Hi,
I am trying to configure Xen on top of an active-backup bonded
configuration using eth0 and eth1. Dom0 can communicate, dom1 cannot
communicate (well, it can with some extra work).
I have seen a few posts about this problem but didn't really find
answers except "it should work" and "it works for me". I'm hoping
someone can share their configuration or simply confirm for me that this
really works as I would expect, I thought it worked at first but after
more testing realized it did not.
This is where I am at in my debugging steps:
- when dom0 is configured to use "plain" eth0 and xen-br0, networking to
dom0 and dom1 works correctly
- when dom0 is configured to use "plain" eth1 and xen-br0, networking to
dom0 and dom1 works correctly
- when server is booted non-Xen, host can communicate correctly over
bond0 interface, failover tested by plugging/unplugging cat5 works as
expected
- when server is booted into Xen (only a single domU for now), bond0 and
vif1.0 are attached to xen-br0
- when server is booted into Xen, dom0 communicates correctly and has no
networking problems (at least that I have detected)
- dom1 cannot communicate over bonded bridge
The Xen startup scripts leave my dom0's IP address on bond0 as well as
xen-br0, I have removed the IP from bond0, no improvement.
By using tcpdump I have determined that arp replies are not being
received, let's pretend my router is 192.168.1.1 here is what I am seeing:
- from dom1 "ping -n 192.168.1.1"
- arp request "who-has 192.168.1.1" goes through
(vif1.0->xen-br0->bond0->eth0->wire)
- arp request is received by 192.168.1.1 and router replies ("arp-reply
192.168.1.1 is at ...")
- arp reply is not received ... sniffing on eth0/eth1, bond0, etc shows
no arp reply
This sounds like a switch problem except for one issue; why does
eth0->xen-br0->vif1.0 and eth1->xen-br0->vif1.0 work but neither
eth0->bond0->xen-br0->vif1.0 or eth1->bond0->xen-br0->vif1.0 work?
Active-backup mode is not supposed to require any support from switch
and my testing confirms this. Without Xen in the picture bonding and
failover work exactly as they are supposed to.
I expected that there may be some security configured in our switches
(perhaps one MAC per port) but that doesn't make sense either since
things work correctly without bonding. I don't control our switches ...
If I hardcode MAC address of router in dom1's arp table then it can
communicate with the world. Anything in the local subnet must be
hardcoded in arp table though or it cannot get through (logical since
arp replies are not being received).
It strikes me as strange that everything but arp is functioning
correctly on the bridge ... isn't that one of the major functions of a
bridge? Anyway I continue to read up on bridges hoping for my eureka
moment.
I have played with rp_filter on all interfaces and other parameters that
I thought might be relevant ... to no avail so far.
One other strange thing to note is that sometimes it works. In fact I
had a bonded server in testing for quite a while and didn't notice this
problem until I moved it to a new data centre (though I don't think I
had testing failover originally) ... now it won't work even in the
original lab setting.
Occassionally something happens and dom1 suddenly gets an arp-reply on
it's own and talks, perhaps it's just coincidentally waiting for an
arp-reply at the same time as dom0 and that gets passed though???
Once in a while if I down one interface in the bond things start
working, not always. I'm trying to track down a cause here but no ideas
so far.
If I hardcode the arp entries in dom1, things always work.
Using Xen 2.0.5/kernel 2.6.11 in dom0 (SuSE Pro 9.3) and kernel 2.6.5 in
domU.
Ideas greatly appreciated, I've been mulling this one over for a while
now! I will followup for the record if I happen to stumble over an
answer myself.
Thanks,
Fraser
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|