RE: [Xen-devel] network hang trigger

I've tried this, and I see the first fragment of the ping get sent and then a complete hang, which is what originally made me suspicious that there was some sort of race with sending packets with a very small time between one and the next.

It could be that Bin's patch changed the timing of things on his machine such that the bug goes away for him. I can make the bug come and go by placing printk's in network_start_xmit as per my previous email.

This is a dump from a normal size ping.

Xen0

listening on vif13.0, link-type EN10MB (Ethernet), capture size 96 bytes

19:02:48.276397 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 1

19:02:48.306646 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 1

19:02:49.275931 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 2

19:02:49.276033 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 2

XenU

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

19:02:48.270125 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 1

19:02:48.277577 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 1

19:02:49.275460 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 64: echo request seq 2

19:02:49.276848 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 64: echo reply seq 2

This is from a large ping (with printk’s in network_start_xmit so it works)

Xen0

19:10:33.502706 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:10:33.502711 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:33.502966 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 1

19:10:33.502992 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:10:34.496713 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:10:34.496717 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:34.496872 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:10:34.496895 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

XenU

19:10:33.496431 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:10:33.498042 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:33.507890 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 1

19:10:33.507953 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:10:34.492920 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:10:34.494703 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:10:34.501604 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:10:34.501639 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

This is from the same large ping (with the printk’s removed so it hangs)

Xen0

listening on vif14.0, link-type EN10MB (Ethernet), capture size 96 bytes

19:23:25.125927 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 1

19:23:55.122574 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 556: ip reassembly time exceeded

19:23:55.122726 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122732 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 2

19:23:55.122734 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122735 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 3

19:23:55.122737 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.122739 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp 1480: echo request seq 4

19:23:55.122741 IP 192.168.200.200 > xen2.int.sbss.com.au: icmp

19:23:55.123850 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 2

19:23:55.123873 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:23:55.123955 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 3

19:23:55.123977 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

19:23:55.124050 IP xen2.int.sbss.com.au > 192.168.200.200: icmp 1480: echo reply seq 4

19:23:55.124070 IP xen2.int.sbss.com.au > 192.168.200.200: icmp

XenU

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes

19:23:25.126797 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 1

19:23:25.129472 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:26.143609 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 2

19:23:26.143622 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:27.143643 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 3

19:23:27.143660 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:28.143643 IP 192.168.200.200 > 192.168.200.204: icmp 1480: echo request seq 4

19:23:28.143658 IP 192.168.200.200 > 192.168.200.204: icmp

19:23:55.124352 IP 192.168.200.204 > 192.168.200.200: icmp 556: ip reassembly time exceeded

19:23:55.126145 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 2

19:23:55.126170 IP 192.168.200.204 > 192.168.200.200: icmp

19:23:55.126201 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 3

19:23:55.126208 IP 192.168.200.204 > 192.168.200.200: icmp

19:23:55.126224 IP 192.168.200.204 > 192.168.200.200: icmp 1480: echo reply seq 4

19:23:55.126230 IP 192.168.200.204 > 192.168.200.200: icmp

The times are in sync between the two domains, so you can see that dom0 only sees the first fragment of the first ping and then a big delay, then the rest come through.

Is it possible that there is a synchronisation problem in interdomain communications?

James

> -----Original Message-----

> From: Keir Fraser [mailto:Keir.Fraser@xxxxxxxxxxxx]

> Sent: Thursday, 16 September 2004 17:24

> To: James Harper

> Cc: Bin Ren; xen-devel@xxxxxxxxxxxxxxxxxxxxx

> Subject: Re: [Xen-devel] network hang trigger

> > When I was thinking about this problem, I was imagining a deadlock

> > condition where rapid back to back packets (eg a fragmented icmp packet

> > from ping or a fragmented udp packet from nfs) was causing a hang until

> > part of the deadlock timed itself out and the packets started flowing

> > again. I couldn't see 1 packet causing a buffer exhaustion unless it got

> > itself into a loop where it kept retrying to send the second fragment

> > and didn't free the buffer each time, but even then the buffer bug would

> > be a side effect.

> >

> > The deadlock would have to be caused in the transmit from xenU to xen0,

> > and something about the difference between sending a ping and responding

> > to a ping is the difference between always causing a lockup and only

> > sometimes causing a lockup.

> Try tcpdumping each end of teh connecttion.

> I find that for ping 0->U, the 'seizure' is entirely within DOM0 --

> ping responses are still received, but for some reason they don't make

> it up to the ping application.

> For ping U->0, it does look as though the network seizes up -- I see

> no packets in either direction.

> -- Keir

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] network hang trigger