I've been running Pallas MPI benchmarks with several configurations, I
just ran a test that errored out. I've run the benchmark successfully
on Xen0 only (four nodes) and on XenU only (four nodes) with no Xen
related errors and no benchmark errors.
This time I ran it with two XenU's on each of the four nodes, each
participating in two separate, simultaneous benchmark runs (two groups
of four XenU's) and all bridged to the cluster LAN. Only one physical
node had a problem (they are identical builds of Xen and XenLinux
2.4.27, last cset 1.1362, 2004-10-04 15:55:47+01:00). There was a group
of messages late August with the same time went backwards errors, but
this is a recent build.
One thing is also that on this node Xen chose to host both guests on CPU
1 (and I know that at the exact moment of failure Xen1 was interacting
with the only other one not to spread out the guests (it actually had
all three Xen0,Xen1,Xen2 on CPU 0)).
I have no clue if any of this information is helpful :-).
(I am attempting another run with the same configuration right now)
xm dmesg:
(XEN) APIC error on CPU0: 00(02)
(XEN) APIC error on CPU1: 00(02)
(XEN) APIC error on CPU1: 02(02)
(XEN) APIC error on CPU0: 02(02)
(XEN) APIC error on CPU1: 02(01)
(XEN) APIC error on CPU0: 02(02)
Xen0 dmesg, just two error messages:
Timer ISR: Time went backwards: -59799000
Timer ISR: Time went backwards: -48699000
(these filled the whole kernel ring buffer:)
Xen1 dmesg, attached, time went backwards many times
Xen2 dmesg, attached, time went backwards many times
benchmark error, Xen1, presumably at the same time as Xen2.. (though on
a different benchmark, the two groups of four actually lost sync after a
while, I'm using the default CPU scheduler. I chalk that up to the
weird cpu pinning that Xen/Xend chose for two of the physical nodes, I am
going to pin those myself in the future)
p3_827: p4_error: net_recv read: probable EOF on socket: 1
p1_777: p4_error: net_recv read: probable EOF on socket: 1
benchmark error, Xen2
p2_821: (347.806618) net_recv failed for fd = 4
p2_821: p4_error: net_recv read, errno = : 104
p3_769: p4_error: net_recv read: probable EOF on socket: 1
p1_766: (402.558327) net_recv failed for fd = 8
p1_766: p4_error: net_recv read, errno = : 104
error.dmesg.xen1
Description: Binary data
error.dmesg.xen2
Description: Binary data
|