Re: [Xen-devel] MPI benchmark performance gap between native lin

Santos, Jose Renato G (Jose Renato Santos) wrote:

  Hi,

  We had a similar network problem in the past. We were using a TCP
benchmark instead of MPI but I believe your problem is probably the same
as the one we encountered.
  It took us a while to get to the bottom of this and we only identified
the reason for this behavior after we ported oprofile to Xen and did
some performance profiling experiments.

  Here is a brief explanation of the problem we found and the solution
that worked for us.
  Xenolinux allocates a full page (4KB) to store socket buffers instead
of using just MTU bytes as in traditional linux. This is necessary to
enable page exchanges between the guest and the I/O domains. The side
effect of this is that memory space used for  socket buffers is not very
efficient. Even if packets have the maximum MTU size (typically 1500
bytes for Ethernet) the total buffer utilization is very low ( at most
just slightly  higher than 35%). If packets arrive faster than they are
processed at the receiver side, they will exhaust the receiver buffer
before the TCP advertised window is reached (By default Linux uses a TCP
advertised window equal to 75% of the receive buffer size. In standard
Linux, this is typically sufficient to stop packet transmission at the
sender before running out of receive buffers. The same is not true in
Xen due to inefficient use of socket buffers). When a packet arrives and
there is no receive buffer available, TCP tries to free socket buffer
space by eliminating socket buffer fragmentation (i.e. eliminating
wasted buffer space). This is done at the cost of an extra copy of all
receive buffer to new compacted socket buffers. This introduces overhead
and reduces throughput when the CPU is the bottleneck, which seems to be
your case.

This problem is not very frequent because modern CPUs are fast enough to
receive packets at Gigabit speeds and the receive buffer does not fill
up. However the problem may arise when using slower machines and/or when
the workload consumes a lot of CPU cycles, such as for example
scientific MPI applications. In your case in you have both factors
against you.

The solution to this problem is trivial. You just have to change the TCP
advertised window of your guest to a lower value. In our case, we used
25% of the receive buffer size and that was sufficient  to eliminate the
problem. This can be done using the following command

echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale

In my experiments, I notice the above changing doesn't persist upon reboots (every reboot willchange the value back to 2, the default value for Debian Sarge 3.1). Is there a way to make apermanent changing?


Thanks.

Xuehai

(The default 2 corresponds to 75% of receive buffer, and -2 corresponds
to 25%)

Please let me know if this improve your results. You should still see a
degradation in throughput when comparing xen to traditional linux, but
hopefully you should be able to see better throughputs. You should also
try running your experiments in domain 0. This will give better
throughput although still lower than traditional linux.
I am curious to know if this have any effect in your experiments.
Please, post the new results if this has any effect in your results

Thanks

Renato
-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Ofxuehai zhang
Sent: Monday, April 04, 2005 4:19 PM
To: Xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] MPI benchmark performance gap betweennative linux anddomU
Hi all,
I did the following experiments to explore the MPIapplication execution performance on both native linuxmachines and inside of unpriviledged Xen user domains. I use8 machines with identical HW configurations (498.756 MHz dualCPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPIBenchmarks (PMB).
Experiment 1: I boot all 8 nodes with native linux (nosmp,kernel 2.4.29) and use all of them for PMB tests.
Experiment 2: I boot all 8 nodes with Xen running and start asingle user domain (port 2.6.10,using file-backed VBD) oneach node with 360MB memory. Then I run the same PMB testsamong these 8 user domains.
The expreiment results show, running a same MPI benchmark inuser domains usually results in a worse (sometimes very bad)performance comparing with on native linux machines. Thefollowing are the results for PMB SendRecv benchmark for bothexperiments (table1 and table2 report throughput and latencyrespectively). As you may notice, SendRecv can achieve a14.9MB/sec throughput on native linux machines but can get amaximum 7.07 MB/sec throughput if running inside of userdomains. The latency results also have big gap.
Clearly, there is difference between the memory used in thenative linux machine of Experiment 1 (512MB) and in the userdomain (360MB, can not go higher because dom0 started with128MB memory) of Experiment 2. However, I don't think it isthe main cause of the performance gap because the testedmessage sizes are much smaller than both memory sizes.
I will appreciate your help if you had the similar experienceand wanna share your insights.
BTW, if you are not familar with PMB SendRecv benchmark, youcan find a detailed explaination athttp://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
Thanks in advance for you help.

Xuehai


P.S. Table 1: SendRecv throughput (MB/sec) performance

Message_Size(bytes)    Experiment_1    Experiment_2
0                0             0
1                0             0
2                0             0
4                0             0
8                0.04          0.01
16                    0.16          0.01
32                    0.34          0.02
64                    0.65          0.04
128                    1.17          0.09
256                    2.15          0.59
512                    3.4           1.23
1K                    5.29          2.57
2K                    7.68          3.5
4K                    10.7          4.96
8K                    13.35         7.07
16K                    14.9          3.77
32K                    9.85          3.68
64K                    5.06          3.02
128K                    7.91          4.94
256K                    7.85          5.25
512K                    7.93          6.11
1M                    7.85          6.5
2M                    8.18          5.44
4M                    7.55          4.93

Table 2: SendRecv latency (millisec) performance

Message_Size(bytes)    Experiment_1    Experiment_2
0                   1979.6        3010.96
1                   1724.16       3218.88
2                   1669.65       3185.3
4                   1637.26       3055.67
8                   406.77        2966.17
16                  185.76        2777.89
32                  181.06        2791.06
64                  189.12        2940.82
128                 210.51        2716.3
256                 227.36        843.94
512                 287.28        796.71
1K                  368.72        758.19
2K                  508.65        1144.24
4K                  730.59        1612.66
8K                  1170.22       2471.65
16K                 2096.86       8300.18
32K                 6340.45       17017.99
64K                 24640.78      41264.5
128K                31709.09      50608.97
256K                63680.67      94918.13
512K                125531.7      162168.47
1M                  251566.94     321451.02
2M                  477431.32     707981
4M                  997768.35     1503987.61



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] MPI benchmark performance gap between native linux anddo