Angela,
I'm not sure what you EXPECTED to see. A virtual machine
will always be (somewhat) slower than the "real" hardware, because you have an
extra software layer for some operations. Basicly, this is the price you pay for
the extended system functionality that you get. It's the same as saying "If I
remove the file-system from my Operating System, I can read from or write
to the disk much quicker than going through the file-system"[1]. You gain
some functionality, and you loose in performance.
Comments on Byte-Bench:
I can't explain the pipe throughput, because I just don't
know anything about how that works.
Process creation involves a lot of page-table work, which
is definitely a typical situation where the hypervisor (Xen) has to take extra
action on top of what's normally done in the OS, as each operations that
normally are trivial writes to a page-table entry now has become a call into Xen
to perform the "trivial operation". So instead of a few simple operations, we
now have a software interrupt, a function call and several extra operations just
to find out what needs to be done, then the actual page-table update. I expect
this to be an order of magnitude slower than the native
operations.
My guess is that shell-scripts isn't slower in themselves,
but that there are several new processes created within the shell-script.
Comments on lmbench:
read & write are slower - no big surprise, it's most likely that the read & write's go to a file, which commonly is emulated through the loopback
mounted file that is the DomU's "disk". So you get twice the amount of reads, one in Dom0 reading the disk image, and the data is
then transferred to DomU through a "read" operation.
Similar for any of the other file-related operations,
they become two-step operations, with Dom0 doing the actual work and then
transferring the result to DomU.
Protection fault handling go through extra steps, as
the code enters Xen itself and then has to be passed back to the actual guest
that prot-faulted, so it's expected that those take longer than the same
operation in native OS.
I still have no explanation for the pipe behaviour - in
the few minutes I've been working on this answer, I haven't learnt how pipes
work ;-)
Sockets, probably related to pipes... But I have no
real idea how pipes or sockets work...
fork+<something>: More work needed in virtual
machine than in the real hardware, as described on process creation above. A
factor ~2x slower isn't bad at all... Some of these operations also involve file
operations, which adds to the already slower operation.
[1] This assumes the file-system is relatively stupid
in caching things, because a modern file-system performs a lot of clever
caching/optimisation to increase the system
performance.
--
Mats
Hi all,
While doing some benchmarking of Xen, I ran across a
couple performance issues. I am wondering if anyone else has noticed this and
whether there is anything I can do to tune the performance.
The
setup:
CPU: Athlon XP 2500+ (1826.005 MHz)
RAM: Limited to 256 MB in
native and xenU
Disk:Maxtor 6B200P0, ATA DISK drive
Motherboard: ASUS
A7VBX-MX SE
Network: tested only loopback interface.
I have Fedora
Core 4 installed as dom0, with Scientific Linux 3.0.7 (RHEL3) installed on a
separate partition as the single domU. I installed the FC4 xen rpms
(xen-3.0-0.20050912.fc4, kernel-xenU-2.6.12-1.1454_FC4,
kernel-xen0-2.6.12-1.1454_FC4) using yum.
I used the following
benchmark tools/suites:
bonnie++-1.03a
UnixBench 4.1.0
ab
lmbench
3.0-a5
The areas where I saw the greatest performance hit were in
system calls, process creation, and pipe throughput. Here are some selected
results:
UnixBench:
============
Scientific Linux 3
Native:
BYTE UNIX Benchmarks (Version 4.1.0)
System --
Linux localhost.localdomain 2.4.21-27.0.2.EL #1 Tue Jan 18 20:27:31 CST 2005
i686 athlon i386 GNU/
Linux
Start Benchmark Run: Thu Sep 22
15:23:17 PDT 2005
2 interactive users.
15:23:17 up 12 min, 2 users, load average: 0.03, 0.08,
0.05
lrwxr-xr-x 1 root
root 4
Sep 9 10:56 /bin/sh -> bash
/bin/sh: symbolic link to
bash
/dev/hdc11
20161172 5059592 14077440 27%
/
<--snip-->
System Call
Overhead
995605.1 lps (10.0 secs, 10 samples)
Pipe
Throughput
1135376.3 lps (10.0 secs, 10 samples)
Pipe-based Context
Switching
375521.7 lps (10.0 secs, 10 samples)
Process
Creation
9476.4 lps (30.0 secs, 3 samples)
Execl
Throughput
2918.3 lps (29.7 secs, 3
samples)
<--snip-->
INDEX
VALUES
TEST
BASELINE RESULT
INDEX
Dhrystone 2 using register
variables 116700.0
4307104.5 369.1
Double-Precision
Whetstone
55.0 980.4
178.3
Execl
Throughput
43.0 2918.3
678.7
File Copy 1024 bufsize 2000
maxblocks 3960.0
143780.0 363.1
File Copy 256 bufsize 500
maxblocks
1655.0 72156.0 436.0
File
Copy 4096 bufsize 8000
maxblocks 5800.0
192427.0 331.8
Pipe
Throughput
12440.0 1135376.3 912.7
Process
Creation
126.0 9476.4
752.1
Shell Scripts (8
concurrent)
6.0 329.7
549.5
System Call
Overhead
15000.0 995605.1
663.7
=========
FINAL
SCORE
475.2
--------------------------------------------
SL3
XenU
BYTE UNIX Benchmarks (Version 4.1.0)
System -- Linux
localhost.localdomain 2.6.12-1.1454_FC4xenU #1 SMP Fri Sep 9 00:45:34 EDT 2005
i686 athlon i386 GNU/Linux
Start Benchmark Run: Fri Sep 23 09:08:23
PDT 2005
1 interactive users.
09:08:23
up 0 min, 1 user, load average: 0.95, 0.25, 0.08
lrwxr-xr-x 1 root
root 4
Sep 9 10:56 /bin/sh -> bash
/bin/sh: symbolic link to
bash
/dev/sda1
20161172 5058964 14078068 27%
/
<--snip-->
System Call
Overhead
969225.3 lps (10.0 secs, 10 samples)
Pipe
Throughput
619270.7 lps (10.0 secs, 10 samples)
Pipe-based Context
Switching
85183.9 lps (10.0 secs, 10 samples)
Process
Creation
3014.6 lps (30.0 secs, 3 samples)
Execl
Throughput
1807.4 lps (29.9 secs, 3
samples)
<--snip-->
INDEX VALUES
TEST
BASELINE RESULT
INDEX
Dhrystone 2 using register
variables 116700.0
4288647.9 367.5
Double-Precision
Whetstone
55.0 976.3
177.5
Execl
Throughput
43.0 1807.4
420.3
File Copy 1024 bufsize 2000
maxblocks 3960.0
143559.0 362.5
File Copy 256 bufsize 500
maxblocks
1655.0 70328.0 424.9
File
Copy 4096 bufsize 8000
maxblocks 5800.0
186297.0 321.2
Pipe
Throughput
12440.0 619270.7 497.8
Process
Creation
126.0 3014.6
239.3
Shell Scripts (8
concurrent)
6.0 188.0
313.3
System Call
Overhead
15000.0 969225.3
646.2
=========
FINAL
SCORE
356.0
---------------------------------------------------------------------------------
lmbench
Selected Results:
==========================
SL3
Native:
<--snip-->
Simple syscall: 0.1516 microseconds
Simple
read: 0.2147 microseconds
Simple write: 0.1817 microseconds
Simple stat:
1.8486 microseconds
Simple fstat: 0.3026 microseconds
Simple open/close:
2.2201 microseconds
<--snip-->
Protection fault: 0.2196
microseconds
Pipe latency: 2.2539 microseconds
AF_UNIX sock stream
latency: 4.8221 microseconds
Process fork+exit: 143.7297
microseconds
Process fork+execve: 483.0833 microseconds
Process
fork+/bin/sh -c: 1884.0000
microseconds
-------------------------------------------------
SL3
XenU:
<--snip-->
Simple syscall: 0.1671 microseconds
Simple
read: 0.4090 microseconds
Simple write: 0.3588 microseconds
Simple stat:
3.5761 microseconds
Simple fstat: 0.5530 microseconds
Simple open/close:
3.9425 microseconds
<--snip-->
Protection fault: 0.5993
microseconds
Pipe latency: 12.1886 microseconds
AF_UNIX sock stream
latency: 22.3485 microseconds
Process fork+exit: 365.8667
microseconds
Process fork+execve: 1066.4000 microseconds
Process
fork+/bin/sh -c: 3826.0000
microseconds
<--snip-->
-------------------------------------------------------------------------
I
can post the full results of these tests if anyone is interested.
Does
anyone have any ideas for tuning the performance of the domUs? Are there any
configurations that perform better than others?
Thank You,
Angela
Norton