WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Scheduling anomaly with 4.0.0 (rc6)

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] Scheduling anomaly with 4.0.0 (rc6)
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Fri, 2 Apr 2010 09:48:49 -0700 (PDT)
Delivery-date: Fri, 02 Apr 2010 09:51:32 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
I've been running some heavy testing on a recent Xen 4.0
snapshot and seeing a strange scheduling anomaly that
I thought I should report.  I don't know if this is
a regression... I suspect not.

System is a Core 2 Duo (Conroe).  Load is four 2-VCPU
EL5u4 guests, two of which are 64-bit and two of which
are 32-bit.  Otherwise they are identical.  All four
are running a sequence of three Linux compiles with
(make -j8 clean; make -j8).  All are started approximately
concurrently: I synchronize the start of the test after
all domains are launched with an external NFS semaphore
file that is checked every 30 seconds.

What I am seeing is a rather large discrepancy in the
amount of time consumed "underway" by the four domains
as reported by xentop and xm list.  I have seen this
repeatedly, but the numbers in front of me right now are:

1191s dom0
3182s 64-bit #1
2577s 64-bit #2 <-- 20% less!
4316s 32-bit #1
2667s 32-bit #2 <-- 40% less!

Again these are identical workloads and the pairs
are identical released kernels running from identical
"file"-based virtual block devices containing released
distros.  Much of my testing had been with tmem and
self-ballooning so I had blamed them for awhile,
but I have reproduced it multiple times with both
of those turned off.

At start and after each kernel compile, I record
a timestamp, so I know the same work is being done.
Eventually the workload finishes on each domain and
intentionally crashes the kernel so measurement is
stopped.  At the conclusion, the 64-bit pair have
very similar total CPU sec and the 32-bit pair have
very similar total CPU sec so eventually (presumably
when the #1's are done hogging CPU), the "slower"
domains do finish the same amount of work.  As a
result, it is hard to tell from just the final
results that the four domains are getting scheduled
at very different rates.

Does this seem like a scheduler problem, or are there
other explanations? Anybody care to try to reproduce it?
Unfortunately, I have to use the machine now for other
work.

P.S. According to xentop, there is almost no network
activity, so it is all CPU and VBD.  And the ratio
of VBD activity looks to be approximately the same
ratio as CPU(sec).

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>