WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Subject: Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date: Thu, 17 Feb 2011 10:11:25 +0100
Cc: Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
Delivery-date: Thu, 17 Feb 2011 01:12:09 -0800
Dkim-signature: v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1297933888; x=1329469888; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to:content-transfer-encoding; bh=hglA994ToWLxXM6bp3hzIauOeLEWf8J6vvw3CM7Yu2k=; b=e3pJpdasxUXmXWqonCTvgvItdlfUcxDwfTnT4xoBVFoaqqUUGwCms1zX vhr4VbxSWOc9JGD6oVvYvxenuItEBikPHAoTrJq+n8NhzyYlqEiwbhh3C cyFD240VPAHTR7Ia0SyfVd35jTG0g1dROzFw85ZeDdRDug2YwB2OJ4GGF 9LEHNjTOGPcN9inH9Ea8cDDLBTyNh6qXbByOkbl/21Il+zXbae9Mr7fXz ZSkzAe5KleQA16OYClekU0np9jlwV;
Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=uQO1E0WIOZe+Ta2hWeFKoD2lIEUPXicoZSwFfTQnCSvqwEeTyk73XD6r aqn59MQ5q/7mrO7MW5H/IoT/ugltl2bEr5+GeRuK07O9K88pzY8aaa/S8 6t61bSMky0fAYVANliZFsnoVezKUoqUd2B/cNvKCZF4n/ILKtE/OvBGLp Crvar1iokdCWv3nrua4HUd9zuYcy9mucVwovCL6taz5M53N2cY3+Ig/vL ON+Bdrg/hEFHgTlJuvER+s/bMydKx;
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4D5CC89C.7020306@xxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: Fujitsu Technology Solutions
References: <4D41FD3A.5090506@xxxxxxx> <4D4A72D8.3020502@xxxxxxxxxxxxxx> <4D4C08B6.30600@xxxxxxx> <4D4FE7E2.9070605@xxxxxxx> <4D4FF452.6060508@xxxxxxxxxxxxxx> <AANLkTinoRUQC_suVYFM9-x3D00KvYofq3R=XkCQUj6RP@xxxxxxxxxxxxxx> <4D50D80F.9000007@xxxxxxxxxxxxxx> <AANLkTinKJUAXhiXpKui_XX8XCD6T5fmzNARwHE6Fjafv@xxxxxxxxxxxxxx> <AANLkTinP0z9GynF1RFd8RwzWuqvxYdb+UBE+7xKpX6D4@xxxxxxxxxxxxxx> <4D517051.10402@xxxxxxx> <AANLkTi=MiELBnPFvb6-jzVth+T7aKxP5JMFhVh3Crdmo@xxxxxxxxxxxxxx> <AANLkTikgGNz=imS1xRVVjntY5P=+MuT_Qsb=-h3QHajY@xxxxxxxxxxxxxx> <4D529BD9.5050200@xxxxxxx> <4D52A2CD.9090507@xxxxxxxxxxxxxx> <4D5388DF.8040900@xxxxxxxxxxxxxx> <4D53AF27.7030909@xxxxxxx> <4D53F3BC.4070807@xxxxxxx> <4D54D478.9000402@xxxxxxxxxxxxxx> <4D54E79E.3000800@xxxxxxx> <AANLkTimkRAHtM4CoTskQ7w6B-8Pis4B2+k7=frxM3oyW@xxxxxxxxxxxxxx> <4D5A29C0.4050702@xxxxxxxxxxxxxx> <4D5B9D2B.107@xxxxxxxxxxxxxx> <AANLkTin+rE1=+vpmTg9xeQdYn7_hucSFkrz1qCtiKfkY@xxxxxxxxxxxxxx> <4D5CC89C.7020306@xxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Iceowl/1.0b1 Icedove/3.0.11
On 02/17/11 08:05, Juergen Gross wrote:
On 02/16/11 14:54, George Dunlap wrote:
Andre (and Juergen), can you try again with the attached patch?

What the patch basically does is try to make "cpu_disable_scheduler()"
do what it seems to say it does. :-) Namely, the various
scheduler-related interrutps (both per-cpu ticks and the master tick)
is a part of the scheduler, so disable them before doing anything, and
don't enable them until the cpu is really ready to go again.

To be precise:
* cpu_disable_scheduler() disables ticks
* scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
and does it after inserting the idle vcpu
* Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
stop tickers
+ Call tick_{resume,suspend} in cpu_{up,down}, respectively
* Modify credit1's tick_{suspend,resume} to handle the master ticker
as well.

With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
on one pcpu), I can perform thousands of operations successfully.

(NB this is not ready for application yet, I just wanted to check to
see if it fixes Andre's problem)

Tried again, this time with the following patch:

diff -r 72470de157ce xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Wed Feb 16 09:49:33 2011 +0000
+++ b/xen/common/sched_credit.c Wed Feb 16 15:09:54 2011 +0100
@@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat
         /*
          * Any work over there to steal?
          */
-        speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
+        speer = cpu_isset(peer_cpu, *online) ?
+            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
         pcpu_schedule_unlock(peer_cpu);
         if ( speer != NULL )
         {


Worked without any flaw for 30000 iterations.


Juergen


After some thousand iterations the machine hang and after dumping Dom0
registers to console it continued running and crashed about a second later:

(XEN) cpupool_unassign_cpu(pool=0,cpu=9)
(XEN) cpupool_unassign_cpu(pool=0,cpu=9) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_unassign_cpu(pool=0,cpu=4)
(XEN) cpupool_unassign_cpu(pool=0,cpu=4) ffff83083fff74c0
(XEN) cpupool_unassign_cpu ret=0
(XEN) cpupool_assign_cpu(pool=1,cpu=9)
(XEN) cpupool_assign_cpu(pool=1,cpu=9) ffff83083002de40
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
timer.c:279
(XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
(XEN) CPU: 9
(XEN) RIP: e008:[<ffff82c480126100>] active_timer+0xc/0x37
(XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor
(XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000
(XEN) rdx: ffff830839d8ff18 rsi: 0000010dbb628a80 rdi: ffff83083ffbcf98
(XEN) rbp: ffff830839d8fd50 rsp: ffff830839d8fd50 r8: ffff83083ffbcf90
(XEN) r9: ffff82c480213680 r10: 00000000ffffffff r11: 0000000000000010
(XEN) r12: ffff82c4802d3f80 r13: ffff82c4802d3f80 r14: ffff83083ffbcf98
(XEN) r15: ffff83083ffbcfc0 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 000000007809c000 cr2: 0000000000620048
(XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff830839d8fd50:
(XEN) ffff830839d8fda0 ffff82c480126ef9 0000000000000000 0000010dbb628a80
(XEN) 0000000000000086 0000000000000009 ffff83083002de40 ffff83083002dd50
(XEN) 0000000000000009 0000000000000009 ffff830839d8fdc0 ffff82c480117906
(XEN) ffff83083ffa3b40 ffff83083ffa5d70 ffff830839d8fe30 ffff82c4801214fa
(XEN) ffff83083002dd00 0000000900000100 0000000000000286 ffff8300780da000
(XEN) ffff83083ffbcf80 ffff83083ffbcf90 ffff82c480247e00 0000000000000009
(XEN) 00000000fffffff0 ffff83083002dd00 0000000000000000 ffff8300781cc198
(XEN) ffff830839d8fe60 ffff82c4801019ff 0000000000000009 0000000000000009
(XEN) ffff8300781cc198 ffff830839d990d0 ffff830839d8fe80 ffff82c480101bd9
(XEN) ffff83107e80c5b0 ffff8300781cc000 ffff830839d8fea0 ffff82c480104f21
(XEN) 0000000000000009 ffff830839d990e0 ffff830839d8fee0 ffff82c480125b6c
(XEN) ffff82c48024a020 ffff830839d8ff18 ffff82c48024a020 ffff830839d8ff18
(XEN) ffff830839d99060 ffff830839d99040 ffff830839d8ff10 ffff82c48015645a
(XEN) 0000000000000000 ffff8300780da000 ffff8300780da000 ffffffffffffffff
(XEN) ffff830839d8fe00 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 ffffffff8062bda0 ffff880fbb1e5fd8 0000000000000246
(XEN) 0000000000000000 000000010003347d 0000000000000000 0000000000000000
(XEN) ffffffff800033aa 00000000deadbeef 00000000deadbeef 00000000deadbeef
(XEN) 0000010000000000 ffffffff800033aa 000000000000e033 0000000000000246
(XEN) ffff880fbb1e5f08 000000000000e02b 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN) [<ffff82c480126100>] active_timer+0xc/0x37
(XEN) [<ffff82c480126ef9>] set_timer+0x102/0x218
(XEN) [<ffff82c480117906>] csched_tick_resume+0x53/0x75
(XEN) [<ffff82c4801214fa>] schedule_cpu_switch+0x1f1/0x25c
(XEN) [<ffff82c4801019ff>] cpupool_assign_cpu_locked+0x61/0xd6
(XEN) [<ffff82c480101bd9>] cpupool_assign_cpu_helper+0x9f/0xcd
(XEN) [<ffff82c480104f21>] continue_hypercall_tasklet_handler+0x51/0xc3
(XEN) [<ffff82c480125b6c>] do_tasklet+0xe1/0x155
(XEN) [<ffff82c48015645a>] idle_loop+0x5f/0x67
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 9:
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at
timer.c:279
(XEN) ****************************************


Juergen



--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel