WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug

To: "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug
From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>
Date: Tue, 30 Jan 2007 22:11:32 +0800
Delivery-date: Tue, 30 Jan 2007 06:11:24 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <C1E4F4D7.80B9%Keir.Fraser@xxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcdESFqDCWsISfq5RGeHgxcxVzRqmQACelaDAAAZiDAAAP4q2wADxf1QAAFVV3AAAMUYXAAACCkwAACFZyAAAV9OMA==
Thread-topic: [Xen-devel] [PATCH] Fix softlockup issue after vcpu hotplug
>From: Keir Fraser [mailto:Keir.Fraser@xxxxxxxxxxxx]
>Sent: 2007年1月30日 21:13
>On 30/1/07 1:09 pm, "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
>>> I'm sure this will fix the issue. But who knows what real underlying
>issue
>>> it might be hiding?
>>>
>>> -- Keir
>>
>> I'm not sure whether it hides something. But the current situation
>> seems like a self-trap to me: watchdog waits for timer interrupt to be
>> awaken in 1s interval, while timer interrupt deliberately schedules a
>> longer interval without considering watchdog and then blames
>> watchdog thread not running within 10s. :-)
>
>Actually I think you're right -- if this fixes the issue then it points to a
>problem in the next_timer_event code. So it would actually be interesting
>to
>try clamping the timeout to one second.
>
> -- Keir

By a simple change like this:

@@ -962,7 +962,8 @@ u64 jiffies_to_st(unsigned long j)
                } else if (((unsigned long)delta >> (BITS_PER_LONG-3)) != 0) {
                        /* Very long timeout means there is no pending timer.
                         * We indicate this to Xen by passing zero timeout. */
-                       st = 0;
+                       //st = 0;
+                       st = processed_system_time + HZ * (u64)NS_PER_TICK;
                } else {
                        st = processed_system_time + delta * (u64)NS_PER_TICK;
                }

I really expected to say it as the root fix, however I can't though 
this change made it better. I created a domU with 4 VCPUs on 
2 CPUs box, and tried to hot-remove/plug vcpu 1,2,3 alternatively. 
After about ten rounds test, everything is just OK. However several 
minutes later, I saw that warning again, though far less frequent 
than before.

So I have to dig more into this bug. The first thing I plan to do, is to 
make sure whether such long timeout is requested as what guest 
wants, or it's xen to enlarge that timeout underlyingly... :-(

BTW, do you think whether it's worthy to destroy vcpu from 
scheduler when it's down and then re-init that vcpu into scheduler 
when it's on? I don't know whether this will make any influence to 
accounting of scheduler. Actually domain save/restore doesn't show 
this bug, and one obvious distinct compared to vcpu-hotplug is that 
domain is restored in a new context...

Thanks,
Kevin

P.S. some trace log attached. You can see that drift in each warning is 
just around 1000 ticks.
[root@localhost ~]# BUG: soft lockup detected on CPU#1!
BUG: drift by 0x41e
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#2!
BUG: drift by 0x447
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x43f
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c011bbae>] try_to_wake_up+0x24e/0x300
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x3ea
 [<c0151301>] softlockup_tick+0xd1/0x100
 [<c01095d4>] timer_interrupt+0x4e4/0x640
 [<c0137699>] __rcu_process_callbacks+0x99/0x100
 [<c0129867>] tasklet_action+0x87/0x130
 [<c0151c89>] handle_IRQ_event+0x59/0xa0
 [<c0151d65>] __do_IRQ+0x95/0x120
 [<c010708f>] do_IRQ+0x3f/0xa0
 [<c0103070>] xen_idle+0x0/0x60
 [<c024e355>] evtchn_do_upcall+0xb5/0x120
 [<c0103070>] xen_idle+0x0/0x60
 [<c01057a5>] hypervisor_callback+0x3d/0x48
 [<c0103070>] xen_idle+0x0/0x60
 [<c0109d40>] raw_safe_halt+0x20/0x50
 [<c01030a1>] xen_idle+0x31/0x60
 [<c010316e>] cpu_idle+0x9e/0xf0

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>