This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] dom0 hang

To: mukesh.rathor@xxxxxxxxxx
Subject: Re: [Xen-devel] dom0 hang
From: George Dunlap <dunlapg@xxxxxxxxx>
Date: Thu, 2 Jul 2009 18:50:16 +0100
Cc: "Kurt C. Hackel" <kurt.hackel@xxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, ackaouy@xxxxxxxxx, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, andrew thomas <andrew.thomas@xxxxxxxxxx>
Delivery-date: Thu, 02 Jul 2009 10:52:31 -0700
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=VUIpAO0ymrOu1s3lmJrf1QjIr0wHIEqhpmswt6wQ/XI=; b=oNLFYFWK5FVug8EoQaUPuAg8goF95mYU+3PNstt6QQV0FsNZ2FGUcaMeQf7xfgH+xh JiJt2s5OsFVGz1pbdMt3X4we6Y6AZoIf6plN5f46IkYk1ZrW2X9mIFGGZ90uJoZsoe0R CuugKOK/GMtnwkR/uce+4HT8eTrqlMkXpfaPY=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=PCQGA2T7yH2Wo5xA3ELniZA95Hf54U3CbgJi3WS8dVM6E1NL1uu+bShXuUZa187/Ct pm2iNWjKYA488cei+FQz5b/85TFdcdnM1oJeyqTmwKeI8/NbXmrfFfFSafzDac0vMk1Q XqVPqCaBpqjSgSyd/7TT7hKVbA+bVgkL9ffOw=
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4A4C2743.5030703@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4A426D50.80401@xxxxxxxxxx> <4A4C2743.5030703@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
On Thu, Jul 2, 2009 at 4:19 AM, Mukesh Rathor<mukesh.rathor@xxxxxxxxxx> wrote:
> dom0 hang:
>    vcpu0 is trying to wakeup a task and in try_to_wake_up() calls
>    task_rq_lock(). since the task has cpu set to 1, it gets runq lock
>    for vcpu1. next it calls resched_task() which results in sending IPI
>    to vcpu1. for that, vcpu0 gets into the HYPERVISOR_event_channel_op
>    HCALL and is waiting to return. Meanwhile, vcpu1 got running, and is
>    spinning on it's runq lock in "schedule():spin_lock_irq(&rq->lock);",
>    that vcpu0 is holding (and is waiting to return from the HCALL).
>    As I had noticed before, vcpu0 never gets scheduled in xen. So
>    looking further into xen:
> xen:
>    Both vcpu's are on the same runq, in this case cpu1. But the
>    priority of vcpu1 has been set to CSCHED_PRI_TS_BOOST. As a result,
>    the scheduler always picks vcpu1, and vcpu0 is starved. Also, I see in
>    kdb that the scheduler timer is not set on cpu 0. That would've
>    allowed csched_load_balance() to kick in on cpu0. [Also, on
>    cpu1, the accounting timer, csched_tick, is not set.  Altho,
>    csched_tick() is running on cpu0, it only checks runq for cpu0.]
>    Looks like c/s 19500 changed csched_schedule():
> +    ret.time = (is_idle_vcpu(snext->vcpu) ?
> +                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
>  The quickest fix for us would be to just back that out.
>  BTW, just a comment on following (all in sched_credit.c):
>      if ( svc->pri == CSCHED_PRI_TS_UNDER &&
>         !(svc->flags & CSCHED_FLAG_VCPU_PARKED) )
>      {
>         svc->pri = CSCHED_PRI_TS_BOOST;
>      }
>  comibined with
>    if ( snext->pri > CSCHED_PRI_TS_OVER )
>            __runq_remove(snext);
>      Setting CSCHED_PRI_TS_BOOST as pri of vcpu seems dangerous. To me,
>      since csched_schedule() never checks for time accumulated by a
>      vcpu at pri CSCHED_PRI_TS_BOOST, that is same as pinning a vcpu to a
>      pcpu. if that vcpu never makes progress, essentially, the system
>      has lost a physical cpu.  Optionally, csched_schedule() should always
>      check for cpu time accumulated and reduce the priority over time.
>      I can't tell right off if it already does that. or something like
>      that :)...  my 2 cents.

Hmm... what's supposed to happen is that eventually a timer tick will
interrupt vcpu1.  If cpu1 is set to be "active", then it will be
debited 10ms worth of credit.  Eventually, it will go into OVER, and
lose BOOST.  If it's "inactive", then when the tick happens, it will
be set to "active" and be debited 10ms again, setting it directly into
OVER (and thus also losing boost).

Can you see if the timer ticks are still happening, and perhaps put
some tracing it to verify that what I described above is happening?


Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>