Re: [Xen-devel] Need help in debugging partially blocked hypervi

To:	xen-devel@xxxxxxxxxxxxxxxxxxx, haitao.shan@xxxxxxxxx
Subject:	Re: [Xen-devel] Need help in debugging partially blocked hypervisor
From:	Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx>
Date:	Mon, 2 Nov 2009 10:11:25 +0100
Cc:	Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date:	Mon, 02 Nov 2009 01:11:54 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=dietmar.hahn@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1257153079; x=1288689079; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Dietmar=20Hahn=20<dietmar.hahn@xxxxxxxxxxxxxx> \|Subject:=20Re:=20[Xen-devel]=20Need=20help=20in=20debugg ing=20partially=20blocked=20hypervisor\|Date:=20Mon,=202 =20Nov=202009=2010:11:25=20+0100\|Message-Id:=20<200911021 011.25669.dietmar.hahn@xxxxxxxxxxxxxx>\|To:=20xen-devel@li sts.xensource.com,=0D=0A=20haitao.shan@xxxxxxxxx\|Cc:=20Ke ir=20Fraser=20<keir.fraser@xxxxxxxxxxxxx>\|MIME-Version: =201.0\|Content-Transfer-Encoding:=207bit\|In-Reply-To:=20< 481ad8630911011712p38b028a9r8078199b176326f3@xxxxxxxxxxxx om>\|References:=20<200910301320.40125.dietmar.hahn@xxxxxx itsu.com>=20<C7109568.18E0D%keir.fraser@xxxxxxxxxxxxx>=20 <481ad8630911011712p38b028a9r8078199b176326f3@xxxxxxxxxxx com>; bh=u0Q3ZOkhJpxO5jIDjIgPa2tTjg7Db5sWNHm65s3oFBI=; b=vZIu+ODb2eCfav75thcSZfP5dFOoCGHzH3X5W/Gbw7MRl2Ge0gBnHqIO WtWCIfMO+zqrMiBY3awxJ25II8FAw45yQnlWtRNpnQjKDZWIKFb3KFsMW n+efrzO7TPuNK/rqWwudQ56LNNkhO3mKRo9gS5WoOoCSeuBXFCvVI37gY bm/qi8xpCH1mR2OadJz4Wj2JZCmQ376sc4kvoQpk2ReR3gtTbCzF8IqNw 8kxWnSJHX2ITEtSz+Qr7wqR+d7n/K;
Domainkey-signature:	s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:From:To:Subject:Date:User-Agent:Cc: References:In-Reply-To:MIME-Version:Content-Type: Content-Transfer-Encoding:Message-Id; b=uE+7edpQv6pG+GURMSusc70UUNc1yLOMa4GBiSd3OJY1RzILI12g+smJ KHWa8yytwEfLoBqU2fErm00nAd8a677hiy4fp57cZuFUw/c+yMoNMBX6P IkJtmzLpG6abnRRQ3WyCUAKQqOoAQfD7oHp8Lr3WPLPlJUlvaTBXcjyD2 HMFDEGWsPvHh/a5ye5a7BI4gGsl22xhb+MreJgeetiYdSutlHpu7JhW6y ENf7yU7QMnXgLv1jZY/YU0DGAyDdx;
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<481ad8630911011712p38b028a9r8078199b176326f3@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<200910301320.40125.dietmar.hahn@xxxxxxxxxxxxxx> <C7109568.18E0D%keir.fraser@xxxxxxxxxxxxx> <481ad8630911011712p38b028a9r8078199b176326f3@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	KMail/1.12.2 (Linux/2.6.27.29-0.1-pae; KDE/4.3.1; i686; ; )

Hi Haitao,

> Can I know how you enabled vPMU on Nehalem? This is not supported in
> current Xen.

http://lists.xensource.com/archives/html/xen-devel/2009-09/msg00829.html

> 
> Concerning vpmu support, I totally agree that we can disable this
> feature by default. If anyone really wants to use it, he can use boot
> options to turn it on.

Yes, that's OK for me.

> I am preparing a patch for that. And I will
> send a patch to enable NHM vpmu together.
> 
> For the problem that Dietmar met, I think I once met this before. Can
> you add some code in vpmu_do_interrupt that sets the counter you are
> using to a value other than zero? Please let me know if that can help.

I don't set the counter to zero. I use 0-val to set the counter.
Actually I testet on Nehalem with
- General Perf-counter #2 (0xc3) with CPU_CLK_UNHALTED and val=1100000
- Fixed counter #1 (0x30a) and val=1100000
The thing is that in normal case the overflows of both counters appear
nearly at the same time.
As described I added some extra tracer for xentrace in
core2_vpmu_do_interrupt() so the code looks like:

    rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content);     -> 1. Step
        {
                uint32_t HAHN_l, HAHN_h;
                HAHN_l = (uint32_t) msr_content;
                HAHN_h = (uint32_t) (msr_content >> 32);
                HVMTRACE_3D(HAHN_TR2, v, 1, HAHN_h, HAHN_l);      -> 2. Step
        }
    if ( !msr_content )
        return 0;
    core2_vpmu_cxt->global_ovf_status |= msr_content;
    msr_content = 0xC000000700000000 | ((1 << core2_get_pmc_count()) - 1);
    wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, msr_content);   -> 3. Step

    rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, msr_content);     -> 4. Step
        {
        uint32_t HAHN_l, HAHN_h;
        HAHN_l = (uint32_t) msr_content;
        HAHN_h = (uint32_t) (msr_content >> 32);
        HVMTRACE_3D(HAHN_TR2, v, 0xa, HAHN_h, HAHN_l);    -> 5. Step

        rdmsrl(0xc3, msr_content);                        -> 6. Step General 
counter #2
        HAHN_l = (uint32_t) msr_content;
        HAHN_h = (uint32_t) (msr_content >> 32);
        HVMTRACE_3D(HAHN_TR2, v, 0xc3, HAHN_h, HAHN_l);
        rdmsrl(0x30a, msr_content);                       -> 7. Step Fixed 
counter #1
        HAHN_l = (uint32_t) msr_content;
        HAHN_h = (uint32_t) (msr_content >> 32);
        HVMTRACE_3D(HAHN_TR2, v, 0x30a, HAHN_h, HAHN_l);
        }

With these tracers I got the following output:

Last good NMI:
Both counter cause the NMI. Resetting works OK.
The counter itself were running further.
2. Step: par1 = 0x01,  high = 0x0002, low =  0x0004 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
5. Step: par1 = 0x0a,  high = 0x0000, low =  0x0000 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
6. Step: par1 = 0xc3,  high = 0x0000, low =  0x03c4 ]  rdmsrl(0xc3)  -> #2 
general counter
7. Step: par1 = 0x30a, high = 0x0000, low =  0x02da ]  rdmsrl(0x30a) -> #1 
fixed counter

NMI from where things goes wrong:
Both counter cause the NMI. Resetting works NOT correct, only for the
general counter!
The general counter (caused the NMI) seems to be stopped!
2. Step: par1 = 0x01,  high = 0x0002, low =  0x0004 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
5. Step: par1 = 0x0a,  high = 0x0002, low =  0x0000 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
6. Step: par1 = 0xc3,  high = 0x0000, low =  0x00ec ]  rdmsrl(0xc3)  -> #2 
general counter
7. Step: par1 = 0x30a, high = 0x0000, low =  0x0000 ]  rdmsrl(0x30a) -> #1 
fixed counter

Wrong NMI:
Only the fixed counter causes the NMI (which was not resetted during NMI 
handling above!)
Both counter seems to be stopped!
2. Step: par1 = 0x01,  high = 0x0002, low =  0x0000 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
5. Step: par1 = 0x0a,  high = 0x0002, low =  0x0000 ]  
rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS)
6. Step: par1 = 0xc3,  high = 0x0000, low =  0x00ec ]  rdmsrl(0xc3)  -> #2 
general counter
7. Step: par1 = 0x30a, high = 0x0000, low =  0x0000 ]  rdmsrl(0x30a) -> #1 
fixed counter

And this state remains forever!
I hope my explanations are understandable ;-)

Until now I can see this behavior only on a Nehalem processor.

Thanks.
Dietmar

> 
> Best Regards
> Shan Haitao
> 
> 2009/10/30 Keir Fraser <keir.fraser@xxxxxxxxxxxxx>:
> > On 30/10/2009 12:20, "Dietmar Hahn" <dietmar.hahn@xxxxxxxxxxxxxx> wrote:
> >
> >> I searched the intel processor spec but couldn't find any help.
> >> So my questions is, what is wrong here?
> >> Can anybody with more knowledge point me in the right direction, what can I
> >> still
> >> do to find the real cause of this?
> >
> > You should probably Cc one of the Intel guys who implemented this stuff --
> > I've added Haitao Shan.
> >
> > Meanwhile I'd be interested to know whether things work okay for you, minus
> > performance counters and the hypervisor hang, if you return immediately from
> > vpmu_initialise(). Really at minimum we need such a fix, perhaps with a boot
> > paremeter to re-enable the feature, for 3.4.2 release; allowing guests to
> > hose the hypervisor like this is of course not on.
> >
> >  -- Keir
> >

-- 
Company details: http://ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Need help in debugging partially blocked hypervisor