Excellent! I agree you have found a very difficult bug!
I am now up to 167 linux compiles with no segfaults!
Congratulations Anthony!
One minor suggestion: I think the new added store can
be in the same cycle as the previous load (no stop bit
needed). I didn't look at the bundling... perhaps it
doesn't matter.
Dan
> -----Original Message-----
> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> Sent: Saturday, April 29, 2006 11:44 PM
> To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold;
> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx; Williamson, Alex (Linux
> Kernel Dev)
> Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
>
> >From: Magenheimer, Dan
> >Sent: 2006年4月29日 21:58
> >To: Xu, Anthony; Tristan Gingold; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx;
> >Williamson, Alex (Linux Kernel Dev)
> >Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
> >
> >Hi Anthony --
> >
> >With both Tristan's stability patch and your earlier patch,
> >I have completed 103 linux compiles now with no segfaults
> >yet. Did you see your segfault with Tristan's patch
> >included?
> >
> >I'll continue running over the weekend with the bits I
> >have but if I see a segfault I will add in the additional
> >store in Xen entry (minstate.h) from your newer patch.
> >
>
> --- a/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h
> Thu Apr 27 02:55:42 2006
> +++ b/linux-2.6-xen-sparse/arch/ia64/xen/xenminstate.h
> Sat Apr 29 13:14:58 2006
> @@ -155,6 +155,8 @@
> ;;
> \
> ld4 r30=[r8];
> \
> ;;
> \
> + /* set XSI_INCOMPL_REGFR 0 */
> \
> + st4 [r8]=r0;
> \
> cmp.eq p6,p7=r30,r0;
> \
> ;; /* not sure if this stop bit is necessary */
> \
> (p6) adds r8=XSI_PRECOVER_IFS-XSI_INCOMPL_REGFR,r8;
>
> The additional store is necessary.
>
> In theory, after Guest executes "cover", incomplete frame
> changes to complete
> frame. So Guest should set INCOMPL to 0 just after "cover".
> At least before guest
> psr.ic and psr.i are turned on.
>
> Previously, only when Guest executes "rfi", INCOMPL is set to
> 0. The window
> between "cover" and "rfi" causes trouble in below scenario.
>
> 1. Application A calls system call.
>
> 2. In OS breaks handler entry, INCOMPL is 0. Due to its system call,
> Linux kernel doesn't execute "cover".
>
> 3. Before returning to Application A, schedule happens,
> Application B begins
> to run.
>
> 4. A TLB miss happens on the context of B, this may make
> INCOMPL 1, before
> Returning to B, (that means "rfi" is not executed, and
> INCOMPL is still 1)
> schedule happens again. A resumes to run with INCOMPL 1
> (this is incorrect now).
>
> 5. As mentioned before, this is system call, "cover" is executed in
> ia64_leave_kernel path. Because INCOMPL is 1, this
> "cover" is not actually
> executed, but this "cover" should be executed.
>
> 5. Thus application A's frame is destroyed. Issue appears.
>
>
> I did catch this scenario.
>
> Thanks,
> Anthony
>
>
> >Dan
> >
> >> -----Original Message-----
> >> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >> Sent: Saturday, April 29, 2006 12:03 AM
> >> To: Magenheimer, Dan (HP Labs Fort Collins); Tristan Gingold;
> >> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx; Williamson, Alex (Linux
> >> Kernel Dev)
> >> Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
> >>
> >> Hi Dan,
> >>
> >> Yes, we also got a segmentation fault in 1 run out of 30.
> >>
> >> Could you please try this new patch?
> >>
> >> Thanks,
> >> -Anthony
> >>
> >> >-----Original Message-----
> >> >From: Magenheimer, Dan (HP Labs Fort Collins)
> >> [mailto:dan.magenheimer@xxxxxx]
> >> >Sent: 2006年4月28日 22:49
> >> >To: Xu, Anthony; Tristan Gingold;
> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx;
> >> >Williamson, Alex (Linux Kernel Dev)
> >> >Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
> >> >
> >> >Hi Anthony --
> >> >
> >> >I tried your patch overnight and still got a segmentation
> >> >fault in 1 run out of 50. I didn't try Tristan's patch yet,
> >> >so will try both at the same time next... perhaps there
> >> >are two different problems that show up as the segmentation
> >> >fault.
> >> >
> >> >Dan
> >> >
> >> >> -----Original Message-----
> >> >> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >> >> Sent: Thursday, April 27, 2006 9:19 PM
> >> >> To: Xu, Anthony; Tristan Gingold;
> >> >> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx; Magenheimer, Dan (HP Labs
> >> >> Fort Collins); Williamson, Alex (Linux Kernel Dev)
> >> >> Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
> >> >>
> >> >> Hi Tristan,
> >> >> Could you please check whether this patch address RSE issue?
> >> >>
> >> >> Yes, Intel QA team is doing the test in the meantime.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> -Anthony
> >> >>
> >> >> >-----Original Message-----
> >> >> >From: xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx
> >> >> >[mailto:xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx] On
> >> >> Behalf Of Xu, Anthony
> >> >> >Sent: 2006?4?28? 9:48
> >> >> >To: Tristan Gingold; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx;
> >> >> Magenheimer, Dan (HP
> >> >> >Labs Fort Collins); Alex Williamson
> >> >> >Subject: RE: [Xen-ia64-devel] PATCH: slightly improve stability
> >> >> >
> >> >> >>From: xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx
> >> >> >>[mailto:xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx] On
> >> >> Behalf Of Tristan
> >> >> >>Gingold
> >> >> >>Sent: 2006?4?27? 23:14
> >> >> >>To: xen-ia64-devel@xxxxxxxxxxxxxxxxxxx; Magenheimer, Dan
> >> >> (HP Labs Fort
> >> >> >>Collins); Alex Williamson
> >> >> >>Subject: [Xen-ia64-devel] PATCH: slightly improve stability
> >> >> >>
> >> >> >>Hi,
> >> >> >>
> >> >> >>as reported earlier, this patch seems to improve stability:
> >> >> crashes are at
> >> >> >>least more coherent and maybe less frequent.
> >> >> >>
> >> >> >>RSE handling seems to have a bug: crahes are now due to
> >> >> either a bad value in
> >> >> >>a stacked register or a use of an invalid stacked register
> >> >> (although cfm
> >> >> >>seems correct in gdb!)
> >> >> >
> >> >> >I'm looking at this too,
> >> >> >Yes there is a bug about handle_lazy_cover.
> >> >> >
> >> >> >void ia64_do_page_fault (unsigned long address, unsigned
> >> >> long isr, struct
> >> >> >pt_regs *regs, unsigned long itir)
> >> >> >{
> >> >> > unsigned long iip = regs->cr_iip, iha;
> >> >> > // FIXME should validate address here
> >> >> > unsigned long pteval;
> >> >> > unsigned long is_data = !((isr >>
> IA64_ISR_X_BIT) & 1UL);
> >> >> > IA64FAULT fault;
> >> >> >
> >> >> > if ((isr & IA64_ISR_IR) && handle_lazy_cover(current,
> >> >> isr, regs)) return;
> >> >> >
> >> >> >This code sequence is intended to handle following scenario.
> >> >> >
> >> >> >1. Guest executes br.ret, this may cause mandatory RSE load,
> >> >> and this load may
> >> >> >cause TLB miss.
> >> >> >2. VMM gets control, but VMM can't handle this TLB miss
> >> >> itself, then VMM injects
> >> >> >TLB miss to Guest TLB miss handler, when VMM executing "rfi"
> >> >> to jump to Guest
> >> >> >TLB miss handler, this TLB miss happens again.
> >> >> >3. At this time, interrupt_collection_enabled is 0, so
> >> >> handle_lazy_cover
> >> >> >executes "cover" on behalf of Guest, and return to Guest TLB
> >> >> miss handler again,
> >> >> >this time there is no TLB miss.
> >> >> >
> >> >> >
> >> >> >Following code sequence is in ia64_leave_kernel path with
> >> >> psr.ic and psr.i off.
> >> >> >When br.ret.dptk.many b0 is executed, there may be a
> >> >> mandatory load, thus
> >> >> >There may be a tlb miss, according to above description
> >> >> handle_lazy_cover
> >> >> >executes "cover" on behalf of Guest and return to Guest,
> >> >> this is no correct
> >> >> >in this scenario.
> >> >> >
> >> >> >I didn't find an easy way to fix this bug.
> >> >> >
> >> >> >
> >> >> > mov loc6=0
> >> >> > mov loc7=0
> >> >> >(pRecurse) br.call.dptk.few b0=rse_clear_invalid
> >> >> > ;;
> >> >> > mov loc8=0
> >> >> > mov loc9=0
> >> >> > cmp.ne pReturn,p0=r0,in1 // if recursion count
> >> >> != 0, we need to do a
> >> >> >br.ret
> >> >> > mov loc10=0
> >> >> > mov loc11=0
> >> >> >(pReturn) br.ret.dptk.many b0
> >> >> >#endif /* !CONFIG_ITANIUM */
> >> >> ># undef pRecurse
> >> >> ># undef pReturn
> >> >> > ;;
> >> >> > alloc r17=ar.pfs,0,0,0,0 // drop current
> register frame
> >> >> > ;;
> >> >> > loadrs
> >> >> >
> >> >> >Thanks,
> >> >> >Anthony
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>Tested by doing many linux kernel compilation in SMP
> >> domU (> 100).
> >> >> >>
> >> >> >>Tristan.
> >> >> >
> >> >> >_______________________________________________
> >> >> >Xen-ia64-devel mailing list
> >> >> >Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> >> >http://lists.xensource.com/xen-ia64-devel
> >> >>
> >>
>
_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel
|