RE: [Xen-devel] xm pause causing lockup

I need to think about this more, but it looks like you have an L2 page
that has a type count of 1 but hasn't been validated. You're then
looping when you try and increment it to 2 thinking that you're racing
someone else. 

Does this happen if you boot with 'nosmp'? I don't really believe it's a
race, but might be worth checking.

Also, it's worth adding a printk into this loop just to check that that
is where you're getting caught.

            /* Someone else is updating validation of this page. Wait...
*/
            while ( (y = page->u.inuse.type_info) == x )
                cpu_relax();
            goto again;

We need to figure out how the type count managed to get to one without
the page being validated. I presume you're doing a debug=y build of Xen?
Do you get any warnings about illegal mmu_update attempts when you boot
FreeBSD?

Ian

> Without the ability to continue and only a very basic 
> understanding of the page typing code there is not a whole 
> lot to go on. Let me know if there is some other bit of 
> information that I can provide you with.
> 
>          -Kip
> 
> Before attaching:
> (XEN) 'd' pressed -> dumping registers
> (XEN) CPU:    1
> (XEN) EIP:    0808:[<fc52d59f>]      
> (XEN) EFLAGS: 00000246   CONTEXT: hypervisor
> (XEN) eax: 40000001   ebx: 00000000   ecx: fcfe3740   edx: fcfe3740
> (XEN) esi: 00007ff0   edi: 00000001   ebp: fcffbda0   esp: fcffbd58
> (XEN) ds: 0810   es: 0810   fs: 0810   gs: 0810   ss: 0810   cs: 0808
> (XEN) Stack trace from ESP=fcffbd58:
> (XEN)    80000003 00000001 fcfe3740 fcfe3740 fcfe3740 80000003
> 80000004 80000003
> (XEN)    00000000 00007ff0 fcffbda0 [fc52bfec] fd494968 fcfe3740
> fcffbdc0 40000001
> (XEN)    40000001 40000002 fcffbdd0 [fc52c07b] fd494968 25fe0000
> 00000000 00000000
> (XEN)    000003d1 00000000 fcffbde0 [fc52bcec] 00000000 fd494968
> fcffbe00 [fc52c52e]
> (XEN)    0000630f 25fe0000 fcfe3740 [fc52d100] fffffffc 00000000
> fcffe000 00000001
> (XEN)    00000001 ff85b000 fcffbe40 [fc52c889] 0630f061 0000630f
> fcfe3740 000002ff
> (XEN)    00000001 f0000000 f0000000 00000004 f0000001 f0000000
> 000002ff ff85b000
> (XEN)    0000630f fcfe3740 fcffbe60 [fc52d0f0] fd494968 000001fa
> fc5b20c0 [fc53185d]
> (XEN)    40000000 00000002 fcffbeb0 [fc52d771] fd494968 40000000
> fcfe3740 fcfe3740
> (XEN)    fcfe3740 80000002 80000003 00000004 00000000 f0000000
> f0000000 00000004
> (XEN)    40000001 f0000000 fd49497c f0000000 f0000000 40000001
> fcffbee0 [fc52c07b]
> (XEN)    fd494968 40000000 002ed518 00000000 a089075b 00000001
> fcfe3740 00000000
> (XEN)    00007ff0 fd494968 fcffbfb0 [fc52df98] 0000630f 40000000
> fcfe3740 00000292
> (XEN)    fc5781c0 00000001 0019b901 00000000 00804e95 00000000
> a089075b 000000a1
> (XEN)    a10955f0 000000a1 00000001 fcfea040 00007ff0 00000001
> fcffbf80 00000000
> (XEN)    fcfe3740 00000000 fcfe3740 00000000 a10955f0 000000a1
> 00000000 fcffbf98
> (XEN)    c0293bac 0000000c 00000003 [fc515bfc] a08902cd 000000a1
> 00000002 fcfe3740
> (XEN)    fcfea040 fd494968 00000000 40000000 00000001 00000001
> 00000000 00000000
> (XEN)    00000001 0000630f c018a19b 00000001 fcfea040 00007ff0
> c0293bc8 [fc54e923]
> (XEN)    c0293bac 00000001 00000000 00007ff0 00000001 c0293bc8
> 0000001a 00000000
> (XEN) Call Trace from ESP=fcffbd58:
> (XEN)    [<fc52bfec>] [<fc52c07b>] [<fc52bcec>] [<fc52c52e>]
> [<fc52d100>] [<fc52c889>]
> (XEN)    [<fc52d0f0>] [<fc53185d>] [<fc52d771>] [<fc52c07b>]
> [<fc52df98>] [<fc515bfc>]
> (XEN)    [<fc54e923>] 
> (XEN) Waiting for GDB to attach to XenDBG
> 
> 
> gdb) bt
> #0  0xfc52d59f in get_page_type (page=0xfd494968, 
> type=0x25fe0000) at mm.c:1235
> #1  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x25fe0000, d=0xfcfe3740) at mm.c:360
> #2  0xfc52c52e in get_page_from_l2e (l2e={l2_lo = 0x630f061}, 
> pfn=0x630f, d=0xfcfe3740, va_idx=0x2ff) at mm.c:495
> #3  0xfc52c889 in alloc_l2_table (page=0xfd494968) at mm.c:679
> #4  0xfc52d0f0 in alloc_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1083
> #5  0xfc52d771 in get_page_type (page=0xfd494968, 
> type=0x40000000) at mm.c:1269
> #6  0xfc52c07b in get_page_and_type_from_pagenr 
> (page_nr=0x630f, type=0x40000000, d=0xfcfe3740) at mm.c:360
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0) at mm.c:1499
> #8  0xfc54e923 in test_all_events () at bitops.h:239
> #9  0xc0293bac in ?? ()
> 
> (gdb) f 7
> #7  0xfc52df98 in do_mmuext_op (uops=0xc0293bac, count=0x1, pdone=0x0,
> foreigndom=0x7ff0)  at mm.c:1499
> 1499                okay = get_page_and_type_from_pagenr(op.mfn, type,
> FOREIGNDOM);
> (gdb) p op
> $9 = {
>   cmd = 0x1,
>   {
>     mfn = 0x630f,
>     linear_addr = 0x630f
>   },
>   {
>     nr_ents = 0xc018a19b,
>     cpuset = 0xc018a19b
>   }
> }
> (gdb) p x
> $1 = 0x40000001
> (gdb) x nx
> 0x40000002:     Ignoring packet error, continuing...
> Reply contains invalid hex digit 40
> (gdb) p y
> $2 = 0x40000001
> (gdb) p page->u.inuse.type_info
> $3 = 0x40000001
> (gdb) p x
> $4 = 0x40000001
> (gdb) p nx
> $5 = 0x40000002
> (gdb) p y
> $6 = 0x40000001
> (gdb) p x
> $7 = 0x40000001
> (gdb) p sizeof(page->u.inuse.type_info)
> $8 = 0x4
> 
> 
> 
> On 4/15/05, Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx> wrote:
> > Wild! It really is looping in get_page_type.
> > 
> > Any chance you could use the serial debugger to find out what x, nx 
> > and y are in the cmpxchg?
> > 
> > I've tried to think of duff inputs that could cause it to loop, but 
> > I'm not smart enough.
> > 
> > Ian
> > 
> > > -----Original Message-----
> > > From: Kip Macy [mailto:kip.macy@xxxxxxxxx]
> > > Sent: 15 April 2005 18:13
> > > To: Ian Pratt
> > > Cc: Keir Fraser; xen-devel; ian.pratt@xxxxxxxxxxxx
> > > Subject: Re: [Xen-devel] xm pause causing lockup
> > >
> > > Great, thanks. I'm now running a completely fresh tree from last 
> > > night.
> > >
> > > Over the course of several minutes I hit 'd' a number of 
> times. The 
> > > addresses I got were:
> > >
> > > 0xfc51c742
> > > 0xfc51c746
> > > 0xfc51c74b
> > > 0xfc51c740
> > >
> > > (gdb) x/i 0xfc51c742
> > > 0xfc51c742 <get_page_type+1218>:        mov    0x40(%esp,1),%eax
> > > (gdb) x/i 0xfc51c746
> > > 0xfc51c746 <get_page_type+1222>:        mov    0x14(%eax),%ebx
> > > (gdb) x/i 0xfc51c74b
> > > 0xfc51c74b <get_page_type+1227>:        je     0xfc51c740
> > > <get_page_type+1216>
> > > (gdb) x/i 0xfc51c740
> > > 0xfc51c740 <get_page_type+1216>:        repz nop
> > >
> > >
> > >                -Kip
> > >
> > > On 4/14/05, Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> > > > > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf
> > > Of Kip Macy
> > > > > Sent: 15 April 2005 05:36
> > > > > To: Keir Fraser
> > > > > Cc: xen-devel
> > > > > Subject: Re: [Xen-devel] xm pause causing lockup
> > > > >
> > > > > To further check this I added:
> > > > >  printk("%s %d %d %d %d %d\n", __FUNCTION__, op->cmd,
> > > > > op->mfn, count, success_count, domid); to
> > > > > HYPERVISOR_mmuext_op and something similar to mmu_update.
> > > >
> > > > Is your hypothesis that Xen gets stuck in either the 
> mmuext_op or 
> > > > mmu_update loops?
> > > > Are you running with watchdog enabled?
> > > >
> > > > It might be good to add a printk at the end so that you can
> > > prove this.
> > > >
> > > > Hitting 'd' on the debug console will give us an EIP on CPU 1.
> > > >
> > > > Ian
> > > >
> > >
> >
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] xm pause causing lockup