WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

FW: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT

To: xen devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: FW: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
From: MaoXiaoyun <tinnycloud@xxxxxxxxxxx>
Date: Wed, 22 Sep 2010 09:32:36 +0800
Cc: keir.fraser@xxxxxxxxxxxxx
Delivery-date: Tue, 21 Sep 2010 18:33:11 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
Importance: Normal
In-reply-to: <C8AB84B9.123A9%keir.fraser@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <BAY121-W424D327E136ADEFE1F7273DA710@xxxxxxx>, <C8AB84B9.123A9%keir.fraser@xxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Forward the tested patch to anyone interested.
 
> Date: Tue, 7 Sep 2010 05:57:11 -0700
> Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
> From: keir.fraser@xxxxxxxxxxxxx
> To: tinnycloud@xxxxxxxxxxx; JBeulich@xxxxxxxxxx
> CC: keir.fraser@xxxxxxxxxxxxx
>
> On 07/09/2010 05:19, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>
> > It looks like the free_heap_pages can be invoked concurrently, right?
> > If so, suppose there are two domains run into this function, run to
> > line 546-548,
> > we can see that pg[i].count_info will be either set to PGC_state_offlined,
> > PGC_state_free,
> > and remember this is not locked by heap_lock(in fact the lock starts at line
> > 558).
> > Then it is possible that, one domain set one page to PGC_state_free, while
> > another domain
> > happened run to line 580, and its pg-mask is the same page that first domain
> > just freed,
> > sure the page is not in heap yet. Thus causes xen panic.
>
> Yes, I think this is it! The dumb patch (attached) is just to extend the
> locked region in alloc_ and free_heap_pages to include the for loops which
> update each page's state. A cleverer version would only need to update the
> first page in an aligned extent with the lock held, but it's not clear that
> buys us that much, especially in free_heap_pages() which usually gets called
> on single pages anyway.
>
> Please try the attached fix and see how you get on. I'm very confident it
> will fix the bug. Well caught!
>
> Thanks,
> Keir
>
>
> > 531 for ( i = 0; i < (1 << order); i++ )
> > 532 {
> > 533 /*
> > 534 * Cannot assume that count_info == 0, as there are some corner
> > case s
> > 535 * where it isn't the case and yet it isn't a bug:
> > 536 * 1. page_get_owner() is NULL
> > 537 * 2. page_get_owner() is a domain that was never accessible by
> > 538 * its domid (e.g., failed to fully construct the domain).
> > 539 * 3. page was never addressable by the guest (e.g., it's an
> > 540 * auto-translate-physmap guest and the page was never
> > included
> > 541 * in its pseudophysical address space).
> > 542 * In all the above cases there can be no guest mappings of this
> > page.
> > 543 */
> > 544 ASSERT(!page_state_is(&pg[i], offlined));
> > 545 pg[i].count_info =
> > 546 ((pg[i].count_info & PGC_broken) |
> > 547 (page_state_is(&pg[i], offlining)
> > 548 ? PGC_state_offlined : PGC_state_free));
> > 549 if ( page_state_is(&pg[i], offlined) )
> > 550 tainted = 1;
> > 5 51
> > 552 /* If a page has no owner it will need no safety TLB flush. */
> > 553 pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL);
> > 554 if ( pg[i].u.free.need_tlbflush )
> > 555 pg[i].tlbflush_timestamp = tlbflush_current_time();
> > 556 }
> > 557
> > 558 spin_lock(&heap_lock);
> > 559
> > 560 avail[node][zone] += 1 << order;
> > 561 total_avail_pages += 1 << order;
> > 562
> > 563 if ( opt_tmem )
> > 564 midsize_alloc_zone_pages = max(
> > 565 midsize_alloc_zone_pages, total_avail_pages /
> > MIDSIZE_ALLOC_FRAC);
> > 566
> > 567 /* Merge chunks as far as possible. */
> > 568 while ( order < MAX_ORDER )
> > 569 {
> > 570 mask = 1UL << order;
> > 571
> > 572 if ( (page_to_mfn(pg) & mask) )
> > 573 {
> > 574 /* Merge with predecessor block? */
> > 575 if ( !mfn_valid(page_to_mfn(pg-mask)) ||
> > 576 !page_state_is(pg-mask, free) ||
> > 577 (PFN_ORDER(pg-mask) != order) )
> > 578 break;
> > 579
> > 580 pgn = pg - mask;
> > 581 head = &heap(node, zone, order);
> > 582 next = pdx_to_page(pgn->list.next);
> > 583 prev = pdx_to_page(pgn->list.prev);
> > 584 if((head->next == pgn) != (pgn->list.prev == PAGE_LIST_NULL)
> > ||
> > 585 (head->tail == pgn) != (pgn->list.next == PAGE_LIST_NULL))
> > {
> > 586 check_page(pg, pg-mask, mask, order, 1, zone, node,head);
> > 587 }
> > 588
> > 589 pg -= mask;
> > 590 page_list_del(pg, &heap(node, zone, order));
> > 591 }
> > 592 else
> > 593 {
> > 594 /* Merge with successor block? */
> > 595 if ( !mfn_valid(page_to_m fn(pg+mask)) ||
> > 596 !page_state_is(pg+mask, free) ||
> > 597 (PFN_ORDER(pg+mask) != order) )
> > 598 break;
> > 599
> > 600 pgn = pg + mask;
> > 601 head = &heap(node, zone, order);
> > 602 next = pdx_to_page(pgn->list.next);
> >
> >
> > From: tinnycloud@xxxxxxxxxxx
> > To: jbeulich@xxxxxxxxxx
> > CC: keir.fraser@xxxxxxxxxxxxx
> > Subject: RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
> > Date: Tue, 7 Sep 2010 19:57:47 +0800
> >
> > Hi Jan:
> >
> > Thanks for your explainations.
> > I added the log on the page your required.
> >
> > ----------------------------check page---------------------------------
> > 502 static int check_page(struct page_info* pgb, struct page_info* pg,
> > unsigned long mask, unsigned int order, int i, int zone, int node, st ruct
> > page_list_ head *head){
> > 503
> > 504 printk("--------pgb %p, pg op mask %p, prev %p, next %p, \
> > 505 mask %lx, order %d, i %d zone %d node %d\n",
> > 506 pgb, pg, pdx_to_page(pg->list.prev),
> > pdx_to_page(pg->list.next),
> > 507 mask, order, i, zone, node);
> > 508 printk("--------head->next %p head->tail %p \n", head->next,
> > head->tail);
> > 509 printk("shr_handle %lu count_info %lu type_info %lu order %u
> > tlbflush_timestamp %u\n",
> > 510 pg->shr_handle, pg->count_info, pg->u.inuse.type_info,
> > pg->v.free.order, pg->tlbflush_timestamp);
> > 511
> > 512 panic("xmao id page before address assigned \n");
> > 513 return 0;
> > 514 }
> >
> > In addtion, I go through the code today. Well, I have another
> > question, in free_heap_pages, refer to
> > below, line 545 to 548, it seems that here domain page is travelled and its
> > state is set either to
> > PGC_state_offlined, PGC_state_free, so could be existed there are two adjacent
> > pages both set to
> > PGC_state_free(that is pg1 = pg2 - mask, and both free). If so, both two pages
> > are not in heap page list,
> > then this situation satisfy the panic. Is this possible?
> >
> > 531 for ( i = 0; i < (1 << order); i++ )
> > 532 {
> > 533 /*
> > 534 * Cannot assume that count_info == 0, as there are some corner
> > cases
> > 535 * where it isn't the case and yet it isn't a bug:
> > 536 * 1. page_get_owner() is NULL
> > 537 * 2. page_get_owner() is a domain that was never accessible by
> > 538 * its domid (e.g., failed to fully construct the domain).
> > 539 * 3. page was never addressable by t he guest (e.g., it's an
> > 540 * auto-translate-physmap guest and the page was never
> > included
> > 541 * in its pseudophysical address space).
> > 542 * In all the above cases there can be no guest mappings of this
> > page.
> > 543 */
> > 544 ASSERT(!page_state_is(&pg[i], offlined));
> > 545 pg[i].count_info =
> > 546 ((pg[i].count_info & PGC_broken) |
> > 547 (page_state_is(&pg[i], offlining)
> > 548 ? PGC_state_offlined : PGC_state_free));
> > 549 if ( page_state_is(&pg[i], offlined) )
> > 550 tainted = 1;
> > 551
> > 552 /* If a page has no owner it will need no safety TLB flush. */
> > 553 pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL);
> > 554 if ( pg[i].u.free.need_tlbflush )
> > 555 pg[i].tlbflush_timestamp = tlbflush_current_time();
> > 556 }
> >
> >> Date: Mon, 6 Sep 2010 14:13:00 +0100
> >> From: JBeulich@xxxxxxxxxx
> >> To: tinnycloud@xxxxxxxxxxx
> >> CC: keir.fraser@xxxxxxxxxxxxx
> >> Subject: RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
> >>
> >>>>> On 06.09.10 at 14:20, MaoXiaoyun <tinnycloud@xxxxxxxxxxx> wrote:
> >>> A quick question for you, what is tainted refer to in below log?
> >>>
> >>> (XEN) ----[ Xen-4.0.0 x86_64 debug=n Not tainted ]----
> >>
> >> In Xen, this really doesn't mean much - it's a concept inherited from
> >> Linux trying to describe how "clean" the current instance is in terms
> >> of software or hardware events in the past that may have affected
> >> stability and/or supportability.
> >>
> >>> (XEN) --------pgb ffff82f6087f02e0, pg op mask ffff82f6087f0 2c0, prev
> >>> ffff8315ffffffe0, next ffff8315ffffffe0, mask 1, order 0, i 1
> >>> zone 22 node 1
> >>> (XEN) ------head->next 0000000000000000, head->tail 0000000000000000
> >>
> >> If that data is right, we basically have a page that is free but not
> >> on the respective heap's list. You will need to print the fields of
> >> the calculated (not the passed int) page's struct page_info (not
> >> just ->list.next and ->list.prev). Further, you should go hunt
> >> for that page across all heap lists, and print where you found it
> >> (or whether you didn't find it anywhere). That'll hopefully tell us
> >> whether the page's state is wrong or whether it got misplaced.
> >>
> >> But as you perhaps noticed the system died on the third of your
> >> prink()-s: You should not de-reference pdx_t o_page(pg->list.prev)
> >> and pdx_to_page(pg->list.next), the information won't tell us
> >> much anyway (I think the whole third printk() is pointless).
> >>
> >> Further, I only now noticed that you second call to check_page()
> >> uses an incorrect second arguments: You want to pass pgn (and
> >> you could make the first call do so too).
> >>
> >> Jan
> >>
> >
>

Attachment: 00-heaplock
Description: Binary data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel