xen-devel
FW: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
Forward the tested patch to anyone interested. > Date: Tue, 7 Sep 2010 05:57:11 -0700 > Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT > From: keir.fraser@xxxxxxxxxxxxx > To: tinnycloud@xxxxxxxxxxx; JBeulich@xxxxxxxxxx > CC: keir.fraser@xxxxxxxxxxxxx > > On 07/09/2010 05:19, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote: > > > It looks like the free_heap_pages can be invoked concurrently, right? > > If so, suppose there are two domains run into this function, run to > > line 546-548, > > we can see that pg[i].count_info will be either set to PGC_state_offlined, > > PGC_state_free, > > and remember this is not locked by heap_lock(in fact the lock starts at line > > 558). > > Then it is possible that, one domain set one page to PGC_state_free, while > > another domain > > happened run to line 580,
and its pg-mask is the same page that first domain > > just freed, > > sure the page is not in heap yet. Thus causes xen panic. > > Yes, I think this is it! The dumb patch (attached) is just to extend the > locked region in alloc_ and free_heap_pages to include the for loops which > update each page's state. A cleverer version would only need to update the > first page in an aligned extent with the lock held, but it's not clear that > buys us that much, especially in free_heap_pages() which usually gets called > on single pages anyway. > > Please try the attached fix and see how you get on. I'm very confident it > will fix the bug. Well caught! > > Thanks, > Keir > > > > 531 for ( i = 0; i < (1 << order); i++ ) > > 532 { > > 533 /* > > 534 * Cannot assume that count_info == 0, as there are some corner > > case
s > > 535 * where it isn't the case and yet it isn't a bug: > > 536 * 1. page_get_owner() is NULL > > 537 * 2. page_get_owner() is a domain that was never accessible by > > 538 * its domid (e.g., failed to fully construct the domain). > > 539 * 3. page was never addressable by the guest (e.g., it's an > > 540 * auto-translate-physmap guest and the page was never > > included > > 541 * in its pseudophysical address space). > > 542 * In all the above cases there can be no guest mappings of this > > page. > > 543 */ > > 544 ASSERT(!page_state_is(&pg[i], offlined)); > > 545 pg[i].count_info = > > 546 ((pg[i].count_info & PGC_broken) | > > 547 (page_state_is(&pg[i], offlining) > > 548 ? PGC_state_offlined : PGC_state_free)); > > 549 if ( page_state_is(&pg[i], offlined) ) > > 550 tainted = 1; > > 5
51 > > 552 /* If a page has no owner it will need no safety TLB flush. */ > > 553 pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL); > > 554 if ( pg[i].u.free.need_tlbflush ) > > 555 pg[i].tlbflush_timestamp = tlbflush_current_time(); > > 556 } > > 557 > > 558 spin_lock(&heap_lock); > > 559 > > 560 avail[node][zone] += 1 << order; > > 561 total_avail_pages += 1 << order; > > 562 > > 563 if ( opt_tmem ) > > 564 midsize_alloc_zone_pages = max( > > 565 midsize_alloc_zone_pages, total_avail_pages / > > MIDSIZE_ALLOC_FRAC); > > 566 > > 567 /* Merge chunks as far as possible. */ > > 568 while ( order < MAX_ORDER ) > > 569 { > > 570 mask = 1UL << order; > > 571 > > 572 if ( (page_to_mfn(pg) & mask) ) > > 573 { > > 574 /*
Merge with predecessor block? */ > > 575 if ( !mfn_valid(page_to_mfn(pg-mask)) || > > 576 !page_state_is(pg-mask, free) || > > 577 (PFN_ORDER(pg-mask) != order) ) > > 578 break; > > 579 > > 580 pgn = pg - mask; > > 581 head = &heap(node, zone, order); > > 582 next = pdx_to_page(pgn->list.next); > > 583 prev = pdx_to_page(pgn->list.prev); > > 584 if((head->next == pgn) != (pgn->list.prev == PAGE_LIST_NULL) > > || > > 585 (head->tail == pgn) != (pgn->list.next == PAGE_LIST_NULL)) > > { > > 586 check_page(pg, pg-mask, mask, order, 1, zone, node,head); > > 587 } > > 588 > > 589 pg -= mask; > > 590 page_list_del(pg, &heap(node, zone, order)); > > 591 } > > 592 else > > 593 { > > 594 /* Merge with successor block? */ > > 595 if ( !mfn_valid(page_to_m
fn(pg+mask)) || > > 596 !page_state_is(pg+mask, free) || > > 597 (PFN_ORDER(pg+mask) != order) ) > > 598 break; > > 599 > > 600 pgn = pg + mask; > > 601 head = &heap(node, zone, order); > > 602 next = pdx_to_page(pgn->list.next); > > > > > > From: tinnycloud@xxxxxxxxxxx > > To: jbeulich@xxxxxxxxxx > > CC: keir.fraser@xxxxxxxxxxxxx > > Subject: RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT > > Date: Tue, 7 Sep 2010 19:57:47 +0800 > > > > Hi Jan: > > > > Thanks for your explainations. > > I added the log on the page your required. > > > > ----------------------------check page--------------------------------- > > 502 static int check_page(struct page_info* pgb, struct page_info* pg, > > unsigned long mask, unsigned int order, int i, int zone, int node, st
ruct > > page_list_ head *head){ > > 503 > > 504 printk("--------pgb %p, pg op mask %p, prev %p, next %p, \ > > 505 mask %lx, order %d, i %d zone %d node %d\n", > > 506 pgb, pg, pdx_to_page(pg->list.prev), > > pdx_to_page(pg->list.next), > > 507 mask, order, i, zone, node); > > 508 printk("--------head->next %p head->tail %p \n", head->next, > > head->tail); > > 509 printk("shr_handle %lu count_info %lu type_info %lu order %u > > tlbflush_timestamp %u\n", > > 510 pg->shr_handle, pg->count_info, pg->u.inuse.type_info, > > pg->v.free.order, pg->tlbflush_timestamp); > > 511 > > 512 panic("xmao id page before address assigned \n"); > > 513 return 0; > > 514 } > > > > In addtion, I go through the code today. Well, I have another > > question, in free_heap_pages, refer
to > > below, line 545 to 548, it seems that here domain page is travelled and its > > state is set either to > > PGC_state_offlined, PGC_state_free, so could be existed there are two adjacent > > pages both set to > > PGC_state_free(that is pg1 = pg2 - mask, and both free). If so, both two pages > > are not in heap page list, > > then this situation satisfy the panic. Is this possible? > > > > 531 for ( i = 0; i < (1 << order); i++ ) > > 532 { > > 533 /* > > 534 * Cannot assume that count_info == 0, as there are some corner > > cases > > 535 * where it isn't the case and yet it isn't a bug: > > 536 * 1. page_get_owner() is NULL > > 537 * 2. page_get_owner() is a domain that was never accessible by > > 538 * its domid (e.g., failed to fully construct the domain). > > 539 * 3. page was never addressable by t
he guest (e.g., it's an > > 540 * auto-translate-physmap guest and the page was never > > included > > 541 * in its pseudophysical address space). > > 542 * In all the above cases there can be no guest mappings of this > > page. > > 543 */ > > 544 ASSERT(!page_state_is(&pg[i], offlined)); > > 545 pg[i].count_info = > > 546 ((pg[i].count_info & PGC_broken) | > > 547 (page_state_is(&pg[i], offlining) > > 548 ? PGC_state_offlined : PGC_state_free)); > > 549 if ( page_state_is(&pg[i], offlined) ) > > 550 tainted = 1; > > 551 > > 552 /* If a page has no owner it will need no safety TLB flush. */ > > 553 pg[i].u.free.need_tlbflush = (page_get_owner(&pg[i]) != NULL); > > 554 if ( pg[i].u.free.need_tlbflush ) > > 555 pg[i].tlbflush_timestamp = tlbflush_current_time(); > > 556 } > >
> >> Date: Mon, 6 Sep 2010 14:13:00 +0100 > >> From: JBeulich@xxxxxxxxxx > >> To: tinnycloud@xxxxxxxxxxx > >> CC: keir.fraser@xxxxxxxxxxxxx > >> Subject: RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT > >> > >>>>> On 06.09.10 at 14:20, MaoXiaoyun <tinnycloud@xxxxxxxxxxx> wrote: > >>> A quick question for you, what is tainted refer to in below log? > >>> > >>> (XEN) ----[ Xen-4.0.0 x86_64 debug=n Not tainted ]---- > >> > >> In Xen, this really doesn't mean much - it's a concept inherited from > >> Linux trying to describe how "clean" the current instance is in terms > >> of software or hardware events in the past that may have affected > >> stability and/or supportability. > >> > >>> (XEN) --------pgb ffff82f6087f02e0, pg op mask ffff82f6087f0
2c0, prev > >>> ffff8315ffffffe0, next ffff8315ffffffe0, mask 1, order 0, i 1 > >>> zone 22 node 1 > >>> (XEN) ------head->next 0000000000000000, head->tail 0000000000000000 > >> > >> If that data is right, we basically have a page that is free but not > >> on the respective heap's list. You will need to print the fields of > >> the calculated (not the passed int) page's struct page_info (not > >> just ->list.next and ->list.prev). Further, you should go hunt > >> for that page across all heap lists, and print where you found it > >> (or whether you didn't find it anywhere). That'll hopefully tell us > >> whether the page's state is wrong or whether it got misplaced. > >> > >> But as you perhaps noticed the system died on the third of your > >> prink()-s: You should not de-reference pdx_t
o_page(pg->list.prev) > >> and pdx_to_page(pg->list.next), the information won't tell us > >> much anyway (I think the whole third printk() is pointless). > >> > >> Further, I only now noticed that you second call to check_page() > >> uses an incorrect second arguments: You want to pass pgn (and > >> you could make the first call do so too). > >> > >> Jan > >> > > >
|
00-heaplock
Description: Binary data
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread> |
FW: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT,
MaoXiaoyun <=
|
|
|