WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creati

To: Keir Fraser <keir@xxxxxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)
From: Eric Tessler <maiden1134@xxxxxxxxx>
Date: Fri, 13 Jul 2007 19:32:58 -0700 (PDT)
Cc: "mark.williamson@xxxxxxxxxxxx" <mark.williamson@xxxxxxxxxxxx>
Delivery-date: Fri, 13 Jul 2007 19:30:54 -0700
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID; b=WkzaVeBG1WkNTMbm/5iKkAsySHfTys49EWekDxtJXZPnT5IPmN49dSZCg8GZ8VVl8LkA2A4z7APIzALsRM9PMoiFAOYEkbL9l2K5ranR89eoLJ62nkPBVESHUxE/azz7KLVYuynIcN3iLvMme6Fg4C36LrYVRF7JQ9oFRDfA0uw=;
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <C2BCDA68.ABDB%keir@xxxxxxxxxxxxx>
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
I was able to get some debugging in on this problem and here is what I have found.
 
I re-ran my test with the XEN debug options enabled as Keir suggested (I also put some debug output in netif_map and map_frontend_pages to find out exactly what was failing). The 16th VM's vif timed out again and here is what I saw in the dmesg log:
 
   (XEN) grant_table.c:557:d1 Expanding dom (1) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d2 Expanding dom (2) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d3 Expanding dom (3) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d4 Expanding dom (4) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d5 Expanding dom (5) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d6 Expanding dom (6) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d7 Expanding dom (7) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d8 Expanding dom (8) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d9 Expanding dom (9) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d10 Expanding dom (10) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d11 Expanding dom (11) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d12 Expanding dom (12) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d13 Expanding dom (13) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d14 Expanding dom (14) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d15 Expanding dom (15) grant table from (4) to (5) frames.
   (XEN) grant_table.c:557:d16 Expanding dom (16) grant table from (4) to (5) frames.
   (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000
You can see from above that the first 15 VMs are OK, and the 16th VM fails with the last error message in "mm.c" as shown above. I attempted to trace upwards what exactly was failing so I enabled debug output in "linux-2.6-xen-sparse/drivers/xen/netback/interface.c" (this is where netif_map() is located). I then observed the following output in /var/log/messages when the 16th VMs vif timed out:
   (map_frontend_pages:227)  Gnttab failure mapping rx_ring_ref!
   (netif_map:274) map frontend pages failed [I added this debug output]
   vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11

The error message from mm.c displayed in the dmesg log is coming from the function "create_grant_va_mapping" (a call to guest_map_l1e() is failing with NULL).
 
In summary, it looks like the mapping of the RX shared memory ring is failing (the TX mapping is passing, it always fails on the mapping of the RX ring). Another interesting note is that the address dumped in the dmesg log is always the same: d1400000 (I saw the failure about 10 times today and the address never changes).
 
Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th VM (2.6.16.33), it failed the same way. The only difference is that instead of extending the grant table from 4 to 5 frames, it was extended from 4 to 16 frames:
   (XEN) grant_table.c:557:d18 Expanding dom (18) grant table from (4) to (16) frames.
   (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000
I believe the following stack trace represents the trace of the failure (starting from within XenBus, traced by hand):
connect_rings                       linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
netif_map                           linux-2.6-xen-sparse/drivers/xen/netback/interface.c
map_frontend_pages                  linux-2.6-xen-sparse/drivers/xen/netback/interface.c
__gnttab_map_grant_ref (hypercall)             xen/common/grant_table.c
create_grant_host_mapping           xen/arch/x86/mm.c
create_grant_va_mapping             xen/arch/x86/mm.c
   guest_map_l1e                    xen/arch/x86/mm.c
     (this is the function that is ultimately failing)
 
Any clue as to what is causing this failure or how to fix it? Is there any other debug info I can provide here that would be of any help in resolving this issue? I have some free time tomorrow to debug this issue, but need some direction; this is in an area of XEN I don't understand very well.
 
I am also thinking about downloading the xen 3.1 unstable release and trying that one to see if the problem also exists there.
 
Thanks,
 
Eric

Keir Fraser <keir@xxxxxxxxxxxxx> wrote:
Can you try 3.0.4 domU kernel agaianst 3.1 dom0 kernel, and vice versa? Also, turn on debug tracing in Xen (boot options ?loglvl=all guest_loglvl=all?) and see what appears at the end of ?xm dmesg?.

 -- Keir


On 13/7/07 02:19, "Eric Tessler" <maiden1134@xxxxxxxxx> wrote:

At the same time in dom0, we see the following error message in /var/log/messages:
  
"vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11"
  
(the error message above means that netif_map failed for some reason in XenBus)
  
 
  
If we repeat this same exact test using XEN 3.0.4, we never have any problems. All vifs in all VMs work correctly. This problem must be specific to XEN 3.1.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
<Prev in Thread] Current Thread [Next in Thread>