Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.

To:	xen-devel@xxxxxxxxxxxxxxxxxxx
Subject:	Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
From:	Teck Choon Giam <giamteckchoon@xxxxxxxxx>
Date:	Mon, 6 Jul 2009 11:55:19 +0800
Delivery-date:	Sun, 05 Jul 2009 20:55:43 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=AlXwD7k5ZsxYxXs7MWjQEkERJS1VRcbRtFqrhHqf8Rw=; b=vx7t0X7ERHmEeVLsuDyxQ0FJ60kEXJPVLpkZZ7va0CMoNfCqlFCZ6/iymcf9T+WHy7 Db4UxyVkst9sAWaPTiAJ7Ai/vnk/txhtnzpqNDC6IkvkJbtDMvKAd4orv5Z5o7Fn56hP 4wvKlyjv5UlcdUb0P2knc60ntUo6RoAZUEfZA=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=u9Hgl1ZU/zFdcsMvtbxN67+6zUdiYV77qGZDq0WZV2/00J6idytfdPNPLMGHOeZj1t MwiFAoQ7QlZphawXKDtOaKLG9ctKO4L3t7cAQJe9u+8vrXxXHfHNLsKi6LS44hSYfK4H WVbC4XIoJ1uKk7Azu+rTFvDKvAYh63QBvPJUs=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4FA716B1526C7C4DB0375C6DADBC4EA341740F5A94@xxxxxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<9b5c9bb30907032359h13c37e3en45dd9b100a7e2502@xxxxxxxxxxxxxx> <C674BEAD.EEC2%keir.fraser@xxxxxxxxxxxxx> <9b5c9bb30907040030m332f0634t13d608976894e4b0@xxxxxxxxxxxxxx> <9b5c9bb30907041956t5369a584g4cc86eee89ebf0db@xxxxxxxxxxxxxx> <4FA716B1526C7C4DB0375C6DADBC4EA341740F5A94@xxxxxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

Hi Ian,

On Mon, Jul 6, 2009 at 5:36 AM, Ian Pratt<Ian.Pratt@xxxxxxxxxxxxx> wrote:
>> >> Power management is another difference between 3.3 and 3.4. You can
>> disable
>> >> 3.4 power management by adding Xen boot parameters: cpuidle=0
>> cpufreq=none
>>
>> > I will disable and run the test tomorrow to see whether network stall
>> > issue is there or not.
>>
>> Using cpuidle=0 cpufreq=none seems to solve the network stall problem.
>
> Hmm, that's rather disturbing. Its presumably the cpuidle parameter which is 
> having the effect. Quite why deeper sleep states can result in one particular 
> device interrupt getting stuck (as opposed to all of them) is a mystery. It 
> might be interesting to see the boot messages, and also to find out which of 
> the C states is causing the problem (presumably C2 or C3).

If I do not add cpuidle and cpufreq in xen boot para. I got the below:

# xenpm get-cpuidle-states
Max C-state: C1

cpu id               : 0
total C-states       : 2
idle time(ms)        : 131588676
C0                   : transition [00000000000019346170]
                      residency  [00000000000003897999 ms]
C1                   : transition [00000000000019346170]
                      residency  [00000000000131507268 ms]

cpu id               : 1
total C-states       : 2
idle time(ms)        : 131696919
C0                   : transition [00000000000012247741]
                      residency  [00000000000003766854 ms]
C1                   : transition [00000000000012247741]
                      residency  [00000000000131638414 ms]

cpu id               : 2
total C-states       : 2
idle time(ms)        : 131540647
C0                   : transition [00000000000013405442]
                      residency  [00000000000003922680 ms]
C1                   : transition [00000000000013405442]
                      residency  [00000000000131482588 ms]

cpu id               : 3
total C-states       : 2
idle time(ms)        : 131527968
C0                   : transition [00000000000031194790]
                      residency  [00000000000004030618 ms]
C1                   : transition [00000000000031194790]
                      residency  [00000000000131374650 ms]

Sorry, I am unable to give you more details as currently all are
booted with cpuidle and cpufreq in xen boot para.  I will try to
migrate one of the server VMs to another then use that to test without
cpuidle and cpufreq in xen boot para. then will report back my
findings.  In fact now all are with:

kernel /xen.gz dom0_mem=256M loglvl=all guest_loglvl=all cpuidle=0 cpufreq=none

If you have any suggestion to add in xen boot para. or any other, feel
free to let me know ;)

> In your tests, rather than rebooting the machine you may possibly be able to 
> recover the machine by unloading and reloading the NIC module. (you may need 
> to remove it from the bridge and ifconfig it down first).

Yes, shutdown all xendomains, shutdown network-bridge and xend then
restart them without the need to restart network can bring back the
network most of the time but it is disturbing as all VMs will need to
shutdown clearly to prevent ext3 file system dirty.

I noticed for other servers that without the cpuidle=0 cpufreq=none in
xen-3.4.x, xenpm get-cpuidle-states showing:

# xenpm get-cpuidle-states
Max C-state: C7

Is this due to the processor type since they are not dual core and/or
quad core or multi-processors and whether is it a VT-d enabled system
type?

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.00GHz
stepping        : 3
cpu MHz         : 3000.112
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc pni cid
bogomips        : 6004.86

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.00GHz
stepping        : 3
cpu MHz         : 3000.112
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc pni cid
bogomips        : 6004.86

The above server is not DELL but is a Tyan server:

# lspci -vvv|grep -i ethernet
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express (rev 11)
        Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express (rev 11)
        Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express

Doing test on this server is ok with no network stall however this
server will crash within a month time and when I plug in
monitor/keyboard can't see any output nor cltr+alt+delete got any
response.  The only thing I can do is to reboot the server then this
cycle will repeat... sudden crash within a month and sometimes can
happen 2 or more times within a month.  So this server is running a
backup domU and a mirror domU which are not so critical.  Due to
sudden crash issue on this type of server(s) (I got two such server
having same issue), thus can't really run this in real production :(

Thanks.

Kindest regards,
Giam Teck Choon

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x