On 05/29/2010 02:43 PM, Mark Hurenkamp wrote:
>> That appears to mean that you're getting single packets which are larger
>> than 18 pages long (72k). I'm not quite sure how that's possible, since
>> I thought the datagram limit is 64k..
>>
>> Are you using nfs over udp or tcp? (I think tcp, from your stack
>> trace.)
>>
>> Does turning of tso/gso with ethtool make a difference?
>>
> Ok, i tried this on the running system, and it did seem to improve
> things, but still i'd see some (other) messages.
> After a reboot, with the new xen/stable-2.6.32.13.x based kernel
> and switching tso and gso off with ethtool, these messages are
> now completely gone (have the system up for about a day now).
Hm. I don't think disabling them should be necessary, but the only
downside in doing so is slightly higher per-packet processing cost.
>
> I do notice something else though (might have been there before,
> but now it is the only message in domU dmesg), just after starting
> nfs during boot of the domU:
>
> BUG: unable to handle kernel paging request at 00000002dcf32198
> IP: [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
> PGD a777067 PUD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci-0/pci0000:08/0000:08:02.0/local_cpus
What device is 0000:08:02.0?
> CPU 0
> Modules linked in: nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss
> autofs4 ipv6 wm8775 tea5767 cx25840 tuner_simple sunrpc tuner_types
> tda9887 tda8290 tuner msp3400 saa7127 saa7115 ivtv i2c_algo_bit
> cx2341x v4l2_common videodev v4l1_compat xen_fbfront
> v4l2_compat_ioctl32 fb_sys_fops tveeprom sysimgblt joydev i2c_core
> sysfillrect xen_kbdfront syscopyarea xen_netfront raid10 raid456
> async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy
> async_tx raid1 raid0 multipath linear
> Pid: 3468, comm: irqbalance Not tainted 2.6.32.13m7.1 #1
> RIP: e030:[<ffffffff811cf09a>] [<ffffffff811cf09a>]
> bitmap_scnprintf+0x5c/0xb6
> RSP: e02b:ffff88001cbd9e18 EFLAGS: 00010246
> RAX: ffffffff81527f2b RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000ffe RDI: 0000000000000000
> RBP: ffff88001cbd9e48 R08: 0000000000000010 R09: 0000000000000001
> R10: 0000000000000357 R11: dead000000200200 R12: 0000000000000000
> R13: 0000000000000ffe R14: 00000002dcf32198 R15: ffff880002bbd000
> FS: 00007fc142b6d720(0000) GS:ffff8800046e0000(0000)
> knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000002dcf32198 CR3: 000000001ca58000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process irqbalance (pid: 3468, threadinfo ffff88001cbd8000, task
> ffff88001ded2920)
> Stack:
> 0000000000000200 ffff880002bbd000 ffff88001cbd9f58 ffff880002eeb858
> <0> ffff88001ce8ed10 ffffffff81616230 ffff88001cbd9e68 ffffffff811dd333
> <0> ffff880002eeb878 ffffffff81606368 ffff88001cbd9e98 ffffffff81273574
> Call Trace:
> [<ffffffff811dd333>] local_cpus_show+0x44/0x57
> [<ffffffff81273574>] dev_attr_show+0x22/0x49
> [<ffffffff810a4e8e>] ? __get_free_pages+0x9/0x46
> [<ffffffff8112fbc2>] sysfs_read_file+0xb4/0x139
> [<ffffffff810da927>] vfs_read+0xa6/0x103
> [<ffffffff810daa3a>] sys_read+0x45/0x69
> [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b
> Code: e0 48 c7 c0 2b 7f 52 81 41 83 ec 20 31 db eb 60 44 89 e2 44 89
> e1 48 63 fb 83 e1 3f c1 fa 06 41 b9 01 00 00 00 48 63 d2 44 89 ee <49>
> 8b 14 d6 29 de 48 d3 ea 49 8d 3c 3f 44 88 c1 41 83 ec 20 49
> RIP [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
> RSP <ffff88001cbd9e18>
> CR2: 00000002dcf32198
> ---[ end trace 5f520ed1e48e5394 ]---
>
>
> During boot of dom0 i see the following when it is starting my domU
> (seems to be more of a warning):
> BUG: MAX_LOCK_DEPTH too low!
> turning off the locking correctness validator.
Interesting. That looks like a bug in the core kernel's mmu notifier
machinery that we're using, but the only side-effect is that it will
disable lockdep checking.
> Pid: 5861, comm: qemu-dm Not tainted 2.6.32.13m7.1 #1
> Call Trace:
> [<ffffffff8106a625>] __lock_acquire+0x431/0x459
> [<ffffffff810b029d>] ? vma_prio_tree_remove+0x27/0xda
> [<ffffffff8106a6b1>] lock_acquire+0x64/0x81
> [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
> [<ffffffff813cdb70>] _spin_lock_nest_lock+0x31/0x66
> [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
> [<ffffffff813ccc0e>] ? mutex_lock_nested+0x34/0x39
> [<ffffffff810b939d>] mm_take_all_locks+0xe5/0x11c
> [<ffffffff810cbcbc>] ? do_mmu_notifier_register+0x56/0x113
> [<ffffffff810cbcc4>] do_mmu_notifier_register+0x5e/0x113
> [<ffffffff810cbd94>] mmu_notifier_register+0xe/0x10
> [<ffffffff8123acdb>] gntdev_open+0x8f/0xcc
> [<ffffffff81257dc2>] misc_open+0x188/0x21e
> [<ffffffff810dd1f6>] chrdev_open+0x164/0x185
> [<ffffffff810dd092>] ? chrdev_open+0x0/0x185
> [<ffffffff810d8bd5>] __dentry_open+0x149/0x27f
> [<ffffffff810d8dd1>] nameidata_to_filp+0x3d/0x4e
> [<ffffffff810e59ed>] do_filp_open+0x4ee/0x9e9
> [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
> [<ffffffff8100eff2>] ? check_events+0x12/0x20
> [<ffffffff811d0637>] ? _raw_spin_unlock+0x8f/0x98
> [<ffffffff813cdb3a>] ? _spin_unlock+0x26/0x2b
> [<ffffffff810eedf2>] ? alloc_fd+0x111/0x123
> [<ffffffff810d89a3>] do_sys_open+0x5e/0x10a
> [<ffffffff810d8a78>] sys_open+0x1b/0x1d
> [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b
>
>
> Probably not related, i see the following message in my dom0 from time
> to time, and if it appears at the 'wrong' moment, it causes my system
> to become completely unusable as soon as a process needs disk access.
>
> ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata4.00: BMDMA stat 0x64
> ata4.00: failed command: READ DMA
> ata4.00: cmd c8/00:08:99:13:5c/00:00:00:00:00/ef tag 0 dma 4096 in
> res 51/40:00:a0:13:5c/00:00:00:00:00/ef Emask 0x9 (media error)
> ata4.00: status: { DRDY ERR }
> ata4.00: error: { UNC }
> ata4.00: configured for UDMA/133
> ata4.01: configured for UDMA/133
> ata4: EH complete
>
> Not sure if this is related though, it could be just a bad disk (it
> seems to be always related to the same disk), i'm going to replace the
> disk, and see if that makes a difference.
That looks like a real disk error - it's getting uncorrectable read errors.
J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|