Re: [Xen-devel] Xen hypervisor external denial of service vulner

On Tue, Feb 08, 2011 at 01:39:06PM +0100, Pim van Riezen wrote:
> Addendum:
> 
>       The Dells are actually R715.
>       The dom0 kernel is actually vmlinuz-2.6.18-194.32.1.el5xen
> 

Have you gived dom0 fixed amount of memory, and also increase dom0 vcpu weights
so that dom0 will always get enough cpu time to take care of things? 

http://wiki.xensource.com/xenwiki/XenBestPractices

-- Pasi


> Cheers,
> Pim
> 
> On Feb 8, 2011, at 13:22 , Pim van Riezen wrote:
> 
> > Good day,
> > 
> > In a scenario where we saw several dom0 nodes fall down due to a sustained 
> > SYN flood to a network range, we have been investigating issues with Xen 
> > under high network load. The results so far seem to be not so pretty. We 
> > recreated a lab setup that can reproduce the scenario with some 
> > reliability, although it takes a bit of trial-and-error to get crashes out 
> > of it.
> > 
> > SETUP:
> > 2x Dell R710
> >     - 4x 6core AMD Opteron 6174
> >     - 128GB memory
> >     - Broadcom BCM5709
> >     - LSI SAS2008 rev.02
> >     - Emulex Saturn-X FC adapter
> >     - CentOS 5.5 w/ gitco Xen 4.0.1
> > 
> > 1x NexSan SATABeast FC raid
> > 1x Brocade FC switch
> > 5x Flood sources (Dell R210)
> > 
> > The dom0 machines are loaded with 50 PV images, coupled to a LVM partition 
> > on FC, half of which are set to start compiling a kernel in rc.local. There 
> > are also 2 HVM images on both machines doing the same.
> > 
> > Networking for all guests is configured in the bridging setup, attached to 
> > a specific vlan that arrives tagged at the Dom0. So vifs end up in xenbr86, 
> > née xenbr0.86.
> > 
> > Grub conf for the dom0s:
> > 
> >     kernel /xen.gz-4.0.1 dom0_mem=2048M max_cstate=0 cpuidle=off
> >     module /vmlinuz-2.6.18-194.11.4.el5xen ro root=LABEL=/ elevator=deadline
> > xencons=tty
> > 
> > The flooding is always done to either the entire IP range the guests live 
> > in (in case of SYN floods) or a sub-range of about 50 IPs (in case of UDP 
> > floods), with random source addresses.
> > 
> > ISSUE:
> > When the pps rate gets into the insane territory (gigabit link saturated or 
> > near-saturated), the machine seems to start losing track of interrupts. 
> > Depending on the severity, this leads to CPU soft lockups on random cores. 
> > Under more dire circumstances, other hardware attached to the PCI bus 
> > starts timing out making the kernel lose track of storage. Usually the 
> > SAS-controller is the first to go, but I've also seen timeouts on the FC 
> > controller.
> > 
> > THINGS TRIED:
> > 1. Raising the broadcom RX ring from 255 to 3000. No noticable effects.
> > 2. Downgrading to Xen 3.4.3. No effect.
> > 3. Different Dell BIOS versions. No effect.
> > 4. Lowering number of guests -> effects get less serious. Not a serious 
> > option.
> > 5. Using ipt_LIMIT in the FORWARD table set to 10000/s -> effects get less 
> > serious when dealing with tcp SYN attacks. No effect when dealing with 
> > 28byte UDP attacks.
> > 6. Disabling HPET as per 
> > http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html 
> > with cpuidle=0 and disabling irqbalance -> effects get less serious.
> > 
> > The changes in 6 stop the machine from completely crapping itself, but I 
> > still get soft lockups, although they seem to be limited to one of these 
> > two paths:
> > 
> > [<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
> > [<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
> > [<ffffffff8027458e>] smp_call_function_many+0x38/0x4c
> > [<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
> > [<ffffffff80274688>] smp_call_function+0x4e/0x5e
> > [<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
> > [<ffffffff8028fdd7>] on_each_cpu+0x10/0x2a
> > [<ffffffff802d7428>] kill_bdev+0x1b/0x30
> > [<ffffffff802d7a47>] __blkdev_put+0x4f/0x169
> > [<ffffffff80213492>] __fput+0xd3/0x1bd
> > [<ffffffff802243cb>] filp_close+0x5c/0x64
> > [<ffffffff8021e5d0>] sys_close+0x88/0xbd
> > [<ffffffff802602f9>] tracesys+0xab/0xb6
> > 
> > and
> > 
> > [<ffffffff8026f4f3>] raw_safe_halt+0x84/0xa8
> > [<ffffffff8026ca88>] xen_idle+0x38/0x4a
> > [<ffffffff8024af6c>] cpu_idle+0x97/0xba
> > [<ffffffff8064eb0f>] start_kernel+0x21f/0x224
> > [<ffffffff8064e1e5>] _sinittext+0x1e5/0x1eb
> > 
> > In some scenarios, an application running on the dom0 that relies on 
> > pthread_cond_timedwait seems to be hanging in all its thread on that 
> > specific call. This may be related to some timing going wonky during the 
> > attack, not sure.
> > 
> > Is there anything more we can try?
> > 
> > Cheers,
> > Pim van Riezen
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Xen hypervisor external denial of service vulnerability?