|
|
|
|
|
|
|
|
|
|
xen-devel
[Xen-devel] Xen hypervisor external denial of service vulnerability?
Good day,
In a scenario where we saw several dom0 nodes fall down due to a sustained SYN
flood to a network range, we have been investigating issues with Xen under high
network load. The results so far seem to be not so pretty. We recreated a lab
setup that can reproduce the scenario with some reliability, although it takes
a bit of trial-and-error to get crashes out of it.
SETUP:
2x Dell R710
- 4x 6core AMD Opteron 6174
- 128GB memory
- Broadcom BCM5709
- LSI SAS2008 rev.02
- Emulex Saturn-X FC adapter
- CentOS 5.5 w/ gitco Xen 4.0.1
1x NexSan SATABeast FC raid
1x Brocade FC switch
5x Flood sources (Dell R210)
The dom0 machines are loaded with 50 PV images, coupled to a LVM partition on
FC, half of which are set to start compiling a kernel in rc.local. There are
also 2 HVM images on both machines doing the same.
Networking for all guests is configured in the bridging setup, attached to a
specific vlan that arrives tagged at the Dom0. So vifs end up in xenbr86, née
xenbr0.86.
Grub conf for the dom0s:
kernel /xen.gz-4.0.1 dom0_mem=2048M max_cstate=0 cpuidle=off
module /vmlinuz-2.6.18-194.11.4.el5xen ro root=LABEL=/ elevator=deadline
xencons=tty
The flooding is always done to either the entire IP range the guests live in
(in case of SYN floods) or a sub-range of about 50 IPs (in case of UDP floods),
with random source addresses.
ISSUE:
When the pps rate gets into the insane territory (gigabit link saturated or
near-saturated), the machine seems to start losing track of interrupts.
Depending on the severity, this leads to CPU soft lockups on random cores.
Under more dire circumstances, other hardware attached to the PCI bus starts
timing out making the kernel lose track of storage. Usually the SAS-controller
is the first to go, but I've also seen timeouts on the FC controller.
THINGS TRIED:
1. Raising the broadcom RX ring from 255 to 3000. No noticable effects.
2. Downgrading to Xen 3.4.3. No effect.
3. Different Dell BIOS versions. No effect.
4. Lowering number of guests -> effects get less serious. Not a serious option.
5. Using ipt_LIMIT in the FORWARD table set to 10000/s -> effects get less
serious when dealing with tcp SYN attacks. No effect when dealing with 28byte
UDP attacks.
6. Disabling HPET as per
http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html with
cpuidle=0 and disabling irqbalance -> effects get less serious.
The changes in 6 stop the machine from completely crapping itself, but I still
get soft lockups, although they seem to be limited to one of these two paths:
[<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
[<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
[<ffffffff8027458e>] smp_call_function_many+0x38/0x4c
[<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
[<ffffffff80274688>] smp_call_function+0x4e/0x5e
[<ffffffff8023f830>] invalidate_bh_lru+0x0/0x42
[<ffffffff8028fdd7>] on_each_cpu+0x10/0x2a
[<ffffffff802d7428>] kill_bdev+0x1b/0x30
[<ffffffff802d7a47>] __blkdev_put+0x4f/0x169
[<ffffffff80213492>] __fput+0xd3/0x1bd
[<ffffffff802243cb>] filp_close+0x5c/0x64
[<ffffffff8021e5d0>] sys_close+0x88/0xbd
[<ffffffff802602f9>] tracesys+0xab/0xb6
and
[<ffffffff8026f4f3>] raw_safe_halt+0x84/0xa8
[<ffffffff8026ca88>] xen_idle+0x38/0x4a
[<ffffffff8024af6c>] cpu_idle+0x97/0xba
[<ffffffff8064eb0f>] start_kernel+0x21f/0x224
[<ffffffff8064e1e5>] _sinittext+0x1e5/0x1eb
In some scenarios, an application running on the dom0 that relies on
pthread_cond_timedwait seems to be hanging in all its thread on that specific
call. This may be related to some timing going wonky during the attack, not
sure.
Is there anything more we can try?
Cheers,
Pim van Riezen
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Xen-devel] Xen hypervisor external denial of service vulnerability?,
Pim van Riezen <=
|
|
|
|
|