WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU

To: Dante Cinco <dantecinco@xxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
From: "Lin, Ray" <Ray.Lin@xxxxxxx>
Date: Thu, 18 Nov 2010 11:52:53 -0700
Accept-language: en-US
Acceptlanguage: en-US
Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, "mathieu.desnoyers@xxxxxxxxxx" <mathieu.desnoyers@xxxxxxxxxx>, "andrew.thomas@xxxxxxxxxx" <andrew.thomas@xxxxxxxxxx>, "keir.fraser@xxxxxxxxxxxxx" <keir.fraser@xxxxxxxxxxxxx>, "chris.mason@xxxxxxxxxx" <chris.mason@xxxxxxxxxx>
Delivery-date: Thu, 18 Nov 2010 10:53:45 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTimPJ4y+YOL2Ed78jmCeaKnxLZb93Kuowxutu_O1@xxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcuHULCxA9lDktfMTzq9B1W/yazMCwAAEOng
Thread-topic: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
 

-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Dante Cinco
Sent: Thursday, November 18, 2010 10:44 AM
To: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@xxxxxxxxxx; 
andrew.thomas@xxxxxxxxxx; keir.fraser@xxxxxxxxxxxxx; chris.mason@xxxxxxxxxx
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops 
domU kernel with PCI passthrough

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
>
> This fellow is passing in a PCI device to his Xen PV guest and trying 
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's 
> sparse_irq rework.
>
>> I wanted to confirm that bounce buffering was indeed occurring so I 
>> modified swiotlb.c in the kernel and added printks in the following
>> functions:
>> swiotlb_bounce
>> swiotlb_tbl_map_single
>> swiotlb_tbl_unmap_single
>> Sure enough we were calling all 3 five times per I/O. We took your 
>> suggestion and replaced pci_map_single with pci_pool_alloc. The 
>> swiotlb calls were gone but the I/O performance only improved 6% (29k 
>> IOPS to 31k IOPS) which is still abysmal.
>
> Hey! 6% that is nothing to sneeze at.

When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x 
(~700k IOPS).

>
>>
>> Any suggestions on where to look next? I have one question about the
>
> So since you are talking IOPS I figured you must be using fio to run 
> those numbers. And since you mentioned HVM at some point, you are not 
> running this PV domain as a back-end for another PV guest. You are 
> probably going to run some form of iSCSI target and stuff those down the PCI 
> device.

Our setup is pure Fibre Channel. We're using a physically separate system 
(Linux-based also) to initiate the SCSI I/Os.

>
> Couple of things that pop in my head.. but lets first address your question.
>
>> P2M array: Does the P2M lookup occur every DMA or just during the 
>> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
>
> It only occurs during allocation. Also since you are bypassing the 
> bounce buffer those calls are done without any spinlock. The lookup of 
> P2M is bitshifting, division - and are constant - so O(1).
>
>> resource that could be a bottleneck?
>
> Doubt it. Your best bet to figure this out is to play with ftrace, or 
> perf trace. But I don't know how well they work with Xen nowadays - 
> Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard 
> that Mathieu got it working?
>
> So the next couple of possiblities are:
>  1). you are hitting the spinlock issues on 'struct request' or any of
>     the paths on the I/O. Oracle did a lot of work on those - and one
>     way to find this out is to look at tracing and see where the contention 
> is.
>     I don't know where or if those patches have been posted upstream.. 
> but as said,
>     if you are seeing the spinlock usage high  - that might be it.
>  1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. 
> Otherwise

I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y

The platform we're running has Intel Xeon E5540 and X58 chipset. Here is the 
kernel configuration associated with processor. Is there anything we could tune 
to improve the performance ?

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_SPARSE_IRQ=y
CONFIG_NUMA_IRQ_DESC=y
CONFIG_X86_MPPARSE=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=8
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=7
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=y
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_COMPACTION is not set
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y


>     you are going to hit dreadfull conditions.
>  2). You are hitting the 64-bit syscall wall. Basically your user-mode
>     application (fio) is doing a write(), which used to be int 0x80 
> but now
>     is a syscall. The syscall gets trapped in the hypervisor which has 
> to
>     call in your PV kernel. You get hit with two context switches for 
> each
>     'write()' call. The solution is to use a 32-bit DomU as the guest 
> user
>     application and guest kernel run in different rings.

There is no user space application that is involved with the I/O. It's all 
kernel driver code that handles the I/O.

>  3). Xen CPU pools. You didn't say where the application that sends 
> the IOs
>     is located. But if it was in a seperate domain then you will want 
> to use
>     Xen CPU pools. Basically this way you can get gang-scheduling 
> where the
>     guest that submits the I/O and the guest that picks up the I/O are 
> running
>     right after each other. I don't know much more details, but this 
> is what
>     I understand it does.
>  4). CPU/MSI-X affinity. I think you already did this, but make sure 
> you pin
>     your guest to specific CPUs and also pin the MSI-X (vectors) to 
> the proper
>     destination. You can use the 'xm debug-keys i' to see the MSI-X 
> affinity - it
>     is a mask and basically see if it overlays the CPUs you are 
> running your guest
>     at. Not sure how to actually set the MSI-X affinity ... now that I think.
>     Keir or some of the Intel folks might know better.

There 16 devices (multi-function) that are PCI-passed through to domU.
There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU 
platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked 
well for us with the HVM kernel. Here's the output of 'xm debug-keys i'
(XEN)    IRQ:  67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:127(----),
(XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000200 vec:43
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:126(----),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000400 vec:83
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:125(----),
(XEN)    IRQ:  70 affinity:00000000,00000000,00000000,00000800 vec:4b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:124(----),
(XEN)    IRQ:  71 affinity:00000000,00000000,00000000,00001000 vec:8b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:123(----),
(XEN)    IRQ:  72 affinity:00000000,00000000,00000000,00002000 vec:53
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:122(----),
(XEN)    IRQ:  73 affinity:00000000,00000000,00000000,00004000 vec:93
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:121(----),
(XEN)    IRQ:  74 affinity:00000000,00000000,00000000,00008000 vec:5b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:120(----),
(XEN)    IRQ:  75 affinity:00000000,00000000,00000000,00010000 vec:9b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:119(----),
(XEN)    IRQ:  76 affinity:00000000,00000000,00000000,00020000 vec:63
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:118(----),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00040000 vec:a3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:117(----),
(XEN)    IRQ:  78 affinity:00000000,00000000,00000000,00080000 vec:6b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:116(----),
(XEN)    IRQ:  79 affinity:00000000,00000000,00000000,00100000 vec:ab
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:115(----),
(XEN)    IRQ:  80 affinity:00000000,00000000,00000000,00200000 vec:73
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:114(----),
(XEN)    IRQ:  81 affinity:00000000,00000000,00000000,00400000 vec:b3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:113(----),
(XEN)    IRQ:  82 affinity:00000000,00000000,00000000,00800000 vec:7b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:112(----),

>  5). Andrew, Mukesh, Keir, Dan, any other ideas?
>

We're also trying Chris' 4 things to try and will consider Mathieu's LTT 
suggestion.

- Dante

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>