Sorry, patch code style
cleaned up and rebased to the latest tip
---
VMSI: This patch
simulate the MSIx table read operation
Signed-off-by:
Liu Yuan <yuan.b.liu@xxxxxxxxx>
Signed-off-by:
Eddie Dong <eddie.dong@xxxxxxxxx>
diff -r
38aee6139719 xen/arch/x86/hvm/vmsi.c
---
a/xen/arch/x86/hvm/vmsi.c Tue Aug 03 21:03:09 2010 +0100
+++
b/xen/arch/x86/hvm/vmsi.c Wed Aug 04 17:01:23 2010 +0800
@@ -159,7 +159,10
@@
unsigned long gtable; /* gpa of msix table
*/
unsigned long table_len;
unsigned long table_flags[MAX_MSIX_TABLE_ENTRIES / BITS_PER_LONG + 1];
-
+#define
MAX_MSIX_ACC_ENTRIES 3
+
struct {
+
uint32_t msi_ad[3]; /* Shadow of address low, high and data
*/
+
} gentries[MAX_MSIX_ACC_ENTRIES];
struct rcu_head rcu;
};
@@ -205,9 +208,10
@@
struct vcpu *v, unsigned long address,
unsigned long len, unsigned long *pval)
{
-
unsigned long offset;
+
unsigned long offset, val;
struct msixtbl_entry *entry;
void *virt;
+
int nr_entry, index;
int r = X86EMUL_UNHANDLEABLE;
rcu_read_lock(&msixtbl_rcu_lock);
@@ -215,18
+219,29 @@
if ( len != 4 )
goto out;
-
offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
-
if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET)
-
goto out;
-
entry = msixtbl_find_entry(v, address);
virt = msixtbl_addr_to_virt(entry, address);
if ( !virt )
goto out;
-
*pval = readl(virt);
+
nr_entry = (address - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
+
offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
+
if ( nr_entry >= MAX_MSIX_ACC_ENTRIES &&
+
offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
+
goto out;
+
+
val = readl(virt);
+
if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
+
{
+
index = offset / sizeof(uint32_t);
+
*pval = entry->gentries[nr_entry].msi_ad[index];
+
}
+
else
+
{
+
*pval = val;
+
}
+
r = X86EMUL_OKAY;
-
out:
rcu_read_unlock(&msixtbl_rcu_lock);
return r;
@@ -238,7 +253,7
@@
unsigned long offset;
struct msixtbl_entry *entry;
void *virt;
-
int nr_entry;
+
int nr_entry, index;
int r = X86EMUL_UNHANDLEABLE;
rcu_read_lock(&msixtbl_rcu_lock);
@@ -252,6 +267,11
@@
offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET)
{
+
if ( nr_entry < MAX_MSIX_ACC_ENTRIES )
+
{
+
index = offset / sizeof(uint32_t);
+
entry->gentries[nr_entry].msi_ad[index] = val;
+
}
set_bit(nr_entry, &entry->table_flags);
goto out;
}
From: Liu,
Yuan B
Sent: Wednesday, August 04, 2010 10:35 AM
To: 'xen-devel@xxxxxxxxxxxxxxxxxxx'
Cc: Dong, Eddie
Subject: [PATCH] Simulates the MSIx table read operation
Hi,
This patch simulates the MSIx table read operation to avoid read traffic caused
by guest Linux kernel in a multiple guests environments running with high
interrupt rate workload.(We tested 24 guests with iperf by 10Gb workload)
[Background]
The assumptions about underlying hardware of OS running in the virtual machine
environment would not hold for some cases. This is particularly perceived when
considering the CPU virtualization that, the VCPU of the OS would be scheduled
out while physical CPU of OS would never be. This cause the corner case trouble
of OS designed inherently by the assumption targeting the physical CPU. We have
seen the _lock-holder preemption_ case. Now SR-IOV issue is yet another
one.
[Issue]
Linux generic IRQ logic for edge interrupt, during the ‘Writing
EOI’ period, has been written the way that in a high rate interrupt
environment, the subsequent interrupt would cause the guest busy
masking/unmasking interrupt if the previous one isn’t handled immediately(For
e.g. the guest is scheduled out).
The mask/unmask operation would cause a read operation to flush the previous PCI transactions to
ensure the write is successful. This corner case isn’t handled by the Xen
which only intercept the Guests’ mask/unmask operation and forward other
requests(read/write table) to qemu.
This special case doesn’t appear in the light workload but
in the case of many (for e.g. 24) guests, it would cause the CPU utilization of
Dom0 up to 140%(This is proportional to the number of the guests), which
definitely limit the scalability and performance of virtualization technology.
[Effect]
This patch emulates the read operation in the Xen and test showed
that all the abnormal MMIO read operation is eliminated completely during iperf
running in a heavy workload. The CPU utilization has been dropped to 60% in my
test.
Thanks,
Yuan