>-----Original Message-----
>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Jan Beulich
>Sent: Monday, July 12, 2010 5:55 PM
>To: Konrad Rzeszutek Wilk
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: Re: [Xen-devel] granting access to MSI-X table and pending bit array
>
>>>> On 07.07.10 at 16:14, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
>> On Wed, Jul 07, 2010 at 11:14:04AM +0100, Jan Beulich wrote:
>>> The original implementation (c/s 17536) disallowed access to these
>>> after granting access to all BAR specified resources (i.e. this was
>>> almost correct, except for a small time window during which the
>>> memory was accessible to the guest and except for hiding the
>>> pending bit array from the guest), but this got reverted with c/s
>>> 20171.
>>>
>>> Afaics this is a security problem, as CPU accesses to the granted
>>> memory don't go through any IOMMU and hence there's no place
>>> these could be filtered out even in a supposedly secure environment
>>> (not that I think devices accesses would be filtered at present, but
>>> for those this would at least be possible ), and such accesses could
>>> inadvertently or maliciously unmask masked vectors or modify the
>>> message address/data fields.
>>>
>>> Imo the pending bit array must be granted read-only access to the
>>> guest (instead of either granting full access or no access at all),
>>> with the potential side effect of also granting read-only access to
>>> the table. And I would even think that this shouldn't be done in the
>>> tools, but rather in Xen itself (since it knows of all the PCI devices
>>> and their respective eventual MSI-X address ranges), thus at once
>>> eliminating any timing windows.
>>
>> That sounds sensible. You got a patch ready?
>
>Below/attached is an untested and possibly not yet complete one
>(can't really test this myself as I don't have, with one exception, any
>MSI-X capable devices around, and the exceptional one doesn't have
>a driver making use of it).
>
>It tracks MMIO MFNs required to only have read-only guest access
>(determined when the first MSI-X interrupt gets enabled on a device)
>in a global range set, and enforces the write protection as
>translations get established.
>
>The changes are made under the assumption that p2m_mmio_direct will
>only ever be used for order 0 pages.
>
>An open question is whether dealing with pv guests (including the
>IOMMU-less case) is necessary, as handling mappings a domain may
>already have in place at the time the first interrupt gets set up
>would require scanning all of the guest's page table pages.
>
>An alternative would be to determine and insert the address ranges
>earlier into mmio_ro_ranges, but that would require a hook in the
>PCI config space writes, which is particularly problematic in case
>MMCONFIG accesses are being used.
>
>A second alternative would be to require Dom0 to report all devices
>(or at least all MSI-X capable ones) regardless of whether they would
>be used by that domain, and do so after resources got determined/
>assigned for them (i.e. a second notification later than the one
>currently happening from the PCI bus scan would be needed).
>
>Jan
>
>--- 2010-06-15.orig/xen/arch/x86/mm.c 2010-06-15 12:24:43.000000000 +0200
>+++ 2010-06-15/xen/arch/x86/mm.c 2010-07-09 16:25:57.000000000 +0200
>@@ -823,7 +823,13 @@ get_page_from_l1e(
> return 0;
> }
>
>- return 1;
>+ if ( !(l1f & _PAGE_RW) || IS_PRIV(pg_owner) ||
>+ !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>+ return 1;
>+ dprintk(XENLOG_G_WARNING,
>+ "d%d: Forcing read-only access to MFN %lx\n",
>+ l1e_owner->domain_id, mfn);
>+ return -1;
> }
>
> if ( unlikely(real_pg_owner != pg_owner) )
>@@ -1179,9 +1185,15 @@ static int alloc_l1_table(struct page_in
>
> for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
> {
>- if ( is_guest_l1_slot(i) &&
>- unlikely(!get_page_from_l1e(pl1e[i], d, d)) )
>- goto fail;
>+ if ( is_guest_l1_slot(i) )
>+ switch ( get_page_from_l1e(pl1e[i], d, d) )
>+ {
>+ case 0:
>+ goto fail;
>+ case -1:
>+ l1e_remove_flags(pl1e[i], _PAGE_RW);
>+ break;
>+ }
>
> adjust_guest_l1e(pl1e[i], d);
> }
>@@ -1764,8 +1776,14 @@ static int mod_l1_entry(l1_pgentry_t *pl
> return rc;
> }
>
>- if ( unlikely(!get_page_from_l1e(nl1e, pt_dom, pg_dom)) )
>+ switch ( get_page_from_l1e(nl1e, pt_dom, pg_dom) )
>+ {
>+ case 0:
> return 0;
>+ case -1:
>+ l1e_remove_flags(nl1e, _PAGE_RW);
>+ break;
>+ }
>
> adjust_guest_l1e(nl1e, pt_dom);
> if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
>@@ -4979,8 +4997,9 @@ static int ptwr_emulated_update(
>
> /* Check the new PTE. */
> nl1e = l1e_from_intpte(val);
>- if ( unlikely(!get_page_from_l1e(nl1e, d, d)) )
>+ switch ( get_page_from_l1e(nl1e, d, d) )
> {
>+ case 0:
> if ( is_pv_32bit_domain(d) && (bytes == 4) && (unaligned_addr & 4) &&
> !do_cmpxchg && (l1e_get_flags(nl1e) & _PAGE_PRESENT) )
> {
>@@ -4999,6 +5018,10 @@ static int ptwr_emulated_update(
> MEM_LOG("ptwr_emulate: could not get_page_from_l1e()");
> return X86EMUL_UNHANDLEABLE;
> }
>+ break;
>+ case -1:
>+ l1e_remove_flags(nl1e, _PAGE_RW);
>+ break;
> }
>
> adjust_guest_l1e(nl1e, d);
>--- 2010-06-15.orig/xen/arch/x86/mm/hap/p2m-ept.c 2010-06-14
>08:49:36.000000000 +0200
>+++ 2010-06-15/xen/arch/x86/mm/hap/p2m-ept.c 2010-07-12
>08:47:12.000000000 +0200
>@@ -72,9 +72,13 @@ static void ept_p2m_type_to_flags(ept_en
> entry->r = entry->w = entry->x = 0;
> return;
> case p2m_ram_rw:
>- case p2m_mmio_direct:
> entry->r = entry->w = entry->x = 1;
> return;
>+ case p2m_mmio_direct:
>+ entry->r = entry->x = 1;
>+ entry->w = !rangeset_contains_singleton(mmio_ro_ranges,
>+ entry->mfn);
>+ return;
> case p2m_ram_logdirty:
> case p2m_ram_ro:
> case p2m_ram_shared:
>@@ -675,6 +679,9 @@ static void ept_change_entry_type_global
> if ( ept_get_asr(d) == 0 )
> return;
>
>+ BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
>+ BUG_ON(ot != nt && (ot == p2m_mmio_direct || nt == p2m_mmio_direct));
>+
> ept_change_entry_type_page(_mfn(ept_get_asr(d)), ept_get_wl(d), ot, nt);
>
> ept_sync_domain(d);
>--- 2010-06-15.orig/xen/arch/x86/mm/p2m.c 2010-06-11 11:41:35.000000000
>+0200
>+++ 2010-06-15/xen/arch/x86/mm/p2m.c 2010-07-12 08:48:08.000000000 +0200
>@@ -72,7 +72,7 @@ boolean_param("hap_1gb", opt_hap_1gb);
> #define SUPERPAGE_PAGES (1UL << 9)
> #define superpage_aligned(_x) (((_x)&(SUPERPAGE_PAGES-1))==0)
>
>-static unsigned long p2m_type_to_flags(p2m_type_t t)
>+static unsigned long p2m_type_to_flags(p2m_type_t t, mfn_t mfn)
> {
> unsigned long flags;
> #ifdef __x86_64__
>@@ -101,7 +101,9 @@ static unsigned long p2m_type_to_flags(p
> case p2m_mmio_dm:
> return flags;
> case p2m_mmio_direct:
>- return flags | P2M_BASE_FLAGS | _PAGE_RW | _PAGE_PCD;
>+ if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
>+ flags |= _PAGE_RW;
>+ return flags | P2M_BASE_FLAGS | _PAGE_PCD;
> case p2m_populate_on_demand:
> return flags;
> }
>@@ -1301,8 +1303,10 @@ p2m_set_entry(struct domain *d, unsigned
> domain_crash(d);
> goto out;
> }
>+ ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
> l3e_content = mfn_valid(mfn)
>- ? l3e_from_pfn(mfn_x(mfn), p2m_type_to_flags(p2mt) | _PAGE_PSE)
>+ ? l3e_from_pfn(mfn_x(mfn),
>+ p2m_type_to_flags(p2mt, mfn) | _PAGE_PSE)
> : l3e_empty();
> entry_content.l1 = l3e_content.l3;
> paging_write_p2m_entry(d, gfn, p2m_entry, table_mfn, entry_content,
>3);
>@@ -1335,7 +1339,8 @@ p2m_set_entry(struct domain *d, unsigned
> ASSERT(p2m_entry);
>
> if ( mfn_valid(mfn) || (p2mt == p2m_mmio_direct) )
>- entry_content = l1e_from_pfn(mfn_x(mfn),
>p2m_type_to_flags(p2mt));
>+ entry_content = l1e_from_pfn(mfn_x(mfn),
>+ p2m_type_to_flags(p2mt, mfn));
> else
> entry_content = l1e_empty();
>
>@@ -1358,9 +1363,11 @@ p2m_set_entry(struct domain *d, unsigned
> goto out;
> }
>
>+ ASSERT(!mfn_valid(mfn) || p2mt != p2m_mmio_direct);
> if ( mfn_valid(mfn) || p2m_is_magic(p2mt) )
> l2e_content = l2e_from_pfn(mfn_x(mfn),
>- p2m_type_to_flags(p2mt) |
>_PAGE_PSE);
>+ p2m_type_to_flags(p2mt, mfn) |
>+ _PAGE_PSE);
> else
> l2e_content = l2e_empty();
>
>@@ -2424,6 +2431,7 @@ void p2m_change_type_global(struct domai
> struct p2m_domain *p2m = p2m_get_hostp2m(d);
>
> BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
>+ BUG_ON(ot != nt && (ot == p2m_mmio_direct || nt == p2m_mmio_direct));
>
> if ( !paging_mode_translate(d) )
> return;
>@@ -2465,7 +2473,7 @@ void p2m_change_type_global(struct domai
> continue;
> mfn = l3e_get_pfn(l3e[i3]);
> gfn = get_gpfn_from_mfn(mfn);
>- flags = p2m_type_to_flags(nt);
>+ flags = p2m_type_to_flags(nt, _mfn(mfn));
> l1e_content = l1e_from_pfn(mfn, flags | _PAGE_PSE);
> paging_write_p2m_entry(d, gfn, (l1_pgentry_t *)&l3e[i3],
> l3mfn, l1e_content, 3);
>@@ -2495,7 +2503,7 @@ void p2m_change_type_global(struct domai
> #endif
> )
> * L2_PAGETABLE_ENTRIES) *
>L1_PAGETABLE_ENTRIES;
>- flags = p2m_type_to_flags(nt);
>+ flags = p2m_type_to_flags(nt, _mfn(mfn));
> l1e_content = l1e_from_pfn(mfn, flags | _PAGE_PSE);
> paging_write_p2m_entry(d, gfn, (l1_pgentry_t *)&l2e[i2],
> l2mfn, l1e_content, 2);
>@@ -2518,7 +2526,7 @@ void p2m_change_type_global(struct domai
> )
> * L2_PAGETABLE_ENTRIES) *
>L1_PAGETABLE_ENTRIES;
> /* create a new 1le entry with the new type */
>- flags = p2m_type_to_flags(nt);
>+ flags = p2m_type_to_flags(nt, _mfn(mfn));
> l1e_content = l1e_from_pfn(mfn, flags);
> paging_write_p2m_entry(d, gfn, &l1e[i1],
> l1mfn, l1e_content, 1);
>--- 2010-06-15.orig/xen/arch/x86/mm/shadow/multi.c 2010-05-28
>13:59:16.000000000 +0200
>+++ 2010-06-15/xen/arch/x86/mm/shadow/multi.c 2010-07-09
>17:11:07.000000000 +0200
>@@ -653,7 +653,9 @@ _sh_propagate(struct vcpu *v,
> }
>
> /* Read-only memory */
>- if ( p2m_is_readonly(p2mt) )
>+ if ( p2m_is_readonly(p2mt) ||
>+ (p2mt == p2m_mmio_direct &&
>+ rangeset_contains_singleton(mmio_ro_ranges, mfn_x(target_mfn))) )
> sflags &= ~_PAGE_RW;
>
> // protect guest page tables
>@@ -1204,15 +1206,19 @@ static int shadow_set_l1e(struct vcpu *v
> /* About to install a new reference */
> if ( shadow_mode_refcounts(d) ) {
>
>TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_SHADOW_L1_GET_REF);
>- if ( shadow_get_page_from_l1e(new_sl1e, d, new_type) == 0 )
>+ switch ( shadow_get_page_from_l1e(new_sl1e, d, new_type) )
> {
>+ case 0:
> /* Doesn't look like a pagetable. */
> flags |= SHADOW_SET_ERROR;
> new_sl1e = shadow_l1e_empty();
>- }
>- else
>- {
>+ break;
>+ case -1:
>+ shadow_l1e_remove_flags(new_sl1e, _PAGE_RW);
>+ /* fall through */
>+ default:
> shadow_vram_get_l1e(new_sl1e, sl1e, sl1mfn, d);
>+ break;
> }
> }
> }
>--- 2010-06-15.orig/xen/arch/x86/msi.c 2010-07-09 15:19:02.000000000 +0200
>+++ 2010-06-15/xen/arch/x86/msi.c 2010-07-12 09:23:01.000000000 +0200
>@@ -16,12 +16,14 @@
> #include <xen/errno.h>
> #include <xen/pci.h>
> #include <xen/pci_regs.h>
>+#include <xen/iocap.h>
> #include <xen/keyhandler.h>
> #include <asm/io.h>
> #include <asm/smp.h>
> #include <asm/desc.h>
> #include <asm/msi.h>
> #include <asm/fixmap.h>
>+#include <asm/p2m.h>
> #include <mach_apic.h>
> #include <io_ports.h>
> #include <public/physdev.h>
>@@ -520,6 +522,43 @@ static int msi_capability_init(struct pc
> return 0;
> }
>
>+static u64 read_pci_mem_bar(u8 bus, u8 slot, u8 func, u8 bir)
>+{
>+ u8 limit;
>+ u32 addr;
>+
>+ switch ( pci_conf_read8(bus, slot, func, PCI_HEADER_TYPE) )
>+ {
>+ case PCI_HEADER_TYPE_NORMAL:
>+ limit = 6;
>+ break;
>+ case PCI_HEADER_TYPE_BRIDGE:
>+ limit = 2;
>+ break;
>+ case PCI_HEADER_TYPE_CARDBUS:
>+ limit = 1;
>+ break;
>+ default:
>+ return 0;
>+ }
>+
>+ if ( bir >= limit )
>+ return 0;
>+ addr = pci_conf_read32(bus, slot, func, PCI_BASE_ADDRESS_0 + bir * 4);
>+ if ( (addr & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>+ return 0;
>+ if ( (addr & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
>PCI_BASE_ADDRESS_MEM_TYPE_64 )
>+ {
>+ addr &= ~PCI_BASE_ADDRESS_MEM_MASK;
>+ if ( ++bir >= limit )
>+ return 0;
>+ return addr |
>+ ((u64)pci_conf_read32(bus, slot, func,
>+ PCI_BASE_ADDRESS_0 + bir * 4) <<
>32);
>+ }
>+ return addr & ~PCI_BASE_ADDRESS_MEM_MASK;
>+}
>+
> /**
> * msix_capability_init - configure device's MSI-X capability
> * @dev: pointer to the pci_dev data structure of MSI-X device function
>@@ -532,7 +571,8 @@ static int msi_capability_init(struct pc
> **/
> static int msix_capability_init(struct pci_dev *dev,
> struct msi_info *msi,
>- struct msi_desc **desc)
>+ struct msi_desc **desc,
>+ unsigned int nr_entries)
> {
> struct msi_desc *entry;
> int pos;
>@@ -587,6 +627,70 @@ static int msix_capability_init(struct p
>
> list_add_tail(&entry->list, &dev->msi_list);
>
>+ if ( !dev->msix_nr_entries )
>+ {
>+ u64 pba_paddr;
>+ u32 pba_offset;
>+ unsigned long pfn;
>+
>+ ASSERT(!dev->msix_used_entries);
>+ WARN_ON(msi->table_base != read_pci_mem_bar(bus, slot, func, bir));
>+
>+ dev->msix_nr_entries = nr_entries;
>+ dev->msix_table.first = pfn = PFN_DOWN(table_paddr);
>+ dev->msix_table.last = PFN_DOWN(table_paddr +
>+ nr_entries *
>PCI_MSIX_ENTRY_SIZE - 1);
>+ for ( ; pfn <= dev->msix_table.last; ++pfn )
>+ WARN_ON(rangeset_contains_singleton(mmio_ro_ranges, pfn));
>+ if ( rangeset_add_range(mmio_ro_ranges, dev->msix_table.first,
>+ dev->msix_table.last) )
>+ WARN();
>+
>+ pba_offset = pci_conf_read32(bus, slot, func,
>+ msix_pba_offset_reg(pos));
>+ bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
>+ pba_paddr = read_pci_mem_bar(bus, slot, func, bir);
>+ WARN_ON(!pba_paddr);
>+ pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
>+
>+ dev->msix_pba.first = PFN_DOWN(pba_paddr);
>+ dev->msix_pba.last = PFN_DOWN(pba_paddr +
>+ BITS_TO_LONGS(nr_entries) - 1);
>+ for ( ; pfn <= dev->msix_table.last; ++pfn )
>+ if ( pfn < dev->msix_table.first || pfn > dev->msix_table.last )
>+ WARN_ON(rangeset_contains_singleton(mmio_ro_ranges,
>pfn));
>+ if ( rangeset_add_range(mmio_ro_ranges, dev->msix_pba.first,
>+ dev->msix_pba.last) )
>+ WARN();
>+printk("MSIX%02x:%02x.%x: table@(%lx,%lx), pba@(%lx,%lx)\n", bus, slot, func,
>+ dev->msix_table.first, dev->msix_table.last,
>+ dev->msix_pba.first, dev->msix_pba.last);//temp
>+
>+ if ( dev->domain )
>+ p2m_change_entry_type_global(dev->domain, p2m_mmio_direct,
>+ p2m_mmio_direct);
>+ if ( !dev->domain || !paging_mode_translate(dev->domain) )
>+ {
>+ struct domain *d = dev->domain;
>+
>+ if ( !d )
>+ for_each_domain(d)
>+ if ( !paging_mode_translate(d) &&
>+ (iomem_access_permitted(d, dev->msix_table.first,
>+ dev->msix_table.last)
>||
>+ iomem_access_permitted(d, dev->msix_pba.first,
>+ dev->msix_pba.last)) )
>+ break;
>+ if ( d )
>+ {
>+ /* XXX How to deal with existing mappings? */
>+ }
>+ }
>+ }
>+ WARN_ON(dev->msix_nr_entries != nr_entries);
>+ WARN_ON(dev->msix_table.first != (table_paddr >> PAGE_SHIFT));
>+ ++dev->msix_used_entries;
>+
> /* Mask interrupt here */
> writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
>
>@@ -707,7 +811,7 @@ static int __pci_enable_msix(struct msi_
> return 0;
> }
>
>- status = msix_capability_init(pdev, msi, desc);
>+ status = msix_capability_init(pdev, msi, desc, nr_entries);
> return status;
> }
>
>@@ -732,6 +836,16 @@ static void __pci_disable_msix(struct ms
> writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
>
> pci_conf_write16(bus, slot, func, msix_control_reg(pos), control);
>+
>+ if ( !--dev->msix_used_entries )
>+ {
>+ if ( rangeset_remove_range(mmio_ro_ranges, dev->msix_table.first,
>+ dev->msix_table.last) )
>+ WARN();
>+ if ( rangeset_remove_range(mmio_ro_ranges, dev->msix_pba.first,
>+ dev->msix_pba.last) )
>+ WARN();
>+ }
> }
>
> /*
>--- 2010-06-15.orig/xen/drivers/passthrough/io.c 2010-04-22
>14:43:25.000000000
>+0200
>+++ 2010-06-15/xen/drivers/passthrough/io.c 2010-07-09 16:02:23.000000000
>+0200
>@@ -26,6 +26,8 @@
> #include <xen/hvm/irq.h>
> #include <xen/tasklet.h>
>
>+struct rangeset *__read_mostly mmio_ro_ranges;
>+
> static void hvm_dirq_assist(unsigned long _d);
>
> bool_t pt_irq_need_timer(uint32_t flags)
>@@ -565,3 +567,11 @@ void hvm_dpci_eoi(struct domain *d, unsi
> unlock:
> spin_unlock(&d->event_lock);
> }
>+
>+static int __init setup_mmio_ro_ranges(void)
>+{
>+ mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
>+ RANGESETF_prettyprint_hex);
>+ return 0;
>+}
>+__initcall(setup_mmio_ro_ranges);
>--- 2010-06-15.orig/xen/include/xen/iommu.h 2010-06-01 13:39:57.000000000
>+0200
>+++ 2010-06-15/xen/include/xen/iommu.h 2010-07-09 15:55:52.000000000 +0200
>@@ -31,6 +31,8 @@ extern bool_t force_iommu, iommu_verbose
> extern bool_t iommu_workaround_bios_bug, iommu_passthrough;
> extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
>
>+extern struct rangeset *mmio_ro_ranges;
>+
> #define domain_hvm_iommu(d) (&d->arch.hvm_domain.hvm_iommu)
>
> #define MAX_IOMMUS 32
>--- 2010-06-15.orig/xen/include/xen/pci.h 2010-01-21 15:36:53.000000000
>+0100
>+++ 2010-06-15/xen/include/xen/pci.h 2010-07-09 15:49:32.000000000 +0200
>@@ -45,6 +45,10 @@ struct pci_dev {
> struct list_head domain_list;
>
> struct list_head msi_list;
>+ unsigned int msix_nr_entries, msix_used_entries;
>+ struct {
>+ unsigned long first, last;
>+ } msix_table, msix_pba;
> int msix_table_refcnt[MAX_MSIX_TABLE_PAGES];
> int msix_table_idx[MAX_MSIX_TABLE_PAGES];
> spinlock_t msix_table_lock;
Will the small window for qemu exists still? QEMU may setup the mapping before
the msix_capability_init().
BTW, I'm not sure if mapping as RO this will solve the sharing issue (i.e. the
PBA or table entry share with other resource).
Thanks
--jyh
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|