This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [PATCH, RFC, resend] Re: [Xen-devel] granting access to MSI-X table

To: Jan Beulich <JBeulich@xxxxxxxxxx>
Subject: RE: [PATCH, RFC, resend] Re: [Xen-devel] granting access to MSI-X table and pending bit array
From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Date: Thu, 26 Aug 2010 16:41:41 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Konrad
Delivery-date: Thu, 26 Aug 2010 01:42:53 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C762EA6020000780001239A@xxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C6566B2020000780000FBD9@xxxxxxxxxxxxxxxxxx> <789F9655DD1B8F43B48D77C5D30659732A128E9C@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4C762EA6020000780001239A@xxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActE7ULIOgPEFaLoR5i0r0lhH94UCwAB/7lg
Thread-topic: [PATCH, RFC, resend] Re: [Xen-devel] granting access to MSI-X table and pending bit array

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@xxxxxxxxxx]
>Sent: Thursday, August 26, 2010 3:07 PM
>To: Jiang, Yunhong
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Konrad Rzeszutek Wilk
>Subject: RE: [PATCH, RFC, resend] Re: [Xen-devel] granting access to MSI-X 
>table and
>pending bit array
>>>> On 26.08.10 at 08:24, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>>>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>>>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Jan Beulich
>>>An alternative would be to determine and insert the address ranges
>>>earlier into mmio_ro_ranges, but that would require a hook in the
>>>PCI config space writes, which is particularly problematic in case
>>>MMCONFIG accesses are being used.
>> I noticed you stated in your previous mail that this should be done in
>> hypervisor, not tools. Is it because tools is not trusted by xen hypervisor?
>> If tools can be trusted, is it possible to achieve this in tools: Tools tell
>> xen hypervisor the MMIO range that is read-only to this guest after the guest
>> is created, but before the domain is unpaused.
>Doing this from the tools would have the unfortunate side effect that
>Dom0 (or the associated stubdom) would continue to need special
>casing, which really it shouldn't for this purpose (and all the special
>casing in the patch is really just because qemu wants to map writably
>the ranges in question, and this can only be removed after the
>hypervisor handling changed).

Agree that doing this from tools can't remove dom0's write access.
Maybe we can combine this together. Tools for normal guest, while your change 
to msix_capability_init() is for dom0, the flow is followed:
1) When xen boot, xen will scan all PCI hierarchy, and get the MSI-X address. 
Mark that range to dom0's mmio_ro_range. (Maybe a seperetaed list is needed to 
track which MMIO range to which device).
2) When a MSI-x interrupt is started, we check the corresponding BAR to see if 
the range is changed by dom0. If yes, update dom0's mmio_ro_range.  We can also 
check if the assigned domain's mmio_ro_range cover this.
3) When tools create domain, tools will remove the mmio range for the guest.

A bit over-complicated?

>Another aspect is that of necessity of denying write access to these
>ranges: The guest must be prevented writing to them only once at

Although preventing writing access is only needed when MSI-x interrupt enabled, 
but as your patch stated, we need consider how to handle mapping setup already 
before the MSI-x enabled (In fact, we are working on removing setup-already 
mapping for MCA purpose, although not sure if it is accepetable by upstream). 
Removing the access from beginning will avoid such clean-already-mapped 

>least one MSI-X interrupt got enabled. Plus (while not the case with
>the current Linux implementation) a (non-Linux or future Linux)
>version may choose to (re-)assign device resources only when the
>device gets enabled, which would be after the guest was already

I'm a bit confused. Are you talking about guest re-assign device resources or 
If you are talking about guest, I think that's easy to handle , and we should 
anyway do that. Especially it should not impact physical resource.
If you are talking aboud dom0, I can't think out while guest enable device will 
cause re-assignment in dom0.

>>>A second alternative would be to require Dom0 to report all devices
>>>(or at least all MSI-X capable ones) regardless of whether they would
>>>be used by that domain, and do so after resources got determined/
>>>assigned for them (i.e. a second notification later than the one
>>>currently happening from the PCI bus scan would be needed).
>> Currently Xen knows about the PCI device's resource assignment already when
>> system boot, since Xen have PCI information. The only situations that Xen may
>> have no idea includes: a) Dom0 re-assign the device resource, may because of
>> resource balance; b) VF device for SR-IOV.
>> I think for situation a, IIRC, xen hypervisor can't handle it, because that
>> means all shadow need be rebuild, the MSI information need be updated etc.
>Shadows and p2m table need updating in this case, but MSI information
>doesn't afaict (it gets proagated only when the first interrupt is being
>set up).

Maybe I didn't state my point clearly. What I mean is, since xen hypervisor 
knows about PCI hierarchy, they can setup the mmio_ro_range when xen boot. The 
only concerns for this method is situation a and situation b, which will update 
the PCI device's resource assignment, while xen hypervisor have no idea and 
can't update mmio_ro_range.

>>>@@ -653,7 +653,9 @@ _sh_propagate(struct vcpu *v,
>>>     }
>>>     /* Read-only memory */
>>>-    if ( p2m_is_readonly(p2mt) )
>>>+    if ( p2m_is_readonly(p2mt) ||
>>>+         (p2mt == p2m_mmio_direct &&
>>>+          rangeset_contains_singleton(mmio_ro_ranges,
>mfn_x(target_mfn))) )
>>>         sflags &= ~_PAGE_RW;
>> Would it have performance impact if too much mmio rangeset and we need
>> search it for each l1 entry update? Or you assume this range will not be
>> updated so frequently?
>Yes, performance wise this isn't nice.
>> I'm not sure if we can add one more p2m type like p2m_mmio_ro? And expand
>> the p2m_is_readonly to cover this also? PAE xen may have trouble for it, but
>> at least it works for x86_64, and some wrap function with #ifdef X86_64 can
>> handle the difference.
>That would be a possible alternative, but I'm afraid I wouldn't dare to
>make such a change.

Which change? Would following code works without much issue?

 if (p2m_is_readonly(p2mt) || p2m_is_ro_mmio(p2mt, mfn_t(target_mfn)))
        sflags &= ~_PAGE_RW;

#ifdef __x86_64
#define P2M_RO_TYPES  (p2m_to_mask(p2m_ram_logdirty) 
#define p2m_is_ro_mmio() 0
#define P2M_RO_TYPES  (p2m_to_mask(p2m_ram_logdirty) 
#define p2m_is_ro_mmio(_t, mfn)  (p2mt == p2m_mmio_direct  && 

#define p2m_is_readonly(_t)  (p2m_to_mask(_t) & P2M_RO_TYPE))

>>>@@ -707,7 +810,7 @@ static int __pci_enable_msix(struct msi_
>>>         return 0;
>>>     }
>>>-    status = msix_capability_init(pdev, msi, desc);
>>>+    status = msix_capability_init(pdev, msi, desc, nr_entries);
>>>     return status;
>>> }
>> As stated above, would it be possible to achive this in tools?
>> Also if it is possible to place the mmio_ro_ranges to be per domain
>> structure?
>As said above, doing it in the tools has down sides, but if done
>there and if excluding/special-casing Dom0 (or the servicing
>stubdom) it should be possible to make the ranges per-domain.

I agree that dom0's writing access should also be avoided . The concern for 
global mmio_ro_ranges is, the ranges maybe a bit big, if several SR-IOV card 
populated, each support several VF. But I have no data how bit impact would it 



Xen-devel mailing list