Good stuff Tim, let me summarize:
- The key is to obtain exclusive access to a p2m entry, or range [gfn,
gfn + 1<<order). This exclusive access lasts beyond the actual lookup,
until the caller is finished with modifications, to prevent the p2m
mapping changing underfoot.
- bits for either fine-grain locks or refcounts need to be set aside.
Stuffing those bits in actual p2m entries will be very error prone/not
possible, given all existing implementations (NPT+IOMMU, 32bit, etc).
So, we're stuck with extra space overhead for a fine-grained p2m
concurrency control structure.
- Unless the recount collapses into the page_info struct. Even then
there is a critical section "get p2m_entry then get_page" that needs
to execute atomically.
- foreign mappings can block p2m actions for arbitrarily long. This
doesn't usually happen, but the risk is latent. This is "hard to
solve", for now.
question 1: I still don't see the need for refcounts. If you want to
prevent changes underfoot, you need to lock the entry, and that's it.
In all the cases you explained, somebody would have to wait until the
refcount on the entry drops to reflect they are the only holder. This
is akin to being locked out.
question 2: although internal hypervisor code paths do not seem to act
on unaligned p2m ranges, external calls (e.g. MEMF_populate_on_demand)
could possibly pass unaligned ranges. These complicate fine-grain
concurrency. Should we fail those? With so many toolstacks out there,
I feel very hesitant.
question 3: is there any way to know a priori the max gfn a domain
will have? Can we pre-allocate the concurrency control structure as
opposed to demand allocating it?
suggestion 1: bake exclusive access in the current calls. A p2m
lookup, followed by a p2m set_entry, delimit a critical section for
that range of p2m mappings. p2m lookups without closing set_entry will
have to issue a call to drop exclusive access on the range of
mappings.
suggestion 2: limit fine granularity (if locking, not refcounting), to
2MB superpages. Saves space. 512 neighbours can surely coexist without
locking each other out :)
Thanks,
Andres
On Thu, Oct 13, 2011 at 11:02 AM, Tim Deegan <tim@xxxxxxx> wrote:
> Hi,
>
> At 15:17 -0400 on 10 Oct (1318259867), Andres Lagar Cavilla wrote:
>> On Mon, Oct 10, 2011 at 5:21 AM, Tim Deegan <tim@xxxxxxx> wrote:
>> > A few issues I can see with it:
>> > - p2m lookups are on some very performance-sensitive paths
>> > (e.g. multiple times in any pagetable walk or instruction emulation
>> > in a HVM guest) so adding the rwlock might have a noticeable impact.
>> >
>> > - This fixes one class of races (page gets freed-to-xen underfoot) but
>> > leaves another one (gfn -> mfn map changes underfoot) untouched. In
>> > particular it doesn't solve the race where a foreign mapper
>> > gets a r/w map of what's meant to be a read-only frame.
>>
>> Can you elaborate a bit? Under what situations does the gfn->mfn map
>> change underfoot (other than sharing, paging in or superpage
>> sharding)? Wouldn't those two be taking a writer lock, and thus be
>> mutually ecxluded from lookups?
>
> Once CPU A has done a p2m lookup but before it is finished using the
> result (so, maybe it has established a mapping, or is about to establish
> a mapping, or is going to do some other thing like shadow a pagetable):
> - CPU B might run a log-dirty clean that makes the pfn read-only.
> This is actually handled, for historical reasons, by having all
> possible callers be aware of log-dirty - that son't scale to
> all other p2m actions, though. :(
> - CPU B might remove the p2m mapping entirely and free the page.
> This is the case that your suggestion handles, by holding
> a refcount on the old MFN.
> - CPU B might make the page read-only so it can be shared with another
> VM.
> - CPU B might move the MFN to another PFN (this happens with emulated
> video RAM, I believe).
>
>
>> > I think that to fix things properly we need to have some refcount
>> > associated with the p2m mapping itself. That would be taken by all
>> > lookups (or at least some - we could have a flag to the p2m lookup) and
>> > released as you suggest, but more importantly it would block all p2m
>> > changes
>> > while the count was raised. (I think that a least in the common case we
>> > could encode such a refcount using the existing typecount).
>>
>> Assuming mapping means "entry in p2m", multiple mappings would have
>> their ref count collapse in the page typecount. Isn't that a problem?
>
> It might be. I think the only case where multiple p2m mappings point to
> the same MFN is page-sharing, which is already a special case; but it
> might make sense to have the refcounts per-pfn anyway, just for
> clarity.
>
>> Do we need per-mapping refcounts, or rather, per mapping mutual
>> exclusion? My feel is that page refcounts are necessary to prevent the
>> page from disappearing, and mappings need to have their lookups and
>> modifications synchronized.
>>
>> > One problem then is how to make all the callers of the p2m update
>> > functions handle failure, either by waiting (new deadlock risk?) or
>> > returning EAGAIN at the hypercall interface. Paths where the update
>> > isn't caused by an explicit request (like log-dirty and the mem-event
>> > rx-to-rw conversion) would be particularly tricky.
>>
>> Callers already wait on lock_p2m. They'll wait longer :)
>
> :)
>
>> On failure, to cite a specific example, if paging was trying to swap
>> something out that got foreign-mapped by somebody else, then yeah,
>> there's no other option than failing that call.
>>
>> How would log-dirty and x^w fail (if the refcount increases before
>> they get exclusive access to the mapping)? They're not trying to
>> change the mapping and/or make a page go away, rather they're changing
>> the p2m permission.
>
> Good point. But x^w would have to fail (or at least wait) if
> x86_emulate on another CPU was doing an instruction-fetch. Otherwise
> there's a race where the page becomes writeable and is written to before
> the instruction fetch completes.
>
>> > More seriously, it introduces another round of the sort of priority
>> > inversion we already get with the existing refcounts - a foreign
>> > mapping, caused by a user-space program in another VM, could arbitrarily
>> > delay a p2m update (and so prevent a VCPU from making progress), without
>> > any mechanism to even request that the mapping be removed.
>>
>> Yeah, that's tricky. I do not know if there is a fix at all.
>> Fundamentally, the foreign mapper (some dom0 sharing/paging/foo
>> utility) is completely async to the domain. Even if we were to revoke
>> the foreign mapping as you suggest below, that would involve an upcall
>> into the foreign-mapper-guest-OS to have it cleanly neuter the mapping
>> and drop the refcount. That's not at all trivial! Perhaps foreign
>> mapping vma's should be taught to patiently re-establish mappings if
>> they disappear under their feet? Event then, you need to keep track of
>> those foreign l1e's, and nothing short of a list will do.
>>
>> Because this is performance rather than correctness I'm inclined to
>> not poke the beast.
>
> I'm inclined to agree. Maybe the right thing to do is implement it and
> see whether serious problems arise.
>
>> >> I'm more wary that turning p2m locking into read/write will result in
>> >> code deadlocking itself: taking a read lock first and a write lock
>> >> later. Possibly the current rwlock implementation could be improved to
>> >> keep a cpumask of read-lockers, and provide an atomic "promote from
>> >> read to write" atomic operation (something along the lines of wait
>> >> until you're the only reader in the cpumask, and then cmpxchg(lock,
>> >> -1, WRITE_BIAS))
>> >
>> > I think that would deadlock if two cpus tried it at once.
>>
>> If you keep a cpumask of all read lockers, and only try to promote if
>> you're the only read locker, it wouldn't. But then you'd have to
>> protect the cpumask from races with another spinlock. Yuk.
>
> If two CPUs both take read locks and both want to promote, they'll
> deadlock waiting for each other to go away.
>
>> Which brings me to another question: p2m_locked_by_me (and others)
>> check on the physical cpu (lock->holder == curent->processor,
>> roughly). Well, what if you lock and then the vcpu migrates? Are vcpu
>> migrations prevented if you hold any locks? Or is there some other
>> magic going on?
>
> When that code was written, vcpus could not be preempted in the
> hypervisor (except in the scheduler softirq handler). Now they can
> voluntarily preempt, but not while holding any locks, so it's still OK.
>
> Cheers,
>
> Tim.
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|