>IMO you're doing code building anyway, but just of one instruction. You get
>rid of the locking by doing it to a per-CPU buffer, and the stack is the
>obvious place, calling out to register save/restore code. I don't really
>care about the performance of the save/restore code -- it's obviously going
>to be trivial compared with the unavoidable trap-and-emulate cost. Also, do
>you need separate save/restore code for IN vs. OUT instructions?
Actually, in the code I currently have I do. This is because for out-s I need
to merge the value output with the user-specified rAX, under the
assumption that output value and register contents are not always identical
(i.e. if particular bits within a port would need to be special treated by Xen,
which I can easily imagine to be required at some point).
> call save_host_restore_guest
> <IN or OUT>
> call save_guest_restore_host
>Would that be reasonable?
It would, provided the above assumption about the need to modify the
output value would never become true. Additionally, for 64-bits, I'm
concerned about the potential need for using indirect calls here (as well
as in the syscall trampolines): there's nothing keeping a user from making
the Xen heap 2Gb or more in size. These would further slow things down,
but depending on the nature of allocations made from the Xen heap it
may also be possible to simply place an upper limit on the heap size, as
it currently is assumed adjacent to the Xen image (but taking memory
holes at rather low addresses into account a user may even be required
to bump the heap size significantly - what if only a few Mb of memory
below 4Gb existed? - since, after all, the heap size is the size of address
space consumed, not the amount of memory used).
>Alternatively, perhaps we could get rid of the distinction and emulate all
>port accesses in this way? I suspect that the cost of state save/restore and
>building the trampoline is dwarfed by the cost of the GPF and even the cost
>of the I/O port access itself (they don't tend to be super fast). Could you
>do a few quick measurements to determine this? If the extra cost is less
>than, say, 10%, I'd be inclined to take the hit to avoid interface changes.
Percentages of full-context relative to simply emulated i/o, without having
changed the assembly file approach to the stub building one, yet (as per
PentiumIII (32-bit) with locking 67%
PentiumIII (32-bit) without locking 84%
Pentium4 (64-bit) with locking 86%
Pentium4 (64-bit) without locking 89%
Revised patch (domctl->sysctl, naming) attached.
Description: Text document
Xen-devel mailing list