Xen/ia64 dom0 virtual physical model design memo

                                               2006 VA Linux Systems Japan K.K.
                                     Isaku Yamahata <yamahata at valinux.co.jp>

* Introduction
This document targets xen/ia64 developers, providing an overview of
the future virtual physical implementation.
It describes what virtual physical model is and xen/ia64 dom0 virtual
physical design, but doesn't explain Xen basic concepts.

* Terminology
Terms related to address are confusingly used.
For clarity terms used in this document are explained in this section.

- VMM
  Virtual Machine Monitor.

- Virtual Processor(VP)
  Virtual Physical model is sometimes called VP model.
  However VP for Virtual Processor seems more popular.
  Although it might be distinguished by a context,
  to avoid confusion VP for Virtual Physical isn't used in this document.

- physical address
  Physical address is used to refer RAM on a non-virtualized environment.
  CPU uses this address to access RAM.

- virtual address
  Address which a user process sees typically.
  This address is translated by MMU to physical address.

- bus address
  Bus address is used by I/O devices to refer RAM.
  For example PCI bus address must be used to program a PCI bus master
  device to do DMA.
  On x86 platform the conversion between machine address and bus address is
  trivial (i.e. bus address value == machine address value),
  But this assumption that the conversion is trivial is not right.
  To clarify this some examples follow.
  - a x86 box with 32bit PCI bus and >4GB memory.
    memory beyond 4GB can't be addressed by PCI bus address
  - bus address value = machine address value may refer different RAM.
  - An extreme example is a machine with IOMMU.
    On IOMMU environment bus address means address before IOMMU translation.

- machine address
  Machine address is used by a real CPU to refer RAM on a virtualized
  environment.
  This corresponds to physical address of non-virtualized environment.
  Sometimes Host physical address is used for the same meaning.
  Sometimes machine physical address also is used for the same meaning.

- pseudo physical address
  This is a address which a guest domain believes as physical address.
  Actually this address is somehow modified by VMM to machine address.
  Sometimes guest physical address is used for the same meaning.
  Sometimes physical address is also used for this meaning.
  Metaphysical address from the hp vBlads project is also used.

- machine bus address
  On virtualized environment machine bus address is used for real bus address
  to distinguish real/virtualized bus address.
  Although there is no real corresponding bus for a virtual device,
  machine bus address notion is still useful. Usually machine address is
  used as its virtual machine bus address on a virtual bus. 
  this is not mandatory other way can be possible.

* Xen/ia64 dom0 virtual physical model
The purpose of dom0 virtual physical model is to make xen/ia64 architectually
correct and then by doing so to make future xen/ia64 development easier and
less maintenance effort in the future. For example, vUSB device driver
which is under development or other virtual devices in the future should
be easily adapted for Xen/ia64.
This issue has been raised from making fewer ia64 specific hack and
getting VNIF to work on xen/ia64.
There are several ways to get this done, virtual physical model has been
chosen at the 2006 winter xen summit.
xen/x86 is a development mainstream and xen/ia64 may have to catch up
the xen/x86 development. So maybe somewhere appropriate between architectually
correctness and xen/x86-ism has to be found.

There are two kind of address translations which can be (para-)virtualized.
virtual address <-> pseudo physical address <-> machine address
used by OS virtual memory subsystem
and
pseudo physical address <-> machine address <-> machine bus address
used by OS I/O subsystem.
Since Xen/ia64 already fully virtualized TLB the latter is an issue.
Unfortunately machine bus address virtualization requires IOMMU assist and
IOMMU is not widely available on ia64 platform (yet) so that
para-virtualization has to be adapted.
Here Xen virtual devices are regarded as a part of I/O subsystem.
e.g. grant table is a part of I/O subsystem.

Compared to virtual physical model, the current implemented model is
called P==M model. P==M model doesn't do any I/O related-address
translation. As a result dom0Linux needs to have a page struct for every
machine page.
This is the reason why we cannot support sparse/discontiguous memory in
domain0 right now. This could still be fixed in P==M model but would be
difficult. It is easy in virtual physical model.

The essence of virtual physical model is that dom0 Linux needs only translation
from pseudo physical address to machine bus address. Not machine address.
machine bus address is only used by OS I/O subsystems for I/O and
Linux has well-defined I/O apis so that it should be easy to
isolate sources which does the conversion.
However the correspondence of machine address and machine bus address is
maintained by dom0, not by xen. Thus, in order to be able to translate
from physical to machine bus addresses, dom0 needs a way to convert from pseudo
physical to machine addresses.

In virtual physical model pseudo physicall address is virtualized.
As a result, a pseudo-physical-contiguous range whose page size is larger
than the xen page size may not be machine-contiguous/bus-contiguous.


* detailed design

There are two major issues. address translation and memory contiguity.
For address translation:
Add a pseudo physical to machine address conversion mechanism.
Make Linux I/O related files aware of machine address.
For memory contiguity:
Add a mechanism to allocate machine-contiguous memory
Modify the DMA'able memory allocators and the routines which examine
machine-contiguity to coalesce DMA regions.
Not only the above but also a bounce buffer which is called swiotlb is used
by xenLinux/x86. It bounces data to preallocated DMA'able machine-contigous
region. This might be also needed for virtual physical model.

- domain0 builder
  ACPI table area, EFI ported I/O area and EFI memory mapped I/O area
  have to be mapped to dom0 pseudo physical address space in advance.

- tlb miss handler, tr/tc emulation
  In the current implementation, any tlb request whose page size > xen page 
  size is accepted.
  But in virtual physical model, tlb request is broken down to xen page size.

  This will cause excess tlb misses. However, this can be mitigated
  by assuming that the low addresses in pseudo physical memory
  (say 0-64Mb) are contiguous in machine address and covering them
  by a single TLB entry. This gives a course grained TLB coverage
  at the bottom of memory, and fined-grain coverage for the rest of memory.
  This isn't focused right now, but will be focused in the future
  tuning phase.

- machine address page lookup(ACPI table parse)
  ACPI table parse requires to read pages pointed by machine address.
  Fortunately ACPI tables live in EFI runtime service data region,
  xen/ia64 maps the region in the way pseudo physical == machine at
  dom0 building.
  On Linux/i386, __acpi_map_table() must be used to access ACPI table,
  so that it would be easily achieved to hook ACPI table access by adding
  a hyper-call to __acpi_map_table().
  However unfortunately on Linux/ia64, __va() is abused instead of
  __acpi_map_table().
  The right way is to fix the linux/ia64 ACPI code to use __acpi_map_table()
  and add a hyper-call to __acpi_map_table().
  But currently EFI runtime service data region is mapped to dom0
  since it's easier way.

- EFI ported/memory-mapped I/O spaces
  EFI memory mapped IO region and EFI memory mapped io port space are
  also mapped to dom0 virtual physical address space in the way
  pseudo physical == machine at dom0 building.

- other I/O spaces
  EFI doesn't cover all of I/O spaces, e.g. PCI device.
  So add a hyper-call to map such I/O spaces to dom0 in the way
  pseudo physical == machine.
  In theory it is possible that pseudo physical != machine address,
  but more coding in xen is required to maintain dom0 I/O mapping.

  In Linux such I/O must be done via ioremap(), so it is easy to hook
  such I/O.

  Xen fakes a EFI memory descriptor table up and pass it to domain0.
  Memory area must not overlap with I/O area, however current
  implementation doesn't handle this just because such situation happens
  not to occur on my testing box.
  This issue will be addressed later.

- pseudo physical to machine address translation
  At first it will be implemented by a simple hyper-call.
  If dom0 virtual physical model is stabilized then it can be replaced 
  by a table lookup or something similar.

- dma
  Linux dma related code must be modified to be machine address aware.
  At least following files must be modified.
  I hope no more files are needed to be modified.
  - dma api
    include/asm-ia64/dma-mapping.h
    arch/ia64/kernel/machvec.c
    include/asm-ia64/dma.h
    include/asm-ia64/pci.h
    arch/ia64/pci/pci.c
  - swiotlb
    swiotlb.h, lib/swiotlb.c, scatterlist.h
  - iommu
    arch/ia64/hp/common/sba_iommu.c
    sgi sn(this won't be worked. it's sgi's work)
  - agp
    include/asm-ia64/agp.h

- foreign page mapping
  This functionality is for dom0 (or privileged domain) to access other
  domain's pages specified by machine address. dom0 maps a page into
  its address spaces.(On x86, it is specified by kernel virtual address.)
  This is used to build domU. On Xen/x86 this is implemented by
  __direct_remap_pfn_range() which is eventually invoked
  via ioctl("/proc/xen/privcmd", IOCTL_PRIVCMD_MMAP).

  1. reserve a range in dom0 pseudo physical address space.
     Add a hyper-call to assign a page specified by machine address into 
     the reserved pseudo physical address space.
     Then maps it by remap_pfn_range().
     When unmapped, de-assign an assigned pages by overriding its
     vm_area_operations.

  2. replace a dom0's page with a foreign domain's page.
     Dom0 allocates a page and replace its underlying its machine page with
     a foreign domain's page by a hyper-call. Then maps the page it
     as usual page.
     When unmapped, de-assign an assigned pages by overriding its
     vm_area_operations.
     same mechanism which is used for grant table (which is described blow)
     can be used.

  3. others
     If there is a better implementation, please propose.

  2. is adapted because it would require less implementation effort than 1.
  Its implementation is very isolated so that if needed its implementation
  can be replaced by 1. with a small impact on other code later.

- grant table
  The current grant table api depends on xen/x86 deeply.
  There are four kind of addresses related to grant table.
  user virtual address, kernel virtual address, pseudo physical address and
  machine address.
  xen/x86 grant table uses user virtual address, kernel virtual address and
  machine address.
  On the other hand xen/ia64 can use pseudo physical address and machine
  address because xen/ia64 fully virtualizes TLB, so it is difficult for
  xen/ia64 to handle user/kernel virtual address.
  Virtual address related api might be emulated by xenLinux/ia64 without
  xen/ia64. However kernel virtual address is a issue.
  The right way is that to re-define grant table api separating arch-independ
  part and arch-dependent part (or define a entirely new clean replacement)
  and to rewrite existing codes including common xen code, xen/x86 code,
  and xenLinux/x86 code.
  This might take a long time.

  Step 1
  Xen/ia64 part:
  Use pseudo physical address instead of kernel virtual address.
  xenLinux/ia64 part:
  Don't change HYPERVISOR_grant_table_op() api. But wrap
  HYPERVISOR_grant_table_op() and do neccesarry conversion/work in it.
  impose a restriction that only virtual address in the Linux identity mapping
  area can be used on grant table api. 
  The area (0xe00000000:00000000 - ) has 1:1 corresponding to pseudo physical
  address. Virtual address of the area can be converted to pseudo physical
  address by __pa().
  xenbus_map_ring_valloc(), xenbus_unmap_ring_vfree() needs to be modified.

  Step 2
  clean up grant table api.
  remove virtual address assumption and introduce arch-specific grant table
  address. It might be virtual address on x86, and pseudo physical address
  on ia64. Define arch-dependent conversions.
  Xen/PPC people should be involved.

- vbd, vnif and other xen virtual device drivers except blktap.
  If grant table issues is done, vbd, vnif should work.
  At most Only arch-independent/arch-depedent should be needed.

- balloon
  TODO
  similar to grant table page transfer.

- blktap
  TODO
  blktap uses gran table with GNTMAP_application_map and GNTMAP_contains_pte.
  This should be emulated by xenLinux.

- Rusty's share
  This might need to be researched.
  This can be a clean inter domain communication API.


* Current status
item            status
dom0 builder    done
ACPI            done
mm I/O          done
phys2mach       done
dma api         done
swiotlb         done
iommu           not yet
                (the modification must be done. testers are also needed)
agp             not yet(the modification is done. testers are needed)
foreign mapping work in progress
grant table     work in progress
vbd		work in progress
vnif		work in progress
balloon         not yet
blktap		not yet


* Issues
- hyper-call
  arch-specific hyper-call needs to be added.
  Some convention of assign its number must be determined with the xen core
  team.

- guest SMP
  This issue is not specific to dom0 virtual physical model.
  In the current implementation pseudo physical-to-machine conversion tables
  are not smp-protected. Perhaps it assumes that a table is built at
  its domain creation and read-only after that.
  Once page flipping or something that requires modifying the table is
  introduced, its table protection become a problem.
  And the corresponding tlb shoot down must be inserted very carefully.

  At the early phase of dom0 virtual physical implementation this issue is
  not addressed.
  This would be addressed after the proof of dom0 virtual physical model.

- page reference count
  page reference counting effert is in progress.
  Some update of the page ref might be needed.

- transparent para-virtualization
  Some of xen/ia64 developers value transparent para-virtualization.
  "if (running_on_xen) { }" can be used.
  Or is it worthwhile to define a switch?

- ski simulator
  simscsi, simeth driver for ski simulator are broken.
  These should be fixed.

- tlb miss optimization
  This should be focused in the future tuning phase.

- pseudo physical address to machine bus address conversion performance
  This should be focused in the future tuning phase.

- guest domain page size < xen page size
  This is not supported yet by the current implementation.
  And this isn't addressed right now at the early stage of dom0 virtual
  physical model.
  This is the future issue.

* Thanks
There are many people who contributed to this document.
All mistakes are due to me.

In no particular order
Magenheimer Dan <dan.magenheimer at hp.com>
Yang, Fred <fred.yang at intel.com> 
Dong, Eddie <eddie.dong at intel.com> 
Simon Horms <horms at verge.net.au>
Hirokazu Takahashi <taka at valinux.co.jp>
YAMAMOTO Takashi <yamamoto at valinux.co.jp>