Xen/ia64 dom0 virtual physical model design memo 2006 VA Linux Systems Japan K.K. Isaku Yamahata * Introduction This document targets xen/ia64 developers, providing an overview of the future virtual physical implementation. It describes what virtual physical model is and xen/ia64 dom0 virtual physical design, but doesn't explain Xen basic concepts. * Terminology Terms related to address are confusingly used. For clarity terms used in this document are explained in this section. - VMM Virtual Machine Monitor. - Virtual Processor(VP) Virtual Physical model is sometimes called VP model. However VP for Virtual Processor seems more popular. Although it might be distinguished by a context, to avoid confusion VP for Virtual Physical isn't used in this document. - physical address Physical address is used to refer RAM on a non-virtualized environment. CPU uses this address to access RAM. - virtual address Address which a user process sees typically. This address is translated by MMU to physical address. - bus address Bus address is used by I/O devices to refer RAM. For example PCI bus address must be used to program a PCI bus master device to do DMA. On x86 platform the conversion between machine address and bus address is trivial (i.e. bus address value == machine address value), But this assumption that the conversion is trivial is not right. To clarify this some examples follow. - a x86 box with 32bit PCI bus and >4GB memory. memory beyond 4GB can't be addressed by PCI bus address - bus address value = machine address value may refer different RAM. - An extreme example is a machine with IOMMU. On IOMMU environment bus address means address before IOMMU translation. - machine address Machine address is used by a real CPU to refer RAM on a virtualized environment. This corresponds to physical address of non-virtualized environment. Sometimes Host physical address is used for the same meaning. Sometimes machine physical address also is used for the same meaning. - pseudo physical address This is a address which a guest domain believes as physical address. Actually this address is somehow modified by VMM to machine address. Sometimes guest physical address is used for the same meaning. Sometimes physical address is also used for this meaning. Metaphysical address from the hp vBlads project is also used. - machine bus address On virtualized environment machine bus address is used for real bus address to distinguish real/virtualized bus address. Although there is no real corresponding bus for a virtual device, machine bus address notion is still useful. Usually machine address is used as its virtual machine bus address on a virtual bus. this is not mandatory other way can be possible. * Xen/ia64 dom0 virtual physical model The purpose of dom0 virtual physical model is to make xen/ia64 architectually correct and then by doing so to make future xen/ia64 development easier and less maintenance effort in the future. For example, vUSB device driver which is under development or other virtual devices in the future should be easily adapted for Xen/ia64. This issue has been raised from making fewer ia64 specific hack and getting VNIF to work on xen/ia64. There are several ways to get this done, virtual physical model has been chosen at the 2006 winter xen summit. xen/x86 is a development mainstream and xen/ia64 may have to catch up the xen/x86 development. So maybe somewhere appropriate between architectually correctness and xen/x86-ism has to be found. There are two kind of address translations which can be (para-)virtualized. virtual address <-> pseudo physical address <-> machine address used by OS virtual memory subsystem and pseudo physical address <-> machine address <-> machine bus address used by OS I/O subsystem. Since Xen/ia64 already fully virtualized TLB the latter is an issue. Unfortunately machine bus address virtualization requires IOMMU assist and IOMMU is not widely available on ia64 platform (yet) so that para-virtualization has to be adapted. Here Xen virtual devices are regarded as a part of I/O subsystem. e.g. grant table is a part of I/O subsystem. Compared to virtual physical model, the current implemented model is called P==M model. P==M model doesn't do any I/O related-address translation. As a result dom0Linux needs to have a page struct for every machine page. This is the reason why we cannot support sparse/discontiguous memory in domain0 right now. This could still be fixed in P==M model but would be difficult. It is easy in virtual physical model. The essence of virtual physical model is that dom0 Linux needs only translation from pseudo physical address to machine bus address. Not machine address. machine bus address is only used by OS I/O subsystems for I/O and Linux has well-defined I/O apis so that it should be easy to isolate sources which does the conversion. However the correspondence of machine address and machine bus address is maintained by dom0, not by xen. Thus, in order to be able to translate from physical to machine bus addresses, dom0 needs a way to convert from pseudo physical to machine addresses. In virtual physical model pseudo physicall address is virtualized. As a result, a pseudo-physical-contiguous range whose page size is larger than the xen page size may not be machine-contiguous/bus-contiguous. * detailed design There are two major issues. address translation and memory contiguity. For address translation: Add a pseudo physical to machine address conversion mechanism. Make Linux I/O related files aware of machine address. For memory contiguity: Add a mechanism to allocate machine-contiguous memory Modify the DMA'able memory allocators and the routines which examine machine-contiguity to coalesce DMA regions. Not only the above but also a bounce buffer which is called swiotlb is used by xenLinux/x86. It bounces data to preallocated DMA'able machine-contigous region. This might be also needed for virtual physical model. - domain0 builder ACPI table area, EFI ported I/O area and EFI memory mapped I/O area have to be mapped to dom0 pseudo physical address space in advance. - tlb miss handler, tr/tc emulation In the current implementation, any tlb request whose page size > xen page size is accepted. But in virtual physical model, tlb request is broken down to xen page size. This will cause excess tlb misses. However, this can be mitigated by assuming that the low addresses in pseudo physical memory (say 0-64Mb) are contiguous in machine address and covering them by a single TLB entry. This gives a course grained TLB coverage at the bottom of memory, and fined-grain coverage for the rest of memory. This isn't focused right now, but will be focused in the future tuning phase. - machine address page lookup(ACPI table parse) ACPI table parse requires to read pages pointed by machine address. Fortunately ACPI tables live in EFI runtime service data region, xen/ia64 maps the region in the way pseudo physical == machine at dom0 building. On Linux/i386, __acpi_map_table() must be used to access ACPI table, so that it would be easily achieved to hook ACPI table access by adding a hyper-call to __acpi_map_table(). However unfortunately on Linux/ia64, __va() is abused instead of __acpi_map_table(). The right way is to fix the linux/ia64 ACPI code to use __acpi_map_table() and add a hyper-call to __acpi_map_table(). But currently EFI runtime service data region is mapped to dom0 since it's easier way. - EFI ported/memory-mapped I/O spaces EFI memory mapped IO region and EFI memory mapped io port space are also mapped to dom0 virtual physical address space in the way pseudo physical == machine at dom0 building. - other I/O spaces EFI doesn't cover all of I/O spaces, e.g. PCI device. So add a hyper-call to map such I/O spaces to dom0 in the way pseudo physical == machine. In theory it is possible that pseudo physical != machine address, but more coding in xen is required to maintain dom0 I/O mapping. In Linux such I/O must be done via ioremap(), so it is easy to hook such I/O. Xen fakes a EFI memory descriptor table up and pass it to domain0. Memory area must not overlap with I/O area, however current implementation doesn't handle this just because such situation happens not to occur on my testing box. This issue will be addressed later. - pseudo physical to machine address translation At first it will be implemented by a simple hyper-call. If dom0 virtual physical model is stabilized then it can be replaced by a table lookup or something similar. - dma Linux dma related code must be modified to be machine address aware. At least following files must be modified. I hope no more files are needed to be modified. - dma api include/asm-ia64/dma-mapping.h arch/ia64/kernel/machvec.c include/asm-ia64/dma.h include/asm-ia64/pci.h arch/ia64/pci/pci.c - swiotlb swiotlb.h, lib/swiotlb.c, scatterlist.h - iommu arch/ia64/hp/common/sba_iommu.c sgi sn(this won't be worked. it's sgi's work) - agp include/asm-ia64/agp.h - foreign page mapping This functionality is for dom0 (or privileged domain) to access other domain's pages specified by machine address. dom0 maps a page into its address spaces.(On x86, it is specified by kernel virtual address.) This is used to build domU. On Xen/x86 this is implemented by __direct_remap_pfn_range() which is eventually invoked via ioctl("/proc/xen/privcmd", IOCTL_PRIVCMD_MMAP). 1. reserve a range in dom0 pseudo physical address space. Add a hyper-call to assign a page specified by machine address into the reserved pseudo physical address space. Then maps it by remap_pfn_range(). When unmapped, de-assign an assigned pages by overriding its vm_area_operations. 2. replace a dom0's page with a foreign domain's page. Dom0 allocates a page and replace its underlying its machine page with a foreign domain's page by a hyper-call. Then maps the page it as usual page. When unmapped, de-assign an assigned pages by overriding its vm_area_operations. same mechanism which is used for grant table (which is described blow) can be used. 3. others If there is a better implementation, please propose. 2. is adapted because it would require less implementation effort than 1. Its implementation is very isolated so that if needed its implementation can be replaced by 1. with a small impact on other code later. - grant table The current grant table api depends on xen/x86 deeply. There are four kind of addresses related to grant table. user virtual address, kernel virtual address, pseudo physical address and machine address. xen/x86 grant table uses user virtual address, kernel virtual address and machine address. On the other hand xen/ia64 can use pseudo physical address and machine address because xen/ia64 fully virtualizes TLB, so it is difficult for xen/ia64 to handle user/kernel virtual address. Virtual address related api might be emulated by xenLinux/ia64 without xen/ia64. However kernel virtual address is a issue. The right way is that to re-define grant table api separating arch-independ part and arch-dependent part (or define a entirely new clean replacement) and to rewrite existing codes including common xen code, xen/x86 code, and xenLinux/x86 code. This might take a long time. Step 1 Xen/ia64 part: Use pseudo physical address instead of kernel virtual address. xenLinux/ia64 part: Don't change HYPERVISOR_grant_table_op() api. But wrap HYPERVISOR_grant_table_op() and do neccesarry conversion/work in it. impose a restriction that only virtual address in the Linux identity mapping area can be used on grant table api. The area (0xe00000000:00000000 - ) has 1:1 corresponding to pseudo physical address. Virtual address of the area can be converted to pseudo physical address by __pa(). xenbus_map_ring_valloc(), xenbus_unmap_ring_vfree() needs to be modified. Step 2 clean up grant table api. remove virtual address assumption and introduce arch-specific grant table address. It might be virtual address on x86, and pseudo physical address on ia64. Define arch-dependent conversions. Xen/PPC people should be involved. - vbd, vnif and other xen virtual device drivers except blktap. If grant table issues is done, vbd, vnif should work. At most Only arch-independent/arch-depedent should be needed. - balloon TODO similar to grant table page transfer. - blktap TODO blktap uses gran table with GNTMAP_application_map and GNTMAP_contains_pte. This should be emulated by xenLinux. - Rusty's share This might need to be researched. This can be a clean inter domain communication API. * Current status item status dom0 builder done ACPI done mm I/O done phys2mach done dma api done swiotlb done iommu not yet (the modification must be done. testers are also needed) agp not yet(the modification is done. testers are needed) foreign mapping work in progress grant table work in progress vbd work in progress vnif work in progress balloon not yet blktap not yet * Issues - hyper-call arch-specific hyper-call needs to be added. Some convention of assign its number must be determined with the xen core team. - guest SMP This issue is not specific to dom0 virtual physical model. In the current implementation pseudo physical-to-machine conversion tables are not smp-protected. Perhaps it assumes that a table is built at its domain creation and read-only after that. Once page flipping or something that requires modifying the table is introduced, its table protection become a problem. And the corresponding tlb shoot down must be inserted very carefully. At the early phase of dom0 virtual physical implementation this issue is not addressed. This would be addressed after the proof of dom0 virtual physical model. - page reference count page reference counting effert is in progress. Some update of the page ref might be needed. - transparent para-virtualization Some of xen/ia64 developers value transparent para-virtualization. "if (running_on_xen) { }" can be used. Or is it worthwhile to define a switch? - ski simulator simscsi, simeth driver for ski simulator are broken. These should be fixed. - tlb miss optimization This should be focused in the future tuning phase. - pseudo physical address to machine bus address conversion performance This should be focused in the future tuning phase. - guest domain page size < xen page size This is not supported yet by the current implementation. And this isn't addressed right now at the early stage of dom0 virtual physical model. This is the future issue. * Thanks There are many people who contributed to this document. All mistakes are due to me. In no particular order Magenheimer Dan Yang, Fred Dong, Eddie Simon Horms Hirokazu Takahashi YAMAMOTO Takashi