Xen/ia64 dom0 virtual physical model design memo 2006 VA Linux Systems Japan K.K. * Introduction This document targets xen/ia64 developers, providing an overview of the future virtual physical implementation. It describes what virtual physical model is and xen/ia64 dom0 virtual physical design, but doesn't explain Xen basic concepts. * Terminology Terms related to address are confusingly used. For clarity terms used in this document are explained in this section. - VMM Virtual Machine Monitor. - Virtual Processor(VP) Virtual Physical model is sometimes called VP model. However VP for Virtual Processor seems more popular. Although it might be distinguished by a context, to avoid confusion VP for Virtual Physical isn't used in this document. - physical address Physical address is used to refer RAM on a non-virtualized environment. CPU uses this address to access RAM. - virtual address Address which a user process sees typically. This address is translated by MMU to physical address. - bus address Bus address is used by I/O devices to refer RAM. For example PCI bus address must be used to program a PCI bus master device to do DMA. On x86 platform the conversion between machine address and bus address is trivial (i.e. bus address value == machine address value), But this assumption that the conversion is trivial is not right. To clarify this some examples follow. - a x86 box with 32bit PCI bus and >4GB memory. memory beyond 4GB can't be addressed by PCI bus address - bus address value = machine address value may refer different RAM. - An extreme example is a machine with IOMMU. On IOMMU environment bus address means address before IOMMU translation. - machine address Machine address is used by a real CPU to refer RAM on a virtualized environment. This corresponds to physical address of non-virtualized environment. Sometimes Host physical address is used for the same meaning. - pseudo physical address This is a address which a guest domain believes as physical address. Actually this address is somehow modified by VMM to machine address. Sometimes pseudo physical address is used for the same meaning. Sometimes physical address is also used for this meaning. - machine bus address On virtualized environment machine bus address is used for real bus address to distinguish real/virtualized bus address. Although there is no real corresponding bus for a virtual device, machine bus address notion is still useful. Usually machine address is used as its virtual machine bus address on a virtual bus. this is not mandatory other way can be possible. * Xen/ia64 dom0 virtual physical model The purpose of dom0 virtual physical model is to make xen/ia64 architectually correct and then by doing so to make future xen/ia64 development easier and less maintenance effort in the future. For example, vUSB device driver which is under development or other virtual devices in the future should be easily adapted for Xen/ia64. This issue has been raised from making fewer ia64 specific hack and getting VNIF to work on xen/ia64. There are several ways to get this done, virtual physical model has been chosen at the 2006 winter xen summit. xen/x86 is a development mainstream and xen/ia64 may have to catch up the xen/x86 development. So maybe somewhere appropriate between architectually correctness and xen/x86-ism has to be found. There are two kind of address translations which can be (para-)virtualized. virtual address <-> pseudo physical address <-> machine address used by OS virtual memory subsystem and pseudo physical address <-> machine address <-> machine bus address used by OS I/O subsystem. Since Xen/ia64 already fully virtualized TLB the latter is an issue. Unfortunately machine bus address virtualization requires IOMMU assist and IOMMU is not widely available on ia64 platform (yet) so that para-virtualization has to be adapted. Here Xen virtual devices are regarded as a part of I/O subsystem. e.g. grant table is a part of I/O subsystem. The essence of virtual physical model is that dom0 Linux needs only translation from pseudo physical address to machine bus address. Not machine address. machine bus address is only used by OS I/O subsystems for I/O and Linux has well-defined I/O apis so that it should be easy to isolate sources which does the conversion. However the correspondence of machine address and machine bus address is maintained by dom0, not by xen. Thus, in order to be able to translate from physical to machine bus addresses, dom0 needs a way to convert from pseudo physical to machine addresses. * detailed design Add a pseudo physical to machine address conversion mechanism. Make Linux I/O related files aware of machine address. - domain0 builder ACPI table area, EFI ported I/O area and EFI memory mapped I/O area have to be mapped to dom0 pseudo physical address space in advance. - tlb miss handler, tr/tc emulation In the current implementation, any tlb request whose page size > xen page size is accepted. But in virtual physical model, tlb request is broken down to xen page size. This will cause excess tlb misses. However, this can be mitigated by assuming that the low addresses in pseudo physical memory (say 0-64Mb) are contiguous in machine address and covering them by a single TLB entry. This gives a course grained TLB coverage at the bottom of memory, and fined-grain coverage for the rest of memory. This isn't focused right now, but will be focused in the future tuning phase. - machine address page lookup(ACPI table parse) ACPI table parse requires to read pages pointed by machine address. Fortunately ACPI tables live in EFI runtime service data region, xen/ia64 maps the region in the way pseudo physical == machine at dom0 building. On Linux/i386, __acpi_map_table() must be used to access ACPI table, so that it would be easily achieved to hook ACPI table access by adding a hyper-call to __acpi_map_table(). However unfortunately on Linux/ia64, __va() is abused instead of __acpi_map_table(). The right way is to fix the linux/ia64 ACPI code to use __acpi_map_table() and add a hyper-call to __acpi_map_table(). But currently EFI runtime service data region is mapped to dom0 since it's easier way. - EFI ported/memory-mapped I/O spaces EFI memory mapped IO region and EFI memory mapped io port space are also mapped to dom0 virtual physical address space in the way pseudo physical == machine at dom0 building. - other I/O spaces EFI doesn't cover all of I/O spaces, e.g. PCI device. So add a hyper-call to map such I/O spaces to dom0 in the way pseudo physical == machine. In theory it is possible that pseudo physical != machine address, but more coding in xen is required to maintain dom0 I/O mapping. In Linux such I/O must be done via ioremap(), so it is easy to hook such I/O. Xen fakes a EFI memory descriptor table up and pass it to domain0. Memory area must not overlap with I/O area, however current implementation doesn't handle this just because such situation happens not to occur on my testing box. This issue will be addressed later. - pseudo physical to machine address translation At first it will be implemented by a simple hyper-call. If dom0 virtual physical model is stabilized then it can be replaced by a table lookup or something similar. - dma Linux dma related code must be modified to be machine address aware. At least following files must be modified. I hope no more files are needed to be modified. - dma api include/asm-ia64/dma-mapping.h arch/ia64/kernel/machvec.c include/asm-ia64/dma.h include/asm-ia64/pci.h arch/ia64/pci/pci.c - swiotlb swiotlb.h, lib/swiotlb.c, scatterlist.h - iommu arch/ia64/hp/common/sba_iommu.c sgi sn(this won't be worked. it's sgi's work) - agp include/asm-ia64/agp.h - grant table TODO - vbd, vnif, balloon TODO - Rusty's share This might need to be researched. * Current status item status dom0 builder done ACPI done mm I/O done phys2mach in progress dma api in progress swiotlb not yet iommu not yet agp not yet grant table not yet vbd, vnif and balloon not yet * Issues - Linux version At the time of writing, xen-unstable.hg and xen-ia64-unstable.hg are based on linux 2.6.16-rc2. On the other hand xen-ia64-unstable-Intel.hg are based on the old linux 2.6.12. For the ease of the future merge, I'd like to work based on linux 2.6.16-rc2. - hyper-call arch-specific hyper-call needs to be added. Some convention of assign its number must be determined with the xen core team. - guest SMP This issue is not specific to dom0 virtual physical model. In the current implementation pseudo physical-to-machine conversion tables are not smp-protected. Perhaps it assumes that a table is built at its domain creation and read-only after that. Once page flipping or something that requires modifying the table is introduced, its table protection become a problem. And the corresponding tlb shoot down must be inserted very carefully. At the early phase of dom0 virtual physical implementation this issue is not addressed. This would be addressed after the proof of dom0 virtual physical model. - page reference count page reference counting effert is in progress. Some update of the page ref might be needed. - tlb miss optimization This should be focused in the future tuning phase. - pseudo physical address to machine bus address conversion performance This should be focused in the future tuning phase. - guest domain page size < xen page size This is not supported yet by the current implementation. And this isn't addressed right now at the early stage of dom0 virtual physical model. This is the future issue.