WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel][PATCH][RFC] Supporting Enlightened Windows 2008Server

To: "Steven Smith" <steven.smith@xxxxxxxxxx>
Subject: Re: [Xen-devel][PATCH][RFC] Supporting Enlightened Windows 2008Server
From: "Ky Srinivasan" <ksrinivasan@xxxxxxxxxx>
Date: Sun, 13 Apr 2008 13:04:37 -0600
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Tim Deegan <Tim.Deegan@xxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date: Sun, 13 Apr 2008 12:05:12 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Steven,

Thanks for the detailed review. I will address the issues you have raised as 
part of my next cleanup of these patches. I have responded to your comments 
in-line.

Regards,

K. Y

>>> Steven Smith <steven.smith@xxxxxxxxxx> 04/10/08 7:23 AM >>> 
> I have a couple of comments on the new patches:


> Would you mind generating diffs with -p, please?  It makes them a fair
> bit easier to read.

Will do.

>  
>  static void svm_vmexit_do_cpuid(struct cpu_user_regs *regs)
>  {
> -    unsigned int eax, ebx, ecx, edx, inst_len;
> +    unsigned int eax, ebx, ecx, edx, inst_len, input;
>  
>      inst_len = __get_instruction_length(current, INSTR_CPUID, NULL);
>      if ( inst_len == 0 ) 
> ...
> @@ -968,6 +970,7 @@
>      regs->ecx = ecx;
>      regs->edx = edx;
>  
> +    hyperx_intercept_do_cpuid(input, regs);
>      __update_guest_eip(regs, inst_len);
>  }
>
  
> Would it be easier to put this bit in hvm_cpuid, or maybe
> cpuid_hypervisor_leaves?  That would avoid needing to make the same
> change to both the VMX and SVM CPUID infrastructure.

Good point. I will look into it.



> @@ -984,6 +987,10 @@
>      struct vcpu *v = current;
>      struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>  
> +    if (hyperx_intercept_do_msr_read(ecx, regs))
> +    {
> +            goto done;
> +    }
>      switch ( ecx )
>      {
>      case MSR_IA32_TSC:
> 
> Likewise, it seems like that could be done better in
> rdmsr_hypervisor_regs().

Yep.

> @@ -1085,6 +1092,10 @@
>      msr_content = (u32)regs->eax | ((u64)regs->edx << 32);
>  
>      hvmtrace_msr_write(v, ecx, msr_content);
> +    if (hyperx_intercept_do_msr_write(ecx, regs))
> +    {
> +            goto done_msr_write;
> +    }
>  
>      switch ( ecx )
>      {

> Or wrmsr_hypervisor_regs().

Will do.





> @@ -2210,6 +2226,10 @@
>                  if ( a.value > HVMPTM_one_missed_tick_pending )
>                      goto param_fail;
>                  break;
> +            case HVM_PARAM_EXTEND_HYPERVISOR:
> +printk("KYS: EXTEND hypervisor id is %d\n", (int)a.value);
> +                if ((a.value == 1) && hyperv_initialize(d)) 
> +                    goto param_fail;
>              }
>              d->arch.hvm_domain.params[a.index] = a.value;
>              rc = 0;
> 

> It might be a good idea to fail with -EINVAL or something if
> a.value > 1, so that if anyone ever introduces another hypervisor
> extension it's easy for the tools to tell whether it's available or
> not.

Will do.


> Index: xen-unstable.hg/xen/arch/x86/hvm/save.c
> ===================================================================
> --- xen-unstable.hg.orig/xen/arch/x86/hvm/save.c2008-04-04 14:28:26.000000000 
> -0400
> +++ xen-unstable.hg/xen/arch/x86/hvm/save.c2008-04-04 14:28:44.000000000 -0400
> @@ -23,6 +23,8 @@
>  
>  #include <asm/hvm/support.h>
>  #include <public/hvm/save.h>
> +#include <public/hvm/params.h>
> +#include <asm/hvm/hvm_extensions.h>
>  
>  void arch_hvm_save(struct domain *d, struct hvm_save_header *hdr)
>  {
> 

> The only change made to this file was to add some #include's.  Why
> were they necessary?

This was necessary at some point in the past. This is the vestiges of a partial 
cleanup! This will be fixed.



> Index: xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h
> ===================================================================
> --- xen-unstable.hg.orig/xen/include/public/arch-x86/hvm/save.h2008-04-04 
> 14:28:26.000000000 -0400
> +++ xen-unstable.hg/xen/include/public/arch-x86/hvm/save.h2008-04-04 
> 14:28:44.000000000 -0400
> @@ -38,7 +38,7 @@
>      uint32_t version;           /* File format version */
>      uint64_t changeset;         /* Version of Xen that saved this file */
>      uint32_t cpuid;             /* CPUID[0x01][%eax] on the saving machine */
> -    uint32_t pad0;
> +    uint32_t pad0;/* extension ID */
>  };
>  
>  DECLARE_HVM_SAVE_TYPE(HEADER, 1, struct hvm_save_header);
> 
> I think it might be a good idea to give the field a more
> descriptive name than ``pad0'' if it's being used for something
> other than padding.

This again is the result of a partial cleanup. I was stashing away the 
extension ID and Tim rightly pointed out that we had enough state and did not 
need the extension ID in the save record. This will be cleaned up.

> It might be an even better idea to move the ID out of the header
> entirely and put it in its own HVM_SAVE_TYPE.  You already have
> hvm_hyperv_dom; can you use that to identify the presence or
> absence of the extensions?

I will look into this.




> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_errno.h2008-03-26 
> 13:56:39.000000000 -0400
> @@ -0,0 +1,4 @@
...
> +/*
> + * hv_errno.h
> + * Error codes for the  Novell Shim.
> 
>> These are just the Viridian error codes, aren't they, rather than
>> something specific to the Novell shim?

Yes. This will be fixed.


> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_ERRNO_H
> +#define HV_ERRNO_H
> +
> +#define HV_STATUS_SUCCESS0x0000
> +#define HV_STATUS_INVALID_HYPERCALL_CODE0x0002
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x0003
> +#define HV_STATUS_INVALID_ALIGNMENT0x0004
> +#define HV_STATUS_INVALID_PARAMETER0x0005
...
> +#endif 
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.c2008-03-26 
> 14:18:32.000000000 -0400
> @@ -0,0 +1,13 @@
...
> +void hv_collect_stats(int event, hv_vcpu_stats_t *statsp)
> +{
> +switch (event) {
> +case HV_CSWITCH:
> +statsp->num_switches++;
> +return;
> +case HV_FLUSH_VA:
> +statsp->num_flushes++;
> +return;
> +case HV_FLUSH_RANGE:
> +statsp->num_flush_ranges++;
> +return;
> That looks a bit pointless.  Would it not be easier to just inline
> this whole function?

Now that, I have dropped all the TLB flush enlightenments, I will clean this up.


...
> +}
...
> +static int
> +hv_get_vp_registers(paddr_t input, paddr_t output)
> +{
> +hv_vcpu_t        *vcpup, *targetp;
> +hv_partition_t   *curp = hv_get_current_partition();
> +get_vp_registers_input_t*inbuf;
> +get_vp_registers_output_t*outbuf;
> +struct vcpu_guest_context*vcpu_ctx;
> +u32*reg_indexp;
> +get_vp_registers_output_t*out_regp;
> +u32num_output_bytes = 0;
> +
> +        vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +inbuf = vcpup->input_buffer;
> +outbuf = vcpup->output_buffer;
> +out_regp = outbuf;
> +/*
> + * Copy the input data to the per-cpu input buffer.
> + * This may be an overkill; obviously it is better to only
> + * copy what we need. XXXKYS: Check with Mike.
> + */
> Who's Mike?  What did he say?

Mike is  the MSFT contact. 




> +if (hvm_copy_from_guest_phys(inbuf, input, PAGE_SIZE)) {
> +return (HV_STATUS_INVALID_ALIGNMENT);
> +}
...
> +/*
> + * Now that we have the register state; select what we want and
> + * populate the output buffer.
> + */
> +reg_indexp = &inbuf->reg_index;
> +while (*reg_indexp != 0) {
> +switch(*reg_indexp) {
> +/*
> + * XXXKYS: need mapping code here; populate
> + * outbuf.
> + */
> +panic("hv_get_vp_registers not supported\n");

> Yeah, that's not really good enough.  It would be better to just
> not support this hypercall at all than have an implementation which
> always crashes Xen.

Early on, I decided what Hypercalls I would support; and from this list I 
wanted 
to only support those that were used by windows 2008 server. I  had put in a 
bunch of panics to trap all the hypercalls that the guest would invoke. This 
hypercall is not used. Thus the implementation is incomplete. I will clean this 
up.

> +}
> +reg_indexp++;
> +out_regp++ ;/*128 bit registers */
> +num_output_bytes +=16;
> +if ((char *)reg_indexp > ((char *)inbuf + PAGE_SIZE)) {
> +/*
> + *input list not reminated correctly; bail out.
> + */
> +panic("hv_get_vp_registers:input list not terminated\n"); 
> +break;

> It would be nice if this caused the hypercall to fail, rather than
> crashing Xen.

Will be cleaned up.



> +}
> +}
> +if (hvm_copy_to_guest_phys(output, outbuf, num_output_bytes)) {
> +/* Some problem copying data out*/
> +panic("hv_get_vp_registers:copyout problem\n");
 
> And again.

Will be cleaned up.


> +}
> +xfree(vcpu_ctx);
> +return (HV_STATUS_SUCCESS);
> +}

> Has this hypercall had any testing at all?  If it hasn't, then what
> is it doing here?

As I noted earlier, win2k8 does not currently use this API. I will clean this 
up.



> +
> +static int
> +hv_set_vp_registers(paddr_t input, paddr_t output)

> Again, I don't believe this function has ever been called.  That
> suggests that this may not be a high-value optimisation suitable
> for inclusion in an initial version.

As I noted earlier, win2k8 does not currently use this API. I will clean this 
up.



...
> 
> +static int
> +hv_switch_va(paddr_t input)
> +{
> +hv_partition_t   *curp = hv_get_current_partition();
> +        hv_vcpu_t *vcpup = &curp->vcpu_state[hv_get_current_vcpu_index()];
> +
> +/*
> + * XXXKYS: the spec sys the asID is passed via memory at offset 0 of 
> + * the page whose GPA is in the input register. However, it appears 
> + * the current build of longhorn (longhorn-2007-02-06-x86_64-fv-02)
> + * passes the asID in the input register instead. Need to check if 
> + * future builds do this.
> + */
> +hvm_set_cr3(input); 
> +HV_STATS_COLLECT(HV_CSWITCH, &vcpup->stats);
> +return (HV_STATUS_SUCCESS);
> +}

> Do you have any evidence that this is actually faster than just
> doing the CR3 write in the guest?  It's a win on real Viridian
> because you avoid blowing the shadow page table cache, but I don't
> think that applies on Xen's current shadow page table
> implementation.

As I noted earlier, initially I identified all the hypercalls that made for a 
win2k8 guest on our platform. From this list I implemented those that the guest 
actually invoked. I implemented all the hypercalls that the guest was invoking 
without regard to the performance value of the enlightenment. I agree with you 
that this may be more important on Veridian than on our platform. 



> +static int
> +hv_flush_va(paddr_t input)
> +{
...
> +/*
> + * Now operate on what we are given
> + * XXXKYS: For now we are ignoring as_id and fushGlobal flag.
> + * May have to revisit this. But first stash away the processed 
> + * parameters for subsequent use.
> + */
> +flush_argp->as_handle = as_id;
> +flush_argp->flags = flush_global;
> +flush_argp->v_mask = vcpu_mask;
> +
> +
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +HV_STATS_COLLECT(HV_FLUSH_VA_STAT, &cur_vcpup->stats);
> +panic("hv_flush_va not supported\n"); 

> So if this hypercall is ever correctly used by the guest, Xen will
> crash?

In the original patches that I posted, this hypercall was implemented. As part 
of my first round of cleanup, I removed the support for the TLB  flush 
hypercalls. As part of this cleanup, I also disabled TLB related 
enlightenments. This will be cleaned up.

> +return (HV_STATUS_SUCCESS);
> +}
> +
> +static int
> +hv_flush_va_range(paddr_t input, unsigned short start_index, 
> +unsigned short rep_count, unsigned short *reps_done)
> +{
...
> +ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, rep_count);
> +hv_set_syscall_retval(guest_cpu_user_regs(),
> +                                   curp->long_mode_guest,
> +                                   ret_val);
> +
> +
> +HV_STATS_COLLECT(HV_FLUSH_RANGE, &cur_vcpup->stats);
> +panic("hv_flush_vaRange not supported\n");
 
> And again.

Will be cleaned up.


> +return (HV_STATUS_SUCCESS);
> +}
> +
> +void
> +hv_handle_hypercall(u64 opcode, u64 input, u64 output, 
> +  u64 *ret_val)

> Would it be easier to just return the return value?

Perhaps.


> +{
...
> +verb = (short)(opcode & 0xffff);
> +rep_count = (short)((opcode >>32) & 0xfff);
> +start_index = (short)((opcode >> 48) & 0xfff);
> +switch (verb) {
> +case HV_CREATE_PARTITION:
> +/*
> + * Xen only allows dom0 to create domains.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +case HV_INITIALIZE_PARTITION:
> +/*
> + * We don't support this.
> + */
> +*ret_val = hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;

> Would it be easier to just make HV_STATUS_ACCESS_DENIED the
> default?  Actually, my reading of the spec is that you should
> return HV_STATUS_INVALID_HYPERCALL_CODE if the guest makes a
> hypercall which you don't support, but I can't imagine it makes a
> great deal of difference.

Perhaps not. 


...
> +case HV_GET_PARTITION_ID:
> +if (!hv_privilege_check(curp, HV_ACCESS_PARTITION_ID)) {
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);
> +return;
> +}
> +partition_id = (u64)current->domain->domain_id;
> +if (hvm_copy_to_guest_phys(output, 
> +&partition_id, 8)) {
> +/*
> + * Invalid output area.
> + */
> +*ret_val = 
> +hv_build_hcall_retval(HV_STATUS_ACCESS_DENIED, 0);

> I think you mean HV_STATUS_INVALID_ALIGNMENT.

Agreed.


> +return;
> +}
> +*ret_val = hv_build_hcall_retval(HV_STATUS_SUCCESS, 0);
> +return;
> 
> +case HV_GET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_get_vp_registers(input, output), 0);
> +return;
> +case HV_SET_VP_REGISTERS:
> +*ret_val = hv_build_hcall_retval(
> +hv_set_vp_registers(input, output), 0);
...
> +case HV_FLUSH_VA:
> +*ret_val = 
> +hv_build_hcall_retval(hv_flush_va(input), 0);
> +return;
> +case HV_FLUSH_VA_LIST:
> +value  = hv_flush_va_range(input, start_index, 
> +rep_count, &reps_done);
> +*ret_val = hv_build_hcall_retval(value, reps_done);  
> +return;

> As mentioned above, your implementations of these hypercalls are
> really quite broken.

These hypercalls are not invoked by the guest and that is the reason they are 
broken. I will clean it up though.



...
> 
> +}
> +}
> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_hypercall.h2008-03-26 
> 13:56:39.000000000 -0400
> @@ -0,0 +1,21 @@
> +
> +/*
> + * nshypercall.h
> + * Memory layouts for the various hypercalls supported. 
> + *
> + * Engineering Contact: K. Y. Srinivasan
> + */
> +
> +#ifndef HV_HYPERCALL_H
> +#define HV_HYPERCALL_H
> +
> +#include <xen/cpumask.h>
> +
> +
> +typedef struct get_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +u32reg_index;
> +u32pad1;
> +u64pad2;
> +} get_vp_registers_input_t;

> My reading of the 0.83 specification is that this should have been:

> typedef struct get_vp_registers_input {
> u64 partition_id;
> u32 vp_index;
> u32 pad;
> u32 reg_index[0];
> } get_vp_registers_input_t;

> Is my copy of the specification out of date?

Nope. 



> +typedef struct set_vp_register_spec {
> +u32reg_name;
> +u32pad;
> +u64pad1;
> +u64low_value;
> +u64high_value;
> +} set_vp_register_spec_t;
> +
> +typedef struct set_vp_registers_input {
> +u64partition_id;
> +u64vp_index;
> +set_vp_register_spec_treg_spec;
> +} set_vp_registers_input_t;

> Again, the 0.83 spec says that vp_index should be a u32, and should
> be followed by a u32 MBZ pad.
Right.

> +
> +
> +typedef struct flush_va {
> +u64as_handle;
> +u64flags;
> +union  {
> +u64processor_mask;
> +cpumask_t vcpu_mask;
> +} proc_mask;
> +#define p_mask proc_mask.processor_mask
> +#define v_maskproc_mask.vcpu_mask
> +u64gva;
> +} flush_va_t;

> Is there something wrong with anonymous unions?



> +/*
> + * Hypercall verbs.
> + */
> +
> +#define HV_CREATE_PARTITION 0x0010
> +#define HV_INITIALIZE_PARTITION 0x0011
> +#define HV_DELETE_PARTITION0x0014

> Is it really necesary to include #define's for hypercalls which we
> don't support and will probably never support?

No; I will clean this file up.


> Index: xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c
> ===================================================================
> --- /dev/null1970-01-01 00:00:00.000000000 +0000
> +++ xen-unstable.hg/xen/arch/x86/hvm/hyperv/hv_intercept.c2008-03-26 
> 14:26:38.000000000 -0400
> @@ -0,0 +1,0 @@
> 
> +/*
> + * Local includes; extension specific.
> + */
> +#include "hv_errno.h"
> +#include "hv_shim.h"
> +
> +
> +/*
> + * Implement the Hyperv Shim.
> + */
> +
> +extern struct cpuinfo_x86 boot_cpu_data;

> Is there something wrong with processor.h?

This will be fixed.

> +extern struct hvm_mmio_handler vlapic_mmio_handler;

> We should probably move that to a header file somewhere.

Will do.

> +static inline void *
> +get_virt_from_gmfn(struct domain *d, unsigned long gmfn)
> +{
> +unsigned long mfn = gmfn_to_mfn(d, gmfn);
> +if (mfn == INVALID_MFN) {
> +return (NULL);
> +}
> +return (map_domain_page_global(mfn));
> +}

> Is it really necessary to use map_domain_page_global() here?
> map_domain_page() is usually quite a bit faster.  To be honest, I'm
> not convinced that this wrapper function has any real value anyway.

I will look into this.


> +static inline unsigned long
> +get_mfn_from_gva(unsigned long va)
> +{
> +uint32_t pfec = PFEC_page_present;
> +unsigned long gfn;
> +gfn = paging_gva_to_gfn(current, va, &pfec);
> +return (gmfn_to_mfn((current->domain), gfn));
> +}

> That's only valid if you can guarantee that the resulting MFN will
> never be written to.

Yes. I use this to figure out if the hypercall is from the PV drivers in the 
guest or the hyparcall from the guest.


> +static inline void 
> +hv_write_hypercall_msr(hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +hv_hypercall_page_initialize(hypercall_page, curp);
> +curp->hypercall_mfn = gmfn_to_mfn(d, gmfn);
> +#ifdef CONFIG_DOMAIN_PAGE
> +unmap_domain_page_global(hypercall_page);
> +#endif

> unmap_domain_page_global() is #define'd to nothing #ifndef
> CONFIG_DOMAIN_PAGE, so you don't need the #ifdef.

Yes.


> +curp->hypercall_msr = msr_content;
> +spin_unlock(&curp->lock);
> +cur_vcpu->flags |= HV_VCPU_UP;
> +}
> +
> +
> +static inline void hv_write_sx_msr(uint32_t idx, hv_partition_t *curp,
> +  hv_vcpu_t*cur_vcpu,
> +  u64msr_content)
> +{
...
> +sx_page = get_virt_from_gmfn(d, gmfn);
> +if (sx_page == NULL) {
> +/*
> + * The guest specified a bogus GPA; inject a GP fault
> + * into the guest.
> + */
> +hv_inject_exception(TRAP_gp_fault);

> The spec says tha SIEFP can be outside the physical address space,
> and you're just supposed to deal with it.  The guest will be unable
> to access the frame, but the MSR write is supposed to succeed
> (section 14.6.3 of version 0.83).  Likewise the SIMP (14.6.4).

You are right. As it turns out, the win2k8 guest on our platform never runs 
into this situation. I will fix this.


> +return;
> +}
> +switch (idx) {
> +case HV_MSR_SIEFP:
> +hv_init_event_page(sx_page);
> +cur_vcpu->siefp_msr = msr_content; 
> +cur_vcpu->sief_page = sx_page; 
> +break;
> +case HV_MSR_SIMP:
> +hv_init_message_page(sx_page);
> +cur_vcpu->simp_msr = msr_content;
> +cur_vcpu->sim_page = sx_page;
> +break;
> +}
> +
> +}
> +

> +
> +/*
> + * Time this domain booted.
> + */
> +s_time_t hv_domain_boot_time;
> Time which domain booted?  What if you have two domains which both
> use the HyperV extensions?

Oops! Will fix this.



> +static inline void
> +hv_inject_exception(int trap)
> +{
> +hvm_funcs.inject_exception(trap, 0, 0);
> +}
> hvm_inject_exception() might be a better choice.

Ok.

> +static inline void 
> +hv_set_partition_privileges(hv_partition_t *hvpp)
> +{
> +/*
> + * This is based on the hypervisor spec under section 5.2.3. 
> + */
> +hvpp->privileges = 0x000000020000007f;
> +}
> So you allow AccessVpRuntime, AccessTimeCounters, AccessSynicMsrs,
> AccessSyntheticTimers, AccessApicMsrs, AccessHypercallMsrs,
> AccessVpIndex, and AccessSelfPartitionId?  That sounds like a
> pretty reasonable set, but it'd be nice if you'd documented the
> fact rather than just dropping in a magic number.  Something like
> this, maybe:

> hvpp->privileges = HV_PRIV_ACCESS_VP_RUNTIME |
>   HV_PRIV_ACCESS_TIME_COUNTERS |
>                  HV_PRIV_ACCESS_SYNIC_MSRS |
                   ...

Ok.

> +static inline u32
> +hv_get_recommendations(void)
> +{
> +/*
> + *For now we recommend all the features. Need to validate.
> + */
> +if ( paging_mode_hap(current->domain)) {
> +/*
> + * If HAP is enabled; the guest should not use TLB flush
> + * related enlightenments.
> + */
> +return (0x19);
> +} else {
> +/*
> + * For now disable TLB flush enlightenments.
> + */
> +return (0x19); 
> +}
> +}

> So you recommend the use of hypercalls for address switches, MSRs
> for APIC registers, and MSRs for rebooting?  The first two make
> sense, but I'm not so sure that enlightened reboot is a useful
> optimisation.

Originally, I had implemented TLB flush enlightenments as well. Based on some 
of the performance analysis we are currently doing, I may re-introduce them. As 
I noted earlier, I just wanted to implement all the features that made sense 
for the guest on our platform - hence I implemented reboot as well!

> Again, it would have been useful if the actual set of features was
> documented rather than just using a magic number.

Yep.

> +static inline void 
> +hv_set_partition_features(hv_partition_t *hvpp)
> +{
> +hvpp->supported_features = 0x1f;
> +}
>i.e. VP_RUNTIME | PARTITION_REF_COUNT | SYNIC MSRS | SYNTHETIC_TIMERS |
>     APIC_MSRS | HYPERCALL_MSRS.

> It's kind of surprising that you recommend guests use the reboot MSRs,
> but then don't claim to support it.  Was that deliberate?

> Actually, looking some more, it doesn't look like you ever use the
> supported_features field, and have instead hardcoded the value to
> 0xff everywhere you need it.

Will fix this.

> +
> +static inline u16 
> +hv_get_guest_major(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u16
> +hv_get_guest_minor(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> +static inline u32
> +hv_get_guest_service_pack(void)
> +{
> +//KYS: Check!
> +return (0);
> +}
> Err... are these supposed to return the guest major/minor/sp
> reported in the identification MSR?

Yep. Was not quite sure what numbers made sense!



> +static inline void
> +hv_read_icr(u64 *icr_content)
...
> +static inline void
> +hv_read_tpr(u64 *tpr_content)
...
> +static inline void
> +hv_write_eoi(u64 msr_content)
...
> +static inline void
> +hv_write_icr(u64 msr_content)
...
> +static inline void
> +hv_write_tpr(u64 msr_content)
...

> These all have really weird implementations and lots of gratuitous
> use of hardcoded magic numbers (have a look at apicdef.h).

Will fix it.


> +static void
> +hv_timeout_handler(void *arg)
> +{
...
> +/*
> + * First post the message and then optionally deal with the 
> + * interrupt notification.
> + */
> +if (cur_vcpu->sim_page == NULL) {
> +panic("Novell Shim: Sim page not setup\n");
> +}

> So if a guest requests a timeout before they've set up the sim
> page, Xen crashes?  What if they clear the sim page after setting
> up a timeout, but before it fires?

I will fix this. As I noted earlier, win2k8 currently does not use any of these 
features.

...
> +}
> +

> +int
> +hyperv_do_hypercall(struct cpu_user_regs *pregs)
> +{
> +hv_partition_t*curp = hv_get_current_partition();
> +hv_vcpu_t        *vcpup;
> +intlong_mode_guest = curp->long_mode_guest;
> +unsigned long hypercall_mfn;
> +unsigned long gmfn;
> +gmfn = (curp->hypercall_msr >> 12);
> +
> +hypercall_mfn = get_mfn_from_gva(pregs->eip);

> That's exciting.  Do you expect that guests will make hypercalls
> from anywhere other than the Viridian hypercall page?

We support PV drivers for win2k8 guests. Given that both HyperV and Xen use the 
same hypercall numbers to implement different functionality, I needed a way to 
differentiate HyperV hypercalls. I use the mfn of the hypercall page to figure 
out who should handle the hypercall.

...

> +int
> +hyperv_vcpu_initialize(struct vcpu *v)
> +{
...
> +/*
> + * Setup the input page for handling hypercalls.
> + *
> + */
> +vcpup->input_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->input_buffer =
> +get_virt_from_page_ptr(vcpup->input_buffer_page);
...
> +vcpup->output_buffer_page = 
> +alloc_domheap_page(NULL);
...
> +vcpup->output_buffer =
> +get_virt_from_page_ptr(vcpup->output_buffer_page);

> What exactly are these used for?  The only place I can see it
> referenced are immediately before a panic(), which seems a bit
> pointless.

These were used in implementing TLB flush enlightenments. I got rid of those 
hypercalls in the current patch set. I will clean these up.


> +vcpup->xen_vcpu = v; 
> +
> +return (0);
> +}
> +

> +
> +static int 
> +hyperv_dom_restore(struct domain *d, hvm_domain_context_t *h)
> +{
> +struct hvm_hyperv_dom ctxt;
> +hv_partition_t*curp;
> +
> +if ( hvm_load_entry(HYPERV_DOM, h, &ctxt) != 0 )
> +        return -22;
> +d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR] = ctxt.ext_id;
> +if ((ctxt.ext_id == 0) || (ctxt.ext_id > 1)) {
> +return 0;
> +}
> +if (hyperv_initialize(d)) {
> +return -22;
> +}
> ????  Did you mean -EINVAL there?

Yep.

> +curp = d->arch.hvm_domain.hyperv_handle;
> +
> +curp->guest_id_msr = ctxt.guestid_msr;
> +curp->hypercall_msr = ctxt.hypercall_msr;
> +curp->long_mode_guest = ctxt.long_mode;
> +curp->hypercall_mfn =
> +gmfn_to_mfn(d, (ctxt.hypercall_msr >> 12));
> +
> +return 0; 
> +}
> +

> +static int
> +hv_preprocess_cpuid_leaves(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +struct domain*d = current->domain;
> +intextid = d->arch.hvm_domain.params[HVM_PARAM_EXTEND_HYPERVISOR];
> +
> +if (extid == 1) {
> +/*
> + * Enlightened Windows guest; need to remap and handle 
> + * leaves used by PV front-end drivers.
> + */
> +if ((input >= 0x40000000) && (input <= 0x40000005)) {
> +return (0);
> +}
> +/*
> + * PV drivers use cpuid to query the hypervisor for details. On
> + * Windows we will use the following leaves for this:
> + *
> + * 4096: VMM Sinature (corresponds to 0x40000000 on Linux)
> + * 4097: VMM Version (corresponds to 0x40000001 on Linux)
> + * 4098: Hypercall details (corresponds to 0x40000002 on Linux)
> + */

> I think it would be better to just unconditionally duplicate the
> Xen leaves at 0x40001000, regardless of whether the viridian
> enlightenments are enabled.  That would make it much easier to
> write PV drivers which work regardless of whether the
> enlightenments are enabled, and if we can do that then it might (in
> a few years' time) be possible to default to enlightments-on for
> all domains.  That isn't really an option when there's no common
> Xen API.

> This would also mean that you don't need to duplicate the whole of
> hypervisor_cpuid_leaves() in the Viridian implementation.

I will look into cleaning this up.

...
> +}
> +
> +int 
> +hyperv_do_cpu_id(unsigned int input, struct cpu_user_regs *regs)
> +{
> +uint32_t idx;
> +
> +/*
> + * hvmloader uses cpuid to set up a hypercall page; we don't want to
> + * intercept calls coming from the bootstrap (bios) code in the HVM 
> + * guest; we discriminate based on the instruction pointer.
> + */
> +if (hv_call_from_bios(regs)) {
> +/*
> + * We don't intercept this.
> + */
> +return (0);
> +}
> +
> +if (input == 0x00000001) { 
> +regs->ecx = (regs->ecx | 0x80000000);
> +return (1);
> +} 

> I think we should be doing this even when the shim is disabled.
> That bit is supposed to indicate ``running on a VMM'', not
> ``running on a Viridian-compatible VMM'', and it has the side
> effect of making Windows much more tolerant of vcpus getting
> scheduled at different rates.

Good point.

> +if (hv_preprocess_cpuid_leaves(input, regs)) {
> +return (0);
> +}
> +idx = (input - 0x40000000);
> +
> +switch (idx) {
> +case 0:
> +/*
> + * 0x40000000: Hypervisor identification. 
> + */
> +regs->eax = 0x40000005; /* For now clamp this */
> +regs->ebx = 0x65766f4e; /* "Nove" */ 
> +regs->ecx = 0x68536c6c; /* "llSh" */
> +regs->edx = 0x76486d69; /* "imHv" */ 
> +break;

> I think this is supposed to identify the hypervisor (Xen), rather
> than the Viridian implementation (Novell shim).

You are right.

...
> +case 5:
> +/*
> + * 0x40000005: Implementation limits.
> + * Currently we retrieve maximum number of vcpus and 
> + * logical processors (hardware threads) supported.
> + */
> +regs->eax = hv_get_max_vcpus_supported();
> 
> +regs->ebx = hv_get_max_lcpus_supported();
> a) ebx is marked as reserved in the specification, and
> b) it doesn't make a great deal of sense to expose the maximum
>   number of physical CPUs to a virtualised guest anyway.

I thought this is what the spec mandated.  I will look into this.

> +regs->ecx = 0; /* Reserved */
> +regs->edx = 0; /* Reserved */
> +break; 
> +
> +default:
> +/*
> + * We don't handle this leaf.
> + */
> +return (0);
> +
> +}
> +return (1);
> +}
> +
> +int
> +hyperv_do_rd_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +if (extid > 1) {
> +/*
> + * For now this is all other "Enlightened" operating systems
> + * other than Longhorn.
> + */
> +if (idx == 0x40000000) {
> +/*
> + * PV driver hypercall setup. Let xen handle this.
> + */
> +return (0);
> +}
> +if (idx == 0x40001000) {
> +idx = 0x40000000;
> +}

> Eh?  What's this doing?

At one point, I was planning to support other enlightened operating systems. I 
will clean it up.


> +}
> +switch (idx) {
...
> +case HV_MSR_TIMER0_COUNT:
> +timer = 0;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER1_COUNT:
> +timer = 1;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER2_COUNT:
> +timer = 2;
> +goto process_timer_count_read;
> +case HV_MSR_TIMER3_COUNT:
> +timer = 3;
> +process_timer_count_read:

> How much does timer support actually buy you?  It's pretty complicated
> all by itself, and it also seems to be the only reason you need SYNIC
> support.

The feature is not used currently by the win2k8 guest. I may choose to get rid 
of this body of code.

> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_read_error;
> +}
...
> +}
> +
> +int
> +hyperv_do_wr_msr(uint32_t idx, struct cpu_user_regs *regs)
> +{
...
> +switch (idx) {
...
> +case HV_MSR_SEOM:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_MSRS)) {
> +goto msr_write_error;
> +}
> +cur_vcpu->eom_msr = msr_content; 
> +hv_process_message_q(curp, cur_vcpu);
> +break;

> You don't support queued messages, so there's not much point in
> supporting the SEOM MSR either.  Also, eom_msr never seems to get
> read, except for saving it when you do a suspend.

Right.


> +case HV_MSR_TIMER3_CONFIG:
> +timer = 3;
> +process_timer_config:
> +if (!hv_privilege_check(curp, HV_ACCESS_SYNC_TIMERS)) {
> +goto msr_write_error;
> +}
> +/*
> + * Assume that the client is going to write the whole msr. 
> + */
> +if (!(msr_content & 0x9)) {
> +/*
> + * We are neither setting Auto Enable or Enable; 
> + * silently exit.
> + * Should this be considered to turn off a 
> + * timer that may be currently 
> + * active; XXXKYS: Check. For now we are 
> + * not doing anything here.
> + */
> +break;
> +}
> +if (!(((u32)(msr_content >> 16)) & 0x0000000f)) {
> +/*
> + * sintx is 0; clear the enable bit(s).
> + */
> +msr_content &= ~(0x1);
> +}

> Again, some #define's would make these magic numbers a bit more
> clear.

I will fix the code.

Regards,

K. Y


> Steven.





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel