[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



Christoph/Frank, Followed is the interface definition, please have a look.

Thanks
Yunhong Jiang

1) Interface between Xen/dom0 for passing xen's recovery action information to 
dom0. 
   Usage model: After offlining broken page, Xen might pass its page-offline 
recovery action 
   result information to dom0. Dom0 will save the information in non-volatile 
memory for further 
   proactive actions, such as offlining the easy-broken page early when doing 
next reboot.


struct page_offline_action
{
    /* Params for passing the offlined page number to DOM0 */
    uint64_t mfn;
    uint64_t status; /* Similar to page offline hypercall */
};

struct cpu_offline_action
{
    /* Params for passing the identity of the offlined CPU to DOM0 */
    uint32_t mc_socketid;
    uint16_t mc_coreid;
    uint16_t mc_core_threadid;
};

struct cache_shrink_action
{
    /* TBD, Christoph, please fill it */
};

/* Recover action flags, giving recovery result information to guest */
/* Recovery successfully after taking certain recovery actions below */
#define REC_ACT_RECOVERED      (0x1 << 0)
/* For solaris's usage that dom0 will take ownership when crash */
#define REC_ACT_RESET          (0x1 << 2)
/* No action is performed by XEN */
#define REC_ACT_INFO           (0x1 << 3)

/* Recover action type definition, valid only when flags &  REC_ACT_RECOVERED */
#define MC_ACT_PAGE_OFFLINE 1
#define MC_ACT_CPU_OFFLINE   2
#define MC_ACT_CACHE_SHIRNK 3

struct recovery_action
{
    uint8_t flags;
    uint8_t action_type;
    union
    {
        struct page_offline_action page_retire;
        struct cpu_offline_action cpu_offline;
        struct cache_shrink_action cache_shrink;
        uint8_t pad[MAX_ACTION_SIZE];
    } action_info;
}

struct mcinfo_bank {
    struct mcinfo_common common;

    uint16_t mc_bank; /* bank nr */
    uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on dom0
                        * and if mc_addr is valid. Never valid on DomU. */
    uint64_t mc_status; /* bank status */
    uint64_t mc_addr;   /* bank address, only valid
                         * if addr bit is set in mc_status */
    uint64_t mc_misc;
    uint64_t mc_ctrl2;
    uint64_t mc_tsc;
    /* Recovery action is performed per bank */
    struct recovery_action action;
};

2) Below two interfaces are for MCA processing internal use.
    a. pre_handler will be called earlier in MCA ISR context, mainly for early 
need_reset 
        detection for avoiding log missing (flag MCA_RESET).  Also, pre_handler 
might
        be able to find the impacted domain if possible.
    b. mca_error_handler is actually a (error_action_index, recovery_handler 
pointer) pair. 
       The defined recovery_handler function performs the actual recovery 
operations in 
       softIrq context after the per_bank MCA error matching the corresponding 
mca_code index. 
       If pre_handler can't judge the impacted domain, recovery_handler must 
figure it out.

/* Error has been recovered successfully */
#define MCA_RECOVERD    0
/* Error impact one guest as stated in owner field */
#define MCA_OWNER       1
/* Error can't be recovered and need reboot system */
#define MCA_RESET       2
/* Error should be handled in softIRQ context */
#define MCA_MORE_ACTION 3

struct mca_handle_result
{
    uint32_t flags;
    /* Valid only when flags & MCA_OWNER */
    domid_d owner;
    /* valid only when flags & MCA_RECOVERD */
    struct  recovery_action *action;
};

struct mca_error_handler
{
    /*
     * Assume we will need only architecture defined code. If the index can't 
be setup by
     * mca_code, we will add a function to do the (index, recovery_handler) 
mapping check.
     * This mca_code represents the recovery handler pointer index for 
identifying this 
     * particular error's corresponding recover action
    */
    uint16_t mca_code;

    /* Handler to be called in softIRQ handler context */
    int recovery_handler(struct mcinfo_bank *bank,
                     struct mcinfo_global *global,
                     struct mcinfo_extended *extention,
                     struct mca_handle_result *result);

};

struct mca_error_handler intel_mca_handler[] = 
{
    ....
};

struct mca_error_handler amd_mca_handler[] =
{
    ....
};


/* HandlVer to be called in MCA ISR in MCA context */
int intel_mca_pre_handler(struct cpu_user_regs *regs,
                                struct mca_handle_result *result);

int amd_mca_pre_handler(struct cpu_user_regs *regs,
                            struct mca_handle_result *result);


Frank.Vanderlinden@xxxxxxx <mailto:Frank.Vanderlinden@xxxxxxx> wrote:
> Jiang, Yunhong wrote:
>> Frank/Christopher, can you please give more comments for it, or you are OK
>> with this? For the action reporting mechanism, we will send out a proposal
>> for review soon. 
> 
> I'm ok with this. We need a little more information on the AMD
> mechanism, but it seems to me that we can fit this in.
> 
> Sometime this week, I'll also send out the last of our changes that
> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> some things in to one patch, like the telemetry handling changes that
> Gavin did. The other changes are error injection (for debugging) and
> panic crash dump support for our FMA tools, but those are probably only
> interesting to us. 
> 
> - Frank
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.