This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN

To: "Frank.Vanderlinden@xxxxxxx" <Frank.Vanderlinden@xxxxxxx>
Subject: RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Date: Thu, 5 Mar 2009 16:31:27 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: Christoph Egger <Christoph.Egger@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Ke, Liping" <liping.ke@xxxxxxxxx>, Gavin Maltby <Gavin.Maltby@xxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Kleen, Andi" <andi.kleen@xxxxxxxxx>
Delivery-date: Thu, 05 Mar 2009 00:32:58 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <49AC1BA8.3090302@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <C5BF30B3.2C2B%keir.fraser@xxxxxxxxxxxxx> <49A45CF0.6080807@xxxxxxx> <E2263E4A5B2284449EEBD0AAB751098401C7B6E888@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <200902251319.29299.Christoph.Egger@xxxxxxx> <49A580C0.7050501@xxxxxxx> <E2263E4A5B2284449EEBD0AAB751098401C7C59202@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <49AC1BA8.3090302@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcmbXwgSG6ZN2R4KQqOX+5VH/tqchQCDTOzA
Thread-topic: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
Christoph/Frank, Followed is the interface definition, please have a look.

Yunhong Jiang

1) Interface between Xen/dom0 for passing xen's recovery action information to 
   Usage model: After offlining broken page, Xen might pass its page-offline 
recovery action 
   result information to dom0. Dom0 will save the information in non-volatile 
memory for further 
   proactive actions, such as offlining the easy-broken page early when doing 
next reboot.

struct page_offline_action
    /* Params for passing the offlined page number to DOM0 */
    uint64_t mfn;
    uint64_t status; /* Similar to page offline hypercall */

struct cpu_offline_action
    /* Params for passing the identity of the offlined CPU to DOM0 */
    uint32_t mc_socketid;
    uint16_t mc_coreid;
    uint16_t mc_core_threadid;

struct cache_shrink_action
    /* TBD, Christoph, please fill it */

/* Recover action flags, giving recovery result information to guest */
/* Recovery successfully after taking certain recovery actions below */
#define REC_ACT_RECOVERED      (0x1 << 0)
/* For solaris's usage that dom0 will take ownership when crash */
#define REC_ACT_RESET          (0x1 << 2)
/* No action is performed by XEN */
#define REC_ACT_INFO           (0x1 << 3)

/* Recover action type definition, valid only when flags &  REC_ACT_RECOVERED */
#define MC_ACT_CPU_OFFLINE   2

struct recovery_action
    uint8_t flags;
    uint8_t action_type;
        struct page_offline_action page_retire;
        struct cpu_offline_action cpu_offline;
        struct cache_shrink_action cache_shrink;
        uint8_t pad[MAX_ACTION_SIZE];
    } action_info;

struct mcinfo_bank {
    struct mcinfo_common common;

    uint16_t mc_bank; /* bank nr */
    uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on dom0
                        * and if mc_addr is valid. Never valid on DomU. */
    uint64_t mc_status; /* bank status */
    uint64_t mc_addr;   /* bank address, only valid
                         * if addr bit is set in mc_status */
    uint64_t mc_misc;
    uint64_t mc_ctrl2;
    uint64_t mc_tsc;
    /* Recovery action is performed per bank */
    struct recovery_action action;

2) Below two interfaces are for MCA processing internal use.
    a. pre_handler will be called earlier in MCA ISR context, mainly for early 
        detection for avoiding log missing (flag MCA_RESET).  Also, pre_handler 
        be able to find the impacted domain if possible.
    b. mca_error_handler is actually a (error_action_index, recovery_handler 
pointer) pair. 
       The defined recovery_handler function performs the actual recovery 
operations in 
       softIrq context after the per_bank MCA error matching the corresponding 
mca_code index. 
       If pre_handler can't judge the impacted domain, recovery_handler must 
figure it out.

/* Error has been recovered successfully */
#define MCA_RECOVERD    0
/* Error impact one guest as stated in owner field */
#define MCA_OWNER       1
/* Error can't be recovered and need reboot system */
#define MCA_RESET       2
/* Error should be handled in softIRQ context */

struct mca_handle_result
    uint32_t flags;
    /* Valid only when flags & MCA_OWNER */
    domid_d owner;
    /* valid only when flags & MCA_RECOVERD */
    struct  recovery_action *action;

struct mca_error_handler
     * Assume we will need only architecture defined code. If the index can't 
be setup by
     * mca_code, we will add a function to do the (index, recovery_handler) 
mapping check.
     * This mca_code represents the recovery handler pointer index for 
identifying this 
     * particular error's corresponding recover action
    uint16_t mca_code;

    /* Handler to be called in softIRQ handler context */
    int recovery_handler(struct mcinfo_bank *bank,
                     struct mcinfo_global *global,
                     struct mcinfo_extended *extention,
                     struct mca_handle_result *result);


struct mca_error_handler intel_mca_handler[] = 

struct mca_error_handler amd_mca_handler[] =

/* HandlVer to be called in MCA ISR in MCA context */
int intel_mca_pre_handler(struct cpu_user_regs *regs,
                                struct mca_handle_result *result);

int amd_mca_pre_handler(struct cpu_user_regs *regs,
                            struct mca_handle_result *result);

Frank.Vanderlinden@xxxxxxx <mailto:Frank.Vanderlinden@xxxxxxx> wrote:
> Jiang, Yunhong wrote:
>> Frank/Christopher, can you please give more comments for it, or you are OK
>> with this? For the action reporting mechanism, we will send out a proposal
>> for review soon. 
> I'm ok with this. We need a little more information on the AMD
> mechanism, but it seems to me that we can fit this in.
> Sometime this week, I'll also send out the last of our changes that
> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> some things in to one patch, like the telemetry handling changes that
> Gavin did. The other changes are error injection (for debugging) and
> panic crash dump support for our FMA tools, but those are probably only
> interesting to us. 
> - Frank
Xen-devel mailing list