Xen project Mailing List

Re: [PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 18 Nov 2025 12:01:21 +0000

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=DfB4ZAAy+1lMtrxygrej1C/X/vQrG1zAUYbcl7qHRqA=; b=tI8u2REvSYjJ38aCy34VQaQHFotxKoYZUq/NYLkVaTcu9QM5Xl0Raw58L0Ppwz7t7k5XRN5ezsLCnjT4szdLCrL4+M4UxbvtTJzVnK6OMBQhqDzJ/ibsb8KKGXY1tLinTUJO88OMrq9/Zr9hrce8X9xHVtjmbQqIAA6r07fLE5SqWAPmf8teRxa1qj4V9UFGQd1YNZkN5AIzyfX6S9BL+qBKIeb7mvk3vNMSZj015iHcuq26GVX3Rvvh8RxF51MTyOkSfdYBqoowv0oa1QwnWdNOL5gKFcyPUylXMRHAj/1/2yhlPImWl9JHRgugnnRr5BQbncWt2Gi09io3y6fOzA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Vw8ZjM7ChxqTszHhVStn1D2Kq44DSw04p73p1KMA4v+k8g4/5KiGuhxwafkbl/rBBKRcJ0CvLDF/Z+oJOvj3bKm3cj2SWnoS2A3UY+CoRyrqp3xs3pFP9wQBnDPldL0L0hTkCe2QmHnMbuFKZcCDs/8AC8BSyfoeU3KTK/2c39B0wEYJoNWWl6YmDSphbUY4msvnxxgh7EGTd8E6QiMqh5+gYb6bTcEc2xSuc2A4+Pfw1tHFI41y5F1atsYLuAZ1V292R1ti/8VdoKulZJcCOFIkZRWnx6MNKNxBZ3hgN5HxE8eejKytZtD/1IcH8dMMXPW4RuolL8iAkPa7IZAZfw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 18 Nov 2025 12:01:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 18/11/2025 7:49 am, Jan Beulich wrote: > On 17.11.2025 23:21, Andrew Cooper wrote: >> wait_for_state() returns false on encountering LOADING_EXIT. >> control_thread_fn() can move directly to this state in the case of an early >> error. It is not an error condition for APs, but right now the latest write >> into stopmachine_data.fn_result wins, causing the real error, -EIO, to get >> clobbered with -EBUSY. e.g.: >> >> # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force >> Failed to update microcode. (err: Device or resource busy) >> >> (XEN) 256 cores are to update their microcode >> (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result >> 0x830107d >> (XEN) Late loading aborted: CPU0 failed to update ucode: -5 >> >> Drop all the -EBUSY's, and treat hitting LOADING_EXIT as a success case. >> This >> causes only a single error to be returned through stop_machine_run(). e.g.: > Why "single"? stop_machine_run() can't return multiple ones, having only a > scalar return type? Or do you mean "a single, consistent" or some such? stop_machine_run() has a data race on stopmachine_data.fn_result. Any CPU returning any nonzero value back into the stop_machine machinery will update the singleton result, and latest wins. This causes the BSP to return -EIO, and all APs to return 0 and not interfere with the -EIO. > >> # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force >> Failed to update microcode. (err: Input/output error) >> >> (XEN) 256 cores are to update their microcode >> (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result >> 0x830107d >> (XEN) Late loading aborted: CPU0 failed to update ucode: -5 > The sole difference being which specific error is observed, which looks to > support the above interpretation. What I don't quite understand is ... > >> Fixes: 5ed12565aa32 ("microcode: rendezvous CPUs in NMI handler and load >> ucode") > ... this and the specific indication that this needs backporting: Why is > the particular error code this important here? Because userspace cares about -EEXIST as a special case for success. Having -EEIXST clobbered with -EBUSY causes a false negative failure in XenServer's testing. As said in the cover letter, 4.19 and earlier now suffer this as a side effect of e0bb712a28a9 ("x86/ucode: Abort parallel load early on any control thread error") because out-of-date ucodes used to be passed into stop_machine and cause every CPU to fail with -EEXIST. >> --- a/xen/arch/x86/cpu/microcode/core.c >> +++ b/xen/arch/x86/cpu/microcode/core.c >> @@ -260,7 +260,9 @@ static int secondary_nmi_work(void) >> { >> cpumask_set_cpu(smp_processor_id(), &cpu_callin_map); >> >> - return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY; >> + wait_for_state(LOADING_EXIT); >> + >> + return 0; >> } > At which point the function could as well return void? Preferably with this > adjustment (and the knock-on one at the call site) and with the slight > clarification to the description > Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx> I have a different series, but ucode_in_nmi needs untangling first. Even changing this function to be void causes this patch to be dominated by cleanup, which isn't appropriate for a bugfix. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.