|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 1/3] x86/ucode: Fix error handling during parallel ucode load
On 18/11/2025 7:49 am, Jan Beulich wrote:
> On 17.11.2025 23:21, Andrew Cooper wrote:
>> wait_for_state() returns false on encountering LOADING_EXIT.
>> control_thread_fn() can move directly to this state in the case of an early
>> error. It is not an error condition for APs, but right now the latest write
>> into stopmachine_data.fn_result wins, causing the real error, -EIO, to get
>> clobbered with -EBUSY. e.g.:
>>
>> # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>> Failed to update microcode. (err: Device or resource busy)
>>
>> (XEN) 256 cores are to update their microcode
>> (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result
>> 0x830107d
>> (XEN) Late loading aborted: CPU0 failed to update ucode: -5
>>
>> Drop all the -EBUSY's, and treat hitting LOADING_EXIT as a success case.
>> This
>> causes only a single error to be returned through stop_machine_run(). e.g.:
> Why "single"? stop_machine_run() can't return multiple ones, having only a
> scalar return type? Or do you mean "a single, consistent" or some such?
stop_machine_run() has a data race on stopmachine_data.fn_result.
Any CPU returning any nonzero value back into the stop_machine machinery
will update the singleton result, and latest wins.
This causes the BSP to return -EIO, and all APs to return 0 and not
interfere with the -EIO.
>
>> # xen-ucode /lib/firmware/amd-ucode/microcode_amd_fam17h.bin --force
>> Failed to update microcode. (err: Input/output error)
>>
>> (XEN) 256 cores are to update their microcode
>> (XEN) microcode: CPU0 update rev 0x830107d to 0x830107c failed, result
>> 0x830107d
>> (XEN) Late loading aborted: CPU0 failed to update ucode: -5
> The sole difference being which specific error is observed, which looks to
> support the above interpretation. What I don't quite understand is ...
>
>> Fixes: 5ed12565aa32 ("microcode: rendezvous CPUs in NMI handler and load
>> ucode")
> ... this and the specific indication that this needs backporting: Why is
> the particular error code this important here?
Because userspace cares about -EEXIST as a special case for success.
Having -EEIXST clobbered with -EBUSY causes a false negative failure in
XenServer's testing.
As said in the cover letter, 4.19 and earlier now suffer this as a side
effect of e0bb712a28a9 ("x86/ucode: Abort parallel load early on any
control thread error") because out-of-date ucodes used to be passed into
stop_machine and cause every CPU to fail with -EEXIST.
>> --- a/xen/arch/x86/cpu/microcode/core.c
>> +++ b/xen/arch/x86/cpu/microcode/core.c
>> @@ -260,7 +260,9 @@ static int secondary_nmi_work(void)
>> {
>> cpumask_set_cpu(smp_processor_id(), &cpu_callin_map);
>>
>> - return wait_for_state(LOADING_EXIT) ? 0 : -EBUSY;
>> + wait_for_state(LOADING_EXIT);
>> +
>> + return 0;
>> }
> At which point the function could as well return void? Preferably with this
> adjustment (and the knock-on one at the call site) and with the slight
> clarification to the description
> Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
I have a different series, but ucode_in_nmi needs untangling first.
Even changing this function to be void causes this patch to be dominated
by cleanup, which isn't appropriate for a bugfix.
~Andrew
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |