Re: [Patch v3 2/2] x86/microcode: Synchronize late microcode loading

From: Chao Gao <chao.gao@intel.com>
To: Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	Ashok Raj <ashok.raj@intel.com>,
	xen-devel@lists.xen.org, Jun Nakajima <jun.nakajima@intel.com>,
	tglx@linutronix.de, Borislav Petkov <bp@suse.de>
Subject: Re: [Patch v3 2/2] x86/microcode: Synchronize late microcode loading
Date: Fri, 18 May 2018 15:21:14 +0800	[thread overview]
Message-ID: <20180518072113.GA65239@skl-4s-chao.sh.intel.com> (raw)
In-Reply-To: <5AFC364802000078001C3436@prv1-mh.provo.novell.com>

On Wed, May 16, 2018 at 07:46:48AM -0600, Jan Beulich wrote:
>>>> On 16.05.18 at 15:25, <andrew.cooper3@citrix.com> wrote:
>> On 16/05/18 14:10, Jan Beulich wrote:
>>>> +static int do_microcode_update(void *_info)
>>>> +{
>>>> +    struct microcode_info *info = _info;
>>>> +    unsigned int cpu = smp_processor_id();
>>>> +    int ret;
>>>> +
>>>> +    ret = wait_for_cpus(&info->cpu_in, MICROCODE_DEFAULT_TIMEOUT);
>>>> +    if ( ret )
>>>> +        return ret;
>>>> +
>>>> +    /*
>>>> +     * Logical threads which set the first bit in cpu_sibling_mask can do
>>>> +     * the update. Other sibling threads just await the completion of
>>>> +     * microcode update.
>>>> +     */
>>>> +    if ( !cpumask_test_and_set_cpu(
>>>> +                cpumask_first(per_cpu(cpu_sibling_mask, cpu)), &info->cpus) )
>>>> +        ret = microcode_update_cpu(info->buffer, info->buffer_size);
>>>> +    /*
>>>> +     * Increase the wait timeout to a safe value here since we're serializing
>>>> +     * the microcode update and that could take a while on a large number of
>>>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>>>> +     * the last CPU finished updating and thus cut short
>>>> +     */
>>>> +    if ( wait_for_cpus(&info->cpu_out, MICROCODE_DEFAULT_TIMEOUT *
>>>> +                                       nr_phys_cpus) )
>>> I remain unconvinced that this is a safe thing to do on a huge system with
>>> guests running (even Dom0 alone would seem risky enough). I continue to

I think there are other operations may also endanger the security, stability
of the whole system. We offer them with caveats. Same here, three
different methods can be used to update microcode; the late update isn't
perfect at this moment. At least, we provide a more reliable method to update
microcode at runtime on systems with no so many cores. And for a huge
system, admins can assess the risk and choose the most suitable method.
They can completely avoid doing live updates and mandate a reboot and do
it early since that's the most dependable method.

>>> hope for comments from others, in particular Andrew, here. At the very
>>> least I think you should taint the hypervisor when making it here.
>> 
>> I see nothing in this patch which prevents a deadlock against the time
>> calibration rendezvous.  It think its fine to pause the time calibration
>> rendezvous while performing this update.
>
>If there's a problem here, wouldn't that be a general one with
>stop_machine()?

I agree with Jan. It shouldn't be specific to the stop_machine() here.
Anyhow, I will look into the potential deadlock you mentioned.

>
>> Also, what is the purpose of serialising the updates while all pcpus are
>> in rendezvous?

microcode_mutex which prevents doing the updates in parallel is not
introduced by this patch. At present, We want to keep this patch and the
update process simple. Could we just make it work first and try to work
out some optimizations later?

>> Surely at that point the best option is to initiate an
>> update on all processors which don't have an online sibling thread with
>> a lower thread id.
>
>I've suggested that before.

I think Andrew's suggestion here is similar to the method which this patch is
using.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel