linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature
@ 2025-03-20 23:40 Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
                   ` (6 more replies)
  0 siblings, 7 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

Hi all,

Here is a revision following up feedbacks from this posting [1]:

  *  Instead of embedding staging logic directly into load_late_locked(),
     it was suggested [3] to place it within load_late_stop_cpus(). This
     change ensures that staging is treated as a preparatory step
     before CPUs are stopped.

  *  Rather than introducing a separate .c file only for staging, the
     recommendation was to consolidate all staging-related code within
     intel.c [4].

  *  The previous implementation lacks clarity in explaining key aspects
     of the mailbox and staging handler. Improving documentation and
     readability in these areas was suggested as a necessary refinement
     [5,6], as I understood.

  *  It was also requested to fold MSR definitions into their usage
     patches [7].

In addition to addressing these points, I’ve considered a unified staging
state struct (patch 2), primarily to simplify the staging handler loop
while also improving overall code organization.

This series is based on the tip/master branch. You can also find it from
this repo:
    git://github.com/intel-staging/microcode.git staging_v2

I suspect the maintainers could afford another look at least after the
upcoming merge window. In the meantime, I would appreciate any additional
feedback from those interested in this feature.

The original cover letter, which provides some background on this feature
enabling and its initial integration considerations, can be found in the
previous postings [1,2]. The relevant specification has also been posted
[8].

Thanks,
Chang

[1] Last posting: https://lore.kernel.org/lkml/20241211014213.3671-1-chang.seok.bae@intel.com/
[2] RFC: https://lore.kernel.org/lkml/20241001161042.465584-1-chang.seok.bae@intel.com/
[3] https://lore.kernel.org/lkml/20250218113634.GGZ7RwwkrrXADX0eRo@fat_crate.local/
[4] https://lore.kernel.org/lkml/20250226175642.GOZ79V2jWQTH5rbuXo@fat_crate.local/
[5] https://lore.kernel.org/lkml/fac46937-e0a5-42c1-96ee-65fec4e17551@intel.com/
[6] https://lore.kernel.org/lkml/1aee0888-b87b-443c-84fa-3bc000cbebcf@intel.com/
[7] https://lore.kernel.org/lkml/20250226171923.GMZ79NG_8wDtZ8vyWH@fat_crate.local/
[8] Staging Spec: https://cdrdv2.intel.com/v1/dl/getContent/782715

Chang S. Bae (6):
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Establish staging control logic
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Enable staging when available

 arch/x86/include/asm/msr-index.h         |   9 +
 arch/x86/kernel/cpu/microcode/core.c     |  11 +
 arch/x86/kernel/cpu/microcode/intel.c    | 341 +++++++++++++++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |   4 +-
 4 files changed, 364 insertions(+), 1 deletion(-)


base-commit: 758ea9705c51865858c612f591c6e6950dcafccf
-- 
2.45.2


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 1/6] x86/microcode: Introduce staging step to reduce late-loading time
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 2/6] x86/microcode/intel: Define staging state struct Chang S. Bae
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

As microcode patch sizes continue to grow, late-loading latency spikes
can lead to timeouts and disruptions in running workloads. This trend of
increasing patch sizes is expected to continue, so a foundational
solution is needed to address the issue.

To mitigate the problem, a new staging feature is introduced. This option
processes most of the microcode update (excluding activation) on a
non-critical path, allowing CPUs to remain operational during the
majority of the update. By offloading work from the critical path,
staging can significantly reduces latency spikes.

Integrate staging as a preparatory step in late-loading. Introduce a new
callback for staging, which is invoked at the beginning of
load_late_stop_cpus(), before CPUs enter the rendezvous phase.

Staging follows an opportunistic model:

  *  If successful, it reduces CPU rendezvous time
  *  Even though it fails, the process falls back to the legacy path to
     finish the loading process but with potentially higher latency.

Extend struct microcode_ops to incorporate staging properties, which will
be implemented in the vendor code separately.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V1 -> V2:
- Move invocation inside of load_late_stop_cpus() (Boris)
- Add more note about staging (Dave)

RFC-V1 -> V1:
- Rename the function name to the do_something() style (Boris).

Note:

Now the invocation is placed as part of the late-loading core function
following Boris' comment on the last posting [1], which I think can keep
the main late-loading logic simple from load_late_locked().

There was some brief discussion about the necessity of enforcing staging
as mandatory [2]. But it was not clearly justified yet, so I thought it
still makes sense to exclude such policy discussions out of this series.

[1] https://lore.kernel.org/lkml/20250218113634.GGZ7RwwkrrXADX0eRo@fat_crate.local/
[2] https://lore.kernel.org/lkml/526df712-6091-4b04-97d5-9007789dc750@intel.com/
---
 arch/x86/kernel/cpu/microcode/core.c     | 11 +++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |  4 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b3658d11e7b6..c4aff44a7ffc 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -541,6 +541,17 @@ static int load_late_stop_cpus(bool is_safe)
 		pr_err("You should switch to early loading, if possible.\n");
 	}
 
+	/*
+	 * Pre-load the microcode image into a staging device. This
+	 * process is preemptible and does not require stopping CPUs.
+	 * Successful staging simplifies the subsequent late-loading
+	 * process, reducing rendezvous time.
+	 *
+	 * Even if the transfer fails, the update will proceed as usual.
+	 */
+	if (microcode_ops->use_staging)
+		microcode_ops->stage_microcode();
+
 	atomic_set(&late_cpus_in, num_online_cpus());
 	atomic_set(&offline_in_nmi, 0);
 	loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 5df621752fef..4b983b4cddbd 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -31,10 +31,12 @@ struct microcode_ops {
 	 * See also the "Synchronization" section in microcode_core.c.
 	 */
 	enum ucode_state	(*apply_microcode)(int cpu);
+	void			(*stage_microcode)(void);
 	int			(*collect_cpu_info)(int cpu, struct cpu_signature *csig);
 	void			(*finalize_late_load)(int result);
 	unsigned int		nmi_safe	: 1,
-				use_nmi		: 1;
+				use_nmi		: 1,
+				use_staging	: 1;
 };
 
 struct early_load_data {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 2/6] x86/microcode/intel: Define staging state struct
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 3/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

To facilitate a structured staging handler, a set of functions will be
introduced. Define staging_state struct to simplify function prototypes
by consolidating relevant data, instead of passing multiple local
variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V1 -> V2: New patch

In the previous version, local variables were used to track state values,
with the intention of improving readability by explicitly passing them
between functions. However, given feedbacks, introducing a dedicated data
structure looks to provide a benefit by simplifying the main loop, which
is now prioritized.
---
 arch/x86/kernel/cpu/microcode/intel.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 819199bc0119..57ed5d414cd1 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -54,6 +54,27 @@ struct extended_sigtable {
 	struct extended_signature	sigs[];
 };
 
+/**
+ * struct staging_state - Tracks the current staging process state
+ *
+ * @mmio_base:		MMIO base address for staging
+ * @ucode_ptr:		Pointer to the microcode image
+ * @ucode_len:		Total size of the microcode image
+ * @chunk_size:		Size of each data piece
+ * @bytes_sent:		Total bytes transmitted so far
+ * @offset:		Current offset in the microcode image
+ * @state:		Current state of the staging process
+ */
+struct staging_state {
+	void __iomem		*mmio_base;
+	void			*ucode_ptr;
+	unsigned int		ucode_len;
+	unsigned int		chunk_size;
+	unsigned int		bytes_sent;
+	unsigned int		offset;
+	enum ucode_state	state;
+};
+
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
 #define EXT_SIGNATURE_SIZE	(sizeof(struct extended_signature))
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 2/6] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-03-21 21:18   ` [PATCH v2a " Chang S. Bae
  2025-03-20 23:40 ` [PATCH v2 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V1 -> V2:
* Adjust to reference the staging_state struct.

RFC-V1 -> V1:
* Simplify code by leveraging the architectural per-package staging scope
  (Thomas).
* Fix MSR read code (Boris and Dave).
* Rename the staging function: staging_work() -> do_stage() (Boris).
* Polish the result messages (Boris).
* Add a prototype for builds without CONFIG_CPU_SUP_INTEL (Boris).
* Massage the changelog.

Note:

1. Using a direct reference to 'cpu_primary_thread_mask' in
   for_each_cpu(...) causes a build error when !CONFIG_SMP. Instead, use
   the wrapper function topology_is_primary_thread() to avoid it.

2. Ideally, the do_stage() function would be as simple as a single WRMSR
   execution. If this were the case, the staging flow could be completed
   with this patch.
---
 arch/x86/include/asm/msr-index.h      |  2 +
 arch/x86/kernel/cpu/microcode/intel.c | 57 ++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index bc6d2de109b5..f123abfdffcb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -891,6 +891,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 57ed5d414cd1..14c20b53f14d 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -64,8 +64,11 @@ struct extended_sigtable {
  * @bytes_sent:		Total bytes transmitted so far
  * @offset:		Current offset in the microcode image
  * @state:		Current state of the staging process
+ *
+ * Staging is performed sequentially per package, so concurrent access is
+ * not expected.
  */
-struct staging_state {
+static struct staging_state {
 	void __iomem		*mmio_base;
 	void			*ucode_ptr;
 	unsigned int		ucode_len;
@@ -73,7 +76,7 @@ struct staging_state {
 	unsigned int		bytes_sent;
 	unsigned int		offset;
 	enum ucode_state	state;
-};
+} staging;
 
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
@@ -320,6 +323,55 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * caller is expected to check the result in staging.state.
+ */
+static void do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	staging.state = UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	u64 mmio_pa;
+	int cpu;
+
+	staging.ucode_ptr = ucode_patch_late;
+	staging.ucode_len = get_totalsize(&ucode_patch_late->hdr);
+	if (!IS_ALIGNED(staging.ucode_len, sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_online_mask) {
+		if (!topology_is_primary_thread(cpu) ||
+		     topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+
+		do_stage(mmio_pa);
+		if (staging.state != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       staging.state == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n",
+		((struct microcode_header_intel *)ucode_patch_late)->rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -648,6 +700,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 4/6] x86/microcode/intel: Implement staging handler
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                   ` (2 preceding siblings ...)
  2025-03-20 23:40 ` [PATCH v2 3/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-03-21  0:15   ` Dave Hansen
  2025-03-20 23:40 ` [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

Previously, per-package staging invocations were established for each
MMIO space. The next step is to implement the staging handler according
to the specified protocol. Below are key aspects to note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d)  The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V1 -> V2:
* Re-write the changelog for clarity (Dave).
* Move staging handling code into intel.c (Boris).
* Add extensive comments to clarify staging logic and hardware
  interactions, along with function renaming renaming (Dave).

 A key change to highlight is the updated main loop in do_stage(). With
 the introduction of global staging data, sub-functions now focus on
 being more self-explanatory through meaningful naming.

RFC-V1 -> V1:
* Rename the function name and change the return type.
---
 arch/x86/kernel/cpu/microcode/intel.c | 116 +++++++++++++++++++++++++-
 1 file changed, 113 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 14c20b53f14d..05b5b73e525a 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,29 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
+/*
+ * Each microcode image is divided into chunks, each at most
+ * MBOX_XACTION_SIZE in size. A 10-chunk image would typically require
+ * 10 transactions. However, the hardware managing the mailbox has
+ * limited resources and may not cache the entire image, potentially
+ * requesting the same chunk multiple times.
+ *
+ * To accommodate this behavior, allow up to twice the expected number of
+ * transactions (i.e., a 10-chunk image can take up to 20 attempts).
+ */
+#define MBOX_XACTION_SIZE	PAGE_SIZE
+#define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -323,14 +348,99 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Prepare for a new microcode transfer by resetting both hardware and
+ * software states.
+ */
+static inline void reset_stage(void)
+{
+	/* Reset tracking variables */
+	staging.offset = 0;
+	staging.bytes_sent = 0;
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, staging.mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Check if the staging process has completed. The hardware signals
+ * completion by setting a unique end offset.
+ */
+static inline bool is_stage_complete(void)
+{
+	return staging.offset == UINT_MAX;
+}
+
+/*
+ * Determine if the next data chunk can be sent. Each chunk is typically
+ * one page unless the remaining data is smaller. If the total
+ * transmitted data exceeds the defined limit, a timeout occurs.
+ */
+static bool can_send_next_chunk(void)
+{
+	WARN_ON_ONCE(staging.ucode_len < staging.offset);
+	staging.chunk_size = min(MBOX_XACTION_SIZE, staging.ucode_len - staging.offset);
+
+	if (staging.bytes_sent + staging.chunk_size > MBOX_XACTION_MAX(staging.ucode_len)) {
+		staging.state = UCODE_TIMEOUT;
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return true if the chunk is processed successfully.
+ */
+static bool send_data_chunk(void)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	return false;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return true if the response is valid, false otherwise.
+ */
+static bool fetch_next_offset(void)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	return false;
+}
+
 /*
  * Handle the staging process using the mailbox MMIO interface. The
- * caller is expected to check the result in staging.state.
+ * microcode image is transferred in chunks until completion. The caller
+ * is expected to check the result in staging.state.
  */
 static void do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	staging.state = UCODE_ERROR;
+	staging.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!staging.mmio_base)) {
+		staging.state = UCODE_ERROR;
+		return;
+	}
+
+	reset_stage();
+
+	/* Perform the staging process while within the retry limit */
+	while (!is_stage_complete() && can_send_next_chunk()) {
+		/* Send a chunk of microcode each time: */
+		if (!send_data_chunk())
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		if (!fetch_next_offset())
+			break;
+	}
+
+	iounmap(staging.mmio_base);
 }
 
 static void stage_microcode(void)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                   ` (3 preceding siblings ...)
  2025-03-20 23:40 ` [PATCH v2 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-03-21 21:19   ` [PATCH v2a " Chang S. Bae
  2025-03-27  3:32   ` [PATCH v2 " Chao Gao
  2025-03-20 23:40 ` [PATCH v2 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  6 siblings, 2 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V1 -> V2:
* Add lots of code comments and edit the changelog (Dave).
* Encapsulate register read/write operations for processing header and
  data sections.
---
 arch/x86/kernel/cpu/microcode/intel.c | 148 +++++++++++++++++++++++++-
 1 file changed, 144 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 05b5b73e525a..b0d530db72dd 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,30 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define DWORD_ALIGN(size)	((size) / sizeof(u32))
+#define MBOX_HEADER(mbox_size)	(PCI_VENDOR_ID_INTEL | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)DWORD_ALIGN(mbox_size) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
 
 /*
  * Each microcode image is divided into chunks, each at most
@@ -57,6 +80,7 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
  */
 #define MBOX_XACTION_SIZE	PAGE_SIZE
 #define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -348,6 +372,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void)
+{
+	u32 dword = readl(staging.mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging firmware */
+	writel(0, staging.mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(u32 dword)
+{
+	writel(dword, staging.mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword();
+	high = read_mbox_dword();
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(u64 value)
+{
+	write_mbox_dword(value);
+	write_mbox_dword(value >> 32);
+}
+
+static inline void write_mbox_data(u32 *chunk, unsigned int chunk_size)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < DWORD_ALIGN(chunk_size); i++)
+		write_mbox_dword(chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer by resetting both hardware and
  * software states.
@@ -392,14 +459,71 @@ static bool can_send_next_chunk(void)
 	return true;
 }
 
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return true on success, false on failure.
+ */
+static bool wait_for_transaction(void)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		staging.state = UCODE_ERROR;
+		return false;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor
+	 * signaled any error. Treat the case as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		staging.state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	staging.state = UCODE_OK;
+	return true;
+}
+
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return true if the chunk is processed successfully.
  */
 static bool send_data_chunk(void)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	return false;
+	u16 mbox_size = MBOX_HEADER_SIZE * 2 + staging.chunk_size;
+	u32 *chunk = staging.ucode_ptr + staging.offset;
+
+	/*
+	 * Write 'request' mailbox object in the following order:
+	 * - Mailbox header includes total size
+	 * - Command header specifies the load operation
+	 * - Data section contains a microcode chunk
+	 */
+	write_mbox_header(MBOX_HEADER(mbox_size));
+	write_mbox_header(MBOX_CMD_LOAD);
+	write_mbox_data(chunk, staging.chunk_size);
+	staging.bytes_sent += staging.chunk_size;
+
+	/*
+	 * Notify the hardware that the mailbox is ready for processing.
+	 * The staging firmware will process the request asynchronously.
+	 */
+	writel(MASK_MBOX_CTRL_GO, staging.mmio_base + MBOX_CONTROL_OFFSET);
+	return wait_for_transaction();
 }
 
 /*
@@ -408,8 +532,24 @@ static bool send_data_chunk(void)
  */
 static bool fetch_next_offset(void)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
-	return false;
+	const u16 mbox_size = MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE;
+
+	/* All responses begin with the same header value: */
+	WARN_ON_ONCE(read_mbox_header() != MBOX_HEADER(mbox_size));
+
+	/*
+	 * The 'response' mailbox contains two dword data:
+	 * - First has next offset in microcode image
+	 * - Second delivers status flag
+	 */
+	staging.offset = read_mbox_dword();
+	if (read_mbox_dword() & MASK_MBOX_RESP_ERROR) {
+		staging.state = UCODE_ERROR;
+		return false;
+	}
+
+	staging.state = UCODE_OK;
+	return true;
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 6/6] x86/microcode/intel: Enable staging when available
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                   ` (4 preceding siblings ...)
  2025-03-20 23:40 ` [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-03-20 23:40 ` Chang S. Bae
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-20 23:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

With the staging code being ready, check for staging availability by
reading these following MSRs:

  * IA32_ARCH_CAPABILITIES (bit 16) for the presence of
    IA32_MCU_ENUMERATION

  * IA32_MCU_ENUMERATION (bit 4) for the staging feature availability.

When available, enable it by setting the feature bit.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V1     -> V2: Fold MSR definings (Boris).
RFC-V1 -> V1: Massage the enabling message.
---
 arch/x86/include/asm/msr-index.h      |  7 +++++++
 arch/x86/kernel/cpu/microcode/intel.c | 17 +++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f123abfdffcb..50c8c0914454 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -166,6 +166,10 @@
 						 * Processor MMIO stale data
 						 * vulnerabilities.
 						 */
+#define ARCH_CAP_MCU_ENUM		BIT(16) /*
+						 * Indicates the presence of microcode update
+						 * feature enumeration and status information
+						 */
 #define ARCH_CAP_FB_CLEAR		BIT(17)	/*
 						 * VERW clears CPU fill buffer
 						 * even on MDS_NO CPUs.
@@ -891,6 +895,9 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_ENUMERATION	0x0000007b
+#define MCU_STAGING			BIT(4)
+
 #define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
 
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index b0d530db72dd..add0c1edd212 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -962,6 +962,18 @@ static __init void calc_llc_size_per_core(struct cpuinfo_x86 *c)
 	llc_size_per_core = (unsigned int)llc_size;
 }
 
+static __init bool staging_available(void)
+{
+	u64 val;
+
+	val = x86_read_arch_cap_msr();
+	if (!(val & ARCH_CAP_MCU_ENUM))
+		return false;
+
+	rdmsrl(MSR_IA32_MCU_ENUMERATION, val);
+	return !!(val & MCU_STAGING);
+}
+
 struct microcode_ops * __init init_intel_microcode(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -972,6 +984,11 @@ struct microcode_ops * __init init_intel_microcode(void)
 		return NULL;
 	}
 
+	if (staging_available()) {
+		microcode_intel_ops.use_staging = true;
+		pr_info("Enabled staging feature.\n");
+	}
+
 	calc_llc_size_per_core(c);
 
 	return &microcode_intel_ops;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 4/6] x86/microcode/intel: Implement staging handler
  2025-03-20 23:40 ` [PATCH v2 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-03-21  0:15   ` Dave Hansen
  2025-03-21 21:19     ` [PATCH v2a " Chang S. Bae
  2025-03-21 21:19     ` [PATCH v2 " Chang S. Bae
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Hansen @ 2025-03-21  0:15 UTC (permalink / raw)
  To: Chang S. Bae, linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/20/25 16:40, Chang S. Bae wrote:
>  static void do_stage(u64 mmio_pa)
>  {
> -	pr_debug_once("Staging implementation is pending.\n");
> -	staging.state = UCODE_ERROR;
> +	staging.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
> +	if (WARN_ON_ONCE(!staging.mmio_base)) {
> +		staging.state = UCODE_ERROR;
> +		return;
> +	}
> +
> +	reset_stage();

This is purely a style thing, but I really dislike using global
variables like this.

What is reset_stage() doing? What state is it resetting? What is locking
this? What is its scope?

Now, imagine you have a function variable. It can be on the stack or static:

static void do_stage(u64 mmio_pa)
{
	struct staging_state staging = {};

Then you pass it around:

	staging.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * ...
	if (WARN_ON_ONCE(!staging.mmio_base)) {
		staging.state = UCODE_ERROR;
		return;
	}

	reset_stage(&staging);

Yeah, it means passing around _one_ function argument to a function
that's not currently taking an argument. But it makes it a heck of a lot
more clear what is going on. It also makes the lifetime and
initialization state of 'staging' *MUCH* more obvious.

It also makes it clear what state reset_stage() needs to do its job.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2a 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-20 23:40 ` [PATCH v2 3/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-03-21 21:18   ` Chang S. Bae
  2025-03-26  7:35     ` Chao Gao
  0 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-03-21 21:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V2 -> V2a:
* Remove a global variable and adjust stage_microcode() (Dave [1]).

Note: this quick revision is just intended to ensure that the feedback
has been properly addressed.

[1]: https://lore.kernel.org/lkml/b01224ee-c935-4b08-a76f-5dc49341182d@intel.com/
---
 arch/x86/include/asm/msr-index.h      |  2 ++
 arch/x86/kernel/cpu/microcode/intel.c | 49 +++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index bc6d2de109b5..f123abfdffcb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -891,6 +891,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 57ed5d414cd1..5d0216e9aee5 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -320,6 +320,54 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface.
+ * Return the result state.
+ */
+static enum ucode_state do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	return UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	enum ucode_state ret;
+	u64 mmio_pa;
+	int cpu;
+
+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_online_mask) {
+		if (!topology_is_primary_thread(cpu) ||
+		     topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+
+		ret = do_stage(mmio_pa);
+		if (ret != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n",
+		((struct microcode_header_intel *)ucode_patch_late)->rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -648,6 +696,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2a 4/6] x86/microcode/intel: Implement staging handler
  2025-03-21  0:15   ` Dave Hansen
@ 2025-03-21 21:19     ` Chang S. Bae
  2025-03-26  8:34       ` Chao Gao
  2025-03-21 21:19     ` [PATCH v2 " Chang S. Bae
  1 sibling, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-03-21 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

Previously, per-package staging invocations were established for each
MMIO space. The next step is to implement the staging handler according
to the specified protocol. Below are key aspects to note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d)  The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V2a:
* Adjust the code with the global variable removal (Dave [1]).

Note: this quick revision is just intended to ensure that the feedback
has been properly addressed.

[1]: https://lore.kernel.org/lkml/b01224ee-c935-4b08-a76f-5dc49341182d@intel.com/
---
 arch/x86/kernel/cpu/microcode/intel.c | 122 +++++++++++++++++++++++++-
 1 file changed, 118 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 5d0216e9aee5..44b94d4d05f7 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,29 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
+/*
+ * Each microcode image is divided into chunks, each at most
+ * MBOX_XACTION_SIZE in size. A 10-chunk image would typically require
+ * 10 transactions. However, the hardware managing the mailbox has
+ * limited resources and may not cache the entire image, potentially
+ * requesting the same chunk multiple times.
+ *
+ * To accommodate this behavior, allow up to twice the expected number of
+ * transactions (i.e., a 10-chunk image can take up to 20 attempts).
+ */
+#define MBOX_XACTION_SIZE	PAGE_SIZE
+#define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -321,13 +346,102 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 }
 
 /*
- * Handle the staging process using the mailbox MMIO interface.
- * Return the result state.
+ * Prepare for a new microcode transfer by resetting hardware and
+ * initializing software states.
+ */
+static void init_stage(struct staging_state *ss)
+{
+	ss->ucode_ptr = ucode_patch_late;
+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
+
+	/* Reset tracking variables */
+	ss->offset = 0;
+	ss->bytes_sent = 0;
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, ss->mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Check if the staging process has completed. The hardware signals
+ * completion by setting a unique end offset.
+ */
+static inline bool is_stage_complete(unsigned int offset)
+{
+	return offset == UINT_MAX;
+}
+
+/*
+ * Determine if the next data chunk can be sent. Each chunk is typically
+ * one page unless the remaining data is smaller. If the total
+ * transmitted data exceeds the defined limit, a timeout occurs.
+ */
+static bool can_send_next_chunk(struct staging_state *ss)
+{
+	WARN_ON_ONCE(ss->ucode_len < ss->offset);
+	ss->chunk_size = min(MBOX_XACTION_SIZE, ss->ucode_len - ss->offset);
+
+	if (ss->bytes_sent + ss->chunk_size > MBOX_XACTION_MAX(ss->ucode_len)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return true if the chunk is processed successfully.
+ */
+static bool send_data_chunk(struct staging_state *unused)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	return false;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return true if the response is valid, false otherwise.
+ */
+static bool fetch_next_offset(struct staging_state *unused)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	return false;
+}
+
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * microcode image is transferred in chunks until completion. Return the
+ * result state.
  */
 static enum ucode_state do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	return UCODE_ERROR;
+	struct staging_state ss = {};
+
+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!ss.mmio_base))
+		return UCODE_ERROR;
+
+	init_stage(&ss);
+
+	/* Perform the staging process while within the retry limit */
+	while (!is_stage_complete(ss.offset) && can_send_next_chunk(&ss)) {
+		/* Send a chunk of microcode each time: */
+		if (!send_data_chunk(&ss))
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		if (!fetch_next_offset(&ss))
+			break;
+	}
+
+	iounmap(ss.mmio_base);
+	return ss.state;
 }
 
 static void stage_microcode(void)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2a 5/6] x86/microcode/intel: Support mailbox transfer
  2025-03-20 23:40 ` [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-03-21 21:19   ` Chang S. Bae
  2025-03-27  3:32   ` [PATCH v2 " Chao Gao
  1 sibling, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-21 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chang.seok.bae

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V2a:
* Adjust the code to align with the global variable removal (Dave [1]).

Note: this quick revision is just intended to ensure that the feedback
has been properly addressed.

[1]: https://lore.kernel.org/lkml/b01224ee-c935-4b08-a76f-5dc49341182d@intel.com/
---
 arch/x86/kernel/cpu/microcode/intel.c | 152 +++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 44b94d4d05f7..ca1847ce1059 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,30 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define DWORD_ALIGN(size)	((size) / sizeof(u32))
+#define MBOX_HEADER(mbox_size)	(PCI_VENDOR_ID_INTEL | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)DWORD_ALIGN(mbox_size) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
 
 /*
  * Each microcode image is divided into chunks, each at most
@@ -57,6 +80,7 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
  */
 #define MBOX_XACTION_SIZE	PAGE_SIZE
 #define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -345,6 +369,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void __iomem *mmio_base)
+{
+	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging firmware */
+	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
+{
+	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void __iomem *mmio_base)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword(mmio_base);
+	high = read_mbox_dword(mmio_base);
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
+{
+	write_mbox_dword(mmio_base, value);
+	write_mbox_dword(mmio_base, value >> 32);
+}
+
+static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_size)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < DWORD_ALIGN(chunk_size); i++)
+		write_mbox_dword(mmio_base, chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer by resetting hardware and
  * initializing software states.
@@ -392,24 +459,97 @@ static bool can_send_next_chunk(struct staging_state *ss)
 	return true;
 }
 
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return true on success, false on failure.
+ */
+static bool wait_for_transaction(struct staging_state *ss)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor
+	 * signaled any error. Treat the case as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
+}
+
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return true if the chunk is processed successfully.
  */
-static bool send_data_chunk(struct staging_state *unused)
+static bool send_data_chunk(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	return false;
+	u16 mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;
+	u32 *chunk = ss->ucode_ptr + ss->offset;
+
+	/*
+	 * Write 'request' mailbox object in the following order:
+	 * - Mailbox header includes total size
+	 * - Command header specifies the load operation
+	 * - Data section contains a microcode chunk
+	 */
+	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
+	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
+	write_mbox_data(ss->mmio_base, chunk, ss->chunk_size);
+	ss->bytes_sent += ss->chunk_size;
+
+	/*
+	 * Notify the hardware that the mailbox is ready for processing.
+	 * The staging firmware will process the request asynchronously.
+	 */
+	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
+	return wait_for_transaction(ss);
 }
 
 /*
  * Retrieve the next offset from the hardware response.
  * Return true if the response is valid, false otherwise.
  */
-static bool fetch_next_offset(struct staging_state *unused)
+static bool fetch_next_offset(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
-	return false;
+	const u16 mbox_size = MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE;
+
+	/* All responses begin with the same header value: */
+	WARN_ON_ONCE(read_mbox_header(ss->mmio_base) != MBOX_HEADER(mbox_size));
+
+	/*
+	 * The 'response' mailbox contains two dword data:
+	 * - First has next offset in microcode image
+	 * - Second delivers status flag
+	 */
+	ss->offset = read_mbox_dword(ss->mmio_base);
+	if (read_mbox_dword(ss->mmio_base) & MASK_MBOX_RESP_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 4/6] x86/microcode/intel: Implement staging handler
  2025-03-21  0:15   ` Dave Hansen
  2025-03-21 21:19     ` [PATCH v2a " Chang S. Bae
@ 2025-03-21 21:19     ` Chang S. Bae
  1 sibling, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-21 21:19 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/20/2025 5:15 PM, Dave Hansen wrote:
> 
> Yeah, it means passing around _one_ function argument to a function
> that's not currently taking an argument. But it makes it a heck of a lot
> more clear what is going on. It also makes the lifetime and
> initialization state of 'staging' *MUCH* more obvious.

I see -- that seemed like a case of over-simplification.

I've updated the relevant patches after passing the staging test. I've 
also thought reordering patches (patch2 <-> patch3), perhaps when 
posting v3.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-21 21:18   ` [PATCH v2a " Chang S. Bae
@ 2025-03-26  7:35     ` Chao Gao
  2025-03-26 18:43       ` Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Chao Gao @ 2025-03-26  7:35 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

>+static void stage_microcode(void)
>+{
>+	unsigned int pkg_id = UINT_MAX;
>+	enum ucode_state ret;
>+	u64 mmio_pa;
>+	int cpu;
>+
>+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
>+		return;
>+
>+	lockdep_assert_cpus_held();
>+
>+	/*
>+	 * The MMIO address is unique per package, and all the SMT
>+	 * primary threads are online here. Find each MMIO space by
>+	 * their package ids to avoid duplicate staging.
>+	 */
>+	for_each_cpu(cpu, cpu_online_mask) {

for_each_online_cpu(cpu)?

>+		if (!topology_is_primary_thread(cpu) ||
>+		     topology_logical_package_id(cpu) == pkg_id)
>+			continue;

Documentation/arch/x86/topology.rst states:
  - topology_core_cpumask():

    The cpumask contains all online threads in the package to which a thread
    belongs.

    The number of online threads is also printed in /proc/cpuinfo "siblings."

So, how about:

		if (cpu != cpumask_first(topology_core_cpumask(cpu)))
			continue;

and dropping the pkg_id?

>+		pkg_id = topology_logical_package_id(cpu);
>+
>+		rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);

Note rdmsrl_on_cpu() may return an error. please consider adding
error-handling. Is it possible that somehow one package doesn't support
this staging feature while others do?

>+
>+		ret = do_stage(mmio_pa);
>+		if (ret != UCODE_OK) {
>+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
>+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
>+			       cpu, pkg_id);

Shall we print a message somewhere showing "Continuing updates without
staging"?

It could be confusing for users to see a success message following an error
message that states "Error: staging failed ..."

>+			return;
>+		}
>+	}
>+
>+	pr_info("Staging of patch revision 0x%x succeeded.\n",
>+		((struct microcode_header_intel *)ucode_patch_late)->rev);
>+}
>+
> static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
> 					  struct microcode_intel *mc,
> 					  u32 *cur_rev)
>@@ -648,6 +696,7 @@ static struct microcode_ops microcode_intel_ops = {
> 	.collect_cpu_info	= collect_cpu_info,
> 	.apply_microcode	= apply_microcode_late,
> 	.finalize_late_load	= finalize_late_load,
>+	.stage_microcode	= stage_microcode,
> 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
> };
> 
>-- 
>2.45.2
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 4/6] x86/microcode/intel: Implement staging handler
  2025-03-21 21:19     ` [PATCH v2a " Chang S. Bae
@ 2025-03-26  8:34       ` Chao Gao
  2025-03-26 18:43         ` Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Chao Gao @ 2025-03-26  8:34 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

> /*
>- * Handle the staging process using the mailbox MMIO interface.
>- * Return the result state.
>+ * Prepare for a new microcode transfer by resetting hardware and
>+ * initializing software states.
>+ */
>+static void init_stage(struct staging_state *ss)
>+{
>+	ss->ucode_ptr = ucode_patch_late;
>+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
>+
>+	/* Reset tracking variables */
>+	ss->offset = 0;
>+	ss->bytes_sent = 0;

Nit: no need to reset them, as

>+	struct staging_state ss = {};

in do_stage() will zero the whole structure.

>+/*
>+ * Handle the staging process using the mailbox MMIO interface. The
>+ * microcode image is transferred in chunks until completion. Return the
>+ * result state.
>  */
> static enum ucode_state do_stage(u64 mmio_pa)
> {
>-	pr_debug_once("Staging implementation is pending.\n");
>-	return UCODE_ERROR;
>+	struct staging_state ss = {};
>+
>+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
>+	if (WARN_ON_ONCE(!ss.mmio_base))
>+		return UCODE_ERROR;
>+
>+	init_stage(&ss);
>+
>+	/* Perform the staging process while within the retry limit */
>+	while (!is_stage_complete(ss.offset) && can_send_next_chunk(&ss)) {
>+		/* Send a chunk of microcode each time: */
>+		if (!send_data_chunk(&ss))
>+			break;
>+		/*
>+		 * Then, ask the hardware which piece of the image it
>+		 * needs next. The same piece may be sent more than once.
>+		 */
>+		if (!fetch_next_offset(&ss))
>+			break;

why send_data_chunk() and fetch_next_offset() return a boolean instead of
an error or ucode_state?

Using the return value to indicate just success or failure, while relying
on another variable to report detailed error/state, seems a bit clumsy to
me.

>+	}
>+
>+	iounmap(ss.mmio_base);
>+	return ss.state;
> }
> 
> static void stage_microcode(void)
>-- 
>2.45.2
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 4/6] x86/microcode/intel: Implement staging handler
  2025-03-26  8:34       ` Chao Gao
@ 2025-03-26 18:43         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-26 18:43 UTC (permalink / raw)
  To: Chao Gao; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/26/2025 1:34 AM, Chao Gao wrote:
>> +
>> +	/* Reset tracking variables */
>> +	ss->offset = 0;
>> +	ss->bytes_sent = 0;
> 
> Nit: no need to reset them, as
> 
>> +	struct staging_state ss = {};
> 
> in do_stage() will zero the whole structure.

I initially wanted to explicitly highlight where these variables are 
reset, but you’re right — in practice, they can be removed.

> why send_data_chunk() and fetch_next_offset() return a boolean instead of
> an error or ucode_state?
> 
> Using the return value to indicate just success or failure, while relying
> on another variable to report detailed error/state, seems a bit clumsy to
> me.

The error state is interpreted at the call site, where the final result 
is returned to the caller.

At the end of each step, all the loop needs to decide is whether to 
continue or break, which naturally fits a boolean return. I don’t see a 
strong reason to change this approach.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-26  7:35     ` Chao Gao
@ 2025-03-26 18:43       ` Chang S. Bae
  2025-03-27  1:44         ` Chao Gao
  0 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-03-26 18:43 UTC (permalink / raw)
  To: Chao Gao; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/26/2025 12:35 AM, Chao Gao wrote:
>> +	for_each_cpu(cpu, cpu_online_mask) {
> 
> for_each_online_cpu(cpu)?

Yes, it looks equivalent but shorter.

> So, how about:
> 
> 		if (cpu != cpumask_first(topology_core_cpumask(cpu)))
> 			continue;
> 
> and dropping the pkg_id?

No, the pkg_id check is intentional to prevent duplicate staging within 
a package. As noted in the comment: "The MMIO address is unique per 
package."

>> +		rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
> 
> Note rdmsrl_on_cpu() may return an error. please consider adding
> error-handling. Is it possible that somehow one package doesn't support
> this staging feature while others do?

rdmsrl_on_cpu() -> smp_call_function_single() -> generic_exec_single():

         if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) {
                 csd_unlock(csd);
                 return -ENXIO;
         }

This error condition applies to an invalid cpu, but since the function 
is guarded by cpu_online_mask, it should not occur.

That said though, ignoring the return value may appear to be incorrect.
Perhaps,

   err = rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
   if (WARN_ON_ONCE(err))
         return;

> Shall we print a message somewhere showing "Continuing updates without
> staging"?
> 
> It could be confusing for users to see a success message following an error
> message that states "Error: staging failed ..."

This function already prints either a success or failure message based 
on staging results which are variable.

But this behavior follows the established policy that loading should 
continue even if staging fails, which is a known and invariant behavior 
at runtime.

So, explicitly stating that updates will proceed without staging seems 
redundant and could be considered noise.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-26 18:43       ` Chang S. Bae
@ 2025-03-27  1:44         ` Chao Gao
  2025-03-28 14:12           ` Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Chao Gao @ 2025-03-27  1:44 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On Wed, Mar 26, 2025 at 11:43:58AM -0700, Chang S. Bae wrote:
>On 3/26/2025 12:35 AM, Chao Gao wrote:
>> > +	for_each_cpu(cpu, cpu_online_mask) {
>> 
>> for_each_online_cpu(cpu)?
>
>Yes, it looks equivalent but shorter.
>
>> So, how about:
>> 
>> 		if (cpu != cpumask_first(topology_core_cpumask(cpu)))
>> 			continue;
>> 
>> and dropping the pkg_id?
>
>No, the pkg_id check is intentional to prevent duplicate staging within a
>package. As noted in the comment: "The MMIO address is unique per package."

The check I suggested can also achieve the goal and is simpler, right?

>
>> > +		rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
>> 
>> Note rdmsrl_on_cpu() may return an error. please consider adding
>> error-handling. Is it possible that somehow one package doesn't support
>> this staging feature while others do?
>
>rdmsrl_on_cpu() -> smp_call_function_single() -> generic_exec_single():
>
>        if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) {
>                csd_unlock(csd);
>                return -ENXIO;
>        }
>
>This error condition applies to an invalid cpu, but since the function is
>guarded by cpu_online_mask, it should not occur.

Ok. I misread rdmsrl_on_cpu(). I thought it would return an error if the
MSR doesn't exist on that CPU. But that's not the case.

>
>That said though, ignoring the return value may appear to be incorrect.
>Perhaps,
>
>  err = rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
>  if (WARN_ON_ONCE(err))
>        return;

Looks good.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer
  2025-03-20 23:40 ` [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
  2025-03-21 21:19   ` [PATCH v2a " Chang S. Bae
@ 2025-03-27  3:32   ` Chao Gao
  2025-03-27 14:11     ` Chang S. Bae
  2025-03-31 19:16     ` Dave Hansen
  1 sibling, 2 replies; 69+ messages in thread
From: Chao Gao @ 2025-03-27  3:32 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

>+/*
>+ * Wait for the hardware to complete a transaction.
>+ * Return true on success, false on failure.
>+ */
>+static bool wait_for_transaction(void)
>+{
>+	u32 timeout, status;
>+
>+	/* Allow time for hardware to complete the operation: */
>+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
>+		msleep(1);
>+
>+		status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);
>+		/* Break out early if the hardware is ready: */
>+		if (status & MASK_MBOX_STATUS_READY)
>+			break;
>+	}
>+
>+	status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);

why read the STATUS again?

>+
>+	/* Check for explicit error response */
>+	if (status & MASK_MBOX_STATUS_ERROR) {
>+		staging.state = UCODE_ERROR;
>+		return false;
>+	}
>+
>+	/*
>+	 * Hardware is neither responded to the action nor
>+	 * signaled any error. Treat the case as timeout.
>+	 */
>+	if (!(status & MASK_MBOX_STATUS_READY)) {
>+		staging.state = UCODE_TIMEOUT;
>+		return false;
>+	}
>+
>+	staging.state = UCODE_OK;
>+	return true;
>+}

How about:

static enum ucode_state wait_for_transaction(void)
{
	u32 timeout, status;

	/* Allow time for hardware to complete the operation: */
	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
		msleep(1);

		status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);

		if (status & MASK_MBOX_STATUS_READY)
			return UCODE_OK;

		/* Check for explicit error response */
		if (status & MASK_MBOX_STATUS_ERROR)
			return UCODE_ERROR;
	}

	/*
	 * Hardware is neither responded to the action nor
	 * signaled any error. Treat the case as timeout.
	 */
	return UCODE_TIMEOUT;
}

and in send_data_chunk(), do:

	staging.state = wait_for_transaction();
	return staging.state != UCODE_OK;

It is simpler and requires less code. Even better, send_data_chunk() can just
propagate the ucode_state to its caller.

By the way, checkpatch.pl warns that 'msleep < 20ms can sleep for up to 20ms;
see function description of msleep().' This makes me wonder how the 10ms
timeout was determined but not precisely enforced. Is it arbitrary or selected
for specific reasons?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer
  2025-03-27  3:32   ` [PATCH v2 " Chao Gao
@ 2025-03-27 14:11     ` Chang S. Bae
  2025-03-31 19:16     ` Dave Hansen
  1 sibling, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-27 14:11 UTC (permalink / raw)
  To: Chao Gao; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/26/2025 8:32 PM, Chao Gao wrote:
> 
> and in send_data_chunk(), do:
> 
> 	staging.state = wait_for_transaction();
> 	return staging.state != UCODE_OK;
> 
> It is simpler and requires less code. Even better, send_data_chunk() can just
> propagate the ucode_state to its caller.

Thanks for the suggestion. To me, this is more of a stylistic choice. 
Previously, the call site had a single line:

     return wait_for_transaction();

> By the way, checkpatch.pl warns that 'msleep < 20ms can sleep for up to 20ms;
> see function description of msleep().' This makes me wonder how the 10ms
> timeout was determined but not precisely enforced. Is it arbitrary or selected
> for specific reasons?

Yes, I saw that warning. It notes that msleep(1) could sleep for longer 
than 1 ms, which I thought was acceptable. The 10 ms timeout was 
determined through discussions with the firmware team responsible for 
staging.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2a 3/6] x86/microcode/intel: Establish staging control logic
  2025-03-27  1:44         ` Chao Gao
@ 2025-03-28 14:12           ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-03-28 14:12 UTC (permalink / raw)
  To: Chao Gao; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/26/2025 6:44 PM, Chao Gao wrote:
> 
> The check I suggested can also achieve the goal and is simpler, right?
Okay, I’d (finally) define this mask compilable for CONFIG_SMP=n, in 
topology.h:

     #define cpu_primary_thread_mask cpu_none_mask

Then,

     for_each_cpu(cpu, cpu_primary_thread_mask) {
         if (topology_logical_package_id(cpu) == pkg_id)
             continue;
         ...
     }

This still looks simpler. Plus, while timing isn’t a major concern on 
this staging path, cpumask_first() on every CPU appears costly. On my 
measurements, this takes about a third of the time compared to yours.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer
  2025-03-27  3:32   ` [PATCH v2 " Chao Gao
  2025-03-27 14:11     ` Chang S. Bae
@ 2025-03-31 19:16     ` Dave Hansen
  1 sibling, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2025-03-31 19:16 UTC (permalink / raw)
  To: Chao Gao, Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 3/26/25 20:32, Chao Gao wrote:
> static enum ucode_state wait_for_transaction(void)
> {
> 	u32 timeout, status;
> 
> 	/* Allow time for hardware to complete the operation: */
> 	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
> 		msleep(1);
> 
> 		status = readl(staging.mmio_base + MBOX_STATUS_OFFSET);
> 
> 		if (status & MASK_MBOX_STATUS_READY)
> 			return UCODE_OK;
> 
> 		/* Check for explicit error response */
> 		if (status & MASK_MBOX_STATUS_ERROR)
> 			return UCODE_ERROR;
> 	}
> 
> 	/*
> 	 * Hardware is neither responded to the action nor
> 	 * signaled any error. Treat the case as timeout.
> 	 */
> 	return UCODE_TIMEOUT;
> }

The flow at minimal indent in the function should be the _main_ control
flow: the common case.

The common case here is the "return UCODE_OK". It should not be buried
in a for() loop. I kinda like what Chang had before.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature
  2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                   ` (5 preceding siblings ...)
  2025-03-20 23:40 ` [PATCH v2 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
@ 2025-04-09 23:27 ` Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
                     ` (6 more replies)
  6 siblings, 7 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

Hi all,

This is another revision addressing feedback from the last posting [1]
over the past few weeks:

  *  Dave suggested using a local variable to clarify the scope handled
     by each helper, via a function argument [2]. Accordingly, the patch
     order was adjusted (patch 2 <-> patch 3).

  *  Chao provided helpful feedback, leading to additional refinements.

This round contains relatively smaller changes compared to the last one,
but I hope this iteration provide a chance to draw more reviews or even
collect a few tags.

As before, the patch series is based on tip/master and is also available
in this repo:
    git://github.com/intel-staging/microcode.git staging_v3

Thanks,
Chang

[1]: V2: https://lore.kernel.org/lkml/Z+O8DK5NZJL43Nt6@intel.com/
[2]: https://lore.kernel.org/lkml/b01224ee-c935-4b08-a76f-5dc49341182d@intel.com/

Chang S. Bae (6):
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/microcode/intel: Establish staging control logic
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Enable staging when available

 arch/x86/include/asm/msr-index.h         |   9 +
 arch/x86/include/asm/topology.h          |   1 +
 arch/x86/kernel/cpu/microcode/core.c     |  11 +
 arch/x86/kernel/cpu/microcode/intel.c    | 338 +++++++++++++++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |   4 +-
 5 files changed, 362 insertions(+), 1 deletion(-)


base-commit: 033163360247053bec75b81ac2d34aeb9d994e59
-- 
2.45.2


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v3 1/6] x86/microcode: Introduce staging step to reduce late-loading time
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

As microcode patch sizes continue to grow, late-loading latency spikes
can lead to timeouts and disruptions in running workloads. This trend of
increasing patch sizes is expected to continue, so a foundational
solution is needed to address the issue.

To mitigate the problem, a new staging feature is introduced. This option
processes most of the microcode update (excluding activation) on a
non-critical path, allowing CPUs to remain operational during the
majority of the update. By offloading work from the critical path,
staging can significantly reduces latency spikes.

Integrate staging as a preparatory step in late-loading. Introduce a new
callback for staging, which is invoked at the beginning of
load_late_stop_cpus(), before CPUs enter the rendezvous phase.

Staging follows an opportunistic model:

  *  If successful, it reduces CPU rendezvous time
  *  Even though it fails, the process falls back to the legacy path to
     finish the loading process but with potentially higher latency.

Extend struct microcode_ops to incorporate staging properties, which will
be implemented in the vendor code separately.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V3: No change

Note:

There were discussions about whether staging success should be enforced
by a configurable option. That topic is identified as follow-up work,
separate from this series.

    https://lore.kernel.org/lkml/54308373-7867-4b76-be34-63730953f83c@intel.com/

V1 -> V2:
* Move invocation inside of load_late_stop_cpus() (Boris)
* Add more note about staging (Dave)
---
 arch/x86/kernel/cpu/microcode/core.c     | 11 +++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |  4 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b3658d11e7b6..c4aff44a7ffc 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -541,6 +541,17 @@ static int load_late_stop_cpus(bool is_safe)
 		pr_err("You should switch to early loading, if possible.\n");
 	}
 
+	/*
+	 * Pre-load the microcode image into a staging device. This
+	 * process is preemptible and does not require stopping CPUs.
+	 * Successful staging simplifies the subsequent late-loading
+	 * process, reducing rendezvous time.
+	 *
+	 * Even if the transfer fails, the update will proceed as usual.
+	 */
+	if (microcode_ops->use_staging)
+		microcode_ops->stage_microcode();
+
 	atomic_set(&late_cpus_in, num_online_cpus());
 	atomic_set(&offline_in_nmi, 0);
 	loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 5df621752fef..4b983b4cddbd 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -31,10 +31,12 @@ struct microcode_ops {
 	 * See also the "Synchronization" section in microcode_core.c.
 	 */
 	enum ucode_state	(*apply_microcode)(int cpu);
+	void			(*stage_microcode)(void);
 	int			(*collect_cpu_info)(int cpu, struct cpu_signature *csig);
 	void			(*finalize_late_load)(int result);
 	unsigned int		nmi_safe	: 1,
-				use_nmi		: 1;
+				use_nmi		: 1,
+				use_staging	: 1;
 };
 
 struct early_load_data {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 2/6] x86/microcode/intel: Establish staging control logic
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
consistent use when narrowing down primary threads to locate the
per-package interface.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V2 -> V3:
* Remove a global variable and adjust stage_microcode() (Dave).
* Simplify for_each_cpu() loop control code
* Handle rdmsrl_on_cpu() return code explicitly (Chao)

V1 -> V2:
* Adjust to reference the staging_state struct.
* Add lockdep_assert_cpus_held() (Boris)
---
 arch/x86/include/asm/msr-index.h      |  2 ++
 arch/x86/include/asm/topology.h       |  1 +
 arch/x86/kernel/cpu/microcode/intel.c | 50 +++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 53da787b9326..f5b05e4f1e27 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -897,6 +897,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6c79ee7c0957..91b5fc44ca62 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -235,6 +235,7 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
+#define cpu_primary_thread_mask cpu_none_mask
 #endif /* !CONFIG_SMP */
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 819199bc0119..a791d5dc2cef 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -299,6 +299,55 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface.
+ * Return the result state.
+ */
+static enum ucode_state do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	return UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	enum ucode_state ret;
+	int cpu, err;
+	u64 mmio_pa;
+
+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_primary_thread_mask) {
+		if (topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		err = rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+		if (WARN_ON_ONCE(err))
+			return;
+
+		ret = do_stage(mmio_pa);
+		if (ret != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n",
+		((struct microcode_header_intel *)ucode_patch_late)->rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -627,6 +676,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 3/6] x86/microcode/intel: Define staging state struct
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

To facilitate a structured staging handler, a set of functions will be
introduced. Define staging_state struct to simplify function prototypes
by consolidating relevant data, instead of passing multiple local
variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V3: No change
V1 -> V2: New patch

Prior to V2, local variables were used to track state values, with the
intention of improving readability by explicitly passing them between
functions. However, given feedbacks, introducing a dedicated data
structure looks to provide a benefit by simplifying the main loop.
---
 arch/x86/kernel/cpu/microcode/intel.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index a791d5dc2cef..d8ea172d90e2 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -54,6 +54,27 @@ struct extended_sigtable {
 	struct extended_signature	sigs[];
 };
 
+/**
+ * struct staging_state - Tracks the current staging process state
+ *
+ * @mmio_base:		MMIO base address for staging
+ * @ucode_ptr:		Pointer to the microcode image
+ * @ucode_len:		Total size of the microcode image
+ * @chunk_size:		Size of each data piece
+ * @bytes_sent:		Total bytes transmitted so far
+ * @offset:		Current offset in the microcode image
+ * @state:		Current state of the staging process
+ */
+struct staging_state {
+	void __iomem		*mmio_base;
+	void			*ucode_ptr;
+	unsigned int		ucode_len;
+	unsigned int		chunk_size;
+	unsigned int		bytes_sent;
+	unsigned int		offset;
+	enum ucode_state	state;
+};
+
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
 #define EXT_SIGNATURE_SIZE	(sizeof(struct extended_signature))
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 4/6] x86/microcode/intel: Implement staging handler
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                     ` (2 preceding siblings ...)
  2025-04-09 23:27   ` [PATCH v3 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-04-09 23:27   ` [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d)  The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V3:
* Rework code to eliminate global variables (Dave)
* Remove redundant variable resets (Chao)

V1 -> V2:
* Re-write the changelog for clarity (Dave).
* Move staging handling code into intel.c (Boris).
* Add extensive comments to clarify staging logic and hardware
  interactions, along with function renaming (Dave).
---
 arch/x86/kernel/cpu/microcode/intel.c | 120 +++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index d8ea172d90e2..12910b8f9f8a 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,29 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
+/*
+ * Each microcode image is divided into chunks, each at most
+ * MBOX_XACTION_SIZE in size. A 10-chunk image would typically require
+ * 10 transactions. However, the hardware managing the mailbox has
+ * limited resources and may not cache the entire image, potentially
+ * requesting the same chunk multiple times.
+ *
+ * To accommodate this behavior, allow up to twice the expected number of
+ * transactions (i.e., a 10-chunk image can take up to 20 attempts).
+ */
+#define MBOX_XACTION_SIZE	PAGE_SIZE
+#define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -321,13 +346,100 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 }
 
 /*
- * Handle the staging process using the mailbox MMIO interface.
- * Return the result state.
+ * Prepare for a new microcode transfer by resetting hardware and
+ * configuring microcode image info.
+ */
+static void init_stage(struct staging_state *ss)
+{
+	ss->ucode_ptr = ucode_patch_late;
+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, ss->mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Check if the staging process has completed. The hardware signals
+ * completion by setting a unique end offset.
+ */
+static inline bool is_stage_complete(unsigned int offset)
+{
+	return offset == UINT_MAX;
+}
+
+/*
+ * Determine if the next data chunk can be sent. Each chunk is typically
+ * one page unless the remaining data is smaller. If the total
+ * transmitted data exceeds the defined limit, a timeout occurs.
+ */
+static bool can_send_next_chunk(struct staging_state *ss)
+{
+	WARN_ON_ONCE(ss->ucode_len < ss->offset);
+	ss->chunk_size = min(MBOX_XACTION_SIZE, ss->ucode_len - ss->offset);
+
+	if (ss->bytes_sent + ss->chunk_size > MBOX_XACTION_MAX(ss->ucode_len)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return true if the chunk is processed successfully.
+ */
+static bool send_data_chunk(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	ss->state = UCODE_ERROR;
+	return false;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return true if the response is valid, false otherwise.
+ */
+static bool fetch_next_offset(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	ss->state = UCODE_ERROR;
+	return false;
+}
+
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * microcode image is transferred in chunks until completion. Return the
+ * result state.
  */
 static enum ucode_state do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	return UCODE_ERROR;
+	struct staging_state ss = {};
+
+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!ss.mmio_base))
+		return UCODE_ERROR;
+
+	init_stage(&ss);
+
+	/* Perform the staging process while within the retry limit */
+	while (!is_stage_complete(ss.offset) && can_send_next_chunk(&ss)) {
+		/* Send a chunk of microcode each time: */
+		if (!send_data_chunk(&ss))
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		if (!fetch_next_offset(&ss))
+			break;
+	}
+
+	iounmap(ss.mmio_base);
+	return ss.state;
 }
 
 static void stage_microcode(void)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                     ` (3 preceding siblings ...)
  2025-04-09 23:27   ` [PATCH v3 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-04-16 14:14     ` Chao Gao
  2025-04-09 23:27   ` [PATCH v3 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  6 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V3:
* Update code to reflect the removal of a global variable (Dave).

V1 -> V2:
* Add lots of code comments and edit the changelog (Dave).
* Encapsulate register read/write operations for processing header and
  data sections.
---
 arch/x86/kernel/cpu/microcode/intel.c | 150 ++++++++++++++++++++++++--
 1 file changed, 144 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 12910b8f9f8a..542b3abad8e3 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,30 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define DWORD_ALIGN(size)	((size) / sizeof(u32))
+#define MBOX_HEADER(mbox_size)	(PCI_VENDOR_ID_INTEL | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)DWORD_ALIGN(mbox_size) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
 
 /*
  * Each microcode image is divided into chunks, each at most
@@ -57,6 +80,7 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
  */
 #define MBOX_XACTION_SIZE	PAGE_SIZE
 #define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -345,6 +369,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void __iomem *mmio_base)
+{
+	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging hardware */
+	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
+{
+	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void __iomem *mmio_base)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword(mmio_base);
+	high = read_mbox_dword(mmio_base);
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
+{
+	write_mbox_dword(mmio_base, value);
+	write_mbox_dword(mmio_base, value >> 32);
+}
+
+static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_size)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < DWORD_ALIGN(chunk_size); i++)
+		write_mbox_dword(mmio_base, chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer by resetting hardware and
  * configuring microcode image info.
@@ -388,15 +455,71 @@ static bool can_send_next_chunk(struct staging_state *ss)
 	return true;
 }
 
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return true on success, false on failure.
+ */
+static bool wait_for_transaction(struct staging_state *ss)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor
+	 * signaled any error. Treat the case as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
+}
+
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return true if the chunk is processed successfully.
  */
 static bool send_data_chunk(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	ss->state = UCODE_ERROR;
-	return false;
+	u16 mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;
+	u32 *chunk = ss->ucode_ptr + ss->offset;
+
+	/*
+	 * Write 'request' mailbox object in the following order:
+	 * - Mailbox header includes total size
+	 * - Command header specifies the load operation
+	 * - Data section contains a microcode chunk
+	 */
+	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
+	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
+	write_mbox_data(ss->mmio_base, chunk, ss->chunk_size);
+	ss->bytes_sent += ss->chunk_size;
+
+	/*
+	 * Notify the hardware that the mailbox is ready for processing.
+	 * The staging hardware will process the request asynchronously.
+	 */
+	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
+	return wait_for_transaction(ss);
 }
 
 /*
@@ -405,9 +528,24 @@ static bool send_data_chunk(struct staging_state *ss)
  */
 static bool fetch_next_offset(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
-	ss->state = UCODE_ERROR;
-	return false;
+	const u16 mbox_size = MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE;
+
+	/* All responses begin with the same header value: */
+	WARN_ON_ONCE(read_mbox_header(ss->mmio_base) != MBOX_HEADER(mbox_size));
+
+	/*
+	 * The 'response' mailbox contains two dword data:
+	 * - First has next offset in microcode image
+	 * - Second delivers status flag
+	 */
+	ss->offset = read_mbox_dword(ss->mmio_base);
+	if (read_mbox_dword(ss->mmio_base) & MASK_MBOX_RESP_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v3 6/6] x86/microcode/intel: Enable staging when available
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                     ` (4 preceding siblings ...)
  2025-04-09 23:27   ` [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-04-09 23:27   ` Chang S. Bae
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  6 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-04-09 23:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao,
	chang.seok.bae

With the staging code being ready, check for staging availability by
reading these following MSRs:

  * IA32_ARCH_CAPABILITIES (bit 16) for the presence of
    IA32_MCU_ENUMERATION

  * IA32_MCU_ENUMERATION (bit 4) for the staging feature availability.

When available, enable it by setting the feature bit.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V2 -> V3: No change
V1 -> V2: Fold MSR definings (Boris).
---
 arch/x86/include/asm/msr-index.h      |  7 +++++++
 arch/x86/kernel/cpu/microcode/intel.c | 17 +++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f5b05e4f1e27..4eb1001fc5ac 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -166,6 +166,10 @@
 						 * Processor MMIO stale data
 						 * vulnerabilities.
 						 */
+#define ARCH_CAP_MCU_ENUM		BIT(16) /*
+						 * Indicates the presence of microcode update
+						 * feature enumeration and status information
+						 */
 #define ARCH_CAP_FB_CLEAR		BIT(17)	/*
 						 * VERW clears CPU fill buffer
 						 * even on MDS_NO CPUs.
@@ -897,6 +901,9 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_ENUMERATION	0x0000007b
+#define MCU_STAGING			BIT(4)
+
 #define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
 
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 542b3abad8e3..89c17cdd4336 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -959,6 +959,18 @@ static __init void calc_llc_size_per_core(struct cpuinfo_x86 *c)
 	llc_size_per_core = (unsigned int)llc_size;
 }
 
+static __init bool staging_available(void)
+{
+	u64 val;
+
+	val = x86_read_arch_cap_msr();
+	if (!(val & ARCH_CAP_MCU_ENUM))
+		return false;
+
+	rdmsrl(MSR_IA32_MCU_ENUMERATION, val);
+	return !!(val & MCU_STAGING);
+}
+
 struct microcode_ops * __init init_intel_microcode(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -969,6 +981,11 @@ struct microcode_ops * __init init_intel_microcode(void)
 		return NULL;
 	}
 
+	if (staging_available()) {
+		microcode_intel_ops.use_staging = true;
+		pr_info("Enabled staging feature.\n");
+	}
+
 	calc_llc_size_per_core(c);
 
 	return &microcode_intel_ops;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer
  2025-04-09 23:27   ` [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-04-16 14:14     ` Chao Gao
  2025-04-16 17:22       ` Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Chao Gao @ 2025-04-16 14:14 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

>+/*
>+ * Wait for the hardware to complete a transaction.
>+ * Return true on success, false on failure.
>+ */
>+static bool wait_for_transaction(struct staging_state *ss)
>+{
>+	u32 timeout, status;
>+
>+	/* Allow time for hardware to complete the operation: */
>+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
>+		msleep(1);
>+
>+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
>+		/* Break out early if the hardware is ready: */
>+		if (status & MASK_MBOX_STATUS_READY)
>+			break;

Shouldn't we exit the loop if the MASK_MBOX_STATUS_ERROR is set, or is the
ERROR bit always set in conjunction with the READY bit?

>+	}
>+
>+	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);

I still think this read is not needed.

>+
>+	/* Check for explicit error response */
>+	if (status & MASK_MBOX_STATUS_ERROR) {
>+		ss->state = UCODE_ERROR;
>+		return false;
>+	}
>+
>+	/*
>+	 * Hardware is neither responded to the action nor
>+	 * signaled any error. Treat the case as timeout.
>+	 */
>+	if (!(status & MASK_MBOX_STATUS_READY)) {
>+		ss->state = UCODE_TIMEOUT;
>+		return false;
>+	}
>+
>+	ss->state = UCODE_OK;
>+	return true;
>+}

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer
  2025-04-16 14:14     ` Chao Gao
@ 2025-04-16 17:22       ` Chang S. Bae
  2025-04-16 17:37         ` Dave Hansen
  0 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-04-16 17:22 UTC (permalink / raw)
  To: Chao Gao; +Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 4/16/2025 7:14 AM, Chao Gao wrote:
>> +/*
>> + * Wait for the hardware to complete a transaction.
>> + * Return true on success, false on failure.
>> + */
>> +static bool wait_for_transaction(struct staging_state *ss)
>> +{
>> +	u32 timeout, status;
>> +
>> +	/* Allow time for hardware to complete the operation: */
>> +	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
>> +		msleep(1);
>> +
>> +		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
>> +		/* Break out early if the hardware is ready: */
>> +		if (status & MASK_MBOX_STATUS_READY)
>> +			break;
> 
> Shouldn't we exit the loop if the MASK_MBOX_STATUS_ERROR is set, or is the
> ERROR bit always set in conjunction with the READY bit?
> 
>> +	}
>> +
>> +	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
> 
> I still think this read is not needed.

No, if it times out, it could exit without having read the status.

But this is a slow path — instead of trying to save a few cycles, I 
wanted to present the logic more clearly:

   * Give the hardware enough time, but not forever.
     * Oh, we don't need to wait anymore if it's done

   * After waiting, check the status and handle it properly.

Isn’t it clearer to read the status after the wait, right before the 
error handling?

Also, reading the status inside the loop is just a way to determine 
whether to break — not to interpret or act on it. That should come after 
the wait completes.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer
  2025-04-16 17:22       ` Chang S. Bae
@ 2025-04-16 17:37         ` Dave Hansen
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2025-04-16 17:37 UTC (permalink / raw)
  To: Chang S. Bae, Chao Gao
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, colinmitchell

On 4/16/25 10:22, Chang S. Bae wrote:
> On 4/16/2025 7:14 AM, Chao Gao wrote:
>>> +/*
>>> + * Wait for the hardware to complete a transaction.
>>> + * Return true on success, false on failure.
>>> + */
>>> +static bool wait_for_transaction(struct staging_state *ss)
>>> +{
>>> +    u32 timeout, status;
>>> +
>>> +    /* Allow time for hardware to complete the operation: */
>>> +    for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
>>> +        msleep(1);
>>> +
>>> +        status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
>>> +        /* Break out early if the hardware is ready: */
>>> +        if (status & MASK_MBOX_STATUS_READY)
>>> +            break;
>>
>> Shouldn't we exit the loop if the MASK_MBOX_STATUS_ERROR is set, or is
>> the
>> ERROR bit always set in conjunction with the READY bit?
>>
>>> +    }
>>> +
>>> +    status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
>>
>> I still think this read is not needed.
> 
> No, if it times out, it could exit without having read the status.

Can it? There are only two ways out of the loop: the explicit break and
the implicit break from the 'timeout' check. Both happen at the "end" of
the loop, unless the timeout< check failed up front:

	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++){
		msleep(1);
		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
		if (status & MASK_MBOX_STATUS_READY)
			break;
	}

> But this is a slow path — instead of trying to save a few cycles, I
> wanted to present the logic more clearly:
> 
>   * Give the hardware enough time, but not forever.
>     * Oh, we don't need to wait anymore if it's done
> 
>   * After waiting, check the status and handle it properly.
> 
> Isn’t it clearer to read the status after the wait, right before the
> error handling?

It's simpler to read the code the way you have it. You don't have to
consider if 'status' could be uninitialized. It makes the "timeout loop"
*COMPLETELY* independent from the rest of the function. It could be:

foo()
{
	spin_until_ready();

	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);

	...
}

for example.

It's less likely to break, too. Let's say someone added a new break:

	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++){
+		if (some_other_condition());
+			break;
		msleep(1);
		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
		if (status & MASK_MBOX_STATUS_READY)
			break;
	}

you'd have a bug on your hands.

I'm fine with the way you have it. I probably would have written it that
way too.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature
  2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                     ` (5 preceding siblings ...)
  2025-04-09 23:27   ` [PATCH v3 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
@ 2025-08-13 17:26   ` Chang S. Bae
  2025-08-13 17:26     ` [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
                       ` (7 more replies)
  6 siblings, 8 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

Dear x86 maintainers,

This is another posting of the series, now including a Tested-by tag.
There are no code changes.

The use case for this feature is late-loading with increased load times,
most likely in cloud service environments. I reached out to potential
users for reviews and feedback, and Anselm Busse has confirmed his
testing.

While additional reviews or endorsements could help ease the maintainers'
burden and build confidence, there hasn't been much recent traction. At
this point, I believe it's better to move forward and get a clear
assessment from the maintainers -- either to prompt more engagement from
prospective users or to determine its readiness for merging.

Of course, I appreciate any further feedback on this set.

As usual, the patch series is based on tip/master and is also available
in this repo:
  git://github.com/intel-staging/microcode.git staging_v4

Last posting (V3):
  https://lore.kernel.org/lkml/20250409232713.4536-1-chang.seok.bae@intel.com/

Thanks,
Chang

Chang S. Bae (6):
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/microcode/intel: Establish staging control logic
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Enable staging when available

 arch/x86/include/asm/msr-index.h         |   9 +
 arch/x86/include/asm/topology.h          |   1 +
 arch/x86/kernel/cpu/microcode/core.c     |  11 +
 arch/x86/kernel/cpu/microcode/intel.c    | 338 +++++++++++++++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |   4 +-
 5 files changed, 362 insertions(+), 1 deletion(-)


base-commit: ca76508b9352e8c770b58213cc6c4700e459b7c2
-- 
2.48.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-18  7:45       ` Chao Gao
  2025-08-13 17:26     ` [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
                       ` (6 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

As microcode patch sizes continue to grow, late-loading latency spikes
can lead to timeouts and disruptions in running workloads. This trend of
increasing patch sizes is expected to continue, so a foundational
solution is needed to address the issue.

To mitigate the problem, a new staging feature is introduced. This option
processes most of the microcode update (excluding activation) on a
non-critical path, allowing CPUs to remain operational during the
majority of the update. By offloading work from the critical path,
staging can significantly reduces latency spikes.

Integrate staging as a preparatory step in late-loading. Introduce a new
callback for staging, which is invoked at the beginning of
load_late_stop_cpus(), before CPUs enter the rendezvous phase.

Staging follows an opportunistic model:

  *  If successful, it reduces CPU rendezvous time
  *  Even though it fails, the process falls back to the legacy path to
     finish the loading process but with potentially higher latency.

Extend struct microcode_ops to incorporate staging properties, which will
be implemented in the vendor code separately.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
There were discussions about whether staging success should be enforced
by a configurable option. That topic is identified as follow-up work,
separate from this series.
    https://lore.kernel.org/lkml/54308373-7867-4b76-be34-63730953f83c@intel.com/

V1 -> V2:
* Move invocation inside of load_late_stop_cpus() (Boris)
* Add more note about staging (Dave)
---
 arch/x86/kernel/cpu/microcode/core.c     | 11 +++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |  4 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b92e09a87c69..34e569ee1db2 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -552,6 +552,17 @@ static int load_late_stop_cpus(bool is_safe)
 		pr_err("You should switch to early loading, if possible.\n");
 	}
 
+	/*
+	 * Pre-load the microcode image into a staging device. This
+	 * process is preemptible and does not require stopping CPUs.
+	 * Successful staging simplifies the subsequent late-loading
+	 * process, reducing rendezvous time.
+	 *
+	 * Even if the transfer fails, the update will proceed as usual.
+	 */
+	if (microcode_ops->use_staging)
+		microcode_ops->stage_microcode();
+
 	atomic_set(&late_cpus_in, num_online_cpus());
 	atomic_set(&offline_in_nmi, 0);
 	loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 50a9702ae4e2..adf02ebbf7a3 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -31,10 +31,12 @@ struct microcode_ops {
 	 * See also the "Synchronization" section in microcode_core.c.
 	 */
 	enum ucode_state	(*apply_microcode)(int cpu);
+	void			(*stage_microcode)(void);
 	int			(*collect_cpu_info)(int cpu, struct cpu_signature *csig);
 	void			(*finalize_late_load)(int result);
 	unsigned int		nmi_safe	: 1,
-				use_nmi		: 1;
+				use_nmi		: 1,
+				use_staging	: 1;
 };
 
 struct early_load_data {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-08-13 17:26     ` [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-13 18:21       ` Dave Hansen
  2025-08-13 17:26     ` [PATCH v4 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
                       ` (5 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
consistent use when narrowing down primary threads to locate the
per-package interface.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V2 -> V3:
* Remove a global variable and adjust stage_microcode() (Dave).
* Simplify for_each_cpu() loop control code
* Handle rdmsrl_on_cpu() return code explicitly (Chao)

V1 -> V2:
* Adjust to reference the staging_state struct.
* Add lockdep_assert_cpus_held() (Boris)
---
 arch/x86/include/asm/msr-index.h      |  2 ++
 arch/x86/include/asm/topology.h       |  1 +
 arch/x86/kernel/cpu/microcode/intel.c | 50 +++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b65c3ba5fa14..0356155f9264 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -913,6 +913,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6c79ee7c0957..91b5fc44ca62 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -235,6 +235,7 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
+#define cpu_primary_thread_mask cpu_none_mask
 #endif /* !CONFIG_SMP */
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 371ca6eac00e..468c4d3d5d66 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -299,6 +299,55 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface.
+ * Return the result state.
+ */
+static enum ucode_state do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	return UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	enum ucode_state ret;
+	int cpu, err;
+	u64 mmio_pa;
+
+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_primary_thread_mask) {
+		if (topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		err = rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+		if (WARN_ON_ONCE(err))
+			return;
+
+		ret = do_stage(mmio_pa);
+		if (ret != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n",
+		((struct microcode_header_intel *)ucode_patch_late)->rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -627,6 +676,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 3/6] x86/microcode/intel: Define staging state struct
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
  2025-08-13 17:26     ` [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
  2025-08-13 17:26     ` [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-13 18:25       ` Dave Hansen
  2025-08-13 17:26     ` [PATCH v4 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

To facilitate a structured staging handler, a set of functions will be
introduced. Define staging_state struct to simplify function prototypes
by consolidating relevant data, instead of passing multiple local
variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V1 -> V2: New patch

Prior to V2, local variables were used to track state values, with the
intention of improving readability by explicitly passing them between
functions. However, given feedbacks, introducing a dedicated data
structure looks to provide a benefit by simplifying the main loop.
---
 arch/x86/kernel/cpu/microcode/intel.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 468c4d3d5d66..36b426a90844 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -54,6 +54,27 @@ struct extended_sigtable {
 	struct extended_signature	sigs[];
 };
 
+/**
+ * struct staging_state - Tracks the current staging process state
+ *
+ * @mmio_base:		MMIO base address for staging
+ * @ucode_ptr:		Pointer to the microcode image
+ * @ucode_len:		Total size of the microcode image
+ * @chunk_size:		Size of each data piece
+ * @bytes_sent:		Total bytes transmitted so far
+ * @offset:		Current offset in the microcode image
+ * @state:		Current state of the staging process
+ */
+struct staging_state {
+	void __iomem		*mmio_base;
+	void			*ucode_ptr;
+	unsigned int		ucode_len;
+	unsigned int		chunk_size;
+	unsigned int		bytes_sent;
+	unsigned int		offset;
+	enum ucode_state	state;
+};
+
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
 #define EXT_SIGNATURE_SIZE	(sizeof(struct extended_signature))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 4/6] x86/microcode/intel: Implement staging handler
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                       ` (2 preceding siblings ...)
  2025-08-13 17:26     ` [PATCH v4 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-13 18:44       ` Dave Hansen
  2025-08-13 17:26     ` [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
                       ` (3 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d)  The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V2 -> V3:
* Rework code to eliminate global variables (Dave)
* Remove redundant variable resets (Chao)

V1 -> V2:
* Re-write the changelog for clarity (Dave).
* Move staging handling code into intel.c (Boris).
* Add extensive comments to clarify staging logic and hardware
  interactions, along with function renaming (Dave).
---
 arch/x86/kernel/cpu/microcode/intel.c | 120 +++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 36b426a90844..3eb3331c5012 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,29 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
+/*
+ * Each microcode image is divided into chunks, each at most
+ * MBOX_XACTION_SIZE in size. A 10-chunk image would typically require
+ * 10 transactions. However, the hardware managing the mailbox has
+ * limited resources and may not cache the entire image, potentially
+ * requesting the same chunk multiple times.
+ *
+ * To accommodate this behavior, allow up to twice the expected number of
+ * transactions (i.e., a 10-chunk image can take up to 20 attempts).
+ */
+#define MBOX_XACTION_SIZE	PAGE_SIZE
+#define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -321,13 +346,100 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 }
 
 /*
- * Handle the staging process using the mailbox MMIO interface.
- * Return the result state.
+ * Prepare for a new microcode transfer by resetting hardware and
+ * configuring microcode image info.
+ */
+static void init_stage(struct staging_state *ss)
+{
+	ss->ucode_ptr = ucode_patch_late;
+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, ss->mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Check if the staging process has completed. The hardware signals
+ * completion by setting a unique end offset.
+ */
+static inline bool is_stage_complete(unsigned int offset)
+{
+	return offset == UINT_MAX;
+}
+
+/*
+ * Determine if the next data chunk can be sent. Each chunk is typically
+ * one page unless the remaining data is smaller. If the total
+ * transmitted data exceeds the defined limit, a timeout occurs.
+ */
+static bool can_send_next_chunk(struct staging_state *ss)
+{
+	WARN_ON_ONCE(ss->ucode_len < ss->offset);
+	ss->chunk_size = min(MBOX_XACTION_SIZE, ss->ucode_len - ss->offset);
+
+	if (ss->bytes_sent + ss->chunk_size > MBOX_XACTION_MAX(ss->ucode_len)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return true if the chunk is processed successfully.
+ */
+static bool send_data_chunk(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	ss->state = UCODE_ERROR;
+	return false;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return true if the response is valid, false otherwise.
+ */
+static bool fetch_next_offset(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	ss->state = UCODE_ERROR;
+	return false;
+}
+
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * microcode image is transferred in chunks until completion. Return the
+ * result state.
  */
 static enum ucode_state do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	return UCODE_ERROR;
+	struct staging_state ss = {};
+
+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!ss.mmio_base))
+		return UCODE_ERROR;
+
+	init_stage(&ss);
+
+	/* Perform the staging process while within the retry limit */
+	while (!is_stage_complete(ss.offset) && can_send_next_chunk(&ss)) {
+		/* Send a chunk of microcode each time: */
+		if (!send_data_chunk(&ss))
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		if (!fetch_next_offset(&ss))
+			break;
+	}
+
+	iounmap(ss.mmio_base);
+	return ss.state;
 }
 
 static void stage_microcode(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                       ` (3 preceding siblings ...)
  2025-08-13 17:26     ` [PATCH v4 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-13 19:07       ` Dave Hansen
  2025-08-13 17:26     ` [PATCH v4 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V2 -> V3:
* Update code to reflect the removal of a global variable (Dave).

V1 -> V2:
* Add lots of code comments and edit the changelog (Dave).
* Encapsulate register read/write operations for processing header and
  data sections.
---
 arch/x86/kernel/cpu/microcode/intel.c | 150 ++++++++++++++++++++++++--
 1 file changed, 144 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 3eb3331c5012..5402bcb96f47 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,30 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define DWORD_ALIGN(size)	((size) / sizeof(u32))
+#define MBOX_HEADER(mbox_size)	(PCI_VENDOR_ID_INTEL | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)DWORD_ALIGN(mbox_size) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
 
 /*
  * Each microcode image is divided into chunks, each at most
@@ -57,6 +80,7 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
  */
 #define MBOX_XACTION_SIZE	PAGE_SIZE
 #define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -345,6 +369,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void __iomem *mmio_base)
+{
+	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging hardware */
+	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
+{
+	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void __iomem *mmio_base)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword(mmio_base);
+	high = read_mbox_dword(mmio_base);
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
+{
+	write_mbox_dword(mmio_base, value);
+	write_mbox_dword(mmio_base, value >> 32);
+}
+
+static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_size)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < DWORD_ALIGN(chunk_size); i++)
+		write_mbox_dword(mmio_base, chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer by resetting hardware and
  * configuring microcode image info.
@@ -388,15 +455,71 @@ static bool can_send_next_chunk(struct staging_state *ss)
 	return true;
 }
 
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return true on success, false on failure.
+ */
+static bool wait_for_transaction(struct staging_state *ss)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor
+	 * signaled any error. Treat the case as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
+}
+
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return true if the chunk is processed successfully.
  */
 static bool send_data_chunk(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	ss->state = UCODE_ERROR;
-	return false;
+	u16 mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;
+	u32 *chunk = ss->ucode_ptr + ss->offset;
+
+	/*
+	 * Write 'request' mailbox object in the following order:
+	 * - Mailbox header includes total size
+	 * - Command header specifies the load operation
+	 * - Data section contains a microcode chunk
+	 */
+	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
+	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
+	write_mbox_data(ss->mmio_base, chunk, ss->chunk_size);
+	ss->bytes_sent += ss->chunk_size;
+
+	/*
+	 * Notify the hardware that the mailbox is ready for processing.
+	 * The staging hardware will process the request asynchronously.
+	 */
+	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
+	return wait_for_transaction(ss);
 }
 
 /*
@@ -405,9 +528,24 @@ static bool send_data_chunk(struct staging_state *ss)
  */
 static bool fetch_next_offset(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
-	ss->state = UCODE_ERROR;
-	return false;
+	const u16 mbox_size = MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE;
+
+	/* All responses begin with the same header value: */
+	WARN_ON_ONCE(read_mbox_header(ss->mmio_base) != MBOX_HEADER(mbox_size));
+
+	/*
+	 * The 'response' mailbox contains two dword data:
+	 * - First has next offset in microcode image
+	 * - Second delivers status flag
+	 */
+	ss->offset = read_mbox_dword(ss->mmio_base);
+	if (read_mbox_dword(ss->mmio_base) & MASK_MBOX_RESP_ERROR) {
+		ss->state = UCODE_ERROR;
+		return false;
+	}
+
+	ss->state = UCODE_OK;
+	return true;
 }
 
 /*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4 6/6] x86/microcode/intel: Enable staging when available
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                       ` (4 preceding siblings ...)
  2025-08-13 17:26     ` [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-08-13 17:26     ` Chang S. Bae
  2025-08-18  8:35       ` Chao Gao
  2025-08-13 19:08     ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Dave Hansen
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 17:26 UTC (permalink / raw)
  To: x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	chang.seok.bae, linux-kernel

With the staging code being ready, check for staging availability by
reading these following MSRs:

  * IA32_ARCH_CAPABILITIES (bit 16) for the presence of
    IA32_MCU_ENUMERATION

  * IA32_MCU_ENUMERATION (bit 4) for the staging feature availability.

When available, enable it by setting the feature bit.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V1 -> V2: Fold MSR definings (Boris).
---
 arch/x86/include/asm/msr-index.h      |  7 +++++++
 arch/x86/kernel/cpu/microcode/intel.c | 17 +++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0356155f9264..2b7b598c414f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -166,6 +166,10 @@
 						 * Processor MMIO stale data
 						 * vulnerabilities.
 						 */
+#define ARCH_CAP_MCU_ENUM		BIT(16) /*
+						 * Indicates the presence of microcode update
+						 * feature enumeration and status information
+						 */
 #define ARCH_CAP_FB_CLEAR		BIT(17)	/*
 						 * VERW clears CPU fill buffer
 						 * even on MDS_NO CPUs.
@@ -913,6 +917,9 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_ENUMERATION	0x0000007b
+#define MCU_STAGING			BIT(4)
+
 #define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
 
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 5402bcb96f47..4da5f3581a94 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -959,6 +959,18 @@ static __init void calc_llc_size_per_core(struct cpuinfo_x86 *c)
 	llc_size_per_core = (unsigned int)llc_size;
 }
 
+static __init bool staging_available(void)
+{
+	u64 val;
+
+	val = x86_read_arch_cap_msr();
+	if (!(val & ARCH_CAP_MCU_ENUM))
+		return false;
+
+	rdmsrl(MSR_IA32_MCU_ENUMERATION, val);
+	return !!(val & MCU_STAGING);
+}
+
 struct microcode_ops * __init init_intel_microcode(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -969,6 +981,11 @@ struct microcode_ops * __init init_intel_microcode(void)
 		return NULL;
 	}
 
+	if (staging_available()) {
+		microcode_intel_ops.use_staging = true;
+		pr_info("Enabled staging feature.\n");
+	}
+
 	calc_llc_size_per_core(c);
 
 	return &microcode_intel_ops;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 17:26     ` [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-08-13 18:21       ` Dave Hansen
  2025-08-13 20:46         ` Chang S. Bae
  2025-08-22 22:39         ` [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 18:21 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

On 8/13/25 10:26, Chang S. Bae wrote:
> When microcode staging is initiated, operations are carried out through
> an MMIO interface. Each package has a unique interface specified by the
> IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.
> 
> Prepare staging with the following steps:
> 
>   1.  Ensure the microcode image is 32-bit aligned to match the MMIO
>       register size.
> 
>   2.  Identify each MMIO interface based on its per-package scope.
> 
>   3.  Invoke the staging function for each identified interface, which
>       will be implemented separately.
> 
> Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
> consistent use when narrowing down primary threads to locate the
> per-package interface.
...
>  static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
> +#define cpu_primary_thread_mask cpu_none_mask
>  #endif /* !CONFIG_SMP */

Isn't 'cpu_none_mask' a mask containing no CPUs? How can that possible
work here:

	for_each_cpu(cpu, cpu_primary_thread_mask) {

? Wouldn't it just not run through the for loop at all on CONFIG_SMP=n?
Is that what we want for some reason? I would have thought that we'd
still want to find the MMIO address for CPU 0, the one and only CPU.

>  static inline void arch_fix_phys_package_id(int num, u32 slot)
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
> index 371ca6eac00e..468c4d3d5d66 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -299,6 +299,55 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
>  	return size ? NULL : patch;
>  }
>  
> +/*
> + * Handle the staging process using the mailbox MMIO interface.
> + * Return the result state.
> + */
> +static enum ucode_state do_stage(u64 mmio_pa)
> +{
> +	pr_debug_once("Staging implementation is pending.\n");
> +	return UCODE_ERROR;
> +}
> +
> +static void stage_microcode(void)
> +{
> +	unsigned int pkg_id = UINT_MAX;
> +	enum ucode_state ret;
> +	int cpu, err;
> +	u64 mmio_pa;
> +
> +	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
> +		return;
> +
> +	lockdep_assert_cpus_held();
> +
> +	/*
> +	 * The MMIO address is unique per package, and all the SMT
> +	 * primary threads are online here. Find each MMIO space by
> +	 * their package ids to avoid duplicate staging.
> +	 */
> +	for_each_cpu(cpu, cpu_primary_thread_mask) {
> +		if (topology_logical_package_id(cpu) == pkg_id)
> +			continue;
> +		pkg_id = topology_logical_package_id(cpu);
> +
> +		err = rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
> +		if (WARN_ON_ONCE(err))
> +			return;
> +
> +		ret = do_stage(mmio_pa);
> +		if (ret != UCODE_OK) {
> +			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
> +			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
> +			       cpu, pkg_id);
> +			return;
> +		}
> +	}
> +
> +	pr_info("Staging of patch revision 0x%x succeeded.\n",
> +		((struct microcode_header_intel *)ucode_patch_late)->rev);
> +}
Hmmm. Consider:

	static struct microcode_intel *ucode_patch_late __read_mostly;

and:

	struct microcode_intel {
	        struct microcode_header_intel   hdr;
	        unsigned int                    bits[];
	};

So isn't this whole ugly cast thing equivalent to:

	ucode_patch_late->hdr.rev

?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 3/6] x86/microcode/intel: Define staging state struct
  2025-08-13 17:26     ` [PATCH v4 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-08-13 18:25       ` Dave Hansen
  2025-08-22 22:39         ` [PATCH v4a " Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 18:25 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

> +/**
> + * struct staging_state - Tracks the current staging process state
> + *
> + * @mmio_base:		MMIO base address for staging
> + * @ucode_ptr:		Pointer to the microcode image
> + * @ucode_len:		Total size of the microcode image
> + * @chunk_size:		Size of each data piece
> + * @bytes_sent:		Total bytes transmitted so far
> + * @offset:		Current offset in the microcode image
> + * @state:		Current state of the staging process
> + */
> +struct staging_state {
> +	void __iomem		*mmio_base;
> +	void			*ucode_ptr;

This is assigned a single time to ucode_patch_late. Shouldn't it share
the same type? Or, maybe not even get a structure member since it can
only have one value.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 4/6] x86/microcode/intel: Implement staging handler
  2025-08-13 17:26     ` [PATCH v4 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-08-13 18:44       ` Dave Hansen
  2025-08-22 22:39         ` [PATCH v4a " Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 18:44 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

> +/*
> + * Determine if the next data chunk can be sent. Each chunk is typically
> + * one page unless the remaining data is smaller. If the total
> + * transmitted data exceeds the defined limit, a timeout occurs.
> + */

This comment isn't really telling the whole story. It's not just
determining if the chunk can be sent, it's calculating it and filling it in.

> +static bool can_send_next_chunk(struct staging_state *ss)
> +{
> +	WARN_ON_ONCE(ss->ucode_len < ss->offset);

Please don't WARN_ON() they can be fatal because of panic_on_warn. Also
I think this is the wrong spot for this. We should enforce this at the
time ss->offset is _established_ which is oddly enough in the next patch.

	ss->offset = read_mbox_dword(ss->mmio_base);
	if (ss->offset > ss->ucode_len)
		// error out


> +	ss->chunk_size = min(MBOX_XACTION_SIZE, ss->ucode_len - ss->offset);

It's a _little_ non-obvious that "can_send_next_chunk()" is also setting
->chunk_size. It would be easier to grok if it was something like:

	ok = calc_next_chunk_size(&ss);
	if (!ok)
		// error out

> +	if (ss->bytes_sent + ss->chunk_size > MBOX_XACTION_MAX(ss->ucode_len)) {
> +		ss->state = UCODE_TIMEOUT;
> +		return false;
> +	}

"TIMEOUT" seems like an odd thing to call this failure. Can you explain
the choice of this error code a bit, please?

> +/*
> + * Handle the staging process using the mailbox MMIO interface. The
> + * microcode image is transferred in chunks until completion. Return the
> + * result state.
>   */
>  static enum ucode_state do_stage(u64 mmio_pa)
>  {
> -	pr_debug_once("Staging implementation is pending.\n");
> -	return UCODE_ERROR;
> +	struct staging_state ss = {};
> +
> +	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
> +	if (WARN_ON_ONCE(!ss.mmio_base))
> +		return UCODE_ERROR;
> +
> +	init_stage(&ss);
> +
> +	/* Perform the staging process while within the retry limit */
> +	while (!is_stage_complete(ss.offset) && can_send_next_chunk(&ss)) {
> +		/* Send a chunk of microcode each time: */
> +		if (!send_data_chunk(&ss))
> +			break;
> +		/*
> +		 * Then, ask the hardware which piece of the image it
> +		 * needs next. The same piece may be sent more than once.
> +		 */
> +		if (!fetch_next_offset(&ss))
> +			break;
> +	}

The return types here are a _bit_ wonky. The 'bool' returns make sense
for things like is_stage_complete(). But they don't look right for:

	if (!send_data_chunk(&ss))
		break;

where we'd typically use an -ERRNO and where 0 mean success. It would
look something like this:

	while (!staging_is_complete(&ss)) {
		err = send_data_chunk(&ss);
		if (err)
			break;

		err = fetch_next_offset(&ss);
		if (err)
			break;
	}

That's utterly unambiguous about the intent and what types the send and
fetch function _must_ have.

Note I also moved the can_send_next_chunk() call into
staging_is_complete(). I think that makes sense as well for the
top-level loop.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer
  2025-08-13 17:26     ` [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-08-13 19:07       ` Dave Hansen
  2025-08-22 22:40         ` [PATCH v4a " Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 19:07 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

On 8/13/25 10:26, Chang S. Bae wrote:
> Previously, the functions for sending microcode data and retrieving the
> next offset were placeholders, as they required handling the specific
> mailbox format. Implement them as following:
> 
> == Mailbox Format ==
> 
> The staging mailbox consists of two primary sections: 'header' and
> 'data'. While the microcode must be transferred following this format,
> the actual data transfer mechanism involves reading and writing to
> specific MMIO registers.
> 
> == Mailbox Data Registers ==
> 
> Unlike conventional interfaces that allocate MMIO space for each data
> chunk, the staging interface features a "narrow" interface, using only
> two dword-sized registers for read and write operations.
> 
> For example, if writing 2 dwords of data to a device. Typically, the
> device would expose 2 dwords of "wide" MMIO space. To send the data to
> the device:
> 
> 	writel(buf[0], io_addr + 0);
> 	writel(buf[1], io_addr + 1);
> 
> But, this interface is a bit different. Instead of having a "wide"
> interface where there is separate MMIO space for each word in a
> transaction, it has a "narrow" interface where several words are written
> to the same spot in MMIO space:
> 
> 	writel(buf[0], io_addr);
> 	writel(buf[1], io_addr);
> 
> The same goes for the read side.
> 
> == Implementation Summary ==
> 
> Given that, introduce two layers of helper functions at first:
> 
>   * Low-level helpers for reading and writing to data registers directly.
>   * Wrapper functions for handling mailbox header and data sections.
> 
> Using them, implement send_data_chunk() and fetch_next_offset()
> functions. Add explicit error and timeout handling routine in
> wait_for_transaction(), finishing up the transfer.
> 
> Note: The kernel has support for similar mailboxes. But none of them are
> compatible with this one. Trying to share code resulted in a bloated
> mess, so this code is standalone.
> 
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Tested-by: Anselm Busse <abusse@amazon.de>
> ---
> V2 -> V3:
> * Update code to reflect the removal of a global variable (Dave).
> 
> V1 -> V2:
> * Add lots of code comments and edit the changelog (Dave).
> * Encapsulate register read/write operations for processing header and
>   data sections.
> ---
>  arch/x86/kernel/cpu/microcode/intel.c | 150 ++++++++++++++++++++++++--
>  1 file changed, 144 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
> index 3eb3331c5012..5402bcb96f47 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm.h>
>  #include <linux/delay.h>
>  #include <linux/io.h>
> +#include <linux/pci_ids.h>
>  
>  #include <asm/cpu_device_id.h>
>  #include <asm/processor.h>
> @@ -42,8 +43,30 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
>  
>  #define MBOX_CONTROL_OFFSET	0x0
>  #define MBOX_STATUS_OFFSET	0x4
> +#define MBOX_WRDATA_OFFSET	0x8
> +#define MBOX_RDDATA_OFFSET	0xc
>  
>  #define MASK_MBOX_CTRL_ABORT	BIT(0)
> +#define MASK_MBOX_CTRL_GO	BIT(31)
> +
> +#define MASK_MBOX_STATUS_ERROR	BIT(2)
> +#define MASK_MBOX_STATUS_READY	BIT(31)
> +
> +#define MASK_MBOX_RESP_SUCCESS	BIT(0)
> +#define MASK_MBOX_RESP_PROGRESS	BIT(1)
> +#define MASK_MBOX_RESP_ERROR	BIT(2)
> +
> +#define MBOX_CMD_LOAD		0x3
> +#define MBOX_OBJ_STAGING	0xb
> +#define DWORD_ALIGN(size)	((size) / sizeof(u32))
> +#define MBOX_HEADER(mbox_size)	(PCI_VENDOR_ID_INTEL | \
> +				 (MBOX_OBJ_STAGING << 16) | \
> +				 ((u64)DWORD_ALIGN(mbox_size) << 32))
> +
> +/* The size of each mailbox header */
> +#define MBOX_HEADER_SIZE	sizeof(u64)
> +/* The size of staging hardware response */
> +#define MBOX_RESPONSE_SIZE	sizeof(u64)
>  
>  /*
>   * Each microcode image is divided into chunks, each at most
> @@ -57,6 +80,7 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
>   */
>  #define MBOX_XACTION_SIZE	PAGE_SIZE
>  #define MBOX_XACTION_MAX(imgsz)	((imgsz) * 2)
> +#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
>  
>  /* Current microcode patch used in early patching on the APs. */
>  static struct microcode_intel *ucode_patch_va __read_mostly;
> @@ -345,6 +369,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
>  	return size ? NULL : patch;
>  }
>  
> +static inline u32 read_mbox_dword(void __iomem *mmio_base)
> +{
> +	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
> +
> +	/* Acknowledge read completion to the staging hardware */
> +	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
> +	return dword;
> +}
> +
> +static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
> +{
> +	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
> +}
> +
> +static inline u64 read_mbox_header(void __iomem *mmio_base)
> +{
> +	u32 high, low;
> +
> +	low  = read_mbox_dword(mmio_base);
> +	high = read_mbox_dword(mmio_base);
> +
> +	return ((u64)high << 32) | low;
> +}
> +
> +static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
> +{
> +	write_mbox_dword(mmio_base, value);
> +	write_mbox_dword(mmio_base, value >> 32);
> +}
> +
> +static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_size)
> +{
> +	int i;
> +
> +	/*
> +	 * The MMIO space is mapped as Uncached (UC). Each write arrives
> +	 * at the device as an individual transaction in program order.
> +	 * The device can then resemble the sequence accordingly.
> +	 */
> +	for (i = 0; i < DWORD_ALIGN(chunk_size); i++)

I don't really like the DWORD_ALIGN(). To me, this is much more clear:

	for (i = 0; i < chunk_size / sizeof(u32); i++)

> +		write_mbox_dword(mmio_base, chunk[i]);
> +}

I'd _maybe_ make it "chunk_bytes" instead of "chunk_size", but that's
just a naming nit.

>  /*
>   * Prepare for a new microcode transfer by resetting hardware and
>   * configuring microcode image info.
> @@ -388,15 +455,71 @@ static bool can_send_next_chunk(struct staging_state *ss)
>  	return true;
>  }
>  
> +/*
> + * Wait for the hardware to complete a transaction.
> + * Return true on success, false on failure.
> + */
> +static bool wait_for_transaction(struct staging_state *ss)
> +{
> +	u32 timeout, status;
> +
> +	/* Allow time for hardware to complete the operation: */
> +	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
> +		msleep(1);
> +
> +		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
> +		/* Break out early if the hardware is ready: */
> +		if (status & MASK_MBOX_STATUS_READY)
> +			break;
> +	}
> +
> +	status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);

Why is another read needed here? It had to go through the loop above at
_least_ once, right?


>  /*
>   * Transmit a chunk of the microcode image to the hardware.
>   * Return true if the chunk is processed successfully.
>   */
>  static bool send_data_chunk(struct staging_state *ss)
>  {
> -	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
> -	ss->state = UCODE_ERROR;
> -	return false;
> +	u16 mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;

The "* 2" needs to be explained.

> +	u32 *chunk = ss->ucode_ptr + ss->offset;

I think calling this "src_chunk" would make a lot of sense.

> +	/*
> +	 * Write 'request' mailbox object in the following order:
> +	 * - Mailbox header includes total size
> +	 * - Command header specifies the load operation
> +	 * - Data section contains a microcode chunk
> +	 */
> +	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
> +	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
> +	write_mbox_data(ss->mmio_base, chunk, ss->chunk_size);
> +	ss->bytes_sent += ss->chunk_size;
> +
> +	/*
> +	 * Notify the hardware that the mailbox is ready for processing.
> +	 * The staging hardware will process the request asynchronously.
> +	 */
> +	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
> +	return wait_for_transaction(ss);
>  }

Why is it important that "The staging hardware will process the request
asynchronously."? We *could* go off and do other things, but we don't.
We sit here and poll. So what does it matter?

>  
>  /*
> @@ -405,9 +528,24 @@ static bool send_data_chunk(struct staging_state *ss)
>   */
>  static bool fetch_next_offset(struct staging_state *ss)
>  {
> -	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
> -	ss->state = UCODE_ERROR;
> -	return false;
> +	const u16 mbox_size = MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE;
> +
> +	/* All responses begin with the same header value: */
> +	WARN_ON_ONCE(read_mbox_header(ss->mmio_base) != MBOX_HEADER(mbox_size));

Again, this seems like a pr_warn() and then 'return false', not a "panic
the system" kind of error.

We should not be panic'ing the system in code that's purely an
optimization at this point.

It's also, IMNHO, bad form to do important actions inside of a WARN_ON()
condition. I'd much rather it be:

/*
 * All responses start with the same mailbox header. Consume
 * it and ensure that it has the expected contents.
 */
static inline int consume_mbox_header(ss)
{
	u32 header = read_mbox_header(ss->mmio_base);
	u32 expected_header = MBOX_HEADER(mbox_size));

	if (header != expected_header) {
		pr_warn_once(...);
		return -EINVAL;
	}

	return 0;
}	

The other way to do it is to do the three reads next to each other and
make it obvious that there are 3 dwords coming out of the device:

	u32 header    = read_mbox_dword(ss->mmio_base);
	u32 offset    = read_mbox_dword(ss->mmio_base);
	u32 err_word  = read_mbox_dword(ss->mmio_base);

... then go do all the checking after that.




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                       ` (5 preceding siblings ...)
  2025-08-13 17:26     ` [PATCH v4 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
@ 2025-08-13 19:08     ` Dave Hansen
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
  7 siblings, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 19:08 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

On 8/13/25 10:26, Chang S. Bae wrote:
> The use case for this feature is late-loading with increased load times,
> most likely in cloud service environments. I reached out to potential
> users for reviews and feedback, and Anselm Busse has confirmed his
> testing.

I'm glad someone tested this, but it would also be ideal to have some
code reviews from the folks that are actually going to use this. From
the lack of reviews, I'm going to assume that folks don't really want
this merged any time soon.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 18:21       ` Dave Hansen
@ 2025-08-13 20:46         ` Chang S. Bae
  2025-08-13 20:55           ` Dave Hansen
  2025-08-22 22:39         ` [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
  1 sibling, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-13 20:46 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

On 8/13/2025 11:21 AM, Dave Hansen wrote:
> On 8/13/25 10:26, Chang S. Bae wrote:
>>
>> Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
>> consistent use when narrowing down primary threads to locate the
>> per-package interface.
> ...
>>   static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
>> +#define cpu_primary_thread_mask cpu_none_mask
>>   #endif /* !CONFIG_SMP */
> 
> Isn't 'cpu_none_mask' a mask containing no CPUs? How can that possible
> work here:
> 
> 	for_each_cpu(cpu, cpu_primary_thread_mask) {
> 
> ? Wouldn't it just not run through the for loop at all on CONFIG_SMP=n?
> Is that what we want for some reason? I would have thought that we'd
> still want to find the MMIO address for CPU 0, the one and only CPU.

Yeah, right.

Then, looking at it again, I see this:

config MICROCODE_LATE_LOADING
	bool "Late microcode loading (DANGEROUS)"
	default n
	depends on MICROCODE && SMP

This optimization only applies to the late-loading path. But, today I 
also had to clarify this dependency for myself. At least, my changelog 
could've made it clearer, sorry.

>> +
>> +	pr_info("Staging of patch revision 0x%x succeeded.\n",
>> +		((struct microcode_header_intel *)ucode_patch_late)->rev);
>> +}
> Hmmm. Consider:
> 
> 	static struct microcode_intel *ucode_patch_late __read_mostly;
> 
> and:
> 
> 	struct microcode_intel {
> 	        struct microcode_header_intel   hdr;
> 	        unsigned int                    bits[];
> 	};
> 
> So isn't this whole ugly cast thing equivalent to:
> 
> 	ucode_patch_late->hdr.rev
> 
> ?

Indeed. I must have been blind to that bit of ugliness. Thanks for 
spotting on it!

Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 20:46         ` Chang S. Bae
@ 2025-08-13 20:55           ` Dave Hansen
  2025-08-14 18:30             ` Chang S. Bae
  2025-08-22 22:39             ` [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Hansen @ 2025-08-13 20:55 UTC (permalink / raw)
  To: Chang S. Bae, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

On 8/13/25 13:46, Chang S. Bae wrote:
>> Isn't 'cpu_none_mask' a mask containing no CPUs? How can that possible
>> work here:
>>
>>     for_each_cpu(cpu, cpu_primary_thread_mask) {
>>
>> ? Wouldn't it just not run through the for loop at all on CONFIG_SMP=n?
>> Is that what we want for some reason? I would have thought that we'd
>> still want to find the MMIO address for CPU 0, the one and only CPU.
> 
> Yeah, right.
> 
> Then, looking at it again, I see this:
> 
> config MICROCODE_LATE_LOADING
>     bool "Late microcode loading (DANGEROUS)"
>     default n
>     depends on MICROCODE && SMP
> 
> This optimization only applies to the late-loading path. But, today I
> also had to clarify this dependency for myself. At least, my changelog
> could've made it clearer, sorry.

I'm not following.

I _think_ you're trying to say that it's a "no harm, no foul" situation
because this user of 'cpu_primary_thread_mask' won't even get compiled
in the buggy !SMP case.

But that's not the problem. The issue is that this line of code:

#define cpu_primary_thread_mask cpu_none_mask

reads as 100% bogus to me. Even on !SMP kernels,
'cpu_primary_thread_mask' should have one CPU in it. Right? The _one_
thread that's present is a primary thread. If this were a mask for
secondary threads, 'cpu_none_mask' would make sense. But it's not.

So could we please make use 'cpu_primary_thread_mask' is getting defined
correctly whether it's really getting compiled into the end image or not?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 20:55           ` Dave Hansen
@ 2025-08-14 18:30             ` Chang S. Bae
  2025-08-22 22:39             ` [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
  1 sibling, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-14 18:30 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: tglx, mingo, bp, dave.hansen, colinmitchell, chao.gao, abusse,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2327 bytes --]

On 8/13/2025 1:55 PM, Dave Hansen wrote:
> 
> But that's not the problem. The issue is that this line of code:
> 
> #define cpu_primary_thread_mask cpu_none_mask

With CONFIG_SMP=n, on the core side (include/linux/cpu_smt.h), the code 
clarifies there is no SMT:

# define cpu_smt_control               (CPU_SMT_NOT_IMPLEMENTED)

This leads kernel/cpu.c to return an empty mask:

static inline const struct cpumask *cpuhp_get_primary_thread_mask(void)
{
	return cpu_none_mask;
}

On the x86 side, the definition is explicit that “primary threads” are 
SMT threads (arch/x86/kernel/smpboot.c):

/* CPUs which are the primary SMT threads */
struct cpumask __cpu_primary_thread_mask __read_mostly;

And via ifdeffery, this mask is only available to SMP kernels.

So it seems I had been subscribing this model -- no primary threads 
without SMP.

> reads as 100% bogus to me. Even on !SMP kernels,
> 'cpu_primary_thread_mask' should have one CPU in it. Right? The _one_
> thread that's present is a primary thread. If this were a mask for
> secondary threads, 'cpu_none_mask' would make sense. But it's not.

Your confidence made me take another look.

Digging into the history, I found that x86 used to have this in the !SMP 
case:

static inline bool topology_is_primary_thread(unsigned int cpu)
{
	return true;
}

That stayed until the recent commit 4b455f59945aa ("cpu/SMT: Provide a 
default topology_is_primary_thread()"), which now defines it in 
include/linux/topology.h with this telling comment:

/*
  * When disabling SMT, the primary thread of the SMT will remain
  * enabled/active. Architectures that have a special primary thread
  * (e.g. x86) need to override this function. ...
  */

This comment basically supports your point.

> So could we please make use 'cpu_primary_thread_mask' is getting defined
> correctly whether it's really getting compiled into the end image or not?

Given that, I’m thinking of simplifying the x86 side a bit -- by making 
the primary thread mask configured and available regardless of 
CONFIG_SMP, matching the behavior of other cpumasks. And its relevant 
helpers are also available, like in the attached diff.

I think the change still aligns x86 with the core code -- especially 
with the note in topology_is_primary_thread(). With that, the user may 
be introduced here.

[-- Attachment #2: tmp.diff --]
[-- Type: text/plain, Size: 2947 bytes --]

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6c79ee7c0957..281252af6e9d 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -218,6 +218,12 @@ static inline unsigned int topology_amd_nodes_per_pkg(void)
 	return __amd_nodes_per_pkg;
 }
 
+#else /* CONFIG_SMP */
+static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_max_smt_threads(void) { return 1; }
+static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
+#endif /* !CONFIG_SMP */
+
 extern struct cpumask __cpu_primary_thread_mask;
 #define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask)
 
@@ -231,12 +237,6 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
 }
 #define topology_is_primary_thread topology_is_primary_thread
 
-#else /* CONFIG_SMP */
-static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
-static inline int topology_max_smt_threads(void) { return 1; }
-static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
-#endif /* !CONFIG_SMP */
-
 static inline void arch_fix_phys_package_id(int num, u32 slot)
 {
 }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index e35ccdc84910..946004d7dd1d 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -52,6 +52,9 @@ u32 cpuid_to_apicid[] __ro_after_init = { [0 ... NR_CPUS - 1] = BAD_APICID, };
 /* Bitmaps to mark registered APICs at each topology domain */
 static struct { DECLARE_BITMAP(map, MAX_LOCAL_APIC); } apic_maps[TOPO_MAX_DOMAIN] __ro_after_init;
 
+/* CPUs which are the primary SMT threads */
+struct cpumask __cpu_primary_thread_mask __read_mostly;
+
 /*
  * Keep track of assigned, disabled and rejected CPUs. Present assigned
  * with 1 as CPU #0 is reserved for the boot CPU.
@@ -75,15 +78,11 @@ bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
 	return phys_id == (u64)cpuid_to_apicid[cpu];
 }
 
-#ifdef CONFIG_SMP
 static void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid)
 {
 	if (!(apicid & (__max_threads_per_core - 1)))
 		cpumask_set_cpu(cpu, &__cpu_primary_thread_mask);
 }
-#else
-static inline void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid) { }
-#endif
 
 /*
  * Convert the APIC ID to a domain level ID by masking out the low bits
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..7804175d2d87 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -103,9 +103,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
-/* CPUs which are the primary SMT threads */
-struct cpumask __cpu_primary_thread_mask __read_mostly;
-
 /* Representing CPUs for which sibling maps can be computed */
 static cpumask_var_t cpu_sibling_setup_mask;
 

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time
  2025-08-13 17:26     ` [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-08-18  7:45       ` Chao Gao
  0 siblings, 0 replies; 69+ messages in thread
From: Chao Gao @ 2025-08-18  7:45 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, abusse,
	linux-kernel

>+	/*
>+	 * Pre-load the microcode image into a staging device. This
>+	 * process is preemptible and does not require stopping CPUs.
>+	 * Successful staging simplifies the subsequent late-loading
>+	 * process, reducing rendezvous time.
>+	 *
>+	 * Even if the transfer fails, the update will proceed as usual.
>+	 */
>+	if (microcode_ops->use_staging)
>+		microcode_ops->stage_microcode();

->use_staging is not necessary. we can directly check if ->stage_microcode()
is NULL, or even add a stub function and then call ->stage_microcode()
unconditionally here. But since this pattern occurs only once in the series,
I don't have a strong preference.

If my review counts for anything,

Reviewed-by: Chao Gao <chao.gao@intel.com>

>+
> 	atomic_set(&late_cpus_in, num_online_cpus());
> 	atomic_set(&offline_in_nmi, 0);
> 	loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
>diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
>index 50a9702ae4e2..adf02ebbf7a3 100644
>--- a/arch/x86/kernel/cpu/microcode/internal.h
>+++ b/arch/x86/kernel/cpu/microcode/internal.h
>@@ -31,10 +31,12 @@ struct microcode_ops {
> 	 * See also the "Synchronization" section in microcode_core.c.
> 	 */
> 	enum ucode_state	(*apply_microcode)(int cpu);
>+	void			(*stage_microcode)(void);
> 	int			(*collect_cpu_info)(int cpu, struct cpu_signature *csig);
> 	void			(*finalize_late_load)(int result);
> 	unsigned int		nmi_safe	: 1,
>-				use_nmi		: 1;
>+				use_nmi		: 1,
>+				use_staging	: 1;
> };
> 
> struct early_load_data {
>-- 
>2.48.1
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 6/6] x86/microcode/intel: Enable staging when available
  2025-08-13 17:26     ` [PATCH v4 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
@ 2025-08-18  8:35       ` Chao Gao
  2025-08-22 22:42         ` Chang S. Bae
  0 siblings, 1 reply; 69+ messages in thread
From: Chao Gao @ 2025-08-18  8:35 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, abusse,
	linux-kernel

On Wed, Aug 13, 2025 at 10:26:49AM -0700, Chang S. Bae wrote:
>With the staging code being ready, check for staging availability by
>reading these following MSRs:
>
>  * IA32_ARCH_CAPABILITIES (bit 16) for the presence of
>    IA32_MCU_ENUMERATION
>
>  * IA32_MCU_ENUMERATION (bit 4) for the staging feature availability.
>
>When available, enable it by setting the feature bit.
>
>Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
>Tested-by: Anselm Busse <abusse@amazon.de>

Reviewed-by: Chao Gao <chao.gao@intel.com>

<snip>

>+static __init bool staging_available(void)
>+{
>+	u64 val;
>+
>+	val = x86_read_arch_cap_msr();
>+	if (!(val & ARCH_CAP_MCU_ENUM))
>+		return false;
>+
>+	rdmsrl(MSR_IA32_MCU_ENUMERATION, val);

nit: s/rdmsrl/rdmsrq

rdmsrl has been renamed to rdmsrq.

>+	return !!(val & MCU_STAGING);
>+}
>+

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n
  2025-08-13 20:55           ` Dave Hansen
  2025-08-14 18:30             ` Chang S. Bae
@ 2025-08-22 22:39             ` Chang S. Bae
  2025-08-23 16:05               ` Chang S. Bae
  1 sibling, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:39 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, chang.seok.bae,
	linux-kernel

cpu_primary_thread_mask is only defined when CONFIG_SMP=y. However, even
in UP kernels there is always exactly one CPU, which can reasonably be
treated as the primary thread.

Historically, topology_is_primary_thread() always returned true with
CONFIG_SMP=n. A recent commit:

  4b455f59945aa ("cpu/SMT: Provide a default topology_is_primary_thread()")

replaced it with a generic implementation with the note:

  "When disabling SMT, the primary thread of the SMT will remain
   enabled/active. Architectures that have a special primary thread (e.g.
   x86) need to override this function. ..."

For consistency and clarity, make the primary thread mask available
regardless of SMP, similar to cpu_possible_mask and cpu_present_mask.

Move __cpu_primary_thread_mask into common code to prevent build issues.
Let cpu_mark_primary_thread() configure the mask even for UP kernels,
alongside other masks. Then, topology_is_primary_thread() can
consistently reference it.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
This improvement was identified from feedback on a feature-enabling
series [*], where a user of this mask is introduced. It is posted here as
a standalone patch for clarity and self-containment. The next revision of
that series will depend on this.

[*] https://lore.kernel.org/lkml/20250813172649.15474-1-chang.seok.bae@intel.com/
---
 arch/x86/include/asm/topology.h       | 12 ++++++------
 arch/x86/kernel/cpu/topology.c        |  4 ----
 arch/x86/kernel/cpu/topology_common.c |  3 +++
 arch/x86/kernel/smpboot.c             |  3 ---
 4 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6c79ee7c0957..281252af6e9d 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -218,6 +218,12 @@ static inline unsigned int topology_amd_nodes_per_pkg(void)
 	return __amd_nodes_per_pkg;
 }
 
+#else /* CONFIG_SMP */
+static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_max_smt_threads(void) { return 1; }
+static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
+#endif /* !CONFIG_SMP */
+
 extern struct cpumask __cpu_primary_thread_mask;
 #define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask)
 
@@ -231,12 +237,6 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
 }
 #define topology_is_primary_thread topology_is_primary_thread
 
-#else /* CONFIG_SMP */
-static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
-static inline int topology_max_smt_threads(void) { return 1; }
-static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
-#endif /* !CONFIG_SMP */
-
 static inline void arch_fix_phys_package_id(int num, u32 slot)
 {
 }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index dcf05c64dd82..6a76caf813d6 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -75,15 +75,11 @@ bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
 	return phys_id == (u64)cpuid_to_apicid[cpu];
 }
 
-#ifdef CONFIG_SMP
 static void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid)
 {
 	if (!(apicid & (__max_threads_per_core - 1)))
 		cpumask_set_cpu(cpu, &__cpu_primary_thread_mask);
 }
-#else
-static inline void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid) { }
-#endif
 
 /*
  * Convert the APIC ID to a domain level ID by masking out the low bits
diff --git a/arch/x86/kernel/cpu/topology_common.c b/arch/x86/kernel/cpu/topology_common.c
index b5a5e1411469..71625795d711 100644
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -16,6 +16,9 @@ EXPORT_SYMBOL_GPL(x86_topo_system);
 unsigned int __amd_nodes_per_pkg __ro_after_init;
 EXPORT_SYMBOL_GPL(__amd_nodes_per_pkg);
 
+/* CPUs which are the primary SMT threads */
+struct cpumask __cpu_primary_thread_mask __read_mostly;
+
 void topology_set_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
 		      unsigned int shift, unsigned int ncpus)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..7804175d2d87 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -103,9 +103,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
-/* CPUs which are the primary SMT threads */
-struct cpumask __cpu_primary_thread_mask __read_mostly;
-
 /* Representing CPUs for which sibling maps can be computed */
 static cpumask_var_t cpu_sibling_setup_mask;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-13 18:21       ` Dave Hansen
  2025-08-13 20:46         ` Chang S. Bae
@ 2025-08-22 22:39         ` Chang S. Bae
  2025-08-22 23:34           ` Dave Hansen
  1 sibling, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:39 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, chang.seok.bae,
	linux-kernel

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
consistent use when narrowing down primary threads to locate the
per-package interface.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V4 -> V4a:
* Rebase on the primary thread cpumask fix (Dave)
* Clean up the rev print code (Dave)
* rdmsrl_on_cpu() -> rdmsrq_on_cpu (Chao)
---
 arch/x86/include/asm/msr-index.h      |  2 ++
 arch/x86/kernel/cpu/microcode/intel.c | 49 +++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b65c3ba5fa14..0356155f9264 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -913,6 +913,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 371ca6eac00e..d309fb1f058f 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -299,6 +299,54 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface.
+ * Return the result state.
+ */
+static enum ucode_state do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	return UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	enum ucode_state ret;
+	int cpu, err;
+	u64 mmio_pa;
+
+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_primary_thread_mask) {
+		if (topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		err = rdmsrq_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+		if (WARN_ON_ONCE(err))
+			return;
+
+		ret = do_stage(mmio_pa);
+		if (ret != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n", ucode_patch_late->hdr.rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -627,6 +675,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4a 3/6] x86/microcode/intel: Define staging state struct
  2025-08-13 18:25       ` Dave Hansen
@ 2025-08-22 22:39         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:39 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, chang.seok.bae,
	linux-kernel

To facilitate a structured staging handler, a set of functions will be
introduced. Define staging_state struct to simplify function prototypes
by consolidating relevant data, instead of passing multiple local
variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V4a: Drop the ucode_ptr field (Dave)
---
 arch/x86/kernel/cpu/microcode/intel.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index d309fb1f058f..3ca22457d839 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -54,6 +54,25 @@ struct extended_sigtable {
 	struct extended_signature	sigs[];
 };
 
+/**
+ * struct staging_state - Tracks the current staging process state
+ *
+ * @mmio_base:		MMIO base address for staging
+ * @ucode_len:		Total size of the microcode image
+ * @chunk_size:		Size of each data piece
+ * @bytes_sent:		Total bytes transmitted so far
+ * @offset:		Current offset in the microcode image
+ * @state:		Current state of the staging process
+ */
+struct staging_state {
+	void __iomem		*mmio_base;
+	unsigned int		ucode_len;
+	unsigned int		chunk_size;
+	unsigned int		bytes_sent;
+	unsigned int		offset;
+	enum ucode_state	state;
+};
+
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
 #define EXT_SIGNATURE_SIZE	(sizeof(struct extended_signature))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4a 4/6] x86/microcode/intel: Implement staging handler
  2025-08-13 18:44       ` Dave Hansen
@ 2025-08-22 22:39         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:39 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, chang.seok.bae,
	linux-kernel

Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d) The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V4a:
* Convert helper functions to return error codes (Dave)
* Consolidate loop-control logic
* Refactor next-chunk calculation/check for clarity
* Remove offset sanity check (moved to next patch)
---
 arch/x86/kernel/cpu/microcode/intel.c | 137 +++++++++++++++++++++++++-
 1 file changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 3ca22457d839..a1b13202330d 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,16 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -319,13 +331,130 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 }
 
 /*
- * Handle the staging process using the mailbox MMIO interface.
- * Return the result state.
+ * Prepare for a new microcode transfer: reset hardware and record the
+ * image size.
+ */
+static void init_stage(struct staging_state *ss)
+{
+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, ss->mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Return PAGE_SIZE, or remaining bytes if this is the final chunk
+ */
+static inline unsigned int calc_next_chunk_size(unsigned int ucode_len, unsigned int offset)
+{
+	return min(PAGE_SIZE, ucode_len - offset);
+}
+
+/*
+ * Update the chunk size and decide whether another chunk can be sent.
+ * This accounts for remaining data and retry limits.
+ */
+static bool can_send_next_chunk(struct staging_state *ss)
+{
+	ss->chunk_size = calc_next_chunk_size(ss->ucode_len, ss->offset);
+	/*
+	 * Each microcode image is divided into chunks, each at most
+	 * one page size. A 10-chunk  image would typically require 10
+	 * transactions.
+	 *
+	 * However, the hardware managing the mailbox has limited
+	 * resources and may not cache the entire image, potentially
+	 * requesting the same chunk multiple times.
+	 *
+	 * To tolerate this behavior, allow up to twice the expected
+	 * number of transactions (i.e., a 10-chunk image can take up to
+	 * 20 attempts).
+	 *
+	 * If the number of attempts exceeds this limit, the hardware is
+	 * likely stuck and mark the state as timeout.
+	 */
+	if (ss->bytes_sent + ss->chunk_size > ss->ucode_len * 2) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Determine whether staging is complete: either the hardware signaled
+ * the end offset, or no more transactions are permitted (retry limit
+ * reached).
+ */
+static inline bool staging_is_complete(struct staging_state *ss)
+{
+	return (ss->offset == UINT_MAX) || !can_send_next_chunk(ss);
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return 0 on success, or an error code on failure.
+ */
+static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unused)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	ss->state = UCODE_ERROR;
+	return -EPROTONOSUPPORT;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return 0 on success, or an error code on failure.
+ */
+static int fetch_next_offset(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	ss->state = UCODE_ERROR;
+	return -EPROTONOSUPPORT;
+}
+
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * microcode image is transferred in chunks until completion. Return the
+ * result state.
  */
 static enum ucode_state do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	return UCODE_ERROR;
+	struct staging_state ss = {};
+	int err;
+
+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!ss.mmio_base))
+		return UCODE_ERROR;
+
+	init_stage(&ss);
+
+	/* Perform the staging process while within the retry limit */
+	while (!staging_is_complete(&ss)) {
+		/* Send a chunk of microcode each time: */
+		err = send_data_chunk(&ss, ucode_patch_late);
+		if (err)
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		err = fetch_next_offset(&ss);
+		if (err)
+			break;
+	}
+
+	iounmap(ss.mmio_base);
+
+	/*
+	 * The helpers update ss.state on error. The final state is
+	 * returned to the caller.
+	 */
+	return ss.state;
 }
 
 static void stage_microcode(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v4a 5/6] x86/microcode/intel: Support mailbox transfer
  2025-08-13 19:07       ` Dave Hansen
@ 2025-08-22 22:40         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:40 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, chang.seok.bae,
	linux-kernel

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Both hardware error states and implicit errors -- invalid header or
offset -- result in UCODE_ERROR. Emit a clear message for the latter.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V4a: Addressed Dave's feedback
* fetch_next_offset():
  - Make dword reads explicit
  - Consolidate offset validation -- adding another user for the
    end-offset checker
  - Convert WARN_* with pr_err_once()
* Simplify transaction waiting logic a bit
---
 arch/x86/kernel/cpu/microcode/intel.c | 186 +++++++++++++++++++++++++-
 1 file changed, 179 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index a1b13202330d..f6b365eba6a2 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,31 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define MBOX_HEADER(size)	((PCI_VENDOR_ID_INTEL)    | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)((size) / sizeof(u32)) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
+
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -330,6 +354,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void __iomem *mmio_base)
+{
+	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging hardware */
+	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
+{
+	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void __iomem *mmio_base)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword(mmio_base);
+	high = read_mbox_dword(mmio_base);
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
+{
+	write_mbox_dword(mmio_base, value);
+	write_mbox_dword(mmio_base, value >> 32);
+}
+
+static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_bytes)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < chunk_bytes / sizeof(u32); i++)
+		write_mbox_dword(mmio_base, chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer: reset hardware and record the
  * image size.
@@ -385,6 +452,14 @@ static bool can_send_next_chunk(struct staging_state *ss)
 	return true;
 }
 
+/*
+ * The hardware indicates completion by returning a sentinel end offset
+ */
+static inline bool is_end_offset(u32 offset)
+{
+	return offset == UINT_MAX;
+}
+
 /*
  * Determine whether staging is complete: either the hardware signaled
  * the end offset, or no more transactions are permitted (retry limit
@@ -392,18 +467,73 @@ static bool can_send_next_chunk(struct staging_state *ss)
  */
 static inline bool staging_is_complete(struct staging_state *ss)
 {
-	return (ss->offset == UINT_MAX) || !can_send_next_chunk(ss);
+	return is_end_offset(ss->offset) || !can_send_next_chunk(ss);
+}
+
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return 0 on success, or an error code on failure.
+ */
+static int wait_for_transaction(struct staging_state *ss)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		ss->state = UCODE_ERROR;
+		return -EPROTO;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor signaled any
+	 * error. Treat this as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		ss->state = UCODE_TIMEOUT;
+		return -ETIMEDOUT;
+	}
+
+	ss->state = UCODE_OK;
+	return 0;
 }
 
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return 0 on success, or an error code on failure.
  */
-static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unused)
+static int send_data_chunk(struct staging_state *ss, void *ucode_ptr)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	ss->state = UCODE_ERROR;
-	return -EPROTONOSUPPORT;
+	u32 *src_chunk = ucode_ptr + ss->offset;
+	u16 mbox_size;
+
+	/*
+	 * Write a 'request' mailbox object in this order:
+	 *  1. Mailbox header includes total size
+	 *  2. Command header specifies the load operation
+	 *  3. Data section contains a microcode chunk
+	 *
+	 * Thus, the mailbox size is two headers plus the chunk size.
+	 */
+	mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;
+	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
+	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
+	write_mbox_data(ss->mmio_base, src_chunk, ss->chunk_size);
+	ss->bytes_sent += ss->chunk_size;
+
+	/* Notify the hardware that the mailbox is ready for processing. */
+	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
+
+	return wait_for_transaction(ss);
 }
 
 /*
@@ -412,9 +542,51 @@ static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unu
  */
 static int fetch_next_offset(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	const u64 expected_header = MBOX_HEADER(MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE);
+	u32 offset, status;
+	u64 header;
+	int err;
+
+	/*
+	 * The 'response' mailbox returns three fields, in order:
+	 *  1. Header
+	 *  2. Next offset in the microcode image
+	 *  3. Status flags
+	 */
+	header = read_mbox_header(ss->mmio_base);
+	offset = read_mbox_dword(ss->mmio_base);
+	status = read_mbox_dword(ss->mmio_base);
+
+	/* All valid responses must start with the expected header. */
+	if (header != expected_header) {
+		pr_err_once("staging: invalid response header\n");
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	/*
+	 * Verify the offset: If not at the end marker, it must not
+	 * exceed the microcode image length
+	 */
+	if (!is_end_offset(offset) && offset > ss->ucode_len) {
+		pr_err_once("staging: invalid response offset\n");
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	/* Hardware may report errors explicitly in the status field */
+	if (status & MASK_MBOX_RESP_ERROR) {
+		err = -EPROTO;
+		goto err_out;
+	}
+
+	ss->offset = offset;
+	ss->state  = UCODE_OK;
+	return 0;
+
+err_out:
 	ss->state = UCODE_ERROR;
-	return -EPROTONOSUPPORT;
+	return err;
 }
 
 /*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v4 6/6] x86/microcode/intel: Enable staging when available
  2025-08-18  8:35       ` Chao Gao
@ 2025-08-22 22:42         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-22 22:42 UTC (permalink / raw)
  To: Chao Gao
  Cc: x86, tglx, mingo, bp, dave.hansen, colinmitchell, abusse,
	linux-kernel

On 8/18/2025 1:35 AM, Chao Gao wrote:
> 
> Reviewed-by: Chao Gao <chao.gao@intel.com>

Thanks.

>> +	rdmsrl(MSR_IA32_MCU_ENUMERATION, val);
> 
> nit: s/rdmsrl/rdmsrq
> 
> rdmsrl has been renamed to rdmsrq.

Yeah, right. Thanks for the catch!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic
  2025-08-22 22:39         ` [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-08-22 23:34           ` Dave Hansen
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Hansen @ 2025-08-22 23:34 UTC (permalink / raw)
  To: Chang S. Bae, dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, linux-kernel

What's with the "v4a" stuff? Is this an attempt to torture test b4? ;)

Seriously, though, it's been more than a week since the last series.
Just send a new series, please.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature
  2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
                       ` (6 preceding siblings ...)
  2025-08-13 19:08     ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Dave Hansen
@ 2025-08-23 15:52     ` Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 1/7] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
                         ` (7 more replies)
  7 siblings, 8 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

Hi all,

This is another iteration -- changes since v4 [*]:

  * Reworked the preparatory change before referencing
    cpu_primary_thread_mask, based on Dave’s feedback. This is now patch1.

  * Incorporated further feedback from Dave on the staging code, mainly
    to improve clarity, reduce ambiguity, and fix minor issues. Each
    patch includes some details.

  * Collected Chao’s review tag (thanks!) for the first and last patches.

As usual, the series is also available here:
  git://github.com/intel-staging/microcode.git staging_v5

[*] https://lore.kernel.org/lkml/20250813172649.15474-1-chang.seok.bae@intel.com/

Thanks,
Chang

Chang S. Bae (7):
  x86/cpu/topology: Make primary thread mask available with SMP=n
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/microcode/intel: Establish staging control logic
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Enable staging when available

 arch/x86/include/asm/msr-index.h         |   9 +
 arch/x86/include/asm/topology.h          |  12 +-
 arch/x86/kernel/cpu/microcode/core.c     |  11 +
 arch/x86/kernel/cpu/microcode/intel.c    | 386 +++++++++++++++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |   4 +-
 arch/x86/kernel/cpu/topology.c           |   4 -
 arch/x86/kernel/cpu/topology_common.c    |   3 +
 arch/x86/kernel/smpboot.c                |   3 -
 8 files changed, 418 insertions(+), 14 deletions(-)


base-commit: 7182bf4176f93be42225d2ef983894febfa4a1b1
-- 
2.48.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v5 1/7] x86/cpu/topology: Make primary thread mask available with SMP=n
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

cpu_primary_thread_mask is only defined when CONFIG_SMP=y. However, even
in UP kernels there is always exactly one CPU, which can reasonably be
treated as the primary thread.

Historically, topology_is_primary_thread() always returned true with
CONFIG_SMP=n. A recent commit:

  4b455f59945aa ("cpu/SMT: Provide a default topology_is_primary_thread()")

replaced it with a generic implementation with the note:

  "When disabling SMT, the primary thread of the SMT will remain
   enabled/active. Architectures that have a special primary thread (e.g.
   x86) need to override this function. ..."

For consistency and clarity, make the primary thread mask available
regardless of SMP, similar to cpu_possible_mask and cpu_present_mask.

Move __cpu_primary_thread_mask into common code to prevent build issues.
Let cpu_mark_primary_thread() configure the mask even for UP kernels,
alongside other masks. Then, topology_is_primary_thread() can
consistently reference it.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
V4 -> V5: New patch

Preparatory patch to set up the mask correctly before its new use in
patch3.
---
 arch/x86/include/asm/topology.h       | 12 ++++++------
 arch/x86/kernel/cpu/topology.c        |  4 ----
 arch/x86/kernel/cpu/topology_common.c |  3 +++
 arch/x86/kernel/smpboot.c             |  3 ---
 4 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 6c79ee7c0957..281252af6e9d 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -218,6 +218,12 @@ static inline unsigned int topology_amd_nodes_per_pkg(void)
 	return __amd_nodes_per_pkg;
 }
 
+#else /* CONFIG_SMP */
+static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_max_smt_threads(void) { return 1; }
+static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
+#endif /* !CONFIG_SMP */
+
 extern struct cpumask __cpu_primary_thread_mask;
 #define cpu_primary_thread_mask ((const struct cpumask *)&__cpu_primary_thread_mask)
 
@@ -231,12 +237,6 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
 }
 #define topology_is_primary_thread topology_is_primary_thread
 
-#else /* CONFIG_SMP */
-static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
-static inline int topology_max_smt_threads(void) { return 1; }
-static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
-#endif /* !CONFIG_SMP */
-
 static inline void arch_fix_phys_package_id(int num, u32 slot)
 {
 }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index e35ccdc84910..f083023f7dd9 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -75,15 +75,11 @@ bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
 	return phys_id == (u64)cpuid_to_apicid[cpu];
 }
 
-#ifdef CONFIG_SMP
 static void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid)
 {
 	if (!(apicid & (__max_threads_per_core - 1)))
 		cpumask_set_cpu(cpu, &__cpu_primary_thread_mask);
 }
-#else
-static inline void cpu_mark_primary_thread(unsigned int cpu, unsigned int apicid) { }
-#endif
 
 /*
  * Convert the APIC ID to a domain level ID by masking out the low bits
diff --git a/arch/x86/kernel/cpu/topology_common.c b/arch/x86/kernel/cpu/topology_common.c
index b5a5e1411469..71625795d711 100644
--- a/arch/x86/kernel/cpu/topology_common.c
+++ b/arch/x86/kernel/cpu/topology_common.c
@@ -16,6 +16,9 @@ EXPORT_SYMBOL_GPL(x86_topo_system);
 unsigned int __amd_nodes_per_pkg __ro_after_init;
 EXPORT_SYMBOL_GPL(__amd_nodes_per_pkg);
 
+/* CPUs which are the primary SMT threads */
+struct cpumask __cpu_primary_thread_mask __read_mostly;
+
 void topology_set_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
 		      unsigned int shift, unsigned int ncpus)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..7804175d2d87 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -103,9 +103,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
-/* CPUs which are the primary SMT threads */
-struct cpumask __cpu_primary_thread_mask __read_mostly;
-
 /* Representing CPUs for which sibling maps can be computed */
 static cpumask_var_t cpu_sibling_setup_mask;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 1/7] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-09-04 12:08         ` Borislav Petkov
  2025-08-23 15:52       ` [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic Chang S. Bae
                         ` (5 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

As microcode patch sizes continue to grow, late-loading latency spikes
can lead to timeouts and disruptions in running workloads. This trend of
increasing patch sizes is expected to continue, so a foundational
solution is needed to address the issue.

To mitigate the problem, a new staging feature is introduced. This option
processes most of the microcode update (excluding activation) on a
non-critical path, allowing CPUs to remain operational during the
majority of the update. By offloading work from the critical path,
staging can significantly reduces latency spikes.

Integrate staging as a preparatory step in late-loading. Introduce a new
callback for staging, which is invoked at the beginning of
load_late_stop_cpus(), before CPUs enter the rendezvous phase.

Staging follows an opportunistic model:

  *  If successful, it reduces CPU rendezvous time
  *  Even though it fails, the process falls back to the legacy path to
     finish the loading process but with potentially higher latency.

Extend struct microcode_ops to incorporate staging properties, which will
be implemented in the vendor code separately.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
---
V4 -> V5:
* Collect Chao's review tag

V1 -> V2:
* Move invocation inside of load_late_stop_cpus() (Boris)
* Add more note about staging (Dave)

There were discussions about whether staging success should be enforced
by a configurable option. That topic is identified as follow-up work,
separate from this series.
    https://lore.kernel.org/lkml/54308373-7867-4b76-be34-63730953f83c@intel.com/
---
 arch/x86/kernel/cpu/microcode/core.c     | 11 +++++++++++
 arch/x86/kernel/cpu/microcode/internal.h |  4 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b92e09a87c69..34e569ee1db2 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -552,6 +552,17 @@ static int load_late_stop_cpus(bool is_safe)
 		pr_err("You should switch to early loading, if possible.\n");
 	}
 
+	/*
+	 * Pre-load the microcode image into a staging device. This
+	 * process is preemptible and does not require stopping CPUs.
+	 * Successful staging simplifies the subsequent late-loading
+	 * process, reducing rendezvous time.
+	 *
+	 * Even if the transfer fails, the update will proceed as usual.
+	 */
+	if (microcode_ops->use_staging)
+		microcode_ops->stage_microcode();
+
 	atomic_set(&late_cpus_in, num_online_cpus());
 	atomic_set(&offline_in_nmi, 0);
 	loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 50a9702ae4e2..adf02ebbf7a3 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -31,10 +31,12 @@ struct microcode_ops {
 	 * See also the "Synchronization" section in microcode_core.c.
 	 */
 	enum ucode_state	(*apply_microcode)(int cpu);
+	void			(*stage_microcode)(void);
 	int			(*collect_cpu_info)(int cpu, struct cpu_signature *csig);
 	void			(*finalize_late_load)(int result);
 	unsigned int		nmi_safe	: 1,
-				use_nmi		: 1;
+				use_nmi		: 1,
+				use_staging	: 1;
 };
 
 struct early_load_data {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 1/7] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-09-04 12:13         ` Borislav Petkov
  2025-08-23 15:52       ` [PATCH v5 4/7] x86/microcode/intel: Define staging state struct Chang S. Bae
                         ` (4 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
consistent use when narrowing down primary threads to locate the
per-package interface.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
---
V4 -> V5:
* Rebase on the primary thread cpumask fix (Dave)
* Clean up the rev print code (Dave)
* rdmsrl_on_cpu() -> rdmsrq_on_cpu (Chao)

V2 -> V3:
* Remove a global variable and adjust stage_microcode() (Dave).
* Simplify for_each_cpu() loop control code
* Handle rdmsrl_on_cpu() return code explicitly (Chao)

V1 -> V2:
* Adjust to reference the staging_state struct.
* Add lockdep_assert_cpus_held() (Boris)
---
 arch/x86/include/asm/msr-index.h      |  2 ++
 arch/x86/kernel/cpu/microcode/intel.c | 49 +++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b65c3ba5fa14..0356155f9264 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -913,6 +913,8 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
+
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
 #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
 #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 371ca6eac00e..d309fb1f058f 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -299,6 +299,54 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+/*
+ * Handle the staging process using the mailbox MMIO interface.
+ * Return the result state.
+ */
+static enum ucode_state do_stage(u64 mmio_pa)
+{
+	pr_debug_once("Staging implementation is pending.\n");
+	return UCODE_ERROR;
+}
+
+static void stage_microcode(void)
+{
+	unsigned int pkg_id = UINT_MAX;
+	enum ucode_state ret;
+	int cpu, err;
+	u64 mmio_pa;
+
+	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
+		return;
+
+	lockdep_assert_cpus_held();
+
+	/*
+	 * The MMIO address is unique per package, and all the SMT
+	 * primary threads are online here. Find each MMIO space by
+	 * their package ids to avoid duplicate staging.
+	 */
+	for_each_cpu(cpu, cpu_primary_thread_mask) {
+		if (topology_logical_package_id(cpu) == pkg_id)
+			continue;
+		pkg_id = topology_logical_package_id(cpu);
+
+		err = rdmsrq_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
+		if (WARN_ON_ONCE(err))
+			return;
+
+		ret = do_stage(mmio_pa);
+		if (ret != UCODE_OK) {
+			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+			       ret == UCODE_TIMEOUT ? "timeout" : "error state",
+			       cpu, pkg_id);
+			return;
+		}
+	}
+
+	pr_info("Staging of patch revision 0x%x succeeded.\n", ucode_patch_late->hdr.rev);
+}
+
 static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
 					  struct microcode_intel *mc,
 					  u32 *cur_rev)
@@ -627,6 +675,7 @@ static struct microcode_ops microcode_intel_ops = {
 	.collect_cpu_info	= collect_cpu_info,
 	.apply_microcode	= apply_microcode_late,
 	.finalize_late_load	= finalize_late_load,
+	.stage_microcode	= stage_microcode,
 	.use_nmi		= IS_ENABLED(CONFIG_X86_64),
 };
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 4/7] x86/microcode/intel: Define staging state struct
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
                         ` (2 preceding siblings ...)
  2025-08-23 15:52       ` [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-09-04 13:48         ` Borislav Petkov
  2025-08-23 15:52       ` [PATCH v5 5/7] x86/microcode/intel: Implement staging handler Chang S. Bae
                         ` (3 subsequent siblings)
  7 siblings, 1 reply; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

To facilitate a structured staging handler, a set of functions will be
introduced. Define staging_state struct to simplify function prototypes
by consolidating relevant data, instead of passing multiple local
variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V5: Drop the ucode_ptr field (Dave)
V1 -> V2: New patch

Prior to V2, local variables were used to track state values, with the
intention of improving readability by explicitly passing them between
functions. However, given feedbacks, introducing a dedicated data
structure looks to provide a benefit by simplifying the main loop.
---
 arch/x86/kernel/cpu/microcode/intel.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index d309fb1f058f..3ca22457d839 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -54,6 +54,25 @@ struct extended_sigtable {
 	struct extended_signature	sigs[];
 };
 
+/**
+ * struct staging_state - Tracks the current staging process state
+ *
+ * @mmio_base:		MMIO base address for staging
+ * @ucode_len:		Total size of the microcode image
+ * @chunk_size:		Size of each data piece
+ * @bytes_sent:		Total bytes transmitted so far
+ * @offset:		Current offset in the microcode image
+ * @state:		Current state of the staging process
+ */
+struct staging_state {
+	void __iomem		*mmio_base;
+	unsigned int		ucode_len;
+	unsigned int		chunk_size;
+	unsigned int		bytes_sent;
+	unsigned int		offset;
+	enum ucode_state	state;
+};
+
 #define DEFAULT_UCODE_TOTALSIZE (DEFAULT_UCODE_DATASIZE + MC_HEADER_SIZE)
 #define EXT_HEADER_SIZE		(sizeof(struct extended_sigtable))
 #define EXT_SIGNATURE_SIZE	(sizeof(struct extended_signature))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 5/7] x86/microcode/intel: Implement staging handler
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
                         ` (3 preceding siblings ...)
  2025-08-23 15:52       ` [PATCH v5 4/7] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 6/7] x86/microcode/intel: Support mailbox transfer Chang S. Bae
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d) The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

With that, incorporate these code sequences to the staging handler:

  1.  Initialization: Map the MMIO space via ioremap(). Reset the staging
      hardware and initialize software state, ensuring a fresh staging
      process aligned with (a).

  2.  Processing Loop: Introduce a loop iterating over data chunk,
      following (b), with proper termination conditions established from
      (d) -- stop staging when the hardware signals completion, or if the
      total transmitted data exceeds the predefined limit.

  3.  Loop Body: Finally, compose the loop body with two steps --
      transmitting a data chunk and retrieving the next offset from the
      hardware response, aligning with (b) and (c).

Since data transmission and mailbox format handling require additional
details, they are implemented separately in next changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V5:
* Convert helper functions to return error codes (Dave)
* Consolidate loop-control logic
* Refactor next-chunk calculation/check for clarity
* Remove offset sanity check (moved to next patch)

V2 -> V3:
* Rework code to eliminate global variables (Dave)
* Remove redundant variable resets (Chao)

V1 -> V2:
* Re-write the changelog for clarity (Dave).
* Move staging handling code into intel.c (Boris).
* Add extensive comments to clarify staging logic and hardware
  interactions, along with function renaming (Dave).
---
 arch/x86/kernel/cpu/microcode/intel.c | 137 +++++++++++++++++++++++++-
 1 file changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 3ca22457d839..a1b13202330d 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -20,6 +20,8 @@
 #include <linux/cpu.h>
 #include <linux/uio.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/io.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -33,6 +35,16 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define UCODE_BSP_LOADED	((struct microcode_intel *)0x1UL)
 
+/* Defines for the microcode staging mailbox interface */
+
+#define MBOX_REG_NUM		4
+#define MBOX_REG_SIZE		sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET	0x0
+#define MBOX_STATUS_OFFSET	0x4
+
+#define MASK_MBOX_CTRL_ABORT	BIT(0)
+
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
 static struct microcode_intel *ucode_patch_late __read_mostly;
@@ -319,13 +331,130 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 }
 
 /*
- * Handle the staging process using the mailbox MMIO interface.
- * Return the result state.
+ * Prepare for a new microcode transfer: reset hardware and record the
+ * image size.
+ */
+static void init_stage(struct staging_state *ss)
+{
+	ss->ucode_len = get_totalsize(&ucode_patch_late->hdr);
+
+	/*
+	 * Abort any ongoing process, effectively resetting the device.
+	 * Unlike regular mailbox data processing requests, this
+	 * operation does not require a status check.
+	 */
+	writel(MASK_MBOX_CTRL_ABORT, ss->mmio_base + MBOX_CONTROL_OFFSET);
+}
+
+/*
+ * Return PAGE_SIZE, or remaining bytes if this is the final chunk
+ */
+static inline unsigned int calc_next_chunk_size(unsigned int ucode_len, unsigned int offset)
+{
+	return min(PAGE_SIZE, ucode_len - offset);
+}
+
+/*
+ * Update the chunk size and decide whether another chunk can be sent.
+ * This accounts for remaining data and retry limits.
+ */
+static bool can_send_next_chunk(struct staging_state *ss)
+{
+	ss->chunk_size = calc_next_chunk_size(ss->ucode_len, ss->offset);
+	/*
+	 * Each microcode image is divided into chunks, each at most
+	 * one page size. A 10-chunk  image would typically require 10
+	 * transactions.
+	 *
+	 * However, the hardware managing the mailbox has limited
+	 * resources and may not cache the entire image, potentially
+	 * requesting the same chunk multiple times.
+	 *
+	 * To tolerate this behavior, allow up to twice the expected
+	 * number of transactions (i.e., a 10-chunk image can take up to
+	 * 20 attempts).
+	 *
+	 * If the number of attempts exceeds this limit, the hardware is
+	 * likely stuck and mark the state as timeout.
+	 */
+	if (ss->bytes_sent + ss->chunk_size > ss->ucode_len * 2) {
+		ss->state = UCODE_TIMEOUT;
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Determine whether staging is complete: either the hardware signaled
+ * the end offset, or no more transactions are permitted (retry limit
+ * reached).
+ */
+static inline bool staging_is_complete(struct staging_state *ss)
+{
+	return (ss->offset == UINT_MAX) || !can_send_next_chunk(ss);
+}
+
+/*
+ * Transmit a chunk of the microcode image to the hardware.
+ * Return 0 on success, or an error code on failure.
+ */
+static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unused)
+{
+	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
+	ss->state = UCODE_ERROR;
+	return -EPROTONOSUPPORT;
+}
+
+/*
+ * Retrieve the next offset from the hardware response.
+ * Return 0 on success, or an error code on failure.
+ */
+static int fetch_next_offset(struct staging_state *ss)
+{
+	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	ss->state = UCODE_ERROR;
+	return -EPROTONOSUPPORT;
+}
+
+/*
+ * Handle the staging process using the mailbox MMIO interface. The
+ * microcode image is transferred in chunks until completion. Return the
+ * result state.
  */
 static enum ucode_state do_stage(u64 mmio_pa)
 {
-	pr_debug_once("Staging implementation is pending.\n");
-	return UCODE_ERROR;
+	struct staging_state ss = {};
+	int err;
+
+	ss.mmio_base = ioremap(mmio_pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+	if (WARN_ON_ONCE(!ss.mmio_base))
+		return UCODE_ERROR;
+
+	init_stage(&ss);
+
+	/* Perform the staging process while within the retry limit */
+	while (!staging_is_complete(&ss)) {
+		/* Send a chunk of microcode each time: */
+		err = send_data_chunk(&ss, ucode_patch_late);
+		if (err)
+			break;
+		/*
+		 * Then, ask the hardware which piece of the image it
+		 * needs next. The same piece may be sent more than once.
+		 */
+		err = fetch_next_offset(&ss);
+		if (err)
+			break;
+	}
+
+	iounmap(ss.mmio_base);
+
+	/*
+	 * The helpers update ss.state on error. The final state is
+	 * returned to the caller.
+	 */
+	return ss.state;
 }
 
 static void stage_microcode(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 6/7] x86/microcode/intel: Support mailbox transfer
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
                         ` (4 preceding siblings ...)
  2025-08-23 15:52       ` [PATCH v5 5/7] x86/microcode/intel: Implement staging handler Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-08-23 15:52       ` [PATCH v5 7/7] x86/microcode/intel: Enable staging when available Chang S. Bae
  2025-08-26 22:13       ` [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature Luck, Tony
  7 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

Previously, the functions for sending microcode data and retrieving the
next offset were placeholders, as they required handling the specific
mailbox format. Implement them as following:

== Mailbox Format ==

The staging mailbox consists of two primary sections: 'header' and
'data'. While the microcode must be transferred following this format,
the actual data transfer mechanism involves reading and writing to
specific MMIO registers.

== Mailbox Data Registers ==

Unlike conventional interfaces that allocate MMIO space for each data
chunk, the staging interface features a "narrow" interface, using only
two dword-sized registers for read and write operations.

For example, if writing 2 dwords of data to a device. Typically, the
device would expose 2 dwords of "wide" MMIO space. To send the data to
the device:

	writel(buf[0], io_addr + 0);
	writel(buf[1], io_addr + 1);

But, this interface is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:

	writel(buf[0], io_addr);
	writel(buf[1], io_addr);

The same goes for the read side.

== Implementation Summary ==

Given that, introduce two layers of helper functions at first:

  * Low-level helpers for reading and writing to data registers directly.
  * Wrapper functions for handling mailbox header and data sections.

Using them, implement send_data_chunk() and fetch_next_offset()
functions. Add explicit error and timeout handling routine in
wait_for_transaction(), finishing up the transfer.

Both hardware error states and implicit errors -- invalid header or
offset -- result in UCODE_ERROR. Emit a clear message for the latter.

Note: The kernel has support for similar mailboxes. But none of them are
compatible with this one. Trying to share code resulted in a bloated
mess, so this code is standalone.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V5: Addressed Dave's feedback
* fetch_next_offset():
  - Make dword reads explicit
  - Consolidate offset validation -- adding another user for the
    end-offset checker
  - Convert WARN_* with pr_err_once()
* Simplify transaction waiting logic a bit

V2 -> V3:
* Update code to reflect the removal of a global variable (Dave).

V1 -> V2:
* Add lots of code comments and edit the changelog (Dave).
* Encapsulate register read/write operations for processing header and
  data sections.
---
 arch/x86/kernel/cpu/microcode/intel.c | 186 +++++++++++++++++++++++++-
 1 file changed, 179 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index a1b13202330d..f6b365eba6a2 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -22,6 +22,7 @@
 #include <linux/mm.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/pci_ids.h>
 
 #include <asm/cpu_device_id.h>
 #include <asm/processor.h>
@@ -42,8 +43,31 @@ static const char ucode_path[] = "kernel/x86/microcode/GenuineIntel.bin";
 
 #define MBOX_CONTROL_OFFSET	0x0
 #define MBOX_STATUS_OFFSET	0x4
+#define MBOX_WRDATA_OFFSET	0x8
+#define MBOX_RDDATA_OFFSET	0xc
 
 #define MASK_MBOX_CTRL_ABORT	BIT(0)
+#define MASK_MBOX_CTRL_GO	BIT(31)
+
+#define MASK_MBOX_STATUS_ERROR	BIT(2)
+#define MASK_MBOX_STATUS_READY	BIT(31)
+
+#define MASK_MBOX_RESP_SUCCESS	BIT(0)
+#define MASK_MBOX_RESP_PROGRESS	BIT(1)
+#define MASK_MBOX_RESP_ERROR	BIT(2)
+
+#define MBOX_CMD_LOAD		0x3
+#define MBOX_OBJ_STAGING	0xb
+#define MBOX_HEADER(size)	((PCI_VENDOR_ID_INTEL)    | \
+				 (MBOX_OBJ_STAGING << 16) | \
+				 ((u64)((size) / sizeof(u32)) << 32))
+
+/* The size of each mailbox header */
+#define MBOX_HEADER_SIZE	sizeof(u64)
+/* The size of staging hardware response */
+#define MBOX_RESPONSE_SIZE	sizeof(u64)
+
+#define MBOX_XACTION_TIMEOUT_MS	(10 * MSEC_PER_SEC)
 
 /* Current microcode patch used in early patching on the APs. */
 static struct microcode_intel *ucode_patch_va __read_mostly;
@@ -330,6 +354,49 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
 	return size ? NULL : patch;
 }
 
+static inline u32 read_mbox_dword(void __iomem *mmio_base)
+{
+	u32 dword = readl(mmio_base + MBOX_RDDATA_OFFSET);
+
+	/* Acknowledge read completion to the staging hardware */
+	writel(0, mmio_base + MBOX_RDDATA_OFFSET);
+	return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *mmio_base, u32 dword)
+{
+	writel(dword, mmio_base + MBOX_WRDATA_OFFSET);
+}
+
+static inline u64 read_mbox_header(void __iomem *mmio_base)
+{
+	u32 high, low;
+
+	low  = read_mbox_dword(mmio_base);
+	high = read_mbox_dword(mmio_base);
+
+	return ((u64)high << 32) | low;
+}
+
+static inline void write_mbox_header(void __iomem *mmio_base, u64 value)
+{
+	write_mbox_dword(mmio_base, value);
+	write_mbox_dword(mmio_base, value >> 32);
+}
+
+static void write_mbox_data(void __iomem *mmio_base, u32 *chunk, unsigned int chunk_bytes)
+{
+	int i;
+
+	/*
+	 * The MMIO space is mapped as Uncached (UC). Each write arrives
+	 * at the device as an individual transaction in program order.
+	 * The device can then resemble the sequence accordingly.
+	 */
+	for (i = 0; i < chunk_bytes / sizeof(u32); i++)
+		write_mbox_dword(mmio_base, chunk[i]);
+}
+
 /*
  * Prepare for a new microcode transfer: reset hardware and record the
  * image size.
@@ -385,6 +452,14 @@ static bool can_send_next_chunk(struct staging_state *ss)
 	return true;
 }
 
+/*
+ * The hardware indicates completion by returning a sentinel end offset
+ */
+static inline bool is_end_offset(u32 offset)
+{
+	return offset == UINT_MAX;
+}
+
 /*
  * Determine whether staging is complete: either the hardware signaled
  * the end offset, or no more transactions are permitted (retry limit
@@ -392,18 +467,73 @@ static bool can_send_next_chunk(struct staging_state *ss)
  */
 static inline bool staging_is_complete(struct staging_state *ss)
 {
-	return (ss->offset == UINT_MAX) || !can_send_next_chunk(ss);
+	return is_end_offset(ss->offset) || !can_send_next_chunk(ss);
+}
+
+/*
+ * Wait for the hardware to complete a transaction.
+ * Return 0 on success, or an error code on failure.
+ */
+static int wait_for_transaction(struct staging_state *ss)
+{
+	u32 timeout, status;
+
+	/* Allow time for hardware to complete the operation: */
+	for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++) {
+		msleep(1);
+
+		status = readl(ss->mmio_base + MBOX_STATUS_OFFSET);
+		/* Break out early if the hardware is ready: */
+		if (status & MASK_MBOX_STATUS_READY)
+			break;
+	}
+
+	/* Check for explicit error response */
+	if (status & MASK_MBOX_STATUS_ERROR) {
+		ss->state = UCODE_ERROR;
+		return -EPROTO;
+	}
+
+	/*
+	 * Hardware is neither responded to the action nor signaled any
+	 * error. Treat this as timeout.
+	 */
+	if (!(status & MASK_MBOX_STATUS_READY)) {
+		ss->state = UCODE_TIMEOUT;
+		return -ETIMEDOUT;
+	}
+
+	ss->state = UCODE_OK;
+	return 0;
 }
 
 /*
  * Transmit a chunk of the microcode image to the hardware.
  * Return 0 on success, or an error code on failure.
  */
-static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unused)
+static int send_data_chunk(struct staging_state *ss, void *ucode_ptr)
 {
-	pr_debug_once("Staging mailbox loading code needs to be implemented.\n");
-	ss->state = UCODE_ERROR;
-	return -EPROTONOSUPPORT;
+	u32 *src_chunk = ucode_ptr + ss->offset;
+	u16 mbox_size;
+
+	/*
+	 * Write a 'request' mailbox object in this order:
+	 *  1. Mailbox header includes total size
+	 *  2. Command header specifies the load operation
+	 *  3. Data section contains a microcode chunk
+	 *
+	 * Thus, the mailbox size is two headers plus the chunk size.
+	 */
+	mbox_size = MBOX_HEADER_SIZE * 2 + ss->chunk_size;
+	write_mbox_header(ss->mmio_base, MBOX_HEADER(mbox_size));
+	write_mbox_header(ss->mmio_base, MBOX_CMD_LOAD);
+	write_mbox_data(ss->mmio_base, src_chunk, ss->chunk_size);
+	ss->bytes_sent += ss->chunk_size;
+
+	/* Notify the hardware that the mailbox is ready for processing. */
+	writel(MASK_MBOX_CTRL_GO, ss->mmio_base + MBOX_CONTROL_OFFSET);
+
+	return wait_for_transaction(ss);
 }
 
 /*
@@ -412,9 +542,51 @@ static int send_data_chunk(struct staging_state *ss, void *ucode_ptr __maybe_unu
  */
 static int fetch_next_offset(struct staging_state *ss)
 {
-	pr_debug_once("Staging mailbox response handling code needs to be implemented.\n\n");
+	const u64 expected_header = MBOX_HEADER(MBOX_HEADER_SIZE + MBOX_RESPONSE_SIZE);
+	u32 offset, status;
+	u64 header;
+	int err;
+
+	/*
+	 * The 'response' mailbox returns three fields, in order:
+	 *  1. Header
+	 *  2. Next offset in the microcode image
+	 *  3. Status flags
+	 */
+	header = read_mbox_header(ss->mmio_base);
+	offset = read_mbox_dword(ss->mmio_base);
+	status = read_mbox_dword(ss->mmio_base);
+
+	/* All valid responses must start with the expected header. */
+	if (header != expected_header) {
+		pr_err_once("staging: invalid response header\n");
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	/*
+	 * Verify the offset: If not at the end marker, it must not
+	 * exceed the microcode image length
+	 */
+	if (!is_end_offset(offset) && offset > ss->ucode_len) {
+		pr_err_once("staging: invalid response offset\n");
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	/* Hardware may report errors explicitly in the status field */
+	if (status & MASK_MBOX_RESP_ERROR) {
+		err = -EPROTO;
+		goto err_out;
+	}
+
+	ss->offset = offset;
+	ss->state  = UCODE_OK;
+	return 0;
+
+err_out:
 	ss->state = UCODE_ERROR;
-	return -EPROTONOSUPPORT;
+	return err;
 }
 
 /*
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v5 7/7] x86/microcode/intel: Enable staging when available
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
                         ` (5 preceding siblings ...)
  2025-08-23 15:52       ` [PATCH v5 6/7] x86/microcode/intel: Support mailbox transfer Chang S. Bae
@ 2025-08-23 15:52       ` Chang S. Bae
  2025-08-26 22:13       ` [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature Luck, Tony
  7 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 15:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse,
	chang.seok.bae

With the staging code being ready, check for staging availability by
reading these following MSRs:

  * IA32_ARCH_CAPABILITIES (bit 16) for the presence of
    IA32_MCU_ENUMERATION

  * IA32_MCU_ENUMERATION (bit 4) for the staging feature availability.

When available, enable it by setting the feature bit.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
---
V4 -> V5:
* Collect Chao's Review tag
* rdmsrl() -> rdmsrq() (Chao)

V1 -> V2:
* Fold MSR definings (Boris).
---
 arch/x86/include/asm/msr-index.h      |  7 +++++++
 arch/x86/kernel/cpu/microcode/intel.c | 17 +++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0356155f9264..2b7b598c414f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -166,6 +166,10 @@
 						 * Processor MMIO stale data
 						 * vulnerabilities.
 						 */
+#define ARCH_CAP_MCU_ENUM		BIT(16) /*
+						 * Indicates the presence of microcode update
+						 * feature enumeration and status information
+						 */
 #define ARCH_CAP_FB_CLEAR		BIT(17)	/*
 						 * VERW clears CPU fill buffer
 						 * even on MDS_NO CPUs.
@@ -913,6 +917,9 @@
 #define MSR_IA32_UCODE_WRITE		0x00000079
 #define MSR_IA32_UCODE_REV		0x0000008b
 
+#define MSR_IA32_MCU_ENUMERATION	0x0000007b
+#define MCU_STAGING			BIT(4)
+
 #define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5
 
 /* Intel SGX Launch Enclave Public Key Hash MSRs */
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index f6b365eba6a2..7d6582f8a80f 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -1007,6 +1007,18 @@ static __init void calc_llc_size_per_core(struct cpuinfo_x86 *c)
 	llc_size_per_core = (unsigned int)llc_size;
 }
 
+static __init bool staging_available(void)
+{
+	u64 val;
+
+	val = x86_read_arch_cap_msr();
+	if (!(val & ARCH_CAP_MCU_ENUM))
+		return false;
+
+	rdmsrq(MSR_IA32_MCU_ENUMERATION, val);
+	return !!(val & MCU_STAGING);
+}
+
 struct microcode_ops * __init init_intel_microcode(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -1017,6 +1029,11 @@ struct microcode_ops * __init init_intel_microcode(void)
 		return NULL;
 	}
 
+	if (staging_available()) {
+		microcode_intel_ops.use_staging = true;
+		pr_info("Enabled staging feature.\n");
+	}
+
 	calc_llc_size_per_core(c);
 
 	return &microcode_intel_ops;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n
  2025-08-22 22:39             ` [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
@ 2025-08-23 16:05               ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-23 16:05 UTC (permalink / raw)
  To: dave.hansen, x86
  Cc: tglx, mingo, bp, colinmitchell, chao.gao, abusse, linux-kernel

On 8/22/2025 3:39 PM, Chang S. Bae wrote:
> ... The next revision of
> that series will depend on this.
Fold into the V5 posting now:
https://lore.kernel.org/lkml/20250823155214.17465-2-chang.seok.bae@intel.com/

Thanks,
Chang

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature
  2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
                         ` (6 preceding siblings ...)
  2025-08-23 15:52       ` [PATCH v5 7/7] x86/microcode/intel: Enable staging when available Chang S. Bae
@ 2025-08-26 22:13       ` Luck, Tony
  2025-08-26 22:15         ` Chang S. Bae
  7 siblings, 1 reply; 69+ messages in thread
From: Luck, Tony @ 2025-08-26 22:13 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse

On Sat, Aug 23, 2025 at 08:52:03AM -0700, Chang S. Bae wrote:
> Hi all,
> 
> This is another iteration -- changes since v4 [*]:
> 
>   * Reworked the preparatory change before referencing
>     cpu_primary_thread_mask, based on Dave’s feedback. This is now patch1.
> 
>   * Incorporated further feedback from Dave on the staging code, mainly
>     to improve clarity, reduce ambiguity, and fix minor issues. Each
>     patch includes some details.
> 
>   * Collected Chao’s review tag (thanks!) for the first and last patches.
> 
> As usual, the series is also available here:
>   git://github.com/intel-staging/microcode.git staging_v5
> 
> [*] https://lore.kernel.org/lkml/20250813172649.15474-1-chang.seok.bae@intel.com/
> 
> Thanks,
> Chang

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature
  2025-08-26 22:13       ` [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature Luck, Tony
@ 2025-08-26 22:15         ` Chang S. Bae
  0 siblings, 0 replies; 69+ messages in thread
From: Chang S. Bae @ 2025-08-26 22:15 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-kernel, x86, tglx, mingo, bp, dave.hansen, chao.gao, abusse

On 8/26/2025 3:13 PM, Luck, Tony wrote:
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>

Thanks!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time
  2025-08-23 15:52       ` [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
@ 2025-09-04 12:08         ` Borislav Petkov
  0 siblings, 0 replies; 69+ messages in thread
From: Borislav Petkov @ 2025-09-04 12:08 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, dave.hansen, chao.gao, abusse

On Sat, Aug 23, 2025 at 08:52:05AM -0700, Chang S. Bae wrote:
> As microcode patch sizes continue to grow, late-loading latency spikes
> can lead to timeouts and disruptions in running workloads. This trend of
> increasing patch sizes is expected to continue, so a foundational
> solution is needed to address the issue.
> 
> To mitigate the problem, a new staging feature is introduced. This option
> processes most of the microcode update (excluding activation) on a
> non-critical path, allowing CPUs to remain operational during the
> majority of the update. By offloading work from the critical path,
> staging can significantly reduces latency spikes.

"reduce"

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic
  2025-08-23 15:52       ` [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic Chang S. Bae
@ 2025-09-04 12:13         ` Borislav Petkov
  0 siblings, 0 replies; 69+ messages in thread
From: Borislav Petkov @ 2025-09-04 12:13 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, dave.hansen, chao.gao, abusse

On Sat, Aug 23, 2025 at 08:52:06AM -0700, Chang S. Bae wrote:
> When microcode staging is initiated, operations are carried out through
> an MMIO interface. Each package has a unique interface specified by the
> IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.
> 
> Prepare staging with the following steps:
> 
>   1.  Ensure the microcode image is 32-bit aligned to match the MMIO
>       register size.
> 
>   2.  Identify each MMIO interface based on its per-package scope.
> 
>   3.  Invoke the staging function for each identified interface, which
>       will be implemented separately.
> 
> Also, define cpu_primary_thread_mask for the CONFIG_SMP=n case, allowing
> consistent use when narrowing down primary threads to locate the
> per-package interface.

This paragraph is stale now and can go.

> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index b65c3ba5fa14..0356155f9264 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -913,6 +913,8 @@
>  #define MSR_IA32_UCODE_WRITE		0x00000079
>  #define MSR_IA32_UCODE_REV		0x0000008b
>  
> +#define MSR_IA32_MCU_STAGING_MBOX_ADDR	0x000007a5

Doesn't look sorted to me.

> +
>  /* Intel SGX Launch Enclave Public Key Hash MSRs */
>  #define MSR_IA32_SGXLEPUBKEYHASH0	0x0000008C
>  #define MSR_IA32_SGXLEPUBKEYHASH1	0x0000008D
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
> index 371ca6eac00e..d309fb1f058f 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -299,6 +299,54 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
>  	return size ? NULL : patch;
>  }
>  
> +/*
> + * Handle the staging process using the mailbox MMIO interface.
> + * Return the result state.
> + */
> +static enum ucode_state do_stage(u64 mmio_pa)
> +{
> +	pr_debug_once("Staging implementation is pending.\n");
> +	return UCODE_ERROR;
> +}
> +
> +static void stage_microcode(void)
> +{
> +	unsigned int pkg_id = UINT_MAX;
> +	enum ucode_state ret;
> +	int cpu, err;
> +	u64 mmio_pa;
> +
> +	if (!IS_ALIGNED(get_totalsize(&ucode_patch_late->hdr), sizeof(u32)))
> +		return;
> +
> +	lockdep_assert_cpus_held();
> +
> +	/*
> +	 * The MMIO address is unique per package, and all the SMT
> +	 * primary threads are online here. Find each MMIO space by
> +	 * their package ids to avoid duplicate staging.
> +	 */
> +	for_each_cpu(cpu, cpu_primary_thread_mask) {
> +		if (topology_logical_package_id(cpu) == pkg_id)
> +			continue;

<---- newline here.

> +		pkg_id = topology_logical_package_id(cpu);
> +
> +		err = rdmsrq_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &mmio_pa);
> +		if (WARN_ON_ONCE(err))
> +			return;
> +
> +		ret = do_stage(mmio_pa);
> +		if (ret != UCODE_OK) {
> +			pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
> +			       ret == UCODE_TIMEOUT ? "timeout" : "error state",

What does "error state" mean?

Are we going to dump additional error state so that it is clear why it failed?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v5 4/7] x86/microcode/intel: Define staging state struct
  2025-08-23 15:52       ` [PATCH v5 4/7] x86/microcode/intel: Define staging state struct Chang S. Bae
@ 2025-09-04 13:48         ` Borislav Petkov
  0 siblings, 0 replies; 69+ messages in thread
From: Borislav Petkov @ 2025-09-04 13:48 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: linux-kernel, x86, tglx, mingo, dave.hansen, chao.gao, abusse

On Sat, Aug 23, 2025 at 08:52:07AM -0700, Chang S. Bae wrote:
> To facilitate a structured staging handler, a set of functions will be
> introduced.

No need for that sentence.

> Define staging_state struct to simplify function prototypes
> by consolidating relevant data, instead of passing multiple local
> variables.

That's clear enough.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2025-09-04 13:48 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20 23:40 [PATCH v2 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
2025-03-20 23:40 ` [PATCH v2 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
2025-03-20 23:40 ` [PATCH v2 2/6] x86/microcode/intel: Define staging state struct Chang S. Bae
2025-03-20 23:40 ` [PATCH v2 3/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
2025-03-21 21:18   ` [PATCH v2a " Chang S. Bae
2025-03-26  7:35     ` Chao Gao
2025-03-26 18:43       ` Chang S. Bae
2025-03-27  1:44         ` Chao Gao
2025-03-28 14:12           ` Chang S. Bae
2025-03-20 23:40 ` [PATCH v2 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
2025-03-21  0:15   ` Dave Hansen
2025-03-21 21:19     ` [PATCH v2a " Chang S. Bae
2025-03-26  8:34       ` Chao Gao
2025-03-26 18:43         ` Chang S. Bae
2025-03-21 21:19     ` [PATCH v2 " Chang S. Bae
2025-03-20 23:40 ` [PATCH v2 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
2025-03-21 21:19   ` [PATCH v2a " Chang S. Bae
2025-03-27  3:32   ` [PATCH v2 " Chao Gao
2025-03-27 14:11     ` Chang S. Bae
2025-03-31 19:16     ` Dave Hansen
2025-03-20 23:40 ` [PATCH v2 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
2025-04-09 23:27 ` [PATCH v3 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
2025-04-09 23:27   ` [PATCH v3 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
2025-04-09 23:27   ` [PATCH v3 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
2025-04-09 23:27   ` [PATCH v3 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
2025-04-09 23:27   ` [PATCH v3 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
2025-04-09 23:27   ` [PATCH v3 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
2025-04-16 14:14     ` Chao Gao
2025-04-16 17:22       ` Chang S. Bae
2025-04-16 17:37         ` Dave Hansen
2025-04-09 23:27   ` [PATCH v3 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
2025-08-13 17:26   ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Chang S. Bae
2025-08-13 17:26     ` [PATCH v4 1/6] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
2025-08-18  7:45       ` Chao Gao
2025-08-13 17:26     ` [PATCH v4 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
2025-08-13 18:21       ` Dave Hansen
2025-08-13 20:46         ` Chang S. Bae
2025-08-13 20:55           ` Dave Hansen
2025-08-14 18:30             ` Chang S. Bae
2025-08-22 22:39             ` [PATCH] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
2025-08-23 16:05               ` Chang S. Bae
2025-08-22 22:39         ` [PATCH v4a 2/6] x86/microcode/intel: Establish staging control logic Chang S. Bae
2025-08-22 23:34           ` Dave Hansen
2025-08-13 17:26     ` [PATCH v4 3/6] x86/microcode/intel: Define staging state struct Chang S. Bae
2025-08-13 18:25       ` Dave Hansen
2025-08-22 22:39         ` [PATCH v4a " Chang S. Bae
2025-08-13 17:26     ` [PATCH v4 4/6] x86/microcode/intel: Implement staging handler Chang S. Bae
2025-08-13 18:44       ` Dave Hansen
2025-08-22 22:39         ` [PATCH v4a " Chang S. Bae
2025-08-13 17:26     ` [PATCH v4 5/6] x86/microcode/intel: Support mailbox transfer Chang S. Bae
2025-08-13 19:07       ` Dave Hansen
2025-08-22 22:40         ` [PATCH v4a " Chang S. Bae
2025-08-13 17:26     ` [PATCH v4 6/6] x86/microcode/intel: Enable staging when available Chang S. Bae
2025-08-18  8:35       ` Chao Gao
2025-08-22 22:42         ` Chang S. Bae
2025-08-13 19:08     ` [PATCH v4 0/6] x86: Support for Intel Microcode Staging Feature Dave Hansen
2025-08-23 15:52     ` [PATCH v5 0/7] " Chang S. Bae
2025-08-23 15:52       ` [PATCH v5 1/7] x86/cpu/topology: Make primary thread mask available with SMP=n Chang S. Bae
2025-08-23 15:52       ` [PATCH v5 2/7] x86/microcode: Introduce staging step to reduce late-loading time Chang S. Bae
2025-09-04 12:08         ` Borislav Petkov
2025-08-23 15:52       ` [PATCH v5 3/7] x86/microcode/intel: Establish staging control logic Chang S. Bae
2025-09-04 12:13         ` Borislav Petkov
2025-08-23 15:52       ` [PATCH v5 4/7] x86/microcode/intel: Define staging state struct Chang S. Bae
2025-09-04 13:48         ` Borislav Petkov
2025-08-23 15:52       ` [PATCH v5 5/7] x86/microcode/intel: Implement staging handler Chang S. Bae
2025-08-23 15:52       ` [PATCH v5 6/7] x86/microcode/intel: Support mailbox transfer Chang S. Bae
2025-08-23 15:52       ` [PATCH v5 7/7] x86/microcode/intel: Enable staging when available Chang S. Bae
2025-08-26 22:13       ` [PATCH v5 0/7] x86: Support for Intel Microcode Staging Feature Luck, Tony
2025-08-26 22:15         ` Chang S. Bae

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).