arch/x86/include/asm/msr-index.h | 9 + arch/x86/kernel/cpu/microcode/Makefile | 2 +- arch/x86/kernel/cpu/microcode/core.c | 12 +- arch/x86/kernel/cpu/microcode/intel.c | 77 ++++++++- arch/x86/kernel/cpu/microcode/intel_staging.c | 154 ++++++++++++++++++ arch/x86/kernel/cpu/microcode/internal.h | 5 +- 6 files changed, 247 insertions(+), 12 deletions(-) create mode 100644 arch/x86/kernel/cpu/microcode/intel_staging.c
Hi all,
I'd like to ask initial feedback on this series enabling the staging
feature. Thanks!
== Latency Spike Issue ==
As microcode images have increased in size, a corresponding rise in load
latency has become inevitable. This latency spike significantly impacts
late loading, which remains in use despite the cautions highlighted in
the documentation [1]. The issue is especially critical for continuously
running workloads and virtual machines, where excessive delays can lead
to timeouts.
== Staging for Latency Reduction ==
Currently, writing to MSR_IA32_UCODE_WRITE triggers the entire update
process -- loading, validating, and activation -- all of which contribute
to the latency during CPU halt. The staging feature mitigates this by
refactoring all but the activation step out of the critical path,
allowing CPUs to continue serving workloads while staging takes place.
== Cache Flush Removal ==
Before resolving this latency spike caused by larger images, another
major latency issue -- cache invalidation [2] -- must first be addressed.
Originally introduced to handle a specific erratum, this cache
invalidation is now unnecessary because the problematic microcode images
have been banned. This cache flush has been found to negate the benefits
of staging, so this patch series begins by removing the WRINVD
instruction.
== Validation ==
We internally established pseudocode to clearly define all essential
steps for interacting with the firmware. Any firmware implementation
supporting staging should adhere to this contract. This patch set
incorporates that staging logic, which I successfully tested on one
firmware implementation. Multiple teams at Intel have also validated the
feature across different implementations.
Preliminary results from a pre-production system show a significant
reduction in latency (about 40%) with the staging approach alone.
Further improvements are possible with additional optimizations [*].
== Call for Review ==
This RFC series aims to present the proposed approach for community
review, to assess its soundness, and to discuss potential alternatives
if necessary. There are several key points to highlight for feedback:
1. Staging Integration Approach
In the core code, the high-level sequence for late loading is:
(1) request_microcode_fw(), and
(2) load_late_stop_cpus()->apply_microcode()
Staging doesn't fit neatly into either steps, as it involves the
loading process but not the activation. Therefore, a new callback is
introduced:
core::load_late_locked()
-> intel::staging_microcode()
-> intel_staging::staging_work()
-> intel_staging::...
2. Code Abstraction
The newly added intel_staging.c file contains all staging-related
code to keep it self-contained. Ideally, the entire firmware
interaction could eventually be abstracted into a single MSR write,
which remains a long-term goal. Fortunately, recent protocol
simplifications have made this more feasible.
3. Staging Policy (TODO)
While staging is always attempted, the system will fall back to the
legacy update method if staging fails. There is an open question
regarding staging policy: should it be mandatory, without fallback,
in certain usage scenarios? This could lead further refinements in
the flow depending on feedback and use cases.
4. Specification Updates
Recent specification updates have simplified the staging protocol
and clarified the behavior of MSR_IA32_UCODE_WRITE in conjunction
with staging:
4.1. Protocol Simplification
The specification update [3] has significantly reduced the
complexity of staging code, trimming the kernel code from ~1K lines
in preliminary implementations. Thanks to Dave for guiding this
redesign effort.
4.2. Clarification of Legacy Update Behavior
Chapter 5 of the specification adds further clarification on
MSR_IA32_UCODE_WRITE. Key points are summarized below:
(a) When staging is not performed or failed, a WRMSR will still load
the patch image, but with higher latency.
(b) During an active staging process, MSR_IA32_UCODE_WRITE can
load a new microcode image, again with higher latency.
(c) If the versions differ between the staged microcode and the
version loaded via MSR_IA32_UCODE_WRITE, the version loaded through
the MSR takes precedence.
I'd also make sure there is no further ambiguity in this documentation
[3]. Feel free to provide feedback if anything seems unclear or
unreasonable.
As noted [*], an additional series focused on further latency
optimizations will follow. However, the staging approach was prioritized
due to its significant first-order impact on latency.
This series is based on 6.12-rc1. You can also find it from this repo:
git://github.com/intel-staging/microcode.git staging_rfc-v1
Thanks,
Chang
[1]: https://docs.kernel.org/arch/x86/microcode.html#why-is-late-loading-dangerous
[2]: https://lore.kernel.org/all/20240701212012.21499-1-chang.seok.bae@intel.com/
[3]: https://cdrdv2.intel.com/v1/dl/getContent/782715
[*]: Further latency improvements will be addressed in the upcoming
‘Uniform’ feature series.
Chang S. Bae (7):
x86/microcode/intel: Remove unnecessary cache writeback and
invalidation
x86/microcode: Introduce staging option to reduce late-loading latency
x86/msr-index: Define MSR index and bit for the microcode staging
feature
x86/microcode/intel: Prepare for microcode staging
x86/microcode/intel_staging: Implement staging logic
x86/microcode/intel_staging: Support mailbox data transfer
x86/microcode/intel: Enable staging when available
arch/x86/include/asm/msr-index.h | 9 +
arch/x86/kernel/cpu/microcode/Makefile | 2 +-
arch/x86/kernel/cpu/microcode/core.c | 12 +-
arch/x86/kernel/cpu/microcode/intel.c | 77 ++++++++-
arch/x86/kernel/cpu/microcode/intel_staging.c | 154 ++++++++++++++++++
arch/x86/kernel/cpu/microcode/internal.h | 5 +-
6 files changed, 247 insertions(+), 12 deletions(-)
create mode 100644 arch/x86/kernel/cpu/microcode/intel_staging.c
--
2.43.0
Hi all,
Changes since the RFC posting [1]:
* Simplified the staging address discovery code. Leveraging the staging
topology, stage only if package id changes (Thomas).
* Cleaned up the MSR read logic (Boris and Dave).
* Renamed functions to align with the do_something naming convention
(Boris).
* Polished staging result messages (Boris).
* Dropped the WBINVD removal as mainlined now.
This series is based on 6.13-rc2. You can also find it from this repo:
git://github.com/intel-staging/microcode.git staging_v1
I would appreciate further reviews and feedback.
Thanks,
Chang
---
Here is the original cover letter with minor updates -- removing the
WBINVD story and updating the function names:
== Latency Spike Issue ==
As microcode images have increased in size, a corresponding rise in load
latency has become inevitable. This latency spike significantly impacts
late loading, which remains in use despite the cautions highlighted in
the documentation [2]. The issue is especially critical for continuously
running workloads and virtual machines, where excessive delays can lead
to timeouts.
== Staging for Latency Reduction ==
Currently, writing to MSR_IA32_UCODE_WRITE triggers the entire update
process -- loading, validating, and activation -- all of which contribute
to the latency during CPU halt. The staging feature mitigates this by
refactoring all but the activation step out of the critical path,
allowing CPUs to continue serving workloads while staging takes place.
== Validation ==
We internally established pseudocode to clearly define all essential
steps for interacting with the firmware. Any firmware implementation
supporting staging should adhere to this contract. This patch set
incorporates that staging logic, which I successfully tested on one
firmware implementation. Multiple teams at Intel have also validated the
feature across different implementations.
Preliminary results from a pre-production system show a significant
reduction in latency (about 40%) with the staging approach alone.
Further improvements are possible with additional optimizations [*].
== Call for Review ==
Here are several key points to highlight for feedback:
1. Staging Integration Approach:
In the core code, the high-level sequence for late loading is:
(1) request_microcode_fw(), and
(2) load_late_stop_cpus()->apply_microcode()
Staging doesn't fit neatly into either steps, as it involves the
loading process but not the activation. Therefore, a new callback is
introduced:
core::load_late_locked()
-> intel::staging_microcode()
-> intel_staging::do_stage()
2. Code Abstraction:
The newly added intel_staging.c file contains all staging-related
code to keep it self-contained. Ideally, the entire firmware
interaction could eventually be abstracted into a single MSR write,
which remains a long-term goal. Fortunately, recent protocol
simplifications have made this more feasible.
3. Staging Policy (TODO):
While staging is always attempted, the system will fall back to the
legacy update method if staging fails. There is an open question
regarding staging policy: should it be mandatory, without fallback,
in certain usage scenarios? This could lead further refinements in
the flow depending on feedback and use cases.
4. Specification Updates
Recent specification updates have simplified the staging protocol
and clarified the behavior of MSR_IA32_UCODE_WRITE in conjunction
with staging:
4.1. Protocol Simplification
The specification update [3] has significantly reduced the
complexity of staging code, trimming the kernel code from ~1K lines
in preliminary implementations. Thanks to Dave for guiding this
redesign effort.
4.2. Clarification of Legacy Update Behavior
Chapter 5 of the specification adds further clarification on
MSR_IA32_UCODE_WRITE. Key points are summarized below:
(a) When staging is not performed or failed, a WRMSR will still load
the patch image, but with higher latency.
(b) During an active staging process, MSR_IA32_UCODE_WRITE can
load a new microcode image, again with higher latency.
(c) If the versions differ between the staged microcode and the
version loaded via MSR_IA32_UCODE_WRITE, the version loaded through
the MSR takes precedence.
I'd also make sure there is no further ambiguity in this documentation
[3]. Feel free to provide feedback if anything seems unclear or
unreasonable.
As noted [*], an additional series focused on further latency
optimizations will follow. However, the staging approach was prioritized
due to its significant first-order impact on latency.
[1]: https://lore.kernel.org/all/20241001161042.465584-1-chang.seok.bae@intel.com/
[2]: https://docs.kernel.org/arch/x86/microcode.html#why-is-late-loading-dangerous
[3]: https://cdrdv2.intel.com/v1/dl/getContent/782715
[*]: Further latency improvements will be addressed in the upcoming
‘Uniform’ feature series.
Chang S. Bae (6):
x86/microcode: Introduce staging option to reduce late-loading latency
x86/msr-index: Define MSR index and bit for the microcode staging
feature
x86/microcode/intel: Prepare for microcode staging
x86/microcode/intel_staging: Implement staging logic
x86/microcode/intel_staging: Support mailbox data transfer
x86/microcode/intel: Enable staging when available
arch/x86/include/asm/msr-index.h | 9 ++
arch/x86/kernel/cpu/microcode/Makefile | 2 +-
arch/x86/kernel/cpu/microcode/core.c | 12 +-
arch/x86/kernel/cpu/microcode/intel.c | 53 +++++++
arch/x86/kernel/cpu/microcode/intel_staging.c | 149 ++++++++++++++++++
arch/x86/kernel/cpu/microcode/internal.h | 7 +-
6 files changed, 228 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/kernel/cpu/microcode/intel_staging.c
--
2.45.2
On 12/10/2024 5:42 PM, Chang S. Bae wrote:
>
> This series is based on 6.13-rc2. You can also find it from this repo:
> git://github.com/intel-staging/microcode.git staging_v1
I've rebased on 6.14-rc1 and found no conflict, pushing to the repo:
git://github.com/intel-staging/microcode.git staging_v1-for-6.14-rc1
> 3. Staging Policy (TODO):
>
> While staging is always attempted, the system will fall back to the
> legacy update method if staging fails. There is an open question
> regarding staging policy: should it be mandatory, without fallback,
> in certain usage scenarios? This could lead further refinements in
> the flow depending on feedback and use cases.
While I noted this as "TODO", the policy part may require a separate
follow-up.
At the moment, I’d like to assess its current state and ensure the
review process covers details up to the mailbox handling logic, while
fully respecting the maintainers' schedule.
Also, feedback from those considering this feature is highly appreciated
here. If the series looks good, I hope to collect ack or review tags to
help move it forward.
Thanks,
Chang
As a potential user, I'd advocate for an option to disable legacy loading if staging is supported. The difference in quiesce time between staging and legacy loading can be significant. Since late loading is more of an explicit active action, I would believe allowing the initiating process freedom of a retry loop or handling any errors from staging makes sense. Presumably load_late_locked could return an error on failure to stage leading to an appropriate print.
On 2/28/25 14:27, Colin Mitchell wrote: > As a potential user, I'd advocate for an option to disable legacy > loading if staging is supported. > > The difference in quiesce time between staging and legacy loading > can be significant. Since late loading is more of an explicit active > action, I would believe allowing the initiating process freedom of a retry > loop or handling any errors from staging makes sense. > > Presumably load_late_locked could return an error on failure > to stage leading to an appropriate print. Hey Colin! You might not have noticed, but this series hasn't even been merged yet. If this series is an important thing to your employer, I'd really appreciate some reviews on it from you or anyone else to whom it is important. That would really speed things along! Requesting new features to be piled on top of an out-of-tree under-reviewed patch set is likely to slow things down more than speed them up.
On Fri, Feb 28, 2025 at 02:27:15PM -0800, Colin Mitchell wrote:
> As a potential user, I'd advocate for an option to disable legacy
> loading if staging is supported.
What happens if staging is failing indefinitely, for whatever reason?
You can't load any microcode anymore if you've disabled the legacy method
too.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2/28/25 14:52, Borislav Petkov wrote: > On Fri, Feb 28, 2025 at 02:27:15PM -0800, Colin Mitchell wrote: >> As a potential user, I'd advocate for an option to disable legacy >> loading if staging is supported. > What happens if staging is failing indefinitely, for whatever reason? > > You can't load any microcode anymore if you've disabled the legacy method > too. Yeah, I'm perplexed by this choice too. You seem to be saying that you'd rather be (for instance) insecure running old microcode than have the latency blip from a legacy microcode load. What action would you take if a staging-load fails? Retry again a few times? Go back to the CPU vendor and get a new image? Or just ignore it?
> On 2/28/25 14:52, Borislav Petkov wrote: > You can't load any microcode anymore if you've disabled the legacy method > too. Staging, if I've read the code correctly here, is only used for late loading. There is performance reason to use staging for early kernel microcode loading pre-SMP. Therefore, if staging perpetually fails, it can be applied without staging on the next reboot. > On 2/28/25 15:23, Dave Hansen wrote: > You seem to be saying that you'd rather be (for instance) insecure > running old microcode than have the latency blip from a legacy microcode > load. > What action would you take if a staging-load fails? Retry again a few > times? Go back to the CPU vendor and get a new image? Or just ignore it? That's correct, but the latency tradeoff scales with the platform specific size of the microcode patch. I'd prefer to have a more deterministic update path and believe the potential latency blip would be significant enough to justify the option. Adding configuration would allow me to handle the error as needed. A retry loop would be a first step but I could also look to migrate VMs off the machine if the platform specific latency blip would negatively affect sensitive guest VMs. While an ideal solution imo would then allow me to force legacy loading, I could also settle with it being done through a reboot where early boot would already skip staging.
On 3/26/25 14:29, Colin Mitchell wrote: >> On 2/28/25 15:23, Dave Hansen wrote: >> You seem to be saying that you'd rather be (for instance) insecure >> running old microcode than have the latency blip from a legacy microcode >> load. >> What action would you take if a staging-load fails? Retry again a few >> times? Go back to the CPU vendor and get a new image? Or just ignore it? > That's correct, but the latency tradeoff scales with the platform specific > size of the microcode patch. I'd prefer to have a more deterministic > update path and believe the potential latency blip would be significant > enough to justify the option. > > Adding configuration would allow me to handle the error as needed. > A retry loop would be a first step but I could also look to migrate VMs > off the machine if the platform specific latency blip would negatively > affect sensitive guest VMs. While an ideal solution imo would then > allow me to force legacy loading, I could also settle with it being done > through a reboot where early boot would already skip staging. There's a lot to unpack there. But, for the purposes of this series, I think what's here is fine for now. Let's leave staging _purely_ as an opportunistic optimization. If folks want to make this more configurable like making staging *mandatory* and disabling legacy loading then we'll look at the patches (and their justifications) as folks submit them. A good justification would be something along these lines: Legacy microcode loading causes a 5,000ms latency blip. Our customers have been complaining to us for years about those legacy loading blips. Migrating a VM causes a 1ms latency blip. Those 4,999ms mean a lot to the folks running those VMs. As a CSP, we would like the flexibility to avoid the gigantic legacy microcode loading blips because they are bad and getting worse. It becomes less compelling if it's something like this: Legacy microcode loading causes a 50ms latency blip. Migrating a VM causes a 49ms latency blip. That millisecond is super important. ... and increasingly less so as it becomes: We like knobs and flexibility for $REASONS. You don't have to have down-to-the-millisecond numbers here. Orders of magnitude are fine. But if you can't demonstrate (or don't anticipate) orders of magnitude improvement from the knob, then it's probably not worth it. It better be a 10x improvement, not 10%.
As microcode patch sizes continue to grow, the latency during
late-loading can spike, leading to timeouts and interruptions in running
workloads. This trend of increasing patch sizes is expected to continue,
so a foundational solution is needed to address the issue.
To mitigate the problem, a new staging feature is introduced. This option
processes most of the microcode update (excluding activation) on a
non-critical path, allowing CPUs to remain operational during the
majority of the update. By moving most of the work off the critical path,
the latency spike can be significantly reduced.
Integrate the staging process as an additional step in the late-loading
flow. Introduce a new callback for staging, which is invoked after the
microcode patch image is prepared but before entering the CPU rendezvous
for triggering the update.
Staging follows an opportunistic model: it is attempted when available.
If successful, it reduces CPU rendezvous time; if not, the process falls
back to the legacy loading, potentially exposing the system to higher
latency.
Extend struct microcode_ops to incorporate staging properties, which will
be updated in the vendor code from subsequent patches.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
RFC-V1 -> V1: Rename the function name to the do_something() style (Boris).
Note.
Whether staging should be mandatory is a policy decision that is beyond
the scope of this patch at the moment. For now, the focus is on
establishing a basic flow, with the intention of attracting focused
reviews, while deferring the discussion on staging policy later.
In terms of the flow, an alternative approach could be to integrate
staging as part of microcode preparation on the vendor code side.
However, this was deemed too implicit, as staging involves loading and
validating the microcode image, which differs from typical microcode file
handling.
---
arch/x86/kernel/cpu/microcode/core.c | 12 ++++++++++--
arch/x86/kernel/cpu/microcode/internal.h | 4 +++-
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c
index b3658d11e7b6..0967fd15be6e 100644
--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -676,19 +676,27 @@ static bool setup_cpus(void)
static int load_late_locked(void)
{
+ bool is_safe = false;
+
if (!setup_cpus())
return -EBUSY;
switch (microcode_ops->request_microcode_fw(0, µcode_pdev->dev)) {
case UCODE_NEW:
- return load_late_stop_cpus(false);
+ break;
case UCODE_NEW_SAFE:
- return load_late_stop_cpus(true);
+ is_safe = true;
+ break;
case UCODE_NFOUND:
return -ENOENT;
default:
return -EBADFD;
}
+
+ if (microcode_ops->use_staging)
+ microcode_ops->stage_microcode();
+
+ return load_late_stop_cpus(is_safe);
}
static ssize_t reload_store(struct device *dev,
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 21776c529fa9..b27cb8e1228d 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -31,10 +31,12 @@ struct microcode_ops {
* See also the "Synchronization" section in microcode_core.c.
*/
enum ucode_state (*apply_microcode)(int cpu);
+ void (*stage_microcode)(void);
int (*collect_cpu_info)(int cpu, struct cpu_signature *csig);
void (*finalize_late_load)(int result);
unsigned int nmi_safe : 1,
- use_nmi : 1;
+ use_nmi : 1,
+ use_staging : 1;
};
struct early_load_data {
--
2.45.2
On 12/10/24 17:42, Chang S. Bae wrote: > + if (microcode_ops->use_staging) > + microcode_ops->stage_microcode(); I don't know if any non-Intel vendors will implement one of these, but could we please comment this a _bit_? Somebody is going to come along at some point and ask themselves whether they should add a new staging handler or dump some new code in an existing one. The key aspects of the existing staging handler are: 1. Helps the future actual microcode load in some way 2. Has little impact on the rest of the system 3. Can succeed or fail without affecting functionality Did I miss any? Could we add a comment here, or maybe even at the Intel stage_microcode() function explaining this intent?
On Tue, Dec 10, 2024 at 05:42:07PM -0800, Chang S. Bae wrote:
> static int load_late_locked(void)
> {
> + bool is_safe = false;
> +
> if (!setup_cpus())
> return -EBUSY;
>
> switch (microcode_ops->request_microcode_fw(0, µcode_pdev->dev)) {
> case UCODE_NEW:
> - return load_late_stop_cpus(false);
> + break;
> case UCODE_NEW_SAFE:
> - return load_late_stop_cpus(true);
> + is_safe = true;
> + break;
> case UCODE_NFOUND:
> return -ENOENT;
> default:
> return -EBADFD;
> }
> +
> + if (microcode_ops->use_staging)
> + microcode_ops->stage_microcode();
> +
> + return load_late_stop_cpus(is_safe);
> }
Why are you even touching this function instead of doing the staging in the
beginning of load_late_stop_cpus()?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 2/17/2025 5:33 AM, Borislav Petkov wrote:
>
> Why are you even touching this function instead of doing the staging in the
> beginning of load_late_stop_cpus()?
I thought staging is logically distinguishable. While
load_late_stop_cpus() currently performs loading when CPUs are
_stopped_, staging occurs on a non-critical path and remains
interruptible. So, the function name itself seems misaligned with the
staging process.
It looks like commit 4b753955e915 ("x86/microcode: Add per CPU result
state") renamed microcode_reload_late() to the current name:
-/*
- * Reload microcode late on all CPUs. Wait for a sec until they
- * all gather together.
- */
-static int microcode_reload_late(void)
+static int load_late_stop_cpus(void)
which primarily narrowed the function’s scope to microcode rendezvous
for late loading.
Given them all, maybe another option is to introduce a wrapper, instead
of modifying load_late_locked() directly, like:
@@ -536,11 +536,6 @@ static int load_late_stop_cpus(bool is_safe)
int old_rev = boot_cpu_data.microcode;
struct cpuinfo_x86 prev_info;
- if (!is_safe) {
- pr_err("Late microcode loading without minimal revision
check.\n");
- pr_err("You should switch to early loading, if
possible.\n");
- }
-
atomic_set(&late_cpus_in, num_online_cpus());
atomic_set(&offline_in_nmi, 0);
loops_per_usec = loops_per_jiffy / (TICK_NSEC / 1000);
@@ -674,6 +669,20 @@ static bool setup_cpus(void)
return true;
}
+static int load_late_apply(bool is_safe)
+{
+ if (!is_safe) {
+ pr_err("Late microcode loading without minimal revision
check.\n");
+ pr_err("You should switch to early loading, if
possible.\n");
+ }
+
+ /* Stage microcode without stopping CPUs */
+ if (microcode_ops->use_staging)
+ microcode_ops->stage_microcode();
+
+ return load_late_stop_cpus(is_safe);
+}
+
static int load_late_locked(void)
{
if (!setup_cpus())
@@ -681,9 +690,9 @@ static int load_late_locked(void)
switch (microcode_ops->request_microcode_fw(0,
µcode_pdev->dev)) {
case UCODE_NEW:
- return load_late_stop_cpus(false);
+ return load_late_apply(false);
case UCODE_NEW_SAFE:
- return load_late_stop_cpus(true);
+ return load_late_apply(true);
case UCODE_NFOUND:
return -ENOENT;
default:
Thanks,
Chang
On Mon, Feb 17, 2025 at 11:51:28PM -0800, Chang S. Bae wrote:
> On 2/17/2025 5:33 AM, Borislav Petkov wrote:
> >
> > Why are you even touching this function instead of doing the staging in the
> > beginning of load_late_stop_cpus()?
>
> I thought staging is logically distinguishable.
What does that even mean?
> While load_late_stop_cpus() currently performs loading when CPUs are
> _stopped_,
Not entirely - it does preparatory work and then stops the CPUs. Staging could
be part of that prep work.
> staging occurs on a non-critical path and remains interruptible.
So if you really wanna do that and be really "free", then you should do it
outside of load_late_locked() because that runs with the hotplug lock taken.
But then the only thing that matters is, when *exactly* you should stage. If
->request_microcode_fw() fails, staging would be unnecessary work.
So instead of trying to too hard, just stick the staging at the beginning of
load_late_stop_cpus() and be done with it.
Also, if you want to send a patch, don't send it from a mail client which will
mangle it so that it is inapplicable and no one can play with it.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The microcode staging feature involves two key MSR entities, the
presence of which is indicated by bit 16 of IA32_ARCH_CAPABILITIES:
* Bit 4 in IA32_MCU_ENUMERATION shows the availability of the microcode
staging feature.
* Staging is managed through MMIO registers, with
IA32_MCU_STAGING_MBOX_ADDR MSR specifying the physical address of the
first MMIO register.
Define the MSR index and bit assignments, helping the upcoming staging
code to make use of the hardware feature.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
arch/x86/include/asm/msr-index.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..2840a2fe340b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -164,6 +164,10 @@
* Processor MMIO stale data
* vulnerabilities.
*/
+#define ARCH_CAP_MCU_ENUM BIT(16) /*
+ * Indicates the presence of microcode update
+ * feature enumeration and status information
+ */
#define ARCH_CAP_FB_CLEAR BIT(17) /*
* VERW clears CPU fill buffer
* even on MDS_NO CPUs.
@@ -884,6 +888,11 @@
#define MSR_IA32_UCODE_WRITE 0x00000079
#define MSR_IA32_UCODE_REV 0x0000008b
+#define MSR_IA32_MCU_ENUMERATION 0x0000007b
+#define MCU_STAGING BIT(4)
+
+#define MSR_IA32_MCU_STAGING_MBOX_ADDR 0x000007a5
+
/* Intel SGX Launch Enclave Public Key Hash MSRs */
#define MSR_IA32_SGXLEPUBKEYHASH0 0x0000008C
#define MSR_IA32_SGXLEPUBKEYHASH1 0x0000008D
--
2.45.2
On Tue, Dec 10, 2024 at 05:42:08PM -0800, Chang S. Bae wrote:
> The microcode staging feature involves two key MSR entities, the
> presence of which is indicated by bit 16 of IA32_ARCH_CAPABILITIES:
>
> * Bit 4 in IA32_MCU_ENUMERATION shows the availability of the microcode
> staging feature.
>
> * Staging is managed through MMIO registers, with
> IA32_MCU_STAGING_MBOX_ADDR MSR specifying the physical address of the
> first MMIO register.
>
> Define the MSR index and bit assignments, helping the upcoming staging
> code to make use of the hardware feature.
>
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> ---
> arch/x86/include/asm/msr-index.h | 9 +++++++++
> 1 file changed, 9 insertions(+)
Merge this patch with the patch where those MSR bits are used - no need for
a separate one.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which points to a set of dword-sized
registers.
Prepare staging with the following steps: First, ensure the microcode
image is dword-aligned to correspond with MMIO registers. Next,
identify each MMIO interface based on its per-package scope. Then,
invoke the staging function for each identified interface.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
RFC-V1 -> V1:
* Simplify code by leveraging the architectural per-package staging scope
(Thomas).
* Fix MSR read code (Boris and Dave).
* Rename the staging function: staging_work() -> do_stage() (Boris).
* Polish the result messages (Boris).
* Add a prototype for builds without CONFIG_CPU_SUP_INTEL (Boris).
* Massage the changelog.
Note:
1. Using a direct reference to 'cpu_primary_thread_mask' in
for_each_cpu(...) causes a build error when !CONFIG_SMP. Instead, use
the wrapper function topology_is_primary_thread() to avoid it.
2. Ideally, the do_stage() function would be as simple as a single WRMSR
execution. If this were the case, the staging flow could be completed
with this patch. From this perspective, the software handling for
interacting with the staging firmware has been separated from this
vendor code and moved into a new file dedicated to staging logic.
---
arch/x86/kernel/cpu/microcode/intel.c | 36 ++++++++++++++++++++++++
arch/x86/kernel/cpu/microcode/internal.h | 7 +++++
2 files changed, 43 insertions(+)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index f3d534807d91..325068bb5524 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -299,6 +299,41 @@ static __init struct microcode_intel *scan_microcode(void *data, size_t size,
return size ? NULL : patch;
}
+static void stage_microcode(void)
+{
+ unsigned int totalsize, pkg_id = UINT_MAX;
+ enum ucode_state state;
+ int cpu;
+ u64 pa;
+
+ totalsize = get_totalsize(&ucode_patch_late->hdr);
+ if (!IS_ALIGNED(totalsize, sizeof(u32)))
+ return;
+
+ /*
+ * The MMIO address is unique per package, and all the SMT
+ * primary threads are online here. Find each MMIO space by
+ * their package ids to avoid duplicate staging.
+ */
+ for_each_cpu(cpu, cpu_online_mask) {
+ if (!topology_is_primary_thread(cpu) || topology_logical_package_id(cpu) == pkg_id)
+ continue;
+ pkg_id = topology_logical_package_id(cpu);
+
+ rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &pa);
+
+ state = do_stage(pa, ucode_patch_late, totalsize);
+ if (state != UCODE_OK) {
+ pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
+ state == UCODE_TIMEOUT ? "timeout" : "error state", cpu, pkg_id);
+ return;
+ }
+ }
+
+ pr_info("Staging of patch revision 0x%x succeeded.\n",
+ ((struct microcode_header_intel *)ucode_patch_late)->rev);
+}
+
static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
struct microcode_intel *mc,
u32 *cur_rev)
@@ -627,6 +662,7 @@ static struct microcode_ops microcode_intel_ops = {
.collect_cpu_info = collect_cpu_info,
.apply_microcode = apply_microcode_late,
.finalize_late_load = finalize_late_load,
+ .stage_microcode = stage_microcode,
.use_nmi = IS_ENABLED(CONFIG_X86_64),
};
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index b27cb8e1228d..158429d80f93 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -120,11 +120,18 @@ void load_ucode_intel_bsp(struct early_load_data *ed);
void load_ucode_intel_ap(void);
void reload_ucode_intel(void);
struct microcode_ops *init_intel_microcode(void);
+static inline enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
+{
+ pr_debug_once("Need to implement the staging code.\n");
+ return UCODE_ERROR;
+}
#else /* CONFIG_CPU_SUP_INTEL */
static inline void load_ucode_intel_bsp(struct early_load_data *ed) { }
static inline void load_ucode_intel_ap(void) { }
static inline void reload_ucode_intel(void) { }
static inline struct microcode_ops *init_intel_microcode(void) { return NULL; }
+static inline enum ucode_state
+do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize) { return UCODE_ERROR; }
#endif /* !CONFIG_CPU_SUP_INTEL */
#endif /* _X86_MICROCODE_INTERNAL_H */
--
2.45.2
On Tue, Dec 10, 2024 at 05:42:09PM -0800, Chang S. Bae wrote:
> +static void stage_microcode(void)
> +{
> + unsigned int totalsize, pkg_id = UINT_MAX;
> + enum ucode_state state;
> + int cpu;
> + u64 pa;
> +
> + totalsize = get_totalsize(&ucode_patch_late->hdr);
> + if (!IS_ALIGNED(totalsize, sizeof(u32)))
> + return;
... lockdep assert hotplug lock held blabla...
> +
> + /*
> + * The MMIO address is unique per package, and all the SMT
> + * primary threads are online here. Find each MMIO space by
> + * their package ids to avoid duplicate staging.
> + */
> + for_each_cpu(cpu, cpu_online_mask) {
> + if (!topology_is_primary_thread(cpu) || topology_logical_package_id(cpu) == pkg_id)
> + continue;
if (!topology_is_primary_thread(cpu) ||
topology_logical_package_id(cpu) == pkg_id)
> + pkg_id = topology_logical_package_id(cpu);
> +
> + rdmsrl_on_cpu(cpu, MSR_IA32_MCU_STAGING_MBOX_ADDR, &pa);
> +
> + state = do_stage(pa, ucode_patch_late, totalsize);
> + if (state != UCODE_OK) {
> + pr_err("Error: staging failed with %s for CPU%d at package %u.\n",
> + state == UCODE_TIMEOUT ? "timeout" : "error state", cpu, pkg_id);
> + return;
> + }
> + }
> +
> + pr_info("Staging of patch revision 0x%x succeeded.\n",
> + ((struct microcode_header_intel *)ucode_patch_late)->rev);
> +}
> +
> static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
> struct microcode_intel *mc,
> u32 *cur_rev)
> @@ -627,6 +662,7 @@ static struct microcode_ops microcode_intel_ops = {
> .collect_cpu_info = collect_cpu_info,
> .apply_microcode = apply_microcode_late,
> .finalize_late_load = finalize_late_load,
> + .stage_microcode = stage_microcode,
Btw, you can get rid of ->use_staging and simply check whether
->stage_microcode is not NULL.
> .use_nmi = IS_ENABLED(CONFIG_X86_64),
> };
>
> diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
> index b27cb8e1228d..158429d80f93 100644
> --- a/arch/x86/kernel/cpu/microcode/internal.h
> +++ b/arch/x86/kernel/cpu/microcode/internal.h
> @@ -120,11 +120,18 @@ void load_ucode_intel_bsp(struct early_load_data *ed);
> void load_ucode_intel_ap(void);
> void reload_ucode_intel(void);
> struct microcode_ops *init_intel_microcode(void);
> +static inline enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
> +{
> + pr_debug_once("Need to implement the staging code.\n");
> + return UCODE_ERROR;
> +}
> #else /* CONFIG_CPU_SUP_INTEL */
> static inline void load_ucode_intel_bsp(struct early_load_data *ed) { }
> static inline void load_ucode_intel_ap(void) { }
> static inline void reload_ucode_intel(void) { }
> static inline struct microcode_ops *init_intel_microcode(void) { return NULL; }
> +static inline enum ucode_state
> +do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize) { return UCODE_ERROR; }
> #endif /* !CONFIG_CPU_SUP_INTEL */
>
You don't need to export those if staging is going to be done only by the
Intel side. Just keep everything in intel.c
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
The staging firmware operates through a protocol via the MMIO interface.
The protocol defines a serialized sequence that begins by clearing the
hardware with an abort request. It then proceeds through iterative
process of sending data, initiating transactions, waiting for processing,
and reading responses.
To facilitate this interaction, follow the outlined protocol. Refactor
the waiting code to manage loop breaks more effectively. Data transfer
involves a next level of detail to handle the mailbox format. While
defining helpers, leave them empty for now.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
RFC-V1 -> V1: Rename the function name and change the return type.
---
arch/x86/kernel/cpu/microcode/Makefile | 2 +-
arch/x86/kernel/cpu/microcode/intel_staging.c | 100 ++++++++++++++++++
arch/x86/kernel/cpu/microcode/internal.h | 6 +-
3 files changed, 102 insertions(+), 6 deletions(-)
create mode 100644 arch/x86/kernel/cpu/microcode/intel_staging.c
diff --git a/arch/x86/kernel/cpu/microcode/Makefile b/arch/x86/kernel/cpu/microcode/Makefile
index 193d98b33a0a..a9f79aaffcb0 100644
--- a/arch/x86/kernel/cpu/microcode/Makefile
+++ b/arch/x86/kernel/cpu/microcode/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
microcode-y := core.o
obj-$(CONFIG_MICROCODE) += microcode.o
-microcode-$(CONFIG_CPU_SUP_INTEL) += intel.o
+microcode-$(CONFIG_CPU_SUP_INTEL) += intel.o intel_staging.o
microcode-$(CONFIG_CPU_SUP_AMD) += amd.o
diff --git a/arch/x86/kernel/cpu/microcode/intel_staging.c b/arch/x86/kernel/cpu/microcode/intel_staging.c
new file mode 100644
index 000000000000..2fc8667cab45
--- /dev/null
+++ b/arch/x86/kernel/cpu/microcode/intel_staging.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#define pr_fmt(fmt) "microcode: " fmt
+#include <linux/delay.h>
+#include <linux/io.h>
+
+#include "internal.h"
+
+#define MBOX_REG_NUM 4
+#define MBOX_REG_SIZE sizeof(u32)
+
+#define MBOX_CONTROL_OFFSET 0x0
+#define MBOX_STATUS_OFFSET 0x4
+
+#define MASK_MBOX_CTRL_ABORT BIT(0)
+
+#define MASK_MBOX_STATUS_ERROR BIT(2)
+#define MASK_MBOX_STATUS_READY BIT(31)
+
+#define MBOX_XACTION_LEN PAGE_SIZE
+#define MBOX_XACTION_MAX(imgsz) ((imgsz) * 2)
+#define MBOX_XACTION_TIMEOUT (10 * MSEC_PER_SEC)
+
+#define STAGING_OFFSET_END 0xffffffff
+
+static inline void abort_xaction(void __iomem *base)
+{
+ writel(MASK_MBOX_CTRL_ABORT, base + MBOX_CONTROL_OFFSET);
+}
+
+static void request_xaction(void __iomem *base, u32 *chunk, unsigned int chunksize)
+{
+ pr_debug_once("Need to implement staging mailbox loading code.\n");
+}
+
+static enum ucode_state wait_for_xaction(void __iomem *base)
+{
+ u32 timeout, status;
+
+ for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT; timeout++) {
+ msleep(1);
+ status = readl(base + MBOX_STATUS_OFFSET);
+ if (status & MASK_MBOX_STATUS_READY)
+ break;
+ }
+
+ status = readl(base + MBOX_STATUS_OFFSET);
+ if (status & MASK_MBOX_STATUS_ERROR)
+ return UCODE_ERROR;
+ if (!(status & MASK_MBOX_STATUS_READY))
+ return UCODE_TIMEOUT;
+
+ return UCODE_OK;
+}
+
+static enum ucode_state read_xaction_response(void __iomem *base, unsigned int *offset)
+{
+ pr_debug_once("Need to implement staging response handler.\n");
+ return UCODE_ERROR;
+}
+
+static inline unsigned int get_chunksize(unsigned int totalsize, unsigned int offset)
+{
+ WARN_ON_ONCE(totalsize < offset);
+ return min(MBOX_XACTION_LEN, totalsize - offset);
+}
+
+enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
+{
+ unsigned int xaction_bytes = 0, offset = 0, chunksize;
+ void __iomem *mmio_base;
+ enum ucode_state state;
+
+ mmio_base = ioremap(pa, MBOX_REG_NUM * MBOX_REG_SIZE);
+ if (WARN_ON_ONCE(!mmio_base))
+ return UCODE_ERROR;
+
+ abort_xaction(mmio_base);
+
+ while (offset != STAGING_OFFSET_END) {
+ chunksize = get_chunksize(totalsize, offset);
+ if (xaction_bytes + chunksize > MBOX_XACTION_MAX(totalsize)) {
+ state = UCODE_TIMEOUT;
+ break;
+ }
+
+ request_xaction(mmio_base, ucode_ptr + offset, chunksize);
+ state = wait_for_xaction(mmio_base);
+ if (state != UCODE_OK)
+ break;
+
+ xaction_bytes += chunksize;
+ state = read_xaction_response(mmio_base, &offset);
+ if (state != UCODE_OK)
+ break;
+ }
+
+ iounmap(mmio_base);
+ return state;
+}
diff --git a/arch/x86/kernel/cpu/microcode/internal.h b/arch/x86/kernel/cpu/microcode/internal.h
index 158429d80f93..787524e4ef1e 100644
--- a/arch/x86/kernel/cpu/microcode/internal.h
+++ b/arch/x86/kernel/cpu/microcode/internal.h
@@ -120,11 +120,7 @@ void load_ucode_intel_bsp(struct early_load_data *ed);
void load_ucode_intel_ap(void);
void reload_ucode_intel(void);
struct microcode_ops *init_intel_microcode(void);
-static inline enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
-{
- pr_debug_once("Need to implement the staging code.\n");
- return UCODE_ERROR;
-}
+enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize);
#else /* CONFIG_CPU_SUP_INTEL */
static inline void load_ucode_intel_bsp(struct early_load_data *ed) { }
static inline void load_ucode_intel_ap(void) { }
--
2.45.2
On Tue, Dec 10, 2024 at 05:42:10PM -0800, Chang S. Bae wrote:
> diff --git a/arch/x86/kernel/cpu/microcode/Makefile b/arch/x86/kernel/cpu/microcode/Makefile
> index 193d98b33a0a..a9f79aaffcb0 100644
> --- a/arch/x86/kernel/cpu/microcode/Makefile
> +++ b/arch/x86/kernel/cpu/microcode/Makefile
> @@ -1,5 +1,5 @@
> # SPDX-License-Identifier: GPL-2.0-only
> microcode-y := core.o
> obj-$(CONFIG_MICROCODE) += microcode.o
> -microcode-$(CONFIG_CPU_SUP_INTEL) += intel.o
> +microcode-$(CONFIG_CPU_SUP_INTEL) += intel.o intel_staging.o
^^^^^^^^^^^^^^^
No need for that guy - just stick everything in intel.c
> +enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
static
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Remember that the subject prefixes are logical areas of the code, not
filenames.
Bad:
x86/microcode/intel_staging
Good:
x86/microcode/intel
On 12/10/24 17:42, Chang S. Bae wrote:
> The staging firmware operates through a protocol via the MMIO interface.
> The protocol defines a serialized sequence that begins by clearing the
> hardware with an abort request. It then proceeds through iterative
> process of sending data, initiating transactions, waiting for processing,
> and reading responses.
I'm not sure how much value this adds. It's rehashing a few things that
are or should be blatantly obvious from the code.
For instance, mentioning that it's an "MMIO interface" is kinda obvious
from the ioremap() and variable named "mmio_base".
> To facilitate this interaction, follow the outlined protocol. Refactor
> the waiting code to manage loop breaks more effectively. Data transfer
> involves a next level of detail to handle the mailbox format. While
> defining helpers, leave them empty for now.
I'm not sure what's being refactored here.
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/microcode/intel_staging.c
> @@ -0,0 +1,100 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#define pr_fmt(fmt) "microcode: " fmt
> +#include <linux/delay.h>
> +#include <linux/io.h>
> +
> +#include "internal.h"
> +
> +#define MBOX_REG_NUM 4
> +#define MBOX_REG_SIZE sizeof(u32)
> +
> +#define MBOX_CONTROL_OFFSET 0x0
> +#define MBOX_STATUS_OFFSET 0x4
> +
> +#define MASK_MBOX_CTRL_ABORT BIT(0)
> +
> +#define MASK_MBOX_STATUS_ERROR BIT(2)
> +#define MASK_MBOX_STATUS_READY BIT(31)
> +
> +#define MBOX_XACTION_LEN PAGE_SIZE
> +#define MBOX_XACTION_MAX(imgsz) ((imgsz) * 2)
The MBOX_XACTION_MAX() concept needs to be explained _somewhere_. The
concept as I remember it is:
Each image is broken up into pieces which are at most
MBOX_XACTION_LEN in length. So a 10*MBOX_XACTION_LEN
will need at least 10 actions. The hardware on the other side of
the mbox has very few resources. It may not be able to cache the
entire image and may ask for the same part of the image more
than once. This means that it may take more than 10 actions to
send a 10-piece image. Allow a 10-piece image to try 20 times to
send the whole thing.
Can we comment that here or in the loop?
That's a rather verbose comment, but it is a kinda complicated thing. I
remember trying to talk the hardware guys out of this, but they were
rather insistent that it's required. But in the end it does make sense
to me that the (relatively) svelte device on the other end of this
mailbox might not have megabytes of spare storage and it's a relatively
simple thing to have the CPU send some data more than once.
> +#define MBOX_XACTION_TIMEOUT (10 * MSEC_PER_SEC)
I find these a lot more self-explanatory when they gave units on them.
#define MBOX_XACTION_TIMEOUT_MS (10 * MSEC_PER_SEC)
plus:
for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT_MS; timeout++)
msleep(1);
makes it a bit more obvious why there's an msleep(1) if the timeout is
obviously in milliseconds in the first place.
> +#define STAGING_OFFSET_END 0xffffffff
Should this get an explicit type?
> +static inline void abort_xaction(void __iomem *base)
> +{
> + writel(MASK_MBOX_CTRL_ABORT, base + MBOX_CONTROL_OFFSET);
> +}
> +
> +static void request_xaction(void __iomem *base, u32 *chunk, unsigned int chunksize)
> +{
> + pr_debug_once("Need to implement staging mailbox loading code.\n");
> +}
Can we comment this a little more please?
/*
* Callers should use this after a request_xaction() to handle
* explicit errors or timeouts when the hardware does not respond.
*
* Returns UCODE_OK on success.
*/
> +static enum ucode_state wait_for_xaction(void __iomem *base)
> +{
> + u32 timeout, status;
> +
/* Give the hardware time to perform the action: */
> + for (timeout = 0; timeout < MBOX_XACTION_TIMEOUT; timeout++) {
> + msleep(1);
> + status = readl(base + MBOX_STATUS_OFFSET);
/* Break out early if the hardware is ready: */
> + if (status & MASK_MBOX_STATUS_READY)
> + break;
> + }
> +
> + status = readl(base + MBOX_STATUS_OFFSET);
/* The hardware signaled an explicit error: */
> + if (status & MASK_MBOX_STATUS_ERROR)
> + return UCODE_ERROR;
/*
* Hardware neither responded to the action nor
* signaled an error. It must be out to lunch.
*/
> + if (!(status & MASK_MBOX_STATUS_READY))
> + return UCODE_TIMEOUT;
> +
> + return UCODE_OK;
> +}
> +
> +static enum ucode_state read_xaction_response(void __iomem *base, unsigned int *offset)
> +{
> + pr_debug_once("Need to implement staging response handler.\n");
> + return UCODE_ERROR;
> +}
> +
> +static inline unsigned int get_chunksize(unsigned int totalsize, unsigned int offset)
> +{
> + WARN_ON_ONCE(totalsize < offset);
> + return min(MBOX_XACTION_LEN, totalsize - offset);
> +}
I honestly forgot what this is trying to do. Is it trying to break up a
"totalsize" action into pieces that are at most MBOX_XACTION_LEN bytes
in size? But it is also trying to ensure that if it has 10 bytes left
that it doesn't try to request MBOX_XACTION_LEN?
Also, "chunk_size", please. "chunksize" is a bit less readable.
> +enum ucode_state do_stage(u64 pa, void *ucode_ptr, unsigned int totalsize)
"totalsize" is the length of the data at 'ucode_ptr', right? Would the
connection be clearer if we had:
void *ucode_ptr,
int ucode_len_bytes);
? Or even "ucode_len"?
> +{
Could we rename "pa" to "mmio_pa", please? It makes it much more clear
that it's not the "pa" of ucode_ptr or something.
> + unsigned int xaction_bytes = 0, offset = 0, chunksize;
> + void __iomem *mmio_base;
> + enum ucode_state state;
> +
> + mmio_base = ioremap(pa, MBOX_REG_NUM * MBOX_REG_SIZE);
> + if (WARN_ON_ONCE(!mmio_base))
> + return UCODE_ERROR;
> +
> + abort_xaction(mmio_base);
Why is this aborting first? Why doesn't it have to wait for the abort to
complete?
> + while (offset != STAGING_OFFSET_END) {
> + chunksize = get_chunksize(totalsize, offset);
> + if (xaction_bytes + chunksize > MBOX_XACTION_MAX(totalsize)) {
> + state = UCODE_TIMEOUT;
> + break;
> + }
So, "xaction_bytes" is the number of bytes that we tried to send. If it
exceeds MBOX_XACTION_MAX(), then too many retries occurred. That's a
timeout.
Right?
I don't think that's obvious from the code.
> +
> + request_xaction(mmio_base, ucode_ptr + offset, chunksize);
> + state = wait_for_xaction(mmio_base);
> + if (state != UCODE_OK)
> + break;
> +
> + xaction_bytes += chunksize;
> + state = read_xaction_response(mmio_base, &offset);
> + if (state != UCODE_OK)
> + break;
> + }
So, a dumb me would look at this loop and expect it to look like this:
while (need_moar_data) {
if (too_many_retries())
break;
send_data();
}
But it doesn't. There's a two-step process to send data, make sure the
device got the data, and then read a separate response. It _seems_ like
it's double-checking the response.
Could we comment it something more like this to make that more clear?
while (need_moar_data) {
if (too_many_retries())
break;
/* Send the data: */
ret = hw_send_data(offset);
if (ret)
break;
/*
* Ask the hardware which piece of the image it needs
* next. The same piece may be sent more than once.
*/
ret = hw_get_next_offset(&offset);
if (ret)
break;
}
The staging architecture features a narrowed interface for data transfer.
Instead of allocating MMIO space based on data chunk size, it utilizes
two data registers: one for reading and one for writing, enforcing the
serialization of read and write operations. Additionally, it defines a
mailbox data format.
To facilitate data transfer, implement helper functions in line with this
specified format for reading and writing staging data. This mailbox
format is a customized version and is not compatible with the existing
mailbox code, so reuse is not feasible.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
arch/x86/kernel/cpu/microcode/intel_staging.c | 55 ++++++++++++++++++-
1 file changed, 52 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/microcode/intel_staging.c b/arch/x86/kernel/cpu/microcode/intel_staging.c
index 2fc8667cab45..eab6e891db9c 100644
--- a/arch/x86/kernel/cpu/microcode/intel_staging.c
+++ b/arch/x86/kernel/cpu/microcode/intel_staging.c
@@ -3,6 +3,7 @@
#define pr_fmt(fmt) "microcode: " fmt
#include <linux/delay.h>
#include <linux/io.h>
+#include <linux/pci_ids.h>
#include "internal.h"
@@ -11,17 +12,44 @@
#define MBOX_CONTROL_OFFSET 0x0
#define MBOX_STATUS_OFFSET 0x4
+#define MBOX_WRDATA_OFFSET 0x8
+#define MBOX_RDDATA_OFFSET 0xc
#define MASK_MBOX_CTRL_ABORT BIT(0)
+#define MASK_MBOX_CTRL_GO BIT(31)
#define MASK_MBOX_STATUS_ERROR BIT(2)
#define MASK_MBOX_STATUS_READY BIT(31)
+#define MASK_MBOX_RESP_SUCCESS BIT(0)
+#define MASK_MBOX_RESP_PROGRESS BIT(1)
+#define MASK_MBOX_RESP_ERROR BIT(2)
+
+#define MBOX_CMD_LOAD 0x3
+#define MBOX_OBJ_STAGING 0xb
+#define MBOX_HDR (PCI_VENDOR_ID_INTEL | (MBOX_OBJ_STAGING << 16))
+#define MBOX_HDR_SIZE 16
+
#define MBOX_XACTION_LEN PAGE_SIZE
#define MBOX_XACTION_MAX(imgsz) ((imgsz) * 2)
#define MBOX_XACTION_TIMEOUT (10 * MSEC_PER_SEC)
#define STAGING_OFFSET_END 0xffffffff
+#define DWORD_SIZE(s) ((s) / sizeof(u32))
+
+static inline u32 read_mbox_dword(void __iomem *base)
+{
+ u32 dword = readl(base + MBOX_RDDATA_OFFSET);
+
+ /* Inform the read completion to the staging firmware */
+ writel(0, base + MBOX_RDDATA_OFFSET);
+ return dword;
+}
+
+static inline void write_mbox_dword(void __iomem *base, u32 dword)
+{
+ writel(dword, base + MBOX_WRDATA_OFFSET);
+}
static inline void abort_xaction(void __iomem *base)
{
@@ -30,7 +58,18 @@ static inline void abort_xaction(void __iomem *base)
static void request_xaction(void __iomem *base, u32 *chunk, unsigned int chunksize)
{
- pr_debug_once("Need to implement staging mailbox loading code.\n");
+ unsigned int i, dwsize = DWORD_SIZE(chunksize);
+
+ write_mbox_dword(base, MBOX_HDR);
+ write_mbox_dword(base, dwsize + DWORD_SIZE(MBOX_HDR_SIZE));
+
+ write_mbox_dword(base, MBOX_CMD_LOAD);
+ write_mbox_dword(base, 0);
+
+ for (i = 0; i < dwsize; i++)
+ write_mbox_dword(base, chunk[i]);
+
+ writel(MASK_MBOX_CTRL_GO, base + MBOX_CONTROL_OFFSET);
}
static enum ucode_state wait_for_xaction(void __iomem *base)
@@ -55,8 +94,18 @@ static enum ucode_state wait_for_xaction(void __iomem *base)
static enum ucode_state read_xaction_response(void __iomem *base, unsigned int *offset)
{
- pr_debug_once("Need to implement staging response handler.\n");
- return UCODE_ERROR;
+ u32 flag;
+
+ WARN_ON_ONCE(read_mbox_dword(base) != MBOX_HDR);
+ WARN_ON_ONCE(read_mbox_dword(base) != DWORD_SIZE(MBOX_HDR_SIZE));
+
+ *offset = read_mbox_dword(base);
+
+ flag = read_mbox_dword(base);
+ if (flag & MASK_MBOX_RESP_ERROR)
+ return UCODE_ERROR;
+
+ return UCODE_OK;
}
static inline unsigned int get_chunksize(unsigned int totalsize, unsigned int offset)
--
2.45.2
On 12/10/24 17:42, Chang S. Bae wrote:
> The staging architecture features a narrowed interface for data transfer.
> Instead of allocating MMIO space based on data chunk size, it utilizes
> two data registers: one for reading and one for writing, enforcing the
> serialization of read and write operations. Additionally, it defines a
> mailbox data format.
So I'm going back and reading this _after_ all of the code. I don't
think I comprehended what "narrowed interface" or "serialization" really
meant when I read this. I was thinking "serializing instructions".
Maybe something like this would help:
Let's say you want to write 2 bytes of data to a device. Typically, the
device would expose 2 bytes of "wide" MMIO space. To send the data to
the device, you could do:
writeb(buf[0], io_addr + 0);
writeb(buf[1], io_addr + 1);
But this mailbox is a bit different. Instead of having a "wide"
interface where there is separate MMIO space for each word in a
transaction, it has a "narrow" interface where several words are written
to the same spot in MMIO space:
writeb(buf[0], io_addr);
writeb(buf[1], io_addr);
The same goes for the read side.
To me, it's much more akin to talking over a serial port than how I
think of devices attached via MMIO.
> To facilitate data transfer, implement helper functions in line with this
> specified format for reading and writing staging data. This mailbox
> format is a customized version and is not compatible with the existing
> mailbox code, so reuse is not feasible.
This is getting a bit too flowery.
Implement helper functions that send and receive data to and
from the device.
Note: the kernel has support for similar mailboxes. But none of
them are compatible with this one. Trying to share code resulted
in a bloated mess, so this code is standalone.
> diff --git a/arch/x86/kernel/cpu/microcode/intel_staging.c b/arch/x86/kernel/cpu/microcode/intel_staging.c
> index 2fc8667cab45..eab6e891db9c 100644
> --- a/arch/x86/kernel/cpu/microcode/intel_staging.c
> +++ b/arch/x86/kernel/cpu/microcode/intel_staging.c
> @@ -3,6 +3,7 @@
> #define pr_fmt(fmt) "microcode: " fmt
> #include <linux/delay.h>
> #include <linux/io.h>
> +#include <linux/pci_ids.h>
>
> #include "internal.h"
>
> @@ -11,17 +12,44 @@
>
> #define MBOX_CONTROL_OFFSET 0x0
> #define MBOX_STATUS_OFFSET 0x4
> +#define MBOX_WRDATA_OFFSET 0x8
> +#define MBOX_RDDATA_OFFSET 0xc
>
> #define MASK_MBOX_CTRL_ABORT BIT(0)
> +#define MASK_MBOX_CTRL_GO BIT(31)
>
> #define MASK_MBOX_STATUS_ERROR BIT(2)
> #define MASK_MBOX_STATUS_READY BIT(31)
>
> +#define MASK_MBOX_RESP_SUCCESS BIT(0)
> +#define MASK_MBOX_RESP_PROGRESS BIT(1)
> +#define MASK_MBOX_RESP_ERROR BIT(2)
> +
> +#define MBOX_CMD_LOAD 0x3
> +#define MBOX_OBJ_STAGING 0xb
> +#define MBOX_HDR (PCI_VENDOR_ID_INTEL | (MBOX_OBJ_STAGING << 16))
> +#define MBOX_HDR_SIZE 16
> +
> #define MBOX_XACTION_LEN PAGE_SIZE
> #define MBOX_XACTION_MAX(imgsz) ((imgsz) * 2)
> #define MBOX_XACTION_TIMEOUT (10 * MSEC_PER_SEC)
>
> #define STAGING_OFFSET_END 0xffffffff
> +#define DWORD_SIZE(s) ((s) / sizeof(u32))
> +
> +static inline u32 read_mbox_dword(void __iomem *base)
> +{
> + u32 dword = readl(base + MBOX_RDDATA_OFFSET);
> +
> + /* Inform the read completion to the staging firmware */
> + writel(0, base + MBOX_RDDATA_OFFSET);
> + return dword;
> +}
That comment doesn't quite parse for me. Maybe:
/* Inform the staging firmware that the read is complete: */
BTW, is this kind of read/write normal for MMIO reads? It looks really
goofy to me, but I don't deal with devices much.
> +static inline void write_mbox_dword(void __iomem *base, u32 dword)
> +{
> + writel(dword, base + MBOX_WRDATA_OFFSET);
> +}
>
> static inline void abort_xaction(void __iomem *base)
> {
> @@ -30,7 +58,18 @@ static inline void abort_xaction(void __iomem *base)
>
> static void request_xaction(void __iomem *base, u32 *chunk, unsigned int chunksize)
> {
> - pr_debug_once("Need to implement staging mailbox loading code.\n");
> + unsigned int i, dwsize = DWORD_SIZE(chunksize);
> +
> + write_mbox_dword(base, MBOX_HDR);
> + write_mbox_dword(base, dwsize + DWORD_SIZE(MBOX_HDR_SIZE));
> +
> + write_mbox_dword(base, MBOX_CMD_LOAD);
> + write_mbox_dword(base, 0);
This last write is a mystery. Why is it here?
> +
> + for (i = 0; i < dwsize; i++)
> + write_mbox_dword(base, chunk[i]);
So, again, I'm a device dummy here. But this _looks_ nonsensical like
the code is just scribbling over itself repeatedly by rewriting data at
'base' over and over.
Would a comment like this help?
/*
* 'base' is mapped UnCached (UC). Each write shows up at the device
* as a separate "packet" in program order. That property allows the
* device to reassemble everything in order on its side.
*/
> + writel(MASK_MBOX_CTRL_GO, base + MBOX_CONTROL_OFFSET);
> }
Can we comment the MASK_MBOX_CTRL_GO? Maybe:
/*
* Tell the device that this chunk is
* complete and can be processed.
*/
> static enum ucode_state wait_for_xaction(void __iomem *base)
> @@ -55,8 +94,18 @@ static enum ucode_state wait_for_xaction(void __iomem *base)
>
> static enum ucode_state read_xaction_response(void __iomem *base, unsigned int *offset)
> {
> - pr_debug_once("Need to implement staging response handler.\n");
> - return UCODE_ERROR;
> + u32 flag;
/*
* All responses begin with the same header
* word and are the same length: 4 dwords.
*/
> + WARN_ON_ONCE(read_mbox_dword(base) != MBOX_HDR);
> + WARN_ON_ONCE(read_mbox_dword(base) != DWORD_SIZE(MBOX_HDR_SIZE));
> +
> + *offset = read_mbox_dword(base);
> +
> + flag = read_mbox_dword(base);
> + if (flag & MASK_MBOX_RESP_ERROR)
> + return UCODE_ERROR;
> +
> + return UCODE_OK;
> }
>
> static inline unsigned int get_chunksize(unsigned int totalsize, unsigned int offset)
On 2/18/2025 12:54 PM, Dave Hansen wrote: > > BTW, is this kind of read/write normal for MMIO reads? It looks really > goofy to me, but I don't deal with devices much. Yeah, looking at some device code, functions like memcpy_toio() and __iowrite*_copy() seem to assume sufficient MMIO space to copy data from kernel memory. I also noticed that some other mailbox code calls back one of these functions. So, I think your assessment is correct: > To me, it's much more akin to talking over a serial port than how I > think of devices attached via MMIO. Certainly, this staging mailbox looks somewhat similar to a serial port. I suspect the staging portion is quite resource-limited. That appears to be one of key characteristics of this interface, which I should highlight more comprehensively. Thanks, Chang
With the staging code being ready, check the relevant MSRs and set the
feature chicken bit to allow staging to be invoked from the core
microcode update process.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
---
RFC-V1 -> V1: Massage the enabling message.
---
arch/x86/kernel/cpu/microcode/intel.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index 325068bb5524..c988b6f8672f 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -674,6 +674,18 @@ static __init void calc_llc_size_per_core(struct cpuinfo_x86 *c)
llc_size_per_core = (unsigned int)llc_size;
}
+static __init bool staging_available(void)
+{
+ u64 val;
+
+ val = x86_read_arch_cap_msr();
+ if (!(val & ARCH_CAP_MCU_ENUM))
+ return false;
+
+ rdmsrl(MSR_IA32_MCU_ENUMERATION, val);
+ return !!(val & MCU_STAGING);
+}
+
struct microcode_ops * __init init_intel_microcode(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -684,6 +696,11 @@ struct microcode_ops * __init init_intel_microcode(void)
return NULL;
}
+ if (staging_available()) {
+ microcode_intel_ops.use_staging = true;
+ pr_info("Enabled staging feature.\n");
+ }
+
calc_llc_size_per_core(c);
return µcode_intel_ops;
--
2.45.2
© 2016 - 2026 Red Hat, Inc.