arch/x86/include/asm/msr-index.h | 5 +++ arch/x86/kernel/cpu/topology_amd.c | 55 ++++++++++++++++-------------- 2 files changed, 34 insertions(+), 26 deletions(-)
When running an AMD guest on QEMU with > 255 cores, the following FW_BUG was noticed with recent kernels: [Firmware Bug]: CPU 512: APIC ID mismatch. CPUID: 0x0000 APIC: 0x0200 Naveen, Sairaj debugged the cause to commit c749ce393b8f ("x86/cpu: Use common topology code for AMD") where, after the rework, the initial APICID was set using the CPUID leaf 0x8000001e EAX[31:0] as opposed to the value from CPUID leaf 0xb EDX[31:0] previously. This led us down a rabbit hole of XTOPOEXT vs TOPOEXT support, preferred order of their parsing, and QEMU nuances like [1] where QEMU 0's out the CPUID leaf 0x8000001e on CPUs where Core ID crosses 255 fearing a Core ID collision in the 8 bit field which leads to the reported FW_BUG. Following were major observations during the debug which the two patches address respectively: 1. The support for CPUID leaf 0xb is independent of the TOPOEXT feature and is rather linked to the x2APIC enablement. On baremetal, this has not been a problem since TOPOEXT support (Fam 0x15 and above) predates the support for CPUID leaf 0xb (Fam 0x17[Zen2] and above) however, in virtualized environment, the support for x2APIC can be enabled independent of topoext where QEMU expects the guest to parse the topology and the APICID from CPUID leaf 0xb. 2. Since CPUID leaf 0x8000001e cannot represent Core ID without collision for guests with > 255 cores, and QEMU 0's out the entire leaf when Core ID crosses 255. Prefer initial APIC read from the XTOPOEXT leaf before falling back to the APICID from 0x8000001e which is still better than 8-bit APICID from leaf 0x1 EBX[31:24]. More details are enclosed in the commit logs. Ideally, these changes should not affect baremetal AMD/Hygon platforms as they have supported TOPOEXT long before the support for CPUID leaf 0xb and the extended CPUID leaf 0x80000026 (famous last words). Patch 1 and 4 is yak shaving to explicitly define a raw MSR value used in the topology parsing bits and simplify the flow around "has_topoext" when the same can be discovered using X86_FEATURE_XTOPOLOGY. This series has been tested on baremetal Zen1 (contains topoext but not 0xb leaf), Zen3 (contains both topoext and 0xb leaf), and Zen4 (contains topoext, 0xb leaf, and 0x80000026 leaf) servers with no changes observed in "/sys/kernel/debug/x86/topo/" directory. The series was also tested on 255 and 512 vCPU (each vCPU is an individual core from QEMU topology being passed) EPYC-Genoa guest with and without x2apic and topoext enabled and this series solves the FW_BUG seen on guest with > 255 VCPUs. No changes observed in "/sys/kernel/debug/x86/topo/" for all other cases without warning. 0xb leaf is provided unconditionally on these guests (with or without topoext, even with x2apic disabled on guests with <= 255 vCPU). In all the cases initial_apicid matched the apicid in "/sys/kernel/debug/x86/topo/" after applying this series. Relevant bits of QEMU cmdline used during testing are as follows: qemu-system-x86_64 \ -enable-kvm -m 32G -smp cpus=512,cores=512 \ -cpu EPYC-Genoa,x2apic=on,kvm-msi-ext-dest-id=on,+kvm-pv-unhalt,kvm-pv-tlb-flush,kvm-pv-ipi,kvm-pv-sched-yield,[-topoext] \ -machine q35,kernel_irqchip=split \ -global kvm-pit.lost_tick_policy=discard ... References: [1] https://github.com/qemu/qemu/commit/35ac5dfbcaa4b Series is based on tip:x86/cpu at commit 65f55a301766 ("x86/CPU/AMD: Add CPUID faulting support") --- Changelog v2..v3: o Patch 1 was added to the series. o Use cpu_feature_enabled() in Patch 3. o Rebased on top of tip:x86/cpu. v2: https://lore.kernel.org/lkml/20250725110622.59743-1-kprateek.nayak@amd.com/ Changelog v1..v2: o Collected tags from Naveen. (Thank you for testing!) o Rebased the series on tip:x86/cpu. o Swapped Patch 1 and Patch 2 from v1. o Merged the body of two if blocks in Patch 1 to allow for cleanup in Patch 3. v1: https://lore.kernel.org/lkml/20250612072921.15107-1-kprateek.nayak@amd.com/ --- K Prateek Nayak (4): x86/msr-index: Define AMD64_CPUID_FN_EXT MSR x86/cpu/topology: Use initial APICID from XTOPOEXT on AMD/HYGON x86/cpu/topology: Always try cpu_parse_topology_ext() on AMD/Hygon x86/cpu/topology: Check for X86_FEATURE_XTOPOLOGY instead of passing has_topoext arch/x86/include/asm/msr-index.h | 5 +++ arch/x86/kernel/cpu/topology_amd.c | 55 ++++++++++++++++-------------- 2 files changed, 34 insertions(+), 26 deletions(-) base-commit: 5bf2f5119b9e957f773a22f226974166b58cff32 -- 2.34.1
Lemme try to make some sense of this because the wild use of names and things is making my head spin... On Mon, Aug 18, 2025 at 06:04:31AM +0000, K Prateek Nayak wrote: > When running an AMD guest on QEMU with > 255 cores, the following FW_BUG > was noticed with recent kernels: > > [Firmware Bug]: CPU 512: APIC ID mismatch. CPUID: 0x0000 APIC: 0x0200 > > Naveen, Sairaj debugged the cause to commit c749ce393b8f ("x86/cpu: Use > common topology code for AMD") where, after the rework, the initial > APICID was set using the CPUID leaf 0x8000001e EAX[31:0] as opposed to That's CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) > the value from CPUID leaf 0xb EDX[31:0] previously. That's CPUID_Fn0000000B_EDX [Extended Topology Enumeration] (Core::X86::Cpuid::ExtTopEnumEdx) > This led us down a rabbit hole of XTOPOEXT vs TOPOEXT support, preferred What is XTOPOEXT? CPUID_Fn0000000B_EDX? Please define all your things properly so that we can have common base when reading this text. TOPOEXT is, I presume: #define X86_FEATURE_TOPOEXT ( 6*32+22) /* "topoext" Topology extensions CPUID leafs */ Our PPR says: CPUID_Fn80000001_ECX [Feature Identifiers] (Core::X86::Cpuid::FeatureExtIdEcx) "22 TopologyExtensions: topology extensions support. Read-only. Reset: Fixed,1. 1=Indicates support for Core::X86::Cpuid::CachePropEax0 and Core::X86::Cpuid::ExtApicId." Those leafs are: CPUID_Fn8000001D_EAX_x00 [Cache Properties (DC)] (Core::X86::Cpuid::CachePropEax0) DC topology info. Probably not important for this here. and CPUID_Fn8000001E_EAX [Extended APIC ID] (Core::X86::Cpuid::ExtApicId) the extended APIC ID is there. How is this APIC ID different from the extended APIC ID in CPUID_Fn0000000B_EDX [Extended Topology Enumeration] (Core::X86::Cpuid::ExtTopEnumEdx) ? > order of their parsing, and QEMU nuances like [1] where QEMU 0's out the > CPUID leaf 0x8000001e on CPUs where Core ID crosses 255 fearing a > Core ID collision in the 8 bit field which leads to the reported FW_BUG. Is that what the hw does though? Has this been verified instead of willy nilly clearing CPUID leafs in qemu? > Following were major observations during the debug which the two > patches address respectively: > > 1. The support for CPUID leaf 0xb is independent of the TOPOEXT feature Yes, PPR says so. > and is rather linked to the x2APIC enablement. Because the SDM says: "Bits 31-00: x2APIC ID of the current logical processor." ? Is our version not containing the x2APIC ID? > On baremetal, this has > not been a problem since TOPOEXT support (Fam 0x15 and above) > predates the support for CPUID leaf 0xb (Fam 0x17[Zen2] and above) > however, in virtualized environment, the support for x2APIC can be > enabled independent of topoext where QEMU expects the guest to parse > the topology and the APICID from CPUID leaf 0xb. So we're fixing a qemu bug? Why isn't qemu force-enabling TOPOEXT support when one requests x2APIC? My initial reaction: fix qemu. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 19, 2025 at 01:34:47PM +0200, Borislav Petkov wrote: > Lemme try to make some sense of this because the wild use of names and things > is making my head spin... > > On Mon, Aug 18, 2025 at 06:04:31AM +0000, K Prateek Nayak wrote: > > When running an AMD guest on QEMU with > 255 cores, the following FW_BUG > > was noticed with recent kernels: > > > > [Firmware Bug]: CPU 512: APIC ID mismatch. CPUID: 0x0000 APIC: 0x0200 > > > > Naveen, Sairaj debugged the cause to commit c749ce393b8f ("x86/cpu: Use > > common topology code for AMD") where, after the rework, the initial > > APICID was set using the CPUID leaf 0x8000001e EAX[31:0] as opposed to > > That's > > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) > > > the value from CPUID leaf 0xb EDX[31:0] previously. > > That's > > CPUID_Fn0000000B_EDX [Extended Topology Enumeration] > (Core::X86::Cpuid::ExtTopEnumEdx) Regardless of the qemu bug with leaf 0x8000001e (with >255 cores), section '16.12 x2APIC_ID' of the APM says: CPUID. The x2APIC ID is reported by CPUID functions Fn0000_000B (Extended Topology Enumeration) and CPUID Fn8000_001E (Extended APIC ID) as follows: - Fn0000_000B_EDX[31:0]_x0 reports the full 32-bit ID, independent of APIC mode (i.e. even with APIC disabled) - Fn8000_001E_EAX[31:0] conditionally reports APIC ID. There are 3 cases: - 32-bit x2APIC_ID, in x2APIC mode. - 8-bit APIC ID (upper 24 bits are 0), in xAPIC mode. - 0, if the APIC is disabled. That suggests use of leaf 0xb for the initial x2APIC ID especially during early init. I'm not sure why leaf 0x8000001e was preferred over leaf 0xb in commit c749ce393b8f ("x86/cpu: Use common topology code for AMD") though. - Naveen
On Wed, Aug 20, 2025 at 01:41:28PM +0530, Naveen N Rao wrote: > That suggests use of leaf 0xb for the initial x2APIC ID especially > during early init. I'm not sure why leaf 0x8000001e was preferred over > leaf 0xb in commit c749ce393b8f ("x86/cpu: Use common topology code for > AMD") though. Well, I see parse_topology_amd() calling cpu_parse_topology_ext() if you have TOPOEXT - which all AMD hw does - which then does cpu_parse_topology_ext() and that one tries 0x80000026 and then falls back to 0xb and *only* *then* to 0x8000001e. So, it looks like it DTRT to me... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Hello Boris, On 8/20/2025 2:29 PM, Borislav Petkov wrote: > On Wed, Aug 20, 2025 at 01:41:28PM +0530, Naveen N Rao wrote: >> That suggests use of leaf 0xb for the initial x2APIC ID especially >> during early init. I'm not sure why leaf 0x8000001e was preferred over >> leaf 0xb in commit c749ce393b8f ("x86/cpu: Use common topology code for >> AMD") though. > > Well, I see parse_topology_amd() calling cpu_parse_topology_ext() if you have > TOPOEXT - which all AMD hw does - which then does cpu_parse_topology_ext() and > that one tries 0x80000026 and then falls back to 0xb and *only* *then* to > 0x8000001e. > > So, it looks like it DTRT to me... But parse_8000_001e() then unconditionally overwrites the "initial_apicid" with the value in 0x8000001E EAX despite it being populated from cpu_parse_topology_ext(). The flow is as follows: parse_topology_amd() if (X86_FEATURE_TOPOEXT) /* True */ has_topoext = cpu_parse_topology_ext(); /* Populates "initial_apicid", returns True */ /* parse_8000_0008() is never called since has_topoext is true */ parse_8000_001e() if (!X86_FEATURE_TOPOEXT) /* False */ return; /* Proceeds here */ cpuid_leaf(0x8000001e, &leaf); tscan->c->topo.initial_apicid = leaf.ext_apic_id; /*** Overwritten here ***/ -- Thanks and Regards, Prateek
On Wed, Aug 20, 2025 at 02:42:26PM +0530, K Prateek Nayak wrote: > Hello Boris, > > On 8/20/2025 2:29 PM, Borislav Petkov wrote: > > On Wed, Aug 20, 2025 at 01:41:28PM +0530, Naveen N Rao wrote: > >> That suggests use of leaf 0xb for the initial x2APIC ID especially > >> during early init. I'm not sure why leaf 0x8000001e was preferred over > >> leaf 0xb in commit c749ce393b8f ("x86/cpu: Use common topology code for > >> AMD") though. > > > > Well, I see parse_topology_amd() calling cpu_parse_topology_ext() if you have > > TOPOEXT - which all AMD hw does - which then does cpu_parse_topology_ext() and > > that one tries 0x80000026 and then falls back to 0xb and *only* *then* to > > 0x8000001e. > > > > So, it looks like it DTRT to me... > > But parse_8000_001e() then unconditionally overwrites the > "initial_apicid" with the value in 0x8000001E EAX despite it being > populated from cpu_parse_topology_ext(). > > The flow is as follows: > > parse_topology_amd() > if (X86_FEATURE_TOPOEXT) /* True */ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > cpu_parse_topology_ext(); Patch 2 from this patchset, which removes this "if" condition above seems to be the right thing to do. X86_FEATURE_TOPOEXT refers to CPUID 0x80000001.ECX[22] which advertises the support for 0x8000001D.EAX and 0x8000001E.EAX. OTOH, the function cpu_parse_topology_ext() parses the topology via the following CPUIDs in that order * CPUID 0x1f (Intel Only) * CPUID 0x80000026 (AMD only) * CPUID 0xB (Both Intel and AMD) None of these have anything to do with X86_FEATURE_TOPOEXT. So the call to cpu_parse_topology_ext() in parse_topology_amd() doesn't have to be gated by the presence or absence of X86_FEATURE_TOPOEXT. I agree that QEMU needs to sort out what needs to do something better than clearing all the regs of CPUID 0x8000001E on encountering a topology with more than 256 cores. Or at the very least not clear the CPUID 0x8000001E.EAX which has the provision to advertise a valid Extended APIC ID. -- Thanks and Regards gautham.
On Wed, Aug 20, 2025 at 02:42:26PM +0530, K Prateek Nayak wrote: > tscan->c->topo.initial_apicid = leaf.ext_apic_id; /*** Overwritten here ***/ Looks like it shouldn't unconditionally overwrite it but I'll let tglx comment here. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Hello Boris, On 8/19/2025 5:04 PM, Borislav Petkov wrote: > Lemme try to make some sense of this because the wild use of names and things > is making my head spin... > > On Mon, Aug 18, 2025 at 06:04:31AM +0000, K Prateek Nayak wrote: >> When running an AMD guest on QEMU with > 255 cores, the following FW_BUG >> was noticed with recent kernels: >> >> [Firmware Bug]: CPU 512: APIC ID mismatch. CPUID: 0x0000 APIC: 0x0200 >> >> Naveen, Sairaj debugged the cause to commit c749ce393b8f ("x86/cpu: Use >> common topology code for AMD") where, after the rework, the initial >> APICID was set using the CPUID leaf 0x8000001e EAX[31:0] as opposed to > > That's > > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) Small correction here, this is actually, CPUID_Fn8000001E_EAX [Extended APIC ID] (Core::X86::Cpuid::ExtApicId) > >> the value from CPUID leaf 0xb EDX[31:0] previously. > > That's > > CPUID_Fn0000000B_EDX [Extended Topology Enumeration] > (Core::X86::Cpuid::ExtTopEnumEdx) > >> This led us down a rabbit hole of XTOPOEXT vs TOPOEXT support, preferred > > What is XTOPOEXT? > > CPUID_Fn0000000B_EDX? > > Please define all your things properly so that we can have common base when > reading this text. Sorry about that! This should actually be "X86_FEATURE_XTOPOLOGY" which is a synthetic feature set when topology parsing via one of the following CPUID leaf is successful: - 0x1f V2 Extended Topology Enumeration Leaf (Intel only) - 0x80000026 CPUID_Fn80000026_E[A,B,C]X_x0[0...3] [Extended CPU Topology] Core::X86::Cpuid::ExCpuTopologyE[a,b,c]x[0..3] (AMD only) - 0xb CPUID_Fn0000000B_E[A,B,C]X_x0[0..2] [Extended Topology Enumeration] Core::X86::Cpuid::ExtTopEnumE[a,b,c]x[0..2] (Both Intel and AMD) The parsing of the leaves is tried in the same order as above. > > TOPOEXT is, I presume: > > #define X86_FEATURE_TOPOEXT ( 6*32+22) /* "topoext" Topology extensions CPUID leafs */ > > Our PPR says: > > CPUID_Fn80000001_ECX [Feature Identifiers] (Core::X86::Cpuid::FeatureExtIdEcx) > > "22 TopologyExtensions: topology extensions support. Read-only. Reset: > Fixed,1. 1=Indicates support for Core::X86::Cpuid::CachePropEax0 and > Core::X86::Cpuid::ExtApicId." > > Those leafs are: > > CPUID_Fn8000001D_EAX_x00 [Cache Properties (DC)] (Core::X86::Cpuid::CachePropEax0) > > DC topology info. Probably not important for this here. > > and > > CPUID_Fn8000001E_EAX [Extended APIC ID] (Core::X86::Cpuid::ExtApicId) > > the extended APIC ID is there. > > How is this APIC ID different from the extended APIC ID in > > CPUID_Fn0000000B_EDX [Extended Topology Enumeration] (Core::X86::Cpuid::ExtTopEnumEdx) > > ? On baremetal, they are the same. On QEMU, when we launch a guest with a topology that contains more than 256 cores on a single socket, QEMU 0s out all the bits in CPUID_Fn8000001E [1] since it fears a collision in the "CoreId[7:0]" field of "CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId)" Since "CPUID_Fn0000000B_EBX_x01 [Extended Topology Enumeration]" and "LogProcAtThisLevel[15:0]" can describe a domain with up to 2^16 cores, the Core ID can always be derived correctly from this even when the number of cores in the guest topology crosses 256. > >> order of their parsing, and QEMU nuances like [1] where QEMU 0's out the >> CPUID leaf 0x8000001e on CPUs where Core ID crosses 255 fearing a >> Core ID collision in the 8 bit field which leads to the reported FW_BUG. > > Is that what the hw does though? We don't have baremetal systems with more than 256 cores per socket and when that happens, I believe the expectation from H/W is to just use CPUID_Fn80000026 leaf or the CPUID_Fn0000000B leaf. > > Has this been verified instead of willy nilly clearing CPUID leafs in qemu? > >> Following were major observations during the debug which the two >> patches address respectively: >> >> 1. The support for CPUID leaf 0xb is independent of the TOPOEXT feature > > Yes, PPR says so. > >> and is rather linked to the x2APIC enablement. > > Because the SDM says: > > "Bits 31-00: x2APIC ID of the current logical processor." > > ? SDM Vol. 3A Sec. 11.12.8 "CPUID Extensions And Topology Enumeration" reads: For Intel 64 and IA-32 processors that support x2APIC, a value of 1 reported by CPUID.01H:ECX[21] indicates that the processor supports x2APIC and the extended topology enumeration leaf (CPUID.0BH). The extended topology enumeration leaf can be accessed by executing CPUID with EAX = 0BH. Processors that do not support x2APIC may support CPUID leaf 0BH. Software can detect the availability of the extended topology enumeration leaf (0BH) by performing two steps: 1. Check maximum input value for basic CPUID information by executing CPUID with EAX= 0. If CPUID.0H:EAX is greater than or equal or 11 (0BH), then proceed to next step 2. Check CPUID.EAX=0BH, ECX=0H:EBX is non-zero. If both of the above conditions are true, extended topology enumeration leaf is available. > > Is our version not containing the x2APIC ID? We too have the Extended APIC ID in both CPUID_Fn0000000B and CPUID_Fn8000001E_EAX and they both match on baremetal. The problem is only for virtualized guest whose topology contains more than 256 cores per socket because of [1] > >> On baremetal, this has >> not been a problem since TOPOEXT support (Fam 0x15 and above) >> predates the support for CPUID leaf 0xb (Fam 0x17[Zen2] and above) >> however, in virtualized environment, the support for x2APIC can be >> enabled independent of topoext where QEMU expects the guest to parse >> the topology and the APICID from CPUID leaf 0xb. > > So we're fixing a qemu bug? > > Why isn't qemu force-enabling TOPOEXT support when one requests x2APIC? > > My initial reaction: fix qemu. This is possible, however what should be the right thing for CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId)? Should QEMU just wrap and start counting the Core Identifiers again from 0? Or Should QEMU go ahead and populate just the CPUID_Fn8000001E_EAX [Extended APIC ID] (Core::X86::Cpuid::ExtApicId) fields and continue to zero out EBX and ECX when CoreID > 255? [1] https://github.com/qemu/qemu/commit/35ac5dfbcaa4b -- Thanks and Regards, Prateek
On Tue, Aug 19, 2025 at 07:58:52PM +0530, K Prateek Nayak wrote: > This is possible, however what should be the right thing for > CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId)? > > Should QEMU just wrap and start counting the Core Identifiers again > from 0? > > Or Should QEMU go ahead and populate just the > CPUID_Fn8000001E_EAX [Extended APIC ID] (Core::X86::Cpuid::ExtApicId) > fields and continue to zero out EBX and ECX when CoreID > 255? I think the right thing to do is what the HW does (or will do), when it gets to more than 256 APIC IDs - "cores" is ambiguous. Perhaps something to discuss with hw folks internally first and then stick to that plan everywhere, qemu included. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
© 2016 - 2025 Red Hat, Inc.