[v3] x86/cpu: Rework the topology evaluation

[patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years, 1 month ago

detect_extended_topology() along with it's early() variant is a classic
example for duct tape engineering:

  - It evaluates an array of subleafs with a boatload of local variables
    for the relevant topology levels instead of using an array to save the
    enumerated information and propagate it to the right level

  - It has no boundary checks for subleafs

  - It prevents updating the die_id with a crude workaround instead of
    checking for leaf 0xb which does not provide die information.

  - It's broken vs. the number of dies evaluation as it uses:

      num_processors[DIE_LEVEL] / num_processors[CORE_LEVEL]

    which "works" only correctly if there is none of the intermediate
    topology levels (MODULE/TILE) enumerated.

There is zero value in trying to "fix" that code as the only proper fix is
to rewrite it from scratch.

Implement a sane parser with proper code documentation, which will be used
for the consolidated topology evaluation in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Fixed up the comment alignment for registers - Peterz
---
 arch/x86/kernel/cpu/Makefile       |    2 
 arch/x86/kernel/cpu/topology.h     |   12 +++
 arch/x86/kernel/cpu/topology_ext.c |  136 +++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -18,7 +18,7 @@ KMSAN_SANITIZE_common.o := n
 KCSAN_SANITIZE_common.o := n
 
 obj-y			:= cacheinfo.o scattered.o
-obj-y			+= topology_common.o topology.o
+obj-y			+= topology_common.o topology_ext.o topology.o
 obj-y			+= common.o
 obj-y			+= rdrand.o
 obj-y			+= match.o
--- a/arch/x86/kernel/cpu/topology.h
+++ b/arch/x86/kernel/cpu/topology.h
@@ -16,6 +16,7 @@ void cpu_init_topology(struct cpuinfo_x8
 void cpu_parse_topology(struct cpuinfo_x86 *c);
 void topology_set_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
 		      unsigned int shift, unsigned int ncpus);
+bool cpu_parse_topology_ext(struct topo_scan *tscan);
 
 static inline u32 topo_shift_apicid(u32 apicid, enum x86_topology_domains dom)
 {
@@ -31,4 +32,15 @@ static inline u32 topo_relative_domain_i
 	return apicid & (x86_topo_system.dom_size[dom] - 1);
 }
 
+/*
+ * Update a domain level after the fact without propagating. Used to fixup
+ * broken CPUID enumerations.
+ */
+static inline void topology_update_dom(struct topo_scan *tscan, enum x86_topology_domains dom,
+				       unsigned int shift, unsigned int ncpus)
+{
+	tscan->dom_shifts[dom] = shift;
+	tscan->dom_ncpus[dom] = ncpus;
+}
+
 #endif /* ARCH_X86_TOPOLOGY_H */
--- /dev/null
+++ b/arch/x86/kernel/cpu/topology_ext.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <asm/apic.h>
+#include <asm/memtype.h>
+#include <asm/processor.h>
+
+#include "cpu.h"
+
+enum topo_types {
+	INVALID_TYPE	= 0,
+	SMT_TYPE	= 1,
+	CORE_TYPE	= 2,
+	MODULE_TYPE	= 3,
+	TILE_TYPE	= 4,
+	DIE_TYPE	= 5,
+	DIEGRP_TYPE	= 6,
+	MAX_TYPE	= 7,
+};
+
+/*
+ * Use a lookup table for the case that there are future types > 6 which
+ * describe an intermediate domain level which does not exist today.
+ *
+ * A table will also be handy to parse the new AMD 0x80000026 leaf which
+ * has defined different domain types, but otherwise uses the same layout
+ * with some of the reserved bits used for new information.
+ */
+static const unsigned int topo_domain_map[MAX_TYPE] = {
+	[SMT_TYPE]	= TOPO_SMT_DOMAIN,
+	[CORE_TYPE]	= TOPO_CORE_DOMAIN,
+	[MODULE_TYPE]	= TOPO_MODULE_DOMAIN,
+	[TILE_TYPE]	= TOPO_TILE_DOMAIN,
+	[DIE_TYPE]	= TOPO_DIE_DOMAIN,
+	[DIEGRP_TYPE]	= TOPO_PKG_DOMAIN,
+};
+
+static inline bool topo_subleaf(struct topo_scan *tscan, u32 leaf, u32 subleaf)
+{
+	unsigned int dom, maxtype = leaf == 0xb ? CORE_TYPE + 1 : MAX_TYPE;
+	struct {
+		// eax
+		u32	x2apic_shift	:  5, // Number of bits to shift APIC ID right
+					      // for the topology ID at the next level
+			__rsvd0		: 27; // Reserved
+		// ebx
+		u32	num_processors	: 16, // Number of processors at current level
+			__rsvd1		: 16; // Reserved
+		// ecx
+		u32	level		:  8, // Current topology level. Same as sub leaf number
+			type		:  8, // Level type. If 0, invalid
+			__rsvd2		: 16; // Reserved
+		// edx
+		u32	x2apic_id	: 32; // X2APIC ID of the current logical processor
+	} sl;
+
+	cpuid_subleaf(leaf, subleaf, &sl);
+
+	if (!sl.num_processors || sl.type == INVALID_TYPE)
+		return false;
+
+	if (sl.type >= maxtype) {
+		/*
+		 * As the subleafs are ordered in domain level order, this
+		 * could be recovered in theory by propagating the
+		 * information at the last parsed level.
+		 *
+		 * But if the infinite wisdom of hardware folks decides to
+		 * create a new domain type between CORE and MODULE or DIE
+		 * and DIEGRP, then that would overwrite the CORE or DIE
+		 * information.
+		 *
+		 * It really would have been too obvious to make the domain
+		 * type space sparse and leave a few reserved types between
+		 * the points which might change instead of forcing
+		 * software to either create a monstrosity of workarounds
+		 * or just being up the creek without a paddle.
+		 *
+		 * Refuse to implement monstrosity, emit an error and try
+		 * to survive.
+		 */
+		pr_err_once("Topology: leaf 0x%x:%d Unknown domain type %u\n",
+			    leaf, subleaf, sl.type);
+		return true;
+	}
+
+	dom = topo_domain_map[sl.type];
+	if (!dom) {
+		tscan->c->topo.initial_apicid = sl.x2apic_id;
+	} else if (tscan->c->topo.initial_apicid != sl.x2apic_id) {
+		pr_warn_once(FW_BUG "CPUID leaf 0x%x subleaf %d APIC ID mismatch %x != %x\n",
+			     leaf, subleaf, tscan->c->topo.initial_apicid, sl.x2apic_id);
+	}
+
+	topology_set_dom(tscan, dom, sl.x2apic_shift, sl.num_processors);
+	return true;
+}
+
+static bool parse_topology_leaf(struct topo_scan *tscan, u32 leaf)
+{
+	u32 subleaf;
+
+	if (tscan->c->cpuid_level < leaf)
+		return false;
+
+	/* Read all available subleafs and populate the levels */
+	for (subleaf = 0; topo_subleaf(tscan, leaf, subleaf); subleaf++);
+
+	/* If subleaf 0 failed to parse, give up */
+	if (!subleaf)
+		return false;
+
+	/*
+	 * There are machines in the wild which have shift 0 in the subleaf
+	 * 0, but advertise 2 logical processors at that level. They are
+	 * truly SMT.
+	 */
+	if (!tscan->dom_shifts[TOPO_SMT_DOMAIN] && tscan->dom_ncpus[TOPO_SMT_DOMAIN] > 1) {
+		unsigned int sft = get_count_order(tscan->dom_ncpus[TOPO_SMT_DOMAIN]);
+
+		pr_warn_once(FW_BUG "CPUID leaf 0x%x subleaf 0 has shift level 0 but %u CPUs\n",
+			     leaf, tscan->dom_ncpus[TOPO_SMT_DOMAIN]);
+		topology_update_dom(tscan, TOPO_SMT_DOMAIN, sft, tscan->dom_ncpus[TOPO_SMT_DOMAIN]);
+	}
+
+	set_cpu_cap(tscan->c, X86_FEATURE_XTOPOLOGY);
+	return true;
+}
+
+bool cpu_parse_topology_ext(struct topo_scan *tscan)
+{
+	/* Try lead 0x1F first. If not available try leaf 0x0b */
+	if (parse_topology_leaf(tscan, 0x1f))
+		return true;
+	return parse_topology_leaf(tscan, 0x0b);
+}

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Zhang, Rui 2 years, 1 month ago

Hi, Thomas,

On Wed, 2023-08-02 at 12:21 +0200, Thomas Gleixner wrote:
> detect_extended_topology() along with it's early() variant is a
> classic
> example for duct tape engineering:
> 
>   - It evaluates an array of subleafs with a boatload of local
> variables
>     for the relevant topology levels instead of using an array to
> save the
>     enumerated information and propagate it to the right level
> 
>   - It has no boundary checks for subleafs
> 
>   - It prevents updating the die_id with a crude workaround instead
> of
>     checking for leaf 0xb which does not provide die information.
> 
>   - It's broken vs. the number of dies evaluation as it uses:
> 
>       num_processors[DIE_LEVEL] / num_processors[CORE_LEVEL]
> 
>     which "works" only correctly if there is none of the intermediate
>     topology levels (MODULE/TILE) enumerated.
> 
> There is zero value in trying to "fix" that code as the only proper
> fix is
> to rewrite it from scratch.
> 
> Implement a sane parser with proper code documentation, which will be
> used
> for the consolidated topology evaluation in the next step.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Fixed up the comment alignment for registers - Peterz
> ---
>  arch/x86/kernel/cpu/Makefile       |    2 
>  arch/x86/kernel/cpu/topology.h     |   12 +++
>  arch/x86/kernel/cpu/topology_ext.c |  136
> +++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -18,7 +18,7 @@ KMSAN_SANITIZE_common.o := n
>  KCSAN_SANITIZE_common.o := n
>  
>  obj-y                  := cacheinfo.o scattered.o
> -obj-y                  += topology_common.o topology.o
> +obj-y                  += topology_common.o topology_ext.o
> topology.o
>  obj-y                  += common.o
>  obj-y                  += rdrand.o
>  obj-y                  += match.o
> --- a/arch/x86/kernel/cpu/topology.h
> +++ b/arch/x86/kernel/cpu/topology.h
> @@ -16,6 +16,7 @@ void cpu_init_topology(struct cpuinfo_x8
>  void cpu_parse_topology(struct cpuinfo_x86 *c);
>  void topology_set_dom(struct topo_scan *tscan, enum
> x86_topology_domains dom,
>                       unsigned int shift, unsigned int ncpus);
> +bool cpu_parse_topology_ext(struct topo_scan *tscan);
>  
>  static inline u32 topo_shift_apicid(u32 apicid, enum
> x86_topology_domains dom)
>  {
> @@ -31,4 +32,15 @@ static inline u32 topo_relative_domain_i
>         return apicid & (x86_topo_system.dom_size[dom] - 1);
>  }
>  
> +/*
> + * Update a domain level after the fact without propagating. Used to
> fixup
> + * broken CPUID enumerations.
> + */
> +static inline void topology_update_dom(struct topo_scan *tscan, enum
> x86_topology_domains dom,
> +                                      unsigned int shift, unsigned
> int ncpus)
> +{
> +       tscan->dom_shifts[dom] = shift;
> +       tscan->dom_ncpus[dom] = ncpus;
> +}
> +
>  #endif /* ARCH_X86_TOPOLOGY_H */
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/topology_ext.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/apic.h>
> +#include <asm/memtype.h>
> +#include <asm/processor.h>
> +
> +#include "cpu.h"
> +
> +enum topo_types {
> +       INVALID_TYPE    = 0,
> +       SMT_TYPE        = 1,
> +       CORE_TYPE       = 2,
> +       MODULE_TYPE     = 3,
> +       TILE_TYPE       = 4,
> +       DIE_TYPE        = 5,
> +       DIEGRP_TYPE     = 6,
> +       MAX_TYPE        = 7,
> +};
> +
> +/*
> + * Use a lookup table for the case that there are future types > 6
> which
> + * describe an intermediate domain level which does not exist today.
> + *
> + * A table will also be handy to parse the new AMD 0x80000026 leaf
> which
> + * has defined different domain types, but otherwise uses the same
> layout
> + * with some of the reserved bits used for new information.
> + */
> +static const unsigned int topo_domain_map[MAX_TYPE] = {
> +       [SMT_TYPE]      = TOPO_SMT_DOMAIN,
> +       [CORE_TYPE]     = TOPO_CORE_DOMAIN,
> +       [MODULE_TYPE]   = TOPO_MODULE_DOMAIN,
> +       [TILE_TYPE]     = TOPO_TILE_DOMAIN,
> +       [DIE_TYPE]      = TOPO_DIE_DOMAIN,
> +       [DIEGRP_TYPE]   = TOPO_PKG_DOMAIN,

May I know why DIEGRP_TYPE is mapped to TOPO_PKG_DOMAIN?

> +};
> +
> +static inline bool topo_subleaf(struct topo_scan *tscan, u32 leaf,
> u32 subleaf)
> +{
> +       unsigned int dom, maxtype = leaf == 0xb ? CORE_TYPE + 1 :
> MAX_TYPE;
> +       struct {
> +               // eax
> +               u32     x2apic_shift    :  5, // Number of bits to
> shift APIC ID right
> +                                             // for the topology ID
> at the next level
> +                       __rsvd0         : 27; // Reserved
> +               // ebx
> +               u32     num_processors  : 16, // Number of processors
> at current level
> +                       __rsvd1         : 16; // Reserved
> +               // ecx
> +               u32     level           :  8, // Current topology
> level. Same as sub leaf number
> +                       type            :  8, // Level type. If 0,
> invalid
> +                       __rsvd2         : 16; // Reserved
> +               // edx
> +               u32     x2apic_id       : 32; // X2APIC ID of the
> current logical processor
> +       } sl;
> +
> +       cpuid_subleaf(leaf, subleaf, &sl);
> +
> +       if (!sl.num_processors || sl.type == INVALID_TYPE)
> +               return false;
> +
> +       if (sl.type >= maxtype) {

It is still legal to have sparse type value in the future, and then
this check will break.
IMO, it is better to use a function to convert type to domain, and
check for unknown domain here, say, something like

diff --git a/arch/x86/kernel/cpu/topology_ext.c
b/arch/x86/kernel/cpu/topology_ext.c
index 5ddc5d24435e..7720a7bc7478 100644
--- a/arch/x86/kernel/cpu/topology_ext.c
+++ b/arch/x86/kernel/cpu/topology_ext.c
@@ -26,14 +26,27 @@ enum topo_types {
  * has defined different domain types, but otherwise uses the same
layout
  * with some of the reserved bits used for new information.
  */
-static const unsigned int topo_domain_map[MAX_TYPE] = {
-	[SMT_TYPE]	= TOPO_SMT_DOMAIN,
-	[CORE_TYPE]	= TOPO_CORE_DOMAIN,
-	[MODULE_TYPE]	= TOPO_MODULE_DOMAIN,
-	[TILE_TYPE]	= TOPO_TILE_DOMAIN,
-	[DIE_TYPE]	= TOPO_DIE_DOMAIN,
-	[DIEGRP_TYPE]	= TOPO_PKG_DOMAIN,
-};
+
+static enum x86_topology_domains topo_type_to_domain(int type)
+{
+	switch (type) {
+	case SMT_TYPE:
+		return TOPO_SMT_DOMAIN;
+	case CORE_TYPE:
+		return TOPO_CORE_DOMAIN;
+	case MODULE_TYPE:
+		return TOPO_MODULE_DOMAIN;
+	case TILE_TYPE:
+		return TOPO_TILE_DOMAIN;
+	case DIE_TYPE:
+		return TOPO_DIE_DOMAIN;
+	case DIEGRP_TYPE:
+		return TOPO_PKG_DOMAIN;
+	default:
+		return TOPO_MAX_DOMAIN;
+	}
+
+}
 
 static inline bool topo_subleaf(struct topo_scan *tscan, u32 leaf, u32
subleaf)
 {
@@ -59,7 +72,8 @@ static inline bool topo_subleaf(struct topo_scan
*tscan, u32 leaf, u32 subleaf)
 	if (!sl.num_processors || sl.type == INVALID_TYPE)
 		return false;
 
-	if (sl.type >= maxtype) {
+	dom = topo_type_to_domain(sl.type);
+	if (dom == TOPO_MAX_DOMAIN) {
 		/*
 		 * As the subleafs are ordered in domain level order,
this
 		 * could be recovered in theory by propagating the
@@ -84,7 +98,6 @@ static inline bool topo_subleaf(struct topo_scan
*tscan, u32 leaf, u32 subleaf)
 		return true;
 	}
 
-	dom = topo_domain_map[sl.type];
 	if (!dom) {
 		tscan->c->topo.initial_apicid = sl.x2apic_id;
 	} else if (tscan->c->topo.initial_apicid != sl.x2apic_id) {

> +               /*
> +                * As the subleafs are ordered in domain level order,
> this
> +                * could be recovered in theory by propagating the
> +                * information at the last parsed level.
> +                *
> +                * But if the infinite wisdom of hardware folks
> decides to
> +                * create a new domain type between CORE and MODULE
> or DIE
> +                * and DIEGRP, then that would overwrite the CORE or
> DIE
> +                * information.

Sorry that I'm confused here.

Say, we have CORE, FOO, MODULE, then the subleave of FOO must be higher
than CORE but lower than MODULE.
so we parse CORE first and propagates the info to FOO/MODULE, and then
parse FOO and propagate to MODULE, and parse MODULE in the end.
How could we overwrite the info of a lower level?

> +                *
> +                * It really would have been too obvious to make the
> domain
> +                * type space sparse and leave a few reserved types
> between
> +                * the points which might change instead of forcing
> +                * software to either create a monstrosity of
> workarounds
> +                * or just being up the creek without a paddle.

Agreed.
with sparse type space, we know the relationship between different
types, without knowing what the type really means.

> +                *
> +                * Refuse to implement monstrosity, emit an error and try
> +                * to survive.
> +                */
> +               pr_err_once("Topology: leaf 0x%x:%d Unknown domain type %u\n",
> +                           leaf, subleaf, sl.type);
> +               return true;

Don't want to be TLDR, I can think of a couple cases that breaks Linux
in different ways if we ignore the cpu topology info of an unknown
level.

So I just want to understand the strategy here, does this mean that
we're not looking for a future proof solution, and instead we are
planning to take future updates (patch enum topo_types/enum
x86_topology_domains/topo_domain_map) whenever a new level is invented?


TBH, I'm still thinking of a future proof proposal here.
currently, Linux only cares about pkg_id/core_id/die_id, and the
relationship between these three levels.
1. for package id: pkg_id_low = FOO.x2apic_shift (FOO is the highest
enumerated level, no matter its type is known or not)
2. for core_id: as SMT level is always enumerated, core_id_low =
SMT.x2apic_shift, core_id_high = pkg_id_low - 1;
3. for die_id: Make Linux Die *OPTIONAL*.
   when DIE is enumerated via CPUID.1F, die_id_low = FOO.x2apic_shift
(FOO is the next enumerated lower level of DIE, no matter its type is
known or not), die_id_high = pkg_id_low - 1;
   when DIE is not enumerated via CPUID.1F, then Linux die does not
exist, adjust the die related topology information, say, die_id = -1,
topology_max_dies_per_package = 0, etc, and don't expose die sysfs I/F.

With this, we can guarantee that all the available topology information
are always valid, even when running on future platforms.

what do you think?

thanks,
rui

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years ago

On Sat, Aug 12 2023 at 08:21, Rui Zhang wrote:
> On Wed, 2023-08-02 at 12:21 +0200, Thomas Gleixner wrote:
>> +
>> +/*
>> + * Use a lookup table for the case that there are future types > 6
>> which
>> + * describe an intermediate domain level which does not exist today.
>> + *
>> + * A table will also be handy to parse the new AMD 0x80000026 leaf
>> which
>> + * has defined different domain types, but otherwise uses the same
>> layout
>> + * with some of the reserved bits used for new information.
>> + */
>> +static const unsigned int topo_domain_map[MAX_TYPE] = {
>> +       [SMT_TYPE]      = TOPO_SMT_DOMAIN,
>> +       [CORE_TYPE]     = TOPO_CORE_DOMAIN,
>> +       [MODULE_TYPE]   = TOPO_MODULE_DOMAIN,
>> +       [TILE_TYPE]     = TOPO_TILE_DOMAIN,
>> +       [DIE_TYPE]      = TOPO_DIE_DOMAIN,
>> +       [DIEGRP_TYPE]   = TOPO_PKG_DOMAIN,
>
> May I know why DIEGRP_TYPE is mapped to TOPO_PKG_DOMAIN?

Where else should it go? It's the topmost, no? But diegrp is a
terminoligy which is not used in the kernel

>> +
>> +       if (sl.type >= maxtype) {
>
> It is still legal to have sparse type value in the future, and then
> this check will break.
> IMO, it is better to use a function to convert type to domain, and
> check for unknown domain here, say, something like

Why? If somewhere in the future Intel decides to add UBER_TILE_TYPE,
then this will be a type larger than DIEGRP_TYPE. maxtype will then
cover the whole thing and the table will map it to the right place.

Even if in their infinite wisdom the HW folks decide to make a gap, then
the table can handle it simply by putting an invalid value into the gap
and checking for that.

Serioulsy we don't need a switch case for that.
>> +               /*
>> +                * As the subleafs are ordered in domain level order,
>> this
>> +                * could be recovered in theory by propagating the
>> +                * information at the last parsed level.
>> +                *
>> +                * But if the infinite wisdom of hardware folks
>> decides to
>> +                * create a new domain type between CORE and MODULE
>> or DIE
>> +                * and DIEGRP, then that would overwrite the CORE or
>> DIE
>> +                * information.
>
> Sorry that I'm confused here.
>
> Say, we have CORE, FOO, MODULE, then the subleave of FOO must be higher
> than CORE but lower than MODULE.
> so we parse CORE first and propagates the info to FOO/MODULE, and then
> parse FOO and propagate to MODULE, and parse MODULE in the end.
> How could we overwrite the info of a lower level?

We don't know about this new thing yet. So where should we propagate to?
We could say, last was core so we stick the new thing into module, but
do we know that's correct? Do we know there is a module actually. Let me
rephrase that comment.
>> +                *
>> +                * Refuse to implement monstrosity, emit an error and try
>> +                * to survive.
>> +                */
>> +               pr_err_once("Topology: leaf 0x%x:%d Unknown domain type %u\n",
>> +                           leaf, subleaf, sl.type);
>> +               return true;
>
> Don't want to be TLDR, I can think of a couple cases that breaks Linux
> in different ways if we ignore the cpu topology info of an unknown
> level.

Come on. If Intel manages it to create a new level then it's not rocket
science to integrate support for that a long time before actual silicon
ships. So what will break? The machines of people who use ancient
kernels on modern hardware? They can keep the pieces.

> So I just want to understand the strategy here, does this mean that
> we're not looking for a future proof solution, and instead we are
> planning to take future updates (patch enum topo_types/enum
> x86_topology_domains/topo_domain_map) whenever a new level is
> invented?

You need that anyway, no? 

> With this, we can guarantee that all the available topology information
> are always valid, even when running on future platforms.

I know that it can be made work, but is it worth the extra effort? I
don't think so.

Thanks,

        tglx

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years ago

On Sat, Aug 12 2023 at 22:04, Thomas Gleixner wrote:
> On Sat, Aug 12 2023 at 08:21, Rui Zhang wrote:
>> With this, we can guarantee that all the available topology information
>> are always valid, even when running on future platforms.
>
> I know that it can be made work, but is it worth the extra effort? I
> don't think so.

So I thought more about it. For intermediate levels, i.e. something
which is squeezed between two existing levels, this works by some
definition of works.

I.e. the example where we have UBER_TILE between TILE and DIE, then we'd
set and propagate the UBER_TILE entry into the DIE slot and then
overwrite it again, if there is a DIE entry too.

Where it becomes interesting is when the unknown level is past DIEGRP,
e.g. DIEGRP_CONGLOMORATE then we'd need to overwrite the DIEGRP level,
right?

It can be done, but I don't know whether it buys us much for the purely
theoretical case of new levels added.

Thanks,

        tglx

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Zhang, Rui 2 years ago

On Sun, 2023-08-13 at 17:04 +0200, Thomas Gleixner wrote:
> On Sat, Aug 12 2023 at 22:04, Thomas Gleixner wrote:
> > On Sat, Aug 12 2023 at 08:21, Rui Zhang wrote:
> > > With this, we can guarantee that all the available topology
> > > information
> > > are always valid, even when running on future platforms.
> > 
> > I know that it can be made work, but is it worth the extra effort?
> > I
> > don't think so.
> 
> So I thought more about it. For intermediate levels, i.e. something
> which is squeezed between two existing levels, this works by some
> definition of works.

this "some definition of works" includes parsing the unknown levels,
right?

> 
> I.e. the example where we have UBER_TILE between TILE and DIE, then
> we'd
> set and propagate the UBER_TILE entry into the DIE slot and then
> overwrite it again, if there is a DIE entry too.

Well, not really.

If we have TILE/UBER_TILE/DIE in CPUID but only support TILE/DIE in
kernel, the UBER_TILE information is overwritten.

But, UBER_TILE tells us the starting bit in APIC ID for die_id.

Say,
level	type		eax.shifts
0	SMT		1
1	CORE		5
2	TILE		7
3	UBER_TILE	8
4	DIE		9

This is a 1 package system with 2 dies, each die has 2 uber_tiles and
each uber_tile has 2 tiles.

If we don't support uber_tile, what we want to see is a platform with 2
dies and each die has 4 tiles.

But topo_shift_apicid() uses x86_topo_system.dom_shifts[TILE], so what
we see is a platform with 4 dies, and each die has 2 tiles. And this is
broken.

IMO, what we really need for each domain in x86_topo_system is dom_size
and dom_offset (id bit offset in APIC ID). and when parsing domain A,
we can propagate its eax.shifts to the dom_offset of its upper level
domains.

With this, we set dom_offset[DIE] to 7 first when parsing TILE, and
then overwrite it to 8 when parsing UBER_TILE, and set
dom_offset[PACKAGE] to 9 when parsinig DIE.

lossing TILE.eax.shifts is okay, because it is for UBER_TILE id.

> 
> Where it becomes interesting is when the unknown level is past
> DIEGRP,
> e.g. DIEGRP_CONGLOMORATE then we'd need to overwrite the DIEGRP
> level,
> right?
> 
> It can be done, but I don't know whether it buys us much for the
> purely
> theoretical case of new levels added.
> 
> 
Similar to previous case, DIEGRP_CONGLOMORATE eax.shifts can be
propagated to dom_offset[PACKAGE].

But, still, there is one case that we can not handle, (the reason I'm
proposing optional die support in Linux)

Say, we have new level FOO, and the CPUID is like this
level	type		eax.shifts
0	SMT		1
1	CORE		5
2	FOO		8

This can be a system with
1. 1 die and 8 FOOs if DIE is the upper level of FOO
or
2. 8 FOOs with 1 die in each FOO if DIE is the lower level of FOO

Currently, die topology information is mandatory in Linux, we cannot
make it right without patching enum topo_types/enum
x86_topology_domains/topo_domain_map (which in fact tells the
relationship between DIE and FOO).

thanks,
rui

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years ago

> On Sun, 2023-08-13 at 17:04 +0200, Thomas Gleixner wrote:
>
> With this, we set dom_offset[DIE] to 7 first when parsing TILE, and
> then overwrite it to 8 when parsing UBER_TILE, and set
> dom_offset[PACKAGE] to 9 when parsinig DIE.
>
> lossing TILE.eax.shifts is okay, because it is for UBER_TILE id.

No. That's just wrong. TILE is defined and potentially used in the
kernel. How can you rightfully assume that UBER TILE is a valid
substitution? You can't.

> Currently, die topology information is mandatory in Linux, we cannot
> make it right without patching enum topo_types/enum
> x86_topology_domains/topo_domain_map (which in fact tells the
> relationship between DIE and FOO).

You cannot just nilly willy assume at which domain level FOO sits. Look
at your example:

> Say, we have new level FOO, and the CPUID is like this
> level	type		eax.shifts
> 0	SMT		1
> 1	CORE		5
> 2	FOO		8

FOO can be anything between CORE and PKG, so you cannot tell what it
means.

Simply heuristics _cannot_ be correct by definition. So why trying to
come up with them just because?

What's the problem you are trying to solve? Some real world issue or
some academic though experiment which might never become a real problem?

Thanks,

        tglx

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Zhang, Rui 2 years ago

Hi, Thomas,

On Mon, 2023-08-14 at 14:26 +0200, Thomas Gleixner wrote:
> > On Sun, 2023-08-13 at 17:04 +0200, Thomas Gleixner wrote:
> > 
> > With this, we set dom_offset[DIE] to 7 first when parsing TILE, and
> > then overwrite it to 8 when parsing UBER_TILE, and set
> > dom_offset[PACKAGE] to 9 when parsinig DIE.
> > 
> > lossing TILE.eax.shifts is okay, because it is for UBER_TILE id.
> 
> No. That's just wrong. TILE is defined and potentially used in the
> kernel.

Sure.

>  How can you rightfully assume that UBER TILE is a valid
> substitution? You can't.

TILE.eax.shifts tells
1. the number of maximum addressable threads in TILE domain, which
   should be saved in x86_topo_system.dom_size[TILE]
2. the highest bit in APIC ID for tile id, but we don't need this if
   we use package/system scope unique tile id
3. the lowest bit in APIC ID for the upper level of tile
   if the upper level is a known level, say, die, this info is saved in
dom_offset[die]
   if the upper level is an unknown level, then we don't need this to
decode the topology information for the unknown level.

maybe I missed something, for now I don't see how things break here.

> 
> > Currently, die topology information is mandatory in Linux, we
> > cannot
> > make it right without patching enum topo_types/enum
> > x86_topology_domains/topo_domain_map (which in fact tells the
> > relationship between DIE and FOO).
> 
> You cannot just nilly willy assume at which domain level FOO sits.

exactly.

> Look
> at your example:
> 
> > Say, we have new level FOO, and the CPUID is like this
> > level   type            eax.shifts
> > 0       SMT             1
> > 1       CORE            5
> > 2       FOO             8
> 
> FOO can be anything between CORE and PKG, so you cannot tell what it
> means.

Exactly. Anything related with MODULE/TILE/DIE can break in this case.

Say this is a system with 1 package, 2 FOOs, 8 cores.

In current design (in this patch set), kernel has to tell how many
dies/tiles/modules this system has, and kernel cannot do this right.

But if using optional Die (and surely optional module/tile), kernel can
tell that this is a 1-package-0-die-0-tile-0-module-8-core system
before knowing what FOO means, we don't need to make up anything we
don't know.

> 
> Simply heuristics _cannot_ be correct by definition. So why trying to
> come up with them just because?
> 
> What's the problem you are trying to solve? Some real world issue or
> some academic though experiment which might never become a real
> problem?
> 
Maybe I was misleading previously, IMO, I totally agree with your
points, and "using optional die/tile/module" is what I propose to
address these concerns.

thanks,
rui

Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years ago

On Mon, Aug 14 2023 at 15:28, Rui Zhang wrote:
> On Mon, 2023-08-14 at 14:26 +0200, Thomas Gleixner wrote:
>> What's the problem you are trying to solve? Some real world issue or
>> some academic though experiment which might never become a real
>> problem?
>> 
> Maybe I was misleading previously, IMO, I totally agree with your
> points, and "using optional die/tile/module" is what I propose to
> address these concerns.

That's exactly what's implemented. If module, tile, die are not
advertised, then you end up with:

        N   threads
        N/2 cores
        1   module
        1   tile
        1   die

in a package because the bits occupied by module, tile and die are
exactly 0.

But from a conceptual and consistency point of view they exist, no?

Thanks,

        tglx

RE: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Brown, Len 2 years ago

> What's the problem you are trying to solve? Some real world issue or some academic though experiment which might never become a real problem?

There are several existing bugs, bad practices, and latent future bugs in today's x86 topology code.

First, with multiple cores having the same core_id.
A consumer of core_id must know about packages to understand core_id.

This is the original sin of the current interface -- which should never have used the word "sibling" *anyplace*,
Because to make sense of the word sibling, you must know what *contains* the siblings.

We introduced "core_cpus" a number of years ago to address this for core_ids (and for other levels,
Such as die_cpus). Unfortunately, we can probably never actually delete threads_siblings and core_siblings
Without breaking some program someplace...

Core_id should be system-wide global, just like the cpu number is system wide global.
Today, it is the only level id that is not system wide global.
This could be implemented by simply not masking off the package_id bits when creating the core_id,
Like we have done for other levels.
Yes, this could be awkward for some existing code that indexes arrays with core_id, and doesn't like them to be sparse.
But that rough edge is a much smaller problem than having to comprehend a level (package) that you may
Otherwise not care about. Besides, core_id's can already be sparse today.

Secondly, with the obsolescence of CPUID.0b and its replacement with CPUID.1F, the contract between
The hardware and the software is that a level can appear and can in between any existing levels.
(the only exception is that SMT is married to core). It is not possible
For an old kernel to know the name or position of a new level in the hierarchy, going forward.

Today, this manifests with a (currently) latent bug that I caused. For I implemented die_id
In the style of package_id, and I shouldn't have followed that example.
Today, if CPUID.1F doesn't know anything about multiple DIE, Linux conjurs up
A die_id 0 in sysfs. It should not. The reason is that when CPUID.1F enumerates
A level that legacy code doesn't know about, we can't possibly tell if it is above DIE,
or below DIE. If it is above DIE, then our default die_id 0 is becomes bogus.

That said, I have voiced my objection inside Intel to the creation of random levels
Which do not have an architectural (software) definition; and I'm advocating that
They be *removed* from the SDM until a software programming definition that
Spans all generation is documented.

SMT, core, module, die and the (implicit) package may not be well documented,
But they do have existing uses and will thus live on.
The others maybe not.

-Len

RE: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Thomas Gleixner 2 years ago

Len!

On Mon, Aug 14 2023 at 14:48, Len Brown wrote:
> First,  with multiple cores having the same core_id.
> A consumer of core_id must know about packages to understand core_id.
>
> This is the original sin of the current interface -- which should
> never have used the word "sibling" *anyplace*, Because to make sense
> of the word sibling, you must know what *contains* the siblings.

You're conflating things. The fact that core_id is relative to the
package has absolutely nothing to do with the concept of siblings.

Whether sibling is the right terminology or not is a problem on its own.

Fact is that CPUs are structured in different domain levels and that
structuring makes some sense as it reflects the actual hardware.

The question whether this structuring has a relevance for software or
not, is independent of that. That's something which needs to be defined
and there are certainly aspects which affect scheduling, while others
affect power or other entities and some affect all of them.

> We introduced "core_cpus" a number of years ago to address this for
> core_ids (and for other levels, Such as die_cpus).  Unfortunately, we
> can probably never actually delete threads_siblings and core_siblings
> Without breaking some program someplace...

Sorry, but "core_cpus" is just a different terminology. I'm not seeing
how this is solving anything.

> Core_id should be system-wide global, just like the cpu number is
> system wide global.  Today, it is the only level id that is not system
> wide global.

That's simply not true. cpu_info::die_id is package relative too, which
is silly to begin with and caused to add this noisy logical_die_id muck.

> This could be implemented by simply not masking off the package_id
> bits when creating the core_id,

You have to shift it right by one, if the system is SMT capable, or just
use the APIC ID if it's not (i.e. 0x1f subleaf 0 has a shift of 0). Not
more not less.

Alternatively all IDs are not shifted right at all and just the bits
below the actual level are masked off.

> Like we have done for other levels.

Which one exactly? The only level ID which is truly system wide unique
is package ID.

Die ID is not and core ID is not and there are no other levels the
current upstream code is dealing with.

> Yes, this could be awkward for some existing code that indexes arrays
> with core_id, and doesn't like them to be sparse.  But that rough edge
> is a much smaller problem than having to comprehend a level (package)
> that you may Otherwise not care about.  Besides, core_id's can already
> be sparse today.

It's not awkward. It's just a matter of auditing all the places which
care about core ID and fixing them up in case they can't deal with it.
I went through the die ID usage before making die ID unique to ensure
that it wont break anything.

I surely have mopped up more complex things than that, so where is the
problem doing the same for core ID?

> Secondly, with the obsolescence of CPUID.0b and its replacement with
> CPUID.1F, the contract between The hardware and the software is that a
> level can appear and can in between any existing levels.  (the only
> exception is that SMT is married to core).

In theory, yes. But what's the practical relevance that there might be a
new level between CORE and MODULE or MODULE and TILE etc...?

> It is not possible For an old kernel to know the name or position of a
> new level in the hierarchy, going forward.

Again, where is the practical problem? These new levels are not going to
be declared nilly willy and every other week, right?

> Today, this manifests with a (currently) latent bug that I caused.
> For I implemented die_id In the style of package_id, and I shouldn't
> have followed that example.

You did NOT. You implemented die_id relative to the package, which does
not make it unique in the same way as core_id is relative to the package
and therefore not unique.

Package ID is unique and the only reason why logical package ID exists
is because there have been systems with massive gaps between the package
IDs. That could have been handled differently, but that's again a
different story.

> Today, if CPUID.1F doesn't know anything about multiple DIE, Linux
> conjurs up A die_id 0 in sysfs.  It should not.  The reason is that
> when CPUID.1F enumerates A level that legacy code doesn't know about,
> we can't possibly tell if it is above DIE, or below DIE.  If it is
> above DIE, then our default die_id 0 is becomes bogus.

That's an implementation problem and the code I posted fixes this by
making die_id unique and taking the documented domain levels into
account.

So if 0x1f does not enumerate dies, then each package has one die and
the die ID is the same as the package ID. It's that simple.

> That said, I have voiced my objection inside Intel to the creation of
> random levels Which do not have an architectural (software)
> definition; and I'm advocating that They be *removed* from the SDM
> until a software programming definition that Spans all generation is
> documented.
>
> SMT, core, module, die and the (implicit) package may not be well
> documented, But they do have existing uses and will thus live on.  The
> others maybe not.

Why removing them? If there is no documentation for using them then
software will ignore them, but they reflect how the hardware is built
and according to conversations with various people this topology
reflects other things which are undocumented.

What do you win by removing them from the SDM?

Absolutely nothing. Actually you lose because if they get added with
information how software should use those levels then the whole problem
I discussed with Rui about imaginary new domain levels surfacing in the
future is there sooner than later. If we deal with them correctly today
and ignore them for then kernels will just work on systems which
enumerate them, they just wont make use of these levels.

The amount of extra storage to handle them is marginal and really not
worth to debate.

Thanks,

        tglx

RE: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Posted by Brown, Len 2 years ago

Hello Thomas,

It seems we need to take a momentary step back to step forward...

First, the Intel CPUID context...

Even though CPUID.B was created to be extensible, we found that adding a "die" level to it would break legacy software.
That is because some legacy software did silly things, such as hard-coding that package level is always adjacent to the core level...

Enter CPUID.1F -- an exact clone of CPUID.B, but with a new name.  The new name guaranteed that the old broken software will not parse CPUID.1F, and gave Intel license to add levels to CPUID.1F at any time without confusing CPUID.1F parsing software.  As 3-year-old kernels routinely run on the very latest hardware, this future-proof goal is paramount.

Multi-die/package systems shipped as the first test of CPUID.1F.  Enumerating the multi-die/package was mostly about MSR scope....

In retrospect, we under-specified what it means to enumerate a CPUID.1F die, because it has been a constant battle to get the HW people to *not* enumerate hidden die that software does not see.

Indeed, we were equally guilty in not codifying an architectural definition of "module" and "tile", which were placed into the  CPUID.1F definition mostly as place-holders with awareness of hardware structures that were already in common use.  For example, there were already module-scoped counters that were hard-coded, and enumerating modules seems to be an to give architectural (re-usable) enumeration to model-specific code.

Second, failings of the Linux topology code...

I agree with you that "thread_siblings" and "core_cpus" are the different words for the same thing.
This will always be true because the hardware architecture guarantees that SMT siblings are the next level down from core.

But no such definition exists for "core_siblings".   It is impossible to write correct software that reads "core_siblings" and takes any action on it.  Those could be the CPUs inside a module, or inside a die, or inside some other level that today's software can't possibly know by name.

On the other hand, die_cpus is clear -- the CPUs within a die.
Package_cpus -- the CPUs within a package.
Core_cpus -- the cpus within a core....
Words matter.

Specific replies....

Re: globally unique core_id

I have 100% confidence that you can make the Linux kernel handle a sparce globally unique core_id name space.
My concern is unknown exposure to joe-random-user-space program that consumes the sysfs representation.

>> Secondly, with the obsolescence of CPUID.0b and its replacement with 
>> CPUID.1F, the contract between The hardware and the software is that a 
>> level can appear and can in between any existing levels.  (the only 
>> exception is that SMT is married to core).

> In theory, yes. But what's the practical relevance that there might be a new level between CORE and MODULE or MODULE and TILE etc...?

>> It is not possible For an old kernel to know the name or position of a 
>> new level in the hierarchy, going forward.

>Again, where is the practical problem? These new levels are not going to be declared nilly willy and every other week, right?

It is irrelevant if a new level is of any practical use to Linux.

What is important is that Linux be able to parse and use the levels it finds useful, while gracefully ignoring any that it doesn't care about (or doesn't yet know about).

Yes, hardware folks can drop something into the ucode and the SDM w/o us knowing ahead of time (see DieGrp in the June 2023 SDM).  Certainly they can do it in well under the 4-year's notice we'd need if we were to simply track the named levels in the SDM.

>> Today, this manifests with a (currently) latent bug that I caused.
>> For I implemented die_id In the style of package_id, and I shouldn't 
>> have followed that example.

> You did NOT. You implemented die_id relative to the package, which does not make it unique in the same way as core_id is relative to the package and therefore not unique.

The point is that  like package_id=0 on a single package system, I put a die_id=0 attribute in sysfs even when NO "die" level is enumerated in CPUID.1F.

That was a mistake.

>> Today, if CPUID.1F doesn't know anything about multiple DIE, Linux 
>> conjurs up A die_id 0 in sysfs.  It should not.  The reason is that 
>> when CPUID.1F enumerates A level that legacy code doesn't know about, 
>> we can't possibly tell if it is above DIE, or below DIE.  If it is 
>> above DIE, then our default die_id 0 is becomes bogus.

>That's an implementation problem and the code I posted fixes this by making die_id unique and taking the documented domain levels into account.

Your code change does not fix the problem above.

>So if 0x1f does not enumerate dies, then each package has one die and the die ID is the same as the package ID. It's that simple.

Unfortunately, no.

Your code will be written and ship before level-X is defined.
A couple of years later, level-X is defined above die.
Your code runs on new hardware that defines no packages, level-X, and no die.
How many die-id's does this system have?

If you could see into the future, you'd answer that there are 2-die, because
There is one inside each level-X.

But since die isn't enumerated, and you don't know if a level-X is defined to be above or below die,
then you can't tell if level-X is something containing die, or something contained-by die...

The proper solution is to not expose a die_id attribute in sysfs if there is no die level enumerated in CPUID.1F.
When it is enumerated, we get it right.  When it is not enumerated, we don't guess.

> What do you win by removing them from the SDM?

When you give HW people enough rope to hang themselves, they will.
Give them something vague in the SDM, and you've created a monster that is interpreted differently by different hardware teams and no validation team on the planet can figure out if the hardware is correct or not.
Then the definition becomes how the OS (possibly not Linux) happened to use that interface on some past chip -- and that use is not documented in the SDM -- and down the rabbit hole you go...

When the SDM precisely documents the software/hardware interface, then proper tests can be written, independent hardware teams are forced to follow the same definition, and correct software can be written once and never break.
 
-Len