x86/cpu: rework instruction set selection

[PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Arnd Bergmann 9 months, 2 weeks ago

From: Arnd Bergmann <arnd@arndb.de>

With cx8 and tsc being mandatory features, the only important
architectural features are now cmov and pae.

Change the large list of target CPUs to no longer pick the instruction set
itself but only the mtune= optimization level and in-kernel optimizations
that remain compatible with all cores.

The CONFIG_X86_CMOV instead becomes user-selectable and is now how
Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX)
and 686-class (everything else) targets.

In order to allow running on late 32-bit cores (Athlon, Pentium-M,
Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything
lower than 6 (i.e. 64 byte cache lines).

The optimization options now depend on X86_CMOV and X86_PAE instead
of the other way round, while other compile-time conditionals that
checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options
that enable support for a particular CPU family.

Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This is what I had in mind as mentioned in the earlier thread on
cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15]
patch.
---
 arch/x86/Kconfig                |   2 +-
 arch/x86/Kconfig.cpu            | 100 ++++++++++++++------------------
 arch/x86/Makefile_32.cpu        |  48 +++++++--------
 arch/x86/include/asm/vermagic.h |  36 +-----------
 arch/x86/kernel/tsc.c           |   2 +-
 arch/x86/xen/Kconfig            |   1 -
 drivers/misc/mei/Kconfig        |   2 +-
 7 files changed, 74 insertions(+), 117 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a9d717558972..1e33f88c9b97 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1438,7 +1438,7 @@ config HIGHMEM
 
 config X86_PAE
 	bool "PAE (Physical Address Extension) Support"
-	depends on X86_32 && X86_HAVE_PAE
+	depends on X86_32 && X86_CMOV
 	select PHYS_ADDR_T_64BIT
 	help
 	  PAE is required for NX support, and furthermore enables
diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6f1e8cc8fe58..0619566de93f 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -1,23 +1,32 @@
 # SPDX-License-Identifier: GPL-2.0
 # Put here option for CPU selection and depending optimization
-choice
-	prompt "x86-32 Processor family"
-	depends on X86_32
-	default M686
+
+config X86_CMOV
+	bool "Require 686-class CMOV instructions" if X86_32
+	default y
 	help
-	  This is the processor type of your CPU. This information is
-	  used for optimizing purposes. In order to compile a kernel
-	  that can run on all supported x86 CPU types (albeit not
-	  optimally fast), you can specify "586" here.
+	  Most x86-32 processor implementations are compatible with
+	  the the CMOV instruction originally added in the Pentium Pro,
+	  and they perform much better when using it.
+
+	  Disable this option to build for 586-class CPUs without this
+	  instruction. This is only required for the original Intel
+	  Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
+	  CyrixIII/C3 CPUs.
 
 	  Note that the 386 and 486 is no longer supported, this includes
 	  AMD/Cyrix/Intel 386DX/DXL/SL/SLC/SX, Cyrix/TI 486DLC/DLC2,
 	  UMC 486SX-S and the NexGen Nx586, AMD ELAN and all 486 based
 	  CPUs.
 
-	  The kernel will not necessarily run on earlier architectures than
-	  the one you have chosen, e.g. a Pentium optimized kernel will run on
-	  a PPro, but not necessarily on a i486.
+choice
+	prompt "x86-32 Processor optimization"
+	depends on X86_32
+	default X86_GENERIC
+	help
+	  This is the processor type of your CPU. This information is
+	  used for optimizing purposes, but does not change compatibility
+	  with other CPU types.
 
 	  Here are the settings recommended for greatest speed:
 	  - "586" for generic Pentium CPUs lacking the TSC
@@ -45,14 +54,13 @@ choice
 
 config M586TSC
 	bool "Pentium-Classic"
-	depends on X86_32
+	depends on X86_32 && !X86_CMOV
 	help
-	  Select this for a Pentium Classic processor with the RDTSC (Read
-	  Time Stamp Counter) instruction for benchmarking.
+	  Select this for a Pentium Classic processor.
 
 config M586MMX
 	bool "Pentium-MMX"
-	depends on X86_32
+	depends on X86_32 && !X86_CMOV
 	help
 	  Select this for a Pentium with the MMX graphics/multimedia
 	  extended instructions.
@@ -117,7 +125,7 @@ config MPENTIUM4
 
 config MK6
 	bool "K6/K6-II/K6-III"
-	depends on X86_32
+	depends on X86_32 && !X86_CMOV
 	help
 	  Select this for an AMD K6-family processor.  Enables use of
 	  some extended instructions, and passes appropriate optimization
@@ -125,7 +133,7 @@ config MK6
 
 config MK7
 	bool "Athlon/Duron/K7"
-	depends on X86_32
+	depends on X86_32 && !X86_PAE
 	help
 	  Select this for an AMD Athlon K7-family processor.  Enables use of
 	  some extended instructions, and passes appropriate optimization
@@ -147,42 +155,37 @@ config MEFFICEON
 
 config MGEODEGX1
 	bool "GeodeGX1"
-	depends on X86_32
+	depends on X86_32 && !X86_CMOV
 	help
 	  Select this for a Geode GX1 (Cyrix MediaGX) chip.
 
 config MGEODE_LX
 	bool "Geode GX/LX"
-	depends on X86_32
+	depends on X86_32 && !X86_PAE
 	help
 	  Select this for AMD Geode GX and LX processors.
 
 config MCYRIXIII
 	bool "CyrixIII/VIA-C3"
-	depends on X86_32
+	depends on X86_32 && !X86_CMOV
 	help
 	  Select this for a Cyrix III or C3 chip.  Presently Linux and GCC
 	  treat this chip as a generic 586. Whilst the CPU is 686 class,
 	  it lacks the cmov extension which gcc assumes is present when
 	  generating 686 code.
-	  Note that Nehemiah (Model 9) and above will not boot with this
-	  kernel due to them lacking the 3DNow! instructions used in earlier
-	  incarnations of the CPU.
 
 config MVIAC3_2
 	bool "VIA C3-2 (Nehemiah)"
-	depends on X86_32
+	depends on X86_32 && !X86_PAE
 	help
 	  Select this for a VIA C3 "Nehemiah". Selecting this enables usage
 	  of SSE and tells gcc to treat the CPU as a 686.
-	  Note, this kernel will not boot on older (pre model 9) C3s.
 
 config MVIAC7
 	bool "VIA C7"
-	depends on X86_32
+	depends on X86_32 && !X86_PAE
 	help
-	  Select this for a VIA C7.  Selecting this uses the correct cache
-	  shift and tells gcc to treat the CPU as a 686.
+	  Select this for a VIA C7.
 
 config MATOM
 	bool "Intel Atom"
@@ -192,20 +195,19 @@ config MATOM
 	  accordingly optimized code. Use a recent GCC with specific Atom
 	  support in order to fully benefit from selecting this option.
 
-endchoice
-
 config X86_GENERIC
-	bool "Generic x86 support"
-	depends on X86_32
+	bool "Generic x86"
 	help
-	  Instead of just including optimizations for the selected
+	  Instead of just including optimizations for a particular
 	  x86 variant (e.g. PII, Crusoe or Athlon), include some more
 	  generic optimizations as well. This will make the kernel
-	  perform better on x86 CPUs other than that selected.
+	  perform better on a variety of CPUs.
 
 	  This is really intended for distributors who need more
 	  generic optimizations.
 
+endchoice
+
 #
 # Define implied options from the CPU selection here
 config X86_INTERNODE_CACHE_SHIFT
@@ -216,17 +218,14 @@ config X86_INTERNODE_CACHE_SHIFT
 config X86_L1_CACHE_SHIFT
 	int
 	default "7" if MPENTIUM4
-	default "6" if MK7 || MPENTIUMM || MATOM || MVIAC7 || X86_GENERIC || X86_64
-	default "4" if MGEODEGX1
-	default "5" if MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MVIAC3_2 || MGEODE_LX
+	default "6"
 
 config X86_F00F_BUG
-	def_bool y
-	depends on M586MMX || M586TSC || M586
+	def_bool !X86_CMOV
 
 config X86_ALIGNMENT_16
 	def_bool y
-	depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1
+	depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1 || (!X86_CMOV && X86_GENERIC)
 
 config X86_INTEL_USERCOPY
 	def_bool y
@@ -234,34 +233,23 @@ config X86_INTEL_USERCOPY
 
 config X86_USE_PPRO_CHECKSUM
 	def_bool y
-	depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM
+	depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM || (X86_CMOV && X86_GENERIC)
 
 config X86_TSC
 	def_bool y
 
-config X86_HAVE_PAE
-	def_bool y
-	depends on MCRUSOE || MEFFICEON || MCYRIXIII || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC7 || MATOM || X86_64
-
 config X86_CX8
 	def_bool y
 
-# this should be set for all -march=.. options where the compiler
-# generates cmov.
-config X86_CMOV
-	def_bool y
-	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || MATOM || MGEODE_LX || X86_64)
-
 config X86_MINIMUM_CPU_FAMILY
 	int
 	default "64" if X86_64
-	default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MK7)
-	default "5" if X86_32
-	default "4"
+	default "6" if X86_32 && X86_CMOV
+	default "5"
 
 config X86_DEBUGCTLMSR
 	def_bool y
-	depends on !(MK6 || MCYRIXIII || M586MMX || M586TSC || M586) && !UML
+	depends on X86_CMOV && !UML
 
 config IA32_FEAT_CTL
 	def_bool y
@@ -297,7 +285,7 @@ config CPU_SUP_INTEL
 config CPU_SUP_CYRIX_32
 	default y
 	bool "Support Cyrix processors" if PROCESSOR_SELECT
-	depends on M586 || M586TSC || M586MMX || (EXPERT && !64BIT)
+	depends on !64BIT
 	help
 	  This enables detection, tunings and quirks for Cyrix processors
 
diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
index f5e933077bf4..ebd7ec6eaf34 100644
--- a/arch/x86/Makefile_32.cpu
+++ b/arch/x86/Makefile_32.cpu
@@ -10,30 +10,32 @@ else
 align		:= -falign-functions=0 -falign-jumps=0 -falign-loops=0
 endif
 
-cflags-$(CONFIG_M586TSC)	+= -march=i586
-cflags-$(CONFIG_M586MMX)	+= -march=pentium-mmx
-cflags-$(CONFIG_M686)		+= -march=i686
-cflags-$(CONFIG_MPENTIUMII)	+= -march=i686 $(call tune,pentium2)
-cflags-$(CONFIG_MPENTIUMIII)	+= -march=i686 $(call tune,pentium3)
-cflags-$(CONFIG_MPENTIUMM)	+= -march=i686 $(call tune,pentium3)
-cflags-$(CONFIG_MPENTIUM4)	+= -march=i686 $(call tune,pentium4)
-cflags-$(CONFIG_MK6)		+= -march=k6
-# Please note, that patches that add -march=athlon-xp and friends are pointless.
-# They make zero difference whatsosever to performance at this time.
-cflags-$(CONFIG_MK7)		+= -march=athlon
-cflags-$(CONFIG_MCRUSOE)	+= -march=i686 $(align)
-cflags-$(CONFIG_MEFFICEON)	+= -march=i686 $(call tune,pentium3) $(align)
-cflags-$(CONFIG_MCYRIXIII)	+= $(call cc-option,-march=c3,-march=i486) $(align)
-cflags-$(CONFIG_MVIAC3_2)	+= $(call cc-option,-march=c3-2,-march=i686)
-cflags-$(CONFIG_MVIAC7)		+= -march=i686
-cflags-$(CONFIG_MATOM)		+= -march=atom
+ifdef CONFIG_X86_CMOV
+cflags-y			+= -march=i686
+else
+cflags-y			+= -march=i586
+endif
 
-# Geode GX1 support
-cflags-$(CONFIG_MGEODEGX1)	+= -march=pentium-mmx
-cflags-$(CONFIG_MGEODE_LX)	+= $(call cc-option,-march=geode,-march=pentium-mmx)
-# add at the end to overwrite eventual tuning options from earlier
-# cpu entries
-cflags-$(CONFIG_X86_GENERIC) 	+= $(call tune,generic,$(call tune,i686))
+cflags-$(CONFIG_M586TSC)	+= -mtune=i586
+cflags-$(CONFIG_M586MMX)	+= -mtune=pentium-mmx
+cflags-$(CONFIG_M686)		+= -mtune=i686
+cflags-$(CONFIG_MPENTIUMII)	+= -mtune=pentium2
+cflags-$(CONFIG_MPENTIUMIII)	+= -mtune=pentium3
+cflags-$(CONFIG_MPENTIUMM)	+= -mtune=pentium3
+cflags-$(CONFIG_MPENTIUM4)	+= -mtune=pentium4
+cflags-$(CONFIG_MK6)		+= -mtune=k6
+# Please note, that patches that add -mtune=athlon-xp and friends are pointless.
+# They make zero difference whatsosever to performance at this time.
+cflags-$(CONFIG_MK7)		+= -mtune=athlon
+cflags-$(CONFIG_MCRUSOE)	+= -mtune=i686 $(align)
+cflags-$(CONFIG_MEFFICEON)	+= -mtune=pentium3 $(align)
+cflags-$(CONFIG_MCYRIXIII)	+= -mtune=c3 $(align)
+cflags-$(CONFIG_MVIAC3_2)	+= -mtune=c3-2
+cflags-$(CONFIG_MVIAC7)		+= -mtune=i686
+cflags-$(CONFIG_MATOM)		+= -mtune=atom
+cflags-$(CONFIG_MGEODEGX1)	+= -mtune=pentium-mmx
+cflags-$(CONFIG_MGEODE_LX)	+= -mtune=geode
+cflags-$(CONFIG_X86_GENERIC) 	+= -mtune=generic
 
 # Bug fix for binutils: this option is required in order to keep
 # binutils from generating NOPL instructions against our will.
diff --git a/arch/x86/include/asm/vermagic.h b/arch/x86/include/asm/vermagic.h
index e26061df0c9b..6554dbdfd719 100644
--- a/arch/x86/include/asm/vermagic.h
+++ b/arch/x86/include/asm/vermagic.h
@@ -5,42 +5,10 @@
 
 #ifdef CONFIG_X86_64
 /* X86_64 does not define MODULE_PROC_FAMILY */
-#elif defined CONFIG_M586TSC
-#define MODULE_PROC_FAMILY "586TSC "
-#elif defined CONFIG_M586MMX
-#define MODULE_PROC_FAMILY "586MMX "
-#elif defined CONFIG_MATOM
-#define MODULE_PROC_FAMILY "ATOM "
-#elif defined CONFIG_M686
+#elif defined CONFIG_X86_CMOV
 #define MODULE_PROC_FAMILY "686 "
-#elif defined CONFIG_MPENTIUMII
-#define MODULE_PROC_FAMILY "PENTIUMII "
-#elif defined CONFIG_MPENTIUMIII
-#define MODULE_PROC_FAMILY "PENTIUMIII "
-#elif defined CONFIG_MPENTIUMM
-#define MODULE_PROC_FAMILY "PENTIUMM "
-#elif defined CONFIG_MPENTIUM4
-#define MODULE_PROC_FAMILY "PENTIUM4 "
-#elif defined CONFIG_MK6
-#define MODULE_PROC_FAMILY "K6 "
-#elif defined CONFIG_MK7
-#define MODULE_PROC_FAMILY "K7 "
-#elif defined CONFIG_MCRUSOE
-#define MODULE_PROC_FAMILY "CRUSOE "
-#elif defined CONFIG_MEFFICEON
-#define MODULE_PROC_FAMILY "EFFICEON "
-#elif defined CONFIG_MCYRIXIII
-#define MODULE_PROC_FAMILY "CYRIXIII "
-#elif defined CONFIG_MVIAC3_2
-#define MODULE_PROC_FAMILY "VIAC3-2 "
-#elif defined CONFIG_MVIAC7
-#define MODULE_PROC_FAMILY "VIAC7 "
-#elif defined CONFIG_MGEODEGX1
-#define MODULE_PROC_FAMILY "GEODEGX1 "
-#elif defined CONFIG_MGEODE_LX
-#define MODULE_PROC_FAMILY "GEODE "
 #else
-#error unknown processor family
+#define MODULE_PROC_FAMILY "586 "
 #endif
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 489c779ef3ef..76b15ef8c85f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1221,7 +1221,7 @@ bool tsc_clocksource_watchdog_disabled(void)
 
 static void __init check_system_tsc_reliable(void)
 {
-#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
+#if defined(CONFIG_CPU_SUP_CYRIX)
 	if (is_geode_lx()) {
 		/* RTSC counts during suspend */
 #define RTSC_SUSP 0x100
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 222b6fdad313..2648459b8e8f 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -9,7 +9,6 @@ config XEN
 	select PARAVIRT_CLOCK
 	select X86_HV_CALLBACK_VECTOR
 	depends on X86_64 || (X86_32 && X86_PAE)
-	depends on X86_64 || (X86_GENERIC || MPENTIUM4 || MATOM)
 	depends on X86_LOCAL_APIC
 	help
 	  This is the Linux Xen port.  Enabling this will allow the
diff --git a/drivers/misc/mei/Kconfig b/drivers/misc/mei/Kconfig
index 7575fee96cc6..4deb17ed0a62 100644
--- a/drivers/misc/mei/Kconfig
+++ b/drivers/misc/mei/Kconfig
@@ -3,7 +3,7 @@
 config INTEL_MEI
 	tristate "Intel Management Engine Interface"
 	depends on X86 && PCI
-	default X86_64 || MATOM
+	default X86_64 || CPU_SUP_INTEL
 	help
 	  The Intel Management Engine (Intel ME) provides Manageability,
 	  Security and Media services for system containing Intel chipsets.
-- 
2.39.5

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Ingo Molnar 9 months, 2 weeks ago

* Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> 
> With cx8 and tsc being mandatory features, the only important
> architectural features are now cmov and pae.
> 
> Change the large list of target CPUs to no longer pick the instruction set
> itself but only the mtune= optimization level and in-kernel optimizations
> that remain compatible with all cores.
> 
> The CONFIG_X86_CMOV instead becomes user-selectable and is now how
> Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX)
> and 686-class (everything else) targets.
> 
> In order to allow running on late 32-bit cores (Athlon, Pentium-M,
> Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything
> lower than 6 (i.e. 64 byte cache lines).
> 
> The optimization options now depend on X86_CMOV and X86_PAE instead
> of the other way round, while other compile-time conditionals that
> checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options
> that enable support for a particular CPU family.
> 
> Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> This is what I had in mind as mentioned in the earlier thread on
> cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15]
> patch.
> ---
>  arch/x86/Kconfig                |   2 +-
>  arch/x86/Kconfig.cpu            | 100 ++++++++++++++------------------
>  arch/x86/Makefile_32.cpu        |  48 +++++++--------
>  arch/x86/include/asm/vermagic.h |  36 +-----------
>  arch/x86/kernel/tsc.c           |   2 +-
>  arch/x86/xen/Kconfig            |   1 -
>  drivers/misc/mei/Kconfig        |   2 +-
>  7 files changed, 74 insertions(+), 117 deletions(-)

While the simplification is nice on its face, this looks messy:

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a9d717558972..1e33f88c9b97 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1438,7 +1438,7 @@ config HIGHMEM
>  
>  config X86_PAE
>  	bool "PAE (Physical Address Extension) Support"
> -	depends on X86_32 && X86_HAVE_PAE
> +	depends on X86_32 && X86_CMOV

Coupling CMOV to PAE ... :-/

> +config X86_CMOV
> +	bool "Require 686-class CMOV instructions" if X86_32
> +	default y
>  	help
> -	  This is the processor type of your CPU. This information is
> -	  used for optimizing purposes. In order to compile a kernel
> -	  that can run on all supported x86 CPU types (albeit not
> -	  optimally fast), you can specify "586" here.
> +	  Most x86-32 processor implementations are compatible with
> +	  the the CMOV instruction originally added in the Pentium Pro,
> +	  and they perform much better when using it.
> +
> +	  Disable this option to build for 586-class CPUs without this
> +	  instruction. This is only required for the original Intel
> +	  Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
> +	  CyrixIII/C3 CPUs.

Very few users will know anything about CMOV.

I'd argue the right path forward is to just bite the bullet and remove 
non-CMOV support as well, that would be the outcome *anyway* in a few 
years. That would allow basically a single 'modern' 32-bit kernel that 
is supposed to boot on every supported CPU. People might even end up 
testing it ... ;-)

Thanks,

	Ingo

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Arnd Bergmann 9 months, 2 weeks ago

On Sat, Apr 26, 2025, at 11:08, Ingo Molnar wrote:
> * Arnd Bergmann <arnd@kernel.org> wrote:
>
> While the simplification is nice on its face, this looks messy:
>
>>  
>>  config X86_PAE
>>  	bool "PAE (Physical Address Extension) Support"
>> -	depends on X86_32 && X86_HAVE_PAE
>> +	depends on X86_32 && X86_CMOV
>
> Coupling CMOV to PAE ... :-/

Right. With the current set of features, CMOV is almost the
same as 686. My reasoning was that support for CMOV has a
very clear definition, with the instruction either being
available or not.

When the M686/MPENTIUMII/MK6/... options are just optimization
levels rather than selecting an instruction set, X86_PAE
can't depend on those any more. An easy answer here would be
to not have X86_PAE depend on anything, but instead make it
force X86_MINIMUM_CPU_FAMILY=6.

>> +config X86_CMOV
>> +	bool "Require 686-class CMOV instructions" if X86_32
>> +	default y
>>  	help
>> -	  This is the processor type of your CPU. This information is
>> -	  used for optimizing purposes. In order to compile a kernel
>> -	  that can run on all supported x86 CPU types (albeit not
>> -	  optimally fast), you can specify "586" here.
>> +	  Most x86-32 processor implementations are compatible with
>> +	  the the CMOV instruction originally added in the Pentium Pro,
>> +	  and they perform much better when using it.
>> +
>> +	  Disable this option to build for 586-class CPUs without this
>> +	  instruction. This is only required for the original Intel
>> +	  Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
>> +	  CyrixIII/C3 CPUs.
>
> Very few users will know anything about CMOV.
>
> I'd argue the right path forward is to just bite the bullet and remove 
> non-CMOV support as well, that would be the outcome *anyway* in a few 
> years. That would allow basically a single 'modern' 32-bit kernel that 
> is supposed to boot on every supported CPU. People might even end up 
> testing it ... ;-)

That would be a much more drastic change than requiring CX8
and TSC, which were present on almost all Socket 7 CPUs and
all embedded cores other than Elan and Vortex86SX.

CMOV is missing not just on old Socket 5/7 CPUs (Pentium
MMX, AMD K6, Cyrix MII) but also newer embedded Via C3, Geode GX
and Vortex86DX/MX/EX/DX2. The replacement Nehemiah (2003), GeodeLX
(2005) and Vortex86DX3/EX2 (2015!) have CMOV, but the old ones
were sold alongside them for years, and some of the 586-class
Vortex86 products are still commercially available.

There is a good chance that we could just not use CMOV and only
build 586-compatible kernels without anyone caring about the
performance difference. There is not much to gain here either
though, as the cost of supporting both 586-class and 686-class
builds is rather small: there is a compiler flag, a boot time
check and microoptimziation in ffs/fls.

     Arnd

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Linus Torvalds 9 months, 2 weeks ago

On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Right. With the current set of features, CMOV is almost the
> same as 686. My reasoning was that support for CMOV has a
> very clear definition, with the instruction either being
> available or not.

Yeah, I don't think there's any reason to make CMOV a reason to drop support.

It has questionable performance impact - I doubt anybody can measure
it - and the "maintenance burden" is basically a single compiler flag.

(And yes, one use in a x86 header file that is pretty questionable
too: I think the reason for the cmov is actually i486-only behavior
and we could probably unify the 32-bit and 64-bit implementation)

Let's not drop Pentium support due to something as insignificant as that.

Particularly as the only half-way "modern" use of the Pentium core is
actually the embedded cores (ie old atoms and clones).

We have good reasons to drop i486 (and the "fake Pentium" cores that
weren't). We _don't_ have good reason to drop Pentium support, I
think.

>  An easy answer here would be
> to not have X86_PAE depend on anything, but instead make it
> force X86_MINIMUM_CPU_FAMILY=6.

Make it so.

          Linus

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by David Laight 9 months, 1 week ago

On Sat, 26 Apr 2025 12:24:37 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Right. With the current set of features, CMOV is almost the
> > same as 686. My reasoning was that support for CMOV has a
> > very clear definition, with the instruction either being
> > available or not.  
> 
> Yeah, I don't think there's any reason to make CMOV a reason to drop support.
> 
> It has questionable performance impact - I doubt anybody can measure
> it - and the "maintenance burden" is basically a single compiler flag.

There is also the user/kernel address check for copy_to/from_user (etc).

The 'cmov' version used for 64bit is nice and succinct (as well as being
speculative execution safe).
Unlike the 'sbb' version it doesn't rely on the first access being to
the first address of the buffer (or page 0 not being mapped).

But I'd guess that the kernel ought to have a boot time test for some
of these instructions - so at least it fails gracefully.
Which would required compiling some early code without cmov.

	David

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Ingo Molnar 9 months, 2 weeks ago

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Right. With the current set of features, CMOV is almost the
> > same as 686. My reasoning was that support for CMOV has a
> > very clear definition, with the instruction either being
> > available or not.
> 
> Yeah, I don't think there's any reason to make CMOV a reason to drop support.
> 
> It has questionable performance impact - I doubt anybody can measure 
> it - and the "maintenance burden" is basically a single compiler 
> flag.
> 
> (And yes, one use in a x86 header file that is pretty questionable 
> too: I think the reason for the cmov is actually i486-only behavior 
> and we could probably unify the 32-bit and 64-bit implementation)
> 
> Let's not drop Pentium support due to something as insignificant as 
> that.

Agreed on that. Idea to require CMOV dropped.

Note that the outcome of 486 removal will likely be that the few 
remaining community distros that still offer x86-32 builds are either 
already 686-CMOV-only (Debian), or are going to drop their 486 builds 
and keep their 686-CMOV-only builds (Gentoo and Archlinux32) by way of 
simple inertia. (There's an off chance that they'll change their 486 
builds to 586, but I think dropping the extra complication and 
standardizing on 686 will be the most likely outcome.)

No commercial distro builds x86-32 with a modern v6.x series kernel 
AFAICS.

Anyway, I agree that the maintenance cost on the kernel side to build 
non-CMOV kernels is very low.

Thanks,

	Ingo

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Linus Torvalds 9 months, 2 weeks ago

On Sat, 26 Apr 2025 at 12:24, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> (And yes, one use in a x86 header file that is pretty questionable
> too: I think the reason for the cmov is actually i486-only behavior
> and we could probably unify the 32-bit and 64-bit implementation)

Actually, what we *should* do is to remove that manual use of 'cmov'
entirely - even if we decide that yes, that undefined zero case is
actually real.

We should probably change it to use CC_SET(), and the compiler will do
a much better job - and probably never use cmov anyway.

And yes, that will generate worse code if you have an old compiler
that doesn't do ASM_FLAG_OUTPUTS, but hey, that's true in general. If
you want good code, you need a good compiler.

And clang needs to learn the CC_SET() pattern anyway.

So I think that manual cmov pattern for x86-32 should be replaced with

        bool zero;

        asm("bsfl %[in],%[out]"
            CC_SET(z)
            : CC_OUT(z) (zero),
              [out]"=r" (r)
            : [in] "rm" (x));

        return zero ? 0 : r+1;

instead (that's ffs(), and fls() would need the same thing except with
bsrl insteadm, of course).

I bet that would actually improve code generation.

And I also bet it doesn't actually matter, of course.

           Linus

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Andrew Cooper 9 months, 2 weeks ago

On 26/04/2025 8:55 pm, Linus Torvalds wrote:
> So I think that manual cmov pattern for x86-32 should be replaced with
>
>         bool zero;
>
>         asm("bsfl %[in],%[out]"
>             CC_SET(z)
>             : CC_OUT(z) (zero),
>               [out]"=r" (r)
>             : [in] "rm" (x));
>
>         return zero ? 0 : r+1;
>
> instead (that's ffs(), and fls() would need the same thing except with
> bsrl insteadm, of course).
>
> I bet that would actually improve code generation.

It is possible to do better still.

ffs/fls are commonly found inside loops where x is the loop condition
too.  Therefore, using statically_true() to provide a form without the
zero compatibility turns out to be a win.

> And I also bet it doesn't actually matter, of course.

Something that neither Linux nor Xen had, which makes a reasonable
difference, is a for_each_set_bit() optimised over a scalar value.  The
existing APIs make it all too easy to spill the loop condition to memory.

~Andrew

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by Linus Torvalds 9 months, 2 weeks ago

On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> ffs/fls are commonly found inside loops where x is the loop condition
> too.  Therefore, using statically_true() to provide a form without the
> zero compatibility turns out to be a win.

We already have the version without the zero capability - it's just
called "__ffs()" and "__fls()", and performance-critical code uses
those.

So fls/ffs are the "standard" library functions that have to handle
zero, and add that stupid "+1" because that interface was designed by
some Pascal person who doesn't understand that we start counting from
0.

Standards bodies: "companies aren't sending their best people".

But it's silly that we then spend effort on magic cmov in inline asm
on those things when it's literally the "don't use this version unless
you don't actually care about performance" case.

I don't think it would be wrong to just make the x86-32 code just do
the check against zero ahead of time - in C.

And yes, that will generate some extra code - you'll test for zero
before, and then the caller might also test for a zero result that
then results in another test for zero that can't actually happen (but
the compiler doesn't know that). But I suspect that on the whole, it
is likely to generate better code anyway just because the compiler
sees that first test and can DTRT.

UNTESTED patch applied in case somebody wants to play with this. It
removes 10 lines of silly code, and along with them that 'cmov' use.

Anybody?

              Linus

Re: [PATCH] [RFC] x86/cpu: rework instruction set selection

Posted by H. Peter Anvin 9 months, 2 weeks ago

On April 27, 2025 12:34:46 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>
>> ffs/fls are commonly found inside loops where x is the loop condition
>> too.  Therefore, using statically_true() to provide a form without the
>> zero compatibility turns out to be a win.
>
>We already have the version without the zero capability - it's just
>called "__ffs()" and "__fls()", and performance-critical code uses
>those.
>
>So fls/ffs are the "standard" library functions that have to handle
>zero, and add that stupid "+1" because that interface was designed by
>some Pascal person who doesn't understand that we start counting from
>0.
>
>Standards bodies: "companies aren't sending their best people".
>
>But it's silly that we then spend effort on magic cmov in inline asm
>on those things when it's literally the "don't use this version unless
>you don't actually care about performance" case.
>
>I don't think it would be wrong to just make the x86-32 code just do
>the check against zero ahead of time - in C.
>
>And yes, that will generate some extra code - you'll test for zero
>before, and then the caller might also test for a zero result that
>then results in another test for zero that can't actually happen (but
>the compiler doesn't know that). But I suspect that on the whole, it
>is likely to generate better code anyway just because the compiler
>sees that first test and can DTRT.
>
>UNTESTED patch applied in case somebody wants to play with this. It
>removes 10 lines of silly code, and along with them that 'cmov' use.
>
>Anybody?
>
>              Linus

It's noteworthy that hardware implementations are now invariably using a different interface, which makes sense for the LSB case (tzcnt/ctz) but has its own drain bramage on for the MSB case (lzcnt/clz)...

[PATCH] bitops/32: Convert variable_ffs() and fls() zero-case handling to C

Posted by Ingo Molnar 9 months, 2 weeks ago


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >
> > ffs/fls are commonly found inside loops where x is the loop condition
> > too.  Therefore, using statically_true() to provide a form without the
> > zero compatibility turns out to be a win.
> 
> We already have the version without the zero capability - it's just
> called "__ffs()" and "__fls()", and performance-critical code uses
> those.
> 
> So fls/ffs are the "standard" library functions that have to handle
> zero, and add that stupid "+1" because that interface was designed by
> some Pascal person who doesn't understand that we start counting from
> 0.
> 
> Standards bodies: "companies aren't sending their best people".
> 
> But it's silly that we then spend effort on magic cmov in inline asm
> on those things when it's literally the "don't use this version unless
> you don't actually care about performance" case.
> 
> I don't think it would be wrong to just make the x86-32 code just do
> the check against zero ahead of time - in C.
> 
> And yes, that will generate some extra code - you'll test for zero
> before, and then the caller might also test for a zero result that
> then results in another test for zero that can't actually happen (but
> the compiler doesn't know that). But I suspect that on the whole, it
> is likely to generate better code anyway just because the compiler
> sees that first test and can DTRT.
> 
> UNTESTED patch applied in case somebody wants to play with this. It
> removes 10 lines of silly code, and along with them that 'cmov' use.
> 
> Anybody?

Makes sense - it seems to boot here, but I only did some very light 
testing.

There's a minor text size increase on x86-32 defconfig, GCC 14.2.0:

      text       data        bss         dec        hex    filename
  16577728    7598826    1744896    25921450    18b87aa    vmlinux.before
  16577908    7598838    1744896    25921642    18b886a    vmlinux.after

bloatometer output:

  add/remove: 2/1 grow/shrink: 201/189 up/down: 5681/-3486 (2195)

Patch with changelog and your SOB added attached. Does it look good to 
you?

Thanks,

	Ingo

================>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 28 Apr 2025 08:38:35 +0200
Subject: [PATCH] bitops/32: Convert variable_ffs() and fls() zero-case handling to C

Don't do the complicated and probably questionable BS*L+CMOVZL
asm() optimization in variable_ffs() and fls(): performance-critical
code is already using __ffs() and __fls() that use sane interfaces
close to the machine instruction ABI. Check ahead for zero in C.

There's a minor text size increase on x86-32 defconfig:

      text       data        bss         dec        hex    filename
  16577728    7598826    1744896    25921450    18b87aa    vmlinux.before
  16577908    7598838    1744896    25921642    18b886a    vmlinux.after

bloatometer output:

  add/remove: 2/1 grow/shrink: 201/189 up/down: 5681/-3486 (2195)

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/bitops.h | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 100413aff640..6061c87f14ac 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -321,15 +321,10 @@ static __always_inline int variable_ffs(int x)
 	asm("bsfl %1,%0"
 	    : "=r" (r)
 	    : ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
-	asm("bsfl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=&r" (r) : "rm" (x), "r" (-1));
 #else
-	asm("bsfl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
+	if (!x)
+		return 0;
+	asm("bsfl %1,%0" : "=r" (r) : "rm" (x));
 #endif
 	return r + 1;
 }
@@ -378,15 +373,10 @@ static __always_inline int fls(unsigned int x)
 	asm("bsrl %1,%0"
 	    : "=r" (r)
 	    : ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
-	asm("bsrl %1,%0\n\t"
-	    "cmovzl %2,%0"
-	    : "=&r" (r) : "rm" (x), "rm" (-1));
 #else
-	asm("bsrl %1,%0\n\t"
-	    "jnz 1f\n\t"
-	    "movl $-1,%0\n"
-	    "1:" : "=r" (r) : "rm" (x));
+	if (!x)
+		return 0;
+	asm("bsrl %1,%0" : "=r" (r) : "rm" (x));
 #endif
 	return r + 1;
 }