arch/x86/Kconfig | 2 +- arch/x86/Kconfig.cpu | 100 ++++++++++++++------------------ arch/x86/Makefile_32.cpu | 48 +++++++-------- arch/x86/include/asm/vermagic.h | 36 +----------- arch/x86/kernel/tsc.c | 2 +- arch/x86/xen/Kconfig | 1 - drivers/misc/mei/Kconfig | 2 +- 7 files changed, 74 insertions(+), 117 deletions(-)
From: Arnd Bergmann <arnd@arndb.de>
With cx8 and tsc being mandatory features, the only important
architectural features are now cmov and pae.
Change the large list of target CPUs to no longer pick the instruction set
itself but only the mtune= optimization level and in-kernel optimizations
that remain compatible with all cores.
The CONFIG_X86_CMOV instead becomes user-selectable and is now how
Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX)
and 686-class (everything else) targets.
In order to allow running on late 32-bit cores (Athlon, Pentium-M,
Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything
lower than 6 (i.e. 64 byte cache lines).
The optimization options now depend on X86_CMOV and X86_PAE instead
of the other way round, while other compile-time conditionals that
checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options
that enable support for a particular CPU family.
Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
This is what I had in mind as mentioned in the earlier thread on
cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15]
patch.
---
arch/x86/Kconfig | 2 +-
arch/x86/Kconfig.cpu | 100 ++++++++++++++------------------
arch/x86/Makefile_32.cpu | 48 +++++++--------
arch/x86/include/asm/vermagic.h | 36 +-----------
arch/x86/kernel/tsc.c | 2 +-
arch/x86/xen/Kconfig | 1 -
drivers/misc/mei/Kconfig | 2 +-
7 files changed, 74 insertions(+), 117 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a9d717558972..1e33f88c9b97 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1438,7 +1438,7 @@ config HIGHMEM
config X86_PAE
bool "PAE (Physical Address Extension) Support"
- depends on X86_32 && X86_HAVE_PAE
+ depends on X86_32 && X86_CMOV
select PHYS_ADDR_T_64BIT
help
PAE is required for NX support, and furthermore enables
diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6f1e8cc8fe58..0619566de93f 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -1,23 +1,32 @@
# SPDX-License-Identifier: GPL-2.0
# Put here option for CPU selection and depending optimization
-choice
- prompt "x86-32 Processor family"
- depends on X86_32
- default M686
+
+config X86_CMOV
+ bool "Require 686-class CMOV instructions" if X86_32
+ default y
help
- This is the processor type of your CPU. This information is
- used for optimizing purposes. In order to compile a kernel
- that can run on all supported x86 CPU types (albeit not
- optimally fast), you can specify "586" here.
+ Most x86-32 processor implementations are compatible with
+ the the CMOV instruction originally added in the Pentium Pro,
+ and they perform much better when using it.
+
+ Disable this option to build for 586-class CPUs without this
+ instruction. This is only required for the original Intel
+ Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
+ CyrixIII/C3 CPUs.
Note that the 386 and 486 is no longer supported, this includes
AMD/Cyrix/Intel 386DX/DXL/SL/SLC/SX, Cyrix/TI 486DLC/DLC2,
UMC 486SX-S and the NexGen Nx586, AMD ELAN and all 486 based
CPUs.
- The kernel will not necessarily run on earlier architectures than
- the one you have chosen, e.g. a Pentium optimized kernel will run on
- a PPro, but not necessarily on a i486.
+choice
+ prompt "x86-32 Processor optimization"
+ depends on X86_32
+ default X86_GENERIC
+ help
+ This is the processor type of your CPU. This information is
+ used for optimizing purposes, but does not change compatibility
+ with other CPU types.
Here are the settings recommended for greatest speed:
- "586" for generic Pentium CPUs lacking the TSC
@@ -45,14 +54,13 @@ choice
config M586TSC
bool "Pentium-Classic"
- depends on X86_32
+ depends on X86_32 && !X86_CMOV
help
- Select this for a Pentium Classic processor with the RDTSC (Read
- Time Stamp Counter) instruction for benchmarking.
+ Select this for a Pentium Classic processor.
config M586MMX
bool "Pentium-MMX"
- depends on X86_32
+ depends on X86_32 && !X86_CMOV
help
Select this for a Pentium with the MMX graphics/multimedia
extended instructions.
@@ -117,7 +125,7 @@ config MPENTIUM4
config MK6
bool "K6/K6-II/K6-III"
- depends on X86_32
+ depends on X86_32 && !X86_CMOV
help
Select this for an AMD K6-family processor. Enables use of
some extended instructions, and passes appropriate optimization
@@ -125,7 +133,7 @@ config MK6
config MK7
bool "Athlon/Duron/K7"
- depends on X86_32
+ depends on X86_32 && !X86_PAE
help
Select this for an AMD Athlon K7-family processor. Enables use of
some extended instructions, and passes appropriate optimization
@@ -147,42 +155,37 @@ config MEFFICEON
config MGEODEGX1
bool "GeodeGX1"
- depends on X86_32
+ depends on X86_32 && !X86_CMOV
help
Select this for a Geode GX1 (Cyrix MediaGX) chip.
config MGEODE_LX
bool "Geode GX/LX"
- depends on X86_32
+ depends on X86_32 && !X86_PAE
help
Select this for AMD Geode GX and LX processors.
config MCYRIXIII
bool "CyrixIII/VIA-C3"
- depends on X86_32
+ depends on X86_32 && !X86_CMOV
help
Select this for a Cyrix III or C3 chip. Presently Linux and GCC
treat this chip as a generic 586. Whilst the CPU is 686 class,
it lacks the cmov extension which gcc assumes is present when
generating 686 code.
- Note that Nehemiah (Model 9) and above will not boot with this
- kernel due to them lacking the 3DNow! instructions used in earlier
- incarnations of the CPU.
config MVIAC3_2
bool "VIA C3-2 (Nehemiah)"
- depends on X86_32
+ depends on X86_32 && !X86_PAE
help
Select this for a VIA C3 "Nehemiah". Selecting this enables usage
of SSE and tells gcc to treat the CPU as a 686.
- Note, this kernel will not boot on older (pre model 9) C3s.
config MVIAC7
bool "VIA C7"
- depends on X86_32
+ depends on X86_32 && !X86_PAE
help
- Select this for a VIA C7. Selecting this uses the correct cache
- shift and tells gcc to treat the CPU as a 686.
+ Select this for a VIA C7.
config MATOM
bool "Intel Atom"
@@ -192,20 +195,19 @@ config MATOM
accordingly optimized code. Use a recent GCC with specific Atom
support in order to fully benefit from selecting this option.
-endchoice
-
config X86_GENERIC
- bool "Generic x86 support"
- depends on X86_32
+ bool "Generic x86"
help
- Instead of just including optimizations for the selected
+ Instead of just including optimizations for a particular
x86 variant (e.g. PII, Crusoe or Athlon), include some more
generic optimizations as well. This will make the kernel
- perform better on x86 CPUs other than that selected.
+ perform better on a variety of CPUs.
This is really intended for distributors who need more
generic optimizations.
+endchoice
+
#
# Define implied options from the CPU selection here
config X86_INTERNODE_CACHE_SHIFT
@@ -216,17 +218,14 @@ config X86_INTERNODE_CACHE_SHIFT
config X86_L1_CACHE_SHIFT
int
default "7" if MPENTIUM4
- default "6" if MK7 || MPENTIUMM || MATOM || MVIAC7 || X86_GENERIC || X86_64
- default "4" if MGEODEGX1
- default "5" if MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MVIAC3_2 || MGEODE_LX
+ default "6"
config X86_F00F_BUG
- def_bool y
- depends on M586MMX || M586TSC || M586
+ def_bool !X86_CMOV
config X86_ALIGNMENT_16
def_bool y
- depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1
+ depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1 || (!X86_CMOV && X86_GENERIC)
config X86_INTEL_USERCOPY
def_bool y
@@ -234,34 +233,23 @@ config X86_INTEL_USERCOPY
config X86_USE_PPRO_CHECKSUM
def_bool y
- depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM
+ depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM || (X86_CMOV && X86_GENERIC)
config X86_TSC
def_bool y
-config X86_HAVE_PAE
- def_bool y
- depends on MCRUSOE || MEFFICEON || MCYRIXIII || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC7 || MATOM || X86_64
-
config X86_CX8
def_bool y
-# this should be set for all -march=.. options where the compiler
-# generates cmov.
-config X86_CMOV
- def_bool y
- depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || MATOM || MGEODE_LX || X86_64)
-
config X86_MINIMUM_CPU_FAMILY
int
default "64" if X86_64
- default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MK7)
- default "5" if X86_32
- default "4"
+ default "6" if X86_32 && X86_CMOV
+ default "5"
config X86_DEBUGCTLMSR
def_bool y
- depends on !(MK6 || MCYRIXIII || M586MMX || M586TSC || M586) && !UML
+ depends on X86_CMOV && !UML
config IA32_FEAT_CTL
def_bool y
@@ -297,7 +285,7 @@ config CPU_SUP_INTEL
config CPU_SUP_CYRIX_32
default y
bool "Support Cyrix processors" if PROCESSOR_SELECT
- depends on M586 || M586TSC || M586MMX || (EXPERT && !64BIT)
+ depends on !64BIT
help
This enables detection, tunings and quirks for Cyrix processors
diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
index f5e933077bf4..ebd7ec6eaf34 100644
--- a/arch/x86/Makefile_32.cpu
+++ b/arch/x86/Makefile_32.cpu
@@ -10,30 +10,32 @@ else
align := -falign-functions=0 -falign-jumps=0 -falign-loops=0
endif
-cflags-$(CONFIG_M586TSC) += -march=i586
-cflags-$(CONFIG_M586MMX) += -march=pentium-mmx
-cflags-$(CONFIG_M686) += -march=i686
-cflags-$(CONFIG_MPENTIUMII) += -march=i686 $(call tune,pentium2)
-cflags-$(CONFIG_MPENTIUMIII) += -march=i686 $(call tune,pentium3)
-cflags-$(CONFIG_MPENTIUMM) += -march=i686 $(call tune,pentium3)
-cflags-$(CONFIG_MPENTIUM4) += -march=i686 $(call tune,pentium4)
-cflags-$(CONFIG_MK6) += -march=k6
-# Please note, that patches that add -march=athlon-xp and friends are pointless.
-# They make zero difference whatsosever to performance at this time.
-cflags-$(CONFIG_MK7) += -march=athlon
-cflags-$(CONFIG_MCRUSOE) += -march=i686 $(align)
-cflags-$(CONFIG_MEFFICEON) += -march=i686 $(call tune,pentium3) $(align)
-cflags-$(CONFIG_MCYRIXIII) += $(call cc-option,-march=c3,-march=i486) $(align)
-cflags-$(CONFIG_MVIAC3_2) += $(call cc-option,-march=c3-2,-march=i686)
-cflags-$(CONFIG_MVIAC7) += -march=i686
-cflags-$(CONFIG_MATOM) += -march=atom
+ifdef CONFIG_X86_CMOV
+cflags-y += -march=i686
+else
+cflags-y += -march=i586
+endif
-# Geode GX1 support
-cflags-$(CONFIG_MGEODEGX1) += -march=pentium-mmx
-cflags-$(CONFIG_MGEODE_LX) += $(call cc-option,-march=geode,-march=pentium-mmx)
-# add at the end to overwrite eventual tuning options from earlier
-# cpu entries
-cflags-$(CONFIG_X86_GENERIC) += $(call tune,generic,$(call tune,i686))
+cflags-$(CONFIG_M586TSC) += -mtune=i586
+cflags-$(CONFIG_M586MMX) += -mtune=pentium-mmx
+cflags-$(CONFIG_M686) += -mtune=i686
+cflags-$(CONFIG_MPENTIUMII) += -mtune=pentium2
+cflags-$(CONFIG_MPENTIUMIII) += -mtune=pentium3
+cflags-$(CONFIG_MPENTIUMM) += -mtune=pentium3
+cflags-$(CONFIG_MPENTIUM4) += -mtune=pentium4
+cflags-$(CONFIG_MK6) += -mtune=k6
+# Please note, that patches that add -mtune=athlon-xp and friends are pointless.
+# They make zero difference whatsosever to performance at this time.
+cflags-$(CONFIG_MK7) += -mtune=athlon
+cflags-$(CONFIG_MCRUSOE) += -mtune=i686 $(align)
+cflags-$(CONFIG_MEFFICEON) += -mtune=pentium3 $(align)
+cflags-$(CONFIG_MCYRIXIII) += -mtune=c3 $(align)
+cflags-$(CONFIG_MVIAC3_2) += -mtune=c3-2
+cflags-$(CONFIG_MVIAC7) += -mtune=i686
+cflags-$(CONFIG_MATOM) += -mtune=atom
+cflags-$(CONFIG_MGEODEGX1) += -mtune=pentium-mmx
+cflags-$(CONFIG_MGEODE_LX) += -mtune=geode
+cflags-$(CONFIG_X86_GENERIC) += -mtune=generic
# Bug fix for binutils: this option is required in order to keep
# binutils from generating NOPL instructions against our will.
diff --git a/arch/x86/include/asm/vermagic.h b/arch/x86/include/asm/vermagic.h
index e26061df0c9b..6554dbdfd719 100644
--- a/arch/x86/include/asm/vermagic.h
+++ b/arch/x86/include/asm/vermagic.h
@@ -5,42 +5,10 @@
#ifdef CONFIG_X86_64
/* X86_64 does not define MODULE_PROC_FAMILY */
-#elif defined CONFIG_M586TSC
-#define MODULE_PROC_FAMILY "586TSC "
-#elif defined CONFIG_M586MMX
-#define MODULE_PROC_FAMILY "586MMX "
-#elif defined CONFIG_MATOM
-#define MODULE_PROC_FAMILY "ATOM "
-#elif defined CONFIG_M686
+#elif defined CONFIG_X86_CMOV
#define MODULE_PROC_FAMILY "686 "
-#elif defined CONFIG_MPENTIUMII
-#define MODULE_PROC_FAMILY "PENTIUMII "
-#elif defined CONFIG_MPENTIUMIII
-#define MODULE_PROC_FAMILY "PENTIUMIII "
-#elif defined CONFIG_MPENTIUMM
-#define MODULE_PROC_FAMILY "PENTIUMM "
-#elif defined CONFIG_MPENTIUM4
-#define MODULE_PROC_FAMILY "PENTIUM4 "
-#elif defined CONFIG_MK6
-#define MODULE_PROC_FAMILY "K6 "
-#elif defined CONFIG_MK7
-#define MODULE_PROC_FAMILY "K7 "
-#elif defined CONFIG_MCRUSOE
-#define MODULE_PROC_FAMILY "CRUSOE "
-#elif defined CONFIG_MEFFICEON
-#define MODULE_PROC_FAMILY "EFFICEON "
-#elif defined CONFIG_MCYRIXIII
-#define MODULE_PROC_FAMILY "CYRIXIII "
-#elif defined CONFIG_MVIAC3_2
-#define MODULE_PROC_FAMILY "VIAC3-2 "
-#elif defined CONFIG_MVIAC7
-#define MODULE_PROC_FAMILY "VIAC7 "
-#elif defined CONFIG_MGEODEGX1
-#define MODULE_PROC_FAMILY "GEODEGX1 "
-#elif defined CONFIG_MGEODE_LX
-#define MODULE_PROC_FAMILY "GEODE "
#else
-#error unknown processor family
+#define MODULE_PROC_FAMILY "586 "
#endif
#ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 489c779ef3ef..76b15ef8c85f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1221,7 +1221,7 @@ bool tsc_clocksource_watchdog_disabled(void)
static void __init check_system_tsc_reliable(void)
{
-#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
+#if defined(CONFIG_CPU_SUP_CYRIX)
if (is_geode_lx()) {
/* RTSC counts during suspend */
#define RTSC_SUSP 0x100
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 222b6fdad313..2648459b8e8f 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -9,7 +9,6 @@ config XEN
select PARAVIRT_CLOCK
select X86_HV_CALLBACK_VECTOR
depends on X86_64 || (X86_32 && X86_PAE)
- depends on X86_64 || (X86_GENERIC || MPENTIUM4 || MATOM)
depends on X86_LOCAL_APIC
help
This is the Linux Xen port. Enabling this will allow the
diff --git a/drivers/misc/mei/Kconfig b/drivers/misc/mei/Kconfig
index 7575fee96cc6..4deb17ed0a62 100644
--- a/drivers/misc/mei/Kconfig
+++ b/drivers/misc/mei/Kconfig
@@ -3,7 +3,7 @@
config INTEL_MEI
tristate "Intel Management Engine Interface"
depends on X86 && PCI
- default X86_64 || MATOM
+ default X86_64 || CPU_SUP_INTEL
help
The Intel Management Engine (Intel ME) provides Manageability,
Security and Media services for system containing Intel chipsets.
--
2.39.5
* Arnd Bergmann <arnd@kernel.org> wrote: > From: Arnd Bergmann <arnd@arndb.de> > > With cx8 and tsc being mandatory features, the only important > architectural features are now cmov and pae. > > Change the large list of target CPUs to no longer pick the instruction set > itself but only the mtune= optimization level and in-kernel optimizations > that remain compatible with all cores. > > The CONFIG_X86_CMOV instead becomes user-selectable and is now how > Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX) > and 686-class (everything else) targets. > > In order to allow running on late 32-bit cores (Athlon, Pentium-M, > Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything > lower than 6 (i.e. 64 byte cache lines). > > The optimization options now depend on X86_CMOV and X86_PAE instead > of the other way round, while other compile-time conditionals that > checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options > that enable support for a particular CPU family. > > Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/ > Signed-off-by: Arnd Bergmann <arnd@arndb.de> > --- > This is what I had in mind as mentioned in the earlier thread on > cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15] > patch. > --- > arch/x86/Kconfig | 2 +- > arch/x86/Kconfig.cpu | 100 ++++++++++++++------------------ > arch/x86/Makefile_32.cpu | 48 +++++++-------- > arch/x86/include/asm/vermagic.h | 36 +----------- > arch/x86/kernel/tsc.c | 2 +- > arch/x86/xen/Kconfig | 1 - > drivers/misc/mei/Kconfig | 2 +- > 7 files changed, 74 insertions(+), 117 deletions(-) While the simplification is nice on its face, this looks messy: > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index a9d717558972..1e33f88c9b97 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1438,7 +1438,7 @@ config HIGHMEM > > config X86_PAE > bool "PAE (Physical Address Extension) Support" > - depends on X86_32 && X86_HAVE_PAE > + depends on X86_32 && X86_CMOV Coupling CMOV to PAE ... :-/ > +config X86_CMOV > + bool "Require 686-class CMOV instructions" if X86_32 > + default y > help > - This is the processor type of your CPU. This information is > - used for optimizing purposes. In order to compile a kernel > - that can run on all supported x86 CPU types (albeit not > - optimally fast), you can specify "586" here. > + Most x86-32 processor implementations are compatible with > + the the CMOV instruction originally added in the Pentium Pro, > + and they perform much better when using it. > + > + Disable this option to build for 586-class CPUs without this > + instruction. This is only required for the original Intel > + Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via > + CyrixIII/C3 CPUs. Very few users will know anything about CMOV. I'd argue the right path forward is to just bite the bullet and remove non-CMOV support as well, that would be the outcome *anyway* in a few years. That would allow basically a single 'modern' 32-bit kernel that is supposed to boot on every supported CPU. People might even end up testing it ... ;-) Thanks, Ingo
On Sat, Apr 26, 2025, at 11:08, Ingo Molnar wrote:
> * Arnd Bergmann <arnd@kernel.org> wrote:
>
> While the simplification is nice on its face, this looks messy:
>
>>
>> config X86_PAE
>> bool "PAE (Physical Address Extension) Support"
>> - depends on X86_32 && X86_HAVE_PAE
>> + depends on X86_32 && X86_CMOV
>
> Coupling CMOV to PAE ... :-/
Right. With the current set of features, CMOV is almost the
same as 686. My reasoning was that support for CMOV has a
very clear definition, with the instruction either being
available or not.
When the M686/MPENTIUMII/MK6/... options are just optimization
levels rather than selecting an instruction set, X86_PAE
can't depend on those any more. An easy answer here would be
to not have X86_PAE depend on anything, but instead make it
force X86_MINIMUM_CPU_FAMILY=6.
>> +config X86_CMOV
>> + bool "Require 686-class CMOV instructions" if X86_32
>> + default y
>> help
>> - This is the processor type of your CPU. This information is
>> - used for optimizing purposes. In order to compile a kernel
>> - that can run on all supported x86 CPU types (albeit not
>> - optimally fast), you can specify "586" here.
>> + Most x86-32 processor implementations are compatible with
>> + the the CMOV instruction originally added in the Pentium Pro,
>> + and they perform much better when using it.
>> +
>> + Disable this option to build for 586-class CPUs without this
>> + instruction. This is only required for the original Intel
>> + Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
>> + CyrixIII/C3 CPUs.
>
> Very few users will know anything about CMOV.
>
> I'd argue the right path forward is to just bite the bullet and remove
> non-CMOV support as well, that would be the outcome *anyway* in a few
> years. That would allow basically a single 'modern' 32-bit kernel that
> is supposed to boot on every supported CPU. People might even end up
> testing it ... ;-)
That would be a much more drastic change than requiring CX8
and TSC, which were present on almost all Socket 7 CPUs and
all embedded cores other than Elan and Vortex86SX.
CMOV is missing not just on old Socket 5/7 CPUs (Pentium
MMX, AMD K6, Cyrix MII) but also newer embedded Via C3, Geode GX
and Vortex86DX/MX/EX/DX2. The replacement Nehemiah (2003), GeodeLX
(2005) and Vortex86DX3/EX2 (2015!) have CMOV, but the old ones
were sold alongside them for years, and some of the 586-class
Vortex86 products are still commercially available.
There is a good chance that we could just not use CMOV and only
build 586-compatible kernels without anyone caring about the
performance difference. There is not much to gain here either
though, as the cost of supporting both 586-class and 686-class
builds is rather small: there is a compiler flag, a boot time
check and microoptimziation in ffs/fls.
Arnd
On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Right. With the current set of features, CMOV is almost the
> same as 686. My reasoning was that support for CMOV has a
> very clear definition, with the instruction either being
> available or not.
Yeah, I don't think there's any reason to make CMOV a reason to drop support.
It has questionable performance impact - I doubt anybody can measure
it - and the "maintenance burden" is basically a single compiler flag.
(And yes, one use in a x86 header file that is pretty questionable
too: I think the reason for the cmov is actually i486-only behavior
and we could probably unify the 32-bit and 64-bit implementation)
Let's not drop Pentium support due to something as insignificant as that.
Particularly as the only half-way "modern" use of the Pentium core is
actually the embedded cores (ie old atoms and clones).
We have good reasons to drop i486 (and the "fake Pentium" cores that
weren't). We _don't_ have good reason to drop Pentium support, I
think.
> An easy answer here would be
> to not have X86_PAE depend on anything, but instead make it
> force X86_MINIMUM_CPU_FAMILY=6.
Make it so.
Linus
On Sat, 26 Apr 2025 12:24:37 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote: > > > > Right. With the current set of features, CMOV is almost the > > same as 686. My reasoning was that support for CMOV has a > > very clear definition, with the instruction either being > > available or not. > > Yeah, I don't think there's any reason to make CMOV a reason to drop support. > > It has questionable performance impact - I doubt anybody can measure > it - and the "maintenance burden" is basically a single compiler flag. There is also the user/kernel address check for copy_to/from_user (etc). The 'cmov' version used for 64bit is nice and succinct (as well as being speculative execution safe). Unlike the 'sbb' version it doesn't rely on the first access being to the first address of the buffer (or page 0 not being mapped). But I'd guess that the kernel ought to have a boot time test for some of these instructions - so at least it fails gracefully. Which would required compiling some early code without cmov. David
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 26 Apr 2025 at 11:59, Arnd Bergmann <arnd@arndb.de> wrote: > > > > Right. With the current set of features, CMOV is almost the > > same as 686. My reasoning was that support for CMOV has a > > very clear definition, with the instruction either being > > available or not. > > Yeah, I don't think there's any reason to make CMOV a reason to drop support. > > It has questionable performance impact - I doubt anybody can measure > it - and the "maintenance burden" is basically a single compiler > flag. > > (And yes, one use in a x86 header file that is pretty questionable > too: I think the reason for the cmov is actually i486-only behavior > and we could probably unify the 32-bit and 64-bit implementation) > > Let's not drop Pentium support due to something as insignificant as > that. Agreed on that. Idea to require CMOV dropped. Note that the outcome of 486 removal will likely be that the few remaining community distros that still offer x86-32 builds are either already 686-CMOV-only (Debian), or are going to drop their 486 builds and keep their 686-CMOV-only builds (Gentoo and Archlinux32) by way of simple inertia. (There's an off chance that they'll change their 486 builds to 586, but I think dropping the extra complication and standardizing on 686 will be the most likely outcome.) No commercial distro builds x86-32 with a modern v6.x series kernel AFAICS. Anyway, I agree that the maintenance cost on the kernel side to build non-CMOV kernels is very low. Thanks, Ingo
On Sat, 26 Apr 2025 at 12:24, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> (And yes, one use in a x86 header file that is pretty questionable
> too: I think the reason for the cmov is actually i486-only behavior
> and we could probably unify the 32-bit and 64-bit implementation)
Actually, what we *should* do is to remove that manual use of 'cmov'
entirely - even if we decide that yes, that undefined zero case is
actually real.
We should probably change it to use CC_SET(), and the compiler will do
a much better job - and probably never use cmov anyway.
And yes, that will generate worse code if you have an old compiler
that doesn't do ASM_FLAG_OUTPUTS, but hey, that's true in general. If
you want good code, you need a good compiler.
And clang needs to learn the CC_SET() pattern anyway.
So I think that manual cmov pattern for x86-32 should be replaced with
bool zero;
asm("bsfl %[in],%[out]"
CC_SET(z)
: CC_OUT(z) (zero),
[out]"=r" (r)
: [in] "rm" (x));
return zero ? 0 : r+1;
instead (that's ffs(), and fls() would need the same thing except with
bsrl insteadm, of course).
I bet that would actually improve code generation.
And I also bet it doesn't actually matter, of course.
Linus
On 26/04/2025 8:55 pm, Linus Torvalds wrote:
> So I think that manual cmov pattern for x86-32 should be replaced with
>
> bool zero;
>
> asm("bsfl %[in],%[out]"
> CC_SET(z)
> : CC_OUT(z) (zero),
> [out]"=r" (r)
> : [in] "rm" (x));
>
> return zero ? 0 : r+1;
>
> instead (that's ffs(), and fls() would need the same thing except with
> bsrl insteadm, of course).
>
> I bet that would actually improve code generation.
It is possible to do better still.
ffs/fls are commonly found inside loops where x is the loop condition
too. Therefore, using statically_true() to provide a form without the
zero compatibility turns out to be a win.
> And I also bet it doesn't actually matter, of course.
Something that neither Linux nor Xen had, which makes a reasonable
difference, is a for_each_set_bit() optimised over a scalar value. The
existing APIs make it all too easy to spill the loop condition to memory.
~Andrew
On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> ffs/fls are commonly found inside loops where x is the loop condition
> too. Therefore, using statically_true() to provide a form without the
> zero compatibility turns out to be a win.
We already have the version without the zero capability - it's just
called "__ffs()" and "__fls()", and performance-critical code uses
those.
So fls/ffs are the "standard" library functions that have to handle
zero, and add that stupid "+1" because that interface was designed by
some Pascal person who doesn't understand that we start counting from
0.
Standards bodies: "companies aren't sending their best people".
But it's silly that we then spend effort on magic cmov in inline asm
on those things when it's literally the "don't use this version unless
you don't actually care about performance" case.
I don't think it would be wrong to just make the x86-32 code just do
the check against zero ahead of time - in C.
And yes, that will generate some extra code - you'll test for zero
before, and then the caller might also test for a zero result that
then results in another test for zero that can't actually happen (but
the compiler doesn't know that). But I suspect that on the whole, it
is likely to generate better code anyway just because the compiler
sees that first test and can DTRT.
UNTESTED patch applied in case somebody wants to play with this. It
removes 10 lines of silly code, and along with them that 'cmov' use.
Anybody?
Linus
On April 27, 2025 12:34:46 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >> ffs/fls are commonly found inside loops where x is the loop condition >> too. Therefore, using statically_true() to provide a form without the >> zero compatibility turns out to be a win. > >We already have the version without the zero capability - it's just >called "__ffs()" and "__fls()", and performance-critical code uses >those. > >So fls/ffs are the "standard" library functions that have to handle >zero, and add that stupid "+1" because that interface was designed by >some Pascal person who doesn't understand that we start counting from >0. > >Standards bodies: "companies aren't sending their best people". > >But it's silly that we then spend effort on magic cmov in inline asm >on those things when it's literally the "don't use this version unless >you don't actually care about performance" case. > >I don't think it would be wrong to just make the x86-32 code just do >the check against zero ahead of time - in C. > >And yes, that will generate some extra code - you'll test for zero >before, and then the caller might also test for a zero result that >then results in another test for zero that can't actually happen (but >the compiler doesn't know that). But I suspect that on the whole, it >is likely to generate better code anyway just because the compiler >sees that first test and can DTRT. > >UNTESTED patch applied in case somebody wants to play with this. It >removes 10 lines of silly code, and along with them that 'cmov' use. > >Anybody? > > Linus It's noteworthy that hardware implementations are now invariably using a different interface, which makes sense for the LSB case (tzcnt/ctz) but has its own drain bramage on for the MSB case (lzcnt/clz)...
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 27 Apr 2025 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >
> > ffs/fls are commonly found inside loops where x is the loop condition
> > too. Therefore, using statically_true() to provide a form without the
> > zero compatibility turns out to be a win.
>
> We already have the version without the zero capability - it's just
> called "__ffs()" and "__fls()", and performance-critical code uses
> those.
>
> So fls/ffs are the "standard" library functions that have to handle
> zero, and add that stupid "+1" because that interface was designed by
> some Pascal person who doesn't understand that we start counting from
> 0.
>
> Standards bodies: "companies aren't sending their best people".
>
> But it's silly that we then spend effort on magic cmov in inline asm
> on those things when it's literally the "don't use this version unless
> you don't actually care about performance" case.
>
> I don't think it would be wrong to just make the x86-32 code just do
> the check against zero ahead of time - in C.
>
> And yes, that will generate some extra code - you'll test for zero
> before, and then the caller might also test for a zero result that
> then results in another test for zero that can't actually happen (but
> the compiler doesn't know that). But I suspect that on the whole, it
> is likely to generate better code anyway just because the compiler
> sees that first test and can DTRT.
>
> UNTESTED patch applied in case somebody wants to play with this. It
> removes 10 lines of silly code, and along with them that 'cmov' use.
>
> Anybody?
Makes sense - it seems to boot here, but I only did some very light
testing.
There's a minor text size increase on x86-32 defconfig, GCC 14.2.0:
text data bss dec hex filename
16577728 7598826 1744896 25921450 18b87aa vmlinux.before
16577908 7598838 1744896 25921642 18b886a vmlinux.after
bloatometer output:
add/remove: 2/1 grow/shrink: 201/189 up/down: 5681/-3486 (2195)
Patch with changelog and your SOB added attached. Does it look good to
you?
Thanks,
Ingo
================>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 28 Apr 2025 08:38:35 +0200
Subject: [PATCH] bitops/32: Convert variable_ffs() and fls() zero-case handling to C
Don't do the complicated and probably questionable BS*L+CMOVZL
asm() optimization in variable_ffs() and fls(): performance-critical
code is already using __ffs() and __fls() that use sane interfaces
close to the machine instruction ABI. Check ahead for zero in C.
There's a minor text size increase on x86-32 defconfig:
text data bss dec hex filename
16577728 7598826 1744896 25921450 18b87aa vmlinux.before
16577908 7598838 1744896 25921642 18b886a vmlinux.after
bloatometer output:
add/remove: 2/1 grow/shrink: 201/189 up/down: 5681/-3486 (2195)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
arch/x86/include/asm/bitops.h | 22 ++++++----------------
1 file changed, 6 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 100413aff640..6061c87f14ac 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -321,15 +321,10 @@ static __always_inline int variable_ffs(int x)
asm("bsfl %1,%0"
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
- asm("bsfl %1,%0\n\t"
- "cmovzl %2,%0"
- : "=&r" (r) : "rm" (x), "r" (-1));
#else
- asm("bsfl %1,%0\n\t"
- "jnz 1f\n\t"
- "movl $-1,%0\n"
- "1:" : "=r" (r) : "rm" (x));
+ if (!x)
+ return 0;
+ asm("bsfl %1,%0" : "=r" (r) : "rm" (x));
#endif
return r + 1;
}
@@ -378,15 +373,10 @@ static __always_inline int fls(unsigned int x)
asm("bsrl %1,%0"
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
- asm("bsrl %1,%0\n\t"
- "cmovzl %2,%0"
- : "=&r" (r) : "rm" (x), "rm" (-1));
#else
- asm("bsrl %1,%0\n\t"
- "jnz 1f\n\t"
- "movl $-1,%0\n"
- "1:" : "=r" (r) : "rm" (x));
+ if (!x)
+ return 0;
+ asm("bsrl %1,%0" : "=r" (r) : "rm" (x));
#endif
return r + 1;
}
* Ingo Molnar <mingo@kernel.org> wrote:
> > UNTESTED patch applied in case somebody wants to play with this. It
> > removes 10 lines of silly code, and along with them that 'cmov' use.
> >
> > Anybody?
>
> Makes sense - it seems to boot here, but I only did some very light
> testing.
>
> There's a minor text size increase on x86-32 defconfig, GCC 14.2.0:
>
> text data bss dec hex filename
> 16577728 7598826 1744896 25921450 18b87aa vmlinux.before
> 16577908 7598838 1744896 25921642 18b886a vmlinux.after
>
> bloatometer output:
>
> add/remove: 2/1 grow/shrink: 201/189 up/down: 5681/-3486 (2195)
And once we remove 486, I think we can do the optimization below to
just assume the output doesn't get clobbered by BS*L in the zero-case,
right?
In the text size space it's a substantial optimization on x86-32
defconfig:
text data bss dec hex filename
16,577,728 7598826 1744896 25921450 18b87aa vmlinux.vanilla # CMOV+BS*L
16,577,908 7598838 1744896 25921642 18b886a vmlinux.linus_patch # if()+BS*L
16,573,568 7602922 1744896 25921386 18b876a vmlinux.noclobber # BS*L
Thanks,
Ingo
---
arch/x86/include/asm/bitops.h | 20 ++------------------
1 file changed, 2 insertions(+), 18 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6061c87f14ac..e3e94a806656 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -308,24 +308,16 @@ static __always_inline int variable_ffs(int x)
{
int r;
-#ifdef CONFIG_X86_64
/*
* AMD64 says BSFL won't clobber the dest reg if x==0; Intel64 says the
* dest reg is undefined if x==0, but their CPU architect says its
* value is written to set it to the same as before, except that the
* top 32 bits will be cleared.
- *
- * We cannot do this on 32 bits because at the very least some
- * 486 CPUs did not behave this way.
*/
asm("bsfl %1,%0"
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
-#else
- if (!x)
- return 0;
- asm("bsfl %1,%0" : "=r" (r) : "rm" (x));
-#endif
+
return r + 1;
}
@@ -360,24 +352,16 @@ static __always_inline int fls(unsigned int x)
if (__builtin_constant_p(x))
return x ? 32 - __builtin_clz(x) : 0;
-#ifdef CONFIG_X86_64
/*
* AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
* dest reg is undefined if x==0, but their CPU architect says its
* value is written to set it to the same as before, except that the
* top 32 bits will be cleared.
- *
- * We cannot do this on 32 bits because at the very least some
- * 486 CPUs did not behave this way.
*/
asm("bsrl %1,%0"
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
-#else
- if (!x)
- return 0;
- asm("bsrl %1,%0" : "=r" (r) : "rm" (x));
-#endif
+
return r + 1;
}
On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote:
>
> And once we remove 486, I think we can do the optimization below to
> just assume the output doesn't get clobbered by BS*L in the zero-case,
> right?
We probably can't, because who knows what "Pentium" CPU's are out there.
Or even if Pentium really does get it right. I doubt we have any
developers with an original Pentium around.
So just leave the "we don't know what the CPU result is for zero"
unless we get some kind of official confirmation.
Linus
On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >> >> And once we remove 486, I think we can do the optimization below to >> just assume the output doesn't get clobbered by BS*L in the zero-case, >> right? > >We probably can't, because who knows what "Pentium" CPU's are out there. > >Or even if Pentium really does get it right. I doubt we have any >developers with an original Pentium around. > >So just leave the "we don't know what the CPU result is for zero" >unless we get some kind of official confirmation. > > Linus If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark.
On 28/04/2025 10:38 pm, H. Peter Anvin wrote: > On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >>> And once we remove 486, I think we can do the optimization below to >>> just assume the output doesn't get clobbered by BS*L in the zero-case, >>> right? >> We probably can't, because who knows what "Pentium" CPU's are out there. >> >> Or even if Pentium really does get it right. I doubt we have any >> developers with an original Pentium around. >> >> So just leave the "we don't know what the CPU result is for zero" >> unless we get some kind of official confirmation. >> >> Linus > If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. > > I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark. https://www.sandpile.org/x86/flags.htm That's a lot of "can't even characterise the result" in the P5. Looking at P4 column, that is clearly what the latest SDM has retroactively declared to be architectural. ~Andrew
On April 28, 2025 5:12:13 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 28/04/2025 10:38 pm, H. Peter Anvin wrote: >> On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >>>> And once we remove 486, I think we can do the optimization below to >>>> just assume the output doesn't get clobbered by BS*L in the zero-case, >>>> right? >>> We probably can't, because who knows what "Pentium" CPU's are out there. >>> >>> Or even if Pentium really does get it right. I doubt we have any >>> developers with an original Pentium around. >>> >>> So just leave the "we don't know what the CPU result is for zero" >>> unless we get some kind of official confirmation. >>> >>> Linus >> If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. >> >> I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark. > >https://www.sandpile.org/x86/flags.htm > >That's a lot of "can't even characterise the result" in the P5. > >Looking at P4 column, that is clearly what the latest SDM has >retroactively declared to be architectural. > >~Andrew Yes, but it wasn't about flags here. Now, question: can we just use __builtin_*() for these? I think gcc should always generate inline code for these on x86.
On 29/04/2025 3:00 am, H. Peter Anvin wrote: > On April 28, 2025 5:12:13 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 28/04/2025 10:38 pm, H. Peter Anvin wrote: >>> On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>>> On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >>>>> And once we remove 486, I think we can do the optimization below to >>>>> just assume the output doesn't get clobbered by BS*L in the zero-case, >>>>> right? >>>> We probably can't, because who knows what "Pentium" CPU's are out there. >>>> >>>> Or even if Pentium really does get it right. I doubt we have any >>>> developers with an original Pentium around. >>>> >>>> So just leave the "we don't know what the CPU result is for zero" >>>> unless we get some kind of official confirmation. >>>> >>>> Linus >>> If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. >>> >>> I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark. >> https://www.sandpile.org/x86/flags.htm >> >> That's a lot of "can't even characterise the result" in the P5. >> >> Looking at P4 column, that is clearly what the latest SDM has >> retroactively declared to be architectural. >> >> ~Andrew > Yes, but it wasn't about flags here. > > Now, question: can we just use __builtin_*() for these? I think gcc should always generate inline code for these on x86. Yes it does generate inline code. https://godbolt.org/z/M45oo5rqT GCC does it branchlessly, but cannot optimise based on context. Clang can optimise based on context, except the 0 case it seems. Moving to -march=i686 causes both GCC and Clang to switch to CMOV and create branchless code, but is still GCC still can't optimise out the CMOV based on context. ~Andrew
On April 28, 2025 7:25:17 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 29/04/2025 3:00 am, H. Peter Anvin wrote: >> On April 28, 2025 5:12:13 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 28/04/2025 10:38 pm, H. Peter Anvin wrote: >>>> On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>>>> On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >>>>>> And once we remove 486, I think we can do the optimization below to >>>>>> just assume the output doesn't get clobbered by BS*L in the zero-case, >>>>>> right? >>>>> We probably can't, because who knows what "Pentium" CPU's are out there. >>>>> >>>>> Or even if Pentium really does get it right. I doubt we have any >>>>> developers with an original Pentium around. >>>>> >>>>> So just leave the "we don't know what the CPU result is for zero" >>>>> unless we get some kind of official confirmation. >>>>> >>>>> Linus >>>> If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. >>>> >>>> I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark. >>> https://www.sandpile.org/x86/flags.htm >>> >>> That's a lot of "can't even characterise the result" in the P5. >>> >>> Looking at P4 column, that is clearly what the latest SDM has >>> retroactively declared to be architectural. >>> >>> ~Andrew >> Yes, but it wasn't about flags here. >> >> Now, question: can we just use __builtin_*() for these? I think gcc should always generate inline code for these on x86. > >Yes it does generate inline code. https://godbolt.org/z/M45oo5rqT > >GCC does it branchlessly, but cannot optimise based on context. > >Clang can optimise based on context, except the 0 case it seems. > >Moving to -march=i686 causes both GCC and Clang to switch to CMOV and >create branchless code, but is still GCC still can't optimise out the >CMOV based on context. > >~Andrew Maybe a gcc bug report would be better than trying to hack around this in the kernel?
On 29/04/2025 4:13 am, H. Peter Anvin wrote: > On April 28, 2025 7:25:17 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 29/04/2025 3:00 am, H. Peter Anvin wrote: >>> On April 28, 2025 5:12:13 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> On 28/04/2025 10:38 pm, H. Peter Anvin wrote: >>>>> On April 28, 2025 9:14:45 AM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>>>>> On Mon, 28 Apr 2025 at 00:05, Ingo Molnar <mingo@kernel.org> wrote: >>>>>>> And once we remove 486, I think we can do the optimization below to >>>>>>> just assume the output doesn't get clobbered by BS*L in the zero-case, >>>>>>> right? >>>>>> We probably can't, because who knows what "Pentium" CPU's are out there. >>>>>> >>>>>> Or even if Pentium really does get it right. I doubt we have any >>>>>> developers with an original Pentium around. >>>>>> >>>>>> So just leave the "we don't know what the CPU result is for zero" >>>>>> unless we get some kind of official confirmation. >>>>>> >>>>>> Linus >>>>> If anyone knows for sure, it is probably Christian Ludloff. However, there was a *huge* tightening of the formal ISA when the i686 was introduced (family=6) and I really believe this was part of it. >>>>> >>>>> I also really don't trust that family=5 really means conforms to undocumented P5 behavior, e.g. for Quark. >>>> https://www.sandpile.org/x86/flags.htm >>>> >>>> That's a lot of "can't even characterise the result" in the P5. >>>> >>>> Looking at P4 column, that is clearly what the latest SDM has >>>> retroactively declared to be architectural. >>>> >>>> ~Andrew >>> Yes, but it wasn't about flags here. >>> >>> Now, question: can we just use __builtin_*() for these? I think gcc should always generate inline code for these on x86. >> Yes it does generate inline code. https://godbolt.org/z/M45oo5rqT >> >> GCC does it branchlessly, but cannot optimise based on context. >> >> Clang can optimise based on context, except the 0 case it seems. >> >> Moving to -march=i686 causes both GCC and Clang to switch to CMOV and >> create branchless code, but is still GCC still can't optimise out the >> CMOV based on context. >> >> ~Andrew > Maybe a gcc bug report would be better than trying to hack around this in the kernel? I tried that. (The thread started as a question around __builtin_constant_p() but did grow to cover __builtin_ffs().) https://gcc.gnu.org/pipermail/gcc/2024-March/243465.html ~Andrew
On Tue, 29 Apr 2025 at 07:38, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> I tried that. (The thread started as a question around
> __builtin_constant_p() but did grow to cover __builtin_ffs().)
Maybe we could do something like
#define ffs(x) \
(statically_true((x) != 0) ? __ffs(x)+1 : __builtin_ffs(x))
which uses our "statically_true()" helper that is actually fairly good
at the whole "let the compiler tell us that it knows that value cannot
be zero"
I didn't check what code that generated, but I've seen gcc do well on
that statically_true() thing in the past.
Then we can just remove our current variable_ffs() thing entirely,
because we now depend on our (good) __ffs() and the builtin being
"good enough" for the bad case.
(And do the same thing for fls() too, of course)
Linus
On 29/04/2025 7:05 pm, Linus Torvalds wrote:
> On Tue, 29 Apr 2025 at 07:38, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> I tried that. (The thread started as a question around
>> __builtin_constant_p() but did grow to cover __builtin_ffs().)
> Maybe we could do something like
>
> #define ffs(x) \
> (statically_true((x) != 0) ? __ffs(x)+1 : __builtin_ffs(x))
>
> which uses our "statically_true()" helper that is actually fairly good
> at the whole "let the compiler tell us that it knows that value cannot
> be zero"
>
> I didn't check what code that generated, but I've seen gcc do well on
> that statically_true() thing in the past.
>
> Then we can just remove our current variable_ffs() thing entirely,
> because we now depend on our (good) __ffs() and the builtin being
> "good enough" for the bad case.
That would improve code generation for 32bit, but generally regress 64bit.
Preloading the destination register with -1 is better than the CMOV form
emitted by the builtin; BSF's habit of conditionally not writing the
destination register *is* a CMOV of sorts.
When I cleaned this up in Xen, there were several factors where I
thought improvements could be made.
Having both ffs() and __ffs(), where the latter is undefined in a common
case, is a trap waiting for an unwary programmer. I have no particular
love for ffs() being off-by-one from normal, but is well defined for all
inputs.
Also, leaving the constant folding to the arch-optimised form means that
it often gets forgotten. Therefore, I rearranged everything to have
this be common:
static always_inline attr_const unsigned int ffs(unsigned int x)
{
if ( __builtin_constant_p(x) )
return __builtin_ffs(x);
#ifdef arch_ffs
return arch_ffs(x);
#else
return generic_ffsl(x);
#endif
}
with most architectures implementing arch_ffs as:
#define arch_ffs(x) ((x) ? 1 + __builtin_ctz(x) : 0)
and x86 as:
static always_inline unsigned int arch_ffs(unsigned int x)
{
unsigned int r;
if ( __builtin_constant_p(x > 0) && x > 0 )
{
/*
* A common code pattern is:
*
* while ( bits )
* {
* bit = ffs(bits);
* ...
*
* and the optimiser really can work with the knowledge of x being
* non-zero without knowing it's exact value, in which case we don't
* need to compensate for BSF's corner cases. Otherwise...
*/
asm ( "bsf %[val], %[res]"
: [res] "=r" (r)
: [val] "rm" (x) );
}
else
{
/*
* ... the AMD manual states that BSF won't modify the destination
* register if x=0. The Intel manual states that the result is
* undefined, but the architects have said that the register is
* written back with it's old value (zero extended as normal).
*/
asm ( "bsf %[val], %[res]"
: [res] "=r" (r)
: [val] "rm" (x), "[res]" (-1) );
}
return r + 1;
}
#define arch_ffs arch_ffs
and finally, providing compatibility for the other forms as:
#define __ffs(x) (ffs(x) - 1)
The end result is fewer APIs to implement in arch-specific code, and the
removal of undefined behaviour.
That said, I don't envy anyone wanting to try and untangle this in
Linux, even if consensus were to agree on it as an approach.
~Andrew
On Tue, 29 Apr 2025 at 12:13, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> That would improve code generation for 32bit, but generally regress 64bit.
>
> Preloading the destination register with -1 is better than the CMOV form
> emitted by the builtin; BSF's habit of conditionally not writing the
> destination register *is* a CMOV of sorts.
Right you are. So we'd need to do this just for the x86-32 case. Oh
well. Ugly, but still prettier than what we have now, I guess.
Linus
On April 29, 2025 1:12:48 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Tue, 29 Apr 2025 at 12:13, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >> That would improve code generation for 32bit, but generally regress 64bit. >> >> Preloading the destination register with -1 is better than the CMOV form >> emitted by the builtin; BSF's habit of conditionally not writing the >> destination register *is* a CMOV of sorts. > >Right you are. So we'd need to do this just for the x86-32 case. Oh >well. Ugly, but still prettier than what we have now, I guess. > > Linus Could you file a gcc bug? Gcc shouldn't generate worse code than inline asm ...
On Tue, 29 Apr 2025 at 14:24, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Could you file a gcc bug? Gcc shouldn't generate worse code than inline asm ...
Honestly, I've given up on that idea.
That's basically what the bug report I reported about clts was - gcc
generating worse code than inline asm.
But why would we use gcc builtins at all if we have inline asm that is
better than what som eversions of gcc generates? The inline asm works
for all versions.
Anyway, attached is a test file that seems to generate basically
"optimal" code. It's not a kernel patch, but a standalone C file for
testing with a couple of stupid test-cases to make sure that it gets
the constant argument case right, and that it optimizes the "I know
it's not zero" case.
That is why it also uses that "#if __SIZEOF_LONG__ == 4" instead of
something like CONFIG_64BIT.
So it's a proof-of-concept thing, and it does seem to be fairly simple.
The "important" parts are just the
#define variable_ffs(x) \
(statically_true((x)!=0) ? __ffs(x)+1 : do_variable_ffs(x))
#define constant_ffs(x) __builtin_ffs(x)
#define ffs(x) \
(__builtin_constant_p(x) ? constant_ffs(x) : variable_ffs(x))
lines which end up picking the right choice. The rest is either the
"reimplement the kernel infrastructure for testing" or the trivial
do_variable_ffs() thing.
I just did
gcc -m32 -O2 -S -fomit-frame-pointer t.c
(with, and without that -m32) and looked at the result to see that it
looks sane. No *actual* testing.
Linus
On 29/04/2025 10:53 pm, Linus Torvalds wrote: > I just did > > gcc -m32 -O2 -S -fomit-frame-pointer t.c > > (with, and without that -m32) and looked at the result to see that it > looks sane. No *actual* testing. do_variable_ffs() doesn't quite work. REP BSF is LZCNT, and unconditionally writes it's output operand, and defeats the attempt to preload with -1. Drop the REP prefix, and it should work as intended. ~Andrew
On Tue, 29 Apr 2025 at 14:59, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> do_variable_ffs() doesn't quite work.
>
> REP BSF is LZCNT, and unconditionally writes it's output operand, and
> defeats the attempt to preload with -1.
>
> Drop the REP prefix, and it should work as intended.
Bah. That's what I get for just doing it blindly without actually
looking at the kernel source. I just copied the __ffs() thing - and
there the 'rep' is not for the zero case - which we don't care about -
but because lzcnt performs better on newer CPUs.
So you're obviously right.
Linus
On 29/04/2025 11:04 pm, Linus Torvalds wrote: > On Tue, 29 Apr 2025 at 14:59, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> do_variable_ffs() doesn't quite work. >> >> REP BSF is LZCNT, and unconditionally writes it's output operand, and >> defeats the attempt to preload with -1. >> >> Drop the REP prefix, and it should work as intended. > Bah. That's what I get for just doing it blindly without actually > looking at the kernel source. I just copied the __ffs() thing - and > there the 'rep' is not for the zero case - which we don't care about - > but because lzcnt performs better on newer CPUs. Oh, I didn't realise there was also a perf difference too, but Agner Fog agrees. Apparently in Zen4, BSF and friends have become a single uop with a sensible latency. Previously they were 6-8 uops with a latency to match. Intel appear to have have had them as a single uop since SandyBridge, so quite a long time now. ~Andrew
On Tue, 29 Apr 2025 at 15:22, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> Oh, I didn't realise there was also a perf difference too, but Agner Fog
> agrees.
The perf difference is exactly because of the issue where the non-rep
one acts as a cmov, and has basically two inputs (the bits to test in
the source, and the old value of the result register)
I guess it's not "fundamental", but lzcnt is basically a bit simpler
for hardware to implement, and the non-rep legacy bsf instruction
basically has a dependency on the previous value of the result
register.
So even when it's a single uop for both cases, that single uop can be
slower for the bsf because of the (typically false) dependency and
extra pressure on the rename registers.
Linus
On April 29, 2025 3:04:30 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote: >On Tue, 29 Apr 2025 at 14:59, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> >> do_variable_ffs() doesn't quite work. >> >> REP BSF is LZCNT, and unconditionally writes it's output operand, and >> defeats the attempt to preload with -1. >> >> Drop the REP prefix, and it should work as intended. > >Bah. That's what I get for just doing it blindly without actually >looking at the kernel source. I just copied the __ffs() thing - and >there the 'rep' is not for the zero case - which we don't care about - >but because lzcnt performs better on newer CPUs. > >So you're obviously right. > > Linus Yeah, the encoding of lzcnt was a real mistake, because the outputs are different (so you still need instruction-specific postprocessing.)
On Mon, 28 Apr 2025 at 19:00, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Now, question: can we just use __builtin_*() for these? I think gcc should always generate inline code for these on x86.
Yeah, I think we can just use __builtin_ffs() directly and get rid of
all the games.
Not all gcc builtins are great: I did a bugzilla about gcc ctzl some time ago:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106471
but for ffs() that does sound like it's the simplest thing to do.
Linus
* Ingo Molnar <mingo@kernel.org> wrote:
> And once we remove 486, I think we can do the optimization below to
> just assume the output doesn't get clobbered by BS*L in the
> zero-case, right?
>
> In the text size space it's a substantial optimization on x86-32
> defconfig:
>
> text data bss dec hex filename
> 16,577,728 7598826 1744896 25921450 18b87aa vmlinux.vanilla # CMOV+BS*L
> 16,577,908 7598838 1744896 25921642 18b886a vmlinux.linus_patch # if()+BS*L
> 16,573,568 7602922 1744896 25921386 18b876a vmlinux.noclobber # BS*L
And BTW, *that* is a price that all of non-486 x86-32 was paying for
486 support...
And, just out of intellectual curiosity, I also tried to measure the
code generation price of the +1 standards-quirk in the fls()/ffs()
interface as well:
text data bss dec hex filename
16,577,728 7598826 1744896 25921450 18b87aa vmlinux.vanilla # CMOV+BS*L
16,577,908 7598838 1744896 25921642 18b886a vmlinux.linus_patch # if()+BS*L
16,573,568 7602922 1744896 25921386 18b876a vmlinux.noclobber # BS*L
..........
16,573,552 7602922 1744896 25921370 18b875a vmlinux.broken # BROKEN: 0 baseline instead of 1
... and unless I messed up the patch, it seems to have a surprisingly
low impact - maybe because the compiler can amortize its cost by
adjusting all dependent code mostly at build time, so the +1 doesn't
end up being generated most of the time?
Thanks,
Ingo
===============================>
This broken patch is broken: it intentionally breaks the ffs()/fls()
interface in an attempt to measure the code generation effects of
interface details.
NOT-Signed-off-by: <anyone@anywhere.anytime>
---
arch/x86/include/asm/bitops.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index e3e94a806656..21707696bafe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -318,7 +318,7 @@ static __always_inline int variable_ffs(int x)
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
- return r + 1;
+ return r;
}
/**
@@ -362,7 +362,7 @@ static __always_inline int fls(unsigned int x)
: "=r" (r)
: ASM_INPUT_RM (x), "0" (-1));
- return r + 1;
+ return r;
}
/**
On Mon, 28 Apr 2025 at 00:14, Ingo Molnar <mingo@kernel.org> wrote:
>
> And, just out of intellectual curiosity, I also tried to measure the
> code generation price of the +1 standards-quirk in the fls()/ffs()
> interface as well:
>
> ... and unless I messed up the patch, it seems to have a surprisingly
> low impact - maybe because the compiler can amortize its cost by
> adjusting all dependent code mostly at build time, so the +1 doesn't
> end up being generated most of the time?
No, I think one issue is that most users actually end up subtracting
one from the return value of 'ffs()', because the "bit #0 returns 1"
semantics of the standard ffs() function really is insane.
It's not just that it doesn't match sane hardware, it's also that it
doesn't match sane *users*. If bit #0 is set, people want '0', so they
typically subtract 1.
So when you stop adding one, you aren't actually removing code -
you're often adding it.
Just see how many hits you get from
git grep '\<ffs(.*).*-.*1'
which is obviously not a very precise pattern, but just look at the
output and see just *how* common that "subtract one" thing is.
I really don't understand how anybody *ever* thought that the whole
"return one bigger" was a good idea for ffs().
Sure, I understand that zero is special and needs a special return
value, but returning a negative value would have been pretty simple
(or just do what our bitops finding functions do, which is to return
past the end, which is often convenient but does tend to make the
error condition check a bit more complex).
Anyway, the fact that so many users subtract one means that your "look
at the size of the binary" model doesn't work. You're counting both
the wins (when that addition doesn't happen) and the losses (when the
"subtract one in the user" happens).
So the "+1" doesn't cause much code generation - as long as it's done
by the compiler that can also undo it - but it's just a horrid user
interface.
But maybe people really were poisoned by the Pascal mindset. Or maybe
it was invented by some ancient Roman who hadn't heard of the concept
of zero. Who knows?
Linus
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 28 Apr 2025 at 00:14, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > And, just out of intellectual curiosity, I also tried to measure the
> > code generation price of the +1 standards-quirk in the fls()/ffs()
> > interface as well:
> >
> > ... and unless I messed up the patch, it seems to have a surprisingly
> > low impact - maybe because the compiler can amortize its cost by
> > adjusting all dependent code mostly at build time, so the +1 doesn't
> > end up being generated most of the time?
>
> No, I think one issue is that most users actually end up subtracting
> one from the return value of 'ffs()', because the "bit #0 returns 1"
> semantics of the standard ffs() function really is insane.
>
> It's not just that it doesn't match sane hardware, it's also that it
> doesn't match sane *users*. If bit #0 is set, people want '0', so they
> typically subtract 1.
>
> So when you stop adding one, you aren't actually removing code -
> you're often adding it.
>
> Just see how many hits you get from
>
> git grep '\<ffs(.*).*-.*1'
>
> which is obviously not a very precise pattern, but just look at the
> output and see just *how* common that "subtract one" thing is.
>
> I really don't understand how anybody *ever* thought that the whole
> "return one bigger" was a good idea for ffs().
Yeah. No argument from me that it's a badly thought out interface - I
was just surprised that it doesn't seem to impact performance as badly
as I expected. I have to add that a lot of work went into absorbing the
negative effects of the ffs()/fls() interfaces:
starship:~/tip> git grep -Ee '__ffs\(|__fls\(' | wc -l
1055
So it impacts code quality negatively, which is arguably the worse side
effect.
> But maybe people really were poisoned by the Pascal mindset. Or maybe
> it was invented by some ancient Roman who hadn't heard of the concept
> of zero. Who knows?
Hey, ancient Romans didn't even have the concept of *whitespaces* and
punctuation to begin with:
https://historyofinformation.com/images/Vergilius_Augusteus,_Georgica_121.jpg
Lazy stonemasons the lot of them.
Romans were the worst ever coders too I suspect. What have the Romans
ever done for us??
Ingo
On April 29, 2025 3:08:03 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Mon, 28 Apr 2025 at 00:14, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > And, just out of intellectual curiosity, I also tried to measure the
>> > code generation price of the +1 standards-quirk in the fls()/ffs()
>> > interface as well:
>> >
>> > ... and unless I messed up the patch, it seems to have a surprisingly
>> > low impact - maybe because the compiler can amortize its cost by
>> > adjusting all dependent code mostly at build time, so the +1 doesn't
>> > end up being generated most of the time?
>>
>> No, I think one issue is that most users actually end up subtracting
>> one from the return value of 'ffs()', because the "bit #0 returns 1"
>> semantics of the standard ffs() function really is insane.
>>
>> It's not just that it doesn't match sane hardware, it's also that it
>> doesn't match sane *users*. If bit #0 is set, people want '0', so they
>> typically subtract 1.
>>
>> So when you stop adding one, you aren't actually removing code -
>> you're often adding it.
>>
>> Just see how many hits you get from
>>
>> git grep '\<ffs(.*).*-.*1'
>>
>> which is obviously not a very precise pattern, but just look at the
>> output and see just *how* common that "subtract one" thing is.
>>
>> I really don't understand how anybody *ever* thought that the whole
>> "return one bigger" was a good idea for ffs().
>
>Yeah. No argument from me that it's a badly thought out interface - I
>was just surprised that it doesn't seem to impact performance as badly
>as I expected. I have to add that a lot of work went into absorbing the
>negative effects of the ffs()/fls() interfaces:
>
> starship:~/tip> git grep -Ee '__ffs\(|__fls\(' | wc -l
> 1055
>
>So it impacts code quality negatively, which is arguably the worse side
>effect.
>
>> But maybe people really were poisoned by the Pascal mindset. Or maybe
>> it was invented by some ancient Roman who hadn't heard of the concept
>> of zero. Who knows?
>
>Hey, ancient Romans didn't even have the concept of *whitespaces* and
>punctuation to begin with:
>
> https://historyofinformation.com/images/Vergilius_Augusteus,_Georgica_121.jpg
>
>Lazy stonemasons the lot of them.
>
>Romans were the worst ever coders too I suspect. What have the Romans
>ever done for us??
>
> Ingo
Well, they did build the roads... 🤣
Roman numerals obviously were not a positional system, but at least in accounting they used N for zero (nulla.)
ROMANI•ITE•DOMVM
On April 28, 2025 12:14:40 AM PDT, Ingo Molnar <mingo@kernel.org> wrote: > >* Ingo Molnar <mingo@kernel.org> wrote: > >> And once we remove 486, I think we can do the optimization below to >> just assume the output doesn't get clobbered by BS*L in the >> zero-case, right? >> >> In the text size space it's a substantial optimization on x86-32 >> defconfig: >> >> text data bss dec hex filename >> 16,577,728 7598826 1744896 25921450 18b87aa vmlinux.vanilla # CMOV+BS*L >> 16,577,908 7598838 1744896 25921642 18b886a vmlinux.linus_patch # if()+BS*L >> 16,573,568 7602922 1744896 25921386 18b876a vmlinux.noclobber # BS*L > >And BTW, *that* is a price that all of non-486 x86-32 was paying for >486 support... > >And, just out of intellectual curiosity, I also tried to measure the >code generation price of the +1 standards-quirk in the fls()/ffs() >interface as well: > > text data bss dec hex filename > 16,577,728 7598826 1744896 25921450 18b87aa vmlinux.vanilla # CMOV+BS*L > 16,577,908 7598838 1744896 25921642 18b886a vmlinux.linus_patch # if()+BS*L > 16,573,568 7602922 1744896 25921386 18b876a vmlinux.noclobber # BS*L > .......... > 16,573,552 7602922 1744896 25921370 18b875a vmlinux.broken # BROKEN: 0 baseline instead of 1 > >... and unless I messed up the patch, it seems to have a surprisingly >low impact - maybe because the compiler can amortize its cost by >adjusting all dependent code mostly at build time, so the +1 doesn't >end up being generated most of the time? > >Thanks, > > Ingo > >===============================> > >This broken patch is broken: it intentionally breaks the ffs()/fls() >interface in an attempt to measure the code generation effects of >interface details. > >NOT-Signed-off-by: <anyone@anywhere.anytime> >--- > arch/x86/include/asm/bitops.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > >diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h >index e3e94a806656..21707696bafe 100644 >--- a/arch/x86/include/asm/bitops.h >+++ b/arch/x86/include/asm/bitops.h >@@ -318,7 +318,7 @@ static __always_inline int variable_ffs(int x) > : "=r" (r) > : ASM_INPUT_RM (x), "0" (-1)); > >- return r + 1; >+ return r; > } > > /** >@@ -362,7 +362,7 @@ static __always_inline int fls(unsigned int x) > : "=r" (r) > : ASM_INPUT_RM (x), "0" (-1)); > >- return r + 1; >+ return r; > } > > /** My recollection was that you can't assume that even for 586; that it is only safe for 686, but it has been a long time...
On Mon, Apr 28, 2025, at 09:14, Ingo Molnar wrote:
>
> ... and unless I messed up the patch, it seems to have a surprisingly
> low impact - maybe because the compiler can amortize its cost by
> adjusting all dependent code mostly at build time, so the +1 doesn't
> end up being generated most of the time?
Is there any reason we can't just use the compiler-builtins directly
like we do on other architectures, at least for 32-bit?
Looking at a couple of vmlinux objects confirms the findings from
fdb6649ab7c1 ("x86/asm/bitops: Use __builtin_ctzl() to evaluate
constant expressions") from looking at the object file that using
the built-in helpers is slightly better than the current asm code
for all 32-bit targets with both gcc and clang. It's also better
for 64-bit targets with clang, but not with gcc, where the inline
asm often saves a cmov but in other cases the compiler finds an
even better instruction sequence.
Arnd
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index eebbc8889e70..bdeae9a497e5 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -246,184 +246,18 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
variable_test_bit(nr, addr);
}
-static __always_inline unsigned long variable__ffs(unsigned long word)
-{
- asm("tzcnt %1,%0"
- : "=r" (word)
- : ASM_INPUT_RM (word));
- return word;
-}
-
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-#define __ffs(word) \
- (__builtin_constant_p(word) ? \
- (unsigned long)__builtin_ctzl(word) : \
- variable__ffs(word))
-
-static __always_inline unsigned long variable_ffz(unsigned long word)
-{
- return variable__ffs(~word);
-}
-
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-#define ffz(word) \
- (__builtin_constant_p(word) ? \
- (unsigned long)__builtin_ctzl(~word) : \
- variable_ffz(word))
-
-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
- if (__builtin_constant_p(word))
- return BITS_PER_LONG - 1 - __builtin_clzl(word);
-
- asm("bsr %1,%0"
- : "=r" (word)
- : ASM_INPUT_RM (word));
- return word;
-}
-
#undef ADDR
-#ifdef __KERNEL__
-static __always_inline int variable_ffs(int x)
-{
- int r;
-
-#ifdef CONFIG_X86_64
- /*
- * AMD64 says BSFL won't clobber the dest reg if x==0; Intel64 says the
- * dest reg is undefined if x==0, but their CPU architect says its
- * value is written to set it to the same as before, except that the
- * top 32 bits will be cleared.
- *
- * We cannot do this on 32 bits because at the very least some
- * 486 CPUs did not behave this way.
- */
- asm("bsfl %1,%0"
- : "=r" (r)
- : ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
- asm("bsfl %1,%0\n\t"
- "cmovzl %2,%0"
- : "=&r" (r) : "rm" (x), "r" (-1));
-#else
- asm("bsfl %1,%0\n\t"
- "jnz 1f\n\t"
- "movl $-1,%0\n"
- "1:" : "=r" (r) : "rm" (x));
-#endif
- return r + 1;
-}
-
-/**
- * ffs - find first set bit in word
- * @x: the word to search
- *
- * This is defined the same way as the libc and compiler builtin ffs
- * routines, therefore differs in spirit from the other bitops.
- *
- * ffs(value) returns 0 if value is 0 or the position of the first
- * set bit if value is nonzero. The first (least significant) bit
- * is at position 1.
- */
-#define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) : variable_ffs(x))
-
-/**
- * fls - find last set bit in word
- * @x: the word to search
- *
- * This is defined in a similar way as the libc and compiler builtin
- * ffs, but returns the position of the most significant set bit.
- *
- * fls(value) returns 0 if value is 0 or the position of the last
- * set bit if value is nonzero. The last (most significant) bit is
- * at position 32.
- */
-static __always_inline int fls(unsigned int x)
-{
- int r;
-
- if (__builtin_constant_p(x))
- return x ? 32 - __builtin_clz(x) : 0;
-
-#ifdef CONFIG_X86_64
- /*
- * AMD64 says BSRL won't clobber the dest reg if x==0; Intel64 says the
- * dest reg is undefined if x==0, but their CPU architect says its
- * value is written to set it to the same as before, except that the
- * top 32 bits will be cleared.
- *
- * We cannot do this on 32 bits because at the very least some
- * 486 CPUs did not behave this way.
- */
- asm("bsrl %1,%0"
- : "=r" (r)
- : ASM_INPUT_RM (x), "0" (-1));
-#elif defined(CONFIG_X86_CMOV)
- asm("bsrl %1,%0\n\t"
- "cmovzl %2,%0"
- : "=&r" (r) : "rm" (x), "rm" (-1));
-#else
- asm("bsrl %1,%0\n\t"
- "jnz 1f\n\t"
- "movl $-1,%0\n"
- "1:" : "=r" (r) : "rm" (x));
-#endif
- return r + 1;
-}
-
-/**
- * fls64 - find last set bit in a 64-bit word
- * @x: the word to search
- *
- * This is defined in a similar way as the libc and compiler builtin
- * ffsll, but returns the position of the most significant set bit.
- *
- * fls64(value) returns 0 if value is 0 or the position of the last
- * set bit if value is nonzero. The last (most significant) bit is
- * at position 64.
- */
-#ifdef CONFIG_X86_64
-static __always_inline int fls64(__u64 x)
-{
- int bitpos = -1;
-
- if (__builtin_constant_p(x))
- return x ? 64 - __builtin_clzll(x) : 0;
- /*
- * AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the
- * dest reg is undefined if x==0, but their CPU architect says its
- * value is written to set it to the same as before.
- */
- asm("bsrq %1,%q0"
- : "+r" (bitpos)
- : ASM_INPUT_RM (x));
- return bitpos + 1;
-}
-#else
+#include <asm-generic/bitops/__ffs.h>
+#include <asm-generic/bitops/ffz.h>
+#include <asm-generic/bitops/builtin-fls.h>
+#include <asm-generic/bitops/__fls.h>
#include <asm-generic/bitops/fls64.h>
-#endif
-#include <asm-generic/bitops/sched.h>
+#include <asm-generic/bitops/sched.h>
+#include <asm-generic/bitops/builtin-ffs.h>
#include <asm/arch_hweight.h>
-
#include <asm-generic/bitops/const_hweight.h>
#include <asm-generic/bitops/instrumented-atomic.h>
@@ -434,5 +268,4 @@ static __always_inline int fls64(__u64 x)
#include <asm-generic/bitops/ext2-atomic-setbit.h>
-#endif /* __KERNEL__ */
#endif /* _ASM_X86_BITOPS_H */
On April 26, 2025 12:55:13 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, 26 Apr 2025 at 12:24, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>>
>> (And yes, one use in a x86 header file that is pretty questionable
>> too: I think the reason for the cmov is actually i486-only behavior
>> and we could probably unify the 32-bit and 64-bit implementation)
>
>Actually, what we *should* do is to remove that manual use of 'cmov'
>entirely - even if we decide that yes, that undefined zero case is
>actually real.
>
>We should probably change it to use CC_SET(), and the compiler will do
>a much better job - and probably never use cmov anyway.
>
>And yes, that will generate worse code if you have an old compiler
>that doesn't do ASM_FLAG_OUTPUTS, but hey, that's true in general. If
>you want good code, you need a good compiler.
>
>And clang needs to learn the CC_SET() pattern anyway.
>
>So I think that manual cmov pattern for x86-32 should be replaced with
>
> bool zero;
>
> asm("bsfl %[in],%[out]"
> CC_SET(z)
> : CC_OUT(z) (zero),
> [out]"=r" (r)
> : [in] "rm" (x));
>
> return zero ? 0 : r+1;
>
>instead (that's ffs(), and fls() would need the same thing except with
>bsrl insteadm, of course).
>
>I bet that would actually improve code generation.
>
>And I also bet it doesn't actually matter, of course.
>
> Linus
It is unfortunate, if understandable, that we ended up using a convention other than what ended up becoming standard. (Return the size in bits if the input is 0.)
This would let us use __builtin_ctz() > tzcnt which I believe is always inline on x86, and probably would help several other architectures too.
How much of a pain would it really be to fix this interface?
On April 26, 2025 12:55:13 PM PDT, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Sat, 26 Apr 2025 at 12:24, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>>
>> (And yes, one use in a x86 header file that is pretty questionable
>> too: I think the reason for the cmov is actually i486-only behavior
>> and we could probably unify the 32-bit and 64-bit implementation)
>
>Actually, what we *should* do is to remove that manual use of 'cmov'
>entirely - even if we decide that yes, that undefined zero case is
>actually real.
>
>We should probably change it to use CC_SET(), and the compiler will do
>a much better job - and probably never use cmov anyway.
>
>And yes, that will generate worse code if you have an old compiler
>that doesn't do ASM_FLAG_OUTPUTS, but hey, that's true in general. If
>you want good code, you need a good compiler.
>
>And clang needs to learn the CC_SET() pattern anyway.
>
>So I think that manual cmov pattern for x86-32 should be replaced with
>
> bool zero;
>
> asm("bsfl %[in],%[out]"
> CC_SET(z)
> : CC_OUT(z) (zero),
> [out]"=r" (r)
> : [in] "rm" (x));
>
> return zero ? 0 : r+1;
>
>instead (that's ffs(), and fls() would need the same thing except with
>bsrl insteadm, of course).
>
>I bet that would actually improve code generation.
>
>And I also bet it doesn't actually matter, of course.
>
> Linus
The undefined zero case applies to family < 6 as far as I know... the same platforms which don't have cmov...
* H. Peter Anvin <hpa@zytor.com> wrote: > The undefined zero case applies to family < 6 as far as I know... the > same platforms which don't have cmov... So, technically, these are family 5 CPUs with CMOV support, if Kconfig.cpu can be believed: MGEODE_LX MCRUSOE Right? Ingo
* Arnd Bergmann <arnd@arndb.de> wrote: > CMOV is missing not just on old Socket 5/7 CPUs (Pentium MMX, AMD K6, > Cyrix MII) but also newer embedded Via C3, Geode GX and > Vortex86DX/MX/EX/DX2. The replacement Nehemiah (2003), GeodeLX (2005) > and Vortex86DX3/EX2 (2015!) have CMOV, but the old ones were sold > alongside them for years, and some of the 586-class Vortex86 products > are still commercially available. Very few (if any) of the commercially available products will run modern 6.16+ kernels, right? Note that the real danger the 32-bit x86 kernel is going to be facing in 2-5 years is total removal due to lack of development interest, but I think we can support 686+ reasonably far into the future, and can keep it tested reasonably - while covering like 99%+ of the currently available 32-bit-only x86 products on the market. The fewer variants, the better. Thanks, Ingo
On Sat, Apr 26, 2025, at 21:09, Ingo Molnar wrote:
> * Arnd Bergmann <arnd@arndb.de> wrote:
>
>> CMOV is missing not just on old Socket 5/7 CPUs (Pentium MMX, AMD K6,
>> Cyrix MII) but also newer embedded Via C3, Geode GX and
>> Vortex86DX/MX/EX/DX2. The replacement Nehemiah (2003), GeodeLX (2005)
>> and Vortex86DX3/EX2 (2015!) have CMOV, but the old ones were sold
>> alongside them for years, and some of the 586-class Vortex86 products
>> are still commercially available.
>
> Very few (if any) of the commercially available products will run
> modern 6.16+ kernels, right?
No, at least not in absolute numbers. As far as I can tell, the RDC
SoC family is the only one that is still around, after Quark, Geode
and Eden were all discontinued around 2019.
There are multiple known RDC licensees (DM&P/Vortex86, xlichip) and
probably a few more with custom chips. They lag behind Intel and AMD
by about one patent expiration time, and maybe a decade behind Arm
SoCs, so they only just arrived at quad-core SMP, LPDDR4, and SSSE3
instructions and have announced upcoming 64-bit chips.
They do have super-long support cycles, and there are a few markets
that absolutely require kernel updates for many years, so I would
still consider the 586-class embedded chips more relevant for future
kernels than 30 year old PCs, and the 686-class embedded chips
more relevant than 20 year old laptops.
> Note that the real danger the 32-bit x86 kernel is going to be facing
> in 2-5 years is total removal due to lack of development interest, but
> I think we can support 686+ reasonably far into the future, and can
> keep it tested reasonably - while covering like 99%+ of the currently
> available 32-bit-only x86 products on the market. The fewer variants,
> the better.
I agree that this is the endgame for x86-32 and that eventually
the while thing will be remove, and every simplification helps, but
I still don't think that removing 586 earlier helps enough to
outweigh the cost here.
The situation is similar on 32-bit Arm, where we currently support
armv4, armv4t, armv5, armv6, armv6k, armv7, armv7ve and armv8-aarch32.
Removing armv3 a few years ago helped a lot, removing the extremely
rare armv6 will help as well, but there is little value in dropping
CPU support for v4 and v4t as long as v5 is there, or v6k and v7
as long as we have v7ve and v8-aarch32. My guess would be that we
can remove armv4/v4t/v5 at the same time as i586/i686 and
some other 32-bit targets, followed by armv7/v7ve/v8-aarch32
much later.
Arnd
On April 27, 2025 6:24:59 AM PDT, Arnd Bergmann <arnd@arndb.de> wrote: >On Sat, Apr 26, 2025, at 21:09, Ingo Molnar wrote: >> * Arnd Bergmann <arnd@arndb.de> wrote: >> >>> CMOV is missing not just on old Socket 5/7 CPUs (Pentium MMX, AMD K6, >>> Cyrix MII) but also newer embedded Via C3, Geode GX and >>> Vortex86DX/MX/EX/DX2. The replacement Nehemiah (2003), GeodeLX (2005) >>> and Vortex86DX3/EX2 (2015!) have CMOV, but the old ones were sold >>> alongside them for years, and some of the 586-class Vortex86 products >>> are still commercially available. >> >> Very few (if any) of the commercially available products will run >> modern 6.16+ kernels, right? > >No, at least not in absolute numbers. As far as I can tell, the RDC >SoC family is the only one that is still around, after Quark, Geode >and Eden were all discontinued around 2019. > >There are multiple known RDC licensees (DM&P/Vortex86, xlichip) and >probably a few more with custom chips. They lag behind Intel and AMD >by about one patent expiration time, and maybe a decade behind Arm >SoCs, so they only just arrived at quad-core SMP, LPDDR4, and SSSE3 >instructions and have announced upcoming 64-bit chips. > >They do have super-long support cycles, and there are a few markets >that absolutely require kernel updates for many years, so I would >still consider the 586-class embedded chips more relevant for future >kernels than 30 year old PCs, and the 686-class embedded chips >more relevant than 20 year old laptops. > >> Note that the real danger the 32-bit x86 kernel is going to be facing >> in 2-5 years is total removal due to lack of development interest, but >> I think we can support 686+ reasonably far into the future, and can >> keep it tested reasonably - while covering like 99%+ of the currently >> available 32-bit-only x86 products on the market. The fewer variants, >> the better. > >I agree that this is the endgame for x86-32 and that eventually >the while thing will be remove, and every simplification helps, but >I still don't think that removing 586 earlier helps enough to >outweigh the cost here. > >The situation is similar on 32-bit Arm, where we currently support >armv4, armv4t, armv5, armv6, armv6k, armv7, armv7ve and armv8-aarch32. >Removing armv3 a few years ago helped a lot, removing the extremely >rare armv6 will help as well, but there is little value in dropping >CPU support for v4 and v4t as long as v5 is there, or v6k and v7 >as long as we have v7ve and v8-aarch32. My guess would be that we >can remove armv4/v4t/v5 at the same time as i586/i686 and >some other 32-bit targets, followed by armv7/v7ve/v8-aarch32 >much later. > > Arnd "Commercially available" doesn't mean "binary distribution." There is a whold world beyond the desktop distros. These kinds of systems run embedded stacks that are at least in part compiled by the appliance vendor to allow for higher configurability.
On April 26, 2025 2:08:17 AM PDT, Ingo Molnar <mingo@kernel.org> wrote: > >* Arnd Bergmann <arnd@kernel.org> wrote: > >> From: Arnd Bergmann <arnd@arndb.de> >> >> With cx8 and tsc being mandatory features, the only important >> architectural features are now cmov and pae. >> >> Change the large list of target CPUs to no longer pick the instruction set >> itself but only the mtune= optimization level and in-kernel optimizations >> that remain compatible with all cores. >> >> The CONFIG_X86_CMOV instead becomes user-selectable and is now how >> Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX) >> and 686-class (everything else) targets. >> >> In order to allow running on late 32-bit cores (Athlon, Pentium-M, >> Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything >> lower than 6 (i.e. 64 byte cache lines). >> >> The optimization options now depend on X86_CMOV and X86_PAE instead >> of the other way round, while other compile-time conditionals that >> checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options >> that enable support for a particular CPU family. >> >> Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/ >> Signed-off-by: Arnd Bergmann <arnd@arndb.de> >> --- >> This is what I had in mind as mentioned in the earlier thread on >> cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15] >> patch. >> --- >> arch/x86/Kconfig | 2 +- >> arch/x86/Kconfig.cpu | 100 ++++++++++++++------------------ >> arch/x86/Makefile_32.cpu | 48 +++++++-------- >> arch/x86/include/asm/vermagic.h | 36 +----------- >> arch/x86/kernel/tsc.c | 2 +- >> arch/x86/xen/Kconfig | 1 - >> drivers/misc/mei/Kconfig | 2 +- >> 7 files changed, 74 insertions(+), 117 deletions(-) > >While the simplification is nice on its face, this looks messy: > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index a9d717558972..1e33f88c9b97 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -1438,7 +1438,7 @@ config HIGHMEM >> >> config X86_PAE >> bool "PAE (Physical Address Extension) Support" >> - depends on X86_32 && X86_HAVE_PAE >> + depends on X86_32 && X86_CMOV > >Coupling CMOV to PAE ... :-/ > >> +config X86_CMOV >> + bool "Require 686-class CMOV instructions" if X86_32 >> + default y >> help >> - This is the processor type of your CPU. This information is >> - used for optimizing purposes. In order to compile a kernel >> - that can run on all supported x86 CPU types (albeit not >> - optimally fast), you can specify "586" here. >> + Most x86-32 processor implementations are compatible with >> + the the CMOV instruction originally added in the Pentium Pro, >> + and they perform much better when using it. >> + >> + Disable this option to build for 586-class CPUs without this >> + instruction. This is only required for the original Intel >> + Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via >> + CyrixIII/C3 CPUs. > >Very few users will know anything about CMOV. > >I'd argue the right path forward is to just bite the bullet and remove >non-CMOV support as well, that would be the outcome *anyway* in a few >years. That would allow basically a single 'modern' 32-bit kernel that >is supposed to boot on every supported CPU. People might even end up >testing it ... ;-) > >Thanks, > > Ingo Dropping CMOV would mean dropping P5 support.
* H. Peter Anvin <hpa@zytor.com> wrote: > Dropping CMOV would mean dropping P5 support. Yeah, I think we should make the cutoff at the 686 level. Is there any strong reason not to do that? Stable kernels will still exist for a very long time for ancient boards. Ingo
On April 26, 2025 11:55:50 AM PDT, Ingo Molnar <mingo@kernel.org> wrote: > >* H. Peter Anvin <hpa@zytor.com> wrote: > >> Dropping CMOV would mean dropping P5 support. > >Yeah, I think we should make the cutoff at the 686 level. Is there any >strong reason not to do that? Stable kernels will still exist for a >very long time for ancient boards. > > Ingo I don't think some of the embedded 586-level ISA CPUs are ancient.
On April 25, 2025 7:15:15 AM PDT, Arnd Bergmann <arnd@kernel.org> wrote:
>From: Arnd Bergmann <arnd@arndb.de>
>
>With cx8 and tsc being mandatory features, the only important
>architectural features are now cmov and pae.
>
>Change the large list of target CPUs to no longer pick the instruction set
>itself but only the mtune= optimization level and in-kernel optimizations
>that remain compatible with all cores.
>
>The CONFIG_X86_CMOV instead becomes user-selectable and is now how
>Kconfig picks between 586-class (Pentium, Pentium MMX, K6, C3, GeodeGX)
>and 686-class (everything else) targets.
>
>In order to allow running on late 32-bit cores (Athlon, Pentium-M,
>Pentium 4, ...), the X86_L1_CACHE_SHIFT can no longer be set to anything
>lower than 6 (i.e. 64 byte cache lines).
>
>The optimization options now depend on X86_CMOV and X86_PAE instead
>of the other way round, while other compile-time conditionals that
>checked for MATOM/MGEODEGX1 instead now check for CPU_SUP_* options
>that enable support for a particular CPU family.
>
>Link: https://lore.kernel.org/lkml/dd29df0c-0b4f-44e6-b71b-2a358ea76fb4@app.fastmail.com/
>Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>---
>This is what I had in mind as mentioned in the earlier thread on
>cx8/tsc removal. I based this on top of the Ingo's [RFC 15/15]
>patch.
>---
> arch/x86/Kconfig | 2 +-
> arch/x86/Kconfig.cpu | 100 ++++++++++++++------------------
> arch/x86/Makefile_32.cpu | 48 +++++++--------
> arch/x86/include/asm/vermagic.h | 36 +-----------
> arch/x86/kernel/tsc.c | 2 +-
> arch/x86/xen/Kconfig | 1 -
> drivers/misc/mei/Kconfig | 2 +-
> 7 files changed, 74 insertions(+), 117 deletions(-)
>
>diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>index a9d717558972..1e33f88c9b97 100644
>--- a/arch/x86/Kconfig
>+++ b/arch/x86/Kconfig
>@@ -1438,7 +1438,7 @@ config HIGHMEM
>
> config X86_PAE
> bool "PAE (Physical Address Extension) Support"
>- depends on X86_32 && X86_HAVE_PAE
>+ depends on X86_32 && X86_CMOV
> select PHYS_ADDR_T_64BIT
> help
> PAE is required for NX support, and furthermore enables
>diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
>index 6f1e8cc8fe58..0619566de93f 100644
>--- a/arch/x86/Kconfig.cpu
>+++ b/arch/x86/Kconfig.cpu
>@@ -1,23 +1,32 @@
> # SPDX-License-Identifier: GPL-2.0
> # Put here option for CPU selection and depending optimization
>-choice
>- prompt "x86-32 Processor family"
>- depends on X86_32
>- default M686
>+
>+config X86_CMOV
>+ bool "Require 686-class CMOV instructions" if X86_32
>+ default y
> help
>- This is the processor type of your CPU. This information is
>- used for optimizing purposes. In order to compile a kernel
>- that can run on all supported x86 CPU types (albeit not
>- optimally fast), you can specify "586" here.
>+ Most x86-32 processor implementations are compatible with
>+ the the CMOV instruction originally added in the Pentium Pro,
>+ and they perform much better when using it.
>+
>+ Disable this option to build for 586-class CPUs without this
>+ instruction. This is only required for the original Intel
>+ Pentium (P5, P54, P55), AMD K6/K6-II/K6-3D, Geode GX1 and Via
>+ CyrixIII/C3 CPUs.
>
> Note that the 386 and 486 is no longer supported, this includes
> AMD/Cyrix/Intel 386DX/DXL/SL/SLC/SX, Cyrix/TI 486DLC/DLC2,
> UMC 486SX-S and the NexGen Nx586, AMD ELAN and all 486 based
> CPUs.
>
>- The kernel will not necessarily run on earlier architectures than
>- the one you have chosen, e.g. a Pentium optimized kernel will run on
>- a PPro, but not necessarily on a i486.
>+choice
>+ prompt "x86-32 Processor optimization"
>+ depends on X86_32
>+ default X86_GENERIC
>+ help
>+ This is the processor type of your CPU. This information is
>+ used for optimizing purposes, but does not change compatibility
>+ with other CPU types.
>
> Here are the settings recommended for greatest speed:
> - "586" for generic Pentium CPUs lacking the TSC
>@@ -45,14 +54,13 @@ choice
>
> config M586TSC
> bool "Pentium-Classic"
>- depends on X86_32
>+ depends on X86_32 && !X86_CMOV
> help
>- Select this for a Pentium Classic processor with the RDTSC (Read
>- Time Stamp Counter) instruction for benchmarking.
>+ Select this for a Pentium Classic processor.
>
> config M586MMX
> bool "Pentium-MMX"
>- depends on X86_32
>+ depends on X86_32 && !X86_CMOV
> help
> Select this for a Pentium with the MMX graphics/multimedia
> extended instructions.
>@@ -117,7 +125,7 @@ config MPENTIUM4
>
> config MK6
> bool "K6/K6-II/K6-III"
>- depends on X86_32
>+ depends on X86_32 && !X86_CMOV
> help
> Select this for an AMD K6-family processor. Enables use of
> some extended instructions, and passes appropriate optimization
>@@ -125,7 +133,7 @@ config MK6
>
> config MK7
> bool "Athlon/Duron/K7"
>- depends on X86_32
>+ depends on X86_32 && !X86_PAE
> help
> Select this for an AMD Athlon K7-family processor. Enables use of
> some extended instructions, and passes appropriate optimization
>@@ -147,42 +155,37 @@ config MEFFICEON
>
> config MGEODEGX1
> bool "GeodeGX1"
>- depends on X86_32
>+ depends on X86_32 && !X86_CMOV
> help
> Select this for a Geode GX1 (Cyrix MediaGX) chip.
>
> config MGEODE_LX
> bool "Geode GX/LX"
>- depends on X86_32
>+ depends on X86_32 && !X86_PAE
> help
> Select this for AMD Geode GX and LX processors.
>
> config MCYRIXIII
> bool "CyrixIII/VIA-C3"
>- depends on X86_32
>+ depends on X86_32 && !X86_CMOV
> help
> Select this for a Cyrix III or C3 chip. Presently Linux and GCC
> treat this chip as a generic 586. Whilst the CPU is 686 class,
> it lacks the cmov extension which gcc assumes is present when
> generating 686 code.
>- Note that Nehemiah (Model 9) and above will not boot with this
>- kernel due to them lacking the 3DNow! instructions used in earlier
>- incarnations of the CPU.
>
> config MVIAC3_2
> bool "VIA C3-2 (Nehemiah)"
>- depends on X86_32
>+ depends on X86_32 && !X86_PAE
> help
> Select this for a VIA C3 "Nehemiah". Selecting this enables usage
> of SSE and tells gcc to treat the CPU as a 686.
>- Note, this kernel will not boot on older (pre model 9) C3s.
>
> config MVIAC7
> bool "VIA C7"
>- depends on X86_32
>+ depends on X86_32 && !X86_PAE
> help
>- Select this for a VIA C7. Selecting this uses the correct cache
>- shift and tells gcc to treat the CPU as a 686.
>+ Select this for a VIA C7.
>
> config MATOM
> bool "Intel Atom"
>@@ -192,20 +195,19 @@ config MATOM
> accordingly optimized code. Use a recent GCC with specific Atom
> support in order to fully benefit from selecting this option.
>
>-endchoice
>-
> config X86_GENERIC
>- bool "Generic x86 support"
>- depends on X86_32
>+ bool "Generic x86"
> help
>- Instead of just including optimizations for the selected
>+ Instead of just including optimizations for a particular
> x86 variant (e.g. PII, Crusoe or Athlon), include some more
> generic optimizations as well. This will make the kernel
>- perform better on x86 CPUs other than that selected.
>+ perform better on a variety of CPUs.
>
> This is really intended for distributors who need more
> generic optimizations.
>
>+endchoice
>+
> #
> # Define implied options from the CPU selection here
> config X86_INTERNODE_CACHE_SHIFT
>@@ -216,17 +218,14 @@ config X86_INTERNODE_CACHE_SHIFT
> config X86_L1_CACHE_SHIFT
> int
> default "7" if MPENTIUM4
>- default "6" if MK7 || MPENTIUMM || MATOM || MVIAC7 || X86_GENERIC || X86_64
>- default "4" if MGEODEGX1
>- default "5" if MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MVIAC3_2 || MGEODE_LX
>+ default "6"
>
> config X86_F00F_BUG
>- def_bool y
>- depends on M586MMX || M586TSC || M586
>+ def_bool !X86_CMOV
>
> config X86_ALIGNMENT_16
> def_bool y
>- depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1
>+ depends on MCYRIXIII || MK6 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODEGX1 || (!X86_CMOV && X86_GENERIC)
>
> config X86_INTEL_USERCOPY
> def_bool y
>@@ -234,34 +233,23 @@ config X86_INTEL_USERCOPY
>
> config X86_USE_PPRO_CHECKSUM
> def_bool y
>- depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM
>+ depends on MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM || (X86_CMOV && X86_GENERIC)
>
> config X86_TSC
> def_bool y
>
>-config X86_HAVE_PAE
>- def_bool y
>- depends on MCRUSOE || MEFFICEON || MCYRIXIII || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC7 || MATOM || X86_64
>-
> config X86_CX8
> def_bool y
>
>-# this should be set for all -march=.. options where the compiler
>-# generates cmov.
>-config X86_CMOV
>- def_bool y
>- depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || MATOM || MGEODE_LX || X86_64)
>-
> config X86_MINIMUM_CPU_FAMILY
> int
> default "64" if X86_64
>- default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MK7)
>- default "5" if X86_32
>- default "4"
>+ default "6" if X86_32 && X86_CMOV
>+ default "5"
>
> config X86_DEBUGCTLMSR
> def_bool y
>- depends on !(MK6 || MCYRIXIII || M586MMX || M586TSC || M586) && !UML
>+ depends on X86_CMOV && !UML
>
> config IA32_FEAT_CTL
> def_bool y
>@@ -297,7 +285,7 @@ config CPU_SUP_INTEL
> config CPU_SUP_CYRIX_32
> default y
> bool "Support Cyrix processors" if PROCESSOR_SELECT
>- depends on M586 || M586TSC || M586MMX || (EXPERT && !64BIT)
>+ depends on !64BIT
> help
> This enables detection, tunings and quirks for Cyrix processors
>
>diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
>index f5e933077bf4..ebd7ec6eaf34 100644
>--- a/arch/x86/Makefile_32.cpu
>+++ b/arch/x86/Makefile_32.cpu
>@@ -10,30 +10,32 @@ else
> align := -falign-functions=0 -falign-jumps=0 -falign-loops=0
> endif
>
>-cflags-$(CONFIG_M586TSC) += -march=i586
>-cflags-$(CONFIG_M586MMX) += -march=pentium-mmx
>-cflags-$(CONFIG_M686) += -march=i686
>-cflags-$(CONFIG_MPENTIUMII) += -march=i686 $(call tune,pentium2)
>-cflags-$(CONFIG_MPENTIUMIII) += -march=i686 $(call tune,pentium3)
>-cflags-$(CONFIG_MPENTIUMM) += -march=i686 $(call tune,pentium3)
>-cflags-$(CONFIG_MPENTIUM4) += -march=i686 $(call tune,pentium4)
>-cflags-$(CONFIG_MK6) += -march=k6
>-# Please note, that patches that add -march=athlon-xp and friends are pointless.
>-# They make zero difference whatsosever to performance at this time.
>-cflags-$(CONFIG_MK7) += -march=athlon
>-cflags-$(CONFIG_MCRUSOE) += -march=i686 $(align)
>-cflags-$(CONFIG_MEFFICEON) += -march=i686 $(call tune,pentium3) $(align)
>-cflags-$(CONFIG_MCYRIXIII) += $(call cc-option,-march=c3,-march=i486) $(align)
>-cflags-$(CONFIG_MVIAC3_2) += $(call cc-option,-march=c3-2,-march=i686)
>-cflags-$(CONFIG_MVIAC7) += -march=i686
>-cflags-$(CONFIG_MATOM) += -march=atom
>+ifdef CONFIG_X86_CMOV
>+cflags-y += -march=i686
>+else
>+cflags-y += -march=i586
>+endif
>
>-# Geode GX1 support
>-cflags-$(CONFIG_MGEODEGX1) += -march=pentium-mmx
>-cflags-$(CONFIG_MGEODE_LX) += $(call cc-option,-march=geode,-march=pentium-mmx)
>-# add at the end to overwrite eventual tuning options from earlier
>-# cpu entries
>-cflags-$(CONFIG_X86_GENERIC) += $(call tune,generic,$(call tune,i686))
>+cflags-$(CONFIG_M586TSC) += -mtune=i586
>+cflags-$(CONFIG_M586MMX) += -mtune=pentium-mmx
>+cflags-$(CONFIG_M686) += -mtune=i686
>+cflags-$(CONFIG_MPENTIUMII) += -mtune=pentium2
>+cflags-$(CONFIG_MPENTIUMIII) += -mtune=pentium3
>+cflags-$(CONFIG_MPENTIUMM) += -mtune=pentium3
>+cflags-$(CONFIG_MPENTIUM4) += -mtune=pentium4
>+cflags-$(CONFIG_MK6) += -mtune=k6
>+# Please note, that patches that add -mtune=athlon-xp and friends are pointless.
>+# They make zero difference whatsosever to performance at this time.
>+cflags-$(CONFIG_MK7) += -mtune=athlon
>+cflags-$(CONFIG_MCRUSOE) += -mtune=i686 $(align)
>+cflags-$(CONFIG_MEFFICEON) += -mtune=pentium3 $(align)
>+cflags-$(CONFIG_MCYRIXIII) += -mtune=c3 $(align)
>+cflags-$(CONFIG_MVIAC3_2) += -mtune=c3-2
>+cflags-$(CONFIG_MVIAC7) += -mtune=i686
>+cflags-$(CONFIG_MATOM) += -mtune=atom
>+cflags-$(CONFIG_MGEODEGX1) += -mtune=pentium-mmx
>+cflags-$(CONFIG_MGEODE_LX) += -mtune=geode
>+cflags-$(CONFIG_X86_GENERIC) += -mtune=generic
>
> # Bug fix for binutils: this option is required in order to keep
> # binutils from generating NOPL instructions against our will.
>diff --git a/arch/x86/include/asm/vermagic.h b/arch/x86/include/asm/vermagic.h
>index e26061df0c9b..6554dbdfd719 100644
>--- a/arch/x86/include/asm/vermagic.h
>+++ b/arch/x86/include/asm/vermagic.h
>@@ -5,42 +5,10 @@
>
> #ifdef CONFIG_X86_64
> /* X86_64 does not define MODULE_PROC_FAMILY */
>-#elif defined CONFIG_M586TSC
>-#define MODULE_PROC_FAMILY "586TSC "
>-#elif defined CONFIG_M586MMX
>-#define MODULE_PROC_FAMILY "586MMX "
>-#elif defined CONFIG_MATOM
>-#define MODULE_PROC_FAMILY "ATOM "
>-#elif defined CONFIG_M686
>+#elif defined CONFIG_X86_CMOV
> #define MODULE_PROC_FAMILY "686 "
>-#elif defined CONFIG_MPENTIUMII
>-#define MODULE_PROC_FAMILY "PENTIUMII "
>-#elif defined CONFIG_MPENTIUMIII
>-#define MODULE_PROC_FAMILY "PENTIUMIII "
>-#elif defined CONFIG_MPENTIUMM
>-#define MODULE_PROC_FAMILY "PENTIUMM "
>-#elif defined CONFIG_MPENTIUM4
>-#define MODULE_PROC_FAMILY "PENTIUM4 "
>-#elif defined CONFIG_MK6
>-#define MODULE_PROC_FAMILY "K6 "
>-#elif defined CONFIG_MK7
>-#define MODULE_PROC_FAMILY "K7 "
>-#elif defined CONFIG_MCRUSOE
>-#define MODULE_PROC_FAMILY "CRUSOE "
>-#elif defined CONFIG_MEFFICEON
>-#define MODULE_PROC_FAMILY "EFFICEON "
>-#elif defined CONFIG_MCYRIXIII
>-#define MODULE_PROC_FAMILY "CYRIXIII "
>-#elif defined CONFIG_MVIAC3_2
>-#define MODULE_PROC_FAMILY "VIAC3-2 "
>-#elif defined CONFIG_MVIAC7
>-#define MODULE_PROC_FAMILY "VIAC7 "
>-#elif defined CONFIG_MGEODEGX1
>-#define MODULE_PROC_FAMILY "GEODEGX1 "
>-#elif defined CONFIG_MGEODE_LX
>-#define MODULE_PROC_FAMILY "GEODE "
> #else
>-#error unknown processor family
>+#define MODULE_PROC_FAMILY "586 "
> #endif
>
> #ifdef CONFIG_X86_32
>diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>index 489c779ef3ef..76b15ef8c85f 100644
>--- a/arch/x86/kernel/tsc.c
>+++ b/arch/x86/kernel/tsc.c
>@@ -1221,7 +1221,7 @@ bool tsc_clocksource_watchdog_disabled(void)
>
> static void __init check_system_tsc_reliable(void)
> {
>-#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
>+#if defined(CONFIG_CPU_SUP_CYRIX)
> if (is_geode_lx()) {
> /* RTSC counts during suspend */
> #define RTSC_SUSP 0x100
>diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
>index 222b6fdad313..2648459b8e8f 100644
>--- a/arch/x86/xen/Kconfig
>+++ b/arch/x86/xen/Kconfig
>@@ -9,7 +9,6 @@ config XEN
> select PARAVIRT_CLOCK
> select X86_HV_CALLBACK_VECTOR
> depends on X86_64 || (X86_32 && X86_PAE)
>- depends on X86_64 || (X86_GENERIC || MPENTIUM4 || MATOM)
> depends on X86_LOCAL_APIC
> help
> This is the Linux Xen port. Enabling this will allow the
>diff --git a/drivers/misc/mei/Kconfig b/drivers/misc/mei/Kconfig
>index 7575fee96cc6..4deb17ed0a62 100644
>--- a/drivers/misc/mei/Kconfig
>+++ b/drivers/misc/mei/Kconfig
>@@ -3,7 +3,7 @@
> config INTEL_MEI
> tristate "Intel Management Engine Interface"
> depends on X86 && PCI
>- default X86_64 || MATOM
>+ default X86_64 || CPU_SUP_INTEL
> help
> The Intel Management Engine (Intel ME) provides Manageability,
> Security and Media services for system containing Intel chipsets.
I really don't like testing an unrelated feature (CMOV for PAE); furthermore, at least some old hypervisors were known to have broken PAE.
At the very least it needs to be abstracted for clarity reasons.
Nacked-by: H. Peter Anvin <hpa@zytor.com>
On Fri, Apr 25, 2025, at 17:34, H. Peter Anvin wrote:
> On April 25, 2025 7:15:15 AM PDT, Arnd Bergmann <arnd@kernel.org> wrote:
>
> I really don't like testing an unrelated feature (CMOV for PAE);
How about a new symbol with the opposite polarity, e.g. CONFIG_CPU_586?
In that case, X86_HAVE_PAE and X86_CMOV could both depend on that
not being set.
I only picked the X86_CMOV symbol because it already exists in .config
files, but that is not the important bit here.
> furthermore, at least some old hypervisors were known to have
> broken PAE.
I'm not following. What does that have to do with my patch?
Arnd
On April 25, 2025 9:13:31 AM PDT, Arnd Bergmann <arnd@arndb.de> wrote: >On Fri, Apr 25, 2025, at 17:34, H. Peter Anvin wrote: >> On April 25, 2025 7:15:15 AM PDT, Arnd Bergmann <arnd@kernel.org> wrote: >> >> I really don't like testing an unrelated feature (CMOV for PAE); > >How about a new symbol with the opposite polarity, e.g. CONFIG_CPU_586? >In that case, X86_HAVE_PAE and X86_CMOV could both depend on that >not being set. > >I only picked the X86_CMOV symbol because it already exists in .config >files, but that is not the important bit here. > >> furthermore, at least some old hypervisors were known to have >> broken PAE. > >I'm not following. What does that have to do with my patch? > > Arnd This seems overly complex to me.
© 2016 - 2025 Red Hat, Inc.