[PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64

Tor Vic posted 1 patch 9 months ago
arch/x86/Kconfig.cpu | 9 +++++++++
arch/x86/Makefile    | 5 +++++
2 files changed, 14 insertions(+)
[PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by Tor Vic 9 months ago
Add a 'native' option that allows users to build an optimized kernel for
their local machine (i.e. the machine which is used to build the kernel)
by passing '-march=native' to the CFLAGS.

The idea comes from Linus' reply to Arnd's initial proposal in [1].

This patch is based on Arnd's x86 cleanup series, which is now in -tip [2].

[1] https://lore.kernel.org/all/CAHk-=wji1sV93yKbc==Z7OSSHBiDE=LAdG_d5Y-zPBrnSs0k2A@mail.gmail.com/
[2] https://lore.kernel.org/all/20250226213714.4040853-1-arnd@kernel.org/

Signed-off-by: Tor Vic <torvic9@mailbox.org>
---
Here are some numbers comparing 'generic' to 'native' on a Skylake dual-core
laptop (generic --> native):

  - vmlinux and compressed modules size:
      125'907'744 bytes --> 125'595'280 bytes  (-0.248 %)
      18'810 kilobytes --> 18'770 kilobytes    (-0.213 %)

  - phoronix, average of 3 runs:
      ffmpeg:
      130.99 --> 131.15                        (+0.122 %)
      nginx:
      10'650 --> 10'725                        (+0.704 %)
      hackbench (lower is better):
      102.27 --> 99.50                         (-2.709 %)

  - xz compression of firefox tarball (lower is better):
      319.57 seconds --> 317.34 seconds        (-0.698 %)

  - stress-ng, bogoops, average of 3 15-second runs:
      fork:
      111'744 --> 115'509                      (+3.397 %)
      bsearch:
      7'211 --> 7'436                          (+3.120 %)
      memfd:
      3'591 --> 3'604                          (+0.362 %)
      mmapfork:
      630 --> 629                              (-0.159 %)
      schedmix:
      42'715 --> 43'251                        (+1.255 %)
      epoll:
      2'443'767 --> 2'454'413                  (+0.436 %)
      vm:
      1'442'256 --> 1'486'615                  (+3.076 %)

  - schbench (two message threads), 30-second runs:
      304 rps --> 305 rps                      (+0.329 %)

There is little difference both in terms of size and of performance, however
the native build comes out on top ever so slightly.
---
 arch/x86/Kconfig.cpu | 9 +++++++++
 arch/x86/Makefile    | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 8fcb8ccee44b..057d7c28b794 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -245,6 +245,15 @@ config MATOM
 
 endchoice
 
+config NATIVE_CPU
+	bool "Build for native CPU"
+	depends on X86_64
+	default n
+	help
+	  Optimize for the current CPU used to compile the kernel.
+	  Use this option if you intend to build the kernel for your
+	  local machine.
+
 config X86_GENERIC
 	bool "Generic x86 support"
 	depends on X86_32
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 8120085b00a4..0075bace3ed9 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -178,8 +178,13 @@ else
 	# Use -mskip-rax-setup if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mskip-rax-setup)
 
+ifdef CONFIG_NATIVE_CPU
+        KBUILD_CFLAGS += -march=native
+        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
+else
         KBUILD_CFLAGS += -march=x86-64 -mtune=generic
         KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
+endif
 
         KBUILD_CFLAGS += -mno-red-zone
         KBUILD_CFLAGS += -mcmodel=kernel
-- 
2.49.0
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by Arnd Bergmann 8 months, 4 weeks ago
On Fri, Mar 21, 2025, at 15:28, Tor Vic wrote:
> Add a 'native' option that allows users to build an optimized kernel for
> their local machine (i.e. the machine which is used to build the kernel)
> by passing '-march=native' to the CFLAGS.
>
> The idea comes from Linus' reply to Arnd's initial proposal in [1].
>
> This patch is based on Arnd's x86 cleanup series, which is now in -tip [2].

Thanks for having another look at this and for including the
benchmarks. I ended up dropping this bit of my series because
there were too many open questions around things like
reproducible builds, but there is clearly a demand for having
this included.

>       hackbench (lower is better):
>       102.27 --> 99.50                         (-2.709 %)
>
>   - stress-ng, bogoops, average of 3 15-second runs:
>       fork:
>       111'744 --> 115'509                      (+3.397 %)
>       bsearch:
>       7'211 --> 7'436                          (+3.120 %)
>       vm:
>       1'442'256 --> 1'486'615                  (+3.076 %)

3% in userspace benchmarks does seem significant enough to
spend more time on seeing what exactly made the difference
here, and possibly including it as separate options.

> +ifdef CONFIG_NATIVE_CPU
> +        KBUILD_CFLAGS += -march=native
> +        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
> +else
>          KBUILD_CFLAGS += -march=x86-64 -mtune=generic
>          KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
> +endif

I assume that the difference here is that -march=native on
your machine gets turned into -march=skylake, which then turns
on both additional instructions and a different instruction
scheduler.

Are you able to quickly run the same tests again using
just one of the two?

a) -march=x86-64 -mtune=skylake
b) -march=skylake -mtune=generic

    Arnd
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by Tor Vic 8 months, 4 weeks ago

On 3/22/25 12:40, Arnd Bergmann wrote:
> On Fri, Mar 21, 2025, at 15:28, Tor Vic wrote:
>> Add a 'native' option that allows users to build an optimized kernel for
>> their local machine (i.e. the machine which is used to build the kernel)
>> by passing '-march=native' to the CFLAGS.
>>
>> The idea comes from Linus' reply to Arnd's initial proposal in [1].
>>
>> This patch is based on Arnd's x86 cleanup series, which is now in -tip [2].
> 
> Thanks for having another look at this and for including the
> benchmarks. I ended up dropping this bit of my series because
> there were too many open questions around things like
> reproducible builds, but there is clearly a demand for having
> this included.
> 
>>        hackbench (lower is better):
>>        102.27 --> 99.50                         (-2.709 %)
>>
>>    - stress-ng, bogoops, average of 3 15-second runs:
>>        fork:
>>        111'744 --> 115'509                      (+3.397 %)
>>        bsearch:
>>        7'211 --> 7'436                          (+3.120 %)
>>        vm:
>>        1'442'256 --> 1'486'615                  (+3.076 %)
> 
> 3% in userspace benchmarks does seem significant enough to
> spend more time on seeing what exactly made the difference
> here, and possibly including it as separate options.
> 
>> +ifdef CONFIG_NATIVE_CPU
>> +        KBUILD_CFLAGS += -march=native
>> +        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
>> +else
>>           KBUILD_CFLAGS += -march=x86-64 -mtune=generic
>>           KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
>> +endif
> 
> I assume that the difference here is that -march=native on
> your machine gets turned into -march=skylake, which then turns
> on both additional instructions and a different instruction
> scheduler.
> 
> Are you able to quickly run the same tests again using
> just one of the two?
> 
> a) -march=x86-64 -mtune=skylake
> b) -march=skylake -mtune=generic
> 

I ran the tests on Zen2 (znver2), but this time the kernel was built 
with clang-20+lto (the skylake tests were with gcc-14).

It turned out that with '-march=native', there is almost no difference 
compared to '-march=x86-64'.
All results are within +0.8% and -0.6%, most of which are probably 
noise. Hackbench, stress-ng fork and xz compression seem to profit the 
most from 'native'.

The vmlinux image is 0.03% bigger with 'native'.

I guess that 'native' can be somewhat useful on some architectures, but 
not on all...

As for your question to run with both '-march' and '-mtune', I'm sorry, 
but I didn't have the time just yet.


>      Arnd
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by Arnd Bergmann 8 months, 3 weeks ago
On Sun, Mar 23, 2025, at 16:14, Tor Vic wrote:
> On 3/22/25 12:40, Arnd Bergmann wrote:
>> On Fri, Mar 21, 2025, at 15:28, Tor Vic wrote:
>>
>> Are you able to quickly run the same tests again using
>> just one of the two?
>> 
>> a) -march=x86-64 -mtune=skylake
>> b) -march=skylake -mtune=generic
>> 
>
> I ran the tests on Zen2 (znver2), but this time the kernel was built 
> with clang-20+lto (the skylake tests were with gcc-14).
>
> It turned out that with '-march=native', there is almost no difference 
> compared to '-march=x86-64'.
> All results are within +0.8% and -0.6%, most of which are probably 
> noise. Hackbench, stress-ng fork and xz compression seem to profit the 
> most from 'native'.
>
> The vmlinux image is 0.03% bigger with 'native'.
>
> I guess that 'native' can be somewhat useful on some architectures, but 
> not on all...

It certainly depends on configuration, toolchain and target machine,
so it's very hard to say anything general about -march=native.
znver2 and skylake are quite similar in the supported instructions,
so my guess would be that the clang-20+lto doesn't behave that differently
between march=x86-64 and march=znver2.

Of course if you don't have the same kernel version and configuration
that you are testing, the results between the two machines, the
results would be fairly random as well.

     Arnd
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by David Laight 8 months, 4 weeks ago
On Sat, 22 Mar 2025 12:40:08 +0100
"Arnd Bergmann" <arnd@kernel.org> wrote:

...
> I assume that the difference here is that -march=native on
> your machine gets turned into -march=skylake, which then turns
> on both additional instructions and a different instruction
> scheduler.
> 
> Are you able to quickly run the same tests again using
> just one of the two?
> 
> a) -march=x86-64 -mtune=skylake
> b) -march=skylake -mtune=generic

I've wondered what -mtune=generic is actually optimised for?
I've seen gcc convert 32 bit add into an lea and then have
to use another instruction to clear the high bits.
I've not fiddled with the options to see why it does that.
My only guess is it is avoiding false dependencies against
the flags register - but the flags are split on non-archaic
cpu so that doesn't matter.

	David
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by H. Peter Anvin 8 months, 4 weeks ago
On March 22, 2025 2:53:10 PM PDT, David Laight <david.laight.linux@gmail.com> wrote:
>On Sat, 22 Mar 2025 12:40:08 +0100
>"Arnd Bergmann" <arnd@kernel.org> wrote:
>
>...
>> I assume that the difference here is that -march=native on
>> your machine gets turned into -march=skylake, which then turns
>> on both additional instructions and a different instruction
>> scheduler.
>> 
>> Are you able to quickly run the same tests again using
>> just one of the two?
>> 
>> a) -march=x86-64 -mtune=skylake
>> b) -march=skylake -mtune=generic
>
>I've wondered what -mtune=generic is actually optimised for?
>I've seen gcc convert 32 bit add into an lea and then have
>to use another instruction to clear the high bits.
>I've not fiddled with the options to see why it does that.
>My only guess is it is avoiding false dependencies against
>the flags register - but the flags are split on non-archaic
>cpu so that doesn't matter.
>
>	David

Not to mention that you can use leal to get the zero-extended low 32 bits. Sounds like a gcc bug report is in order. 

The general idea, I believe, is that "generic" is supposed to produce coffee that performs well on the majority of hardware currently in production, and doesn't outright suck on older hardware either. In other words, it is intentionally a moving target.
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by Tor Vic 8 months, 4 weeks ago

On 3/22/25 12:40, Arnd Bergmann wrote:
> On Fri, Mar 21, 2025, at 15:28, Tor Vic wrote:
>> Add a 'native' option that allows users to build an optimized kernel for
>> their local machine (i.e. the machine which is used to build the kernel)
>> by passing '-march=native' to the CFLAGS.
>>
>> The idea comes from Linus' reply to Arnd's initial proposal in [1].
>>
>> This patch is based on Arnd's x86 cleanup series, which is now in -tip [2].
> 
> Thanks for having another look at this and for including the
> benchmarks. I ended up dropping this bit of my series because
> there were too many open questions around things like
> reproducible builds, but there is clearly a demand for having
> this included.
> 
>>        hackbench (lower is better):
>>        102.27 --> 99.50                         (-2.709 %)
>>
>>    - stress-ng, bogoops, average of 3 15-second runs:
>>        fork:
>>        111'744 --> 115'509                      (+3.397 %)
>>        bsearch:
>>        7'211 --> 7'436                          (+3.120 %)
>>        vm:
>>        1'442'256 --> 1'486'615                  (+3.076 %)
> 
> 3% in userspace benchmarks does seem significant enough to
> spend more time on seeing what exactly made the difference
> here, and possibly including it as separate options.

I didn't have a lot of time, so I only started 3 runs, and that is 
probably not enough to get accurate results as some tests are quite 
fluctuating.
I'm also not sure if those stress-ng tests are appropriate for this kind 
of testing, but I decided to include the results.

Suggestions for other benchmarks are welcome!

> 
>> +ifdef CONFIG_NATIVE_CPU
>> +        KBUILD_CFLAGS += -march=native
>> +        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
>> +else
>>           KBUILD_CFLAGS += -march=x86-64 -mtune=generic
>>           KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
>> +endif
> 
> I assume that the difference here is that -march=native on
> your machine gets turned into -march=skylake, which then turns
> on both additional instructions and a different instruction
> scheduler.

IIRC, '-march=native' on Skylake turns into '-march=skylake -mabm' with gcc.

> 
> Are you able to quickly run the same tests again using
> just one of the two?
> 
> a) -march=x86-64 -mtune=skylake
> b) -march=skylake -mtune=generic
> 

Unfortunately that poor old laptop is quite slow, but I intend to run 
more tests on my much faster Zen2 machine next week.
I'll report back.

>      Arnd
Re: [PATCH] arch/x86: Add an option to build the kernel with '-march=native' on x86-64
Posted by H. Peter Anvin 9 months ago
On March 21, 2025 7:28:58 AM PDT, Tor Vic <torvic9@mailbox.org> wrote:
>Add a 'native' option that allows users to build an optimized kernel for
>their local machine (i.e. the machine which is used to build the kernel)
>by passing '-march=native' to the CFLAGS.
>
>The idea comes from Linus' reply to Arnd's initial proposal in [1].
>
>This patch is based on Arnd's x86 cleanup series, which is now in -tip [2].
>
>[1] https://lore.kernel.org/all/CAHk-=wji1sV93yKbc==Z7OSSHBiDE=LAdG_d5Y-zPBrnSs0k2A@mail.gmail.com/
>[2] https://lore.kernel.org/all/20250226213714.4040853-1-arnd@kernel.org/
>
>Signed-off-by: Tor Vic <torvic9@mailbox.org>
>---
>Here are some numbers comparing 'generic' to 'native' on a Skylake dual-core
>laptop (generic --> native):
>
>  - vmlinux and compressed modules size:
>      125'907'744 bytes --> 125'595'280 bytes  (-0.248 %)
>      18'810 kilobytes --> 18'770 kilobytes    (-0.213 %)
>
>  - phoronix, average of 3 runs:
>      ffmpeg:
>      130.99 --> 131.15                        (+0.122 %)
>      nginx:
>      10'650 --> 10'725                        (+0.704 %)
>      hackbench (lower is better):
>      102.27 --> 99.50                         (-2.709 %)
>
>  - xz compression of firefox tarball (lower is better):
>      319.57 seconds --> 317.34 seconds        (-0.698 %)
>
>  - stress-ng, bogoops, average of 3 15-second runs:
>      fork:
>      111'744 --> 115'509                      (+3.397 %)
>      bsearch:
>      7'211 --> 7'436                          (+3.120 %)
>      memfd:
>      3'591 --> 3'604                          (+0.362 %)
>      mmapfork:
>      630 --> 629                              (-0.159 %)
>      schedmix:
>      42'715 --> 43'251                        (+1.255 %)
>      epoll:
>      2'443'767 --> 2'454'413                  (+0.436 %)
>      vm:
>      1'442'256 --> 1'486'615                  (+3.076 %)
>
>  - schbench (two message threads), 30-second runs:
>      304 rps --> 305 rps                      (+0.329 %)
>
>There is little difference both in terms of size and of performance, however
>the native build comes out on top ever so slightly.
>---
> arch/x86/Kconfig.cpu | 9 +++++++++
> arch/x86/Makefile    | 5 +++++
> 2 files changed, 14 insertions(+)
>
>diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
>index 8fcb8ccee44b..057d7c28b794 100644
>--- a/arch/x86/Kconfig.cpu
>+++ b/arch/x86/Kconfig.cpu
>@@ -245,6 +245,15 @@ config MATOM
> 
> endchoice
> 
>+config NATIVE_CPU
>+	bool "Build for native CPU"
>+	depends on X86_64
>+	default n
>+	help
>+	  Optimize for the current CPU used to compile the kernel.
>+	  Use this option if you intend to build the kernel for your
>+	  local machine.
>+
> config X86_GENERIC
> 	bool "Generic x86 support"
> 	depends on X86_32
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 8120085b00a4..0075bace3ed9 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -178,8 +178,13 @@ else
> 	# Use -mskip-rax-setup if supported.
> 	KBUILD_CFLAGS += $(call cc-option,-mskip-rax-setup)
> 
>+ifdef CONFIG_NATIVE_CPU
>+        KBUILD_CFLAGS += -march=native
>+        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
>+else
>         KBUILD_CFLAGS += -march=x86-64 -mtune=generic
>         KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
>+endif
> 
>         KBUILD_CFLAGS += -mno-red-zone
>         KBUILD_CFLAGS += -mcmodel=kernel

You may want to consider this option to also select matching kernel options (which would require probing cpuid.)
[tip: x86/kconfig] x86/kbuild/64: Add the CONFIG_X86_NATIVE_CPU option to locally optimize the kernel with '-march=native'
Posted by tip-bot2 for Tor Vic 8 months, 2 weeks ago
The following commit has been merged into the x86/kconfig branch of tip:

Commit-ID:     ea1dcca1de129dfdf145338a868648bc0e24717c
Gitweb:        https://git.kernel.org/tip/ea1dcca1de129dfdf145338a868648bc0e24717c
Author:        Tor Vic <torvic9@mailbox.org>
AuthorDate:    Fri, 21 Mar 2025 15:28:58 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 25 Mar 2025 08:24:06 +01:00

x86/kbuild/64: Add the CONFIG_X86_NATIVE_CPU option to locally optimize the kernel with '-march=native'

Add a 'native' option that allows users to build an optimized kernel for
their local machine (i.e. the machine which is used to build the kernel)
by passing '-march=native' to CFLAGS.

The idea comes from Linus' reply to Arnd's initial proposal:

  https://lore.kernel.org/all/CAHk-=wji1sV93yKbc==Z7OSSHBiDE=LAdG_d5Y-zPBrnSs0k2A@mail.gmail.com/

Here are some numbers comparing 'generic' to 'native' on a Skylake dual-core
laptop (generic --> native):

  - vmlinux and compressed modules size:
      125'907'744 bytes --> 125'595'280 bytes  (-0.248 %)
      18'810 kilobytes --> 18'770 kilobytes    (-0.213 %)

  - phoronix, average of 3 runs:
      ffmpeg:
      130.99 --> 131.15                        (+0.122 %)
      nginx:
      10'650 --> 10'725                        (+0.704 %)
      hackbench (lower is better):
      102.27 --> 99.50                         (-2.709 %)

  - xz compression of firefox tarball (lower is better):
      319.57 seconds --> 317.34 seconds        (-0.698 %)

  - stress-ng, bogoops, average of 3 15-second runs:
      fork:
      111'744 --> 115'509                      (+3.397 %)
      bsearch:
      7'211 --> 7'436                          (+3.120 %)
      memfd:
      3'591 --> 3'604                          (+0.362 %)
      mmapfork:
      630 --> 629                              (-0.159 %)
      schedmix:
      42'715 --> 43'251                        (+1.255 %)
      epoll:
      2'443'767 --> 2'454'413                  (+0.436 %)
      vm:
      1'442'256 --> 1'486'615                  (+3.076 %)

  - schbench (two message threads), 30-second runs:
      304 rps --> 305 rps                      (+0.329 %)

There is little difference both in terms of size and of performance, however
the native build comes out on top ever so slightly.

[ mingo: Renamed the option to CONFIG_X86_NATIVE_CPU, expanded the help text
         and added Linus's Suggested-by tag. ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tor Vic <torvic9@mailbox.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250321142859.13889-1-torvic9@mailbox.org
---
 arch/x86/Kconfig.cpu | 14 ++++++++++++++
 arch/x86/Makefile    |  5 +++++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 753b876..9d108a5 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -245,6 +245,20 @@ config MATOM
 
 endchoice
 
+config X86_NATIVE_CPU
+	bool "Build and optimize for local/native CPU"
+	depends on X86_64
+	default n
+	help
+	  Optimize for the current CPU used to compile the kernel.
+	  Use this option if you intend to build the kernel for your
+	  local machine.
+
+	  Note that such a kernel might not work optimally on a
+	  different x86 machine.
+
+	  If unsure, say N.
+
 config X86_GENERIC
 	bool "Generic x86 support"
 	depends on X86_32
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 0fc7e8f..436635e 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -173,8 +173,13 @@ else
 	# Use -mskip-rax-setup if supported.
 	KBUILD_CFLAGS += $(call cc-option,-mskip-rax-setup)
 
+ifdef CONFIG_X86_NATIVE_CPU
+        KBUILD_CFLAGS += -march=native
+        KBUILD_RUSTFLAGS += -Ctarget-cpu=native
+else
         KBUILD_CFLAGS += -march=x86-64 -mtune=generic
         KBUILD_RUSTFLAGS += -Ctarget-cpu=x86-64 -Ztune-cpu=generic
+endif
 
         KBUILD_CFLAGS += -mno-red-zone
         KBUILD_CFLAGS += -mcmodel=kernel