[RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants

Ard Biesheuvel posted 3 patches 7 months, 1 week ago
There is a newer version of this series
[RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
From: Ard Biesheuvel <ardb@kernel.org>

There are a couple of cases where pgtable_l5_enabled() is not used for
control flow, but for selecting the value of a global constant. There
are some other occurrences of such constants where the value is stored
in a global variable that needs to be assigned sufficiently early.

To make this more robust, base all of these on a new helper that uses
alternatives based code patching to select one of two immediate values,
based on whether 5 level paging is in use.

Base this on __pgtable_l5_enabled, which is guaranteed to be set to the
right value before C code ever observes it. This allows the helper to
have the 'const' attribute.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/include/asm/page_64_types.h    |  2 +-
 arch/x86/include/asm/pgtable_64_types.h | 28 +++++++++++++++++---
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 1faa8f88850a..c5631dc4ab16 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -54,7 +54,7 @@
 #define __PHYSICAL_MASK_SHIFT	52
 
 #ifdef CONFIG_X86_5LEVEL
-#define __VIRTUAL_MASK_SHIFT	(pgtable_l5_enabled() ? 56 : 47)
+#define __VIRTUAL_MASK_SHIFT	(choose_l5_enabled(56, 47))
 /* See task_size_max() in <asm/page_64.h> */
 #else
 #define __VIRTUAL_MASK_SHIFT	47
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 2c498d16609c..bb4f54ac2e62 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -43,6 +43,28 @@ static inline bool __attribute_const__ pgtable_l5_enabled(void)
 t_no:
 	return false;
 }
+
+static inline int __attribute_const__ choose_l5_enabled(int yes, int no)
+{
+	int ret = no;
+
+	asm_inline(ALTERNATIVE_TERNARY("jmp 6f; 8:", %c[feat], "movl %[yes], %[ret]", "")
+		"	.pushsection .altinstr_aux,\"ax\"	\n"
+		"6:	pushfq					\n"
+		"	testb	$1, %a[l5en]			\n"
+		"	jz	7f				\n"
+		"	movl	%[yes], %[ret]			\n"
+		"7:	popfq					\n"
+		"	jmp	8b				\n"
+		"	.popsection				\n"
+		: [ret]  "+rm" (ret)
+		: [feat] "i" (X86_FEATURE_LA57),
+		  [yes]	 "i" (yes),
+		  [l5en] "i" (&__pgtable_l5_enabled));
+
+	return ret;
+}
+
 #else
 #define pgtable_l5_enabled() 0
 #endif /* CONFIG_X86_5LEVEL */
@@ -59,7 +81,7 @@ extern unsigned int ptrs_per_p4d;
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
-#define PGDIR_SHIFT	pgdir_shift
+#define PGDIR_SHIFT	choose_l5_enabled(48, 39)
 #define PTRS_PER_PGD	512
 
 /*
@@ -67,7 +89,7 @@ extern unsigned int ptrs_per_p4d;
  */
 #define P4D_SHIFT		39
 #define MAX_PTRS_PER_P4D	512
-#define PTRS_PER_P4D		ptrs_per_p4d
+#define PTRS_PER_P4D		choose_l5_enabled(MAX_PTRS_PER_P4D, 1)
 #define P4D_SIZE		(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK		(~(P4D_SIZE - 1))
 
@@ -138,7 +160,7 @@ extern unsigned int ptrs_per_p4d;
 
 #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
 # define VMALLOC_START		vmalloc_base
-# define VMALLOC_SIZE_TB	(pgtable_l5_enabled() ? VMALLOC_SIZE_TB_L5 : VMALLOC_SIZE_TB_L4)
+# define VMALLOC_SIZE_TB	((unsigned long)choose_l5_enabled(VMALLOC_SIZE_TB_L5, VMALLOC_SIZE_TB_L4))
 # define VMEMMAP_START		vmemmap_base
 #else
 # define VMALLOC_START		__VMALLOC_BASE_L4
-- 
2.49.0.987.g0cc8ee98dc-goog
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ingo Molnar 7 months, 1 week ago
* Ard Biesheuvel <ardb+git@google.com> wrote:

> +static inline int __attribute_const__ choose_l5_enabled(int yes, int no)
> +{
> +	int ret = no;
> +
> +	asm_inline(ALTERNATIVE_TERNARY("jmp 6f; 8:", %c[feat], "movl %[yes], %[ret]", "")
> +		"	.pushsection .altinstr_aux,\"ax\"	\n"
> +		"6:	pushfq					\n"
> +		"	testb	$1, %a[l5en]			\n"
> +		"	jz	7f				\n"
> +		"	movl	%[yes], %[ret]			\n"
> +		"7:	popfq					\n"
> +		"	jmp	8b				\n"
> +		"	.popsection				\n"
> +		: [ret]  "+rm" (ret)
> +		: [feat] "i" (X86_FEATURE_LA57),
> +		  [yes]	 "i" (yes),
> +		  [l5en] "i" (&__pgtable_l5_enabled));
> +
> +	return ret;

So why not create a new synthethic cpufeature flag, 
X86_FEATURE_LA57_ENABLED or so, which could then be queried via the 
regular facilities? This ternary logic is not really needed, because 
the hardware isn't ternary. :)

With that we could do with only a single, obvious line of ALTERNATIVE() 
assembly:

        #define ALTERNATIVES_CONST_U32(__val1, __val2, __feature)       \
        ({                                                              \
                u32 __val;                                              \
                                                                        \
                asm_inline (ALTERNATIVE("movl $" #__val1 ", %0", "movl $" __val2 ", %0", __feature) :"=g" (__val)); \
                                                                        \
                __val;                                                  \
        })

        ...

        #define MAX_PHYSMEM_BITS ALTERNATIVE_CONST_U32(46, 52, X86_FEATURE_LA57_ENABLED)

Thanks,

	Ingo
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 18:50, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > +static inline int __attribute_const__ choose_l5_enabled(int yes, int no)
> > +{
> > +     int ret = no;
> > +
> > +     asm_inline(ALTERNATIVE_TERNARY("jmp 6f; 8:", %c[feat], "movl %[yes], %[ret]", "")
> > +             "       .pushsection .altinstr_aux,\"ax\"       \n"
> > +             "6:     pushfq                                  \n"
> > +             "       testb   $1, %a[l5en]                    \n"
> > +             "       jz      7f                              \n"
> > +             "       movl    %[yes], %[ret]                  \n"
> > +             "7:     popfq                                   \n"
> > +             "       jmp     8b                              \n"
> > +             "       .popsection                             \n"
> > +             : [ret]  "+rm" (ret)
> > +             : [feat] "i" (X86_FEATURE_LA57),
> > +               [yes]  "i" (yes),
> > +               [l5en] "i" (&__pgtable_l5_enabled));
> > +
> > +     return ret;
>
> So why not create a new synthethic cpufeature flag,
> X86_FEATURE_LA57_ENABLED or so, which could then be queried via the
> regular facilities? This ternary logic is not really needed, because
> the hardware isn't ternary. :)
>

The logic is not ternary, and ALTERNATIVE_TERNARY() is a misnomer. The
first branch of the alternative is the preliminary version that gets
used before alternatives have been patched.

This is needed, because otherwise, code that runs really early may
observe the wrong value. This is the whole reason the
USE_EARLY_PGTABLE_L5 exists to begin with.

> With that we could do with only a single, obvious line of ALTERNATIVE()
> assembly:
>
>         #define ALTERNATIVES_CONST_U32(__val1, __val2, __feature)       \
>         ({                                                              \
>                 u32 __val;                                              \
>                                                                         \
>                 asm_inline (ALTERNATIVE("movl $" #__val1 ", %0", "movl $" __val2 ", %0", __feature) :"=g" (__val)); \
>                                                                         \
>                 __val;                                                  \
>         })
>
>         ...
>
>         #define MAX_PHYSMEM_BITS ALTERNATIVE_CONST_U32(46, 52, X86_FEATURE_LA57_ENABLED)
>

This will produce 46 during early boot, and may therefore result in
the logic in __startup_64() to behave incorrectly. (Not sure whether
this macro in particular is implicated in that, but in general,
anything that relies on binary CPU feature logic in this context is
potentially broken.

I do agree that having a separate CPU feature X86_FEATURE_LA57_ENABLED
would be preferable: it is kind of strange that we have to pretend the
CPU does not implement 5-level paging in order to use only 4 levels,
and I suppose there may be cases where we want the wider physical
address space even when using only 4 levels of virtual addressing.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 08:49, Ard Biesheuvel <ardb+git@google.com> wrote:
>
> +       asm_inline(ALTERNATIVE_TERNARY("jmp 6f; 8:", %c[feat], "movl %[yes], %[ret]", "")
> +               "       .pushsection .altinstr_aux,\"ax\"       \n"
> +               "6:     pushfq                                  \n"
> +               "       testb   $1, %a[l5en]                    \n"
> +               "       jz      7f                              \n"
> +               "       movl    %[yes], %[ret]                  \n"
> +               "7:     popfq                                   \n"
> +               "       jmp     8b                              \n"
> +               "       .popsection                             \n"
> +               : [ret]  "+rm" (ret)
> +               : [feat] "i" (X86_FEATURE_LA57),
> +                 [yes]  "i" (yes),
> +                 [l5en] "i" (&__pgtable_l5_enabled));

I really detest these things. I don't think it's worth the complexity.
Is there really such a hot path somewhere that we should do something
like this?

If we can't use some existing infrastructure for this, we should just
keep it simple. Not write ten lines of specialized inline asm.

             Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 18:23, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 6 May 2025 at 08:49, Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> > +       asm_inline(ALTERNATIVE_TERNARY("jmp 6f; 8:", %c[feat], "movl %[yes], %[ret]", "")
> > +               "       .pushsection .altinstr_aux,\"ax\"       \n"
> > +               "6:     pushfq                                  \n"
> > +               "       testb   $1, %a[l5en]                    \n"
> > +               "       jz      7f                              \n"
> > +               "       movl    %[yes], %[ret]                  \n"
> > +               "7:     popfq                                   \n"
> > +               "       jmp     8b                              \n"
> > +               "       .popsection                             \n"
> > +               : [ret]  "+rm" (ret)
> > +               : [feat] "i" (X86_FEATURE_LA57),
> > +                 [yes]  "i" (yes),
> > +                 [l5en] "i" (&__pgtable_l5_enabled));
>
> I really detest these things. I don't think it's worth the complexity.
> Is there really such a hot path somewhere that we should do something
> like this?
>

Not sure - Ingo's data only tells us how often these occur in the
code, not how often they are called.

The only constant that I know is likely to be hot is
KASAN_SHADOW_START, but that only matters when KASAN is enabled so I'm
not sure that matters.

For the remaining uses, I would assume that manipulations of PGD/P4D
level entries [where the 4-level/5-level distinction matters] are
extremely rare compared to lower level ones, so I wouldn't expect any
bottlenecks there.

> If we can't use some existing infrastructure for this, we should just
> keep it simple. Not write ten lines of specialized inline asm.
>

Fair enough - I'm not particularly attached to this, and I'm generally
averse to premature optimization. I was just addressing the concerns
raised by Ingo.

I think the first two patches are important, though, as they are about
robustness/consistency rather than optimization.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 09:35, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> I think the first two patches are important, though, as they are about
> robustness/consistency rather than optimization.

That first patch already has another copy of that insane inline asm
optimization.

Let's fix this properly, not this disgusting way.

Let's get rid of that USE_EARLY_PGTABLE_L5 crazy case entirely, and
fix the few places where it is currently used.

That code is bogus *anyway*, because your argument for these patches
is "one single truth", but the fact is, there's at least *SIX*
different values that get set depending on this value: pgdir_shift,
ptrs_per_p4d, vmalloc_base, etc. All of those change depending on
whether we do 5-level page tables or not, so the whole argument that
"pgtable_l5_enabled() is special" is just wrong to begin with.

Any code that makes pgtable_l5_enabled() will fundamentally then just
have *another* inconsistency, namely the inconsistency between that
"is_enabled" and all the other values that L5 paging actually
modifies.

So I don't think your patches even fix anything. They only paper over
one very particular issue.

For example, as far as I can tell, the only real reason for it in
arch/x86/kernel/cpu/common.c is *one* single use where we do that

        if (!pgtable_l5_enabled())
                setup_clear_cpu_cap(X86_FEATURE_LA57);

in early_identify_cpu().

And then we have __early_make_pgtable(), but that's the SAME FILE that
has all the magical special __pgtable_l5_enabled logic anyway. So that
damn well could just write out the actual real logic, instead of using
that "is L5 enabled" helper function THAT IT IS ITSELF INITIALIZING.

So I reall ythink this whole issue goes much deeper, and is much more
broken than your patches imply. And I think your patches in many ways
make it *worse*, because they may make that pgtable_l5_enabled() be
set up early, but that just hides all the *other* issues that aren't.,

As a very real example of that, just look at what happens in
arch/x86/mm/mem_encrypt_identity.c

It does all that page table setup, but if __pgtable_l5_enabled hassn't
been set up yet, then all those *othger* values that go with it also
haven't been set up yet, so now it uses the *wrong* value for
ptrs_per_p4d etc.

And I just checked: that code very much does use ptrs_per_p4d,
although it's hidden by

                memset(p4d, 0, sizeof(*p4d) * PTRS_PER_P4D);

so I really think this is all papering over things. That code CANNOT
WORK before __pgtable_l5_enabled hass been initialized in its
*current* location.

IOW, I think your patches only make things *less* consistent, not more.

               Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 19:03, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 6 May 2025 at 09:35, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > I think the first two patches are important, though, as they are about
> > robustness/consistency rather than optimization.
>
> That first patch already has another copy of that insane inline asm
> optimization.
>
> Let's fix this properly, not this disgusting way.
>
> Let's get rid of that USE_EARLY_PGTABLE_L5 crazy case entirely, and
> fix the few places where it is currently used.
>
> That code is bogus *anyway*, because your argument for these patches
> is "one single truth", but the fact is, there's at least *SIX*
> different values that get set depending on this value: pgdir_shift,
> ptrs_per_p4d, vmalloc_base, etc.

Yes, there are many. But that is not what I meant by source of truth.

Currently, we are conflating two things
- whether the CPU has the LA57 capability
- whether the CPU is using the LA57 capability to support 5-level paging.

The CPU feature gets set too late to be useful, and then needs to be
cleared again if we happen to run with 4 levels of paging.

> All of those change depending on
> whether we do 5-level page tables or not, so the whole argument that
> "pgtable_l5_enabled() is special" is just wrong to begin with.
>

In my original patch, which is the one Ingo objected to,
pgtable_l5_enabled() is unambiguously based on whether CR4.LA57 is
set.

> Any code that makes pgtable_l5_enabled() will fundamentally then just
> have *another* inconsistency, namely the inconsistency between that
> "is_enabled" and all the other values that L5 paging actually
> modifies.
>

Not sure I follow you here.

> So I don't think your patches even fix anything. They only paper over
> one very particular issue.
>
> For example, as far as I can tell, the only real reason for it in
> arch/x86/kernel/cpu/common.c is *one* single use where we do that
>
>         if (!pgtable_l5_enabled())
>                 setup_clear_cpu_cap(X86_FEATURE_LA57);
>
> in early_identify_cpu().
>

Yes, this is needed because we conflate the CPU capability with
whether or not we are running with 5 levels of paging.

> And then we have __early_make_pgtable(), but that's the SAME FILE that
> has all the magical special __pgtable_l5_enabled logic anyway. So that
> damn well could just write out the actual real logic, instead of using
> that "is L5 enabled" helper function THAT IT IS ITSELF INITIALIZING.
>
> So I reall ythink this whole issue goes much deeper, and is much more
> broken than your patches imply. And I think your patches in many ways
> make it *worse*, because they may make that pgtable_l5_enabled() be
> set up early, but that just hides all the *other* issues that aren't.,
>
> As a very real example of that, just look at what happens in
> arch/x86/mm/mem_encrypt_identity.c
>
> It does all that page table setup, but if __pgtable_l5_enabled hassn't
> been set up yet, then all those *othger* values that go with it also
> haven't been set up yet, so now it uses the *wrong* value for
> ptrs_per_p4d etc.
>

That code runs after __startup_64() so it should observe the correct
value. But it's flaky as hell, which is why I am trying to fix this.

I'd much rather get rid of those variables entirely, which is what I
proposed initially.

> And I just checked: that code very much does use ptrs_per_p4d,
> although it's hidden by
>
>                 memset(p4d, 0, sizeof(*p4d) * PTRS_PER_P4D);
>
> so I really think this is all papering over things. That code CANNOT
> WORK before __pgtable_l5_enabled hass been initialized in its
> *current* location.
>
> IOW, I think your patches only make things *less* consistent, not more.
>

In the light of the above, care to comment on the previous approach?

https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/

That also uses the ALTERNATIVE_TERNARY() so the CR4 access gets
patched away, and I'm happy to take suggestions how to improve that.
But it replaces those global variables with expressions that are
directly derived from CR4.LA57, which I think is likely the most
robust approach.

I'd still like to split off the LA57 CPU capability from the 5-level
paging state, but that needs careful consideration of the existing
users: for instance, KVM might need the capability in some cases, but
the current state in others. There is also some IOMMU driver code that
refers to the feature.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 10:26, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> In the light of the above, care to comment on the previous approach?
>
> https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/

I have to say, I find that one much more palatable.

That said, I still think it would be better to get rid of the early
case *entirely*.

 So it would be lovely to have a subsequent patch that just makes the
"before fixup" case result in an UD2 instead of "read cr4 and check
the LA57 bit" and then fix the fallout.

I think that fallout could be handled by having it be an exception
that prints out a warning, and then jumps to the right target.
Anything else would be very painful (as in "oh, the machion doesn't
boot, because things go wrong during early boot").

But I think that first version of yours is simpler and more
straightforward than the later alternatives, and it does get rid of
that nasty USE_EARLY_PGTABLE_L5 thing.

I just think that in a perfect world we could then do more cleanups on
top of that.

              Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 19:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 6 May 2025 at 10:26, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > In the light of the above, care to comment on the previous approach?
> >
> > https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/
>
> I have to say, I find that one much more palatable.
>
> That said, I still think it would be better to get rid of the early
> case *entirely*.
>
>  So it would be lovely to have a subsequent patch that just makes the
> "before fixup" case result in an UD2 instead of "read cr4 and check
> the LA57 bit" and then fix the fallout.
>

Pretending the early case does not exist is not going to get us very far.

The very first thing we do when entering the core kernel is populate
the page tables, and this uses all the macros and #define's that are
based on pgtable_l5_enabled(). Alternatives patching occurs *much*
later.

cpu_feature_enabled() is currently built around that ternary logic,
and so we could use it this early as long as we set the bit in
boot_cpu_data. So that might be the best approach.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 10:53, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> The very first thing we do when entering the core kernel is populate
> the page tables, and this uses all the macros and #define's that are
> based on pgtable_l5_enabled(). Alternatives patching occurs *much*
> later.

.. but that's my *point*.

If you depend on pgtable_l5_enabled() when you set up early paging,
you likely also depend on ptrs_per_p4d.

And if you depend on that, then you depend on having run check_la57_support().

And we should have some way to *verify* that, rather than say "you can
use pgtable_l5_enabled randomly early".

Maybe we could have a "early alternatives fixup" for this. I thnk
RISC-V does something like that, where it has separate early
alternatives fixup.

The early boot alternatives code fixup should be really easy - no need
to worry about IPIs and things like that.

Hmm?

              Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 11:17, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The early boot alternatives code fixup should be really easy - no need
> to worry about IPIs and things like that.

Actually, we already run the alternatives fixup before smp, so I guess
we already do that.

But I think we could fairly easily add another section for early boot
alternatives and just use the existing apply_alternatives() for this,
and run it long before arch_cpu_finalize_init().

No?

              Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 20:39, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 6 May 2025 at 11:17, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The early boot alternatives code fixup should be really easy - no need
> > to worry about IPIs and things like that.
>
> Actually, we already run the alternatives fixup before smp, so I guess
> we already do that.
>
> But I think we could fairly easily add another section for early boot
> alternatives and just use the existing apply_alternatives() for this,
> and run it long before arch_cpu_finalize_init().
>
> No?
>

Not great.

First of all, that would mean calling into even more C code from the
1:1 mapping of memory, and this whole effort is about reducing and
confining that, so we don't get more random breakage in confidential
guests at boot time because some absolute address is being used before
it is mapped. (i.e., all the RIP_REL_REF() changes)

But patching alternatives is fundamentally an optimization, and so
things will look something like

void __startup_64()
{
    if (cr4.la57)
        set_capability(foo)

    apply_alternatives();

    // set up the page tables etc etc
}

and given that the code would work equally well without the
alternatives being applied, I'm not sure what we gain here, given that
we are still relying on the C code not touching the page tables before
this code executes.

This is why I included patch #2, so that at the very least, the
variable cannot be observed by C code before it assumes the correct
value.

So what I'd like to do instead is

- make cpu_feature_enabled() work early on
- replace all global variables that are derived from
pgtable_l5_enabled() with expressions based on it.

That still means we need to set the capability before populating the
page tables, but we might be able to do that in a not too nasty way
from the startup asm code.

Then, if the codegen for those LA57 related expressions sucks, perhaps
we could look into expanding the runtime const stuff to handle those
constants. That code looks like it is more suitable for execution from
the confined early startup code that runs from the 1:1 mapping, and
conceptually already does what we need.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Linus Torvalds 7 months, 1 week ago
On Tue, 6 May 2025 at 12:15, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> First of all, that would mean calling into even more C code from the
> 1:1 mapping of memory,

Not a lot, actually.

When I did the runtime-const stuff, I refused to use the alternatives
code, because it was called much too late and was much too
complicated.

So I just did an early fixup by hand - in the place where the variable
was actually changed. Exactly so that you had *consistency*.

It's literally four lines of code that gets inlined.

See runtime_const_fixup().

And I just checked. Those four lines of code generate seven *instructions*:

  .LBB80_4:
        movq    $__start_runtime_ptr_dentry_hashtable, %rcx
  .LBB80_6:                               # =>This Inner Loop Header: Depth=1
        cmpq    $__stop_runtime_ptr_dentry_hashtable, %rcx
        jae     .LBB80_7
  # %bb.5:                                #   in Loop: Header=BB80_6 Depth=1
        movslq  (%rcx), %rdx
        movq    %rax, (%rcx,%rdx)
        addq    $4, %rcx
        jmp     .LBB80_6

that's literally what that

        runtime_const_init(ptr, dentry_hashtable);

generates in dcache_init_early() for me.

Seriously, this is *trivial*.

                Linus

             Linus
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 21:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 6 May 2025 at 12:15, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > First of all, that would mean calling into even more C code from the
> > 1:1 mapping of memory,
>
> Not a lot, actually.
>
> When I did the runtime-const stuff, I refused to use the alternatives
> code, because it was called much too late and was much too
> complicated.
>
> So I just did an early fixup by hand - in the place where the variable
> was actually changed. Exactly so that you had *consistency*.
>
> It's literally four lines of code that gets inlined.
>
> See runtime_const_fixup().
>
> And I just checked. Those four lines of code generate seven *instructions*:
>
>   .LBB80_4:
>         movq    $__start_runtime_ptr_dentry_hashtable, %rcx
>   .LBB80_6:                               # =>This Inner Loop Header: Depth=1
>         cmpq    $__stop_runtime_ptr_dentry_hashtable, %rcx
>         jae     .LBB80_7
>   # %bb.5:                                #   in Loop: Header=BB80_6 Depth=1
>         movslq  (%rcx), %rdx
>         movq    %rax, (%rcx,%rdx)
>         addq    $4, %rcx
>         jmp     .LBB80_6
>
> that's literally what that
>
>         runtime_const_init(ptr, dentry_hashtable);
>
> generates in dcache_init_early() for me.
>
> Seriously, this is *trivial*.
>

This looks trivial, yes - I thought you were talking about the
alternatives patching code, which seems rather complex, and either
pulling it into the startup code or writing an 'early' version of it
seem like a lot of work for little gain.

So yes, let's use this from the startup code, once we figure out which
of these constants are too costly to be expressed as
'pgtable_l5_enabled() ? foo : bar'
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 21:50, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 6 May 2025 at 21:42, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 6 May 2025 at 12:15, Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > First of all, that would mean calling into even more C code from the
> > > 1:1 mapping of memory,
> >
> > Not a lot, actually.
> >
> > When I did the runtime-const stuff, I refused to use the alternatives
> > code, because it was called much too late and was much too
> > complicated.
> >
> > So I just did an early fixup by hand - in the place where the variable
> > was actually changed. Exactly so that you had *consistency*.
> >
> > It's literally four lines of code that gets inlined.
> >
> > See runtime_const_fixup().
> >
> > And I just checked. Those four lines of code generate seven *instructions*:
> >
> >   .LBB80_4:
> >         movq    $__start_runtime_ptr_dentry_hashtable, %rcx
> >   .LBB80_6:                               # =>This Inner Loop Header: Depth=1
> >         cmpq    $__stop_runtime_ptr_dentry_hashtable, %rcx
> >         jae     .LBB80_7
> >   # %bb.5:                                #   in Loop: Header=BB80_6 Depth=1
> >         movslq  (%rcx), %rdx
> >         movq    %rax, (%rcx,%rdx)
> >         addq    $4, %rcx
> >         jmp     .LBB80_6
> >
> > that's literally what that
> >
> >         runtime_const_init(ptr, dentry_hashtable);
> >
> > generates in dcache_init_early() for me.
> >
> > Seriously, this is *trivial*.
> >
>
> This looks trivial, yes - I thought you were talking about the
> alternatives patching code, which seems rather complex, and either
> pulling it into the startup code or writing an 'early' version of it
> seem like a lot of work for little gain.
>
> So yes, let's use this from the startup code, once we figure out which
> of these constants are too costly to be expressed as
> 'pgtable_l5_enabled() ? foo : bar'

OK, after digging a bit deeper, it seems that runtime constants are
not going to be usable here.

First of all, the constants

page_offset_base
vmalloc_base
vmemmap_base

have their address taken in the KASLR memory randomization code, which
subsequently overwrites them with different values.

This means that runtime constants are not usable here. It also means
that there is no need whatsoever to assign these variables extremely
early, given that they may assume different values during the
execution of setup_arch().

That leaves

pgdir_shift
ptrs_per_p4d

which are both 'int', and while looking into adding runtime const
support for the 'int' type, I noticed that it doesn't appear to
support loadable modules? This is a problem because these variables
are exported.

So, in order to get rid of USE_EARLY_PGTABLE_L5, which is how this
discussion started to begin with, I suggest we go with an approach
that just redefines references to these two variables in terms of
pgtable_l5_enabled()?l5_val:l4_val, as I proposed originally. Then, I
can try to make cpu_feature_enabled() usable from early boot onwards
so we can drop the early variant of pgtable_l5_enabled().

If any of these variables turn out to be a bottleneck in practice, we
can revisit.
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ingo Molnar 7 months, 1 week ago
* Ard Biesheuvel <ardb@kernel.org> wrote:

> > All of those change depending on
> > whether we do 5-level page tables or not, so the whole argument that
> > "pgtable_l5_enabled() is special" is just wrong to begin with.
> >
> 
> In my original patch, which is the one Ingo objected to, 
> pgtable_l5_enabled() is unambiguously based on whether CR4.LA57 is 
> set.

So I didn't really object to the simplification aspect - I was 
criticizing the current state of LA57 handling, regardless of your 
patch. In fact in that thread I supported the simplification aspect:

  > > Anyway, I'm not against Ard's simplification patch as a first step, and
  > > any optimizations can be layered on top of that.

But in hindsight I can see how my first reply came away as 
disagreement...

> In the light of the above, care to comment on the previous approach?
> 
> https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/
> 
> That also uses the ALTERNATIVE_TERNARY() so the CR4 access gets
> patched away, and I'm happy to take suggestions how to improve that.

I still think we should introduce a LA57_ENABLED synthethic cpufeature 
flag or so for the MM constants and all the late facilities, and go 
from there.

Thanks,

	Ingo
Re: [RFC PATCH 3/3] x86/boot: Use alternatives based selector for 5-level paging constants
Posted by Ard Biesheuvel 7 months, 1 week ago
On Tue, 6 May 2025 at 19:39, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > All of those change depending on
> > > whether we do 5-level page tables or not, so the whole argument that
> > > "pgtable_l5_enabled() is special" is just wrong to begin with.
> > >
> >
> > In my original patch, which is the one Ingo objected to,
> > pgtable_l5_enabled() is unambiguously based on whether CR4.LA57 is
> > set.
>
> So I didn't really object to the simplification aspect - I was
> criticizing the current state of LA57 handling, regardless of your
> patch. In fact in that thread I supported the simplification aspect:
>
>   > > Anyway, I'm not against Ard's simplification patch as a first step, and
>   > > any optimizations can be layered on top of that.
>
> But in hindsight I can see how my first reply came away as
> disagreement...
>

OK, so at least we all agree that there is plenty of room for
improvement here :-)

> > In the light of the above, care to comment on the previous approach?
> >
> > https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/
> >
> > That also uses the ALTERNATIVE_TERNARY() so the CR4 access gets
> > patched away, and I'm happy to take suggestions how to improve that.
>
> I still think we should introduce a LA57_ENABLED synthethic cpufeature
> flag or so for the MM constants and all the late facilities, and go
> from there.
>

It would be nice if we could set this CPU capability early on. That
way, we could just use cpu_feature_enabled(LA57_ENABLED) throughout.

I.e.,

--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -26,7 +26,8 @@ static inline bool check_la57_support(void)
        if (!(native_read_cr4() & X86_CR4_LA57))
                return false;

-       __pgtable_l5_enabled    = 1;
+       set_cpu_cap(&boot_cpu_data, X86_FEATURE_LA57_ENABLED);
+
        pgdir_shift             = 48;
        ptrs_per_p4d            = 512;
        page_offset_base        = __PAGE_OFFSET_BASE_L5;

but this requires some tweaking of the CPU feature detection code.
(Complete patch here [0])

The use of a separate feature bit is kind of orthogonal, though - it
would be a nice cleanup, but I don't see it as a prerequisite for this
change.


[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=la57-early-cap