If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
page tables can only be written under guard(kpkeys_hardened_pgtables).
This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d9371d992033..4880cb7a4cb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -34,6 +34,7 @@
#include <linux/slab.h>
#include <linux/cacheinfo.h>
#include <linux/rcuwait.h>
+#include <linux/kpkeys.h>
struct mempolicy;
struct anon_vma;
@@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
__folio_set_pgtable(folio);
lruvec_stat_add_folio(folio, NR_PAGETABLE);
+ if (kpkeys_protect_pgtable_memory(folio))
+ return false;
return true;
}
@@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc)
ptlock_free(ptdesc);
__folio_clear_pgtable(folio);
lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+ kpkeys_unprotect_pgtable_memory(folio);
}
static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
--
2.47.0
On 15.08.25 10:55, Kevin Brodsky wrote: Just wondering, should the patch subject be: "mm: protect page tables with privileged pkey" ? At least patch #2 tells me that set_memory_pkey() will set the protection key, and the function is called "kpkeys_protect_pgtable_memory"? Just trying to connect the dots here :) > If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page > table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that > page tables can only be written under guard(kpkeys_hardened_pgtables). > > This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled > (default). > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> > --- > include/linux/mm.h | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d9371d992033..4880cb7a4cb9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -34,6 +34,7 @@ > #include <linux/slab.h> > #include <linux/cacheinfo.h> > #include <linux/rcuwait.h> > +#include <linux/kpkeys.h> > > struct mempolicy; > struct anon_vma; > @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc) > > __folio_set_pgtable(folio); > lruvec_stat_add_folio(folio, NR_PAGETABLE); > + if (kpkeys_protect_pgtable_memory(folio)) > + return false; > return true; > } > > @@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc) > ptlock_free(ptdesc); > __folio_clear_pgtable(folio); > lruvec_stat_sub_folio(folio, NR_PAGETABLE); > + kpkeys_unprotect_pgtable_memory(folio); This is all rather nasty. Not your fault. In the near future page tables will not be folios, and the whole ptdesc_folio() conversion will not make any sense. Likely you should make kpkeys_protect_pgtable_memory() etc. consume an address range, or a page range right from the start. -- Cheers David / dhildenb
On 01/10/2025 17:28, David Hildenbrand wrote: > On 15.08.25 10:55, Kevin Brodsky wrote: > > Just wondering, should the patch subject be: > > "mm: protect page tables with privileged pkey" ? > > At least patch #2 tells me that set_memory_pkey() will set the > protection key, and the function is called > "kpkeys_protect_pgtable_memory"? > > Just trying to connect the dots here :) That's fair! I suppose I meant "map with privileged pkey" in the sense of "modify the mapping/PTE to set the pkey". But I see how that can be confusing, as we're not creating any mapping. I'll reword that in v6. > >> If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page >> table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that >> page tables can only be written under guard(kpkeys_hardened_pgtables). >> >> This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled >> (default). >> >> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> >> --- >> include/linux/mm.h | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index d9371d992033..4880cb7a4cb9 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -34,6 +34,7 @@ >> #include <linux/slab.h> >> #include <linux/cacheinfo.h> >> #include <linux/rcuwait.h> >> +#include <linux/kpkeys.h> >> struct mempolicy; >> struct anon_vma; >> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct >> ptdesc *ptdesc) >> __folio_set_pgtable(folio); >> lruvec_stat_add_folio(folio, NR_PAGETABLE); >> + if (kpkeys_protect_pgtable_memory(folio)) >> + return false; >> return true; >> } >> @@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct >> ptdesc *ptdesc) >> ptlock_free(ptdesc); >> __folio_clear_pgtable(folio); >> lruvec_stat_sub_folio(folio, NR_PAGETABLE); >> + kpkeys_unprotect_pgtable_memory(folio); > > This is all rather nasty. Not your fault. > > In the near future page tables will not be folios, and the whole > ptdesc_folio() conversion will not make any sense. Ah, interesting. Any patches/series I should be aware of? > > Likely you should make kpkeys_protect_pgtable_memory() etc. consume an > address range, or a page range right from the start. Got it. That said, as per the discussion on the cover letter [1], it's quite likely that we'll have to change the approach completely anyway (i.e. allocate PTPs from a dedicated pool where the pkey is already set). - Kevin [1] lore.kernel.org/r/aMwd7IJVECEy8mzf@willie-the-truck
On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d9371d992033..4880cb7a4cb9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -34,6 +34,7 @@ > #include <linux/slab.h> > #include <linux/cacheinfo.h> > #include <linux/rcuwait.h> > +#include <linux/kpkeys.h> > > struct mempolicy; > struct anon_vma; > @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc) > > __folio_set_pgtable(folio); > lruvec_stat_add_folio(folio, NR_PAGETABLE); > + if (kpkeys_protect_pgtable_memory(folio)) > + return false; > return true; > } It seems like this does a kernel range shootdown for every page table that gets allocated? If so it throws a pretty big wrench into the carefully managed TLB flush minimization logic in the kernel. Obviously this is much more straightforward then the x86 series' page table conversion batching stuff, but TBH I was worried that even that was going to have a performance hit. I think how to efficiently do direct map permissions is the key technical problem to solve for pkeys security usages. They can switch on and off fast, but applying the key is just as much of a hit as any other kernel memory permission. (I assume this works the similarly to x86's?)
On 15/08/2025 18:37, Edgecombe, Rick P wrote: > On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote: >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index d9371d992033..4880cb7a4cb9 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -34,6 +34,7 @@ >> #include <linux/slab.h> >> #include <linux/cacheinfo.h> >> #include <linux/rcuwait.h> >> +#include <linux/kpkeys.h> >> >> struct mempolicy; >> struct anon_vma; >> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc) >> >> __folio_set_pgtable(folio); >> lruvec_stat_add_folio(folio, NR_PAGETABLE); >> + if (kpkeys_protect_pgtable_memory(folio)) >> + return false; >> return true; >> } > It seems like this does a kernel range shootdown for every page table that gets > allocated? If so it throws a pretty big wrench into the carefully managed TLB > flush minimization logic in the kernel. > > Obviously this is much more straightforward then the x86 series' page table > conversion batching stuff, but TBH I was worried that even that was going to > have a performance hit. I think how to efficiently do direct map permissions is > the key technical problem to solve for pkeys security usages. They can switch on > and off fast, but applying the key is just as much of a hit as any other kernel > memory permission. (I assume this works the similarly to x86's?) The benchmarking results (see cover letter) don't seem to point to a major performance hit from setting the pkey on arm64 (worth noting that the linear mapping is PTE-mapped on arm64 today so no splitting should occur when setting the pkey). The overhead may well be substantially higher on x86. I agree this is worth looking into, though. I will check the overhead added by set_memory_pkey() specifically (ignoring pkey register switches), and maybe try to allocate page tables with a dedicated kmem_cache instead, reusing this patch [1] from my other kpkeys series. A kmem_cache won't be as optimal as a dedicated allocator, but batching the page freeing may already improve things substantially. - Kevin [1] https://lore.kernel.org/linux-hardening/20250815090000.2182450-4-kevin.brodsky@arm.com/
On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote: > The benchmarking results (see cover letter) don't seem to point to a > major performance hit from setting the pkey on arm64 (worth noting that > the linear mapping is PTE-mapped on arm64 today so no splitting should > occur when setting the pkey). The overhead may well be substantially > higher on x86. It's surprising to me. The batching seems to be about switching the pkey, not the conversion of the direct map. And with batching you measured a fork benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table allocations and so extra direct map work? I don't know if it's possible the mock implementation skipped some set_memory() work somehow? > > I agree this is worth looking into, though. I will check the overhead > added by set_memory_pkey() specifically (ignoring pkey register > switches), and maybe try to allocate page tables with a dedicated > kmem_cache instead, reusing this patch [1] from my other kpkeys series. > A kmem_cache won't be as optimal as a dedicated allocator, but batching > the page freeing may already improve things substantially. I actually never got to the benchmark on real HW stage either, but I'd be surprised if this approach would have acceptable performance for x86. There are so many optimizations around minimizing TLB flushes in Linux. Dunno. Maybe my arm knowledge is too lacking.
On 18/08/2025 19:01, Edgecombe, Rick P wrote: > On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote: >> The benchmarking results (see cover letter) don't seem to point to a >> major performance hit from setting the pkey on arm64 (worth noting that >> the linear mapping is PTE-mapped on arm64 today so no splitting should >> occur when setting the pkey). The overhead may well be substantially >> higher on x86. > It's surprising to me. The batching seems to be about switching the pkey, not > the conversion of the direct map. Correct, there is still a set_memory_pkey() for each PTP. > And with batching you measured a fork > benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table > allocations and so extra direct map work? It should indeed... > I don't know if it's possible the mock implementation skipped some set_memory() > work somehow? In fact you're absolutely right, in the mock implementation I benchmarked set_memory_pkey() is in fact a no-op :( This is because patch 6 gates set_memory_pkey() on system_supports_poe(), but the mock implementation [1] only modifies arch_kpkeys_enabled(). In other words the numbers in the cover letter correspond to the added pkey register switches, without touching the page tables. I am now re-running the benchmarks with set_memory_pkey() actually modifying the page tables. I'll reply to the cover letter with the updated numbers. - Kevin [1] https://gitlab.arm.com/linux-arm/linux-kb/-/commit/fd75b43abb354e84d06f3dfb05ce839e9fb13e08
© 2016 - 2025 Red Hat, Inc.