[v5] pkeys-based page table hardening

[RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Kevin Brodsky 1 month, 2 weeks ago

If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
page tables can only be written under guard(kpkeys_hardened_pgtables).

This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d9371d992033..4880cb7a4cb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
 #include <linux/rcuwait.h>
+#include <linux/kpkeys.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
 
 	__folio_set_pgtable(folio);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
+	if (kpkeys_protect_pgtable_memory(folio))
+		return false;
 	return true;
 }
 
@@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc)
 	ptlock_free(ptdesc);
 	__folio_clear_pgtable(folio);
 	lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+	kpkeys_unprotect_pgtable_memory(folio);
 }
 
 static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
-- 
2.47.0

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by David Hildenbrand 2 days, 22 hours ago

On 15.08.25 10:55, Kevin Brodsky wrote:

Just wondering, should the patch subject be:

"mm: protect page tables with privileged pkey" ?

At least patch #2 tells me that set_memory_pkey() will set the 
protection key, and the function is called "kpkeys_protect_pgtable_memory"?

Just trying to connect the dots here :)

> If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
> table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
> page tables can only be written under guard(kpkeys_hardened_pgtables).
> 
> This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
> (default).
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>   include/linux/mm.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d9371d992033..4880cb7a4cb9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -34,6 +34,7 @@
>   #include <linux/slab.h>
>   #include <linux/cacheinfo.h>
>   #include <linux/rcuwait.h>
> +#include <linux/kpkeys.h>
>   
>   struct mempolicy;
>   struct anon_vma;
> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
>   
>   	__folio_set_pgtable(folio);
>   	lruvec_stat_add_folio(folio, NR_PAGETABLE);
> +	if (kpkeys_protect_pgtable_memory(folio))
> +		return false;
>   	return true;
>   }
>   
> @@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc)
>   	ptlock_free(ptdesc);
>   	__folio_clear_pgtable(folio);
>   	lruvec_stat_sub_folio(folio, NR_PAGETABLE);
> +	kpkeys_unprotect_pgtable_memory(folio);

This is all rather nasty. Not your fault.

In the near future page tables will not be folios, and the whole 
ptdesc_folio() conversion will not make any sense.

Likely you should make kpkeys_protect_pgtable_memory() etc. consume an 
address range, or a page range right from the start.

-- 
Cheers

David / dhildenb

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Kevin Brodsky 2 days, 20 hours ago

On 01/10/2025 17:28, David Hildenbrand wrote:
> On 15.08.25 10:55, Kevin Brodsky wrote:
>
> Just wondering, should the patch subject be:
>
> "mm: protect page tables with privileged pkey" ?
>
> At least patch #2 tells me that set_memory_pkey() will set the
> protection key, and the function is called
> "kpkeys_protect_pgtable_memory"?
>
> Just trying to connect the dots here :)

That's fair! I suppose I meant "map with privileged pkey" in the sense
of "modify the mapping/PTE to set the pkey". But I see how that can be
confusing, as we're not creating any mapping. I'll reword that in v6.

>
>> If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
>> table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
>> page tables can only be written under guard(kpkeys_hardened_pgtables).
>>
>> This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
>> (default).
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>   include/linux/mm.h | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d9371d992033..4880cb7a4cb9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -34,6 +34,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/cacheinfo.h>
>>   #include <linux/rcuwait.h>
>> +#include <linux/kpkeys.h>
>>     struct mempolicy;
>>   struct anon_vma;
>> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct
>> ptdesc *ptdesc)
>>         __folio_set_pgtable(folio);
>>       lruvec_stat_add_folio(folio, NR_PAGETABLE);
>> +    if (kpkeys_protect_pgtable_memory(folio))
>> +        return false;
>>       return true;
>>   }
>>   @@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct
>> ptdesc *ptdesc)
>>       ptlock_free(ptdesc);
>>       __folio_clear_pgtable(folio);
>>       lruvec_stat_sub_folio(folio, NR_PAGETABLE);
>> +    kpkeys_unprotect_pgtable_memory(folio);
>
> This is all rather nasty. Not your fault.
>
> In the near future page tables will not be folios, and the whole
> ptdesc_folio() conversion will not make any sense.

Ah, interesting. Any patches/series I should be aware of?

>
> Likely you should make kpkeys_protect_pgtable_memory() etc. consume an
> address range, or a page range right from the start.

Got it. That said, as per the discussion on the cover letter [1], it's
quite likely that we'll have to change the approach completely anyway
(i.e. allocate PTPs from a dedicated pool where the pkey is already set).

- Kevin

[1] lore.kernel.org/r/aMwd7IJVECEy8mzf@willie-the-truck

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Edgecombe, Rick P 1 month, 2 weeks ago

On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d9371d992033..4880cb7a4cb9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -34,6 +34,7 @@
>  #include <linux/slab.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/rcuwait.h>
> +#include <linux/kpkeys.h>
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
>  
>  	__folio_set_pgtable(folio);
>  	lruvec_stat_add_folio(folio, NR_PAGETABLE);
> +	if (kpkeys_protect_pgtable_memory(folio))
> +		return false;
>  	return true;
>  }

It seems like this does a kernel range shootdown for every page table that gets
allocated? If so it throws a pretty big wrench into the carefully managed TLB
flush minimization logic in the kernel.

Obviously this is much more straightforward then the x86 series' page table
conversion batching stuff, but TBH I was worried that even that was going to
have a performance hit. I think how to efficiently do direct map permissions is
the key technical problem to solve for pkeys security usages. They can switch on
and off fast, but applying the key is just as much of a hit as any other kernel
memory permission. (I assume this works the similarly to x86's?)

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Kevin Brodsky 1 month, 2 weeks ago

On 15/08/2025 18:37, Edgecombe, Rick P wrote:
> On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote:
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d9371d992033..4880cb7a4cb9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -34,6 +34,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/cacheinfo.h>
>>  #include <linux/rcuwait.h>
>> +#include <linux/kpkeys.h>
>>  
>>  struct mempolicy;
>>  struct anon_vma;
>> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
>>  
>>  	__folio_set_pgtable(folio);
>>  	lruvec_stat_add_folio(folio, NR_PAGETABLE);
>> +	if (kpkeys_protect_pgtable_memory(folio))
>> +		return false;
>>  	return true;
>>  }
> It seems like this does a kernel range shootdown for every page table that gets
> allocated? If so it throws a pretty big wrench into the carefully managed TLB
> flush minimization logic in the kernel.
>
> Obviously this is much more straightforward then the x86 series' page table
> conversion batching stuff, but TBH I was worried that even that was going to
> have a performance hit. I think how to efficiently do direct map permissions is
> the key technical problem to solve for pkeys security usages. They can switch on
> and off fast, but applying the key is just as much of a hit as any other kernel
> memory permission. (I assume this works the similarly to x86's?)

The benchmarking results (see cover letter) don't seem to point to a
major performance hit from setting the pkey on arm64 (worth noting that
the linear mapping is PTE-mapped on arm64 today so no splitting should
occur when setting the pkey). The overhead may well be substantially
higher on x86.

I agree this is worth looking into, though. I will check the overhead
added by set_memory_pkey() specifically (ignoring pkey register
switches), and maybe try to allocate page tables with a dedicated
kmem_cache instead, reusing this patch [1] from my other kpkeys series.
A kmem_cache won't be as optimal as a dedicated allocator, but batching
the page freeing may already improve things substantially.

- Kevin

[1]
https://lore.kernel.org/linux-hardening/20250815090000.2182450-4-kevin.brodsky@arm.com/

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Edgecombe, Rick P 1 month, 2 weeks ago

On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote:
> The benchmarking results (see cover letter) don't seem to point to a
> major performance hit from setting the pkey on arm64 (worth noting that
> the linear mapping is PTE-mapped on arm64 today so no splitting should
> occur when setting the pkey). The overhead may well be substantially
> higher on x86.

It's surprising to me. The batching seems to be about switching the pkey, not
the conversion of the direct map. And with batching you measured a fork
benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table
allocations and so extra direct map work?

I don't know if it's possible the mock implementation skipped some set_memory()
work somehow?

> 
> I agree this is worth looking into, though. I will check the overhead
> added by set_memory_pkey() specifically (ignoring pkey register
> switches), and maybe try to allocate page tables with a dedicated
> kmem_cache instead, reusing this patch [1] from my other kpkeys series.
> A kmem_cache won't be as optimal as a dedicated allocator, but batching
> the page freeing may already improve things substantially.

I actually never got to the benchmark on real HW stage either, but I'd be
surprised if this approach would have acceptable performance for x86. There are
so many optimizations around minimizing TLB flushes in Linux. Dunno. Maybe my
arm knowledge is too lacking.

Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey

Posted by Kevin Brodsky 1 month, 2 weeks ago

On 18/08/2025 19:01, Edgecombe, Rick P wrote:
> On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote:
>> The benchmarking results (see cover letter) don't seem to point to a
>> major performance hit from setting the pkey on arm64 (worth noting that
>> the linear mapping is PTE-mapped on arm64 today so no splitting should
>> occur when setting the pkey). The overhead may well be substantially
>> higher on x86.
> It's surprising to me. The batching seems to be about switching the pkey, not
> the conversion of the direct map.

Correct, there is still a set_memory_pkey() for each PTP.

>  And with batching you measured a fork
> benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table
> allocations and so extra direct map work?

It should indeed...

> I don't know if it's possible the mock implementation skipped some set_memory()
> work somehow?

In fact you're absolutely right, in the mock implementation I
benchmarked set_memory_pkey() is in fact a no-op :( This is because
patch 6 gates set_memory_pkey() on system_supports_poe(), but the mock
implementation [1] only modifies arch_kpkeys_enabled(). In other words
the numbers in the cover letter correspond to the added pkey register
switches, without touching the page tables.

I am now re-running the benchmarks with set_memory_pkey() actually
modifying the page tables. I'll reply to the cover letter with the
updated numbers.

- Kevin

[1]
https://gitlab.arm.com/linux-arm/linux-kb/-/commit/fd75b43abb354e84d06f3dfb05ce839e9fb13e08