arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

[PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Ryan Roberts 3 months ago

It has been reported that split_kernel_leaf_mapping() is trying to sleep
in non-sleepable context. It does this when acquiring the
pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
CONFIG_KFENCE are enabled, which change linear map permissions within
softirq context during memory allocation and/or freeing. All other paths
into this function are called from sleepable context and so are safe.

But it turns out that the memory for which these 2 features may attempt
to modify the permissions is always mapped by pte, so there is no need
to attempt to split the mapping. So let's exit early in these cases and
avoid attempting to take the mutex.

There is one wrinkle to this approach; late-initialized kfence allocates
it's pool from the buddy which may be block mapped. So we must hook that
allocation and convert it to pte-mappings up front. Previously this was
done as a side-effect of kfence protecting all the individual pages in
its pool at init-time, but this no longer works due to the added early
exit path in split_kernel_leaf_mapping().

So instead, do this via the existing arch_kfence_init_pool() arch hook,
and reuse the existing linear_map_split_to_ptes() infrastructure. This
will now also be more efficient as a result.

Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

Hi All,

This is a fuller fix than the suggestion I sent yesterday, and works correctly
with late-init kfence (thanks to Yang Shi for pointing that out).

I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
individually, and I've also forced it to take the linear_map_split_to_ptes() to
verify that I haven't broken it during the refactoring.

I've kept Guenter's T-b since the early-init kfence path that he was testing is
unchanged.

Assuming nobody spots any issues, I'fd like to get it into the next round of
arm64 bug fixes for this cycle.

Thanks,
Ryan


 arch/arm64/include/asm/kfence.h |  4 +-
 arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
 2 files changed, 68 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index a81937fae9f6..4a921e06d750 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -10,8 +10,6 @@

 #include <asm/set_memory.h>

-static inline bool arch_kfence_init_pool(void) { return true; }
-
 static inline bool kfence_protect_page(unsigned long addr, bool protect)
 {
 	set_memory_valid(addr, 1, !protect);
@@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
 {
 	return !kfence_early_init;
 }
+bool arch_kfence_init_pool(void);
 #else /* CONFIG_KFENCE */
 static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
+static inline bool arch_kfence_init_pool(void) { return false; }
 #endif /* CONFIG_KFENCE */

 #endif /* __ASM_KFENCE_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b8d37eb037fc..0385e9b17ab0 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
 	return ret;
 }

+static inline bool force_pte_mapping(void)
+{
+	bool bbml2 = system_capabilities_finalized() ?
+		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
+
+	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
+			   is_realm_world())) ||
+		debug_pagealloc_enabled();
+}
+
 static DEFINE_MUTEX(pgtable_split_lock);

 int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
@@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 	if (!system_supports_bbml2_noabort())
 		return 0;

+	/*
+	 * If the region is within a pte-mapped area, there is no need to try to
+	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
+	 * change permissions from softirq context so for those cases (which are
+	 * always pte-mapped), we must not go any further because taking the
+	 * mutex below may sleep.
+	 */
+	if (force_pte_mapping() || is_kfence_address((void *)start))
+		return 0;
+
 	/*
 	 * Ensure start and end are at least page-aligned since this is the
 	 * finest granularity we can split to.
@@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 	return ret;
 }

-static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
-					  unsigned long next,
-					  struct mm_walk *walk)
+static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
 {
+	gfp_t gfp = *(gfp_t *)walk->private;
 	pud_t pud = pudp_get(pudp);
 	int ret = 0;

 	if (pud_leaf(pud))
-		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
+		ret = split_pud(pudp, pud, gfp, false);

 	return ret;
 }

-static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
-					  unsigned long next,
-					  struct mm_walk *walk)
+static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
 {
+	gfp_t gfp = *(gfp_t *)walk->private;
 	pmd_t pmd = pmdp_get(pmdp);
 	int ret = 0;

 	if (pmd_leaf(pmd)) {
 		if (pmd_cont(pmd))
 			split_contpmd(pmdp);
-		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
+		ret = split_pmd(pmdp, pmd, gfp, false);

 		/*
 		 * We have split the pmd directly to ptes so there is no need to
@@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
 	return ret;
 }

-static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
-					  unsigned long next,
-					  struct mm_walk *walk)
+static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
+				   unsigned long next, struct mm_walk *walk)
 {
 	pte_t pte = __ptep_get(ptep);

@@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
 	return 0;
 }

-static const struct mm_walk_ops split_to_ptes_ops __initconst = {
+static const struct mm_walk_ops split_to_ptes_ops = {
 	.pud_entry	= split_to_ptes_pud_entry,
 	.pmd_entry	= split_to_ptes_pmd_entry,
 	.pte_entry	= split_to_ptes_pte_entry,
 };

+static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
+{
+	int ret;
+
+	arch_enter_lazy_mmu_mode();
+	ret = walk_kernel_page_table_range_lockless(start, end,
+					&split_to_ptes_ops, NULL, &gfp);
+	arch_leave_lazy_mmu_mode();
+
+	return ret;
+}
+
 static bool linear_map_requires_bbml2 __initdata;

 u32 idmap_kpti_bbml2_flag;
@@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
 		 * PTE. The kernel alias remains static throughout runtime so
 		 * can continue to be safely mapped with large mappings.
 		 */
-		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
-						&split_to_ptes_ops, NULL, NULL);
+		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
 		if (!ret)
-			ret = walk_kernel_page_table_range_lockless(kend, lend,
-						&split_to_ptes_ops, NULL, NULL);
+			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
 		if (ret)
 			panic("Failed to split linear map\n");
 		flush_tlb_kernel_range(lstart, lend);
@@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
 	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
 	__kfence_pool = phys_to_virt(kfence_pool);
 }
+
+bool arch_kfence_init_pool(void)
+{
+	unsigned long start = (unsigned long)__kfence_pool;
+	unsigned long end = start + KFENCE_POOL_SIZE;
+	int ret;
+
+	/* Exit early if we know the linear map is already pte-mapped. */
+	if (!system_supports_bbml2_noabort() || force_pte_mapping())
+		return true;
+
+	/* Kfence pool is already pte-mapped for the early init case. */
+	if (kfence_early_init)
+		return true;
+
+	mutex_lock(&pgtable_split_lock);
+	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
+	mutex_unlock(&pgtable_split_lock);
+
+	return ret ? false : true;
+}
 #else /* CONFIG_KFENCE */

 static inline phys_addr_t arm64_kfence_alloc_pool(void) { return 0; }
@@ -1009,16 +1059,6 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {

 #endif /* CONFIG_KFENCE */

-static inline bool force_pte_mapping(void)
-{
-	bool bbml2 = system_capabilities_finalized() ?
-		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
-
-	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
-			   is_realm_world())) ||
-		debug_pagealloc_enabled();
-}
-
 static void __init map_mem(pgd_t *pgdp)
 {
 	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
--
2.43.0

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Yang Shi 3 months ago

Hi Ryan,

Thanks for coming up with the fix so quickly. I believe Will and David 
already covered the most points. I don't have too much to add, just one 
nit below.
On 11/3/25 4:57 AM, Ryan Roberts wrote:
> It has been reported that split_kernel_leaf_mapping() is trying to sleep
> in non-sleepable context. It does this when acquiring the
> pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
> CONFIG_KFENCE are enabled, which change linear map permissions within
> softirq context during memory allocation and/or freeing. All other paths
> into this function are called from sleepable context and so are safe.
>
> But it turns out that the memory for which these 2 features may attempt
> to modify the permissions is always mapped by pte, so there is no need
> to attempt to split the mapping. So let's exit early in these cases and
> avoid attempting to take the mutex.
>
> There is one wrinkle to this approach; late-initialized kfence allocates
> it's pool from the buddy which may be block mapped. So we must hook that
> allocation and convert it to pte-mappings up front. Previously this was
> done as a side-effect of kfence protecting all the individual pages in
> its pool at init-time, but this no longer works due to the added early
> exit path in split_kernel_leaf_mapping().
>
> So instead, do this via the existing arch_kfence_init_pool() arch hook,
> and reuse the existing linear_map_split_to_ptes() infrastructure. This
> will now also be more efficient as a result.
>
> Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
> Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
> Tested-by: Guenter Roeck <groeck@google.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> Hi All,
>
> This is a fuller fix than the suggestion I sent yesterday, and works correctly
> with late-init kfence (thanks to Yang Shi for pointing that out).
>
> I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
> individually, and I've also forced it to take the linear_map_split_to_ptes() to
> verify that I haven't broken it during the refactoring.
>
> I've kept Guenter's T-b since the early-init kfence path that he was testing is
> unchanged.
>
> Assuming nobody spots any issues, I'fd like to get it into the next round of
> arm64 bug fixes for this cycle.
>
> Thanks,
> Ryan
>
>
>   arch/arm64/include/asm/kfence.h |  4 +-
>   arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
>   2 files changed, 68 insertions(+), 28 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
> index a81937fae9f6..4a921e06d750 100644
> --- a/arch/arm64/include/asm/kfence.h
> +++ b/arch/arm64/include/asm/kfence.h
> @@ -10,8 +10,6 @@
>
>   #include <asm/set_memory.h>
>
> -static inline bool arch_kfence_init_pool(void) { return true; }
> -
>   static inline bool kfence_protect_page(unsigned long addr, bool protect)
>   {
>   	set_memory_valid(addr, 1, !protect);
> @@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
>   {
>   	return !kfence_early_init;
>   }
> +bool arch_kfence_init_pool(void);
>   #else /* CONFIG_KFENCE */
>   static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
> +static inline bool arch_kfence_init_pool(void) { return false; }
>   #endif /* CONFIG_KFENCE */
>
>   #endif /* __ASM_KFENCE_H */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index b8d37eb037fc..0385e9b17ab0 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>   	return ret;
>   }
>
> +static inline bool force_pte_mapping(void)
> +{
> +	bool bbml2 = system_capabilities_finalized() ?
> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> +
> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> +			   is_realm_world())) ||
> +		debug_pagealloc_enabled();
> +}
> +
>   static DEFINE_MUTEX(pgtable_split_lock);
>
>   int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   	if (!system_supports_bbml2_noabort())
>   		return 0;
>
> +	/*
> +	 * If the region is within a pte-mapped area, there is no need to try to
> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
> +	 * change permissions from softirq context so for those cases (which are

I'd prefer use some more general words, for example, "atomic context, 
for example, softirq". We are not sure (maybe just me) whether this 
issue exists in non-softirq atomic context or not, so a more general 
phrase could cover more cases.

Thanks,
Yang

> +	 * always pte-mapped), we must not go any further because taking the
> +	 * mutex below may sleep.
> +	 */
> +	if (force_pte_mapping() || is_kfence_address((void *)start))
> +		return 0;
> +
>   	/*
>   	 * Ensure start and end are at least page-aligned since this is the
>   	 * finest granularity we can split to.
> @@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   	return ret;
>   }
>
> -static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>   {
> +	gfp_t gfp = *(gfp_t *)walk->private;
>   	pud_t pud = pudp_get(pudp);
>   	int ret = 0;
>
>   	if (pud_leaf(pud))
> -		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
> +		ret = split_pud(pudp, pud, gfp, false);
>
>   	return ret;
>   }
>
> -static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>   {
> +	gfp_t gfp = *(gfp_t *)walk->private;
>   	pmd_t pmd = pmdp_get(pmdp);
>   	int ret = 0;
>
>   	if (pmd_leaf(pmd)) {
>   		if (pmd_cont(pmd))
>   			split_contpmd(pmdp);
> -		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
> +		ret = split_pmd(pmdp, pmd, gfp, false);
>
>   		/*
>   		 * We have split the pmd directly to ptes so there is no need to
> @@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>   	return ret;
>   }
>
> -static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>   {
>   	pte_t pte = __ptep_get(ptep);
>
> @@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>   	return 0;
>   }
>
> -static const struct mm_walk_ops split_to_ptes_ops __initconst = {
> +static const struct mm_walk_ops split_to_ptes_ops = {
>   	.pud_entry	= split_to_ptes_pud_entry,
>   	.pmd_entry	= split_to_ptes_pmd_entry,
>   	.pte_entry	= split_to_ptes_pte_entry,
>   };
>
> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> +{
> +	int ret;
> +
> +	arch_enter_lazy_mmu_mode();
> +	ret = walk_kernel_page_table_range_lockless(start, end,
> +					&split_to_ptes_ops, NULL, &gfp);
> +	arch_leave_lazy_mmu_mode();
> +
> +	return ret;
> +}
> +
>   static bool linear_map_requires_bbml2 __initdata;
>
>   u32 idmap_kpti_bbml2_flag;
> @@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>   		 * PTE. The kernel alias remains static throughout runtime so
>   		 * can continue to be safely mapped with large mappings.
>   		 */
> -		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> -						&split_to_ptes_ops, NULL, NULL);
> +		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>   		if (!ret)
> -			ret = walk_kernel_page_table_range_lockless(kend, lend,
> -						&split_to_ptes_ops, NULL, NULL);
> +			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>   		if (ret)
>   			panic("Failed to split linear map\n");
>   		flush_tlb_kernel_range(lstart, lend);
> @@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
>   	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>   	__kfence_pool = phys_to_virt(kfence_pool);
>   }
> +
> +bool arch_kfence_init_pool(void)
> +{
> +	unsigned long start = (unsigned long)__kfence_pool;
> +	unsigned long end = start + KFENCE_POOL_SIZE;
> +	int ret;
> +
> +	/* Exit early if we know the linear map is already pte-mapped. */
> +	if (!system_supports_bbml2_noabort() || force_pte_mapping())
> +		return true;
> +
> +	/* Kfence pool is already pte-mapped for the early init case. */
> +	if (kfence_early_init)
> +		return true;
> +
> +	mutex_lock(&pgtable_split_lock);
> +	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
> +	mutex_unlock(&pgtable_split_lock);
> +
> +	return ret ? false : true;
> +}
>   #else /* CONFIG_KFENCE */
>
>   static inline phys_addr_t arm64_kfence_alloc_pool(void) { return 0; }
> @@ -1009,16 +1059,6 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>
>   #endif /* CONFIG_KFENCE */
>
> -static inline bool force_pte_mapping(void)
> -{
> -	bool bbml2 = system_capabilities_finalized() ?
> -		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> -
> -	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> -			   is_realm_world())) ||
> -		debug_pagealloc_enabled();
> -}
> -
>   static void __init map_mem(pgd_t *pgdp)
>   {
>   	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
> --
> 2.43.0
>

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Will Deacon 3 months ago

On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
> It has been reported that split_kernel_leaf_mapping() is trying to sleep
> in non-sleepable context. It does this when acquiring the
> pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
> CONFIG_KFENCE are enabled, which change linear map permissions within
> softirq context during memory allocation and/or freeing. All other paths
> into this function are called from sleepable context and so are safe.
> 
> But it turns out that the memory for which these 2 features may attempt
> to modify the permissions is always mapped by pte, so there is no need
> to attempt to split the mapping. So let's exit early in these cases and
> avoid attempting to take the mutex.
> 
> There is one wrinkle to this approach; late-initialized kfence allocates
> it's pool from the buddy which may be block mapped. So we must hook that
> allocation and convert it to pte-mappings up front. Previously this was
> done as a side-effect of kfence protecting all the individual pages in
> its pool at init-time, but this no longer works due to the added early
> exit path in split_kernel_leaf_mapping().
> 
> So instead, do this via the existing arch_kfence_init_pool() arch hook,
> and reuse the existing linear_map_split_to_ptes() infrastructure. This
> will now also be more efficient as a result.
> 
> Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
> Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
> Tested-by: Guenter Roeck <groeck@google.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> 
> Hi All,
> 
> This is a fuller fix than the suggestion I sent yesterday, and works correctly
> with late-init kfence (thanks to Yang Shi for pointing that out).
> 
> I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
> individually, and I've also forced it to take the linear_map_split_to_ptes() to
> verify that I haven't broken it during the refactoring.
> 
> I've kept Guenter's T-b since the early-init kfence path that he was testing is
> unchanged.
> 
> Assuming nobody spots any issues, I'fd like to get it into the next round of
> arm64 bug fixes for this cycle.
> 
> Thanks,
> Ryan
> 
> 
>  arch/arm64/include/asm/kfence.h |  4 +-
>  arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
>  2 files changed, 68 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
> index a81937fae9f6..4a921e06d750 100644
> --- a/arch/arm64/include/asm/kfence.h
> +++ b/arch/arm64/include/asm/kfence.h
> @@ -10,8 +10,6 @@
> 
>  #include <asm/set_memory.h>
> 
> -static inline bool arch_kfence_init_pool(void) { return true; }
> -
>  static inline bool kfence_protect_page(unsigned long addr, bool protect)
>  {
>  	set_memory_valid(addr, 1, !protect);
> @@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
>  {
>  	return !kfence_early_init;
>  }
> +bool arch_kfence_init_pool(void);
>  #else /* CONFIG_KFENCE */
>  static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
> +static inline bool arch_kfence_init_pool(void) { return false; }

Why do we need this for the !KFENCE case?

>  #endif /* CONFIG_KFENCE */
> 
>  #endif /* __ASM_KFENCE_H */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index b8d37eb037fc..0385e9b17ab0 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>  	return ret;
>  }
> 
> +static inline bool force_pte_mapping(void)
> +{
> +	bool bbml2 = system_capabilities_finalized() ?
> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> +
> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> +			   is_realm_world())) ||
> +		debug_pagealloc_enabled();
> +}
> +
>  static DEFINE_MUTEX(pgtable_split_lock);
> 
>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  	if (!system_supports_bbml2_noabort())
>  		return 0;
> 
> +	/*
> +	 * If the region is within a pte-mapped area, there is no need to try to
> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
> +	 * change permissions from softirq context so for those cases (which are
> +	 * always pte-mapped), we must not go any further because taking the
> +	 * mutex below may sleep.
> +	 */
> +	if (force_pte_mapping() || is_kfence_address((void *)start))
> +		return 0;
> +
>  	/*
>  	 * Ensure start and end are at least page-aligned since this is the
>  	 * finest granularity we can split to.
> @@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  	return ret;
>  }
> 
> -static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>  {
> +	gfp_t gfp = *(gfp_t *)walk->private;
>  	pud_t pud = pudp_get(pudp);
>  	int ret = 0;
> 
>  	if (pud_leaf(pud))
> -		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
> +		ret = split_pud(pudp, pud, gfp, false);
> 
>  	return ret;
>  }
> 
> -static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>  {
> +	gfp_t gfp = *(gfp_t *)walk->private;
>  	pmd_t pmd = pmdp_get(pmdp);
>  	int ret = 0;
> 
>  	if (pmd_leaf(pmd)) {
>  		if (pmd_cont(pmd))
>  			split_contpmd(pmdp);
> -		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
> +		ret = split_pmd(pmdp, pmd, gfp, false);
> 
>  		/*
>  		 * We have split the pmd directly to ptes so there is no need to
> @@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>  	return ret;
>  }
> 
> -static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
> -					  unsigned long next,
> -					  struct mm_walk *walk)
> +static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
> +				   unsigned long next, struct mm_walk *walk)
>  {
>  	pte_t pte = __ptep_get(ptep);
> 
> @@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>  	return 0;
>  }
> 
> -static const struct mm_walk_ops split_to_ptes_ops __initconst = {
> +static const struct mm_walk_ops split_to_ptes_ops = {
>  	.pud_entry	= split_to_ptes_pud_entry,
>  	.pmd_entry	= split_to_ptes_pmd_entry,
>  	.pte_entry	= split_to_ptes_pte_entry,
>  };
> 
> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> +{
> +	int ret;
> +
> +	arch_enter_lazy_mmu_mode();
> +	ret = walk_kernel_page_table_range_lockless(start, end,
> +					&split_to_ptes_ops, NULL, &gfp);
> +	arch_leave_lazy_mmu_mode();

Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
calls flush_tlb_kernel_range() right after this so now it looks like
we have more barriers than we need there.

> +
> +	return ret;
> +}
> +
>  static bool linear_map_requires_bbml2 __initdata;
> 
>  u32 idmap_kpti_bbml2_flag;
> @@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>  		 * PTE. The kernel alias remains static throughout runtime so
>  		 * can continue to be safely mapped with large mappings.
>  		 */
> -		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> -						&split_to_ptes_ops, NULL, NULL);
> +		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>  		if (!ret)
> -			ret = walk_kernel_page_table_range_lockless(kend, lend,
> -						&split_to_ptes_ops, NULL, NULL);
> +			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>  		if (ret)
>  			panic("Failed to split linear map\n");
>  		flush_tlb_kernel_range(lstart, lend);
> @@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
>  	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  	__kfence_pool = phys_to_virt(kfence_pool);
>  }
> +
> +bool arch_kfence_init_pool(void)
> +{
> +	unsigned long start = (unsigned long)__kfence_pool;
> +	unsigned long end = start + KFENCE_POOL_SIZE;
> +	int ret;
> +
> +	/* Exit early if we know the linear map is already pte-mapped. */
> +	if (!system_supports_bbml2_noabort() || force_pte_mapping())
> +		return true;
> +
> +	/* Kfence pool is already pte-mapped for the early init case. */
> +	if (kfence_early_init)
> +		return true;
> +
> +	mutex_lock(&pgtable_split_lock);
> +	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
> +	mutex_unlock(&pgtable_split_lock);
> +
> +	return ret ? false : true;

return !ret

Will

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Ryan Roberts 3 months ago

Thanks for the fast review!


On 03/11/2025 15:37, Will Deacon wrote:
> On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
>> It has been reported that split_kernel_leaf_mapping() is trying to sleep
>> in non-sleepable context. It does this when acquiring the
>> pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
>> CONFIG_KFENCE are enabled, which change linear map permissions within
>> softirq context during memory allocation and/or freeing. All other paths
>> into this function are called from sleepable context and so are safe.
>>
>> But it turns out that the memory for which these 2 features may attempt
>> to modify the permissions is always mapped by pte, so there is no need
>> to attempt to split the mapping. So let's exit early in these cases and
>> avoid attempting to take the mutex.
>>
>> There is one wrinkle to this approach; late-initialized kfence allocates
>> it's pool from the buddy which may be block mapped. So we must hook that
>> allocation and convert it to pte-mappings up front. Previously this was
>> done as a side-effect of kfence protecting all the individual pages in
>> its pool at init-time, but this no longer works due to the added early
>> exit path in split_kernel_leaf_mapping().
>>
>> So instead, do this via the existing arch_kfence_init_pool() arch hook,
>> and reuse the existing linear_map_split_to_ptes() infrastructure. This
>> will now also be more efficient as a result.
>>
>> Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
>> Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
>> Tested-by: Guenter Roeck <groeck@google.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> Hi All,
>>
>> This is a fuller fix than the suggestion I sent yesterday, and works correctly
>> with late-init kfence (thanks to Yang Shi for pointing that out).
>>
>> I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
>> individually, and I've also forced it to take the linear_map_split_to_ptes() to
>> verify that I haven't broken it during the refactoring.
>>
>> I've kept Guenter's T-b since the early-init kfence path that he was testing is
>> unchanged.
>>
>> Assuming nobody spots any issues, I'fd like to get it into the next round of
>> arm64 bug fixes for this cycle.
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/kfence.h |  4 +-
>>  arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
>>  2 files changed, 68 insertions(+), 28 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
>> index a81937fae9f6..4a921e06d750 100644
>> --- a/arch/arm64/include/asm/kfence.h
>> +++ b/arch/arm64/include/asm/kfence.h
>> @@ -10,8 +10,6 @@
>>
>>  #include <asm/set_memory.h>
>>
>> -static inline bool arch_kfence_init_pool(void) { return true; }
>> -
>>  static inline bool kfence_protect_page(unsigned long addr, bool protect)
>>  {
>>  	set_memory_valid(addr, 1, !protect);
>> @@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
>>  {
>>  	return !kfence_early_init;
>>  }
>> +bool arch_kfence_init_pool(void);
>>  #else /* CONFIG_KFENCE */
>>  static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
>> +static inline bool arch_kfence_init_pool(void) { return false; }
> 
> Why do we need this for the !KFENCE case?

We don't, my bad. I'll remove.

> 
>>  #endif /* CONFIG_KFENCE */
>>
>>  #endif /* __ASM_KFENCE_H */
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index b8d37eb037fc..0385e9b17ab0 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>  	return ret;
>>  }
>>
>> +static inline bool force_pte_mapping(void)
>> +{
>> +	bool bbml2 = system_capabilities_finalized() ?
>> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
>> +
>> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
>> +			   is_realm_world())) ||
>> +		debug_pagealloc_enabled();
>> +}
>> +
>>  static DEFINE_MUTEX(pgtable_split_lock);
>>
>>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  	if (!system_supports_bbml2_noabort())
>>  		return 0;
>>
>> +	/*
>> +	 * If the region is within a pte-mapped area, there is no need to try to
>> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>> +	 * change permissions from softirq context so for those cases (which are
>> +	 * always pte-mapped), we must not go any further because taking the
>> +	 * mutex below may sleep.
>> +	 */
>> +	if (force_pte_mapping() || is_kfence_address((void *)start))
>> +		return 0;
>> +
>>  	/*
>>  	 * Ensure start and end are at least page-aligned since this is the
>>  	 * finest granularity we can split to.
>> @@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>  	pud_t pud = pudp_get(pudp);
>>  	int ret = 0;
>>
>>  	if (pud_leaf(pud))
>> -		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
>> +		ret = split_pud(pudp, pud, gfp, false);
>>
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>  	pmd_t pmd = pmdp_get(pmdp);
>>  	int ret = 0;
>>
>>  	if (pmd_leaf(pmd)) {
>>  		if (pmd_cont(pmd))
>>  			split_contpmd(pmdp);
>> -		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
>> +		ret = split_pmd(pmdp, pmd, gfp, false);
>>
>>  		/*
>>  		 * We have split the pmd directly to ptes so there is no need to
>> @@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>>  	pte_t pte = __ptep_get(ptep);
>>
>> @@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>>  	return 0;
>>  }
>>
>> -static const struct mm_walk_ops split_to_ptes_ops __initconst = {
>> +static const struct mm_walk_ops split_to_ptes_ops = {
>>  	.pud_entry	= split_to_ptes_pud_entry,
>>  	.pmd_entry	= split_to_ptes_pmd_entry,
>>  	.pte_entry	= split_to_ptes_pte_entry,
>>  };
>>
>> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>> +{
>> +	int ret;
>> +
>> +	arch_enter_lazy_mmu_mode();
>> +	ret = walk_kernel_page_table_range_lockless(start, end,
>> +					&split_to_ptes_ops, NULL, &gfp);
>> +	arch_leave_lazy_mmu_mode();
> 
> Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
> calls flush_tlb_kernel_range() right after this so now it looks like
> we have more barriers than we need there.

Without the lazy mmu block, every write to every pte (or pmd/pud) will cause a
dsb and isb to be emitted. With the lazy mmu block, we only emit a single
dsb/isb at the end of the block.

linear_map_split_to_ptes() didn't previously have a lazy mmu block; that was an
oversight, I believe. So when refactoring I thought it made sense to make it
common for both cases.

Yes, the flush_tlb_kernel_range() also has the barriers, so the lazy mmu mode is
reducing from a gazillion barriers to 2. We could further optimize from 2 to 1,
but I doubt the performance improvement will be measurable.

Perhaps I've misunderstood your point...?

> 
>> +
>> +	return ret;
>> +}
>> +
>>  static bool linear_map_requires_bbml2 __initdata;
>>
>>  u32 idmap_kpti_bbml2_flag;
>> @@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>  		 * PTE. The kernel alias remains static throughout runtime so
>>  		 * can continue to be safely mapped with large mappings.
>>  		 */
>> -		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>> -						&split_to_ptes_ops, NULL, NULL);
>> +		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>>  		if (!ret)
>> -			ret = walk_kernel_page_table_range_lockless(kend, lend,
>> -						&split_to_ptes_ops, NULL, NULL);
>> +			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>>  		if (ret)
>>  			panic("Failed to split linear map\n");
>>  		flush_tlb_kernel_range(lstart, lend);
>> @@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
>>  	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>>  	__kfence_pool = phys_to_virt(kfence_pool);
>>  }
>> +
>> +bool arch_kfence_init_pool(void)
>> +{
>> +	unsigned long start = (unsigned long)__kfence_pool;
>> +	unsigned long end = start + KFENCE_POOL_SIZE;
>> +	int ret;
>> +
>> +	/* Exit early if we know the linear map is already pte-mapped. */
>> +	if (!system_supports_bbml2_noabort() || force_pte_mapping())
>> +		return true;
>> +
>> +	/* Kfence pool is already pte-mapped for the early init case. */
>> +	if (kfence_early_init)
>> +		return true;
>> +
>> +	mutex_lock(&pgtable_split_lock);
>> +	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
>> +	mutex_unlock(&pgtable_split_lock);
>> +
>> +	return ret ? false : true;
> 
> return !ret

Yes good point - will fix in the next spin.

Thanks,
Ryan

> 
> Will

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Will Deacon 3 months ago

Hey Ryan,

On Mon, Nov 03, 2025 at 04:28:44PM +0000, Ryan Roberts wrote:
> On 03/11/2025 15:37, Will Deacon wrote:
> > On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
> >> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> >> +{
> >> +	int ret;
> >> +
> >> +	arch_enter_lazy_mmu_mode();
> >> +	ret = walk_kernel_page_table_range_lockless(start, end,
> >> +					&split_to_ptes_ops, NULL, &gfp);
> >> +	arch_leave_lazy_mmu_mode();
> > 
> > Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
> > calls flush_tlb_kernel_range() right after this so now it looks like
> > we have more barriers than we need there.
> 
> Without the lazy mmu block, every write to every pte (or pmd/pud) will cause a
> dsb and isb to be emitted. With the lazy mmu block, we only emit a single
> dsb/isb at the end of the block.
> 
> linear_map_split_to_ptes() didn't previously have a lazy mmu block; that was an
> oversight, I believe. So when refactoring I thought it made sense to make it
> common for both cases.
> 
> Yes, the flush_tlb_kernel_range() also has the barriers, so the lazy mmu mode is
> reducing from a gazillion barriers to 2. We could further optimize from 2 to 1,
> but I doubt the performance improvement will be measurable.
> 
> Perhaps I've misunderstood your point...?

I was just trying to understand whether this was a functional thing (which
I couldn't grok) or an optimisation. Sounds like it's the latter, but I'd
prefer not to mix optimisations with fixes.

Will

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Ryan Roberts 3 months ago

On 04/11/2025 12:41, Will Deacon wrote:
> Hey Ryan,
> 
> On Mon, Nov 03, 2025 at 04:28:44PM +0000, Ryan Roberts wrote:
>> On 03/11/2025 15:37, Will Deacon wrote:
>>> On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
>>>> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	arch_enter_lazy_mmu_mode();
>>>> +	ret = walk_kernel_page_table_range_lockless(start, end,
>>>> +					&split_to_ptes_ops, NULL, &gfp);
>>>> +	arch_leave_lazy_mmu_mode();
>>>
>>> Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
>>> calls flush_tlb_kernel_range() right after this so now it looks like
>>> we have more barriers than we need there.
>>
>> Without the lazy mmu block, every write to every pte (or pmd/pud) will cause a
>> dsb and isb to be emitted. With the lazy mmu block, we only emit a single
>> dsb/isb at the end of the block.
>>
>> linear_map_split_to_ptes() didn't previously have a lazy mmu block; that was an
>> oversight, I believe. So when refactoring I thought it made sense to make it
>> common for both cases.
>>
>> Yes, the flush_tlb_kernel_range() also has the barriers, so the lazy mmu mode is
>> reducing from a gazillion barriers to 2. We could further optimize from 2 to 1,
>> but I doubt the performance improvement will be measurable.
>>
>> Perhaps I've misunderstood your point...?
> 
> I was just trying to understand whether this was a functional thing (which
> I couldn't grok) or an optimisation. Sounds like it's the latter, but I'd
> prefer not to mix optimisations with fixes.

Ahh, yes fair enough. I'll send out a new version either tomorrow or Thursday.
In that I'll include 3 separate patches; the fix, this optimization and the tidy
up that David suggested. Ideally all 3 will go to 6.18, but if you only want to
push the fix to 6.18 and leave the rest for 6.19 fair enough.

Thanks,
Ryan

> 
> Will

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Ryan Roberts 3 months ago

On 03/11/2025 16:28, Ryan Roberts wrote:
> Thanks for the fast review!
> 
> 
> On 03/11/2025 15:37, Will Deacon wrote:
>> On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
>>> It has been reported that split_kernel_leaf_mapping() is trying to sleep
>>> in non-sleepable context. It does this when acquiring the
>>> pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
>>> CONFIG_KFENCE are enabled, which change linear map permissions within
>>> softirq context during memory allocation and/or freeing. All other paths
>>> into this function are called from sleepable context and so are safe.
>>>
>>> But it turns out that the memory for which these 2 features may attempt
>>> to modify the permissions is always mapped by pte, so there is no need
>>> to attempt to split the mapping. So let's exit early in these cases and
>>> avoid attempting to take the mutex.
>>>
>>> There is one wrinkle to this approach; late-initialized kfence allocates
>>> it's pool from the buddy which may be block mapped. So we must hook that
>>> allocation and convert it to pte-mappings up front. Previously this was
>>> done as a side-effect of kfence protecting all the individual pages in
>>> its pool at init-time, but this no longer works due to the added early
>>> exit path in split_kernel_leaf_mapping().
>>>
>>> So instead, do this via the existing arch_kfence_init_pool() arch hook,
>>> and reuse the existing linear_map_split_to_ptes() infrastructure. This
>>> will now also be more efficient as a result.
>>>
>>> Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
>>> Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
>>> Tested-by: Guenter Roeck <groeck@google.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>
>>> Hi All,
>>>
>>> This is a fuller fix than the suggestion I sent yesterday, and works correctly
>>> with late-init kfence (thanks to Yang Shi for pointing that out).
>>>
>>> I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
>>> individually, and I've also forced it to take the linear_map_split_to_ptes() to
>>> verify that I haven't broken it during the refactoring.
>>>
>>> I've kept Guenter's T-b since the early-init kfence path that he was testing is
>>> unchanged.
>>>
>>> Assuming nobody spots any issues, I'fd like to get it into the next round of
>>> arm64 bug fixes for this cycle.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>  arch/arm64/include/asm/kfence.h |  4 +-
>>>  arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
>>>  2 files changed, 68 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
>>> index a81937fae9f6..4a921e06d750 100644
>>> --- a/arch/arm64/include/asm/kfence.h
>>> +++ b/arch/arm64/include/asm/kfence.h
>>> @@ -10,8 +10,6 @@
>>>
>>>  #include <asm/set_memory.h>
>>>
>>> -static inline bool arch_kfence_init_pool(void) { return true; }
>>> -
>>>  static inline bool kfence_protect_page(unsigned long addr, bool protect)
>>>  {
>>>  	set_memory_valid(addr, 1, !protect);
>>> @@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
>>>  {
>>>  	return !kfence_early_init;
>>>  }
>>> +bool arch_kfence_init_pool(void);
>>>  #else /* CONFIG_KFENCE */
>>>  static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
>>> +static inline bool arch_kfence_init_pool(void) { return false; }
>>
>> Why do we need this for the !KFENCE case?
> 
> We don't, my bad. I'll remove.
> 
>>
>>>  #endif /* CONFIG_KFENCE */
>>>
>>>  #endif /* __ASM_KFENCE_H */
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index b8d37eb037fc..0385e9b17ab0 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>>  	return ret;
>>>  }
>>>
>>> +static inline bool force_pte_mapping(void)
>>> +{
>>> +	bool bbml2 = system_capabilities_finalized() ?
>>> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
>>> +
>>> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
>>> +			   is_realm_world())) ||
>>> +		debug_pagealloc_enabled();
>>> +}
>>> +
>>>  static DEFINE_MUTEX(pgtable_split_lock);
>>>
>>>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>>  	if (!system_supports_bbml2_noabort())
>>>  		return 0;
>>>
>>> +	/*
>>> +	 * If the region is within a pte-mapped area, there is no need to try to
>>> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>>> +	 * change permissions from softirq context so for those cases (which are
>>> +	 * always pte-mapped), we must not go any further because taking the
>>> +	 * mutex below may sleep.
>>> +	 */
>>> +	if (force_pte_mapping() || is_kfence_address((void *)start))
>>> +		return 0;
>>> +
>>>  	/*
>>>  	 * Ensure start and end are at least page-aligned since this is the
>>>  	 * finest granularity we can split to.
>>> @@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>>  	return ret;
>>>  }
>>>
>>> -static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>>> -					  unsigned long next,
>>> -					  struct mm_walk *walk)
>>> +static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>>> +				   unsigned long next, struct mm_walk *walk)
>>>  {
>>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>>  	pud_t pud = pudp_get(pudp);
>>>  	int ret = 0;
>>>
>>>  	if (pud_leaf(pud))
>>> -		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
>>> +		ret = split_pud(pudp, pud, gfp, false);
>>>
>>>  	return ret;
>>>  }
>>>
>>> -static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>> -					  unsigned long next,
>>> -					  struct mm_walk *walk)
>>> +static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>> +				   unsigned long next, struct mm_walk *walk)
>>>  {
>>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>>  	pmd_t pmd = pmdp_get(pmdp);
>>>  	int ret = 0;
>>>
>>>  	if (pmd_leaf(pmd)) {
>>>  		if (pmd_cont(pmd))
>>>  			split_contpmd(pmdp);
>>> -		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
>>> +		ret = split_pmd(pmdp, pmd, gfp, false);
>>>
>>>  		/*
>>>  		 * We have split the pmd directly to ptes so there is no need to
>>> @@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>>  	return ret;
>>>  }
>>>
>>> -static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>>> -					  unsigned long next,
>>> -					  struct mm_walk *walk)
>>> +static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>>> +				   unsigned long next, struct mm_walk *walk)
>>>  {
>>>  	pte_t pte = __ptep_get(ptep);
>>>
>>> @@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>>>  	return 0;
>>>  }
>>>
>>> -static const struct mm_walk_ops split_to_ptes_ops __initconst = {
>>> +static const struct mm_walk_ops split_to_ptes_ops = {
>>>  	.pud_entry	= split_to_ptes_pud_entry,
>>>  	.pmd_entry	= split_to_ptes_pmd_entry,
>>>  	.pte_entry	= split_to_ptes_pte_entry,
>>>  };
>>>
>>> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>>> +{
>>> +	int ret;
>>> +
>>> +	arch_enter_lazy_mmu_mode();
>>> +	ret = walk_kernel_page_table_range_lockless(start, end,
>>> +					&split_to_ptes_ops, NULL, &gfp);
>>> +	arch_leave_lazy_mmu_mode();
>>
>> Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
>> calls flush_tlb_kernel_range() right after this so now it looks like
>> we have more barriers than we need there.
> 
> Without the lazy mmu block, every write to every pte (or pmd/pud) will cause a
> dsb and isb to be emitted. With the lazy mmu block, we only emit a single
> dsb/isb at the end of the block.
> 
> linear_map_split_to_ptes() didn't previously have a lazy mmu block; that was an
> oversight, I believe. So when refactoring I thought it made sense to make it
> common for both cases.
> 
> Yes, the flush_tlb_kernel_range() also has the barriers, so the lazy mmu mode is
> reducing from a gazillion barriers to 2. We could further optimize from 2 to 1,
> but I doubt the performance improvement will be measurable.
> 
> Perhaps I've misunderstood your point...?

Having thought about this, I just wanted to add a few details; if KFENCE is
enabled and can be used to allocate a page (actually I think it is currently
only used for slab allocations), then allocation of a new page table page (which
can be triggered while splitting the mappings to ptes) may get hooked by KFENCE
and will cause a page unprotection request, which will cause us to enter a
nested lazy mmu block and will cause a call into split_kernel_leaf_mapping().

The nested lazy mmu block is safe but non-optimal since we will emit the
barriers when we exit the nested block and will consider ourself not in lazy mmu
mode for the remainder of the outer block; so we will emit more barriers than
necessary there. That's an existing problem that Kevin has a series to address.

Calling split_kernel_leaf_mapping() is safe since it will exit early since the
ptp memory whose permissions are being changed must be in a region that is
already pte-mapped (i.e. the KFENCE pool). So it won't attempt to grab the mutex
and there will never be any deadlock. This is a special (slightly simpler) case
of the circular issue that Kevin is trying to solve for kpkeys.

But as I say - as far as I can tell, KFENCE will never be used for a page
allocation so this is theoretical anyway. You could construct a similar argument
with CONFIG_DEBUG_PAGEALLOC, but in that case force_pte_mapping() would be true
and the whole linear map would be mapped by pte from the start and we would
never be trying to split the mappings in the first place.

I'm babbling, but the TL;DR is that I believe this is all safe and optimal in
practice.

> 
>>
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static bool linear_map_requires_bbml2 __initdata;
>>>
>>>  u32 idmap_kpti_bbml2_flag;
>>> @@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>>  		 * PTE. The kernel alias remains static throughout runtime so
>>>  		 * can continue to be safely mapped with large mappings.
>>>  		 */
>>> -		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>>> -						&split_to_ptes_ops, NULL, NULL);
>>> +		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>>>  		if (!ret)
>>> -			ret = walk_kernel_page_table_range_lockless(kend, lend,
>>> -						&split_to_ptes_ops, NULL, NULL);
>>> +			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>>>  		if (ret)
>>>  			panic("Failed to split linear map\n");
>>>  		flush_tlb_kernel_range(lstart, lend);
>>> @@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
>>>  	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>>>  	__kfence_pool = phys_to_virt(kfence_pool);
>>>  }
>>> +
>>> +bool arch_kfence_init_pool(void)
>>> +{
>>> +	unsigned long start = (unsigned long)__kfence_pool;
>>> +	unsigned long end = start + KFENCE_POOL_SIZE;
>>> +	int ret;
>>> +
>>> +	/* Exit early if we know the linear map is already pte-mapped. */
>>> +	if (!system_supports_bbml2_noabort() || force_pte_mapping())
>>> +		return true;
>>> +
>>> +	/* Kfence pool is already pte-mapped for the early init case. */
>>> +	if (kfence_early_init)
>>> +		return true;
>>> +
>>> +	mutex_lock(&pgtable_split_lock);
>>> +	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
>>> +	mutex_unlock(&pgtable_split_lock);

I think it's worth adding a comment here to say that TLB invalidation is not
required here since we know that bbml2 is supported. So while we have split the
pgtables, the TLB may still contain the larger mappings, but they will be
naturally evicted over time, and if/when the permissions are changed for smaller
regions those small regions will be invalidated which will cause any
intersecting large TLB entries to be invalidated.

I'll add for the next version.

Thanks,
Ryan


>>> +
>>> +	return ret ? false : true;
>>
>> return !ret
> 
> Yes good point - will fix in the next spin.
> 
> Thanks,
> Ryan
> 
>>
>> Will
>

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by David Hildenbrand (Red Hat) 3 months ago

>   }
> 
> +static inline bool force_pte_mapping(void)
> +{
> +	bool bbml2 = system_capabilities_finalized() ?
> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();

You are only moving this function. Still, there is some room for improvement I want to point out :)

bbml2 could be a const (or a helper function like bbml2_supported).

> +
> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> +			   is_realm_world())) ||
> +		debug_pagealloc_enabled();


I suspect this could be made a bit easier to read.

	if (debug_pagealloc_enabled())
		return true;
	if (bbml2)
		return false;
	return rodata_full || arm64_kfence_can_set_direct_map() || is_realm_world();


> +}
> +
>   static DEFINE_MUTEX(pgtable_split_lock);
> 
>   int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   	if (!system_supports_bbml2_noabort())
>   		return 0;
> 
> +	/*
> +	 * If the region is within a pte-mapped area, there is no need to try to
> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
> +	 * change permissions from softirq context so for those cases (which are
> +	 * always pte-mapped), we must not go any further because taking the
> +	 * mutex below may sleep.
> +	 */
> +	if (force_pte_mapping() || is_kfence_address((void *)start))
> +		return 0;
> +

We're effectively performing two system_supports_bbml2_noabort() checks, similarly in
arch_kfence_init_pool().

I wonder if there is a clean way to avoid that.

I'm not super up-to-date on that code. Nothing else jumped at me.

-- 
Cheers

David

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by Ryan Roberts 3 months ago

On 03/11/2025 15:38, David Hildenbrand (Red Hat) wrote:
> 
>>   }
>>
>> +static inline bool force_pte_mapping(void)
>> +{
>> +    bool bbml2 = system_capabilities_finalized() ?
>> +        system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> 
> You are only moving this function. Still, there is some room for improvement I
> want to point out :)
> 
> bbml2 could be a const (or a helper function like bbml2_supported).
> 
>> +
>> +    return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
>> +               is_realm_world())) ||
>> +        debug_pagealloc_enabled();
> 
> 
> I suspect this could be made a bit easier to read.
> 
>     if (debug_pagealloc_enabled())
>         return true;
>     if (bbml2)
>         return false;
>     return rodata_full || arm64_kfence_can_set_direct_map() || is_realm_world();

Yeah, I guess that's a bit nicer. I'd prefer to tidy it up in as separate commit
though. (feel free ;-) )

> 
> 
>> +}
>> +
>>   static DEFINE_MUTEX(pgtable_split_lock);
>>
>>   int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start,
>> unsigned long end)
>>       if (!system_supports_bbml2_noabort())
>>           return 0;
>>
>> +    /*
>> +     * If the region is within a pte-mapped area, there is no need to try to
>> +     * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>> +     * change permissions from softirq context so for those cases (which are
>> +     * always pte-mapped), we must not go any further because taking the
>> +     * mutex below may sleep.
>> +     */
>> +    if (force_pte_mapping() || is_kfence_address((void *)start))
>> +        return 0;
>> +
> 
> We're effectively performing two system_supports_bbml2_noabort() checks,
> similarly in
> arch_kfence_init_pool().
> 
> I wonder if there is a clean way to avoid that.

I thought about this too. But system_supports_bbml2_noabort() is actually a
magic alternatives patching thing; the code is updated so it's zero overhead. I
decided this was the simplest and clearest way to do it. But I'm open to other
ideas...

> 
> I'm not super up-to-date on that code. Nothing else jumped at me.

Thanks for the review!

>

Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping() when in atomic context

Posted by David Hildenbrand (Red Hat) 3 months ago

On 03.11.25 18:28, Ryan Roberts wrote:
> On 03/11/2025 15:38, David Hildenbrand (Red Hat) wrote:
>>
>>>    }
>>>
>>> +static inline bool force_pte_mapping(void)
>>> +{
>>> +    bool bbml2 = system_capabilities_finalized() ?
>>> +        system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
>>
>> You are only moving this function. Still, there is some room for improvement I
>> want to point out :)
>>
>> bbml2 could be a const (or a helper function like bbml2_supported).
>>
>>> +
>>> +    return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
>>> +               is_realm_world())) ||
>>> +        debug_pagealloc_enabled();
>>
>>
>> I suspect this could be made a bit easier to read.
>>
>>      if (debug_pagealloc_enabled())
>>          return true;
>>      if (bbml2)
>>          return false;
>>      return rodata_full || arm64_kfence_can_set_direct_map() || is_realm_world();
> 
> Yeah, I guess that's a bit nicer. I'd prefer to tidy it up in as separate commit
> though. (feel free ;-) )

Separate commit is fine (hoping you can do it once this lands :P ).

> 
>>
>>
>>> +}
>>> +
>>>    static DEFINE_MUTEX(pgtable_split_lock);
>>>
>>>    int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start,
>>> unsigned long end)
>>>        if (!system_supports_bbml2_noabort())
>>>            return 0;
>>>
>>> +    /*
>>> +     * If the region is within a pte-mapped area, there is no need to try to
>>> +     * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>>> +     * change permissions from softirq context so for those cases (which are
>>> +     * always pte-mapped), we must not go any further because taking the
>>> +     * mutex below may sleep.
>>> +     */
>>> +    if (force_pte_mapping() || is_kfence_address((void *)start))
>>> +        return 0;
>>> +
>>
>> We're effectively performing two system_supports_bbml2_noabort() checks,
>> similarly in
>> arch_kfence_init_pool().
>>
>> I wonder if there is a clean way to avoid that.
> 
> I thought about this too. But system_supports_bbml2_noabort() is actually a
> magic alternatives patching thing; 

Makes sense, so likely just another nop in the final code.

> the code is updated so it's zero overhead. I
> decided this was the simplest and clearest way to do it. But I'm open to other
> ideas...

Given that we have two such call sequences, I was wondering if we could 
have a helper that better expresses+documents the desired semantics.

static bool pte_leaf_split_possible()
{
	/*
	 * !BBML2_NOABORT systems should never run into scenarios where
          * we would have to split. So exit early and let calling code
	 * detect it + raise a warning.
	 */
	if (!system_supports_bbml2_noabort())
		return false;
	return force_pte_mapping();
}

Something like that maybe.

-- 
Cheers

David