saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

[RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Al Viro 2 months, 1 week ago

On Sat, Nov 29, 2025 at 03:37:28AM +0000, Al Viro wrote:

> AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
> is lazier, though - there arch_sync_kernel_mappings() bumps a counter
> in init_mm and context switches use that to check if propagation needs
> to be done.  No idea how well does that work on vfree() side of things -
> hadn't looked into that rabbit hole...

BTW, speaking of vmalloc space - does anybody object against sorting
CONFIG_ALPHA_LARGE_VMALLOC out, so that we wouldn't need to mess
with that in alpha page fault handler?

Basically, do what amd64 does - something along the lines of (untested)
patch below.  Comments?

[PATCH] alpha: take the LARGE_VMALLOC kludge out

Support of vmalloc space that won't fit into the single L1 slot had
been a headache for quite a while.

The only things we use seg1 for are virtual mapping of page tables
(at the last 8G) and vmalloc space (below that).  Normal setup has
vmalloc space from -16G to -8G, occupying the penultimate L1 slot.
It is set up (with table sitting just after the kernel image) very
early, by callback_init().  pgd_alloc() copies that entry when
setting a new L1 table up, and it's never changed afterwards.
All page table changes done by vmalloc() are done to tables that
are shared between all threads, avoiding the need to synchronize
them.

It would be trivial to extend that - preallocate L2 tables to
cover the entire vmalloc space (8Kb for each 8Gb of that) and
set all the L1 slots involved before anything gets forked,
then copy these slots on pgd_alloc().  Unfortunately, that
had been done in a different way - only one L2 table is
preallocated, the rest gets created on demand, which means
that we need to propagate changes to threads' L1 tables.
It's kinda-sorta handled in do_page_fault(), but it's racy and
fixing that up would be a major headache.

Bugger that for the game of soldiers - do what e.g. amd64 does
and preallocate these in mem_init().  And replace the bool
ALPHA_LARGE_VMALLOC with int ALPHA_VMALLOC_SIZE (in gigabytes),
dependent upon EXPERT and defaulting to 8.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 80367f2cf821..36cbba4e21d9 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -410,20 +410,21 @@ config ALPHA_WTINT
 
 	  If unsure, say N.
 
-# LARGE_VMALLOC is racy, if you *really* need it then fix it first
-config ALPHA_LARGE_VMALLOC
-	bool
+config ALPHA_VMALLOC_SIZE
+	int "vmalloc space (in gigabytes)" if EXPERT
+	default "8"
+	range 8 2040
 	help
-	  Process creation and other aspects of virtual memory management can
-	  be streamlined if we restrict the kernel to one PGD for all vmalloc
-	  allocations.  This equates to about 8GB.
+	  We preallocate the second-level page tables to cover the entire
+	  vmalloc area; one 8Kb page for each 8Gb.
 
-	  Under normal circumstances, this is so far and above what is needed
-	  as to be laughable.  However, there are certain applications (such
-	  as benchmark-grade in-kernel web serving) that can make use of as
-	  much vmalloc space as is available.
+	  Default is 8Gb total and under normal circumstances, this is so
+	  far and above what is needed as to be laughable.  However, there are
+	  certain applications (such as benchmark-grade in-kernel web serving)
+	  that can make use of as much vmalloc space as is available.
 
-	  Say N unless you know you need gobs and gobs of vmalloc space.
+	  Leave it at 8 unless you know you need gobs and gobs of
+	  vmalloc space.
 
 config VERBOSE_MCHECK
 	bool "Verbose Machine Checks"
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 90e7a9539102..0f554d01fe54 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -49,11 +49,8 @@ struct vm_area_struct;
 /* Number of pointers that fit on a page:  this will go away. */
 #define PTRS_PER_PAGE	(1UL << (PAGE_SHIFT-3))
 
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-#define VMALLOC_START		0xfffffe0000000000
-#else
-#define VMALLOC_START		(-2*PGDIR_SIZE)
-#endif
+#define VMALLOC_SLOTS		DIV_ROUND_UP(CONFIG_ALPHA_VMALLOC_SIZE, 8)
+#define VMALLOC_START		(-(VMALLOC_SLOTS + 1)*PGDIR_SIZE)
 #define VMALLOC_END		(-PGDIR_SIZE)
 
 /*
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index a9816bbc9f34..0bc5fc4d510e 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -111,10 +111,6 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	if (!mm || faulthandler_disabled())
 		goto no_context;
 
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-	if (address >= TASK_SIZE)
-		goto vmalloc_fault;
-#endif
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
@@ -225,24 +221,4 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
  do_sigsegv:
 	force_sig_fault(SIGSEGV, si_code, (void __user *) address);
 	return;
-
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
- vmalloc_fault:
-	if (user_mode(regs))
-		goto do_sigsegv;
-	else {
-		/* Synchronize this task's top level page-table
-		   with the "reference" page table from init.  */
-		long index = pgd_index(address);
-		pgd_t *pgd, *pgd_k;
-
-		pgd = current->active_mm->pgd + index;
-		pgd_k = swapper_pg_dir + index;
-		if (!pgd_present(*pgd) && pgd_present(*pgd_k)) {
-			pgd_val(*pgd) = pgd_val(*pgd_k);
-			return;
-		}
-		goto no_context;
-	}
-#endif
 }
diff --git a/arch/alpha/mm/init.c b/arch/alpha/mm/init.c
index 4c5ab9cd8a0a..e5eea8b05c7f 100644
--- a/arch/alpha/mm/init.c
+++ b/arch/alpha/mm/init.c
@@ -45,13 +45,10 @@ pgd_alloc(struct mm_struct *mm)
 	ret = __pgd_alloc(mm, 0);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-#ifdef CONFIG_ALPHA_LARGE_VMALLOC
-		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
-#else
-		pgd_val(ret[PTRS_PER_PGD-2]) = pgd_val(init[PTRS_PER_PGD-2]);
-#endif
-
+		for (int i = 0; i < VMALLOC_SLOTS; i++) {
+			pgd_val(ret[PTRS_PER_PGD - VMALLOC_SLOTS - 1 + i]) =
+			pgd_val(init[PTRS_PER_PGD - VMALLOC_SLOTS - 1 + i]);
+		}
 		/* The last PGD entry is the VPTB self-map.  */
 		pgd_val(ret[PTRS_PER_PGD-1])
 		  = pte_val(mk_pte(virt_to_page(ret), PAGE_KERNEL));
@@ -148,9 +145,10 @@ callback_init(void * kernel_end)
 	   On systems with larger consoles, additional pages will be
 	   allocated as needed during the mapping process.
 
-	   In the case of not SRM, but not CONFIG_ALPHA_LARGE_VMALLOC,
-	   we need to allocate the PGD we use for vmalloc before we start
-	   forking other tasks.  */
+	   In any case we need to allocate a PGD we use for vmalloc
+	   before we start forking other tasks.  If vmalloc wants more
+	   than one PGD slot, allocate the rest later (at mem_init() -
+	   it's still early enough).  */
 
 	two_pages = (void *)
 	  (((unsigned long)kernel_end + ~PAGE_MASK) & PAGE_MASK);
@@ -246,6 +244,22 @@ srm_paging_stop (void)
 }
 #endif
 
+void __init mem_init(void)
+{
+	// first slot already filled by callback_init()
+	unsigned long addr = VMALLOC_START + PGDIR_SIZE;
+
+	while (addr < VMALLOC_END) {
+		pgd_t *pgd = pgd_offset_k(addr);
+		p4d_t *p4d = p4d_offset(pgd, addr);
+		pud_t *pud = pud_offset(p4d, addr);
+		pmd_t *pmd = pmd_alloc(&init_mm, pud, addr);
+		if (!pmd)
+			panic("can't preallocate tables for vmalloc");
+		addr += PGDIR_SIZE;
+	}
+}
+
 static const pgprot_t protection_map[16] = {
 	[VM_NONE]					= _PAGE_P(_PAGE_FOE | _PAGE_FOW |
 								  _PAGE_FOR),

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Linus Torvalds 2 months, 1 week ago

On Sat, 29 Nov 2025 at 19:01, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +         Default is 8Gb total and under normal circumstances, this is so
> +         far and above what is needed as to be laughable.  However, there are
> +         certain applications (such as benchmark-grade in-kernel web serving)
> +         that can make use of as much vmalloc space as is available.

I wonder if we even need the config variable?

Because this reads like the whole feature exists due to the old 'tux'
web server thing (from the early 2000's - long long gone, never merged
upstream).

So I'm not sure there are any actual real use-cases for tons of
vmalloc space on alpha.

Anyway, I see no real objections to the patch, only a "maybe it could
be cut down even more".

                Linus

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Al Viro 2 months, 1 week ago

On Sun, Nov 30, 2025 at 02:16:13PM -0800, Linus Torvalds wrote:
> On Sat, 29 Nov 2025 at 19:01, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > +         Default is 8Gb total and under normal circumstances, this is so
> > +         far and above what is needed as to be laughable.  However, there are
> > +         certain applications (such as benchmark-grade in-kernel web serving)
> > +         that can make use of as much vmalloc space as is available.
> 
> I wonder if we even need the config variable?
> 
> Because this reads like the whole feature exists due to the old 'tux'
> web server thing (from the early 2000's - long long gone, never merged
> upstream).
> 
> So I'm not sure there are any actual real use-cases for tons of
> vmalloc space on alpha.
> 
> Anyway, I see no real objections to the patch, only a "maybe it could
> be cut down even more".

FWIW, I'm trying to figure out what's going on with amd64 in that area;
we used to do allocate-on-demand until 2020, when Joerg went for "let's
preallocate them" and killed arch_sync_kernel_mappings(), which got
reverted soon after, only to be brought back when Joerg had fixed the
bug in preallocation.  It stayed that way until this August, when
commit 6659d027998083fbb6d42a165b0c90dc2e8ba989
Author: Harry Yoo <harry.yoo@oracle.com>
Date:   Mon Aug 18 11:02:06 2025 +0900

     x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()

happened, with reference to this
commit 8d400913c231bd1da74067255816453f96cd35b0
Author: Oscar Salvador <osalvador@suse.de>
Date:   Thu Apr 29 22:57:19 2021 -0700

     x86/vmemmap: handle unpopulated sub-pmd ranges

What I don't understand is how does that manage to avoid the same race -
on #PF amd64 does not bother with vmalloc_fault logics.  Exact same
scenario with two vmalloc() on different CPUs would seem to apply here
as well...

Which callers of arch_sync_kernel_mappings() are involved?  If it's
anything in mm/vmalloc.c, I really don't see how that could be correct;
if it's about apply_to_page_range() and calls never hit vmalloc space,
we might be OK, but it would be nice to have described somewhere...

Am I missing something obvious here?

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by david laight 2 months, 1 week ago

On Sun, 30 Nov 2025 03:01:46 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Sat, Nov 29, 2025 at 03:37:28AM +0000, Al Viro wrote:
> 
> > AFAICS, 32bit arm is similar to 32bit x86 in that respect; propagation
> > is lazier, though - there arch_sync_kernel_mappings() bumps a counter
> > in init_mm and context switches use that to check if propagation needs
> > to be done.  No idea how well does that work on vfree() side of things -
> > hadn't looked into that rabbit hole...  
> 
> BTW, speaking of vmalloc space - does anybody object against sorting
> CONFIG_ALPHA_LARGE_VMALLOC out, so that we wouldn't need to mess
> with that in alpha page fault handler?
> 
> Basically, do what amd64 does - something along the lines of (untested)
> patch below.  Comments?

How difficult would it be to allocate the pte for the next 8GB on demand
inside vmalloc(), and then propagate it to the per-task page tables.
That is a path than can sleep, so being slow if it needs to synchronise
with other cpu shouldn't matter - especially since it won't happen often.

That should be moderately generic code and would let the vmalloc limit
be 'soft'; perhaps based on physical memory size, and even be raisable
from a sysctl.

Likely more use for very large x86-64 and arm-64 systems than alpha.

	David

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Al Viro 2 months, 1 week ago

On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:

> How difficult would it be to allocate the pte for the next 8GB on demand
> inside vmalloc(), and then propagate it to the per-task page tables.
> That is a path than can sleep, so being slow if it needs to synchronise
> with other cpu shouldn't matter - especially since it won't happen often.
> 
> That should be moderately generic code and would let the vmalloc limit
> be 'soft'; perhaps based on physical memory size, and even be raisable
> from a sysctl.

Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
space on alpha had been racy all along (and known to be that); it was
basically "could we squeeze more out of khttpd" kind of fun.

Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
space and total footprint worth bothering with that?

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by david laight 2 months, 1 week ago

On Sun, 30 Nov 2025 16:43:48 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:
> 
> > How difficult would it be to allocate the pte for the next 8GB on demand
> > inside vmalloc(), and then propagate it to the per-task page tables.
> > That is a path than can sleep, so being slow if it needs to synchronise
> > with other cpu shouldn't matter - especially since it won't happen often.
> > 
> > That should be moderately generic code and would let the vmalloc limit
> > be 'soft'; perhaps based on physical memory size, and even be raisable
> > from a sysctl.  
> 
> Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
> space on alpha had been racy all along (and known to be that); it was
> basically "could we squeeze more out of khttpd" kind of fun.
> 
> Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
> space and total footprint worth bothering with that?
> 

I doubt it matters for alpha - I suspect you could just nuke ALPHA_LARGE_VMALLOC.
At a guess it was written way back when the biggest/fastest systems you could
get were alpha.

I was more thinking about about modern 64 bits systems where you might want
to run a distro kernel on systems with relatively small amounts of RAM and
others with 100s of cpu and multi TB of RAM.
I can well image workloads for the latter that might run out of vmalloc space.
In some situations even getting a command line parameter in can be hard,
so you might want it to be a systcl - even if changing that is what does the
update.
(Doing the updates in the page fault handler definitely sounds like a recipe
for disaster.)

Note that I've not looked at where amd64 gets the limit for mem_init().
Maybe it tries to 'guess' the correct value for the system.
But it is likely to be workload related - so allocating 8K for every 8G
of physical memory (one option) may be wasteful.

	David

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Al Viro 2 months, 1 week ago

On Sun, Nov 30, 2025 at 07:03:31PM +0000, david laight wrote:

> (Doing the updates in the page fault handler definitely sounds like a recipe
> for disaster.)

See the comments in arch/x86/mm/fault.c regarding the need to do it in
#PF on i386.  Basically, you have

CPU1: does vmalloc(), needs to grab a new slot in top-level table.
Updates the page table tree for init_mm (rooted at swapper_pg_dir).
Starts propagating that change to other threads.

CPU2: does vmalloc(), which grabs another address in the range
covered by the same slot.  It works with the same page table tree,
so it sees that slot already occupied, inserts a new page reference
into the page table hanging off it.  No top-level changes to
propagate, so it returns the address it has grabbed to the caller.

CPU2: caller of vmalloc() dereferences the returned object.
If propagation from CPU1 has not reached the top-level page table
CPU2 is using, the top-level slot is empty and MMU of CPU2 raises
#PF.

BTW, it might be possible for two parallel allocations to grab two
areas, each requiring grabbing its own top-level slot ;-/
It certainly can happen on x86 - two vmalloc(SZ_4M) in parallel
will do just that, but with NUMA you can get it 64bit boxen
as well.

So simple generation count on root page table won't solve it either...

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Al Viro 2 months, 1 week ago

On Sun, Nov 30, 2025 at 08:31:18PM +0000, Al Viro wrote:
> It certainly can happen on x86 - two vmalloc(SZ_4M) in parallel

32bit x86, that is.

> will do just that, but with NUMA you can get it 64bit boxen
> as well.

Re: [RFC][alpha] saner vmalloc handling (was Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context)

Posted by Magnus Lindholm 2 months, 1 week ago

On Sun, Nov 30, 2025 at 5:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Nov 30, 2025 at 11:32:13AM +0000, david laight wrote:
>
> > How difficult would it be to allocate the pte for the next 8GB on demand
> > inside vmalloc(), and then propagate it to the per-task page tables.
> > That is a path than can sleep, so being slow if it needs to synchronise
> > with other cpu shouldn't matter - especially since it won't happen often.
> >
> > That should be moderately generic code and would let the vmalloc limit
> > be 'soft'; perhaps based on physical memory size, and even be raisable
> > from a sysctl.
>
> Considerable headache and pretty pointless, at that.  Note that >8G vmalloc
> space on alpha had been racy all along (and known to be that); it was
> basically "could we squeeze more out of khttpd" kind of fun.
>
> Do we have realistic vmalloc-crazy loads with high fragmentation of vmalloc
> space and total footprint worth bothering with that?

Hi everyone,

In my opinion, for Alpha I’d prefer the static preallocation model, as
it fixes the LARGE_VMALLOC race cleanly and keeps the fault path
straightforward. I don’t see many realistic Alpha workloads that would
benefit from a more complex or dynamic vmalloc setup, and compile-time
adjustment seems sufficient. Al’s version solves the issues without
adding new moving parts, which feels like the right tradeoff. Removing
code that has never worked properly should not cause any harm.

FWIW, I applied Al's patch and I'm running it now on my XP1000.
Seems to work as-is.

Regards

Magnus