hash_name() may cross page boundary and trigger sleep in RCU context

[Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Xie Yuanbin 2 months, 1 week ago

On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
> On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
>>
>> What I'm thinking is to address both of these by handling kernel space
>> page faults (which will be permission or PTE-not-present) separately
>> (not even build tested):
>
> That patch looks sane to me.
>
> But I also didn't build test it, just scanned it visually ;)

That patch removes harden_branch_predictor() from __do_user_fault(), and
moves it to do_page_fault()->do_kernel_address_page_fault(). 
This resolves previously mentioned kernel warning issue. However,
__do_user_fault() is not only called by do_page_fault(), it is
alse called by do_bad_area(), do_sect_fault() and do_translation_fault().

So I think that some harden_branch_predictor() is missing on other paths.
According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
will never be called anymore, even if a user program trys to access the
kernel address.

Or perhaps I've misunderstood something, could you please point it out?
Thank you very much.

What about something like this (The patch has been tested):
```patch
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2bc828a1940c..af86198631c5 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -183,9 +183,11 @@ __do_user_fault(unsigned long addr, unsigned int fsr, unsigned int sig,
 {
 	struct task_struct *tsk = current;

-	if (addr > TASK_SIZE)
+	if (addr >= TASK_SIZE)
 		harden_branch_predictor();

+	local_irq_enable();
+
 #ifdef CONFIG_DEBUG_USER
 	if (((user_debug & UDBG_SEGV) && (sig == SIGSEGV)) ||
 	    ((user_debug & UDBG_BUS)  && (sig == SIGBUS))) {
@@ -272,6 +274,24 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;

+	if (unlikely(addr >= TASK_SIZE)) {
+		/*
+		 * Fault from user mode for a kernel space address. User mode
+		 * should not be faulting in kernel space, which includes the
+		 * vector/khelper page. Handle the Spectre issues while
+		 * interrupts are still disabled, then send a SIGSEGV. Note
+		 * that __do_user_fault() will enable interrupts.
+		 *
+		 * Fault from kernel mode. Jump to __do_kernel_fault()->
+		 * fixup_exception() directly, without getting mm lock and
+		 * finding vma. The interrupts are not enabled but it will be
+		 * good, just like what do_translation_fault() and
+		 * do_bad_area() does.
+		 */
+		fault = 0;
+		code = SEGV_MAPERR;
+		goto bad_area;
+	}

 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))

```

Thanks very much!

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Russell King (Oracle) 2 months ago

On Wed, Dec 03, 2025 at 09:48:00AM +0800, Xie Yuanbin wrote:
> On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
> > On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> >>
> >> What I'm thinking is to address both of these by handling kernel space
> >> page faults (which will be permission or PTE-not-present) separately
> >> (not even build tested):
> >
> > That patch looks sane to me.
> >
> > But I also didn't build test it, just scanned it visually ;)
> 
> That patch removes harden_branch_predictor() from __do_user_fault(), and
> moves it to do_page_fault()->do_kernel_address_page_fault(). 
> This resolves previously mentioned kernel warning issue. However,
> __do_user_fault() is not only called by do_page_fault(), it is
> alse called by do_bad_area(), do_sect_fault() and do_translation_fault().
> 
> So I think that some harden_branch_predictor() is missing on other paths.
> According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
> will never be called anymore, even if a user program trys to access the
> kernel address.
> 
> Or perhaps I've misunderstood something, could you please point it out?
> Thank you very much.

Right, let's split these issues into separate patches. Please test this
patch, which should address only the hash_name() fault issue, and
provides the basis for fixing the branch predictor issue.

Yes, at the moment, do_kernel_address_page_fault() looks very much like
do_bad_area(), but with the addition of the IRQ-enable if the parent
context was enabled, but the following patch to address the branch
predictor hardening will show why its different.

In my opinion, this approach makes the handling for kernel address
page faults (non-present pages and page permission faults) much easier
to understand.

Note that this will call __do_user_fault() with interrupts disabled.

Build tested, and remotely boot tested on Cortex-A5 hardware but
without kfence enabled. Also tested usermode access to kernel space
which fails with SEGV:
- read from 0xc0000000 (section permission fault, do_sect_fault)
- read from 0xffff2000 (page translation fault, do_page_fault)
- read from 0xffff0000 (vectors page - read possible as expected)
- write to 0xffff0000 (page permission fault, do_page_fault)

8<===
From: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Subject: [PATCH] ARM: fix hash_name() fault

Zizhi Wo reports:

"During the execution of hash_name()->load_unaligned_zeropad(), a
 potential memory access beyond the PAGE boundary may occur. For
 example, when the filename length is near the PAGE_SIZE boundary.
 This triggers a page fault, which leads to a call to
 do_page_fault()->mmap_read_trylock(). If we can't acquire the lock,
 we have to fall back to the mmap_read_lock() path, which calls
 might_sleep(). This breaks RCU semantics because path lookup occurs
 under an RCU read-side critical section."

This is seen with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_KFENCE=y.

Kernel addresses (with the exception of the vectors/kuser helper
page) do not have VMAs associated with them. If the vectors/kuser
helper page faults, then there are two possibilities:

1. if the fault happened while in kernel mode, then we're basically
   dead, because the CPU won't be able to vector through this page
   to handle the fault.
2. if the fault happened while in user mode, that means the page was
   protected from user access, and we want to fault anyway.

Thus, we can handle kernel addresses from any context entirely
separately without going anywhere near the mmap lock. This gives us
an entirely non-sleeping path for all kernel mode kernel address
faults.

Reported-by: Zizhi Wo <wozizhi@huaweicloud.com>
Reported-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Link: https://lore.kernel.org/r/20251126090505.3057219-1-wozizhi@huaweicloud.com
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm/mm/fault.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 46169fe42c61..2bbec38ced97 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -260,6 +260,35 @@ static inline bool ttbr0_usermode_access_allowed(struct pt_regs *regs)
 }
 #endif

+static int __kprobes
+do_kernel_address_page_fault(struct mm_struct *mm, unsigned long addr,
+			     unsigned int fsr, struct pt_regs *regs)
+{
+	if (user_mode(regs)) {
+		/*
+		 * Fault from user mode for a kernel space address. User mode
+		 * should not be faulting in kernel space, which includes the
+		 * vector/khelper page. Send a SIGSEGV.
+		 */
+		__do_user_fault(addr, fsr, SIGSEGV, SEGV_MAPERR, regs);
+	} else {
+		/*
+		 * Fault from kernel mode. Enable interrupts if they were
+		 * enabled in the parent context. Section (upper page table)
+		 * translation faults are handled via do_translation_fault(),
+		 * so we will only get here for a non-present kernel space
+		 * PTE or PTE permission fault. This may happen in exceptional
+		 * circumstances and need the fixup tables to be walked.
+		 */
+		if (interrupts_enabled(regs))
+			local_irq_enable();
+
+		__do_kernel_fault(mm, addr, fsr, regs);
+	}
+
+	return 0;
+}
+
 static int __kprobes
 do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 {
@@ -273,6 +302,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (kprobe_page_fault(regs, fsr))
 		return 0;

+	/*
+	 * Handle kernel addresses faults separately, which avoids touching
+	 * the mmap lock from contexts that are not able to sleep.
+	 */
+	if (addr >= TASK_SIZE)
+		return do_kernel_address_page_fault(mm, addr, fsr, regs);

 	/* Enable interrupts if they were enabled in the parent context. */
 	if (interrupts_enabled(regs))
-- 
2.47.3

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Xie Yuanbin 2 months ago

On Fri, 5 Dec 2025 12:08:14 +0000, Russell King wrote:
> On Wed, Dec 03, 2025 at 09:48:00AM +0800, Xie Yuanbin wrote:
>> On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
>> > On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
>> > <linux@armlinux.org.uk> wrote:
>> >>
>> >> What I'm thinking is to address both of these by handling kernel space
>> >> page faults (which will be permission or PTE-not-present) separately
>> >> (not even build tested):
>> >
>> > That patch looks sane to me.
>> >
>> > But I also didn't build test it, just scanned it visually ;)
>>
>> That patch removes harden_branch_predictor() from __do_user_fault(), and
>> moves it to do_page_fault()->do_kernel_address_page_fault().
>> This resolves previously mentioned kernel warning issue. However,
>> __do_user_fault() is not only called by do_page_fault(), it is
>> alse called by do_bad_area(), do_sect_fault() and do_translation_fault().
>>
>> So I think that some harden_branch_predictor() is missing on other paths.
>> According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
>> will never be called anymore, even if a user program trys to access the
>> kernel address.
>>
>> Or perhaps I've misunderstood something, could you please point it out?
>> Thank you very much.
>
> Right, let's split these issues into separate patches. Please test this
> patch, which should address only the hash_name() fault issue, and
> provides the basis for fixing the branch predictor issue.

I conducted a simple test, and it seems that both the hash_name()
might sleep issue and the branch predictor issue have been fixed.

BTW, even with this patch, test cases may still fail. There is another
bug in hash_name() will also be triggered by the testcase, which will be
fixed in this patch:
Link: https://lore.kernel.org/20251127025848.363992-1-pangliyuan1@huawei.com

Test case is from:
Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com

Test in commit 6987d58a9cbc5bd57c98 ("Add linux-next specific files for
20251205") from linux-next branch.

I still have a question about this patch: Is 
```patch
+		if (interrupts_enabled(regs))
+			local_irq_enable();
```
necessary? Although this implementation is closer to the original code,
which can reduce side effects, do_bad_area(), do_sect_fault(),
and do_translation_fault() all call __do_kernel_fault() with interrupts
disabled.

Thanks very much!

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Russell King (Oracle) 2 months ago

On Mon, Dec 08, 2025 at 10:32:06AM +0800, Xie Yuanbin wrote:
> On Fri, 5 Dec 2025 12:08:14 +0000, Russell King wrote:
> > Right, let's split these issues into separate patches. Please test this
> > patch, which should address only the hash_name() fault issue, and
> > provides the basis for fixing the branch predictor issue.
> 
> I conducted a simple test, and it seems that both the hash_name()
> might sleep issue and the branch predictor issue have been fixed.

This isn't entirely fixed. A data abort for an alignment fault (thus
calling do_alignment()) will enable interrupts, and then may call
do_bad_area(). We can't short-circuit this path like we can with
do_page_fault() as alignment faults from userspace can be valid for
the vectors page - not that we should see them, but that doesn't mean
that there isn't something in userspace that does.

> BTW, even with this patch, test cases may still fail. There is another
> bug in hash_name() will also be triggered by the testcase, which will be
> fixed in this patch:
> Link: https://lore.kernel.org/20251127025848.363992-1-pangliyuan1@huawei.com

That patch got missed - I'm notoriously bad at catching every email.
There's just way too much email coming in.

> Test case is from:
> Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com
> 
> Test in commit 6987d58a9cbc5bd57c98 ("Add linux-next specific files for
> 20251205") from linux-next branch.
> 
> I still have a question about this patch: Is 
> ```patch
> +		if (interrupts_enabled(regs))
> +			local_irq_enable();
> ```
> necessary? Although this implementation is closer to the original code,
> which can reduce side effects, do_bad_area(), do_sect_fault(),
> and do_translation_fault() all call __do_kernel_fault() with interrupts
> disabled.

It's to keep the behaviour closer to the original as possible, on the
principle of avoiding unnecessary behavioural changes to the code. As
noted above, do_bad_area() can be called with interrupts enabled.

Whether RT folk would be happy removing that is a different question,
given that they want as much of the kernel to be preemptable.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Xie Yuanbin 2 months ago

On Mon, 8 Dec 2025 10:07:25 +0000, Russell King wrote:
> This isn't entirely fixed. A data abort for an alignment fault (thus
> calling do_alignment()) will enable interrupts, and then may call
> do_bad_area(). We can't short-circuit this path like we can with
> do_page_fault() as alignment faults from userspace can be valid for
> the vectors page - not that we should see them, but that doesn't mean
> that there isn't something in userspace that does.

I had indeed been lacking in consideration regarding do_alignment()
before, so thank you for reply. But, may I ask that, is there a scenario
where user-mode access to kernel addresses causes an alignment fault
(do_alignment())?

In your last email, you described it as follows:
On Fri, 5 Dec 2025 12:08:14 +0000, Russell King wrote:
> Also tested usermode access to kernel space
> which fails with SEGV:
> - read from 0xc0000000 (section permission fault, do_sect_fault)
> - read from 0xffff2000 (page translation fault, do_page_fault)
> - read from 0xffff0000 (vectors page - read possible as expected)
> - write to 0xffff0000 (page permission fault, do_page_fault)

There seems to be no do_alignment()?

In other words, is there a way to construct a user-mode testcase which
accesses a kernel address and triggers do_alignment()?

> That patch got missed - I'm notoriously bad at catching every email.
> There's just way too much email coming in.

No need to worry.

> It's to keep the behaviour closer to the original as possible, on the
> principle of avoiding unnecessary behavioural changes to the code. As
> noted above, do_bad_area() can be called with interrupts enabled.
>
> Whether RT folk would be happy removing that is a different question,
> given that they want as much of the kernel to be preemptable.

Thank you for your reply. I have no objections to this, although it might
introduce some unnecessary code paths, at least it won't bring any new
issues.

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Russell King (Oracle) 2 months ago

On Mon, Dec 08, 2025 at 09:18:42PM +0800, Xie Yuanbin wrote:
> On Mon, 8 Dec 2025 10:07:25 +0000, Russell King wrote:
> > This isn't entirely fixed. A data abort for an alignment fault (thus
> > calling do_alignment()) will enable interrupts, and then may call
> > do_bad_area(). We can't short-circuit this path like we can with
> > do_page_fault() as alignment faults from userspace can be valid for
> > the vectors page - not that we should see them, but that doesn't mean
> > that there isn't something in userspace that does.
> 
> I had indeed been lacking in consideration regarding do_alignment()
> before, so thank you for reply. But, may I ask that, is there a scenario
> where user-mode access to kernel addresses causes an alignment fault
> (do_alignment())?

If you mean, won't permission errors be detected first, then no.
Alignment is one of the first things that is checked if alignment
faults are enabled.

So yes, if userspace attempts an unaigned load of a kernel address,
and the CPU does not support / have enabled unaigned load support,
then we will get a data abort with the FSR indicating an alignment
fault. So do_alignment() wil be entered.

Whether branch predictor handling needs to happen in this path is
a separate question, but as it's highly likely we'll take an
exception anyway and userspace is doing Bad Stuff, I feel it's
better to be over-cautious.

> In your last email, you described it as follows:
> On Fri, 5 Dec 2025 12:08:14 +0000, Russell King wrote:
> > Also tested usermode access to kernel space
> > which fails with SEGV:
> > - read from 0xc0000000 (section permission fault, do_sect_fault)
> > - read from 0xffff2000 (page translation fault, do_page_fault)
> > - read from 0xffff0000 (vectors page - read possible as expected)
> > - write to 0xffff0000 (page permission fault, do_page_fault)
> 
> There seems to be no do_alignment()?

Yes, I didn't test this case, because I was only concentrating on
the effects of the proposed patch which did not include moving the
branch predictor handling.

> In other words, is there a way to construct a user-mode testcase which
> accesses a kernel address and triggers do_alignment()?

Testing these mitigations is very difficult as there's no public
test cases for ARM.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by Xie Yuanbin 2 months ago

On Mon, 8 Dec 2025 15:43:56 +0000, Russell King wrote:
> On Mon, Dec 08, 2025 at 09:18:42PM +0800, Xie Yuanbin wrote:
>> I had indeed been lacking in consideration regarding do_alignment()
>> before, so thank you for reply. But, may I ask that, is there a scenario
>> where user-mode access to kernel addresses causes an alignment fault
>> (do_alignment())?
>
> If you mean, won't permission errors be detected first, then no.
> Alignment is one of the first things that is checked if alignment
> faults are enabled.
>
> So yes, if userspace attempts an unaigned load of a kernel address,
> and the CPU does not support / have enabled unaigned load support,
> then we will get a data abort with the FSR indicating an alignment
> fault. So do_alignment() wil be entered.
>
> Whether branch predictor handling needs to happen in this path is
> a separate question, but as it's highly likely we'll take an
> exception anyway and userspace is doing Bad Stuff, I feel it's
> better to be over-cautious.

Thanks for your reply. I know it now, and thank you.

Re: [Bug report] hash_name() may cross page boundary and trigger sleep in RCU context

Posted by David Laight 2 months ago

On Mon, 8 Dec 2025 10:32:06 +0800
Xie Yuanbin <xieyuanbin1@huawei.com> wrote:

> On Fri, 5 Dec 2025 12:08:14 +0000, Russell King wrote:
> > On Wed, Dec 03, 2025 at 09:48:00AM +0800, Xie Yuanbin wrote:  
> >> On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:  
> >> > On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> >> > <linux@armlinux.org.uk> wrote:  
> >> >>
> >> >> What I'm thinking is to address both of these by handling kernel space
> >> >> page faults (which will be permission or PTE-not-present) separately
> >> >> (not even build tested):  
> >> >
> >> > That patch looks sane to me.
> >> >
> >> > But I also didn't build test it, just scanned it visually ;)  
> >>
> >> That patch removes harden_branch_predictor() from __do_user_fault(), and
> >> moves it to do_page_fault()->do_kernel_address_page_fault().
> >> This resolves previously mentioned kernel warning issue. However,
> >> __do_user_fault() is not only called by do_page_fault(), it is
> >> alse called by do_bad_area(), do_sect_fault() and do_translation_fault().
> >>
> >> So I think that some harden_branch_predictor() is missing on other paths.
> >> According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
> >> will never be called anymore, even if a user program trys to access the
> >> kernel address.
> >>
> >> Or perhaps I've misunderstood something, could you please point it out?
> >> Thank you very much.  
> >
> > Right, let's split these issues into separate patches. Please test this
> > patch, which should address only the hash_name() fault issue, and
> > provides the basis for fixing the branch predictor issue.  
> 
> I conducted a simple test, and it seems that both the hash_name()
> might sleep issue and the branch predictor issue have been fixed.
> 
> BTW, even with this patch, test cases may still fail. There is another
> bug in hash_name() will also be triggered by the testcase, which will be
> fixed in this patch:
> Link: https://lore.kernel.org/20251127025848.363992-1-pangliyuan1@huawei.com
> 
> Test case is from:
> Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com
> 
> Test in commit 6987d58a9cbc5bd57c98 ("Add linux-next specific files for
> 20251205") from linux-next branch.
> 
> I still have a question about this patch: Is 
> ```patch
> +		if (interrupts_enabled(regs))
> +			local_irq_enable();
> ```
> necessary? Although this implementation is closer to the original code,
> which can reduce side effects, do_bad_area(), do_sect_fault(),
> and do_translation_fault() all call __do_kernel_fault() with interrupts
> disabled.

It has to be safer to leave them disabled.
But you don't want to do that over long code paths.
But I'd have thought the 'act on an exception table entry or panic'
path wouldn't be long compared to an actual ISR (or other code that
disables interrupts) so there is no real point enabling them here.
But that is just my 2c.

	David

> 
> Thanks very much!
>