fs/dcache.c | 10 ++++++++-- fs/namei.c | 7 +++++++ 2 files changed, 15 insertions(+), 2 deletions(-)
When the path is initialized with LOOKUP_RCU flag in path_init(), the
rcu read lock will be acquired. Inside the rcu critical section,
load_unaligned_zeropad() may be called. According to the comments of
load_unaligned_zeropad(), when loading the memory, a page fault may be
triggered in the very unlikely case.
On arm32/arm64, a page fault may cause the current thread to sleep inside
mmap_read_lock_killable(). If CONFIG_DEBUG_ATOMIC_SLEEP=y, the following
warning will be triggered:
```log
[ 16.243462] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1559
[ 16.245271] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 68, name: test
[ 16.246219] preempt_count: 0, expected: 0
[ 16.246582] RCU nest depth: 1, expected: 0
[ 16.247262] CPU: 0 UID: 0 PID: 68 Comm: test Not tainted 6.18.0-rc6-next-20251124 #28 PREEMPT
[ 16.247432] Hardware name: Generic DT based system
[ 16.247549] Call trace:
[ 16.247618] unwind_backtrace from show_stack+0x10/0x14
[ 16.248442] show_stack from dump_stack_lvl+0x50/0x5c
[ 16.248458] dump_stack_lvl from __might_resched+0x174/0x188
[ 16.248475] __might_resched from down_read_killable+0x18/0x10c
[ 16.248490] down_read_killable from mmap_read_lock_killable+0x24/0x84
[ 16.248504] mmap_read_lock_killable from lock_mm_and_find_vma+0x164/0x18c
[ 16.248516] lock_mm_and_find_vma from do_page_fault+0x1d4/0x4a0
[ 16.248529] do_page_fault from do_DataAbort+0x30/0xa8
[ 16.248549] do_DataAbort from __dabt_svc+0x44/0x60
[ 16.248597] Exception stack(0xf0b41da0 to 0xf0b41de8)
[ 16.248675] 1da0: c20b34f0 c3f23bf8 00000000 c389be50 f0b41e90 00000501 61c88647 00000000
[ 16.248698] 1dc0: 80808080 fefefeff 2f2f2f2f eec51ffd c3219088 f0b41df0 c066d3e4 c066d218
[ 16.248705] 1de0: 60000013 ffffffff
[ 16.248736] __dabt_svc from link_path_walk+0xa8/0x444
[ 16.248752] link_path_walk from path_openat+0xac/0xe18
[ 16.248764] path_openat from do_filp_open+0x94/0x134
[ 16.248775] do_filp_open from do_sys_openat2+0x9c/0xf0
[ 16.248785] do_sys_openat2 from sys_openat+0x80/0xa0
[ 16.248806] sys_openat from ret_fast_syscall+0x0/0x4c
[ 16.248814] Exception stack(0xf0b41fa8 to 0xf0b41ff0)
[ 16.248825] 1fa0: 00000000 00000000 ffffff9c beb27d0c 00000242 000001b6
[ 16.248834] 1fc0: 00000000 00000000 000c543c 00000142 00027e85 00000002 00000002 00000000
[ 16.248841] 1fe0: beb27c20 beb27c0c 0006ea80 00072e78
[ 16.923450] ------------[ cut here ]------------
[ 16.923630] WARNING: kernel/rcu/tree_plugin.h:332 at rcu_note_context_switch+0x408/0x610, CPU#0: test/68
[ 16.924780] Voluntary context switch within RCU read-side critical section!
[ 16.924887] Modules linked in:
[ 16.925670] CPU: 0 UID: 0 PID: 68 Comm: test Tainted: G W 6.18.0-rc6-next-20251124 #28 PREEMPT
[ 16.926120] Tainted: [W]=WARN
[ 16.926257] Hardware name: Generic DT based system
[ 16.926474] Call trace:
[ 16.926487] unwind_backtrace from show_stack+0x10/0x14
[ 16.926899] show_stack from dump_stack_lvl+0x50/0x5c
[ 16.927318] dump_stack_lvl from __warn+0xf8/0x200
[ 16.927696] __warn from warn_slowpath_fmt+0x180/0x208
[ 16.928060] warn_slowpath_fmt from rcu_note_context_switch+0x408/0x610
[ 16.928768] rcu_note_context_switch from __schedule+0xe4/0xa58
[ 16.928917] __schedule from schedule+0x70/0x124
[ 16.929197] schedule from schedule_preempt_disabled+0x14/0x20
[ 16.929514] schedule_preempt_disabled from rwsem_down_read_slowpath+0x26c/0x4e4
[ 16.929875] rwsem_down_read_slowpath from down_read_killable+0x58/0x10c
[ 16.930320] down_read_killable from mmap_read_lock_killable+0x24/0x84
[ 16.930761] mmap_read_lock_killable from lock_mm_and_find_vma+0x164/0x18c
[ 16.931101] lock_mm_and_find_vma from do_page_fault+0x1d4/0x4a0
[ 16.931354] do_page_fault from do_DataAbort+0x30/0xa8
[ 16.931649] do_DataAbort from __dabt_svc+0x44/0x60
[ 16.931862] Exception stack(0xf0b41d88 to 0xf0b41dd0)
[ 16.932063] 1d80: c3219088 eec5dffd f0b41ec0 00000002 c3219118 00000010
[ 16.933732] 1da0: c321913c 00000002 00007878 c2da86c0 00000000 00000002 b8009440 f0b41ddc
[ 16.934019] 1dc0: eec5dffd c0677300 60000013 ffffffff
[ 16.934294] __dabt_svc from __d_lookup_rcu+0xc4/0x10c
[ 16.934468] __d_lookup_rcu from lookup_fast+0xa0/0x190
[ 16.934720] lookup_fast from path_openat+0x154/0xe18
[ 16.934953] path_openat from do_filp_open+0x94/0x134
[ 16.935141] do_filp_open from do_sys_openat2+0x9c/0xf0
[ 16.935384] do_sys_openat2 from sys_openat+0x80/0xa0
[ 16.935547] sys_openat from ret_fast_syscall+0x0/0x4c
[ 16.935799] Exception stack(0xf0b41fa8 to 0xf0b41ff0)
[ 16.936007] 1fa0: 00000000 00000000 ffffff9c beb27d0c 00000242 000001b6
[ 16.936293] 1fc0: 00000000 00000000 000c543c 00000142 00027e85 00000002 00000002 00000000
[ 16.936624] 1fe0: beb27c20 beb27c0c 0006ea80 00072e78
[ 16.936780] ---[ end trace 0000000000000000 ]---
```
Add pagefault_disable() to handle this situation.
Fixes: b9a50f74905a ("ARM: 7450/1: dcache: select DCACHE_WORD_ACCESS for little-endian ARMv6+ CPUs")
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Co-developed-by: Liyuan Pang <pangliyuan1@huawei.com>
---
On latest linux-next source, using arm32's multi_v7_defconfig, and
setting CONFIG_PREEMPT=y, CONFIG_DEBUG_ATOMIC_SLEEP=y, CONFIG_KFENCE=y,
CONFIG_ARM_PAN=n, then run the following testcase:
```c
static void *thread(void *arg)
{
while (1) {
void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
assert(p != (void *)-1);
__asm__ volatile ("":"+r"(p)::"memory");
munmap(p, 4096);
}
}
int main()
{
pthread_t th;
int ret;
char path[4096] = "/tmp";
for (size_t i = 0; i < 2044; ++i) {
strcat(path, "/x");
ret = mkdir(path, 0755);
assert(ret == 0 || errno == EEXIST);
}
strcat(path, "/xx");
assert(strlen(path) == 4095);
assert(pthread_create(&th, NULL, thread, NULL) == 0);
while (1) {
FILE *fp = fopen(path, "wb+");
assert(fp);
fclose(fp);
}
return 0;
}
```
The might sleep warning will be triggered immediately.
Another possible solution: call pagefault_disable() after rcu_read_lock()
and call pagefault_enable() before rcu_read_unlock(). Inside path_init()
and leave_rcu(). However, this solution has a relatively large scope of
page fault disabling.
fs/dcache.c | 10 ++++++++--
fs/namei.c | 7 +++++++
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 23d1752c29e6..154195909f07 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -264,23 +264,29 @@ fs_initcall(init_fs_dcache_sysctls);
*/
static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char *ct, unsigned tcount)
{
unsigned long a,b,mask;
+ pagefault_disable();
for (;;) {
a = read_word_at_a_time(cs);
b = load_unaligned_zeropad(ct);
if (tcount < sizeof(unsigned long))
break;
- if (unlikely(a != b))
+ if (unlikely(a != b)) {
+ pagefault_enable();
return 1;
+ }
cs += sizeof(unsigned long);
ct += sizeof(unsigned long);
tcount -= sizeof(unsigned long);
- if (!tcount)
+ if (!tcount) {
+ pagefault_enable();
return 0;
+ }
}
+ pagefault_enable();
mask = bytemask_from_count(tcount);
return unlikely(!!((a ^ b) & mask));
}
#else
diff --git a/fs/namei.c b/fs/namei.c
index 4ac7ff8e3a40..b04756e58ca3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2304,10 +2304,11 @@ static inline unsigned int fold_hash(unsigned long x, unsigned long y)
*/
unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
{
unsigned long a, x = 0, y = (unsigned long)salt;
+ pagefault_disable();
for (;;) {
if (!len)
goto done;
a = load_unaligned_zeropad(name);
if (len < sizeof(unsigned long))
@@ -2316,10 +2317,11 @@ unsigned int full_name_hash(const void *salt, const char *name, unsigned int len
name += sizeof(unsigned long);
len -= sizeof(unsigned long);
}
x ^= a & bytemask_from_count(len);
done:
+ pagefault_enable();
return fold_hash(x, y);
}
EXPORT_SYMBOL(full_name_hash);
/* Return the "hash_len" (hash and length) of a null-terminated string */
@@ -2328,18 +2330,20 @@ u64 hashlen_string(const void *salt, const char *name)
unsigned long a = 0, x = 0, y = (unsigned long)salt;
unsigned long adata, mask, len;
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
len = 0;
+ pagefault_disable();
goto inside;
do {
HASH_MIX(x, y, a);
len += sizeof(unsigned long);
inside:
a = load_unaligned_zeropad(name+len);
} while (!has_zero(a, &adata, &constants));
+ pagefault_enable();
adata = prep_zero_mask(a, adata, &constants);
mask = create_zero_mask(adata);
x ^= a & zero_bytemask(mask);
@@ -2357,17 +2361,19 @@ static inline const char *hash_name(struct nameidata *nd,
{
unsigned long a, b, x, y = (unsigned long)nd->path.dentry;
unsigned long adata, bdata, mask, len;
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
+ pagefault_disable();
/*
* The first iteration is special, because it can result in
* '.' and '..' and has no mixing other than the final fold.
*/
a = load_unaligned_zeropad(name);
b = a ^ REPEAT_BYTE('/');
if (has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)) {
+ pagefault_enable();
adata = prep_zero_mask(a, adata, &constants);
bdata = prep_zero_mask(b, bdata, &constants);
mask = create_zero_mask(adata | bdata);
a &= zero_bytemask(mask);
*lastword = a;
@@ -2383,10 +2389,11 @@ static inline const char *hash_name(struct nameidata *nd,
HASH_MIX(x, y, a);
len += sizeof(unsigned long);
a = load_unaligned_zeropad(name+len);
b = a ^ REPEAT_BYTE('/');
} while (!(has_zero(a, &adata, &constants) | has_zero(b, &bdata, &constants)));
+ pagefault_enable();
adata = prep_zero_mask(a, adata, &constants);
bdata = prep_zero_mask(b, bdata, &constants);
mask = create_zero_mask(adata | bdata);
a &= zero_bytemask(mask);
--
2.51.0
On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote:
> On latest linux-next source, using arm32's multi_v7_defconfig, and
> setting CONFIG_PREEMPT=y, CONFIG_DEBUG_ATOMIC_SLEEP=y, CONFIG_KFENCE=y,
> CONFIG_ARM_PAN=n, then run the following testcase:
> ```c
> static void *thread(void *arg)
> {
> while (1) {
> void *p = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
>
> assert(p != (void *)-1);
> __asm__ volatile ("":"+r"(p)::"memory");
>
> munmap(p, 4096);
> }
> }
>
> int main()
> {
> pthread_t th;
> int ret;
> char path[4096] = "/tmp";
>
> for (size_t i = 0; i < 2044; ++i) {
> strcat(path, "/x");
> ret = mkdir(path, 0755);
> assert(ret == 0 || errno == EEXIST);
> }
> strcat(path, "/xx");
>
> assert(strlen(path) == 4095);
>
> assert(pthread_create(&th, NULL, thread, NULL) == 0);
>
> while (1) {
> FILE *fp = fopen(path, "wb+");
>
> assert(fp);
> fclose(fp);
> }
> return 0;
> }
> ```
> The might sleep warning will be triggered immediately.
"Immediately" part is interesting - presumably KFENCE is playing silly
buggers with PTEs in there.
Anyway, the underlying bug is that fault in this scenario should not
even look at VMAs - it should get to fixup_exception() and be done
with that, with minimal overhead for all other cause of faults.
We have an unaligned 32bit fetch from kernel address, spanning the
page boundary, with the second page unmapped or unreadable. Access
comes from kernel mode. All we want is to fail the fault without
an oops, blocking, etc.
AFAICS, on arm32 looks for VMA at address > TASK_SIZE won't find
a damn thing anyway, so skipping these attempts and going to
bad_area looks safe enough, if we do that after all early cases...
On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote: > When the path is initialized with LOOKUP_RCU flag in path_init(), the > rcu read lock will be acquired. Inside the rcu critical section, > load_unaligned_zeropad() may be called. According to the comments of > load_unaligned_zeropad(), when loading the memory, a page fault may be > triggered in the very unlikely case. > Add pagefault_disable() to handle this situation. Way too costly, IMO. That needs to be dealt with in page fault handler and IIRC arm used to do that; did that get broken at some point?
On Wed, Nov 26, 2025 at 06:10:31PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 06:19:52PM +0800, Xie Yuanbin wrote:
> > When the path is initialized with LOOKUP_RCU flag in path_init(), the
> > rcu read lock will be acquired. Inside the rcu critical section,
> > load_unaligned_zeropad() may be called. According to the comments of
> > load_unaligned_zeropad(), when loading the memory, a page fault may be
> > triggered in the very unlikely case.
>
> > Add pagefault_disable() to handle this situation.
>
> Way too costly, IMO. That needs to be dealt with in page fault handler
> and IIRC arm used to do that; did that get broken at some point?
FWIW, on arm64 it's dealt with hitting do_translation_fault(), which
sees that address is kernel-space one, goes into do_bad_area(), sees
that it's from kernel mode and proceeds into __do_kernel_fault() and
from there - to fixup_exception(). No messing with VMA lookups, etc.
anywhere in process.
Had been that way since 760bfb47c36a "arm64: fault: Route pte translation
faults via do_translation_fault"...
Did that get broken? Or is it actually arm32-specific?
In any case, making every architecture pay for that is insane - taking
a fault there is extremely rare to start with and callers sit on very
hot paths. Deal with that in the fault handler. Really.
We have no business even looking at VMAs when the fault is on kernel-mode
read access to kernel-space address. And callers of load_unaligned_zeropad()
are not the place for dealing with that.
It's been years since I looked at 32bit arm exception handling, so I'd need
quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
arch/arm/mm/fault.c; better let ARM folks deal with that. But arch/* is
where it should be dealt with; as for papering over that in fs/*:
NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> It's been years since I looked at 32bit arm exception handling, so I'd need
> quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> arch/arm/mm/fault.c; better let ARM folks deal with that. But arch/* is
> where it should be dealt with; as for papering over that in fs/*:
Don't expect that to happen. I've not looked at it for over a decade,
I do very little 32-bit ARM stuff anymore. Others have modified the
fault handling, the VM has changed, I basically no longer have the
knowledge. Effectively, 32-bit ARM is unmaintained now, although it
still has many users.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Wed, Nov 26, 2025 at 07:05:05PM +0000, Russell King (Oracle) wrote:
> On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> > It's been years since I looked at 32bit arm exception handling, so I'd need
> > quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> > arch/arm/mm/fault.c; better let ARM folks deal with that. But arch/* is
> > where it should be dealt with; as for papering over that in fs/*:
>
> Don't expect that to happen. I've not looked at it for over a decade,
> I do very little 32-bit ARM stuff anymore. Others have modified the
> fault handling, the VM has changed, I basically no longer have the
> knowledge. Effectively, 32-bit ARM is unmaintained now, although it
> still has many users.
Joy... For quick and dirty variant (on current tree), how about
adding
if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
goto no_context;
right after
if (!ttbr0_usermode_access_allowed(regs))
goto no_context;
in do_page_fault() there?
NOTE: that might or might not break vdso; I don't think it would, but...
On Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote: > For quick and dirty variant (on current tree), how about > adding > if (unlikely(addr > TASK_SIZE) && !user_mode(regs)) > goto no_context; > > right after > > if (!ttbr0_usermode_access_allowed(regs)) > goto no_context; > > in do_page_fault() there? > > NOTE: that might or might not break vdso; I don't think it would, but... On Wed, 26 Nov 2025 23:31:00 +0000, Russell King wrote: > Now, for 32-bit ARM, I think I am coming to the conclusion that Al's > suggestion is probably the easiest solution. However, whether it has > side effects, I couldn't say - the 32-bit ARM fault code has been > modified by quite a few people in ways I don't yet understand, so I > can't be certain at the moment whether it would cause problems. > > I think the only thing to do is to try the solution and see what > breaks. I'm not in a position to be able to do that as, having not > had reason to touch 32-bit ARM for years, I don't have a hackable > platform nearby. Maybe Xie Yuanbin can test it? Hi, Al Viro and Russell King! I moved the judgment forward to before local_irq_enable() and submitted a new patch: Link: https://lore.kernel.org/20251127140109.191657-1-xieyuanbin1@huawei.com This is because there's another bug I reported before that also requires a similar judgment, but before the interrupt is enabled. Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com I hope this can fix both of these bugs. It is closer to the x86's implementation and works well in current tests. Could you please take a look? Thanks you very much! Xie Yuanbin
On Wed, Nov 26, 2025 at 07:26:40PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:05:05PM +0000, Russell King (Oracle) wrote:
> > On Wed, Nov 26, 2025 at 06:48:20PM +0000, Al Viro wrote:
> > > It's been years since I looked at 32bit arm exception handling, so I'd need
> > > quite a bit of (re)RTF{S,M} before I'm comfortable with poking in
> > > arch/arm/mm/fault.c; better let ARM folks deal with that. But arch/* is
> > > where it should be dealt with; as for papering over that in fs/*:
> >
> > Don't expect that to happen. I've not looked at it for over a decade,
> > I do very little 32-bit ARM stuff anymore. Others have modified the
> > fault handling, the VM has changed, I basically no longer have the
> > knowledge. Effectively, 32-bit ARM is unmaintained now, although it
> > still has many users.
>
> Joy... For quick and dirty variant (on current tree), how about
> adding
> if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> goto no_context;
>
> right after
>
> if (!ttbr0_usermode_access_allowed(regs))
> goto no_context;
>
> in do_page_fault() there?
>
> NOTE: that might or might not break vdso; I don't think it would, but...
I don't understand how that helps. Wasn't the report that the filename
crosses a page boundary in userspace, but the following page is
inaccessible which causes a fault to be taken (as it always would do).
Thus, wouldn't "addr" be a userspace address (that the kernel is
accessing) and thus be below TASK_SIZE ?
I'm also confused - if we can't take a fault and handle it while
reading the filename from userspace, how are pages that have been
swapped out or evicted from the page cache read back in from storage
which invariably results in sleeping - which we can't do here because
of the RCU context (not that I've ever understood RCU, which is why
I've always referred those bugs to Paul.)
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote: > I don't understand how that helps. Wasn't the report that the filename > crosses a page boundary in userspace, but the following page is > inaccessible which causes a fault to be taken (as it always would do). > Thus, wouldn't "addr" be a userspace address (that the kernel is > accessing) and thus be below TASK_SIZE ? > > I'm also confused - if we can't take a fault and handle it while > reading the filename from userspace, how are pages that have been > swapped out or evicted from the page cache read back in from storage > which invariably results in sleeping - which we can't do here because > of the RCU context (not that I've ever understood RCU, which is why > I've always referred those bugs to Paul.) No, the filename is already copied in kernel space *and* it's long enough to end right next to the end of page. There's NUL before the end of page, at that, with '/' a couple of bytes prior. We attempt to save on memory accesses, doing word-by-word fetches, starting from the beginning of component. We *will* detect NUL and ignore all subsequent bytes; the problem is that the last 3 bytes of page might be '/', 'x' and '\0'. We call load_unaligned_zeropad() on page + PAGE_SIZE - 2. And get a fetch that spans the end of page. We don't care what's in the next page, if there is one mapped there to start with. If there's nothing mapped, we want zeroes read from it, but all we really care about is having the bytes within *our* page read correctly - and no oops happening, obviously. That fault is an extremely cold case on a fairly hot path. We don't want to mess with disabling pagefaults, etc. - not for the sake of that.
On Wed, Nov 26, 2025 at 08:02:21PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:
>
> > I don't understand how that helps. Wasn't the report that the filename
> > crosses a page boundary in userspace, but the following page is
> > inaccessible which causes a fault to be taken (as it always would do).
> > Thus, wouldn't "addr" be a userspace address (that the kernel is
> > accessing) and thus be below TASK_SIZE ?
> >
> > I'm also confused - if we can't take a fault and handle it while
> > reading the filename from userspace, how are pages that have been
> > swapped out or evicted from the page cache read back in from storage
> > which invariably results in sleeping - which we can't do here because
> > of the RCU context (not that I've ever understood RCU, which is why
> > I've always referred those bugs to Paul.)
>
> No, the filename is already copied in kernel space *and* it's long enough
> to end right next to the end of page. There's NUL before the end of page,
> at that, with '/' a couple of bytes prior. We attempt to save on memory
> accesses, doing word-by-word fetches, starting from the beginning of
> component. We *will* detect NUL and ignore all subsequent bytes; the
> problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
> We call load_unaligned_zeropad() on page + PAGE_SIZE - 2. And get
> a fetch that spans the end of page.
>
> We don't care what's in the next page, if there is one mapped there
> to start with. If there's nothing mapped, we want zeroes read from
> it, but all we really care about is having the bytes within *our*
> page read correctly - and no oops happening, obviously.
>
> That fault is an extremely cold case on a fairly hot path. We don't
> want to mess with disabling pagefaults, etc. - not for the sake
> of that.
I think, looking at the x86 handling, 32-bit ARM has missed a heck of
a lot of changes to the fault handling code, going all the way back to
pre-git history.
I seem to remember that I had updated it to match i386's implementation
at one point in the distant past, which is essentially what we have
today with a few tweaks. As code ages, it gets more difficult to
justify wholesale rewrites to bring it back up.
Relevant to this, looking at i386, that at some point added:
+ /*
+ * We fault-in kernel-space virtual memory on-demand. The
+ * 'reference' page table is init_mm.pgd.
+ *
+ * NOTE! We MUST NOT take any locks for this case. We may
+ * be in an interrupt or a critical region, and should
+ * only copy the information from the master page table,
+ * nothing more.
+ *
+ * This verifies that the fault happens in kernel space
+ * (error_code & 4) == 0, and that the fault was not a
+ * protection error (error_code & 1) == 0.
+ */
+ if (unlikely(address >= TASK_SIZE)) {
+ if (!(error_code & 5))
+ goto vmalloc_fault;
+ /*
+ * Don't take the mm semaphore here. If we fixup a prefetch
+ * fault we could otherwise deadlock.
+ */
+ goto bad_area_nosemaphore;
+ }
which is after notify_die() and the test to see whether we need a
local_irq_enable(). This means we go straight to the fixing up etc
for these addresses.
In today's kernel, this has morphed into:
/* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
do_kern_addr_fault(regs, error_code, address);
} else {
do_user_addr_fault(regs, error_code, address);
meaning any page fault for a kernel space address is handled entirely
separately from the normal page fault handling, and it looks like
this is entirely sensible.
Interestingly, however, I notice that x86 appears to no longer call
notify_die(DIE_PAGE_FAULT) in its page fault handling path, and I
wonder whether that's a regression on x86.
Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
suggestion is probably the easiest solution. However, whether it has
side effects, I couldn't say - the 32-bit ARM fault code has been
modified by quite a few people in ways I don't yet understand, so I
can't be certain at the moment whether it would cause problems.
I think the only thing to do is to try the solution and see what
breaks. I'm not in a position to be able to do that as, having not
had reason to touch 32-bit ARM for years, I don't have a hackable
platform nearby. Maybe Xie Yuanbin can test it?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On, Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote:
> For quick and dirty variant (on current tree), how about
> adding
> if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> goto no_context;
>
> right after
>
> if (!ttbr0_usermode_access_allowed(regs))
> goto no_context;
>
> in do_page_fault() there?
On, Wed, 26 Nov 2025 23:31:00 +0000, Russell King (Oracle) wrote:
> Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
> suggestion is probably the easiest solution. However, whether it has
> side effects, I couldn't say - the 32-bit ARM fault code has been
> modified by quite a few people in ways I don't yet understand, so I
> can't be certain at the moment whether it would cause problems.
I think I've already submitted a very similar patch, to fix another bug:
On Thu, 16 Oct 2025 20:16:21 +0800, Xie Yuanbin wrote:
> +#ifdef CONFIG_HARDEN_BRANCH_PREDICTOR
> + if (unlikely(addr > TASK_SIZE) && user_mode(regs)) {
> + fault = 0;
> + code = SEGV_MAPERR;
> + goto bad_area;
> + }
> +#endif
Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
However, the patch seems to have received no response for a very long
time.
On Wed, 26 Nov 2025 23:31:00 +0000, Russell King wrote:
> I think the only thing to do is to try the solution and see what
> breaks. I'm not in a position to be able to do that as, having not
> had reason to touch 32-bit ARM for years, I don't have a hackable
> platform nearby. Maybe Xie Yuanbin can test it?
With pleasure.
By the way, for the config and test case shown in this patch:
vfs: Fix might sleep in load_unaligned_zeropad() with rcu read lock held
Link: https://lore.kernel.org/20251126101952.174467-1-xieyuanbin1@huawei.com
the warning can be reproduced directly on QEMU.
Xie Yuanbin
On 2025-11-27 11:03:16 [+0800], Xie Yuanbin wrote:
> On, Wed, 26 Nov 2025 19:26:40 +0000, Al Viro wrote:
> > For quick and dirty variant (on current tree), how about
> > adding
> > if (unlikely(addr > TASK_SIZE) && !user_mode(regs))
> > goto no_context;
> >
> > right after
> >
> > if (!ttbr0_usermode_access_allowed(regs))
> > goto no_context;
> >
> > in do_page_fault() there?
>
> On, Wed, 26 Nov 2025 23:31:00 +0000, Russell King (Oracle) wrote:
> > Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
> > suggestion is probably the easiest solution. However, whether it has
> > side effects, I couldn't say - the 32-bit ARM fault code has been
> > modified by quite a few people in ways I don't yet understand, so I
> > can't be certain at the moment whether it would cause problems.
>
> I think I've already submitted a very similar patch, to fix another bug:
> On Thu, 16 Oct 2025 20:16:21 +0800, Xie Yuanbin wrote:
> > +#ifdef CONFIG_HARDEN_BRANCH_PREDICTOR
> > + if (unlikely(addr > TASK_SIZE) && user_mode(regs)) {
> > + fault = 0;
> > + code = SEGV_MAPERR;
> > + goto bad_area;
> > + }
> > +#endif
> Link: https://lore.kernel.org/20250925025744.6807-1-xieyuanbin1@huawei.com
>
> However, the patch seems to have received no response for a very long
> time.
This all should be covered by the series here
https://lore.kernel.org/all/20251110145555.2555055-1-bigeasy@linutronix.de/
or do I miss something.
Sebastian
On, Thu, 27 Nov 2025 08:20:57 +0100, Sebastian Andrzej Siewior wrote: > This all should be covered by the series here > https://lore.kernel.org/all/20251110145555.2555055-1-bigeasy@linutronix.de/ Yes, I know it. > or do I miss something. We had some discussions about this bug: Link: https://lore.kernel.org/lkml/20251126090505.3057219-1-wozizhi@huaweicloud.com/ The discussions: Link: https://lore.kernel.org/CAHk-=wh1Wfwt9OFB4AfBbjyeu4JVZuSWQ4A8OoT3W6x9btddfw@mail.gmail.com Link: https://lore.kernel.org/20251126192640.GD3538@ZenIV Link: https://lore.kernel.org/aSeNtFxD1WRjFaiR@shell.armlinux.org.uk According to the discussion, in do_page_fault(), when addr >= TASK_SIZE, we should not try to acquire the mm read lock or find vma. Instead, we should directly call __do_kernel_fault() or __do_user_fault(). Your submission just moved harden_branch_predictor() forward. I think we can have more discussions about the patches to fix the missing spectre. I am trying to write a new patch, I hope it will better handle these two bugs and be compatible with PREEMPT_RT scenarios. > Sebastian Thanks! Xie Yuanbin
On Wed, 26 Nov 2025 20:02:21 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote: > > > I don't understand how that helps. Wasn't the report that the filename > > crosses a page boundary in userspace, but the following page is > > inaccessible which causes a fault to be taken (as it always would do). > > Thus, wouldn't "addr" be a userspace address (that the kernel is > > accessing) and thus be below TASK_SIZE ? > > > > I'm also confused - if we can't take a fault and handle it while > > reading the filename from userspace, how are pages that have been > > swapped out or evicted from the page cache read back in from storage > > which invariably results in sleeping - which we can't do here because > > of the RCU context (not that I've ever understood RCU, which is why > > I've always referred those bugs to Paul.) > > No, the filename is already copied in kernel space *and* it's long enough > to end right next to the end of page. There's NUL before the end of page, > at that, with '/' a couple of bytes prior. We attempt to save on memory > accesses, doing word-by-word fetches, starting from the beginning of > component. We *will* detect NUL and ignore all subsequent bytes; the > problem is that the last 3 bytes of page might be '/', 'x' and '\0'. > We call load_unaligned_zeropad() on page + PAGE_SIZE - 2. And get > a fetch that spans the end of page. > > We don't care what's in the next page, if there is one mapped there > to start with. If there's nothing mapped, we want zeroes read from > it, but all we really care about is having the bytes within *our* > page read correctly - and no oops happening, obviously. > > That fault is an extremely cold case on a fairly hot path. We don't > want to mess with disabling pagefaults, etc. - not for the sake > of that. > Can you fix it with a flag on the exception table entry that means 'don't try to fault in a page'? I think the logic would be the same as 'disabling pagefaults', just checking a different flag. After all the fault itself happens in both cases. David
On Wed, Nov 26, 2025 at 10:25:05PM +0000, david laight wrote: > Can you fix it with a flag on the exception table entry that means > 'don't try to fault in a page'? > > I think the logic would be the same as 'disabling pagefaults', just > checking a different flag. > After all the fault itself happens in both cases. The problem is getting to the point where you search the exception table without blocking. x86 #PF had been done that way from well before the point when load_unaligned_zeropad() had been introduced, so everything worked there from the very beginning. arm and arm64, OTOH, were different - there had been logics for "if trylock fails, check if we are in kernel space and have no matching exception table entry; bugger off if so, otherwise we are safe to grab mmap_sem - it's something like get_user() and we *want* mmap_sem there", but it did exactly the wrong thing for this case. The only thing that prevented serious breakage from the very beginning was that these faults are very rare - and hard to arrange without KFENCE. So it didn't blow up. In 2017 arm64 side of problem had been spotted and (hopefully) fixed. arm counterpart stayed unnoticed (perhaps for the lack of good reproducer) until now. Most of the faults are from userland code, obviously, so we don't want to search through the exception table on the common path. So hanging that on a flag in exception table entry is not a good idea - we need a cheaper predicate checked first. x86 starts with separating the fault on kernel address from that on userland; we are not going anywhere near mmap_sem (and VMAs in general) in the former case and that's where load_unaligned_zeropad() faults end up. arm64 fix consisted of using do_translation_fault() instead of do_page_fault(), with the former falling back to the latter for userland addresses and using do_bad_area() for kernel ones. Assuming that the way it's hooked up covers everything, we should be fine there. One potential problem _might_ be with the next PTE present, but write-only. Note that it has to cope with symlink bodies as well and those might come from page cache rather than kmem_cache_alloc(). I'm nowhere near being uptodate on arm64 virtual memory setup, though, so take that with a cartload of salt...
© 2016 - 2025 Red Hat, Inc.