fs/ocfs2/dlmglue.c | 3 +++ fs/ocfs2/super.c | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-)
A race condition exists between filesystem unmount and inode permission
operations. When ocfs2_dismount_volume() frees the ocfs2_super (osb)
structure, concurrent access via OCFS2_SB(inode->i_sb) in
ocfs2_inode_lock_full_nested() can dereference freed memory, causing a
page fault in __pv_queued_spin_lock_slowpath via
ocfs2_is_hard_readonly() -> spin_lock(&osb->osb_lock).
Fix this with two changes:
1. In ocfs2_dismount_volume(): set sb->s_fs_info = NULL before
kfree(osb), so OCFS2_SB() returns NULL instead of a dangling pointer
during the teardown race window.
2. In ocfs2_inode_lock_full_nested(): add a NULL check on osb after
OCFS2_SB(), returning -EIO if the superblock info is already gone.
This ensures the crash path is handled gracefully when the
filesystem is being torn down.
Signed-off-by: Jiakai Xu <xujiakai24@mails.ucas.ac.cn>
Fixes: ccd979bdbce9f ("OCFS2: The Second Oracle Cluster Filesystem")
---
fs/ocfs2/dlmglue.c | 3 +++
fs/ocfs2/super.c | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 7283bb2c5a31..cd619958a0a2 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2435,6 +2435,9 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
struct buffer_head *local_bh = NULL;
+ if (!osb)
+ return -EIO;
+
mlog(0, "inode %llu, take %s META lock\n",
(unsigned long long)OCFS2_I(inode)->ip_blkno,
ex ? "EXMODE" : "PRMODE");
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index b875f01c9756..3fd56638e4f0 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1881,10 +1881,10 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
printk(KERN_INFO "ocfs2: Unmounting device (%s) on (node %s)\n",
osb->dev_str, nodestr);
+ sb->s_fs_info = NULL;
ocfs2_delete_osb(osb);
kfree(osb);
sb->s_dev = 0;
- sb->s_fs_info = NULL;
}
static int ocfs2_setup_osb_uuid(struct ocfs2_super *osb, const unsigned char *uuid,
--
2.34.1
On 5/8/26 2:01 PM, Jiakai Xu wrote:
> A race condition exists between filesystem unmount and inode permission
> operations. When ocfs2_dismount_volume() frees the ocfs2_super (osb)
> structure, concurrent access via OCFS2_SB(inode->i_sb) in
> ocfs2_inode_lock_full_nested() can dereference freed memory, causing a
> page fault in __pv_queued_spin_lock_slowpath via
> ocfs2_is_hard_readonly() -> spin_lock(&osb->osb_lock).
>
> Fix this with two changes:
>
> 1. In ocfs2_dismount_volume(): set sb->s_fs_info = NULL before
> kfree(osb), so OCFS2_SB() returns NULL instead of a dangling pointer
> during the teardown race window.
>
> 2. In ocfs2_inode_lock_full_nested(): add a NULL check on osb after
> OCFS2_SB(), returning -EIO if the superblock info is already gone.
> This ensures the crash path is handled gracefully when the
> filesystem is being torn down.
>
It seems this is not enough, or TOCTOU still exists. Say:
Thread A Thread B
osb = OCFS2_SB(inode->i_sb)
ocfs2_dismount_volume()
-> sb->s_fs_info = NULL
-> kfree(osb)
use freed osb
BTW, how did you find this issue?
Joseph
> Signed-off-by: Jiakai Xu <xujiakai24@mails.ucas.ac.cn>
> Fixes: ccd979bdbce9f ("OCFS2: The Second Oracle Cluster Filesystem")
> ---
> fs/ocfs2/dlmglue.c | 3 +++
> fs/ocfs2/super.c | 2 +-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 7283bb2c5a31..cd619958a0a2 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2435,6 +2435,9 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
> struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> struct buffer_head *local_bh = NULL;
>
> + if (!osb)
> + return -EIO;
> +
> mlog(0, "inode %llu, take %s META lock\n",
> (unsigned long long)OCFS2_I(inode)->ip_blkno,
> ex ? "EXMODE" : "PRMODE");
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index b875f01c9756..3fd56638e4f0 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -1881,10 +1881,10 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
> printk(KERN_INFO "ocfs2: Unmounting device (%s) on (node %s)\n",
> osb->dev_str, nodestr);
>
> + sb->s_fs_info = NULL;
> ocfs2_delete_osb(osb);
> kfree(osb);
> sb->s_dev = 0;
> - sb->s_fs_info = NULL;
> }
>
> static int ocfs2_setup_osb_uuid(struct ocfs2_super *osb, const unsigned char *uuid,
> It seems this is not enough, or TOCTOU still exists. Say:
>
> Thread A Thread B
> osb = OCFS2_SB(inode->i_sb)
> ocfs2_dismount_volume()
> -> sb->s_fs_info = NULL
> -> kfree(osb)
> use freed osb
>
Hi Joseph,
Thank you very much for the review! You are absolutely right about the
TOCTOU issue — simply adding a NULL check after OCFS2_SB() cannot
prevent the race where thread A reads a valid osb pointer before thread
B frees it.
> BTW, how did you find this issue?
I found this issue through fuzzing. The crash report shows a page fault
at __pv_queued_spin_lock_slowpath via the call path:
ocfs2_permission -> ocfs2_inode_lock_tracker ->
ocfs2_inode_lock_full_nested -> ocfs2_is_hard_readonly ->
spin_lock(&osb->osb_lock)
The fault address was in the kernel static data region, indicating that
the osb structure had been freed and its memory reused.
I have been thinking about a more robust fix and would like to get your
opinion on the following approach:
Currently, ocfs2_dismount_volume() is called from ocfs2_put_super(),
which runs inside generic_shutdown_super() while s_umount is still held.
The osb structure is freed at this point, but inodes with elevated
refcounts (e.g., held by inotify) survive evict_inodes() and may still
trigger filesystem operations (like ocfs2_permission) that access osb.
The idea is to move the osb cleanup out of ocfs2_dismount_volume() and
into an ocfs2-specific ->kill_sb() callback, so that the cleanup happens
after generic_shutdown_super() has completed and all concurrent VFS
operations have drained.
Specifically:
1. Remove ocfs2_delete_osb(), kfree(osb), and sb->s_fs_info = NULL from
ocfs2_dismount_volume(). Keep all the subsystem shutdown (journal,
dlm, recovery, quota, etc.) there.
2. Add a new ocfs2_kill_sb() that wraps kill_block_super():
static void ocfs2_kill_sb(struct super_block *sb)
{
struct ocfs2_super *osb = OCFS2_SB(sb);
kill_block_super(sb);
// At this point generic_shutdown_super() has completed,
// SB_DYING is set, and no new VFS operations can enter.
if (osb) {
ocfs2_delete_osb(osb);
kfree(osb);
sb->s_fs_info = NULL;
}
}
3. Update ocfs2_fs_type to use ocfs2_kill_sb instead of kill_block_super.
4. The NULL check in ocfs2_inode_lock_full_nested() can optionally be
kept as a defense-in-depth measure, though it is no longer strictly
necessary if the life-cycle ordering is correct.
This pattern is similar to ext4 — ext4_kill_sb() calls kill_block_super()
first and then handles cleanup after (e.g., journal_bdev_file).
Does this approach make sense?
Best regards,
Jiakai
On 5/9/26 12:28 PM, Jiakai Xu wrote:
>> It seems this is not enough, or TOCTOU still exists. Say:
>>
>> Thread A Thread B
>> osb = OCFS2_SB(inode->i_sb)
>> ocfs2_dismount_volume()
>> -> sb->s_fs_info = NULL
>> -> kfree(osb)
>> use freed osb
>>
>
> Hi Joseph,
>
> Thank you very much for the review! You are absolutely right about the
> TOCTOU issue — simply adding a NULL check after OCFS2_SB() cannot
> prevent the race where thread A reads a valid osb pointer before thread
> B frees it.
>
>> BTW, how did you find this issue?
>
> I found this issue through fuzzing. The crash report shows a page fault
> at __pv_queued_spin_lock_slowpath via the call path:
>
> ocfs2_permission -> ocfs2_inode_lock_tracker ->
> ocfs2_inode_lock_full_nested -> ocfs2_is_hard_readonly ->
> spin_lock(&osb->osb_lock)
What is the operation?
We expect all operations cannot access filesystem during filesystem shutdown.
>
> The fault address was in the kernel static data region, indicating that
> the osb structure had been freed and its memory reused.
>
> I have been thinking about a more robust fix and would like to get your
> opinion on the following approach:
>
> Currently, ocfs2_dismount_volume() is called from ocfs2_put_super(),
> which runs inside generic_shutdown_super() while s_umount is still held.
> The osb structure is freed at this point, but inodes with elevated
> refcounts (e.g., held by inotify) survive evict_inodes() and may still
> trigger filesystem operations (like ocfs2_permission) that access osb.
>
> The idea is to move the osb cleanup out of ocfs2_dismount_volume() and
> into an ocfs2-specific ->kill_sb() callback, so that the cleanup happens
> after generic_shutdown_super() has completed and all concurrent VFS
> operations have drained.
>
> Specifically:
>
> 1. Remove ocfs2_delete_osb(), kfree(osb), and sb->s_fs_info = NULL from
> ocfs2_dismount_volume(). Keep all the subsystem shutdown (journal,
> dlm, recovery, quota, etc.) there.
>
> 2. Add a new ocfs2_kill_sb() that wraps kill_block_super():
>
> static void ocfs2_kill_sb(struct super_block *sb)
> {
> struct ocfs2_super *osb = OCFS2_SB(sb);
>
> kill_block_super(sb);
> // At this point generic_shutdown_super() has completed,
> // SB_DYING is set, and no new VFS operations can enter.
>
> if (osb) {
> ocfs2_delete_osb(osb);
> kfree(osb);
> sb->s_fs_info = NULL;
> }
> }
>
> 3. Update ocfs2_fs_type to use ocfs2_kill_sb instead of kill_block_super.
>
> 4. The NULL check in ocfs2_inode_lock_full_nested() can optionally be
> kept as a defense-in-depth measure, though it is no longer strictly
> necessary if the life-cycle ordering is correct.
>
> This pattern is similar to ext4 — ext4_kill_sb() calls kill_block_super()
> first and then handles cleanup after (e.g., journal_bdev_file).
>
> Does this approach make sense?
>
In generic_shutdown_super(), it clears SB_ACTIVE.
So it seems we can check this flag.
Thanks,
Joseph
> In generic_shutdown_super(), it clears SB_ACTIVE.
> So it seems we can check this flag.
Hi Joseph,
Thank you for the suggestion. I looked into the SB_ACTIVE approach,
but it seems like it still cannot fully close the TOCTOU window.
Let me explain my understanding:
generic_shutdown_super() clears SB_ACTIVE and then calls put_super(),
so checking sb->s_flags & SB_ACTIVE in ocfs2_inode_lock_full_nested()
would access the superblock itself (which is still alive), not osb.
That part is safe. However, consider this race:
Thread A (inotify_add_watch) Thread B (umount)
───────────────────────────── ─────────────────────
read sb->s_flags → SB_ACTIVE set
generic_shutdown_super()
→ clear SB_ACTIVE
→ put_super
→ kfree(osb)
osb = OCFS2_SB(sb) → osb is freed
→ use osb → UAF
So even with the SB_ACTIVE check at the beginning of
ocfs2_inode_lock_full_nested(), there is still a window between
the flag check and the actual dereference of osb where the
filesystem teardown can complete and free the osb structure.
To be honest, I'm finding it difficult to come up with a clean
solution for this race. I wonder if you or anyone in the community
might have ideas on how to best address it.
Any guidance would be greatly appreciated.
Best regards,
Jiakai
> What is the operation? > We expect all operations cannot access filesystem during filesystem shutdown. Here is the full crash report produced by the fuzzer: BUG: unable to handle page fault for address: ffffffff1315afd0 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 6e6a067 P4D 6e6b067 PUD 0 Oops: Oops: 0002 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 12119 Comm: syz.2.132 Not tainted 6.18.5 #1 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__pv_queued_spin_lock_slowpath+0x109/0x430 home/zzzrrll/tmp/linux/kernel/locking/qspinlock.c:288 Code: 9a 00 00 00 0f b7 c8 81 e1 fc ff 00 00 83 e0 03 48 c1 e0 05 4c 8d a8 00 c4 6b 89 48 c7 c2 f8 ff ff ff 48 8b ac 4a 90 0d ab 86 <48> 89 9c 05 00 c4 6b 89 b8 00 80 00 00 45 31 f6 eb 23 41 80 7c 2d RSP: 0018:ffa000000da9bcc0 EFLAGS: 00010216 RAX: 0000000000000060 RBX: ff1100007da2c400 RCX: 0000000000008584 RDX: fffffffffffffff8 RSI: 0000000085873528 RDI: 0000000000040000 RBP: ffffffff89a9eb70 R08: ff1100007da2c414 R09: 0000000000000000 R10: 0000000000000002 R11: ffffffff823c6ad0 R12: 0000000000000000 R13: ffffffff896bc460 R14: ff110000f4370000 R15: ff1100007ba096c8 FS: 00007fb3ffc0a640(0000) GS:ff110000f4370000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff1315afd0 CR3: 000000003f598000 CR4: 0000000000751ef0 PKRU: 80000000 Call Trace: <TASK> pv_queued_spin_lock_slowpath home/zzzrrll/tmp/linux/include/asm-generic/qspinlock.h:111 [inline] queued_spin_lock_slowpath home/zzzrrll/tmp/linux/arch/x86/include/asm/qspinlock.h:51 [inline] queued_spin_lock home/zzzrrll/tmp/linux/include/asm-generic/qspinlock.h:114 [inline] do_raw_spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock.h:187 [inline] __raw_spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock_api_smp.h:134 [inline] _raw_spin_lock+0x31/0x40 home/zzzrrll/tmp/linux/kernel/locking/spinlock.c:154 spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock.h:351 [inline] ocfs2_is_hard_readonly home/zzzrrll/tmp/linux/fs/ocfs2/ocfs2.h:665 [inline] ocfs2_inode_lock_full_nested+0x5c/0xca0 home/zzzrrll/tmp/linux/fs/ocfs2/dlmglue.c:2446 ocfs2_inode_lock_tracker+0xd8/0x400 home/zzzrrll/tmp/linux/fs/ocfs2/dlmglue.c:2691 ocfs2_permission+0x75/0x130 home/zzzrrll/tmp/linux/fs/ocfs2/file.c:1349 do_inode_permission home/zzzrrll/tmp/linux/fs/namei.c:526 [inline] inode_permission+0x1b4/0x2d0 home/zzzrrll/tmp/linux/fs/namei.c:593 path_permission home/zzzrrll/tmp/linux/include/linux/fs.h:3086 [inline] inotify_find_inode home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:381 [inline] __do_sys_inotify_add_watch home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:771 [inline] __se_sys_inotify_add_watch+0x146/0x650 home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:729 do_syscall_x64 home/zzzrrll/tmp/linux/arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc6/0xfa0 home/zzzrrll/tmp/linux/arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb3fedae16d Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb3ffc09f98 EFLAGS: 00000246 ORIG_RAX: 00000000000000fe RAX: ffffffffffffffda RBX: 00007fb3feff5fa0 RCX: 00007fb3fedae16d RDX: 0000000004000000 RSI: 0000200000000080 RDI: 0000000000000004 RBP: 00007fb3fee480f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fb3feff6038 R14: 00007fb3feff5fa0 R15: 00007fb3ffbea000 </TASK> Modules linked in: CR2: ffffffff1315afd0 ---[ end trace 0000000000000000 ]--- RIP: 0010:__pv_queued_spin_lock_slowpath+0x109/0x430 home/zzzrrll/tmp/linux/kernel/locking/qspinlock.c:288 Code: 9a 00 00 00 0f b7 c8 81 e1 fc ff 00 00 83 e0 03 48 c1 e0 05 4c 8d a8 00 c4 6b 89 48 c7 c2 f8 ff ff ff 48 8b ac 4a 90 0d ab 86 <48> 89 9c 05 00 c4 6b 89 b8 00 80 00 00 45 31 f6 eb 23 41 80 7c 2d RSP: 0018:ffa000000da9bcc0 EFLAGS: 00010216 RAX: 0000000000000060 RBX: ff1100007da2c400 RCX: 0000000000008584 RDX: fffffffffffffff8 RSI: 0000000085873528 RDI: 0000000000040000 RBP: ffffffff89a9eb70 R08: ff1100007da2c414 R09: 0000000000000000 R10: 0000000000000002 R11: ffffffff823c6ad0 R12: 0000000000000000 R13: ffffffff896bc460 R14: ff110000f4370000 R15: ff1100007ba096c8 FS: 00007fb3ffc0a640(0000) GS:ff110000f4370000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff1315afd0 CR3: 000000003f598000 CR4: 0000000000751ef0 PKRU: 80000000 ---------------- Code disassembly (best guess), 1 bytes skipped: 0: 00 00 add %al,(%rax) 2: 00 0f add %cl,(%rdi) 4: b7 c8 mov $0xc8,%bh 6: 81 e1 fc ff 00 00 and $0xfffc,%ecx c: 83 e0 03 and $0x3,%eax f: 48 c1 e0 05 shl $0x5,%rax 13: 4c 8d a8 00 c4 6b 89 lea -0x76943c00(%rax),%r13 1a: 48 c7 c2 f8 ff ff ff mov $0xfffffffffffffff8,%rdx 21: 48 8b ac 4a 90 0d ab mov -0x7954f270(%rdx,%rcx,2),%rbp 28: 86 * 29: 48 89 9c 05 00 c4 6b mov %rbx,-0x76943c00(%rbp,%rax,1) <-- trapping instruction 30: 89 31: b8 00 80 00 00 mov $0x8000,%eax 36: 45 31 f6 xor %r14d,%r14d 39: eb 23 jmp 0x5e 3b: 41 rex.B 3c: 80 .byte 0x80 3d: 7c 2d jl 0x6c Jiakai
© 2016 - 2026 Red Hat, Inc.