ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

[PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Jiakai Xu 3 weeks, 1 day ago

A race condition exists between filesystem unmount and inode permission
operations. When ocfs2_dismount_volume() frees the ocfs2_super (osb)
structure, concurrent access via OCFS2_SB(inode->i_sb) in
ocfs2_inode_lock_full_nested() can dereference freed memory, causing a
page fault in __pv_queued_spin_lock_slowpath via
ocfs2_is_hard_readonly() -> spin_lock(&osb->osb_lock).

Fix this with two changes:

1. In ocfs2_dismount_volume(): set sb->s_fs_info = NULL before
   kfree(osb), so OCFS2_SB() returns NULL instead of a dangling pointer
   during the teardown race window.

2. In ocfs2_inode_lock_full_nested(): add a NULL check on osb after
   OCFS2_SB(), returning -EIO if the superblock info is already gone.
   This ensures the crash path is handled gracefully when the
   filesystem is being torn down.

Signed-off-by: Jiakai Xu <xujiakai24@mails.ucas.ac.cn>
Fixes: ccd979bdbce9f ("OCFS2: The Second Oracle Cluster Filesystem")
---
 fs/ocfs2/dlmglue.c | 3 +++
 fs/ocfs2/super.c   | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 7283bb2c5a31..cd619958a0a2 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2435,6 +2435,9 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	struct buffer_head *local_bh = NULL;
 
+	if (!osb)
+		return -EIO;
+
 	mlog(0, "inode %llu, take %s META lock\n",
 	     (unsigned long long)OCFS2_I(inode)->ip_blkno,
 	     ex ? "EXMODE" : "PRMODE");
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index b875f01c9756..3fd56638e4f0 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1881,10 +1881,10 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
 	printk(KERN_INFO "ocfs2: Unmounting device (%s) on (node %s)\n",
 	       osb->dev_str, nodestr);
 
+	sb->s_fs_info = NULL;
 	ocfs2_delete_osb(osb);
 	kfree(osb);
 	sb->s_dev = 0;
-	sb->s_fs_info = NULL;
 }
 
 static int ocfs2_setup_osb_uuid(struct ocfs2_super *osb, const unsigned char *uuid,
-- 
2.34.1

Re: [PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Joseph Qi 3 weeks, 1 day ago


On 5/8/26 2:01 PM, Jiakai Xu wrote:
> A race condition exists between filesystem unmount and inode permission
> operations. When ocfs2_dismount_volume() frees the ocfs2_super (osb)
> structure, concurrent access via OCFS2_SB(inode->i_sb) in
> ocfs2_inode_lock_full_nested() can dereference freed memory, causing a
> page fault in __pv_queued_spin_lock_slowpath via
> ocfs2_is_hard_readonly() -> spin_lock(&osb->osb_lock).
> 
> Fix this with two changes:
> 
> 1. In ocfs2_dismount_volume(): set sb->s_fs_info = NULL before
>    kfree(osb), so OCFS2_SB() returns NULL instead of a dangling pointer
>    during the teardown race window.
> 
> 2. In ocfs2_inode_lock_full_nested(): add a NULL check on osb after
>    OCFS2_SB(), returning -EIO if the superblock info is already gone.
>    This ensures the crash path is handled gracefully when the
>    filesystem is being torn down.
> 

It seems this is not enough, or TOCTOU still exists. Say:

Thread A			Thread B
osb = OCFS2_SB(inode->i_sb)
				ocfs2_dismount_volume()
				-> sb->s_fs_info = NULL
				-> kfree(osb)
use freed osb

BTW, how did you find this issue?

Joseph

> Signed-off-by: Jiakai Xu <xujiakai24@mails.ucas.ac.cn>
> Fixes: ccd979bdbce9f ("OCFS2: The Second Oracle Cluster Filesystem")
> ---
>  fs/ocfs2/dlmglue.c | 3 +++
>  fs/ocfs2/super.c   | 2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 7283bb2c5a31..cd619958a0a2 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2435,6 +2435,9 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
>  	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>  	struct buffer_head *local_bh = NULL;
>  
> +	if (!osb)
> +		return -EIO;
> +
>  	mlog(0, "inode %llu, take %s META lock\n",
>  	     (unsigned long long)OCFS2_I(inode)->ip_blkno,
>  	     ex ? "EXMODE" : "PRMODE");
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index b875f01c9756..3fd56638e4f0 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -1881,10 +1881,10 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
>  	printk(KERN_INFO "ocfs2: Unmounting device (%s) on (node %s)\n",
>  	       osb->dev_str, nodestr);
>  
> +	sb->s_fs_info = NULL;
>  	ocfs2_delete_osb(osb);
>  	kfree(osb);
>  	sb->s_dev = 0;
> -	sb->s_fs_info = NULL;
>  }
>  
>  static int ocfs2_setup_osb_uuid(struct ocfs2_super *osb, const unsigned char *uuid,

Re: [PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Jiakai Xu 3 weeks ago

> It seems this is not enough, or TOCTOU still exists. Say:
> 
> Thread A			Thread B
> osb = OCFS2_SB(inode->i_sb)
> 				ocfs2_dismount_volume()
> 				-> sb->s_fs_info = NULL
> 				-> kfree(osb)
> use freed osb
> 

Hi Joseph,

Thank you very much for the review! You are absolutely right about the
TOCTOU issue — simply adding a NULL check after OCFS2_SB() cannot
prevent the race where thread A reads a valid osb pointer before thread
B frees it.

> BTW, how did you find this issue?

I found this issue through fuzzing. The crash report shows a page fault 
at __pv_queued_spin_lock_slowpath via the call path:

  ocfs2_permission -> ocfs2_inode_lock_tracker ->
  ocfs2_inode_lock_full_nested -> ocfs2_is_hard_readonly ->
  spin_lock(&osb->osb_lock)

The fault address was in the kernel static data region, indicating that
the osb structure had been freed and its memory reused.

I have been thinking about a more robust fix and would like to get your
opinion on the following approach:

Currently, ocfs2_dismount_volume() is called from ocfs2_put_super(),
which runs inside generic_shutdown_super() while s_umount is still held.
The osb structure is freed at this point, but inodes with elevated
refcounts (e.g., held by inotify) survive evict_inodes() and may still
trigger filesystem operations (like ocfs2_permission) that access osb.

The idea is to move the osb cleanup out of ocfs2_dismount_volume() and
into an ocfs2-specific ->kill_sb() callback, so that the cleanup happens
after generic_shutdown_super() has completed and all concurrent VFS
operations have drained.

Specifically:

1. Remove ocfs2_delete_osb(), kfree(osb), and sb->s_fs_info = NULL from
   ocfs2_dismount_volume(). Keep all the subsystem shutdown (journal,
   dlm, recovery, quota, etc.) there.

2. Add a new ocfs2_kill_sb() that wraps kill_block_super():

   static void ocfs2_kill_sb(struct super_block *sb)
   {
       struct ocfs2_super *osb = OCFS2_SB(sb);

       kill_block_super(sb);
       // At this point generic_shutdown_super() has completed,
       // SB_DYING is set, and no new VFS operations can enter.

       if (osb) {
           ocfs2_delete_osb(osb);
           kfree(osb);
           sb->s_fs_info = NULL;
       }
   }

3. Update ocfs2_fs_type to use ocfs2_kill_sb instead of kill_block_super.

4. The NULL check in ocfs2_inode_lock_full_nested() can optionally be
   kept as a defense-in-depth measure, though it is no longer strictly
   necessary if the life-cycle ordering is correct.

This pattern is similar to ext4 — ext4_kill_sb() calls kill_block_super()
first and then handles cleanup after (e.g., journal_bdev_file).

Does this approach make sense?

Best regards,
Jiakai

Re: [PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Joseph Qi 3 weeks ago


On 5/9/26 12:28 PM, Jiakai Xu wrote:
>> It seems this is not enough, or TOCTOU still exists. Say:
>>
>> Thread A			Thread B
>> osb = OCFS2_SB(inode->i_sb)
>> 				ocfs2_dismount_volume()
>> 				-> sb->s_fs_info = NULL
>> 				-> kfree(osb)
>> use freed osb
>>
> 
> Hi Joseph,
> 
> Thank you very much for the review! You are absolutely right about the
> TOCTOU issue — simply adding a NULL check after OCFS2_SB() cannot
> prevent the race where thread A reads a valid osb pointer before thread
> B frees it.
> 
>> BTW, how did you find this issue?
> 
> I found this issue through fuzzing. The crash report shows a page fault 
> at __pv_queued_spin_lock_slowpath via the call path:
> 
>   ocfs2_permission -> ocfs2_inode_lock_tracker ->
>   ocfs2_inode_lock_full_nested -> ocfs2_is_hard_readonly ->
>   spin_lock(&osb->osb_lock)

What is the operation?
We expect all operations cannot access filesystem during filesystem shutdown.

> 
> The fault address was in the kernel static data region, indicating that
> the osb structure had been freed and its memory reused.
> 
> I have been thinking about a more robust fix and would like to get your
> opinion on the following approach:
> 
> Currently, ocfs2_dismount_volume() is called from ocfs2_put_super(),
> which runs inside generic_shutdown_super() while s_umount is still held.
> The osb structure is freed at this point, but inodes with elevated
> refcounts (e.g., held by inotify) survive evict_inodes() and may still
> trigger filesystem operations (like ocfs2_permission) that access osb.
> 
> The idea is to move the osb cleanup out of ocfs2_dismount_volume() and
> into an ocfs2-specific ->kill_sb() callback, so that the cleanup happens
> after generic_shutdown_super() has completed and all concurrent VFS
> operations have drained.
> 
> Specifically:
> 
> 1. Remove ocfs2_delete_osb(), kfree(osb), and sb->s_fs_info = NULL from
>    ocfs2_dismount_volume(). Keep all the subsystem shutdown (journal,
>    dlm, recovery, quota, etc.) there.
> 
> 2. Add a new ocfs2_kill_sb() that wraps kill_block_super():
> 
>    static void ocfs2_kill_sb(struct super_block *sb)
>    {
>        struct ocfs2_super *osb = OCFS2_SB(sb);
> 
>        kill_block_super(sb);
>        // At this point generic_shutdown_super() has completed,
>        // SB_DYING is set, and no new VFS operations can enter.
> 
>        if (osb) {
>            ocfs2_delete_osb(osb);
>            kfree(osb);
>            sb->s_fs_info = NULL;
>        }
>    }
> 
> 3. Update ocfs2_fs_type to use ocfs2_kill_sb instead of kill_block_super.
> 
> 4. The NULL check in ocfs2_inode_lock_full_nested() can optionally be
>    kept as a defense-in-depth measure, though it is no longer strictly
>    necessary if the life-cycle ordering is correct.
> 
> This pattern is similar to ext4 — ext4_kill_sb() calls kill_block_super()
> first and then handles cleanup after (e.g., journal_bdev_file).
> 
> Does this approach make sense?
> 

In generic_shutdown_super(), it clears SB_ACTIVE.
So it seems we can check this flag.

Thanks,
Joseph

Re: [PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Jiakai Xu 2 weeks, 3 days ago

> In generic_shutdown_super(), it clears SB_ACTIVE.
> So it seems we can check this flag.

Hi Joseph,

Thank you for the suggestion. I looked into the SB_ACTIVE approach,
but it seems like it still cannot fully close the TOCTOU window.
Let me explain my understanding:

generic_shutdown_super() clears SB_ACTIVE and then calls put_super(),
so checking sb->s_flags & SB_ACTIVE in ocfs2_inode_lock_full_nested()
would access the superblock itself (which is still alive), not osb.
That part is safe. However, consider this race:

    Thread A (inotify_add_watch)          Thread B (umount)
    ─────────────────────────────         ─────────────────────
    read sb->s_flags → SB_ACTIVE set
                                          generic_shutdown_super()
                                            → clear SB_ACTIVE
                                            → put_super
                                              → kfree(osb)
    osb = OCFS2_SB(sb) → osb is freed
    → use osb → UAF

So even with the SB_ACTIVE check at the beginning of
ocfs2_inode_lock_full_nested(), there is still a window between
the flag check and the actual dereference of osb where the
filesystem teardown can complete and free the osb structure.

To be honest, I'm finding it difficult to come up with a clean
solution for this race. I wonder if you or anyone in the community
might have ideas on how to best address it.

Any guidance would be greatly appreciated.

Best regards,
Jiakai

Re: [PATCH] ocfs2: fix use-after-free in ocfs2_inode_lock_full_nested during unmount

Posted by Jiakai Xu 3 weeks ago

> What is the operation?
> We expect all operations cannot access filesystem during filesystem shutdown.

Here is the full crash report produced by the fuzzer:

BUG: unable to handle page fault for address: ffffffff1315afd0
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 6e6a067 P4D 6e6b067 PUD 0 
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 12119 Comm: syz.2.132 Not tainted 6.18.5 #1 PREEMPT(full) 
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__pv_queued_spin_lock_slowpath+0x109/0x430 home/zzzrrll/tmp/linux/kernel/locking/qspinlock.c:288
Code: 9a 00 00 00 0f b7 c8 81 e1 fc ff 00 00 83 e0 03 48 c1 e0 05 4c 8d a8 00 c4 6b 89 48 c7 c2 f8 ff ff ff 48 8b ac 4a 90 0d ab 86 <48> 89 9c 05 00 c4 6b 89 b8 00 80 00 00 45 31 f6 eb 23 41 80 7c 2d
RSP: 0018:ffa000000da9bcc0 EFLAGS: 00010216
RAX: 0000000000000060 RBX: ff1100007da2c400 RCX: 0000000000008584
RDX: fffffffffffffff8 RSI: 0000000085873528 RDI: 0000000000040000
RBP: ffffffff89a9eb70 R08: ff1100007da2c414 R09: 0000000000000000
R10: 0000000000000002 R11: ffffffff823c6ad0 R12: 0000000000000000
R13: ffffffff896bc460 R14: ff110000f4370000 R15: ff1100007ba096c8
FS:  00007fb3ffc0a640(0000) GS:ff110000f4370000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff1315afd0 CR3: 000000003f598000 CR4: 0000000000751ef0
PKRU: 80000000
Call Trace:
 <TASK>
 pv_queued_spin_lock_slowpath home/zzzrrll/tmp/linux/include/asm-generic/qspinlock.h:111 [inline]
 queued_spin_lock_slowpath home/zzzrrll/tmp/linux/arch/x86/include/asm/qspinlock.h:51 [inline]
 queued_spin_lock home/zzzrrll/tmp/linux/include/asm-generic/qspinlock.h:114 [inline]
 do_raw_spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock.h:187 [inline]
 __raw_spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock_api_smp.h:134 [inline]
 _raw_spin_lock+0x31/0x40 home/zzzrrll/tmp/linux/kernel/locking/spinlock.c:154
 spin_lock home/zzzrrll/tmp/linux/include/linux/spinlock.h:351 [inline]
 ocfs2_is_hard_readonly home/zzzrrll/tmp/linux/fs/ocfs2/ocfs2.h:665 [inline]
 ocfs2_inode_lock_full_nested+0x5c/0xca0 home/zzzrrll/tmp/linux/fs/ocfs2/dlmglue.c:2446
 ocfs2_inode_lock_tracker+0xd8/0x400 home/zzzrrll/tmp/linux/fs/ocfs2/dlmglue.c:2691
 ocfs2_permission+0x75/0x130 home/zzzrrll/tmp/linux/fs/ocfs2/file.c:1349
 do_inode_permission home/zzzrrll/tmp/linux/fs/namei.c:526 [inline]
 inode_permission+0x1b4/0x2d0 home/zzzrrll/tmp/linux/fs/namei.c:593
 path_permission home/zzzrrll/tmp/linux/include/linux/fs.h:3086 [inline]
 inotify_find_inode home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:381 [inline]
 __do_sys_inotify_add_watch home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:771 [inline]
 __se_sys_inotify_add_watch+0x146/0x650 home/zzzrrll/tmp/linux/fs/notify/inotify/inotify_user.c:729
 do_syscall_x64 home/zzzrrll/tmp/linux/arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc6/0xfa0 home/zzzrrll/tmp/linux/arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb3fedae16d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb3ffc09f98 EFLAGS: 00000246 ORIG_RAX: 00000000000000fe
RAX: ffffffffffffffda RBX: 00007fb3feff5fa0 RCX: 00007fb3fedae16d
RDX: 0000000004000000 RSI: 0000200000000080 RDI: 0000000000000004
RBP: 00007fb3fee480f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb3feff6038 R14: 00007fb3feff5fa0 R15: 00007fb3ffbea000
 </TASK>
Modules linked in:
CR2: ffffffff1315afd0
---[ end trace 0000000000000000 ]---
RIP: 0010:__pv_queued_spin_lock_slowpath+0x109/0x430 home/zzzrrll/tmp/linux/kernel/locking/qspinlock.c:288
Code: 9a 00 00 00 0f b7 c8 81 e1 fc ff 00 00 83 e0 03 48 c1 e0 05 4c 8d a8 00 c4 6b 89 48 c7 c2 f8 ff ff ff 48 8b ac 4a 90 0d ab 86 <48> 89 9c 05 00 c4 6b 89 b8 00 80 00 00 45 31 f6 eb 23 41 80 7c 2d
RSP: 0018:ffa000000da9bcc0 EFLAGS: 00010216
RAX: 0000000000000060 RBX: ff1100007da2c400 RCX: 0000000000008584
RDX: fffffffffffffff8 RSI: 0000000085873528 RDI: 0000000000040000
RBP: ffffffff89a9eb70 R08: ff1100007da2c414 R09: 0000000000000000
R10: 0000000000000002 R11: ffffffff823c6ad0 R12: 0000000000000000
R13: ffffffff896bc460 R14: ff110000f4370000 R15: ff1100007ba096c8
FS:  00007fb3ffc0a640(0000) GS:ff110000f4370000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff1315afd0 CR3: 000000003f598000 CR4: 0000000000751ef0
PKRU: 80000000
----------------
Code disassembly (best guess), 1 bytes skipped:
   0:	00 00                	add    %al,(%rax)
   2:	00 0f                	add    %cl,(%rdi)
   4:	b7 c8                	mov    $0xc8,%bh
   6:	81 e1 fc ff 00 00    	and    $0xfffc,%ecx
   c:	83 e0 03             	and    $0x3,%eax
   f:	48 c1 e0 05          	shl    $0x5,%rax
  13:	4c 8d a8 00 c4 6b 89 	lea    -0x76943c00(%rax),%r13
  1a:	48 c7 c2 f8 ff ff ff 	mov    $0xfffffffffffffff8,%rdx
  21:	48 8b ac 4a 90 0d ab 	mov    -0x7954f270(%rdx,%rcx,2),%rbp
  28:	86
* 29:	48 89 9c 05 00 c4 6b 	mov    %rbx,-0x76943c00(%rbp,%rax,1) <-- trapping instruction
  30:	89
  31:	b8 00 80 00 00       	mov    $0x8000,%eax
  36:	45 31 f6             	xor    %r14d,%r14d
  39:	eb 23                	jmp    0x5e
  3b:	41                   	rex.B
  3c:	80                   	.byte 0x80
  3d:	7c 2d                	jl     0x6c

Jiakai