[PATCH v5] f2fs: protect published gc_thread during teardown

Zhang Cen posted 1 patch 2 days, 13 hours ago
fs/f2fs/gc.c      | 52 ++++++++++++++++++++++++++++++-----------------
fs/f2fs/gc.h      | 38 ++++++++++++++++++++++++++++++++++
fs/f2fs/segment.c | 19 ++++++++++-------
fs/f2fs/super.c   |  7 +++++--
fs/f2fs/sysfs.c   | 18 +++++++++++-----
5 files changed, 101 insertions(+), 33 deletions(-)
[PATCH v5] f2fs: protect published gc_thread during teardown
Posted by Zhang Cen 2 days, 13 hours ago
f2fs_stop_gc_thread() stops the background GC task, wakes GC_MERGE
foreground waiters, frees sbi->gc_thread, and then clears the published
pointer. A foreground f2fs_balance_fs() caller can already have copied
that pointer and queued itself on gc_th->fggc_wq, so freeing gc_th at
stop time can leave finish_wait() operating on a freed waitqueue.

The buggy scenario involves two paths, with each column showing the
order within that path:

foreground f2fs_balance_fs() caller:   shutdown path:
  1. observes no checkpoint error        1. sets CP_ERROR_FLAG
  2. snapshots sbi->gc_thread           2. enters f2fs_stop_gc_thread()
  3. queues on gc_th->fggc_wq           3. stops gc_th->f2fs_gc_task
  4. wakes gc_wait_queue_head           4. wakes gc_th->fggc_wq
  5. sleeps for foreground GC           5. frees gc_th
  6. finish_wait() touches fggc_wq

GC_MERGE does not keep independent work_struct items that shutdown can
cancel. Its pending foreground GC requests are waitqueue waiters. Drain
them by withdrawing the GC task pointer, stopping the task, waking
gc_th->fggc_wq, and leaving each waiter to remove its own wait entry
with finish_wait().

Keep the allocated GC-thread state until the superblock is destroyed and
use gc_th->f2fs_gc_task as the running-state marker. The stop path now
withdraws the task pointer under gc_task_lock, stops the task, and wakes
foreground waiters, but leaves the waitqueue storage valid. The start
path reuses a stopped gc_thread object instead of reinitializing its
waitqueues, and remount restart decisions check the task pointer rather
than only the object pointer.

f2fs_balance_fs() also snapshots sbi->gc_thread once and rechecks both
f2fs_cp_error() and f2fs_gc_task after prepare_to_wait(). If shutdown is
visible or the GC task has already been withdrawn, the caller removes its
wait entry without waking the GC thread or sleeping for new foreground GC
work. Thus shutdown drains already queued waiters and stops accepting
new foreground GC work once shutdown is visible to the caller.

Task pointer users are protected separately from the retained container
lifetime. A sysfs critical_task_priority store now snapshots the GC task
under gc_task_lock and holds a task_struct reference while calling
set_user_nice(). Boolean running-state checks that do not dereference the
task_struct continue to use READ_ONCE().

One observed report was:

BUG: KASAN: slab-use-after-free in finish_wait+0x276/0x290
Write of size 8 at addr ffff8881150819b8 by task dd/802
The buggy address belongs to the object at ffff888115081900 which
belongs to the cache kmalloc-256 of size 256
The buggy address is located 184 bytes inside of freed 256-byte region
Call trace:
  finish_wait()
  f2fs_balance_fs()
  f2fs_write_single_data_page()
  f2fs_write_cache_pages()
  __f2fs_write_data_pages()
  do_writepages()
  filemap_fdatawrite_wbc()
  __filemap_fdatawrite_range()
  file_write_and_wait_range()
  f2fs_do_sync_file()
  f2fs_sync_file()
  do_fsync()
Freed by task stack:
  kfree()
  f2fs_stop_gc_thread()
  f2fs_do_shutdown()
  f2fs_shutdown()
  fs_bdev_mark_dead()

Fixes: 5911d2d1d1a3 ("f2fs: introduce gc_merge mount option")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
---
v5:
Explain that GC_MERGE foreground requests are waitqueue waiters, not
independent work items; shutdown drains them by withdrawing the GC task,
waking fggc_wq, and letting waiters dequeue themselves.
Recheck checkpoint error and the GC task pointer after prepare_to_wait() so
shutdown-visible f2fs_balance_fs() callers do not wake or sleep for new
foreground GC work.

v4:
Replace the v3 SRCU/refcounted lifetime model with a smaller fix that
keeps the existing heap-allocated gc_thread object alive until
superblock teardown.
Use f2fs_gc_task as the running-state marker and withdraw it with
xchg() before waking GC_MERGE waiters.
Reuse a stopped gc_thread object across remount restarts so the
waitqueues are not reinitialized while old waiters can still finish.
Recheck f2fs_gc_task after prepare_to_wait() so a waiter that races
with teardown does not sleep after the worker has been withdrawn.

v3:
Fix checkpatch style issues in the broader lifetime variant.

v2:
Sashiko.dev pointed out that GC_MERGE foreground waiters and
GC-thread users needed lifetime-safe access after teardown.

 fs/f2fs/gc.c      | 52 ++++++++++++++++++++++++++++++-----------------
 fs/f2fs/gc.h      | 38 ++++++++++++++++++++++++++++++++++
 fs/f2fs/segment.c | 19 ++++++++++-------
 fs/f2fs/super.c   |  7 +++++--
 fs/f2fs/sysfs.c   | 18 +++++++++++-----
 5 files changed, 101 insertions(+), 33 deletions(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index ba93010924c0..190d60b3cfc5 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -193,12 +193,24 @@ static int gc_thread_func(void *data)
 
 int f2fs_start_gc_thread(struct f2fs_sb_info *sbi)
 {
-	struct f2fs_gc_kthread *gc_th;
+	struct f2fs_gc_kthread *gc_th = READ_ONCE(sbi->gc_thread);
+	struct task_struct *task;
+	bool allocated = false;
 	dev_t dev = sbi->sb->s_bdev->bd_dev;
 
-	gc_th = f2fs_kmalloc(sbi, sizeof(struct f2fs_gc_kthread), GFP_KERNEL);
-	if (!gc_th)
-		return -ENOMEM;
+	if (gc_th && READ_ONCE(gc_th->f2fs_gc_task))
+		return 0;
+
+	if (!gc_th) {
+		gc_th = f2fs_kmalloc(sbi, sizeof(*gc_th), GFP_KERNEL);
+		if (!gc_th)
+			return -ENOMEM;
+		gc_th->f2fs_gc_task = NULL;
+		spin_lock_init(&gc_th->gc_task_lock);
+		init_waitqueue_head(&gc_th->gc_wait_queue_head);
+		init_waitqueue_head(&gc_th->fggc_wq);
+		allocated = true;
+	}
 
 	gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME;
 	gc_th->valid_thresh_ratio = DEF_GC_THREAD_VALID_THRESH_RATIO;
@@ -221,34 +233,36 @@ int f2fs_start_gc_thread(struct f2fs_sb_info *sbi)
 
 	gc_th->gc_wake = false;
 
-	sbi->gc_thread = gc_th;
-	init_waitqueue_head(&sbi->gc_thread->gc_wait_queue_head);
-	init_waitqueue_head(&sbi->gc_thread->fggc_wq);
-	sbi->gc_thread->f2fs_gc_task = kthread_run(gc_thread_func, sbi,
-			"f2fs_gc-%u:%u", MAJOR(dev), MINOR(dev));
-	if (IS_ERR(gc_th->f2fs_gc_task)) {
-		int err = PTR_ERR(gc_th->f2fs_gc_task);
+	task = kthread_create(gc_thread_func, sbi, "f2fs_gc-%u:%u",
+			      MAJOR(dev), MINOR(dev));
+	if (IS_ERR(task)) {
+		int err = PTR_ERR(task);
 
-		kfree(gc_th);
-		sbi->gc_thread = NULL;
+		if (allocated)
+			kfree(gc_th);
 		return err;
 	}
 
-	set_user_nice(gc_th->f2fs_gc_task,
-			PRIO_TO_NICE(sbi->critical_task_priority));
+	set_user_nice(task, PRIO_TO_NICE(sbi->critical_task_priority));
+	if (allocated)
+		WRITE_ONCE(sbi->gc_thread, gc_th);
+	f2fs_set_gc_task(gc_th, task);
+	wake_up_process(task);
 	return 0;
 }
 
 void f2fs_stop_gc_thread(struct f2fs_sb_info *sbi)
 {
-	struct f2fs_gc_kthread *gc_th = sbi->gc_thread;
+	struct f2fs_gc_kthread *gc_th = READ_ONCE(sbi->gc_thread);
+	struct task_struct *task;
 
 	if (!gc_th)
 		return;
-	kthread_stop(gc_th->f2fs_gc_task);
+	task = f2fs_detach_gc_task(gc_th);
+	if (!task)
+		return;
+	kthread_stop(task);
 	wake_up_all(&gc_th->fggc_wq);
-	kfree(gc_th);
-	sbi->gc_thread = NULL;
 }
 
 static int select_gc_type(struct f2fs_sb_info *sbi, int gc_type)
diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h
index 6c4d4567571e..6df406e98939 100644
--- a/fs/f2fs/gc.h
+++ b/fs/f2fs/gc.h
@@ -5,6 +5,9 @@
  * Copyright (c) 2012 Samsung Electronics Co., Ltd.
  *             http://www.samsung.com/
  */
+#include <linux/sched/task.h>
+#include <linux/spinlock.h>
+
 #define GC_THREAD_MIN_WB_PAGES		1	/*
 						 * a threshold to determine
 						 * whether IO subsystem is idle
@@ -47,6 +50,7 @@
 
 struct f2fs_gc_kthread {
 	struct task_struct *f2fs_gc_task;
+	spinlock_t gc_task_lock;		/* protects f2fs_gc_task */
 	wait_queue_head_t gc_wait_queue_head;
 
 	/* for gc sleep time */
@@ -72,6 +76,40 @@ struct f2fs_gc_kthread {
 	unsigned int boost_gc_greedy;
 };
 
+static inline struct task_struct *f2fs_get_gc_task(struct f2fs_gc_kthread *gc_th)
+{
+	struct task_struct *task;
+
+	spin_lock(&gc_th->gc_task_lock);
+	task = READ_ONCE(gc_th->f2fs_gc_task);
+	if (task)
+		get_task_struct(task);
+	spin_unlock(&gc_th->gc_task_lock);
+
+	return task;
+}
+
+static inline void f2fs_set_gc_task(struct f2fs_gc_kthread *gc_th,
+				    struct task_struct *task)
+{
+	spin_lock(&gc_th->gc_task_lock);
+	WRITE_ONCE(gc_th->f2fs_gc_task, task);
+	spin_unlock(&gc_th->gc_task_lock);
+}
+
+static inline struct task_struct *
+f2fs_detach_gc_task(struct f2fs_gc_kthread *gc_th)
+{
+	struct task_struct *task;
+
+	spin_lock(&gc_th->gc_task_lock);
+	task = READ_ONCE(gc_th->f2fs_gc_task);
+	WRITE_ONCE(gc_th->f2fs_gc_task, NULL);
+	spin_unlock(&gc_th->gc_task_lock);
+
+	return task;
+}
+
 struct gc_inode_list {
 	struct list_head ilist;
 	struct radix_tree_root iroot;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 788f8b050249..5f61627009a5 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -424,6 +424,8 @@ int f2fs_commit_atomic_write(struct inode *inode)
  */
 void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need)
 {
+	struct f2fs_gc_kthread *gc_th;
+
 	if (f2fs_cp_error(sbi))
 		return;
 
@@ -444,15 +446,18 @@ void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need)
 	if (has_enough_free_secs(sbi, 0, 0))
 		return;
 
-	if (test_opt(sbi, GC_MERGE) && sbi->gc_thread &&
-				sbi->gc_thread->f2fs_gc_task) {
+	gc_th = READ_ONCE(sbi->gc_thread);
+	if (test_opt(sbi, GC_MERGE) && gc_th &&
+	    READ_ONCE(gc_th->f2fs_gc_task)) {
 		DEFINE_WAIT(wait);
 
-		prepare_to_wait(&sbi->gc_thread->fggc_wq, &wait,
-					TASK_UNINTERRUPTIBLE);
-		wake_up(&sbi->gc_thread->gc_wait_queue_head);
-		io_schedule();
-		finish_wait(&sbi->gc_thread->fggc_wq, &wait);
+		prepare_to_wait(&gc_th->fggc_wq, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (!f2fs_cp_error(sbi) && READ_ONCE(gc_th->f2fs_gc_task)) {
+			wake_up(&gc_th->gc_wait_queue_head);
+			io_schedule();
+		}
+		finish_wait(&gc_th->fggc_wq, &wait);
 	} else {
 		struct f2fs_gc_control gc_control = {
 			.victim_segno = NULL_SEGNO,
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index ccf806b676f5..d6863da05a7c 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2925,11 +2925,12 @@ static int __f2fs_remount(struct fs_context *fc, struct super_block *sb)
 	if ((flags & SB_RDONLY) ||
 			(F2FS_OPTION(sbi).bggc_mode == BGGC_MODE_OFF &&
 			!test_opt(sbi, GC_MERGE))) {
-		if (sbi->gc_thread) {
+		if (sbi->gc_thread && READ_ONCE(sbi->gc_thread->f2fs_gc_task)) {
 			f2fs_stop_gc_thread(sbi);
 			need_restart_gc = true;
 		}
-	} else if (!sbi->gc_thread) {
+	} else if (!sbi->gc_thread ||
+			!READ_ONCE(sbi->gc_thread->f2fs_gc_task)) {
 		err = f2fs_start_gc_thread(sbi);
 		if (err)
 			goto restore_opts;
@@ -5451,6 +5452,7 @@ static int f2fs_fill_super(struct super_block *sb, struct fs_context *fc)
 free_sb_buf:
 	kfree(raw_super);
 free_sbi:
+	kfree(sbi->gc_thread);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	lockdep_unregister_key(&sbi->cp_global_sem_key);
 #endif
@@ -5535,6 +5537,7 @@ static void kill_f2fs_super(struct super_block *sb)
 	/* Release block devices last, after fscrypt_destroy_keyring(). */
 	if (sbi) {
 		destroy_device_list(sbi);
+		kfree(sbi->gc_thread);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 		lockdep_unregister_key(&sbi->cp_global_sem_key);
 #endif
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 352e96ad5c3a..07543df2b5b1 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -12,6 +12,7 @@
 #include <linux/seq_file.h>
 #include <linux/unicode.h>
 #include <linux/ioprio.h>
+#include <linux/sched/task.h>
 #include <linux/sysfs.h>
 
 #include "f2fs.h"
@@ -981,17 +982,24 @@ static ssize_t __sbi_store(struct f2fs_attr *a,
 	}
 
 	if (!strcmp(a->attr.name, "critical_task_priority")) {
+		struct f2fs_gc_kthread *gc_th;
+		struct task_struct *gc_task;
+		int nice;
+
 		if (t < NICE_TO_PRIO(MIN_NICE) || t > NICE_TO_PRIO(MAX_NICE))
 			return -EINVAL;
 		if (!capable(CAP_SYS_NICE))
 			return -EPERM;
 		sbi->critical_task_priority = t;
+		nice = PRIO_TO_NICE(sbi->critical_task_priority);
 		if (sbi->cprc_info.f2fs_issue_ckpt)
-			set_user_nice(sbi->cprc_info.f2fs_issue_ckpt,
-					PRIO_TO_NICE(sbi->critical_task_priority));
-		if (sbi->gc_thread && sbi->gc_thread->f2fs_gc_task)
-			set_user_nice(sbi->gc_thread->f2fs_gc_task,
-					PRIO_TO_NICE(sbi->critical_task_priority));
+			set_user_nice(sbi->cprc_info.f2fs_issue_ckpt, nice);
+		gc_th = READ_ONCE(sbi->gc_thread);
+		gc_task = gc_th ? f2fs_get_gc_task(gc_th) : NULL;
+		if (gc_task) {
+			set_user_nice(gc_task, nice);
+			put_task_struct(gc_task);
+		}
 		return count;
 	}
 
-- 
2.43.0