fs/fs-writeback.c | 71 +++++++++++++++++++++------------- include/linux/fs/super_types.h | 8 ++++ 2 files changed, 53 insertions(+), 26 deletions(-)
Hi all,
Changes since v3:
* Collect RVB from Jan Kara. (Thanks for your review!)
* Patch 3: Remove stale comments. (Reported by Sashiko)
Changes since v2:
* Collect RVB from Jan Kara. (Thanks for your review!)
* Patch 3: switch to wake_up_var() / wait_var_event() to drain
s_isw_nr_in_flight. (Suggested by Christian Brauner and Sashiko)
* Polish comments and changelogs.
Changes since v1:
* Use a simple RCU-based fix (patch 1) that is easy to backport to
older kernels; the per-sb refcount optimization is split out as a
separate performance patch (patch 3). (Suggested by Jan Kara)
v1: https://patch.msgid.link/20260513094829.867648-1-libaokun@linux.alibaba.com
v2: https://patch.msgid.link/20260517142147.3354909-1-libaokun@linux.alibaba.com
v3: https://patch.msgid.link/20260518135349.1187628-1-libaokun@linux.alibaba.com
======
When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs() / cleanup_offline_cgwb() can trigger
"VFS: Busy inodes after unmount" followed by a use-after-free on
percpu counters.
There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call. If cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued, it returns early -- leaving a held inode
reference that blocks evict_inodes() and a later iput() that hits
freed percpu counters.
Patch 1 closes the race by extending the RCU read-side critical
section to cover the window from inode_prepare_wbs_switch() through
wb_queue_isw(), and adding synchronize_rcu() in the umount path so
that all in-flight switchers complete queueing before
flush_workqueue() runs. rcu_barrier() is intentionally retained so
the same hunk applies cleanly to stable trees that still queue
switches via queue_rcu_work().
Patch 2 removes the now-dead rcu_barrier() that was left over from
the queue_rcu_work() era (replaced by plain queue_work() in commit
e1b849cfa6b6 "writeback: Avoid contention on wb->list_lock when
switching inodes"). This is mainline-only.
Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair
with a per-sb counter (s_isw_nr_in_flight) plus three small helpers
(cgroup_writeback_pin / cgroup_writeback_unpin /
cgroup_writeback_drain), eliminating the global serialization
penalty. This also reverts the RCU extension from patch 1 since the
per-sb counter makes it unnecessary.
Performance
-----------
Measured on a 16 vCPU QEMU guest, all kernels share the same .config.
Background load: 4 ext4 superblocks each running
while :; do
mkdir /sys/fs/cgroup/<tag>-tmp$N
( echo $BASHPID > <tag>-tmp$N/cgroup.procs
dd if=/dev/zero of=$mp/burner bs=4k count=256 conv=notrunc \
oflag=sync)
rmdir /sys/fs/cgroup/<tag>-tmp$N
done
This drives both inode_switch_wbs() (different cgroups writing the
same inode) and cleanup_offline_cgwb() (dying memcgs), keeping the
global isw_nr_in_flight non-zero throughout the run. Latencies are
wall-clock around umount(8) on a separate target sb; only the target
sb's umount is measured.
Four kernels are compared at each step of the series:
base pre-fix mainline
+race base + patch 1 (race fix, keeps rcu_barrier)
+rmbarrier +race + patch 2 (drop rcu_barrier)
+persb +rmbarrier + patch 3 (per-sb counter)
Target sb runs its own cgwb churn:
p50 p95 p99 max
base 99.7 ms 112.9 ms 112.9 ms 127.2 ms
+race 110.2 ms 153.8 ms 153.8 ms 160.4 ms
+rmbarrier 67.6 ms 88.3 ms 88.3 ms 96.8 ms
+persb 7.9 ms 10.0 ms 10.0 ms 10.1 ms
Idle target umount under cross-sb cgwb-switch pressure:
p50 p95 p99 max
base 92.0 ms 123.5 ms 136.5 ms 141.3 ms
+race 118.8 ms 154.6 ms 164.7 ms 165.3 ms
+rmbarrier 62.7 ms 95.4 ms 108.1 ms 108.6 ms
+persb 5.3 ms 6.9 ms 7.4 ms 7.4 ms
8 concurrent umounts of idle sbs under the same pressure:
p50 p95 p99 max
base 137.5 ms 166.9 ms 166.9 ms 171.3 ms
+race 162.2 ms 183.9 ms 183.9 ms 217.0 ms
+rmbarrier 61.3 ms 99.5 ms 99.5 ms 113.7 ms
+persb 8.1 ms 9.1 ms 9.1 ms 9.5 ms
A no-pressure baseline run (no background load) measures ~5 ms p50
across all four kernels, validating that the methodology has no
systematic bias.
In-kernel cgroup_writeback_umount() cumulative cost across the same
run (bpftrace, ~340 calls covering all four scenarios):
cgroup_writeback_umount() time
base 21240 ms total (~62 ms / call)
+race (rcu_barrier+sync) 24966 ms total (~73 ms / call)
+rmbarrier (synchronize_rcu) 12371 ms total (~36 ms / call)
+persb (per-sb counter) 1.37 ms total ( ~4 us / call)
Under +persb the wait_var_event() condition is true on entry
whenever the target sb has nothing in flight, so synchronize_rcu()
and flush_workqueue() are never called on this path.
Notes:
- Patch 1 adds ~10-27 ms p50 over base by introducing
synchronize_rcu(). This is the cost of closing the race
correctly and is paid by stable backports as well.
- Patch 2 ("drop rcu_barrier()") was expected to be a pure cleanup
on mainline, but actually removes a real wait: rcu_barrier()
drains call_rcu() callbacks from *all* subsystems, and the
cgroup teardown path keeps that pipeline busy under this
workload. Removing it cuts ~43-101 ms p50 on top of patch 1.
- Patch 3 (per-sb counter) replaces the global wait entirely; the
target sb no longer waits for activity on unrelated sbs,
recovering near-baseline latency in all three scenarios.
Comments and questions are, as always, welcome.
Thanks,
Baokun
Baokun Li (3):
writeback: fix race between cgroup_writeback_umount() and
inode_switch_wbs()
writeback: drop now-unnecessary rcu_barrier() in
cgroup_writeback_umount()
writeback: use a per-sb counter to drain inode wb switches at umount
fs/fs-writeback.c | 71 +++++++++++++++++++++-------------
include/linux/fs/super_types.h | 8 ++++
2 files changed, 53 insertions(+), 26 deletions(-)
--
2.43.7
On Thu, 21 May 2026 17:50:13 +0800, Baokun Li wrote:
> Changes since v3:
> * Collect RVB from Jan Kara. (Thanks for your review!)
> * Patch 3: Remove stale comments. (Reported by Sashiko)
>
> Changes since v2:
> * Collect RVB from Jan Kara. (Thanks for your review!)
> * Patch 3: switch to wake_up_var() / wait_var_event() to drain
> s_isw_nr_in_flight. (Suggested by Christian Brauner and Sashiko)
> * Polish comments and changelogs.
>
> [...]
Applied to the vfs-7.2.writeback branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.writeback branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.2.writeback
[1/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
https://git.kernel.org/vfs/vfs/c/cba38ec4cbd3
[2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()
https://git.kernel.org/vfs/vfs/c/e90a6d668e26
[3/3] writeback: use a per-sb counter to drain inode wb switches at umount
https://git.kernel.org/vfs/vfs/c/31c1d19ead2c
© 2016 - 2026 Red Hat, Inc.