Changes since v1:
* Use a simple RCU-based fix (patch 1) that is easy to backport to
older kernels; the per-sb refcount optimization is split out as a
separate performance patch (patch 3). (Suggested by Jan Kara)
v1: https://patch.msgid.link/20260513094829.867648-1-libaokun@linux.alibaba.com
======
When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes
after unmount" followed by a use-after-free on percpu counters.
There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call. If cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued, it returns early — leaving a held inode reference
that blocks evict_inodes() and a later iput() that hits freed percpu
counters.
Patch 1 fixes the race by extending the RCU read-side critical section
to cover the window from inode_prepare_wbs_switch() through
wb_queue_isw(), and adding synchronize_rcu() in the umount path so
that all in-flight switchers complete queueing before flush_workqueue()
runs.
Patch 2 removes the now-dead rcu_barrier() that was left over from the
old queue_rcu_work() era (removed by commit e1b849cfa6b6 ("writeback:
Avoid contention on wb->list_lock when switching inodes")).
Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair
with a per-sb counter (s_isw_nr_in_flight), eliminating the global
serialization penalty. This also reverts the RCU extension from patch 1
since the per-sb counter makes it unnecessary.
Measured with 4 background superblocks churning cgwb switches to keep
isw_nr_in_flight non-zero, while a separate idle sb is umounted in a
loop (N=100):
Idle target umount latency under cross-sb cgwb-switch pressure:
p50 p95 p99 max
patch 1+2 (synchronize_rcu) 64.4 ms 95.8 ms 101.4 ms 110.5 ms
patch 3 (per-sb counter) 5.3 ms 6.9 ms 7.4 ms 7.7 ms
no-pressure baseline 5.2 ms 5.9 ms 6.0 ms 6.1 ms
8 concurrent umounts of idle sbs under the same pressure (5 batches):
p50 p95 max
patch 1+2 (synchronize_rcu) 57.9 ms 82.1 ms 90.0 ms
patch 3 (per-sb counter) 7.5 ms 7.8 ms 8.0 ms
In-kernel cgroup_writeback_umount() cumulative cost over 286 calls
(bpftrace, kprobes filtered to the umount call context):
cgroup_writeback_umount() time
patch 1+2 (synchronize_rcu) 8717 ms total (~30 ms / call)
patch 3 (per-sb counter) 1.16 ms total (~4 us / call)
Comments and questions are, as always, welcome.
Thanks,
Baokun
Baokun Li (3):
writeback: fix race between cgroup_writeback_umount() and
inode_switch_wbs()
writeback: drop now-unnecessary rcu_barrier() in
cgroup_writeback_umount()
writeback: use a per-sb counter to drain inode wb switches at umount
fs/fs-writeback.c | 52 +++++++++++++++++++---------------
include/linux/fs/super_types.h | 8 ++++++
2 files changed, 37 insertions(+), 23 deletions(-)
--
2.43.7