From nobody Tue Apr 7 17:14:30 2026 Received: from m16.mail.163.com (m16.mail.163.com [117.135.210.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40B826BFCE; Tue, 3 Mar 2026 06:01:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=117.135.210.4 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772517666; cv=none; b=Clanvq8sEMoef+6rhDrR0ZkhG8xHMNSDVGx/zwKr8+2q6Rn7WgHDXyL4u3/ohlfG8K5lTuK9lOvwqdw8r2CVhW84anEVNht3+VKtyZqkEz5AvVpGPiQka07JPgOChrrIRrKtqDsmuGNnR1ZGYr4QHUoeAyBXHKD3wnNUIk4rW+4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772517666; c=relaxed/simple; bh=K081xiha3FiK1vatRE0y28BVNSkFSZ9Ri30x8FK6Q+A=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=kufm7sAZxrEYlWcKmLz7abLmO49DwQ0sRadl7Df9b9Ee5eVUeOratBPuBWRfLSI2KM2xEQN3opuaq/j8a9VbzGeNc/mOC5pDVTumUqwf/3bYQKEJ18HhvWDxIFUfzxPfiB22/61oBEzJlfoPhe2YSIh6hco70OJxiT0y4aqBasM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=163.com; spf=pass smtp.mailfrom=163.com; dkim=pass (1024-bit key) header.d=163.com header.i=@163.com header.b=ceHN+i3T; arc=none smtp.client-ip=117.135.210.4 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=163.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=163.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=163.com header.i=@163.com header.b="ceHN+i3T" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com; s=s110527; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=Vq C1TfkAewRVgrw+rNJiIMnFHKv/dtS18fe0zRlHPCM=; b=ceHN+i3Tu4jabYdNza OGikdsXeR2cOnPQbxafkHQ+0BlDyIAOH+Ju6P7XKytV78fDVfhXttkvg3FcV5L9p iqqImKDs5mRyxyJn+pgaONrOuTadXiF60XjS5pNDEgNZjv1erMAmaGM7OwuzRJp0 MDmkqDZCjRapnxLNeJsjKuZqs= Received: from pek-lpg-core6.wrs.com (unknown []) by gzga-smtp-mtada-g0-1 (Coremail) with SMTP id _____wBXqJIDeaZp9R4iNg--.65183S2; Tue, 03 Mar 2026 14:00:36 +0800 (CST) From: Rahul Sharma To: gregkh@linuxfoundation.org, stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Qu Wenruo , Jan Kara , Boris Burkov , David Sterba , Rahul Sharma Subject: [PATCH 5.15.y] btrfs: do not strictly require dirty metadata threshold for metadata writepages Date: Tue, 3 Mar 2026 14:00:34 +0800 Message-Id: <20260303060034.3997360-1-black.hawk@163.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: _____wBXqJIDeaZp9R4iNg--.65183S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3Xry7KF4rWr15CrWrWw43Awb_yoWxWr18pF WakwnxAw4qq3WUWr93uayqv34Syrs7A3y7Gr95J3yFvFnxCryI9ry0kr10vFW8ArWxGrWY vr4Yy348J3WqyaDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0zNdgAUUUUUU= X-CM-SenderInfo: 5eoduy4okd4yi6rwjhhfrp/xtbC3QVj-GmmeQUfSgAA3h Content-Type: text/plain; charset="utf-8" From: Qu Wenruo [ Upstream commit 4e159150a9a56d66d247f4b5510bed46fe58aa1c ] [BUG] There is an internal report that over 1000 processes are waiting at the io_schedule_timeout() of balance_dirty_pages(), causing a system hang and trigger a kernel coredump. The kernel is v6.4 kernel based, but the root problem still applies to any upstream kernel before v6.18. [CAUSE] From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than there's dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup. So that means the only way to reduce the dirty pages of that cgroup is to writeback the dirty pages of btrfs btree inode, and only after that those processes can exit balance_dirty_pages(). Now back to the btrfs part, btree_writepages() is responsible for writing back dirty btree inode pages. The problem here is, there is a btrfs internal threshold that if the btree inode's dirty bytes are below the 32M threshold, it will not do any writeback. This behavior is to batch as much metadata as possible so we won't write back those tree blocks and then later re-COW them again for another modification. This internal 32MiB is higher than the existing dirty page size (28MiB), meaning no writeback will happen, causing a deadlock between btrfs and cgroup: - Btrfs doesn't want to write back btree inode until more dirty pages - Cgroup/MM doesn't want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced. Thanks Jan Kara a lot for the analysis of the root cause. [ENHANCEMENT] Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the btree_inode"), btrfs btree inode pages will only be charged to the root cgroup which should have a much larger limit than btrfs' 32MiB threshold. So it should not affect newer kernels. But for all current LTS kernels, they are all affected by this problem, and backporting the whole AS_KERNEL_FILE may not be a good idea. Even for newer kernels I still think it's a good idea to get rid of the internal threshold at btree_writepages(), since for most cases cgroup/MM has a better view of full system memory usage than btrfs' fixed threshold. For internal callers using btrfs_btree_balance_dirty() since that function is already doing internal threshold check, we don't need to bother them. But for external callers of btree_writepages(), just respect their requests and write back whatever they want, ignoring the internal btrfs threshold to avoid such deadlock on btree inode dirty page balancing. CC: stable@vger.kernel.org CC: Jan Kara Reviewed-by: Boris Burkov Signed-off-by: Qu Wenruo Signed-off-by: David Sterba [ The context change is due to the commit 41044b41ad2c ("btrfs: add helper to get fs_info from struct inode pointer") in v6.9 and the commit c66f2afc7148 ("btrfs: remove pointless writepages callback wrapper") in v6.10 which are irrelevant to the logic of this patch. ] Signed-off-by: Rahul Sharma --- fs/btrfs/disk-io.c | 23 ----------------------- fs/btrfs/extent_io.c | 3 +-- fs/btrfs/extent_io.h | 3 +-- 3 files changed, 2 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 136902f27e44..c838648f46d6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -912,29 +912,6 @@ static int btree_migratepage(struct address_space *map= ping, } #endif =20 - -static int btree_writepages(struct address_space *mapping, - struct writeback_control *wbc) -{ - struct btrfs_fs_info *fs_info; - int ret; - - if (wbc->sync_mode =3D=3D WB_SYNC_NONE) { - - if (wbc->for_kupdate) - return 0; - - fs_info =3D BTRFS_I(mapping->host)->root->fs_info; - /* this is a bit racy, but that's ok */ - ret =3D __percpu_counter_compare(&fs_info->dirty_metadata_bytes, - BTRFS_DIRTY_METADATA_THRESH, - fs_info->dirty_metadata_batch); - if (ret < 0) - return 0; - } - return btree_write_cache_pages(mapping, wbc); -} - static int btree_releasepage(struct page *page, gfp_t gfp_flags) { if (PageWriteback(page) || PageDirty(page)) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3b671e9bf684..64fb86dd61c4 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4792,8 +4792,7 @@ static int submit_eb_page(struct page *page, struct w= riteback_control *wbc, return 1; } =20 -int btree_write_cache_pages(struct address_space *mapping, - struct writeback_control *wbc) +int btree_writepages(struct address_space *mapping, struct writeback_contr= ol *wbc) { struct extent_buffer *eb_context =3D NULL; struct extent_page_data epd =3D { diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index f7ab6ba8238e..c1f0953e8099 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -187,8 +187,7 @@ int extent_write_locked_range(struct inode *inode, u64 = start, u64 end, int mode); int extent_writepages(struct address_space *mapping, struct writeback_control *wbc); -int btree_write_cache_pages(struct address_space *mapping, - struct writeback_control *wbc); +int btree_writepages(struct address_space *mapping, struct writeback_contr= ol *wbc); void extent_readahead(struct readahead_control *rac); int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fi= einfo, u64 start, u64 len); --=20 2.34.1