From nobody Sat Jun 20 20:53:32 2026 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A47BF32E729 for ; Fri, 10 Apr 2026 03:56:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775793407; cv=none; b=NZLz8bPfPFnQlbZ/RrtHgfXc4eISbZdEOcYRuwRir7t6No4u6qn4p1HNrPaYcAlfXkIgAXxsvt+r3VbMsz946vUNK2RItYuf7sbZ3EIr7cWjiI7om3QM0R5ZRYqeJha6Mu2Jf/iaRgIszVUZBlG+5zTYFTWNxRwwDd/ebqdzPs0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775793407; c=relaxed/simple; bh=6VgrxJ/EKckn/f1V4UYs/as2VQQE71/j43gR36E7BhE=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=oyoE+4hAIQxqs9nczfJ0vxsaCsrWv5iSkTgHgycLWrYRO0f3H0dqK9o9JreI1XJjkgZjBZ+V5OKY3uYcDZ69pxrsDp2GD0X2UZ7QRKjTW6ZmZFiquCOWT8vf/7WdpTHAqgyTsBNeoASHINq3Ifkm/w0dUORh47JOGwwv53q/RaI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Vxu0JNNn; arc=none smtp.client-ip=209.85.214.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Vxu0JNNn" Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2b23fcf90b2so15523105ad.3 for ; Thu, 09 Apr 2026 20:56:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775793405; x=1776398205; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=68Y2KOu6IjWpqsJxmRaUt28cJkLdrMRf5M9uj6SY5BA=; b=Vxu0JNNnwdFztIyl/KOQvCeKter0rO6CAn6x/W0jP5s0Cs/JwNFU9YbANt/WJoizG6 qt/ljOHzC5cSgEof2vDaSKxO4wxhpt/ES3EH0C4IWPGeZzqX6ebkJTsMxiVj+dek+9zB OcryjeCewqh5rnatv2n6EjCZt5NhZu2oHna1rMctS15fklvRc+LbdVUlSwBgwYOWF6Yo BoO7t99bl2xh5+6vfeukcc/cRVauyDI9w6LJuU7qalJtjNybTY1fgDiKsfZSpQPUeZN8 iHnx+Dr4lGl1p08FzUEaCeGHTziPaXIhzYUgbWlFBT5VjOR5GPmrUlUs41T8EuTZyuWS ZfZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775793405; x=1776398205; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=68Y2KOu6IjWpqsJxmRaUt28cJkLdrMRf5M9uj6SY5BA=; b=RvwdDzybudIWFZHHZF6wSCjNrTQfWTC2+hcq5tIuVpR7EKXfHHfalKkzcTdvwvs/ep qJuMhYUGT18/f52cyCrVjq8oJTzpGKeK+vdL0a/9SZPgIH77PBoHiCPpUWjt8gJG264n al6C04Z0PzEVEa6Il5ptSdeosQ9T5HbNMA741b3P/hvcbH/DrT1ZxWP0YLdtLX3N2ZXM nzd0Oy0oEnVb5c1pLluH1wuPcpXzMaJgv35BLW8Co3/hqVPy459fSfo/wy59eM4zQ0A5 bsfuxHsz7wROGQaIG8dmSLIdOvlNOD3dVvOs3vcG7viVpSRcH7rbIlKEAJfZYsI6Gv74 IwXg== X-Gm-Message-State: AOJu0Yz68DVPzjgbCB/B5EMePJ1BFNhmHiulfoPYzd9pUOUy0tyE//0B RtFXzQsa5qKPO3GSYu0zdbd6srEXktjWPxZ2yG7Y2nhPNk3cPC43RXtF X-Gm-Gg: AeBDiesq453NgIf8dfTyElm+/nhGsxWDQGVz9J/QF6JUVeeTtKVBceuVLZ8TNXeDOfn Y1hjQG3ow7MdjIKFRFxuevWaTiAh3mpd8ukceFXRYJeMeiUThHoPJWYV2m2B/+iO6vsctJysPq0 00qTJqdAbGDRLMjjhUbWdCKOduTggHdjKZc/2u3phTjRX+m26erb0kWqUWkHbWAluaeumtyV8p5 jK7Qw3wK7LWgPK6HSoNaiwCRazjbvjCCfkm13K80qWGMPYcmWTjwkKcYhWvm/YAG4I8qVn0wRbv 1RxjccPXh60SPjWbyWV7KcfQvL6Sxu2boi0HKL9YMngGJHBu+CYFFjqGHJGpASj2rPlr2a4vjxE Gu4WJ2Bq+4/Au2+H0VyZgqWaGRFxeuunwAwfhZVt1cm6LKXhXaH/i9tGBwEi3gYFbXTZ00XTbbE RSSEUN6o0qRaMyMF8qkO7GoRrCR8cmF7z2iz6qfrBOxLpeo9saMcTr X-Received: by 2002:a17:902:b48f:b0:2b2:d12e:beae with SMTP id d9443c01a7336-2b2d5a65c0bmr10742495ad.42.1775793404825; Thu, 09 Apr 2026 20:56:44 -0700 (PDT) Received: from n232-167-136.byted.org ([36.110.163.104]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b2d4f37086sm10850555ad.68.2026.04.09.20.56.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Apr 2026 20:56:44 -0700 (PDT) From: guzebing To: tytso@mit.edu, adilger.kernel@dilger.ca, libaokun@linux.alibaba.com, jack@suse.cz, ojaswin@linux.ibm.com, ritesh.list@gmail.com, yi.zhang@huawei.com, guzebing@bytedance.com Cc: linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org Subject: [PATCH] ext4: make mballoc max prealloc size configurable Date: Fri, 10 Apr 2026 11:56:35 +0800 Message-Id: <20260410035635.1381920-1-guzebing1612@gmail.com> X-Mailer: git-send-email 2.20.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Guzebing Add per-superblock sysfs knob mb_max_prealloc_kb (min 8MiB, roundup pow2) and use it in request normalization. When multiple tasks write to different files on the same filesystem concurrently, each file ends up with 8 MiB extents. If the preallocation size is increased, the resulting extent size grows accordingly. Due to the readahead mechanism on NVMe SSDs, files with larger extents achieve higher sequential read throughput. On an ext4 filesystem on an NVMe Gen4 data drive, dd read throughput for a file with 8 MiB extents is 455 MB/s, while for a file with 32 MiB extents it reaches 702 MB/s. Steps to reproduce: 1.Configure the maximum preallocation size to 8 MiB or 32 MiB: echo 8192 > /sys/fs/ext4/nvme13n1/mb_max_prealloc_kb echo 32768 > /sys/fs/ext4/nvme13n1/mb_max_prealloc_kb 2.Run the following commands simultaneously so that the extents of the two files are physically interleaved, resulting in 8 MiB or 32 MiB extents: dd if=3D/dev/zero of=3D/mnt/store1/501.txt bs=3D128K count=3D80K oflag=3Ddi= rect dd if=3D/dev/zero of=3D/mnt/store1/502.txt bs=3D128K count=3D80K oflag=3Ddi= rect 3.Read back the file and measure the read throughput: dd if=3D/mnt/store1/501.txt of=3D/dev/null bs=3D128K count=3D80K iflag=3Ddi= rect Signed-off-by: Guzebing --- Documentation/ABI/testing/sysfs-fs-ext4 | 8 +++++++ fs/ext4/ext4.h | 1 + fs/ext4/mballoc.c | 2 +- fs/ext4/super.c | 1 + fs/ext4/sysfs.c | 28 ++++++++++++++++++++++++- 5 files changed, 38 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-ext4 b/Documentation/ABI/te= sting/sysfs-fs-ext4 index 2edd0a6672d3a..316ae1d1ec18b 100644 --- a/Documentation/ABI/testing/sysfs-fs-ext4 +++ b/Documentation/ABI/testing/sysfs-fs-ext4 @@ -48,6 +48,14 @@ Description: will have its blocks allocated out of its own unique preallocation pool. =20 +What: /sys/fs/ext4//mb_max_prealloc_kb +Date: April 2026 +Contact: "Linux Ext4 Development List" +Description: + Maximum size (in kilobytes) used by the multiblock allocator's + normalized request preallocation heuristic. Values are rounded + up to a power of two and clamped to a minimum of 8192 (8MiB). + What: /sys/fs/ext4//inode_readahead_blks Date: March 2008 Contact: "Theodore Ts'o" diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 7617e2d454ea5..bce99740740f5 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1634,6 +1634,7 @@ struct ext4_sb_info { unsigned int s_mb_best_avail_max_trim_order; unsigned int s_sb_update_sec; unsigned int s_sb_update_kb; + unsigned int s_mb_max_prealloc_kb; =20 /* where last allocation was done - for stream allocation */ ext4_group_t *s_mb_last_groups; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bb58eafb87bcd..f5f63c56fcdac 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4589,7 +4589,7 @@ ext4_mb_normalize_request(struct ext4_allocation_cont= ext *ac, (8<<20)>>bsbits, max, 8 * 1024)) { start_off =3D ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23; - size =3D 8 * 1024 * 1024; + size =3D (loff_t)sbi->s_mb_max_prealloc_kb << 10; } else { start_off =3D (loff_t) ac->ac_o_ex.fe_logical << bsbits; size =3D (loff_t) EXT4_C2B(sbi, diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a34efb44e73d7..f815e31657cc9 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5447,6 +5447,7 @@ static int __ext4_fill_super(struct fs_context *fc, s= truct super_block *sb) sbi->s_stripe =3D 0; } sbi->s_extent_max_zeroout_kb =3D 32; + sbi->s_mb_max_prealloc_kb =3D 8 * 1024; =20 /* * set up enough so that it can read an inode diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 923b375e017fa..6339492eb2fa7 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -10,6 +10,8 @@ =20 #include #include +#include +#include #include #include #include @@ -41,6 +43,7 @@ typedef enum { attr_pointer_atomic, attr_journal_task, attr_err_report_sec, + attr_mb_max_prealloc_kb, } attr_id_t; =20 typedef enum { @@ -115,6 +118,25 @@ static ssize_t reserved_clusters_store(struct ext4_sb_= info *sbi, return count; } =20 +static ssize_t mb_max_prealloc_kb_store(struct ext4_sb_info *sbi, + const char *buf, size_t count) +{ + unsigned int v; + int ret; + unsigned long rounded; + + ret =3D kstrtouint(skip_spaces(buf), 0, &v); + if (ret) + return ret; + if (v < 8192) + v =3D 8192; + rounded =3D roundup_pow_of_two((unsigned long)v); + if (rounded > UINT_MAX) + return -EINVAL; + sbi->s_mb_max_prealloc_kb =3D (unsigned int)rounded; + return count; +} + static ssize_t trigger_test_error(struct ext4_sb_info *sbi, const char *buf, size_t count) { @@ -288,6 +310,7 @@ EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_li= mit); EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks); EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec); EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb); +EXT4_ATTR_OFFSET(mb_max_prealloc_kb, 0644, mb_max_prealloc_kb, ext4_sb_inf= o, s_mb_max_prealloc_kb); =20 static unsigned int old_bump_val =3D 128; EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val); @@ -341,6 +364,7 @@ static struct attribute *ext4_attrs[] =3D { ATTR_LIST(last_trim_minblks), ATTR_LIST(sb_update_sec), ATTR_LIST(sb_update_kb), + ATTR_LIST(mb_max_prealloc_kb), ATTR_LIST(err_report_sec), NULL, }; @@ -431,6 +455,7 @@ static ssize_t ext4_generic_attr_show(struct ext4_attr = *a, case attr_mb_order: case attr_pointer_pi: case attr_pointer_ui: + case attr_mb_max_prealloc_kb: if (a->attr_ptr =3D=3D ptr_ext4_super_block_offset) return sysfs_emit(buf, "%u\n", le32_to_cpup(ptr)); return sysfs_emit(buf, "%u\n", *((unsigned int *) ptr)); @@ -557,6 +582,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj, return reserved_clusters_store(sbi, buf, len); case attr_inode_readahead: return inode_readahead_blks_store(sbi, buf, len); + case attr_mb_max_prealloc_kb: + return mb_max_prealloc_kb_store(sbi, buf, len); case attr_trigger_test_error: return trigger_test_error(sbi, buf, len); case attr_err_report_sec: @@ -695,4 +722,3 @@ void ext4_exit_sysfs(void) remove_proc_entry(proc_dirname, NULL); ext4_proc_root =3D NULL; } - --=20 2.20.1