From nobody Fri Dec 19 04:15:07 2025 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBD57257438; Mon, 24 Mar 2025 07:37:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801852; cv=none; b=dH8lz99VI40nkEGtMFvjUBlt7gVnf5zDKaYGKW2DALZVwTuDqNjQgDKHghzgURFNnJswZWBxJ7gpL2XQNEuoWE8XKLMHiNtxaK1YVB0CmHnBo7N+i3UbwIX096g6eqOaJsx6G05JdGmbj6QrC9Kpgtty6JtOqxxK5DzyHW+XCKk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801852; c=relaxed/simple; bh=VD+PwGe5tmvrGlAfNJro4FDnmcFwJtTakB63Kd/xTQ8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HfTd869zWNDL+UF6LOyItffnBTrLFkLV436KBfGNPHGlLzzct81ZX/s0xJg0ekOy07cayoVkYOMHx0M6QYJHUBlgSLi2UIAaFDMCdrSN6XjK89bZ46CDltjCwDWeALBbRqZjgF1KtyfDGr3bzNupbMDMlB9/0fpaT1NhBtnjsDw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=Z2YBvfly; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="Z2YBvfly" Received: from pps.filterd (m0353729.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52O5K7gk023160; Mon, 24 Mar 2025 07:37:18 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=5ZpwIYnRwKDFz/DfW xShmUTTGBe9YrODk96vZraZ31s=; b=Z2YBvflyGjwZSgUcGJVlHmUvbbalP53VD a1ktXn5PonYnlvaUwwkDz7RJ/LtWQOgpF1xG13cWXbGVlhhvjr9NfNXgxtOQSbgr ke9i5FOlA+rLZ3n/8T+YC68KIm0ksVqDUxE72qjDmSIoh5gz2JbznL57j2cPkg+w MwygyGHEYMdeNQ78JowVboDwwHpwDekK+f5t1IGc27JYZtbiXpTYY3OmpLmBsNud E5PzSjy3okylKo6fvIK1hPXg7MnraV3524xchoYg2iYi/SZrz2hO16gDsCyx/YPN P1f4ra/Ex3OTI3Q36AL/5GYVZbK9GzDLSx/9pzCVZltE65uegCArw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jjb4ueg7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:18 +0000 (GMT) Received: from m0353729.ppops.net (m0353729.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7X8Tr013752; Mon, 24 Mar 2025 07:37:17 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jjb4ueg0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:17 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O3bFX9020029; Mon, 24 Mar 2025 07:37:16 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j8hnn5ha-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:16 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bE6Q38470060 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:14 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6DB4720040; Mon, 24 Mar 2025 07:37:14 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E437D20043; Mon, 24 Mar 2025 07:37:12 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:12 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 01/11] ext4: add aligned allocation hint in mballoc Date: Mon, 24 Mar 2025 13:06:59 +0530 Message-ID: <70fcaa59709f4fc30223b9d0e33b5cbda74209c6.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 1ftvGmEpYbtbILO8xybkKSnkLtFXyiq_ X-Proofpoint-ORIG-GUID: 5KwNZUzYxxhRIUfXpFSvhnrP3LCDqg0C X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 malwarescore=0 phishscore=0 clxscore=1015 priorityscore=1501 mlxlogscore=999 suspectscore=0 bulkscore=0 adultscore=0 spamscore=0 mlxscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Add support in mballoc for allocating blocks that are aligned to a certain power-of-2 offset. 1. We define a new flag EXT4_MB_ALIGNED_HINT to indicate that we want an aligned allocation. This is just a hint, mballoc tries its best to provide aligned blocks but if it can't then it'll fallback to normal allocation 2. The alignment is determined by the length of the allocation, for example if we ask for 8192 bytes, then the alignment of physical blocks will also be 8192 bytes aligned (ie 2 blocks aligned on 4k blocksize). 3. We dont yet support arbitrary alignment. For aligned writes, the length/alignment must be power of 2 in blocks, ie for 4k blocksize we can get 4k byte aligned, 8k byte aligned, 16k byte aligned ... allocation but not 12k byte aligned. 4. We use CR_POWER2_ALIGNED criteria for aligned allocation which by design allocates in an aligned manner. Since CR_POWER2_ALIGNED needs the ac->ac_g_ex.fe_len to be power of 2, thats where the restriction in point 3 above comes from. Since right now aligned allocation support is added mainly for atomic writes use case, this restriction should be fine since atomic write capable devices usually support only power of 2 alignments 5. For ease of review enabling inode preallocation support is done in upcoming patches and is disabled in this patch. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 2 ++ fs/ext4/mballoc.c | 60 +++++++++++++++++++++++++++++++++---- include/trace/events/ext4.h | 1 + 3 files changed, 58 insertions(+), 5 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 5a20e9cd7184..2c83275d2ad4 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -222,6 +222,8 @@ enum criteria { /* Avg fragment size rb tree lookup succeeded at least once for * CR_BEST_AVAIL_LEN */ #define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x00020000 +/* mballoc will try to align physical start to length (aka natural alignme= nt) */ +#define EXT4_MB_HINT_ALIGNED 0x40000 =20 struct ext4_allocation_request { /* target inode for block we're allocating */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 0d523e9fb3d5..ca51581573e3 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2177,8 +2177,11 @@ static void ext4_mb_use_best_found(struct ext4_alloc= ation_context *ac, * user requested originally, we store allocated * space in a special descriptor. */ - if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) + if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) { + /* Aligned allocation doesn't have preallocation support */ + WARN_ON(ac->ac_flags & EXT4_MB_HINT_ALIGNED); ext4_mb_new_preallocation(ac); + } =20 } =20 @@ -2814,10 +2817,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_co= ntext *ac) =20 BUG_ON(ac->ac_status =3D=3D AC_STATUS_FOUND); =20 - /* first, try the goal */ - err =3D ext4_mb_find_by_goal(ac, &e4b); - if (err || ac->ac_status =3D=3D AC_STATUS_FOUND) - goto out; + /* + * first, try the goal. Skip trying goal for aligned allocations since + * goal determination logic is not alignment aware (yet) + */ + if (!(ac->ac_flags & EXT4_MB_HINT_ALIGNED)) { + err =3D ext4_mb_find_by_goal(ac, &e4b); + if (err || ac->ac_status =3D=3D AC_STATUS_FOUND) + goto out; + } =20 if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) goto out; @@ -2858,9 +2866,22 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) */ if (ac->ac_2order) cr =3D CR_POWER2_ALIGNED; + else + WARN_ON_ONCE(ac->ac_g_ex.fe_len > 1 && + ac->ac_flags & EXT4_MB_HINT_ALIGNED); repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status =3D=3D AC_STATUS_CONTINUE; c= r++) { ac->ac_criteria =3D cr; + + if (ac->ac_criteria > CR_POWER2_ALIGNED && + ac->ac_flags & EXT4_MB_HINT_ALIGNED && + ac->ac_g_ex.fe_len > 1) { + ext4_warning_inode( + ac->ac_inode, + "Aligned allocation not possible, using unaligned allocation"); + ac->ac_flags &=3D ~EXT4_MB_HINT_ALIGNED; + } + /* * searching for the right group start * from the goal value specified @@ -2993,6 +3014,24 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) if (!err && ac->ac_status !=3D AC_STATUS_FOUND && first_err) err =3D first_err; =20 + if (ac->ac_flags & EXT4_MB_HINT_ALIGNED && ac->ac_status =3D=3D AC_STATUS= _FOUND) { + ext4_fsblk_t start =3D ext4_grp_offs_to_block(sb, &ac->ac_b_ex); + ext4_grpblk_t len =3D EXT4_C2B(sbi, ac->ac_b_ex.fe_len); + + if (!len) { + ext4_warning_inode(ac->ac_inode, + "Expected a non zero len extent"); + ac->ac_status =3D AC_STATUS_BREAK; + goto exit; + } + + WARN_ON_ONCE(!is_power_of_2(len)); + WARN_ON_ONCE(start % len); + /* We don't support preallocation yet */ + WARN_ON_ONCE(ac->ac_b_ex.fe_len !=3D ac->ac_o_ex.fe_len); + } + + exit: mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr= %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, ac->ac_flags, cr, err); @@ -4440,6 +4479,13 @@ ext4_mb_normalize_request(struct ext4_allocation_con= text *ac, if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) return; =20 + /* + * caller may have strict alignment requirements. In this case, avoid + * normalization since it is not alignment aware. + */ + if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) + return; + if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { ext4_mb_normalize_group_request(ac); return ; @@ -4794,6 +4840,10 @@ ext4_mb_use_preallocated(struct ext4_allocation_cont= ext *ac) if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return false; =20 + /* using preallocated blocks is not alignment aware. */ + if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) + return false; + /* * first, try per-file preallocation by searching the inode pa rbtree. * diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 156908641e68..79cc4224fbbd 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -36,6 +36,7 @@ struct partial_cluster; { EXT4_MB_STREAM_ALLOC, "STREAM_ALLOC" }, \ { EXT4_MB_USE_ROOT_BLOCKS, "USE_ROOT_BLKS" }, \ { EXT4_MB_USE_RESERVED, "USE_RESV" }, \ + { EXT4_MB_HINT_ALIGNED, "HINT_ALIGNED" }, \ { EXT4_MB_STRICT_CHECK, "STRICT_CHECK" }) =20 #define show_map_flags(flags) __print_flags(flags, "|", \ --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C06E325745F; Mon, 24 Mar 2025 07:37:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801854; cv=none; b=WE+yzzsI0wgY1jvZzuYoej68HB3j9sjd49DZcHTqVBBTpKMowCDyc2Pd/BNUJjQE/Uu5MuB3/bYA3CND9dMCygddc6/X152XgXvOenOnA8BOR9FHMUunVrKwL3dQThFpnTXZCGA1LWnsaZLSsBeCr3P4+DMnr6NYGuJzaAz+tEg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801854; c=relaxed/simple; bh=2XU2J/cF/uPfYjnOvZO41yMCM5m3yb83jy8T1WkNx1s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QZCF2BeTYUMoDEuBGUZV91iVeM7QaAhiifT6uhCygxxv6o4B9GOycyueYdm9imIEEYrogJXw13yENDJKEuhq1/ygPkl2POGqGPmmGE/kZ2J7fpWX9zoo53ZNaY/TkroikCXhOzoKHgXI6SbsRW+uvkjWfC5cQsbe+Kcjwwu9c/U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=bPJ3s+AT; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="bPJ3s+AT" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NJuQrR018057; Mon, 24 Mar 2025 07:37:20 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=2XY78IMwdLPq/JPKZ 0RA0zFV6cCBSiCJzmpVNNgdHXg=; b=bPJ3s+ATSRbUGA9SkRGARlIJ6Lw8O4hrj d8YRcG7bcNarXLUFjXmxZO1yGr8RYncBAMRzKg/p6GfjVqGMtfJNlgrq2iCkiWVr ELREDJZegVXCZMRAl1IAj/z1+XZnBQTHBdovV6Sow8IXMxNhoLhb/JUY6mMdYa2b mfBQ1Zam1XOrEnVgTQmIp8BlpyBYC0c8KmAtPAFmanl52rpj9CEtNQq7jxwoMQ70 3dXcasKdV5WCOg4r9YCyYSGKa3N0OaSVMNWRTuflPZz/7M3fzfwQhsnCw+336ui/ 7/8n+Vx4evRd7d/n4kE5zkWHjEaXGoD8CaPe7c/ek3Gyp4noH8ZIQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthe9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:19 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bJA0029853; Mon, 24 Mar 2025 07:37:19 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthe5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:19 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O4mCG8009709; Mon, 24 Mar 2025 07:37:18 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 45j9rkcxfq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:18 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bGLO35520962 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:16 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AB7492004B; Mon, 24 Mar 2025 07:37:16 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D12B020040; Mon, 24 Mar 2025 07:37:14 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:14 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 02/11] ext4: allow inode preallocation for aligned alloc Date: Mon, 24 Mar 2025 13:07:00 +0530 Message-ID: <083378fa1d7114228d30b84d453e975a9da06751.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 0EFk0Bu6XOlFvW00zx5rkJ0biGH1fQXi X-Proofpoint-GUID: aSe7iD4tG6MW9yA1WSI5ICNOFLxaGRVg X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 bulkscore=0 mlxscore=0 priorityscore=1501 impostorscore=0 malwarescore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 phishscore=0 adultscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Enable inode preallocation support for aligned allocations. Inode preallocation will only be used if the preallocated blocks are able to satisfy the length and alignment requirements of the allocations, else we disable preallocation for this particular allocation and proceed as usual. Disabling inode preallocation is required otherwise we might end up with overlapping preallocated ranges which can trigger a BUG() later. Further, during normalizing, we usually try to round it up to a power of 2 which can still give us aligned allocation. We also make sure not change the goal start so aligned allocation is more straightforward. If for whatever reason the goal is not power of 2 or doesn't contain the original request, then we throw a warning and proceed as normal. For now, group preallocation is disabled for aligned allocations. Signed-off-by: Ojaswin Mujoo --- fs/ext4/mballoc.c | 96 +++++++++++++++++++++++++++++++---------------- 1 file changed, 63 insertions(+), 33 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ca51581573e3..db7c593873a9 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2178,8 +2178,6 @@ static void ext4_mb_use_best_found(struct ext4_alloca= tion_context *ac, * space in a special descriptor. */ if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) { - /* Aligned allocation doesn't have preallocation support */ - WARN_ON(ac->ac_flags & EXT4_MB_HINT_ALIGNED); ext4_mb_new_preallocation(ac); } =20 @@ -3027,8 +3025,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) =20 WARN_ON_ONCE(!is_power_of_2(len)); WARN_ON_ONCE(start % len); - /* We don't support preallocation yet */ - WARN_ON_ONCE(ac->ac_b_ex.fe_len !=3D ac->ac_o_ex.fe_len); + WARN_ON_ONCE(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len); } =20 exit: @@ -4479,13 +4476,6 @@ ext4_mb_normalize_request(struct ext4_allocation_con= text *ac, if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) return; =20 - /* - * caller may have strict alignment requirements. In this case, avoid - * normalization since it is not alignment aware. - */ - if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) - return; - if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { ext4_mb_normalize_group_request(ac); return ; @@ -4542,6 +4532,21 @@ ext4_mb_normalize_request(struct ext4_allocation_con= text *ac, size =3D (loff_t) EXT4_C2B(sbi, ac->ac_o_ex.fe_len) << bsbits; } + + /* + * For aligned allocations, we need to ensure 2 things: + * + * 1. The start should remain same as original start so that finding + * aligned physical blocks for it is straight forward. + * + * 2. The new_size should not be less than the original len. This + * can sometimes happen due to the way we predict size above. + */ + if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) { + start_off =3D ac->ac_o_ex.fe_logical << bsbits; + size =3D max_t(loff_t, size, + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) << bsbits); + } size =3D size >> bsbits; start =3D start_off >> bsbits; =20 @@ -4792,32 +4797,46 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block, } =20 /* - * check if found pa meets EXT4_MB_HINT_GOAL_ONLY + * check if found pa meets EXT4_MB_HINT_GOAL_ONLY or EXT4_MB_HINT_ALIGNED */ static bool -ext4_mb_pa_goal_check(struct ext4_allocation_context *ac, +ext4_mb_pa_check(struct ext4_allocation_context *ac, struct ext4_prealloc_space *pa) { struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); ext4_fsblk_t start; =20 - if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))) + if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY || + ac->ac_flags & EXT4_MB_HINT_ALIGNED))) return true; =20 - /* - * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted - * in ext4_mb_normalize_request and will keep same with ac_o_ex - * from ext4_mb_initialize_context. Choose ac_g_ex here to keep - * consistent with ext4_mb_find_by_goal. - */ - start =3D pa->pa_pstart + - (ac->ac_g_ex.fe_logical - pa->pa_lstart); - if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) !=3D start) - return false; + if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) { + /* + * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted + * in ext4_mb_normalize_request and will keep same with ac_o_ex + * from ext4_mb_initialize_context. Choose ac_g_ex here to keep + * consistent with ext4_mb_find_by_goal. + */ + start =3D pa->pa_pstart + + (ac->ac_g_ex.fe_logical - pa->pa_lstart); + if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) !=3D start) + return false; =20 - if (ac->ac_g_ex.fe_len > pa->pa_len - - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - pa->pa_lstart)) - return false; + if (ac->ac_g_ex.fe_len > + pa->pa_len - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - + pa->pa_lstart)) + return false; + } else if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) { + start =3D pa->pa_pstart + + (ac->ac_g_ex.fe_logical - pa->pa_lstart); + if (start % EXT4_C2B(sbi, ac->ac_g_ex.fe_len)) + return false; + + if (EXT4_C2B(sbi, ac->ac_g_ex.fe_len) > + (EXT4_C2B(sbi, pa->pa_len) - + (ac->ac_g_ex.fe_logical - pa->pa_lstart))) + return false; + } =20 return true; } @@ -4840,10 +4859,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_cont= ext *ac) if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return false; =20 - /* using preallocated blocks is not alignment aware. */ - if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) - return false; - /* * first, try per-file preallocation by searching the inode pa rbtree. * @@ -4949,7 +4964,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_conte= xt *ac) goto try_group_pa; } =20 - if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) { + if (tmp_pa->pa_free && likely(ext4_mb_pa_check(ac, tmp_pa))) { atomic_inc(&tmp_pa->pa_count); ext4_mb_use_inode_pa(ac, tmp_pa); spin_unlock(&tmp_pa->pa_lock); @@ -4984,6 +4999,19 @@ ext4_mb_use_preallocated(struct ext4_allocation_cont= ext *ac) * pa_free =3D=3D 0. */ WARN_ON_ONCE(tmp_pa->pa_free =3D=3D 0); + + /* + * If, for any reason, we reach here then we need to disable PA + * because otherwise ext4_mb_normalize_request() will try to + * allocate a new PA for this logical range where another PA + * already exists. This is not allowed and will trigger BUG_ONs. + * Hence, as a workaround we disable PA. + * + * NOTE: ideally we would want to have some logic to take care + * of the unusable PA. Maybe a more fine grained discard logic + * that could allow us to discard only specific PAs. + */ + ac->ac_flags |=3D EXT4_MB_HINT_NOPREALLOC; } spin_unlock(&tmp_pa->pa_lock); try_group_pa: @@ -5790,6 +5818,7 @@ static void ext4_mb_group_or_file(struct ext4_allocat= ion_context *ac) int bsbits =3D ac->ac_sb->s_blocksize_bits; loff_t size, isize; bool inode_pa_eligible, group_pa_eligible; + bool is_aligned =3D (ac->ac_flags & EXT4_MB_HINT_ALIGNED); =20 if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return; @@ -5797,7 +5826,8 @@ static void ext4_mb_group_or_file(struct ext4_allocat= ion_context *ac) if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) return; =20 - group_pa_eligible =3D sbi->s_mb_group_prealloc > 0; + /* Aligned allocation does not support group pa */ + group_pa_eligible =3D (!is_aligned && sbi->s_mb_group_prealloc > 0); inode_pa_eligible =3D true; size =3D extent_logical_end(sbi, &ac->ac_o_ex); isize =3D (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 447CE2586E0; Mon, 24 Mar 2025 07:37:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801858; cv=none; b=tqrB3BMV47+AIpG+4cnF7sDcFD3fYLIrWAOjlC1yGcvoQjU3FZBEVaNhOwzfj3uJJzK3sU1mm1Fl2xVC8uCCnhm0FLHXv81SWVx5hte3jZzN/eAxD/EOWg1dZNfYbWvIoOJI6l6XK72LClWwWb7KbEs4v14PsZg01BEp9E1H99k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801858; c=relaxed/simple; bh=24FeMYG6wq81DhPy8/5wr5797KIPubb6tdtskGIFS/4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=X6IJ/3VrPXn19YjdtSScjhyjlU1eP0Lxnc63hfnAuNtqfV3xvlXj8VYyKcNNnJ5kktfHjkKukgJ2/bkN6nPgwr9gbm9HJmwtZzn8wCpVTupha1i0VUGVTldUZEQcbOQmEvzzQM+ZtDusqT0xLYsTsMl6RaiX4p1UMxlhKIlfTuE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=MjCgEcIg; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="MjCgEcIg" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52O7R14X004873; Mon, 24 Mar 2025 07:37:22 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=juBr27iCuBDKuucZE x9JX6/rqy46tVT40wAvkBXBRZI=; b=MjCgEcIgnTqh67/j9Ov4V9nG4dAW0SN3R vZblybLN/ifiBUj5fnnfSg32ti7g+uZvOrNBfgPEgJHucj5OCWM5dakCvUBhBKfO JmDp+n/mV0b2cptAAcu6yhpwtAv8x5t47KDJn+tPf6k056dTRSnmQ3dCtEcsm6LG t19huqIicuocYRx7ThHuGgq8YibpnbFwfBBFIyS115JRa9YzPJDSRwFUsGadfy5G w7z+3XUb0tKDf5BT6HViWsaMrIj8AEmR+yXTe3HGdS2FZoEoUaCmgMnoc0ug2xfU QBRbpjHuWQxRHb2JbayfxA1DO1ViQfkrxHuY9x2VTGERG1rGa1ONg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jkqp368v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:22 +0000 (GMT) Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7Kcbx006882; Mon, 24 Mar 2025 07:37:21 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jkqp368q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:21 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O508IL009718; Mon, 24 Mar 2025 07:37:20 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 45j9rkcxg0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:20 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bI7i6816176 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:19 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D013120040; Mon, 24 Mar 2025 07:37:18 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1A4EF2004B; Mon, 24 Mar 2025 07:37:17 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:16 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 03/11] ext4: support for extsize hint using FS_IOC_FS(GET/SET)XATTR Date: Mon, 24 Mar 2025 13:07:01 +0530 Message-ID: <630bf8077b84d576462eb6d1e2a55b0471b324c1.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: _7uT4si1rBPy6sVg8V_XbVSvTacieOjj X-Proofpoint-ORIG-GUID: wzoo4Bmgxbc97NNZy7rRl5y6hN6qQ8e4 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 phishscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 spamscore=0 adultscore=0 mlxscore=0 mlxlogscore=999 clxscore=1015 malwarescore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" This patch adds support for getting and setting extsize hint using FS_IOC_GETXATTR and FS_IOC_SETXATTR interface. The extsize is stored in xattr of type EXT4_XATTR_INDEX_SYSTEM. Restrictions on setting extsize: 1. extsize can't be set on files with data 2. extsize can't be set on non regular files 3. extsize hint can't be used with bigalloc (yet) 4. extsize (in blocks) should be power-of-2 for simplicity. 5. extsize must be a multiple of block size The ioctl behavior has been kept as close to the XFS equivalent as possible. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 6 +++ fs/ext4/inode.c | 89 +++++++++++++++++++++++++++++++++++ fs/ext4/ioctl.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + 4 files changed, 218 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 2c83275d2ad4..75c1c70f7815 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1173,6 +1173,8 @@ struct ext4_inode_info { __u32 i_csum_seed; =20 kprojid_t i_projid; + /* The extentsize hint for the inode in blocks */ + ext4_grpblk_t i_extsize; }; =20 /* @@ -3051,6 +3053,10 @@ extern void ext4_da_update_reserve_space(struct inod= e *inode, int used, int quota_claim); extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk, ext4_lblk_t len); +int ext4_inode_xattr_get_extsize(struct inode *inode); +int ext4_inode_xattr_set_extsize(struct inode *inode, ext4_grpblk_t extsiz= e); +ext4_grpblk_t ext4_inode_get_extsize(struct ext4_inode_info *ei); +void ext4_inode_set_extsize(struct ext4_inode_info *ei, ext4_grpblk_t exts= ize); =20 /* indirect.c */ extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index aede80fa1781..00d8e9065a02 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4976,6 +4976,20 @@ struct inode *__ext4_iget(struct super_block *sb, un= signed long ino, } } =20 + ret =3D ext4_inode_xattr_get_extsize(&ei->vfs_inode); + if (ret >=3D 0) { + ei->i_extsize =3D ret; + } else if (ret =3D=3D -ENODATA) { + /* extsize is not set */ + ei->i_extsize =3D 0; + } else { + ext4_error_inode( + inode, function, line, 0, + "iget: error while retrieving extsize from xattr: %ld", ret); + ret =3D -EFSCORRUPTED; + goto bad_inode; + } + EXT4_INODE_GET_CTIME(inode, raw_inode); EXT4_INODE_GET_ATIME(inode, raw_inode); EXT4_INODE_GET_MTIME(inode, raw_inode); @@ -6311,3 +6325,78 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) ext4_journal_stop(handle); goto out; } + +/* + * Returns positive extsize if set, 0 if not set else error + */ +ext4_grpblk_t ext4_inode_xattr_get_extsize(struct inode *inode) +{ + char *buf; + int size, ret =3D 0; + ext4_grpblk_t extsize =3D 0; + + size =3D ext4_xattr_get(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", NULL, = 0); + + if (size =3D=3D -ENODATA || size =3D=3D 0) { + return 0; + } else if (size < 0) { + ret =3D size; + goto exit; + } + + buf =3D (char *)kmalloc(size + 1, GFP_KERNEL); + if (!buf) { + ret =3D -ENOMEM; + goto exit; + } + + size =3D ext4_xattr_get(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", buf, + size); + if (size =3D=3D -ENODATA) + /* No extsize is set */ + extsize =3D 0; + else if (size < 0) + ret =3D size; + else { + buf[size] =3D '\0'; + ret =3D kstrtoint(buf, 10, &extsize); + } + + kfree(buf); +exit: + if (ret) + return ret; + return extsize; +} + +int ext4_inode_xattr_set_extsize(struct inode *inode, ext4_grpblk_t extsiz= e) +{ + int err =3D 0; + /* max value of extsize should fit within 11 chars */ + char extsize_str[11]; + + err =3D snprintf(extsize_str, 10, "%u", extsize); + if (err < 0) + return err; + + /* Try to replace the xattr if it exists, else try to create it */ + err =3D ext4_xattr_set(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", + extsize_str, strlen(extsize_str), XATTR_REPLACE); + + if (err =3D=3D -ENODATA) + err =3D ext4_xattr_set(inode, EXT4_XATTR_INDEX_SYSTEM, "extsize", + extsize_str, strlen(extsize_str), + XATTR_CREATE); + + return err; +} + +ext4_grpblk_t ext4_inode_get_extsize(struct ext4_inode_info *ei) +{ + return ei->i_extsize; +} + +void ext4_inode_set_extsize(struct ext4_inode_info *ei, ext4_grpblk_t exts= ize) +{ + ei->i_extsize =3D extsize; +} diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index d17207386ead..48f62d7c27e6 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -708,6 +708,93 @@ static int ext4_ioctl_setflags(struct inode *inode, return err; } =20 +static u32 ext4_ioctl_getextsize(struct inode *inode) +{ + ext4_grpblk_t extsize; + + extsize =3D ext4_inode_get_extsize(EXT4_I(inode)); + + return (u32) extsize << inode->i_blkbits; +} + + +static int ext4_ioctl_setextsize(struct inode *inode, u32 extsize, u32 xfl= ags) +{ + int err; + ext4_grpblk_t extsize_blks =3D extsize >> inode->i_blkbits; + struct ext4_sb_info *sbi =3D EXT4_SB(inode->i_sb); + int blksize =3D 1 << inode->i_blkbits; + char *msg =3D NULL; + + if (!S_ISREG(inode->i_mode)) { + msg =3D "Cannot set extsize on non regular file"; + err =3D -EOPNOTSUPP; + goto error; + } + + /* + * We are okay with a non-zero i_size as long as there is no data. + */ + if (ext4_has_inline_data(inode) || + READ_ONCE(EXT4_I(inode)->i_disksize) || + EXT4_I(inode)->i_reserved_data_blocks) { + msg =3D "Cannot set extsize on file with data"; + err =3D -EINVAL; + goto error; + } + + if (extsize % blksize) { + msg =3D "extsize must be multiple of blocksize"; + err =3D -EINVAL; + goto error; + } + + if (sbi->s_cluster_ratio > 1) { + msg =3D "Can't use extsize hint with bigalloc"; + err =3D -EINVAL; + goto error; + } + + if ((xflags & FS_XFLAG_EXTSIZE) && extsize =3D=3D 0) { + msg =3D "fsx_extsize can't be 0 if FS_XFLAG_EXTSIZE is passed"; + err =3D -EINVAL; + goto error; + } + + if (extsize_blks > sbi->s_blocks_per_group) { + msg =3D "extsize cannot exceed number of bytes in block group"; + err =3D -EINVAL; + goto error; + } + + if (extsize && !is_power_of_2(extsize_blks)) { + msg =3D "extsize must be either power-of-2 in fs blocks or 0"; + err =3D -EINVAL; + goto error; + } + + if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { + msg =3D "extsize can't be set on non-extent based files"; + err =3D -EINVAL; + goto error; + } + + /* update the extsize in inode xattr */ + err =3D ext4_inode_xattr_set_extsize(inode, extsize_blks); + if (err < 0) + return err; + + /* Update the new extsize in the in-core inode */ + ext4_inode_set_extsize(EXT4_I(inode), extsize_blks); + return 0; + +error: + if (msg) + ext4_warning_inode(inode, "%s\n", msg); + + return err; +} + #ifdef CONFIG_QUOTA static int ext4_ioctl_setproject(struct inode *inode, __u32 projid) { @@ -985,6 +1072,7 @@ int ext4_fileattr_get(struct dentry *dentry, struct fi= leattr *fa) struct inode *inode =3D d_inode(dentry); struct ext4_inode_info *ei =3D EXT4_I(inode); u32 flags =3D ei->i_flags & EXT4_FL_USER_VISIBLE; + u32 extsize =3D 0; =20 if (S_ISREG(inode->i_mode)) flags &=3D ~FS_PROJINHERIT_FL; @@ -993,6 +1081,13 @@ int ext4_fileattr_get(struct dentry *dentry, struct f= ileattr *fa) if (ext4_has_feature_project(inode->i_sb)) fa->fsx_projid =3D from_kprojid(&init_user_ns, ei->i_projid); =20 + extsize =3D ext4_ioctl_getextsize(inode); + /* Flag is only set if extsize is non zero */ + if (extsize > 0) { + fa->fsx_extsize =3D extsize; + fa->fsx_xflags |=3D FS_XFLAG_EXTSIZE; + } + return 0; } =20 @@ -1022,6 +1117,33 @@ int ext4_fileattr_set(struct mnt_idmap *idmap, if (err) goto out; err =3D ext4_ioctl_setproject(inode, fa->fsx_projid); + if (err) + goto out; + + if (fa->fsx_xflags & FS_XFLAG_EXTSIZE) { + err =3D ext4_ioctl_setextsize(inode, fa->fsx_extsize, + fa->fsx_xflags); + if (err) + goto out; + } else if (fa->fsx_extsize =3D=3D 0) { + /* + * Even when user explicitly passes extsize=3D0 the flag is cleared in + * fileattr_set_prepare(). + */ + if (ext4_inode_get_extsize(EXT4_I(inode)) !=3D 0) { + err =3D ext4_ioctl_setextsize(inode, fa->fsx_extsize, + fa->fsx_xflags); + if (err) + goto out; + } + + } else { + /* Unexpected usage, reset extsize to 0 */ + err =3D ext4_ioctl_setextsize(inode, 0, fa->fsx_xflags); + if (err) + goto out; + fa->fsx_xflags =3D 0; + } out: return err; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8122d4ffb3b5..250479a1d237 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1413,6 +1413,7 @@ static struct inode *ext4_alloc_inode(struct super_bl= ock *sb) spin_lock_init(&ei->i_completed_io_lock); ei->i_sync_tid =3D 0; ei->i_datasync_tid =3D 0; + ei->i_extsize =3D 0; INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work); ext4_fc_init_inode(&ei->vfs_inode); mutex_init(&ei->i_fc_lock); --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12CC5257AD5; Mon, 24 Mar 2025 07:37:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801854; cv=none; b=n1LgogBScbK2mFrN45NeRFsqzM3xJvz836E7uQgoazWP50NKiUfY7xSW7UhmmlBrGM2Y6Iq0LqtYLFWsvraseQ/hhz/7gRNleyIPaeNgmaFUWDPIoYdhRiS/PSmyMe+IA6bvuSZWrjPYAFGSWjI/OdNFlI0xpQZw7ViIuRiFFeQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801854; c=relaxed/simple; bh=OSp+5BCtTy+VEm4pp+V79KcV5vtn402aXiaRe1kUOkI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lD/AmAfc0G0kEIGjaJBY/DOhkFse4zBC9FeTYF/ey1BvvqKazvjMOwYVsth4vqeXJO4R5YJpZMyOviYUEEhMj2G8A4YoKcoKg3he4gH0BHlgvKbjYgDrRYiDdfLels30MabkbXakqnPcEwzJBBKlH6zLyA1Ja3cVvH1xtjn7dLM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=D8Rr+OzB; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="D8Rr+OzB" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NMELtQ005746; Mon, 24 Mar 2025 07:37:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=cjG6AOq812JXkrPde dr6WDePRENKLzimjdgYKKgqwUk=; b=D8Rr+OzBlFJHGLGHjNOvfsk6uuDXAW0kK U/Jm6Ptk2sHH+xuYylPRkdN3sAeeWnR1s8YmrdVrNdNTdf/FneHu2i1RUTvNBIFQ h8HfgSO9tYAqydQvrjMZWvBQA9/OSWynshsdSgKJVGKRVFYSiE5KWO8hoKaXitDd xGVbiteRWMx/s99kONItiiPk8TnR29ta4T/UCS1Q5ghOFNiCraZhD3lDaF1AyatI tqU6NFCu9sNsrBwBFl63u5J7kN5AVaYeTTBRY6ZqHXHopRVk9h+R8UsfOG6jWlb3 EKuTGgpD3z1ebkZ/ap+VqX5JxNJrzYg3gnrQ44USkaBDqxUuo8/yQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthf3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:24 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bOt9029962; Mon, 24 Mar 2025 07:37:24 GMT Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthey-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:24 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O7VmLE025666; Mon, 24 Mar 2025 07:37:23 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j7wywacy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:23 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bLga52691432 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:21 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 33D932004B; Mon, 24 Mar 2025 07:37:21 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 409AB20040; Mon, 24 Mar 2025 07:37:19 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:19 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 04/11] ext4: pass lblk and len explicitly to ext4_split_extent*() Date: Mon, 24 Mar 2025 13:07:02 +0530 Message-ID: <7b4e15e314dc4d247fc19c12f76bbbc66a23faa5.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Fm2dMLe22p30YzL2602oRBEKx513aDyO X-Proofpoint-GUID: QX0jPpP_KjHjYg8lQAweQy4r9eRW6qkn X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=603 bulkscore=0 mlxscore=0 priorityscore=1501 impostorscore=0 malwarescore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 phishscore=0 adultscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Since these functions only use the map to determine lblk and len of the split, pass them explicitly. This is in preparation for making them work with extent size hints cleanly. No functional change in this patch. Signed-off-by: Ojaswin Mujoo --- fs/ext4/extents.c | 57 +++++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c616a16a9f36..4e604ce6ce35 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3347,7 +3347,8 @@ static struct ext4_ext_path *ext4_split_extent_at(han= dle_t *handle, static struct ext4_ext_path *ext4_split_extent(handle_t *handle, struct inode *inode, struct ext4_ext_path *path, - struct ext4_map_blocks *map, + ext4_lblk_t lblk, + unsigned int len, int split_flag, int flags, unsigned int *allocated) { @@ -3363,7 +3364,7 @@ static struct ext4_ext_path *ext4_split_extent(handle= _t *handle, ee_len =3D ext4_ext_get_actual_len(ex); unwritten =3D ext4_ext_is_unwritten(ex); =20 - if (map->m_lblk + map->m_len < ee_block + ee_len) { + if (lblk + len < ee_block + ee_len) { split_flag1 =3D split_flag & EXT4_EXT_MAY_ZEROOUT; flags1 =3D flags | EXT4_GET_BLOCKS_PRE_IO; if (unwritten) @@ -3372,28 +3373,28 @@ static struct ext4_ext_path *ext4_split_extent(hand= le_t *handle, if (split_flag & EXT4_EXT_DATA_VALID2) split_flag1 |=3D EXT4_EXT_DATA_VALID1; path =3D ext4_split_extent_at(handle, inode, path, - map->m_lblk + map->m_len, split_flag1, flags1); + lblk + len, split_flag1, flags1); if (IS_ERR(path)) return path; /* * Update path is required because previous ext4_split_extent_at * may result in split of original leaf or extent zeroout. */ - path =3D ext4_find_extent(inode, map->m_lblk, path, flags); + path =3D ext4_find_extent(inode, lblk, path, flags); if (IS_ERR(path)) return path; depth =3D ext_depth(inode); ex =3D path[depth].p_ext; if (!ex) { EXT4_ERROR_INODE(inode, "unexpected hole at %lu", - (unsigned long) map->m_lblk); + (unsigned long) lblk); ext4_free_ext_path(path); return ERR_PTR(-EFSCORRUPTED); } unwritten =3D ext4_ext_is_unwritten(ex); } =20 - if (map->m_lblk >=3D ee_block) { + if (lblk >=3D ee_block) { split_flag1 =3D split_flag & EXT4_EXT_DATA_VALID2; if (unwritten) { split_flag1 |=3D EXT4_EXT_MARK_UNWRIT1; @@ -3401,16 +3402,16 @@ static struct ext4_ext_path *ext4_split_extent(hand= le_t *handle, EXT4_EXT_MARK_UNWRIT2); } path =3D ext4_split_extent_at(handle, inode, path, - map->m_lblk, split_flag1, flags); + lblk, split_flag1, flags); if (IS_ERR(path)) return path; } =20 if (allocated) { - if (map->m_lblk + map->m_len > ee_block + ee_len) - *allocated =3D ee_len - (map->m_lblk - ee_block); + if (lblk + len > ee_block + ee_len) + *allocated =3D ee_len - (lblk - ee_block); else - *allocated =3D map->m_len; + *allocated =3D len; } ext4_ext_show_leaf(inode, path); return path; @@ -3658,8 +3659,8 @@ ext4_ext_convert_to_initialized(handle_t *handle, str= uct inode *inode, } =20 fallback: - path =3D ext4_split_extent(handle, inode, path, &split_map, split_flag, - flags, NULL); + path =3D ext4_split_extent(handle, inode, path, split_map.m_lblk, + split_map.m_len, split_flag, flags, NULL); if (IS_ERR(path)) return path; out: @@ -3699,11 +3700,11 @@ ext4_ext_convert_to_initialized(handle_t *handle, s= truct inode *inode, * allocated pointer. Return an extent path pointer on success, or an error * pointer on failure. */ -static struct ext4_ext_path *ext4_split_convert_extents(handle_t *handle, - struct inode *inode, - struct ext4_map_blocks *map, - struct ext4_ext_path *path, - int flags, unsigned int *allocated) +static struct ext4_ext_path * +ext4_split_convert_extents(handle_t *handle, struct inode *inode, + ext4_lblk_t lblk, unsigned int len, + struct ext4_ext_path *path, int flags, + unsigned int *allocated) { ext4_lblk_t eof_block; ext4_lblk_t ee_block; @@ -3712,12 +3713,12 @@ static struct ext4_ext_path *ext4_split_convert_ext= ents(handle_t *handle, int split_flag =3D 0, depth; =20 ext_debug(inode, "logical block %llu, max_blocks %u\n", - (unsigned long long)map->m_lblk, map->m_len); + (unsigned long long)lblk, len); =20 eof_block =3D (EXT4_I(inode)->i_disksize + inode->i_sb->s_blocksize - 1) >> inode->i_sb->s_blocksize_bits; - if (eof_block < map->m_lblk + map->m_len) - eof_block =3D map->m_lblk + map->m_len; + if (eof_block < lblk + len) + eof_block =3D lblk + len; /* * It is safe to convert extent to initialized via explicit * zeroout only if extent is fully inside i_size or new_size. @@ -3737,8 +3738,8 @@ static struct ext4_ext_path *ext4_split_convert_exten= ts(handle_t *handle, split_flag |=3D (EXT4_EXT_MARK_UNWRIT2 | EXT4_EXT_DATA_VALID2); } flags |=3D EXT4_GET_BLOCKS_PRE_IO; - return ext4_split_extent(handle, inode, path, map, split_flag, flags, - allocated); + return ext4_split_extent(handle, inode, path, lblk, len, split_flag, + flags, allocated); } =20 static struct ext4_ext_path * @@ -3773,7 +3774,7 @@ ext4_convert_unwritten_extents_endio(handle_t *handle= , struct inode *inode, inode->i_ino, (unsigned long long)ee_block, ee_len, (unsigned long long)map->m_lblk, map->m_len); #endif - path =3D ext4_split_convert_extents(handle, inode, map, path, + path =3D ext4_split_convert_extents(handle, inode, map->m_lblk, map->m_l= en, path, EXT4_GET_BLOCKS_CONVERT, NULL); if (IS_ERR(path)) return path; @@ -3837,8 +3838,9 @@ convert_initialized_extent(handle_t *handle, struct i= node *inode, (unsigned long long)ee_block, ee_len); =20 if (ee_block !=3D map->m_lblk || ee_len > map->m_len) { - path =3D ext4_split_convert_extents(handle, inode, map, path, - EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, NULL); + path =3D ext4_split_convert_extents( + handle, inode, map->m_lblk, map->m_len, path, + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, NULL); if (IS_ERR(path)) return path; =20 @@ -3909,8 +3911,9 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, s= truct inode *inode, =20 /* get_block() before submitting IO, split the extent */ if (flags & EXT4_GET_BLOCKS_PRE_IO) { - path =3D ext4_split_convert_extents(handle, inode, map, path, - flags | EXT4_GET_BLOCKS_CONVERT, allocated); + path =3D ext4_split_convert_extents( + handle, inode, map->m_lblk, map->m_len, path, + flags | EXT4_GET_BLOCKS_CONVERT, allocated); if (IS_ERR(path)) return path; /* --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4AF03259C8C; Mon, 24 Mar 2025 07:37:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801864; cv=none; b=DmYSjmkBgnINXPw7YusUb0BsJDzUreue/ew8mRo6UwtRy0yrclAQF92d7qDURqUgcjIP0kMdL5GPbdxfqfYHts30ByFtvPUhct/AfdneKzYOvnxxu/b2YxaMffd6ASCtqM3Gbv7cfxVL5oemDRgNihMQnqVXX6a4Oa+o4eJjx3Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801864; c=relaxed/simple; bh=Z8x7JO0VofSlC3TbkezEDpvpOsVLT4H1Oy3WGKrUpb4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XxAxXVygQydUEUG1FxOqOv4FVtmTNtg5uwtzZRzUbsY044ejJTpjvQLFh46d8o5L9PIU/bYR76HfFAA/mt/zmdpCVuAo86H6QTP+bhuds8QdU+PC9uj9djqxYP49VN1H8H56Ga5gz++qMXbb86beYPYY0OTt1GoQZhAWy8olcUk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=piXf25Ke; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="piXf25Ke" Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NKaUq4030364; Mon, 24 Mar 2025 07:37:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=KX2ntTwlFPBFs4Q7T 8vUNpejMJZ4JnENg735S80/MrU=; b=piXf25Ke7vRp27eXckKZdo9L1fVbWAVJG x0Miag10HyaexJf1YDwgsD3hOHUAAOFmw14Ox/Bj6CBCT/O2k0Kbb7aLZRtF+26F VLBjKzGDPfhBu5ila83ogXHxtEGApB/pecsoFtogsEVWEIQVXicDSF7A9LFwF5L+ ReIdvNWg7TWtHfZoWL6GbkeHHmpiOdJ5KpqM4f0EePYo3Od39OwlQQXLX+S84vM+ 8Wydc1RbNy+9e7MF42JnseLIUnYz7hUfGkLFxjxLmrCOFLeUFyQPNdNZqgwCc5Zx Pi5HP5BswwHsoMDslinxfaeIYtm5ZyAUI3TzAdbuEMYFtOR8LhUKw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsfpa174-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:26 +0000 (GMT) Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bQfb009120; Mon, 24 Mar 2025 07:37:26 GMT Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsfpa170-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:26 +0000 (GMT) Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O4xcnP005969; Mon, 24 Mar 2025 07:37:25 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 45ja824u34-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:25 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bNP958130742 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:24 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8B3BF20043; Mon, 24 Mar 2025 07:37:23 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 965CC20040; Mon, 24 Mar 2025 07:37:21 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:21 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 05/11] ext4: add extsize hint support Date: Mon, 24 Mar 2025 13:07:03 +0530 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 66s0PSszIdzaf62E1zH9u8Kab-GWcmrO X-Proofpoint-GUID: UR77XDEzEmhjInA3x8ZkAjxZAvgkuDrF X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 priorityscore=1501 clxscore=1015 suspectscore=0 adultscore=0 lowpriorityscore=0 bulkscore=0 impostorscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Now that the ioctl is in place, add the underlying infrastructure to support extent size hints. ** MOTIVATION ** 1. This feature allows us to ask the allocator for blocks that are logically AS WELL AS physically aligned to an extent size hint (aka extsize), that is generally a power of 2. 2. This means both start and the length of the physical and logical range should be aligned to the extsize. 3. This sets up the infra we'll eventually need for supporting non-torn/atomic writes that need to follow a certain alignment as required by hardware. 4. This can also be extent to other use cases like stripe alignment ** DESIGN NOTES ** * Physical Alignment * 1. Since the extsize is always a power-of-2 (for now) in fs blocks, we leverage CR_POWER2_ALIGNED allocation to get the blocks. This ensures the blocks are physically aligned 2. Since this is just a hint, incase we are not able to get any aligned blocks we simply drop back to non aligned allocation. * Logical Alignment * The flow of extsize aligned allocation with buffered and direct IO: +--------------------------------------------------------+ | Buffered IO | +--------------------------------------------------------+ | ext4_map_blocks() call with extsize allocation | +--------------------------------------------------------+ | +--------------------------------------------+ | Adjust lblk and len to align to extsize | +--------------------------------------------+ | +--------------------------------------------------------+ |Pre-existing written/unwritten blocks in extsize range? | +--------------------------+-----------------------------+ YES NO | | +---------------v---------------+ +-----------------v-------------= -----+ | Covers orig range? | | Allocate extsize range = | +---------------+---------------+ +---------------------+---------= -----+ | | | YES NO | | | +---------------v-------------= -----+ +--------v-------+ +-------v---------+ | Mark allocated extent as = | | Return blocks | | Fallback to | | unwritten = | +----------------+ | non-extsize | +----------------+------------= -----+ | allocation | | +-----------------+ +----------------v------------= -----+ | Insert extsize extent = | | into tree = | +----------------+------------= -----+ | +----------------v------------= -----+ | Return allocated blocks = | +-----------------------------= -----+ +--------------------------------------------+ | During writeback: | +--------------------------------------------+ | Use PRE_IO to split only the dirty extent | +--------------------------------------------+ +--------------------------------------------+ | After IO: | +--------------------------------------------+ | Convert the extent under IO to written | +--------------------------------------------+ Same flow for direct IO: +----------------------------------------------------------------------+ | Direct IO | +----------------------------------------------------------------------+ | ext4_map_blocks() called with extsize allocation and PRE-IO | +----------------------------------------------------------------------+ | +----------------------------------------------------------------------+ | Adjust lblk and len to align to extsize | +----------------------------------------------------------------------+ | +----------------------------------------------------------------------+ | Pre-existing written blocks in extsize range? | +----------------------------------+-----------------------------------+ YES NO | | +---------v----------+ +------------------v-----------------+ | Covers orig range? | | Unwritten blocks in extsize range? | +---------+----------+ +------------------+-----------------+ | | | | YES NO YES NO | | | | +-------v----+ +-----v--------+ +----------v----------+ +-------v-------= ---+ | Return | | Fallback to | | Call ext4_ext_map_ | | Allocate extsi= ze | | blocks | | non-extsize | | blocks() ->ext4_ext | | range = | +------------+ | allocation | | _handle_unwritten_ | +-------+-------= ---+ +--------------+ | extents() | | +----------+----------+ +-------v-------= ---+ | | Mark complete= | +----------v----------+ | range unwritte= n | | Split orig range | | & insert in = | | from bigger | | tree = | | unwritten extent | +-------+-------= ---+ +----------+----------+ | | +-------v-------= ---+ +----------v----------+ | Split orig ran= ge | | Mark split extent | | from bigger = | | as unwritten | | allocated exte= nt | +----------+----------+ +-------+-------= ---+ | | +----------v----------+ +-------v-------= ---+ | Return split extent | | Mark split ext= ent| | to user | | as unwritten= | +---------------------+ +-------+-------= ---+ | +-------v-------= ---+ | Return split = | | extent to user= | +---------------= ---+ +--------------------------------------------+ | After IO: | +--------------------------------------------+ | Convert the extent under IO to written | +--------------------------------------------+ ** IMPLEMENTATION NOTES ** * Callers of ext4_map_blocks work under the assumption that ext4_map_blocks will always only return as much as requested or less but now we might end up allocating more so make changes to ext4_map_blocks to make sure we adjust the allocated map to only return as much as user requested. * Further, we now maintain 2 maps in ext4_map_blocks - the original map and the extsize map that is used when extsize hint allocation is taking place. We also pass these 2 maps down because some functions might now need information of the original map as well as the extsize map. * For example, when we go for direct IO and there's a hole in the orig range requested, we allocate based on extsize range and then split the bigger unwritten extent onto smaller unwritten extents based on orig range. (Needed so we dont have to split after IO). For this, we need the information of extsize range as well as orig range hence 2 maps. * Since now we allocate more than the user requested, to avoid stale data exposure, we mark the bigger extsize extent as unwritten and then use the similar flow of dioread_nolock to only mark the extent under write as written. * We disable extsize hints when writes are beyond EOF (for now) * When extsize is set on an inode, we drop to no delalloc allocations similar to XFS. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 4 +- fs/ext4/ext4_jbd2.h | 15 ++ fs/ext4/extents.c | 174 ++++++++++++++++-- fs/ext4/inode.c | 354 ++++++++++++++++++++++++++++++++---- include/trace/events/ext4.h | 1 + 5 files changed, 495 insertions(+), 53 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 75c1c70f7815..ab4f10f9031a 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -727,6 +727,7 @@ enum { #define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400 /* Caller is in the atomic contex, find extent if it has been cached */ #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800 +#define EXT4_GET_BLOCKS_EXTSIZE 0x1000 =20 /* * The bit position of these flags must not overlap with any of the @@ -3708,7 +3709,8 @@ struct ext4_extent; extern void ext4_ext_tree_init(handle_t *handle, struct inode *inode); extern int ext4_ext_index_trans_blocks(struct inode *inode, int extents); extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map, int flags); + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, int flags); extern int ext4_ext_truncate(handle_t *, struct inode *); extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, ext4_lblk_t end); diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 3221714d9901..53b930f6c797 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -458,4 +458,19 @@ static inline int ext4_journal_destroy(struct ext4_sb_= info *sbi, journal_t *jour return err; } =20 +static inline int ext4_should_use_extsize(struct inode *inode) +{ + if (!S_ISREG(inode->i_mode)) + return 0; + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) + return 0; + return (ext4_inode_get_extsize(EXT4_I(inode)) > 0); +} + +static inline int ext4_should_use_unwrit_extents(struct inode *inode) +{ + return (ext4_should_dioread_nolock(inode) || + ext4_should_use_extsize(inode)); +} + #endif /* _EXT4_JBD2_H */ diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 4e604ce6ce35..a86cc3e76f14 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3889,15 +3889,24 @@ convert_initialized_extent(handle_t *handle, struct= inode *inode, =20 static struct ext4_ext_path * ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map, + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, struct ext4_ext_path *path, int flags, unsigned int *allocated, ext4_fsblk_t newblock) { + struct ext4_map_blocks *map; int err =3D 0; =20 - ext_debug(inode, "logical block %llu, max_blocks %u, flags 0x%x, allocate= d %u\n", - (unsigned long long)map->m_lblk, map->m_len, flags, - *allocated); + if (flags & EXT4_GET_BLOCKS_EXTSIZE) { + BUG_ON(extsize_map =3D=3D NULL); + map =3D extsize_map; + } else + map =3D orig_map; + + ext_debug( + inode, + "logical block %llu, max_blocks %u, flags 0x%x, allocated %u\n", + (unsigned long long)map->m_lblk, map->m_len, flags, *allocated); ext4_ext_show_leaf(inode, path); =20 /* @@ -3906,13 +3915,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle,= struct inode *inode, */ flags |=3D EXT4_GET_BLOCKS_METADATA_NOFAIL; =20 - trace_ext4_ext_handle_unwritten_extents(inode, map, flags, - *allocated, newblock); + trace_ext4_ext_handle_unwritten_extents(inode, map, flags, *allocated, + newblock); =20 /* get_block() before submitting IO, split the extent */ if (flags & EXT4_GET_BLOCKS_PRE_IO) { + /* Split should always happen based on original mapping */ path =3D ext4_split_convert_extents( - handle, inode, map->m_lblk, map->m_len, path, + handle, inode, orig_map->m_lblk, orig_map->m_len, path, flags | EXT4_GET_BLOCKS_CONVERT, allocated); if (IS_ERR(path)) return path; @@ -3927,11 +3937,19 @@ ext4_ext_handle_unwritten_extents(handle_t *handle,= struct inode *inode, err =3D -EFSCORRUPTED; goto errout; } + + /* + * For extsize case we need to adjust lblk to start of split + * extent because the m_len will be set to len of split extent. + * No change for non extsize case + */ + map->m_lblk =3D orig_map->m_lblk; map->m_flags |=3D EXT4_MAP_UNWRITTEN; goto out; } /* IO end_io complete, convert the filled extent to written */ if (flags & EXT4_GET_BLOCKS_CONVERT) { + BUG_ON(map =3D=3D extsize_map); path =3D ext4_convert_unwritten_extents_endio(handle, inode, map, path); if (IS_ERR(path)) @@ -4189,7 +4207,8 @@ static ext4_lblk_t ext4_ext_determine_insert_hole(str= uct inode *inode, * return < 0, error case. */ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map, int flags) + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, int flags) { struct ext4_ext_path *path =3D NULL; struct ext4_extent newex, *ex, ex2; @@ -4200,6 +4219,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct ino= de *inode, unsigned int allocated_clusters =3D 0; struct ext4_allocation_request ar; ext4_lblk_t cluster_offset; + struct ext4_map_blocks *map; +#ifdef CONFIG_EXT4_DEBUG + struct ext4_ext_path *test_path =3D NULL; +#endif + + if (flags & EXT4_GET_BLOCKS_EXTSIZE) { + BUG_ON(extsize_map =3D=3D NULL); + map =3D extsize_map; + } else + map =3D orig_map; + =20 ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len); trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags); @@ -4256,6 +4286,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inod= e *inode, */ if ((!ext4_ext_is_unwritten(ex)) && (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) { + BUG_ON(map =3D=3D extsize_map); path =3D convert_initialized_extent(handle, inode, map, path, &allocated); if (IS_ERR(path)) @@ -4272,8 +4303,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inod= e *inode, } =20 path =3D ext4_ext_handle_unwritten_extents( - handle, inode, map, path, flags, - &allocated, newblock); + handle, inode, orig_map, extsize_map, path, + flags, &allocated, newblock); if (IS_ERR(path)) err =3D PTR_ERR(path); goto out; @@ -4306,6 +4337,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inod= e *inode, */ if (cluster_offset && ex && get_implied_cluster_alloc(inode->i_sb, map, ex, path)) { + BUG_ON(map =3D=3D extsize_map); ar.len =3D allocated =3D map->m_len; newblock =3D map->m_pblk; goto got_allocated_blocks; @@ -4325,6 +4357,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inod= e *inode, * cluster we can use. */ if ((sbi->s_cluster_ratio > 1) && err && get_implied_cluster_alloc(inode->i_sb, map, &ex2, path)) { + BUG_ON(map =3D=3D extsize_map); ar.len =3D allocated =3D map->m_len; newblock =3D map->m_pblk; err =3D 0; @@ -4379,6 +4412,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inod= e *inode, ar.flags |=3D EXT4_MB_DELALLOC_RESERVED; if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL) ar.flags |=3D EXT4_MB_USE_RESERVED; + if (flags & EXT4_GET_BLOCKS_EXTSIZE) + ar.flags |=3D EXT4_MB_HINT_ALIGNED; newblock =3D ext4_mb_new_blocks(handle, &ar, &err); if (!newblock) goto out; @@ -4400,9 +4435,114 @@ int ext4_ext_map_blocks(handle_t *handle, struct in= ode *inode, map->m_flags |=3D EXT4_MAP_UNWRITTEN; } =20 - path =3D ext4_ext_insert_extent(handle, inode, path, &newex, flags); - if (IS_ERR(path)) { - err =3D PTR_ERR(path); + if ((flags & EXT4_GET_BLOCKS_EXTSIZE) && + (flags & EXT4_GET_BLOCKS_PRE_IO)) { + /* + * With EXTSIZE and PRE-IO (direct io case) we have to be careful + * because we want to insert the complete extent but split only the + * originally requested range. + * + * Below are the different (S)cenarios and the (A)ction we take: + * + * S1: New extent covers the original range completely/partially. + * A1: Insert new extent, allow merges. Then split the original + * range from this. Adjust the length of split if new extent only + * partially covers original. + * + * S2: New extent doesn't cover original range at all + * A2: Just insert this range and return. Rest is handled in + * ext4_map_blocks() + * NOTE: We can handle this as an error with EAGAIN in future. + */ + ext4_lblk_t newex_lblk =3D le32_to_cpu(newex.ee_block); + loff_t newex_len =3D ext4_ext_get_actual_len(&newex); + + if (in_range(orig_map->m_lblk, newex_lblk, newex_len)) { + /* S1 */ + loff_t split_len =3D 0; + + BUG_ON(!ext4_ext_is_unwritten(&newex)); + + if (newex_lblk + newex_len >=3D + orig_map->m_lblk + (loff_t)orig_map->m_len) + split_len =3D orig_map->m_len; + else + split_len =3D newex_len - + (orig_map->m_lblk - newex_lblk); + + path =3D ext4_ext_insert_extent( + handle, inode, path, &newex, + (flags & ~EXT4_GET_BLOCKS_PRE_IO)); + if (IS_ERR(path)) { + err =3D PTR_ERR(path); + goto insert_error; + } + + /* + * Update path before split + * NOTE: This might no longer be needed with recent + * changes in ext4_ext_insert_extent() + */ + path =3D ext4_find_extent(inode, orig_map->m_lblk, path, 0); + if (IS_ERR(path)) { + err =3D PTR_ERR(path); + goto insert_error; + } + + /* + * GET_BLOCKS_CONVERT is needed to make sure split + * extent is marked unwritten although the flags itself + * means that the extent should be converted to written. + * + * TODO: This is because ext4_split_convert_extents() + * doesn't respect the flags at all but fixing this + * needs more involved design changes. + */ + path =3D ext4_split_convert_extents( + handle, inode, orig_map->m_lblk, split_len, + path, flags | EXT4_GET_BLOCKS_CONVERT, NULL); + if (IS_ERR(path)) { + err =3D PTR_ERR(path); + goto insert_error; + } + +#ifdef CONFIG_EXT4_DEBUG + test_path =3D ext4_find_extent(inode, orig_map->m_lblk, + NULL, 0); + if (!IS_ERR(test_path)) { + /* Confirm we've correctly split and marked the extent unwritten */ + struct ext4_extent *test_ex =3D + test_path[ext_depth(inode)].p_ext; + WARN_ON(!ext4_ext_is_unwritten(test_ex)); + WARN_ON(test_ex->ee_block !=3D orig_map->m_lblk); + WARN_ON(ext4_ext_get_actual_len(test_ex) !=3D + orig_map->m_len); + kfree(test_path); + } +#endif + } else { + /* S2 */ + BUG_ON(orig_map->m_lblk < newex_lblk + newex_len); + + path =3D ext4_ext_insert_extent( + handle, inode, path, &newex, + (flags & ~EXT4_GET_BLOCKS_PRE_IO)); + if (IS_ERR(path)) { + err =3D PTR_ERR(path); + goto insert_error; + } + } + } else { + path =3D ext4_ext_insert_extent(handle, inode, path, &newex, + flags); + if (IS_ERR(path)) { + err =3D PTR_ERR(path); + goto insert_error; + } + } + +insert_error: + if (err) { if (allocated_clusters) { int fb_flags =3D 0; =20 @@ -4672,7 +4812,7 @@ static long ext4_do_fallocate(struct file *file, loff= _t offset, loff_t end =3D offset + len; loff_t new_size =3D 0; ext4_lblk_t start_lblk, len_lblk; - int ret; + int ret, flags; =20 trace_ext4_fallocate_enter(inode, offset, len, mode); WARN_ON_ONCE(!inode_is_locked(inode)); @@ -4694,8 +4834,12 @@ static long ext4_do_fallocate(struct file *file, lof= f_t offset, goto out; } =20 + flags =3D EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; + if (ext4_should_use_extsize(inode)) + flags |=3D EXT4_GET_BLOCKS_EXTSIZE; + ret =3D ext4_alloc_file_blocks(file, start_lblk, len_lblk, new_size, - EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); + flags); if (ret) goto out; =20 diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 00d8e9065a02..53724b7cb9e0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -435,7 +435,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle, */ down_read(&EXT4_I(inode)->i_data_sem); if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { - retval =3D ext4_ext_map_blocks(handle, inode, map, 0); + retval =3D ext4_ext_map_blocks(handle, inode, map, NULL, 0); } else { retval =3D ext4_ind_map_blocks(handle, inode, map, 0); } @@ -460,15 +460,33 @@ static void ext4_map_blocks_es_recheck(handle_t *hand= le, #endif /* ES_AGGRESSIVE_TEST */ =20 static int ext4_map_query_blocks(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map) + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, + bool should_extsize) { unsigned int status; int retval; + struct ext4_map_blocks *map; + + if (should_extsize) { + BUG_ON(extsize_map =3D=3D NULL); + map =3D extsize_map; + } else + map =3D orig_map; =20 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) - retval =3D ext4_ext_map_blocks(handle, inode, map, 0); - else + if (should_extsize) { + retval =3D ext4_ext_map_blocks(handle, inode, orig_map, + map, + EXT4_GET_BLOCKS_EXTSIZE); + } else { + retval =3D ext4_ext_map_blocks(handle, inode, map, NULL, + 0); + } + else { + BUG_ON(should_extsize); retval =3D ext4_ind_map_blocks(handle, inode, map, 0); + } =20 if (retval <=3D 0) return retval; @@ -489,11 +507,20 @@ static int ext4_map_query_blocks(handle_t *handle, st= ruct inode *inode, } =20 static int ext4_map_create_blocks(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map, int flags) + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, int flags, + bool should_extsize) { struct extent_status es; unsigned int status; int err, retval =3D 0; + struct ext4_map_blocks *map; + + if (should_extsize) { + BUG_ON(extsize_map =3D=3D NULL); + map =3D extsize_map; + } else + map =3D orig_map; =20 /* * We pass in the magic EXT4_GET_BLOCKS_DELALLOC_RESERVE @@ -514,8 +541,15 @@ static int ext4_map_create_blocks(handle_t *handle, st= ruct inode *inode, * changed the inode type in between. */ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { - retval =3D ext4_ext_map_blocks(handle, inode, map, flags); + if (should_extsize) { + retval =3D ext4_ext_map_blocks(handle, inode, orig_map, + map, flags); + } else { + retval =3D ext4_ext_map_blocks(handle, inode, map, NULL, + flags); + } } else { + BUG_ON(should_extsize); retval =3D ext4_ind_map_blocks(handle, inode, map, flags); =20 /* @@ -570,6 +604,80 @@ static int ext4_map_create_blocks(handle_t *handle, st= ruct inode *inode, return retval; } =20 +/** + * Extsize hint will change the mapped range and hence we'll end up mappin= g more. + * To not confuse the caller, adjust the struct ext4_map_blocks to reflect= the + * original mapping requested by them. + * + * @cur_map: The block mapping we are working with (for sanity check) + * @orig_map: The originally requested mapping + * @extsize_map: The mapping after adjusting for extsize hint + * @flags Get block flags (for sanity check) + * + * This function assumes that the orig_mlblk is contained within the mappi= ng + * held in extsize_map. Caller must make sure this is true. + */ +static inline unsigned int ext4_extsize_adjust_map(struct ext4_map_blocks = *cur_map, + struct ext4_map_blocks *orig_map, + struct ext4_map_blocks *extsize_map, + int flags) +{ + __u64 map_end =3D (__u64)extsize_map->m_lblk + extsize_map->m_len; + + BUG_ON(cur_map !=3D extsize_map || !(flags & EXT4_GET_BLOCKS_EXTSIZE)); + + orig_map->m_len =3D min(orig_map->m_len, map_end - orig_map->m_lblk); + orig_map->m_pblk =3D + extsize_map->m_pblk + (orig_map->m_lblk - extsize_map->m_lblk); + orig_map->m_flags =3D extsize_map->m_flags; + + return orig_map->m_len; +} + +/** + * ext4_error_adjust_map - Adjust map returned upon error in ext4_map_bloc= ks() + * + * @cur_map: current map we are working with + * @orig_map: original map that would be returned to the user. + * + * Most of the callers of ext4_map_blocks() ignore the map on error, howev= er + * some use it for debug logging. In this case, they log state of the map = just + * before the error, hence this function ensures that map returned to call= er is + * the one we were working with when error happened. Mostly useful when ex= tsize + * hints are enabled. + */ +static inline void ext4_error_adjust_map(struct ext4_map_blocks *cur_map, + struct ext4_map_blocks *orig_map) +{ + if (cur_map !=3D orig_map) + memcpy(orig_map, cur_map, sizeof(*cur_map)); +} + +/* + * This functions resets the mapping to it's original state after it has b= een + * modified due to extent size hint and drops the extsize hint. To be used + * incase we want to fallback from extsize based aligned allocation to nor= mal + * allocation + * + * @map: The block mapping where lblk and len have been modified + * because of extsize hint + * @flags: The get_block flags + * @orig_mlblk: The originally requested logical block to map + * @orig_mlen: The originally requested len to map + * @orig_flags: The originally requested get_block flags + */ +static inline void ext4_extsize_reset_map(struct ext4_map_blocks *map, + int *flags, ext4_lblk_t orig_mlblk, + unsigned int orig_mlen, + int orig_flags) +{ + /* Drop the extsize hint from original flags */ + *flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + map->m_lblk =3D orig_mlblk; + map->m_len =3D orig_mlen; + map->m_flags =3D 0; +} + /* * The ext4_map_blocks() function tries to look up the requested blocks, * and returns if the blocks are already mapped. @@ -594,31 +702,111 @@ static int ext4_map_create_blocks(handle_t *handle, = struct inode *inode, * It returns the error in case of allocation failure. */ int ext4_map_blocks(handle_t *handle, struct inode *inode, - struct ext4_map_blocks *map, int flags) + struct ext4_map_blocks *orig_map, int flags) { struct extent_status es; int retval; int ret =3D 0; + + ext4_lblk_t orig_mlblk, extsize_mlblk; + unsigned int orig_mlen, extsize_mlen; + int orig_flags; + + struct ext4_map_blocks *map =3D NULL; + struct ext4_map_blocks extsize_map =3D {0}; + + __u32 extsize =3D ext4_inode_get_extsize(EXT4_I(inode)); + bool should_extsize =3D false; + #ifdef ES_AGGRESSIVE_TEST - struct ext4_map_blocks orig_map; + struct ext4_map_blocks test_map; =20 - memcpy(&orig_map, map, sizeof(*map)); + memcpy(&test_map, map, sizeof(*map)); #endif =20 - map->m_flags =3D 0; - ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n", - flags, map->m_len, (unsigned long) map->m_lblk); + orig_map->m_flags =3D 0; + ext_debug(inode, "flag 0x%x, max_blocks %u, logical block %lu\n", flags, + orig_map->m_len, (unsigned long)orig_map->m_lblk); =20 /* * ext4_map_blocks returns an int, and m_len is an unsigned int */ - if (unlikely(map->m_len > INT_MAX)) - map->m_len =3D INT_MAX; + if (unlikely(orig_map->m_len > INT_MAX)) + orig_map->m_len =3D INT_MAX; =20 /* We can handle the block number less than EXT_MAX_BLOCKS */ - if (unlikely(map->m_lblk >=3D EXT_MAX_BLOCKS)) + if (unlikely(orig_map->m_lblk >=3D EXT_MAX_BLOCKS)) return -EFSCORRUPTED; =20 + orig_mlblk =3D orig_map->m_lblk; + orig_mlen =3D orig_map->m_len; + orig_flags =3D flags; + +set_map: + should_extsize =3D (extsize && (flags & EXT4_GET_BLOCKS_CREATE) && + (flags & EXT4_GET_BLOCKS_EXTSIZE)); + if (should_extsize) { + /* + * We adjust the extent size here but we still return the + * original lblk and len while returning to keep the behavior + * compatible. + */ + int len, align; + /* + * NOTE: Should we import EXT_UNWRITTEN_MAX_LEN from + * ext4_extents.h here? + */ + int max_unwrit_len =3D ((1UL << 15) - 1); + loff_t end; + + align =3D orig_map->m_lblk % extsize; + len =3D orig_map->m_len + align; + + extsize_map.m_lblk =3D orig_map->m_lblk - align; + extsize_map.m_len =3D + max_t(unsigned int, roundup_pow_of_two(len), extsize); + + /* + * For now allocations beyond EOF don't use extsize hints so + * that we can avoid dealing with extra blocks allocated past + * EOF. We have inode lock since extsize allocations are + * non-delalloc so i_size can be accessed safely + */ + end =3D (extsize_map.m_lblk + (loff_t)extsize_map.m_len) << inode->i_blk= bits; + if (end > inode->i_size) { + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + goto set_map; + } + + /* Fallback to normal allocation if we go beyond max len */ + if (extsize_map.m_len >=3D max_unwrit_len) { + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + goto set_map; + } + + /* + * We are allocating more than requested. We'll have to convert + * the extent to unwritten and then convert only the part + * requested to written. For now we are using the same flow as + * dioread nolock to achieve this. Hence the caller has to pass + * CREATE_UNWRIT with EXTSIZE + */ + if (!(flags | EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT)) { + WARN_ON(true); + + /* Fallback to non extsize allocation */ + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + goto set_map; + } + + extsize_mlblk =3D extsize_map.m_lblk; + extsize_mlen =3D extsize_map.m_len; + + extsize_map.m_flags =3D orig_map->m_flags; + map =3D &extsize_map; + } else + map =3D orig_map; + /* Lookup extent status tree firstly */ if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) && ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) { @@ -648,7 +836,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, return retval; #ifdef ES_AGGRESSIVE_TEST ext4_map_blocks_es_recheck(handle, inode, map, - &orig_map, flags); + &test_map, flags); #endif goto found; } @@ -664,19 +852,60 @@ int ext4_map_blocks(handle_t *handle, struct inode *i= node, * file system block. */ down_read(&EXT4_I(inode)->i_data_sem); - retval =3D ext4_map_query_blocks(handle, inode, map); + if (should_extsize) { + BUG_ON(map !=3D &extsize_map); + retval =3D ext4_map_query_blocks(handle, inode, orig_map, map, + should_extsize); + } else { + BUG_ON(map !=3D orig_map); + retval =3D ext4_map_query_blocks(handle, inode, map, NULL, + should_extsize); + } up_read((&EXT4_I(inode)->i_data_sem)); =20 found: if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) { ret =3D check_block_validity(inode, map); - if (ret !=3D 0) + if (ret !=3D 0) { + ext4_error_adjust_map(map, orig_map); return ret; + } } =20 /* If it is only a block(s) look up */ - if ((flags & EXT4_GET_BLOCKS_CREATE) =3D=3D 0) + if ((flags & EXT4_GET_BLOCKS_CREATE) =3D=3D 0) { + BUG_ON(flags & EXT4_GET_BLOCKS_EXTSIZE); return retval; + } + + /* Handle some special cases when extsize based allocation is needed */ + if (retval >=3D 0 && flags & EXT4_GET_BLOCKS_EXTSIZE) { + bool orig_in_range =3D + in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len); + /* + * Special case: if the extsize range is mapped already and + * covers the original start, we return it. + */ + if (map->m_flags & EXT4_MAP_MAPPED && orig_in_range) { + /* + * We don't use EXTSIZE with CONVERT_UNWRITTEN so + * we can directly return the written extent + */ + return ext4_extsize_adjust_map(map, orig_map, &extsize_map, flags); + } + + /* + * Fallback case: if the found mapping (or hole) doesn't cover + * the extsize required, then just fall back to normal + * allocation to keep things simple. + */ + + if (map->m_lblk !=3D extsize_mlblk || + map->m_len !=3D extsize_mlen) { + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + goto set_map; + } + } =20 /* * Returns if the blocks have already allocated @@ -700,12 +929,22 @@ int ext4_map_blocks(handle_t *handle, struct inode *i= node, * with create =3D=3D 1 flag. */ down_write(&EXT4_I(inode)->i_data_sem); - retval =3D ext4_map_create_blocks(handle, inode, map, flags); + if (should_extsize) { + BUG_ON(map !=3D &extsize_map); + retval =3D ext4_map_create_blocks(handle, inode, orig_map, map, flags, + should_extsize); + } else { + BUG_ON(map !=3D orig_map); + retval =3D ext4_map_create_blocks(handle, inode, map, NULL, flags, + should_extsize); + } up_write((&EXT4_I(inode)->i_data_sem)); if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) { ret =3D check_block_validity(inode, map); - if (ret !=3D 0) + if (ret !=3D 0) { + ext4_error_adjust_map(map, orig_map); return ret; + } =20 /* * Inodes with freshly allocated blocks where contents will be @@ -727,16 +966,38 @@ int ext4_map_blocks(handle_t *handle, struct inode *i= node, else ret =3D ext4_jbd2_inode_add_write(handle, inode, start_byte, length); - if (ret) + if (ret) { + ext4_error_adjust_map(map, orig_map); return ret; + } } } if (retval > 0 && (map->m_flags & EXT4_MAP_UNWRITTEN || map->m_flags & EXT4_MAP_MAPPED)) ext4_fc_track_range(handle, inode, map->m_lblk, map->m_lblk + map->m_len - 1); - if (retval < 0) + + if (retval > 0 && flags & EXT4_GET_BLOCKS_EXTSIZE) { + /* + * In the rare case that we have a short allocation and orig + * lblk doesn't lie in mapped range just try to retry with + * actual allocation. This is not ideal but this should be an + * edge case near ENOSPC. + * + * NOTE: This has a side effect that blocks are allocated but + * not used. Can we avoid that? + */ + if (!in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len)) { + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; + goto set_map; + } + return ext4_extsize_adjust_map(map, orig_map, &extsize_map, flags); + } + + if (retval < 0) { + ext4_error_adjust_map(map, orig_map); ext_debug(inode, "failed with err %d\n", retval); + } return retval; } =20 @@ -772,18 +1033,20 @@ static int _ext4_get_block(struct inode *inode, sect= or_t iblock, { struct ext4_map_blocks map; int ret =3D 0; + unsigned int orig_mlen =3D bh->b_size >> inode->i_blkbits; =20 if (ext4_has_inline_data(inode)) return -ERANGE; =20 map.m_lblk =3D iblock; - map.m_len =3D bh->b_size >> inode->i_blkbits; + map.m_len =3D orig_mlen; =20 ret =3D ext4_map_blocks(ext4_journal_current_handle(), inode, &map, flags); if (ret > 0) { map_bh(bh, inode->i_sb, map.m_pblk); ext4_update_bh_state(bh, map.m_flags); + WARN_ON(map.m_len !=3D orig_mlen); bh->b_size =3D inode->i_sb->s_blocksize * map.m_len; ret =3D 0; } else if (ret =3D=3D 0) { @@ -809,11 +1072,14 @@ int ext4_get_block_unwritten(struct inode *inode, se= ctor_t iblock, struct buffer_head *bh_result, int create) { int ret =3D 0; + int flags =3D EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; + + if (ext4_should_use_extsize(inode)) + flags |=3D EXT4_GET_BLOCKS_EXTSIZE; =20 ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", inode->i_ino, create); - ret =3D _ext4_get_block(inode, iblock, bh_result, - EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); + ret =3D _ext4_get_block(inode, iblock, bh_result, flags); =20 /* * If the buffer is marked unwritten, mark it as new to make sure it is @@ -1164,7 +1430,8 @@ static int ext4_write_begin(struct file *file, struct= address_space *mapping, from =3D pos & (PAGE_SIZE - 1); to =3D from + len; =20 - if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) { + if (!ext4_should_use_extsize(inode) && + ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) { ret =3D ext4_try_to_write_inline_data(mapping, inode, pos, len, foliop); if (ret < 0) @@ -1212,7 +1479,7 @@ static int ext4_write_begin(struct file *file, struct= address_space *mapping, /* In case writeback began while the folio was unlocked */ folio_wait_stable(folio); =20 - if (ext4_should_dioread_nolock(inode)) + if (ext4_should_use_unwrit_extents(inode)) ret =3D ext4_block_write_begin(handle, folio, pos, len, ext4_get_block_unwritten); else @@ -1802,7 +2069,7 @@ static int ext4_da_map_blocks(struct inode *inode, st= ruct ext4_map_blocks *map) if (ext4_has_inline_data(inode)) retval =3D 0; else - retval =3D ext4_map_query_blocks(NULL, inode, map); + retval =3D ext4_map_query_blocks(NULL, inode, map, NULL, false); up_read(&EXT4_I(inode)->i_data_sem); if (retval) return retval < 0 ? retval : 0; @@ -1825,7 +2092,7 @@ static int ext4_da_map_blocks(struct inode *inode, st= ruct ext4_map_blocks *map) goto found; } } else if (!ext4_has_inline_data(inode)) { - retval =3D ext4_map_query_blocks(NULL, inode, map); + retval =3D ext4_map_query_blocks(NULL, inode, map, NULL, false); if (retval) { up_write(&EXT4_I(inode)->i_data_sem); return retval < 0 ? retval : 0; @@ -2199,6 +2466,7 @@ static int mpage_map_one_extent(handle_t *handle, str= uct mpage_da_data *mpd) struct ext4_map_blocks *map =3D &mpd->map; int get_blocks_flags; int err, dioread_nolock; + int extsize =3D ext4_should_use_extsize(inode); =20 trace_ext4_da_write_pages_extent(inode, map); /* @@ -2217,11 +2485,14 @@ static int mpage_map_one_extent(handle_t *handle, s= truct mpage_da_data *mpd) dioread_nolock =3D ext4_should_dioread_nolock(inode); if (dioread_nolock) get_blocks_flags |=3D EXT4_GET_BLOCKS_IO_CREATE_EXT; + if (extsize) + get_blocks_flags |=3D EXT4_GET_BLOCKS_PRE_IO; =20 err =3D ext4_map_blocks(handle, inode, map, get_blocks_flags); if (err < 0) return err; - if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) { + if ((extsize || dioread_nolock) && + (map->m_flags & EXT4_MAP_UNWRITTEN)) { if (!mpd->io_submit.io_end->handle && ext4_handle_valid(handle)) { mpd->io_submit.io_end->handle =3D handle->h_rsv_handle; @@ -2643,10 +2914,11 @@ static int ext4_do_writepages(struct mpage_da_data = *mpd) } mpd->journalled_more_data =3D 0; =20 - if (ext4_should_dioread_nolock(inode)) { + if (ext4_should_use_unwrit_extents(inode)) { /* - * We may need to convert up to one extent per block in - * the page and we may dirty the inode. + * For extsize allocation or dioread_nolock, we may need to + * convert up to one extent per block in the page and we may + * dirty the inode. */ rsv_blocks =3D 1 + ext4_chunk_trans_blocks(inode, PAGE_SIZE >> inode->i_blkbits); @@ -2924,7 +3196,8 @@ static int ext4_da_write_begin(struct file *file, str= uct address_space *mapping, =20 index =3D pos >> PAGE_SHIFT; =20 - if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode)) { + if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode) || + ext4_should_use_extsize(inode)) { *fsdata =3D (void *)FALL_BACK_TO_NONDELALLOC; return ext4_write_begin(file, mapping, pos, len, foliop, fsdata); @@ -3371,12 +3644,19 @@ static int ext4_iomap_alloc(struct inode *inode, st= ruct ext4_map_blocks *map, * can complete at any point during the I/O and subsequently push the * i_disksize out to i_size. This could be beyond where direct I/O is * happening and thus expose allocated blocks to direct I/O reads. + * + * NOTE for extsize hints: We only support it for writes inside + * EOF (for now) to not have to deal with blocks past EOF */ else if (((loff_t)map->m_lblk << blkbits) >=3D i_size_read(inode)) m_flags =3D EXT4_GET_BLOCKS_CREATE; - else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { m_flags =3D EXT4_GET_BLOCKS_IO_CREATE_EXT; =20 + if (ext4_should_use_extsize(inode)) + m_flags |=3D EXT4_GET_BLOCKS_EXTSIZE; + } + ret =3D ext4_map_blocks(handle, inode, map, m_flags); =20 /* @@ -6270,7 +6550,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf) } folio_unlock(folio); /* OK, we need to fill the hole... */ - if (ext4_should_dioread_nolock(inode)) + if (ext4_should_use_unwrit_extents(inode)) get_block =3D ext4_get_block_unwritten; else get_block =3D ext4_get_block; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 79cc4224fbbd..d9464ee764af 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -50,6 +50,7 @@ struct partial_cluster; { EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, "CONVERT_UNWRITTEN" }, \ { EXT4_GET_BLOCKS_ZERO, "ZERO" }, \ { EXT4_GET_BLOCKS_IO_SUBMIT, "IO_SUBMIT" }, \ + { EXT4_GET_BLOCKS_EXTSIZE, "EXTSIZE" }, \ { EXT4_EX_NOCACHE, "EX_NOCACHE" }) =20 /* --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D26B92586FB; Mon, 24 Mar 2025 07:37:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801858; cv=none; b=DBl2OF00WvQxkfRDnslvWEBJeoBtcwXQBSWG51Q0LLEKIAgeAsV2+8OBmlWyUJeZZerN2Qn4i75EcWu4MMJUpU+3e22u4n2UJmY7oM/5MawxjAInnfOtXOyYRCEv85TqCdJgK/2/cfBxguOHHpJQZvAKPI2tSjx+mZ3fDqKANm8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801858; c=relaxed/simple; bh=PgiRgsauRzLuz9ExWPwjx57t3tHOS9kM81k8WTIHAeg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pKFgxW2Z95e/Mw5rh5Q8/v6rmyZb0gPRBk3TiRAJm7xKZuGexSCDLzBRNxdTvNCe12JRUo3EVCoShRJIgs5Fq3xS/1gRR1v59FNoeiMQE0WN6IX46UVabfh0Nt8Q2Fur0ClVmwrAXcvbJ+uRY7lQakKRGzTlmyCv6I9X+4WhWiE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=Q5TcCno4; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="Q5TcCno4" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NKxkAe030463; Mon, 24 Mar 2025 07:37:29 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=Dj21oKUFOE54gMtT4 Frv2frqo4ju3Azg5xQNW3+iVM0=; b=Q5TcCno45jgfs2n3zWtszp0/ce2vQSvPL vmOGAjudLWZtEwR/xk8xjZljVRCHQY438gfUto/sc3ivwuwljhLj2YPLXEl6mqYH uw9nLQZNfc04eWqYOdyzHewSjSOTyzEC56xku3SRpL42u98I7YbVjKmEr/z7Epnx x2D5mygkmr0B/N1cL4EtpZ60/2ZC3MR5+rxHdPicwGsBAZF8KgaPxKZAaJKnxVx4 Xu98S0UewrZ2MWFCSINsDJfAG97YP4+jLXTSomQrSR3DHRfgouxSQnQdENnycNaT UhP9PF5qU+3aVGKr6g3fOPzUznZgZQzsS9v4ut3TS0qBy4XL2Tjzg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthfg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:29 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7X7Jf021329; Mon, 24 Mar 2025 07:37:28 GMT Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jpfwthfe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:28 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O7KaK3025462; Mon, 24 Mar 2025 07:37:28 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j7wywadn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:28 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bQIB49676772 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:26 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1DDB02004F; Mon, 24 Mar 2025 07:37:26 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F11D82004B; Mon, 24 Mar 2025 07:37:23 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:23 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 06/11] ext4: make extsize work with EOF allocations Date: Mon, 24 Mar 2025 13:07:04 +0530 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: QvFb7SeG8riVOzBAlHTAjR-J00FJ8thl X-Proofpoint-GUID: q9ciTaUdCVrYDjiOAvv2llhlueOmI1Ks X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 bulkscore=0 mlxscore=0 priorityscore=1501 impostorscore=0 malwarescore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 phishscore=0 adultscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Make extsize hints work with EOF allocations. We deviate from XFS here because in case we have blocks left past EOF, we don't truncate them. There are 2 main reasons: 1. Since the user is opting for extsize allocations, chances are that they will use the blocks in future. 2. If we start truncating all EOF blocks in ext4_release_file like XFS, then we will have to always truncate blocks even if they have been intentionally preallocated using fallocate w/ KEEP_SIZE which might cause confusion for users. This is mainly because ext4 doesn't have a way to distinguish if the blocks beyond EOF have been allocated intentionally. We can work around this by using an ondisk inode flag like XFS (XFS_DIFLAG_PREALLOC) but that would be an overkill. It's much simpler to just let the EOF blocks stick around. NOTE: One thing that changes in this patch is that for direct IO we need to pass the EXT4_GET_BLOCKS_IO_CREATE_EXT even if we are allocating beyond i_size. Signed-off-by: Ojaswin Mujoo --- fs/ext4/inode.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 53724b7cb9e0..bf19b9f99cea 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -757,7 +757,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, * ext4_extents.h here? */ int max_unwrit_len =3D ((1UL << 15) - 1); - loff_t end; =20 align =3D orig_map->m_lblk % extsize; len =3D orig_map->m_len + align; @@ -766,18 +765,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *in= ode, extsize_map.m_len =3D max_t(unsigned int, roundup_pow_of_two(len), extsize); =20 - /* - * For now allocations beyond EOF don't use extsize hints so - * that we can avoid dealing with extra blocks allocated past - * EOF. We have inode lock since extsize allocations are - * non-delalloc so i_size can be accessed safely - */ - end =3D (extsize_map.m_lblk + (loff_t)extsize_map.m_len) << inode->i_blk= bits; - if (end > inode->i_size) { - flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; - goto set_map; - } - /* Fallback to normal allocation if we go beyond max len */ if (extsize_map.m_len >=3D max_unwrit_len) { flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; @@ -3645,10 +3632,13 @@ static int ext4_iomap_alloc(struct inode *inode, st= ruct ext4_map_blocks *map, * i_disksize out to i_size. This could be beyond where direct I/O is * happening and thus expose allocated blocks to direct I/O reads. * - * NOTE for extsize hints: We only support it for writes inside - * EOF (for now) to not have to deal with blocks past EOF + * NOTE: For extsize hint based EOF allocations, we still need + * IO_CREATE_EXT flag because we will be allocating more than the write + * hence the extra blocks need to be marked unwritten and split before + * the I/O. */ - else if (((loff_t)map->m_lblk << blkbits) >=3D i_size_read(inode)) + else if (((loff_t)map->m_lblk << blkbits) >=3D i_size_read(inode) && + !ext4_should_use_extsize(inode)) m_flags =3D EXT4_GET_BLOCKS_CREATE; else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { m_flags =3D EXT4_GET_BLOCKS_IO_CREATE_EXT; --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF8F5258CEC; Mon, 24 Mar 2025 07:37:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801860; cv=none; b=LMz7sKpMa7sVz9dm7/BNf5zCfm1tQklJacHXRgt3FCAWEVCSXx/mRKmPId9KQmzzoK2tYtqvvORhQOUF8/DLwqhqvk3XeD8OHtoqHHrSAflvvB8Ipio/8F250OWzJf1Zq5WDOfprFb8XrBdUq1HUuHCyCGL3it4+MWL12bY9JG8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801860; c=relaxed/simple; bh=kH61CPjbgKcGvY1o1zLfjh8znxrdW5cGRb6BU60sqCM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oSvqsuDgqisErKqW7o75bidJj1tEbj6rkJnEonEM8UaCvojAUn6JrpOMc49wqzPOcOq+g4Yjqzz8B1Fp6sUh3Va5jQEF3Dx88OEsFouAueGMUmaDpUgMfXrKBU+VOCDEWifA89h0yksyaM0rlcb8geS6P21bkLCAXh920/0QblU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=iPob3KuQ; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="iPob3KuQ" Received: from pps.filterd (m0353725.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NKcp3E012293; Mon, 24 Mar 2025 07:37:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=ypHI+ECkAQMnukI0Y JmyEWA3haJeto49RvcUtRAkQNc=; b=iPob3KuQH3UVqdGOEc78HvalTENyB/Zgb MsKzW8OmDKvWVVyooXZzBgTmFAslxi3Q1qmamwpWhjtSk/a3ObiOYJ3GxMpbfmG1 e76ky/Q62LlVZr2kHuMQQ2PqDZhY2AAZr9nqkHucOSstNKTM2ItyUSyYCrw+cuf8 AXN7lHNYYl/+BmYN7U0g4LwQX1kmyXh2v6Z5cXaesNha3pWVUMSDRJfegIfyNFEN s3cTmsWxxbko0/NlvCES+ZM2XggTwp5gAoyqvIRT18SBdN16ivBc22/fAwnRvVfD Rk5R4Fozq46nZbcvW1Luib5C1u20J7wLaKLcXgP6ODVW492NiJXkA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021f7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:31 +0000 (GMT) Received: from m0353725.ppops.net (m0353725.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7a7DB002400; Mon, 24 Mar 2025 07:37:31 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021f5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:31 +0000 (GMT) Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O3fJDe012240; Mon, 24 Mar 2025 07:37:30 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j91kw3ep-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:30 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bSGP30540248 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:28 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 75C0A2004B; Mon, 24 Mar 2025 07:37:28 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8196220040; Mon, 24 Mar 2025 07:37:26 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:26 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 07/11] ext4: add ext4_map_blocks_extsize() wrapper to handle overwrites Date: Mon, 24 Mar 2025 13:07:05 +0530 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 43qNZqCoDqESOMDw0yAQGywdvwu4dvtM X-Proofpoint-GUID: QoPWFJVpQXh1pivmJitXEEPXafZu5vo_ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 adultscore=0 mlxlogscore=999 suspectscore=0 lowpriorityscore=0 spamscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 malwarescore=0 impostorscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Currently, with the extsize hints, if we consider a scenario where the hint is is set to 16k and we do a write of (0,4k) we get the below mapping: [ 4k written ] [ 12k unwritten ] Now, if we do a (4k,4k) write, ext4_map_blocks will again try for a extsize aligned write, adjust the range to (0, 16k) and then run into issues since the new range is already has a mapping in it. Although this does not lead to a failure since we eventually fallback to a non extsize allocation, this is not a good approach. Hence, implement a wrapper over ext4_map_blocks() which detects if a mapping already exists for an extsize based allocation and then reuses the same mapping. In case the mapping completely covers the original request we simply disable extsize allocation and call map_blocks to correctly process the mapping and set the map flags. Otherwise, if there is a hole or partial mapping, then we just let ext4_map_blocks() handle the allocation. Signed-off-by: Ojaswin Mujoo --- fs/ext4/inode.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf19b9f99cea..e41c97584f35 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -678,6 +678,42 @@ static inline void ext4_extsize_reset_map(struct ext4_= map_blocks *map, map->m_flags =3D 0; } =20 +static int ext4_map_blocks_extsize(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map, int flags) +{ + int orig_mlen =3D map->m_len; + int ret =3D 0; + int tmp_flags; + + WARN_ON(!ext4_inode_get_extsize(EXT4_I(inode))); + WARN_ON(!(flags | EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT)); + + /* + * First check if there are any existing allocations + */ + ret =3D ext4_map_blocks(handle, inode, map, 0); + if (ret < 0) + return ret; + + /* + * the present mapping fully covers the requested range. In this + * case just go for a non extsize based allocation. Note that we won't + * really be allocating new blocks but the call to ext4_map_blocks is + * important to ensure things like extent splitting and proper map flags + * are taken care of. For all other cases, just let ext4_map_blocks handle + * the allocations + */ + if (ret > 0 && map->m_len =3D=3D orig_mlen) + tmp_flags =3D flags & ~(EXT4_GET_BLOCKS_EXTSIZE | + EXT4_GET_BLOCKS_FORCEALIGN); + else + tmp_flags =3D flags; + + ret =3D ext4_map_blocks(handle, inode, map, tmp_flags); + + return ret; +} + /* * The ext4_map_blocks() function tries to look up the requested blocks, * and returns if the blocks are already mapped. @@ -1028,8 +1064,12 @@ static int _ext4_get_block(struct inode *inode, sect= or_t iblock, map.m_lblk =3D iblock; map.m_len =3D orig_mlen; =20 - ret =3D ext4_map_blocks(ext4_journal_current_handle(), inode, &map, - flags); + if ((flags & EXT4_GET_BLOCKS_CREATE) && ext4_should_use_extsize(inode)) + ret =3D ext4_map_blocks_extsize(ext4_journal_current_handle(), inode, + &map, flags); + else + ret =3D ext4_map_blocks(ext4_journal_current_handle(), inode, + &map, flags); if (ret > 0) { map_bh(bh, inode->i_sb, map.m_pblk); ext4_update_bh_state(bh, map.m_flags); @@ -3647,7 +3687,10 @@ static int ext4_iomap_alloc(struct inode *inode, str= uct ext4_map_blocks *map, m_flags |=3D EXT4_GET_BLOCKS_EXTSIZE; } =20 - ret =3D ext4_map_blocks(handle, inode, map, m_flags); + if (ext4_should_use_extsize(inode)) + ret =3D ext4_map_blocks_extsize(handle, inode, map, m_flags); + else + ret =3D ext4_map_blocks(handle, inode, map, m_flags); =20 /* * We cannot fill holes in indirect tree based inodes as that could --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04D92259C83; Mon, 24 Mar 2025 07:37:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801862; cv=none; b=Y7deNIWubPIX9EU9uQGQLqKtXkpmYDg73tsjwFLej5AFbT/9IX/jWCdEfBO4yPd1C91gpkV1vv3JPD1obwcw1QllFOZAUtIJWTQFohtGbcvAYDJ2gotdHEQa8+c+3UUoaL25PJyjEItOLmufP840S5AWzz/CIs2xOyGczaQIGaQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801862; c=relaxed/simple; bh=2BWHvwgYN1arVXgjKr7yqDlAI18r0MdfXrmwWeNBmbw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e9DRRG84yTVbfbhwwdpxTpr/eAene7ft0vRwPHG9rlJ0W7jGQodH+TV22baWh44FeRMN1VreRytFmB4cdLSFKEj8kfdUTaCo0iOH8zHH5ICMkMaVu6XaVO7pEURYhZRjJPPfoLueiYk6i1IYjlUf517rgP2104CZRhRKAvExVWE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=cUGE5Ok0; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="cUGE5Ok0" Received: from pps.filterd (m0353725.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NMIJfO024381; Mon, 24 Mar 2025 07:37:33 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=li3Z9XtwlzFVJsnd4 9gv2urb/nH6TiB2FmItOK7AxGg=; b=cUGE5Ok0ZaW1o+HsizW3+zjTEOLSDpeei jnMK8Pmw4YN96ijT3O9FTPzK7+U/fN3jO5hdW7z/nboYtEaVdKhDl1W3QnMeuYCJ COgT36erGu5GiFA0AcC5oBDWgpgOaxJy4jdiMpxU+9hmCeBdxQcW8xzKosspNMGu lWPMZHHuDm4Ncm4RjKm0anRFLqtXHiqeZExyJnKQgv95p2Yf1tCnDpBfY3QlIqtv 4V5I2haFDIT/YYoVbZAvkBgiVp+uXPPYoG03MQw4c34akLm1sgRMtstVFrnrB/dq qtECmXRe1U0fJ4xl1SuKtFSQc58l+qY23sYvLF3SOJ7WyvuR1mWQg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021fc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:33 +0000 (GMT) Received: from m0353725.ppops.net (m0353725.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bXZl005044; Mon, 24 Mar 2025 07:37:33 GMT Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021fa-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:32 +0000 (GMT) Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1]) by ppma12.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O75PSB030321; Mon, 24 Mar 2025 07:37:32 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 45j7ht5c50-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:32 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bUTm53412216 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:30 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 95BC12004B; Mon, 24 Mar 2025 07:37:30 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DA57820040; Mon, 24 Mar 2025 07:37:28 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:28 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 08/11] ext4: add forcealign support of mballoc Date: Mon, 24 Mar 2025 13:07:06 +0530 Message-ID: <18c0e6352a9d20bea37447e6500ecec4cae73614.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: M0hxVpSh9V6jyR9SjVOq-EEBlvxLKA-u X-Proofpoint-GUID: srMiE8Wj2XeWYtxDmq1lgc_LejjvdSTT X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 adultscore=0 mlxlogscore=768 suspectscore=0 lowpriorityscore=0 spamscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 malwarescore=0 impostorscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Introduce an EXT4_MB_FORCE_ALIGN flag that essentially enforces the same behavior as EXT4_MB_HINT_ALIGNED however the alignment requirements are no longer a hint but must be respected. If the allocator can't return aligned blocks, then ENOSPC will be thrown. This will be eventually used to guarantee aligned blocks to perform H/W accelerated atomic writes. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 2 ++ fs/ext4/mballoc.c | 31 +++++++++++++++++++++++-------- include/trace/events/ext4.h | 1 + 3 files changed, 26 insertions(+), 8 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index ab4f10f9031a..9b9d7a354736 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -224,6 +224,8 @@ enum criteria { #define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x00020000 /* mballoc will try to align physical start to length (aka natural alignme= nt) */ #define EXT4_MB_HINT_ALIGNED 0x40000 +/* Same as HINT_ALIGNED but fail allocation if alginment can't be guarante= ed */ +#define EXT4_MB_FORCE_ALIGN 0x80000 =20 struct ext4_allocation_request { /* target inode for block we're allocating */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index db7c593873a9..412aa80bc6e7 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2872,12 +2872,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_co= ntext *ac) ac->ac_criteria =3D cr; =20 if (ac->ac_criteria > CR_POWER2_ALIGNED && - ac->ac_flags & EXT4_MB_HINT_ALIGNED && ac->ac_g_ex.fe_len > 1) { - ext4_warning_inode( - ac->ac_inode, - "Aligned allocation not possible, using unaligned allocation"); - ac->ac_flags &=3D ~EXT4_MB_HINT_ALIGNED; + if (ac->ac_flags & EXT4_MB_FORCE_ALIGN) { + ext4_warning_inode( + ac->ac_inode, + "Aligned allocation not possible, failing allocation"); + ac->ac_status =3D AC_STATUS_BREAK; + goto exit; + } + + if (ac->ac_flags & EXT4_MB_HINT_ALIGNED) { + ext4_warning_inode( + ac->ac_inode, + "Aligned allocation not possible, using unaligned allocation"); + ac->ac_flags &=3D ~EXT4_MB_HINT_ALIGNED; + } } =20 /* @@ -3023,9 +3032,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) goto exit; } =20 - WARN_ON_ONCE(!is_power_of_2(len)); - WARN_ON_ONCE(start % len); - WARN_ON_ONCE(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len); + if (WARN_ON_ONCE(!is_power_of_2(len)) || + WARN_ON_ONCE(start % len) || + WARN_ON_ONCE(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len)) { + /* FORCE_ALIGN should error out if aligned blocks can't be found */ + if (ac->ac_flags & EXT4_MB_FORCE_ALIGN) { + ac->ac_status =3D AC_STATUS_BREAK; + goto exit; + } + } } =20 exit: diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index d9464ee764af..ebc1fb5ad57b 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -37,6 +37,7 @@ struct partial_cluster; { EXT4_MB_USE_ROOT_BLOCKS, "USE_ROOT_BLKS" }, \ { EXT4_MB_USE_RESERVED, "USE_RESV" }, \ { EXT4_MB_HINT_ALIGNED, "HINT_ALIGNED" }, \ + { EXT4_MB_FORCE_ALIGN, "FORCE_ALIGN" }, \ { EXT4_MB_STRICT_CHECK, "STRICT_CHECK" }) =20 #define show_map_flags(flags) __print_flags(flags, "|", \ --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9CED257ADE; Mon, 24 Mar 2025 07:37:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801866; cv=none; b=QKZZQPtWOfIqMuy67tMjifvlpbvmEk9u5eLyMpHIYBuKhDmMr85wVKNVjBciexKHY1Chjgl2WiMhlkDZwbSCnZ4cACR2tn1VBy6R7q5D56fka/NhThElJ2zVd8hoXQy/fuCO9ga34x6G+Si8Ff6F4SeuB+nVubyJ2MmM9s8LAAo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801866; c=relaxed/simple; bh=LQUDFo/B4Ni7NRq4/htOT728ddTxCQjVzS5lbHiF670=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qkCyxHbAqn85/QkyX6ALEVKi+gVHkJASIfMxbUHTcxS7/9IxoK9T0SUWrE70zymElIINALHu2VRqx241hYLS7nSferj51qoT5qeChIrUS03+w6PoncOflC1UvY0Q397QAlBgBRVftUKTet4iSfQMFkIk7bEwQi9YPmEO6nxDJeg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=Yzs1ZjwV; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="Yzs1ZjwV" Received: from pps.filterd (m0353725.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52NLRT5j029155; Mon, 24 Mar 2025 07:37:36 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=SU2l/E70z181y6juO TidrngautSFqTg8K/MCEhMB1Xc=; b=Yzs1ZjwVBLkaklUVbWWfHX/G1OUAvDr7o z4ZC32YqjDFEYtwA/TMUXgk0YSlCP1IAX4KG1oRUZTFXpe2AhIRgDZc3/XuAbLfD eEF3Mu/rpgVr+q/8Luer3uHn+pFoP5vYMUqWUwZOpeKp9ZLlqTsjsrPdy2FPc8NV BywuqxEwk1yoJFztGQuFR/6Bf89d12bEtlPVeVB/1mezRwQdDIyMZUGAB+crIWC2 ElBGhG+HOTwfYAEMP2ByARTyA5o11H0/yJNLKK1C6FzPEe5OTtRkKEIolpYLQnd5 1bJ4SjXkYFLdyRuWm7nSBnSYg3nrXR3asGQuWmQtn6GXP8gcaxNDA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021fh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:35 +0000 (GMT) Received: from m0353725.ppops.net (m0353725.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7NpdR007627; Mon, 24 Mar 2025 07:37:35 GMT Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jsh021ff-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:35 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O7Ju5J025463; Mon, 24 Mar 2025 07:37:34 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j7wywadv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:34 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bWwW30540494 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:32 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BE83120040; Mon, 24 Mar 2025 07:37:32 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E6E8B20043; Mon, 24 Mar 2025 07:37:30 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:30 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 09/11] ext4: add forcealign support to ext4_map_blocks Date: Mon, 24 Mar 2025 13:07:07 +0530 Message-ID: <003d7711c71a8f515687045f5b74fe045c0f01d1.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: xZ23W9V-fS276T1fsr_L4wxJY1P25CUx X-Proofpoint-GUID: 2aymsP6lTRyFLoW0N49uZOfdjiyJayiI X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 adultscore=0 mlxlogscore=999 suspectscore=0 lowpriorityscore=0 spamscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 malwarescore=0 impostorscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" Introduce EXT4_GET_BLOCKS_FORCEALIGN that works with EXT4_GET_BLOCKS_EXTSIZE and guarantees that the extent returned by allocator is physically as well as logically aligned to the extsize hint set on the inode. This feature will be used to guranatee aligned allocations for HW accelerat= ed atomic writes Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 24 ++++++++++++++++++++++-- fs/ext4/inode.c | 43 ++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 63 insertions(+), 5 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9b9d7a354736..a7429797c1d2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -730,6 +730,7 @@ enum { /* Caller is in the atomic contex, find extent if it has been cached */ #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800 #define EXT4_GET_BLOCKS_EXTSIZE 0x1000 +#define EXT4_GET_BLOCKS_FORCEALIGN 0x2000 =20 /* * The bit position of these flags must not overlap with any of the diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index a86cc3e76f14..25c1368b49bb 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4414,13 +4414,27 @@ int ext4_ext_map_blocks(handle_t *handle, struct in= ode *inode, ar.flags |=3D EXT4_MB_USE_RESERVED; if (flags & EXT4_GET_BLOCKS_EXTSIZE) ar.flags |=3D EXT4_MB_HINT_ALIGNED; + if (flags & EXT4_GET_BLOCKS_FORCEALIGN) { + if (WARN_ON(ar.logical !=3D map->m_lblk || ar.len !=3D map->m_len || + !(flags & EXT4_GET_BLOCKS_EXTSIZE))) { + /* + * This should ideally not happen but if does then error + * out + */ + err =3D -ENOSPC; + goto out; + } + ar.flags |=3D EXT4_MB_FORCE_ALIGN; + } newblock =3D ext4_mb_new_blocks(handle, &ar, &err); if (!newblock) goto out; allocated_clusters =3D ar.len; ar.len =3D EXT4_C2B(sbi, ar.len) - offset; - ext_debug(inode, "allocate new block: goal %llu, found %llu/%u, requested= %u\n", - ar.goal, newblock, ar.len, allocated); + ext_debug( + inode, + "allocate new block: goal %llu, found %llu/%u, requested %u\n", + ar.goal, newblock, ar.len, allocated); if (ar.len > allocated) ar.len =3D allocated; =20 @@ -4435,6 +4449,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct ino= de *inode, map->m_flags |=3D EXT4_MAP_UNWRITTEN; } =20 + if ((flags & EXT4_GET_BLOCKS_FORCEALIGN) && + (ar.len !=3D map->m_len || pblk % map->m_len)) { + err =3D -ENOSPC; + goto insert_error; + } + if ((flags & EXT4_GET_BLOCKS_EXTSIZE) && (flags & EXT4_GET_BLOCKS_PRE_IO)) { /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e41c97584f35..93ab76cb4818 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -753,6 +753,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, =20 __u32 extsize =3D ext4_inode_get_extsize(EXT4_I(inode)); bool should_extsize =3D false; + bool should_forcealign =3D false; =20 #ifdef ES_AGGRESSIVE_TEST struct ext4_map_blocks test_map; @@ -793,6 +794,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, * ext4_extents.h here? */ int max_unwrit_len =3D ((1UL << 15) - 1); + should_forcealign =3D (flags & EXT4_GET_BLOCKS_FORCEALIGN); =20 align =3D orig_map->m_lblk % extsize; len =3D orig_map->m_len + align; @@ -802,7 +804,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *in= ode, max_t(unsigned int, roundup_pow_of_two(len), extsize); =20 /* Fallback to normal allocation if we go beyond max len */ - if (extsize_map.m_len >=3D max_unwrit_len) { + if (WARN_ON(extsize_map.m_len >=3D max_unwrit_len)) { + if (should_forcealign) + /* forcealign has no fallback */ + return -EINVAL; + flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; goto set_map; } @@ -814,8 +820,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *in= ode, * dioread nolock to achieve this. Hence the caller has to pass * CREATE_UNWRIT with EXTSIZE */ - if (!(flags | EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT)) { - WARN_ON(true); + if (WARN_ON(!(flags | EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT))) { + if (should_forcealign) + /* forcealign has no fallback */ + return -EINVAL; =20 /* Fallback to non extsize allocation */ flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; @@ -905,6 +913,29 @@ int ext4_map_blocks(handle_t *handle, struct inode *in= ode, if (retval >=3D 0 && flags & EXT4_GET_BLOCKS_EXTSIZE) { bool orig_in_range =3D in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len); + + if (should_forcealign) { + /* + * For forcealign, irrespective of it it's a hole or not, + * the mapping we got should be exactly equal to the + * extsize mapping we requested since allocation and + * deallocation both respect extsize. If + * not, something has gone terribly wrong. + */ + if (WARN_ON((map->m_lblk !=3D extsize_mlblk) || + (map->m_len !=3D extsize_mlen))) { + ext4_error_adjust_map(map, orig_map); + ext4_warning( + inode->i_sb, + "%s: Unaligned blocks found! Disable forcealign and try again." + "requested:(%u, %u) extsize:(%u, %u) got:(%u, %u)\n", + __func__, orig_mlblk, orig_mlen, + extsize_mlblk, extsize_mlen, + map->m_lblk, map->m_len); + return -EUCLEAN; + } + } + /* * Special case: if the extsize range is mapped already and * covers the original start, we return it. @@ -925,6 +956,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, =20 if (map->m_lblk !=3D extsize_mlblk || map->m_len !=3D extsize_mlen) { + WARN_ON(should_forcealign); flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; goto set_map; } @@ -1011,6 +1043,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *= inode, * not used. Can we avoid that? */ if (!in_range(orig_mlblk, (__u64)map->m_lblk, map->m_len)) { + if (WARN_ON(should_forcealign)) { + /* this should never happen */ + ext4_error_adjust_map(map, orig_map); + return -ENOSPC; + } flags =3D orig_flags & ~EXT4_GET_BLOCKS_EXTSIZE; goto set_map; } --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 53147257AF1; Mon, 24 Mar 2025 07:37:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801869; cv=none; b=B6j1M9dhZIfc8p5+sWFY6oumX53q19KKh3f9o2TdCQ+64P0L6Lkg4zUjiYxqBj4ljCK9BpgtcbcQB+p6PxcgD71N/7rcLhjklIoG++lTj+4f8dRYnqXLygN3VOzWMDEYlFQ8bm6v0oj5AQjijtsRDXVNstb3a28KTfmvNf0rQnU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801869; c=relaxed/simple; bh=Y7Ul2jyT3cWVzT8RfDz+X63lcuEoa8dpMdFyToo3qIk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YyNmGhG0UXm/NHN08Avqtejmc6XG0Nf3NQM7SpJ2falgunutodfQGjflGEHiLdHtUVdYsWZ/5ZPaL56qnqYcs3er+LcRd0jAMWXNkwpGvUmXypv1AZ0SCpO8n8I+HISLLt2lVnSC3/qIIUpYSPH4cLa85WbZ5sN/0K9+Zo4290c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=Es7V9iDA; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="Es7V9iDA" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52O6UKWC023812; Mon, 24 Mar 2025 07:37:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=HJnGTQ2UZ8x8a3YyI v7zxS8nbxG22mpw5d8hfomK1+Q=; b=Es7V9iDA2X20UrrjNOKT0lF2DKvoXUPqX hocLcVCEiNSPp9xhxOI327ZHlt0on3lDwOqQirW+/Xctwo7HVptHvaV4s1SYrDVY kAVy496kdTGf3oAN8EPTBo7vzO9MPPThIAG1OZzv7mGP2v8xB4m/278iK1Cwybw7 HM41Jk75rk3JW9uj/eVR4FoaXCVkJE6ik7YLovUbQ9c4R04nb/ezybXdqlHuaEe3 mvmsVpvM5LOALdclnxwkO2fOrBEivrwb9Ijdo6RWY+XgWqLWK2QqzCvExBsAxS11 MIpS9OOTnpQfrN8BIFOsjA6gsLzQZFf2+R6TMtP50YRowwlUbjZEQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jkqp36a2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:38 +0000 (GMT) Received: from m0356517.ppops.net (m0356517.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bcFG006996; Mon, 24 Mar 2025 07:37:38 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jkqp369w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:38 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O3UQpw020082; Mon, 24 Mar 2025 07:37:37 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 45j8hnn5k8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:36 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bZgM52691448 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:35 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E93F32004E; Mon, 24 Mar 2025 07:37:34 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2D58D20040; Mon, 24 Mar 2025 07:37:33 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:32 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 10/11] ext4: add support for adding focealign via SETXATTR ioctl Date: Mon, 24 Mar 2025 13:07:08 +0530 Message-ID: X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 09b_xKBoMvY7Lc6OKt8aS7q7JDKOJNcf X-Proofpoint-ORIG-GUID: 3r--c8yug3Q5VsFAltr6EX8juwP-iZBb X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 phishscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 spamscore=0 adultscore=0 mlxscore=0 mlxlogscore=999 clxscore=1015 malwarescore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" With forcealign set on an inode, we should always either get an extent physically aligned to the extsize or we should error out. This is suitable for hardware accelerated atomic writes since it allows us to exit early rather than sending the bio and then getting an error from the device. This patch adds the SET/GETXATTR ioctl level support to set/get this flag. Right now, this can only be set if extsize is set on an inode. Since we are almost out of inode flags, we reuse the unused EXT4_EOFBLOCKS_FL. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 5 ++- fs/ext4/ext4_jbd2.h | 8 +++++ fs/ext4/extents.c | 7 ++++- fs/ext4/inode.c | 16 ++++++++-- fs/ext4/ioctl.c | 69 +++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 6 ++-- 6 files changed, 104 insertions(+), 7 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index a7429797c1d2..690caad50cb6 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -514,6 +514,9 @@ struct flex_groups { #define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded directory */ #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */ =20 +/* Extended flags, can only be set via FS_SETXATTR ioctl */ +#define EXT4_FORCEALIGN_XFL 0x00400000 /* Inode must do algined allocatio= n */ + /* User modifiable flags */ #define EXT4_FL_USER_MODIFIABLE (EXT4_SECRM_FL | \ EXT4_UNRM_FL | \ @@ -528,7 +531,6 @@ struct flex_groups { EXT4_DIRSYNC_FL | \ EXT4_TOPDIR_FL | \ EXT4_EXTENTS_FL | \ - 0x00400000 /* EXT4_EOFBLOCKS_FL */ | \ EXT4_DAX_FL | \ EXT4_PROJINHERIT_FL | \ EXT4_CASEFOLD_FL) @@ -605,6 +607,7 @@ enum { EXT4_INODE_VERITY =3D 20, /* Verity protected inode */ EXT4_INODE_EA_INODE =3D 21, /* Inode used for large EA */ /* 22 was formerly EXT4_INODE_EOFBLOCKS */ + EXT4_INODE_FORCEALIGN =3D 22, /* Inode should do aligned allocation */ EXT4_INODE_DAX =3D 25, /* Inode is DAX */ EXT4_INODE_INLINE_DATA =3D 28, /* Data in inode. */ EXT4_INODE_PROJINHERIT =3D 29, /* Create with parents projid */ diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 53b930f6c797..f88149ff0033 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -467,6 +467,14 @@ static inline int ext4_should_use_extsize(struct inode= *inode) return (ext4_inode_get_extsize(EXT4_I(inode)) > 0); } =20 +static inline int ext4_should_use_forcealign(struct inode *inode) +{ + if (!ext4_should_use_extsize(inode)) + return 0; + + return (ext4_test_inode_flag(inode, EXT4_INODE_FORCEALIGN)); +} + static inline int ext4_should_use_unwrit_extents(struct inode *inode) { return (ext4_should_dioread_nolock(inode) || diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 25c1368b49bb..1835e18f0eef 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4855,9 +4855,14 @@ static long ext4_do_fallocate(struct file *file, lof= f_t offset, } =20 flags =3D EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; - if (ext4_should_use_extsize(inode)) + if (ext4_should_use_extsize(inode)) { flags |=3D EXT4_GET_BLOCKS_EXTSIZE; =20 + if (ext4_should_use_forcealign(inode)) { + flags |=3D EXT4_GET_BLOCKS_FORCEALIGN; + } + } + ret =3D ext4_alloc_file_blocks(file, start_lblk, len_lblk, new_size, flags); if (ret) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 93ab76cb4818..5b36e62872d6 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -922,7 +922,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *ino= de, * deallocation both respect extsize. If * not, something has gone terribly wrong. */ - if (WARN_ON((map->m_lblk !=3D extsize_mlblk) || + if (WARN_ON_ONCE((map->m_lblk !=3D extsize_mlblk) || (map->m_len !=3D extsize_mlen))) { ext4_error_adjust_map(map, orig_map); ext4_warning( @@ -1138,9 +1138,14 @@ int ext4_get_block_unwritten(struct inode *inode, se= ctor_t iblock, int ret =3D 0; int flags =3D EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; =20 - if (ext4_should_use_extsize(inode)) + if (ext4_should_use_extsize(inode)) { flags |=3D EXT4_GET_BLOCKS_EXTSIZE; =20 + if (ext4_should_use_forcealign(inode)) { + flags |=3D EXT4_GET_BLOCKS_FORCEALIGN; + } + } + ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n", inode->i_ino, create); ret =3D _ext4_get_block(inode, iblock, bh_result, flags); @@ -3720,8 +3725,13 @@ static int ext4_iomap_alloc(struct inode *inode, str= uct ext4_map_blocks *map, else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { m_flags =3D EXT4_GET_BLOCKS_IO_CREATE_EXT; =20 - if (ext4_should_use_extsize(inode)) + if (ext4_should_use_extsize(inode)) { m_flags |=3D EXT4_GET_BLOCKS_EXTSIZE; + + if (ext4_should_use_forcealign(inode)) { + m_flags |=3D EXT4_GET_BLOCKS_FORCEALIGN; + } + } } =20 if (ext4_should_use_extsize(inode)) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 48f62d7c27e6..5c3cdbe17e2b 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -795,6 +795,67 @@ static int ext4_ioctl_setextsize(struct inode *inode, = u32 extsize, u32 xflags) return err; } =20 +/* + * If forcealign =3D 0 then caller wants to unset it. + */ +static int ext4_ioctl_setforcealign(struct inode *inode, bool forcealign) +{ + int err =3D 0; + char *msg =3D NULL; + handle_t *handle; + + bool has_forcealign =3D ext4_test_inode_flag(inode, EXT4_INODE_FORCEALIGN= ); + bool set_forcealign =3D (forcealign && !has_forcealign); + bool unset_forcealign =3D (!forcealign && has_forcealign); + + bool modify_forcealign =3D ((set_forcealign && !has_forcealign) || unset_= forcealign); + if (!modify_forcealign) + return 0; + + if (set_forcealign && !ext4_inode_get_extsize(EXT4_I(inode))) { + msg =3D "forcealign can't be used without extsize set"; + err =3D -EINVAL; + goto error; + } + + handle =3D ext4_journal_start(inode, EXT4_HT_INODE, 1); + if (IS_ERR(handle)) { + err =3D PTR_ERR(handle); + goto error; + } + + struct ext4_iloc iloc; + err =3D ext4_reserve_inode_write(handle, inode, &iloc); + if (err < 0) + goto error_journal; + + if (set_forcealign) { + ext4_set_inode_flag(inode, EXT4_INODE_FORCEALIGN); + } else if (unset_forcealign) + ext4_clear_inode_flag(inode, EXT4_INODE_FORCEALIGN); + + inode_set_ctime_current(inode); + inode_inc_iversion(inode); + + err =3D ext4_mark_iloc_dirty(handle, inode, &iloc); + if (err < 0) + goto error_journal; + + err =3D ext4_journal_stop(handle); + if (err < 0) + goto error; + + return 0; +error_journal: + if (handle) + ext4_journal_stop(handle); +error: + if (msg) + ext4_warning_inode(inode, "%s\n", msg); + + return err; +} + #ifdef CONFIG_QUOTA static int ext4_ioctl_setproject(struct inode *inode, __u32 projid) { @@ -1088,6 +1149,9 @@ int ext4_fileattr_get(struct dentry *dentry, struct f= ileattr *fa) fa->fsx_xflags |=3D FS_XFLAG_EXTSIZE; } =20 + if (ext4_test_inode_flag(inode, EXT4_INODE_FORCEALIGN)) + fa->fsx_xflags |=3D FS_XFLAG_FORCEALIGN; + return 0; } =20 @@ -1144,6 +1208,11 @@ int ext4_fileattr_set(struct mnt_idmap *idmap, goto out; fa->fsx_xflags =3D 0; } + + err =3D ext4_ioctl_setforcealign(inode, + (fa->fsx_xflags & FS_XFLAG_FORCEALIGN)); + if (err) + goto out; out: return err; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 2bbe00cf1248..944fa77ce18e 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -167,7 +167,9 @@ struct fsxattr { #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */ #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */ #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */ -#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ +/* data extent mappings for regular files must be aligned to extent size h= int */ +#define FS_XFLAG_FORCEALIGN 0x00020000 +#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ =20 /* the read-only stuff doesn't really belong here, but any other place is probably as bad and I don't want to create yet another include file. */ @@ -295,7 +297,7 @@ struct fsxattr { #define FS_EXTENT_FL 0x00080000 /* Extents */ #define FS_VERITY_FL 0x00100000 /* Verity protected inode */ #define FS_EA_INODE_FL 0x00200000 /* Inode used for large EA */ -#define FS_EOFBLOCKS_FL 0x00400000 /* Reserved for ext4 */ +/* Was previously FS_EOFBLOCKS_FL (reserved for ext4) */ #define FS_NOCOW_FL 0x00800000 /* Do not cow file */ #define FS_DAX_FL 0x02000000 /* Inode is DAX */ #define FS_INLINE_DATA_FL 0x10000000 /* Reserved for ext4 */ --=20 2.48.1 From nobody Fri Dec 19 04:15:07 2025 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF09D25A2C5; Mon, 24 Mar 2025 07:37:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801870; cv=none; b=WngL3fd9MtgIukWyoDD5Zqk96Vm/GE6+JheeEIT4gpkAi7e+yri0o9bp8CZx1/DY9AFUrWH2nEQCIz4w8zzX1bLTmibNJXIT1pc9bulYiSeBqLS8lCTxO4UsX3gxhlbgjULUXoxcpSA0iZn0FraSOIeIa7YC7eBkplRpHDjltpE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742801870; c=relaxed/simple; bh=p0L2J3UyOgYMszymVIBHFgnY2lak1l1GuG6/A90f+3o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=VbtQsMFXJ2AdO36LIVgKarCuNPjMeA7W4uGRDO24mHahc5qZW7BYIxKr5n1w8Gn7W2LPubKTFyotSslGeff5OUAOZPZXVqnPCZefu075/YjxrkxldYxLXUS5qz28mQYhyaU1QvePBwpAzNCcOZovPBuOcurw1joiSyBld3KZM10= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=n27u+TTv; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="n27u+TTv" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52O3kDaD013714; Mon, 24 Mar 2025 07:37:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=Qs39OneXxPBiBg9dk FLzNdM6vVLGoJBSoBFd5PE3qos=; b=n27u+TTvwhz50h6/+jXVP+9vclOY6ZIJw F4o3R/5VvEb8SJRXOJ3+QkQg8AVlL4BAtB8oV8ger7FewkPpT5HR1pqyb2j2Y93s uZETyQgMlzH5g/NbHXQpkal/mNka5dhp8AsRcIyW+HTfSlnrap5YXqiJhZ0FCM1b sxNSineFfO8FW5bQQkeE0NrNGJmvpzR/pu0VG2Ae9svnzcRFtSyDafR0hvjB1fLB 9FC3UvQscykc1Z3D64N08dEAPfCMXkcs6oQEnQPbyz0oFkll1wNyJ6V01loi1v14 lMgt7DDMVMoF4YNh65G3mkJ3pGmmU0zKTQst1F5JFeh8OHLTU835g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jys90vex-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:40 +0000 (GMT) Received: from m0360083.ppops.net (m0360083.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 52O7bdG5001452; Mon, 24 Mar 2025 07:37:39 GMT Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 45jys90vev-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:39 +0000 (GMT) Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1]) by ppma12.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 52O72cnp030299; Mon, 24 Mar 2025 07:37:38 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 45j7ht5c5j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 24 Mar 2025 07:37:38 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 52O7bbXT53412220 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 24 Mar 2025 07:37:37 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1360D2004E; Mon, 24 Mar 2025 07:37:37 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4B0B22004B; Mon, 24 Mar 2025 07:37:35 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.39.29.200]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 24 Mar 2025 07:37:35 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: John Garry , dchinner@redhat.com, "Darrick J . Wong" , Ritesh Harjani , linux-kernel@vger.kernel.org Subject: [RFC v3 11/11] ext4: disallow unaligned deallocations on forcealign inodes Date: Mon, 24 Mar 2025 13:07:09 +0530 Message-ID: <4fafc9c42962e1fdc8ce44b21fa977df5de0679e.1742800203.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: WcMMON3tqbK7VuSSeYKFQPaTYHUqiLD2 X-Proofpoint-GUID: 7aW7Fw2d4tEQgC-_3IFwBq7ZKJNsVHeh X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1093,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-24_03,2025-03-21_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 mlxlogscore=905 priorityscore=1501 lowpriorityscore=0 mlxscore=0 adultscore=0 impostorscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2502280000 definitions=main-2503240053 Content-Type: text/plain; charset="utf-8" When forcealign is set, an unaligned deallocation can disturb the alignment of the mappings of the file by introducing unaligned holes/unwrit= ten extents to it. This could then lead to increased allocations failures on the file in future. To avoid this, disallow deallocations or extent shifting operations (like i= nsert range) unless they are aligned to extsize as well. Note that this is a relatively strict approach which can be relaxed a bit m= ore in the future by using partial zeroing tricks etc. Signed-off-by: Ojaswin Mujoo --- fs/ext4/extents.c | 36 ++++++++++++++++++++++++++++++++++++ fs/ext4/inode.c | 12 ++++++++++++ 2 files changed, 48 insertions(+) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 1835e18f0eef..1ac5bb8cbbde 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4754,6 +4754,18 @@ static long ext4_zero_range(struct file *file, loff_= t offset, return ret; } =20 + if (ext4_should_use_forcealign(inode)) { + u32 extsize_bytes =3D ext4_inode_get_extsize(EXT4_I(inode)) + << inode->i_blkbits; + + if (!IS_ALIGNED(offset | end, extsize_bytes)) { + ext4_warning( + inode->i_sb, + "tried unaligned operation on forcealign inode"); + return -EINVAL; + } + } + flags =3D EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; /* Preallocate the range including the unaligned edges */ if (!IS_ALIGNED(offset | end, blocksize)) { @@ -5473,6 +5485,18 @@ static int ext4_collapse_range(struct file *file, lo= ff_t offset, loff_t len) if (end >=3D inode->i_size) return -EINVAL; =20 + if (ext4_should_use_forcealign(inode)) { + u32 extsize_bytes =3D ext4_inode_get_extsize(EXT4_I(inode)) + << inode->i_blkbits; + + if (!IS_ALIGNED(offset | end, extsize_bytes)) { + ext4_warning( + inode->i_sb, + "tried unaligned operation on forcealign inode"); + return -EINVAL; + } + } + /* * Write tail of the last page before removed range and data that * will be shifted since they will get removed from the page cache @@ -5573,6 +5597,18 @@ static int ext4_insert_range(struct file *file, loff= _t offset, loff_t len) if (len > inode->i_sb->s_maxbytes - inode->i_size) return -EFBIG; =20 + if (ext4_should_use_forcealign(inode)) { + u32 extsize_bytes =3D ext4_inode_get_extsize(EXT4_I(inode)) + << inode->i_blkbits; + + if (!IS_ALIGNED(offset | (offset + len), extsize_bytes)) { + ext4_warning( + sb, + "tried unaligned operation on forcealign inode"); + return -EINVAL; + } + } + /* * Write out all dirty pages. Need to round down to align start offset * to page size boundary for page size > block size. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5b36e62872d6..4c974e461061 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4397,6 +4397,18 @@ int ext4_punch_hole(struct file *file, loff_t offset= , loff_t length) end =3D max_end; length =3D end - offset; =20 + if (ext4_should_use_forcealign(inode)) { + u32 extsize_bytes =3D ext4_inode_get_extsize(EXT4_I(inode)) + << inode->i_blkbits; + + if (!IS_ALIGNED(offset | end, extsize_bytes)) { + ext4_warning( + sb, + "tried unaligned operation on forcealign inode"); + return -EINVAL; + } + } + /* * Attach jinode to inode for jbd2 if we do any zeroing of partial * block. --=20 2.48.1