From nobody Thu Apr 9 09:34:23 2026 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC1073D75C9; Wed, 8 Apr 2026 18:46:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775673996; cv=none; b=p+fHQpanHV+nj3q/W0PABIxyM0ZDSWc9yYaDDgiuO3zMQem39kujZAqcxrU1lwPIjh7u8yeDWPZ8g9cwMLIRE5hS9t2NibTQGsg5uKGwsdtSTe4KnFgah1TWfQwuq0nl42yERv9dhSGVViNwYi/zPJHXum1AErm0/C8Sn9CY5do= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775673996; c=relaxed/simple; bh=oOduun5eoC64LiZcBKIjrJ2kY+dGIBQALfg8klri3r0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=htdSAql9G7h56YwMKvNwLTmpd+1PQZSqBVH+xU7P5vUTw2rT1n27ir67qSyr/As2Ht/6KNbFNKs4pDKFLhw12i7MuACR/gOasPU2zj7PI4L0whYXatVSbyXQtI4AZsyZxvHSnkHc5uC/Iz5QOUCv8sUaNl1sHWIFSPKFsVqMd7U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=VenbjJ5c; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="VenbjJ5c" Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 6386WUw72326341; Wed, 8 Apr 2026 18:46:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=MfBhzVEmdDTO9khog B+4Xy9fp0weh853fIHPAUsG2nM=; b=VenbjJ5chvFAASuGyTIkrBfCJG2Uy4U4G uEGlQJLeYDh+mTES4Hi9Yf+2x70NQ1LEFZKqzdx5quX9gGId4KuR+fn633JAQFHQ DgtFcDuimsEXPCmhIl0H3NBpOk/uEWeHJLUyw0U9+DTlWKCnduvy42iRgURkQa6N 46R2C/rIuvkAyeo7cCMT32L1TPBdd5VsCnRcrcmQ9r5ZQn+A6nsdv4OY3eBGNxda 3WMMz8OWEJIozsgrM15VvKAK5oyCouBLFl/LqfsTMiQVfmOu4Ri9E7EHatnv16PU yUcFbjrJKFu5MgxyU/SFKEJVn95zWfxvIubfdG7pFw9dDOfst/v8Q== Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2kgvh9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:13 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 638FwrvX019113; Wed, 8 Apr 2026 18:46:12 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 4dcme9gm34-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:12 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 638IkAs146203346 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 8 Apr 2026 18:46:10 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8767A2004E; Wed, 8 Apr 2026 18:46:10 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B85F020040; Wed, 8 Apr 2026 18:45:56 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.124.212.72]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 8 Apr 2026 18:45:56 +0000 (GMT) From: Ojaswin Mujoo To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, jack@suse.cz, Luis Chamberlain , dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de, brauner@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v2 1/5] mm: Refactor folio_clear_dirty_for_io() Date: Thu, 9 Apr 2026 00:15:42 +0530 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA4MDE3MCBTYWx0ZWRfX9lacaNyd+wCc Gpe2sNQr134QkD4u3pLlJCk4V/Ma6q+yCkM9R6rbqJ4vXCcA8BfSXgvcmS65/V31VS8L9UJfxuE 7Th+zdNSiElxiSmCx2kQxWifz9hW/KtUYF4uHRQ4hQIMweEHu6ftwygZdcx7/xDRoFTpwmOZ+0V OSgjpK+v8I3U0Zc3vJvTRXdxI8p4TukKIh+qq56GADWrZAYdibxRt5J5lRqShrlA4uenDhN8CzI GKd/mhfJPl+tBJqGyIUYy6e7OoPMdR0tJQeqeF1kRuxJBUF8Wj5SeZ7UDBbKu9RiTql03hupLt+ rZdSaB4chqZTqNzs8BFrHuo6FIV86d8u2p3KTtT8zle/NjG2OMTzzDiwAtuwX0nMpxO+qpMmId8 iHcAjI/x/NsPS17V1wAbvFvxFKR/Aa5w8GOs58uJUD6JDsk04PZhaRdNXmenxz6sj1LZQdSxXdB Cqm9R0EV7VJ4zeKU8Bg== X-Proofpoint-ORIG-GUID: IDLTCbjMJD02ckO5zX8Yk3FV6kIhxYo_ X-Authority-Analysis: v=2.4 cv=e9k2j6p/ c=1 sm=1 tr=0 ts=69d6a275 cx=c_pps a=AfN7/Ok6k8XGzOShvHwTGQ==:117 a=AfN7/Ok6k8XGzOShvHwTGQ==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=Y2IxJ9c9Rs8Kov3niI8_:22 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=BWzSRGt6nnUlSqokuEQA:9 X-Proofpoint-GUID: pun6bXvTAKHjjfiNwEHsCXmTH6QMpLDJ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-08_05,2026-04-08_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 impostorscore=0 malwarescore=0 suspectscore=0 spamscore=0 bulkscore=0 adultscore=0 priorityscore=1501 phishscore=0 lowpriorityscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604080170 Content-Type: text/plain; charset="utf-8" Add a new __folio_clear_dirty_for_io() helper which takes an extra parameter to indicate folio_mkclean() is needed. This is in preparation of buffered writethrough support where we already do folio_mkclean() before calling into this function. Co-developed-by: Ritesh Harjani (IBM) Signed-off-by: Ritesh Harjani (IBM) Signed-off-by: Ojaswin Mujoo --- mm/page-writeback.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 601a5e048d12..2f0c6916213d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2847,8 +2847,11 @@ EXPORT_SYMBOL(__folio_cancel_dirty); * * This incoherency between the folio's dirty flag and xarray tag is * unfortunate, but it only exists while the folio is locked. + * + * For some cases we might not want to do mkclean, eg, if we've already ta= ken + * care of it, hence pass the should_mkclean flag to indicate if its neede= d. */ -bool folio_clear_dirty_for_io(struct folio *folio) +static bool __folio_clear_dirty_for_io(struct folio *folio, bool should_mk= clean) { struct address_space *mapping =3D folio_mapping(folio); bool ret =3D false; @@ -2885,7 +2888,7 @@ bool folio_clear_dirty_for_io(struct folio *folio) * as a serialization point for all the different * threads doing their things. */ - if (folio_mkclean(folio)) + if (should_mkclean && folio_mkclean(folio)) folio_mark_dirty(folio); /* * We carefully synchronise fault handlers against @@ -2908,6 +2911,11 @@ bool folio_clear_dirty_for_io(struct folio *folio) } return folio_test_clear_dirty(folio); } + +bool folio_clear_dirty_for_io(struct folio *folio) +{ + return __folio_clear_dirty_for_io(folio, true); +} EXPORT_SYMBOL(folio_clear_dirty_for_io); =20 static void wb_inode_writeback_start(struct bdi_writeback *wb) --=20 2.53.0 From nobody Thu Apr 9 09:34:23 2026 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6DC2A33121F; Wed, 8 Apr 2026 18:46:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674001; cv=none; b=jA5QzRgI/2Y6oU27NjmQ5CCwuzCFR1XckhJG9CruJyc/8FLTT+pWAQ67sk4lG+W2mU+HPZWHKBddt+hGOjT/HorwvDPnZIlHXKV0iFlhKXJtmCqQOwU3hTJSwjRtY+YK8Z4RhwqQzdUy00PpNavTSf3uot0/WhTInlsym2U0mWU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674001; c=relaxed/simple; bh=qbFhHzBsjlIDNr9VXnyyx/uzyXqnJA+Z4csEMuCWF8k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IOZLjwhab+IybDZtqdNaVRLk0nSoja8Io3zGMof8EJ4fgPmxSfYyhcwa9syLw/dHyEP9Q7evgiDhV/ShTLqFowcb8EJ2aJbo0ZqS2wRHxf29g9pgss/rYgHu7n/9pv7moxtfvVq0jr3XAaCAon5aAKOMqfq/4UTNxbTTAeOB5zo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=RM4RqVwg; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="RM4RqVwg" Received: from pps.filterd (m0353725.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 638Hk23b2210122; Wed, 8 Apr 2026 18:46:17 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=Ei4tbMs16OI1cyCL+ XlF1NSRsr+yRMoqkt46it9yX4I=; b=RM4RqVwg56Mz+7WWEW1OJDjh/xHOc+5aC lUmr5u95L797qxPSU+4jQpeg936luuelseig+LUBCnLE+ILgmbTmaYqnicJ8bEjF TT8aVAfM2pjEBARGfwN7hbEf1x1EzAdbLI43IfnzuQwoyX9sEk2iBVKJi9Wi41Cj y35gj9X9MzHeddwtgTjJZci+v7V0OkT1DmdUbXZzSjTJ+njKZfA7lZB5UNy4XR6Z R0paxsMNlhzy8/5VO5P+wSLGJXRYp/Zsyhr6dPvuApYjY8o28qPGhh8WIkRdg5G7 s46tmIgvPDpv9fr/X0EyruyS8HxVlVT9BpJdl1IO2Oi6yDUH4dQIA== Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2hguxt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:17 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 638G0HMJ018877; Wed, 8 Apr 2026 18:46:16 GMT Received: from smtprelay03.fra02v.mail.ibm.com ([9.218.2.224]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 4dcme9gm3c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:16 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay03.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 638IkEJX44630378 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 8 Apr 2026 18:46:14 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 545842004E; Wed, 8 Apr 2026 18:46:14 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E1C742004B; Wed, 8 Apr 2026 18:46:10 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.124.212.72]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 8 Apr 2026 18:46:10 +0000 (GMT) From: Ojaswin Mujoo To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, jack@suse.cz, Luis Chamberlain , dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de, brauner@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH Date: Thu, 9 Apr 2026 00:15:43 +0530 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA4MDE3MCBTYWx0ZWRfX1Bsd1b5k5L0d Yw9OV/qNxG6RzoKKI+FijPkzLBjxs8aI9ErSNB8mDcwFlZKZxUxzykCwwEK36/fSu+8/93Mxe+G 7/X/c+k30z/681d0I7j2KUm8sjfOPAwkyr3LjxK77NY2YPCle9Dz+yBCTrt/Osi+sXSDwVALpxa fgfgqE4s2YZYtd11TrrKWUcc1m0rn7Kt1YH9ZUe7mnnv7cvWRxkA+ZnVDvkEWH8r7iobwfU+grU rGgWUQWUmbQ5y/U535X81nwhDCeteRcBwIl0oNyTgabpaGWAfvYVoNXpmjdFvE4jmMcH/Slkskj dv/UrLVW3CIdA1VrUgG8y8MGxm/7q52qZ2PfSaoBmjValoAIzZkhqiGDDXjOLu/ThJh0OUPfpbS 5h44W52gh1g6l9T02cY6w8LOhi3k5UOF/exQlVKtAAnbsrPTjfUVeZaPLzRkkBge3QhxaKOYN6D rTUbfjO4k5NywnP5ezQ== X-Proofpoint-GUID: zHkimg8hj8Qqu9DVWsdpzMLgmb5HOA7Q X-Authority-Analysis: v=2.4 cv=a/wAM0SF c=1 sm=1 tr=0 ts=69d6a279 cx=c_pps a=AfN7/Ok6k8XGzOShvHwTGQ==:117 a=AfN7/Ok6k8XGzOShvHwTGQ==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=V8glGbnc2Ofi9Qvn3v5h:22 a=VwQbUJbxAAAA:8 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=TxIjQczwvY32lihKEpwA:9 X-Proofpoint-ORIG-GUID: iFhz_i9jpoMFnKRaEuk1cUNDiDGJYAxn X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-08_05,2026-04-08_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 clxscore=1015 spamscore=0 impostorscore=0 priorityscore=1501 phishscore=0 lowpriorityscore=0 adultscore=0 malwarescore=0 suspectscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604080170 Content-Type: text/plain; charset="utf-8" This adds initial support for performing buffered non-aio RWF_WRITETHROUGH write. The rough flow for a writethrough write is as follows: 1. Acquire inode lock 2. initialize writethrough context (wt_ctx) and mark mapping as stable. 3. Start the iomap_iter() loop. For each iomap: 3.1. Acquire folio and folio_lock. 3.2. perform memcpy from user buffer to the folio and mark it dirty 3.3. Wait for any current writeback to complete and then call folio_mkclean() to prevent mmap writes from changing it. 3.4. Start writeback on the folio 3.5. Add the folio range under write to wt_ctx->bvec and folio_unlock() 3.6. If bvec is full, submit the current bvecs for IO. 3.7. Repeat 3.2 to 3.6 till the whole iomap is processed. Submit the final set of bvecs for IO. 4. Repeat step 3 till we have no more data to write. 5. Finally, sleep in the syscall thread till all the IOs are completed (refcount =3D=3D 0). Once that happens, the end io handler will wake us up. 6. Upon waking up, call fs ->end_io() callback (which updates inode size), record any errors and return. 7. inode_unlock() This design gives buffered writethrough the same semantics as dio and any error in the IO is directly returned to the caller. The design has delibrately open coded the IO submission and completion flow (inspired by dio) rather than reusing the dio functions as accomodating buffered writethrough logic in dio code was polluting it with too many if else conditionals and special cases. Suggested-by: Jan Kara Suggested-by: Dave Chinner Co-developed-by: Ritesh Harjani (IBM) Signed-off-by: Ritesh Harjani (IBM) Signed-off-by: Ojaswin Mujoo --- fs/iomap/buffered-io.c | 352 ++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 7 + include/linux/iomap.h | 38 +++++ include/linux/pagemap.h | 1 + include/uapi/linux/fs.h | 5 +- mm/page-writeback.c | 6 + 6 files changed, 408 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index e4b6886e5c3c..74e1ab108b0f 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -9,6 +9,7 @@ #include #include #include +#include #include "internal.h" #include "trace.h" =20 @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter= , size_t len, size_t copied, return __iomap_write_end(iter->inode, pos, len, copied, folio); } =20 +static ssize_t iomap_writethrough_complete(struct iomap_writethrough_ctx *= wt_ctx) +{ + struct kiocb *iocb =3D wt_ctx->iocb; + struct inode *inode =3D wt_ctx->inode; + ssize_t ret =3D wt_ctx->error; + + if (wt_ctx->dops && wt_ctx->dops->end_io) { + int err =3D wt_ctx->dops->end_io(iocb, wt_ctx->written, + wt_ctx->error, + wt_ctx->flags); + if (err) + ret =3D err; + } + + mapping_clear_stable_writes(inode->i_mapping); + + if (!ret) { + ret =3D wt_ctx->written; + iocb->ki_pos =3D wt_ctx->pos + ret; + } + + kfree(wt_ctx); + return ret; +} + +static void iomap_writethrough_done(struct iomap_writethrough_ctx *wt_ctx) +{ + struct task_struct *waiter =3D wt_ctx->waiter; + + WRITE_ONCE(wt_ctx->waiter, NULL); + blk_wake_io_task(waiter); + return; +} + +static void iomap_writethrough_bio_end_io(struct bio *bio) +{ + struct iomap_writethrough_ctx *wt_ctx =3D bio->bi_private; + struct folio_iter fi; + + if (bio->bi_status) + cmpxchg(&wt_ctx->error, 0, + blk_status_to_errno(bio->bi_status)); + bio_for_each_folio_all(fi, bio) + folio_end_writeback(fi.folio); + + bio_put(bio); + if (atomic_dec_and_test(&wt_ctx->ref)) + iomap_writethrough_done(wt_ctx); +} + +static void +iomap_writethrough_submit_bio(struct iomap_writethrough_ctx *wt_ctx, + struct iomap *iomap, + const struct iomap_writethrough_ops *wt_ops) +{ + struct bio *bio; + unsigned int i; + u64 len =3D 0; + + if (!wt_ctx->nr_bvecs) + return; + + for (i =3D 0; i < wt_ctx->nr_bvecs; i++) + len +=3D wt_ctx->bvec[i].bv_len; + + if (wt_ops->writethrough_submit) + wt_ops->writethrough_submit(wt_ctx->inode, iomap, wt_ctx->bio_pos, + len); + + bio =3D bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, REQ_OP_WRITE, GFP_NOFS); + bio->bi_iter.bi_sector =3D iomap_sector(iomap, wt_ctx->bio_pos); + bio->bi_end_io =3D iomap_writethrough_bio_end_io; + bio->bi_private =3D wt_ctx; + + for (i =3D 0; i < wt_ctx->nr_bvecs; i++) + __bio_add_page(bio, wt_ctx->bvec[i].bv_page, + wt_ctx->bvec[i].bv_len, + wt_ctx->bvec[i].bv_offset); + + atomic_inc(&wt_ctx->ref); + submit_bio(bio); + wt_ctx->nr_bvecs =3D 0; +} + +/** + * iomap_writethrough_begin - prepare the various structures for writethro= ugh + * @folio: folio to prepare for writethrough + * @off: offset of write within folio + * @len: len of write within folio + * + * This function does the major preparation work needed before starting the + * writethrough. The main task is to prepare folio for writeththrough by b= locking + * mmap writes and setting writeback on it. Further, we must clear the wri= te range + * to non-dirty. If this results in the complete folio becoming non-dirty,= then we + * need to clear the master dirty bit. + */ +static void iomap_folio_prepare_writethrough(struct folio *folio, size_t o= ff, + size_t len) +{ + bool fully_written; + u64 zero =3D 0; + + if (folio_test_writeback(folio)) + folio_wait_writeback(folio); + + if (folio_mkclean(folio)) + folio_mark_dirty(folio); + + /* + * We might either write through the complete folio or a partial folio + * writethrough might result in all blocks becoming non-dirty, so we need= to + * check and mark the folio clean if that is the case. + */ + fully_written =3D (off =3D=3D 0 && len =3D=3D folio_size(folio)); + iomap_clear_range_dirty(folio, off, len); + if (fully_written || + !iomap_find_dirty_range(folio, &zero, folio_size(folio))) + folio_clear_dirty_for_writethrough(folio); + + folio_start_writeback(folio); +} + +/** + * iomap_writethrough_iter - perform RWF_WRITETHROUGH buffered write + * @wt_ctx: writethrough context + * @iter: iomap iter holding mapping information + * @i: iov_iter for write + * @wt_ops: the fs callbacks needed for writethrough + * + * This function copies the user buffer to folio similar to usual buffered + * IO path, with the difference that we immediately issue the IO. For this= we + * utilize IO submission and completion mechanism that is inspired by dio. + * + * Folio handling note: We might be writing through a partial folio so we = need + * to be careful to not clear the folio dirty bit unless there are no dirt= y blocks + * in the folio after the writethrough. + */ +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx, + struct iomap_iter *iter, struct iov_iter *i, + const struct iomap_writethrough_ops *wt_ops) + +{ + ssize_t total_written =3D 0; + int status =3D 0; + struct address_space *mapping =3D iter->inode->i_mapping; + size_t chunk =3D mapping_max_folio_size(mapping); + unsigned int bdp_flags =3D (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0; + unsigned int bs =3D i_blocksize(iter->inode); + + /* copied over based on DIO handles these flags */ + if (iter->iomap.type =3D=3D IOMAP_UNWRITTEN) + wt_ctx->flags |=3D IOMAP_DIO_UNWRITTEN; + if (iter->iomap.flags & IOMAP_F_SHARED) + wt_ctx->flags |=3D IOMAP_DIO_COW; + + if (!(iter->flags & IOMAP_WRITETHROUGH)) + return -EINVAL; + + do { + struct folio *folio; + size_t offset; /* Offset into folio */ + u64 bytes; /* Bytes to write to folio */ + size_t copied; /* Bytes copied from user */ + u64 written; /* Bytes have been written */ + loff_t pos; + size_t off_aligned, len_aligned; + + bytes =3D iov_iter_count(i); +retry: + offset =3D iter->pos & (chunk - 1); + bytes =3D min(chunk - offset, bytes); + status =3D balance_dirty_pages_ratelimited_flags(mapping, + bdp_flags); + if (unlikely(status)) + break; + + /* + * If completions already occurred and reported errors, give up + * now and don't bother submitting more bios. + */ + if (unlikely(data_race(wt_ctx->error))) { + wt_ctx->nr_bvecs =3D 0; + break; + } + + if (bytes > iomap_length(iter)) + bytes =3D iomap_length(iter); + + /* + * Bring in the user page that we'll copy from _first_. + * Otherwise there's a nasty deadlock on copying from the + * same page as we're writing to, without it being marked + * up-to-date. + * + * For async buffered writes the assumption is that the user + * page has already been faulted in. This can be optimized by + * faulting the user page. + */ + if (unlikely(fault_in_iov_iter_readable(i, bytes) =3D=3D bytes)) { + status =3D -EFAULT; + break; + } + + status =3D iomap_write_begin(iter, wt_ops->write_ops, &folio, + &offset, &bytes); + if (unlikely(status)) { + iomap_write_failed(iter->inode, iter->pos, bytes); + break; + } + if (iter->iomap.flags & IOMAP_F_STALE) + break; + + pos =3D iter->pos; + + if (mapping_writably_mapped(mapping)) + flush_dcache_folio(folio); + + copied =3D copy_folio_from_iter_atomic(folio, offset, bytes, i); + written =3D iomap_write_end(iter, bytes, copied, folio) ? + copied : 0; + + if (!written) + goto put_folio; + + off_aligned =3D round_down(offset, bs); + len_aligned =3D round_up(offset + written, bs) - off_aligned; + + iomap_folio_prepare_writethrough(folio, off_aligned, + len_aligned); + + if (!wt_ctx->nr_bvecs) + wt_ctx->bio_pos =3D round_down(pos, bs); + + bvec_set_folio(&wt_ctx->bvec[wt_ctx->nr_bvecs], folio, + len_aligned, off_aligned); + wt_ctx->nr_bvecs++; + wt_ctx->written +=3D written; + + if (pos + written > wt_ctx->new_i_size) + wt_ctx->new_i_size =3D pos + written; + + if (wt_ctx->nr_bvecs =3D=3D wt_ctx->max_bvecs) + iomap_writethrough_submit_bio(wt_ctx, &iter->iomap, wt_ops); + +put_folio: + __iomap_put_folio(iter, wt_ops->write_ops, written, folio); + + cond_resched(); + if (unlikely(written =3D=3D 0)) { + iomap_write_failed(iter->inode, pos, bytes); + iov_iter_revert(i, copied); + + if (chunk > PAGE_SIZE) + chunk /=3D 2; + if (copied) { + bytes =3D copied; + goto retry; + } + } else { + total_written +=3D written; + iomap_iter_advance(iter, written); + } + } while (iov_iter_count(i) && iomap_length(iter)); + + if (wt_ctx->nr_bvecs) + iomap_writethrough_submit_bio(wt_ctx, &iter->iomap, wt_ops); + + return total_written ? 0 : status; +} + static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i, const struct iomap_write_ops *write_ops) { @@ -1232,6 +1503,87 @@ iomap_file_buffered_write(struct kiocb *iocb, struct= iov_iter *i, } EXPORT_SYMBOL_GPL(iomap_file_buffered_write); =20 +ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter = *i, + const struct iomap_writethrough_ops *wt_ops, + void *private) +{ + struct inode *inode =3D iocb->ki_filp->f_mapping->host; + struct iomap_iter iter =3D { + .inode =3D inode, + .pos =3D iocb->ki_pos, + .len =3D iov_iter_count(i), + .flags =3D IOMAP_WRITE | IOMAP_WRITETHROUGH, + .private =3D private, + }; + struct iomap_writethrough_ctx *wt_ctx; + unsigned int max_bvecs; + ssize_t ret; + + + /* + * For now we don't support any other flag with WRITETHROUGH + */ + if (!(iocb->ki_flags & IOCB_WRITETHROUGH)) + return -EINVAL; + if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE)) + return -EINVAL; + if (iocb_is_dsync(iocb)) + /* D_SYNC support not implemented yet */ + return -EOPNOTSUPP; + if (!is_sync_kiocb(iocb)) + /* aio support not implemented yet */ + return -EOPNOTSUPP; + + /* + * +1 to max bvecs to account for unaligned write spanning multiple + * folios + */ + max_bvecs =3D DIV_ROUND_UP( + iov_iter_count(i), + PAGE_SIZE << mapping_min_folio_order(inode->i_mapping)) + 1; + + if (max_bvecs > BIO_MAX_VECS) + max_bvecs =3D BIO_MAX_VECS; + if (!max_bvecs) + max_bvecs =3D 1; + + wt_ctx =3D kzalloc(struct_size(wt_ctx, bvec, max_bvecs), GFP_NOFS); + if (!wt_ctx) + return -ENOMEM; + + wt_ctx->iocb =3D iocb; + wt_ctx->inode =3D inode; + wt_ctx->dops =3D wt_ops->dops; + wt_ctx->pos =3D iocb->ki_pos; + wt_ctx->new_i_size =3D i_size_read(inode); + wt_ctx->max_bvecs =3D max_bvecs; + atomic_set(&wt_ctx->ref, 1); + wt_ctx->waiter =3D current; + + mapping_set_stable_writes(inode->i_mapping); + + while ((ret =3D iomap_iter(&iter, wt_ops->ops)) > 0) { + WARN_ON(iter.iomap.type !=3D IOMAP_UNWRITTEN && + iter.iomap.type !=3D IOMAP_MAPPED); + iter.status =3D iomap_writethrough_iter(wt_ctx, &iter, i, wt_ops); + } + if (ret < 0) + cmpxchg(&wt_ctx->error, 0, ret); + + if (!atomic_dec_and_test(&wt_ctx->ref)) { + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (!READ_ONCE(wt_ctx->waiter)) + break; + blk_io_schedule(); + } + __set_current_state(TASK_RUNNING); + } + + return iomap_writethrough_complete(wt_ctx); +} +EXPORT_SYMBOL_GPL(iomap_file_writethrough_write); + static void iomap_write_delalloc_ifs_punch(struct inode *inode, struct folio *folio, loff_t start_byte, loff_t end_byte, struct iomap *iomap, iomap_punch_t punch) diff --git a/include/linux/fs.h b/include/linux/fs.h index 547ce27fb741..2f95fd49472a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -344,6 +344,7 @@ struct readahead_control; #define IOCB_ATOMIC (__force int) RWF_ATOMIC #define IOCB_DONTCACHE (__force int) RWF_DONTCACHE #define IOCB_NOSIGNAL (__force int) RWF_NOSIGNAL +#define IOCB_WRITETHROUGH (__force int) RWF_WRITETHROUGH =20 /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -1985,6 +1986,8 @@ struct file_operations { #define FOP_ASYNC_LOCK ((__force fop_flags_t)(1 << 6)) /* File system supports uncached read/write buffered IO */ #define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7)) +/* File system supports write through buffered IO */ +#define FOP_WRITETHROUGH ((__force fop_flags_t)(1 << 8)) =20 /* Wrap a directory iterator that needs exclusive inode access */ int wrap_directory_iterator(struct file *, struct dir_context *, @@ -3434,6 +3437,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *k= i, rwf_t flags, if (IS_DAX(ki->ki_filp->f_mapping->host)) return -EOPNOTSUPP; } + if (flags & RWF_WRITETHROUGH) + /* file system must support it */ + if (!(ki->ki_filp->f_op->fop_flags & FOP_WRITETHROUGH)) + return -EOPNOTSUPP; kiocb_flags |=3D (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |=3D IOCB_DSYNC; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 531f9ebdeeae..661233aa009d 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -209,6 +209,7 @@ struct iomap_write_ops { #endif /* CONFIG_FS_DAX */ #define IOMAP_ATOMIC (1 << 9) /* torn-write protection */ #define IOMAP_DONTCACHE (1 << 10) +#define IOMAP_WRITETHROUGH (1 << 11) =20 struct iomap_ops { /* @@ -475,6 +476,27 @@ struct iomap_writepage_ctx { void *wb_ctx; /* pending writeback context */ }; =20 +struct iomap_writethrough_ctx { + struct kiocb *iocb; + const struct iomap_dio_ops *dops; + struct inode *inode; + loff_t new_i_size; + loff_t pos; + size_t written; + atomic_t ref; + unsigned int flags; + int error; + + /* used during submission and for non-aio completion */ + struct task_struct *waiter; + + loff_t bio_pos; + unsigned int nr_bvecs; + unsigned int max_bvecs; + struct bio_vec bvec[]; + +}; + struct iomap_ioend *iomap_init_ioend(struct inode *inode, struct bio *bio, loff_t file_offset, u16 ioend_flags); struct iomap_ioend *iomap_split_ioend(struct iomap_ioend *ioend, @@ -599,6 +621,22 @@ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, s= truct iov_iter *iter, ssize_t iomap_dio_complete(struct iomap_dio *dio); void iomap_dio_bio_end_io(struct bio *bio); =20 +/* + * In writethrough, we copy user data to folio first and then send the fol= io + * to writeback via dio path. To achieve this, we need callbacks from ioma= p_ops, + * iomap_write_ops and iomap_dio_ops. This struct packs them together. + */ +struct iomap_writethrough_ops { + const struct iomap_ops *ops; + const struct iomap_write_ops *write_ops; + const struct iomap_dio_ops *dops; + int (*writethrough_submit)(struct inode *inode, struct iomap *iomap, + loff_t offset, u64 len); +}; +ssize_t iomap_file_writethrough_write(struct kiocb *iocb, struct iov_iter = *i, + const struct iomap_writethrough_ops *wt_ops, + void *private); + #ifdef CONFIG_SWAP struct file; struct swap_info_struct; diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 31a848485ad9..192a00422bc8 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1260,6 +1260,7 @@ static inline void folio_cancel_dirty(struct folio *f= olio) __folio_cancel_dirty(folio); } bool folio_clear_dirty_for_io(struct folio *folio); +bool folio_clear_dirty_for_writethrough(struct folio *folio); bool clear_page_dirty_for_io(struct page *page); void folio_invalidate(struct folio *folio, size_t offset, size_t length); bool noop_dirty_folio(struct address_space *mapping, struct folio *folio); diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 70b2b661f42c..dec78041b0cf 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -435,10 +435,13 @@ typedef int __bitwise __kernel_rwf_t; /* prevent pipe and socket writes from raising SIGPIPE */ #define RWF_NOSIGNAL ((__force __kernel_rwf_t)0x00000100) =20 +/* buffered IO that is asynchronously written through to disk after write = */ +#define RWF_WRITETHROUGH ((__force __kernel_rwf_t)0x00000200) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\ - RWF_DONTCACHE | RWF_NOSIGNAL) + RWF_DONTCACHE | RWF_NOSIGNAL | RWF_WRITETHROUGH) =20 #define PROCFS_IOCTL_MAGIC 'f' =20 diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2f0c6916213d..20561d3d5eaa 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2918,6 +2918,12 @@ bool folio_clear_dirty_for_io(struct folio *folio) } EXPORT_SYMBOL(folio_clear_dirty_for_io); =20 +bool folio_clear_dirty_for_writethrough(struct folio *folio) +{ + return __folio_clear_dirty_for_io(folio, false); +} +EXPORT_SYMBOL(folio_clear_dirty_for_writethrough); + static void wb_inode_writeback_start(struct bdi_writeback *wb) { atomic_inc(&wb->writeback_inodes); --=20 2.53.0 From nobody Thu Apr 9 09:34:23 2026 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9EDA83D9037; Wed, 8 Apr 2026 18:46:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674003; cv=none; b=ScjP9Tk/ARv5pY5fI7mp8KjWVdlOfLHVtLhqdn2cE2jqf42c0DWcGraR7cT3XUVWvdFROETyi3WxdA9her8uYiIKTcC2BWPr6TMsWOqYZlIquQnxP+HEi4yHwJOqvRvsKSRvmbxUBUKbC2oxRYaynC85Vzcuz3yEE4V3dV8+g5U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674003; c=relaxed/simple; bh=KFf5G4dg0UA/kzEpgKk2F9MtHgKbbl9t3yAzS/f9yng=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IsExNwCGtILt8j0fWuihtoUiGKdV/MEASVAe1rZ8rwTUYiUhUaRkz/xseocV8khTlOzrAWjo9NQ9fX/FHiXc8dprlPUz8oO0s5G4GdtjMjikZHhLVjM0b0E8W6G4CvkgArFq9MumtGWSf7cepNXQijMJ4eqQAgJ94DyEkweibOo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=RzBJYlpS; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="RzBJYlpS" Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 638GvXtO2326654; Wed, 8 Apr 2026 18:46:22 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=GWsab/zXI/rmB25rg T+Eersuw1LHHRUH6P1t+A9btZc=; b=RzBJYlpS9CmDoLcJMNoKlevCowfW1u5y3 JSaRns3FRaG4L19pcZLCC88vhRiKCJ4kxWSKxGxwvP5GZgZKDq4EqByfzM1anhWw b1YTjcXIPO4nekgI4SXrEQUxcsPcv8vZ/WgohJbyripVErYh98nLAWypP3klhw7e 4p7lE4O2JRKuC3yiiBtW7S1lV/2dTHbciyYPjqjEYcOqUbQi1HunCgogEQseVLdD hM7GgaudpgJqHSExP7C2ryBtFtwxLoaso307U51nUbBKFYG+KbUqoZiviQkJ04Ix poBe/KHVV0+sZjFMiaHldBbGhmgk+4SvHFOsmVDOyGZJVHipEdsxQ== Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2kgvhu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:21 +0000 (GMT) Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 638G1gW4013902; Wed, 8 Apr 2026 18:46:20 GMT Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4dcmf48kke-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:20 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 638IkJcf26280438 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 8 Apr 2026 18:46:19 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E328120040; Wed, 8 Apr 2026 18:46:18 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AD0C92004B; Wed, 8 Apr 2026 18:46:14 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.124.212.72]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 8 Apr 2026 18:46:14 +0000 (GMT) From: Ojaswin Mujoo To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, jack@suse.cz, Luis Chamberlain , dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de, brauner@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v2 3/5] xfs: Add RWF_WRITETHROUGH support to xfs Date: Thu, 9 Apr 2026 00:15:44 +0530 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA4MDE3MCBTYWx0ZWRfX+pY9KjE/O0wJ 5Qv9InESbJrBOWCheiI8hacugq0s48jJEcYleWI4/o/GkI4fA0+1MVuBjyRutZl5AUfKqnpho4B 2W5xO1DxAIzceQamJNMzBjlCkU+P/YUwoWOEXNRZsbLh2jfF6M/l2nKORMDlKO5xl0tFxVU176K bT7V4sQyLyxRzRPkLwVMw09albjnSnU++zTq4+h5URSazjBI55kLNboXDCy3UwvVRHDluYn7RhN 5IsBmrwNyti6dmEUqbd3gjv2ZcyuOWsK1kpL2ElSwy+1dZ+/JrnFACNyyFZlP7ZxTZWqhMysffe u6+KN6DAp8ZbNJGH48/f9+dFhTIXvuDrczMN57y0VX0zs5WXzGRbQtwozqxEFhhpR2XR3k1yroV MgoZV+Pf3O07IZgI9gpJ+8kR2BCVZecluVKny4l79lfaCq3g/Myn7YThClfeQjq22zBqZzdKlKI 8oQ96D0ORwSPPUTWS6g== X-Proofpoint-ORIG-GUID: PB1ueZShDNExrGjC3ZrhskDNHCzfpFBP X-Authority-Analysis: v=2.4 cv=e9k2j6p/ c=1 sm=1 tr=0 ts=69d6a27e cx=c_pps a=3Bg1Hr4SwmMryq2xdFQyZA==:117 a=3Bg1Hr4SwmMryq2xdFQyZA==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=Y2IxJ9c9Rs8Kov3niI8_:22 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=F43LGkGvbWfUyK2stg0A:9 X-Proofpoint-GUID: AUBYMZn19w9l95TykjMhGGMIqiGzdvX8 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-08_05,2026-04-08_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 impostorscore=0 malwarescore=0 suspectscore=0 spamscore=0 bulkscore=0 adultscore=0 priorityscore=1501 phishscore=0 lowpriorityscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604080170 Content-Type: text/plain; charset="utf-8" Add the boilerplate needed to start supporting RWF_WRITETHROUGH in XFS. We use the direct wirte ->iomap_begin() functions to ensure the range under write through always has a real non-delalloc extent. We reuse the xfs dio's end IO function to perform extent conversion and i_size handling for us. *Note on COW extent over DATA hole case* In case of an unmapped COW extent over a DATA hole (due to COW preallocations), leave the extent unmapped until we are just about to send IO. At that time, use the ->writethrough_submit() call back to convert the COW extent to written. We initially tried converting during iomap_begin() time (like dio does) but that results in a stale data exposure as follows: 1. iomap_begin() - converts COW extent over DATA hole to written and marks IOMAP_F_NEW to handle zeroing. 2. During iomap_write_begin() -> realise extent is stale and return back without zeroing. 3. iomap_begin() - Again sees the same COW extent but it's written this time so we don't mark IOMAP_F_NEW 4. Since IOMAP_F_NEW is unmarked, we never zeroout and hence expose stale data. To avoid the above, take the buffered IO approach of converting the extent just before IO, when we are sure to have zeroed out the folio. Co-developed-by: Ritesh Harjani (IBM) Signed-off-by: Ritesh Harjani (IBM) Signed-off-by: Ojaswin Mujoo --- fs/xfs/xfs_file.c | 53 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 47 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 6246f34df9fd..d8436d840476 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -988,6 +988,39 @@ xfs_file_dax_write( return ret; } =20 +static int +xfs_writethrough_submit( + struct inode *inode, + struct iomap *iomap, + loff_t offset, + u64 count) +{ + int error =3D 0; + unsigned int nofs_flag; + + /* + * Convert CoW extents to regular. + * + * We are under writethrough context with folio lock possibly held. To + * avoid memory allocation deadlocks, set the task-wide nofs context. + */ + if (iomap->flags & IOMAP_F_SHARED) { + nofs_flag =3D memalloc_nofs_save(); + error =3D xfs_reflink_convert_cow(XFS_I(inode), offset, count); + memalloc_nofs_restore(nofs_flag); + } + + return error; +} + +const struct iomap_writethrough_ops xfs_writethrough_ops =3D { + .ops =3D &xfs_direct_write_iomap_ops, + .write_ops =3D &xfs_iomap_write_ops, + .dops =3D &xfs_dio_write_ops, + .writethrough_submit =3D &xfs_writethrough_submit +}; + + STATIC ssize_t xfs_file_buffered_write( struct kiocb *iocb, @@ -1010,9 +1043,13 @@ xfs_file_buffered_write( goto out; =20 trace_xfs_file_buffered_write(iocb, from); - ret =3D iomap_file_buffered_write(iocb, from, - &xfs_buffered_write_iomap_ops, &xfs_iomap_write_ops, - NULL); + if (iocb->ki_flags & IOCB_WRITETHROUGH) { + ret =3D iomap_file_writethrough_write(iocb, from, + &xfs_writethrough_ops, NULL); + } else + ret =3D iomap_file_buffered_write(iocb, from, + &xfs_buffered_write_iomap_ops, + &xfs_iomap_write_ops, NULL); =20 /* * If we hit a space limit, try to free up some lingering preallocated @@ -1047,8 +1084,12 @@ xfs_file_buffered_write( =20 if (ret > 0) { XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret); - /* Handle various SYNC-type writes */ - ret =3D generic_write_sync(iocb, ret); + /* + * Handle various SYNC-type writes. + * For writethrough, we handle sync during completion. + */ + if (!(iocb->ki_flags & IOCB_WRITETHROUGH)) + ret =3D generic_write_sync(iocb, ret); } return ret; } @@ -2042,7 +2083,7 @@ const struct file_operations xfs_file_operations =3D { .remap_file_range =3D xfs_file_remap_range, .fop_flags =3D FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | FOP_DIO_PARALLEL_WRITE | - FOP_DONTCACHE, + FOP_DONTCACHE | FOP_WRITETHROUGH, .setlease =3D generic_setlease, }; =20 --=20 2.53.0 From nobody Thu Apr 9 09:34:23 2026 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 39B9C3D903D; Wed, 8 Apr 2026 18:46:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674008; cv=none; b=gQUg6trBQdYM3ocebzNliurE2/K+x+g26bpTI5/AHMB9zmI93ML0RDMRgzO5zmyJrwCmSY35nUTipDKkRdF2BXaEwYIoM94o4CpvQ41a8v3kFZdWwrWgJpnFJag9jIhG8GtnQzW40qfkFkzLZdXyDvcE4NJyVefqa3BLmztQ6VE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674008; c=relaxed/simple; bh=qcw4FPYg7qWdUZnBoQY51rm+iVriJqHOzFlgGDj3CUM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ajzlDesrAxXoOwfwdeUbiK8cSAw/33SjEXO6oXL70v6E6oXnQZ/C0IwjqwoRgNh+bDjJ9UtsFeQElq0v0HVvCoy5kdrCbfxy45Pga60qe2J9i5KNnI97Ign08ISEuQRyH8fibkg8Nwp18X/7G5o2+T7yv7pQMWtDLlazykWMoWc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=frl1QSWI; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="frl1QSWI" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 6386b4oD2594681; Wed, 8 Apr 2026 18:46:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=u/FVAs3+nVWCm7Mn/ SYzhNu6ZuycBrO+O7t3Y1hxAIE=; b=frl1QSWIgWbCzXh2eO2VxQ6p369BuMfAE +JRQck/GJgLmyznmq2g7moNamZNJ7zJfoxoFQ74Vm/LqUY4bzIcjYatPmsMvczl8 P8efBCmHBmktoPlv9wkHA01Ke+jRxd9y+q2W7qp1k/YQ+Dynu1TjPbp14AMdBGcn fD1VaOT7/LVUDzRtJUOT6T7aruXh3GAz8ZPqbEL8kM40rCJEjQRxI1uiggkIz81I VVcYQ13fJ+7T54D55d9nhaWmhsRDDvRkuMrnuNxu9jwJMKdofhzjCOMYTTXClpHl 0+XkWogOLOutsRoyZDrte7bzityTTpk9sry1lqxufNYtspQUKOM9Q== Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2e9jcv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:26 +0000 (GMT) Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 638G9FuQ007942; Wed, 8 Apr 2026 18:46:25 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4dcmg2gkmp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:24 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 638IkNe150725326 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 8 Apr 2026 18:46:23 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EB8642004B; Wed, 8 Apr 2026 18:46:22 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4B2BE20040; Wed, 8 Apr 2026 18:46:19 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.124.212.72]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 8 Apr 2026 18:46:19 +0000 (GMT) From: Ojaswin Mujoo To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, jack@suse.cz, Luis Chamberlain , dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de, brauner@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v2 4/5] iomap: Add aio support to RWF_WRITETHROUGH Date: Thu, 9 Apr 2026 00:15:45 +0530 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-ORIG-GUID: JhEyvaM_pp-26pRFxK6A_oTXKQ3pEitz X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA4MDE3MCBTYWx0ZWRfX4OHy3uIx0ZyM Rj8kp4GLHq/ZSKc5r3LXAqUyXLXsURyI3ayhwtCM+rpfOgz9QjKNBrRI9MPymi4kLypFsMUmNJH T130L8+dNbKSxDaPXhV2XUC2mIFC3neqTYLeGTD7605rRUV8WM6a39RZokTmHvh5axb5C0FMdKN WMhMO5QpzREhizcHSWuA+7euBWiz8lt9nZFaIZUQlZHslKAeJ+LHr7oU6kKJwAl+ObM+FHhTrLj 4vXLw30qxaTKcg6asK3+JpsKLcysNFGDuNm6ztoJqVmw3Ir3lVkRJeuMrPSte0S/5ShqyRoHoBt AMH4gcZAksfijQKh2m7pEftVyLBej8fhpVys3y0OlO+VWaboBN0iLz6QOwzFAlHX7V+OYTxe7fP UON6PkqUybL+aYD6USJLop5mu03gQJ4QzzVqkle6MuX6AXx6svxm4+sxelccq/A3Nbtz95gqgAn By4CRlWVjNo7w8arymw== X-Authority-Analysis: v=2.4 cv=Cfw4Irrl c=1 sm=1 tr=0 ts=69d6a282 cx=c_pps a=5BHTudwdYE3Te8bg5FgnPg==:117 a=5BHTudwdYE3Te8bg5FgnPg==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=iQ6ETzBq9ecOQQE5vZCe:22 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=dGFhD9eEaN_uTI8HWPQA:9 X-Proofpoint-GUID: mTjuhj2a_Ms-PfKakK7_N4KuIFlR8Lsd X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-08_05,2026-04-08_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 phishscore=0 clxscore=1015 adultscore=0 suspectscore=0 priorityscore=1501 impostorscore=0 bulkscore=0 spamscore=0 lowpriorityscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604080170 Content-Type: text/plain; charset="utf-8" With aio the only thing we need to be careful off is that writethrough can be in progress even after dropping inode and folio lock. Due to this, we need a way to synchronise with other paths where stable write is not enough, example: 1. Truncate to 0 in xfs sets i_size =3D 0 before waiting for writeback to complete. In case of writethrough, the end io completion can again push the i_size to a non-zero value. 2. Dio reads might race with aio writethrough ->end_io() and read 0s if unwritten conversion is yet to happen. Hence use the dio begin/end as it gives us the required guarantees. Co-developed-by: Ritesh Harjani (IBM) Signed-off-by: Ritesh Harjani (IBM) Signed-off-by: Ojaswin Mujoo --- fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++------ include/linux/iomap.h | 10 ++++++-- 2 files changed, 54 insertions(+), 9 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 74e1ab108b0f..6937f10e2782 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1113,6 +1113,9 @@ static ssize_t iomap_writethrough_complete(struct iom= ap_writethrough_ctx *wt_ctx =20 mapping_clear_stable_writes(inode->i_mapping); =20 + if (wt_ctx->is_aio) + inode_dio_end(inode); + if (!ret) { ret =3D wt_ctx->written; iocb->ki_pos =3D wt_ctx->pos + ret; @@ -1122,12 +1125,27 @@ static ssize_t iomap_writethrough_complete(struct i= omap_writethrough_ctx *wt_ctx return ret; } =20 +static void iomap_writethrough_complete_work(struct work_struct *work) +{ + struct iomap_writethrough_ctx *wt_ctx =3D + container_of(work, struct iomap_writethrough_ctx, aio_work); + struct kiocb *iocb =3D wt_ctx->iocb; + + iocb->ki_complete(iocb, iomap_writethrough_complete(wt_ctx)); +} + static void iomap_writethrough_done(struct iomap_writethrough_ctx *wt_ctx) { - struct task_struct *waiter =3D wt_ctx->waiter; + if (!wt_ctx->is_aio) { + struct task_struct *waiter =3D wt_ctx->waiter; =20 - WRITE_ONCE(wt_ctx->waiter, NULL); - blk_wake_io_task(waiter); + WRITE_ONCE(wt_ctx->waiter, NULL); + blk_wake_io_task(waiter); + return; + } + + INIT_WORK(&wt_ctx->aio_work, iomap_writethrough_complete_work); + queue_work(wt_ctx->inode->i_sb->s_dio_done_wq, &wt_ctx->aio_work); return; } =20 @@ -1530,9 +1548,6 @@ ssize_t iomap_file_writethrough_write(struct kiocb *i= ocb, struct iov_iter *i, if (iocb_is_dsync(iocb)) /* D_SYNC support not implemented yet */ return -EOPNOTSUPP; - if (!is_sync_kiocb(iocb)) - /* aio support not implemented yet */ - return -EOPNOTSUPP; =20 /* * +1 to max bvecs to account for unaligned write spanning multiple @@ -1557,11 +1572,32 @@ ssize_t iomap_file_writethrough_write(struct kiocb = *iocb, struct iov_iter *i, wt_ctx->pos =3D iocb->ki_pos; wt_ctx->new_i_size =3D i_size_read(inode); wt_ctx->max_bvecs =3D max_bvecs; + wt_ctx->is_aio =3D !is_sync_kiocb(iocb); atomic_set(&wt_ctx->ref, 1); - wt_ctx->waiter =3D current; + + if (!wt_ctx->is_aio) + wt_ctx->waiter =3D current; + else + /* + * With aio, writethrough can be in progress even after dropping + * inode and folio lock. Due to this, we need a way to + * synchronise with other paths where stable write is not enough + * (example truncate). Hence use the dio begin/end as it gives + * us the required guarantees. + */ + inode_dio_begin(inode); =20 mapping_set_stable_writes(inode->i_mapping); =20 + if (wt_ctx->is_aio && !inode->i_sb->s_dio_done_wq) { + ret =3D sb_init_dio_done_wq(inode->i_sb); + if (ret < 0) { + mapping_clear_stable_writes(inode->i_mapping); + kfree(wt_ctx); + return ret; + } + } + while ((ret =3D iomap_iter(&iter, wt_ops->ops)) > 0) { WARN_ON(iter.iomap.type !=3D IOMAP_UNWRITTEN && iter.iomap.type !=3D IOMAP_MAPPED); @@ -1571,6 +1607,9 @@ ssize_t iomap_file_writethrough_write(struct kiocb *i= ocb, struct iov_iter *i, cmpxchg(&wt_ctx->error, 0, ret); =20 if (!atomic_dec_and_test(&wt_ctx->ref)) { + if (wt_ctx->is_aio) + return -EIOCBQUEUED; + for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (!READ_ONCE(wt_ctx->waiter)) diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 661233aa009d..e99f7c279dc6 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -486,9 +486,15 @@ struct iomap_writethrough_ctx { atomic_t ref; unsigned int flags; int error; + bool is_aio; =20 - /* used during submission and for non-aio completion */ - struct task_struct *waiter; + union { + /* used during submission and for non-aio completion */ + struct task_struct *waiter; + + /* used during aio completion */ + struct work_struct aio_work; + }; =20 loff_t bio_pos; unsigned int nr_bvecs; --=20 2.53.0 From nobody Thu Apr 9 09:34:23 2026 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 159733D9050; Wed, 8 Apr 2026 18:46:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674013; cv=none; b=JHCJvnORWqO0IWVttehAbRMyAn6ZT0S8gl4d8E4z928Vw2YNy2WKhlPmUMG+14jtFCcuocfx+wEoUw3/yeL+ve1mbBM9jRhUVc6o2kx1kksv8X9pToep35xJUc5ARQq/FxGjveOdI3jHT3aqI5xhiaswWmWo8d5oSpFbCbc6n0c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775674013; c=relaxed/simple; bh=/5dfLLCFdsBvcQj8bzhKBnfZqvWdbYDBMPSwHi5Ou/s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gV4OAcNEAc6o1d01VbrTVAkXsx4/YnHhqgHPiaCMlUb7BUdH1E8daQqtbIyrHR29DFL0/7dVZzavRZEGV0jOK5hc8jJqMAWZy6UzzJGjsAP/v3Dq57S8zOvZX01ARpMeRcKRR0V2oAajWnGvB4gnmP7W8KF5y4gbyz83Iz/Q59s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=b/boA0s1; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="b/boA0s1" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 638IUBCd2302408; Wed, 8 Apr 2026 18:46:30 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=UqhjeDnHacxc+DguG BAg4lyRMb8kf2xKVKHgKUo0dTQ=; b=b/boA0s1d5bxxGIH292nbEDfFlIzo0vQS Lj+GzoqG2ku5LH6RRIGGio4I4hB1EfPsFn/19F92MnqjetYttdgvbGxLfac55t2l 6kZPSsnPrSH1i21U4nukUNG3awYQJnhnJmQUODCxt4olnyoxHsSpt2tBdLTn13tb m9GqV7G8ek2EASKNtLdTBt38qoMeU9qrKBMlVEU6cgcqNkOa/JdP108qL97RrWKG tsadwF5g/Fuil8uWNeiYPuCk8ZyXCwaruD3LbXz++OFc4WNLhl3qiKm5KL4pgi9I mVcg4e8Mxx6+MOC9c1k4tZ2AEc0pM14sUJgc7QKNKU07XUzE8xtcQ== Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4dcn2fhj03-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:29 +0000 (GMT) Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 638FpSrP019008; Wed, 8 Apr 2026 18:46:28 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 4dcme9gm3r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 08 Apr 2026 18:46:28 +0000 Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 638IkRZf59310524 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 8 Apr 2026 18:46:27 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EF2442004B; Wed, 8 Apr 2026 18:46:26 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 54F0120040; Wed, 8 Apr 2026 18:46:23 +0000 (GMT) Received: from li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com (unknown [9.124.212.72]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 8 Apr 2026 18:46:23 +0000 (GMT) From: Ojaswin Mujoo To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org, hch@lst.de, ritesh.list@gmail.com, jack@suse.cz, Luis Chamberlain , dgc@kernel.org, tytso@mit.edu, p.raghav@samsung.com, andres@anarazel.de, brauner@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v2 5/5] iomap: Add DSYNC support to writethrough Date: Thu, 9 Apr 2026 00:15:46 +0530 Message-ID: <162ff37dec9295bb76ebadba6ea72ac72cc3c3df.1775658795.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDA4MDE3MCBTYWx0ZWRfX87itVyAk2BZw 5zN6CXNfltPIFFRrEFyk869PVeO3cgxSrbmZlzzYBGnb1mdnjf3zpO8wEziXbIsFmagcLb3SuR4 T52vGnAn6vvAWG39vJ/eBB2BxTyQVw54CRz3a3tLAUNM+uNYcZeIuvuymRGYGTU+nTa6G8IcVcA 5KQpuGuuawggPiyDy08JyuhSaEnXGSbc9psXzBKyg9AcrSBirixkttd1S9Jd7SmnYjQ1cUEmu0p VKsyhI9AMvepW3hcpinDZYzXRrKUfhGyg//PeDrX/IXryF7XZYRm/XBkcf3LiOp5XiIHtVignxa /cY26NUPpYWgQl2R9h5ljUTGL2XTctcHF/8ZPP03g22nWNksmLvBnsQdVeOh95g/PbrEUwZpw2x Q5NZa45dMbvltx6Kr34IATMaHn0SRDb/UYKIGJ2KfDL4hNoduv9O6R5WACz/OhzdLq+KGo1sVLE ynWayXCnc16PZ48IL3Q== X-Authority-Analysis: v=2.4 cv=FsY1OWrq c=1 sm=1 tr=0 ts=69d6a286 cx=c_pps a=AfN7/Ok6k8XGzOShvHwTGQ==:117 a=AfN7/Ok6k8XGzOShvHwTGQ==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=U7nrCbtTmkRpXpFmAIza:22 a=VwQbUJbxAAAA:8 a=pGLkceISAAAA:8 a=VnNF1IyMAAAA:8 a=qsowkHyYPmhlFs6s4IEA:9 X-Proofpoint-ORIG-GUID: yg4EfJuoyroMIX8kQWusv1PnSFj0EC2h X-Proofpoint-GUID: 8y3ULG_eg5bbhD938sHwn0yaRqJi3mqR X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-08_05,2026-04-08_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 priorityscore=1501 impostorscore=0 spamscore=0 phishscore=0 lowpriorityscore=0 clxscore=1015 adultscore=0 malwarescore=0 suspectscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604010000 definitions=main-2604080170 Content-Type: text/plain; charset="utf-8" Add DSYNC support to writethrough buffered writes. Unlike the usual buffered writes where we call generic_write_sync() inline during the syscall path, for writethrough we instead sync the data during IO completion path, just like dio. This allows aio writethrough to be truly async where the syscall can return after IO submission and the sync can then be done asynchronously during IO completion time. Further, just like dio, we utilize the FUA optimization, if available, to avoid syncing the data for DSYNC operations. Suggested-by: Dave Chinner Co-developed-by: Ritesh Harjani (IBM) Signed-off-by: Ritesh Harjani (IBM) Signed-off-by: Ojaswin Mujoo --- fs/iomap/buffered-io.c | 37 +++++++++++++++++++++++++++++++++---- include/linux/iomap.h | 1 + 2 files changed, 34 insertions(+), 4 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 6937f10e2782..8965f603f2cf 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1119,6 +1119,14 @@ static ssize_t iomap_writethrough_complete(struct io= map_writethrough_ctx *wt_ctx if (!ret) { ret =3D wt_ctx->written; iocb->ki_pos =3D wt_ctx->pos + ret; + + /* + * If this is a DSYNC write and we couldn't optimize it, make + * sure we push it to stable storage now that we've written + * data. + */ + if (iocb_is_dsync(wt_ctx->iocb) && !wt_ctx->use_fua) + ret =3D generic_write_sync(iocb, ret); } =20 kfree(wt_ctx); @@ -1173,6 +1181,7 @@ iomap_writethrough_submit_bio(struct iomap_writethrou= gh_ctx *wt_ctx, struct bio *bio; unsigned int i; u64 len =3D 0; + blk_opf_t opf =3D REQ_OP_WRITE; =20 if (!wt_ctx->nr_bvecs) return; @@ -1184,7 +1193,10 @@ iomap_writethrough_submit_bio(struct iomap_writethro= ugh_ctx *wt_ctx, wt_ops->writethrough_submit(wt_ctx->inode, iomap, wt_ctx->bio_pos, len); =20 - bio =3D bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, REQ_OP_WRITE, GFP_NOFS); + if (wt_ctx->use_fua) + opf |=3D REQ_FUA; + + bio =3D bio_alloc(iomap->bdev, wt_ctx->nr_bvecs, opf, GFP_NOFS); bio->bi_iter.bi_sector =3D iomap_sector(iomap, wt_ctx->bio_pos); bio->bi_end_io =3D iomap_writethrough_bio_end_io; bio->bi_private =3D wt_ctx; @@ -1273,6 +1285,19 @@ static int iomap_writethrough_iter(struct iomap_writ= ethrough_ctx *wt_ctx, if (!(iter->flags & IOMAP_WRITETHROUGH)) return -EINVAL; =20 + /* + * If we realise that cache flush is neccessary (eg FUA is not present + * or we need metadata updates) then we turn off the optimization. + */ + if (wt_ctx->use_fua) { + if (iter->iomap.type !=3D IOMAP_MAPPED || + (iter->iomap.flags & + (IOMAP_F_NEW | IOMAP_F_SHARED | IOMAP_F_DIRTY)) || + (bdev_write_cache(iter->iomap.bdev) && + !bdev_fua(iter->iomap.bdev))) + wt_ctx->use_fua =3D false; + } + do { struct folio *folio; size_t offset; /* Offset into folio */ @@ -1545,9 +1570,6 @@ ssize_t iomap_file_writethrough_write(struct kiocb *i= ocb, struct iov_iter *i, return -EINVAL; if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_DONTCACHE)) return -EINVAL; - if (iocb_is_dsync(iocb)) - /* D_SYNC support not implemented yet */ - return -EOPNOTSUPP; =20 /* * +1 to max bvecs to account for unaligned write spanning multiple @@ -1575,6 +1597,13 @@ ssize_t iomap_file_writethrough_write(struct kiocb *= iocb, struct iov_iter *i, wt_ctx->is_aio =3D !is_sync_kiocb(iocb); atomic_set(&wt_ctx->ref, 1); =20 + /* + * Similar to dio, we optimistically set use_fua=3Dtrue to avoid explicit + * sync. In case we later realise cache flush is needed we set it back + * to false. + */ + wt_ctx->use_fua =3D iocb_is_dsync(iocb) && !(iocb->ki_flags & IOCB_SYNC); + if (!wt_ctx->is_aio) wt_ctx->waiter =3D current; else diff --git a/include/linux/iomap.h b/include/linux/iomap.h index e99f7c279dc6..579bc48ed39c 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -487,6 +487,7 @@ struct iomap_writethrough_ctx { unsigned int flags; int error; bool is_aio; + bool use_fua; =20 union { /* used during submission and for non-aio completion */ --=20 2.53.0