From nobody Thu Dec 18 14:28:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3EABC46CA0 for ; Sat, 2 Dec 2023 09:19:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231599AbjLBJKp (ORCPT ); Sat, 2 Dec 2023 04:10:45 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37594 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229514AbjLBJKm (ORCPT ); Sat, 2 Dec 2023 04:10:42 -0500 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE7BB11C; Sat, 2 Dec 2023 01:10:47 -0800 (PST) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.54]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Sj3vP3VCHzShK4; Sat, 2 Dec 2023 17:06:25 +0800 (CST) Received: from huawei.com (10.175.127.227) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Sat, 2 Dec 2023 17:10:45 +0800 From: Baokun Li To: , CC: , , , , , , , , , , Subject: [PATCH -RFC 1/2] mm: avoid data corruption when extending DIO write race with buffered read Date: Sat, 2 Dec 2023 17:14:31 +0800 Message-ID: <20231202091432.8349-2-libaokun1@huawei.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20231202091432.8349-1-libaokun1@huawei.com> References: <20231202091432.8349-1-libaokun1@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When DIO write and buffered read are performed on the same file on two CPUs, the following race may occur: cpu1 cpu2 Direct write 1024 from 4096 | Buffered read 8192 from 0 -----------------------------|---------------------------- ... ... ext4_file_write_iter ext4_dio_write_iter iomap_dio_rw ... ext4_file_read_iter generic_file_read_iter filemap_read filemap_get_pages ... ext4_mpage_readpages ext4_readpage_limit(inode) i_size_read(inode) // 4096 ext4_dio_write_end_io i_size_write(inode, 5120) i_size_read(inode) // 5120 1. read alloc 8192 0 8192 |-------------------|-------------------| 2. read form disk (i_size 4096) 0 filled data 4096 filled zero 8192 |-------------------|-------------------| 3. copyout (i_size 5120) 0 copyout to uset buffer 5120 8192 |------------------------|--------------| |~~~~| Inconsistent data In the above race, because of the change of inode_size, the actual data read from the disk is only 4096 bytes, but copied to the user's buffer 5120 bytes, including 1024 bytes of zero-filled tail page, which results in the data read by the user is not consistent with the data on the disk. To solve this problem completely, we should take the lesser of the number of bytes actually read or the inode_size and use that as the final read size. The problem here is that we don't know how many bytes of valid data filemap_get_pages() reads, or how many bytes of valid data are in a page, so we have to rely on inode_size to determine the range of valid data. So we read the inode_size before and after filemap_get_pages(), and take the smaller of the two as the size of the copyout to reduce the probability of the above issue being triggered. Signed-off-by: Baokun Li --- mm/filemap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 71f00539ac00..47c1729afbb4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2587,7 +2587,8 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_i= ter *iter, if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags |=3D IOCB_NOWAIT; =20 - if (unlikely(iocb->ki_pos >=3D i_size_read(inode))) + isize =3D i_size_read(inode); + if (unlikely(iocb->ki_pos >=3D isize)) break; =20 error =3D filemap_get_pages(iocb, iter->count, &fbatch, false); @@ -2602,7 +2603,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_i= ter *iter, * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). */ - isize =3D i_size_read(inode); + isize =3D min_t(loff_t, isize, i_size_read(inode)); if (unlikely(iocb->ki_pos >=3D isize)) goto put_folios; end_offset =3D min_t(loff_t, isize, iocb->ki_pos + iter->count); --=20 2.31.1 From nobody Thu Dec 18 14:28:03 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33B32C10DC1 for ; Sat, 2 Dec 2023 09:19:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232066AbjLBJLG (ORCPT ); Sat, 2 Dec 2023 04:11:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230234AbjLBJKn (ORCPT ); Sat, 2 Dec 2023 04:10:43 -0500 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D7D613D; Sat, 2 Dec 2023 01:10:48 -0800 (PST) Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4Sj3tm01XNzMnZj; Sat, 2 Dec 2023 17:05:52 +0800 (CST) Received: from huawei.com (10.175.127.227) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Sat, 2 Dec 2023 17:10:45 +0800 From: Baokun Li To: , CC: , , , , , , , , , , Subject: [PATCH -RFC 2/2] ext4: avoid data corruption when extending DIO write race with buffered read Date: Sat, 2 Dec 2023 17:14:32 +0800 Message-ID: <20231202091432.8349-3-libaokun1@huawei.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20231202091432.8349-1-libaokun1@huawei.com> References: <20231202091432.8349-1-libaokun1@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.175.127.227] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The following race between extending DIO write and buffered read may result in reading a stale page cache: cpu1 cpu2 ------------------------------|----------------------------- // Direct write 1024 from 4096 // Buffer read 8192 from 0 ... ... ext4_file_write_iter ext4_dio_write_iter iomap_dio_rw ... ext4_file_read_iter generic_file_read_iter filemap_read i_size_read(inode) // 4096 filemap_get_pages ... ext4_mpage_readpages ext4_readpage_limit(inode) i_size_read(inode) // 4096 // read 4096, zero-filled 4096 ext4_dio_write_end_io i_size_write(inode, 5120) i_size_read(inode) // 5120 copyout 4096 // new read 4096 from 4096 ext4_file_read_iter generic_file_read_iter filemap_read i_size_read(inode) // 5120 filemap_get_pages // stale page is uptodata i_size_read(inode) // 5120 copyout 5120 dio invalidate stale page cache In the above race, after DIO write updates the inode size, but before invalidate stale page cache, buffered read sees that the last read page chche is still uptodata, and does not re-read it from the disk to copy it directly to the user space, which results in the data in the tail of 1024 bytes is not the same as the data on the disk. To get around this, we wait for the existing DIO write to invalidate the stale page cache before each new buffered read. Signed-off-by: Baokun Li --- fs/ext4/file.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 0166bb9ca160..99e92ddef97d 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -144,6 +144,9 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, = struct iov_iter *to) if (iocb->ki_flags & IOCB_DIRECT) return ext4_dio_read_iter(iocb, to); =20 + /* wait for stale page cache to be invalidated */ + inode_dio_wait(inode); + return generic_file_read_iter(iocb, to); } =20 --=20 2.31.1