From nobody Tue Dec 16 11:45:40 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B083C28B2B for ; Fri, 19 Aug 2022 16:53:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354379AbiHSQw4 (ORCPT ); Fri, 19 Aug 2022 12:52:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354533AbiHSQvK (ORCPT ); Fri, 19 Aug 2022 12:51:10 -0400 Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3C861118C95; Fri, 19 Aug 2022 09:14:15 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id E8A5ACE26AB; Fri, 19 Aug 2022 16:13:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D48CEC433D6; Fri, 19 Aug 2022 16:13:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1660925606; bh=XNSwH24Jx3adY5NH3XLFMDG8na8eXL+6Z8TOLUf3BSs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=VR0b+SD+aetk9v/qNrp+xBzSuKNdrJK4ykyi3nwKg8Wgm10zW3uLVqt5EwwcUa7Ye wqeKkmMnC0Qw6gml0tNiFSb8Zr4KQxFK578+XH3Y1Qm1MuOGap+g4PL5nrBqCSg16b AyuYEqeeNsHbYcAz9KT9E9/GWvyhZvZwrfM4uYIk= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, David Sterba , Qu Wenruo Subject: [PATCH 5.10 545/545] btrfs: raid56: dont trust any cached sector in __raid56_parity_recover() Date: Fri, 19 Aug 2022 17:45:15 +0200 Message-Id: <20220819153854.004582400@linuxfoundation.org> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220819153829.135562864@linuxfoundation.org> References: <20220819153829.135562864@linuxfoundation.org> User-Agent: quilt/0.67 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Qu Wenruo commit f6065f8edeb25f4a9dfe0b446030ad995a84a088 upstream. [BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data|rai= d5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wante= d 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wante= d 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=3D38928384 devid=3D2 type=3DDAT= A1 offset=3D0 opf=3D0x0 physical=3D9502720 len=3D65536 56.1185: raid56_read_partial: full_stripe=3D38928384 devid=3D3 type=3DDAT= A2 offset=3D16384 opf=3D0x0 physical=3D9519104 len=3D16384 56.1185: raid56_read_partial: full_stripe=3D38928384 devid=3D3 type=3DDAT= A2 offset=3D49152 opf=3D0x0 physical=3D9551872 len=3D16384 56.1187: raid56_write_stripe: full_stripe=3D38928384 devid=3D3 type=3DDAT= A2 offset=3D0 opf=3D0x1 physical=3D9502720 len=3D16384 56.1188: raid56_write_stripe: full_stripe=3D38928384 devid=3D3 type=3DDAT= A2 offset=3D32768 opf=3D0x1 physical=3D9535488 len=3D16384 56.1188: raid56_write_stripe: full_stripe=3D38928384 devid=3D1 type=3DPQ1= offset=3D0 opf=3D0x1 physical=3D30474240 len=3D16384 56.1189: raid56_write_stripe: full_stripe=3D38928384 devid=3D1 type=3DPQ1= offset=3D32768 opf=3D0x1 physical=3D30507008 len=3D16384 56.1218: raid56_write_stripe: full_stripe=3D38928384 devid=3D3 type=3DDAT= A2 offset=3D49152 opf=3D0x1 physical=3D9551872 len=3D16384 56.1219: raid56_write_stripe: full_stripe=3D38928384 devid=3D1 type=3DPQ1= offset=3D49152 opf=3D0x1 physical=3D30523392 len=3D16384 56.2721: raid56_parity_recover: full stripe=3D38928384 eb=3D39010304 mirr= or=3D2 56.2723: raid56_parity_recover: full stripe=3D38928384 eb=3D39010304 mirr= or=3D2 56.2724: raid56_parity_recover: full stripe=3D38928384 eb=3D39010304 mirr= or=3D2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stri= pe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: | Good | Good | Data2: | Bad | Bad | Parity: | Good | Good | In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - d4e28d9b5f04 ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Acked-by: David Sterba Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/raid56.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2094,9 +2094,12 @@ static int __raid56_parity_recover(struc atomic_set(&rbio->error, 0); =20 /* - * read everything that hasn't failed. Thanks to the - * stripe cache, it is possible that some or all of these - * pages are going to be uptodate. + * Read everything that hasn't failed. However this time we will + * not trust any cached sector. + * As we may read out some stale data but higher layer is not reading + * that stale part. + * + * So here we always re-read everything in recovery path. */ for (stripe =3D 0; stripe < rbio->real_stripes; stripe++) { if (rbio->faila =3D=3D stripe || rbio->failb =3D=3D stripe) { @@ -2105,16 +2108,6 @@ static int __raid56_parity_recover(struc } =20 for (pagenr =3D 0; pagenr < rbio->stripe_npages; pagenr++) { - struct page *p; - - /* - * the rmw code may have already read this - * page in - */ - p =3D rbio_stripe_page(rbio, stripe, pagenr); - if (PageUptodate(p)) - continue; - ret =3D rbio_add_io_page(rbio, &bio_list, rbio_stripe_page(rbio, stripe, pagenr), stripe, pagenr, rbio->stripe_len);