[PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write

Ankit Kapoor posted 1 patch 3 days, 2 hours ago
drivers/md/bcache/request.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
[PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
Posted by Ankit Kapoor 3 days, 2 hours ago
Overview
--------
This series addresses a cache inconsistency issue with stale data in bcache
that arises from a race condition between a read cache miss and a bypass 
write due to congestion or sequential cutoff. The fix involves sequencing 
the btree invalidation of the bypass write to occur strictly after the 
backing device write.

Race Analysis
-------------
The following sequence illustrates how stale data is cached after a read
cache miss when btree invalidation of a bypass write happens in parallel
with a delayed write to the backing device:

Write IO Path (Parallel)            Read IO Path
------------------------            ------------
           |
 [Btree Invalidation]
           |
           |                      [Cache Miss]
           |                           |
           |                     [Btree Placeholder Key Insertion]
           |                           |
 (Delay in writing                     |
 to the backing device)                |
           |                     [Cache data from the backing device]
           |                           |
           +-------------------------->|  <-- No key collision detected!
           |                      [Btree Placeholder Key Replacement]
           |                           |
    [Write to the                      |
    backing device]                -------------
                                 CRITICAL BUG:
                             Stale data gets cached

Reproduction Steps
------------------
The bug can be reliably reproduced by injecting a 5-second delay into
the backing device write path via dm-delay. Cache mode is set to
writearound to simulate bypass write.

1. Data Preparation:
  # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
  # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
  # echo writearound > /sys/block/bcache0/bcache/cache_mode
  # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
    bs=4096 count=1 conv=notrunc

2. Race Execution:
  # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
    bs=4096 count=1 conv=notrunc &
  # sleep 1
  # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
    status=none | hexdump > ./concurrent-read-result
  # sleep 10
  # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
    status=none | hexdump > ./second-read-result

3. Results (Without Patch):
  # cat second-read-result
  0000000 0000 0000 0000 0000 0000 0000 0000 0000  # <--- STALE READ

Proposed Fix
------------
The fix enforces strict total (sequential) order of btree invalidation
after write to the backing device in a bypass write:

OLD FLOW                                          NEW FLOW
-------------------------------       --------------------------------
        [ Write Start ]                       [ Write Start ]
               |                                     |
       +-------+-------+                             |
       |               |                             v
       v               v                    [     Write to   ]
 [    Btree     ] [   Write to    ]         [ backing-device ]
 [ Invalidation ] [ backing-device]                  |
       |               |                             v
       +-------+-------+                    [      Btree     ]
               |                            [  Invalidation  ]
               v                                     |
         [ Write End ]                               v
                                               [ Write End ]

Enforcing this sequential execution ensures that either:
1. A stale read is followed and invalidated by the deferred write
   invalidation flow.
2. The write invalidation executes first, forcing the subsequent read
   path's key replacement sequence to properly catch the collision.

Failure Handling
----------------
This patch keeps existing error-handling behavior intact. Although
execution is now sequential, btree invalidation is still triggered
regardless of whether the write to the backing device succeeds
or fails.

Verification and Performance
----------------------------
Manual Results (With Patch):
  # cat second-read-result
  0000000 0101 0101 0101 0101 0101 0101 0101 0101  # <--- CORRECT DATA

Stress Verification:
FIO was executed under a write-only workload (128 KB Write, libaio,
iodepth=64, direct=1). Without the patch, FIO reported CRC errors
due to stale read corruptions; with the patch, zero CRC errors or
corruptions were reported.

Write-Only Workload (FIO Averages CSV):
Metric,With Fix,Without Fix,Delta

Write IOPS,1630,1630,0.00%
Write Bandwidth (MiB/s),204,204,0.00%
Write Avg Latency (micro second),39219.95,39219.58,0.00%

Test Environment
----------------
- CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
- Memory: 3.75 GB RAM
- OS: Linux 6.12.68 (Google COS)
- Storage: Google Cloud SSD PD + Local SSD

Ankit Kapoor (1):
  bcache: fix stale data race between read cache miss and bypass write

 drivers/md/bcache/request.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

-- 
2.54.0.669.g59709faab0-goog
Re: [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
Posted by Coly Li 3 hours ago
On Thu, May 21, 2026 at 04:39:24PM +0800, Ankit Kapoor wrote:

Hi Ankit,

From your description and analysis, I feel this is a real issue.
Let me understand this deeper and response you later.

Thanks.

Coly Li

> Overview
> --------
> This series addresses a cache inconsistency issue with stale data in bcache
> that arises from a race condition between a read cache miss and a bypass 
> write due to congestion or sequential cutoff. The fix involves sequencing 
> the btree invalidation of the bypass write to occur strictly after the 
> backing device write.
> 
> Race Analysis
> -------------
> The following sequence illustrates how stale data is cached after a read
> cache miss when btree invalidation of a bypass write happens in parallel
> with a delayed write to the backing device:
> 
> Write IO Path (Parallel)            Read IO Path
> ------------------------            ------------
>            |
>  [Btree Invalidation]
>            |
>            |                      [Cache Miss]
>            |                           |
>            |                     [Btree Placeholder Key Insertion]
>            |                           |
>  (Delay in writing                     |
>  to the backing device)                |
>            |                     [Cache data from the backing device]
>            |                           |
>            +-------------------------->|  <-- No key collision detected!
>            |                      [Btree Placeholder Key Replacement]
>            |                           |
>     [Write to the                      |
>     backing device]                -------------
>                                  CRITICAL BUG:
>                              Stale data gets cached
> 
> Reproduction Steps
> ------------------
> The bug can be reliably reproduced by injecting a 5-second delay into
> the backing device write path via dm-delay. Cache mode is set to
> writearound to simulate bypass write.
> 
> 1. Data Preparation:
>   # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
>   # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
>   # echo writearound > /sys/block/bcache0/bcache/cache_mode
>   # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc
> 
> 2. Race Execution:
>   # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
>     bs=4096 count=1 conv=notrunc &
>   # sleep 1
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./concurrent-read-result
>   # sleep 10
>   # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
>     status=none | hexdump > ./second-read-result
> 
> 3. Results (Without Patch):
>   # cat second-read-result
>   0000000 0000 0000 0000 0000 0000 0000 0000 0000  # <--- STALE READ
> 
> Proposed Fix
> ------------
> The fix enforces strict total (sequential) order of btree invalidation
> after write to the backing device in a bypass write:
> 
> OLD FLOW                                          NEW FLOW
> -------------------------------       --------------------------------
>         [ Write Start ]                       [ Write Start ]
>                |                                     |
>        +-------+-------+                             |
>        |               |                             v
>        v               v                    [     Write to   ]
>  [    Btree     ] [   Write to    ]         [ backing-device ]
>  [ Invalidation ] [ backing-device]                  |
>        |               |                             v
>        +-------+-------+                    [      Btree     ]
>                |                            [  Invalidation  ]
>                v                                     |
>          [ Write End ]                               v
>                                                [ Write End ]
> 
> Enforcing this sequential execution ensures that either:
> 1. A stale read is followed and invalidated by the deferred write
>    invalidation flow.
> 2. The write invalidation executes first, forcing the subsequent read
>    path's key replacement sequence to properly catch the collision.
> 
> Failure Handling
> ----------------
> This patch keeps existing error-handling behavior intact. Although
> execution is now sequential, btree invalidation is still triggered
> regardless of whether the write to the backing device succeeds
> or fails.
> 
> Verification and Performance
> ----------------------------
> Manual Results (With Patch):
>   # cat second-read-result
>   0000000 0101 0101 0101 0101 0101 0101 0101 0101  # <--- CORRECT DATA
> 
> Stress Verification:
> FIO was executed under a write-only workload (128 KB Write, libaio,
> iodepth=64, direct=1). Without the patch, FIO reported CRC errors
> due to stale read corruptions; with the patch, zero CRC errors or
> corruptions were reported.
> 
> Write-Only Workload (FIO Averages CSV):
> Metric,With Fix,Without Fix,Delta
> 
> Write IOPS,1630,1630,0.00%
> Write Bandwidth (MiB/s),204,204,0.00%
> Write Avg Latency (micro second),39219.95,39219.58,0.00%
> 
> Test Environment
> ----------------
> - CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
> - Memory: 3.75 GB RAM
> - OS: Linux 6.12.68 (Google COS)
> - Storage: Google Cloud SSD PD + Local SSD
> 
> Ankit Kapoor (1):
>   bcache: fix stale data race between read cache miss and bypass write
> 
>  drivers/md/bcache/request.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> -- 
> 2.54.0.669.g59709faab0-goog