block/io: fix reproducible silent data corruption in write-vs-discard race

[PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race
Posted by Denis V. Lunev via qemu development 1 month, 1 week ago
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>

This series fixes a qemu block-layer race between an in-flight
format-driver write and a concurrent discard or MAY_UNMAP
write-zeroes on the same guest range. The race has been latent
in upstream since v1.0 (commit 68d100e905 "qcow2: Use coroutines",
2011-06-30) and has been producing silent qcow2 metadata
corruption in production.

Mechanism
---------

qcow2's write path drops s->lock around the data I/O of an
allocating write.  If a discard / pwrite_zeroes(MAY_UNMAP) on the
same guest offset lands in that window, it clears the L2 entry
and decrements the cluster's refcount to zero; the writer then
reacquires the lock and unconditionally writes L2[G] =
alloc_offset | OFLAG_COPIED onto the now-freed cluster.  The next
allocation re-hands the cluster out and we end up with two L2
entries aliasing one host cluster.  Patch 1/2 carries the
per-frame diagram of the interleaving.

On-disk signature: qemu-img check reports refcount=0 with a live
OFLAG_COPIED reference, or refcount < reference.  Runtime
signature: "qcow2_free_clusters failed: Invalid argument" on
stderr with no guest-visible error.

Production consequences
-----------------------

  * Silent data drift.  Once two guest offsets share one host
    cluster, writes through either alias overwrite bytes the
    other alias owns.  The guest reads back bytes it never
    wrote, with no I/O error hit anywhere in the stack.

  * Guest-filesystem corruption.  ext4 discovers the resulting
    inconsistency and remounts read-only.  Because the backing
    qemu returned success for every request, nothing in the
    guest's own block layer logs anything; kernels have been
    observed stopping all FS writes silently for hours until
    userspace tries to write.

  * Latent poisoning with multi-day incubation.  A block-job
    (commit, stream, active mirror, legacy block-migrate +
    commit on a destination) running concurrently with guest
    discard traffic plants aliased clusters that may not
    produce a symptom until a later guest discard walks one
    of them.  Cases in the wild have surfaced 8 to 17 days
    after the originating block-job window.

  * Recovery requires both fsck inside the guest AND
    qemu-img check -r all on the host -- the former repairs
    the ext4 level, the latter repairs the qcow2 refcount/L2
    aliasing; fsck alone leaves the image to re-corrupt the
    moment writes to an aliased cluster resume.

Why it surfaces only under block-jobs
-------------------------------------

Guest-only I/O rarely opens the race window: the guest's own
block layer serialises DISCARD and WRITE to the same LBA range,
so at any moment a cluster is either "being written" or "being
discarded" from the guest, not both.  The race requires a second
I/O producer on the same BDS that does not observe guest-side
ordering -- i.e. a block job.  Every migration / commit / backup
/ mirror cycle is an exposure window; steady-state VMs are
essentially immune until the next image-management operation
runs.

Fix
---

Patch 1/2 marks both pdiscard and all pwrite_zeroes (with or
without MAY_UNMAP) as BDRV_REQ_SERIALISING in the generic block
layer.  Their tracked_request then waits for overlapping
in-flight writes -- including non-serialising ones -- to finish
their format-driver commit before any L2/refcount mutation
happens.

The gate lives in block/io.c rather than in qcow2 so that:

  * every format driver that drops an internal mutex during
    the data I/O of an allocating write is covered, not just
    qcow2;

  * the NBD WRITE_ZEROES path (blk_co_pwrite_zeroes ->
    blk_co_pwritev -> bdrv_co_pwritev_part ->
    bdrv_aligned_pwritev, which bypasses the
    bdrv_co_pwrite_zeroes wrapper entirely) is still caught
    -- the gate is placed where BDRV_REQ_ZERO_WRITE is
    observed on the way down to the driver.

Perf impact is limited to the overlap window. The serialising
request only waits when a conflict actually exists, which is
exactly the corruption surface.  Steady-state non-overlapping
traffic pays nothing.

Test
----

Patch 2/2 adds a deterministic iotest
(tests/qemu-iotests/tests/discard-write-serialisation) that
drives a single qemu-io process with a fixed-seed 5000-command
sequence of interleaved aio_write and aio_write -z -u at random
cluster-aligned offsets in a small contention region, then runs
qemu-img check and asserts zero corruptions. Results across 8
runs each:

  fixed tree:    8/8 clean
  unfixed tree:  8/8 detect (2-4 corruptions per run)

100% detection on the unfixed tree, zero false positives on the
fixed tree, under 30 seconds per run.  The test is scoped to
qcow2 because qcow2 is the format whose qemu-img check validates
the fingerprint; the underlying race is format-agnostic.

Denis V. Lunev (2):
  block/io: serialise discard and write-zeroes against in-flight writes
  iotests: regression test for discard/write-zeroes vs in-flight write
    race

 block/io.c                                    | 25 ++++-
 .../tests/discard-write-serialisation         | 97 +++++++++++++++++++
 .../tests/discard-write-serialisation.out     |  1 +
 3 files changed, 122 insertions(+), 1 deletion(-)
 create mode 100755 tests/qemu-iotests/tests/discard-write-serialisation
 create mode 100644 tests/qemu-iotests/tests/discard-write-serialisation.out

-- 
2.51.0