From nobody Sat May 30 20:11:40 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=fail; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=none dis=none) header.from=nongnu.org ARC-Seal: i=1; a=rsa-sha256; t=1776787054; cv=none; d=zohomail.com; s=zohoarc; b=kWa0+Hyob22SNHEWbbdjjok0Os9WEaB9u+WnAdQ+w6dxsl1iKjRWZNviKGqkiegKXrRX/OuB6/gTdEPrfs9G884Feo0X580yRTimf9qaCeML3yCbGtq4Jx24rOBDg+vFXqhPeHe0NnaBF9qHLo2xQrMtEpWNUz4G4I5q5ul+AK0= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1776787054; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Reply-To:Reply-To:References:Sender:Subject:Subject:To:To:Message-Id; bh=tux/oh3wBPkdouV7SFgL9r8W8VUWDFSoRd15nfBxmrk=; b=TLj0qRdhT5wkFwy+AvhNAXWbQxEkqvFssl0JhZqo1l0YZS+9l8xTACFHA13hTARMujIIkQiVWAIvzw6wLIVaPn7HcWF9pPZhgjwIsCyohRRZlA5kW0lFqaO4YCZNNld6ue3AoafAUMmj/LkyJq/V5PKhC/cgUsGyqPcejvXgDhE= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=fail; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1776787054561882.1302916669562; Tue, 21 Apr 2026 08:57:34 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wFDSo-0002fQ-60; Tue, 21 Apr 2026 11:56:38 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wFDSl-0002ej-GV; Tue, 21 Apr 2026 11:56:35 -0400 Received: from relay.virtuozzo.com ([130.117.225.111]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wFDSj-0003JB-GB; Tue, 21 Apr 2026 11:56:35 -0400 Received: from ch-demo-asa.virtuozzo.com ([130.117.225.8] helo=iris.sw.ru) by relay.virtuozzo.com with esmtp (Exim 4.96) (envelope-from ) id 1wFDQ2-001k3k-24; Tue, 21 Apr 2026 17:56:19 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=virtuozzo.com; s=relay; h=MIME-Version:Message-ID:Date:Subject:From: Content-Type; bh=tux/oh3wBPkdouV7SFgL9r8W8VUWDFSoRd15nfBxmrk=; b=A9T/9Lrr5SfM bJhieSxqX3VcBQnUNyJDfbYehWF7F76ucUnXVNyFxPeVWRaLcO82k+SI5w9ZE1yvna1P53wcuRu9R l+dN/eZ2WWqorNJTdvQ3xORssEgQj7y67LMDqCpNXfps7SaVXN/CmlpXacRjfQxpIEGu39Ek+a3I+ HIb31CiWPIwtz+7wBUyHYNcJWebKwB8jpxhOYnN62HmMqcpjZNvJzpt0ImV6UMhFxmVQc3wfSk1UC TZxTRrlNY1mSM3wg95uw259YN9dO9gh7aJ0VovS33GlYFIqnVQu1lbaOCJya1pMg4XB5Gt8GvhQtj xlX1TRqYMCZSsawN1M567w==; To: qemu-devel@nongnu.org, qemu-block@nongnu.org, qemu-stable@nongnu.org Cc: den@openvz.org, Stefan Hajnoczi , Kevin Wolf , Hanna Reitz Subject: [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Date: Tue, 21 Apr 2026 17:56:27 +0200 Message-ID: <20260421155628.3600671-2-den@openvz.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260421155628.3600671-1-den@openvz.org> References: <20260421155628.3600671-1-den@openvz.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists1p.gnu.org; Received-SPF: softfail client-ip=130.117.225.111; envelope-from=den@openvz.org; helo=relay.virtuozzo.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: "Denis V. Lunev" From: "Denis V. Lunev" via qemu development Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: fail (Header signature does not verify) X-ZM-MESSAGEID: 1776787057118158500 Content-Type: text/plain; charset="utf-8" qcow2's write path drops s->lock around the data I/O of an allocating write. A concurrent discard (or MAY_UNMAP write-zeroes) on the same guest offset lands the cluster-free operation in that window. The original writer then reacquires the lock and unconditionally writes L2[G] =3D alloc_offset | OFLAG_COPIED on its now-stale l2meta, binding the L2 entry to a freed cluster: WRITE coroutine DISCARD coroutine --------------- ----------------- qcow2_co_pwritev_part: lock(s->lock) qcow2_alloc_host_offset: handle_copied reads L2[G] =3D C | OFLAG_COPIED builds l2meta { alloc=3DC, keep_old_clusters=3Dtrue } unlock(s->lock) --> bdrv_co_pwritev_part (data I/O) lock(s->lock) qcow2_co_pdiscard on G: discard_in_l2_slice set_l2_entry(G, 0) free_any_cluster(C): rc(C) 1 -> 0 unlock(s->lock) lock(s->lock) qcow2_handle_l2meta(link_l2=3Dtrue): qcow2_alloc_cluster_link_l2: set_l2_entry(G, C | OFLAG_COPIED) <- stale alloc onto freed cluster The next allocator pass re-hands C out on rc=3D0, so we end up with two L2 entries aliasing one host cluster. On disk this shows up in qemu-img check as refcount=3D0 with a live OFLAG_COPIED reference or as refcount < reference; at runtime the next discard on either alias prints "qcow2_free_clusters failed: Invalid argument" on stderr with no guest-visible error. Mark both discards and all write-zeroes (with or without MAY_UNMAP) as BDRV_REQ_SERIALISING in the generic block layer. Their tracked_request then waits for overlapping in-flight writes, including non-serialising ones, to finish their format-driver commit before any L2/refcount mutation happens. Signed-off-by: Denis V. Lunev Cc: Stefan Hajnoczi Cc: Kevin Wolf Cc: Hanna Reitz --- block/io.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/block/io.c b/block/io.c index dd5f13c694..9f23029b95 100644 --- a/block/io.c +++ b/block/io.c @@ -2097,6 +2097,16 @@ bdrv_aligned_pwritev(BdrvChild *child, BdrvTrackedRe= quest *req, max_transfer =3D QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT= _MAX), align); =20 + /* + * Zero-writes (with or without MAY_UNMAP) mutate L2 entries / refcoun= ts + * in the format driver and therefore race with concurrent in-flight + * regular writes that have dropped their internal mutex for the data + * I/O. See the comment in bdrv_co_pdiscard(). Serialise them. + */ + if (flags & BDRV_REQ_ZERO_WRITE) { + flags |=3D BDRV_REQ_SERIALISING; + } + ret =3D bdrv_co_write_req_prepare(child, offset, bytes, req, flags); =20 if (!ret && bs->detect_zeroes !=3D BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF = && @@ -3192,7 +3202,20 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, = int64_t offset, bdrv_inc_in_flight(bs); tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_DISCARD); =20 - ret =3D bdrv_co_write_req_prepare(child, offset, bytes, &req, 0); + /* + * Discards must serialise against overlapping in-flight writes. + * A format driver's write path may drop its internal mutex around + * the data I/O while still holding a pending cluster-allocation + * commit (see qcow2's handle_copied / qcow2_alloc_cluster_link_l2 + * sequence). A concurrent discard that clears L2 and drops the + * refcount during that window leaves the writer pointing at a + * freed cluster - the root of the refcount/reference aliasing + * corruption family. Marking the discard serialising makes it wait + * for the in-flight write's tracked_request to complete before any + * L2/refcount mutation happens. + */ + ret =3D bdrv_co_write_req_prepare(child, offset, bytes, &req, + BDRV_REQ_SERIALISING); if (ret < 0) { goto out; } --=20 2.51.0 From nobody Sat May 30 20:11:40 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; dkim=fail; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass(p=none dis=none) header.from=nongnu.org ARC-Seal: i=1; a=rsa-sha256; t=1776787054; cv=none; d=zohomail.com; s=zohoarc; b=JjcEvTWsuXG5GjnB3vcudxkYg2tHJsyIRv5vCiFFWC308DHUUiOEkCFJyz4W18/vMK/BWTcR4IbkiuClXn2K1Pzerk8bwaCQ4FnwVqSf4pg4ndiOSTDpZpB1we0c/iCNsdrzzTpb6K54Ttofj2U417Pp28PSWsYtI0TUm5n+Cnw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1776787054; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Reply-To:Reply-To:References:Sender:Subject:Subject:To:To:Message-Id; bh=5oZ5ryl95dTV+hIbE37CXIJ9NkwjnJnZE/utqtG/ovM=; b=Z2DEUJlujJ5Yk2SkV3gwMQBJ74Tfg9yO9Y/eJOqdlVoS6Q+Q4aSJpr6t8F4S2AlJWIws06zlsSh5PJiWeXaxktB/ykjauff5yH7biKV1eFtMavhyNQG+b7aPcbtwMliVNDGimDA2D+0mhSp3vYhGemvF+VULHzp3NR9+/LdTaNs= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=fail; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1776787054260933.2219714339628; Tue, 21 Apr 2026 08:57:34 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wFDSo-0002ft-HC; Tue, 21 Apr 2026 11:56:38 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wFDSl-0002ei-GP; Tue, 21 Apr 2026 11:56:35 -0400 Received: from relay.virtuozzo.com ([130.117.225.111]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wFDSj-0003JA-Fl; Tue, 21 Apr 2026 11:56:34 -0400 Received: from ch-demo-asa.virtuozzo.com ([130.117.225.8] helo=iris.sw.ru) by relay.virtuozzo.com with esmtp (Exim 4.96) (envelope-from ) id 1wFDQ2-001k3k-35; Tue, 21 Apr 2026 17:56:20 +0200 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=virtuozzo.com; s=relay; h=MIME-Version:Message-ID:Date:Subject:From: Content-Type; bh=5oZ5ryl95dTV+hIbE37CXIJ9NkwjnJnZE/utqtG/ovM=; b=x0O+XMH38Aiy Y0ODmDeTiif5+9CCJfkkbSqDHaHQVPszRK4ydyzJEgplNtg2U0w4jWklEcWCJv8L5lP9iW/slclhz X0T0rlRRSFxXo35cI9Elwaz2pUZbKPFNZxIQVMtJuiBC4QpzBuw2Oka3u1+Q98zveCqa+ixLpIw0a Voewzd+eBwL0UHOxAHbVl7ahBjMgYMsK6sgCqRdKF2RpRoH/xW69bIg7xTzDw5UORI74CGrmsnLEK GHhiAioX9CFEhEdsyGZEN+LRt8bDmd0ZOyCdKjnppYwQ8Xbmg5FwG+vlXJ7uTvnMrDq5+jvzqNBFF Sf+WuSvdHXweXDi6/Zs7eQ==; To: qemu-devel@nongnu.org, qemu-block@nongnu.org, qemu-stable@nongnu.org Cc: den@openvz.org, Stefan Hajnoczi , Kevin Wolf , Hanna Reitz Subject: [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race Date: Tue, 21 Apr 2026 17:56:28 +0200 Message-ID: <20260421155628.3600671-3-den@openvz.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260421155628.3600671-1-den@openvz.org> References: <20260421155628.3600671-1-den@openvz.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists1p.gnu.org; Received-SPF: softfail client-ip=130.117.225.111; envelope-from=den@openvz.org; helo=relay.virtuozzo.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_SOFTFAIL=0.665 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-to: "Denis V. Lunev" From: "Denis V. Lunev" via qemu development Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZohoMail-DKIM: fail (Header signature does not verify) X-ZM-MESSAGEID: 1776787057846154100 Content-Type: text/plain; charset="utf-8" Add tests/qemu-iotests/tests/discard-write-serialisation, a deterministic regression test for the race fixed in the previous commit. Drive a single qemu-io process with a fixed-seed interleaved sequence of async aio_write and aio_write -z -u commands at random cluster-aligned offsets in a small contention region, then run qemu-img check and assert zero corruptions. On an unpatched tree the same workload reproduces the refcount-aliasing fingerprint deterministically; on the fixed tree the image comes back clean. The test is scoped to qcow2 because qcow2 is the format whose qemu-img check validates refcount/reference consistency and therefore actually detects the fingerprint. The underlying race is in the generic block layer and not format-specific, but a test that asserts "qemu-img check returns zero corruptions" only has signal on formats that run such a check. Signed-off-by: Denis V. Lunev Cc: Stefan Hajnoczi Cc: Kevin Wolf Cc: Hanna Reitz --- .../tests/discard-write-serialisation | 97 +++++++++++++++++++ .../tests/discard-write-serialisation.out | 1 + 2 files changed, 98 insertions(+) create mode 100755 tests/qemu-iotests/tests/discard-write-serialisation create mode 100644 tests/qemu-iotests/tests/discard-write-serialisation.out diff --git a/tests/qemu-iotests/tests/discard-write-serialisation b/tests/q= emu-iotests/tests/discard-write-serialisation new file mode 100755 index 0000000000..45a3f7f043 --- /dev/null +++ b/tests/qemu-iotests/tests/discard-write-serialisation @@ -0,0 +1,97 @@ +#!/usr/bin/env python3 +# group: rw quick auto +# +# Regression test for the block-layer race fixed in +# block/io: serialise discard and write-zeroes against in-flight writes. +# +# A format driver's write path may drop its internal mutex around the +# data I/O of an allocating write (qcow2 does so between +# qcow2_alloc_host_offset and qcow2_alloc_cluster_link_l2). A +# concurrent discard or MAY_UNMAP write-zeroes on the same guest range, +# running in that window, can clear the L2 entry and drop the cluster's +# refcount to zero; the writer's subsequent link then binds the L2 +# entry to a freed cluster. qemu-img check reports this as refcount=3D0 +# with a live OFLAG_COPIED reference, or refcount < reference when the +# allocator re-hands the cluster out. +# +# The bug is in the generic block layer, not format-specific; qcow2 is +# the detection vehicle because its refcount validation in qemu-img +# check catches the fingerprint. The test drives a single qemu-io +# process with interleaved async aio_write and aio_write -z -u commands +# at random cluster-aligned offsets in a small contention region, then +# runs qemu-img check and asserts zero corruptions. On an unpatched +# tree the same workload reproduces the fingerprint deterministically +# (seed is fixed). +# +# SPDX-License-Identifier: GPL-2.0-or-later + +import random +import subprocess + +import iotests +from iotests import qemu_img_create, qemu_img_check, qemu_io_wrap_args + + +iotests.script_initialize(supported_fmts=3D['qcow2'], + supported_platforms=3D['linux']) + +IMG_SIZE =3D 256 * 1024 * 1024 # 256 MiB +REGION =3D 64 * 1024 * 1024 # contention region: 64 MiB +CLUSTER =3D 1024 * 1024 # 1 MiB +SUBCLUSTER =3D 32 * 1024 # 32 KiB +OPS =3D 5000 +SEED =3D 7 + +def build_commands() -> bytes: + rng =3D random.Random(SEED) + max_cluster =3D REGION // CLUSTER - 1 + lines =3D [] + for _ in range(OPS): + cl =3D rng.randint(0, max_cluster) + off =3D cl * CLUSTER + if rng.random() < 0.5: + # Small sub-cluster write at an unaligned position inside + # the cluster -- exercises the handle_copied path and the + # s->lock drop around the data I/O. + sub =3D rng.randrange(0, CLUSTER, SUBCLUSTER) + lines.append(f'aio_write -q {off + sub} 32k') + else: + # MAY_UNMAP write-zeroes aligned to the cluster -- frees + # clusters at the format driver level and is the concurrent + # cluster-free source that races with the in-flight writes. + lines.append(f'aio_write -q -z -u {off} 1M') + lines.append('aio_flush') + return ('\n'.join(lines) + '\n').encode() + + +def main() -> None: + with iotests.FilePath('disk.img') as img: + qemu_img_create('-f', 'qcow2', + '-o', 'cluster_size=3D1M,extended_l2=3Don,' + 'lazy_refcounts=3Don,refcount_bits=3D16', + img, str(IMG_SIZE)) + + # Run qemu-io with async AIO. --cache=3Dnone and --aio=3Dnative e= nsure + # the writer coroutine actually yields around its data I/O (which + # is what opens the race window). Swallow stdout/stderr: the + # result we care about is the on-disk state, checked below. + args =3D qemu_io_wrap_args(['-f', 'qcow2', '-n', + '--cache=3Dnone', '--aio=3Dnative', img]) + subprocess.run(args, input=3Dbuild_commands(), + stdout=3Dsubprocess.DEVNULL, + stderr=3Dsubprocess.DEVNULL, + check=3DTrue) + + result =3D qemu_img_check(img) + corruptions =3D result.get('corruptions', 0) + check_errors =3D result.get('check-errors', 0) + if corruptions or check_errors: + iotests.log(f'FAIL: qemu-img check reports ' + f'corruptions=3D{corruptions} ' + f'check-errors=3D{check_errors}') + else: + iotests.log('OK') + + +if __name__ =3D=3D '__main__': + main() diff --git a/tests/qemu-iotests/tests/discard-write-serialisation.out b/tes= ts/qemu-iotests/tests/discard-write-serialisation.out new file mode 100644 index 0000000000..d86bac9de5 --- /dev/null +++ b/tests/qemu-iotests/tests/discard-write-serialisation.out @@ -0,0 +1 @@ +OK --=20 2.51.0