From nobody Mon Feb  9 20:32:29 2026
Delivered-To: importer@patchew.org
Authentication-Results: mx.zohomail.com;
	spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as
 permitted sender)
  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org;
	dmarc=pass(p=none dis=none)  header.from=nongnu.org
ARC-Seal: i=1; a=rsa-sha256; t=1685647798; cv=none;
	d=zohomail.com; s=zohoarc;
	b=cRvCeJ8hOaqe/XFSxCsA28+FXfwcU9Ptn4SBtsHqt5mxSFCAKm+TzT22Cx08sLZ+8fZj6EtPrmmguruMV/uR27eqqewkFr7KGlgNgDVt7spSFJosr4BoqyVFzp4A0WAVehGYQz94ou8rS6lAfzBJoiRmK1hsef8GlWJfwmGxMlU=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com;
 s=zohoarc;
	t=1685647798;
 h=Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:Reply-To:References:Sender:Subject:To;
	bh=U4PgCD9g+22hR7XPQiy42CR2JCtKgrziH/qy8q4uDlw=;
	b=DiODCh2ciJ2XdiNsnurxWI/C1HO0TwPu5TyLPu/hr+6L3zLPqXRxTjGeqkQk1D1rI+VztjQEdFeJ4vYGgwzhvCqC4jyd6O11uQecGcr36idQjI+6WuScftDsSR0rV1y2FGT0D5gjvjFlxFemYf7ActeDKWzT0ScIuvjDzMZVPdU=
ARC-Authentication-Results: i=1; mx.zohomail.com;
	spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as
 permitted sender)
  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org;
	dmarc=pass header.from=<qemu-devel@nongnu.org> (p=none dis=none)
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by
 mx.zohomail.com
	with SMTPS id 1685647798628244.95902560416448;
 Thu, 1 Jun 2023 12:29:58 -0700 (PDT)
Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <qemu-devel-bounces@nongnu.org>)
	id 1q4nyk-0006nI-1d; Thu, 01 Jun 2023 15:28:58 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <andrey.drobyshev@virtuozzo.com>)
 id 1q4nyZ-0006ju-6c; Thu, 01 Jun 2023 15:28:48 -0400
Received: from relay.virtuozzo.com ([130.117.225.111])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <andrey.drobyshev@virtuozzo.com>)
 id 1q4nyV-0005mE-TJ; Thu, 01 Jun 2023 15:28:46 -0400
Received: from dev005.ch-qa.vzint.dev ([172.29.1.10])
 by relay.virtuozzo.com with esmtp (Exim 4.96)
 (envelope-from <andrey.drobyshev@virtuozzo.com>) id 1q4nyC-00DLDg-12;
 Thu, 01 Jun 2023 21:28:36 +0200
To: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org, kwolf@redhat.com, hreitz@redhat.com,
 andrey.drobyshev@virtuozzo.com, den@virtuozzo.com
Subject: [PATCH 4/6] qemu-img: rebase: avoid unnecessary COW operations
Date: Thu,  1 Jun 2023 22:28:34 +0300
Message-Id: <20230601192836.598602-5-andrey.drobyshev@virtuozzo.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com>
References: <20230601192836.598602-1-andrey.drobyshev@virtuozzo.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17
 as permitted sender) client-ip=209.51.188.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Received-SPF: pass client-ip=130.117.225.111;
 envelope-from=andrey.drobyshev@virtuozzo.com; helo=relay.virtuozzo.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Reply-to: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
From: Andrey Drobyshev via <qemu-devel@nongnu.org>
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org
X-ZM-MESSAGEID: 1685647799247100004
Content-Type: text/plain; charset="utf-8"

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size =3D=3D cluster_size =3D=3D=
 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay cluster size.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
---
 qemu-img.c | 72 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 60f4c06487..9a469cd609 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3513,6 +3513,7 @@ static int img_rebase(int argc, char **argv)
     uint8_t *buf_new =3D NULL;
     BlockDriverState *bs =3D NULL, *prefix_chain_bs =3D NULL;
     BlockDriverState *unfiltered_bs;
+    BlockDriverInfo bdi =3D {0};
     char *filename;
     const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
     int c, flags, src_flags, ret;
@@ -3646,6 +3647,15 @@ static int img_rebase(int argc, char **argv)
         }
     }
=20
+    /* We need overlay cluster size to make sure write requests are aligne=
d */
+    ret =3D bdrv_get_info(unfiltered_bs, &bdi);
+    if (ret < 0) {
+        error_report("could not get block driver info");
+        goto out;
+    } else if (bdi.cluster_size =3D=3D 0) {
+        bdi.cluster_size =3D 1;
+    }
+
     /* For safe rebasing we need to compare old and new backing file */
     if (!unsafe) {
         QDict *options =3D NULL;
@@ -3744,6 +3754,7 @@ static int img_rebase(int argc, char **argv)
         int64_t new_backing_size =3D 0;
         uint64_t offset;
         int64_t n;
+        int64_t n_old =3D 0, n_new =3D 0;
         float local_progress =3D 0;
=20
         buf_old =3D blk_blockalign(blk_old_backing, IO_BUF_SIZE);
@@ -3784,7 +3795,7 @@ static int img_rebase(int argc, char **argv)
         }
=20
         for (offset =3D 0; offset < size; offset +=3D n) {
-            bool buf_old_is_zero =3D false;
+            bool old_backing_eof =3D false;
=20
             /* How many bytes can we handle with the next read? */
             n =3D MIN(IO_BUF_SIZE, size - offset);
@@ -3829,33 +3840,38 @@ static int img_rebase(int argc, char **argv)
                 }
             }
=20
+            /* At this point n must be aligned to the target cluster size.=
 */
+            if (offset + n < size) {
+                assert(n % bdi.cluster_size =3D=3D 0);
+            }
+
+            /*
+             * Much like the with the target image, we'll try to read as m=
uch
+             * of the old and new backings as we can.
+             */
+            n_old =3D MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+            if (blk_new_backing) {
+                n_new =3D MIN(n, MAX(0, new_backing_size - (int64_t) offse=
t));
+            }
+
             /*
              * Read old and new backing file and take into consideration t=
hat
              * backing files may be smaller than the COW image.
              */
-            if (offset >=3D old_backing_size) {
-                memset(buf_old, 0, n);
-                buf_old_is_zero =3D true;
+            memset(buf_old + n_old, 0, n - n_old);
+            if (!n_old) {
+                old_backing_eof =3D true;
             } else {
-                if (offset + n > old_backing_size) {
-                    n =3D old_backing_size - offset;
-                }
-
-                ret =3D blk_pread(blk_old_backing, offset, n, buf_old, 0);
+                ret =3D blk_pread(blk_old_backing, offset, n_old, buf_old,=
 0);
                 if (ret < 0) {
                     error_report("error while reading from old backing fil=
e");
                     goto out;
                 }
             }
=20
-            if (offset >=3D new_backing_size || !blk_new_backing) {
-                memset(buf_new, 0, n);
-            } else {
-                if (offset + n > new_backing_size) {
-                    n =3D new_backing_size - offset;
-                }
-
-                ret =3D blk_pread(blk_new_backing, offset, n, buf_new, 0);
+            memset(buf_new + n_new, 0, n - n_new);
+            if (blk_new_backing && n_new) {
+                ret =3D blk_pread(blk_new_backing, offset, n_new, buf_new,=
 0);
                 if (ret < 0) {
                     error_report("error while reading from new backing fil=
e");
                     goto out;
@@ -3867,15 +3883,28 @@ static int img_rebase(int argc, char **argv)
=20
             while (written < n) {
                 int64_t pnum;
+                int64_t start, end;
=20
                 if (compare_buffers(buf_old + written, buf_new + written,
                                     n - written, &pnum))
                 {
-                    if (buf_old_is_zero) {
+                    if (old_backing_eof) {
                         ret =3D blk_pwrite_zeroes(blk, offset + written, p=
num, 0);
                     } else {
-                        ret =3D blk_pwrite(blk, offset + written, pnum,
-                                         buf_old + written, 0);
+                        /*
+                         * If we've got to this point, it means the cluster
+                         * we're dealing with is unallocated, and any part=
ial
+                         * write will cause COW.  To avoid that, we make s=
ure
+                         * request is aligned to cluster size.
+                         */
+                        start =3D QEMU_ALIGN_DOWN(offset + written,
+                                                bdi.cluster_size);
+                        end =3D QEMU_ALIGN_UP(offset + written + pnum,
+                                            bdi.cluster_size);
+                        end =3D end > size ? size : end;
+                        ret =3D blk_pwrite(blk, start, end - start,
+                                         buf_old + (start - offset), 0);
+                        pnum =3D end - (offset + written);
                     }
                     if (ret < 0) {
                         error_report("Error while writing to COW image: %s=
",
@@ -3885,6 +3914,9 @@ static int img_rebase(int argc, char **argv)
                 }
=20
                 written +=3D pnum;
+                if (offset + written >=3D old_backing_size) {
+                    old_backing_eof =3D true;
+                }
             }
             qemu_progress_print(local_progress, 100);
         }
--=20
2.31.1