[v3] Make blockdev-mirror dest sparse in more cases

[PATCH v3 08/11] mirror: Skip writing zeroes when target is already zero

Posted by Eric Blake 5 months, 1 week ago

When mirroring, the goal is to ensure that the destination reads the
same as the source; this goal is met whether the destination is sparse
or fully-allocated.  However, if the destination cannot efficiently
write zeroes, then any time the mirror operation wants to copy zeroes
from the source to the destination (either during the background over
sparse regions when doing a full mirror, or in the foreground when the
guest actively writes zeroes), we were causing the destination to
fully allocate that portion of the disk, even if it already read as
zeroes.

The effect is especially pronounced when the source is a raw file.
That's because when the source is a qcow2 file, the dirty bitmap only
visits the portions of the source that are allocated, which tend to be
non-zero.  But when the source is a raw file,
bdrv_co_is_allocated_above() reports the entire file as allocated so
mirror_dirty_init sets the entire dirty bitmap, and it is only later
during mirror_iteration that we change to consulting the more precise
bdrv_co_block_status_above() to learn where the source reads as zero.

Remember that since a mirror operation can write a cluster more than
once (every time the guest changes the source, the destination is also
changed to keep up), we can't take the shortcut of relying on
s->zero_target (which is static for the life of the job) in
mirror_co_zero() to see if the destination is already zero, because
that information may be stale.  Any solution we use must be dynamic in
the face of the guest writing or discarding a cluster while the mirror
has been ongoing.

We could just teach mirror_co_zero() to do a block_status() probe of
the destination, and skip the zeroes if the destination already reads
as zero, but we know from past experience that extra block_status()
calls are not always cheap (tmpfs, anyone?), especially when they are
random access rather than linear.  Use of block_status() of the source
by the background task in a linear fashion is not our bottleneck (it's
a background task, after all); but since mirroring can be done while
the source is actively being changed, we don't want a slow
block_status() of the destination to occur on the hot path of the
guest trying to do random-access writes to the source.

So this patch takes a slightly different approach: any time we have to
transfer the full image, we know that mirror_dirty_init() is _already_
doing a pre-zero pass over the entire destination.  Therefore, if we
track which clusters of the destination are zero at any given moment,
we don't have to do a block_status() call on the destination, but can
instead just refer to the zero bitmap associated with the job.

With this patch, if I create a raw sparse destination file, connect it
with QMP 'blockdev-add' while leaving it at the default "discard":
"ignore", then run QMP 'blockdev-mirror' with "sync": "full", the
destination remains sparse rather than fully allocated.  Meanwhile, a
destination image that is already fully allocated remains so unless it
was opened with "detect-zeroes": "unmap".  If writing zeroes is
skipped, the job counters are not incremented.

Signed-off-by: Eric Blake <eblake@redhat.com>

---

v3: also skip counters when skipping I/O [Sunny]
---
 block/mirror.c | 92 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 81 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 4059bf96854..69a02dfc2b7 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,7 @@ typedef struct MirrorBlockJob {
     size_t buf_size;
     int64_t bdev_length;
     unsigned long *cow_bitmap;
+    unsigned long *zero_bitmap;
     BdrvDirtyBitmap *dirty_bitmap;
     BdrvDirtyBitmapIter *dbi;
     uint8_t *buf;
@@ -108,9 +109,12 @@ struct MirrorOp {
     int64_t offset;
     uint64_t bytes;

-    /* The pointee is set by mirror_co_read(), mirror_co_zero(), and
-     * mirror_co_discard() before yielding for the first time */
+    /*
+     * These pointers are set by mirror_co_read(), mirror_co_zero(), and
+     * mirror_co_discard() before yielding for the first time
+     */
     int64_t *bytes_handled;
+    bool *io_skipped;

     bool is_pseudo_op;
     bool is_active_write;
@@ -408,15 +412,34 @@ static void coroutine_fn mirror_co_read(void *opaque)
 static void coroutine_fn mirror_co_zero(void *opaque)
 {
     MirrorOp *op = opaque;
-    int ret;
+    bool write_needed = true;
+    int ret = 0;

     op->s->in_flight++;
     op->s->bytes_in_flight += op->bytes;
     *op->bytes_handled = op->bytes;
     op->is_in_flight = true;

-    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
-                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
+    if (op->s->zero_bitmap) {
+        unsigned long end = DIV_ROUND_UP(op->offset + op->bytes,
+                                         op->s->granularity);
+        assert(QEMU_IS_ALIGNED(op->offset, op->s->granularity));
+        assert(QEMU_IS_ALIGNED(op->bytes, op->s->granularity) ||
+               op->offset + op->bytes == op->s->bdev_length);
+        if (find_next_zero_bit(op->s->zero_bitmap, end,
+                               op->offset / op->s->granularity) == end) {
+            write_needed = false;
+            *op->io_skipped = true;
+        }
+    }
+    if (write_needed) {
+        ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
+                                   op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
+    }
+    if (ret >= 0 && op->s->zero_bitmap) {
+        bitmap_set(op->s->zero_bitmap, op->offset / op->s->granularity,
+                   DIV_ROUND_UP(op->bytes, op->s->granularity));
+    }
     mirror_write_complete(op, ret);
 }

@@ -435,29 +458,43 @@ static void coroutine_fn mirror_co_discard(void *opaque)
 }

 static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
-                               unsigned bytes, MirrorMethod mirror_method)
+                               unsigned bytes, MirrorMethod mirror_method,
+                               bool *io_skipped)
 {
     MirrorOp *op;
     Coroutine *co;
     int64_t bytes_handled = -1;

+    assert(QEMU_IS_ALIGNED(offset, s->granularity));
+    assert(QEMU_IS_ALIGNED(bytes, s->granularity) ||
+           offset + bytes == s->bdev_length);
     op = g_new(MirrorOp, 1);
     *op = (MirrorOp){
         .s              = s,
         .offset         = offset,
         .bytes          = bytes,
         .bytes_handled  = &bytes_handled,
+        .io_skipped     = io_skipped,
     };
     qemu_co_queue_init(&op->waiting_requests);

     switch (mirror_method) {
     case MIRROR_METHOD_COPY:
+        if (s->zero_bitmap) {
+            bitmap_clear(s->zero_bitmap, offset / s->granularity,
+                         DIV_ROUND_UP(bytes, s->granularity));
+        }
         co = qemu_coroutine_create(mirror_co_read, op);
         break;
     case MIRROR_METHOD_ZERO:
+        /* s->zero_bitmap handled in mirror_co_zero */
         co = qemu_coroutine_create(mirror_co_zero, op);
         break;
     case MIRROR_METHOD_DISCARD:
+        if (s->zero_bitmap) {
+            bitmap_clear(s->zero_bitmap, offset / s->granularity,
+                         DIV_ROUND_UP(bytes, s->granularity));
+        }
         co = qemu_coroutine_create(mirror_co_discard, op);
         break;
     default:
@@ -568,6 +605,7 @@ static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
         int ret = -1;
         int64_t io_bytes;
         int64_t io_bytes_acct;
+        bool io_skipped = false;
         MirrorMethod mirror_method = MIRROR_METHOD_COPY;

         assert(!(offset % s->granularity));
@@ -611,8 +649,10 @@ static void coroutine_fn GRAPH_UNLOCKED mirror_iteration(MirrorBlockJob *s)
         }

         io_bytes = mirror_clip_bytes(s, offset, io_bytes);
-        io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
-        if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
+        io_bytes = mirror_perform(s, offset, io_bytes, mirror_method,
+                                  &io_skipped);
+        if (io_skipped ||
+            (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok)) {
             io_bytes_acct = 0;
         } else {
             io_bytes_acct = io_bytes;
@@ -849,6 +889,8 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
     bdrv_graph_co_rdunlock();

     if (s->zero_target) {
+        int64_t bitmap_length = DIV_ROUND_UP(s->bdev_length, s->granularity);
+
         offset = 0;
         bdrv_graph_co_rdlock();
         ret = bdrv_co_is_all_zeroes(target_bs);
@@ -856,6 +898,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
         if (ret < 0) {
             return ret;
         }
+        s->zero_bitmap = bitmap_new(bitmap_length);
         /*
          * If the destination already reads as zero, and we are not
          * requested to punch holes into existing zeroes, then we can
@@ -864,6 +907,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
         if (ret > 0 &&
             (target_bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP ||
              !bdrv_can_write_zeroes_with_unmap(target_bs))) {
+            bitmap_set(s->zero_bitmap, 0, bitmap_length);
             offset = s->bdev_length;
         }
         if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
@@ -875,6 +919,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
         while (offset < s->bdev_length) {
             int bytes = MIN(s->bdev_length - offset,
                             QEMU_ALIGN_DOWN(INT_MAX, s->granularity));
+            bool ignored;

             mirror_throttle(s);

@@ -890,7 +935,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
                 continue;
             }

-            mirror_perform(s, offset, bytes, MIRROR_METHOD_ZERO);
+            mirror_perform(s, offset, bytes, MIRROR_METHOD_ZERO, &ignored);
             offset += bytes;
         }

@@ -1180,6 +1225,7 @@ immediate_exit:
     assert(s->in_flight == 0);
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
+    g_free(s->zero_bitmap);
     g_free(s->in_flight_bitmap);
     bdrv_dirty_iter_free(s->dbi);

@@ -1359,6 +1405,7 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
     int ret;
     size_t qiov_offset = 0;
     int64_t dirty_bitmap_offset, dirty_bitmap_end;
+    int64_t zero_bitmap_offset, zero_bitmap_end;

     if (!QEMU_IS_ALIGNED(offset, job->granularity) &&
         bdrv_dirty_bitmap_get(job->dirty_bitmap, offset))
@@ -1402,8 +1449,9 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
     }

     /*
-     * Tails are either clean or shrunk, so for bitmap resetting
-     * we safely align the range down.
+     * Tails are either clean or shrunk, so for dirty bitmap resetting
+     * we safely align the range narrower.  But for zero bitmap, round
+     * range wider for checking or clearing, and narrower for setting.
      */
     dirty_bitmap_offset = QEMU_ALIGN_UP(offset, job->granularity);
     dirty_bitmap_end = QEMU_ALIGN_DOWN(offset + bytes, job->granularity);
@@ -1411,22 +1459,44 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
         bdrv_reset_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
                                 dirty_bitmap_end - dirty_bitmap_offset);
     }
+    zero_bitmap_offset = offset / job->granularity;
+    zero_bitmap_end = DIV_ROUND_UP(offset + bytes, job->granularity);

     job_progress_increase_remaining(&job->common.job, bytes);
     job->active_write_bytes_in_flight += bytes;

     switch (method) {
     case MIRROR_METHOD_COPY:
+        if (job->zero_bitmap) {
+            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
+                         zero_bitmap_end - zero_bitmap_offset);
+        }
         ret = blk_co_pwritev_part(job->target, offset, bytes,
                                   qiov, qiov_offset, flags);
         break;

     case MIRROR_METHOD_ZERO:
+        if (job->zero_bitmap) {
+            if (find_next_zero_bit(job->zero_bitmap, zero_bitmap_end,
+                                   zero_bitmap_offset) == zero_bitmap_end) {
+                ret = 0;
+                break;
+            }
+        }
         assert(!qiov);
         ret = blk_co_pwrite_zeroes(job->target, offset, bytes, flags);
+        if (job->zero_bitmap && ret >= 0) {
+            bitmap_set(job->zero_bitmap, dirty_bitmap_offset / job->granularity,
+                       (dirty_bitmap_end - dirty_bitmap_offset) /
+                       job->granularity);
+        }
         break;

     case MIRROR_METHOD_DISCARD:
+        if (job->zero_bitmap) {
+            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
+                         zero_bitmap_end - zero_bitmap_offset);
+        }
         assert(!qiov);
         ret = blk_co_pdiscard(job->target, offset, bytes);
         break;
-- 
2.49.0

Re [PATCH v3 08/11] mirror: Skip writing zeroes when target is already zero

Posted by Sunny Zhu 5 months, 1 week ago

on Thu 24 Apr 2025 19:52:08 -0500, Eric wrote:
>      if (s->zero_target) {
> +        int64_t bitmap_length = DIV_ROUND_UP(s->bdev_length, s->granularity);
> +
>          offset = 0;
>          bdrv_graph_co_rdlock();
>          ret = bdrv_co_is_all_zeroes(target_bs);
> @@ -856,6 +898,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
>          if (ret < 0) {
>              return ret;
>          }
> +        s->zero_bitmap = bitmap_new(bitmap_length);
>          /*
>           * If the destination already reads as zero, and we are not
>           * requested to punch holes into existing zeroes, then we can
> @@ -864,6 +907,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
>          if (ret > 0 &&
>              (target_bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP ||
>               !bdrv_can_write_zeroes_with_unmap(target_bs))) {
> +            bitmap_set(s->zero_bitmap, 0, bitmap_length);

when arg->mode != NEW_IMAGE_MODE_EXISTING && bdrv_has_zero_init(target_bs) is true
in drive_mirror (This means the target image is newly created), in which case
s->zero_target == false, we still need to execute bitmap_set(s->zero_bitmap, 0, bitmap_length) 

>              offset = s->bdev_length;
>          }
>          if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
> @@ -875,6 +919,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
>          while (offset < s->bdev_length) {
>              int bytes = MIN(s->bdev_length - offset,
>                              QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

Re: Re [PATCH v3 08/11] mirror: Skip writing zeroes when target is already zero

Posted by Eric Blake 5 months ago

On Thu, May 01, 2025 at 12:38:30AM +0800, Sunny Zhu wrote:
> on Thu 24 Apr 2025 19:52:08 -0500, Eric wrote:
> >      if (s->zero_target) {
> > +        int64_t bitmap_length = DIV_ROUND_UP(s->bdev_length, s->granularity);
> > +
> >          offset = 0;
> >          bdrv_graph_co_rdlock();
> >          ret = bdrv_co_is_all_zeroes(target_bs);
> > @@ -856,6 +898,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> >          if (ret < 0) {
> >              return ret;
> >          }
> > +        s->zero_bitmap = bitmap_new(bitmap_length);
> >          /*
> >           * If the destination already reads as zero, and we are not
> >           * requested to punch holes into existing zeroes, then we can
> > @@ -864,6 +907,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> >          if (ret > 0 &&
> >              (target_bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP ||
> >               !bdrv_can_write_zeroes_with_unmap(target_bs))) {
> > +            bitmap_set(s->zero_bitmap, 0, bitmap_length);
> 
> when arg->mode != NEW_IMAGE_MODE_EXISTING && bdrv_has_zero_init(target_bs) is true
> in drive_mirror (This means the target image is newly created), in which case
> s->zero_target == false, we still need to execute bitmap_set(s->zero_bitmap, 0, bitmap_length)

Good catch.  I will fix that in v4.

Now that I'm thinking a bit more about it, I wonder if s->zero_target
is the right name.  We have several pieces of information feeding the
potential optimizations: are we mirroring the full image or just a
portion of it (if just a portion, pre-zeroing is wrong because we
shouldn't touch the part of the image not being mirrored), did we just
create the image (in which case it reads as zero and we can skip
pre-zeroing), did the user pass target-is-zero (later in this series,
same effect as if we just created the image), is punching zeroes
allowed (not all setups allow it; and even when it is allowed, there
are tradeoffs on whether to punch one big hole and then fill back up
or to only punch small holes as zeroes are encountered).

It's a big mess of conditionals, so I'm still trying to figure out if
there is a more sensible way to arrange the various logic bits.  With
the addition of target-is-zero, maybe it makes more sense to rename
s->zero_target to s->full_image, and to set s->full_image=true
s->target_is_zero in the arg->mode != NEW_IMAGE_MODE_EXISTING &&
bdrv_has_zero_init(target) case.

> 
> >              offset = s->bdev_length;
> >          }
> >          if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
> > @@ -875,6 +919,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> >          while (offset < s->bdev_length) {
> >              int bytes = MIN(s->bdev_length - offset,
> >                              QEMU_ALIGN_DOWN(INT_MAX, s->granularity));
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re [PATCH v3 08/11] mirror: Skip writing zeroes when target is already zero

Posted by Sunny Zhu 5 months ago

On Thu, 1 May 2025 12:58:42 -0500, Eric wrote:
> On Thu, May 01, 2025 at 12:38:30AM +0800, Sunny Zhu wrote:
> > on Thu 24 Apr 2025 19:52:08 -0500, Eric wrote:
> > >      if (s->zero_target) {
> > > +        int64_t bitmap_length = DIV_ROUND_UP(s->bdev_length, s->granularity);
> > > +
> > >          offset = 0;
> > >          bdrv_graph_co_rdlock();
> > >          ret = bdrv_co_is_all_zeroes(target_bs);
> > > @@ -856,6 +898,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> > >          if (ret < 0) {
> > >              return ret;
> > >          }
> > > +        s->zero_bitmap = bitmap_new(bitmap_length);
> > >          /*
> > >           * If the destination already reads as zero, and we are not
> > >           * requested to punch holes into existing zeroes, then we can
> > > @@ -864,6 +907,7 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> > >          if (ret > 0 &&
> > >              (target_bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP ||
> > >               !bdrv_can_write_zeroes_with_unmap(target_bs))) {
> > > +            bitmap_set(s->zero_bitmap, 0, bitmap_length);
> > 
> > when arg->mode != NEW_IMAGE_MODE_EXISTING && bdrv_has_zero_init(target_bs) is true
> > in drive_mirror (This means the target image is newly created), in which case
> > s->zero_target == false, we still need to execute bitmap_set(s->zero_bitmap, 0, bitmap_length)
> 
> Good catch.  I will fix that in v4.
> 
> Now that I'm thinking a bit more about it, I wonder if s->zero_target
> is the right name.  We have several pieces of information feeding the
> potential optimizations: are we mirroring the full image or just a
> portion of it (if just a portion, pre-zeroing is wrong because we
> shouldn't touch the part of the image not being mirrored), did we just
> create the image (in which case it reads as zero and we can skip
> pre-zeroing), did the user pass target-is-zero (later in this series,
> same effect as if we just created the image), is punching zeroes
> allowed (not all setups allow it; and even when it is allowed, there
> are tradeoffs on whether to punch one big hole and then fill back up
> or to only punch small holes as zeroes are encountered).
> 
> It's a big mess of conditionals, so I'm still trying to figure out if
> there is a more sensible way to arrange the various logic bits.  With
> the addition of target-is-zero, maybe it makes more sense to rename
> s->zero_target to s->full_image, and to set s->full_image=true
> s->target_is_zero in the arg->mode != NEW_IMAGE_MODE_EXISTING &&
> bdrv_has_zero_init(target) case.

Correct. The pre-zero and skip-zero optimizations are only applicable
when sync == MIRROR_SYNC_MODE_FULL. In MIRROR_SYNC_MODE_TOP scenarios,
where the source snapshot chain (excluding the top snapshot) is pre-copied
to the target, the destination image is not an all-zero image. Consequently,
we can neither perform pre-zeroing nor skip-zero operations in this case.

    if (sync == MIRROR_SYNC_MODE_TOP) {
        /* No optimization is needed. */
    } else {
        /* sync == MIRROR_SYNC_MODE_FULL */
        if (mode == NEW_IMAGE_MODE_EXISTING) {
            if (target_is_zero) {
                bitmap_set(zero_bitmap)
            } else {
                pre_zeroing;
                bitmap_set(zero_bitmap);
            }
        } else {
            if (bdrv_has_zero_init(target_bs)) {
                bitmap_set(zero_bitmap);
            } else {
                pre_zeroing;
                bitmap_set(zero_bitmap);
            }
        }
    }

We list all possible scenarios, where some code branches can be merged
during implementation.