[v1] Make blockdev-mirror dest sparse in more cases

[PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Eric Blake 5 months, 3 weeks ago

When mirroring, the goal is to ensure that the destination reads the
same as the source; this goal is met whether the destination is sparse
or fully-allocated.  However, if the destination cannot efficiently
write zeroes, then any time the mirror operation wants to copy zeroes
from the source to the destination (either during the background over
sparse regions when doing a full mirror, or in the foreground when the
guest actively writes zeroes), we were causing the destination to
fully allocate that portion of the disk, even if it already read as
zeroes.

We could just teach mirror_co_zero() to do a block_status() probe of
the destination, and skip the zeroes if the destination already reads
as zero, but we know from past experience that block_status() calls
are not always cheap (tmpfs, anyone?).  So this patch takes a slightly
different approach: any time we have to transfer the full image,
mirror_dirty_init() is _already_ doing a pre-zero pass over the entire
destination.  Therefore, if we track which clusters of the destination
are zero at any given moment, we don't have to do a block_status()
call on the destination, but can instead just refer to the zero bitmap
associated with the job.

With this patch, if I externally create a raw sparse destination file
('truncate --size=$N dst.raw'), connect it with QMP 'blockdev-add'
while leaving it at the default "discard":"ignore", then run QMP
'blockdev-mirror' with "sync":"full", the destination remains sparse
rather than fully allocated.

However, a raw destination file created with 'blockdev-create' still
gets fully allocated, because more work is needed in file-posix to
still identify reads-as-zeroes even when the first 4k has to be
allocated to make alignment probing work.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 94 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 78 insertions(+), 16 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2e1e14c8e7e..98da5a6dc27 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,7 @@ typedef struct MirrorBlockJob {
     size_t buf_size;
     int64_t bdev_length;
     unsigned long *cow_bitmap;
+    unsigned long *zero_bitmap;
     BdrvDirtyBitmap *dirty_bitmap;
     BdrvDirtyBitmapIter *dbi;
     uint8_t *buf;
@@ -408,15 +409,32 @@ static void coroutine_fn mirror_co_read(void *opaque)
 static void coroutine_fn mirror_co_zero(void *opaque)
 {
     MirrorOp *op = opaque;
-    int ret;
+    bool write_needed = true;
+    int ret = 0;

     op->s->in_flight++;
     op->s->bytes_in_flight += op->bytes;
     *op->bytes_handled = op->bytes;
     op->is_in_flight = true;

-    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
-                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
+    if (op->s->zero_bitmap) {
+        unsigned long last = (op->offset + op->bytes) / op->s->granularity;
+        assert(QEMU_IS_ALIGNED(op->offset, op->s->granularity));
+        assert(QEMU_IS_ALIGNED(op->bytes, op->s->granularity) ||
+               op->offset + op->bytes == op->s->bdev_length);
+        if (find_next_zero_bit(op->s->zero_bitmap, last,
+                               op->offset / op->s->granularity) == last) {
+            write_needed = false;
+        }
+    }
+    if (write_needed) {
+        ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
+                                   op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
+    }
+    if (ret >= 0 && op->s->zero_bitmap) {
+        bitmap_set(op->s->zero_bitmap, op->offset / op->s->granularity,
+                   op->bytes / op->s->granularity);
+    }
     mirror_write_complete(op, ret);
 }

@@ -441,6 +459,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
     Coroutine *co;
     int64_t bytes_handled = -1;

+    assert(QEMU_IS_ALIGNED(offset, s->granularity));
+    assert(QEMU_IS_ALIGNED(bytes, s->granularity) ||
+           offset + bytes == s->bdev_length);
     op = g_new(MirrorOp, 1);
     *op = (MirrorOp){
         .s              = s,
@@ -452,12 +473,21 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,

     switch (mirror_method) {
     case MIRROR_METHOD_COPY:
+        if (s->zero_bitmap) {
+            bitmap_clear(s->zero_bitmap, offset / s->granularity,
+                         bytes / s->granularity);
+        }
         co = qemu_coroutine_create(mirror_co_read, op);
         break;
     case MIRROR_METHOD_ZERO:
+        /* s->zero_bitmap handled in mirror_co_zero */
         co = qemu_coroutine_create(mirror_co_zero, op);
         break;
     case MIRROR_METHOD_DISCARD:
+        if (s->zero_bitmap) {
+            bitmap_clear(s->zero_bitmap, offset / s->granularity,
+                         bytes / s->granularity);
+        }
         co = qemu_coroutine_create(mirror_co_discard, op);
         break;
     default:
@@ -851,10 +881,17 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
     }
     bdrv_graph_co_rdunlock();

-    if (s->zero_target && ret <= 0) {
+    if (s->zero_target) {
+        int64_t length;
+
         if (ret < 0) {
             return ret;
         }
+        length = DIV_ROUND_UP(s->bdev_length, s->granularity);
+        s->zero_bitmap = bitmap_new(length);
+        if (ret > 0) {
+            bitmap_set(s->zero_bitmap, 0, length);
+        }
         if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
             bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length);
             return 0;
@@ -1169,6 +1206,7 @@ immediate_exit:
     assert(s->in_flight == 0);
     qemu_vfree(s->buf);
     g_free(s->cow_bitmap);
+    g_free(s->zero_bitmap);
     g_free(s->in_flight_bitmap);
     bdrv_dirty_iter_free(s->dbi);

@@ -1347,7 +1385,8 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
 {
     int ret;
     size_t qiov_offset = 0;
-    int64_t bitmap_offset, bitmap_end;
+    int64_t dirty_bitmap_offset, dirty_bitmap_end;
+    int64_t zero_bitmap_offset, zero_bitmap_end;

     if (!QEMU_IS_ALIGNED(offset, job->granularity) &&
         bdrv_dirty_bitmap_get(job->dirty_bitmap, offset))
@@ -1391,31 +1430,54 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
     }

     /*
-     * Tails are either clean or shrunk, so for bitmap resetting
-     * we safely align the range down.
+     * Tails are either clean or shrunk, so for dirty bitmap resetting
+     * we safely align the range down.  But for zero bitmap, round range
+     * up for checking or clearing, and down for setting.
      */
-    bitmap_offset = QEMU_ALIGN_UP(offset, job->granularity);
-    bitmap_end = QEMU_ALIGN_DOWN(offset + bytes, job->granularity);
-    if (bitmap_offset < bitmap_end) {
-        bdrv_reset_dirty_bitmap(job->dirty_bitmap, bitmap_offset,
-                                bitmap_end - bitmap_offset);
+    dirty_bitmap_offset = QEMU_ALIGN_UP(offset, job->granularity);
+    dirty_bitmap_end = QEMU_ALIGN_DOWN(offset + bytes, job->granularity);
+    if (dirty_bitmap_offset < dirty_bitmap_end) {
+        bdrv_reset_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
+                                dirty_bitmap_end - dirty_bitmap_offset);
     }
+    zero_bitmap_offset = offset / job->granularity;
+    zero_bitmap_end = DIV_ROUND_UP(offset + bytes, job->granularity);

     job_progress_increase_remaining(&job->common.job, bytes);
     job->active_write_bytes_in_flight += bytes;

     switch (method) {
     case MIRROR_METHOD_COPY:
+        if (job->zero_bitmap) {
+            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
+                         zero_bitmap_end - zero_bitmap_offset);
+        }
         ret = blk_co_pwritev_part(job->target, offset, bytes,
                                   qiov, qiov_offset, flags);
         break;

     case MIRROR_METHOD_ZERO:
+        if (job->zero_bitmap) {
+            if (find_next_zero_bit(job->zero_bitmap, zero_bitmap_end,
+                                   zero_bitmap_offset) == zero_bitmap_end) {
+                ret = 0;
+                break;
+            }
+        }
         assert(!qiov);
         ret = blk_co_pwrite_zeroes(job->target, offset, bytes, flags);
+        if (job->zero_bitmap && ret >= 0) {
+            bitmap_set(job->zero_bitmap, dirty_bitmap_offset / job->granularity,
+                       (dirty_bitmap_end - dirty_bitmap_offset) /
+                       job->granularity);
+        }
         break;

     case MIRROR_METHOD_DISCARD:
+        if (job->zero_bitmap) {
+            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
+                         zero_bitmap_end - zero_bitmap_offset);
+        }
         assert(!qiov);
         ret = blk_co_pdiscard(job->target, offset, bytes);
         break;
@@ -1436,10 +1498,10 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
          * at function start, and they must be still dirty, as we've locked
          * the region for in-flight op.
          */
-        bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
-        bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
-        bdrv_set_dirty_bitmap(job->dirty_bitmap, bitmap_offset,
-                              bitmap_end - bitmap_offset);
+        dirty_bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
+        dirty_bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
+        bdrv_set_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
+                              dirty_bitmap_end - dirty_bitmap_offset);
         qatomic_set(&job->actively_synced, false);

         action = mirror_error_action(job, false, -ret);
-- 
2.49.0

Re: [PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Vladimir Sementsov-Ogievskiy 5 months, 3 weeks ago

On 11.04.25 04:04, Eric Blake wrote:
> When mirroring, the goal is to ensure that the destination reads the
> same as the source; this goal is met whether the destination is sparse
> or fully-allocated.  However, if the destination cannot efficiently
> write zeroes, then any time the mirror operation wants to copy zeroes
> from the source to the destination (either during the background over
> sparse regions when doing a full mirror, or in the foreground when the
> guest actively writes zeroes), we were causing the destination to
> fully allocate that portion of the disk, even if it already read as
> zeroes.
> 
> We could just teach mirror_co_zero() to do a block_status() probe of
> the destination, and skip the zeroes if the destination already reads
> as zero, but we know from past experience that block_status() calls
> are not always cheap (tmpfs, anyone?).  So this patch takes a slightly
> different approach: any time we have to transfer the full image,
> mirror_dirty_init() is _already_ doing a pre-zero pass over the entire
> destination.  Therefore, if we track which clusters of the destination
> are zero at any given moment, we don't have to do a block_status()
> call on the destination, but can instead just refer to the zero bitmap
> associated with the job.
> 
> With this patch, if I externally create a raw sparse destination file
> ('truncate --size=$N dst.raw'), connect it with QMP 'blockdev-add'
> while leaving it at the default "discard":"ignore", then run QMP
> 'blockdev-mirror' with "sync":"full", the destination remains sparse
> rather than fully allocated.
> 
> However, a raw destination file created with 'blockdev-create' still
> gets fully allocated, because more work is needed in file-posix to
> still identify reads-as-zeroes even when the first 4k has to be
> allocated to make alignment probing work.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   block/mirror.c | 94 +++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 78 insertions(+), 16 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 2e1e14c8e7e..98da5a6dc27 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -73,6 +73,7 @@ typedef struct MirrorBlockJob {
>       size_t buf_size;
>       int64_t bdev_length;
>       unsigned long *cow_bitmap;
> +    unsigned long *zero_bitmap;
>       BdrvDirtyBitmap *dirty_bitmap;
>       BdrvDirtyBitmapIter *dbi;
>       uint8_t *buf;
> @@ -408,15 +409,32 @@ static void coroutine_fn mirror_co_read(void *opaque)
>   static void coroutine_fn jk(void *opaque)
>   {
>       MirrorOp *op = opaque;
> -    int ret;
> +    bool write_needed = true;
> +    int ret = 0;
> 
>       op->s->in_flight++;
>       op->s->bytes_in_flight += op->bytes;
>       *op->bytes_handled = op->bytes;
>       op->is_in_flight = true;
> 
> -    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
> -                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
> +    if (op->s->zero_bitmap) {
> +        unsigned long last = (op->offset + op->bytes) / op->s->granularity;

Maybe, call it "end", not "last, as it's not last element of the range, but first after the range.

Also, seems we need still do DIV_ROUND_UP, for ..

> +        assert(QEMU_IS_ALIGNED(op->offset, op->s->granularity));
> +        assert(QEMU_IS_ALIGNED(op->bytes, op->s->granularity) ||
> +               op->offset + op->bytes == op->s->bdev_length);

.. ^ this case, when bytes is unaligned to granularity but aligned to bdev_length.

> +        if (find_next_zero_bit(op->s->zero_bitmap, last,
> +                               op->offset / op->s->granularity) == last) {
> +            write_needed = false;
> +        }
> +    }
> +    if (write_needed) {
> +        ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
> +                                   op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
> +    }
> +    if (ret >= 0 && op->s->zero_bitmap) {
> +        bitmap_set(op->s->zero_bitmap, op->offset / op->s->granularity,
> +                   op->bytes / op->s->granularity);

and here we want to align up bytes, for the corner case

> +    }

Also, I'm not sure, what guarantees we have in case of write-zeroes failure. Should we clear the bitmap in this case, like we do MIRROR_METHOD_COPY and MIRROR_METHOD_DISCARD below

>       mirror_write_complete(op, ret);
>   }
> 
> @@ -441,6 +459,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
>       Coroutine *co;
>       int64_t bytes_handled = -1;
> 
> +    assert(QEMU_IS_ALIGNED(offset, s->granularity));
> +    assert(QEMU_IS_ALIGNED(bytes, s->granularity) ||
> +           offset + bytes == s->bdev_length);
>       op = g_new(MirrorOp, 1);
>       *op = (MirrorOp){
>           .s              = s,
> @@ -452,12 +473,21 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
> 
>       switch (mirror_method) {
>       case MIRROR_METHOD_COPY:
> +        if (s->zero_bitmap) {
> +            bitmap_clear(s->zero_bitmap, offset / s->granularity,
> +                         bytes / s->granularity);

again, align up for corner case

> +        }
>           co = qemu_coroutine_create(mirror_co_read, op);
>           break;
>       case MIRROR_METHOD_ZERO:
> +        /* s->zero_bitmap handled in mirror_co_zero */
>           co = qemu_coroutine_create(mirror_co_zero, op);
>           break;
>       case MIRROR_METHOD_DISCARD:
> +        if (s->zero_bitmap) {
> +            bitmap_clear(s->zero_bitmap, offset / s->granularity,
> +                         bytes / s->granularity);
> +        }
>           co = qemu_coroutine_create(mirror_co_discard, op);
>           break;
>       default:
> @@ -851,10 +881,17 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
>       }
>       bdrv_graph_co_rdunlock();
> 
> -    if (s->zero_target && ret <= 0) {
> +    if (s->zero_target) {
> +        int64_t length;
> +
>           if (ret < 0) {
>               return ret;
>           }
> +        length = DIV_ROUND_UP(s->bdev_length, s->granularity);
> +        s->zero_bitmap = bitmap_new(length);
> +        if (ret > 0) {
> +            bitmap_set(s->zero_bitmap, 0, length);

hmm, we should not continue zeroing target in case of ret > 0.

I didn't like that we set ret in one if-block, and handle in another, but now it gets even more confusing.

Maybe, just move bdrv_co_is_zero_fast() call into big "if (s->zero_target) {" ?



> +        }
>           if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
>               bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length);
>               return 0;
> @@ -1169,6 +1206,7 @@ immediate_exit:
>       assert(s->in_flight == 0);
>       qemu_vfree(s->buf);
>       g_free(s->cow_bitmap);
> +    g_free(s->zero_bitmap);
>       g_free(s->in_flight_bitmap);
>       bdrv_dirty_iter_free(s->dbi);
> 
> @@ -1347,7 +1385,8 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
>   {
>       int ret;
>       size_t qiov_offset = 0;
> -    int64_t bitmap_offset, bitmap_end;
> +    int64_t dirty_bitmap_offset, dirty_bitmap_end;
> +    int64_t zero_bitmap_offset, zero_bitmap_end;
> 
>       if (!QEMU_IS_ALIGNED(offset, job->granularity) &&
>           bdrv_dirty_bitmap_get(job->dirty_bitmap, offset))
> @@ -1391,31 +1430,54 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
>       }
> 
>       /*
> -     * Tails are either clean or shrunk, so for bitmap resetting
> -     * we safely align the range down.
> +     * Tails are either clean or shrunk, so for dirty bitmap resetting
> +     * we safely align the range down.  But for zero bitmap, round range
> +     * up for checking or clearing, and down for setting.
>        */
> -    bitmap_offset = QEMU_ALIGN_UP(offset, job->granularity);
> -    bitmap_end = QEMU_ALIGN_DOWN(offset + bytes, job->granularity);
> -    if (bitmap_offset < bitmap_end) {
> -        bdrv_reset_dirty_bitmap(job->dirty_bitmap, bitmap_offset,
> -                                bitmap_end - bitmap_offset);
> +    dirty_bitmap_offset = QEMU_ALIGN_UP(offset, job->granularity);
> +    dirty_bitmap_end = QEMU_ALIGN_DOWN(offset + bytes, job->granularity);
> +    if (dirty_bitmap_offset < dirty_bitmap_end) {
> +        bdrv_reset_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
> +                                dirty_bitmap_end - dirty_bitmap_offset);
>       }
> +    zero_bitmap_offset = offset / job->granularity;
> +    zero_bitmap_end = DIV_ROUND_UP(offset + bytes, job->granularity);
> 
>       job_progress_increase_remaining(&job->common.job, bytes);
>       job->active_write_bytes_in_flight += bytes;
> 
>       switch (method) {
>       case MIRROR_METHOD_COPY:
> +        if (job->zero_bitmap) {
> +            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
> +                         zero_bitmap_end - zero_bitmap_offset);
> +        }
>           ret = blk_co_pwritev_part(job->target, offset, bytes,
>                                     qiov, qiov_offset, flags);
>           break;
> 
>       case MIRROR_METHOD_ZERO:
> +        if (job->zero_bitmap) {
> +            if (find_next_zero_bit(job->zero_bitmap, zero_bitmap_end,
> +                                   zero_bitmap_offset) == zero_bitmap_end) {
> +                ret = 0;
> +                break;
> +            }
> +        }
>           assert(!qiov);
>           ret = blk_co_pwrite_zeroes(job->target, offset, bytes, flags);
> +        if (job->zero_bitmap && ret >= 0) {
> +            bitmap_set(job->zero_bitmap, dirty_bitmap_offset / job->granularity,
> +                       (dirty_bitmap_end - dirty_bitmap_offset) /
> +                       job->granularity);
> +        }

Same thing, probably we should clear the bitmap in case of write failure.

>           break;
> 
>       case MIRROR_METHOD_DISCARD:
> +        if (job->zero_bitmap) {
> +            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
> +                         zero_bitmap_end - zero_bitmap_offset);
> +        }
>           assert(!qiov);
>           ret = blk_co_pdiscard(job->target, offset, bytes);
>           break;
> @@ -1436,10 +1498,10 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
>            * at function start, and they must be still dirty, as we've locked
>            * the region for in-flight op.
>            */
> -        bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
> -        bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
> -        bdrv_set_dirty_bitmap(job->dirty_bitmap, bitmap_offset,
> -                              bitmap_end - bitmap_offset);
> +        dirty_bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
> +        dirty_bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
> +        bdrv_set_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
> +                              dirty_bitmap_end - dirty_bitmap_offset);

Not really matter, but still, renaming in a separate patch would make this one a bit simpler.

>           qatomic_set(&job->actively_synced, false);
> 
>           action = mirror_error_action(job, false, -ret);

-- 
Best regards,
Vladimir

Re: [PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Eric Blake 5 months, 3 weeks ago

On Tue, Apr 15, 2025 at 06:59:15PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 11.04.25 04:04, Eric Blake wrote:
> > When mirroring, the goal is to ensure that the destination reads the
> > same as the source; this goal is met whether the destination is sparse
> > or fully-allocated.  However, if the destination cannot efficiently
> > write zeroes, then any time the mirror operation wants to copy zeroes
> > from the source to the destination (either during the background over
> > sparse regions when doing a full mirror, or in the foreground when the
> > guest actively writes zeroes), we were causing the destination to
> > fully allocate that portion of the disk, even if it already read as
> > zeroes.
> > 
> > -    ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
> > -                               op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
> > +    if (op->s->zero_bitmap) {
> > +        unsigned long last = (op->offset + op->bytes) / op->s->granularity;
> 
> Maybe, call it "end", not "last, as it's not last element of the range, but first after the range.

Works for me.

> 
> Also, seems we need still do DIV_ROUND_UP, for ..
> 
> > +        assert(QEMU_IS_ALIGNED(op->offset, op->s->granularity));
> > +        assert(QEMU_IS_ALIGNED(op->bytes, op->s->granularity) ||
> > +               op->offset + op->bytes == op->s->bdev_length);
> 
> .. ^ this case, when bytes is unaligned to granularity but aligned to bdev_length.

Hmm, good point.  I'll have to revisit the logic and make sure it
still works in the case of a final partial region.

> 
> > +        if (find_next_zero_bit(op->s->zero_bitmap, last,
> > +                               op->offset / op->s->granularity) == last) {
> > +            write_needed = false;
> > +        }
> > +    }
> > +    if (write_needed) {
> > +        ret = blk_co_pwrite_zeroes(op->s->target, op->offset, op->bytes,
> > +                                   op->s->unmap ? BDRV_REQ_MAY_UNMAP : 0);
> > +    }
> > +    if (ret >= 0 && op->s->zero_bitmap) {
> > +        bitmap_set(op->s->zero_bitmap, op->offset / op->s->granularity,
> > +                   op->bytes / op->s->granularity);
> 
> and here we want to align up bytes, for the corner case
> 
> > +    }
> 
> Also, I'm not sure, what guarantees we have in case of write-zeroes failure. Should we clear the bitmap in this case, like we do MIRROR_METHOD_COPY and MIRROR_METHOD_DISCARD below

My goal was to clear the bitmap before any action that (potentially)
destroys zero regions (writes and discards), even if that action
fails; but only set the bitmap after success on any action that
creates zero regions (write-zero).  If the write-zero fails, and the
bitmap was already set, it doesn't matter that we left the bitmap set
(a write zeroes that fails AND causes the disk to no longer read as
zero should not happen); otherwise, we are correct in leaving the
bitmap unset on failure (even if the write-zeros partially succeeded,
the pessimistic answer of assuming a region is non-zero is always
correct even if sometimes slower).

> 
> >       mirror_write_complete(op, ret);
> >   }
> > 
> > @@ -441,6 +459,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
> >       Coroutine *co;
> >       int64_t bytes_handled = -1;
> > 
> > +    assert(QEMU_IS_ALIGNED(offset, s->granularity));
> > +    assert(QEMU_IS_ALIGNED(bytes, s->granularity) ||
> > +           offset + bytes == s->bdev_length);
> >       op = g_new(MirrorOp, 1);
> >       *op = (MirrorOp){
> >           .s              = s,
> > @@ -452,12 +473,21 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
> > 
> >       switch (mirror_method) {
> >       case MIRROR_METHOD_COPY:
> > +        if (s->zero_bitmap) {
> > +            bitmap_clear(s->zero_bitmap, offset / s->granularity,
> > +                         bytes / s->granularity);
> 
> again, align up for corner case
> 
> > +        }
> >           co = qemu_coroutine_create(mirror_co_read, op);
> >           break;
> >       case MIRROR_METHOD_ZERO:
> > +        /* s->zero_bitmap handled in mirror_co_zero */
> >           co = qemu_coroutine_create(mirror_co_zero, op);
> >           break;
> >       case MIRROR_METHOD_DISCARD:
> > +        if (s->zero_bitmap) {
> > +            bitmap_clear(s->zero_bitmap, offset / s->granularity,
> > +                         bytes / s->granularity);
> > +        }
> >           co = qemu_coroutine_create(mirror_co_discard, op);
> >           break;
> >       default:
> > @@ -851,10 +881,17 @@ static int coroutine_fn GRAPH_UNLOCKED mirror_dirty_init(MirrorBlockJob *s)
> >       }
> >       bdrv_graph_co_rdunlock();
> > 
> > -    if (s->zero_target && ret <= 0) {
> > +    if (s->zero_target) {
> > +        int64_t length;
> > +
> >           if (ret < 0) {
> >               return ret;
> >           }
> > +        length = DIV_ROUND_UP(s->bdev_length, s->granularity);
> > +        s->zero_bitmap = bitmap_new(length);
> > +        if (ret > 0) {
> > +            bitmap_set(s->zero_bitmap, 0, length);
> 
> hmm, we should not continue zeroing target in case of ret > 0.

Correct. If ret > 0, we set zero_bitmap to record that starting state,
and then want to proceed on with the rest of the code that populates
only the non-zero portions of the dirty bitmap.

> 
> I didn't like that we set ret in one if-block, and handle in another, but now it gets even more confusing.
> 
> Maybe, just move bdrv_co_is_zero_fast() call into big "if (s->zero_target) {" ?

That would require grabbing another RDLOCK - but I'll try and see if
that makes things easier to read.

...
> > +        if (job->zero_bitmap && ret >= 0) {
> > +            bitmap_set(job->zero_bitmap, dirty_bitmap_offset / job->granularity,
> > +                       (dirty_bitmap_end - dirty_bitmap_offset) /
> > +                       job->granularity);
> > +        }
> 
> Same thing, probably we should clear the bitmap in case of write failure.

Why? Either the bitmap was already clear, or we had a write failure on
writing zeroes on top of zeroes but the region should still read as
zeroes.

> 
> >           break;
> > 
> >       case MIRROR_METHOD_DISCARD:
> > +        if (job->zero_bitmap) {
> > +            bitmap_clear(job->zero_bitmap, zero_bitmap_offset,
> > +                         zero_bitmap_end - zero_bitmap_offset);
> > +        }
> >           assert(!qiov);
> >           ret = blk_co_pdiscard(job->target, offset, bytes);
> >           break;
> > @@ -1436,10 +1498,10 @@ do_sync_target_write(MirrorBlockJob *job, MirrorMethod method,
> >            * at function start, and they must be still dirty, as we've locked
> >            * the region for in-flight op.
> >            */
> > -        bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
> > -        bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
> > -        bdrv_set_dirty_bitmap(job->dirty_bitmap, bitmap_offset,
> > -                              bitmap_end - bitmap_offset);
> > +        dirty_bitmap_offset = QEMU_ALIGN_DOWN(offset, job->granularity);
> > +        dirty_bitmap_end = QEMU_ALIGN_UP(offset + bytes, job->granularity);
> > +        bdrv_set_dirty_bitmap(job->dirty_bitmap, dirty_bitmap_offset,
> > +                              dirty_bitmap_end - dirty_bitmap_offset);
> 
> Not really matter, but still, renaming in a separate patch would make this one a bit simpler.

Okay, I'll split the variable rename.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: [PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Vladimir Sementsov-Ogievskiy 5 months, 2 weeks ago

On 17.04.25 00:51, Eric Blake wrote:
> (a write zeroes that fails AND causes the disk to no longer read as
> zero should not happen)

I don't know, is there such a contract? write-zeroes may fallback to write(), which only state that:

        An error return value while performing write() using direct I/O
        does not mean the entire write has failed.  Partial data may be
        written and the data at the file offset on which the write() was
        attempted should be considered inconsistent.

So, I used to think that on failed write nothing is guaranteed.

What do we lose if we just unset the bitmap before write-zeroes, and set it again in case of success?

-- 
Best regards,
Vladimir

Re: [PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Eric Blake 5 months, 2 weeks ago

On Mon, Apr 21, 2025 at 09:15:33AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 17.04.25 00:51, Eric Blake wrote:
> > (a write zeroes that fails AND causes the disk to no longer read as
> > zero should not happen)
> 
> I don't know, is there such a contract? write-zeroes may fallback to write(), which only state that:
> 
>        An error return value while performing write() using direct I/O
>        does not mean the entire write has failed.  Partial data may be
>        written and the data at the file offset on which the write() was
>        attempted should be considered inconsistent.
> 
> So, I used to think that on failed write nothing is guaranteed.
> 
> What do we lose if we just unset the bitmap before write-zeroes, and set it again in case of success?
>

I still don't see the point.  Either the cluster was already non-zero
before the failed write-zero (so there's no bit to pre-clear); or the
cluster was already zero before the failed write-zero, and any failure
that corrupts the disk by actually turning zeroes into non-zero is not
worth worrying about, so pre-clearing the bit is not going to make
things any better.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: [PATCH 3/6] mirror: Skip writing zeroes when target is already zero

Posted by Vladimir Sementsov-Ogievskiy 5 months, 2 weeks ago

On 21.04.25 17:41, Eric Blake wrote:
> On Mon, Apr 21, 2025 at 09:15:33AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> On 17.04.25 00:51, Eric Blake wrote:
>>> (a write zeroes that fails AND causes the disk to no longer read as
>>> zero should not happen)
>>
>> I don't know, is there such a contract? write-zeroes may fallback to write(), which only state that:
>>
>>         An error return value while performing write() using direct I/O
>>         does not mean the entire write has failed.  Partial data may be
>>         written and the data at the file offset on which the write() was
>>         attempted should be considered inconsistent.
>>
>> So, I used to think that on failed write nothing is guaranteed.
>>
>> What do we lose if we just unset the bitmap before write-zeroes, and set it again in case of success?
>>
> 
> I still don't see the point.  Either the cluster was already non-zero
> before the failed write-zero (so there's no bit to pre-clear);

[..]

> cluster was already zero before the failed write-zero, and any failure
> that corrupts the disk by actually turning zeroes into non-zero

Yes, I mean this case

> is not
> worth worrying about

And I don't follow why we should not care.

Usually we have no assumptions about data at offset, if write(offset, ..) failed. We just retry write.

Here we do this assumption "if there were zeroes, they will not be broken by writing zeroes even on failure scenarios", for any kind of block-driver and underlying file system.

You say "corrupts the disk". Of course, if disk is corrupted - everything is broken, and we should not care. The result of the mirror would be incorrect anyway. But we don't consider any EIO as disk corruption. We do retry the write and continue mirror operation.

Moreover, if EIO is returned due to some broken sector, we may hope that retrying the write will fix it, probably some other sector will be allocated by FS (I hope I'm not talking nonsense). But relying on the data still read as zero without rewriting would be wrong.

>, so pre-clearing the bit is not going to make
> things any better.
> 

-- 
Best regards,
Vladimir

[PATCH 1/6] mirror: Skip pre-zeroing destination if it is already zero
[PATCH 2/6] file-posix: Allow lseek at offset 0 when !want_zero
[PATCH 3/6] mirror: Skip writing zeroes when target is already zero
[PATCH 4/6] block: Expand block status mode from bool to enum
[PATCH 5/6] file-posix: Recognize blockdev-create file as starting all zero
[PATCH 6/6] tests: Add iotest mirror-sparse for recent patches
[PATCH 7/6] mirror: Allow QMP override to declare target already zero