fix potential data loss in QEMU (missed flush)

[PATCH 2/3] block: move bdrv_advance_flush_gen() earlier from bdrv_co_write_req_finish

Posted by Denis V. Lunev via 1 month, 1 week ago

This helper is mandatory to be called before bdrv_co_flush(). In the
other case bdrv_co_flush() will be noop. This helper should be called
after actual write is completed for subsequent flush to perform some
work.

Actually this change is important, without it BDRV_REQ_FUA semantics is
broken completely. flush() is not called. This smells like potential
data loss if somebody relies on BDRV_REQ_FUA.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
---
 block/io.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index f39ff862c11..820b41fab12 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1163,8 +1163,11 @@ bdrv_driver_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov, flags);
 
 emulate_flags:
-    if (ret == 0 && emulate_fua) {
-        ret = bdrv_co_flush(bs);
+    if (ret == 0) {
+        bdrv_advance_flush_gen(bs);
+        if (emulate_fua) {
+            ret = bdrv_co_flush(bs);
+        }
     }
 
     if (qiov == &local_qiov) {
@@ -2077,8 +2080,11 @@ bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes,
     }
 
 fail:
-    if (ret == 0 && need_flush) {
-        ret = bdrv_co_flush(bs);
+    if (ret == 0) {
+        bdrv_advance_flush_gen(bs);
+        if (need_flush) {
+            ret = bdrv_co_flush(bs);
+        }
     }
     qemu_vfree(buf);
     return ret;
@@ -2147,8 +2153,6 @@ bdrv_co_write_req_finish(BdrvChild *child, int64_t offset, int64_t bytes,
 
     bdrv_check_request(offset, bytes, &error_abort);
 
-    bdrv_advance_flush_gen(bs);
-
     /*
      * Discard cannot extend the image, but in error handling cases, such as
      * when reverting a qcow2 cluster allocation, the discarded range can pass
@@ -3720,6 +3724,7 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
     }
     ret = 0;
 out:
+    bdrv_advance_flush_gen(bs);
     bdrv_co_write_req_finish(child, req.offset, req.bytes, &req, ret);
     tracked_request_end(&req);
     bdrv_dec_in_flight(bs);
@@ -3994,6 +4000,7 @@ static int coroutine_fn GRAPH_RDLOCK bdrv_co_copy_range_internal(
                                                       bytes,
                                                       read_flags, write_flags);
         }
+        bdrv_advance_flush_gen(dst->bs);
         bdrv_co_write_req_finish(dst, dst_offset, bytes, &req, ret);
         tracked_request_end(&req);
         bdrv_dec_in_flight(dst->bs);
@@ -4187,6 +4194,7 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
     } else {
         offset = bs->total_sectors * BDRV_SECTOR_SIZE;
     }
+    bdrv_advance_flush_gen(bs);
     /*
      * It's possible that truncation succeeded but bdrv_refresh_total_sectors
      * failed, but the latter doesn't affect how we should finish the request.
-- 
2.43.5

Re: [PATCH 2/3] block: move bdrv_advance_flush_gen() earlier from bdrv_co_write_req_finish

Posted by Hanna Czenczek 2 weeks ago

On 29.12.25 17:07, Denis V. Lunev wrote:
> This helper is mandatory to be called before bdrv_co_flush(). In the
> other case bdrv_co_flush() will be noop. This helper should be called
> after actual write is completed for subsequent flush to perform some
> work.
>
> Actually this change is important, without it BDRV_REQ_FUA semantics is
> broken completely. flush() is not called. This smells like potential
> data loss if somebody relies on BDRV_REQ_FUA.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Hanna Reitz <hreitz@redhat.com>
> ---
>   block/io.c | 20 ++++++++++++++------
>   1 file changed, 14 insertions(+), 6 deletions(-)

Definitely an improvement, but if possible, I would still prefer keeping 
the increment it in bdrv_co_write_req_finish(), because that is where 
all write requests eventually end up in, and that’s also where dirty 
bitmaps are handled.  It seems slightly error-prone to manually have to 
call this every time the BDS data is modified.

Can we maybe instead pull the FUA emulation out into 
bdrv_aligned_pwritev(), after bdrv_co_write_req_finish()?

I think we can only pull out the flush, not the FUA emulation detection 
because for write-zeroes, we have to check either supported_zero_flags 
or supported_write_flags, depending on whether we fall back or not.  But 
bdrv_driver_pwritev() and bdrv_co_do_pwrite_zeroes() could return an 
need_emulate_fua flag via pointer parameter, and if set, 
bdrv_aligned_pwritev() can then do the flush after 
bdrv_co_write_req_finish().  (This would also allow dropping both “only 
flush once at the end” blocks.)

Maybe it would then also make sense to move the bdrv_co_flush() into 
bdrv_co_write_req_finish(), and add a need_flush parameter to it, so 
that it can see whether the whole request (write + flush) was successful.

Hanna

Re: [PATCH 2/3] block: move bdrv_advance_flush_gen() earlier from bdrv_co_write_req_finish

Posted by Denis V. Lunev 1 week, 3 days ago

On 1/26/26 16:07, Hanna Czenczek wrote:
> On 29.12.25 17:07, Denis V. Lunev wrote:
>> This helper is mandatory to be called before bdrv_co_flush(). In the
>> other case bdrv_co_flush() will be noop. This helper should be called
>> after actual write is completed for subsequent flush to perform some
>> work.
>>
>> Actually this change is important, without it BDRV_REQ_FUA semantics is
>> broken completely. flush() is not called. This smells like potential
>> data loss if somebody relies on BDRV_REQ_FUA.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Hanna Reitz <hreitz@redhat.com>
>> ---
>>   block/io.c | 20 ++++++++++++++------
>>   1 file changed, 14 insertions(+), 6 deletions(-)
> Definitely an improvement, but if possible, I would still prefer keeping 
> the increment it in bdrv_co_write_req_finish(), because that is where 
> all write requests eventually end up in, and that’s also where dirty 
> bitmaps are handled.  It seems slightly error-prone to manually have to 
> call this every time the BDS data is modified.
>
> Can we maybe instead pull the FUA emulation out into 
> bdrv_aligned_pwritev(), after bdrv_co_write_req_finish()?
>
> I think we can only pull out the flush, not the FUA emulation detection 
> because for write-zeroes, we have to check either supported_zero_flags 
> or supported_write_flags, depending on whether we fall back or not.  But 
> bdrv_driver_pwritev() and bdrv_co_do_pwrite_zeroes() could return an 
> need_emulate_fua flag via pointer parameter, and if set, 
> bdrv_aligned_pwritev() can then do the flush after 
> bdrv_co_write_req_finish().  (This would also allow dropping both “only 
> flush once at the end” blocks.)
>
> Maybe it would then also make sense to move the bdrv_co_flush() into 
> bdrv_co_write_req_finish(), and add a need_flush parameter to it, so 
> that it can see whether the whole request (write + flush) was successful.
>
> Hanna
>
I have tried to think about this problem in a this and that way
and found yet another missed place in the code. This has
frustrated me even more.

For the reference:

raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr, uint64_t bytes,
           QEMUIOVector *qiov, int type, int flags)
{
    ...
    assert(qiov->size == bytes);
    ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
    if (ret == 0 && (flags & BDRV_REQ_FUA)) {
        /* TODO Use pwritev2() instead if it's available */
        ret = bdrv_co_flush(bs);
    }
    goto out; /* Avoid the compiler err of unused label */

This definitely states that the code is cumbersome and you
right that I need to try harder to move FUA emulation out
into generic code.

I'll invest more here.

Thank you for the point,
    Den