block/blklogwrites: Fix a bug when logging "write zeroes" operations.

[PATCH] block/blklogwrites: Fix a bug when logging "write zeroes" operations.

Posted by megari@gmx.com 8 months, 2 weeks ago

From: Ari Sundholm <ari@tuxera.com>

There is a bug in the blklogwrites driver pertaining to logging "write
zeroes" operations, causing log corruption. This can be easily observed
by setting detect-zeroes to something other than "off" for the driver.

The issue is caused by a concurrency bug pertaining to the fact that
"write zeroes" operations have to be logged in two parts: first the log
entry metadata, then the zeroed-out region. While the log entry
metadata is being written by bdrv_co_pwritev(), another operation may
begin in the meanwhile and modify the state of the blklogwrites driver.
This is as intended by the coroutine-driven I/O model in QEMU, of
course.

Unfortunately, this specific scenario is mishandled. A short example:
    1. Initially, in the current operation (#1), the current log sector
number in the driver state is only incremented by the number of sectors
taken by the log entry metadata, after which the log entry metadata is
written. The current operation yields.
    2. Another operation (#2) may start while the log entry metadata is
being written. It uses the current log position as the start offset for
its log entry. This is in the sector right after the operation #1 log
entry metadata, which is bad!
    3. After bdrv_co_pwritev() returns (#1), the current log sector
number is reread from the driver state in order to find out the start
offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
offset will be the sector right after the (misplaced) operation #2 log
entry, which means that the zeroed-out region begins at the wrong
offset.
    4. As a result of the above, the log is corrupt.

Fix this by only reading the driver metadata once, computing the
offsets and sizes in one go (including the optional zeroed-out region)
and setting the log sector number to the appropriate value for the next
operation in line.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
Cc: qemu-stable@nongnu.org
---
 block/blklogwrites.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index 7207b2e757..ba717dab4d 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -328,22 +328,39 @@ static void coroutine_fn GRAPH_RDLOCK
 blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
 {
     BDRVBlkLogWritesState *s = lr->bs->opaque;
-    uint64_t cur_log_offset = s->cur_log_sector << s->sectorbits;

-    s->nr_entries++;
-    s->cur_log_sector +=
-            ROUND_UP(lr->qiov->size, s->sectorsize) >> s->sectorbits;
+    /*
+     * Determine the offsets and sizes of different parts of the entry, and
+     * update the state of the driver.
+     *
+     * This needs to be done in one go, before any actual I/O is done, as the
+     * log entry may have to be written in two parts, and the state of the
+     * driver may be modified by other driver operations while waiting for the
+     * I/O to complete.
+     */
+    const uint64_t entry_start_sector = s->cur_log_sector;
+    const uint64_t entry_offset = entry_start_sector << s->sectorbits;
+    const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
+    const uint64_t entry_aligned_size = qiov_aligned_size +
+        ROUND_UP(lr->zero_size, s->sectorsize);
+    const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;

-    lr->log_ret = bdrv_co_pwritev(s->log_file, cur_log_offset, lr->qiov->size,
+    s->nr_entries++;
+    s->cur_log_sector += entry_nr_sectors;
+
+    /*
+     * Write the log entry. Note that if this is a "write zeroes" operation,
+     * only the entry header is written here, with the zeroing being done
+     * separately below.
+     */
+    lr->log_ret = bdrv_co_pwritev(s->log_file, entry_offset, lr->qiov->size,
                                   lr->qiov, 0);

     /* Logging for the "write zeroes" operation */
     if (lr->log_ret == 0 && lr->zero_size) {
-        cur_log_offset = s->cur_log_sector << s->sectorbits;
-        s->cur_log_sector +=
-                ROUND_UP(lr->zero_size, s->sectorsize) >> s->sectorbits;
+        const uint64_t zeroes_offset = entry_offset + qiov_aligned_size;

-        lr->log_ret = bdrv_co_pwrite_zeroes(s->log_file, cur_log_offset,
+        lr->log_ret = bdrv_co_pwrite_zeroes(s->log_file, zeroes_offset,
                                             lr->zero_size, 0);
     }

--
2.43.0

Re: [PATCH] block/blklogwrites: Fix a bug when logging "write zeroes" operations.

Posted by Kevin Wolf 8 months, 2 weeks ago

Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
> From: Ari Sundholm <ari@tuxera.com>
> 
> There is a bug in the blklogwrites driver pertaining to logging "write
> zeroes" operations, causing log corruption. This can be easily observed
> by setting detect-zeroes to something other than "off" for the driver.
> 
> The issue is caused by a concurrency bug pertaining to the fact that
> "write zeroes" operations have to be logged in two parts: first the log
> entry metadata, then the zeroed-out region. While the log entry
> metadata is being written by bdrv_co_pwritev(), another operation may
> begin in the meanwhile and modify the state of the blklogwrites driver.
> This is as intended by the coroutine-driven I/O model in QEMU, of
> course.
> 
> Unfortunately, this specific scenario is mishandled. A short example:
>     1. Initially, in the current operation (#1), the current log sector
> number in the driver state is only incremented by the number of sectors
> taken by the log entry metadata, after which the log entry metadata is
> written. The current operation yields.
>     2. Another operation (#2) may start while the log entry metadata is
> being written. It uses the current log position as the start offset for
> its log entry. This is in the sector right after the operation #1 log
> entry metadata, which is bad!
>     3. After bdrv_co_pwritev() returns (#1), the current log sector
> number is reread from the driver state in order to find out the start
> offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
> offset will be the sector right after the (misplaced) operation #2 log
> entry, which means that the zeroed-out region begins at the wrong
> offset.
>     4. As a result of the above, the log is corrupt.
> 
> Fix this by only reading the driver metadata once, computing the
> offsets and sizes in one go (including the optional zeroed-out region)
> and setting the log sector number to the appropriate value for the next
> operation in line.
> 
> Signed-off-by: Ari Sundholm <ari@tuxera.com>
> Cc: qemu-stable@nongnu.org

Thanks, applied to the block branch.

Note that while this fixes the single threaded case, it is not thread
safe and will still break with multiqueue enabled (see the new
iothread-vq-mapping option added to virtio-blk very recently). Most
block drivers already take a lock when modifying their global state, and
it looks like we missed blklogwrites when checking what needs to be
prepared for the block layer to be thread safe.

So I think we'll want another patch to add a QemuMutex that can be taken
while you do the calculations on s->cur_log_sector. But this patch is
a good first step because it means that we don't need to keep a lock
across an I/O request (just for the sake of completeness, this would
have had to be a CoMutex rather than a QemuMutex).

Kevin

Re: [PATCH] block/blklogwrites: Fix a bug when logging "write zeroes" operations.

Posted by Ari Sundholm 8 months, 2 weeks ago

Hi, Kevin!

On 1/10/24 15:39, Kevin Wolf wrote:
> Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
>> From: Ari Sundholm <ari@tuxera.com>
>>
>> There is a bug in the blklogwrites driver pertaining to logging "write
>> zeroes" operations, causing log corruption. This can be easily observed
>> by setting detect-zeroes to something other than "off" for the driver.
>>
>> The issue is caused by a concurrency bug pertaining to the fact that
>> "write zeroes" operations have to be logged in two parts: first the log
>> entry metadata, then the zeroed-out region. While the log entry
>> metadata is being written by bdrv_co_pwritev(), another operation may
>> begin in the meanwhile and modify the state of the blklogwrites driver.
>> This is as intended by the coroutine-driven I/O model in QEMU, of
>> course.
>>
>> Unfortunately, this specific scenario is mishandled. A short example:
>>      1. Initially, in the current operation (#1), the current log sector
>> number in the driver state is only incremented by the number of sectors
>> taken by the log entry metadata, after which the log entry metadata is
>> written. The current operation yields.
>>      2. Another operation (#2) may start while the log entry metadata is
>> being written. It uses the current log position as the start offset for
>> its log entry. This is in the sector right after the operation #1 log
>> entry metadata, which is bad!
>>      3. After bdrv_co_pwritev() returns (#1), the current log sector
>> number is reread from the driver state in order to find out the start
>> offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
>> offset will be the sector right after the (misplaced) operation #2 log
>> entry, which means that the zeroed-out region begins at the wrong
>> offset.
>>      4. As a result of the above, the log is corrupt.
>>
>> Fix this by only reading the driver metadata once, computing the
>> offsets and sizes in one go (including the optional zeroed-out region)
>> and setting the log sector number to the appropriate value for the next
>> operation in line.
>>
>> Signed-off-by: Ari Sundholm <ari@tuxera.com>
>> Cc: qemu-stable@nongnu.org
> 
> Thanks, applied to the block branch.
> 

Thank you.

> Note that while this fixes the single threaded case, it is not thread
> safe and will still break with multiqueue enabled (see the new
> iothread-vq-mapping option added to virtio-blk very recently). Most
> block drivers already take a lock when modifying their global state, and
> it looks like we missed blklogwrites when checking what needs to be
> prepared for the block layer to be thread safe.
> 

I see. Thanks for the heads up. I had missed this new development.

> So I think we'll want another patch to add a QemuMutex that can be taken
> while you do the calculations on s->cur_log_sector. But this patch is
> a good first step because it means that we don't need to keep a lock
> across an I/O request (just for the sake of completeness, this would
> have had to be a CoMutex rather than a QemuMutex).
> 

OK. I guess I have a bit of additional work to do, then. What release 
would these fixes realistically make it to? Just trying to gauge the 
urgency for the fix for the multi-threaded case for prioritization purposes.

And yes, holding a CoMutex while doing I/O would have fixed this issue, 
with the tiny drawback of killing any concurrency in the driver. ;)

Best regards,
Ari Sundholm
ari@tuxera.com

> Kevin
>

Re: [PATCH] block/blklogwrites: Fix a bug when logging "write zeroes" operations.

Posted by Kevin Wolf 8 months, 2 weeks ago

Am 10.01.2024 um 16:21 hat Ari Sundholm geschrieben:
> On 1/10/24 15:39, Kevin Wolf wrote:
> > Am 09.01.2024 um 19:46 hat megari@gmx.com geschrieben:
> > > From: Ari Sundholm <ari@tuxera.com>
> > > 
> > > There is a bug in the blklogwrites driver pertaining to logging "write
> > > zeroes" operations, causing log corruption. This can be easily observed
> > > by setting detect-zeroes to something other than "off" for the driver.
> > > 
> > > The issue is caused by a concurrency bug pertaining to the fact that
> > > "write zeroes" operations have to be logged in two parts: first the log
> > > entry metadata, then the zeroed-out region. While the log entry
> > > metadata is being written by bdrv_co_pwritev(), another operation may
> > > begin in the meanwhile and modify the state of the blklogwrites driver.
> > > This is as intended by the coroutine-driven I/O model in QEMU, of
> > > course.
> > > 
> > > Unfortunately, this specific scenario is mishandled. A short example:
> > >      1. Initially, in the current operation (#1), the current log sector
> > > number in the driver state is only incremented by the number of sectors
> > > taken by the log entry metadata, after which the log entry metadata is
> > > written. The current operation yields.
> > >      2. Another operation (#2) may start while the log entry metadata is
> > > being written. It uses the current log position as the start offset for
> > > its log entry. This is in the sector right after the operation #1 log
> > > entry metadata, which is bad!
> > >      3. After bdrv_co_pwritev() returns (#1), the current log sector
> > > number is reread from the driver state in order to find out the start
> > > offset for bdrv_co_pwrite_zeroes(). This is an obvious blunder, as the
> > > offset will be the sector right after the (misplaced) operation #2 log
> > > entry, which means that the zeroed-out region begins at the wrong
> > > offset.
> > >      4. As a result of the above, the log is corrupt.
> > > 
> > > Fix this by only reading the driver metadata once, computing the
> > > offsets and sizes in one go (including the optional zeroed-out region)
> > > and setting the log sector number to the appropriate value for the next
> > > operation in line.
> > > 
> > > Signed-off-by: Ari Sundholm <ari@tuxera.com>
> > > Cc: qemu-stable@nongnu.org
> > 
> > Thanks, applied to the block branch.
> > 
> 
> Thank you.
> 
> > Note that while this fixes the single threaded case, it is not thread
> > safe and will still break with multiqueue enabled (see the new
> > iothread-vq-mapping option added to virtio-blk very recently). Most
> > block drivers already take a lock when modifying their global state, and
> > it looks like we missed blklogwrites when checking what needs to be
> > prepared for the block layer to be thread safe.
> > 
> 
> I see. Thanks for the heads up. I had missed this new development.
> 
> > So I think we'll want another patch to add a QemuMutex that can be taken
> > while you do the calculations on s->cur_log_sector. But this patch is
> > a good first step because it means that we don't need to keep a lock
> > across an I/O request (just for the sake of completeness, this would
> > have had to be a CoMutex rather than a QemuMutex).
> 
> OK. I guess I have a bit of additional work to do, then.

Yes, a bit, but I honestly don't expect the locking to be a big effort
for a small driver like blklogwrites.

> What release would these fixes realistically make it to? Just trying
> to gauge the urgency for the fix for the multi-threaded case for
> prioritization purposes.

Well, this one is queued for 9.0. If we can make sure to get the thread
safety fix into 9.0, too (I expect the soft freeze to be somewhere
around beginning of March and the release in April), that would be good,
because that would also be the first release to be affected.

I'm not sure if and when a 8.2.* stable release is planned before that.

> And yes, holding a CoMutex while doing I/O would have fixed this
> issue, with the tiny drawback of killing any concurrency in the
> driver. ;)

Ah, that's not true, only any concurrency in the log writes. ;)

Kevin

[PATCH v3] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Ari Sundholm via 8 months ago

During the review of a fix for a concurrency issue in blklogwrites,
it was found that the driver needs an additional fix when enabling
multiqueue, which is a new feature introduced in QEMU 9.0, as the
driver state may be read and written by multiple threads at the same
time, which was not the case when the driver was originally written.

Fix the multi-threaded scenario by introducing a mutex to protect the
mutable fields in the driver state, and always having the mutex locked
by the current thread when accessing them. Also use the mutex and a
CoQueue to ensure that the super block is not being written to by
multiple threads concurrently and updates are properly serialized.

Additionally, add the const qualifier to a few BDRVBlkLogWritesState
pointer targets in contexts where the driver state is not written to.

Signed-off-by: Ari Sundholm <ari@tuxera.com>

v1->v2: Ensure that the super block is not written to concurrently.
v2->v3: Use a CoQueue instead of a condition variable, as the latter
does not make the currently executing coroutine yield on entering a
wait.
---
 block/blklogwrites.c | 85 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 8 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index ba717dab4d..399819a2a1 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -3,7 +3,7 @@
  *
  * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
  * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
- * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
+ * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -55,9 +55,34 @@ typedef struct {
     BdrvChild *log_file;
     uint32_t sectorsize;
     uint32_t sectorbits;
+    uint64_t update_interval;
+
+    /*
+     * The mutable state of the driver, consisting of the current log sector
+     * and the number of log entries.
+     *
+     * May be read and/or written from multiple threads, and the mutex must be
+     * held when accessing these fields.
+     */
     uint64_t cur_log_sector;
     uint64_t nr_entries;
-    uint64_t update_interval;
+    QemuMutex mutex;
+
+    /*
+     * The super block sequence number. Non-zero if a super block update is in
+     * progress.
+     *
+     * The mutex must be held when accessing this field.
+     */
+    uint64_t super_update_seq;
+
+    /*
+     * A coroutine-aware queue to serialize super block updates.
+     *
+     * Used with the mutex to ensure that only one thread be updating the super
+     * block at a time.
+     */
+    CoQueue super_update_queue;
 } BDRVBlkLogWritesState;
 
 static QemuOptsList runtime_opts = {
@@ -169,6 +194,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
+    qemu_mutex_init(&s->mutex);
+    qemu_co_queue_init(&s->super_update_queue);
+
     log_append = qemu_opt_get_bool(opts, "log-append", false);
 
     if (log_append) {
@@ -231,6 +259,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
         s->nr_entries = 0;
     }
 
+    s->super_update_seq = 0;
+
     if (!blk_log_writes_sector_size_valid(log_sector_size)) {
         ret = -EINVAL;
         error_setg(errp, "Invalid log sector size %"PRIu64, log_sector_size);
@@ -255,6 +285,7 @@ fail_log:
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
         s->log_file = NULL;
+        qemu_mutex_destroy(&s->mutex);
     }
 fail:
     qemu_opts_del(opts);
@@ -269,6 +300,7 @@ static void blk_log_writes_close(BlockDriverState *bs)
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
+    qemu_mutex_destroy(&s->mutex);
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
@@ -295,7 +327,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
 
 static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
 {
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     bs->bl.request_alignment = s->sectorsize;
 }
 
@@ -338,15 +370,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
      * driver may be modified by other driver operations while waiting for the
      * I/O to complete.
      */
+    qemu_mutex_lock(&s->mutex);
     const uint64_t entry_start_sector = s->cur_log_sector;
     const uint64_t entry_offset = entry_start_sector << s->sectorbits;
     const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
     const uint64_t entry_aligned_size = qiov_aligned_size +
         ROUND_UP(lr->zero_size, s->sectorsize);
     const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
+    const uint64_t entry_seq = s->nr_entries + 1;
 
-    s->nr_entries++;
+    s->nr_entries = entry_seq;
     s->cur_log_sector += entry_nr_sectors;
+    qemu_mutex_unlock(&s->mutex);
 
     /*
      * Write the log entry. Note that if this is a "write zeroes" operation,
@@ -366,17 +401,44 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
 
     /* Update super block on flush or every update interval */
     if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
-        || (s->nr_entries % s->update_interval == 0)))
+        || (entry_seq % s->update_interval == 0)))
     {
         struct log_write_super super = {
             .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
             .version    = cpu_to_le64(WRITE_LOG_VERSION),
-            .nr_entries = cpu_to_le64(s->nr_entries),
+            .nr_entries = const_le64(0),
             .sectorsize = cpu_to_le32(s->sectorsize),
         };
-        void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
+        void *zeroes;
         QEMUIOVector qiov;
 
+        /*
+         * Wait if a super block update is already in progress.
+         * Bail out if a newer update got its turn before us.
+         */
+        WITH_QEMU_LOCK_GUARD(&s->mutex) {
+            CoQueueWaitFlags wait_flags = 0;
+            while (s->super_update_seq) {
+                if (entry_seq < s->super_update_seq) {
+                    return;
+                }
+                qemu_co_queue_wait_flags(&s->super_update_queue,
+                    &s->mutex, wait_flags);
+
+                /*
+                 * In case the wait condition remains true after wakeup,
+                 * to avoid starvation, make sure that this request is
+                 * scheduled to rerun next by pushing it to the front of the
+                 * queue.
+                 */
+                wait_flags = CO_QUEUE_WAIT_FRONT;
+            }
+            s->super_update_seq = entry_seq;
+            super.nr_entries = cpu_to_le64(s->nr_entries);
+        }
+
+        zeroes = g_malloc0(s->sectorsize - sizeof(super));
+
         qemu_iovec_init(&qiov, 2);
         qemu_iovec_add(&qiov, &super, sizeof(super));
         qemu_iovec_add(&qiov, zeroes, s->sectorsize - sizeof(super));
@@ -386,6 +448,13 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
         if (lr->log_ret == 0) {
             lr->log_ret = bdrv_co_flush(s->log_file->bs);
         }
+
+        /* The super block has been updated. Let another request have a go. */
+        qemu_mutex_lock(&s->mutex);
+        s->super_update_seq = 0;
+        (void) qemu_co_queue_next(&s->super_update_queue);
+        qemu_mutex_unlock(&s->mutex);
+
         qemu_iovec_destroy(&qiov);
         g_free(zeroes);
     }
@@ -405,7 +474,7 @@ blk_log_writes_co_log(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
 {
     QEMUIOVector log_qiov;
     size_t niov = qiov ? qiov->niov : 0;
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     BlkLogWritesFileReq fr = {
         .bs         = bs,
         .offset     = offset,
-- 
2.43.0

Re: [PATCH v3] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Kevin Wolf 8 months ago

Am 19.01.2024 um 17:29 hat Ari Sundholm geschrieben:
> During the review of a fix for a concurrency issue in blklogwrites,
> it was found that the driver needs an additional fix when enabling
> multiqueue, which is a new feature introduced in QEMU 9.0, as the
> driver state may be read and written by multiple threads at the same
> time, which was not the case when the driver was originally written.
> 
> Fix the multi-threaded scenario by introducing a mutex to protect the
> mutable fields in the driver state, and always having the mutex locked
> by the current thread when accessing them. Also use the mutex and a
> CoQueue to ensure that the super block is not being written to by
> multiple threads concurrently and updates are properly serialized.
> 
> Additionally, add the const qualifier to a few BDRVBlkLogWritesState
> pointer targets in contexts where the driver state is not written to.
> 
> Signed-off-by: Ari Sundholm <ari@tuxera.com>
> 
> v1->v2: Ensure that the super block is not written to concurrently.
> v2->v3: Use a CoQueue instead of a condition variable, as the latter
> does not make the currently executing coroutine yield on entering a
> wait.
> ---
>  block/blklogwrites.c | 85 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 77 insertions(+), 8 deletions(-)

For your next series, please put the changelog between versions below
the "---" marker so that git am doesn't consider it part of the commit
message.

Thanks, applied to the block branch.

Kevin

[PATCH v2] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Ari Sundholm via 8 months, 1 week ago

During the review of a fix for a concurrency issue in blklogwrites,
it was found that the driver needs an additional fix when enabling
multiqueue, which is a new feature introduced in QEMU 9.0, as the
driver state may be read and written by multiple threads at the same
time, which was not the case when the driver was originally written.

Fix the multi-threaded scenario by introducing a mutex to protect the
mutable fields in the driver state, and always having the mutex locked
by the current thread when accessing them. Also use the mutex and a
condition variable to ensure that the super block is not being written
to by multiple threads concurrently.

Additionally, add the const qualifier to a few BDRVBlkLogWritesState
pointer targets in contexts where the driver state is not written to.

Signed-off-by: Ari Sundholm <ari@tuxera.com>

v1->v2: Ensure that the super block is not written to concurrently.
---
 block/blklogwrites.c | 77 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 69 insertions(+), 8 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index ba717dab4d..f8bec7c863 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -3,7 +3,7 @@
  *
  * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
  * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
- * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
+ * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -55,9 +55,34 @@ typedef struct {
     BdrvChild *log_file;
     uint32_t sectorsize;
     uint32_t sectorbits;
+    uint64_t update_interval;
+
+    /*
+     * The mutable state of the driver, consisting of the current log sector
+     * and the number of log entries.
+     *
+     * May be read and/or written from multiple threads, and the mutex must be
+     * held when accessing these fields.
+     */
     uint64_t cur_log_sector;
     uint64_t nr_entries;
-    uint64_t update_interval;
+    QemuMutex mutex;
+
+    /*
+     * The super block sequence number. Non-zero if a super block update is in
+     * progress.
+     *
+     * The mutex must be held when accessing this field.
+     */
+    uint64_t super_update_seq;
+
+    /*
+     * A condition variable to wait for and signal finished superblock updates.
+     *
+     * Used with the mutex to ensure that only one thread be updating the super
+     * block at a time.
+     */
+    QemuCond super_updated;
 } BDRVBlkLogWritesState;
 
 static QemuOptsList runtime_opts = {
@@ -169,6 +194,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
+    qemu_mutex_init(&s->mutex);
+    qemu_cond_init(&s->super_updated);
+
     log_append = qemu_opt_get_bool(opts, "log-append", false);
 
     if (log_append) {
@@ -231,6 +259,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
         s->nr_entries = 0;
     }
 
+    s->super_update_seq = 0;
+
     if (!blk_log_writes_sector_size_valid(log_sector_size)) {
         ret = -EINVAL;
         error_setg(errp, "Invalid log sector size %"PRIu64, log_sector_size);
@@ -255,6 +285,8 @@ fail_log:
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
         s->log_file = NULL;
+        qemu_cond_destroy(&s->super_updated);
+        qemu_mutex_destroy(&s->mutex);
     }
 fail:
     qemu_opts_del(opts);
@@ -269,6 +301,8 @@ static void blk_log_writes_close(BlockDriverState *bs)
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
+    qemu_cond_destroy(&s->super_updated);
+    qemu_mutex_destroy(&s->mutex);
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
@@ -295,7 +329,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
 
 static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
 {
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     bs->bl.request_alignment = s->sectorsize;
 }
 
@@ -338,15 +372,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
      * driver may be modified by other driver operations while waiting for the
      * I/O to complete.
      */
+    qemu_mutex_lock(&s->mutex);
     const uint64_t entry_start_sector = s->cur_log_sector;
     const uint64_t entry_offset = entry_start_sector << s->sectorbits;
     const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
     const uint64_t entry_aligned_size = qiov_aligned_size +
         ROUND_UP(lr->zero_size, s->sectorsize);
     const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
+    const uint64_t entry_seq = s->nr_entries + 1;
 
-    s->nr_entries++;
+    s->nr_entries = entry_seq;
     s->cur_log_sector += entry_nr_sectors;
+    qemu_mutex_unlock(&s->mutex);
 
     /*
      * Write the log entry. Note that if this is a "write zeroes" operation,
@@ -366,17 +403,34 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
 
     /* Update super block on flush or every update interval */
     if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
-        || (s->nr_entries % s->update_interval == 0)))
+        || (entry_seq % s->update_interval == 0)))
     {
         struct log_write_super super = {
             .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
             .version    = cpu_to_le64(WRITE_LOG_VERSION),
-            .nr_entries = cpu_to_le64(s->nr_entries),
+            .nr_entries = const_le64(0),
             .sectorsize = cpu_to_le32(s->sectorsize),
         };
-        void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
+        void *zeroes;
         QEMUIOVector qiov;
 
+        /*
+         * Wait if a super block update is already in progress.
+         * Bail out if a newer update got its turn before us.
+         */
+        WITH_QEMU_LOCK_GUARD(&s->mutex) {
+            while (s->super_update_seq) {
+                if (entry_seq < s->super_update_seq) {
+                    return;
+                }
+                qemu_cond_wait(&s->super_updated, &s->mutex);
+            }
+            s->super_update_seq = entry_seq;
+            super.nr_entries = cpu_to_le64(s->nr_entries);
+        }
+
+        zeroes = g_malloc0(s->sectorsize - sizeof(super));
+
         qemu_iovec_init(&qiov, 2);
         qemu_iovec_add(&qiov, &super, sizeof(super));
         qemu_iovec_add(&qiov, zeroes, s->sectorsize - sizeof(super));
@@ -386,6 +440,13 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
         if (lr->log_ret == 0) {
             lr->log_ret = bdrv_co_flush(s->log_file->bs);
         }
+
+        /* The super block has been updated. Let another thread have a go. */
+        qemu_mutex_lock(&s->mutex);
+        s->super_update_seq = 0;
+        qemu_cond_signal(&s->super_updated);
+        qemu_mutex_unlock(&s->mutex);
+
         qemu_iovec_destroy(&qiov);
         g_free(zeroes);
     }
@@ -405,7 +466,7 @@ blk_log_writes_co_log(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
 {
     QEMUIOVector log_qiov;
     size_t niov = qiov ? qiov->niov : 0;
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     BlkLogWritesFileReq fr = {
         .bs         = bs,
         .offset     = offset,
-- 
2.43.0

Re: [PATCH v2] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Kevin Wolf 8 months ago

Am 11.01.2024 um 17:32 hat Ari Sundholm geschrieben:
> During the review of a fix for a concurrency issue in blklogwrites,
> it was found that the driver needs an additional fix when enabling
> multiqueue, which is a new feature introduced in QEMU 9.0, as the
> driver state may be read and written by multiple threads at the same
> time, which was not the case when the driver was originally written.
> 
> Fix the multi-threaded scenario by introducing a mutex to protect the
> mutable fields in the driver state, and always having the mutex locked
> by the current thread when accessing them. Also use the mutex and a
> condition variable to ensure that the super block is not being written
> to by multiple threads concurrently.
> 
> Additionally, add the const qualifier to a few BDRVBlkLogWritesState
> pointer targets in contexts where the driver state is not written to.
> 
> Signed-off-by: Ari Sundholm <ari@tuxera.com>
> 
> v1->v2: Ensure that the super block is not written to concurrently.
> ---
>  block/blklogwrites.c | 77 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 69 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blklogwrites.c b/block/blklogwrites.c
> index ba717dab4d..f8bec7c863 100644
> --- a/block/blklogwrites.c
> +++ b/block/blklogwrites.c
> @@ -3,7 +3,7 @@
>   *
>   * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
>   * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
> - * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
> + * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
>   *
>   * This work is licensed under the terms of the GNU GPL, version 2 or later.
>   * See the COPYING file in the top-level directory.
> @@ -55,9 +55,34 @@ typedef struct {
>      BdrvChild *log_file;
>      uint32_t sectorsize;
>      uint32_t sectorbits;
> +    uint64_t update_interval;
> +
> +    /*
> +     * The mutable state of the driver, consisting of the current log sector
> +     * and the number of log entries.
> +     *
> +     * May be read and/or written from multiple threads, and the mutex must be
> +     * held when accessing these fields.
> +     */
>      uint64_t cur_log_sector;
>      uint64_t nr_entries;
> -    uint64_t update_interval;
> +    QemuMutex mutex;
> +
> +    /*
> +     * The super block sequence number. Non-zero if a super block update is in
> +     * progress.
> +     *
> +     * The mutex must be held when accessing this field.
> +     */
> +    uint64_t super_update_seq;
> +
> +    /*
> +     * A condition variable to wait for and signal finished superblock updates.
> +     *
> +     * Used with the mutex to ensure that only one thread be updating the super
> +     * block at a time.
> +     */
> +    QemuCond super_updated;
>  } BDRVBlkLogWritesState;
>  
>  static QemuOptsList runtime_opts = {
> @@ -169,6 +194,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>          goto fail;
>      }
>  
> +    qemu_mutex_init(&s->mutex);
> +    qemu_cond_init(&s->super_updated);
> +
>      log_append = qemu_opt_get_bool(opts, "log-append", false);
>  
>      if (log_append) {
> @@ -231,6 +259,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>          s->nr_entries = 0;
>      }
>  
> +    s->super_update_seq = 0;
> +
>      if (!blk_log_writes_sector_size_valid(log_sector_size)) {
>          ret = -EINVAL;
>          error_setg(errp, "Invalid log sector size %"PRIu64, log_sector_size);
> @@ -255,6 +285,8 @@ fail_log:
>          bdrv_unref_child(bs, s->log_file);
>          bdrv_graph_wrunlock();
>          s->log_file = NULL;
> +        qemu_cond_destroy(&s->super_updated);
> +        qemu_mutex_destroy(&s->mutex);
>      }
>  fail:
>      qemu_opts_del(opts);
> @@ -269,6 +301,8 @@ static void blk_log_writes_close(BlockDriverState *bs)
>      bdrv_unref_child(bs, s->log_file);
>      s->log_file = NULL;
>      bdrv_graph_wrunlock();
> +    qemu_cond_destroy(&s->super_updated);
> +    qemu_mutex_destroy(&s->mutex);
>  }
>  
>  static int64_t coroutine_fn GRAPH_RDLOCK
> @@ -295,7 +329,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
>  
>  static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
> -    BDRVBlkLogWritesState *s = bs->opaque;
> +    const BDRVBlkLogWritesState *s = bs->opaque;
>      bs->bl.request_alignment = s->sectorsize;
>  }
>  
> @@ -338,15 +372,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>       * driver may be modified by other driver operations while waiting for the
>       * I/O to complete.
>       */
> +    qemu_mutex_lock(&s->mutex);
>      const uint64_t entry_start_sector = s->cur_log_sector;
>      const uint64_t entry_offset = entry_start_sector << s->sectorbits;
>      const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
>      const uint64_t entry_aligned_size = qiov_aligned_size +
>          ROUND_UP(lr->zero_size, s->sectorsize);
>      const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
> +    const uint64_t entry_seq = s->nr_entries + 1;
>  
> -    s->nr_entries++;
> +    s->nr_entries = entry_seq;
>      s->cur_log_sector += entry_nr_sectors;
> +    qemu_mutex_unlock(&s->mutex);
>  
>      /*
>       * Write the log entry. Note that if this is a "write zeroes" operation,
> @@ -366,17 +403,34 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>  
>      /* Update super block on flush or every update interval */
>      if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
> -        || (s->nr_entries % s->update_interval == 0)))
> +        || (entry_seq % s->update_interval == 0)))
>      {
>          struct log_write_super super = {
>              .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
>              .version    = cpu_to_le64(WRITE_LOG_VERSION),
> -            .nr_entries = cpu_to_le64(s->nr_entries),
> +            .nr_entries = const_le64(0),
>              .sectorsize = cpu_to_le32(s->sectorsize),
>          };
> -        void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
> +        void *zeroes;
>          QEMUIOVector qiov;
>  
> +        /*
> +         * Wait if a super block update is already in progress.
> +         * Bail out if a newer update got its turn before us.
> +         */
> +        WITH_QEMU_LOCK_GUARD(&s->mutex) {
> +            while (s->super_update_seq) {
> +                if (entry_seq < s->super_update_seq) {
> +                    return;
> +                }
> +                qemu_cond_wait(&s->super_updated, &s->mutex);

This will block, which is exactly what you want if another thread is
writing the super block.

However, in a single-threaded case where it's just the previous request
coroutine that is still writing its super block (i.e. bdrv_co_pwritev()
below has yielded), this will deadlock because we'll never switch back
and actually complete the previous super block write.

So unless I'm missing a reason why this won't happen, I think you need a
coroutine aware mechanism here. The obvious options would be using a
CoMutex in the first place and holding it across the I/O operation or
keeping the cheaper QemuMutex and replacing the condition variable with
a CoQueue.

> +            }
> +            s->super_update_seq = entry_seq;
> +            super.nr_entries = cpu_to_le64(s->nr_entries);
> +        }
> +
> +        zeroes = g_malloc0(s->sectorsize - sizeof(super));
> +
>          qemu_iovec_init(&qiov, 2);
>          qemu_iovec_add(&qiov, &super, sizeof(super));
>          qemu_iovec_add(&qiov, zeroes, s->sectorsize - sizeof(super));

Kevin

Re: [PATCH v2] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Ari Sundholm 8 months ago

On 1/18/24 21:18, Kevin Wolf wrote:
> Am 11.01.2024 um 17:32 hat Ari Sundholm geschrieben:
>> During the review of a fix for a concurrency issue in blklogwrites,
>> it was found that the driver needs an additional fix when enabling
>> multiqueue, which is a new feature introduced in QEMU 9.0, as the
>> driver state may be read and written by multiple threads at the same
>> time, which was not the case when the driver was originally written.
>>
>> Fix the multi-threaded scenario by introducing a mutex to protect the
>> mutable fields in the driver state, and always having the mutex locked
>> by the current thread when accessing them. Also use the mutex and a
>> condition variable to ensure that the super block is not being written
>> to by multiple threads concurrently.
>>
>> Additionally, add the const qualifier to a few BDRVBlkLogWritesState
>> pointer targets in contexts where the driver state is not written to.
>>
>> Signed-off-by: Ari Sundholm <ari@tuxera.com>
>>
>> v1->v2: Ensure that the super block is not written to concurrently.
>> ---
>>   block/blklogwrites.c | 77 +++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 69 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/blklogwrites.c b/block/blklogwrites.c
>> index ba717dab4d..f8bec7c863 100644
>> --- a/block/blklogwrites.c
>> +++ b/block/blklogwrites.c
>> @@ -3,7 +3,7 @@
>>    *
>>    * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
>>    * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
>> - * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
>> + * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
>>    *
>>    * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>    * See the COPYING file in the top-level directory.
>> @@ -55,9 +55,34 @@ typedef struct {
>>       BdrvChild *log_file;
>>       uint32_t sectorsize;
>>       uint32_t sectorbits;
>> +    uint64_t update_interval;
>> +
>> +    /*
>> +     * The mutable state of the driver, consisting of the current log sector
>> +     * and the number of log entries.
>> +     *
>> +     * May be read and/or written from multiple threads, and the mutex must be
>> +     * held when accessing these fields.
>> +     */
>>       uint64_t cur_log_sector;
>>       uint64_t nr_entries;
>> -    uint64_t update_interval;
>> +    QemuMutex mutex;
>> +
>> +    /*
>> +     * The super block sequence number. Non-zero if a super block update is in
>> +     * progress.
>> +     *
>> +     * The mutex must be held when accessing this field.
>> +     */
>> +    uint64_t super_update_seq;
>> +
>> +    /*
>> +     * A condition variable to wait for and signal finished superblock updates.
>> +     *
>> +     * Used with the mutex to ensure that only one thread be updating the super
>> +     * block at a time.
>> +     */
>> +    QemuCond super_updated;
>>   } BDRVBlkLogWritesState;
>>   
>>   static QemuOptsList runtime_opts = {
>> @@ -169,6 +194,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>>           goto fail;
>>       }
>>   
>> +    qemu_mutex_init(&s->mutex);
>> +    qemu_cond_init(&s->super_updated);
>> +
>>       log_append = qemu_opt_get_bool(opts, "log-append", false);
>>   
>>       if (log_append) {
>> @@ -231,6 +259,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
>>           s->nr_entries = 0;
>>       }
>>   
>> +    s->super_update_seq = 0;
>> +
>>       if (!blk_log_writes_sector_size_valid(log_sector_size)) {
>>           ret = -EINVAL;
>>           error_setg(errp, "Invalid log sector size %"PRIu64, log_sector_size);
>> @@ -255,6 +285,8 @@ fail_log:
>>           bdrv_unref_child(bs, s->log_file);
>>           bdrv_graph_wrunlock();
>>           s->log_file = NULL;
>> +        qemu_cond_destroy(&s->super_updated);
>> +        qemu_mutex_destroy(&s->mutex);
>>       }
>>   fail:
>>       qemu_opts_del(opts);
>> @@ -269,6 +301,8 @@ static void blk_log_writes_close(BlockDriverState *bs)
>>       bdrv_unref_child(bs, s->log_file);
>>       s->log_file = NULL;
>>       bdrv_graph_wrunlock();
>> +    qemu_cond_destroy(&s->super_updated);
>> +    qemu_mutex_destroy(&s->mutex);
>>   }
>>   
>>   static int64_t coroutine_fn GRAPH_RDLOCK
>> @@ -295,7 +329,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
>>   
>>   static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
>>   {
>> -    BDRVBlkLogWritesState *s = bs->opaque;
>> +    const BDRVBlkLogWritesState *s = bs->opaque;
>>       bs->bl.request_alignment = s->sectorsize;
>>   }
>>   
>> @@ -338,15 +372,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>>        * driver may be modified by other driver operations while waiting for the
>>        * I/O to complete.
>>        */
>> +    qemu_mutex_lock(&s->mutex);
>>       const uint64_t entry_start_sector = s->cur_log_sector;
>>       const uint64_t entry_offset = entry_start_sector << s->sectorbits;
>>       const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
>>       const uint64_t entry_aligned_size = qiov_aligned_size +
>>           ROUND_UP(lr->zero_size, s->sectorsize);
>>       const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
>> +    const uint64_t entry_seq = s->nr_entries + 1;
>>   
>> -    s->nr_entries++;
>> +    s->nr_entries = entry_seq;
>>       s->cur_log_sector += entry_nr_sectors;
>> +    qemu_mutex_unlock(&s->mutex);
>>   
>>       /*
>>        * Write the log entry. Note that if this is a "write zeroes" operation,
>> @@ -366,17 +403,34 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
>>   
>>       /* Update super block on flush or every update interval */
>>       if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
>> -        || (s->nr_entries % s->update_interval == 0)))
>> +        || (entry_seq % s->update_interval == 0)))
>>       {
>>           struct log_write_super super = {
>>               .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
>>               .version    = cpu_to_le64(WRITE_LOG_VERSION),
>> -            .nr_entries = cpu_to_le64(s->nr_entries),
>> +            .nr_entries = const_le64(0),
>>               .sectorsize = cpu_to_le32(s->sectorsize),
>>           };
>> -        void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
>> +        void *zeroes;
>>           QEMUIOVector qiov;
>>   
>> +        /*
>> +         * Wait if a super block update is already in progress.
>> +         * Bail out if a newer update got its turn before us.
>> +         */
>> +        WITH_QEMU_LOCK_GUARD(&s->mutex) {
>> +            while (s->super_update_seq) {
>> +                if (entry_seq < s->super_update_seq) {
>> +                    return;
>> +                }
>> +                qemu_cond_wait(&s->super_updated, &s->mutex);
> 
> This will block, which is exactly what you want if another thread is
> writing the super block.
> 
> However, in a single-threaded case where it's just the previous request
> coroutine that is still writing its super block (i.e. bdrv_co_pwritev()
> below has yielded), this will deadlock because we'll never switch back
> and actually complete the previous super block write.
> 
> So unless I'm missing a reason why this won't happen, I think you need a
> coroutine aware mechanism here. The obvious options would be using a
> CoMutex in the first place and holding it across the I/O operation or
> keeping the cheaper QemuMutex and replacing the condition variable with
> a CoQueue.
> 

Yup. Indeed, you are right. It took me a while to see why. Thanks for 
pointing this out. I had not properly considered the fact that QemuCond 
does not yield on entering a wait in coroutine context.

Posted v3, where the condition variable has been substituted with a 
CoQueue. Hopefully I did it right this time.

If I am reading the CoQueue implementation correctly, this should have 
an additional bonus of fairness among the pending super block updates 
due to the execution order imposed, and it also appears that spurious 
wakeups might not be an issue like they would be with pthread condvars? 
I am not relying on these properties in the v3 patch, though.

Best regards,
Ari Sundholm
ari@tuxera.com

>> +            }
>> +            s->super_update_seq = entry_seq;
>> +            super.nr_entries = cpu_to_le64(s->nr_entries);
>> +        }
>> +
>> +        zeroes = g_malloc0(s->sectorsize - sizeof(super));
>> +
>>           qemu_iovec_init(&qiov, 2);
>>           qemu_iovec_add(&qiov, &super, sizeof(super));
>>           qemu_iovec_add(&qiov, zeroes, s->sectorsize - sizeof(super));
> 
> Kevin
>

Re: [PATCH v2] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Kevin Wolf 8 months ago

Am 19.01.2024 um 17:55 hat Ari Sundholm geschrieben:
> On 1/18/24 21:18, Kevin Wolf wrote:
> > Am 11.01.2024 um 17:32 hat Ari Sundholm geschrieben:
> > > During the review of a fix for a concurrency issue in blklogwrites,
> > > it was found that the driver needs an additional fix when enabling
> > > multiqueue, which is a new feature introduced in QEMU 9.0, as the
> > > driver state may be read and written by multiple threads at the same
> > > time, which was not the case when the driver was originally written.
> > > 
> > > Fix the multi-threaded scenario by introducing a mutex to protect the
> > > mutable fields in the driver state, and always having the mutex locked
> > > by the current thread when accessing them. Also use the mutex and a
> > > condition variable to ensure that the super block is not being written
> > > to by multiple threads concurrently.
> > > 
> > > Additionally, add the const qualifier to a few BDRVBlkLogWritesState
> > > pointer targets in contexts where the driver state is not written to.
> > > 
> > > Signed-off-by: Ari Sundholm <ari@tuxera.com>
> > > 
> > > v1->v2: Ensure that the super block is not written to concurrently.
> > > ---
> > >   block/blklogwrites.c | 77 +++++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 69 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/block/blklogwrites.c b/block/blklogwrites.c
> > > index ba717dab4d..f8bec7c863 100644
> > > --- a/block/blklogwrites.c
> > > +++ b/block/blklogwrites.c
> > > @@ -3,7 +3,7 @@
> > >    *
> > >    * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
> > >    * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
> > > - * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
> > > + * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
> > >    *
> > >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > >    * See the COPYING file in the top-level directory.
> > > @@ -55,9 +55,34 @@ typedef struct {
> > >       BdrvChild *log_file;
> > >       uint32_t sectorsize;
> > >       uint32_t sectorbits;
> > > +    uint64_t update_interval;
> > > +
> > > +    /*
> > > +     * The mutable state of the driver, consisting of the current log sector
> > > +     * and the number of log entries.
> > > +     *
> > > +     * May be read and/or written from multiple threads, and the mutex must be
> > > +     * held when accessing these fields.
> > > +     */
> > >       uint64_t cur_log_sector;
> > >       uint64_t nr_entries;
> > > -    uint64_t update_interval;
> > > +    QemuMutex mutex;
> > > +
> > > +    /*
> > > +     * The super block sequence number. Non-zero if a super block update is in
> > > +     * progress.
> > > +     *
> > > +     * The mutex must be held when accessing this field.
> > > +     */
> > > +    uint64_t super_update_seq;
> > > +
> > > +    /*
> > > +     * A condition variable to wait for and signal finished superblock updates.
> > > +     *
> > > +     * Used with the mutex to ensure that only one thread be updating the super
> > > +     * block at a time.
> > > +     */
> > > +    QemuCond super_updated;
> > >   } BDRVBlkLogWritesState;
> > >   static QemuOptsList runtime_opts = {
> > > @@ -169,6 +194,9 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
> > >           goto fail;
> > >       }
> > > +    qemu_mutex_init(&s->mutex);
> > > +    qemu_cond_init(&s->super_updated);
> > > +
> > >       log_append = qemu_opt_get_bool(opts, "log-append", false);
> > >       if (log_append) {
> > > @@ -231,6 +259,8 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
> > >           s->nr_entries = 0;
> > >       }
> > > +    s->super_update_seq = 0;
> > > +
> > >       if (!blk_log_writes_sector_size_valid(log_sector_size)) {
> > >           ret = -EINVAL;
> > >           error_setg(errp, "Invalid log sector size %"PRIu64, log_sector_size);
> > > @@ -255,6 +285,8 @@ fail_log:
> > >           bdrv_unref_child(bs, s->log_file);
> > >           bdrv_graph_wrunlock();
> > >           s->log_file = NULL;
> > > +        qemu_cond_destroy(&s->super_updated);
> > > +        qemu_mutex_destroy(&s->mutex);
> > >       }
> > >   fail:
> > >       qemu_opts_del(opts);
> > > @@ -269,6 +301,8 @@ static void blk_log_writes_close(BlockDriverState *bs)
> > >       bdrv_unref_child(bs, s->log_file);
> > >       s->log_file = NULL;
> > >       bdrv_graph_wrunlock();
> > > +    qemu_cond_destroy(&s->super_updated);
> > > +    qemu_mutex_destroy(&s->mutex);
> > >   }
> > >   static int64_t coroutine_fn GRAPH_RDLOCK
> > > @@ -295,7 +329,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
> > >   static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
> > >   {
> > > -    BDRVBlkLogWritesState *s = bs->opaque;
> > > +    const BDRVBlkLogWritesState *s = bs->opaque;
> > >       bs->bl.request_alignment = s->sectorsize;
> > >   }
> > > @@ -338,15 +372,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
> > >        * driver may be modified by other driver operations while waiting for the
> > >        * I/O to complete.
> > >        */
> > > +    qemu_mutex_lock(&s->mutex);
> > >       const uint64_t entry_start_sector = s->cur_log_sector;
> > >       const uint64_t entry_offset = entry_start_sector << s->sectorbits;
> > >       const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
> > >       const uint64_t entry_aligned_size = qiov_aligned_size +
> > >           ROUND_UP(lr->zero_size, s->sectorsize);
> > >       const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
> > > +    const uint64_t entry_seq = s->nr_entries + 1;
> > > -    s->nr_entries++;
> > > +    s->nr_entries = entry_seq;
> > >       s->cur_log_sector += entry_nr_sectors;
> > > +    qemu_mutex_unlock(&s->mutex);
> > >       /*
> > >        * Write the log entry. Note that if this is a "write zeroes" operation,
> > > @@ -366,17 +403,34 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
> > >       /* Update super block on flush or every update interval */
> > >       if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
> > > -        || (s->nr_entries % s->update_interval == 0)))
> > > +        || (entry_seq % s->update_interval == 0)))
> > >       {
> > >           struct log_write_super super = {
> > >               .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
> > >               .version    = cpu_to_le64(WRITE_LOG_VERSION),
> > > -            .nr_entries = cpu_to_le64(s->nr_entries),
> > > +            .nr_entries = const_le64(0),
> > >               .sectorsize = cpu_to_le32(s->sectorsize),
> > >           };
> > > -        void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
> > > +        void *zeroes;
> > >           QEMUIOVector qiov;
> > > +        /*
> > > +         * Wait if a super block update is already in progress.
> > > +         * Bail out if a newer update got its turn before us.
> > > +         */
> > > +        WITH_QEMU_LOCK_GUARD(&s->mutex) {
> > > +            while (s->super_update_seq) {
> > > +                if (entry_seq < s->super_update_seq) {
> > > +                    return;
> > > +                }
> > > +                qemu_cond_wait(&s->super_updated, &s->mutex);
> > 
> > This will block, which is exactly what you want if another thread is
> > writing the super block.
> > 
> > However, in a single-threaded case where it's just the previous request
> > coroutine that is still writing its super block (i.e. bdrv_co_pwritev()
> > below has yielded), this will deadlock because we'll never switch back
> > and actually complete the previous super block write.
> > 
> > So unless I'm missing a reason why this won't happen, I think you need a
> > coroutine aware mechanism here. The obvious options would be using a
> > CoMutex in the first place and holding it across the I/O operation or
> > keeping the cheaper QemuMutex and replacing the condition variable with
> > a CoQueue.
> > 
> 
> Yup. Indeed, you are right. It took me a while to see why. Thanks for
> pointing this out. I had not properly considered the fact that QemuCond does
> not yield on entering a wait in coroutine context.

Yes, exactly.

> Posted v3, where the condition variable has been substituted with a CoQueue.
> Hopefully I did it right this time.

Looks good at the first sight, but I'll properly review it now.

> If I am reading the CoQueue implementation correctly, this should have an
> additional bonus of fairness among the pending super block updates due to
> the execution order imposed

Yes, though I'm not sure how much of a bonus this really is here. If you
process the latest update first, you wouldn't have to do any I/O for the
earlier requests any more. But completely without it, you could indeed
get starvation.

Probably not worth worrying too much about.

> and it also appears that spurious wakeups might not be an issue like
> they would be with pthread condvars? I am not relying on these
> properties in the v3 patch, though.

I think in your case you can get spurious wakeups because you only go
through the queue if someone is already updating the super block.
aio_co_wake() doesn't immediately enter the coroutine that is woken up,
but it just schedules it to run when the current one yields or
terminates. So another thread could start updating the super block in
between. Maybe it can happen even in the single threaded case that
another request coroutine can be scheduled first, I'm not sure about
this.

It would be possible to avoid this by resetting super_update_seq to 0
only in the woken up coroutine or if the queue is empty. Not sure if
this would be any better in practice, though.

Kevin

[PATCH] block/blklogwrites: Protect mutable driver state with a mutex.

Posted by Ari Sundholm via 8 months, 1 week ago

During the review of a fix for a concurrency issue in blklogwrites,
it was found that the driver needs an additional fix when enabling
multiqueue, which is a new feature introduced in QEMU 9.0, as the
driver state may be read and written by multiple threads at the same
time, which was not the case when the driver was originally written.

Fix the multi-threaded scenario by introducing a mutex to protect the
mutable fields in the driver state, and always having the mutex locked
by the current thread when accessing them.

Additionally, add the const qualifier to a few BDRVBlkLogWritesState
pointer targets in contexts where the driver state is not written to.

Signed-off-by: Ari Sundholm <ari@tuxera.com>
---
 block/blklogwrites.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/block/blklogwrites.c b/block/blklogwrites.c
index ba717dab4d..50f68df2a6 100644
--- a/block/blklogwrites.c
+++ b/block/blklogwrites.c
@@ -3,7 +3,7 @@
  *
  * Copyright (c) 2017 Tuomas Tynkkynen <tuomas@tuxera.com>
  * Copyright (c) 2018 Aapo Vienamo <aapo@tuxera.com>
- * Copyright (c) 2018 Ari Sundholm <ari@tuxera.com>
+ * Copyright (c) 2018-2024 Ari Sundholm <ari@tuxera.com>
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -55,9 +55,18 @@ typedef struct {
     BdrvChild *log_file;
     uint32_t sectorsize;
     uint32_t sectorbits;
+    uint64_t update_interval;
+
+    /*
+     * The mutable state of the driver, consisting of the current log sector
+     * and the number of log entries.
+     *
+     * May be read and/or written from multiple threads, and the mutex must be
+     * held when accessing these fields.
+     */
     uint64_t cur_log_sector;
     uint64_t nr_entries;
-    uint64_t update_interval;
+    QemuMutex mutex;
 } BDRVBlkLogWritesState;
 
 static QemuOptsList runtime_opts = {
@@ -149,6 +158,7 @@ static int blk_log_writes_open(BlockDriverState *bs, QDict *options, int flags,
     uint64_t log_sector_size;
     bool log_append;
 
+    qemu_mutex_init(&s->mutex);
     opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
     if (!qemu_opts_absorb_qdict(opts, options, errp)) {
         ret = -EINVAL;
@@ -255,6 +265,7 @@ fail_log:
         bdrv_unref_child(bs, s->log_file);
         bdrv_graph_wrunlock();
         s->log_file = NULL;
+        qemu_mutex_destroy(&s->mutex);
     }
 fail:
     qemu_opts_del(opts);
@@ -269,6 +280,7 @@ static void blk_log_writes_close(BlockDriverState *bs)
     bdrv_unref_child(bs, s->log_file);
     s->log_file = NULL;
     bdrv_graph_wrunlock();
+    qemu_mutex_destroy(&s->mutex);
 }
 
 static int64_t coroutine_fn GRAPH_RDLOCK
@@ -295,7 +307,7 @@ static void blk_log_writes_child_perm(BlockDriverState *bs, BdrvChild *c,
 
 static void blk_log_writes_refresh_limits(BlockDriverState *bs, Error **errp)
 {
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     bs->bl.request_alignment = s->sectorsize;
 }
 
@@ -338,15 +350,18 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
      * driver may be modified by other driver operations while waiting for the
      * I/O to complete.
      */
+    qemu_mutex_lock(&s->mutex);
     const uint64_t entry_start_sector = s->cur_log_sector;
     const uint64_t entry_offset = entry_start_sector << s->sectorbits;
     const uint64_t qiov_aligned_size = ROUND_UP(lr->qiov->size, s->sectorsize);
     const uint64_t entry_aligned_size = qiov_aligned_size +
         ROUND_UP(lr->zero_size, s->sectorsize);
     const uint64_t entry_nr_sectors = entry_aligned_size >> s->sectorbits;
+    const uint64_t entry_seq = s->nr_entries + 1;
 
-    s->nr_entries++;
+    s->nr_entries = entry_seq;
     s->cur_log_sector += entry_nr_sectors;
+    qemu_mutex_unlock(&s->mutex);
 
     /*
      * Write the log entry. Note that if this is a "write zeroes" operation,
@@ -366,14 +381,16 @@ blk_log_writes_co_do_log(BlkLogWritesLogReq *lr)
 
     /* Update super block on flush or every update interval */
     if (lr->log_ret == 0 && ((lr->entry.flags & LOG_FLUSH_FLAG)
-        || (s->nr_entries % s->update_interval == 0)))
+        || (entry_seq % s->update_interval == 0)))
     {
+        qemu_mutex_lock(&s->mutex);
         struct log_write_super super = {
             .magic      = cpu_to_le64(WRITE_LOG_MAGIC),
             .version    = cpu_to_le64(WRITE_LOG_VERSION),
             .nr_entries = cpu_to_le64(s->nr_entries),
             .sectorsize = cpu_to_le32(s->sectorsize),
         };
+        qemu_mutex_unlock(&s->mutex);
         void *zeroes = g_malloc0(s->sectorsize - sizeof(super));
         QEMUIOVector qiov;
 
@@ -405,7 +422,7 @@ blk_log_writes_co_log(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
 {
     QEMUIOVector log_qiov;
     size_t niov = qiov ? qiov->niov : 0;
-    BDRVBlkLogWritesState *s = bs->opaque;
+    const BDRVBlkLogWritesState *s = bs->opaque;
     BlkLogWritesFileReq fr = {
         .bs         = bs,
         .offset     = offset,
-- 
2.43.0