[PATCH 10/16] qcow2: Fix cache_clean_timer

Hanna Czenczek posted 16 patches 2 weeks, 3 days ago
Maintainers: Paolo Bonzini <pbonzini@redhat.com>, "Alex Bennée" <alex.bennee@linaro.org>, Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>, Ronnie Sahlberg <ronniesahlberg@gmail.com>, Peter Lieven <pl@dlhnet.de>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Ilya Dryomov <idryomov@gmail.com>, "Richard W.M. Jones" <rjones@redhat.com>, Stefan Weil <sw@weilnetz.de>
There is a newer version of this series
[PATCH 10/16] qcow2: Fix cache_clean_timer
Posted by Hanna Czenczek 2 weeks, 3 days ago
The cache-cleaner runs as a timer CB in the BDS AioContext.  With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image.  While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:

    while (i < c->size && can_clean_entry(c, i)) {
        for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
            usleep(100);
        }
        c->entries[i].offset = 0;
        c->entries[i].lru_counter = 0;
        i++;
        to_clean++;
    }

i.e. making it wait on purpose for the point in time where the cache is
in use by something else.

The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.

We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses.  I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.

Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer CB must enter such a coroutine.  As
a result, descheduling the timer is no longer a guarantee that the
cache-cleaner will not run, because it may now be yielding in
qemu_co_mutex_lock().

(Note even now this was only guaranteed for cache_clean_timer_del()
callers that run in the BDS (the timer’s) AioContext.  For callers
running in the main context, the problem may have already existed,
though maybe the BQL prevents timers from running in other contexts, I’m
not sure.)

Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.

(One cleaner idea may be to have a generic way to have timers run
coroutines, and to await those when descheduling the timer.  But while
cleaner, it would also be more complicated, and I don’t think worth it
at this point.)

(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/qcow2.h |  1 +
 block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 79 insertions(+), 12 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index a9e3481c6e..272d34def1 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -347,6 +347,7 @@ typedef struct BDRVQcow2State {
     Qcow2Cache *refcount_block_cache;
     QEMUTimer *cache_clean_timer;
     unsigned cache_clean_interval;
+    bool cache_clean_running, cache_clean_polling;
 
     QLIST_HEAD(, QCowL2Meta) cluster_allocs;
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 4aa9f9e068..ef851794c3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -835,12 +835,38 @@ static const char *overlap_bool_option_names[QCOW2_OL_MAX_BITNR] = {
     [QCOW2_OL_BITMAP_DIRECTORY_BITNR] = QCOW2_OPT_OVERLAP_BITMAP_DIRECTORY,
 };
 
-static void cache_clean_timer_cb(void *opaque)
+static void coroutine_fn cache_clean_timer_co(void *opaque)
 {
     BlockDriverState *bs = opaque;
     BDRVQcow2State *s = bs->opaque;
+
+    qemu_co_mutex_lock(&s->lock);
+    if (!s->cache_clean_timer) {
+        /* cache_clean_timer_del() has been called, skip doing anything */
+        goto out;
+    }
     qcow2_cache_clean_unused(s->l2_table_cache);
     qcow2_cache_clean_unused(s->refcount_block_cache);
+
+out:
+    qemu_co_mutex_unlock(&s->lock);
+    qatomic_set(&s->cache_clean_running, false);
+    if (qatomic_xchg(&s->cache_clean_polling, false)) {
+        aio_wait_kick();
+    }
+}
+
+static void cache_clean_timer_cb(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVQcow2State *s = bs->opaque;
+    Coroutine *co;
+
+    co = qemu_coroutine_create(cache_clean_timer_co, bs);
+    /* cleared in cache_clean_timer_co() */
+    qatomic_set(&s->cache_clean_running, true);
+    qemu_coroutine_enter(co);
+
     timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
               (int64_t) s->cache_clean_interval * 1000);
 }
@@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
     }
 }
 
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext, holding s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    IO_OR_GS_CODE();
+    cache_clean_timer_del(bs);
+    if (qatomic_read(&s->cache_clean_running)) {
+        qemu_co_mutex_unlock(&s->lock);
+        qatomic_set(&s->cache_clean_polling, true);
+        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
+        qemu_co_mutex_lock(&s->lock);
+    }
+}
+
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext without s->lock held.
+ */
+static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    IO_OR_GS_CODE();
+    cache_clean_timer_del(bs);
+    if (qatomic_read(&s->cache_clean_running)) {
+        qatomic_set(&s->cache_clean_polling, true);
+        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
+    }
+}
+
 static void qcow2_detach_aio_context(BlockDriverState *bs)
 {
     cache_clean_timer_del(bs);
@@ -1214,12 +1273,20 @@ fail:
     return ret;
 }
 
+/* s_locked specifies whether s->lock is held or not */
 static void qcow2_update_options_commit(BlockDriverState *bs,
-                                        Qcow2ReopenState *r)
+                                        Qcow2ReopenState *r,
+                                        bool s_locked)
 {
     BDRVQcow2State *s = bs->opaque;
     int i;
 
+    if (s_locked) {
+        cache_clean_timer_locked_co_del_and_wait(bs);
+    } else {
+        cache_clean_timer_del_and_wait(bs);
+    }
+
     if (s->l2_table_cache) {
         qcow2_cache_destroy(s->l2_table_cache);
     }
@@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
     }
     s->l2_table_cache = r->l2_table_cache;
     s->refcount_block_cache = r->refcount_block_cache;
+
+    s->cache_clean_interval = r->cache_clean_interval;
+    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
+
     s->l2_slice_size = r->l2_slice_size;
 
     s->overlap_check = r->overlap_check;
@@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
 
     s->discard_no_unref = r->discard_no_unref;
 
-    if (s->cache_clean_interval != r->cache_clean_interval) {
-        cache_clean_timer_del(bs);
-        s->cache_clean_interval = r->cache_clean_interval;
-        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
-    }
-
     qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
     s->crypto_opts = r->crypto_opts;
 }
@@ -1261,6 +1326,7 @@ static void qcow2_update_options_abort(BlockDriverState *bs,
     qapi_free_QCryptoBlockOpenOptions(r->crypto_opts);
 }
 
+/* Called with s->lock held */
 static int coroutine_fn GRAPH_RDLOCK
 qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
                      Error **errp)
@@ -1270,7 +1336,7 @@ qcow2_update_options(BlockDriverState *bs, QDict *options, int flags,
 
     ret = qcow2_update_options_prepare(bs, &r, options, flags, errp);
     if (ret >= 0) {
-        qcow2_update_options_commit(bs, &r);
+        qcow2_update_options_commit(bs, &r, true);
     } else {
         qcow2_update_options_abort(bs, &r);
     }
@@ -1908,7 +1974,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
     qemu_vfree(s->l1_table);
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
-    cache_clean_timer_del(bs);
+    cache_clean_timer_locked_co_del_and_wait(bs);
     if (s->l2_table_cache) {
         qcow2_cache_destroy(s->l2_table_cache);
     }
@@ -2048,7 +2114,7 @@ static void qcow2_reopen_commit(BDRVReopenState *state)
 
     GRAPH_RDLOCK_GUARD_MAINLOOP();
 
-    qcow2_update_options_commit(state->bs, state->opaque);
+    qcow2_update_options_commit(state->bs, state->opaque, false);
     if (!s->data_file) {
         /*
          * If we don't have an external data file, s->data_file was cleared by
@@ -2805,7 +2871,7 @@ qcow2_do_close(BlockDriverState *bs, bool close_data_file)
         qcow2_inactivate(bs);
     }
 
-    cache_clean_timer_del(bs);
+    cache_clean_timer_del_and_wait(bs);
     qcow2_cache_destroy(s->l2_table_cache);
     qcow2_cache_destroy(s->refcount_block_cache);
 
-- 
2.51.0


Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Posted by Kevin Wolf 2 weeks, 1 day ago
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> The cache-cleaner runs as a timer CB in the BDS AioContext.  With
> multiqueue, it can run concurrently to I/O requests, and because it does
> not take any lock, this can break concurrent cache accesses, corrupting
> the image.  While the chances of this happening are low, it can be
> reproduced e.g. by modifying the code to schedule the timer CB every
> 5 ms (instead of at most once per second) and modifying the last (inner)
> while loop of qcow2_cache_clean_unused() like so:
> 
>     while (i < c->size && can_clean_entry(c, i)) {
>         for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>             usleep(100);
>         }
>         c->entries[i].offset = 0;
>         c->entries[i].lru_counter = 0;
>         i++;
>         to_clean++;
>     }
> 
> i.e. making it wait on purpose for the point in time where the cache is
> in use by something else.
> 
> The solution chosen for this in this patch is not the best solution, I
> hope, but I admittedly can’t come up with anything strictly better.
> 
> We can protect from concurrent cache accesses either by taking the
> existing s->lock, or we introduce a new (non-coroutine) mutex
> specifically for cache accesses.  I would prefer to avoid the latter so
> as not to introduce additional (very slight) overhead.

In theory, the old plan was that eventually qcow2 would use fine grained
locks instead of the single s->lock, and having a separate cache lock
would be a step towards it. But if we never actually make use of it to
hold s->lock for a shorter time, that's not really a good argument. I'm
not sure if that's ever going to happen unless for a rewrite in Rust or
something.

I never tried to measure specifically if lock contention is a problem
with high queue depth and random I/O on a huge disk. Intuitively,
holding s->lock while doing I/O for loading entries into the cache can't
be really good.

Anyway, I went a bit on a tangent there...

> Using s->lock, which is a coroutine mutex, however means that we need to
> take it in a coroutine, so the timer CB must enter such a coroutine.  As
> a result, descheduling the timer is no longer a guarantee that the
> cache-cleaner will not run, because it may now be yielding in
> qemu_co_mutex_lock().

I think creating a coroutine in cache_clean_timer_cb() is the wrong
approach. Instead, cache_clean_timer_init() could create a coroutine
and its implementation could be something like this:

    while (!s->cache_clean_timer_stopping) {
        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
                                  QEMU_CLOCK_VIRTUAL,
                                  s->cache_clean_interval * NANOSECONDS_PER_SECOND);

        qemu_co_mutex_lock(&s->lock);
        qcow2_cache_clean_unused(s->l2_table_cache);
        qcow2_cache_clean_unused(s->refcount_block_cache);
        qemu_co_mutex_unlock(&s->lock);
    }
    s->cache_clean_timer_stopping = false;

> (Note even now this was only guaranteed for cache_clean_timer_del()
> callers that run in the BDS (the timer’s) AioContext.  For callers
> running in the main context, the problem may have already existed,
> though maybe the BQL prevents timers from running in other contexts, I’m
> not sure.)
> 
> Polling to await the timer to actually settle seems very complicated for
> something that’s rather a minor problem, but I can’t come up with any
> better solution that doesn’t again just overlook potential problems.
> 
> (One cleaner idea may be to have a generic way to have timers run
> coroutines, and to await those when descheduling the timer.  But while
> cleaner, it would also be more complicated, and I don’t think worth it
> at this point.)
> 
> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> I’m not too fond of this solution.)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/qcow2.h |  1 +
>  block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 79 insertions(+), 12 deletions(-)

> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
>      }
>  }
>  
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext, holding s->lock.
> + */
> +static void coroutine_fn
> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    IO_OR_GS_CODE();
> +    cache_clean_timer_del(bs);
> +    if (qatomic_read(&s->cache_clean_running)) {
> +        qemu_co_mutex_unlock(&s->lock);
> +        qatomic_set(&s->cache_clean_polling, true);
> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));

Polling in a coroutine_fn is verboten.

If we do need this function, I think it would be a yield here and a wake
on the other side. I think we might be able to get around it if we move
the call from qcow2_do_open() into qcow2_open() (i.e. outside the
coroutine). A bit ugly, so your choice.

> +        qemu_co_mutex_lock(&s->lock);
> +    }
> +}
> +
> +/*
> + * Delete the cache clean timer and await any yet running instance.
> + * Must be called from the main or BDS AioContext without s->lock held.
> + */
> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    IO_OR_GS_CODE();
> +    cache_clean_timer_del(bs);
> +    if (qatomic_read(&s->cache_clean_running)) {
> +        qatomic_set(&s->cache_clean_polling, true);
> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> +    }
> +}
> +
>  static void qcow2_detach_aio_context(BlockDriverState *bs)
>  {
>      cache_clean_timer_del(bs);
> @@ -1214,12 +1273,20 @@ fail:
>      return ret;
>  }
>  
> +/* s_locked specifies whether s->lock is held or not */
>  static void qcow2_update_options_commit(BlockDriverState *bs,
> -                                        Qcow2ReopenState *r)
> +                                        Qcow2ReopenState *r,
> +                                        bool s_locked)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      int i;
>  
> +    if (s_locked) {
> +        cache_clean_timer_locked_co_del_and_wait(bs);
> +    } else {
> +        cache_clean_timer_del_and_wait(bs);
> +    }
> +
>      if (s->l2_table_cache) {
>          qcow2_cache_destroy(s->l2_table_cache);
>      }
> @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>      }
>      s->l2_table_cache = r->l2_table_cache;
>      s->refcount_block_cache = r->refcount_block_cache;
> +
> +    s->cache_clean_interval = r->cache_clean_interval;
> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> +
>      s->l2_slice_size = r->l2_slice_size;
>  
>      s->overlap_check = r->overlap_check;
> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>  
>      s->discard_no_unref = r->discard_no_unref;
>  
> -    if (s->cache_clean_interval != r->cache_clean_interval) {
> -        cache_clean_timer_del(bs);
> -        s->cache_clean_interval = r->cache_clean_interval;
> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> -    }
> -

I think the del/init pair here won't be necessary any more after
switching to the background coroutine. It will just start using the new
value of s->cache_clean_interval the next time it sleeps.

Kevin


Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Posted by Hanna Czenczek 2 weeks ago
On 29.10.25 21:23, Kevin Wolf wrote:
> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
>> The cache-cleaner runs as a timer CB in the BDS AioContext.  With
>> multiqueue, it can run concurrently to I/O requests, and because it does
>> not take any lock, this can break concurrent cache accesses, corrupting
>> the image.  While the chances of this happening are low, it can be
>> reproduced e.g. by modifying the code to schedule the timer CB every
>> 5 ms (instead of at most once per second) and modifying the last (inner)
>> while loop of qcow2_cache_clean_unused() like so:
>>
>>      while (i < c->size && can_clean_entry(c, i)) {
>>          for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
>>              usleep(100);
>>          }
>>          c->entries[i].offset = 0;
>>          c->entries[i].lru_counter = 0;
>>          i++;
>>          to_clean++;
>>      }
>>
>> i.e. making it wait on purpose for the point in time where the cache is
>> in use by something else.
>>
>> The solution chosen for this in this patch is not the best solution, I
>> hope, but I admittedly can’t come up with anything strictly better.
>>
>> We can protect from concurrent cache accesses either by taking the
>> existing s->lock, or we introduce a new (non-coroutine) mutex
>> specifically for cache accesses.  I would prefer to avoid the latter so
>> as not to introduce additional (very slight) overhead.
> In theory, the old plan was that eventually qcow2 would use fine grained
> locks instead of the single s->lock, and having a separate cache lock
> would be a step towards it. But if we never actually make use of it to
> hold s->lock for a shorter time, that's not really a good argument. I'm
> not sure if that's ever going to happen unless for a rewrite in Rust or
> something.
>
> I never tried to measure specifically if lock contention is a problem
> with high queue depth and random I/O on a huge disk. Intuitively,
> holding s->lock while doing I/O for loading entries into the cache can't
> be really good.
>
> Anyway, I went a bit on a tangent there...
>
>> Using s->lock, which is a coroutine mutex, however means that we need to
>> take it in a coroutine, so the timer CB must enter such a coroutine.  As
>> a result, descheduling the timer is no longer a guarantee that the
>> cache-cleaner will not run, because it may now be yielding in
>> qemu_co_mutex_lock().
> I think creating a coroutine in cache_clean_timer_cb() is the wrong
> approach. Instead, cache_clean_timer_init() could create a coroutine
> and its implementation could be something like this:
>
>      while (!s->cache_clean_timer_stopping) {
>          qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
>                                    QEMU_CLOCK_VIRTUAL,
>                                    s->cache_clean_interval * NANOSECONDS_PER_SECOND);
>
>          qemu_co_mutex_lock(&s->lock);
>          qcow2_cache_clean_unused(s->l2_table_cache);
>          qcow2_cache_clean_unused(s->refcount_block_cache);
>          qemu_co_mutex_unlock(&s->lock);
>      }
>      s->cache_clean_timer_stopping = false;

Ah, that’s nicer.  I think we can replace the flag by checking 
s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting 
coroutines.

>> (Note even now this was only guaranteed for cache_clean_timer_del()
>> callers that run in the BDS (the timer’s) AioContext.  For callers
>> running in the main context, the problem may have already existed,
>> though maybe the BQL prevents timers from running in other contexts, I’m
>> not sure.)
>>
>> Polling to await the timer to actually settle seems very complicated for
>> something that’s rather a minor problem, but I can’t come up with any
>> better solution that doesn’t again just overlook potential problems.
>>
>> (One cleaner idea may be to have a generic way to have timers run
>> coroutines, and to await those when descheduling the timer.  But while
>> cleaner, it would also be more complicated, and I don’t think worth it
>> at this point.)
>>
>> (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
>> I’m not too fond of this solution.)
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/qcow2.h |  1 +
>>   block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
>>   2 files changed, 79 insertions(+), 12 deletions(-)
>> @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
>>       }
>>   }
>>   
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext, holding s->lock.
>> + */
>> +static void coroutine_fn
>> +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    IO_OR_GS_CODE();
>> +    cache_clean_timer_del(bs);
>> +    if (qatomic_read(&s->cache_clean_running)) {
>> +        qemu_co_mutex_unlock(&s->lock);
>> +        qatomic_set(&s->cache_clean_polling, true);
>> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> Polling in a coroutine_fn is verboten.
>
> If we do need this function, I think it would be a yield here and a wake
> on the other side. I think we might be able to get around it if we move
> the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> coroutine). A bit ugly, so your choice.

We can let a CoQueue do the waking, no?

>> +        qemu_co_mutex_lock(&s->lock);
>> +    }
>> +}
>> +
>> +/*
>> + * Delete the cache clean timer and await any yet running instance.
>> + * Must be called from the main or BDS AioContext without s->lock held.
>> + */
>> +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    IO_OR_GS_CODE();
>> +    cache_clean_timer_del(bs);
>> +    if (qatomic_read(&s->cache_clean_running)) {
>> +        qatomic_set(&s->cache_clean_polling, true);
>> +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
>> +    }
>> +}
>> +
>>   static void qcow2_detach_aio_context(BlockDriverState *bs)
>>   {
>>       cache_clean_timer_del(bs);
>> @@ -1214,12 +1273,20 @@ fail:
>>       return ret;
>>   }
>>   
>> +/* s_locked specifies whether s->lock is held or not */
>>   static void qcow2_update_options_commit(BlockDriverState *bs,
>> -                                        Qcow2ReopenState *r)
>> +                                        Qcow2ReopenState *r,
>> +                                        bool s_locked)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>>       int i;
>>   
>> +    if (s_locked) {
>> +        cache_clean_timer_locked_co_del_and_wait(bs);
>> +    } else {
>> +        cache_clean_timer_del_and_wait(bs);
>> +    }
>> +
>>       if (s->l2_table_cache) {
>>           qcow2_cache_destroy(s->l2_table_cache);
>>       }
>> @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>       }
>>       s->l2_table_cache = r->l2_table_cache;
>>       s->refcount_block_cache = r->refcount_block_cache;
>> +
>> +    s->cache_clean_interval = r->cache_clean_interval;
>> +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> +
>>       s->l2_slice_size = r->l2_slice_size;
>>   
>>       s->overlap_check = r->overlap_check;
>> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>   
>>       s->discard_no_unref = r->discard_no_unref;
>>   
>> -    if (s->cache_clean_interval != r->cache_clean_interval) {
>> -        cache_clean_timer_del(bs);
>> -        s->cache_clean_interval = r->cache_clean_interval;
>> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>> -    }
>> -
> I think the del/init pair here won't be necessary any more after
> switching to the background coroutine. It will just start using the new
> value of s->cache_clean_interval the next time it sleeps.

One problem is that if we don’t lock s->lock, the coroutine can read 
s->l2_table_cache and s->refcount_block_cache while they’re invalid, 
which is why I moved the deletion above.  We also still need to delete 
if the interval is set to 0 (or special-case that in the coroutine to 
wait forever).

We could run all of this in a coroutine so we can lock s->lock, or we 
have to force-stop the timer/coroutine at the start.  Maybe running it 
in a coroutine is better…

Hanna


Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Posted by Kevin Wolf 2 weeks ago
Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
> On 29.10.25 21:23, Kevin Wolf wrote:
> > Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
> > > The cache-cleaner runs as a timer CB in the BDS AioContext.  With
> > > multiqueue, it can run concurrently to I/O requests, and because it does
> > > not take any lock, this can break concurrent cache accesses, corrupting
> > > the image.  While the chances of this happening are low, it can be
> > > reproduced e.g. by modifying the code to schedule the timer CB every
> > > 5 ms (instead of at most once per second) and modifying the last (inner)
> > > while loop of qcow2_cache_clean_unused() like so:
> > > 
> > >      while (i < c->size && can_clean_entry(c, i)) {
> > >          for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
> > >              usleep(100);
> > >          }
> > >          c->entries[i].offset = 0;
> > >          c->entries[i].lru_counter = 0;
> > >          i++;
> > >          to_clean++;
> > >      }
> > > 
> > > i.e. making it wait on purpose for the point in time where the cache is
> > > in use by something else.
> > > 
> > > The solution chosen for this in this patch is not the best solution, I
> > > hope, but I admittedly can’t come up with anything strictly better.
> > > 
> > > We can protect from concurrent cache accesses either by taking the
> > > existing s->lock, or we introduce a new (non-coroutine) mutex
> > > specifically for cache accesses.  I would prefer to avoid the latter so
> > > as not to introduce additional (very slight) overhead.
> > In theory, the old plan was that eventually qcow2 would use fine grained
> > locks instead of the single s->lock, and having a separate cache lock
> > would be a step towards it. But if we never actually make use of it to
> > hold s->lock for a shorter time, that's not really a good argument. I'm
> > not sure if that's ever going to happen unless for a rewrite in Rust or
> > something.
> > 
> > I never tried to measure specifically if lock contention is a problem
> > with high queue depth and random I/O on a huge disk. Intuitively,
> > holding s->lock while doing I/O for loading entries into the cache can't
> > be really good.
> > 
> > Anyway, I went a bit on a tangent there...
> > 
> > > Using s->lock, which is a coroutine mutex, however means that we need to
> > > take it in a coroutine, so the timer CB must enter such a coroutine.  As
> > > a result, descheduling the timer is no longer a guarantee that the
> > > cache-cleaner will not run, because it may now be yielding in
> > > qemu_co_mutex_lock().
> > I think creating a coroutine in cache_clean_timer_cb() is the wrong
> > approach. Instead, cache_clean_timer_init() could create a coroutine
> > and its implementation could be something like this:
> > 
> >      while (!s->cache_clean_timer_stopping) {
> >          qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
> >                                    QEMU_CLOCK_VIRTUAL,

Oh, I wanted to comment on this one, too, but apparently forgot...

I have absolutely no idea why we're using QEMU_CLOCK_VIRTUAL in the
existing code. I don't think the cache cleaning should stop just because
the guest is paused. You can still have block jobs and exports that work
on the image and fill the cache.

> >                                    s->cache_clean_interval * NANOSECONDS_PER_SECOND);
> > 
> >          qemu_co_mutex_lock(&s->lock);
> >          qcow2_cache_clean_unused(s->l2_table_cache);
> >          qcow2_cache_clean_unused(s->refcount_block_cache);
> >          qemu_co_mutex_unlock(&s->lock);
> >      }
> >      s->cache_clean_timer_stopping = false;
> 
> Ah, that’s nicer.  I think we can replace the flag by checking
> s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
> coroutines.

Ah, yes, stopping on s->cache_clean_interval == 0 is good.

> > > (Note even now this was only guaranteed for cache_clean_timer_del()
> > > callers that run in the BDS (the timer’s) AioContext.  For callers
> > > running in the main context, the problem may have already existed,
> > > though maybe the BQL prevents timers from running in other contexts, I’m
> > > not sure.)
> > > 
> > > Polling to await the timer to actually settle seems very complicated for
> > > something that’s rather a minor problem, but I can’t come up with any
> > > better solution that doesn’t again just overlook potential problems.
> > > 
> > > (One cleaner idea may be to have a generic way to have timers run
> > > coroutines, and to await those when descheduling the timer.  But while
> > > cleaner, it would also be more complicated, and I don’t think worth it
> > > at this point.)
> > > 
> > > (Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
> > > I’m not too fond of this solution.)
> > > 
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >   block/qcow2.h |  1 +
> > >   block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
> > >   2 files changed, 79 insertions(+), 12 deletions(-)
> > > @@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
> > >       }
> > >   }
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext, holding s->lock.
> > > + */
> > > +static void coroutine_fn
> > > +cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
> > > +{
> > > +    BDRVQcow2State *s = bs->opaque;
> > > +    IO_OR_GS_CODE();
> > > +    cache_clean_timer_del(bs);
> > > +    if (qatomic_read(&s->cache_clean_running)) {
> > > +        qemu_co_mutex_unlock(&s->lock);
> > > +        qatomic_set(&s->cache_clean_polling, true);
> > > +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > Polling in a coroutine_fn is verboten.
> > 
> > If we do need this function, I think it would be a yield here and a wake
> > on the other side. I think we might be able to get around it if we move
> > the call from qcow2_do_open() into qcow2_open() (i.e. outside the
> > coroutine). A bit ugly, so your choice.
> 
> We can let a CoQueue do the waking, no?

Yes, that's probably nicer. I don't expect that you'll ever have more
than a single waiter, but a CoQueue would be safe in either case.

> > > +        qemu_co_mutex_lock(&s->lock);
> > > +    }
> > > +}
> > > +
> > > +/*
> > > + * Delete the cache clean timer and await any yet running instance.
> > > + * Must be called from the main or BDS AioContext without s->lock held.
> > > + */
> > > +static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
> > > +{
> > > +    BDRVQcow2State *s = bs->opaque;
> > > +    IO_OR_GS_CODE();
> > > +    cache_clean_timer_del(bs);
> > > +    if (qatomic_read(&s->cache_clean_running)) {
> > > +        qatomic_set(&s->cache_clean_polling, true);
> > > +        BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
> > > +    }
> > > +}
> > > +
> > >   static void qcow2_detach_aio_context(BlockDriverState *bs)
> > >   {
> > >       cache_clean_timer_del(bs);
> > > @@ -1214,12 +1273,20 @@ fail:
> > >       return ret;
> > >   }
> > > +/* s_locked specifies whether s->lock is held or not */
> > >   static void qcow2_update_options_commit(BlockDriverState *bs,
> > > -                                        Qcow2ReopenState *r)
> > > +                                        Qcow2ReopenState *r,
> > > +                                        bool s_locked)
> > >   {
> > >       BDRVQcow2State *s = bs->opaque;
> > >       int i;
> > > +    if (s_locked) {
> > > +        cache_clean_timer_locked_co_del_and_wait(bs);
> > > +    } else {
> > > +        cache_clean_timer_del_and_wait(bs);
> > > +    }
> > > +
> > >       if (s->l2_table_cache) {
> > >           qcow2_cache_destroy(s->l2_table_cache);
> > >       }
> > > @@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       }
> > >       s->l2_table_cache = r->l2_table_cache;
> > >       s->refcount_block_cache = r->refcount_block_cache;
> > > +
> > > +    s->cache_clean_interval = r->cache_clean_interval;
> > > +    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > +
> > >       s->l2_slice_size = r->l2_slice_size;
> > >       s->overlap_check = r->overlap_check;
> > > @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
> > >       s->discard_no_unref = r->discard_no_unref;
> > > -    if (s->cache_clean_interval != r->cache_clean_interval) {
> > > -        cache_clean_timer_del(bs);
> > > -        s->cache_clean_interval = r->cache_clean_interval;
> > > -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
> > > -    }
> > > -
> > I think the del/init pair here won't be necessary any more after
> > switching to the background coroutine. It will just start using the new
> > value of s->cache_clean_interval the next time it sleeps.
> 
> One problem is that if we don’t lock s->lock, the coroutine can read
> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
> is why I moved the deletion above.

I see. This is a preexisting problem, right? The timer runs in the BDS
main context, while qcow2_update_options_commit() runs in the main loop
or... essentially anywhere else?

Should it be a separate patch then? A comment in the code wouldn't hurt
either.

> We also still need to delete if the interval is set to 0 (or
> special-case that in the coroutine to wait forever).

If the coroutine terminates on 0 as you suggested above, that would
automatically be solved.

> We could run all of this in a coroutine so we can lock s->lock, or we
> have to force-stop the timer/coroutine at the start.  Maybe running it
> in a coroutine is better…

So qcow2_reopen_commit() would immediately enter a coroutine and
BDRV_POLL_WHILE() for its completion?

It's not exactly pretty (and I hope polling in reopen callbacks is even
allowed), but maybe more local ugliness than having additional state
(like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
cache clean coroutine to stop.

Kevin


Re: [PATCH 10/16] qcow2: Fix cache_clean_timer
Posted by Hanna Czenczek 1 week, 1 day ago
On 31.10.25 14:03, Kevin Wolf wrote:
> Am 31.10.2025 um 10:29 hat Hanna Czenczek geschrieben:
>> On 29.10.25 21:23, Kevin Wolf wrote:
>>> Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
[...]

>>>> @@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
>>>>        s->discard_no_unref = r->discard_no_unref;
>>>> -    if (s->cache_clean_interval != r->cache_clean_interval) {
>>>> -        cache_clean_timer_del(bs);
>>>> -        s->cache_clean_interval = r->cache_clean_interval;
>>>> -        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
>>>> -    }
>>>> -
>>> I think the del/init pair here won't be necessary any more after
>>> switching to the background coroutine. It will just start using the new
>>> value of s->cache_clean_interval the next time it sleeps.
>> One problem is that if we don’t lock s->lock, the coroutine can read
>> s->l2_table_cache and s->refcount_block_cache while they’re invalid, which
>> is why I moved the deletion above.
> I see. This is a preexisting problem, right? The timer runs in the BDS
> main context, while qcow2_update_options_commit() runs in the main loop
> or... essentially anywhere else?

Sorry for the late reply, yes.  There may be the additional problem 
noted in the commit message, specifically that the timer may fire, run 
the CB in the BDS’s context, and is concurrently deleted from the main 
context.  It will then still run, so just moving the delete up is not a 
100 % solution I think.  We also need to make sure the timer code really 
isn’t running.

> Should it be a separate patch then? A comment in the code wouldn't hurt
> either.

I’ll add a comment.

>> We also still need to delete if the interval is set to 0 (or
>> special-case that in the coroutine to wait forever).
> If the coroutine terminates on 0 as you suggested above, that would
> automatically be solved.

I don’t think so, because we still need to be able to restart it (when 
transitioning from a value of 0 to a different value).

>> We could run all of this in a coroutine so we can lock s->lock, or we
>> have to force-stop the timer/coroutine at the start.  Maybe running it
>> in a coroutine is better…
> So qcow2_reopen_commit() would immediately enter a coroutine and
> BDRV_POLL_WHILE() for its completion?
>
> It's not exactly pretty (and I hope polling in reopen callbacks is even
> allowed), but maybe more local ugliness than having additional state
> (like s->cache_clean_running) to allow BDRV_POLL_WHILE() to wait for the
> cache clean coroutine to stop.

I think I’ll keep it as-is.  We’ll probably actually want to wait in 
detach_aio_context, too.

Hanna