system/physmem: Synchronize ram_list accesses

[PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 1 week ago

Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
access to ram_list [1]. Ensure the concurrent accesses to ram_list are
properly synchronized with atomic accesses, mutexes, or RCU.

First, the plain assignments of ram_list.mru_block are replaced with
qatomic_set(). A comment in qemu_get_ram_block() explains why the
ordering requirement is relaxed, but it still needs to be atomically
accessed. include/qemu/atomic.h says:

> The C11 memory model says that variables that are accessed from
> different threads should at least be done with __ATOMIC_RELAXED
> primitives or the result is undefined. Generally this has little to
> no effect on the generated code but not using the atomic primitives
> will get flagged by sanitizers as a violation.

Second, ram_list.version accesses are replaced with atomic operations
or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
has tighter ordering requirements for one of its goals: ensuring that
the reader-held rs->last_seen_block value is invalidated whenever a RAM
block is reclaimed between two RCU reader critical sections. Below are
steps a reader and an updater follow:

Reader:

R-1. Enter the first RCU read-side critical section:
R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
R-1-2. rs->last_seen_block = an element of ram_list.blocks
R-2. Enter the second RCU read-side critical section:
R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
R-2-2.     rs->last_seen_block = NULL

Updater:

W-1. Enter a ram_list.mutex critical section
W-1-1. Update ram_list.blocks
W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
W-2. Enter another ram_list.mutex critical section
W-2-1. QLIST_REMOVE_RCU(block, next)
W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
W-2-3. call_rcu(block, reclaim_ramblock, rcu)

W-1-2 represents the write observed by R-1-1.

ram_list.version is read non-atomically on the update side because the
update side is serialized with ram_list.mutex. The other ram_list
accesses in these steps are reasoned about in two cases.

When the grace period of W-2-3 contains R-2:
    qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
    enforce the following ordering:
        W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
    The value of ram_list.blocks stored by W-1-1 or a newer value that
    was loaded by R-1-2 is still valid because of the grace period.

When the grace period of W-2-3 ends before R-2:
    call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
    the following ordering:
        W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
    The value of ram_list.version stored by W-2-2 or a newer value that
    was loaded by R-2-1 differs from rs->last_version and the reader
    invalidates rs->last_seen_block.

Together, these steps ensure that rs->last_seen_block is invalidated
whenever necessary. With added atomic operations, pre-existing memory
barriers are no longer necessary and are removed.

Any other ram_list accesses are already properly synchronized.

[1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 migration/ram.c  | 10 +++++-----
 system/physmem.c | 16 +++++++---------
 2 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index fc38ffbf8af1..6da24d725861 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
 
     rs->last_seen_block = NULL;
     rs->last_page = 0;
-    rs->last_version = ram_list.version;
+
+    /* Read version before ram_list.blocks */
+    rs->last_version = qatomic_load_acquire(&ram_list.version);
+
     rs->xbzrle_started = false;
 
     ram_page_hint_reset(&rs->page_hint);
@@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
      */
     WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
         WITH_RCU_READ_LOCK_GUARD() {
-            if (ram_list.version != rs->last_version) {
+            if (qatomic_read(&ram_list.version) != rs->last_version) {
                 ram_state_reset(rs);
             }
 
-            /* Read version before ram_list.blocks */
-            smp_rmb();
-
             ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
             if (ret < 0) {
                 qemu_file_set_error(f, ret);
diff --git a/system/physmem.c b/system/physmem.c
index 7bcbf8757361..9e1ac13e8259 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -839,12 +839,12 @@ found:
     /* It is safe to write mru_block outside the BQL.  This
      * is what happens:
      *
-     *     mru_block = xxx
+     *     qatomic_set(&mru_block, xxx)
      *     rcu_read_unlock()
      *                                        xxx removed from list
      *                  rcu_read_lock()
      *                  read mru_block
-     *                                        mru_block = NULL;
+     *                                        qatomic_set(&mru_block, NULL);
      *                                        call_rcu(reclaim_ramblock, xxx);
      *                  rcu_read_unlock()
      *
@@ -852,7 +852,7 @@ found:
      * when it was placed into the list.  Here we're just making an extra
      * copy of the pointer.
      */
-    ram_list.mru_block = block;
+    qatomic_set(&ram_list.mru_block, block);
     return block;
 }
 
@@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     } else { /* list is empty */
         QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
     }
-    ram_list.mru_block = NULL;
+    qatomic_set(&ram_list.mru_block, NULL);
 
     /* Write list before version */
-    smp_wmb();
-    ram_list.version++;
+    qatomic_store_release(&ram_list.version, ram_list.version + 1);
     qemu_mutex_unlock_ramlist();
 
     physical_memory_set_dirty_range(new_block->offset,
@@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
     name = cpr_name(block->mr);
     cpr_delete_fd(name, 0);
     QLIST_REMOVE_RCU(block, next);
-    ram_list.mru_block = NULL;
+    qatomic_set(&ram_list.mru_block, NULL);
     /* Write list before version */
-    smp_wmb();
-    ram_list.version++;
+    qatomic_store_release(&ram_list.version, ram_list.version + 1);
     call_rcu(block, reclaim_ramblock, rcu);
     qemu_mutex_unlock_ramlist();
 }

---
base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
change-id: 20260520-tsan-c6f1aa909a90

Best regards,
--  
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Peter Xu 3 days, 19 hours ago

On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
> properly synchronized with atomic accesses, mutexes, or RCU.
> 
> First, the plain assignments of ram_list.mru_block are replaced with
> qatomic_set(). A comment in qemu_get_ram_block() explains why the
> ordering requirement is relaxed, but it still needs to be atomically
> accessed. include/qemu/atomic.h says:
> 
> > The C11 memory model says that variables that are accessed from
> > different threads should at least be done with __ATOMIC_RELAXED
> > primitives or the result is undefined. Generally this has little to
> > no effect on the generated code but not using the atomic primitives
> > will get flagged by sanitizers as a violation.
> 
> Second, ram_list.version accesses are replaced with atomic operations
> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
> has tighter ordering requirements for one of its goals: ensuring that
> the reader-held rs->last_seen_block value is invalidated whenever a RAM
> block is reclaimed between two RCU reader critical sections. Below are
> steps a reader and an updater follow:
> 
> Reader:
> 
> R-1. Enter the first RCU read-side critical section:
> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
> R-1-2. rs->last_seen_block = an element of ram_list.blocks
> R-2. Enter the second RCU read-side critical section:
> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
> R-2-2.     rs->last_seen_block = NULL
> 
> Updater:
> 
> W-1. Enter a ram_list.mutex critical section
> W-1-1. Update ram_list.blocks
> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> W-2. Enter another ram_list.mutex critical section
> W-2-1. QLIST_REMOVE_RCU(block, next)
> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
> 
> W-1-2 represents the write observed by R-1-1.
> 
> ram_list.version is read non-atomically on the update side because the
> update side is serialized with ram_list.mutex. The other ram_list
> accesses in these steps are reasoned about in two cases.
> 
> When the grace period of W-2-3 contains R-2:
>     qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>     enforce the following ordering:
>         W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>     The value of ram_list.blocks stored by W-1-1 or a newer value that
>     was loaded by R-1-2 is still valid because of the grace period.
> 
> When the grace period of W-2-3 ends before R-2:
>     call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>     the following ordering:
>         W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>     The value of ram_list.version stored by W-2-2 or a newer value that
>     was loaded by R-2-1 differs from rs->last_version and the reader
>     invalidates rs->last_seen_block.
> 
> Together, these steps ensure that rs->last_seen_block is invalidated
> whenever necessary. With added atomic operations, pre-existing memory
> barriers are no longer necessary and are removed.
> 
> Any other ram_list accesses are already properly synchronized.
> 
> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
> 
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>

Above link [1] also mentioned about a hang.  Just to double check: we're
only trying to silent the TSAN warning (which is a false positive), right?

> ---
>  migration/ram.c  | 10 +++++-----
>  system/physmem.c | 16 +++++++---------
>  2 files changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index fc38ffbf8af1..6da24d725861 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
>  
>      rs->last_seen_block = NULL;
>      rs->last_page = 0;
> -    rs->last_version = ram_list.version;
> +
> +    /* Read version before ram_list.blocks */
> +    rs->last_version = qatomic_load_acquire(&ram_list.version);
> +
>      rs->xbzrle_started = false;
>  
>      ram_page_hint_reset(&rs->page_hint);
> @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>       */
>      WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
>          WITH_RCU_READ_LOCK_GUARD() {
> -            if (ram_list.version != rs->last_version) {
> +            if (qatomic_read(&ram_list.version) != rs->last_version) {
>                  ram_state_reset(rs);
>              }

Not directly relevant to this patch, but I never thought into why this few
lines are needed at all, and then I found I don't have an answer..

Here, ram_list version change should only happen during ramblock
add/remove.  I can think of a few special cases:

a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
   never happen as qdev code fails any device add/remove attempt during
   migration (qdev_device_add_from_qdict, qdev_unplug)

b) virtio-mem, which only manages one memory backend, hence one ramblock,
   no add/remove

c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
   which means ramblocks are not migratable

PS: I recall we have other special cases like ROM sizes can change on dest
QEMU, but it's about used_length, not about version boosts, so doesn't
matter here.

Both a) and b) should make sure no change on version, c) will change
version, but in a case where the ramblock is destined to be not migratable,
hence not visible to migration core.

The patch itself looks all fine, but I'm just wondering if we should just
remove this check completely or even try to somehow trap it from happening
(if we can rule out things like virtio-gpu): if something really changed,
it'll be a serious problem because we sync ramblock list already during
ram_save_setup.

Thanks,

>  
> -            /* Read version before ram_list.blocks */
> -            smp_rmb();
> -
>              ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
>              if (ret < 0) {
>                  qemu_file_set_error(f, ret);
> diff --git a/system/physmem.c b/system/physmem.c
> index 7bcbf8757361..9e1ac13e8259 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -839,12 +839,12 @@ found:
>      /* It is safe to write mru_block outside the BQL.  This
>       * is what happens:
>       *
> -     *     mru_block = xxx
> +     *     qatomic_set(&mru_block, xxx)
>       *     rcu_read_unlock()
>       *                                        xxx removed from list
>       *                  rcu_read_lock()
>       *                  read mru_block
> -     *                                        mru_block = NULL;
> +     *                                        qatomic_set(&mru_block, NULL);
>       *                                        call_rcu(reclaim_ramblock, xxx);
>       *                  rcu_read_unlock()
>       *
> @@ -852,7 +852,7 @@ found:
>       * when it was placed into the list.  Here we're just making an extra
>       * copy of the pointer.
>       */
> -    ram_list.mru_block = block;
> +    qatomic_set(&ram_list.mru_block, block);
>      return block;
>  }
>  
> @@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      } else { /* list is empty */
>          QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
>      }
> -    ram_list.mru_block = NULL;
> +    qatomic_set(&ram_list.mru_block, NULL);
>  
>      /* Write list before version */
> -    smp_wmb();
> -    ram_list.version++;
> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>      qemu_mutex_unlock_ramlist();
>  
>      physical_memory_set_dirty_range(new_block->offset,
> @@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
>      name = cpr_name(block->mr);
>      cpr_delete_fd(name, 0);
>      QLIST_REMOVE_RCU(block, next);
> -    ram_list.mru_block = NULL;
> +    qatomic_set(&ram_list.mru_block, NULL);
>      /* Write list before version */
> -    smp_wmb();
> -    ram_list.version++;
> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>      call_rcu(block, reclaim_ramblock, rcu);
>      qemu_mutex_unlock_ramlist();
>  }
> 
> ---
> base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
> change-id: 20260520-tsan-c6f1aa909a90
> 
> Best regards,
> --  
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> 

-- 
Peter Xu

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 3 days, 9 hours ago

On 2026/05/27 6:35, Peter Xu wrote:
> On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>> properly synchronized with atomic accesses, mutexes, or RCU.
>>
>> First, the plain assignments of ram_list.mru_block are replaced with
>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>> ordering requirement is relaxed, but it still needs to be atomically
>> accessed. include/qemu/atomic.h says:
>>
>>> The C11 memory model says that variables that are accessed from
>>> different threads should at least be done with __ATOMIC_RELAXED
>>> primitives or the result is undefined. Generally this has little to
>>> no effect on the generated code but not using the atomic primitives
>>> will get flagged by sanitizers as a violation.
>>
>> Second, ram_list.version accesses are replaced with atomic operations
>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>> has tighter ordering requirements for one of its goals: ensuring that
>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>> block is reclaimed between two RCU reader critical sections. Below are
>> steps a reader and an updater follow:
>>
>> Reader:
>>
>> R-1. Enter the first RCU read-side critical section:
>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>> R-2. Enter the second RCU read-side critical section:
>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>> R-2-2.     rs->last_seen_block = NULL
>>
>> Updater:
>>
>> W-1. Enter a ram_list.mutex critical section
>> W-1-1. Update ram_list.blocks
>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>> W-2. Enter another ram_list.mutex critical section
>> W-2-1. QLIST_REMOVE_RCU(block, next)
>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>
>> W-1-2 represents the write observed by R-1-1.
>>
>> ram_list.version is read non-atomically on the update side because the
>> update side is serialized with ram_list.mutex. The other ram_list
>> accesses in these steps are reasoned about in two cases.
>>
>> When the grace period of W-2-3 contains R-2:
>>      qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>>      enforce the following ordering:
>>          W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>      The value of ram_list.blocks stored by W-1-1 or a newer value that
>>      was loaded by R-1-2 is still valid because of the grace period.
>>
>> When the grace period of W-2-3 ends before R-2:
>>      call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>>      the following ordering:
>>          W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>      The value of ram_list.version stored by W-2-2 or a newer value that
>>      was loaded by R-2-1 differs from rs->last_version and the reader
>>      invalidates rs->last_seen_block.
>>
>> Together, these steps ensure that rs->last_seen_block is invalidated
>> whenever necessary. With added atomic operations, pre-existing memory
>> barriers are no longer necessary and are removed.
>>
>> Any other ram_list accesses are already properly synchronized.
>>
>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>
>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> 
> Above link [1] also mentioned about a hang.  Just to double check: we're
> only trying to silent the TSAN warning (which is a false positive), right?

The hang is irrelevant and this is indeed to address the TSAN warning, 
though I'm a bit reluctant to call it false positive because, strictly 
speaking, the code actually relies on undefined behavior.

> 
>> ---
>>   migration/ram.c  | 10 +++++-----
>>   system/physmem.c | 16 +++++++---------
>>   2 files changed, 12 insertions(+), 14 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index fc38ffbf8af1..6da24d725861 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
>>   
>>       rs->last_seen_block = NULL;
>>       rs->last_page = 0;
>> -    rs->last_version = ram_list.version;
>> +
>> +    /* Read version before ram_list.blocks */
>> +    rs->last_version = qatomic_load_acquire(&ram_list.version);
>> +
>>       rs->xbzrle_started = false;
>>   
>>       ram_page_hint_reset(&rs->page_hint);
>> @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>        */
>>       WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
>>           WITH_RCU_READ_LOCK_GUARD() {
>> -            if (ram_list.version != rs->last_version) {
>> +            if (qatomic_read(&ram_list.version) != rs->last_version) {
>>                   ram_state_reset(rs);
>>               }
> 
> Not directly relevant to this patch, but I never thought into why this few
> lines are needed at all, and then I found I don't have an answer..
> 
> Here, ram_list version change should only happen during ramblock
> add/remove.  I can think of a few special cases:
> 
> a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
>     never happen as qdev code fails any device add/remove attempt during
>     migration (qdev_device_add_from_qdict, qdev_unplug)
> 
> b) virtio-mem, which only manages one memory backend, hence one ramblock,
>     no add/remove
> 
> c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
>     which means ramblocks are not migratable
> 
> PS: I recall we have other special cases like ROM sizes can change on dest
> QEMU, but it's about used_length, not about version boosts, so doesn't
> matter here.
> 
> Both a) and b) should make sure no change on version, c) will change
> version, but in a case where the ramblock is destined to be not migratable,
> hence not visible to migration core.
> 
> The patch itself looks all fine, but I'm just wondering if we should just
> remove this check completely or even try to somehow trap it from happening
> (if we can rule out things like virtio-gpu): if something really changed,
> it'll be a serious problem because we sync ramblock list already during
> ram_save_setup.

There is another edge case: a ram_list.blocks change may be propagated 
to the thread calling ram_save_setup() and ram_save_iterate() at 
different timing. We are only ensuring that ram_list.blocks changes 
invalidating old blocks are delivered with WITH_RCU_READ_LOCK_GUARD().

So I think there is a theoretical race condition that breaks migration, 
though triggering it in practice may not be realistic.

Regards,
Akihiko Odaki

> 
> Thanks,
> 
>>   
>> -            /* Read version before ram_list.blocks */
>> -            smp_rmb();
>> -
>>               ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
>>               if (ret < 0) {
>>                   qemu_file_set_error(f, ret);
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 7bcbf8757361..9e1ac13e8259 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -839,12 +839,12 @@ found:
>>       /* It is safe to write mru_block outside the BQL.  This
>>        * is what happens:
>>        *
>> -     *     mru_block = xxx
>> +     *     qatomic_set(&mru_block, xxx)
>>        *     rcu_read_unlock()
>>        *                                        xxx removed from list
>>        *                  rcu_read_lock()
>>        *                  read mru_block
>> -     *                                        mru_block = NULL;
>> +     *                                        qatomic_set(&mru_block, NULL);
>>        *                                        call_rcu(reclaim_ramblock, xxx);
>>        *                  rcu_read_unlock()
>>        *
>> @@ -852,7 +852,7 @@ found:
>>        * when it was placed into the list.  Here we're just making an extra
>>        * copy of the pointer.
>>        */
>> -    ram_list.mru_block = block;
>> +    qatomic_set(&ram_list.mru_block, block);
>>       return block;
>>   }
>>   
>> @@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>       } else { /* list is empty */
>>           QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
>>       }
>> -    ram_list.mru_block = NULL;
>> +    qatomic_set(&ram_list.mru_block, NULL);
>>   
>>       /* Write list before version */
>> -    smp_wmb();
>> -    ram_list.version++;
>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>>       qemu_mutex_unlock_ramlist();
>>   
>>       physical_memory_set_dirty_range(new_block->offset,
>> @@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
>>       name = cpr_name(block->mr);
>>       cpr_delete_fd(name, 0);
>>       QLIST_REMOVE_RCU(block, next);
>> -    ram_list.mru_block = NULL;
>> +    qatomic_set(&ram_list.mru_block, NULL);
>>       /* Write list before version */
>> -    smp_wmb();
>> -    ram_list.version++;
>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>>       call_rcu(block, reclaim_ramblock, rcu);
>>       qemu_mutex_unlock_ramlist();
>>   }
>>
>> ---
>> base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
>> change-id: 20260520-tsan-c6f1aa909a90
>>
>> Best regards,
>> --
>> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>
>

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Peter Xu 2 days, 1 hour ago

On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
> On 2026/05/27 6:35, Peter Xu wrote:
> > On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
> > > Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
> > > access to ram_list [1]. Ensure the concurrent accesses to ram_list are
> > > properly synchronized with atomic accesses, mutexes, or RCU.
> > > 
> > > First, the plain assignments of ram_list.mru_block are replaced with
> > > qatomic_set(). A comment in qemu_get_ram_block() explains why the
> > > ordering requirement is relaxed, but it still needs to be atomically
> > > accessed. include/qemu/atomic.h says:
> > > 
> > > > The C11 memory model says that variables that are accessed from
> > > > different threads should at least be done with __ATOMIC_RELAXED
> > > > primitives or the result is undefined. Generally this has little to
> > > > no effect on the generated code but not using the atomic primitives
> > > > will get flagged by sanitizers as a violation.
> > > 
> > > Second, ram_list.version accesses are replaced with atomic operations
> > > or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
> > > has tighter ordering requirements for one of its goals: ensuring that
> > > the reader-held rs->last_seen_block value is invalidated whenever a RAM
> > > block is reclaimed between two RCU reader critical sections. Below are
> > > steps a reader and an updater follow:
> > > 
> > > Reader:
> > > 
> > > R-1. Enter the first RCU read-side critical section:
> > > R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
> > > R-1-2. rs->last_seen_block = an element of ram_list.blocks
> > > R-2. Enter the second RCU read-side critical section:
> > > R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
> > > R-2-2.     rs->last_seen_block = NULL
> > > 
> > > Updater:
> > > 
> > > W-1. Enter a ram_list.mutex critical section
> > > W-1-1. Update ram_list.blocks
> > > W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > W-2. Enter another ram_list.mutex critical section
> > > W-2-1. QLIST_REMOVE_RCU(block, next)
> > > W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > W-2-3. call_rcu(block, reclaim_ramblock, rcu)
> > > 
> > > W-1-2 represents the write observed by R-1-1.
> > > 
> > > ram_list.version is read non-atomically on the update side because the
> > > update side is serialized with ram_list.mutex. The other ram_list
> > > accesses in these steps are reasoned about in two cases.
> > > 
> > > When the grace period of W-2-3 contains R-2:
> > >      qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
> > >      enforce the following ordering:
> > >          W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
> > >      The value of ram_list.blocks stored by W-1-1 or a newer value that
> > >      was loaded by R-1-2 is still valid because of the grace period.
> > > 
> > > When the grace period of W-2-3 ends before R-2:
> > >      call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
> > >      the following ordering:
> > >          W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
> > >      The value of ram_list.version stored by W-2-2 or a newer value that
> > >      was loaded by R-2-1 differs from rs->last_version and the reader
> > >      invalidates rs->last_seen_block.
> > > 
> > > Together, these steps ensure that rs->last_seen_block is invalidated
> > > whenever necessary. With added atomic operations, pre-existing memory
> > > barriers are no longer necessary and are removed.
> > > 
> > > Any other ram_list accesses are already properly synchronized.
> > > 
> > > [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
> > > 
> > > Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > 
> > Above link [1] also mentioned about a hang.  Just to double check: we're
> > only trying to silent the TSAN warning (which is a false positive), right?
> 
> The hang is irrelevant and this is indeed to address the TSAN warning,
> though I'm a bit reluctant to call it false positive because, strictly
> speaking, the code actually relies on undefined behavior.
> 
> > 
> > > ---
> > >   migration/ram.c  | 10 +++++-----
> > >   system/physmem.c | 16 +++++++---------
> > >   2 files changed, 12 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index fc38ffbf8af1..6da24d725861 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
> > >       rs->last_seen_block = NULL;
> > >       rs->last_page = 0;
> > > -    rs->last_version = ram_list.version;
> > > +
> > > +    /* Read version before ram_list.blocks */
> > > +    rs->last_version = qatomic_load_acquire(&ram_list.version);
> > > +
> > >       rs->xbzrle_started = false;
> > >       ram_page_hint_reset(&rs->page_hint);
> > > @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> > >        */
> > >       WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
> > >           WITH_RCU_READ_LOCK_GUARD() {
> > > -            if (ram_list.version != rs->last_version) {
> > > +            if (qatomic_read(&ram_list.version) != rs->last_version) {
> > >                   ram_state_reset(rs);
> > >               }
> > 
> > Not directly relevant to this patch, but I never thought into why this few
> > lines are needed at all, and then I found I don't have an answer..
> > 
> > Here, ram_list version change should only happen during ramblock
> > add/remove.  I can think of a few special cases:
> > 
> > a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
> >     never happen as qdev code fails any device add/remove attempt during
> >     migration (qdev_device_add_from_qdict, qdev_unplug)
> > 
> > b) virtio-mem, which only manages one memory backend, hence one ramblock,
> >     no add/remove
> > 
> > c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
> >     which means ramblocks are not migratable
> > 
> > PS: I recall we have other special cases like ROM sizes can change on dest
> > QEMU, but it's about used_length, not about version boosts, so doesn't
> > matter here.
> > 
> > Both a) and b) should make sure no change on version, c) will change
> > version, but in a case where the ramblock is destined to be not migratable,
> > hence not visible to migration core.
> > 
> > The patch itself looks all fine, but I'm just wondering if we should just
> > remove this check completely or even try to somehow trap it from happening
> > (if we can rule out things like virtio-gpu): if something really changed,
> > it'll be a serious problem because we sync ramblock list already during
> > ram_save_setup.
> 
> There is another edge case: a ram_list.blocks change may be propagated to
> the thread calling ram_save_setup() and ram_save_iterate() at different
> timing. We are only ensuring that ram_list.blocks changes invalidating old
> blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
> 
> So I think there is a theoretical race condition that breaks migration,
> though triggering it in practice may not be realistic.

Could you elaborate more on the theoretical race mentioned?

My comment above was trying to say there should have no such happening, the
virtio-gpu boosting the version seems to be an accident to me because
migration doesn't need that (because virtio-gpu ramblocks are not
migratable anyway).

While looking at this once more, I found that I'm lost why mru_block is
race free..

The problem is mru_block is written by the RCU readers, then... see whether
this race can happen (even with this patch):

  Reader A                              Writer
  --------                              ------
  rcu_read_lock()
  walk list, find block X
                                        QLIST_REMOVE_RCU(X)
                                        qatomic_set(&mru_block, NULL)
                                        call_rcu(X, reclaim_ramblock)
  qatomic_set(&mru_block, X)  <----------- overwrites NULL
  rcu_read_unlock()
                                        grace period ends, X freed

  Reader C
  --------
  rcu_read_lock()
  qatomic_rcu_read(&mru_block) -> X
  Read X's block->offset ...  <----------- UAF

One thing I can think of so far is this (pesudo code only, ignoring atomic
ops for now):

qemu_ram_free():
    qemu_mutex_lock_ramlist();
    ...
    QLIST_REMOVE_RCU(block, next);
    /*
     * 1st attempt to reset it, note: concurrent reader may set it again,
     * even to the ramblock being freed.
     */
    ram_list.mru_block = NULL;
    ram_list.version++;
    /*
     * Only to make sure whoever might be using this ramblock being
     * removed to finish using it.
     */
    call_rcu(block, rcu_noop_fn, rcu);
    drain_call_rcu();
    /*
     * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
     * has access to the ramblock being removed, hence this clear makes
     * sure it won't get set again by anyone to the removed ramblock.
     */
    ram_list.mru_block = NULL;
    call_rcu(block, reclaim_ramblock, rcu);
    qemu_mutex_unlock_ramlist();

But it looks a bit ugly.  Thoughts?

Thanks,

-- 
Peter Xu

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 2 days, 1 hour ago


On 2026/05/29 0:18, Peter Xu wrote:
> On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
>> On 2026/05/27 6:35, Peter Xu wrote:
>>> On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
>>>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>>>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>>>> properly synchronized with atomic accesses, mutexes, or RCU.
>>>>
>>>> First, the plain assignments of ram_list.mru_block are replaced with
>>>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>>>> ordering requirement is relaxed, but it still needs to be atomically
>>>> accessed. include/qemu/atomic.h says:
>>>>
>>>>> The C11 memory model says that variables that are accessed from
>>>>> different threads should at least be done with __ATOMIC_RELAXED
>>>>> primitives or the result is undefined. Generally this has little to
>>>>> no effect on the generated code but not using the atomic primitives
>>>>> will get flagged by sanitizers as a violation.
>>>>
>>>> Second, ram_list.version accesses are replaced with atomic operations
>>>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>>>> has tighter ordering requirements for one of its goals: ensuring that
>>>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>>>> block is reclaimed between two RCU reader critical sections. Below are
>>>> steps a reader and an updater follow:
>>>>
>>>> Reader:
>>>>
>>>> R-1. Enter the first RCU read-side critical section:
>>>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>>>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>>>> R-2. Enter the second RCU read-side critical section:
>>>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>>>> R-2-2.     rs->last_seen_block = NULL
>>>>
>>>> Updater:
>>>>
>>>> W-1. Enter a ram_list.mutex critical section
>>>> W-1-1. Update ram_list.blocks
>>>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>> W-2. Enter another ram_list.mutex critical section
>>>> W-2-1. QLIST_REMOVE_RCU(block, next)
>>>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>>>
>>>> W-1-2 represents the write observed by R-1-1.
>>>>
>>>> ram_list.version is read non-atomically on the update side because the
>>>> update side is serialized with ram_list.mutex. The other ram_list
>>>> accesses in these steps are reasoned about in two cases.
>>>>
>>>> When the grace period of W-2-3 contains R-2:
>>>>       qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>>>>       enforce the following ordering:
>>>>           W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>>>       The value of ram_list.blocks stored by W-1-1 or a newer value that
>>>>       was loaded by R-1-2 is still valid because of the grace period.
>>>>
>>>> When the grace period of W-2-3 ends before R-2:
>>>>       call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>>>>       the following ordering:
>>>>           W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>>>       The value of ram_list.version stored by W-2-2 or a newer value that
>>>>       was loaded by R-2-1 differs from rs->last_version and the reader
>>>>       invalidates rs->last_seen_block.
>>>>
>>>> Together, these steps ensure that rs->last_seen_block is invalidated
>>>> whenever necessary. With added atomic operations, pre-existing memory
>>>> barriers are no longer necessary and are removed.
>>>>
>>>> Any other ram_list accesses are already properly synchronized.
>>>>
>>>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>>>
>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>
>>> Above link [1] also mentioned about a hang.  Just to double check: we're
>>> only trying to silent the TSAN warning (which is a false positive), right?
>>
>> The hang is irrelevant and this is indeed to address the TSAN warning,
>> though I'm a bit reluctant to call it false positive because, strictly
>> speaking, the code actually relies on undefined behavior.
>>
>>>
>>>> ---
>>>>    migration/ram.c  | 10 +++++-----
>>>>    system/physmem.c | 16 +++++++---------
>>>>    2 files changed, 12 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index fc38ffbf8af1..6da24d725861 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
>>>>        rs->last_seen_block = NULL;
>>>>        rs->last_page = 0;
>>>> -    rs->last_version = ram_list.version;
>>>> +
>>>> +    /* Read version before ram_list.blocks */
>>>> +    rs->last_version = qatomic_load_acquire(&ram_list.version);
>>>> +
>>>>        rs->xbzrle_started = false;
>>>>        ram_page_hint_reset(&rs->page_hint);
>>>> @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>>>         */
>>>>        WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
>>>>            WITH_RCU_READ_LOCK_GUARD() {
>>>> -            if (ram_list.version != rs->last_version) {
>>>> +            if (qatomic_read(&ram_list.version) != rs->last_version) {
>>>>                    ram_state_reset(rs);
>>>>                }
>>>
>>> Not directly relevant to this patch, but I never thought into why this few
>>> lines are needed at all, and then I found I don't have an answer..
>>>
>>> Here, ram_list version change should only happen during ramblock
>>> add/remove.  I can think of a few special cases:
>>>
>>> a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
>>>      never happen as qdev code fails any device add/remove attempt during
>>>      migration (qdev_device_add_from_qdict, qdev_unplug)
>>>
>>> b) virtio-mem, which only manages one memory backend, hence one ramblock,
>>>      no add/remove
>>>
>>> c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
>>>      which means ramblocks are not migratable
>>>
>>> PS: I recall we have other special cases like ROM sizes can change on dest
>>> QEMU, but it's about used_length, not about version boosts, so doesn't
>>> matter here.
>>>
>>> Both a) and b) should make sure no change on version, c) will change
>>> version, but in a case where the ramblock is destined to be not migratable,
>>> hence not visible to migration core.
>>>
>>> The patch itself looks all fine, but I'm just wondering if we should just
>>> remove this check completely or even try to somehow trap it from happening
>>> (if we can rule out things like virtio-gpu): if something really changed,
>>> it'll be a serious problem because we sync ramblock list already during
>>> ram_save_setup.
>>
>> There is another edge case: a ram_list.blocks change may be propagated to
>> the thread calling ram_save_setup() and ram_save_iterate() at different
>> timing. We are only ensuring that ram_list.blocks changes invalidating old
>> blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
>>
>> So I think there is a theoretical race condition that breaks migration,
>> though triggering it in practice may not be realistic.
> 
> Could you elaborate more on the theoretical race mentioned?
> 
> My comment above was trying to say there should have no such happening, the
> virtio-gpu boosting the version seems to be an accident to me because
> migration doesn't need that (because virtio-gpu ramblocks are not
> migratable anyway).

For example, a device may be removed before migration, but the removal 
of the block from ram_list.blocks and the increment of ram_list.version 
may not be observed by the thread calling ram_save_setup(); RCU only 
guarantees that these writes will be visible in read-side critical 
sections *after* the grace period ends, which can be after the 
ram_save_setup() call and before following ram_save_iterate() calls.

Regards,
Akihiko Odaki

> 
> While looking at this once more, I found that I'm lost why mru_block is
> race free..
> 
> The problem is mru_block is written by the RCU readers, then... see whether
> this race can happen (even with this patch):
> 
>    Reader A                              Writer
>    --------                              ------
>    rcu_read_lock()
>    walk list, find block X
>                                          QLIST_REMOVE_RCU(X)
>                                          qatomic_set(&mru_block, NULL)
>                                          call_rcu(X, reclaim_ramblock)
>    qatomic_set(&mru_block, X)  <----------- overwrites NULL
>    rcu_read_unlock()
>                                          grace period ends, X freed
> 
>    Reader C
>    --------
>    rcu_read_lock()
>    qatomic_rcu_read(&mru_block) -> X
>    Read X's block->offset ...  <----------- UAF
> 
> One thing I can think of so far is this (pesudo code only, ignoring atomic
> ops for now):
> 
> qemu_ram_free():
>      qemu_mutex_lock_ramlist();
>      ...
>      QLIST_REMOVE_RCU(block, next);
>      /*
>       * 1st attempt to reset it, note: concurrent reader may set it again,
>       * even to the ramblock being freed.
>       */
>      ram_list.mru_block = NULL;
>      ram_list.version++;
>      /*
>       * Only to make sure whoever might be using this ramblock being
>       * removed to finish using it.
>       */
>      call_rcu(block, rcu_noop_fn, rcu);
>      drain_call_rcu();
>      /*
>       * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
>       * has access to the ramblock being removed, hence this clear makes
>       * sure it won't get set again by anyone to the removed ramblock.
>       */
>      ram_list.mru_block = NULL;
>      call_rcu(block, reclaim_ramblock, rcu);
>      qemu_mutex_unlock_ramlist();
> 
> But it looks a bit ugly.  Thoughts?
> 
> Thanks,
>

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Peter Xu 2 days ago

On Fri, May 29, 2026 at 12:32:18AM +0900, Akihiko Odaki wrote:
> 
> 
> On 2026/05/29 0:18, Peter Xu wrote:
> > On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
> > > On 2026/05/27 6:35, Peter Xu wrote:
> > > > On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
> > > > > Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
> > > > > access to ram_list [1]. Ensure the concurrent accesses to ram_list are
> > > > > properly synchronized with atomic accesses, mutexes, or RCU.
> > > > > 
> > > > > First, the plain assignments of ram_list.mru_block are replaced with
> > > > > qatomic_set(). A comment in qemu_get_ram_block() explains why the
> > > > > ordering requirement is relaxed, but it still needs to be atomically
> > > > > accessed. include/qemu/atomic.h says:
> > > > > 
> > > > > > The C11 memory model says that variables that are accessed from
> > > > > > different threads should at least be done with __ATOMIC_RELAXED
> > > > > > primitives or the result is undefined. Generally this has little to
> > > > > > no effect on the generated code but not using the atomic primitives
> > > > > > will get flagged by sanitizers as a violation.
> > > > > 
> > > > > Second, ram_list.version accesses are replaced with atomic operations
> > > > > or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
> > > > > has tighter ordering requirements for one of its goals: ensuring that
> > > > > the reader-held rs->last_seen_block value is invalidated whenever a RAM
> > > > > block is reclaimed between two RCU reader critical sections. Below are
> > > > > steps a reader and an updater follow:
> > > > > 
> > > > > Reader:
> > > > > 
> > > > > R-1. Enter the first RCU read-side critical section:
> > > > > R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
> > > > > R-1-2. rs->last_seen_block = an element of ram_list.blocks
> > > > > R-2. Enter the second RCU read-side critical section:
> > > > > R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
> > > > > R-2-2.     rs->last_seen_block = NULL
> > > > > 
> > > > > Updater:
> > > > > 
> > > > > W-1. Enter a ram_list.mutex critical section
> > > > > W-1-1. Update ram_list.blocks
> > > > > W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > > > W-2. Enter another ram_list.mutex critical section
> > > > > W-2-1. QLIST_REMOVE_RCU(block, next)
> > > > > W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > > > W-2-3. call_rcu(block, reclaim_ramblock, rcu)
> > > > > 
> > > > > W-1-2 represents the write observed by R-1-1.
> > > > > 
> > > > > ram_list.version is read non-atomically on the update side because the
> > > > > update side is serialized with ram_list.mutex. The other ram_list
> > > > > accesses in these steps are reasoned about in two cases.
> > > > > 
> > > > > When the grace period of W-2-3 contains R-2:
> > > > >       qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
> > > > >       enforce the following ordering:
> > > > >           W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
> > > > >       The value of ram_list.blocks stored by W-1-1 or a newer value that
> > > > >       was loaded by R-1-2 is still valid because of the grace period.
> > > > > 
> > > > > When the grace period of W-2-3 ends before R-2:
> > > > >       call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
> > > > >       the following ordering:
> > > > >           W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
> > > > >       The value of ram_list.version stored by W-2-2 or a newer value that
> > > > >       was loaded by R-2-1 differs from rs->last_version and the reader
> > > > >       invalidates rs->last_seen_block.
> > > > > 
> > > > > Together, these steps ensure that rs->last_seen_block is invalidated
> > > > > whenever necessary. With added atomic operations, pre-existing memory
> > > > > barriers are no longer necessary and are removed.
> > > > > 
> > > > > Any other ram_list accesses are already properly synchronized.
> > > > > 
> > > > > [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
> > > > > 
> > > > > Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > > > 
> > > > Above link [1] also mentioned about a hang.  Just to double check: we're
> > > > only trying to silent the TSAN warning (which is a false positive), right?
> > > 
> > > The hang is irrelevant and this is indeed to address the TSAN warning,
> > > though I'm a bit reluctant to call it false positive because, strictly
> > > speaking, the code actually relies on undefined behavior.
> > > 
> > > > 
> > > > > ---
> > > > >    migration/ram.c  | 10 +++++-----
> > > > >    system/physmem.c | 16 +++++++---------
> > > > >    2 files changed, 12 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > index fc38ffbf8af1..6da24d725861 100644
> > > > > --- a/migration/ram.c
> > > > > +++ b/migration/ram.c
> > > > > @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
> > > > >        rs->last_seen_block = NULL;
> > > > >        rs->last_page = 0;
> > > > > -    rs->last_version = ram_list.version;
> > > > > +
> > > > > +    /* Read version before ram_list.blocks */
> > > > > +    rs->last_version = qatomic_load_acquire(&ram_list.version);
> > > > > +
> > > > >        rs->xbzrle_started = false;
> > > > >        ram_page_hint_reset(&rs->page_hint);
> > > > > @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> > > > >         */
> > > > >        WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
> > > > >            WITH_RCU_READ_LOCK_GUARD() {
> > > > > -            if (ram_list.version != rs->last_version) {
> > > > > +            if (qatomic_read(&ram_list.version) != rs->last_version) {
> > > > >                    ram_state_reset(rs);
> > > > >                }
> > > > 
> > > > Not directly relevant to this patch, but I never thought into why this few
> > > > lines are needed at all, and then I found I don't have an answer..
> > > > 
> > > > Here, ram_list version change should only happen during ramblock
> > > > add/remove.  I can think of a few special cases:
> > > > 
> > > > a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
> > > >      never happen as qdev code fails any device add/remove attempt during
> > > >      migration (qdev_device_add_from_qdict, qdev_unplug)
> > > > 
> > > > b) virtio-mem, which only manages one memory backend, hence one ramblock,
> > > >      no add/remove
> > > > 
> > > > c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
> > > >      which means ramblocks are not migratable
> > > > 
> > > > PS: I recall we have other special cases like ROM sizes can change on dest
> > > > QEMU, but it's about used_length, not about version boosts, so doesn't
> > > > matter here.
> > > > 
> > > > Both a) and b) should make sure no change on version, c) will change
> > > > version, but in a case where the ramblock is destined to be not migratable,
> > > > hence not visible to migration core.
> > > > 
> > > > The patch itself looks all fine, but I'm just wondering if we should just
> > > > remove this check completely or even try to somehow trap it from happening
> > > > (if we can rule out things like virtio-gpu): if something really changed,
> > > > it'll be a serious problem because we sync ramblock list already during
> > > > ram_save_setup.
> > > 
> > > There is another edge case: a ram_list.blocks change may be propagated to
> > > the thread calling ram_save_setup() and ram_save_iterate() at different
> > > timing. We are only ensuring that ram_list.blocks changes invalidating old
> > > blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
> > > 
> > > So I think there is a theoretical race condition that breaks migration,
> > > though triggering it in practice may not be realistic.
> > 
> > Could you elaborate more on the theoretical race mentioned?
> > 
> > My comment above was trying to say there should have no such happening, the
> > virtio-gpu boosting the version seems to be an accident to me because
> > migration doesn't need that (because virtio-gpu ramblocks are not
> > migratable anyway).
> 
> For example, a device may be removed before migration, but the removal of
> the block from ram_list.blocks and the increment of ram_list.version may not
> be observed by the thread calling ram_save_setup(); RCU only guarantees that
> these writes will be visible in read-side critical sections *after* the
> grace period ends, which can be after the ram_save_setup() call and before
> following ram_save_iterate() calls.

Ah. I think it should be fine, since save_setup() holds BQL, done by all
callers when invoking qemu_savevm_state_do_setup().  I expect removal of a
device needs BQL too while invoking qemu_ram_free(), so serialized with it.

The other thing is, very early we'll set MIGRATION_STATUS_SETUP, covering
both save_setup() and save_iterate(), so things will start to get blocked
very early too.

OTOH, any comments on below?

Thanks,

> 
> Regards,
> Akihiko Odaki
> 
> > 
> > While looking at this once more, I found that I'm lost why mru_block is
> > race free..
> > 
> > The problem is mru_block is written by the RCU readers, then... see whether
> > this race can happen (even with this patch):
> > 
> >    Reader A                              Writer
> >    --------                              ------
> >    rcu_read_lock()
> >    walk list, find block X
> >                                          QLIST_REMOVE_RCU(X)
> >                                          qatomic_set(&mru_block, NULL)
> >                                          call_rcu(X, reclaim_ramblock)
> >    qatomic_set(&mru_block, X)  <----------- overwrites NULL
> >    rcu_read_unlock()
> >                                          grace period ends, X freed
> > 
> >    Reader C
> >    --------
> >    rcu_read_lock()
> >    qatomic_rcu_read(&mru_block) -> X
> >    Read X's block->offset ...  <----------- UAF
> > 
> > One thing I can think of so far is this (pesudo code only, ignoring atomic
> > ops for now):
> > 
> > qemu_ram_free():
> >      qemu_mutex_lock_ramlist();
> >      ...
> >      QLIST_REMOVE_RCU(block, next);
> >      /*
> >       * 1st attempt to reset it, note: concurrent reader may set it again,
> >       * even to the ramblock being freed.
> >       */
> >      ram_list.mru_block = NULL;
> >      ram_list.version++;
> >      /*
> >       * Only to make sure whoever might be using this ramblock being
> >       * removed to finish using it.
> >       */
> >      call_rcu(block, rcu_noop_fn, rcu);
> >      drain_call_rcu();
> >      /*
> >       * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
> >       * has access to the ramblock being removed, hence this clear makes
> >       * sure it won't get set again by anyone to the removed ramblock.
> >       */
> >      ram_list.mru_block = NULL;
> >      call_rcu(block, reclaim_ramblock, rcu);
> >      qemu_mutex_unlock_ramlist();
> > 
> > But it looks a bit ugly.  Thoughts?
> > 
> > Thanks,
> > 
> 

-- 
Peter Xu

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 1 day, 8 hours ago

On 2026/05/29 1:03, Peter Xu wrote:
> On Fri, May 29, 2026 at 12:32:18AM +0900, Akihiko Odaki wrote:
>>
>>
>> On 2026/05/29 0:18, Peter Xu wrote:
>>> On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
>>>> On 2026/05/27 6:35, Peter Xu wrote:
>>>>> On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
>>>>>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>>>>>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>>>>>> properly synchronized with atomic accesses, mutexes, or RCU.
>>>>>>
>>>>>> First, the plain assignments of ram_list.mru_block are replaced with
>>>>>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>>>>>> ordering requirement is relaxed, but it still needs to be atomically
>>>>>> accessed. include/qemu/atomic.h says:
>>>>>>
>>>>>>> The C11 memory model says that variables that are accessed from
>>>>>>> different threads should at least be done with __ATOMIC_RELAXED
>>>>>>> primitives or the result is undefined. Generally this has little to
>>>>>>> no effect on the generated code but not using the atomic primitives
>>>>>>> will get flagged by sanitizers as a violation.
>>>>>>
>>>>>> Second, ram_list.version accesses are replaced with atomic operations
>>>>>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>>>>>> has tighter ordering requirements for one of its goals: ensuring that
>>>>>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>>>>>> block is reclaimed between two RCU reader critical sections. Below are
>>>>>> steps a reader and an updater follow:
>>>>>>
>>>>>> Reader:
>>>>>>
>>>>>> R-1. Enter the first RCU read-side critical section:
>>>>>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>>>>>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>>>>>> R-2. Enter the second RCU read-side critical section:
>>>>>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>>>>>> R-2-2.     rs->last_seen_block = NULL
>>>>>>
>>>>>> Updater:
>>>>>>
>>>>>> W-1. Enter a ram_list.mutex critical section
>>>>>> W-1-1. Update ram_list.blocks
>>>>>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>>>> W-2. Enter another ram_list.mutex critical section
>>>>>> W-2-1. QLIST_REMOVE_RCU(block, next)
>>>>>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>>>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>>>>>
>>>>>> W-1-2 represents the write observed by R-1-1.
>>>>>>
>>>>>> ram_list.version is read non-atomically on the update side because the
>>>>>> update side is serialized with ram_list.mutex. The other ram_list
>>>>>> accesses in these steps are reasoned about in two cases.
>>>>>>
>>>>>> When the grace period of W-2-3 contains R-2:
>>>>>>        qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>>>>>>        enforce the following ordering:
>>>>>>            W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>>>>>        The value of ram_list.blocks stored by W-1-1 or a newer value that
>>>>>>        was loaded by R-1-2 is still valid because of the grace period.
>>>>>>
>>>>>> When the grace period of W-2-3 ends before R-2:
>>>>>>        call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>>>>>>        the following ordering:
>>>>>>            W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>>>>>        The value of ram_list.version stored by W-2-2 or a newer value that
>>>>>>        was loaded by R-2-1 differs from rs->last_version and the reader
>>>>>>        invalidates rs->last_seen_block.
>>>>>>
>>>>>> Together, these steps ensure that rs->last_seen_block is invalidated
>>>>>> whenever necessary. With added atomic operations, pre-existing memory
>>>>>> barriers are no longer necessary and are removed.
>>>>>>
>>>>>> Any other ram_list accesses are already properly synchronized.
>>>>>>
>>>>>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>>>
>>>>> Above link [1] also mentioned about a hang.  Just to double check: we're
>>>>> only trying to silent the TSAN warning (which is a false positive), right?
>>>>
>>>> The hang is irrelevant and this is indeed to address the TSAN warning,
>>>> though I'm a bit reluctant to call it false positive because, strictly
>>>> speaking, the code actually relies on undefined behavior.
>>>>
>>>>>
>>>>>> ---
>>>>>>     migration/ram.c  | 10 +++++-----
>>>>>>     system/physmem.c | 16 +++++++---------
>>>>>>     2 files changed, 12 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>>> index fc38ffbf8af1..6da24d725861 100644
>>>>>> --- a/migration/ram.c
>>>>>> +++ b/migration/ram.c
>>>>>> @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
>>>>>>         rs->last_seen_block = NULL;
>>>>>>         rs->last_page = 0;
>>>>>> -    rs->last_version = ram_list.version;
>>>>>> +
>>>>>> +    /* Read version before ram_list.blocks */
>>>>>> +    rs->last_version = qatomic_load_acquire(&ram_list.version);
>>>>>> +
>>>>>>         rs->xbzrle_started = false;
>>>>>>         ram_page_hint_reset(&rs->page_hint);
>>>>>> @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>>>>>          */
>>>>>>         WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
>>>>>>             WITH_RCU_READ_LOCK_GUARD() {
>>>>>> -            if (ram_list.version != rs->last_version) {
>>>>>> +            if (qatomic_read(&ram_list.version) != rs->last_version) {
>>>>>>                     ram_state_reset(rs);
>>>>>>                 }
>>>>>
>>>>> Not directly relevant to this patch, but I never thought into why this few
>>>>> lines are needed at all, and then I found I don't have an answer..
>>>>>
>>>>> Here, ram_list version change should only happen during ramblock
>>>>> add/remove.  I can think of a few special cases:
>>>>>
>>>>> a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
>>>>>       never happen as qdev code fails any device add/remove attempt during
>>>>>       migration (qdev_device_add_from_qdict, qdev_unplug)
>>>>>
>>>>> b) virtio-mem, which only manages one memory backend, hence one ramblock,
>>>>>       no add/remove
>>>>>
>>>>> c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
>>>>>       which means ramblocks are not migratable
>>>>>
>>>>> PS: I recall we have other special cases like ROM sizes can change on dest
>>>>> QEMU, but it's about used_length, not about version boosts, so doesn't
>>>>> matter here.
>>>>>
>>>>> Both a) and b) should make sure no change on version, c) will change
>>>>> version, but in a case where the ramblock is destined to be not migratable,
>>>>> hence not visible to migration core.
>>>>>
>>>>> The patch itself looks all fine, but I'm just wondering if we should just
>>>>> remove this check completely or even try to somehow trap it from happening
>>>>> (if we can rule out things like virtio-gpu): if something really changed,
>>>>> it'll be a serious problem because we sync ramblock list already during
>>>>> ram_save_setup.
>>>>
>>>> There is another edge case: a ram_list.blocks change may be propagated to
>>>> the thread calling ram_save_setup() and ram_save_iterate() at different
>>>> timing. We are only ensuring that ram_list.blocks changes invalidating old
>>>> blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
>>>>
>>>> So I think there is a theoretical race condition that breaks migration,
>>>> though triggering it in practice may not be realistic.
>>>
>>> Could you elaborate more on the theoretical race mentioned?
>>>
>>> My comment above was trying to say there should have no such happening, the
>>> virtio-gpu boosting the version seems to be an accident to me because
>>> migration doesn't need that (because virtio-gpu ramblocks are not
>>> migratable anyway).
>>
>> For example, a device may be removed before migration, but the removal of
>> the block from ram_list.blocks and the increment of ram_list.version may not
>> be observed by the thread calling ram_save_setup(); RCU only guarantees that
>> these writes will be visible in read-side critical sections *after* the
>> grace period ends, which can be after the ram_save_setup() call and before
>> following ram_save_iterate() calls.
> 
> Ah. I think it should be fine, since save_setup() holds BQL, done by all
> callers when invoking qemu_savevm_state_do_setup().  I expect removal of a
> device needs BQL too while invoking qemu_ram_free(), so serialized with it.
> 
> The other thing is, very early we'll set MIGRATION_STATUS_SETUP, covering
> both save_setup() and save_iterate(), so things will start to get blocked
> very early too.
> 
> OTOH, any comments on below?

Sure.

> 
> Thanks,
> 
>>
>> Regards,
>> Akihiko Odaki
>>
>>>
>>> While looking at this once more, I found that I'm lost why mru_block is
>>> race free..
>>>
>>> The problem is mru_block is written by the RCU readers, then... see whether
>>> this race can happen (even with this patch):
>>>
>>>     Reader A                              Writer
>>>     --------                              ------
>>>     rcu_read_lock()
>>>     walk list, find block X
>>>                                           QLIST_REMOVE_RCU(X)
>>>                                           qatomic_set(&mru_block, NULL)
>>>                                           call_rcu(X, reclaim_ramblock)
>>>     qatomic_set(&mru_block, X)  <----------- overwrites NULL
>>>     rcu_read_unlock()
>>>                                           grace period ends, X freed
>>>
>>>     Reader C
>>>     --------
>>>     rcu_read_lock()
>>>     qatomic_rcu_read(&mru_block) -> X
>>>     Read X's block->offset ...  <----------- UAF

I agree this is problematic.

>>>
>>> One thing I can think of so far is this (pesudo code only, ignoring atomic
>>> ops for now):
>>>
>>> qemu_ram_free():
>>>       qemu_mutex_lock_ramlist();
>>>       ...
>>>       QLIST_REMOVE_RCU(block, next);
>>>       /*
>>>        * 1st attempt to reset it, note: concurrent reader may set it again,
>>>        * even to the ramblock being freed.
>>>        */
>>>       ram_list.mru_block = NULL;
>>>       ram_list.version++;
>>>       /*
>>>        * Only to make sure whoever might be using this ramblock being
>>>        * removed to finish using it.
>>>        */
>>>       call_rcu(block, rcu_noop_fn, rcu);
>>>       drain_call_rcu();
>>>       /*
>>>        * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
>>>        * has access to the ramblock being removed, hence this clear makes
>>>        * sure it won't get set again by anyone to the removed ramblock.
>>>        */
>>>       ram_list.mru_block = NULL;
>>>       call_rcu(block, reclaim_ramblock, rcu);
>>>       qemu_mutex_unlock_ramlist();
>>>
>>> But it looks a bit ugly.  Thoughts?

Indeed it's not really nice. drain_call_rcu() is something we definitely 
want to avoid; it unlocks the BQL, which is too hard to prove safety.

The best solution I have currently in mind is to keep the mru_block 
inside an allocation managed my RCU:

struct RAMListVersion {
	struct rcu_head rcu;
	RAMBlock *mru_block;
	RAMBlock *reclaim_block;
};

And qemu_ram_free() will do:
     qemu_mutex_lock_ramlist();
     ...
     QLIST_REMOVE_RCU(block, next);
     ram_list.version->reclaim_block = block;
     call_rcu(ram_list.version, reclaim_ramlist_version, rcu);
     ram_list.version = g_new0(RamListVersion, 1);
     ram_list.version_id++;
     qemu_mutex_unlock_ramlist();

qemu_get_ram_block() will do:
     qatomic_rcu_read(&ram_list.version)->mru_block = block;

reclaim_ramlist_version will do:
     if (version->reclaim_block) {
          reclaim_ramblock(version->reclaim_block);
     }

     g_free(version);

Regards,
Akihiko Odaki

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Peter Xu 1 day ago

On Fri, May 29, 2026 at 05:24:54PM +0900, Akihiko Odaki wrote:
> On 2026/05/29 1:03, Peter Xu wrote:
> > On Fri, May 29, 2026 at 12:32:18AM +0900, Akihiko Odaki wrote:
> > > 
> > > 
> > > On 2026/05/29 0:18, Peter Xu wrote:
> > > > On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
> > > > > On 2026/05/27 6:35, Peter Xu wrote:
> > > > > > On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
> > > > > > > Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
> > > > > > > access to ram_list [1]. Ensure the concurrent accesses to ram_list are
> > > > > > > properly synchronized with atomic accesses, mutexes, or RCU.
> > > > > > > 
> > > > > > > First, the plain assignments of ram_list.mru_block are replaced with
> > > > > > > qatomic_set(). A comment in qemu_get_ram_block() explains why the
> > > > > > > ordering requirement is relaxed, but it still needs to be atomically
> > > > > > > accessed. include/qemu/atomic.h says:
> > > > > > > 
> > > > > > > > The C11 memory model says that variables that are accessed from
> > > > > > > > different threads should at least be done with __ATOMIC_RELAXED
> > > > > > > > primitives or the result is undefined. Generally this has little to
> > > > > > > > no effect on the generated code but not using the atomic primitives
> > > > > > > > will get flagged by sanitizers as a violation.
> > > > > > > 
> > > > > > > Second, ram_list.version accesses are replaced with atomic operations
> > > > > > > or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
> > > > > > > has tighter ordering requirements for one of its goals: ensuring that
> > > > > > > the reader-held rs->last_seen_block value is invalidated whenever a RAM
> > > > > > > block is reclaimed between two RCU reader critical sections. Below are
> > > > > > > steps a reader and an updater follow:
> > > > > > > 
> > > > > > > Reader:
> > > > > > > 
> > > > > > > R-1. Enter the first RCU read-side critical section:
> > > > > > > R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
> > > > > > > R-1-2. rs->last_seen_block = an element of ram_list.blocks
> > > > > > > R-2. Enter the second RCU read-side critical section:
> > > > > > > R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
> > > > > > > R-2-2.     rs->last_seen_block = NULL
> > > > > > > 
> > > > > > > Updater:
> > > > > > > 
> > > > > > > W-1. Enter a ram_list.mutex critical section
> > > > > > > W-1-1. Update ram_list.blocks
> > > > > > > W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > > > > > W-2. Enter another ram_list.mutex critical section
> > > > > > > W-2-1. QLIST_REMOVE_RCU(block, next)
> > > > > > > W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> > > > > > > W-2-3. call_rcu(block, reclaim_ramblock, rcu)
> > > > > > > 
> > > > > > > W-1-2 represents the write observed by R-1-1.
> > > > > > > 
> > > > > > > ram_list.version is read non-atomically on the update side because the
> > > > > > > update side is serialized with ram_list.mutex. The other ram_list
> > > > > > > accesses in these steps are reasoned about in two cases.
> > > > > > > 
> > > > > > > When the grace period of W-2-3 contains R-2:
> > > > > > >        qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
> > > > > > >        enforce the following ordering:
> > > > > > >            W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
> > > > > > >        The value of ram_list.blocks stored by W-1-1 or a newer value that
> > > > > > >        was loaded by R-1-2 is still valid because of the grace period.
> > > > > > > 
> > > > > > > When the grace period of W-2-3 ends before R-2:
> > > > > > >        call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
> > > > > > >        the following ordering:
> > > > > > >            W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
> > > > > > >        The value of ram_list.version stored by W-2-2 or a newer value that
> > > > > > >        was loaded by R-2-1 differs from rs->last_version and the reader
> > > > > > >        invalidates rs->last_seen_block.
> > > > > > > 
> > > > > > > Together, these steps ensure that rs->last_seen_block is invalidated
> > > > > > > whenever necessary. With added atomic operations, pre-existing memory
> > > > > > > barriers are no longer necessary and are removed.
> > > > > > > 
> > > > > > > Any other ram_list accesses are already properly synchronized.
> > > > > > > 
> > > > > > > [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
> > > > > > > 
> > > > > > > Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> > > > > > 
> > > > > > Above link [1] also mentioned about a hang.  Just to double check: we're
> > > > > > only trying to silent the TSAN warning (which is a false positive), right?
> > > > > 
> > > > > The hang is irrelevant and this is indeed to address the TSAN warning,
> > > > > though I'm a bit reluctant to call it false positive because, strictly
> > > > > speaking, the code actually relies on undefined behavior.
> > > > > 
> > > > > > 
> > > > > > > ---
> > > > > > >     migration/ram.c  | 10 +++++-----
> > > > > > >     system/physmem.c | 16 +++++++---------
> > > > > > >     2 files changed, 12 insertions(+), 14 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > > > index fc38ffbf8af1..6da24d725861 100644
> > > > > > > --- a/migration/ram.c
> > > > > > > +++ b/migration/ram.c
> > > > > > > @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
> > > > > > >         rs->last_seen_block = NULL;
> > > > > > >         rs->last_page = 0;
> > > > > > > -    rs->last_version = ram_list.version;
> > > > > > > +
> > > > > > > +    /* Read version before ram_list.blocks */
> > > > > > > +    rs->last_version = qatomic_load_acquire(&ram_list.version);
> > > > > > > +
> > > > > > >         rs->xbzrle_started = false;
> > > > > > >         ram_page_hint_reset(&rs->page_hint);
> > > > > > > @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> > > > > > >          */
> > > > > > >         WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
> > > > > > >             WITH_RCU_READ_LOCK_GUARD() {
> > > > > > > -            if (ram_list.version != rs->last_version) {
> > > > > > > +            if (qatomic_read(&ram_list.version) != rs->last_version) {
> > > > > > >                     ram_state_reset(rs);
> > > > > > >                 }
> > > > > > 
> > > > > > Not directly relevant to this patch, but I never thought into why this few
> > > > > > lines are needed at all, and then I found I don't have an answer..
> > > > > > 
> > > > > > Here, ram_list version change should only happen during ramblock
> > > > > > add/remove.  I can think of a few special cases:
> > > > > > 
> > > > > > a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
> > > > > >       never happen as qdev code fails any device add/remove attempt during
> > > > > >       migration (qdev_device_add_from_qdict, qdev_unplug)
> > > > > > 
> > > > > > b) virtio-mem, which only manages one memory backend, hence one ramblock,
> > > > > >       no add/remove
> > > > > > 
> > > > > > c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
> > > > > >       which means ramblocks are not migratable
> > > > > > 
> > > > > > PS: I recall we have other special cases like ROM sizes can change on dest
> > > > > > QEMU, but it's about used_length, not about version boosts, so doesn't
> > > > > > matter here.
> > > > > > 
> > > > > > Both a) and b) should make sure no change on version, c) will change
> > > > > > version, but in a case where the ramblock is destined to be not migratable,
> > > > > > hence not visible to migration core.
> > > > > > 
> > > > > > The patch itself looks all fine, but I'm just wondering if we should just
> > > > > > remove this check completely or even try to somehow trap it from happening
> > > > > > (if we can rule out things like virtio-gpu): if something really changed,
> > > > > > it'll be a serious problem because we sync ramblock list already during
> > > > > > ram_save_setup.
> > > > > 
> > > > > There is another edge case: a ram_list.blocks change may be propagated to
> > > > > the thread calling ram_save_setup() and ram_save_iterate() at different
> > > > > timing. We are only ensuring that ram_list.blocks changes invalidating old
> > > > > blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
> > > > > 
> > > > > So I think there is a theoretical race condition that breaks migration,
> > > > > though triggering it in practice may not be realistic.
> > > > 
> > > > Could you elaborate more on the theoretical race mentioned?
> > > > 
> > > > My comment above was trying to say there should have no such happening, the
> > > > virtio-gpu boosting the version seems to be an accident to me because
> > > > migration doesn't need that (because virtio-gpu ramblocks are not
> > > > migratable anyway).
> > > 
> > > For example, a device may be removed before migration, but the removal of
> > > the block from ram_list.blocks and the increment of ram_list.version may not
> > > be observed by the thread calling ram_save_setup(); RCU only guarantees that
> > > these writes will be visible in read-side critical sections *after* the
> > > grace period ends, which can be after the ram_save_setup() call and before
> > > following ram_save_iterate() calls.
> > 
> > Ah. I think it should be fine, since save_setup() holds BQL, done by all
> > callers when invoking qemu_savevm_state_do_setup().  I expect removal of a
> > device needs BQL too while invoking qemu_ram_free(), so serialized with it.
> > 
> > The other thing is, very early we'll set MIGRATION_STATUS_SETUP, covering
> > both save_setup() and save_iterate(), so things will start to get blocked
> > very early too.
> > 
> > OTOH, any comments on below?
> 
> Sure.
> 
> > 
> > Thanks,
> > 
> > > 
> > > Regards,
> > > Akihiko Odaki
> > > 
> > > > 
> > > > While looking at this once more, I found that I'm lost why mru_block is
> > > > race free..
> > > > 
> > > > The problem is mru_block is written by the RCU readers, then... see whether
> > > > this race can happen (even with this patch):
> > > > 
> > > >     Reader A                              Writer
> > > >     --------                              ------
> > > >     rcu_read_lock()
> > > >     walk list, find block X
> > > >                                           QLIST_REMOVE_RCU(X)
> > > >                                           qatomic_set(&mru_block, NULL)
> > > >                                           call_rcu(X, reclaim_ramblock)
> > > >     qatomic_set(&mru_block, X)  <----------- overwrites NULL
> > > >     rcu_read_unlock()
> > > >                                           grace period ends, X freed
> > > > 
> > > >     Reader C
> > > >     --------
> > > >     rcu_read_lock()
> > > >     qatomic_rcu_read(&mru_block) -> X
> > > >     Read X's block->offset ...  <----------- UAF
> 
> I agree this is problematic.
> 
> > > > 
> > > > One thing I can think of so far is this (pesudo code only, ignoring atomic
> > > > ops for now):
> > > > 
> > > > qemu_ram_free():
> > > >       qemu_mutex_lock_ramlist();
> > > >       ...
> > > >       QLIST_REMOVE_RCU(block, next);
> > > >       /*
> > > >        * 1st attempt to reset it, note: concurrent reader may set it again,
> > > >        * even to the ramblock being freed.
> > > >        */
> > > >       ram_list.mru_block = NULL;
> > > >       ram_list.version++;
> > > >       /*
> > > >        * Only to make sure whoever might be using this ramblock being
> > > >        * removed to finish using it.
> > > >        */
> > > >       call_rcu(block, rcu_noop_fn, rcu);
> > > >       drain_call_rcu();
> > > >       /*
> > > >        * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
> > > >        * has access to the ramblock being removed, hence this clear makes
> > > >        * sure it won't get set again by anyone to the removed ramblock.
> > > >        */
> > > >       ram_list.mru_block = NULL;
> > > >       call_rcu(block, reclaim_ramblock, rcu);
> > > >       qemu_mutex_unlock_ramlist();
> > > > 
> > > > But it looks a bit ugly.  Thoughts?
> 
> Indeed it's not really nice. drain_call_rcu() is something we definitely
> want to avoid; it unlocks the BQL, which is too hard to prove safety.

Ah, yeah I forgot that.

> 
> The best solution I have currently in mind is to keep the mru_block inside
> an allocation managed my RCU:
> 
> struct RAMListVersion {
> 	struct rcu_head rcu;
> 	RAMBlock *mru_block;
> 	RAMBlock *reclaim_block;
> };
> 
> And qemu_ram_free() will do:
>     qemu_mutex_lock_ramlist();
>     ...
>     QLIST_REMOVE_RCU(block, next);
>     ram_list.version->reclaim_block = block;
>     call_rcu(ram_list.version, reclaim_ramlist_version, rcu);
>     ram_list.version = g_new0(RamListVersion, 1);
>     ram_list.version_id++;
>     qemu_mutex_unlock_ramlist();
> 
> qemu_get_ram_block() will do:
>     qatomic_rcu_read(&ram_list.version)->mru_block = block;
> 
> reclaim_ramlist_version will do:
>     if (version->reclaim_block) {
>          reclaim_ramblock(version->reclaim_block);
>     }
> 
>     g_free(version);

This one almost works, except I think such race can still happen:

    writer                                   reader
    ------                                   ------
  qemu_get_ram_block for block X
                                             qemu_get_ram_block
    qemu_mutex_lock_ramlist();
                                               found block X
    QLIST_REMOVE_RCU(block, next);
    ram_list.version->reclaim_block = block;
    call_rcu(ram_list.version, ...);
    ram_list.version = g_new0(RamListVersion, 1);
                                             ram_list.version->mru_block = X <-----------------
    qemu_mutex_unlock_ramlist();

The fix should be simple, IMHO, my previous solution only misses that BQL
thing, so we can do a nested rcu free with:

reclaim_ramblock_prepare():
  /*
   * 1st round rcu free guarantees the last reader that can access
   * this ramblock is gone.  Reset this field making sure it will
   * never points to the ramblock being freed.
   */
  ram_list.mru_block = NULL;
  call_rcu(block, reclaim_ramblock, rcu);

qemu_get_ram_block():
  qemu_mutex_lock_ramlist();
  ...
  QLIST_REMOVE_RCU(block, next);
  ram_list.mru_block = NULL;
  call_rcu(block, reclaim_ramblock_prepare, rcu);
  qemu_mutex_unlock_ramlist();

Thanks,

-- 
Peter Xu

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 1 day ago

On 2026/05/30 1:13, Peter Xu wrote:
> On Fri, May 29, 2026 at 05:24:54PM +0900, Akihiko Odaki wrote:
>> On 2026/05/29 1:03, Peter Xu wrote:
>>> On Fri, May 29, 2026 at 12:32:18AM +0900, Akihiko Odaki wrote:
>>>>
>>>>
>>>> On 2026/05/29 0:18, Peter Xu wrote:
>>>>> On Wed, May 27, 2026 at 04:54:42PM +0900, Akihiko Odaki wrote:
>>>>>> On 2026/05/27 6:35, Peter Xu wrote:
>>>>>>> On Sat, May 23, 2026 at 09:38:28PM +0900, Akihiko Odaki wrote:
>>>>>>>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>>>>>>>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>>>>>>>> properly synchronized with atomic accesses, mutexes, or RCU.
>>>>>>>>
>>>>>>>> First, the plain assignments of ram_list.mru_block are replaced with
>>>>>>>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>>>>>>>> ordering requirement is relaxed, but it still needs to be atomically
>>>>>>>> accessed. include/qemu/atomic.h says:
>>>>>>>>
>>>>>>>>> The C11 memory model says that variables that are accessed from
>>>>>>>>> different threads should at least be done with __ATOMIC_RELAXED
>>>>>>>>> primitives or the result is undefined. Generally this has little to
>>>>>>>>> no effect on the generated code but not using the atomic primitives
>>>>>>>>> will get flagged by sanitizers as a violation.
>>>>>>>>
>>>>>>>> Second, ram_list.version accesses are replaced with atomic operations
>>>>>>>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>>>>>>>> has tighter ordering requirements for one of its goals: ensuring that
>>>>>>>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>>>>>>>> block is reclaimed between two RCU reader critical sections. Below are
>>>>>>>> steps a reader and an updater follow:
>>>>>>>>
>>>>>>>> Reader:
>>>>>>>>
>>>>>>>> R-1. Enter the first RCU read-side critical section:
>>>>>>>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>>>>>>>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>>>>>>>> R-2. Enter the second RCU read-side critical section:
>>>>>>>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>>>>>>>> R-2-2.     rs->last_seen_block = NULL
>>>>>>>>
>>>>>>>> Updater:
>>>>>>>>
>>>>>>>> W-1. Enter a ram_list.mutex critical section
>>>>>>>> W-1-1. Update ram_list.blocks
>>>>>>>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>>>>>> W-2. Enter another ram_list.mutex critical section
>>>>>>>> W-2-1. QLIST_REMOVE_RCU(block, next)
>>>>>>>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>>>>>>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>>>>>>>
>>>>>>>> W-1-2 represents the write observed by R-1-1.
>>>>>>>>
>>>>>>>> ram_list.version is read non-atomically on the update side because the
>>>>>>>> update side is serialized with ram_list.mutex. The other ram_list
>>>>>>>> accesses in these steps are reasoned about in two cases.
>>>>>>>>
>>>>>>>> When the grace period of W-2-3 contains R-2:
>>>>>>>>         qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>>>>>>>>         enforce the following ordering:
>>>>>>>>             W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>>>>>>>         The value of ram_list.blocks stored by W-1-1 or a newer value that
>>>>>>>>         was loaded by R-1-2 is still valid because of the grace period.
>>>>>>>>
>>>>>>>> When the grace period of W-2-3 ends before R-2:
>>>>>>>>         call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>>>>>>>>         the following ordering:
>>>>>>>>             W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>>>>>>>         The value of ram_list.version stored by W-2-2 or a newer value that
>>>>>>>>         was loaded by R-2-1 differs from rs->last_version and the reader
>>>>>>>>         invalidates rs->last_seen_block.
>>>>>>>>
>>>>>>>> Together, these steps ensure that rs->last_seen_block is invalidated
>>>>>>>> whenever necessary. With added atomic operations, pre-existing memory
>>>>>>>> barriers are no longer necessary and are removed.
>>>>>>>>
>>>>>>>> Any other ram_list accesses are already properly synchronized.
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>>>>>>>
>>>>>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>>>>>
>>>>>>> Above link [1] also mentioned about a hang.  Just to double check: we're
>>>>>>> only trying to silent the TSAN warning (which is a false positive), right?
>>>>>>
>>>>>> The hang is irrelevant and this is indeed to address the TSAN warning,
>>>>>> though I'm a bit reluctant to call it false positive because, strictly
>>>>>> speaking, the code actually relies on undefined behavior.
>>>>>>
>>>>>>>
>>>>>>>> ---
>>>>>>>>      migration/ram.c  | 10 +++++-----
>>>>>>>>      system/physmem.c | 16 +++++++---------
>>>>>>>>      2 files changed, 12 insertions(+), 14 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>>>>> index fc38ffbf8af1..6da24d725861 100644
>>>>>>>> --- a/migration/ram.c
>>>>>>>> +++ b/migration/ram.c
>>>>>>>> @@ -2495,7 +2495,10 @@ static void ram_state_reset(RAMState *rs)
>>>>>>>>          rs->last_seen_block = NULL;
>>>>>>>>          rs->last_page = 0;
>>>>>>>> -    rs->last_version = ram_list.version;
>>>>>>>> +
>>>>>>>> +    /* Read version before ram_list.blocks */
>>>>>>>> +    rs->last_version = qatomic_load_acquire(&ram_list.version);
>>>>>>>> +
>>>>>>>>          rs->xbzrle_started = false;
>>>>>>>>          ram_page_hint_reset(&rs->page_hint);
>>>>>>>> @@ -3270,13 +3273,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>>>>>>>           */
>>>>>>>>          WITH_QEMU_LOCK_GUARD(&rs->bitmap_mutex) {
>>>>>>>>              WITH_RCU_READ_LOCK_GUARD() {
>>>>>>>> -            if (ram_list.version != rs->last_version) {
>>>>>>>> +            if (qatomic_read(&ram_list.version) != rs->last_version) {
>>>>>>>>                      ram_state_reset(rs);
>>>>>>>>                  }
>>>>>>>
>>>>>>> Not directly relevant to this patch, but I never thought into why this few
>>>>>>> lines are needed at all, and then I found I don't have an answer..
>>>>>>>
>>>>>>> Here, ram_list version change should only happen during ramblock
>>>>>>> add/remove.  I can think of a few special cases:
>>>>>>>
>>>>>>> a) device add/remove (e.g. dimms) causing new / removing ramblocks, should
>>>>>>>        never happen as qdev code fails any device add/remove attempt during
>>>>>>>        migration (qdev_device_add_from_qdict, qdev_unplug)
>>>>>>>
>>>>>>> b) virtio-mem, which only manages one memory backend, hence one ramblock,
>>>>>>>        no add/remove
>>>>>>>
>>>>>>> c) virtio-gpu, will do random add/remove with memory_region_init_ram_ptr()
>>>>>>>        which means ramblocks are not migratable
>>>>>>>
>>>>>>> PS: I recall we have other special cases like ROM sizes can change on dest
>>>>>>> QEMU, but it's about used_length, not about version boosts, so doesn't
>>>>>>> matter here.
>>>>>>>
>>>>>>> Both a) and b) should make sure no change on version, c) will change
>>>>>>> version, but in a case where the ramblock is destined to be not migratable,
>>>>>>> hence not visible to migration core.
>>>>>>>
>>>>>>> The patch itself looks all fine, but I'm just wondering if we should just
>>>>>>> remove this check completely or even try to somehow trap it from happening
>>>>>>> (if we can rule out things like virtio-gpu): if something really changed,
>>>>>>> it'll be a serious problem because we sync ramblock list already during
>>>>>>> ram_save_setup.
>>>>>>
>>>>>> There is another edge case: a ram_list.blocks change may be propagated to
>>>>>> the thread calling ram_save_setup() and ram_save_iterate() at different
>>>>>> timing. We are only ensuring that ram_list.blocks changes invalidating old
>>>>>> blocks are delivered with WITH_RCU_READ_LOCK_GUARD().
>>>>>>
>>>>>> So I think there is a theoretical race condition that breaks migration,
>>>>>> though triggering it in practice may not be realistic.
>>>>>
>>>>> Could you elaborate more on the theoretical race mentioned?
>>>>>
>>>>> My comment above was trying to say there should have no such happening, the
>>>>> virtio-gpu boosting the version seems to be an accident to me because
>>>>> migration doesn't need that (because virtio-gpu ramblocks are not
>>>>> migratable anyway).
>>>>
>>>> For example, a device may be removed before migration, but the removal of
>>>> the block from ram_list.blocks and the increment of ram_list.version may not
>>>> be observed by the thread calling ram_save_setup(); RCU only guarantees that
>>>> these writes will be visible in read-side critical sections *after* the
>>>> grace period ends, which can be after the ram_save_setup() call and before
>>>> following ram_save_iterate() calls.
>>>
>>> Ah. I think it should be fine, since save_setup() holds BQL, done by all
>>> callers when invoking qemu_savevm_state_do_setup().  I expect removal of a
>>> device needs BQL too while invoking qemu_ram_free(), so serialized with it.
>>>
>>> The other thing is, very early we'll set MIGRATION_STATUS_SETUP, covering
>>> both save_setup() and save_iterate(), so things will start to get blocked
>>> very early too.
>>>
>>> OTOH, any comments on below?
>>
>> Sure.
>>
>>>
>>> Thanks,
>>>
>>>>
>>>> Regards,
>>>> Akihiko Odaki
>>>>
>>>>>
>>>>> While looking at this once more, I found that I'm lost why mru_block is
>>>>> race free..
>>>>>
>>>>> The problem is mru_block is written by the RCU readers, then... see whether
>>>>> this race can happen (even with this patch):
>>>>>
>>>>>      Reader A                              Writer
>>>>>      --------                              ------
>>>>>      rcu_read_lock()
>>>>>      walk list, find block X
>>>>>                                            QLIST_REMOVE_RCU(X)
>>>>>                                            qatomic_set(&mru_block, NULL)
>>>>>                                            call_rcu(X, reclaim_ramblock)
>>>>>      qatomic_set(&mru_block, X)  <----------- overwrites NULL
>>>>>      rcu_read_unlock()
>>>>>                                            grace period ends, X freed
>>>>>
>>>>>      Reader C
>>>>>      --------
>>>>>      rcu_read_lock()
>>>>>      qatomic_rcu_read(&mru_block) -> X
>>>>>      Read X's block->offset ...  <----------- UAF
>>
>> I agree this is problematic.
>>
>>>>>
>>>>> One thing I can think of so far is this (pesudo code only, ignoring atomic
>>>>> ops for now):
>>>>>
>>>>> qemu_ram_free():
>>>>>        qemu_mutex_lock_ramlist();
>>>>>        ...
>>>>>        QLIST_REMOVE_RCU(block, next);
>>>>>        /*
>>>>>         * 1st attempt to reset it, note: concurrent reader may set it again,
>>>>>         * even to the ramblock being freed.
>>>>>         */
>>>>>        ram_list.mru_block = NULL;
>>>>>        ram_list.version++;
>>>>>        /*
>>>>>         * Only to make sure whoever might be using this ramblock being
>>>>>         * removed to finish using it.
>>>>>         */
>>>>>        call_rcu(block, rcu_noop_fn, rcu);
>>>>>        drain_call_rcu();
>>>>>        /*
>>>>>         * 2nd attempt to clear it: after call_rcu() we make sure nobody yet
>>>>>         * has access to the ramblock being removed, hence this clear makes
>>>>>         * sure it won't get set again by anyone to the removed ramblock.
>>>>>         */
>>>>>        ram_list.mru_block = NULL;
>>>>>        call_rcu(block, reclaim_ramblock, rcu);
>>>>>        qemu_mutex_unlock_ramlist();
>>>>>
>>>>> But it looks a bit ugly.  Thoughts?
>>
>> Indeed it's not really nice. drain_call_rcu() is something we definitely
>> want to avoid; it unlocks the BQL, which is too hard to prove safety.
> 
> Ah, yeah I forgot that.
> 
>>
>> The best solution I have currently in mind is to keep the mru_block inside
>> an allocation managed my RCU:
>>
>> struct RAMListVersion {
>> 	struct rcu_head rcu;
>> 	RAMBlock *mru_block;
>> 	RAMBlock *reclaim_block;
>> };
>>
>> And qemu_ram_free() will do:
>>      qemu_mutex_lock_ramlist();
>>      ...
>>      QLIST_REMOVE_RCU(block, next);
>>      ram_list.version->reclaim_block = block;
>>      call_rcu(ram_list.version, reclaim_ramlist_version, rcu);
>>      ram_list.version = g_new0(RamListVersion, 1);
>>      ram_list.version_id++;
>>      qemu_mutex_unlock_ramlist();
>>
>> qemu_get_ram_block() will do:
>>      qatomic_rcu_read(&ram_list.version)->mru_block = block;
>>
>> reclaim_ramlist_version will do:
>>      if (version->reclaim_block) {
>>           reclaim_ramblock(version->reclaim_block);
>>      }
>>
>>      g_free(version);
> 
> This one almost works, except I think such race can still happen:
> 
>      writer                                   reader
>      ------                                   ------
>    qemu_get_ram_block for block X
>                                               qemu_get_ram_block
>      qemu_mutex_lock_ramlist();
>                                                 found block X
>      QLIST_REMOVE_RCU(block, next);
>      ram_list.version->reclaim_block = block;
>      call_rcu(ram_list.version, ...);
>      ram_list.version = g_new0(RamListVersion, 1);
>                                               ram_list.version->mru_block = X <-----------------
>      qemu_mutex_unlock_ramlist();

You are right.

> 
> The fix should be simple, IMHO, my previous solution only misses that BQL
> thing, so we can do a nested rcu free with:
> 
> reclaim_ramblock_prepare():
>    /*
>     * 1st round rcu free guarantees the last reader that can access
>     * this ramblock is gone.  Reset this field making sure it will
>     * never points to the ramblock being freed.
>     */
>    ram_list.mru_block = NULL;
>    call_rcu(block, reclaim_ramblock, rcu);
> 
> qemu_get_ram_block():
>    qemu_mutex_lock_ramlist();
>    ...
>    QLIST_REMOVE_RCU(block, next);
>    ram_list.mru_block = NULL;
>    call_rcu(block, reclaim_ramblock_prepare, rcu);
>    qemu_mutex_unlock_ramlist();

Now I get it and this looks nice to me. Having a nested call_rcu() is 
not that ugly if the underlying reasoning is described in a comment. I 
would describe it something like following in reclaim_ramblock_prepare():

Now the last reader that retrieves the pointer to the block via 
ram_list.blocks is gone. The reader may have assigned the block to 
ram_list.mru_block, so reset the field making sure there is no remaining 
pointer to the block.

Regards,
Akihiko Odaki

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Philippe Mathieu-Daudé 4 days, 7 hours ago

On 23/5/26 14:38, Akihiko Odaki wrote:
> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
> properly synchronized with atomic accesses, mutexes, or RCU.
> 
> First, the plain assignments of ram_list.mru_block are replaced with
> qatomic_set(). A comment in qemu_get_ram_block() explains why the
> ordering requirement is relaxed, but it still needs to be atomically
> accessed. include/qemu/atomic.h says:
> 
>> The C11 memory model says that variables that are accessed from
>> different threads should at least be done with __ATOMIC_RELAXED
>> primitives or the result is undefined. Generally this has little to
>> no effect on the generated code but not using the atomic primitives
>> will get flagged by sanitizers as a violation.
> 
> Second, ram_list.version accesses are replaced with atomic operations
> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
> has tighter ordering requirements for one of its goals: ensuring that
> the reader-held rs->last_seen_block value is invalidated whenever a RAM
> block is reclaimed between two RCU reader critical sections. Below are
> steps a reader and an updater follow:
> 
> Reader:
> 
> R-1. Enter the first RCU read-side critical section:
> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
> R-1-2. rs->last_seen_block = an element of ram_list.blocks
> R-2. Enter the second RCU read-side critical section:
> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
> R-2-2.     rs->last_seen_block = NULL
> 
> Updater:
> 
> W-1. Enter a ram_list.mutex critical section
> W-1-1. Update ram_list.blocks
> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> W-2. Enter another ram_list.mutex critical section
> W-2-1. QLIST_REMOVE_RCU(block, next)
> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
> 
> W-1-2 represents the write observed by R-1-1.
> 
> ram_list.version is read non-atomically on the update side because the
> update side is serialized with ram_list.mutex. The other ram_list
> accesses in these steps are reasoned about in two cases.
> 
> When the grace period of W-2-3 contains R-2:
>      qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>      enforce the following ordering:
>          W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>      The value of ram_list.blocks stored by W-1-1 or a newer value that
>      was loaded by R-1-2 is still valid because of the grace period.
> 
> When the grace period of W-2-3 ends before R-2:
>      call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>      the following ordering:
>          W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>      The value of ram_list.version stored by W-2-2 or a newer value that
>      was loaded by R-2-1 differs from rs->last_version and the reader
>      invalidates rs->last_seen_block.
> 
> Together, these steps ensure that rs->last_seen_block is invalidated
> whenever necessary. With added atomic operations, pre-existing memory
> barriers are no longer necessary and are removed.
> 
> Any other ram_list accesses are already properly synchronized.
> 
> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
> 
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
>   migration/ram.c  | 10 +++++-----
>   system/physmem.c | 16 +++++++---------
>   2 files changed, 12 insertions(+), 14 deletions(-)


> @@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>       } else { /* list is empty */
>           QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
>       }
> -    ram_list.mru_block = NULL;
> +    qatomic_set(&ram_list.mru_block, NULL);
>   
>       /* Write list before version */
> -    smp_wmb();
> -    ram_list.version++;
> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);

Should this be:

        qatomic_store_release(&ram_list.version,
                              qatomic_read(&ram_list.version) + 1);

>       qemu_mutex_unlock_ramlist();
>   
>       physical_memory_set_dirty_range(new_block->offset,
> @@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
>       name = cpr_name(block->mr);
>       cpr_delete_fd(name, 0);
>       QLIST_REMOVE_RCU(block, next);
> -    ram_list.mru_block = NULL;
> +    qatomic_set(&ram_list.mru_block, NULL);
>       /* Write list before version */
> -    smp_wmb();
> -    ram_list.version++;
> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);

ditto?

>       call_rcu(block, reclaim_ramblock, rcu);
>       qemu_mutex_unlock_ramlist();
>   }
> 
> ---
> base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
> change-id: 20260520-tsan-c6f1aa909a90
> 
> Best regards,
> --
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Akihiko Odaki 4 days, 4 hours ago


On 2026/05/26 18:22, Philippe Mathieu-Daudé wrote:
> On 23/5/26 14:38, Akihiko Odaki wrote:
>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>> properly synchronized with atomic accesses, mutexes, or RCU.
>>
>> First, the plain assignments of ram_list.mru_block are replaced with
>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>> ordering requirement is relaxed, but it still needs to be atomically
>> accessed. include/qemu/atomic.h says:
>>
>>> The C11 memory model says that variables that are accessed from
>>> different threads should at least be done with __ATOMIC_RELAXED
>>> primitives or the result is undefined. Generally this has little to
>>> no effect on the generated code but not using the atomic primitives
>>> will get flagged by sanitizers as a violation.
>>
>> Second, ram_list.version accesses are replaced with atomic operations
>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>> has tighter ordering requirements for one of its goals: ensuring that
>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>> block is reclaimed between two RCU reader critical sections. Below are
>> steps a reader and an updater follow:
>>
>> Reader:
>>
>> R-1. Enter the first RCU read-side critical section:
>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>> R-2. Enter the second RCU read-side critical section:
>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>> R-2-2.     rs->last_seen_block = NULL
>>
>> Updater:
>>
>> W-1. Enter a ram_list.mutex critical section
>> W-1-1. Update ram_list.blocks
>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>> W-2. Enter another ram_list.mutex critical section
>> W-2-1. QLIST_REMOVE_RCU(block, next)
>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>
>> W-1-2 represents the write observed by R-1-1.
>>
>> ram_list.version is read non-atomically on the update side because the
>> update side is serialized with ram_list.mutex. The other ram_list
>> accesses in these steps are reasoned about in two cases.
>>
>> When the grace period of W-2-3 contains R-2:
>>      qatomic_load_acquire() at R-1-1 and qatomic_store_release() at W-1-2
>>      enforce the following ordering:
>>          W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>      The value of ram_list.blocks stored by W-1-1 or a newer value that
>>      was loaded by R-1-2 is still valid because of the grace period.
>>
>> When the grace period of W-2-3 ends before R-2:
>>      call_rcu() at W-2-3 and the read-side critical section at R-2 ensure
>>      the following ordering:
>>          W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>      The value of ram_list.version stored by W-2-2 or a newer value that
>>      was loaded by R-2-1 differs from rs->last_version and the reader
>>      invalidates rs->last_seen_block.
>>
>> Together, these steps ensure that rs->last_seen_block is invalidated
>> whenever necessary. With added atomic operations, pre-existing memory
>> barriers are no longer necessary and are removed.
>>
>> Any other ram_list accesses are already properly synchronized.
>>
>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>
>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>> ---
>>   migration/ram.c  | 10 +++++-----
>>   system/physmem.c | 16 +++++++---------
>>   2 files changed, 12 insertions(+), 14 deletions(-)
> 
> 
>> @@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock *new_block, 
>> Error **errp)
>>       } else { /* list is empty */
>>           QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
>>       }
>> -    ram_list.mru_block = NULL;
>> +    qatomic_set(&ram_list.mru_block, NULL);
>>       /* Write list before version */
>> -    smp_wmb();
>> -    ram_list.version++;
>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
> 
> Should this be:
> 
>         qatomic_store_release(&ram_list.version,
>                               qatomic_read(&ram_list.version) + 1);
> 
>>       qemu_mutex_unlock_ramlist();
>>       physical_memory_set_dirty_range(new_block->offset,
>> @@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
>>       name = cpr_name(block->mr);
>>       cpr_delete_fd(name, 0);
>>       QLIST_REMOVE_RCU(block, next);
>> -    ram_list.mru_block = NULL;
>> +    qatomic_set(&ram_list.mru_block, NULL);
>>       /* Write list before version */
>> -    smp_wmb();
>> -    ram_list.version++;
>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
> 
> ditto?

ram_list.version is read non-atomically on the update side because the 
update side is serialized with ram_list.mutex.

> 
>>       call_rcu(block, reclaim_ramblock, rcu);
>>       qemu_mutex_unlock_ramlist();
>>   }
>>
>> ---
>> base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
>> change-id: 20260520-tsan-c6f1aa909a90
>>
>> Best regards,
>> -- 
>> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>
>

Re: [PATCH] system/physmem: Synchronize ram_list accesses

Posted by Philippe Mathieu-Daudé 4 days, 4 hours ago

On 26/5/26 13:58, Akihiko Odaki wrote:
> 
> 
> On 2026/05/26 18:22, Philippe Mathieu-Daudé wrote:
>> On 23/5/26 14:38, Akihiko Odaki wrote:
>>> Alex Bennée reported a ThreadSanitizer warning about a plain concurrent
>>> access to ram_list [1]. Ensure the concurrent accesses to ram_list are
>>> properly synchronized with atomic accesses, mutexes, or RCU.
>>>
>>> First, the plain assignments of ram_list.mru_block are replaced with
>>> qatomic_set(). A comment in qemu_get_ram_block() explains why the
>>> ordering requirement is relaxed, but it still needs to be atomically
>>> accessed. include/qemu/atomic.h says:
>>>
>>>> The C11 memory model says that variables that are accessed from
>>>> different threads should at least be done with __ATOMIC_RELAXED
>>>> primitives or the result is undefined. Generally this has little to
>>>> no effect on the generated code but not using the atomic primitives
>>>> will get flagged by sanitizers as a violation.
>>>
>>> Second, ram_list.version accesses are replaced with atomic operations
>>> or protected with a mutex. Unlike ram_list.mru_block, ram_list.version
>>> has tighter ordering requirements for one of its goals: ensuring that
>>> the reader-held rs->last_seen_block value is invalidated whenever a RAM
>>> block is reclaimed between two RCU reader critical sections. Below are
>>> steps a reader and an updater follow:
>>>
>>> Reader:
>>>
>>> R-1. Enter the first RCU read-side critical section:
>>> R-1-1. rs->last_version = qatomic_load_acquire(&ram_list.version)
>>> R-1-2. rs->last_seen_block = an element of ram_list.blocks
>>> R-2. Enter the second RCU read-side critical section:
>>> R-2-1. if (qatomic_read(&ram_list.version) != rs->last_version)
>>> R-2-2.     rs->last_seen_block = NULL
>>>
>>> Updater:
>>>
>>> W-1. Enter a ram_list.mutex critical section
>>> W-1-1. Update ram_list.blocks
>>> W-1-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>> W-2. Enter another ram_list.mutex critical section
>>> W-2-1. QLIST_REMOVE_RCU(block, next)
>>> W-2-2. qatomic_store_release(&ram_list.version, ram_list.version + 1)
>>> W-2-3. call_rcu(block, reclaim_ramblock, rcu)
>>>
>>> W-1-2 represents the write observed by R-1-1.
>>>
>>> ram_list.version is read non-atomically on the update side because the
>>> update side is serialized with ram_list.mutex. The other ram_list
>>> accesses in these steps are reasoned about in two cases.
>>>
>>> When the grace period of W-2-3 contains R-2:
>>>      qatomic_load_acquire() at R-1-1 and qatomic_store_release() at 
>>> W-1-2
>>>      enforce the following ordering:
>>>          W-1-1 -> W-1-2 -> R-1-1 -> R-1-2
>>>      The value of ram_list.blocks stored by W-1-1 or a newer value that
>>>      was loaded by R-1-2 is still valid because of the grace period.
>>>
>>> When the grace period of W-2-3 ends before R-2:
>>>      call_rcu() at W-2-3 and the read-side critical section at R-2 
>>> ensure
>>>      the following ordering:
>>>          W-2-2 -> W-2-3 -> the grace period -> R-2 -> R-2-1
>>>      The value of ram_list.version stored by W-2-2 or a newer value that
>>>      was loaded by R-2-1 differs from rs->last_version and the reader
>>>      invalidates rs->last_seen_block.
>>>
>>> Together, these steps ensure that rs->last_seen_block is invalidated
>>> whenever necessary. With added atomic operations, pre-existing memory
>>> barriers are no longer necessary and are removed.
>>>
>>> Any other ram_list accesses are already properly synchronized.
>>>
>>> [1] https://lore.kernel.org/qemu-devel/878q9fbmap.fsf@draig.linaro.org/
>>>
>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>> ---
>>>   migration/ram.c  | 10 +++++-----
>>>   system/physmem.c | 16 +++++++---------
>>>   2 files changed, 12 insertions(+), 14 deletions(-)
>>
>>
>>> @@ -2260,11 +2260,10 @@ static void ram_block_add(RAMBlock 
>>> *new_block, Error **errp)
>>>       } else { /* list is empty */
>>>           QLIST_INSERT_HEAD_RCU(&ram_list.blocks, new_block, next);
>>>       }
>>> -    ram_list.mru_block = NULL;
>>> +    qatomic_set(&ram_list.mru_block, NULL);
>>>       /* Write list before version */
>>> -    smp_wmb();
>>> -    ram_list.version++;
>>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>>
>> Should this be:
>>
>>         qatomic_store_release(&ram_list.version,
>>                               qatomic_read(&ram_list.version) + 1);
>>
>>>       qemu_mutex_unlock_ramlist();
>>>       physical_memory_set_dirty_range(new_block->offset,
>>> @@ -2608,10 +2607,9 @@ void qemu_ram_free(RAMBlock *block)
>>>       name = cpr_name(block->mr);
>>>       cpr_delete_fd(name, 0);
>>>       QLIST_REMOVE_RCU(block, next);
>>> -    ram_list.mru_block = NULL;
>>> +    qatomic_set(&ram_list.mru_block, NULL);
>>>       /* Write list before version */
>>> -    smp_wmb();
>>> -    ram_list.version++;
>>> +    qatomic_store_release(&ram_list.version, ram_list.version + 1);
>>
>> ditto?
> 
> ram_list.version is read non-atomically on the update side because the 
> update side is serialized with ram_list.mutex.

Great.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

> 
>>
>>>       call_rcu(block, reclaim_ramblock, rcu);
>>>       qemu_mutex_unlock_ramlist();
>>>   }
>>>
>>> ---
>>> base-commit: e89049b3ba5f1f0468bc0d294173345597514a1b
>>> change-id: 20260520-tsan-c6f1aa909a90
>>>
>>> Best regards,
>>> -- 
>>> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>
>>
>