[RFC v1 2/2] migration: merge fragmented clear_dirty ioctls

Chuang Xu posted 2 patches 6 days, 12 hours ago
Maintainers: Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, Peter Xu <peterx@redhat.com>, David Hildenbrand <david@kernel.org>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Fabiano Rosas <farosas@suse.de>
[RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
Posted by Chuang Xu 6 days, 12 hours ago
From: xuchuangxclwt <xuchuangxclwt@bytedance.com>

When the addresses processed are not aligned, a large number of
clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
4096 clear_dirty ioctls), which increases the time required for
bitmap_sync and makes it more difficult for dirty pages to converge.

Attempt to merge those fragmented clear_dirty ioctls.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 accel/tcg/cputlb.c       |  5 +++--
 include/system/physmem.h |  3 ++-
 migration/ram.c          | 17 ++++++++++++++++-
 system/memory.c          |  2 +-
 system/physmem.c         | 19 +++++++++++--------
 5 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index fd1606c856..8a054cb545 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
 void tlb_protect_code(ram_addr_t ram_addr)
 {
     physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
-                                             TARGET_PAGE_SIZE,
-                                             DIRTY_MEMORY_CODE);
+                                         TARGET_PAGE_SIZE,
+                                         DIRTY_MEMORY_CODE,
+                                         true);
 }
 
 /* update the TLB so that writes in physical page 'phys_addr' are no longer
diff --git a/include/system/physmem.h b/include/system/physmem.h
index 879f6eae38..8529f0510a 100644
--- a/include/system/physmem.h
+++ b/include/system/physmem.h
@@ -41,7 +41,8 @@ void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
 
 bool physical_memory_test_and_clear_dirty(ram_addr_t start,
                                           ram_addr_t length,
-                                          unsigned client);
+                                          unsigned client,
+                                          bool clear_dirty);
 
 DirtyBitmapSnapshot *
 physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
diff --git a/migration/ram.c b/migration/ram.c
index 29f016cb25..329168d081 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -995,18 +995,33 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
         }
     } else {
         ram_addr_t offset = rb->offset;
+        unsigned long end, start_page;
+        uint64_t mr_offset, mr_size;
 
         for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
+            /*
+             * Here we don't expect so many clear_dirty, so we call
+             * physical_memory_test_and_clear_dirty with clear_dirty
+             * set to false. Later we'll do make an overall clear_dirty
+             * outside the loop.
+             */
             if (physical_memory_test_and_clear_dirty(
                         start + addr + offset,
                         TARGET_PAGE_SIZE,
-                        DIRTY_MEMORY_MIGRATION)) {
+                        DIRTY_MEMORY_MIGRATION,
+                        false)) {
                 long k = (start + addr) >> TARGET_PAGE_BITS;
                 if (!test_and_set_bit(k, dest)) {
                     num_dirty++;
                 }
             }
         }
+        end = TARGET_PAGE_ALIGN(start + addr + offset + TARGET_PAGE_SIZE)
+              >> TARGET_PAGE_BITS;
+        start_page = (start + offset) >> TARGET_PAGE_BITS;
+        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
+        mr_size = (end - start_page) << TARGET_PAGE_BITS;
+        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
     }
 
     return num_dirty;
diff --git a/system/memory.c b/system/memory.c
index 8b84661ae3..7b81b84f30 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
 {
     assert(mr->ram_block);
     physical_memory_test_and_clear_dirty(
-        memory_region_get_ram_addr(mr) + addr, size, client);
+        memory_region_get_ram_addr(mr) + addr, size, client, true);
 }
 
 int memory_region_get_fd(MemoryRegion *mr)
diff --git a/system/physmem.c b/system/physmem.c
index c9869e4049..21b2db3884 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1091,8 +1091,9 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
 
 /* Note: start and end must be within the same ram block.  */
 bool physical_memory_test_and_clear_dirty(ram_addr_t start,
-                                              ram_addr_t length,
-                                              unsigned client)
+                                          ram_addr_t length,
+                                          unsigned client,
+                                          bool clear_dirty)
 {
     DirtyMemoryBlocks *blocks;
     unsigned long end, page, start_page;
@@ -1126,9 +1127,11 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
             page += num;
         }
 
-        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
-        mr_size = (end - start_page) << TARGET_PAGE_BITS;
-        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        if (clear_dirty) {
+            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
+            mr_size = (end - start_page) << TARGET_PAGE_BITS;
+            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+        }
     }
 
     if (dirty) {
@@ -1140,9 +1143,9 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
 
 static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
 {
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
-    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, true);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, true);
+    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, true);
 }
 
 DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
-- 
2.20.1
Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
Posted by Peter Xu 5 days, 3 hours ago
On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> 
> When the addresses processed are not aligned, a large number of
> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
> 4096 clear_dirty ioctls), which increases the time required for
> bitmap_sync and makes it more difficult for dirty pages to converge.
> 
> Attempt to merge those fragmented clear_dirty ioctls.

(besides separate perf results I requested as in the cover letter reply..)

Could you add something into the commit log explaining at least one example
that you observe?  E.g. what is the VM setup, how many ramblocks are the
ones not aligned?

Have you considered setting rb->clear_bmap when it's available?  It'll
postpone the remote clear even further until page sent.  Logically it
should be more efficient, but it may depend on the size of unaligned
ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
but it can be much bigger.  Some discussion around this would be nice on
how you chose the current solution.

> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  accel/tcg/cputlb.c       |  5 +++--
>  include/system/physmem.h |  3 ++-
>  migration/ram.c          | 17 ++++++++++++++++-
>  system/memory.c          |  2 +-
>  system/physmem.c         | 19 +++++++++++--------
>  5 files changed, 33 insertions(+), 13 deletions(-)
> 
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index fd1606c856..8a054cb545 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -857,8 +857,9 @@ void tlb_flush_page_bits_by_mmuidx_all_cpus_synced(CPUState *src_cpu,
>  void tlb_protect_code(ram_addr_t ram_addr)
>  {
>      physical_memory_test_and_clear_dirty(ram_addr & TARGET_PAGE_MASK,
> -                                             TARGET_PAGE_SIZE,
> -                                             DIRTY_MEMORY_CODE);
> +                                         TARGET_PAGE_SIZE,
> +                                         DIRTY_MEMORY_CODE,
> +                                         true);
>  }
>  
>  /* update the TLB so that writes in physical page 'phys_addr' are no longer
> diff --git a/include/system/physmem.h b/include/system/physmem.h
> index 879f6eae38..8529f0510a 100644
> --- a/include/system/physmem.h
> +++ b/include/system/physmem.h
> @@ -41,7 +41,8 @@ void physical_memory_dirty_bits_cleared(ram_addr_t start, ram_addr_t length);
>  
>  bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>                                            ram_addr_t length,
> -                                          unsigned client);
> +                                          unsigned client,
> +                                          bool clear_dirty);
>  
>  DirtyBitmapSnapshot *
>  physical_memory_snapshot_and_clear_dirty(MemoryRegion *mr, hwaddr offset,
> diff --git a/migration/ram.c b/migration/ram.c
> index 29f016cb25..329168d081 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -995,18 +995,33 @@ static uint64_t physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>          }
>      } else {
>          ram_addr_t offset = rb->offset;
> +        unsigned long end, start_page;
> +        uint64_t mr_offset, mr_size;
>  
>          for (addr = 0; addr < length; addr += TARGET_PAGE_SIZE) {
> +            /*
> +             * Here we don't expect so many clear_dirty, so we call
> +             * physical_memory_test_and_clear_dirty with clear_dirty
> +             * set to false. Later we'll do make an overall clear_dirty
> +             * outside the loop.
> +             */
>              if (physical_memory_test_and_clear_dirty(
>                          start + addr + offset,
>                          TARGET_PAGE_SIZE,
> -                        DIRTY_MEMORY_MIGRATION)) {
> +                        DIRTY_MEMORY_MIGRATION,
> +                        false)) {
>                  long k = (start + addr) >> TARGET_PAGE_BITS;
>                  if (!test_and_set_bit(k, dest)) {
>                      num_dirty++;
>                  }
>              }
>          }
> +        end = TARGET_PAGE_ALIGN(start + addr + offset + TARGET_PAGE_SIZE)
> +              >> TARGET_PAGE_BITS;
> +        start_page = (start + offset) >> TARGET_PAGE_BITS;
> +        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - offset;
> +        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +        memory_region_clear_dirty_bitmap(rb->mr, mr_offset, mr_size);
>      }
>  
>      return num_dirty;
> diff --git a/system/memory.c b/system/memory.c
> index 8b84661ae3..7b81b84f30 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2424,7 +2424,7 @@ void memory_region_reset_dirty(MemoryRegion *mr, hwaddr addr,
>  {
>      assert(mr->ram_block);
>      physical_memory_test_and_clear_dirty(
> -        memory_region_get_ram_addr(mr) + addr, size, client);
> +        memory_region_get_ram_addr(mr) + addr, size, client, true);
>  }
>  
>  int memory_region_get_fd(MemoryRegion *mr)
> diff --git a/system/physmem.c b/system/physmem.c
> index c9869e4049..21b2db3884 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1091,8 +1091,9 @@ void physical_memory_set_dirty_range(ram_addr_t start, ram_addr_t length,
>  
>  /* Note: start and end must be within the same ram block.  */
>  bool physical_memory_test_and_clear_dirty(ram_addr_t start,
> -                                              ram_addr_t length,
> -                                              unsigned client)
> +                                          ram_addr_t length,
> +                                          unsigned client,
> +                                          bool clear_dirty)
>  {
>      DirtyMemoryBlocks *blocks;
>      unsigned long end, page, start_page;
> @@ -1126,9 +1127,11 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>              page += num;
>          }
>  
> -        mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> -        mr_size = (end - start_page) << TARGET_PAGE_BITS;
> -        memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        if (clear_dirty) {
> +            mr_offset = (ram_addr_t)(start_page << TARGET_PAGE_BITS) - ramblock->offset;
> +            mr_size = (end - start_page) << TARGET_PAGE_BITS;
> +            memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +        }
>      }
>  
>      if (dirty) {
> @@ -1140,9 +1143,9 @@ bool physical_memory_test_and_clear_dirty(ram_addr_t start,
>  
>  static void physical_memory_clear_dirty_range(ram_addr_t addr, ram_addr_t length)
>  {
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA);
> -    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_MIGRATION, true);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_VGA, true);
> +    physical_memory_test_and_clear_dirty(addr, length, DIRTY_MEMORY_CODE, true);
>  }
>  
>  DirtyBitmapSnapshot *physical_memory_snapshot_and_clear_dirty
> -- 
> 2.20.1
> 

-- 
Peter Xu
Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
Posted by Chuang Xu 4 days, 10 hours ago
Hi, Peter

On 10/12/2025 05:10, Peter Xu wrote:
> On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
>> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
>>
>> When the addresses processed are not aligned, a large number of
>> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
>> 4096 clear_dirty ioctls), which increases the time required for
>> bitmap_sync and makes it more difficult for dirty pages to converge.
>>
>> Attempt to merge those fragmented clear_dirty ioctls.
> (besides separate perf results I requested as in the cover letter reply..)
>
> Could you add something into the commit log explaining at least one example
> that you observe?  E.g. what is the VM setup, how many ramblocks are the
> ones not aligned?
>
> Have you considered setting rb->clear_bmap when it's available?  It'll
> postpone the remote clear even further until page sent.  Logically it
> should be more efficient, but it may depend on the size of unaligned
> ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
> but it can be much bigger.  Some discussion around this would be nice on
> how you chose the current solution.
>

On my Intel(R) Xeon(R) 6986P-C(previous tests were based on Cascade Lake),
I add some logs. Here are some examples of unaligned memory I observed:
size 1966080: system.flash0
size 131072: /rom@etc/acpi/tables, isa-bios, system.flash1, pc.rom
size 65536: cirrus_vga.rom

Taking system.flash0 as an example, judging from its size, this should 
be the OVMF I'm using.
This memory segment will trigger clear_dirty in both memory_listener 
"kvm-memory" and
memory_listener "kvm-smram" simultaneously, ultimately resulting in a 
total of 960 kvm_ioctl calls.
If a larger OVMF is used, this number will obviously worsen.

On the machine I tested, clear system.flash0 took a total of 49ms,
and clear all unaligned memory took a total of 61ms.

Regarding why the current solution was chosen, because I'm not sure if 
there were any
special considerations in the initial design of clear_dirty_log for not 
applying unaligned memory paths.
Therefore, I chose to keep most of the logic the same as the existing one,
only extracting and merging the actual clear_dirty operations.

Thanks.
Re: [RFC v1 2/2] migration: merge fragmented clear_dirty ioctls
Posted by Peter Xu 4 days, 4 hours ago
On Wed, Dec 10, 2025 at 10:18:21PM +0800, Chuang Xu wrote:
> Hi, Peter
> 
> On 10/12/2025 05:10, Peter Xu wrote:
> > On Mon, Dec 08, 2025 at 08:09:52PM +0800, Chuang Xu wrote:
> >> From: xuchuangxclwt <xuchuangxclwt@bytedance.com>
> >>
> >> When the addresses processed are not aligned, a large number of
> >> clear_dirty ioctl occur (e.g. a 16MB misaligned memory can generate
> >> 4096 clear_dirty ioctls), which increases the time required for
> >> bitmap_sync and makes it more difficult for dirty pages to converge.
> >>
> >> Attempt to merge those fragmented clear_dirty ioctls.
> > (besides separate perf results I requested as in the cover letter reply..)
> >
> > Could you add something into the commit log explaining at least one example
> > that you observe?  E.g. what is the VM setup, how many ramblocks are the
> > ones not aligned?
> >
> > Have you considered setting rb->clear_bmap when it's available?  It'll
> > postpone the remote clear even further until page sent.  Logically it
> > should be more efficient, but it may depend on the size of unaligned
> > ramblocks that you're hitting indeed, as clear_bmap isn't PAGE_SIZE based
> > but it can be much bigger.  Some discussion around this would be nice on
> > how you chose the current solution.
> >
> 
> On my Intel(R) Xeon(R) 6986P-C(previous tests were based on Cascade Lake),
> I add some logs. Here are some examples of unaligned memory I observed:
> size 1966080: system.flash0
> size 131072: /rom@etc/acpi/tables, isa-bios, system.flash1, pc.rom
> size 65536: cirrus_vga.rom
> 
> Taking system.flash0 as an example, judging from its size, this should 
> be the OVMF I'm using.
> This memory segment will trigger clear_dirty in both memory_listener 
> "kvm-memory" and
> memory_listener "kvm-smram" simultaneously, ultimately resulting in a 
> total of 960 kvm_ioctl calls.
> If a larger OVMF is used, this number will obviously worsen.
> 
> On the machine I tested, clear system.flash0 took a total of 49ms,
> and clear all unaligned memory took a total of 61ms.

Thanks for the numbers.  It might be more helpful if you share the other
relevant numbers, e.g. referring to your cover letter definitions,

  - total bitmap sync time (x)
  - iteration transfer time (t)

that'll provide a full picture of how much wasted on per-page log_clear().

You can use those to compare to the clear() operation it took after your
patch applied.  I think that might be a better way to justify the patch.

In general, it looks reasonable to me.

Having a bool parameter for the old physical_memory_test_and_clear_dirty()
is fine but might be slightly error prone for new callers if they set it to
false by accident.

Another way to do it is, since physical_memory_test_and_clear_dirty()
always takes a range, we can pass the bitmap ("dest") into it when
available, then changing the retval to be num_dirty:

  (1) when dest provided, only account it when "dest" bitmap wasn't set
  already

  (2) when dest==NULL, treat all dirty bits into num_dirty

Maybe that'll look better, you can double check.

Looks like patch 1 will be dropped. You can repost this patch alone when
you feel ready whatever way you see fit.

Thanks,

> 
> Regarding why the current solution was chosen, because I'm not sure if 
> there were any
> special considerations in the initial design of clear_dirty_log for not 
> applying unaligned memory paths.
> Therefore, I chose to keep most of the logic the same as the existing one,
> only extracting and merging the actual clear_dirty operations.
> 
> Thanks.
> 

-- 
Peter Xu