[v6] migration/rdma: add x-rdma-chunk-size parameter

[PATCH v6] migration/rdma: add x-rdma-chunk-size parameter

Posted by Samuel Zhang 4 weeks ago

The default 1MB RDMA chunk size causes slow live migration because
each chunk triggers a write_flush (ibv_post_send). For 8GB RAM,
1MB chunk size produces ~15000 flushes vs ~3700 with 1024MB chunk size.

Add x-rdma-chunk-size parameter to configure the RDMA chunk size for
faster migration.
Usage: `migrate_set_parameter x-rdma-chunk-size 1024M`

Performance with RDMA live migration of 8GB RAM VM:

| x-rdma-chunk-size (B) | time (s) | throughput (MB/s) |
|-----------------------|----------|-------------------|
| 1M (default)          | 37.915   |  1,007            |
| 32M                   | 17.880   |  2,260            |
| 1024M                 |  4.368   | 17,529            |

Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Li Zhijian <lizhijian@fujitsu.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Acked-by: Peter Xu <peterx@redhat.com>
---
v2:
- Renamed x-rdma-chunk-shift to x-rdma-chunk-size (byte count)
- Added validation in migrate_params_check()
- Added hmp_migrate_set_parameter() support
- Added hmp_info_migrate_parameters() support
- Added migrate_mark_all_params_present()
- Use qemu_strtosz() for size suffix support
v3: [Markus]
- Use visit_type_size() in HMP set parameter
- Use MiB/GiB constants
v4: [Markus]
- Remove superfluous comment on DEFAULT_MIGRATE_X_RDMA_CHUNK_SIZE
- Use "Only applies when migrating via RDMA" in QAPI doc
v5:
- Document that x-rdma-chunk-size must be set to the same value on both
  source and destination before migration starts
- Add Acked-by and Tested-by from Li Zhijian
v6:
- Add Acked-by from Peter Xu and Fabiano Rosas
- Rebase to https://gitlab.com/peterx/qemu/-/tree/next

 migration/migration-hmp-cmds.c | 11 +++++++++++
 migration/options.c            | 33 ++++++++++++++++++++++++++++++++-
 migration/options.h            |  1 +
 migration/rdma.c               | 30 ++++++++++++++++--------------
 qapi/migration.json            | 13 +++++++++++--
 5 files changed, 71 insertions(+), 17 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 0a193b8f54..4f6c1dbf89 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -451,6 +451,13 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                            params->direct_io ? "on" : "off");
         }
 
+        if (params->has_x_rdma_chunk_size) {
+            monitor_printf(mon, "%s: %" PRIu64 " bytes\n",
+                           MigrationParameter_str(
+                               MIGRATION_PARAMETER_X_RDMA_CHUNK_SIZE),
+                           params->x_rdma_chunk_size);
+        }
+
         assert(params->has_cpr_exec_command);
         monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
     }
@@ -734,6 +741,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_direct_io = true;
         visit_type_bool(v, param, &p->direct_io, &err);
         break;
+    case MIGRATION_PARAMETER_X_RDMA_CHUNK_SIZE:
+        p->has_x_rdma_chunk_size = true;
+        visit_type_size(v, param, &p->x_rdma_chunk_size, &err);
+        break;
     case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
         /*
          * NOTE: g_autofree will only auto g_free() the strv array when
diff --git a/migration/options.c b/migration/options.c
index 68441f0276..5cbfd29099 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/units.h"
 #include "exec/target_page.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/error.h"
@@ -90,6 +91,7 @@ const PropertyInfo qdev_prop_StrOrNull;
 
 #define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT_PERIOD     1000    /* milliseconds */
 #define DEFAULT_MIGRATE_VCPU_DIRTY_LIMIT            1       /* MB/s */
+#define DEFAULT_MIGRATE_X_RDMA_CHUNK_SIZE           MiB
 
 const Property migration_properties[] = {
     DEFINE_PROP_BOOL("store-global-state", MigrationState,
@@ -183,6 +185,9 @@ const Property migration_properties[] = {
     DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
                        parameters.zero_page_detection,
                        ZERO_PAGE_DETECTION_MULTIFD),
+    DEFINE_PROP_UINT64("x-rdma-chunk-size", MigrationState,
+                      parameters.x_rdma_chunk_size,
+                      DEFAULT_MIGRATE_X_RDMA_CHUNK_SIZE),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -1000,6 +1005,15 @@ ZeroPageDetection migrate_zero_page_detection(void)
     return s->parameters.zero_page_detection;
 }
 
+uint64_t migrate_rdma_chunk_size(void)
+{
+    MigrationState *s = migrate_get_current();
+    uint64_t size = s->parameters.x_rdma_chunk_size;
+
+    assert(MiB <= size && size <= GiB && is_power_of_2(size));
+    return size;
+}
+
 /* parameters helpers */
 
 AnnounceParameters *migrate_announce_params(void)
@@ -1062,7 +1076,7 @@ static void migrate_mark_all_params_present(MigrationParameters *p)
         &p->has_announce_step, &p->has_block_bitmap_mapping,
         &p->has_x_vcpu_dirty_limit_period, &p->has_vcpu_dirty_limit,
         &p->has_mode, &p->has_zero_page_detection, &p->has_direct_io,
-        &p->has_cpr_exec_command,
+        &p->has_x_rdma_chunk_size, &p->has_cpr_exec_command,
     };
 
     len = ARRAY_SIZE(has_fields);
@@ -1273,6 +1287,15 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_x_rdma_chunk_size &&
+        (params->x_rdma_chunk_size < MiB ||
+         params->x_rdma_chunk_size > GiB ||
+         !is_power_of_2(params->x_rdma_chunk_size))) {
+        error_setg(errp, "Option x_rdma_chunk_size expects "
+                   "a power of 2 in the range 1MiB to 1024MiB");
+        return false;
+    }
+
     return true;
 }
 
@@ -1393,6 +1416,10 @@ static void migrate_params_test_apply(MigrationParameters *params,
         dest->direct_io = params->direct_io;
     }
 
+    if (params->has_x_rdma_chunk_size) {
+        dest->x_rdma_chunk_size = params->x_rdma_chunk_size;
+    }
+
     if (params->has_cpr_exec_command) {
         qapi_free_strList(dest->cpr_exec_command);
         dest->cpr_exec_command = QAPI_CLONE(strList, params->cpr_exec_command);
@@ -1520,6 +1547,10 @@ static void migrate_params_apply(MigrationParameters *params)
         s->parameters.direct_io = params->direct_io;
     }
 
+    if (params->has_x_rdma_chunk_size) {
+        s->parameters.x_rdma_chunk_size = params->x_rdma_chunk_size;
+    }
+
     if (params->has_cpr_exec_command) {
         qapi_free_strList(s->parameters.cpr_exec_command);
         s->parameters.cpr_exec_command =
diff --git a/migration/options.h b/migration/options.h
index b502871097..b46221998a 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -87,6 +87,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 ZeroPageDetection migrate_zero_page_detection(void);
+uint64_t migrate_rdma_chunk_size(void);
 
 /* parameters helpers */
 
diff --git a/migration/rdma.c b/migration/rdma.c
index 55ab85650a..3e37a1d440 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -45,10 +45,12 @@
 #define RDMA_RESOLVE_TIMEOUT_MS 10000
 
 /* Do not merge data if larger than this. */
-#define RDMA_MERGE_MAX (2 * 1024 * 1024)
-#define RDMA_SIGNALED_SEND_MAX (RDMA_MERGE_MAX / 4096)
+static inline uint64_t rdma_merge_max(void)
+{
+    return migrate_rdma_chunk_size() * 2;
+}
 
-#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
+#define RDMA_SIGNALED_SEND_MAX 512
 
 /*
  * This is only for non-live state being migrated.
@@ -527,21 +529,21 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
 static inline uint64_t ram_chunk_index(const uint8_t *start,
                                        const uint8_t *host)
 {
-    return ((uintptr_t) host - (uintptr_t) start) >> RDMA_REG_CHUNK_SHIFT;
+    return ((uintptr_t) host - (uintptr_t) start) / migrate_rdma_chunk_size();
 }
 
 static inline uint8_t *ram_chunk_start(const RDMALocalBlock *rdma_ram_block,
                                        uint64_t i)
 {
     return (uint8_t *)(uintptr_t)(rdma_ram_block->local_host_addr +
-                                  (i << RDMA_REG_CHUNK_SHIFT));
+                                  (i * migrate_rdma_chunk_size()));
 }
 
 static inline uint8_t *ram_chunk_end(const RDMALocalBlock *rdma_ram_block,
                                      uint64_t i)
 {
     uint8_t *result = ram_chunk_start(rdma_ram_block, i) +
-                                         (1UL << RDMA_REG_CHUNK_SHIFT);
+                                         migrate_rdma_chunk_size();
 
     if (result > (rdma_ram_block->local_host_addr + rdma_ram_block->length)) {
         result = rdma_ram_block->local_host_addr + rdma_ram_block->length;
@@ -1841,6 +1843,7 @@ static int qemu_rdma_write_one(RDMAContext *rdma,
     struct ibv_send_wr *bad_wr;
     int reg_result_idx, ret, count = 0;
     uint64_t chunk, chunks;
+    uint64_t chunk_size = migrate_rdma_chunk_size();
     uint8_t *chunk_start, *chunk_end;
     RDMALocalBlock *block = &(rdma->local_ram_blocks.block[current_index]);
     RDMARegister reg;
@@ -1861,22 +1864,21 @@ retry:
     chunk_start = ram_chunk_start(block, chunk);
 
     if (block->is_ram_block) {
-        chunks = length / (1UL << RDMA_REG_CHUNK_SHIFT);
+        chunks = length / chunk_size;
 
-        if (chunks && ((length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
+        if (chunks && ((length % chunk_size) == 0)) {
             chunks--;
         }
     } else {
-        chunks = block->length / (1UL << RDMA_REG_CHUNK_SHIFT);
+        chunks = block->length / chunk_size;
 
-        if (chunks && ((block->length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
+        if (chunks && ((block->length % chunk_size) == 0)) {
             chunks--;
         }
     }
 
     trace_qemu_rdma_write_one_top(chunks + 1,
-                                  (chunks + 1) *
-                                  (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
+                                  (chunks + 1) * chunk_size / 1024 / 1024);
 
     chunk_end = ram_chunk_end(block, chunk + chunks);
 
@@ -2176,7 +2178,7 @@ static int qemu_rdma_write(RDMAContext *rdma,
     rdma->current_length += len;
 
     /* flush it if buffer is too large */
-    if (rdma->current_length >= RDMA_MERGE_MAX) {
+    if (rdma->current_length >= rdma_merge_max()) {
         return qemu_rdma_write_flush(rdma, errp);
     }
 
@@ -3522,7 +3524,7 @@ int rdma_registration_handle(QEMUFile *f)
                 } else {
                     chunk = reg->key.chunk;
                     host_addr = block->local_host_addr +
-                        (reg->key.chunk * (1UL << RDMA_REG_CHUNK_SHIFT));
+                        (reg->key.chunk * migrate_rdma_chunk_size());
                     /* Check for particularly bad chunk value */
                     if (host_addr < (void *)block->local_host_addr) {
                         error_report("rdma: bad chunk for block %s"
diff --git a/qapi/migration.json b/qapi/migration.json
index 7134d4ce47..0db115ec5e 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -806,7 +806,7 @@
 #
 # Features:
 #
-# @unstable: Members @x-checkpoint-delay and
+# @unstable: Members @x-checkpoint-delay, @x-rdma-chunk-size, and
 #     @x-vcpu-dirty-limit-period are experimental.
 #
 # Since: 2.4
@@ -831,6 +831,7 @@
            'mode',
            'zero-page-detection',
            'direct-io',
+           { 'name': 'x-rdma-chunk-size', 'features': [ 'unstable' ] },
            'cpr-exec-command'] }
 
 ##
@@ -1007,9 +1008,15 @@
 #     is @cpr-exec.  The first list element is the program's filename,
 #     the remainder its arguments.  (Since 10.2)
 #
+# @x-rdma-chunk-size: RDMA memory registration chunk size in bytes.
+#     Default is 1MiB.  Must be a power of 2 in the range
+#     [1MiB, 1024MiB].  Only applies when migrating via RDMA.
+#     Must be set to the same value on both source and destination
+#     before migration starts.  (Since 11.1)
+#
 # Features:
 #
-# @unstable: Members @x-checkpoint-delay and
+# @unstable: Members @x-checkpoint-delay, @x-rdma-chunk-size, and
 #     @x-vcpu-dirty-limit-period are experimental.
 #
 # Since: 2.4
@@ -1046,6 +1053,8 @@
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
             '*direct-io': 'bool',
+            '*x-rdma-chunk-size': { 'type': 'uint64',
+                                    'features': [ 'unstable' ] },
             '*cpr-exec-command': [ 'str' ]} }
 
 ##
-- 
2.43.7

Re: [PATCH v6] migration/rdma: add x-rdma-chunk-size parameter

Posted by Daniel P. Berrangé 4 weeks ago

On Mon, Apr 27, 2026 at 11:14:01AM +0800, Samuel Zhang wrote:
> The default 1MB RDMA chunk size causes slow live migration because
> each chunk triggers a write_flush (ibv_post_send). For 8GB RAM,
> 1MB chunk size produces ~15000 flushes vs ~3700 with 1024MB chunk size.
> 
> Add x-rdma-chunk-size parameter to configure the RDMA chunk size for
> faster migration.
> Usage: `migrate_set_parameter x-rdma-chunk-size 1024M`
> 
> Performance with RDMA live migration of 8GB RAM VM:
> 
> | x-rdma-chunk-size (B) | time (s) | throughput (MB/s) |
> |-----------------------|----------|-------------------|
> | 1M (default)          | 37.915   |  1,007            |
> | 32M                   | 17.880   |  2,260            |
> | 1024M                 |  4.368   | 17,529            |

What is the downside of setting a larger chunk size ?

IOW, why should we keep 1M as the default when it gives
such terrible relative performance ?  Why not make 1G
be the default instead of creating this flag and requiring
people to know about setting it ?

Re: [PATCH v6] migration/rdma: add x-rdma-chunk-size parameter

Posted by Zhang, GuoQing (Sam) 3 weeks, 4 days ago

On 2026/4/27 15:17, Daniel P. Berrangé wrote:
> [You don't often get email from berrange@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Mon, Apr 27, 2026 at 11:14:01AM +0800, Samuel Zhang wrote:
>> The default 1MB RDMA chunk size causes slow live migration because
>> each chunk triggers a write_flush (ibv_post_send). For 8GB RAM,
>> 1MB chunk size produces ~15000 flushes vs ~3700 with 1024MB chunk size.
>>
>> Add x-rdma-chunk-size parameter to configure the RDMA chunk size for
>> faster migration.
>> Usage: `migrate_set_parameter x-rdma-chunk-size 1024M`
>>
>> Performance with RDMA live migration of 8GB RAM VM:
>>
>> | x-rdma-chunk-size (B) | time (s) | throughput (MB/s) |
>> |-----------------------|----------|-------------------|
>> | 1M (default)          | 37.915   |  1,007            |
>> | 32M                   | 17.880   |  2,260            |
>> | 1024M                 |  4.368   | 17,529            |
> What is the downside of setting a larger chunk size ?
>
> IOW, why should we keep 1M as the default when it gives
> such terrible relative performance ?  Why not make 1G
> be the default instead of creating this flag and requiring
> people to know about setting it ?
>
Hi Daniel,

Thank you for the very good question.
I dug into the git history. The 1M chunk size dates back to the original 
RDMA implementation by Michael R. Hines in 2013 (commit 2da776db48).
I agree 1M is too conservative for modern hardware. However, I found 
that 1G is not necessarily the optimal chunk size either.

I collected the following performance data on my server:

NIC: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 
network controller
qemu version: v11.0.0-rc2-139-g25fcd86805
qemu config: pin-all off (default setting)
VM system RAM size: 8GB
guest workload: `stress-ng --vm 4 --vm-bytes 1G --vm-method rand-set 
--timeout 0`
```
chunk_size  total(ms) setup(ms)   down(ms)  Throughput(Mbps) total_size  
transferred
1m            45,156       864      1,166         1,252.50  8.02 GiB    
  6.46 GiB
2m            41,848       853      1,161         1,354.11  8.02 GiB    
  6.46 GiB
4m            37,836       861      1,435         1,523.33  8.02 GiB    
  6.56 GiB
8m            37,684       852      1,176         1,537.98  8.02 GiB    
  6.59 GiB
16m           37,620       852      1,173         1,538.96  8.02 GiB    
  6.59 GiB
32m           15,034       963      1,864         3,401.26  8.02 GiB    
  5.57 GiB
64m            4,492       868      1,554        13,637.46  8.02 GiB    
  5.75 GiB
128m           3,940       851      1,662        16,860.59  8.02 GiB    
  6.06 GiB
256m           3,640       852      2,206        19,390.99  8.02 GiB    
  6.29 GiB
512m           3,645       852      2,179        23,200.67  8.02 GiB    
  7.54 GiB
1024m          3,665       865      2,238        24,676.59  8.02 GiB    
  8.04 GiB
```

The downside of a larger chunk size:
A larger chunk causes more data to be transferred per dirty region. For 
example, a single dirty page (4K) will cause a full 1G
chunk to be transferred when chunk size is 1G. As a result, the total 
migration time may not be the shortest with the largest chunk
size. See 256m row and 1024m row in the table as an example.

Based on my data, 128m appears to be the sweet spot for my hardware and 
workload, but different configurations may have different
optimal values. I think increasing the default (e.g., to 64m or 128m) 
while keeping this parameter for user tuning would be a good
approach.

Hi Zhijian, what do you think about (1) increasing the default chunk 
size, and (2) keeping this tunable parameter?

Thanks,
Sam

Re: [PATCH v6] migration/rdma: add x-rdma-chunk-size parameter

Posted by Daniel P. Berrangé 3 weeks, 4 days ago

On Thu, Apr 30, 2026 at 05:46:43PM +0800, Zhang, GuoQing (Sam) wrote:
> 
> On 2026/4/27 15:17, Daniel P. Berrangé wrote:
> > [You don't often get email from berrange@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Mon, Apr 27, 2026 at 11:14:01AM +0800, Samuel Zhang wrote:
> > > The default 1MB RDMA chunk size causes slow live migration because
> > > each chunk triggers a write_flush (ibv_post_send). For 8GB RAM,
> > > 1MB chunk size produces ~15000 flushes vs ~3700 with 1024MB chunk size.
> > > 
> > > Add x-rdma-chunk-size parameter to configure the RDMA chunk size for
> > > faster migration.
> > > Usage: `migrate_set_parameter x-rdma-chunk-size 1024M`
> > > 
> > > Performance with RDMA live migration of 8GB RAM VM:
> > > 
> > > | x-rdma-chunk-size (B) | time (s) | throughput (MB/s) |
> > > |-----------------------|----------|-------------------|
> > > | 1M (default)          | 37.915   |  1,007            |
> > > | 32M                   | 17.880   |  2,260            |
> > > | 1024M                 |  4.368   | 17,529            |
> > What is the downside of setting a larger chunk size ?
> > 
> > IOW, why should we keep 1M as the default when it gives
> > such terrible relative performance ?  Why not make 1G
> > be the default instead of creating this flag and requiring
> > people to know about setting it ?
> > 
> Hi Daniel,
> 
> Thank you for the very good question.
> I dug into the git history. The 1M chunk size dates back to the original
> RDMA implementation by Michael R. Hines in 2013 (commit 2da776db48).
> I agree 1M is too conservative for modern hardware. However, I found that 1G
> is not necessarily the optimal chunk size either.
> 
> I collected the following performance data on my server:
> 
> NIC: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network
> controller
> qemu version: v11.0.0-rc2-139-g25fcd86805
> qemu config: pin-all off (default setting)
> VM system RAM size: 8GB
> guest workload: `stress-ng --vm 4 --vm-bytes 1G --vm-method rand-set
> --timeout 0`
> ```
> chunk_size  total(ms) setup(ms)   down(ms)  Throughput(Mbps) total_size 
> transferred
> 1m            45,156       864      1,166         1,252.50  8.02 GiB   
>  6.46 GiB
> 2m            41,848       853      1,161         1,354.11  8.02 GiB   
>  6.46 GiB
> 4m            37,836       861      1,435         1,523.33  8.02 GiB   
>  6.56 GiB
> 8m            37,684       852      1,176         1,537.98  8.02 GiB   
>  6.59 GiB
> 16m           37,620       852      1,173         1,538.96  8.02 GiB   
>  6.59 GiB
> 32m           15,034       963      1,864         3,401.26  8.02 GiB   
>  5.57 GiB
> 64m            4,492       868      1,554        13,637.46  8.02 GiB   
>  5.75 GiB
> 128m           3,940       851      1,662        16,860.59  8.02 GiB   
>  6.06 GiB
> 256m           3,640       852      2,206        19,390.99  8.02 GiB   
>  6.29 GiB
> 512m           3,645       852      2,179        23,200.67  8.02 GiB   
>  7.54 GiB
> 1024m          3,665       865      2,238        24,676.59  8.02 GiB   
>  8.04 GiB
> ```
> 
> The downside of a larger chunk size:
> A larger chunk causes more data to be transferred per dirty region. For
> example, a single dirty page (4K) will cause a full 1G
> chunk to be transferred when chunk size is 1G. As a result, the total
> migration time may not be the shortest with the largest chunk
> size. See 256m row and 1024m row in the table as an example.
> 
> Based on my data, 128m appears to be the sweet spot for my hardware and
> workload, but different configurations may have different
> optimal values. I think increasing the default (e.g., to 64m or 128m) while
> keeping this parameter for user tuning would be a good
> approach.

Yep, that sounds reasonable.  Also I think it possibly justifies
adding this tunable without the 'x-' prefix.

With regards,
Daniel