On 5/21/2026 6:20 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 21, 2026 at 04:53:54PM +0300, Avihai Horon wrote:
>> On 5/19/2026 11:09 PM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
>>>> Performance tests were done by migrating a single VM with:
>>>> * 8 GB RAM
>>>> * 4 mlx5 VFIO devices:
>>>> - One device with 1GB of device data (stopcopy data) that runs
>>>> workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
>>>> (generate new initial_bytes chunks during precopy).
>>> Could you elaborate a bit more on what workload is executed, and how that
>>> will affect REINIT reportings (e.g. is only one REINIT generated, or it
>>> keeps generating)?
>> Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on the
>> VFIO device in a loop for several iterations.
>> This generates several REINITs.
>>
>>> Can I understand it in this way: without REINIT, device is forced to put
>>> those data into stopcopy size; then with REINIT, some stopcopy size is
>>> essentially moved back to precopy phase?
>> Almost:
>> Without REINIT, the device is forced to put this data in precopy
>> dirty_bytes.
>> With REINIT, this data can be put in precopy init_bytes (and do the
>> switchover-ack dance again).
> Hmm, then I don't understand why moving some chunk of data from
> precopy_bytes to init_bytes helps downtime.
>
> Essentially, QEMU makes the switchover decision based on the math of:
>
> init+dirty+stop
> --------------- <= downtime_limit
> bw
>
> The possible min of above is:
>
> stop
> ---------------
> bw
>
> Here whether some data would be in init or precopy portion shouldn't matter
> for a min downtime, since both portions are allowed to be moved during
> precopy phase.
>
> OTOH, if stop_bytes unchanged, min downtime is still the same before /
> after supporting REINIT, if we try harder.
>
> Say, with below testing results:
>
> With VFIO_PRECOPY_INFO_REINIT:
> 1335ms total (~520ms from the VFIO device running the workload).
>
> Without VFIO_PRECOPY_INFO_REINIT:
> 2352ms total (~1600ms from the VFIO device running the workload).
>
> What is the downtime_limit you specified for both cases? Have you tried to
> specify lower downtime_limit than what you specified, so that both results
> will become even closer (until they become, statistically, identical)?
>
> In general, I can understand the REINIT will stop converging too early, but
> it'll be the same IIUC just to turn the downtime_limit smaller.. IOW, I
> may still miss some important piece of info that how this REINIT feature
> helps downtime..
The init_bytes are special in the sense that it's crucial that they are
transferred before switching over. Otherwise, VFIO precopy may not have
full effect which could make VFIO migration slower.
Accordingly, their contribution to downtime may not be just the time it
takes to transfer them.
Specifically for mlx5, init_bytes hold a small portion of metadata used
for time consuming pre-allocations on destination side. So, we may have
10MB of init_bytes which would take a fraction of a second to transfer,
but once reached destination, it could take even a few seconds to load them.
When moving this data from dirty_bytes to init_bytes along with
switchover-ack, we guarantee that this long pre-allocation doesn't
happen during downtime. This is the time difference you see in the test
results.
Thanks.