[PATCH v2 0/4] Migration time prediction using calc-dirty-rate

Andrei Gudkov via posted 4 patches 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/cover.1682598010.git.gudkov.andrei@huawei.com
Maintainers: Juan Quintela <quintela@redhat.com>, Peter Xu <peterx@redhat.com>, Leonardo Bras <leobras@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, John Snow <jsnow@redhat.com>, Cleber Rosa <crosa@redhat.com>
MAINTAINERS                  |   1 +
migration/dirtyrate.c        | 228 +++++++++++++++++++++-------
migration/dirtyrate.h        |  26 +++-
migration/trace-events       |   4 +-
qapi/migration.json          |  28 +++-
scripts/predict_migration.py | 283 +++++++++++++++++++++++++++++++++++
6 files changed, 511 insertions(+), 59 deletions(-)
create mode 100644 scripts/predict_migration.py
[PATCH v2 0/4] Migration time prediction using calc-dirty-rate
Posted by Andrei Gudkov via 1 year ago
V1 -> V2:
  - Extracted CRC32->xxHash into separate commit
  - Extacted @n-zero-samples metric into separate commit
  - Added description to qapi about connection between
    @n-dirty-samples and @periods arrays
  - Added (Since ...) tag to new metrics

---

The overall goal of this patch is to be able to predict time it would
take to migrate VM in precopy mode based on max allowed downtime,
network bandwidth, and metrics collected with "calc-dirty-rate".
Predictor itself is a simple python script that closely follows iterations
of the migration algorithm: compute how long it would take to copy
dirty pages, estimate number of pages dirtied by VM from the beginning
of the last iteration; repeat all over again until estimated iteration time
fits max allowed downtime. However, to get reasonable accuracy, predictor
requires more metrics, which have been implemented into "calc-dirty-rate".

Summary of calc-dirty-rate changes:

1. The most important change is that now calc-dirty-rate produces
   a *vector* of dirty page measurements for progressively increasing time
   periods: 125ms, 250, 500, 750, 1000, 1500, .., up to specified calc-time.
   The motivation behind such change is that number of dirtied pages as
   a function of time starting from "clean state" (new migration iteration)
   is far from linear. Shape of this function depends on the workload type
   and intensity. Measuring number of dirty pages at progressively
   increasing periods allows to reconstruct this function using piece-wise
   interpolation.

2. New metric added -- number of all-zero pages.
   Predictor needs to distinguish between number of zero and non-zero pages
   because during migration only 8 byte header is placed on the wire for
   all-zero page.

3. Hashing function was changed from CRC32 to xxHash.
   This reduces overhead of sampling by ~10 times, which is important since
   now some of the measurement periods are sub-second.

4. Other trivial metrics were added for convenience: total number
   of VM pages, number of sampled pages, page size.


After these changes output from calc-dirty-rate looks like this:

{
  "page-size": 4096,
  "periods": [125, 250, 375, 500, 750, 1000, 1500,
              2000, 3000, 4001, 6000, 8000, 10000,
              15000, 20000, 25000, 30000, 35000,
              40000, 45000, 50000, 60000],
  "status": "measured",
  "sample-pages": 512,
  "dirty-rate": 98,
  "mode": "page-sampling",
  "n-dirty-pages": [33, 78, 119, 151, 217, 236, 293, 336,
                    425, 505, 620, 756, 898, 1204, 1457,
                    1723, 1934, 2141, 2328, 2522, 2675, 2958],
  "n-sampled-pages": 16392,
  "n-zero-pages": 10060,
  "n-total-pages": 8392704,
  "start-time": 2916750,
  "calc-time": 60
}

Passing this data into prediction script, we get the following estimations:

Downtime> |    125ms |    250ms |    500ms |   1000ms |   5000ms |    unlim
---------------------------------------------------------------------------
 100 Mbps |        - |        - |        - |        - |        - |   16m59s  
   1 Gbps |        - |        - |        - |        - |        - |    1m40s
   2 Gbps |        - |        - |        - |        - |    1m41s |      50s  
 2.5 Gbps |        - |        - |        - |        - |    1m07s |      40s
   5 Gbps |      48s |      46s |      31s |      28s |      25s |      20s
  10 Gbps |      13s |      12s |      12s |      12s |      12s |      10s
  25 Gbps |       5s |       5s |       5s |       5s |       4s |       4s
  40 Gbps |       3s |       3s |       3s |       3s |       3s |       3s


Quality of prediction was tested with YCSB benchmark. Memcached instance
was installed into 32GiB VM, and a client generated a stream of requests.
Between experiments we varied request size distribution, number of threads,
and location of the client (inside or outside the VM).
After short preheat phase, we measured calc-dirty-rate:
1. {"execute": "calc-dirty-rate", "arguments":{"calc-time":60}}
2. Wait 60 seconds
3. Collect results with {"execute": "query-dirty-rate"}

Afterwards we tried to migrate VM after randomly selecting max downtime
and bandwidth limit. Typical prediction error is 6-7%, with only 180 out
of 5779 experiments failing badly: prediction error >=25% or incorrectly
predicting migration success when in fact it didn't converge.


Andrei Gudkov (4):
  migration/calc-dirty-rate: replaced CRC32 with xxHash
  migration/calc-dirty-rate: detailed stats in sampling mode
  migration/calc-dirty-rate: added n-zero-pages metric
  migration/calc-dirty-rate: tool to predict migration time

 MAINTAINERS                  |   1 +
 migration/dirtyrate.c        | 228 +++++++++++++++++++++-------
 migration/dirtyrate.h        |  26 +++-
 migration/trace-events       |   4 +-
 qapi/migration.json          |  28 +++-
 scripts/predict_migration.py | 283 +++++++++++++++++++++++++++++++++++
 6 files changed, 511 insertions(+), 59 deletions(-)
 create mode 100644 scripts/predict_migration.py

-- 
2.30.2
Re: [PATCH v2 0/4] Migration time prediction using calc-dirty-rate
Posted by Peter Xu 11 months ago
Hi, Andrei,

On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote:
> Afterwards we tried to migrate VM after randomly selecting max downtime
> and bandwidth limit. Typical prediction error is 6-7%, with only 180 out
> of 5779 experiments failing badly: prediction error >=25% or incorrectly
> predicting migration success when in fact it didn't converge.

What's the normal size of the VMs when you did the measurements?

A major challenge of convergence issues come from huge VMs and I'm
wondering whether those are covered in the prediction verifications.

Thanks,

-- 
Peter Xu
Re: [PATCH v2 0/4] Migration time prediction using calc-dirty-rate
Posted by gudkov.andrei--- via 11 months ago
On Tue, May 30, 2023 at 11:46:50AM -0400, Peter Xu wrote:
> Hi, Andrei,
> 
> On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote:
> > Afterwards we tried to migrate VM after randomly selecting max downtime
> > and bandwidth limit. Typical prediction error is 6-7%, with only 180 out
> > of 5779 experiments failing badly: prediction error >=25% or incorrectly
> > predicting migration success when in fact it didn't converge.
> 
> What's the normal size of the VMs when you did the measurements?

VM size in all experiments was 32GiB. However, since some of the pages
are zero, the effective VM size was smaller. I checked the value of
precopy-bytes counter after the first migration iteration. Median value
among all experiments is 24.3GiB.

> 
> A major challenge of convergence issues come from huge VMs and I'm
> wondering whether those are covered in the prediction verifications.

Hmmm... My understanding is that convergence primarily depends on how
agressive VM dirties pages and not on VM size. Small VM with agressive
writes would be impossible to migrate without throttling. On the contrary,
migration of the huge dormant VM will converge in just single iteration
(although a long one). The only reason I can imagine why large VM size can
negatively affect convergence is due to the following reasoning: larger VM
size => bigger number of vCPUs => more memory writes per second.
Or do you probably mean that during each iteration we perform
KVM_CLEAR_DIRTY_LOG, which is (I suspect) linear in time and can become
bottleneck for large VMs? Anyway, I will conduct experiments with
large VMs.


I think that the easiest way to predict whether VM migration will converge
or not is the following. Run calc-dirty-rate with calc-time equal to
desired downtime. If it reports that the volume of dirtied memory over
calc-time period is larger than you can copy over network in the same time,
then you are out of luck. Alas, at the current moment calc-time accepts
values in units of seconds, while reasonable downtime lies in range 50-300ms.
I am preparing a separate patch that will allow to specify calc-time in
milliseconds. I hope that this approach will be cleaner than an array of
hardcoded values I introduced in my original patch.

> 
> Thanks,
> 
> -- 
> Peter Xu
Re: [PATCH v2 0/4] Migration time prediction using calc-dirty-rate
Posted by Peter Xu 11 months ago
On Wed, May 31, 2023 at 05:46:40PM +0300, gudkov.andrei@huawei.com wrote:
> On Tue, May 30, 2023 at 11:46:50AM -0400, Peter Xu wrote:
> > Hi, Andrei,
> > 
> > On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote:
> > > Afterwards we tried to migrate VM after randomly selecting max downtime
> > > and bandwidth limit. Typical prediction error is 6-7%, with only 180 out
> > > of 5779 experiments failing badly: prediction error >=25% or incorrectly
> > > predicting migration success when in fact it didn't converge.
> > 
> > What's the normal size of the VMs when you did the measurements?
> 
> VM size in all experiments was 32GiB. However, since some of the pages
> are zero, the effective VM size was smaller. I checked the value of
> precopy-bytes counter after the first migration iteration. Median value
> among all experiments is 24.3GiB.
> 
> > 
> > A major challenge of convergence issues come from huge VMs and I'm
> > wondering whether those are covered in the prediction verifications.
> 
> Hmmm... My understanding is that convergence primarily depends on how
> agressive VM dirties pages and not on VM size. Small VM with agressive
> writes would be impossible to migrate without throttling. On the contrary,
> migration of the huge dormant VM will converge in just single iteration
> (although a long one). The only reason I can imagine why large VM size can
> negatively affect convergence is due to the following reasoning: larger VM
> size => bigger number of vCPUs => more memory writes per second.
> Or do you probably mean that during each iteration we perform
> KVM_CLEAR_DIRTY_LOG, which is (I suspect) linear in time and can become
> bottleneck for large VMs?

Partly yes, but not explicitly to CLEAR_LOG, more to the whole process that
may still be relevant to size of guest memory, and I was curious whether it
can keep being accurate even if mem size grows.

I assume huge VM normally should have more cores too, and it's even less
likely to be idle if there's a real customer using it (rather than in labs,
if I'm a huge VM tenant I won't want to make it idle anytime).  Then with
more cores there's definitely more chance of having higher dirty rates,
especially with the larger mem pool.

> Anyway, I will conduct experiments with large VMs.

Thanks.

> 
> I think that the easiest way to predict whether VM migration will converge
> or not is the following. Run calc-dirty-rate with calc-time equal to
> desired downtime. If it reports that the volume of dirtied memory over
> calc-time period is larger than you can copy over network in the same time,
> then you are out of luck. Alas, at the current moment calc-time accepts
> values in units of seconds, while reasonable downtime lies in range 50-300ms.
> I am preparing a separate patch that will allow to specify calc-time in
> milliseconds. I hope that this approach will be cleaner than an array of
> hardcoded values I introduced in my original patch.

I actually haven't personally gone through the details of the new
interface, but what you said sounds reasonable, and happy to read the new
version.

-- 
Peter Xu