MAINTAINERS | 1 + migration/dirtyrate.c | 228 +++++++++++++++++++++------- migration/dirtyrate.h | 26 +++- migration/trace-events | 4 +- qapi/migration.json | 28 +++- scripts/predict_migration.py | 283 +++++++++++++++++++++++++++++++++++ 6 files changed, 511 insertions(+), 59 deletions(-) create mode 100644 scripts/predict_migration.py
V1 -> V2: - Extracted CRC32->xxHash into separate commit - Extacted @n-zero-samples metric into separate commit - Added description to qapi about connection between @n-dirty-samples and @periods arrays - Added (Since ...) tag to new metrics --- The overall goal of this patch is to be able to predict time it would take to migrate VM in precopy mode based on max allowed downtime, network bandwidth, and metrics collected with "calc-dirty-rate". Predictor itself is a simple python script that closely follows iterations of the migration algorithm: compute how long it would take to copy dirty pages, estimate number of pages dirtied by VM from the beginning of the last iteration; repeat all over again until estimated iteration time fits max allowed downtime. However, to get reasonable accuracy, predictor requires more metrics, which have been implemented into "calc-dirty-rate". Summary of calc-dirty-rate changes: 1. The most important change is that now calc-dirty-rate produces a *vector* of dirty page measurements for progressively increasing time periods: 125ms, 250, 500, 750, 1000, 1500, .., up to specified calc-time. The motivation behind such change is that number of dirtied pages as a function of time starting from "clean state" (new migration iteration) is far from linear. Shape of this function depends on the workload type and intensity. Measuring number of dirty pages at progressively increasing periods allows to reconstruct this function using piece-wise interpolation. 2. New metric added -- number of all-zero pages. Predictor needs to distinguish between number of zero and non-zero pages because during migration only 8 byte header is placed on the wire for all-zero page. 3. Hashing function was changed from CRC32 to xxHash. This reduces overhead of sampling by ~10 times, which is important since now some of the measurement periods are sub-second. 4. Other trivial metrics were added for convenience: total number of VM pages, number of sampled pages, page size. After these changes output from calc-dirty-rate looks like this: { "page-size": 4096, "periods": [125, 250, 375, 500, 750, 1000, 1500, 2000, 3000, 4001, 6000, 8000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000], "status": "measured", "sample-pages": 512, "dirty-rate": 98, "mode": "page-sampling", "n-dirty-pages": [33, 78, 119, 151, 217, 236, 293, 336, 425, 505, 620, 756, 898, 1204, 1457, 1723, 1934, 2141, 2328, 2522, 2675, 2958], "n-sampled-pages": 16392, "n-zero-pages": 10060, "n-total-pages": 8392704, "start-time": 2916750, "calc-time": 60 } Passing this data into prediction script, we get the following estimations: Downtime> | 125ms | 250ms | 500ms | 1000ms | 5000ms | unlim --------------------------------------------------------------------------- 100 Mbps | - | - | - | - | - | 16m59s 1 Gbps | - | - | - | - | - | 1m40s 2 Gbps | - | - | - | - | 1m41s | 50s 2.5 Gbps | - | - | - | - | 1m07s | 40s 5 Gbps | 48s | 46s | 31s | 28s | 25s | 20s 10 Gbps | 13s | 12s | 12s | 12s | 12s | 10s 25 Gbps | 5s | 5s | 5s | 5s | 4s | 4s 40 Gbps | 3s | 3s | 3s | 3s | 3s | 3s Quality of prediction was tested with YCSB benchmark. Memcached instance was installed into 32GiB VM, and a client generated a stream of requests. Between experiments we varied request size distribution, number of threads, and location of the client (inside or outside the VM). After short preheat phase, we measured calc-dirty-rate: 1. {"execute": "calc-dirty-rate", "arguments":{"calc-time":60}} 2. Wait 60 seconds 3. Collect results with {"execute": "query-dirty-rate"} Afterwards we tried to migrate VM after randomly selecting max downtime and bandwidth limit. Typical prediction error is 6-7%, with only 180 out of 5779 experiments failing badly: prediction error >=25% or incorrectly predicting migration success when in fact it didn't converge. Andrei Gudkov (4): migration/calc-dirty-rate: replaced CRC32 with xxHash migration/calc-dirty-rate: detailed stats in sampling mode migration/calc-dirty-rate: added n-zero-pages metric migration/calc-dirty-rate: tool to predict migration time MAINTAINERS | 1 + migration/dirtyrate.c | 228 +++++++++++++++++++++------- migration/dirtyrate.h | 26 +++- migration/trace-events | 4 +- qapi/migration.json | 28 +++- scripts/predict_migration.py | 283 +++++++++++++++++++++++++++++++++++ 6 files changed, 511 insertions(+), 59 deletions(-) create mode 100644 scripts/predict_migration.py -- 2.30.2
Hi, Andrei, On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote: > Afterwards we tried to migrate VM after randomly selecting max downtime > and bandwidth limit. Typical prediction error is 6-7%, with only 180 out > of 5779 experiments failing badly: prediction error >=25% or incorrectly > predicting migration success when in fact it didn't converge. What's the normal size of the VMs when you did the measurements? A major challenge of convergence issues come from huge VMs and I'm wondering whether those are covered in the prediction verifications. Thanks, -- Peter Xu
On Tue, May 30, 2023 at 11:46:50AM -0400, Peter Xu wrote: > Hi, Andrei, > > On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote: > > Afterwards we tried to migrate VM after randomly selecting max downtime > > and bandwidth limit. Typical prediction error is 6-7%, with only 180 out > > of 5779 experiments failing badly: prediction error >=25% or incorrectly > > predicting migration success when in fact it didn't converge. > > What's the normal size of the VMs when you did the measurements? VM size in all experiments was 32GiB. However, since some of the pages are zero, the effective VM size was smaller. I checked the value of precopy-bytes counter after the first migration iteration. Median value among all experiments is 24.3GiB. > > A major challenge of convergence issues come from huge VMs and I'm > wondering whether those are covered in the prediction verifications. Hmmm... My understanding is that convergence primarily depends on how agressive VM dirties pages and not on VM size. Small VM with agressive writes would be impossible to migrate without throttling. On the contrary, migration of the huge dormant VM will converge in just single iteration (although a long one). The only reason I can imagine why large VM size can negatively affect convergence is due to the following reasoning: larger VM size => bigger number of vCPUs => more memory writes per second. Or do you probably mean that during each iteration we perform KVM_CLEAR_DIRTY_LOG, which is (I suspect) linear in time and can become bottleneck for large VMs? Anyway, I will conduct experiments with large VMs. I think that the easiest way to predict whether VM migration will converge or not is the following. Run calc-dirty-rate with calc-time equal to desired downtime. If it reports that the volume of dirtied memory over calc-time period is larger than you can copy over network in the same time, then you are out of luck. Alas, at the current moment calc-time accepts values in units of seconds, while reasonable downtime lies in range 50-300ms. I am preparing a separate patch that will allow to specify calc-time in milliseconds. I hope that this approach will be cleaner than an array of hardcoded values I introduced in my original patch. > > Thanks, > > -- > Peter Xu
On Wed, May 31, 2023 at 05:46:40PM +0300, gudkov.andrei@huawei.com wrote: > On Tue, May 30, 2023 at 11:46:50AM -0400, Peter Xu wrote: > > Hi, Andrei, > > > > On Thu, Apr 27, 2023 at 03:42:56PM +0300, Andrei Gudkov via wrote: > > > Afterwards we tried to migrate VM after randomly selecting max downtime > > > and bandwidth limit. Typical prediction error is 6-7%, with only 180 out > > > of 5779 experiments failing badly: prediction error >=25% or incorrectly > > > predicting migration success when in fact it didn't converge. > > > > What's the normal size of the VMs when you did the measurements? > > VM size in all experiments was 32GiB. However, since some of the pages > are zero, the effective VM size was smaller. I checked the value of > precopy-bytes counter after the first migration iteration. Median value > among all experiments is 24.3GiB. > > > > > A major challenge of convergence issues come from huge VMs and I'm > > wondering whether those are covered in the prediction verifications. > > Hmmm... My understanding is that convergence primarily depends on how > agressive VM dirties pages and not on VM size. Small VM with agressive > writes would be impossible to migrate without throttling. On the contrary, > migration of the huge dormant VM will converge in just single iteration > (although a long one). The only reason I can imagine why large VM size can > negatively affect convergence is due to the following reasoning: larger VM > size => bigger number of vCPUs => more memory writes per second. > Or do you probably mean that during each iteration we perform > KVM_CLEAR_DIRTY_LOG, which is (I suspect) linear in time and can become > bottleneck for large VMs? Partly yes, but not explicitly to CLEAR_LOG, more to the whole process that may still be relevant to size of guest memory, and I was curious whether it can keep being accurate even if mem size grows. I assume huge VM normally should have more cores too, and it's even less likely to be idle if there's a real customer using it (rather than in labs, if I'm a huge VM tenant I won't want to make it idle anytime). Then with more cores there's definitely more chance of having higher dirty rates, especially with the larger mem pool. > Anyway, I will conduct experiments with large VMs. Thanks. > > I think that the easiest way to predict whether VM migration will converge > or not is the following. Run calc-dirty-rate with calc-time equal to > desired downtime. If it reports that the volume of dirtied memory over > calc-time period is larger than you can copy over network in the same time, > then you are out of luck. Alas, at the current moment calc-time accepts > values in units of seconds, while reasonable downtime lies in range 50-300ms. > I am preparing a separate patch that will allow to specify calc-time in > milliseconds. I hope that this approach will be cleaner than an array of > hardcoded values I introduced in my original patch. I actually haven't personally gone through the details of the new interface, but what you said sounds reasonable, and happy to read the new version. -- Peter Xu
© 2016 - 2024 Red Hat, Inc.