> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Friday, July 12, 2024 6:49 AM > To: Wang, Yichen <yichen.wang@bytedance.com> > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren > (Jack) Chuang <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > * Performance: > > > > We use two Intel 4th generation Xeon servers for testing. > > > > Architecture: x86_64 > > CPU(s): 192 > > Thread(s) per core: 2 > > Core(s) per socket: 48 > > Socket(s): 2 > > NUMA node(s): 2 > > Vendor ID: GenuineIntel > > CPU family: 6 > > Model: 143 > > Model name: Intel(R) Xeon(R) Platinum 8457C > > Stepping: 8 > > CPU MHz: 2538.624 > > CPU max MHz: 3800.0000 > > CPU min MHz: 800.0000 > > > > We perform multifd live migration with below setup: > > 1. VM has 100GB memory. > > 2. Use the new migration option multifd-set-normal-page-ratio to control > the total > > size of the payload sent over the network. > > 3. Use 8 multifd channels. > > 4. Use tcp for live migration. > > 4. Use CPU to perform zero page checking as the baseline. > > 5. Use one DSA device to offload zero page checking to compare with the > baseline. > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU > usage. > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > CPU usage > > > > |---------------|---------------|---------------|---------------| > > | |comm |runtime(msec) |totaltime(msec)| > > |---------------|---------------|---------------|---------------| > > |Baseline |live_migration |5657.58 | | > > | |multifdsend_0 |3931.563 | | > > | |multifdsend_1 |4405.273 | | > > | |multifdsend_2 |3941.968 | | > > | |multifdsend_3 |5032.975 | | > > | |multifdsend_4 |4533.865 | | > > | |multifdsend_5 |4530.461 | | > > | |multifdsend_6 |5171.916 | | > > | |multifdsend_7 |4722.769 |41922 | > > |---------------|---------------|---------------|---------------| > > |DSA |live_migration |6129.168 | | > > | |multifdsend_0 |2954.717 | | > > | |multifdsend_1 |2766.359 | | > > | |multifdsend_2 |2853.519 | | > > | |multifdsend_3 |2740.717 | | > > | |multifdsend_4 |2824.169 | | > > | |multifdsend_5 |2966.908 | | > > | |multifdsend_6 |2611.137 | | > > | |multifdsend_7 |3114.732 | | > > | |dsa_completion |3612.564 |32568 | > > |---------------|---------------|---------------|---------------| > > > > Baseline total runtime is calculated by adding up all multifdsend_X > > and live_migration threads runtime. DSA offloading total runtime is > > calculated by adding up all multifdsend_X, live_migration and > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and > > that is 23% total CPU usage savings. > > > Here the DSA was mostly idle. > > Sounds good but a question: what if several qemu instances are > migrated in parallel? > > Some accelerators tend to basically stall if several tasks > are trying to use them at the same time. > > Where is the boundary here? A DSA device can be assigned to multiple Qemu instances. The DSA resource used by each process is called a work queue, each DSA device can support up to 8 work queues and work queues are classified into dedicated queues and shared queues. A dedicated queue can only serve one process. Theoretically, there is no limit on the number of processes in a shared queue, it is based on enqcmd + SVM technology. https://www.kernel.org/doc/html/v5.17/x86/sva.html > -- > MST
On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > -----Original Message----- > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Friday, July 12, 2024 6:49 AM > > To: Wang, Yichen <yichen.wang@bytedance.com> > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren > > (Jack) Chuang <horenchuang@bytedance.com> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > > zero page checking in multifd live migration. > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > * Performance: > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > Architecture: x86_64 > > > CPU(s): 192 > > > Thread(s) per core: 2 > > > Core(s) per socket: 48 > > > Socket(s): 2 > > > NUMA node(s): 2 > > > Vendor ID: GenuineIntel > > > CPU family: 6 > > > Model: 143 > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > Stepping: 8 > > > CPU MHz: 2538.624 > > > CPU max MHz: 3800.0000 > > > CPU min MHz: 800.0000 > > > > > > We perform multifd live migration with below setup: > > > 1. VM has 100GB memory. > > > 2. Use the new migration option multifd-set-normal-page-ratio to control > > the total > > > size of the payload sent over the network. > > > 3. Use 8 multifd channels. > > > 4. Use tcp for live migration. > > > 4. Use CPU to perform zero page checking as the baseline. > > > 5. Use one DSA device to offload zero page checking to compare with the > > baseline. > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU > > usage. > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > CPU usage > > > > > > |---------------|---------------|---------------|---------------| > > > | |comm |runtime(msec) |totaltime(msec)| > > > |---------------|---------------|---------------|---------------| > > > |Baseline |live_migration |5657.58 | | > > > | |multifdsend_0 |3931.563 | | > > > | |multifdsend_1 |4405.273 | | > > > | |multifdsend_2 |3941.968 | | > > > | |multifdsend_3 |5032.975 | | > > > | |multifdsend_4 |4533.865 | | > > > | |multifdsend_5 |4530.461 | | > > > | |multifdsend_6 |5171.916 | | > > > | |multifdsend_7 |4722.769 |41922 | > > > |---------------|---------------|---------------|---------------| > > > |DSA |live_migration |6129.168 | | > > > | |multifdsend_0 |2954.717 | | > > > | |multifdsend_1 |2766.359 | | > > > | |multifdsend_2 |2853.519 | | > > > | |multifdsend_3 |2740.717 | | > > > | |multifdsend_4 |2824.169 | | > > > | |multifdsend_5 |2966.908 | | > > > | |multifdsend_6 |2611.137 | | > > > | |multifdsend_7 |3114.732 | | > > > | |dsa_completion |3612.564 |32568 | > > > |---------------|---------------|---------------|---------------| > > > > > > Baseline total runtime is calculated by adding up all multifdsend_X > > > and live_migration threads runtime. DSA offloading total runtime is > > > calculated by adding up all multifdsend_X, live_migration and > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and > > > that is 23% total CPU usage savings. > > > > > > Here the DSA was mostly idle. > > > > Sounds good but a question: what if several qemu instances are > > migrated in parallel? > > > > Some accelerators tend to basically stall if several tasks > > are trying to use them at the same time. > > > > Where is the boundary here? > > A DSA device can be assigned to multiple Qemu instances. > The DSA resource used by each process is called a work queue, each DSA > device can support up to 8 work queues and work queues are classified into > dedicated queues and shared queues. > > A dedicated queue can only serve one process. Theoretically, there is no limit > on the number of processes in a shared queue, it is based on enqcmd + SVM technology. > > https://www.kernel.org/doc/html/v5.17/x86/sva.html This server has 200 CPUs which can thinkably migrate around 100 single cpu qemu instances with no issue. What happens if you do this with DSA? > > -- > > MST
> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Monday, July 15, 2024 8:24 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Michael S. Tsirkin <mst@redhat.com> > > > Sent: Friday, July 12, 2024 6:49 AM > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > <berrange@redhat.com>; > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho- > Ren > > > (Jack) Chuang <horenchuang@bytedance.com> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > offload > > > zero page checking in multifd live migration. > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > * Performance: > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > Architecture: x86_64 > > > > CPU(s): 192 > > > > Thread(s) per core: 2 > > > > Core(s) per socket: 48 > > > > Socket(s): 2 > > > > NUMA node(s): 2 > > > > Vendor ID: GenuineIntel > > > > CPU family: 6 > > > > Model: 143 > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > Stepping: 8 > > > > CPU MHz: 2538.624 > > > > CPU max MHz: 3800.0000 > > > > CPU min MHz: 800.0000 > > > > > > > > We perform multifd live migration with below setup: > > > > 1. VM has 100GB memory. > > > > 2. Use the new migration option multifd-set-normal-page-ratio to > control > > > the total > > > > size of the payload sent over the network. > > > > 3. Use 8 multifd channels. > > > > 4. Use tcp for live migration. > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > 5. Use one DSA device to offload zero page checking to compare with > the > > > baseline. > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU > > > usage. > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > CPU usage > > > > > > > > |---------------|---------------|---------------|------------- > --| > > > > | |comm |runtime(msec) |totaltime(msec)| > > > > |---------------|---------------|---------------|------------- > --| > > > > |Baseline |live_migration |5657.58 | | > > > > | |multifdsend_0 |3931.563 | | > > > > | |multifdsend_1 |4405.273 | | > > > > | |multifdsend_2 |3941.968 | | > > > > | |multifdsend_3 |5032.975 | | > > > > | |multifdsend_4 |4533.865 | | > > > > | |multifdsend_5 |4530.461 | | > > > > | |multifdsend_6 |5171.916 | | > > > > | |multifdsend_7 |4722.769 |41922 | > > > > |---------------|---------------|---------------|------------- > --| > > > > |DSA |live_migration |6129.168 | | > > > > | |multifdsend_0 |2954.717 | | > > > > | |multifdsend_1 |2766.359 | | > > > > | |multifdsend_2 |2853.519 | | > > > > | |multifdsend_3 |2740.717 | | > > > > | |multifdsend_4 |2824.169 | | > > > > | |multifdsend_5 |2966.908 | | > > > > | |multifdsend_6 |2611.137 | | > > > > | |multifdsend_7 |3114.732 | | > > > > | |dsa_completion |3612.564 |32568 | > > > > |---------------|---------------|---------------|------------- > --| > > > > > > > > Baseline total runtime is calculated by adding up all multifdsend_X > > > > and live_migration threads runtime. DSA offloading total runtime is > > > > calculated by adding up all multifdsend_X, live_migration and > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and > > > > that is 23% total CPU usage savings. > > > > > > > > > Here the DSA was mostly idle. > > > > > > Sounds good but a question: what if several qemu instances are > > > migrated in parallel? > > > > > > Some accelerators tend to basically stall if several tasks > > > are trying to use them at the same time. > > > > > > Where is the boundary here? > > > > A DSA device can be assigned to multiple Qemu instances. > > The DSA resource used by each process is called a work queue, each DSA > > device can support up to 8 work queues and work queues are classified > into > > dedicated queues and shared queues. > > > > A dedicated queue can only serve one process. Theoretically, there is no > limit > > on the number of processes in a shared queue, it is based on enqcmd + > SVM technology. > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > This server has 200 CPUs which can thinkably migrate around 100 single > cpu qemu instances with no issue. What happens if you do this with DSA? First, the DSA work queue needs to be configured in shared mode, and one queue is enough. The maximum depth of the work queue of the DSA hardware is 128, which means that the number of zero-page detection tasks submitted cannot exceed 128, otherwise, enqcmd will return an error until the work queue is available again 100 Qemu instances need to be migrated concurrently, I don't have any data on this yet, I think the 100 zero-page detection tasks can be successfully submitted to the DSA hardware work queue, but the throughput of DSA's zero-page detection also needs to be considered. Once the DSA maximum throughput is reached, the work queue may be filled up quickly, this will cause some Qemu instances to be temporarily unable to submit new tasks to DSA. This is likely to happen in the first round of migration memory iteration. > > > -- > > > MST
On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > -----Original Message----- > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Monday, July 15, 2024 8:24 PM > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > <horenchuang@bytedance.com> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > > zero page checking in multifd live migration. > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > -----Original Message----- > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > <berrange@redhat.com>; > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho- > > Ren > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > offload > > > > zero page checking in multifd live migration. > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > * Performance: > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > Architecture: x86_64 > > > > > CPU(s): 192 > > > > > Thread(s) per core: 2 > > > > > Core(s) per socket: 48 > > > > > Socket(s): 2 > > > > > NUMA node(s): 2 > > > > > Vendor ID: GenuineIntel > > > > > CPU family: 6 > > > > > Model: 143 > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > Stepping: 8 > > > > > CPU MHz: 2538.624 > > > > > CPU max MHz: 3800.0000 > > > > > CPU min MHz: 800.0000 > > > > > > > > > > We perform multifd live migration with below setup: > > > > > 1. VM has 100GB memory. > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to > > control > > > > the total > > > > > size of the payload sent over the network. > > > > > 3. Use 8 multifd channels. > > > > > 4. Use tcp for live migration. > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > 5. Use one DSA device to offload zero page checking to compare with > > the > > > > baseline. > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU > > > > usage. > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > CPU usage > > > > > > > > > > |---------------|---------------|---------------|------------- > > --| > > > > > | |comm |runtime(msec) |totaltime(msec)| > > > > > |---------------|---------------|---------------|------------- > > --| > > > > > |Baseline |live_migration |5657.58 | | > > > > > | |multifdsend_0 |3931.563 | | > > > > > | |multifdsend_1 |4405.273 | | > > > > > | |multifdsend_2 |3941.968 | | > > > > > | |multifdsend_3 |5032.975 | | > > > > > | |multifdsend_4 |4533.865 | | > > > > > | |multifdsend_5 |4530.461 | | > > > > > | |multifdsend_6 |5171.916 | | > > > > > | |multifdsend_7 |4722.769 |41922 | > > > > > |---------------|---------------|---------------|------------- > > --| > > > > > |DSA |live_migration |6129.168 | | > > > > > | |multifdsend_0 |2954.717 | | > > > > > | |multifdsend_1 |2766.359 | | > > > > > | |multifdsend_2 |2853.519 | | > > > > > | |multifdsend_3 |2740.717 | | > > > > > | |multifdsend_4 |2824.169 | | > > > > > | |multifdsend_5 |2966.908 | | > > > > > | |multifdsend_6 |2611.137 | | > > > > > | |multifdsend_7 |3114.732 | | > > > > > | |dsa_completion |3612.564 |32568 | > > > > > |---------------|---------------|---------------|------------- > > --| > > > > > > > > > > Baseline total runtime is calculated by adding up all multifdsend_X > > > > > and live_migration threads runtime. DSA offloading total runtime is > > > > > calculated by adding up all multifdsend_X, live_migration and > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > migrated in parallel? > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > are trying to use them at the same time. > > > > > > > > Where is the boundary here? > > > > > > A DSA device can be assigned to multiple Qemu instances. > > > The DSA resource used by each process is called a work queue, each DSA > > > device can support up to 8 work queues and work queues are classified > > into > > > dedicated queues and shared queues. > > > > > > A dedicated queue can only serve one process. Theoretically, there is no > > limit > > > on the number of processes in a shared queue, it is based on enqcmd + > > SVM technology. > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > This server has 200 CPUs which can thinkably migrate around 100 single > > cpu qemu instances with no issue. What happens if you do this with DSA? > > First, the DSA work queue needs to be configured in shared mode, and one > queue is enough. > > The maximum depth of the work queue of the DSA hardware is 128, which means > that the number of zero-page detection tasks submitted cannot exceed 128, > otherwise, enqcmd will return an error until the work queue is available again > > 100 Qemu instances need to be migrated concurrently, I don't have any data on > this yet, I think the 100 zero-page detection tasks can be successfully submitted > to the DSA hardware work queue, but the throughput of DSA's zero-page detection also > needs to be considered. Once the DSA maximum throughput is reached, the work queue > may be filled up quickly, this will cause some Qemu instances to be temporarily unable > to submit new tasks to DSA. The unfortunate reality here would be that there's likely no QoS, this is purely fifo, right? > This is likely to happen in the first round of migration > memory iteration. Try testing this and see then? -- MST
> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Monday, July 15, 2024 10:43 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Michael S. Tsirkin <mst@redhat.com> > > > Sent: Monday, July 15, 2024 8:24 PM > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > <pbonzini@redhat.com>; Marc-André Lureau > <marcandre.lureau@redhat.com>; > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > <thuth@redhat.com>; > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > <peterx@redhat.com>; > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > Markus > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > qemu- > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > <horenchuang@bytedance.com> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > offload > > > zero page checking in multifd live migration. > > > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > > -----Original Message----- > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > > <berrange@redhat.com>; > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; > Ho- > > > Ren > > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > > offload > > > > > zero page checking in multifd live migration. > > > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > > * Performance: > > > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > > > Architecture: x86_64 > > > > > > CPU(s): 192 > > > > > > Thread(s) per core: 2 > > > > > > Core(s) per socket: 48 > > > > > > Socket(s): 2 > > > > > > NUMA node(s): 2 > > > > > > Vendor ID: GenuineIntel > > > > > > CPU family: 6 > > > > > > Model: 143 > > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > > Stepping: 8 > > > > > > CPU MHz: 2538.624 > > > > > > CPU max MHz: 3800.0000 > > > > > > CPU min MHz: 800.0000 > > > > > > > > > > > > We perform multifd live migration with below setup: > > > > > > 1. VM has 100GB memory. > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to > > > control > > > > > the total > > > > > > size of the payload sent over the network. > > > > > > 3. Use 8 multifd channels. > > > > > > 4. Use tcp for live migration. > > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > > 5. Use one DSA device to offload zero page checking to compare > with > > > the > > > > > baseline. > > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze > CPU > > > > > usage. > > > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > > > CPU usage > > > > > > > > > > > > |---------------|---------------|---------------|------------- > > > --| > > > > > > | |comm |runtime(msec) |totaltime(msec)| > > > > > > |---------------|---------------|---------------|------------- > > > --| > > > > > > |Baseline |live_migration |5657.58 | | > > > > > > | |multifdsend_0 |3931.563 | | > > > > > > | |multifdsend_1 |4405.273 | | > > > > > > | |multifdsend_2 |3941.968 | | > > > > > > | |multifdsend_3 |5032.975 | | > > > > > > | |multifdsend_4 |4533.865 | | > > > > > > | |multifdsend_5 |4530.461 | | > > > > > > | |multifdsend_6 |5171.916 | | > > > > > > | |multifdsend_7 |4722.769 |41922 | > > > > > > |---------------|---------------|---------------|------------- > > > --| > > > > > > |DSA |live_migration |6129.168 | | > > > > > > | |multifdsend_0 |2954.717 | | > > > > > > | |multifdsend_1 |2766.359 | | > > > > > > | |multifdsend_2 |2853.519 | | > > > > > > | |multifdsend_3 |2740.717 | | > > > > > > | |multifdsend_4 |2824.169 | | > > > > > > | |multifdsend_5 |2966.908 | | > > > > > > | |multifdsend_6 |2611.137 | | > > > > > > | |multifdsend_7 |3114.732 | | > > > > > > | |dsa_completion |3612.564 |32568 | > > > > > > |---------------|---------------|---------------|------------- > > > --| > > > > > > > > > > > > Baseline total runtime is calculated by adding up all > multifdsend_X > > > > > > and live_migration threads runtime. DSA offloading total runtime > is > > > > > > calculated by adding up all multifdsend_X, live_migration and > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime > and > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > migrated in parallel? > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > are trying to use them at the same time. > > > > > > > > > > Where is the boundary here? > > > > > > > > A DSA device can be assigned to multiple Qemu instances. > > > > The DSA resource used by each process is called a work queue, each > DSA > > > > device can support up to 8 work queues and work queues are > classified > > > into > > > > dedicated queues and shared queues. > > > > > > > > A dedicated queue can only serve one process. Theoretically, there > is no > > > limit > > > > on the number of processes in a shared queue, it is based on enqcmd > + > > > SVM technology. > > > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > > > This server has 200 CPUs which can thinkably migrate around 100 single > > > cpu qemu instances with no issue. What happens if you do this with > DSA? > > > > First, the DSA work queue needs to be configured in shared mode, and one > > queue is enough. > > > > The maximum depth of the work queue of the DSA hardware is 128, which > means > > that the number of zero-page detection tasks submitted cannot exceed > 128, > > otherwise, enqcmd will return an error until the work queue is available > again > > > > 100 Qemu instances need to be migrated concurrently, I don't have any > data on > > this yet, I think the 100 zero-page detection tasks can be successfully > submitted > > to the DSA hardware work queue, but the throughput of DSA's zero-page > detection also > > needs to be considered. Once the DSA maximum throughput is reached, the > work queue > > may be filled up quickly, this will cause some Qemu instances to be > temporarily unable > > to submit new tasks to DSA. > > The unfortunate reality here would be that there's likely no QoS, this > is purely fifo, right? Yes, this scenario may be fifo, assuming that the number of pages each task is the same, because DSA hardware consists of multiple work engines, they can process tasks concurrently, usually in a round-robin way to get tasks from the work queue. DSA supports priority and flow control based on work queue granularity. https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt > > This is likely to happen in the first round of migration > > memory iteration. > > Try testing this and see then? Yes, I can test based on this patch set. Please review the test scenario My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC. All 8 DSA devices serve 100 Qemu instances for simultaneous live migration. Each VM has 1 vCPU, and 1G memory, with no workload in the VM. You want to know if some Qemu instances are stalled because of DSA, right? > -- > MST
> -----Original Message----- > From: Liu, Yuan1 > Sent: Monday, July 15, 2024 11:23 PM > To: Michael S. Tsirkin <mst@redhat.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > > -----Original Message----- > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Monday, July 15, 2024 10:43 PM > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > <thuth@redhat.com>; > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > <peterx@redhat.com>; > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > <horenchuang@bytedance.com> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > > zero page checking in multifd live migration. > > > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > > > -----Original Message----- > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > Sent: Monday, July 15, 2024 8:24 PM > > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > > <pbonzini@redhat.com>; Marc-André Lureau > > <marcandre.lureau@redhat.com>; > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > > <thuth@redhat.com>; > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > > <peterx@redhat.com>; > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > > Markus > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > > qemu- > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > > <horenchuang@bytedance.com> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > offload > > > > zero page checking in multifd live migration. > > > > > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > > > -----Original Message----- > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > > > <berrange@redhat.com>; > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; > > Ho- > > > > Ren > > > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > > > offload > > > > > > zero page checking in multifd live migration. > > > > > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > > > * Performance: > > > > > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > > > > > Architecture: x86_64 > > > > > > > CPU(s): 192 > > > > > > > Thread(s) per core: 2 > > > > > > > Core(s) per socket: 48 > > > > > > > Socket(s): 2 > > > > > > > NUMA node(s): 2 > > > > > > > Vendor ID: GenuineIntel > > > > > > > CPU family: 6 > > > > > > > Model: 143 > > > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > > > Stepping: 8 > > > > > > > CPU MHz: 2538.624 > > > > > > > CPU max MHz: 3800.0000 > > > > > > > CPU min MHz: 800.0000 > > > > > > > > > > > > > > We perform multifd live migration with below setup: > > > > > > > 1. VM has 100GB memory. > > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio > to > > > > control > > > > > > the total > > > > > > > size of the payload sent over the network. > > > > > > > 3. Use 8 multifd channels. > > > > > > > 4. Use tcp for live migration. > > > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > > > 5. Use one DSA device to offload zero page checking to compare > > with > > > > the > > > > > > baseline. > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to > analyze > > CPU > > > > > > usage. > > > > > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > > > > > CPU usage > > > > > > > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > --| > > > > > > > | |comm |runtime(msec) > |totaltime(msec)| > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > --| > > > > > > > |Baseline |live_migration |5657.58 | | > > > > > > > | |multifdsend_0 |3931.563 | | > > > > > > > | |multifdsend_1 |4405.273 | | > > > > > > > | |multifdsend_2 |3941.968 | | > > > > > > > | |multifdsend_3 |5032.975 | | > > > > > > > | |multifdsend_4 |4533.865 | | > > > > > > > | |multifdsend_5 |4530.461 | | > > > > > > > | |multifdsend_6 |5171.916 | | > > > > > > > | |multifdsend_7 |4722.769 |41922 > | > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > --| > > > > > > > |DSA |live_migration |6129.168 | | > > > > > > > | |multifdsend_0 |2954.717 | | > > > > > > > | |multifdsend_1 |2766.359 | | > > > > > > > | |multifdsend_2 |2853.519 | | > > > > > > > | |multifdsend_3 |2740.717 | | > > > > > > > | |multifdsend_4 |2824.169 | | > > > > > > > | |multifdsend_5 |2966.908 | | > > > > > > > | |multifdsend_6 |2611.137 | | > > > > > > > | |multifdsend_7 |3114.732 | | > > > > > > > | |dsa_completion |3612.564 |32568 > | > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > --| > > > > > > > > > > > > > > Baseline total runtime is calculated by adding up all > > multifdsend_X > > > > > > > and live_migration threads runtime. DSA offloading total > runtime > > is > > > > > > > calculated by adding up all multifdsend_X, live_migration and > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec > runtime > > and > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > > migrated in parallel? > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > are trying to use them at the same time. > > > > > > > > > > > > Where is the boundary here? If I understand correctly, you are concerned that in some scenarios the accelerator itself is the migration bottleneck, causing the migration performance to be degraded. My understanding is to make full use of the accelerator bandwidth, and once the accelerator is the bottleneck, it will fall back to zero-page detection by the CPU. For example, when the enqcmd command returns an error which means the work queue is full, then we can add some retry mechanisms or directly use CPU detection. > > > > > A DSA device can be assigned to multiple Qemu instances. > > > > > The DSA resource used by each process is called a work queue, each > > DSA > > > > > device can support up to 8 work queues and work queues are > > classified > > > > into > > > > > dedicated queues and shared queues. > > > > > > > > > > A dedicated queue can only serve one process. Theoretically, there > > is no > > > > limit > > > > > on the number of processes in a shared queue, it is based on > enqcmd > > + > > > > SVM technology. > > > > > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > > > > > This server has 200 CPUs which can thinkably migrate around 100 > single > > > > cpu qemu instances with no issue. What happens if you do this with > > DSA? > > > > > > First, the DSA work queue needs to be configured in shared mode, and > one > > > queue is enough. > > > > > > The maximum depth of the work queue of the DSA hardware is 128, which > > means > > > that the number of zero-page detection tasks submitted cannot exceed > > 128, > > > otherwise, enqcmd will return an error until the work queue is > available > > again > > > > > > 100 Qemu instances need to be migrated concurrently, I don't have any > > data on > > > this yet, I think the 100 zero-page detection tasks can be > successfully > > submitted > > > to the DSA hardware work queue, but the throughput of DSA's zero-page > > detection also > > > needs to be considered. Once the DSA maximum throughput is reached, > the > > work queue > > > may be filled up quickly, this will cause some Qemu instances to be > > temporarily unable > > > to submit new tasks to DSA. > > > > The unfortunate reality here would be that there's likely no QoS, this > > is purely fifo, right? > > Yes, this scenario may be fifo, assuming that the number of pages each > task > is the same, because DSA hardware consists of multiple work engines, they > can > process tasks concurrently, usually in a round-robin way to get tasks from > the > work queue. > > DSA supports priority and flow control based on work queue granularity. > https://github.com/intel/idxd- > config/blob/stable/Documentation/accfg/accel-config-config-wq.txt > > > > This is likely to happen in the first round of migration > > > memory iteration. > > > > Try testing this and see then? > > Yes, I can test based on this patch set. Please review the test scenario > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC. > All 8 DSA devices serve 100 Qemu instances for simultaneous live > migration. > Each VM has 1 vCPU, and 1G memory, with no workload in the VM. > > You want to know if some Qemu instances are stalled because of DSA, right? > > > -- > > MST
On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote: > > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > > > migrated in parallel? > > > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > > are trying to use them at the same time. > > > > > > > > > > > > > > Where is the boundary here? > > If I understand correctly, you are concerned that in some scenarios the > accelerator itself is the migration bottleneck, causing the migration performance > to be degraded. > > My understanding is to make full use of the accelerator bandwidth, and once > the accelerator is the bottleneck, it will fall back to zero-page detection > by the CPU. > > For example, when the enqcmd command returns an error which means the work queue > is full, then we can add some retry mechanisms or directly use CPU detection. How is it handled in your patch? If you just abort migration unless enqcmd succeeds then would that not be a bug, where loading the system leads to migraton failures? -- MST
> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Tuesday, July 16, 2024 12:24 AM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote: > > > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > > > > > Sounds good but a question: what if several qemu instances > are > > > > > > > > migrated in parallel? > > > > > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > > > are trying to use them at the same time. > > > > > > > > > > > > > > > > Where is the boundary here? > > > > If I understand correctly, you are concerned that in some scenarios the > > accelerator itself is the migration bottleneck, causing the migration > performance > > to be degraded. > > > > My understanding is to make full use of the accelerator bandwidth, and > once > > the accelerator is the bottleneck, it will fall back to zero-page > detection > > by the CPU. > > > > For example, when the enqcmd command returns an error which means the > work queue > > is full, then we can add some retry mechanisms or directly use CPU > detection. > > > How is it handled in your patch? If you just abort migration unless > enqcmd succeeds then would that not be a bug, where loading the system > leads to migraton failures? Sorry for this, I have just started reviewing this patch. The content we discussed before is only related to the DSA device itself and may not be related to this patch's implementation. I will review the issue you mentioned carefully. Thank you for your reminder. > -- > MST
On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote: > > -----Original Message----- > > From: Michael S. Tsirkin <mst@redhat.com> > > Sent: Monday, July 15, 2024 10:43 PM > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > <horenchuang@bytedance.com> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > > zero page checking in multifd live migration. > > > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > > > -----Original Message----- > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > Sent: Monday, July 15, 2024 8:24 PM > > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > > <pbonzini@redhat.com>; Marc-André Lureau > > <marcandre.lureau@redhat.com>; > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > > <thuth@redhat.com>; > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > > <peterx@redhat.com>; > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > > Markus > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > > qemu- > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > > <horenchuang@bytedance.com> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > offload > > > > zero page checking in multifd live migration. > > > > > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > > > -----Original Message----- > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > > > <berrange@redhat.com>; > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > Armbruster > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; > > Ho- > > > > Ren > > > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > > > offload > > > > > > zero page checking in multifd live migration. > > > > > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > > > * Performance: > > > > > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > > > > > Architecture: x86_64 > > > > > > > CPU(s): 192 > > > > > > > Thread(s) per core: 2 > > > > > > > Core(s) per socket: 48 > > > > > > > Socket(s): 2 > > > > > > > NUMA node(s): 2 > > > > > > > Vendor ID: GenuineIntel > > > > > > > CPU family: 6 > > > > > > > Model: 143 > > > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > > > Stepping: 8 > > > > > > > CPU MHz: 2538.624 > > > > > > > CPU max MHz: 3800.0000 > > > > > > > CPU min MHz: 800.0000 > > > > > > > > > > > > > > We perform multifd live migration with below setup: > > > > > > > 1. VM has 100GB memory. > > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to > > > > control > > > > > > the total > > > > > > > size of the payload sent over the network. > > > > > > > 3. Use 8 multifd channels. > > > > > > > 4. Use tcp for live migration. > > > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > > > 5. Use one DSA device to offload zero page checking to compare > > with > > > > the > > > > > > baseline. > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze > > CPU > > > > > > usage. > > > > > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > > > > > CPU usage > > > > > > > > > > > > > > |---------------|---------------|---------------|------------- > > > > --| > > > > > > > | |comm |runtime(msec) |totaltime(msec)| > > > > > > > |---------------|---------------|---------------|------------- > > > > --| > > > > > > > |Baseline |live_migration |5657.58 | | > > > > > > > | |multifdsend_0 |3931.563 | | > > > > > > > | |multifdsend_1 |4405.273 | | > > > > > > > | |multifdsend_2 |3941.968 | | > > > > > > > | |multifdsend_3 |5032.975 | | > > > > > > > | |multifdsend_4 |4533.865 | | > > > > > > > | |multifdsend_5 |4530.461 | | > > > > > > > | |multifdsend_6 |5171.916 | | > > > > > > > | |multifdsend_7 |4722.769 |41922 | > > > > > > > |---------------|---------------|---------------|------------- > > > > --| > > > > > > > |DSA |live_migration |6129.168 | | > > > > > > > | |multifdsend_0 |2954.717 | | > > > > > > > | |multifdsend_1 |2766.359 | | > > > > > > > | |multifdsend_2 |2853.519 | | > > > > > > > | |multifdsend_3 |2740.717 | | > > > > > > > | |multifdsend_4 |2824.169 | | > > > > > > > | |multifdsend_5 |2966.908 | | > > > > > > > | |multifdsend_6 |2611.137 | | > > > > > > > | |multifdsend_7 |3114.732 | | > > > > > > > | |dsa_completion |3612.564 |32568 | > > > > > > > |---------------|---------------|---------------|------------- > > > > --| > > > > > > > > > > > > > > Baseline total runtime is calculated by adding up all > > multifdsend_X > > > > > > > and live_migration threads runtime. DSA offloading total runtime > > is > > > > > > > calculated by adding up all multifdsend_X, live_migration and > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime > > and > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > > migrated in parallel? > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > are trying to use them at the same time. > > > > > > > > > > > > Where is the boundary here? > > > > > > > > > > A DSA device can be assigned to multiple Qemu instances. > > > > > The DSA resource used by each process is called a work queue, each > > DSA > > > > > device can support up to 8 work queues and work queues are > > classified > > > > into > > > > > dedicated queues and shared queues. > > > > > > > > > > A dedicated queue can only serve one process. Theoretically, there > > is no > > > > limit > > > > > on the number of processes in a shared queue, it is based on enqcmd > > + > > > > SVM technology. > > > > > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > > > > > This server has 200 CPUs which can thinkably migrate around 100 single > > > > cpu qemu instances with no issue. What happens if you do this with > > DSA? > > > > > > First, the DSA work queue needs to be configured in shared mode, and one > > > queue is enough. > > > > > > The maximum depth of the work queue of the DSA hardware is 128, which > > means > > > that the number of zero-page detection tasks submitted cannot exceed > > 128, > > > otherwise, enqcmd will return an error until the work queue is available > > again > > > > > > 100 Qemu instances need to be migrated concurrently, I don't have any > > data on > > > this yet, I think the 100 zero-page detection tasks can be successfully > > submitted > > > to the DSA hardware work queue, but the throughput of DSA's zero-page > > detection also > > > needs to be considered. Once the DSA maximum throughput is reached, the > > work queue > > > may be filled up quickly, this will cause some Qemu instances to be > > temporarily unable > > > to submit new tasks to DSA. > > > > The unfortunate reality here would be that there's likely no QoS, this > > is purely fifo, right? > > Yes, this scenario may be fifo, assuming that the number of pages each task > is the same, because DSA hardware consists of multiple work engines, they can > process tasks concurrently, usually in a round-robin way to get tasks from the > work queue. > > DSA supports priority and flow control based on work queue granularity. > https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt Right but it seems clear there aren't enough work queues for a typical setup. > > > This is likely to happen in the first round of migration > > > memory iteration. > > > > Try testing this and see then? > > Yes, I can test based on this patch set. Please review the test scenario > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC. > All 8 DSA devices serve 100 Qemu instances for simultaneous live migration. > Each VM has 1 vCPU, and 1G memory, with no workload in the VM. > > You want to know if some Qemu instances are stalled because of DSA, right? And generally just run same benchmark you did compared to cpu: worst case and average numbers would be interesting. > > -- > > MST
> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Tuesday, July 16, 2024 12:09 AM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Michael S. Tsirkin <mst@redhat.com> > > > Sent: Monday, July 15, 2024 10:43 PM > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > <pbonzini@redhat.com>; Marc-André Lureau > <marcandre.lureau@redhat.com>; > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > <thuth@redhat.com>; > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > <peterx@redhat.com>; > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > Markus > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > qemu- > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > <horenchuang@bytedance.com> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > offload > > > zero page checking in multifd live migration. > > > > > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > > > > -----Original Message----- > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > Sent: Monday, July 15, 2024 8:24 PM > > > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > > > <pbonzini@redhat.com>; Marc-André Lureau > > > <marcandre.lureau@redhat.com>; > > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > > > <thuth@redhat.com>; > > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > > > <peterx@redhat.com>; > > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > > > Markus > > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > > > qemu- > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > > > <horenchuang@bytedance.com> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > > offload > > > > > zero page checking in multifd live migration. > > > > > > > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > > > > -----Original Message----- > > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > > > > <berrange@redhat.com>; > > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano > Rosas > > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > > Armbruster > > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; > > > Ho- > > > > > Ren > > > > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator > to > > > > > offload > > > > > > > zero page checking in multifd live migration. > > > > > > > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > > > > * Performance: > > > > > > > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > > > > > > > Architecture: x86_64 > > > > > > > > CPU(s): 192 > > > > > > > > Thread(s) per core: 2 > > > > > > > > Core(s) per socket: 48 > > > > > > > > Socket(s): 2 > > > > > > > > NUMA node(s): 2 > > > > > > > > Vendor ID: GenuineIntel > > > > > > > > CPU family: 6 > > > > > > > > Model: 143 > > > > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > > > > Stepping: 8 > > > > > > > > CPU MHz: 2538.624 > > > > > > > > CPU max MHz: 3800.0000 > > > > > > > > CPU min MHz: 800.0000 > > > > > > > > > > > > > > > > We perform multifd live migration with below setup: > > > > > > > > 1. VM has 100GB memory. > > > > > > > > 2. Use the new migration option multifd-set-normal-page- > ratio to > > > > > control > > > > > > > the total > > > > > > > > size of the payload sent over the network. > > > > > > > > 3. Use 8 multifd channels. > > > > > > > > 4. Use tcp for live migration. > > > > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > > > > 5. Use one DSA device to offload zero page checking to > compare > > > with > > > > > the > > > > > > > baseline. > > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to > analyze > > > CPU > > > > > > > usage. > > > > > > > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > > > > > > > CPU usage > > > > > > > > > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > | |comm |runtime(msec) > |totaltime(msec)| > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > |Baseline |live_migration |5657.58 | | > > > > > > > > | |multifdsend_0 |3931.563 | | > > > > > > > > | |multifdsend_1 |4405.273 | | > > > > > > > > | |multifdsend_2 |3941.968 | | > > > > > > > > | |multifdsend_3 |5032.975 | | > > > > > > > > | |multifdsend_4 |4533.865 | | > > > > > > > > | |multifdsend_5 |4530.461 | | > > > > > > > > | |multifdsend_6 |5171.916 | | > > > > > > > > | |multifdsend_7 |4722.769 |41922 > | > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > |DSA |live_migration |6129.168 | | > > > > > > > > | |multifdsend_0 |2954.717 | | > > > > > > > > | |multifdsend_1 |2766.359 | | > > > > > > > > | |multifdsend_2 |2853.519 | | > > > > > > > > | |multifdsend_3 |2740.717 | | > > > > > > > > | |multifdsend_4 |2824.169 | | > > > > > > > > | |multifdsend_5 |2966.908 | | > > > > > > > > | |multifdsend_6 |2611.137 | | > > > > > > > > | |multifdsend_7 |3114.732 | | > > > > > > > > | |dsa_completion |3612.564 |32568 > | > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > > > > > > > > > Baseline total runtime is calculated by adding up all > > > multifdsend_X > > > > > > > > and live_migration threads runtime. DSA offloading total > runtime > > > is > > > > > > > > calculated by adding up all multifdsend_X, live_migration > and > > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec > runtime > > > and > > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > > > migrated in parallel? > > > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > > are trying to use them at the same time. > > > > > > > > > > > > > > Where is the boundary here? > > > > > > > > > > > > A DSA device can be assigned to multiple Qemu instances. > > > > > > The DSA resource used by each process is called a work queue, > each > > > DSA > > > > > > device can support up to 8 work queues and work queues are > > > classified > > > > > into > > > > > > dedicated queues and shared queues. > > > > > > > > > > > > A dedicated queue can only serve one process. Theoretically, > there > > > is no > > > > > limit > > > > > > on the number of processes in a shared queue, it is based on > enqcmd > > > + > > > > > SVM technology. > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > > > > > > > This server has 200 CPUs which can thinkably migrate around 100 > single > > > > > cpu qemu instances with no issue. What happens if you do this with > > > DSA? > > > > > > > > First, the DSA work queue needs to be configured in shared mode, and > one > > > > queue is enough. > > > > > > > > The maximum depth of the work queue of the DSA hardware is 128, > which > > > means > > > > that the number of zero-page detection tasks submitted cannot exceed > > > 128, > > > > otherwise, enqcmd will return an error until the work queue is > available > > > again > > > > > > > > 100 Qemu instances need to be migrated concurrently, I don't have > any > > > data on > > > > this yet, I think the 100 zero-page detection tasks can be > successfully > > > submitted > > > > to the DSA hardware work queue, but the throughput of DSA's zero- > page > > > detection also > > > > needs to be considered. Once the DSA maximum throughput is reached, > the > > > work queue > > > > may be filled up quickly, this will cause some Qemu instances to be > > > temporarily unable > > > > to submit new tasks to DSA. > > > > > > The unfortunate reality here would be that there's likely no QoS, this > > > is purely fifo, right? > > > > Yes, this scenario may be fifo, assuming that the number of pages each > task > > is the same, because DSA hardware consists of multiple work engines, > they can > > process tasks concurrently, usually in a round-robin way to get tasks > from the > > work queue. > > > > DSA supports priority and flow control based on work queue granularity. > > https://github.com/intel/idxd- > config/blob/stable/Documentation/accfg/accel-config-config-wq.txt > > Right but it seems clear there aren't enough work queues for a typical > setup. > > > > > This is likely to happen in the first round of migration > > > > memory iteration. > > > > > > Try testing this and see then? > > > > Yes, I can test based on this patch set. Please review the test scenario > > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC. > > All 8 DSA devices serve 100 Qemu instances for simultaneous live > migration. > > Each VM has 1 vCPU, and 1G memory, with no workload in the VM. > > > > You want to know if some Qemu instances are stalled because of DSA, > right? > > And generally just run same benchmark you did compared to cpu: > worst case and average numbers would be interesting. Sure, I will have a test for this. > > > -- > > > MST
© 2016 - 2024 Red Hat, Inc.