> -----Original Message----- > From: Michael S. Tsirkin <mst@redhat.com> > Sent: Tuesday, July 16, 2024 12:09 AM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>; > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>; > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > <horenchuang@bytedance.com> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload > zero page checking in multifd live migration. > > On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Michael S. Tsirkin <mst@redhat.com> > > > Sent: Monday, July 15, 2024 10:43 PM > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > <pbonzini@redhat.com>; Marc-André Lureau > <marcandre.lureau@redhat.com>; > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > <thuth@redhat.com>; > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > <peterx@redhat.com>; > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > Markus > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > qemu- > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > <horenchuang@bytedance.com> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > offload > > > zero page checking in multifd live migration. > > > > > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote: > > > > > -----Original Message----- > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > Sent: Monday, July 15, 2024 8:24 PM > > > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini > > > > > <pbonzini@redhat.com>; Marc-André Lureau > > > <marcandre.lureau@redhat.com>; > > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth > > > <thuth@redhat.com>; > > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu > > > <peterx@redhat.com>; > > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; > > > Markus > > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > > > qemu- > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam > > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang > > > > > <horenchuang@bytedance.com> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to > > > offload > > > > > zero page checking in multifd live migration. > > > > > > > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote: > > > > > > > -----Original Message----- > > > > > > > From: Michael S. Tsirkin <mst@redhat.com> > > > > > > > Sent: Friday, July 12, 2024 6:49 AM > > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com> > > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau > > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé > > > > > <berrange@redhat.com>; > > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé > > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano > Rosas > > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus > > > Armbruster > > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu- > > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 > > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam > <shivam.kumar1@nutanix.com>; > > > Ho- > > > > > Ren > > > > > > > (Jack) Chuang <horenchuang@bytedance.com> > > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator > to > > > > > offload > > > > > > > zero page checking in multifd live migration. > > > > > > > > > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote: > > > > > > > > * Performance: > > > > > > > > > > > > > > > > We use two Intel 4th generation Xeon servers for testing. > > > > > > > > > > > > > > > > Architecture: x86_64 > > > > > > > > CPU(s): 192 > > > > > > > > Thread(s) per core: 2 > > > > > > > > Core(s) per socket: 48 > > > > > > > > Socket(s): 2 > > > > > > > > NUMA node(s): 2 > > > > > > > > Vendor ID: GenuineIntel > > > > > > > > CPU family: 6 > > > > > > > > Model: 143 > > > > > > > > Model name: Intel(R) Xeon(R) Platinum 8457C > > > > > > > > Stepping: 8 > > > > > > > > CPU MHz: 2538.624 > > > > > > > > CPU max MHz: 3800.0000 > > > > > > > > CPU min MHz: 800.0000 > > > > > > > > > > > > > > > > We perform multifd live migration with below setup: > > > > > > > > 1. VM has 100GB memory. > > > > > > > > 2. Use the new migration option multifd-set-normal-page- > ratio to > > > > > control > > > > > > > the total > > > > > > > > size of the payload sent over the network. > > > > > > > > 3. Use 8 multifd channels. > > > > > > > > 4. Use tcp for live migration. > > > > > > > > 4. Use CPU to perform zero page checking as the baseline. > > > > > > > > 5. Use one DSA device to offload zero page checking to > compare > > > with > > > > > the > > > > > > > baseline. > > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to > analyze > > > CPU > > > > > > > usage. > > > > > > > > > > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm. > > > > > > > > > > > > > > > > CPU usage > > > > > > > > > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > | |comm |runtime(msec) > |totaltime(msec)| > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > |Baseline |live_migration |5657.58 | | > > > > > > > > | |multifdsend_0 |3931.563 | | > > > > > > > > | |multifdsend_1 |4405.273 | | > > > > > > > > | |multifdsend_2 |3941.968 | | > > > > > > > > | |multifdsend_3 |5032.975 | | > > > > > > > > | |multifdsend_4 |4533.865 | | > > > > > > > > | |multifdsend_5 |4530.461 | | > > > > > > > > | |multifdsend_6 |5171.916 | | > > > > > > > > | |multifdsend_7 |4722.769 |41922 > | > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > |DSA |live_migration |6129.168 | | > > > > > > > > | |multifdsend_0 |2954.717 | | > > > > > > > > | |multifdsend_1 |2766.359 | | > > > > > > > > | |multifdsend_2 |2853.519 | | > > > > > > > > | |multifdsend_3 |2740.717 | | > > > > > > > > | |multifdsend_4 |2824.169 | | > > > > > > > > | |multifdsend_5 |2966.908 | | > > > > > > > > | |multifdsend_6 |2611.137 | | > > > > > > > > | |multifdsend_7 |3114.732 | | > > > > > > > > | |dsa_completion |3612.564 |32568 > | > > > > > > > > |---------------|---------------|---------------|------- > ------ > > > > > --| > > > > > > > > > > > > > > > > Baseline total runtime is calculated by adding up all > > > multifdsend_X > > > > > > > > and live_migration threads runtime. DSA offloading total > runtime > > > is > > > > > > > > calculated by adding up all multifdsend_X, live_migration > and > > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec > runtime > > > and > > > > > > > > that is 23% total CPU usage savings. > > > > > > > > > > > > > > > > > > > > > Here the DSA was mostly idle. > > > > > > > > > > > > > > Sounds good but a question: what if several qemu instances are > > > > > > > migrated in parallel? > > > > > > > > > > > > > > Some accelerators tend to basically stall if several tasks > > > > > > > are trying to use them at the same time. > > > > > > > > > > > > > > Where is the boundary here? > > > > > > > > > > > > A DSA device can be assigned to multiple Qemu instances. > > > > > > The DSA resource used by each process is called a work queue, > each > > > DSA > > > > > > device can support up to 8 work queues and work queues are > > > classified > > > > > into > > > > > > dedicated queues and shared queues. > > > > > > > > > > > > A dedicated queue can only serve one process. Theoretically, > there > > > is no > > > > > limit > > > > > > on the number of processes in a shared queue, it is based on > enqcmd > > > + > > > > > SVM technology. > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html > > > > > > > > > > This server has 200 CPUs which can thinkably migrate around 100 > single > > > > > cpu qemu instances with no issue. What happens if you do this with > > > DSA? > > > > > > > > First, the DSA work queue needs to be configured in shared mode, and > one > > > > queue is enough. > > > > > > > > The maximum depth of the work queue of the DSA hardware is 128, > which > > > means > > > > that the number of zero-page detection tasks submitted cannot exceed > > > 128, > > > > otherwise, enqcmd will return an error until the work queue is > available > > > again > > > > > > > > 100 Qemu instances need to be migrated concurrently, I don't have > any > > > data on > > > > this yet, I think the 100 zero-page detection tasks can be > successfully > > > submitted > > > > to the DSA hardware work queue, but the throughput of DSA's zero- > page > > > detection also > > > > needs to be considered. Once the DSA maximum throughput is reached, > the > > > work queue > > > > may be filled up quickly, this will cause some Qemu instances to be > > > temporarily unable > > > > to submit new tasks to DSA. > > > > > > The unfortunate reality here would be that there's likely no QoS, this > > > is purely fifo, right? > > > > Yes, this scenario may be fifo, assuming that the number of pages each > task > > is the same, because DSA hardware consists of multiple work engines, > they can > > process tasks concurrently, usually in a round-robin way to get tasks > from the > > work queue. > > > > DSA supports priority and flow control based on work queue granularity. > > https://github.com/intel/idxd- > config/blob/stable/Documentation/accfg/accel-config-config-wq.txt > > Right but it seems clear there aren't enough work queues for a typical > setup. > > > > > This is likely to happen in the first round of migration > > > > memory iteration. > > > > > > Try testing this and see then? > > > > Yes, I can test based on this patch set. Please review the test scenario > > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC. > > All 8 DSA devices serve 100 Qemu instances for simultaneous live > migration. > > Each VM has 1 vCPU, and 1G memory, with no workload in the VM. > > > > You want to know if some Qemu instances are stalled because of DSA, > right? > > And generally just run same benchmark you did compared to cpu: > worst case and average numbers would be interesting. Sure, I will have a test for this. > > > -- > > > MST
© 2016 - 2024 Red Hat, Inc.