vhost: Perform memory section dirty scans once per iteration

[PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Joao Martins 7 months, 1 week ago

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So
essentially we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger.

The real problem, however, is exactly that: a device per vhost worker/qp,
when there should be a device representing a netdev (for N vhost workers).
Given this problem exists for any Qemu these days, figured a simpler
solution is better to increase stable tree's coverage; thus don't
change the device model of sw vhost to fix this "over log scan" issue.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
I am not fully sure the heuristic captures the myriad of different vhost
devices -- I think so. IIUC, the log is always shared, it's just whether
it's qemu head memory or via /dev/shm when other processes want to
access it.
---
 hw/virtio/vhost.c | 44 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446b7..70646c2b533c 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -44,6 +44,7 @@
 
 static struct vhost_log *vhost_log;
 static struct vhost_log *vhost_log_shm;
+static struct vhost_dev *vhost_log_dev;
 
 static unsigned int used_memslots;
 static QLIST_HEAD(, vhost_dev) vhost_devices =
@@ -124,6 +125,21 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
     }
 }
 
+static bool vhost_log_dev_enabled(struct vhost_dev *dev)
+{
+    return dev == vhost_log_dev;
+}
+
+static void vhost_log_set_dev(struct vhost_dev *dev)
+{
+    vhost_log_dev = dev;
+}
+
+static bool vhost_log_dev_is_set(void)
+{
+    return vhost_log_dev != NULL;
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
                                    MemoryRegionSection *section,
                                    hwaddr first,
@@ -141,13 +157,16 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
     start_addr = MAX(first, start_addr);
     end_addr = MIN(last, end_addr);
 
-    for (i = 0; i < dev->mem->nregions; ++i) {
-        struct vhost_memory_region *reg = dev->mem->regions + i;
-        vhost_dev_sync_region(dev, section, start_addr, end_addr,
-                              reg->guest_phys_addr,
-                              range_get_last(reg->guest_phys_addr,
-                                             reg->memory_size));
+    if (vhost_log_dev_enabled(dev)) {
+        for (i = 0; i < dev->mem->nregions; ++i) {
+            struct vhost_memory_region *reg = dev->mem->regions + i;
+            vhost_dev_sync_region(dev, section, start_addr, end_addr,
+                                  reg->guest_phys_addr,
+                                  range_get_last(reg->guest_phys_addr,
+                                                 reg->memory_size));
+        }
     }
+
     for (i = 0; i < dev->nvqs; ++i) {
         struct vhost_virtqueue *vq = dev->vqs + i;
 
@@ -943,6 +962,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
             goto err_vq;
         }
     }
+
+    /*
+     * During migration devices can't be removed, so we at log start
+     * we select our vhost_device that will scan the memory sections
+     * and skip for the others. This is possible because the log is shared
+     * amongst all vhost devices.
+     */
+    if (enable_log && !vhost_log_dev_is_set()) {
+        vhost_log_set_dev(dev);
+    } else if (!enable_log) {
+        vhost_log_set_dev(NULL);
+    }
+
     return 0;
 err_vq:
     for (; i >= 0; --i) {
-- 
2.39.3

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Michael S. Tsirkin 7 months ago

On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
> On setups with one or more virtio-net devices with vhost on,
> dirty tracking iteration increases cost the bigger the number
> amount of queues are set up e.g. on idle guests migration the
> following is observed with virtio-net with vhost=on:
> 
> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> 
> With high memory rates the symptom is lack of convergence as soon
> as it has a vhost device with a sufficiently high number of queues,
> the sufficient number of vhost devices.
> 
> On every migration iteration (every 100msecs) it will redundantly
> query the *shared log* the number of queues configured with vhost
> that exist in the guest. For the virtqueue data, this is necessary,
> but not for the memory sections which are the same. So
> essentially we end up scanning the dirty log too often.
> 
> To fix that, select a vhost device responsible for scanning the
> log with regards to memory sections dirty tracking. It is selected
> when we enable the logger (during migration) and cleared when we
> disable the logger.
> 
> The real problem, however, is exactly that: a device per vhost worker/qp,
> when there should be a device representing a netdev (for N vhost workers).
> Given this problem exists for any Qemu these days, figured a simpler
> solution is better to increase stable tree's coverage; thus don't
> change the device model of sw vhost to fix this "over log scan" issue.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> I am not fully sure the heuristic captures the myriad of different vhost
> devices -- I think so. IIUC, the log is always shared, it's just whether
> it's qemu head memory or via /dev/shm when other processes want to
> access it.

Thanks for working on this.

I don't think this works like this because different types of different
vhost devices have different regions - see e.g. vhost_region_add_section
I am also not sure all devices are running at the same time - e.g.
some could be disconnected, and vhost_sync_dirty_bitmap takes this
into account.

But the idea is I think a good one - I just feel more refactoring is
needed.

We also have a FIXME:

static void vhost_log_sync_range(struct vhost_dev *dev,
                                 hwaddr first, hwaddr last)
{   
    int i;
    /* FIXME: this is N^2 in number of sections */
    for (i = 0; i < dev->n_mem_sections; ++i) {
        MemoryRegionSection *section = &dev->mem_sections[i];
        vhost_sync_dirty_bitmap(dev, section, first, last);
    }
}       

that it would be nice to address. Thanks!


> ---
>  hw/virtio/vhost.c | 44 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index e2f6ffb446b7..70646c2b533c 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -44,6 +44,7 @@
>  
>  static struct vhost_log *vhost_log;
>  static struct vhost_log *vhost_log_shm;
> +static struct vhost_dev *vhost_log_dev;
>  
>  static unsigned int used_memslots;
>  static QLIST_HEAD(, vhost_dev) vhost_devices =
> @@ -124,6 +125,21 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
>      }
>  }
>  
> +static bool vhost_log_dev_enabled(struct vhost_dev *dev)
> +{
> +    return dev == vhost_log_dev;
> +}
> +
> +static void vhost_log_set_dev(struct vhost_dev *dev)
> +{
> +    vhost_log_dev = dev;
> +}
> +
> +static bool vhost_log_dev_is_set(void)
> +{
> +    return vhost_log_dev != NULL;
> +}
> +
>  static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
>                                     MemoryRegionSection *section,
>                                     hwaddr first,
> @@ -141,13 +157,16 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
>      start_addr = MAX(first, start_addr);
>      end_addr = MIN(last, end_addr);
>  
> -    for (i = 0; i < dev->mem->nregions; ++i) {
> -        struct vhost_memory_region *reg = dev->mem->regions + i;
> -        vhost_dev_sync_region(dev, section, start_addr, end_addr,
> -                              reg->guest_phys_addr,
> -                              range_get_last(reg->guest_phys_addr,
> -                                             reg->memory_size));
> +    if (vhost_log_dev_enabled(dev)) {
> +        for (i = 0; i < dev->mem->nregions; ++i) {
> +            struct vhost_memory_region *reg = dev->mem->regions + i;
> +            vhost_dev_sync_region(dev, section, start_addr, end_addr,
> +                                  reg->guest_phys_addr,
> +                                  range_get_last(reg->guest_phys_addr,
> +                                                 reg->memory_size));
> +        }
>      }
> +
>      for (i = 0; i < dev->nvqs; ++i) {
>          struct vhost_virtqueue *vq = dev->vqs + i;
>  
> @@ -943,6 +962,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
>              goto err_vq;
>          }
>      }
> +
> +    /*
> +     * During migration devices can't be removed, so we at log start
> +     * we select our vhost_device that will scan the memory sections
> +     * and skip for the others. This is possible because the log is shared
> +     * amongst all vhost devices.
> +     */
> +    if (enable_log && !vhost_log_dev_is_set()) {
> +        vhost_log_set_dev(dev);
> +    } else if (!enable_log) {
> +        vhost_log_set_dev(NULL);
> +    }
> +
>      return 0;
>  err_vq:
>      for (; i >= 0; --i) {
> -- 
> 2.39.3

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Joao Martins 7 months ago

On 03/10/2023 15:01, Michael S. Tsirkin wrote:
> On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
>> On setups with one or more virtio-net devices with vhost on,
>> dirty tracking iteration increases cost the bigger the number
>> amount of queues are set up e.g. on idle guests migration the
>> following is observed with virtio-net with vhost=on:
>>
>> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
>> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
>>
>> With high memory rates the symptom is lack of convergence as soon
>> as it has a vhost device with a sufficiently high number of queues,
>> the sufficient number of vhost devices.
>>
>> On every migration iteration (every 100msecs) it will redundantly
>> query the *shared log* the number of queues configured with vhost
>> that exist in the guest. For the virtqueue data, this is necessary,
>> but not for the memory sections which are the same. So
>> essentially we end up scanning the dirty log too often.
>>
>> To fix that, select a vhost device responsible for scanning the
>> log with regards to memory sections dirty tracking. It is selected
>> when we enable the logger (during migration) and cleared when we
>> disable the logger.
>>
>> The real problem, however, is exactly that: a device per vhost worker/qp,
>> when there should be a device representing a netdev (for N vhost workers).
>> Given this problem exists for any Qemu these days, figured a simpler
>> solution is better to increase stable tree's coverage; thus don't
>> change the device model of sw vhost to fix this "over log scan" issue.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> I am not fully sure the heuristic captures the myriad of different vhost
>> devices -- I think so. IIUC, the log is always shared, it's just whether
>> it's qemu head memory or via /dev/shm when other processes want to
>> access it.
> 
> Thanks for working on this.
> 
> I don't think this works like this because different types of different
> vhost devices have different regions - see e.g. vhost_region_add_section
> I am also not sure all devices are running at the same time - e.g.
> some could be disconnected, and vhost_sync_dirty_bitmap takes this
> into account.
> 

Good point. But this all means logic in selecting the 'logger' to take into
considering whether vhost_dev::log_enabled or vhost_dev::started right?

With respect to regions it seems like this can only change depending on whether
one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
the backend sets vhost_backend_can_merge?

With respect to 'could be disconnected' during migration not devices can be
added or removed during migration, so might not be something that occurs during
migration. I placed this in log_sync exactly to just cover migration, unless
there's some other way that disconnects the vhost and changes these variables
during migration.

> But the idea is I think a good one - I just feel more refactoring is
> needed.

Can you expand on what refactoring you were thinking for this fix?

My thinking on this bug was mostly to address the inneficiency with the smallest
intrusive fix (if at all possible!) given that virtually all multiqueue vhost
supported QEMU have this problem. And then move into a 'vhost-device for all
queues' as it feels like the problem here is the 'device per queue pair' doesn't
scale.

At the end of the day the problem on this is the vhost object model in log_sync
not scaling to amount of queues. But you could also argue that if the log is
shared that you can just log once for all, plus another one for each deviation
of normal behaviour, like the points you made in the earlier paragraph, and thus
the thinking behind this patch would still apply?

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Michael S. Tsirkin 7 months ago

On Fri, Oct 06, 2023 at 09:58:30AM +0100, Joao Martins wrote:
> On 03/10/2023 15:01, Michael S. Tsirkin wrote:
> > On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
> >> On setups with one or more virtio-net devices with vhost on,
> >> dirty tracking iteration increases cost the bigger the number
> >> amount of queues are set up e.g. on idle guests migration the
> >> following is observed with virtio-net with vhost=on:
> >>
> >> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
> >> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>
> >> With high memory rates the symptom is lack of convergence as soon
> >> as it has a vhost device with a sufficiently high number of queues,
> >> the sufficient number of vhost devices.
> >>
> >> On every migration iteration (every 100msecs) it will redundantly
> >> query the *shared log* the number of queues configured with vhost
> >> that exist in the guest. For the virtqueue data, this is necessary,
> >> but not for the memory sections which are the same. So
> >> essentially we end up scanning the dirty log too often.
> >>
> >> To fix that, select a vhost device responsible for scanning the
> >> log with regards to memory sections dirty tracking. It is selected
> >> when we enable the logger (during migration) and cleared when we
> >> disable the logger.
> >>
> >> The real problem, however, is exactly that: a device per vhost worker/qp,
> >> when there should be a device representing a netdev (for N vhost workers).
> >> Given this problem exists for any Qemu these days, figured a simpler
> >> solution is better to increase stable tree's coverage; thus don't
> >> change the device model of sw vhost to fix this "over log scan" issue.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >> I am not fully sure the heuristic captures the myriad of different vhost
> >> devices -- I think so. IIUC, the log is always shared, it's just whether
> >> it's qemu head memory or via /dev/shm when other processes want to
> >> access it.
> > 
> > Thanks for working on this.
> > 
> > I don't think this works like this because different types of different
> > vhost devices have different regions - see e.g. vhost_region_add_section
> > I am also not sure all devices are running at the same time - e.g.
> > some could be disconnected, and vhost_sync_dirty_bitmap takes this
> > into account.
> > 
> 
> Good point. But this all means logic in selecting the 'logger' to take into
> considering whether vhost_dev::log_enabled or vhost_dev::started right?
> 
> With respect to regions it seems like this can only change depending on whether
> one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
> the backend sets vhost_backend_can_merge?
> 
> With respect to 'could be disconnected' during migration not devices can be
> added or removed during migration, so might not be something that occurs during
> migration.
> I placed this in log_sync exactly to just cover migration, unless
> there's some other way that disconnects the vhost and changes these variables
> during migration.

The *frontend* can't be added or removed (ATM - this is just because we lack
good ways to describe devices that can be migrated, so all we
came up with is passing same command line on both sides,
and this breaks if you add/remove things in the process).
We really shouldn't bake this assumption into code if we can
help it though.

But I digress.

The *backend* can disconnect at any time as this is not guest visible.

> 
> > But the idea is I think a good one - I just feel more refactoring is
> > needed.
> 
> Can you expand on what refactoring you were thinking for this fix?

Better separate the idea of logging from device. then we can
have a single logger that collects data from devices to decide
what needs to be logged.

> My thinking on this bug was mostly to address the inneficiency with the smallest
> intrusive fix (if at all possible!) given that virtually all multiqueue vhost
> supported QEMU have this problem. And then move into a 'vhost-device for all
> queues' as it feels like the problem here is the 'device per queue pair' doesn't
> scale.
> 
> At the end of the day the problem on this is the vhost object model in log_sync
> not scaling to amount of queues. But you could also argue that if the log is
> shared that you can just log once for all, plus another one for each deviation
> of normal behaviour, like the points you made in the earlier paragraph, and thus
> the thinking behind this patch would still apply?

The thinking is good, but not the implementation.

-- 
MST

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Si-Wei Liu 6 months, 2 weeks ago


On 10/6/2023 2:48 AM, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 09:58:30AM +0100, Joao Martins wrote:
>> On 03/10/2023 15:01, Michael S. Tsirkin wrote:
>>> On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
>>>> On setups with one or more virtio-net devices with vhost on,
>>>> dirty tracking iteration increases cost the bigger the number
>>>> amount of queues are set up e.g. on idle guests migration the
>>>> following is observed with virtio-net with vhost=on:
>>>>
>>>> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>>>> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>>>> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
>>>> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
>>>>
>>>> With high memory rates the symptom is lack of convergence as soon
>>>> as it has a vhost device with a sufficiently high number of queues,
>>>> the sufficient number of vhost devices.
>>>>
>>>> On every migration iteration (every 100msecs) it will redundantly
>>>> query the *shared log* the number of queues configured with vhost
>>>> that exist in the guest. For the virtqueue data, this is necessary,
>>>> but not for the memory sections which are the same. So
>>>> essentially we end up scanning the dirty log too often.
>>>>
>>>> To fix that, select a vhost device responsible for scanning the
>>>> log with regards to memory sections dirty tracking. It is selected
>>>> when we enable the logger (during migration) and cleared when we
>>>> disable the logger.
>>>>
>>>> The real problem, however, is exactly that: a device per vhost worker/qp,
>>>> when there should be a device representing a netdev (for N vhost workers).
>>>> Given this problem exists for any Qemu these days, figured a simpler
>>>> solution is better to increase stable tree's coverage; thus don't
>>>> change the device model of sw vhost to fix this "over log scan" issue.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>> I am not fully sure the heuristic captures the myriad of different vhost
>>>> devices -- I think so. IIUC, the log is always shared, it's just whether
>>>> it's qemu head memory or via /dev/shm when other processes want to
>>>> access it.
>>> Thanks for working on this.
>>>
>>> I don't think this works like this because different types of different
>>> vhost devices have different regions - see e.g. vhost_region_add_section
>>> I am also not sure all devices are running at the same time - e.g.
>>> some could be disconnected, and vhost_sync_dirty_bitmap takes this
>>> into account.
>>>
>> Good point. But this all means logic in selecting the 'logger' to take into
>> considering whether vhost_dev::log_enabled or vhost_dev::started right?
>>
>> With respect to regions it seems like this can only change depending on whether
>> one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
>> the backend sets vhost_backend_can_merge?
>>
>> With respect to 'could be disconnected' during migration not devices can be
>> added or removed during migration, so might not be something that occurs during
>> migration.
>> I placed this in log_sync exactly to just cover migration, unless
>> there's some other way that disconnects the vhost and changes these variables
>> during migration.
> The *frontend* can't be added or removed (ATM - this is just because we lack
> good ways to describe devices that can be migrated, so all we
> came up with is passing same command line on both sides,
> and this breaks if you add/remove things in the process).
> We really shouldn't bake this assumption into code if we can
> help it though.
>
> But I digress.
>
> The *backend* can disconnect at any time as this is not guest visible.
>
>>> But the idea is I think a good one - I just feel more refactoring is
>>> needed.
>> Can you expand on what refactoring you were thinking for this fix?
> Better separate the idea of logging from device. then we can
> have a single logger that collects data from devices to decide
> what needs to be logged.
Discussion. I think the troublemaker here is the vhost-user clients that 
attempt to round down&up to (huge) page boundary and then has to merge 
adjacent sections, leading to differing views between vhost devices. 
While I agree it is a great idea to separate logging from device, it 
isn't clear to me how that can help the case where there could be a mix 
of both vhost-user and vhost-kernel clients in the same qemu process, in 
which case it would need at least 2 separate vhost loggers for the 
specific vhost type? Or you would think there's value to unify the two 
distinct subsystems with one single vhost logger facility? Noted the 
vhost logging interface (vhost kernel or vhost userspace) doesn't 
support the notion of separate logging of memory buffer sections against 
those for VQs, all QEMU can rely on is various sections in the memory 
table and basically a single dirty bitmap for both guest buffers and VQs 
are indistinctively shared by all vhost devices. How does it help to 
just refactor QEMU part of code using today's vhost backend interface, I 
am not sure.

Regardless, IMHO for fixing stable p.o.v it might be less risky and 
valuable to just limit the fix to vhost-kernel case (to be more precise, 
non-vhost-user type and without vhost_backend_can_merge defined), my 2c.


Regards,
-Siwei
>
>> My thinking on this bug was mostly to address the inneficiency with the smallest
>> intrusive fix (if at all possible!) given that virtually all multiqueue vhost
>> supported QEMU have this problem. And then move into a 'vhost-device for all
>> queues' as it feels like the problem here is the 'device per queue pair' doesn't
>> scale.
>>
>> At the end of the day the problem on this is the vhost object model in log_sync
>> not scaling to amount of queues. But you could also argue that if the log is
>> shared that you can just log once for all, plus another one for each deviation
>> of normal behaviour, like the points you made in the earlier paragraph, and thus
>> the thinking behind this patch would still apply?
> The thinking is good, but not the implementation.
>

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Michael S. Tsirkin 6 months, 2 weeks ago

On Tue, Oct 17, 2023 at 05:32:34PM -0700, Si-Wei Liu wrote:
> 
> 
> On 10/6/2023 2:48 AM, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 09:58:30AM +0100, Joao Martins wrote:
> > > On 03/10/2023 15:01, Michael S. Tsirkin wrote:
> > > > On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
> > > > > On setups with one or more virtio-net devices with vhost on,
> > > > > dirty tracking iteration increases cost the bigger the number
> > > > > amount of queues are set up e.g. on idle guests migration the
> > > > > following is observed with virtio-net with vhost=on:
> > > > > 
> > > > > 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> > > > > 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> > > > > 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
> > > > > 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> > > > > 
> > > > > With high memory rates the symptom is lack of convergence as soon
> > > > > as it has a vhost device with a sufficiently high number of queues,
> > > > > the sufficient number of vhost devices.
> > > > > 
> > > > > On every migration iteration (every 100msecs) it will redundantly
> > > > > query the *shared log* the number of queues configured with vhost
> > > > > that exist in the guest. For the virtqueue data, this is necessary,
> > > > > but not for the memory sections which are the same. So
> > > > > essentially we end up scanning the dirty log too often.
> > > > > 
> > > > > To fix that, select a vhost device responsible for scanning the
> > > > > log with regards to memory sections dirty tracking. It is selected
> > > > > when we enable the logger (during migration) and cleared when we
> > > > > disable the logger.
> > > > > 
> > > > > The real problem, however, is exactly that: a device per vhost worker/qp,
> > > > > when there should be a device representing a netdev (for N vhost workers).
> > > > > Given this problem exists for any Qemu these days, figured a simpler
> > > > > solution is better to increase stable tree's coverage; thus don't
> > > > > change the device model of sw vhost to fix this "over log scan" issue.
> > > > > 
> > > > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > > > > ---
> > > > > I am not fully sure the heuristic captures the myriad of different vhost
> > > > > devices -- I think so. IIUC, the log is always shared, it's just whether
> > > > > it's qemu head memory or via /dev/shm when other processes want to
> > > > > access it.
> > > > Thanks for working on this.
> > > > 
> > > > I don't think this works like this because different types of different
> > > > vhost devices have different regions - see e.g. vhost_region_add_section
> > > > I am also not sure all devices are running at the same time - e.g.
> > > > some could be disconnected, and vhost_sync_dirty_bitmap takes this
> > > > into account.
> > > > 
> > > Good point. But this all means logic in selecting the 'logger' to take into
> > > considering whether vhost_dev::log_enabled or vhost_dev::started right?
> > > 
> > > With respect to regions it seems like this can only change depending on whether
> > > one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
> > > the backend sets vhost_backend_can_merge?
> > > 
> > > With respect to 'could be disconnected' during migration not devices can be
> > > added or removed during migration, so might not be something that occurs during
> > > migration.
> > > I placed this in log_sync exactly to just cover migration, unless
> > > there's some other way that disconnects the vhost and changes these variables
> > > during migration.
> > The *frontend* can't be added or removed (ATM - this is just because we lack
> > good ways to describe devices that can be migrated, so all we
> > came up with is passing same command line on both sides,
> > and this breaks if you add/remove things in the process).
> > We really shouldn't bake this assumption into code if we can
> > help it though.
> > 
> > But I digress.
> > 
> > The *backend* can disconnect at any time as this is not guest visible.
> > 
> > > > But the idea is I think a good one - I just feel more refactoring is
> > > > needed.
> > > Can you expand on what refactoring you were thinking for this fix?
> > Better separate the idea of logging from device. then we can
> > have a single logger that collects data from devices to decide
> > what needs to be logged.
> Discussion. I think the troublemaker here is the vhost-user clients that
> attempt to round down&up to (huge) page boundary and then has to merge
> adjacent sections, leading to differing views between vhost devices. While I
> agree it is a great idea to separate logging from device, it isn't clear to
> me how that can help the case where there could be a mix of both vhost-user
> and vhost-kernel clients in the same qemu process, in which case it would
> need at least 2 separate vhost loggers for the specific vhost type? Or you
> would think there's value to unify the two distinct subsystems with one
> single vhost logger facility?

Yes - I think we need a logger per backend type. Reference-count them, too.

> Noted the vhost logging interface (vhost
> kernel or vhost userspace) doesn't support the notion of separate logging of
> memory buffer sections against those for VQs, all QEMU can rely on is
> various sections in the memory table and basically a single dirty bitmap for
> both guest buffers and VQs are indistinctively shared by all vhost devices.
> How does it help to just refactor QEMU part of code using today's vhost
> backend interface, I am not sure.
> 
> Regardless, IMHO for fixing stable p.o.v it might be less risky and valuable
> to just limit the fix to vhost-kernel case (to be more precise,
> non-vhost-user type and without vhost_backend_can_merge defined), my 2c.
> 
> 
> Regards,
> -Siwei
> > 
> > > My thinking on this bug was mostly to address the inneficiency with the smallest
> > > intrusive fix (if at all possible!) given that virtually all multiqueue vhost
> > > supported QEMU have this problem. And then move into a 'vhost-device for all
> > > queues' as it feels like the problem here is the 'device per queue pair' doesn't
> > > scale.
> > > 
> > > At the end of the day the problem on this is the vhost object model in log_sync
> > > not scaling to amount of queues. But you could also argue that if the log is
> > > shared that you can just log once for all, plus another one for each deviation
> > > of normal behaviour, like the points you made in the earlier paragraph, and thus
> > > the thinking behind this patch would still apply?
> > The thinking is good, but not the implementation.
> >

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

Posted by Joao Martins 7 months ago


On 06/10/2023 10:48, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 09:58:30AM +0100, Joao Martins wrote:
>> On 03/10/2023 15:01, Michael S. Tsirkin wrote:
>>> On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:
>>>> On setups with one or more virtio-net devices with vhost on,
>>>> dirty tracking iteration increases cost the bigger the number
>>>> amount of queues are set up e.g. on idle guests migration the
>>>> following is observed with virtio-net with vhost=on:
>>>>
>>>> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>>>> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>>>> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
>>>> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
>>>>
>>>> With high memory rates the symptom is lack of convergence as soon
>>>> as it has a vhost device with a sufficiently high number of queues,
>>>> the sufficient number of vhost devices.
>>>>
>>>> On every migration iteration (every 100msecs) it will redundantly
>>>> query the *shared log* the number of queues configured with vhost
>>>> that exist in the guest. For the virtqueue data, this is necessary,
>>>> but not for the memory sections which are the same. So
>>>> essentially we end up scanning the dirty log too often.
>>>>
>>>> To fix that, select a vhost device responsible for scanning the
>>>> log with regards to memory sections dirty tracking. It is selected
>>>> when we enable the logger (during migration) and cleared when we
>>>> disable the logger.
>>>>
>>>> The real problem, however, is exactly that: a device per vhost worker/qp,
>>>> when there should be a device representing a netdev (for N vhost workers).
>>>> Given this problem exists for any Qemu these days, figured a simpler
>>>> solution is better to increase stable tree's coverage; thus don't
>>>> change the device model of sw vhost to fix this "over log scan" issue.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>> I am not fully sure the heuristic captures the myriad of different vhost
>>>> devices -- I think so. IIUC, the log is always shared, it's just whether
>>>> it's qemu head memory or via /dev/shm when other processes want to
>>>> access it.
>>>
>>> Thanks for working on this.
>>>
>>> I don't think this works like this because different types of different
>>> vhost devices have different regions - see e.g. vhost_region_add_section
>>> I am also not sure all devices are running at the same time - e.g.
>>> some could be disconnected, and vhost_sync_dirty_bitmap takes this
>>> into account.
>>>
>>
>> Good point. But this all means logic in selecting the 'logger' to take into
>> considering whether vhost_dev::log_enabled or vhost_dev::started right?
>>
>> With respect to regions it seems like this can only change depending on whether
>> one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
>> the backend sets vhost_backend_can_merge?
>>
>> With respect to 'could be disconnected' during migration not devices can be
>> added or removed during migration, so might not be something that occurs during
>> migration.
>> I placed this in log_sync exactly to just cover migration, unless
>> there's some other way that disconnects the vhost and changes these variables
>> during migration.
> 
> The *frontend* can't be added or removed (ATM - this is just because we lack
> good ways to describe devices that can be migrated, so all we
> came up with is passing same command line on both sides,
> and this breaks if you add/remove things in the process).
> We really shouldn't bake this assumption into code if we can
> help it though.
> 
> But I digress.
> 
Regardless of what the guest is doing, I was more talking about VMM. This wasn't
so much about backing the assumption into code for the reasons you just
enumerated but (...)

> The *backend* can disconnect at any time as this is not guest visible.
> 

(...) more like what can lead Qemu to disconnect the backend.

I guess I am reading your comment as backend being the secondary process on
vhost-user (or vhost kernel thread) can just disconnect voluntarily, regardless
of what Qemu might be doing.

>>
>>> But the idea is I think a good one - I just feel more refactoring is
>>> needed.
>>
>> Can you expand on what refactoring you were thinking for this fix?
> 
> Better separate the idea of logging from device. then we can
> have a single logger that collects data from devices to decide
> what needs to be logged.
> 

OK, that makes sense. Presumably, such logger abstraction would be aware of all
the different sections from all vhost device representation, and it scans
everything related to sections. VQ might make sense to still be part of the device.

>> My thinking on this bug was mostly to address the inneficiency with the smallest
>> intrusive fix (if at all possible!) given that virtually all multiqueue vhost
>> supported QEMU have this problem. And then move into a 'vhost-device for all
>> queues' as it feels like the problem here is the 'device per queue pair' doesn't
>> scale.
>>
>> At the end of the day the problem on this is the vhost object model in log_sync
>> not scaling to amount of queues. But you could also argue that if the log is
>> shared that you can just log once for all, plus another one for each deviation
>> of normal behaviour, like the points you made in the earlier paragraph, and thus
>> the thinking behind this patch would still apply?
> 
> The thinking is good, but not the implementation.
> 
Yeah, I got that from the beginning :)