[Qemu-devel] [RFC 23/29] vub+postcopy: madvises

Dr. David Alan Gilbert (git) posted 29 patches 8 years, 7 months ago
There is a newer version of this series
[Qemu-devel] [RFC 23/29] vub+postcopy: madvises
Posted by Dr. David Alan Gilbert (git) 8 years, 7 months ago
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Clear the area and turn off THP.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 0658b6e847..ceddeac74f 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -451,11 +451,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
         }
 
         if (dev->postcopy_listening) {
+            int ret;
             /* We should already have an open ufd need to mark each memory
              * range as ufd.
-             * Note: Do we need any madvises? Well it's not been accessed
-             * yet, still probably need no THP to be safe, discard to be safe?
              */
+
+            /* Discard any mapping we have here; note I can't use MADV_REMOVE
+             * or fallocate to make the hole since I don't want to lose
+             * data that's already arrived in the shared process.
+             * TODO: How to do hugepage
+             */
+            ret = madvise((void *)dev_region->mmap_addr,
+                          dev_region->size + dev_region->mmap_offset,
+                          MADV_DONTNEED);
+            if (ret) {
+                fprintf(stderr,
+                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
+                        __func__, i, strerror(errno));
+            }
+            /* Turn off transparent hugepages so we dont get lose wakeups
+             * in neighbouring pages.
+             * TODO: Turn this backon later.
+             */
+            ret = madvise((void *)dev_region->mmap_addr,
+                          dev_region->size + dev_region->mmap_offset,
+                          MADV_NOHUGEPAGE);
+            if (ret) {
+                /* Note: This can happen legally on kernels that are configured
+                 * without madvise'able hugepages
+                 */
+                fprintf(stderr,
+                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
+                        __func__, i, strerror(errno));
+            }
             struct uffdio_register reg_struct;
             /* Note: We might need to go back to using mmap_addr and
              * len + mmap_offset for * huge pages, but then we do hope not to
-- 
2.13.0


Re: [Qemu-devel] [RFC 23/29] vub+postcopy: madvises
Posted by Alexey Perevalov 8 years, 6 months ago
On 06/28/2017 10:00 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Clear the area and turn off THP.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
>   1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 0658b6e847..ceddeac74f 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -451,11 +451,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>           }
>   
>           if (dev->postcopy_listening) {
> +            int ret;
>               /* We should already have an open ufd need to mark each memory
>                * range as ufd.
> -             * Note: Do we need any madvises? Well it's not been accessed
> -             * yet, still probably need no THP to be safe, discard to be safe?
>                */
> +
> +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
> +             * or fallocate to make the hole since I don't want to lose
> +             * data that's already arrived in the shared process.
> +             * TODO: How to do hugepage
> +             */
Hi, David, frankly saying, I stuck with my solution, and I have also 
another issues,
but here I could suggest solution for hugepages. I think we could 
transmit a received pages
bitmap in VHOST_USER_SET_MEM_TABLE (VhostUserMemoryRegion), but it will 
raise a compatibility issue,
or introduce special message type for that and send it before 
VHOST_USER_SET_MEM_TABLE.
So it will be  possible to do fallocate on received bitmap basis, just 
skip already copied pages.
If you wish, I could send patches, rebased on yours, for doing it.

> +            ret = madvise((void *)dev_region->mmap_addr,
> +                          dev_region->size + dev_region->mmap_offset,
> +                          MADV_DONTNEED);
> +            if (ret) {
> +                fprintf(stderr,
> +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
> +                        __func__, i, strerror(errno));
> +            }
> +            /* Turn off transparent hugepages so we dont get lose wakeups
> +             * in neighbouring pages.
> +             * TODO: Turn this backon later.
> +             */
> +            ret = madvise((void *)dev_region->mmap_addr,
> +                          dev_region->size + dev_region->mmap_offset,
> +                          MADV_NOHUGEPAGE);
> +            if (ret) {
> +                /* Note: This can happen legally on kernels that are configured
> +                 * without madvise'able hugepages
> +                 */
> +                fprintf(stderr,
> +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
> +                        __func__, i, strerror(errno));
> +            }
>               struct uffdio_register reg_struct;
>               /* Note: We might need to go back to using mmap_addr and
>                * len + mmap_offset for * huge pages, but then we do hope not to


-- 
Best regards,
Alexey Perevalov

Re: [Qemu-devel] [RFC 23/29] vub+postcopy: madvises
Posted by Dr. David Alan Gilbert 8 years, 6 months ago
* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> On 06/28/2017 10:00 PM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Clear the area and turn off THP.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >   contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
> >   1 file changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > index 0658b6e847..ceddeac74f 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -451,11 +451,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> >           }
> >           if (dev->postcopy_listening) {
> > +            int ret;
> >               /* We should already have an open ufd need to mark each memory
> >                * range as ufd.
> > -             * Note: Do we need any madvises? Well it's not been accessed
> > -             * yet, still probably need no THP to be safe, discard to be safe?
> >                */
> > +
> > +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
> > +             * or fallocate to make the hole since I don't want to lose
> > +             * data that's already arrived in the shared process.
> > +             * TODO: How to do hugepage
> > +             */
> Hi, David, frankly saying, I stuck with my solution, and I have also another
> issues,
> but here I could suggest solution for hugepages. I think we could transmit a
> received pages
> bitmap in VHOST_USER_SET_MEM_TABLE (VhostUserMemoryRegion), but it will
> raise a compatibility issue,
> or introduce special message type for that and send it before
> VHOST_USER_SET_MEM_TABLE.
> So it will be  possible to do fallocate on received bitmap basis, just skip
> already copied pages.
> If you wish, I could send patches, rebased on yours, for doing it.

What we found works is that actually we don't need to do a discard -
since we've only just done the mmap of the arena, nothing will be
occupying it on the shared client, so we don't need to discard.

We've had a postcopy migrate work now, with a few hacks we're still
cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
updated and reposted.

Dave

> > +            ret = madvise((void *)dev_region->mmap_addr,
> > +                          dev_region->size + dev_region->mmap_offset,
> > +                          MADV_DONTNEED);
> > +            if (ret) {
> > +                fprintf(stderr,
> > +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
> > +                        __func__, i, strerror(errno));
> > +            }
> > +            /* Turn off transparent hugepages so we dont get lose wakeups
> > +             * in neighbouring pages.
> > +             * TODO: Turn this backon later.
> > +             */
> > +            ret = madvise((void *)dev_region->mmap_addr,
> > +                          dev_region->size + dev_region->mmap_offset,
> > +                          MADV_NOHUGEPAGE);
> > +            if (ret) {
> > +                /* Note: This can happen legally on kernels that are configured
> > +                 * without madvise'able hugepages
> > +                 */
> > +                fprintf(stderr,
> > +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
> > +                        __func__, i, strerror(errno));
> > +            }
> >               struct uffdio_register reg_struct;
> >               /* Note: We might need to go back to using mmap_addr and
> >                * len + mmap_offset for * huge pages, but then we do hope not to
> 
> 
> -- 
> Best regards,
> Alexey Perevalov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC 23/29] vub+postcopy: madvises
Posted by Alexey Perevalov 8 years, 6 months ago
On 08/08/2017 08:06 PM, Dr. David Alan Gilbert wrote:
> * Alexey Perevalov (a.perevalov@samsung.com) wrote:
>> On 06/28/2017 10:00 PM, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> Clear the area and turn off THP.
>>>
>>> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> ---
>>>    contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
>>>    1 file changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
>>> index 0658b6e847..ceddeac74f 100644
>>> --- a/contrib/libvhost-user/libvhost-user.c
>>> +++ b/contrib/libvhost-user/libvhost-user.c
>>> @@ -451,11 +451,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
>>>            }
>>>            if (dev->postcopy_listening) {
>>> +            int ret;
>>>                /* We should already have an open ufd need to mark each memory
>>>                 * range as ufd.
>>> -             * Note: Do we need any madvises? Well it's not been accessed
>>> -             * yet, still probably need no THP to be safe, discard to be safe?
>>>                 */
>>> +
>>> +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
>>> +             * or fallocate to make the hole since I don't want to lose
>>> +             * data that's already arrived in the shared process.
>>> +             * TODO: How to do hugepage
>>> +             */
>> Hi, David, frankly saying, I stuck with my solution, and I have also another
>> issues,
>> but here I could suggest solution for hugepages. I think we could transmit a
>> received pages
>> bitmap in VHOST_USER_SET_MEM_TABLE (VhostUserMemoryRegion), but it will
>> raise a compatibility issue,
>> or introduce special message type for that and send it before
>> VHOST_USER_SET_MEM_TABLE.
>> So it will be  possible to do fallocate on received bitmap basis, just skip
>> already copied pages.
>> If you wish, I could send patches, rebased on yours, for doing it.
> What we found works is that actually we don't need to do a discard -
> since we've only just done the mmap of the arena, nothing will be
> occupying it on the shared client, so we don't need to discard.
Looks like yes, I checked on kernel from Andrea's git,
there is any more EEXIST error in case when client doesn't
fallocate.

>
> We've had a postcopy migrate work now, with a few hacks we're still
> cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
> updated and reposted.
In you patch series vring is disabling in case of VHOST_USER_GET_VRING_BASE.
It's being called when vhost-user server want's to stop vring.
QEMU is enabling vring as soon as virtual machine is started, so I 
didn't see
explicit vring disabling for migrating VRING.
So migrating VRING is protected just by uffd_register, isn't it? And PMD 
thread (any
vhost-user thread which accessing migrating VRING) will wait page 
copying in this case,
right?

>
> Dave
>
>>> +            ret = madvise((void *)dev_region->mmap_addr,
>>> +                          dev_region->size + dev_region->mmap_offset,
>>> +                          MADV_DONTNEED);
>>> +            if (ret) {
>>> +                fprintf(stderr,
>>> +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
>>> +                        __func__, i, strerror(errno));
>>> +            }
>>> +            /* Turn off transparent hugepages so we dont get lose wakeups
>>> +             * in neighbouring pages.
>>> +             * TODO: Turn this backon later.
>>> +             */
>>> +            ret = madvise((void *)dev_region->mmap_addr,
>>> +                          dev_region->size + dev_region->mmap_offset,
>>> +                          MADV_NOHUGEPAGE);
>>> +            if (ret) {
>>> +                /* Note: This can happen legally on kernels that are configured
>>> +                 * without madvise'able hugepages
>>> +                 */
>>> +                fprintf(stderr,
>>> +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
>>> +                        __func__, i, strerror(errno));
>>> +            }
>>>                struct uffdio_register reg_struct;
>>>                /* Note: We might need to go back to using mmap_addr and
>>>                 * len + mmap_offset for * huge pages, but then we do hope not to
>>
>> -- 
>> Best regards,
>> Alexey Perevalov
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>

-- 
Best regards,
Alexey Perevalov

Re: [Qemu-devel] [RFC 23/29] vub+postcopy: madvises
Posted by Dr. David Alan Gilbert 8 years, 6 months ago
* Alexey Perevalov (a.perevalov@samsung.com) wrote:
> On 08/08/2017 08:06 PM, Dr. David Alan Gilbert wrote:
> > * Alexey Perevalov (a.perevalov@samsung.com) wrote:
> > > On 06/28/2017 10:00 PM, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > Clear the area and turn off THP.
> > > > 
> > > > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > ---
> > > >    contrib/libvhost-user/libvhost-user.c | 32 ++++++++++++++++++++++++++++++--
> > > >    1 file changed, 30 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> > > > index 0658b6e847..ceddeac74f 100644
> > > > --- a/contrib/libvhost-user/libvhost-user.c
> > > > +++ b/contrib/libvhost-user/libvhost-user.c
> > > > @@ -451,11 +451,39 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
> > > >            }
> > > >            if (dev->postcopy_listening) {
> > > > +            int ret;
> > > >                /* We should already have an open ufd need to mark each memory
> > > >                 * range as ufd.
> > > > -             * Note: Do we need any madvises? Well it's not been accessed
> > > > -             * yet, still probably need no THP to be safe, discard to be safe?
> > > >                 */
> > > > +
> > > > +            /* Discard any mapping we have here; note I can't use MADV_REMOVE
> > > > +             * or fallocate to make the hole since I don't want to lose
> > > > +             * data that's already arrived in the shared process.
> > > > +             * TODO: How to do hugepage
> > > > +             */
> > > Hi, David, frankly saying, I stuck with my solution, and I have also another
> > > issues,
> > > but here I could suggest solution for hugepages. I think we could transmit a
> > > received pages
> > > bitmap in VHOST_USER_SET_MEM_TABLE (VhostUserMemoryRegion), but it will
> > > raise a compatibility issue,
> > > or introduce special message type for that and send it before
> > > VHOST_USER_SET_MEM_TABLE.
> > > So it will be  possible to do fallocate on received bitmap basis, just skip
> > > already copied pages.
> > > If you wish, I could send patches, rebased on yours, for doing it.
> > What we found works is that actually we don't need to do a discard -
> > since we've only just done the mmap of the arena, nothing will be
> > occupying it on the shared client, so we don't need to discard.
> Looks like yes, I checked on kernel from Andrea's git,
> there is any more EEXIST error in case when client doesn't
> fallocate.
> 
> > 
> > We've had a postcopy migrate work now, with a few hacks we're still
> > cleaning up, both on vhost-user-bridge and dpdk; so I'll get this
> > updated and reposted.
> In you patch series vring is disabling in case of VHOST_USER_GET_VRING_BASE.
> It's being called when vhost-user server want's to stop vring.
> QEMU is enabling vring as soon as virtual machine is started, so I didn't
> see
> explicit vring disabling for migrating VRING.
> So migrating VRING is protected just by uffd_register, isn't it? And PMD
> thread (any
> vhost-user thread which accessing migrating VRING) will wait page copying in
> this case,
> right?

Yes I believe that's the case; although I don't know the structure of
dpdk to know the effect of that.

Dave

> 
> > 
> > Dave
> > 
> > > > +            ret = madvise((void *)dev_region->mmap_addr,
> > > > +                          dev_region->size + dev_region->mmap_offset,
> > > > +                          MADV_DONTNEED);
> > > > +            if (ret) {
> > > > +                fprintf(stderr,
> > > > +                        "%s: Failed to madvise(DONTNEED) region %d: %s\n",
> > > > +                        __func__, i, strerror(errno));
> > > > +            }
> > > > +            /* Turn off transparent hugepages so we dont get lose wakeups
> > > > +             * in neighbouring pages.
> > > > +             * TODO: Turn this backon later.
> > > > +             */
> > > > +            ret = madvise((void *)dev_region->mmap_addr,
> > > > +                          dev_region->size + dev_region->mmap_offset,
> > > > +                          MADV_NOHUGEPAGE);
> > > > +            if (ret) {
> > > > +                /* Note: This can happen legally on kernels that are configured
> > > > +                 * without madvise'able hugepages
> > > > +                 */
> > > > +                fprintf(stderr,
> > > > +                        "%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
> > > > +                        __func__, i, strerror(errno));
> > > > +            }
> > > >                struct uffdio_register reg_struct;
> > > >                /* Note: We might need to go back to using mmap_addr and
> > > >                 * len + mmap_offset for * huge pages, but then we do hope not to
> > > 
> > > -- 
> > > Best regards,
> > > Alexey Perevalov
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > 
> 
> -- 
> Best regards,
> Alexey Perevalov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK