[v1] postcopy+vhost-user/shared ram

[Qemu-devel] [RFC 22/29] vhost+postcopy: Call wakeups

Posted by Dr. David Alan Gilbert (git) 8 years, 7 months ago

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Cause the vhost-user client to be woken up whenever:
  a) We place a page in postcopy mode
  b) We get a fault and the page has already been received

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 14 ++++++++++----
 migration/trace-events   |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e6b8160f09..b97fc9398b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -491,7 +491,11 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
                                        rb_offset);
-    /* TODO: Check bitmap to see if we already have the page */
+    if (ramblock_recv_bitmap_test_byte_offset(aligned_rbo, rb)) {
+        trace_postcopy_request_shared_page_present(pcfd->idstr,
+                                        qemu_ram_get_idstr(rb), rb_offset);
+        return postcopy_wake_shared(pcfd, client_addr, rb);
+    }
     if (rb != mis->last_rb) {
         mis->last_rb = rb;
         migrate_send_rp_req_pages(mis, qemu_ram_get_idstr(rb),
@@ -792,7 +796,8 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
     }
 
     trace_postcopy_place_page(host);
-    return 0;
+    return postcopy_notify_shared_wake(rb,
+                                       qemu_ram_block_host_offset(rb, host));
 }
 
 /*
@@ -813,6 +818,9 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
 
             return -e;
         }
+        return postcopy_notify_shared_wake(rb,
+                                           qemu_ram_block_host_offset(rb,
+                                                                      host));
     } else {
         /* The kernel can't use UFFDIO_ZEROPAGE for hugepages */
         if (!mis->postcopy_tmp_zero_page) {
@@ -832,8 +840,6 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
         return postcopy_place_page(mis, host, mis->postcopy_tmp_zero_page,
                                    rb);
     }
-
-    return 0;
 }
 
 /*
diff --git a/migration/trace-events b/migration/trace-events
index 85a35be518..b2f7d85704 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -198,6 +198,7 @@ postcopy_ram_incoming_cleanup_entry(void) ""
 postcopy_ram_incoming_cleanup_exit(void) ""
 postcopy_ram_incoming_cleanup_join(void) ""
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset %"PRIx64
+postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset %"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at %"PRIx64" in %s"
 
 save_xbzrle_page_skipping(void) ""
-- 
2.13.0

Re: [Qemu-devel] [RFC 22/29] vhost+postcopy: Call wakeups

Posted by Peter Xu 8 years, 7 months ago

On Wed, Jun 28, 2017 at 08:00:40PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Cause the vhost-user client to be woken up whenever:
>   a) We place a page in postcopy mode

Just to make sure I understand it correctly - UFFDIO_COPY will only
wake up the waiters on the same userfaultfd context, so we don't need
to wake up QEMU userfaultfd (vcpu threads), but we need to explicitly
wake up other ufds/threads, like vhost-user backends. Am I right?

Thanks,

>   b) We get a fault and the page has already been received

-- 
Peter Xu

Re: [Qemu-devel] [RFC 22/29] vhost+postcopy: Call wakeups

Posted by Andrea Arcangeli 8 years, 7 months ago

On Tue, Jul 11, 2017 at 12:22:32PM +0800, Peter Xu wrote:
> On Wed, Jun 28, 2017 at 08:00:40PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Cause the vhost-user client to be woken up whenever:
> >   a) We place a page in postcopy mode
> 
> Just to make sure I understand it correctly - UFFDIO_COPY will only
> wake up the waiters on the same userfaultfd context, so we don't need
> to wake up QEMU userfaultfd (vcpu threads), but we need to explicitly
> wake up other ufds/threads, like vhost-user backends. Am I right?

Yes.

Every "uffd" represents one and only one "mm" (i.e. a process). So
there is no way a single UFFDIO_COPY can wake the faults happening on
a process different from the "mm" the uffd is associated with.

vhost-bridge being a different process requires a UFFDIO_WAKE on its
own uffd it passed to qemu in addition of the UFFDIO_COPY that like
you said implicitly wakes the userfaults happening on the qemu process
(vcpus iothread, dataplane etc..).

On a side note there's a way not to wake userfaults implicitly in
UFFDIO_COPY in case you want to wake userfaults in batches but nobody
uses that for now (uffdio_copy.mode |= UFFDIO_COPY_MODE_DONTWAKE).

It'd be theoretically nice to optimize away the additional enter/exit
kernel introduced by the UFFDIO_WAKE and the translation table as
well.

What we could do is to add a UFFDIO_BIND that takes an "fd" as
parameter to the ioctl to bind the two uffd together. Then we could
push logical offsets in addition to the virtual address ranges when
calling UFFDIO_REGISTER_LOGICAL (the logical offsets would then match
the guest physical addresses) so that the UFFDIO_COPY_LOGICAL would
then be able to get a logical range to wakeup that the kernel would
translate into virtual addresses for all uffds bind together. Pushing
offsets into UFFDIO_REGISTER was David's idea.

That would eliminate the enter/exit kernel for the explicit
UFFDIO_WAKE and calling a single UFFDIO_COPY would be enough.

Alternatively we should make the uffd work based on file offsets
instead of virtual addresses but that would involve changes to
filesystems and it only would move the needle on top of tmpfs
(shared=on/off no difference) and hugetlbfs. It would be enough for
vhost-bridge.

Usually the uffd fault lives at the higher level of the virtual memory
subsystem and never deals with file offsets so if we can get away with
logical ranges per-uffd for UFFDIO_REGISTER and UFFDIO_COPY, it may be
simpler and easier to extend automatically to all memory types
supported by uffd (including anon which has no file offset).

No major improvement is to be expected by such an enhancement though
so it's not very high priority to implement. It's not even clear if
the complexity is worth it. Doing one more syscall per page I think
might be measurable only on very fast network. The current way of
operation where uffd are independent of each other and the translation
table is transferred by userland means is quite optimal already and
much simpler. Furthermore for hugetlbfs the performance difference
most certainly wouldn't be measurable, as the enter/exit kernel would
be diluted by a factor of 512 compared to 4k userfaults.

Thanks,
Andrea

Re: [Qemu-devel] [RFC 22/29] vhost+postcopy: Call wakeups

Posted by Peter Xu 8 years, 6 months ago

On Wed, Jul 12, 2017 at 05:00:04PM +0200, Andrea Arcangeli wrote:
> On Tue, Jul 11, 2017 at 12:22:32PM +0800, Peter Xu wrote:
> > On Wed, Jun 28, 2017 at 08:00:40PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > Cause the vhost-user client to be woken up whenever:
> > >   a) We place a page in postcopy mode
> > 
> > Just to make sure I understand it correctly - UFFDIO_COPY will only
> > wake up the waiters on the same userfaultfd context, so we don't need
> > to wake up QEMU userfaultfd (vcpu threads), but we need to explicitly
> > wake up other ufds/threads, like vhost-user backends. Am I right?
> 
> Yes.
> 
> Every "uffd" represents one and only one "mm" (i.e. a process). So
> there is no way a single UFFDIO_COPY can wake the faults happening on
> a process different from the "mm" the uffd is associated with.
> 
> vhost-bridge being a different process requires a UFFDIO_WAKE on its
> own uffd it passed to qemu in addition of the UFFDIO_COPY that like
> you said implicitly wakes the userfaults happening on the qemu process
> (vcpus iothread, dataplane etc..).
> 
> On a side note there's a way not to wake userfaults implicitly in
> UFFDIO_COPY in case you want to wake userfaults in batches but nobody
> uses that for now (uffdio_copy.mode |= UFFDIO_COPY_MODE_DONTWAKE).
> 
> It'd be theoretically nice to optimize away the additional enter/exit
> kernel introduced by the UFFDIO_WAKE and the translation table as
> well.
> 
> What we could do is to add a UFFDIO_BIND that takes an "fd" as
> parameter to the ioctl to bind the two uffd together. Then we could
> push logical offsets in addition to the virtual address ranges when
> calling UFFDIO_REGISTER_LOGICAL (the logical offsets would then match
> the guest physical addresses) so that the UFFDIO_COPY_LOGICAL would
> then be able to get a logical range to wakeup that the kernel would
> translate into virtual addresses for all uffds bind together. Pushing
> offsets into UFFDIO_REGISTER was David's idea.
> 
> That would eliminate the enter/exit kernel for the explicit
> UFFDIO_WAKE and calling a single UFFDIO_COPY would be enough.
> 
> Alternatively we should make the uffd work based on file offsets
> instead of virtual addresses but that would involve changes to
> filesystems and it only would move the needle on top of tmpfs
> (shared=on/off no difference) and hugetlbfs. It would be enough for
> vhost-bridge.

Really glad to know these ideas.

> 
> Usually the uffd fault lives at the higher level of the virtual memory
> subsystem and never deals with file offsets so if we can get away with
> logical ranges per-uffd for UFFDIO_REGISTER and UFFDIO_COPY, it may be
> simpler and easier to extend automatically to all memory types
> supported by uffd (including anon which has no file offset).
> 
> No major improvement is to be expected by such an enhancement though
> so it's not very high priority to implement. It's not even clear if
> the complexity is worth it. Doing one more syscall per page I think
> might be measurable only on very fast network. The current way of
> operation where uffd are independent of each other and the translation
> table is transferred by userland means is quite optimal already and
> much simpler. Furthermore for hugetlbfs the performance difference
> most certainly wouldn't be measurable, as the enter/exit kernel would
> be diluted by a factor of 512 compared to 4k userfaults.

Indeed, performance critical scenarios should be using huge pages, and
that means that extra WAKE will have even smaller impact.

Thanks Andrea!

-- 
Peter Xu

Re: [Qemu-devel] [RFC 22/29] vhost+postcopy: Call wakeups

Posted by Michael S. Tsirkin 8 years, 6 months ago

On Wed, Jul 12, 2017 at 05:00:04PM +0200, Andrea Arcangeli wrote:
> What we could do is to add a UFFDIO_BIND that takes an "fd" as
> parameter to the ioctl to bind the two uffd together. Then we could
> push logical offsets in addition to the virtual address ranges when
> calling UFFDIO_REGISTER_LOGICAL (the logical offsets would then match
> the guest physical addresses) so that the UFFDIO_COPY_LOGICAL would
> then be able to get a logical range to wakeup that the kernel would
> translate into virtual addresses for all uffds bind together. Pushing
> offsets into UFFDIO_REGISTER was David's idea.

I think it was mine originally just in an off-list discussion.
To me it seems cleaner to do UFFDIO_DUP which gives you
a new fd bound to the current one, though.

> That would eliminate the enter/exit kernel for the explicit
> UFFDIO_WAKE and calling a single UFFDIO_COPY would be enough.
> 
> Alternatively we should make the uffd work based on file offsets
> instead of virtual addresses but that would involve changes to
> filesystems and it only would move the needle on top of tmpfs
> (shared=on/off no difference) and hugetlbfs. It would be enough for
> vhost-bridge.
> 
> Usually the uffd fault lives at the higher level of the virtual memory
> subsystem and never deals with file offsets so if we can get away with
> logical ranges per-uffd for UFFDIO_REGISTER and UFFDIO_COPY, it may be
> simpler and easier to extend automatically to all memory types
> supported by uffd (including anon which has no file offset).
> 
> No major improvement is to be expected by such an enhancement though
> so it's not very high priority to implement. It's not even clear if
> the complexity is worth it. Doing one more syscall per page I think
> might be measurable only on very fast network. The current way of
> operation where uffd are independent of each other and the translation
> table is transferred by userland means is quite optimal already and
> much simpler. Furthermore for hugetlbfs the performance difference
> most certainly wouldn't be measurable, as the enter/exit kernel would
> be diluted by a factor of 512 compared to 4k userfaults.
> 
> Thanks,
> Andrea