migration: savevm_state_insert_handler: constant-time element insertion

[PATCH] migration: savevm_state_insert_handler: constant-time element insertion

Posted by Scott Cheloha 4 years, 6 months ago

Registering a SaveStateEntry object via savevm_state_insert_handler()
is an O(n) operation because the list is a priority queue maintained by
walking the list from head to tail to find a suitable insertion point.

This adds considerable overhead for VMs with many such objects.  For
instance, ppc64 machines with large maxmem (8T+) spend ~10% or more of
their CPU time in savevm_state_insert_handler() before attempting to
boot a kernel.

If we track the head for each priority's subqueue we can insert new
elements in constant time.

This commit also introduces a new function, savevm_state_remove_handler(),
which abstracts the logic for replacing the head of an element's subqueue
when removing it.

Signed-off-by: Scott Cheloha <cheloha@linux.vnet.ibm.com>
---
 migration/savevm.c | 35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 8d95e261f6..f7a2d36bba 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -250,6 +250,7 @@ typedef struct SaveStateEntry {
 
 typedef struct SaveState {
     QTAILQ_HEAD(, SaveStateEntry) handlers;
+    SaveStateEntry *handler_pri_head[MIG_PRI_MAX + 1];
     int global_section_id;
     uint32_t len;
     const char *name;
@@ -261,6 +262,7 @@ typedef struct SaveState {
 
 static SaveState savevm_state = {
     .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
+    .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
     .global_section_id = 0,
 };
 
@@ -709,20 +711,43 @@ static void savevm_state_handler_insert(SaveStateEntry *nse)
 {
     MigrationPriority priority = save_state_priority(nse);
     SaveStateEntry *se;
+    int i;
 
     assert(priority <= MIG_PRI_MAX);
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (save_state_priority(se) < priority) {
+    for (i = priority - 1; i >= 0; i--) {
+        se = savevm_state.handler_pri_head[i];
+        if (se != NULL) {
+            assert(save_state_priority(se) < priority);
             break;
         }
     }
 
-    if (se) {
+    if (i >= 0) {
         QTAILQ_INSERT_BEFORE(se, nse, entry);
     } else {
         QTAILQ_INSERT_TAIL(&savevm_state.handlers, nse, entry);
     }
+
+    if (savevm_state.handler_pri_head[priority] == NULL) {
+        savevm_state.handler_pri_head[priority] = nse;
+    }
+}
+
+static void savevm_state_handler_remove(SaveStateEntry *se)
+{
+    SaveStateEntry *next;
+    MigrationPriority priority = save_state_priority(se);
+
+    if (se == savevm_state.handler_pri_head[priority]) {
+        next = QTAILQ_NEXT(se, entry);
+        if (next != NULL && save_state_priority(next) == priority) {
+            savevm_state.handler_pri_head[priority] = next;
+        } else {
+            savevm_state.handler_pri_head[priority] = NULL;
+        }
+    }
+    QTAILQ_REMOVE(&savevm_state.handlers, se, entry);
 }
 
 /* TODO: Individual devices generally have very little idea about the rest
@@ -777,7 +802,7 @@ void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque)
 
     QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
         if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
-            QTAILQ_REMOVE(&savevm_state.handlers, se, entry);
+            savevm_state_handler_remove(se);
             g_free(se->compat);
             g_free(se);
         }
@@ -841,7 +866,7 @@ void vmstate_unregister(DeviceState *dev, const VMStateDescription *vmsd,
 
     QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
         if (se->vmsd == vmsd && se->opaque == opaque) {
-            QTAILQ_REMOVE(&savevm_state.handlers, se, entry);
+            savevm_state_handler_remove(se);
             g_free(se->compat);
             g_free(se);
         }
-- 
2.23.0

Re: [PATCH] migration: savevm_state_insert_handler: constant-time element insertion

Posted by Juan Quintela 4 years, 6 months ago

Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:

Hi

> Registering a SaveStateEntry object via savevm_state_insert_handler()
> is an O(n) operation because the list is a priority queue maintained by
> walking the list from head to tail to find a suitable insertion point.
>
> This adds considerable overhead for VMs with many such objects.  For
> instance, ppc64 machines with large maxmem (8T+) spend ~10% or more of
> their CPU time in savevm_state_insert_handler() before attempting to
> boot a kernel.

Ouch ...

> If we track the head for each priority's subqueue we can insert new
> elements in constant time.

We are adding a subqueue by priority, right? (see later comments)

> This commit also introduces a new function,
> savevm_state_remove_handler(),

savevm_state_handler_remove()

search didn't find it O:-)

> which abstracts the logic for replacing the head of an element's subqueue
> when removing it.

I think that it is better if you split the new function creation.  Make
commit easier to write O:-)

>  static SaveState savevm_state = {
>      .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
> +    .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },

Why are you still maintaining the handlers QTAILQ?  Once here will not
be easier to just change handlers field to be an array
handlers[MIG_PRI_MAX] field, and adjust callers?

Changes are only inside this file.

The code to maintain the subqueue inside the other queue is just as
complex as chaning all the callers.  What do you think?

savevm_state_handler_insert() for instance becomes even easier, just a
QTALIQ_INSERT_TAIL() in the proper queue, right?

I agree with the idea of the patch.  Especially when you told us how bad
the performance of the current code is.

Out of curiosity, how many objects are we talking about?

Later, Juan.

Re: [PATCH] migration: savevm_state_insert_handler: constant-time element insertion

Posted by Scott Cheloha 4 years, 6 months ago

On Thu, Oct 17, 2019 at 10:43:08AM +0200, Juan Quintela wrote:
> Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> 
> > Registering a SaveStateEntry object via savevm_state_insert_handler()
> > is an O(n) operation because the list is a priority queue maintained by
> > walking the list from head to tail to find a suitable insertion point.
> >
> > This adds considerable overhead for VMs with many such objects.  For
> > instance, ppc64 machines with large maxmem (8T+) spend ~10% or more of
> > their CPU time in savevm_state_insert_handler() before attempting to
> > boot a kernel.
> 
> Ouch ...
> 
> > If we track the head for each priority's subqueue we can insert new
> > elements in constant time.
> 
> We are adding a subqueue by priority, right? (see later comments)

One already exists.  This patch would just make insertion way, way
faster by memoizing the subqueue heads.

> > This commit also introduces a new function,
> > savevm_state_remove_handler(),
> 
> savevm_state_handler_remove()
> 
> search didn't find it O:-)

Whoops, my bad, will fix the commit message for v2.

> > which abstracts the logic for replacing the head of an element's subqueue
> > when removing it.
> 
> I think that it is better if you split the new function creation.  Make
> commit easier to write O:-)

Sure, I'll do that in the v2 patch in the next mail.

> >  static SaveState savevm_state = {
> >      .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
> > +    .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
> 
> Why are you still maintaining the handlers QTAILQ?  Once here will not
> be easier to just change handlers field to be an array
> handlers[MIG_PRI_MAX] field, and adjust callers?
> 
> Changes are only inside this file.
> 
> The code to maintain the subqueue inside the other queue is just as
> complex as chaning all the callers.  What do you think?

I was trying to avoid churning the file more than absolutely
necessary.  There are 18 QTAILQ_FOREACH() loops in savevm.c right now.
Making ~15 of them double-loops doesn't make the code easier to read.

I think incurring slight complexity on insertion/removal to make
insertion fast is well worth the conceptual simplicity of addressing
one big list of elements for every other operation.

> savevm_state_handler_insert() for instance becomes even easier, just a
> QTALIQ_INSERT_TAIL() in the proper queue, right?

Yes, insertion becomes extremely obvious: you just append the element
to the tail of its priority queue, which must already exist.

But see above for the cost.

> I agree with the idea of the patch.  Especially when you told us how bad
> the performance of the current code is.
> 
> Out of curiosity, how many objects are we talking about?

At maxmem=8T I'm seeing about 40000 elements in that list.  At
maxmem=64T I'm seeing around 262000.  The vast majority of these
elements are "spapr_drc" objects, each of which (IIRC) corresponds to
a 256MB chunk of address space.

Re: [PATCH] migration: savevm_state_insert_handler: constant-time element insertion

Posted by Juan Quintela 4 years, 6 months ago

Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> On Thu, Oct 17, 2019 at 10:43:08AM +0200, Juan Quintela wrote:
>> Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
>> 
>> > Registering a SaveStateEntry object via savevm_state_insert_handler()
>> > is an O(n) operation because the list is a priority queue maintained by
>> > walking the list from head to tail to find a suitable insertion point.
>> >
>> > This adds considerable overhead for VMs with many such objects.  For
>> > instance, ppc64 machines with large maxmem (8T+) spend ~10% or more of
>> > their CPU time in savevm_state_insert_handler() before attempting to
>> > boot a kernel.

> I was trying to avoid churning the file more than absolutely
> necessary.  There are 18 QTAILQ_FOREACH() loops in savevm.c right now.
> Making ~15 of them double-loops doesn't make the code easier to read.

Change thecode to be something different, I agree that is more churn,
but ...

>
> I think incurring slight complexity on insertion/removal to make
> insertion fast is well worth the conceptual simplicity of addressing
> one big list of elements for every other operation.
>
>> savevm_state_handler_insert() for instance becomes even easier, just a
>> QTALIQ_INSERT_TAIL() in the proper queue, right?
>
> Yes, insertion becomes extremely obvious: you just append the element
> to the tail of its priority queue, which must already exist.
>
> But see above for the cost.
>
>> I agree with the idea of the patch.  Especially when you told us how bad
>> the performance of the current code is.
>> 
>> Out of curiosity, how many objects are we talking about?
>
> At maxmem=8T I'm seeing about 40000 elements in that list.  At
> maxmem=64T I'm seeing around 262000.  The vast majority of these
> elements are "spapr_drc" objects, each of which (IIRC) corresponds to
> a 256MB chunk of address space.

We are having trouble because we have too many objects.  So, the right
approach IMHO is just to add the list of queueue.  Looking into the
functions:

static int calculate_new_instance_id(const char *idstr)
static int calculate_compat_instance_id(const char *idstr)
   * We can call QTAILQ_FOREACH in the propper subqueue

static void savevm_state_handler_insert(SaveStateEntry *nse)
   * We don't need the call if we do propper subqueues array

void unregister_savevm(DeviceState *dev, const char *idstr, void
   *opaque)
   * We can use the propper subqueue

vmstate_unregister
   * We can use the propper subqueue

bool qemu_savevm_state_blocked(Error **errp)
  * We need to loop over all queues

void qemu_savevm_state_setup(QEMUFile *f)
int qemu_savevm_state_resume_prepare(MigrationState *s)
int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
void qemu_savevm_state_complete_postcopy(QEMUFile *f)
int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool
  in_postcopy)
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
void qemu_savevm_state_cleanup(void)
int qemu_save_device_state(QEMUFile *f)
static int qemu_loadvm_state_setup(QEMUFile *f)
void qemu_loadvm_state_cleanup(void)
 * Loop over all queues

static SaveStateEntry *find_se(const char *idstr, int instance_id)
qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
* we know the propper queue


But basically all the ones that we need to loop over all queues don't
have local state, so we can create a

loop_over_all_handlers() function that takes a callback and does all the
work.  They don't share states between iterations.

What do you think?
My problem with your appreach is that it makes insertion/removal more
expensive, that is where you are showing performance problems.  In the
places where we need to loop over all queues, we need to do it over all
elements anyways, so the performance difference is going to be
negigible.

Once told that, having 40000 elements on that queue, it will make
"downtime" for migration "interesting", to say the least, no?  How much
size are we talking about?  Should we consider moving it to a live
section?


Later, Juan.