[v3] qemu: replace nested job with interruptible async job state

[libvirt] [PATCH v3 00/11] qemu: replace nested job with interruptible async job state
Posted by Nikolay Shirokovskiy 8 years, 3 months ago
Hello, everyone.

This is successor to [1] 
"[PATCH v2 RFC 0/4] qemu: replace nested job with interruptible async job state".
I tried to expand on series rationale, to make better splitting and to
add comments.

ABSTRACT

Let's use interruptible async job state instead of nested jobs
to control concurrent execution of async and regular jobs.

RATIONALE

I'm not quite happy with current implementation of concurrent
execution async and regular jobs because it is not straigthforward.

What is current implementation of jobs concurrency? While async job is running
any compatible regular job is allowed to run concurrently. Practically it means
that as soon as async job drops the domain lock another rpc thread can run in,
find domain in the domain list and start a regular job. This is what happens
when for example migration drops the lock before waiting for migration
completion event or before sending request to destination side. But if we are
going to send command to qemu monitor and accordingly drop domain lock this
concurrency is not desired because qemu monitor can not handle simultaneus
commands now. (This is probably not the only reason. I guess we don't want to
think about what can happen if regular job will run concurrenly for example to
migration at any place. Ability to run concurrently during disk mirror or state
transfer which take the most time of migration looks enough). Thus nested jobs
were introduced. Now if async job wants to send command to monitor it start
special nested jobs beforehand thus effectively block concurrent regular jobs.

To me it sounds like hacky way to implement the fact that async job allows
concurrency only in limited places. Let's call them interruptible. So
I suggest introduce interruptible state for async job. Before we are going
to wait for migration completion for example we enter this state allowing
regular jobs run concurrently. Later when we meet migration completion
condition we wait for concurrent regular job to finish if any before
proceed. This final step is required in this approach by definition -
when async job is not interruptible no regular jobs can run concurrenly.
On practice this protects us from sending qemu monitor command from async
job while concurrent regular job is not finished its qemu monitor communication
yet.

Sounds like this final wait can have consequenses on migration. Like what if
finishing concurrent job take too much time. In this case we get increased
migration downtime. Well we will get this increase with current appoach too.
Because right after migration completion condition is met we fetch migration
stats and wait in enter monitor function anyway until concurrent job is
finished. This also demontrates that current appoach isolates async and
concurrent regular job loosely. It is possible that regular job starts then
async job resumes and continues its execution until it requires qemu monitor and
only then regular job finishes its execution. I guess it will be easy to reason
about concurrency if when concurrent regular is started async job can not
continue until regular job is finished. Check patch 4 for a demonstration
that job isolation is good.

TESTS

To test concurrency I ran about 100 times a migration of non-empty
domain with shared disk in a run with concurrent domstats for this
domain in a tight loop. Looks good.

MISSING PARTS

- Enter/exit interruptible state at the end/begin of migration phases.
  Until this point is done we have a bit of degradation because domain
  won't response for example to queries between migration phases.

- Update THREADS.txt

I'm going to implement missing parts as soon as RFC has overall approvement.

TODO

1 Drop driver argument from calls to qemuDomainObjEnterMonitor
2 Replace qemuDomainObjEnterMonitorAsync with qemuDomainObjEnterMonitor
3 Make qemuDomainObjExitMonitor void and drop its driver argument.

[1] https://www.redhat.com/archives/libvir-list/2017-May/msg00018.html

Nikolay Shirokovskiy (11):
  qemu: introduce routines for interruptible state
  qemu: check interruptible state instead of taking nested job
  qemu: enter interruptlible state where appropriate
  qemu: remove nested job usage from qemuProcessStop
  qemu: remove unused qemuDomainObjBeginNestedJob
  qemu: remove QEMU_JOB_ASYNC_NESTED
  qemu: remove liveness check from qemuDomainObjExitMonitor
  qemu: drop qemuDomainObjEnterMonitorInternal
  qemu: drop qemuDomainObjExitMonitorInternal
  conf: cleanup: drop virDomainObjWait
  qemu: rename qemuDomainNestedJobAllowed

 src/conf/domain_conf.c    |  19 ----
 src/conf/domain_conf.h    |   1 -
 src/libvirt_private.syms  |   1 -
 src/qemu/qemu_domain.c    | 265 ++++++++++++++++++++++++++++++----------------
 src/qemu/qemu_domain.h    |  17 ++-
 src/qemu/qemu_driver.c    |   2 +-
 src/qemu/qemu_migration.c |  18 ++--
 src/qemu/qemu_process.c   |  19 +---
 8 files changed, 195 insertions(+), 147 deletions(-)

-- 
1.8.3.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list