[PATCH] vm : forbid to start a removing vm

Hogan Wang posted 1 patch 3 years, 1 month ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20210305011952.960-1-hogan.wang@huawei.com
There is a newer version of this series
src/qemu/qemu_driver.c | 6 ++++++
1 file changed, 6 insertions(+)
[PATCH] vm : forbid to start a removing vm
Posted by Hogan Wang 3 years, 1 month ago
From: Zhuang Shengen <zhuangshengen@huawei.com>

When a vm is doing migration phase confirm, and then start it concurrently,
it will lead to the vm out of libvirtd control.

Cause Analysis:
1. thread1 migrate vm out.
2. thread2 start the migrating vm.
3. thread1 remove vm from domain list after migrate success.
4. thread2 acquired the vm job success and start the vm.
5. cannot find the vm any more by 'virsh list' command. Actually,
   the started vm is not exist in the domain list.

Solution:
Check the vm->removing state before start.


Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
Reviewed-by: Hogan Wang <hogan.wang@huawei.com>
---
 src/qemu/qemu_driver.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index d1a3659774..a5dfea94cb 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -6637,6 +6637,12 @@ qemuDomainCreateWithFlags(virDomainPtr dom, unsigned int flags)
         goto endjob;
     }
 
+    if (vm->removing) {
+        virReportError(VIR_ERR_OPERATION_INVALID,
+                       "%s", _("domain is already removing"));
+        goto endjob;
+    }
+
     if (qemuDomainObjStart(dom->conn, driver, vm, flags,
                            QEMU_ASYNC_JOB_START) < 0)
         goto endjob;
-- 
2.23.0


Re: [PATCH] vm : forbid to start a removing vm
Posted by Martin Kletzander 3 years, 1 month ago
On Fri, Mar 05, 2021 at 09:19:52AM +0800, Hogan Wang wrote:
>From: Zhuang Shengen <zhuangshengen@huawei.com>
>
>When a vm is doing migration phase confirm, and then start it concurrently,
>it will lead to the vm out of libvirtd control.
>
>Cause Analysis:
>1. thread1 migrate vm out.
>2. thread2 start the migrating vm.
>3. thread1 remove vm from domain list after migrate success.
>4. thread2 acquired the vm job success and start the vm.
>5. cannot find the vm any more by 'virsh list' command. Actually,
>   the started vm is not exist in the domain list.
>
>Solution:
>Check the vm->removing state before start.
>

Well, this would only fix starting it, but there could be other ways that domain
can be started, right?  Like restoring it from a save.

Anyway, I think the issue here is that the CreateWithFlags is even able to get a
job started, I think you should look into that.

>
>Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
>Reviewed-by: Hogan Wang <hogan.wang@huawei.com>
>---
> src/qemu/qemu_driver.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
>diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>index d1a3659774..a5dfea94cb 100644
>--- a/src/qemu/qemu_driver.c
>+++ b/src/qemu/qemu_driver.c
>@@ -6637,6 +6637,12 @@ qemuDomainCreateWithFlags(virDomainPtr dom, unsigned int flags)
>         goto endjob;
>     }
>
>+    if (vm->removing) {
>+        virReportError(VIR_ERR_OPERATION_INVALID,
>+                       "%s", _("domain is already removing"));
>+        goto endjob;
>+    }
>+
>     if (qemuDomainObjStart(dom->conn, driver, vm, flags,
>                            QEMU_ASYNC_JOB_START) < 0)
>         goto endjob;
>-- 
>2.23.0
>
>
Re: Re: [PATCH] vm : forbid to start a removing vm
Posted by Hogan Wang 3 years, 1 month ago
> On Fri, Mar 05, 2021 at 09:19:52AM +0800, Hogan Wang wrote:
> >From: Zhuang Shengen <zhuangshengen@huawei.com>
> >
> >When a vm is doing migration phase confirm, and then start it 
> >concurrently, it will lead to the vm out of libvirtd control.
> >
> >Cause Analysis:
> >1. thread1 migrate vm out.
> >2. thread2 start the migrating vm.
> >3. thread1 remove vm from domain list after migrate success.
> >4. thread2 acquired the vm job success and start the vm.
> >5. cannot find the vm any more by 'virsh list' command. Actually,
> >   the started vm is not exist in the domain list.
> >
> >Solution:
> >Check the vm->removing state before start.
> >
> 
> Well, this would only fix starting it, but there could be other ways
> that domain can be started, right?  Like restoring it from a save.
> 
> Anyway, I think the issue here is that the CreateWithFlags is even
> able to get a job started, I think you should look into that.
> 

Yes, it's, a removed vm begin a job may cause unanticipated results.
Therefore, I think qemuDomainObjBeginJobInternal should return error
after a removed vm acquire job success. I will push an new patch to
fix it.

> >
> >Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
> >Reviewed-by: Hogan Wang <hogan.wang@huawei.com>
> >---
> > src/qemu/qemu_driver.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> >diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 
> >d1a3659774..a5dfea94cb 100644
> >--- a/src/qemu/qemu_driver.c
> >+++ b/src/qemu/qemu_driver.c
> >@@ -6637,6 +6637,12 @@ qemuDomainCreateWithFlags(virDomainPtr dom, unsigned int flags)
> >         goto endjob;
> >     }
> >
> >+    if (vm->removing) {
> >+        virReportError(VIR_ERR_OPERATION_INVALID,
> >+                       "%s", _("domain is already removing"));
> >+        goto endjob;
> >+    }
> >+
> >     if (qemuDomainObjStart(dom->conn, driver, vm, flags,
> >                            QEMU_ASYNC_JOB_START) < 0)
> >         goto endjob;
> >--
> >2.23.0
> >
> >


[PATCH] already removed obj is not allowed to acquire job
Posted by Hogan Wang 3 years ago
From: Zhuang Shengen <zhuangshengen@huawei.com>

a removed vm begin a job may cause unanticipated results.so
add judgement in qemuDomainObjBeginJobInternal to forbid
a removed obj acquire the job

Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
Reviewed-by: Hogan Wang <hogan.wang@huawei.com>
---
 src/qemu/qemu_domainjob.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/qemu/qemu_domainjob.c b/src/qemu/qemu_domainjob.c
index 50cfc45f5b..246c3208fc 100644
--- a/src/qemu/qemu_domainjob.c
+++ b/src/qemu/qemu_domainjob.c
@@ -880,6 +880,9 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
     if (!nested && !qemuDomainNestedJobAllowed(&priv->job, job))
         goto retry;
 
+    if (obj->removing == 1)
+        goto error;
+
     ignore_value(virTimeMillisNow(&now));
 
     if (job) {
@@ -1011,6 +1014,10 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
                              "due to max_queued limit"));
         }
         ret = -2;
+    } else if (obj->removing == 1) {
+        virReportError(VIR_ERR_OPERATION_FAILED, "%s",
+                       _("cannot acquire job mutex for removed domain"));
+        ret = -2;
     } else {
         virReportSystemError(errno, "%s", _("cannot acquire job mutex"));
     }
-- 
2.23.0


Re: [PATCH] already removed obj is not allowed to acquire job
Posted by Martin Kletzander 3 years ago
On Tue, Mar 30, 2021 at 04:36:41PM +0800, Hogan Wang wrote:
>From: Zhuang Shengen <zhuangshengen@huawei.com>
>
>a removed vm begin a job may cause unanticipated results.so
>add judgement in qemuDomainObjBeginJobInternal to forbid
>a removed obj acquire the job
>

Actually what I meant was that there is some part of the code that probably is
not taking a job even though it should be. I see Michal replied to your other
post of the (same??) patch here:

   https://listman.redhat.com/archives/libvir-list/2021-March/msg00933.html

Suggesting what I meant. Have you tried that approach? The mails get confusing
with multiple postings.

>Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
>Reviewed-by: Hogan Wang <hogan.wang@huawei.com>
>---
> src/qemu/qemu_domainjob.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>diff --git a/src/qemu/qemu_domainjob.c b/src/qemu/qemu_domainjob.c
>index 50cfc45f5b..246c3208fc 100644
>--- a/src/qemu/qemu_domainjob.c
>+++ b/src/qemu/qemu_domainjob.c
>@@ -880,6 +880,9 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
>     if (!nested && !qemuDomainNestedJobAllowed(&priv->job, job))
>         goto retry;
>
>+    if (obj->removing == 1)
>+        goto error;
>+
>     ignore_value(virTimeMillisNow(&now));
>
>     if (job) {
>@@ -1011,6 +1014,10 @@ qemuDomainObjBeginJobInternal(virQEMUDriverPtr driver,
>                              "due to max_queued limit"));
>         }
>         ret = -2;
>+    } else if (obj->removing == 1) {
>+        virReportError(VIR_ERR_OPERATION_FAILED, "%s",
>+                       _("cannot acquire job mutex for removed domain"));
>+        ret = -2;
>     } else {
>         virReportSystemError(errno, "%s", _("cannot acquire job mutex"));
>     }
>-- 
>2.23.0
>