[libvirt] [PATCH] libvirt: add daemon itself as shutdown reason

Nikolay Shirokovskiy posted 1 patch 5 years, 6 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/1538997668-625740-1-git-send-email-nshirokovskiy@virtuozzo.com
Test syntax-check passed
There is a newer version of this series
include/libvirt/libvirt-domain.h |  1 +
src/conf/domain_conf.c           |  3 ++-
src/qemu/qemu_process.c          | 11 ++++-------
tools/virsh-domain-monitor.c     |  3 ++-
4 files changed, 9 insertions(+), 9 deletions(-)
[libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 6 months ago
Let's introduce shutdown reason "daemon" which means we have to
kill running domain ourselves as the best action we can take at
that moment. Failure to pick up domain on daemon restart is
one example of such case. Using reason "crashed" is a bit misleading
as it is used when qemu is actually crashed or qemu destroyed on
guest panic as result of user choice that is the problem was
in qemu/guest itself. So I propose to use "crashed" only for
qemu side issues and introduce "daemon" when we have to kill the qemu
for good.

There is another example where "daemon" will be useful. If we can
not reboot domain we kill it and got "crashed" reason now. Looks
like good candidate for "daemon" reason.

Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
---
 include/libvirt/libvirt-domain.h |  1 +
 src/conf/domain_conf.c           |  3 ++-
 src/qemu/qemu_process.c          | 11 ++++-------
 tools/virsh-domain-monitor.c     |  3 ++-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
index fdd2d6b..11fdab5 100644
--- a/include/libvirt/libvirt-domain.h
+++ b/include/libvirt/libvirt-domain.h
@@ -145,6 +145,7 @@ typedef enum {
     VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
     VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
                                            * taken while domain was shutoff */
+    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
 # ifdef VIR_ENUM_SENTINELS
     VIR_DOMAIN_SHUTOFF_LAST
 # endif
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index 9911d56..e441723 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
               "migrated",
               "saved",
               "failed",
-              "from snapshot")
+              "from snapshot",
+              "daemon")
 
 VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
               "unknown",
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index e9c7618..c4bc9ca 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
         /* We can't get the monitor back, so must kill the VM
          * to remove danger of it ending up running twice if
          * user tries to start it again later
-         * If we couldn't get the monitor since QEMU supports
-         * no-shutdown, we can safely say that the domain
-         * crashed ... */
-        state = VIR_DOMAIN_SHUTOFF_CRASHED;
-        /* If BeginJob failed, we jumped here without a job, let's hope another
+         * If BeginJob failed, we jumped here without a job, let's hope another
          * thread didn't have a chance to start playing with the domain yet
          * (it's all we can do anyway).
          */
-        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
+        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
+                        QEMU_ASYNC_JOB_NONE, stopFlags);
     }
     goto cleanup;
 }
@@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
          * is no thread that could be doing anything else with the same domain
          * object.
          */
-        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
+        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
                         QEMU_ASYNC_JOB_NONE, 0);
         qemuDomainRemoveInactiveJobLocked(src->driver, obj);
 
diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
index 3a26363..f0ad558 100644
--- a/tools/virsh-domain-monitor.c
+++ b/tools/virsh-domain-monitor.c
@@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
               N_("migrated"),
               N_("saved"),
               N_("failed"),
-              N_("from snapshot"))
+              N_("from snapshot"),
+              N_("daemon"))
 
 VIR_ENUM_DECL(virshDomainCrashedReason)
 VIR_ENUM_IMPL(virshDomainCrashedReason,
-- 
1.8.3.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 6 months ago

On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
> Let's introduce shutdown reason "daemon" which means we have to
> kill running domain ourselves as the best action we can take at
> that moment. Failure to pick up domain on daemon restart is
> one example of such case. Using reason "crashed" is a bit misleading
> as it is used when qemu is actually crashed or qemu destroyed on
> guest panic as result of user choice that is the problem was
> in qemu/guest itself. So I propose to use "crashed" only for
> qemu side issues and introduce "daemon" when we have to kill the qemu
> for good.

How about (although reading below may shed some more context):

This patch introduces a new shutdown reason "daemon" in order
to indicate that the daemon needed to force shutdown the domain
as the best course of action to take at the moment.

This action would occur during Reconnect processing when it's
determined that either the QEMU monitor no longer exists for
some unknown reason or creation of a thread to reconnect the
domain failed.

> 
> There is another example where "daemon" will be useful. If we can
> not reboot domain we kill it and got "crashed" reason now. Looks
> like good candidate for "daemon" reason.

If you feel this way, then a followup patch could/should be posted;
otherwise, this'll be lost to commit message heaven where all good ideas
go to die ;-).

> 
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> ---
>  include/libvirt/libvirt-domain.h |  1 +
>  src/conf/domain_conf.c           |  3 ++-
>  src/qemu/qemu_process.c          | 11 ++++-------
>  tools/virsh-domain-monitor.c     |  3 ++-
>  4 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
> index fdd2d6b..11fdab5 100644
> --- a/include/libvirt/libvirt-domain.h
> +++ b/include/libvirt/libvirt-domain.h
> @@ -145,6 +145,7 @@ typedef enum {
>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>                                             * taken while domain was shutoff */
> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */

s/have to/decides to/

>  # ifdef VIR_ENUM_SENTINELS
>      VIR_DOMAIN_SHUTOFF_LAST
>  # endif
> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
> index 9911d56..e441723 100644
> --- a/src/conf/domain_conf.c
> +++ b/src/conf/domain_conf.c
> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>                "migrated",
>                "saved",
>                "failed",
> -              "from snapshot")
> +              "from snapshot",
> +              "daemon")
>  
>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>                "unknown",
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index e9c7618..c4bc9ca 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>          /* We can't get the monitor back, so must kill the VM
>           * to remove danger of it ending up running twice if
>           * user tries to start it again later

Since we're changing anyway, let's put the stop there, e.g.
"s/later/later./"

> -         * If we couldn't get the monitor since QEMU supports
> -         * no-shutdown, we can safely say that the domain
> -         * crashed ... */
> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
> -        /* If BeginJob failed, we jumped here without a job, let's hope another
> +         * If BeginJob failed, we jumped here without a job, let's hope another
>           * thread didn't have a chance to start playing with the domain yet
>           * (it's all we can do anyway).
>           */
> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>      }
>      goto cleanup;
>  }
> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>           * is no thread that could be doing anything else with the same domain
>           * object.
>           */
> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>                          QEMU_ASYNC_JOB_NONE, 0);
>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>  

When qemuProcessReconnectHelper was introduced (commit d38897a5d)
reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
removed in commit fe35b1ad6 the conditional state was just left at
VIR_DOMAIN_SHUTOFF_CRASHED.

Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
on in the code leading to the latter revert commit losing that unknown
reason. The reason for the "-no-shutdown" *did* change and that probably
should have been reflected in the qemuProcessReconnect logic.

So I think perhaps we should have an initial patch which references or
uses that first paragraph in order to change the code to:

        /* We cannot get the monitor back, so kill the VM to
         * remove possibility of it ending up running twice if
         * user tries to start it again later.
         *
         * If we cannot get to the monitor when the QEMU command
         * line used -no-shutdown, then we can safely say that the
         * domain crashed; otherwise we don't really know. */
        if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
            state = VIR_DOMAIN_SHUTOFF_CRASHED;
        else
            state = VIR_DOMAIN_SHUTOFF_UNKNOWN;

and uses the short commit msg of "qemu: Restore lost shutdown reason".

Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
CRASHED otherwise.

John

> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
> index 3a26363..f0ad558 100644
> --- a/tools/virsh-domain-monitor.c
> +++ b/tools/virsh-domain-monitor.c
> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>                N_("migrated"),
>                N_("saved"),
>                N_("failed"),
> -              N_("from snapshot"))
> +              N_("from snapshot"),
> +              N_("daemon"))
>  
>  VIR_ENUM_DECL(virshDomainCrashedReason)
>  VIR_ENUM_IMPL(virshDomainCrashedReason,
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 6 months ago

On 16.10.2018 15:48, John Ferlan wrote:
> 
> 
> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>> Let's introduce shutdown reason "daemon" which means we have to
>> kill running domain ourselves as the best action we can take at
>> that moment. Failure to pick up domain on daemon restart is
>> one example of such case. Using reason "crashed" is a bit misleading
>> as it is used when qemu is actually crashed or qemu destroyed on
>> guest panic as result of user choice that is the problem was
>> in qemu/guest itself. So I propose to use "crashed" only for
>> qemu side issues and introduce "daemon" when we have to kill the qemu
>> for good.
> 
> How about (although reading below may shed some more context):
> 
> This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
> 
> This action would occur during Reconnect processing when it's
> determined that either the QEMU monitor no longer exists for
> some unknown reason or creation of a thread to reconnect the
> domain failed.

Mostly I'm fine with this one except "monitor no longer exists"
is condition for reason "crashed" or "unknown" (Martin's contribution)

> 
>>
>> There is another example where "daemon" will be useful. If we can
>> not reboot domain we kill it and got "crashed" reason now. Looks
>> like good candidate for "daemon" reason.
> 
> If you feel this way, then a followup patch could/should be posted;
> otherwise, this'll be lost to commit message heaven where all good ideas
> go to die ;-).

Sure!

> 
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>> ---
>>  include/libvirt/libvirt-domain.h |  1 +
>>  src/conf/domain_conf.c           |  3 ++-
>>  src/qemu/qemu_process.c          | 11 ++++-------
>>  tools/virsh-domain-monitor.c     |  3 ++-
>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>> index fdd2d6b..11fdab5 100644
>> --- a/include/libvirt/libvirt-domain.h
>> +++ b/include/libvirt/libvirt-domain.h
>> @@ -145,6 +145,7 @@ typedef enum {
>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>                                             * taken while domain was shutoff */
>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
> 
> s/have to/decides to/
> 
>>  # ifdef VIR_ENUM_SENTINELS
>>      VIR_DOMAIN_SHUTOFF_LAST
>>  # endif
>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>> index 9911d56..e441723 100644
>> --- a/src/conf/domain_conf.c
>> +++ b/src/conf/domain_conf.c
>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>                "migrated",
>>                "saved",
>>                "failed",
>> -              "from snapshot")
>> +              "from snapshot",
>> +              "daemon")
>>  
>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>                "unknown",
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index e9c7618..c4bc9ca 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>          /* We can't get the monitor back, so must kill the VM
>>           * to remove danger of it ending up running twice if
>>           * user tries to start it again later
> 
> Since we're changing anyway, let's put the stop there, e.g.
> "s/later/later./"
> 
>> -         * If we couldn't get the monitor since QEMU supports
>> -         * no-shutdown, we can safely say that the domain
>> -         * crashed ... */
>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>           * thread didn't have a chance to start playing with the domain yet
>>           * (it's all we can do anyway).
>>           */
>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>      }
>>      goto cleanup;
>>  }
>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>           * is no thread that could be doing anything else with the same domain
>>           * object.
>>           */
>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>                          QEMU_ASYNC_JOB_NONE, 0);
>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>  
> 
> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
> removed in commit fe35b1ad6 the conditional state was just left at
> VIR_DOMAIN_SHUTOFF_CRASHED.
> 
> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
> on in the code leading to the latter revert commit losing that unknown
> reason. The reason for the "-no-shutdown" *did* change and that probably
> should have been reflected in the qemuProcessReconnect logic.
> 
> So I think perhaps we should have an initial patch which references or
> uses that first paragraph in order to change the code to:
> 
>         /* We cannot get the monitor back, so kill the VM to
>          * remove possibility of it ending up running twice if
>          * user tries to start it again later.
>          *
>          * If we cannot get to the monitor when the QEMU command
>          * line used -no-shutdown, then we can safely say that the
>          * domain crashed; otherwise we don't really know. */
>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>         else
>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> 
> and uses the short commit msg of "qemu: Restore lost shutdown reason".
> 
> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
> CRASHED otherwise.
> 

Then we will get "crashed" reason always for modern qemu and domains
that can reboot which is not true.

However I see now that the issue is more complicated then I thought. I agree
with Martin too but we have to additionally check that connect(2) for domain
monitor socket was called and error was ECONNREFUSED. So finally state is
resolved like this:

if (connect(2) was not called) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
else if (connect(2) result is NOT ECONNREFUSED) {
    state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else if (connect(2) result is ECONNREFUSED) {
    if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
        state = VIR_DOMAIN_SHUTOFF_CRASHED;
    else
        state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
} else {
    state = VIR_DOMAIN_SHUTOFF_DAEMON;
}

Moreover we should not later kill process by remembered pid in all cases except
last one in the above code because we can kill some other process.

Looks like a bit complicated, not sure should we pursuit this one or just left
VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).

Nikolay

> 
>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>> index 3a26363..f0ad558 100644
>> --- a/tools/virsh-domain-monitor.c
>> +++ b/tools/virsh-domain-monitor.c
>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>                N_("migrated"),
>>                N_("saved"),
>>                N_("failed"),
>> -              N_("from snapshot"))
>> +              N_("from snapshot"),
>> +              N_("daemon"))
>>  
>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 6 months ago

On 10/17/18 4:16 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 16.10.2018 15:48, John Ferlan wrote:
>>
>>
>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>> Let's introduce shutdown reason "daemon" which means we have to
>>> kill running domain ourselves as the best action we can take at
>>> that moment. Failure to pick up domain on daemon restart is
>>> one example of such case. Using reason "crashed" is a bit misleading
>>> as it is used when qemu is actually crashed or qemu destroyed on
>>> guest panic as result of user choice that is the problem was
>>> in qemu/guest itself. So I propose to use "crashed" only for
>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>> for good.
>>
>> How about (although reading below may shed some more context):
>>
>> This patch introduces a new shutdown reason "daemon" in order
>> to indicate that the daemon needed to force shutdown the domain
>> as the best course of action to take at the moment.
>>
>> This action would occur during Reconnect processing when it's
>> determined that either the QEMU monitor no longer exists for
>> some unknown reason or creation of a thread to reconnect the
>> domain failed.
> 
> Mostly I'm fine with this one except "monitor no longer exists"
> is condition for reason "crashed" or "unknown" (Martin's contribution)
> 
>>
>>>
>>> There is another example where "daemon" will be useful. If we can
>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>> like good candidate for "daemon" reason.
>>
>> If you feel this way, then a followup patch could/should be posted;
>> otherwise, this'll be lost to commit message heaven where all good ideas
>> go to die ;-).
> 
> Sure!
> 
>>
>>>
>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>> ---
>>>  include/libvirt/libvirt-domain.h |  1 +
>>>  src/conf/domain_conf.c           |  3 ++-
>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>>> index fdd2d6b..11fdab5 100644
>>> --- a/include/libvirt/libvirt-domain.h
>>> +++ b/include/libvirt/libvirt-domain.h
>>> @@ -145,6 +145,7 @@ typedef enum {
>>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>>                                             * taken while domain was shutoff */
>>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
>>
>> s/have to/decides to/
>>
>>>  # ifdef VIR_ENUM_SENTINELS
>>>      VIR_DOMAIN_SHUTOFF_LAST
>>>  # endif
>>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>>> index 9911d56..e441723 100644
>>> --- a/src/conf/domain_conf.c
>>> +++ b/src/conf/domain_conf.c
>>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>>                "migrated",
>>>                "saved",
>>>                "failed",
>>> -              "from snapshot")
>>> +              "from snapshot",
>>> +              "daemon")
>>>  
>>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>>                "unknown",
>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>> index e9c7618..c4bc9ca 100644
>>> --- a/src/qemu/qemu_process.c
>>> +++ b/src/qemu/qemu_process.c
>>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>>          /* We can't get the monitor back, so must kill the VM
>>>           * to remove danger of it ending up running twice if
>>>           * user tries to start it again later
>>
>> Since we're changing anyway, let's put the stop there, e.g.
>> "s/later/later./"
>>
>>> -         * If we couldn't get the monitor since QEMU supports
>>> -         * no-shutdown, we can safely say that the domain
>>> -         * crashed ... */
>>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>>           * thread didn't have a chance to start playing with the domain yet
>>>           * (it's all we can do anyway).
>>>           */
>>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>>      }
>>>      goto cleanup;
>>>  }
>>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>           * is no thread that could be doing anything else with the same domain
>>>           * object.
>>>           */
>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>>  
>>
>> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
>> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
>> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
>> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
>> removed in commit fe35b1ad6 the conditional state was just left at
>> VIR_DOMAIN_SHUTOFF_CRASHED.
>>
>> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
>> on in the code leading to the latter revert commit losing that unknown
>> reason. The reason for the "-no-shutdown" *did* change and that probably
>> should have been reflected in the qemuProcessReconnect logic.
>>
>> So I think perhaps we should have an initial patch which references or
>> uses that first paragraph in order to change the code to:
>>
>>         /* We cannot get the monitor back, so kill the VM to
>>          * remove possibility of it ending up running twice if
>>          * user tries to start it again later.
>>          *
>>          * If we cannot get to the monitor when the QEMU command
>>          * line used -no-shutdown, then we can safely say that the
>>          * domain crashed; otherwise we don't really know. */
>>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>         else
>>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>
>> and uses the short commit msg of "qemu: Restore lost shutdown reason".

I've posted a patch for the above, see:

https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html

while we work through the rest.

>>
>> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
>> CRASHED otherwise.
>>
> 
> Then we will get "crashed" reason always for modern qemu and domains
> that can reboot which is not true.
> 
> However I see now that the issue is more complicated then I thought. I agree
> with Martin too but we have to additionally check that connect(2) for domain
> monitor socket was called and error was ECONNREFUSED. So finally state is
> resolved like this:
> 
> if (connect(2) was not called) {
>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;> else if (connect(2) result is NOT ECONNREFUSED) {
>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> } else if (connect(2) result is ECONNREFUSED) {
>     if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>         state = VIR_DOMAIN_SHUTOFF_CRASHED;
>     else
>         state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> } else {
>     state = VIR_DOMAIN_SHUTOFF_DAEMON;
> }

Do we really want to dig into why qemuConnectMonitor failed? Or is it
good enough to do something because it did fail. IOW: What condition of
using DAEMON as the reason is most important?

Failure before we try to open monitor?
Failure to open the monitor?
Failure after opening the monitor?

Once we open the monitor, priv->mon is set and we could use that.

> 
> Moreover we should not later kill process by remembered pid in all cases except
> last one in the above code because we can kill some other process.

Not quite sure I follow. This is the reconnect path to connect so some
existing/running domain. The {vm|obj}->pid would be something filled via
/var/run/libvirt/qemu processing for {domain}.{xml|pid}. Different hunk
of code. Maybe I'm misreading your comment.

John

> 
> Looks like a bit complicated, not sure should we pursuit this one or just left
> VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
> VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).
> 
> Nikolay
> 
>>
>>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>>> index 3a26363..f0ad558 100644
>>> --- a/tools/virsh-domain-monitor.c
>>> +++ b/tools/virsh-domain-monitor.c
>>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>>                N_("migrated"),
>>>                N_("saved"),
>>>                N_("failed"),
>>> -              N_("from snapshot"))
>>> +              N_("from snapshot"),
>>> +              N_("daemon"))
>>>  
>>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 6 months ago

On 18.10.2018 18:30, John Ferlan wrote:
> 
> 
> On 10/17/18 4:16 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 16.10.2018 15:48, John Ferlan wrote:
>>>
>>>
>>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>>> Let's introduce shutdown reason "daemon" which means we have to
>>>> kill running domain ourselves as the best action we can take at
>>>> that moment. Failure to pick up domain on daemon restart is
>>>> one example of such case. Using reason "crashed" is a bit misleading
>>>> as it is used when qemu is actually crashed or qemu destroyed on
>>>> guest panic as result of user choice that is the problem was
>>>> in qemu/guest itself. So I propose to use "crashed" only for
>>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>>> for good.
>>>
>>> How about (although reading below may shed some more context):
>>>
>>> This patch introduces a new shutdown reason "daemon" in order
>>> to indicate that the daemon needed to force shutdown the domain
>>> as the best course of action to take at the moment.
>>>
>>> This action would occur during Reconnect processing when it's
>>> determined that either the QEMU monitor no longer exists for
>>> some unknown reason or creation of a thread to reconnect the
>>> domain failed.
>>
>> Mostly I'm fine with this one except "monitor no longer exists"
>> is condition for reason "crashed" or "unknown" (Martin's contribution)
>>
>>>
>>>>
>>>> There is another example where "daemon" will be useful. If we can
>>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>>> like good candidate for "daemon" reason.
>>>
>>> If you feel this way, then a followup patch could/should be posted;
>>> otherwise, this'll be lost to commit message heaven where all good ideas
>>> go to die ;-).
>>
>> Sure!
>>
>>>
>>>>
>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>> ---
>>>>  include/libvirt/libvirt-domain.h |  1 +
>>>>  src/conf/domain_conf.c           |  3 ++-
>>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>>>> index fdd2d6b..11fdab5 100644
>>>> --- a/include/libvirt/libvirt-domain.h
>>>> +++ b/include/libvirt/libvirt-domain.h
>>>> @@ -145,6 +145,7 @@ typedef enum {
>>>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>>>                                             * taken while domain was shutoff */
>>>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
>>>
>>> s/have to/decides to/
>>>
>>>>  # ifdef VIR_ENUM_SENTINELS
>>>>      VIR_DOMAIN_SHUTOFF_LAST
>>>>  # endif
>>>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>>>> index 9911d56..e441723 100644
>>>> --- a/src/conf/domain_conf.c
>>>> +++ b/src/conf/domain_conf.c
>>>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>>>                "migrated",
>>>>                "saved",
>>>>                "failed",
>>>> -              "from snapshot")
>>>> +              "from snapshot",
>>>> +              "daemon")
>>>>  
>>>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>>>                "unknown",
>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>> index e9c7618..c4bc9ca 100644
>>>> --- a/src/qemu/qemu_process.c
>>>> +++ b/src/qemu/qemu_process.c
>>>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>>>          /* We can't get the monitor back, so must kill the VM
>>>>           * to remove danger of it ending up running twice if
>>>>           * user tries to start it again later
>>>
>>> Since we're changing anyway, let's put the stop there, e.g.
>>> "s/later/later./"
>>>
>>>> -         * If we couldn't get the monitor since QEMU supports
>>>> -         * no-shutdown, we can safely say that the domain
>>>> -         * crashed ... */
>>>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>>>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>>>           * thread didn't have a chance to start playing with the domain yet
>>>>           * (it's all we can do anyway).
>>>>           */
>>>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>>>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>>>      }
>>>>      goto cleanup;
>>>>  }
>>>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>>           * is no thread that could be doing anything else with the same domain
>>>>           * object.
>>>>           */
>>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>>>  
>>>
>>> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
>>> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
>>> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
>>> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
>>> removed in commit fe35b1ad6 the conditional state was just left at
>>> VIR_DOMAIN_SHUTOFF_CRASHED.
>>>
>>> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
>>> on in the code leading to the latter revert commit losing that unknown
>>> reason. The reason for the "-no-shutdown" *did* change and that probably
>>> should have been reflected in the qemuProcessReconnect logic.
>>>
>>> So I think perhaps we should have an initial patch which references or
>>> uses that first paragraph in order to change the code to:
>>>
>>>         /* We cannot get the monitor back, so kill the VM to
>>>          * remove possibility of it ending up running twice if
>>>          * user tries to start it again later.
>>>          *
>>>          * If we cannot get to the monitor when the QEMU command
>>>          * line used -no-shutdown, then we can safely say that the
>>>          * domain crashed; otherwise we don't really know. */
>>>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>>         else
>>>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>>
>>> and uses the short commit msg of "qemu: Restore lost shutdown reason".
> 
> I've posted a patch for the above, see:
> 
> https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html
> 
> while we work through the rest.
> 
>>>
>>> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
>>> CRASHED otherwise.
>>>
>>
>> Then we will get "crashed" reason always for modern qemu and domains
>> that can reboot which is not true.
>>
>> However I see now that the issue is more complicated then I thought. I agree
>> with Martin too but we have to additionally check that connect(2) for domain
>> monitor socket was called and error was ECONNREFUSED. So finally state is
>> resolved like this:
>>
>> if (connect(2) was not called) {
>>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;> else if (connect(2) result is NOT ECONNREFUSED) {
>>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>> } else if (connect(2) result is ECONNREFUSED) {
>>     if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>         state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>     else
>>         state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>> } else {
>>     state = VIR_DOMAIN_SHUTOFF_DAEMON;
>> }
> 
> Do we really want to dig into why qemuConnectMonitor failed? Or is it
> good enough to do something because it did fail. IOW: What condition of
> using DAEMON as the reason is most important?
> 
> Failure before we try to open monitor?
> Failure to open the monitor?
> Failure after opening the monitor?
> 
> Once we open the monitor, priv->mon is set and we could use that.
> 

Sure if priv->mon != NULL then the reason to shutdown is DAEMON only.

But for example the patch you sent is not quite correct. If we don't even try
to call connect or connect result is something unexpected then we can't be sure
of anything - the process may exist or not exist so we can't tell it crashed if
it was started with -noshutdown option. The virDomainObjIsActive is only
remembered state at this point. We need to set reason UNKNOWN in this case.
Let's call this the first case. The second case is when connect returns
ECONNREFUSED - this means nobody listening on monitor socket - this is the case 
your patch should apply to. The problem is you can not distinguish case 1 and
2 now. In both cases you'll have priv->mon = NULL. Moreover in some cases we
  can close monitor right after successful connection thus sometimes even if
priv->mon == NULL we need to set DAEMON reason. That's why I don't use
priv->mon in conditions in the above if/else snippet. 

>>
>> Moreover we should not later kill process by remembered pid in all cases except
>> last one in the above code because we can kill some other process.
> 
> Not quite sure I follow. This is the reconnect path to connect so some
> existing/running domain. The {vm|obj}->pid would be something filled via
> /var/run/libvirt/qemu processing for {domain}.{xml|pid}. Different hunk
> of code. Maybe I'm misreading your comment.

We can't be sure domain is running. And if it is not running some unrelated
process can reuse pid. So imagine we don't even have chance to call connect(2)
because of early error and qemu really died and some process with same pid
is running - we call qemuProcessStop on error path and shoot innocent process.

Even if we connect and get ECONNREFUSED which means there is no qemu process
pid can be reused and we should not kill. In short we should clear
vm->pid before calling qemuProcessStop if shutdown reason is different from DAEMON.

Nikolay

> 
>>
>> Looks like a bit complicated, not sure should we pursuit this one or just left
>> VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
>> VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).
>>
>> Nikolay
>>
>>>
>>>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>>>> index 3a26363..f0ad558 100644
>>>> --- a/tools/virsh-domain-monitor.c
>>>> +++ b/tools/virsh-domain-monitor.c
>>>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>>>                N_("migrated"),
>>>>                N_("saved"),
>>>>                N_("failed"),
>>>> -              N_("from snapshot"))
>>>> +              N_("from snapshot"),
>>>> +              N_("daemon"))
>>>>  
>>>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 6 months ago

On 10/19/18 5:24 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 18.10.2018 18:30, John Ferlan wrote:
>>
>>
>> On 10/17/18 4:16 AM, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 16.10.2018 15:48, John Ferlan wrote:
>>>>
>>>>
>>>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>>>> Let's introduce shutdown reason "daemon" which means we have to
>>>>> kill running domain ourselves as the best action we can take at
>>>>> that moment. Failure to pick up domain on daemon restart is
>>>>> one example of such case. Using reason "crashed" is a bit misleading
>>>>> as it is used when qemu is actually crashed or qemu destroyed on
>>>>> guest panic as result of user choice that is the problem was
>>>>> in qemu/guest itself. So I propose to use "crashed" only for
>>>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>>>> for good.
>>>>
>>>> How about (although reading below may shed some more context):
>>>>
>>>> This patch introduces a new shutdown reason "daemon" in order
>>>> to indicate that the daemon needed to force shutdown the domain
>>>> as the best course of action to take at the moment.
>>>>
>>>> This action would occur during Reconnect processing when it's
>>>> determined that either the QEMU monitor no longer exists for
>>>> some unknown reason or creation of a thread to reconnect the
>>>> domain failed.
>>>
>>> Mostly I'm fine with this one except "monitor no longer exists"
>>> is condition for reason "crashed" or "unknown" (Martin's contribution)
>>>
>>>>
>>>>>
>>>>> There is another example where "daemon" will be useful. If we can
>>>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>>>> like good candidate for "daemon" reason.
>>>>
>>>> If you feel this way, then a followup patch could/should be posted;
>>>> otherwise, this'll be lost to commit message heaven where all good ideas
>>>> go to die ;-).
>>>
>>> Sure!
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>> ---
>>>>>  include/libvirt/libvirt-domain.h |  1 +
>>>>>  src/conf/domain_conf.c           |  3 ++-
>>>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>>>>> index fdd2d6b..11fdab5 100644
>>>>> --- a/include/libvirt/libvirt-domain.h
>>>>> +++ b/include/libvirt/libvirt-domain.h
>>>>> @@ -145,6 +145,7 @@ typedef enum {
>>>>>      VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
>>>>>      VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
>>>>>                                             * taken while domain was shutoff */
>>>>> +    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon have to kill domain */
>>>>
>>>> s/have to/decides to/
>>>>
>>>>>  # ifdef VIR_ENUM_SENTINELS
>>>>>      VIR_DOMAIN_SHUTOFF_LAST
>>>>>  # endif
>>>>> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
>>>>> index 9911d56..e441723 100644
>>>>> --- a/src/conf/domain_conf.c
>>>>> +++ b/src/conf/domain_conf.c
>>>>> @@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
>>>>>                "migrated",
>>>>>                "saved",
>>>>>                "failed",
>>>>> -              "from snapshot")
>>>>> +              "from snapshot",
>>>>> +              "daemon")
>>>>>  
>>>>>  VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
>>>>>                "unknown",
>>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>>> index e9c7618..c4bc9ca 100644
>>>>> --- a/src/qemu/qemu_process.c
>>>>> +++ b/src/qemu/qemu_process.c
>>>>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>>>>          /* We can't get the monitor back, so must kill the VM
>>>>>           * to remove danger of it ending up running twice if
>>>>>           * user tries to start it again later
>>>>
>>>> Since we're changing anyway, let's put the stop there, e.g.
>>>> "s/later/later./"
>>>>
>>>>> -         * If we couldn't get the monitor since QEMU supports
>>>>> -         * no-shutdown, we can safely say that the domain
>>>>> -         * crashed ... */
>>>>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>>>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>>>>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>>>>           * thread didn't have a chance to start playing with the domain yet
>>>>>           * (it's all we can do anyway).
>>>>>           */
>>>>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>>>>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>>>>      }
>>>>>      goto cleanup;
>>>>>  }
>>>>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>>>           * is no thread that could be doing anything else with the same domain
>>>>>           * object.
>>>>>           */
>>>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>>>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>>>>  
>>>>
>>>> When qemuProcessReconnectHelper was introduced (commit d38897a5d)
>>>> reconnection failure used VIR_DOMAIN_SHUTOFF_FAILED; however, that was
>>>> changed in commit bda2f17d to either VIR_DOMAIN_SHUTOFF_CRASHED or
>>>> VIR_DOMAIN_SHUTOFF_UNKNOWN. When QEMU_CAPS_NO_SHUTDOWN checking was
>>>> removed in commit fe35b1ad6 the conditional state was just left at
>>>> VIR_DOMAIN_SHUTOFF_CRASHED.
>>>>
>>>> Still I agree w/ Martin's thoughts, which unfortunately wasn't commented
>>>> on in the code leading to the latter revert commit losing that unknown
>>>> reason. The reason for the "-no-shutdown" *did* change and that probably
>>>> should have been reflected in the qemuProcessReconnect logic.
>>>>
>>>> So I think perhaps we should have an initial patch which references or
>>>> uses that first paragraph in order to change the code to:
>>>>
>>>>         /* We cannot get the monitor back, so kill the VM to
>>>>          * remove possibility of it ending up running twice if
>>>>          * user tries to start it again later.
>>>>          *
>>>>          * If we cannot get to the monitor when the QEMU command
>>>>          * line used -no-shutdown, then we can safely say that the
>>>>          * domain crashed; otherwise we don't really know. */
>>>>         if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>>>             state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>>>         else
>>>>             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>>>
>>>> and uses the short commit msg of "qemu: Restore lost shutdown reason".
>>
>> I've posted a patch for the above, see:
>>
>> https://www.redhat.com/archives/libvir-list/2018-October/msg00945.html
>>
>> while we work through the rest.
>>
>>>>
>>>> Then this followup would change the "UNKNOWN" to "DAEMON" keeping the
>>>> CRASHED otherwise.
>>>>
>>>
>>> Then we will get "crashed" reason always for modern qemu and domains
>>> that can reboot which is not true.
>>>
>>> However I see now that the issue is more complicated then I thought. I agree
>>> with Martin too but we have to additionally check that connect(2) for domain
>>> monitor socket was called and error was ECONNREFUSED. So finally state is
>>> resolved like this:
>>>
>>> if (connect(2) was not called) {
>>>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;> else if (connect(2) result is NOT ECONNREFUSED) {
>>>     state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>> } else if (connect(2) result is ECONNREFUSED) {
>>>     if (priv->monJSON && priv->allowReboot == VIR_TRISTATE_BOOL_YES)
>>>         state = VIR_DOMAIN_SHUTOFF_CRASHED;
>>>     else
>>>         state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
>>> } else {
>>>     state = VIR_DOMAIN_SHUTOFF_DAEMON;
>>> }
>>
>> Do we really want to dig into why qemuConnectMonitor failed? Or is it
>> good enough to do something because it did fail. IOW: What condition of
>> using DAEMON as the reason is most important?
>>
>> Failure before we try to open monitor?
>> Failure to open the monitor?
>> Failure after opening the monitor?
>>
>> Once we open the monitor, priv->mon is set and we could use that.
>>
> 
> Sure if priv->mon != NULL then the reason to shutdown is DAEMON only.
> 
> But for example the patch you sent is not quite correct. If we don't even try

Please comment on it in that patch!  I was going for "same check" used
to place the "-no-shutdown" on the command line.

> to call connect or connect result is something unexpected then we can't be sure
> of anything - the process may exist or not exist so we can't tell it crashed if
> it was started with -noshutdown option. The virDomainObjIsActive is only
> remembered state at this point. We need to set reason UNKNOWN in this case.
> Let's call this the first case. The second case is when connect returns
> ECONNREFUSED - this means nobody listening on monitor socket - this is the case 
> your patch should apply to. The problem is you can not distinguish case 1 and
> 2 now. In both cases you'll have priv->mon = NULL. Moreover in some cases we
>   can close monitor right after successful connection thus sometimes even if
> priv->mon == NULL we need to set DAEMON reason. That's why I don't use
> priv->mon in conditions in the above if/else snippet. 
> 
>>>
>>> Moreover we should not later kill process by remembered pid in all cases except
>>> last one in the above code because we can kill some other process.
>>
>> Not quite sure I follow. This is the reconnect path to connect so some
>> existing/running domain. The {vm|obj}->pid would be something filled via
>> /var/run/libvirt/qemu processing for {domain}.{xml|pid}. Different hunk
>> of code. Maybe I'm misreading your comment.
> 
> We can't be sure domain is running. And if it is not running some unrelated
> process can reuse pid. So imagine we don't even have chance to call connect(2)
> because of early error and qemu really died and some process with same pid
> is running - we call qemuProcessStop on error path and shoot innocent process.

I'd have to dig through all the code that handles the reconnection, but
I'm working under the assumption that the .xml and .pid files found in
the /var/run/libvirt/qemu directory would have been vetted previously to
determine whether the pid mentioned in the file related to a qemu
process or not. I certainly understand the downside of doing something
to someone else's process!

John

> 
> Even if we connect and get ECONNREFUSED which means there is no qemu process
> pid can be reused and we should not kill. In short we should clear
> vm->pid before calling qemuProcessStop if shutdown reason is different from DAEMON.
> 
> Nikolay
> 
>>
>>>
>>> Looks like a bit complicated, not sure should we pursuit this one or just left
>>> VIR_DOMAIN_SHUTOFF_UNKNOWN only (in this case we can introduce
>>> VIR_DOMAIN_SHUTOFF_DAEMON for reboot failures where reason is obvisous).
>>>
>>> Nikolay
>>>
>>>>
>>>>> diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
>>>>> index 3a26363..f0ad558 100644
>>>>> --- a/tools/virsh-domain-monitor.c
>>>>> +++ b/tools/virsh-domain-monitor.c
>>>>> @@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
>>>>>                N_("migrated"),
>>>>>                N_("saved"),
>>>>>                N_("failed"),
>>>>> -              N_("from snapshot"))
>>>>> +              N_("from snapshot"),
>>>>> +              N_("daemon"))
>>>>>  
>>>>>  VIR_ENUM_DECL(virshDomainCrashedReason)
>>>>>  VIR_ENUM_IMPL(virshDomainCrashedReason,
>>>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 5 months ago

On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
> Let's introduce shutdown reason "daemon" which means we have to
> kill running domain ourselves as the best action we can take at
> that moment. Failure to pick up domain on daemon restart is
> one example of such case. Using reason "crashed" is a bit misleading
> as it is used when qemu is actually crashed or qemu destroyed on
> guest panic as result of user choice that is the problem was
> in qemu/guest itself. So I propose to use "crashed" only for
> qemu side issues and introduce "daemon" when we have to kill the qemu
> for good.
> 
> There is another example where "daemon" will be useful. If we can
> not reboot domain we kill it and got "crashed" reason now. Looks
> like good candidate for "daemon" reason.
> 
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> ---
>  include/libvirt/libvirt-domain.h |  1 +
>  src/conf/domain_conf.c           |  3 ++-
>  src/qemu/qemu_process.c          | 11 ++++-------
>  tools/virsh-domain-monitor.c     |  3 ++-
>  4 files changed, 9 insertions(+), 9 deletions(-)
> 

[...]

Now that I've pushed commits 296e05b54 and 8f0f8425d, let's revisit
this...  Merging in that set of changes with this patch means that
instead of:

> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index e9c7618..c4bc9ca 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>          /* We can't get the monitor back, so must kill the VM
>           * to remove danger of it ending up running twice if
>           * user tries to start it again later
> -         * If we couldn't get the monitor since QEMU supports
> -         * no-shutdown, we can safely say that the domain
> -         * crashed ... */
> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
> -        /* If BeginJob failed, we jumped here without a job, let's hope another
> +         * If BeginJob failed, we jumped here without a job, let's hope another
>           * thread didn't have a chance to start playing with the domain yet
>           * (it's all we can do anyway).
>           */
> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>      }
>      goto cleanup;
>  }
> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>           * is no thread that could be doing anything else with the same domain
>           * object.
>           */
> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>                          QEMU_ASYNC_JOB_NONE, 0);
>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>  

We'd have:

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 3bf84834ec..023e187729 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7995,10 +7995,14 @@ qemuProcessReconnect(void *opaque)
          *
          * If we cannot get to the monitor when the QEMU command
          * line used -no-shutdown, then we can safely say that the
-         * domain crashed; otherwise, we don't really know. */
+         * domain crashed; otherwise, if the monitor was started,
+         * then we can blame ourselves, else we failed before the
+         * monitor started so we don't really know. */
         if (!priv->mon && tryMonReconn &&
             qemuDomainIsUsingNoShutdown(priv))
             state = VIR_DOMAIN_SHUTOFF_CRASHED;
+        else if (priv->mon)
+            state = VIR_DOMAIN_SHUTOFF_DAEMON;
         else
             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;

@@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
          * is no thread that could be doing anything else with the same
domain
          * object.
          */
-        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
+        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
                         QEMU_ASYNC_JOB_NONE, 0);
         qemuDomainRemoveInactiveJobLocked(src->driver, obj);

[...]

So does this essentially meet/handle the condition you were trying to
alter?  That is - once we get beyond the monitor reconnection, then if
we end up in error that means the daemon has decided to stop things.
Leaving the UNKNOWN to be the period prior to attempting to reconnect
and CRASHED to the period during which we try to connect to the monitor.

This also captures failures after connection is successful (when
priv->mon in qemuConnectMonitor).

Getting more specific failures of why the connection failed is perhaps a
future task if it really matters.

If this looks good to you let me know, I can merge and push with the
previously noted changes and a commit message of:

"This patch introduces a new shutdown reason "daemon" in order
to indicate that the daemon needed to force shutdown the domain
as the best course of action to take at the moment.

This action would occur during reconnection when processing encounters
an error once the monitor reconnection is successful."


John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 5 months ago

On 07.11.2018 17:29, John Ferlan wrote:
> 
> 
> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>> Let's introduce shutdown reason "daemon" which means we have to
>> kill running domain ourselves as the best action we can take at
>> that moment. Failure to pick up domain on daemon restart is
>> one example of such case. Using reason "crashed" is a bit misleading
>> as it is used when qemu is actually crashed or qemu destroyed on
>> guest panic as result of user choice that is the problem was
>> in qemu/guest itself. So I propose to use "crashed" only for
>> qemu side issues and introduce "daemon" when we have to kill the qemu
>> for good.
>>
>> There is another example where "daemon" will be useful. If we can
>> not reboot domain we kill it and got "crashed" reason now. Looks
>> like good candidate for "daemon" reason.
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>> ---
>>  include/libvirt/libvirt-domain.h |  1 +
>>  src/conf/domain_conf.c           |  3 ++-
>>  src/qemu/qemu_process.c          | 11 ++++-------
>>  tools/virsh-domain-monitor.c     |  3 ++-
>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>
> 
> [...]
> 
> Now that I've pushed commits 296e05b54 and 8f0f8425d, let's revisit
> this...  Merging in that set of changes with this patch means that
> instead of:
> 
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index e9c7618..c4bc9ca 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -7988,15 +7988,12 @@ qemuProcessReconnect(void *opaque)
>>          /* We can't get the monitor back, so must kill the VM
>>           * to remove danger of it ending up running twice if
>>           * user tries to start it again later
>> -         * If we couldn't get the monitor since QEMU supports
>> -         * no-shutdown, we can safely say that the domain
>> -         * crashed ... */
>> -        state = VIR_DOMAIN_SHUTOFF_CRASHED;
>> -        /* If BeginJob failed, we jumped here without a job, let's hope another
>> +         * If BeginJob failed, we jumped here without a job, let's hope another
>>           * thread didn't have a chance to start playing with the domain yet
>>           * (it's all we can do anyway).
>>           */
>> -        qemuProcessStop(driver, obj, state, QEMU_ASYNC_JOB_NONE, stopFlags);
>> +        qemuProcessStop(driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>> +                        QEMU_ASYNC_JOB_NONE, stopFlags);
>>      }
>>      goto cleanup;
>>  }
>> @@ -8035,7 +8032,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>           * is no thread that could be doing anything else with the same domain
>>           * object.
>>           */
>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>                          QEMU_ASYNC_JOB_NONE, 0);
>>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
>>  
> 
> We'd have:
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index 3bf84834ec..023e187729 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -7995,10 +7995,14 @@ qemuProcessReconnect(void *opaque)
>           *
>           * If we cannot get to the monitor when the QEMU command
>           * line used -no-shutdown, then we can safely say that the
> -         * domain crashed; otherwise, we don't really know. */
> +         * domain crashed; otherwise, if the monitor was started,
> +         * then we can blame ourselves, else we failed before the
> +         * monitor started so we don't really know. */
>          if (!priv->mon && tryMonReconn &&
>              qemuDomainIsUsingNoShutdown(priv))
>              state = VIR_DOMAIN_SHUTOFF_CRASHED;
> +        else if (priv->mon)
> +            state = VIR_DOMAIN_SHUTOFF_DAEMON;
>          else
>              state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
> 
> @@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>           * is no thread that could be doing anything else with the same
> domain
>           * object.
>           */
> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>                          QEMU_ASYNC_JOB_NONE, 0);

I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to 
qemuProcessReconnect before we try to reconnect. May be this hunk need to be 
placed in distinct patch therefore.

>          qemuDomainRemoveInactiveJobLocked(src->driver, obj);
> 
> [...]
> 
> So does this essentially meet/handle the condition you were trying to
> alter?  That is - once we get beyond the monitor reconnection, then if
> we end up in error that means the daemon has decided to stop things.
> Leaving the UNKNOWN to be the period prior to attempting to reconnect
> and CRASHED to the period during which we try to connect to the monitor.
> 
> This also captures failures after connection is successful (when
> priv->mon in qemuConnectMonitor).
> 


> Getting more specific failures of why the connection failed is perhaps a
> future task if it really matters.

This looks good enough for me too.

> 
> If this looks good to you let me know, I can merge and push with the
> previously noted changes and a commit message of:
> 
> "This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
> 
> This action would occur during reconnection when processing encounters
> an error once the monitor reconnection is successful."
> 
> 

Besides the first comment I'm fine with this change. I'm going
to send a patch to change reason to VIR_DOMAIN_SHUTOFF_DAEMON in 
appropriate reboot failure cases too. Also one day let's fix the code not to kill
unrelated process.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 5 months ago

On 11/8/18 2:03 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 07.11.2018 17:29, John Ferlan wrote:
>>
>>
>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>> Let's introduce shutdown reason "daemon" which means we have to
>>> kill running domain ourselves as the best action we can take at
>>> that moment. Failure to pick up domain on daemon restart is
>>> one example of such case. Using reason "crashed" is a bit misleading
>>> as it is used when qemu is actually crashed or qemu destroyed on
>>> guest panic as result of user choice that is the problem was
>>> in qemu/guest itself. So I propose to use "crashed" only for
>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>> for good.
>>>
>>> There is another example where "daemon" will be useful. If we can
>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>> like good candidate for "daemon" reason.
>>>
>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>> ---
>>>  include/libvirt/libvirt-domain.h |  1 +
>>>  src/conf/domain_conf.c           |  3 ++-
>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>

[...]

>> @@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>           * is no thread that could be doing anything else with the same
>> domain
>>           * object.
>>           */
>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>                          QEMU_ASYNC_JOB_NONE, 0);
> 
> I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to 
> qemuProcessReconnect before we try to reconnect. May be this hunk need to be 
> placed in distinct patch therefore.
> 

Perhaps keeping "FAILED" would be better - that way (and I can add this
to the qemuProcessReconnect description):

 * Failure to reconnect will generate the following reasons:
 *    VIR_DOMAIN_SHUTOFF_FAILED  => Tried to create this thread, but failed
 *    VIR_DOMAIN_SHUTOFF_UNKNOWN => Thread started, but failed before reconnect
 *                                  to the monitor
 *    VIR_DOMAIN_SHUTOFF_CRASHED => Attempted to reconnect to the monitor, but
 *                                  failed to open, assume domain crash
 *    VIR_DOMAIN_SHUTOFF_DAEMON  => Reconnected to the monitor, but decided
 *                                  afterwards we needed to stop the process

Look good?

Tks -

John


[...]

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 5 months ago

On 08.11.2018 15:58, John Ferlan wrote:
> 
> 
> On 11/8/18 2:03 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 07.11.2018 17:29, John Ferlan wrote:
>>>
>>>
>>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>>> Let's introduce shutdown reason "daemon" which means we have to
>>>> kill running domain ourselves as the best action we can take at
>>>> that moment. Failure to pick up domain on daemon restart is
>>>> one example of such case. Using reason "crashed" is a bit misleading
>>>> as it is used when qemu is actually crashed or qemu destroyed on
>>>> guest panic as result of user choice that is the problem was
>>>> in qemu/guest itself. So I propose to use "crashed" only for
>>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>>> for good.
>>>>
>>>> There is another example where "daemon" will be useful. If we can
>>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>>> like good candidate for "daemon" reason.
>>>>
>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>> ---
>>>>  include/libvirt/libvirt-domain.h |  1 +
>>>>  src/conf/domain_conf.c           |  3 ++-
>>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>>
> 
> [...]
> 
>>> @@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>           * is no thread that could be doing anything else with the same
>>> domain
>>>           * object.
>>>           */
>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>
>> I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to 
>> qemuProcessReconnect before we try to reconnect. May be this hunk need to be 
>> placed in distinct patch therefore.
>>
> 
> Perhaps keeping "FAILED" would be better - that way (and I can add this
> to the qemuProcessReconnect description):
> 
>  * Failure to reconnect will generate the following reasons:
>  *    VIR_DOMAIN_SHUTOFF_FAILED  => Tried to create this thread, but failed
>  *    VIR_DOMAIN_SHUTOFF_UNKNOWN => Thread started, but failed before reconnect
>  *                                  to the monitor
>  *    VIR_DOMAIN_SHUTOFF_CRASHED => Attempted to reconnect to the monitor, but
>  *                                  failed to open, assume domain crash
>  *    VIR_DOMAIN_SHUTOFF_DAEMON  => Reconnected to the monitor, but decided
>  *                                  afterwards we needed to stop the process
> 
> Look good?

Unfortunately meaning of each is fixed in API:

    VIR_DOMAIN_SHUTOFF_UNKNOWN = 0,     /* the reason is unknown */                              

    VIR_DOMAIN_SHUTOFF_CRASHED = 3,     /* domain crashed */                                     

    VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */

So this is definetly not FAILED. I guess we can't change meaning now also, so 
we have to use UNKNOWN and explaining reasons again is qemuProcessReconnect is
needless.

By the way how from reconnection POV failing to spawn a thread differ from any
error in qemuProcessReconnect before monitor connectiton.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 5 months ago

On 11/8/18 8:10 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 08.11.2018 15:58, John Ferlan wrote:
>>
>>
>> On 11/8/18 2:03 AM, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 07.11.2018 17:29, John Ferlan wrote:
>>>>
>>>>
>>>> On 10/8/18 7:21 AM, Nikolay Shirokovskiy wrote:
>>>>> Let's introduce shutdown reason "daemon" which means we have to
>>>>> kill running domain ourselves as the best action we can take at
>>>>> that moment. Failure to pick up domain on daemon restart is
>>>>> one example of such case. Using reason "crashed" is a bit misleading
>>>>> as it is used when qemu is actually crashed or qemu destroyed on
>>>>> guest panic as result of user choice that is the problem was
>>>>> in qemu/guest itself. So I propose to use "crashed" only for
>>>>> qemu side issues and introduce "daemon" when we have to kill the qemu
>>>>> for good.
>>>>>
>>>>> There is another example where "daemon" will be useful. If we can
>>>>> not reboot domain we kill it and got "crashed" reason now. Looks
>>>>> like good candidate for "daemon" reason.
>>>>>
>>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>> ---
>>>>>  include/libvirt/libvirt-domain.h |  1 +
>>>>>  src/conf/domain_conf.c           |  3 ++-
>>>>>  src/qemu/qemu_process.c          | 11 ++++-------
>>>>>  tools/virsh-domain-monitor.c     |  3 ++-
>>>>>  4 files changed, 9 insertions(+), 9 deletions(-)
>>>>>
>>
>> [...]
>>
>>>> @@ -8045,7 +8049,7 @@ qemuProcessReconnectHelper(virDomainObjPtr obj,
>>>>           * is no thread that could be doing anything else with the same
>>>> domain
>>>>           * object.
>>>>           */
>>>> -        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_FAILED,
>>>> +        qemuProcessStop(src->driver, obj, VIR_DOMAIN_SHUTOFF_DAEMON,
>>>>                          QEMU_ASYNC_JOB_NONE, 0);
>>>
>>> I think we need to report VIR_DOMAIN_SHUTOFF_UNKNOWN here. This is similar to 
>>> qemuProcessReconnect before we try to reconnect. May be this hunk need to be 
>>> placed in distinct patch therefore.
>>>
>>
>> Perhaps keeping "FAILED" would be better - that way (and I can add this
>> to the qemuProcessReconnect description):
>>
>>  * Failure to reconnect will generate the following reasons:
>>  *    VIR_DOMAIN_SHUTOFF_FAILED  => Tried to create this thread, but failed
>>  *    VIR_DOMAIN_SHUTOFF_UNKNOWN => Thread started, but failed before reconnect
>>  *                                  to the monitor
>>  *    VIR_DOMAIN_SHUTOFF_CRASHED => Attempted to reconnect to the monitor, but
>>  *                                  failed to open, assume domain crash
>>  *    VIR_DOMAIN_SHUTOFF_DAEMON  => Reconnected to the monitor, but decided
>>  *                                  afterwards we needed to stop the process
>>
>> Look good?
> 
> Unfortunately meaning of each is fixed in API:
> 
>     VIR_DOMAIN_SHUTOFF_UNKNOWN = 0,     /* the reason is unknown */                              
> 
>     VIR_DOMAIN_SHUTOFF_CRASHED = 3,     /* domain crashed */                                     
> 
>     VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
> 
> So this is definetly not FAILED. I guess we can't change meaning now also, so 
> we have to use UNKNOWN and explaining reasons again is qemuProcessReconnect is
> needless.
> 
> By the way how from reconnection POV failing to spawn a thread differ from any
> error in qemuProcessReconnect before monitor connectiton.
> 

True, but let's walk through going through history

The FAILED reason was introduced in commit b046c55d4 (v0.9.2).

Then, commit d38897a5d (v0.9.5) introduced qemuProcessReconnectHelper to
run qemuProcessReconnect in a thread. This commit used FAILED for both
APIs because that kept the same reason.

Eventually commit bda2f17d7 (v0.9.12) changed FAILED to CRASHED and
UNKNOWN based on whether the domain was started w/ -no-shutdown. With
this commit, failure to start a thread would still use FAILED.

That stayed that way until commit fe35b1ad6 (v4.3.0) in inadvertently
removed UNKNOWN while removing the capability used to determine whether
-no-shutdown was used. This left CRASHED as the only reason.

That was repaired as a result of this patch by commit 296e05b54 (4.10.0).

So now the proposal which seems fair to me is to create a category
DAEMON which reduces the UNKNOWN window to better describe why we're
stopping the domain after reconnection.

So, going back to FAILED in qemuProcessReconnectHelper IMO is the right
thing to do at this point. If some patch comes along afterwards to
changed FAILED to UNKNOWN it can be debated then.

So, I'll restore the original FAILED reason, but not document the
various values since the reader of the code could figure it out.

John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] [PATCHv2] libvirt: add daemon itself as shutdown reason
Posted by John Ferlan 5 years, 5 months ago
From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>

This patch introduces a new shutdown reason "daemon" in order
to indicate that the daemon needed to force shutdown the domain
as the best course of action to take at the moment.

This action would occur during reconnection when processing
encounters an error once the monitor reconnection is successful.

Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
Reviewed-by: John Ferlan <jferlan@redhat.com>
---

 Since I hadn't seen a yay or nay on my last response related to
 changes - figured I'd post things as they are now just to make it
 easier to determine whether the changes made were acceptible
 This is essentially your original patch merged with my other
 changes and the removal of the qemu_process.c change.

 include/libvirt/libvirt-domain.h | 2 ++
 src/conf/domain_conf.c           | 3 ++-
 src/qemu/qemu_process.c          | 6 +++++-
 tools/virsh-domain-monitor.c     | 3 ++-
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
index fdd2d6b8ea..71debd92b8 100644
--- a/include/libvirt/libvirt-domain.h
+++ b/include/libvirt/libvirt-domain.h
@@ -145,6 +145,8 @@ typedef enum {
     VIR_DOMAIN_SHUTOFF_FAILED = 6,      /* domain failed to start */
     VIR_DOMAIN_SHUTOFF_FROM_SNAPSHOT = 7, /* restored from a snapshot which was
                                            * taken while domain was shutoff */
+    VIR_DOMAIN_SHUTOFF_DAEMON = 8,      /* daemon decides to kill domain
+                                           during reconnection processing */
 # ifdef VIR_ENUM_SENTINELS
     VIR_DOMAIN_SHUTOFF_LAST
 # endif
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index e8e0adc819..b45f25fabb 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -779,7 +779,8 @@ VIR_ENUM_IMPL(virDomainShutoffReason, VIR_DOMAIN_SHUTOFF_LAST,
               "migrated",
               "saved",
               "failed",
-              "from snapshot")
+              "from snapshot",
+              "daemon")
 
 VIR_ENUM_IMPL(virDomainCrashedReason, VIR_DOMAIN_CRASHED_LAST,
               "unknown",
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 1850923914..59a94bb74a 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -7978,10 +7978,14 @@ qemuProcessReconnect(void *opaque)
          *
          * If we cannot get to the monitor when the QEMU command
          * line used -no-shutdown, then we can safely say that the
-         * domain crashed; otherwise, we don't really know. */
+         * domain crashed; otherwise, if the monitor was started,
+         * then we can blame ourselves, else we failed before the
+         * monitor started so we don't really know. */
         if (!priv->mon && tryMonReconn &&
             qemuDomainIsUsingNoShutdown(priv))
             state = VIR_DOMAIN_SHUTOFF_CRASHED;
+        else if (priv->mon)
+            state = VIR_DOMAIN_SHUTOFF_DAEMON;
         else
             state = VIR_DOMAIN_SHUTOFF_UNKNOWN;
 
diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c
index 3a2636377d..f0ad558f87 100644
--- a/tools/virsh-domain-monitor.c
+++ b/tools/virsh-domain-monitor.c
@@ -212,7 +212,8 @@ VIR_ENUM_IMPL(virshDomainShutoffReason,
               N_("migrated"),
               N_("saved"),
               N_("failed"),
-              N_("from snapshot"))
+              N_("from snapshot"),
+              N_("daemon"))
 
 VIR_ENUM_DECL(virshDomainCrashedReason)
 VIR_ENUM_IMPL(virshDomainCrashedReason,
-- 
2.17.2

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCHv2] libvirt: add daemon itself as shutdown reason
Posted by Andrea Bolognani 5 years, 5 months ago
On Tue, 2018-11-13 at 15:36 -0500, John Ferlan wrote:
> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> 
> This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
> 
> This action would occur during reconnection when processing
> encounters an error once the monitor reconnection is successful.
> 
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> Reviewed-by: John Ferlan <jferlan@redhat.com>

This made language bindings sad:

  https://ci.centos.org/view/libvirt/job/libvirt-go-check/106/
  https://ci.centos.org/view/libvirt/job/libvirt-perl-check/107/

Interestingly, Python doesn't seem to be bothered at all.

Anyway, it would be great if someone would add support for the new
enum value to Go and Perl bindings and make CI happy again :)

-- 
Andrea Bolognani / Red Hat / Virtualization

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCHv2] libvirt: add daemon itself as shutdown reason
Posted by Daniel P. Berrangé 5 years, 5 months ago
On Wed, Nov 14, 2018 at 05:19:35PM +0100, Andrea Bolognani wrote:
> On Tue, 2018-11-13 at 15:36 -0500, John Ferlan wrote:
> > From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> > 
> > This patch introduces a new shutdown reason "daemon" in order
> > to indicate that the daemon needed to force shutdown the domain
> > as the best course of action to take at the moment.
> > 
> > This action would occur during reconnection when processing
> > encounters an error once the monitor reconnection is successful.
> > 
> > Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> > Reviewed-by: John Ferlan <jferlan@redhat.com>
> 
> This made language bindings sad:
> 
>   https://ci.centos.org/view/libvirt/job/libvirt-go-check/106/
>   https://ci.centos.org/view/libvirt/job/libvirt-perl-check/107/

This is normal - the check jobs are explicitly intended to fail
whenever public API additions are made to remind us (me) to add
support there.

> Interestingly, Python doesn't seem to be bothered at all.

python auto-generates some of its code.

> Anyway, it would be great if someone would add support for the new
> enum value to Go and Perl bindings and make CI happy again :)

I'm doing it..

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCHv2] libvirt: add daemon itself as shutdown reason
Posted by Erik Skultety 5 years, 5 months ago
On Tue, Nov 13, 2018 at 03:36:59PM -0500, John Ferlan wrote:
> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>
> This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
>
> This action would occur during reconnection when processing
> encounters an error once the monitor reconnection is successful.
>
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> Reviewed-by: John Ferlan <jferlan@redhat.com>
> ---

FWIW:
Reviewed-by: Erik Skultety <eskultet@redhat.com>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCHv2] libvirt: add daemon itself as shutdown reason
Posted by Nikolay Shirokovskiy 5 years, 5 months ago

On 13.11.2018 23:36, John Ferlan wrote:
> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> 
> This patch introduces a new shutdown reason "daemon" in order
> to indicate that the daemon needed to force shutdown the domain
> as the best course of action to take at the moment.
> 
> This action would occur during reconnection when processing
> encounters an error once the monitor reconnection is successful.
> 
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> Reviewed-by: John Ferlan <jferlan@redhat.com>
> ---
> 
>  Since I hadn't seen a yay or nay on my last response related to
>  changes - figured I'd post things as they are now just to make it
>  easier to determine whether the changes made were acceptible
>  This is essentially your original patch merged with my other
>  changes and the removal of the qemu_process.c change.
> 
>  include/libvirt/libvirt-domain.h | 2 ++
>  src/conf/domain_conf.c           | 3 ++-
>  src/qemu/qemu_process.c          | 6 +++++-
>  tools/virsh-domain-monitor.c     | 3 ++-
>  4 files changed, 11 insertions(+), 3 deletions(-)
> 

I thought you are going to push the patch after last letter)

ACK

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list