[PATCH v2 1/2] xl: Keep monitoring suspended domain

Jason Andryuk posted 2 patches 1 month ago
[PATCH v2 1/2] xl: Keep monitoring suspended domain
Posted by Jason Andryuk 1 month ago
When a VM transitioned to LIBXL_SHUTDOWN_REASON_SUSPEND, the xl daemon
was exiting as 0 = DOMAIN_RESTART_NONE "No domain restart".
Later, when the VM actually shutdown, the missing xl daemon meant the
domain wasn't cleaned up properly.

Add a new DOMAIN_RESTART_SUSPENDED to handle the case.  The xl daemon
keeps running to react to future shutdown events.

The domain death event needs to be re-enabled to catch subsequent
events.  The libxl_evgen_domain_death is moved from death_list to
death_reported, and then it isn't found on subsequent iterations through
death_list.  We enable the new event before disabling the old event, to
keep the xenstore watch active.  If it is unregistered and
re-registered, it'll fire immediately for our suspended domain which
will end up continuously re-triggering.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
 tools/xl/xl.h           |  1 +
 tools/xl/xl_vmcontrol.c | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/xl/xl.h b/tools/xl/xl.h
index 9c86bb1d98..967d034cfe 100644
--- a/tools/xl/xl.h
+++ b/tools/xl/xl.h
@@ -301,6 +301,7 @@ typedef enum {
     DOMAIN_RESTART_NORMAL,       /* Domain should be restarted */
     DOMAIN_RESTART_RENAME,       /* Domain should be renamed and restarted */
     DOMAIN_RESTART_SOFT_RESET,   /* Soft reset should be performed */
+    DOMAIN_RESTART_SUSPENDED,    /* Domain suspended - keep looping */
 } domain_restart_type;
 
 extern void printf_info_sexp(int domid, libxl_domain_config *d_config, FILE *fh);
diff --git a/tools/xl/xl_vmcontrol.c b/tools/xl/xl_vmcontrol.c
index fa1a4420e3..c45d497c28 100644
--- a/tools/xl/xl_vmcontrol.c
+++ b/tools/xl/xl_vmcontrol.c
@@ -417,7 +417,7 @@ static domain_restart_type handle_domain_death(uint32_t *r_domid,
         break;
     case LIBXL_SHUTDOWN_REASON_SUSPEND:
         LOG("Domain has suspended.");
-        return 0;
+        return DOMAIN_RESTART_SUSPENDED;
     case LIBXL_SHUTDOWN_REASON_CRASH:
         action = d_config->on_crash;
         break;
@@ -1030,6 +1030,7 @@ start:
         }
     }
     while (1) {
+        libxl_evgen_domain_death *deathw2 = NULL;
         libxl_event *event;
         ret = domain_wait_event(domid, &event);
         if (ret) goto out;
@@ -1100,9 +1101,24 @@ start:
                 ret = 0;
                 goto out;
 
+            case DOMAIN_RESTART_SUSPENDED:
+                LOG("Continue waiting for domain %u", domid);
+                /*
+                 * Enable a new event before disabling the old.  This ensures
+                 * the xenstore watch remains active.  Otherwise it'll fire
+                 * immediately on re-registration and find our suspended domain.
+                 */
+                ret = libxl_evenable_domain_death(ctx, domid, 0, &deathw2);
+                if (ret) goto out;
+                libxl_evdisable_domain_death(ctx, deathw);
+                deathw = deathw2;
+                deathw2 = NULL;
+                break;
+
             default:
                 abort();
             }
+            break;
 
         case LIBXL_EVENT_TYPE_DOMAIN_DEATH:
             LOG("Domain %u has been destroyed.", domid);
-- 
2.34.1
Re: [PATCH v2 1/2] xl: Keep monitoring suspended domain
Posted by Anthony PERARD 4 weeks ago
On Tue, Nov 26, 2024 at 12:19:40PM -0500, Jason Andryuk wrote:
> When a VM transitioned to LIBXL_SHUTDOWN_REASON_SUSPEND, the xl daemon
> was exiting as 0 = DOMAIN_RESTART_NONE "No domain restart".
> Later, when the VM actually shutdown, the missing xl daemon meant the
> domain wasn't cleaned up properly.
> 
> Add a new DOMAIN_RESTART_SUSPENDED to handle the case.  The xl daemon
> keeps running to react to future shutdown events.
> 
> The domain death event needs to be re-enabled to catch subsequent
> events.  The libxl_evgen_domain_death is moved from death_list to
> death_reported, and then it isn't found on subsequent iterations through
> death_list.  We enable the new event before disabling the old event, to
> keep the xenstore watch active.  If it is unregistered and
> re-registered, it'll fire immediately for our suspended domain which
> will end up continuously re-triggering.
> 
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>

Thanks,

-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech
Re: [PATCH v2 1/2] xl: Keep monitoring suspended domain
Posted by Jan Beulich 3 weeks, 3 days ago
On 28.11.2024 17:45, Anthony PERARD wrote:
> On Tue, Nov 26, 2024 at 12:19:40PM -0500, Jason Andryuk wrote:
>> When a VM transitioned to LIBXL_SHUTDOWN_REASON_SUSPEND, the xl daemon
>> was exiting as 0 = DOMAIN_RESTART_NONE "No domain restart".
>> Later, when the VM actually shutdown, the missing xl daemon meant the
>> domain wasn't cleaned up properly.
>>
>> Add a new DOMAIN_RESTART_SUSPENDED to handle the case.  The xl daemon
>> keeps running to react to future shutdown events.
>>
>> The domain death event needs to be re-enabled to catch subsequent
>> events.  The libxl_evgen_domain_death is moved from death_list to
>> death_reported, and then it isn't found on subsequent iterations through
>> death_list.  We enable the new event before disabling the old event, to
>> keep the xenstore watch active.  If it is unregistered and
>> re-registered, it'll fire immediately for our suspended domain which
>> will end up continuously re-triggering.
>>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>

While committing I was wondering: Does this want/need backporting (and hence
was it perhaps lacking a Fixes: tag)?

Jan
Re: [PATCH v2 1/2] xl: Keep monitoring suspended domain
Posted by Jason Andryuk 3 weeks, 2 days ago
On 2024-12-02 03:53, Jan Beulich wrote:
> On 28.11.2024 17:45, Anthony PERARD wrote:
>> On Tue, Nov 26, 2024 at 12:19:40PM -0500, Jason Andryuk wrote:
>>> When a VM transitioned to LIBXL_SHUTDOWN_REASON_SUSPEND, the xl daemon
>>> was exiting as 0 = DOMAIN_RESTART_NONE "No domain restart".
>>> Later, when the VM actually shutdown, the missing xl daemon meant the
>>> domain wasn't cleaned up properly.
>>>
>>> Add a new DOMAIN_RESTART_SUSPENDED to handle the case.  The xl daemon
>>> keeps running to react to future shutdown events.
>>>
>>> The domain death event needs to be re-enabled to catch subsequent
>>> events.  The libxl_evgen_domain_death is moved from death_list to
>>> death_reported, and then it isn't found on subsequent iterations through
>>> death_list.  We enable the new event before disabling the old event, to
>>> keep the xenstore watch active.  If it is unregistered and
>>> re-registered, it'll fire immediately for our suspended domain which
>>> will end up continuously re-triggering.
>>>
>>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>>
>> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
> 
> While committing I was wondering: Does this want/need backporting (and hence
> was it perhaps lacking a Fixes: tag)?

Thanks, Jan.

I don't think it's really worth backporting.  Mainly, it hasn't been an 
issue in the last 14 years.  A Linux domU doesn't suspend itself - it 
only does so in response to a xenstore watch.  A domU *could* suspend 
itself without the xenstore watch, but that doesn't seem to happen in 
practice.  Since xl has not been able to generate those xenstore events 
prior to the `xl suspend` introduction, this code path hasn't run or 
been an issue.

The tag would be:
Fixes: 1a0e17891f ("xl: support on_{poweroff,reboot,crash} domain 
configuration options.")

Regards,
Jason