[PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers

Mykola Kvach posted 16 patches 11 months, 1 week ago
[PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Mykola Kvach 11 months, 1 week ago
From: Mykola Kvach <mykola_kvach@epam.com>

This patch implements suspend/resume helpers for the watchdog.
While a domain is suspended its watchdogs must be paused. Otherwise,
if the domain stays in the suspend state for a longer period of time
compared to the watchdog period, the domain would be shutdown on resume.
Proper solution to this problem is to stop (suspend) the watchdog timers
after the domain suspends and to restart (resume) the watchdog timers
before the domain resumes. The suspend/resume of watchdog timers is done
in Xen and is invisible to the guests.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xilinx.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
---
Changes in v3:
- cover the code with CONFIG_SYSTEM_SUSPEND

Changes in v2:
- drop suspended field from timer structure
- drop the call of watchdog_domain_resume from ctxt_switch_to
---
 xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  9 +++++++++
 2 files changed, 48 insertions(+)

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index b1c6b6b9fa..6c2231826a 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
         kill_timer(&d->watchdog_timer[i].timer);
 }
 
+#ifdef CONFIG_SYSTEM_SUSPEND
+
+void watchdog_domain_suspend(struct domain *d)
+{
+    unsigned int i;
+
+    spin_lock(&d->watchdog_lock);
+
+    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
+    {
+        if ( test_bit(i, &d->watchdog_inuse_map) )
+        {
+            stop_timer(&d->watchdog_timer[i].timer);
+        }
+    }
+
+    spin_unlock(&d->watchdog_lock);
+}
+
+void watchdog_domain_resume(struct domain *d)
+{
+    unsigned int i;
+
+    spin_lock(&d->watchdog_lock);
+
+    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
+    {
+        if ( test_bit(i, &d->watchdog_inuse_map) )
+        {
+            set_timer(&d->watchdog_timer[i].timer,
+                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
+        }
+    }
+
+    spin_unlock(&d->watchdog_lock);
+}
+
+#endif /* CONFIG_SYSTEM_SUSPEND */
+
 /*
  * Pin a vcpu temporarily to a specific CPU (or restore old pinning state if
  * cpu is NR_CPUS).
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index d0d10612ce..caab4aad93 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1109,6 +1109,15 @@ void scheduler_disable(void);
 void watchdog_domain_init(struct domain *d);
 void watchdog_domain_destroy(struct domain *d);
 
+#ifdef CONFIG_SYSTEM_SUSPEND
+/*
+ * Suspend/resume watchdogs of domain (while the domain is suspended its
+ * watchdogs should be on pause)
+ */
+void watchdog_domain_suspend(struct domain *d);
+void watchdog_domain_resume(struct domain *d);
+#endif /* CONFIG_SYSTEM_SUSPEND */
+
 /*
  * Use this check when the following are both true:
  *  - Using this feature or interface requires full access to the hardware
-- 
2.43.0
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Grygorii Strashko 10 months, 3 weeks ago

On 05.03.25 11:11, Mykola Kvach wrote:
> From: Mykola Kvach <mykola_kvach@epam.com>
> 
> This patch implements suspend/resume helpers for the watchdog.
> While a domain is suspended its watchdogs must be paused. Otherwise,
> if the domain stays in the suspend state for a longer period of time
> compared to the watchdog period, the domain would be shutdown on resume.
> Proper solution to this problem is to stop (suspend) the watchdog timers
> after the domain suspends and to restart (resume) the watchdog timers
> before the domain resumes. The suspend/resume of watchdog timers is done
> in Xen and is invisible to the guests.
> 
> Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
> Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xilinx.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
> ---
> Changes in v3:
> - cover the code with CONFIG_SYSTEM_SUSPEND
> 
> Changes in v2:
> - drop suspended field from timer structure
> - drop the call of watchdog_domain_resume from ctxt_switch_to
> ---
>   xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/sched.h |  9 +++++++++
>   2 files changed, 48 insertions(+)
> 
> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index b1c6b6b9fa..6c2231826a 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
>           kill_timer(&d->watchdog_timer[i].timer);
>   }
>   
> +#ifdef CONFIG_SYSTEM_SUSPEND
> +
> +void watchdog_domain_suspend(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            stop_timer(&d->watchdog_timer[i].timer);
> +        }
> +    }
> +
> +    spin_unlock(&d->watchdog_lock);
> +}
> +
> +void watchdog_domain_resume(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            set_timer(&d->watchdog_timer[i].timer,
> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
> +        }
> +    }
> +
> +    spin_unlock(&d->watchdog_lock);
> +}
> +
> +#endif /* CONFIG_SYSTEM_SUSPEND */

My understanding is that domain's watchdogs support are not mandatory requirement
for enabling basic System suspend2ram feature, as they are not enabled automatically.
So, domain's watchdog patches can be separated and posted after basic functionality
is in place.

[...]

-- 
Best regards,
-grygorii
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Mykola Kvach 10 months, 3 weeks ago
Hi,

On Thu, Mar 20, 2025 at 1:25 PM Grygorii Strashko
<grygorii_strashko@epam.com> wrote:
>
>
>
> On 05.03.25 11:11, Mykola Kvach wrote:
> > From: Mykola Kvach <mykola_kvach@epam.com>
> >
> > This patch implements suspend/resume helpers for the watchdog.
> > While a domain is suspended its watchdogs must be paused. Otherwise,
> > if the domain stays in the suspend state for a longer period of time
> > compared to the watchdog period, the domain would be shutdown on resume.
> > Proper solution to this problem is to stop (suspend) the watchdog timers
> > after the domain suspends and to restart (resume) the watchdog timers
> > before the domain resumes. The suspend/resume of watchdog timers is done
> > in Xen and is invisible to the guests.
> >
> > Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
> > Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xilinx.com>
> > Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> > Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
> > ---
> > Changes in v3:
> > - cover the code with CONFIG_SYSTEM_SUSPEND
> >
> > Changes in v2:
> > - drop suspended field from timer structure
> > - drop the call of watchdog_domain_resume from ctxt_switch_to
> > ---
> >   xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++
> >   xen/include/xen/sched.h |  9 +++++++++
> >   2 files changed, 48 insertions(+)
> >
> > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> > index b1c6b6b9fa..6c2231826a 100644
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
> >           kill_timer(&d->watchdog_timer[i].timer);
> >   }
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
> > +
> > +void watchdog_domain_suspend(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            stop_timer(&d->watchdog_timer[i].timer);
> > +        }
> > +    }
> > +
> > +    spin_unlock(&d->watchdog_lock);
> > +}
> > +
> > +void watchdog_domain_resume(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            set_timer(&d->watchdog_timer[i].timer,
> > +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
> > +        }
> > +    }
> > +
> > +    spin_unlock(&d->watchdog_lock);
> > +}
> > +
> > +#endif /* CONFIG_SYSTEM_SUSPEND */
>
> My understanding is that domain's watchdogs support are not mandatory requirement
> for enabling basic System suspend2ram feature, as they are not enabled automatically.
> So, domain's watchdog patches can be separated and posted after basic functionality
> is in place.

AFAIK, the hardware domain always has the watchdog enabled, at least for now.

>
> [...]
>
> --
> Best regards,
> -grygorii

Best regards,
Mykola
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Jan Beulich 11 months ago
On 05.03.2025 10:11, Mykola Kvach wrote:
> From: Mykola Kvach <mykola_kvach@epam.com>
> 
> This patch implements suspend/resume helpers for the watchdog.
> While a domain is suspended its watchdogs must be paused. Otherwise,
> if the domain stays in the suspend state for a longer period of time
> compared to the watchdog period, the domain would be shutdown on resume.
> Proper solution to this problem is to stop (suspend) the watchdog timers
> after the domain suspends and to restart (resume) the watchdog timers
> before the domain resumes. The suspend/resume of watchdog timers is done
> in Xen and is invisible to the guests.
> 
> Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>

From: != first S-o-b: is always raising the question of who's the original
author of a patch.

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
>          kill_timer(&d->watchdog_timer[i].timer);
>  }
>  
> +#ifdef CONFIG_SYSTEM_SUSPEND
> +
> +void watchdog_domain_suspend(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            stop_timer(&d->watchdog_timer[i].timer);
> +        }

We generally prefer to omit the braces if the body of an if() (or
whatever it is) is a single statement / line.

> +    }
> +
> +    spin_unlock(&d->watchdog_lock);
> +}
> +
> +void watchdog_domain_resume(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            set_timer(&d->watchdog_timer[i].timer,
> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));

The timeout may have almost expired before suspending; restoring to the
full original period feels wrong. At the very least, if that's indeed
intended behavior, imo this needs spelling out explicitly.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -1109,6 +1109,15 @@ void scheduler_disable(void);
>  void watchdog_domain_init(struct domain *d);
>  void watchdog_domain_destroy(struct domain *d);
>  
> +#ifdef CONFIG_SYSTEM_SUSPEND
> +/*
> + * Suspend/resume watchdogs of domain (while the domain is suspended its
> + * watchdogs should be on pause)
> + */
> +void watchdog_domain_suspend(struct domain *d);
> +void watchdog_domain_resume(struct domain *d);
> +#endif /* CONFIG_SYSTEM_SUSPEND */

I don't think the #ifdef is strictly needed here?

Jan
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Mykola Kvach 10 months, 3 weeks ago
Hi,

On Thu, Mar 13, 2025 at 5:34 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 05.03.2025 10:11, Mykola Kvach wrote:
> > From: Mykola Kvach <mykola_kvach@epam.com>
> >
> > This patch implements suspend/resume helpers for the watchdog.
> > While a domain is suspended its watchdogs must be paused. Otherwise,
> > if the domain stays in the suspend state for a longer period of time
> > compared to the watchdog period, the domain would be shutdown on resume.
> > Proper solution to this problem is to stop (suspend) the watchdog timers
> > after the domain suspends and to restart (resume) the watchdog timers
> > before the domain resumes. The suspend/resume of watchdog timers is done
> > in Xen and is invisible to the guests.
> >
> > Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
>
> From: != first S-o-b: is always raising the question of who's the original
> author of a patch.

I'll try to change the From field if possible. Thank you for pointing that out.

>
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
> >          kill_timer(&d->watchdog_timer[i].timer);
> >  }
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
> > +
> > +void watchdog_domain_suspend(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            stop_timer(&d->watchdog_timer[i].timer);
> > +        }
>
> We generally prefer to omit the braces if the body of an if() (or
> whatever it is) is a single statement / line.

will change

>
> > +    }
> > +
> > +    spin_unlock(&d->watchdog_lock);
> > +}
> > +
> > +void watchdog_domain_resume(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            set_timer(&d->watchdog_timer[i].timer,
> > +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
>
> The timeout may have almost expired before suspending; restoring to the
> full original period feels wrong. At the very least, if that's indeed
> intended behavior, imo this needs spelling out explicitly.

It takes some time to wake up the domain, but the watchdog timeout is
reset by a userspace daemon. As a result, we can easily encounter a
watchdog trigger during the  resume process. It looks like we should
stop the watchdog timer from the guest, and in that case, we can drop
all changes related to the watchdog in this patch series.
In any case, in this patch, we restore the default timeout instead of
using the real remaining time, so the behavior won't change. However,
I'm not sure exactly how much effort this would require. We can
enable/disable the watchdog using the Linux kernel driver and the Xen
watchdog daemon, but the Linux kernel already handles suspend/resume
of the Xen watchdog timer.

>
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -1109,6 +1109,15 @@ void scheduler_disable(void);
> >  void watchdog_domain_init(struct domain *d);
> >  void watchdog_domain_destroy(struct domain *d);
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
> > +/*
> > + * Suspend/resume watchdogs of domain (while the domain is suspended its
> > + * watchdogs should be on pause)
> > + */
> > +void watchdog_domain_suspend(struct domain *d);
> > +void watchdog_domain_resume(struct domain *d);
> > +#endif /* CONFIG_SYSTEM_SUSPEND */
>
> I don't think the #ifdef is strictly needed here?

ok, I'll drop it

>
> Jan

Best regards,
Mykola
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Jan Beulich 10 months, 3 weeks ago
On 21.03.2025 10:50, Mykola Kvach wrote:
> On Thu, Mar 13, 2025 at 5:34 PM Jan Beulich <jbeulich@suse.com> wrote:
>> On 05.03.2025 10:11, Mykola Kvach wrote:
>>> +void watchdog_domain_resume(struct domain *d)
>>> +{
>>> +    unsigned int i;
>>> +
>>> +    spin_lock(&d->watchdog_lock);
>>> +
>>> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
>>> +    {
>>> +        if ( test_bit(i, &d->watchdog_inuse_map) )
>>> +        {
>>> +            set_timer(&d->watchdog_timer[i].timer,
>>> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
>>
>> The timeout may have almost expired before suspending; restoring to the
>> full original period feels wrong. At the very least, if that's indeed
>> intended behavior, imo this needs spelling out explicitly.
> 
> It takes some time to wake up the domain, but the watchdog timeout is
> reset by a userspace daemon. As a result, we can easily encounter a
> watchdog trigger during the  resume process.

Which may mean the restoring is done too early, or needs doing in two
phases.

> It looks like we should
> stop the watchdog timer from the guest, and in that case, we can drop
> all changes related to the watchdog in this patch series.

Except that then you require a guest to be aware of host suspend. Which
may not be desirable.

Jan

Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Mykola Kvach 10 months, 3 weeks ago
On Fri, Mar 21, 2025 at 4:04 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 21.03.2025 10:50, Mykola Kvach wrote:
> > On Thu, Mar 13, 2025 at 5:34 PM Jan Beulich <jbeulich@suse.com> wrote:
> >> On 05.03.2025 10:11, Mykola Kvach wrote:
> >>> +void watchdog_domain_resume(struct domain *d)
> >>> +{
> >>> +    unsigned int i;
> >>> +
> >>> +    spin_lock(&d->watchdog_lock);
> >>> +
> >>> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> >>> +    {
> >>> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> >>> +        {
> >>> +            set_timer(&d->watchdog_timer[i].timer,
> >>> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
> >>
> >> The timeout may have almost expired before suspending; restoring to the
> >> full original period feels wrong. At the very least, if that's indeed
> >> intended behavior, imo this needs spelling out explicitly.
> >
> > It takes some time to wake up the domain, but the watchdog timeout is
> > reset by a userspace daemon. As a result, we can easily encounter a
> > watchdog trigger during the  resume process.
>
> Which may mean the restoring is done too early, or needs doing in two
> phases.
>
> > It looks like we should
> > stop the watchdog timer from the guest, and in that case, we can drop
> > all changes related to the watchdog in this patch series.
>
> Except that then you require a guest to be aware of host suspend. Which
> may not be desirable.

I think this is not a problem; at least, I don't see how the guest
could be aware of the host suspend.

For now, we have three cases:

1) The guest is suspended (actually paused) because the system
suspends, and we pause all non-hardware domains.
2) The guest is suspended via the `xl` tool (x86 only, at least for now).
3) The guest requests S2R via `echo mem > /sys/power/state` or
`systemctl suspend`.

Let's review all these cases:

1) There is no action required here; it should be handled correctly by
domain pause. However, I think it is not handled properly right
now—but that is not an issue with the current patch series.
2) There are potential problems here. We should either notify the
domain that it will be suspended (which is hard to implement and the
guest will be aware of the host suspending) or suspend watchdog
directly during the execution of `xl` commands (more preferable).
3) As far as I know, if `watchdogd` is running, we can simply add an
action to it on suspend/resume events (need to review not Linux kernel
cases). In the case of the Linux kernel driver, it already handles
suspending/resuming the Xen watchdog correctly.

So, if I am not mistaken, we can drop all patches related to watchdog
suspend in Xen until `xl suspend/resume` for ARM64 is implemented. For
other cases, we should handle suspend/resume of the watchdog via the
`watchdogd` service.

Note: As far as I know, only the control domain has `watchdogd`
(though we could potentially set it up for other domains). DomUs can
only use the Xen watchdog Linux kernel driver.

>
> Jan

~Mykola
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Jan Beulich 10 months, 3 weeks ago
On 24.03.2025 12:00, Mykola Kvach wrote:
> On Fri, Mar 21, 2025 at 4:04 PM Jan Beulich <jbeulich@suse.com> wrote:
>>
>> On 21.03.2025 10:50, Mykola Kvach wrote:
>>> On Thu, Mar 13, 2025 at 5:34 PM Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 05.03.2025 10:11, Mykola Kvach wrote:
>>>>> +void watchdog_domain_resume(struct domain *d)
>>>>> +{
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    spin_lock(&d->watchdog_lock);
>>>>> +
>>>>> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
>>>>> +    {
>>>>> +        if ( test_bit(i, &d->watchdog_inuse_map) )
>>>>> +        {
>>>>> +            set_timer(&d->watchdog_timer[i].timer,
>>>>> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
>>>>
>>>> The timeout may have almost expired before suspending; restoring to the
>>>> full original period feels wrong. At the very least, if that's indeed
>>>> intended behavior, imo this needs spelling out explicitly.
>>>
>>> It takes some time to wake up the domain, but the watchdog timeout is
>>> reset by a userspace daemon. As a result, we can easily encounter a
>>> watchdog trigger during the  resume process.
>>
>> Which may mean the restoring is done too early, or needs doing in two
>> phases.
>>
>>> It looks like we should
>>> stop the watchdog timer from the guest, and in that case, we can drop
>>> all changes related to the watchdog in this patch series.

Noting this, ...

>> Except that then you require a guest to be aware of host suspend. Which
>> may not be desirable.
> 
> I think this is not a problem; at least, I don't see how the guest
> could be aware of the host suspend.

... perhaps it is me who is confused, but: With an unaware guest, how can
the stopping be done from the guest? I.e. ..

> For now, we have three cases:
> 
> 1) The guest is suspended (actually paused) because the system
> suspends, and we pause all non-hardware domains.

... in this case in particular, which this series is about aiui.

Jan

> 2) The guest is suspended via the `xl` tool (x86 only, at least for now).
> 3) The guest requests S2R via `echo mem > /sys/power/state` or
> `systemctl suspend`.
> 
> Let's review all these cases:
> 
> 1) There is no action required here; it should be handled correctly by
> domain pause. However, I think it is not handled properly right
> now—but that is not an issue with the current patch series.
> 2) There are potential problems here. We should either notify the
> domain that it will be suspended (which is hard to implement and the
> guest will be aware of the host suspending) or suspend watchdog
> directly during the execution of `xl` commands (more preferable).
> 3) As far as I know, if `watchdogd` is running, we can simply add an
> action to it on suspend/resume events (need to review not Linux kernel
> cases). In the case of the Linux kernel driver, it already handles
> suspending/resuming the Xen watchdog correctly.
> 
> So, if I am not mistaken, we can drop all patches related to watchdog
> suspend in Xen until `xl suspend/resume` for ARM64 is implemented. For
> other cases, we should handle suspend/resume of the watchdog via the
> `watchdogd` service.
> 
> Note: As far as I know, only the control domain has `watchdogd`
> (though we could potentially set it up for other domains). DomUs can
> only use the Xen watchdog Linux kernel driver.
> 
> ~Mykola


Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Julien Grall 11 months ago
Hi Mykola,

On 05/03/2025 09:11, Mykola Kvach wrote:
> From: Mykola Kvach <mykola_kvach@epam.com>
> 
> This patch implements suspend/resume helpers for the watchdog.
> While a domain is suspended its watchdogs must be paused. Otherwise,
> if the domain stays in the suspend state for a longer period of time
> compared to the watchdog period, the domain would be shutdown on resume.
> Proper solution to this problem is to stop (suspend) the watchdog timers
> after the domain suspends and to restart (resume) the watchdog timers
> before the domain resumes. The suspend/resume of watchdog timers is done
> in Xen and is invisible to the guests.
> 
> Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
> Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xilinx.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
> ---
> Changes in v3:
> - cover the code with CONFIG_SYSTEM_SUSPEND
> 
> Changes in v2:
> - drop suspended field from timer structure
> - drop the call of watchdog_domain_resume from ctxt_switch_to
> ---
>   xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/sched.h |  9 +++++++++
>   2 files changed, 48 insertions(+)
> 
> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index b1c6b6b9fa..6c2231826a 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
>           kill_timer(&d->watchdog_timer[i].timer);
>   }
>   
> +#ifdef CONFIG_SYSTEM_SUSPEND

The config option is Arm specific, yet this is common code. As x86, 
already suspend/resume, then shouldn't the config option be common?

But more importantly, why do we need to save/restore the watchdogs for 
Arm but not x86? Is this a latent issue or design choice?

> +
> +void watchdog_domain_suspend(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            stop_timer(&d->watchdog_timer[i].timer);
> +        }
> +    }
> +
> +    spin_unlock(&d->watchdog_lock);
> +}
> +
> +void watchdog_domain_resume(struct domain *d)
> +{
> +    unsigned int i;
> +
> +    spin_lock(&d->watchdog_lock);
> +
> +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> +    {
> +        if ( test_bit(i, &d->watchdog_inuse_map) )
> +        {
> +            set_timer(&d->watchdog_timer[i].timer,
> +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
> +        }
> +    }
> +
> +    spin_unlock(&d->watchdog_lock);
> +}
> +
> +#endif /* CONFIG_SYSTEM_SUSPEND */
> +
>   /*
>    * Pin a vcpu temporarily to a specific CPU (or restore old pinning state if
>    * cpu is NR_CPUS).
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index d0d10612ce..caab4aad93 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -1109,6 +1109,15 @@ void scheduler_disable(void);
>   void watchdog_domain_init(struct domain *d);
>   void watchdog_domain_destroy(struct domain *d);
>   
> +#ifdef CONFIG_SYSTEM_SUSPEND
> +/*
> + * Suspend/resume watchdogs of domain (while the domain is suspended its
> + * watchdogs should be on pause)
> + */
> +void watchdog_domain_suspend(struct domain *d);
> +void watchdog_domain_resume(struct domain *d);
> +#endif /* CONFIG_SYSTEM_SUSPEND */
> +
>   /*
>    * Use this check when the following are both true:
>    *  - Using this feature or interface requires full access to the hardware

Cheers,

-- 
Julien Grall
Re: [PATCH 08/16] xen/arm: add watchdog domain suspend/resume helpers
Posted by Mykola Kvach 8 months, 2 weeks ago
Hi, @Julien Grall

On Tue, Mar 11, 2025 at 11:04 PM Julien Grall <julien@xen.org> wrote:
>
> Hi Mykola,
>
> On 05/03/2025 09:11, Mykola Kvach wrote:
> > From: Mykola Kvach <mykola_kvach@epam.com>
> >
> > This patch implements suspend/resume helpers for the watchdog.
> > While a domain is suspended its watchdogs must be paused. Otherwise,
> > if the domain stays in the suspend state for a longer period of time
> > compared to the watchdog period, the domain would be shutdown on resume.
> > Proper solution to this problem is to stop (suspend) the watchdog timers
> > after the domain suspends and to restart (resume) the watchdog timers
> > before the domain resumes. The suspend/resume of watchdog timers is done
> > in Xen and is invisible to the guests.
> >
> > Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
> > Signed-off-by: Saeed Nowshadi <saeed.nowshadi@xilinx.com>
> > Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> > Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
> > ---
> > Changes in v3:
> > - cover the code with CONFIG_SYSTEM_SUSPEND
> >
> > Changes in v2:
> > - drop suspended field from timer structure
> > - drop the call of watchdog_domain_resume from ctxt_switch_to
> > ---
> >   xen/common/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++++
> >   xen/include/xen/sched.h |  9 +++++++++
> >   2 files changed, 48 insertions(+)
> >
> > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> > index b1c6b6b9fa..6c2231826a 100644
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -1605,6 +1605,45 @@ void watchdog_domain_destroy(struct domain *d)
> >           kill_timer(&d->watchdog_timer[i].timer);
> >   }
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
>
> The config option is Arm specific, yet this is common code. As x86,
> already suspend/resume, then shouldn't the config option be common?
>
> But more importantly, why do we need to save/restore the watchdogs for
> Arm but not x86? Is this a latent issue or design choice?

I’ve looked into this a bit. Here's what I've found:

A watchdog timer is created and initialized (but not started) for each
domain during the domain_create call. This timer can be triggered either by
the Linux kernel (I refer to Linux kernel–based operating systems because I
have access to the code and can confirm that the Xen watchdog timer is
implemented there. I don’t have knowledge about the existence or
implementation of the Xen watchdog driver in other operating systems) Xen
watchdog driver or by the xenwatchdogd service.

Case 1: Watchdog started by the Linux kernel driver (I hope that operating
systems not based on the Linux kernel also implement proper handling for
the Xen watchdog timer driver during suspend and resume)
When the Xen watchdog is started by the Linux kernel driver, everything
works as expected. The driver correctly handles system suspend events:
https://elixir.bootlin.com/linux/v6.14.8/source/drivers/watchdog/xen_wdt.c#L169

Case 2: Watchdog started by xenwatchdogd service
However, when the watchdog is started by the xenwatchdogd service, neither
the underlying OS nor the daemon takes care of stopping the watchdog timer
during suspend:

https://elixir.bootlin.com/xen/v4.20.0/source/tools/hotplug/Linux/init.d/xen-watchdog.in

https://elixir.bootlin.com/xen/v4.20.0/source/tools/hotplug/NetBSD/rc.d/xen-watchdog

Behavior on x86 during suspend:
- Linux guest is configured with xenwatchdogd, and the Xen watchdog is
started at boot
- the OS initiates suspend (we request)
- at that moment, there's an active watchdog timer in Xen for the domain,
set to, say, 15 seconds
- after suspend preparations, domain_shutdown() is called with the
SHUTDOWN_suspend argument inside Xen hypervisor internals
- inside this function, the is_shutting_down flag is set in the domain
structure
- when the watchdog timer expires, the Xen handler skips the reset action
because the domain is marked as shutting down:

https://elixir.bootlin.com/xen/v4.20.0/source/xen/common/sched/core.c#L1539

So far, everything behaves correctly.

*BUT* *for the second case there is another flow. The domain starts
resuming from suspend. As part of the resume process, the is_shutting_down
flag inside the domain is cleared, which re-enables normal watchdog
behavior. However, the watchdog timer—set before suspend—has nearly
expired. Because the OS and its user-space services (such as the watchdog
pinging daemon) have not yet fully resumed and restarted, the watchdog
timeout occurs before the ping can be sent. As a result, the watchdog
triggers a reset or shutdown (as far as i know can't be another action of
watchdog expiry, but we aren't interested in these options right now)
before the service has a chance to take control again — effectively making
a clean resume impossible.*

It's also unclear how common this situation is on x86 systems —
specifically, whether xenwatchdogd is typically used in domU or dom0, or
whether the kernel driver is more commonly relied upon instead.
---

In my opinion, since the guest OS is the one starting the Xen watchdog, it
should also manage its state during suspend/resume transitions. The general
expectation across architectures is that the OS handles device state
management when transitioning between power states. This includes stopping
or parking watchdog timers during suspend.
I think proper handling should be added to the relevant services to avoid
unexpected triggers.

What’s your take on this?

>
> > +
> > +void watchdog_domain_suspend(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            stop_timer(&d->watchdog_timer[i].timer);
> > +        }
> > +    }
> > +
> > +    spin_unlock(&d->watchdog_lock);
> > +}
> > +
> > +void watchdog_domain_resume(struct domain *d)
> > +{
> > +    unsigned int i;
> > +
> > +    spin_lock(&d->watchdog_lock);
> > +
> > +    for ( i = 0; i < NR_DOMAIN_WATCHDOG_TIMERS; i++ )
> > +    {
> > +        if ( test_bit(i, &d->watchdog_inuse_map) )
> > +        {
> > +            set_timer(&d->watchdog_timer[i].timer,
> > +                      NOW() + SECONDS(d->watchdog_timer[i].timeout));
> > +        }
> > +    }
> > +
> > +    spin_unlock(&d->watchdog_lock);
> > +}
> > +
> > +#endif /* CONFIG_SYSTEM_SUSPEND */
> > +
> >   /*
> >    * Pin a vcpu temporarily to a specific CPU (or restore old pinning
state if
> >    * cpu is NR_CPUS).
> > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> > index d0d10612ce..caab4aad93 100644
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -1109,6 +1109,15 @@ void scheduler_disable(void);
> >   void watchdog_domain_init(struct domain *d);
> >   void watchdog_domain_destroy(struct domain *d);
> >
> > +#ifdef CONFIG_SYSTEM_SUSPEND
> > +/*
> > + * Suspend/resume watchdogs of domain (while the domain is suspended
its
> > + * watchdogs should be on pause)
> > + */
> > +void watchdog_domain_suspend(struct domain *d);
> > +void watchdog_domain_resume(struct domain *d);
> > +#endif /* CONFIG_SYSTEM_SUSPEND */
> > +
> >   /*
> >    * Use this check when the following are both true:
> >    *  - Using this feature or interface requires full access to the
hardware
>
> Cheers,
>
> --
> Julien Grall
>

Kind regards,
Mykola