PM: QoS: Introduce a CPU system-wakeup QoS limit for s2idle

[PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months, 3 weeks ago

A CPU system-wakeup QoS limit may have been requested by user-space. To
avoid breaking this constraint when entering a low-power state during
s2idle through genpd, let's extend the corresponding genpd governor for
CPUs. More precisely, during s2idle let the genpd governor select a
suitable low-power state, by taking into account the QoS limit.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---

Changes in v2:
	- Limite the change to the genpd governor for CPUs.

---
 drivers/pmdomain/core.c     | 10 ++++++++--
 drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
 include/linux/pm_domain.h   |  1 +
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
index 61c2277c9ce3..4fd546ef0448 100644
--- a/drivers/pmdomain/core.c
+++ b/drivers/pmdomain/core.c
@@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
 			return;
 	}
 
-	/* Choose the deepest state when suspending */
-	genpd->state_idx = genpd->state_count - 1;
+	if (genpd->gov && genpd->gov->system_power_down_ok) {
+		if (!genpd->gov->system_power_down_ok(&genpd->domain))
+			return;
+	} else {
+		/* Default to the deepest state. */
+		genpd->state_idx = genpd->state_count - 1;
+	}
+
 	if (_genpd_power_off(genpd, false)) {
 		genpd->states[genpd->state_idx].rejected++;
 		return;
diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
index 39359811a930..bd1b9d66d4a5 100644
--- a/drivers/pmdomain/governor.c
+++ b/drivers/pmdomain/governor.c
@@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
 	return false;
 }
 
+static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
+{
+	s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
+	struct generic_pm_domain *genpd = pd_to_genpd(pd);
+	int state_idx = genpd->state_count - 1;
+
+	if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN)) {
+		genpd->state_idx = state_idx;
+		return true;
+	}
+
+	/* Find the deepest state for the latency constraint. */
+	while (state_idx >= 0) {
+		s64 latency_ns = genpd->states[state_idx].power_off_latency_ns +
+				 genpd->states[state_idx].power_on_latency_ns;
+
+		if (latency_ns <= constraint_ns) {
+			genpd->state_idx = state_idx;
+			return true;
+		}
+		state_idx--;
+	}
+
+	return false;
+}
+
 struct dev_power_governor pm_domain_cpu_gov = {
 	.suspend_ok = default_suspend_ok,
 	.power_down_ok = cpu_power_down_ok,
+	.system_power_down_ok = cpu_system_power_down_ok,
 };
 #endif
 
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index f67a2cb7d781..93ba0143ca47 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -153,6 +153,7 @@ enum genpd_sync_state {
 };
 
 struct dev_power_governor {
+	bool (*system_power_down_ok)(struct dev_pm_domain *domain);
 	bool (*power_down_ok)(struct dev_pm_domain *domain);
 	bool (*suspend_ok)(struct device *dev);
 };
-- 
2.43.0

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Kevin Hilman 3 months, 1 week ago

Ulf Hansson <ulf.hansson@linaro.org> writes:

> A CPU system-wakeup QoS limit may have been requested by user-space. To
> avoid breaking this constraint when entering a low-power state during
> s2idle through genpd, let's extend the corresponding genpd governor for
> CPUs. More precisely, during s2idle let the genpd governor select a
> suitable low-power state, by taking into account the QoS limit.

In addition to a QoS limit requested by userspace, shouldn't any
per-device QoS limits from devices in the PM domain with CPUs/clusters
also be considered?

Right now, if a device is in a PM domain that also contains CPUs, any
per-device QoS constraints for those devices should also impact the
state chosen by s2idle.

I just tried a quick hack of extending you cpu_system_power_down_ok()
function to look for any per-device QoS constraints set all devices in
the PM domain (and subdomains).  It takes the min of all the per-device
QoS constratins, compares it to the one from userspace, and uses the min
of those to decide the genpd state.

That has the effect I'm after, but curious what you think about the
usecase and the idea?

Kevin

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months ago

On Sat, 1 Nov 2025 at 01:11, Kevin Hilman <khilman@baylibre.com> wrote:
>
> Ulf Hansson <ulf.hansson@linaro.org> writes:
>
> > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > avoid breaking this constraint when entering a low-power state during
> > s2idle through genpd, let's extend the corresponding genpd governor for
> > CPUs. More precisely, during s2idle let the genpd governor select a
> > suitable low-power state, by taking into account the QoS limit.
>
> In addition to a QoS limit requested by userspace, shouldn't any
> per-device QoS limits from devices in the PM domain with CPUs/clusters
> also be considered?
>
> Right now, if a device is in a PM domain that also contains CPUs, any
> per-device QoS constraints for those devices should also impact the
> state chosen by s2idle.

I am not so sure about that. The existing dev PM QoS latency is
targeted towards runtime suspend of a device and the genpd governor
also takes it into account for this use case.

If we would start to take the same dev PM QoS latency constraint into
account for system suspend (s2idle), it may not be what the user
really intended. Instead, I think we should consider extending the dev
PM QoS interface, to allow the user to set a specific latency limit
for system wakeup. Then the genpd governor should take that into
account for s2idle.

>
> I just tried a quick hack of extending you cpu_system_power_down_ok()
> function to look for any per-device QoS constraints set all devices in
> the PM domain (and subdomains).  It takes the min of all the per-device
> QoS constratins, compares it to the one from userspace, and uses the min
> of those to decide the genpd state.
>
> That has the effect I'm after, but curious what you think about the
> usecase and the idea?

It makes sense, but as stated above, I think we need a new QoS limit
specific for system suspend.

Rafael, what's your thoughts around this?

Kind regards
Uffe

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Kevin Hilman 2 months, 3 weeks ago

Ulf Hansson <ulf.hansson@linaro.org> writes:

> On Sat, 1 Nov 2025 at 01:11, Kevin Hilman <khilman@baylibre.com> wrote:
>>
>> Ulf Hansson <ulf.hansson@linaro.org> writes:
>>
>> > A CPU system-wakeup QoS limit may have been requested by user-space. To
>> > avoid breaking this constraint when entering a low-power state during
>> > s2idle through genpd, let's extend the corresponding genpd governor for
>> > CPUs. More precisely, during s2idle let the genpd governor select a
>> > suitable low-power state, by taking into account the QoS limit.
>>
>> In addition to a QoS limit requested by userspace, shouldn't any
>> per-device QoS limits from devices in the PM domain with CPUs/clusters
>> also be considered?
>>
>> Right now, if a device is in a PM domain that also contains CPUs, any
>> per-device QoS constraints for those devices should also impact the
>> state chosen by s2idle.
>
> I am not so sure about that. The existing dev PM QoS latency is
> targeted towards runtime suspend of a device and the genpd governor
> also takes it into account for this use case.
>
> If we would start to take the same dev PM QoS latency constraint into
> account for system suspend (s2idle), it may not be what the user
> really intended. Instead, I think we should consider extending the dev
> PM QoS interface, to allow the user to set a specific latency limit
> for system wakeup. Then the genpd governor should take that into
> account for s2idle.
>>
>> I just tried a quick hack of extending you cpu_system_power_down_ok()
>> function to look for any per-device QoS constraints set all devices in
>> the PM domain (and subdomains).  It takes the min of all the per-device
>> QoS constratins, compares it to the one from userspace, and uses the min
>> of those to decide the genpd state.
>>
>> That has the effect I'm after, but curious what you think about the
>> usecase and the idea?
>
> It makes sense, but as stated above, I think we need a new QoS limit
> specific for system suspend.

OK, got it.  I understand the need to distinguish between QoS
constraints set for runtime PM and system-wide suspend/resume.

So, instead of adding a completely separate entry for system-wide
suspend, and duplicating a bunch of code/API, what about using the QoS
flags, and adding a new flag that indicates if the resume_latency
constraint applies only to runtime PM (the default) or if it *also*
applies to system-wide suspend? (RFC patch below[1])

With this, I was able to extend your new cpu_system_power_down_ok()
function to check for any devices in the domain with a resume_latency
*and* this new flag before applying it to s2idle, and that's working
great locally.

Kevin

[1]
From d25b871c2044ee0e993fd75f5f1df36a35633d1f Mon Sep 17 00:00:00 2001
From: "Kevin Hilman (TI.com)" <khilman@baylibre.com>
Date: Thu, 13 Nov 2025 17:02:38 -0800
Subject: [RFC/PATCH] PM / QoS: add flag to indicate latency applies
 system-wide

Add new PM_QOS_FLAG_LATENCY_SYS flag to indicate that the
resume_latency QoS constraint applies not just to runtime PM
transitions but also to system-wide suspend/s2idle.

Signed-off-by: Kevin Hilman (TI.com) <khilman@baylibre.com>
---
 drivers/base/power/sysfs.c | 27 +++++++++++++++++++++++++++
 include/linux/pm_qos.h     |  2 ++
 2 files changed, 29 insertions(+)

diff --git a/drivers/base/power/sysfs.c b/drivers/base/power/sysfs.c
index 13b31a3adc77..9136c13c17e4 100644
--- a/drivers/base/power/sysfs.c
+++ b/drivers/base/power/sysfs.c
@@ -316,6 +316,32 @@ static ssize_t pm_qos_no_power_off_store(struct device *dev,
 
 static DEVICE_ATTR_RW(pm_qos_no_power_off);
 
+static ssize_t pm_qos_latency_sys_show(struct device *dev,
+				       struct device_attribute *attr,
+				       char *buf)
+{
+	return sysfs_emit(buf, "%d\n", !!(dev_pm_qos_requested_flags(dev)
+					  & PM_QOS_FLAG_LATENCY_SYS));
+}
+
+static ssize_t pm_qos_latency_sys_store(struct device *dev,
+					 struct device_attribute *attr,
+					 const char *buf, size_t n)
+{
+	int ret;
+
+	if (kstrtoint(buf, 0, &ret))
+		return -EINVAL;
+
+	if (ret != 0 && ret != 1)
+		return -EINVAL;
+
+	ret = dev_pm_qos_update_flags(dev, PM_QOS_FLAG_LATENCY_SYS, ret);
+	return ret < 0 ? ret : n;
+}
+
+static DEVICE_ATTR_RW(pm_qos_latency_sys);
+
 #ifdef CONFIG_PM_SLEEP
 static const char _enabled[] = "enabled";
 static const char _disabled[] = "disabled";
@@ -681,6 +707,7 @@ static const struct attribute_group pm_qos_latency_tolerance_attr_group = {
 
 static struct attribute *pm_qos_flags_attrs[] = {
 	&dev_attr_pm_qos_no_power_off.attr,
+	&dev_attr_pm_qos_latency_sys.attr,
 	NULL,
 };
 static const struct attribute_group pm_qos_flags_attr_group = {
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..b95f52775dfb 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -37,6 +37,8 @@ enum pm_qos_flags_status {
 #define PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT	(-1)
 
 #define PM_QOS_FLAG_NO_POWER_OFF	(1 << 0)
+/* latency value applies to system-wide suspend/s2idle */
+#define PM_QOS_FLAG_LATENCY_SYS		(2 << 0)
 
 enum pm_qos_type {
 	PM_QOS_UNITIALIZED,
-- 
2.51.0

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months ago

On Tue, Nov 4, 2025 at 5:10 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Sat, 1 Nov 2025 at 01:11, Kevin Hilman <khilman@baylibre.com> wrote:
> >
> > Ulf Hansson <ulf.hansson@linaro.org> writes:
> >
> > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > avoid breaking this constraint when entering a low-power state during
> > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > suitable low-power state, by taking into account the QoS limit.
> >
> > In addition to a QoS limit requested by userspace, shouldn't any
> > per-device QoS limits from devices in the PM domain with CPUs/clusters
> > also be considered?
> >
> > Right now, if a device is in a PM domain that also contains CPUs, any
> > per-device QoS constraints for those devices should also impact the
> > state chosen by s2idle.
>
> I am not so sure about that. The existing dev PM QoS latency is
> targeted towards runtime suspend of a device and the genpd governor
> also takes it into account for this use case.
>
> If we would start to take the same dev PM QoS latency constraint into
> account for system suspend (s2idle), it may not be what the user
> really intended. Instead, I think we should consider extending the dev
> PM QoS interface, to allow the user to set a specific latency limit
> for system wakeup. Then the genpd governor should take that into
> account for s2idle.
>
> >
> > I just tried a quick hack of extending you cpu_system_power_down_ok()
> > function to look for any per-device QoS constraints set all devices in
> > the PM domain (and subdomains).  It takes the min of all the per-device
> > QoS constratins, compares it to the one from userspace, and uses the min
> > of those to decide the genpd state.
> >
> > That has the effect I'm after, but curious what you think about the
> > usecase and the idea?
>
> It makes sense, but as stated above, I think we need a new QoS limit
> specific for system suspend.
>
> Rafael, what's your thoughts around this?

Well, it's analogous to the CPU latency limit case, so if a new
"system suspend" QoS limit is introduced for CPUs, that also needs to
be done for the other devices.

However, as in the CPU case, my personal view is that the "system
suspend" latency limits should be greater than or equal to the
corresponding latency limits for runtime PM.

One more thing that has just occurred to me is that there are systems
in which I don't want to enable the "system suspend" limits at all.
IOW, all of this needs to be disabled unless the platform opts in.

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months ago

On Tue, 4 Nov 2025 at 17:37, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Nov 4, 2025 at 5:10 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Sat, 1 Nov 2025 at 01:11, Kevin Hilman <khilman@baylibre.com> wrote:
> > >
> > > Ulf Hansson <ulf.hansson@linaro.org> writes:
> > >
> > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > avoid breaking this constraint when entering a low-power state during
> > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > suitable low-power state, by taking into account the QoS limit.
> > >
> > > In addition to a QoS limit requested by userspace, shouldn't any
> > > per-device QoS limits from devices in the PM domain with CPUs/clusters
> > > also be considered?
> > >
> > > Right now, if a device is in a PM domain that also contains CPUs, any
> > > per-device QoS constraints for those devices should also impact the
> > > state chosen by s2idle.
> >
> > I am not so sure about that. The existing dev PM QoS latency is
> > targeted towards runtime suspend of a device and the genpd governor
> > also takes it into account for this use case.
> >
> > If we would start to take the same dev PM QoS latency constraint into
> > account for system suspend (s2idle), it may not be what the user
> > really intended. Instead, I think we should consider extending the dev
> > PM QoS interface, to allow the user to set a specific latency limit
> > for system wakeup. Then the genpd governor should take that into
> > account for s2idle.
> >
> > >
> > > I just tried a quick hack of extending you cpu_system_power_down_ok()
> > > function to look for any per-device QoS constraints set all devices in
> > > the PM domain (and subdomains).  It takes the min of all the per-device
> > > QoS constratins, compares it to the one from userspace, and uses the min
> > > of those to decide the genpd state.
> > >
> > > That has the effect I'm after, but curious what you think about the
> > > usecase and the idea?
> >
> > It makes sense, but as stated above, I think we need a new QoS limit
> > specific for system suspend.
> >
> > Rafael, what's your thoughts around this?
>
> Well, it's analogous to the CPU latency limit case, so if a new
> "system suspend" QoS limit is introduced for CPUs, that also needs to
> be done for the other devices.

Agreed!

>
> However, as in the CPU case, my personal view is that the "system
> suspend" latency limits should be greater than or equal to the
> corresponding latency limits for runtime PM.

Right, we should treat general devices similar to CPUs.

>
> One more thing that has just occurred to me is that there are systems
> in which I don't want to enable the "system suspend" limits at all.
> IOW, all of this needs to be disabled unless the platform opts in.

Okay. So are you thinking of using a Kconfig for this or better to
manage this in runtime?

Kind regards
Uffe

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months ago

On Tue, Nov 4, 2025 at 5:52 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 4 Nov 2025 at 17:37, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Nov 4, 2025 at 5:10 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Sat, 1 Nov 2025 at 01:11, Kevin Hilman <khilman@baylibre.com> wrote:
> > > >
> > > > Ulf Hansson <ulf.hansson@linaro.org> writes:
> > > >
> > > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > > avoid breaking this constraint when entering a low-power state during
> > > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > > suitable low-power state, by taking into account the QoS limit.
> > > >
> > > > In addition to a QoS limit requested by userspace, shouldn't any
> > > > per-device QoS limits from devices in the PM domain with CPUs/clusters
> > > > also be considered?
> > > >
> > > > Right now, if a device is in a PM domain that also contains CPUs, any
> > > > per-device QoS constraints for those devices should also impact the
> > > > state chosen by s2idle.
> > >
> > > I am not so sure about that. The existing dev PM QoS latency is
> > > targeted towards runtime suspend of a device and the genpd governor
> > > also takes it into account for this use case.
> > >
> > > If we would start to take the same dev PM QoS latency constraint into
> > > account for system suspend (s2idle), it may not be what the user
> > > really intended. Instead, I think we should consider extending the dev
> > > PM QoS interface, to allow the user to set a specific latency limit
> > > for system wakeup. Then the genpd governor should take that into
> > > account for s2idle.
> > >
> > > >
> > > > I just tried a quick hack of extending you cpu_system_power_down_ok()
> > > > function to look for any per-device QoS constraints set all devices in
> > > > the PM domain (and subdomains).  It takes the min of all the per-device
> > > > QoS constratins, compares it to the one from userspace, and uses the min
> > > > of those to decide the genpd state.
> > > >
> > > > That has the effect I'm after, but curious what you think about the
> > > > usecase and the idea?
> > >
> > > It makes sense, but as stated above, I think we need a new QoS limit
> > > specific for system suspend.
> > >
> > > Rafael, what's your thoughts around this?
> >
> > Well, it's analogous to the CPU latency limit case, so if a new
> > "system suspend" QoS limit is introduced for CPUs, that also needs to
> > be done for the other devices.
>
> Agreed!
>
> >
> > However, as in the CPU case, my personal view is that the "system
> > suspend" latency limits should be greater than or equal to the
> > corresponding latency limits for runtime PM.
>
> Right, we should treat general devices similar to CPUs.
>
> >
> > One more thing that has just occurred to me is that there are systems
> > in which I don't want to enable the "system suspend" limits at all.
> > IOW, all of this needs to be disabled unless the platform opts in.
>
> Okay. So are you thinking of using a Kconfig for this or better to
> manage this in runtime?

Hmm, Kconfig might be a good fit because, for instance, I don't quite
see how this can be made to generally work on x86 except for some
corner cases.

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months, 1 week ago

On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> A CPU system-wakeup QoS limit may have been requested by user-space. To
> avoid breaking this constraint when entering a low-power state during
> s2idle through genpd, let's extend the corresponding genpd governor for
> CPUs. More precisely, during s2idle let the genpd governor select a
> suitable low-power state, by taking into account the QoS limit.
>
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>
> Changes in v2:
>         - Limite the change to the genpd governor for CPUs.
>
> ---
>  drivers/pmdomain/core.c     | 10 ++++++++--
>  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
>  include/linux/pm_domain.h   |  1 +
>  3 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> index 61c2277c9ce3..4fd546ef0448 100644
> --- a/drivers/pmdomain/core.c
> +++ b/drivers/pmdomain/core.c
> @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
>                         return;
>         }
>
> -       /* Choose the deepest state when suspending */
> -       genpd->state_idx = genpd->state_count - 1;
> +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> +                       return;
> +       } else {
> +               /* Default to the deepest state. */
> +               genpd->state_idx = genpd->state_count - 1;
> +       }
> +
>         if (_genpd_power_off(genpd, false)) {
>                 genpd->states[genpd->state_idx].rejected++;
>                 return;
> diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> index 39359811a930..bd1b9d66d4a5 100644
> --- a/drivers/pmdomain/governor.c
> +++ b/drivers/pmdomain/governor.c
> @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>         return false;
>  }
>
> +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> +{
> +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;

I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
into account directly.

It should be told by cpuidle which state has been selected on the CPU
side and it should not go any deeper than that anyway.

> +       struct generic_pm_domain *genpd = pd_to_genpd(pd);
> +       int state_idx = genpd->state_count - 1;
> +
> +       if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN)) {
> +               genpd->state_idx = state_idx;
> +               return true;
> +       }
> +
> +       /* Find the deepest state for the latency constraint. */
> +       while (state_idx >= 0) {
> +               s64 latency_ns = genpd->states[state_idx].power_off_latency_ns +
> +                                genpd->states[state_idx].power_on_latency_ns;
> +
> +               if (latency_ns <= constraint_ns) {
> +                       genpd->state_idx = state_idx;
> +                       return true;
> +               }
> +               state_idx--;
> +       }
> +
> +       return false;
> +}
> +
>  struct dev_power_governor pm_domain_cpu_gov = {
>         .suspend_ok = default_suspend_ok,
>         .power_down_ok = cpu_power_down_ok,
> +       .system_power_down_ok = cpu_system_power_down_ok,
>  };
>  #endif
>
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index f67a2cb7d781..93ba0143ca47 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -153,6 +153,7 @@ enum genpd_sync_state {
>  };
>
>  struct dev_power_governor {
> +       bool (*system_power_down_ok)(struct dev_pm_domain *domain);
>         bool (*power_down_ok)(struct dev_pm_domain *domain);
>         bool (*suspend_ok)(struct device *dev);
>  };
> --
> 2.43.0
>

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months, 1 week ago

On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > avoid breaking this constraint when entering a low-power state during
> > s2idle through genpd, let's extend the corresponding genpd governor for
> > CPUs. More precisely, during s2idle let the genpd governor select a
> > suitable low-power state, by taking into account the QoS limit.
> >
> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > ---
> >
> > Changes in v2:
> >         - Limite the change to the genpd governor for CPUs.
> >
> > ---
> >  drivers/pmdomain/core.c     | 10 ++++++++--
> >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> >  include/linux/pm_domain.h   |  1 +
> >  3 files changed, 36 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > index 61c2277c9ce3..4fd546ef0448 100644
> > --- a/drivers/pmdomain/core.c
> > +++ b/drivers/pmdomain/core.c
> > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> >                         return;
> >         }
> >
> > -       /* Choose the deepest state when suspending */
> > -       genpd->state_idx = genpd->state_count - 1;
> > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > +                       return;
> > +       } else {
> > +               /* Default to the deepest state. */
> > +               genpd->state_idx = genpd->state_count - 1;
> > +       }
> > +
> >         if (_genpd_power_off(genpd, false)) {
> >                 genpd->states[genpd->state_idx].rejected++;
> >                 return;
> > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > index 39359811a930..bd1b9d66d4a5 100644
> > --- a/drivers/pmdomain/governor.c
> > +++ b/drivers/pmdomain/governor.c
> > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> >         return false;
> >  }
> >
> > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > +{
> > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
>
> I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> into account directly.
>
> It should be told by cpuidle which state has been selected on the CPU
> side and it should not go any deeper than that anyway.

For PSCI OS-initiated mode, cpuidle doesn't know about the states that
may be shared among a group of CPUs.

Instead, those states are controlled through the PM domain topology by
genpd and its governor, hence this is needed too.

>
> > +       struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > +       int state_idx = genpd->state_count - 1;
> > +
> > +       if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN)) {
> > +               genpd->state_idx = state_idx;
> > +               return true;
> > +       }
> > +
> > +       /* Find the deepest state for the latency constraint. */
> > +       while (state_idx >= 0) {
> > +               s64 latency_ns = genpd->states[state_idx].power_off_latency_ns +
> > +                                genpd->states[state_idx].power_on_latency_ns;
> > +
> > +               if (latency_ns <= constraint_ns) {
> > +                       genpd->state_idx = state_idx;
> > +                       return true;
> > +               }
> > +               state_idx--;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> >  struct dev_power_governor pm_domain_cpu_gov = {
> >         .suspend_ok = default_suspend_ok,
> >         .power_down_ok = cpu_power_down_ok,
> > +       .system_power_down_ok = cpu_system_power_down_ok,
> >  };
> >  #endif
> >
> > diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> > index f67a2cb7d781..93ba0143ca47 100644
> > --- a/include/linux/pm_domain.h
> > +++ b/include/linux/pm_domain.h
> > @@ -153,6 +153,7 @@ enum genpd_sync_state {
> >  };
> >
> >  struct dev_power_governor {
> > +       bool (*system_power_down_ok)(struct dev_pm_domain *domain);
> >         bool (*power_down_ok)(struct dev_pm_domain *domain);
> >         bool (*suspend_ok)(struct device *dev);
> >  };
> > --
> > 2.43.0
> >

Kind regards
Uffe

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months, 1 week ago

On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > avoid breaking this constraint when entering a low-power state during
> > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > suitable low-power state, by taking into account the QoS limit.
> > >
> > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > ---
> > >
> > > Changes in v2:
> > >         - Limite the change to the genpd governor for CPUs.
> > >
> > > ---
> > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > >  include/linux/pm_domain.h   |  1 +
> > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > index 61c2277c9ce3..4fd546ef0448 100644
> > > --- a/drivers/pmdomain/core.c
> > > +++ b/drivers/pmdomain/core.c
> > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > >                         return;
> > >         }
> > >
> > > -       /* Choose the deepest state when suspending */
> > > -       genpd->state_idx = genpd->state_count - 1;
> > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > +                       return;
> > > +       } else {
> > > +               /* Default to the deepest state. */
> > > +               genpd->state_idx = genpd->state_count - 1;
> > > +       }
> > > +
> > >         if (_genpd_power_off(genpd, false)) {
> > >                 genpd->states[genpd->state_idx].rejected++;
> > >                 return;
> > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > index 39359811a930..bd1b9d66d4a5 100644
> > > --- a/drivers/pmdomain/governor.c
> > > +++ b/drivers/pmdomain/governor.c
> > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > >         return false;
> > >  }
> > >
> > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > +{
> > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> >
> > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > into account directly.
> >
> > It should be told by cpuidle which state has been selected on the CPU
> > side and it should not go any deeper than that anyway.
>
> For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> may be shared among a group of CPUs.
>
> Instead, those states are controlled through the PM domain topology by
> genpd and its governor, hence this is needed too.

All right, but I'd like to understand how all of that works.

So cpuidle selects a state to enter for the given CPU and then genpd
is invoked.  It has to take the exit latency of that state into
account, so it doesn't go too deep.  How does it do that?

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months, 1 week ago

On Thu, 30 Oct 2025 at 13:23, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > avoid breaking this constraint when entering a low-power state during
> > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > suitable low-power state, by taking into account the QoS limit.
> > > >
> > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > ---
> > > >
> > > > Changes in v2:
> > > >         - Limite the change to the genpd governor for CPUs.
> > > >
> > > > ---
> > > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > > >  include/linux/pm_domain.h   |  1 +
> > > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > > index 61c2277c9ce3..4fd546ef0448 100644
> > > > --- a/drivers/pmdomain/core.c
> > > > +++ b/drivers/pmdomain/core.c
> > > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > > >                         return;
> > > >         }
> > > >
> > > > -       /* Choose the deepest state when suspending */
> > > > -       genpd->state_idx = genpd->state_count - 1;
> > > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > > +                       return;
> > > > +       } else {
> > > > +               /* Default to the deepest state. */
> > > > +               genpd->state_idx = genpd->state_count - 1;
> > > > +       }
> > > > +
> > > >         if (_genpd_power_off(genpd, false)) {
> > > >                 genpd->states[genpd->state_idx].rejected++;
> > > >                 return;
> > > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > > index 39359811a930..bd1b9d66d4a5 100644
> > > > --- a/drivers/pmdomain/governor.c
> > > > +++ b/drivers/pmdomain/governor.c
> > > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > >         return false;
> > > >  }
> > > >
> > > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > > +{
> > > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> > >
> > > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > > into account directly.
> > >
> > > It should be told by cpuidle which state has been selected on the CPU
> > > side and it should not go any deeper than that anyway.
> >
> > For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> > may be shared among a group of CPUs.
> >
> > Instead, those states are controlled through the PM domain topology by
> > genpd and its governor, hence this is needed too.
>
> All right, but I'd like to understand how all of that works.
>
> So cpuidle selects a state to enter for the given CPU and then genpd
> is invoked.  It has to take the exit latency of that state into
> account, so it doesn't go too deep.  How does it do that?

Depending on the state selected, in cpuidle-psci.c we may end up
calling __psci_enter_domain_idle_state() (only for the deepest
CPU-state).

For s2idle this means we call dev_pm_genpd_suspend|resume(), to manage
the reference counting of the PM domains via genpd. This then may lead
to that genpd_sync_power_off() tries to select a state by calling the
new governor function above.

Did that make sense?

Kind regards
Uffe

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months, 1 week ago

On Thu, Oct 30, 2025 at 1:32 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Thu, 30 Oct 2025 at 13:23, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >
> > > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > > avoid breaking this constraint when entering a low-power state during
> > > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > > suitable low-power state, by taking into account the QoS limit.
> > > > >
> > > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > > ---
> > > > >
> > > > > Changes in v2:
> > > > >         - Limite the change to the genpd governor for CPUs.
> > > > >
> > > > > ---
> > > > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > > > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > > > >  include/linux/pm_domain.h   |  1 +
> > > > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > > > index 61c2277c9ce3..4fd546ef0448 100644
> > > > > --- a/drivers/pmdomain/core.c
> > > > > +++ b/drivers/pmdomain/core.c
> > > > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > > > >                         return;
> > > > >         }
> > > > >
> > > > > -       /* Choose the deepest state when suspending */
> > > > > -       genpd->state_idx = genpd->state_count - 1;
> > > > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > > > +                       return;
> > > > > +       } else {
> > > > > +               /* Default to the deepest state. */
> > > > > +               genpd->state_idx = genpd->state_count - 1;
> > > > > +       }
> > > > > +
> > > > >         if (_genpd_power_off(genpd, false)) {
> > > > >                 genpd->states[genpd->state_idx].rejected++;
> > > > >                 return;
> > > > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > > > index 39359811a930..bd1b9d66d4a5 100644
> > > > > --- a/drivers/pmdomain/governor.c
> > > > > +++ b/drivers/pmdomain/governor.c
> > > > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > >         return false;
> > > > >  }
> > > > >
> > > > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > > > +{
> > > > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> > > >
> > > > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > > > into account directly.
> > > >
> > > > It should be told by cpuidle which state has been selected on the CPU
> > > > side and it should not go any deeper than that anyway.
> > >
> > > For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> > > may be shared among a group of CPUs.
> > >
> > > Instead, those states are controlled through the PM domain topology by
> > > genpd and its governor, hence this is needed too.
> >
> > All right, but I'd like to understand how all of that works.
> >
> > So cpuidle selects a state to enter for the given CPU and then genpd
> > is invoked.  It has to take the exit latency of that state into
> > account, so it doesn't go too deep.  How does it do that?
>
> Depending on the state selected, in cpuidle-psci.c we may end up
> calling __psci_enter_domain_idle_state() (only for the deepest
> CPU-state).
>
> For s2idle this means we call dev_pm_genpd_suspend|resume(), to manage
> the reference counting of the PM domains via genpd. This then may lead
> to that genpd_sync_power_off() tries to select a state by calling the
> new governor function above.
>
> Did that make sense?

So IIUC this will only happen if the deepest idle state is selected in
which case the cpu_wakeup_latency_qos_limit() value is greater than
the exit latency of that state, but it may still need to be taken into
account when selecting the domain state.  However, this means that the
exit latency number for the deepest idle state is too low (it should
represent the worst-case exit latency which means the maximum domain
exit latency in this particular case).

Moreover, it looks like the "runtime" cpuidle has the same problem, doesn't it?

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months, 1 week ago

On Thu, 30 Oct 2025 at 15:02, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 1:32 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Thu, 30 Oct 2025 at 13:23, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > >
> > > > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > > > avoid breaking this constraint when entering a low-power state during
> > > > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > > > suitable low-power state, by taking into account the QoS limit.
> > > > > >
> > > > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > > > ---
> > > > > >
> > > > > > Changes in v2:
> > > > > >         - Limite the change to the genpd governor for CPUs.
> > > > > >
> > > > > > ---
> > > > > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > > > > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > > > > >  include/linux/pm_domain.h   |  1 +
> > > > > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > > > > index 61c2277c9ce3..4fd546ef0448 100644
> > > > > > --- a/drivers/pmdomain/core.c
> > > > > > +++ b/drivers/pmdomain/core.c
> > > > > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > > > > >                         return;
> > > > > >         }
> > > > > >
> > > > > > -       /* Choose the deepest state when suspending */
> > > > > > -       genpd->state_idx = genpd->state_count - 1;
> > > > > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > > > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > > > > +                       return;
> > > > > > +       } else {
> > > > > > +               /* Default to the deepest state. */
> > > > > > +               genpd->state_idx = genpd->state_count - 1;
> > > > > > +       }
> > > > > > +
> > > > > >         if (_genpd_power_off(genpd, false)) {
> > > > > >                 genpd->states[genpd->state_idx].rejected++;
> > > > > >                 return;
> > > > > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > > > > index 39359811a930..bd1b9d66d4a5 100644
> > > > > > --- a/drivers/pmdomain/governor.c
> > > > > > +++ b/drivers/pmdomain/governor.c
> > > > > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > > >         return false;
> > > > > >  }
> > > > > >
> > > > > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > > > > +{
> > > > > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> > > > >
> > > > > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > > > > into account directly.
> > > > >
> > > > > It should be told by cpuidle which state has been selected on the CPU
> > > > > side and it should not go any deeper than that anyway.
> > > >
> > > > For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> > > > may be shared among a group of CPUs.
> > > >
> > > > Instead, those states are controlled through the PM domain topology by
> > > > genpd and its governor, hence this is needed too.
> > >
> > > All right, but I'd like to understand how all of that works.
> > >
> > > So cpuidle selects a state to enter for the given CPU and then genpd
> > > is invoked.  It has to take the exit latency of that state into
> > > account, so it doesn't go too deep.  How does it do that?
> >
> > Depending on the state selected, in cpuidle-psci.c we may end up
> > calling __psci_enter_domain_idle_state() (only for the deepest
> > CPU-state).
> >
> > For s2idle this means we call dev_pm_genpd_suspend|resume(), to manage
> > the reference counting of the PM domains via genpd. This then may lead
> > to that genpd_sync_power_off() tries to select a state by calling the
> > new governor function above.
> >
> > Did that make sense?
>
> So IIUC this will only happen if the deepest idle state is selected in
> which case the cpu_wakeup_latency_qos_limit() value is greater than
> the exit latency of that state, but it may still need to be taken into
> account when selecting the domain state.  However, this means that the

Correct.

> exit latency number for the deepest idle state is too low (it should
> represent the worst-case exit latency which means the maximum domain
> exit latency in this particular case).

Yes, from the cpuidle state-selection point of view, but how is that a problem?

If the genpd-governor doesn't find a suitable "domain-idle-state", we
fallback to using the one cpuidle selected.

>
> Moreover, it looks like the "runtime" cpuidle has the same problem, doesn't it?

It works in a very similar way, but I fail to understand why you think
there is a problem.

Kind regards
Uffe

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Rafael J. Wysocki 3 months, 1 week ago

On Thu, Oct 30, 2025 at 4:07 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Thu, 30 Oct 2025 at 15:02, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Oct 30, 2025 at 1:32 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Thu, 30 Oct 2025 at 13:23, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >
> > > > > On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > >
> > > > > > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > >
> > > > > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > > > > avoid breaking this constraint when entering a low-power state during
> > > > > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > > > > suitable low-power state, by taking into account the QoS limit.
> > > > > > >
> > > > > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > > > > ---
> > > > > > >
> > > > > > > Changes in v2:
> > > > > > >         - Limite the change to the genpd governor for CPUs.
> > > > > > >
> > > > > > > ---
> > > > > > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > > > > > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > > > > > >  include/linux/pm_domain.h   |  1 +
> > > > > > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > > > > > index 61c2277c9ce3..4fd546ef0448 100644
> > > > > > > --- a/drivers/pmdomain/core.c
> > > > > > > +++ b/drivers/pmdomain/core.c
> > > > > > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > > > > > >                         return;
> > > > > > >         }
> > > > > > >
> > > > > > > -       /* Choose the deepest state when suspending */
> > > > > > > -       genpd->state_idx = genpd->state_count - 1;
> > > > > > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > > > > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > > > > > +                       return;
> > > > > > > +       } else {
> > > > > > > +               /* Default to the deepest state. */
> > > > > > > +               genpd->state_idx = genpd->state_count - 1;
> > > > > > > +       }
> > > > > > > +
> > > > > > >         if (_genpd_power_off(genpd, false)) {
> > > > > > >                 genpd->states[genpd->state_idx].rejected++;
> > > > > > >                 return;
> > > > > > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > > > > > index 39359811a930..bd1b9d66d4a5 100644
> > > > > > > --- a/drivers/pmdomain/governor.c
> > > > > > > +++ b/drivers/pmdomain/governor.c
> > > > > > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > > > >         return false;
> > > > > > >  }
> > > > > > >
> > > > > > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > > > > > +{
> > > > > > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> > > > > >
> > > > > > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > > > > > into account directly.
> > > > > >
> > > > > > It should be told by cpuidle which state has been selected on the CPU
> > > > > > side and it should not go any deeper than that anyway.
> > > > >
> > > > > For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> > > > > may be shared among a group of CPUs.
> > > > >
> > > > > Instead, those states are controlled through the PM domain topology by
> > > > > genpd and its governor, hence this is needed too.
> > > >
> > > > All right, but I'd like to understand how all of that works.
> > > >
> > > > So cpuidle selects a state to enter for the given CPU and then genpd
> > > > is invoked.  It has to take the exit latency of that state into
> > > > account, so it doesn't go too deep.  How does it do that?
> > >
> > > Depending on the state selected, in cpuidle-psci.c we may end up
> > > calling __psci_enter_domain_idle_state() (only for the deepest
> > > CPU-state).
> > >
> > > For s2idle this means we call dev_pm_genpd_suspend|resume(), to manage
> > > the reference counting of the PM domains via genpd. This then may lead
> > > to that genpd_sync_power_off() tries to select a state by calling the
> > > new governor function above.
> > >
> > > Did that make sense?
> >
> > So IIUC this will only happen if the deepest idle state is selected in
> > which case the cpu_wakeup_latency_qos_limit() value is greater than
> > the exit latency of that state, but it may still need to be taken into
> > account when selecting the domain state.  However, this means that the
>
> Correct.
>
> > exit latency number for the deepest idle state is too low (it should
> > represent the worst-case exit latency which means the maximum domain
> > exit latency in this particular case).
>
> Yes, from the cpuidle state-selection point of view, but how is that a problem?

It is confusing.  Otherwise, for s2idle, I guess it is not a big deal.

I guess what happens is that genpd has a range of states with
different latency values to choose from and it is not practical to
expose all of them as CPU idle states, so you end up exposing just one
of them with the lowest latency value to allow cpuidle to involve
genpd often enough.

If that's the case, I'd make a note of that somewhere if I were you,
or people will routinely get confused by it.

> If the genpd-governor doesn't find a suitable "domain-idle-state", we
> fallback to using the one cpuidle selected.
>
> >
> > Moreover, it looks like the "runtime" cpuidle has the same problem, doesn't it?
>
> It works in a very similar way, but I fail to understand why you think
> there is a problem.

There is a problem because it may violate a "runtime" latency constraint.

Say you expose 2 CPU idle states, a shallow one and a genpd one.  The
advertised exit latency of the genpd state is X and the current
latency constraint is Y > X.  The genpd state is selected and genpd
doesn't look at the cpuidle_governor_latency_req() return value, so it
chooses a real state with exit latency Z > Y.

To a minimum, genpd should be made aware of
cpuidle_governor_latency_req(), but even then cpuidle governors take
exit latency into consideration in their computations, so things may
get confused somewhat.

Re: [PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle

Posted by Ulf Hansson 3 months, 1 week ago

On Thu, 30 Oct 2025 at 19:11, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 4:07 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Thu, 30 Oct 2025 at 15:02, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Thu, Oct 30, 2025 at 1:32 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > On Thu, 30 Oct 2025 at 13:23, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Thu, Oct 30, 2025 at 1:00 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > >
> > > > > > On Thu, 30 Oct 2025 at 11:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu, Oct 16, 2025 at 5:19 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > >
> > > > > > > > A CPU system-wakeup QoS limit may have been requested by user-space. To
> > > > > > > > avoid breaking this constraint when entering a low-power state during
> > > > > > > > s2idle through genpd, let's extend the corresponding genpd governor for
> > > > > > > > CPUs. More precisely, during s2idle let the genpd governor select a
> > > > > > > > suitable low-power state, by taking into account the QoS limit.
> > > > > > > >
> > > > > > > > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > > > > > > > ---
> > > > > > > >
> > > > > > > > Changes in v2:
> > > > > > > >         - Limite the change to the genpd governor for CPUs.
> > > > > > > >
> > > > > > > > ---
> > > > > > > >  drivers/pmdomain/core.c     | 10 ++++++++--
> > > > > > > >  drivers/pmdomain/governor.c | 27 +++++++++++++++++++++++++++
> > > > > > > >  include/linux/pm_domain.h   |  1 +
> > > > > > > >  3 files changed, 36 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > > > > > > > index 61c2277c9ce3..4fd546ef0448 100644
> > > > > > > > --- a/drivers/pmdomain/core.c
> > > > > > > > +++ b/drivers/pmdomain/core.c
> > > > > > > > @@ -1425,8 +1425,14 @@ static void genpd_sync_power_off(struct generic_pm_domain *genpd, bool use_lock,
> > > > > > > >                         return;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > -       /* Choose the deepest state when suspending */
> > > > > > > > -       genpd->state_idx = genpd->state_count - 1;
> > > > > > > > +       if (genpd->gov && genpd->gov->system_power_down_ok) {
> > > > > > > > +               if (!genpd->gov->system_power_down_ok(&genpd->domain))
> > > > > > > > +                       return;
> > > > > > > > +       } else {
> > > > > > > > +               /* Default to the deepest state. */
> > > > > > > > +               genpd->state_idx = genpd->state_count - 1;
> > > > > > > > +       }
> > > > > > > > +
> > > > > > > >         if (_genpd_power_off(genpd, false)) {
> > > > > > > >                 genpd->states[genpd->state_idx].rejected++;
> > > > > > > >                 return;
> > > > > > > > diff --git a/drivers/pmdomain/governor.c b/drivers/pmdomain/governor.c
> > > > > > > > index 39359811a930..bd1b9d66d4a5 100644
> > > > > > > > --- a/drivers/pmdomain/governor.c
> > > > > > > > +++ b/drivers/pmdomain/governor.c
> > > > > > > > @@ -415,9 +415,36 @@ static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > > > > >         return false;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static bool cpu_system_power_down_ok(struct dev_pm_domain *pd)
> > > > > > > > +{
> > > > > > > > +       s64 constraint_ns = cpu_wakeup_latency_qos_limit() * NSEC_PER_USEC;
> > > > > > >
> > > > > > > I'm not sure why genpd needs to take cpu_wakeup_latency_qos_limit()
> > > > > > > into account directly.
> > > > > > >
> > > > > > > It should be told by cpuidle which state has been selected on the CPU
> > > > > > > side and it should not go any deeper than that anyway.
> > > > > >
> > > > > > For PSCI OS-initiated mode, cpuidle doesn't know about the states that
> > > > > > may be shared among a group of CPUs.
> > > > > >
> > > > > > Instead, those states are controlled through the PM domain topology by
> > > > > > genpd and its governor, hence this is needed too.
> > > > >
> > > > > All right, but I'd like to understand how all of that works.
> > > > >
> > > > > So cpuidle selects a state to enter for the given CPU and then genpd
> > > > > is invoked.  It has to take the exit latency of that state into
> > > > > account, so it doesn't go too deep.  How does it do that?
> > > >
> > > > Depending on the state selected, in cpuidle-psci.c we may end up
> > > > calling __psci_enter_domain_idle_state() (only for the deepest
> > > > CPU-state).
> > > >
> > > > For s2idle this means we call dev_pm_genpd_suspend|resume(), to manage
> > > > the reference counting of the PM domains via genpd. This then may lead
> > > > to that genpd_sync_power_off() tries to select a state by calling the
> > > > new governor function above.
> > > >
> > > > Did that make sense?
> > >
> > > So IIUC this will only happen if the deepest idle state is selected in
> > > which case the cpu_wakeup_latency_qos_limit() value is greater than
> > > the exit latency of that state, but it may still need to be taken into
> > > account when selecting the domain state.  However, this means that the
> >
> > Correct.
> >
> > > exit latency number for the deepest idle state is too low (it should
> > > represent the worst-case exit latency which means the maximum domain
> > > exit latency in this particular case).
> >
> > Yes, from the cpuidle state-selection point of view, but how is that a problem?
>
> It is confusing.  Otherwise, for s2idle, I guess it is not a big deal.
>
> I guess what happens is that genpd has a range of states with
> different latency values to choose from and it is not practical to
> expose all of them as CPU idle states, so you end up exposing just one
> of them with the lowest latency value to allow cpuidle to involve
> genpd often enough.

Yes, the states that are CPU specific are exposed to CPU-idle.

The states that are shared with other CPUs are managed by genpd,
because those need reference counting. Not even limited to CPUs.

>
> If that's the case, I'd make a note of that somewhere if I were you,
> or people will routinely get confused by it.

Documentation is always nice.

We have DT docs and the PSCI spec, but we lack proper documentation of
the whole genpd interface. Actually, I have started working on
documentation for genpd, but haven't reached the point of submitting a
patch for it.

>
> > If the genpd-governor doesn't find a suitable "domain-idle-state", we
> > fallback to using the one cpuidle selected.
> >
> > >
> > > Moreover, it looks like the "runtime" cpuidle has the same problem, doesn't it?
> >
> > It works in a very similar way, but I fail to understand why you think
> > there is a problem.
>
> There is a problem because it may violate a "runtime" latency constraint.
>
> Say you expose 2 CPU idle states, a shallow one and a genpd one.  The
> advertised exit latency of the genpd state is X and the current
> latency constraint is Y > X.  The genpd state is selected and genpd
> doesn't look at the cpuidle_governor_latency_req() return value, so it
> chooses a real state with exit latency Z > Y.
>
> To a minimum, genpd should be made aware of
> cpuidle_governor_latency_req(), but even then cpuidle governors take
> exit latency into consideration in their computations, so things may
> get confused somewhat.

Please have a look at cpu_power_down_ok(), which is the function that
runs to select the domain-idle-state. It does take the constraints
into account during runtime, even it doesn't call
cpuidle_governor_latency_req() explicitly.

Kind regards
Uffe

[PATCH v2 1/4] PM: QoS: Introduce a CPU system-wakeup QoS limit
[PATCH v2 2/4] pmdomain: Respect the CPU system-wakeup QoS limit during s2idle
[PATCH v2 3/4] sched: idle: Respect the CPU system-wakeup QoS limit for s2idle
[PATCH v2 4/4] Documentation: power/cpuidle: Document the CPU system-wakeup latency QoS