PM: QoS: Introduce a system-wakeup QoS limit for s2idle and genpd

[RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 2 months, 3 weeks ago

Some platforms and devices supports multiple low-power-states than can be
used for system-wide suspend. Today these states are selected on per
subsystem basis and in most cases it's the deepest possible state that
becomes selected.

For some use-cases this is a problem as it isn't suitable or even breaks
the system-wakeup latency constraint, when we decide to enter these deeper
states during system-wide suspend.

Therefore, let's introduce an interface for user-space, allowing us to
specify the system-wakeup QoS limit. Subsequent changes will start taking
into account the QoS limit.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 include/linux/pm_qos.h |   9 ++++
 kernel/power/qos.c     | 114 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+)

diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 4a69d4af3ff8..5f84084f19c8 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -143,6 +143,15 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 			 struct pm_qos_flags_request *req,
 			 enum pm_qos_req_action action, s32 val);
 
+#ifdef CONFIG_PM_SLEEP
+s32 system_wakeup_latency_qos_limit(void);
+#else
+static inline s32 system_wakeup_latency_qos_limit(void)
+{
+	return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+}
+#endif
+
 #ifdef CONFIG_CPU_IDLE
 s32 cpu_latency_qos_limit(void);
 bool cpu_latency_qos_request_active(struct pm_qos_request *req);
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index 4244b069442e..fb496c220ffe 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -209,6 +209,120 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 	return prev_value != curr_value;
 }
 
+#ifdef CONFIG_PM_SLEEP
+static struct pm_qos_constraints system_wakeup_latency_constraints = {
+	.list = PLIST_HEAD_INIT(system_wakeup_latency_constraints.list),
+	.target_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
+	.default_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
+	.no_constraint_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
+	.type = PM_QOS_MIN,
+};
+
+/**
+ * system_wakeup_latency_qos_limit - Current system wakeup latency QoS limit.
+ *
+ * Returns the current system wakeup latency QoS limit that may have been
+ * requested by user-space.
+ */
+s32 system_wakeup_latency_qos_limit(void)
+{
+	return pm_qos_read_value(&system_wakeup_latency_constraints);
+}
+
+static int system_wakeup_latency_qos_open(struct inode *inode,
+					  struct file *filp)
+{
+	struct pm_qos_request *req;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->qos = &system_wakeup_latency_constraints;
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ,
+			     PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+	filp->private_data = req;
+
+	return 0;
+}
+
+static int system_wakeup_latency_qos_release(struct inode *inode,
+					     struct file *filp)
+{
+	struct pm_qos_request *req = filp->private_data;
+
+	filp->private_data = NULL;
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ,
+			     PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+	kfree(req);
+
+	return 0;
+}
+
+static ssize_t system_wakeup_latency_qos_read(struct file *filp,
+					      char __user *buf,
+					      size_t count,
+					      loff_t *f_pos)
+{
+	s32 value = pm_qos_read_value(&system_wakeup_latency_constraints);
+
+	return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32));
+}
+
+static ssize_t system_wakeup_latency_qos_write(struct file *filp,
+					       const char __user *buf,
+					       size_t count, loff_t *f_pos)
+{
+	struct pm_qos_request *req = filp->private_data;
+	s32 value;
+
+	if (count == sizeof(s32)) {
+		if (copy_from_user(&value, buf, sizeof(s32)))
+			return -EFAULT;
+	} else {
+		int ret;
+
+		ret = kstrtos32_from_user(buf, count, 16, &value);
+		if (ret)
+			return ret;
+	}
+
+	if (value < 0)
+		return -EINVAL;
+
+	pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ, value);
+
+	return count;
+}
+
+static const struct file_operations system_wakeup_latency_qos_fops = {
+	.open = system_wakeup_latency_qos_open,
+	.release = system_wakeup_latency_qos_release,
+	.read = system_wakeup_latency_qos_read,
+	.write = system_wakeup_latency_qos_write,
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice system_wakeup_latency_qos_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "system_wakeup_latency",
+	.fops = &system_wakeup_latency_qos_fops,
+};
+
+static int __init system_wakeup_latency_qos_init(void)
+{
+	int ret;
+
+	ret = misc_register(&system_wakeup_latency_qos_miscdev);
+	if (ret < 0)
+		pr_err("%s: %s setup failed\n", __func__,
+		       system_wakeup_latency_qos_miscdev.name);
+
+	return ret;
+}
+late_initcall(system_wakeup_latency_qos_init);
+#endif /* CONFIG_PM_SLEEP */
+
 #ifdef CONFIG_CPU_IDLE
 /* Definitions related to the CPU latency QoS. */
 
-- 
2.43.0

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Rafael J. Wysocki 2 months, 2 weeks ago

On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> Some platforms and devices supports multiple low-power-states than can be
> used for system-wide suspend. Today these states are selected on per
> subsystem basis and in most cases it's the deepest possible state that
> becomes selected.
>
> For some use-cases this is a problem as it isn't suitable or even breaks
> the system-wakeup latency constraint, when we decide to enter these deeper
> states during system-wide suspend.
>
> Therefore, let's introduce an interface for user-space, allowing us to
> specify the system-wakeup QoS limit. Subsequent changes will start taking
> into account the QoS limit.

Well, this is not really a system-wakeup limit, but a CPU idle state
latency limit for states entered in the last step of suspend-to-idle.

It looks like the problem is that the existing CPU latency QoS is not
taken into account by suspend-to-idle, so instead of adding an
entirely new interface to overcome this, would it make sense to add an
ioctl() to the existing one that would allow the user of it to
indicate that the given request should also be respected by
suspend-to-idle?

There are two basic reasons why I think so:
(1) The requests that you want to be respected by suspend-to-idle
should also be respected by the regular "runtime" idle, or at least I
don't see a reason why it wouldn't be the case.
(2) The new interface introduced by this patch basically duplicates
the existing one.

The flow related to this I kind of envision would be as follows:
(1) User space opens /dev/cpu_dma_latency and a single CPU latency QoS
request is created via cpu_latency_qos_add_request().
(2) User space calls a new ioctl() on the open file descriptor to
indicate that the request should also apply to suspend-to-idle.  A new
request is created with the same value and added to a new list of
constraints.  That new list of constraints will be used by
suspend-to-idle.
(3) Writing to the open file descriptor causes both requests to be updated.
(4) If user space does not want the request to apply to
suspend-to-idle any more, it can use another new ioctl() to achieve
that.  It would cause the second (suspend-to-idle) copy of the request
to be dropped.
(5) Closing the file descriptor causes both copies of the request to be dropped.

> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>  include/linux/pm_qos.h |   9 ++++
>  kernel/power/qos.c     | 114 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 123 insertions(+)
>
> diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
> index 4a69d4af3ff8..5f84084f19c8 100644
> --- a/include/linux/pm_qos.h
> +++ b/include/linux/pm_qos.h
> @@ -143,6 +143,15 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
>                          struct pm_qos_flags_request *req,
>                          enum pm_qos_req_action action, s32 val);
>
> +#ifdef CONFIG_PM_SLEEP
> +s32 system_wakeup_latency_qos_limit(void);
> +#else
> +static inline s32 system_wakeup_latency_qos_limit(void)
> +{
> +       return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
> +}
> +#endif
> +
>  #ifdef CONFIG_CPU_IDLE
>  s32 cpu_latency_qos_limit(void);
>  bool cpu_latency_qos_request_active(struct pm_qos_request *req);
> diff --git a/kernel/power/qos.c b/kernel/power/qos.c
> index 4244b069442e..fb496c220ffe 100644
> --- a/kernel/power/qos.c
> +++ b/kernel/power/qos.c
> @@ -209,6 +209,120 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
>         return prev_value != curr_value;
>  }
>
> +#ifdef CONFIG_PM_SLEEP
> +static struct pm_qos_constraints system_wakeup_latency_constraints = {
> +       .list = PLIST_HEAD_INIT(system_wakeup_latency_constraints.list),
> +       .target_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
> +       .default_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
> +       .no_constraint_value = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT,
> +       .type = PM_QOS_MIN,
> +};
> +
> +/**
> + * system_wakeup_latency_qos_limit - Current system wakeup latency QoS limit.
> + *
> + * Returns the current system wakeup latency QoS limit that may have been
> + * requested by user-space.
> + */
> +s32 system_wakeup_latency_qos_limit(void)
> +{
> +       return pm_qos_read_value(&system_wakeup_latency_constraints);
> +}
> +
> +static int system_wakeup_latency_qos_open(struct inode *inode,
> +                                         struct file *filp)
> +{
> +       struct pm_qos_request *req;
> +
> +       req = kzalloc(sizeof(*req), GFP_KERNEL);
> +       if (!req)
> +               return -ENOMEM;
> +
> +       req->qos = &system_wakeup_latency_constraints;
> +       pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ,
> +                            PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> +       filp->private_data = req;
> +
> +       return 0;
> +}
> +
> +static int system_wakeup_latency_qos_release(struct inode *inode,
> +                                            struct file *filp)
> +{
> +       struct pm_qos_request *req = filp->private_data;
> +
> +       filp->private_data = NULL;
> +       pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ,
> +                            PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> +       kfree(req);
> +
> +       return 0;
> +}
> +
> +static ssize_t system_wakeup_latency_qos_read(struct file *filp,
> +                                             char __user *buf,
> +                                             size_t count,
> +                                             loff_t *f_pos)
> +{
> +       s32 value = pm_qos_read_value(&system_wakeup_latency_constraints);
> +
> +       return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32));
> +}
> +
> +static ssize_t system_wakeup_latency_qos_write(struct file *filp,
> +                                              const char __user *buf,
> +                                              size_t count, loff_t *f_pos)
> +{
> +       struct pm_qos_request *req = filp->private_data;
> +       s32 value;
> +
> +       if (count == sizeof(s32)) {
> +               if (copy_from_user(&value, buf, sizeof(s32)))
> +                       return -EFAULT;
> +       } else {
> +               int ret;
> +
> +               ret = kstrtos32_from_user(buf, count, 16, &value);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       if (value < 0)
> +               return -EINVAL;
> +
> +       pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ, value);
> +
> +       return count;
> +}
> +
> +static const struct file_operations system_wakeup_latency_qos_fops = {
> +       .open = system_wakeup_latency_qos_open,
> +       .release = system_wakeup_latency_qos_release,
> +       .read = system_wakeup_latency_qos_read,
> +       .write = system_wakeup_latency_qos_write,
> +       .llseek = noop_llseek,
> +};
> +
> +static struct miscdevice system_wakeup_latency_qos_miscdev = {
> +       .minor = MISC_DYNAMIC_MINOR,
> +       .name = "system_wakeup_latency",
> +       .fops = &system_wakeup_latency_qos_fops,
> +};
> +
> +static int __init system_wakeup_latency_qos_init(void)
> +{
> +       int ret;
> +
> +       ret = misc_register(&system_wakeup_latency_qos_miscdev);
> +       if (ret < 0)
> +               pr_err("%s: %s setup failed\n", __func__,
> +                      system_wakeup_latency_qos_miscdev.name);
> +
> +       return ret;
> +}
> +late_initcall(system_wakeup_latency_qos_init);
> +#endif /* CONFIG_PM_SLEEP */
> +
>  #ifdef CONFIG_CPU_IDLE
>  /* Definitions related to the CPU latency QoS. */
>
> --

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Kevin Hilman 1 month, 3 weeks ago

"Rafael J. Wysocki" <rafael@kernel.org> writes:

> On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>>
>> Some platforms and devices supports multiple low-power-states than can be
>> used for system-wide suspend. Today these states are selected on per
>> subsystem basis and in most cases it's the deepest possible state that
>> becomes selected.
>>
>> For some use-cases this is a problem as it isn't suitable or even breaks
>> the system-wakeup latency constraint, when we decide to enter these deeper
>> states during system-wide suspend.
>>
>> Therefore, let's introduce an interface for user-space, allowing us to
>> specify the system-wakeup QoS limit. Subsequent changes will start taking
>> into account the QoS limit.
>
> Well, this is not really a system-wakeup limit, but a CPU idle state
> latency limit for states entered in the last step of suspend-to-idle.
>
> It looks like the problem is that the existing CPU latency QoS is not
> taken into account by suspend-to-idle, so instead of adding an
> entirely new interface to overcome this, would it make sense to add an
> ioctl() to the existing one that would allow the user of it to
> indicate that the given request should also be respected by
> suspend-to-idle?
>
> There are two basic reasons why I think so:
> (1) The requests that you want to be respected by suspend-to-idle
> should also be respected by the regular "runtime" idle, or at least I
> don't see a reason why it wouldn't be the case.
> (2) The new interface introduced by this patch basically duplicates
> the existing one.

I also think that just using the existing /dev/cpu_dma_latency is the
right approach here, and simply teaching s2idle to respect this value.

I'm curious about the need for a new ioctl() though.  Under what
conditions do you want normal/runtime CPUidle to respect this value and
s2idle to not respect this value?

Kevin

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Rafael J. Wysocki 1 month, 3 weeks ago

On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
>
> "Rafael J. Wysocki" <rafael@kernel.org> writes:
>
> > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >>
> >> Some platforms and devices supports multiple low-power-states than can be
> >> used for system-wide suspend. Today these states are selected on per
> >> subsystem basis and in most cases it's the deepest possible state that
> >> becomes selected.
> >>
> >> For some use-cases this is a problem as it isn't suitable or even breaks
> >> the system-wakeup latency constraint, when we decide to enter these deeper
> >> states during system-wide suspend.
> >>
> >> Therefore, let's introduce an interface for user-space, allowing us to
> >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> >> into account the QoS limit.
> >
> > Well, this is not really a system-wakeup limit, but a CPU idle state
> > latency limit for states entered in the last step of suspend-to-idle.
> >
> > It looks like the problem is that the existing CPU latency QoS is not
> > taken into account by suspend-to-idle, so instead of adding an
> > entirely new interface to overcome this, would it make sense to add an
> > ioctl() to the existing one that would allow the user of it to
> > indicate that the given request should also be respected by
> > suspend-to-idle?
> >
> > There are two basic reasons why I think so:
> > (1) The requests that you want to be respected by suspend-to-idle
> > should also be respected by the regular "runtime" idle, or at least I
> > don't see a reason why it wouldn't be the case.
> > (2) The new interface introduced by this patch basically duplicates
> > the existing one.
>
> I also think that just using the existing /dev/cpu_dma_latency is the
> right approach here, and simply teaching s2idle to respect this value.
>
> I'm curious about the need for a new ioctl() though.  Under what
> conditions do you want normal/runtime CPUidle to respect this value and
> s2idle to not respect this value?

In a typical PC environment s2idle is a replacement for ACPI S3 which
does not take any QoS constraints into account, so users may want to
set QoS limits for run-time and then suspend with the expectation that
QoS will not affect it.

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 1 month, 3 weeks ago

On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> >
> > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> >
> > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >>
> > >> Some platforms and devices supports multiple low-power-states than can be
> > >> used for system-wide suspend. Today these states are selected on per
> > >> subsystem basis and in most cases it's the deepest possible state that
> > >> becomes selected.
> > >>
> > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > >> states during system-wide suspend.
> > >>
> > >> Therefore, let's introduce an interface for user-space, allowing us to
> > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > >> into account the QoS limit.
> > >
> > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > latency limit for states entered in the last step of suspend-to-idle.
> > >
> > > It looks like the problem is that the existing CPU latency QoS is not
> > > taken into account by suspend-to-idle, so instead of adding an
> > > entirely new interface to overcome this, would it make sense to add an
> > > ioctl() to the existing one that would allow the user of it to
> > > indicate that the given request should also be respected by
> > > suspend-to-idle?
> > >
> > > There are two basic reasons why I think so:
> > > (1) The requests that you want to be respected by suspend-to-idle
> > > should also be respected by the regular "runtime" idle, or at least I
> > > don't see a reason why it wouldn't be the case.
> > > (2) The new interface introduced by this patch basically duplicates
> > > the existing one.
> >
> > I also think that just using the existing /dev/cpu_dma_latency is the
> > right approach here, and simply teaching s2idle to respect this value.
> >
> > I'm curious about the need for a new ioctl() though.  Under what
> > conditions do you want normal/runtime CPUidle to respect this value and
> > s2idle to not respect this value?
>
> In a typical PC environment s2idle is a replacement for ACPI S3 which
> does not take any QoS constraints into account, so users may want to
> set QoS limits for run-time and then suspend with the expectation that
> QoS will not affect it.

Yes, I agree. To me, these are orthogonal use-cases which could have
different wakeup latency constraints.

Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
allow this to be managed, I think.

Although, I am not fully convinced yet that re-using
/dev/cpu_dma_latency is the right path. The main reason is that I
don't want us to limit the use-case to CPU latencies, but rather allow
the QoS constraint to be system-wide for any type of device. For
example, it could be used by storage drivers too (like NVMe, UFS,
eMMC), as a way to understand what low power state to pick as system
wide suspend. If you have a closer look at patch2 [1] , I suggest we
extend the genpd-governor for *both* CPU-cluster-PM-domains and for
other PM-domains too.

Interested to hear your thoughts around this.

Kind regards
Uffe

[1]
https://lore.kernel.org/all/20250716123323.65441-3-ulf.hansson@linaro.org/

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 3 weeks, 3 days ago

On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > >
> > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > >
> > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >>
> > > >> Some platforms and devices supports multiple low-power-states than can be
> > > >> used for system-wide suspend. Today these states are selected on per
> > > >> subsystem basis and in most cases it's the deepest possible state that
> > > >> becomes selected.
> > > >>
> > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > >> states during system-wide suspend.
> > > >>
> > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > >> into account the QoS limit.
> > > >
> > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > latency limit for states entered in the last step of suspend-to-idle.
> > > >
> > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > taken into account by suspend-to-idle, so instead of adding an
> > > > entirely new interface to overcome this, would it make sense to add an
> > > > ioctl() to the existing one that would allow the user of it to
> > > > indicate that the given request should also be respected by
> > > > suspend-to-idle?
> > > >
> > > > There are two basic reasons why I think so:
> > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > should also be respected by the regular "runtime" idle, or at least I
> > > > don't see a reason why it wouldn't be the case.
> > > > (2) The new interface introduced by this patch basically duplicates
> > > > the existing one.
> > >
> > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > right approach here, and simply teaching s2idle to respect this value.
> > >
> > > I'm curious about the need for a new ioctl() though.  Under what
> > > conditions do you want normal/runtime CPUidle to respect this value and
> > > s2idle to not respect this value?
> >
> > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > does not take any QoS constraints into account, so users may want to
> > set QoS limits for run-time and then suspend with the expectation that
> > QoS will not affect it.
>
> Yes, I agree. To me, these are orthogonal use-cases which could have
> different wakeup latency constraints.
>
> Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> allow this to be managed, I think.
>
> Although, I am not fully convinced yet that re-using
> /dev/cpu_dma_latency is the right path. The main reason is that I
> don't want us to limit the use-case to CPU latencies, but rather allow
> the QoS constraint to be system-wide for any type of device. For
> example, it could be used by storage drivers too (like NVMe, UFS,
> eMMC), as a way to understand what low power state to pick as system
> wide suspend. If you have a closer look at patch2 [1] , I suggest we
> extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> other PM-domains too.
>
> Interested to hear your thoughts around this.

Hey, just wanted to see if you have managed to digest this and have
any possible further comment?

I can totally implement the ioctl approach, but just wanted to make
sure that is the direction you think we should take.

Kind regards
Uffe

>
> [1]
> https://lore.kernel.org/all/20250716123323.65441-3-ulf.hansson@linaro.org/

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Rafael J. Wysocki 2 weeks, 5 days ago

Hi,

Sorry for the delay.

On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > >
> > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > >
> > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >>
> > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > >> used for system-wide suspend. Today these states are selected on per
> > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > >> becomes selected.
> > > > >>
> > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > >> states during system-wide suspend.
> > > > >>
> > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > >> into account the QoS limit.
> > > > >
> > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > >
> > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > ioctl() to the existing one that would allow the user of it to
> > > > > indicate that the given request should also be respected by
> > > > > suspend-to-idle?
> > > > >
> > > > > There are two basic reasons why I think so:
> > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > don't see a reason why it wouldn't be the case.
> > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > the existing one.
> > > >
> > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > right approach here, and simply teaching s2idle to respect this value.
> > > >
> > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > s2idle to not respect this value?
> > >
> > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > does not take any QoS constraints into account, so users may want to
> > > set QoS limits for run-time and then suspend with the expectation that
> > > QoS will not affect it.
> >
> > Yes, I agree. To me, these are orthogonal use-cases which could have
> > different wakeup latency constraints.
> >
> > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > allow this to be managed, I think.
> >
> > Although, I am not fully convinced yet that re-using
> > /dev/cpu_dma_latency is the right path. The main reason is that I
> > don't want us to limit the use-case to CPU latencies, but rather allow
> > the QoS constraint to be system-wide for any type of device. For
> > example, it could be used by storage drivers too (like NVMe, UFS,
> > eMMC), as a way to understand what low power state to pick as system
> > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > other PM-domains too.
> >
> > Interested to hear your thoughts around this.
>
> Hey, just wanted to see if you have managed to digest this and have
> any possible further comment?

The reason why I thought about reusing /dev/cpu_dma_latency is because
I think that the s2idle limit should also apply to cpuidle.  Of
course, cpuidle may be limited further, but IMV it should observe the
limit set on system suspend (it would be kind of inconsistent to allow
cpuidle to use deeper idle states than can be used by s2idle).

I also don't think that having a per-CPU s2idle limit would be
particularly useful (and it might be problematic).

Now, it is not as straightforward as I thought because someone may
want to set a more restrictive limit on cpuidle, in which case they
would need to open the same special device file twice etc and that
would be quite cumbersome.

So in the end I think that what you did in the $subject patch is
better, but I still would like it to also affect cpuidle.

And it needs to be made clear that this is a limit on the resume
latency of one device.  Worst case, the system wakeup latency may be a
sum of those limits if the devices in question are resumed
sequentially, so in fact this is a limit on the contribution of a
given device to the system wakeup latency.

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 2 weeks, 4 days ago

On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> Hi,
>
> Sorry for the delay.
>
> On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > >
> > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > >
> > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > >>
> > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > >> becomes selected.
> > > > > >>
> > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > >> states during system-wide suspend.
> > > > > >>
> > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > >> into account the QoS limit.
> > > > > >
> > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > >
> > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > indicate that the given request should also be respected by
> > > > > > suspend-to-idle?
> > > > > >
> > > > > > There are two basic reasons why I think so:
> > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > don't see a reason why it wouldn't be the case.
> > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > the existing one.
> > > > >
> > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > >
> > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > s2idle to not respect this value?
> > > >
> > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > does not take any QoS constraints into account, so users may want to
> > > > set QoS limits for run-time and then suspend with the expectation that
> > > > QoS will not affect it.
> > >
> > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > different wakeup latency constraints.
> > >
> > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > allow this to be managed, I think.
> > >
> > > Although, I am not fully convinced yet that re-using
> > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > the QoS constraint to be system-wide for any type of device. For
> > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > eMMC), as a way to understand what low power state to pick as system
> > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > other PM-domains too.
> > >
> > > Interested to hear your thoughts around this.
> >
> > Hey, just wanted to see if you have managed to digest this and have
> > any possible further comment?
>
> The reason why I thought about reusing /dev/cpu_dma_latency is because
> I think that the s2idle limit should also apply to cpuidle.  Of
> course, cpuidle may be limited further, but IMV it should observe the
> limit set on system suspend (it would be kind of inconsistent to allow
> cpuidle to use deeper idle states than can be used by s2idle).

Agreed!

>
> I also don't think that having a per-CPU s2idle limit would be
> particularly useful (and it might be problematic).
>
> Now, it is not as straightforward as I thought because someone may
> want to set a more restrictive limit on cpuidle, in which case they
> would need to open the same special device file twice etc and that
> would be quite cumbersome.
>
> So in the end I think that what you did in the $subject patch is
> better, but I still would like it to also affect cpuidle.

Okay. I will update the patches according to your suggestions!

>
> And it needs to be made clear that this is a limit on the resume
> latency of one device.  Worst case, the system wakeup latency may be a
> sum of those limits if the devices in question are resumed
> sequentially, so in fact this is a limit on the contribution of a
> given device to the system wakeup latency.

Indeed, that's a very good point! I will keep this in mind when
working on adding the documentation part.

Again, thanks a lot for your feedback!

Kind regards
Uffe

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Rafael J. Wysocki 2 weeks ago

On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > Hi,
> >
> > Sorry for the delay.
> >
> > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > > >
> > > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > > >
> > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > >>
> > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > >> becomes selected.
> > > > > > >>
> > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > >> states during system-wide suspend.
> > > > > > >>
> > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > >> into account the QoS limit.
> > > > > > >
> > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > >
> > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > indicate that the given request should also be respected by
> > > > > > > suspend-to-idle?
> > > > > > >
> > > > > > > There are two basic reasons why I think so:
> > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > the existing one.
> > > > > >
> > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > >
> > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > s2idle to not respect this value?
> > > > >
> > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > does not take any QoS constraints into account, so users may want to
> > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > QoS will not affect it.
> > > >
> > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > different wakeup latency constraints.
> > > >
> > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > allow this to be managed, I think.
> > > >
> > > > Although, I am not fully convinced yet that re-using
> > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > the QoS constraint to be system-wide for any type of device. For
> > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > eMMC), as a way to understand what low power state to pick as system
> > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > other PM-domains too.
> > > >
> > > > Interested to hear your thoughts around this.
> > >
> > > Hey, just wanted to see if you have managed to digest this and have
> > > any possible further comment?
> >
> > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > I think that the s2idle limit should also apply to cpuidle.  Of
> > course, cpuidle may be limited further, but IMV it should observe the
> > limit set on system suspend (it would be kind of inconsistent to allow
> > cpuidle to use deeper idle states than can be used by s2idle).
>
> Agreed!
>
> >
> > I also don't think that having a per-CPU s2idle limit would be
> > particularly useful (and it might be problematic).
> >
> > Now, it is not as straightforward as I thought because someone may
> > want to set a more restrictive limit on cpuidle, in which case they
> > would need to open the same special device file twice etc and that
> > would be quite cumbersome.
> >
> > So in the end I think that what you did in the $subject patch is
> > better, but I still would like it to also affect cpuidle.
>
> Okay. I will update the patches according to your suggestions!
>
> >
> > And it needs to be made clear that this is a limit on the resume
> > latency of one device.  Worst case, the system wakeup latency may be a
> > sum of those limits if the devices in question are resumed
> > sequentially, so in fact this is a limit on the contribution of a
> > given device to the system wakeup latency.
>
> Indeed, that's a very good point! I will keep this in mind when
> working on adding the documentation part.

Well, this also means that using one limit for all of the different
devices is not likely to be very practical because the goal is to save
as much energy as reasonably possible in system suspend while
respecting a global resume latency constraint at the same time.

Using the same limit on a local contribution from each device to the
combined latency is not likely to be effective here.  Rather, I'd
expect that the best results can be achieved by setting different
resume latency limits on different devices, depending on how much
power they draw in each of their idle states and what the exit latency
values for all of those states are.  In other words, this appears to
be an optimization problem in which the resume latency limits for
individual devices need to be chosen to satisfy the global resume
latency constraint and minimize the total system power.

> Again, thanks a lot for your feedback!

Anytime!

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 1 week, 6 days ago

On Mon, 22 Sept 2025 at 20:55, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > Hi,
> > >
> > > Sorry for the delay.
> > >
> > > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >
> > > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > > > >
> > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > > > >
> > > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > >>
> > > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > > >> becomes selected.
> > > > > > > >>
> > > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > > >> states during system-wide suspend.
> > > > > > > >>
> > > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > > >> into account the QoS limit.
> > > > > > > >
> > > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > > >
> > > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > > indicate that the given request should also be respected by
> > > > > > > > suspend-to-idle?
> > > > > > > >
> > > > > > > > There are two basic reasons why I think so:
> > > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > > the existing one.
> > > > > > >
> > > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > > >
> > > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > > s2idle to not respect this value?
> > > > > >
> > > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > > does not take any QoS constraints into account, so users may want to
> > > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > > QoS will not affect it.
> > > > >
> > > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > > different wakeup latency constraints.
> > > > >
> > > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > > allow this to be managed, I think.
> > > > >
> > > > > Although, I am not fully convinced yet that re-using
> > > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > > the QoS constraint to be system-wide for any type of device. For
> > > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > > eMMC), as a way to understand what low power state to pick as system
> > > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > > other PM-domains too.
> > > > >
> > > > > Interested to hear your thoughts around this.
> > > >
> > > > Hey, just wanted to see if you have managed to digest this and have
> > > > any possible further comment?
> > >
> > > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > > I think that the s2idle limit should also apply to cpuidle.  Of
> > > course, cpuidle may be limited further, but IMV it should observe the
> > > limit set on system suspend (it would be kind of inconsistent to allow
> > > cpuidle to use deeper idle states than can be used by s2idle).
> >
> > Agreed!
> >
> > >
> > > I also don't think that having a per-CPU s2idle limit would be
> > > particularly useful (and it might be problematic).
> > >
> > > Now, it is not as straightforward as I thought because someone may
> > > want to set a more restrictive limit on cpuidle, in which case they
> > > would need to open the same special device file twice etc and that
> > > would be quite cumbersome.
> > >
> > > So in the end I think that what you did in the $subject patch is
> > > better, but I still would like it to also affect cpuidle.
> >
> > Okay. I will update the patches according to your suggestions!
> >
> > >
> > > And it needs to be made clear that this is a limit on the resume
> > > latency of one device.  Worst case, the system wakeup latency may be a
> > > sum of those limits if the devices in question are resumed
> > > sequentially, so in fact this is a limit on the contribution of a
> > > given device to the system wakeup latency.
> >
> > Indeed, that's a very good point! I will keep this in mind when
> > working on adding the documentation part.
>
> Well, this also means that using one limit for all of the different
> devices is not likely to be very practical because the goal is to save
> as much energy as reasonably possible in system suspend while
> respecting a global resume latency constraint at the same time.
>
> Using the same limit on a local contribution from each device to the
> combined latency is not likely to be effective here.  Rather, I'd
> expect that the best results can be achieved by setting different
> resume latency limits on different devices, depending on how much
> power they draw in each of their idle states and what the exit latency
> values for all of those states are.  In other words, this appears to
> be an optimization problem in which the resume latency limits for
> individual devices need to be chosen to satisfy the global resume
> latency constraint and minimize the total system power.

I am following your reasoning and I agree!

Perhaps we should start with extending the cpu_dma_latency with an
ioctl after all? This would allow userspace to specify constraints to
be applicable for system-wide-suspend (s2idle), but it would still be
limited for CPUs/CPU-clusters.

For other devices, we should probably explore the per device PM QoS
(pm_qos_latency_tolerance_us) instead. Currently the
pm_qos_latency_tolerance_us is used for "runtime_suspend", so perhaps
adding another per device sysfs file, like
"pm_qos_system_wakeup_latency_us",  that we can use for the
system-wide-wakeup latency constraint?

Would this make better sense, you think?

Kind regards
Uffe

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Rafael J. Wysocki 1 week, 6 days ago

On Tue, Sep 23, 2025 at 11:42 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Mon, 22 Sept 2025 at 20:55, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Sorry for the delay.
> > > >
> > > > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >
> > > > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > > > > >
> > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > > > > >
> > > > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > > >>
> > > > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > > > >> becomes selected.
> > > > > > > > >>
> > > > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > > > >> states during system-wide suspend.
> > > > > > > > >>
> > > > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > > > >> into account the QoS limit.
> > > > > > > > >
> > > > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > > > >
> > > > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > > > indicate that the given request should also be respected by
> > > > > > > > > suspend-to-idle?
> > > > > > > > >
> > > > > > > > > There are two basic reasons why I think so:
> > > > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > > > the existing one.
> > > > > > > >
> > > > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > > > >
> > > > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > > > s2idle to not respect this value?
> > > > > > >
> > > > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > > > does not take any QoS constraints into account, so users may want to
> > > > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > > > QoS will not affect it.
> > > > > >
> > > > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > > > different wakeup latency constraints.
> > > > > >
> > > > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > > > allow this to be managed, I think.
> > > > > >
> > > > > > Although, I am not fully convinced yet that re-using
> > > > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > > > the QoS constraint to be system-wide for any type of device. For
> > > > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > > > eMMC), as a way to understand what low power state to pick as system
> > > > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > > > other PM-domains too.
> > > > > >
> > > > > > Interested to hear your thoughts around this.
> > > > >
> > > > > Hey, just wanted to see if you have managed to digest this and have
> > > > > any possible further comment?
> > > >
> > > > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > > > I think that the s2idle limit should also apply to cpuidle.  Of
> > > > course, cpuidle may be limited further, but IMV it should observe the
> > > > limit set on system suspend (it would be kind of inconsistent to allow
> > > > cpuidle to use deeper idle states than can be used by s2idle).
> > >
> > > Agreed!
> > >
> > > >
> > > > I also don't think that having a per-CPU s2idle limit would be
> > > > particularly useful (and it might be problematic).
> > > >
> > > > Now, it is not as straightforward as I thought because someone may
> > > > want to set a more restrictive limit on cpuidle, in which case they
> > > > would need to open the same special device file twice etc and that
> > > > would be quite cumbersome.
> > > >
> > > > So in the end I think that what you did in the $subject patch is
> > > > better, but I still would like it to also affect cpuidle.
> > >
> > > Okay. I will update the patches according to your suggestions!
> > >
> > > >
> > > > And it needs to be made clear that this is a limit on the resume
> > > > latency of one device.  Worst case, the system wakeup latency may be a
> > > > sum of those limits if the devices in question are resumed
> > > > sequentially, so in fact this is a limit on the contribution of a
> > > > given device to the system wakeup latency.
> > >
> > > Indeed, that's a very good point! I will keep this in mind when
> > > working on adding the documentation part.
> >
> > Well, this also means that using one limit for all of the different
> > devices is not likely to be very practical because the goal is to save
> > as much energy as reasonably possible in system suspend while
> > respecting a global resume latency constraint at the same time.
> >
> > Using the same limit on a local contribution from each device to the
> > combined latency is not likely to be effective here.  Rather, I'd
> > expect that the best results can be achieved by setting different
> > resume latency limits on different devices, depending on how much
> > power they draw in each of their idle states and what the exit latency
> > values for all of those states are.  In other words, this appears to
> > be an optimization problem in which the resume latency limits for
> > individual devices need to be chosen to satisfy the global resume
> > latency constraint and minimize the total system power.
>
> I am following your reasoning and I agree!
>
> Perhaps we should start with extending the cpu_dma_latency with an
> ioctl after all? This would allow userspace to specify constraints to
> be applicable for system-wide-suspend (s2idle), but it would still be
> limited for CPUs/CPU-clusters.

Right.

Adding a separate device special file to represent the limit affecting
s2idle may be somewhat cleaner though as mentioned before.

> For other devices, we should probably explore the per device PM QoS
> (pm_qos_latency_tolerance_us) instead. Currently the
> pm_qos_latency_tolerance_us is used for "runtime_suspend", so perhaps
> adding another per device sysfs file, like
> "pm_qos_system_wakeup_latency_us",  that we can use for the
> system-wide-wakeup latency constraint?
>
> Would this make better sense, you think?

I think that this can be made work.

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Ulf Hansson 1 week, 6 days ago

On Tue, 23 Sept 2025 at 13:39, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Sep 23, 2025 at 11:42 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> >
> > On Mon, 22 Sept 2025 at 20:55, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > >
> > > > On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Sorry for the delay.
> > > > >
> > > > > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > >
> > > > > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > > > > > >
> > > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > > > > > >
> > > > > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > > > >>
> > > > > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > > > > >> becomes selected.
> > > > > > > > > >>
> > > > > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > > > > >> states during system-wide suspend.
> > > > > > > > > >>
> > > > > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > > > > >> into account the QoS limit.
> > > > > > > > > >
> > > > > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > > > > >
> > > > > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > > > > indicate that the given request should also be respected by
> > > > > > > > > > suspend-to-idle?
> > > > > > > > > >
> > > > > > > > > > There are two basic reasons why I think so:
> > > > > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > > > > the existing one.
> > > > > > > > >
> > > > > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > > > > >
> > > > > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > > > > s2idle to not respect this value?
> > > > > > > >
> > > > > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > > > > does not take any QoS constraints into account, so users may want to
> > > > > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > > > > QoS will not affect it.
> > > > > > >
> > > > > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > > > > different wakeup latency constraints.
> > > > > > >
> > > > > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > > > > allow this to be managed, I think.
> > > > > > >
> > > > > > > Although, I am not fully convinced yet that re-using
> > > > > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > > > > the QoS constraint to be system-wide for any type of device. For
> > > > > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > > > > eMMC), as a way to understand what low power state to pick as system
> > > > > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > > > > other PM-domains too.
> > > > > > >
> > > > > > > Interested to hear your thoughts around this.
> > > > > >
> > > > > > Hey, just wanted to see if you have managed to digest this and have
> > > > > > any possible further comment?
> > > > >
> > > > > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > > > > I think that the s2idle limit should also apply to cpuidle.  Of
> > > > > course, cpuidle may be limited further, but IMV it should observe the
> > > > > limit set on system suspend (it would be kind of inconsistent to allow
> > > > > cpuidle to use deeper idle states than can be used by s2idle).
> > > >
> > > > Agreed!
> > > >
> > > > >
> > > > > I also don't think that having a per-CPU s2idle limit would be
> > > > > particularly useful (and it might be problematic).
> > > > >
> > > > > Now, it is not as straightforward as I thought because someone may
> > > > > want to set a more restrictive limit on cpuidle, in which case they
> > > > > would need to open the same special device file twice etc and that
> > > > > would be quite cumbersome.
> > > > >
> > > > > So in the end I think that what you did in the $subject patch is
> > > > > better, but I still would like it to also affect cpuidle.
> > > >
> > > > Okay. I will update the patches according to your suggestions!
> > > >
> > > > >
> > > > > And it needs to be made clear that this is a limit on the resume
> > > > > latency of one device.  Worst case, the system wakeup latency may be a
> > > > > sum of those limits if the devices in question are resumed
> > > > > sequentially, so in fact this is a limit on the contribution of a
> > > > > given device to the system wakeup latency.
> > > >
> > > > Indeed, that's a very good point! I will keep this in mind when
> > > > working on adding the documentation part.
> > >
> > > Well, this also means that using one limit for all of the different
> > > devices is not likely to be very practical because the goal is to save
> > > as much energy as reasonably possible in system suspend while
> > > respecting a global resume latency constraint at the same time.
> > >
> > > Using the same limit on a local contribution from each device to the
> > > combined latency is not likely to be effective here.  Rather, I'd
> > > expect that the best results can be achieved by setting different
> > > resume latency limits on different devices, depending on how much
> > > power they draw in each of their idle states and what the exit latency
> > > values for all of those states are.  In other words, this appears to
> > > be an optimization problem in which the resume latency limits for
> > > individual devices need to be chosen to satisfy the global resume
> > > latency constraint and minimize the total system power.
> >
> > I am following your reasoning and I agree!
> >
> > Perhaps we should start with extending the cpu_dma_latency with an
> > ioctl after all? This would allow userspace to specify constraints to
> > be applicable for system-wide-suspend (s2idle), but it would still be
> > limited for CPUs/CPU-clusters.
>
> Right.
>
> Adding a separate device special file to represent the limit affecting
> s2idle may be somewhat cleaner though as mentioned before.

Okay, sounds good to me too!

>
> > For other devices, we should probably explore the per device PM QoS
> > (pm_qos_latency_tolerance_us) instead. Currently the
> > pm_qos_latency_tolerance_us is used for "runtime_suspend", so perhaps
> > adding another per device sysfs file, like
> > "pm_qos_system_wakeup_latency_us",  that we can use for the
> > system-wide-wakeup latency constraint?
> >
> > Would this make better sense, you think?
>
> I think that this can be made work.

Okay, I will explore this approach.

Thanks for your feedback!

Kind regards
Uffe

Re: [RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit

Posted by Dhruva Gole 1 week, 6 days ago

Hi Ulf,

On Sep 23, 2025 at 14:36:53 +0200, Ulf Hansson wrote:
> On Tue, 23 Sept 2025 at 13:39, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Sep 23, 2025 at 11:42 AM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > >
> > > On Mon, 22 Sept 2025 at 20:55, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > >
> > > > On Thu, Sep 18, 2025 at 5:34 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > >
> > > > > On Wed, 17 Sept 2025 at 21:24, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Sorry for the delay.
> > > > > >
> > > > > > On Fri, Sep 12, 2025 at 3:58 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > >
> > > > > > > On Tue, 12 Aug 2025 at 11:26, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 11 Aug 2025 at 21:16, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Aug 11, 2025 at 7:16 PM Kevin Hilman <khilman@baylibre.com> wrote:
> > > > > > > > > >
> > > > > > > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes:
> > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 16, 2025 at 2:33 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> Some platforms and devices supports multiple low-power-states than can be
> > > > > > > > > > >> used for system-wide suspend. Today these states are selected on per
> > > > > > > > > > >> subsystem basis and in most cases it's the deepest possible state that
> > > > > > > > > > >> becomes selected.
> > > > > > > > > > >>
> > > > > > > > > > >> For some use-cases this is a problem as it isn't suitable or even breaks
> > > > > > > > > > >> the system-wakeup latency constraint, when we decide to enter these deeper
> > > > > > > > > > >> states during system-wide suspend.
> > > > > > > > > > >>
> > > > > > > > > > >> Therefore, let's introduce an interface for user-space, allowing us to
> > > > > > > > > > >> specify the system-wakeup QoS limit. Subsequent changes will start taking
> > > > > > > > > > >> into account the QoS limit.
> > > > > > > > > > >
> > > > > > > > > > > Well, this is not really a system-wakeup limit, but a CPU idle state
> > > > > > > > > > > latency limit for states entered in the last step of suspend-to-idle.
> > > > > > > > > > >
> > > > > > > > > > > It looks like the problem is that the existing CPU latency QoS is not
> > > > > > > > > > > taken into account by suspend-to-idle, so instead of adding an
> > > > > > > > > > > entirely new interface to overcome this, would it make sense to add an
> > > > > > > > > > > ioctl() to the existing one that would allow the user of it to
> > > > > > > > > > > indicate that the given request should also be respected by
> > > > > > > > > > > suspend-to-idle?
> > > > > > > > > > >
> > > > > > > > > > > There are two basic reasons why I think so:
> > > > > > > > > > > (1) The requests that you want to be respected by suspend-to-idle
> > > > > > > > > > > should also be respected by the regular "runtime" idle, or at least I
> > > > > > > > > > > don't see a reason why it wouldn't be the case.
> > > > > > > > > > > (2) The new interface introduced by this patch basically duplicates
> > > > > > > > > > > the existing one.
> > > > > > > > > >
> > > > > > > > > > I also think that just using the existing /dev/cpu_dma_latency is the
> > > > > > > > > > right approach here, and simply teaching s2idle to respect this value.
> > > > > > > > > >
> > > > > > > > > > I'm curious about the need for a new ioctl() though.  Under what
> > > > > > > > > > conditions do you want normal/runtime CPUidle to respect this value and
> > > > > > > > > > s2idle to not respect this value?
> > > > > > > > >
> > > > > > > > > In a typical PC environment s2idle is a replacement for ACPI S3 which
> > > > > > > > > does not take any QoS constraints into account, so users may want to
> > > > > > > > > set QoS limits for run-time and then suspend with the expectation that
> > > > > > > > > QoS will not affect it.
> > > > > > > >
> > > > > > > > Yes, I agree. To me, these are orthogonal use-cases which could have
> > > > > > > > different wakeup latency constraints.
> > > > > > > >
> > > > > > > > Adding an ioctl for /dev/cpu_dma_latency, as suggested by Rafael would
> > > > > > > > allow this to be managed, I think.
> > > > > > > >
> > > > > > > > Although, I am not fully convinced yet that re-using
> > > > > > > > /dev/cpu_dma_latency is the right path. The main reason is that I
> > > > > > > > don't want us to limit the use-case to CPU latencies, but rather allow
> > > > > > > > the QoS constraint to be system-wide for any type of device. For
> > > > > > > > example, it could be used by storage drivers too (like NVMe, UFS,
> > > > > > > > eMMC), as a way to understand what low power state to pick as system
> > > > > > > > wide suspend. If you have a closer look at patch2 [1] , I suggest we
> > > > > > > > extend the genpd-governor for *both* CPU-cluster-PM-domains and for
> > > > > > > > other PM-domains too.
> > > > > > > >
> > > > > > > > Interested to hear your thoughts around this.
> > > > > > >
> > > > > > > Hey, just wanted to see if you have managed to digest this and have
> > > > > > > any possible further comment?
> > > > > >
> > > > > > The reason why I thought about reusing /dev/cpu_dma_latency is because
> > > > > > I think that the s2idle limit should also apply to cpuidle.  Of
> > > > > > course, cpuidle may be limited further, but IMV it should observe the
> > > > > > limit set on system suspend (it would be kind of inconsistent to allow
> > > > > > cpuidle to use deeper idle states than can be used by s2idle).
> > > > >
> > > > > Agreed!
> > > > >
> > > > > >
> > > > > > I also don't think that having a per-CPU s2idle limit would be
> > > > > > particularly useful (and it might be problematic).
> > > > > >
> > > > > > Now, it is not as straightforward as I thought because someone may
> > > > > > want to set a more restrictive limit on cpuidle, in which case they
> > > > > > would need to open the same special device file twice etc and that
> > > > > > would be quite cumbersome.
> > > > > >
> > > > > > So in the end I think that what you did in the $subject patch is
> > > > > > better, but I still would like it to also affect cpuidle.
> > > > >
> > > > > Okay. I will update the patches according to your suggestions!
> > > > >
> > > > > >
> > > > > > And it needs to be made clear that this is a limit on the resume
> > > > > > latency of one device.  Worst case, the system wakeup latency may be a
> > > > > > sum of those limits if the devices in question are resumed
> > > > > > sequentially, so in fact this is a limit on the contribution of a
> > > > > > given device to the system wakeup latency.
> > > > >
> > > > > Indeed, that's a very good point! I will keep this in mind when
> > > > > working on adding the documentation part.
> > > >
> > > > Well, this also means that using one limit for all of the different
> > > > devices is not likely to be very practical because the goal is to save
> > > > as much energy as reasonably possible in system suspend while
> > > > respecting a global resume latency constraint at the same time.
> > > >
> > > > Using the same limit on a local contribution from each device to the
> > > > combined latency is not likely to be effective here.  Rather, I'd
> > > > expect that the best results can be achieved by setting different
> > > > resume latency limits on different devices, depending on how much
> > > > power they draw in each of their idle states and what the exit latency
> > > > values for all of those states are.  In other words, this appears to
> > > > be an optimization problem in which the resume latency limits for
> > > > individual devices need to be chosen to satisfy the global resume
> > > > latency constraint and minimize the total system power.
> > >
> > > I am following your reasoning and I agree!
> > >
> > > Perhaps we should start with extending the cpu_dma_latency with an
> > > ioctl after all? This would allow userspace to specify constraints to
> > > be applicable for system-wide-suspend (s2idle), but it would still be
> > > limited for CPUs/CPU-clusters.
> >
> > Right.
> >
> > Adding a separate device special file to represent the limit affecting
> > s2idle may be somewhat cleaner though as mentioned before.
> 
> Okay, sounds good to me too!
> 
> >
> > > For other devices, we should probably explore the per device PM QoS
> > > (pm_qos_latency_tolerance_us) instead. Currently the
> > > pm_qos_latency_tolerance_us is used for "runtime_suspend", so perhaps
> > > adding another per device sysfs file, like
> > > "pm_qos_system_wakeup_latency_us",  that we can use for the
> > > system-wide-wakeup latency constraint?
> > >
> > > Would this make better sense, you think?
> >
> > I think that this can be made work.
> 
> Okay, I will explore this approach.

I think this is kind of similar to how we did it for the TI SCI
pmdomains driver for TI SoC. See Kevin's patch [1] where we read from
dev_pm_qos_read_value and then based on that we set some constraints on
the firmware entity based on which the firmware entity chose which low
power mode to enter. It's nice to see that the logic is finally getting
into a much more central part of the kernel PM.

About this series itself, Kevin and I have been working to integrate a
branch where we can have some platform specific support for the TI AM62L
SoC along with this series applied on it on vendor kernel, but I think
it should be good enough to test a proof of concept that we can finally
do mode selection while using s2idle.

So yeah - I was able to write some values into
/dev/system_wakeup_latency and then see that before setting any value it
picked the deepest idle-state in the DT. When I programmed some latency
constraint into /dev/system_wakeup_latency then I could see that the
cpuidle_enter_s2idle picked the shallower idle-state.

These idle-states we had were modelling 2 different low power mode
variants of suspend to RAM, and based on the different suspend-param
that I recieved in the firmware (in this case TF-A via PSCI), I did the
mode selection bits and switched between low power modes purely based on
system_wakeup_latency. There's definitely more work to do, and I will
continue to closely monitor the next revisions of this series as well,
so please feel free to include me in To/CC.

[1] https://lore.kernel.org/linux-pm/20241206-lpm-v6-10-constraints-pmdomain-v6-1-833980158c68@baylibre.com/

-- 
Best regards,
Dhruva Gole
Texas Instruments Incorporated

[RFC/PATCH 1/3] PM: QoS: Introduce a system-wakeup QoS limit
[RFC/PATCH 2/3] pmdomain: Respect the system-wakeup QoS limit at system-wide suspend
[RFC/PATCH 3/3] cpuidle: Respect the system-wakeup QoS limit for s2idle