[v5] PM: QoS: Introduce boot parameter pm_qos_resume_latency_us

[PATCH v5] PM: QoS: Introduce boot parameter pm_qos_resume_latency_us

Posted by Aaron Tomlin 1 month, 2 weeks ago

Hi Rafael, Danilo, Pavel, Len,

Users currently lack a mechanism to define granular, per-CPU PM QoS
resume latency constraints during the early boot phase.

While the idle=poll boot parameter exists, it enforces a global
override, forcing all CPUs in the system to "poll". This global approach
is not suitable for asymmetric workloads where strict latency guarantees
are required only on specific critical CPUs, while housekeeping or
non-critical CPUs should be allowed to enter deeper idle states to save
energy.

Additionally, the existing sysfs interface
(/sys/devices/system/cpu/cpuN/power/pm_qos_resume_latency_us) becomes
available only after userspace initialisation. This is too late to
prevent deep C-state entry during the early kernel boot phase, which may
be required for debugging early boot hangs related to C-state
transitions or for workloads requiring strict latency guarantees
immediately upon system start.

This patch introduces the pm_qos_resume_latency_us kernel boot
parameter, which allows users to specify distinct resume latency
constraints for specific CPU ranges.

	Syntax: pm_qos_resume_latency_us=range:value;range:value...

This boot parameter mirrors the sysfs interface behaviour: the special
string "n/a" imposes a 0us latency constraint (polling), while the
integer 0 removes the constraint entirely.

For example:

	"pm_qos_resume_latency_us=0:n/a;1-15:20"

Forces CPU 0 to poll on idle; constrains CPUs 1-15 to not enter a sleep
state that takes longer than 20 us to wake up. All other CPUs will have
the default (no resume latency) applied.

Implementation Details:

	- The parameter string is captured via __setup() and parsed in
	  an early_initcall() to ensure suitable memory allocators are
	  available.

	- Constraints are stored in a read-only linked list.

	- The constraints are queried and applied in register_cpu().
	  This ensures the latency requirement is active immediately
	  upon CPU registration, effectively acting as a "birth"
	  constraint before the cpuidle governor takes over.

	- The parsing logic enforces a "First Match Wins" policy: if a
	  CPU falls into multiple specified ranges, the latency value
	  from the first matching entry is used.

	- The constraints persist across CPU hotplug events.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
Changes since v4 [1]:
 - Modified the parsing logic so the boot parameter perfectly mirrors the
   existing sysfs interface. Passing the integer 0 now explicitly removes
   the constraint (i.e., maps to PM_QOS_RESUME_LATENCY_NO_CONSTRAINT), and
   the special string "n/a" safely imposes a 0us latency constraint
   (polling on idle)

 - Changed the outer tuple delimiter from a comma (",") to a semicolon
   (";"). This prevents strsep() from breaking standard, non-contiguous
   CPU lists (e.g., "0,2:n/a;4-7:20")

 - Moved the cpumask_or() coverage update to the end of the parsing loop

 - Updated documentation to reflect the new semicolon delimiter and updated
   semantic behavior in Documentation/admin-guide/kernel-parameters.txt

Changes since v3 [2]:
 - Moved pm_qos_get_boot_cpu_latency_limit() declaration out of the
   CONFIG_PM #ifdef block, as qos.c is compiled regardless

Changes since v2 [3]:
 - Add pr_fmt() to standardise log prefixes (Zhongqiu Han)
 - Drop <asm/setup.h> by duplicating the command line with kstrdup()
   (Zhongqiu Han)
 - Fix init_pm_qos_resume_latency_us_setup() error path to return -ENOMEM
   (Zhongqiu Han)

Changes since v1 [4]:
 - Removed boot_option_idle_override == IDLE_POLL check
 - Decoupled implementation from CONFIG_CPU_IDLE
 - Added kernel-parameters.txt documentation
 - Renamed internal setup functions for consistency

[1]: https://lore.kernel.org/lkml/20260308190421.46657-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260307200736.4192234-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260128033143.3456074-2-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260123010024.3301276-1-atomlin@atomlin.com/

 .../admin-guide/kernel-parameters.txt         |  22 +++
 drivers/base/cpu.c                            |   5 +-
 include/linux/pm_qos.h                        |   1 +
 kernel/power/qos.c                            | 153 ++++++++++++++++++
 4 files changed, 179 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6a3d6bd0746c..1beb4f82e038 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2238,6 +2238,28 @@ Kernel parameters
 	icn=		[HW,ISDN]
 			Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
 
+	pm_qos_resume_latency_us=	[KNL,EARLY]
+			Format: <cpu-list>:<value>[;<cpu-list>:<value>...]
+
+			Establish per-CPU resume latency constraints. These constraints
+			are applied immediately upon CPU registration and persist
+			across CPU hotplug events.
+
+			For example:
+				"pm_qos_resume_latency_us=0:n/a;1-15:20"
+
+			This restricts CPU 0 to a 0us resume latency (effectively
+			forcing polling) and limits CPUs 1-15 to C-states with a
+			maximum exit latency of 20us. All other CPUs remain
+			unconstrained by this parameter.
+
+			This boot parameter mirrors the sysfs interface behaviour.
+			The special string "n/a" imposes a 0us latency constraint
+			(polling), while the integer 0 removes the constraint.
+
+			NOTE: The parsing logic enforces a "First Match Wins" policy.
+			If a CPU is included in multiple specified ranges, the latency
+			value from the first matching entry takes precedence.
 
 	idle=		[X86,EARLY]
 			Format: idle=poll, idle=halt, idle=nomwait
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index c6c57b6f61c6..1dea5bcd76a0 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -416,6 +416,7 @@ EXPORT_SYMBOL_GPL(cpu_subsys);
 int register_cpu(struct cpu *cpu, int num)
 {
 	int error;
+	s32 resume_latency;
 
 	cpu->node_id = cpu_to_node(num);
 	memset(&cpu->dev, 0x00, sizeof(struct device));
@@ -436,8 +437,8 @@ int register_cpu(struct cpu *cpu, int num)
 
 	per_cpu(cpu_sys_devices, num) = &cpu->dev;
 	register_cpu_under_node(num, cpu_to_node(num));
-	dev_pm_qos_expose_latency_limit(&cpu->dev,
-					PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+	resume_latency = pm_qos_get_boot_cpu_latency_limit(num);
+	dev_pm_qos_expose_latency_limit(&cpu->dev, resume_latency);
 	set_cpu_enabled(num, true);
 
 	return 0;
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 6cea4455f867..65ce276282e8 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -142,6 +142,7 @@ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node,
 bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 			 struct pm_qos_flags_request *req,
 			 enum pm_qos_req_action action, s32 val);
+s32 pm_qos_get_boot_cpu_latency_limit(unsigned int cpu);
 
 #ifdef CONFIG_CPU_IDLE
 s32 cpu_latency_qos_limit(void);
diff --git a/kernel/power/qos.c b/kernel/power/qos.c
index f7d8064e9adc..1c854e02ada0 100644
--- a/kernel/power/qos.c
+++ b/kernel/power/qos.c
@@ -18,6 +18,8 @@
  * global CPU latency QoS requests and frequency QoS requests are provided.
  */
 
+#define pr_fmt(fmt) "pm_qos: " fmt
+
 /*#define DEBUG*/
 
 #include <linux/pm_qos.h>
@@ -34,6 +36,9 @@
 #include <linux/kernel.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
+#include <linux/cpumask.h>
+#include <linux/cpu.h>
+#include <linux/list.h>
 
 #include <linux/uaccess.h>
 #include <linux/export.h>
@@ -209,6 +214,154 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf,
 	return prev_value != curr_value;
 }
 
+static LIST_HEAD(pm_qos_boot_list);
+static char *pm_qos_resume_latency_cmdline __initdata;
+
+struct pm_qos_boot_entry {
+	struct list_head node;
+	struct cpumask mask;
+	s32 latency;
+};
+
+static int __init pm_qos_resume_latency_us_setup(char *str)
+{
+	pm_qos_resume_latency_cmdline = str;
+	return 1;
+}
+__setup("pm_qos_resume_latency_us=", pm_qos_resume_latency_us_setup);
+
+/**
+ * init_pm_qos_resume_latency_us_setup - Parse the pm_qos_resume_latency_us boot parameter.
+ *
+ * Parses the kernel command line option "pm_qos_resume_latency_us=" to establish
+ * per-CPU resume latency constraints. These constraints are applied
+ * immediately when a CPU is registered.
+ *
+ * Syntax: pm_qos_resume_latency_us=<cpu-list>:<value>[;<cpu-list>:<value>...]
+ * Example: pm_qos_resume_latency_us=0-3:n/a;4-7:20
+ *
+ * The parsing logic enforces a "First Match Wins" policy. If a CPU is
+ * covered by multiple entries in the list, only the first valid entry
+ * applies. Any subsequent overlapping ranges for that CPU are ignored.
+ *
+ * Return: 0 on success, or a negative error code on failure.
+ */
+static int __init init_pm_qos_resume_latency_us_setup(void)
+{
+	char *token, *cmd, *cmd_copy;
+	struct pm_qos_boot_entry *entry, *tentry;
+	cpumask_var_t covered;
+	int ret = 0;
+
+	if (!pm_qos_resume_latency_cmdline)
+		return 0;
+
+	cmd_copy = kstrdup(pm_qos_resume_latency_cmdline, GFP_KERNEL);
+	if (!cmd_copy)
+		return -ENOMEM;
+
+	if (!zalloc_cpumask_var(&covered, GFP_KERNEL)) {
+		pr_warn("Failed to allocate memory for parsing boot parameter\n");
+		ret = -ENOMEM;
+		goto free_cmd_copy;
+	}
+
+	cmd = cmd_copy;
+	while ((token = strsep(&cmd, ";")) != NULL) {
+		char *str_range, *str_val;
+
+		str_range = strsep(&token, ":");
+		str_val = token;
+
+		if (!str_val) {
+			pr_warn("Missing value range %s\n", str_range);
+			continue;
+		}
+
+		entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+		if (!entry) {
+			pr_warn("Failed to allocate memory for boot entry\n");
+			ret = -ENOMEM;
+			goto cleanup;
+		}
+
+		if (cpulist_parse(str_range, &entry->mask)) {
+			pr_warn("Failed to parse cpulist range %s\n", str_range);
+			kfree(entry);
+			continue;
+		}
+
+		cpumask_andnot(&entry->mask, &entry->mask, covered);
+		if (cpumask_empty(&entry->mask)) {
+			pr_warn("Entry %s already covered, ignoring\n", str_range);
+			kfree(entry);
+			continue;
+		}
+
+		if (!strcmp(str_val, "n/a")) {
+			entry->latency = 0;
+		} else if (kstrtos32(str_val, 0, &entry->latency)) {
+			pr_warn("Invalid latency requirement value %s\n", str_val);
+			kfree(entry);
+			continue;
+		} else if (entry->latency == 0) {
+			entry->latency = PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+		}
+
+		if (entry->latency < 0) {
+			pr_warn("Latency requirement cannot be negative: %d\n", entry->latency);
+			kfree(entry);
+			continue;
+		}
+
+		cpumask_or(covered, covered, &entry->mask);
+
+		list_add_tail(&entry->node, &pm_qos_boot_list);
+	}
+
+	free_cpumask_var(covered);
+	kfree(cmd_copy);
+	return 0;
+
+cleanup:
+	list_for_each_entry_safe(entry, tentry, &pm_qos_boot_list, node) {
+		list_del(&entry->node);
+		kfree(entry);
+	}
+	free_cpumask_var(covered);
+free_cmd_copy:
+	kfree(cmd_copy);
+	return ret;
+}
+early_initcall(init_pm_qos_resume_latency_us_setup);
+
+/**
+ * pm_qos_get_boot_cpu_latency_limit - Get boot-time latency limit for a CPU.
+ * @cpu: Logical CPU number to check.
+ *
+ * Checks the read-only boot-time constraints list to see if a specific
+ * PM QoS latency override was requested for this CPU via the kernel
+ * command line.
+ *
+ * Return: The latency limit in microseconds if a constraint exists,
+ * or PM_QOS_RESUME_LATENCY_NO_CONSTRAINT if no boot override applies.
+ */
+s32 pm_qos_get_boot_cpu_latency_limit(unsigned int cpu)
+{
+	struct pm_qos_boot_entry *entry;
+
+	if (list_empty(&pm_qos_boot_list))
+		return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+
+	list_for_each_entry(entry, &pm_qos_boot_list, node) {
+		if (cpumask_test_cpu(cpu, &entry->mask))
+			return entry->latency;
+	}
+
+	return PM_QOS_RESUME_LATENCY_NO_CONSTRAINT;
+}
+EXPORT_SYMBOL_GPL(pm_qos_get_boot_cpu_latency_limit);
+
 #ifdef CONFIG_CPU_IDLE
 /* Definitions related to the CPU latency QoS. */
 
-- 
2.51.0

Re: [PATCH v5] PM: QoS: Introduce boot parameter pm_qos_resume_latency_us

Posted by Aaron Tomlin 1 week, 1 day ago

On Sun, Apr 26, 2026 at 12:01:27PM -0400, Aaron Tomlin wrote:
> This patch introduces the pm_qos_resume_latency_us kernel boot
> parameter, which allows users to specify distinct resume latency
> constraints for specific CPU ranges.
> 
> 	Syntax: pm_qos_resume_latency_us=range:value;range:value...
> 
> This boot parameter mirrors the sysfs interface behaviour: the special
> string "n/a" imposes a 0us latency constraint (polling), while the
> integer 0 removes the constraint entirely.

Hi Greg, Rafael, Danilo, Pavel, Len,

It has been over a month since I submitted this, so I just wanted to gently
ping this thread.

As a quick reminder, this parameter is highly beneficial for deployments
that prefer to establish strict latency constraints early in the boot
process, eliminating the need to rely on custom user-space tooling later
on.

Patch link: https://lore.kernel.org/lkml/20260426160127.292486-1-atomlin@atomlin.com/

Please let me know if you have any conceptual concerns with this approach,
if any further adjustments are required, or if you simply need me to rebase
and resend this against the latest power management tree.

Thank you for your time.

Kind regards
-- 
Aaron Tomlin

Re: [PATCH v5] PM: QoS: Introduce boot parameter pm_qos_resume_latency_us

Posted by Rafael J. Wysocki 1 week, 1 day ago

On Mon, Jun 1, 2026 at 8:55 PM Aaron Tomlin <atomlin@atomlin.com> wrote:
>
> On Sun, Apr 26, 2026 at 12:01:27PM -0400, Aaron Tomlin wrote:
> > This patch introduces the pm_qos_resume_latency_us kernel boot
> > parameter, which allows users to specify distinct resume latency
> > constraints for specific CPU ranges.
> >
> >       Syntax: pm_qos_resume_latency_us=range:value;range:value...
> >
> > This boot parameter mirrors the sysfs interface behaviour: the special
> > string "n/a" imposes a 0us latency constraint (polling), while the
> > integer 0 removes the constraint entirely.
>
> Hi Greg, Rafael, Danilo, Pavel, Len,
>
> It has been over a month since I submitted this, so I just wanted to gently
> ping this thread.
>
> As a quick reminder, this parameter is highly beneficial for deployments
> that prefer to establish strict latency constraints early in the boot
> process, eliminating the need to rely on custom user-space tooling later
> on.
>
> Patch link: https://lore.kernel.org/lkml/20260426160127.292486-1-atomlin@atomlin.com/
>
> Please let me know if you have any conceptual concerns with this approach,
> if any further adjustments are required,

IMV it would be better to call the new command line arg something like
"cpu_idle_exit_latency_us" to make it clear what it is about.

Also, generally speaking, it should be part of cpuidle rather than the
generic QoS code that also applies to devices other than CPUs.

> or if you simply need me to rebase and resend this against the latest power management tree.

And that too.

Thanks!

Re: [PATCH v5] PM: QoS: Introduce boot parameter pm_qos_resume_latency_us

Posted by Aaron Tomlin 1 week ago

On Mon, Jun 01, 2026 at 09:08:33PM +0200, Rafael J. Wysocki wrote:
> IMV it would be better to call the new command line arg something like
> "cpu_idle_exit_latency_us" to make it clear what it is about.
> 
> Also, generally speaking, it should be part of cpuidle rather than the
> generic QoS code that also applies to devices other than CPUs.
> 
> > or if you simply need me to rebase and resend this against the latest
> > power management tree.
> 
> And that too.

Hi Rafael,

Thank you for your review and the constructive feedback.

I certainly appreciate your reasoning concerning the naming convention and
its current CPU-specific scope.

However, before I prepare the next iteration, I thought it prudent to
briefly outline my original architectural reasoning for housing it within
the generic PM QoS code, simply to ascertain whether you remain of the view
that cpuidle is the most appropriate home.

My primary motivation for retaining it within the generic PM QoS framework
was to maintain strict symmetry with the existing sysfs interface.

The parameter is designed to align precisely with
/sys/devices/system/cpu/cpuN/power/pm_qos_resume_latency_us. That sysfs
attribute is exposed and managed by the generic device PM QoS code, quite
independently of cpuidle. Furthermore, the boot constraint itself is
applied via dev_pm_qos_expose_latency_limit(&cpu->dev, ...), which is
fundamentally a core QoS API.

Should we move the boot parameter parsing into cpuidle, the consequence is
that the cpuidle subsystem becomes responsible for parsing a boot string,
only to immediately pass that data back into the generic PM QoS framework
during CPU registration. Keeping the implementation within qos.c ensures
that the parsing logic and the underlying data structures remain cohesively
in the same subsystem. Crucially, it also preserves the flexibility to
extend this syntax to non-CPU devices in the future without necessitating
further refactoring.

I look forward to hearing your preference.

Kind regards,
-- 
Aaron Tomlin