amd-cppc has 2 operation modes: autonomous (active) mode,
non-autonomous (passive) mode.
In active mode, platform ignores the requestd done in the Desired
Performance Target register and takes into account only the values
set to the minimum, maximum and energy performance preference(EPP)
registers.
The EPP is used in the CCLK DPM controller to drive the frequency
that a core is going to operate during short periods of activity.
The SOC EPP targets are configured on a scale from 0 to 255 where 0
represents maximum performance and 255 represents maximum efficiency.
This commit implements one new AMD CPU frequency driver `amd-cppc-epp`
for active mode. It also introduce `active` tag for users to explicitly
select active mode and a new variable `opt_active_mode` to keep track of
which mode is currently enabled.
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- Remove redundant epp_mode
- Remove pointless initializer
- Define sole caller read_epp_init_once and epp_init value to read
pre-defined BIOS epp value only once
- Combine the commit "xen/cpufreq: introduce policy type when
cpufreq_driver->setpolicy exists"
---
v2 -> v3:
- Combined with commit "x86/cpufreq: add "cpufreq=amd-cppc,active" para"
- Refactor doc about "active mode"
- Change opt_cpufreq_active to opt_active_mode
- Let caller pass epp_init when unspecified to allow the function parameter
to be of uint8_t
- Make epp_init per-cpu value
---
docs/misc/xen-command-line.pandoc | 8 +-
xen/arch/x86/acpi/cpufreq/amd-cppc.c | 119 +++++++++++++++++++++++++--
xen/drivers/cpufreq/utility.c | 11 +++
xen/include/acpi/cpufreq/cpufreq.h | 12 +++
xen/include/public/sysctl.h | 1 +
5 files changed, 145 insertions(+), 6 deletions(-)
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index b3c3ca2377..19094070b3 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -515,7 +515,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
available support.
### cpufreq
-> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[verbose]]`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[active][,verbose]]`
> Default: `xen`
@@ -537,6 +537,12 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
* `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
on supported AMD hardware to provide finer grained frequency control
mechanism. The default is disabled.
+* `active` is to enable amd-cppc driver in active(autonomous) mode. In this
+ mode, users could write to energy performance preference register to tell
+ hardware if they want to bias toward performance or energy efficiency. Then
+ built-in CPPC power algorithm will calculate the runtime workload and adjust
+ the realtime cores frequency automatically according to the power supply and
+ thermal, core voltage and some other hardware conditions.
User could use `;`-separated options to support universal options which they
would like to try on any agnostic platform, *but* under priority order, like
diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index bf30990c74..606bb648b3 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -30,6 +30,9 @@
printk(XENLOG_DEBUG "AMD_CPPC: " fmt, ## args); \
})
+static bool __ro_after_init opt_active_mode;
+static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
+
struct amd_cppc_drv_data
{
const struct xen_processor_cppc *cppc_data;
@@ -70,6 +73,13 @@ static bool __init amd_cppc_handle_option(const char *s, const char *end)
return true;
}
+ ret = parse_boolean("active", s, end);
+ if ( ret >= 0 )
+ {
+ opt_active_mode = ret;
+ return true;
+ }
+
return false;
}
@@ -226,14 +236,19 @@ static void amd_cppc_write_request_msrs(void *info)
}
static int cf_check amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
- uint8_t des_perf, uint8_t max_perf)
+ uint8_t des_perf, uint8_t max_perf,
+ uint8_t epp)
{
struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
uint64_t prev = data->req.raw;
data->req.min_perf = min_perf;
data->req.max_perf = max_perf;
- data->req.des_perf = des_perf;
+ if ( !opt_active_mode )
+ data->req.des_perf = des_perf;
+ else
+ data->req.des_perf = 0;
+ data->req.epp = epp;
if ( prev == data->req.raw )
return 0;
@@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
return res;
return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
- des_perf, data->caps.highest_perf);
+ des_perf, data->caps.highest_perf,
+ /* Pre-defined BIOS value for passive mode */
+ per_cpu(epp_init, policy->cpu));
+}
+
+static int read_epp_init(void)
+{
+ uint64_t val;
+
+ if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
+ return -EINVAL;
+ this_cpu(epp_init) = (val >> 24) & 0xFF;
+
+ return 0;
}
static void cf_check amd_cppc_init_msrs(void *info)
@@ -343,6 +371,8 @@ static void cf_check amd_cppc_init_msrs(void *info)
*/
policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
+ data->err = read_epp_init();
+
return;
err:
@@ -372,7 +402,7 @@ static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
return 0;
}
-static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+static int amd_cppc_cpufreq_init_perf(struct cpufreq_policy *policy)
{
unsigned int cpu = policy->cpu;
struct amd_cppc_drv_data *data;
@@ -411,12 +441,78 @@ static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
amd_cppc_boost_init(policy, data);
+ return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+ int ret;
+
+ ret = amd_cppc_cpufreq_init_perf(policy);
+ if ( ret )
+ return ret;
+
amd_cppc_verbose("CPU %u initialized with amd-cppc passive mode\n",
policy->cpu);
return 0;
}
+static int cf_check amd_cppc_epp_cpu_init(struct cpufreq_policy *policy)
+{
+ int ret;
+
+ ret = amd_cppc_cpufreq_init_perf(policy);
+ if ( ret )
+ return ret;
+
+ policy->policy = cpufreq_parse_policy(policy->governor);
+
+ amd_cppc_verbose("CPU %u initialized with amd-cppc active mode\n", policy->cpu);
+
+ return 0;
+}
+
+static int amd_cppc_epp_update_limit(const struct cpufreq_policy *policy)
+{
+ const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
+ policy->cpu);
+ uint8_t max_perf, min_perf, epp;
+
+ /* Initial min/max values for CPPC Performance Controls Register */
+ /*
+ * Continuous CPPC performance scale in active mode is [lowest_perf,
+ * highest_perf]
+ */
+ max_perf = data->caps.highest_perf;
+ min_perf = data->caps.lowest_perf;
+
+ epp = per_cpu(epp_init, policy->cpu);
+ if ( policy->policy == CPUFREQ_POLICY_PERFORMANCE )
+ {
+ /* Force the epp value to be zero for performance policy */
+ epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
+ min_perf = max_perf;
+ }
+ else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
+ /* Force the epp value to be 0xff for powersave policy */
+ /*
+ * If set max_perf = min_perf = lowest_perf, we are putting
+ * cpu cores in idle.
+ */
+ epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
+
+ return amd_cppc_write_request(policy->cpu, min_perf,
+ /* des_perf = 0 for epp mode */
+ 0,
+ max_perf, epp);
+}
+
+static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
+{
+ return amd_cppc_epp_update_limit(policy);
+}
+
static const struct cpufreq_driver __initconst_cf_clobber
amd_cppc_cpufreq_driver =
{
@@ -427,6 +523,16 @@ amd_cppc_cpufreq_driver =
.exit = amd_cppc_cpufreq_cpu_exit,
};
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_epp_driver =
+{
+ .name = XEN_AMD_CPPC_EPP_DRIVER_NAME,
+ .verify = amd_cppc_cpufreq_verify,
+ .setpolicy = amd_cppc_epp_set_policy,
+ .init = amd_cppc_epp_cpu_init,
+ .exit = amd_cppc_cpufreq_cpu_exit,
+};
+
int __init amd_cppc_register_driver(void)
{
int ret;
@@ -437,7 +543,10 @@ int __init amd_cppc_register_driver(void)
return -ENODEV;
}
- ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
+ if ( opt_active_mode )
+ ret = cpufreq_register_driver(&amd_cppc_epp_driver);
+ else
+ ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
if ( ret )
return ret;
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index f1fd2fdbce..abde499d40 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -491,3 +491,14 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
return __cpufreq_governor(data, CPUFREQ_GOV_LIMITS);
}
+
+unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov)
+{
+ if ( !strncasecmp(gov->name, "performance", CPUFREQ_NAME_LEN) )
+ return CPUFREQ_POLICY_PERFORMANCE;
+
+ if ( !strncasecmp(gov->name, "powersave", CPUFREQ_NAME_LEN) )
+ return CPUFREQ_POLICY_POWERSAVE;
+
+ return CPUFREQ_POLICY_UNKNOWN;
+}
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 3c2b951830..7c36634d40 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -83,6 +83,7 @@ struct cpufreq_policy {
int8_t turbo; /* tristate flag: 0 for unsupported
* -1 for disable, 1 for enabled
* See CPUFREQ_TURBO_* below for defines */
+ unsigned int policy; /* CPUFREQ_POLICY_* */
};
DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
@@ -133,6 +134,17 @@ extern int cpufreq_register_governor(struct cpufreq_governor *governor);
extern struct cpufreq_governor *__find_governor(const char *governor);
#define CPUFREQ_DEFAULT_GOVERNOR &cpufreq_gov_dbs
+#define CPUFREQ_POLICY_UNKNOWN 0
+/*
+ * If cpufreq_driver->target() exists, the ->governor decides what frequency
+ * within the limits is used. If cpufreq_driver->setpolicy() exists, these
+ * two generic policies are available:
+ */
+#define CPUFREQ_POLICY_POWERSAVE 1
+#define CPUFREQ_POLICY_PERFORMANCE 2
+
+unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov);
+
/* pass a target to the cpufreq driver */
extern int __cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 42997252ef..fa431fd983 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -424,6 +424,7 @@ struct xen_set_cppc_para {
};
#define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
+#define XEN_AMD_CPPC_EPP_DRIVER_NAME "amd-cppc-epp"
#define XEN_HWP_DRIVER_NAME "hwp"
/*
--
2.34.1
On 06.03.2025 09:39, Penny Zheng wrote:
> amd-cppc has 2 operation modes: autonomous (active) mode,
> non-autonomous (passive) mode.
> In active mode, platform ignores the requestd done in the Desired
> Performance Target register and takes into account only the values
> set to the minimum, maximum and energy performance preference(EPP)
> registers.
> The EPP is used in the CCLK DPM controller to drive the frequency
> that a core is going to operate during short periods of activity.
> The SOC EPP targets are configured on a scale from 0 to 255 where 0
> represents maximum performance and 255 represents maximum efficiency.
So this is the other way around from "perf" values, where aiui 0xff is
"highest"?
> @@ -537,6 +537,12 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
> * `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
> on supported AMD hardware to provide finer grained frequency control
> mechanism. The default is disabled.
> +* `active` is to enable amd-cppc driver in active(autonomous) mode. In this
> + mode, users could write to energy performance preference register to tell
> + hardware if they want to bias toward performance or energy efficiency. Then
> + built-in CPPC power algorithm will calculate the runtime workload and adjust
> + the realtime cores frequency automatically according to the power supply and
What are "the realtime cores"?
> + thermal, core voltage and some other hardware conditions.
I think there better would be only one "and" in the enumeration of conditions.
> @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> return res;
>
> return amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
> - des_perf, data->caps.highest_perf);
> + des_perf, data->caps.highest_perf,
> + /* Pre-defined BIOS value for passive mode */
> + per_cpu(epp_init, policy->cpu));
> +}
> +
> +static int read_epp_init(void)
> +{
> + uint64_t val;
> +
> + if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
> + return -EINVAL;
I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
patches). Unless you can give a halfway reasonable scenario under which by the
time we get here there's still a chance that the MSR isn't implemented in the
next lower layer (hardware or another hypervisor, just to explain what's meant,
without me assuming that the driver should come into play in the first place when
we run virtualized ourselves).
Furthermore you call this function unconditionally, i.e. if there was a chance
for the MSR read to fail, CPU init would needlessly fail when in passive mode.
> + this_cpu(epp_init) = (val >> 24) & 0xFF;
Please can you #define a suitable mask constant in msr-index.h, such that you can
use MASK_EXTR() here?
> @@ -411,12 +441,78 @@ static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>
> amd_cppc_boost_init(policy, data);
>
> + return 0;
> +}
> +
> +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> + int ret;
> +
> + ret = amd_cppc_cpufreq_init_perf(policy);
> + if ( ret )
> + return ret;
> +
> amd_cppc_verbose("CPU %u initialized with amd-cppc passive mode\n",
> policy->cpu);
>
> return 0;
> }
>
> +static int cf_check amd_cppc_epp_cpu_init(struct cpufreq_policy *policy)
> +{
> + int ret;
> +
> + ret = amd_cppc_cpufreq_init_perf(policy);
> + if ( ret )
> + return ret;
> +
> + policy->policy = cpufreq_parse_policy(policy->governor);
> +
> + amd_cppc_verbose("CPU %u initialized with amd-cppc active mode\n", policy->cpu);
> +
> + return 0;
> +}
> +
> +static int amd_cppc_epp_update_limit(const struct cpufreq_policy *policy)
> +{
> + const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
> + policy->cpu);
Nit: Indentation is off by one here.
> + uint8_t max_perf, min_perf, epp;
> +
> + /* Initial min/max values for CPPC Performance Controls Register */
> + /*
> + * Continuous CPPC performance scale in active mode is [lowest_perf,
> + * highest_perf]
> + */
> + max_perf = data->caps.highest_perf;
> + min_perf = data->caps.lowest_perf;
> +
> + epp = per_cpu(epp_init, policy->cpu);
> + if ( policy->policy == CPUFREQ_POLICY_PERFORMANCE )
This may want to be switch() instead.
> + {
> + /* Force the epp value to be zero for performance policy */
> + epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> + min_perf = max_perf;
> + }
> + else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> + /* Force the epp value to be 0xff for powersave policy */
> + /*
> + * If set max_perf = min_perf = lowest_perf, we are putting
> + * cpu cores in idle.
> + */
Nit: Such two successive comments want combining. (Same near the top of the
function, as I notice only now.)
Furthermore I'm in trouble with interpreting this comment: To me "lowest"
doesn't mean "doing nothing" but "doing things as efficiently in terms of
power use as possible". IOW that's not idle. Yet the comment reads as if it
was meant to be an explanation of why we can't set max_perf from min_perf
here. That is, not matter what's meant to be said, I think this needs re-
wording (and possibly using subjunctive mood).
> + epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
> +
> + return amd_cppc_write_request(policy->cpu, min_perf,
> + /* des_perf = 0 for epp mode */
> + 0,
The comment could do with putting on the same line as the 0, e.g.
(slightly adjusted)
return amd_cppc_write_request(policy->cpu, min_perf,
0 /* no des_perf for epp mode */,
max_perf, epp);
> +static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
> +{
> + return amd_cppc_epp_update_limit(policy);
> +}
So the purpose of this wrapper is solely to have the actual function's
parameter be pointer-to-const? I don't think that's worth it; I also don't
think we do such elsewhere.
> --- a/xen/drivers/cpufreq/utility.c
> +++ b/xen/drivers/cpufreq/utility.c
> @@ -491,3 +491,14 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
>
> return __cpufreq_governor(data, CPUFREQ_GOV_LIMITS);
> }
> +
> +unsigned int cpufreq_parse_policy(const struct cpufreq_governor *gov)
> +{
> + if ( !strncasecmp(gov->name, "performance", CPUFREQ_NAME_LEN) )
> + return CPUFREQ_POLICY_PERFORMANCE;
> +
> + if ( !strncasecmp(gov->name, "powersave", CPUFREQ_NAME_LEN) )
> + return CPUFREQ_POLICY_POWERSAVE;
> +
> + return CPUFREQ_POLICY_UNKNOWN;
> +}
Hmm, this isn't really parsing (in the sense of dealing with e.g. command
line elements). Maybe cpufreq_get_policy() or, more explicitly,
cpufreq_policy_from_governor()? Or something along these lines?
I also don't see why the more expensive case-insensitive comparison
routine needs using here.
Jan
[Public]
Hi,
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, March 25, 2025 6:49 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc
> driver in active mode
>
> On 06.03.2025 09:39, Penny Zheng wrote:
> > amd-cppc has 2 operation modes: autonomous (active) mode,
> > non-autonomous (passive) mode.
> > In active mode, platform ignores the requestd done in the Desired
> > Performance Target register and takes into account only the values set
> > to the minimum, maximum and energy performance preference(EPP)
> > registers.
> > The EPP is used in the CCLK DPM controller to drive the frequency that
> > a core is going to operate during short periods of activity.
> > The SOC EPP targets are configured on a scale from 0 to 255 where 0
> > represents maximum performance and 255 represents maximum efficiency.
>
> So this is the other way around from "perf" values, where aiui 0xff is "highest"?
>
Yes, it is not the perf value. It is an arbitrary value on a scale from 0 to 255
> > @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct
> cpufreq_policy *policy,
> > return res;
> >
> > return amd_cppc_write_request(policy->cpu, data-
> >caps.lowest_nonlinear_perf,
> > - des_perf, data->caps.highest_perf);
> > + des_perf, data->caps.highest_perf,
> > + /* Pre-defined BIOS value for passive mode */
> > + per_cpu(epp_init, policy->cpu)); }
> > +
> > +static int read_epp_init(void)
> > +{
> > + uint64_t val;
> > +
> > + if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
> > + return -EINVAL;
>
> I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
> patches). Unless you can give a halfway reasonable scenario under which by the
> time we get here there's still a chance that the MSR isn't implemented in the next
> lower layer (hardware or another hypervisor, just to explain what's meant, without
> me assuming that the driver should come into play in the first place when we run
> virtualized ourselves).
>
Correct me if I understand wrongly, we are concerning that the driver may not always
have the privilege to directly access the MSR in all scenarios, so rdmsr_safe with exception
handling isn't always suitable. Then maybe I shall switch them all into rdmsrl() ?
> Furthermore you call this function unconditionally, i.e. if there was a chance for the
> MSR read to fail, CPU init would needlessly fail when in passive mode.
>
The reason why I also run read_epp_init() for passive mode is to avoid setting epp with zero value
for MSR_AMD_CPPC_REQ in passive mode. I want to give it pre-defined BIOS value in passive mode.
If we wrap read_epp_init() with active mode check, maybe we shall add extra read before setting request register MSR_AMD_CPPC_REQ,
introducing MSR_AMD_CPPC_EPP_MASK to reserve original value for epp in passive mode, or any better suggestion?
> > + {
> > + /* Force the epp value to be zero for performance policy */
> > + epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> > + min_perf = max_perf;
> > + }
> > + else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> > + /* Force the epp value to be 0xff for powersave policy */
> > + /*
> > + * If set max_perf = min_perf = lowest_perf, we are putting
> > + * cpu cores in idle.
> > + */
>
> Nit: Such two successive comments want combining. (Same near the top of the
> function, as I notice only now.)
>
> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
> doesn't mean "doing nothing" but "doing things as efficiently in terms of power use
> as possible". IOW that's not idle. Yet the comment reads as if it was meant to be an
> explanation of why we can't set max_perf from min_perf here. That is, not matter
> what's meant to be said, I think this needs re- wording (and possibly using
> subjunctive mood).
>
How about:
The lowest non-linear perf is equivalent as P2 frequency. Reducing performance below this
point does not lead to total energy savings for a given computation (although it reduces momentary power).
So we are not suggesting to set max_perf smaller than lowest non-linear perf, or even the lowest perf.
> Jan
On 28.03.2025 05:07, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, March 25, 2025 6:49 PM
>>
>> On 06.03.2025 09:39, Penny Zheng wrote:
>>> @@ -261,7 +276,20 @@ static int cf_check amd_cppc_cpufreq_target(struct
>> cpufreq_policy *policy,
>>> return res;
>>>
>>> return amd_cppc_write_request(policy->cpu, data-
>>> caps.lowest_nonlinear_perf,
>>> - des_perf, data->caps.highest_perf);
>>> + des_perf, data->caps.highest_perf,
>>> + /* Pre-defined BIOS value for passive mode */
>>> + per_cpu(epp_init, policy->cpu)); }
>>> +
>>> +static int read_epp_init(void)
>>> +{
>>> + uint64_t val;
>>> +
>>> + if ( rdmsr_safe(MSR_AMD_CPPC_REQ, val) )
>>> + return -EINVAL;
>>
>> I'm unconvinced of using rdmsr_safe() everywhere (i.e. this also goes for earlier
>> patches). Unless you can give a halfway reasonable scenario under which by the
>> time we get here there's still a chance that the MSR isn't implemented in the next
>> lower layer (hardware or another hypervisor, just to explain what's meant, without
>> me assuming that the driver should come into play in the first place when we run
>> virtualized ourselves).
>>
>
> Correct me if I understand wrongly, we are concerning that the driver may not always
> have the privilege to directly access the MSR in all scenarios, so rdmsr_safe with exception
> handling isn't always suitable. Then maybe I shall switch them all into rdmsrl() ?
There's no privilege question here - we're running at the highest possible privilege
level. The only question in MSR access can concern the existence of these MSRs (on
bare hardware) or improper emulation of MSRs by an underlying hypervisor. The latter
case I think we can pretty much exclude for a driver like this one - the driver
simply has no (real) use when running virtualized. Which leaves errata on hardware.
Those would better be dealt with by checking once up front (and then disabling the
driver if need be). IOW except for perhaps a single probing access early in driver
init, I think these better would all be plain accesses. And even such an early
probing access would likely only need switching to when a relevant erratum becomes
known.
>> Furthermore you call this function unconditionally, i.e. if there was a chance for the
>> MSR read to fail, CPU init would needlessly fail when in passive mode.
>>
>
> The reason why I also run read_epp_init() for passive mode is to avoid setting epp with zero value
> for MSR_AMD_CPPC_REQ in passive mode. I want to give it pre-defined BIOS value in passive mode.
> If we wrap read_epp_init() with active mode check, maybe we shall add extra read before setting request register MSR_AMD_CPPC_REQ,
> introducing MSR_AMD_CPPC_EPP_MASK to reserve original value for epp in passive mode, or any better suggestion?
Well, not using rdmsr_safe() here would make the function impossible to fail, and
hence the question itself would be moot. Otherwise my suggestion would be to ignore
the error (perhaps associated with a warning) in passive mode.
>>> + {
>>> + /* Force the epp value to be zero for performance policy */
>>> + epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
>>> + min_perf = max_perf;
>>> + }
>>> + else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
>>> + /* Force the epp value to be 0xff for powersave policy */
>>> + /*
>>> + * If set max_perf = min_perf = lowest_perf, we are putting
>>> + * cpu cores in idle.
>>> + */
>>
>> Nit: Such two successive comments want combining. (Same near the top of the
>> function, as I notice only now.)
>>
>> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
>> doesn't mean "doing nothing" but "doing things as efficiently in terms of power use
>> as possible". IOW that's not idle. Yet the comment reads as if it was meant to be an
>> explanation of why we can't set max_perf from min_perf here. That is, not matter
>> what's meant to be said, I think this needs re- wording (and possibly using
>> subjunctive mood).
>
> How about:
> The lowest non-linear perf is equivalent as P2 frequency. Reducing performance below this
> point does not lead to total energy savings for a given computation (although it reduces momentary power).
> So we are not suggesting to set max_perf smaller than lowest non-linear perf, or even the lowest perf.
In an abstract way I think I can follow this. In the context of the code being
commented, however, I'm afraid I still can't make sense of it. Main point being
that the code commented doesn't use any of the *_perf values. It only sets the
"epp" local variable. Maybe the point of the comment is to explain why non of
the *_perf are used here, but I can't read this out of either of the proposed
texts.
Jan
[Public]
Hi,
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, March 28, 2025 3:18 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>;
> Orzel, Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Roger
> Pau Monné <roger.pau@citrix.com>; Stefano Stabellini <sstabellini@kernel.org>;
> xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v3 12/15] xen/x86: implement EPP support for the amd-cppc
> driver in active mode
>
> On 28.03.2025 05:07, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, March 25, 2025 6:49 PM
> >>
> >> On 06.03.2025 09:39, Penny Zheng wrote:
>
> >>> + {
> >>> + /* Force the epp value to be zero for performance policy */
> >>> + epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
> >>> + min_perf = max_perf;
> >>> + }
> >>> + else if ( policy->policy == CPUFREQ_POLICY_POWERSAVE )
> >>> + /* Force the epp value to be 0xff for powersave policy */
> >>> + /*
> >>> + * If set max_perf = min_perf = lowest_perf, we are putting
> >>> + * cpu cores in idle.
> >>> + */
> >>
> >> Nit: Such two successive comments want combining. (Same near the top
> >> of the function, as I notice only now.)
> >>
> >> Furthermore I'm in trouble with interpreting this comment: To me "lowest"
> >> doesn't mean "doing nothing" but "doing things as efficiently in
> >> terms of power use as possible". IOW that's not idle. Yet the comment
> >> reads as if it was meant to be an explanation of why we can't set
> >> max_perf from min_perf here. That is, not matter what's meant to be
> >> said, I think this needs re- wording (and possibly using subjunctive mood).
> >
> > How about:
> > The lowest non-linear perf is equivalent as P2 frequency. Reducing
> > performance below this point does not lead to total energy savings for a given
> computation (although it reduces momentary power).
> > So we are not suggesting to set max_perf smaller than lowest non-linear perf, or
> even the lowest perf.
>
> In an abstract way I think I can follow this. In the context of the code being
> commented, however, I'm afraid I still can't make sense of it. Main point being that
> the code commented doesn't use any of the *_perf values. It only sets the "epp"
> local variable. Maybe the point of the comment is to explain why non of the *_perf
> are used here, but I can't read this out of either of the proposed texts.
>
I've checked some internal test suites for CPPC in windows. Maybe setting max_perf = nominal_perf
is a fair option for powersave mode
> Jan
© 2016 - 2026 Red Hat, Inc.