[PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling

Penny Zheng posted 15 patches 1 month ago
[PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
Posted by Penny Zheng 1 month ago
amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism firstly on AMD Zen based CPU series.
The new mechanism is based on Collaborative Processor Performance
Control (CPPC) which is a finer grain frequency management
than legacy ACPI hardware P-States.
Current AMD CPU platforms are using the ACPI P-states driver to
manage CPU frequency and clocks with switching only in 3 P-states.
The new amd-cppc allows a more flexible, low-latency interface for Xen
to directly communicate the performance hints to hardware.

The first version "amd-cppc" could leverage common governors such as
*ondemand*, *performance*, etc, to manage the performance hints. In the
future, we will introduce an advanced active mode to enable autonomous
performence level selection.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v1 -> v2:
- re-construct union caps and req to have anonymous struct instead
- avoid "else" when the earlier if() ends in an unconditional control flow statement
- Add check to avoid chopping off set bits from cast
- make pointers pointer-to-const wherever possible
- remove noisy log
- exclude families before 0x17 before CPPC-feature MSR op
- remove useless variable helpers
- use xvzalloc and XVFREE
- refactor error handling as ENABLE bit can only be cleared by reset
---
v2 -> v3:
- Move all MSR-definations to msr-index.h and follow the required style
- Refactor opening figure braces for struct/union
- Sort overlong lines throughout the series
- Make offset/res int covering underflow scenario
- Error out when amd_max_freq_mhz isn't set
- Introduce amd_get_freq(name) macro to decrease redundancy
- Supported CPU family checked ahead of smp-function
- Nominal freq shall be checked between the [min, max]
- Use APERF/MPREF to calculate current frequency
- Use amd_cppc_cpufreq_cpu_exit() to tidy error path
---
v3 -> v4:
- verbose print shall come with a CPU number
- deal with res <= 0 in amd_cppc_khz_to_perf()
- introduce a single helper amd_get_lowest_or_nominal_freq() to cover both
lowest and nominal scenario
- reduce abuse of wrmsr_safe()/rdmsr_safe() with wrmsrl()/rdmsrl()
- move cf_check from amd_cppc_write_request() to amd_cppc_write_request_msrs()
- add comment to explain why setting non_linear_lowest in passive mode
- add check to ensure perf values in
lowest <= non_linear_lowest <= nominal <= highset
- refactor comment for "data->err != 0" scenario
- use "data->err" instead of -ENODEV
- add U suffixes for all msr macro
---
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 393 +++++++++++++++++++++++++++
 xen/arch/x86/include/asm/msr-index.h |   5 +
 2 files changed, 398 insertions(+)

diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 8a081e5523..2fdfd17f59 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -14,7 +14,56 @@
 #include <xen/domain.h>
 #include <xen/init.h>
 #include <xen/param.h>
+#include <xen/percpu.h>
+#include <xen/xvmalloc.h>
 #include <acpi/cpufreq/cpufreq.h>
+#include <asm/amd.h>
+#include <asm/msr-index.h>
+
+#define amd_cppc_err(cpu, fmt, args...)                             \
+    printk(XENLOG_ERR "AMD_CPPC: CPU%u error: " fmt, cpu, ## args)
+#define amd_cppc_warn(cpu, fmt, args...)                            \
+    printk(XENLOG_WARNING "AMD_CPPC: CPU%u warning: " fmt, cpu, ## args)
+#define amd_cppc_verbose(cpu, fmt, args...)                         \
+({                                                                  \
+    if ( cpufreq_verbose )                                          \
+        printk(XENLOG_DEBUG "AMD_CPPC: CPU%u " fmt, cpu, ## args);  \
+})
+
+struct amd_cppc_drv_data
+{
+    const struct xen_processor_cppc *cppc_data;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int lowest_perf:8;
+            unsigned int lowest_nonlinear_perf:8;
+            unsigned int nominal_perf:8;
+            unsigned int highest_perf:8;
+            unsigned int :32;
+        };
+    } caps;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int max_perf:8;
+            unsigned int min_perf:8;
+            unsigned int des_perf:8;
+            unsigned int epp:8;
+            unsigned int :32;
+        };
+    } req;
+
+    int err;
+};
+
+static DEFINE_PER_CPU_READ_MOSTLY(struct amd_cppc_drv_data *,
+                                  amd_cppc_drv_data);
+/*
+ * Core max frequency read from PstateDef as anchor point
+ * for freq-to-perf transition
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(uint64_t, amd_max_pxfreq_mhz);
 
 static bool __init amd_cppc_handle_option(const char *s, const char *end)
 {
@@ -51,10 +100,354 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
     return 0;
 }
 
+/*
+ * If CPPC lowest_freq and nominal_freq registers are exposed then we can
+ * use them to convert perf to freq and vice versa. The conversion is
+ * extrapolated as an linear function passing by the 2 points:
+ *  - (Low perf, Low freq)
+ *  - (Nominal perf, Nominal freq)
+ */
+static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
+                                unsigned int freq, uint8_t *perf)
+{
+    const struct xen_processor_cppc *cppc_data = data->cppc_data;
+    uint64_t mul, div;
+    int offset = 0, res;
+
+    if ( freq == (cppc_data->cpc.nominal_mhz * 1000) )
+    {
+        *perf = data->caps.nominal_perf;
+        return 0;
+    }
+
+    if ( freq == (cppc_data->cpc.lowest_mhz * 1000) )
+    {
+        *perf = data->caps.lowest_perf;
+        return 0;
+    }
+
+    if ( cppc_data->cpc.lowest_mhz && cppc_data->cpc.nominal_mhz &&
+         cppc_data->cpc.lowest_mhz < cppc_data->cpc.nominal_mhz )
+    {
+        mul = data->caps.nominal_perf - data->caps.lowest_perf;
+        div = cppc_data->cpc.nominal_mhz - cppc_data->cpc.lowest_mhz;
+
+        /*
+         * We don't need to convert to kHz for computing offset and can
+         * directly use nominal_mhz and lowest_mhz as the division
+         * will remove the frequency unit.
+         */
+        offset = data->caps.nominal_perf -
+                 (mul * cppc_data->cpc.nominal_mhz) / div;
+    }
+    else
+    {
+        /* Read Processor Max Speed(MHz) as anchor point */
+        mul = data->caps.highest_perf;
+        div = this_cpu(amd_max_pxfreq_mhz);
+        if ( !div || div == INVAL_FREQ_MHZ )
+            return -EINVAL;
+    }
+
+    res = offset + (mul * freq) / (div * 1000);
+    if ( res > UINT8_MAX )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value exceeds maximum value 255: %d\n", res);
+        *perf = 0xff;
+        return 0;
+    } else if ( res < 0 )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value smaller than minimum value 0: %d\n", res);
+        *perf = 0;
+        return 0;
+    }
+    *perf = (uint8_t)res;
+
+    return 0;
+}
+
+static int amd_get_lowest_or_nominal_freq(const struct amd_cppc_drv_data *data,
+                                          uint32_t cpc_mhz, uint8_t perf,
+                                          unsigned int *freq)
+{
+    uint64_t mul, div, res;
+
+    if ( !freq )
+        return -EINVAL;
+
+    if ( cpc_mhz )
+    {
+        /* Switch to khz */
+        *freq = cpc_mhz * 1000;
+        return 0;
+    }
+
+    /* Read Processor Max Speed(MHz) as anchor point */
+    mul = this_cpu(amd_max_pxfreq_mhz);
+    if ( mul == INVAL_FREQ_MHZ || !mul )
+    {
+        printk(XENLOG_ERR
+               "Failed to read valid processor max frequency as anchor point: %lu\n",
+               mul);
+        return -EINVAL;
+    }
+    div = data->caps.highest_perf;
+    res = (mul * perf * 1000) / div;
+
+    if ( res > UINT_MAX || !res )
+    {
+        printk(XENLOG_ERR
+               "Frequeny exceeds maximum value UINT_MAX or being zero value: %lu\n",
+               res);
+        return -EINVAL;
+    }
+    *freq = (unsigned int)res;
+
+    return 0;
+}
+
+static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int *max_freq)
+{
+    unsigned int nom_freq = 0, boost_ratio;
+    int res;
+
+    res = amd_get_lowest_or_nominal_freq(data,
+                                         data->cppc_data->cpc.nominal_mhz,
+                                         data->caps.nominal_perf,
+                                         &nom_freq);
+    if ( res )
+        return res;
+
+    boost_ratio = (unsigned int)(data->caps.highest_perf /
+                                 data->caps.nominal_perf);
+    *max_freq = nom_freq * boost_ratio;
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq,
+                                 policy->cpuinfo.max_freq);
+
+    return 0;
+}
+
+static void cf_check amd_cppc_write_request_msrs(void *info)
+{
+    const struct amd_cppc_drv_data *data = info;
+
+    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw);
+}
+
+static void amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
+                                   uint8_t des_perf, uint8_t max_perf)
+{
+    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint64_t prev = data->req.raw;
+
+    data->req.min_perf = min_perf;
+    data->req.max_perf = max_perf;
+    data->req.des_perf = des_perf;
+
+    if ( prev == data->req.raw )
+        return;
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);
+}
+
+static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
+                                            unsigned int target_freq,
+                                            unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint8_t des_perf;
+    int res;
+
+    if ( unlikely(!target_freq) )
+        return 0;
+
+    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
+    if ( res )
+        return res;
+
+    /*
+     * Setting with "lowest_nonlinear_perf" to ensure governoring
+     * performance in P-state range.
+     */
+    amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
+                           des_perf, data->caps.highest_perf);
+    return 0;
+}
+
+static void cf_check amd_cppc_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
+    uint64_t val;
+    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
+
+    /* Package level MSR */
+    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
+    /*
+     * Only when Enable bit is on, the hardware will calculate the processor’s
+     * performance capabilities and initialize the performance level fields in
+     * the CPPC capability registers.
+     */
+    if ( !(val & AMD_CPPC_ENABLE) )
+    {
+        val |= AMD_CPPC_ENABLE;
+        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
+    }
+
+    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
+
+    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
+         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
+         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
+         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
+         data->caps.nominal_perf > data->caps.highest_perf )
+    {
+        amd_cppc_err(policy->cpu,
+                     "Platform malfunction, read CPPC capability highest(%u), lowest(%u), nominal(%u), lowest_nonlinear(%u) zero value\n",
+                     data->caps.highest_perf, data->caps.lowest_perf,
+                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
+        goto err;
+    }
+
+    amd_process_freq(cpu_data + policy->cpu,
+                     NULL, NULL, &this_cpu(amd_max_pxfreq_mhz));
+
+    data->err = amd_get_lowest_or_nominal_freq(data,
+                                               data->cppc_data->cpc.lowest_mhz,
+                                               data->caps.lowest_perf,
+                                               &min_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_lowest_or_nominal_freq(data,
+                                               data->cppc_data->cpc.nominal_mhz,
+                                               data->caps.nominal_perf,
+                                               &nominal_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_max_freq(data, &max_freq);
+    if ( data->err )
+        return;
+
+    if ( min_freq > nominal_freq || nominal_freq > max_freq )
+    {
+        amd_cppc_err(policy->cpu,
+                     "min(%u), or max(%u), or nominal(%u) freq value is incorrect\n",
+                     min_freq, max_freq, nominal_freq);
+        goto err;
+    }
+
+    policy->min = min_freq;
+    policy->max = max_freq;
+
+    policy->cpuinfo.min_freq = min_freq;
+    policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = nominal_freq;
+    /*
+     * Set after policy->cpuinfo.perf_freq, as we are taking
+     * APERF/MPERF average frequency as current frequency.
+     */
+    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
+
+    return;
+
+ err:
+    /*
+     * No fallback shceme is available here, see more explanation at call
+     * site in amd_cppc_cpufreq_cpu_init().
+     */
+    data->err = -EINVAL;
+}
+
+/*
+ * AMD CPPC driver is different than legacy ACPI hardware P-State,
+ * which has a finer grain frequency range between the highest and lowest
+ * frequency. And boost frequency is actually the frequency which is mapped on
+ * highest performance ratio. The legacy P0 frequency is actually mapped on
+ * nominal performance ratio.
+ */
+static void amd_cppc_boost_init(struct cpufreq_policy *policy,
+                                const struct amd_cppc_drv_data *data)
+{
+    if ( data->caps.highest_perf <= data->caps.nominal_perf )
+        return;
+
+    policy->turbo = CPUFREQ_TURBO_ENABLED;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XVFREE(per_cpu(amd_cppc_drv_data, policy->cpu));
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct amd_cppc_drv_data *data;
+    const struct cpuinfo_x86 *c = cpu_data + cpu;
+
+    data = xvzalloc(struct amd_cppc_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
+
+    per_cpu(amd_cppc_drv_data, cpu) = data;
+
+    /* Feature CPPC is firstly introduced on Zen2 */
+    if ( c->x86 < 0x17 )
+    {
+        printk_once("Unsupported cpu family: %x\n", c->x86);
+        return -EOPNOTSUPP;
+    }
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
+
+    /*
+     * The enable bit is sticky, as we need to enable it at the very first
+     * begining, before CPPC capability sanity check.
+     * If error path takes effective, not only amd-cppc cpufreq driver fails
+     * to initialize, but also we could not fall back to legacy P-states
+     * driver, irrespective of the command line specifying a fallback option.
+     */
+    if ( data->err )
+    {
+        amd_cppc_err(cpu, "Could not initialize AMD CPPC MSR properly\n");
+        amd_cppc_cpufreq_cpu_exit(policy);
+        return data->err;
+    }
+
+    policy->governor = cpufreq_opt_governor ? : CPUFREQ_DEFAULT_GOVERNOR;
+
+    amd_cppc_boost_init(policy, data);
+
+    amd_cppc_verbose(policy->cpu,
+                     "CPU initialized with amd-cppc passive mode\n");
+
+    return 0;
+}
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
     .name   = XEN_AMD_CPPC_DRIVER_NAME,
+    .verify = amd_cppc_cpufreq_verify,
+    .target = amd_cppc_cpufreq_target,
+    .init   = amd_cppc_cpufreq_cpu_init,
+    .exit   = amd_cppc_cpufreq_cpu_exit,
 };
 
 int __init amd_cppc_register_driver(void)
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 22d9e76e55..3ffa613df0 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -238,6 +238,11 @@
 
 #define MSR_AMD_CSTATE_CFG                  0xc0010296U
 
+#define MSR_AMD_CPPC_CAP1                   0xc00102b0U
+#define MSR_AMD_CPPC_ENABLE                 0xc00102b1U
+#define  AMD_CPPC_ENABLE                    (_AC(1, ULL) <<  0)
+#define MSR_AMD_CPPC_REQ                    0xc00102b3U
+
 /*
  * Legacy MSR constants in need of cleanup.  No new MSRs below this comment.
  */
-- 
2.34.1


Re: [PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
Posted by Jan Beulich 2 weeks, 2 days ago
On 14.04.2025 09:40, Penny Zheng wrote:
> --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> @@ -14,7 +14,56 @@
>  #include <xen/domain.h>
>  #include <xen/init.h>
>  #include <xen/param.h>
> +#include <xen/percpu.h>
> +#include <xen/xvmalloc.h>
>  #include <acpi/cpufreq/cpufreq.h>
> +#include <asm/amd.h>
> +#include <asm/msr-index.h>
> +
> +#define amd_cppc_err(cpu, fmt, args...)                             \
> +    printk(XENLOG_ERR "AMD_CPPC: CPU%u error: " fmt, cpu, ## args)
> +#define amd_cppc_warn(cpu, fmt, args...)                            \
> +    printk(XENLOG_WARNING "AMD_CPPC: CPU%u warning: " fmt, cpu, ## args)
> +#define amd_cppc_verbose(cpu, fmt, args...)                         \
> +({                                                                  \
> +    if ( cpufreq_verbose )                                          \
> +        printk(XENLOG_DEBUG "AMD_CPPC: CPU%u " fmt, cpu, ## args);  \
> +})
> +
> +struct amd_cppc_drv_data
> +{
> +    const struct xen_processor_cppc *cppc_data;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            unsigned int lowest_perf:8;
> +            unsigned int lowest_nonlinear_perf:8;
> +            unsigned int nominal_perf:8;
> +            unsigned int highest_perf:8;
> +            unsigned int :32;
> +        };
> +    } caps;
> +    union {
> +        uint64_t raw;
> +        struct {
> +            unsigned int max_perf:8;
> +            unsigned int min_perf:8;
> +            unsigned int des_perf:8;
> +            unsigned int epp:8;
> +            unsigned int :32;
> +        };
> +    } req;
> +
> +    int err;
> +};
> +
> +static DEFINE_PER_CPU_READ_MOSTLY(struct amd_cppc_drv_data *,
> +                                  amd_cppc_drv_data);
> +/*
> + * Core max frequency read from PstateDef as anchor point
> + * for freq-to-perf transition
> + */
> +static DEFINE_PER_CPU_READ_MOSTLY(uint64_t, amd_max_pxfreq_mhz);

Throughout here (and maybe also further down) - is the amd_ prefix really
needed everywhere? Even the cppc part of the identifiers seems excessive
in at least some of the cases.

For this last one, also once again see the type related comment on patch 07.

> @@ -51,10 +100,354 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
>      return 0;
>  }
>  
> +/*
> + * If CPPC lowest_freq and nominal_freq registers are exposed then we can
> + * use them to convert perf to freq and vice versa. The conversion is
> + * extrapolated as an linear function passing by the 2 points:
> + *  - (Low perf, Low freq)
> + *  - (Nominal perf, Nominal freq)
> + */
> +static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
> +                                unsigned int freq, uint8_t *perf)
> +{
> +    const struct xen_processor_cppc *cppc_data = data->cppc_data;
> +    uint64_t mul, div;
> +    int offset = 0, res;
> +
> +    if ( freq == (cppc_data->cpc.nominal_mhz * 1000) )

I'm pretty sure I commented on this before: The expression here _suggests_
that "freq" is in kHz, but that's not being made explicit anywhere.

> +    {
> +        *perf = data->caps.nominal_perf;
> +        return 0;
> +    }
> +
> +    if ( freq == (cppc_data->cpc.lowest_mhz * 1000) )
> +    {
> +        *perf = data->caps.lowest_perf;
> +        return 0;
> +    }

How likely is it that these two early return paths are taken, when the
incoming "freq" is 25 or 5 MHz granular? IOW - is it relevant to shortcut
these two cases?

> +    if ( cppc_data->cpc.lowest_mhz && cppc_data->cpc.nominal_mhz &&
> +         cppc_data->cpc.lowest_mhz < cppc_data->cpc.nominal_mhz )

Along the lines of a comment on an earlier patch, the middle part of the
condition here is redundant with the 3rd one. Also, don't you check this
relation already during init? IOW isn't it the 3rd part which can be
dropped?

> +    {
> +        mul = data->caps.nominal_perf - data->caps.lowest_perf;
> +        div = cppc_data->cpc.nominal_mhz - cppc_data->cpc.lowest_mhz;
> +
> +        /*
> +         * We don't need to convert to kHz for computing offset and can
> +         * directly use nominal_mhz and lowest_mhz as the division
> +         * will remove the frequency unit.
> +         */
> +        offset = data->caps.nominal_perf -
> +                 (mul * cppc_data->cpc.nominal_mhz) / div;
> +    }
> +    else
> +    {
> +        /* Read Processor Max Speed(MHz) as anchor point */
> +        mul = data->caps.highest_perf;
> +        div = this_cpu(amd_max_pxfreq_mhz);
> +        if ( !div || div == INVAL_FREQ_MHZ )

Seeing this - do we need INVAL_FREQ_MHZ at all? Isn't 0 good enough an indicator
of "data not available"?

> +            return -EINVAL;
> +    }
> +
> +    res = offset + (mul * freq) / (div * 1000);
> +    if ( res > UINT8_MAX )
> +    {
> +        printk_once(XENLOG_WARNING
> +                    "Perf value exceeds maximum value 255: %d\n", res);
> +        *perf = 0xff;
> +        return 0;
> +    } else if ( res < 0 )

Nit: Style. And no real need for "else" here anyway (or alternatively for the
"return 0", with an "else" added below).

> +    {
> +        printk_once(XENLOG_WARNING
> +                    "Perf value smaller than minimum value 0: %d\n", res);
> +        *perf = 0;
> +        return 0;
> +    }
> +    *perf = (uint8_t)res;

We don't normally spell out such conversions (i.e. please drop the cast).

> +    return 0;
> +}
> +
> +static int amd_get_lowest_or_nominal_freq(const struct amd_cppc_drv_data *data,

Nothing in the function looks to limit it to "nominal" or "lowest", so them
being in the identifier feels misleading.

> +                                          uint32_t cpc_mhz, uint8_t perf,
> +                                          unsigned int *freq)
> +{
> +    uint64_t mul, div, res;
> +
> +    if ( !freq )
> +        return -EINVAL;

Is this needed? It's an internal function.

> +    if ( cpc_mhz )
> +    {
> +        /* Switch to khz */
> +        *freq = cpc_mhz * 1000;
> +        return 0;
> +    }
> +
> +    /* Read Processor Max Speed(MHz) as anchor point */
> +    mul = this_cpu(amd_max_pxfreq_mhz);
> +    if ( mul == INVAL_FREQ_MHZ || !mul )
> +    {
> +        printk(XENLOG_ERR
> +               "Failed to read valid processor max frequency as anchor point: %lu\n",
> +               mul);
> +        return -EINVAL;
> +    }
> +    div = data->caps.highest_perf;
> +    res = (mul * perf * 1000) / div;

While there is a comment further up, clarifying function-wide that the result is
to be in kHz would perhaps be better.

> +    if ( res > UINT_MAX || !res )
> +    {
> +        printk(XENLOG_ERR
> +               "Frequeny exceeds maximum value UINT_MAX or being zero value: %lu\n",

Just "out of range"?

> +               res);
> +        return -EINVAL;

And then -ERANGE here?

> +    }
> +    *freq = (unsigned int)res;

See above.

> +static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
> +                            unsigned int *max_freq)
> +{
> +    unsigned int nom_freq = 0, boost_ratio;
> +    int res;
> +
> +    res = amd_get_lowest_or_nominal_freq(data,
> +                                         data->cppc_data->cpc.nominal_mhz,
> +                                         data->caps.nominal_perf,
> +                                         &nom_freq);
> +    if ( res )
> +        return res;
> +
> +    boost_ratio = (unsigned int)(data->caps.highest_perf /
> +                                 data->caps.nominal_perf);

I may have seen logic ensuring nominal_perf isn't 0, so that part may be
fine. What guarantees this division to yield a positive value, though?
If it yields zero (say 0xff / 0x80), ...

> +    *max_freq = nom_freq * boost_ratio;

... zero will be reported back here. I think you want to scale the
calculations here to avoid such.

> +static void cf_check amd_cppc_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
> +    uint64_t val;
> +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
> +
> +    /* Package level MSR */
> +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
> +    /*
> +     * Only when Enable bit is on, the hardware will calculate the processor’s
> +     * performance capabilities and initialize the performance level fields in
> +     * the CPPC capability registers.
> +     */
> +    if ( !(val & AMD_CPPC_ENABLE) )
> +    {
> +        val |= AMD_CPPC_ENABLE;
> +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
> +    }
> +
> +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
> +
> +    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
> +         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
> +         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||

Same question as asked elsewhere - where is this relation specified?

> +         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
> +         data->caps.nominal_perf > data->caps.highest_perf )
> +    {
> +        amd_cppc_err(policy->cpu,
> +                     "Platform malfunction, read CPPC capability highest(%u), lowest(%u), nominal(%u), lowest_nonlinear(%u) zero value\n",

The message wants shortening if at all possible, and it wants to not be
misleading (saying "zero" when the issue may be something else).

> +                     data->caps.highest_perf, data->caps.lowest_perf,
> +                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
> +        goto err;
> +    }
> +
> +    amd_process_freq(cpu_data + policy->cpu,

&cpu_data[policy->cpu] is generally preferred in such cases.

> +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct amd_cppc_drv_data *data;
> +    const struct cpuinfo_x86 *c = cpu_data + cpu;
> +
> +    data = xvzalloc(struct amd_cppc_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +
> +    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
> +
> +    per_cpu(amd_cppc_drv_data, cpu) = data;
> +
> +    /* Feature CPPC is firstly introduced on Zen2 */
> +    if ( c->x86 < 0x17 )
> +    {
> +        printk_once("Unsupported cpu family: %x\n", c->x86);
> +        return -EOPNOTSUPP;
> +    }

With the cpu_has_cppc check in amd_cppc_register_driver(), is this check
really needed?

Jan

RE: [PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
Posted by Penny, Zheng 1 week, 1 day ago
[Public]

Hi,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, April 29, 2025 10:29 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for
> cpufreq scaling
>
> On 14.04.2025 09:40, Penny Zheng wrote:
> > --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> > +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> > +/*
> > + * If CPPC lowest_freq and nominal_freq registers are exposed then we
> > +can
> > + * use them to convert perf to freq and vice versa. The conversion is
> > + * extrapolated as an linear function passing by the 2 points:
> > + *  - (Low perf, Low freq)
> > + *  - (Nominal perf, Nominal freq)
> > + */
> > +static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
> > +                                unsigned int freq, uint8_t *perf) {
> > +    const struct xen_processor_cppc *cppc_data = data->cppc_data;
> > +    uint64_t mul, div;
> > +    int offset = 0, res;
> > +
> > +    if ( freq == (cppc_data->cpc.nominal_mhz * 1000) )
>
> I'm pretty sure I commented on this before: The expression here _suggests_ that
> "freq" is in kHz, but that's not being made explicit anywhere.
>

Sorry, I may overlook, and I'll be more careful.
I have clarified it in the function title, and maybe it's not enough. I'll change the parameter
name from "freq" to "freq_khz" to be more explicit.

> > +    {
> > +        *perf = data->caps.nominal_perf;
> > +        return 0;
> > +    }
> > +
> > +    if ( freq == (cppc_data->cpc.lowest_mhz * 1000) )
> > +    {
> > +        *perf = data->caps.lowest_perf;
> > +        return 0;
> > +    }
>
> How likely is it that these two early return paths are taken, when the incoming
> "freq" is 25 or 5 MHz granular? IOW - is it relevant to shortcut these two cases?
>

Sorry, I may not understand what you mean here.
Answering " How likely is it that these two early return paths are taken "
It's rare ig.... maybe *ondemand* governor will frequently give frequency around nominal frequency,
but the exact value is rare ig.
I'm confused about " when the incoming  "freq" is 25 or 5 MHz granular ".
Are we talking about is it worthy to have these two early return paths considering it is rarely taken

> > +    if ( cppc_data->cpc.lowest_mhz && cppc_data->cpc.nominal_mhz &&
> > +         cppc_data->cpc.lowest_mhz < cppc_data->cpc.nominal_mhz )
>
> Along the lines of a comment on an earlier patch, the middle part of the condition
> here is redundant with the 3rd one. Also, don't you check this relation already
> during init? IOW isn't it the 3rd part which can be dropped?
>

Yes, you're right. I've checked it in set_cppc_pminfo() already and only gave warnings there.
I shall delete the check here, and besides giving warning message during init, if values are
invalid, instead of storing invalid values, we shall set cppc_data->cpc.lowest_mhz / cppc_data->cpc.nominal_mhz them
zero... Then wherever we are trying to use them, like here, non-zero values ensures valid values.

> > +static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
> > +                            unsigned int *max_freq) {
> > +    unsigned int nom_freq = 0, boost_ratio;
> > +    int res;
> > +
> > +    res = amd_get_lowest_or_nominal_freq(data,
> > +                                         data->cppc_data->cpc.nominal_mhz,
> > +                                         data->caps.nominal_perf,
> > +                                         &nom_freq);
> > +    if ( res )
> > +        return res;
> > +
> > +    boost_ratio = (unsigned int)(data->caps.highest_perf /
> > +                                 data->caps.nominal_perf);
>
> I may have seen logic ensuring nominal_perf isn't 0, so that part may be fine. What
> guarantees this division to yield a positive value, though?
> If it yields zero (say 0xff / 0x80), ...
>

I think maybe you were saying 0x80/0xff to yield zero value. For that, we checked that highest_perf
must not be smaller than nominal_perf during init, see ...

> > +    *max_freq = nom_freq * boost_ratio;
>
> ... zero will be reported back here. I think you want to scale the calculations here to
> avoid such.
>
> > +static void cf_check amd_cppc_init_msrs(void *info) {
> > +    struct cpufreq_policy *policy = info;
> > +    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
> > +    uint64_t val;
> > +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
> > +
> > +    /* Package level MSR */
> > +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
> > +    /*
> > +     * Only when Enable bit is on, the hardware will calculate the processor’s
> > +     * performance capabilities and initialize the performance level fields in
> > +     * the CPPC capability registers.
> > +     */
> > +    if ( !(val & AMD_CPPC_ENABLE) )
> > +    {
> > +        val |= AMD_CPPC_ENABLE;
> > +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
> > +    }
> > +
> > +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
> > +
> > +    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
> > +         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
> > +         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
>
> Same question as asked elsewhere - where is this relation specified?
>
> > +         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
> > +         data->caps.nominal_perf > data->caps.highest_perf )

here ...

>
> Jan
Re: [PATCH v4 09/15] xen/x86: introduce a new amd cppc driver for cpufreq scaling
Posted by Jan Beulich 2 days, 7 hours ago
On 07.05.2025 11:19, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, April 29, 2025 10:29 PM
>>
>> On 14.04.2025 09:40, Penny Zheng wrote:
>>> --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
>>> +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
>>> +/*
>>> + * If CPPC lowest_freq and nominal_freq registers are exposed then we
>>> +can
>>> + * use them to convert perf to freq and vice versa. The conversion is
>>> + * extrapolated as an linear function passing by the 2 points:
>>> + *  - (Low perf, Low freq)
>>> + *  - (Nominal perf, Nominal freq)
>>> + */
>>> +static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
>>> +                                unsigned int freq, uint8_t *perf) {
>>> +    const struct xen_processor_cppc *cppc_data = data->cppc_data;
>>> +    uint64_t mul, div;
>>> +    int offset = 0, res;
>>> +
>>> +    if ( freq == (cppc_data->cpc.nominal_mhz * 1000) )
>>> +    {
>>> +        *perf = data->caps.nominal_perf;
>>> +        return 0;
>>> +    }
>>> +
>>> +    if ( freq == (cppc_data->cpc.lowest_mhz * 1000) )
>>> +    {
>>> +        *perf = data->caps.lowest_perf;
>>> +        return 0;
>>> +    }
>>
>> How likely is it that these two early return paths are taken, when the incoming
>> "freq" is 25 or 5 MHz granular? IOW - is it relevant to shortcut these two cases?
>>
> 
> Sorry, I may not understand what you mean here.
> Answering " How likely is it that these two early return paths are taken "
> It's rare ig.... maybe *ondemand* governor will frequently give frequency around nominal frequency,
> but the exact value is rare ig.
> I'm confused about " when the incoming  "freq" is 25 or 5 MHz granular ".
> Are we talking about is it worthy to have these two early return paths considering it is rarely taken

Yes - why have extra code which is expected to be hardly ever used. Things
would be different if such shortcuts covered common cases.

>>> +static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
>>> +                            unsigned int *max_freq) {
>>> +    unsigned int nom_freq = 0, boost_ratio;
>>> +    int res;
>>> +
>>> +    res = amd_get_lowest_or_nominal_freq(data,
>>> +                                         data->cppc_data->cpc.nominal_mhz,
>>> +                                         data->caps.nominal_perf,
>>> +                                         &nom_freq);
>>> +    if ( res )
>>> +        return res;
>>> +
>>> +    boost_ratio = (unsigned int)(data->caps.highest_perf /
>>> +                                 data->caps.nominal_perf);
>>
>> I may have seen logic ensuring nominal_perf isn't 0, so that part may be fine. What
>> guarantees this division to yield a positive value, though?
>> If it yields zero (say 0xff / 0x80), ...
> 
> I think maybe you were saying 0x80/0xff to yield zero value. For that, we checked that highest_perf
> must not be smaller than nominal_perf during init, see ...

No, I indeed meant 0xff / 0x80, but yes, that will yield 1, not 0. My main
point though was about this calculation being pretty far off the real value
(rather closer to 2), i.e. in the overall calculation (including ...

>>> +    *max_freq = nom_freq * boost_ratio;

... this multiplication) some scaling may want introducing, or the
multiplication may want doing ahead of the division.

Jan