[PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode

Penny Zheng posted 8 patches 2 months ago
Only 3 patches received!
There is a newer version of this series
[PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Penny Zheng 2 months ago
amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism. The new mechanism is based on
Collaborative Processor Performance Control (CPPC) which is a finer grain
frequency management than legacy ACPI hardware P-States.
Current AMD CPU platforms are using the ACPI P-states driver to
manage CPU frequency and clocks with switching only in 3 P-states, while the
new amd-cppc allows a more flexible, low-latency interface for Xen
to directly communicate the performance hints to hardware.

"amd-cppc" driver is responsible for implementing CPPC in passive mode, which
still leverages Xen governors such as *ondemand*, *performance*, etc, to
calculate the performance hints. In the future, we will introduce an advanced
active mode to enable autonomous performence level selection.

Field epp, energy performance preference, which only has meaning when active
mode is enabled and will be introduced later in details, so we read
pre-defined BIOS value for it in passive mode.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
---
v1 -> v2:
- re-construct union caps and req to have anonymous struct instead
- avoid "else" when the earlier if() ends in an unconditional control flow statement
- Add check to avoid chopping off set bits from cast
- make pointers pointer-to-const wherever possible
- remove noisy log
- exclude families before 0x17 before CPPC-feature MSR op
- remove useless variable helpers
- use xvzalloc and XVFREE
- refactor error handling as ENABLE bit can only be cleared by reset
---
v2 -> v3:
- Move all MSR-definations to msr-index.h and follow the required style
- Refactor opening figure braces for struct/union
- Sort overlong lines throughout the series
- Make offset/res int covering underflow scenario
- Error out when amd_max_freq_mhz isn't set
- Introduce amd_get_freq(name) macro to decrease redundancy
- Supported CPU family checked ahead of smp-function
- Nominal freq shall be checked between the [min, max]
- Use APERF/MPREF to calculate current frequency
- Use amd_cppc_cpufreq_cpu_exit() to tidy error path
---
v3 -> v4:
- verbose print shall come with a CPU number
- deal with res <= 0 in amd_cppc_khz_to_perf()
- introduce a single helper amd_get_lowest_or_nominal_freq() to cover both
lowest and nominal scenario
- reduce abuse of wrmsr_safe()/rdmsr_safe() with wrmsrl()/rdmsrl()
- move cf_check from amd_cppc_write_request() to amd_cppc_write_request_msrs()
- add comment to explain why setting non_linear_lowest in passive mode
- add check to ensure perf values in
lowest <= non_linear_lowest <= nominal <= highset
- refactor comment for "data->err != 0" scenario
- use "data->err" instead of -ENODEV
- add U suffixes for all msr macro
---
v4 -> v5:
- all freq-values shall be unsigned int type
- remove shortcuts as it is rarely taken
- checking cpc.nominal_mhz and cpc.lowest_mhz are non-zero values is enough
- drop the explicit type cast
- null pointer check is in no need for internal functions
- change amd_get_lowest_or_nominal_freq() to amd_get_cpc_freq()
- clarifying function-wide that the calculated frequency result is to be in kHz
- use array notation
- with cpu_has_cppc check, no need to do cpu family check
---
v5 -> v6
- replace "AMD_CPPC" with "AMD-CPPC" in message
- add equation(mul,div) non-zero check
- replace -EINVAL with -EOPNOTSUPP
- refactor comment
---
v6 -> v7
- used > in place of !=, to not only serve a doc aspect, but also allow to
drop one part
- unify with UINT8_MAX
- return -ERANGE as we reject perf values of 0 as invalid
- replace uint32_t with unsigned int
- Move some epp introduction here, otherwise we will mis-handle this field here
by always clearing it
---
v7 -> v8:
- refine message text by removing 0
---
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 418 ++++++++++++++++++++++++++-
 xen/arch/x86/cpu/amd.c               |   8 +-
 xen/arch/x86/include/asm/amd.h       |   2 +
 xen/arch/x86/include/asm/msr-index.h |   6 +
 xen/include/public/sysctl.h          |   1 +
 5 files changed, 430 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 3377783f7e..5b99b86fb7 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -14,7 +14,98 @@
 #include <xen/domain.h>
 #include <xen/init.h>
 #include <xen/param.h>
+#include <xen/percpu.h>
+#include <xen/xvmalloc.h>
 #include <acpi/cpufreq/cpufreq.h>
+#include <asm/amd.h>
+#include <asm/msr.h>
+
+#define amd_cppc_err(cpu, fmt, args...)                             \
+    printk(XENLOG_ERR "AMD-CPPC: CPU%u error: " fmt, cpu, ## args)
+#define amd_cppc_warn(cpu, fmt, args...)                            \
+    printk(XENLOG_WARNING "AMD-CPPC: CPU%u warning: " fmt, cpu, ## args)
+#define amd_cppc_verbose(cpu, fmt, args...)                         \
+({                                                                  \
+    if ( cpufreq_verbose )                                          \
+        printk(XENLOG_DEBUG "AMD-CPPC: CPU%u " fmt, cpu, ## args);  \
+})
+
+/*
+ * Field highest_perf, nominal_perf, lowest_nonlinear_perf, and lowest_perf
+ * contain the values read from CPPC capability MSR. They represent the limits
+ * of managed performance range as well as the dynamic capability, which may
+ * change during processor operation
+ * Field highest_perf represents highest performance, which is the absolute
+ * maximum performance an individual processor may reach, assuming ideal
+ * conditions. This performance level may not be sustainable for long
+ * durations and may only be achievable if other platform components
+ * are in a specific state; for example, it may require other processors be
+ * in an idle state. This would be equivalent to the highest frequencies
+ * supported by the processor.
+ * Field nominal_perf represents maximum sustained performance level of the
+ * processor, assuming ideal operating conditions. All cores/processors are
+ * expected to be able to sustain their nominal performance state
+ * simultaneously.
+ * Field lowest_nonlinear_perf represents Lowest Nonlinear Performance, which
+ * is the lowest performance level at which nonlinear power savings are
+ * achieved. Above this threshold, lower performance levels should be
+ * generally more energy efficient than higher performance levels. So in
+ * traditional terms, this represents the P-state range of performance levels.
+ * Field lowest_perf represents the absolute lowest performance level of the
+ * platform. Selecting it may cause an efficiency penalty but should reduce
+ * the instantaneous power consumption of the processor. So in traditional
+ * terms, this represents the T-state range of performance levels.
+ *
+ * Field max_perf, min_perf, des_perf store the values for CPPC request MSR.
+ * Software passes performance goals through these fields.
+ * Field max_perf conveys the maximum performance level at which the platform
+ * may run. And it may be set to any performance value in the range
+ * [lowest_perf, highest_perf], inclusive.
+ * Field min_perf conveys the minimum performance level at which the platform
+ * may run. And it may be set to any performance value in the range
+ * [lowest_perf, highest_perf], inclusive but must be less than or equal to
+ * max_perf.
+ * Field des_perf conveys performance level Xen governor is requesting. And it
+ * may be set to any performance value in the range [min_perf, max_perf],
+ * inclusive.
+ * Field epp represents energy performance preference, which only has meaning
+ * when active mode is enabled.
+ */
+struct amd_cppc_drv_data
+{
+    const struct xen_processor_cppc *cppc_data;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int lowest_perf:8;
+            unsigned int lowest_nonlinear_perf:8;
+            unsigned int nominal_perf:8;
+            unsigned int highest_perf:8;
+            unsigned int :32;
+        };
+    } caps;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int max_perf:8;
+            unsigned int min_perf:8;
+            unsigned int des_perf:8;
+            unsigned int epp:8;
+            unsigned int :32;
+        };
+    } req;
+
+    int err;
+};
+
+static DEFINE_PER_CPU_READ_MOSTLY(struct amd_cppc_drv_data *,
+                                  amd_cppc_drv_data);
+/*
+ * Core max frequency read from PstateDef as anchor point
+ * for freq-to-perf transition
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, pxfreq_mhz);
+static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
 
 static bool __init amd_cppc_handle_option(const char *s, const char *end)
 {
@@ -50,10 +141,335 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
     return 0;
 }
 
+/*
+ * If CPPC lowest_freq and nominal_freq registers are exposed then we can
+ * use them to convert perf to freq and vice versa. The conversion is
+ * extrapolated as an linear function passing by the 2 points:
+ *  - (Low perf, Low freq)
+ *  - (Nominal perf, Nominal freq)
+ * Parameter freq is always in kHz.
+ */
+static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
+                                unsigned int freq, uint8_t *perf)
+{
+    const struct xen_processor_cppc *cppc_data = data->cppc_data;
+    unsigned int mul, div;
+    int offset = 0, res;
+
+    if ( cppc_data->cpc.lowest_mhz &&
+         data->caps.nominal_perf > data->caps.lowest_perf &&
+         cppc_data->cpc.nominal_mhz > cppc_data->cpc.lowest_mhz )
+    {
+        mul = data->caps.nominal_perf - data->caps.lowest_perf;
+        div = cppc_data->cpc.nominal_mhz - cppc_data->cpc.lowest_mhz;
+
+        /*
+         * We don't need to convert to kHz for computing offset and can
+         * directly use nominal_mhz and lowest_mhz as the division
+         * will remove the frequency unit.
+         */
+        offset = data->caps.nominal_perf -
+                 (mul * cppc_data->cpc.nominal_mhz) / div;
+    }
+    else
+    {
+        /* Read Processor Max Speed(MHz) as anchor point */
+        mul = data->caps.highest_perf;
+        div = this_cpu(pxfreq_mhz);
+        if ( !div )
+            return -EOPNOTSUPP;
+    }
+
+    res = offset + (mul * freq) / (div * 1000);
+    if ( res > UINT8_MAX )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value exceeds maximum value 255: %d\n", res);
+        *perf = UINT8_MAX;
+        return 0;
+    }
+    if ( res <= 0 )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value smaller than minimum value: %d\n", res);
+        return -ERANGE;
+    }
+    *perf = res;
+
+    return 0;
+}
+
+/*
+ * _CPC may define nominal frequecy and lowest frequency, if not, use
+ * Processor Max Speed as anchor point to calculate.
+ * Output freq stores cpc frequency in kHz
+ */
+static int amd_get_cpc_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int cpc_mhz, uint8_t perf,
+                            unsigned int *freq)
+{
+    unsigned int mul, div, res;
+
+    if ( cpc_mhz )
+    {
+        /* Switch to kHz */
+        *freq = cpc_mhz * 1000;
+        return 0;
+    }
+
+    /* Read Processor Max Speed(MHz) as anchor point */
+    mul = this_cpu(pxfreq_mhz);
+    if ( !mul )
+        return -EOPNOTSUPP;
+    div = data->caps.highest_perf;
+    res = (mul * perf * 1000) / div;
+    if ( unlikely(!res) )
+        return -EOPNOTSUPP;
+
+    return 0;
+}
+
+/* Output max_freq stores calculated maximum frequency in kHz */
+static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int *max_freq)
+{
+    unsigned int nom_freq = 0;
+    int res;
+
+    res = amd_get_cpc_freq(data, data->cppc_data->cpc.nominal_mhz,
+                           data->caps.nominal_perf, &nom_freq);
+    if ( res )
+        return res;
+
+    *max_freq = (data->caps.highest_perf * nom_freq) / data->caps.nominal_perf;
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq,
+                                 policy->cpuinfo.max_freq);
+
+    return 0;
+}
+
+static void cf_check amd_cppc_write_request_msrs(void *info)
+{
+    const struct amd_cppc_drv_data *data = info;
+
+    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw);
+}
+
+static void amd_cppc_write_request(unsigned int cpu, uint8_t min_perf,
+                                   uint8_t des_perf, uint8_t max_perf,
+                                   uint8_t epp)
+{
+    struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint64_t prev = data->req.raw;
+
+    data->req.min_perf = min_perf;
+    data->req.max_perf = max_perf;
+    data->req.des_perf = des_perf;
+    data->req.epp = epp;
+
+    if ( prev == data->req.raw )
+        return;
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);
+}
+
+static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
+                                            unsigned int target_freq,
+                                            unsigned int relation)
+{
+    unsigned int cpu = policy->cpu;
+    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
+    uint8_t des_perf;
+    int res;
+
+    if ( unlikely(!target_freq) )
+        return 0;
+
+    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
+    if ( res )
+        return res;
+
+    /*
+     * Having a performance level lower than the lowest nonlinear
+     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
+     * may actually cause an efficiency penalty, So when deciding the min_perf
+     * value, we prefer lowest nonlinear performance over lowest performance.
+     */
+    amd_cppc_write_request(policy->cpu, data->caps.lowest_nonlinear_perf,
+                           des_perf, data->caps.highest_perf,
+                           /* Pre-defined BIOS value for passive mode */
+                           per_cpu(epp_init, policy->cpu));
+    return 0;
+}
+
+static void cf_check amd_cppc_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct amd_cppc_drv_data *data = this_cpu(amd_cppc_drv_data);
+    uint64_t val;
+    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
+
+    /* Package level MSR */
+    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
+    /*
+     * Only when Enable bit is on, the hardware will calculate the processor’s
+     * performance capabilities and initialize the performance level fields in
+     * the CPPC capability registers.
+     */
+    if ( !(val & AMD_CPPC_ENABLE) )
+    {
+        val |= AMD_CPPC_ENABLE;
+        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
+    }
+
+    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
+
+    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
+         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
+         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
+         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
+         data->caps.nominal_perf > data->caps.highest_perf )
+    {
+        amd_cppc_err(policy->cpu,
+                     "Out of range values: highest(%u), lowest(%u), nominal(%u), lowest_nonlinear(%u)\n",
+                     data->caps.highest_perf, data->caps.lowest_perf,
+                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
+        goto err;
+    }
+
+    amd_process_freq(&cpu_data[policy->cpu],
+                     NULL, NULL, &this_cpu(pxfreq_mhz));
+
+    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.lowest_mhz,
+                                 data->caps.lowest_perf, &min_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.nominal_mhz,
+                                 data->caps.nominal_perf, &nominal_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_max_freq(data, &max_freq);
+    if ( data->err )
+        return;
+
+    if ( min_freq > nominal_freq || nominal_freq > max_freq )
+    {
+        amd_cppc_err(policy->cpu,
+                     "min(%u), or max(%u), or nominal(%u) freq value is incorrect\n",
+                     min_freq, max_freq, nominal_freq);
+        goto err;
+    }
+
+    policy->min = min_freq;
+    policy->max = max_freq;
+
+    policy->cpuinfo.min_freq = min_freq;
+    policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = nominal_freq;
+    /*
+     * Set after policy->cpuinfo.perf_freq, as we are taking
+     * APERF/MPERF average frequency as current frequency.
+     */
+    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
+
+    /* Store pre-defined BIOS value for passive mode */
+    rdmsrl(MSR_AMD_CPPC_REQ, val);
+    this_cpu(epp_init) = MASK_EXTR(val, AMD_CPPC_EPP_MASK);
+
+    return;
+
+ err:
+    /*
+     * No fallback shceme is available here, see more explanation at call
+     * site in amd_cppc_cpufreq_cpu_init().
+     */
+    data->err = -EINVAL;
+}
+
+/*
+ * AMD CPPC driver is different than legacy ACPI hardware P-State,
+ * which has a finer grain frequency range between the highest and lowest
+ * frequency. And boost frequency is actually the frequency which is mapped on
+ * highest performance ratio. The legacy P0 frequency is actually mapped on
+ * nominal performance ratio.
+ */
+static void amd_cppc_boost_init(struct cpufreq_policy *policy,
+                                const struct amd_cppc_drv_data *data)
+{
+    if ( data->caps.highest_perf <= data->caps.nominal_perf )
+        return;
+
+    policy->turbo = CPUFREQ_TURBO_ENABLED;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XVFREE(per_cpu(amd_cppc_drv_data, policy->cpu));
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct amd_cppc_drv_data *data;
+
+    data = xvzalloc(struct amd_cppc_drv_data);
+    if ( !data )
+        return -ENOMEM;
+
+    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
+
+    per_cpu(amd_cppc_drv_data, cpu) = data;
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
+
+    /*
+     * The enable bit is sticky, as we need to enable it at the very first
+     * begining, before CPPC capability values sanity check.
+     * If error path is taken effective, not only amd-cppc cpufreq core fails
+     * to initialize, but also we could not fall back to legacy P-states
+     * driver, irrespective of the command line specifying a fallback option.
+     */
+    if ( data->err )
+    {
+        amd_cppc_err(cpu, "Could not initialize cpufreq core in CPPC mode\n");
+        amd_cppc_cpufreq_cpu_exit(policy);
+        return data->err;
+    }
+
+    policy->governor = cpufreq_opt_governor ? : CPUFREQ_DEFAULT_GOVERNOR;
+
+    amd_cppc_boost_init(policy, data);
+
+    amd_cppc_verbose(policy->cpu,
+                     "CPU initialized with amd-cppc passive mode\n");
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_cpufreq_driver =
+{
+    .name   = XEN_AMD_CPPC_DRIVER_NAME,
+    .verify = amd_cppc_cpufreq_verify,
+    .target = amd_cppc_cpufreq_target,
+    .init   = amd_cppc_cpufreq_cpu_init,
+    .exit   = amd_cppc_cpufreq_cpu_exit,
+};
+
 int __init amd_cppc_register_driver(void)
 {
     if ( !cpu_has_cppc )
         return -ENODEV;
 
-    return -EOPNOTSUPP;
+    return cpufreq_register_driver(&amd_cppc_cpufreq_driver);
 }
diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 567b992a9f..9767f63539 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -613,10 +613,10 @@ static unsigned int attr_const amd_parse_freq(unsigned int family,
 	return freq;
 }
 
-static void amd_process_freq(const struct cpuinfo_x86 *c,
-			     unsigned int *low_mhz,
-			     unsigned int *nom_mhz,
-			     unsigned int *hi_mhz)
+void amd_process_freq(const struct cpuinfo_x86 *c,
+		      unsigned int *low_mhz,
+		      unsigned int *nom_mhz,
+		      unsigned int *hi_mhz)
 {
 	unsigned int idx = 0, h;
 	uint64_t hi, lo, val;
diff --git a/xen/arch/x86/include/asm/amd.h b/xen/arch/x86/include/asm/amd.h
index 9c9599a622..72df42a6f6 100644
--- a/xen/arch/x86/include/asm/amd.h
+++ b/xen/arch/x86/include/asm/amd.h
@@ -173,5 +173,7 @@ extern bool amd_virt_spec_ctrl;
 bool amd_setup_legacy_ssbd(void);
 void amd_set_legacy_ssbd(bool enable);
 void amd_set_cpuid_user_dis(bool enable);
+void amd_process_freq(const struct cpuinfo_x86 *c, unsigned int *low_mhz,
+                      unsigned int *nom_mhz, unsigned int *hi_mhz);
 
 #endif /* __AMD_H__ */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index 428d993ee8..6abf154887 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -241,6 +241,12 @@
 
 #define MSR_AMD_CSTATE_CFG                  0xc0010296U
 
+#define MSR_AMD_CPPC_CAP1                   0xc00102b0U
+#define MSR_AMD_CPPC_ENABLE                 0xc00102b1U
+#define  AMD_CPPC_ENABLE                    (_AC(1, ULL) << 0)
+#define MSR_AMD_CPPC_REQ                    0xc00102b3U
+#define  AMD_CPPC_EPP_MASK                  (_AC(0xff, ULL) << 24)
+
 /*
  * Legacy MSR constants in need of cleanup.  No new MSRs below this comment.
  */
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index aafa7fcf2b..aa29a5401c 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -453,6 +453,7 @@ struct xen_set_cppc_para {
     uint32_t activity_window;
 };
 
+#define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
-- 
2.34.1


Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Jan Beulich 2 months ago
On 28.08.2025 12:03, Penny Zheng wrote:
> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> +                                            unsigned int target_freq,
> +                                            unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);

I fear there's a problem here that I so far overlooked. As it happens, just
yesterday I made a patch to eliminate cpufreq_drv_data[] global. In the
course of doing so it became clear that in principle the CPU denoted by
policy->cpu can be offline. Hence its per-CPU data is also unavailable. See
cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s invocation
of .exit(). Is there anything well-hidden (and likely lacking some suitable
comment) which guarantees that no two CPUs (threads) will be in the same
domain? If not, I fear you simply can't use per-CPU data here.

Since initially I was thinking of using per-CPU data also in my patch, I'm
reproducing this in raw form below, for your reference. It's generally only
4.22 material now, of course. Yet in turn for your driver the new drv_data
field may want to become a union, with an "acpi" and a "cppc" sub-struct.
And possibly a "hwp" one: Jason, looks like the HWP driver has a similar
issue (unless again something guarantees that no two CPUs / threads will
be in the same domain).

Jan

cpufreq: eliminate cpufreq_drv_data[]

Possibly many slots of it may be unused (all of them when the HWP or CPPC
drivers are in use), as it's always strictly associated with the CPU
recorded in a policy (irrespective of that CPU intermediately being taken
offline). It is shared by all CPUs sharing a policy. We could therefore
put the respective pointers in struct cpufreq_policy, but even that level
of indirection is pointless. Embed the driver data structure directly in
the policy one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/acpi/cpufreq/acpi.c
+++ b/xen/arch/x86/acpi/cpufreq/acpi.c
@@ -174,17 +174,18 @@ static u32 get_cur_val(const cpumask_t *
         return 0;
 
     policy = per_cpu(cpufreq_cpu_policy, cpu);
-    if (!policy || !cpufreq_drv_data[policy->cpu])
+    if ( !policy )
         return 0;
 
-    switch (cpufreq_drv_data[policy->cpu]->arch_cpu_flags) {
+    switch ( policy->drv_data.arch_cpu_flags )
+    {
     case SYSTEM_INTEL_MSR_CAPABLE:
         cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
         cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
         break;
     case SYSTEM_IO_CAPABLE:
         cmd.type = SYSTEM_IO_CAPABLE;
-        perf = cpufreq_drv_data[policy->cpu]->acpi_data;
+        perf = policy->drv_data.acpi_data;
         cmd.addr.io.port = perf->control_register.address;
         cmd.addr.io.bit_width = perf->control_register.bit_width;
         break;
@@ -210,9 +211,8 @@ static unsigned int cf_check get_cur_fre
     if (!policy)
         return 0;
 
-    data = cpufreq_drv_data[policy->cpu];
-    if (unlikely(data == NULL ||
-        data->acpi_data == NULL || data->freq_table == NULL))
+    data = &policy->drv_data;
+    if ( !data->acpi_data || !data->freq_table )
         return 0;
 
     return extract_freq(get_cur_val(cpumask_of(cpu)), data);
@@ -255,7 +255,7 @@ static int cf_check acpi_cpufreq_target(
     struct cpufreq_policy *policy,
     unsigned int target_freq, unsigned int relation)
 {
-    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
+    struct acpi_cpufreq_data *data = &policy->drv_data;
     struct processor_performance *perf;
     struct cpufreq_freqs freqs;
     cpumask_t online_policy_cpus;
@@ -265,10 +265,8 @@ static int cf_check acpi_cpufreq_target(
     unsigned int j;
     int result = 0;
 
-    if (unlikely(data == NULL ||
-        data->acpi_data == NULL || data->freq_table == NULL)) {
+    if ( !data->acpi_data || !data->freq_table )
         return -ENODEV;
-    }
 
     if (policy->turbo == CPUFREQ_TURBO_DISABLED)
         if (target_freq > policy->cpuinfo.second_max_freq)
@@ -334,11 +332,9 @@ static int cf_check acpi_cpufreq_target(
 
 static int cf_check acpi_cpufreq_verify(struct cpufreq_policy *policy)
 {
-    struct acpi_cpufreq_data *data;
     struct processor_performance *perf;
 
-    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
-        !processor_pminfo[policy->cpu])
+    if ( !policy || !processor_pminfo[policy->cpu] )
         return -EINVAL;
 
     perf = &processor_pminfo[policy->cpu]->perf;
@@ -346,7 +342,7 @@ static int cf_check acpi_cpufreq_verify(
     cpufreq_verify_within_limits(policy, 0,
         perf->states[perf->platform_limit].core_frequency * 1000);
 
-    return cpufreq_frequency_table_verify(policy, data->freq_table);
+    return cpufreq_frequency_table_verify(policy, policy->drv_data.freq_table);
 }
 
 static unsigned long
@@ -382,17 +378,11 @@ static int cf_check acpi_cpufreq_cpu_ini
     unsigned int i;
     unsigned int valid_states = 0;
     unsigned int cpu = policy->cpu;
-    struct acpi_cpufreq_data *data;
+    struct acpi_cpufreq_data *data = &policy->drv_data;
     unsigned int result = 0;
     struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
     struct processor_performance *perf;
 
-    data = xzalloc(struct acpi_cpufreq_data);
-    if (!data)
-        return -ENOMEM;
-
-    cpufreq_drv_data[cpu] = data;
-
     data->acpi_data = &processor_pminfo[cpu]->perf;
 
     perf = data->acpi_data;
@@ -409,23 +399,18 @@ static int cf_check acpi_cpufreq_cpu_ini
         if (cpufreq_verbose)
             printk("xen_pminfo: @acpi_cpufreq_cpu_init,"
                    "HARDWARE addr space\n");
-        if (!cpu_has(c, X86_FEATURE_EIST)) {
-            result = -ENODEV;
-            goto err_unreg;
-        }
+        if ( !cpu_has(c, X86_FEATURE_EIST) )
+            return -ENODEV;
         data->arch_cpu_flags = SYSTEM_INTEL_MSR_CAPABLE;
         break;
     default:
-        result = -ENODEV;
-        goto err_unreg;
+        return -ENODEV;
     }
 
     data->freq_table = xmalloc_array(struct cpufreq_frequency_table,
                                     (perf->state_count+1));
-    if (!data->freq_table) {
-        result = -ENOMEM;
-        goto err_unreg;
-    }
+    if ( !data->freq_table )
+        return -ENOMEM;
 
     /* detect transition latency */
     policy->cpuinfo.transition_latency = 0;
@@ -483,23 +468,14 @@ static int cf_check acpi_cpufreq_cpu_ini
     return result;
 
 err_freqfree:
-    xfree(data->freq_table);
-err_unreg:
-    xfree(data);
-    cpufreq_drv_data[cpu] = NULL;
+    XFREE(data->freq_table);
 
     return result;
 }
 
 static int cf_check acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 {
-    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
-
-    if (data) {
-        cpufreq_drv_data[policy->cpu] = NULL;
-        xfree(data->freq_table);
-        xfree(data);
-    }
+    XFREE(policy->drv_data.freq_table);
 
     return 0;
 }
--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
+++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
@@ -35,8 +35,6 @@
 
 #include <acpi/cpufreq/cpufreq.h>
 
-struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
-
 struct perf_pair {
     union {
         struct {
--- a/xen/arch/x86/acpi/cpufreq/powernow.c
+++ b/xen/arch/x86/acpi/cpufreq/powernow.c
@@ -84,16 +84,14 @@ static int cf_check powernow_cpufreq_tar
     struct cpufreq_policy *policy,
     unsigned int target_freq, unsigned int relation)
 {
-    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
+    struct acpi_cpufreq_data *data = &policy->drv_data;
     struct processor_performance *perf;
     unsigned int next_state; /* Index into freq_table */
     unsigned int next_perf_state; /* Index into perf table */
     int result;
 
-    if (unlikely(data == NULL ||
-        data->acpi_data == NULL || data->freq_table == NULL)) {
+    if ( !data->acpi_data || !data->freq_table )
         return -ENODEV;
-    }
 
     perf = data->acpi_data;
     result = cpufreq_frequency_table_target(policy,
@@ -185,11 +183,9 @@ static void cf_check get_cpu_data(void *
 
 static int cf_check powernow_cpufreq_verify(struct cpufreq_policy *policy)
 {
-    struct acpi_cpufreq_data *data;
     struct processor_performance *perf;
 
-    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
-        !processor_pminfo[policy->cpu])
+    if ( !policy || !processor_pminfo[policy->cpu] )
         return -EINVAL;
 
     perf = &processor_pminfo[policy->cpu]->perf;
@@ -197,7 +193,7 @@ static int cf_check powernow_cpufreq_ver
     cpufreq_verify_within_limits(policy, 0, 
         perf->states[perf->platform_limit].core_frequency * 1000);
 
-    return cpufreq_frequency_table_verify(policy, data->freq_table);
+    return cpufreq_frequency_table_verify(policy, policy->drv_data.freq_table);
 }
 
 static int cf_check powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
@@ -205,18 +201,12 @@ static int cf_check powernow_cpufreq_cpu
     unsigned int i;
     unsigned int valid_states = 0;
     unsigned int cpu = policy->cpu;
-    struct acpi_cpufreq_data *data;
+    struct acpi_cpufreq_data *data = &policy->drv_data;
     unsigned int result = 0;
     struct processor_performance *perf;
     struct amd_cpu_data info;
     struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
 
-    data = xzalloc(struct acpi_cpufreq_data);
-    if (!data)
-        return -ENOMEM;
-
-    cpufreq_drv_data[cpu] = data;
-
     data->acpi_data = &processor_pminfo[cpu]->perf;
 
     info.perf = perf = data->acpi_data;
@@ -228,8 +218,7 @@ static int cf_check powernow_cpufreq_cpu
         if (cpumask_weight(policy->cpus) != 1) {
             printk(XENLOG_WARNING "Unsupported sharing type %d (%u CPUs)\n",
                    policy->shared_type, cpumask_weight(policy->cpus));
-            result = -ENODEV;
-            goto err_unreg;
+            return -ENODEV;
         }
     } else {
         cpumask_copy(policy->cpus, cpumask_of(cpu));
@@ -238,21 +227,16 @@ static int cf_check powernow_cpufreq_cpu
     /* capability check */
     if (perf->state_count <= 1) {
         printk("No P-States\n");
-        result = -ENODEV;
-        goto err_unreg;
+        return -ENODEV;
     }
 
-    if (perf->control_register.space_id != perf->status_register.space_id) {
-        result = -ENODEV;
-        goto err_unreg;
-    }
+    if ( perf->control_register.space_id != perf->status_register.space_id )
+        return -ENODEV;
 
     data->freq_table = xmalloc_array(struct cpufreq_frequency_table, 
                                     (perf->state_count+1));
-    if (!data->freq_table) {
-        result = -ENOMEM;
-        goto err_unreg;
-    }
+    if ( !data->freq_table )
+        return -ENOMEM;
 
     /* detect transition latency */
     policy->cpuinfo.transition_latency = 0;
@@ -298,23 +282,14 @@ static int cf_check powernow_cpufreq_cpu
     return result;
 
 err_freqfree:
-    xfree(data->freq_table);
-err_unreg:
-    xfree(data);
-    cpufreq_drv_data[cpu] = NULL;
+    XFREE(data->freq_table);
 
     return result;
 }
 
 static int cf_check powernow_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 {
-    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
-
-    if (data) {
-        cpufreq_drv_data[policy->cpu] = NULL;
-        xfree(data->freq_table);
-        xfree(data);
-    }
+    XFREE(policy->drv_data.freq_table);
 
     return 0;
 }
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -37,8 +37,6 @@ struct acpi_cpufreq_data {
     unsigned int arch_cpu_flags;
 };
 
-extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
-
 struct cpufreq_cpuinfo {
     unsigned int        max_freq;
     unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
@@ -80,6 +78,8 @@ struct cpufreq_policy {
     int8_t              turbo;  /* tristate flag: 0 for unsupported
                                  * -1 for disable, 1 for enabled
                                  * See CPUFREQ_TURBO_* below for defines */
+
+    struct acpi_cpufreq_data drv_data;
 };
 DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
RE: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Penny, Zheng 2 months ago
[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, August 28, 2025 7:23 PM
> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
> <Jason.Andryuk@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC
> in passive mode
>
> On 28.08.2025 12:03, Penny Zheng wrote:
> > +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> > +                                            unsigned int target_freq,
> > +                                            unsigned int relation) {
> > +    unsigned int cpu = policy->cpu;
> > +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
> > +cpu);
>
> I fear there's a problem here that I so far overlooked. As it happens, just
> yesterday I made a patch to eliminate cpufreq_drv_data[] global. In the course of
> doing so it became clear that in principle the CPU denoted by
> policy->cpu can be offline. Hence its per-CPU data is also unavailable.
> policy->See
> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s invocation
> of .exit(). Is there anything well-hidden (and likely lacking some suitable
> comment) which guarantees that no two CPUs (threads) will be in the same
> domain? If not, I fear you simply can't use per-CPU data here.
>

Correct me if I understand you wrongly:
No, my env is always per pcpu per cpufreq domain. So it never occurred to me that cpus, other than the first one in domain, will never call .init(), and of course, no per_cpu(amd_cppc_drv_data) ever gets allocated then.

> Since initially I was thinking of using per-CPU data also in my patch, I'm
> reproducing this in raw form below, for your reference. It's generally only
> 4.22 material now, of course. Yet in turn for your driver the new drv_data field
> may want to become a union, with an "acpi" and a "cppc" sub-struct.

How about I embed my new driver data " struct amd_cppc_drv_data * " into cpufreq policy, maybe pointer is enough?
Later, maybe, all "cppc", "acpi" and "hwp" could constitute an union in policy.

> And possibly a "hwp" one: Jason, looks like the HWP driver has a similar issue
> (unless again something guarantees that no two CPUs / threads will be in the
> same domain).
>
> Jan
>
> cpufreq: eliminate cpufreq_drv_data[]
>
> Possibly many slots of it may be unused (all of them when the HWP or CPPC
> drivers are in use), as it's always strictly associated with the CPU recorded in a
> policy (irrespective of that CPU intermediately being taken offline). It is shared by
> all CPUs sharing a policy. We could therefore put the respective pointers in struct
> cpufreq_policy, but even that level of indirection is pointless. Embed the driver
> data structure directly in the policy one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/acpi/cpufreq/acpi.c
> +++ b/xen/arch/x86/acpi/cpufreq/acpi.c
> @@ -174,17 +174,18 @@ static u32 get_cur_val(const cpumask_t *
>          return 0;
>
>      policy = per_cpu(cpufreq_cpu_policy, cpu);
> -    if (!policy || !cpufreq_drv_data[policy->cpu])
> +    if ( !policy )
>          return 0;
>
> -    switch (cpufreq_drv_data[policy->cpu]->arch_cpu_flags) {
> +    switch ( policy->drv_data.arch_cpu_flags )
> +    {
>      case SYSTEM_INTEL_MSR_CAPABLE:
>          cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
>          cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
>          break;
>      case SYSTEM_IO_CAPABLE:
>          cmd.type = SYSTEM_IO_CAPABLE;
> -        perf = cpufreq_drv_data[policy->cpu]->acpi_data;
> +        perf = policy->drv_data.acpi_data;
>          cmd.addr.io.port = perf->control_register.address;
>          cmd.addr.io.bit_width = perf->control_register.bit_width;
>          break;
> @@ -210,9 +211,8 @@ static unsigned int cf_check get_cur_fre
>      if (!policy)
>          return 0;
>
> -    data = cpufreq_drv_data[policy->cpu];
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL))
> +    data = &policy->drv_data;
> +    if ( !data->acpi_data || !data->freq_table )
>          return 0;
>
>      return extract_freq(get_cur_val(cpumask_of(cpu)), data); @@ -255,7 +255,7
> @@ static int cf_check acpi_cpufreq_target(
>      struct cpufreq_policy *policy,
>      unsigned int target_freq, unsigned int relation)  {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>      struct processor_performance *perf;
>      struct cpufreq_freqs freqs;
>      cpumask_t online_policy_cpus;
> @@ -265,10 +265,8 @@ static int cf_check acpi_cpufreq_target(
>      unsigned int j;
>      int result = 0;
>
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL)) {
> +    if ( !data->acpi_data || !data->freq_table )
>          return -ENODEV;
> -    }
>
>      if (policy->turbo == CPUFREQ_TURBO_DISABLED)
>          if (target_freq > policy->cpuinfo.second_max_freq) @@ -334,11 +332,9
> @@ static int cf_check acpi_cpufreq_target(
>
>  static int cf_check acpi_cpufreq_verify(struct cpufreq_policy *policy)  {
> -    struct acpi_cpufreq_data *data;
>      struct processor_performance *perf;
>
> -    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
> -        !processor_pminfo[policy->cpu])
> +    if ( !policy || !processor_pminfo[policy->cpu] )
>          return -EINVAL;
>
>      perf = &processor_pminfo[policy->cpu]->perf;
> @@ -346,7 +342,7 @@ static int cf_check acpi_cpufreq_verify(
>      cpufreq_verify_within_limits(policy, 0,
>          perf->states[perf->platform_limit].core_frequency * 1000);
>
> -    return cpufreq_frequency_table_verify(policy, data->freq_table);
> +    return cpufreq_frequency_table_verify(policy,
> + policy->drv_data.freq_table);
>  }
>
>  static unsigned long
> @@ -382,17 +378,11 @@ static int cf_check acpi_cpufreq_cpu_ini
>      unsigned int i;
>      unsigned int valid_states = 0;
>      unsigned int cpu = policy->cpu;
> -    struct acpi_cpufreq_data *data;
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>      unsigned int result = 0;
>      struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
>      struct processor_performance *perf;
>
> -    data = xzalloc(struct acpi_cpufreq_data);
> -    if (!data)
> -        return -ENOMEM;
> -
> -    cpufreq_drv_data[cpu] = data;
> -
>      data->acpi_data = &processor_pminfo[cpu]->perf;
>
>      perf = data->acpi_data;
> @@ -409,23 +399,18 @@ static int cf_check acpi_cpufreq_cpu_ini
>          if (cpufreq_verbose)
>              printk("xen_pminfo: @acpi_cpufreq_cpu_init,"
>                     "HARDWARE addr space\n");
> -        if (!cpu_has(c, X86_FEATURE_EIST)) {
> -            result = -ENODEV;
> -            goto err_unreg;
> -        }
> +        if ( !cpu_has(c, X86_FEATURE_EIST) )
> +            return -ENODEV;
>          data->arch_cpu_flags = SYSTEM_INTEL_MSR_CAPABLE;
>          break;
>      default:
> -        result = -ENODEV;
> -        goto err_unreg;
> +        return -ENODEV;
>      }
>
>      data->freq_table = xmalloc_array(struct cpufreq_frequency_table,
>                                      (perf->state_count+1));
> -    if (!data->freq_table) {
> -        result = -ENOMEM;
> -        goto err_unreg;
> -    }
> +    if ( !data->freq_table )
> +        return -ENOMEM;
>
>      /* detect transition latency */
>      policy->cpuinfo.transition_latency = 0; @@ -483,23 +468,14 @@ static int
> cf_check acpi_cpufreq_cpu_ini
>      return result;
>
>  err_freqfree:
> -    xfree(data->freq_table);
> -err_unreg:
> -    xfree(data);
> -    cpufreq_drv_data[cpu] = NULL;
> +    XFREE(data->freq_table);
>
>      return result;
>  }
>
>  static int cf_check acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)  {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> -
> -    if (data) {
> -        cpufreq_drv_data[policy->cpu] = NULL;
> -        xfree(data->freq_table);
> -        xfree(data);
> -    }
> +    XFREE(policy->drv_data.freq_table);
>
>      return 0;
>  }
> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -35,8 +35,6 @@
>
>  #include <acpi/cpufreq/cpufreq.h>
>
> -struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
> -
>  struct perf_pair {
>      union {
>          struct {
> --- a/xen/arch/x86/acpi/cpufreq/powernow.c
> +++ b/xen/arch/x86/acpi/cpufreq/powernow.c
> @@ -84,16 +84,14 @@ static int cf_check powernow_cpufreq_tar
>      struct cpufreq_policy *policy,
>      unsigned int target_freq, unsigned int relation)  {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>      struct processor_performance *perf;
>      unsigned int next_state; /* Index into freq_table */
>      unsigned int next_perf_state; /* Index into perf table */
>      int result;
>
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL)) {
> +    if ( !data->acpi_data || !data->freq_table )
>          return -ENODEV;
> -    }
>
>      perf = data->acpi_data;
>      result = cpufreq_frequency_table_target(policy,
> @@ -185,11 +183,9 @@ static void cf_check get_cpu_data(void *
>
>  static int cf_check powernow_cpufreq_verify(struct cpufreq_policy *policy)  {
> -    struct acpi_cpufreq_data *data;
>      struct processor_performance *perf;
>
> -    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
> -        !processor_pminfo[policy->cpu])
> +    if ( !policy || !processor_pminfo[policy->cpu] )
>          return -EINVAL;
>
>      perf = &processor_pminfo[policy->cpu]->perf;
> @@ -197,7 +193,7 @@ static int cf_check powernow_cpufreq_ver
>      cpufreq_verify_within_limits(policy, 0,
>          perf->states[perf->platform_limit].core_frequency * 1000);
>
> -    return cpufreq_frequency_table_verify(policy, data->freq_table);
> +    return cpufreq_frequency_table_verify(policy,
> + policy->drv_data.freq_table);
>  }
>
>  static int cf_check powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
> @@ -205,18 +201,12 @@ static int cf_check powernow_cpufreq_cpu
>      unsigned int i;
>      unsigned int valid_states = 0;
>      unsigned int cpu = policy->cpu;
> -    struct acpi_cpufreq_data *data;
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>      unsigned int result = 0;
>      struct processor_performance *perf;
>      struct amd_cpu_data info;
>      struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
>
> -    data = xzalloc(struct acpi_cpufreq_data);
> -    if (!data)
> -        return -ENOMEM;
> -
> -    cpufreq_drv_data[cpu] = data;
> -
>      data->acpi_data = &processor_pminfo[cpu]->perf;
>
>      info.perf = perf = data->acpi_data; @@ -228,8 +218,7 @@ static int cf_check
> powernow_cpufreq_cpu
>          if (cpumask_weight(policy->cpus) != 1) {
>              printk(XENLOG_WARNING "Unsupported sharing type %d (%u
> CPUs)\n",
>                     policy->shared_type, cpumask_weight(policy->cpus));
> -            result = -ENODEV;
> -            goto err_unreg;
> +            return -ENODEV;
>          }
>      } else {
>          cpumask_copy(policy->cpus, cpumask_of(cpu)); @@ -238,21 +227,16
> @@ static int cf_check powernow_cpufreq_cpu
>      /* capability check */
>      if (perf->state_count <= 1) {
>          printk("No P-States\n");
> -        result = -ENODEV;
> -        goto err_unreg;
> +        return -ENODEV;
>      }
>
> -    if (perf->control_register.space_id != perf->status_register.space_id) {
> -        result = -ENODEV;
> -        goto err_unreg;
> -    }
> +    if ( perf->control_register.space_id != perf->status_register.space_id )
> +        return -ENODEV;
>
>      data->freq_table = xmalloc_array(struct cpufreq_frequency_table,
>                                      (perf->state_count+1));
> -    if (!data->freq_table) {
> -        result = -ENOMEM;
> -        goto err_unreg;
> -    }
> +    if ( !data->freq_table )
> +        return -ENOMEM;
>
>      /* detect transition latency */
>      policy->cpuinfo.transition_latency = 0; @@ -298,23 +282,14 @@ static int
> cf_check powernow_cpufreq_cpu
>      return result;
>
>  err_freqfree:
> -    xfree(data->freq_table);
> -err_unreg:
> -    xfree(data);
> -    cpufreq_drv_data[cpu] = NULL;
> +    XFREE(data->freq_table);
>
>      return result;
>  }
>
>  static int cf_check powernow_cpufreq_cpu_exit(struct cpufreq_policy *policy)  {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> -
> -    if (data) {
> -        cpufreq_drv_data[policy->cpu] = NULL;
> -        xfree(data->freq_table);
> -        xfree(data);
> -    }
> +    XFREE(policy->drv_data.freq_table);
>
>      return 0;
>  }
> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -37,8 +37,6 @@ struct acpi_cpufreq_data {
>      unsigned int arch_cpu_flags;
>  };
>
> -extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
> -
>  struct cpufreq_cpuinfo {
>      unsigned int        max_freq;
>      unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
> @@ -80,6 +78,8 @@ struct cpufreq_policy {
>      int8_t              turbo;  /* tristate flag: 0 for unsupported
>                                   * -1 for disable, 1 for enabled
>                                   * See CPUFREQ_TURBO_* below for defines */
> +
> +    struct acpi_cpufreq_data drv_data;
>  };
>  DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
>

Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Jan Beulich 2 months ago
On 29.08.2025 05:30, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Thursday, August 28, 2025 7:23 PM
>>
>> On 28.08.2025 12:03, Penny Zheng wrote:
>>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>>> +                                            unsigned int target_freq,
>>> +                                            unsigned int relation) {
>>> +    unsigned int cpu = policy->cpu;
>>> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data,
>>> +cpu);
>>
>> I fear there's a problem here that I so far overlooked. As it happens, just
>> yesterday I made a patch to eliminate cpufreq_drv_data[] global. In the course of
>> doing so it became clear that in principle the CPU denoted by
>> policy->cpu can be offline. Hence its per-CPU data is also unavailable.
>> policy->See
>> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s invocation
>> of .exit(). Is there anything well-hidden (and likely lacking some suitable
>> comment) which guarantees that no two CPUs (threads) will be in the same
>> domain? If not, I fear you simply can't use per-CPU data here.
>>
> 
> Correct me if I understand you wrongly:
> No, my env is always per pcpu per cpufreq domain. So it never occurred to me that cpus, other than the first one in domain, will never call .init(), and of course, no per_cpu(amd_cppc_drv_data) ever gets allocated then.

Well, the question is how domains are organized when using the CPPC driver.
Aiui that's still driven by data passed in by Dom0, so in turn the question
is whether there are any constraints on what ACPI may surface. If there are,
all that may be necessary is adding a check. If there aren't, ...

>> Since initially I was thinking of using per-CPU data also in my patch, I'm
>> reproducing this in raw form below, for your reference. It's generally only
>> 4.22 material now, of course. Yet in turn for your driver the new drv_data field
>> may want to become a union, with an "acpi" and a "cppc" sub-struct.
> 
> How about I embed my new driver data " struct amd_cppc_drv_data * " into cpufreq policy, maybe pointer is enough?
> Later, maybe, all "cppc", "acpi" and "hwp" could constitute an union in policy.

... I'd prefer to go the union approach right away. Whether then to take my
patch as a prereq is tbd; that largely depends on what (if anything) is
needed on the HWP side. If HWP needs fixing, that wants to to come first, as
it would want backporting.

Jan
RE: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Penny, Zheng 2 months ago
[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, August 29, 2025 2:12 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org; Andryuk, Jason
> <Jason.Andryuk@amd.com>
> Subject: Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in
> passive mode
>
> On 29.08.2025 05:30, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Thursday, August 28, 2025 7:23 PM
> >>
> >> On 28.08.2025 12:03, Penny Zheng wrote:
> >>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> >>> +                                            unsigned int target_freq,
> >>> +                                            unsigned int relation) {
> >>> +    unsigned int cpu = policy->cpu;
> >>> +    const struct amd_cppc_drv_data *data =
> >>> +per_cpu(amd_cppc_drv_data, cpu);
> >>
> >> I fear there's a problem here that I so far overlooked. As it
> >> happens, just yesterday I made a patch to eliminate
> >> cpufreq_drv_data[] global. In the course of doing so it became clear
> >> that in principle the CPU denoted by
> >> policy->cpu can be offline. Hence its per-CPU data is also unavailable.
> >> policy->See
> >> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s
> >> invocation of .exit(). Is there anything well-hidden (and likely
> >> lacking some suitable
> >> comment) which guarantees that no two CPUs (threads) will be in the
> >> same domain? If not, I fear you simply can't use per-CPU data here.
> >>
> >
> > Correct me if I understand you wrongly:
> > No, my env is always per pcpu per cpufreq domain. So it never occurred to me
> that cpus, other than the first one in domain, will never call .init(), and of course, no
> per_cpu(amd_cppc_drv_data) ever gets allocated then.
>
> Well, the question is how domains are organized when using the CPPC driver.
> Aiui that's still driven by data passed in by Dom0, so in turn the question is whether
> there are any constraints on what ACPI may surface. If there are, all that may be
> necessary is adding a check. If there aren't, ...
>

According to ACPI spec, _PSD controls both P-state or CPPC, so in my implementation of getting CPPC data passed by Dom0(set_cppc_pminfo()), I demand both entry exist, _PSD and _CPC.
```
        if ( cppc_data->flags == (XEN_CPPC_PSD | XEN_CPPC_CPC) )
        {
                ...
                pm_info->init = XEN_CPPC_INIT;
                ret = cpufreq_cpu_init(cpuid);
                ...
        }
```

> >> Since initially I was thinking of using per-CPU data also in my
> >> patch, I'm reproducing this in raw form below, for your reference.
> >> It's generally only
> >> 4.22 material now, of course. Yet in turn for your driver the new
> >> drv_data field may want to become a union, with an "acpi" and a "cppc" sub-
> struct.
> >
> > How about I embed my new driver data " struct amd_cppc_drv_data * " into
> cpufreq policy, maybe pointer is enough?
> > Later, maybe, all "cppc", "acpi" and "hwp" could constitute an union in policy.
>
> ... I'd prefer to go the union approach right away. Whether then to take my patch as
> a prereq is tbd; that largely depends on what (if anything) is needed on the HWP
> side. If HWP needs fixing, that wants to to come first, as it would want backporting.
>
> Jan
Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Jan Beulich 1 month, 4 weeks ago
On 01.09.2025 05:21, Penny, Zheng wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, August 29, 2025 2:12 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Huang, Ray <Ray.Huang@amd.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Roger Pau Monné <roger.pau@citrix.com>;
>> Anthony PERARD <anthony.perard@vates.tech>; Orzel, Michal
>> <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
>> <sstabellini@kernel.org>; xen-devel@lists.xenproject.org; Andryuk, Jason
>> <Jason.Andryuk@amd.com>
>> Subject: Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in
>> passive mode
>>
>> On 29.08.2025 05:30, Penny, Zheng wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Thursday, August 28, 2025 7:23 PM
>>>>
>>>> On 28.08.2025 12:03, Penny Zheng wrote:
>>>>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>>>>> +                                            unsigned int target_freq,
>>>>> +                                            unsigned int relation) {
>>>>> +    unsigned int cpu = policy->cpu;
>>>>> +    const struct amd_cppc_drv_data *data =
>>>>> +per_cpu(amd_cppc_drv_data, cpu);
>>>>
>>>> I fear there's a problem here that I so far overlooked. As it
>>>> happens, just yesterday I made a patch to eliminate
>>>> cpufreq_drv_data[] global. In the course of doing so it became clear
>>>> that in principle the CPU denoted by
>>>> policy->cpu can be offline. Hence its per-CPU data is also unavailable.
>>>> policy->See
>>>> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s
>>>> invocation of .exit(). Is there anything well-hidden (and likely
>>>> lacking some suitable
>>>> comment) which guarantees that no two CPUs (threads) will be in the
>>>> same domain? If not, I fear you simply can't use per-CPU data here.
>>>>
>>>
>>> Correct me if I understand you wrongly:
>>> No, my env is always per pcpu per cpufreq domain. So it never occurred to me
>> that cpus, other than the first one in domain, will never call .init(), and of course, no
>> per_cpu(amd_cppc_drv_data) ever gets allocated then.
>>
>> Well, the question is how domains are organized when using the CPPC driver.
>> Aiui that's still driven by data passed in by Dom0, so in turn the question is whether
>> there are any constraints on what ACPI may surface. If there are, all that may be
>> necessary is adding a check. If there aren't, ...
>>
> 
> According to ACPI spec, _PSD controls both P-state or CPPC, so in my implementation of getting CPPC data passed by Dom0(set_cppc_pminfo()), I demand both entry exist, _PSD and _CPC.
> ```
>         if ( cppc_data->flags == (XEN_CPPC_PSD | XEN_CPPC_CPC) )
>         {
>                 ...
>                 pm_info->init = XEN_CPPC_INIT;
>                 ret = cpufreq_cpu_init(cpuid);
>                 ...
>         }
> ```

That's only about presence of the data though. My remark, otoh, was about its
content. It could in principle be specified somewhere that no domain may cover
more than a single CPU / thread. In the absence of such, the case needs
handling correctly.

Jan

Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Jason Andryuk 2 months ago
On 2025-08-28 07:22, Jan Beulich wrote:
> On 28.08.2025 12:03, Penny Zheng wrote:
>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>> +                                            unsigned int target_freq,
>> +                                            unsigned int relation)
>> +{
>> +    unsigned int cpu = policy->cpu;
>> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
> 
> I fear there's a problem here that I so far overlooked. As it happens, just
> yesterday I made a patch to eliminate cpufreq_drv_data[] global. In the
> course of doing so it became clear that in principle the CPU denoted by
> policy->cpu can be offline. Hence its per-CPU data is also unavailable. See
> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s invocation
> of .exit(). Is there anything well-hidden (and likely lacking some suitable
> comment) which guarantees that no two CPUs (threads) will be in the same
> domain? If not, I fear you simply can't use per-CPU data here.

Sorry, I'm confused by your use of "domain" here.  Do you mean a 
per_cpu(..., policy->cpu) access racing with a cpu offline?  I'm not 
away of anything preventing that, though I'm not particularly familiar 
with it.

do_pm_op() has:
     if ( op->cpuid >= nr_cpu_ids || !cpu_online(op->cpuid) )
         return -EINVAL;
     pmpt = processor_pminfo[op->cpuid];

and do_get_pm_info() has:
     if ( !op || (op->cpuid >= nr_cpu_ids) || !cpu_online(op->cpuid) )
         return -EINVAL;
     pmpt = processor_pminfo[op->cpuid];

But those are only at entry.

Regards,
Jason

> Since initially I was thinking of using per-CPU data also in my patch, I'm
> reproducing this in raw form below, for your reference. It's generally only
> 4.22 material now, of course. Yet in turn for your driver the new drv_data
> field may want to become a union, with an "acpi" and a "cppc" sub-struct.
> And possibly a "hwp" one: Jason, looks like the HWP driver has a similar
> issue (unless again something guarantees that no two CPUs / threads will
> be in the same domain).
> 
> Jan
> 
> cpufreq: eliminate cpufreq_drv_data[]
> 
> Possibly many slots of it may be unused (all of them when the HWP or CPPC
> drivers are in use), as it's always strictly associated with the CPU
> recorded in a policy (irrespective of that CPU intermediately being taken
> offline). It is shared by all CPUs sharing a policy. We could therefore
> put the respective pointers in struct cpufreq_policy, but even that level
> of indirection is pointless. Embed the driver data structure directly in
> the policy one.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/acpi/cpufreq/acpi.c
> +++ b/xen/arch/x86/acpi/cpufreq/acpi.c
> @@ -174,17 +174,18 @@ static u32 get_cur_val(const cpumask_t *
>           return 0;
>   
>       policy = per_cpu(cpufreq_cpu_policy, cpu);
> -    if (!policy || !cpufreq_drv_data[policy->cpu])
> +    if ( !policy )
>           return 0;
>   
> -    switch (cpufreq_drv_data[policy->cpu]->arch_cpu_flags) {
> +    switch ( policy->drv_data.arch_cpu_flags )
> +    {
>       case SYSTEM_INTEL_MSR_CAPABLE:
>           cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
>           cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
>           break;
>       case SYSTEM_IO_CAPABLE:
>           cmd.type = SYSTEM_IO_CAPABLE;
> -        perf = cpufreq_drv_data[policy->cpu]->acpi_data;
> +        perf = policy->drv_data.acpi_data;
>           cmd.addr.io.port = perf->control_register.address;
>           cmd.addr.io.bit_width = perf->control_register.bit_width;
>           break;
> @@ -210,9 +211,8 @@ static unsigned int cf_check get_cur_fre
>       if (!policy)
>           return 0;
>   
> -    data = cpufreq_drv_data[policy->cpu];
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL))
> +    data = &policy->drv_data;
> +    if ( !data->acpi_data || !data->freq_table )
>           return 0;
>   
>       return extract_freq(get_cur_val(cpumask_of(cpu)), data);
> @@ -255,7 +255,7 @@ static int cf_check acpi_cpufreq_target(
>       struct cpufreq_policy *policy,
>       unsigned int target_freq, unsigned int relation)
>   {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>       struct processor_performance *perf;
>       struct cpufreq_freqs freqs;
>       cpumask_t online_policy_cpus;
> @@ -265,10 +265,8 @@ static int cf_check acpi_cpufreq_target(
>       unsigned int j;
>       int result = 0;
>   
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL)) {
> +    if ( !data->acpi_data || !data->freq_table )
>           return -ENODEV;
> -    }
>   
>       if (policy->turbo == CPUFREQ_TURBO_DISABLED)
>           if (target_freq > policy->cpuinfo.second_max_freq)
> @@ -334,11 +332,9 @@ static int cf_check acpi_cpufreq_target(
>   
>   static int cf_check acpi_cpufreq_verify(struct cpufreq_policy *policy)
>   {
> -    struct acpi_cpufreq_data *data;
>       struct processor_performance *perf;
>   
> -    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
> -        !processor_pminfo[policy->cpu])
> +    if ( !policy || !processor_pminfo[policy->cpu] )
>           return -EINVAL;
>   
>       perf = &processor_pminfo[policy->cpu]->perf;
> @@ -346,7 +342,7 @@ static int cf_check acpi_cpufreq_verify(
>       cpufreq_verify_within_limits(policy, 0,
>           perf->states[perf->platform_limit].core_frequency * 1000);
>   
> -    return cpufreq_frequency_table_verify(policy, data->freq_table);
> +    return cpufreq_frequency_table_verify(policy, policy->drv_data.freq_table);
>   }
>   
>   static unsigned long
> @@ -382,17 +378,11 @@ static int cf_check acpi_cpufreq_cpu_ini
>       unsigned int i;
>       unsigned int valid_states = 0;
>       unsigned int cpu = policy->cpu;
> -    struct acpi_cpufreq_data *data;
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>       unsigned int result = 0;
>       struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
>       struct processor_performance *perf;
>   
> -    data = xzalloc(struct acpi_cpufreq_data);
> -    if (!data)
> -        return -ENOMEM;
> -
> -    cpufreq_drv_data[cpu] = data;
> -
>       data->acpi_data = &processor_pminfo[cpu]->perf;
>   
>       perf = data->acpi_data;
> @@ -409,23 +399,18 @@ static int cf_check acpi_cpufreq_cpu_ini
>           if (cpufreq_verbose)
>               printk("xen_pminfo: @acpi_cpufreq_cpu_init,"
>                      "HARDWARE addr space\n");
> -        if (!cpu_has(c, X86_FEATURE_EIST)) {
> -            result = -ENODEV;
> -            goto err_unreg;
> -        }
> +        if ( !cpu_has(c, X86_FEATURE_EIST) )
> +            return -ENODEV;
>           data->arch_cpu_flags = SYSTEM_INTEL_MSR_CAPABLE;
>           break;
>       default:
> -        result = -ENODEV;
> -        goto err_unreg;
> +        return -ENODEV;
>       }
>   
>       data->freq_table = xmalloc_array(struct cpufreq_frequency_table,
>                                       (perf->state_count+1));
> -    if (!data->freq_table) {
> -        result = -ENOMEM;
> -        goto err_unreg;
> -    }
> +    if ( !data->freq_table )
> +        return -ENOMEM;
>   
>       /* detect transition latency */
>       policy->cpuinfo.transition_latency = 0;
> @@ -483,23 +468,14 @@ static int cf_check acpi_cpufreq_cpu_ini
>       return result;
>   
>   err_freqfree:
> -    xfree(data->freq_table);
> -err_unreg:
> -    xfree(data);
> -    cpufreq_drv_data[cpu] = NULL;
> +    XFREE(data->freq_table);
>   
>       return result;
>   }
>   
>   static int cf_check acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>   {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> -
> -    if (data) {
> -        cpufreq_drv_data[policy->cpu] = NULL;
> -        xfree(data->freq_table);
> -        xfree(data);
> -    }
> +    XFREE(policy->drv_data.freq_table);
>   
>       return 0;
>   }
> --- a/xen/arch/x86/acpi/cpufreq/cpufreq.c
> +++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c
> @@ -35,8 +35,6 @@
>   
>   #include <acpi/cpufreq/cpufreq.h>
>   
> -struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
> -
>   struct perf_pair {
>       union {
>           struct {
> --- a/xen/arch/x86/acpi/cpufreq/powernow.c
> +++ b/xen/arch/x86/acpi/cpufreq/powernow.c
> @@ -84,16 +84,14 @@ static int cf_check powernow_cpufreq_tar
>       struct cpufreq_policy *policy,
>       unsigned int target_freq, unsigned int relation)
>   {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>       struct processor_performance *perf;
>       unsigned int next_state; /* Index into freq_table */
>       unsigned int next_perf_state; /* Index into perf table */
>       int result;
>   
> -    if (unlikely(data == NULL ||
> -        data->acpi_data == NULL || data->freq_table == NULL)) {
> +    if ( !data->acpi_data || !data->freq_table )
>           return -ENODEV;
> -    }
>   
>       perf = data->acpi_data;
>       result = cpufreq_frequency_table_target(policy,
> @@ -185,11 +183,9 @@ static void cf_check get_cpu_data(void *
>   
>   static int cf_check powernow_cpufreq_verify(struct cpufreq_policy *policy)
>   {
> -    struct acpi_cpufreq_data *data;
>       struct processor_performance *perf;
>   
> -    if (!policy || !(data = cpufreq_drv_data[policy->cpu]) ||
> -        !processor_pminfo[policy->cpu])
> +    if ( !policy || !processor_pminfo[policy->cpu] )
>           return -EINVAL;
>   
>       perf = &processor_pminfo[policy->cpu]->perf;
> @@ -197,7 +193,7 @@ static int cf_check powernow_cpufreq_ver
>       cpufreq_verify_within_limits(policy, 0,
>           perf->states[perf->platform_limit].core_frequency * 1000);
>   
> -    return cpufreq_frequency_table_verify(policy, data->freq_table);
> +    return cpufreq_frequency_table_verify(policy, policy->drv_data.freq_table);
>   }
>   
>   static int cf_check powernow_cpufreq_cpu_init(struct cpufreq_policy *policy)
> @@ -205,18 +201,12 @@ static int cf_check powernow_cpufreq_cpu
>       unsigned int i;
>       unsigned int valid_states = 0;
>       unsigned int cpu = policy->cpu;
> -    struct acpi_cpufreq_data *data;
> +    struct acpi_cpufreq_data *data = &policy->drv_data;
>       unsigned int result = 0;
>       struct processor_performance *perf;
>       struct amd_cpu_data info;
>       struct cpuinfo_x86 *c = &cpu_data[policy->cpu];
>   
> -    data = xzalloc(struct acpi_cpufreq_data);
> -    if (!data)
> -        return -ENOMEM;
> -
> -    cpufreq_drv_data[cpu] = data;
> -
>       data->acpi_data = &processor_pminfo[cpu]->perf;
>   
>       info.perf = perf = data->acpi_data;
> @@ -228,8 +218,7 @@ static int cf_check powernow_cpufreq_cpu
>           if (cpumask_weight(policy->cpus) != 1) {
>               printk(XENLOG_WARNING "Unsupported sharing type %d (%u CPUs)\n",
>                      policy->shared_type, cpumask_weight(policy->cpus));
> -            result = -ENODEV;
> -            goto err_unreg;
> +            return -ENODEV;
>           }
>       } else {
>           cpumask_copy(policy->cpus, cpumask_of(cpu));
> @@ -238,21 +227,16 @@ static int cf_check powernow_cpufreq_cpu
>       /* capability check */
>       if (perf->state_count <= 1) {
>           printk("No P-States\n");
> -        result = -ENODEV;
> -        goto err_unreg;
> +        return -ENODEV;
>       }
>   
> -    if (perf->control_register.space_id != perf->status_register.space_id) {
> -        result = -ENODEV;
> -        goto err_unreg;
> -    }
> +    if ( perf->control_register.space_id != perf->status_register.space_id )
> +        return -ENODEV;
>   
>       data->freq_table = xmalloc_array(struct cpufreq_frequency_table,
>                                       (perf->state_count+1));
> -    if (!data->freq_table) {
> -        result = -ENOMEM;
> -        goto err_unreg;
> -    }
> +    if ( !data->freq_table )
> +        return -ENOMEM;
>   
>       /* detect transition latency */
>       policy->cpuinfo.transition_latency = 0;
> @@ -298,23 +282,14 @@ static int cf_check powernow_cpufreq_cpu
>       return result;
>   
>   err_freqfree:
> -    xfree(data->freq_table);
> -err_unreg:
> -    xfree(data);
> -    cpufreq_drv_data[cpu] = NULL;
> +    XFREE(data->freq_table);
>   
>       return result;
>   }
>   
>   static int cf_check powernow_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>   {
> -    struct acpi_cpufreq_data *data = cpufreq_drv_data[policy->cpu];
> -
> -    if (data) {
> -        cpufreq_drv_data[policy->cpu] = NULL;
> -        xfree(data->freq_table);
> -        xfree(data);
> -    }
> +    XFREE(policy->drv_data.freq_table);
>   
>       return 0;
>   }
> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -37,8 +37,6 @@ struct acpi_cpufreq_data {
>       unsigned int arch_cpu_flags;
>   };
>   
> -extern struct acpi_cpufreq_data *cpufreq_drv_data[NR_CPUS];
> -
>   struct cpufreq_cpuinfo {
>       unsigned int        max_freq;
>       unsigned int        second_max_freq;    /* P1 if Turbo Mode is on */
> @@ -80,6 +78,8 @@ struct cpufreq_policy {
>       int8_t              turbo;  /* tristate flag: 0 for unsupported
>                                    * -1 for disable, 1 for enabled
>                                    * See CPUFREQ_TURBO_* below for defines */
> +
> +    struct acpi_cpufreq_data drv_data;
>   };
>   DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
>   
>
Re: [PATCH v8 3/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
Posted by Jan Beulich 2 months ago
On 29.08.2025 02:16, Jason Andryuk wrote:
> On 2025-08-28 07:22, Jan Beulich wrote:
>> On 28.08.2025 12:03, Penny Zheng wrote:
>>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>>> +                                            unsigned int target_freq,
>>> +                                            unsigned int relation)
>>> +{
>>> +    unsigned int cpu = policy->cpu;
>>> +    const struct amd_cppc_drv_data *data = per_cpu(amd_cppc_drv_data, cpu);
>>
>> I fear there's a problem here that I so far overlooked. As it happens, just
>> yesterday I made a patch to eliminate cpufreq_drv_data[] global. In the
>> course of doing so it became clear that in principle the CPU denoted by
>> policy->cpu can be offline. Hence its per-CPU data is also unavailable. See
>> cpufreq_add_cpu()'s invocation of .init() and cpufreq_del_cpu()'s invocation
>> of .exit(). Is there anything well-hidden (and likely lacking some suitable
>> comment) which guarantees that no two CPUs (threads) will be in the same
>> domain? If not, I fear you simply can't use per-CPU data here.
> 
> Sorry, I'm confused by your use of "domain" here.

I agree it's confusing, but that's the terminology used in cpufreq.c (see
e.g. "struct cpufreq_dom" or "const struct xen_psd_package *domain_info").

>  Do you mean a 
> per_cpu(..., policy->cpu) access racing with a cpu offline?

Yes (I wouldn't call it "racing" though, as it's not a timing issue).

>  I'm not 
> away of anything preventing that, though I'm not particularly familiar 
> with it.



> do_pm_op() has:
>      if ( op->cpuid >= nr_cpu_ids || !cpu_online(op->cpuid) )
>          return -EINVAL;
>      pmpt = processor_pminfo[op->cpuid];
> 
> and do_get_pm_info() has:
>      if ( !op || (op->cpuid >= nr_cpu_ids) || !cpu_online(op->cpuid) )
>          return -EINVAL;
>      pmpt = processor_pminfo[op->cpuid];
> 
> But those are only at entry.

That's not accessing struct cpufreq_policy, though. Per-CPU accesses
using policy->cpu are the problematic ones, as - from all I can tell -
the CPU named there can have gone offline, with the policy surviving
when some other CPU is also part of the same "domain".

As said in the reply to Penny, main question is whether the data
controlling what a "domain" covers may be constrained in the HWP case,
demanding that no two CPUs (threads) can be in the same "domain". Then
adding merely a sanity check somewhere would suffice.

Jan