kernel/power/energy_model.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
The sched_ext schedulers [1] currently access the energy model through the
debugfs to make energy-aware scheduling decisions [2]. The userspace part
of a sched_ext scheduler feeds the necessary (post-processed) energy-model
information to the BPF part of the scheduler.
However, there is a limitation in the current debugfs support of the energy
model. When the energy model is updated (em_dev_update_perf_domain), there
is no way for the userspace part to know such changes (besides polling the
debugfs files).
Therefore, add inotify support (IN_MODIFY) when the energy model is updated.
With this inotify support, the directory of an updated performance domain
(e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory (e.g.,
/sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext
scheduler (or any userspace application) monitors the energy model change
in userspace using the regular inotify interface.
Note that accessing the energy model information from userspace has many
advantages over other alternatives, especially adding new BPF kfuncs. The
userspace has much more freedom than the BPF code (e.g., using external
libraries and floating point arithmetics), which may be infeasible (if not
impossible) in the BPF/kernel code.
[1] https://lwn.net/Articles/922405/
[2] https://github.com/sched-ext/scx/pull/1624
Signed-off-by: Changwoo Min <changwoo@igalia.com>
---
ChangeLog v1 -> v2:
- Change em_debug_update() to only inotify the directory of an updated
performance domain (and its parent directory).
- Move the em_debug_update() call outside of the mutex lock.
- Update the commit message to clarify its motivation and what will be
inotified when updated.
kernel/power/energy_model.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d9b7e2b38c7a..590e90e8cb66 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -14,6 +14,7 @@
#include <linux/cpumask.h>
#include <linux/debugfs.h>
#include <linux/energy_model.h>
+#include <linux/fsnotify.h>
#include <linux/sched/topology.h>
#include <linux/slab.h>
@@ -156,9 +157,18 @@ static int __init em_debug_init(void)
return 0;
}
fs_initcall(em_debug_init);
+
+static void em_debug_update(struct device *dev)
+{
+ struct dentry *d;
+
+ d = debugfs_lookup(dev_name(dev), rootdir);
+ fsnotify_dentry(d, FS_MODIFY);
+}
#else /* CONFIG_DEBUG_FS */
static void em_debug_create_pd(struct device *dev) {}
static void em_debug_remove_pd(struct device *dev) {}
+static void em_debug_update(struct device *dev) {}
#endif
static void em_release_table_kref(struct kref *kref)
@@ -324,6 +334,8 @@ int em_dev_update_perf_domain(struct device *dev,
em_table_free(old_table);
mutex_unlock(&em_pd_mutex);
+
+ em_debug_update(dev);
return 0;
}
EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
--
2.49.0
Hi Changwoo, On 5/7/25 02:47, Changwoo Min wrote: > The sched_ext schedulers [1] currently access the energy model through the > debugfs to make energy-aware scheduling decisions [2]. The userspace part > of a sched_ext scheduler feeds the necessary (post-processed) energy-model > information to the BPF part of the scheduler. > > However, there is a limitation in the current debugfs support of the energy > model. When the energy model is updated (em_dev_update_perf_domain), there > is no way for the userspace part to know such changes (besides polling the > debugfs files). > > Therefore, add inotify support (IN_MODIFY) when the energy model is updated. > With this inotify support, the directory of an updated performance domain > (e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory (e.g., > /sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext > scheduler (or any userspace application) monitors the energy model change > in userspace using the regular inotify interface. > > Note that accessing the energy model information from userspace has many > advantages over other alternatives, especially adding new BPF kfuncs. The > userspace has much more freedom than the BPF code (e.g., using external > libraries and floating point arithmetics), which may be infeasible (if not > impossible) in the BPF/kernel code. > > [1] https://lwn.net/Articles/922405/ > [2] https://github.com/sched-ext/scx/pull/1624 > > Signed-off-by: Changwoo Min <changwoo@igalia.com> > --- > > ChangeLog v1 -> v2: > - Change em_debug_update() to only inotify the directory of an updated > performance domain (and its parent directory). > - Move the em_debug_update() call outside of the mutex lock. > - Update the commit message to clarify its motivation and what will be > inotified when updated. > > kernel/power/energy_model.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > I have discussed that with Rafael and we have similar view. The EM debugfs is not the right interface for this purpose. A better design and mechanism for your purpose would be the netlink notification. It is present in the kernel in thermal framework and e.g. is used by Intel HFI - drivers/thermal/intel/intel_hfi.c - drivers/thermal/thermal_netlink.c It's able to send to the user space the information from FW about the CPUs' efficiency changes, which is similar to this EM modification. Would you be interested in writing similar mechanism in the EM fwk? Regards, Lukasz
Hi Lukasz and Rafael, Thank you for the pointers and guidance. On 5/9/25 19:55, Lukasz Luba wrote: > Hi Changwoo, > > On 5/7/25 02:47, Changwoo Min wrote: >> The sched_ext schedulers [1] currently access the energy model through >> the >> debugfs to make energy-aware scheduling decisions [2]. The userspace part >> of a sched_ext scheduler feeds the necessary (post-processed) energy- >> model >> information to the BPF part of the scheduler. >> >> However, there is a limitation in the current debugfs support of the >> energy >> model. When the energy model is updated (em_dev_update_perf_domain), >> there >> is no way for the userspace part to know such changes (besides polling >> the >> debugfs files). >> >> Therefore, add inotify support (IN_MODIFY) when the energy model is >> updated. >> With this inotify support, the directory of an updated performance domain >> (e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory >> (e.g., >> /sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext >> scheduler (or any userspace application) monitors the energy model change >> in userspace using the regular inotify interface. >> >> Note that accessing the energy model information from userspace has many >> advantages over other alternatives, especially adding new BPF kfuncs. The >> userspace has much more freedom than the BPF code (e.g., using external >> libraries and floating point arithmetics), which may be infeasible (if >> not >> impossible) in the BPF/kernel code. >> >> [1] https://lwn.net/Articles/922405/ >> [2] https://github.com/sched-ext/scx/pull/1624 >> >> Signed-off-by: Changwoo Min <changwoo@igalia.com> >> --- >> >> ChangeLog v1 -> v2: >> - Change em_debug_update() to only inotify the directory of an updated >> performance domain (and its parent directory). >> - Move the em_debug_update() call outside of the mutex lock. >> - Update the commit message to clarify its motivation and what will be >> inotified when updated. >> >> kernel/power/energy_model.c | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> > > I have discussed that with Rafael and we have similar view. > The EM debugfs is not the right interface for this purpose. > > A better design and mechanism for your purpose would be the netlink > notification. It is present in the kernel in thermal framework > and e.g. is used by Intel HFI > - drivers/thermal/intel/intel_hfi.c > - drivers/thermal/thermal_netlink.c > It's able to send to the user space the information from FW about > the CPUs' efficiency changes, which is similar to this EM modification. I have considered netlink before. However, I chose the debugfs-inotify path since it requires fewer changes. However, if the netlink interface is better for this purpose (I agree *debugfs* is not ideal), sure let's go with that direction. > > Would you be interested in writing similar mechanism in the EM fwk? Sure, I will work on it and send another patch set. > > Regards, > Lukasz > > _______________________________________________ > Kernel-dev mailing list -- kernel-dev@igalia.com > To unsubscribe send an email to kernel-dev-leave@igalia.com >
On Fri, May 9, 2025 at 12:55 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Changwoo, > > On 5/7/25 02:47, Changwoo Min wrote: > > The sched_ext schedulers [1] currently access the energy model through the > > debugfs to make energy-aware scheduling decisions [2]. The userspace part > > of a sched_ext scheduler feeds the necessary (post-processed) energy-model > > information to the BPF part of the scheduler. > > > > However, there is a limitation in the current debugfs support of the energy > > model. When the energy model is updated (em_dev_update_perf_domain), there > > is no way for the userspace part to know such changes (besides polling the > > debugfs files). > > > > Therefore, add inotify support (IN_MODIFY) when the energy model is updated. > > With this inotify support, the directory of an updated performance domain > > (e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory (e.g., > > /sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext > > scheduler (or any userspace application) monitors the energy model change > > in userspace using the regular inotify interface. > > > > Note that accessing the energy model information from userspace has many > > advantages over other alternatives, especially adding new BPF kfuncs. The > > userspace has much more freedom than the BPF code (e.g., using external > > libraries and floating point arithmetics), which may be infeasible (if not > > impossible) in the BPF/kernel code. > > > > [1] https://lwn.net/Articles/922405/ > > [2] https://github.com/sched-ext/scx/pull/1624 > > > > Signed-off-by: Changwoo Min <changwoo@igalia.com> > > --- > > > > ChangeLog v1 -> v2: > > - Change em_debug_update() to only inotify the directory of an updated > > performance domain (and its parent directory). > > - Move the em_debug_update() call outside of the mutex lock. > > - Update the commit message to clarify its motivation and what will be > > inotified when updated. > > > > kernel/power/energy_model.c | 12 ++++++++++++ > > 1 file changed, 12 insertions(+) > > > > I have discussed that with Rafael and we have similar view. > The EM debugfs is not the right interface for this purpose. > > A better design and mechanism for your purpose would be the netlink > notification. It is present in the kernel in thermal framework > and e.g. is used by Intel HFI > - drivers/thermal/intel/intel_hfi.c > - drivers/thermal/thermal_netlink.c > It's able to send to the user space the information from FW about > the CPUs' efficiency changes, which is similar to this EM modification. In addition, after this patch https://lore.kernel.org/linux-pm/3637203.iIbC2pHGDl@rjwysocki.net/ which is about to get into linux-next, em_dev_update_perf_domain() will not be the only place where the Energy Model can be updated. Thanks!
Thank you, Rafael, for the pointer. On 5/10/25 01:41, Rafael J. Wysocki wrote: >> >> I have discussed that with Rafael and we have similar view. >> The EM debugfs is not the right interface for this purpose. >> >> A better design and mechanism for your purpose would be the netlink >> notification. It is present in the kernel in thermal framework >> and e.g. is used by Intel HFI >> - drivers/thermal/intel/intel_hfi.c >> - drivers/thermal/thermal_netlink.c >> It's able to send to the user space the information from FW about >> the CPUs' efficiency changes, which is similar to this EM modification. > > In addition, after this patch > > https://lore.kernel.org/linux-pm/3637203.iIbC2pHGDl@rjwysocki.net/ > > which is about to get into linux-next, em_dev_update_perf_domain() > will not be the only place where the Energy Model can be updated. I am curious about whether the energy mode is likely to be updated more often with this change. How often the energy model is likely to be updated is the factor to be considered for the interface and the model to post-processing the eneergy model (in the BPF schedulers). Regards, Changwoo Min > > Thanks! > _______________________________________________ > Kernel-dev mailing list -- kernel-dev@igalia.com > To unsubscribe send an email to kernel-dev-leave@igalia.com
On Sat, May 10, 2025 at 7:07 AM Changwoo Min <changwoo@igalia.com> wrote: > > Thank you, Rafael, for the pointer. > > On 5/10/25 01:41, Rafael J. Wysocki wrote: > >> > >> I have discussed that with Rafael and we have similar view. > >> The EM debugfs is not the right interface for this purpose. > >> > >> A better design and mechanism for your purpose would be the netlink > >> notification. It is present in the kernel in thermal framework > >> and e.g. is used by Intel HFI > >> - drivers/thermal/intel/intel_hfi.c > >> - drivers/thermal/thermal_netlink.c > >> It's able to send to the user space the information from FW about > >> the CPUs' efficiency changes, which is similar to this EM modification. > > > > In addition, after this patch > > > > https://lore.kernel.org/linux-pm/3637203.iIbC2pHGDl@rjwysocki.net/ > > > > which is about to get into linux-next, em_dev_update_perf_domain() > > will not be the only place where the Energy Model can be updated. > > I am curious about whether the energy mode is likely to be updated more > often with this change. How often the energy model is likely to be > updated is the factor to be considered for the interface and the model > to post-processing the eneergy model (in the BPF schedulers). It really is hard to say precisely because eventually this will depend on the platform firmware. Hopefully, this is not going to happen too often, but if the thermal envelope of the platform is tight, for instance, it may not be the case.
On 5/10/25 12:34, Rafael J. Wysocki wrote:
> On Sat, May 10, 2025 at 7:07 AM Changwoo Min <changwoo@igalia.com> wrote:
>>
>> Thank you, Rafael, for the pointer.
>>
>> On 5/10/25 01:41, Rafael J. Wysocki wrote:
>>>>
>>>> I have discussed that with Rafael and we have similar view.
>>>> The EM debugfs is not the right interface for this purpose.
>>>>
>>>> A better design and mechanism for your purpose would be the netlink
>>>> notification. It is present in the kernel in thermal framework
>>>> and e.g. is used by Intel HFI
>>>> - drivers/thermal/intel/intel_hfi.c
>>>> - drivers/thermal/thermal_netlink.c
>>>> It's able to send to the user space the information from FW about
>>>> the CPUs' efficiency changes, which is similar to this EM modification.
>>>
>>> In addition, after this patch
>>>
>>> https://lore.kernel.org/linux-pm/3637203.iIbC2pHGDl@rjwysocki.net/
>>>
>>> which is about to get into linux-next, em_dev_update_perf_domain()
>>> will not be the only place where the Energy Model can be updated.
>>
>> I am curious about whether the energy mode is likely to be updated more
>> often with this change. How often the energy model is likely to be
>> updated is the factor to be considered for the interface and the model
>> to post-processing the eneergy model (in the BPF schedulers).
>
> It really is hard to say precisely because eventually this will depend
> on the platform firmware. Hopefully, this is not going to happen too
> often, but if the thermal envelope of the platform is tight, for
> instance, it may not be the case.
It's hard to say for all use cases, but there are some easy to measure
and understand:
1. Long scenarios with heavy GPU usage (e.g. gaming). Power on CPUs
built from High-Performance cells can be affected by +20% and after
~1min
2. Longer recording with heavy ISP usage, similar to above
In those two, it's sufficient to update the EM every 1-3sec to reach
this +20% after 60sec. Although, at the beginning when the GPU starts
heating the updates should happen a bit more often.
There are some more complex cases, e.g. when more than 1 Big CPU does
heavy computations and the heat is higher than normal EM model of
single CPU (even for that scenario profile). Then the updates to EM
can go a bit more often (it depends what the platform would like
to leverage and achieve w/ SW).
On 5/22/25 17:19, Lukasz Luba wrote: > > > On 5/10/25 12:34, Rafael J. Wysocki wrote: >> On Sat, May 10, 2025 at 7:07 AM Changwoo Min <changwoo@igalia.com> wrote: >>> I am curious about whether the energy mode is likely to be updated more >>> often with this change. How often the energy model is likely to be >>> updated is the factor to be considered for the interface and the model >>> to post-processing the eneergy model (in the BPF schedulers). >> >> It really is hard to say precisely because eventually this will depend >> on the platform firmware. Hopefully, this is not going to happen too >> often, but if the thermal envelope of the platform is tight, for >> instance, it may not be the case. > > It's hard to say for all use cases, but there are some easy to measure > and understand: > > 1. Long scenarios with heavy GPU usage (e.g. gaming). Power on CPUs > built from High-Performance cells can be affected by +20% and after > ~1min > 2. Longer recording with heavy ISP usage, similar to above > > In those two, it's sufficient to update the EM every 1-3sec to reach > this +20% after 60sec. Although, at the beginning when the GPU starts > heating the updates should happen a bit more often. > > There are some more complex cases, e.g. when more than 1 Big CPU does > heavy computations and the heat is higher than normal EM model of > single CPU (even for that scenario profile). Then the updates to EM > can go a bit more often (it depends what the platform would like > to leverage and achieve w/ SW). Thank you for the further clarification. I think the netlink notification should be fast and efficient enough to cover these scenarios. Regards, Changwoo Min
On 5/22/25 09:35, Changwoo Min wrote: > > > On 5/22/25 17:19, Lukasz Luba wrote: >> >> >> On 5/10/25 12:34, Rafael J. Wysocki wrote: >>> On Sat, May 10, 2025 at 7:07 AM Changwoo Min <changwoo@igalia.com> >>> wrote: > >>>> I am curious about whether the energy mode is likely to be updated more >>>> often with this change. How often the energy model is likely to be >>>> updated is the factor to be considered for the interface and the model >>>> to post-processing the eneergy model (in the BPF schedulers). >>> >>> It really is hard to say precisely because eventually this will depend >>> on the platform firmware. Hopefully, this is not going to happen too >>> often, but if the thermal envelope of the platform is tight, for >>> instance, it may not be the case. >> >> It's hard to say for all use cases, but there are some easy to measure >> and understand: >> >> 1. Long scenarios with heavy GPU usage (e.g. gaming). Power on CPUs >> built from High-Performance cells can be affected by +20% and after >> ~1min >> 2. Longer recording with heavy ISP usage, similar to above >> >> In those two, it's sufficient to update the EM every 1-3sec to reach >> this +20% after 60sec. Although, at the beginning when the GPU starts >> heating the updates should happen a bit more often. >> >> There are some more complex cases, e.g. when more than 1 Big CPU does >> heavy computations and the heat is higher than normal EM model of >> single CPU (even for that scenario profile). Then the updates to EM >> can go a bit more often (it depends what the platform would like >> to leverage and achieve w/ SW). > > Thank you for the further clarification. I think the netlink > notification should be fast and efficient enough to cover these scenarios. Yes, I agree
Hello,
On Wed, May 07, 2025 at 10:47:28AM +0900, Changwoo Min wrote:
> The sched_ext schedulers [1] currently access the energy model through the
> debugfs to make energy-aware scheduling decisions [2]. The userspace part
> of a sched_ext scheduler feeds the necessary (post-processed) energy-model
> information to the BPF part of the scheduler.
>
> However, there is a limitation in the current debugfs support of the energy
> model. When the energy model is updated (em_dev_update_perf_domain), there
> is no way for the userspace part to know such changes (besides polling the
> debugfs files).
>
> Therefore, add inotify support (IN_MODIFY) when the energy model is updated.
> With this inotify support, the directory of an updated performance domain
> (e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory (e.g.,
> /sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext
> scheduler (or any userspace application) monitors the energy model change
> in userspace using the regular inotify interface.
>
> Note that accessing the energy model information from userspace has many
> advantages over other alternatives, especially adding new BPF kfuncs. The
> userspace has much more freedom than the BPF code (e.g., using external
> libraries and floating point arithmetics), which may be infeasible (if not
> impossible) in the BPF/kernel code.
>
> [1] https://lwn.net/Articles/922405/
> [2] https://github.com/sched-ext/scx/pull/1624
>
> Signed-off-by: Changwoo Min <changwoo@igalia.com>
FWIW, this looks simple enough and workable to me. Just a nit below:
> +static void em_debug_update(struct device *dev)
> +{
> + struct dentry *d;
> +
> + d = debugfs_lookup(dev_name(dev), rootdir);
> + fsnotify_dentry(d, FS_MODIFY);
> +}
Would something like em_debug_notify_updated() or em_debug_updated() be
better? em_debug_update() sounds like it's actively updating something.
Thanks.
--
tejun
Hi Tejun,
Thanks for the comments!
On 5/8/25 02:04, Tejun Heo wrote:
> Hello,
>
> On Wed, May 07, 2025 at 10:47:28AM +0900, Changwoo Min wrote:
>> The sched_ext schedulers [1] currently access the energy model through the
>> debugfs to make energy-aware scheduling decisions [2]. The userspace part
>> of a sched_ext scheduler feeds the necessary (post-processed) energy-model
>> information to the BPF part of the scheduler.
>>
>> However, there is a limitation in the current debugfs support of the energy
>> model. When the energy model is updated (em_dev_update_perf_domain), there
>> is no way for the userspace part to know such changes (besides polling the
>> debugfs files).
>>
>> Therefore, add inotify support (IN_MODIFY) when the energy model is updated.
>> With this inotify support, the directory of an updated performance domain
>> (e.g., /sys/kernel/debug/energy_model/cpu0) and its parent directory (e.g.,
>> /sys/kernel/debug/energy_model) are inotified. Therefore, a sched_ext
>> scheduler (or any userspace application) monitors the energy model change
>> in userspace using the regular inotify interface.
>>
>> Note that accessing the energy model information from userspace has many
>> advantages over other alternatives, especially adding new BPF kfuncs. The
>> userspace has much more freedom than the BPF code (e.g., using external
>> libraries and floating point arithmetics), which may be infeasible (if not
>> impossible) in the BPF/kernel code.
>>
>> [1] https://lwn.net/Articles/922405/
>> [2] https://github.com/sched-ext/scx/pull/1624
>>
>> Signed-off-by: Changwoo Min <changwoo@igalia.com>
>
> FWIW, this looks simple enough and workable to me. Just a nit below:
>
>> +static void em_debug_update(struct device *dev)
>> +{
>> + struct dentry *d;
>> +
>> + d = debugfs_lookup(dev_name(dev), rootdir);
>> + fsnotify_dentry(d, FS_MODIFY);
>> +}
>
> Would something like em_debug_notify_updated() or em_debug_updated() be
> better? em_debug_update() sounds like it's actively updating something.
I agree that em_debug_update() sounds misleading.
em_debug_notify_updated() delivers clear meaning, so I will change it as
you suggested.
Regards,
Changwoo Min
>
> Thanks.
>
© 2016 - 2025 Red Hat, Inc.