resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
resctrl can't be built as a module, and the kernfs helpers are not exported
so this is unlikely to change. MPAM has an error interrupt which indicates
the MPAM driver has gone haywire. Should this occur tasks could run with
the wrong control values, leading to bad performance for important tasks.
In this scenario the MPAM driver will reset the hardware, but it needs
a way to tell resctrl that no further configuration should be attempted.
In particular, moving tasks between control or monitor groups does not
interact with the architecture code, so there is no opportunity for the
arch code to indicate that the hardware is no-longer functioning.
Using resctrl_exit() for this leaves the system in a funny state as
resctrl is still mounted, but cannot be un-mounted because the sysfs
directory that is typically used has been removed. Dave Martin suggests
this may cause systemd trouble in the future as not all filesystems
can be unmounted.
Add calls to remove all the files and directories in resctrl, and
remove the sysfs_remove_mount_point() call that leaves the system
in a funny state. When triggered, this causes all the resctrl files
to disappear. resctrl can be unmounted, but not mounted again.
Signed-off-by: James Morse <james.morse@arm.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
---
Changes since v7:
* Moved kernfs cleanup functions to resctrl_fs_teardown(), and call this
from rescrl_exit().
* Added description of MPAM resetting the hardware to the commit message.
Changes since v6:
* Added kdoc and comment to resctrl_exit().
Changes since v5:
* Serialise rdtgroup_destroy_root() against umount().
* Check rdtgroup_default.kn to protect against duplicate calls.
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++++++-----
1 file changed, 35 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index fdf2616c7ca0..3f9c37637d7e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -3074,6 +3074,22 @@ static void rmdir_all_sub(void)
kernfs_remove(kn_mondata);
}
+static void resctrl_fs_teardown(void)
+{
+ lockdep_assert_held(&rdtgroup_mutex);
+
+ /* Cleared by rdtgroup_destroy_root() */
+ if (!rdtgroup_default.kn)
+ return;
+
+ rmdir_all_sub();
+ rdt_pseudo_lock_release();
+ rdtgroup_default.mode = RDT_MODE_SHAREABLE;
+ closid_exit();
+ schemata_list_destroy();
+ rdtgroup_destroy_root();
+}
+
static void rdt_kill_sb(struct super_block *sb)
{
struct rdt_resource *r;
@@ -3087,12 +3103,7 @@ static void rdt_kill_sb(struct super_block *sb)
for_each_alloc_capable_rdt_resource(r)
resctrl_arch_reset_all_ctrls(r);
- rmdir_all_sub();
- rdt_pseudo_lock_release();
- rdtgroup_default.mode = RDT_MODE_SHAREABLE;
- closid_exit();
- schemata_list_destroy();
- rdtgroup_destroy_root();
+ resctrl_fs_teardown();
if (resctrl_arch_alloc_capable())
resctrl_arch_disable_alloc();
if (resctrl_arch_mon_capable())
@@ -4123,6 +4134,8 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx)
static void rdtgroup_destroy_root(void)
{
+ lockdep_assert_held(&rdtgroup_mutex);
+
kernfs_destroy_root(rdt_root);
rdtgroup_default.kn = NULL;
}
@@ -4416,11 +4429,26 @@ int __init resctrl_init(void)
return ret;
}
+/**
+ * resctrl_exit() - Remove the resctrl filesystem and free resources.
+ *
+ * Called by the architecture code in response to a fatal error.
+ * Resctrl files and structures are removed from kernfs to prevent further
+ * configuration.
+ */
void __exit resctrl_exit(void)
{
+ mutex_lock(&rdtgroup_mutex);
+ resctrl_fs_teardown();
+ mutex_unlock(&rdtgroup_mutex);
+
debugfs_remove_recursive(debugfs_resctrl);
unregister_filesystem(&rdt_fs_type);
- sysfs_remove_mount_point(fs_kobj, "resctrl");
+
+ /*
+ * The sysfs mount point added by resctrl_init() is not removed so that
+ * it can be used to umount resctrl.
+ */
resctrl_mon_resource_exit();
}
--
2.20.1
Hi James,
On 4/11/25 9:42 AM, James Morse wrote:
> resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
> resctrl can't be built as a module, and the kernfs helpers are not exported
> so this is unlikely to change. MPAM has an error interrupt which indicates
> the MPAM driver has gone haywire. Should this occur tasks could run with
> the wrong control values, leading to bad performance for important tasks.
> In this scenario the MPAM driver will reset the hardware, but it needs
> a way to tell resctrl that no further configuration should be attempted.
>
> In particular, moving tasks between control or monitor groups does not
> interact with the architecture code, so there is no opportunity for the
> arch code to indicate that the hardware is no-longer functioning.
>
> Using resctrl_exit() for this leaves the system in a funny state as
> resctrl is still mounted, but cannot be un-mounted because the sysfs
> directory that is typically used has been removed. Dave Martin suggests
> this may cause systemd trouble in the future as not all filesystems
> can be unmounted.
>
> Add calls to remove all the files and directories in resctrl, and
> remove the sysfs_remove_mount_point() call that leaves the system
> in a funny state. When triggered, this causes all the resctrl files
> to disappear. resctrl can be unmounted, but not mounted again.
>
The caveat here is that resctrl pretends to be mounted (resctrl_mounted == true)
but there is nothing there. The undocumented part of this is that for this
to work resctrl fs depends (a lot) on the architecture's callbacks to know
if they are being called after a resctrl_exit() call so that they return data
that will direct resctrl fs behavior to safest exit for those
resctrl fs flows that are still possible after a resctrl_exit(). Not ideal
layering.
I understand from a previous comment [1] that one of the Arm "tricks" is to
offline all domains. This seems to be a good "catch all" to ensure that at least
current flows of concern are not running anymore. Considering this,
what if there is a new resctrl_error_exit() that does something like below?
void resctrl_error_exit(void)
{
mutex_lock(&rdtgroup_mutex);
WARN_ON_ONCE(resctrl_new_function_returns_true_if_any_resource_has_a_control_or_monitor_domain());
resctrl_fs_teardown();
mutex_unlock(&rdtgroup_mutex);
resctrl_exit();
}
I do not see this as requiring anything new from architecture but instead
making what Arm already does a requirement and keeping existing behavior?
This leaves proc_resctrl_show() that relies on resctrl_mounted but as I see
the resctrl_fs_cleanup() will remove all resource groups that should result
in the output being as it will be if resctrl is not mounted. No dependence
on architecture callbacks returning resctrl_exit() aware data here.
> Signed-off-by: James Morse <james.morse@arm.com>
> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Tested-by: Peter Newman <peternewman@google.com>
> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
> Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
> Tested-by: Babu Moger <babu.moger@amd.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> ---
...
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++++++-----
> 1 file changed, 35 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index fdf2616c7ca0..3f9c37637d7e 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3074,6 +3074,22 @@ static void rmdir_all_sub(void)
> kernfs_remove(kn_mondata);
> }
>
> +static void resctrl_fs_teardown(void)
> +{
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> + /* Cleared by rdtgroup_destroy_root() */
> + if (!rdtgroup_default.kn)
> + return;
> +
> + rmdir_all_sub();
> + rdt_pseudo_lock_release();
> + rdtgroup_default.mode = RDT_MODE_SHAREABLE;
> + closid_exit();
> + schemata_list_destroy();
> + rdtgroup_destroy_root();
> +}
> +
> static void rdt_kill_sb(struct super_block *sb)
> {
> struct rdt_resource *r;
> @@ -3087,12 +3103,7 @@ static void rdt_kill_sb(struct super_block *sb)
> for_each_alloc_capable_rdt_resource(r)
> resctrl_arch_reset_all_ctrls(r);
>
> - rmdir_all_sub();
> - rdt_pseudo_lock_release();
> - rdtgroup_default.mode = RDT_MODE_SHAREABLE;
> - closid_exit();
> - schemata_list_destroy();
> - rdtgroup_destroy_root();
> + resctrl_fs_teardown();
> if (resctrl_arch_alloc_capable())
> resctrl_arch_disable_alloc();
> if (resctrl_arch_mon_capable())
> @@ -4123,6 +4134,8 @@ static int rdtgroup_setup_root(struct rdt_fs_context *ctx)
>
> static void rdtgroup_destroy_root(void)
> {
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> kernfs_destroy_root(rdt_root);
> rdtgroup_default.kn = NULL;
> }
> @@ -4416,11 +4429,26 @@ int __init resctrl_init(void)
> return ret;
> }
>
> +/**
> + * resctrl_exit() - Remove the resctrl filesystem and free resources.
> + *
> + * Called by the architecture code in response to a fatal error.
> + * Resctrl files and structures are removed from kernfs to prevent further
> + * configuration.
Please write with imperative tone. For example, "Remove resctrl files and structures ..."
> + */
> void __exit resctrl_exit(void)
> {
> + mutex_lock(&rdtgroup_mutex);
> + resctrl_fs_teardown();
> + mutex_unlock(&rdtgroup_mutex);
> +
> debugfs_remove_recursive(debugfs_resctrl);
Is it possible for the fatal error handling to trigger multiple calls here?
To protect against multiple calls causing issues debugfs_resctrl can be set to NULL here.
> unregister_filesystem(&rdt_fs_type);
unregister_filesystem() seems to handle an already-unregistered filesystem.
> - sysfs_remove_mount_point(fs_kobj, "resctrl");
> +
> + /*
> + * The sysfs mount point added by resctrl_init() is not removed so that
> + * it can be used to umount resctrl.
> + */
(needs imperative)
>
> resctrl_mon_resource_exit();
> }
Reinette
[1] https://lore.kernel.org/lkml/5ab63ad9-f7f5-4d3d-b0cb-fd229aa269db@arm.com/
Hi Reinette,
On 16/04/2025 01:25, Reinette Chatre wrote:
> On 4/11/25 9:42 AM, James Morse wrote:
>> resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
>> resctrl can't be built as a module, and the kernfs helpers are not exported
>> so this is unlikely to change. MPAM has an error interrupt which indicates
>> the MPAM driver has gone haywire. Should this occur tasks could run with
>> the wrong control values, leading to bad performance for important tasks.
>> In this scenario the MPAM driver will reset the hardware, but it needs
>> a way to tell resctrl that no further configuration should be attempted.
>>
>> In particular, moving tasks between control or monitor groups does not
>> interact with the architecture code, so there is no opportunity for the
>> arch code to indicate that the hardware is no-longer functioning.
>>
>> Using resctrl_exit() for this leaves the system in a funny state as
>> resctrl is still mounted, but cannot be un-mounted because the sysfs
>> directory that is typically used has been removed. Dave Martin suggests
>> this may cause systemd trouble in the future as not all filesystems
>> can be unmounted.
>>
>> Add calls to remove all the files and directories in resctrl, and
>> remove the sysfs_remove_mount_point() call that leaves the system
>> in a funny state. When triggered, this causes all the resctrl files
>> to disappear. resctrl can be unmounted, but not mounted again.
> The caveat here is that resctrl pretends to be mounted (resctrl_mounted == true)
> but there is nothing there. The undocumented part of this is that for this
> to work resctrl fs depends (a lot) on the architecture's callbacks to know
> if they are being called after a resctrl_exit() call so that they return data
> that will direct resctrl fs behavior to safest exit for those
> resctrl fs flows that are still possible after a resctrl_exit(). Not ideal
> layering.
It was the arch code that called resctrl_exit() - there is no other path into it.
I don't think its a problem for the arch code to also know to return an error.
I haven't found anything where which error is returned actually matter - so there
is no 'direction', only errors.
I agree the documentation can be improved.
> I understand from a previous comment [1] that one of the Arm "tricks" is to
> offline all domains. This seems to be a good "catch all" to ensure that at least
> current flows of concern are not running anymore.
Yup, that is necessary to stop the limbo and overflow workers for trying to read the
counters - which is a waste of time.
> Considering this,
> what if there is a new resctrl_error_exit() that does something like below?
>
> void resctrl_error_exit(void)
> {
> mutex_lock(&rdtgroup_mutex);
> WARN_ON_ONCE(resctrl_new_function_returns_true_if_any_resource_has_a_control_or_monitor_domain());
> resctrl_fs_teardown();
> mutex_unlock(&rdtgroup_mutex);
> resctrl_exit();
> }
Makes sense - the alternative would be to dig around to cancel the limbo/overflow
work, and a subsequent CPU-online might start them again.
> I do not see this as requiring anything new from architecture but instead
> making what Arm already does a requirement and keeping existing behavior?
I agree.
> This leaves proc_resctrl_show() that relies on resctrl_mounted but as I see
> the resctrl_fs_cleanup() will remove all resource groups that should result
> in the output being as it will be if resctrl is not mounted. No dependence
> on architecture callbacks returning resctrl_exit() aware data here.
Great - I'd missed that one,
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index fdf2616c7ca0..3f9c37637d7e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -4416,11 +4429,26 @@ int __init resctrl_init(void)
>> return ret;
>> }
>>
>> +/**
>> + * resctrl_exit() - Remove the resctrl filesystem and free resources.
>> + *
>> + * Called by the architecture code in response to a fatal error.
>> + * Resctrl files and structures are removed from kernfs to prevent further
>> + * configuration.
>
> Please write with imperative tone. For example, "Remove resctrl files and structures ..."
>
>> + */
>> void __exit resctrl_exit(void)
>> {
>> + mutex_lock(&rdtgroup_mutex);
>> + resctrl_fs_teardown();
>> + mutex_unlock(&rdtgroup_mutex);
>> +
>> debugfs_remove_recursive(debugfs_resctrl);
>
> Is it possible for the fatal error handling to trigger multiple calls here?
> To protect against multiple calls causing issues debugfs_resctrl can be set to NULL here.
It's not, the driver keeps track of whether resctrl_init() had been called, and only calls
resctrl_exit() once. But I agree it would be better to make it robust to this.
>> unregister_filesystem(&rdt_fs_type);
>
> unregister_filesystem() seems to handle an already-unregistered filesystem.
>
>> - sysfs_remove_mount_point(fs_kobj, "resctrl");
>> +
>> + /*
>> + * The sysfs mount point added by resctrl_init() is not removed so that
>> + * it can be used to umount resctrl.
>> + */
>
> (needs imperative)
>
>>
>> resctrl_mon_resource_exit();
>> }
Thanks,
James
© 2016 - 2025 Red Hat, Inc.