[v2] sched/psi: some optimization and extension

[PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 8 months ago

PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI stats disable/re-enable
interface "cgroup.psi", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.

But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 ++++
 include/linux/psi.h                     |  2 ++
 include/linux/psi_types.h               |  2 ++
 kernel/cgroup/cgroup.c                  | 43 +++++++++++++++++++++++++
 kernel/sched/psi.c                      | 40 +++++++++++++++++++----
 5 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index dd84e34bc051..ade40506ab80 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
 	killing cgroups is a process directed operation, i.e. it affects
 	the whole thread-group.
 
+  cgroup.psi
+	A read-write single value file that allowed values are "0" and "1".
+	The default is "1".
+
+	Writing "0" to the file will disable the cgroup PSI stats accounting.
+	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
+
   irq.pressure
 	A read-write nested-keyed file.
 
diff --git a/include/linux/psi.h b/include/linux/psi.h
index aa168a038242..1138ccffd76b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -33,6 +33,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 int psi_cgroup_alloc(struct cgroup *cgrp);
 void psi_cgroup_free(struct cgroup *cgrp);
 void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_enable(struct psi_group *group, bool enable);
 #endif
 
 #else /* CONFIG_PSI */
@@ -54,6 +55,7 @@ static inline void cgroup_move_task(struct task_struct *p, struct css_set *to)
 {
 	rcu_assign_pointer(p->cgroups, to);
 }
+static inline void psi_cgroup_enable(struct psi_group *group, bool enable) {}
 #endif
 
 #endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 4677655f6ca1..fced39e255aa 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -147,6 +147,8 @@ struct psi_trigger {
 };
 
 struct psi_group {
+	bool enabled;
+
 	/* Protects data used by the aggregator */
 	struct mutex avgs_lock;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 91de8ff7fa50..6ba56983b5a5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3709,6 +3709,43 @@ static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
 }
 #endif
 
+static int cgroup_psi_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct psi_group *psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+
+	seq_printf(seq, "%d\n", psi->enabled);
+
+	return 0;
+}
+
+static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	ssize_t ret;
+	int enable;
+	struct cgroup *cgrp;
+	struct psi_group *psi;
+
+	ret = kstrtoint(strstrip(buf), 0, &enable);
+	if (ret)
+		return ret;
+
+	if (enable < 0 || enable > 1)
+		return -ERANGE;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
+	psi_cgroup_enable(psi, enable);
+
+	cgroup_kn_unlock(of->kn);
+
+	return nbytes;
+}
+
 static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
 					  poll_table *pt)
 {
@@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
 		.release = cgroup_pressure_release,
 	},
 #endif
+	{
+		.name = "cgroup.psi",
+		.flags = CFTYPE_PRESSURE,
+		.seq_show = cgroup_psi_show,
+		.write = cgroup_psi_write,
+	},
 #endif /* CONFIG_PSI */
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 58f8092c938f..9df1686ee02d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
 {
 	int cpu;
 
+	group->enabled = true;
 	for_each_possible_cpu(cpu)
 		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
 	group->avg_last_update = sched_clock();
@@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
 	/*
-	 * First we assess the aggregate resource states this CPU's
-	 * tasks have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 *
-	 * Then we update the task counts according to the state
+	 * First we update the task counts according to the state
 	 * change requested through the @clear and @set bits.
+	 *
+	 * Then if the cgroup PSI stats accounting enabled, we
+	 * assess the aggregate resource states this CPU's tasks
+	 * have been in since the last change, and account any
+	 * SOME and FULL time these may have resulted in.
 	 */
 	write_seqcount_begin(&groupc->seq);
 
-	record_times(groupc, now);
-
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
@@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
 		if (set & (1 << t))
 			groupc->tasks[t]++;
 
+	if (!group->enabled) {
+		if (groupc->state_mask & (1 << PSI_NONIDLE))
+			record_times(groupc, now);
+		groupc->state_mask = state_mask;
+		write_seqcount_end(&groupc->seq);
+		return;
+	}
+
 	for (s = 0; s < NR_PSI_STATES; s++) {
 		if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
 			state_mask |= (1 << s);
@@ -766,6 +774,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
 		state_mask |= (1 << PSI_MEM_FULL);
 
+	record_times(groupc, now);
 	groupc->state_mask = state_mask;
 
 	write_seqcount_end(&groupc->seq);
@@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 
 	task_rq_unlock(rq, task, &rf);
 }
+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+	struct psi_group_cpu *groupc;
+	int cpu;
+	u64 now;
+
+	if (group->enabled == enable)
+		return;
+	group->enabled = enable;
+
+	for_each_possible_cpu(cpu) {
+		groupc = per_cpu_ptr(group->pcpu, cpu);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+	}
+}
 #endif /* CONFIG_CGROUPS */
 
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
-- 
2.36.1

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Johannes Weiner 3 years, 7 months ago

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	ssize_t ret;
> +	int enable;
> +	struct cgroup *cgrp;
> +	struct psi_group *psi;
> +
> +	ret = kstrtoint(strstrip(buf), 0, &enable);
> +	if (ret)
> +		return ret;
> +
> +	if (enable < 0 || enable > 1)
> +		return -ERANGE;
> +
> +	cgrp = cgroup_kn_lock_live(of->kn, false);
> +	if (!cgrp)
> +		return -ENOENT;
> +
> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
> +	psi_cgroup_enable(psi, enable);

I think it should also add/remove the pressure files when enabling and
disabling the aggregation, since their contents would be stale and
misleading.

Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>  		.release = cgroup_pressure_release,
>  	},
>  #endif
> +	{
> +		.name = "cgroup.psi",
> +		.flags = CFTYPE_PRESSURE,
> +		.seq_show = cgroup_psi_show,
> +		.write = cgroup_psi_write,
> +	},
>  #endif /* CONFIG_PSI */
>  	{ }	/* terminate */
>  };
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 58f8092c938f..9df1686ee02d 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>  {
>  	int cpu;
>  
> +	group->enabled = true;
>  	for_each_possible_cpu(cpu)
>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>  	group->avg_last_update = sched_clock();
> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>  
>  	/*
> -	 * First we assess the aggregate resource states this CPU's
> -	 * tasks have been in since the last change, and account any
> -	 * SOME and FULL time these may have resulted in.
> -	 *
> -	 * Then we update the task counts according to the state
> +	 * First we update the task counts according to the state
>  	 * change requested through the @clear and @set bits.
> +	 *
> +	 * Then if the cgroup PSI stats accounting enabled, we
> +	 * assess the aggregate resource states this CPU's tasks
> +	 * have been in since the last change, and account any
> +	 * SOME and FULL time these may have resulted in.
>  	 */
>  	write_seqcount_begin(&groupc->seq);
>  
> -	record_times(groupc, now);
> -
>  	/*
>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>  	 * task count - it's just a boolean flag directly encoded in
> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  		if (set & (1 << t))
>  			groupc->tasks[t]++;
>  
> +	if (!group->enabled) {
> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
> +			record_times(groupc, now);

Why record the nonidle time? It's only used for aggregation, which is
stopped as well.

> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>  
>  	task_rq_unlock(rq, task, &rf);
>  }
> +
> +void psi_cgroup_enable(struct psi_group *group, bool enable)
> +{
> +	struct psi_group_cpu *groupc;
> +	int cpu;
> +	u64 now;
> +
> +	if (group->enabled == enable)
> +		return;
> +	group->enabled = enable;
> +
> +	for_each_possible_cpu(cpu) {
> +		groupc = per_cpu_ptr(group->pcpu, cpu);
> +		now = cpu_clock(cpu);
> +		psi_group_change(group, cpu, 0, 0, now, true);

This loop deserves a comment, IMO.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 7 months ago

On 2022/8/15 23:49, Johannes Weiner wrote:
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> +static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
>> +				char *buf, size_t nbytes, loff_t off)
>> +{
>> +	ssize_t ret;
>> +	int enable;
>> +	struct cgroup *cgrp;
>> +	struct psi_group *psi;
>> +
>> +	ret = kstrtoint(strstrip(buf), 0, &enable);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (enable < 0 || enable > 1)
>> +		return -ERANGE;
>> +
>> +	cgrp = cgroup_kn_lock_live(of->kn, false);
>> +	if (!cgrp)
>> +		return -ENOENT;
>> +
>> +	psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi;
>> +	psi_cgroup_enable(psi, enable);
> 
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

Ok, I will look.

> 
>> @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] = {
>>  		.release = cgroup_pressure_release,
>>  	},
>>  #endif
>> +	{
>> +		.name = "cgroup.psi",
>> +		.flags = CFTYPE_PRESSURE,
>> +		.seq_show = cgroup_psi_show,
>> +		.write = cgroup_psi_write,
>> +	},
>>  #endif /* CONFIG_PSI */
>>  	{ }	/* terminate */
>>  };
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index 58f8092c938f..9df1686ee02d 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
>>  {
>>  	int cpu;
>>  
>> +	group->enabled = true;
>>  	for_each_possible_cpu(cpu)
>>  		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>>  	group->avg_last_update = sched_clock();
>> @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  	groupc = per_cpu_ptr(group->pcpu, cpu);
>>  
>>  	/*
>> -	 * First we assess the aggregate resource states this CPU's
>> -	 * tasks have been in since the last change, and account any
>> -	 * SOME and FULL time these may have resulted in.
>> -	 *
>> -	 * Then we update the task counts according to the state
>> +	 * First we update the task counts according to the state
>>  	 * change requested through the @clear and @set bits.
>> +	 *
>> +	 * Then if the cgroup PSI stats accounting enabled, we
>> +	 * assess the aggregate resource states this CPU's tasks
>> +	 * have been in since the last change, and account any
>> +	 * SOME and FULL time these may have resulted in.
>>  	 */
>>  	write_seqcount_begin(&groupc->seq);
>>  
>> -	record_times(groupc, now);
>> -
>>  	/*
>>  	 * Start with TSK_ONCPU, which doesn't have a corresponding
>>  	 * task count - it's just a boolean flag directly encoded in
>> @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, int cpu,
>>  		if (set & (1 << t))
>>  			groupc->tasks[t]++;
>>  
>> +	if (!group->enabled) {
>> +		if (groupc->state_mask & (1 << PSI_NONIDLE))
>> +			record_times(groupc, now);
> 
> Why record the nonidle time? It's only used for aggregation, which is
> stopped as well.

I'm considering of this situation: disable at t2 and re-enable at t3

state1(t1) --> state2(t2) --> state3(t3)

If aggregator has get_recent_times() in [t1, t2], groupc->times_prev[aggregator]
will include that delta of (t - t1).

Then re-enable at t3, the delta of (t3-t1) is discarded, may make that aggregator
see times < groupc->times_prev[aggregator] ?

Maybe I missed something, not sure whether this is a problem.


> 
>> @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>>  
>>  	task_rq_unlock(rq, task, &rf);
>>  }
>> +
>> +void psi_cgroup_enable(struct psi_group *group, bool enable)
>> +{
>> +	struct psi_group_cpu *groupc;
>> +	int cpu;
>> +	u64 now;
>> +
>> +	if (group->enabled == enable)
>> +		return;
>> +	group->enabled = enable;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		groupc = per_cpu_ptr(group->pcpu, cpu);
>> +		now = cpu_clock(cpu);
>> +		psi_group_change(group, cpu, 0, 0, now, true);
> 
> This loop deserves a comment, IMO.

I add some comments as below, could you help take a look?

+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+       int cpu;
+       u64 now;
+
+       if (group->enabled == enable)
+               return;
+       group->enabled = enable;
+
+       /*
+        * We use psi_group_change() to disable or re-enable the
+        * record_times(), test_state() loop and averaging worker
+        * in each psi_group_cpu of the psi_group, use .clear = 0
+        * and .set = 0 here since no task status really changed.
+        */
+       for_each_possible_cpu(cpu) {
+               now = cpu_clock(cpu);
+               psi_group_change(group, cpu, 0, 0, now, true);
+       }
+}

Thanks!

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Tejun Heo 3 years, 7 months ago

Hello,

On Mon, Aug 15, 2022 at 11:49:55AM -0400, Johannes Weiner wrote:
> I think it should also add/remove the pressure files when enabling and
> disabling the aggregation, since their contents would be stale and
> misleading.
> 
> Take a look at cgroup_add_dfl_cftypes() and cgroup_rm_cftypes()

The problem with adding cftypes dynamically is that it can fail, which isn't
the end of the world here but still kinda sucks. I think what we actually
wanna do is hiding and unhiding while keeping all the data structures in
place which is needed somewhere else anyway.

Thanks.

-- 
tejun

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Michal Koutný 3 years, 8 months ago

Hello Chengming.

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index dd84e34bc051..ade40506ab80 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
>  	killing cgroups is a process directed operation, i.e. it affects
>  	the whole thread-group.
>  
> +  cgroup.psi
> +	A read-write single value file that allowed values are "0" and "1".
> +	The default is "1".
> +
> +	Writing "0" to the file will disable the cgroup PSI stats accounting.
> +	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
> +

I'd suggest explaining here explicitely, this control attribute is not
hierarchical (i.e. PSI accounting in a cgroup does not affect accounting
in descendants and doesn't need pass enablement via ancestors from
root). And the purpose that it "saves" cycles (where).

Regards,
Michal

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 8 months ago

On 2022/8/12 18:14, Michal Koutný wrote:
> Hello Chengming.
> 
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index dd84e34bc051..ade40506ab80 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
>>  	killing cgroups is a process directed operation, i.e. it affects
>>  	the whole thread-group.
>>  
>> +  cgroup.psi
>> +	A read-write single value file that allowed values are "0" and "1".
>> +	The default is "1".
>> +
>> +	Writing "0" to the file will disable the cgroup PSI stats accounting.
>> +	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
>> +
> 
> I'd suggest explaining here explicitely, this control attribute is not
> hierarchical (i.e. PSI accounting in a cgroup does not affect accounting
> in descendants and doesn't need pass enablement via ancestors from
> root). And the purpose that it "saves" cycles (where).

Thanks for the suggestion and explanation!

Could you help take a look if there is anything to improve?


--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,23 @@ All cgroup core files are prefixed with "cgroup."
        killing cgroups is a process directed operation, i.e. it affects
        the whole thread-group.

+  cgroup.pressure
+       A read-write single value file that allowed values are "0" and "1".
+       The default is "1".
+
+       Writing "0" to the file will disable the cgroup PSI accounting.
+       Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+       This control attribute is not hierarchical, so disable or enable PSI
+       accounting in a cgroup does not affect PSI accounting in descendants
+       and doesn't need pass enablement via ancestors from root.
+
+       The reason this control attribute exists is that PSI accounts stalls for
+       each cgroup separately and aggregates it at each level of the hierarchy.
+       This may cause non-negligible overhead for some workloads when under
+       deep level of the hierarchy, in which case this control attribute can
+       be used to disable PSI accounting in the cgroups.
+

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Michal Koutný 3 years, 7 months ago

On Fri, Aug 12, 2022 at 08:36:17PM +0800, Chengming Zhou <zhouchengming@bytedance.com> wrote:
> Could you help take a look if there is anything to improve?

Thanks, just a little nit.

> +       The reason this control attribute exists is that PSI accounts stalls for
> +       each cgroup separately and aggregates it at each level of the hierarchy.
> +       This may cause non-negligible overhead for some workloads when under
> +       deep level of the hierarchy, in which case this control attribute can
> +       be used to disable PSI accounting in the cgroups.

s/in the cgroups/in the non-leaf cgroups/
or
s/in the cgroups/in the uninteresting cgroups/

(I'm concerned that it may result in lots of disabling if you want the
performance. I'll expand on it in 2nd subthread.)

Michal

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Tejun Heo 3 years, 8 months ago

Hello,

On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> So this patch introduce a per-cgroup PSI stats disable/re-enable
> interface "cgroup.psi", which is a read-write single value file that
> allowed values are "0" and "1", the defaults is "1" so per-cgroup
> PSI stats is enabled by default.

Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
"cgroup.psi" is the best name. Also, it doesn't convey that it's the
enable/disable knob. I think it needs a better name.

Thanks.

-- 
tejun

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 8 months ago

On 2022/8/10 01:48, Tejun Heo wrote:
> Hello,
> 
> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>> So this patch introduce a per-cgroup PSI stats disable/re-enable
>> interface "cgroup.psi", which is a read-write single value file that
>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
>> PSI stats is enabled by default.
> 
> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
> enable/disable knob. I think it needs a better name.

Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?

Thanks.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 8 months ago

On 2022/8/10 08:39, Chengming Zhou wrote:
> On 2022/8/10 01:48, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
>>> So this patch introduce a per-cgroup PSI stats disable/re-enable
>>> interface "cgroup.psi", which is a read-write single value file that
>>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
>>> PSI stats is enabled by default.
>>
>> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
>> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
>> enable/disable knob. I think it needs a better name.
> 
> Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?

Doesn't look good either, what do you think of "cgroup.pressure.enable"?

Thanks.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Johannes Weiner 3 years, 8 months ago

On Wed, Aug 10, 2022 at 09:30:59AM +0800, Chengming Zhou wrote:
> On 2022/8/10 08:39, Chengming Zhou wrote:
> > On 2022/8/10 01:48, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Mon, Aug 08, 2022 at 07:03:40PM +0800, Chengming Zhou wrote:
> >>> So this patch introduce a per-cgroup PSI stats disable/re-enable
> >>> interface "cgroup.psi", which is a read-write single value file that
> >>> allowed values are "0" and "1", the defaults is "1" so per-cgroup
> >>> PSI stats is enabled by default.
> >>
> >> Given that the knobs are named {cpu|memory|io}.pressure, I wonder whether
> >> "cgroup.psi" is the best name. Also, it doesn't convey that it's the
> >> enable/disable knob. I think it needs a better name.
> > 
> > Yes, "cgroup.psi" is not good. What abort "pressure.enable" or "cgroup.psi_enable"?
> 
> Doesn't look good either, what do you think of "cgroup.pressure.enable"?

How about just cgroup.pressure? Too ambiguous?

cgroup.pressure.enable sounds good to me too. Or, because it's
default-enabled and that likely won't change, cgroup.pressure.disable.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Michal Koutný 3 years, 7 months ago

On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> cgroup.pressure.enable sounds good to me too. Or, because it's
> default-enabled and that likely won't change, cgroup.pressure.disable.

Will it not change?

I'd say that user would be interested in particular level or even just
level in subtree for PSI, so the opt-out may result in lots of explicit
disablements (or even watch for cgroups created and disable PSI there)
to get some performance back.

I have two suggestions based on the above:
1) Make the default globally configurable (mount option?)
2) Allow implicit enablement upon trigger creation

WDYT?

Michal

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 7 months ago

On 2022/8/15 21:23, Michal Koutný wrote:
> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> cgroup.pressure.enable sounds good to me too. Or, because it's
>> default-enabled and that likely won't change, cgroup.pressure.disable.
> 
> Will it not change?
> 
> I'd say that user would be interested in particular level or even just
> level in subtree for PSI, so the opt-out may result in lots of explicit
> disablements (or even watch for cgroups created and disable PSI there)
> to get some performance back.
> 
> I have two suggestions based on the above:
> 1) Make the default globally configurable (mount option?)
> 2) Allow implicit enablement upon trigger creation
> 

I think suggestion 1) make sense in some use case, like make per-cgroup
PSI disabled by default using a mount option, then enable using the
"cgroup.pressure" interface.

But suggestion 2) auto enable upon trigger creation, if we hide the
{cpu,memory,io}.pressure files when disabled, how can we create trigger?

Want to see what do Johannes and Tejun think about these suggestions?

Thanks.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Johannes Weiner 3 years, 7 months ago

On Tue, Aug 23, 2022 at 02:18:21PM +0800, Chengming Zhou wrote:
> On 2022/8/15 21:23, Michal Koutný wrote:
> > On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >> cgroup.pressure.enable sounds good to me too. Or, because it's
> >> default-enabled and that likely won't change, cgroup.pressure.disable.
> > 
> > Will it not change?
> > 
> > I'd say that user would be interested in particular level or even just
> > level in subtree for PSI, so the opt-out may result in lots of explicit
> > disablements (or even watch for cgroups created and disable PSI there)
> > to get some performance back.
> > 
> > I have two suggestions based on the above:
> > 1) Make the default globally configurable (mount option?)
> > 2) Allow implicit enablement upon trigger creation
> > 
> 
> I think suggestion 1) make sense in some use case, like make per-cgroup
> PSI disabled by default using a mount option, then enable using the
> "cgroup.pressure" interface.
> 
> But suggestion 2) auto enable upon trigger creation, if we hide the
> {cpu,memory,io}.pressure files when disabled, how can we create trigger?
> 
> Want to see what do Johannes and Tejun think about these suggestions?

Re 1: I agree. If desired in the future we can make the default
configurable. Kconfig, mount option, what have you. cgroup.pressure
will work fine as a name regardless of what the default is.

Re 2: Not all consumers of the pressure metrics create trigger. I
would argue that few do. So it isn't the best signal to decide on
whether aggregation should occur. And yes, it's further complicated by
the triggers being written to the very pressure files. If we don't
hide them, we have to come up with another way to mark them as stale,
lest they confuse the heck out of users. Without breaking format...

So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
pressure files should be good for now while allowing to make the
default configurable down the line.

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Tejun Heo 3 years, 7 months ago

Hello,

On Tue, Aug 23, 2022 at 11:35:59AM -0400, Johannes Weiner wrote:
> Re 1: I agree. If desired in the future we can make the default
> configurable. Kconfig, mount option, what have you. cgroup.pressure
> will work fine as a name regardless of what the default is.

Given that there's already cgroup_disable=pressure for cases which want it
fully disabled, I'm not sure we'd need to add more complex disabling
options. The only difference that'd make is for users who are configuring
cgroups manually which is pretty rare and it'd create a clear downside of
increasing confusion as the base assumption becomes dynamic. So, I think the
current default-on with opting-out is and will be just fine.

> Re 2: Not all consumers of the pressure metrics create trigger. I
> would argue that few do. So it isn't the best signal to decide on
> whether aggregation should occur. And yes, it's further complicated by
> the triggers being written to the very pressure files. If we don't
> hide them, we have to come up with another way to mark them as stale,
> lest they confuse the heck out of users. Without breaking format...
> 
> So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
> pressure files should be good for now while allowing to make the
> default configurable down the line.

Sounds great.

Thanks.

-- 
tejun

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 7 months ago

On 2022/8/23 23:35, Johannes Weiner wrote:
> On Tue, Aug 23, 2022 at 02:18:21PM +0800, Chengming Zhou wrote:
>> On 2022/8/15 21:23, Michal Koutný wrote:
>>> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>> cgroup.pressure.enable sounds good to me too. Or, because it's
>>>> default-enabled and that likely won't change, cgroup.pressure.disable.
>>>
>>> Will it not change?
>>>
>>> I'd say that user would be interested in particular level or even just
>>> level in subtree for PSI, so the opt-out may result in lots of explicit
>>> disablements (or even watch for cgroups created and disable PSI there)
>>> to get some performance back.
>>>
>>> I have two suggestions based on the above:
>>> 1) Make the default globally configurable (mount option?)
>>> 2) Allow implicit enablement upon trigger creation
>>>
>>
>> I think suggestion 1) make sense in some use case, like make per-cgroup
>> PSI disabled by default using a mount option, then enable using the
>> "cgroup.pressure" interface.
>>
>> But suggestion 2) auto enable upon trigger creation, if we hide the
>> {cpu,memory,io}.pressure files when disabled, how can we create trigger?
>>
>> Want to see what do Johannes and Tejun think about these suggestions?
> 
> Re 1: I agree. If desired in the future we can make the default
> configurable. Kconfig, mount option, what have you. cgroup.pressure
> will work fine as a name regardless of what the default is.
> 
> Re 2: Not all consumers of the pressure metrics create trigger. I
> would argue that few do. So it isn't the best signal to decide on
> whether aggregation should occur. And yes, it's further complicated by
> the triggers being written to the very pressure files. If we don't
> hide them, we have to come up with another way to mark them as stale,
> lest they confuse the heck out of users. Without breaking format...
> 
> So IMO, default-enable, "cgroup.pressure" as a name, and hiding the
> pressure files should be good for now while allowing to make the
> default configurable down the line.

Agree, it's what we want for now. Thanks for your reply!

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Tejun Heo 3 years, 8 months ago

Hello,

On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner wrote:
> How about just cgroup.pressure? Too ambiguous?
> 
> cgroup.pressure.enable sounds good to me too. Or, because it's
> default-enabled and that likely won't change, cgroup.pressure.disable.

.disable sounds more logical but I like .enable better for some reason. As
for just cgroup.pressure, yeah, maybe? The conundrum is that the prettiness
order is the exact reverse of the logical order. So, I'm okay with any of
the three.

Thanks.

-- 
tejun

Re: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface

Posted by Chengming Zhou 3 years, 8 months ago

On 2022/8/11 01:27, Tejun Heo wrote:
> Hello,
> 
> On Wed, Aug 10, 2022 at 11:25:07AM -0400, Johannes Weiner wrote:
>> How about just cgroup.pressure? Too ambiguous?
>>
>> cgroup.pressure.enable sounds good to me too. Or, because it's
>> default-enabled and that likely won't change, cgroup.pressure.disable.
> 
> .disable sounds more logical but I like .enable better for some reason. As
> for just cgroup.pressure, yeah, maybe? The conundrum is that the prettiness
> order is the exact reverse of the logical order. So, I'm okay with any of
> the three.

Ok, so I would like to pick the prettiest "cgroup.pressure", it also looks more
consistent with {cpu|memory|io}.pressure.

Thanks!