[RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct

Michael Pratt posted 1 patch 2 months, 2 weeks ago
There is a newer version of this series
kernel/sched/syscalls.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
[RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Michael Pratt 2 months, 2 weeks ago
From userspace, spawning a new process with, for example,
posix_spawn(), only allows the user to work with
the scheduling priority value defined by POSIX
in the sched_param struct.

However, sched_setparam() and similar syscalls lead to
__sched_setscheduler() which rejects any new value
for the priority other than 0 for non-RT schedule classes,
a behavior kept since Linux 2.6 or earlier.

Linux translates the usage of the sched_param struct
into it's own internal sched_attr struct during the syscall,
but the user has no way to manage the other values
within the sched_attr struct using only POSIX functions.

The only other way to adjust niceness while using posix_spawn()
would be to set the value after the process has started,
but this introduces the risk of the process being dead
before the next syscall can set the priority after the fact.

To resolve this, allow the use of the priority value
originally from the POSIX sched_param struct in order to
set the niceness value instead of rejecting the priority value.

Edit the sched_get_priority_*() POSIX syscalls
in order to reflect the range of values accepted.

Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c
Signed-off-by: Michael Pratt <mcpratt@pm.me>
---
 kernel/sched/syscalls.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ae1b42775ef9..52c02b80f037 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -853,6 +853,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
 		attr.sched_policy = policy;
 	}
 
+	if (attr.sched_priority > MAX_PRIO-1)
+		return -EINVAL;
+
+	/*
+	 * If priority is set for SCHED_NORMAL or SCHED_BATCH,
+	 * set the niceness instead, but only for user calls.
+	 */
+	if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
+	   ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
+		attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
+		attr.sched_priority = 0;
+	}
+
 	return __sched_setscheduler(p, &attr, check, true);
 }
 /**
@@ -1598,9 +1611,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_RR:
 		ret = MAX_RT_PRIO-1;
 		break;
-	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
+		ret = MAX_PRIO-1;
+		break;
+	case SCHED_DEADLINE:
 	case SCHED_IDLE:
 		ret = 0;
 		break;
@@ -1625,9 +1640,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_RR:
 		ret = 1;
 		break;
-	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
+		ret = MAX_RT_PRIO;
+		break;
+	case SCHED_DEADLINE:
 	case SCHED_IDLE:
 		ret = 0;
 	}

base-commit: 5be63fc19fcaa4c236b307420483578a56986a37
-- 
2.30.2
Re: [RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Peter Zijlstra 2 months, 2 weeks ago
On Mon, Sep 16, 2024 at 05:08:49AM +0000, Michael Pratt wrote:
> From userspace, spawning a new process with, for example,
> posix_spawn(), only allows the user to work with
> the scheduling priority value defined by POSIX
> in the sched_param struct.
> 
> However, sched_setparam() and similar syscalls lead to
> __sched_setscheduler() which rejects any new value
> for the priority other than 0 for non-RT schedule classes,
> a behavior kept since Linux 2.6 or earlier.

Right, and the current behaviour is entirely in-line with the POSIX
specs.

I realize this might be a pain, but why should be change this spec
conforming and very long standing behaviour?

Worse, you're proposing a nice ABI that is entirely different from the
normal [-20,19] range.

Why do you feel this is the best way forward? Would not adding
POSIX_SPAWN_SETSCHEDATTR be a more future proof mechanism?
Re: [RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Michael Pratt 1 month, 3 weeks ago
Hi everyone, again,

I'm still waiting for a serious review of this very short patch after about a month.
I'll be getting ready to send this a 3rd time, and to more people, if I don't hear back again.

I believe that I have correctly addressed Peter's concerns in my replies,
and that my patch is still appropriate for inclusion into Linux as it is now.

If I'm wrong please let me know why instead of just silence.

--
MCP
Re: [RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Michael Pratt 2 months ago
Hi Peter,

Have I addressed your concerns?

In my opinion, this is a simple change that adds ABI functionality,
without affecting any other parts of the syscalls involved, whether breaking or non-breaking.

Like I said in previous replies, POSIX doesn't allow passing a negative number
as a successful value, and setting the static priority directly can expose
the scheduler to invalid priority values for CFS policies in later syscalls.

As far as I can see, this is the only way for usage of posix_spawnattr_setschedparam()
to be both compliant _and_ functional for SCHED_NORMAL and SCHED_BATCH.

--
Thanks for your time,
MCP
Re: [RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Michael Pratt 2 months, 1 week ago
Hi Peter,

On Monday, September 16th, 2024 at 07:13, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Sep 16, 2024 at 05:08:49AM +0000, Michael Pratt wrote:
> 
> > From userspace, spawning a new process with, for example,
> > posix_spawn(), only allows the user to work with
> > the scheduling priority value defined by POSIX
> > in the sched_param struct.
> > 
> > However, sched_setparam() and similar syscalls lead to
> > __sched_setscheduler() which rejects any new value
> > for the priority other than 0 for non-RT schedule classes,
> > a behavior kept since Linux 2.6 or earlier.
> 
> 
> Right, and the current behaviour is entirely in-line with the POSIX
> specs.

I'm just mentioning this for context.
In this case, "in-line with POSIX specs" has nothing to do with
whether or not the feature works. POSIX says nothing about which policies
should be accepting which values or not and how they are processed.
Like many things, it is simply implementation-specific.

The current behavior is that it doesn't work, and I would like it to work.

> I realize this might be a pain, but why should we change this spec
> conforming and very long standing behavior?

The fact that the overall behavior is "very long standing" is a coincidence.
The code here conforms to the specs both before and after the patch,
and the difference is functionality.

In fact, I am not aiming to change
the exact behavior of "reject every priority value other than 0"
but rather work around that by translating it to niceness
so long as it is a valid range passed as the priority by the user.
This method is not just to maintain that priority must be 0, but I found it necessary,
because if the syscall were allowed to change the static priority,
then a future change in the "niceness" value would theoretically allow the priority
to pass into the RT range for non-RT policies.

> Worse, you're proposing a nice ABI that is entirely different from the
> normal [-20,19] range.

Please take a closer look... The resulting niceness value is exactly that range.
  PRIO_TO_NICE([MAX_RT_PRIO,MAX_PRIO-1]) = [-20,19]

I am not writing this so that the value passed as a "priority" value should be assumed
to be the "niceness" value instead by the user, but rather that the user should
pass a value for "priority" that will actually result in that value,
but with the "niceness" adjusted instead,
as that is the user-specific method to effectively do the same thing.

The "niceness" value has no meaning in the world of POSIX, it only means something
in the world of Linux, and just like the translation from sched_param to sched_attr structs,
this is the place where we would translate priority to niceness.
Everything outside the internals of the kernel should be understood as the "actual" priority,
because POSIX is a userspace that doesn't acknowledge or understand the kernel's ABIs,
not the other way around.

Otherwise, we have a confusing conflation between the meaning of the two values,
where a value of 19 makes sense for niceness, but is obviously invalid for priority
for SCHED_NORMAL, and a negative value makes sense for niceness, but is obviously invalid
for priority in any policy.

Implementations of posix_spawn functions ask for the "priority",
and POSIX states that the value passed in with the sched_param struct should be the "priority"
and that the usage is implementation-specific, not the other way around, where
the meaning of the value would be implementation-specific, but the usage of the value
would be clearly defined instead. I'm trying to stay in-line with the semantics as well.

> Why do you feel this is the best way forward? Would not adding
> POSIX_SPAWN_SETSCHEDATTR be a more future proof mechanism?

New flags don't change the fact that the value will be rejected in the kernel,
unless I am misunderstanding what you mean...

I believe this is the simplest and the smallest possible change
that is conforming both to POSIX and the kernel's styling
in order to make posix_spawnattr_setschedparam()
work instead of _just_ being "conforming and compliant",
which like I said is a low requirement of "just reject all values".

Flags like POSIX_SPAWN_SETSCHEDATTR would be used at the library level
and we have no problems at the library level, except for Linux-only libraries
that have not implemented posix_spawnattr_setschedparam() because it currently fails.
Notably, the musl C library is an example of this, but that might change
if we finally add support for this.

It would be nice if POSIX would add a flag to specifically cater to linux,
however, that would likely require them to add the sched_attr struct definition
or replace the sched_param struct, and as we know things usually work the other way around.

Thanks for your time.

--
MCP
Re: [RESEND PATCH] sched/syscalls: Allow setting niceness using sched_param struct
Posted by Michael Pratt 2 months, 1 week ago
one more detail I forgot:
it actually would not be compliant for niceness as the input...

On Monday, September 16th, 2024 at 15:23, Michael Pratt <mcpratt@pm.me> wrote:

> On Monday, September 16th, 2024 at 07:13, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > Worse, you're proposing a nice ABI that is entirely different from the
> > normal [-20,19] range.
> 
> ...
> ...
> ...
> 
> Otherwise, we have a confusing conflation between the meaning of the two values,
> where a value of 19 makes sense for niceness, but is obviously invalid for priority
> for SCHED_NORMAL, and a negative value makes sense for niceness, but is obviously invalid
> for priority in any policy.
> 

POSIX doesn't allow a negative value for the ABI at all:

  If successful, the sched_get_priority_max() and sched_get_priority_min() functions return
  the appropriate maximum or minimum values, respectively.
  If unsuccessful, they return a value of -1 and set errno to indicate the error.


--
MCP