kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
independent variable defining the virtual protection timestamp.
When `reweight_entity()` is called (e.g., via nice/renice), it performs
the following actions to preserve Lag consistency:
1. Scales `se->vlag` based on the new weight.
2. Calls `place_entity()`, which recalculates `se->vruntime` based on
the new weight and scaled lag.
However, the current implementation fails to update `se->vprot`, leading
to mismatches between the task's actual runtime and its expected duration.
This patch fixes the issue by calling `set_protect_slice()` at the end of
`reweight_entity()`. This ensures that a new, valid protection slice is
committed based on the updated `vruntime` and the new weight, restoring
Run-to-Parity consistency immediately after a weight change.
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Wang Tao <wangtao554@huawei.com>
---
kernel/sched/fair.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e71302282671..bdd8c4e688ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
if (!curr)
__enqueue_entity(cfs_rq, se);
cfs_rq->nr_queued++;
+ if (curr)
+ set_protect_slice(cfs_rq, se);
}
}
--
2.34.1
Hello Wang,
On 1/20/2026 6:01 PM, Wang Tao wrote:
> In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
> independent variable defining the virtual protection timestamp.
>
> When `reweight_entity()` is called (e.g., via nice/renice), it performs
> the following actions to preserve Lag consistency:
> 1. Scales `se->vlag` based on the new weight.
> 2. Calls `place_entity()`, which recalculates `se->vruntime` based on
> the new weight and scaled lag.
>
> However, the current implementation fails to update `se->vprot`, leading
> to mismatches between the task's actual runtime and its expected duration.
I don't think that is correct. "vprot" allows for "min_slice" worth of
runtime from the beginning of the pick however, if we do a
set_protect_slice() after reweight, we'll essentially grant another
"min_slice" worth of time from the current "se->vruntime" (or until
deadline if it is sooner) which is not correct.
>
> This patch fixes the issue by calling `set_protect_slice()` at the end of
> `reweight_entity()`. This ensures that a new, valid protection slice is
> committed based on the updated `vruntime` and the new weight, restoring
> Run-to-Parity consistency immediately after a weight change.
>
> Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
> Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
> Signed-off-by: Wang Tao <wangtao554@huawei.com>
> ---
> kernel/sched/fair.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e71302282671..bdd8c4e688ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
At the beginning of reweight_entity, we should first check if
protect_slice() for curr is true or not:
bool protect_slice = curr && protect_slice(se);
> if (!curr)
> __enqueue_entity(cfs_rq, se);
> cfs_rq->nr_queued++;
> + if (curr)
> + set_protect_slice(cfs_rq, se);
If protect_slice() was true to begin with, we should do:
if (protect_slice)
se->vprot = min_vruntime(se->vprot, se->deadline);
This ensures that if our deadline has moved back, we only protect until
the new deadline and the scheduler can re-evaluate after that. If there
was an entity with a shorter slice at the beginning of the pick, the
"vprot" should still reflect the old value that was calculated using
"se->vruntime" at the time of the pick.
--
Thanks and Regards,
Prateek
Hello Prateek,
K Prateek Nayak 写道:
> Hello Wang,
>
> On 1/20/2026 6:01 PM, Wang Tao wrote:
>> In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
>> independent variable defining the virtual protection timestamp.
>>
>> When `reweight_entity()` is called (e.g., via nice/renice), it performs
>> the following actions to preserve Lag consistency:
>> 1. Scales `se->vlag` based on the new weight.
>> 2. Calls `place_entity()`, which recalculates `se->vruntime` based on
>> the new weight and scaled lag.
>>
>> However, the current implementation fails to update `se->vprot`, leading
>> to mismatches between the task's actual runtime and its expected duration.
>
> I don't think that is correct. "vprot" allows for "min_slice" worth of
> runtime from the beginning of the pick however, if we do a
> set_protect_slice() after reweight, we'll essentially grant another
> "min_slice" worth of time from the current "se->vruntime" (or until
> deadline if it is sooner) which is not correct.
>
>>
>> This patch fixes the issue by calling `set_protect_slice()` at the end of
>> `reweight_entity()`. This ensures that a new, valid protection slice is
>> committed based on the updated `vruntime` and the new weight, restoring
>> Run-to-Parity consistency immediately after a weight change.
>>
>> Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
>> Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
>> Signed-off-by: Wang Tao <wangtao554@huawei.com>
>> ---
>> kernel/sched/fair.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e71302282671..bdd8c4e688ae 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
> At the beginning of reweight_entity, we should first check if
> protect_slice() for curr is true or not:
>
> bool protect_slice = curr && protect_slice(se);
>
>> if (!curr)
>> __enqueue_entity(cfs_rq, se);
>> cfs_rq->nr_queued++;
>> + if (curr)
>> + set_protect_slice(cfs_rq, se);
>
>
> If protect_slice() was true to begin with, we should do:
>
> if (protect_slice)
> se->vprot = min_vruntime(se->vprot, se->deadline);
>
> This ensures that if our deadline has moved back, we only protect until
> the new deadline and the scheduler can re-evaluate after that. If there
> was an entity with a shorter slice at the beginning of the pick, the
> "vprot" should still reflect the old value that was calculated using
> "se->vruntime" at the time of the pick.
>
Regarding your suggestion, I have a concern.
When a task's weight changes, I believe its "vprot" should also change
accordingly. If we keep the original "vprot" unchanged, the task's
actual runtime will not reach the expected "vprot".
For example, if the weight decreases, the "vruntime" increases faster.
If "vprot" is not updated, the task will hit the limit much earlier than
expected in physical time.
Therefore, I made this modification to ensure the protection slice is
valid. I would like to hear your thoughts on this.
if (protect_slice) {
se->vprot -= se->vruntime;
se->vprot = div_s64(se->vprot * se->load.weight, weight);
se->vprot += se->vruntime;
se->vprot = min_vruntime(se->vprot, se->deadline);
}
Best Regards,
Tao
On Wed, 21 Jan 2026 at 10:33, wangtao (EQ) <wangtao554@huawei.com> wrote:
>
> Hello Prateek,
>
> K Prateek Nayak 写道:
> > Hello Wang,
> >
> > On 1/20/2026 6:01 PM, Wang Tao wrote:
> >> In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
> >> independent variable defining the virtual protection timestamp.
> >>
> >> When `reweight_entity()` is called (e.g., via nice/renice), it performs
> >> the following actions to preserve Lag consistency:
> >> 1. Scales `se->vlag` based on the new weight.
> >> 2. Calls `place_entity()`, which recalculates `se->vruntime` based on
> >> the new weight and scaled lag.
> >>
> >> However, the current implementation fails to update `se->vprot`, leading
> >> to mismatches between the task's actual runtime and its expected duration.
> >
> > I don't think that is correct. "vprot" allows for "min_slice" worth of
> > runtime from the beginning of the pick however, if we do a
> > set_protect_slice() after reweight, we'll essentially grant another
> > "min_slice" worth of time from the current "se->vruntime" (or until
> > deadline if it is sooner) which is not correct.
> >
> >>
> >> This patch fixes the issue by calling `set_protect_slice()` at the end of
> >> `reweight_entity()`. This ensures that a new, valid protection slice is
> >> committed based on the updated `vruntime` and the new weight, restoring
> >> Run-to-Parity consistency immediately after a weight change.
> >>
> >> Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
> >> Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
> >> Signed-off-by: Wang Tao <wangtao554@huawei.com>
> >> ---
> >> kernel/sched/fair.c | 2 ++
> >> 1 file changed, 2 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e71302282671..bdd8c4e688ae 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >
> > At the beginning of reweight_entity, we should first check if
> > protect_slice() for curr is true or not:
> >
> > bool protect_slice = curr && protect_slice(se);
> >
> >> if (!curr)
> >> __enqueue_entity(cfs_rq, se);
> >> cfs_rq->nr_queued++;
> >> + if (curr)
> >> + set_protect_slice(cfs_rq, se);
> >
> >
> > If protect_slice() was true to begin with, we should do:
> >
> > if (protect_slice)
> > se->vprot = min_vruntime(se->vprot, se->deadline);
> >
> > This ensures that if our deadline has moved back, we only protect until
> > the new deadline and the scheduler can re-evaluate after that. If there
> > was an entity with a shorter slice at the beginning of the pick, the
> > "vprot" should still reflect the old value that was calculated using
> > "se->vruntime" at the time of the pick.
> >
>
> Regarding your suggestion, I have a concern.
> When a task's weight changes, I believe its "vprot" should also change
> accordingly. If we keep the original "vprot" unchanged, the task's
> actual runtime will not reach the expected "vprot".
>
> For example, if the weight decreases, the "vruntime" increases faster.
> If "vprot" is not updated, the task will hit the limit much earlier than
> expected in physical time.
Ok but it's the safest way to stay in the min runnable slice limit so
you need to demonstrate that its lag will remain below.
Run to parity already goes close to this limit by not looking at a new
running entity at every tick so we need to be sure to not break this
rule.
>
> Therefore, I made this modification to ensure the protection slice is
> valid. I would like to hear your thoughts on this.
>
> if (protect_slice) {
> se->vprot -= se->vruntime;
> se->vprot = div_s64(se->vprot * se->load.weight, weight);
> se->vprot += se->vruntime;
>
> se->vprot = min_vruntime(se->vprot, se->deadline);
Why not use update_protect_slice() like when a new task with a shorter
slice is added ?
> }
>
> Best Regards,
> Tao
On Wed, Jan 21, 2026 at 06:00:15PM +0100, Vincent Guittot wrote:
> Why not use update_protect_slice() like when a new task with a shorter
> slice is added ?
That seems wrong too...
I was going over this, and should we not limit set_protect_slice() to
the first set_next_entity(). That is, AFAICT we'll re-set it on
sched_change -- which sounds wrong to me.
Anyway, I ended up with something like so (should probably be split in
two patches).
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca642295c4b..bab51da3d179 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3790,6 +3790,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
{
bool curr = cfs_rq->curr == se;
+ u64 vprot = 0;
if (se->on_rq) {
/* commit outstanding execution time */
@@ -3797,6 +3798,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
update_entity_lag(cfs_rq, se);
se->deadline -= se->vruntime;
se->rel_deadline = 1;
+ if (curr && protect_slice(se))
+ vprot = se->vprot - se->vruntime;
+
cfs_rq->nr_queued--;
if (!curr)
__dequeue_entity(cfs_rq, se);
@@ -3812,6 +3816,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
if (se->rel_deadline)
se->deadline = div_s64(se->deadline * se->load.weight, weight);
+ if (vprot)
+ vprot = div_s64(vprot * se->load.weight, weight);
+
update_load_set(&se->load, weight);
do {
@@ -3823,6 +3830,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
place_entity(cfs_rq, se, 0);
+ if (vprot)
+ se->vprot = se->vruntime + vprot;
update_load_add(&cfs_rq->load, se->load.weight);
if (!curr)
__enqueue_entity(cfs_rq, se);
@@ -5420,7 +5429,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
}
static void
-set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
{
clear_buddies(cfs_rq, se);
@@ -5435,7 +5444,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
- set_protect_slice(cfs_rq, se);
+ if (first)
+ set_protect_slice(cfs_rq, se);
}
update_stats_curr_start(cfs_rq, se);
@@ -8958,13 +8968,13 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
pse = parent_entity(pse);
}
if (se_depth >= pse_depth) {
- set_next_entity(cfs_rq_of(se), se);
+ set_next_entity(cfs_rq_of(se), se, true);
se = parent_entity(se);
}
}
put_prev_entity(cfs_rq, pse);
- set_next_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se, true);
__set_next_task_fair(rq, p, true);
}
@@ -13578,7 +13588,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- set_next_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se, first);
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
On Fri, 23 Jan 2026 at 16:39, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 21, 2026 at 06:00:15PM +0100, Vincent Guittot wrote:
>
> > Why not use update_protect_slice() like when a new task with a shorter
> > slice is added ?
>
> That seems wrong too...
>
> I was going over this, and should we not limit set_protect_slice() to
> the first set_next_entity(). That is, AFAICT we'll re-set it on
> sched_change -- which sounds wrong to me.
Fair enough
I never noticed this
>
> Anyway, I ended up with something like so (should probably be split in
> two patches).
yes 2 patches should be better
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eca642295c4b..bab51da3d179 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3790,6 +3790,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> bool curr = cfs_rq->curr == se;
> + u64 vprot = 0;
>
> if (se->on_rq) {
> /* commit outstanding execution time */
> @@ -3797,6 +3798,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> update_entity_lag(cfs_rq, se);
> se->deadline -= se->vruntime;
> se->rel_deadline = 1;
> + if (curr && protect_slice(se))
> + vprot = se->vprot - se->vruntime;
> +
> cfs_rq->nr_queued--;
> if (!curr)
> __dequeue_entity(cfs_rq, se);
> @@ -3812,6 +3816,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> if (se->rel_deadline)
> se->deadline = div_s64(se->deadline * se->load.weight, weight);
>
> + if (vprot)
> + vprot = div_s64(vprot * se->load.weight, weight);
> +
> update_load_set(&se->load, weight);
>
> do {
> @@ -3823,6 +3830,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> place_entity(cfs_rq, se, 0);
> + if (vprot)
> + se->vprot = se->vruntime + vprot;
> update_load_add(&cfs_rq->load, se->load.weight);
> if (!curr)
> __enqueue_entity(cfs_rq, se);
> @@ -5420,7 +5429,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> }
>
> static void
> -set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
> {
> clear_buddies(cfs_rq, se);
>
> @@ -5435,7 +5444,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> __dequeue_entity(cfs_rq, se);
> update_load_avg(cfs_rq, se, UPDATE_TG);
>
> - set_protect_slice(cfs_rq, se);
> + if (first)
> + set_protect_slice(cfs_rq, se);
> }
>
> update_stats_curr_start(cfs_rq, se);
> @@ -8958,13 +8968,13 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> pse = parent_entity(pse);
> }
> if (se_depth >= pse_depth) {
> - set_next_entity(cfs_rq_of(se), se);
> + set_next_entity(cfs_rq_of(se), se, true);
> se = parent_entity(se);
> }
> }
>
> put_prev_entity(cfs_rq, pse);
> - set_next_entity(cfs_rq, se);
> + set_next_entity(cfs_rq, se, true);
>
> __set_next_task_fair(rq, p, true);
> }
> @@ -13578,7 +13588,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> for_each_sched_entity(se) {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> - set_next_entity(cfs_rq, se);
> + set_next_entity(cfs_rq, se, first);
> /* ensure bandwidth has been allocated on our new cfs_rq */
> account_cfs_rq_runtime(cfs_rq, 0);
> }
Hello Peter,
On 1/23/2026 9:08 PM, Peter Zijlstra wrote:
> On Wed, Jan 21, 2026 at 06:00:15PM +0100, Vincent Guittot wrote:
>
>> Why not use update_protect_slice() like when a new task with a shorter
>> slice is added ?
>
> That seems wrong too...
>
> I was going over this, and should we not limit set_protect_slice() to
> the first set_next_entity(). That is, AFAICT we'll re-set it on
> sched_change -- which sounds wrong to me.
That makes sense.
>
> Anyway, I ended up with something like so (should probably be split in
> two patches).
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eca642295c4b..bab51da3d179 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3790,6 +3790,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> bool curr = cfs_rq->curr == se;
> + u64 vprot = 0;
>
> if (se->on_rq) {
> /* commit outstanding execution time */
> @@ -3797,6 +3798,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> update_entity_lag(cfs_rq, se);
> se->deadline -= se->vruntime;
> se->rel_deadline = 1;
> + if (curr && protect_slice(se))
> + vprot = se->vprot - se->vruntime;
> +
> cfs_rq->nr_queued--;
> if (!curr)
> __dequeue_entity(cfs_rq, se);
> @@ -3812,6 +3816,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> if (se->rel_deadline)
> se->deadline = div_s64(se->deadline * se->load.weight, weight);
>
> + if (vprot)
> + vprot = div_s64(vprot * se->load.weight, weight);
> +
> update_load_set(&se->load, weight);
>
> do {
> @@ -3823,6 +3830,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> place_entity(cfs_rq, se, 0);
> + if (vprot)
> + se->vprot = se->vruntime + vprot;
Scaling vprot makes sense too. Can there be a problem where "vprot"
turns to zero on scaling (weight > (vprot * se->load.weight)) but we
don't end up updating "se->vprot"?
Probably rare but a higher weight implies it might take a tad bit
longer to hit that old "vprot" but since we are losing the precision,
can the pending "se->vprot" be big enough to cause any issue?
Since we are looking at these bits, can you also please take a look at
https://lore.kernel.org/lkml/20251226001731.3730586-1-quzicheng@huawei.com/
which I feel might be a genuine issue when we are reweight the curr's
vruntime.
> update_load_add(&cfs_rq->load, se->load.weight);
> if (!curr)
> __enqueue_entity(cfs_rq, se);
--
Thanks and Regards,
Prateek
On Fri, Jan 23, 2026 at 10:09:05PM +0530, K Prateek Nayak wrote:
> > @@ -3823,6 +3830,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > enqueue_load_avg(cfs_rq, se);
> > if (se->on_rq) {
> > place_entity(cfs_rq, se, 0);
> > + if (vprot)
> > + se->vprot = se->vruntime + vprot;
>
> Scaling vprot makes sense too. Can there be a problem where "vprot"
> turns to zero on scaling (weight > (vprot * se->load.weight)) but we
> don't end up updating "se->vprot"?
Yeah, this can happen, but in that case the thing is 'small'. I'll see
if I can fix it though.
> Since we are looking at these bits, can you also please take a look at
> https://lore.kernel.org/lkml/20251226001731.3730586-1-quzicheng@huawei.com/
> which I feel might be a genuine issue when we are reweight the curr's
> vruntime.
Oh, I hadn't found that yet -- people should know better than to send
mail during x-mas ;-)
Hi Vincent,
Thanks for the feedback.
Vincent Guittot 写道:
> On Wed, 21 Jan 2026 at 10:33, wangtao (EQ) <wangtao554@huawei.com> wrote:
>>
>> Hello Prateek,
>>
>> K Prateek Nayak 写道:
>>> Hello Wang,
>>>
>>> On 1/20/2026 6:01 PM, Wang Tao wrote:
>>>> In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
>>>> independent variable defining the virtual protection timestamp.
>>>>
>>>> When `reweight_entity()` is called (e.g., via nice/renice), it performs
>>>> the following actions to preserve Lag consistency:
>>>> 1. Scales `se->vlag` based on the new weight.
>>>> 2. Calls `place_entity()`, which recalculates `se->vruntime` based on
>>>> the new weight and scaled lag.
>>>>
>>>> However, the current implementation fails to update `se->vprot`, leading
>>>> to mismatches between the task's actual runtime and its expected duration.
>>>
>>> I don't think that is correct. "vprot" allows for "min_slice" worth of
>>> runtime from the beginning of the pick however, if we do a
>>> set_protect_slice() after reweight, we'll essentially grant another
>>> "min_slice" worth of time from the current "se->vruntime" (or until
>>> deadline if it is sooner) which is not correct.
>>>
>>>>
>>>> This patch fixes the issue by calling `set_protect_slice()` at the end of
>>>> `reweight_entity()`. This ensures that a new, valid protection slice is
>>>> committed based on the updated `vruntime` and the new weight, restoring
>>>> Run-to-Parity consistency immediately after a weight change.
>>>>
>>>> Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
>>>> Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
>>>> Signed-off-by: Wang Tao <wangtao554@huawei.com>
>>>> ---
>>>> kernel/sched/fair.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e71302282671..bdd8c4e688ae 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>>
>>> At the beginning of reweight_entity, we should first check if
>>> protect_slice() for curr is true or not:
>>>
>>> bool protect_slice = curr && protect_slice(se);
>>>
>>>> if (!curr)
>>>> __enqueue_entity(cfs_rq, se);
>>>> cfs_rq->nr_queued++;
>>>> + if (curr)
>>>> + set_protect_slice(cfs_rq, se);
>>>
>>>
>>> If protect_slice() was true to begin with, we should do:
>>>
>>> if (protect_slice)
>>> se->vprot = min_vruntime(se->vprot, se->deadline);
>>>
>>> This ensures that if our deadline has moved back, we only protect until
>>> the new deadline and the scheduler can re-evaluate after that. If there
>>> was an entity with a shorter slice at the beginning of the pick, the
>>> "vprot" should still reflect the old value that was calculated using
>>> "se->vruntime" at the time of the pick.
>>>
>>
>> Regarding your suggestion, I have a concern.
>> When a task's weight changes, I believe its "vprot" should also change
>> accordingly. If we keep the original "vprot" unchanged, the task's
>> actual runtime will not reach the expected "vprot".
>>
>> For example, if the weight decreases, the "vruntime" increases faster.
>> If "vprot" is not updated, the task will hit the limit much earlier than
>> expected in physical time.
>
> Ok but it's the safest way to stay in the min runnable slice limit so
> you need to demonstrate that its lag will remain below.
> Run to parity already goes close to this limit by not looking at a new
> running entity at every tick so we need to be sure to not break this
> rule.
>
I fully agree that staying within the Lag limit is paramount and we must
ensure the scheduler invariant is preserved.
>>
>> Therefore, I made this modification to ensure the protection slice is
>> valid. I would like to hear your thoughts on this.
>>
>> if (protect_slice) {
>> se->vprot -= se->vruntime;
>> se->vprot = div_s64(se->vprot * se->load.weight, weight);
>> se->vprot += se->vruntime;
>>
>> se->vprot = min_vruntime(se->vprot, se->deadline);
>
> Why not use update_protect_slice() like when a new task with a shorter
> slice is added ?
I looked into reusing update_protect_slice(), but there is a
mathematical hurdle specific to reweight_entity(): the coordinate space
distortion.
In reweight_entity(), se->vruntime is scaled to match the new weight,
but se->vprot remains in the old weight's coordinate system. If we
simply call update_protect_slice() or compare them directly, we are
comparing values with different virtual time speeds, which leads to
incorrect truncation.
>
>> }
>>
>> Best Regards,
>> Tao
>
To address your safety concerns while fixing the coordinate mismatch, I
have implemented some changes:
Capture: Calculate the remaining physical slice before se->vruntime is
updated (while both are still in the old coordinate system).
Scale: Project this physical slice to the new weight's virtual domain.
Clamp (Safety): Apply the new slice to the new vruntime, but strictly
clamp it with min_vruntime(..., se->deadline).
This approach preserves the physical time promise (fixing the bug) but
includes the min_vruntime guardrail you suggested to ensure we never
exceed the deadline/lag limit.
Does this align with what you had in mind?
Best,
Tao
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdd8c4e688ae..52dc7deea7a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3755,10 +3755,17 @@ static void reweight_entity(struct cfs_rq
*cfs_rq, struct sched_entity *se,
unsigned long weight)
{
bool curr = cfs_rq->curr == se;
+ bool protect_slice = curr && protect_slice(se);
+ u64 vslice_rem = 0;
if (se->on_rq) {
+
+ if (protect_slice) {
+ vslice_rem = se->vprot - se->vruntime;
+ }
+
update_entity_lag(cfs_rq, se);
se->deadline -= se->vruntime;
se->rel_deadline = 1;
@@ -3792,8 +3799,13 @@ static void reweight_entity(struct cfs_rq
*cfs_rq, struct sched_entity *se,
if (!curr)
__enqueue_entity(cfs_rq, se);
cfs_rq->nr_queued++;
- if (curr)
- set_protect_slice(cfs_rq, se);
+
+ if (protect_slice && vslice_rem > 0) {
+ vslice_rem = div_s64(vslice_rem *
se->load.weight, weight);
+ se->vprot == se->vruntime + vslice_rem;
+
+ se->vprot = min_vruntime(se->vprot, se->deadline);
+ }
}
}
On Thu, 22 Jan 2026 at 08:52, wangtao (EQ) <wangtao554@huawei.com> wrote:
>
> Hi Vincent,
> Thanks for the feedback.
>
> Vincent Guittot 写道:
> > On Wed, 21 Jan 2026 at 10:33, wangtao (EQ) <wangtao554@huawei.com> wrote:
> >>
> >> Hello Prateek,
> >>
> >> K Prateek Nayak 写道:
> >>> Hello Wang,
> >>>
> >>> On 1/20/2026 6:01 PM, Wang Tao wrote:
> >>>> In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
> >>>> independent variable defining the virtual protection timestamp.
> >>>>
> >>>> When `reweight_entity()` is called (e.g., via nice/renice), it performs
> >>>> the following actions to preserve Lag consistency:
> >>>> 1. Scales `se->vlag` based on the new weight.
> >>>> 2. Calls `place_entity()`, which recalculates `se->vruntime` based on
> >>>> the new weight and scaled lag.
> >>>>
> >>>> However, the current implementation fails to update `se->vprot`, leading
> >>>> to mismatches between the task's actual runtime and its expected duration.
> >>>
> >>> I don't think that is correct. "vprot" allows for "min_slice" worth of
> >>> runtime from the beginning of the pick however, if we do a
> >>> set_protect_slice() after reweight, we'll essentially grant another
> >>> "min_slice" worth of time from the current "se->vruntime" (or until
> >>> deadline if it is sooner) which is not correct.
> >>>
> >>>>
> >>>> This patch fixes the issue by calling `set_protect_slice()` at the end of
> >>>> `reweight_entity()`. This ensures that a new, valid protection slice is
> >>>> committed based on the updated `vruntime` and the new weight, restoring
> >>>> Run-to-Parity consistency immediately after a weight change.
> >>>>
> >>>> Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
> >>>> Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
> >>>> Signed-off-by: Wang Tao <wangtao554@huawei.com>
> >>>> ---
> >>>> kernel/sched/fair.c | 2 ++
> >>>> 1 file changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index e71302282671..bdd8c4e688ae 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -3792,6 +3792,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >>>
> >>> At the beginning of reweight_entity, we should first check if
> >>> protect_slice() for curr is true or not:
> >>>
> >>> bool protect_slice = curr && protect_slice(se);
> >>>
> >>>> if (!curr)
> >>>> __enqueue_entity(cfs_rq, se);
> >>>> cfs_rq->nr_queued++;
> >>>> + if (curr)
> >>>> + set_protect_slice(cfs_rq, se);
> >>>
> >>>
> >>> If protect_slice() was true to begin with, we should do:
> >>>
> >>> if (protect_slice)
> >>> se->vprot = min_vruntime(se->vprot, se->deadline);
> >>>
> >>> This ensures that if our deadline has moved back, we only protect until
> >>> the new deadline and the scheduler can re-evaluate after that. If there
> >>> was an entity with a shorter slice at the beginning of the pick, the
> >>> "vprot" should still reflect the old value that was calculated using
> >>> "se->vruntime" at the time of the pick.
> >>>
> >>
> >> Regarding your suggestion, I have a concern.
> >> When a task's weight changes, I believe its "vprot" should also change
> >> accordingly. If we keep the original "vprot" unchanged, the task's
> >> actual runtime will not reach the expected "vprot".
> >>
> >> For example, if the weight decreases, the "vruntime" increases faster.
> >> If "vprot" is not updated, the task will hit the limit much earlier than
> >> expected in physical time.
> >
> > Ok but it's the safest way to stay in the min runnable slice limit so
> > you need to demonstrate that its lag will remain below.
> > Run to parity already goes close to this limit by not looking at a new
> > running entity at every tick so we need to be sure to not break this
> > rule.
> >
>
> I fully agree that staying within the Lag limit is paramount and we must
> ensure the scheduler invariant is preserved.
>
> >>
> >> Therefore, I made this modification to ensure the protection slice is
> >> valid. I would like to hear your thoughts on this.
> >>
> >> if (protect_slice) {
> >> se->vprot -= se->vruntime;
> >> se->vprot = div_s64(se->vprot * se->load.weight, weight);
> >> se->vprot += se->vruntime;
> >>
> >> se->vprot = min_vruntime(se->vprot, se->deadline);
> >
> > Why not use update_protect_slice() like when a new task with a shorter
> > slice is added ?
>
> I looked into reusing update_protect_slice(), but there is a
> mathematical hurdle specific to reweight_entity(): the coordinate space
> distortion.
>
> In reweight_entity(), se->vruntime is scaled to match the new weight,
> but se->vprot remains in the old weight's coordinate system. If we
> simply call update_protect_slice() or compare them directly, we are
> comparing values with different virtual time speeds, which leads to
> incorrect truncation.
>
> >
> >> }
> >>
> >> Best Regards,
> >> Tao
> >
>
> To address your safety concerns while fixing the coordinate mismatch, I
> have implemented some changes:
>
> Capture: Calculate the remaining physical slice before se->vruntime is
> updated (while both are still in the old coordinate system).
>
> Scale: Project this physical slice to the new weight's virtual domain.
>
> Clamp (Safety): Apply the new slice to the new vruntime, but strictly
> clamp it with min_vruntime(..., se->deadline).
>
> This approach preserves the physical time promise (fixing the bug) but
> includes the min_vruntime guardrail you suggested to ensure we never
> exceed the deadline/lag limit.
>
> Does this align with what you had in mind?
yes
>
> Best,
> Tao
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bdd8c4e688ae..52dc7deea7a1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3755,10 +3755,17 @@ static void reweight_entity(struct cfs_rq
> *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> bool curr = cfs_rq->curr == se;
> + bool protect_slice = curr && protect_slice(se);
> + u64 vslice_rem = 0;
>
> if (se->on_rq) {
> +
> + if (protect_slice) {
> + vslice_rem = se->vprot - se->vruntime;
> + }
> +
> update_entity_lag(cfs_rq, se);
> se->deadline -= se->vruntime;
> se->rel_deadline = 1;
> @@ -3792,8 +3799,13 @@ static void reweight_entity(struct cfs_rq
> *cfs_rq, struct sched_entity *se,
> if (!curr)
> __enqueue_entity(cfs_rq, se);
> cfs_rq->nr_queued++;
> - if (curr)
> - set_protect_slice(cfs_rq, se);
> +
> + if (protect_slice && vslice_rem > 0) {
> + vslice_rem = div_s64(vslice_rem *
> se->load.weight, weight);
> + se->vprot == se->vruntime + vslice_rem;
> +
> + se->vprot = min_vruntime(se->vprot, se->deadline);
> + }
> }
> }
>
Hello Vincent,
On 1/21/2026 10:30 PM, Vincent Guittot wrote:
>>> If protect_slice() was true to begin with, we should do:
>>>
>>> if (protect_slice)
>>> se->vprot = min_vruntime(se->vprot, se->deadline);
>>>
>>> This ensures that if our deadline has moved back, we only protect until
>>> the new deadline and the scheduler can re-evaluate after that. If there
>>> was an entity with a shorter slice at the beginning of the pick, the
>>> "vprot" should still reflect the old value that was calculated using
>>> "se->vruntime" at the time of the pick.
>>>
>>
>> Regarding your suggestion, I have a concern.
>> When a task's weight changes, I believe its "vprot" should also change
>> accordingly. If we keep the original "vprot" unchanged, the task's
>> actual runtime will not reach the expected "vprot".
>>
>> For example, if the weight decreases, the "vruntime" increases faster.
>> If "vprot" is not updated, the task will hit the limit much earlier than
>> expected in physical time.
>
> Ok but it's the safest way to stay in the min runnable slice limit so
> you need to demonstrate that its lag will remain below.
> Run to parity already goes close to this limit by not looking at a new
> running entity at every tick so we need to be sure to not break this
> rule.
Wakeups will do a update_protect_slice() at the common ancestors for
RUN_TO_PARITY but this is also a problem for cgroups where the current
sched_entities can undergo a reweight at every tick.
For Wang's specific case, the problem is "se->vprot" can expand if
entity's weight goes down but update_protect_slice() does a
min(se->vprot, se->vruntime + scale_delta_fair(min_slice, se))
which is capped by the original vprot. The reweight can end up with a
case where:
se->vprot < se->deadline < se->vruntime + scale_delta_fair(min_slice, se)
(at time of pick) (after reweight)
Ideally, we should redo set_protect_slice() with the same situation
at the time of pick with adjusted vruntime, deadline, and considering
the current set of tasks queued but it boils down to the same way we
do "rel_deadline" today right?
wakeup_preempt_fair() would have already updated the vprot based on
the min_slice and since we do a scale_delta_fair(min_slice, se) in
{set,update}_protect_slice(), it scaling it with the current entity's
weight shouldn't be a problem I suppose.
>
>>
>> Therefore, I made this modification to ensure the protection slice is
>> valid. I would like to hear your thoughts on this.
>>
>> if (protect_slice) {
>> se->vprot -= se->vruntime;
>> se->vprot = div_s64(se->vprot * se->load.weight, weight);
>> se->vprot += se->vruntime;
>>
>> se->vprot = min_vruntime(se->vprot, se->deadline);
>
> Why not use update_protect_slice() like when a new task with a shorter
> slice is added ?
Are you suggesting something like the following to keep
protect_slice() updated at enqueue?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc905b853c4b..c21fb039038b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6937,15 +6937,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
}
for_each_sched_entity(se) {
+ struct sched_entity *curr;
+
cfs_rq = cfs_rq_of(se);
+ curr = cfs_rq->curr;
update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);
update_cfs_group(se);
se->slice = slice;
- if (se != cfs_rq->curr)
+ if (se != curr)
min_vruntime_cb_propagate(&se->run_node, NULL);
+ if (curr && curr->on_rq)
+ update_protect_slice(cfs_rq, curr);
slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_runnable += h_nr_runnable;
--
Thanks and Regards,
Prateek
On Thu, 22 Jan 2026 at 05:38, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 1/21/2026 10:30 PM, Vincent Guittot wrote:
> >>> If protect_slice() was true to begin with, we should do:
> >>>
> >>> if (protect_slice)
> >>> se->vprot = min_vruntime(se->vprot, se->deadline);
> >>>
> >>> This ensures that if our deadline has moved back, we only protect until
> >>> the new deadline and the scheduler can re-evaluate after that. If there
> >>> was an entity with a shorter slice at the beginning of the pick, the
> >>> "vprot" should still reflect the old value that was calculated using
> >>> "se->vruntime" at the time of the pick.
> >>>
> >>
> >> Regarding your suggestion, I have a concern.
> >> When a task's weight changes, I believe its "vprot" should also change
> >> accordingly. If we keep the original "vprot" unchanged, the task's
> >> actual runtime will not reach the expected "vprot".
> >>
> >> For example, if the weight decreases, the "vruntime" increases faster.
> >> If "vprot" is not updated, the task will hit the limit much earlier than
> >> expected in physical time.
> >
> > Ok but it's the safest way to stay in the min runnable slice limit so
> > you need to demonstrate that its lag will remain below.
> > Run to parity already goes close to this limit by not looking at a new
> > running entity at every tick so we need to be sure to not break this
> > rule.
>
> Wakeups will do a update_protect_slice() at the common ancestors for
> RUN_TO_PARITY but this is also a problem for cgroups where the current
> sched_entities can undergo a reweight at every tick.
>
> For Wang's specific case, the problem is "se->vprot" can expand if
> entity's weight goes down but update_protect_slice() does a
>
> min(se->vprot, se->vruntime + scale_delta_fair(min_slice, se))
>
> which is capped by the original vprot. The reweight can end up with a
> case where:
>
> se->vprot < se->deadline < se->vruntime + scale_delta_fair(min_slice, se)
> (at time of pick) (after reweight)
>
> Ideally, we should redo set_protect_slice() with the same situation
> at the time of pick with adjusted vruntime, deadline, and considering
> the current set of tasks queued but it boils down to the same way we
> do "rel_deadline" today right?
>
> wakeup_preempt_fair() would have already updated the vprot based on
> the min_slice and since we do a scale_delta_fair(min_slice, se) in
> {set,update}_protect_slice(), it scaling it with the current entity's
> weight shouldn't be a problem I suppose.
>
> >
> >>
> >> Therefore, I made this modification to ensure the protection slice is
> >> valid. I would like to hear your thoughts on this.
> >>
> >> if (protect_slice) {
> >> se->vprot -= se->vruntime;
> >> se->vprot = div_s64(se->vprot * se->load.weight, weight);
> >> se->vprot += se->vruntime;
> >>
> >> se->vprot = min_vruntime(se->vprot, se->deadline);
> >
> > Why not use update_protect_slice() like when a new task with a shorter
> > slice is added ?
>
> Are you suggesting something like the following to keep
> protect_slice() updated at enqueue?
No, I was suggesting to never increase vprot in order to be sure that
we will not break the lag rule. but wangtao and peter made another
proposal
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dc905b853c4b..c21fb039038b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6937,15 +6937,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> }
>
> for_each_sched_entity(se) {
> + struct sched_entity *curr;
> +
> cfs_rq = cfs_rq_of(se);
> + curr = cfs_rq->curr;
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> se_update_runnable(se);
> update_cfs_group(se);
>
> se->slice = slice;
> - if (se != cfs_rq->curr)
> + if (se != curr)
> min_vruntime_cb_propagate(&se->run_node, NULL);
> + if (curr && curr->on_rq)
> + update_protect_slice(cfs_rq, curr);
> slice = cfs_rq_min_slice(cfs_rq);
>
> cfs_rq->h_nr_runnable += h_nr_runnable;
> --
> Thanks and Regards,
> Prateek
>
© 2016 - 2026 Red Hat, Inc.