kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 - 2 files changed, 1 insertion(+), 2 deletions(-)
An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
to enqueue_entity() via wake_up_new_task().
Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
Tested on top of peterz/sched/eevdf from 2023-10-03.
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
update_rq_clock(rq);
post_init_entity_util_avg(p);
- activate_task(rq, p, ENQUEUE_NOCLOCK);
+ activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
trace_sched_wakeup_new(p);
wakeup_preempt(rq, p, WF_FORK);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..5872b8a3f5891 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12446,7 +12446,6 @@ static void task_fork_fair(struct task_struct *p)
curr = cfs_rq->curr;
if (curr)
update_curr(cfs_rq);
- place_entity(cfs_rq, se, ENQUEUE_INITIAL);
rq_unlock(rq, &rf);
}
--
2.41.0
Hello Daniel,
On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
>
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> to enqueue_entity() via wake_up_new_task().
>
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
I got a chance to this this on a 3rd Generation EPYC system. I don't
see anything out of the ordinary except for a small regression on
hackbench. I'll leave the full result below.
o System details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)
o Kernel Details
- tip: tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
duplicate header inclusions")
- place-deadline-fix: tip + this patch
o Benchmark Results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.58) 1.04 [ -3.63]( 3.14)
2-groups 1.00 [ -0.00]( 1.87) 1.03 [ -2.98]( 1.85)
4-groups 1.00 [ -0.00]( 1.63) 1.02 [ -2.35]( 1.59)
8-groups 1.00 [ -0.00]( 1.38) 1.03 [ -2.92]( 1.20)
16-groups 1.00 [ -0.00]( 2.67) 1.02 [ -1.61]( 2.08)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
1 1.00 [ 0.00]( 0.59) 1.02 [ 2.09]( 0.07)
2 1.00 [ 0.00]( 1.19) 1.02 [ 2.38]( 0.82)
4 1.00 [ 0.00]( 0.33) 1.03 [ 2.89]( 0.99)
8 1.00 [ 0.00]( 0.76) 1.02 [ 2.10]( 0.46)
16 1.00 [ 0.00]( 1.10) 1.01 [ 0.81]( 0.49)
32 1.00 [ 0.00]( 1.47) 1.02 [ 1.77]( 0.58)
64 1.00 [ 0.00]( 1.77) 1.02 [ 1.83]( 1.77)
128 1.00 [ 0.00]( 0.41) 1.02 [ 2.49]( 0.52)
256 1.00 [ 0.00]( 0.63) 1.03 [ 3.03]( 1.38)
512 1.00 [ 0.00]( 0.32) 1.02 [ 1.61]( 0.45)
1024 1.00 [ 0.00]( 0.22) 1.01 [ 1.00]( 0.26)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
Copy 1.00 [ 0.00]( 9.30) 0.85 [-15.36](11.26)
Scale 1.00 [ 0.00]( 6.67) 0.98 [ -2.36]( 7.53)
Add 1.00 [ 0.00]( 6.77) 0.92 [ -7.86]( 7.83)
Triad 1.00 [ 0.00]( 7.36) 0.94 [ -5.57]( 6.82)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
Copy 1.00 [ 0.00]( 1.83) 0.96 [ -3.68]( 5.08)
Scale 1.00 [ 0.00]( 6.41) 1.03 [ 2.66]( 5.28)
Add 1.00 [ 0.00]( 6.23) 1.02 [ 1.54]( 4.97)
Triad 1.00 [ 0.00]( 0.89) 0.94 [ -5.68]( 6.78)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.05) 1.02 [ 1.83]( 1.98)
2-clients 1.00 [ 0.00]( 0.93) 1.02 [ 1.87]( 2.45)
4-clients 1.00 [ 0.00]( 0.54) 1.02 [ 2.19]( 1.99)
8-clients 1.00 [ 0.00]( 0.48) 1.02 [ 2.29]( 2.27)
16-clients 1.00 [ 0.00]( 0.42) 1.02 [ 1.60]( 1.70)
32-clients 1.00 [ 0.00]( 0.78) 1.02 [ 1.88]( 2.08)
64-clients 1.00 [ 0.00]( 1.45) 1.02 [ 2.33]( 2.18)
128-clients 1.00 [ 0.00]( 0.97) 1.02 [ 2.38]( 1.95)
256-clients 1.00 [ 0.00]( 4.57) 1.02 [ 2.50]( 5.42)
512-clients 1.00 [ 0.00](52.74) 1.03 [ 3.38](49.69)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
1 1.00 [ -0.00]( 3.95) 0.90 [ 10.26](31.80)
2 1.00 [ -0.00](10.45) 1.08 [ -7.89](15.33)
4 1.00 [ -0.00]( 4.76) 0.93 [ 7.14]( 3.95)
8 1.00 [ -0.00]( 9.35) 1.06 [ -6.25]( 8.90)
16 1.00 [ -0.00]( 8.84) 0.92 [ 8.06]( 4.39)
32 1.00 [ -0.00]( 3.33) 1.04 [ -4.40]( 3.68)
64 1.00 [ -0.00]( 6.70) 0.96 [ 4.17]( 2.75)
128 1.00 [ -0.00]( 0.71) 0.96 [ 3.55]( 1.26)
256 1.00 [ -0.00](31.20) 1.28 [-28.21]( 9.69)
512 1.00 [ -0.00]( 4.98) 1.00 [ 0.48]( 2.76)
==================================================================
Test : ycsb-cassandra
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip place-deadline-fix(pct imp)
Throughput 1.00 1.01 (%diff: 1.06%)
==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Metric tip place-deadline-fix(pct imp)
Throughput 1.00 1.00 (%diff: 0.25%)
==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip place-deadline-fix(pct imp)
1CCD 1 1.00 1.00 (%diff: -0.32%)
2CCD 2 1.00 1.00 (%diff: -0.26%)
4CCD 4 1.00 1.00 (%diff: 0.17%)
8CCD 8 1.00 1.00 (%diff: -0.17%)
--
I see there is a v2. I'll give that a spin as well.
> ---
>
> Tested on top of peterz/sched/eevdf from 2023-10-03.
>
> kernel/sched/core.c | 2 +-
> kernel/sched/fair.c | 1 -
> 2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
> update_rq_clock(rq);
> post_init_entity_util_avg(p);
>
> - activate_task(rq, p, ENQUEUE_NOCLOCK);
> + activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
> trace_sched_wakeup_new(p);
> wakeup_preempt(rq, p, WF_FORK);
> #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..5872b8a3f5891 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12446,7 +12446,6 @@ static void task_fork_fair(struct task_struct *p)
> curr = cfs_rq->curr;
> if (curr)
> update_curr(cfs_rq);
> - place_entity(cfs_rq, se, ENQUEUE_INITIAL);
> rq_unlock(rq, &rf);
> }
>
--
Thanks and Regards,
Prateek
Hi Prateek,
On Thu, Oct 05, 2023 at 11:26:07AM +0530, K Prateek Nayak wrote:
> Hello Daniel,
>
> On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> >
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> > to enqueue_entity() via wake_up_new_task().
> >
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
>
> I got a chance to this this on a 3rd Generation EPYC system. I don't
> see anything out of the ordinary except for a small regression on
> hackbench. I'll leave the full result below.
Thanks for testing!
> o System details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)
>
>
> o Kernel Details
>
> - tip: tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
> duplicate header inclusions")
>
> - place-deadline-fix: tip + this patch
>
>
> o Benchmark Results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) place-deadline-fix[pct imp](CV)
> 1-groups 1.00 [ -0.00]( 2.58) 1.04 [ -3.63]( 3.14)
> 2-groups 1.00 [ -0.00]( 1.87) 1.03 [ -2.98]( 1.85)
> 4-groups 1.00 [ -0.00]( 1.63) 1.02 [ -2.35]( 1.59)
> 8-groups 1.00 [ -0.00]( 1.38) 1.03 [ -2.92]( 1.20)
> 16-groups 1.00 [ -0.00]( 2.67) 1.02 [ -1.61]( 2.08)
Huh, numbers do seem a bit outside the noise. Doesn't hackbench only
fork at the beginning? I glanced at perf messaging source just now, but
not sure if you use that version. Anyway, I wouldn't expect this patch
to have much of an effect in that case.
An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that (and with it all of
task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
wake_up_new_task().
Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
v2
- place_entity() seems like the only reason for task_fork_fair() to exist
after the recent removal of sysctl_sched_child_runs_first, so take out
the whole function.
Still based on today's peterz/sched/eevdf
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 24 ------------------------
2 files changed, 1 insertion(+), 25 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
update_rq_clock(rq);
post_init_entity_util_avg(p);
- activate_task(rq, p, ENQUEUE_NOCLOCK);
+ activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
trace_sched_wakeup_new(p);
wakeup_preempt(rq, p, WF_FORK);
#ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..3827b302eeb9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12427,29 +12427,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_core(rq, curr);
}
-/*
- * called on fork with the child task as argument from the parent's context
- * - child not yet on the tasklist
- * - preemption disabled
- */
-static void task_fork_fair(struct task_struct *p)
-{
- struct sched_entity *se = &p->se, *curr;
- struct cfs_rq *cfs_rq;
- struct rq *rq = this_rq();
- struct rq_flags rf;
-
- rq_lock(rq, &rf);
- update_rq_clock(rq);
-
- cfs_rq = task_cfs_rq(current);
- curr = cfs_rq->curr;
- if (curr)
- update_curr(cfs_rq);
- place_entity(cfs_rq, se, ENQUEUE_INITIAL);
- rq_unlock(rq, &rf);
-}
-
/*
* Priority of the task has changed. Check to see if we preempt
* the current task.
@@ -12953,7 +12930,6 @@ DEFINE_SCHED_CLASS(fair) = {
#endif
.task_tick = task_tick_fair,
- .task_fork = task_fork_fair,
.prio_changed = prio_changed_fair,
.switched_from = switched_from_fair,
--
2.41.0
Hello Daniel,
Same as v1, I do not see any regressions with this version either.
I'll leave the full results below.
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Kernel Details
- tip: tip:sched/core at commit 238437d88cea ("intel_idle: Add ibrs_off
module parameter to force-disable IBRS")
[For DeathStarBench comparisons alone since I ran to the issue
which below commit solves]
+ min_deadline fix commit 8dafa9d0eb1a ("sched/eevdf: Fix
min_deadline heap integrity") from tip:sched/urgent
- place-initial-fix: tip + this patch as is
o Benchmark Results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) place-initial-fix[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.11) 1.01 [ -1.08]( 2.60)
2-groups 1.00 [ -0.00]( 1.31) 1.01 [ -0.93]( 1.61)
4-groups 1.00 [ -0.00]( 1.04) 1.00 [ -0.00]( 1.25)
8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 1.15]( 0.85)
16-groups 1.00 [ -0.00]( 2.45) 1.00 [ -0.27]( 2.32)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) place-initial-fix[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 0.99 [ -0.59]( 0.88)
2 1.00 [ 0.00]( 0.64) 0.99 [ -1.43]( 0.69)
4 1.00 [ 0.00]( 0.59) 0.99 [ -1.49]( 0.76)
8 1.00 [ 0.00]( 0.34) 1.00 [ -0.35]( 0.20)
16 1.00 [ 0.00]( 0.72) 0.98 [ -1.96]( 1.97)
32 1.00 [ 0.00]( 0.65) 1.00 [ -0.24]( 1.07)
64 1.00 [ 0.00]( 0.59) 1.00 [ -0.14]( 1.18)
128 1.00 [ 0.00]( 1.19) 0.99 [ -1.04]( 0.93)
256 1.00 [ 0.00]( 0.16) 1.00 [ -0.18]( 0.34)
512 1.00 [ 0.00]( 0.20) 0.99 [ -0.62]( 0.02)
1024 1.00 [ 0.00]( 0.06) 1.00 [ -0.49]( 0.37)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) place-initial-fix[pct imp](CV)
Copy 1.00 [ 0.00]( 6.04) 1.00 [ -0.21]( 7.98)
Scale 1.00 [ 0.00]( 5.44) 0.99 [ -0.75]( 5.75)
Add 1.00 [ 0.00]( 5.44) 0.99 [ -1.48]( 5.40)
Triad 1.00 [ 0.00]( 7.82) 1.02 [ 2.21]( 8.33)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) place-initial-fix[pct imp](CV)
Copy 1.00 [ 0.00]( 1.14) 1.00 [ 0.40]( 1.12)
Scale 1.00 [ 0.00]( 4.60) 1.01 [ 1.05]( 4.99)
Add 1.00 [ 0.00]( 4.91) 1.00 [ -0.14]( 4.97)
Triad 1.00 [ 0.00]( 0.60) 0.96 [ -3.53]( 6.13)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) place-initial-fix[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.61) 1.00 [ 0.40]( 0.75)
2-clients 1.00 [ 0.00]( 0.44) 1.00 [ -0.47]( 0.91)
4-clients 1.00 [ 0.00]( 0.75) 1.00 [ -0.23]( 0.84)
8-clients 1.00 [ 0.00]( 0.65) 1.00 [ -0.07]( 0.62)
16-clients 1.00 [ 0.00]( 0.49) 1.00 [ -0.29]( 0.56)
32-clients 1.00 [ 0.00]( 0.57) 1.00 [ -0.14]( 0.46)
64-clients 1.00 [ 0.00]( 1.67) 1.00 [ -0.14]( 1.81)
128-clients 1.00 [ 0.00]( 1.11) 1.01 [ 0.64]( 1.04)
256-clients 1.00 [ 0.00]( 2.64) 0.99 [ -1.29]( 5.25)
512-clients 1.00 [ 0.00](52.49) 0.99 [ -0.57](53.01)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) place-initial-fix[pct imp](CV)
1 1.00 [ -0.00]( 8.41) 1.05 [ -5.41](13.45)
2 1.00 [ -0.00]( 5.29) 0.88 [ 12.50](13.21)
4 1.00 [ -0.00]( 1.32) 1.00 [ -0.00]( 4.80)
8 1.00 [ -0.00]( 9.52) 0.94 [ 6.25]( 8.85)
16 1.00 [ -0.00]( 1.61) 0.97 [ 3.23]( 5.00)
32 1.00 [ -0.00]( 7.27) 0.88 [ 12.50]( 2.30)
64 1.00 [ -0.00]( 6.96) 1.07 [ -6.94]( 4.94)
128 1.00 [ -0.00]( 3.41) 0.99 [ 1.44]( 2.69)
256 1.00 [ -0.00](32.95) 0.81 [ 19.17](16.38)
512 1.00 [ -0.00]( 3.20) 0.98 [ 1.66]( 2.35)
==================================================================
Test : ycsb-cassandra
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip place-initial-fix(%diff)
throughput 1.00 0.99 (%diff: -0.67%)
==================================================================
Test : ycsb-mondodb
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
metric tip place-initial-fix(%diff)
throughput 1.00 0.99 (%diff: -0.68%)
==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
Note : Comparisons contains additional commit 8dafa9d0eb1a
("sched/eevdf: Fix min_deadline heap integrity") from
tip:sched/urgent to fix an EEVDF issue being hit
==================================================================
Pinning scaling tip place-initial-fix (%diff)
1CCD 1 1.00 1.00 (%diff: -0.09%)
2CCD 2 1.00 1.02 (%diff: 2.46%)
4CCD 4 1.00 1.00 (%diff: 0.45%)
8CCD 8 1.00 1.00 (%diff: -0.46%)
--
On 10/4/2023 6:39 PM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
>
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
>
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>
> v2
> - place_entity() seems like the only reason for task_fork_fair() to exist
> after the recent removal of sysctl_sched_child_runs_first, so take out
> the whole function.
>
> Still based on today's peterz/sched/eevdf
>
> kernel/sched/core.c | 2 +-
> kernel/sched/fair.c | 24 ------------------------
> 2 files changed, 1 insertion(+), 25 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
> update_rq_clock(rq);
> post_init_entity_util_avg(p);
>
> - activate_task(rq, p, ENQUEUE_NOCLOCK);
> + activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
> trace_sched_wakeup_new(p);
> wakeup_preempt(rq, p, WF_FORK);
> #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..3827b302eeb9b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12427,29 +12427,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> task_tick_core(rq, curr);
> }
>
> -/*
> - * called on fork with the child task as argument from the parent's context
> - * - child not yet on the tasklist
> - * - preemption disabled
> - */
> -static void task_fork_fair(struct task_struct *p)
> -{
> - struct sched_entity *se = &p->se, *curr;
> - struct cfs_rq *cfs_rq;
> - struct rq *rq = this_rq();
> - struct rq_flags rf;
> -
> - rq_lock(rq, &rf);
> - update_rq_clock(rq);
> -
> - cfs_rq = task_cfs_rq(current);
> - curr = cfs_rq->curr;
> - if (curr)
> - update_curr(cfs_rq);
> - place_entity(cfs_rq, se, ENQUEUE_INITIAL);
> - rq_unlock(rq, &rf);
> -}
> -
> /*
> * Priority of the task has changed. Check to see if we preempt
> * the current task.
> @@ -12953,7 +12930,6 @@ DEFINE_SCHED_CLASS(fair) = {
> #endif
>
> .task_tick = task_tick_fair,
> - .task_fork = task_fork_fair,
>
> .prio_changed = prio_changed_fair,
> .switched_from = switched_from_fair,
--
Thanks and Regards,
Prateek
Hi Daniel,
On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
>
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
>
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>
> v2
> - place_entity() seems like the only reason for task_fork_fair() to exist
> after the recent removal of sysctl_sched_child_runs_first, so take out
> the whole function.
At first glance I thought if we remove task_fork_fair(), do we lose one chance to
update the parent task's statistic in update_curr()? We might get out-of-date
parent task's deadline and make preemption decision based on the stale data in
wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
so this should not be a problem.
Then I was wondering why can't we just skip place_entity() in enqueue_entity()
if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
way the new fork task's deadline will not be overwritten by wake_up_new_task()->
enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
and deadline are all calculated by place_entity() rather than being renormalised
to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
in enqueue_entity().
Per my understanding, this patch looks good,
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
Hi Chenyu,
On Wed, Oct 04, 2023 at 11:46:21PM +0800, Chen Yu wrote:
> Hi Daniel,
>
> On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> >
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that (and with it all of
> > task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> > wake_up_new_task().
> >
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> >
> > v2
> > - place_entity() seems like the only reason for task_fork_fair() to exist
> > after the recent removal of sysctl_sched_child_runs_first, so take out
> > the whole function.
>
> At first glance I thought if we remove task_fork_fair(), do we lose one chance to
> update the parent task's statistic in update_curr()? We might get out-of-date
> parent task's deadline and make preemption decision based on the stale data in
> wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
> I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
> so this should not be a problem.
>
> Then I was wondering why can't we just skip place_entity() in enqueue_entity()
> if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
> way the new fork task's deadline will not be overwritten by wake_up_new_task()->
> enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
> and deadline are all calculated by place_entity() rather than being renormalised
> to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
> in enqueue_entity().
This all made me wonder if the order of update_curr() for the parent and
place_entity() for the child matters. And it does, since placing uses
avg_vruntime(), which wants an up-to-date vruntime for current and
min_vruntime for cfs_rq. Good that 'curr' in enqueue_entity() is false
on fork so that the parent's vruntime is up to date, but it seems
placing should always happen after update_curr().
> Per my understanding, this patch looks good,
>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Thanks!
Daniel
© 2016 - 2025 Red Hat, Inc.