[v3] sched/fair: Manage lag and run to parity with different slices

[PATCH v3 0/6] sched/fair: Manage lag and run to parity with different slices
Posted by Vincent Guittot 7 months ago
This follows the attempt to better track maximum lag of tasks in presence
of different slices duration:
[1]  https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/

Since v2:
- Fixed some typos.
- Created a union of vlag and vprot to help understanding the usage
  of the field as suggested by Peter
- Use min_vruntime instead of min
- Removed resched_next_quantum() which is equals to !protect_slice()

Since v1, tracking of the max slice has been removed from the patchset
because we now ensure that the lag of an entity remains in the range of:
   
  [-(slice + tick) : (slice + tick)] with run_to_parity
and
  [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity
  
As a result, there is no need the max slice of enqueued entities anymore.

Patch 1 is a simple cleanup to ease following changes.

Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of
its simplicity. The running task has a minimum protection of 0.7ms before
eevdf looks for another task.

Patch 3 ensures that the protection is canceled only if the waking task
will be selected by pick_task_fair. This case has been mentionned by Peter
will reviewing v1.

Patch 4 modifes the duration of the protection to take into account the
shortest slice of enqueued tasks instead of the slice of the running task.

Patch 5 fixes the case of tasks not being eligible at wakeup or after
migrating  but with a shorter slice. We need to update the duration of the
protection to not exceed the lag.

Patch 6 fixes the case of tasks still being eligible after the protected
period but others must run to no exceed lag limit. This has been
highlighted in a test with delayed entities being dequeued with a positive
lag larger than their slice but it can happen for delayed dequeue entity
too.

The patchset has been tested with rt-app on 37 different use cases, some a
simple and should never trigger any problem but have been kept to increase
the test coverage. The tests have been run on dragon rb5 with affinity on
biggest cores. The lag has been checked when we update the entity's lag at
dequeue and every time we check if an entity is eligible.

             RUN_TO_PARITY    NO_RUN_TO_PARITY
	     lag error        lag_error 
mainline       14/37            14/37
+ patch 1-2    14/37             0/37
+ patch 3-5     1/37             0/37
+ patch 6       0/37             0/37

Vincent Guittot (6):
  sched/fair: Use protect_slice() instead of direct comparison
  sched/fair: Fix NO_RUN_TO_PARITY case
  sched/fair: Remove spurious shorter slice preemption
  sched/fair: Limit run to parity to the min slice of enqueued entities
  sched/fair: Fix entity's lag with run to parity
  sched/fair: Always trigger resched at the end of a protected period

 include/linux/sched.h | 10 ++++-
 kernel/sched/fair.c   | 96 +++++++++++++++++++++----------------------
 2 files changed, 56 insertions(+), 50 deletions(-)

-- 
2.43.0