[PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

Valentin Schneider posted 2 patches 4 years, 5 months ago
include/linux/sched.h             | 19 ++++++++++++++++---
include/trace/events/sched.h      | 11 +++++++----
kernel/sched/core.c               |  4 ++--
kernel/trace/fgraph.c             |  4 +++-
kernel/trace/ftrace.c             |  4 +++-
kernel/trace/trace_events.c       |  8 ++++++--
kernel/trace/trace_osnoise.c      |  4 +++-
kernel/trace/trace_sched_switch.c |  1 +
kernel/trace/trace_sched_wakeup.c |  1 +
9 files changed, 42 insertions(+), 14 deletions(-)
[PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
Posted by Valentin Schneider 4 years, 5 months ago
Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

  cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
  grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
  <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen: 

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

  smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
   TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT


I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).

Revisions
=========

v2 -> v3
++++++++

o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as
  TASK_UNINTERRUPTIBLE instead (Eric)

RFC v1 -> v2
++++++++++++

o Collected tags (Steven, Sebastian)
o Patched missed trace_switch event clients (Steven)

Cheers,
Valentin

Valentin Schneider (2):
  sched/tracing: Don't re-read p->state when emitting sched_switch event
  sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

 include/linux/sched.h             | 19 ++++++++++++++++---
 include/trace/events/sched.h      | 11 +++++++----
 kernel/sched/core.c               |  4 ++--
 kernel/trace/fgraph.c             |  4 +++-
 kernel/trace/ftrace.c             |  4 +++-
 kernel/trace/trace_events.c       |  8 ++++++--
 kernel/trace/trace_osnoise.c      |  4 +++-
 kernel/trace/trace_sched_switch.c |  1 +
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 42 insertions(+), 14 deletions(-)

--
2.25.1

Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
Posted by Steven Rostedt 4 years, 5 months ago
On Thu, 20 Jan 2022 16:25:18 +0000
Valentin Schneider <valentin.schneider@arm.com> wrote:

> Hi folks,
> 
> Problem
> =======
> 
> Abhijeet pointed out that the following sequence of trace events can be
> observed:
> 
>   cat-1676  [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
>   grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
>   <idle>-0  [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
> 
> IOW, not-yet-woken task gets presented as runnable and switched out in
> favor of the idle task... Dietmar and I had a look, turns out this sequence
> can happen: 
> 
> 		      p->__state = TASK_INTERRUPTIBLE;
> 		      __schedule()
> 			deactivate_task(p);
>   ttwu()
>     READ !p->on_rq
>     p->__state=TASK_WAKING
> 			trace_sched_switch()
> 			  __trace_sched_switch_state()
> 			    task_state_index()
> 			      return 0;
> 
> TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
> This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") which punted the TASK_WAKING write to the other
> side of
> 
>   smp_cond_load_acquire(&p->on_cpu, !VAL);
> 
> in ttwu().
> 
> Uwe reported on #linux-rt what I think is a similar issue with
> TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
> you get prev_state=0 despite the task blocking on a lock.
> 
> Both of those are very confusing for tooling or even human observers.



This all looks fine to me:

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Peter, want to take this through your tree?

-- Steve
Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not
Posted by Peter Zijlstra 4 years, 4 months ago
On Fri, Jan 21, 2022 at 12:15:58PM -0500, Steven Rostedt wrote:
> 
> This all looks fine to me:
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> Peter, want to take this through your tree?

Queued up now. Thanks!