[PATCH] hung_task: configurable hung-task stacktrace loglevel

Sergey Senozhatsky posted 1 patch 9 months, 2 weeks ago
include/linux/sched/debug.h |  1 +
kernel/hung_task.c          |  6 ++++--
kernel/sched/core.c         | 19 +++++++++++++------
lib/Kconfig.debug           |  8 ++++++++
4 files changed, 26 insertions(+), 8 deletions(-)
[PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Sergey Senozhatsky 9 months, 2 weeks ago
Currently, hung-task watchdog uses two different loglevels
to report hung-tasks: a) KERN_INFO for all the important task
information (e.g. sched_show_task()) and b)  KERN_ERR for the
rest.  This makes it a little inconvenient, especially for
automated kernel logs parsing.

Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
becomes configurable.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 include/linux/sched/debug.h |  1 +
 kernel/hung_task.c          |  6 ++++--
 kernel/sched/core.c         | 19 +++++++++++++------
 lib/Kconfig.debug           |  8 ++++++++
 4 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/debug.h b/include/linux/sched/debug.h
index 35ed4577a6cc..80c17cdd35a8 100644
--- a/include/linux/sched/debug.h
+++ b/include/linux/sched/debug.h
@@ -34,6 +34,7 @@ extern void show_stack(struct task_struct *task, unsigned long *sp,
 		       const char *loglvl);
 
 extern void sched_show_task(struct task_struct *p);
+extern void sched_show_task_log_lvl(struct task_struct *p, const char *lvl);
 
 struct seq_file;
 extern void proc_sched_show_task(struct task_struct *p,
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d2432df2b905..8817016fd45b 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -26,6 +26,8 @@
 
 #include <trace/events/sched.h>
 
+#define PR_LEVEL KERN_SOH __stringify(CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL)
+
 /*
  * The number of tasks checked:
  */
@@ -153,7 +155,7 @@ static void debug_show_blocker(struct task_struct *task)
 			       task->comm, task->pid, t->comm, t->pid);
 			break;
 		}
-		sched_show_task(t);
+		sched_show_task_log_lvl(t, PR_LEVEL);
 		return;
 	}
 }
@@ -221,7 +223,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 			pr_err("      Blocked by coredump.\n");
 		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
 			" disables this message.\n");
-		sched_show_task(t);
+		sched_show_task_log_lvl(t, PR_LEVEL);
 		debug_show_blocker(t);
 		hung_task_show_lock = true;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54e7d63f7785..e9033b049092 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7751,7 +7751,7 @@ void __sched io_schedule(void)
 }
 EXPORT_SYMBOL(io_schedule);
 
-void sched_show_task(struct task_struct *p)
+void sched_show_task_log_lvl(struct task_struct *p, const char *lvl)
 {
 	unsigned long free;
 	int ppid;
@@ -7759,7 +7759,8 @@ void sched_show_task(struct task_struct *p)
 	if (!try_get_task_stack(p))
 		return;
 
-	pr_info("task:%-15.15s state:%c", p->comm, task_state_to_char(p));
+	printk("%stask:%-15.15s state:%c", lvl,
+	       p->comm, task_state_to_char(p));
 
 	if (task_is_running(p))
 		pr_cont("  running task    ");
@@ -7773,12 +7774,18 @@ void sched_show_task(struct task_struct *p)
 		free, task_pid_nr(p), task_tgid_nr(p),
 		ppid, p->flags, read_task_thread_flags(p));
 
-	print_worker_info(KERN_INFO, p);
-	print_stop_info(KERN_INFO, p);
-	print_scx_info(KERN_INFO, p);
-	show_stack(p, NULL, KERN_INFO);
+	print_worker_info(lvl, p);
+	print_stop_info(lvl, p);
+	print_scx_info(lvl, p);
+	show_stack(p, NULL, lvl);
 	put_task_stack(p);
 }
+EXPORT_SYMBOL_GPL(sched_show_task_log_lvl);
+
+void sched_show_task(struct task_struct *p)
+{
+	sched_show_task_log_lvl(p, KERN_INFO);
+}
 EXPORT_SYMBOL_GPL(sched_show_task);
 
 static inline bool
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3e40f68a4a4f..6cd266651447 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1290,6 +1290,14 @@ config DETECT_HUNG_TASK_BLOCKER
 	  This will add overhead a bit but shows suspicious tasks and
 	  call trace if it comes from waiting a mutex.
 
+config HUNG_TASK_STACKTRACE_LOGLEVEL
+	int "Log-level for hung task stacktrace (1-7)"
+	depends on DETECT_HUNG_TASK
+	default "6"
+	help
+	  This option controls the log-level used to print the stacktrace
+	  of the hung task. The default is 6 (KERN_INFO).
+
 config WQ_WATCHDOG
 	bool "Detect Workqueue Stalls"
 	depends on DEBUG_KERNEL
-- 
2.49.0.805.g082f7c87e0-goog
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Petr Mladek 9 months, 2 weeks ago
On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> Currently, hung-task watchdog uses two different loglevels
> to report hung-tasks: a) KERN_INFO for all the important task
> information (e.g. sched_show_task()) and b)  KERN_ERR for the
> rest.

IMHO, the two different loglevels make sense. The KERN_ERR
message seems to inform about that a task gets blocked for too long.
And KERN_INFO is used for an extra debug information.

> This makes it a little inconvenient, especially for
> automated kernel logs parsing.

Anyway, what is the exact problem, please?
Are the KERN_INFO messages filtered because of console_loglevel?
Or is it a problem to match all the related lines?

> Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
> becomes configurable.

I am not sure if adding hung-task-specific config option is
the right solution. I guess that other watchdogs or other
similar reports have the same problem.

It seems that several other reports, for example,
watchdog_hardlockup_check(), or __die(), are using KERN_DEFAULT
which is configurable via CONFIG_MESSAGE_LOGLEVEL_DEFAULT.

A solution might be using KERN_DEFAULT for sched_show_task()
in hung_tasks detector as well.

Alternatively, if the problem is console_loglevel filtering then
it might make sense to create a config option which would force
using the same loglevel in all similar reports. I would call it:

   CONFIG_FULL_REPORT_USING_SAME_LOGLEVEL

And support it for other reports.

If the problem is matching all related lines. Then a solution
would be printing some help lines around the report, similar
to

    ------------[ cut here ]------------

in include/asm-generic/bug.h

Plus, it would be needed to filter out messages from other CPUs.
CONFIG_PRINTK_CALLER should help with this.

Best Regards,
Petr
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Tomasz Figa 9 months, 2 weeks ago
Hi Petr,

On Thu, Apr 24, 2025 at 7:59 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > Currently, hung-task watchdog uses two different loglevels
> > to report hung-tasks: a) KERN_INFO for all the important task
> > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > rest.
>
> IMHO, the two different loglevels make sense. The KERN_ERR
> message seems to inform about that a task gets blocked for too long.
> And KERN_INFO is used for an extra debug information.
>

I agree that two different levels make sense, but I think that
KERN_INFO is not necessarily the best one to use, because we have
quite a lot of usual/expected things logged with that level, but this
clearly is not an unusual/expected event that we're logging.

My preference would be on KERN_NOTICE.

> > This makes it a little inconvenient, especially for
> > automated kernel logs parsing.
>
> Anyway, what is the exact problem, please?
> Are the KERN_INFO messages filtered because of console_loglevel?
> Or is it a problem to match all the related lines?

The problem is that when we're looking at the hundreds of reports with
various problems from the production fleet, we want to be able to
filter out some of the usual/expected logs. The easiest way to do it
is by using the message log level. However, if we set the filters to
anything more severe than KERNEL_INFO, we lose the task dumps and we
need to go and fetch the entire unfiltered log, which is tedious.

(FWIW, we're also developing an automated analysis tool and it would
make the implementation much easier if we could simply use the log
level to filter out expected vs unexpected events from the logs - and
most of the time that already works, the case Sergey's patch is
addressing is just one of the small number of exceptions.)

>
> > Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
> > becomes configurable.
>
> I am not sure if adding hung-task-specific config option is
> the right solution. I guess that other watchdogs or other
> similar reports have the same problem.
>
> It seems that several other reports, for example,
> watchdog_hardlockup_check(), or __die(), are using KERN_DEFAULT
> which is configurable via CONFIG_MESSAGE_LOGLEVEL_DEFAULT.
>
> A solution might be using KERN_DEFAULT for sched_show_task()
> in hung_tasks detector as well.

I have to admit that I don't really know what else KERN_DEFAULT is
used for, but wouldn't that mean that again some typical messages
would end up being mixed in with messages for unexpected events?

>
> Alternatively, if the problem is console_loglevel filtering then
> it might make sense to create a config option which would force
> using the same loglevel in all similar reports. I would call it:
>
>    CONFIG_FULL_REPORT_USING_SAME_LOGLEVEL
>
> And support it for other reports.

I think that would work for us too, but I kind of also think that
having two different levels for the main part and then a higher (lower
severity) one for the other tasks makes sense and would be useful for
our analysis too.

>
> If the problem is matching all related lines. Then a solution
> would be printing some help lines around the report, similar
> to
>
>     ------------[ cut here ]------------
>
> in include/asm-generic/bug.h
>
> Plus, it would be needed to filter out messages from other CPUs.
> CONFIG_PRINTK_CALLER should help with this.

I'm not really in love with that idea - it would make things so much
more complicated, despite already having the right tool to
differentiate between the importance of various logs - after all the
log level is exactly that.

Best,
Tomasz
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Petr Mladek 9 months, 2 weeks ago
On Fri 2025-04-25 15:58:46, Tomasz Figa wrote:
> Hi Petr,
> 
> On Thu, Apr 24, 2025 at 7:59 PM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > Currently, hung-task watchdog uses two different loglevels
> > > to report hung-tasks: a) KERN_INFO for all the important task
> > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > rest.
> >
> > IMHO, the two different loglevels make sense. The KERN_ERR
> > message seems to inform about that a task gets blocked for too long.
> > And KERN_INFO is used for an extra debug information.
> >
> 
> I agree that two different levels make sense, but I think that
> KERN_INFO is not necessarily the best one to use, because we have
> quite a lot of usual/expected things logged with that level, but this
> clearly is not an unusual/expected event that we're logging.
> 
> My preference would be on KERN_NOTICE.

Sigh, this is the problem with loglevels. Different people have
different feeling about them.

A solution would be to add an extra log level. But the full 0-7
(3 bit) range is already taken.

> > > This makes it a little inconvenient, especially for
> > > automated kernel logs parsing.
> >
> > Anyway, what is the exact problem, please?
> > Are the KERN_INFO messages filtered because of console_loglevel?
> > Or is it a problem to match all the related lines?
> 
> The problem is that when we're looking at the hundreds of reports with
> various problems from the production fleet, we want to be able to
> filter out some of the usual/expected logs. The easiest way to do it
> is by using the message log level. However, if we set the filters to
> anything more severe than KERNEL_INFO, we lose the task dumps and we
> need to go and fetch the entire unfiltered log, which is tedious.

Good to know.

This might be an argument for using the same log level for the entire
report. But it might create new problems. It would be more complicated
to filter-out known problems. I mean that a single known
warning/error/emergency message can be filtered easily. But
creating a filter for the entire to-be-ignored backtrace is more
complicated.


> (FWIW, we're also developing an automated analysis tool and it would
> make the implementation much easier if we could simply use the log
> level to filter out expected vs unexpected events from the logs - and
> most of the time that already works, the case Sergey's patch is
> addressing is just one of the small number of exceptions.)

It might be interesting to see the list of exceptions. Maybe, we
could find some common pattern...

It would be nice to handle all the reports of critical situations
similar way. It would help everyone. This is why I am not happy with
a hung-stask-detector-specific setting.

> > If the problem is matching all related lines. Then a solution
> > would be printing some help lines around the report, similar
> > to
> >
> >     ------------[ cut here ]------------
> >
> > in include/asm-generic/bug.h
> >
> > Plus, it would be needed to filter out messages from other CPUs.
> > CONFIG_PRINTK_CALLER should help with this.
> 
> I'm not really in love with that idea - it would make things so much
> more complicated, despite already having the right tool to
> differentiate between the importance of various logs - after all the
> log level is exactly that.

Honestly, the more I think about it the more I like the prefix/postfix
lines + the caller_id. I am afraid that manipulating log levels is a
lost fight  because different people might have different opinion
about how various messages are important.

Best Regards,
Petr
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Tomasz Figa 9 months, 2 weeks ago
On Sat, Apr 26, 2025 at 12:32 AM Petr Mladek <pmladek@suse.com> wrote:
>
> On Fri 2025-04-25 15:58:46, Tomasz Figa wrote:
> > Hi Petr,
> >
> > On Thu, Apr 24, 2025 at 7:59 PM Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > > Currently, hung-task watchdog uses two different loglevels
> > > > to report hung-tasks: a) KERN_INFO for all the important task
> > > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > > rest.
> > >
> > > IMHO, the two different loglevels make sense. The KERN_ERR
> > > message seems to inform about that a task gets blocked for too long.
> > > And KERN_INFO is used for an extra debug information.
> > >
> >
> > I agree that two different levels make sense, but I think that
> > KERN_INFO is not necessarily the best one to use, because we have
> > quite a lot of usual/expected things logged with that level, but this
> > clearly is not an unusual/expected event that we're logging.
> >
> > My preference would be on KERN_NOTICE.
>
> Sigh, this is the problem with loglevels. Different people have
> different feeling about them.
>
> A solution would be to add an extra log level. But the full 0-7
> (3 bit) range is already taken.
>
> > > > This makes it a little inconvenient, especially for
> > > > automated kernel logs parsing.
> > >
> > > Anyway, what is the exact problem, please?
> > > Are the KERN_INFO messages filtered because of console_loglevel?
> > > Or is it a problem to match all the related lines?
> >
> > The problem is that when we're looking at the hundreds of reports with
> > various problems from the production fleet, we want to be able to
> > filter out some of the usual/expected logs. The easiest way to do it
> > is by using the message log level. However, if we set the filters to
> > anything more severe than KERNEL_INFO, we lose the task dumps and we
> > need to go and fetch the entire unfiltered log, which is tedious.
>
> Good to know.
>
> This might be an argument for using the same log level for the entire
> report. But it might create new problems. It would be more complicated
> to filter-out known problems. I mean that a single known
> warning/error/emergency message can be filtered easily. But
> creating a filter for the entire to-be-ignored backtrace is more
> complicated.
>

Yeah, but since this filtering would be happening in whatever code
reads those logs, outside of the kernel, we first need to get the
candidate backtraces from the kernel, so having them a log level
appropriate for an unusual event would definitely help.

>
> > (FWIW, we're also developing an automated analysis tool and it would
> > make the implementation much easier if we could simply use the log
> > level to filter out expected vs unexpected events from the logs - and
> > most of the time that already works, the case Sergey's patch is
> > addressing is just one of the small number of exceptions.)
>
> It might be interesting to see the list of exceptions. Maybe, we
> could find some common pattern...
>
> It would be nice to handle all the reports of critical situations
> similar way. It would help everyone. This is why I am not happy with
> a hung-stask-detector-specific setting.
>

I was convinced that the soft/hard_lockup detector also does the same,
but I double checked and it seems like it [1] just calls dump_stack()
that uses KERN_DEFAULT and in our case that defaults to 4
(KERN_WARNING), which is more severe than 6 (KERN_INFO) that we want
to filter out, so I guess making the hung task watchdog behave the
same way would also work for us.

[1] https://elixir.bootlin.com/linux/v6.14.4/source/lib/nmi_backtrace.c#L94
[2] https://elixir.bootlin.com/linux/v6.14.4/source/lib/dump_stack.c#L127

Besides that, I don't actually have any other cases at hand. I guess
we'll find out once we look at more crash reports.

> > > If the problem is matching all related lines. Then a solution
> > > would be printing some help lines around the report, similar
> > > to
> > >
> > >     ------------[ cut here ]------------
> > >
> > > in include/asm-generic/bug.h
> > >
> > > Plus, it would be needed to filter out messages from other CPUs.
> > > CONFIG_PRINTK_CALLER should help with this.
> >
> > I'm not really in love with that idea - it would make things so much
> > more complicated, despite already having the right tool to
> > differentiate between the importance of various logs - after all the
> > log level is exactly that.
>
> Honestly, the more I think about it the more I like the prefix/postfix
> lines + the caller_id. I am afraid that manipulating log levels is a
> lost fight  because different people might have different opinion
> about how various messages are important.

The problem with the special lines is that it completely breaks any
line-based processing in a data pipeline. For a piece of
infrastructure that needs to deal with thousands of reports, on an
on-demand basis, that would mean quite a bit of sequential work done
instead of doing it in parallel and taking much more time to answer
users' queries.

That could be worked around, though, if we could prefix each line
separately with some special tag in addition to log level, timestamp
and caller, though. Borrowing from Sergey's earlier example:

<3>[  125.297687][  T140][E] INFO: task zsh:470 blocked for more than
61 seconds.
<3>[  125.302321][  T140][E]       Not tainted
6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
<3>[  125.309333][  T140][E] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[  125.315040][  T140][E] task:zsh             state:D stack:0
pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
<6>[  125.320594][  T140][E] Call Trace:
<6>[  125.322327][  T140][E]  <TASK>
<6>[  125.323852][  T140][E]  __schedule+0x13b4/0x2120
<6>[  125.325459][  T140][E]  ? schedule+0xdc/0x280
<6>[  125.327100][  T140][E]  schedule+0xdc/0x280
<6>[  125.328590][  T140][E]  schedule_preempt_disabled+0x10/0x20
<6>[  125.330589][  T140][E]  __mutex_lock+0x698/0x1200
<6>[  125.332291][  T140][E]  ? __mutex_lock+0x485/0x1200
<6>[  125.334074][  T140][E]  mutex_lock+0x81/0x90
<6>[  125.335113][  T140][E]  drop_caches_sysctl_handler+0x3e/0x140
<6>[  125.336665][  T140][E]  proc_sys_call_handler+0x327/0x4f0
<6>[  125.338069][  T140][E]  vfs_write+0x794/0xb60
<6>[  125.339216][  T140][E]  ? proc_sys_read+0x10/0x10
<6>[  125.340568][  T140][E]  ksys_write+0xb8/0x170
<6>[  125.341701][  T140][E]  do_syscall_64+0xd0/0x1a0
<6>[  125.343009][  T140][E]  ? arch_exit_to_user_mode_prepare+0x11/0x60
<6>[  125.344612][  T140][E]  ? irqentry_exit_to_user_mode+0x7e/0xa0
<6>[  125.346260][  T140][E]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

where [E] would mean an "emergency" message, rather than something
usual, regardless of the loglevel.

Best,
Tomasz
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Petr Mladek 9 months, 1 week ago
On Wed 2025-04-30 17:42:51, Tomasz Figa wrote:
> On Sat, Apr 26, 2025 at 12:32 AM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Fri 2025-04-25 15:58:46, Tomasz Figa wrote:
> > > Hi Petr,
> > >
> > > On Thu, Apr 24, 2025 at 7:59 PM Petr Mladek <pmladek@suse.com> wrote:
> > > >
> > > > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > > > Currently, hung-task watchdog uses two different loglevels
> > > > > to report hung-tasks: a) KERN_INFO for all the important task
> > > > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > > > rest.
> > > >
> > > > IMHO, the two different loglevels make sense. The KERN_ERR
> > > > message seems to inform about that a task gets blocked for too long.
> > > > And KERN_INFO is used for an extra debug information.
> > > >
> > >
> > > > If the problem is matching all related lines. Then a solution
> > > > would be printing some help lines around the report, similar
> > > > to
> > > >
> > > >     ------------[ cut here ]------------
> > > >
> > > > in include/asm-generic/bug.h
> > > >
> > > > Plus, it would be needed to filter out messages from other CPUs.
> > > > CONFIG_PRINTK_CALLER should help with this.
> > >
> > > I'm not really in love with that idea - it would make things so much
> > > more complicated, despite already having the right tool to
> > > differentiate between the importance of various logs - after all the
> > > log level is exactly that.
> >
> > Honestly, the more I think about it the more I like the prefix/postfix
> > lines + the caller_id. I am afraid that manipulating log levels is a
> > lost fight  because different people might have different opinion
> > about how various messages are important.
> 
> The problem with the special lines is that it completely breaks any
> line-based processing in a data pipeline. For a piece of
> infrastructure that needs to deal with thousands of reports, on an
> on-demand basis, that would mean quite a bit of sequential work done
> instead of doing it in parallel and taking much more time to answer
> users' queries.
> 
> That could be worked around, though, if we could prefix each line
> separately with some special tag in addition to log level, timestamp
> and caller, though. Borrowing from Sergey's earlier example:
> 
> <3>[  125.297687][  T140][E] INFO: task zsh:470 blocked for more than
> 61 seconds.
> <3>[  125.302321][  T140][E]       Not tainted
> 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
> <3>[  125.309333][  T140][E] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> <6>[  125.315040][  T140][E] task:zsh             state:D stack:0
> pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> <6>[  125.320594][  T140][E] Call Trace:
> <6>[  125.322327][  T140][E]  <TASK>
> <6>[  125.323852][  T140][E]  __schedule+0x13b4/0x2120
> <6>[  125.325459][  T140][E]  ? schedule+0xdc/0x280
> <6>[  125.327100][  T140][E]  schedule+0xdc/0x280
> <6>[  125.328590][  T140][E]  schedule_preempt_disabled+0x10/0x20
> <6>[  125.330589][  T140][E]  __mutex_lock+0x698/0x1200
> <6>[  125.332291][  T140][E]  ? __mutex_lock+0x485/0x1200
> <6>[  125.334074][  T140][E]  mutex_lock+0x81/0x90
> <6>[  125.335113][  T140][E]  drop_caches_sysctl_handler+0x3e/0x140
> <6>[  125.336665][  T140][E]  proc_sys_call_handler+0x327/0x4f0
> <6>[  125.338069][  T140][E]  vfs_write+0x794/0xb60
> <6>[  125.339216][  T140][E]  ? proc_sys_read+0x10/0x10
> <6>[  125.340568][  T140][E]  ksys_write+0xb8/0x170
> <6>[  125.341701][  T140][E]  do_syscall_64+0xd0/0x1a0
> <6>[  125.343009][  T140][E]  ? arch_exit_to_user_mode_prepare+0x11/0x60
> <6>[  125.344612][  T140][E]  ? irqentry_exit_to_user_mode+0x7e/0xa0
> <6>[  125.346260][  T140][E]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> where [E] would mean an "emergency" message, rather than something
> usual, regardless of the loglevel.

This is an interesting idea. It has several advantages. It would:

  + still allow to filter out the extra details on too slow consoles [1]
  + work even when the "cut here" prefix/postfix lines get lost
  + obsolete the config option forcing the same loglevel in emergency
      section => safe space in struct task_struct. [2]

[1] Note that there is still floating a patchset which allows to define
     per-console loglevel, see
     https://lore.kernel.org/r/cover.1730133890.git.chris@chrisdown.name

[2] It might be eventually replaced by a config option which would show
    all emergency messages on consoles.

Best Regards,
Petr
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by John Ogness 9 months, 1 week ago
On 2025-05-02, Petr Mladek <pmladek@suse.com> wrote:
>> The problem with the special lines is that it completely breaks any
>> line-based processing in a data pipeline. For a piece of
>> infrastructure that needs to deal with thousands of reports, on an
>> on-demand basis, that would mean quite a bit of sequential work done
>> instead of doing it in parallel and taking much more time to answer
>> users' queries.
>> 
>> That could be worked around, though, if we could prefix each line
>> separately with some special tag in addition to log level, timestamp
>> and caller, though. Borrowing from Sergey's earlier example:
>> 
>> <3>[  125.297687][  T140][E] INFO: task zsh:470 blocked for more than
>> 61 seconds.
>> <3>[  125.302321][  T140][E]       Not tainted
>> 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
>> <3>[  125.309333][  T140][E] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> <6>[  125.315040][  T140][E] task:zsh             state:D stack:0
>> pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
>> <6>[  125.320594][  T140][E] Call Trace:
>> <6>[  125.322327][  T140][E]  <TASK>
>> <6>[  125.323852][  T140][E]  __schedule+0x13b4/0x2120
>> <6>[  125.325459][  T140][E]  ? schedule+0xdc/0x280
>> <6>[  125.327100][  T140][E]  schedule+0xdc/0x280
>> <6>[  125.328590][  T140][E]  schedule_preempt_disabled+0x10/0x20
>> <6>[  125.330589][  T140][E]  __mutex_lock+0x698/0x1200
>> <6>[  125.332291][  T140][E]  ? __mutex_lock+0x485/0x1200
>> <6>[  125.334074][  T140][E]  mutex_lock+0x81/0x90
>> <6>[  125.335113][  T140][E]  drop_caches_sysctl_handler+0x3e/0x140
>> <6>[  125.336665][  T140][E]  proc_sys_call_handler+0x327/0x4f0
>> <6>[  125.338069][  T140][E]  vfs_write+0x794/0xb60
>> <6>[  125.339216][  T140][E]  ? proc_sys_read+0x10/0x10
>> <6>[  125.340568][  T140][E]  ksys_write+0xb8/0x170
>> <6>[  125.341701][  T140][E]  do_syscall_64+0xd0/0x1a0
>> <6>[  125.343009][  T140][E]  ? arch_exit_to_user_mode_prepare+0x11/0x60
>> <6>[  125.344612][  T140][E]  ? irqentry_exit_to_user_mode+0x7e/0xa0
>> <6>[  125.346260][  T140][E]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>> 
>> where [E] would mean an "emergency" message, rather than something
>> usual, regardless of the loglevel.
>
> This is an interesting idea. It has several advantages. It would:
>
>   + still allow to filter out the extra details on too slow consoles [1]
>   + work even when the "cut here" prefix/postfix lines get lost
>   + obsolete the config option forcing the same loglevel in emergency
>       section => safe space in struct task_struct. [2]

So I guess this would introduce a new printk_info_flags emergency
flag. The information needs to be stored in the ringbuffer.

> [1] Note that there is still floating a patchset which allows to define
>      per-console loglevel, see
>      https://lore.kernel.org/r/cover.1730133890.git.chris@chrisdown.name
>
> [2] It might be eventually replaced by a config option which would show
>     all emergency messages on consoles.

Which, when enabled, would simply result in setting LOG_FORCE_CON
whenever the new emergency flag is set.

John
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by John Ogness 9 months, 2 weeks ago
On 2025-04-25, Petr Mladek <pmladek@suse.com> wrote:
> I am afraid that manipulating log levels is a lost fight because
> different people might have different opinion about how various
> messages are important.

Wasn't that the whole point of Sergey's patch? To make it configurable?

I must admit that I am not happy with the patch. Mostly because it is
too specific. And I am not sure if we really want to try to make it all
dynamic with a report API either. At least we need to think about it
more carefully.

One thing that crossed my mind was that we have enter/exit markers for
emergency mode, which should be used whenever something "bad" happens. I
am wondering if a fixed loglevel could be configured for all messages
stored by a CPU in emergency mode. This might also encourage developers
to track down and mark more emergency sections. For the nbcon series, I
really only picked a few obvious ones, but I am sure there are more.

In other words, I would prefer to recycle the emergenceny enter/exit
markers rather than introduce new ones. (Unless we are also talking
about reports that are totally normal and acceptable during runtime.)

John
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Sergey Senozhatsky 9 months, 2 weeks ago
On (25/04/28 10:11), John Ogness wrote:
> On 2025-04-25, Petr Mladek <pmladek@suse.com> wrote:
> > I am afraid that manipulating log levels is a lost fight because
> > different people might have different opinion about how various
> > messages are important.
> 
> Wasn't that the whole point of Sergey's patch? To make it configurable?
> 
> I must admit that I am not happy with the patch.

Same here ;)

[..]
> In other words, I would prefer to recycle the emergenceny enter/exit
> markers rather than introduce new ones. (Unless we are also talking
> about reports that are totally normal and acceptable during runtime.)

Adding "--- report begin ---"/"--- report end ---" markers everywhere
probably will suffice.  Paired with printk-owner these should make
log parsing error level independent, I guess.
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Sergey Senozhatsky 9 months, 2 weeks ago
Hi Petr,

On (25/04/24 12:58), Petr Mladek wrote:
> On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > Currently, hung-task watchdog uses two different loglevels
> > to report hung-tasks: a) KERN_INFO for all the important task
> > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > rest.
> 
> IMHO, the two different loglevels make sense. The KERN_ERR
> message seems to inform about that a task gets blocked for too long.
> And KERN_INFO is used for an extra debug information.
> 
> > This makes it a little inconvenient, especially for
> > automated kernel logs parsing.
> 
> Anyway, what is the exact problem, please?
> Are the KERN_INFO messages filtered because of console_loglevel?
> Or is it a problem to match all the related lines?

The latter one.  A made up example, just to demonstrate what we are
getting now:

<3>[  125.297687][  T140] INFO: task zsh:470 blocked for more than 61 seconds.
<3>[  125.302321][  T140]       Not tainted 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
<3>[  125.309333][  T140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[  125.315040][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
<6>[  125.320594][  T140] Call Trace:
<6>[  125.322327][  T140]  <TASK>
<6>[  125.323852][  T140]  __schedule+0x13b4/0x2120
<6>[  125.325459][  T140]  ? schedule+0xdc/0x280
<6>[  125.327100][  T140]  schedule+0xdc/0x280
<6>[  125.328590][  T140]  schedule_preempt_disabled+0x10/0x20
<6>[  125.330589][  T140]  __mutex_lock+0x698/0x1200
<6>[  125.332291][  T140]  ? __mutex_lock+0x485/0x1200
<6>[  125.334074][  T140]  mutex_lock+0x81/0x90
<6>[  125.335113][  T140]  drop_caches_sysctl_handler+0x3e/0x140
<6>[  125.336665][  T140]  proc_sys_call_handler+0x327/0x4f0
<6>[  125.338069][  T140]  vfs_write+0x794/0xb60
<6>[  125.339216][  T140]  ? proc_sys_read+0x10/0x10
<6>[  125.340568][  T140]  ksys_write+0xb8/0x170
<6>[  125.341701][  T140]  do_syscall_64+0xd0/0x1a0
<6>[  125.343009][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
<6>[  125.344612][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
<6>[  125.346260][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
<6>[  125.347772][  T140] RIP: 0033:0x7fa4bd8be687
<6>[  125.348958][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
<6>[  125.351161][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
<6>[  125.353221][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
<6>[  125.355338][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
<6>[  125.357424][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
<6>[  125.359677][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
<6>[  125.361551][  T140]  </TASK>
<3>[  125.362363][  T140] INFO: task zsh:470 is blocked on a mutex likely owned by task zsh:470.
<6>[  125.364467][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
<6>[  125.367493][  T140] Call Trace:
<6>[  125.368359][  T140]  <TASK>
<6>[  125.369180][  T140]  __schedule+0x13b4/0x2120
<6>[  125.370364][  T140]  ? schedule+0xdc/0x280
<6>[  125.371486][  T140]  schedule+0xdc/0x280
<6>[  125.372518][  T140]  schedule_preempt_disabled+0x10/0x20
<6>[  125.374049][  T140]  __mutex_lock+0x698/0x1200
<6>[  125.375326][  T140]  ? __mutex_lock+0x485/0x1200
<6>[  125.376572][  T140]  mutex_lock+0x81/0x90
<6>[  125.377773][  T140]  drop_caches_sysctl_handler+0x3e/0x140
<6>[  125.379391][  T140]  proc_sys_call_handler+0x327/0x4f0
<6>[  125.380715][  T140]  vfs_write+0x794/0xb60
<6>[  125.381951][  T140]  ? proc_sys_read+0x10/0x10
<6>[  125.383083][  T140]  ksys_write+0xb8/0x170
<6>[  125.384329][  T140]  do_syscall_64+0xd0/0x1a0
<6>[  125.385461][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
<6>[  125.387110][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
<6>[  125.388566][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
<6>[  125.390194][  T140] RIP: 0033:0x7fa4bd8be687
<6>[  125.391268][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
<6>[  125.393367][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
<6>[  125.395506][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
<6>[  125.397494][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
<6>[  125.399701][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
<6>[  125.401663][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
<6>[  125.403579][  T140]  </TASK>

So there are <3> and then <6> lines and we can't easily tell when to
stop parsing and what to consider part of the automated error report
and what not to, there also can be some other <6> lines in between.

> > Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
> > becomes configurable.
> 
> I am not sure if adding hung-task-specific config option is
> the right solution. I guess that other watchdogs or other
> similar reports have the same problem.
>
> It seems that several other reports, for example,
> watchdog_hardlockup_check(), or __die(), are using KERN_DEFAULT
> which is configurable via CONFIG_MESSAGE_LOGLEVEL_DEFAULT.
> 
> A solution might be using KERN_DEFAULT for sched_show_task()
> in hung_tasks detector as well.

So my worry with CONFIG_MESSAGE_LOGLEVEL_DEFAULT was that, if we
set it, say, to 3 (KERN_ERR), then how many "false positives" we
will get?  There are many printk("") calls in the kernel that
default to MESSAGE_LOGLEVEL_DEFAULT, as far as I understand it:
	git grep "printk(\"" | grep -v TP_printk | wc -l
	9001

But maybe that is the solution.  Do you want to switch sched_show_task()
to KERN_DEFAULT for all or would it be better to still introduce
sched_show_task_log_lvl() can call it with KERN_DEFAULT only from
kernel/hung_task.c?
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Petr Mladek 9 months, 2 weeks ago
On Fri 2025-04-25 13:49:13, Sergey Senozhatsky wrote:
> Hi Petr,
> 
> On (25/04/24 12:58), Petr Mladek wrote:
> > On Thu 2025-04-24 16:02:43, Sergey Senozhatsky wrote:
> > > Currently, hung-task watchdog uses two different loglevels
> > > to report hung-tasks: a) KERN_INFO for all the important task
> > > information (e.g. sched_show_task()) and b)  KERN_ERR for the
> > > rest.
> > 
> > IMHO, the two different loglevels make sense. The KERN_ERR
> > message seems to inform about that a task gets blocked for too long.
> > And KERN_INFO is used for an extra debug information.
> > 
> > > This makes it a little inconvenient, especially for
> > > automated kernel logs parsing.
> > 
> > Anyway, what is the exact problem, please?
> > Are the KERN_INFO messages filtered because of console_loglevel?
> > Or is it a problem to match all the related lines?
> 
> The latter one.  A made up example, just to demonstrate what we are
> getting now:
> 

Note: I do not have strong opinion. I am rather thinking loudly.

When I look at it. A prefix line:

<6>[  125.297687][  T140] ------------[ cut here ]------------

> <3>[  125.297687][  T140] INFO: task zsh:470 blocked for more than 61 seconds.
> <3>[  125.302321][  T140]       Not tainted 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
> <3>[  125.309333][  T140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> <6>[  125.315040][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> <6>[  125.320594][  T140] Call Trace:
> <6>[  125.322327][  T140]  <TASK>
> <6>[  125.323852][  T140]  __schedule+0x13b4/0x2120
> <6>[  125.325459][  T140]  ? schedule+0xdc/0x280
> <6>[  125.327100][  T140]  schedule+0xdc/0x280
> <6>[  125.328590][  T140]  schedule_preempt_disabled+0x10/0x20
> <6>[  125.330589][  T140]  __mutex_lock+0x698/0x1200
> <6>[  125.332291][  T140]  ? __mutex_lock+0x485/0x1200
> <6>[  125.334074][  T140]  mutex_lock+0x81/0x90
> <6>[  125.335113][  T140]  drop_caches_sysctl_handler+0x3e/0x140
> <6>[  125.336665][  T140]  proc_sys_call_handler+0x327/0x4f0
> <6>[  125.338069][  T140]  vfs_write+0x794/0xb60
> <6>[  125.339216][  T140]  ? proc_sys_read+0x10/0x10
> <6>[  125.340568][  T140]  ksys_write+0xb8/0x170
> <6>[  125.341701][  T140]  do_syscall_64+0xd0/0x1a0
> <6>[  125.343009][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
> <6>[  125.344612][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
> <6>[  125.346260][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> <6>[  125.347772][  T140] RIP: 0033:0x7fa4bd8be687
> <6>[  125.348958][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> <6>[  125.351161][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
> <6>[  125.353221][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
> <6>[  125.355338][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
> <6>[  125.357424][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> <6>[  125.359677][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
> <6>[  125.361551][  T140]  </TASK>
> <3>[  125.362363][  T140] INFO: task zsh:470 is blocked on a mutex likely owned by task zsh:470.
> <6>[  125.364467][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> <6>[  125.367493][  T140] Call Trace:
> <6>[  125.368359][  T140]  <TASK>
> <6>[  125.369180][  T140]  __schedule+0x13b4/0x2120
> <6>[  125.370364][  T140]  ? schedule+0xdc/0x280
> <6>[  125.371486][  T140]  schedule+0xdc/0x280
> <6>[  125.372518][  T140]  schedule_preempt_disabled+0x10/0x20
> <6>[  125.374049][  T140]  __mutex_lock+0x698/0x1200
> <6>[  125.375326][  T140]  ? __mutex_lock+0x485/0x1200
> <6>[  125.376572][  T140]  mutex_lock+0x81/0x90
> <6>[  125.377773][  T140]  drop_caches_sysctl_handler+0x3e/0x140
> <6>[  125.379391][  T140]  proc_sys_call_handler+0x327/0x4f0
> <6>[  125.380715][  T140]  vfs_write+0x794/0xb60
> <6>[  125.381951][  T140]  ? proc_sys_read+0x10/0x10
> <6>[  125.383083][  T140]  ksys_write+0xb8/0x170
> <6>[  125.384329][  T140]  do_syscall_64+0xd0/0x1a0
> <6>[  125.385461][  T140]  ? arch_exit_to_user_mode_prepare+0x11/0x60
> <6>[  125.387110][  T140]  ? irqentry_exit_to_user_mode+0x7e/0xa0
> <6>[  125.388566][  T140]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> <6>[  125.390194][  T140] RIP: 0033:0x7fa4bd8be687
> <6>[  125.391268][  T140] RSP: 002b:00007ffecf417820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> <6>[  125.393367][  T140] RAX: ffffffffffffffda RBX: 00007fa4bd82e300 RCX: 00007fa4bd8be687
> <6>[  125.395506][  T140] RDX: 0000000000000002 RSI: 00005621f5c65860 RDI: 0000000000000001
> <6>[  125.397494][  T140] RBP: 00005621f5c65860 R08: 0000000000000000 R09: 0000000000000000
> <6>[  125.399701][  T140] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
> <6>[  125.401663][  T140] R13: 00007fa4bda175c0 R14: 00007fa4bda14e80 R15: 00007fa4bdb59f70
> <6>[  125.403579][  T140]  </TASK>

and a suffix line

<6>[  125.403579][  T140] ------------[ cut here ]------------

might be helpful for both automated monitors and humans => win win solution.

> So there are <3> and then <6> lines and we can't easily tell when to
> stop parsing and what to consider part of the automated error report
> and what not to, there also can be some other <6> lines in between.

All the related information should be printed from the same context.
It is "[  T140]" in this case. The only exception would be backtraces
from other CPUs.

> > > Introduce CONFIG_HUNG_TASK_STACKTRACE_LOGLEVEL so that (a)
> > > becomes configurable.
> > 
> > I am not sure if adding hung-task-specific config option is
> > the right solution. I guess that other watchdogs or other
> > similar reports have the same problem.
> >
> > It seems that several other reports, for example,
> > watchdog_hardlockup_check(), or __die(), are using KERN_DEFAULT
> > which is configurable via CONFIG_MESSAGE_LOGLEVEL_DEFAULT.
> > 
> > A solution might be using KERN_DEFAULT for sched_show_task()
> > in hung_tasks detector as well.
> 
> So my worry with CONFIG_MESSAGE_LOGLEVEL_DEFAULT was that, if we
> set it, say, to 3 (KERN_ERR), then how many "false positives" we
> will get?  There are many printk("") calls in the kernel that
> default to MESSAGE_LOGLEVEL_DEFAULT, as far as I understand it:
> 	git grep "printk(\"" | grep -v TP_printk | wc -l
> 	9001

Just for record, this still counts also many other printk() wrappers,
e.g. bpf_printk(), srm_printk(), dprintk().

It should be more precise with -w option:

	$> git grep -w "printk(\"" | wc -l
	2830

That said, I agree that it might still mean a lot of false
positives.

> But maybe that is the solution.  Do you want to switch sched_show_task()
> to KERN_DEFAULT for all or would it be better to still introduce
> sched_show_task_log_lvl() can call it with KERN_DEFAULT only from
> kernel/hung_task.c?

If you want to go this way then I would introduce sched_show_task_log_lvl().
It would allow to fix the various reports incrementally.

For example, I see sched_show_task() used in show_cpu_pool_hog() in
kernel/workqueue.c and some other messages are KERN_INFO. So, using
another log level in sched_show_task() would cause mess here.

That said, you probably want to fix workqueue stall report
as well. I believe that you have the workqueue watchdog enabled
and it has the same problem as the hung_task detector.


My opinion:

I think that using the prefix/postfix "cut here" line is a win-win
solution. And I would personally go this way.

But I understand this might have other problems, like the need
of the full log (the other reply from Thomasz Figa). You really
might want to use the same loglevel for the entire report.

If you want the same loglevel then I really suggest to introduce
a generic CONFIG option, e.g. CONFIG_CONSISTENT_REPORT_LOGLEVEL
and use is everywhere to force the consistent loglevel,
for example, hung taskdetector, workqueue watchdog, soft lockup
detector, Oops, ...


Idea:

Some people complained about a non-consistent loglevel when pr_cont()
lines were not connected because another context (CPU, task,
interrupt) added a message in between.

I had an idea to store the last used loglevel in struct task_struct.
But it did not look worth it just for continues lines.

This might be another use-case. So we could add something like:

struct printk_context {
	u8	deferred;	/* printk_safe/deferred nesting	counter. */
	u8      emergency;	/* nbcon_emergency nesting counter */
	u8	level:3;	/* last used or forced log level */
	u8	flags:5;
}

and one of the flags might be

enum printk_context_flag {
	pr_ctxt_force_level = 0x0,	/* Force the same loglevel for	all messages */
};

and then we could create an API:

void printk_force_loglevel_enter(const char *lvl)
{
	struct printk_context *pr_ctxt;

	pr_ctxt = printk_get_context();
	/* FIXME: make sure the it returns a valid level 0..7 */
	pr_ctxt.level = printk_get_level(lvl);
	pt_ctxt.flags |= pr_ctct_force_level;
}

void printk_force_loglevel_exit(void)
{
	struct printk_context *pr_ctxt;

	pr_ctxt = printk_get_context();
	pt_ctxt.flags &= ~pr_ctct_force_level;
}

, where printk_get_context() would return the right struct
printk_context either from struct task_struct for task context or
a static per-CPU variable for IRQ or NMI context.

FIXME: We should integrate the above with the existing:

	  + printk_context used by printk_safe_enter()/exit()
	  + nbcon_pcpu_emergency_nesting
	  + ??? (Did I miss any other printk context counter/flag?)


Also we could even create an API for consistent reports:

static atomic_t printk_report_count = ATOMIC_INIT(0);
static bool printk_force_report_loglevel = IS_ENABLED(CONFIG_PRINTK_FORCE_REPORT_LOGLEVEL);

int printk_report_start(const char *lvl)
{
	int report_id = atomic_inc_return(&printk_report_count);

	if (printk_force_report_loglevel)
		printk_force_loglevel_enter(lvl);

	printk("%s---------- Report No. %d start --------\n", lvl, report_id);

	return report_id;
}

void printk_report_end(const char *lvl, int report_id)
{
	printk("%s---------- Report No. %d end --------\n", lvl, report_id);

	if (printk_force_report_loglevel)
		printk_force_loglevel_exit();
}

This API would help to standardize the reports from various watchdogs:

  + The "start/end" lines would help both humans and bots.

  + The optional loglevel forcing would help to filter the messages depending
    on whether people want to see all details using the same loglevel
    or not.


My problem with the log level manipulation is that I do not know what
loglevel right. The loglevel might be important with slow consoles which
need not be suitable for all the details.

Best Regards,
Petr
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Sergey Senozhatsky 9 months, 2 weeks ago
On (25/04/25 16:55), Petr Mladek wrote:
> > > Anyway, what is the exact problem, please?
> > > Are the KERN_INFO messages filtered because of console_loglevel?
> > > Or is it a problem to match all the related lines?
> > 
> > The latter one.  A made up example, just to demonstrate what we are
> > getting now:
> > 
> 
> Note: I do not have strong opinion. I am rather thinking loudly.

Sure.

> When I look at it. A prefix line:
> 
> <6>[  125.297687][  T140] ------------[ cut here ]------------
> 
> > <3>[  125.297687][  T140] INFO: task zsh:470 blocked for more than 61 seconds.
> > <3>[  125.302321][  T140]       Not tainted 6.15.0-rc3-next-20250424-00001-g258d8df78c77-dirty #154
> > <3>[  125.309333][  T140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > <6>[  125.315040][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> > <6>[  125.320594][  T140] Call Trace:
> > <6>[  125.322327][  T140]  <TASK>
> > <6>[  125.323852][  T140]  __schedule+0x13b4/0x2120
> > <6>[  125.325459][  T140]  ? schedule+0xdc/0x280
> > <6>[  125.327100][  T140]  schedule+0xdc/0x280
> > <6>[  125.328590][  T140]  schedule_preempt_disabled+0x10/0x20
> > <6>[  125.330589][  T140]  __mutex_lock+0x698/0x1200
> > <6>[  125.332291][  T140]  ? __mutex_lock+0x485/0x1200
> > <6>[  125.334074][  T140]  mutex_lock+0x81/0x90
[..]
> > <3>[  125.362363][  T140] INFO: task zsh:470 is blocked on a mutex likely owned by task zsh:470.
> > <6>[  125.364467][  T140] task:zsh             state:D stack:0     pid:470   tgid:470   ppid:430    task_flags:0x400100 flags:0x00004002
> > <6>[  125.367493][  T140] Call Trace:
> > <6>[  125.368359][  T140]  <TASK>
> > <6>[  125.369180][  T140]  __schedule+0x13b4/0x2120
> > <6>[  125.370364][  T140]  ? schedule+0xdc/0x280
> > <6>[  125.371486][  T140]  schedule+0xdc/0x280
> > <6>[  125.372518][  T140]  schedule_preempt_disabled+0x10/0x20
> > <6>[  125.374049][  T140]  __mutex_lock+0x698/0x1200
> > <6>[  125.375326][  T140]  ? __mutex_lock+0x485/0x1200
> > <6>[  125.376572][  T140]  mutex_lock+0x81/0x90
[..]

> and a suffix line
> 
> <6>[  125.403579][  T140] ------------[ cut here ]------------
> 
> might be helpful for both automated monitors and humans => win win solution.

suffix/prefix lines should improve things, I think.  We don't enable
printk-owner in the fleet, we probably should, but we don't for whatever
reason.

[..]
> > So my worry with CONFIG_MESSAGE_LOGLEVEL_DEFAULT was that, if we
> > set it, say, to 3 (KERN_ERR), then how many "false positives" we
> > will get?  There are many printk("") calls in the kernel that
> > default to MESSAGE_LOGLEVEL_DEFAULT, as far as I understand it:
> > 	git grep "printk(\"" | grep -v TP_printk | wc -l
> > 	9001
> 
> Just for record, this still counts also many other printk() wrappers,
> e.g. bpf_printk(), srm_printk(), dprintk().
> 
> It should be more precise with -w option:
> 
> 	$> git grep -w "printk(\"" | wc -l
> 	2830

Good point.

> That said, I agree that it might still mean a lot of false
> positives.

Agreed.

> > But maybe that is the solution.  Do you want to switch sched_show_task()
> > to KERN_DEFAULT for all or would it be better to still introduce
> > sched_show_task_log_lvl() can call it with KERN_DEFAULT only from
> > kernel/hung_task.c?
> 
> If you want to go this way then I would introduce sched_show_task_log_lvl().
> It would allow to fix the various reports incrementally.
> 
> For example, I see sched_show_task() used in show_cpu_pool_hog() in
> kernel/workqueue.c and some other messages are KERN_INFO. So, using
> another log level in sched_show_task() would cause mess here.

Good point.

[..]
> If you want the same loglevel then I really suggest to introduce
> a generic CONFIG option, e.g. CONFIG_CONSISTENT_REPORT_LOGLEVEL
> and use is everywhere to force the consistent loglevel,
> for example, hung taskdetector, workqueue watchdog, soft lockup
> detector, Oops, ...

Yeah, that's a logical next step, if we go this way.

[..]
> int printk_report_start(const char *lvl)
> {
> 	int report_id = atomic_inc_return(&printk_report_count);
> 
> 	if (printk_force_report_loglevel)
> 		printk_force_loglevel_enter(lvl);
> 
> 	printk("%s---------- Report No. %d start --------\n", lvl, report_id);
> 
> 	return report_id;
> }
> 
> void printk_report_end(const char *lvl, int report_id)
> {
> 	printk("%s---------- Report No. %d end --------\n", lvl, report_id);
> 
> 	if (printk_force_report_loglevel)
> 		printk_force_loglevel_exit();
> }

Looks good to me, however

> My problem with the log level manipulation is that I do not know what
> loglevel right.

Excellent point.  Sort of feels like we are back to square one.
If we just KERN_INFO everything, then someone might come up with
a patch to make it KERN_ERR, or ALERT (because they panic the system
right after that print-out and it cannot be INFO.)
Re: [PATCH] hung_task: configurable hung-task stacktrace loglevel
Posted by Sergey Senozhatsky 9 months, 2 weeks ago
On (25/04/30 10:34), Sergey Senozhatsky wrote:
[..]
> > int printk_report_start(const char *lvl)
> > {
> > 	int report_id = atomic_inc_return(&printk_report_count);
> > 
> > 	if (printk_force_report_loglevel)
> > 		printk_force_loglevel_enter(lvl);
> > 
> > 	printk("%s---------- Report No. %d start --------\n", lvl, report_id);
> > 
> > 	return report_id;
> > }
> > 
> > void printk_report_end(const char *lvl, int report_id)
> > {
> > 	printk("%s---------- Report No. %d end --------\n", lvl, report_id);
> > 
> > 	if (printk_force_report_loglevel)
> > 		printk_force_loglevel_exit();
> > }
> 
> Looks good to me, however
> 
> > My problem with the log level manipulation is that I do not know what
> > loglevel right.

We still might want to introduce CONFIG_CONSISTENT_REPORT_LOGLEVEL.