profiling: remove prof_cpu_mask

[PATCH] profiling: remove prof_cpu_mask

Posted by Tetsuo Handa 1 year, 6 months ago

syzbot is reporting uninit-value at profile_hits(), for there is a race
window between

  if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
    return -ENOMEM;
  cpumask_copy(prof_cpu_mask, cpu_possible_mask);

in profile_init() and

  cpumask_available(prof_cpu_mask) &&
  cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))

in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
completes while cpumask_available(prof_cpu_mask) returns true as soon as
alloc_cpumask_var(&prof_cpu_mask) completes.

We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
call cpumask_copy() from create_proc_profile() on only UP kernels, for
profile_online_cpu() calls cpumask_set_cpu() as needed via
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
removes prof_cpu_mask because it seems unnecessary.

The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
in profile_tick() is likely always true due to

  a CPU cannot call profile_tick() if that CPU is offline

and

  cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
  online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
  CPU becomes offline

. This test could be false during transition between online and offline.

But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
belongs to PREPARE section, which means that the CPU subjected to
profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
use-after-free bug) because interrupt for that CPU is disabled during
PREPARE section. Therefore, this test is guaranteed to be true, and
can be removed. (Since profile_hits() checks prof_buffer != NULL, we
don't need to check prof_buffer != NULL here unless get_irq_regs() or
user_mode() is such slow that we want to avoid when prof_buffer == NULL).

do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But syzbot is also reporting that sometimes
do_profile_hits() is called while current thread is still doing vzalloc(),
where prof_buffer must be NULL at this moment. This indicates that multiple
threads concurrently tried to write to /sys/kernel/profiling interface,
which caused that somebody else try to re-allocate prof_buffer despite
somebody has already allocated prof_buffer. Fix this by using
serialization.

Reported-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
---
 kernel/ksysfs.c  |  7 +++++++
 kernel/profile.c | 46 ++++++----------------------------------------
 2 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 07fb5987b42b..1bab21b4718f 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -92,7 +92,14 @@ static ssize_t profiling_store(struct kobject *kobj,
 				   const char *buf, size_t count)
 {
 	int ret;
+	static DEFINE_MUTEX(lock);
 
+	/*
+	 * We need serialization, for profile_setup() initializes prof_on
+	 * value and profile_init() must not reallocate prof_buffer after
+	 * once allocated.
+	 */
+	guard(mutex)(&lock);
 	if (prof_on)
 		return -EEXIST;
 	/*
diff --git a/kernel/profile.c b/kernel/profile.c
index 2b775cc5c28f..4654c6cd984e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -47,7 +47,6 @@ static unsigned short int prof_shift;
 int prof_on __read_mostly;
 EXPORT_SYMBOL_GPL(prof_on);
 
-static cpumask_var_t prof_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_PROC_FS)
 static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits);
 static DEFINE_PER_CPU(int, cpu_profile_flip);
@@ -114,11 +113,6 @@ int __ref profile_init(void)
 
 	buffer_bytes = prof_len*sizeof(atomic_t);
 
-	if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
-		return -ENOMEM;
-
-	cpumask_copy(prof_cpu_mask, cpu_possible_mask);
-
 	prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL|__GFP_NOWARN);
 	if (prof_buffer)
 		return 0;
@@ -132,7 +126,6 @@ int __ref profile_init(void)
 	if (prof_buffer)
 		return 0;
 
-	free_cpumask_var(prof_cpu_mask);
 	return -ENOMEM;
 }
 
@@ -267,9 +260,6 @@ static int profile_dead_cpu(unsigned int cpu)
 	struct page *page;
 	int i;
 
-	if (cpumask_available(prof_cpu_mask))
-		cpumask_clear_cpu(cpu, prof_cpu_mask);
-
 	for (i = 0; i < 2; i++) {
 		if (per_cpu(cpu_profile_hits, cpu)[i]) {
 			page = virt_to_page(per_cpu(cpu_profile_hits, cpu)[i]);
@@ -302,14 +292,6 @@ static int profile_prepare_cpu(unsigned int cpu)
 	return 0;
 }
 
-static int profile_online_cpu(unsigned int cpu)
-{
-	if (cpumask_available(prof_cpu_mask))
-		cpumask_set_cpu(cpu, prof_cpu_mask);
-
-	return 0;
-}
-
 #else /* !CONFIG_SMP */
 #define profile_flip_buffers()		do { } while (0)
 #define profile_discard_flip_buffers()	do { } while (0)
@@ -334,8 +316,8 @@ void profile_tick(int type)
 {
 	struct pt_regs *regs = get_irq_regs();
 
-	if (!user_mode(regs) && cpumask_available(prof_cpu_mask) &&
-	    cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
+	/* This is the old kernel-only legacy profiling */
+	if (!user_mode(regs))
 		profile_hit(type, (void *)profile_pc(regs));
 }
 
@@ -418,10 +400,6 @@ static const struct proc_ops profile_proc_ops = {
 int __ref create_proc_profile(void)
 {
 	struct proc_dir_entry *entry;
-#ifdef CONFIG_SMP
-	enum cpuhp_state online_state;
-#endif
-
 	int err = 0;
 
 	if (!prof_on)
@@ -431,26 +409,14 @@ int __ref create_proc_profile(void)
 				profile_prepare_cpu, profile_dead_cpu);
 	if (err)
 		return err;
-
-	err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "AP_PROFILE_ONLINE",
-				profile_online_cpu, NULL);
-	if (err < 0)
-		goto err_state_prep;
-	online_state = err;
-	err = 0;
 #endif
 	entry = proc_create("profile", S_IWUSR | S_IRUGO,
 			    NULL, &profile_proc_ops);
-	if (!entry)
-		goto err_state_onl;
-	proc_set_size(entry, (1 + prof_len) * sizeof(atomic_t));
-
-	return err;
-err_state_onl:
+	if (entry)
+		proc_set_size(entry, (1 + prof_len) * sizeof(atomic_t));
 #ifdef CONFIG_SMP
-	cpuhp_remove_state(online_state);
-err_state_prep:
-	cpuhp_remove_state(CPUHP_PROFILE_PREPARE);
+	else
+		cpuhp_remove_state(CPUHP_PROFILE_PREPARE);
 #endif
 	return err;
 }
-- 
2.43.5

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Linus Torvalds 1 year, 6 months ago

On Sat, 27 Jul 2024 at 04:00, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
> call cpumask_copy() from create_proc_profile() on only UP kernels, for
> profile_online_cpu() calls cpumask_set_cpu() as needed via
> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
> removes prof_cpu_mask because it seems unnecessary.

So I like this one a lot more, but it actually makes me wonder: could
we just remove the horrid per-cpu profile flip buffers entirely?

This code predates the whole "we have _real_ profiling", and dates
back to literally two decades ago.

It's there due to ancient NUMA concerns on old SGI Altix hardware in 2004:

   https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=ad02973d42f6b538c7ed76c7c0a5ae8560f65913

and I think it's past time to just take all this code out of its misery.

Nobody sane should use the old profile code for any real work any
more. I'd be more than happy to just remove all the magic code and see
if anybody even notices..

                  Linus

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Tetsuo Handa 1 year, 6 months ago

On 2024/07/28 1:54, Linus Torvalds wrote:
> On Sat, 27 Jul 2024 at 04:00, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
>> call cpumask_copy() from create_proc_profile() on only UP kernels, for
>> profile_online_cpu() calls cpumask_set_cpu() as needed via
>> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
>> removes prof_cpu_mask because it seems unnecessary.
> 
> So I like this one a lot more, but it actually makes me wonder: could
> we just remove the horrid per-cpu profile flip buffers entirely?

I dropped "profiling: initialize prof_cpu_mask from profile_online_cpu()" from
my tree so that someone can take this patch. Andrew, can you take this patch?
Whether somebody notices performance/accuracy problem by removing the horrid
per-cpu profile flip buffers will be attempted by a separate patch.

> 
> This code predates the whole "we have _real_ profiling", and dates
> back to literally two decades ago.
> 
> It's there due to ancient NUMA concerns on old SGI Altix hardware in 2004:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=ad02973d42f6b538c7ed76c7c0a5ae8560f65913
> 
> and I think it's past time to just take all this code out of its misery.
> 
> Nobody sane should use the old profile code for any real work any
> more. I'd be more than happy to just remove all the magic code and see
> if anybody even notices..
> 
>                   Linus

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Thomas Gleixner 1 year, 6 months ago

On Sat, Jul 27 2024 at 09:54, Linus Torvalds wrote:
> On Sat, 27 Jul 2024 at 04:00, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
>> call cpumask_copy() from create_proc_profile() on only UP kernels, for
>> profile_online_cpu() calls cpumask_set_cpu() as needed via
>> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
>> removes prof_cpu_mask because it seems unnecessary.
>
> So I like this one a lot more, but it actually makes me wonder: could
> we just remove the horrid per-cpu profile flip buffers entirely?
>
> This code predates the whole "we have _real_ profiling", and dates
> back to literally two decades ago.
>
> It's there due to ancient NUMA concerns on old SGI Altix hardware in 2004:
>
>    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=ad02973d42f6b538c7ed76c7c0a5ae8560f65913
>
> and I think it's past time to just take all this code out of its misery.
>
> Nobody sane should use the old profile code for any real work any
> more. I'd be more than happy to just remove all the magic code and see
> if anybody even notices..

Only people who indulge in nostalgia will notice :)

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Linus Torvalds 1 year, 6 months ago

On Sat, 27 Jul 2024 at 14:20, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Only people who indulge in nostalgia will notice :)

So let's see. Any bets on whether anybody actually notices?

I took Tetsuo's fix for the syzbot issue, and then I did a "remove all
the flip buffer code" in commit 2accfdb7eff6 ("profiling: attempt to
remove per-cpu profile flip buffer").

If somebody ends up having a real use-case for this old horrid code,
we may just have to remove it.

But it might also be the case that somebody actually does want the
boot-time profiling, and then the runtime overhead is annoying just
because they are on a multi-socket machine and the profiling just
keeps going - even after better profilers are available.

So it might be that nobody wants to actually re-instate the flip
buffer thing, but instead just turn the thing off entirely.

Technically you can do that by writing to /proc/profile with a
"profiling multiplier" that effectively turns it off, but very few
architectures actually support that (see "setup_profiling_timer()").

End result: maybe we should add a way to just say "I'm done profiling
now" if somebody reports that it causes performance issues after boot.

But I hope (and think it's very possible) that nobody will ever notice
any other way than from following this LKML discussion.

              Linus

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Linus Torvalds 1 year, 6 months ago

On Sat, 27 Jul 2024 at 14:20, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Only people who indulge in nostalgia will notice :)

... and the *really* nostalgic people will be happy that we're back to
the old days (pre-2004) with just a simple array that is addressed by
shifting the instruction pointer...

I wonder how many people actually use this ancient kernel profiling
thing. I get the feeling that it's "one real user and a hundred syzbot
test failures".

            Linus

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Tetsuo Handa 1 year, 6 months ago

On 2024/07/28 6:22, Linus Torvalds wrote:
> I wonder how many people actually use this ancient kernel profiling
> thing. I get the feeling that it's "one real user and a hundred syzbot
> test failures".

What about emitting some kernel messages for investigating whether there
are users who need this code, and wait for two years for whether someone
says "I need this code" ?

For example, hfs ( https://syzkaller.appspot.com/upstream/s/hfs ) has
many open bugs. Some of them have patches but nobody can review/take them.
Unless a filesystem needs to be mounted as a native filesystem, I think
that re-implementing such filesystem as a fuse-based filesystem will help
reducing overall bugs.

Anyway, it seems that the kernel sleep profiling is no longer working
after commit 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep
task blocked").

Re: [PATCH] profiling: remove prof_cpu_mask

Posted by Linus Torvalds 1 year, 6 months ago

On Sat, 27 Jul 2024 at 16:48, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> What about emitting some kernel messages for investigating whether there
> are users who need this code, and wait for two years for whether someone
> says "I need this code" ?

Heh. We've tried that. Nobody reads the kernel messages (or the docs -
people have tried documenting "this will go away in a year" in our
kernel documentation instead).

People notice and let you know when the feature is gone, and generally
not one second before that.

Actually, what *has* worked is WARN_ONCE() kind of big really scary
messages, and that will actually make some people notice just because
they are *so* big that people see them almost by accident if they ever
happen to look at any logs.

But then that causes other problems (ie the denial-of-service on
platforms that have panic-on-warn set).

So while we have done that too, it's only workable for some "let's see
if anybody hits this during the release cycle", because you can't
release with the warning in place.

Generally the best option is probably just to remove it, see if
anybody notices, and add it back if it turns out to have a real and
valid use.

              Linus