[PATCH v4 2/5] timers: Add the available mask in timer migration

Gabriele Monaco posted 5 patches 7 months, 2 weeks ago
There is a newer version of this series
[PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Gabriele Monaco 7 months, 2 weeks ago
Keep track of the CPUs available for timer migration in a cpumask. This
prepares the ground to generalise the concept of unavailable CPUs.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/time/timer_migration.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 7efd897c7959..25439f961ccf 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -422,6 +422,9 @@ static unsigned int tmigr_crossnode_level __read_mostly;
 
 static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
 
+/* CPUs available for timer migration */
+static cpumask_var_t tmigr_available_cpumask;
+
 #define TMIGR_NONE	0xFF
 #define BIT_CNT		8
 
@@ -1449,6 +1452,7 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
 	raw_spin_lock_irq(&tmc->lock);
 	tmc->available = false;
 	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
 
 	/*
 	 * CPU has to handle the local events on his own, when on the way to
@@ -1459,7 +1463,7 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
 	raw_spin_unlock_irq(&tmc->lock);
 
 	if (firstexp != KTIME_MAX) {
-		migrator = cpumask_any_but(cpu_online_mask, cpu);
+		migrator = cpumask_any(tmigr_available_cpumask);
 		work_on_cpu(migrator, tmigr_trigger_active, NULL);
 	}
 
@@ -1480,6 +1484,7 @@ static int tmigr_cpu_available(unsigned int cpu)
 	if (!tmc->idle)
 		__tmigr_cpu_activate(tmc);
 	tmc->available = true;
+	cpumask_set_cpu(cpu, tmigr_available_cpumask);
 	raw_spin_unlock_irq(&tmc->lock);
 	return 0;
 }
@@ -1801,6 +1806,11 @@ static int __init tmigr_init(void)
 	if (ncpus == 1)
 		return 0;
 
+	if (!zalloc_cpumask_var(&tmigr_available_cpumask, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
 	/*
 	 * Calculate the required hierarchy levels. Unfortunately there is no
 	 * reliable information available, unless all possible CPUs have been
-- 
2.49.0
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Frederic Weisbecker 7 months, 1 week ago
Le Tue, May 06, 2025 at 11:15:37AM +0200, Gabriele Monaco a écrit :
> Keep track of the CPUs available for timer migration in a cpumask. This
> prepares the ground to generalise the concept of unavailable CPUs.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  kernel/time/timer_migration.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 7efd897c7959..25439f961ccf 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -422,6 +422,9 @@ static unsigned int tmigr_crossnode_level __read_mostly;
>  
>  static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
>  
> +/* CPUs available for timer migration */
> +static cpumask_var_t tmigr_available_cpumask;
> +
>  #define TMIGR_NONE	0xFF
>  #define BIT_CNT		8
>  
> @@ -1449,6 +1452,7 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
>  	raw_spin_lock_irq(&tmc->lock);
>  	tmc->available = false;
>  	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> +	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
>  
>  	/*
>  	 * CPU has to handle the local events on his own, when on the way to
> @@ -1459,7 +1463,7 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
>  	raw_spin_unlock_irq(&tmc->lock);
>  
>  	if (firstexp != KTIME_MAX) {
> -		migrator = cpumask_any_but(cpu_online_mask, cpu);
> +		migrator = cpumask_any(tmigr_available_cpumask);

Considering nohz_full CPUs should be still available.

I don't think there is anything ensuring that, in nohz_full mode,
there must be at least one housekeeping CPU that is not domain
isolated.

For example if we have two CPUs with CPU 0 being domain isolated
and CPU 1 being nohz_full, then there is no migrator to handle CPU 1's
global timers.

Thanks.

>  		work_on_cpu(migrator, tmigr_trigger_active, NULL);
>  	}

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Gabriele Monaco 7 months, 1 week ago

On Tue, 2025-05-06 at 18:07 +0200, Frederic Weisbecker wrote:
> Le Tue, May 06, 2025 at 11:15:37AM +0200, Gabriele Monaco a écrit :
> > Keep track of the CPUs available for timer migration in a cpumask.
> > This
> > prepares the ground to generalise the concept of unavailable CPUs.
> > 
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > ---
> >  kernel/time/timer_migration.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/time/timer_migration.c
> > b/kernel/time/timer_migration.c
> > index 7efd897c7959..25439f961ccf 100644
> > --- a/kernel/time/timer_migration.c
> > +++ b/kernel/time/timer_migration.c
> > @@ -422,6 +422,9 @@ static unsigned int tmigr_crossnode_level
> > __read_mostly;
> >  
> >  static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
> >  
> > +/* CPUs available for timer migration */
> > +static cpumask_var_t tmigr_available_cpumask;
> > +
> >  #define TMIGR_NONE	0xFF
> >  #define BIT_CNT		8
> >  
> > @@ -1449,6 +1452,7 @@ static int tmigr_cpu_unavailable(unsigned int
> > cpu)
> >  	raw_spin_lock_irq(&tmc->lock);
> >  	tmc->available = false;
> >  	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> > +	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
> >  
> >  	/*
> >  	 * CPU has to handle the local events on his own, when on
> > the way to
> > @@ -1459,7 +1463,7 @@ static int tmigr_cpu_unavailable(unsigned int
> > cpu)
> >  	raw_spin_unlock_irq(&tmc->lock);
> >  
> >  	if (firstexp != KTIME_MAX) {
> > -		migrator = cpumask_any_but(cpu_online_mask, cpu);
> > +		migrator = cpumask_any(tmigr_available_cpumask);
> 
> Considering nohz_full CPUs should be still available.
> 
> I don't think there is anything ensuring that, in nohz_full mode,
> there must be at least one housekeeping CPU that is not domain
> isolated.
> 
> For example if we have two CPUs with CPU 0 being domain isolated
> and CPU 1 being nohz_full, then there is no migrator to handle CPU
> 1's
> global timers.
> 

Mmh, good point, didn't think about having the domain isolated and
nohz_full maps disjointed..

When that's really the case how do you think we should fall back?

In the situation you describe, no one is going to be able to handle
global timers on the nohz_full CPUs, right?

When this situation really occurs, we could keep one of the domain
isolated CPUs in the hierarchy. 
Now, I see on x86 CPU0 cannot be offlined and is not added to
nohz_full, which would make things considerably easier, but ARM doesn't
seem to work the same way.

We could elect a lucky winner (e.g. first or last becoming domain
isolated) and swap it whenever it becomes offline, until we actually
run out of those (no online cpu non-nohz_full is left), but I believe
this shouldn't happen..

Does this make sense to you?

Thanks,
Gabriele
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Frederic Weisbecker 7 months, 1 week ago
Le Wed, May 07, 2025 at 09:57:38AM +0200, Gabriele Monaco a écrit :
> 
> 
> On Tue, 2025-05-06 at 18:07 +0200, Frederic Weisbecker wrote:
> > Le Tue, May 06, 2025 at 11:15:37AM +0200, Gabriele Monaco a écrit :
> > > Keep track of the CPUs available for timer migration in a cpumask.
> > > This
> > > prepares the ground to generalise the concept of unavailable CPUs.
> > > 
> > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > ---
> > >  kernel/time/timer_migration.c | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/time/timer_migration.c
> > > b/kernel/time/timer_migration.c
> > > index 7efd897c7959..25439f961ccf 100644
> > > --- a/kernel/time/timer_migration.c
> > > +++ b/kernel/time/timer_migration.c
> > > @@ -422,6 +422,9 @@ static unsigned int tmigr_crossnode_level
> > > __read_mostly;
> > >  
> > >  static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
> > >  
> > > +/* CPUs available for timer migration */
> > > +static cpumask_var_t tmigr_available_cpumask;
> > > +
> > >  #define TMIGR_NONE	0xFF
> > >  #define BIT_CNT		8
> > >  
> > > @@ -1449,6 +1452,7 @@ static int tmigr_cpu_unavailable(unsigned int
> > > cpu)
> > >  	raw_spin_lock_irq(&tmc->lock);
> > >  	tmc->available = false;
> > >  	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> > > +	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
> > >  
> > >  	/*
> > >  	 * CPU has to handle the local events on his own, when on
> > > the way to
> > > @@ -1459,7 +1463,7 @@ static int tmigr_cpu_unavailable(unsigned int
> > > cpu)
> > >  	raw_spin_unlock_irq(&tmc->lock);
> > >  
> > >  	if (firstexp != KTIME_MAX) {
> > > -		migrator = cpumask_any_but(cpu_online_mask, cpu);
> > > +		migrator = cpumask_any(tmigr_available_cpumask);
> > 
> > Considering nohz_full CPUs should be still available.
> > 
> > I don't think there is anything ensuring that, in nohz_full mode,
> > there must be at least one housekeeping CPU that is not domain
> > isolated.
> > 
> > For example if we have two CPUs with CPU 0 being domain isolated
> > and CPU 1 being nohz_full, then there is no migrator to handle CPU
> > 1's
> > global timers.
> > 
> 
> Mmh, good point, didn't think about having the domain isolated and
> nohz_full maps disjointed..
> 
> When that's really the case how do you think we should fall back?
> 
> In the situation you describe, no one is going to be able to handle
> global timers on the nohz_full CPUs, right?
> 
> When this situation really occurs, we could keep one of the domain
> isolated CPUs in the hierarchy. 
> Now, I see on x86 CPU0 cannot be offlined and is not added to
> nohz_full, which would make things considerably easier, but ARM doesn't
> seem to work the same way.
> 
> We could elect a lucky winner (e.g. first or last becoming domain
> isolated) and swap it whenever it becomes offline, until we actually
> run out of those (no online cpu non-nohz_full is left), but I believe
> this shouldn't happen..
> 
> Does this make sense to you?

Well, nohz_full= and isolcpus=, when they are passed together, must contain the
same set of CPUs. And if there is no housekeeping CPU then one is forced, so
it's well handled at this point.

But if nohz_full= is passed on boot and cpusets later create an isolated
partition which spans the housekeeping set, then the isolated partition must
be rejected.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Gabriele Monaco 7 months, 1 week ago

On Wed, 2025-05-07 at 14:25 +0200, Frederic Weisbecker wrote:
> Le Wed, May 07, 2025 at 09:57:38AM +0200, Gabriele Monaco a écrit :
> > 
> > 
> > On Tue, 2025-05-06 at 18:07 +0200, Frederic Weisbecker wrote:
> > > Le Tue, May 06, 2025 at 11:15:37AM +0200, Gabriele Monaco a écrit
> > > :
> > > > Keep track of the CPUs available for timer migration in a
> > > > cpumask.
> > > > This
> > > > prepares the ground to generalise the concept of unavailable
> > > > CPUs.
> > > > 
> > > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > > > ---
> > > >  kernel/time/timer_migration.c | 12 +++++++++++-
> > > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/time/timer_migration.c
> > > > b/kernel/time/timer_migration.c
> > > > index 7efd897c7959..25439f961ccf 100644
> > > > --- a/kernel/time/timer_migration.c
> > > > +++ b/kernel/time/timer_migration.c
> > > > @@ -422,6 +422,9 @@ static unsigned int tmigr_crossnode_level
> > > > __read_mostly;
> > > >  
> > > >  static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
> > > >  
> > > > +/* CPUs available for timer migration */
> > > > +static cpumask_var_t tmigr_available_cpumask;
> > > > +
> > > >  #define TMIGR_NONE	0xFF
> > > >  #define BIT_CNT		8
> > > >  
> > > > @@ -1449,6 +1452,7 @@ static int tmigr_cpu_unavailable(unsigned
> > > > int
> > > > cpu)
> > > >  	raw_spin_lock_irq(&tmc->lock);
> > > >  	tmc->available = false;
> > > >  	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> > > > +	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
> > > >  
> > > >  	/*
> > > >  	 * CPU has to handle the local events on his own, when
> > > > on
> > > > the way to
> > > > @@ -1459,7 +1463,7 @@ static int tmigr_cpu_unavailable(unsigned
> > > > int
> > > > cpu)
> > > >  	raw_spin_unlock_irq(&tmc->lock);
> > > >  
> > > >  	if (firstexp != KTIME_MAX) {
> > > > -		migrator = cpumask_any_but(cpu_online_mask,
> > > > cpu);
> > > > +		migrator =
> > > > cpumask_any(tmigr_available_cpumask);
> > > 
> > > Considering nohz_full CPUs should be still available.
> > > 
> > > I don't think there is anything ensuring that, in nohz_full mode,
> > > there must be at least one housekeeping CPU that is not domain
> > > isolated.
> > > 
> > > For example if we have two CPUs with CPU 0 being domain isolated
> > > and CPU 1 being nohz_full, then there is no migrator to handle
> > > CPU
> > > 1's
> > > global timers.
> > > 
> > 
> > Mmh, good point, didn't think about having the domain isolated and
> > nohz_full maps disjointed..
> > 
> > When that's really the case how do you think we should fall back?
> > 
> > In the situation you describe, no one is going to be able to handle
> > global timers on the nohz_full CPUs, right?
> > 
> > When this situation really occurs, we could keep one of the domain
> > isolated CPUs in the hierarchy. 
> > Now, I see on x86 CPU0 cannot be offlined and is not added to
> > nohz_full, which would make things considerably easier, but ARM
> > doesn't
> > seem to work the same way.
> > 
> > We could elect a lucky winner (e.g. first or last becoming domain
> > isolated) and swap it whenever it becomes offline, until we
> > actually
> > run out of those (no online cpu non-nohz_full is left), but I
> > believe
> > this shouldn't happen..
> > 
> > Does this make sense to you?
> 
> Well, nohz_full= and isolcpus=, when they are passed together, must
> contain the
> same set of CPUs. And if there is no housekeeping CPU then one is
> forced, so
> it's well handled at this point.

I'm not so sure about this one though.
As far as I understand [1], is preventing the user from setting
different CPUs while doing isolcpus=nohz, and nohz_full= (which is now
equivalent). But I seem to be able to do isolcpus=0-3 and nohz_full=4-7
without any problem and I believe I'd hit the issue you're mentioning.
(The same would work if I swap the masks as 0 cannot be nohz_full).

  # vng -a isolcpus=0-7 -a nohz_full=8-15 head
/sys/devices/system/cpu/{isolated,nohz_full}

  ==> /sys/devices/system/cpu/isolated <==
  0-7

  ==> /sys/devices/system/cpu/nohz_full <==
  8-15

(where probably some CPUs are set up to do housekeeping stuff anyway,
but if we just look at the masks, we won't notice)

Then I assume this should not be allowed either, should it?
Or am I missing something here?

> 
> But if nohz_full= is passed on boot and cpusets later create an
> isolated
> partition which spans the housekeeping set, then the isolated
> partition must
> be rejected.

Mmh, that would make things easier actually.
I assume there's no real use case for that kind of hybrid setup with
half CPUs nohz_full and half domain isolated..

Thanks,
Gabriele

[1] -
https://elixir.bootlin.com/linux/v6.14.5/source/kernel/sched/isolation.c#L163
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Frederic Weisbecker 7 months, 1 week ago
Le Wed, May 07, 2025 at 02:46:39PM +0200, Gabriele Monaco a écrit :
> 
> I'm not so sure about this one though.
> As far as I understand [1], is preventing the user from setting
> different CPUs while doing isolcpus=nohz, and nohz_full= (which is now
> equivalent). But I seem to be able to do isolcpus=0-3 and nohz_full=4-7
> without any problem and I believe I'd hit the issue you're mentioning.

Duh!

> (The same would work if I swap the masks as 0 cannot be nohz_full).

Unfortunately 0 can be nohz_full...

> 
>   # vng -a isolcpus=0-7 -a nohz_full=8-15 head
> /sys/devices/system/cpu/{isolated,nohz_full}
> 
>   ==> /sys/devices/system/cpu/isolated <==
>   0-7
> 
>   ==> /sys/devices/system/cpu/nohz_full <==
>   8-15
> 
> (where probably some CPUs are set up to do housekeeping stuff anyway,
> but if we just look at the masks, we won't notice)
> 
> Then I assume this should not be allowed either, should it?
> Or am I missing something here?

Exactly then. housekeeping_setup() already handles cases when
there is no housekeeping left. I guess that section could be
made aware of nohz_full + isolcpus not leaving any housekeeping left.

> 
> > 
> > But if nohz_full= is passed on boot and cpusets later create an
> > isolated
> > partition which spans the housekeeping set, then the isolated
> > partition must
> > be rejected.
> 
> Mmh, that would make things easier actually.
> I assume there's no real use case for that kind of hybrid setup with
> half CPUs nohz_full and half domain isolated..

I guess we can accept nohz_full + isolated partition as long as a housekeeping
CPU remains.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Gabriele Monaco 7 months, 1 week ago

On Wed, 2025-05-07 at 15:40 +0200, Frederic Weisbecker wrote:
> Le Wed, May 07, 2025 at 02:46:39PM +0200, Gabriele Monaco a écrit :
> > 
> > I'm not so sure about this one though.
> > As far as I understand [1], is preventing the user from setting
> > different CPUs while doing isolcpus=nohz, and nohz_full= (which is
> > now
> > equivalent). But I seem to be able to do isolcpus=0-3 and
> > nohz_full=4-7
> > without any problem and I believe I'd hit the issue you're
> > mentioning.
> 
> Duh!
> 
> > (The same would work if I swap the masks as 0 cannot be nohz_full).
> 
> Unfortunately 0 can be nohz_full...

Well, I haven't found what enforces it, but I wasn't able to set 0 as
nohz_full, no matter what I tried on x86 and arm64. Not sure if 0 was
just by chance in this case (I'm guessing it has something to do with
tick_do_timer_cpu, I'm not quite familiar with this code).

I was trying to see if we can make some assumption in the tmigr but
what you propose (enforce fully housekeeping CPUs everywhere) seems
much neater.

> 
> > 
> >   # vng -a isolcpus=0-7 -a nohz_full=8-15 head
> > /sys/devices/system/cpu/{isolated,nohz_full}
> > 
> >   ==> /sys/devices/system/cpu/isolated <==
> >   0-7
> > 
> >   ==> /sys/devices/system/cpu/nohz_full <==
> >   8-15
> > 
> > (where probably some CPUs are set up to do housekeeping stuff
> > anyway,
> > but if we just look at the masks, we won't notice)
> > 
> > Then I assume this should not be allowed either, should it?
> > Or am I missing something here?
> 
> Exactly then. housekeeping_setup() already handles cases when
> there is no housekeeping left. I guess that section could be
> made aware of nohz_full + isolcpus not leaving any housekeeping left.
> 
> > 
> > > 
> > > But if nohz_full= is passed on boot and cpusets later create an
> > > isolated
> > > partition which spans the housekeeping set, then the isolated
> > > partition must
> > > be rejected.
> > 
> > Mmh, that would make things easier actually.
> > I assume there's no real use case for that kind of hybrid setup
> > with
> > half CPUs nohz_full and half domain isolated..
> 
> I guess we can accept nohz_full + isolated partition as long as a
> housekeeping
> CPU remains.

Yeah makes sense, I'll explore that.

Thanks,
Gabriele
Re: [PATCH v4 2/5] timers: Add the available mask in timer migration
Posted by Frederic Weisbecker 7 months, 1 week ago
Le Wed, May 07, 2025 at 03:54:32PM +0200, Gabriele Monaco a écrit :
> 
> 
> On Wed, 2025-05-07 at 15:40 +0200, Frederic Weisbecker wrote:
> > Le Wed, May 07, 2025 at 02:46:39PM +0200, Gabriele Monaco a écrit :
> > > 
> > > I'm not so sure about this one though.
> > > As far as I understand [1], is preventing the user from setting
> > > different CPUs while doing isolcpus=nohz, and nohz_full= (which is
> > > now
> > > equivalent). But I seem to be able to do isolcpus=0-3 and
> > > nohz_full=4-7
> > > without any problem and I believe I'd hit the issue you're
> > > mentioning.
> > 
> > Duh!
> > 
> > > (The same would work if I swap the masks as 0 cannot be nohz_full).
> > 
> > Unfortunately 0 can be nohz_full...
> 
> Well, I haven't found what enforces it, but I wasn't able to set 0 as
> nohz_full, no matter what I tried on x86 and arm64. Not sure if 0 was
> just by chance in this case (I'm guessing it has something to do with
> tick_do_timer_cpu, I'm not quite familiar with this code).

Ah looks like you need CONFIG_PM_SLEEP_SMP_NONZERO_CPU. IIUC it's only
powerpc.

You can try to deactivate CONFIG_SUSPEND and CONFIG_HIBERNATE_CALLBACKS
otherwise.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs