From nobody Mon Feb  9 05:26:53 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4732028A72F
	for <linux-kernel@vger.kernel.org>; Mon,  1 Dec 2025 06:51:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764571910; cv=none;
 b=gDYW1XPXhvFJVj7MXgo7OLSHLrHqleFpATVznI/VBJq0YOCt0CmRPde/RrHEdJCf7qQ8/kRAQrM6GeB4kuL9tJRAMwRvmEyaRdWCX1N9gKDGTFL8r2J0d9ydxhZ3qozIEavOMQWBR34JNq+Xe09kfGleYQ60qN3+jvrXQdImiX0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764571910; c=relaxed/simple;
	bh=ysXb7CPgrEYtOtR50bCYvZocS47NK5Op8QXBLoAYkaY=;
	h=From:To:Cc:Subject:References:Message-ID:Content-Type:
	 MIME-Version:Date;
 b=WyyzmxYNkM0LWThmaFi5I0jZtRPB02APR3csIuOtTbq6J05d7+FyNNBq9wRSc5tzyX98uHxF0qwVwTC0ZTutzhsxK+NKQamflmHFGUex+U8KqvPHTUZZfMm/PnHweraETZRORrfLAmXxYgZL1P1AUevuaEsdIQMhgK6e/n/HFHk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=HHfrIDVu;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=kL/slq2j; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="HHfrIDVu";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="kL/slq2j"
From: Thomas Gleixner <tglx@linutronix.de>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1764571905;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=OptVbElXMa0PhNkDv0aQk3y0SpFTQVJwn4aaIA0VaCo=;
	b=HHfrIDVuH8TRkebnG03kosopSnNZl250RX4imVWOaR6ZDqV+Cb5EO1btoTKsEYUVAIkv9S
	vDs55qZ3Oc4/cPPQeL8rzJ0bdOiu1g9swJDQeX3EJK84Yp1q/WNq5g/lpUDPhEY6a1Pabc
	gFAOFFhn+yYDkXmjKlGltnhdr3lPu4bLWMO19voNgwj+wSaV4r+kY7rkhMop9l5GghzOip
	f/qHJ105BdKMeZ5+a4ojcpTmgu7dbM+HjjNY/yku57sggDwMu0jZVVqB+Eh4bPWEE9XPTh
	RfXaYEusbL9nnHDsmvbkoWR6rEFQnhOvg6q+/ncI6uoI5NT7JbEn9cU+uPgRQw==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1764571905;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=OptVbElXMa0PhNkDv0aQk3y0SpFTQVJwn4aaIA0VaCo=;
	b=kL/slq2jHD4JYKAdPxPDZsfQliQghAPiz7F/dXbO9+YmQ62h2qAUKDcsgQT58eX6K+5xZT
	y8Z/JqPs5DLDdYAA==
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org
Subject: [GIT pull] timers/core for v6.19-rc1
References: <176457119565.1888260.10012195384143368631.tglx@xen13>
Message-ID: <176457122251.1888260.91531689314335034.tglx@xen13>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Date: Mon,  1 Dec 2025 07:51:38 +0100 (CET)

Linus,

please pull the latest timers/core branch from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-core-20=
25-11-30

up to:  7dec062cfcf2: timers/migration: Exclude isolated cpus from hierarchy


Update to the time/timers core:

  - Prevent a thundering herd problem when the timekeeper CPU is delayed
    and a large number of CPUs compete to acquire jiffies_lock to do the
    update. Limit it to one CPU with a separate "uncontended" atomic
    variable.

  - A set of improvements for the timer migration mechanism:

    - Support imbalanced NUMA trees correctly
   =20
    - Support dynamic exclusion of CPUs from the migrator duty to allow the
      cpuset/isolation mechanism to exclude them from handling timers of
      remote idle CPUs.

   - The usual small updates, cleanups and enhancements

Thanks,

	tglx

------------------>
Frederic Weisbecker (6):
      timers/migration: Convert "while" loops to use "for"
      timers/migration: Remove locking on group connection
      timers/migration: Fix imbalanced NUMA trees
      timers/migration: Assert that hotplug preparing CPU is part of stable=
 active hierarchy
      timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
      timers/migration: Remove dead code handling idle CPU checking for rem=
ote timers

Gabriele Monaco (6):
      timers/migration: Rename 'online' bit to 'available'
      timers/migration: Add mask for CPUs available in the hierarchy
      timers/migration: Use scoped_guard on available flag set/clear
      cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_is=
olation_cpumasks()
      sched/isolation: Force housekeeping if isolcpus and nohz_full don't l=
eave any
      timers/migration: Exclude isolated cpus from hierarchy

Jianyun Gao (1):
      time: Fix a few typos in time[r] related code comments

Steve Wahl (1):
      tick/sched: Limit non-timekeeper CPUs calling jiffies update

Sunday Adelodun (1):
      time: tick-oneshot: Add missing Return and parameter descriptions to =
kernel-doc

Thomas Wei=C3=9Fschuh (2):
      hrtimer: Store time as ktime_t in restart block
      selftests/timers/nanosleep: Add tests for return of remaining time

Wake Liu (1):
      selftests/timers: Clean up kernel version check in posix_timers

Yury Norov (1):
      cpumask: Add initialiser to use cleanup helpers


 include/linux/cpumask.h                       |   2 +
 include/linux/delay.h                         |   8 +-
 include/linux/restart_block.h                 |   2 +-
 include/linux/timer.h                         |   9 +
 include/trace/events/timer_migration.h        |   4 +-
 kernel/cgroup/cpuset.c                        |  15 +-
 kernel/sched/isolation.c                      |  23 ++
 kernel/time/hrtimer.c                         |   4 +-
 kernel/time/posix-cpu-timers.c                |   4 +-
 kernel/time/posix-timers.c                    |   2 +-
 kernel/time/tick-oneshot.c                    |  20 +-
 kernel/time/tick-sched.c                      |  30 +-
 kernel/time/timer_migration.c                 | 487 +++++++++++++++++-----=
----
 kernel/time/timer_migration.h                 |   2 +-
 tools/testing/selftests/timers/nanosleep.c    |  55 +++
 tools/testing/selftests/timers/posix_timers.c |  32 +-
 16 files changed, 503 insertions(+), 196 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index ff8f41ab7ce6..68be522449ec 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1005,6 +1005,7 @@ static __always_inline unsigned int cpumask_size(void)
=20
 #define this_cpu_cpumask_var_ptr(x)	this_cpu_read(x)
 #define __cpumask_var_read_mostly	__read_mostly
+#define CPUMASK_VAR_NULL		NULL
=20
 bool alloc_cpumask_var_node(cpumask_var_t *mask, gfp_t flags, int node);
=20
@@ -1051,6 +1052,7 @@ static __always_inline bool cpumask_available(cpumask=
_var_t mask)
=20
 #define this_cpu_cpumask_var_ptr(x) this_cpu_ptr(x)
 #define __cpumask_var_read_mostly
+#define CPUMASK_VAR_NULL {}
=20
 static __always_inline bool alloc_cpumask_var(cpumask_var_t *mask, gfp_t f=
lags)
 {
diff --git a/include/linux/delay.h b/include/linux/delay.h
index 89866bab100d..46412c00033a 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -68,7 +68,7 @@ void usleep_range_state(unsigned long min, unsigned long =
max,
  * @min:	Minimum time in microseconds to sleep
  * @max:	Maximum time in microseconds to sleep
  *
- * For basic information please refere to usleep_range_state().
+ * For basic information please refer to usleep_range_state().
  *
  * The task will be in the state TASK_UNINTERRUPTIBLE during the sleep.
  */
@@ -82,10 +82,10 @@ static inline void usleep_range(unsigned long min, unsi=
gned long max)
  * @min:	Minimum time in microseconds to sleep
  * @max:	Maximum time in microseconds to sleep
  *
- * For basic information please refere to usleep_range_state().
+ * For basic information please refer to usleep_range_state().
  *
  * The sleeping task has the state TASK_IDLE during the sleep to prevent
- * contribution to the load avarage.
+ * contribution to the load average.
  */
 static inline void usleep_range_idle(unsigned long min, unsigned long max)
 {
@@ -96,7 +96,7 @@ static inline void usleep_range_idle(unsigned long min, u=
nsigned long max)
  * ssleep - wrapper for seconds around msleep
  * @seconds:	Requested sleep duration in seconds
  *
- * Please refere to msleep() for detailed information.
+ * Please refer to msleep() for detailed information.
  */
 static inline void ssleep(unsigned int seconds)
 {
diff --git a/include/linux/restart_block.h b/include/linux/restart_block.h
index 7e50bbc94e47..36ddfa1ec301 100644
--- a/include/linux/restart_block.h
+++ b/include/linux/restart_block.h
@@ -43,7 +43,7 @@ struct restart_block {
 				struct __kernel_timespec __user *rmtp;
 				struct old_timespec32 __user *compat_rmtp;
 			};
-			u64 expires;
+			ktime_t expires;
 		} nanosleep;
 		/* For poll */
 		struct {
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 0414d9e6b4fc..62e1cea71125 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -188,4 +188,13 @@ int timers_dead_cpu(unsigned int cpu);
 #define timers_dead_cpu		NULL
 #endif
=20
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask);
+#else
+static inline int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_c=
pumask)
+{
+	return 0;
+}
+#endif
+
 #endif
diff --git a/include/trace/events/timer_migration.h b/include/trace/events/=
timer_migration.h
index 47db5eaf2f9a..61171b13c687 100644
--- a/include/trace/events/timer_migration.h
+++ b/include/trace/events/timer_migration.h
@@ -173,14 +173,14 @@ DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_active,
 	TP_ARGS(tmc)
 );
=20
-DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_online,
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_available,
=20
 	TP_PROTO(struct tmigr_cpu *tmc),
=20
 	TP_ARGS(tmc)
 );
=20
-DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_offline,
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_unavailable,
=20
 	TP_PROTO(struct tmigr_cpu *tmc),
=20
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675..bfc3b319e1c0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1339,7 +1339,7 @@ static bool partition_xcpus_del(int old_prs, struct c=
puset *parent,
 	return isolcpus_updated;
 }
=20
-static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
+static void update_isolation_cpumasks(bool isolcpus_updated)
 {
 	int ret;
=20
@@ -1350,6 +1350,9 @@ static void update_unbound_workqueue_cpumask(bool iso=
lcpus_updated)
=20
 	ret =3D workqueue_unbound_exclude_cpumask(isolated_cpus);
 	WARN_ON_ONCE(ret < 0);
+
+	ret =3D tmigr_isolated_exclude_cpumask(isolated_cpus);
+	WARN_ON_ONCE(ret < 0);
 }
=20
 /**
@@ -1470,7 +1473,7 @@ static int remote_partition_enable(struct cpuset *cs,=
 int new_prs,
 	list_add(&cs->remote_sibling, &remote_children);
 	cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_isolation_cpumasks(isolcpus_updated);
 	cpuset_force_rebuild();
 	cs->prs_err =3D 0;
=20
@@ -1511,7 +1514,7 @@ static void remote_partition_disable(struct cpuset *c=
s, struct tmpmasks *tmp)
 	compute_effective_exclusive_cpumask(cs, NULL, NULL);
 	reset_partition_data(cs);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_isolation_cpumasks(isolcpus_updated);
 	cpuset_force_rebuild();
=20
 	/*
@@ -1580,7 +1583,7 @@ static void remote_cpus_update(struct cpuset *cs, str=
uct cpumask *xcpus,
 	if (xcpus)
 		cpumask_copy(cs->exclusive_cpus, xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_isolation_cpumasks(isolcpus_updated);
 	if (adding || deleting)
 		cpuset_force_rebuild();
=20
@@ -1943,7 +1946,7 @@ static int update_parent_effective_cpumask(struct cpu=
set *cs, int cmd,
 		WARN_ON_ONCE(parent->nr_subparts < 0);
 	}
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_isolation_cpumasks(isolcpus_updated);
=20
 	if ((old_prs !=3D new_prs) && (cmd =3D=3D partcmd_update))
 		update_partition_exclusive_flag(cs, new_prs);
@@ -2968,7 +2971,7 @@ static int update_prstate(struct cpuset *cs, int new_=
prs)
 	else if (isolcpus_updated)
 		isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
 	spin_unlock_irq(&callback_lock);
-	update_unbound_workqueue_cpumask(isolcpus_updated);
+	update_isolation_cpumasks(isolcpus_updated);
=20
 	/* Force update if switching back to member & update effective_xcpus */
 	update_cpumasks_hier(cs, &tmpmask, !new_prs);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a4cf17b1fab0..3ad0d6df6a0a 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -167,6 +167,29 @@ static int __init housekeeping_setup(char *str, unsign=
ed long flags)
 			}
 		}
=20
+		/*
+		 * Check the combination of nohz_full and isolcpus=3Ddomain,
+		 * necessary to avoid problems with the timer migration
+		 * hierarchy. managed_irq is ignored by this check since it
+		 * isn't considered in the timer migration logic.
+		 */
+		iter_flags =3D housekeeping.flags & (HK_FLAG_KERNEL_NOISE | HK_FLAG_DOMA=
IN);
+		type =3D find_first_bit(&iter_flags, HK_TYPE_MAX);
+		/*
+		 * Pass the check if none of these flags were previously set or
+		 * are not in the current selection.
+		 */
+		iter_flags =3D flags & (HK_FLAG_KERNEL_NOISE | HK_FLAG_DOMAIN);
+		first_cpu =3D (type =3D=3D HK_TYPE_MAX || !iter_flags) ? 0 :
+			    cpumask_first_and_and(cpu_present_mask,
+				    housekeeping_staging, housekeeping.cpumasks[type]);
+		if (first_cpu >=3D min(nr_cpu_ids, setup_max_cpus)) {
+			pr_warn("Housekeeping: must include one present CPU "
+				"neither in nohz_full=3D nor in isolcpus=3Ddomain, "
+				"ignoring setting %s\n", str);
+			goto free_housekeeping_staging;
+		}
+
 		iter_flags =3D flags & ~housekeeping.flags;
=20
 		for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 7e7b2b471bae..9c77e5c72556 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2145,7 +2145,7 @@ static long __sched hrtimer_nanosleep_restart(struct =
restart_block *restart)
 	int ret;
=20
 	hrtimer_setup_sleeper_on_stack(&t, restart->nanosleep.clockid, HRTIMER_MO=
DE_ABS);
-	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
+	hrtimer_set_expires(&t.timer, restart->nanosleep.expires);
 	ret =3D do_nanosleep(&t, HRTIMER_MODE_ABS);
 	destroy_hrtimer_on_stack(&t.timer);
 	return ret;
@@ -2172,7 +2172,7 @@ long hrtimer_nanosleep(ktime_t rqtp, const enum hrtim=
er_mode mode,
=20
 	restart =3D &current->restart_block;
 	restart->nanosleep.clockid =3D t.timer.base->clockid;
-	restart->nanosleep.expires =3D hrtimer_get_expires_tv64(&t.timer);
+	restart->nanosleep.expires =3D hrtimer_get_expires(&t.timer);
 	set_restart_fn(restart, hrtimer_nanosleep_restart);
 out:
 	destroy_hrtimer_on_stack(&t.timer);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2e5b89d7d866..0de2bb7cbec0 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1557,7 +1557,7 @@ static int do_cpu_nanosleep(const clockid_t which_clo=
ck, int flags,
 		 * Report back to the user the time still remaining.
 		 */
 		restart =3D &current->restart_block;
-		restart->nanosleep.expires =3D expires;
+		restart->nanosleep.expires =3D ns_to_ktime(expires);
 		if (restart->nanosleep.type !=3D TT_NONE)
 			error =3D nanosleep_copyout(restart, &it.it_value);
 	}
@@ -1599,7 +1599,7 @@ static long posix_cpu_nsleep_restart(struct restart_b=
lock *restart_block)
 	clockid_t which_clock =3D restart_block->nanosleep.clockid;
 	struct timespec64 t;
=20
-	t =3D ns_to_timespec64(restart_block->nanosleep.expires);
+	t =3D ktime_to_timespec64(restart_block->nanosleep.expires);
=20
 	return do_cpu_nanosleep(which_clock, TIMER_ABSTIME, &t);
 }
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index aa3120104a51..36dbb8146517 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1242,7 +1242,7 @@ SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which=
_clock,
  *    sys_clock_settime(). The kernel internal timekeeping is always using
  *    nanoseconds precision independent of the clocksource device which is
  *    used to read the time from. The resolution of that device only
- *    affects the presicion of the time returned by sys_clock_gettime().
+ *    affects the precision of the time returned by sys_clock_gettime().
  *
  * Returns:
  *	0		Success. @tp contains the resolution
diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c
index 5e2c2c26b3cc..ffee943d796d 100644
--- a/kernel/time/tick-oneshot.c
+++ b/kernel/time/tick-oneshot.c
@@ -19,6 +19,10 @@
=20
 /**
  * tick_program_event - program the CPU local timer device for the next ev=
ent
+ * @expires: the time at which the next timer event should occur
+ * @force: flag to force reprograming even if the event time hasn't changed
+ *
+ * Return: 0 on success, negative error code on failure
  */
 int tick_program_event(ktime_t expires, int force)
 {
@@ -57,6 +61,13 @@ void tick_resume_oneshot(void)
=20
 /**
  * tick_setup_oneshot - setup the event device for oneshot mode (hres or n=
ohz)
+ * @newdev: Pointer to the clock event device to configure
+ * @handler: Function to be called when the event device triggers an inter=
rupt
+ * @next_event: Initial expiry time for the next event (in ktime)
+ *
+ * Configures the specified clock event device for onshot mode,
+ * assigns the given handler as its event callback, and programs
+ * the device to trigger at the specified next event time.
  */
 void tick_setup_oneshot(struct clock_event_device *newdev,
 			void (*handler)(struct clock_event_device *),
@@ -69,6 +80,10 @@ void tick_setup_oneshot(struct clock_event_device *newde=
v,
=20
 /**
  * tick_switch_to_oneshot - switch to oneshot mode
+ * @handler: function to call when an event occurs on the tick device
+ *
+ * Return: 0 on success, -EINVAL if the tick device is not present,
+ *         not functional, or does not support oneshot mode.
  */
 int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *))
 {
@@ -101,7 +116,7 @@ int tick_switch_to_oneshot(void (*handler)(struct clock=
_event_device *))
 /**
  * tick_oneshot_mode_active - check whether the system is in oneshot mode
  *
- * returns 1 when either nohz or highres are enabled. otherwise 0.
+ * Return: 1 when either nohz or highres are enabled, otherwise 0.
  */
 int tick_oneshot_mode_active(void)
 {
@@ -120,6 +135,9 @@ int tick_oneshot_mode_active(void)
  * tick_init_highres - switch to high resolution mode
  *
  * Called with interrupts disabled.
+ *
+ * Return: 0 on success, -EINVAL if the tick device cannot switch
+ *         to oneshot/high-resolution mode.
  */
 int tick_init_highres(void)
 {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..3ff3eb1f90d0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_s=
ched *ts,
 	ts->flags &=3D ~flag;
 }
=20
+/*
+ * Allow only one non-timekeeper CPU at a time update jiffies from
+ * the timer tick.
+ *
+ * Returns true if update was run.
+ */
+static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t n=
ow)
+{
+	static atomic_t in_progress;
+	int inp;
+
+	inp =3D atomic_read(&in_progress);
+	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
+		return false;
+
+	if (ts->last_tick_jiffies =3D=3D jiffies)
+		tick_do_update_jiffies64(now);
+	atomic_set(&in_progress, 0);
+	return true;
+}
+
 #define MAX_STALLED_JIFFIES 5
=20
 static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
@@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts=
, ktime_t now)
 		ts->stalled_jiffies =3D 0;
 		ts->last_tick_jiffies =3D READ_ONCE(jiffies);
 	} else {
-		if (++ts->stalled_jiffies =3D=3D MAX_STALLED_JIFFIES) {
-			tick_do_update_jiffies64(now);
-			ts->stalled_jiffies =3D 0;
-			ts->last_tick_jiffies =3D READ_ONCE(jiffies);
+		if (++ts->stalled_jiffies >=3D MAX_STALLED_JIFFIES) {
+			if (tick_limited_update_jiffies64(ts, now)) {
+				ts->stalled_jiffies =3D 0;
+				ts->last_tick_jiffies =3D READ_ONCE(jiffies);
+			}
 		}
 	}
=20
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index c0c54dc5314c..18dda1aa782d 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -10,6 +10,7 @@
 #include <linux/spinlock.h>
 #include <linux/timerqueue.h>
 #include <trace/events/ipi.h>
+#include <linux/sched/isolation.h>
=20
 #include "timer_migration.h"
 #include "tick-internal.h"
@@ -420,14 +421,54 @@ static struct list_head *tmigr_level_list __read_most=
ly;
 static unsigned int tmigr_hierarchy_levels __read_mostly;
 static unsigned int tmigr_crossnode_level __read_mostly;
=20
+static struct tmigr_group *tmigr_root;
+
 static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
=20
+/*
+ * CPUs available for timer migration.
+ * Protected by cpuset_mutex (with cpus_read_lock held) or cpus_write_lock.
+ * Additionally tmigr_available_mutex serializes set/clear operations with=
 each other.
+ */
+static cpumask_var_t tmigr_available_cpumask;
+static DEFINE_MUTEX(tmigr_available_mutex);
+
+/* Enabled during late initcall */
+static DEFINE_STATIC_KEY_FALSE(tmigr_exclude_isolated);
+
 #define TMIGR_NONE	0xFF
 #define BIT_CNT		8
=20
 static inline bool tmigr_is_not_available(struct tmigr_cpu *tmc)
 {
-	return !(tmc->tmgroup && tmc->online);
+	return !(tmc->tmgroup && tmc->available);
+}
+
+/*
+ * Returns true if @cpu should be excluded from the hierarchy as isolated.
+ * Domain isolated CPUs don't participate in timer migration, nohz_full CP=
Us
+ * are still part of the hierarchy but become idle (from a tick and timer
+ * migration perspective) when they stop their tick. This lets the timekee=
ping
+ * CPU handle their global timers. Marking also isolated CPUs as idle woul=
d be
+ * too costly, hence they are completely excluded from the hierarchy.
+ * This check is necessary, for instance, to prevent offline isolated CPUs=
 from
+ * being incorrectly marked as available once getting back online.
+ *
+ * This function returns false during early boot and the isolation logic is
+ * enabled only after isolated CPUs are marked as unavailable at late boot.
+ * The tick CPU can be isolated at boot, however we cannot mark it as
+ * unavailable to avoid having no global migrator for the nohz_full CPUs. =
This
+ * should be ensured by the callers of this function: implicitly from hotp=
lug
+ * callbacks and explicitly in tmigr_init_isolation() and
+ * tmigr_isolated_exclude_cpumask().
+ */
+static inline bool tmigr_is_isolated(int cpu)
+{
+	if (!static_branch_unlikely(&tmigr_exclude_isolated))
+		return false;
+	return (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
+		cpuset_cpu_is_isolated(cpu)) &&
+	       housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE);
 }
=20
 /*
@@ -502,11 +543,6 @@ static bool tmigr_check_lonely(struct tmigr_group *gro=
up)
  * @now:		timer base monotonic
  * @check:		is set if there is the need to handle remote timers;
  *			required in tmigr_requires_handle_remote() only
- * @tmc_active:		this flag indicates, whether the CPU which triggers
- *			the hierarchy walk is !idle in the timer migration
- *			hierarchy. When the CPU is idle and the whole hierarchy is
- *			idle, only the first event of the top level has to be
- *			considered.
  */
 struct tmigr_walk {
 	u64			nextexp;
@@ -517,16 +553,13 @@ struct tmigr_walk {
 	unsigned long		basej;
 	u64			now;
 	bool			check;
-	bool			tmc_active;
 };
=20
 typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tm=
igr_walk *);
=20
-static void __walk_groups(up_f up, struct tmigr_walk *data,
-			  struct tmigr_cpu *tmc)
+static void __walk_groups_from(up_f up, struct tmigr_walk *data,
+			       struct tmigr_group *child, struct tmigr_group *group)
 {
-	struct tmigr_group *child =3D NULL, *group =3D tmc->tmgroup;
-
 	do {
 		WARN_ON_ONCE(group->level >=3D tmigr_hierarchy_levels);
=20
@@ -544,6 +577,12 @@ static void __walk_groups(up_f up, struct tmigr_walk *=
data,
 	} while (group);
 }
=20
+static void __walk_groups(up_f up, struct tmigr_walk *data,
+			  struct tmigr_cpu *tmc)
+{
+	__walk_groups_from(up, data, NULL, tmc->tmgroup);
+}
+
 static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu=
 *tmc)
 {
 	lockdep_assert_held(&tmc->lock);
@@ -708,7 +747,7 @@ void tmigr_cpu_activate(void)
 /*
  * Returns true, if there is nothing to be propagated to the next level
  *
- * @data->firstexp is set to expiry of first gobal event of the (top level=
 of
+ * @data->firstexp is set to expiry of first global event of the (top leve=
l of
  * the) hierarchy, but only when hierarchy is completely idle.
  *
  * The child and group states need to be read under the lock, to prevent a=
 race
@@ -926,7 +965,7 @@ static void tmigr_handle_remote_cpu(unsigned int cpu, u=
64 now,
 	 * updated the event takes care when hierarchy is completely
 	 * idle. Otherwise the migrator does it as the event is enqueued.
 	 */
-	if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
+	if (!tmc->available || tmc->remote || tmc->cpuevt.ignore ||
 	    now < tmc->cpuevt.nextevt.expires) {
 		raw_spin_unlock_irq(&tmc->lock);
 		return;
@@ -973,7 +1012,7 @@ static void tmigr_handle_remote_cpu(unsigned int cpu, =
u64 now,
 	 * (See also section "Required event and timerqueue update after a
 	 * remote expiry" in the documentation at the top)
 	 */
-	if (!tmc->online || !tmc->idle) {
+	if (!tmc->available || !tmc->idle) {
 		timer_unlock_remote_bases(cpu);
 		goto unlock;
 	}
@@ -1113,15 +1152,6 @@ static bool tmigr_requires_handle_remote_up(struct t=
migr_group *group,
 	 */
 	if (!tmigr_check_migrator(group, childmask))
 		return true;
-
-	/*
-	 * When there is a parent group and the CPU which triggered the
-	 * hierarchy walk is not active, proceed the walk to reach the top level
-	 * group before reading the next_expiry value.
-	 */
-	if (group->parent && !data->tmc_active)
-		return false;
-
 	/*
 	 * The lock is required on 32bit architectures to read the variable
 	 * consistently with a concurrent writer. On 64bit the lock is not
@@ -1166,7 +1196,6 @@ bool tmigr_requires_handle_remote(void)
 	data.now =3D get_jiffies_update(&jif);
 	data.childmask =3D tmc->groupmask;
 	data.firstexp =3D KTIME_MAX;
-	data.tmc_active =3D !tmc->idle;
 	data.check =3D false;
=20
 	/*
@@ -1432,38 +1461,43 @@ static long tmigr_trigger_active(void *unused)
 {
 	struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu);
=20
-	WARN_ON_ONCE(!tmc->online || tmc->idle);
+	WARN_ON_ONCE(!tmc->available || tmc->idle);
=20
 	return 0;
 }
=20
-static int tmigr_cpu_offline(unsigned int cpu)
+static int tmigr_clear_cpu_available(unsigned int cpu)
 {
 	struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu);
 	int migrator;
 	u64 firstexp;
=20
-	raw_spin_lock_irq(&tmc->lock);
-	tmc->online =3D false;
-	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+	guard(mutex)(&tmigr_available_mutex);
=20
-	/*
-	 * CPU has to handle the local events on his own, when on the way to
-	 * offline; Therefore nextevt value is set to KTIME_MAX
-	 */
-	firstexp =3D __tmigr_cpu_deactivate(tmc, KTIME_MAX);
-	trace_tmigr_cpu_offline(tmc);
-	raw_spin_unlock_irq(&tmc->lock);
+	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
+	scoped_guard(raw_spinlock_irq, &tmc->lock) {
+		if (!tmc->available)
+			return 0;
+		tmc->available =3D false;
+		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+		/*
+		 * CPU has to handle the local events on his own, when on the way to
+		 * offline; Therefore nextevt value is set to KTIME_MAX
+		 */
+		firstexp =3D __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+		trace_tmigr_cpu_unavailable(tmc);
+	}
=20
 	if (firstexp !=3D KTIME_MAX) {
-		migrator =3D cpumask_any_but(cpu_online_mask, cpu);
+		migrator =3D cpumask_any(tmigr_available_cpumask);
 		work_on_cpu(migrator, tmigr_trigger_active, NULL);
 	}
=20
 	return 0;
 }
=20
-static int tmigr_cpu_online(unsigned int cpu)
+static int tmigr_set_cpu_available(unsigned int cpu)
 {
 	struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu);
=20
@@ -1471,16 +1505,123 @@ static int tmigr_cpu_online(unsigned int cpu)
 	if (WARN_ON_ONCE(!tmc->tmgroup))
 		return -EINVAL;
=20
-	raw_spin_lock_irq(&tmc->lock);
-	trace_tmigr_cpu_online(tmc);
-	tmc->idle =3D timer_base_is_idle();
-	if (!tmc->idle)
-		__tmigr_cpu_activate(tmc);
-	tmc->online =3D true;
-	raw_spin_unlock_irq(&tmc->lock);
+	if (tmigr_is_isolated(cpu))
+		return 0;
+
+	guard(mutex)(&tmigr_available_mutex);
+
+	cpumask_set_cpu(cpu, tmigr_available_cpumask);
+	scoped_guard(raw_spinlock_irq, &tmc->lock) {
+		if (tmc->available)
+			return 0;
+		trace_tmigr_cpu_available(tmc);
+		tmc->idle =3D timer_base_is_idle();
+		if (!tmc->idle)
+			__tmigr_cpu_activate(tmc);
+		tmc->available =3D true;
+	}
 	return 0;
 }
=20
+static void tmigr_cpu_isolate(struct work_struct *ignored)
+{
+	tmigr_clear_cpu_available(smp_processor_id());
+}
+
+static void tmigr_cpu_unisolate(struct work_struct *ignored)
+{
+	tmigr_set_cpu_available(smp_processor_id());
+}
+
+/**
+ * tmigr_isolated_exclude_cpumask - Exclude given CPUs from hierarchy
+ * @exclude_cpumask: the cpumask to be excluded from timer migration hiera=
rchy
+ *
+ * This function can be called from cpuset code to provide the new set of
+ * isolated CPUs that should be excluded from the hierarchy.
+ * Online CPUs not present in exclude_cpumask but already excluded are bro=
ught
+ * back to the hierarchy.
+ * Functions to isolate/unisolate need to be called locally and can sleep.
+ */
+int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
+{
+	struct work_struct __percpu *works __free(free_percpu) =3D
+		alloc_percpu(struct work_struct);
+	cpumask_var_t cpumask __free(free_cpumask_var) =3D CPUMASK_VAR_NULL;
+	int cpu;
+
+	lockdep_assert_cpus_held();
+
+	if (!works)
+		return -ENOMEM;
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	/*
+	 * First set previously isolated CPUs as available (unisolate).
+	 * This cpumask contains only CPUs that switched to available now.
+	 */
+	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
+	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
+
+	for_each_cpu(cpu, cpumask) {
+		struct work_struct *work =3D per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, tmigr_cpu_unisolate);
+		schedule_work_on(cpu, work);
+	}
+	for_each_cpu(cpu, cpumask)
+		flush_work(per_cpu_ptr(works, cpu));
+
+	/*
+	 * Then clear previously available CPUs (isolate).
+	 * This cpumask contains only CPUs that switched to not available now.
+	 * There cannot be overlap with the newly available ones.
+	 */
+	cpumask_and(cpumask, exclude_cpumask, tmigr_available_cpumask);
+	cpumask_and(cpumask, cpumask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+	/*
+	 * Handle this here and not in the cpuset code because exclude_cpumask
+	 * might include also the tick CPU if included in isolcpus.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (!tick_nohz_cpu_hotpluggable(cpu)) {
+			cpumask_clear_cpu(cpu, cpumask);
+			break;
+		}
+	}
+
+	for_each_cpu(cpu, cpumask) {
+		struct work_struct *work =3D per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, tmigr_cpu_isolate);
+		schedule_work_on(cpu, work);
+	}
+	for_each_cpu(cpu, cpumask)
+		flush_work(per_cpu_ptr(works, cpu));
+
+	return 0;
+}
+
+static int __init tmigr_init_isolation(void)
+{
+	cpumask_var_t cpumask __free(free_cpumask_var) =3D CPUMASK_VAR_NULL;
+
+	static_branch_enable(&tmigr_exclude_isolated);
+
+	if (!housekeeping_enabled(HK_TYPE_DOMAIN))
+		return 0;
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_D=
OMAIN));
+
+	/* Protect against RCU torture hotplug testing */
+	guard(cpus_read_lock)();
+	return tmigr_isolated_exclude_cpumask(cpumask);
+}
+late_initcall(tmigr_init_isolation);
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1498,21 +1639,6 @@ static void tmigr_init_group(struct tmigr_group *gro=
up, unsigned int lvl,
 	s.seq =3D 0;
 	atomic_set(&group->migr_state, s.state);
=20
-	/*
-	 * If this is a new top-level, prepare its groupmask in advance.
-	 * This avoids accidents where yet another new top-level is
-	 * created in the future and made visible before the current groupmask.
-	 */
-	if (list_empty(&tmigr_level_list[lvl])) {
-		group->groupmask =3D BIT(0);
-		/*
-		 * The previous top level has prepared its groupmask already,
-		 * simply account it as the first child.
-		 */
-		if (lvl > 0)
-			group->num_children =3D 1;
-	}
-
 	timerqueue_init_head(&group->events);
 	timerqueue_init(&group->groupevt.nextevt);
 	group->groupevt.nextevt.expires =3D KTIME_MAX;
@@ -1520,8 +1646,7 @@ static void tmigr_init_group(struct tmigr_group *grou=
p, unsigned int lvl,
 	group->groupevt.ignore =3D true;
 }
=20
-static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
-					   unsigned int lvl)
+static struct tmigr_group *tmigr_get_group(int node, unsigned int lvl)
 {
 	struct tmigr_group *tmp, *group =3D NULL;
=20
@@ -1567,25 +1692,51 @@ static struct tmigr_group *tmigr_get_group(unsigned=
 int cpu, int node,
 	return group;
 }
=20
+static bool tmigr_init_root(struct tmigr_group *group, bool activate)
+{
+	if (!group->parent && group !=3D tmigr_root) {
+		/*
+		 * This is the new top-level, prepare its groupmask in advance
+		 * to avoid accidents where yet another new top-level is
+		 * created in the future and made visible before this groupmask.
+		 */
+		group->groupmask =3D BIT(0);
+		WARN_ON_ONCE(activate);
+
+		return true;
+	}
+
+	return false;
+
+}
+
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent,
 				       bool activate)
 {
-	struct tmigr_walk data;
-
-	raw_spin_lock_irq(&child->lock);
-	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
+	if (tmigr_init_root(parent, activate)) {
+		/*
+		 * The previous top level had prepared its groupmask already,
+		 * simply account it in advance as the first child. If some groups
+		 * have been created between the old and new root due to node
+		 * mismatch, the new root's child will be intialized accordingly.
+		 */
+		parent->num_children =3D 1;
+	}
=20
-	if (activate) {
+	/* Connecting old root to new root ? */
+	if (!parent->parent && activate) {
 		/*
-		 * @child is the old top and @parent the new one. In this
-		 * case groupmask is pre-initialized and @child already
-		 * accounted, along with its new sibling corresponding to the
-		 * CPU going up.
+		 * @child is the old top, or in case of node mismatch, some
+		 * intermediate group between the old top and the new one in
+		 * @parent. In this case the @child must be pre-accounted above
+		 * as the first child. Its new inactive sibling corresponding
+		 * to the CPU going up has been accounted as the second child.
 		 */
-		WARN_ON_ONCE(child->groupmask !=3D BIT(0) || parent->num_children !=3D 2=
);
+		WARN_ON_ONCE(parent->num_children !=3D 2);
+		child->groupmask =3D BIT(0);
 	} else {
-		/* Adding @child for the CPU going up to @parent. */
+		/* Common case adding @child for the CPU going up to @parent. */
 		child->groupmask =3D BIT(parent->num_children++);
 	}
=20
@@ -1596,87 +1747,61 @@ static void tmigr_connect_child_parent(struct tmigr=
_group *child,
 	 */
 	smp_store_release(&child->parent, parent);
=20
-	raw_spin_unlock(&parent->lock);
-	raw_spin_unlock_irq(&child->lock);
-
 	trace_tmigr_connect_child_parent(child);
-
-	if (!activate)
-		return;
-
-	/*
-	 * To prevent inconsistent states, active children need to be active in
-	 * the new parent as well. Inactive children are already marked inactive
-	 * in the parent group:
-	 *
-	 * * When new groups were created by tmigr_setup_groups() starting from
-	 *   the lowest level (and not higher then one level below the current
-	 *   top level), then they are not active. They will be set active when
-	 *   the new online CPU comes active.
-	 *
-	 * * But if a new group above the current top level is required, it is
-	 *   mandatory to propagate the active state of the already existing
-	 *   child to the new parent. So tmigr_connect_child_parent() is
-	 *   executed with the formerly top level group (child) and the newly
-	 *   created group (parent).
-	 *
-	 * * It is ensured that the child is active, as this setup path is
-	 *   executed in hotplug prepare callback. This is exectued by an
-	 *   already connected and !idle CPU. Even if all other CPUs go idle,
-	 *   the CPU executing the setup will be responsible up to current top
-	 *   level group. And the next time it goes inactive, it will release
-	 *   the new childmask and parent to subsequent walkers through this
-	 *   @child. Therefore propagate active state unconditionally.
-	 */
-	data.childmask =3D child->groupmask;
-
-	/*
-	 * There is only one new level per time (which is protected by
-	 * tmigr_mutex). When connecting the child and the parent and set the
-	 * child active when the parent is inactive, the parent needs to be the
-	 * uppermost level. Otherwise there went something wrong!
-	 */
-	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 }
=20
-static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
+static int tmigr_setup_groups(unsigned int cpu, unsigned int node,
+			      struct tmigr_group *start, bool activate)
 {
 	struct tmigr_group *group, *child, **stack;
-	int top =3D 0, err =3D 0, i =3D 0;
-	struct list_head *lvllist;
+	int i, top =3D 0, err =3D 0, start_lvl =3D 0;
+	bool root_mismatch =3D false;
=20
 	stack =3D kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL);
 	if (!stack)
 		return -ENOMEM;
=20
-	do {
-		group =3D tmigr_get_group(cpu, node, i);
+	if (start) {
+		stack[start->level] =3D start;
+		start_lvl =3D start->level + 1;
+	}
+
+	if (tmigr_root)
+		root_mismatch =3D tmigr_root->numa_node !=3D node;
+
+	for (i =3D start_lvl; i < tmigr_hierarchy_levels; i++) {
+		group =3D tmigr_get_group(node, i);
 		if (IS_ERR(group)) {
 			err =3D PTR_ERR(group);
+			i--;
 			break;
 		}
=20
 		top =3D i;
-		stack[i++] =3D group;
+		stack[i] =3D group;
=20
 		/*
 		 * When booting only less CPUs of a system than CPUs are
-		 * available, not all calculated hierarchy levels are required.
+		 * available, not all calculated hierarchy levels are required,
+		 * unless a node mismatch is detected.
 		 *
 		 * The loop is aborted as soon as the highest level, which might
 		 * be different from tmigr_hierarchy_levels, contains only a
-		 * single group.
+		 * single group, unless the nodes mismatch below tmigr_crossnode_level
 		 */
-		if (group->parent || list_is_singular(&tmigr_level_list[i - 1]))
+		if (group->parent)
 			break;
+		if ((!root_mismatch || i >=3D tmigr_crossnode_level) &&
+		    list_is_singular(&tmigr_level_list[i]))
+			break;
+	}
=20
-	} while (i < tmigr_hierarchy_levels);
-
-	/* Assert single root */
-	WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_lis=
t[top]));
+	/* Assert single root without parent */
+	if (WARN_ON_ONCE(i >=3D tmigr_hierarchy_levels))
+		return -EINVAL;
=20
-	while (i > 0) {
-		group =3D stack[--i];
+	for (; i >=3D start_lvl; i--) {
+		group =3D stack[i];
=20
 		if (err < 0) {
 			list_del(&group->list);
@@ -1692,12 +1817,10 @@ static int tmigr_setup_groups(unsigned int cpu, uns=
igned int node)
 		if (i =3D=3D 0) {
 			struct tmigr_cpu *tmc =3D per_cpu_ptr(&tmigr_cpu, cpu);
=20
-			raw_spin_lock_irq(&group->lock);
-
 			tmc->tmgroup =3D group;
 			tmc->groupmask =3D BIT(group->num_children++);
=20
-			raw_spin_unlock_irq(&group->lock);
+			tmigr_init_root(group, activate);
=20
 			trace_tmigr_connect_cpu_parent(tmc);
=20
@@ -1705,42 +1828,58 @@ static int tmigr_setup_groups(unsigned int cpu, uns=
igned int node)
 			continue;
 		} else {
 			child =3D stack[i - 1];
-			/* Will be activated at online time */
-			tmigr_connect_child_parent(child, group, false);
+			tmigr_connect_child_parent(child, group, activate);
 		}
+	}
=20
-		/* check if uppermost level was newly created */
-		if (top !=3D i)
-			continue;
-
-		WARN_ON_ONCE(top =3D=3D 0);
+	if (err < 0)
+		goto out;
=20
-		lvllist =3D &tmigr_level_list[top];
+	if (activate) {
+		struct tmigr_walk data;
+		union tmigr_state state;
=20
 		/*
-		 * Newly created root level should have accounted the upcoming
-		 * CPU's child group and pre-accounted the old root.
+		 * To prevent inconsistent states, active children need to be active in
+		 * the new parent as well. Inactive children are already marked inactive
+		 * in the parent group:
+		 *
+		 * * When new groups were created by tmigr_setup_groups() starting from
+		 *   the lowest level, then they are not active. They will be set active
+		 *   when the new online CPU comes active.
+		 *
+		 * * But if new groups above the current top level are required, it is
+		 *   mandatory to propagate the active state of the already existing
+		 *   child to the new parents. So tmigr_active_up() activates the
+		 *   new parents while walking up from the old root to the new.
+		 *
+		 * * It is ensured that @start is active, as this setup path is
+		 *   executed in hotplug prepare callback. This is executed by an
+		 *   already connected and !idle CPU. Even if all other CPUs go idle,
+		 *   the CPU executing the setup will be responsible up to current top
+		 *   level group. And the next time it goes inactive, it will release
+		 *   the new childmask and parent to subsequent walkers through this
+		 *   @child. Therefore propagate active state unconditionally.
 		 */
-		if (group->num_children =3D=3D 2 && list_is_singular(lvllist)) {
-			/*
-			 * The target CPU must never do the prepare work, except
-			 * on early boot when the boot CPU is the target. Otherwise
-			 * it may spuriously activate the old top level group inside
-			 * the new one (nevertheless whether old top level group is
-			 * active or not) and/or release an uninitialized childmask.
-			 */
-			WARN_ON_ONCE(cpu =3D=3D raw_smp_processor_id());
-
-			lvllist =3D &tmigr_level_list[top - 1];
-			list_for_each_entry(child, lvllist, list) {
-				if (child->parent)
-					continue;
+		state.state =3D atomic_read(&start->migr_state);
+		WARN_ON_ONCE(!state.active);
+		WARN_ON_ONCE(!start->parent);
+		data.childmask =3D start->groupmask;
+		__walk_groups_from(tmigr_active_up, &data, start, start->parent);
+	}
=20
-				tmigr_connect_child_parent(child, group, true);
-			}
+	/* Root update */
+	if (list_is_singular(&tmigr_level_list[top])) {
+		group =3D list_first_entry(&tmigr_level_list[top],
+					 typeof(*group), list);
+		WARN_ON_ONCE(group->parent);
+		if (tmigr_root) {
+			/* Old root should be the same or below */
+			WARN_ON_ONCE(tmigr_root->level > top);
 		}
+		tmigr_root =3D group;
 	}
-
+out:
 	kfree(stack);
=20
 	return err;
@@ -1748,12 +1887,31 @@ static int tmigr_setup_groups(unsigned int cpu, uns=
igned int node)
=20
 static int tmigr_add_cpu(unsigned int cpu)
 {
+	struct tmigr_group *old_root =3D tmigr_root;
 	int node =3D cpu_to_node(cpu);
 	int ret;
=20
-	mutex_lock(&tmigr_mutex);
-	ret =3D tmigr_setup_groups(cpu, node);
-	mutex_unlock(&tmigr_mutex);
+	guard(mutex)(&tmigr_mutex);
+
+	ret =3D tmigr_setup_groups(cpu, node, NULL, false);
+
+	/* Root has changed? Connect the old one to the new */
+	if (ret >=3D 0 && old_root && old_root !=3D tmigr_root) {
+		/*
+		 * The target CPU must never do the prepare work, except
+		 * on early boot when the boot CPU is the target. Otherwise
+		 * it may spuriously activate the old top level group inside
+		 * the new one (nevertheless whether old top level group is
+		 * active or not) and/or release an uninitialized childmask.
+		 */
+		WARN_ON_ONCE(cpu =3D=3D raw_smp_processor_id());
+		/*
+		 * The (likely) current CPU is expected to be online in the hierarchy,
+		 * otherwise the old root may not be active as expected.
+		 */
+		WARN_ON_ONCE(!per_cpu_ptr(&tmigr_cpu, raw_smp_processor_id())->available=
);
+		ret =3D tmigr_setup_groups(-1, old_root->numa_node, old_root, true);
+	}
=20
 	return ret;
 }
@@ -1798,6 +1956,11 @@ static int __init tmigr_init(void)
 	if (ncpus =3D=3D 1)
 		return 0;
=20
+	if (!zalloc_cpumask_var(&tmigr_available_cpumask, GFP_KERNEL)) {
+		ret =3D -ENOMEM;
+		goto err;
+	}
+
 	/*
 	 * Calculate the required hierarchy levels. Unfortunately there is no
 	 * reliable information available, unless all possible CPUs have been
@@ -1847,7 +2010,7 @@ static int __init tmigr_init(void)
 		goto err;
=20
 	ret =3D cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
-				tmigr_cpu_online, tmigr_cpu_offline);
+				tmigr_set_cpu_available, tmigr_clear_cpu_available);
 	if (ret)
 		goto err;
=20
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index ae19f70f8170..70879cde6fdd 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -97,7 +97,7 @@ struct tmigr_group {
  */
 struct tmigr_cpu {
 	raw_spinlock_t		lock;
-	bool			online;
+	bool			available;
 	bool			idle;
 	bool			remote;
 	struct tmigr_group	*tmgroup;
diff --git a/tools/testing/selftests/timers/nanosleep.c b/tools/testing/sel=
ftests/timers/nanosleep.c
index 252c6308c569..10badae13ebe 100644
--- a/tools/testing/selftests/timers/nanosleep.c
+++ b/tools/testing/selftests/timers/nanosleep.c
@@ -116,6 +116,56 @@ int nanosleep_test(int clockid, long long ns)
 	return 0;
 }
=20
+static void dummy_event_handler(int val)
+{
+	/* No action needed */
+}
+
+static int nanosleep_test_remaining(int clockid)
+{
+	struct timespec rqtp =3D {}, rmtp =3D {};
+	struct itimerspec itimer =3D {};
+	struct sigaction sa =3D {};
+	timer_t timer;
+	int ret;
+
+	sa.sa_handler =3D dummy_event_handler;
+	ret =3D sigaction(SIGALRM, &sa, NULL);
+	if (ret)
+		return -1;
+
+	ret =3D timer_create(clockid, NULL, &timer);
+	if (ret)
+		return -1;
+
+	itimer.it_value.tv_nsec =3D NSEC_PER_SEC / 4;
+	ret =3D timer_settime(timer, 0, &itimer, NULL);
+	if (ret)
+		return -1;
+
+	rqtp.tv_nsec =3D NSEC_PER_SEC / 2;
+	ret =3D clock_nanosleep(clockid, 0, &rqtp, &rmtp);
+	if (ret !=3D EINTR)
+		return -1;
+
+	ret =3D timer_delete(timer);
+	if (ret)
+		return -1;
+
+	sa.sa_handler =3D SIG_DFL;
+	ret =3D sigaction(SIGALRM, &sa, NULL);
+	if (ret)
+		return -1;
+
+	if (!in_order((struct timespec) {}, rmtp))
+		return -1;
+
+	if (!in_order(rmtp, rqtp))
+		return -1;
+
+	return 0;
+}
+
 int main(int argc, char **argv)
 {
 	long long length;
@@ -150,6 +200,11 @@ int main(int argc, char **argv)
 			}
 			length *=3D 100;
 		}
+		ret =3D nanosleep_test_remaining(clockid);
+		if (ret < 0) {
+			ksft_test_result_fail("%-31s\n", clockstring(clockid));
+			ksft_exit_fail();
+		}
 		ksft_test_result_pass("%-31s\n", clockstring(clockid));
 next:
 		ret =3D 0;
diff --git a/tools/testing/selftests/timers/posix_timers.c b/tools/testing/=
selftests/timers/posix_timers.c
index f0eceb0faf34..a563c438ac79 100644
--- a/tools/testing/selftests/timers/posix_timers.c
+++ b/tools/testing/selftests/timers/posix_timers.c
@@ -18,6 +18,7 @@
 #include <time.h>
 #include <include/vdso/time64.h>
 #include <pthread.h>
+#include <stdbool.h>
=20
 #include "../kselftest.h"
=20
@@ -670,8 +671,14 @@ static void check_timer_create_exact(void)
=20
 int main(int argc, char **argv)
 {
+	bool run_sig_ign_tests =3D ksft_min_kernel_version(6, 13);
+
 	ksft_print_header();
-	ksft_set_plan(19);
+	if (run_sig_ign_tests) {
+		ksft_set_plan(19);
+	} else {
+		ksft_set_plan(10);
+	}
=20
 	ksft_print_msg("Testing posix timers. False negative may happen on CPU ex=
ecution \n");
 	ksft_print_msg("based timers if other threads run on the CPU...\n");
@@ -695,15 +702,20 @@ int main(int argc, char **argv)
 	check_timer_create(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
 	check_timer_distribution();
=20
-	check_sig_ign(0);
-	check_sig_ign(1);
-	check_rearm();
-	check_delete();
-	check_sigev_none(CLOCK_MONOTONIC, "CLOCK_MONOTONIC");
-	check_sigev_none(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
-	check_gettime(CLOCK_MONOTONIC, "CLOCK_MONOTONIC");
-	check_gettime(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
-	check_gettime(CLOCK_THREAD_CPUTIME_ID, "CLOCK_THREAD_CPUTIME_ID");
+	if (run_sig_ign_tests) {
+		check_sig_ign(0);
+		check_sig_ign(1);
+		check_rearm();
+		check_delete();
+		check_sigev_none(CLOCK_MONOTONIC, "CLOCK_MONOTONIC");
+		check_sigev_none(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
+		check_gettime(CLOCK_MONOTONIC, "CLOCK_MONOTONIC");
+		check_gettime(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
+		check_gettime(CLOCK_THREAD_CPUTIME_ID, "CLOCK_THREAD_CPUTIME_ID");
+	} else {
+		ksft_print_msg("Skipping SIG_IGN tests on kernel < 6.13\n");
+	}
+
 	check_overrun(CLOCK_MONOTONIC, "CLOCK_MONOTONIC");
 	check_overrun(CLOCK_PROCESS_CPUTIME_ID, "CLOCK_PROCESS_CPUTIME_ID");
 	check_overrun(CLOCK_THREAD_CPUTIME_ID, "CLOCK_THREAD_CPUTIME_ID");