lib/debugobjects.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Linus,
please pull the latest core/debugobjects branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-debugobjects-2026-04-12
up to: 723ddce93e8d: debugobjects: Drop likely() around !IS_ERR_OR_NULL()
A trivial update for debugobjects to drop a pointless likely() around
IS_ERR_OR_NULL().
Thanks,
tglx
------------------>
Philipp Hahn (1):
debugobjects: Drop likely() around !IS_ERR_OR_NULL()
lib/debugobjects.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 12f50de85b62..12e2e42e6a31 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -1024,7 +1024,7 @@ void debug_object_assert_init(void *addr, const struct debug_obj_descr *descr)
raw_spin_lock_irqsave(&db->lock, flags);
obj = lookup_object_or_alloc(addr, db, descr, false, true);
raw_spin_unlock_irqrestore(&db->lock, flags);
- if (likely(!IS_ERR_OR_NULL(obj)))
+ if (!IS_ERR_OR_NULL(obj))
return;
/* If NULL the allocation has hit OOM */
The pull request you sent on Sun, 12 Apr 2026 19:45:55 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-debugobjects-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/2ad332b0e221dedc4c483faef2003be3655f9d77 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest timers/core branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-core-2026-04-12
up to: ff1c0c5d0702: Merge branch 'timers/urgent' into timers/core
Updates for the timer/timekeeping core:
- A rework of the hrtimer subsystem to reduce the overhead for frequently
armed timers, especially the hrtick scheduler timer.
- Better timer locality decision
- Simplification of the evaluation of the first expiry time by
keeping track of the neighbor timers in the RB-tree by providing a
RB-tree variant with neighbor links. That avoids walking the
RB-tree on removal to find the next expiry time, but even more
important allows to quickly evaluate whether a timer which is
rearmed changes the position in the RB-tree with the modified
expiry time or not. If not, the dequeue/enqueue sequence which both
can end up in rebalancing can be completely avoided.
- Deferred reprogramming of the underlying clock event device. This
optimizes for the situation where a hrtimer callback sets the need
resched bit. In that case the code attempts to defer the
re-programming of the clock event device up to the point where the
scheduler has picked the next task and has the next hrtick timer
armed. In case that there is no immediate reschedule or soft
interrupts have to be handled before reaching the reschedule point
in the interrupt entry code the clock event is reprogrammed in one
of those code paths to prevent that the timer becomes stale.
- Support for clocksource coupled clockevents
The TSC deadline timer is coupled to the TSC. The next event is
programmed in TSC time. Currently this is done by converting the
CLOCK_MONOTONIC based expiry value into a relative timeout,
converting it into TSC ticks, reading the TSC adding the delta
ticks and writing the deadline MSR.
As the timekeeping core has the conversion factors for the TSC
already, the whole back and forth conversion can be completely
avoided. The timekeeping core calculates the reverse conversion
factors from nanoseconds to TSC ticks and utilizes the base
timestamps of TSC and CLOCK_MONOTONIC which are updated once per
tick. This allows a direct conversion into the TSC deadline value
without reading the time and as a bonus keeps the deadline
conversion in sync with the TSC conversion factors, which are
updated by adjtimex() on systems with NTP/PTP enabled.
- Allow inlining of the clocksource read and clockevent write
functions when they are tiny enough, e.g. on x86 RDTSC and WRMSR.
With all those enhancements in place a hrtick enabled scheduler
provides the same performance as without hrtick. But also other hrtimer
users obviously benefit from these optimizations.
- Robustness improvements and cleanups of historical sins in the hrtimer
and timekeeping code.
- Rewrite of the clocksource watchdog.
The clocksource watchdog code has over time reached the state of an
impenetrable maze of duct tape and staples. The original design, which was
made in the context of systems far smaller than today, is based on the
assumption that the to be monitored clocksource (TSC) can be trivially
compared against a known to be stable clocksource (HPET/ACPI-PM timer).
Over the years this rather naive approach turned out to have major
flaws. Long delays between the watchdog invocations can cause wrap
arounds of the reference clocksource. The access to the reference
clocksource degrades on large multi-sockets systems dure to
interconnect congestion. This has been addressed with various
heuristics which degraded the accuracy of the watchdog to the point
that it fails to detect actual TSC problems on older hardware which
exposes slow inter CPU drifts due to firmware manipulating the TSC to
hide SMI time.
The rewrite addresses this by:
- Restricting the validation against the reference clocksource to the
boot CPU which is usually closest to the legacy block which
contains the reference clocksource (HPET/ACPI-PM).
- Do a round robin validation betwen the boot CPU and the other CPUs
based only on the TSC with an algorithm similar to the TSC
synchronization code during CPU hotplug.
- Being more leniant versus remote timeouts
- The usual tiny fixes, cleanups and enhancements all over the place
Thanks,
tglx
------------------>
Ingo Molnar (1):
sched/hrtick: Mark hrtick_clear() as always used
Josh Snyder (1):
tick/nohz: Fix inverted return value in check_tick_dependency() fast path
Peter Zijlstra (12):
sched/eevdf: Fix HRTICK duration
hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()
hrtimer: Provide LAZY_REARM mode
sched/hrtick: Mark hrtick timer LAZY_REARM
hrtimer: Re-arrange hrtimer_interrupt()
hrtimer: Prepare stubs for deferred rearming
entry: Prepare for deferred hrtimer rearming
softirq: Prepare for deferred hrtimer rearming
sched/core: Prepare for deferred hrtimer rearming
hrtimer: Push reprogramming timers into the interrupt return path
sched: Default enable HRTICK when deferred rearming is enabled
hrtimer: Less agressive interrupt 'hang' handling
Peter Zijlstra (Intel) (2):
sched/fair: Simplify hrtick_update()
sched/fair: Make hrtick resched hard
Petr Pavlu (1):
jiffies: Remove unused __jiffy_arch_data
Ryota Sakamoto (1):
time/kunit: Add .kunitconfig
Shrikanth Hegde (1):
timers: Get this_cpu once while clearing the idle state
Thomas Gleixner (43):
sched: Avoid ktime_get() indirection
hrtimer: Provide a static branch based hrtimer_hres_enabled()
sched: Use hrtimer_highres_enabled()
sched: Optimize hrtimer handling
sched/hrtick: Avoid tiny hrtick rearms
tick/sched: Avoid hrtimer_cancel/start() sequence
clockevents: Remove redundant CLOCK_EVT_FEAT_KTIME
timekeeping: Allow inlining clocksource::read()
x86: Inline TSC reads in timekeeping
x86/apic: Remove pointless fence in lapic_next_deadline()
x86/apic: Avoid the PVOPS indirection for the TSC deadline timer
timekeeping: Provide infrastructure for coupled clockevents
clockevents: Provide support for clocksource coupled comparators
x86/apic: Enable TSC coupled programming mode
hrtimer: Add debug object init assertion
hrtimer: Reduce trace noise in hrtimer_start()
hrtimer: Use guards where appropriate
hrtimer: Cleanup coding style and comments
hrtimer: Evaluate timer expiry only once
hrtimer: Replace the bitfield in hrtimer_cpu_base
hrtimer: Convert state and properties to boolean
hrtimer: Optimize for local timers
hrtimer: Use NOHZ information for locality
hrtimer: Separate remove/enqueue handling for local timers
hrtimer: Add hrtimer_rearm tracepoint
hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearm
hrtimer: Avoid re-evaluation when nothing changed
hrtimer: Keep track of first expiring timer per clock base
hrtimer: Rework next event evaluation
hrtimer: Simplify run_hrtimer_queues()
hrtimer: Optimize for_each_active_base()
rbtree: Provide rbtree with links
timerqueue: Provide linked timerqueue
hrtimer: Use linked timerqueue
hrtimer: Try to modify timers in place
timekeeping: Initialize the coupled clocksource conversion completely
clocksource: Update clocksource::freq_khz on registration
parisc: Remove unused clocksource flags
MIPS: Don't select CLOCKSOURCE_WATCHDOG
x86/tsc: Handle CLOCK_SOURCE_VALID_FOR_HRES correctly
clocksource: Don't use non-continuous clocksources as watchdog
clocksource: Rewrite watchdog code completely
clockevents: Prevent timer interrupt starvation
Thomas Weißschuh (Schneider Electric) (12):
scripts/gdb: timerlist: Adapt to move of tk_core
tracing: Use explicit array size instead of sentinel elements in symbol printing
timer_list: Print offset as signed integer
timekeeping/auxclock: Consistently use raw timekeeper for tk_setup_internals()
timekeeping: Mark offsets array as const
hrtimer: Remove hrtimer_get_expires_ns()
hrtimer: Don't zero-initialize ret in hrtimer_nanosleep()
hrtimer: Drop spurious space in 'enum hrtimer_base_type'
hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry event
hrtimer: Mark index and clockid of clock base as const
hrtimer: Remove trailing comma after HRTIMER_MAX_CLOCK_BASES
hrtimer: Add a helper to retrieve a hrtimer from its timerqueue node
Zhan Xusheng (3):
posix-timers: Fix stale function name in comment
hrtimer: Fix incorrect #endif comment for BITS_PER_LONG check
alarmtimer: Access timerqueue node under lock in suspend
Documentation/admin-guide/kernel-parameters.txt | 7 +-
MAINTAINERS | 1 +
arch/mips/Kconfig | 1 -
arch/parisc/kernel/time.c | 5 +-
arch/x86/Kconfig | 2 +
arch/x86/include/asm/clock_inlined.h | 22 +
arch/x86/include/asm/time.h | 1 -
arch/x86/kernel/apic/apic.c | 41 +-
arch/x86/kernel/hpet.c | 4 +-
arch/x86/kernel/tsc.c | 61 +-
drivers/clocksource/Kconfig | 1 -
drivers/clocksource/acpi_pm.c | 4 +-
include/asm-generic/thread_info_tif.h | 5 +-
include/linux/clockchips.h | 12 +-
include/linux/clocksource.h | 27 +-
include/linux/hrtimer.h | 64 +-
include/linux/hrtimer_defs.h | 83 +-
include/linux/hrtimer_rearm.h | 83 ++
include/linux/hrtimer_types.h | 19 +-
include/linux/irq-entry-common.h | 25 +-
include/linux/jiffies.h | 6 +-
include/linux/rbtree.h | 81 +-
include/linux/rbtree_types.h | 16 +
include/linux/rseq_entry.h | 16 +-
include/linux/timekeeper_internal.h | 8 +
include/linux/timerqueue.h | 56 +-
include/linux/timerqueue_types.h | 15 +-
include/linux/trace_events.h | 13 +-
include/trace/events/timer.h | 42 +-
include/trace/stages/stage3_trace_output.h | 40 +-
kernel/entry/common.c | 4 +-
kernel/sched/core.c | 91 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 55 +-
kernel/sched/features.h | 5 +
kernel/sched/sched.h | 41 +-
kernel/softirq.c | 15 +-
kernel/time/.kunitconfig | 2 +
kernel/time/Kconfig | 28 +-
kernel/time/alarmtimer.c | 12 +-
kernel/time/clockevents.c | 71 +-
kernel/time/clocksource-wdtest.c | 268 +++---
kernel/time/clocksource.c | 805 ++++++++--------
kernel/time/hrtimer.c | 1128 +++++++++++++----------
kernel/time/jiffies.c | 1 -
kernel/time/posix-timers.c | 2 +-
kernel/time/tick-broadcast-hrtimer.c | 1 -
kernel/time/tick-broadcast.c | 8 +-
kernel/time/tick-common.c | 1 +
kernel/time/tick-sched.c | 30 +-
kernel/time/timekeeping.c | 203 +++-
kernel/time/timekeeping.h | 2 +
kernel/time/timer.c | 5 +-
kernel/time/timer_list.c | 16 +-
kernel/trace/trace_events_synth.c | 4 +-
kernel/trace/trace_output.c | 20 +-
kernel/trace/trace_syscalls.c | 3 +-
lib/rbtree.c | 17 +
lib/timerqueue.c | 14 +
scripts/gdb/linux/timerlist.py | 2 +-
60 files changed, 2222 insertions(+), 1395 deletions(-)
create mode 100644 arch/x86/include/asm/clock_inlined.h
create mode 100644 include/linux/hrtimer_rearm.h
create mode 100644 kernel/time/.kunitconfig
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a550630644..bd4e6c0b2f0a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7963,12 +7963,7 @@ Kernel parameters
(HPET or PM timer) on systems whose TSC frequency was
obtained from HW or FW using either an MSR or CPUID(0x15).
Warn if the difference is more than 500 ppm.
- [x86] watchdog: Use TSC as the watchdog clocksource with
- which to check other HW timers (HPET or PM timer), but
- only on systems where TSC has been deemed trustworthy.
- This will be suppressed by an earlier tsc=nowatchdog and
- can be overridden by a later tsc=nowatchdog. A console
- message will flag any such suppression or overriding.
+ [x86] watchdog: Enforce the clocksource watchdog on TSC
tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
diff --git a/MAINTAINERS b/MAINTAINERS
index c3fe46d7c4bc..292e9ce3b65e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26621,6 +26621,7 @@ F: include/linux/timekeeping.h
F: include/linux/timex.h
F: include/uapi/linux/time.h
F: include/uapi/linux/timex.h
+F: kernel/time/.kunitconfig
F: kernel/time/alarmtimer.c
F: kernel/time/clocksource*
F: kernel/time/ntp*
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index e48b62b4dc48..4364f3dba688 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -1131,7 +1131,6 @@ config CSRC_IOASIC
bool
config CSRC_R4K
- select CLOCKSOURCE_WATCHDOG if CPU_FREQ
bool
config CSRC_SB1250
diff --git a/arch/parisc/kernel/time.c b/arch/parisc/kernel/time.c
index 94dc48455dc6..71c9d5426995 100644
--- a/arch/parisc/kernel/time.c
+++ b/arch/parisc/kernel/time.c
@@ -210,12 +210,9 @@ static struct clocksource clocksource_cr16 = {
.read = read_cr16,
.mask = CLOCKSOURCE_MASK(BITS_PER_LONG),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY |
- CLOCK_SOURCE_VERIFY_PERCPU,
+ CLOCK_SOURCE_VALID_FOR_HRES,
};
-
/*
* timer interrupt and sched_clock() initialization
*/
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e2df1b147184..560d2ce8cedd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -141,6 +141,7 @@ config X86
select ARCH_USE_SYM_ANNOTATIONS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_DEFAULT_BPF_JIT if X86_64
+ select ARCH_WANTS_CLOCKSOURCE_READ_INLINE if X86_64
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_GENERAL_HUGETLB
@@ -163,6 +164,7 @@ config X86
select EDAC_SUPPORT
select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
select GENERIC_CLOCKEVENTS_BROADCAST_IDLE if GENERIC_CLOCKEVENTS_BROADCAST
+ select GENERIC_CLOCKEVENTS_COUPLED_INLINE if X86_64
select GENERIC_CLOCKEVENTS_MIN_ADJUST
select GENERIC_CMOS_UPDATE
select GENERIC_CPU_AUTOPROBE
diff --git a/arch/x86/include/asm/clock_inlined.h b/arch/x86/include/asm/clock_inlined.h
new file mode 100644
index 000000000000..b2dee8db2fb9
--- /dev/null
+++ b/arch/x86/include/asm/clock_inlined.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CLOCK_INLINED_H
+#define _ASM_X86_CLOCK_INLINED_H
+
+#include <asm/tsc.h>
+
+struct clocksource;
+
+static __always_inline u64 arch_inlined_clocksource_read(struct clocksource *cs)
+{
+ return (u64)rdtsc_ordered();
+}
+
+struct clock_event_device;
+
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 cycles, struct clock_event_device *evt)
+{
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, cycles);
+}
+
+#endif
diff --git a/arch/x86/include/asm/time.h b/arch/x86/include/asm/time.h
index f360104ed172..459780c3ed1f 100644
--- a/arch/x86/include/asm/time.h
+++ b/arch/x86/include/asm/time.h
@@ -7,7 +7,6 @@
extern void hpet_time_init(void);
extern bool pit_timer_init(void);
-extern bool tsc_clocksource_watchdog_disabled(void);
extern struct clock_event_device *global_clock_event;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 961714e6adae..0c8970c4c3e3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -412,23 +412,21 @@ EXPORT_SYMBOL_GPL(setup_APIC_eilvt);
/*
* Program the next event, relative to now
*/
-static int lapic_next_event(unsigned long delta,
- struct clock_event_device *evt)
+static int lapic_next_event(unsigned long delta, struct clock_event_device *evt)
{
apic_write(APIC_TMICT, delta);
return 0;
}
-static int lapic_next_deadline(unsigned long delta,
- struct clock_event_device *evt)
+static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt)
{
- u64 tsc;
-
- /* This MSR is special and need a special fence: */
- weak_wrmsr_fence();
+ /*
+ * There is no weak_wrmsr_fence() required here as all of this is purely
+ * CPU local. Avoid the [ml]fence overhead.
+ */
+ u64 tsc = rdtsc();
- tsc = rdtsc();
- wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
return 0;
}
@@ -452,7 +450,7 @@ static int lapic_timer_shutdown(struct clock_event_device *evt)
* the timer _and_ zero the counter registers:
*/
if (v & APIC_LVT_TIMER_TSCDEADLINE)
- wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
+ native_wrmsrq(MSR_IA32_TSC_DEADLINE, 0);
else
apic_write(APIC_TMICT, 0);
@@ -549,6 +547,11 @@ static __init bool apic_validate_deadline_timer(void)
if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
return false;
+
+ /* XEN_PV does not support it, but be paranoia about it */
+ if (boot_cpu_has(X86_FEATURE_XENPV))
+ goto clear;
+
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return true;
@@ -561,9 +564,11 @@ static __init bool apic_validate_deadline_timer(void)
if (boot_cpu_data.microcode >= rev)
return true;
- setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
pr_err(FW_BUG "TSC_DEADLINE disabled due to Errata; "
"please update microcode to version: 0x%x (or later)\n", rev);
+
+clear:
+ setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
return false;
}
@@ -586,14 +591,14 @@ static void setup_APIC_timer(void)
if (this_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER)) {
levt->name = "lapic-deadline";
- levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC |
- CLOCK_EVT_FEAT_DUMMY);
+ levt->features &= ~(CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_DUMMY);
+ levt->features |= CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED;
+ levt->cs_id = CSID_X86_TSC;
levt->set_next_event = lapic_next_deadline;
- clockevents_config_and_register(levt,
- tsc_khz * (1000 / TSC_DIVISOR),
- 0xF, ~0UL);
- } else
+ clockevents_config_and_register(levt, tsc_khz * (1000 / TSC_DIVISOR), 0xF, ~0UL);
+ } else {
clockevents_register_device(levt);
+ }
apic_update_vector(smp_processor_id(), LOCAL_TIMER_VECTOR, true);
}
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 610590e83445..8dc7b710e125 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -854,7 +854,7 @@ static struct clocksource clocksource_hpet = {
.rating = 250,
.read = read_hpet,
.mask = HPET_MASK,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED,
.resume = hpet_resume_counter,
};
@@ -1082,8 +1082,6 @@ int __init hpet_enable(void)
if (!hpet_counting())
goto out_nohpet;
- if (tsc_clocksource_watchdog_disabled())
- clocksource_hpet.flags |= CLOCK_SOURCE_MUST_VERIFY;
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
if (id & HPET_ID_LEGSUP) {
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d9aa694e43f3..c5110eb554bc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -322,12 +322,16 @@ int __init notsc_setup(char *str)
return 1;
}
#endif
-
__setup("notsc", notsc_setup);
+enum {
+ TSC_WATCHDOG_AUTO,
+ TSC_WATCHDOG_OFF,
+ TSC_WATCHDOG_ON,
+};
+
static int no_sched_irq_time;
-static int no_tsc_watchdog;
-static int tsc_as_watchdog;
+static int tsc_watchdog;
static int __init tsc_setup(char *str)
{
@@ -337,25 +341,14 @@ static int __init tsc_setup(char *str)
no_sched_irq_time = 1;
if (!strcmp(str, "unstable"))
mark_tsc_unstable("boot parameter");
- if (!strcmp(str, "nowatchdog")) {
- no_tsc_watchdog = 1;
- if (tsc_as_watchdog)
- pr_alert("%s: Overriding earlier tsc=watchdog with tsc=nowatchdog\n",
- __func__);
- tsc_as_watchdog = 0;
- }
+ if (!strcmp(str, "nowatchdog"))
+ tsc_watchdog = TSC_WATCHDOG_OFF;
if (!strcmp(str, "recalibrate"))
tsc_force_recalibrate = 1;
- if (!strcmp(str, "watchdog")) {
- if (no_tsc_watchdog)
- pr_alert("%s: tsc=watchdog overridden by earlier tsc=nowatchdog\n",
- __func__);
- else
- tsc_as_watchdog = 1;
- }
+ if (!strcmp(str, "watchdog"))
+ tsc_watchdog = TSC_WATCHDOG_ON;
return 1;
}
-
__setup("tsc=", tsc_setup);
#define MAX_RETRIES 5
@@ -1175,7 +1168,6 @@ static int tsc_cs_enable(struct clocksource *cs)
static struct clocksource clocksource_tsc_early = {
.name = "tsc-early",
.rating = 299,
- .uncertainty_margin = 32 * NSEC_PER_MSEC,
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
@@ -1200,9 +1192,9 @@ static struct clocksource clocksource_tsc = {
.read = read_tsc,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS |
- CLOCK_SOURCE_VALID_FOR_HRES |
+ CLOCK_SOURCE_CAN_INLINE_READ |
CLOCK_SOURCE_MUST_VERIFY |
- CLOCK_SOURCE_VERIFY_PERCPU,
+ CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT,
.id = CSID_X86_TSC,
.vdso_clock_mode = VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
@@ -1230,16 +1222,12 @@ EXPORT_SYMBOL_GPL(mark_tsc_unstable);
static void __init tsc_disable_clocksource_watchdog(void)
{
+ if (tsc_watchdog == TSC_WATCHDOG_ON)
+ return;
clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY;
}
-bool tsc_clocksource_watchdog_disabled(void)
-{
- return !(clocksource_tsc.flags & CLOCK_SOURCE_MUST_VERIFY) &&
- tsc_as_watchdog && !no_tsc_watchdog;
-}
-
static void __init check_system_tsc_reliable(void)
{
#if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC)
@@ -1394,6 +1382,8 @@ static void tsc_refine_calibration_work(struct work_struct *work)
(unsigned long)tsc_khz / 1000,
(unsigned long)tsc_khz % 1000);
+ clocksource_tsc.flags |= CLOCK_SOURCE_CALIBRATED;
+
/* Inform the TSC deadline clockevent devices about the recalibration */
lapic_update_tsc_freq();
@@ -1409,6 +1399,15 @@ static void tsc_refine_calibration_work(struct work_struct *work)
have_art = true;
clocksource_tsc.base = &art_base_clk;
}
+
+ /*
+ * Transfer the valid for high resolution flag if it was set on the
+ * early TSC already. That guarantees that there is no intermediate
+ * clocksource selected once the early TSC is unregistered.
+ */
+ if (clocksource_tsc_early.flags & CLOCK_SOURCE_VALID_FOR_HRES)
+ clocksource_tsc.flags |= CLOCK_SOURCE_VALID_FOR_HRES;
+
clocksource_register_khz(&clocksource_tsc, tsc_khz);
unreg:
clocksource_unregister(&clocksource_tsc_early);
@@ -1460,12 +1459,10 @@ static bool __init determine_cpu_tsc_frequencies(bool early)
if (early) {
cpu_khz = x86_platform.calibrate_cpu();
- if (tsc_early_khz) {
+ if (tsc_early_khz)
tsc_khz = tsc_early_khz;
- } else {
+ else
tsc_khz = x86_platform.calibrate_tsc();
- clocksource_tsc.freq_khz = tsc_khz;
- }
} else {
/* We should not be here with non-native cpu calibration */
WARN_ON(x86_platform.calibrate_cpu != native_calibrate_cpu);
@@ -1569,7 +1566,7 @@ void __init tsc_init(void)
return;
}
- if (tsc_clocksource_reliable || no_tsc_watchdog)
+ if (tsc_clocksource_reliable || tsc_watchdog == TSC_WATCHDOG_OFF)
tsc_disable_clocksource_watchdog();
clocksource_register_khz(&clocksource_tsc_early, tsc_khz);
diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index fd9112706545..d1a33a231a44 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -596,7 +596,6 @@ config CLKSRC_VERSATILE
config CLKSRC_MIPS_GIC
bool
depends on MIPS_GIC
- select CLOCKSOURCE_WATCHDOG
select TIMER_OF
config CLKSRC_PXA
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index b4330a01a566..67792937242f 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -98,7 +98,7 @@ static struct clocksource clocksource_acpi_pm = {
.rating = 200,
.read = acpi_pm_read,
.mask = (u64)ACPI_PM_MASK,
- .flags = CLOCK_SOURCE_IS_CONTINUOUS,
+ .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_CALIBRATED,
.suspend = acpi_pm_suspend,
.resume = acpi_pm_resume,
};
@@ -243,8 +243,6 @@ static int __init init_acpi_pm_clocksource(void)
return -ENODEV;
}
- if (tsc_clocksource_watchdog_disabled())
- clocksource_acpi_pm.flags |= CLOCK_SOURCE_MUST_VERIFY;
return clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
}
diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index da1610a78f92..528e6fc7efe9 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -41,11 +41,14 @@
#define _TIF_PATCH_PENDING BIT(TIF_PATCH_PENDING)
#ifdef HAVE_TIF_RESTORE_SIGMASK
-# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal() */
+# define TIF_RESTORE_SIGMASK 10 // Restore signal mask in do_signal()
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
#define TIF_RSEQ 11 // Run RSEQ fast path
#define _TIF_RSEQ BIT(TIF_RSEQ)
+#define TIF_HRTIMER_REARM 12 // re-arm the timer
+#define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index b0df28ddd394..6adb72761246 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -43,9 +43,9 @@ enum clock_event_state {
/*
* Clock event features
*/
-# define CLOCK_EVT_FEAT_PERIODIC 0x000001
-# define CLOCK_EVT_FEAT_ONESHOT 0x000002
-# define CLOCK_EVT_FEAT_KTIME 0x000004
+# define CLOCK_EVT_FEAT_PERIODIC 0x000001
+# define CLOCK_EVT_FEAT_ONESHOT 0x000002
+# define CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED 0x000004
/*
* x86(64) specific (mis)features:
@@ -73,6 +73,7 @@ enum clock_event_state {
* level handler of the event source
* @set_next_event: set next event function using a clocksource delta
* @set_next_ktime: set next event function using a direct ktime value
+ * @set_next_coupled: set next event function for clocksource coupled mode
* @next_event: local storage for the next event in oneshot mode
* @max_delta_ns: maximum delta value in ns
* @min_delta_ns: minimum delta value in ns
@@ -80,6 +81,8 @@ enum clock_event_state {
* @shift: nanoseconds to cycles divisor (power of two)
* @state_use_accessors:current state of the device, assigned by the core code
* @features: features
+ * @cs_id: Clocksource ID to denote the clocksource for coupled mode
+ * @next_event_forced: True if the last programming was a forced event
* @retries: number of forced programming retries
* @set_state_periodic: switch state to periodic
* @set_state_oneshot: switch state to oneshot
@@ -101,6 +104,7 @@ struct clock_event_device {
void (*event_handler)(struct clock_event_device *);
int (*set_next_event)(unsigned long evt, struct clock_event_device *);
int (*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+ void (*set_next_coupled)(u64 cycles, struct clock_event_device *);
ktime_t next_event;
u64 max_delta_ns;
u64 min_delta_ns;
@@ -108,6 +112,8 @@ struct clock_event_device {
u32 shift;
enum clock_event_state state_use_accessors;
unsigned int features;
+ enum clocksource_ids cs_id;
+ unsigned int next_event_forced;
unsigned long retries;
int (*set_state_periodic)(struct clock_event_device *);
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 65b7c41471c3..ccf5c0ca26b7 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -44,8 +44,6 @@ struct module;
* @shift: Cycle to nanosecond divisor (power of two)
* @max_idle_ns: Maximum idle time permitted by the clocksource (nsecs)
* @maxadj: Maximum adjustment value to mult (~11%)
- * @uncertainty_margin: Maximum uncertainty in nanoseconds per half second.
- * Zero says to use default WATCHDOG_THRESHOLD.
* @archdata: Optional arch-specific data
* @max_cycles: Maximum safe cycle value which won't overflow on
* multiplication
@@ -105,7 +103,6 @@ struct clocksource {
u32 shift;
u64 max_idle_ns;
u32 maxadj;
- u32 uncertainty_margin;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
struct arch_clocksource_data archdata;
#endif
@@ -133,6 +130,7 @@ struct clocksource {
struct list_head wd_list;
u64 cs_last;
u64 wd_last;
+ unsigned int wd_cpu;
#endif
struct module *owner;
};
@@ -142,13 +140,19 @@ struct clocksource {
*/
#define CLOCK_SOURCE_IS_CONTINUOUS 0x01
#define CLOCK_SOURCE_MUST_VERIFY 0x02
+#define CLOCK_SOURCE_CALIBRATED 0x04
#define CLOCK_SOURCE_WATCHDOG 0x10
#define CLOCK_SOURCE_VALID_FOR_HRES 0x20
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
#define CLOCK_SOURCE_RESELECT 0x100
-#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
+#define CLOCK_SOURCE_CAN_INLINE_READ 0x200
+#define CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT 0x400
+
+#define CLOCK_SOURCE_WDTEST 0x800
+#define CLOCK_SOURCE_WDTEST_PERCPU 0x1000
+
/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
@@ -298,21 +302,6 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
-static inline unsigned int clocksource_get_max_watchdog_retry(void)
-{
- /*
- * When system is in the boot phase or under heavy workload, there
- * can be random big latencies during the clocksource/watchdog
- * read, so allow retries to filter the noise latency. As the
- * latency's frequency and maximum value goes up with the number of
- * CPUs, scale the number of retries with the number of online
- * CPUs.
- */
- return (ilog2(num_online_cpus()) / 2) + 1;
-}
-
-void clocksource_verify_percpu(struct clocksource *cs);
-
/**
* struct clocksource_base - hardware abstraction for clock on which a clocksource
* is based
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 74adbd4e7003..9ced498fefaa 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -13,6 +13,7 @@
#define _LINUX_HRTIMER_H
#include <linux/hrtimer_defs.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/hrtimer_types.h>
#include <linux/init.h>
#include <linux/list.h>
@@ -31,6 +32,13 @@
* soft irq context
* HRTIMER_MODE_HARD - Timer callback function will be executed in
* hard irq context even on PREEMPT_RT.
+ * HRTIMER_MODE_LAZY_REARM - Avoid reprogramming if the timer was the
+ * first expiring timer and is moved into the
+ * future. Special mode for the HRTICK timer to
+ * avoid extensive reprogramming of the hardware,
+ * which is expensive in virtual machines. Risks
+ * a pointless expiry, but that's better than
+ * reprogramming on every context switch,
*/
enum hrtimer_mode {
HRTIMER_MODE_ABS = 0x00,
@@ -38,6 +46,7 @@ enum hrtimer_mode {
HRTIMER_MODE_PINNED = 0x02,
HRTIMER_MODE_SOFT = 0x04,
HRTIMER_MODE_HARD = 0x08,
+ HRTIMER_MODE_LAZY_REARM = 0x10,
HRTIMER_MODE_ABS_PINNED = HRTIMER_MODE_ABS | HRTIMER_MODE_PINNED,
HRTIMER_MODE_REL_PINNED = HRTIMER_MODE_REL | HRTIMER_MODE_PINNED,
@@ -55,33 +64,6 @@ enum hrtimer_mode {
HRTIMER_MODE_REL_PINNED_HARD = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_HARD,
};
-/*
- * Values to track state of the timer
- *
- * Possible states:
- *
- * 0x00 inactive
- * 0x01 enqueued into rbtree
- *
- * The callback state is not part of the timer->state because clearing it would
- * mean touching the timer after the callback, this makes it impossible to free
- * the timer from the callback function.
- *
- * Therefore we track the callback state in:
- *
- * timer->base->cpu_base->running == timer
- *
- * On SMP it is possible to have a "callback function running and enqueued"
- * status. It happens for example when a posix timer expired and the callback
- * queued a signal. Between dropping the lock which protects the posix timer
- * and reacquiring the base lock of the hrtimer, another CPU can deliver the
- * signal and rearm the timer.
- *
- * All state transitions are protected by cpu_base->lock.
- */
-#define HRTIMER_STATE_INACTIVE 0x00
-#define HRTIMER_STATE_ENQUEUED 0x01
-
/**
* struct hrtimer_sleeper - simple sleeper structure
* @timer: embedded timer structure
@@ -134,11 +116,6 @@ static inline ktime_t hrtimer_get_softexpires(const struct hrtimer *timer)
return timer->_softexpires;
}
-static inline s64 hrtimer_get_expires_ns(const struct hrtimer *timer)
-{
- return ktime_to_ns(timer->node.expires);
-}
-
ktime_t hrtimer_cb_get_time(const struct hrtimer *timer);
static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
@@ -146,24 +123,23 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer)
return ktime_sub(timer->node.expires, hrtimer_cb_get_time(timer));
}
-static inline int hrtimer_is_hres_active(struct hrtimer *timer)
-{
- return IS_ENABLED(CONFIG_HIGH_RES_TIMERS) ?
- timer->base->cpu_base->hres_active : 0;
-}
-
#ifdef CONFIG_HIGH_RES_TIMERS
+extern unsigned int hrtimer_resolution;
struct clock_event_device;
extern void hrtimer_interrupt(struct clock_event_device *dev);
-extern unsigned int hrtimer_resolution;
+extern struct static_key_false hrtimer_highres_enabled_key;
-#else
+static inline bool hrtimer_highres_enabled(void)
+{
+ return static_branch_likely(&hrtimer_highres_enabled_key);
+}
+#else /* CONFIG_HIGH_RES_TIMERS */
#define hrtimer_resolution (unsigned int)LOW_RES_NSEC
-
-#endif
+static inline bool hrtimer_highres_enabled(void) { return false; }
+#endif /* !CONFIG_HIGH_RES_TIMERS */
static inline ktime_t
__hrtimer_expires_remaining_adjusted(const struct hrtimer *timer, ktime_t now)
@@ -293,8 +269,8 @@ extern bool hrtimer_active(const struct hrtimer *timer);
*/
static inline bool hrtimer_is_queued(struct hrtimer *timer)
{
- /* The READ_ONCE pairs with the update functions of timer->state */
- return !!(READ_ONCE(timer->state) & HRTIMER_STATE_ENQUEUED);
+ /* The READ_ONCE pairs with the update functions of timer->is_queued */
+ return READ_ONCE(timer->is_queued);
}
/*
diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h
index 02b010df6570..52ed9e46ff13 100644
--- a/include/linux/hrtimer_defs.h
+++ b/include/linux/hrtimer_defs.h
@@ -19,21 +19,23 @@
* timer to a base on another cpu.
* @clockid: clock id for per_cpu support
* @seq: seqcount around __run_hrtimer
+ * @expires_next: Absolute time of the next event in this clock base
* @running: pointer to the currently running hrtimer
* @active: red black tree root node for the active timers
* @offset: offset of this clock to the monotonic base
*/
struct hrtimer_clock_base {
- struct hrtimer_cpu_base *cpu_base;
- unsigned int index;
- clockid_t clockid;
- seqcount_raw_spinlock_t seq;
- struct hrtimer *running;
- struct timerqueue_head active;
- ktime_t offset;
+ struct hrtimer_cpu_base *cpu_base;
+ const unsigned int index;
+ const clockid_t clockid;
+ seqcount_raw_spinlock_t seq;
+ ktime_t expires_next;
+ struct hrtimer *running;
+ struct timerqueue_linked_head active;
+ ktime_t offset;
} __hrtimer_clock_base_align;
-enum hrtimer_base_type {
+enum hrtimer_base_type {
HRTIMER_BASE_MONOTONIC,
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_BOOTTIME,
@@ -42,37 +44,36 @@ enum hrtimer_base_type {
HRTIMER_BASE_REALTIME_SOFT,
HRTIMER_BASE_BOOTTIME_SOFT,
HRTIMER_BASE_TAI_SOFT,
- HRTIMER_MAX_CLOCK_BASES,
+ HRTIMER_MAX_CLOCK_BASES
};
/**
* struct hrtimer_cpu_base - the per cpu clock bases
- * @lock: lock protecting the base and associated clock bases
- * and timers
- * @cpu: cpu number
- * @active_bases: Bitfield to mark bases with active timers
- * @clock_was_set_seq: Sequence counter of clock was set events
- * @hres_active: State of high resolution mode
- * @in_hrtirq: hrtimer_interrupt() is currently executing
- * @hang_detected: The last hrtimer interrupt detected a hang
- * @softirq_activated: displays, if the softirq is raised - update of softirq
- * related settings is not required then.
- * @nr_events: Total number of hrtimer interrupt events
- * @nr_retries: Total number of hrtimer interrupt retries
- * @nr_hangs: Total number of hrtimer interrupt hangs
- * @max_hang_time: Maximum time spent in hrtimer_interrupt
- * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
- * expired
- * @online: CPU is online from an hrtimers point of view
- * @timer_waiters: A hrtimer_cancel() invocation waits for the timer
- * callback to finish.
- * @expires_next: absolute time of the next event, is required for remote
- * hrtimer enqueue; it is the total first expiry time (hard
- * and soft hrtimer are taken into account)
- * @next_timer: Pointer to the first expiring timer
- * @softirq_expires_next: Time to check, if soft queues needs also to be expired
- * @softirq_next_timer: Pointer to the first expiring softirq based timer
- * @clock_base: array of clock bases for this cpu
+ * @lock: lock protecting the base and associated clock bases and timers
+ * @cpu: cpu number
+ * @active_bases: Bitfield to mark bases with active timers
+ * @clock_was_set_seq: Sequence counter of clock was set events
+ * @hres_active: State of high resolution mode
+ * @deferred_rearm: A deferred rearm is pending
+ * @deferred_needs_update: The deferred rearm must re-evaluate the first timer
+ * @hang_detected: The last hrtimer interrupt detected a hang
+ * @softirq_activated: displays, if the softirq is raised - update of softirq
+ * related settings is not required then.
+ * @nr_events: Total number of hrtimer interrupt events
+ * @nr_retries: Total number of hrtimer interrupt retries
+ * @nr_hangs: Total number of hrtimer interrupt hangs
+ * @max_hang_time: Maximum time spent in hrtimer_interrupt
+ * @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are expired
+ * @online: CPU is online from an hrtimers point of view
+ * @timer_waiters: A hrtimer_cancel() waiters for the timer callback to finish.
+ * @expires_next: Absolute time of the next event, is required for remote
+ * hrtimer enqueue; it is the total first expiry time (hard
+ * and soft hrtimer are taken into account)
+ * @next_timer: Pointer to the first expiring timer
+ * @softirq_expires_next: Time to check, if soft queues needs also to be expired
+ * @softirq_next_timer: Pointer to the first expiring softirq based timer
+ * @deferred_expires_next: Cached expires next value for deferred rearm
+ * @clock_base: Array of clock bases for this cpu
*
* Note: next_timer is just an optimization for __remove_hrtimer().
* Do not dereference the pointer because it is not reliable on
@@ -83,11 +84,12 @@ struct hrtimer_cpu_base {
unsigned int cpu;
unsigned int active_bases;
unsigned int clock_was_set_seq;
- unsigned int hres_active : 1,
- in_hrtirq : 1,
- hang_detected : 1,
- softirq_activated : 1,
- online : 1;
+ bool hres_active;
+ bool deferred_rearm;
+ bool deferred_needs_update;
+ bool hang_detected;
+ bool softirq_activated;
+ bool online;
#ifdef CONFIG_HIGH_RES_TIMERS
unsigned int nr_events;
unsigned short nr_retries;
@@ -102,6 +104,7 @@ struct hrtimer_cpu_base {
struct hrtimer *next_timer;
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
+ ktime_t deferred_expires_next;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
call_single_data_t csd;
} ____cacheline_aligned;
diff --git a/include/linux/hrtimer_rearm.h b/include/linux/hrtimer_rearm.h
new file mode 100644
index 000000000000..a6f2e5d5e1c7
--- /dev/null
+++ b/include/linux/hrtimer_rearm.h
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_HRTIMER_REARM_H
+#define _LINUX_HRTIMER_REARM_H
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+#include <linux/thread_info.h>
+
+void __hrtimer_rearm_deferred(void);
+
+/*
+ * This is purely CPU local, so check the TIF bit first to avoid the overhead of
+ * the atomic test_and_clear_bit() operation for the common case where the bit
+ * is not set.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred_tif(unsigned long tif_work)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (unlikely(tif_work & _TIF_HRTIMER_REARM)) {
+ clear_thread_flag(TIF_HRTIMER_REARM);
+ return true;
+ }
+ return false;
+}
+
+#define TIF_REARM_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_HRTIMER_REARM)
+
+/* Invoked from the exit to user before invoking exit_to_user_mode_loop() */
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask)
+{
+ /* Help the compiler to optimize the function out for syscall returns */
+ if (!(tif_mask & _TIF_HRTIMER_REARM))
+ return false;
+ /*
+ * Rearm the timer if none of the resched flags is set before going into
+ * the loop which re-enables interrupts.
+ */
+ if (unlikely((*tif_work & TIF_REARM_MASK) == _TIF_HRTIMER_REARM)) {
+ clear_thread_flag(TIF_HRTIMER_REARM);
+ __hrtimer_rearm_deferred();
+ /* Don't go into the loop if HRTIMER_REARM was the only flag */
+ *tif_work &= ~TIF_HRTIMER_REARM;
+ return !*tif_work;
+ }
+ return false;
+}
+
+/* Invoked from the time slice extension decision function */
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work)
+{
+ if (hrtimer_test_and_clear_rearm_deferred_tif(tif_work))
+ __hrtimer_rearm_deferred();
+}
+
+/*
+ * This is to be called on all irqentry_exit() paths that will enable
+ * interrupts.
+ */
+static __always_inline void hrtimer_rearm_deferred(void)
+{
+ hrtimer_rearm_deferred_tif(read_thread_flags());
+}
+
+/*
+ * Invoked from the scheduler on entry to __schedule() so it can defer
+ * rearming after the load balancing callbacks which might change hrtick.
+ */
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void)
+{
+ return hrtimer_test_and_clear_rearm_deferred_tif(read_thread_flags());
+}
+
+#else /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void __hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred(void) { }
+static __always_inline void hrtimer_rearm_deferred_tif(unsigned long tif_work) { }
+static __always_inline bool
+hrtimer_rearm_deferred_user_irq(unsigned long *tif_work, const unsigned long tif_mask) { return false; }
+static __always_inline bool hrtimer_test_and_clear_rearm_deferred(void) { return false; }
+#endif /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
+#endif
diff --git a/include/linux/hrtimer_types.h b/include/linux/hrtimer_types.h
index 8fbbb6bdf7a1..b5dacc8271a4 100644
--- a/include/linux/hrtimer_types.h
+++ b/include/linux/hrtimer_types.h
@@ -17,7 +17,7 @@ enum hrtimer_restart {
/**
* struct hrtimer - the basic hrtimer structure
- * @node: timerqueue node, which also manages node.expires,
+ * @node: Linked timerqueue node, which also manages node.expires,
* the absolute expiry time in the hrtimers internal
* representation. The time is related to the clock on
* which the timer is based. Is setup by adding
@@ -28,23 +28,26 @@ enum hrtimer_restart {
* was armed.
* @function: timer expiry callback function
* @base: pointer to the timer base (per cpu and per clock)
- * @state: state information (See bit values above)
+ * @is_queued: Indicates whether a timer is enqueued or not
* @is_rel: Set if the timer was armed relative
* @is_soft: Set if hrtimer will be expired in soft interrupt context.
* @is_hard: Set if hrtimer will be expired in hard interrupt context
* even on RT.
+ * @is_lazy: Set if the timer is frequently rearmed to avoid updates
+ * of the clock event device
*
* The hrtimer structure must be initialized by hrtimer_setup()
*/
struct hrtimer {
- struct timerqueue_node node;
+ struct timerqueue_linked_node node;
+ struct hrtimer_clock_base *base;
+ bool is_queued;
+ bool is_rel;
+ bool is_soft;
+ bool is_hard;
+ bool is_lazy;
ktime_t _softexpires;
enum hrtimer_restart (*__private function)(struct hrtimer *);
- struct hrtimer_clock_base *base;
- u8 state;
- u8 is_rel;
- u8 is_soft;
- u8 is_hard;
};
#endif /* _LINUX_HRTIMER_TYPES_H */
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d26d1b1bcbfb..b976946b3cdb 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -3,6 +3,7 @@
#define __LINUX_IRQENTRYCOMMON_H
#include <linux/context_tracking.h>
+#include <linux/hrtimer_rearm.h>
#include <linux/kmsan.h>
#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
@@ -33,6 +34,14 @@
_TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+# define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK | _TIF_HRTIMER_REARM)
+#else
+# define EXIT_TO_USER_MODE_WORK_SYSCALL (EXIT_TO_USER_MODE_WORK)
+# define EXIT_TO_USER_MODE_WORK_IRQ (EXIT_TO_USER_MODE_WORK)
+#endif
+
/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
* @regs: Pointer to currents pt_regs
@@ -203,6 +212,7 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
/**
* __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
+ * @work_mask: Which TIF bits need to be evaluated
*
* 1) check that interrupts are disabled
* 2) call tick_nohz_user_enter_prepare()
@@ -212,7 +222,8 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
*
* Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs,
+ const unsigned long work_mask)
{
unsigned long ti_work;
@@ -222,8 +233,10 @@ static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
tick_nohz_user_enter_prepare();
ti_work = read_thread_flags();
- if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
- ti_work = exit_to_user_mode_loop(regs, ti_work);
+ if (unlikely(ti_work & work_mask)) {
+ if (!hrtimer_rearm_deferred_user_irq(&ti_work, work_mask))
+ ti_work = exit_to_user_mode_loop(regs, ti_work);
+ }
arch_exit_to_user_mode_prepare(regs, ti_work);
}
@@ -239,7 +252,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
/* Temporary workaround to keep ARM64 alive */
static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
rseq_exit_to_user_mode_legacy();
__exit_to_user_mode_validate();
}
@@ -253,7 +266,7 @@ static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *reg
*/
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_SYSCALL);
rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -267,7 +280,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
*/
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
- __exit_to_user_mode_prepare(regs);
+ __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK_IRQ);
rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
index d1c3d4941854..bbd57061802c 100644
--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -67,10 +67,6 @@ extern void register_refined_jiffies(long clock_tick_rate);
/* USER_TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
#define USER_TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)
-#ifndef __jiffy_arch_data
-#define __jiffy_arch_data
-#endif
-
/*
* The 64-bit value is not atomic on 32-bit systems - you MUST NOT read it
* without sampling the sequence number in jiffies_lock.
@@ -83,7 +79,7 @@ extern void register_refined_jiffies(long clock_tick_rate);
* See arch/ARCH/kernel/vmlinux.lds.S
*/
extern u64 __cacheline_aligned_in_smp jiffies_64;
-extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies;
+extern unsigned long volatile __cacheline_aligned_in_smp jiffies;
#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void);
diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 4091e978aef2..48acdc3889dd 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -35,10 +35,15 @@
#define RB_CLEAR_NODE(node) \
((node)->__rb_parent_color = (unsigned long)(node))
+#define RB_EMPTY_LINKED_NODE(lnode) RB_EMPTY_NODE(&(lnode)->node)
+#define RB_CLEAR_LINKED_NODE(lnode) ({ \
+ RB_CLEAR_NODE(&(lnode)->node); \
+ (lnode)->prev = (lnode)->next = NULL; \
+})
extern void rb_insert_color(struct rb_node *, struct rb_root *);
extern void rb_erase(struct rb_node *, struct rb_root *);
-
+extern bool rb_erase_linked(struct rb_node_linked *, struct rb_root_linked *);
/* Find logical next and previous nodes in a tree */
extern struct rb_node *rb_next(const struct rb_node *);
@@ -213,15 +218,10 @@ rb_add_cached(struct rb_node *node, struct rb_root_cached *tree,
return leftmost ? node : NULL;
}
-/**
- * rb_add() - insert @node into @tree
- * @node: node to insert
- * @tree: tree to insert @node into
- * @less: operator defining the (partial) node order
- */
static __always_inline void
-rb_add(struct rb_node *node, struct rb_root *tree,
- bool (*less)(struct rb_node *, const struct rb_node *))
+__rb_add(struct rb_node *node, struct rb_root *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *),
+ void (*linkop)(struct rb_node *, struct rb_node *, struct rb_node **))
{
struct rb_node **link = &tree->rb_node;
struct rb_node *parent = NULL;
@@ -234,10 +234,73 @@ rb_add(struct rb_node *node, struct rb_root *tree,
link = &parent->rb_right;
}
+ linkop(node, parent, link);
rb_link_node(node, parent, link);
rb_insert_color(node, tree);
}
+#define __node_2_linked_node(_n) \
+ rb_entry((_n), struct rb_node_linked, node)
+
+static inline void
+rb_link_linked_node(struct rb_node *node, struct rb_node *parent, struct rb_node **link)
+{
+ if (!parent)
+ return;
+
+ struct rb_node_linked *nnew = __node_2_linked_node(node);
+ struct rb_node_linked *npar = __node_2_linked_node(parent);
+
+ if (link == &parent->rb_left) {
+ nnew->prev = npar->prev;
+ nnew->next = npar;
+ npar->prev = nnew;
+ if (nnew->prev)
+ nnew->prev->next = nnew;
+ } else {
+ nnew->next = npar->next;
+ nnew->prev = npar;
+ npar->next = nnew;
+ if (nnew->next)
+ nnew->next->prev = nnew;
+ }
+}
+
+/**
+ * rb_add_linked() - insert @node into the leftmost linked tree @tree
+ * @node: node to insert
+ * @tree: linked tree to insert @node into
+ * @less: operator defining the (partial) node order
+ *
+ * Returns @true when @node is the new leftmost, @false otherwise.
+ */
+static __always_inline bool
+rb_add_linked(struct rb_node_linked *node, struct rb_root_linked *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *))
+{
+ __rb_add(&node->node, &tree->rb_root, less, rb_link_linked_node);
+ if (!node->prev)
+ tree->rb_leftmost = node;
+ return !node->prev;
+}
+
+/* Empty linkop function which is optimized away by the compiler */
+static __always_inline void
+rb_link_noop(struct rb_node *n, struct rb_node *p, struct rb_node **l) { }
+
+/**
+ * rb_add() - insert @node into @tree
+ * @node: node to insert
+ * @tree: tree to insert @node into
+ * @less: operator defining the (partial) node order
+ */
+static __always_inline void
+rb_add(struct rb_node *node, struct rb_root *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *))
+{
+ __rb_add(node, tree, less, rb_link_noop);
+}
+
/**
* rb_find_add_cached() - find equivalent @node in @tree, or add @node
* @node: node to look-for / insert
diff --git a/include/linux/rbtree_types.h b/include/linux/rbtree_types.h
index 45b6ecde3665..3c7ae53e8139 100644
--- a/include/linux/rbtree_types.h
+++ b/include/linux/rbtree_types.h
@@ -9,6 +9,12 @@ struct rb_node {
} __attribute__((aligned(sizeof(long))));
/* The alignment might seem pointless, but allegedly CRIS needs it */
+struct rb_node_linked {
+ struct rb_node node;
+ struct rb_node_linked *prev;
+ struct rb_node_linked *next;
+};
+
struct rb_root {
struct rb_node *rb_node;
};
@@ -28,7 +34,17 @@ struct rb_root_cached {
struct rb_node *rb_leftmost;
};
+/*
+ * Leftmost tree with links. This would allow a trivial rb_rightmost update,
+ * but that has been omitted due to the lack of users.
+ */
+struct rb_root_linked {
+ struct rb_root rb_root;
+ struct rb_node_linked *rb_leftmost;
+};
+
#define RB_ROOT (struct rb_root) { NULL, }
#define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }
+#define RB_ROOT_LINKED (struct rb_root_linked) { {NULL, }, NULL }
#endif
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index c6831c93cd6e..f11ebd34f8b9 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -40,6 +40,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#endif /* !CONFIG_RSEQ_STATS */
#ifdef CONFIG_RSEQ
+#include <linux/hrtimer_rearm.h>
#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/sched/signal.h>
@@ -110,7 +111,7 @@ static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
t->rseq.slice.state.granted = false;
}
-static __always_inline bool rseq_grant_slice_extension(bool work_pending)
+static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
{
struct task_struct *curr = current;
struct rseq_slice_ctrl usr_ctrl;
@@ -215,11 +216,20 @@ static __always_inline bool rseq_grant_slice_extension(bool work_pending)
return false;
}
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask)
+{
+ if (unlikely(__rseq_grant_slice_extension(ti_work & mask))) {
+ hrtimer_rearm_deferred_tif(ti_work);
+ return true;
+ }
+ return false;
+}
+
#else /* CONFIG_RSEQ_SLICE_EXTENSION */
static __always_inline bool rseq_slice_extension_enabled(void) { return false; }
static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; }
static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { }
-static __always_inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
@@ -778,7 +788,7 @@ static inline void rseq_syscall_exit_to_user_mode(void) { }
static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
-static inline bool rseq_grant_slice_extension(bool work_pending) { return false; }
+static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index b8ae89ea28ab..e36d11e33e0c 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -72,6 +72,10 @@ struct tk_read_base {
* @id: The timekeeper ID
* @tkr_raw: The readout base structure for CLOCK_MONOTONIC_RAW
* @raw_sec: CLOCK_MONOTONIC_RAW time in seconds
+ * @cs_id: The ID of the current clocksource
+ * @cs_ns_to_cyc_mult: Multiplicator for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_shift: Shift value for nanoseconds to cycles conversion
+ * @cs_ns_to_cyc_maxns: Maximum nanoseconds to cyles conversion range
* @clock_was_set_seq: The sequence number of clock was set events
* @cs_was_changed_seq: The sequence number of clocksource change events
* @clock_valid: Indicator for valid clock
@@ -159,6 +163,10 @@ struct timekeeper {
u64 raw_sec;
/* Cachline 3 and 4 (timekeeping internal variables): */
+ enum clocksource_ids cs_id;
+ u32 cs_ns_to_cyc_mult;
+ u32 cs_ns_to_cyc_shift;
+ u64 cs_ns_to_cyc_maxns;
unsigned int clock_was_set_seq;
u8 cs_was_changed_seq;
u8 clock_valid;
diff --git a/include/linux/timerqueue.h b/include/linux/timerqueue.h
index d306d9dd2207..7d0aaa766580 100644
--- a/include/linux/timerqueue.h
+++ b/include/linux/timerqueue.h
@@ -5,12 +5,11 @@
#include <linux/rbtree.h>
#include <linux/timerqueue_types.h>
-extern bool timerqueue_add(struct timerqueue_head *head,
- struct timerqueue_node *node);
-extern bool timerqueue_del(struct timerqueue_head *head,
- struct timerqueue_node *node);
-extern struct timerqueue_node *timerqueue_iterate_next(
- struct timerqueue_node *node);
+bool timerqueue_add(struct timerqueue_head *head, struct timerqueue_node *node);
+bool timerqueue_del(struct timerqueue_head *head, struct timerqueue_node *node);
+struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node);
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node);
/**
* timerqueue_getnext - Returns the timer with the earliest expiration time
@@ -19,8 +18,7 @@ extern struct timerqueue_node *timerqueue_iterate_next(
*
* Returns a pointer to the timer node that has the earliest expiration time.
*/
-static inline
-struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
+static inline struct timerqueue_node *timerqueue_getnext(struct timerqueue_head *head)
{
struct rb_node *leftmost = rb_first_cached(&head->rb_root);
@@ -41,4 +39,46 @@ static inline void timerqueue_init_head(struct timerqueue_head *head)
{
head->rb_root = RB_ROOT_CACHED;
}
+
+/* Timer queues with linked nodes */
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_first(struct timerqueue_linked_head *head)
+{
+ return rb_entry_safe(head->rb_root.rb_leftmost, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_next(struct timerqueue_linked_node *node)
+{
+ return rb_entry_safe(node->node.next, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+struct timerqueue_linked_node *timerqueue_linked_prev(struct timerqueue_linked_node *node)
+{
+ return rb_entry_safe(node->node.prev, struct timerqueue_linked_node, node);
+}
+
+static __always_inline
+bool timerqueue_linked_del(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+ return rb_erase_linked(&node->node, &head->rb_root);
+}
+
+static __always_inline void timerqueue_linked_init(struct timerqueue_linked_node *node)
+{
+ RB_CLEAR_LINKED_NODE(&node->node);
+}
+
+static __always_inline bool timerqueue_linked_node_queued(struct timerqueue_linked_node *node)
+{
+ return !RB_EMPTY_LINKED_NODE(&node->node);
+}
+
+static __always_inline void timerqueue_linked_init_head(struct timerqueue_linked_head *head)
+{
+ head->rb_root = RB_ROOT_LINKED;
+}
+
#endif /* _LINUX_TIMERQUEUE_H */
diff --git a/include/linux/timerqueue_types.h b/include/linux/timerqueue_types.h
index dc298d0923e3..be2218b147c4 100644
--- a/include/linux/timerqueue_types.h
+++ b/include/linux/timerqueue_types.h
@@ -6,12 +6,21 @@
#include <linux/types.h>
struct timerqueue_node {
- struct rb_node node;
- ktime_t expires;
+ struct rb_node node;
+ ktime_t expires;
};
struct timerqueue_head {
- struct rb_root_cached rb_root;
+ struct rb_root_cached rb_root;
+};
+
+struct timerqueue_linked_node {
+ struct rb_node_linked node;
+ ktime_t expires;
+};
+
+struct timerqueue_linked_head {
+ struct rb_root_linked rb_root;
};
#endif /* _LINUX_TIMERQUEUE_TYPES_H */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 37eb2f0f3dd8..40a43a4c7caf 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -22,20 +22,23 @@ union bpf_attr;
const char *trace_print_flags_seq(struct trace_seq *p, const char *delim,
unsigned long flags,
- const struct trace_print_flags *flag_array);
+ const struct trace_print_flags *flag_array,
+ size_t flag_array_size);
const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val,
- const struct trace_print_flags *symbol_array);
+ const struct trace_print_flags *symbol_array,
+ size_t symbol_array_size);
#if BITS_PER_LONG == 32
const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim,
unsigned long long flags,
- const struct trace_print_flags_u64 *flag_array);
+ const struct trace_print_flags_u64 *flag_array,
+ size_t flag_array_size);
const char *trace_print_symbols_seq_u64(struct trace_seq *p,
unsigned long long val,
- const struct trace_print_flags_u64
- *symbol_array);
+ const struct trace_print_flags_u64 *symbol_array,
+ size_t symbol_array_size);
#endif
struct trace_iterator;
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1641ae3e6ca0..07cbb9836b91 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -218,12 +218,13 @@ TRACE_EVENT(hrtimer_setup,
* hrtimer_start - called when the hrtimer is started
* @hrtimer: pointer to struct hrtimer
* @mode: the hrtimers mode
+ * @was_armed: Was armed when hrtimer_start*() was invoked
*/
TRACE_EVENT(hrtimer_start,
- TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode),
+ TP_PROTO(struct hrtimer *hrtimer, enum hrtimer_mode mode, bool was_armed),
- TP_ARGS(hrtimer, mode),
+ TP_ARGS(hrtimer, mode, was_armed),
TP_STRUCT__entry(
__field( void *, hrtimer )
@@ -231,6 +232,7 @@ TRACE_EVENT(hrtimer_start,
__field( s64, expires )
__field( s64, softexpires )
__field( enum hrtimer_mode, mode )
+ __field( bool, was_armed )
),
TP_fast_assign(
@@ -239,26 +241,26 @@ TRACE_EVENT(hrtimer_start,
__entry->expires = hrtimer_get_expires(hrtimer);
__entry->softexpires = hrtimer_get_softexpires(hrtimer);
__entry->mode = mode;
+ __entry->was_armed = was_armed;
),
TP_printk("hrtimer=%p function=%ps expires=%llu softexpires=%llu "
- "mode=%s", __entry->hrtimer, __entry->function,
+ "mode=%s was_armed=%d", __entry->hrtimer, __entry->function,
(unsigned long long) __entry->expires,
(unsigned long long) __entry->softexpires,
- decode_hrtimer_mode(__entry->mode))
+ decode_hrtimer_mode(__entry->mode), __entry->was_armed)
);
/**
* hrtimer_expire_entry - called immediately before the hrtimer callback
* @hrtimer: pointer to struct hrtimer
- * @now: pointer to variable which contains current time of the
- * timers base.
+ * @now: variable which contains current time of the timers base.
*
* Allows to determine the timer latency.
*/
TRACE_EVENT(hrtimer_expire_entry,
- TP_PROTO(struct hrtimer *hrtimer, ktime_t *now),
+ TP_PROTO(struct hrtimer *hrtimer, ktime_t now),
TP_ARGS(hrtimer, now),
@@ -270,7 +272,7 @@ TRACE_EVENT(hrtimer_expire_entry,
TP_fast_assign(
__entry->hrtimer = hrtimer;
- __entry->now = *now;
+ __entry->now = now;
__entry->function = ACCESS_PRIVATE(hrtimer, function);
),
@@ -321,6 +323,30 @@ DEFINE_EVENT(hrtimer_class, hrtimer_cancel,
TP_ARGS(hrtimer)
);
+/**
+ * hrtimer_rearm - Invoked when the clockevent device is rearmed
+ * @next_event: The next expiry time (CLOCK_MONOTONIC)
+ */
+TRACE_EVENT(hrtimer_rearm,
+
+ TP_PROTO(ktime_t next_event, bool deferred),
+
+ TP_ARGS(next_event, deferred),
+
+ TP_STRUCT__entry(
+ __field( s64, next_event )
+ __field( bool, deferred )
+ ),
+
+ TP_fast_assign(
+ __entry->next_event = next_event;
+ __entry->deferred = deferred;
+ ),
+
+ TP_printk("next_event=%llu deferred=%d",
+ (unsigned long long) __entry->next_event, __entry->deferred)
+);
+
/**
* itimer_state - called when itimer is started or canceled
* @which: name of the interval timer
diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
index fce85ea2df1c..b7d8ef4b9fe1 100644
--- a/include/trace/stages/stage3_trace_output.h
+++ b/include/trace/stages/stage3_trace_output.h
@@ -64,36 +64,36 @@
#define __get_rel_sockaddr(field) ((struct sockaddr *)__get_rel_dynamic_array(field))
#undef __print_flags
-#define __print_flags(flag, delim, flag_array...) \
- ({ \
- static const struct trace_print_flags __flags[] = \
- { flag_array, { -1, NULL }}; \
- trace_print_flags_seq(p, delim, flag, __flags); \
+#define __print_flags(flag, delim, flag_array...) \
+ ({ \
+ static const struct trace_print_flags __flags[] = \
+ { flag_array }; \
+ trace_print_flags_seq(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \
})
#undef __print_symbolic
-#define __print_symbolic(value, symbol_array...) \
- ({ \
- static const struct trace_print_flags symbols[] = \
- { symbol_array, { -1, NULL }}; \
- trace_print_symbols_seq(p, value, symbols); \
+#define __print_symbolic(value, symbol_array...) \
+ ({ \
+ static const struct trace_print_flags symbols[] = \
+ { symbol_array }; \
+ trace_print_symbols_seq(p, value, symbols, ARRAY_SIZE(symbols)); \
})
#undef __print_flags_u64
#undef __print_symbolic_u64
#if BITS_PER_LONG == 32
-#define __print_flags_u64(flag, delim, flag_array...) \
- ({ \
- static const struct trace_print_flags_u64 __flags[] = \
- { flag_array, { -1, NULL } }; \
- trace_print_flags_seq_u64(p, delim, flag, __flags); \
+#define __print_flags_u64(flag, delim, flag_array...) \
+ ({ \
+ static const struct trace_print_flags_u64 __flags[] = \
+ { flag_array }; \
+ trace_print_flags_seq_u64(p, delim, flag, __flags, ARRAY_SIZE(__flags)); \
})
-#define __print_symbolic_u64(value, symbol_array...) \
- ({ \
- static const struct trace_print_flags_u64 symbols[] = \
- { symbol_array, { -1, NULL } }; \
- trace_print_symbols_seq_u64(p, value, symbols); \
+#define __print_symbolic_u64(value, symbol_array...) \
+ ({ \
+ static const struct trace_print_flags_u64 symbols[] = \
+ { symbol_array }; \
+ trace_print_symbols_seq_u64(p, value, symbols, ARRAY_SIZE(symbols)); \
})
#else
#define __print_flags_u64(flag, delim, flag_array...) \
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 9ef63e414791..9e1a6afb07f2 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
local_irq_enable_exit_to_user(ti_work);
if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
- if (!rseq_grant_slice_extension(ti_work & TIF_SLICE_EXT_DENY))
+ if (!rseq_grant_slice_extension(ti_work, TIF_SLICE_EXT_DENY))
schedule();
}
@@ -225,6 +225,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
*/
if (state.exit_rcu) {
instrumentation_begin();
+ hrtimer_rearm_deferred();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
@@ -238,6 +239,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
+ hrtimer_rearm_deferred();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..4495929f4c9b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -872,7 +872,14 @@ void update_rq_clock(struct rq *rq)
* Use HR-timers to deliver accurate preemption points.
*/
-static void hrtick_clear(struct rq *rq)
+enum {
+ HRTICK_SCHED_NONE = 0,
+ HRTICK_SCHED_DEFER = BIT(1),
+ HRTICK_SCHED_START = BIT(2),
+ HRTICK_SCHED_REARM_HRTIMER = BIT(3)
+};
+
+static void __used hrtick_clear(struct rq *rq)
{
if (hrtimer_active(&rq->hrtick_timer))
hrtimer_cancel(&rq->hrtick_timer);
@@ -897,12 +904,24 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-static void __hrtick_restart(struct rq *rq)
+static inline bool hrtick_needs_rearm(struct hrtimer *timer, ktime_t expires)
+{
+ /*
+ * Queued is false when the timer is not started or currently
+ * running the callback. In both cases, restart. If queued check
+ * whether the expiry time actually changes substantially.
+ */
+ return !hrtimer_is_queued(timer) ||
+ abs(expires - hrtimer_get_expires(timer)) > 5000;
+}
+
+static void hrtick_cond_restart(struct rq *rq)
{
struct hrtimer *timer = &rq->hrtick_timer;
ktime_t time = rq->hrtick_time;
- hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
+ if (hrtick_needs_rearm(timer, time))
+ hrtimer_start(timer, time, HRTIMER_MODE_ABS_PINNED_HARD);
}
/*
@@ -914,7 +933,7 @@ static void __hrtick_start(void *arg)
struct rq_flags rf;
rq_lock(rq, &rf);
- __hrtick_restart(rq);
+ hrtick_cond_restart(rq);
rq_unlock(rq, &rf);
}
@@ -925,7 +944,6 @@ static void __hrtick_start(void *arg)
*/
void hrtick_start(struct rq *rq, u64 delay)
{
- struct hrtimer *timer = &rq->hrtick_timer;
s64 delta;
/*
@@ -933,27 +951,67 @@ void hrtick_start(struct rq *rq, u64 delay)
* doesn't make sense and can cause timer DoS.
*/
delta = max_t(s64, delay, 10000LL);
- rq->hrtick_time = ktime_add_ns(hrtimer_cb_get_time(timer), delta);
+
+ /*
+ * If this is in the middle of schedule() only note the delay
+ * and let hrtick_schedule_exit() deal with it.
+ */
+ if (rq->hrtick_sched) {
+ rq->hrtick_sched |= HRTICK_SCHED_START;
+ rq->hrtick_delay = delta;
+ return;
+ }
+
+ rq->hrtick_time = ktime_add_ns(ktime_get(), delta);
+ if (!hrtick_needs_rearm(&rq->hrtick_timer, rq->hrtick_time))
+ return;
if (rq == this_rq())
- __hrtick_restart(rq);
+ hrtimer_start(&rq->hrtick_timer, rq->hrtick_time, HRTIMER_MODE_ABS_PINNED_HARD);
else
smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd);
}
-static void hrtick_rq_init(struct rq *rq)
+static inline void hrtick_schedule_enter(struct rq *rq)
{
- INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
- hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
+ rq->hrtick_sched = HRTICK_SCHED_DEFER;
+ if (hrtimer_test_and_clear_rearm_deferred())
+ rq->hrtick_sched |= HRTICK_SCHED_REARM_HRTIMER;
}
-#else /* !CONFIG_SCHED_HRTICK: */
-static inline void hrtick_clear(struct rq *rq)
+
+static inline void hrtick_schedule_exit(struct rq *rq)
{
+ if (rq->hrtick_sched & HRTICK_SCHED_START) {
+ rq->hrtick_time = ktime_add_ns(ktime_get(), rq->hrtick_delay);
+ hrtick_cond_restart(rq);
+ } else if (idle_rq(rq)) {
+ /*
+ * No need for using hrtimer_is_active(). The timer is CPU local
+ * and interrupts are disabled, so the callback cannot be
+ * running and the queued state is valid.
+ */
+ if (hrtimer_is_queued(&rq->hrtick_timer))
+ hrtimer_cancel(&rq->hrtick_timer);
+ }
+
+ if (rq->hrtick_sched & HRTICK_SCHED_REARM_HRTIMER)
+ __hrtimer_rearm_deferred();
+
+ rq->hrtick_sched = HRTICK_SCHED_NONE;
}
-static inline void hrtick_rq_init(struct rq *rq)
+static void hrtick_rq_init(struct rq *rq)
{
+ INIT_CSD(&rq->hrtick_csd, __hrtick_start, rq);
+ rq->hrtick_sched = HRTICK_SCHED_NONE;
+ hrtimer_setup(&rq->hrtick_timer, hrtick, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL_HARD | HRTIMER_MODE_LAZY_REARM);
}
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline void hrtick_clear(struct rq *rq) { }
+static inline void hrtick_rq_init(struct rq *rq) { }
+static inline void hrtick_schedule_enter(struct rq *rq) { }
+static inline void hrtick_schedule_exit(struct rq *rq) { }
#endif /* !CONFIG_SCHED_HRTICK */
/*
@@ -5032,6 +5090,7 @@ static inline void finish_lock_switch(struct rq *rq)
*/
spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
__balance_callbacks(rq, NULL);
+ hrtick_schedule_exit(rq);
raw_spin_rq_unlock_irq(rq);
}
@@ -6785,9 +6844,6 @@ static void __sched notrace __schedule(int sched_mode)
schedule_debug(prev, preempt);
- if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
- hrtick_clear(rq);
-
klp_sched_try_switch(prev);
local_irq_disable();
@@ -6814,6 +6870,8 @@ static void __sched notrace __schedule(int sched_mode)
rq_lock(rq, &rf);
smp_mb__after_spinlock();
+ hrtick_schedule_enter(rq);
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
@@ -6916,6 +6974,7 @@ static void __sched notrace __schedule(int sched_mode)
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq, NULL);
+ hrtick_schedule_exit(rq);
raw_spin_rq_unlock_irq(rq);
}
trace_sched_exit_tp(is_switch);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b00429323..9d619a4ec3d1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1097,7 +1097,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
act = ns_to_ktime(dl_next_period(dl_se));
}
- now = hrtimer_cb_get_time(timer);
+ now = ktime_get();
delta = ktime_to_ns(now) - rq_clock(rq);
act = ktime_add_ns(act, delta);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be7..2be80780ff51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5600,7 +5600,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* validating it and just reschedule.
*/
if (queued) {
- resched_curr_lazy(rq_of(cfs_rq));
+ resched_curr(rq_of(cfs_rq));
return;
}
#endif
@@ -6805,27 +6805,41 @@ static inline void sched_fair_update_stop_tick(struct rq *rq, struct task_struct
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
+ unsigned long scale = 1024;
+ unsigned long util = 0;
+ u64 vdelta;
+ u64 delta;
WARN_ON_ONCE(task_rq(p) != rq);
- if (rq->cfs.h_nr_queued > 1) {
- u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- u64 slice = se->slice;
- s64 delta = slice - ran;
+ if (rq->cfs.h_nr_queued <= 1)
+ return;
- if (delta < 0) {
- if (task_current_donor(rq, p))
- resched_curr(rq);
- return;
- }
- hrtick_start(rq, delta);
+ /*
+ * Compute time until virtual deadline
+ */
+ vdelta = se->deadline - se->vruntime;
+ if ((s64)vdelta < 0) {
+ if (task_current_donor(rq, p))
+ resched_curr(rq);
+ return;
}
+ delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+
+ /*
+ * Correct for instantaneous load of other classes.
+ */
+ util += cpu_util_irq(rq);
+ if (util && util < 1024) {
+ scale *= 1024;
+ scale /= (1024 - util);
+ }
+
+ hrtick_start(rq, (scale * delta) / 1024);
}
/*
- * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * Called on enqueue to start the hrtick when h_nr_queued becomes more than 1.
*/
static void hrtick_update(struct rq *rq)
{
@@ -6834,6 +6848,9 @@ static void hrtick_update(struct rq *rq)
if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
return;
+ if (hrtick_active(rq))
+ return;
+
hrtick_start_fair(rq, donor);
}
#else /* !CONFIG_SCHED_HRTICK: */
@@ -7156,9 +7173,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
WARN_ON_ONCE(!task_sleep);
WARN_ON_ONCE(p->on_rq != 1);
- /* Fix-up what dequeue_task_fair() skipped */
- hrtick_update(rq);
-
/*
* Fix-up what block_task() skipped.
*
@@ -7192,8 +7206,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/*
* Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
*/
-
- hrtick_update(rq);
return true;
}
@@ -13435,11 +13447,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
entity_tick(cfs_rq, se, queued);
}
- if (queued) {
- if (!need_resched())
- hrtick_start_fair(rq, curr);
+ if (queued)
return;
- }
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 136a6584be79..d06228462607 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,8 +63,13 @@ SCHED_FEAT(DELAY_ZERO, true)
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+SCHED_FEAT(HRTICK, true)
+SCHED_FEAT(HRTICK_DL, true)
+#else
SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)
+#endif
/*
* Decrement CPU capacity based on time not spent running tasks
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1ef9ba480f51..a67c73ecdf79 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1288,6 +1288,8 @@ struct rq {
call_single_data_t hrtick_csd;
struct hrtimer hrtick_timer;
ktime_t hrtick_time;
+ ktime_t hrtick_delay;
+ unsigned int hrtick_sched;
#endif
#ifdef CONFIG_SCHEDSTATS
@@ -3033,46 +3035,31 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
* - enabled by features
* - hrtimer is actually high res
*/
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_enabled(struct rq *rq)
{
- if (!cpu_active(cpu_of(rq)))
- return 0;
- return hrtimer_is_hres_active(&rq->hrtick_timer);
+ return cpu_active(cpu_of(rq)) && hrtimer_highres_enabled();
}
-static inline int hrtick_enabled_fair(struct rq *rq)
+static inline bool hrtick_enabled_fair(struct rq *rq)
{
- if (!sched_feat(HRTICK))
- return 0;
- return hrtick_enabled(rq);
+ return sched_feat(HRTICK) && hrtick_enabled(rq);
}
-static inline int hrtick_enabled_dl(struct rq *rq)
+static inline bool hrtick_enabled_dl(struct rq *rq)
{
- if (!sched_feat(HRTICK_DL))
- return 0;
- return hrtick_enabled(rq);
+ return sched_feat(HRTICK_DL) && hrtick_enabled(rq);
}
extern void hrtick_start(struct rq *rq, u64 delay);
-
-#else /* !CONFIG_SCHED_HRTICK: */
-
-static inline int hrtick_enabled_fair(struct rq *rq)
-{
- return 0;
-}
-
-static inline int hrtick_enabled_dl(struct rq *rq)
-{
- return 0;
-}
-
-static inline int hrtick_enabled(struct rq *rq)
+static inline bool hrtick_active(struct rq *rq)
{
- return 0;
+ return hrtimer_active(&rq->hrtick_timer);
}
+#else /* !CONFIG_SCHED_HRTICK: */
+static inline bool hrtick_enabled_fair(struct rq *rq) { return false; }
+static inline bool hrtick_enabled_dl(struct rq *rq) { return false; }
+static inline bool hrtick_enabled(struct rq *rq) { return false; }
#endif /* !CONFIG_SCHED_HRTICK */
#ifndef arch_scale_freq_tick
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 77198911b8dd..4425d8dce44b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,6 +663,13 @@ void irq_enter_rcu(void)
{
__irq_enter_raw();
+ /*
+ * If this is a nested interrupt that hits the exit_to_user_mode_loop
+ * where it has enabled interrupts but before it has hit schedule() we
+ * could have hrtimers in an undefined state. Fix it up here.
+ */
+ hrtimer_rearm_deferred();
+
if (tick_nohz_full_cpu(smp_processor_id()) ||
(is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
tick_irq_enter();
@@ -719,8 +726,14 @@ static inline void __irq_exit_rcu(void)
#endif
account_hardirq_exit(current);
preempt_count_sub(HARDIRQ_OFFSET);
- if (!in_interrupt() && local_softirq_pending())
+ if (!in_interrupt() && local_softirq_pending()) {
+ /*
+ * If we left hrtimers unarmed, make sure to arm them now,
+ * before enabling interrupts to run SoftIRQ.
+ */
+ hrtimer_rearm_deferred();
invoke_softirq();
+ }
if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))
diff --git a/kernel/time/.kunitconfig b/kernel/time/.kunitconfig
new file mode 100644
index 000000000000..d60a611b2853
--- /dev/null
+++ b/kernel/time/.kunitconfig
@@ -0,0 +1,2 @@
+CONFIG_KUNIT=y
+CONFIG_TIME_KUNIT_TEST=y
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f7836c..6a11964377e6 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -17,6 +17,9 @@ config ARCH_CLOCKSOURCE_DATA
config ARCH_CLOCKSOURCE_INIT
bool
+config ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+ bool
+
# Timekeeping vsyscall support
config GENERIC_TIME_VSYSCALL
bool
@@ -44,10 +47,23 @@ config GENERIC_CLOCKEVENTS_BROADCAST_IDLE
config GENERIC_CLOCKEVENTS_MIN_ADJUST
bool
+config GENERIC_CLOCKEVENTS_COUPLED
+ bool
+
+config GENERIC_CLOCKEVENTS_COUPLED_INLINE
+ select GENERIC_CLOCKEVENTS_COUPLED
+ bool
+
# Generic update of CMOS clock
config GENERIC_CMOS_UPDATE
bool
+# Deferred rearming of the hrtimer interrupt
+config HRTIMER_REARM_DEFERRED
+ def_bool y
+ depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
+ depends on HIGH_RES_TIMERS && SCHED_HRTICK
+
# Select to handle posix CPU timers from task_work
# and not from the timer interrupt context
config HAVE_POSIX_CPU_TIMERS_TASK_WORK
@@ -196,18 +212,6 @@ config HIGH_RES_TIMERS
hardware is not capable then this option only increases
the size of the kernel image.
-config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
- int "Clocksource watchdog maximum allowable skew (in microseconds)"
- depends on CLOCKSOURCE_WATCHDOG
- range 50 1000
- default 125
- help
- Specify the maximum amount of allowable watchdog skew in
- microseconds before reporting the clocksource to be unstable.
- The default is based on a half-second clocksource watchdog
- interval and NTP's maximum frequency drift of 500 parts
- per million. If the clocksource is good enough for NTP,
- it is good enough for the clocksource watchdog!
endif
config POSIX_AUX_CLOCKS
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index b64db405ba5c..6e173d70d825 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -234,19 +234,23 @@ static int alarmtimer_suspend(struct device *dev)
if (!rtc)
return 0;
- /* Find the soonest timer to expire*/
+ /* Find the soonest timer to expire */
for (i = 0; i < ALARM_NUMTYPE; i++) {
struct alarm_base *base = &alarm_bases[i];
struct timerqueue_node *next;
+ ktime_t next_expires;
ktime_t delta;
- scoped_guard(spinlock_irqsave, &base->lock)
+ scoped_guard(spinlock_irqsave, &base->lock) {
next = timerqueue_getnext(&base->timerqueue);
+ if (next)
+ next_expires = next->expires;
+ }
if (!next)
continue;
- delta = ktime_sub(next->expires, base->get_ktime());
+ delta = ktime_sub(next_expires, base->get_ktime());
if (!min || (delta < min)) {
- expires = next->expires;
+ expires = next_expires;
min = delta;
type = i;
}
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index eaae1ce9f060..b4d730604972 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -172,6 +172,7 @@ void clockevents_shutdown(struct clock_event_device *dev)
{
clockevents_switch_state(dev, CLOCK_EVT_STATE_SHUTDOWN);
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;
}
/**
@@ -292,6 +293,38 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
#endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE
+#include <asm/clock_inlined.h>
+#else
+static __always_inline void
+arch_inlined_clockevent_set_next_coupled(u64 u64 cycles, struct clock_event_device *dev) { }
+#endif
+
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+ u64 cycles;
+
+ if (unlikely(!(dev->features & CLOCK_EVT_FEAT_CLOCKSOURCE_COUPLED)))
+ return false;
+
+ if (unlikely(!ktime_expiry_to_cycles(dev->cs_id, expires, &cycles)))
+ return false;
+
+ if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED_INLINE))
+ arch_inlined_clockevent_set_next_coupled(cycles, dev);
+ else
+ dev->set_next_coupled(cycles, dev);
+ return true;
+}
+
+#else
+static inline bool clockevent_set_next_coupled(struct clock_event_device *dev, ktime_t expires)
+{
+ return false;
+}
+#endif
+
/**
* clockevents_program_event - Reprogram the clock event device.
* @dev: device to program
@@ -300,12 +333,10 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
*
* Returns 0 on success, -ETIME when the event is in the past.
*/
-int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
- bool force)
+int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, bool force)
{
- unsigned long long clc;
int64_t delta;
- int rc;
+ u64 cycles;
if (WARN_ON_ONCE(expires < 0))
return -ETIME;
@@ -319,21 +350,35 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
WARN_ONCE(!clockevent_state_oneshot(dev), "Current state: %d\n",
clockevent_get_state(dev));
- /* Shortcut for clockevent devices that can deal with ktime. */
- if (dev->features & CLOCK_EVT_FEAT_KTIME)
+ /* ktime_t based reprogramming for the broadcast hrtimer device */
+ if (unlikely(dev->features & CLOCK_EVT_FEAT_HRTIMER))
return dev->set_next_ktime(expires, dev);
+ if (likely(clockevent_set_next_coupled(dev, expires)))
+ return 0;
+
delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
- if (delta <= 0)
- return force ? clockevents_program_min_delta(dev) : -ETIME;
- delta = min(delta, (int64_t) dev->max_delta_ns);
- delta = max(delta, (int64_t) dev->min_delta_ns);
+ /* Required for tick_periodic() during early boot */
+ if (delta <= 0 && !force)
+ return -ETIME;
+
+ if (delta > (int64_t)dev->min_delta_ns) {
+ delta = min(delta, (int64_t) dev->max_delta_ns);
+ cycles = ((u64)delta * dev->mult) >> dev->shift;
+ if (!dev->set_next_event((unsigned long) cycles, dev))
+ return 0;
+ }
- clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
- rc = dev->set_next_event((unsigned long) clc, dev);
+ if (dev->next_event_forced)
+ return 0;
- return (rc && force) ? clockevents_program_min_delta(dev) : rc;
+ if (dev->set_next_event(dev->min_delta_ticks, dev)) {
+ if (!force || clockevents_program_min_delta(dev))
+ return -ETIME;
+ }
+ dev->next_event_forced = 1;
+ return 0;
}
/*
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index 38dae590b29f..b4cf17b4aeed 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -3,202 +3,196 @@
* Unit test for the clocksource watchdog.
*
* Copyright (C) 2021 Facebook, Inc.
+ * Copyright (C) 2026 Intel Corp.
*
* Author: Paul E. McKenney <paulmck@kernel.org>
+ * Author: Thomas Gleixner <tglx@kernel.org>
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/device.h>
#include <linux/clocksource.h>
-#include <linux/init.h>
+#include <linux/delay.h>
#include <linux/module.h>
-#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
-#include <linux/tick.h>
#include <linux/kthread.h>
-#include <linux/delay.h>
-#include <linux/prandom.h>
-#include <linux/cpu.h>
#include "tick-internal.h"
+#include "timekeeping_internal.h"
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Clocksource watchdog unit test");
MODULE_AUTHOR("Paul E. McKenney <paulmck@kernel.org>");
+MODULE_AUTHOR("Thomas Gleixner <tglx@kernel.org>");
+
+enum wdtest_states {
+ WDTEST_INJECT_NONE,
+ WDTEST_INJECT_DELAY,
+ WDTEST_INJECT_POSITIVE,
+ WDTEST_INJECT_NEGATIVE,
+ WDTEST_INJECT_PERCPU = 0x100,
+};
-static int holdoff = IS_BUILTIN(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) ? 10 : 0;
-module_param(holdoff, int, 0444);
-MODULE_PARM_DESC(holdoff, "Time to wait to start test (s).");
+static enum wdtest_states wdtest_state;
+static unsigned long wdtest_test_count;
+static ktime_t wdtest_last_ts, wdtest_offset;
-/* Watchdog kthread's task_struct pointer for debug purposes. */
-static struct task_struct *wdtest_task;
+#define SHIFT_4000PPM 8
-static u64 wdtest_jiffies_read(struct clocksource *cs)
+static ktime_t wdtest_get_offset(struct clocksource *cs)
{
- return (u64)jiffies;
-}
-
-static struct clocksource clocksource_wdtest_jiffies = {
- .name = "wdtest-jiffies",
- .rating = 1, /* lowest valid rating*/
- .uncertainty_margin = TICK_NSEC,
- .read = wdtest_jiffies_read,
- .mask = CLOCKSOURCE_MASK(32),
- .flags = CLOCK_SOURCE_MUST_VERIFY,
- .mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
- .shift = JIFFIES_SHIFT,
- .max_cycles = 10,
-};
+ if (wdtest_state < WDTEST_INJECT_PERCPU)
+ return wdtest_test_count & 0x1 ? 0 : wdtest_offset >> SHIFT_4000PPM;
-static int wdtest_ktime_read_ndelays;
-static bool wdtest_ktime_read_fuzz;
+ /* Only affect the readout of the "remote" CPU */
+ return cs->wd_cpu == smp_processor_id() ? 0 : NSEC_PER_MSEC;
+}
static u64 wdtest_ktime_read(struct clocksource *cs)
{
- int wkrn = READ_ONCE(wdtest_ktime_read_ndelays);
- static int sign = 1;
- u64 ret;
+ ktime_t now = ktime_get_raw_fast_ns();
+ ktime_t intv = now - wdtest_last_ts;
- if (wkrn) {
- udelay(cs->uncertainty_margin / 250);
- WRITE_ONCE(wdtest_ktime_read_ndelays, wkrn - 1);
- }
- ret = ktime_get_real_fast_ns();
- if (READ_ONCE(wdtest_ktime_read_fuzz)) {
- sign = -sign;
- ret = ret + sign * 100 * NSEC_PER_MSEC;
+ /*
+ * Only increment the test counter once per watchdog interval and
+ * store the interval for the offset calculation of this step. This
+ * guarantees a consistent behaviour even if the other side needs
+ * to repeat due to a watchdog read timeout.
+ */
+ if (intv > (NSEC_PER_SEC / 4)) {
+ WRITE_ONCE(wdtest_test_count, wdtest_test_count + 1);
+ wdtest_last_ts = now;
+ wdtest_offset = intv;
}
- return ret;
-}
-static void wdtest_ktime_cs_mark_unstable(struct clocksource *cs)
-{
- pr_info("--- Marking %s unstable due to clocksource watchdog.\n", cs->name);
+ switch (wdtest_state & ~WDTEST_INJECT_PERCPU) {
+ case WDTEST_INJECT_POSITIVE:
+ return now + wdtest_get_offset(cs);
+ case WDTEST_INJECT_NEGATIVE:
+ return now - wdtest_get_offset(cs);
+ case WDTEST_INJECT_DELAY:
+ udelay(500);
+ return now;
+ default:
+ return now;
+ }
}
-#define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \
- CLOCK_SOURCE_VALID_FOR_HRES | \
- CLOCK_SOURCE_MUST_VERIFY | \
- CLOCK_SOURCE_VERIFY_PERCPU)
+#define KTIME_FLAGS (CLOCK_SOURCE_IS_CONTINUOUS | \
+ CLOCK_SOURCE_CALIBRATED | \
+ CLOCK_SOURCE_MUST_VERIFY | \
+ CLOCK_SOURCE_WDTEST)
static struct clocksource clocksource_wdtest_ktime = {
.name = "wdtest-ktime",
- .rating = 300,
+ .rating = 10,
.read = wdtest_ktime_read,
.mask = CLOCKSOURCE_MASK(64),
.flags = KTIME_FLAGS,
- .mark_unstable = wdtest_ktime_cs_mark_unstable,
.list = LIST_HEAD_INIT(clocksource_wdtest_ktime.list),
};
-/* Reset the clocksource if needed. */
-static void wdtest_ktime_clocksource_reset(void)
+static void wdtest_clocksource_reset(enum wdtest_states which, bool percpu)
+{
+ clocksource_unregister(&clocksource_wdtest_ktime);
+
+ pr_info("Test: State %d percpu %d\n", which, percpu);
+
+ wdtest_state = which;
+ if (percpu)
+ wdtest_state |= WDTEST_INJECT_PERCPU;
+ wdtest_test_count = 0;
+ wdtest_last_ts = 0;
+
+ clocksource_wdtest_ktime.rating = 10;
+ clocksource_wdtest_ktime.flags = KTIME_FLAGS;
+ if (percpu)
+ clocksource_wdtest_ktime.flags |= CLOCK_SOURCE_WDTEST_PERCPU;
+ clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+}
+
+static bool wdtest_execute(enum wdtest_states which, bool percpu, unsigned int expect,
+ unsigned long calls)
{
- if (clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE) {
- clocksource_unregister(&clocksource_wdtest_ktime);
- clocksource_wdtest_ktime.flags = KTIME_FLAGS;
- schedule_timeout_uninterruptible(HZ / 10);
- clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
+ wdtest_clocksource_reset(which, percpu);
+
+ for (; READ_ONCE(wdtest_test_count) < calls; msleep(100)) {
+ unsigned int flags = READ_ONCE(clocksource_wdtest_ktime.flags);
+
+ if (kthread_should_stop())
+ return false;
+
+ if (flags & CLOCK_SOURCE_UNSTABLE) {
+ if (expect & CLOCK_SOURCE_UNSTABLE)
+ return true;
+ pr_warn("Fail: Unexpected unstable\n");
+ return false;
+ }
+ if (flags & CLOCK_SOURCE_VALID_FOR_HRES) {
+ if (expect & CLOCK_SOURCE_VALID_FOR_HRES)
+ return true;
+ pr_warn("Fail: Unexpected valid for highres\n");
+ return false;
+ }
}
+
+ if (!expect)
+ return true;
+
+ pr_warn("Fail: Timed out\n");
+ return false;
}
-/* Run the specified series of watchdog tests. */
-static int wdtest_func(void *arg)
+static bool wdtest_run(bool percpu)
{
- unsigned long j1, j2;
- int i, max_retries;
- char *s;
+ if (!wdtest_execute(WDTEST_INJECT_NONE, percpu, CLOCK_SOURCE_VALID_FOR_HRES, 8))
+ return false;
- schedule_timeout_uninterruptible(holdoff * HZ);
+ if (!wdtest_execute(WDTEST_INJECT_DELAY, percpu, 0, 4))
+ return false;
- /*
- * Verify that jiffies-like clocksources get the manually
- * specified uncertainty margin.
- */
- pr_info("--- Verify jiffies-like uncertainty margin.\n");
- __clocksource_register(&clocksource_wdtest_jiffies);
- WARN_ON_ONCE(clocksource_wdtest_jiffies.uncertainty_margin != TICK_NSEC);
+ if (!wdtest_execute(WDTEST_INJECT_POSITIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8))
+ return false;
- j1 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
- schedule_timeout_uninterruptible(HZ);
- j2 = clocksource_wdtest_jiffies.read(&clocksource_wdtest_jiffies);
- WARN_ON_ONCE(j1 == j2);
+ if (!wdtest_execute(WDTEST_INJECT_NEGATIVE, percpu, CLOCK_SOURCE_UNSTABLE, 8))
+ return false;
- clocksource_unregister(&clocksource_wdtest_jiffies);
+ return true;
+}
- /*
- * Verify that tsc-like clocksources are assigned a reasonable
- * uncertainty margin.
- */
- pr_info("--- Verify tsc-like uncertainty margin.\n");
+static int wdtest_func(void *arg)
+{
clocksource_register_khz(&clocksource_wdtest_ktime, 1000 * 1000);
- WARN_ON_ONCE(clocksource_wdtest_ktime.uncertainty_margin < NSEC_PER_USEC);
-
- j1 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
- udelay(1);
- j2 = clocksource_wdtest_ktime.read(&clocksource_wdtest_ktime);
- pr_info("--- tsc-like times: %lu - %lu = %lu.\n", j2, j1, j2 - j1);
- WARN_ONCE(time_before(j2, j1 + NSEC_PER_USEC),
- "Expected at least 1000ns, got %lu.\n", j2 - j1);
-
- /* Verify tsc-like stability with various numbers of errors injected. */
- max_retries = clocksource_get_max_watchdog_retry();
- for (i = 0; i <= max_retries + 1; i++) {
- if (i <= 1 && i < max_retries)
- s = "";
- else if (i <= max_retries)
- s = ", expect message";
- else
- s = ", expect clock skew";
- pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s);
- WRITE_ONCE(wdtest_ktime_read_ndelays, i);
- schedule_timeout_uninterruptible(2 * HZ);
- WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
- WARN_ON_ONCE((i <= max_retries) !=
- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
- wdtest_ktime_clocksource_reset();
+ if (wdtest_run(false)) {
+ if (wdtest_run(true))
+ pr_info("Success: All tests passed\n");
}
-
- /* Verify tsc-like stability with clock-value-fuzz error injection. */
- pr_info("--- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.\n");
- WRITE_ONCE(wdtest_ktime_read_fuzz, true);
- schedule_timeout_uninterruptible(2 * HZ);
- WARN_ON_ONCE(!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
- clocksource_verify_percpu(&clocksource_wdtest_ktime);
- WRITE_ONCE(wdtest_ktime_read_fuzz, false);
-
clocksource_unregister(&clocksource_wdtest_ktime);
- pr_info("--- Done with test.\n");
- return 0;
-}
+ if (!IS_MODULE(CONFIG_TEST_CLOCKSOURCE_WATCHDOG))
+ return 0;
-static void wdtest_print_module_parms(void)
-{
- pr_alert("--- holdoff=%d\n", holdoff);
+ while (!kthread_should_stop())
+ schedule_timeout_interruptible(3600 * HZ);
+ return 0;
}
-/* Cleanup function. */
-static void clocksource_wdtest_cleanup(void)
-{
-}
+static struct task_struct *wdtest_thread;
static int __init clocksource_wdtest_init(void)
{
- int ret = 0;
-
- wdtest_print_module_parms();
+ struct task_struct *t = kthread_run(wdtest_func, NULL, "wdtest");
- /* Create watchdog-test task. */
- wdtest_task = kthread_run(wdtest_func, NULL, "wdtest");
- if (IS_ERR(wdtest_task)) {
- ret = PTR_ERR(wdtest_task);
- pr_warn("%s: Failed to create wdtest kthread.\n", __func__);
- wdtest_task = NULL;
- return ret;
+ if (IS_ERR(t)) {
+ pr_warn("Failed to create wdtest kthread.\n");
+ return PTR_ERR(t);
}
-
+ wdtest_thread = t;
return 0;
}
-
module_init(clocksource_wdtest_init);
+
+static void clocksource_wdtest_cleanup(void)
+{
+ if (wdtest_thread)
+ kthread_stop(wdtest_thread);
+}
module_exit(clocksource_wdtest_cleanup);
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df7194961658..baee13a1f87f 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,15 +7,17 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/device.h>
#include <linux/clocksource.h>
+#include <linux/cpu.h>
+#include <linux/delay.h>
+#include <linux/device.h>
#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
-#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/module.h>
#include <linux/prandom.h>
-#include <linux/cpu.h>
+#include <linux/sched.h>
+#include <linux/tick.h>
+#include <linux/topology.h>
#include "tick-internal.h"
#include "timekeeping_internal.h"
@@ -107,48 +109,6 @@ static char override_name[CS_NAME_LEN];
static int finished_booting;
static u64 suspend_start;
-/*
- * Interval: 0.5sec.
- */
-#define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_INTERVAL_MAX_NS ((2 * WATCHDOG_INTERVAL) * (NSEC_PER_SEC / HZ))
-
-/*
- * Threshold: 0.0312s, when doubled: 0.0625s.
- */
-#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 5)
-
-/*
- * Maximum permissible delay between two readouts of the watchdog
- * clocksource surrounding a read of the clocksource being validated.
- * This delay could be due to SMIs, NMIs, or to VCPU preemptions. Used as
- * a lower bound for cs->uncertainty_margin values when registering clocks.
- *
- * The default of 500 parts per million is based on NTP's limits.
- * If a clocksource is good enough for NTP, it is good enough for us!
- *
- * In other words, by default, even if a clocksource is extremely
- * precise (for example, with a sub-nanosecond period), the maximum
- * permissible skew between the clocksource watchdog and the clocksource
- * under test is not permitted to go below the 500ppm minimum defined
- * by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the
- * CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option.
- */
-#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#else
-#define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ)
-#endif
-
-/*
- * Default for maximum permissible skew when cs->uncertainty_margin is
- * not specified, and the lower bound even when cs->uncertainty_margin
- * is specified. This is also the default that is used when registering
- * clocks with unspecified cs->uncertainty_margin, so this macro is used
- * even in CONFIG_CLOCKSOURCE_WATCHDOG=n kernels.
- */
-#define WATCHDOG_MAX_SKEW (MAX_SKEW_USEC * NSEC_PER_USEC)
-
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
static void clocksource_select(void);
@@ -160,7 +120,42 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static int watchdog_running;
static atomic_t watchdog_reset_pending;
-static int64_t watchdog_max_interval;
+
+/* Watchdog interval: 0.5sec. */
+#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))
+
+/* Maximum time between two reference watchdog readouts */
+#define WATCHDOG_READOUT_MAX_NS (50U * NSEC_PER_USEC)
+
+/*
+ * Maximum time between two remote readouts for NUMA=n. On NUMA enabled systems
+ * the timeout is calculated from the numa distance.
+ */
+#define WATCHDOG_DEFAULT_TIMEOUT_NS (50U * NSEC_PER_USEC)
+
+/*
+ * Remote timeout NUMA distance multiplier. The local distance is 10. The
+ * default remote distance is 20. ACPI tables provide more accurate numbers
+ * which are guaranteed to be greater than the local distance.
+ *
+ * This results in a 5us base value, which is equivalent to the above !NUMA
+ * default.
+ */
+#define WATCHDOG_NUMA_MULTIPLIER_NS ((u64)(WATCHDOG_DEFAULT_TIMEOUT_NS / LOCAL_DISTANCE))
+
+/* Limit the NUMA timeout in case the distance values are insanely big */
+#define WATCHDOG_NUMA_MAX_TIMEOUT_NS ((u64)(500U * NSEC_PER_USEC))
+
+/* Shift values to calculate the approximate $N ppm of a given delta. */
+#define SHIFT_500PPM 11
+#define SHIFT_4000PPM 8
+
+/* Number of attempts to read the watchdog */
+#define WATCHDOG_FREQ_RETRIES 3
+
+/* Five reads local and remote for inter CPU skew detection */
+#define WATCHDOG_REMOTE_MAX_SEQ 10
static inline void clocksource_watchdog_lock(unsigned long *flags)
{
@@ -241,204 +236,422 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}
-static int verify_n_cpus = 8;
-module_param(verify_n_cpus, int, 0644);
+static inline void clocksource_reset_watchdog(void)
+{
+ struct clocksource *cs;
-enum wd_read_status {
- WD_READ_SUCCESS,
- WD_READ_UNSTABLE,
- WD_READ_SKIP
+ list_for_each_entry(cs, &watchdog_list, wd_list)
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+enum wd_result {
+ WD_SUCCESS,
+ WD_FREQ_NO_WATCHDOG,
+ WD_FREQ_TIMEOUT,
+ WD_FREQ_RESET,
+ WD_FREQ_SKEWED,
+ WD_CPU_TIMEOUT,
+ WD_CPU_SKEWED,
+};
+
+struct watchdog_cpu_data {
+ /* Keep first as it is 32 byte aligned */
+ call_single_data_t csd;
+ atomic_t remote_inprogress;
+ enum wd_result result;
+ u64 cpu_ts[2];
+ struct clocksource *cs;
+ /* Ensure that the sequence is in a separate cache line */
+ atomic_t seq ____cacheline_aligned;
+ /* Set by the control CPU according to NUMA distance */
+ u64 timeout_ns;
};
-static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
-{
- int64_t md = watchdog->uncertainty_margin;
- unsigned int nretries, max_retries;
- int64_t wd_delay, wd_seq_delay;
- u64 wd_end, wd_end2;
-
- max_retries = clocksource_get_max_watchdog_retry();
- for (nretries = 0; nretries <= max_retries; nretries++) {
- local_irq_disable();
- *wdnow = watchdog->read(watchdog);
- *csnow = cs->read(cs);
- wd_end = watchdog->read(watchdog);
- wd_end2 = watchdog->read(watchdog);
- local_irq_enable();
-
- wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
- if (wd_delay <= md + cs->uncertainty_margin) {
- if (nretries > 1 && nretries >= max_retries) {
- pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
- smp_processor_id(), watchdog->name, nretries);
+struct watchdog_data {
+ raw_spinlock_t lock;
+ enum wd_result result;
+
+ u64 wd_seq;
+ u64 wd_delta;
+ u64 cs_delta;
+ u64 cpu_ts[2];
+
+ unsigned int curr_cpu;
+} ____cacheline_aligned_in_smp;
+
+static void watchdog_check_skew_remote(void *unused);
+
+static DEFINE_PER_CPU_ALIGNED(struct watchdog_cpu_data, watchdog_cpu_data) = {
+ .csd = CSD_INIT(watchdog_check_skew_remote, NULL),
+};
+
+static struct watchdog_data watchdog_data = {
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(watchdog_data.lock),
+};
+
+static inline void watchdog_set_result(struct watchdog_cpu_data *wd, enum wd_result result)
+{
+ guard(raw_spinlock)(&watchdog_data.lock);
+ if (!wd->result) {
+ atomic_set(&wd->seq, WATCHDOG_REMOTE_MAX_SEQ);
+ WRITE_ONCE(wd->result, result);
+ }
+}
+
+/* Wait for the sequence number to hand over control. */
+static bool watchdog_wait_seq(struct watchdog_cpu_data *wd, u64 start, int seq)
+{
+ for(int cnt = 0; atomic_read(&wd->seq) < seq; cnt++) {
+ /* Bail if the other side set an error result */
+ if (READ_ONCE(wd->result) != WD_SUCCESS)
+ return false;
+
+ /* Prevent endless loops if the other CPU does not react. */
+ if (cnt == 5000) {
+ u64 nsecs = ktime_get_raw_fast_ns();
+
+ if (nsecs - start >=wd->timeout_ns) {
+ watchdog_set_result(wd, WD_CPU_TIMEOUT);
+ return false;
}
- return WD_READ_SUCCESS;
+ cnt = 0;
}
+ cpu_relax();
+ }
+ return seq < WATCHDOG_REMOTE_MAX_SEQ;
+}
- /*
- * Now compute delay in consecutive watchdog read to see if
- * there is too much external interferences that cause
- * significant delay in reading both clocksource and watchdog.
- *
- * If consecutive WD read-back delay > md, report
- * system busy, reinit the watchdog and skip the current
- * watchdog test.
- */
- wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
- if (wd_seq_delay > md)
- goto skip_test;
+static void watchdog_check_skew(struct watchdog_cpu_data *wd, int index)
+{
+ u64 prev, now, delta, start = ktime_get_raw_fast_ns();
+ int local = index, remote = (index + 1) & 0x1;
+ struct clocksource *cs = wd->cs;
+
+ /* Set the local timestamp so that the first iteration works correctly */
+ wd->cpu_ts[local] = cs->read(cs);
+
+ /* Signal arrival */
+ atomic_inc(&wd->seq);
+
+ for (int seq = local + 2; seq < WATCHDOG_REMOTE_MAX_SEQ; seq += 2) {
+ if (!watchdog_wait_seq(wd, start, seq))
+ return;
+
+ /* Capture local timestamp before possible non-local coherency overhead */
+ now = cs->read(cs);
+
+ /* Store local timestamp before reading remote to limit coherency stalls */
+ wd->cpu_ts[local] = now;
+
+ prev = wd->cpu_ts[remote];
+ delta = (now - prev) & cs->mask;
+
+ if (delta > cs->max_raw_delta) {
+ watchdog_set_result(wd, WD_CPU_SKEWED);
+ return;
+ }
+
+ /* Hand over to the remote CPU */
+ atomic_inc(&wd->seq);
}
+}
- pr_warn("timekeeping watchdog on CPU%d: wd-%s-wd excessive read-back delay of %lldns vs. limit of %ldns, wd-wd read-back delay only %lldns, attempt %d, marking %s unstable\n",
- smp_processor_id(), cs->name, wd_delay, WATCHDOG_MAX_SKEW, wd_seq_delay, nretries, cs->name);
- return WD_READ_UNSTABLE;
+static void watchdog_check_skew_remote(void *unused)
+{
+ struct watchdog_cpu_data *wd = this_cpu_ptr(&watchdog_cpu_data);
-skip_test:
- pr_info("timekeeping watchdog on CPU%d: %s wd-wd read-back delay of %lldns\n",
- smp_processor_id(), watchdog->name, wd_seq_delay);
- pr_info("wd-%s-wd read-back delay of %lldns, clock-skew test skipped!\n",
- cs->name, wd_delay);
- return WD_READ_SKIP;
+ atomic_inc(&wd->remote_inprogress);
+ watchdog_check_skew(wd, 1);
+ atomic_dec(&wd->remote_inprogress);
}
-static u64 csnow_mid;
-static cpumask_t cpus_ahead;
-static cpumask_t cpus_behind;
-static cpumask_t cpus_chosen;
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
+{
+ return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}
+
+/*
+ * This is only invoked for remote CPUs. See watchdog_check_cpu_skew().
+ */
+static inline u64 wd_get_remote_timeout(unsigned int remote_cpu)
+{
+ unsigned int n1, n2;
+ u64 ns;
+
+ if (nr_node_ids == 1)
+ return WATCHDOG_DEFAULT_TIMEOUT_NS;
+
+ n1 = cpu_to_node(smp_processor_id());
+ n2 = cpu_to_node(remote_cpu);
+ ns = WATCHDOG_NUMA_MULTIPLIER_NS * node_distance(n1, n2);
+ return min(ns, WATCHDOG_NUMA_MAX_TIMEOUT_NS);
+}
-static void clocksource_verify_choose_cpus(void)
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
{
- int cpu, i, n = verify_n_cpus;
+ struct watchdog_cpu_data *wd;
- if (n < 0 || n >= num_online_cpus()) {
- /* Check all of the CPUs. */
- cpumask_copy(&cpus_chosen, cpu_online_mask);
- cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+ wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
+ if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
+ watchdog_data.result = WD_CPU_TIMEOUT;
return;
}
- /* If no checking desired, or no other CPU to check, leave. */
- cpumask_clear(&cpus_chosen);
- if (n == 0 || num_online_cpus() <= 1)
+ atomic_set(&wd->seq, 0);
+ wd->result = WD_SUCCESS;
+ wd->cs = cs;
+ /* Store the current CPU ID for the watchdog test unit */
+ cs->wd_cpu = smp_processor_id();
+
+ wd->timeout_ns = wd_get_remote_timeout(cpu);
+
+ /* Kick the remote CPU into the watchdog function */
+ if (WARN_ON_ONCE(smp_call_function_single_async(cpu, &wd->csd))) {
+ watchdog_data.result = WD_CPU_TIMEOUT;
+ return;
+ }
+
+ scoped_guard(irq)
+ watchdog_check_skew(wd, 0);
+
+ scoped_guard(raw_spinlock_irq, &watchdog_data.lock) {
+ watchdog_data.result = wd->result;
+ memcpy(watchdog_data.cpu_ts, wd->cpu_ts, sizeof(wd->cpu_ts));
+ }
+}
+
+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+ unsigned int cpu = watchdog_data.curr_cpu;
+
+ cpu = cpumask_next_wrap(cpu, cpu_online_mask);
+ watchdog_data.curr_cpu = cpu;
+
+ /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+ if (cpu == smp_processor_id())
return;
- /* Make sure to select at least one CPU other than the current CPU. */
- cpu = cpumask_any_but(cpu_online_mask, smp_processor_id());
- if (WARN_ON_ONCE(cpu >= nr_cpu_ids))
+ /* Don't interfere with the test mechanics */
+ if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
return;
- cpumask_set_cpu(cpu, &cpus_chosen);
- /* Force a sane value for the boot parameter. */
- if (n > nr_cpu_ids)
- n = nr_cpu_ids;
+ __watchdog_check_cpu_skew(cs, cpu);
+}
+
+static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
+{
+ unsigned int ppm_shift = SHIFT_4000PPM;
+ u64 wd_ts0, wd_ts1, cs_ts;
+
+ watchdog_data.result = WD_SUCCESS;
+ if (!watchdog) {
+ watchdog_data.result = WD_FREQ_NO_WATCHDOG;
+ return false;
+ }
+
+ if (cs->flags & CLOCK_SOURCE_WDTEST_PERCPU)
+ return true;
/*
- * Randomly select the specified number of CPUs. If the same
- * CPU is selected multiple times, that CPU is checked only once,
- * and no replacement CPU is selected. This gracefully handles
- * situations where verify_n_cpus is greater than the number of
- * CPUs that are currently online.
+ * If both the clocksource and the watchdog claim they are
+ * calibrated use 500ppm limit. Uncalibrated clocksources need a
+ * larger allowance because thefirmware supplied frequencies can be
+ * way off.
*/
- for (i = 1; i < n; i++) {
- cpu = cpumask_random(cpu_online_mask);
- if (!WARN_ON_ONCE(cpu >= nr_cpu_ids))
- cpumask_set_cpu(cpu, &cpus_chosen);
+ if (watchdog->flags & CLOCK_SOURCE_CALIBRATED && cs->flags & CLOCK_SOURCE_CALIBRATED)
+ ppm_shift = SHIFT_500PPM;
+
+ for (int retries = 0; retries < WATCHDOG_FREQ_RETRIES; retries++) {
+ s64 wd_last, cs_last, wd_seq, wd_delta, cs_delta, max_delta;
+
+ scoped_guard(irq) {
+ wd_ts0 = watchdog->read(watchdog);
+ cs_ts = cs->read(cs);
+ wd_ts1 = watchdog->read(watchdog);
+ }
+
+ wd_last = cs->wd_last;
+ cs_last = cs->cs_last;
+
+ /* Validate the watchdog readout window */
+ wd_seq = cycles_to_nsec_safe(watchdog, wd_ts0, wd_ts1);
+ if (wd_seq > WATCHDOG_READOUT_MAX_NS) {
+ /* Store for printout in case all retries fail */
+ watchdog_data.wd_seq = wd_seq;
+ continue;
+ }
+
+ /* Store for subsequent processing */
+ cs->wd_last = wd_ts0;
+ cs->cs_last = cs_ts;
+
+ /* First round or reset pending? */
+ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || reset_pending)
+ goto reset;
+
+ /* Calculate the nanosecond deltas from the last invocation */
+ wd_delta = cycles_to_nsec_safe(watchdog, wd_last, wd_ts0);
+ cs_delta = cycles_to_nsec_safe(cs, cs_last, cs_ts);
+
+ watchdog_data.wd_delta = wd_delta;
+ watchdog_data.cs_delta = cs_delta;
+
+ /*
+ * Ensure that the deltas are within the readout limits of
+ * the clocksource and the watchdog. Long delays can cause
+ * clocksources to overflow.
+ */
+ max_delta = max(wd_delta, cs_delta);
+ if (max_delta > cs->max_idle_ns || max_delta > watchdog->max_idle_ns)
+ goto reset;
+
+ /*
+ * Calculate and validate the skew against the allowed PPM
+ * value of the maximum delta plus the watchdog readout
+ * time.
+ */
+ if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq)
+ return true;
+
+ watchdog_data.result = WD_FREQ_SKEWED;
+ return false;
}
- /* Don't verify ourselves. */
- cpumask_clear_cpu(smp_processor_id(), &cpus_chosen);
+ watchdog_data.result = WD_FREQ_TIMEOUT;
+ return false;
+
+reset:
+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
+ watchdog_data.result = WD_FREQ_RESET;
+ return false;
}
-static void clocksource_verify_one_cpu(void *csin)
+/* Synchronization for sched clock */
+static void clocksource_tick_stable(struct clocksource *cs)
{
- struct clocksource *cs = (struct clocksource *)csin;
-
- csnow_mid = cs->read(cs);
+ if (cs == curr_clocksource && cs->tick_stable)
+ cs->tick_stable(cs);
}
-void clocksource_verify_percpu(struct clocksource *cs)
+/* Conditionaly enable high resolution mode */
+static void clocksource_enable_highres(struct clocksource *cs)
{
- int64_t cs_nsec, cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;
- u64 csnow_begin, csnow_end;
- int cpu, testcpu;
- s64 delta;
+ if ((cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) ||
+ !(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) ||
+ !watchdog || !(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS))
+ return;
+
+ /* Mark it valid for high-res. */
+ cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
- if (verify_n_cpus == 0)
+ /*
+ * Can't schedule work before finished_booting is
+ * true. clocksource_done_booting will take care of it.
+ */
+ if (!finished_booting)
return;
- cpumask_clear(&cpus_ahead);
- cpumask_clear(&cpus_behind);
- cpus_read_lock();
- migrate_disable();
- clocksource_verify_choose_cpus();
- if (cpumask_empty(&cpus_chosen)) {
- migrate_enable();
- cpus_read_unlock();
- pr_warn("Not enough CPUs to check clocksource '%s'.\n", cs->name);
+
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
return;
+
+ /*
+ * If this is not the current clocksource let the watchdog thread
+ * reselect it. Due to the change to high res this clocksource
+ * might be preferred now. If it is the current clocksource let the
+ * tick code know about that change.
+ */
+ if (cs != curr_clocksource) {
+ cs->flags |= CLOCK_SOURCE_RESELECT;
+ schedule_work(&watchdog_work);
+ } else {
+ tick_clock_notify();
}
- testcpu = smp_processor_id();
- pr_info("Checking clocksource %s synchronization from CPU %d to CPUs %*pbl.\n",
- cs->name, testcpu, cpumask_pr_args(&cpus_chosen));
- preempt_disable();
- for_each_cpu(cpu, &cpus_chosen) {
- if (cpu == testcpu)
- continue;
- csnow_begin = cs->read(cs);
- smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
- csnow_end = cs->read(cs);
- delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
- if (delta < 0)
- cpumask_set_cpu(cpu, &cpus_behind);
- delta = (csnow_end - csnow_mid) & cs->mask;
- if (delta < 0)
- cpumask_set_cpu(cpu, &cpus_ahead);
- cs_nsec = cycles_to_nsec_safe(cs, csnow_begin, csnow_end);
- if (cs_nsec > cs_nsec_max)
- cs_nsec_max = cs_nsec;
- if (cs_nsec < cs_nsec_min)
- cs_nsec_min = cs_nsec;
+}
+
+static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 2);
+
+static void watchdog_print_freq_timeout(struct clocksource *cs)
+{
+ if (!__ratelimit(&ratelimit_state))
+ return;
+ pr_info("Watchdog %s read timed out. Readout sequence took: %lluns\n",
+ watchdog->name, watchdog_data.wd_seq);
+}
+
+static void watchdog_print_freq_skew(struct clocksource *cs)
+{
+ pr_warn("Marking clocksource %s unstable due to frequency skew\n", cs->name);
+ pr_warn("Watchdog %20s interval: %16lluns\n", watchdog->name, watchdog_data.wd_delta);
+ pr_warn("Clocksource %20s interval: %16lluns\n", cs->name, watchdog_data.cs_delta);
+}
+
+static void watchdog_handle_remote_timeout(struct clocksource *cs)
+{
+ pr_info_once("Watchdog remote CPU %u read timed out\n", watchdog_data.curr_cpu);
+}
+
+static void watchdog_print_remote_skew(struct clocksource *cs)
+{
+ pr_warn("Marking clocksource %s unstable due to inter CPU skew\n", cs->name);
+ if (watchdog_data.cpu_ts[0] < watchdog_data.cpu_ts[1]) {
+ pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", smp_processor_id(),
+ watchdog_data.cpu_ts[0], watchdog_data.curr_cpu, watchdog_data.cpu_ts[1]);
+ } else {
+ pr_warn("CPU%u %16llu < CPU%u %16llu (cycles)\n", watchdog_data.curr_cpu,
+ watchdog_data.cpu_ts[1], smp_processor_id(), watchdog_data.cpu_ts[0]);
}
- preempt_enable();
- migrate_enable();
- cpus_read_unlock();
- if (!cpumask_empty(&cpus_ahead))
- pr_warn(" CPUs %*pbl ahead of CPU %d for clocksource %s.\n",
- cpumask_pr_args(&cpus_ahead), testcpu, cs->name);
- if (!cpumask_empty(&cpus_behind))
- pr_warn(" CPUs %*pbl behind CPU %d for clocksource %s.\n",
- cpumask_pr_args(&cpus_behind), testcpu, cs->name);
- pr_info(" CPU %d check durations %lldns - %lldns for clocksource %s.\n",
- testcpu, cs_nsec_min, cs_nsec_max, cs->name);
-}
-EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
+}
-static inline void clocksource_reset_watchdog(void)
+static void watchdog_check_result(struct clocksource *cs)
{
- struct clocksource *cs;
+ switch (watchdog_data.result) {
+ case WD_SUCCESS:
+ clocksource_tick_stable(cs);
+ clocksource_enable_highres(cs);
+ return;
- list_for_each_entry(cs, &watchdog_list, wd_list)
+ case WD_FREQ_TIMEOUT:
+ watchdog_print_freq_timeout(cs);
+ /* Try again later and invalidate the reference timestamps. */
cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
+ return;
+ case WD_FREQ_NO_WATCHDOG:
+ case WD_FREQ_RESET:
+ /*
+ * Nothing to do when the reference timestamps were reset
+ * or no watchdog clocksource registered.
+ */
+ return;
+
+ case WD_FREQ_SKEWED:
+ watchdog_print_freq_skew(cs);
+ break;
+
+ case WD_CPU_TIMEOUT:
+ /* Remote check timed out. Try again next cycle. */
+ watchdog_handle_remote_timeout(cs);
+ return;
+
+ case WD_CPU_SKEWED:
+ watchdog_print_remote_skew(cs);
+ break;
+ }
+ __clocksource_unstable(cs);
+}
static void clocksource_watchdog(struct timer_list *unused)
{
- int64_t wd_nsec, cs_nsec, interval;
- u64 csnow, wdnow, cslast, wdlast;
- int next_cpu, reset_pending;
struct clocksource *cs;
- enum wd_read_status read_ret;
- unsigned long extra_wait = 0;
- u32 md;
+ bool reset_pending;
- spin_lock(&watchdog_lock);
+ guard(spinlock)(&watchdog_lock);
if (!watchdog_running)
- goto out;
+ return;
reset_pending = atomic_read(&watchdog_reset_pending);
list_for_each_entry(cs, &watchdog_list, wd_list) {
-
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
if (finished_booting)
@@ -446,170 +659,40 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}
- read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
-
- if (read_ret == WD_READ_UNSTABLE) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
- continue;
- }
-
- /*
- * When WD_READ_SKIP is returned, it means the system is likely
- * under very heavy load, where the latency of reading
- * watchdog/clocksource is very big, and affect the accuracy of
- * watchdog check. So give system some space and suspend the
- * watchdog check for 5 minutes.
- */
- if (read_ret == WD_READ_SKIP) {
- /*
- * As the watchdog timer will be suspended, and
- * cs->last could keep unchanged for 5 minutes, reset
- * the counters.
- */
- clocksource_reset_watchdog();
- extra_wait = HZ * 300;
- break;
- }
-
- /* Clocksource initialized ? */
- if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
- atomic_read(&watchdog_reset_pending)) {
- cs->flags |= CLOCK_SOURCE_WATCHDOG;
- cs->wd_last = wdnow;
- cs->cs_last = csnow;
- continue;
+ /* Compare against watchdog clocksource if available */
+ if (watchdog_check_freq(cs, reset_pending)) {
+ /* Check for inter CPU skew */
+ watchdog_check_cpu_skew(cs);
}
- wd_nsec = cycles_to_nsec_safe(watchdog, cs->wd_last, wdnow);
- cs_nsec = cycles_to_nsec_safe(cs, cs->cs_last, csnow);
- wdlast = cs->wd_last; /* save these in case we print them */
- cslast = cs->cs_last;
- cs->cs_last = csnow;
- cs->wd_last = wdnow;
-
- if (atomic_read(&watchdog_reset_pending))
- continue;
-
- /*
- * The processing of timer softirqs can get delayed (usually
- * on account of ksoftirqd not getting to run in a timely
- * manner), which causes the watchdog interval to stretch.
- * Skew detection may fail for longer watchdog intervals
- * on account of fixed margins being used.
- * Some clocksources, e.g. acpi_pm, cannot tolerate
- * watchdog intervals longer than a few seconds.
- */
- interval = max(cs_nsec, wd_nsec);
- if (unlikely(interval > WATCHDOG_INTERVAL_MAX_NS)) {
- if (system_state > SYSTEM_SCHEDULING &&
- interval > 2 * watchdog_max_interval) {
- watchdog_max_interval = interval;
- pr_warn("Long readout interval, skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
- cs_nsec, wd_nsec);
- }
- watchdog_timer.expires = jiffies;
- continue;
- }
-
- /* Check the deviation from the watchdog clocksource. */
- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
- if (abs(cs_nsec - wd_nsec) > md) {
- s64 cs_wd_msec;
- s64 wd_msec;
- u32 wd_rem;
-
- pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
- smp_processor_id(), cs->name);
- pr_warn(" '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
- watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
- pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
- cs->name, cs_nsec, csnow, cslast, cs->mask);
- cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
- wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
- pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
- cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
- if (curr_clocksource == cs)
- pr_warn(" '%s' is current clocksource.\n", cs->name);
- else if (curr_clocksource)
- pr_warn(" '%s' (not '%s') is current clocksource.\n", curr_clocksource->name, cs->name);
- else
- pr_warn(" No current clocksource.\n");
- __clocksource_unstable(cs);
- continue;
- }
-
- if (cs == curr_clocksource && cs->tick_stable)
- cs->tick_stable(cs);
-
- if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
- (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
- (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
- /* Mark it valid for high-res. */
- cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
-
- /*
- * clocksource_done_booting() will sort it if
- * finished_booting is not set yet.
- */
- if (!finished_booting)
- continue;
-
- /*
- * If this is not the current clocksource let
- * the watchdog thread reselect it. Due to the
- * change to high res this clocksource might
- * be preferred now. If it is the current
- * clocksource let the tick code know about
- * that change.
- */
- if (cs != curr_clocksource) {
- cs->flags |= CLOCK_SOURCE_RESELECT;
- schedule_work(&watchdog_work);
- } else {
- tick_clock_notify();
- }
- }
+ watchdog_check_result(cs);
}
- /*
- * We only clear the watchdog_reset_pending, when we did a
- * full cycle through all clocksources.
- */
+ /* Clear after the full clocksource walk */
if (reset_pending)
atomic_dec(&watchdog_reset_pending);
- /*
- * Cycle through CPUs to check if the CPUs stay synchronized
- * to each other.
- */
- next_cpu = cpumask_next_wrap(raw_smp_processor_id(), cpu_online_mask);
-
- /*
- * Arm timer if not already pending: could race with concurrent
- * pair clocksource_stop_watchdog() clocksource_start_watchdog().
- */
+ /* Could have been rearmed by a stop/start cycle */
if (!timer_pending(&watchdog_timer)) {
- watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
- add_timer_on(&watchdog_timer, next_cpu);
+ watchdog_timer.expires += WATCHDOG_INTERVAL;
+ add_timer_local(&watchdog_timer);
}
-out:
- spin_unlock(&watchdog_lock);
}
static inline void clocksource_start_watchdog(void)
{
- if (watchdog_running || !watchdog || list_empty(&watchdog_list))
+ if (watchdog_running || list_empty(&watchdog_list))
return;
- timer_setup(&watchdog_timer, clocksource_watchdog, 0);
+ timer_setup(&watchdog_timer, clocksource_watchdog, TIMER_PINNED);
watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
- add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
+
+ add_timer_on(&watchdog_timer, get_boot_cpu_id());
watchdog_running = 1;
}
static inline void clocksource_stop_watchdog(void)
{
- if (!watchdog_running || (watchdog && !list_empty(&watchdog_list)))
+ if (!watchdog_running || !list_empty(&watchdog_list))
return;
timer_delete(&watchdog_timer);
watchdog_running = 0;
@@ -651,6 +734,13 @@ static void clocksource_select_watchdog(bool fallback)
if (cs->flags & CLOCK_SOURCE_MUST_VERIFY)
continue;
+ /*
+ * If it's not continuous, don't put the fox in charge of
+ * the henhouse.
+ */
+ if (!(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS))
+ continue;
+
/* Skip current if we were requested for a fallback. */
if (fallback && cs == old_wd)
continue;
@@ -690,12 +780,6 @@ static int __clocksource_watchdog_kthread(void)
unsigned long flags;
int select = 0;
- /* Do any required per-CPU skew verification. */
- if (curr_clocksource &&
- curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
- curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
- clocksource_verify_percpu(curr_clocksource);
-
spin_lock_irqsave(&watchdog_lock, flags);
list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -1016,6 +1100,8 @@ static struct clocksource *clocksource_find_best(bool oneshot, bool skipcur)
continue;
if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES))
continue;
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
+ continue;
return cs;
}
return NULL;
@@ -1040,6 +1126,8 @@ static void __clocksource_select(bool skipcur)
continue;
if (strcmp(cs->name, override_name) != 0)
continue;
+ if (cs->flags & CLOCK_SOURCE_WDTEST)
+ continue;
/*
* Check to make sure we don't switch to a non-highres
* capable clocksource if the tick code is in oneshot
@@ -1169,31 +1257,10 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
NSEC_PER_SEC / scale, sec * scale);
- }
- /*
- * If the uncertainty margin is not specified, calculate it. If
- * both scale and freq are non-zero, calculate the clock period, but
- * bound below at 2*WATCHDOG_MAX_SKEW, that is, 500ppm by default.
- * However, if either of scale or freq is zero, be very conservative
- * and take the tens-of-milliseconds WATCHDOG_THRESHOLD value
- * for the uncertainty margin. Allow stupidly small uncertainty
- * margins to be specified by the caller for testing purposes,
- * but warn to discourage production use of this capability.
- *
- * Bottom line: The sum of the uncertainty margins of the
- * watchdog clocksource and the clocksource under test will be at
- * least 500ppm by default. For more information, please see the
- * comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above.
- */
- if (scale && freq && !cs->uncertainty_margin) {
- cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
- if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
- cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
- } else if (!cs->uncertainty_margin) {
- cs->uncertainty_margin = WATCHDOG_THRESHOLD;
+ /* Update cs::freq_khz */
+ cs->freq_khz = div_u64((u64)freq * scale, 1000);
}
- WARN_ON_ONCE(cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW);
/*
* Ensure clocksources that have large 'mult' values don't overflow
@@ -1241,6 +1308,10 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
cs->id = CSID_GENERIC;
+
+ if (WARN_ON_ONCE(!freq && cs->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+ cs->flags &= ~CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT;
+
if (cs->vdso_clock_mode < 0 ||
cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) {
pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n",
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 860af7a58428..5bd6efe598f0 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -49,6 +49,28 @@
#include "tick-internal.h"
+/*
+ * Constants to set the queued state of the timer (INACTIVE, ENQUEUED)
+ *
+ * The callback state is kept separate in the CPU base because having it in
+ * the timer would required touching the timer after the callback, which
+ * makes it impossible to free the timer from the callback function.
+ *
+ * Therefore we track the callback state in:
+ *
+ * timer->base->cpu_base->running == timer
+ *
+ * On SMP it is possible to have a "callback function running and enqueued"
+ * status. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE false
+#define HRTIMER_STATE_ENQUEUED true
+
/*
* The resolution of the clocks. The resolution value is returned in
* the clock_getres() system call to give application programmers an
@@ -77,43 +99,22 @@ static ktime_t __hrtimer_cb_get_time(clockid_t clock_id);
* to reach a base using a clockid, hrtimer_clockid_to_base()
* is used to convert from clockid to the proper hrtimer_base_type.
*/
+
+#define BASE_INIT(idx, cid) \
+ [idx] = { .index = idx, .clockid = cid }
+
DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
{
.lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock),
- .clock_base =
- {
- {
- .index = HRTIMER_BASE_MONOTONIC,
- .clockid = CLOCK_MONOTONIC,
- },
- {
- .index = HRTIMER_BASE_REALTIME,
- .clockid = CLOCK_REALTIME,
- },
- {
- .index = HRTIMER_BASE_BOOTTIME,
- .clockid = CLOCK_BOOTTIME,
- },
- {
- .index = HRTIMER_BASE_TAI,
- .clockid = CLOCK_TAI,
- },
- {
- .index = HRTIMER_BASE_MONOTONIC_SOFT,
- .clockid = CLOCK_MONOTONIC,
- },
- {
- .index = HRTIMER_BASE_REALTIME_SOFT,
- .clockid = CLOCK_REALTIME,
- },
- {
- .index = HRTIMER_BASE_BOOTTIME_SOFT,
- .clockid = CLOCK_BOOTTIME,
- },
- {
- .index = HRTIMER_BASE_TAI_SOFT,
- .clockid = CLOCK_TAI,
- },
+ .clock_base = {
+ BASE_INIT(HRTIMER_BASE_MONOTONIC, CLOCK_MONOTONIC),
+ BASE_INIT(HRTIMER_BASE_REALTIME, CLOCK_REALTIME),
+ BASE_INIT(HRTIMER_BASE_BOOTTIME, CLOCK_BOOTTIME),
+ BASE_INIT(HRTIMER_BASE_TAI, CLOCK_TAI),
+ BASE_INIT(HRTIMER_BASE_MONOTONIC_SOFT, CLOCK_MONOTONIC),
+ BASE_INIT(HRTIMER_BASE_REALTIME_SOFT, CLOCK_REALTIME),
+ BASE_INIT(HRTIMER_BASE_BOOTTIME_SOFT, CLOCK_BOOTTIME),
+ BASE_INIT(HRTIMER_BASE_TAI_SOFT, CLOCK_TAI),
},
.csd = CSD_INIT(retrigger_next_event, NULL)
};
@@ -126,23 +127,43 @@ static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base)
return likely(base->online);
}
+#ifdef CONFIG_HIGH_RES_TIMERS
+DEFINE_STATIC_KEY_FALSE(hrtimer_highres_enabled_key);
+
+static void hrtimer_hres_workfn(struct work_struct *work)
+{
+ static_branch_enable(&hrtimer_highres_enabled_key);
+}
+
+static DECLARE_WORK(hrtimer_hres_work, hrtimer_hres_workfn);
+
+static inline void hrtimer_schedule_hres_work(void)
+{
+ if (!hrtimer_highres_enabled())
+ schedule_work(&hrtimer_hres_work);
+}
+#else
+static inline void hrtimer_schedule_hres_work(void) { }
+#endif
+
/*
* Functions and macros which are different for UP/SMP systems are kept in a
* single place
*/
#ifdef CONFIG_SMP
-
/*
* We require the migration_base for lock_hrtimer_base()/switch_hrtimer_base()
* such that hrtimer_callback_running() can unconditionally dereference
* timer->base->cpu_base
*/
static struct hrtimer_cpu_base migration_cpu_base = {
- .clock_base = { {
- .cpu_base = &migration_cpu_base,
- .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
- &migration_cpu_base.lock),
- }, },
+ .clock_base = {
+ [0] = {
+ .cpu_base = &migration_cpu_base,
+ .seq = SEQCNT_RAW_SPINLOCK_ZERO(migration_cpu_base.seq,
+ &migration_cpu_base.lock),
+ },
+ },
};
#define migration_base migration_cpu_base.clock_base[0]
@@ -159,15 +180,13 @@ static struct hrtimer_cpu_base migration_cpu_base = {
* possible to set timer->base = &migration_base and drop the lock: the timer
* remains locked.
*/
-static
-struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
- unsigned long *flags)
+static struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+ unsigned long *flags)
__acquires(&timer->base->lock)
{
- struct hrtimer_clock_base *base;
-
for (;;) {
- base = READ_ONCE(timer->base);
+ struct hrtimer_clock_base *base = READ_ONCE(timer->base);
+
if (likely(base != &migration_base)) {
raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
if (likely(base == timer->base))
@@ -220,7 +239,7 @@ static bool hrtimer_suitable_target(struct hrtimer *timer, struct hrtimer_clock_
return expires >= new_base->cpu_base->expires_next;
}
-static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, int pinned)
+static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, bool pinned)
{
if (!hrtimer_base_is_online(base)) {
int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER));
@@ -248,8 +267,7 @@ static inline struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *
* the timer callback is currently running.
*/
static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
- int pinned)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, bool pinned)
{
struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
struct hrtimer_clock_base *new_base;
@@ -262,13 +280,12 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
if (base != new_base) {
/*
- * We are trying to move timer to new_base.
- * However we can't change timer's base while it is running,
- * so we keep it on the same CPU. No hassle vs. reprogramming
- * the event source in the high resolution case. The softirq
- * code will take care of this when the timer function has
- * completed. There is no conflict as we hold the lock until
- * the timer is enqueued.
+ * We are trying to move timer to new_base. However we can't
+ * change timer's base while it is running, so we keep it on
+ * the same CPU. No hassle vs. reprogramming the event source
+ * in the high resolution case. The remote CPU will take care
+ * of this when the timer function has completed. There is no
+ * conflict as we hold the lock until the timer is enqueued.
*/
if (unlikely(hrtimer_callback_running(timer)))
return base;
@@ -278,8 +295,7 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
raw_spin_unlock(&base->cpu_base->lock);
raw_spin_lock(&new_base->cpu_base->lock);
- if (!hrtimer_suitable_target(timer, new_base, new_cpu_base,
- this_cpu_base)) {
+ if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, this_cpu_base)) {
raw_spin_unlock(&new_base->cpu_base->lock);
raw_spin_lock(&base->cpu_base->lock);
new_cpu_base = this_cpu_base;
@@ -298,14 +314,13 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
#else /* CONFIG_SMP */
-static inline struct hrtimer_clock_base *
-lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+static inline struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer,
+ unsigned long *flags)
__acquires(&timer->base->cpu_base->lock)
{
struct hrtimer_clock_base *base = timer->base;
raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
-
return base;
}
@@ -340,7 +355,7 @@ s64 __ktime_divns(const ktime_t kt, s64 div)
return dclc < 0 ? -tmp : tmp;
}
EXPORT_SYMBOL_GPL(__ktime_divns);
-#endif /* BITS_PER_LONG >= 64 */
+#endif /* BITS_PER_LONG < 64 */
/*
* Add two ktime values and do a safety check for overflow:
@@ -422,12 +437,37 @@ static bool hrtimer_fixup_free(void *addr, enum debug_obj_state state)
}
}
+/* Stub timer callback for improperly used timers. */
+static enum hrtimer_restart stub_timer(struct hrtimer *unused)
+{
+ WARN_ON_ONCE(1);
+ return HRTIMER_NORESTART;
+}
+
+/*
+ * hrtimer_fixup_assert_init is called when:
+ * - an untracked/uninit-ed object is found
+ */
+static bool hrtimer_fixup_assert_init(void *addr, enum debug_obj_state state)
+{
+ struct hrtimer *timer = addr;
+
+ switch (state) {
+ case ODEBUG_STATE_NOTAVAILABLE:
+ hrtimer_setup(timer, stub_timer, CLOCK_MONOTONIC, 0);
+ return true;
+ default:
+ return false;
+ }
+}
+
static const struct debug_obj_descr hrtimer_debug_descr = {
- .name = "hrtimer",
- .debug_hint = hrtimer_debug_hint,
- .fixup_init = hrtimer_fixup_init,
- .fixup_activate = hrtimer_fixup_activate,
- .fixup_free = hrtimer_fixup_free,
+ .name = "hrtimer",
+ .debug_hint = hrtimer_debug_hint,
+ .fixup_init = hrtimer_fixup_init,
+ .fixup_activate = hrtimer_fixup_activate,
+ .fixup_free = hrtimer_fixup_free,
+ .fixup_assert_init = hrtimer_fixup_assert_init,
};
static inline void debug_hrtimer_init(struct hrtimer *timer)
@@ -440,8 +480,7 @@ static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer)
debug_object_init_on_stack(timer, &hrtimer_debug_descr);
}
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
- enum hrtimer_mode mode)
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode)
{
debug_object_activate(timer, &hrtimer_debug_descr);
}
@@ -451,6 +490,11 @@ static inline void debug_hrtimer_deactivate(struct hrtimer *timer)
debug_object_deactivate(timer, &hrtimer_debug_descr);
}
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer)
+{
+ debug_object_assert_init(timer, &hrtimer_debug_descr);
+}
+
void destroy_hrtimer_on_stack(struct hrtimer *timer)
{
debug_object_free(timer, &hrtimer_debug_descr);
@@ -461,9 +505,9 @@ EXPORT_SYMBOL_GPL(destroy_hrtimer_on_stack);
static inline void debug_hrtimer_init(struct hrtimer *timer) { }
static inline void debug_hrtimer_init_on_stack(struct hrtimer *timer) { }
-static inline void debug_hrtimer_activate(struct hrtimer *timer,
- enum hrtimer_mode mode) { }
+static inline void debug_hrtimer_activate(struct hrtimer *timer, enum hrtimer_mode mode) { }
static inline void debug_hrtimer_deactivate(struct hrtimer *timer) { }
+static inline void debug_hrtimer_assert_init(struct hrtimer *timer) { }
#endif
static inline void debug_setup(struct hrtimer *timer, clockid_t clockid, enum hrtimer_mode mode)
@@ -479,80 +523,80 @@ static inline void debug_setup_on_stack(struct hrtimer *timer, clockid_t clockid
trace_hrtimer_setup(timer, clockid, mode);
}
-static inline void debug_activate(struct hrtimer *timer,
- enum hrtimer_mode mode)
+static inline void debug_activate(struct hrtimer *timer, enum hrtimer_mode mode, bool was_armed)
{
debug_hrtimer_activate(timer, mode);
- trace_hrtimer_start(timer, mode);
+ trace_hrtimer_start(timer, mode, was_armed);
}
-static inline void debug_deactivate(struct hrtimer *timer)
-{
- debug_hrtimer_deactivate(timer);
- trace_hrtimer_cancel(timer);
-}
+#define for_each_active_base(base, cpu_base, active) \
+ for (unsigned int idx = ffs(active); idx--; idx = ffs((active))) \
+ for (bool done = false; !done; active &= ~(1U << idx)) \
+ for (base = &cpu_base->clock_base[idx]; !done; done = true)
-static struct hrtimer_clock_base *
-__next_base(struct hrtimer_cpu_base *cpu_base, unsigned int *active)
+#define hrtimer_from_timerqueue_node(_n) container_of_const(_n, struct hrtimer, node)
+
+#if defined(CONFIG_NO_HZ_COMMON)
+/*
+ * Same as hrtimer_bases_next_event() below, but skips the excluded timer and
+ * does not update cpu_base->next_timer/expires.
+ */
+static ktime_t hrtimer_bases_next_event_without(struct hrtimer_cpu_base *cpu_base,
+ const struct hrtimer *exclude,
+ unsigned int active, ktime_t expires_next)
{
- unsigned int idx;
+ struct hrtimer_clock_base *base;
+ ktime_t expires;
- if (!*active)
- return NULL;
+ lockdep_assert_held(&cpu_base->lock);
- idx = __ffs(*active);
- *active &= ~(1U << idx);
+ for_each_active_base(base, cpu_base, active) {
+ expires = ktime_sub(base->expires_next, base->offset);
+ if (expires >= expires_next)
+ continue;
+
+ /*
+ * If the excluded timer is the first on this base evaluate the
+ * next timer.
+ */
+ struct timerqueue_linked_node *node = timerqueue_linked_first(&base->active);
- return &cpu_base->clock_base[idx];
+ if (unlikely(&exclude->node == node)) {
+ node = timerqueue_linked_next(node);
+ if (!node)
+ continue;
+ expires = ktime_sub(node->expires, base->offset);
+ if (expires >= expires_next)
+ continue;
+ }
+ expires_next = expires;
+ }
+ /* If base->offset changed, the result might be negative */
+ return max(expires_next, 0);
}
+#endif
-#define for_each_active_base(base, cpu_base, active) \
- while ((base = __next_base((cpu_base), &(active))))
+static __always_inline struct hrtimer *clock_base_next_timer(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
-static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
- const struct hrtimer *exclude,
- unsigned int active,
- ktime_t expires_next)
+ return hrtimer_from_timerqueue_node(next);
+}
+
+/* Find the base with the earliest expiry */
+static void hrtimer_bases_first(struct hrtimer_cpu_base *cpu_base,unsigned int active,
+ ktime_t *expires_next, struct hrtimer **next_timer)
{
struct hrtimer_clock_base *base;
ktime_t expires;
for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *next;
- struct hrtimer *timer;
-
- next = timerqueue_getnext(&base->active);
- timer = container_of(next, struct hrtimer, node);
- if (timer == exclude) {
- /* Get to the next timer in the queue. */
- next = timerqueue_iterate_next(next);
- if (!next)
- continue;
-
- timer = container_of(next, struct hrtimer, node);
- }
- expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
- if (expires < expires_next) {
- expires_next = expires;
-
- /* Skip cpu_base update if a timer is being excluded. */
- if (exclude)
- continue;
-
- if (timer->is_soft)
- cpu_base->softirq_next_timer = timer;
- else
- cpu_base->next_timer = timer;
+ expires = ktime_sub(base->expires_next, base->offset);
+ if (expires < *expires_next) {
+ *expires_next = expires;
+ *next_timer = clock_base_next_timer(base);
}
}
- /*
- * clock_was_set() might have changed base->offset of any of
- * the clock bases so the result might be negative. Fix it up
- * to prevent a false positive in clockevents_program_event().
- */
- if (expires_next < 0)
- expires_next = 0;
- return expires_next;
}
/*
@@ -575,30 +619,28 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
* - HRTIMER_ACTIVE_SOFT, or
* - HRTIMER_ACTIVE_HARD.
*/
-static ktime_t
-__hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
+static ktime_t __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_mask)
{
- unsigned int active;
struct hrtimer *next_timer = NULL;
ktime_t expires_next = KTIME_MAX;
+ unsigned int active;
+
+ lockdep_assert_held(&cpu_base->lock);
if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
- cpu_base->softirq_next_timer = NULL;
- expires_next = __hrtimer_next_event_base(cpu_base, NULL,
- active, KTIME_MAX);
-
- next_timer = cpu_base->softirq_next_timer;
+ if (active)
+ hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
+ cpu_base->softirq_next_timer = next_timer;
}
if (active_mask & HRTIMER_ACTIVE_HARD) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+ if (active)
+ hrtimer_bases_first(cpu_base, active, &expires_next, &next_timer);
cpu_base->next_timer = next_timer;
- expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
- expires_next);
}
-
- return expires_next;
+ return max(expires_next, 0);
}
static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base)
@@ -638,8 +680,8 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
ktime_t *offs_boot = &base->clock_base[HRTIMER_BASE_BOOTTIME].offset;
ktime_t *offs_tai = &base->clock_base[HRTIMER_BASE_TAI].offset;
- ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq,
- offs_real, offs_boot, offs_tai);
+ ktime_t now = ktime_get_update_offsets_now(&base->clock_was_set_seq, offs_real,
+ offs_boot, offs_tai);
base->clock_base[HRTIMER_BASE_REALTIME_SOFT].offset = *offs_real;
base->clock_base[HRTIMER_BASE_BOOTTIME_SOFT].offset = *offs_boot;
@@ -649,7 +691,9 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base)
}
/*
- * Is the high resolution mode active ?
+ * Is the high resolution mode active in the CPU base. This cannot use the
+ * static key as the CPUs are switched to high resolution mode
+ * asynchronously.
*/
static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
{
@@ -657,8 +701,13 @@ static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base)
cpu_base->hres_active : 0;
}
-static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
- struct hrtimer *next_timer,
+static inline void hrtimer_rearm_event(ktime_t expires_next, bool deferred)
+{
+ trace_hrtimer_rearm(expires_next, deferred);
+ tick_program_event(expires_next, 1);
+}
+
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, struct hrtimer *next_timer,
ktime_t expires_next)
{
cpu_base->expires_next = expires_next;
@@ -683,20 +732,13 @@ static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
if (!hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
return;
- tick_program_event(expires_next, 1);
+ hrtimer_rearm_event(expires_next, false);
}
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
-static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+/* Reprogram the event source with a evaluation of all clock bases */
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, bool skip_equal)
{
- ktime_t expires_next;
-
- expires_next = hrtimer_update_next_event(cpu_base);
+ ktime_t expires_next = hrtimer_update_next_event(cpu_base);
if (skip_equal && expires_next == cpu_base->expires_next)
return;
@@ -707,57 +749,49 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
/* High resolution timer related functions */
#ifdef CONFIG_HIGH_RES_TIMERS
-/*
- * High resolution timer enabled ?
- */
+/* High resolution timer enabled ? */
static bool hrtimer_hres_enabled __read_mostly = true;
unsigned int hrtimer_resolution __read_mostly = LOW_RES_NSEC;
EXPORT_SYMBOL_GPL(hrtimer_resolution);
-/*
- * Enable / Disable high resolution mode
- */
+/* Enable / Disable high resolution mode */
static int __init setup_hrtimer_hres(char *str)
{
return (kstrtobool(str, &hrtimer_hres_enabled) == 0);
}
-
__setup("highres=", setup_hrtimer_hres);
-/*
- * hrtimer_high_res_enabled - query, if the highres mode is enabled
- */
-static inline int hrtimer_is_hres_enabled(void)
+/* hrtimer_high_res_enabled - query, if the highres mode is enabled */
+static inline bool hrtimer_is_hres_enabled(void)
{
return hrtimer_hres_enabled;
}
-/*
- * Switch to high resolution mode
- */
+/* Switch to high resolution mode */
static void hrtimer_switch_to_hres(void)
{
struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
if (tick_init_highres()) {
- pr_warn("Could not switch to high resolution mode on CPU %u\n",
- base->cpu);
+ pr_warn("Could not switch to high resolution mode on CPU %u\n", base->cpu);
return;
}
- base->hres_active = 1;
+ base->hres_active = true;
hrtimer_resolution = HIGH_RES_NSEC;
tick_setup_sched_timer(true);
/* "Retrigger" the interrupt to get things going */
retrigger_next_event(NULL);
+ hrtimer_schedule_hres_work();
}
#else
-static inline int hrtimer_is_hres_enabled(void) { return 0; }
+static inline bool hrtimer_is_hres_enabled(void) { return 0; }
static inline void hrtimer_switch_to_hres(void) { }
#endif /* CONFIG_HIGH_RES_TIMERS */
+
/*
* Retrigger next event is called after clock was set with interrupts
* disabled through an SMP function call or directly from low level
@@ -792,13 +826,12 @@ static void retrigger_next_event(void *arg)
* In periodic low resolution mode, the next softirq expiration
* must also be updated.
*/
- raw_spin_lock(&base->lock);
+ guard(raw_spinlock)(&base->lock);
hrtimer_update_base(base);
if (hrtimer_hres_active(base))
- hrtimer_force_reprogram(base, 0);
+ hrtimer_force_reprogram(base, /* skip_equal */ false);
else
hrtimer_update_next_event(base);
- raw_spin_unlock(&base->lock);
}
/*
@@ -812,10 +845,11 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
struct hrtimer_clock_base *base = timer->base;
- ktime_t expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
+ ktime_t expires = hrtimer_get_expires(timer);
- WARN_ON_ONCE(hrtimer_get_expires(timer) < 0);
+ WARN_ON_ONCE(expires < 0);
+ expires = ktime_sub(expires, base->offset);
/*
* CLOCK_REALTIME timer might be requested with an absolute
* expiry time which is less than base->offset. Set it to 0.
@@ -842,8 +876,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
timer_cpu_base->softirq_next_timer = timer;
timer_cpu_base->softirq_expires_next = expires;
- if (!ktime_before(expires, timer_cpu_base->expires_next) ||
- !reprogram)
+ if (!ktime_before(expires, timer_cpu_base->expires_next) || !reprogram)
return;
}
@@ -857,11 +890,8 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
if (expires >= cpu_base->expires_next)
return;
- /*
- * If the hrtimer interrupt is running, then it will reevaluate the
- * clock bases and reprogram the clock event device.
- */
- if (cpu_base->in_hrtirq)
+ /* If a deferred rearm is pending skip reprogramming the device */
+ if (cpu_base->deferred_rearm)
return;
cpu_base->next_timer = timer;
@@ -869,8 +899,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
__hrtimer_reprogram(cpu_base, timer, expires);
}
-static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
- unsigned int active)
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base, unsigned int active)
{
struct hrtimer_clock_base *base;
unsigned int seq;
@@ -896,13 +925,11 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
if (seq == cpu_base->clock_was_set_seq)
return false;
- /*
- * If the remote CPU is currently handling an hrtimer interrupt, it
- * will reevaluate the first expiring timer of all clock bases
- * before reprogramming. Nothing to do here.
- */
- if (cpu_base->in_hrtirq)
+ /* If a deferred rearm is pending the remote CPU will take care of it */
+ if (cpu_base->deferred_rearm) {
+ cpu_base->deferred_needs_update = true;
return false;
+ }
/*
* Walk the affected clock bases and check whether the first expiring
@@ -913,9 +940,9 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
active &= cpu_base->active_bases;
for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *next;
+ struct timerqueue_linked_node *next;
- next = timerqueue_getnext(&base->active);
+ next = timerqueue_linked_first(&base->active);
expires = ktime_sub(next->expires, base->offset);
if (expires < cpu_base->expires_next)
return true;
@@ -947,11 +974,9 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
*/
void clock_was_set(unsigned int bases)
{
- struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
cpumask_var_t mask;
- int cpu;
- if (!hrtimer_hres_active(cpu_base) && !tick_nohz_is_active())
+ if (!hrtimer_highres_enabled() && !tick_nohz_is_active())
goto out_timerfd;
if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -960,23 +985,19 @@ void clock_was_set(unsigned int bases)
}
/* Avoid interrupting CPUs if possible */
- cpus_read_lock();
- for_each_online_cpu(cpu) {
- unsigned long flags;
-
- cpu_base = &per_cpu(hrtimer_bases, cpu);
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
+ scoped_guard(cpus_read_lock) {
+ int cpu;
- if (update_needs_ipi(cpu_base, bases))
- cpumask_set_cpu(cpu, mask);
+ for_each_online_cpu(cpu) {
+ struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
+ if (update_needs_ipi(cpu_base, bases))
+ cpumask_set_cpu(cpu, mask);
+ }
+ scoped_guard(preempt)
+ smp_call_function_many(mask, retrigger_next_event, NULL, 1);
}
-
- preempt_disable();
- smp_call_function_many(mask, retrigger_next_event, NULL, 1);
- preempt_enable();
- cpus_read_unlock();
free_cpumask_var(mask);
out_timerfd:
@@ -1011,11 +1032,8 @@ void hrtimers_resume_local(void)
retrigger_next_event(NULL);
}
-/*
- * Counterpart to lock_hrtimer_base above:
- */
-static inline
-void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
+/* Counterpart to lock_hrtimer_base above */
+static inline void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
__releases(&timer->base->cpu_base->lock)
{
raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
@@ -1032,7 +1050,7 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
* .. note::
* This only updates the timer expiry value and does not requeue the timer.
*
- * There is also a variant of the function hrtimer_forward_now().
+ * There is also a variant of this function: hrtimer_forward_now().
*
* Context: Can be safely called from the callback function of @timer. If called
* from other contexts @timer must neither be enqueued nor running the
@@ -1042,15 +1060,15 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
*/
u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
{
- u64 orun = 1;
ktime_t delta;
+ u64 orun = 1;
delta = ktime_sub(now, hrtimer_get_expires(timer));
if (delta < 0)
return 0;
- if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+ if (WARN_ON(timer->is_queued))
return 0;
if (interval < hrtimer_resolution)
@@ -1079,73 +1097,98 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
* enqueue_hrtimer - internal function to (re)start a timer
*
* The timer is inserted in expiry order. Insertion into the
- * red black tree is O(log(n)). Must hold the base lock.
+ * red black tree is O(log(n)).
*
* Returns true when the new timer is the leftmost timer in the tree.
*/
static bool enqueue_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
- enum hrtimer_mode mode)
+ enum hrtimer_mode mode, bool was_armed)
{
- debug_activate(timer, mode);
+ lockdep_assert_held(&base->cpu_base->lock);
+
+ debug_activate(timer, mode, was_armed);
WARN_ON_ONCE(!base->cpu_base->online);
base->cpu_base->active_bases |= 1 << base->index;
/* Pairs with the lockless read in hrtimer_is_queued() */
- WRITE_ONCE(timer->state, HRTIMER_STATE_ENQUEUED);
+ WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+ if (!timerqueue_linked_add(&base->active, &timer->node))
+ return false;
+
+ base->expires_next = hrtimer_get_expires(timer);
+ return true;
+}
- return timerqueue_add(&base->active, &timer->node);
+static inline void base_update_next_timer(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
+
+ base->expires_next = next ? next->expires : KTIME_MAX;
}
/*
* __remove_hrtimer - internal function to remove a timer
*
- * Caller must hold the base lock.
- *
* High resolution timer mode reprograms the clock event device when the
* timer is the one which expires next. The caller can disable this by setting
* reprogram to zero. This is useful, when the context does a reprogramming
* anyway (e.g. timer interrupt)
*/
-static void __remove_hrtimer(struct hrtimer *timer,
- struct hrtimer_clock_base *base,
- u8 newstate, int reprogram)
+static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool newstate, bool reprogram)
{
struct hrtimer_cpu_base *cpu_base = base->cpu_base;
- u8 state = timer->state;
+ bool was_first;
- /* Pairs with the lockless read in hrtimer_is_queued() */
- WRITE_ONCE(timer->state, newstate);
- if (!(state & HRTIMER_STATE_ENQUEUED))
+ lockdep_assert_held(&cpu_base->lock);
+
+ if (!timer->is_queued)
return;
- if (!timerqueue_del(&base->active, &timer->node))
+ /* Pairs with the lockless read in hrtimer_is_queued() */
+ WRITE_ONCE(timer->is_queued, newstate);
+
+ was_first = !timerqueue_linked_prev(&timer->node);
+
+ if (!timerqueue_linked_del(&base->active, &timer->node))
cpu_base->active_bases &= ~(1 << base->index);
+ /* Nothing to update if this was not the first timer in the base */
+ if (!was_first)
+ return;
+
+ base_update_next_timer(base);
+
/*
- * Note: If reprogram is false we do not update
- * cpu_base->next_timer. This happens when we remove the first
- * timer on a remote cpu. No harm as we never dereference
- * cpu_base->next_timer. So the worst thing what can happen is
- * an superfluous call to hrtimer_force_reprogram() on the
- * remote cpu later on if the same timer gets enqueued again.
+ * If reprogram is false don't update cpu_base->next_timer and do not
+ * touch the clock event device.
+ *
+ * This happens when removing the first timer on a remote CPU, which
+ * will be handled by the remote CPU's interrupt. It also happens when
+ * a local timer is removed to be immediately restarted. That's handled
+ * at the call site.
*/
- if (reprogram && timer == cpu_base->next_timer)
- hrtimer_force_reprogram(cpu_base, 1);
+ if (!reprogram || timer != cpu_base->next_timer || timer->is_lazy)
+ return;
+
+ if (cpu_base->deferred_rearm)
+ cpu_base->deferred_needs_update = true;
+ else
+ hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
}
-/*
- * remove hrtimer, called with base lock held
- */
-static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
- bool restart, bool keep_local)
+static inline bool remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ bool newstate)
{
- u8 state = timer->state;
+ lockdep_assert_held(&base->cpu_base->lock);
- if (state & HRTIMER_STATE_ENQUEUED) {
+ if (timer->is_queued) {
bool reprogram;
+ debug_hrtimer_deactivate(timer);
+
/*
* Remove the timer and force reprogramming when high
* resolution mode is active and the timer is on the current
@@ -1154,24 +1197,81 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
* reprogramming happens in the interrupt handler. This is a
* rare case and less expensive than a smp call.
*/
- debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
- /*
- * If the timer is not restarted then reprogramming is
- * required if the timer is local. If it is local and about
- * to be restarted, avoid programming it twice (on removal
- * and a moment later when it's requeued).
- */
- if (!restart)
- state = HRTIMER_STATE_INACTIVE;
- else
- reprogram &= !keep_local;
+ __remove_hrtimer(timer, base, newstate, reprogram);
+ return true;
+ }
+ return false;
+}
+
+/*
+ * Update in place has to retrieve the expiry times of the neighbour nodes
+ * if they exist. That is cache line neutral because the dequeue/enqueue
+ * operation is going to need the same cache lines. But there is a big win
+ * when the dequeue/enqueue can be avoided because the RB tree does not
+ * have to be rebalanced twice.
+ */
+static inline bool
+hrtimer_can_update_in_place(struct hrtimer *timer, struct hrtimer_clock_base *base, ktime_t expires)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_next(&timer->node);
+ struct timerqueue_linked_node *prev = timerqueue_linked_prev(&timer->node);
+
+ /* If the new expiry goes behind the next timer, requeue is required */
+ if (next && expires > next->expires)
+ return false;
+
+ /* If this is the first timer, update in place */
+ if (!prev)
+ return true;
+
+ /* Update in place when it does not go ahead of the previous one */
+ return expires >= prev->expires;
+}
+
+static inline bool
+remove_and_enqueue_same_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+ const enum hrtimer_mode mode, ktime_t expires, u64 delta_ns)
+{
+ bool was_first = false;
+
+ /* Remove it from the timer queue if active */
+ if (timer->is_queued) {
+ was_first = !timerqueue_linked_prev(&timer->node);
+
+ /* Try to update in place to avoid the de/enqueue dance */
+ if (hrtimer_can_update_in_place(timer, base, expires)) {
+ hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+ trace_hrtimer_start(timer, mode, true);
+ if (was_first)
+ base->expires_next = expires;
+ return was_first;
+ }
- __remove_hrtimer(timer, base, state, reprogram);
- return 1;
+ debug_hrtimer_deactivate(timer);
+ timerqueue_linked_del(&base->active, &timer->node);
}
- return 0;
+
+ /* Set the new expiry time */
+ hrtimer_set_expires_range_ns(timer, expires, delta_ns);
+
+ debug_activate(timer, mode, timer->is_queued);
+ base->cpu_base->active_bases |= 1 << base->index;
+
+ /* Pairs with the lockless read in hrtimer_is_queued() */
+ WRITE_ONCE(timer->is_queued, HRTIMER_STATE_ENQUEUED);
+
+ /* If it's the first expiring timer now or again, update base */
+ if (timerqueue_linked_add(&base->active, &timer->node)) {
+ base->expires_next = expires;
+ return true;
+ }
+
+ if (was_first)
+ base_update_next_timer(base);
+
+ return false;
}
static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
@@ -1190,55 +1290,93 @@ static inline ktime_t hrtimer_update_lowres(struct hrtimer *timer, ktime_t tim,
return tim;
}
-static void
-hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
+static void hrtimer_update_softirq_timer(struct hrtimer_cpu_base *cpu_base, bool reprogram)
{
- ktime_t expires;
-
- /*
- * Find the next SOFT expiration.
- */
- expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
+ ktime_t expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT);
/*
- * reprogramming needs to be triggered, even if the next soft
- * hrtimer expires at the same time than the next hard
+ * Reprogramming needs to be triggered, even if the next soft
+ * hrtimer expires at the same time as the next hard
* hrtimer. cpu_base->softirq_expires_next needs to be updated!
*/
if (expires == KTIME_MAX)
return;
/*
- * cpu_base->*next_timer is recomputed by __hrtimer_get_next_event()
- * cpu_base->*expires_next is only set by hrtimer_reprogram()
+ * cpu_base->next_timer is recomputed by __hrtimer_get_next_event()
+ * cpu_base->expires_next is only set by hrtimer_reprogram()
*/
hrtimer_reprogram(cpu_base->softirq_next_timer, reprogram);
}
-static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- u64 delta_ns, const enum hrtimer_mode mode,
- struct hrtimer_clock_base *base)
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+ if (static_branch_likely(&timers_migration_enabled)) {
+ /*
+ * If it is local and the first expiring timer keep it on the local
+ * CPU to optimize reprogramming of the clockevent device. Also
+ * avoid switch_hrtimer_base() overhead when local and pinned.
+ */
+ if (!is_local)
+ return false;
+ if (is_first || is_pinned)
+ return true;
+
+ /* Honour the NOHZ full restrictions */
+ if (!housekeeping_cpu(smp_processor_id(), HK_TYPE_KERNEL_NOISE))
+ return false;
+
+ /*
+ * If the tick is not stopped or need_resched() is set, then
+ * there is no point in moving the timer somewhere else.
+ */
+ return !tick_nohz_tick_stopped() || need_resched();
+ }
+ return is_local;
+}
+#else
+static __always_inline bool hrtimer_prefer_local(bool is_local, bool is_first, bool is_pinned)
+{
+ return is_local;
+}
+#endif
+
+static inline bool hrtimer_keep_base(struct hrtimer *timer, bool is_local, bool is_first,
+ bool is_pinned)
+{
+ /* If the timer is running the callback it has to stay on its CPU base. */
+ if (unlikely(timer->base->running == timer))
+ return true;
+
+ return hrtimer_prefer_local(is_local, is_first, is_pinned);
+}
+
+static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+ const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
{
struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases);
- struct hrtimer_clock_base *new_base;
- bool force_local, first;
+ bool is_pinned, first, was_first, keep_base = false;
+ struct hrtimer_cpu_base *cpu_base = base->cpu_base;
- /*
- * If the timer is on the local cpu base and is the first expiring
- * timer then this might end up reprogramming the hardware twice
- * (on removal and on enqueue). To avoid that by prevent the
- * reprogram on removal, keep the timer local to the current CPU
- * and enforce reprogramming after it is queued no matter whether
- * it is the new first expiring timer again or not.
- */
- force_local = base->cpu_base == this_cpu_base;
- force_local &= base->cpu_base->next_timer == timer;
+ was_first = cpu_base->next_timer == timer;
+ is_pinned = !!(mode & HRTIMER_MODE_PINNED);
/*
- * Don't force local queuing if this enqueue happens on a unplugged
- * CPU after hrtimer_cpu_dying() has been invoked.
+ * Don't keep it local if this enqueue happens on a unplugged CPU
+ * after hrtimer_cpu_dying() has been invoked.
*/
- force_local &= this_cpu_base->online;
+ if (likely(this_cpu_base->online)) {
+ bool is_local = cpu_base == this_cpu_base;
+
+ keep_base = hrtimer_keep_base(timer, is_local, was_first, is_pinned);
+ }
+
+ /* Calculate absolute expiry time for relative timers */
+ if (mode & HRTIMER_MODE_REL)
+ tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+ /* Compensate for low resolution granularity */
+ tim = hrtimer_update_lowres(timer, tim, mode);
/*
* Remove an active timer from the queue. In case it is not queued
@@ -1250,32 +1388,41 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* reprogramming later if it was the first expiring timer. This
* avoids programming the underlying clock event twice (once at
* removal and once after enqueue).
+ *
+ * @keep_base is also true if the timer callback is running on a
+ * remote CPU and for local pinned timers.
*/
- remove_hrtimer(timer, base, true, force_local);
+ if (likely(keep_base)) {
+ first = remove_and_enqueue_same_base(timer, base, mode, tim, delta_ns);
+ } else {
+ /* Keep the ENQUEUED state in case it is queued */
+ bool was_armed = remove_hrtimer(timer, base, HRTIMER_STATE_ENQUEUED);
- if (mode & HRTIMER_MODE_REL)
- tim = ktime_add_safe(tim, __hrtimer_cb_get_time(base->clockid));
+ hrtimer_set_expires_range_ns(timer, tim, delta_ns);
- tim = hrtimer_update_lowres(timer, tim, mode);
+ /* Switch the timer base, if necessary: */
+ base = switch_hrtimer_base(timer, base, is_pinned);
+ cpu_base = base->cpu_base;
- hrtimer_set_expires_range_ns(timer, tim, delta_ns);
+ first = enqueue_hrtimer(timer, base, mode, was_armed);
+ }
- /* Switch the timer base, if necessary: */
- if (!force_local) {
- new_base = switch_hrtimer_base(timer, base,
- mode & HRTIMER_MODE_PINNED);
- } else {
- new_base = base;
+ /* If a deferred rearm is pending skip reprogramming the device */
+ if (cpu_base->deferred_rearm) {
+ cpu_base->deferred_needs_update = true;
+ return false;
}
- first = enqueue_hrtimer(timer, new_base, mode);
- if (!force_local) {
+ if (!was_first || cpu_base != this_cpu_base) {
/*
- * If the current CPU base is online, then the timer is
- * never queued on a remote CPU if it would be the first
- * expiring timer there.
+ * If the current CPU base is online, then the timer is never
+ * queued on a remote CPU if it would be the first expiring
+ * timer there unless the timer callback is currently executed
+ * on the remote CPU. In the latter case the remote CPU will
+ * re-evaluate the first expiring timer after completing the
+ * callbacks.
*/
- if (hrtimer_base_is_online(this_cpu_base))
+ if (likely(hrtimer_base_is_online(this_cpu_base)))
return first;
/*
@@ -1283,21 +1430,33 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* already offline. If the timer is the first to expire,
* kick the remote CPU to reprogram the clock event.
*/
- if (first) {
- struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base;
+ if (first)
+ smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd);
+ return false;
+ }
- smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd);
- }
- return 0;
+ /*
+ * Special case for the HRTICK timer. It is frequently rearmed and most
+ * of the time moves the expiry into the future. That's expensive in
+ * virtual machines and it's better to take the pointless already armed
+ * interrupt than reprogramming the hardware on every context switch.
+ *
+ * If the new expiry is before the armed time, then reprogramming is
+ * required.
+ */
+ if (timer->is_lazy) {
+ if (cpu_base->expires_next <= hrtimer_get_expires(timer))
+ return false;
}
/*
- * Timer was forced to stay on the current CPU to avoid
- * reprogramming on removal and enqueue. Force reprogram the
- * hardware by evaluating the new first expiring timer.
+ * Timer was the first expiring timer and forced to stay on the
+ * current CPU to avoid reprogramming on removal and enqueue. Force
+ * reprogram the hardware by evaluating the new first expiring
+ * timer.
*/
- hrtimer_force_reprogram(new_base->cpu_base, 1);
- return 0;
+ hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
+ return false;
}
/**
@@ -1309,12 +1468,14 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
* relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
* softirq based mode is considered for debug purpose only!
*/
-void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- u64 delta_ns, const enum hrtimer_mode mode)
+void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
+ const enum hrtimer_mode mode)
{
struct hrtimer_clock_base *base;
unsigned long flags;
+ debug_hrtimer_assert_init(timer);
+
/*
* Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
* match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
@@ -1362,8 +1523,11 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
base = lock_hrtimer_base(timer, &flags);
- if (!hrtimer_callback_running(timer))
- ret = remove_hrtimer(timer, base, false, false);
+ if (!hrtimer_callback_running(timer)) {
+ ret = remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
+ if (ret)
+ trace_hrtimer_cancel(timer);
+ }
unlock_hrtimer_base(timer, &flags);
@@ -1397,8 +1561,7 @@ static void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base)
* the timer callback to finish. Drop expiry_lock and reacquire it. That
* allows the waiter to acquire the lock and make progress.
*/
-static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base,
- unsigned long flags)
+static void hrtimer_sync_wait_running(struct hrtimer_cpu_base *cpu_base, unsigned long flags)
{
if (atomic_read(&cpu_base->timer_waiters)) {
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1463,14 +1626,10 @@ void hrtimer_cancel_wait_running(const struct hrtimer *timer)
spin_unlock_bh(&base->cpu_base->softirq_expiry_lock);
}
#else
-static inline void
-hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void
-hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
-static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base,
- unsigned long flags) { }
+static inline void hrtimer_cpu_base_init_expiry_lock(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_lock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_cpu_base_unlock_expiry(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_sync_wait_running(struct hrtimer_cpu_base *base, unsigned long fl) { }
#endif
/**
@@ -1526,15 +1685,11 @@ u64 hrtimer_get_next_event(void)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
u64 expires = KTIME_MAX;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
if (!hrtimer_hres_active(cpu_base))
expires = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL);
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
return expires;
}
@@ -1549,26 +1704,20 @@ u64 hrtimer_next_event_without(const struct hrtimer *exclude)
{
struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
u64 expires = KTIME_MAX;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
-
- if (hrtimer_hres_active(cpu_base)) {
- unsigned int active;
+ unsigned int active;
- if (!cpu_base->softirq_activated) {
- active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
- expires = __hrtimer_next_event_base(cpu_base, exclude,
- active, KTIME_MAX);
- }
- active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
- expires = __hrtimer_next_event_base(cpu_base, exclude, active,
- expires);
- }
+ guard(raw_spinlock_irqsave)(&cpu_base->lock);
+ if (!hrtimer_hres_active(cpu_base))
+ return expires;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+ if (active && !cpu_base->softirq_activated)
+ expires = hrtimer_bases_next_event_without(cpu_base, exclude, active, KTIME_MAX);
- return expires;
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+ if (!active)
+ return expires;
+ return hrtimer_bases_next_event_without(cpu_base, exclude, active, expires);
}
#endif
@@ -1612,8 +1761,7 @@ ktime_t hrtimer_cb_get_time(const struct hrtimer *timer)
}
EXPORT_SYMBOL_GPL(hrtimer_cb_get_time);
-static void __hrtimer_setup(struct hrtimer *timer,
- enum hrtimer_restart (*function)(struct hrtimer *),
+static void __hrtimer_setup(struct hrtimer *timer, enum hrtimer_restart (*fn)(struct hrtimer *),
clockid_t clock_id, enum hrtimer_mode mode)
{
bool softtimer = !!(mode & HRTIMER_MODE_SOFT);
@@ -1645,13 +1793,14 @@ static void __hrtimer_setup(struct hrtimer *timer,
base += hrtimer_clockid_to_base(clock_id);
timer->is_soft = softtimer;
timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
+ timer->is_lazy = !!(mode & HRTIMER_MODE_LAZY_REARM);
timer->base = &cpu_base->clock_base[base];
- timerqueue_init(&timer->node);
+ timerqueue_linked_init(&timer->node);
- if (WARN_ON_ONCE(!function))
+ if (WARN_ON_ONCE(!fn))
ACCESS_PRIVATE(timer, function) = hrtimer_dummy_timeout;
else
- ACCESS_PRIVATE(timer, function) = function;
+ ACCESS_PRIVATE(timer, function) = fn;
}
/**
@@ -1710,12 +1859,10 @@ bool hrtimer_active(const struct hrtimer *timer)
base = READ_ONCE(timer->base);
seq = raw_read_seqcount_begin(&base->seq);
- if (timer->state != HRTIMER_STATE_INACTIVE ||
- base->running == timer)
+ if (timer->is_queued || base->running == timer)
return true;
- } while (read_seqcount_retry(&base->seq, seq) ||
- base != READ_ONCE(timer->base));
+ } while (read_seqcount_retry(&base->seq, seq) || base != READ_ONCE(timer->base));
return false;
}
@@ -1729,7 +1876,7 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
* - callback: the timer is being ran
* - post: the timer is inactive or (re)queued
*
- * On the read side we ensure we observe timer->state and cpu_base->running
+ * On the read side we ensure we observe timer->is_queued and cpu_base->running
* from the same section, if anything changed while we looked at it, we retry.
* This includes timer->base changing because sequence numbers alone are
* insufficient for that.
@@ -1738,11 +1885,9 @@ EXPORT_SYMBOL_GPL(hrtimer_active);
* a false negative if the read side got smeared over multiple consecutive
* __run_hrtimer() invocations.
*/
-
-static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
- struct hrtimer_clock_base *base,
- struct hrtimer *timer, ktime_t *now,
- unsigned long flags) __must_hold(&cpu_base->lock)
+static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, struct hrtimer_clock_base *base,
+ struct hrtimer *timer, ktime_t now, unsigned long flags)
+ __must_hold(&cpu_base->lock)
{
enum hrtimer_restart (*fn)(struct hrtimer *);
bool expires_in_hardirq;
@@ -1754,15 +1899,15 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
base->running = timer;
/*
- * Separate the ->running assignment from the ->state assignment.
+ * Separate the ->running assignment from the ->is_queued assignment.
*
* As with a regular write barrier, this ensures the read side in
* hrtimer_active() cannot observe base->running == NULL &&
- * timer->state == INACTIVE.
+ * timer->is_queued == INACTIVE.
*/
raw_write_seqcount_barrier(&base->seq);
- __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, 0);
+ __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE, false);
fn = ACCESS_PRIVATE(timer, function);
/*
@@ -1797,16 +1942,15 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
* hrtimer_start_range_ns() can have popped in and enqueued the timer
* for us already.
*/
- if (restart != HRTIMER_NORESTART &&
- !(timer->state & HRTIMER_STATE_ENQUEUED))
- enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS);
+ if (restart == HRTIMER_RESTART && !timer->is_queued)
+ enqueue_hrtimer(timer, base, HRTIMER_MODE_ABS, false);
/*
- * Separate the ->running assignment from the ->state assignment.
+ * Separate the ->running assignment from the ->is_queued assignment.
*
* As with a regular write barrier, this ensures the read side in
* hrtimer_active() cannot observe base->running.timer == NULL &&
- * timer->state == INACTIVE.
+ * timer->is_queued == INACTIVE.
*/
raw_write_seqcount_barrier(&base->seq);
@@ -1814,23 +1958,24 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
base->running = NULL;
}
+static __always_inline struct hrtimer *clock_base_next_timer_safe(struct hrtimer_clock_base *base)
+{
+ struct timerqueue_linked_node *next = timerqueue_linked_first(&base->active);
+
+ return next ? hrtimer_from_timerqueue_node(next) : NULL;
+}
+
static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
unsigned long flags, unsigned int active_mask)
{
- struct hrtimer_clock_base *base;
unsigned int active = cpu_base->active_bases & active_mask;
+ struct hrtimer_clock_base *base;
for_each_active_base(base, cpu_base, active) {
- struct timerqueue_node *node;
- ktime_t basenow;
-
- basenow = ktime_add(now, base->offset);
-
- while ((node = timerqueue_getnext(&base->active))) {
- struct hrtimer *timer;
-
- timer = container_of(node, struct hrtimer, node);
+ ktime_t basenow = ktime_add(now, base->offset);
+ struct hrtimer *timer;
+ while ((timer = clock_base_next_timer(base))) {
/*
* The immediate goal for using the softexpires is
* minimizing wakeups, not running timers at the
@@ -1846,7 +1991,7 @@ static void __hrtimer_run_queues(struct hrtimer_cpu_base *cpu_base, ktime_t now,
if (basenow < hrtimer_get_softexpires(timer))
break;
- __run_hrtimer(cpu_base, base, timer, &basenow, flags);
+ __run_hrtimer(cpu_base, base, timer, basenow, flags);
if (active_mask == HRTIMER_ACTIVE_SOFT)
hrtimer_sync_wait_running(cpu_base, flags);
}
@@ -1865,7 +2010,7 @@ static __latent_entropy void hrtimer_run_softirq(void)
now = hrtimer_update_base(cpu_base);
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_SOFT);
- cpu_base->softirq_activated = 0;
+ cpu_base->softirq_activated = false;
hrtimer_update_softirq_timer(cpu_base, true);
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
@@ -1874,6 +2019,63 @@ static __latent_entropy void hrtimer_run_softirq(void)
#ifdef CONFIG_HIGH_RES_TIMERS
+/*
+ * Very similar to hrtimer_force_reprogram(), except it deals with
+ * deferred_rearm and hang_detected.
+ */
+static void hrtimer_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next, bool deferred)
+{
+ cpu_base->expires_next = expires_next;
+ cpu_base->deferred_rearm = false;
+
+ if (unlikely(cpu_base->hang_detected)) {
+ /*
+ * Give the system a chance to do something else than looping
+ * on hrtimer interrupts.
+ */
+ expires_next = ktime_add_ns(ktime_get(),
+ min(100 * NSEC_PER_MSEC, cpu_base->max_hang_time));
+ }
+ hrtimer_rearm_event(expires_next, deferred);
+}
+
+#ifdef CONFIG_HRTIMER_REARM_DEFERRED
+void __hrtimer_rearm_deferred(void)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+ ktime_t expires_next;
+
+ if (!cpu_base->deferred_rearm)
+ return;
+
+ guard(raw_spinlock)(&cpu_base->lock);
+ if (cpu_base->deferred_needs_update) {
+ hrtimer_update_base(cpu_base);
+ expires_next = hrtimer_update_next_event(cpu_base);
+ } else {
+ /* No timer added/removed. Use the cached value */
+ expires_next = cpu_base->deferred_expires_next;
+ }
+ hrtimer_rearm(cpu_base, expires_next, true);
+}
+
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
+{
+ /* hrtimer_interrupt() just re-evaluated the first expiring timer */
+ cpu_base->deferred_needs_update = false;
+ /* Cache the expiry time */
+ cpu_base->deferred_expires_next = expires_next;
+ set_thread_flag(TIF_HRTIMER_REARM);
+}
+#else /* CONFIG_HRTIMER_REARM_DEFERRED */
+static __always_inline void
+hrtimer_interrupt_rearm(struct hrtimer_cpu_base *cpu_base, ktime_t expires_next)
+{
+ hrtimer_rearm(cpu_base, expires_next, false);
+}
+#endif /* !CONFIG_HRTIMER_REARM_DEFERRED */
+
/*
* High resolution timer interrupt
* Called with interrupts disabled
@@ -1888,86 +2090,55 @@ void hrtimer_interrupt(struct clock_event_device *dev)
BUG_ON(!cpu_base->hres_active);
cpu_base->nr_events++;
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;
raw_spin_lock_irqsave(&cpu_base->lock, flags);
entry_time = now = hrtimer_update_base(cpu_base);
retry:
- cpu_base->in_hrtirq = 1;
+ cpu_base->deferred_rearm = true;
/*
- * We set expires_next to KTIME_MAX here with cpu_base->lock
- * held to prevent that a timer is enqueued in our queue via
- * the migration code. This does not affect enqueueing of
- * timers which run their callback and need to be requeued on
- * this CPU.
+ * Set expires_next to KTIME_MAX, which prevents that remote CPUs queue
+ * timers while __hrtimer_run_queues() is expiring the clock bases.
+ * Timers which are re/enqueued on the local CPU are not affected by
+ * this.
*/
cpu_base->expires_next = KTIME_MAX;
if (!ktime_before(now, cpu_base->softirq_expires_next)) {
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->softirq_activated = 1;
+ cpu_base->softirq_activated = true;
raise_timer_softirq(HRTIMER_SOFTIRQ);
}
__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
- /* Reevaluate the clock bases for the [soft] next expiry */
- expires_next = hrtimer_update_next_event(cpu_base);
- /*
- * Store the new expiry value so the migration code can verify
- * against it.
- */
- cpu_base->expires_next = expires_next;
- cpu_base->in_hrtirq = 0;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
-
- /* Reprogramming necessary ? */
- if (!tick_program_event(expires_next, 0)) {
- cpu_base->hang_detected = 0;
- return;
- }
-
/*
* The next timer was already expired due to:
* - tracing
* - long lasting callbacks
* - being scheduled away when running in a VM
*
- * We need to prevent that we loop forever in the hrtimer
- * interrupt routine. We give it 3 attempts to avoid
- * overreacting on some spurious event.
- *
- * Acquire base lock for updating the offsets and retrieving
- * the current time.
+ * We need to prevent that we loop forever in the hrtiner interrupt
+ * routine. We give it 3 attempts to avoid overreacting on some
+ * spurious event.
*/
- raw_spin_lock_irqsave(&cpu_base->lock, flags);
now = hrtimer_update_base(cpu_base);
- cpu_base->nr_retries++;
- if (++retries < 3)
- goto retry;
- /*
- * Give the system a chance to do something else than looping
- * here. We stored the entry time, so we know exactly how long
- * we spent here. We schedule the next event this amount of
- * time away.
- */
- cpu_base->nr_hangs++;
- cpu_base->hang_detected = 1;
- raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+ expires_next = hrtimer_update_next_event(cpu_base);
+ cpu_base->hang_detected = false;
+ if (expires_next < now) {
+ if (++retries < 3)
+ goto retry;
+
+ delta = ktime_sub(now, entry_time);
+ cpu_base->max_hang_time = max_t(unsigned int, cpu_base->max_hang_time, delta);
+ cpu_base->nr_hangs++;
+ cpu_base->hang_detected = true;
+ }
- delta = ktime_sub(now, entry_time);
- if ((unsigned int)delta > cpu_base->max_hang_time)
- cpu_base->max_hang_time = (unsigned int) delta;
- /*
- * Limit it to a sensible value as we enforce a longer
- * delay. Give the CPU at least 100ms to catch up.
- */
- if (delta > 100 * NSEC_PER_MSEC)
- expires_next = ktime_add_ns(now, 100 * NSEC_PER_MSEC);
- else
- expires_next = ktime_add(now, delta);
- tick_program_event(expires_next, 1);
- pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+ hrtimer_interrupt_rearm(cpu_base, expires_next);
+ raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
}
+
#endif /* !CONFIG_HIGH_RES_TIMERS */
/*
@@ -1999,7 +2170,7 @@ void hrtimer_run_queues(void)
if (!ktime_before(now, cpu_base->softirq_expires_next)) {
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->softirq_activated = 1;
+ cpu_base->softirq_activated = true;
raise_timer_softirq(HRTIMER_SOFTIRQ);
}
@@ -2012,8 +2183,7 @@ void hrtimer_run_queues(void)
*/
static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
- struct hrtimer_sleeper *t =
- container_of(timer, struct hrtimer_sleeper, timer);
+ struct hrtimer_sleeper *t = container_of(timer, struct hrtimer_sleeper, timer);
struct task_struct *task = t->task;
t->task = NULL;
@@ -2031,8 +2201,7 @@ static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
* Wrapper around hrtimer_start_expires() for hrtimer_sleeper based timers
* to allow PREEMPT_RT to tweak the delivery mode (soft/hardirq context)
*/
-void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
- enum hrtimer_mode mode)
+void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl, enum hrtimer_mode mode)
{
/*
* Make the enqueue delivery mode check work on RT. If the sleeper
@@ -2048,8 +2217,8 @@ void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
}
EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);
-static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
- clockid_t clock_id, enum hrtimer_mode mode)
+static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl, clockid_t clock_id,
+ enum hrtimer_mode mode)
{
/*
* On PREEMPT_RT enabled kernels hrtimers which are not explicitly
@@ -2085,8 +2254,8 @@ static void __hrtimer_setup_sleeper(struct hrtimer_sleeper *sl,
* @clock_id: the clock to be used
* @mode: timer mode abs/rel
*/
-void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl,
- clockid_t clock_id, enum hrtimer_mode mode)
+void hrtimer_setup_sleeper_on_stack(struct hrtimer_sleeper *sl, clockid_t clock_id,
+ enum hrtimer_mode mode)
{
debug_setup_on_stack(&sl->timer, clock_id, mode);
__hrtimer_setup_sleeper(sl, clock_id, mode);
@@ -2159,12 +2328,11 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
return ret;
}
-long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode,
- const clockid_t clockid)
+long hrtimer_nanosleep(ktime_t rqtp, const enum hrtimer_mode mode, const clockid_t clockid)
{
struct restart_block *restart;
struct hrtimer_sleeper t;
- int ret = 0;
+ int ret;
hrtimer_setup_sleeper_on_stack(&t, clockid, mode);
hrtimer_set_expires_range_ns(&t.timer, rqtp, current->timer_slack_ns);
@@ -2203,8 +2371,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
current->restart_block.fn = do_no_restart_syscall;
current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
current->restart_block.nanosleep.rmtp = rmtp;
- return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
- CLOCK_MONOTONIC);
+ return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}
#endif
@@ -2212,7 +2379,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec __user *, rqtp,
#ifdef CONFIG_COMPAT_32BIT_TIME
SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
- struct old_timespec32 __user *, rmtp)
+ struct old_timespec32 __user *, rmtp)
{
struct timespec64 tu;
@@ -2225,8 +2392,7 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
current->restart_block.fn = do_no_restart_syscall;
current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
current->restart_block.nanosleep.compat_rmtp = rmtp;
- return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL,
- CLOCK_MONOTONIC);
+ return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}
#endif
@@ -2236,14 +2402,13 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 __user *, rqtp,
int hrtimers_prepare_cpu(unsigned int cpu)
{
struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
- int i;
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+ for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
struct hrtimer_clock_base *clock_b = &cpu_base->clock_base[i];
clock_b->cpu_base = cpu_base;
seqcount_raw_spinlock_init(&clock_b->seq, &cpu_base->lock);
- timerqueue_init_head(&clock_b->active);
+ timerqueue_linked_init_head(&clock_b->active);
}
cpu_base->cpu = cpu;
@@ -2257,13 +2422,14 @@ int hrtimers_cpu_starting(unsigned int cpu)
/* Clear out any left over state from a CPU down operation */
cpu_base->active_bases = 0;
- cpu_base->hres_active = 0;
- cpu_base->hang_detected = 0;
+ cpu_base->hres_active = false;
+ cpu_base->hang_detected = false;
cpu_base->next_timer = NULL;
cpu_base->softirq_next_timer = NULL;
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
- cpu_base->online = 1;
+ cpu_base->softirq_activated = false;
+ cpu_base->online = true;
return 0;
}
@@ -2272,20 +2438,20 @@ int hrtimers_cpu_starting(unsigned int cpu)
static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
struct hrtimer_clock_base *new_base)
{
+ struct timerqueue_linked_node *node;
struct hrtimer *timer;
- struct timerqueue_node *node;
- while ((node = timerqueue_getnext(&old_base->active))) {
- timer = container_of(node, struct hrtimer, node);
+ while ((node = timerqueue_linked_first(&old_base->active))) {
+ timer = hrtimer_from_timerqueue_node(node);
BUG_ON(hrtimer_callback_running(timer));
- debug_deactivate(timer);
+ debug_hrtimer_deactivate(timer);
/*
* Mark it as ENQUEUED not INACTIVE otherwise the
* timer could be seen as !active and just vanish away
* under us on another CPU
*/
- __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+ __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, false);
timer->base = new_base;
/*
* Enqueue the timers on the new cpu. This does not
@@ -2295,13 +2461,13 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
* sort out already expired timers and reprogram the
* event device.
*/
- enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+ enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS, true);
}
}
int hrtimers_cpu_dying(unsigned int dying_cpu)
{
- int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+ int ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
struct hrtimer_cpu_base *old_base, *new_base;
old_base = this_cpu_ptr(&hrtimer_bases);
@@ -2314,16 +2480,14 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
raw_spin_lock(&old_base->lock);
raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
- for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
- migrate_hrtimer_list(&old_base->clock_base[i],
- &new_base->clock_base[i]);
- }
+ for (int i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
+ migrate_hrtimer_list(&old_base->clock_base[i], &new_base->clock_base[i]);
/* Tell the other CPU to retrigger the next event */
smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
raw_spin_unlock(&new_base->lock);
- old_base->online = 0;
+ old_base->online = false;
raw_spin_unlock(&old_base->lock);
return 0;
diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index 9daf8c5d9687..1c954f330dfe 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -32,7 +32,6 @@ static u64 jiffies_read(struct clocksource *cs)
static struct clocksource clocksource_jiffies = {
.name = "jiffies",
.rating = 1, /* lowest valid rating*/
- .uncertainty_margin = 32 * NSEC_PER_MSEC,
.read = jiffies_read,
.mask = CLOCKSOURCE_MASK(32),
.mult = TICK_NSEC << JIFFIES_SHIFT, /* details above */
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 413e2389f0a5..9331e1614124 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1092,7 +1092,7 @@ void exit_itimers(struct task_struct *tsk)
}
/*
- * There should be no timers on the ignored list. itimer_delete() has
+ * There should be no timers on the ignored list. posix_timer_delete() has
* mopped them up.
*/
if (!WARN_ON_ONCE(!hlist_empty(&tsk->signal->ignored_posix_timers)))
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index a88b72b0f35e..51f6a1032c83 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -78,7 +78,6 @@ static struct clock_event_device ce_broadcast_hrtimer = {
.set_state_shutdown = bc_shutdown,
.set_next_ktime = bc_set_next,
.features = CLOCK_EVT_FEAT_ONESHOT |
- CLOCK_EVT_FEAT_KTIME |
CLOCK_EVT_FEAT_HRTIMER,
.rating = 0,
.bound_on = -1,
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f63c65881364..7e57fa31ee26 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -76,8 +76,10 @@ const struct clock_event_device *tick_get_wakeup_device(int cpu)
*/
static void tick_broadcast_start_periodic(struct clock_event_device *bc)
{
- if (bc)
+ if (bc) {
+ bc->next_event_forced = 0;
tick_setup_periodic(bc, 1);
+ }
}
/*
@@ -403,6 +405,7 @@ static void tick_handle_periodic_broadcast(struct clock_event_device *dev)
bool bc_local;
raw_spin_lock(&tick_broadcast_lock);
+ tick_broadcast_device.evtdev->next_event_forced = 0;
/* Handle spurious interrupts gracefully */
if (clockevent_state_shutdown(tick_broadcast_device.evtdev)) {
@@ -696,6 +699,7 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
raw_spin_lock(&tick_broadcast_lock);
dev->next_event = KTIME_MAX;
+ tick_broadcast_device.evtdev->next_event_forced = 0;
next_event = KTIME_MAX;
cpumask_clear(tmpmask);
now = ktime_get();
@@ -1063,6 +1067,7 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
bc->event_handler = tick_handle_oneshot_broadcast;
+ bc->next_event_forced = 0;
bc->next_event = KTIME_MAX;
/*
@@ -1175,6 +1180,7 @@ void hotplug_cpu__broadcast_tick_pull(int deadcpu)
}
/* This moves the broadcast assignment to this CPU: */
+ bc->next_event_forced = 0;
clockevents_program_event(bc, bc->next_event, 1);
}
raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d305d8521896..6a9198a4279b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -110,6 +110,7 @@ void tick_handle_periodic(struct clock_event_device *dev)
int cpu = smp_processor_id();
ktime_t next = dev->next_event;
+ dev->next_event_forced = 0;
tick_periodic(cpu);
/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f7907fadd63f..cbbb87a0c6e7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -345,7 +345,7 @@ static bool check_tick_dependency(atomic_t *dep)
int val = atomic_read(dep);
if (likely(!tracepoint_enabled(tick_stop)))
- return !val;
+ return !!val;
if (val & TICK_DEP_MASK_POSIX_TIMER) {
trace_tick_stop(0, TICK_DEP_MASK_POSIX_TIMER);
@@ -864,19 +864,32 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);
+/* Simplified variant of hrtimer_forward_now() */
+static ktime_t tick_forward_now(ktime_t expires, ktime_t now)
+{
+ ktime_t delta = now - expires;
+
+ if (likely(delta < TICK_NSEC))
+ return expires + TICK_NSEC;
+
+ expires += TICK_NSEC * ktime_divns(delta, TICK_NSEC);
+ if (expires > now)
+ return expires;
+ return expires + TICK_NSEC;
+}
+
static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
{
- hrtimer_cancel(&ts->sched_timer);
- hrtimer_set_expires(&ts->sched_timer, ts->last_tick);
+ ktime_t expires = ts->last_tick;
- /* Forward the time to expire in the future */
- hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
+ if (now >= expires)
+ expires = tick_forward_now(expires, now);
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
- hrtimer_start_expires(&ts->sched_timer,
- HRTIMER_MODE_ABS_PINNED_HARD);
+ hrtimer_start(&ts->sched_timer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
} else {
- tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
+ hrtimer_set_expires(&ts->sched_timer, expires);
+ tick_program_event(expires, 1);
}
/*
@@ -1513,6 +1526,7 @@ static void tick_nohz_lowres_handler(struct clock_event_device *dev)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;
if (likely(tick_nohz_handler(&ts->sched_timer) == HRTIMER_RESTART))
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c07e562ee4c1..c493a4010305 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -3,34 +3,30 @@
* Kernel timekeeping code and accessor functions. Based on code from
* timer.c, moved in commit 8524070b7982.
*/
-#include <linux/timekeeper_internal.h>
-#include <linux/module.h>
-#include <linux/interrupt.h>
+#include <linux/audit.h>
+#include <linux/clocksource.h>
+#include <linux/compiler.h>
+#include <linux/jiffies.h>
#include <linux/kobject.h>
-#include <linux/percpu.h>
-#include <linux/init.h>
-#include <linux/mm.h>
+#include <linux/module.h>
#include <linux/nmi.h>
-#include <linux/sched.h>
-#include <linux/sched/loadavg.h>
+#include <linux/pvclock_gtod.h>
+#include <linux/random.h>
#include <linux/sched/clock.h>
+#include <linux/sched/loadavg.h>
+#include <linux/static_key.h>
+#include <linux/stop_machine.h>
#include <linux/syscore_ops.h>
-#include <linux/clocksource.h>
-#include <linux/jiffies.h>
+#include <linux/tick.h>
#include <linux/time.h>
#include <linux/timex.h>
-#include <linux/tick.h>
-#include <linux/stop_machine.h>
-#include <linux/pvclock_gtod.h>
-#include <linux/compiler.h>
-#include <linux/audit.h>
-#include <linux/random.h>
+#include <linux/timekeeper_internal.h>
#include <vdso/auxclock.h>
#include "tick-internal.h"
-#include "ntp_internal.h"
#include "timekeeping_internal.h"
+#include "ntp_internal.h"
#define TK_CLEAR_NTP (1 << 0)
#define TK_CLOCK_WAS_SET (1 << 1)
@@ -275,6 +271,11 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
tk->monotonic_to_boot = ktime_to_timespec64(tk->offs_boot);
}
+#ifdef CONFIG_ARCH_WANTS_CLOCKSOURCE_READ_INLINE
+#include <asm/clock_inlined.h>
+
+static DEFINE_STATIC_KEY_FALSE(clocksource_read_inlined);
+
/*
* tk_clock_read - atomic clocksource read() helper
*
@@ -288,12 +289,35 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
* a read of the fast-timekeeper tkrs (which is protected by its own locking
* and update logic).
*/
-static inline u64 tk_clock_read(const struct tk_read_base *tkr)
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
+{
+ struct clocksource *clock = READ_ONCE(tkr->clock);
+
+ if (static_branch_likely(&clocksource_read_inlined))
+ return arch_inlined_clocksource_read(clock);
+
+ return clock->read(clock);
+}
+
+static inline void clocksource_disable_inline_read(void)
+{
+ static_branch_disable(&clocksource_read_inlined);
+}
+
+static inline void clocksource_enable_inline_read(void)
+{
+ static_branch_enable(&clocksource_read_inlined);
+}
+#else
+static __always_inline u64 tk_clock_read(const struct tk_read_base *tkr)
{
struct clocksource *clock = READ_ONCE(tkr->clock);
return clock->read(clock);
}
+static inline void clocksource_disable_inline_read(void) { }
+static inline void clocksource_enable_inline_read(void) { }
+#endif
/**
* tk_setup_internals - Set up internals to use clocksource clock.
@@ -367,6 +391,27 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
tk->tkr_raw.mult = clock->mult;
tk->ntp_err_mult = 0;
tk->skip_second_overflow = 0;
+
+ tk->cs_id = clock->id;
+
+ /* Coupled clockevent data */
+ if (IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) &&
+ clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT) {
+ /*
+ * Aim for an one hour maximum delta and use KHz to handle
+ * clocksources with a frequency above 4GHz correctly as
+ * the frequency argument of clocks_calc_mult_shift() is u32.
+ */
+ clocks_calc_mult_shift(&tk->cs_ns_to_cyc_mult, &tk->cs_ns_to_cyc_shift,
+ NSEC_PER_MSEC, clock->freq_khz, 3600 * 1000);
+ /*
+ * Initialize the conversion limit as the previous clocksource
+ * might have the same shift/mult pair so the quick check in
+ * tk_update_ns_to_cyc() fails to update it after a clocksource
+ * change leaving it effectivly zero.
+ */
+ tk->cs_ns_to_cyc_maxns = div_u64(clock->mask, tk->cs_ns_to_cyc_mult);
+ }
}
/* Timekeeper helper functions. */
@@ -375,7 +420,7 @@ static noinline u64 delta_to_ns_safe(const struct tk_read_base *tkr, u64 delta)
return mul_u64_u32_add_u64_shr(delta, tkr->mult, tkr->xtime_nsec, tkr->shift);
}
-static inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
+static __always_inline u64 timekeeping_cycles_to_ns(const struct tk_read_base *tkr, u64 cycles)
{
/* Calculate the delta since the last update_wall_time() */
u64 mask = tkr->mask, delta = (cycles - tkr->cycle_last) & mask;
@@ -696,6 +741,36 @@ static inline void tk_update_ktime_data(struct timekeeper *tk)
tk->tkr_raw.base = ns_to_ktime(tk->raw_sec * NSEC_PER_SEC);
}
+static inline void tk_update_ns_to_cyc(struct timekeeper *tks, struct timekeeper *tkc)
+{
+ struct tk_read_base *tkrs = &tks->tkr_mono;
+ struct tk_read_base *tkrc = &tkc->tkr_mono;
+ unsigned int shift;
+
+ if (!IS_ENABLED(CONFIG_GENERIC_CLOCKEVENTS_COUPLED) ||
+ !(tkrs->clock->flags & CLOCK_SOURCE_HAS_COUPLED_CLOCK_EVENT))
+ return;
+
+ if (tkrs->mult == tkrc->mult && tkrs->shift == tkrc->shift)
+ return;
+ /*
+ * The conversion math is simple:
+ *
+ * CS::MULT (1 << NS_TO_CYC_SHIFT)
+ * --------------- = ----------------------
+ * (1 << CS:SHIFT) NS_TO_CYC_MULT
+ *
+ * Ergo:
+ *
+ * NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
+ *
+ * NS_TO_CYC_SHIFT has been set up in tk_setup_internals()
+ */
+ shift = tkrs->shift + tks->cs_ns_to_cyc_shift;
+ tks->cs_ns_to_cyc_mult = (u32)div_u64(1ULL << shift, tkrs->mult);
+ tks->cs_ns_to_cyc_maxns = div_u64(tkrs->clock->mask, tks->cs_ns_to_cyc_mult);
+}
+
/*
* Restore the shadow timekeeper from the real timekeeper.
*/
@@ -730,6 +805,7 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real;
if (tk->id == TIMEKEEPER_CORE) {
+ tk_update_ns_to_cyc(tk, &tkd->timekeeper);
update_vsyscall(tk);
update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);
@@ -784,6 +860,71 @@ static void timekeeping_forward_now(struct timekeeper *tk)
tk_update_coarse_nsecs(tk);
}
+/*
+ * ktime_expiry_to_cycles - Convert a expiry time to clocksource cycles
+ * @id: Clocksource ID which is required for validity
+ * @expires_ns: Absolute CLOCK_MONOTONIC expiry time (nsecs) to be converted
+ * @cycles: Pointer to storage for corresponding absolute cycles value
+ *
+ * Convert a CLOCK_MONOTONIC based absolute expiry time to a cycles value
+ * based on the correlated clocksource of the clockevent device by using
+ * the base nanoseconds and cycles values of the last timekeeper update and
+ * converting the delta between @expires_ns and base nanoseconds to cycles.
+ *
+ * This only works for clockevent devices which are using a less than or
+ * equal comparator against the clocksource.
+ *
+ * Utilizing this avoids two clocksource reads for such devices, the
+ * ktime_get() in clockevents_program_event() to calculate the delta expiry
+ * value and the readout in the device::set_next_event() callback to
+ * convert the delta back to a absolute comparator value.
+ *
+ * Returns: True if @id matches the current clocksource ID, false otherwise
+ */
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+ struct tk_read_base *tkrm = &tk->tkr_mono;
+ ktime_t base_ns, delta_ns, max_ns;
+ u64 base_cycles, delta_cycles;
+ unsigned int seq;
+ u32 mult, shift;
+
+ /*
+ * Racy check to avoid the seqcount overhead when ID does not match. If
+ * the relevant clocksource is installed concurrently, then this will
+ * just delay the switch over to this mechanism until the next event is
+ * programmed. If the ID is not matching the clock events code will use
+ * the regular relative set_next_event() callback as before.
+ */
+ if (data_race(tk->cs_id) != id)
+ return false;
+
+ do {
+ seq = read_seqcount_begin(&tk_core.seq);
+
+ if (tk->cs_id != id)
+ return false;
+
+ base_cycles = tkrm->cycle_last;
+ base_ns = tkrm->base + (tkrm->xtime_nsec >> tkrm->shift);
+
+ mult = tk->cs_ns_to_cyc_mult;
+ shift = tk->cs_ns_to_cyc_shift;
+ max_ns = tk->cs_ns_to_cyc_maxns;
+
+ } while (read_seqcount_retry(&tk_core.seq, seq));
+
+ /* Prevent negative deltas and multiplication overflows */
+ delta_ns = min(expires_ns - base_ns, max_ns);
+ delta_ns = max(delta_ns, 0);
+
+ /* Convert to cycles */
+ delta_cycles = ((u64)delta_ns * mult) >> shift;
+ *cycles = base_cycles + delta_cycles;
+ return true;
+}
+
/**
* ktime_get_real_ts64 - Returns the time of day in a timespec64.
* @ts: pointer to the timespec to be set
@@ -848,7 +989,7 @@ u32 ktime_get_resolution_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_resolution_ns);
-static ktime_t *offsets[TK_OFFS_MAX] = {
+static const ktime_t *const offsets[TK_OFFS_MAX] = {
[TK_OFFS_REAL] = &tk_core.timekeeper.offs_real,
[TK_OFFS_BOOT] = &tk_core.timekeeper.offs_boot,
[TK_OFFS_TAI] = &tk_core.timekeeper.offs_tai,
@@ -857,8 +998,9 @@ static ktime_t *offsets[TK_OFFS_MAX] = {
ktime_t ktime_get_with_offset(enum tk_offsets offs)
{
struct timekeeper *tk = &tk_core.timekeeper;
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
- ktime_t base, *offset = offsets[offs];
+ ktime_t base;
u64 nsecs;
WARN_ON(timekeeping_suspended);
@@ -878,8 +1020,9 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset);
ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
{
struct timekeeper *tk = &tk_core.timekeeper;
- ktime_t base, *offset = offsets[offs];
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
+ ktime_t base;
u64 nsecs;
WARN_ON(timekeeping_suspended);
@@ -902,7 +1045,7 @@ EXPORT_SYMBOL_GPL(ktime_get_coarse_with_offset);
*/
ktime_t ktime_mono_to_any(ktime_t tmono, enum tk_offsets offs)
{
- ktime_t *offset = offsets[offs];
+ const ktime_t *offset = offsets[offs];
unsigned int seq;
ktime_t tconv;
@@ -1631,7 +1774,19 @@ int timekeeping_notify(struct clocksource *clock)
if (tk->tkr_mono.clock == clock)
return 0;
+
+ /* Disable inlined reads accross the clocksource switch */
+ clocksource_disable_inline_read();
+
stop_machine(change_clocksource, clock, NULL);
+
+ /*
+ * If the clocksource has been selected and supports inlined reads
+ * enable the branch.
+ */
+ if (tk->tkr_mono.clock == clock && clock->flags & CLOCK_SOURCE_CAN_INLINE_READ)
+ clocksource_enable_inline_read();
+
tick_clock_notify();
return tk->tkr_mono.clock == clock ? 0 : -1;
}
@@ -2834,7 +2989,7 @@ static void tk_aux_update_clocksource(void)
continue;
timekeeping_forward_now(tks);
- tk_setup_internals(tks, tk_core.timekeeper.tkr_mono.clock);
+ tk_setup_internals(tks, tk_core.timekeeper.tkr_raw.clock);
timekeeping_update_from_shadow(tkd, TK_UPDATE_ALL);
}
}
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 543beba096c7..198d0608db74 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -9,6 +9,8 @@ extern ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq,
ktime_t *offs_boot,
ktime_t *offs_tai);
+bool ktime_expiry_to_cycles(enum clocksource_ids id, ktime_t expires_ns, u64 *cycles);
+
extern int timekeeping_valid_for_hres(void);
extern u64 timekeeping_max_deferment(void);
extern void timekeeping_warp_clock(void);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 7e1e3bde6b8b..04d928c21aba 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2319,6 +2319,7 @@ u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
*/
void timer_clear_idle(void)
{
+ int this_cpu = smp_processor_id();
/*
* We do this unlocked. The worst outcome is a remote pinned timer
* enqueue sending a pointless IPI, but taking the lock would just
@@ -2327,9 +2328,9 @@ void timer_clear_idle(void)
* path. Required for BASE_LOCAL only.
*/
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
- if (tick_nohz_full_cpu(smp_processor_id()))
+ if (tick_nohz_full_cpu(this_cpu))
__this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
- trace_timer_base_idle(false, smp_processor_id());
+ trace_timer_base_idle(false, this_cpu);
/* Activate without holding the timer_base->lock */
tmigr_cpu_activate();
diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index 488e47e96e93..427d7ddea3af 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -47,7 +47,7 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
int idx, u64 now)
{
SEQ_printf(m, " #%d: <%p>, %ps", idx, taddr, ACCESS_PRIVATE(timer, function));
- SEQ_printf(m, ", S:%02x", timer->state);
+ SEQ_printf(m, ", S:%02x", timer->is_queued);
SEQ_printf(m, "\n");
SEQ_printf(m, " # expires at %Lu-%Lu nsecs [in %Ld to %Ld nsecs]\n",
(unsigned long long)ktime_to_ns(hrtimer_get_softexpires(timer)),
@@ -56,13 +56,11 @@ print_timer(struct seq_file *m, struct hrtimer *taddr, struct hrtimer *timer,
(long long)(ktime_to_ns(hrtimer_get_expires(timer)) - now));
}
-static void
-print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
- u64 now)
+static void print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
{
+ struct timerqueue_linked_node *curr;
struct hrtimer *timer, tmp;
unsigned long next = 0, i;
- struct timerqueue_node *curr;
unsigned long flags;
next_one:
@@ -72,13 +70,13 @@ print_active_timers(struct seq_file *m, struct hrtimer_clock_base *base,
raw_spin_lock_irqsave(&base->cpu_base->lock, flags);
- curr = timerqueue_getnext(&base->active);
+ curr = timerqueue_linked_first(&base->active);
/*
* Crude but we have to do this O(N*N) thing, because
* we have to unlock the base when printing:
*/
while (curr && i < next) {
- curr = timerqueue_iterate_next(curr);
+ curr = timerqueue_linked_next(curr);
i++;
}
@@ -103,8 +101,8 @@ print_base(struct seq_file *m, struct hrtimer_clock_base *base, u64 now)
SEQ_printf(m, " .resolution: %u nsecs\n", hrtimer_resolution);
#ifdef CONFIG_HIGH_RES_TIMERS
- SEQ_printf(m, " .offset: %Lu nsecs\n",
- (unsigned long long) ktime_to_ns(base->offset));
+ SEQ_printf(m, " .offset: %Ld nsecs\n",
+ (long long) base->offset);
#endif
SEQ_printf(m, "active timers:\n");
print_active_timers(m, base, now + ktime_to_ns(base->offset));
diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 8bb95b2a6fcf..39ac4eba0702 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -395,7 +395,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
n_u64++;
} else {
struct trace_print_flags __flags[] = {
- __def_gfpflag_names, {-1, NULL} };
+ __def_gfpflag_names };
char *space = (i == se->n_fields - 1 ? "" : " ");
print_synth_event_num_val(s, print_fmt,
@@ -408,7 +408,7 @@ static enum print_line_t print_synth_event(struct trace_iterator *iter,
trace_seq_puts(s, " (");
trace_print_flags_seq(s, "|",
entry->fields[n_u64].as_u64,
- __flags);
+ __flags, ARRAY_SIZE(__flags));
trace_seq_putc(s, ')');
}
n_u64++;
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 1996d7aba038..96e2d22b4364 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -69,14 +69,15 @@ enum print_line_t trace_print_printk_msg_only(struct trace_iterator *iter)
const char *
trace_print_flags_seq(struct trace_seq *p, const char *delim,
unsigned long flags,
- const struct trace_print_flags *flag_array)
+ const struct trace_print_flags *flag_array,
+ size_t flag_array_size)
{
unsigned long mask;
const char *str;
const char *ret = trace_seq_buffer_ptr(p);
int i, first = 1;
- for (i = 0; flag_array[i].name && flags; i++) {
+ for (i = 0; i < flag_array_size && flags; i++) {
mask = flag_array[i].mask;
if ((flags & mask) != mask)
@@ -106,12 +107,13 @@ EXPORT_SYMBOL(trace_print_flags_seq);
const char *
trace_print_symbols_seq(struct trace_seq *p, unsigned long val,
- const struct trace_print_flags *symbol_array)
+ const struct trace_print_flags *symbol_array,
+ size_t symbol_array_size)
{
int i;
const char *ret = trace_seq_buffer_ptr(p);
- for (i = 0; symbol_array[i].name; i++) {
+ for (i = 0; i < symbol_array_size; i++) {
if (val != symbol_array[i].mask)
continue;
@@ -133,14 +135,15 @@ EXPORT_SYMBOL(trace_print_symbols_seq);
const char *
trace_print_flags_seq_u64(struct trace_seq *p, const char *delim,
unsigned long long flags,
- const struct trace_print_flags_u64 *flag_array)
+ const struct trace_print_flags_u64 *flag_array,
+ size_t flag_array_size)
{
unsigned long long mask;
const char *str;
const char *ret = trace_seq_buffer_ptr(p);
int i, first = 1;
- for (i = 0; flag_array[i].name && flags; i++) {
+ for (i = 0; i < flag_array_size && flags; i++) {
mask = flag_array[i].mask;
if ((flags & mask) != mask)
@@ -170,12 +173,13 @@ EXPORT_SYMBOL(trace_print_flags_seq_u64);
const char *
trace_print_symbols_seq_u64(struct trace_seq *p, unsigned long long val,
- const struct trace_print_flags_u64 *symbol_array)
+ const struct trace_print_flags_u64 *symbol_array,
+ size_t symbol_array_size)
{
int i;
const char *ret = trace_seq_buffer_ptr(p);
- for (i = 0; symbol_array[i].name; i++) {
+ for (i = 0; i < symbol_array_size; i++) {
if (val != symbol_array[i].mask)
continue;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 37317b81fcda..8ad72e17d8eb 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -174,7 +174,6 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
{ O_NOFOLLOW, "O_NOFOLLOW" },
{ O_NOATIME, "O_NOATIME" },
{ O_CLOEXEC, "O_CLOEXEC" },
- { -1, NULL }
};
trace_seq_printf(s, "%s(", entry->name);
@@ -205,7 +204,7 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
trace_seq_puts(s, "O_RDONLY|");
}
- trace_print_flags_seq(s, "|", bits, __flags);
+ trace_print_flags_seq(s, "|", bits, __flags, ARRAY_SIZE(__flags));
/*
* trace_print_flags_seq() adds a '\0' to the
* buffer, but this needs to append more to the seq.
diff --git a/lib/rbtree.c b/lib/rbtree.c
index 18d42bcf4ec9..5790d6ecba4e 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -446,6 +446,23 @@ void rb_erase(struct rb_node *node, struct rb_root *root)
}
EXPORT_SYMBOL(rb_erase);
+bool rb_erase_linked(struct rb_node_linked *node, struct rb_root_linked *root)
+{
+ if (node->prev)
+ node->prev->next = node->next;
+ else
+ root->rb_leftmost = node->next;
+
+ if (node->next)
+ node->next->prev = node->prev;
+
+ rb_erase(&node->node, &root->rb_root);
+ RB_CLEAR_LINKED_NODE(node);
+
+ return !!root->rb_leftmost;
+}
+EXPORT_SYMBOL_GPL(rb_erase_linked);
+
/*
* Augmented rbtree manipulation functions.
*
diff --git a/lib/timerqueue.c b/lib/timerqueue.c
index cdb9c7658478..e2a1e08cb4bd 100644
--- a/lib/timerqueue.c
+++ b/lib/timerqueue.c
@@ -82,3 +82,17 @@ struct timerqueue_node *timerqueue_iterate_next(struct timerqueue_node *node)
return container_of(next, struct timerqueue_node, node);
}
EXPORT_SYMBOL_GPL(timerqueue_iterate_next);
+
+#define __node_2_tq_linked(_n) \
+ container_of(rb_entry((_n), struct rb_node_linked, node), struct timerqueue_linked_node, node)
+
+static __always_inline bool __tq_linked_less(struct rb_node *a, const struct rb_node *b)
+{
+ return __node_2_tq_linked(a)->expires < __node_2_tq_linked(b)->expires;
+}
+
+bool timerqueue_linked_add(struct timerqueue_linked_head *head, struct timerqueue_linked_node *node)
+{
+ return rb_add_linked(&node->node, &head->rb_root, __tq_linked_less);
+}
+EXPORT_SYMBOL_GPL(timerqueue_linked_add);
diff --git a/scripts/gdb/linux/timerlist.py b/scripts/gdb/linux/timerlist.py
index ccc24d30de80..9fb3436a217c 100644
--- a/scripts/gdb/linux/timerlist.py
+++ b/scripts/gdb/linux/timerlist.py
@@ -20,7 +20,7 @@ def ktime_get():
We can't read the hardware timer itself to add any nanoseconds
that need to be added since we last stored the time in the
timekeeper. But this is probably good enough for debug purposes."""
- tk_core = gdb.parse_and_eval("&tk_core")
+ tk_core = gdb.parse_and_eval("&timekeeper_data[TIMEKEEPER_CORE]")
return tk_core['timekeeper']['tkr_mono']['base']
On Sun, Apr 12, 2026 at 07:46:25PM -0800, Thomas Gleixner wrote: > Linus, > > please pull the latest timers/core branch from: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-core-2026-04-12 > > up to: ff1c0c5d0702: Merge branch 'timers/urgent' into timers/core > > Updates for the timer/timekeeping core: Looks like it breaks the boot for me. It hangs at [ 1.036914] cfg80211: failed to load regulatory.db [ 1.037396] clk: Disabling unused clocks if I wait few minutes and then Ctrl-C then I see: ^C[ 190.319574] ata6: SATA link down (SStatus 0 SControl 300) [ 190.320442] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 190.320869] ata3.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100 [ 190.321168] ata3.00: applying bridge limits but it still hangs later. Reverting the whole pull fixes the issue. The boot is standard qemu-system-x86_64 --enable-kvm -machine q35 -smp 8 -s -cpu host
On Tue, 14 Apr 2026 at 21:10, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Looks like it breaks the boot for me.
I can confirm.
Trying to narrow it down now.
Linus
On Tue, Apr 14, 2026 at 9:38 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 14 Apr 2026 at 21:10, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > Looks like it breaks the boot for me. > > I can confirm. > > Trying to narrow it down now. Instead of full revert. the following helped: config HRTIMER_REARM_DEFERRED - def_bool y - depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS - depends on HIGH_RES_TIMERS && SCHED_HRTICK + bool + default n
On Tue, Apr 14, 2026 at 9:40 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 9:38 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 14 Apr 2026 at 21:10, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > Looks like it breaks the boot for me.
> >
> > I can confirm.
> >
> > Trying to narrow it down now.
>
> Instead of full revert. the following helped:
>
> config HRTIMER_REARM_DEFERRED
> - def_bool y
> - depends on GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
> - depends on HIGH_RES_TIMERS && SCHED_HRTICK
> + bool
> + default n
and here is a fix mainly by claude with a lot of nudging
from my side.
Fixes the boot for me and looks correct.
From 3b57e2477d1e0c74e91c6ce7e7b67a2f63da2a83 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Tue, 14 Apr 2026 21:42:04 -0700
Subject: [PATCH] hrtimer: Rearm deferred hrtimer on kernel interrupt return
path
0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") added
hrtimer_rearm_deferred() to irqentry_exit_to_kernel_mode(). Then
041aa7a85390 ("entry: Split preemption from
irqentry_exit_to_kernel_mode()") split that function into
irqentry_exit_to_kernel_mode_preempt() and
irqentry_exit_to_kernel_mode_after_preempt(). When the two were
merged in c43267e6794a, hrtimer_rearm_deferred() ended up after the
regs_irqs_disabled() early return in the _preempt() path.
When the system is executing only in kernel mode (e.g. during boot),
hrtimer interrupts return to kernel context exclusively. Without the
rearm call on this path, the clock event device is never reprogrammed
after the first hrtimer interrupt and the tick dies.
On KVM guests this happens reliably because virtual interrupt
injection timing causes the hrtimer interrupt to frequently arrive
while the interrupted kernel code has IRQs disabled.
Move hrtimer_rearm_deferred() before the early return.
Fixes: c43267e6794a ("Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/irq-entry-common.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 7ab41eec549f..08f671cbd1a5 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -469,13 +469,13 @@ static __always_inline irqentry_state_t irqentry_enter_from_kernel_mode(struct p
static inline void irqentry_exit_to_kernel_mode_preempt(struct pt_regs *regs,
irqentry_state_t state)
{
+ hrtimer_rearm_deferred();
+
if (regs_irqs_disabled(regs) || state.exit_rcu)
return;
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
-
- hrtimer_rearm_deferred();
}
/**
--
2.52.0
On Tue, 14 Apr 2026 at 22:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> and here is a fix mainly by claude with a lot of nudging
> from my side.
>
> Fixes the boot for me and looks correct.
So I think Claude isn't quite right in the explanation. Your patch
works, but I think it causes the double arming when preempting that
PeterZ tried to avoid.
Here's what I think I'll actually apply, generated literally as the
difference between my original merge and the final one that was
influenced by the state of linux-next.
And it's not the "irqs disabled" test that I think causes problems,
it's the "state.exit_rcu" one.
We should indeed only re-arm the deferred hrtimer if interrupts are
enabled, and that's what the code did when it was in
kernel/entry/common.c irqentry_exit().
But it should be re-armed regardless of that state.exit_rcu thing.
It would be lovely if you can still verify that yes, this version also
fixes things for you?
Linus
On Tue, Apr 14, 2026 at 10:36 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It would be lovely if you can still verify that yes, this version also > fixes things for you? Yep. Works. Tested-by: Alexei Starovoitov <ast@kernel.org>
On Tue, 14 Apr 2026 at 22:46, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Yep. Works.
> Tested-by: Alexei Starovoitov <ast@kernel.org>
Thanks. Fix pushed out,
Linus
On Tue, 14 Apr 2026 at 22:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> and here is a fix mainly by claude with a lot of nudging
> from my side.
>
> Fixes the boot for me and looks correct.
Hah. This was something I had initially done doifferently.
But then looked at linux-next, where Thomas had done what appeared to
be a smarter resolution than the straightforward one:
https://lore.kernel.org/all/CAHk-=wg8+BER4VyFKG3rnPi2gXxbf-jbHS=EU+xhFqGVQfbutw@mail.gmail.com/
and I picked that "smarter" one. That seems to have been a mistake.
Apparently nobody actually runs linux-next. I knew it didn't get a lot
of testing, but apparently it's more like "no testing at all" than
"not a lot".
Oh well.
Linus
On Tue, Apr 14, 2026 at 10:16:29PM -0700, Linus Torvalds wrote: > Apparently nobody actually runs linux-next. I knew it didn't get a lot > of testing, but apparently it's more like "no testing at all" than > "not a lot". Looking at my own results this doesn't seem to have caused obvious explosions in my test lab for whatever reason, and the KUnit tests that run as part of the -next merge didn't notice anything here. It does look like this will have overlapped with the kselftest issues which have been messing up a bunch of the CI systems in ways that fell between the cracks of build and runtime testing: https://lore.kernel.org/linux-kselftest/20260320-selftests-fixes-v1-0-79144f76be01@suse.com/T/#mecec7793b0e4c8f316a32600f48835479cf056f3 which severely impacted the coverage from several of the CI systems and took an unfortunately long time to get addressed as a result - runtime people were seeing it as a build or infra issue, but build people weren't noticing since anyone specifically build testing the selftests would've set things up so they wouldn't have seen the issue.
The pull request you sent on Sun, 12 Apr 2026 19:46:25 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-core-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/c1fe867b5bf9c57ab7856486d342720e2b205eed Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest core/entry branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-entry-2026-04-12
up to: c291cfac49a6: entry: Add missing kernel-doc for arch_ptrace_report_syscall functions
A trivial update for the entry code adding missing kernel documentation for
function arguments.
Thanks,
tglx
------------------>
Kit Dallege (1):
entry: Add missing kernel-doc for arch_ptrace_report_syscall functions
include/linux/entry-common.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f83ca0abf2cd..d223246401bc 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -48,6 +48,7 @@
/**
* arch_ptrace_report_syscall_entry - Architecture specific ptrace_report_syscall_entry() wrapper
+ * @regs: Pointer to the register state at syscall entry
*
* Invoked from syscall_trace_enter() to wrap ptrace_report_syscall_entry().
*
@@ -205,6 +206,8 @@ static __always_inline bool report_single_step(unsigned long work)
/**
* arch_ptrace_report_syscall_exit - Architecture specific ptrace_report_syscall_exit()
+ * @regs: Pointer to the register state at syscall exit
+ * @step: Indicates a single-step exit rather than a normal syscall exit
*
* This allows architecture specific ptrace_report_syscall_exit()
* implementations. If not defined by the architecture this falls back to
The pull request you sent on Sun, 12 Apr 2026 19:46:01 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core-entry-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/15a1bccddccba6cab63fec1345fbd24102d9e0b8 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest irq/core branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-core-2026-04-12
up to: e8be82c2d77e: Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
Update for the core interrupt subsystem:
- Invoke add_interrupt_randomness() in handle_percpu_devid_irq() and
cleanup the workaround in the Hyper-V driver, which would now invoke
it twice on ARM64. Removing it from the driver requires to add it to
the x86 system vector entry point.
- Remove the pointles cpu_read_lock() around reading CPU possible mask,
which is read only after init.
- Add documentation for the interaction between device tree bindings and
the interrupt type defines in irq.h.
- Delete stale defines in the matrix allocator and the equivalent in
loongarch.
Thanks,
tglx
------------------>
Geert Uytterhoeven (1):
genirq: Document interaction between <linux/irq.h> and DT binding defines
Michael Kelley (2):
genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()
Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
Nam Cao (1):
genirq/matrix, LoongArch: Delete IRQ_MATRIX_BITS leftovers
Sebastian Andrzej Siewior (1):
genirq/affinity: Remove cpus_read_lock() while reading cpu_possible_mask
arch/loongarch/include/asm/irq.h | 1 -
arch/x86/kernel/cpu/mshyperv.c | 2 ++
drivers/hv/mshv_synic.c | 3 ---
drivers/hv/vmbus_drv.c | 3 ---
include/linux/irq.h | 4 ++++
kernel/irq/affinity.c | 7 ++-----
kernel/irq/chip.c | 3 +++
kernel/irq/matrix.c | 2 +-
8 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h
index 3943647503a9..537add26daf4 100644
--- a/arch/loongarch/include/asm/irq.h
+++ b/arch/loongarch/include/asm/irq.h
@@ -48,7 +48,6 @@ void spurious_interrupt(void);
*/
#define NR_VECTORS 256
#define NR_LEGACY_VECTORS 16
-#define IRQ_MATRIX_BITS NR_VECTORS
#define AVEC_IRQ_SHIFT 4
#define AVEC_IRQ_BIT 8
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 9befdc557d9e..a7dfc29d3470 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -161,6 +161,8 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
if (vmbus_handler)
vmbus_handler();
+ add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
+
if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
apic_eoi();
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 43f1bcbbf2d3..e2288a726fec 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -12,7 +12,6 @@
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/io.h>
-#include <linux/random.h>
#include <linux/cpuhotplug.h>
#include <linux/reboot.h>
#include <asm/mshyperv.h>
@@ -445,8 +444,6 @@ void mshv_isr(void)
mb();
if (msg->header.message_flags.msg_pending)
hv_set_non_nested_msr(HV_MSR_EOM, 0);
-
- add_interrupt_randomness(mshv_sint_vector);
} else {
pr_warn_once("%s: unknown message type 0x%x\n", __func__,
msg->header.message_type);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1..e7ac79e2fb49 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -32,7 +32,6 @@
#include <linux/ptrace.h>
#include <linux/sysfb.h>
#include <linux/efi.h>
-#include <linux/random.h>
#include <linux/kernel.h>
#include <linux/syscore_ops.h>
#include <linux/dma-map-ops.h>
@@ -1361,8 +1360,6 @@ static void __vmbus_isr(void)
vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page);
vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page);
-
- add_interrupt_randomness(vmbus_interrupt);
}
static DEFINE_PER_CPU(bool, vmbus_irq_pending);
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 951acbdb9f84..efa514ee562f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -35,6 +35,10 @@ enum irqchip_irq_state;
*
* Bits 0-7 are the same as the IRQF_* bits in linux/interrupt.h
*
+ * Note that the first 6 definitions are shadowed by C preprocessor definitions
+ * in include/dt-bindings/interrupt-controller/irq.h. This is not an issue, as
+ * the actual values must be the same, due to being part of the stable DT ABI.
+ *
* IRQ_TYPE_NONE - default, unspecified type
* IRQ_TYPE_EDGE_RISING - rising edge triggered
* IRQ_TYPE_EDGE_FALLING - falling edge triggered
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 85c45cfe7223..78f2418a8925 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -115,13 +115,10 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, unsigned int maxvec,
if (resv > minvec)
return 0;
- if (affd->calc_sets) {
+ if (affd->calc_sets)
set_vecs = maxvec - resv;
- } else {
- cpus_read_lock();
+ else
set_vecs = cpumask_weight(cpu_possible_mask);
- cpus_read_unlock();
- }
return resv + min(set_vecs, maxvec - resv);
}
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 6147a07d0127..6c9b1dc4e7d4 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -14,6 +14,7 @@
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
#include <linux/irqdomain.h>
+#include <linux/random.h>
#include <trace/events/irq.h>
@@ -929,6 +930,8 @@ void handle_percpu_devid_irq(struct irq_desc *desc)
enabled ? " and unmasked" : "", irq, cpu);
}
+ add_interrupt_randomness(irq);
+
if (chip->irq_eoi)
chip->irq_eoi(&desc->irq_data);
}
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0f79a4abea05..faafb43a4e61 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -39,7 +39,7 @@ struct irq_matrix {
/**
* irq_alloc_matrix - Allocate a irq_matrix structure and initialize it
- * @matrix_bits: Number of matrix bits must be <= IRQ_MATRIX_BITS
+ * @matrix_bits: Number of matrix bits
* @alloc_start: From which bit the allocation search starts
* @alloc_end: At which bit the allocation search ends, i.e first
* invalid bit
The pull request you sent on Sun, 12 Apr 2026 19:46:05 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-core-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/db23954eeaf23464669043ddbb38a64f7b301ebd Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest irq/msi branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-msi-2026-04-12
up to: aa80869b77e1: irqchip/msi-lib: Refuse initialization when irq_write_msi_msg() is missing
A small update for the MSI interrupt library to check for callers which
fail to provide the mandatory irq_write_msi_msg() callback, which prevents
a NULL pointer dereference later.
Thanks,
tglx
------------------>
Thomas Gleixner (1):
irqchip/msi-lib: Refuse initialization when irq_write_msi_msg() is missing
drivers/irqchip/irq-msi-lib.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/irqchip/irq-msi-lib.c b/drivers/irqchip/irq-msi-lib.c
index d5eefc3d7215..45e0ed3134ce 100644
--- a/drivers/irqchip/irq-msi-lib.c
+++ b/drivers/irqchip/irq-msi-lib.c
@@ -48,6 +48,9 @@ bool msi_lib_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
return false;
}
+ if (WARN_ON_ONCE(!chip->irq_write_msi_msg))
+ return false;
+
required_flags = pops->required_flags;
/* Is the target domain bus token supported? */
The pull request you sent on Sun, 12 Apr 2026 19:46:15 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-msi-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/1d5e40351e7d521d7d143447d57315b6eb1e1160 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest smp/core branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp-core-2026-04-12
up to: 7eb28030f641: smp: Use system_percpu_wq instead of system_wq
Updates for the SMP core code:
- Switch smp_call_on_cpu() to user system_percpu_wq instead of system_wq
a part of the ongoing workqueue restructuring
- Improve the CSD-lock diagnostics for smp_call_function_single() to
provide better debug mechanisms on weakly ordered systems.
- Cache the current CPU number once in smp_call_function*() instead of
retrieving it over and over.
- Add missing kernel-doc comments all over the place
Thanks,
tglx
------------------>
Marco Crivellari (1):
smp: Use system_percpu_wq instead of system_wq
Paul E. McKenney (1):
smp: Improve smp_call_function_single() CSD-lock diagnostics
Randy Dunlap (1):
smp: Add missing kernel-doc comments
Shrikanth Hegde (1):
smp: Get this_cpu once in smp_call_function
include/linux/smp.h | 38 ++++++++++++++++++---------------
kernel/smp.c | 60 ++++++++++++++++++++++++++++++++++++++---------------
2 files changed, 64 insertions(+), 34 deletions(-)
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 1ebd88026119..6925d15ccaa7 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -73,7 +73,7 @@ static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
}
/**
- * on_each_cpu_mask(): Run a function on processors specified by
+ * on_each_cpu_mask() - Run a function on processors specified by
* cpumask, which may include the local processor.
* @mask: The set of cpus to run on (only runs on online subset).
* @func: The function to run. This must be fast and non-blocking.
@@ -239,13 +239,30 @@ static inline int get_boot_cpu_id(void)
#endif /* !SMP */
-/**
+/*
* raw_smp_processor_id() - get the current (unstable) CPU id
*
- * For then you know what you are doing and need an unstable
+ * raw_smp_processor_id() is arch-specific/arch-defined and
+ * may be a macro or a static inline function.
+ *
+ * For when you know what you are doing and need an unstable
* CPU id.
*/
+/*
+ * Allow the architecture to differentiate between a stable and unstable read.
+ * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
+ * regular asm read for the stable.
+ */
+#ifndef __smp_processor_id
+#define __smp_processor_id() raw_smp_processor_id()
+#endif
+
+#ifdef CONFIG_DEBUG_PREEMPT
+ extern unsigned int debug_smp_processor_id(void);
+# define smp_processor_id() debug_smp_processor_id()
+
+#else
/**
* smp_processor_id() - get the current (stable) CPU id
*
@@ -258,23 +275,10 @@ static inline int get_boot_cpu_id(void)
* - preemption is disabled;
* - the task is CPU affine.
*
- * When CONFIG_DEBUG_PREEMPT; we verify these assumption and WARN
+ * When CONFIG_DEBUG_PREEMPT=y, we verify these assumptions and WARN
* when smp_processor_id() is used when the CPU id is not stable.
*/
-/*
- * Allow the architecture to differentiate between a stable and unstable read.
- * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
- * regular asm read for the stable.
- */
-#ifndef __smp_processor_id
-#define __smp_processor_id() raw_smp_processor_id()
-#endif
-
-#ifdef CONFIG_DEBUG_PREEMPT
- extern unsigned int debug_smp_processor_id(void);
-# define smp_processor_id() debug_smp_processor_id()
-#else
# define smp_processor_id() __smp_processor_id()
#endif
diff --git a/kernel/smp.c b/kernel/smp.c
index f349960f79ca..6c77848d91f3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -215,7 +215,7 @@ static atomic_t n_csd_lock_stuck;
/**
* csd_lock_is_stuck - Has a CSD-lock acquisition been stuck too long?
*
- * Returns @true if a CSD-lock acquisition is stuck and has been stuck
+ * Returns: @true if a CSD-lock acquisition is stuck and has been stuck
* long enough for a "non-responsive CSD lock" message to be printed.
*/
bool csd_lock_is_stuck(void)
@@ -377,6 +377,20 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+ if (static_branch_unlikely(&csdlock_debug_enabled))
+ return per_cpu_ptr(&csd_data, cpu);
+ return this_cpu_ptr(&csd_data);
+}
+#else
+static call_single_data_t *get_single_csd_data(int cpu)
+{
+ return this_cpu_ptr(&csd_data);
+}
+#endif
+
void __smp_call_single_queue(int cpu, struct llist_node *node)
{
/*
@@ -625,13 +639,14 @@ void flush_smp_call_function_queue(void)
local_irq_restore(flags);
}
-/*
+/**
* smp_call_function_single - Run a function on a specific CPU
+ * @cpu: Specific target CPU for this function.
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait until function has completed on other CPUs.
*
- * Returns 0 on success, else a negative status code.
+ * Returns: %0 on success, else a negative status code.
*/
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
@@ -670,14 +685,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
csd = &csd_stack;
if (!wait) {
- csd = this_cpu_ptr(&csd_data);
+ csd = get_single_csd_data(cpu);
csd_lock(csd);
}
csd->func = func;
csd->info = info;
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
- csd->node.src = smp_processor_id();
+ csd->node.src = this_cpu;
csd->node.dst = cpu;
#endif
@@ -738,18 +753,18 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd)
}
EXPORT_SYMBOL_GPL(smp_call_function_single_async);
-/*
+/**
* smp_call_function_any - Run a function on any of the given cpus
* @mask: The mask of cpus it can run on.
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait until function has completed.
*
- * Returns 0 on success, else a negative status code (if no cpus were online).
- *
* Selection preference:
* 1) current cpu if in @mask
* 2) nearest cpu in @mask, based on NUMA topology
+ *
+ * Returns: %0 on success, else a negative status code (if no cpus were online).
*/
int smp_call_function_any(const struct cpumask *mask,
smp_call_func_t func, void *info, int wait)
@@ -832,7 +847,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
csd->func = func;
csd->info = info;
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
- csd->node.src = smp_processor_id();
+ csd->node.src = this_cpu;
csd->node.dst = cpu;
#endif
trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
@@ -880,7 +895,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
}
/**
- * smp_call_function_many(): Run a function on a set of CPUs.
+ * smp_call_function_many() - Run a function on a set of CPUs.
* @mask: The set of cpus to run on (only runs on online subset).
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
@@ -902,14 +917,12 @@ void smp_call_function_many(const struct cpumask *mask,
EXPORT_SYMBOL(smp_call_function_many);
/**
- * smp_call_function(): Run a function on all other CPUs.
+ * smp_call_function() - Run a function on all other CPUs.
* @func: The function to run. This must be fast and non-blocking.
* @info: An arbitrary pointer to pass to the function.
* @wait: If true, wait (atomically) until function has completed
* on other CPUs.
*
- * Returns 0.
- *
* If @wait is true, then returns once @func has returned; otherwise
* it returns just before the target cpu calls @func.
*
@@ -1009,8 +1022,8 @@ void __init smp_init(void)
smp_cpus_done(setup_max_cpus);
}
-/*
- * on_each_cpu_cond(): Call a function on each processor for which
+/**
+ * on_each_cpu_cond_mask() - Call a function on each processor for which
* the supplied function cond_func returns true, optionally waiting
* for all the required CPUs to finish. This may include the local
* processor.
@@ -1024,6 +1037,7 @@ void __init smp_init(void)
* @info: An arbitrary pointer to pass to both functions.
* @wait: If true, wait (atomically) until function has
* completed on other CPUs.
+ * @mask: The set of cpus to run on (only runs on online subset).
*
* Preemption is disabled to protect against CPUs going offline but not online.
* CPUs going online during the call will not be seen or sent an IPI.
@@ -1095,7 +1109,7 @@ EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
* scheduled, for any of the CPUs in the @mask. It does not guarantee
* correctness as it only provides a racy snapshot.
*
- * Returns true if there is a pending IPI scheduled and false otherwise.
+ * Returns: true if there is a pending IPI scheduled and false otherwise.
*/
bool cpus_peek_for_pending_ipi(const struct cpumask *mask)
{
@@ -1145,6 +1159,18 @@ static void smp_call_on_cpu_callback(struct work_struct *work)
complete(&sscs->done);
}
+/**
+ * smp_call_on_cpu() - Call a function on a specific CPU and wait
+ * for it to return.
+ * @cpu: The CPU to run on.
+ * @func: The function to run
+ * @par: An arbitrary pointer parameter for @func.
+ * @phys: If @true, force to run on physical @cpu. See
+ * &struct smp_call_on_cpu_struct for more info.
+ *
+ * Returns: %-ENXIO if the @cpu is invalid; otherwise the return value
+ * from @func.
+ */
int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
{
struct smp_call_on_cpu_struct sscs = {
@@ -1159,7 +1185,7 @@ int smp_call_on_cpu(unsigned int cpu, int (*func)(void *), void *par, bool phys)
if (cpu >= nr_cpu_ids || !cpu_online(cpu))
return -ENXIO;
- queue_work_on(cpu, system_wq, &sscs.work);
+ queue_work_on(cpu, system_percpu_wq, &sscs.work);
wait_for_completion(&sscs.done);
destroy_work_on_stack(&sscs.work);
The pull request you sent on Sun, 12 Apr 2026 19:46:20 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp-core-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/e80d033851b3bc94c3d254ac66660ddd0a49d72c Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest irq/drivers branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-drivers-2026-04-12
up to: 1fac04a0a473: irqchip/irq-pic32-evic: Add __maybe_unused for board_bind_eic_interrupt in COMPILE_TEST
Updates for the interrupt chip driver subsystem:
- A large refactoring for the Renesas RZV2H driver to add new interrupt
types cleanly.
- A large refactoring for the Renesas RZG2L driver to add support the new
RZ/G3L variant.
- Add support for the new NXP S32N79 chip in the IMX irq-steer driver.
- Add support for the Apple AICv3 variant
- Enhance the Loongson PCH LPC driver so it can be used on MIPS with
device tree firmware
- Allow the PIC32 EVIC driver to be built independent of MIPS in compile
tests.
- The usual small fixes and enhancements all over the place
Thanks,
tglx
------------------>
Biju Das (19):
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Use pattern for interrupt-names
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/G3L SoC
irqchip/renesas-rzg2l: Fix error path in rzg2l_irqc_common_probe()
irqchip/renesas-rzg2l: Drop redundant IRQC_TINT_START check in rzg2l_irqc_alloc()
irqchip/renesas-rzg2l: Replace single irq_chip with per-region irq_chip instances
irqchip/renesas-rzg2l: Split EOI handler into separate IRQ and TINT functions
irqchip/renesas-rzg2l: Split set_type handler into separate IRQ and TINT functions
irqchip/renesas-rzg2l: Replace rzg2l_irqc_irq_{enable,disable} with TINT-specific handlers
irqchip/renesas-rzg2l: Split rzfive_tint_irq_endisable() into separate IRQ and TINT helpers
irqchip/renesas-rzg2l: Split rzfive_irqc_{mask,unmask} into separate IRQ and TINT handlers
irqchip/renesas-rzg2l: Dynamically allocate fwspec array
irqchip/renesas-rzg2l: Drop IRQC_NUM_IRQ macro
irqchip/renesas-rzg2l: Drop IRQC_TINT_START macro
irqchip/renesas-rzg2l: Drop IRQC_IRQ_COUNT macro
irqchip/renesas-rzg2l: Add RZ/G3L support
irqchip/renesas-rzg2l: Add shared interrupt support
irqchip/renesas-rzg2l: Replace raw_spin_{lock,unlock} with guard() in rzg2l_irq_set_type()
irqchip/renesas-rzg2l: Clear the shared interrupt bit in rzg2l_irqc_free()
irqchip/renesas-rzg2l: Add NMI support
Brian Masney (6):
irqchip/irq-pic32-evic: Address warning related to wrong printf() formatter
irqchip/irq-pic32-evic: Don't define plat_irq_dispatch() for !MIPS builds
irqchip/irq-pic32-evic: Define board_bind_eic_interrupt for !MIPS builds
irqchip/irq-pic32-evic: Only include asm headers when compiling for MIPS
irqchip/irq-pic32-evic: Allow driver to be compiled with COMPILE_TEST
irqchip/irq-pic32-evic: Add __maybe_unused for board_bind_eic_interrupt in COMPILE_TEST
Ciprian Marian Costea (1):
irqchip/imx-irqsteer: Add NXP S32N79 support
Geert Uytterhoeven (4):
irqchip/gic-v3: Print a warning for out-of-range interrupt numbers
irqchip/renesas-rzv2h: Kill swint_idx[]
irqchip/renesas-rzv2h: Kill swint_names[]
irqchip/renesas-rzv2h: Kill icu_err string
Icenowy Zheng (6):
MIPS: loongson64: Override arch_dynirq_lower_bound to reserve LPC IRQs
LoongArch: Override arch_dynirq_lower_bound to reserve LPC IRQs
dt-bindings: interrupt-controller: Add LS7A PCH LPC
irqchip/loongson-pch-lpc: Extract non-ACPI-related code from ACPI init
irqchip/loongson-pch-lpc: Add OF init code
irqchip/loongson-pch-lpc: Enable building on MIPS Loongson64
Janne Grunau (2):
dt-bindings: interrupt-controller: apple,aic2: Add AICv3
irqchip/apple-aic: Add support for "apple,t8122-aic3"
Lad Prabhakar (7):
irqchip/renesas-rzv2h: Use local node pointer
irqchip/renesas-rzv2h: Use local device pointer in ICU probe
irqchip/renesas-rzv2h: Switch to using dev_err_probe()
irqchip/renesas-rzv2h: Clarify IRQ range definitions and tighten TINT validation
irqchip/renesas-rzv2h: Replace single irq_chip with per-region irq_chip instances
irqchip/renesas-rzv2h: Add CA55 software interrupt support
irqchip/renesas-rzv2h: Handle ICU error IRQ and add SWPE trigger
Philipp Hahn (1):
irqchip: Use IS_ERR_OR_NULL() instead of NULL and IS_ERR() checks
.../bindings/interrupt-controller/apple,aic2.yaml | 30 +-
.../interrupt-controller/loongson,pch-lpc.yaml | 52 ++
.../interrupt-controller/renesas,rzg2l-irqc.yaml | 157 ++----
arch/loongarch/kernel/irq.c | 6 +
arch/mips/loongson64/init.c | 6 +
arch/mips/pic32/Kconfig | 1 -
drivers/irqchip/Kconfig | 12 +-
drivers/irqchip/irq-apple-aic.c | 24 +-
drivers/irqchip/irq-gic-v3.c | 10 +-
drivers/irqchip/irq-imx-irqsteer.c | 53 +-
drivers/irqchip/irq-loongson-pch-lpc.c | 92 +++-
drivers/irqchip/irq-mvebu-odmi.c | 2 +-
drivers/irqchip/irq-pic32-evic.c | 8 +-
drivers/irqchip/irq-renesas-rzg2l.c | 576 +++++++++++++++++----
drivers/irqchip/irq-renesas-rzv2h.c | 467 +++++++++++++----
15 files changed, 1128 insertions(+), 368 deletions(-)
create mode 100644 Documentation/devicetree/bindings/interrupt-controller/loongson,pch-lpc.yaml
diff --git a/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml b/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
index ee5a0dfff437..d0d9a90e96e7 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/apple,aic2.yaml
@@ -4,10 +4,10 @@
$id: http://devicetree.org/schemas/interrupt-controller/apple,aic2.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
-title: Apple Interrupt Controller 2
+title: Apple Interrupt Controller 2 and 3
maintainers:
- - Hector Martin <marcan@marcan.st>
+ - Janne Grunau <j@jannau.net>
description: |
The Apple Interrupt Controller 2 is a simple interrupt controller present on
@@ -28,14 +28,24 @@ description: |
which do not go through a discrete interrupt controller. It also handles
FIQ-based Fast IPIs.
+ The Apple Interrupt Controller 3 is in its base functionality very similar to
+ the Apple Interrupt Controller 2 and uses the same device tree bindings. It is
+ found on Apple ARM SoCs platforms starting with t8122 (M3).
+
properties:
compatible:
- items:
- - enum:
- - apple,t8112-aic
- - apple,t6000-aic
- - apple,t6020-aic
- - const: apple,aic2
+ oneOf:
+ - items:
+ - enum:
+ - apple,t6000-aic
+ - apple,t6020-aic
+ - apple,t8112-aic
+ - const: apple,aic2
+ - items:
+ - enum:
+ - apple,t6030-aic3
+ - const: apple,t8122-aic3
+ - const: apple,t8122-aic3
interrupt-controller: true
@@ -117,7 +127,9 @@ allOf:
properties:
compatible:
contains:
- const: apple,t8112-aic
+ enum:
+ - apple,t8112-aic
+ - apple,t8122-aic3
then:
properties:
'#interrupt-cells':
diff --git a/Documentation/devicetree/bindings/interrupt-controller/loongson,pch-lpc.yaml b/Documentation/devicetree/bindings/interrupt-controller/loongson,pch-lpc.yaml
new file mode 100644
index 000000000000..ff2a425b6f0b
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/loongson,pch-lpc.yaml
@@ -0,0 +1,52 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/loongson,pch-lpc.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Loongson PCH LPC Controller
+
+maintainers:
+ - Jiaxun Yang <jiaxun.yang@flygoat.com>
+
+description:
+ This interrupt controller is found in the Loongson LS7A family of PCH for
+ accepting interrupts sent by LPC-connected peripherals and signalling PIC
+ via a single interrupt line when interrupts are available.
+
+properties:
+ compatible:
+ const: loongson,ls7a-lpc
+
+ reg:
+ maxItems: 1
+
+ interrupt-controller: true
+
+ interrupts:
+ maxItems: 1
+
+ '#interrupt-cells':
+ const: 2
+
+required:
+ - compatible
+ - reg
+ - interrupt-controller
+ - interrupts
+ - '#interrupt-cells'
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ lpc: interrupt-controller@10002000 {
+ compatible = "loongson,ls7a-lpc";
+ reg = <0x10002000 0x400>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ interrupt-parent = <&pic>;
+ interrupts = <19 IRQ_TYPE_LEVEL_HIGH>;
+ };
+...
diff --git a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml
index 44b6ae5fc802..3a221e1800a0 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml
@@ -30,7 +30,9 @@ properties:
- renesas,r9a08g045-irqc # RZ/G3S
- const: renesas,rzg2l-irqc
- - const: renesas,r9a07g043f-irqc # RZ/Five
+ - enum:
+ - renesas,r9a07g043f-irqc # RZ/Five
+ - renesas,r9a08g046-irqc # RZ/G3L
'#interrupt-cells':
description: The first cell should contain a macro RZG2L_{NMI,IRQX} included in the
@@ -48,107 +50,35 @@ properties:
interrupts:
minItems: 45
- items:
- - description: NMI interrupt
- - description: IRQ0 interrupt
- - description: IRQ1 interrupt
- - description: IRQ2 interrupt
- - description: IRQ3 interrupt
- - description: IRQ4 interrupt
- - description: IRQ5 interrupt
- - description: IRQ6 interrupt
- - description: IRQ7 interrupt
- - description: GPIO interrupt, TINT0
- - description: GPIO interrupt, TINT1
- - description: GPIO interrupt, TINT2
- - description: GPIO interrupt, TINT3
- - description: GPIO interrupt, TINT4
- - description: GPIO interrupt, TINT5
- - description: GPIO interrupt, TINT6
- - description: GPIO interrupt, TINT7
- - description: GPIO interrupt, TINT8
- - description: GPIO interrupt, TINT9
- - description: GPIO interrupt, TINT10
- - description: GPIO interrupt, TINT11
- - description: GPIO interrupt, TINT12
- - description: GPIO interrupt, TINT13
- - description: GPIO interrupt, TINT14
- - description: GPIO interrupt, TINT15
- - description: GPIO interrupt, TINT16
- - description: GPIO interrupt, TINT17
- - description: GPIO interrupt, TINT18
- - description: GPIO interrupt, TINT19
- - description: GPIO interrupt, TINT20
- - description: GPIO interrupt, TINT21
- - description: GPIO interrupt, TINT22
- - description: GPIO interrupt, TINT23
- - description: GPIO interrupt, TINT24
- - description: GPIO interrupt, TINT25
- - description: GPIO interrupt, TINT26
- - description: GPIO interrupt, TINT27
- - description: GPIO interrupt, TINT28
- - description: GPIO interrupt, TINT29
- - description: GPIO interrupt, TINT30
- - description: GPIO interrupt, TINT31
- - description: Bus error interrupt
- - description: ECCRAM0 or combined ECCRAM0/1 1bit error interrupt
- - description: ECCRAM0 or combined ECCRAM0/1 2bit error interrupt
- - description: ECCRAM0 or combined ECCRAM0/1 error overflow interrupt
- - description: ECCRAM1 1bit error interrupt
- - description: ECCRAM1 2bit error interrupt
- - description: ECCRAM1 error overflow interrupt
+ maxItems: 61
interrupt-names:
minItems: 45
+ maxItems: 61
items:
- - const: nmi
- - const: irq0
- - const: irq1
- - const: irq2
- - const: irq3
- - const: irq4
- - const: irq5
- - const: irq6
- - const: irq7
- - const: tint0
- - const: tint1
- - const: tint2
- - const: tint3
- - const: tint4
- - const: tint5
- - const: tint6
- - const: tint7
- - const: tint8
- - const: tint9
- - const: tint10
- - const: tint11
- - const: tint12
- - const: tint13
- - const: tint14
- - const: tint15
- - const: tint16
- - const: tint17
- - const: tint18
- - const: tint19
- - const: tint20
- - const: tint21
- - const: tint22
- - const: tint23
- - const: tint24
- - const: tint25
- - const: tint26
- - const: tint27
- - const: tint28
- - const: tint29
- - const: tint30
- - const: tint31
- - const: bus-err
- - const: ec7tie1-0
- - const: ec7tie2-0
- - const: ec7tiovf-0
- - const: ec7tie1-1
- - const: ec7tie2-1
- - const: ec7tiovf-1
+ oneOf:
+ - description: NMI interrupt
+ const: nmi
+ - description: External IRQ interrupt
+ pattern: '^irq([0-9]|1[0-5])$'
+ - description: GPIO interrupt
+ pattern: '^tint([0-9]|1[0-9]|2[0-9]|3[0-1])$'
+ - description: Bus error interrupt
+ const: bus-err
+ - description: ECCRAM0 or combined ECCRAM0/1 1bit error interrupt
+ const: ec7tie1-0
+ - description: ECCRAM0 or combined ECCRAM0/1 2bit error interrupt
+ const: ec7tie2-0
+ - description: ECCRAM0 or combined ECCRAM0/1 error overflow interrupt
+ const: ec7tiovf-0
+ - description: ECCRAM1 1bit error interrupt
+ const: ec7tie1-1
+ - description: ECCRAM1 2bit error interrupt
+ const: ec7tie2-1
+ - description: ECCRAM1 error overflow interrupt
+ const: ec7tiovf-1
+ - description: Integrated GPT Error interrupt
+ pattern: '^ovfunf([0-7])$'
clocks:
maxItems: 2
@@ -180,6 +110,24 @@ required:
allOf:
- $ref: /schemas/interrupt-controller.yaml#
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - renesas,r9a07g043f-irqc
+ - renesas,r9a07g043u-irqc
+ - renesas,r9a07g044-irqc
+ - renesas,r9a07g054-irqc
+ then:
+ properties:
+ interrupts:
+ minItems: 48
+ maxItems: 48
+ interrupt-names:
+ minItems: 48
+ maxItems: 48
+
- if:
properties:
compatible:
@@ -192,12 +140,19 @@ allOf:
maxItems: 45
interrupt-names:
maxItems: 45
- else:
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - renesas,r9a08g046-irqc
+ then:
properties:
interrupts:
- minItems: 48
+ minItems: 61
interrupt-names:
- minItems: 48
+ minItems: 61
unevaluatedProperties: false
diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c
index 80946cafaec1..7bf68a7a5f4b 100644
--- a/arch/loongarch/kernel/irq.c
+++ b/arch/loongarch/kernel/irq.c
@@ -11,6 +11,7 @@
#include <linux/irqchip.h>
#include <linux/kernel_stat.h>
#include <linux/proc_fs.h>
+#include <linux/minmax.h>
#include <linux/mm.h>
#include <linux/sched.h>
#include <linux/seq_file.h>
@@ -99,6 +100,11 @@ int __init arch_probe_nr_irqs(void)
return NR_IRQS_LEGACY;
}
+unsigned int arch_dynirq_lower_bound(unsigned int from)
+{
+ return MAX(from, NR_IRQS_LEGACY);
+}
+
void __init init_IRQ(void)
{
int i;
diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
index 5f73f8663ab2..c7cc5a3d7817 100644
--- a/arch/mips/loongson64/init.c
+++ b/arch/mips/loongson64/init.c
@@ -7,6 +7,7 @@
#include <linux/irqchip.h>
#include <linux/logic_pio.h>
#include <linux/memblock.h>
+#include <linux/minmax.h>
#include <linux/of.h>
#include <linux/of_address.h>
#include <asm/bootinfo.h>
@@ -227,3 +228,8 @@ void __init arch_init_irq(void)
reserve_pio_range();
irqchip_init();
}
+
+unsigned int arch_dynirq_lower_bound(unsigned int from)
+{
+ return MAX(from, NR_IRQS_LEGACY);
+}
diff --git a/arch/mips/pic32/Kconfig b/arch/mips/pic32/Kconfig
index bb6ab1f3e80d..cd14a071e631 100644
--- a/arch/mips/pic32/Kconfig
+++ b/arch/mips/pic32/Kconfig
@@ -20,7 +20,6 @@ config PIC32MZDA
select LIBFDT
select USE_OF
select PINCTRL
- select PIC32_EVIC
help
Support for the Microchip PIC32MZDA microcontroller.
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index f07b00d7fef9..e755a2a05209 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -252,9 +252,12 @@ config ORION_IRQCHIP
select IRQ_DOMAIN
config PIC32_EVIC
- bool
+ def_bool MACH_PIC32 || COMPILE_TEST
select GENERIC_IRQ_CHIP
select IRQ_DOMAIN
+ help
+ Enable support for the interrupt controller on the Microchip PIC32
+ family of platforms.
config JCORE_AIC
bool "J-Core integrated AIC" if COMPILE_TEST
@@ -541,11 +544,11 @@ config CSKY_APB_INTC
config IMX_IRQSTEER
bool "i.MX IRQSTEER support"
- depends on ARCH_MXC || COMPILE_TEST
- default ARCH_MXC
+ depends on ARCH_MXC || ARCH_S32 || COMPILE_TEST
+ default y if ARCH_MXC || ARCH_S32
select IRQ_DOMAIN
help
- Support for the i.MX IRQSTEER interrupt multiplexer/remapper.
+ Support for the i.MX and S32 IRQSTEER interrupt multiplexer/remapper.
config IMX_INTMUX
bool "i.MX INTMUX support" if COMPILE_TEST
@@ -761,7 +764,6 @@ config LOONGSON_PCH_MSI
config LOONGSON_PCH_LPC
bool "Loongson PCH LPC Controller"
- depends on LOONGARCH
depends on MACH_LOONGSON64 || LOONGARCH
default MACH_LOONGSON64
select IRQ_DOMAIN_HIERARCHY
diff --git a/drivers/irqchip/irq-apple-aic.c b/drivers/irqchip/irq-apple-aic.c
index 2b24c82bb0df..4a3141d9f914 100644
--- a/drivers/irqchip/irq-apple-aic.c
+++ b/drivers/irqchip/irq-apple-aic.c
@@ -134,8 +134,12 @@
#define AIC2_IRQ_CFG 0x2000
+/* AIC v3 registers (MMIO) */
+#define AIC3_IRQ_CFG 0x10000
+
/*
* AIC2 registers are laid out like this, starting at AIC2_IRQ_CFG:
+ * AIC3 registers use the same layout but start at AIC3_IRQ_CFG:
*
* Repeat for each die:
* IRQ_CFG: u32 * MAX_IRQS
@@ -293,6 +297,15 @@ static const struct aic_info aic2_info __initconst = {
.local_fast_ipi = true,
};
+static const struct aic_info aic3_info __initconst = {
+ .version = 3,
+
+ .irq_cfg = AIC3_IRQ_CFG,
+
+ .fast_ipi = true,
+ .local_fast_ipi = true,
+};
+
static const struct of_device_id aic_info_match[] = {
{
.compatible = "apple,t8103-aic",
@@ -310,6 +323,10 @@ static const struct of_device_id aic_info_match[] = {
.compatible = "apple,aic2",
.data = &aic2_info,
},
+ {
+ .compatible = "apple,t8122-aic3",
+ .data = &aic3_info,
+ },
{}
};
@@ -620,7 +637,7 @@ static int aic_irq_domain_map(struct irq_domain *id, unsigned int irq,
u32 type = FIELD_GET(AIC_EVENT_TYPE, hw);
struct irq_chip *chip = &aic_chip;
- if (ic->info.version == 2)
+ if (ic->info.version == 2 || ic->info.version == 3)
chip = &aic2_chip;
if (type == AIC_EVENT_TYPE_IRQ) {
@@ -991,7 +1008,7 @@ static int __init aic_of_ic_init(struct device_node *node, struct device_node *p
break;
}
- case 2: {
+ case 2 ... 3: {
u32 info1, info3;
info1 = aic_ic_read(irqc, AIC2_INFO1);
@@ -1065,7 +1082,7 @@ static int __init aic_of_ic_init(struct device_node *node, struct device_node *p
off += irqc->info.die_stride;
}
- if (irqc->info.version == 2) {
+ if (irqc->info.version == 2 || irqc->info.version == 3) {
u32 config = aic_ic_read(irqc, AIC2_CONFIG);
config |= AIC2_CONFIG_ENABLE;
@@ -1116,3 +1133,4 @@ static int __init aic_of_ic_init(struct device_node *node, struct device_node *p
IRQCHIP_DECLARE(apple_aic, "apple,aic", aic_of_ic_init);
IRQCHIP_DECLARE(apple_aic2, "apple,aic2", aic_of_ic_init);
+IRQCHIP_DECLARE(apple_aic3, "apple,t8122-aic3", aic_of_ic_init);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 20f13b686ab2..99444a1b2ffa 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1603,15 +1603,23 @@ static int gic_irq_domain_translate(struct irq_domain *d,
switch (fwspec->param[0]) {
case 0: /* SPI */
+ if (fwspec->param[1] > 987)
+ pr_warn_once("SPI %u out of range (use ESPI?)\n", fwspec->param[1]);
*hwirq = fwspec->param[1] + 32;
break;
case 1: /* PPI */
+ if (fwspec->param[1] > 15)
+ pr_warn_once("PPI %u out of range (use EPPI?)\n", fwspec->param[1]);
*hwirq = fwspec->param[1] + 16;
break;
case 2: /* ESPI */
+ if (fwspec->param[1] > 1023)
+ pr_warn_once("ESPI %u out of range\n", fwspec->param[1]);
*hwirq = fwspec->param[1] + ESPI_BASE_INTID;
break;
case 3: /* EPPI */
+ if (fwspec->param[1] > 63)
+ pr_warn_once("EPPI %u out of range\n", fwspec->param[1]);
*hwirq = fwspec->param[1] + EPPI_BASE_INTID;
break;
case GIC_IRQ_TYPE_LPI: /* LPI */
@@ -2252,7 +2260,7 @@ static int __init gic_of_init(struct device_node *node, struct device_node *pare
out_unmap_rdist:
for (i = 0; i < nr_redist_regions; i++)
- if (rdist_regs[i].redist_base && !IS_ERR(rdist_regs[i].redist_base))
+ if (!IS_ERR_OR_NULL(rdist_regs[i].redist_base))
iounmap(rdist_regs[i].redist_base);
kfree(rdist_regs);
out_unmap_dist:
diff --git a/drivers/irqchip/irq-imx-irqsteer.c b/drivers/irqchip/irq-imx-irqsteer.c
index 4682ce5bf8d3..87b07f517be3 100644
--- a/drivers/irqchip/irq-imx-irqsteer.c
+++ b/drivers/irqchip/irq-imx-irqsteer.c
@@ -26,19 +26,38 @@
#define CHAN_MAX_OUTPUT_INT 0xF
+/* SoC does not implement the CHANCTRL register */
+#define IRQSTEER_QUIRK_NO_CHANCTRL BIT(0)
+
+struct irqsteer_devtype_data {
+ u32 quirks;
+};
+
struct irqsteer_data {
- void __iomem *regs;
- struct clk *ipg_clk;
- int irq[CHAN_MAX_OUTPUT_INT];
- int irq_count;
- raw_spinlock_t lock;
- int reg_num;
- int channel;
- struct irq_domain *domain;
- u32 *saved_reg;
- struct device *dev;
+ void __iomem *regs;
+ struct clk *ipg_clk;
+ int irq[CHAN_MAX_OUTPUT_INT];
+ int irq_count;
+ raw_spinlock_t lock;
+ int reg_num;
+ int channel;
+ struct irq_domain *domain;
+ u32 *saved_reg;
+ struct device *dev;
+ const struct irqsteer_devtype_data *devtype_data;
+};
+
+static const struct irqsteer_devtype_data imx_data = { };
+
+static const struct irqsteer_devtype_data s32n79_data = {
+ .quirks = IRQSTEER_QUIRK_NO_CHANCTRL,
};
+static bool irqsteer_has_chanctrl(const struct irqsteer_devtype_data *data)
+{
+ return !(data->quirks & IRQSTEER_QUIRK_NO_CHANCTRL);
+}
+
static int imx_irqsteer_get_reg_index(struct irqsteer_data *data,
unsigned long irqnum)
{
@@ -188,6 +207,10 @@ static int imx_irqsteer_probe(struct platform_device *pdev)
if (ret)
return ret;
+ data->devtype_data = device_get_match_data(&pdev->dev);
+ if (!data->devtype_data)
+ return dev_err_probe(&pdev->dev, -ENODEV, "failed to match device data\n");
+
/*
* There is one output irq for each group of 64 inputs.
* One register bit map can represent 32 input interrupts.
@@ -210,7 +233,8 @@ static int imx_irqsteer_probe(struct platform_device *pdev)
}
/* steer all IRQs into configured channel */
- writel_relaxed(BIT(data->channel), data->regs + CHANCTRL);
+ if (irqsteer_has_chanctrl(data->devtype_data))
+ writel_relaxed(BIT(data->channel), data->regs + CHANCTRL);
data->domain = irq_domain_create_linear(dev_fwnode(&pdev->dev), data->reg_num * 32,
&imx_irqsteer_domain_ops, data);
@@ -279,7 +303,9 @@ static void imx_irqsteer_restore_regs(struct irqsteer_data *data)
{
int i;
- writel_relaxed(BIT(data->channel), data->regs + CHANCTRL);
+ if (irqsteer_has_chanctrl(data->devtype_data))
+ writel_relaxed(BIT(data->channel), data->regs + CHANCTRL);
+
for (i = 0; i < data->reg_num; i++)
writel_relaxed(data->saved_reg[i],
data->regs + CHANMASK(i, data->reg_num));
@@ -319,7 +345,8 @@ static const struct dev_pm_ops imx_irqsteer_pm_ops = {
};
static const struct of_device_id imx_irqsteer_dt_ids[] = {
- { .compatible = "fsl,imx-irqsteer", },
+ { .compatible = "fsl,imx-irqsteer", .data = &imx_data },
+ { .compatible = "nxp,s32n79-irqsteer", .data = &s32n79_data },
{},
};
diff --git a/drivers/irqchip/irq-loongson-pch-lpc.c b/drivers/irqchip/irq-loongson-pch-lpc.c
index 3ad46ec94e3c..7117ca6fc2f0 100644
--- a/drivers/irqchip/irq-loongson-pch-lpc.c
+++ b/drivers/irqchip/irq-loongson-pch-lpc.c
@@ -13,6 +13,8 @@
#include <linux/irqchip/chained_irq.h>
#include <linux/irqdomain.h>
#include <linux/kernel.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
#include <linux/syscore_ops.h>
#include "irq-loongson.h"
@@ -175,13 +177,10 @@ static struct syscore pch_lpc_syscore = {
.ops = &pch_lpc_syscore_ops,
};
-int __init pch_lpc_acpi_init(struct irq_domain *parent,
- struct acpi_madt_lpc_pic *acpi_pchlpc)
+static int __init pch_lpc_init(phys_addr_t addr, unsigned long size,
+ struct fwnode_handle *irq_handle, int parent_irq)
{
- int parent_irq;
struct pch_lpc *priv;
- struct irq_fwspec fwspec;
- struct fwnode_handle *irq_handle;
priv = kzalloc_obj(*priv);
if (!priv)
@@ -189,7 +188,7 @@ int __init pch_lpc_acpi_init(struct irq_domain *parent,
raw_spin_lock_init(&priv->lpc_lock);
- priv->base = ioremap(acpi_pchlpc->address, acpi_pchlpc->size);
+ priv->base = ioremap(addr, size);
if (!priv->base)
goto free_priv;
@@ -198,12 +197,6 @@ int __init pch_lpc_acpi_init(struct irq_domain *parent,
goto iounmap_base;
}
- irq_handle = irq_domain_alloc_named_fwnode("lpcintc");
- if (!irq_handle) {
- pr_err("Unable to allocate domain handle\n");
- goto iounmap_base;
- }
-
/*
* The LPC interrupt controller is a legacy i8259-compatible device,
* which requires a static 1:1 mapping for IRQs 0-15.
@@ -213,15 +206,10 @@ int __init pch_lpc_acpi_init(struct irq_domain *parent,
&pch_lpc_domain_ops, priv);
if (!priv->lpc_domain) {
pr_err("Failed to create IRQ domain\n");
- goto free_irq_handle;
+ goto iounmap_base;
}
pch_lpc_reset(priv);
- fwspec.fwnode = parent->fwnode;
- fwspec.param[0] = acpi_pchlpc->cascade + GSI_MIN_PCH_IRQ;
- fwspec.param[1] = IRQ_TYPE_LEVEL_HIGH;
- fwspec.param_count = 2;
- parent_irq = irq_create_fwspec_mapping(&fwspec);
irq_set_chained_handler_and_data(parent_irq, lpc_irq_dispatch, priv);
pch_lpc_priv = priv;
@@ -230,8 +218,6 @@ int __init pch_lpc_acpi_init(struct irq_domain *parent,
return 0;
-free_irq_handle:
- irq_domain_free_fwnode(irq_handle);
iounmap_base:
iounmap(priv->base);
free_priv:
@@ -239,3 +225,69 @@ int __init pch_lpc_acpi_init(struct irq_domain *parent,
return -ENOMEM;
}
+
+#ifdef CONFIG_ACPI
+int __init pch_lpc_acpi_init(struct irq_domain *parent, struct acpi_madt_lpc_pic *acpi_pchlpc)
+{
+ struct fwnode_handle *irq_handle;
+ struct irq_fwspec fwspec;
+ int parent_irq, ret;
+
+ irq_handle = irq_domain_alloc_named_fwnode("lpcintc");
+ if (!irq_handle) {
+ pr_err("Unable to allocate domain handle\n");
+ return -ENOMEM;
+ }
+
+ fwspec.fwnode = parent->fwnode;
+ fwspec.param[0] = acpi_pchlpc->cascade + GSI_MIN_PCH_IRQ;
+ fwspec.param[1] = IRQ_TYPE_LEVEL_HIGH;
+ fwspec.param_count = 2;
+ parent_irq = irq_create_fwspec_mapping(&fwspec);
+ if (parent_irq <= 0) {
+ pr_err("Unable to map LPC parent interrupt\n");
+ irq_domain_free_fwnode(irq_handle);
+ return -ENOMEM;
+ }
+
+ ret = pch_lpc_init(acpi_pchlpc->address, acpi_pchlpc->size, irq_handle, parent_irq);
+ if (ret) {
+ irq_dispose_mapping(parent_irq);
+ irq_domain_free_fwnode(irq_handle);
+ return ret;
+ }
+
+ return 0;
+}
+#endif /* CONFIG_ACPI */
+
+#ifdef CONFIG_OF
+static int __init pch_lpc_of_init(struct device_node *node, struct device_node *parent)
+{
+ struct fwnode_handle *irq_handle;
+ struct resource res;
+ int parent_irq, ret;
+
+ if (of_address_to_resource(node, 0, &res))
+ return -EINVAL;
+
+ parent_irq = irq_of_parse_and_map(node, 0);
+ if (!parent_irq) {
+ pr_err("Failed to get the parent IRQ for LPC IRQs\n");
+ return -EINVAL;
+ }
+
+ irq_handle = of_fwnode_handle(node);
+
+ ret = pch_lpc_init(res.start, resource_size(&res), irq_handle,
+ parent_irq);
+ if (ret) {
+ irq_dispose_mapping(parent_irq);
+ return ret;
+ }
+
+ return 0;
+}
+
+IRQCHIP_DECLARE(pch_lpc, "loongson,ls7a-lpc", pch_lpc_of_init);
+#endif /* CONFIG_OF */
diff --git a/drivers/irqchip/irq-mvebu-odmi.c b/drivers/irqchip/irq-mvebu-odmi.c
index b99ab9dcc14b..94e7eda46e81 100644
--- a/drivers/irqchip/irq-mvebu-odmi.c
+++ b/drivers/irqchip/irq-mvebu-odmi.c
@@ -217,7 +217,7 @@ static int __init mvebu_odmi_init(struct device_node *node,
for (i = 0; i < odmis_count; i++) {
struct odmi_data *odmi = &odmis[i];
- if (odmi->base && !IS_ERR(odmi->base))
+ if (!IS_ERR_OR_NULL(odmi->base))
iounmap(odmis[i].base);
}
bitmap_free(odmis_bm);
diff --git a/drivers/irqchip/irq-pic32-evic.c b/drivers/irqchip/irq-pic32-evic.c
index e85c3e300701..3c48288c9e6c 100644
--- a/drivers/irqchip/irq-pic32-evic.c
+++ b/drivers/irqchip/irq-pic32-evic.c
@@ -15,8 +15,10 @@
#include <linux/irq.h>
#include <linux/platform_data/pic32.h>
+#ifdef CONFIG_MIPS
#include <asm/irq.h>
#include <asm/traps.h>
+#endif
#define REG_INTCON 0x0000
#define REG_INTSTAT 0x0020
@@ -40,6 +42,7 @@ struct evic_chip_data {
static struct irq_domain *evic_irq_domain;
static void __iomem *evic_base;
+#ifdef CONFIG_MIPS
asmlinkage void __weak plat_irq_dispatch(void)
{
unsigned int hwirq;
@@ -47,6 +50,9 @@ asmlinkage void __weak plat_irq_dispatch(void)
hwirq = readl(evic_base + REG_INTSTAT) & 0xFF;
do_domain_IRQ(evic_irq_domain, hwirq);
}
+#else
+static __maybe_unused void (*board_bind_eic_interrupt)(int irq, int regset);
+#endif
static struct evic_chip_data *irqd_to_priv(struct irq_data *data)
{
@@ -196,7 +202,7 @@ static void __init pic32_ext_irq_of_init(struct irq_domain *domain)
of_property_for_each_u32(node, pname, hwirq) {
if (i >= ARRAY_SIZE(priv->ext_irqs)) {
- pr_warn("More than %d external irq, skip rest\n",
+ pr_warn("More than %zu external irq, skip rest\n",
ARRAY_SIZE(priv->ext_irqs));
break;
}
diff --git a/drivers/irqchip/irq-renesas-rzg2l.c b/drivers/irqchip/irq-renesas-rzg2l.c
index e73d426cea6d..f6b2e69a2f4e 100644
--- a/drivers/irqchip/irq-renesas-rzg2l.c
+++ b/drivers/irqchip/irq-renesas-rzg2l.c
@@ -20,18 +20,21 @@
#include <linux/spinlock.h>
#include <linux/syscore_ops.h>
+#define IRQC_NMI 0
#define IRQC_IRQ_START 1
-#define IRQC_IRQ_COUNT 8
-#define IRQC_TINT_START (IRQC_IRQ_START + IRQC_IRQ_COUNT)
#define IRQC_TINT_COUNT 32
-#define IRQC_NUM_IRQ (IRQC_TINT_START + IRQC_TINT_COUNT)
+#define IRQC_SHARED_IRQ_COUNT 8
+#define IRQC_IRQ_SHARED_START (IRQC_IRQ_START + IRQC_SHARED_IRQ_COUNT)
+#define NSCR 0x0
+#define NITSR 0x4
#define ISCR 0x10
#define IITSR 0x14
#define TSCR 0x20
#define TITSR(n) (0x24 + (n) * 4)
#define TITSR0_MAX_INT 16
#define TITSEL_WIDTH 0x2
+#define INTTSEL 0x2c
#define TSSR(n) (0x30 + ((n) * 4))
#define TIEN BIT(7)
#define TSSEL_SHIFT(n) (8 * (n))
@@ -43,6 +46,10 @@
#define TSSR_OFFSET(n) ((n) % 4)
#define TSSR_INDEX(n) ((n) / 4)
+#define NSCR_NSTAT 0
+#define NITSR_NTSEL_EDGE_FALLING 0
+#define NITSR_NTSEL_EDGE_RISING 1
+
#define TITSR_TITSEL_EDGE_RISING 0
#define TITSR_TITSEL_EDGE_FALLING 1
#define TITSR_TITSEL_LEVEL_HIGH 2
@@ -55,33 +62,62 @@
#define IITSR_IITSEL_EDGE_BOTH 3
#define IITSR_IITSEL_MASK(n) IITSR_IITSEL((n), 3)
+#define INTTSEL_TINTSEL(n) BIT(n)
+#define INTTSEL_TINTSEL_START 24
+
#define TINT_EXTRACT_HWIRQ(x) FIELD_GET(GENMASK(15, 0), (x))
#define TINT_EXTRACT_GPIOINT(x) FIELD_GET(GENMASK(31, 16), (x))
/**
* struct rzg2l_irqc_reg_cache - registers cache (necessary for suspend/resume)
- * @iitsr: IITSR register
- * @titsr: TITSR registers
+ * @nitsr: NITSR register
+ * @iitsr: IITSR register
+ * @inttsel: INTTSEL register
+ * @titsr: TITSR registers
*/
struct rzg2l_irqc_reg_cache {
+ u32 nitsr;
u32 iitsr;
+ u32 inttsel;
u32 titsr[2];
};
+/**
+ * struct rzg2l_hw_info - Interrupt Control Unit controller hardware info structure.
+ * @tssel_lut: TINT lookup table
+ * @irq_count: Number of IRQC interrupts
+ * @tint_start: Start of TINT interrupts
+ * @num_irq: Total Number of interrupts
+ * @shared_irq_cnt: Number of shared interrupts
+ */
+struct rzg2l_hw_info {
+ const u8 *tssel_lut;
+ unsigned int irq_count;
+ unsigned int tint_start;
+ unsigned int num_irq;
+ unsigned int shared_irq_cnt;
+};
+
/**
* struct rzg2l_irqc_priv - IRQ controller private data structure
* @base: Controller's base address
- * @irqchip: Pointer to struct irq_chip
+ * @irq_chip: Pointer to struct irq_chip for irq
+ * @tint_chip: Pointer to struct irq_chip for tint
* @fwspec: IRQ firmware specific data
* @lock: Lock to serialize access to hardware registers
+ * @info: Hardware specific data
* @cache: Registers cache for suspend/resume
+ * @used_irqs: Bitmap to manage the shared interrupts
*/
static struct rzg2l_irqc_priv {
void __iomem *base;
- const struct irq_chip *irqchip;
- struct irq_fwspec fwspec[IRQC_NUM_IRQ];
+ const struct irq_chip *irq_chip;
+ const struct irq_chip *tint_chip;
+ struct irq_fwspec *fwspec;
raw_spinlock_t lock;
+ struct rzg2l_hw_info info;
struct rzg2l_irqc_reg_cache cache;
+ DECLARE_BITMAP(used_irqs, IRQC_SHARED_IRQ_COUNT);
} *rzg2l_irqc_data;
static struct rzg2l_irqc_priv *irq_data_to_priv(struct irq_data *data)
@@ -89,6 +125,28 @@ static struct rzg2l_irqc_priv *irq_data_to_priv(struct irq_data *data)
return data->domain->host_data;
}
+static void rzg2l_clear_nmi_int(struct rzg2l_irqc_priv *priv)
+{
+ u32 bit = BIT(NSCR_NSTAT);
+ u32 reg;
+
+ /*
+ * No locking required as the register is not shared
+ * with other interrupts.
+ *
+ * Writing is allowed only when NSTAT is 1
+ */
+ reg = readl_relaxed(priv->base + NSCR);
+ if (reg & bit) {
+ writel_relaxed(reg & ~bit, priv->base + NSCR);
+ /*
+ * Enforce that the posted write is flushed to prevent that the
+ * just handled interrupt is raised again.
+ */
+ readl_relaxed(priv->base + NSCR);
+ }
+}
+
static void rzg2l_clear_irq_int(struct rzg2l_irqc_priv *priv, unsigned int hwirq)
{
unsigned int hw_irq = hwirq - IRQC_IRQ_START;
@@ -114,7 +172,7 @@ static void rzg2l_clear_irq_int(struct rzg2l_irqc_priv *priv, unsigned int hwirq
static void rzg2l_clear_tint_int(struct rzg2l_irqc_priv *priv, unsigned int hwirq)
{
- u32 bit = BIT(hwirq - IRQC_TINT_START);
+ u32 bit = BIT(hwirq - priv->info.tint_start);
u32 reg;
reg = readl_relaxed(priv->base + TSCR);
@@ -128,17 +186,33 @@ static void rzg2l_clear_tint_int(struct rzg2l_irqc_priv *priv, unsigned int hwir
}
}
-static void rzg2l_irqc_eoi(struct irq_data *d)
+static void rzg2l_irqc_nmi_eoi(struct irq_data *d)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+
+ rzg2l_clear_nmi_int(priv);
+ irq_chip_eoi_parent(d);
+}
+
+static void rzg2l_irqc_irq_eoi(struct irq_data *d)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hw_irq = irqd_to_hwirq(d);
- raw_spin_lock(&priv->lock);
- if (hw_irq >= IRQC_IRQ_START && hw_irq <= IRQC_IRQ_COUNT)
+ scoped_guard(raw_spinlock, &priv->lock)
rzg2l_clear_irq_int(priv, hw_irq);
- else if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ)
+
+ irq_chip_eoi_parent(d);
+}
+
+static void rzg2l_irqc_tint_eoi(struct irq_data *d)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+ unsigned int hw_irq = irqd_to_hwirq(d);
+
+ scoped_guard(raw_spinlock, &priv->lock)
rzg2l_clear_tint_int(priv, hw_irq);
- raw_spin_unlock(&priv->lock);
+
irq_chip_eoi_parent(d);
}
@@ -161,7 +235,7 @@ static void rzfive_irqc_unmask_irq_interrupt(struct rzg2l_irqc_priv *priv,
static void rzfive_irqc_mask_tint_interrupt(struct rzg2l_irqc_priv *priv,
unsigned int hwirq)
{
- u32 bit = BIT(hwirq - IRQC_TINT_START);
+ u32 bit = BIT(hwirq - priv->info.tint_start);
writel_relaxed(readl_relaxed(priv->base + TMSK) | bit, priv->base + TMSK);
}
@@ -169,125 +243,170 @@ static void rzfive_irqc_mask_tint_interrupt(struct rzg2l_irqc_priv *priv,
static void rzfive_irqc_unmask_tint_interrupt(struct rzg2l_irqc_priv *priv,
unsigned int hwirq)
{
- u32 bit = BIT(hwirq - IRQC_TINT_START);
+ u32 bit = BIT(hwirq - priv->info.tint_start);
writel_relaxed(readl_relaxed(priv->base + TMSK) & ~bit, priv->base + TMSK);
}
-static void rzfive_irqc_mask(struct irq_data *d)
+static void rzfive_irqc_irq_mask(struct irq_data *d)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hwirq = irqd_to_hwirq(d);
- raw_spin_lock(&priv->lock);
- if (hwirq >= IRQC_IRQ_START && hwirq <= IRQC_IRQ_COUNT)
+ scoped_guard(raw_spinlock, &priv->lock)
rzfive_irqc_mask_irq_interrupt(priv, hwirq);
- else if (hwirq >= IRQC_TINT_START && hwirq < IRQC_NUM_IRQ)
+
+ irq_chip_mask_parent(d);
+}
+
+static void rzfive_irqc_tint_mask(struct irq_data *d)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+ unsigned int hwirq = irqd_to_hwirq(d);
+
+ scoped_guard(raw_spinlock, &priv->lock)
rzfive_irqc_mask_tint_interrupt(priv, hwirq);
- raw_spin_unlock(&priv->lock);
+
irq_chip_mask_parent(d);
}
-static void rzfive_irqc_unmask(struct irq_data *d)
+static void rzfive_irqc_irq_unmask(struct irq_data *d)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hwirq = irqd_to_hwirq(d);
- raw_spin_lock(&priv->lock);
- if (hwirq >= IRQC_IRQ_START && hwirq <= IRQC_IRQ_COUNT)
+ scoped_guard(raw_spinlock, &priv->lock)
rzfive_irqc_unmask_irq_interrupt(priv, hwirq);
- else if (hwirq >= IRQC_TINT_START && hwirq < IRQC_NUM_IRQ)
+
+ irq_chip_unmask_parent(d);
+}
+
+static void rzfive_irqc_tint_unmask(struct irq_data *d)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+ unsigned int hwirq = irqd_to_hwirq(d);
+
+ scoped_guard(raw_spinlock, &priv->lock)
rzfive_irqc_unmask_tint_interrupt(priv, hwirq);
- raw_spin_unlock(&priv->lock);
+
irq_chip_unmask_parent(d);
}
-static void rzfive_tint_irq_endisable(struct irq_data *d, bool enable)
+static void rzfive_irq_endisable(struct irq_data *d, bool enable)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hwirq = irqd_to_hwirq(d);
- if (hwirq >= IRQC_TINT_START && hwirq < IRQC_NUM_IRQ) {
- u32 offset = hwirq - IRQC_TINT_START;
- u32 tssr_offset = TSSR_OFFSET(offset);
- u8 tssr_index = TSSR_INDEX(offset);
- u32 reg;
+ guard(raw_spinlock)(&priv->lock);
+ if (enable)
+ rzfive_irqc_unmask_irq_interrupt(priv, hwirq);
+ else
+ rzfive_irqc_mask_irq_interrupt(priv, hwirq);
+}
- raw_spin_lock(&priv->lock);
- if (enable)
- rzfive_irqc_unmask_tint_interrupt(priv, hwirq);
- else
- rzfive_irqc_mask_tint_interrupt(priv, hwirq);
- reg = readl_relaxed(priv->base + TSSR(tssr_index));
- if (enable)
- reg |= TIEN << TSSEL_SHIFT(tssr_offset);
- else
- reg &= ~(TIEN << TSSEL_SHIFT(tssr_offset));
- writel_relaxed(reg, priv->base + TSSR(tssr_index));
- raw_spin_unlock(&priv->lock);
- } else {
- raw_spin_lock(&priv->lock);
- if (enable)
- rzfive_irqc_unmask_irq_interrupt(priv, hwirq);
- else
- rzfive_irqc_mask_irq_interrupt(priv, hwirq);
- raw_spin_unlock(&priv->lock);
- }
+static void rzfive_tint_endisable(struct irq_data *d, bool enable)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+ unsigned int hwirq = irqd_to_hwirq(d);
+ unsigned int offset = hwirq - priv->info.tint_start;
+ unsigned int tssr_offset = TSSR_OFFSET(offset);
+ unsigned int tssr_index = TSSR_INDEX(offset);
+ u32 reg;
+
+ guard(raw_spinlock)(&priv->lock);
+ if (enable)
+ rzfive_irqc_unmask_tint_interrupt(priv, hwirq);
+ else
+ rzfive_irqc_mask_tint_interrupt(priv, hwirq);
+ reg = readl_relaxed(priv->base + TSSR(tssr_index));
+ if (enable)
+ reg |= TIEN << TSSEL_SHIFT(tssr_offset);
+ else
+ reg &= ~(TIEN << TSSEL_SHIFT(tssr_offset));
+ writel_relaxed(reg, priv->base + TSSR(tssr_index));
}
static void rzfive_irqc_irq_disable(struct irq_data *d)
{
irq_chip_disable_parent(d);
- rzfive_tint_irq_endisable(d, false);
+ rzfive_irq_endisable(d, false);
}
static void rzfive_irqc_irq_enable(struct irq_data *d)
{
- rzfive_tint_irq_endisable(d, true);
+ rzfive_irq_endisable(d, true);
+ irq_chip_enable_parent(d);
+}
+
+static void rzfive_irqc_tint_disable(struct irq_data *d)
+{
+ irq_chip_disable_parent(d);
+ rzfive_tint_endisable(d, false);
+}
+
+static void rzfive_irqc_tint_enable(struct irq_data *d)
+{
+ rzfive_tint_endisable(d, true);
irq_chip_enable_parent(d);
}
static void rzg2l_tint_irq_endisable(struct irq_data *d, bool enable)
{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hw_irq = irqd_to_hwirq(d);
+ unsigned int offset = hw_irq - priv->info.tint_start;
+ unsigned int tssr_offset = TSSR_OFFSET(offset);
+ unsigned int tssr_index = TSSR_INDEX(offset);
+ u32 reg;
- if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ) {
- struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
- u32 offset = hw_irq - IRQC_TINT_START;
- u32 tssr_offset = TSSR_OFFSET(offset);
- u8 tssr_index = TSSR_INDEX(offset);
- u32 reg;
-
- raw_spin_lock(&priv->lock);
- reg = readl_relaxed(priv->base + TSSR(tssr_index));
- if (enable)
- reg |= TIEN << TSSEL_SHIFT(tssr_offset);
- else
- reg &= ~(TIEN << TSSEL_SHIFT(tssr_offset));
- writel_relaxed(reg, priv->base + TSSR(tssr_index));
- raw_spin_unlock(&priv->lock);
- }
+ guard(raw_spinlock)(&priv->lock);
+ reg = readl_relaxed(priv->base + TSSR(tssr_index));
+ if (enable)
+ reg |= TIEN << TSSEL_SHIFT(tssr_offset);
+ else
+ reg &= ~(TIEN << TSSEL_SHIFT(tssr_offset));
+ writel_relaxed(reg, priv->base + TSSR(tssr_index));
}
-static void rzg2l_irqc_irq_disable(struct irq_data *d)
+static void rzg2l_irqc_tint_disable(struct irq_data *d)
{
irq_chip_disable_parent(d);
rzg2l_tint_irq_endisable(d, false);
}
-static void rzg2l_irqc_irq_enable(struct irq_data *d)
+static void rzg2l_irqc_tint_enable(struct irq_data *d)
{
rzg2l_tint_irq_endisable(d, true);
irq_chip_enable_parent(d);
}
+static int rzg2l_nmi_set_type(struct irq_data *d, unsigned int type)
+{
+ struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
+ u32 sense;
+
+ switch (type & IRQ_TYPE_SENSE_MASK) {
+ case IRQ_TYPE_EDGE_FALLING:
+ sense = NITSR_NTSEL_EDGE_FALLING;
+ break;
+ case IRQ_TYPE_EDGE_RISING:
+ sense = NITSR_NTSEL_EDGE_RISING;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ writel_relaxed(sense, priv->base + NITSR);
+ return 0;
+}
+
static int rzg2l_irq_set_type(struct irq_data *d, unsigned int type)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hwirq = irqd_to_hwirq(d);
- u32 iitseln = hwirq - IRQC_IRQ_START;
+ unsigned int iitseln = hwirq - IRQC_IRQ_START;
bool clear_irq_int = false;
- u16 sense, tmp;
+ unsigned int sense, tmp;
switch (type & IRQ_TYPE_SENSE_MASK) {
case IRQ_TYPE_LEVEL_LOW:
@@ -313,14 +432,13 @@ static int rzg2l_irq_set_type(struct irq_data *d, unsigned int type)
return -EINVAL;
}
- raw_spin_lock(&priv->lock);
+ guard(raw_spinlock)(&priv->lock);
tmp = readl_relaxed(priv->base + IITSR);
tmp &= ~IITSR_IITSEL_MASK(iitseln);
tmp |= IITSR_IITSEL(iitseln, sense);
if (clear_irq_int)
rzg2l_clear_irq_int(priv, hwirq);
writel_relaxed(tmp, priv->base + IITSR);
- raw_spin_unlock(&priv->lock);
return 0;
}
@@ -331,6 +449,11 @@ static u32 rzg2l_disable_tint_and_set_tint_source(struct irq_data *d, struct rzg
u32 tint = (u32)(uintptr_t)irq_data_get_irq_chip_data(d);
u32 tien = reg & (TIEN << TSSEL_SHIFT(tssr_offset));
+ if (priv->info.tssel_lut)
+ tint = priv->info.tssel_lut[tint];
+ else
+ tint = (u32)(uintptr_t)irq_data_get_irq_chip_data(d);
+
/* Clear the relevant byte in reg */
reg &= ~(TSSEL_MASK << TSSEL_SHIFT(tssr_offset));
/* Set TINT and leave TIEN clear */
@@ -344,10 +467,10 @@ static int rzg2l_tint_set_edge(struct irq_data *d, unsigned int type)
{
struct rzg2l_irqc_priv *priv = irq_data_to_priv(d);
unsigned int hwirq = irqd_to_hwirq(d);
- u32 titseln = hwirq - IRQC_TINT_START;
- u32 tssr_offset = TSSR_OFFSET(titseln);
- u8 tssr_index = TSSR_INDEX(titseln);
- u8 index, sense;
+ unsigned int titseln = hwirq - priv->info.tint_start;
+ unsigned int tssr_offset = TSSR_OFFSET(titseln);
+ unsigned int tssr_index = TSSR_INDEX(titseln);
+ unsigned int index, sense;
u32 reg, tssr;
switch (type & IRQ_TYPE_SENSE_MASK) {
@@ -383,15 +506,31 @@ static int rzg2l_tint_set_edge(struct irq_data *d, unsigned int type)
return 0;
}
-static int rzg2l_irqc_set_type(struct irq_data *d, unsigned int type)
+static int rzg2l_irqc_irq_set_type(struct irq_data *d, unsigned int type)
{
- unsigned int hw_irq = irqd_to_hwirq(d);
- int ret = -EINVAL;
+ int ret = rzg2l_irq_set_type(d, type);
+
+ if (ret)
+ return ret;
+
+ return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
+}
+
+static int rzg2l_irqc_tint_set_type(struct irq_data *d, unsigned int type)
+{
+ int ret = rzg2l_tint_set_edge(d, type);
- if (hw_irq >= IRQC_IRQ_START && hw_irq <= IRQC_IRQ_COUNT)
- ret = rzg2l_irq_set_type(d, type);
- else if (hw_irq >= IRQC_TINT_START && hw_irq < IRQC_NUM_IRQ)
- ret = rzg2l_tint_set_edge(d, type);
+ if (ret)
+ return ret;
+
+ return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
+}
+
+static int rzg2l_irqc_nmi_set_type(struct irq_data *d, unsigned int type)
+{
+ int ret;
+
+ ret = rzg2l_nmi_set_type(d, type);
if (ret)
return ret;
@@ -403,7 +542,10 @@ static int rzg2l_irqc_irq_suspend(void *data)
struct rzg2l_irqc_reg_cache *cache = &rzg2l_irqc_data->cache;
void __iomem *base = rzg2l_irqc_data->base;
+ cache->nitsr = readl_relaxed(base + NITSR);
cache->iitsr = readl_relaxed(base + IITSR);
+ if (rzg2l_irqc_data->info.shared_irq_cnt)
+ cache->inttsel = readl_relaxed(base + INTTSEL);
for (u8 i = 0; i < 2; i++)
cache->titsr[i] = readl_relaxed(base + TITSR(i));
@@ -422,7 +564,10 @@ static void rzg2l_irqc_irq_resume(void *data)
*/
for (u8 i = 0; i < 2; i++)
writel_relaxed(cache->titsr[i], base + TITSR(i));
+ if (rzg2l_irqc_data->info.shared_irq_cnt)
+ writel_relaxed(cache->inttsel, base + INTTSEL);
writel_relaxed(cache->iitsr, base + IITSR);
+ writel_relaxed(cache->nitsr, base + NITSR);
}
static const struct syscore_ops rzg2l_irqc_syscore_ops = {
@@ -434,44 +579,162 @@ static struct syscore rzg2l_irqc_syscore = {
.ops = &rzg2l_irqc_syscore_ops,
};
-static const struct irq_chip rzg2l_irqc_chip = {
+static const struct irq_chip rzg2l_irqc_nmi_chip = {
+ .name = "rzg2l-irqc",
+ .irq_eoi = rzg2l_irqc_nmi_eoi,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = irq_chip_set_parent_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = rzg2l_irqc_nmi_set_type,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static const struct irq_chip rzg2l_irqc_irq_chip = {
.name = "rzg2l-irqc",
- .irq_eoi = rzg2l_irqc_eoi,
+ .irq_eoi = rzg2l_irqc_irq_eoi,
.irq_mask = irq_chip_mask_parent,
.irq_unmask = irq_chip_unmask_parent,
- .irq_disable = rzg2l_irqc_irq_disable,
- .irq_enable = rzg2l_irqc_irq_enable,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
.irq_get_irqchip_state = irq_chip_get_parent_state,
.irq_set_irqchip_state = irq_chip_set_parent_state,
.irq_retrigger = irq_chip_retrigger_hierarchy,
- .irq_set_type = rzg2l_irqc_set_type,
+ .irq_set_type = rzg2l_irqc_irq_set_type,
.irq_set_affinity = irq_chip_set_affinity_parent,
.flags = IRQCHIP_MASK_ON_SUSPEND |
IRQCHIP_SET_TYPE_MASKED |
IRQCHIP_SKIP_SET_WAKE,
};
-static const struct irq_chip rzfive_irqc_chip = {
+static const struct irq_chip rzg2l_irqc_tint_chip = {
+ .name = "rzg2l-irqc",
+ .irq_eoi = rzg2l_irqc_tint_eoi,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = rzg2l_irqc_tint_disable,
+ .irq_enable = rzg2l_irqc_tint_enable,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = irq_chip_set_parent_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = rzg2l_irqc_tint_set_type,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static const struct irq_chip rzfive_irqc_irq_chip = {
.name = "rzfive-irqc",
- .irq_eoi = rzg2l_irqc_eoi,
- .irq_mask = rzfive_irqc_mask,
- .irq_unmask = rzfive_irqc_unmask,
+ .irq_eoi = rzg2l_irqc_irq_eoi,
+ .irq_mask = rzfive_irqc_irq_mask,
+ .irq_unmask = rzfive_irqc_irq_unmask,
.irq_disable = rzfive_irqc_irq_disable,
.irq_enable = rzfive_irqc_irq_enable,
.irq_get_irqchip_state = irq_chip_get_parent_state,
.irq_set_irqchip_state = irq_chip_set_parent_state,
.irq_retrigger = irq_chip_retrigger_hierarchy,
- .irq_set_type = rzg2l_irqc_set_type,
+ .irq_set_type = rzg2l_irqc_irq_set_type,
.irq_set_affinity = irq_chip_set_affinity_parent,
.flags = IRQCHIP_MASK_ON_SUSPEND |
IRQCHIP_SET_TYPE_MASKED |
IRQCHIP_SKIP_SET_WAKE,
};
+static const struct irq_chip rzfive_irqc_tint_chip = {
+ .name = "rzfive-irqc",
+ .irq_eoi = rzg2l_irqc_tint_eoi,
+ .irq_mask = rzfive_irqc_tint_mask,
+ .irq_unmask = rzfive_irqc_tint_unmask,
+ .irq_disable = rzfive_irqc_tint_disable,
+ .irq_enable = rzfive_irqc_tint_enable,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = irq_chip_set_parent_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = rzg2l_irqc_tint_set_type,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static bool rzg2l_irqc_is_shared_irqc(const struct rzg2l_hw_info info, unsigned int hw_irq)
+{
+ return ((hw_irq >= (info.tint_start - info.shared_irq_cnt)) && hw_irq < info.tint_start);
+}
+
+static bool rzg2l_irqc_is_shared_tint(const struct rzg2l_hw_info info, unsigned int hw_irq)
+{
+ return ((hw_irq >= (info.num_irq - info.shared_irq_cnt)) && hw_irq < info.num_irq);
+}
+
+static bool rzg2l_irqc_is_shared_and_get_irq_num(struct rzg2l_irqc_priv *priv,
+ irq_hw_number_t hwirq, unsigned int *irq_num)
+{
+ bool is_shared = false;
+
+ if (rzg2l_irqc_is_shared_irqc(priv->info, hwirq)) {
+ *irq_num = hwirq - IRQC_IRQ_SHARED_START;
+ is_shared = true;
+ } else if (rzg2l_irqc_is_shared_tint(priv->info, hwirq)) {
+ *irq_num = hwirq - IRQC_TINT_COUNT - IRQC_IRQ_SHARED_START;
+ is_shared = true;
+ }
+
+ return is_shared;
+}
+
+static void rzg2l_irqc_set_inttsel(struct rzg2l_irqc_priv *priv, unsigned int offset,
+ unsigned int select_irq)
+{
+ u32 reg;
+
+ guard(raw_spinlock_irqsave)(&priv->lock);
+ reg = readl_relaxed(priv->base + INTTSEL);
+ if (select_irq)
+ reg |= INTTSEL_TINTSEL(offset);
+ else
+ reg &= ~INTTSEL_TINTSEL(offset);
+ writel_relaxed(reg, priv->base + INTTSEL);
+}
+
+static int rzg2l_irqc_shared_irq_alloc(struct rzg2l_irqc_priv *priv, irq_hw_number_t hwirq)
+{
+ unsigned int irq_num;
+
+ if (rzg2l_irqc_is_shared_and_get_irq_num(priv, hwirq, &irq_num)) {
+ if (test_and_set_bit(irq_num, priv->used_irqs))
+ return -EBUSY;
+
+ if (hwirq < priv->info.tint_start)
+ rzg2l_irqc_set_inttsel(priv, INTTSEL_TINTSEL_START + irq_num, 1);
+ else
+ rzg2l_irqc_set_inttsel(priv, INTTSEL_TINTSEL_START + irq_num, 0);
+ }
+
+ return 0;
+}
+
+static void rzg2l_irqc_shared_irq_free(struct rzg2l_irqc_priv *priv, irq_hw_number_t hwirq)
+{
+ unsigned int irq_num;
+
+ if (rzg2l_irqc_is_shared_and_get_irq_num(priv, hwirq, &irq_num) &&
+ test_and_clear_bit(irq_num, priv->used_irqs))
+ rzg2l_irqc_set_inttsel(priv, INTTSEL_TINTSEL_START + irq_num, 0);
+}
+
static int rzg2l_irqc_alloc(struct irq_domain *domain, unsigned int virq,
unsigned int nr_irqs, void *arg)
{
struct rzg2l_irqc_priv *priv = domain->host_data;
+ const struct irq_chip *chip;
unsigned long tint = 0;
irq_hw_number_t hwirq;
unsigned int type;
@@ -488,28 +751,57 @@ static int rzg2l_irqc_alloc(struct irq_domain *domain, unsigned int virq,
* from 16-31 bits. TINT from the pinctrl driver needs to be programmed
* in IRQC registers to enable a given gpio pin as interrupt.
*/
- if (hwirq > IRQC_IRQ_COUNT) {
+ if (hwirq == IRQC_NMI) {
+ chip = &rzg2l_irqc_nmi_chip;
+ } else if (hwirq > priv->info.irq_count) {
tint = TINT_EXTRACT_GPIOINT(hwirq);
hwirq = TINT_EXTRACT_HWIRQ(hwirq);
-
- if (hwirq < IRQC_TINT_START)
- return -EINVAL;
+ chip = priv->tint_chip;
+ } else {
+ chip = priv->irq_chip;
}
- if (hwirq > (IRQC_NUM_IRQ - 1))
+ if (hwirq >= priv->info.num_irq)
return -EINVAL;
- ret = irq_domain_set_hwirq_and_chip(domain, virq, hwirq, priv->irqchip,
- (void *)(uintptr_t)tint);
+ if (priv->info.shared_irq_cnt) {
+ ret = rzg2l_irqc_shared_irq_alloc(priv, hwirq);
+ if (ret)
+ return ret;
+ }
+
+ ret = irq_domain_set_hwirq_and_chip(domain, virq, hwirq, chip, (void *)(uintptr_t)tint);
if (ret)
- return ret;
+ goto shared_irq_free;
+
+ ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, &priv->fwspec[hwirq]);
+ if (ret)
+ goto shared_irq_free;
+
+ return 0;
+
+shared_irq_free:
+ if (priv->info.shared_irq_cnt)
+ rzg2l_irqc_shared_irq_free(priv, hwirq);
- return irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, &priv->fwspec[hwirq]);
+ return ret;
+}
+
+static void rzg2l_irqc_free(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs)
+{
+ struct irq_data *d = irq_domain_get_irq_data(domain, virq);
+ struct rzg2l_irqc_priv *priv = domain->host_data;
+ irq_hw_number_t hwirq = irqd_to_hwirq(d);
+
+ irq_domain_free_irqs_common(domain, virq, nr_irqs);
+
+ if (priv->info.shared_irq_cnt)
+ rzg2l_irqc_shared_irq_free(priv, hwirq);
}
static const struct irq_domain_ops rzg2l_irqc_domain_ops = {
.alloc = rzg2l_irqc_alloc,
- .free = irq_domain_free_irqs_common,
+ .free = rzg2l_irqc_free,
.translate = irq_domain_translate_twocell,
};
@@ -520,7 +812,7 @@ static int rzg2l_irqc_parse_interrupts(struct rzg2l_irqc_priv *priv,
unsigned int i;
int ret;
- for (i = 0; i < IRQC_NUM_IRQ; i++) {
+ for (i = 0; i < priv->info.num_irq; i++) {
ret = of_irq_parse_one(np, i, &map);
if (ret)
return ret;
@@ -532,7 +824,9 @@ static int rzg2l_irqc_parse_interrupts(struct rzg2l_irqc_priv *priv,
}
static int rzg2l_irqc_common_probe(struct platform_device *pdev, struct device_node *parent,
- const struct irq_chip *irq_chip)
+ const struct irq_chip *irq_chip,
+ const struct irq_chip *tint_chip,
+ const struct rzg2l_hw_info info)
{
struct irq_domain *irq_domain, *parent_domain;
struct device_node *node = pdev->dev.of_node;
@@ -548,12 +842,20 @@ static int rzg2l_irqc_common_probe(struct platform_device *pdev, struct device_n
if (!rzg2l_irqc_data)
return -ENOMEM;
- rzg2l_irqc_data->irqchip = irq_chip;
+ rzg2l_irqc_data->irq_chip = irq_chip;
+ rzg2l_irqc_data->tint_chip = tint_chip;
rzg2l_irqc_data->base = devm_of_iomap(dev, dev->of_node, 0, NULL);
if (IS_ERR(rzg2l_irqc_data->base))
return PTR_ERR(rzg2l_irqc_data->base);
+ rzg2l_irqc_data->info = info;
+
+ rzg2l_irqc_data->fwspec = devm_kcalloc(&pdev->dev, info.num_irq,
+ sizeof(*rzg2l_irqc_data->fwspec), GFP_KERNEL);
+ if (!rzg2l_irqc_data->fwspec)
+ return -ENOMEM;
+
ret = rzg2l_irqc_parse_interrupts(rzg2l_irqc_data, node);
if (ret)
return dev_err_probe(dev, ret, "cannot parse interrupts: %d\n", ret);
@@ -574,10 +876,10 @@ static int rzg2l_irqc_common_probe(struct platform_device *pdev, struct device_n
raw_spin_lock_init(&rzg2l_irqc_data->lock);
- irq_domain = irq_domain_create_hierarchy(parent_domain, 0, IRQC_NUM_IRQ, dev_fwnode(dev),
+ irq_domain = irq_domain_create_hierarchy(parent_domain, 0, info.num_irq, dev_fwnode(dev),
&rzg2l_irqc_domain_ops, rzg2l_irqc_data);
if (!irq_domain) {
- pm_runtime_put(dev);
+ pm_runtime_put_sync(dev);
return -ENOMEM;
}
@@ -586,18 +888,64 @@ static int rzg2l_irqc_common_probe(struct platform_device *pdev, struct device_n
return 0;
}
+/* Mapping based on port index on Table 4.2-1 and GPIOINT on Table 4.6-7 */
+static const u8 rzg3l_tssel_lut[] = {
+ 83, 84, /* P20-P21 */
+ 7, 8, 9, 10, 11, 12, 13, /* P30-P36 */
+ 85, 86, 87, 88, 89, 90, 91, /* P50-P56 */
+ 92, 93, 94, 95, 96, 97, 98, /* P60-P66 */
+ 99, 100, 101, 102, 103, 104, 105, 106, /* P70-P77 */
+ 107, 108, 109, 110, 111, 112, /* P80-P85 */
+ 45, 46, 47, 48, 49, 50, 51, 52, /* PA0-PA7 */
+ 53, 54, 55, 56, 57, 58, 59, 60, /* PB0-PB7 */
+ 61, 62, 63, /* PC0-PC2 */
+ 64, 65, 66, 67, 68, 69, 70, 71, /* PD0-PD7 */
+ 72, 73, 74, 75, 76, 77, 78, 79, /* PE0-PE7 */
+ 80, 81, 82, /* PF0-PF2 */
+ 27, 28, 29, 30, 31, 32, 33, 34, /* PG0-PG7 */
+ 35, 36, 37, 38, 39, 40, /* PH0-PH5 */
+ 2, 3, 4, 5, 6, /* PJ0-PJ4 */
+ 41, 42, 43, 44, /* PK0-PK3 */
+ 14, 15, 16, 17, 26, /* PL0-PL4 */
+ 18, 19, 20, 21, 22, 23, 24, 25, /* PM0-PM7 */
+ 0, 1 /* PS0-PS1 */
+};
+
+static const struct rzg2l_hw_info rzg3l_hw_params = {
+ .tssel_lut = rzg3l_tssel_lut,
+ .irq_count = 16,
+ .tint_start = IRQC_IRQ_START + 16,
+ .num_irq = IRQC_IRQ_START + 16 + IRQC_TINT_COUNT,
+ .shared_irq_cnt = IRQC_SHARED_IRQ_COUNT,
+};
+
+static const struct rzg2l_hw_info rzg2l_hw_params = {
+ .irq_count = 8,
+ .tint_start = IRQC_IRQ_START + 8,
+ .num_irq = IRQC_IRQ_START + 8 + IRQC_TINT_COUNT,
+};
+
static int rzg2l_irqc_probe(struct platform_device *pdev, struct device_node *parent)
{
- return rzg2l_irqc_common_probe(pdev, parent, &rzg2l_irqc_chip);
+ return rzg2l_irqc_common_probe(pdev, parent, &rzg2l_irqc_irq_chip, &rzg2l_irqc_tint_chip,
+ rzg2l_hw_params);
+}
+
+static int rzg3l_irqc_probe(struct platform_device *pdev, struct device_node *parent)
+{
+ return rzg2l_irqc_common_probe(pdev, parent, &rzg2l_irqc_irq_chip, &rzg2l_irqc_tint_chip,
+ rzg3l_hw_params);
}
static int rzfive_irqc_probe(struct platform_device *pdev, struct device_node *parent)
{
- return rzg2l_irqc_common_probe(pdev, parent, &rzfive_irqc_chip);
+ return rzg2l_irqc_common_probe(pdev, parent, &rzfive_irqc_irq_chip, &rzfive_irqc_tint_chip,
+ rzg2l_hw_params);
}
IRQCHIP_PLATFORM_DRIVER_BEGIN(rzg2l_irqc)
IRQCHIP_MATCH("renesas,rzg2l-irqc", rzg2l_irqc_probe)
+IRQCHIP_MATCH("renesas,r9a08g046-irqc", rzg3l_irqc_probe)
IRQCHIP_MATCH("renesas,r9a07g043f-irqc", rzfive_irqc_probe)
IRQCHIP_PLATFORM_DRIVER_END(rzg2l_irqc)
MODULE_AUTHOR("Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>");
diff --git a/drivers/irqchip/irq-renesas-rzv2h.c b/drivers/irqchip/irq-renesas-rzv2h.c
index 03e93b061edd..31c543c876b1 100644
--- a/drivers/irqchip/irq-renesas-rzv2h.c
+++ b/drivers/irqchip/irq-renesas-rzv2h.c
@@ -12,6 +12,7 @@
#include <linux/bitfield.h>
#include <linux/cleanup.h>
#include <linux/err.h>
+#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/irqchip.h>
#include <linux/irqchip/irq-renesas-rzv2h.h>
@@ -25,9 +26,17 @@
/* DT "interrupts" indexes */
#define ICU_IRQ_START 1
#define ICU_IRQ_COUNT 16
-#define ICU_TINT_START (ICU_IRQ_START + ICU_IRQ_COUNT)
+#define ICU_IRQ_LAST (ICU_IRQ_START + ICU_IRQ_COUNT - 1)
+#define ICU_TINT_START (ICU_IRQ_LAST + 1)
#define ICU_TINT_COUNT 32
-#define ICU_NUM_IRQ (ICU_TINT_START + ICU_TINT_COUNT)
+#define ICU_TINT_LAST (ICU_TINT_START + ICU_TINT_COUNT - 1)
+#define ICU_CA55_INT_START (ICU_TINT_LAST + 1)
+#define ICU_CA55_INT_COUNT 4
+#define ICU_CA55_INT_LAST (ICU_CA55_INT_START + ICU_CA55_INT_COUNT - 1)
+#define ICU_ERR_INT_START (ICU_CA55_INT_LAST + 1)
+#define ICU_ERR_INT_COUNT 1
+#define ICU_ERR_INT_LAST (ICU_ERR_INT_START + ICU_ERR_INT_COUNT - 1)
+#define ICU_NUM_IRQ (ICU_ERR_INT_LAST + 1)
/* Registers */
#define ICU_NSCNT 0x00
@@ -40,6 +49,15 @@
#define ICU_TSCLR 0x24
#define ICU_TITSR(k) (0x28 + (k) * 4)
#define ICU_TSSR(k) (0x30 + (k) * 4)
+#define ICU_BEISR(k) (0x70 + (k) * 4)
+#define ICU_BECLR(k) (0x80 + (k) * 4)
+#define ICU_EREISR(k) (0x90 + (k) * 4)
+#define ICU_ERCLR(k) (0xE0 + (k) * 4)
+#define ICU_SWINT 0x130
+#define ICU_ERINTA55CTL(k) (0x338 + (k) * 4)
+#define ICU_ERINTA55CRL(k) (0x348 + (k) * 4)
+#define ICU_ERINTA55MSK(k) (0x358 + (k) * 4)
+#define ICU_SWPE 0x370
#define ICU_DMkSELy(k, y) (0x420 + (k) * 0x20 + (y) * 4)
#define ICU_DMACKSELk(k) (0x500 + (k) * 4)
@@ -90,6 +108,10 @@
#define ICU_RZG3E_TSSEL_MAX_VAL 0x8c
#define ICU_RZV2H_TSSEL_MAX_VAL 0x55
+#define ICU_SWPE_NUM 16
+#define ICU_NUM_BE 4
+#define ICU_NUM_A55ERR 4
+
/**
* struct rzv2h_irqc_reg_cache - registers cache (necessary for suspend/resume)
* @nitsr: ICU_NITSR register
@@ -108,12 +130,16 @@ struct rzv2h_irqc_reg_cache {
* @t_offs: TINT offset
* @max_tssel: TSSEL max value
* @field_width: TSSR field width
+ * @ecc_start: Start index of ECC RAM interrupts
+ * @ecc_end: End index of ECC RAM interrupts
*/
struct rzv2h_hw_info {
const u8 *tssel_lut;
u16 t_offs;
u8 max_tssel;
u8 field_width;
+ u8 ecc_start;
+ u8 ecc_end;
};
/* DMAC */
@@ -167,32 +193,47 @@ static inline struct rzv2h_icu_priv *irq_data_to_priv(struct irq_data *data)
return data->domain->host_data;
}
-static void rzv2h_icu_eoi(struct irq_data *d)
+static void rzv2h_icu_tint_eoi(struct irq_data *d)
{
struct rzv2h_icu_priv *priv = irq_data_to_priv(d);
unsigned int hw_irq = irqd_to_hwirq(d);
unsigned int tintirq_nr;
u32 bit;
- scoped_guard(raw_spinlock, &priv->lock) {
- if (hw_irq >= ICU_TINT_START) {
- tintirq_nr = hw_irq - ICU_TINT_START;
- bit = BIT(tintirq_nr);
- if (!irqd_is_level_type(d))
- writel_relaxed(bit, priv->base + priv->info->t_offs + ICU_TSCLR);
- } else if (hw_irq >= ICU_IRQ_START) {
- tintirq_nr = hw_irq - ICU_IRQ_START;
- bit = BIT(tintirq_nr);
- if (!irqd_is_level_type(d))
- writel_relaxed(bit, priv->base + ICU_ISCLR);
- } else {
- writel_relaxed(ICU_NSCLR_NCLR, priv->base + ICU_NSCLR);
- }
+ if (!irqd_is_level_type(d)) {
+ tintirq_nr = hw_irq - ICU_TINT_START;
+ bit = BIT(tintirq_nr);
+ writel_relaxed(bit, priv->base + priv->info->t_offs + ICU_TSCLR);
}
irq_chip_eoi_parent(d);
}
+static void rzv2h_icu_irq_eoi(struct irq_data *d)
+{
+ struct rzv2h_icu_priv *priv = irq_data_to_priv(d);
+ unsigned int hw_irq = irqd_to_hwirq(d);
+ unsigned int tintirq_nr;
+ u32 bit;
+
+ if (!irqd_is_level_type(d)) {
+ tintirq_nr = hw_irq - ICU_IRQ_START;
+ bit = BIT(tintirq_nr);
+ writel_relaxed(bit, priv->base + ICU_ISCLR);
+ }
+
+ irq_chip_eoi_parent(d);
+}
+
+static void rzv2h_icu_nmi_eoi(struct irq_data *d)
+{
+ struct rzv2h_icu_priv *priv = irq_data_to_priv(d);
+
+ writel_relaxed(ICU_NSCLR_NCLR, priv->base + ICU_NSCLR);
+
+ irq_chip_eoi_parent(d);
+}
+
static void rzv2h_tint_irq_endisable(struct irq_data *d, bool enable)
{
struct rzv2h_icu_priv *priv = irq_data_to_priv(d);
@@ -200,9 +241,6 @@ static void rzv2h_tint_irq_endisable(struct irq_data *d, bool enable)
u32 tint_nr, tssel_n, k, tssr;
u8 nr_tint;
- if (hw_irq < ICU_TINT_START)
- return;
-
tint_nr = hw_irq - ICU_TINT_START;
nr_tint = 32 / priv->info->field_width;
k = tint_nr / nr_tint;
@@ -225,13 +263,13 @@ static void rzv2h_tint_irq_endisable(struct irq_data *d, bool enable)
writel_relaxed(BIT(tint_nr), priv->base + priv->info->t_offs + ICU_TSCLR);
}
-static void rzv2h_icu_irq_disable(struct irq_data *d)
+static void rzv2h_icu_tint_disable(struct irq_data *d)
{
irq_chip_disable_parent(d);
rzv2h_tint_irq_endisable(d, false);
}
-static void rzv2h_icu_irq_enable(struct irq_data *d)
+static void rzv2h_icu_tint_enable(struct irq_data *d)
{
rzv2h_tint_irq_endisable(d, true);
irq_chip_enable_parent(d);
@@ -257,7 +295,7 @@ static int rzv2h_nmi_set_type(struct irq_data *d, unsigned int type)
writel_relaxed(sense, priv->base + ICU_NITSR);
- return 0;
+ return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
}
static void rzv2h_clear_irq_int(struct rzv2h_icu_priv *priv, unsigned int hwirq)
@@ -307,14 +345,15 @@ static int rzv2h_irq_set_type(struct irq_data *d, unsigned int type)
return -EINVAL;
}
- guard(raw_spinlock)(&priv->lock);
- iitsr = readl_relaxed(priv->base + ICU_IITSR);
- iitsr &= ~ICU_IITSR_IITSEL_MASK(irq_nr);
- iitsr |= ICU_IITSR_IITSEL_PREP(sense, irq_nr);
- rzv2h_clear_irq_int(priv, hwirq);
- writel_relaxed(iitsr, priv->base + ICU_IITSR);
+ scoped_guard(raw_spinlock, &priv->lock) {
+ iitsr = readl_relaxed(priv->base + ICU_IITSR);
+ iitsr &= ~ICU_IITSR_IITSEL_MASK(irq_nr);
+ iitsr |= ICU_IITSR_IITSEL_PREP(sense, irq_nr);
+ rzv2h_clear_irq_int(priv, hwirq);
+ writel_relaxed(iitsr, priv->base + ICU_IITSR);
+ }
- return 0;
+ return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
}
static void rzv2h_clear_tint_int(struct rzv2h_icu_priv *priv, unsigned int hwirq)
@@ -389,49 +428,82 @@ static int rzv2h_tint_set_type(struct irq_data *d, unsigned int type)
titsr_k = ICU_TITSR_K(tint_nr);
titsel_n = ICU_TITSR_TITSEL_N(tint_nr);
- guard(raw_spinlock)(&priv->lock);
+ scoped_guard(raw_spinlock, &priv->lock) {
+ tssr = readl_relaxed(priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
+ titsr = readl_relaxed(priv->base + priv->info->t_offs + ICU_TITSR(titsr_k));
- tssr = readl_relaxed(priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
- titsr = readl_relaxed(priv->base + priv->info->t_offs + ICU_TITSR(titsr_k));
+ tssr_cur = field_get(ICU_TSSR_TSSEL_MASK(tssel_n, priv->info->field_width), tssr);
+ titsr_cur = field_get(ICU_TITSR_TITSEL_MASK(titsel_n), titsr);
+ if (tssr_cur == tint && titsr_cur == sense)
+ goto set_parent_type;
- tssr_cur = field_get(ICU_TSSR_TSSEL_MASK(tssel_n, priv->info->field_width), tssr);
- titsr_cur = field_get(ICU_TITSR_TITSEL_MASK(titsel_n), titsr);
- if (tssr_cur == tint && titsr_cur == sense)
- return 0;
+ tssr &= ~(ICU_TSSR_TSSEL_MASK(tssel_n, priv->info->field_width) | tien);
+ tssr |= ICU_TSSR_TSSEL_PREP(tint, tssel_n, priv->info->field_width);
+
+ writel_relaxed(tssr, priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
+
+ titsr &= ~ICU_TITSR_TITSEL_MASK(titsel_n);
+ titsr |= ICU_TITSR_TITSEL_PREP(sense, titsel_n);
- tssr &= ~(ICU_TSSR_TSSEL_MASK(tssel_n, priv->info->field_width) | tien);
- tssr |= ICU_TSSR_TSSEL_PREP(tint, tssel_n, priv->info->field_width);
+ writel_relaxed(titsr, priv->base + priv->info->t_offs + ICU_TITSR(titsr_k));
+
+ rzv2h_clear_tint_int(priv, hwirq);
+
+ writel_relaxed(tssr | tien, priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
+ }
+set_parent_type:
+ return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
+}
- writel_relaxed(tssr, priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
+static int rzv2h_icu_swint_set_irqchip_state(struct irq_data *d, enum irqchip_irq_state which,
+ bool state)
+{
+ unsigned int hwirq = irqd_to_hwirq(d);
+ struct rzv2h_icu_priv *priv;
+ unsigned int bit;
- titsr &= ~ICU_TITSR_TITSEL_MASK(titsel_n);
- titsr |= ICU_TITSR_TITSEL_PREP(sense, titsel_n);
+ if (which != IRQCHIP_STATE_PENDING)
+ return irq_chip_set_parent_state(d, which, state);
- writel_relaxed(titsr, priv->base + priv->info->t_offs + ICU_TITSR(titsr_k));
+ if (!state)
+ return 0;
- rzv2h_clear_tint_int(priv, hwirq);
+ priv = irq_data_to_priv(d);
+ bit = BIT(hwirq - ICU_CA55_INT_START);
- writel_relaxed(tssr | tien, priv->base + priv->info->t_offs + ICU_TSSR(tssr_k));
+ /* Trigger the software interrupt */
+ writel_relaxed(bit, priv->base + ICU_SWINT);
return 0;
}
-static int rzv2h_icu_set_type(struct irq_data *d, unsigned int type)
+static int rzv2h_icu_swpe_set_irqchip_state(struct irq_data *d, enum irqchip_irq_state which,
+ bool state)
{
- unsigned int hw_irq = irqd_to_hwirq(d);
- int ret;
+ struct rzv2h_icu_priv *priv;
+ unsigned int bit;
+ static u8 swpe;
- if (hw_irq >= ICU_TINT_START)
- ret = rzv2h_tint_set_type(d, type);
- else if (hw_irq >= ICU_IRQ_START)
- ret = rzv2h_irq_set_type(d, type);
- else
- ret = rzv2h_nmi_set_type(d, type);
+ if (which != IRQCHIP_STATE_PENDING)
+ return irq_chip_set_parent_state(d, which, state);
- if (ret)
- return ret;
+ if (!state)
+ return 0;
- return irq_chip_set_type_parent(d, IRQ_TYPE_LEVEL_HIGH);
+ priv = irq_data_to_priv(d);
+
+ bit = BIT(swpe);
+ /*
+ * SWPE has 16 bits; the bit position is rotated on each trigger
+ * and wraps around once all bits have been used.
+ */
+ if (++swpe >= ICU_SWPE_NUM)
+ swpe = 0;
+
+ /* Trigger the pseudo error interrupt */
+ writel_relaxed(bit, priv->base + ICU_SWPE);
+
+ return 0;
}
static int rzv2h_irqc_irq_suspend(void *data)
@@ -472,27 +544,98 @@ static struct syscore rzv2h_irqc_syscore = {
.ops = &rzv2h_irqc_syscore_ops,
};
-static const struct irq_chip rzv2h_icu_chip = {
+static const struct irq_chip rzv2h_icu_tint_chip = {
+ .name = "rzv2h-icu",
+ .irq_eoi = rzv2h_icu_tint_eoi,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = rzv2h_icu_tint_disable,
+ .irq_enable = rzv2h_icu_tint_enable,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = irq_chip_set_parent_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = rzv2h_tint_set_type,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static const struct irq_chip rzv2h_icu_irq_chip = {
.name = "rzv2h-icu",
- .irq_eoi = rzv2h_icu_eoi,
+ .irq_eoi = rzv2h_icu_irq_eoi,
.irq_mask = irq_chip_mask_parent,
.irq_unmask = irq_chip_unmask_parent,
- .irq_disable = rzv2h_icu_irq_disable,
- .irq_enable = rzv2h_icu_irq_enable,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
.irq_get_irqchip_state = irq_chip_get_parent_state,
.irq_set_irqchip_state = irq_chip_set_parent_state,
.irq_retrigger = irq_chip_retrigger_hierarchy,
- .irq_set_type = rzv2h_icu_set_type,
+ .irq_set_type = rzv2h_irq_set_type,
.irq_set_affinity = irq_chip_set_affinity_parent,
.flags = IRQCHIP_MASK_ON_SUSPEND |
IRQCHIP_SET_TYPE_MASKED |
IRQCHIP_SKIP_SET_WAKE,
};
+static const struct irq_chip rzv2h_icu_nmi_chip = {
+ .name = "rzv2h-icu",
+ .irq_eoi = rzv2h_icu_nmi_eoi,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = irq_chip_set_parent_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = rzv2h_nmi_set_type,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static const struct irq_chip rzv2h_icu_swint_chip = {
+ .name = "rzv2h-icu",
+ .irq_eoi = irq_chip_eoi_parent,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = rzv2h_icu_swint_set_irqchip_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = irq_chip_set_type_parent,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+static const struct irq_chip rzv2h_icu_swpe_err_chip = {
+ .name = "rzv2h-icu",
+ .irq_eoi = irq_chip_eoi_parent,
+ .irq_mask = irq_chip_mask_parent,
+ .irq_unmask = irq_chip_unmask_parent,
+ .irq_disable = irq_chip_disable_parent,
+ .irq_enable = irq_chip_enable_parent,
+ .irq_get_irqchip_state = irq_chip_get_parent_state,
+ .irq_set_irqchip_state = rzv2h_icu_swpe_set_irqchip_state,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_type = irq_chip_set_type_parent,
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+ .flags = IRQCHIP_MASK_ON_SUSPEND |
+ IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE,
+};
+
+#define hwirq_within(hwirq, which) ((hwirq) >= which##_START && (hwirq) <= which##_LAST)
+
static int rzv2h_icu_alloc(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs,
void *arg)
{
struct rzv2h_icu_priv *priv = domain->host_data;
+ const struct irq_chip *chip;
unsigned long tint = 0;
irq_hw_number_t hwirq;
unsigned int type;
@@ -508,19 +651,27 @@ static int rzv2h_icu_alloc(struct irq_domain *domain, unsigned int virq, unsigne
* hwirq is embedded in bits 0-15.
* TINT is embedded in bits 16-31.
*/
- if (hwirq >= ICU_TINT_START) {
- tint = ICU_TINT_EXTRACT_GPIOINT(hwirq);
+ tint = ICU_TINT_EXTRACT_GPIOINT(hwirq);
+ if (tint || hwirq_within(hwirq, ICU_TINT)) {
hwirq = ICU_TINT_EXTRACT_HWIRQ(hwirq);
- if (hwirq < ICU_TINT_START)
+ if (!hwirq_within(hwirq, ICU_TINT))
return -EINVAL;
+ chip = &rzv2h_icu_tint_chip;
+ } else if (hwirq_within(hwirq, ICU_IRQ)) {
+ chip = &rzv2h_icu_irq_chip;
+ } else if (hwirq_within(hwirq, ICU_CA55_INT)) {
+ chip = &rzv2h_icu_swint_chip;
+ } else if (hwirq_within(hwirq, ICU_ERR_INT)) {
+ chip = &rzv2h_icu_swpe_err_chip;
+ } else {
+ chip = &rzv2h_icu_nmi_chip;
}
if (hwirq > (ICU_NUM_IRQ - 1))
return -EINVAL;
- ret = irq_domain_set_hwirq_and_chip(domain, virq, hwirq, &rzv2h_icu_chip,
- (void *)(uintptr_t)tint);
+ ret = irq_domain_set_hwirq_and_chip(domain, virq, hwirq, chip, (void *)(uintptr_t)tint);
if (ret)
return ret;
@@ -550,62 +701,160 @@ static int rzv2h_icu_parse_interrupts(struct rzv2h_icu_priv *priv, struct device
return 0;
}
+static irqreturn_t rzv2h_icu_error_irq(int irq, void *data)
+{
+ struct rzv2h_icu_priv *priv = data;
+ const struct rzv2h_hw_info *hw_info = priv->info;
+ void __iomem *base = priv->base;
+ unsigned int k;
+ u32 st;
+
+ /* 1) Bus errors (BEISR0..3) */
+ for (k = 0; k < ICU_NUM_BE; k++) {
+ st = readl(base + ICU_BEISR(k));
+ if (!st)
+ continue;
+
+ writel_relaxed(st, base + ICU_BECLR(k));
+ pr_warn("rzv2h-icu: BUS error k=%u status=0x%08x\n", k, st);
+ }
+
+ /* 2) ECC RAM errors (EREISR0..X) */
+ for (k = hw_info->ecc_start; k <= hw_info->ecc_end; k++) {
+ st = readl(base + ICU_EREISR(k));
+ if (!st)
+ continue;
+
+ writel_relaxed(st, base + ICU_ERCLR(k));
+ pr_warn("rzv2h-icu: ECC error k=%u status=0x%08x\n", k, st);
+ }
+
+ /* 3) IP/CA55 error interrupt status (ERINTA55CTL0..3) */
+ for (k = 0; k < ICU_NUM_A55ERR; k++) {
+ st = readl(base + ICU_ERINTA55CTL(k));
+ if (!st)
+ continue;
+
+ /* there is no relation with status bits so clear all the interrupts */
+ writel_relaxed(0xffffffff, base + ICU_ERINTA55CRL(k));
+ pr_warn("rzv2h-icu: IP/CA55 error k=%u status=0x%08x\n", k, st);
+ }
+
+ return IRQ_HANDLED;
+}
+
+static irqreturn_t rzv2h_icu_swint_irq(int irq, void *data)
+{
+ unsigned int cpu = (uintptr_t)data;
+
+ pr_info("SWINT interrupt for CA55 core %u\n", cpu);
+ return IRQ_HANDLED;
+}
+
+static int rzv2h_icu_setup_irqs(struct platform_device *pdev, struct irq_domain *irq_domain)
+{
+ const struct rzv2h_hw_info *hw_info = rzv2h_icu_data->info;
+ bool irq_inject = IS_ENABLED(CONFIG_GENERIC_IRQ_INJECTION);
+ void __iomem *base = rzv2h_icu_data->base;
+ struct device *dev = &pdev->dev;
+ struct irq_fwspec fwspec;
+ unsigned int i, virq;
+ int ret;
+
+ for (i = 0; i < ICU_CA55_INT_COUNT && irq_inject; i++) {
+ fwspec.fwnode = irq_domain->fwnode;
+ fwspec.param_count = 2;
+ fwspec.param[0] = ICU_CA55_INT_START + i;
+ fwspec.param[1] = IRQ_TYPE_EDGE_RISING;
+
+ virq = irq_create_fwspec_mapping(&fwspec);
+ if (!virq) {
+ return dev_err_probe(dev, -EINVAL,
+ "failed to create int-ca55-%u IRQ mapping\n", i);
+ }
+
+ ret = devm_request_irq(dev, virq, rzv2h_icu_swint_irq, 0, dev_name(dev),
+ (void *)(uintptr_t)i);
+ if (ret)
+ return dev_err_probe(dev, ret, "Failed to request int-ca55-%u IRQ\n", i);
+ }
+
+ /* Unmask and clear all IP/CA55 error interrupts */
+ for (i = 0; i < ICU_NUM_A55ERR; i++) {
+ writel_relaxed(0xffffff, base + ICU_ERINTA55CRL(i));
+ writel_relaxed(0x0, base + ICU_ERINTA55MSK(i));
+ }
+
+ /* Clear all Bus errors */
+ for (i = 0; i < ICU_NUM_BE; i++)
+ writel_relaxed(0xffffffff, base + ICU_BECLR(i));
+
+ /* Clear all ECCRAM errors */
+ for (i = hw_info->ecc_start; i <= hw_info->ecc_end; i++)
+ writel_relaxed(0xffffffff, base + ICU_ERCLR(i));
+
+ fwspec.fwnode = irq_domain->fwnode;
+ fwspec.param_count = 2;
+ fwspec.param[0] = ICU_ERR_INT_START;
+ fwspec.param[1] = IRQ_TYPE_LEVEL_HIGH;
+
+ virq = irq_create_fwspec_mapping(&fwspec);
+ if (!virq)
+ return dev_err_probe(dev, -EINVAL, "failed to create icu-error-ca55 IRQ mapping\n");
+
+ ret = devm_request_irq(dev, virq, rzv2h_icu_error_irq, 0, dev_name(dev), rzv2h_icu_data);
+ if (ret)
+ return dev_err_probe(dev, ret, "Failed to request icu-error-ca55 IRQ\n");
+
+ return 0;
+}
+
static int rzv2h_icu_probe_common(struct platform_device *pdev, struct device_node *parent,
const struct rzv2h_hw_info *hw_info)
{
struct irq_domain *irq_domain, *parent_domain;
struct device_node *node = pdev->dev.of_node;
+ struct device *dev = &pdev->dev;
struct reset_control *resetn;
int ret;
parent_domain = irq_find_host(parent);
- if (!parent_domain) {
- dev_err(&pdev->dev, "cannot find parent domain\n");
- return -ENODEV;
- }
+ if (!parent_domain)
+ return dev_err_probe(dev, -ENODEV, "cannot find parent domain\n");
- rzv2h_icu_data = devm_kzalloc(&pdev->dev, sizeof(*rzv2h_icu_data), GFP_KERNEL);
+ rzv2h_icu_data = devm_kzalloc(dev, sizeof(*rzv2h_icu_data), GFP_KERNEL);
if (!rzv2h_icu_data)
return -ENOMEM;
platform_set_drvdata(pdev, rzv2h_icu_data);
- rzv2h_icu_data->base = devm_of_iomap(&pdev->dev, pdev->dev.of_node, 0, NULL);
+ rzv2h_icu_data->base = devm_of_iomap(dev, node, 0, NULL);
if (IS_ERR(rzv2h_icu_data->base))
return PTR_ERR(rzv2h_icu_data->base);
ret = rzv2h_icu_parse_interrupts(rzv2h_icu_data, node);
- if (ret) {
- dev_err(&pdev->dev, "cannot parse interrupts: %d\n", ret);
- return ret;
- }
+ if (ret)
+ return dev_err_probe(dev, ret, "cannot parse interrupts\n");
- resetn = devm_reset_control_get_exclusive_deasserted(&pdev->dev, NULL);
- if (IS_ERR(resetn)) {
- ret = PTR_ERR(resetn);
- dev_err(&pdev->dev, "failed to acquire deasserted reset: %d\n", ret);
- return ret;
- }
+ resetn = devm_reset_control_get_exclusive_deasserted(dev, NULL);
+ if (IS_ERR(resetn))
+ return dev_err_probe(dev, PTR_ERR(resetn), "failed to acquire deasserted reset\n");
- ret = devm_pm_runtime_enable(&pdev->dev);
- if (ret < 0) {
- dev_err(&pdev->dev, "devm_pm_runtime_enable failed, %d\n", ret);
- return ret;
- }
+ ret = devm_pm_runtime_enable(dev);
+ if (ret < 0)
+ return dev_err_probe(dev, ret, "devm_pm_runtime_enable failed\n");
- ret = pm_runtime_resume_and_get(&pdev->dev);
- if (ret < 0) {
- dev_err(&pdev->dev, "pm_runtime_resume_and_get failed: %d\n", ret);
- return ret;
- }
+ ret = pm_runtime_resume_and_get(dev);
+ if (ret < 0)
+ return dev_err_probe(dev, ret, "pm_runtime_resume_and_get failed\n");
raw_spin_lock_init(&rzv2h_icu_data->lock);
irq_domain = irq_domain_create_hierarchy(parent_domain, 0, ICU_NUM_IRQ,
- dev_fwnode(&pdev->dev), &rzv2h_icu_domain_ops,
+ dev_fwnode(dev), &rzv2h_icu_domain_ops,
rzv2h_icu_data);
if (!irq_domain) {
- dev_err(&pdev->dev, "failed to add irq domain\n");
+ dev_err(dev, "failed to add irq domain\n");
ret = -ENOMEM;
goto pm_put;
}
@@ -614,15 +863,18 @@ static int rzv2h_icu_probe_common(struct platform_device *pdev, struct device_no
register_syscore(&rzv2h_irqc_syscore);
+ ret = rzv2h_icu_setup_irqs(pdev, irq_domain);
+ if (ret)
+ goto pm_put;
+
/*
* coccicheck complains about a missing put_device call before returning, but it's a false
- * positive. We still need &pdev->dev after successfully returning from this function.
+ * positive. We still need dev after successfully returning from this function.
*/
return 0;
pm_put:
- pm_runtime_put_sync(&pdev->dev);
-
+ pm_runtime_put_sync(dev);
return ret;
}
@@ -657,12 +909,24 @@ static const struct rzv2h_hw_info rzg3e_hw_params = {
.t_offs = ICU_RZG3E_TINT_OFFSET,
.max_tssel = ICU_RZG3E_TSSEL_MAX_VAL,
.field_width = 16,
+ .ecc_start = 1,
+ .ecc_end = 4,
+};
+
+static const struct rzv2h_hw_info rzv2n_hw_params = {
+ .t_offs = 0,
+ .max_tssel = ICU_RZV2H_TSSEL_MAX_VAL,
+ .field_width = 8,
+ .ecc_start = 0,
+ .ecc_end = 2,
};
static const struct rzv2h_hw_info rzv2h_hw_params = {
.t_offs = 0,
.max_tssel = ICU_RZV2H_TSSEL_MAX_VAL,
.field_width = 8,
+ .ecc_start = 0,
+ .ecc_end = 11,
};
static int rzg3e_icu_probe(struct platform_device *pdev, struct device_node *parent)
@@ -670,6 +934,11 @@ static int rzg3e_icu_probe(struct platform_device *pdev, struct device_node *par
return rzv2h_icu_probe_common(pdev, parent, &rzg3e_hw_params);
}
+static int rzv2n_icu_probe(struct platform_device *pdev, struct device_node *parent)
+{
+ return rzv2h_icu_probe_common(pdev, parent, &rzv2n_hw_params);
+}
+
static int rzv2h_icu_probe(struct platform_device *pdev, struct device_node *parent)
{
return rzv2h_icu_probe_common(pdev, parent, &rzv2h_hw_params);
@@ -677,7 +946,7 @@ static int rzv2h_icu_probe(struct platform_device *pdev, struct device_node *par
IRQCHIP_PLATFORM_DRIVER_BEGIN(rzv2h_icu)
IRQCHIP_MATCH("renesas,r9a09g047-icu", rzg3e_icu_probe)
-IRQCHIP_MATCH("renesas,r9a09g056-icu", rzv2h_icu_probe)
+IRQCHIP_MATCH("renesas,r9a09g056-icu", rzv2n_icu_probe)
IRQCHIP_MATCH("renesas,r9a09g057-icu", rzv2h_icu_probe)
IRQCHIP_PLATFORM_DRIVER_END(rzv2h_icu)
MODULE_AUTHOR("Fabrizio Castro <fabrizio.castro.jz@renesas.com>");
The pull request you sent on Sun, 12 Apr 2026 19:46:10 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq-drivers-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/c0ecb2a9eeaa25832c1367ecc865ab2523b8c3d5 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Linus,
please pull the latest timers/vdso branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-vdso-2026-04-12
up to: 7138a8698a39: timens: Use task_lock guard in timens_get*()
Update to the VDSO subsystem:
- Make the handling of compat functions consistent and more robust
- Rework the underlying data store so that it is dynamically
allocated, which allows the conversion of the last holdout SPARC64
to the generic VDSO implementation
- Rework the SPARC64 VDSO to utilize the generic implementation
- Mop up the left overs of the non-generic VDSO support in the core
code.
- Expand the VDSO selftest and make them more robust
- Allow time namespaces to be enabled independently of the generic
VDSO support, which was not possible before due to SPARC64 not
using it.
- Various cleanups and improvements in the related code.
Thanks,
tglx
------------------>
Arnd Bergmann (1):
clocksource: Remove ARCH_CLOCKSOURCE_DATA
Randy Dunlap (1):
vdso/datapage: Correct struct member kernel-doc
Thomas Weißschuh (49):
x86/vdso: Use 32-bit CHECKFLAGS for compat vDSO
sparc64: vdso: Use 32-bit CHECKFLAGS for compat vDSO
s390: Add -m64 to KBUILD_CPPFLAGS
powerpc/audit: Directly include unistd_32.h from compat_audit.c
asm-generic/bitsperlong.h: Add sanity checks for __BITS_PER_LONG
vdso/datastore: Reduce scope of some variables in vvar_fault()
vdso/datastore: Drop inclusion of linux/mmap_lock.h
vdso/datastore: Allocate data pages dynamically
sparc64: vdso: Link with -z noexecstack
sparc64: vdso: Remove obsolete "fake section table" reservation
sparc64: vdso: Replace code patching with runtime conditional
sparc64: vdso: Move hardware counter read into header
sparc64: vdso: Move syscall fallbacks into header
sparc64: vdso: Introduce vdso/processor.h
sparc64: vdso: Switch to the generic vDSO library
sparc64: vdso2c: Drop sym_vvar_start handling
sparc64: vdso2c: Remove symbol handling
sparc64: vdso: Implement clock_gettime64()
vdso/gettimeofday: Drop a few usages of __maybe_unused
vdso/gettimeofday: Add a helper to read the sequence lock of a time namespace aware clock
vdso/gettimeofday: Add a helper to test if a clock is namespaced
vdso/gettimeofday: Move the unlikely() into vdso_read_retry()
arm64: vDSO: gettimeofday: Explicitly include vdso/clocksource.h
arm64: vDSO: compat_gettimeofday: Add explicit includes
ARM: vdso: gettimeofday: Add explicit includes
powerpc/vdso/gettimeofday: Explicitly include vdso/time32.h
powerpc/vdso: Explicitly include asm/cputable.h and asm/feature-fixups.h
LoongArch: vDSO: Explicitly include asm/vdso/vdso.h
MIPS: vdso: Add include guard to asm/vdso/vdso.h
MIPS: vdso: Explicitly include asm/vdso/vdso.h
random: vDSO: Add explicit includes
vdso/gettimeofday: Add explicit includes
vdso/helpers: Explicitly include vdso/processor.h
vdso/datapage: Remove inclusion of gettimeofday.h
vdso/datapage: Trim down unnecessary includes
random: vDSO: Trim vDSO includes
random: vDSO: Remove ifdeffery
Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
selftests: vDSO: vdso_test_correctness: Handle different tv_usec types
selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
selftests: vDSO: vdso_test_correctness: Add a test for time()
vdso/timens: Move functions to new file
timens: Remove dependency on the vDSO
timens: Add a __free() wrapper for put_time_ns()
timens: Simplify some calls to put_time_ns()
timens: Use mutex guard in proc_timens_set_offset()
timens: Use task_lock guard in timens_get*()
MAINTAINERS | 2 +
arch/arm/include/asm/vdso/gettimeofday.h | 2 +
arch/arm64/include/asm/vdso/compat_gettimeofday.h | 3 +
arch/arm64/include/asm/vdso/gettimeofday.h | 2 +
arch/loongarch/kernel/process.c | 1 +
arch/loongarch/kernel/vdso.c | 1 +
arch/mips/include/asm/vdso/vdso.h | 5 +
arch/mips/kernel/vdso.c | 1 +
arch/powerpc/include/asm/vdso/gettimeofday.h | 1 +
arch/powerpc/include/asm/vdso/processor.h | 3 +
arch/powerpc/kernel/compat_audit.c | 3 +-
arch/s390/Makefile | 3 +-
arch/sparc/Kconfig | 3 +-
arch/sparc/include/asm/clocksource.h | 9 -
arch/sparc/include/asm/processor.h | 3 +
arch/sparc/include/asm/processor_32.h | 2 -
arch/sparc/include/asm/processor_64.h | 25 --
arch/sparc/include/asm/vdso.h | 2 -
arch/sparc/include/asm/vdso/clocksource.h | 10 +
arch/sparc/include/asm/vdso/gettimeofday.h | 184 ++++++++++
arch/sparc/include/asm/vdso/processor.h | 41 +++
arch/sparc/include/asm/vdso/vsyscall.h | 10 +
arch/sparc/include/asm/vvar.h | 75 ----
arch/sparc/kernel/Makefile | 1 -
arch/sparc/kernel/time_64.c | 6 +-
arch/sparc/kernel/vdso.c | 69 ----
arch/sparc/vdso/Makefile | 11 +-
arch/sparc/vdso/vclock_gettime.c | 380 ++-------------------
arch/sparc/vdso/vdso-layout.lds.S | 26 +-
arch/sparc/vdso/vdso.lds.S | 2 -
arch/sparc/vdso/vdso2c.c | 24 --
arch/sparc/vdso/vdso2c.h | 45 +--
arch/sparc/vdso/vdso32/vdso32.lds.S | 4 +-
arch/sparc/vdso/vma.c | 274 +--------------
arch/x86/entry/vdso/vdso32/Makefile | 4 +
drivers/char/random.c | 16 +-
include/asm-generic/bitsperlong.h | 9 +
include/linux/clocksource.h | 6 +-
include/linux/time_namespace.h | 39 ++-
include/linux/vdso_datastore.h | 6 +
include/vdso/datapage.h | 27 +-
include/vdso/helpers.h | 31 +-
init/Kconfig | 4 +-
init/main.c | 2 +
kernel/time/Kconfig | 4 -
kernel/time/Makefile | 1 +
kernel/time/namespace.c | 203 ++---------
kernel/time/namespace_internal.h | 28 ++
kernel/time/namespace_vdso.c | 160 +++++++++
lib/vdso/datastore.c | 122 +++----
lib/vdso/getrandom.c | 3 +
lib/vdso/gettimeofday.c | 99 +++---
tools/testing/selftests/vDSO/Makefile | 6 +-
tools/testing/selftests/vDSO/parse_vdso.c | 3 +-
.../testing/selftests/vDSO/vdso_test_correctness.c | 112 ++++--
.../selftests/vDSO/vdso_test_gettimeofday.c | 2 -
56 files changed, 829 insertions(+), 1291 deletions(-)
create mode 100644 arch/sparc/include/asm/vdso/clocksource.h
create mode 100644 arch/sparc/include/asm/vdso/gettimeofday.h
create mode 100644 arch/sparc/include/asm/vdso/processor.h
create mode 100644 arch/sparc/include/asm/vdso/vsyscall.h
delete mode 100644 arch/sparc/include/asm/vvar.h
delete mode 100644 arch/sparc/kernel/vdso.c
create mode 100644 kernel/time/namespace_internal.h
create mode 100644 kernel/time/namespace_vdso.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 77fdfcb55f06..6ad74a5196d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10768,6 +10768,7 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/vdso
F: include/asm-generic/vdso/vsyscall.h
F: include/vdso/
+F: kernel/time/namespace_vdso.c
F: kernel/time/vsyscall.c
F: lib/vdso/
F: tools/testing/selftests/vDSO/
@@ -21000,6 +21001,7 @@ F: include/trace/events/timer*
F: kernel/time/itimer.c
F: kernel/time/posix-*
F: kernel/time/namespace.c
+F: kernel/time/namespace_vdso.c
POWER MANAGEMENT CORE
M: "Rafael J. Wysocki" <rafael@kernel.org>
diff --git a/arch/arm/include/asm/vdso/gettimeofday.h b/arch/arm/include/asm/vdso/gettimeofday.h
index 1e9f81639c88..26da5d8621cc 100644
--- a/arch/arm/include/asm/vdso/gettimeofday.h
+++ b/arch/arm/include/asm/vdso/gettimeofday.h
@@ -11,6 +11,8 @@
#include <asm/errno.h>
#include <asm/unistd.h>
#include <asm/vdso/cp15.h>
+#include <vdso/clocksource.h>
+#include <vdso/time32.h>
#include <uapi/linux/time.h>
#define VDSO_HAS_CLOCK_GETRES 1
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index 0d513f924321..a03e34b572f1 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -7,6 +7,9 @@
#ifndef __ASSEMBLER__
+#include <vdso/clocksource.h>
+#include <vdso/time32.h>
+
#include <asm/barrier.h>
#include <asm/unistd_compat_32.h>
#include <asm/errno.h>
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h b/arch/arm64/include/asm/vdso/gettimeofday.h
index 3658a757e255..96d2eccd4995 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -9,6 +9,8 @@
#ifndef __ASSEMBLER__
+#include <vdso/clocksource.h>
+
#include <asm/alternative.h>
#include <asm/arch_timer.h>
#include <asm/barrier.h>
diff --git a/arch/loongarch/kernel/process.c b/arch/loongarch/kernel/process.c
index 4ac1c3086152..ac3a0baa5d00 100644
--- a/arch/loongarch/kernel/process.c
+++ b/arch/loongarch/kernel/process.c
@@ -52,6 +52,7 @@
#include <asm/switch_to.h>
#include <asm/unwind.h>
#include <asm/vdso.h>
+#include <asm/vdso/vdso.h>
#ifdef CONFIG_STACKPROTECTOR
#include <linux/stackprotector.h>
diff --git a/arch/loongarch/kernel/vdso.c b/arch/loongarch/kernel/vdso.c
index 0aa10cadb959..8ce8159c10b9 100644
--- a/arch/loongarch/kernel/vdso.c
+++ b/arch/loongarch/kernel/vdso.c
@@ -18,6 +18,7 @@
#include <asm/page.h>
#include <asm/vdso.h>
+#include <asm/vdso/vdso.h>
#include <vdso/helpers.h>
#include <vdso/vsyscall.h>
#include <vdso/datapage.h>
diff --git a/arch/mips/include/asm/vdso/vdso.h b/arch/mips/include/asm/vdso/vdso.h
index 6889e0f2e5db..ef50d33f3439 100644
--- a/arch/mips/include/asm/vdso/vdso.h
+++ b/arch/mips/include/asm/vdso/vdso.h
@@ -4,6 +4,9 @@
* Author: Alex Smith <alex.smith@imgtec.com>
*/
+#ifndef __ASM_VDSO_VDSO_H
+#define __ASM_VDSO_VDSO_H
+
#include <asm/sgidefs.h>
#include <vdso/page.h>
@@ -70,3 +73,5 @@ static inline void __iomem *get_gic(const struct vdso_time_data *data)
#endif /* CONFIG_CLKSRC_MIPS_GIC */
#endif /* __ASSEMBLER__ */
+
+#endif /* __ASM_VDSO_VDSO_H */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index de096777172f..2fa4df3e46e4 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -21,6 +21,7 @@
#include <asm/mips-cps.h>
#include <asm/page.h>
#include <asm/vdso.h>
+#include <asm/vdso/vdso.h>
#include <vdso/helpers.h>
#include <vdso/vsyscall.h>
diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h
index 8ea397e26ad0..a853f853da6c 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -8,6 +8,7 @@
#include <asm/barrier.h>
#include <asm/unistd.h>
#include <uapi/linux/time.h>
+#include <vdso/time32.h>
#define VDSO_HAS_CLOCK_GETRES 1
diff --git a/arch/powerpc/include/asm/vdso/processor.h b/arch/powerpc/include/asm/vdso/processor.h
index c1f3d7aaf3ee..4c6802c3a580 100644
--- a/arch/powerpc/include/asm/vdso/processor.h
+++ b/arch/powerpc/include/asm/vdso/processor.h
@@ -4,6 +4,9 @@
#ifndef __ASSEMBLER__
+#include <asm/cputable.h>
+#include <asm/feature-fixups.h>
+
/* Macros for adjusting thread priority (hardware multi-threading) */
#ifdef CONFIG_PPC64
#define HMT_very_low() asm volatile("or 31, 31, 31 # very low priority")
diff --git a/arch/powerpc/kernel/compat_audit.c b/arch/powerpc/kernel/compat_audit.c
index 57b38c592b9f..b4d81a57b2d9 100644
--- a/arch/powerpc/kernel/compat_audit.c
+++ b/arch/powerpc/kernel/compat_audit.c
@@ -1,7 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
-#undef __powerpc64__
#include <linux/audit_arch.h>
-#include <asm/unistd.h>
+#include <asm/unistd_32.h>
#include "audit_32.h"
diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index d78ad6885ca2..02bc948a4a56 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -12,8 +12,7 @@ LD_BFD := elf64-s390
KBUILD_LDFLAGS := -m elf64_s390
KBUILD_AFLAGS_MODULE += -fPIC
KBUILD_CFLAGS_MODULE += -fPIC
-KBUILD_AFLAGS += -m64
-KBUILD_CFLAGS += -m64
+KBUILD_CPPFLAGS += -m64
KBUILD_CFLAGS += -fPIC
LDFLAGS_vmlinux := $(call ld-option,-no-pie)
extra_tools := relocs
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8699be91fca9..a6b787efc2c4 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -104,7 +104,6 @@ config SPARC64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select GENERIC_TIME_VSYSCALL
- select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_PTE_SPECIAL
select PCI_DOMAINS if PCI
select ARCH_HAS_GIGANTIC_PAGE
@@ -115,6 +114,8 @@ config SPARC64
select ARCH_SUPPORTS_SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
select ARCH_HAS_LAZY_MMU_MODE
+ select HAVE_GENERIC_VDSO
+ select GENERIC_GETTIMEOFDAY
config ARCH_PROC_KCORE_TEXT
def_bool y
diff --git a/arch/sparc/include/asm/clocksource.h b/arch/sparc/include/asm/clocksource.h
index d63ef224befe..68303ad26eb2 100644
--- a/arch/sparc/include/asm/clocksource.h
+++ b/arch/sparc/include/asm/clocksource.h
@@ -5,13 +5,4 @@
#ifndef _ASM_SPARC_CLOCKSOURCE_H
#define _ASM_SPARC_CLOCKSOURCE_H
-/* VDSO clocksources */
-#define VCLOCK_NONE 0 /* Nothing userspace can do. */
-#define VCLOCK_TICK 1 /* Use %tick. */
-#define VCLOCK_STICK 2 /* Use %stick. */
-
-struct arch_clocksource_data {
- int vclock_mode;
-};
-
#endif /* _ASM_SPARC_CLOCKSOURCE_H */
diff --git a/arch/sparc/include/asm/processor.h b/arch/sparc/include/asm/processor.h
index 18295ea625dd..e34de956519a 100644
--- a/arch/sparc/include/asm/processor.h
+++ b/arch/sparc/include/asm/processor.h
@@ -1,6 +1,9 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef ___ASM_SPARC_PROCESSOR_H
#define ___ASM_SPARC_PROCESSOR_H
+
+#include <asm/vdso/processor.h>
+
#if defined(__sparc__) && defined(__arch64__)
#include <asm/processor_64.h>
#else
diff --git a/arch/sparc/include/asm/processor_32.h b/arch/sparc/include/asm/processor_32.h
index ba8b70ffec08..a074d313f4f8 100644
--- a/arch/sparc/include/asm/processor_32.h
+++ b/arch/sparc/include/asm/processor_32.h
@@ -91,8 +91,6 @@ unsigned long __get_wchan(struct task_struct *);
extern struct task_struct *last_task_used_math;
int do_mathemu(struct pt_regs *regs, struct task_struct *fpt);
-#define cpu_relax() barrier()
-
extern void (*sparc_idle)(void);
#endif
diff --git a/arch/sparc/include/asm/processor_64.h b/arch/sparc/include/asm/processor_64.h
index 321859454ca4..485070495263 100644
--- a/arch/sparc/include/asm/processor_64.h
+++ b/arch/sparc/include/asm/processor_64.h
@@ -182,31 +182,6 @@ unsigned long __get_wchan(struct task_struct *task);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->tpc)
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->u_regs[UREG_FP])
-/* Please see the commentary in asm/backoff.h for a description of
- * what these instructions are doing and how they have been chosen.
- * To make a long story short, we are trying to yield the current cpu
- * strand during busy loops.
- */
-#ifdef BUILD_VDSO
-#define cpu_relax() asm volatile("\n99:\n\t" \
- "rd %%ccr, %%g0\n\t" \
- "rd %%ccr, %%g0\n\t" \
- "rd %%ccr, %%g0\n\t" \
- ::: "memory")
-#else /* ! BUILD_VDSO */
-#define cpu_relax() asm volatile("\n99:\n\t" \
- "rd %%ccr, %%g0\n\t" \
- "rd %%ccr, %%g0\n\t" \
- "rd %%ccr, %%g0\n\t" \
- ".section .pause_3insn_patch,\"ax\"\n\t"\
- ".word 99b\n\t" \
- "wr %%g0, 128, %%asr27\n\t" \
- "nop\n\t" \
- "nop\n\t" \
- ".previous" \
- ::: "memory")
-#endif
-
/* Prefetch support. This is tuned for UltraSPARC-III and later.
* UltraSPARC-I will treat these as nops, and UltraSPARC-II has
* a shallower prefetch queue than later chips.
diff --git a/arch/sparc/include/asm/vdso.h b/arch/sparc/include/asm/vdso.h
index 59e79d35cd73..f08562d10215 100644
--- a/arch/sparc/include/asm/vdso.h
+++ b/arch/sparc/include/asm/vdso.h
@@ -8,8 +8,6 @@
struct vdso_image {
void *data;
unsigned long size; /* Always a multiple of PAGE_SIZE */
-
- long sym_vvar_start; /* Negative offset to the vvar area */
};
#ifdef CONFIG_SPARC64
diff --git a/arch/sparc/include/asm/vdso/clocksource.h b/arch/sparc/include/asm/vdso/clocksource.h
new file mode 100644
index 000000000000..007aa8ceaf52
--- /dev/null
+++ b/arch/sparc/include/asm/vdso/clocksource.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_VDSO_CLOCKSOURCE_H
+#define __ASM_VDSO_CLOCKSOURCE_H
+
+/* VDSO clocksources */
+#define VDSO_ARCH_CLOCKMODES \
+ VDSO_CLOCKMODE_TICK, \
+ VDSO_CLOCKMODE_STICK
+
+#endif /* __ASM_VDSO_CLOCKSOURCE_H */
diff --git a/arch/sparc/include/asm/vdso/gettimeofday.h b/arch/sparc/include/asm/vdso/gettimeofday.h
new file mode 100644
index 000000000000..b0c80c8a28bb
--- /dev/null
+++ b/arch/sparc/include/asm/vdso/gettimeofday.h
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2006 Andi Kleen, SUSE Labs.
+ */
+
+#ifndef _ASM_SPARC_VDSO_GETTIMEOFDAY_H
+#define _ASM_SPARC_VDSO_GETTIMEOFDAY_H
+
+#include <uapi/linux/time.h>
+#include <uapi/linux/unistd.h>
+
+#include <vdso/align.h>
+#include <vdso/clocksource.h>
+#include <vdso/datapage.h>
+#include <vdso/page.h>
+
+#include <linux/types.h>
+
+#ifdef CONFIG_SPARC64
+static __always_inline u64 vread_tick(void)
+{
+ u64 ret;
+
+ __asm__ __volatile__("rd %%tick, %0" : "=r" (ret));
+ return ret;
+}
+
+static __always_inline u64 vread_tick_stick(void)
+{
+ u64 ret;
+
+ __asm__ __volatile__("rd %%asr24, %0" : "=r" (ret));
+ return ret;
+}
+#else
+static __always_inline u64 vdso_shift_ns(u64 val, u32 amt)
+{
+ u64 ret;
+
+ __asm__ __volatile__("sllx %H1, 32, %%g1\n\t"
+ "srl %L1, 0, %L1\n\t"
+ "or %%g1, %L1, %%g1\n\t"
+ "srlx %%g1, %2, %L0\n\t"
+ "srlx %L0, 32, %H0"
+ : "=r" (ret)
+ : "r" (val), "r" (amt)
+ : "g1");
+ return ret;
+}
+#define vdso_shift_ns vdso_shift_ns
+
+static __always_inline u64 vread_tick(void)
+{
+ register unsigned long long ret asm("o4");
+
+ __asm__ __volatile__("rd %%tick, %L0\n\t"
+ "srlx %L0, 32, %H0"
+ : "=r" (ret));
+ return ret;
+}
+
+static __always_inline u64 vread_tick_stick(void)
+{
+ register unsigned long long ret asm("o4");
+
+ __asm__ __volatile__("rd %%asr24, %L0\n\t"
+ "srlx %L0, 32, %H0"
+ : "=r" (ret));
+ return ret;
+}
+#endif
+
+static __always_inline u64 __arch_get_hw_counter(s32 clock_mode, const struct vdso_time_data *vd)
+{
+ if (likely(clock_mode == VDSO_CLOCKMODE_STICK))
+ return vread_tick_stick();
+ else
+ return vread_tick();
+}
+
+#ifdef CONFIG_SPARC64
+#define SYSCALL_STRING \
+ "ta 0x6d;" \
+ "bcs,a 1f;" \
+ " sub %%g0, %%o0, %%o0;" \
+ "1:"
+#else
+#define SYSCALL_STRING \
+ "ta 0x10;" \
+ "bcs,a 1f;" \
+ " sub %%g0, %%o0, %%o0;" \
+ "1:"
+#endif
+
+#define SYSCALL_CLOBBERS \
+ "f0", "f1", "f2", "f3", "f4", "f5", "f6", "f7", \
+ "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", \
+ "f16", "f17", "f18", "f19", "f20", "f21", "f22", "f23", \
+ "f24", "f25", "f26", "f27", "f28", "f29", "f30", "f31", \
+ "f32", "f34", "f36", "f38", "f40", "f42", "f44", "f46", \
+ "f48", "f50", "f52", "f54", "f56", "f58", "f60", "f62", \
+ "cc", "memory"
+
+#ifdef CONFIG_SPARC64
+
+static __always_inline
+long clock_gettime_fallback(clockid_t clock, struct __kernel_timespec *ts)
+{
+ register long num __asm__("g1") = __NR_clock_gettime;
+ register long o0 __asm__("o0") = clock;
+ register long o1 __asm__("o1") = (long) ts;
+
+ __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
+ "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
+ return o0;
+}
+
+#else /* !CONFIG_SPARC64 */
+
+static __always_inline
+long clock_gettime_fallback(clockid_t clock, struct __kernel_timespec *ts)
+{
+ register long num __asm__("g1") = __NR_clock_gettime64;
+ register long o0 __asm__("o0") = clock;
+ register long o1 __asm__("o1") = (long) ts;
+
+ __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
+ "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
+ return o0;
+}
+
+static __always_inline
+long clock_gettime32_fallback(clockid_t clock, struct old_timespec32 *ts)
+{
+ register long num __asm__("g1") = __NR_clock_gettime;
+ register long o0 __asm__("o0") = clock;
+ register long o1 __asm__("o1") = (long) ts;
+
+ __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
+ "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
+ return o0;
+}
+
+#endif /* CONFIG_SPARC64 */
+
+static __always_inline
+long gettimeofday_fallback(struct __kernel_old_timeval *tv, struct timezone *tz)
+{
+ register long num __asm__("g1") = __NR_gettimeofday;
+ register long o0 __asm__("o0") = (long) tv;
+ register long o1 __asm__("o1") = (long) tz;
+
+ __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
+ "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
+ return o0;
+}
+
+static __always_inline const struct vdso_time_data *__arch_get_vdso_u_time_data(void)
+{
+ unsigned long ret;
+
+ /*
+ * SPARC does not support native PC-relative code relocations.
+ * Calculate the address manually, works for 32 and 64 bit code.
+ */
+ __asm__ __volatile__(
+ "1:\n"
+ "call 3f\n" // Jump over the embedded data and set up %o7
+ "nop\n" // Delay slot
+ "2:\n"
+ ".word vdso_u_time_data - .\n" // Embedded offset to external symbol
+ "3:\n"
+ "add %%o7, 2b - 1b, %%o7\n" // Point %o7 to the embedded offset
+ "ldsw [%%o7], %0\n" // Load the offset
+ "add %0, %%o7, %0\n" // Calculate the absolute address
+ : "=r" (ret)
+ :
+ : "o7");
+
+ return (const struct vdso_time_data *)ret;
+}
+#define __arch_get_vdso_u_time_data __arch_get_vdso_u_time_data
+
+#endif /* _ASM_SPARC_VDSO_GETTIMEOFDAY_H */
diff --git a/arch/sparc/include/asm/vdso/processor.h b/arch/sparc/include/asm/vdso/processor.h
new file mode 100644
index 000000000000..f7a9adc807f7
--- /dev/null
+++ b/arch/sparc/include/asm/vdso/processor.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _ASM_SPARC_VDSO_PROCESSOR_H
+#define _ASM_SPARC_VDSO_PROCESSOR_H
+
+#include <linux/compiler.h>
+
+#if defined(__arch64__)
+
+/* Please see the commentary in asm/backoff.h for a description of
+ * what these instructions are doing and how they have been chosen.
+ * To make a long story short, we are trying to yield the current cpu
+ * strand during busy loops.
+ */
+#ifdef BUILD_VDSO
+#define cpu_relax() asm volatile("\n99:\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ ::: "memory")
+#else /* ! BUILD_VDSO */
+#define cpu_relax() asm volatile("\n99:\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ "rd %%ccr, %%g0\n\t" \
+ ".section .pause_3insn_patch,\"ax\"\n\t"\
+ ".word 99b\n\t" \
+ "wr %%g0, 128, %%asr27\n\t" \
+ "nop\n\t" \
+ "nop\n\t" \
+ ".previous" \
+ ::: "memory")
+#endif /* BUILD_VDSO */
+
+#else /* ! __arch64__ */
+
+#define cpu_relax() barrier()
+
+#endif /* __arch64__ */
+
+#endif /* _ASM_SPARC_VDSO_PROCESSOR_H */
diff --git a/arch/sparc/include/asm/vdso/vsyscall.h b/arch/sparc/include/asm/vdso/vsyscall.h
new file mode 100644
index 000000000000..8bfe703fedc5
--- /dev/null
+++ b/arch/sparc/include/asm/vdso/vsyscall.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _ASM_SPARC_VDSO_VSYSCALL_H
+#define _ASM_SPARC_VDSO_VSYSCALL_H
+
+#define __VDSO_PAGES 4
+
+#include <asm-generic/vdso/vsyscall.h>
+
+#endif /* _ASM_SPARC_VDSO_VSYSCALL_H */
diff --git a/arch/sparc/include/asm/vvar.h b/arch/sparc/include/asm/vvar.h
deleted file mode 100644
index 6eaf5cfcaae1..000000000000
--- a/arch/sparc/include/asm/vvar.h
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
- */
-
-#ifndef _ASM_SPARC_VVAR_DATA_H
-#define _ASM_SPARC_VVAR_DATA_H
-
-#include <asm/clocksource.h>
-#include <asm/processor.h>
-#include <asm/barrier.h>
-#include <linux/time.h>
-#include <linux/types.h>
-
-struct vvar_data {
- unsigned int seq;
-
- int vclock_mode;
- struct { /* extract of a clocksource struct */
- u64 cycle_last;
- u64 mask;
- int mult;
- int shift;
- } clock;
- /* open coded 'struct timespec' */
- u64 wall_time_sec;
- u64 wall_time_snsec;
- u64 monotonic_time_snsec;
- u64 monotonic_time_sec;
- u64 monotonic_time_coarse_sec;
- u64 monotonic_time_coarse_nsec;
- u64 wall_time_coarse_sec;
- u64 wall_time_coarse_nsec;
-
- int tz_minuteswest;
- int tz_dsttime;
-};
-
-extern struct vvar_data *vvar_data;
-extern int vdso_fix_stick;
-
-static inline unsigned int vvar_read_begin(const struct vvar_data *s)
-{
- unsigned int ret;
-
-repeat:
- ret = READ_ONCE(s->seq);
- if (unlikely(ret & 1)) {
- cpu_relax();
- goto repeat;
- }
- smp_rmb(); /* Finish all reads before we return seq */
- return ret;
-}
-
-static inline int vvar_read_retry(const struct vvar_data *s,
- unsigned int start)
-{
- smp_rmb(); /* Finish all reads before checking the value of seq */
- return unlikely(s->seq != start);
-}
-
-static inline void vvar_write_begin(struct vvar_data *s)
-{
- ++s->seq;
- smp_wmb(); /* Makes sure that increment of seq is reflected */
-}
-
-static inline void vvar_write_end(struct vvar_data *s)
-{
- smp_wmb(); /* Makes the value of seq current before we increment */
- ++s->seq;
-}
-
-
-#endif /* _ASM_SPARC_VVAR_DATA_H */
diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile
index 22170d4f8e06..497b5714fa8f 100644
--- a/arch/sparc/kernel/Makefile
+++ b/arch/sparc/kernel/Makefile
@@ -41,7 +41,6 @@ obj-$(CONFIG_SPARC32) += systbls_32.o
obj-y += time_$(BITS).o
obj-$(CONFIG_SPARC32) += windows.o
obj-y += cpu.o
-obj-$(CONFIG_SPARC64) += vdso.o
obj-$(CONFIG_SPARC32) += devices.o
obj-y += ptrace_$(BITS).o
obj-y += unaligned_$(BITS).o
diff --git a/arch/sparc/kernel/time_64.c b/arch/sparc/kernel/time_64.c
index b32f27f929d1..87b267043ccd 100644
--- a/arch/sparc/kernel/time_64.c
+++ b/arch/sparc/kernel/time_64.c
@@ -838,14 +838,14 @@ void __init time_init_early(void)
if (tlb_type == spitfire) {
if (is_hummingbird()) {
init_tick_ops(&hbtick_operations);
- clocksource_tick.archdata.vclock_mode = VCLOCK_NONE;
+ clocksource_tick.vdso_clock_mode = VDSO_CLOCKMODE_NONE;
} else {
init_tick_ops(&tick_operations);
- clocksource_tick.archdata.vclock_mode = VCLOCK_TICK;
+ clocksource_tick.vdso_clock_mode = VDSO_CLOCKMODE_TICK;
}
} else {
init_tick_ops(&stick_operations);
- clocksource_tick.archdata.vclock_mode = VCLOCK_STICK;
+ clocksource_tick.vdso_clock_mode = VDSO_CLOCKMODE_STICK;
}
}
diff --git a/arch/sparc/kernel/vdso.c b/arch/sparc/kernel/vdso.c
deleted file mode 100644
index 0e27437eb97b..000000000000
--- a/arch/sparc/kernel/vdso.c
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE
- * Copyright 2003 Andi Kleen, SuSE Labs.
- *
- * Thanks to hpa@transmeta.com for some useful hint.
- * Special thanks to Ingo Molnar for his early experience with
- * a different vsyscall implementation for Linux/IA32 and for the name.
- */
-
-#include <linux/time.h>
-#include <linux/timekeeper_internal.h>
-
-#include <asm/vvar.h>
-
-void update_vsyscall_tz(void)
-{
- if (unlikely(vvar_data == NULL))
- return;
-
- vvar_data->tz_minuteswest = sys_tz.tz_minuteswest;
- vvar_data->tz_dsttime = sys_tz.tz_dsttime;
-}
-
-void update_vsyscall(struct timekeeper *tk)
-{
- struct vvar_data *vdata = vvar_data;
-
- if (unlikely(vdata == NULL))
- return;
-
- vvar_write_begin(vdata);
- vdata->vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
- vdata->clock.cycle_last = tk->tkr_mono.cycle_last;
- vdata->clock.mask = tk->tkr_mono.mask;
- vdata->clock.mult = tk->tkr_mono.mult;
- vdata->clock.shift = tk->tkr_mono.shift;
-
- vdata->wall_time_sec = tk->xtime_sec;
- vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec;
-
- vdata->monotonic_time_sec = tk->xtime_sec +
- tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec +
- (tk->wall_to_monotonic.tv_nsec <<
- tk->tkr_mono.shift);
-
- while (vdata->monotonic_time_snsec >=
- (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
- vdata->monotonic_time_snsec -=
- ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
- vdata->monotonic_time_sec++;
- }
-
- vdata->wall_time_coarse_sec = tk->xtime_sec;
- vdata->wall_time_coarse_nsec =
- (long)(tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift);
-
- vdata->monotonic_time_coarse_sec =
- vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec;
- vdata->monotonic_time_coarse_nsec =
- vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec;
-
- while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) {
- vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC;
- vdata->monotonic_time_coarse_sec++;
- }
-
- vvar_write_end(vdata);
-}
diff --git a/arch/sparc/vdso/Makefile b/arch/sparc/vdso/Makefile
index 683b2d408224..83fb2aca59cb 100644
--- a/arch/sparc/vdso/Makefile
+++ b/arch/sparc/vdso/Makefile
@@ -3,6 +3,9 @@
# Building vDSO images for sparc.
#
+# Include the generic Makefile to check the built vDSO:
+include $(srctree)/lib/vdso/Makefile.include
+
# files to link into the vdso
vobjs-y := vdso-note.o vclock_gettime.o
@@ -90,6 +93,9 @@ KBUILD_CFLAGS_32 += -DDISABLE_BRANCH_PROFILING
KBUILD_CFLAGS_32 += -mv8plus
$(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
+CHECKFLAGS_32 := $(filter-out -m64 -D__sparc_v9__ -D__arch64__, $(CHECKFLAGS)) -m32
+$(obj)/vdso32.so.dbg: CHECKFLAGS = $(CHECKFLAGS_32)
+
$(obj)/vdso32.so.dbg: FORCE \
$(obj)/vdso32/vdso32.lds \
$(obj)/vdso32/vclock_gettime.o \
@@ -102,6 +108,7 @@ $(obj)/vdso32.so.dbg: FORCE \
quiet_cmd_vdso = VDSO $@
cmd_vdso = $(LD) -nostdlib -o $@ \
$(VDSO_LDFLAGS) $(VDSO_LDFLAGS_$(filter %.lds,$(^F))) \
- -T $(filter %.lds,$^) $(filter %.o,$^)
+ -T $(filter %.lds,$^) $(filter %.o,$^); \
+ $(cmd_vdso_check)
-VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 -Bsymbolic --no-undefined
+VDSO_LDFLAGS = -shared --hash-style=both --build-id=sha1 -Bsymbolic --no-undefined -z noexecstack
diff --git a/arch/sparc/vdso/vclock_gettime.c b/arch/sparc/vdso/vclock_gettime.c
index 79607804ea1b..1d9859392e4c 100644
--- a/arch/sparc/vdso/vclock_gettime.c
+++ b/arch/sparc/vdso/vclock_gettime.c
@@ -12,382 +12,48 @@
* Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
*/
-#include <linux/kernel.h>
-#include <linux/time.h>
-#include <linux/string.h>
-#include <asm/io.h>
-#include <asm/unistd.h>
-#include <asm/timex.h>
-#include <asm/clocksource.h>
-#include <asm/vvar.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
-#ifdef CONFIG_SPARC64
-#define SYSCALL_STRING \
- "ta 0x6d;" \
- "bcs,a 1f;" \
- " sub %%g0, %%o0, %%o0;" \
- "1:"
-#else
-#define SYSCALL_STRING \
- "ta 0x10;" \
- "bcs,a 1f;" \
- " sub %%g0, %%o0, %%o0;" \
- "1:"
-#endif
-
-#define SYSCALL_CLOBBERS \
- "f0", "f1", "f2", "f3", "f4", "f5", "f6", "f7", \
- "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", \
- "f16", "f17", "f18", "f19", "f20", "f21", "f22", "f23", \
- "f24", "f25", "f26", "f27", "f28", "f29", "f30", "f31", \
- "f32", "f34", "f36", "f38", "f40", "f42", "f44", "f46", \
- "f48", "f50", "f52", "f54", "f56", "f58", "f60", "f62", \
- "cc", "memory"
-
-/*
- * Compute the vvar page's address in the process address space, and return it
- * as a pointer to the vvar_data.
- */
-notrace static __always_inline struct vvar_data *get_vvar_data(void)
-{
- unsigned long ret;
-
- /*
- * vdso data page is the first vDSO page so grab the PC
- * and move up a page to get to the data page.
- */
- __asm__("rd %%pc, %0" : "=r" (ret));
- ret &= ~(8192 - 1);
- ret -= 8192;
-
- return (struct vvar_data *) ret;
-}
+#include <vdso/gettime.h>
-notrace static long vdso_fallback_gettime(long clock, struct __kernel_old_timespec *ts)
-{
- register long num __asm__("g1") = __NR_clock_gettime;
- register long o0 __asm__("o0") = clock;
- register long o1 __asm__("o1") = (long) ts;
+#include <asm/vdso/gettimeofday.h>
- __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
- "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
- return o0;
-}
+#include "../../../../lib/vdso/gettimeofday.c"
-notrace static long vdso_fallback_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
+int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
{
- register long num __asm__("g1") = __NR_gettimeofday;
- register long o0 __asm__("o0") = (long) tv;
- register long o1 __asm__("o1") = (long) tz;
-
- __asm__ __volatile__(SYSCALL_STRING : "=r" (o0) : "r" (num),
- "0" (o0), "r" (o1) : SYSCALL_CLOBBERS);
- return o0;
+ return __cvdso_gettimeofday(tv, tz);
}
-#ifdef CONFIG_SPARC64
-notrace static __always_inline u64 __shr64(u64 val, int amt)
-{
- return val >> amt;
-}
+int gettimeofday(struct __kernel_old_timeval *, struct timezone *)
+ __weak __alias(__vdso_gettimeofday);
-notrace static __always_inline u64 vread_tick(void)
+#if defined(CONFIG_SPARC64)
+int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
{
- u64 ret;
-
- __asm__ __volatile__("rd %%tick, %0" : "=r" (ret));
- return ret;
+ return __cvdso_clock_gettime(clock, ts);
}
-notrace static __always_inline u64 vread_tick_stick(void)
-{
- u64 ret;
+int clock_gettime(clockid_t, struct __kernel_timespec *)
+ __weak __alias(__vdso_clock_gettime);
- __asm__ __volatile__("rd %%asr24, %0" : "=r" (ret));
- return ret;
-}
#else
-notrace static __always_inline u64 __shr64(u64 val, int amt)
-{
- u64 ret;
-
- __asm__ __volatile__("sllx %H1, 32, %%g1\n\t"
- "srl %L1, 0, %L1\n\t"
- "or %%g1, %L1, %%g1\n\t"
- "srlx %%g1, %2, %L0\n\t"
- "srlx %L0, 32, %H0"
- : "=r" (ret)
- : "r" (val), "r" (amt)
- : "g1");
- return ret;
-}
-
-notrace static __always_inline u64 vread_tick(void)
-{
- register unsigned long long ret asm("o4");
-
- __asm__ __volatile__("rd %%tick, %L0\n\t"
- "srlx %L0, 32, %H0"
- : "=r" (ret));
- return ret;
-}
-
-notrace static __always_inline u64 vread_tick_stick(void)
-{
- register unsigned long long ret asm("o4");
-
- __asm__ __volatile__("rd %%asr24, %L0\n\t"
- "srlx %L0, 32, %H0"
- : "=r" (ret));
- return ret;
-}
-#endif
-notrace static __always_inline u64 vgetsns(struct vvar_data *vvar)
+int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts)
{
- u64 v;
- u64 cycles;
-
- cycles = vread_tick();
- v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
- return v * vvar->clock.mult;
+ return __cvdso_clock_gettime32(clock, ts);
}
-notrace static __always_inline u64 vgetsns_stick(struct vvar_data *vvar)
-{
- u64 v;
- u64 cycles;
+int clock_gettime(clockid_t, struct old_timespec32 *)
+ __weak __alias(__vdso_clock_gettime);
- cycles = vread_tick_stick();
- v = (cycles - vvar->clock.cycle_last) & vvar->clock.mask;
- return v * vvar->clock.mult;
-}
-
-notrace static __always_inline int do_realtime(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
+int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts)
{
- unsigned long seq;
- u64 ns;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->wall_time_sec;
- ns = vvar->wall_time_snsec;
- ns += vgetsns(vvar);
- ns = __shr64(ns, vvar->clock.shift);
- } while (unlikely(vvar_read_retry(vvar, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
-
- return 0;
+ return __cvdso_clock_gettime(clock, ts);
}
-notrace static __always_inline int do_realtime_stick(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
-{
- unsigned long seq;
- u64 ns;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->wall_time_sec;
- ns = vvar->wall_time_snsec;
- ns += vgetsns_stick(vvar);
- ns = __shr64(ns, vvar->clock.shift);
- } while (unlikely(vvar_read_retry(vvar, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
+int clock_gettime64(clockid_t, struct __kernel_timespec *)
+ __weak __alias(__vdso_clock_gettime64);
- return 0;
-}
-
-notrace static __always_inline int do_monotonic(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
-{
- unsigned long seq;
- u64 ns;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->monotonic_time_sec;
- ns = vvar->monotonic_time_snsec;
- ns += vgetsns(vvar);
- ns = __shr64(ns, vvar->clock.shift);
- } while (unlikely(vvar_read_retry(vvar, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
-
- return 0;
-}
-
-notrace static __always_inline int do_monotonic_stick(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
-{
- unsigned long seq;
- u64 ns;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->monotonic_time_sec;
- ns = vvar->monotonic_time_snsec;
- ns += vgetsns_stick(vvar);
- ns = __shr64(ns, vvar->clock.shift);
- } while (unlikely(vvar_read_retry(vvar, seq)));
-
- ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
- ts->tv_nsec = ns;
-
- return 0;
-}
-
-notrace static int do_realtime_coarse(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
-{
- unsigned long seq;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->wall_time_coarse_sec;
- ts->tv_nsec = vvar->wall_time_coarse_nsec;
- } while (unlikely(vvar_read_retry(vvar, seq)));
- return 0;
-}
-
-notrace static int do_monotonic_coarse(struct vvar_data *vvar,
- struct __kernel_old_timespec *ts)
-{
- unsigned long seq;
-
- do {
- seq = vvar_read_begin(vvar);
- ts->tv_sec = vvar->monotonic_time_coarse_sec;
- ts->tv_nsec = vvar->monotonic_time_coarse_nsec;
- } while (unlikely(vvar_read_retry(vvar, seq)));
-
- return 0;
-}
-
-notrace int
-__vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
-{
- struct vvar_data *vvd = get_vvar_data();
-
- switch (clock) {
- case CLOCK_REALTIME:
- if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
- break;
- return do_realtime(vvd, ts);
- case CLOCK_MONOTONIC:
- if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
- break;
- return do_monotonic(vvd, ts);
- case CLOCK_REALTIME_COARSE:
- return do_realtime_coarse(vvd, ts);
- case CLOCK_MONOTONIC_COARSE:
- return do_monotonic_coarse(vvd, ts);
- }
- /*
- * Unknown clock ID ? Fall back to the syscall.
- */
- return vdso_fallback_gettime(clock, ts);
-}
-int
-clock_gettime(clockid_t, struct __kernel_old_timespec *)
- __attribute__((weak, alias("__vdso_clock_gettime")));
-
-notrace int
-__vdso_clock_gettime_stick(clockid_t clock, struct __kernel_old_timespec *ts)
-{
- struct vvar_data *vvd = get_vvar_data();
-
- switch (clock) {
- case CLOCK_REALTIME:
- if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
- break;
- return do_realtime_stick(vvd, ts);
- case CLOCK_MONOTONIC:
- if (unlikely(vvd->vclock_mode == VCLOCK_NONE))
- break;
- return do_monotonic_stick(vvd, ts);
- case CLOCK_REALTIME_COARSE:
- return do_realtime_coarse(vvd, ts);
- case CLOCK_MONOTONIC_COARSE:
- return do_monotonic_coarse(vvd, ts);
- }
- /*
- * Unknown clock ID ? Fall back to the syscall.
- */
- return vdso_fallback_gettime(clock, ts);
-}
-
-notrace int
-__vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
-{
- struct vvar_data *vvd = get_vvar_data();
-
- if (likely(vvd->vclock_mode != VCLOCK_NONE)) {
- if (likely(tv != NULL)) {
- union tstv_t {
- struct __kernel_old_timespec ts;
- struct __kernel_old_timeval tv;
- } *tstv = (union tstv_t *) tv;
- do_realtime(vvd, &tstv->ts);
- /*
- * Assign before dividing to ensure that the division is
- * done in the type of tv_usec, not tv_nsec.
- *
- * There cannot be > 1 billion usec in a second:
- * do_realtime() has already distributed such overflow
- * into tv_sec. So we can assign it to an int safely.
- */
- tstv->tv.tv_usec = tstv->ts.tv_nsec;
- tstv->tv.tv_usec /= 1000;
- }
- if (unlikely(tz != NULL)) {
- /* Avoid memcpy. Some old compilers fail to inline it */
- tz->tz_minuteswest = vvd->tz_minuteswest;
- tz->tz_dsttime = vvd->tz_dsttime;
- }
- return 0;
- }
- return vdso_fallback_gettimeofday(tv, tz);
-}
-int
-gettimeofday(struct __kernel_old_timeval *, struct timezone *)
- __attribute__((weak, alias("__vdso_gettimeofday")));
-
-notrace int
-__vdso_gettimeofday_stick(struct __kernel_old_timeval *tv, struct timezone *tz)
-{
- struct vvar_data *vvd = get_vvar_data();
-
- if (likely(vvd->vclock_mode != VCLOCK_NONE)) {
- if (likely(tv != NULL)) {
- union tstv_t {
- struct __kernel_old_timespec ts;
- struct __kernel_old_timeval tv;
- } *tstv = (union tstv_t *) tv;
- do_realtime_stick(vvd, &tstv->ts);
- /*
- * Assign before dividing to ensure that the division is
- * done in the type of tv_usec, not tv_nsec.
- *
- * There cannot be > 1 billion usec in a second:
- * do_realtime() has already distributed such overflow
- * into tv_sec. So we can assign it to an int safely.
- */
- tstv->tv.tv_usec = tstv->ts.tv_nsec;
- tstv->tv.tv_usec /= 1000;
- }
- if (unlikely(tz != NULL)) {
- /* Avoid memcpy. Some old compilers fail to inline it */
- tz->tz_minuteswest = vvd->tz_minuteswest;
- tz->tz_dsttime = vvd->tz_dsttime;
- }
- return 0;
- }
- return vdso_fallback_gettimeofday(tv, tz);
-}
+#endif
diff --git a/arch/sparc/vdso/vdso-layout.lds.S b/arch/sparc/vdso/vdso-layout.lds.S
index d31e57e8a3bb..180e5d0ee071 100644
--- a/arch/sparc/vdso/vdso-layout.lds.S
+++ b/arch/sparc/vdso/vdso-layout.lds.S
@@ -4,15 +4,9 @@
* This script controls its layout.
*/
-#if defined(BUILD_VDSO64)
-# define SHDR_SIZE 64
-#elif defined(BUILD_VDSO32)
-# define SHDR_SIZE 40
-#else
-# error unknown VDSO target
-#endif
-
-#define NUM_FAKE_SHDRS 7
+#include <vdso/datapage.h>
+#include <vdso/page.h>
+#include <asm/vdso/vsyscall.h>
SECTIONS
{
@@ -23,8 +17,7 @@ SECTIONS
* segment. Page size is 8192 for both 64-bit and 32-bit vdso binaries
*/
- vvar_start = . -8192;
- vvar_data = vvar_start;
+ VDSO_VVAR_SYMS
. = SIZEOF_HEADERS;
@@ -47,19 +40,8 @@ SECTIONS
*(.bss*)
*(.dynbss*)
*(.gnu.linkonce.b.*)
-
- /*
- * Ideally this would live in a C file: kept in here for
- * compatibility with x86-64.
- */
- VDSO_FAKE_SECTION_TABLE_START = .;
- . = . + NUM_FAKE_SHDRS * SHDR_SIZE;
- VDSO_FAKE_SECTION_TABLE_END = .;
} :text
- .fake_shstrtab : { *(.fake_shstrtab) } :text
-
-
.note : { *(.note.*) } :text :note
.eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr
diff --git a/arch/sparc/vdso/vdso.lds.S b/arch/sparc/vdso/vdso.lds.S
index 629ab6900df7..f3caa29a331c 100644
--- a/arch/sparc/vdso/vdso.lds.S
+++ b/arch/sparc/vdso/vdso.lds.S
@@ -18,10 +18,8 @@ VERSION {
global:
clock_gettime;
__vdso_clock_gettime;
- __vdso_clock_gettime_stick;
gettimeofday;
__vdso_gettimeofday;
- __vdso_gettimeofday_stick;
local: *;
};
}
diff --git a/arch/sparc/vdso/vdso2c.c b/arch/sparc/vdso/vdso2c.c
index dc81240aab6f..e5c61214a0e2 100644
--- a/arch/sparc/vdso/vdso2c.c
+++ b/arch/sparc/vdso/vdso2c.c
@@ -58,28 +58,6 @@
const char *outfilename;
-/* Symbols that we need in vdso2c. */
-enum {
- sym_vvar_start,
- sym_VDSO_FAKE_SECTION_TABLE_START,
- sym_VDSO_FAKE_SECTION_TABLE_END,
-};
-
-struct vdso_sym {
- const char *name;
- int export;
-};
-
-struct vdso_sym required_syms[] = {
- [sym_vvar_start] = {"vvar_start", 1},
- [sym_VDSO_FAKE_SECTION_TABLE_START] = {
- "VDSO_FAKE_SECTION_TABLE_START", 0
- },
- [sym_VDSO_FAKE_SECTION_TABLE_END] = {
- "VDSO_FAKE_SECTION_TABLE_END", 0
- },
-};
-
__attribute__((format(printf, 1, 2))) __attribute__((noreturn))
static void fail(const char *format, ...)
{
@@ -119,8 +97,6 @@ static void fail(const char *format, ...)
#define PUT_BE(x, val) \
PBE(x, val, 64, PBE(x, val, 32, PBE(x, val, 16, LAST_PBE(x, val))))
-#define NSYMS ARRAY_SIZE(required_syms)
-
#define BITSFUNC3(name, bits, suffix) name##bits##suffix
#define BITSFUNC2(name, bits, suffix) BITSFUNC3(name, bits, suffix)
#define BITSFUNC(name) BITSFUNC2(name, ELF_BITS, )
diff --git a/arch/sparc/vdso/vdso2c.h b/arch/sparc/vdso/vdso2c.h
index 60d69acc748f..bad6a0593f4c 100644
--- a/arch/sparc/vdso/vdso2c.h
+++ b/arch/sparc/vdso/vdso2c.h
@@ -17,11 +17,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned long mapping_size;
int i;
unsigned long j;
- ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr;
+ ELF(Shdr) *symtab_hdr = NULL;
ELF(Ehdr) *hdr = (ELF(Ehdr) *)raw_addr;
ELF(Dyn) *dyn = 0, *dyn_end = 0;
- INT_BITS syms[NSYMS] = {};
-
ELF(Phdr) *pt = (ELF(Phdr) *)(raw_addr + GET_BE(&hdr->e_phoff));
/* Walk the segment table. */
@@ -72,42 +70,6 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (!symtab_hdr)
fail("no symbol table\n");
- strtab_hdr = raw_addr + GET_BE(&hdr->e_shoff) +
- GET_BE(&hdr->e_shentsize) * GET_BE(&symtab_hdr->sh_link);
-
- /* Walk the symbol table */
- for (i = 0;
- i < GET_BE(&symtab_hdr->sh_size) / GET_BE(&symtab_hdr->sh_entsize);
- i++) {
- int k;
-
- ELF(Sym) *sym = raw_addr + GET_BE(&symtab_hdr->sh_offset) +
- GET_BE(&symtab_hdr->sh_entsize) * i;
- const char *name = raw_addr + GET_BE(&strtab_hdr->sh_offset) +
- GET_BE(&sym->st_name);
-
- for (k = 0; k < NSYMS; k++) {
- if (!strcmp(name, required_syms[k].name)) {
- if (syms[k]) {
- fail("duplicate symbol %s\n",
- required_syms[k].name);
- }
-
- /*
- * Careful: we use negative addresses, but
- * st_value is unsigned, so we rely
- * on syms[k] being a signed type of the
- * correct width.
- */
- syms[k] = GET_BE(&sym->st_value);
- }
- }
- }
-
- /* Validate mapping addresses. */
- if (syms[sym_vvar_start] % 8192)
- fail("vvar_begin must be a multiple of 8192\n");
-
if (!name) {
fwrite(stripped_addr, stripped_len, 1, outfile);
return;
@@ -133,10 +95,5 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
fprintf(outfile, "const struct vdso_image %s_builtin = {\n", name);
fprintf(outfile, "\t.data = raw_data,\n");
fprintf(outfile, "\t.size = %lu,\n", mapping_size);
- for (i = 0; i < NSYMS; i++) {
- if (required_syms[i].export && syms[i])
- fprintf(outfile, "\t.sym_%s = %" PRIi64 ",\n",
- required_syms[i].name, (int64_t)syms[i]);
- }
fprintf(outfile, "};\n");
}
diff --git a/arch/sparc/vdso/vdso32/vdso32.lds.S b/arch/sparc/vdso/vdso32/vdso32.lds.S
index 218930fdff03..a14e4f77e6f2 100644
--- a/arch/sparc/vdso/vdso32/vdso32.lds.S
+++ b/arch/sparc/vdso/vdso32/vdso32.lds.S
@@ -17,10 +17,10 @@ VERSION {
global:
clock_gettime;
__vdso_clock_gettime;
- __vdso_clock_gettime_stick;
+ clock_gettime64;
+ __vdso_clock_gettime64;
gettimeofday;
__vdso_gettimeofday;
- __vdso_gettimeofday_stick;
local: *;
};
}
diff --git a/arch/sparc/vdso/vma.c b/arch/sparc/vdso/vma.c
index c454689ce5fa..60029d60f4d3 100644
--- a/arch/sparc/vdso/vma.c
+++ b/arch/sparc/vdso/vma.c
@@ -16,17 +16,16 @@
#include <linux/linkage.h>
#include <linux/random.h>
#include <linux/elf.h>
+#include <linux/vdso_datastore.h>
#include <asm/cacheflush.h>
#include <asm/spitfire.h>
#include <asm/vdso.h>
-#include <asm/vvar.h>
#include <asm/page.h>
-unsigned int __read_mostly vdso_enabled = 1;
+#include <vdso/datapage.h>
+#include <asm/vdso/vsyscall.h>
-static struct vm_special_mapping vvar_mapping = {
- .name = "[vvar]"
-};
+unsigned int __read_mostly vdso_enabled = 1;
#ifdef CONFIG_SPARC64
static struct vm_special_mapping vdso_mapping64 = {
@@ -40,207 +39,8 @@ static struct vm_special_mapping vdso_mapping32 = {
};
#endif
-struct vvar_data *vvar_data;
-
-struct vdso_elfinfo32 {
- Elf32_Ehdr *hdr;
- Elf32_Sym *dynsym;
- unsigned long dynsymsize;
- const char *dynstr;
- unsigned long text;
-};
-
-struct vdso_elfinfo64 {
- Elf64_Ehdr *hdr;
- Elf64_Sym *dynsym;
- unsigned long dynsymsize;
- const char *dynstr;
- unsigned long text;
-};
-
-struct vdso_elfinfo {
- union {
- struct vdso_elfinfo32 elf32;
- struct vdso_elfinfo64 elf64;
- } u;
-};
-
-static void *one_section64(struct vdso_elfinfo64 *e, const char *name,
- unsigned long *size)
-{
- const char *snames;
- Elf64_Shdr *shdrs;
- unsigned int i;
-
- shdrs = (void *)e->hdr + e->hdr->e_shoff;
- snames = (void *)e->hdr + shdrs[e->hdr->e_shstrndx].sh_offset;
- for (i = 1; i < e->hdr->e_shnum; i++) {
- if (!strcmp(snames+shdrs[i].sh_name, name)) {
- if (size)
- *size = shdrs[i].sh_size;
- return (void *)e->hdr + shdrs[i].sh_offset;
- }
- }
- return NULL;
-}
-
-static int find_sections64(const struct vdso_image *image, struct vdso_elfinfo *_e)
-{
- struct vdso_elfinfo64 *e = &_e->u.elf64;
-
- e->hdr = image->data;
- e->dynsym = one_section64(e, ".dynsym", &e->dynsymsize);
- e->dynstr = one_section64(e, ".dynstr", NULL);
-
- if (!e->dynsym || !e->dynstr) {
- pr_err("VDSO64: Missing symbol sections.\n");
- return -ENODEV;
- }
- return 0;
-}
-
-static Elf64_Sym *find_sym64(const struct vdso_elfinfo64 *e, const char *name)
-{
- unsigned int i;
-
- for (i = 0; i < (e->dynsymsize / sizeof(Elf64_Sym)); i++) {
- Elf64_Sym *s = &e->dynsym[i];
- if (s->st_name == 0)
- continue;
- if (!strcmp(e->dynstr + s->st_name, name))
- return s;
- }
- return NULL;
-}
-
-static int patchsym64(struct vdso_elfinfo *_e, const char *orig,
- const char *new)
-{
- struct vdso_elfinfo64 *e = &_e->u.elf64;
- Elf64_Sym *osym = find_sym64(e, orig);
- Elf64_Sym *nsym = find_sym64(e, new);
-
- if (!nsym || !osym) {
- pr_err("VDSO64: Missing symbols.\n");
- return -ENODEV;
- }
- osym->st_value = nsym->st_value;
- osym->st_size = nsym->st_size;
- osym->st_info = nsym->st_info;
- osym->st_other = nsym->st_other;
- osym->st_shndx = nsym->st_shndx;
-
- return 0;
-}
-
-static void *one_section32(struct vdso_elfinfo32 *e, const char *name,
- unsigned long *size)
-{
- const char *snames;
- Elf32_Shdr *shdrs;
- unsigned int i;
-
- shdrs = (void *)e->hdr + e->hdr->e_shoff;
- snames = (void *)e->hdr + shdrs[e->hdr->e_shstrndx].sh_offset;
- for (i = 1; i < e->hdr->e_shnum; i++) {
- if (!strcmp(snames+shdrs[i].sh_name, name)) {
- if (size)
- *size = shdrs[i].sh_size;
- return (void *)e->hdr + shdrs[i].sh_offset;
- }
- }
- return NULL;
-}
-
-static int find_sections32(const struct vdso_image *image, struct vdso_elfinfo *_e)
-{
- struct vdso_elfinfo32 *e = &_e->u.elf32;
-
- e->hdr = image->data;
- e->dynsym = one_section32(e, ".dynsym", &e->dynsymsize);
- e->dynstr = one_section32(e, ".dynstr", NULL);
-
- if (!e->dynsym || !e->dynstr) {
- pr_err("VDSO32: Missing symbol sections.\n");
- return -ENODEV;
- }
- return 0;
-}
-
-static Elf32_Sym *find_sym32(const struct vdso_elfinfo32 *e, const char *name)
-{
- unsigned int i;
-
- for (i = 0; i < (e->dynsymsize / sizeof(Elf32_Sym)); i++) {
- Elf32_Sym *s = &e->dynsym[i];
- if (s->st_name == 0)
- continue;
- if (!strcmp(e->dynstr + s->st_name, name))
- return s;
- }
- return NULL;
-}
-
-static int patchsym32(struct vdso_elfinfo *_e, const char *orig,
- const char *new)
-{
- struct vdso_elfinfo32 *e = &_e->u.elf32;
- Elf32_Sym *osym = find_sym32(e, orig);
- Elf32_Sym *nsym = find_sym32(e, new);
-
- if (!nsym || !osym) {
- pr_err("VDSO32: Missing symbols.\n");
- return -ENODEV;
- }
- osym->st_value = nsym->st_value;
- osym->st_size = nsym->st_size;
- osym->st_info = nsym->st_info;
- osym->st_other = nsym->st_other;
- osym->st_shndx = nsym->st_shndx;
-
- return 0;
-}
-
-static int find_sections(const struct vdso_image *image, struct vdso_elfinfo *e,
- bool elf64)
-{
- if (elf64)
- return find_sections64(image, e);
- else
- return find_sections32(image, e);
-}
-
-static int patch_one_symbol(struct vdso_elfinfo *e, const char *orig,
- const char *new_target, bool elf64)
-{
- if (elf64)
- return patchsym64(e, orig, new_target);
- else
- return patchsym32(e, orig, new_target);
-}
-
-static int stick_patch(const struct vdso_image *image, struct vdso_elfinfo *e, bool elf64)
-{
- int err;
-
- err = find_sections(image, e, elf64);
- if (err)
- return err;
-
- err = patch_one_symbol(e,
- "__vdso_gettimeofday",
- "__vdso_gettimeofday_stick", elf64);
- if (err)
- return err;
-
- return patch_one_symbol(e,
- "__vdso_clock_gettime",
- "__vdso_clock_gettime_stick", elf64);
- return 0;
-}
-
/*
- * Allocate pages for the vdso and vvar, and copy in the vdso text from the
+ * Allocate pages for the vdso and copy in the vdso text from the
* kernel image.
*/
static int __init init_vdso_image(const struct vdso_image *image,
@@ -248,16 +48,8 @@ static int __init init_vdso_image(const struct vdso_image *image,
bool elf64)
{
int cnpages = (image->size) / PAGE_SIZE;
- struct page *dp, **dpp = NULL;
struct page *cp, **cpp = NULL;
- struct vdso_elfinfo ei;
- int i, dnpages = 0;
-
- if (tlb_type != spitfire) {
- int err = stick_patch(image, &ei, elf64);
- if (err)
- return err;
- }
+ int i;
/*
* First, the vdso text. This is initialied data, an integral number of
@@ -280,31 +72,6 @@ static int __init init_vdso_image(const struct vdso_image *image,
copy_page(page_address(cp), image->data + i * PAGE_SIZE);
}
- /*
- * Now the vvar page. This is uninitialized data.
- */
-
- if (vvar_data == NULL) {
- dnpages = (sizeof(struct vvar_data) / PAGE_SIZE) + 1;
- if (WARN_ON(dnpages != 1))
- goto oom;
- dpp = kzalloc_objs(struct page *, dnpages);
- vvar_mapping.pages = dpp;
-
- if (!dpp)
- goto oom;
-
- dp = alloc_page(GFP_KERNEL);
- if (!dp)
- goto oom;
-
- dpp[0] = dp;
- vvar_data = page_address(dp);
- memset(vvar_data, 0, PAGE_SIZE);
-
- vvar_data->seq = 0;
- }
-
return 0;
oom:
if (cpp != NULL) {
@@ -316,15 +83,6 @@ static int __init init_vdso_image(const struct vdso_image *image,
vdso_mapping->pages = NULL;
}
- if (dpp != NULL) {
- for (i = 0; i < dnpages; i++) {
- if (dpp[i] != NULL)
- __free_page(dpp[i]);
- }
- kfree(dpp);
- vvar_mapping.pages = NULL;
- }
-
pr_warn("Cannot allocate vdso\n");
vdso_enabled = 0;
return -ENOMEM;
@@ -359,9 +117,12 @@ static unsigned long vdso_addr(unsigned long start, unsigned int len)
return start + (offset << PAGE_SHIFT);
}
+static_assert(VDSO_NR_PAGES == __VDSO_PAGES);
+
static int map_vdso(const struct vdso_image *image,
struct vm_special_mapping *vdso_mapping)
{
+ const size_t area_size = image->size + VDSO_NR_PAGES * PAGE_SIZE;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long text_start, addr = 0;
@@ -374,23 +135,20 @@ static int map_vdso(const struct vdso_image *image,
* region is free.
*/
if (current->flags & PF_RANDOMIZE) {
- addr = get_unmapped_area(NULL, 0,
- image->size - image->sym_vvar_start,
- 0, 0);
+ addr = get_unmapped_area(NULL, 0, area_size, 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto up_fail;
}
- addr = vdso_addr(addr, image->size - image->sym_vvar_start);
+ addr = vdso_addr(addr, area_size);
}
- addr = get_unmapped_area(NULL, addr,
- image->size - image->sym_vvar_start, 0, 0);
+ addr = get_unmapped_area(NULL, addr, area_size, 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto up_fail;
}
- text_start = addr - image->sym_vvar_start;
+ text_start = addr + VDSO_NR_PAGES * PAGE_SIZE;
current->mm->context.vdso = (void __user *)text_start;
/*
@@ -408,11 +166,7 @@ static int map_vdso(const struct vdso_image *image,
goto up_fail;
}
- vma = _install_special_mapping(mm,
- addr,
- -image->sym_vvar_start,
- VM_READ|VM_MAYREAD,
- &vvar_mapping);
+ vma = vdso_install_vvar_mapping(mm, addr);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
diff --git a/arch/x86/entry/vdso/vdso32/Makefile b/arch/x86/entry/vdso/vdso32/Makefile
index add6afb484ba..ded4fc6a48cd 100644
--- a/arch/x86/entry/vdso/vdso32/Makefile
+++ b/arch/x86/entry/vdso/vdso32/Makefile
@@ -15,6 +15,10 @@ flags-y := -DBUILD_VDSO32 -m32 -mregparm=0
flags-$(CONFIG_X86_64) += -include $(src)/fake_32bit_build.h
flags-remove-y := -m64
+# Checker flags
+CHECKFLAGS := $(subst -m64,-m32,$(CHECKFLAGS))
+CHECKFLAGS := $(subst -D__x86_64__,-D__i386__,$(CHECKFLAGS))
+
# The location of this include matters!
include $(src)/../common/Makefile.include
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 7ff4d29911fd..b4da1fb976c1 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -56,11 +56,7 @@
#include <linux/sched/isolation.h>
#include <crypto/chacha.h>
#include <crypto/blake2s.h>
-#ifdef CONFIG_VDSO_GETRANDOM
-#include <vdso/getrandom.h>
#include <vdso/datapage.h>
-#include <vdso/vsyscall.h>
-#endif
#include <asm/archrandom.h>
#include <asm/processor.h>
#include <asm/irq.h>
@@ -269,7 +265,7 @@ static void crng_reseed(struct work_struct *work)
if (next_gen == ULONG_MAX)
++next_gen;
WRITE_ONCE(base_crng.generation, next_gen);
-#ifdef CONFIG_VDSO_GETRANDOM
+
/* base_crng.generation's invalid value is ULONG_MAX, while
* vdso_k_rng_data->generation's invalid value is 0, so add one to the
* former to arrive at the latter. Use smp_store_release so that this
@@ -283,8 +279,9 @@ static void crng_reseed(struct work_struct *work)
* because the vDSO side only checks whether the value changed, without
* actually using or interpreting the value.
*/
- smp_store_release((unsigned long *)&vdso_k_rng_data->generation, next_gen + 1);
-#endif
+ if (IS_ENABLED(CONFIG_VDSO_GETRANDOM))
+ smp_store_release((unsigned long *)&vdso_k_rng_data->generation, next_gen + 1);
+
if (!static_branch_likely(&crng_is_ready))
crng_init = CRNG_READY;
spin_unlock_irqrestore(&base_crng.lock, flags);
@@ -734,9 +731,8 @@ static void __cold _credit_init_bits(size_t bits)
if (system_dfl_wq)
queue_work(system_dfl_wq, &set_ready);
atomic_notifier_call_chain(&random_ready_notifier, 0, NULL);
-#ifdef CONFIG_VDSO_GETRANDOM
- WRITE_ONCE(vdso_k_rng_data->is_ready, true);
-#endif
+ if (IS_ENABLED(CONFIG_VDSO_GETRANDOM))
+ WRITE_ONCE(vdso_k_rng_data->is_ready, true);
wake_up_interruptible(&crng_init_wait);
kill_fasync(&fasync, SIGIO, POLL_IN);
pr_notice("crng init done\n");
diff --git a/include/asm-generic/bitsperlong.h b/include/asm-generic/bitsperlong.h
index 1023e2a4bd37..90e8aeebfd2f 100644
--- a/include/asm-generic/bitsperlong.h
+++ b/include/asm-generic/bitsperlong.h
@@ -19,6 +19,15 @@
#error Inconsistent word size. Check asm/bitsperlong.h
#endif
+#if __CHAR_BIT__ * __SIZEOF_LONG__ != __BITS_PER_LONG
+#error Inconsistent word size. Check asm/bitsperlong.h
+#endif
+
+#ifndef __ASSEMBLER__
+_Static_assert(sizeof(long) * 8 == __BITS_PER_LONG,
+ "Inconsistent word size. Check asm/bitsperlong.h");
+#endif
+
#ifndef BITS_PER_LONG_LONG
#define BITS_PER_LONG_LONG 64
#endif
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 65b7c41471c3..12d853b18832 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -25,8 +25,7 @@ struct clocksource_base;
struct clocksource;
struct module;
-#if defined(CONFIG_ARCH_CLOCKSOURCE_DATA) || \
- defined(CONFIG_GENERIC_GETTIMEOFDAY)
+#if defined(CONFIG_GENERIC_GETTIMEOFDAY)
#include <asm/clocksource.h>
#endif
@@ -106,9 +105,6 @@ struct clocksource {
u64 max_idle_ns;
u32 maxadj;
u32 uncertainty_margin;
-#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
- struct arch_clocksource_data archdata;
-#endif
u64 max_cycles;
u64 max_raw_delta;
const char *name;
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index c514d0e5a45c..58bd9728df58 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -8,6 +8,7 @@
#include <linux/ns_common.h>
#include <linux/err.h>
#include <linux/time64.h>
+#include <linux/cleanup.h>
struct user_namespace;
extern struct user_namespace init_user_ns;
@@ -25,7 +26,9 @@ struct time_namespace {
struct ucounts *ucounts;
struct ns_common ns;
struct timens_offsets offsets;
+#ifdef CONFIG_TIME_NS_VDSO
struct page *vvar_page;
+#endif
/* If set prevents changing offsets after any task joined namespace. */
bool frozen_offsets;
} __randomize_layout;
@@ -38,9 +41,6 @@ static inline struct time_namespace *to_time_ns(struct ns_common *ns)
return container_of(ns, struct time_namespace, ns);
}
void __init time_ns_init(void);
-extern int vdso_join_timens(struct task_struct *task,
- struct time_namespace *ns);
-extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);
static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
@@ -53,7 +53,6 @@ struct time_namespace *copy_time_ns(u64 flags,
struct time_namespace *old_ns);
void free_time_ns(struct time_namespace *ns);
void timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
-struct page *find_timens_vvar_page(struct vm_area_struct *vma);
static inline void put_time_ns(struct time_namespace *ns)
{
@@ -117,17 +116,6 @@ static inline void __init time_ns_init(void)
{
}
-static inline int vdso_join_timens(struct task_struct *task,
- struct time_namespace *ns)
-{
- return 0;
-}
-
-static inline void timens_commit(struct task_struct *tsk,
- struct time_namespace *ns)
-{
-}
-
static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
{
return NULL;
@@ -154,11 +142,6 @@ static inline void timens_on_fork(struct nsproxy *nsproxy,
return;
}
-static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
-{
- return NULL;
-}
-
static inline void timens_add_monotonic(struct timespec64 *ts) { }
static inline void timens_add_boottime(struct timespec64 *ts) { }
@@ -175,4 +158,20 @@ static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
}
#endif
+#ifdef CONFIG_TIME_NS_VDSO
+extern void timens_commit(struct task_struct *tsk, struct time_namespace *ns);
+struct page *find_timens_vvar_page(struct vm_area_struct *vma);
+#else /* !CONFIG_TIME_NS_VDSO */
+static inline void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
+{
+}
+
+static inline struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+ return NULL;
+}
+#endif /* CONFIG_TIME_NS_VDSO */
+
+DEFINE_FREE(time_ns, struct time_namespace *, if (_T) put_time_ns(_T))
+
#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/vdso_datastore.h b/include/linux/vdso_datastore.h
index a91fa24b06e0..0b530428db71 100644
--- a/include/linux/vdso_datastore.h
+++ b/include/linux/vdso_datastore.h
@@ -2,9 +2,15 @@
#ifndef _LINUX_VDSO_DATASTORE_H
#define _LINUX_VDSO_DATASTORE_H
+#ifdef CONFIG_HAVE_GENERIC_VDSO
#include <linux/mm_types.h>
extern const struct vm_special_mapping vdso_vvar_mapping;
struct vm_area_struct *vdso_install_vvar_mapping(struct mm_struct *mm, unsigned long addr);
+void __init vdso_setup_data_pages(void);
+#else /* !CONFIG_HAVE_GENERIC_VDSO */
+static inline void vdso_setup_data_pages(void) { }
+#endif /* CONFIG_HAVE_GENERIC_VDSO */
+
#endif /* _LINUX_VDSO_DATASTORE_H */
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 23c39b96190f..5977723fb3b5 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -4,24 +4,16 @@
#ifndef __ASSEMBLY__
-#include <linux/compiler.h>
+#include <linux/types.h>
+
#include <uapi/linux/bits.h>
#include <uapi/linux/time.h>
-#include <uapi/linux/types.h>
-#include <uapi/asm-generic/errno-base.h>
#include <vdso/align.h>
#include <vdso/bits.h>
#include <vdso/cache.h>
-#include <vdso/clocksource.h>
-#include <vdso/ktime.h>
-#include <vdso/limits.h>
-#include <vdso/math64.h>
#include <vdso/page.h>
-#include <vdso/processor.h>
#include <vdso/time.h>
-#include <vdso/time32.h>
-#include <vdso/time64.h>
#ifdef CONFIG_ARCH_HAS_VDSO_TIME_DATA
#include <asm/vdso/time_data.h>
@@ -80,8 +72,8 @@ struct vdso_timestamp {
* @mask: clocksource mask
* @mult: clocksource multiplier
* @shift: clocksource shift
- * @basetime[clock_id]: basetime per clock_id
- * @offset[clock_id]: time namespace offset per clock_id
+ * @basetime: basetime per clock_id
+ * @offset: time namespace offset per clock_id
*
* See also struct vdso_time_data for basic access and ordering information as
* struct vdso_clock is used there.
@@ -184,17 +176,6 @@ enum vdso_pages {
VDSO_NR_PAGES
};
-/*
- * The generic vDSO implementation requires that gettimeofday.h
- * provides:
- * - __arch_get_hw_counter(): to get the hw counter based on the
- * clock_mode.
- * - gettimeofday_fallback(): fallback for gettimeofday.
- * - clock_gettime_fallback(): fallback for clock_gettime.
- * - clock_getres_fallback(): fallback for clock_getres.
- */
-#include <asm/vdso/gettimeofday.h>
-
#else /* !__ASSEMBLY__ */
#ifdef CONFIG_VDSO_GETRANDOM
diff --git a/include/vdso/helpers.h b/include/vdso/helpers.h
index 1a5ee9d9052c..a3bf4f1c0d37 100644
--- a/include/vdso/helpers.h
+++ b/include/vdso/helpers.h
@@ -6,6 +6,13 @@
#include <asm/barrier.h>
#include <vdso/datapage.h>
+#include <vdso/processor.h>
+#include <vdso/clocksource.h>
+
+static __always_inline bool vdso_is_timens_clock(const struct vdso_clock *vc)
+{
+ return IS_ENABLED(CONFIG_TIME_NS) && vc->clock_mode == VDSO_CLOCKMODE_TIMENS;
+}
static __always_inline u32 vdso_read_begin(const struct vdso_clock *vc)
{
@@ -18,6 +25,28 @@ static __always_inline u32 vdso_read_begin(const struct vdso_clock *vc)
return seq;
}
+/*
+ * Variant of vdso_read_begin() to handle VDSO_CLOCKMODE_TIMENS.
+ *
+ * Time namespace enabled tasks have a special VVAR page installed which has
+ * vc->seq set to 1 and vc->clock_mode set to VDSO_CLOCKMODE_TIMENS. For non
+ * time namespace affected tasks this does not affect performance because if
+ * vc->seq is odd, i.e. a concurrent update is in progress the extra check for
+ * vc->clock_mode is just a few extra instructions while spin waiting for
+ * vc->seq to become even again.
+ */
+static __always_inline bool vdso_read_begin_timens(const struct vdso_clock *vc, u32 *seq)
+{
+ while (unlikely((*seq = READ_ONCE(vc->seq)) & 1)) {
+ if (vdso_is_timens_clock(vc))
+ return true;
+ cpu_relax();
+ }
+ smp_rmb();
+
+ return false;
+}
+
static __always_inline u32 vdso_read_retry(const struct vdso_clock *vc,
u32 start)
{
@@ -25,7 +54,7 @@ static __always_inline u32 vdso_read_retry(const struct vdso_clock *vc,
smp_rmb();
seq = READ_ONCE(vc->seq);
- return seq != start;
+ return unlikely(seq != start);
}
static __always_inline void vdso_write_seq_begin(struct vdso_clock *vc)
diff --git a/init/Kconfig b/init/Kconfig
index 444ce811ea67..5e710b03a27a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1386,12 +1386,14 @@ config UTS_NS
config TIME_NS
bool "TIME namespace"
- depends on GENERIC_GETTIMEOFDAY
default y
help
In this namespace boottime and monotonic clocks can be set.
The time will keep going with the same pace.
+config TIME_NS_VDSO
+ def_bool TIME_NS && GENERIC_GETTIMEOFDAY
+
config IPC_NS
bool "IPC namespace"
depends on (SYSVIPC || POSIX_MQUEUE)
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..de867b2693d2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -105,6 +105,7 @@
#include <linux/ptdump.h>
#include <linux/time_namespace.h>
#include <linux/unaligned.h>
+#include <linux/vdso_datastore.h>
#include <net/net_namespace.h>
#include <asm/io.h>
@@ -1119,6 +1120,7 @@ void start_kernel(void)
srcu_init();
hrtimers_init();
softirq_init();
+ vdso_setup_data_pages();
timekeeping_init();
time_init();
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 7c6a52f7836c..fe3311877097 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -9,10 +9,6 @@
config CLOCKSOURCE_WATCHDOG
bool
-# Architecture has extra clocksource data
-config ARCH_CLOCKSOURCE_DATA
- bool
-
# Architecture has extra clocksource init called from registration
config ARCH_CLOCKSOURCE_INIT
bool
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index f7d52d9543cc..eaf290c972f9 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -30,5 +30,6 @@ obj-$(CONFIG_GENERIC_GETTIMEOFDAY) += vsyscall.o
obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o
obj-$(CONFIG_TEST_UDELAY) += test_udelay.o
obj-$(CONFIG_TIME_NS) += namespace.o
+obj-$(CONFIG_TIME_NS_VDSO) += namespace_vdso.o
obj-$(CONFIG_TEST_CLOCKSOURCE_WATCHDOG) += clocksource-wdtest.o
obj-$(CONFIG_TIME_KUNIT_TEST) += time_test.o
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 652744e00eb4..4bca3f78c8ea 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -18,8 +18,9 @@
#include <linux/cred.h>
#include <linux/err.h>
#include <linux/mm.h>
+#include <linux/cleanup.h>
-#include <vdso/datapage.h>
+#include "namespace_internal.h"
ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
struct timens_offsets *ns_offsets)
@@ -93,8 +94,8 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
if (!ns)
goto fail_dec;
- ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
- if (!ns->vvar_page)
+ err = timens_vdso_alloc_vvar_page(ns);
+ if (err)
goto fail_free;
err = ns_common_init(ns);
@@ -109,7 +110,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
return ns;
fail_free_page:
- __free_page(ns->vvar_page);
+ timens_vdso_free_vvar_page(ns);
fail_free:
kfree(ns);
fail_dec:
@@ -138,117 +139,7 @@ struct time_namespace *copy_time_ns(u64 flags,
return clone_time_ns(user_ns, old_ns);
}
-static struct timens_offset offset_from_ts(struct timespec64 off)
-{
- struct timens_offset ret;
-
- ret.sec = off.tv_sec;
- ret.nsec = off.tv_nsec;
-
- return ret;
-}
-
-/*
- * A time namespace VVAR page has the same layout as the VVAR page which
- * contains the system wide VDSO data.
- *
- * For a normal task the VVAR pages are installed in the normal ordering:
- * VVAR
- * PVCLOCK
- * HVCLOCK
- * TIMENS <- Not really required
- *
- * Now for a timens task the pages are installed in the following order:
- * TIMENS
- * PVCLOCK
- * HVCLOCK
- * VVAR
- *
- * The check for vdso_clock->clock_mode is in the unlikely path of
- * the seq begin magic. So for the non-timens case most of the time
- * 'seq' is even, so the branch is not taken.
- *
- * If 'seq' is odd, i.e. a concurrent update is in progress, the extra check
- * for vdso_clock->clock_mode is a non-issue. The task is spin waiting for the
- * update to finish and for 'seq' to become even anyway.
- *
- * Timens page has vdso_clock->clock_mode set to VDSO_CLOCKMODE_TIMENS which
- * enforces the time namespace handling path.
- */
-static void timens_setup_vdso_clock_data(struct vdso_clock *vc,
- struct time_namespace *ns)
-{
- struct timens_offset *offset = vc->offset;
- struct timens_offset monotonic = offset_from_ts(ns->offsets.monotonic);
- struct timens_offset boottime = offset_from_ts(ns->offsets.boottime);
-
- vc->seq = 1;
- vc->clock_mode = VDSO_CLOCKMODE_TIMENS;
- offset[CLOCK_MONOTONIC] = monotonic;
- offset[CLOCK_MONOTONIC_RAW] = monotonic;
- offset[CLOCK_MONOTONIC_COARSE] = monotonic;
- offset[CLOCK_BOOTTIME] = boottime;
- offset[CLOCK_BOOTTIME_ALARM] = boottime;
-}
-
-struct page *find_timens_vvar_page(struct vm_area_struct *vma)
-{
- if (likely(vma->vm_mm == current->mm))
- return current->nsproxy->time_ns->vvar_page;
-
- /*
- * VM_PFNMAP | VM_IO protect .fault() handler from being called
- * through interfaces like /proc/$pid/mem or
- * process_vm_{readv,writev}() as long as there's no .access()
- * in special_mapping_vmops().
- * For more details check_vma_flags() and __access_remote_vm()
- */
-
- WARN(1, "vvar_page accessed remotely");
-
- return NULL;
-}
-
-/*
- * Protects possibly multiple offsets writers racing each other
- * and tasks entering the namespace.
- */
-static DEFINE_MUTEX(offset_lock);
-
-static void timens_set_vvar_page(struct task_struct *task,
- struct time_namespace *ns)
-{
- struct vdso_time_data *vdata;
- struct vdso_clock *vc;
- unsigned int i;
-
- if (ns == &init_time_ns)
- return;
-
- /* Fast-path, taken by every task in namespace except the first. */
- if (likely(ns->frozen_offsets))
- return;
-
- mutex_lock(&offset_lock);
- /* Nothing to-do: vvar_page has been already initialized. */
- if (ns->frozen_offsets)
- goto out;
-
- ns->frozen_offsets = true;
- vdata = page_address(ns->vvar_page);
- vc = vdata->clock_data;
-
- for (i = 0; i < CS_BASES; i++)
- timens_setup_vdso_clock_data(&vc[i], ns);
-
- if (IS_ENABLED(CONFIG_POSIX_AUX_CLOCKS)) {
- for (i = 0; i < ARRAY_SIZE(vdata->aux_clock_data); i++)
- timens_setup_vdso_clock_data(&vdata->aux_clock_data[i], ns);
- }
-
-out:
- mutex_unlock(&offset_lock);
-}
+DEFINE_MUTEX(timens_offset_lock);
void free_time_ns(struct time_namespace *ns)
{
@@ -256,41 +147,39 @@ void free_time_ns(struct time_namespace *ns)
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_common_free(ns);
- __free_page(ns->vvar_page);
+ timens_vdso_free_vvar_page(ns);
/* Concurrent nstree traversal depends on a grace period. */
kfree_rcu(ns, ns.ns_rcu);
}
static struct ns_common *timens_get(struct task_struct *task)
{
- struct time_namespace *ns = NULL;
+ struct time_namespace *ns;
struct nsproxy *nsproxy;
- task_lock(task);
+ guard(task_lock)(task);
nsproxy = task->nsproxy;
- if (nsproxy) {
- ns = nsproxy->time_ns;
- get_time_ns(ns);
- }
- task_unlock(task);
+ if (!nsproxy)
+ return NULL;
- return ns ? &ns->ns : NULL;
+ ns = nsproxy->time_ns;
+ get_time_ns(ns);
+ return &ns->ns;
}
static struct ns_common *timens_for_children_get(struct task_struct *task)
{
- struct time_namespace *ns = NULL;
+ struct time_namespace *ns;
struct nsproxy *nsproxy;
- task_lock(task);
+ guard(task_lock)(task);
nsproxy = task->nsproxy;
- if (nsproxy) {
- ns = nsproxy->time_ns_for_children;
- get_time_ns(ns);
- }
- task_unlock(task);
+ if (!nsproxy)
+ return NULL;
- return ns ? &ns->ns : NULL;
+ ns = nsproxy->time_ns_for_children;
+ get_time_ns(ns);
+ return &ns->ns;
}
static void timens_put(struct ns_common *ns)
@@ -298,12 +187,6 @@ static void timens_put(struct ns_common *ns)
put_time_ns(to_time_ns(ns));
}
-void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
-{
- timens_set_vvar_page(tsk, ns);
- vdso_join_timens(tsk, ns);
-}
-
static int timens_install(struct nsset *nsset, struct ns_common *new)
{
struct nsproxy *nsproxy = nsset->nsproxy;
@@ -367,36 +250,33 @@ static void show_offset(struct seq_file *m, int clockid, struct timespec64 *ts)
void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m)
{
- struct ns_common *ns;
- struct time_namespace *time_ns;
+ struct time_namespace *time_ns __free(time_ns) = NULL;
+ struct ns_common *ns = timens_for_children_get(p);
- ns = timens_for_children_get(p);
if (!ns)
return;
+
time_ns = to_time_ns(ns);
show_offset(m, CLOCK_MONOTONIC, &time_ns->offsets.monotonic);
show_offset(m, CLOCK_BOOTTIME, &time_ns->offsets.boottime);
- put_time_ns(time_ns);
}
int proc_timens_set_offset(struct file *file, struct task_struct *p,
struct proc_timens_offset *offsets, int noffsets)
{
- struct ns_common *ns;
- struct time_namespace *time_ns;
+ struct time_namespace *time_ns __free(time_ns) = NULL;
+ struct ns_common *ns = timens_for_children_get(p);
struct timespec64 tp;
- int i, err;
+ int i;
- ns = timens_for_children_get(p);
if (!ns)
return -ESRCH;
+
time_ns = to_time_ns(ns);
- if (!file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME)) {
- put_time_ns(time_ns);
+ if (!file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME))
return -EPERM;
- }
for (i = 0; i < noffsets; i++) {
struct proc_timens_offset *off = &offsets[i];
@@ -409,15 +289,12 @@ int proc_timens_set_offset(struct file *file, struct task_struct *p,
ktime_get_boottime_ts64(&tp);
break;
default:
- err = -EINVAL;
- goto out;
+ return -EINVAL;
}
- err = -ERANGE;
-
if (off->val.tv_sec > KTIME_SEC_MAX ||
off->val.tv_sec < -KTIME_SEC_MAX)
- goto out;
+ return -ERANGE;
tp = timespec64_add(tp, off->val);
/*
@@ -425,16 +302,13 @@ int proc_timens_set_offset(struct file *file, struct task_struct *p,
* still unreachable.
*/
if (tp.tv_sec < 0 || tp.tv_sec > KTIME_SEC_MAX / 2)
- goto out;
+ return -ERANGE;
}
- mutex_lock(&offset_lock);
- if (time_ns->frozen_offsets) {
- err = -EACCES;
- goto out_unlock;
- }
+ guard(mutex)(&timens_offset_lock);
+ if (time_ns->frozen_offsets)
+ return -EACCES;
- err = 0;
/* Don't report errors after this line */
for (i = 0; i < noffsets; i++) {
struct proc_timens_offset *off = &offsets[i];
@@ -452,12 +326,7 @@ int proc_timens_set_offset(struct file *file, struct task_struct *p,
*offset = off->val;
}
-out_unlock:
- mutex_unlock(&offset_lock);
-out:
- put_time_ns(time_ns);
-
- return err;
+ return 0;
}
const struct proc_ns_operations timens_operations = {
diff --git a/kernel/time/namespace_internal.h b/kernel/time/namespace_internal.h
new file mode 100644
index 000000000000..b37ba179f43b
--- /dev/null
+++ b/kernel/time/namespace_internal.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TIME_NAMESPACE_INTERNAL_H
+#define _TIME_NAMESPACE_INTERNAL_H
+
+#include <linux/mutex.h>
+
+struct time_namespace;
+
+/*
+ * Protects possibly multiple offsets writers racing each other
+ * and tasks entering the namespace.
+ */
+extern struct mutex timens_offset_lock;
+
+#ifdef CONFIG_TIME_NS_VDSO
+int timens_vdso_alloc_vvar_page(struct time_namespace *ns);
+void timens_vdso_free_vvar_page(struct time_namespace *ns);
+#else /* !CONFIG_TIME_NS_VDSO */
+static inline int timens_vdso_alloc_vvar_page(struct time_namespace *ns)
+{
+ return 0;
+}
+static inline void timens_vdso_free_vvar_page(struct time_namespace *ns)
+{
+}
+#endif /* CONFIG_TIME_NS_VDSO */
+
+#endif /* _TIME_NAMESPACE_INTERNAL_H */
diff --git a/kernel/time/namespace_vdso.c b/kernel/time/namespace_vdso.c
new file mode 100644
index 000000000000..88c075cd16a3
--- /dev/null
+++ b/kernel/time/namespace_vdso.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Andrei Vagin <avagin@openvz.org>
+ * Author: Dmitry Safonov <dima@arista.com>
+ */
+
+#include <linux/cleanup.h>
+#include <linux/mm.h>
+#include <linux/time_namespace.h>
+#include <linux/time.h>
+#include <linux/vdso_datastore.h>
+
+#include <vdso/clocksource.h>
+#include <vdso/datapage.h>
+
+#include "namespace_internal.h"
+
+static struct timens_offset offset_from_ts(struct timespec64 off)
+{
+ struct timens_offset ret;
+
+ ret.sec = off.tv_sec;
+ ret.nsec = off.tv_nsec;
+
+ return ret;
+}
+
+/*
+ * A time namespace VVAR page has the same layout as the VVAR page which
+ * contains the system wide VDSO data.
+ *
+ * For a normal task the VVAR pages are installed in the normal ordering:
+ * VVAR
+ * PVCLOCK
+ * HVCLOCK
+ * TIMENS <- Not really required
+ *
+ * Now for a timens task the pages are installed in the following order:
+ * TIMENS
+ * PVCLOCK
+ * HVCLOCK
+ * VVAR
+ *
+ * The check for vdso_clock->clock_mode is in the unlikely path of
+ * the seq begin magic. So for the non-timens case most of the time
+ * 'seq' is even, so the branch is not taken.
+ *
+ * If 'seq' is odd, i.e. a concurrent update is in progress, the extra check
+ * for vdso_clock->clock_mode is a non-issue. The task is spin waiting for the
+ * update to finish and for 'seq' to become even anyway.
+ *
+ * Timens page has vdso_clock->clock_mode set to VDSO_CLOCKMODE_TIMENS which
+ * enforces the time namespace handling path.
+ */
+static void timens_setup_vdso_clock_data(struct vdso_clock *vc,
+ struct time_namespace *ns)
+{
+ struct timens_offset *offset = vc->offset;
+ struct timens_offset monotonic = offset_from_ts(ns->offsets.monotonic);
+ struct timens_offset boottime = offset_from_ts(ns->offsets.boottime);
+
+ vc->seq = 1;
+ vc->clock_mode = VDSO_CLOCKMODE_TIMENS;
+ offset[CLOCK_MONOTONIC] = monotonic;
+ offset[CLOCK_MONOTONIC_RAW] = monotonic;
+ offset[CLOCK_MONOTONIC_COARSE] = monotonic;
+ offset[CLOCK_BOOTTIME] = boottime;
+ offset[CLOCK_BOOTTIME_ALARM] = boottime;
+}
+
+struct page *find_timens_vvar_page(struct vm_area_struct *vma)
+{
+ if (likely(vma->vm_mm == current->mm))
+ return current->nsproxy->time_ns->vvar_page;
+
+ /*
+ * VM_PFNMAP | VM_IO protect .fault() handler from being called
+ * through interfaces like /proc/$pid/mem or
+ * process_vm_{readv,writev}() as long as there's no .access()
+ * in special_mapping_vmops().
+ * For more details check_vma_flags() and __access_remote_vm()
+ */
+
+ WARN(1, "vvar_page accessed remotely");
+
+ return NULL;
+}
+
+static void timens_set_vvar_page(struct task_struct *task,
+ struct time_namespace *ns)
+{
+ struct vdso_time_data *vdata;
+ struct vdso_clock *vc;
+ unsigned int i;
+
+ if (ns == &init_time_ns)
+ return;
+
+ /* Fast-path, taken by every task in namespace except the first. */
+ if (likely(ns->frozen_offsets))
+ return;
+
+ guard(mutex)(&timens_offset_lock);
+ /* Nothing to-do: vvar_page has been already initialized. */
+ if (ns->frozen_offsets)
+ return;
+
+ ns->frozen_offsets = true;
+ vdata = page_address(ns->vvar_page);
+ vc = vdata->clock_data;
+
+ for (i = 0; i < CS_BASES; i++)
+ timens_setup_vdso_clock_data(&vc[i], ns);
+
+ if (IS_ENABLED(CONFIG_POSIX_AUX_CLOCKS)) {
+ for (i = 0; i < ARRAY_SIZE(vdata->aux_clock_data); i++)
+ timens_setup_vdso_clock_data(&vdata->aux_clock_data[i], ns);
+ }
+}
+
+/*
+ * The vvar page layout depends on whether a task belongs to the root or
+ * non-root time namespace. Whenever a task changes its namespace, the VVAR
+ * page tables are cleared and then they will be re-faulted with a
+ * corresponding layout.
+ * See also the comment near timens_setup_vdso_clock_data() for details.
+ */
+static int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+ struct mm_struct *mm = task->mm;
+ struct vm_area_struct *vma;
+ VMA_ITERATOR(vmi, mm, 0);
+
+ guard(mmap_read_lock)(mm);
+ for_each_vma(vmi, vma) {
+ if (vma_is_special_mapping(vma, &vdso_vvar_mapping))
+ zap_vma_pages(vma);
+ }
+ return 0;
+}
+
+void timens_commit(struct task_struct *tsk, struct time_namespace *ns)
+{
+ timens_set_vvar_page(tsk, ns);
+ vdso_join_timens(tsk, ns);
+}
+
+int timens_vdso_alloc_vvar_page(struct time_namespace *ns)
+{
+ ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!ns->vvar_page)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void timens_vdso_free_vvar_page(struct time_namespace *ns)
+{
+ __free_page(ns->vvar_page);
+}
diff --git a/lib/vdso/datastore.c b/lib/vdso/datastore.c
index a565c30c71a0..cf5d784a4a5a 100644
--- a/lib/vdso/datastore.c
+++ b/lib/vdso/datastore.c
@@ -1,64 +1,92 @@
// SPDX-License-Identifier: GPL-2.0-only
-#include <linux/linkage.h>
-#include <linux/mmap_lock.h>
+#include <linux/gfp.h>
+#include <linux/init.h>
#include <linux/mm.h>
#include <linux/time_namespace.h>
#include <linux/types.h>
#include <linux/vdso_datastore.h>
#include <vdso/datapage.h>
-/*
- * The vDSO data page.
- */
+static u8 vdso_initdata[VDSO_NR_PAGES * PAGE_SIZE] __aligned(PAGE_SIZE) __initdata = {};
+
#ifdef CONFIG_GENERIC_GETTIMEOFDAY
-static union {
- struct vdso_time_data data;
- u8 page[PAGE_SIZE];
-} vdso_time_data_store __page_aligned_data;
-struct vdso_time_data *vdso_k_time_data = &vdso_time_data_store.data;
-static_assert(sizeof(vdso_time_data_store) == PAGE_SIZE);
+struct vdso_time_data *vdso_k_time_data __refdata =
+ (void *)&vdso_initdata[VDSO_TIME_PAGE_OFFSET * PAGE_SIZE];
+
+static_assert(sizeof(struct vdso_time_data) <= PAGE_SIZE);
#endif /* CONFIG_GENERIC_GETTIMEOFDAY */
#ifdef CONFIG_VDSO_GETRANDOM
-static union {
- struct vdso_rng_data data;
- u8 page[PAGE_SIZE];
-} vdso_rng_data_store __page_aligned_data;
-struct vdso_rng_data *vdso_k_rng_data = &vdso_rng_data_store.data;
-static_assert(sizeof(vdso_rng_data_store) == PAGE_SIZE);
+struct vdso_rng_data *vdso_k_rng_data __refdata =
+ (void *)&vdso_initdata[VDSO_RNG_PAGE_OFFSET * PAGE_SIZE];
+
+static_assert(sizeof(struct vdso_rng_data) <= PAGE_SIZE);
#endif /* CONFIG_VDSO_GETRANDOM */
#ifdef CONFIG_ARCH_HAS_VDSO_ARCH_DATA
-static union {
- struct vdso_arch_data data;
- u8 page[VDSO_ARCH_DATA_SIZE];
-} vdso_arch_data_store __page_aligned_data;
-struct vdso_arch_data *vdso_k_arch_data = &vdso_arch_data_store.data;
+struct vdso_arch_data *vdso_k_arch_data __refdata =
+ (void *)&vdso_initdata[VDSO_ARCH_PAGES_START * PAGE_SIZE];
#endif /* CONFIG_ARCH_HAS_VDSO_ARCH_DATA */
+void __init vdso_setup_data_pages(void)
+{
+ unsigned int order = get_order(VDSO_NR_PAGES * PAGE_SIZE);
+ struct page *pages;
+
+ /*
+ * Allocate the data pages dynamically. SPARC does not support mapping
+ * static pages to be mapped into userspace.
+ * It is also a requirement for mlockall() support.
+ *
+ * Do not use folios. In time namespaces the pages are mapped in a different order
+ * to userspace, which is not handled by the folio optimizations in finish_fault().
+ */
+ pages = alloc_pages(GFP_KERNEL, order);
+ if (!pages)
+ panic("Unable to allocate VDSO storage pages");
+
+ /* The pages are mapped one-by-one into userspace and each one needs to be refcounted. */
+ split_page(pages, order);
+
+ /* Move the data already written by other subsystems to the new pages */
+ memcpy(page_address(pages), vdso_initdata, VDSO_NR_PAGES * PAGE_SIZE);
+
+ if (IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY))
+ vdso_k_time_data = page_address(pages + VDSO_TIME_PAGE_OFFSET);
+
+ if (IS_ENABLED(CONFIG_VDSO_GETRANDOM))
+ vdso_k_rng_data = page_address(pages + VDSO_RNG_PAGE_OFFSET);
+
+ if (IS_ENABLED(CONFIG_ARCH_HAS_VDSO_ARCH_DATA))
+ vdso_k_arch_data = page_address(pages + VDSO_ARCH_PAGES_START);
+}
+
static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct page *timens_page = find_timens_vvar_page(vma);
- unsigned long addr, pfn;
- vm_fault_t err;
+ struct page *page, *timens_page;
+
+ timens_page = find_timens_vvar_page(vma);
switch (vmf->pgoff) {
case VDSO_TIME_PAGE_OFFSET:
if (!IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY))
return VM_FAULT_SIGBUS;
- pfn = __phys_to_pfn(__pa_symbol(vdso_k_time_data));
+ page = virt_to_page(vdso_k_time_data);
if (timens_page) {
/*
* Fault in VVAR page too, since it will be accessed
* to get clock data anyway.
*/
+ unsigned long addr;
+ vm_fault_t err;
+
addr = vmf->address + VDSO_TIMENS_PAGE_OFFSET * PAGE_SIZE;
- err = vmf_insert_pfn(vma, addr, pfn);
+ err = vmf_insert_page(vma, addr, page);
if (unlikely(err & VM_FAULT_ERROR))
return err;
- pfn = page_to_pfn(timens_page);
+ page = timens_page;
}
break;
case VDSO_TIMENS_PAGE_OFFSET:
@@ -71,24 +99,25 @@ static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
*/
if (!IS_ENABLED(CONFIG_TIME_NS) || !timens_page)
return VM_FAULT_SIGBUS;
- pfn = __phys_to_pfn(__pa_symbol(vdso_k_time_data));
+ page = virt_to_page(vdso_k_time_data);
break;
case VDSO_RNG_PAGE_OFFSET:
if (!IS_ENABLED(CONFIG_VDSO_GETRANDOM))
return VM_FAULT_SIGBUS;
- pfn = __phys_to_pfn(__pa_symbol(vdso_k_rng_data));
+ page = virt_to_page(vdso_k_rng_data);
break;
case VDSO_ARCH_PAGES_START ... VDSO_ARCH_PAGES_END:
if (!IS_ENABLED(CONFIG_ARCH_HAS_VDSO_ARCH_DATA))
return VM_FAULT_SIGBUS;
- pfn = __phys_to_pfn(__pa_symbol(vdso_k_arch_data)) +
- vmf->pgoff - VDSO_ARCH_PAGES_START;
+ page = virt_to_page(vdso_k_arch_data) + vmf->pgoff - VDSO_ARCH_PAGES_START;
break;
default:
return VM_FAULT_SIGBUS;
}
- return vmf_insert_pfn(vma, vmf->address, pfn);
+ get_page(page);
+ vmf->page = page;
+ return 0;
}
const struct vm_special_mapping vdso_vvar_mapping = {
@@ -100,31 +129,6 @@ struct vm_area_struct *vdso_install_vvar_mapping(struct mm_struct *mm, unsigned
{
return _install_special_mapping(mm, addr, VDSO_NR_PAGES * PAGE_SIZE,
VM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP |
- VM_PFNMAP | VM_SEALED_SYSMAP,
+ VM_MIXEDMAP | VM_SEALED_SYSMAP,
&vdso_vvar_mapping);
}
-
-#ifdef CONFIG_TIME_NS
-/*
- * The vvar page layout depends on whether a task belongs to the root or
- * non-root time namespace. Whenever a task changes its namespace, the VVAR
- * page tables are cleared and then they will be re-faulted with a
- * corresponding layout.
- * See also the comment near timens_setup_vdso_clock_data() for details.
- */
-int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
-{
- struct mm_struct *mm = task->mm;
- struct vm_area_struct *vma;
- VMA_ITERATOR(vmi, mm, 0);
-
- mmap_read_lock(mm);
- for_each_vma(vmi, vma) {
- if (vma_is_special_mapping(vma, &vdso_vvar_mapping))
- zap_vma_pages(vma);
- }
- mmap_read_unlock(mm);
-
- return 0;
-}
-#endif
diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c
index 440f8a6203a6..7e29005aa208 100644
--- a/lib/vdso/getrandom.c
+++ b/lib/vdso/getrandom.c
@@ -7,8 +7,11 @@
#include <linux/minmax.h>
#include <vdso/datapage.h>
#include <vdso/getrandom.h>
+#include <vdso/limits.h>
#include <vdso/unaligned.h>
+#include <asm/barrier.h>
#include <asm/vdso/getrandom.h>
+#include <uapi/linux/errno.h>
#include <uapi/linux/mman.h>
#include <uapi/linux/random.h>
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 4399e143d43a..a5798bd26d20 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -3,8 +3,25 @@
* Generic userspace implementations of gettimeofday() and similar.
*/
#include <vdso/auxclock.h>
+#include <vdso/clocksource.h>
#include <vdso/datapage.h>
#include <vdso/helpers.h>
+#include <vdso/ktime.h>
+#include <vdso/limits.h>
+#include <vdso/math64.h>
+#include <vdso/time32.h>
+#include <vdso/time64.h>
+
+/*
+ * The generic vDSO implementation requires that gettimeofday.h
+ * provides:
+ * - __arch_get_hw_counter(): to get the hw counter based on the
+ * clock_mode.
+ * - gettimeofday_fallback(): fallback for gettimeofday.
+ * - clock_gettime_fallback(): fallback for clock_gettime.
+ * - clock_getres_fallback(): fallback for clock_getres.
+ */
+#include <asm/vdso/gettimeofday.h>
/* Bring in default accessors */
#include <vdso/vsyscall.h>
@@ -135,7 +152,7 @@ bool do_hres_timens(const struct vdso_time_data *vdns, const struct vdso_clock *
if (!vdso_get_timestamp(vd, vc, clk, &sec, &ns))
return false;
- } while (unlikely(vdso_read_retry(vc, seq)));
+ } while (vdso_read_retry(vc, seq));
/* Add the namespace offset */
sec += offs->sec;
@@ -158,28 +175,12 @@ bool do_hres(const struct vdso_time_data *vd, const struct vdso_clock *vc,
return false;
do {
- /*
- * Open coded function vdso_read_begin() to handle
- * VDSO_CLOCKMODE_TIMENS. Time namespace enabled tasks have a
- * special VVAR page installed which has vc->seq set to 1 and
- * vc->clock_mode set to VDSO_CLOCKMODE_TIMENS. For non time
- * namespace affected tasks this does not affect performance
- * because if vc->seq is odd, i.e. a concurrent update is in
- * progress the extra check for vc->clock_mode is just a few
- * extra instructions while spin waiting for vc->seq to become
- * even again.
- */
- while (unlikely((seq = READ_ONCE(vc->seq)) & 1)) {
- if (IS_ENABLED(CONFIG_TIME_NS) &&
- vc->clock_mode == VDSO_CLOCKMODE_TIMENS)
- return do_hres_timens(vd, vc, clk, ts);
- cpu_relax();
- }
- smp_rmb();
+ if (vdso_read_begin_timens(vc, &seq))
+ return do_hres_timens(vd, vc, clk, ts);
if (!vdso_get_timestamp(vd, vc, clk, &sec, &ns))
return false;
- } while (unlikely(vdso_read_retry(vc, seq)));
+ } while (vdso_read_retry(vc, seq));
vdso_set_timespec(ts, sec, ns);
@@ -204,7 +205,7 @@ bool do_coarse_timens(const struct vdso_time_data *vdns, const struct vdso_clock
seq = vdso_read_begin(vc);
sec = vdso_ts->sec;
nsec = vdso_ts->nsec;
- } while (unlikely(vdso_read_retry(vc, seq)));
+ } while (vdso_read_retry(vc, seq));
/* Add the namespace offset */
sec += offs->sec;
@@ -223,21 +224,12 @@ bool do_coarse(const struct vdso_time_data *vd, const struct vdso_clock *vc,
u32 seq;
do {
- /*
- * Open coded function vdso_read_begin() to handle
- * VDSO_CLOCK_TIMENS. See comment in do_hres().
- */
- while ((seq = READ_ONCE(vc->seq)) & 1) {
- if (IS_ENABLED(CONFIG_TIME_NS) &&
- vc->clock_mode == VDSO_CLOCKMODE_TIMENS)
- return do_coarse_timens(vd, vc, clk, ts);
- cpu_relax();
- }
- smp_rmb();
+ if (vdso_read_begin_timens(vc, &seq))
+ return do_coarse_timens(vd, vc, clk, ts);
ts->tv_sec = vdso_ts->sec;
ts->tv_nsec = vdso_ts->nsec;
- } while (unlikely(vdso_read_retry(vc, seq)));
+ } while (vdso_read_retry(vc, seq));
return true;
}
@@ -256,20 +248,12 @@ bool do_aux(const struct vdso_time_data *vd, clockid_t clock, struct __kernel_ti
vc = &vd->aux_clock_data[idx];
do {
- /*
- * Open coded function vdso_read_begin() to handle
- * VDSO_CLOCK_TIMENS. See comment in do_hres().
- */
- while ((seq = READ_ONCE(vc->seq)) & 1) {
- if (IS_ENABLED(CONFIG_TIME_NS) && vc->clock_mode == VDSO_CLOCKMODE_TIMENS) {
- vd = __arch_get_vdso_u_timens_data(vd);
- vc = &vd->aux_clock_data[idx];
- /* Re-read from the real time data page */
- continue;
- }
- cpu_relax();
+ if (vdso_read_begin_timens(vc, &seq)) {
+ vd = __arch_get_vdso_u_timens_data(vd);
+ vc = &vd->aux_clock_data[idx];
+ /* Re-read from the real time data page */
+ continue;
}
- smp_rmb();
/* Auxclock disabled? */
if (vc->clock_mode == VDSO_CLOCKMODE_NONE)
@@ -277,7 +261,7 @@ bool do_aux(const struct vdso_time_data *vd, clockid_t clock, struct __kernel_ti
if (!vdso_get_timestamp(vd, vc, VDSO_BASE_AUX, &sec, &ns))
return false;
- } while (unlikely(vdso_read_retry(vc, seq)));
+ } while (vdso_read_retry(vc, seq));
vdso_set_timespec(ts, sec, ns);
@@ -313,7 +297,7 @@ __cvdso_clock_gettime_common(const struct vdso_time_data *vd, clockid_t clock,
return do_hres(vd, vc, clock, ts);
}
-static __maybe_unused int
+static int
__cvdso_clock_gettime_data(const struct vdso_time_data *vd, clockid_t clock,
struct __kernel_timespec *ts)
{
@@ -333,7 +317,7 @@ __cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
}
#ifdef BUILD_VDSO32
-static __maybe_unused int
+static int
__cvdso_clock_gettime32_data(const struct vdso_time_data *vd, clockid_t clock,
struct old_timespec32 *res)
{
@@ -359,7 +343,7 @@ __cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res)
}
#endif /* BUILD_VDSO32 */
-static __maybe_unused int
+static int
__cvdso_gettimeofday_data(const struct vdso_time_data *vd,
struct __kernel_old_timeval *tv, struct timezone *tz)
{
@@ -376,8 +360,7 @@ __cvdso_gettimeofday_data(const struct vdso_time_data *vd,
}
if (unlikely(tz != NULL)) {
- if (IS_ENABLED(CONFIG_TIME_NS) &&
- vc->clock_mode == VDSO_CLOCKMODE_TIMENS)
+ if (vdso_is_timens_clock(vc))
vd = __arch_get_vdso_u_timens_data(vd);
tz->tz_minuteswest = vd[CS_HRES_COARSE].tz_minuteswest;
@@ -394,14 +377,13 @@ __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
}
#ifdef VDSO_HAS_TIME
-static __maybe_unused __kernel_old_time_t
+static __kernel_old_time_t
__cvdso_time_data(const struct vdso_time_data *vd, __kernel_old_time_t *time)
{
const struct vdso_clock *vc = vd->clock_data;
__kernel_old_time_t t;
- if (IS_ENABLED(CONFIG_TIME_NS) &&
- vc->clock_mode == VDSO_CLOCKMODE_TIMENS) {
+ if (vdso_is_timens_clock(vc)) {
vd = __arch_get_vdso_u_timens_data(vd);
vc = vd->clock_data;
}
@@ -432,8 +414,7 @@ bool __cvdso_clock_getres_common(const struct vdso_time_data *vd, clockid_t cloc
if (!vdso_clockid_valid(clock))
return false;
- if (IS_ENABLED(CONFIG_TIME_NS) &&
- vc->clock_mode == VDSO_CLOCKMODE_TIMENS)
+ if (vdso_is_timens_clock(vc))
vd = __arch_get_vdso_u_timens_data(vd);
/*
@@ -464,7 +445,7 @@ bool __cvdso_clock_getres_common(const struct vdso_time_data *vd, clockid_t cloc
return true;
}
-static __maybe_unused
+static
int __cvdso_clock_getres_data(const struct vdso_time_data *vd, clockid_t clock,
struct __kernel_timespec *res)
{
@@ -484,7 +465,7 @@ int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res)
}
#ifdef BUILD_VDSO32
-static __maybe_unused int
+static int
__cvdso_clock_getres_time32_data(const struct vdso_time_data *vd, clockid_t clock,
struct old_timespec32 *res)
{
diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index e361aca22a74..a61047bdcd57 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -19,8 +19,6 @@ endif
include ../lib.mk
-CFLAGS += $(TOOLS_INCLUDES)
-
CFLAGS_NOLIBC := -nostdlib -nostdinc -ffreestanding -fno-asynchronous-unwind-tables \
-fno-stack-protector -include $(top_srcdir)/tools/include/nolibc/nolibc.h \
-I$(top_srcdir)/tools/include/nolibc/ $(KHDR_INCLUDES)
@@ -28,13 +26,11 @@ CFLAGS_NOLIBC := -nostdlib -nostdinc -ffreestanding -fno-asynchronous-unwind-tab
$(OUTPUT)/vdso_test_gettimeofday: parse_vdso.c vdso_test_gettimeofday.c
$(OUTPUT)/vdso_test_getcpu: parse_vdso.c vdso_test_getcpu.c
$(OUTPUT)/vdso_test_abi: parse_vdso.c vdso_test_abi.c
+$(OUTPUT)/vdso_test_correctness: parse_vdso.c vdso_test_correctness.c
$(OUTPUT)/vdso_standalone_test_x86: vdso_standalone_test_x86.c parse_vdso.c | headers
$(OUTPUT)/vdso_standalone_test_x86: CFLAGS:=$(CFLAGS_NOLIBC) $(CFLAGS)
-$(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
-$(OUTPUT)/vdso_test_correctness: LDFLAGS += -ldl
-
$(OUTPUT)/vdso_test_getrandom: parse_vdso.c
$(OUTPUT)/vdso_test_getrandom: CFLAGS += -isystem $(top_srcdir)/tools/include \
$(KHDR_INCLUDES) \
diff --git a/tools/testing/selftests/vDSO/parse_vdso.c b/tools/testing/selftests/vDSO/parse_vdso.c
index 3ff00fb624a4..c6ff4413ea36 100644
--- a/tools/testing/selftests/vDSO/parse_vdso.c
+++ b/tools/testing/selftests/vDSO/parse_vdso.c
@@ -19,8 +19,7 @@
#include <stdint.h>
#include <string.h>
#include <limits.h>
-#include <linux/auxvec.h>
-#include <linux/elf.h>
+#include <elf.h>
#include "parse_vdso.h"
diff --git a/tools/testing/selftests/vDSO/vdso_test_correctness.c b/tools/testing/selftests/vDSO/vdso_test_correctness.c
index 055af95aa552..5c5a07dd1128 100644
--- a/tools/testing/selftests/vDSO/vdso_test_correctness.c
+++ b/tools/testing/selftests/vDSO/vdso_test_correctness.c
@@ -11,28 +11,22 @@
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
+#include <sys/auxv.h>
#include <sys/syscall.h>
-#include <dlfcn.h>
#include <string.h>
#include <errno.h>
#include <sched.h>
#include <stdbool.h>
#include <limits.h>
+#include "parse_vdso.h"
#include "vdso_config.h"
#include "vdso_call.h"
#include "kselftest.h"
+static const char *version;
static const char **name;
-#ifndef SYS_getcpu
-# ifdef __x86_64__
-# define SYS_getcpu 309
-# else
-# define SYS_getcpu 318
-# endif
-#endif
-
#ifndef __NR_clock_gettime64
#define __NR_clock_gettime64 403
#endif
@@ -61,6 +55,10 @@ typedef long (*vgtod_t)(struct timeval *tv, struct timezone *tz);
vgtod_t vdso_gettimeofday;
+typedef time_t (*vtime_t)(__kernel_time_t *tloc);
+
+vtime_t vdso_time;
+
typedef long (*getcpu_t)(unsigned *, unsigned *, void *);
getcpu_t vgetcpu;
@@ -110,42 +108,39 @@ static void *vsyscall_getcpu(void)
static void fill_function_pointers(void)
{
- void *vdso = dlopen("linux-vdso.so.1",
- RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
- if (!vdso)
- vdso = dlopen("linux-gate.so.1",
- RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
- if (!vdso)
- vdso = dlopen("linux-vdso32.so.1",
- RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
- if (!vdso)
- vdso = dlopen("linux-vdso64.so.1",
- RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD);
- if (!vdso) {
+ unsigned long sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR);
+
+ if (!sysinfo_ehdr) {
printf("[WARN]\tfailed to find vDSO\n");
return;
}
- vdso_getcpu = (getcpu_t)dlsym(vdso, name[4]);
+ vdso_init_from_sysinfo_ehdr(sysinfo_ehdr);
+
+ vdso_getcpu = (getcpu_t)vdso_sym(version, name[4]);
if (!vdso_getcpu)
printf("Warning: failed to find getcpu in vDSO\n");
vgetcpu = (getcpu_t) vsyscall_getcpu();
- vdso_clock_gettime = (vgettime_t)dlsym(vdso, name[1]);
+ vdso_clock_gettime = (vgettime_t)vdso_sym(version, name[1]);
if (!vdso_clock_gettime)
printf("Warning: failed to find clock_gettime in vDSO\n");
#if defined(VDSO_32BIT)
- vdso_clock_gettime64 = (vgettime64_t)dlsym(vdso, name[5]);
+ vdso_clock_gettime64 = (vgettime64_t)vdso_sym(version, name[5]);
if (!vdso_clock_gettime64)
printf("Warning: failed to find clock_gettime64 in vDSO\n");
#endif
- vdso_gettimeofday = (vgtod_t)dlsym(vdso, name[0]);
+ vdso_gettimeofday = (vgtod_t)vdso_sym(version, name[0]);
if (!vdso_gettimeofday)
printf("Warning: failed to find gettimeofday in vDSO\n");
+ vdso_time = (vtime_t)vdso_sym(version, name[2]);
+ if (!vdso_time)
+ printf("Warning: failed to find time in vDSO\n");
+
}
static long sys_getcpu(unsigned * cpu, unsigned * node,
@@ -169,6 +164,16 @@ static inline int sys_gettimeofday(struct timeval *tv, struct timezone *tz)
return syscall(__NR_gettimeofday, tv, tz);
}
+static inline __kernel_old_time_t sys_time(__kernel_old_time_t *tloc)
+{
+#ifdef __NR_time
+ return syscall(__NR_time, tloc);
+#else
+ errno = ENOSYS;
+ return -1;
+#endif
+}
+
static void test_getcpu(void)
{
printf("[RUN]\tTesting getcpu...\n");
@@ -412,10 +417,10 @@ static void test_gettimeofday(void)
return;
}
- printf("\t%llu.%06ld %llu.%06ld %llu.%06ld\n",
- (unsigned long long)start.tv_sec, start.tv_usec,
- (unsigned long long)vdso.tv_sec, vdso.tv_usec,
- (unsigned long long)end.tv_sec, end.tv_usec);
+ printf("\t%llu.%06lld %llu.%06lld %llu.%06lld\n",
+ (unsigned long long)start.tv_sec, (long long)start.tv_usec,
+ (unsigned long long)vdso.tv_sec, (long long)vdso.tv_usec,
+ (unsigned long long)end.tv_sec, (long long)end.tv_usec);
if (!tv_leq(&start, &vdso) || !tv_leq(&vdso, &end)) {
printf("[FAIL]\tTimes are out of sequence\n");
@@ -435,8 +440,56 @@ static void test_gettimeofday(void)
VDSO_CALL(vdso_gettimeofday, 2, &vdso, NULL);
}
+static void test_time(void)
+{
+ __kernel_old_time_t start, end, vdso_ret, vdso_param;
+
+ if (!vdso_time)
+ return;
+
+ printf("[RUN]\tTesting time...\n");
+
+ if (sys_time(&start) < 0) {
+ if (errno == -ENOSYS) {
+ printf("[SKIP]\tNo time() support\n");
+ } else {
+ printf("[FAIL]\tsys_time failed (%d)\n", errno);
+ nerrs++;
+ }
+ return;
+ }
+
+ vdso_ret = VDSO_CALL(vdso_time, 1, &vdso_param);
+ end = sys_time(NULL);
+
+ if (vdso_ret < 0 || end < 0) {
+ printf("[FAIL]\tvDSO returned %d, syscall errno=%d\n",
+ (int)vdso_ret, errno);
+ nerrs++;
+ return;
+ }
+
+ printf("\t%lld %lld %lld\n",
+ (long long)start,
+ (long long)vdso_ret,
+ (long long)end);
+
+ if (vdso_ret != vdso_param) {
+ printf("[FAIL]\tinconsistent return values: %lld %lld\n",
+ (long long)vdso_ret, (long long)vdso_param);
+ nerrs++;
+ return;
+ }
+
+ if (!(start <= vdso_ret) || !(vdso_ret <= end)) {
+ printf("[FAIL]\tTimes are out of sequence\n");
+ nerrs++;
+ }
+}
+
int main(int argc, char **argv)
{
+ version = versions[VDSO_VERSION];
name = (const char **)&names[VDSO_NAMES];
fill_function_pointers();
@@ -444,6 +497,7 @@ int main(int argc, char **argv)
test_clock_gettime();
test_clock_gettime64();
test_gettimeofday();
+ test_time();
/*
* Test getcpu() last so that, if something goes wrong setting affinity,
diff --git a/tools/testing/selftests/vDSO/vdso_test_gettimeofday.c b/tools/testing/selftests/vDSO/vdso_test_gettimeofday.c
index 912edadad92c..990b29e0e272 100644
--- a/tools/testing/selftests/vDSO/vdso_test_gettimeofday.c
+++ b/tools/testing/selftests/vDSO/vdso_test_gettimeofday.c
@@ -11,10 +11,8 @@
*/
#include <stdio.h>
-#ifndef NOLIBC
#include <sys/auxv.h>
#include <sys/time.h>
-#endif
#include "kselftest.h"
#include "parse_vdso.h"
The pull request you sent on Sun, 12 Apr 2026 19:46:34 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers-vdso-2026-04-12 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/f21f7b5162e9dbde6d3d5ce727d4ca2552d76ce9 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
© 2016 - 2026 Red Hat, Inc.