[v1] tracing/osnoise: Add option to align tlat threads

[PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Tomas Glozar 1 month, 1 week ago

Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
corresponding setting osnoise/timerlat_align_us.

This option sets the alignment of wakeup times between different
timerlat threads, similarly to cyclictest's -A/--aligned option. If
TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
records its first wake-up time. Each following thread sets its first
wake-up time to a fixed offset from the recorded time, and incremenets
it by the same offset.

Example:

osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
set to 50. There are four threads, on CPUs 1 to 4.

- CPU 4 enters first cycle first. The current time is 20000us, so
the wake-up of the first cycle is set to 21000us. This time is recorded.
- CPU 2 enter first cycle next. It reads the recorded time, increments
it to 21020us, and uses this value as its own wake-up time for the first
cycle.
- CPU 3 enters first cycle next. It reads the recorded time, increments
it to 21040 us, and uses the value as its own wake-up time.
- CPU 1 proceeds analogically.

In each next cycle, the wake-up time (called "absolute period" in
timerlat code) is incremented by the (relative) period of 1000us. Thus,
the wake-ups in the following cycles (provided the times are reached and
not in the past) will be as follows:

CPU 1		CPU 2		CPU 3	 	CPU 4
21080us		21020us		21040us		21000us
22080us		22020us		22040us		22000us
...		...		...		...

Even if any cycle is skipped due to e.g. the first cycle calculation
happening later, the alignment stays in place.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---

I tested this option using the following command:

$ bpftrace -e 'tracepoint:osnoise:timerlat_sample /!@time[cpu]/ {
      if (!@begin) { @begin = nsecs; }
      @time[cpu] = ((nsecs - @begin) / 1000) % 1000;
  }
 END { clear(@begin); }' -c 'rtla timerlat hist -d 1s -c 1-10'

This captures the alignment of first timerlat sample (which is +-
equivalent to the wake-up time).

With timerlat_align_us = 20:

@time[1]: 2
@time[2]: 18
@time[3]: 38
@time[4]: 57
@time[5]: 83
@time[6]: 103
@time[7]: 123
@time[8]: 143
@time[9]: 162
@time[10]: 182

With timerlat_align_us = 0

@time[1]: 1
@time[5]: 4
@time[7]: 4
@time[6]: 4
@time[8]: 4
@time[9]: 4
@time[10]: 4
@time[4]: 5
@time[3]: 5
@time[2]: 5

Only thing I am not too sure about is the absense of barriers. I feel
like I only touch that one atomic variable concurrently, so it should
be fine (unlike e.g. a mutex protecting another variable, where you need
acquire-release semantics) with relaxed variants of atomic functions;
but I don't have any other experience with barriers so far.

 kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index dee610e465b9..df1d4529d226 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -58,6 +58,7 @@ enum osnoise_options_index {
 	OSN_PANIC_ON_STOP,
 	OSN_PREEMPT_DISABLE,
 	OSN_IRQ_DISABLE,
+	OSN_TIMERLAT_ALIGN,
 	OSN_MAX
 };
 
@@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
 							"OSNOISE_WORKLOAD",
 							"PANIC_ON_STOP",
 							"OSNOISE_PREEMPT_DISABLE",
-							"OSNOISE_IRQ_DISABLE" };
+							"OSNOISE_IRQ_DISABLE",
+							"TIMERLAT_ALIGN" };
 
 #define OSN_DEFAULT_OPTIONS		0x2
 static unsigned long osnoise_options	= OSN_DEFAULT_OPTIONS;
@@ -326,6 +328,7 @@ static struct osnoise_data {
 	u64	stop_tracing_total;	/* stop trace in the final operation (report/thread) */
 #ifdef CONFIG_TIMERLAT_TRACER
 	u64	timerlat_period;	/* timerlat period */
+	u64	timerlat_align_us;	/* timerlat alignment */
 	u64	print_stack;		/* print IRQ stack if total > */
 	int	timerlat_tracer;	/* timerlat tracer */
 #endif
@@ -338,6 +341,7 @@ static struct osnoise_data {
 #ifdef CONFIG_TIMERLAT_TRACER
 	.print_stack			= 0,
 	.timerlat_period		= DEFAULT_TIMERLAT_PERIOD,
+	.timerlat_align_us		= 0,
 	.timerlat_tracer		= 0,
 #endif
 };
@@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
 {
 	ktime_t next_abs_period, now;
 	u64 rel_period = osnoise_data.timerlat_period * 1000;
+	static atomic64_t align_next;
 
 	now = hrtimer_cb_get_time(&tlat->timer);
 	next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
@@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
 	 */
 	tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
 
+	if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
+	    && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
+		/*
+		 * Align thread in first cycle on each CPU to the set alignment.
+		 */
+		tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
+			&align_next);
+		tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
+		next_abs_period = ns_to_ktime(tlat->abs_period);
+	}
+
 	/*
 	 * If the new abs_period is in the past, skip the activation.
 	 */
@@ -2650,6 +2666,17 @@ static struct trace_min_max_param timerlat_period = {
 	.min	= &timerlat_min_period,
 };
 
+/*
+ * osnoise/timerlat_align_us: align the first wakeup of all timerlat
+ * threads to a common boundary (in us). 0 means disabled.
+ */
+static struct trace_min_max_param timerlat_align_us = {
+	.lock	= &interface_lock,
+	.val	= &osnoise_data.timerlat_align_us,
+	.max	= NULL,
+	.min	= NULL,
+};
+
 static const struct file_operations timerlat_fd_fops = {
 	.open		= timerlat_fd_open,
 	.read		= timerlat_fd_read,
@@ -2746,6 +2773,11 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
 	if (!tmp)
 		return -ENOMEM;
 
+	tmp = tracefs_create_file("timerlat_align_us", TRACE_MODE_WRITE, top_dir,
+				  &timerlat_align_us, &trace_min_max_fops);
+	if (!tmp)
+		return -ENOMEM;
+
 	retval = osnoise_create_cpu_timerlat_fd(top_dir);
 	if (retval)
 		return retval;
-- 
2.53.0

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Crystal Wood 1 month, 1 week ago

On Fri, 2026-02-27 at 16:04 +0100, Tomas Glozar wrote:
> Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> corresponding setting osnoise/timerlat_align_us.
> 
> This option sets the alignment of wakeup times between different
> timerlat threads, similarly to cyclictest's -A/--aligned option. If
> TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> records its first wake-up time. Each following thread sets its first
> wake-up time to a fixed offset from the recorded time, and incremenets
> it by the same offset.

Why not just set the initial timer expiration to be 
"period + cpu * align_us"?  Then you wouldn't need any interaction
between CPUs.

>  kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)

Documentation needs to be updated as well.

Should mention that updating align_us while the timer is running won't
take effect immediately (unlike period, which does).

> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index dee610e465b9..df1d4529d226 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -58,6 +58,7 @@ enum osnoise_options_index {
>  	OSN_PANIC_ON_STOP,
>  	OSN_PREEMPT_DISABLE,
>  	OSN_IRQ_DISABLE,
> +	OSN_TIMERLAT_ALIGN,
>  	OSN_MAX
>  };
>  
> @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
>  							"OSNOISE_WORKLOAD",
>  							"PANIC_ON_STOP",
>  							"OSNOISE_PREEMPT_DISABLE",
> -							"OSNOISE_IRQ_DISABLE" };
> +							"OSNOISE_IRQ_DISABLE",
> +							"TIMERLAT_ALIGN" };

Do we really need a flag for this, or can we just interpret a non-zero
align_us value as enabling the feature?

> @@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
>  {
>  	ktime_t next_abs_period, now;
>  	u64 rel_period = osnoise_data.timerlat_period * 1000;
> +	static atomic64_t align_next;

How will this get reset if the tracer is stopped and restarted?

>  	now = hrtimer_cb_get_time(&tlat->timer);
>  	next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> @@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
>  	 */
>  	tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
>  
> +	if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
> +	    && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
> +		/*
> +		 * Align thread in first cycle on each CPU to the set alignment.
> +		 */
> +		tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
> +			&align_next);
> +		tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
> +		next_abs_period = ns_to_ktime(tlat->abs_period);
> +	}

I'm already unclear about the existing purpose of next_abs_period, but
if it has any use at all shouldn't it be to avoid writing intermediate
values like this back to tlat?

-Crystal

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Tomas Glozar 1 month ago

so 28. 2. 2026 v 0:50 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> > corresponding setting osnoise/timerlat_align_us.
> >
> > This option sets the alignment of wakeup times between different
> > timerlat threads, similarly to cyclictest's -A/--aligned option. If
> > TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> > records its first wake-up time. Each following thread sets its first
> > wake-up time to a fixed offset from the recorded time, and incremenets
> > it by the same offset.
>
> Why not just set the initial timer expiration to be
> "period + cpu * align_us"?  Then you wouldn't need any interaction
> between CPUs.

"period + cpu * align_us" wouldn't quite do it, for two reasons:

1. The wake-up timers are set to absolute time, and are incremented by
"period" (once or multiple times, if the timer is significantly
delayed) each cycle. What can be done as an alternative to what v1
does is this: record the current time when starting the timerlat
tracer (I need to reset align_next to zero anyway even with the v1
design, that is a bug in the patch), and increment from that.

2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
run on CPUs 0 and 100 with alignment 10, suddenly, the space between
the threads becomes 1000us, which is equivalent to 0us. I would need
to go through the cpuset and assign numbers from 0 to n to each CPU.
That would guarantee a fixed spacing of the threads independent of
when the threads wake up in the first cycle (unlike the v1 design),
but it would make the implementation more complex, since I would have
to store the numbers.

If I implemented both of those ideas, the interaction between the CPUs
can indeed be gotten rid of. I'm not sure if it is a better solution,
though. Another motivation of recording the first thread wake-up was
that when using user threads, the first thread might be created some
time after the tracer is enabled, and I did not want to have a large
gap that would have to be corrected by the while loop at the end of
wait_next_period().

>
> >  kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
>
> Documentation needs to be updated as well.
>
> Should mention that updating align_us while the timer is running won't
> take effect immediately (unlike period, which does).
>

Good idea, thanks! In general, I'm not expecting the user to change
timerlat parameters during a measurement - but it is supported, and
should be documented.

> > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > index dee610e465b9..df1d4529d226 100644
> > --- a/kernel/trace/trace_osnoise.c
> > +++ b/kernel/trace/trace_osnoise.c
> > @@ -58,6 +58,7 @@ enum osnoise_options_index {
> >       OSN_PANIC_ON_STOP,
> >       OSN_PREEMPT_DISABLE,
> >       OSN_IRQ_DISABLE,
> > +     OSN_TIMERLAT_ALIGN,
> >       OSN_MAX
> >  };
> >
> > @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
> >                                                       "OSNOISE_WORKLOAD",
> >                                                       "PANIC_ON_STOP",
> >                                                       "OSNOISE_PREEMPT_DISABLE",
> > -                                                     "OSNOISE_IRQ_DISABLE" };
> > +                                                     "OSNOISE_IRQ_DISABLE",
> > +                                                     "TIMERLAT_ALIGN" };
>
> Do we really need a flag for this, or can we just interpret a non-zero
> align_us value as enabling the feature?
>

Yes, we need a flag for this, because a zero alignment is a common use case.

I used it in cyclictest to measure the overhead of a large number of
threads waking up at the same time. Similarly, a non-zero alignment
will get rid of most of that overhead. Without alignment set, the
thread wake-ups offsets are semi-random, depending on how the threads
wake up, which might lead to inconsistent results where one run has
good numbers and another run bad numbers, since the alignment is
determined in the first cycle.

For example, here are some bpftrace numbers (from the same command as
in the original message) with wake-ups with timerlat_align_us = 0:

@time[12]: 1
@time[11]: 1
@time[13]: 1
@time[10]: 3
@time[16]: 6
@time[19]: 6
@time[15]: 6
@time[18]: 7
@time[14]: 7
@time[17]: 71
@time[20]: 997

> > @@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
> >  {
> >       ktime_t next_abs_period, now;
> >       u64 rel_period = osnoise_data.timerlat_period * 1000;
> > +     static atomic64_t align_next;
>
> How will this get reset if the tracer is stopped and restarted?
>

It won't, I forgot to reset it. See my comment above in my reply to Steve,

> >       now = hrtimer_cb_get_time(&tlat->timer);
> >       next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> > @@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
> >        */
> >       tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> >
> > +     if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
> > +         && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
> > +             /*
> > +              * Align thread in first cycle on each CPU to the set alignment.
> > +              */
> > +             tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
> > +                     &align_next);
> > +             tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
> > +             next_abs_period = ns_to_ktime(tlat->abs_period);
> > +     }
>
> I'm already unclear about the existing purpose of next_abs_period, but
> if it has any use at all shouldn't it be to avoid writing intermediate
> values like this back to tlat?
>

next_abs_period is basically just the ktime_t variant of
tlat->abs_period for local calculations of the next period inside
wait_next_period(). Its only purpose is the ktime_compare() call that
increments tlat->abs_period by the period until it lands into the
future, if it happens to be in the past. This is necessary to do for
both a regular cycle (which might take long due to noise) and the
first cycle with alignment (because the other thread's first wake up
might be late), so it has to be set in the new code as well,
otherwise, the while loop won't see the time is in the past.

I agree that this part of the code is confusing. There is also a
field, timerlat_variables.rel_period (tlat->rel_period), that is not
used anywhere, since the relative period is pulled out of
osnoise_variables. Something like this would be easier to read and
comprehend, IMHO:

/*
 * wait_next_period - Wait for the next period for timerlat
 */
static int wait_next_period(struct timerlat_variables *tlat)
{
    ktime_t now;
    u64 rel_period = osnoise_data.timerlat_period * 1000;

    now = hrtimer_cb_get_time(&tlat->timer);

    /*
     * Set the next abs_period.
     */
    tlat->abs_period += rel_period;

    /*
     * If the new abs_period is in the past, skip the activation.
     */
    while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
        next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
        tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
    }

    set_current_state(TASK_INTERRUPTIBLE);

    hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
    schedule();
    return 1;
}

(Excluding the changes from this patch.) What do you think?

Tomas

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Crystal Wood 1 month ago

On Mon, 2026-03-02 at 09:48 +0100, Tomas Glozar wrote:
> so 28. 2. 2026 v 0:50 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > > Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> > > corresponding setting osnoise/timerlat_align_us.
> > > 
> > > This option sets the alignment of wakeup times between different
> > > timerlat threads, similarly to cyclictest's -A/--aligned option. If
> > > TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> > > records its first wake-up time. Each following thread sets its first
> > > wake-up time to a fixed offset from the recorded time, and incremenets
> > > it by the same offset.
> > 
> > Why not just set the initial timer expiration to be
> > "period + cpu * align_us"?  Then you wouldn't need any interaction
> > between CPUs.
> 
> "period + cpu * align_us" wouldn't quite do it, for two reasons:
> 
> 1. The wake-up timers are set to absolute time, and are incremented by
> "period" (once or multiple times, if the timer is significantly
> delayed) each cycle. What can be done as an alternative to what v1
> does is this: record the current time when starting the timerlat
> tracer (I need to reset align_next to zero anyway even with the v1
> design, that is a bug in the patch), and increment from that.

I was only talking about doing this for the initial expiration, not on
increment.

> 2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
> run on CPUs 0 and 100 with alignment 10, suddenly, the space between
> the threads becomes 1000us, which is equivalent to 0us. I would need
> to go through the cpuset and assign numbers from 0 to n to each CPU.
> That would guarantee a fixed spacing of the threads independent of
> when the threads wake up in the first cycle (unlike the v1 design),
> but it would make the implementation more complex, since I would have
> to store the numbers.

Right, I was thinking of just a few CPUs missing not being a big deal,
but on big systems with only a few CPUs running the test it does
matter.

Instead of assigning numbers, could you just loop over each CPU's
tlat->abs_period and set the initial expiration with appropriate
offset (prior to starting any of the threads)?  Then the thread would
not need to care about anything other than the usual increment.

> If I implemented both of those ideas, the interaction between the CPUs
> can indeed be gotten rid of. I'm not sure if it is a better solution,
> though. Another motivation of recording the first thread wake-up was
> that when using user threads, the first thread might be created some
> time after the tracer is enabled, and I did not want to have a large
> gap that would have to be corrected by the while loop at the end of
> wait_next_period().

What is the actual concerning impact here?

If we want to be really paranoid we could check for pending signals
during the loop, in case userspace delayed so long (with a very short
period) that the user has a hard time with ctrl+c and such... but that
could happen already if userspace does something silly (e.g. stopped
by a debugger?) between loops.

> > >  kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
> > >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > Documentation needs to be updated as well.
> > 
> > Should mention that updating align_us while the timer is running won't
> > take effect immediately (unlike period, which does).
> > 
> 
> Good idea, thanks! In general, I'm not expecting the user to change
> timerlat parameters during a measurement - but it is supported, and
> should be documented.

Maybe better to phrase it as "not guaranteed to take effect
immediately" :-)

> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index dee610e465b9..df1d4529d226 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -58,6 +58,7 @@ enum osnoise_options_index {
> > >       OSN_PANIC_ON_STOP,
> > >       OSN_PREEMPT_DISABLE,
> > >       OSN_IRQ_DISABLE,
> > > +     OSN_TIMERLAT_ALIGN,
> > >       OSN_MAX
> > >  };
> > > 
> > > @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
> > >                                                       "OSNOISE_WORKLOAD",
> > >                                                       "PANIC_ON_STOP",
> > >                                                       "OSNOISE_PREEMPT_DISABLE",
> > > -                                                     "OSNOISE_IRQ_DISABLE" };
> > > +                                                     "OSNOISE_IRQ_DISABLE",
> > > +                                                     "TIMERLAT_ALIGN" };
> > 
> > Do we really need a flag for this, or can we just interpret a non-zero
> > align_us value as enabling the feature?
> > 
> 
> Yes, we need a flag for this, because a zero alignment is a common use case.
> 
> I used it in cyclictest to measure the overhead of a large number of
> threads waking up at the same time. Similarly, a non-zero alignment
> will get rid of most of that overhead. Without alignment set, the
> thread wake-ups offsets are semi-random, depending on how the threads
> wake up, which might lead to inconsistent results where one run has
> good numbers and another run bad numbers, since the alignment is
> determined in the first cycle.

OK, I was viewing the staggering as the main point, but I see how the
alignment itself helps too.

Is there a use case for not always doing the alignment?
Other than people asking why their numbers suddenly got worse...

> > I'm already unclear about the existing purpose of next_abs_period, but
> > if it has any use at all shouldn't it be to avoid writing intermediate
> > values like this back to tlat?
> > 
> 
> next_abs_period is basically just the ktime_t variant of
> tlat->abs_period for local calculations of the next period inside
> wait_next_period(). Its only purpose is the ktime_compare() call that
> increments tlat->abs_period by the period until it lands into the
> future, if it happens to be in the past. This is necessary to do for
> both a regular cycle (which might take long due to noise) and the
> first cycle with alignment (because the other thread's first wake up
> might be late), so it has to be set in the new code as well,
> otherwise, the while loop won't see the time is in the past.

Oh, I missed the unit difference (though it's basically just a typedef
at this point).

> I agree that this part of the code is confusing. There is also a
> field, timerlat_variables.rel_period (tlat->rel_period), that is not
> used anywhere, since the relative period is pulled out of
> osnoise_variables. Something like this would be easier to read and
> comprehend, IMHO:

Yeah, I noticed that as well... we should remove it if we're not going
to use it.

> /*
>  * wait_next_period - Wait for the next period for timerlat
>  */
> static int wait_next_period(struct timerlat_variables *tlat)
> {
>     ktime_t now;
>     u64 rel_period = osnoise_data.timerlat_period * 1000;
> 
>     now = hrtimer_cb_get_time(&tlat->timer);
> 
>     /*
>      * Set the next abs_period.
>      */
>     tlat->abs_period += rel_period;
> 
>     /*
>      * If the new abs_period is in the past, skip the activation.
>      */
>     while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
>         next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
>         tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
>     }
> 
>     set_current_state(TASK_INTERRUPTIBLE);
> 
>     hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
>     schedule();
>     return 1;
> }
> 
> (Excluding the changes from this patch.) What do you think?

Why can't we just make tlat->abs_period and every other time variable
in this file be ktime_t?  Other than atomic stuff if we do go that
route.

Not saying that that should hold up this patch,  just an idea to simplify things.

-Crystal

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Tomas Glozar 1 month ago

út 3. 3. 2026 v 4:21 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > 1. The wake-up timers are set to absolute time, and are incremented by
> > "period" (once or multiple times, if the timer is significantly
> > delayed) each cycle. What can be done as an alternative to what v1
> > does is this: record the current time when starting the timerlat
> > tracer (I need to reset align_next to zero anyway even with the v1
> > design, that is a bug in the patch), and increment from that.
>
> I was only talking about doing this for the initial expiration, not on
> increment.
>

Ah, I misread "period" as "relative period", since the initial
absolute period is determined from the current time in the first cycle
of each thread.

> > 2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
> > run on CPUs 0 and 100 with alignment 10, suddenly, the space between
> > the threads becomes 1000us, which is equivalent to 0us. I would need
> > to go through the cpuset and assign numbers from 0 to n to each CPU.
> > That would guarantee a fixed spacing of the threads independent of
> > when the threads wake up in the first cycle (unlike the v1 design),
> > but it would make the implementation more complex, since I would have
> > to store the numbers.
>
> Right, I was thinking of just a few CPUs missing not being a big deal,
> but on big systems with only a few CPUs running the test it does
> matter.
>

My point is mostly that the spacing of the threads shouldn't depend on
the CPU numbers. One has to make sure the alignment doesn't overflow
the period anyway if they want to have completely time-isolated
wake-ups.

> Instead of assigning numbers, could you just loop over each CPU's
> tlat->abs_period and set the initial expiration with appropriate
> offset (prior to starting any of the threads)?  Then the thread would
> not need to care about anything other than the usual increment.
>
> > If I implemented both of those ideas, the interaction between the CPUs
> > can indeed be gotten rid of. I'm not sure if it is a better solution,
> > though. Another motivation of recording the first thread wake-up was
> > that when using user threads, the first thread might be created some
> > time after the tracer is enabled, and I did not want to have a large
> > gap that would have to be corrected by the while loop at the end of
> > wait_next_period().
>
> What is the actual concerning impact here?
>
> If we want to be really paranoid we could check for pending signals
> during the loop, in case userspace delayed so long (with a very short
> period) that the user has a hard time with ctrl+c and such... but that
> could happen already if userspace does something silly (e.g. stopped
> by a debugger?) between loops.
>

The while loop is designed only to handle "small" time differences,
with respect to the relative period. When using timerlat manually with
a user workload, it might take the user a few seconds/seconds/hours
before they start the user process (typing the command line, or if the
user e.g. has a snack or coffee in between), which has to be corrected
by the while loop. This does not interact well with a low period.
Consider the following scenario, assuming the initial absolute period
is set on timerlat tracer enablement:

1. The user enables the timerlat tracer with NO_OSNOISE_WORKLOAD and
100us period.
2. The user steps away for 1 hour.
3. After 10 seconds, tlat->abs_period is 3 600 000 000us in the past.
The while loop starts incrementing tlat->abs_period by 100us, taking
36 000 000 loops. If one iteration takes 10 CPU cycles on a 1GHz CPU,
the while loop itself will take 360us (which is >100us).
4. The timerlat thread never wakes up, since the wake-up time even
after the correction is in the past.

This is much more reasonable than the user stopping the thread inside
the while loop. Actually, this scenario can already happen without
alignment, since the scheduler might preempt the thread inside the
while loop for more than the period - but that is a separate issue.

Also, the current implementation is relatively simple (and hopefully
also easy to understand with the comments in v2), so my idea is that
we can use it for now, and if we want deterministing alignment in the
future, we can always improve it.

> > Good idea, thanks! In general, I'm not expecting the user to change
> > timerlat parameters during a measurement - but it is supported, and
> > should be documented.
>
> Maybe better to phrase it as "not guaranteed to take effect
> immediately" :-)
>

I think we can afford to be less vague than that here :) something
like, "setting the timerlat_align_us option will take affect upon next
timerlat tracer start".

> > I used it in cyclictest to measure the overhead of a large number of
> > threads waking up at the same time. Similarly, a non-zero alignment
> > will get rid of most of that overhead. Without alignment set, the
> > thread wake-ups offsets are semi-random, depending on how the threads
> > wake up, which might lead to inconsistent results where one run has
> > good numbers and another run bad numbers, since the alignment is
> > determined in the first cycle.
>
> OK, I was viewing the staggering as the main point, but I see how the
> alignment itself helps too.
>
> Is there a use case for not always doing the alignment?
> Other than people asking why their numbers suddenly got worse...
>

Yes - to simulate the default behavior of cyclictest without
-A/--aligned, and of multi-thread cyclic workloads that do not align
their threads respectively. Even if we wanted to always use the
alignment, it should not default to 0 IMHO, so that users don't see a
degradation like you mention.

> > next_abs_period is basically just the ktime_t variant of
> > tlat->abs_period for local calculations of the next period inside
> > wait_next_period(). Its only purpose is the ktime_compare() call that
> > increments tlat->abs_period by the period until it lands into the
> > future, if it happens to be in the past. This is necessary to do for
> > both a regular cycle (which might take long due to noise) and the
> > first cycle with alignment (because the other thread's first wake up
> > might be late), so it has to be set in the new code as well,
> > otherwise, the while loop won't see the time is in the past.
>
> Oh, I missed the unit difference (though it's basically just a typedef
> at this point).
>
> > I agree that this part of the code is confusing. There is also a
> > field, timerlat_variables.rel_period (tlat->rel_period), that is not
> > used anywhere, since the relative period is pulled out of
> > osnoise_variables. Something like this would be easier to read and
> > comprehend, IMHO:
>
> Yeah, I noticed that as well... we should remove it if we're not going
> to use it.
>

Yeah.

> > /*
> >  * wait_next_period - Wait for the next period for timerlat
> >  */
> > static int wait_next_period(struct timerlat_variables *tlat)
> > {
> >     ktime_t now;
> >     u64 rel_period = osnoise_data.timerlat_period * 1000;
> >
> >     now = hrtimer_cb_get_time(&tlat->timer);
> >
> >     /*
> >      * Set the next abs_period.
> >      */
> >     tlat->abs_period += rel_period;
> >
> >     /*
> >      * If the new abs_period is in the past, skip the activation.
> >      */
> >     while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
> >         next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> >         tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> >     }
> >
> >     set_current_state(TASK_INTERRUPTIBLE);
> >
> >     hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
> >     schedule();
> >     return 1;
> > }
> >
> > (Excluding the changes from this patch.) What do you think?
>
> Why can't we just make tlat->abs_period and every other time variable
> in this file be ktime_t?  Other than atomic stuff if we do go that
> route.
>
> Not saying that that should hold up this patch,  just an idea to simplify things.
>

The reason for that is that the code does arithmetic directly on the
ns unit form of the time, without the need to use ktime_add(). I don't
see anything on top of that.  I see ktime_add(x, y) is just x + y
nowadays, so that would work.

Tomas

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Crystal Wood 1 month ago

On Fri, 2026-03-06 at 16:15 +0100, Tomas Glozar wrote:
> út 3. 3. 2026 v 4:21 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > > 1. The wake-up timers are set to absolute time, and are incremented by
> > > "period" (once or multiple times, if the timer is significantly
> > > delayed) each cycle. What can be done as an alternative to what v1
> > > does is this: record the current time when starting the timerlat
> > > tracer (I need to reset align_next to zero anyway even with the v1
> > > design, that is a bug in the patch), and increment from that.
> > 
> > I was only talking about doing this for the initial expiration, not on
> > increment.
> > 
> 
> Ah, I misread "period" as "relative period", since the initial
> absolute period is determined from the current time in the first cycle
> of each thread.

I did mean relative period ("absolute period" isn't a phrase that makes
sense to me; that's just an expiration); I just forgot to make the
"now +" part explicit.

> > > 2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
> > > run on CPUs 0 and 100 with alignment 10, suddenly, the space between
> > > the threads becomes 1000us, which is equivalent to 0us. I would need
> > > to go through the cpuset and assign numbers from 0 to n to each CPU.
> > > That would guarantee a fixed spacing of the threads independent of
> > > when the threads wake up in the first cycle (unlike the v1 design),
> > > but it would make the implementation more complex, since I would have
> > > to store the numbers.
> > 
> > Right, I was thinking of just a few CPUs missing not being a big deal,
> > but on big systems with only a few CPUs running the test it does
> > matter.
> > 
> 
> My point is mostly that the spacing of the threads shouldn't depend on
> the CPU numbers. One has to make sure the alignment doesn't overflow
> the period anyway if they want to have completely time-isolated
> wake-ups.

Yeah, I underestimated how easy it would be for that to be a problem.  I
think this is a good enough reason to use the atomic approach.

> The while loop is designed only to handle "small" time differences,
> with respect to the relative period. When using timerlat manually with
> a user workload, it might take the user a few seconds/seconds/hours
> before they start the user process (typing the command line, or if the
> user e.g. has a snack or coffee in between), which has to be corrected
> by the while loop. This does not interact well with a low period.
> Consider the following scenario, assuming the initial absolute period
> is set on timerlat tracer enablement:
> 
> 1. The user enables the timerlat tracer with NO_OSNOISE_WORKLOAD and
> 100us period.
> 2. The user steps away for 1 hour.
> 3. After 10 seconds, tlat->abs_period is 3 600 000 000us in the past.
> The while loop starts incrementing tlat->abs_period by 100us, taking
> 36 000 000 loops. If one iteration takes 10 CPU cycles on a 1GHz CPU,
> the while loop itself will take 360us (which is >100us).
> 4. The timerlat thread never wakes up, since the wake-up time even
> after the correction is in the past.

(assuming you meant "after one hour" instead of 10 seconds)

Oh, I see it doesn't update "now" inside the loop.  Arming a timer for
the past should cause it to fire immediately though; otherwise there
would be all sorts of nasty races.

In any case, regardless of what we do with alignment, we could simplify
by replacing the while loop with hrtimer_forward_now() which does
division (if necessary) instead of a loop, and get rid of
tlat->abs_period (the handler can call hrtimer_get_expires()).

> This is much more reasonable than the user stopping the thread inside
> the while loop. Actually, this scenario can already happen without
> alignment, since the scheduler might preempt the thread inside the
> while loop for more than the period - but that is a separate issue.
> 
> Also, the current implementation is relatively simple (and hopefully
> also easy to understand with the comments in v2), so my idea is that
> we can use it for now, and if we want deterministing alignment in the
> future, we can always improve it.

The comments do help, thanks.  v1 took me a few tries to figure out,
given how different it is from usual cmpxchg usage.  v2 looks good to
me.

> > > I used it in cyclictest to measure the overhead of a large number of
> > > threads waking up at the same time. Similarly, a non-zero alignment
> > > will get rid of most of that overhead. Without alignment set, the
> > > thread wake-ups offsets are semi-random, depending on how the threads
> > > wake up, which might lead to inconsistent results where one run has
> > > good numbers and another run bad numbers, since the alignment is
> > > determined in the first cycle.
> > 
> > OK, I was viewing the staggering as the main point, but I see how the
> > alignment itself helps too.
> > 
> > Is there a use case for not always doing the alignment?
> > Other than people asking why their numbers suddenly got worse...
> > 
> 
> Yes - to simulate the default behavior of cyclictest without
> -A/--aligned, and of multi-thread cyclic workloads that do not align
> their threads respectively. Even if we wanted to always use the
> alignment, it should not default to 0 IMHO, so that users don't see a
> degradation like you mention.

It just feels a bit weird to preserve "maybe they're bunched up, maybe
not, who knows?" as a feature (much less the default), especially for a
tool meant to help with determinism.

> > Why can't we just make tlat->abs_period and every other time variable
> > in this file be ktime_t?  Other than atomic stuff if we do go that
> > route.
> > 
> > Not saying that that should hold up this patch,  just an idea to simplify things.
> > 
> 
> The reason for that is that the code does arithmetic directly on the
> ns unit form of the time, without the need to use ktime_add(). I don't
> see anything on top of that.  I see ktime_add(x, y) is just x + y
> nowadays, so that would work.

We should still use ktime_add() etc. in order to be nice users of the
interface.  I just think the conversions are worse clutter than a
couple ktime add/sub calls.

-Crystal

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Steven Rostedt 1 month, 1 week ago

On Fri, 27 Feb 2026 16:04:20 +0100
Tomas Glozar <tglozar@redhat.com> wrote:

> Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> corresponding setting osnoise/timerlat_align_us.
> 
> This option sets the alignment of wakeup times between different
> timerlat threads, similarly to cyclictest's -A/--aligned option. If
> TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> records its first wake-up time. Each following thread sets its first
> wake-up time to a fixed offset from the recorded time, and incremenets
> it by the same offset.
> 
> Example:
> 
> osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
> set to 50. There are four threads, on CPUs 1 to 4.

Is it set to 50 or 20?

> 
> - CPU 4 enters first cycle first. The current time is 20000us, so
> the wake-up of the first cycle is set to 21000us. This time is recorded.
> - CPU 2 enter first cycle next. It reads the recorded time, increments
> it to 21020us, and uses this value as its own wake-up time for the first
> cycle.
> - CPU 3 enters first cycle next. It reads the recorded time, increments
> it to 21040 us, and uses the value as its own wake-up time.

As the increments are off by 20 and not 50.

> - CPU 1 proceeds analogically.
> 
> In each next cycle, the wake-up time (called "absolute period" in
> timerlat code) is incremented by the (relative) period of 1000us. Thus,
> the wake-ups in the following cycles (provided the times are reached and
> not in the past) will be as follows:
> 
> CPU 1		CPU 2		CPU 3	 	CPU 4
> 21080us		21020us		21040us		21000us
> 22080us		22020us		22040us		22000us
> ...		...		...		...
> 
> Even if any cycle is skipped due to e.g. the first cycle calculation
> happening later, the alignment stays in place.
> 
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
> 
> I tested this option using the following command:
> 
> $ bpftrace -e 'tracepoint:osnoise:timerlat_sample /!@time[cpu]/ {
>       if (!@begin) { @begin = nsecs; }
>       @time[cpu] = ((nsecs - @begin) / 1000) % 1000;
>   }
>  END { clear(@begin); }' -c 'rtla timerlat hist -d 1s -c 1-10'
> 
> This captures the alignment of first timerlat sample (which is +-
> equivalent to the wake-up time).
> 
> With timerlat_align_us = 20:
> 
> @time[1]: 2
> @time[2]: 18
> @time[3]: 38
> @time[4]: 57
> @time[5]: 83
> @time[6]: 103
> @time[7]: 123
> @time[8]: 143
> @time[9]: 162
> @time[10]: 182
> 
> With timerlat_align_us = 0
> 
> @time[1]: 1
> @time[5]: 4
> @time[7]: 4
> @time[6]: 4
> @time[8]: 4
> @time[9]: 4
> @time[10]: 4
> @time[4]: 5
> @time[3]: 5
> @time[2]: 5
> 
> Only thing I am not too sure about is the absense of barriers. I feel
> like I only touch that one atomic variable concurrently, so it should
> be fine (unlike e.g. a mutex protecting another variable, where you need
> acquire-release semantics) with relaxed variants of atomic functions;
> but I don't have any other experience with barriers so far.
> 
>  kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index dee610e465b9..df1d4529d226 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -58,6 +58,7 @@ enum osnoise_options_index {
>  	OSN_PANIC_ON_STOP,
>  	OSN_PREEMPT_DISABLE,
>  	OSN_IRQ_DISABLE,
> +	OSN_TIMERLAT_ALIGN,
>  	OSN_MAX
>  };
>  
> @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
>  							"OSNOISE_WORKLOAD",
>  							"PANIC_ON_STOP",
>  							"OSNOISE_PREEMPT_DISABLE",
> -							"OSNOISE_IRQ_DISABLE" };
> +							"OSNOISE_IRQ_DISABLE",
> +							"TIMERLAT_ALIGN" };
>  
>  #define OSN_DEFAULT_OPTIONS		0x2
>  static unsigned long osnoise_options	= OSN_DEFAULT_OPTIONS;
> @@ -326,6 +328,7 @@ static struct osnoise_data {
>  	u64	stop_tracing_total;	/* stop trace in the final operation (report/thread) */
>  #ifdef CONFIG_TIMERLAT_TRACER
>  	u64	timerlat_period;	/* timerlat period */
> +	u64	timerlat_align_us;	/* timerlat alignment */
>  	u64	print_stack;		/* print IRQ stack if total > */
>  	int	timerlat_tracer;	/* timerlat tracer */
>  #endif
> @@ -338,6 +341,7 @@ static struct osnoise_data {
>  #ifdef CONFIG_TIMERLAT_TRACER
>  	.print_stack			= 0,
>  	.timerlat_period		= DEFAULT_TIMERLAT_PERIOD,
> +	.timerlat_align_us		= 0,
>  	.timerlat_tracer		= 0,
>  #endif
>  };
> @@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
>  {
>  	ktime_t next_abs_period, now;
>  	u64 rel_period = osnoise_data.timerlat_period * 1000;
> +	static atomic64_t align_next;
>  
>  	now = hrtimer_cb_get_time(&tlat->timer);
>  	next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> @@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
>  	 */
>  	tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
>  
> +	if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
> +	    && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {

So the first one here sets 'align_next' and all others fall into this path.

As 'align_next' is a static variable for this function, what happens if you
run timerlat a second time with different values?

-- Steve

> +		/*
> +		 * Align thread in first cycle on each CPU to the set alignment.
> +		 */
> +		tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
> +			&align_next);
> +		tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
> +		next_abs_period = ns_to_ktime(tlat->abs_period);
> +	}
> +
>  	/*
>  	 * If the new abs_period is in the past, skip the activation.
>  	 */
> @@ -2650,6 +2666,17 @@ static struct trace_min_max_param timerlat_period = {
>  	.min	= &timerlat_min_period,
>  };
>  
> +/*
> + * osnoise/timerlat_align_us: align the first wakeup of all timerlat
> + * threads to a common boundary (in us). 0 means disabled.
> + */
> +static struct trace_min_max_param timerlat_align_us = {
> +	.lock	= &interface_lock,
> +	.val	= &osnoise_data.timerlat_align_us,
> +	.max	= NULL,
> +	.min	= NULL,
> +};
> +
>  static const struct file_operations timerlat_fd_fops = {
>  	.open		= timerlat_fd_open,
>  	.read		= timerlat_fd_read,
> @@ -2746,6 +2773,11 @@ static int init_timerlat_tracefs(struct dentry *top_dir)
>  	if (!tmp)
>  		return -ENOMEM;
>  
> +	tmp = tracefs_create_file("timerlat_align_us", TRACE_MODE_WRITE, top_dir,
> +				  &timerlat_align_us, &trace_min_max_fops);
> +	if (!tmp)
> +		return -ENOMEM;
> +
>  	retval = osnoise_create_cpu_timerlat_fd(top_dir);
>  	if (retval)
>  		return retval;

Re: [PATCH] tracing/osnoise: Add option to align tlat threads

Posted by Tomas Glozar 1 month ago

pá 27. 2. 2026 v 16:51 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> > Example:
> >
> > osnoise/timerlat_period is set to 1000, osnoise/timerlat_align_us is
> > set to 50. There are four threads, on CPUs 1 to 4.
>
> Is it set to 50 or 20?
>

That is a typo, good catch.

>
> So the first one here sets 'align_next' and all others fall into this path.
>
> As 'align_next' is a static variable for this function, what happens if you
> run timerlat a second time with different values?
>

That is an oversight. It worked during my testing, since timerlat took
the align_next of the previous run, and since it was far in the past,
it incremented it to be in the future, and everything worked. But that
is not the intended behavior (and will be slow if a lot of time passes
between the timerlat runs, since we increment it multiple times by the
period)

It should be a global variable reset with each run. I got too excited
about the patch and forgot this is not RTLA where static variables get
reset with each run (unless I want the user to reboot after each run).
I will fix it.

Tomas