[v10] unwind_user: x86: Deferred unwinding infrastructure

[PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 4 months ago

From: Josh Poimboeuf <jpoimboe@kernel.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

A "nmi_timestamp" is added to the unwind_task_info that gets updated by
NMIs to not race with setting the info->timestamp.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223552.636076711@goodmis.org/

- Check for ret < 0 instead of just ret != 0 from return code of
  task_work_add(). Don't want to just assume it's less than zero as it
  needs to return a negative on error.

 include/linux/unwind_deferred_types.h |  1 +
 kernel/unwind/deferred.c              | 91 ++++++++++++++++++++++++---
 2 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..ae27a02234b8 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -11,6 +11,7 @@ struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
 	u64			timestamp;
+	u64			nmi_timestamp;
 	int			pending;
 };
 
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b76c704ddc6d..88c867c32c01 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
 {
 	lockdep_assert_irqs_disabled();
 
-	if (!info->timestamp)
-		info->timestamp = local_clock();
+	/*
+	 * Note, the timestamp is generated on the first request.
+	 * If it exists here, then the timestamp is earlier than
+	 * this request and it means that this request will be
+	 * valid for the stracktrace.
+	 */
+	if (!info->timestamp) {
+		WRITE_ONCE(info->timestamp, local_clock());
+		barrier();
+		/*
+		 * If an NMI came in and set a timestamp, it means that
+		 * it happened before this timestamp was set (otherwise
+		 * the NMI would have used this one). Use the NMI timestamp
+		 * instead.
+		 */
+		if (unlikely(info->nmi_timestamp)) {
+			WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+			barrier();
+			WRITE_ONCE(info->nmi_timestamp, 0);
+		}
+	}
 
 	return info->timestamp;
 }
@@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	unwind_deferred_trace(&trace);
 
+	/* Check if the timestamp was only set by NMI */
+	if (info->nmi_timestamp) {
+		WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+		barrier();
+		WRITE_ONCE(info->nmi_timestamp, 0);
+	}
+
 	timestamp = info->timestamp;
 
 	guard(mutex)(&callback_mutex);
@@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	}
 }
 
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_timestamp = false;
+	int ret;
+
+	/* Always use the nmi_timestamp first */
+	*timestamp = info->nmi_timestamp ? : info->timestamp;
+
+	if (!*timestamp) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user space. Initialize the task timestamp.
+		 *
+		 * Don't write to info->timestamp directly, otherwise it may race
+		 * with an interruption of get_timestamp().
+		 */
+		info->nmi_timestamp = local_clock();
+		*timestamp = info->nmi_timestamp;
+		inited_timestamp = true;
+	}
+
+	if (info->pending)
+		return 1;
+
+	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+	if (ret < 0) {
+		/*
+		 * If this set nmi_timestamp and is not using it,
+		 * there's no guarantee that it will be used.
+		 * Set it back to zero.
+		 */
+		if (inited_timestamp)
+			info->nmi_timestamp = 0;
+		return ret;
+	}
+
+	info->pending = 1;
+
+	return 0;
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task exit
  * @work: Unwind descriptor requesting the trace
@@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*timestamp = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, timestamp);
+
 	guard(irqsave)();
 
 	*timestamp = get_timestamp(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = READ_ONCE(info->pending);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&info->pending, &pending, 1))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:

> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
> +{
> +	struct unwind_task_info *info = &current->unwind_info;
> +	bool inited_timestamp = false;
> +	int ret;
> +
> +	/* Always use the nmi_timestamp first */
> +	*timestamp = info->nmi_timestamp ? : info->timestamp;
> +
> +	if (!*timestamp) {
> +		/*
> +		 * This is the first unwind request since the most recent entry
> +		 * from user space. Initialize the task timestamp.
> +		 *
> +		 * Don't write to info->timestamp directly, otherwise it may race
> +		 * with an interruption of get_timestamp().
> +		 */
> +		info->nmi_timestamp = local_clock();
> +		*timestamp = info->nmi_timestamp;
> +		inited_timestamp = true;
> +	}
> +
> +	if (info->pending)
> +		return 1;
> +
> +	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
> +	if (ret < 0) {
> +		/*
> +		 * If this set nmi_timestamp and is not using it,
> +		 * there's no guarantee that it will be used.
> +		 * Set it back to zero.
> +		 */
> +		if (inited_timestamp)
> +			info->nmi_timestamp = 0;
> +		return ret;
> +	}
> +
> +	info->pending = 1;
> +
> +	return 0;
> +}
> +
>  /**
>   * unwind_deferred_request - Request a user stacktrace on task exit
>   * @work: Unwind descriptor requesting the trace
> @@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> +	int pending;
>  	int ret;
>  
>  	*timestamp = 0;
>  
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	if (in_nmi())
> +		return unwind_deferred_request_nmi(work, timestamp);

So nested NMI is a thing -- AFAICT this is broken in the face of nested
NMI.

Specifically, we mark all exceptions that can happen with IRQs disabled
as NMI like (so that they don't go about taking locks etc.).

So imagine you're in #DB, you're asking for an unwind, you do the above
dance and get hit with NMI.

Then you get the NMI setting nmi_timestamp, and #DB overwriting it with
a later value, and you're back up the creek without no paddles.


Mix that with local_clock() that is only monotonic on a single CPU. And
you ask for an unwind on CPU0, get migrated to CPU1 which for the
argument will be behind, and see a timestamp 'far' in the future.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago


On June 19, 2025 4:57:17 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
>
>> 
>> +		info->nmi_timestamp = local_clock();
>> +		*timestamp = info->nmi_timestamp;
>> +		inited_timestamp = true;
>> +	}
>> +
>> +	if (info->pending)
>> +		return 1;
>> +
>> +	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
>> +	if (ret < 0) {
>> +		/*
>> +		 * If this set nmi_timestamp and is not using it,
>> +		 * there's no guarantee that it will be used.
>> +		 * Set it back to zero.
>> +		 */
>> +		if (inited_timestamp)
>> +			info->nmi_timestamp = 0;
>> +		return ret;
>> +	}
>> +
>> +	info->pending = 1;
>> +
>> +	return 0;
>> +}
>> +
>>  /**
>>   * unwind_deferred_request - Request a user stacktrace on task exit
>>   * @work: Unwind descriptor requesting the trace
>> @@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
>>  int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
>>  {
>>  	struct unwind_task_info *info = &current->unwind_info;
>> +	int pending;
>>  	int ret;
>>  
>>  	*timestamp = 0;
>>  
>>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>>  	    !user_mode(task_pt_regs(current)))
>>  		return -EINVAL;
>>  
>> +	if (in_nmi())
>> +		return unwind_deferred_request_nmi(work, timestamp);
>
>So nested NMI is a thing -- AFAICT this is broken in the face of nested
>NMI.
>
>Specifically, we mark all exceptions that can happen with IRQs disabled
>as NMI like (so that they don't go about taking locks etc.).
>
>So imagine you're in #DB, you're asking for an unwind, you do the above
>dance and get hit with NMI.

Does #DB make in_nmi() true? If that's the case then we do need to handle that.

-- Steve 

>
>Then you get the NMI setting nmi_timestamp, and #DB overwriting it with
>a later value, and you're back up the creek without no paddles.
>
>
>Mix that with local_clock() that is only monotonic on a single CPU. And
>you ask for an unwind on CPU0, get migrated to CPU1 which for the
>argument will be behind, and see a timestamp 'far' in the future.
>

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:

> Does #DB make in_nmi() true? If that's the case then we do need to handle that.

Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.

Ignoring #DF because that's mostly game over, you can get them all
nested for up to 4 (you're well aware of the normal NMI recursion
crap).

Then there is the SEV #VC stuff, which is also NMI like. So if you're a
CoCo-nut, you can perhaps get it up to 5.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago


On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
>
>> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
>
>Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
>
>Ignoring #DF because that's mostly game over, you can get them all
>nested for up to 4 (you're well aware of the normal NMI recursion
>crap).

We probably can implement this with stacked counters.

>Then there is the SEV #VC stuff, which is also NMI like. So if you're a
>CoCo-nut, you can perhaps get it up to 5.

The rest of the tracing infrastructure goes 4 deep and hasn't had issues, so 4 is probably sufficient.

-- Steve

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 05:42:31AM -0400, Steven Rostedt wrote:
> 
> 
> On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
> >
> >> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
> >
> >Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
> >
> >Ignoring #DF because that's mostly game over, you can get them all
> >nested for up to 4 (you're well aware of the normal NMI recursion
> >crap).
> 
> We probably can implement this with stacked counters.

I would seriously consider dropping support for anything that can't do
cmpxchg at the width you need.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago


On June 19, 2025 5:45:05 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 05:42:31AM -0400, Steven Rostedt wrote:
>> 
>> 
>> On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>> >On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
>> >
>> >> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
>> >
>> >Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
>> >
>> >Ignoring #DF because that's mostly game over, you can get them all
>> >nested for up to 4 (you're well aware of the normal NMI recursion
>> >crap).
>> 
>> We probably can implement this with stacked counters.
>
>I would seriously consider dropping support for anything that can't do
>cmpxchg at the width you need.

That may be something we can do as it's a new feature and unlike the ftrace ring buffer, it won't be a regression not to support them.

We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming they all have a proper 64 bit cmpxchg.

-- Steve

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 06:19:28AM -0400, Steven Rostedt wrote:

> We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming
> they all have a proper 64 bit cmpxchg.

They do. The only 64bit architecture that does not is HPPA IIRC.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago

On Thu, 19 Jun 2025 12:39:51 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jun 19, 2025 at 06:19:28AM -0400, Steven Rostedt wrote:
> 
> > We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming
> > they all have a proper 64 bit cmpxchg.  
> 
> They do. The only 64bit architecture that does not is HPPA IIRC.

It appears that I refactored the ring buffer code to get rid of all 64
bit cmpxchg(), but I do have this in the code:

        if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) || 
             IS_ENABLED(CONFIG_GENERIC_ATOMIC64)) &&
            (unlikely(in_nmi()))) {
                return NULL;
        }

We could do something similar, in the function that asks for a deferred
stack trace:

	/* NMI requires having safe 64 bit cmpxchg operations */
	if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) || !IS_ENABLED(CONFIG_64BIT)) && in_nmi())
		return -EINVAL;

As the only thing not supported is requesting a deferred stack trace
from NMI context when 64 bit cmpxchg is not available. No reason to not
support the rest of the functionality.

I'll have to wrap the cmpxchg too to not be performed and just do the
update, as for these archs, NMI is not an issue and interrupts will be
disabled, so no cmpxchg is needed.

-- Steve

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 11:32:26AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
> 
> > Does #DB make in_nmi() true? If that's the case then we do need to handle that.
> 
> Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.

Note: these are all the from-kernel parts of those exceptions. The
from-user side is significantly different.

> Ignoring #DF because that's mostly game over, you can get them all
> nested for up to 4 (you're well aware of the normal NMI recursion
> crap).
> 
> Then there is the SEV #VC stuff, which is also NMI like. So if you're a
> CoCo-nut, you can perhaps get it up to 5.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Make unwind_deferred_request() NMI-safe so tracers in NMI context can
> call it and safely request a user space stacktrace when the task exits.
> 
> A "nmi_timestamp" is added to the unwind_task_info that gets updated by
> NMIs to not race with setting the info->timestamp.

I feel this is missing something... or I am.

> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223552.636076711@goodmis.org/
> 
> - Check for ret < 0 instead of just ret != 0 from return code of
>   task_work_add(). Don't want to just assume it's less than zero as it
>   needs to return a negative on error.
> 
>  include/linux/unwind_deferred_types.h |  1 +
>  kernel/unwind/deferred.c              | 91 ++++++++++++++++++++++++---
>  2 files changed, 84 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index 5df264cf81ad..ae27a02234b8 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -11,6 +11,7 @@ struct unwind_task_info {
>  	struct unwind_cache	*cache;
>  	struct callback_head	work;
>  	u64			timestamp;
> +	u64			nmi_timestamp;
>  	int			pending;
>  };
>  
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index b76c704ddc6d..88c867c32c01 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
>  {
>  	lockdep_assert_irqs_disabled();
>  
> -	if (!info->timestamp)
> -		info->timestamp = local_clock();
> +	/*
> +	 * Note, the timestamp is generated on the first request.
> +	 * If it exists here, then the timestamp is earlier than
> +	 * this request and it means that this request will be
> +	 * valid for the stracktrace.
> +	 */
> +	if (!info->timestamp) {
> +		WRITE_ONCE(info->timestamp, local_clock());
> +		barrier();
> +		/*
> +		 * If an NMI came in and set a timestamp, it means that
> +		 * it happened before this timestamp was set (otherwise
> +		 * the NMI would have used this one). Use the NMI timestamp
> +		 * instead.
> +		 */
> +		if (unlikely(info->nmi_timestamp)) {
> +			WRITE_ONCE(info->timestamp, info->nmi_timestamp);
> +			barrier();
> +			WRITE_ONCE(info->nmi_timestamp, 0);
> +		}
> +	}
>  
>  	return info->timestamp;
>  }
> @@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  
>  	unwind_deferred_trace(&trace);
>  
> +	/* Check if the timestamp was only set by NMI */
> +	if (info->nmi_timestamp) {
> +		WRITE_ONCE(info->timestamp, info->nmi_timestamp);
> +		barrier();
> +		WRITE_ONCE(info->nmi_timestamp, 0);
> +	}
> +
>  	timestamp = info->timestamp;
>  
>  	guard(mutex)(&callback_mutex);
> @@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	}
>  }
>  
> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
> +{
> +	struct unwind_task_info *info = &current->unwind_info;
> +	bool inited_timestamp = false;
> +	int ret;
> +
> +	/* Always use the nmi_timestamp first */
> +	*timestamp = info->nmi_timestamp ? : info->timestamp;
> +
> +	if (!*timestamp) {
> +		/*
> +		 * This is the first unwind request since the most recent entry
> +		 * from user space. Initialize the task timestamp.
> +		 *
> +		 * Don't write to info->timestamp directly, otherwise it may race
> +		 * with an interruption of get_timestamp().
> +		 */
> +		info->nmi_timestamp = local_clock();
> +		*timestamp = info->nmi_timestamp;
> +		inited_timestamp = true;
> +	}
> +
> +	if (info->pending)
> +		return 1;
> +
> +	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
> +	if (ret < 0) {
> +		/*
> +		 * If this set nmi_timestamp and is not using it,
> +		 * there's no guarantee that it will be used.
> +		 * Set it back to zero.
> +		 */
> +		if (inited_timestamp)
> +			info->nmi_timestamp = 0;
> +		return ret;
> +	}
> +
> +	info->pending = 1;
> +
> +	return 0;
> +}

So what's the actual problem here, something like this:

  if (!info->timestamp)
    <NMI>
      if (!info->timestamp)
        info->timestamp = local_clock(); /* Ta */
    </NMI>
      info->timestamp = local_clock();   /* Tb */

And now info has Tb which is after Ta, which was recorded for the NMI
request?
       
Why can't we cmpxchg_local() the thing and avoid this horrible stuff?

static u64 get_timestamp(struct unwind_task_info *info)
{
	u64 new, old = info->timestamp;

	if (old)
		return old;
	
	new = local_clock();
	old = cmpxchg_local(&info->timestamp, old, new);
	if (old)
		return old;
	return new;
}

Seems simple enough; what's wrong with it?

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago

On Thu, 19 Jun 2025 10:34:15 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Why can't we cmpxchg_local() the thing and avoid this horrible stuff?
> 
> static u64 get_timestamp(struct unwind_task_info *info)
> {
> 	u64 new, old = info->timestamp;
> 
> 	if (old)
> 		return old;
> 	
> 	new = local_clock();
> 	old = cmpxchg_local(&info->timestamp, old, new);
> 	if (old)
> 		return old;
> 	return new;
> }
> 
> Seems simple enough; what's wrong with it?

It's a 64 bit number where most 32 bit architectures don't have any
decent cmpxchg on 64 bit values. That's given me hell in the ring
buffer code :-p

-- Steve

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 04:37:33AM -0400, Steven Rostedt wrote:
> On Thu, 19 Jun 2025 10:34:15 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Why can't we cmpxchg_local() the thing and avoid this horrible stuff?
> > 
> > static u64 get_timestamp(struct unwind_task_info *info)
> > {
> > 	u64 new, old = info->timestamp;
> > 
> > 	if (old)
> > 		return old;
> > 	
> > 	new = local_clock();
> > 	old = cmpxchg_local(&info->timestamp, old, new);
> > 	if (old)
> > 		return old;
> > 	return new;
> > }
> > 
> > Seems simple enough; what's wrong with it?
> 
> It's a 64 bit number where most 32 bit architectures don't have any
> decent cmpxchg on 64 bit values. That's given me hell in the ring
> buffer code :-p

Do we really have to support 32bit?

But IIRC a previous version of all this had a syscall counter. If you
make this a per task syscall counter, unsigned long is plenty.

I suppose that was dropped because adding that counter increment to all
syscalls blows. But if you really want to support 32bit, that might be a
fallback.

Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
x86 we good with timestamps, even on 32bit.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:

> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
> x86 we good with timestamps, even on 32bit.

Well, not entirely true, local_clock() is not guaranteed monotonic. So
you might be in for quite a bit of hurt if you rely on that.

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Steven Rostedt 3 months, 3 weeks ago


On June 19, 2025 4:48:13 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:
>
>> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
>> x86 we good with timestamps, even on 32bit.
>
>Well, not entirely true, local_clock() is not guaranteed monotonic. So
>you might be in for quite a bit of hurt if you rely on that.
>

As long as it is monotonic per task. If it is not, then pretty much all tracers that use it are broken.

-- Steve

Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 05:10:20AM -0400, Steven Rostedt wrote:
> 
> 
> On June 19, 2025 4:48:13 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:
> >
> >> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
> >> x86 we good with timestamps, even on 32bit.
> >
> >Well, not entirely true, local_clock() is not guaranteed monotonic. So
> >you might be in for quite a bit of hurt if you rely on that.
> >
> 
> As long as it is monotonic per task. If it is not, then pretty much all tracers that use it are broken.

It is monotonic per CPU. It says so in the comment.

The inter-CPU drift is bounded to a tick or something.

The trade-off is that it can be the same value for the majority if the
that tick.

The way that thing is set up, is that we use GTOD (HPET if your TSC is
buggered) snapshots at ticks, set up a window to the next tick, and fill
out with TSC deltas and a (local) monotonicity filter.

So if TSC is really wild, it can hit that window boundary real quick,
get stuck there until the next tick.

Some of the early had TSC affected by DVFS, so you change CPU speed, TSC
speed changes along with it. We sorta try and compensate for that.

Anyway, welcome to the wonderful world of trying to tell time on x86 :-(

Today, most x86_64 chips made in the last few years will have relatively
sane TSC, but still we get the rare random case it gets marked unstable
(they're becoming few and far between though).

x86 is one of the worst architectures in this regard -- but IIRC there
were a few others out there. Also, virt, lets not talk about virt.