[v2] rv: Add monitors to validate task switch

[RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Gabriele Monaco 9 months ago

DA monitor can be accessed from multiple cores simultaneously, this is
likely, for instance when dealing with per-task monitors reacting on
events that do not always occur on the CPU where the task is running.
This can cause race conditions where two events change the next state
and we see inconsistent values. E.g.:

  [62] event_srs: 27: sleepable x sched_wakeup -> running (final)
  [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable
  [63] error_srs: 27: event sched_switch_suspend not expected in the state running

In this case the monitor fails because the event on CPU 62 wins against
the one on CPU 63, although the correct state should have been
sleepable, since the task get suspended.

Detect if the current state was modified by using try_cmpxchg while
storing the next value. If it was, try again reading the current state.
After a maximum number of failed retries, react as if it was an error
with invalid current state (we cannot determine it).

Monitors where this type of condition can occur must be able to account
for racing events in any possible order, as we cannot know the winner.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/rv.h      |  3 ++-
 include/rv/da_monitor.h | 53 +++++++++++++++++++++++++++++++----------
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/include/linux/rv.h b/include/linux/rv.h
index 3452b5e4b29e..a83a81ac6e46 100644
--- a/include/linux/rv.h
+++ b/include/linux/rv.h
@@ -7,7 +7,8 @@
 #ifndef _LINUX_RV_H
 #define _LINUX_RV_H
 
-#define MAX_DA_NAME_LEN	32
+#define MAX_DA_NAME_LEN			32
+#define MAX_DA_RETRY_RACING_EVENTS	3
 
 #ifdef CONFIG_RV
 /*
diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 215c3eb770cc..8b714e3085a5 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -82,16 +82,19 @@ static inline void da_monitor_reset_##name(struct da_monitor *da_mon)				\
  */												\
 static inline type da_monitor_curr_state_##name(struct da_monitor *da_mon)			\
 {												\
-	return da_mon->curr_state;								\
+	return READ_ONCE(da_mon->curr_state);							\
 }												\
 												\
 /*												\
  * da_monitor_set_state_##name - set the new current state					\
+ *												\
+ * return false without the change in case the state was modified elsewhere			\
  */												\
-static inline void										\
-da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name state)		\
+static inline bool										\
+da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name prev_state,		\
+			    enum states_##name state)						\
 {												\
-	da_mon->curr_state = state;								\
+	return try_cmpxchg(&da_mon->curr_state, &prev_state, state);				\
 }												\
 												\
 /*												\
@@ -150,17 +153,29 @@ static inline bool da_monitor_handling_event_##name(struct da_monitor *da_mon)
  * Event handler for implicit monitors. Implicit monitor is the one which the
  * handler does not need to specify which da_monitor to manipulate. Examples
  * of implicit monitor are the per_cpu or the global ones.
+ *
+ * Retry, in case there is a race while getting and setting the next state
+ * return an invalid current state if we run out of retries. The monitor should
+ * be able to handle various orders.
  */
 #define DECLARE_DA_MON_MODEL_HANDLER_IMPLICIT(name, type)					\
 												\
 static inline bool										\
 da_event_##name(struct da_monitor *da_mon, enum events_##name event)				\
 {												\
-	type curr_state = da_monitor_curr_state_##name(da_mon);					\
-	type next_state = model_get_next_state_##name(curr_state, event);			\
+	bool changed;										\
+	type curr_state, next_state;								\
 												\
-	if (next_state != INVALID_STATE) {							\
-		da_monitor_set_state_##name(da_mon, next_state);				\
+	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {					\
+		curr_state = da_monitor_curr_state_##name(da_mon);				\
+		next_state = model_get_next_state_##name(curr_state, event);			\
+		if (next_state == INVALID_STATE)						\
+			break;									\
+		changed = da_monitor_set_state_##name(da_mon, curr_state, next_state);		\
+		if (unlikely(!changed))	{							\
+			curr_state = -1;							\
+			continue;								\
+		}										\
 												\
 		trace_event_##name(model_get_state_name_##name(curr_state),			\
 				   model_get_event_name_##name(event),				\
@@ -181,17 +196,29 @@ da_event_##name(struct da_monitor *da_mon, enum events_##name event)				\
 
 /*
  * Event handler for per_task monitors.
+ *
+ * Retry, in case there is a race while getting and setting the next state
+ * return an invalid current state if we run out of retries. The monitor should
+ * be able to handle various orders.
  */
 #define DECLARE_DA_MON_MODEL_HANDLER_PER_TASK(name, type)					\
 												\
 static inline bool da_event_##name(struct da_monitor *da_mon, struct task_struct *tsk,		\
 				   enum events_##name event)					\
 {												\
-	type curr_state = da_monitor_curr_state_##name(da_mon);					\
-	type next_state = model_get_next_state_##name(curr_state, event);			\
-												\
-	if (next_state != INVALID_STATE) {							\
-		da_monitor_set_state_##name(da_mon, next_state);				\
+	bool changed;										\
+	type curr_state, next_state;								\
+												\
+	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {					\
+		curr_state = da_monitor_curr_state_##name(da_mon);				\
+		next_state = model_get_next_state_##name(curr_state, event);			\
+		if (next_state == INVALID_STATE)						\
+			break;									\
+		changed = da_monitor_set_state_##name(da_mon, curr_state, next_state);		\
+		if (unlikely(!changed))	{							\
+			curr_state = -1;							\
+			continue;								\
+		}										\
 												\
 		trace_event_##name(tsk->pid,							\
 				   model_get_state_name_##name(curr_state),			\
-- 
2.49.0

Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Nam Cao 8 months, 3 weeks ago

On Wed, May 14, 2025 at 10:43:12AM +0200, Gabriele Monaco wrote:
> -static inline void										\
> -da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name state)		\
> +static inline bool										\
> +da_monitor_set_state_##name(struct da_monitor *da_mon, enum states_##name prev_state,		\
> +			    enum states_##name state)						\
>  {												\
> -	da_mon->curr_state = state;								\
> +	return try_cmpxchg(&da_mon->curr_state, &prev_state, state);				\
>  }												\

This is a very thin wrapper. Should we just call try_cmpxchg() directly?

>  static inline bool										\
>  da_event_##name(struct da_monitor *da_mon, enum events_##name event)				\
>  {												\
> -	type curr_state = da_monitor_curr_state_##name(da_mon);					\
> -	type next_state = model_get_next_state_##name(curr_state, event);			\
> +	bool changed;										\
> +	type curr_state, next_state;								\
>  												\
> -	if (next_state != INVALID_STATE) {							\
> -		da_monitor_set_state_##name(da_mon, next_state);				\
> +	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {					\
> +		curr_state = da_monitor_curr_state_##name(da_mon);				\

For the follow-up iterations (i > 0), it is not necessary to read
curr_state again here, we already have its value from try_cmpxchg() below.

Also, thinking about memory barrier hurts my main, but I'm not entirely
sure if reading curr_state without memory barrier here is okay.

How about something like below?

curr_state = da_monitor_curr_state_##name(da_mon);
for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {
	next_state = model_get_next_state_##name(curr_state, event);
	if (next_state == INVALID_STATE)
		break;
	if (try_cmpxchg(&da_mon->curr_state, &curr_state, next_state))
		break;
}

Furthermore, it is possible to replace for(...) with while (1)? I don't
think we can have a live lock, because if we fail to do try_cmpxchg(),
the "other guy" surely succeed.

Best regards,
Nam

Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Gabriele Monaco 8 months, 3 weeks ago

On Mon, 2025-05-19 at 11:06 +0200, Nam Cao wrote:
> On Wed, May 14, 2025 at 10:43:12AM +0200, Gabriele Monaco wrote:
> > -static inline
> > void										\
> > -da_monitor_set_state_##name(struct da_monitor *da_mon, enum
> > states_##name state)		\
> > +static inline
> > bool										\
> > +da_monitor_set_state_##name(struct da_monitor *da_mon, enum
> > states_##name prev_state,		\
> > +			    enum states_##name
> > state)						\
> >  {								
> > 				\
> > -	da_mon->curr_state =
> > state;								\
> > +	return try_cmpxchg(&da_mon->curr_state, &prev_state,
> > state);				\
> >  }								
> > 				\
> 
> This is a very thin wrapper. Should we just call try_cmpxchg()
> directly?

Mmh, right, at this point the wrapper does nothing but making the code
more obscure, will do.

> 
> >  static inline
> > bool										\
> >  da_event_##name(struct da_monitor *da_mon, enum events_##name
> > event)				\
> >  {								
> > 				\
> > -	type curr_state =
> > da_monitor_curr_state_##name(da_mon);					\
> > -	type next_state = model_get_next_state_##name(curr_state,
> > event);			\
> > +	bool
> > changed;										\
> > +	type curr_state,
> > next_state;								\
> >  								
> > 				\
> > -	if (next_state != INVALID_STATE)
> > {							\
> > -		da_monitor_set_state_##name(da_mon,
> > next_state);				\
> > +	for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++)
> > {					\
> > +		curr_state =
> > da_monitor_curr_state_##name(da_mon);				\
> 
> For the follow-up iterations (i > 0), it is not necessary to read
> curr_state again here, we already have its value from try_cmpxchg()
> below.

Yeah good point.

> 
> Also, thinking about memory barrier hurts my main, but I'm not
> entirely
> sure if reading curr_state without memory barrier here is okay.
> 

I guess we are on the same boat here. I couldn't really understand how
much of a barrier is the try_cmpxchg imposing (if any), but didn't see
any noticeable difference adding an explicit smp write barrier to pair
with the READ_ONCE in da_monitor_curr_state, so straight assumed we can
do without it.
But I definitely appreciate opinions on this.

> How about something like below?
> 
> curr_state = da_monitor_curr_state_##name(da_mon);
> for (int i = 0; i < MAX_DA_RETRY_RACING_EVENTS; i++) {
> 	next_state = model_get_next_state_##name(curr_state, event);
> 	if (next_state == INVALID_STATE)
> 		break;
> 	if (try_cmpxchg(&da_mon->curr_state, &curr_state,
> next_state))
> 		break;
> }
> 

Yeah, that's neater.

> Furthermore, it is possible to replace for(...) with while (1)? I
> don't
> think we can have a live lock, because if we fail to do
> try_cmpxchg(),
> the "other guy" surely succeed.
> 

Mmh, although definitely unlikely, I'm thinking of a case in which the
event starts on one CPU and at the same time we see events in IRQ and 
on another CPU, let's say continuously. Nothing forbids that between
any two consecutive try_cmpxchg another CPU/context changes the next
state (making the local try_cmpxchg fail).
In practice I've never seen it going on the second iteration, as the
critical section is really tiny, but I'm not sure we can guarantee this
never happens.
Or am I missing something?

Thanks,
Gabriele

Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Nam Cao 8 months, 3 weeks ago

On Mon, May 19, 2025 at 12:28:12PM +0200, Gabriele Monaco wrote:
> Mmh, although definitely unlikely, I'm thinking of a case in which the
> event starts on one CPU and at the same time we see events in IRQ and 
> on another CPU, let's say continuously. Nothing forbids that between
> any two consecutive try_cmpxchg another CPU/context changes the next
> state (making the local try_cmpxchg fail).
> In practice I've never seen it going on the second iteration, as the
> critical section is really tiny, but I'm not sure we can guarantee this
> never happens.
> Or am I missing something?

I have a feeling that you missed my point. I agree that the retrying is
needed, because we may race with another.

What I am proposing is that we drop the MAX_DA_RETRY_RACING_EVENTS, and
just keep retrying until we succeed.

And that's safe to do, because the maximum number of retries is the number
of tasks contending with us to set the monitor's state. So we know we won't
be retrying for long.

Best regards,
Nam

Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Gabriele Monaco 8 months, 3 weeks ago

On Mon, 2025-05-19 at 12:38 +0200, Nam Cao wrote:
> On Mon, May 19, 2025 at 12:28:12PM +0200, Gabriele Monaco wrote:
> > Mmh, although definitely unlikely, I'm thinking of a case in which
> > the
> > event starts on one CPU and at the same time we see events in IRQ
> > and 
> > on another CPU, let's say continuously. Nothing forbids that
> > between
> > any two consecutive try_cmpxchg another CPU/context changes the
> > next
> > state (making the local try_cmpxchg fail).
> > In practice I've never seen it going on the second iteration, as
> > the
> > critical section is really tiny, but I'm not sure we can guarantee
> > this
> > never happens.
> > Or am I missing something?
> 
> I have a feeling that you missed my point. I agree that the retrying
> is
> needed, because we may race with another.
> 
> What I am proposing is that we drop the MAX_DA_RETRY_RACING_EVENTS,
> and
> just keep retrying until we succeed.
> 
> And that's safe to do, because the maximum number of retries is the
> number
> of tasks contending with us to set the monitor's state. So we know we
> won't
> be retrying for long.

I get this point, what I mean is: can we really guarantee the number of
contending tasks (or contexts) is finite?
In other words, the try_cmpxchg guarantees 1 and only 1 actor wins
every time, but cannot guarantee all actors will eventually win, an
actor /could/ be hanging there forever.

This handler is running for each event in the monitor and tracepoint
handlers can be interrupted as well as run in interrupt context (where
of course they cannot be interrupted). I don't think the number of
actors is bounded by the number of CPUs.
I see this situation is extremely unlikely, but in an exotic scenario
where a CPU is sufficiently slower than others (e.g. in a VM) I believe
we can see this critical section large enough for this to potentially
happen.

I'm not quite afraid of infinite loops, but rather RV introducing
unbounded latency very hard to track and without any reporting.
Chances are, since tracepoints and actual traced events are not atomic,
that by the time this delayed context /wins/ the RV event is no longer
current, so we may see an error already.

Does it make sense to you or am I making it more complex than it should
be?

Thanks,
Gabriele

Re: [RFC PATCH v2 10/12] rv: Retry when da monitor detects race conditions

Posted by Nam Cao 8 months, 3 weeks ago

On Mon, May 19, 2025 at 01:13:01PM +0200, Gabriele Monaco wrote:
> 
> 
> On Mon, 2025-05-19 at 12:38 +0200, Nam Cao wrote:
> > On Mon, May 19, 2025 at 12:28:12PM +0200, Gabriele Monaco wrote:
> > > Mmh, although definitely unlikely, I'm thinking of a case in which
> > > the
> > > event starts on one CPU and at the same time we see events in IRQ
> > > and 
> > > on another CPU, let's say continuously. Nothing forbids that
> > > between
> > > any two consecutive try_cmpxchg another CPU/context changes the
> > > next
> > > state (making the local try_cmpxchg fail).
> > > In practice I've never seen it going on the second iteration, as
> > > the
> > > critical section is really tiny, but I'm not sure we can guarantee
> > > this
> > > never happens.
> > > Or am I missing something?
> > 
> > I have a feeling that you missed my point. I agree that the retrying
> > is
> > needed, because we may race with another.
> > 
> > What I am proposing is that we drop the MAX_DA_RETRY_RACING_EVENTS,
> > and
> > just keep retrying until we succeed.
> > 
> > And that's safe to do, because the maximum number of retries is the
> > number
> > of tasks contending with us to set the monitor's state. So we know we
> > won't
> > be retrying for long.
> 
> I get this point, what I mean is: can we really guarantee the number of
> contending tasks (or contexts) is finite?
> In other words, the try_cmpxchg guarantees 1 and only 1 actor wins
> every time, but cannot guarantee all actors will eventually win, an
> actor /could/ be hanging there forever.
> 
> This handler is running for each event in the monitor and tracepoint
> handlers can be interrupted as well as run in interrupt context (where
> of course they cannot be interrupted). I don't think the number of
> actors is bounded by the number of CPUs.
> I see this situation is extremely unlikely, but in an exotic scenario
> where a CPU is sufficiently slower than others (e.g. in a VM) I believe
> we can see this critical section large enough for this to potentially
> happen.
> 
> I'm not quite afraid of infinite loops, but rather RV introducing
> unbounded latency very hard to track and without any reporting.
> Chances are, since tracepoints and actual traced events are not atomic,
> that by the time this delayed context /wins/ the RV event is no longer
> current, so we may see an error already.
> 
> Does it make sense to you or am I making it more complex than it should
> be?

Right, I can see that being a problem. But I don't know enough about it to
comment further, so do as you think best, maybe someone else can help.

Best regards,
Nam