[PATCH 0/1] documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry

Oleg Nesterov posted 1 patch 4 months, 1 week ago
There is a newer version of this series
[PATCH 0/1] documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
Posted by Oleg Nesterov 4 months, 1 week ago
Hello,

I already tried to discuss this about 2 years ago, but nobody replied.
Let me make another attempt.

1/1 is the trivial doc fix, but what do you think about "RFC 2/1" ?
Does it make any sense?

Oleg.
[PATCH v2 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
Only 1/4 was changed according to comments from Linus and Waiman,
but let me resend 2-4 as well.

Oleg.

 fs/proc/array.c         |  9 ++-----
 fs/proc/base.c          | 10 ++------
 include/linux/seqlock.h | 64 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c  | 15 +++---------
 4 files changed, 71 insertions(+), 27 deletions(-)
Re: [PATCH v2 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Wed, Oct 08, 2025 at 02:30:14PM +0200, Oleg Nesterov wrote:
> Only 1/4 was changed according to comments from Linus and Waiman,
> but let me resend 2-4 as well.

Shall I put this in tip/locking/core once -rc1 happens or so?
Re: [PATCH v2 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/08, Peter Zijlstra wrote:
>
> On Wed, Oct 08, 2025 at 02:30:14PM +0200, Oleg Nesterov wrote:
> > Only 1/4 was changed according to comments from Linus and Waiman,
> > but let me resend 2-4 as well.
>
> Shall I put this in tip/locking/core once -rc1 happens or so?

Would be great ;) If nobody objects.

Can you also take the trivial

	[PATCH 1/1] documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
	https://lore.kernel.org/all/20250928162029.GA3121@redhat.com/

?

OTOH, if this series is merged, this doc fix should be updated to
mention scoped_seqlock_read()...

Oleg.
Re: [PATCH v2 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Wed, Oct 08, 2025 at 03:13:02PM +0200, Oleg Nesterov wrote:
> On 10/08, Peter Zijlstra wrote:
> >
> > On Wed, Oct 08, 2025 at 02:30:14PM +0200, Oleg Nesterov wrote:
> > > Only 1/4 was changed according to comments from Linus and Waiman,
> > > but let me resend 2-4 as well.
> >
> > Shall I put this in tip/locking/core once -rc1 happens or so?
> 
> Would be great ;) If nobody objects.
> 
> Can you also take the trivial
> 
> 	[PATCH 1/1] documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
> 	https://lore.kernel.org/all/20250928162029.GA3121@redhat.com/
> 
> ?

I've picked that up too and stuck it in front.

> OTOH, if this series is merged, this doc fix should be updated to
> mention scoped_seqlock_read()...

I see a patch [5/4] in your future ;-)
[PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
error prone. With the new helper the "typical" code like

	int seq, nextseq;
	unsigned long flags;

	nextseq = 0;
	do {
		seq = nextseq;
		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);

		// read-side critical section

		nextseq = 1;
	} while (need_seqretry(&seqlock, seq));
	done_seqretry_irqrestore(&seqlock, seq, flags);

can be rewritten as

	scoped_seqlock_read_irqsave (&seqlock) {
		// read-side critical section
	}

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/seqlock.h | 64 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..fa8d73668f4b 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,68 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+/* internal helper for scoped_seqlock_read/scoped_seqlock_read_irqsave */
+static inline bool
+__scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
+{
+	bool retry = false;
+
+	if (*seq & 1) {
+		if (flags)
+			read_sequnlock_excl_irqrestore(lock, *flags);
+		else
+			read_sequnlock_excl(lock);
+	} else if (read_seqretry(lock, *seq)) {
+		*seq = 1;
+		retry = true;
+		if (flags)
+			read_seqlock_excl_irqsave(lock, *flags);
+		else
+			read_seqlock_excl(lock);
+	}
+
+	return retry;
+}
+
+#define __scoped_seqlock_read(lock, s)	\
+	for (struct { bool lockless; int seq; }					\
+		s = { .lockless = true, .seq = read_seqbegin(lock) };		\
+	     s.lockless || __scoped_seqlock_read_retry(lock, &s.seq, NULL);	\
+	     s.lockless = false)
+
+#define __scoped_seqlock_read_irqsave(lock, s)	\
+	for (struct { bool lockless; int seq; unsigned long flags; }		\
+		s = { .lockless = true, .seq = read_seqbegin(lock) };		\
+	     s.lockless || __scoped_seqlock_read_retry(lock, &s.seq, &s.flags);	\
+	     s.lockless = false)
+
+/**
+ * scoped_seqlock_read(lock) - execute the read side critical section
+ *                             without manual sequence counter handling
+ *                             or calls to other helpers
+ * @lock: pointer to the seqlock_t protecting the data
+ *
+ * Example:
+ *
+ *	scoped_seqlock_read(&lock) {
+ *		// read-side critical section
+ *	}
+ *
+ * Starts with a lockless pass first. If it fails, restarts the critical
+ * section with the lock held.
+ *
+ * The critical section must not contain control flow that escapes the loop.
+ */
+#define scoped_seqlock_read(lock)	\
+	__scoped_seqlock_read(lock, __UNIQUE_ID(s))
+
+/**
+ * scoped_seqlock_read_irqsave(lock) - same as scoped_seqlock_read() but
+ *                                     disables irqs on a locking pass
+ * @lock: pointer to the seqlock_t protecting the data
+ */
+#define scoped_seqlock_read_irqsave(lock)	\
+	__scoped_seqlock_read_irqsave(lock, __UNIQUE_ID(s))
+
 #endif /* __LINUX_SEQLOCK_H */
-- 
2.25.1.362.g51ebf55
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
Bah.

Now I'm reading this again, and going through it more carefully, and
now I hate your helper.

Why?

Because I think that with the new organization, you can do so much better.

So the old code used "seq" as two different values:

 - the first loop around, it's the sequence count for the lockless
attempt and is always even because of how read_seqbegin() works (and
because the original code had that whole 'nextseq' hackery.

 - the second loop around, it's just a flag that now we hold the lock

and you converted the new scheme to that same model.

Which works, but in the new scheme, it's just all *pointless*.

Why do I say that?

Because in the new scheme, you have that explicit "lockless" flag
which is so much more legible in the first place.

You use it in the for-loop, but you don't use it in the helper function.

So all the games with

        if (*seq & 1) {
       ...
                *seq = 1;

are horribly non-intuitive and shouldn't exist because it would be
much more logical to just use the 'lockless' boolean instead.

Those games only make the code harder to understand, and I think they
also make the compiler unable to just until the "loop" twice.

So instead of this:

  static inline bool
  __scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
  {
         bool retry = false;

         if (*seq & 1) {
                 if (flags)
                         read_sequnlock_excl_irqrestore(lock, *flags);
                 else
                         read_sequnlock_excl(lock);
         } else if (read_seqretry(lock, *seq)) {
                 *seq = 1;
                 retry = true;
                 if (flags)
                         read_seqlock_excl_irqsave(lock, *flags);
                 else
                         read_seqlock_excl(lock);
         }

         return retry;
  }

I think that you should make that helper function be a macro - so that
it can take the unnamed union type as its argument - and then make it
do something like this instead:

  #define __scoped_seqlock_read_retry(s) ({ s.lockless ?       \
        (s.lockless = false) || read_seqretry(lock, s.seq) ?   \
                ({ read_seqlock_excl(lock); true }) : false :  \
        ({ read_sequnlock_excl(lock); false }) })

Ok, that hasn't even been close to a compiler and may be completely
buggy, but basically the logic is

 - if lockless: set lockless to false (unconditionally to make it real
easy for the compiler), then do the seq_retry, and if that says we
should retry, it does the locking and returns true (so that the loop
is re-done)

 - if not lockless, just unlock the lock and return false to say we're done

And yes, you'd need the irqsafe version too, and yes, you could do it
with the "if (flags)" thing or you could duplicate that macro, but you
could also just pass the lock/unlock sequence as an argument, and now
that macro looks even simpler

  #define __scoped_seqlock_retry(s, lock, unlock) ({ s.lockless ?  \
        (s.lockless = false) || read_seqretry(lock, s.seq) ?       \
                ({ lock ; true; }) : false :                       \
        ({ unlock ; false }) })

although then the users or that macro will be a bit uglier (maybe do
that with another level of macro to make each step simpler).

And I think the compiler will be able to optimize this better, because
the compiler actually can just follow the 's.lockless' tests and turn
all those tests into static jumps when it unrolls that loop twice
(first the lockless = true, then the lockless = false).

Again: I DID NOT FEED THIS TO A COMPILER. It may have syntax errors.
It may have logic errors,

Or it may be me having a complete breakdown and spouting nonsense. But
when re-reading your __scoped_seqlock_read_retry() helper, I really
really hated how it played those games with even/odd *seq, and I think
it's unnecessary.

Am I making sense? Who knows?

             Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Wed, 8 Oct 2025 at 09:05, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think that you should make that helper function be a macro - so that
> it can take the unnamed union type as its argument - and then make it
> do something like this instead:
>
>   #define __scoped_seqlock_read_retry(s) ({ s.lockless ?       \
>         (s.lockless = false) || read_seqretry(lock, s.seq) ?   \
>                 ({ read_seqlock_excl(lock); true }) : false :  \
>         ({ read_sequnlock_excl(lock); false }) })

Actually, it shouldn't do that "s.lockless = false" thing at all -
because that's done by the for() loop.

So that was just a thinko from trying to translate the whole "*seq =
1" hackery from the old model.

But that hackery shouldn't even exist since it's all handled naturally
and much more cleanly by the surrounding for-loop.

The only thing that the macro needs to worry about is whether it
should retry after the lockless case (and take the lock if so), or
whether it should just unlock after the locked case.

             Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Wed, 8 Oct 2025 at 22:31, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But that hackery shouldn't even exist since it's all handled naturally
> and much more cleanly by the surrounding for-loop.

I'm batting zero. I still think it's true that this logic should be
handled by the for-loop, but the *seq=1" hackery did actually do
something.

Because it, together with the 'lockless' flag, enumerated three
states, not two. It's just that the first state didn't _do_ anything,
so it was kind of easy to miss it when I read through the code.

So the three states are

 (a) before anything has been done (lockless, seq is even)

 (b) the "we've done the lockless, we need to end the loop or take the
lock" (!lockless, seq is even)

 (c) we've done the locked, we need to unlock and end (seq is odd)

and so relying on just a 'bool lockless' in the loop isn't enough.

I do still think this could be done at the for-loop level, so that the
compiler can statically see the stages and unroll it all to not be a
loop at all.

But I think it means that "lockless" would have to be a "phase", and go 0/1/2.

And I did get it to work that way, and did get gcc to unroll it to
generate really nice code. IOW, I can make this code

        scoped_seqlock_read(lock) {
                asm volatile("TEST");
        }

generate this assembly:

.L12:
        movl    (%rdi), %eax    #* lock, _25
        testb   $1, %al #, _25
        jne     .L14    #,
        TEST
        movl    (%rdi), %edx    # MEM[(const volatile unsigned int
*)lock_6(D)], _70
        cmpl    %edx, %eax      # _70, _25
        jne     .L15    #,
        ret
.L15:
        subq    $8, %rsp        #,
        addq    $4, %rdi        #, _74
        movq    %rdi, (%rsp)    # _74, %sfp
        call    _raw_spin_lock  #
        TEST
        movq    (%rsp), %rdi    # %sfp, _74
        addq    $8, %rsp        #,
        jmp     _raw_spin_unlock        #
.L14:
        pause
        jmp     .L12    #

which is basically optimal, but I have to say, the hoops I had to jump
through to get there makes me suspect it's not worth it.

(Note above how there is no sign of 'phase' left in the code, and the
only conditionals are on the actual sequence count state, and the nice
default action of no sequence count problems is all linear
fall-through code).

The irqsave version looks the same, just (obviously) calling a
different set of spinlock/unlock functions and having one extra
register argument for that.

But to get there, I had to use not only linux/unroll.h, but also make
that for-loop explicitly have that s.phase < 3 test just to get the
unrolling to actually trigger.

I'm attaching the test-case I wrote that does this. It's not
*horrific*, but the extra unrolling hackery makes me doubt that it's
really worth it.

So I suspect your patch is fine as-is, and I was not just wrong on
relying on the (insufficient) "locked" boolean, but the more complete
solution is just too damn ugly.

But hey, I *did* get gcc to generate nice code. So you might look at
my test-case below and decide that maybe there's something to be said
for this.

                 Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/09, Linus Torvalds wrote:
>
> I'm batting zero. I still think it's true that this logic should be
> handled by the for-loop, but the *seq=1" hackery did actually do
> something.

Yes...

> But I think it means that "lockless" would have to be a "phase", and go 0/1/2.

OK, I like your version more than mine. In particular, the fact that it
puts seq/flags into the unsigned long data "union", somehow I didn't think
about this option. Plus, yes, it removes the "make/check seq is odd". Nice.

But do we really want to unroll the loop? This code should be optimized
for the likely case when the lockless pass succeeds.

Let me think a bit more before I send V3...


> #define __scoped_seqlock_read(lock, s, cond) \
> 	unrolled_full for (struct { unsigned long phase, data; } s = { 0 }; \
> 		s.phase < 3 && cond(lock, s.phase, &s.data); s.phase++)

I guess in your version "struct" doesn't buy too much, just saves one
__UNIQUE_ID(). But this is a matter of taste.

Thanks!

Oleg.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Thu, Oct 09, 2025 at 04:37:49PM +0200, Oleg Nesterov wrote:

> Let me think a bit more before I send V3...

How do we feel about something a little like so?


diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..9786b8d14164 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,85 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+enum ss_state {
+	sss_done = 0,
+	sss_locked,
+	sss_lockless,
+};
+
+enum ss_type {
+	ss_lockless,
+	ss_lock,
+	ss_lock_irqsave,
+};
+
+struct ss_tmp {
+	enum ss_type	type;
+	enum ss_state	state;
+	int		seq;
+	unsigned long	flags;
+	spinlock_t	*lock;
+};
+
+static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
+{
+	if (!sst->lock)
+		return;
+
+	if (sst->type == ss_lock_irqsave) {
+		spin_unlock_irqrestore(sst->lock, sst->flags);
+		return;
+	}
+
+	spin_unlock(sst->lock);
+}
+
+static inline void
+__scoped_seqlock_next(struct ss_tmp *sst, seqlock_t *lock)
+{
+	switch (sst->state) {
+	case sss_lockless:
+		if (!read_seqretry(lock, sst->seq)) {
+			sst->state = sss_done;
+			return;
+		}
+
+		switch (sst->type) {
+		case ss_lock:
+			sst->lock = &lock->lock;
+			spin_lock(sst->lock);
+			sst->state = sss_locked;
+			return;
+
+		case ss_lock_irqsave:
+			sst->lock = &lock->lock;
+			spin_lock_irqsave(sst->lock, sst->flags);
+			sst->state = sss_locked;
+			return;
+
+		case ss_lockless:
+			sst->seq = read_seqbegin(lock);
+			return;
+		}
+
+	case sss_locked:
+		sst->state = sss_done;
+		return;
+
+	case sss_done:
+		BUG();
+	}
+}
+
+#define __scoped_seqlock_read(_seqlock, _type, _s)			\
+	for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) = {	\
+		.type = _type, .state = sss_lockless,			\
+		.seq = read_seqbegin(_seqlock), .lock = NULL };		\
+	     _s.state != sss_done;					\
+	     __scoped_seqlock_next(&_s, _seqlock))
+
+#define scoped_seqlock_read(_seqlock, _type)				\
+	__scoped_seqlock_read(_seqlock, _type, __UNIQUE_ID(seqlock))
+
 #endif /* __LINUX_SEQLOCK_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..d2b3f987c888 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -313,10 +313,8 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
-	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
+	u64 utime, stime;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -329,27 +327,19 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	if (same_thread_group(current, tsk))
 		(void) task_sched_runtime(current);
 
-	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	guard(rcu)();
+	scoped_seqlock_read(&sig->stats_lock, ss_lock_irqsave) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
-	rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Thu, Oct 09, 2025 at 09:50:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 09, 2025 at 04:37:49PM +0200, Oleg Nesterov wrote:
> 
> > Let me think a bit more before I send V3...
> 
> How do we feel about something a little like so?

Slightly nicer version that's actually compiled :-)

---
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..7273fddc19a2 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,87 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+enum ss_state {
+	ss_done = 0,
+	ss_lock,
+	ss_lock_irqsave,
+	ss_lockless,
+};
+
+struct ss_tmp {
+	enum ss_state	state;
+	int		seq;
+	unsigned long	flags;
+	spinlock_t	*lock;
+};
+
+static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
+{
+	if (!sst->lock)
+		return;
+
+	if (sst->state == ss_lock_irqsave) {
+		spin_unlock_irqrestore(sst->lock, sst->flags);
+		return;
+	}
+
+	spin_unlock(sst->lock);
+}
+
+extern void __scoped_seqlock_fail(void);
+
+static inline void
+__scoped_seqlock_next(struct ss_tmp *sst, seqlock_t *lock, enum ss_state target)
+{
+	switch (sst->state) {
+	case ss_lockless:
+		if (!read_seqretry(lock, sst->seq)) {
+			sst->state = ss_done;
+			return;
+		}
+
+		switch (target) {
+		case ss_done:
+			__scoped_seqlock_fail();
+			return;
+
+		case ss_lock:
+			sst->lock = &lock->lock;
+			spin_lock(sst->lock);
+			sst->state = ss_lock;
+			return;
+
+		case ss_lock_irqsave:
+			sst->lock = &lock->lock;
+			spin_lock_irqsave(sst->lock, sst->flags);
+				sst->state = ss_lock_irqsave;
+			return;
+
+		case ss_lockless:
+			sst->seq = read_seqbegin(lock);
+			return;
+		}
+		BUG();
+
+	case ss_lock:
+	case ss_lock_irqsave:
+		sst->state = ss_done;
+		return;
+
+	case ss_done:
+		BUG();
+	}
+}
+
+#define __scoped_seqlock_read(_seqlock, _target, _s)			\
+	for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) = {	\
+		.state = ss_lockless, .seq = read_seqbegin(_seqlock),	\
+		.lock = NULL };						\
+	     _s.state != ss_done;					\
+	     __scoped_seqlock_next(&_s, _seqlock, _target))
+
+#define scoped_seqlock_read(_seqlock, _target)				\
+	__scoped_seqlock_read(_seqlock, _target, __UNIQUE_ID(seqlock))
+
 #endif /* __LINUX_SEQLOCK_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..d2b3f987c888 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -313,10 +313,8 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
-	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
+	u64 utime, stime;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -329,27 +327,19 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	if (same_thread_group(current, tsk))
 		(void) task_sched_runtime(current);
 
-	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	guard(rcu)();
+	scoped_seqlock_read(&sig->stats_lock, ss_lock_irqsave) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
-	rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Thu, 9 Oct 2025 at 13:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Slightly nicer version that's actually compiled :-)

I assume that "target of ss_lockless" is an intentional extension to
just make the loop never take a lock at all?

I do like it, but I think you'll find that having a separate 'seq' and
'flags' like this:

> +struct ss_tmp {
> +       enum ss_state   state;
> +       int             seq;
> +       unsigned long   flags;
> +       spinlock_t      *lock;
> +};

makes it unnecessarily waste a register.

You never need both seq and flags at the same time, since if you take
the spinlock the sequence number is pointless.

So please make that a union, and I think it will help avoid wasting a
register in the loop.

Other than that I like it. Except that "BUG()" really bugs me. It will
generate horrendous code for no reason and we *really* shouldn't add
BUG statements anyway.

Either that inline function is fine, or it isn't. Don't make it
generate stupid code for "I'm not fine" that will also be a huge pain
to debug because if that code is buggy it will presumably trigger in
context where the machine will be dead, dead, dead.

             Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Thu, Oct 09, 2025 at 01:24:51PM -0700, Linus Torvalds wrote:
> On Thu, 9 Oct 2025 at 13:12, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Slightly nicer version that's actually compiled :-)
> 
> I assume that "target of ss_lockless" is an intentional extension to
> just make the loop never take a lock at all?

Yep. Almost came for free, so might as well do it.

> I do like it, but I think you'll find that having a separate 'seq' and
> 'flags' like this:
> 
> > +struct ss_tmp {
> > +       enum ss_state   state;
> > +       int             seq;
> > +       unsigned long   flags;
> > +       spinlock_t      *lock;
> > +};
> 
> makes it unnecessarily waste a register.
> 
> You never need both seq and flags at the same time, since if you take
> the spinlock the sequence number is pointless.
> 
> So please make that a union, and I think it will help avoid wasting a
> register in the loop.

Sure; otoh compiler should be able to tell the same using liveness
analysis I suppose, but perhaps they're not *that* clever.

> Other than that I like it. Except that "BUG()" really bugs me. It will
> generate horrendous code for no reason and we *really* shouldn't add
> BUG statements anyway.
> 
> Either that inline function is fine, or it isn't. Don't make it
> generate stupid code for "I'm not fine" that will also be a huge pain
> to debug because if that code is buggy it will presumably trigger in
> context where the machine will be dead, dead, dead.

So I thought they were fine; we handle all the enum cases with 'return'
so its impossible to not exit the switch() but the silly compiler was
complaining about possible fall-through, so clearly it was getting
confused.

The ss_done case should never trip either -- perhaps I should make that
one of those __scoped_seqlock_fail like things as well.

Anyway, I'll play around with it some (and look at actual code-gen)
tomorrow.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Fri, Oct 10, 2025 at 12:12:42AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 09, 2025 at 01:24:51PM -0700, Linus Torvalds wrote:
> > On Thu, 9 Oct 2025 at 13:12, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Slightly nicer version that's actually compiled :-)
> > 
> > I assume that "target of ss_lockless" is an intentional extension to
> > just make the loop never take a lock at all?
> 
> Yep. Almost came for free, so might as well do it.
> 
> > I do like it, but I think you'll find that having a separate 'seq' and
> > 'flags' like this:
> > 
> > > +struct ss_tmp {
> > > +       enum ss_state   state;
> > > +       int             seq;
> > > +       unsigned long   flags;
> > > +       spinlock_t      *lock;
> > > +};
> > 
> > makes it unnecessarily waste a register.
> > 
> > You never need both seq and flags at the same time, since if you take
> > the spinlock the sequence number is pointless.
> > 
> > So please make that a union, and I think it will help avoid wasting a
> > register in the loop.
> 
> Sure; otoh compiler should be able to tell the same using liveness
> analysis I suppose, but perhaps they're not *that* clever.

The moment I use a union gcc-14 creates the whole structure on the
stack, while without the union it mostly manages to keep the things
in registers without too much spilling.

Also, there's a rather silly bug in the code, the __cleanup function
looks at ->state to determine which spin_unlock variant to pick, but
->state will be ss_done. Additionally, when someone does 'break' or
'goto' out of the scope the state isn't reliably anyway.

The sanest code gen happens when I do:

struct ss_tmp {
	enum ss_state	state;
	int		seq;
	unsigned long	flags;
	spinlock_t	*lock;
	spinlock_t	*lock_irqsave;
}

static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
{
	if (sst->lock)
		spin_unlock(sst->lock);
	if (sst->lock_irqsave)
		spin_unlock_irqrestore(sst->lock, sst->flags);
}

liveness analysis seems to work well, and it is able to determine which
fields are active and only track those in registers and utterly discard
the rest. Specifically in the thread_group_cputime code, there are only
_raw_spin_lock_irqsave and _raw_spin_unlock_irqrestore calls left, all
the other stuff is just optimized out.

I'll look what clang does ... but that's really tomorrow, or rather,
somewhat later this morning :/
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Thu, 9 Oct 2025 at 16:20, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The moment I use a union gcc-14 creates the whole structure on the
> stack, while without the union it mostly manages to keep the things
> in registers without too much spilling.

Ouch.

Do the union manually like I did by just using a "unsigned long" for
both 'seq' and 'flags'.

I bet that works fine - it seemed to work well for my - admittedly
somewhat different - version.

              Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Thu, 9 Oct 2025 at 15:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Sure; otoh compiler should be able to tell the same using liveness
> analysis I suppose, but perhaps they're not *that* clever.

They are that clever, but only if they end up unrolling the loop
statically. If the loop remains a loop, the two variables end up live
in the same code.

And while a compiler _could_ in theory still see that they aren't
actually live in the same _iteration_, I don't think any compiler
actually ends up being that clever in practice.

So making it a union then hopefully gets the compiler to basically use
that explicit information.

> So I thought they were fine; we handle all the enum cases with 'return'
> so its impossible to not exit the switch() but the silly compiler was
> complaining about possible fall-through, so clearly it was getting
> confused.

Yeah, I found the same thing with the 0/1/2 approach - the compiler
wouldn't realize that the range was limited until I added a very
explicit limit check that "shouldn't matter", but did.

This might obviously end up depending on compiler version and other
random things, but in general the whole value range analysis tends to
be a pretty fragile thing.

In practice, compilers tend to be good at doing value range analysis
if they see particular patterns (like initializing it to some value,
always incrementing it by one, and comparing against another value).

But when it's written more like a state machine like this, it's
clearly very hit and miss.

             Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Thu, Oct 09, 2025 at 03:55:15PM -0700, Linus Torvalds wrote:
> On Thu, 9 Oct 2025 at 15:12, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Sure; otoh compiler should be able to tell the same using liveness
> > analysis I suppose, but perhaps they're not *that* clever.
> 
> They are that clever, but only if they end up unrolling the loop
> statically. If the loop remains a loop, the two variables end up live
> in the same code.
> 
> And while a compiler _could_ in theory still see that they aren't
> actually live in the same _iteration_, I don't think any compiler
> actually ends up being that clever in practice.
> 
> So making it a union then hopefully gets the compiler to basically use
> that explicit information.

Right, so I had to use -Os to not make it unroll the thing, but then
indeed, sharing the variable helps it.

> > So I thought they were fine; we handle all the enum cases with 'return'
> > so its impossible to not exit the switch() but the silly compiler was
> > complaining about possible fall-through, so clearly it was getting
> > confused.
> 
> Yeah, I found the same thing with the 0/1/2 approach - the compiler
> wouldn't realize that the range was limited until I added a very
> explicit limit check that "shouldn't matter", but did.
> 
> This might obviously end up depending on compiler version and other
> random things, but in general the whole value range analysis tends to
> be a pretty fragile thing.
> 
> In practice, compilers tend to be good at doing value range analysis
> if they see particular patterns (like initializing it to some value,
> always incrementing it by one, and comparing against another value).
> 
> But when it's written more like a state machine like this, it's
> clearly very hit and miss.

I reordered the code, it is happier now.

Anyway, the below seems to generate decent code for
{-O2,-Os}x{gcc-14,clang-22}. Yay for optimizing compilers I suppose :-)

---
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..45fab026f7d6 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,83 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+enum ss_state {
+	ss_done = 0,
+	ss_lock,
+	ss_lock_irqsave,
+	ss_lockless,
+};
+
+struct ss_tmp {
+	enum ss_state	state;
+	unsigned long	data;
+	spinlock_t	*lock;
+	spinlock_t	*lock_irqsave;
+};
+
+static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
+{
+	if (sst->lock)
+		spin_unlock(sst->lock);
+	if (sst->lock_irqsave)
+		spin_unlock_irqrestore(sst->lock, sst->data);
+}
+
+extern void __scoped_seqlock_invalid_target(void);
+extern void __scoped_seqlock_bug(void);
+
+static inline void
+__scoped_seqlock_next(struct ss_tmp *sst, seqlock_t *lock, enum ss_state target)
+{
+	switch (sst->state) {
+	case ss_done:
+		__scoped_seqlock_bug();
+		return;
+
+	case ss_lock:
+	case ss_lock_irqsave:
+		sst->state = ss_done;
+		return;
+
+	case ss_lockless:
+		if (!read_seqretry(lock, sst->data)) {
+			sst->state = ss_done;
+			return;
+		}
+		break;
+	}
+
+	switch (target) {
+	case ss_done:
+		__scoped_seqlock_invalid_target();
+		return;
+
+	case ss_lock:
+		sst->lock = &lock->lock;
+		spin_lock(sst->lock);
+		sst->state = ss_lock;
+		return;
+
+	case ss_lock_irqsave:
+		sst->lock_irqsave = &lock->lock;
+		spin_lock_irqsave(sst->lock, sst->data);
+		sst->state = ss_lock_irqsave;
+		return;
+
+	case ss_lockless:
+		sst->data = read_seqbegin(lock);
+		return;
+	}
+}
+
+#define __scoped_seqlock_read(_seqlock, _target, _s)			\
+	for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) =	\
+	     { .state = ss_lockless, .data = read_seqbegin(_seqlock) };	\
+	     _s.state != ss_done;					\
+	     __scoped_seqlock_next(&_s, _seqlock, _target))
+
+#define scoped_seqlock_read(_seqlock, _target)				\
+	__scoped_seqlock_read(_seqlock, _target, __UNIQUE_ID(seqlock))
+
 #endif /* __LINUX_SEQLOCK_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..d2b3f987c888 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -313,10 +313,8 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
-	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
+	u64 utime, stime;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -329,27 +327,19 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	if (same_thread_group(current, tsk))
 		(void) task_sched_runtime(current);
 
-	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	guard(rcu)();
+	scoped_seqlock_read(&sig->stats_lock, ss_lock_irqsave) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
-	rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/10, Peter Zijlstra wrote:
>
> I reordered the code, it is happier now.
>
> Anyway, the below seems to generate decent code for
> {-O2,-Os}x{gcc-14,clang-22}. Yay for optimizing compilers I suppose :-)

Another approach which looks better than mine ;)

Linus's version is simpler, but yours can handle break/return and
the "only lockless" case, good.

I leave this patch to you and Linus, he seems to like your code too.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


But... perhaps we should not "export" the _target names and instead
add the additional defines, something like

	scoped_seqlock_read()
	scoped_seqlock_read_or_lock()
	scoped_seqlock_read_or_lock_irqsave()

?

Oleg.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 3 months, 4 weeks ago
On Fri, 10 Oct 2025 at 05:33, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I leave this patch to you and Linus, he seems to like your code too.

Yeah, I think PeterZ's is the best of the bunch, I like how it looks
in that latest iteration.

           Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/10, Oleg Nesterov wrote:
>
> On 10/10, Peter Zijlstra wrote:
> >
> > I reordered the code, it is happier now.
> >
> > Anyway, the below seems to generate decent code for
> > {-O2,-Os}x{gcc-14,clang-22}. Yay for optimizing compilers I suppose :-)
>
> Another approach which looks better than mine ;)
>
> Linus's version is simpler, but yours can handle break/return and
> the "only lockless" case, good.
>
> I leave this patch to you and Linus, he seems to like your code too.
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>
>
> But... perhaps we should not "export" the _target names and instead
> add the additional defines, something like
>
> 	scoped_seqlock_read()
> 	scoped_seqlock_read_or_lock()
> 	scoped_seqlock_read_or_lock_irqsave()
>
> ?

And... perhaps we can simplify this code a little bit? I mean

	enum ss_state {
		ss_lockless	= 0,
		ss_lock		= 1,
		ss_lock_irqsave	= 2,
		ss_done		= 4,
	};

	struct ss_tmp {
		enum ss_state	state;
		unsigned long	data;
		seqlock_t	*lock;
	};

	static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
	{
		if (sst->state & ss_lock)
			spin_unlock(&sst->lock.lock);
		if (sst->state & ss_lock_irqsave)
			spin_unlock_irqrestore(&sst->lock.lock, sst->data);
	}

	static inline void
	__scoped_seqlock_next(struct ss_tmp *sst, enum ss_state target)
	{
		switch (sst->state) {
		case ss_lock:
		case ss_lock_irqsave:
			sst->state |= ss_done;
			return;

		case ss_lockless:
			if (!read_seqretry(sst->lock, sst->data)) {
				sst->state = ss_done;
				return;
			}
			break;
		}

		switch (target) {
		case ss_lock:
			spin_lock(&sst->lock.lock);
			sst->state = ss_lock;
			return;

		case ss_lock_irqsave:
			spin_lock_irqsave(&sst->lock.lock, sst->data);
			sst->state = ss_lock_irqsave;
			return;

		case ss_lockless:
			sst->data = read_seqbegin(sst->lock);
			return;
		}
	}

	#define __scoped_seqlock_read(_seqlock, _target, _s)			\
		for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) =				\
		     { .state = ss_lockless, .data = read_seqbegin(_seqlock), .lock = __seqlock };	\
		     !(_s.state & ss_done);								\
		     __scoped_seqlock_next(&_s, _target))


(I removed __scoped_seqlock_invalid_target/__scoped_seqlock_bug to lessen the code).

Not sure this makes sense. Plus I didn't even try to compile this code and I have
no idea how this change can affect the code generation. But let me ask anyway...

Oleg.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 3 months, 3 weeks ago
On Fri, Oct 10, 2025 at 03:14:39PM +0200, Oleg Nesterov wrote:
> On 10/10, Oleg Nesterov wrote:
> >
> > On 10/10, Peter Zijlstra wrote:
> > >
> > > I reordered the code, it is happier now.
> > >
> > > Anyway, the below seems to generate decent code for
> > > {-O2,-Os}x{gcc-14,clang-22}. Yay for optimizing compilers I suppose :-)
> >
> > Another approach which looks better than mine ;)
> >
> > Linus's version is simpler, but yours can handle break/return and
> > the "only lockless" case, good.
> >
> > I leave this patch to you and Linus, he seems to like your code too.
> >
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> >
> >
> > But... perhaps we should not "export" the _target names and instead
> > add the additional defines, something like
> >
> > 	scoped_seqlock_read()
> > 	scoped_seqlock_read_or_lock()
> > 	scoped_seqlock_read_or_lock_irqsave()
> >
> > ?
> 
> And... perhaps we can simplify this code a little bit? I mean
> 
> 	enum ss_state {
> 		ss_lockless	= 0,
> 		ss_lock		= 1,
> 		ss_lock_irqsave	= 2,
> 		ss_done		= 4,
> 	};
> 
> 	struct ss_tmp {
> 		enum ss_state	state;
> 		unsigned long	data;
> 		seqlock_t	*lock;
> 	};
> 
> 	static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
> 	{
> 		if (sst->state & ss_lock)
> 			spin_unlock(&sst->lock.lock);
> 		if (sst->state & ss_lock_irqsave)
> 			spin_unlock_irqrestore(&sst->lock.lock, sst->data);
> 	}
> 
> 	static inline void
> 	__scoped_seqlock_next(struct ss_tmp *sst, enum ss_state target)
> 	{
> 		switch (sst->state) {
> 		case ss_lock:
> 		case ss_lock_irqsave:
> 			sst->state |= ss_done;
> 			return;
> 
> 		case ss_lockless:
> 			if (!read_seqretry(sst->lock, sst->data)) {
> 				sst->state = ss_done;
> 				return;
> 			}
> 			break;
> 		}
> 
> 		switch (target) {
> 		case ss_lock:
> 			spin_lock(&sst->lock.lock);
> 			sst->state = ss_lock;
> 			return;
> 
> 		case ss_lock_irqsave:
> 			spin_lock_irqsave(&sst->lock.lock, sst->data);
> 			sst->state = ss_lock_irqsave;
> 			return;
> 
> 		case ss_lockless:
> 			sst->data = read_seqbegin(sst->lock);
> 			return;
> 		}
> 	}
> 
> 	#define __scoped_seqlock_read(_seqlock, _target, _s)			\
> 		for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) =				\
> 		     { .state = ss_lockless, .data = read_seqbegin(_seqlock), .lock = __seqlock };	\
> 		     !(_s.state & ss_done);								\
> 		     __scoped_seqlock_next(&_s, _target))
> 
> 
> (I removed __scoped_seqlock_invalid_target/__scoped_seqlock_bug to lessen the code).
> 
> Not sure this makes sense. Plus I didn't even try to compile this code and I have
> no idea how this change can affect the code generation. But let me ask anyway...

So GCC is clever enough to see through this scheme, but Clang gets
confused and generates worse code. Specifically it emits the whole
__scoped_seqlock_cleanup() sequence, testing both bits and both unlock
options.

Where previously it would only have to discover which field was written
to and could delete all code for the unwritten field, it now has to
track the state and discover ss_lock|ss_done is not possible while
ss_lock_irqsave|ss_done is.

So while that additional pointer might seem wasteful, it actually makes
the state tracking easier and allows the compiler to more easily throw
away stuff.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 3 months, 3 weeks ago
On 10/13, Peter Zijlstra wrote:
>
> On Fri, Oct 10, 2025 at 03:14:39PM +0200, Oleg Nesterov wrote:
> > 	static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
> > 	{
> > 		if (sst->state & ss_lock)
> > 			spin_unlock(&sst->lock.lock);
> > 		if (sst->state & ss_lock_irqsave)
> > 			spin_unlock_irqrestore(&sst->lock.lock, sst->data);
> > 	}
> > 
> > 	static inline void
> > 	__scoped_seqlock_next(struct ss_tmp *sst, enum ss_state target)
> > 	{
> > 		switch (sst->state) {
> > 		case ss_lock:
> > 		case ss_lock_irqsave:
> > 			sst->state |= ss_done;
> 
> So GCC is clever enough to see through this scheme, but Clang gets
> confused and generates worse code. Specifically it emits the whole
> __scoped_seqlock_cleanup() sequence, testing both bits and both unlock
> options.

OK, thanks and sorry for the noise then.

> So while that additional pointer might seem wasteful, it actually makes
> the state tracking easier and allows the compiler to more easily throw
> away stuff.

Great ;)

Oleg.
[tip: locking/core] seqlock: Introduce scoped_seqlock_read()
Posted by tip-bot2 for Peter Zijlstra 3 months, 2 weeks ago
The following commit has been merged into the locking/core branch of tip:

Commit-ID:     cc39f3872c0865bef992b713338df369554fa9e0
Gitweb:        https://git.kernel.org/tip/cc39f3872c0865bef992b713338df369554fa9e0
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 09 Oct 2025 22:11:54 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Oct 2025 12:31:57 +02:00

seqlock: Introduce scoped_seqlock_read()

The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
error prone. With the new helper the "typical" code like

	int seq, nextseq;
	unsigned long flags;

	nextseq = 0;
	do {
		seq = nextseq;
		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);

		// read-side critical section

		nextseq = 1;
	} while (need_seqretry(&seqlock, seq));
	done_seqretry_irqrestore(&seqlock, seq, flags);

can be rewritten as

	scoped_seqlock_read (&seqlock, ss_lock_irqsave) {
		// read-side critical section
	}

Original idea by Oleg Nesterov; with contributions from Linus.

Originally-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/seqlock.h | 111 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48ea..b7bcc41 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,115 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+enum ss_state {
+	ss_done = 0,
+	ss_lock,
+	ss_lock_irqsave,
+	ss_lockless,
+};
+
+struct ss_tmp {
+	enum ss_state	state;
+	unsigned long	data;
+	spinlock_t	*lock;
+	spinlock_t	*lock_irqsave;
+};
+
+static inline void __scoped_seqlock_cleanup(struct ss_tmp *sst)
+{
+	if (sst->lock)
+		spin_unlock(sst->lock);
+	if (sst->lock_irqsave)
+		spin_unlock_irqrestore(sst->lock_irqsave, sst->data);
+}
+
+extern void __scoped_seqlock_invalid_target(void);
+
+#if defined(CONFIG_CC_IS_GCC) && CONFIG_GCC_VERSION < 90000
+/*
+ * For some reason some GCC-8 architectures (nios2, alpha) have trouble
+ * determining that the ss_done state is impossible in __scoped_seqlock_next()
+ * below.
+ */
+static inline void __scoped_seqlock_bug(void) { }
+#else
+/*
+ * Canary for compiler optimization -- if the compiler doesn't realize this is
+ * an impossible state, it very likely generates sub-optimal code here.
+ */
+extern void __scoped_seqlock_bug(void);
+#endif
+
+static inline void
+__scoped_seqlock_next(struct ss_tmp *sst, seqlock_t *lock, enum ss_state target)
+{
+	switch (sst->state) {
+	case ss_done:
+		__scoped_seqlock_bug();
+		return;
+
+	case ss_lock:
+	case ss_lock_irqsave:
+		sst->state = ss_done;
+		return;
+
+	case ss_lockless:
+		if (!read_seqretry(lock, sst->data)) {
+			sst->state = ss_done;
+			return;
+		}
+		break;
+	}
+
+	switch (target) {
+	case ss_done:
+		__scoped_seqlock_invalid_target();
+		return;
+
+	case ss_lock:
+		sst->lock = &lock->lock;
+		spin_lock(sst->lock);
+		sst->state = ss_lock;
+		return;
+
+	case ss_lock_irqsave:
+		sst->lock_irqsave = &lock->lock;
+		spin_lock_irqsave(sst->lock_irqsave, sst->data);
+		sst->state = ss_lock_irqsave;
+		return;
+
+	case ss_lockless:
+		sst->data = read_seqbegin(lock);
+		return;
+	}
+}
+
+#define __scoped_seqlock_read(_seqlock, _target, _s)			\
+	for (struct ss_tmp _s __cleanup(__scoped_seqlock_cleanup) =	\
+	     { .state = ss_lockless, .data = read_seqbegin(_seqlock) };	\
+	     _s.state != ss_done;					\
+	     __scoped_seqlock_next(&_s, _seqlock, _target))
+
+/**
+ * scoped_seqlock_read (lock, ss_state) - execute the read side critical
+ *                                        section without manual sequence
+ *                                        counter handling or calls to other
+ *                                        helpers
+ * @lock: pointer to seqlock_t protecting the data
+ * @ss_state: one of {ss_lock, ss_lock_irqsave, ss_lockless} indicating
+ *            the type of critical read section
+ *
+ * Example:
+ *
+ *     scoped_seqlock_read (&lock, ss_lock) {
+ *         // read-side critical section
+ *     }
+ *
+ * Starts with a lockess pass first. If it fails, restarts the critical
+ * section with the lock held.
+ */
+#define scoped_seqlock_read(_seqlock, _target)				\
+	__scoped_seqlock_read(_seqlock, _target, __UNIQUE_ID(seqlock))
+
 #endif /* __LINUX_SEQLOCK_H */
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Thu, 9 Oct 2025 at 07:39, Oleg Nesterov <oleg@redhat.com> wrote:
>
> But do we really want to unroll the loop? This code should be optimized
> for the likely case when the lockless pass succeeds.

Dropping the unroll magic certainly makes the patch more palatable,
and maybe it doesn't matter that much.

It really was mainly me going "this should be easy for the compiler to
see the flow statically", and the need to keep two state variables
annoyed me.

But I guess many of the users are complicated enough that you actually
don't want to have two copies of the inner body. If they weren't
complicated, they'd just use the simpler

   do { } while (read_seqcount_retry(...));

model.

So you are right, I was probably chasing that optimal code generation
for no real reason.

            Linus
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/08, Linus Torvalds wrote:
>
> Bah.
>
> Now I'm reading this again, and going through it more carefully, and
> now I hate your helper.

OK ;)

let me read/check your email and reply tomorrow, after sleep.

Oleg.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Wed, Oct 08, 2025 at 02:30:45PM +0200, Oleg Nesterov wrote:
> The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
> error prone. With the new helper the "typical" code like
> 
> 	int seq, nextseq;
> 	unsigned long flags;
> 
> 	nextseq = 0;
> 	do {
> 		seq = nextseq;
> 		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);
> 
> 		// read-side critical section
> 
> 		nextseq = 1;
> 	} while (need_seqretry(&seqlock, seq));
> 	done_seqretry_irqrestore(&seqlock, seq, flags);
> 
> can be rewritten as
> 
> 	scoped_seqlock_read_irqsave (&seqlock) {
> 		// read-side critical section
> 	}
> 

Hmm, on first reading I was expecting that to be:

	do {
		seq = read_seqbegin(&seqlock);

		// read-side section

	} while (read_seqretry(&seqlock, seq));

for lack of that _or_lock() wording, but I suppose we can make that
something like:

 	scoped_seqbegin_read (&seqlock) {
		// read-side section
	}

which is distinctive enough.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/08, Peter Zijlstra wrote:
>
> On Wed, Oct 08, 2025 at 02:30:45PM +0200, Oleg Nesterov wrote:
> >
> > 	scoped_seqlock_read_irqsave (&seqlock) {
> > 		// read-side critical section
> > 	}
> >
>
> Hmm, on first reading I was expecting that to be:
>
> 	do {
> 		seq = read_seqbegin(&seqlock);
>
> 		// read-side section
>
> 	} while (read_seqretry(&seqlock, seq));
>
> for lack of that _or_lock() wording, but I suppose we can make that
> something like:
>
>  	scoped_seqbegin_read (&seqlock) {

I was thinking about scoped_seqcount_read() but I agree with any naming

Oleg.
Re: [PATCH v2 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Peter Zijlstra 4 months ago
On Wed, Oct 08, 2025 at 02:59:03PM +0200, Oleg Nesterov wrote:
> On 10/08, Peter Zijlstra wrote:
> >
> > On Wed, Oct 08, 2025 at 02:30:45PM +0200, Oleg Nesterov wrote:
> > >
> > > 	scoped_seqlock_read_irqsave (&seqlock) {
> > > 		// read-side critical section
> > > 	}
> > >
> >
> > Hmm, on first reading I was expecting that to be:
> >
> > 	do {
> > 		seq = read_seqbegin(&seqlock);
> >
> > 		// read-side section
> >
> > 	} while (read_seqretry(&seqlock, seq));
> >
> > for lack of that _or_lock() wording, but I suppose we can make that
> > something like:
> >
> >  	scoped_seqbegin_read (&seqlock) {
> 
> I was thinking about scoped_seqcount_read() but I agree with any naming

Ah, but that would take a seqcount_*t not a seqlock_t no?

Or we can do:

	scoped_seqcount_read (&seqlock.seqcount)

But yeah, for later I suppose.
[PATCH v2 2/4] seqlock: change thread_group_cputime() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

While at it, change thread_group_cputime() to use __for_each_thread(sig).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/cputime.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..94b445e947d2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -315,8 +315,6 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -330,25 +328,18 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 		(void) task_sched_runtime(current);
 
 	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	scoped_seqlock_read_irqsave(&sig->stats_lock) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 	rcu_read_unlock();
 }
 
-- 
2.25.1.362.g51ebf55
[tip: locking/core] seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
Posted by tip-bot2 for Oleg Nesterov 3 months, 2 weeks ago
The following commit has been merged into the locking/core branch of tip:

Commit-ID:     488f48b32654dc6be04d9cc12f75ce030c9cb21b
Gitweb:        https://git.kernel.org/tip/488f48b32654dc6be04d9cc12f75ce030c9cb21b
Author:        Oleg Nesterov <oleg@redhat.com>
AuthorDate:    Wed, 08 Oct 2025 14:30:52 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Oct 2025 12:31:57 +02:00

seqlock: Change thread_group_cputime() to use scoped_seqlock_read()

To simplify the code and make it more readable.

While at it, change thread_group_cputime() to use __for_each_thread(sig).

[peterz: update to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/cputime.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2..4f97896 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -313,10 +313,8 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
-	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
+	u64 utime, stime;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -329,27 +327,19 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	if (same_thread_group(current, tsk))
 		(void) task_sched_runtime(current);
 
-	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	guard(rcu)();
+	scoped_seqlock_read (&sig->stats_lock, ss_lock_irqsave) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
-	rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
[PATCH v2 3/4] seqlock: change do_task_stat() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..e9c61448f3ee 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -483,7 +483,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long flags;
 	int exit_code = task->exit_code;
 	struct signal_struct *sig = task->signal;
-	unsigned int seq = 1;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -540,10 +539,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (permitted && (!whole || num_threads < 2))
 		wchan = !task_is_running(task);
 
-	do {
-		seq++; /* 2 on the 1st/lockless path, otherwise odd */
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+	scoped_seqlock_read_irqsave(&sig->stats_lock) {
 		cmin_flt = sig->cmin_flt;
 		cmaj_flt = sig->cmaj_flt;
 		cutime = sig->cutime;
@@ -565,8 +561,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			}
 			rcu_read_unlock();
 		}
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 
 	if (whole) {
 		thread_group_cputime_adjusted(task, &utime, &stime);
-- 
2.25.1.362.g51ebf55
[tip: locking/core] seqlock: Change do_task_stat() to use scoped_seqlock_read()
Posted by tip-bot2 for Oleg Nesterov 3 months, 2 weeks ago
The following commit has been merged into the locking/core branch of tip:

Commit-ID:     b76f72bea2c601afec81829ea427fc0d20f83216
Gitweb:        https://git.kernel.org/tip/b76f72bea2c601afec81829ea427fc0d20f83216
Author:        Oleg Nesterov <oleg@redhat.com>
AuthorDate:    Wed, 08 Oct 2025 14:30:59 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Oct 2025 12:31:57 +02:00

seqlock: Change do_task_stat() to use scoped_seqlock_read()

To simplify the code and make it more readable.

[peterz: change to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/proc/array.c |  9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2ae6318..cbd4bc4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -481,7 +481,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long flags;
 	int exit_code = task->exit_code;
 	struct signal_struct *sig = task->signal;
-	unsigned int seq = 1;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -538,10 +537,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (permitted && (!whole || num_threads < 2))
 		wchan = !task_is_running(task);
 
-	do {
-		seq++; /* 2 on the 1st/lockless path, otherwise odd */
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+	scoped_seqlock_read (&sig->stats_lock, ss_lock_irqsave) {
 		cmin_flt = sig->cmin_flt;
 		cmaj_flt = sig->cmaj_flt;
 		cutime = sig->cutime;
@@ -563,8 +559,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			}
 			rcu_read_unlock();
 		}
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 
 	if (whole) {
 		thread_group_cputime_adjusted(task, &utime, &stime);
[PATCH v2 4/4] seqlock: change do_io_accounting() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..08113bb6af1d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3041,20 +3041,14 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (whole) {
 		struct signal_struct *sig = task->signal;
 		struct task_struct *t;
-		unsigned int seq = 1;
-		unsigned long flags;
 
 		rcu_read_lock();
-		do {
-			seq++; /* 2 on the 1st/lockless path, otherwise odd */
-			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+		scoped_seqlock_read_irqsave(&sig->stats_lock) {
 			acct = sig->ioac;
 			__for_each_thread(sig, t)
 				task_io_accounting_add(&acct, &t->ioac);
 
-		} while (need_seqretry(&sig->stats_lock, seq));
-		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+		}
 		rcu_read_unlock();
 	} else {
 		acct = task->ioac;
-- 
2.25.1.362.g51ebf55
[tip: locking/core] seqlock: Change do_io_accounting() to use scoped_seqlock_read()
Posted by tip-bot2 for Oleg Nesterov 3 months, 2 weeks ago
The following commit has been merged into the locking/core branch of tip:

Commit-ID:     795aab353d0650b2d04dc3aa2e22a51000cb2aaa
Gitweb:        https://git.kernel.org/tip/795aab353d0650b2d04dc3aa2e22a51000cb2aaa
Author:        Oleg Nesterov <oleg@redhat.com>
AuthorDate:    Wed, 08 Oct 2025 14:31:05 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Oct 2025 12:31:57 +02:00

seqlock: Change do_io_accounting() to use scoped_seqlock_read()

To simplify the code and make it more readable.

[peterz: change to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 fs/proc/base.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878..407b41c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3043,21 +3043,14 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (whole) {
 		struct signal_struct *sig = task->signal;
 		struct task_struct *t;
-		unsigned int seq = 1;
-		unsigned long flags;
-
-		rcu_read_lock();
-		do {
-			seq++; /* 2 on the 1st/lockless path, otherwise odd */
-			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
 
+		guard(rcu)();
+		scoped_seqlock_read (&sig->stats_lock, ss_lock_irqsave) {
 			acct = sig->ioac;
 			__for_each_thread(sig, t)
 				task_io_accounting_add(&acct, &t->ioac);
 
-		} while (need_seqretry(&sig->stats_lock, seq));
-		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
-		rcu_read_unlock();
+		}
 	} else {
 		acct = task->ioac;
 	}
[PATCH 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
OK, please see V2. scoped_seqlock_read_irqsave() uses the
"for (struct {} s ..." hack to make "ulong flags" local.

I can change scoped_seqlock_read() to do the same, to make
it symmetric with _irqsave() if you think it makes sense.

I've dropped 5/5 which changed __dentry_path() for now.

Oleg.

 fs/proc/array.c         |  9 ++------
 fs/proc/base.c          | 10 ++------
 include/linux/seqlock.h | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c  | 15 +++---------
 4 files changed, 68 insertions(+), 27 deletions(-)
Re: [PATCH 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Linus Torvalds 4 months ago
On Tue, 7 Oct 2025 at 07:22, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK, please see V2. scoped_seqlock_read_irqsave() uses the
> "for (struct {} s ..." hack to make "ulong flags" local.

Well, patches 2-4 certainly look pretty. That's clearly a *much* nicer
interface.

The macros to introduce that nicer interface may not be winning any
beauty contests, but hey, that's pretty much par for the course.

So this does look like a clear improvement to the interface.

> I can change scoped_seqlock_read() to do the same, to make
> it symmetric with _irqsave() if you think it makes sense.

I don't think it's visible to users, but maybe it would make the
macros look slightly cleaner.

And it would allow making 'lockless' an actual 'bool' - which you
admittedly didn't actually do in the irqsave version either. Not that
that would matter either - I don't think compilers would care one way
or the other.

So a matter of taste. I'd personally lean towards doing it just to
make that 'use a struct in a for-loop' pattern less of an outlier, and
perhaps make people more aware of it.

For example, one advantage of doing it that way was that you only
needed one single use of __UNIQUE_ID() in the
scoped_seqlock_read_irqsave(), because the only ID that ends up being
unique is the name of the struct, and then you can have multiple
different members. I hadn't even thought of that detail, but I thought
it was a nice bonus.

But I really don't think it *matters*, so I'm happy either way.

          Linus
Re: [PATCH 0/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/07, Linus Torvalds wrote:
>
> On Tue, 7 Oct 2025 at 07:22, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I can change scoped_seqlock_read() to do the same, to make
> > it symmetric with _irqsave() if you think it makes sense.
>
> I don't think it's visible to users, but maybe it would make the
> macros look slightly cleaner.

OK, agreed, will do tomorrow,

> And it would allow making 'lockless' an actual 'bool'

Yeees... This will slightly increase the "size" of defines, but I
guess this doesn't really matter in this case.

Thanks,

Oleg.
[PATCH 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
error prone. With the new helper the "typical" code like

	int seq, nextseq;
	unsigned long flags;

	nextseq = 0;
	do {
		seq = nextseq;
		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);

		// read-side critical section

		nextseq = 1;
	} while (need_seqretry(&seqlock, seq));
	done_seqretry_irqrestore(&seqlock, seq, flags);

can be rewritten as

	scoped_seqlock_read_irqsave (&seqlock) {
		// read-side critical section
	}

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/seqlock.h | 61 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..9012702fd0a8 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,65 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+/* internal helper for scoped_seqlock_read/scoped_seqlock_read_irqsave */
+static inline int
+scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
+{
+	int retry = 0;
+
+	if (*seq & 1) {
+		if (flags)
+			read_sequnlock_excl_irqrestore(lock, *flags);
+		else
+			read_sequnlock_excl(lock);
+	} else if (read_seqretry(lock, *seq)) {
+		retry = *seq = 1;
+		if (flags)
+			read_seqlock_excl_irqsave(lock, *flags);
+		else
+			read_seqlock_excl(lock);
+	}
+
+	return retry;
+}
+
+#define __scoped_seqlock_read(lock, lockless, seq)	\
+	for (int lockless = 1, seq = read_seqbegin(lock);		\
+	     lockless || scoped_seqlock_read_retry(lock, &seq, NULL);	\
+	     lockless = 0)
+
+/**
+ * scoped_seqlock_read(lock) - execute the read side critical section
+ *                             without manual sequence counter handling
+ *                             or calls to other helpers
+ * @lock: pointer to the seqlock_t protecting the data
+ *
+ * Example:
+ *
+ *	scoped_seqlock_read(&lock) {
+ *		// read-side critical section
+ *	}
+ *
+ * Starts with a lockless pass first. If it fails, restarts the critical
+ * section with the lock held.
+ *
+ * The critical section must not contain control flow that escapes the loop.
+ */
+#define scoped_seqlock_read(lock)	\
+	__scoped_seqlock_read(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq))
+
+#define __scoped_seqlock_read_irqsave(lock, s)	\
+	for (struct { int lockless, seq; ulong flags; } s = { 1, read_seqbegin(lock) }; \
+	     s.lockless || scoped_seqlock_read_retry(lock, &s.seq, &s.flags);		\
+	     s.lockless = 0)
+
+/**
+ * scoped_seqlock_read_irqsave(lock) - same as scoped_seqlock_read() but
+ *                                     disables irqs on a locking pass
+ * @lock: pointer to the seqlock_t protecting the data
+ */
+#define scoped_seqlock_read_irqsave(lock)	\
+	__scoped_seqlock_read_irqsave(lock, __UNIQUE_ID(s))
+
 #endif /* __LINUX_SEQLOCK_H */
-- 
2.25.1.362.g51ebf55
Re: [PATCH 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Waiman Long 4 months ago
On 10/7/25 10:21 AM, Oleg Nesterov wrote:
> The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
> error prone. With the new helper the "typical" code like
>
> 	int seq, nextseq;
> 	unsigned long flags;
>
> 	nextseq = 0;
> 	do {
> 		seq = nextseq;
> 		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);
>
> 		// read-side critical section
>
> 		nextseq = 1;
> 	} while (need_seqretry(&seqlock, seq));
> 	done_seqretry_irqrestore(&seqlock, seq, flags);
>
> can be rewritten as
>
> 	scoped_seqlock_read_irqsave (&seqlock) {
> 		// read-side critical section
> 	}
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>   include/linux/seqlock.h | 61 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 61 insertions(+)
>
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 5ce48eab7a2a..9012702fd0a8 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -1209,4 +1209,65 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
>   	if (seq & 1)
>   		read_sequnlock_excl_irqrestore(lock, flags);
>   }
> +
> +/* internal helper for scoped_seqlock_read/scoped_seqlock_read_irqsave */
> +static inline int
> +scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
I would suggest adding the "__" prefix to indicate that this is an 
internal helper that shouldn't be called directly.
> +{
> +	int retry = 0;
> +
> +	if (*seq & 1) {
> +		if (flags)
> +			read_sequnlock_excl_irqrestore(lock, *flags);
> +		else
> +			read_sequnlock_excl(lock);
> +	} else if (read_seqretry(lock, *seq)) {
> +		retry = *seq = 1;
> +		if (flags)
> +			read_seqlock_excl_irqsave(lock, *flags);
> +		else
> +			read_seqlock_excl(lock);
> +	}
> +
> +	return retry;
> +}
> +
> +#define __scoped_seqlock_read(lock, lockless, seq)	\
> +	for (int lockless = 1, seq = read_seqbegin(lock);		\
> +	     lockless || scoped_seqlock_read_retry(lock, &seq, NULL);	\
> +	     lockless = 0)

I like Linus' suggestion of putting lockless and seq into a struct to 
make it more consistent with __scoped_seqlock_read_irqsave().

> +
> +/**
> + * scoped_seqlock_read(lock) - execute the read side critical section
> + *                             without manual sequence counter handling
> + *                             or calls to other helpers
> + * @lock: pointer to the seqlock_t protecting the data
> + *
> + * Example:
> + *
> + *	scoped_seqlock_read(&lock) {
> + *		// read-side critical section
> + *	}
> + *
> + * Starts with a lockless pass first. If it fails, restarts the critical
> + * section with the lock held.
> + *
> + * The critical section must not contain control flow that escapes the loop.
> + */
> +#define scoped_seqlock_read(lock)	\
> +	__scoped_seqlock_read(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq))
> +
> +#define __scoped_seqlock_read_irqsave(lock, s)	\
> +	for (struct { int lockless, seq; ulong flags; } s = { 1, read_seqbegin(lock) }; \
> +	     s.lockless || scoped_seqlock_read_retry(lock, &s.seq, &s.flags);		\
> +	     s.lockless = 0)
> +
> +/**
> + * scoped_seqlock_read_irqsave(lock) - same as scoped_seqlock_read() but
> + *                                     disables irqs on a locking pass
> + * @lock: pointer to the seqlock_t protecting the data
Maybe we should we should add a comment saying that this API is similar 
to scoped_seqlock_read() but with irqs disabled.
> + */
> +#define scoped_seqlock_read_irqsave(lock)	\
> +	__scoped_seqlock_read_irqsave(lock, __UNIQUE_ID(s))
> +
>   #endif /* __LINUX_SEQLOCK_H */

Other than these minor nits, the patch looks good to me.

Cheers,
Longman
Re: [PATCH 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
On 10/07, Waiman Long wrote:
>
> On 10/7/25 10:21 AM, Oleg Nesterov wrote:
> >+
> >+/* internal helper for scoped_seqlock_read/scoped_seqlock_read_irqsave */
> >+static inline int
> >+scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
> I would suggest adding the "__" prefix to indicate that this is an internal
> helper that shouldn't be called directly.

OK, I will add "__", but I thought that "internal helper" makes it clear that
it shouldn't be called directly. Nevermind, will do.

> >+#define __scoped_seqlock_read(lock, lockless, seq)	\
> >+	for (int lockless = 1, seq = read_seqbegin(lock);		\
> >+	     lockless || scoped_seqlock_read_retry(lock, &seq, NULL);	\
> >+	     lockless = 0)
>
> I like Linus' suggestion of putting lockless and seq into a struct to make
> it more consistent with __scoped_seqlock_read_irqsave().

Again, will do. See my reply to Linus.

> >+/**
> >+ * scoped_seqlock_read_irqsave(lock) - same as scoped_seqlock_read() but
> >+ *                                     disables irqs on a locking pass
> >+ * @lock: pointer to the seqlock_t protecting the data
> Maybe we should we should add a comment saying that this API is similar to
> scoped_seqlock_read() but with irqs disabled.

Hmm... This is what the comment above tries to say... Do you think it can
be improved?

Oleg.
Re: [PATCH 1/4] seqlock: introduce scoped_seqlock_read() and scoped_seqlock_read_irqsave()
Posted by Waiman Long 4 months ago
On 10/7/25 1:18 PM, Oleg Nesterov wrote:
> On 10/07, Waiman Long wrote:
>> On 10/7/25 10:21 AM, Oleg Nesterov wrote:
>>> +
>>> +/* internal helper for scoped_seqlock_read/scoped_seqlock_read_irqsave */
>>> +static inline int
>>> +scoped_seqlock_read_retry(seqlock_t *lock, int *seq, unsigned long *flags)
>> I would suggest adding the "__" prefix to indicate that this is an internal
>> helper that shouldn't be called directly.
> OK, I will add "__", but I thought that "internal helper" makes it clear that
> it shouldn't be called directly. Nevermind, will do.
>
>>> +#define __scoped_seqlock_read(lock, lockless, seq)	\
>>> +	for (int lockless = 1, seq = read_seqbegin(lock);		\
>>> +	     lockless || scoped_seqlock_read_retry(lock, &seq, NULL);	\
>>> +	     lockless = 0)
>> I like Linus' suggestion of putting lockless and seq into a struct to make
>> it more consistent with __scoped_seqlock_read_irqsave().
> Again, will do. See my reply to Linus.
>
>>> +/**
>>> + * scoped_seqlock_read_irqsave(lock) - same as scoped_seqlock_read() but
>>> + *                                     disables irqs on a locking pass
>>> + * @lock: pointer to the seqlock_t protecting the data
>> Maybe we should we should add a comment saying that this API is similar to
>> scoped_seqlock_read() but with irqs disabled.
> Hmm... This is what the comment above tries to say... Do you think it can
> be improved?

Sorry, I missed that. Never mind :-)

Cheers,
Longman
[PATCH 2/4] seqlock: change thread_group_cputime() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

While at it, change thread_group_cputime() to use __for_each_thread(sig).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/cputime.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..94b445e947d2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -315,8 +315,6 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
-	unsigned long flags;
 
 	/*
 	 * Update current task runtime to account pending time since last
@@ -330,25 +328,18 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 		(void) task_sched_runtime(current);
 
 	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	scoped_seqlock_read_irqsave(&sig->stats_lock) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 	rcu_read_unlock();
 }
 
-- 
2.25.1.362.g51ebf55
[PATCH 3/4] seqlock: change do_task_stat() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..e9c61448f3ee 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -483,7 +483,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long flags;
 	int exit_code = task->exit_code;
 	struct signal_struct *sig = task->signal;
-	unsigned int seq = 1;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -540,10 +539,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (permitted && (!whole || num_threads < 2))
 		wchan = !task_is_running(task);
 
-	do {
-		seq++; /* 2 on the 1st/lockless path, otherwise odd */
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+	scoped_seqlock_read_irqsave(&sig->stats_lock) {
 		cmin_flt = sig->cmin_flt;
 		cmaj_flt = sig->cmaj_flt;
 		cutime = sig->cutime;
@@ -565,8 +561,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			}
 			rcu_read_unlock();
 		}
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 
 	if (whole) {
 		thread_group_cputime_adjusted(task, &utime, &stime);
-- 
2.25.1.362.g51ebf55
[PATCH 4/4] seqlock: change do_io_accounting() to use scoped_seqlock_read_irqsave()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..08113bb6af1d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3041,20 +3041,14 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (whole) {
 		struct signal_struct *sig = task->signal;
 		struct task_struct *t;
-		unsigned int seq = 1;
-		unsigned long flags;
 
 		rcu_read_lock();
-		do {
-			seq++; /* 2 on the 1st/lockless path, otherwise odd */
-			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+		scoped_seqlock_read_irqsave(&sig->stats_lock) {
 			acct = sig->ioac;
 			__for_each_thread(sig, t)
 				task_io_accounting_add(&acct, &t->ioac);
 
-		} while (need_seqretry(&sig->stats_lock, seq));
-		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+		}
 		rcu_read_unlock();
 	} else {
 		acct = task->ioac;
-- 
2.25.1.362.g51ebf55
[PATCH 0/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
OK, let me name it SEQLOCK_READ_SECTION().

Al, could you look at 5/5? Please nack it if you think it makes no sense
or wrong. I abused __dentry_path() to add another example. To simplify the
review, this is how it looks after the patch:

	static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
	{
		const struct dentry *dentry;
		struct prepend_buffer b;

		rcu_read_lock();
		__SEQLOCK_READ_SECTION(&rename_lock, lockless, seq, NULL) {
			dentry = d;
			b = *p;
			while (!IS_ROOT(dentry)) {
				const struct dentry *parent = dentry->d_parent;

				prefetch(parent);
				if (!prepend_name(&b, &dentry->d_name))
					break;
				dentry = parent;
			}
			if (lockless)
				rcu_read_unlock();
		}

		if (b.len == p->len)
			prepend_char(&b, '/');
		return extract_string(&b);
	}


TODO: add another trivial helper

	static inline int need_seqretry_or_lock(seqlock_t *lock, int *seq)
	{
		int ret = !(*seq & 1) && read_seqretry(lock, *seq);

		if (ret)
			*seq = 1; /* make this counter odd */

		return ret;
	}

which can be used when the read section is more complex. Say, d_walk().

Oleg.
---

 fs/d_path.c             | 31 +++++++++++++------------------
 fs/proc/array.c         |  9 ++-------
 fs/proc/base.c          |  9 ++-------
 include/linux/seqlock.h | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c  | 14 +++-----------
 5 files changed, 69 insertions(+), 43 deletions(-)
Re: [PATCH 0/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Al Viro 4 months ago
On Sun, Oct 05, 2025 at 04:49:29PM +0200, Oleg Nesterov wrote:
> OK, let me name it SEQLOCK_READ_SECTION().
> 
> Al, could you look at 5/5? Please nack it if you think it makes no sense
> or wrong. I abused __dentry_path() to add another example. To simplify the
> review, this is how it looks after the patch:
> 
> 	static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
> 	{
> 		const struct dentry *dentry;
> 		struct prepend_buffer b;
> 
> 		rcu_read_lock();
> 		__SEQLOCK_READ_SECTION(&rename_lock, lockless, seq, NULL) {
> 			dentry = d;
> 			b = *p;
> 			while (!IS_ROOT(dentry)) {
> 				const struct dentry *parent = dentry->d_parent;
> 
> 				prefetch(parent);
> 				if (!prepend_name(&b, &dentry->d_name))
> 					break;
> 				dentry = parent;
> 			}
> 			if (lockless)
> 				rcu_read_unlock();

<cringe>

> 		}
> 
> 		if (b.len == p->len)
> 			prepend_char(&b, '/');
> 		return extract_string(&b);
> 	}
> 
> 
> TODO: add another trivial helper
> 
> 	static inline int need_seqretry_or_lock(seqlock_t *lock, int *seq)
> 	{
> 		int ret = !(*seq & 1) && read_seqretry(lock, *seq);
> 
> 		if (ret)
> 			*seq = 1; /* make this counter odd */
> 
> 		return ret;
> 	}
> 
> which can be used when the read section is more complex. Say, d_walk().

prepend_path() would be interesting to look at...
Re: [PATCH 0/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
On 10/05, Al Viro wrote:
>
> On Sun, Oct 05, 2025 at 04:49:29PM +0200, Oleg Nesterov wrote:
>
> > 	static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
> > 	{
> > 		const struct dentry *dentry;
> > 		struct prepend_buffer b;
> >
> > 		rcu_read_lock();
> > 		__SEQLOCK_READ_SECTION(&rename_lock, lockless, seq, NULL) {
> > 			dentry = d;
> > 			b = *p;
> > 			while (!IS_ROOT(dentry)) {
> > 				const struct dentry *parent = dentry->d_parent;
> >
> > 				prefetch(parent);
> > 				if (!prepend_name(&b, &dentry->d_name))
> > 					break;
> > 				dentry = parent;
> > 			}
> > 			if (lockless)
> > 				rcu_read_unlock();
>
> <cringe>

Weird or buggy?

> prepend_path() would be interesting to look at...

Yeah, I have already looked at it.

Not sure yet what the "good" cleanup could do. Will think about it more.

Oleg.
[PATCH 1/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
error prone. With the new helper the "typical" code like

	nextseq = 0;
	do {
		seq = nextseq;
		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);

		// read-side critical section

		nextseq = 1;
	} while (need_seqretry(&seqlock, seq));
	done_seqretry_irqrestore(&seqlock, seq, flags);

can be rewritten as

	SEQLOCK_READ_SECTION(&seqlock, &flags) {
		// read-side critical section
	}

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/seqlock.h | 49 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5ce48eab7a2a..8cd36a7c1638 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -1209,4 +1209,53 @@ done_seqretry_irqrestore(seqlock_t *lock, int seq, unsigned long flags)
 	if (seq & 1)
 		read_sequnlock_excl_irqrestore(lock, flags);
 }
+
+/* internal helper for SEQLOCK_READ_SECTION() */
+static inline int
+seqlock_read_section_retry(seqlock_t *lock, int *seq, unsigned long *flags)
+{
+	int retry = 0;
+
+	if (*seq & 1) {
+		if (flags)
+			read_sequnlock_excl_irqrestore(lock, *flags);
+		else
+			read_sequnlock_excl(lock);
+	} else if (read_seqretry(lock, *seq)) {
+		retry = *seq = 1;
+		if (flags)
+			read_seqlock_excl_irqsave(lock, *flags);
+		else
+			read_seqlock_excl(lock);
+	}
+
+	return retry;
+}
+
+#define __SEQLOCK_READ_SECTION(lock, lockless, seq, flags)	\
+	for (int lockless = 1, seq = read_seqbegin(lock);		\
+	     lockless || seqlock_read_section_retry(lock, &seq, flags);	\
+	     lockless = 0)
+
+/**
+ * SEQLOCK_READ_SECTION(lock, flags) - execute the read side critical section
+ *                                     without manual sequence counter handling
+ *                                     or calls to other helpers
+ * @lock: pointer to the seqlock_t protecting the data
+ * @flags: pointer to unsigned long for irqsave/irqrestore or NULL
+ *
+ * Example:
+ *
+ *	SEQLOCK_READ_SECTION(&lock, &flags) {
+ *		// read-side critical section
+ *	}
+ *
+ * Starts with a lockless pass first. If it fails, restarts the critical section
+ * with the lock held and, if @flags != NULL, with irqs disabled.
+ *
+ * The critical section must not contain control flow that escapes the loop.
+ */
+#define SEQLOCK_READ_SECTION(lock, flags)	\
+	__SEQLOCK_READ_SECTION(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)
+
 #endif /* __LINUX_SEQLOCK_H */
-- 
2.25.1.362.g51ebf55
Re: [PATCH 1/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Linus Torvalds 4 months ago
On Sun, 5 Oct 2025 at 07:51, Oleg Nesterov <oleg@redhat.com> wrote:
>
> +#define __SEQLOCK_READ_SECTION(lock, lockless, seq, flags)     \
> +       for (int lockless = 1, seq = read_seqbegin(lock);               \
> +            lockless || seqlock_read_section_retry(lock, &seq, flags); \
> +            lockless = 0)
> ...
> +#define SEQLOCK_READ_SECTION(lock, flags)      \
> +       __SEQLOCK_READ_SECTION(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)

Ok, I like the helper wrapper, but I don't actually think we need it
to be shouty.

As far as the users are concerned, the result doesn't end up being
really any different from our scoped guards, so I'd actually suggest
you just make this helper look like our scoped_guard() macro does.

And instead of making people pass in a NULL 'flags', just do a
separate version of it, exactly like we do for locking. Even if the
internal implementation then ends up sharing most of the code, please
don't make people pass in NULL just because they don't want the
irqsave version.

So make it two different things:

   scoped_seqlock_read(lock) { .... }

   scoped_seqlock__read_irqsave(lock, flags) { ... }

or something.

(Maybe 'flags' can even be local to that scope?)

Hmm?

                Linus
Re: [PATCH 1/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
OK. I'll send v2 but let me ask...

On 10/05, Linus Torvalds wrote:
>
> As far as the users are concerned, the result doesn't end up being
> really any different from our scoped guards, so I'd actually suggest
> you just make this helper look like our scoped_guard() macro does.

I swear, I too thought about scoped_seqlock_xxx ;)

> And instead of making people pass in a NULL 'flags', just do a
> separate version of it, exactly like we do for locking. Even if the
> internal implementation then ends up sharing most of the code, please
> don't make people pass in NULL just because they don't want the
> irqsave version.
>
> So make it two different things:
>
>    scoped_seqlock_read(lock) { .... }
>
>    scoped_seqlock__read_irqsave(lock, flags) { ... }

OK. But if you don't object I'd like to avoid another DEFINE_LOCK_GUARD()
or something like it in this case. To me it won't buy anything.

And I think that the "generic" seqlock_read_section_retry(flags) makes
sense, the "if (flags)" checks should not generate the extra code.

Will you agree?

> (Maybe 'flags' can even be local to that scope?)

The problem is that you can't declare "int lockless/seq" and
"unsigned long flags" inside "for (...)",  but I'll try to think about
it more.

Oleg.
Re: [PATCH 1/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Linus Torvalds 4 months ago
On Sun, 5 Oct 2025 at 09:09, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK. But if you don't object I'd like to avoid another DEFINE_LOCK_GUARD()
> or something like it in this case. To me it won't buy anything.

Oh, absolutely. I didn't mean that you actually *use* the fancy GUARD
infrastructure we have: this code doesn't do the whole '__cleanup()'
thing at all.

I only mean that as far as users are concerned, this all looks
similar, so please use a similar interface even if from an
implementation standpoint it is very different.

> The problem is that you can't declare "int lockless/seq" and
> "unsigned long flags" inside "for (...)",  but I'll try to think about
> it more.

I agree, it's an annoying limitation of the for-loop variable
declaration thing that you can only declare one type.

There's a somewhat strange solution to it, though: declare a structure
with multiple members of the types you want:

        for (struct { char *a; int b; } c = {NULL, 1}; ... }

and then you use 'c.a' and 'c.b' and friends inside the loop.

It looks pretty odd, but it actually works fine and we should probably
use it more.

[ The *really* hacky solution is to just use one type and pick the
biggest type and then possibly mis-use that type.

  That's effectively what you did with 'lockless': it's declared as an
'int', but it's really effectively used as a 'bool' anyway.

  Of course, in that 'lockless' case it's not as noticeable, since we
often use int/bool rather interchangeably in the kernel for historical
reasons anyway, so you probably didn't think of it as a hacky solution
]

               Linus
[PATCH 2/5] seqlock: change thread_group_cputime() to use SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

While at it, change thread_group_cputime() to use __for_each_thread(sig).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/cputime.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7097de2c8cda..2d691075b6d6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -315,7 +315,6 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
 	unsigned long flags;
 
 	/*
@@ -330,25 +329,18 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 		(void) task_sched_runtime(current);
 
 	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	SEQLOCK_READ_SECTION(&sig->stats_lock, &flags) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
 
-		for_each_thread(tsk, t) {
+		__for_each_thread(sig, t) {
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 	rcu_read_unlock();
 }
 
-- 
2.25.1.362.g51ebf55
[PATCH 3/5] seqlock: change do_task_stat() to use SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/array.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..99bcaa58a2bd 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -483,7 +483,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long flags;
 	int exit_code = task->exit_code;
 	struct signal_struct *sig = task->signal;
-	unsigned int seq = 1;
 
 	state = *get_task_state(task);
 	vsize = eip = esp = 0;
@@ -540,10 +539,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (permitted && (!whole || num_threads < 2))
 		wchan = !task_is_running(task);
 
-	do {
-		seq++; /* 2 on the 1st/lockless path, otherwise odd */
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+	SEQLOCK_READ_SECTION(&sig->stats_lock, &flags) {
 		cmin_flt = sig->cmin_flt;
 		cmaj_flt = sig->cmaj_flt;
 		cutime = sig->cutime;
@@ -565,8 +561,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			}
 			rcu_read_unlock();
 		}
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 
 	if (whole) {
 		thread_group_cputime_adjusted(task, &utime, &stime);
-- 
2.25.1.362.g51ebf55
[PATCH 4/5] seqlock: change do_io_accounting() to use SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/proc/base.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..7ec507efbb99 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3041,20 +3041,15 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
 	if (whole) {
 		struct signal_struct *sig = task->signal;
 		struct task_struct *t;
-		unsigned int seq = 1;
 		unsigned long flags;
 
 		rcu_read_lock();
-		do {
-			seq++; /* 2 on the 1st/lockless path, otherwise odd */
-			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
-
+		SEQLOCK_READ_SECTION(&sig->stats_lock, &flags) {
 			acct = sig->ioac;
 			__for_each_thread(sig, t)
 				task_io_accounting_add(&acct, &t->ioac);
 
-		} while (need_seqretry(&sig->stats_lock, seq));
-		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+		}
 		rcu_read_unlock();
 	} else {
 		acct = task->ioac;
-- 
2.25.1.362.g51ebf55
[PATCH 5/5] seqlock: change __dentry_path() to use __SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
To simplify the code and make it more readable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/d_path.c | 31 +++++++++++++------------------
 1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/fs/d_path.c b/fs/d_path.c
index bb365511066b..0dab1ab1e78d 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -332,28 +332,23 @@ static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
 {
 	const struct dentry *dentry;
 	struct prepend_buffer b;
-	int seq = 0;
 
 	rcu_read_lock();
-restart:
-	dentry = d;
-	b = *p;
-	read_seqbegin_or_lock(&rename_lock, &seq);
-	while (!IS_ROOT(dentry)) {
-		const struct dentry *parent = dentry->d_parent;
+	__SEQLOCK_READ_SECTION(&rename_lock, lockless, seq, NULL) {
+		dentry = d;
+		b = *p;
+		while (!IS_ROOT(dentry)) {
+			const struct dentry *parent = dentry->d_parent;
 
-		prefetch(parent);
-		if (!prepend_name(&b, &dentry->d_name))
-			break;
-		dentry = parent;
-	}
-	if (!(seq & 1))
-		rcu_read_unlock();
-	if (need_seqretry(&rename_lock, seq)) {
-		seq = 1;
-		goto restart;
+			prefetch(parent);
+			if (!prepend_name(&b, &dentry->d_name))
+				break;
+			dentry = parent;
+		}
+		if (lockless)
+			rcu_read_unlock();
 	}
-	done_seqretry(&rename_lock, seq);
+
 	if (b.len == p->len)
 		prepend_char(&b, '/');
 	return extract_string(&b);
-- 
2.25.1.362.g51ebf55
Re: [PATCH 5/5] seqlock: change __dentry_path() to use __SEQLOCK_READ_SECTION()
Posted by Linus Torvalds 4 months ago
On Sun, 5 Oct 2025 at 07:52, Oleg Nesterov <oleg@redhat.com> wrote:
>
> To simplify the code and make it more readable.

Ok, so the other ones looked fine. This one I'm not convinced about.

The end result makes no sense taken alone. It does that odd
"rcu_read_unlock()" after the first loop, and it basically ends up
being very incestuous with that implementation, which means that now
you have to understand both pieces intimately to figure it out, and
they are not near each other.

It *might* be solved by just renaming 'lockless' to
'first_lockless_iteration' - not because it changes the code, but
because it makes the logic much more explicit, and then that

        if (first_lockless_iteration)
                rcu_read_unlock();

test inside that loop would at least make a lot more conceptual sense
without knowing the internal implementation of that macro.

But honestly, while I think that would turn it from "too ugly to live"
to "I don't love it but I can deal with it", I would wish for
something better.

Side note: that whole internal loop:

        while (!IS_ROOT(dentry)) {
                const struct dentry *parent = dentry->d_parent;

                prefetch(parent);
                if (!prepend_name(&b, &dentry->d_name))
                        break;
                dentry = parent;
        }

should be a helper function of its own, I think. And if you do that,
maybe you can switch the whole thing over to just making the first
non-locked iteration be an explicitly separate phase?

I dunno. I don't love that code in the existing format - but I think
you ended up hiding that subtlety even more.

             Linus
[PATCH 0/5] seqlock: introduce SEQLOCK_READ_SECTION()
Posted by Oleg Nesterov 4 months ago
OK, let me name it SEQLOCK_READ_SECTION().

Al, could you look at 5/5? Please nack it if you think it makes no sense
or wrong. I abused __dentry_path() to add another example. To simplify the
review, this is how it looks after the patch:

	static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
	{
		const struct dentry *dentry;
		struct prepend_buffer b;

		rcu_read_lock();
		__SEQLOCK_READ_SECTION(&rename_lock, lockless, seq, NULL) {
			dentry = d;
			b = *p;
			while (!IS_ROOT(dentry)) {
				const struct dentry *parent = dentry->d_parent;

				prefetch(parent);
				if (!prepend_name(&b, &dentry->d_name))
					break;
				dentry = parent;
			}
			if (lockless)
				rcu_read_unlock();
		}

		if (b.len == p->len)
			prepend_char(&b, '/');
		return extract_string(&b);
	}


TODO: add another trivial helper

	static inline int need_seqretry_or_lock(seqlock_t *lock, int *seq)
	{
		int ret = !(*seq & 1) && read_seqretry(lock, *seq);

		if (ret)
			*seq = 1; /* make this counter odd */

		return ret;
	}

which can be used when the read section is more complex. Say, d_walk().

Oleg.
---

 fs/d_path.c             | 31 +++++++++++++------------------
 fs/proc/array.c         |  9 ++-------
 fs/proc/base.c          |  9 ++-------
 include/linux/seqlock.h | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c  | 14 +++-----------
 5 files changed, 69 insertions(+), 43 deletions(-)
[RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
Another problem is that this API is error prone. Two years ago half of the
read_seqbegin_or_lock() users were buggy (they followed the wrong example
from Documentation/locking/seqlock.rst). And even if the code is mostly
correct it is very easy to add a hard-to-detect mistake, see for example

	[PATCH][v3] afs: Remove erroneous seq |= 1 in volume lookup loop
	https://lore.kernel.org/all/20250910084235.2630-1-lirongqing@baidu.com/

Can we improve this API?

-------------------------------------------------------------------------------
To simplify, suppose we add the new helper

	static inline int need_seqretry_xxx(seqlock_t *lock, int *seq)
	{
		int ret = !(*seq & 1) && read_seqretry(lock, *seq);

		if (ret)
			++*seq;	/* make this counter odd */

		return ret;
	}

which can be used instead of need_seqretry(). This way the users do not
need to modify "seq" in the main loop. For example, we can simplify
thread_group_cputime() as follows:

	--- a/kernel/sched/cputime.c
	+++ b/kernel/sched/cputime.c
	@@ -314,7 +314,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
		struct signal_struct *sig = tsk->signal;
		u64 utime, stime;
		struct task_struct *t;
	-	unsigned int seq, nextseq;
	+	unsigned int seq;
		unsigned long flags;
	 
		/*
	@@ -330,9 +330,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
	 
		rcu_read_lock();
		/* Attempt a lockless read on the first round. */
	-	nextseq = 0;
	+	seq = 0;
		do {
	-		seq = nextseq;
			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
			times->utime = sig->utime;
			times->stime = sig->stime;
	@@ -344,9 +343,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
				times->stime += stime;
				times->sum_exec_runtime += read_sum_exec_runtime(t);
			}
	-		/* If lockless access failed, take the lock. */
	-		nextseq = 1;
	-	} while (need_seqretry(&sig->stats_lock, seq));
	+	} while (need_seqretry_xxx(&sig->stats_lock, &seq));
		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
		rcu_read_unlock();
	 }

most (if not all) of other users can be changed the same way.

-------------------------------------------------------------------------------
Or perhaps we can even add a helper that hides all the details, something like

	int xxx(seqlock_t *lock, int *seq, int lockless)
	{
		if (lockless) {
			*seq = read_seqbegin(lock);
			return 1;
		}

		if (*seq & 1) {
			read_sequnlock_excl(lock);
			return 0;
		}

		if (read_seqretry(lock, *seq)) {
			read_seqlock_excl(lock);
			*seq = 1;
			return 1;
		}

		return 0;

	}

	#define __XXX(lock, seq, lockless)	\
		for (int lockless = 1, seq; xxx(lock, &seq, lockless); lockless = 0)

	#define XXX(lock)	\
		__XXX(lock, __UNIQUE_ID(seq), __UNIQUE_ID(lockless))


?

This way the users can simply do:

	seqlock_t sl;

	void func(void)
	{
		XXX(&sl) {
			... read-side critical section ...
		}
	}

using only the new XXX() helper. No need to declare/initialize seq, no need
for need_seqretry/done_seqretry.

What do you think?

Oleg.
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Peter Zijlstra 4 months, 1 week ago
On Sun, Sep 28, 2025 at 06:20:54PM +0200, Oleg Nesterov wrote:

> Can we improve this API?

Please :-)

> -------------------------------------------------------------------------------
> To simplify, suppose we add the new helper
> 
> 	static inline int need_seqretry_xxx(seqlock_t *lock, int *seq)
> 	{
> 		int ret = !(*seq & 1) && read_seqretry(lock, *seq);
> 
> 		if (ret)
> 			++*seq;	/* make this counter odd */
> 
> 		return ret;
> 	}

How about need_seqretry_or_lock() to stay in theme with
read_seqbegin_or_lock().

But then there's done_seqretry() without the _or_lock() :/

> which can be used instead of need_seqretry(). This way the users do not
> need to modify "seq" in the main loop. For example, we can simplify
> thread_group_cputime() as follows:
> 
> 	--- a/kernel/sched/cputime.c
> 	+++ b/kernel/sched/cputime.c
> 	@@ -314,7 +314,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> 		struct signal_struct *sig = tsk->signal;
> 		u64 utime, stime;
> 		struct task_struct *t;
> 	-	unsigned int seq, nextseq;
> 	+	unsigned int seq;
> 		unsigned long flags;
> 	 
> 		/*
> 	@@ -330,9 +330,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> 	 
> 		rcu_read_lock();
> 		/* Attempt a lockless read on the first round. */
> 	-	nextseq = 0;
> 	+	seq = 0;
> 		do {
> 	-		seq = nextseq;
> 			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
> 			times->utime = sig->utime;
> 			times->stime = sig->stime;
> 	@@ -344,9 +343,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> 				times->stime += stime;
> 				times->sum_exec_runtime += read_sum_exec_runtime(t);
> 			}
> 	-		/* If lockless access failed, take the lock. */
> 	-		nextseq = 1;
> 	-	} while (need_seqretry(&sig->stats_lock, seq));
> 	+	} while (need_seqretry_xxx(&sig->stats_lock, &seq));
> 		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
> 		rcu_read_unlock();
> 	 }
> 
> most (if not all) of other users can be changed the same way.
> 
> -------------------------------------------------------------------------------
> Or perhaps we can even add a helper that hides all the details, something like
> 
> 	int xxx(seqlock_t *lock, int *seq, int lockless)
> 	{
> 		if (lockless) {
> 			*seq = read_seqbegin(lock);
> 			return 1;
> 		}
> 
> 		if (*seq & 1) {
> 			read_sequnlock_excl(lock);
> 			return 0;
> 		}
> 
> 		if (read_seqretry(lock, *seq)) {
> 			read_seqlock_excl(lock);
> 			*seq = 1;
> 			return 1;
> 		}
> 
> 		return 0;
> 
> 	}
> 
> 	#define __XXX(lock, seq, lockless)	\
> 		for (int lockless = 1, seq; xxx(lock, &seq, lockless); lockless = 0)
> 
> 	#define XXX(lock)	\
> 		__XXX(lock, __UNIQUE_ID(seq), __UNIQUE_ID(lockless))
> 
> 
> ?

Oh gawd, that thing had better not have control flow escape that loop.

But yes, I suppose something like this is far more useable than the
current thing.

> This way the users can simply do:
> 
> 	seqlock_t sl;
> 
> 	void func(void)
> 	{
> 		XXX(&sl) {
> 			... read-side critical section ...
> 		}
> 	}
> 
> using only the new XXX() helper. No need to declare/initialize seq, no need
> for need_seqretry/done_seqretry.
> 
> What do you think?

I'm thinking we want something like this for the normal seqcount loops
too :-)
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
On 10/01, Peter Zijlstra wrote:
> On Sun, Sep 28, 2025 at 06:20:54PM +0200, Oleg Nesterov wrote:
>
> > To simplify, suppose we add the new helper
> >
> > 	static inline int need_seqretry_xxx(seqlock_t *lock, int *seq)
> > 	{
> > 		int ret = !(*seq & 1) && read_seqretry(lock, *seq);
> >
> > 		if (ret)
> > 			++*seq;	/* make this counter odd */
                        ^^^^^^
Hmm. just
			*seq = 1;
makes more sense

> How about need_seqretry_or_lock() to stay in theme with
> read_seqbegin_or_lock().

I am fine with any name ;) This one looks good to me.

> > 	#define __XXX(lock, seq, lockless)	\
> > 		for (int lockless = 1, seq; xxx(lock, &seq, lockless); lockless = 0)
> >
> > 	#define XXX(lock)	\
> > 		__XXX(lock, __UNIQUE_ID(seq), __UNIQUE_ID(lockless))
> >
> >
> > ?
>
> Oh gawd, that thing had better not have control flow escape that loop.

Yes, yes. "continue" is fine, but break/return won't work.

> But yes, I suppose something like this is far more useable than the
> current thing.

OK, great. So, modulo naming, how about the patch below?

The new stuff should obviously go to include/linux/seqlock.h, xxx() can be
probably uninlined. thread_group_cputime() is changed as an example.

Oleg.


--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -306,6 +306,35 @@ static u64 read_sum_exec_runtime(struct task_struct *t)
 }
 #endif /* !CONFIG_64BIT */
 
+static inline int xxx(seqlock_t *lock, int lockless, int *seq, unsigned long *flags)
+{
+	if (lockless) {
+		*seq = read_seqbegin(lock);
+		return 1;
+	} else if (*seq & 1) {
+		if (flags)
+			read_sequnlock_excl_irqrestore(lock, *flags);
+		else
+			read_sequnlock_excl(lock);
+		return 0;
+	} else if (read_seqretry(lock, *seq)) {
+		if (flags)
+			read_seqlock_excl_irqsave(lock, *flags);
+		else
+			read_seqlock_excl(lock);
+		*seq = 1;
+		return 1;
+	} else {
+		return 0;
+	}
+}
+
+#define __XXX(lock, lockless, seq, flags)	\
+	for (int lockless = 1, seq; xxx(lock, lockless, &seq, flags); lockless = 0)
+
+#define XXX(lock, flags)	\
+	__XXX(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)
+
 /*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
@@ -315,7 +344,6 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	u64 utime, stime;
 	struct task_struct *t;
-	unsigned int seq, nextseq;
 	unsigned long flags;
 
 	/*
@@ -330,11 +358,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 		(void) task_sched_runtime(current);
 
 	rcu_read_lock();
-	/* Attempt a lockless read on the first round. */
-	nextseq = 0;
-	do {
-		seq = nextseq;
-		flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
+	XXX(&sig->stats_lock, &flags) {
 		times->utime = sig->utime;
 		times->stime = sig->stime;
 		times->sum_exec_runtime = sig->sum_sched_runtime;
@@ -345,10 +369,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			times->stime += stime;
 			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
-		/* If lockless access failed, take the lock. */
-		nextseq = 1;
-	} while (need_seqretry(&sig->stats_lock, seq));
-	done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
+	}
 	rcu_read_unlock();
 }
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
On 10/01, Oleg Nesterov wrote:
>
> +static inline int xxx(seqlock_t *lock, int lockless, int *seq, unsigned long *flags)
> +{
> +	if (lockless) {
> +		*seq = read_seqbegin(lock);
> +		return 1;
> +	} else if (*seq & 1) {
> +		if (flags)
> +			read_sequnlock_excl_irqrestore(lock, *flags);
> +		else
> +			read_sequnlock_excl(lock);
> +		return 0;
> +	} else if (read_seqretry(lock, *seq)) {
> +		if (flags)
> +			read_seqlock_excl_irqsave(lock, *flags);
> +		else
> +			read_seqlock_excl(lock);
> +		*seq = 1;
> +		return 1;
> +	} else {
> +		return 0;
> +	}
> +}
> +
> +#define __XXX(lock, lockless, seq, flags)	\
> +	for (int lockless = 1, seq; xxx(lock, lockless, &seq, flags); lockless = 0)
> +
> +#define XXX(lock, flags)	\
> +	__XXX(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)

Or, better

	static inline int xxx(seqlock_t *lock, int *seq, unsigned long *flags)
	{
		int retry = 0;

		if (*seq & 1) {
			if (flags)
				read_sequnlock_excl_irqrestore(lock, *flags);
			else
				read_sequnlock_excl(lock);
		} else if (read_seqretry(lock, *seq)) {
			retry = *seq = 1;
			if (flags)
				read_seqlock_excl_irqsave(lock, *flags);
			else
				read_seqlock_excl(lock);
		}

		return retry;
	}

	#define __XXX(lock, lockless, seq, flags)			\
		for (int lockless = 1, seq = read_seqbegin(lock);	\
		     lockless || xxx(lock, &seq, flags);		\
		     lockless = 0)

	#define XXX(lock, flags)	\
		__XXX(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)


With this version the change in thread_group_cputime() even shrinks .text a little bit.

I'd like to send the patch, but I still can't find a good name... Peter, will you agree
with SEQBEGIN_OR_LOCK(lock, flags) ?

Oleg.
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
On 10/01, Oleg Nesterov wrote:
>
> +static inline int xxx(seqlock_t *lock, int lockless, int *seq, unsigned long *flags)
> +{
> +	if (lockless) {
> +		*seq = read_seqbegin(lock);
> +		return 1;
> +	} else if (*seq & 1) {
> +		if (flags)
> +			read_sequnlock_excl_irqrestore(lock, *flags);
> +		else
> +			read_sequnlock_excl(lock);
> +		return 0;
> +	} else if (read_seqretry(lock, *seq)) {
> +		if (flags)
> +			read_seqlock_excl_irqsave(lock, *flags);
> +		else
> +			read_seqlock_excl(lock);
> +		*seq = 1;
> +		return 1;
> +	} else {
> +		return 0;
> +	}
> +}
> +
> +#define __XXX(lock, lockless, seq, flags)	\
> +	for (int lockless = 1, seq; xxx(lock, lockless, &seq, flags); lockless = 0)
> +
> +#define XXX(lock, flags)	\
> +	__XXX(lock, __UNIQUE_ID(lockless), __UNIQUE_ID(seq), flags)

Note that __XXX() can have users too. See the patch below. Not that it makes a
lot of sense, just for example.

Oleg.


diff --git a/fs/d_path.c b/fs/d_path.c
index bb365511066b..b4cde188be4f 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -332,28 +332,23 @@ static char *__dentry_path(const struct dentry *d, struct prepend_buffer *p)
 {
 	const struct dentry *dentry;
 	struct prepend_buffer b;
-	int seq = 0;
 
 	rcu_read_lock();
-restart:
-	dentry = d;
-	b = *p;
-	read_seqbegin_or_lock(&rename_lock, &seq);
-	while (!IS_ROOT(dentry)) {
-		const struct dentry *parent = dentry->d_parent;
+	__XXX(&rename_lock, lockless, seq, NULL) {
+		dentry = d;
+		b = *p;
+		while (!IS_ROOT(dentry)) {
+			const struct dentry *parent = dentry->d_parent;
 
-		prefetch(parent);
-		if (!prepend_name(&b, &dentry->d_name))
-			break;
-		dentry = parent;
-	}
-	if (!(seq & 1))
-		rcu_read_unlock();
-	if (need_seqretry(&rename_lock, seq)) {
-		seq = 1;
-		goto restart;
+			prefetch(parent);
+			if (!prepend_name(&b, &dentry->d_name))
+				break;
+			dentry = parent;
+		}
+		if (lockless)
+			rcu_read_unlock();
 	}
-	done_seqretry(&rename_lock, seq);
+
 	if (b.len == p->len)
 		prepend_char(&b, '/');
 	return extract_string(&b);
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by David Howells 4 months, 1 week ago
Oleg Nesterov <oleg@redhat.com> wrote:

> 
> Can we improve this API?

It would also be nice to fix the static lock-balance detection stuff that you
get when you enable advanced checking during a kernel build.  It doesn't
seem to understand seqlocks.

> 	-	nextseq = 0;
> 	+	seq = 0;

Perhaps an init function or macro that hides this bit?

	void init_read_seqlock(int *seq)
	{
		*seq = 0;
	}

	init_read_seqlock(&seq);

or:

	#define INIT_READ_SEQBEGIN 0

	seq = INIT_READ_SEQBEGIN;

Though if we can fold the whole loop inside a macro, that might make it easier
to use.

d_walk() in fs/dcache.c might give you issues, though.

David
Re: [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
On 09/30, David Howells wrote:
>
> It would also be nice to fix the static lock-balance detection stuff that you
> get when you enable advanced checking during a kernel build.  It doesn't
> seem to understand seqlocks.
>
> > 	-	nextseq = 0;
> > 	+	seq = 0;
>
> Perhaps an init function or macro that hides this bit?
>
> 	void init_read_seqlock(int *seq)
> 	{
> 		*seq = 0;
> 	}
>
> 	init_read_seqlock(&seq);
>
> or:
>
> 	#define INIT_READ_SEQBEGIN 0
>
> 	seq = INIT_READ_SEQBEGIN;

Agreed,

> Though if we can fold the whole loop inside a macro, that might make it easier
> to use.

Yes...

> d_walk() in fs/dcache.c might give you issues, though.

Yes. But note that it can use need_seqretry_xxx() too, see the patch below.

So. Do you think it makes any sense to add the new helper for the start?
Can you suggest a better name? Say, check_restart() ?

Oleg.

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1355,7 +1355,7 @@ static void d_walk(struct dentry *parent, void *data,
 		spin_lock(&this_parent->d_lock);
 
 		/* might go back up the wrong parent if we have had a rename. */
-		if (need_seqretry(&rename_lock, seq))
+		if (need_seqretry_xxx(&rename_lock, &seq))
 			goto rename_retry;
 		/* go into the first sibling still alive */
 		hlist_for_each_entry_continue(dentry, d_sib) {
@@ -1366,7 +1366,7 @@ static void d_walk(struct dentry *parent, void *data,
 		}
 		goto ascend;
 	}
-	if (need_seqretry(&rename_lock, seq))
+	if (need_seqretry_xxx(&rename_lock, &seq))
 		goto rename_retry;
 	rcu_read_unlock();
 
@@ -1378,11 +1378,8 @@ static void d_walk(struct dentry *parent, void *data,
 rename_retry:
 	spin_unlock(&this_parent->d_lock);
 	rcu_read_unlock();
-	BUG_ON(seq & 1);
-	if (!retry)
-		return;
-	seq = 1;
-	goto again;
+	if (retry)
+		goto again;
 }
 
 struct check_mount {
RE: [????] [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Li,Rongqing 4 months, 1 week ago

> Another problem is that this API is error prone. Two years ago half of the
> read_seqbegin_or_lock() users were buggy (they followed the wrong example
> from Documentation/locking/seqlock.rst). And even if the code is mostly
> correct it is very easy to add a hard-to-detect mistake, see for example
> 
> 	[PATCH][v3] afs: Remove erroneous seq |= 1 in volume lookup loop
> 	https://lore.kernel.org/all/20250910084235.2630-1-lirongqing@baidu.co
> m/
> 
> Can we improve this API?
> 
> -------------------------------------------------------------------------------
> To simplify, suppose we add the new helper
> 
> 	static inline int need_seqretry_xxx(seqlock_t *lock, int *seq)
> 	{
> 		int ret = !(*seq & 1) && read_seqretry(lock, *seq);
> 
> 		if (ret)
> 			++*seq;	/* make this counter odd */
> 
> 		return ret;
> 	}
> 
> which can be used instead of need_seqretry(). This way the users do not need
> to modify "seq" in the main loop. For example, we can simplify
> thread_group_cputime() as follows:
> 
> 	--- a/kernel/sched/cputime.c
> 	+++ b/kernel/sched/cputime.c
> 	@@ -314,7 +314,7 @@ void thread_group_cputime(struct task_struct
> *tsk, struct task_cputime *times)
> 		struct signal_struct *sig = tsk->signal;
> 		u64 utime, stime;
> 		struct task_struct *t;
> 	-	unsigned int seq, nextseq;
> 	+	unsigned int seq;
> 		unsigned long flags;
> 
> 		/*
> 	@@ -330,9 +330,8 @@ void thread_group_cputime(struct task_struct
> *tsk, struct task_cputime *times)
> 
> 		rcu_read_lock();
> 		/* Attempt a lockless read on the first round. */
> 	-	nextseq = 0;
> 	+	seq = 0;
> 		do {
> 	-		seq = nextseq;
> 			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
> 			times->utime = sig->utime;
> 			times->stime = sig->stime;
> 	@@ -344,9 +343,7 @@ void thread_group_cputime(struct task_struct
> *tsk, struct task_cputime *times)
> 				times->stime += stime;
> 				times->sum_exec_runtime += read_sum_exec_runtime(t);
> 			}
> 	-		/* If lockless access failed, take the lock. */
> 	-		nextseq = 1;
> 	-	} while (need_seqretry(&sig->stats_lock, seq));
> 	+	} while (need_seqretry_xxx(&sig->stats_lock, &seq));
> 		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
> 		rcu_read_unlock();
> 	 }
> 

If this API can be simplified, it should prevent future errors;

I submitted a patch, inspired by inspired by your previous patch, and hope that all places use a fixed syntax, to prevent future errors; 

https://lkml.org/lkml/2025/7/31/616

thanks

-Li

> most (if not all) of other users can be changed the same way.
> 
> -------------------------------------------------------------------------------
> Or perhaps we can even add a helper that hides all the details, something like
> 
> 	int xxx(seqlock_t *lock, int *seq, int lockless)
> 	{
> 		if (lockless) {
> 			*seq = read_seqbegin(lock);
> 			return 1;
> 		}
> 
> 		if (*seq & 1) {
> 			read_sequnlock_excl(lock);
> 			return 0;
> 		}
> 
> 		if (read_seqretry(lock, *seq)) {
> 			read_seqlock_excl(lock);
> 			*seq = 1;
> 			return 1;
> 		}
> 
> 		return 0;
> 
> 	}
> 
> 	#define __XXX(lock, seq, lockless)	\
> 		for (int lockless = 1, seq; xxx(lock, &seq, lockless); lockless = 0)
> 
> 	#define XXX(lock)	\
> 		__XXX(lock, __UNIQUE_ID(seq), __UNIQUE_ID(lockless))
> 
> 
> ?
> 
> This way the users can simply do:
> 
> 	seqlock_t sl;
> 
> 	void func(void)
> 	{
> 		XXX(&sl) {
> 			... read-side critical section ...
> 		}
> 	}
> 
> using only the new XXX() helper. No need to declare/initialize seq, no need for
> need_seqretry/done_seqretry.
> 
> What do you think?
> 
> Oleg.
Re: [????] [RFC 2/1] seqlock: make the read_seqbegin_or_lock() API more simple and less error-prone ?
Posted by Oleg Nesterov 4 months, 1 week ago
On 09/29, Li,Rongqing wrote:
>
> > Another problem is that this API is error prone. Two years ago half of the
> > read_seqbegin_or_lock() users were buggy (they followed the wrong example
> > from Documentation/locking/seqlock.rst). And even if the code is mostly
> > correct it is very easy to add a hard-to-detect mistake, see for example
> >
> > 	[PATCH][v3] afs: Remove erroneous seq |= 1 in volume lookup loop
> > 	https://lore.kernel.org/all/20250910084235.2630-1-lirongqing@baidu.co
> > m/
> >
> > Can we improve this API?
> >
> > -------------------------------------------------------------------------------
> > To simplify, suppose we add the new helper
> >
> > 	static inline int need_seqretry_xxx(seqlock_t *lock, int *seq)
> > 	{
> > 		int ret = !(*seq & 1) && read_seqretry(lock, *seq);
> >
> > 		if (ret)
> > 			++*seq;	/* make this counter odd */
> >
> > 		return ret;
> > 	}
> >
> > which can be used instead of need_seqretry(). This way the users do not need
> > to modify "seq" in the main loop. For example, we can simplify
> > thread_group_cputime() as follows:
> >
> > 	--- a/kernel/sched/cputime.c
> > 	+++ b/kernel/sched/cputime.c
> > 	@@ -314,7 +314,7 @@ void thread_group_cputime(struct task_struct
> > *tsk, struct task_cputime *times)
> > 		struct signal_struct *sig = tsk->signal;
> > 		u64 utime, stime;
> > 		struct task_struct *t;
> > 	-	unsigned int seq, nextseq;
> > 	+	unsigned int seq;
> > 		unsigned long flags;
> >
> > 		/*
> > 	@@ -330,9 +330,8 @@ void thread_group_cputime(struct task_struct
> > *tsk, struct task_cputime *times)
> >
> > 		rcu_read_lock();
> > 		/* Attempt a lockless read on the first round. */
> > 	-	nextseq = 0;
> > 	+	seq = 0;
> > 		do {
> > 	-		seq = nextseq;
> > 			flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq);
> > 			times->utime = sig->utime;
> > 			times->stime = sig->stime;
> > 	@@ -344,9 +343,7 @@ void thread_group_cputime(struct task_struct
> > *tsk, struct task_cputime *times)
> > 				times->stime += stime;
> > 				times->sum_exec_runtime += read_sum_exec_runtime(t);
> > 			}
> > 	-		/* If lockless access failed, take the lock. */
> > 	-		nextseq = 1;
> > 	-	} while (need_seqretry(&sig->stats_lock, seq));
> > 	+	} while (need_seqretry_xxx(&sig->stats_lock, &seq));
> > 		done_seqretry_irqrestore(&sig->stats_lock, seq, flags);
> > 		rcu_read_unlock();
> > 	 }
> >
>
> If this API can be simplified, it should prevent future errors;
>
> I submitted a patch, inspired by inspired by your previous patch, and hope that all places use a fixed syntax, to prevent future errors;
>
> https://lkml.org/lkml/2025/7/31/616

Well, I am not sure it makes a lot of sense to change thread_group_cputime()
this way, "nextseq" or the "seq++" trick is a matter of taste. I tried to
suggest a simplified API to avoid the manipulation of "seq" altogether.

Oleg.

> > most (if not all) of other users can be changed the same way.
> >
> > -------------------------------------------------------------------------------
> > Or perhaps we can even add a helper that hides all the details, something like
> >
> > 	int xxx(seqlock_t *lock, int *seq, int lockless)
> > 	{
> > 		if (lockless) {
> > 			*seq = read_seqbegin(lock);
> > 			return 1;
> > 		}
> >
> > 		if (*seq & 1) {
> > 			read_sequnlock_excl(lock);
> > 			return 0;
> > 		}
> >
> > 		if (read_seqretry(lock, *seq)) {
> > 			read_seqlock_excl(lock);
> > 			*seq = 1;
> > 			return 1;
> > 		}
> >
> > 		return 0;
> >
> > 	}
> >
> > 	#define __XXX(lock, seq, lockless)	\
> > 		for (int lockless = 1, seq; xxx(lock, &seq, lockless); lockless = 0)
> >
> > 	#define XXX(lock)	\
> > 		__XXX(lock, __UNIQUE_ID(seq), __UNIQUE_ID(lockless))
> >
> >
> > ?
> >
> > This way the users can simply do:
> >
> > 	seqlock_t sl;
> >
> > 	void func(void)
> > 	{
> > 		XXX(&sl) {
> > 			... read-side critical section ...
> > 		}
> > 	}
> >
> > using only the new XXX() helper. No need to declare/initialize seq, no need for
> > need_seqretry/done_seqretry.
> >
> > What do you think?
> >
> > Oleg.
>