include/linux/seqlock.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
`sequence` is a concurrently accessed shared variable on the reader
side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
prevent unwanted compiler optimizations like store tearing.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
include/linux/seqlock.h | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5298765d6ca4..f4c6f2507742 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
* Make sure we are not reinitializing a held lock:
*/
lockdep_init_map(&s->dep_map, name, key, 0);
- s->sequence = 0;
+ WRITE_ONCE(s->sequence, 0);
}
#ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -405,7 +405,7 @@ do { \
static inline void do_raw_write_seqcount_begin(seqcount_t *s)
{
kcsan_nestable_atomic_begin();
- s->sequence++;
+ WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
smp_wmb();
}
@@ -426,7 +426,7 @@ do { \
static inline void do_raw_write_seqcount_end(seqcount_t *s)
{
smp_wmb();
- s->sequence++;
+ WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
kcsan_nestable_atomic_end();
}
@@ -548,9 +548,9 @@ static inline void do_write_seqcount_end(seqcount_t *s)
static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
{
kcsan_nestable_atomic_begin();
- s->sequence++;
+ WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
smp_wmb();
- s->sequence++;
+ WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
kcsan_nestable_atomic_end();
}
@@ -569,7 +569,7 @@ static inline void do_write_seqcount_invalidate(seqcount_t *s)
{
smp_wmb();
kcsan_nestable_atomic_begin();
- s->sequence+=2;
+ WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 2);
kcsan_nestable_atomic_end();
}
@@ -673,7 +673,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)
{
smp_wmb(); /* prior stores before incrementing "sequence" */
- s->seqcount.sequence++;
+ WRITE_ONCE(s->seqcount.sequence, READ_ONCE(s->seqcount.sequence) + 1);
smp_wmb(); /* increment "sequence" before following stores */
}
--
2.47.0
Hi all, Thanks for the very interesting discussion - I learned quite a few things. Please let me know if the following is a correct summary: 1. The intent of the change is valid. New compiler optimizations may break current assumptions. 2. My proposed change is unacceptable due to bad codegen from GCC. Clang is OK b/c it can optimize the `WRITE_ONCE(.., READ_ONCE(..) + 1)` into a single `incl` instruction. 3. The counter-proposed INC_ONCE() change from Peter is also incorrect, despite codegen for both clang and GCC currently being correct. There is no guarantee that the GCC variant of INC_ONCE() will continue to work in the future. It sounds like the next step for this is to follow up w/ GCC folks about improving codegen for `WRITE_ONCE(.., READ_ONCE(..) + 1`)`. Does that sound right? Thanks, Daniel
On Tue, Dec 17, 2024 at 03:17:36PM -0800, Daniel Xu wrote:
> `sequence` is a concurrently accessed shared variable on the reader
> side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
> prevent unwanted compiler optimizations like store tearing.
>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
> include/linux/seqlock.h | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 5298765d6ca4..f4c6f2507742 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
> * Make sure we are not reinitializing a held lock:
> */
> lockdep_init_map(&s->dep_map, name, key, 0);
> - s->sequence = 0;
> + WRITE_ONCE(s->sequence, 0);
> }
>
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> @@ -405,7 +405,7 @@ do { \
> static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> {
> kcsan_nestable_atomic_begin();
> - s->sequence++;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> smp_wmb();
> }
This results in significantly worse code-gen, it will change an inc to
memory with a load,inc,store.
On Wed, Dec 18, 2024 at 11:30:00AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 17, 2024 at 03:17:36PM -0800, Daniel Xu wrote:
> > `sequence` is a concurrently accessed shared variable on the reader
> > side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
> > prevent unwanted compiler optimizations like store tearing.
> >
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> > include/linux/seqlock.h | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> > index 5298765d6ca4..f4c6f2507742 100644
> > --- a/include/linux/seqlock.h
> > +++ b/include/linux/seqlock.h
> > @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
> > * Make sure we are not reinitializing a held lock:
> > */
> > lockdep_init_map(&s->dep_map, name, key, 0);
> > - s->sequence = 0;
> > + WRITE_ONCE(s->sequence, 0);
> > }
> >
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > @@ -405,7 +405,7 @@ do { \
> > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > {
> > kcsan_nestable_atomic_begin();
> > - s->sequence++;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > smp_wmb();
> > }
>
> This results in significantly worse code-gen, it will change an inc to
> memory with a load,inc,store.
Isn't that code-generation bug in the process of being fixed? And,
either way, given the likely cache miss, should we really care?
Thanx, Paul
On Wed, Dec 18, 2024 at 07:43:41AM -0800, Paul E. McKenney wrote:
> On Wed, Dec 18, 2024 at 11:30:00AM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2024 at 03:17:36PM -0800, Daniel Xu wrote:
> > > @@ -405,7 +405,7 @@ do { \
> > > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > > {
> > > kcsan_nestable_atomic_begin();
> > > - s->sequence++;
> > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > smp_wmb();
> > > }
> >
> > This results in significantly worse code-gen, it will change an inc to
> > memory with a load,inc,store.
>
> Isn't that code-generation bug in the process of being fixed?
Last time I looked the compiler wasn't allowed to touch it because of
all the volatile going around. Did anything change?
> And, either way, given the likely cache miss, should we really care?
Yeah, extra register pressure too.
On Wed, Dec 18, 2024 at 05:23:25PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 18, 2024 at 07:43:41AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 18, 2024 at 11:30:00AM +0100, Peter Zijlstra wrote:
> > > On Tue, Dec 17, 2024 at 03:17:36PM -0800, Daniel Xu wrote:
>
>
> > > > @@ -405,7 +405,7 @@ do { \
> > > > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > > > {
> > > > kcsan_nestable_atomic_begin();
> > > > - s->sequence++;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > smp_wmb();
> > > > }
> > >
> > > This results in significantly worse code-gen, it will change an inc to
> > > memory with a load,inc,store.
> >
> > Isn't that code-generation bug in the process of being fixed?
>
> Last time I looked the compiler wasn't allowed to touch it because of
> all the volatile going around. Did anything change?
>
> > And, either way, given the likely cache miss, should we really care?
>
> Yeah, extra register pressure too.
Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ?
I'd have to check what the compiler makes of that.
/me mucks about with godbolt for a bit...
GCC doesn't optimize that, but Clang does.
I would still very much refrain from making this change until both
compilers can generate sane code for it.
On Wed, Dec 18, 2024 at 05:29:34PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 18, 2024 at 05:23:25PM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 18, 2024 at 07:43:41AM -0800, Paul E. McKenney wrote:
> > > On Wed, Dec 18, 2024 at 11:30:00AM +0100, Peter Zijlstra wrote:
> > > > On Tue, Dec 17, 2024 at 03:17:36PM -0800, Daniel Xu wrote:
> >
> >
> > > > > @@ -405,7 +405,7 @@ do { \
> > > > > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > > > > {
> > > > > kcsan_nestable_atomic_begin();
> > > > > - s->sequence++;
> > > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > > smp_wmb();
> > > > > }
> > > >
> > > > This results in significantly worse code-gen, it will change an inc to
> > > > memory with a load,inc,store.
> > >
> > > Isn't that code-generation bug in the process of being fixed?
> >
> > Last time I looked the compiler wasn't allowed to touch it because of
> > all the volatile going around. Did anything change?
> >
> > > And, either way, given the likely cache miss, should we really care?
> >
> > Yeah, extra register pressure too.
OK, fair point.
> Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ?
> I'd have to check what the compiler makes of that.
>
> /me mucks about with godbolt for a bit...
>
> GCC doesn't optimize that, but Clang does.
>
> I would still very much refrain from making this change until both
> compilers can generate sane code for it.
Is GCC on track to do this, or do we need to encourage them?
Just to make sure I understand, your proposal is to create an INC_ONCE()
or similar, and add it once compiler support is there? Seems reasonable
to me, just checking.
Thanx, Paul
+linux-toolchains
On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote:
> > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ?
> > I'd have to check what the compiler makes of that.
> >
> > /me mucks about with godbolt for a bit...
> >
> > GCC doesn't optimize that, but Clang does.
> >
> > I would still very much refrain from making this change until both
> > compilers can generate sane code for it.
>
> Is GCC on track to do this, or do we need to encourage them?
I have no clue; probably wise to offer encouragement.
> Just to make sure I understand, your proposal is to create an INC_ONCE()
> or similar, and add it once compiler support is there? Seems reasonable
> to me, just checking.
I suppose we can work in parallel, add INC_ONCE() now, but not have a
proper definition for GCC.
---
arch/riscv/kvm/vmid.c | 2 +-
arch/s390/kernel/idle.c | 2 +-
drivers/md/dm-vdo/indexer/index-session.c | 2 +-
fs/xfs/libxfs/xfs_iext_tree.c | 2 +-
include/linux/compiler-gcc.h | 5 +++++
include/linux/compiler.h | 4 ++++
include/linux/rcupdate_trace.h | 2 +-
include/linux/srcutiny.h | 2 +-
io_uring/io_uring.c | 2 +-
kernel/rcu/srcutree.c | 4 ++--
kernel/rcu/tree_plugin.h | 2 +-
kernel/sched/fair.c | 2 +-
mm/kfence/kfence_test.c | 2 +-
security/apparmor/apparmorfs.c | 2 +-
14 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c
index ddc98714ce8e..805a5acf669c 100644
--- a/arch/riscv/kvm/vmid.c
+++ b/arch/riscv/kvm/vmid.c
@@ -90,7 +90,7 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu)
/* First user of a new VMID version? */
if (unlikely(vmid_next == 0)) {
- WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1);
+ INC_ONCE(vmid_version);
vmid_next = 1;
/*
diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index 39cb8d0ae348..8fb7cd75fe62 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -45,7 +45,7 @@ void account_idle_time_irq(void)
/* Account time spent with enabled wait psw loaded as idle time. */
WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
- WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
+ INC_ONC(idle->idle_count);
account_idle_time(cputime_to_nsecs(idle_time));
}
diff --git a/drivers/md/dm-vdo/indexer/index-session.c b/drivers/md/dm-vdo/indexer/index-session.c
index aee0914d604a..c5a7dee9dc66 100644
--- a/drivers/md/dm-vdo/indexer/index-session.c
+++ b/drivers/md/dm-vdo/indexer/index-session.c
@@ -152,7 +152,7 @@ static void enter_callback_stage(struct uds_request *request)
static inline void count_once(u64 *count_ptr)
{
- WRITE_ONCE(*count_ptr, READ_ONCE(*count_ptr) + 1);
+ INC_ONCE(*count_ptr);
}
static void update_session_stats(struct uds_request *request)
diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 8796f2b3e534..a1fcd4cf2424 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -626,7 +626,7 @@ xfs_iext_realloc_root(
*/
static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp)
{
- WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
+ INC_ONCE(ifp->if_seq);
}
void
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index c9b58188ec61..77c253e29758 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -137,3 +137,8 @@
#if GCC_VERSION < 90100
#undef __alloc_size__
#endif
+
+/*
+ * GCC can't properly optimize the real one with volatile on.
+ */
+#define INC_ONCE(v) (v)++
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index efd43df3a99a..b1b13dac1b9e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -8,6 +8,10 @@
#ifdef __KERNEL__
+#ifndef INC_ONCE
+#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++
+#endif
+
/*
* Note: DISABLE_BRANCH_PROFILING can be used by special lowlevel code
* to disable branch tracing on a per file basis.
diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
index e6c44eb428ab..adb12e7304da 100644
--- a/include/linux/rcupdate_trace.h
+++ b/include/linux/rcupdate_trace.h
@@ -50,7 +50,7 @@ static inline void rcu_read_lock_trace(void)
{
struct task_struct *t = current;
- WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
+ INC_ONCE(t->trc_reader_nesting);
barrier();
if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
t->trc_reader_special.b.need_mb)
diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
index 1321da803274..bd4362323733 100644
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -66,7 +66,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp)
preempt_disable(); // Needed for PREEMPT_AUTO
idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
- WRITE_ONCE(ssp->srcu_lock_nesting[idx], READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1);
+ INC_ONCE(ssp->srcu_lock_nesting[idx]);
preempt_enable();
return idx;
}
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 06ff41484e29..ef3d4871e775 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -373,7 +373,7 @@ static void io_account_cq_overflow(struct io_ring_ctx *ctx)
{
struct io_rings *r = ctx->rings;
- WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
+ INC_ONCE(r->cq_overflow);
ctx->cq_extra--;
}
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 5e2e53464794..a812af81dbff 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -656,7 +656,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp)
jbase += j - gpstart;
if (!jbase) {
ASSERT_EXCLUSIVE_WRITER(sup->srcu_n_exp_nodelay);
- WRITE_ONCE(sup->srcu_n_exp_nodelay, READ_ONCE(sup->srcu_n_exp_nodelay) + 1);
+ INC_ONCE(sup->srcu_n_exp_nodelay);
if (READ_ONCE(sup->srcu_n_exp_nodelay) > srcu_max_nodelay_phase)
jbase = 1;
}
@@ -1856,7 +1856,7 @@ static void process_srcu(struct work_struct *work)
j = jiffies;
if (READ_ONCE(sup->reschedule_jiffies) == j) {
ASSERT_EXCLUSIVE_WRITER(sup->reschedule_count);
- WRITE_ONCE(sup->reschedule_count, READ_ONCE(sup->reschedule_count) + 1);
+ INC_ONCE(sup->reschedule_count);
if (READ_ONCE(sup->reschedule_count) > srcu_max_nodelay)
curdelay = 1;
} else {
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3927ea5f7955..51f525089c27 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -387,7 +387,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
static void rcu_preempt_read_enter(void)
{
- WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
+ INC_ONCE(current->rcu_read_lock_nesting);
}
static int rcu_preempt_read_exit(void)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5329672815b..f0927407abbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3242,7 +3242,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
* statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not
* expensive, to avoid any form of compiler optimizations:
*/
- WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
+ INC_ONCE(p->mm->numa_scan_seq);
p->mm->numa_scan_offset = 0;
}
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index f65fb182466d..e0bf31b1875e 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -517,7 +517,7 @@ static void test_kmalloc_aligned_oob_write(struct kunit *test)
* fault immediately after it.
*/
expect.addr = buf + size;
- WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1);
+ INC_ONCE(*expect.addr);
KUNIT_EXPECT_FALSE(test, report_available());
test_free(buf);
KUNIT_EXPECT_TRUE(test, report_matches(&expect));
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 2c0185ebc900..63f5c8387764 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -596,7 +596,7 @@ static __poll_t ns_revision_poll(struct file *file, poll_table *pt)
void __aa_bump_ns_revision(struct aa_ns *ns)
{
- WRITE_ONCE(ns->revision, READ_ONCE(ns->revision) + 1);
+ INC_ONCE(ns->revision);
wake_up_interruptible(&ns->wait);
}
On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > +linux-toolchains > > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: > > > > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? > > > I'd have to check what the compiler makes of that. > > > > > > /me mucks about with godbolt for a bit... > > > > > > GCC doesn't optimize that, but Clang does. > > > > > > I would still very much refrain from making this change until both > > > compilers can generate sane code for it. > > > > Is GCC on track to do this, or do we need to encourage them? > > I have no clue; probably wise to offer encouragement. > > > Just to make sure I understand, your proposal is to create an INC_ONCE() > > or similar, and add it once compiler support is there? Seems reasonable > > to me, just checking. > > I suppose we can work in parallel, add INC_ONCE() now, but not have a > proper definition for GCC. [...] > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > index 39cb8d0ae348..8fb7cd75fe62 100644 > --- a/arch/s390/kernel/idle.c > +++ b/arch/s390/kernel/idle.c > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > /* Account time spent with enabled wait psw loaded as idle time. */ > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > + INC_ONC(idle->idle_count); (nit: drive-by typo) [...] > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > index c9b58188ec61..77c253e29758 100644 > --- a/include/linux/compiler-gcc.h > +++ b/include/linux/compiler-gcc.h > @@ -137,3 +137,8 @@ > #if GCC_VERSION < 90100 > #undef __alloc_size__ > #endif > + > +/* > + * GCC can't properly optimize the real one with volatile on. > + */ > +#define INC_ONCE(v) (v)++ I think I'm missing what we're trying to do here, but how is this a safe fallback? I'd have thought we'd need the whole READ_ONCE() + WRITE_ONCE() sequence if the compiler doesn't give us what we need... > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > index efd43df3a99a..b1b13dac1b9e 100644 > --- a/include/linux/compiler.h > +++ b/include/linux/compiler.h > @@ -8,6 +8,10 @@ > > #ifdef __KERNEL__ > > +#ifndef INC_ONCE > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > +#endif ... but I'm honestly not sure what this is aiming to elide. Is this about having the compiler emit an increment instruction with memory operands on architectures that support it? If so, what about architectures that _don't_ support that? We still need to guarantee that the load/store instructions are single-copy atomic for the type in that case. Given how hard it can be to get the compiler to break atomicity for plain accesses, I'm pretty worried about the robustness of this. Will
On Thu, Dec 19, 2024 at 04:34:55PM +0000, Will Deacon wrote: > On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > > > +linux-toolchains > > > > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: > > > > > > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? > > > > I'd have to check what the compiler makes of that. > > > > > > > > /me mucks about with godbolt for a bit... > > > > > > > > GCC doesn't optimize that, but Clang does. > > > > > > > > I would still very much refrain from making this change until both > > > > compilers can generate sane code for it. > > > > > > Is GCC on track to do this, or do we need to encourage them? > > > > I have no clue; probably wise to offer encouragement. > > > > > Just to make sure I understand, your proposal is to create an INC_ONCE() > > > or similar, and add it once compiler support is there? Seems reasonable > > > to me, just checking. > > > > I suppose we can work in parallel, add INC_ONCE() now, but not have a > > proper definition for GCC. > > [...] > > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > > index 39cb8d0ae348..8fb7cd75fe62 100644 > > --- a/arch/s390/kernel/idle.c > > +++ b/arch/s390/kernel/idle.c > > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > > > /* Account time spent with enabled wait psw loaded as idle time. */ > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > > + INC_ONC(idle->idle_count); > > (nit: drive-by typo) Heh! I was clearly building with GCC. ;-) > [...] > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > > index c9b58188ec61..77c253e29758 100644 > > --- a/include/linux/compiler-gcc.h > > +++ b/include/linux/compiler-gcc.h > > @@ -137,3 +137,8 @@ > > #if GCC_VERSION < 90100 > > #undef __alloc_size__ > > #endif > > + > > +/* > > + * GCC can't properly optimize the real one with volatile on. > > + */ > > +#define INC_ONCE(v) (v)++ > > I think I'm missing what we're trying to do here, but how is this a > safe fallback? I'd have thought we'd need the whole READ_ONCE() + > WRITE_ONCE() sequence if the compiler doesn't give us what we need... > > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > > index efd43df3a99a..b1b13dac1b9e 100644 > > --- a/include/linux/compiler.h > > +++ b/include/linux/compiler.h > > @@ -8,6 +8,10 @@ > > > > #ifdef __KERNEL__ > > > > +#ifndef INC_ONCE > > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > +#endif > > ... but I'm honestly not sure what this is aiming to elide. Is this about > having the compiler emit an increment instruction with memory operands > on architectures that support it? If so, what about architectures that > _don't_ support that? We still need to guarantee that the load/store > instructions are single-copy atomic for the type in that case. Architectures whose increment instructions lack memory operands can still load, increment, then store. So yes, if two of these run concurrently, you can lose counts when incrementing normal memory. (Of course, with device memory, you get what you get.) > Given how hard it can be to get the compiler to break atomicity for > plain accesses, I'm pretty worried about the robustness of this. Your point being that the current plain C-language postfix ++ is unlikely to be adversely optimized? If so, although I agree here in 2024, the years pass quickly and increasingly clever compiler optimizations accumulate. Thanx, Paul
On Thu, Dec 19, 2024 at 09:53:48AM -0800, Paul E. McKenney wrote: > On Thu, Dec 19, 2024 at 04:34:55PM +0000, Will Deacon wrote: > > On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > > > index 39cb8d0ae348..8fb7cd75fe62 100644 > > > --- a/arch/s390/kernel/idle.c > > > +++ b/arch/s390/kernel/idle.c > > > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > > > > > /* Account time spent with enabled wait psw loaded as idle time. */ > > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > > > + INC_ONC(idle->idle_count); > > > > (nit: drive-by typo) > > Heh! I was clearly building with GCC. ;-) And not for S390 :p > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > > > index c9b58188ec61..77c253e29758 100644 > > > --- a/include/linux/compiler-gcc.h > > > +++ b/include/linux/compiler-gcc.h > > > @@ -137,3 +137,8 @@ > > > #if GCC_VERSION < 90100 > > > #undef __alloc_size__ > > > #endif > > > + > > > +/* > > > + * GCC can't properly optimize the real one with volatile on. > > > + */ > > > +#define INC_ONCE(v) (v)++ > > > > I think I'm missing what we're trying to do here, but how is this a > > safe fallback? I'd have thought we'd need the whole READ_ONCE() + > > WRITE_ONCE() sequence if the compiler doesn't give us what we need... > > > > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > > > index efd43df3a99a..b1b13dac1b9e 100644 > > > --- a/include/linux/compiler.h > > > +++ b/include/linux/compiler.h > > > @@ -8,6 +8,10 @@ > > > > > > #ifdef __KERNEL__ > > > > > > +#ifndef INC_ONCE > > > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > > +#endif > > > > ... but I'm honestly not sure what this is aiming to elide. Is this about > > having the compiler emit an increment instruction with memory operands > > on architectures that support it? If so, what about architectures that > > _don't_ support that? We still need to guarantee that the load/store > > instructions are single-copy atomic for the type in that case. > > Architectures whose increment instructions lack memory operands can still > load, increment, then store. So yes, if two of these run concurrently, > you can lose counts when incrementing normal memory. (Of course, with > device memory, you get what you get.) ... but can we guarantee that those load and store instructions won't now be torn? > > Given how hard it can be to get the compiler to break atomicity for > > plain accesses, I'm pretty worried about the robustness of this. > > Your point being that the current plain C-language postfix ++ is unlikely > to be adversely optimized? If so, although I agree here in 2024, > the years pass quickly and increasingly clever compiler optimizations > accumulate. Right, I think we're in agreement. I'm saying that just because this proposed INC_ONCE() macro appears to generate the assembly code we want, I don't have even an instinctive feeling that it's deliberate or something that we should/can rely upon. Will
On Thu, Dec 19, 2024 at 05:58:03PM +0000, Will Deacon wrote: > On Thu, Dec 19, 2024 at 09:53:48AM -0800, Paul E. McKenney wrote: > > On Thu, Dec 19, 2024 at 04:34:55PM +0000, Will Deacon wrote: > > > On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > > > > index 39cb8d0ae348..8fb7cd75fe62 100644 > > > > --- a/arch/s390/kernel/idle.c > > > > +++ b/arch/s390/kernel/idle.c > > > > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > > > > > > > /* Account time spent with enabled wait psw loaded as idle time. */ > > > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > > > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > > > > + INC_ONC(idle->idle_count); > > > > > > (nit: drive-by typo) > > > > Heh! I was clearly building with GCC. ;-) > > And not for S390 :p True enough! ;-) > > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > > > > index c9b58188ec61..77c253e29758 100644 > > > > --- a/include/linux/compiler-gcc.h > > > > +++ b/include/linux/compiler-gcc.h > > > > @@ -137,3 +137,8 @@ > > > > #if GCC_VERSION < 90100 > > > > #undef __alloc_size__ > > > > #endif > > > > + > > > > +/* > > > > + * GCC can't properly optimize the real one with volatile on. > > > > + */ > > > > +#define INC_ONCE(v) (v)++ > > > > > > I think I'm missing what we're trying to do here, but how is this a > > > safe fallback? I'd have thought we'd need the whole READ_ONCE() + > > > WRITE_ONCE() sequence if the compiler doesn't give us what we need... > > > > > > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > > > > index efd43df3a99a..b1b13dac1b9e 100644 > > > > --- a/include/linux/compiler.h > > > > +++ b/include/linux/compiler.h > > > > @@ -8,6 +8,10 @@ > > > > > > > > #ifdef __KERNEL__ > > > > > > > > +#ifndef INC_ONCE > > > > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > > > +#endif > > > > > > ... but I'm honestly not sure what this is aiming to elide. Is this about > > > having the compiler emit an increment instruction with memory operands > > > on architectures that support it? If so, what about architectures that > > > _don't_ support that? We still need to guarantee that the load/store > > > instructions are single-copy atomic for the type in that case. > > > > Architectures whose increment instructions lack memory operands can still > > load, increment, then store. So yes, if two of these run concurrently, > > you can lose counts when incrementing normal memory. (Of course, with > > device memory, you get what you get.) > > ... but can we guarantee that those load and store instructions won't > now be torn? The volatile type qualifier must take care of that, or device drivers no longer work. Or are you worried about the C-language ++ fallback? In that case, as I understand it, this fallback is only there until GCC learns to generate a increment-to-memory instruction for x86. > > > Given how hard it can be to get the compiler to break atomicity for > > > plain accesses, I'm pretty worried about the robustness of this. > > > > Your point being that the current plain C-language postfix ++ is unlikely > > to be adversely optimized? If so, although I agree here in 2024, > > the years pass quickly and increasingly clever compiler optimizations > > accumulate. > > Right, I think we're in agreement. I'm saying that just because this > proposed INC_ONCE() macro appears to generate the assembly code we want, > I don't have even an instinctive feeling that it's deliberate or > something that we should/can rely upon. My perhaps naive hope is that GCC updates permit the plain C-language postfix ++ fallback goes away sooner rather than later. Or am I still missing your point? Thanx, Paul
On Thu, Dec 19, 2024 at 10:06:23AM -0800, Paul E. McKenney wrote: > On Thu, Dec 19, 2024 at 05:58:03PM +0000, Will Deacon wrote: > > On Thu, Dec 19, 2024 at 09:53:48AM -0800, Paul E. McKenney wrote: > > > On Thu, Dec 19, 2024 at 04:34:55PM +0000, Will Deacon wrote: > > > > On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > > > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > > > > > index 39cb8d0ae348..8fb7cd75fe62 100644 > > > > > --- a/arch/s390/kernel/idle.c > > > > > +++ b/arch/s390/kernel/idle.c > > > > > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > > > > > > > > > /* Account time spent with enabled wait psw loaded as idle time. */ > > > > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > > > > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > > > > > + INC_ONC(idle->idle_count); > > > > > > > > (nit: drive-by typo) > > > > > > Heh! I was clearly building with GCC. ;-) > > > > And not for S390 :p > > True enough! ;-) > > > > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > > > > > index c9b58188ec61..77c253e29758 100644 > > > > > --- a/include/linux/compiler-gcc.h > > > > > +++ b/include/linux/compiler-gcc.h > > > > > @@ -137,3 +137,8 @@ > > > > > #if GCC_VERSION < 90100 > > > > > #undef __alloc_size__ > > > > > #endif > > > > > + > > > > > +/* > > > > > + * GCC can't properly optimize the real one with volatile on. > > > > > + */ > > > > > +#define INC_ONCE(v) (v)++ > > > > > > > > I think I'm missing what we're trying to do here, but how is this a > > > > safe fallback? I'd have thought we'd need the whole READ_ONCE() + > > > > WRITE_ONCE() sequence if the compiler doesn't give us what we need... > > > > > > > > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > > > > > index efd43df3a99a..b1b13dac1b9e 100644 > > > > > --- a/include/linux/compiler.h > > > > > +++ b/include/linux/compiler.h > > > > > @@ -8,6 +8,10 @@ > > > > > > > > > > #ifdef __KERNEL__ > > > > > > > > > > +#ifndef INC_ONCE > > > > > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > > > > +#endif > > > > > > > > ... but I'm honestly not sure what this is aiming to elide. Is this about > > > > having the compiler emit an increment instruction with memory operands > > > > on architectures that support it? If so, what about architectures that > > > > _don't_ support that? We still need to guarantee that the load/store > > > > instructions are single-copy atomic for the type in that case. > > > > > > Architectures whose increment instructions lack memory operands can still > > > load, increment, then store. So yes, if two of these run concurrently, > > > you can lose counts when incrementing normal memory. (Of course, with > > > device memory, you get what you get.) > > > > ... but can we guarantee that those load and store instructions won't > > now be torn? > > The volatile type qualifier must take care of that, or device drivers > no longer work. > > Or are you worried about the C-language ++ fallback? In that case, > as I understand it, this fallback is only there until GCC learns to > generate a increment-to-memory instruction for x86. As I understand it, both the clang and the GCC implementations in the proposed diff are written using the C-language ++ operator, no? The GCC one doesn't have 'volatile' at all: #define INC_ONCE(v) (v)++ so I don't understand why it's a correct replacement for e.g.: WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); If it was, then we wouldn't need any of these macros in the first place! The clang version does have 'volatile': #define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ but then I don't understand why this is enough for the compiler to generate an 'inc' with memory operands on x86 if it cannot do that for the READ_ONCE + WRITE_ONCE implementation. If somehow the above is more "optimisable" then I worry about the atomicity being preserved as well. So, basically, I don't understand what's going on here :) I think I was expecting to see arch-specific implementations of INC_ONCE() written in assembly where they have the relevant instructions but that doesn't seem to be the case. Will
On Thu, Dec 19, 2024 at 06:31:15PM +0000, Will Deacon wrote: > On Thu, Dec 19, 2024 at 10:06:23AM -0800, Paul E. McKenney wrote: > > On Thu, Dec 19, 2024 at 05:58:03PM +0000, Will Deacon wrote: > > > On Thu, Dec 19, 2024 at 09:53:48AM -0800, Paul E. McKenney wrote: > > > > On Thu, Dec 19, 2024 at 04:34:55PM +0000, Will Deacon wrote: > > > > > On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote: > > > > > > diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c > > > > > > index 39cb8d0ae348..8fb7cd75fe62 100644 > > > > > > --- a/arch/s390/kernel/idle.c > > > > > > +++ b/arch/s390/kernel/idle.c > > > > > > @@ -45,7 +45,7 @@ void account_idle_time_irq(void) > > > > > > > > > > > > /* Account time spent with enabled wait psw loaded as idle time. */ > > > > > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > > > > > - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1); > > > > > > + INC_ONC(idle->idle_count); > > > > > > > > > > (nit: drive-by typo) > > > > > > > > Heh! I was clearly building with GCC. ;-) > > > > > > And not for S390 :p > > > > True enough! ;-) > > > > > > > > diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h > > > > > > index c9b58188ec61..77c253e29758 100644 > > > > > > --- a/include/linux/compiler-gcc.h > > > > > > +++ b/include/linux/compiler-gcc.h > > > > > > @@ -137,3 +137,8 @@ > > > > > > #if GCC_VERSION < 90100 > > > > > > #undef __alloc_size__ > > > > > > #endif > > > > > > + > > > > > > +/* > > > > > > + * GCC can't properly optimize the real one with volatile on. > > > > > > + */ > > > > > > +#define INC_ONCE(v) (v)++ > > > > > > > > > > I think I'm missing what we're trying to do here, but how is this a > > > > > safe fallback? I'd have thought we'd need the whole READ_ONCE() + > > > > > WRITE_ONCE() sequence if the compiler doesn't give us what we need... > > > > > > > > > > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > > > > > > index efd43df3a99a..b1b13dac1b9e 100644 > > > > > > --- a/include/linux/compiler.h > > > > > > +++ b/include/linux/compiler.h > > > > > > @@ -8,6 +8,10 @@ > > > > > > > > > > > > #ifdef __KERNEL__ > > > > > > > > > > > > +#ifndef INC_ONCE > > > > > > +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > > > > > +#endif > > > > > > > > > > ... but I'm honestly not sure what this is aiming to elide. Is this about > > > > > having the compiler emit an increment instruction with memory operands > > > > > on architectures that support it? If so, what about architectures that > > > > > _don't_ support that? We still need to guarantee that the load/store > > > > > instructions are single-copy atomic for the type in that case. > > > > > > > > Architectures whose increment instructions lack memory operands can still > > > > load, increment, then store. So yes, if two of these run concurrently, > > > > you can lose counts when incrementing normal memory. (Of course, with > > > > device memory, you get what you get.) > > > > > > ... but can we guarantee that those load and store instructions won't > > > now be torn? > > > > The volatile type qualifier must take care of that, or device drivers > > no longer work. > > > > Or are you worried about the C-language ++ fallback? In that case, > > as I understand it, this fallback is only there until GCC learns to > > generate a increment-to-memory instruction for x86. > > As I understand it, both the clang and the GCC implementations in the > proposed diff are written using the C-language ++ operator, no? > > The GCC one doesn't have 'volatile' at all: > > #define INC_ONCE(v) (v)++ > > so I don't understand why it's a correct replacement for e.g.: > > WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time); > > If it was, then we wouldn't need any of these macros in the first place! You are right, it is not a correct replacement. It is instead a behavior-preserving fallback until GCC does the optimizations. Here the behavior is the code generation. > The clang version does have 'volatile': > > #define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++ > > but then I don't understand why this is enough for the compiler to > generate an 'inc' with memory operands on x86 if it cannot do that for > the READ_ONCE + WRITE_ONCE implementation. If somehow the above is more > "optimisable" then I worry about the atomicity being preserved as well. > > So, basically, I don't understand what's going on here :) Peter learned that clang does in fact optimize the original READ_ONCE + WRITE_ONCE implementation. > I think I was expecting to see arch-specific implementations of > INC_ONCE() written in assembly where they have the relevant instructions > but that doesn't seem to be the case. That is yet another fallback, though with quite a bit more code. :-( Thanx, Paul
* Peter Zijlstra: > +linux-toolchains > > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: > >> > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? >> > I'd have to check what the compiler makes of that. >> > >> > /me mucks about with godbolt for a bit... >> > >> > GCC doesn't optimize that, but Clang does. >> > >> > I would still very much refrain from making this change until both >> > compilers can generate sane code for it. >> >> Is GCC on track to do this, or do we need to encourage them? > > I have no clue; probably wise to offer encouragement. What do you consider sane code? Clang's choice to generate an incl instruction (on x86-64 at least) is a bit surprising. Curiously, the C11 abstract machine has a value-less increment-in-place operation, so it's probably not in violation of the volatile rules. (C doesn't specify x++ in terms of ++x and x += 1.) Thanks, Florian
On Wed, Dec 18, 2024 at 08:56:07PM +0100, Florian Weimer wrote: > * Peter Zijlstra: > > > +linux-toolchains > > > > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: > > > >> > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? > >> > I'd have to check what the compiler makes of that. > >> > > >> > /me mucks about with godbolt for a bit... > >> > > >> > GCC doesn't optimize that, but Clang does. > >> > > >> > I would still very much refrain from making this change until both > >> > compilers can generate sane code for it. > >> > >> Is GCC on track to do this, or do we need to encourage them? > > > > I have no clue; probably wise to offer encouragement. > > What do you consider sane code? Peter's "(*(volatile unsigned int *)&s->sequence)++;" qualifies as sane. > Clang's choice to generate an incl instruction (on x86-64 at least) is a > bit surprising. Curiously, the C11 abstract machine has a value-less > increment-in-place operation, so it's probably not in violation of the > volatile rules. (C doesn't specify x++ in terms of ++x and x += 1.) Very good! Should I do something like file a bug somewhere to help this along? Thanx, Paul
* Paul E. McKenney: > On Wed, Dec 18, 2024 at 08:56:07PM +0100, Florian Weimer wrote: >> * Peter Zijlstra: >> >> > +linux-toolchains >> > >> > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: >> > >> >> > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? >> >> > I'd have to check what the compiler makes of that. >> >> > >> >> > /me mucks about with godbolt for a bit... >> >> > >> >> > GCC doesn't optimize that, but Clang does. >> >> > >> >> > I would still very much refrain from making this change until both >> >> > compilers can generate sane code for it. >> >> >> >> Is GCC on track to do this, or do we need to encourage them? >> > >> > I have no clue; probably wise to offer encouragement. >> >> What do you consider sane code? > > Peter's "(*(volatile unsigned int *)&s->sequence)++;" qualifies as sane. I think the reference was originally to machine code. >> Clang's choice to generate an incl instruction (on x86-64 at least) is a >> bit surprising. Curiously, the C11 abstract machine has a value-less >> increment-in-place operation, so it's probably not in violation of the >> volatile rules. (C doesn't specify x++ in terms of ++x and x += 1.) > > Very good! Should I do something like file a bug somewhere to help > this along? I don't know. It seems that Clang/LLVM is cheating. It's doing this optimization even for i = i + 1; with a volatile i. That doesn't look like “strictly according to the abstract machine” anymore. A proper implementation would need explicit representation of volatile increment/decrement in the IR. Given that volatile increment/decrement is deprecated, that seems quite a bit of effort. Thanks, Florian
On Thu, Dec 19, 2024 at 05:45:15PM +0100, Florian Weimer wrote: > > Peter's "(*(volatile unsigned int *)&s->sequence)++;" qualifies as sane. > > I think the reference was originally to machine code. Correct; however more tinkering with godbolt got me: https://godbolt.org/z/6xoxjdYPx Where we can see that clang can even optimize the WRITE_ONCE(foo, READ_ONCE(foo) + 1) case, which I though it would not be able to do. So perhaps we just need to get GCC fixed.
On Thu, Dec 19, 2024 at 05:45:15PM +0100, Florian Weimer wrote: > * Paul E. McKenney: > > > On Wed, Dec 18, 2024 at 08:56:07PM +0100, Florian Weimer wrote: > >> * Peter Zijlstra: > >> > >> > +linux-toolchains > >> > > >> > On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote: > >> > > >> >> > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ? > >> >> > I'd have to check what the compiler makes of that. > >> >> > > >> >> > /me mucks about with godbolt for a bit... > >> >> > > >> >> > GCC doesn't optimize that, but Clang does. > >> >> > > >> >> > I would still very much refrain from making this change until both > >> >> > compilers can generate sane code for it. > >> >> > >> >> Is GCC on track to do this, or do we need to encourage them? > >> > > >> > I have no clue; probably wise to offer encouragement. > >> > >> What do you consider sane code? > > > > Peter's "(*(volatile unsigned int *)&s->sequence)++;" qualifies as sane. > > I think the reference was originally to machine code. Very well, then compiling this to a to-memory increment instruction qualifies as sane. > >> Clang's choice to generate an incl instruction (on x86-64 at least) is a > >> bit surprising. Curiously, the C11 abstract machine has a value-less > >> increment-in-place operation, so it's probably not in violation of the > >> volatile rules. (C doesn't specify x++ in terms of ++x and x += 1.) > > > > Very good! Should I do something like file a bug somewhere to help > > this along? > > I don't know. It seems that Clang/LLVM is cheating. It's doing this > optimization even for > > i = i + 1; > > with a volatile i. That doesn't look like “strictly according to the > abstract machine” anymore. How so? It is in fact adding one to that volatile variable, which is in accord with the abstract machine. > A proper implementation would need explicit > representation of volatile increment/decrement in the IR. Given that > volatile increment/decrement is deprecated, that seems quite a bit of > effort. If I remember correctly, there was a discussion in SG21 about de-deprecating volatile increment/decrement. Thanx, Paul
On Wed, Dec 18, 2024 at 06:12:41PM +0100, Peter Zijlstra wrote:
>
> +linux-toolchains
>
> On Wed, Dec 18, 2024 at 08:59:47AM -0800, Paul E. McKenney wrote:
>
> > > Perhaps something like: (*(volatile unsigned int *)&s->sequence)++; ?
> > > I'd have to check what the compiler makes of that.
> > >
> > > /me mucks about with godbolt for a bit...
> > >
> > > GCC doesn't optimize that, but Clang does.
> > >
> > > I would still very much refrain from making this change until both
> > > compilers can generate sane code for it.
> >
> > Is GCC on track to do this, or do we need to encourage them?
>
> I have no clue; probably wise to offer encouragement.
Hopefully your +linux-toolchain is a start.
> > Just to make sure I understand, your proposal is to create an INC_ONCE()
> > or similar, and add it once compiler support is there? Seems reasonable
> > to me, just checking.
>
> I suppose we can work in parallel, add INC_ONCE() now, but not have a
> proper definition for GCC.
This passes an rcutorture smoke test, so feel free to add:
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> ---
> arch/riscv/kvm/vmid.c | 2 +-
> arch/s390/kernel/idle.c | 2 +-
> drivers/md/dm-vdo/indexer/index-session.c | 2 +-
> fs/xfs/libxfs/xfs_iext_tree.c | 2 +-
> include/linux/compiler-gcc.h | 5 +++++
> include/linux/compiler.h | 4 ++++
> include/linux/rcupdate_trace.h | 2 +-
> include/linux/srcutiny.h | 2 +-
> io_uring/io_uring.c | 2 +-
> kernel/rcu/srcutree.c | 4 ++--
> kernel/rcu/tree_plugin.h | 2 +-
> kernel/sched/fair.c | 2 +-
> mm/kfence/kfence_test.c | 2 +-
> security/apparmor/apparmorfs.c | 2 +-
> 14 files changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c
> index ddc98714ce8e..805a5acf669c 100644
> --- a/arch/riscv/kvm/vmid.c
> +++ b/arch/riscv/kvm/vmid.c
> @@ -90,7 +90,7 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu)
>
> /* First user of a new VMID version? */
> if (unlikely(vmid_next == 0)) {
> - WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1);
> + INC_ONCE(vmid_version);
> vmid_next = 1;
>
> /*
> diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
> index 39cb8d0ae348..8fb7cd75fe62 100644
> --- a/arch/s390/kernel/idle.c
> +++ b/arch/s390/kernel/idle.c
> @@ -45,7 +45,7 @@ void account_idle_time_irq(void)
>
> /* Account time spent with enabled wait psw loaded as idle time. */
> WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
> - WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
> + INC_ONC(idle->idle_count);
> account_idle_time(cputime_to_nsecs(idle_time));
> }
>
> diff --git a/drivers/md/dm-vdo/indexer/index-session.c b/drivers/md/dm-vdo/indexer/index-session.c
> index aee0914d604a..c5a7dee9dc66 100644
> --- a/drivers/md/dm-vdo/indexer/index-session.c
> +++ b/drivers/md/dm-vdo/indexer/index-session.c
> @@ -152,7 +152,7 @@ static void enter_callback_stage(struct uds_request *request)
>
> static inline void count_once(u64 *count_ptr)
> {
> - WRITE_ONCE(*count_ptr, READ_ONCE(*count_ptr) + 1);
> + INC_ONCE(*count_ptr);
> }
>
> static void update_session_stats(struct uds_request *request)
> diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> index 8796f2b3e534..a1fcd4cf2424 100644
> --- a/fs/xfs/libxfs/xfs_iext_tree.c
> +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> @@ -626,7 +626,7 @@ xfs_iext_realloc_root(
> */
> static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp)
> {
> - WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
> + INC_ONCE(ifp->if_seq);
> }
>
> void
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index c9b58188ec61..77c253e29758 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -137,3 +137,8 @@
> #if GCC_VERSION < 90100
> #undef __alloc_size__
> #endif
> +
> +/*
> + * GCC can't properly optimize the real one with volatile on.
> + */
> +#define INC_ONCE(v) (v)++
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index efd43df3a99a..b1b13dac1b9e 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -8,6 +8,10 @@
>
> #ifdef __KERNEL__
>
> +#ifndef INC_ONCE
> +#define INC_ONCE(v) (*(volatile typeof(v) *)&(v))++
> +#endif
> +
> /*
> * Note: DISABLE_BRANCH_PROFILING can be used by special lowlevel code
> * to disable branch tracing on a per file basis.
> diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h
> index e6c44eb428ab..adb12e7304da 100644
> --- a/include/linux/rcupdate_trace.h
> +++ b/include/linux/rcupdate_trace.h
> @@ -50,7 +50,7 @@ static inline void rcu_read_lock_trace(void)
> {
> struct task_struct *t = current;
>
> - WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
> + INC_ONCE(t->trc_reader_nesting);
> barrier();
> if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
> t->trc_reader_special.b.need_mb)
> diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
> index 1321da803274..bd4362323733 100644
> --- a/include/linux/srcutiny.h
> +++ b/include/linux/srcutiny.h
> @@ -66,7 +66,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp)
>
> preempt_disable(); // Needed for PREEMPT_AUTO
> idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1;
> - WRITE_ONCE(ssp->srcu_lock_nesting[idx], READ_ONCE(ssp->srcu_lock_nesting[idx]) + 1);
> + INC_ONCE(ssp->srcu_lock_nesting[idx]);
> preempt_enable();
> return idx;
> }
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 06ff41484e29..ef3d4871e775 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -373,7 +373,7 @@ static void io_account_cq_overflow(struct io_ring_ctx *ctx)
> {
> struct io_rings *r = ctx->rings;
>
> - WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
> + INC_ONCE(r->cq_overflow);
> ctx->cq_extra--;
> }
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 5e2e53464794..a812af81dbff 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -656,7 +656,7 @@ static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> jbase += j - gpstart;
> if (!jbase) {
> ASSERT_EXCLUSIVE_WRITER(sup->srcu_n_exp_nodelay);
> - WRITE_ONCE(sup->srcu_n_exp_nodelay, READ_ONCE(sup->srcu_n_exp_nodelay) + 1);
> + INC_ONCE(sup->srcu_n_exp_nodelay);
> if (READ_ONCE(sup->srcu_n_exp_nodelay) > srcu_max_nodelay_phase)
> jbase = 1;
> }
> @@ -1856,7 +1856,7 @@ static void process_srcu(struct work_struct *work)
> j = jiffies;
> if (READ_ONCE(sup->reschedule_jiffies) == j) {
> ASSERT_EXCLUSIVE_WRITER(sup->reschedule_count);
> - WRITE_ONCE(sup->reschedule_count, READ_ONCE(sup->reschedule_count) + 1);
> + INC_ONCE(sup->reschedule_count);
> if (READ_ONCE(sup->reschedule_count) > srcu_max_nodelay)
> curdelay = 1;
> } else {
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 3927ea5f7955..51f525089c27 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -387,7 +387,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>
> static void rcu_preempt_read_enter(void)
> {
> - WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
> + INC_ONCE(current->rcu_read_lock_nesting);
> }
>
> static int rcu_preempt_read_exit(void)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f5329672815b..f0927407abbe 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3242,7 +3242,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
> * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not
> * expensive, to avoid any form of compiler optimizations:
> */
> - WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
> + INC_ONCE(p->mm->numa_scan_seq);
> p->mm->numa_scan_offset = 0;
> }
>
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index f65fb182466d..e0bf31b1875e 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -517,7 +517,7 @@ static void test_kmalloc_aligned_oob_write(struct kunit *test)
> * fault immediately after it.
> */
> expect.addr = buf + size;
> - WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1);
> + INC_ONCE(*expect.addr);
> KUNIT_EXPECT_FALSE(test, report_available());
> test_free(buf);
> KUNIT_EXPECT_TRUE(test, report_matches(&expect));
> diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
> index 2c0185ebc900..63f5c8387764 100644
> --- a/security/apparmor/apparmorfs.c
> +++ b/security/apparmor/apparmorfs.c
> @@ -596,7 +596,7 @@ static __poll_t ns_revision_poll(struct file *file, poll_table *pt)
>
> void __aa_bump_ns_revision(struct aa_ns *ns)
> {
> - WRITE_ONCE(ns->revision, READ_ONCE(ns->revision) + 1);
> + INC_ONCE(ns->revision);
> wake_up_interruptible(&ns->wait);
> }
>
On 12/17/24 6:17 PM, Daniel Xu wrote:
> `sequence` is a concurrently accessed shared variable on the reader
> side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
> prevent unwanted compiler optimizations like store tearing.
>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
> include/linux/seqlock.h | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 5298765d6ca4..f4c6f2507742 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
> * Make sure we are not reinitializing a held lock:
> */
> lockdep_init_map(&s->dep_map, name, key, 0);
> - s->sequence = 0;
> + WRITE_ONCE(s->sequence, 0);
> }
>
The init function certainly doesn't need to use WRITE_ONCE().
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> @@ -405,7 +405,7 @@ do { \
> static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> {
> kcsan_nestable_atomic_begin();
> - s->sequence++;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> smp_wmb();
> }
>
> @@ -426,7 +426,7 @@ do { \
> static inline void do_raw_write_seqcount_end(seqcount_t *s)
> {
> smp_wmb();
> - s->sequence++;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> kcsan_nestable_atomic_end();
> }
>
> @@ -548,9 +548,9 @@ static inline void do_write_seqcount_end(seqcount_t *s)
> static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
> {
> kcsan_nestable_atomic_begin();
> - s->sequence++;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> smp_wmb();
> - s->sequence++;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> kcsan_nestable_atomic_end();
> }
>
> @@ -569,7 +569,7 @@ static inline void do_write_seqcount_invalidate(seqcount_t *s)
> {
> smp_wmb();
> kcsan_nestable_atomic_begin();
> - s->sequence+=2;
> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 2);
> kcsan_nestable_atomic_end();
> }
>
> @@ -673,7 +673,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
> static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)
> {
> smp_wmb(); /* prior stores before incrementing "sequence" */
> - s->seqcount.sequence++;
> + WRITE_ONCE(s->seqcount.sequence, READ_ONCE(s->seqcount.sequence) + 1);
> smp_wmb(); /* increment "sequence" before following stores */
> }
>
For seqcount, its actual value isn't important. What is important is
whether the value changes and whether it is even or odd. So even if
store tearing is happening, it shouldn't affect its operation. I doubt
we need to use WRITE_ONCE() here. Could you come up with a scenario
where store tearing will make it behave incorrectly?
Cheers,
Longman
On Tue, Dec 17, 2024 at 10:30:39PM -0500, Waiman Long wrote:
> On 12/17/24 6:17 PM, Daniel Xu wrote:
> > `sequence` is a concurrently accessed shared variable on the reader
> > side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
> > prevent unwanted compiler optimizations like store tearing.
> >
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> > include/linux/seqlock.h | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> > index 5298765d6ca4..f4c6f2507742 100644
> > --- a/include/linux/seqlock.h
> > +++ b/include/linux/seqlock.h
> > @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
> > * Make sure we are not reinitializing a held lock:
> > */
> > lockdep_init_map(&s->dep_map, name, key, 0);
> > - s->sequence = 0;
> > + WRITE_ONCE(s->sequence, 0);
> > }
> The init function certainly doesn't need to use WRITE_ONCE().
> > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > @@ -405,7 +405,7 @@ do { \
> > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > {
> > kcsan_nestable_atomic_begin();
> > - s->sequence++;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > smp_wmb();
> > }
> > @@ -426,7 +426,7 @@ do { \
> > static inline void do_raw_write_seqcount_end(seqcount_t *s)
> > {
> > smp_wmb();
> > - s->sequence++;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > kcsan_nestable_atomic_end();
> > }
> > @@ -548,9 +548,9 @@ static inline void do_write_seqcount_end(seqcount_t *s)
> > static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
> > {
> > kcsan_nestable_atomic_begin();
> > - s->sequence++;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > smp_wmb();
> > - s->sequence++;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > kcsan_nestable_atomic_end();
> > }
> > @@ -569,7 +569,7 @@ static inline void do_write_seqcount_invalidate(seqcount_t *s)
> > {
> > smp_wmb();
> > kcsan_nestable_atomic_begin();
> > - s->sequence+=2;
> > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 2);
> > kcsan_nestable_atomic_end();
> > }
> > @@ -673,7 +673,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
> > static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)
> > {
> > smp_wmb(); /* prior stores before incrementing "sequence" */
> > - s->seqcount.sequence++;
> > + WRITE_ONCE(s->seqcount.sequence, READ_ONCE(s->seqcount.sequence) + 1);
> > smp_wmb(); /* increment "sequence" before following stores */
> > }
>
> For seqcount, its actual value isn't important. What is important is whether
> the value changes and whether it is even or odd. So even if store tearing is
> happening, it shouldn't affect its operation. I doubt we need to use
> WRITE_ONCE() here. Could you come up with a scenario where store tearing
> will make it behave incorrectly?
But why expand the state space?
Also, there are potentially "optimizations" other than store tearing.
No, I haven't seen them yet, but then again, there were a great many
optimizations that were not being used back when I started coding C.
Thanx, Paul
On 12/18/24 10:45 AM, Paul E. McKenney wrote:
> On Tue, Dec 17, 2024 at 10:30:39PM -0500, Waiman Long wrote:
>> On 12/17/24 6:17 PM, Daniel Xu wrote:
>>> `sequence` is a concurrently accessed shared variable on the reader
>>> side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
>>> prevent unwanted compiler optimizations like store tearing.
>>>
>>> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
>>> ---
>>> include/linux/seqlock.h | 14 +++++++-------
>>> 1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
>>> index 5298765d6ca4..f4c6f2507742 100644
>>> --- a/include/linux/seqlock.h
>>> +++ b/include/linux/seqlock.h
>>> @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
>>> * Make sure we are not reinitializing a held lock:
>>> */
>>> lockdep_init_map(&s->dep_map, name, key, 0);
>>> - s->sequence = 0;
>>> + WRITE_ONCE(s->sequence, 0);
>>> }
>> The init function certainly doesn't need to use WRITE_ONCE().
>>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>> @@ -405,7 +405,7 @@ do { \
>>> static inline void do_raw_write_seqcount_begin(seqcount_t *s)
>>> {
>>> kcsan_nestable_atomic_begin();
>>> - s->sequence++;
>>> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
>>> smp_wmb();
>>> }
>>> @@ -426,7 +426,7 @@ do { \
>>> static inline void do_raw_write_seqcount_end(seqcount_t *s)
>>> {
>>> smp_wmb();
>>> - s->sequence++;
>>> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
>>> kcsan_nestable_atomic_end();
>>> }
>>> @@ -548,9 +548,9 @@ static inline void do_write_seqcount_end(seqcount_t *s)
>>> static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
>>> {
>>> kcsan_nestable_atomic_begin();
>>> - s->sequence++;
>>> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
>>> smp_wmb();
>>> - s->sequence++;
>>> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
>>> kcsan_nestable_atomic_end();
>>> }
>>> @@ -569,7 +569,7 @@ static inline void do_write_seqcount_invalidate(seqcount_t *s)
>>> {
>>> smp_wmb();
>>> kcsan_nestable_atomic_begin();
>>> - s->sequence+=2;
>>> + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 2);
>>> kcsan_nestable_atomic_end();
>>> }
>>> @@ -673,7 +673,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
>>> static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)
>>> {
>>> smp_wmb(); /* prior stores before incrementing "sequence" */
>>> - s->seqcount.sequence++;
>>> + WRITE_ONCE(s->seqcount.sequence, READ_ONCE(s->seqcount.sequence) + 1);
>>> smp_wmb(); /* increment "sequence" before following stores */
>>> }
>> For seqcount, its actual value isn't important. What is important is whether
>> the value changes and whether it is even or odd. So even if store tearing is
>> happening, it shouldn't affect its operation. I doubt we need to use
>> WRITE_ONCE() here. Could you come up with a scenario where store tearing
>> will make it behave incorrectly?
> But why expand the state space?
I don't quite get what you mean by "state space".
>
> Also, there are potentially "optimizations" other than store tearing.
> No, I haven't seen them yet, but then again, there were a great many
> optimizations that were not being used back when I started coding C.
All the inc's are bounded by smp_wmb()'s which should limit the kind of
optimizations that can be done by the compiler. That is why I don't see
a need to use WRITE_ONCE() here. Of course, there may be some corner
cases that I may miss. So I ask for whether such a case exists.
Cheers,
Longman
On Wed, Dec 18, 2024 at 11:10:01AM -0500, Waiman Long wrote:
> On 12/18/24 10:45 AM, Paul E. McKenney wrote:
> > On Tue, Dec 17, 2024 at 10:30:39PM -0500, Waiman Long wrote:
> > > On 12/17/24 6:17 PM, Daniel Xu wrote:
> > > > `sequence` is a concurrently accessed shared variable on the reader
> > > > side. Therefore, it needs to be wrapped in WRITE_ONCE() in order to
> > > > prevent unwanted compiler optimizations like store tearing.
> > > >
> > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > > ---
> > > > include/linux/seqlock.h | 14 +++++++-------
> > > > 1 file changed, 7 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> > > > index 5298765d6ca4..f4c6f2507742 100644
> > > > --- a/include/linux/seqlock.h
> > > > +++ b/include/linux/seqlock.h
> > > > @@ -45,7 +45,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
> > > > * Make sure we are not reinitializing a held lock:
> > > > */
> > > > lockdep_init_map(&s->dep_map, name, key, 0);
> > > > - s->sequence = 0;
> > > > + WRITE_ONCE(s->sequence, 0);
> > > > }
> > > The init function certainly doesn't need to use WRITE_ONCE().
> > > > #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > > @@ -405,7 +405,7 @@ do { \
> > > > static inline void do_raw_write_seqcount_begin(seqcount_t *s)
> > > > {
> > > > kcsan_nestable_atomic_begin();
> > > > - s->sequence++;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > smp_wmb();
> > > > }
> > > > @@ -426,7 +426,7 @@ do { \
> > > > static inline void do_raw_write_seqcount_end(seqcount_t *s)
> > > > {
> > > > smp_wmb();
> > > > - s->sequence++;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > kcsan_nestable_atomic_end();
> > > > }
> > > > @@ -548,9 +548,9 @@ static inline void do_write_seqcount_end(seqcount_t *s)
> > > > static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
> > > > {
> > > > kcsan_nestable_atomic_begin();
> > > > - s->sequence++;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > smp_wmb();
> > > > - s->sequence++;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 1);
> > > > kcsan_nestable_atomic_end();
> > > > }
> > > > @@ -569,7 +569,7 @@ static inline void do_write_seqcount_invalidate(seqcount_t *s)
> > > > {
> > > > smp_wmb();
> > > > kcsan_nestable_atomic_begin();
> > > > - s->sequence+=2;
> > > > + WRITE_ONCE(s->sequence, READ_ONCE(s->sequence) + 2);
> > > > kcsan_nestable_atomic_end();
> > > > }
> > > > @@ -673,7 +673,7 @@ read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
> > > > static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)
> > > > {
> > > > smp_wmb(); /* prior stores before incrementing "sequence" */
> > > > - s->seqcount.sequence++;
> > > > + WRITE_ONCE(s->seqcount.sequence, READ_ONCE(s->seqcount.sequence) + 1);
> > > > smp_wmb(); /* increment "sequence" before following stores */
> > > > }
> > > For seqcount, its actual value isn't important. What is important is whether
> > > the value changes and whether it is even or odd. So even if store tearing is
> > > happening, it shouldn't affect its operation. I doubt we need to use
> > > WRITE_ONCE() here. Could you come up with a scenario where store tearing
> > > will make it behave incorrectly?
> > But why expand the state space?
> I don't quite get what you mean by "state space".
The state space includes the set of possible values that might be stored
in .sequence at any given point in time. So if you allow the compiler to
tear the store, you are increasing the number of values that a concurrent
reader can return, and thus also increasing the number of states that must
be considered. On the other hand, if you use WRITE_ONCE() to constrain
the compiler to do a single store, then you don't have to even worry
about the added states due to these partial updates.
> > Also, there are potentially "optimizations" other than store tearing.
> > No, I haven't seen them yet, but then again, there were a great many
> > optimizations that were not being used back when I started coding C.
>
> All the inc's are bounded by smp_wmb()'s which should limit the kind of
> optimizations that can be done by the compiler. That is why I don't see a
> need to use WRITE_ONCE() here. Of course, there may be some corner cases
> that I may miss. So I ask for whether such a case exists.
One other potential optimization is using the variable for scratch storage
immediately prior to the update, which could store any value there.
I haven't seen this myself, but the standard permits it, and in some
cases register pressure might encourage it.
And who knows what else our creative compiler writers will come up with?
Thanx, Paul
On 12/18/24 11:57 AM, Paul E. McKenney wrote: > On Wed, Dec 18, 2024 at 11:10:01AM -0500, Waiman Long wrote: >> On 12/18/24 10:45 AM, Paul E. McKenney wrote: >>> On Tue, Dec 17, 2024 at 10:30:39PM -0500, Waiman Long wrote: >>>> For seqcount, its actual value isn't important. What is important is whether >>>> the value changes and whether it is even or odd. So even if store tearing is >>>> happening, it shouldn't affect its operation. I doubt we need to use >>>> WRITE_ONCE() here. Could you come up with a scenario where store tearing >>>> will make it behave incorrectly? >>> But why expand the state space? >> I don't quite get what you mean by "state space". > The state space includes the set of possible values that might be stored > in .sequence at any given point in time. So if you allow the compiler to > tear the store, you are increasing the number of values that a concurrent > reader can return, and thus also increasing the number of states that must > be considered. On the other hand, if you use WRITE_ONCE() to constrain > the compiler to do a single store, then you don't have to even worry > about the added states due to these partial updates. Thanks for the explanation. Now I know what you mean. As I said above, the correct operation of seqcount does not depend on getting its intended value right. It is just its odd/even-ness as well as if the value has been changed from previous read. That is the state space where seqcount is operated on. >>> Also, there are potentially "optimizations" other than store tearing. >>> No, I haven't seen them yet, but then again, there were a great many >>> optimizations that were not being used back when I started coding C. >> All the inc's are bounded by smp_wmb()'s which should limit the kind of >> optimizations that can be done by the compiler. That is why I don't see a >> need to use WRITE_ONCE() here. Of course, there may be some corner cases >> that I may miss. So I ask for whether such a case exists. > One other potential optimization is using the variable for scratch storage > immediately prior to the update, which could store any value there. > I haven't seen this myself, but the standard permits it, and in some > cases register pressure might encourage it. > > And who knows what else our creative compiler writers will come up with? If the compiler really uses the variable as a scratch storage, it will be a problem if the variable can be accessed concurrently from multiple CPUs. It is a new compiler optimization strategy that I am aware before. In that case, we may really need a way to mark these variables as not suitable for such advanced optimization. Thanks, Longman > > Thanx, Paul >
On Wed, Dec 18, 2024 at 01:38:17PM -0500, Waiman Long wrote: > On 12/18/24 11:57 AM, Paul E. McKenney wrote: > > On Wed, Dec 18, 2024 at 11:10:01AM -0500, Waiman Long wrote: > > > On 12/18/24 10:45 AM, Paul E. McKenney wrote: > > > > On Tue, Dec 17, 2024 at 10:30:39PM -0500, Waiman Long wrote: > > > > > For seqcount, its actual value isn't important. What is important is whether > > > > > the value changes and whether it is even or odd. So even if store tearing is > > > > > happening, it shouldn't affect its operation. I doubt we need to use > > > > > WRITE_ONCE() here. Could you come up with a scenario where store tearing > > > > > will make it behave incorrectly? > > > > But why expand the state space? > > > I don't quite get what you mean by "state space". > > The state space includes the set of possible values that might be stored > > in .sequence at any given point in time. So if you allow the compiler to > > tear the store, you are increasing the number of values that a concurrent > > reader can return, and thus also increasing the number of states that must > > be considered. On the other hand, if you use WRITE_ONCE() to constrain > > the compiler to do a single store, then you don't have to even worry > > about the added states due to these partial updates. > > Thanks for the explanation. Now I know what you mean. > > As I said above, the correct operation of seqcount does not depend on > getting its intended value right. It is just its odd/even-ness as well as if > the value has been changed from previous read. That is the state space where > seqcount is operated on. Perhaps an example would help. Given 16-bit store tearing on the 32-bit unsigned sequence, the following could happen: ..., 0x0000fffe, 0x0000ffff, 0x00000000, 0x00010000, 0x00010001, ... With the WRITE_ONCE() prohibiting store tearing, one of those values cannot happen: ..., 0x0000fffe, 0x0000ffff, 0x00010000, 0x00010001, ... Hence the smaller state space. Yes, I understand that you can easily reason you way around this, but would a younger version of yourself if chasing some other bug through this code while sleep-deprived? > > > > Also, there are potentially "optimizations" other than store tearing. > > > > No, I haven't seen them yet, but then again, there were a great many > > > > optimizations that were not being used back when I started coding C. > > > All the inc's are bounded by smp_wmb()'s which should limit the kind of > > > optimizations that can be done by the compiler. That is why I don't see a > > > need to use WRITE_ONCE() here. Of course, there may be some corner cases > > > that I may miss. So I ask for whether such a case exists. > > One other potential optimization is using the variable for scratch storage > > immediately prior to the update, which could store any value there. > > I haven't seen this myself, but the standard permits it, and in some > > cases register pressure might encourage it. > > > > And who knows what else our creative compiler writers will come up with? > > If the compiler really uses the variable as a scratch storage, it will be a > problem if the variable can be accessed concurrently from multiple CPUs. It > is a new compiler optimization strategy that I am aware before. In that > case, we may really need a way to mark these variables as not suitable for > such advanced optimization. These markings already exist, namely, the "volatile" keyword, READ_ONCE(), WRITE_ONCE(), and hopefully soon INC_ONCE(). These last three use volatile accesses internally. The scratch-storage possibility exists only just before a normal C-language store, not before volatile accesses. So a compiler is forbidden from doing that scratch-value-store trick before a volatile store. Thanx, Paul
On 12/18/24 1:52 PM, Paul E. McKenney wrote: >> If the compiler really uses the variable as a scratch storage, it will be a >> problem if the variable can be accessed concurrently from multiple CPUs. It >> is a new compiler optimization strategy that I am aware before. In that >> case, we may really need a way to mark these variables as not suitable for >> such advanced optimization. > These markings already exist, namely, the "volatile" keyword, READ_ONCE(), > WRITE_ONCE(), and hopefully soon INC_ONCE(). These last three use > volatile accesses internally. > > The scratch-storage possibility exists only just before a normal > C-language store, not before volatile accesses. So a compiler is > forbidden from doing that scratch-value-store trick before a volatile > store. I am aware of that in READ_ONCE() and WRITE_ONCE() and I am looking forward to have a INC_ONCE() and maybe DEC_ONCE() soon. Cheers, Longman
© 2016 - 2025 Red Hat, Inc.