Hazard pointers [1] provide a way to dynamically distribute refcounting
and can be used to improve the scalability of refcounting without
significant space cost.
Hazard pointers are similar to RCU: they build the synchronization
between two part, readers and updaters. Readers are refcount users, they
acquire and release refcount. Updaters cleanup objects when there are no
reader referencing them (via call_hazptr()). The difference is instead
of waiting for a grace period, hazard pointers can free an object as
long as there is no one referencing the object. This means for a
particular workload, hazard pointers may have smaller memory footprint
due to fewer pending callbacks.
The synchronization between readers and updaters is built around "hazard
pointer slots": a slot readers can use to store a pointer value.
Reader side protection:
1. Read the value of a pointer to the target data element.
2. Store it to a hazard pointer slot.
3. Enforce full ordering (e.g. smp_mb()).
4. Re-read the original pointer, reset the slot and retry if the
value changed.
5. Otherwise, the continued existence of the target data element
is guaranteed.
Updater side check:
1. Unpublish the target data element (e.g. setting the pointer
value to NULL).
2. Enforce full ordering.
3. Read the value from a hazard pointer slot.
4. If the value doesn't match the target data element, then this
slot doesn't represent a reference to it.
5. Otherwise, updater needs to re-check (step 3).
To distribute the accesses of hazptr slots from different contexts,
hazptr_context is introduced. Users need to define/allocate their own
hazptr_context to allocate hazard pointer slots.
For the updater side to confirm no existing reference, it needs to scan
all the possible slots, and to speed up this process, hazptr_context
also contains an rbtree node for each slot so that updater can cache the
reader-scan result in an rbtree. The rbtree nodes are pre-allocated in
this way to prevent "allocate memory to free memory" in extreme cases.
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Co-developed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/hazptr.h | 83 ++++++++
kernel/Makefile | 1 +
kernel/hazptr.c | 463 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 547 insertions(+)
create mode 100644 include/linux/hazptr.h
create mode 100644 kernel/hazptr.c
diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..4548ca8c75eb
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
+
+/* Hazard pointer slot. */
+typedef void* hazptr_t;
+
+#define HAZPTR_SLOT_PER_CTX 8
+
+struct hazptr_slot_snap {
+ struct rb_node node;
+ hazptr_t slot;
+};
+
+/*
+ * A set of hazard pointer slots for a context.
+ *
+ * The context can be a task, CPU or reader (depends on the use case). It may
+ * be allocated statically or dynamically. It can only be used after calling
+ * init_hazptr_context(), and users are also responsible to call
+ * cleaup_hazptr_context() when it's not used any more.
+ */
+struct hazptr_context {
+ // The lock of the percpu context lists.
+ spinlock_t *lock;
+
+ struct list_head list;
+ struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX];
+ ____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX];
+};
+
+void init_hazptr_context(struct hazptr_context *hzcp);
+void cleanup_hazptr_context(struct hazptr_context *hzcp);
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp);
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp);
+
+#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field))
+#define hazptr_protect(hzp, gp, field) ({ \
+ typeof(gp) ___p; \
+ \
+ ___p = hazptr_tryprotect(hzp, gp, field); \
+ BUG_ON(!___p); \
+ ___p; \
+})
+
+static inline void *__hazptr_tryprotect(hazptr_t *hzp,
+ void *const *p,
+ unsigned long head_offset)
+{
+ void *ptr;
+ struct callback_head *head;
+
+ ptr = READ_ONCE(*p);
+
+ if (ptr == NULL)
+ return NULL;
+
+ head = (struct callback_head *)(ptr + head_offset);
+
+ WRITE_ONCE(*hzp, head);
+ smp_mb();
+
+ ptr = READ_ONCE(*p); // read again
+
+ if (ptr + head_offset != head) { // pointer changed
+ WRITE_ONCE(*hzp, NULL); // reset hazard pointer
+ return NULL;
+ } else
+ return ptr;
+}
+
+static inline void hazptr_clear(hazptr_t *hzp)
+{
+ /* Pairs with smp_load_acquire() in reader scan. */
+ smp_store_release(hzp, NULL);
+}
+
+void call_hazptr(struct callback_head *head, rcu_callback_t func);
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..7927264b9870 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
obj-y += livepatch/
obj-y += dma/
obj-y += entry/
+obj-y += hazptr.o
obj-$(CONFIG_MODULES) += module/
obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..f22ccc2a4a62
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,463 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/spinlock.h>
+#include <linux/cleanup.h>
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+
+#define HAZPTR_UNUSED (1ul)
+
+/* Per-CPU data for hazard pointer management. */
+struct hazptr_percpu {
+ spinlock_t ctx_lock; /* hazptr context list lock */
+ struct list_head ctx_list; /* hazptr context list */
+ spinlock_t callback_lock; /* Per-CPU callback queue lock */
+ struct callback_head *queued; /* Per-CPU callback queue */
+ struct callback_head **tail;
+};
+
+/*
+ * Per-CPU data contains context lists and callbacks, which are maintained in a
+ * per-CPU maner to reduce potential contention. This means a global scan (for
+ * readers or callbacks) has to take each CPU's lock, but it's less problematic
+ * because that's a slowpath.
+ */
+DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu);
+
+/* A RBTree that stores the reader scan result of all hazptr slot */
+struct hazptr_reader_tree {
+ spinlock_t lock;
+ struct rb_root root;
+};
+
+static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree)
+{
+ spin_lock_init(&tree->lock);
+ tree->root = RB_ROOT;
+}
+
+static bool is_null_or_unused(hazptr_t slot)
+{
+ return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED;
+}
+
+static int hazptr_node_cmp(const void *key, const struct rb_node *curr)
+{
+ unsigned long slot = (unsigned long)key;
+ struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node);
+ unsigned long curr_slot = (unsigned long)(curr_snap->slot);
+
+ if (slot < curr_slot)
+ return -1;
+ else if (slot > curr_slot)
+ return 1;
+ else
+ return 0;
+}
+
+static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr)
+{
+ struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node);
+
+ return hazptr_node_cmp((void *)new_snap->slot, curr) < 0;
+}
+
+/* Add the snapshot into a search tree. tree->lock must be held. */
+static inline void reader_add_locked(struct hazptr_reader_tree *tree,
+ struct hazptr_slot_snap *snap)
+{
+ lockdep_assert_held(&tree->lock);
+ BUG_ON(is_null_or_unused(snap->slot));
+
+ rb_add(&snap->node, &tree->root, hazptr_node_less);
+}
+
+static inline void reader_add(struct hazptr_reader_tree *tree,
+ struct hazptr_slot_snap *snap)
+{
+ guard(spinlock_irqsave)(&tree->lock);
+
+ reader_add_locked(tree, snap);
+}
+
+/* Delete the snapshot from a search tree. tree->lock must be held. */
+static inline void reader_del_locked(struct hazptr_reader_tree *tree,
+ struct hazptr_slot_snap *snap)
+{
+ lockdep_assert_held(&tree->lock);
+
+ rb_erase(&snap->node, &tree->root);
+}
+
+static inline void reader_del(struct hazptr_reader_tree *tree,
+ struct hazptr_slot_snap *snap)
+{
+ guard(spinlock_irqsave)(&tree->lock);
+
+ reader_del_locked(tree, snap);
+}
+
+/* Find whether a read exists. tree->lock must be held. */
+static inline bool reader_exist_locked(struct hazptr_reader_tree *tree,
+ unsigned long slot)
+{
+ lockdep_assert_held(&tree->lock);
+
+ return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp);
+}
+
+static inline bool reader_exist(struct hazptr_reader_tree *tree,
+ unsigned long slot)
+{
+ guard(spinlock_irqsave)(&tree->lock);
+
+ return reader_exist_locked(tree, slot);
+}
+
+/*
+ * Scan the readers from one hazptr context and update the global readers tree.
+ *
+ * Must be called with hzcp->lock held.
+ */
+static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree,
+ struct hazptr_context *hzcp)
+{
+ lockdep_assert_held(hzcp->lock);
+
+ for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+ /*
+ * Pairs with smp_store_release() in hazptr_{clear,free}().
+ *
+ * Ensure
+ *
+ * <reader> <updater>
+ *
+ * [access protected pointers]
+ * hazptr_clear();
+ * smp_store_release()
+ * // in reader scan.
+ * smp_load_acquire(); // is null or unused.
+ * [run callbacks] // all accesses from
+ * // reader must be
+ * // observed.
+ */
+ hazptr_t val = smp_load_acquire(&hzcp->slots[i]);
+
+ if (!is_null_or_unused(val)) {
+ struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+ // Already in the tree, need to remove first.
+ if (!is_null_or_unused(snap->slot)) {
+ reader_del(tree, snap);
+ }
+ snap->slot = val;
+ reader_add(tree, snap);
+ }
+ }
+}
+
+/*
+ * Initialize a hazptr context.
+ *
+ * Must be called before using the context for hazptr allocation.
+ */
+void init_hazptr_context(struct hazptr_context *hzcp)
+{
+ struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+ for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+ hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED;
+ hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED;
+ }
+
+ guard(spinlock_irqsave)(&pcpu->ctx_lock);
+ list_add(&hzcp->list, &pcpu->ctx_list);
+ hzcp->lock = &pcpu->ctx_lock;
+}
+
+struct hazptr_struct {
+ struct work_struct work;
+ bool scheduled;
+
+ // list of callbacks, we move percpu queued callbacks into the global
+ // queued list in workqueue function.
+ spinlock_t callback_lock;
+ struct callback_head *queued;
+ struct callback_head **tail;
+
+ struct hazptr_reader_tree readers;
+};
+
+static struct hazptr_struct hazptr_struct;
+
+/*
+ * Clean up hazptr context.
+ *
+ * Must call before freeing the context. This function also removes any
+ * reference held by the hazard pointer slots in the context, even
+ * hazptr_clear() or hazptr_free() is not called previously.
+ */
+void cleanup_hazptr_context(struct hazptr_context *hzcp)
+{
+ if (hzcp->lock) {
+ scoped_guard(spinlock_irqsave, hzcp->lock) {
+ list_del(&hzcp->list);
+ hzcp->lock = NULL;
+ }
+
+ for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+ struct hazptr_slot_snap *snap = &hzcp->snaps[i];
+
+ if (!is_null_or_unused(snap->slot))
+ reader_del(&hazptr_struct.readers, snap);
+ }
+ }
+}
+
+/*
+ * Allocate a hazptr slot from a hazptr_context.
+ *
+ * Return: NULL means fail to allocate, otherwise the address of the allocated
+ * slot.
+ */
+hazptr_t *hazptr_alloc(struct hazptr_context *hzcp)
+{
+ unsigned long unused;
+
+ for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) {
+ if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) {
+ unused = HAZPTR_UNUSED;
+
+ /*
+ * rwm-sequence is relied on here.
+ *
+ * This is needed since in case of a previous reader:
+ *
+ * <reader 1> <reader 2> <updater>
+ * [access protected pointers]
+ * hazptr_free():
+ * smp_store_release(); // hzptr == UNUSED
+ * hazptr_alloc():
+ * try_cmpxchg_relaxed(); // hzptr == NULL
+ *
+ * // in reader scan
+ * smp_load_acquire(); // is null
+ * [run callbacks]
+ *
+ * Because of the rwm-sequence of release operations,
+ * when callbacks are run, accesses from reader 1 must
+ * be already observed by the updater.
+ */
+ if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) {
+ return (hazptr_t *)&hzcp->slots[i];
+ }
+ }
+ }
+
+ return NULL;
+}
+
+/* Free a hazptr slot. */
+void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr)
+{
+ WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED);
+
+ /* Pairs with smp_load_acquire() in reader scan */
+ smp_store_release(hzptr, (void *)HAZPTR_UNUSED);
+}
+
+/* Scan all possible readers, and update the global reader tree. */
+static void check_readers(struct hazptr_struct *hzst)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+ struct hazptr_context *ctx;
+
+ guard(spinlock_irqsave)(&pcpu->ctx_lock);
+ list_for_each_entry(ctx, &pcpu->ctx_list, list)
+ hazptr_context_snap_readers_locked(&hzst->readers, ctx);
+ }
+}
+
+/*
+ * Start the background work for hazptr callback handling if not started.
+ *
+ * Must be called with hazptr_struct lock held.
+ */
+static void kick_hazptr_work(void)
+{
+ if (hazptr_struct.scheduled)
+ return;
+
+ queue_work(system_wq, &hazptr_struct.work);
+ hazptr_struct.scheduled = true;
+}
+
+/*
+ * Check which callbacks are ready to be called.
+ *
+ * Return: a callback list that no reader is referencing the corresponding
+ * objects.
+ */
+static struct callback_head *do_hazptr(struct hazptr_struct *hzst)
+{
+ struct callback_head *tmp, **curr;
+ struct callback_head *todo = NULL, **todo_tail = &todo;
+
+ // Currently, the lock is unnecessary, but maybe we will have multiple
+ // work_structs sharing the same callback list in the future;
+ guard(spinlock_irqsave)(&hzst->callback_lock);
+
+ curr = &hzst->queued;
+
+ while ((tmp = *curr)) {
+ // No reader, we can free the object.
+ if (!reader_exist(&hzst->readers, (unsigned long)tmp)) {
+ // Add tmp into todo.
+ *todo_tail = tmp;
+ todo_tail = &tmp->next;
+
+ // Delete tmp from ->queued and move to the next entry.
+ *curr = tmp->next;
+ } else
+ curr = &tmp->next;
+ }
+
+ hzst->tail = curr;
+
+ // Keep checking if callback list is not empty.
+ if (hzst->queued)
+ kick_hazptr_work();
+
+ *todo_tail = NULL;
+
+ return todo;
+}
+
+static void hazptr_work_func(struct work_struct *work)
+{
+ struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work);
+ struct callback_head *todo;
+
+ // Advance callbacks from hzpcpu to hzst
+ scoped_guard(spinlock_irqsave, &hzst->callback_lock) {
+ int cpu;
+
+ hzst->scheduled = false;
+ for_each_possible_cpu(cpu) {
+ struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+ guard(spinlock)(&pcpu->callback_lock);
+
+ if (pcpu->queued) {
+ *(hzst->tail) = pcpu->queued;
+ hzst->tail = pcpu->tail;
+ pcpu->queued = NULL;
+ pcpu->tail = &pcpu->queued;
+ }
+ }
+ }
+
+ // Pairs with the smp_mb() on the reader side. This guarantees that if
+ // the hazptr work picked up the callback from an updater and the
+ // updater set the global pointer to NULL before enqueue the callback,
+ // the hazptr work must observe a reader that protects the hazptr before
+ // the updater.
+ //
+ // <reader> <updater> <hazptr work>
+ // ptr = READ_ONCE(gp);
+ // WRITE_ONCE(*hazptr, ptr);
+ // smp_mb(); // => ->strong-fence
+ // tofree = gp;
+ // ptr = READ_ONCE(gp); // re-read, gp is not NULL
+ // // => ->fre
+ // WRITE_ONCE(gp, NULL);
+ // call_hazptr(gp, ...):
+ // lock(->callback_lock);
+ // [queued the callback]
+ // unlock(->callback_lock);// => ->po-unlock-lock-po
+ // lock(->callback_lock);
+ // [move from hzpcpu to hzst]
+ //
+ // smp_mb(); => ->strong-fence
+ // ptr = READ_ONCE(*hazptr);
+ // // ptr == gp, otherwise => ->fre
+ //
+ // If ptr != gp, it means there exists a circle with the following
+ // memory ordering relationships:
+ // ->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre
+ // => (due to the definition of prop)
+ // ->strong-fence ->prop ->strong-fence ->hb ->prop
+ // => (rotate the circle)
+ // ->prop ->strong-fence ->prop ->strong-fence ->hb
+ // => (due to the definition of pb)
+ // ->pb ->pb
+ // but pb is acyclic according to LKMM, so this cannot happen.
+ smp_mb();
+ check_readers(hzst);
+
+ todo = do_hazptr(hzst);
+
+ while (todo) {
+ struct callback_head *next = todo->next;
+ void (*func)(struct callback_head *) = todo->func;
+
+ func(todo);
+
+ todo = next;
+ }
+}
+
+/*
+ * Put @head into the cleanup callback queue.
+ *
+ * @func(@head) will be called after no one is referencing the object. Caller
+ * must ensure the object has been unpublished before calling this.
+ */
+void call_hazptr(struct callback_head *head, rcu_callback_t func)
+{
+ struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu);
+
+ head->func = func;
+ head->next = NULL;
+
+ scoped_guard(spinlock_irqsave, &pcpu->callback_lock) {
+ *(pcpu->tail) = head;
+ pcpu->tail = &head->next;
+ }
+
+ guard(spinlock_irqsave)(&hazptr_struct.callback_lock);
+ kick_hazptr_work();
+}
+
+static int init_hazptr_struct(void)
+{
+ int cpu;
+
+ INIT_WORK(&hazptr_struct.work, hazptr_work_func);
+
+ spin_lock_init(&hazptr_struct.callback_lock);
+ hazptr_struct.queued = NULL;
+ hazptr_struct.tail = &hazptr_struct.queued;
+
+ for_each_possible_cpu(cpu) {
+ struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu);
+
+ spin_lock_init(&pcpu->ctx_lock);
+ INIT_LIST_HEAD(&pcpu->ctx_list);
+
+ spin_lock_init(&pcpu->callback_lock);
+ pcpu->queued = NULL;
+ pcpu->tail = &pcpu->queued;
+
+ }
+
+ init_hazptr_reader_tree(&hazptr_struct.readers);
+
+ return 0;
+}
+
+early_initcall(init_hazptr_struct);
--
2.45.2
Am 9/17/2024 um 4:33 PM schrieb Boqun Feng: > +static inline void *__hazptr_tryprotect(hazptr_t *hzp, > + void *const *p, > + unsigned long head_offset) > +{ > + void *ptr; > + struct callback_head *head; > + > + ptr = READ_ONCE(*p); > + > + if (ptr == NULL) > + return NULL; > + > + head = (struct callback_head *)(ptr + head_offset); > + > + WRITE_ONCE(*hzp, head); > + smp_mb(); > + > + ptr = READ_ONCE(*p); // read again > + > + if (ptr + head_offset != head) { // pointer changed > + WRITE_ONCE(*hzp, NULL); // reset hazard pointer > + return NULL; > + } else > + return ptr; > +} There is a subtle potential for ABA issues here. If the compiler replaces 'return ptr;' with 'return head - head_offset;', then you do not have an address dependency from the second read. In this case, in ABA, the first read can read from a stale store, then the second read reads the same value from a newer store but only establishes control-dependency based synchronization with that store; any reads from *ptr could be speculatively executed before doing the second ptr = READ_ONCE(*p). Therefore you could read the object state before it is properly reinitialized by the second store. I'm not sure what the most efficient fix is or if you just want to gamble that "the compiler will never do that". I guess either READ_ONCE(ptr) or a compiler barrier before return ptr might do it? Have fun, jonas
Hi Jonas, On Fri, Sep 20, 2024 at 09:41:15AM +0200, Jonas Oberhauser wrote: > > > Am 9/17/2024 um 4:33 PM schrieb Boqun Feng: > > +static inline void *__hazptr_tryprotect(hazptr_t *hzp, > > + void *const *p, > > + unsigned long head_offset) > > +{ > > + void *ptr; > > + struct callback_head *head; > > + > > + ptr = READ_ONCE(*p); > > + > > + if (ptr == NULL) > > + return NULL; > > + > > + head = (struct callback_head *)(ptr + head_offset); > > + > > + WRITE_ONCE(*hzp, head); > > + smp_mb(); > > + > > + ptr = READ_ONCE(*p); // read again > > + > > + if (ptr + head_offset != head) { // pointer changed > > + WRITE_ONCE(*hzp, NULL); // reset hazard pointer > > + return NULL; > > + } else > > + return ptr; > > +} > > There is a subtle potential for ABA issues here. > > If the compiler replaces 'return ptr;' with 'return head - head_offset;', > then you do not have an address dependency from the second read. > > In this case, in ABA, the first read can read from a stale store, then the > second read reads the same value from a newer store but only establishes > control-dependency based synchronization with that store; any reads from > *ptr could be speculatively executed before doing the second ptr = > READ_ONCE(*p). > > Therefore you could read the object state before it is properly > reinitialized by the second store. > Thanks for taking a look, and nice find! > I'm not sure what the most efficient fix is or if you just want to gamble > that "the compiler will never do that". > I guess either READ_ONCE(ptr) or a compiler barrier before return ptr might > do it? > I think the root cause of this is that compiler can replace 'ptr' with 'head - head_offset' based on pointer value comparison. A fix would be converting pointer to unsigned long and doing the comparison: if ((unsigned long)ptr + head_offset != (unsigned long)head) { WRITE_ONCE(*hzp, NULL); return NULL; } else return ptr; because of the conversion, compilers lose the information of pointer equality, therefore cannot replace 'ptr' with 'head - head_offset'. Of course, if we are really worried about compilers being too "smart", we can always do the comparison in asm code, then compilers don't know anything of the equality between 'ptr' and 'head - head_offset'. Regards, boqun > Have fun, > jonas >
Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: > Hi Jonas, > > Of > course, if we are really worried about compilers being too "smart" Ah, I see you know me better and better... > we can always do the comparison in asm code, then compilers don't know > anything of the equality between 'ptr' and 'head - head_offset'. Yes, but then a simple compiler barrier between the comparison and returning ptr would also do the trick, right? And maybe easier on the eyes. Have fun, jonas
On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote: > > > Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: > > Hi Jonas, > > > > Of > > course, if we are really worried about compilers being too "smart" > > Ah, I see you know me better and better... > > > we can always do the comparison in asm code, then compilers don't know > > anything of the equality between 'ptr' and 'head - head_offset'. > Yes, but then a simple compiler barrier between the comparison and returning > ptr would also do the trick, right? And maybe easier on the eyes. > The thing about putting a compiler barrier is that it will prevent all compiler reorderings, and some of the reordering may contribute to better codegen. (I know in this case, we have a smp_mb(), but still compilers can move unrelated code upto the second load for optimization purpose). Asm comparison is cheaper in this way. But TBH, compilers should provide a way to compare pointer values without using the result for pointer equality proof, if "convert to unsigned long" doesn't work, some other ways should work. Regards, Boqun > > Have fun, > jonas >
On 2024-09-25 12:45, Boqun Feng wrote: > On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote: >> >> >> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: >>> Hi Jonas, >>> >>> Of >>> course, if we are really worried about compilers being too "smart" >> >> Ah, I see you know me better and better... >> >>> we can always do the comparison in asm code, then compilers don't know >>> anything of the equality between 'ptr' and 'head - head_offset'. >> Yes, but then a simple compiler barrier between the comparison and returning >> ptr would also do the trick, right? And maybe easier on the eyes. >> > > The thing about putting a compiler barrier is that it will prevent all > compiler reorderings, and some of the reordering may contribute to > better codegen. (I know in this case, we have a smp_mb(), but still > compilers can move unrelated code upto the second load for optimization > purpose). Asm comparison is cheaper in this way. But TBH, compilers > should provide a way to compare pointer values without using the result > for pointer equality proof, if "convert to unsigned long" doesn't work, > some other ways should work. > Based on Documentation/RCU/rcu_dereference.rst : - Be very careful about comparing pointers obtained from rcu_dereference() against non-NULL values. As Linus Torvalds explained, if the two pointers are equal, the compiler could substitute the pointer you are comparing against for the pointer obtained from rcu_dereference(). For example:: p = rcu_dereference(gp); if (p == &default_struct) do_default(p->a); Because the compiler now knows that the value of "p" is exactly the address of the variable "default_struct", it is free to transform this code into the following:: p = rcu_dereference(gp); if (p == &default_struct) do_default(default_struct.a); On ARM and Power hardware, the load from "default_struct.a" can now be speculated, such that it might happen before the rcu_dereference(). This could result in bugs due to misordering. So I am not only concerned about compiler proofs here, as it appears that the speculation done by the CPU can also cause issues on some architectures. Thanks, Mathieu > Regards, > Boqun > >> >> Have fun, >> jonas >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Am 9/25/2024 um 1:59 PM schrieb Mathieu Desnoyers: > On 2024-09-25 12:45, Boqun Feng wrote: >> On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote: >>> >>> >>> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: >>>> Hi Jonas, >>>> >>>> Of >>>> course, if we are really worried about compilers being too "smart" >>> >>> Ah, I see you know me better and better... >>> >>>> we can always do the comparison in asm code, then compilers don't know >>>> anything of the equality between 'ptr' and 'head - head_offset'. >>> Yes, but then a simple compiler barrier between the comparison and >>> returning >>> ptr would also do the trick, right? And maybe easier on the eyes. >>> >> >> The thing about putting a compiler barrier is that it will prevent all >> compiler reorderings, and some of the reordering may contribute to >> better codegen. (I know in this case, we have a smp_mb(), but still >> compilers can move unrelated code upto the second load for optimization >> purpose). Asm comparison is cheaper in this way. But TBH, compilers >> should provide a way to compare pointer values without using the result >> for pointer equality proof, if "convert to unsigned long" doesn't work, >> some other ways should work. >> > > Based on Documentation/RCU/rcu_dereference.rst : > > - Be very careful about comparing pointers obtained from > rcu_dereference() against non-NULL values. As Linus Torvalds > explained, if the two pointers are equal, the compiler could > substitute the pointer you are comparing against for the pointer > obtained from rcu_dereference(). For example:: > > p = rcu_dereference(gp); > if (p == &default_struct) > do_default(p->a); > > Because the compiler now knows that the value of "p" is exactly > the address of the variable "default_struct", it is free to > transform this code into the following:: > > p = rcu_dereference(gp); > if (p == &default_struct) > do_default(default_struct.a); > > On ARM and Power hardware, the load from "default_struct.a" > can now be speculated, such that it might happen before the > rcu_dereference(). This could result in bugs due to misordering. > > So I am not only concerned about compiler proofs here, as it appears > that the speculation done by the CPU can also cause issues on some > architectures. No, this is only possible in this example because of the compiler first doing some other optimizations (like what I mentioned on Boqun's original patch). If you can ensure that the instruction sequence corresponds to more or less t = load p // again // on alpha: dep fence ... *t then you can be sure that there is an address dependency which orders the access. This is guaranteed by LKMM, or if you don't trust LKMM, also by Arm, Power, Alpha etc. The extra dep fence on alpha is automatically inserted if you use READ_ONCE as boqun did (and I assumed your uatomic_load or whatever is doing the same thing, but I didn't check). Given that the hazard-pointer-protected object presumably is not a single static non-freeable object, but some dynamically allocated object, it is pretty much impossible for the compiler to guess the address like in the example you shared above. Note that inside the if, the code after transform is do_default(default_struct.a); which is an address that is known to the hardware before it loads from gp. That would not be the case here (if the compiler optimization is ruled out). jonas
On Wed, Sep 25, 2024 at 01:59:06PM +0200, Mathieu Desnoyers wrote: > On 2024-09-25 12:45, Boqun Feng wrote: > > On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote: > > > > > > > > > Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: > > > > Hi Jonas, > > > > > > > > Of > > > > course, if we are really worried about compilers being too "smart" > > > > > > Ah, I see you know me better and better... > > > > > > > we can always do the comparison in asm code, then compilers don't know > > > > anything of the equality between 'ptr' and 'head - head_offset'. > > > Yes, but then a simple compiler barrier between the comparison and returning > > > ptr would also do the trick, right? And maybe easier on the eyes. > > > > > > > The thing about putting a compiler barrier is that it will prevent all > > compiler reorderings, and some of the reordering may contribute to > > better codegen. (I know in this case, we have a smp_mb(), but still > > compilers can move unrelated code upto the second load for optimization > > purpose). Asm comparison is cheaper in this way. But TBH, compilers > > should provide a way to compare pointer values without using the result > > for pointer equality proof, if "convert to unsigned long" doesn't work, > > some other ways should work. > > > > Based on Documentation/RCU/rcu_dereference.rst : > > - Be very careful about comparing pointers obtained from > rcu_dereference() against non-NULL values. As Linus Torvalds > explained, if the two pointers are equal, the compiler could > substitute the pointer you are comparing against for the pointer > obtained from rcu_dereference(). For example:: > > p = rcu_dereference(gp); > if (p == &default_struct) > do_default(p->a); > > Because the compiler now knows that the value of "p" is exactly > the address of the variable "default_struct", it is free to > transform this code into the following:: > > p = rcu_dereference(gp); > if (p == &default_struct) > do_default(default_struct.a); > > On ARM and Power hardware, the load from "default_struct.a" > can now be speculated, such that it might happen before the > rcu_dereference(). This could result in bugs due to misordering. > > So I am not only concerned about compiler proofs here, as it appears > that the speculation done by the CPU can also cause issues on some > architectures. > Note that the issue is caused by compiler replacing one pointer with the other, CPU speculation won't cause the issue solely, because all architectures Linux supports honor address dependencies (except for Alpha, but on Alpha, READ_ONCE() has a smp_mb() after the load). That is: if the compiler generates code as the code is written, there should be no problem. Regards, Boqun > Thanks, > > Mathieu > > > Regards, > > Boqun > > > > > > > > Have fun, > > > jonas > > > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2024-09-25 14:16, Boqun Feng wrote: > On Wed, Sep 25, 2024 at 01:59:06PM +0200, Mathieu Desnoyers wrote: >> On 2024-09-25 12:45, Boqun Feng wrote: >>> On Wed, Sep 25, 2024 at 12:11:52PM +0200, Jonas Oberhauser wrote: >>>> >>>> >>>> Am 9/25/2024 um 12:02 PM schrieb Boqun Feng: >>>>> Hi Jonas, >>>>> >>>>> Of >>>>> course, if we are really worried about compilers being too "smart" >>>> >>>> Ah, I see you know me better and better... >>>> >>>>> we can always do the comparison in asm code, then compilers don't know >>>>> anything of the equality between 'ptr' and 'head - head_offset'. >>>> Yes, but then a simple compiler barrier between the comparison and returning >>>> ptr would also do the trick, right? And maybe easier on the eyes. >>>> >>> >>> The thing about putting a compiler barrier is that it will prevent all >>> compiler reorderings, and some of the reordering may contribute to >>> better codegen. (I know in this case, we have a smp_mb(), but still >>> compilers can move unrelated code upto the second load for optimization >>> purpose). Asm comparison is cheaper in this way. But TBH, compilers >>> should provide a way to compare pointer values without using the result >>> for pointer equality proof, if "convert to unsigned long" doesn't work, >>> some other ways should work. >>> >> >> Based on Documentation/RCU/rcu_dereference.rst : >> >> - Be very careful about comparing pointers obtained from >> rcu_dereference() against non-NULL values. As Linus Torvalds >> explained, if the two pointers are equal, the compiler could >> substitute the pointer you are comparing against for the pointer >> obtained from rcu_dereference(). For example:: >> >> p = rcu_dereference(gp); >> if (p == &default_struct) >> do_default(p->a); >> >> Because the compiler now knows that the value of "p" is exactly >> the address of the variable "default_struct", it is free to >> transform this code into the following:: >> >> p = rcu_dereference(gp); >> if (p == &default_struct) >> do_default(default_struct.a); >> >> On ARM and Power hardware, the load from "default_struct.a" >> can now be speculated, such that it might happen before the >> rcu_dereference(). This could result in bugs due to misordering. >> >> So I am not only concerned about compiler proofs here, as it appears >> that the speculation done by the CPU can also cause issues on some >> architectures. >> > > Note that the issue is caused by compiler replacing one pointer with > the other, CPU speculation won't cause the issue solely, because all > architectures Linux supports honor address dependencies (except for > Alpha, but on Alpha, READ_ONCE() has a smp_mb() after the load). That > is: if the compiler generates code as the code is written, there should > be no problem. I am starting to wonder if it would not be more robust to just bite the bullet and add the inline asm helpers to do the pointer comparison outside of the compiler knowledge for each architecture. This should work fine in the short term, until compilers eventually give us a "compare pointers without allowing the compiler to infer anything about pointer equality". Like so: #include <stdio.h> #define __str_1(x) #x #define __str(x) __str_1(x) /* x86-64 */ #define bne_ptr(_a, _b, _label) \ asm goto ( \ "cmpq %[a], %[b]\n\t" \ "jne %l[" __str(_label) "]\n\t" \ : : [a] "r" (_a), [b] "r" (_b) \ : : _label) int x; int v[2]; int main(void) { bne_ptr(v, v + 1, label_same); x = 1; label_same: printf("%d\n", x); return 0; } > > Regards, > Boqun > >> Thanks, >> >> Mathieu >> >>> Regards, >>> Boqun >>> >>>> >>>> Have fun, >>>> jonas >>>> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-25 14:47, Mathieu Desnoyers wrote: [...] > Like so: > > #include <stdio.h> > > #define __str_1(x) #x > #define __str(x) __str_1(x) > > /* x86-64 */ > #define bne_ptr(_a, _b, _label) \ > asm goto ( \ > "cmpq %[a], %[b]\n\t" \ > "jne %l[" __str(_label) "]\n\t" \ > : : [a] "r" (_a), [b] "r" (_b) \ > : : _label) > > int x; > > int v[2]; > > int main(void) > { > bne_ptr(v, v + 1, label_same); > x = 1; > label_same: Note that this label should probably be called "label_ne". I flipped the macro logic without changing the labels. Thanks, Mathieu > printf("%d\n", x); > return 0; > } > > >> >> Regards, >> Boqun >> >>> Thanks, >>> >>> Mathieu >>> >>>> Regards, >>>> Boqun >>>> >>>>> >>>>> Have fun, >>>>> jonas >>>>> >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>> > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-25 15:10, Mathieu Desnoyers wrote: [...] Cleaner without goto in the user code: #include <stdio.h> #include <stdbool.h> static inline bool same_ptr(void *a, void *b) { asm goto ( "cmpq %[a], %[b]\n\t" "jne %l[ne]\n\t" : : [a] "r" (a), [b] "r" (b) : : ne); return true; ne: return false; } int x; int v[2]; int main(void) { if (same_ptr(v, v + 1)) x = 1; printf("%d\n", x); return 0; } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-25 15:20, Mathieu Desnoyers wrote: [...] > static inline > bool same_ptr(void *a, void *b) > { > asm goto ( > "cmpq %[a], %[b]\n\t" > "jne %l[ne]\n\t" > : : [a] "r" (a), [b] "r" (b) > : : ne); > return true; > ne: > return false; > } Based on the information provided in this email thread, it appears the only concern when it comes to comparing a pointer loaded by rcu_dereference() with another pointer is the possibility of compiler optimizations. In the specific case of hazard pointer hpref_hp_get(), this function does both loads which are then compared with one another. Therefore, it is not possible for the user to provide a comparison value known at compile-time, except in the unlikely scenario where the hazard pointer would be const, which does not really make much sense. Therefore, just using rcu_dereference() for the second load should be fine. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Am 9/26/2024 um 8:16 AM schrieb Mathieu Desnoyers: > On 2024-09-25 15:20, Mathieu Desnoyers wrote: > [...] >> static inline >> bool same_ptr(void *a, void *b) >> { >> asm goto ( >> "cmpq %[a], %[b]\n\t" >> "jne %l[ne]\n\t" >> : : [a] "r" (a), [b] "r" (b) >> : : ne); >> return true; >> ne: >> return false; >> } > > Based on the information provided in this email thread, > it appears the only concern when it comes to comparing a > pointer loaded by rcu_dereference() with another pointer > is the possibility of compiler optimizations. > > In the specific case of hazard pointer hpref_hp_get(), this > function does both loads which are then compared with one > another. Therefore, it is not possible for the user to > provide a comparison value known at compile-time, except > in the unlikely scenario where the hazard pointer would > be const, which does not really make much sense. > > Therefore, just using rcu_dereference() for the second > load should be fine. > > Thanks, > > Mathieu > No, the issue introduced by the compiler optimization (or by your original patch) is that the CPU can speculatively load from the first pointer as soon as it has completeled the load of that poitner: node = ... // t's value can be computed here, since the value of node is known ... node2 = ... if (node == node2) // just a control dependency, doesn't prevent speculatively loading node t = *node if we can force the compiler's hand, it will generate this code which provides the necessary ordering at hardware level: node = ... // t can't be computed here since the value of node2 is not known here ... node2 = ... // t can be computed here if (node == node2) t = *node2 Kind regards, jonas
On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: > > No, the issue introduced by the compiler optimization (or by your > original patch) is that the CPU can speculatively load from the first > pointer as soon as it has completed the load of that pointer: You mean the compiler can do it. The inline asm has no impact on what the CPU does. The conditional isn't a barrier for the actual hardware. But once the compiler doesn't try to do it, the data dependency on the address does end up being an ordering constraint on the hardware too (I'm happy to say that I haven't heard from the crazies that want value prediction in a long time). Just use a barrier. Or make sure to use the proper ordered memory accesses when possible. Don't use an inline asm for the compare - we don't even have anything insane like that as a portable helper, and we shouldn't have it. Linus
On 2024-09-26 18:12, Linus Torvalds wrote: > On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser > <jonas.oberhauser@huaweicloud.com> wrote: >> >> No, the issue introduced by the compiler optimization (or by your >> original patch) is that the CPU can speculatively load from the first >> pointer as soon as it has completed the load of that pointer: > > You mean the compiler can do it. The inline asm has no impact on what > the CPU does. The conditional isn't a barrier for the actual hardware. > But once the compiler doesn't try to do it, the data dependency on the > address does end up being an ordering constraint on the hardware too > (I'm happy to say that I haven't heard from the crazies that want > value prediction in a long time). > > Just use a barrier. Or make sure to use the proper ordered memory > accesses when possible. Don't use an inline asm for the compare - we > don't even have anything insane like that as a portable helper, and we > shouldn't have it. How does the compiler barrier help in any way here ? I am concerned about the compiler SSA GVN (Global Value Numbering) optimizations, and I don't think a compiler barrier solves anything. (or I'm missing something obvious) I was concerned about the suggestion from Jonas to use "node2" rather than "node" after the equality check as a way to ensure the intended register is used to return the pointer, because after the SSA GVN optimization pass, AFAIU this won't help in any way. I have a set of examples below that show gcc use the result of the first load, and clang use the result of the second load (on both x86-64 and aarch64). Likewise when a load-acquire is used as second load, which I find odd. Hopefully mixing this optimization from gcc with speculation still abide by the memory model. Only the asm goto approach ensures that gcc uses the result from the second load. #include <stdbool.h> #define READ_ONCE(x) \ (*(__volatile__ __typeof__(x) *)&(x)) static inline bool same_ptr(void *a, void *b) { asm goto ( #ifdef __x86_64__ "cmpq %[a], %[b]\n\t" "jne %l[ne]\n\t" #elif defined(__aarch64__) "cmp %[a], %[b]\n\t" "bne %l[ne]\n\t" #else # error "unimplemented" #endif : : [a] "r" (a), [b] "r" (b) : : ne); return true; ne: return false; } int *p; int fct_2_volatile(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = READ_ONCE(p); } while (a != b); return *b; } int fct_volatile_acquire(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = __atomic_load_n(&p, __ATOMIC_ACQUIRE); } while (a != b); return *b; } int fct_asm_compare(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = READ_ONCE(p); } while (!same_ptr(a, b)); return *b; } x86-64 gcc 14.2: fct_2_volatile: mov rax,QWORD PTR [rip+0x0] # 7 <fct_2_volatile+0x7> mov rdx,QWORD PTR [rip+0x0] # e <fct_2_volatile+0xe> cmp rax,rdx jne 0 <fct_2_volatile> mov eax,DWORD PTR [rax] ret cs nop WORD PTR [rax+rax*1+0x0] fct_volatile_acquire: mov rax,QWORD PTR [rip+0x0] # 27 <fct_volatile_acquire+0x7> mov rdx,QWORD PTR [rip+0x0] # 2e <fct_volatile_acquire+0xe> cmp rax,rdx jne 20 <fct_volatile_acquire> mov eax,DWORD PTR [rax] ret cs nop WORD PTR [rax+rax*1+0x0] fct_asm_compare: mov rdx,QWORD PTR [rip+0x0] # 47 <fct_asm_compare+0x7> mov rax,QWORD PTR [rip+0x0] # 4e <fct_asm_compare+0xe> cmp rax,rdx jne 40 <fct_asm_compare> mov eax,DWORD PTR [rax] ret main: xor eax,eax ret x86-64 clang 19.1.0: fct_2_volatile: mov rcx,QWORD PTR [rip+0x0] # 7 <fct_2_volatile+0x7> mov rax,QWORD PTR [rip+0x0] # e <fct_2_volatile+0xe> cmp rcx,rax jne 0 <fct_2_volatile> mov eax,DWORD PTR [rax] ret cs nop WORD PTR [rax+rax*1+0x0] fct_volatile_acquire: mov rcx,QWORD PTR [rip+0x0] # 27 <fct_volatile_acquire+0x7> mov rax,QWORD PTR [rip+0x0] # 2e <fct_volatile_acquire+0xe> cmp rcx,rax jne 20 <fct_volatile_acquire> mov eax,DWORD PTR [rax] ret cs nop WORD PTR [rax+rax*1+0x0] fct_asm_compare: mov rcx,QWORD PTR [rip+0x0] # 47 <fct_asm_compare+0x7> mov rax,QWORD PTR [rip+0x0] # 4e <fct_asm_compare+0xe> cmp rax,rcx jne 40 <fct_asm_compare> mov eax,DWORD PTR [rax] ret cs nop WORD PTR [rax+rax*1+0x0] main: xor eax,eax ret ARM64 gcc 14.2.0: fct_2_volatile: adrp x0, .LANCHOR0 add x0, x0, :lo12:.LANCHOR0 .L2: ldr x1, [x0] ldr x2, [x0] cmp x1, x2 bne .L2 ldr w0, [x1] ret fct_volatile_acquire: adrp x0, .LANCHOR0 add x0, x0, :lo12:.LANCHOR0 .L6: ldr x1, [x0] ldar x2, [x0] cmp x1, x2 bne .L6 ldr w0, [x1] ret fct_asm_compare: adrp x1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 .L9: ldr x2, [x1] ldr x0, [x1] cmp x2, x0 bne .L9 ldr w0, [x0] ret p: .zero 8 armv8-a clang (trunk): fct_2_volatile: adrp x8, 0 <fct_2_volatile> ldr x10, [x8] ldr x9, [x8] cmp x10, x9 b.ne 4 <fct_2_volatile+0x4> // b.any ldr w0, [x9] ret fct_volatile_acquire: adrp x8, 0 <fct_2_volatile> add x8, x8, #0x0 ldr x10, [x8] ldar x9, [x8] cmp x10, x9 b.ne 24 <fct_volatile_acquire+0x8> // b.any ldr w0, [x9] ret fct_asm_compare: adrp x8, 0 <fct_2_volatile> ldr x9, [x8] ldr x8, [x8] cmp x9, x8 b.ne 3c <fct_asm_compare> // b.any ldr w0, [x8] ret main: mov w0, wzr -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Fri, Sep 27, 2024 at 03:20:40AM +0200, Mathieu Desnoyers wrote: > On 2024-09-26 18:12, Linus Torvalds wrote: > > On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser > > <jonas.oberhauser@huaweicloud.com> wrote: > > > > > > No, the issue introduced by the compiler optimization (or by your > > > original patch) is that the CPU can speculatively load from the first > > > pointer as soon as it has completed the load of that pointer: > > > > You mean the compiler can do it. The inline asm has no impact on what > > the CPU does. The conditional isn't a barrier for the actual hardware. > > But once the compiler doesn't try to do it, the data dependency on the > > address does end up being an ordering constraint on the hardware too > > (I'm happy to say that I haven't heard from the crazies that want > > value prediction in a long time). > > > > Just use a barrier. Or make sure to use the proper ordered memory > > accesses when possible. Don't use an inline asm for the compare - we > > don't even have anything insane like that as a portable helper, and we > > shouldn't have it. > > How does the compiler barrier help in any way here ? > > I am concerned about the compiler SSA GVN (Global Value Numbering) > optimizations, and I don't think a compiler barrier solves anything. > (or I'm missing something obvious) I think you're right, a compiler barrier doesn't help here: head = READ_ONCE(p); smp_mb(); WRITE_ONCE(*slot, head); ptr = READ_ONCE(p); if (ptr != head) { ... } else { barrier(); return ptr; } compilers can replace 'ptr' with 'head' because of the equality, and even putting barrier() here cannot prevent compiler to rewrite the else branch into: else { barrier(); return head; } because that's just using a different register, unrelated to memory accesses. Jonas, am I missing something subtle? Or this is different than what you proposed? Regards, Boqun > > I was concerned about the suggestion from Jonas to use "node2" > rather than "node" after the equality check as a way to ensure > the intended register is used to return the pointer, because after > the SSA GVN optimization pass, AFAIU this won't help in any way. > I have a set of examples below that show gcc use the result of the > first load, and clang use the result of the second load (on > both x86-64 and aarch64). Likewise when a load-acquire is used as > second load, which I find odd. Hopefully mixing this optimization > from gcc with speculation still abide by the memory model. > > Only the asm goto approach ensures that gcc uses the result from > the second load. > [...]
Two comments inline, Am 9/27/2024 um 6:38 AM schrieb Boqun Feng: > compilers can replace 'ptr' with 'head' because of the equality, and > even putting barrier() here cannot prevent compiler to rewrite the else > branch into: > > else { > barrier(); > return head; > } > > because that's just using a different register, unrelated to memory > accesses. Yeah, that was the solution I had in mind. My reasoning was that from the C abstract machine, head is still a memory access, and the barrier() should make the compiler forget everything it knew about the value of head from before the barrier(). So I had a gap in my understanding of the strength of barrier(). Can I understand it to mean that the compiler can do escape analysis to reason that a volatile asm code with ::memory can't possibly be manipulating some variables (like head in this case)? That idea seems to be confirmed by this (atrocious, not to be copied!) example: int fct_escape_address_of_b(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = READ_ONCE(p); } while (a != b); // really really hide b int **p = &b; OPTIMIZER_HIDE_VAR(p); asm volatile ("" : : : "memory"); return *b; } This also does not generate any additional instructions, unlike just using OPTIMIZER_HIDE_VAR(b). What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently works instead of like above? > On Fri, Sep 27, 2024 at 03:20:40AM +0200, Mathieu Desnoyers wrote: >> I have a set of examples below that show gcc use the result of the >> first load, and clang use the result of the second load (on >> both x86-64 and aarch64). Likewise when a load-acquire is used as >> second load, which I find odd. Note that if you use acquire on the second load, the code is correct even if the compiler uses the result of the first load. That is because the acquire load will still synchronize sufficiently with the second publishing store after the ABA, and then we can get the right data even from the old address. Best wishes, jonas
On 2024-09-27 21:23, Jonas Oberhauser wrote: [...] > That idea seems to be confirmed by this (atrocious, not to be copied!) > example: > > int fct_escape_address_of_b(void) > { > int *a, *b; > > do { > a = READ_ONCE(p); > asm volatile ("" : : : "memory"); > b = READ_ONCE(p); > } while (a != b); > > // really really hide b > int **p = &b; > OPTIMIZER_HIDE_VAR(p); > > asm volatile ("" : : : "memory"); > return *b; > } > > This also does not generate any additional instructions, unlike just > using OPTIMIZER_HIDE_VAR(b). > > What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently > works instead of like above? Did you try it on godbolt.org ? Does it have the intended effect ? By the looks of it, you're just creating another version of @b called "p", which is then never used and would be discarded by further optimization. I'm unsure what you are trying to achieve here. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers: > On 2024-09-27 21:23, Jonas Oberhauser wrote: > [...] >> That idea seems to be confirmed by this (atrocious, not to be copied!) >> example: >> >> int fct_escape_address_of_b(void) >> { >> int *a, *b; >> >> do { >> a = READ_ONCE(p); >> asm volatile ("" : : : "memory"); >> b = READ_ONCE(p); >> } while (a != b); >> >> // really really hide b >> int **p = &b; >> OPTIMIZER_HIDE_VAR(p); >> >> asm volatile ("" : : : "memory"); >> return *b; >> } >> >> This also does not generate any additional instructions, unlike just >> using OPTIMIZER_HIDE_VAR(b). >> >> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it >> currently works instead of like above? > > Did you try it on godbolt.org ? Does it have the intended effect ? I certainly did try and certainly read it as having the intended effect, otherwise I wouldn't have written that it seems confirmed. However, just because my eyes read it doesn't mean that's what happened, and even if it happened doesn't mean that it is guaranteed to happen. > By the looks of it, you're just creating another version of @b called > "p", which is then never used and would be discarded by further > optimization. > > I'm unsure what you are trying to achieve here. Simply put I'm trying to let the compiler think that I leaked the address of b. After that, the memory barrier should let it think that the b after the memory barrier might not be the same as the one before it (which was equal to a), forcing it to read from b. However, I suppose on second thought that that might not be enough, because the compiler could still simply do b = a right after exiting the while loop. And that is true no matter what we put behind the while loop or before the condition, as long as the condition compares a and b, right after it the compiler can do b = a. Just took me a while to see :)) I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but (as far as I read the code) doesn't do it with the above. Maybe just a weird corner case... Have fun, jonas
2024年9月28日 06:18,Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: > > > > Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers: >> On 2024-09-27 21:23, Jonas Oberhauser wrote: >> [...] >>> That idea seems to be confirmed by this (atrocious, not to be copied!) example: >>> >>> int fct_escape_address_of_b(void) >>> { >>> int *a, *b; >>> >>> do { >>> a = READ_ONCE(p); >>> asm volatile ("" : : : "memory"); >>> b = READ_ONCE(p); >>> } while (a != b); >>> >>> // really really hide b >>> int **p = &b; >>> OPTIMIZER_HIDE_VAR(p); >>> >>> asm volatile ("" : : : "memory"); >>> return *b; >>> } >>> >>> This also does not generate any additional instructions, unlike just using OPTIMIZER_HIDE_VAR(b). >>> >>> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently works instead of like above? >> Did you try it on godbolt.org ? Does it have the intended effect ? > > I certainly did try and certainly read it as having the intended effect, otherwise I wouldn't have written that it seems confirmed. > > However, just because my eyes read it doesn't mean that's what happened, and even if it happened doesn't mean that it is guaranteed to happen. > >> By the looks of it, you're just creating another version of @b called >> "p", which is then never used and would be discarded by further >> optimization. > >> I'm unsure what you are trying to achieve here. > > Simply put I'm trying to let the compiler think that I leaked the address of b. After that, the memory barrier should let it think that the b after the memory barrier might not be the same as the one before it (which was equal to a), forcing it to read from b. > > However, I suppose on second thought that that might not be enough, because the compiler could still simply do b = a right after exiting the while loop. > > And that is true no matter what we put behind the while loop or before the condition, as long as the condition compares a and b, right after it the compiler can do b = a. Just took me a while to see :)) > > I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but (as far as I read the code) doesn't do it with the above. Maybe just a weird corner case... Let the p to be a static variable out of the function will make a difference. Or the following: int **p = &b; barrier_data(p); also works. BTW, barrier_data(&b) generates more instructions than godbolt when build the kernel. > > Have fun, > jonas > >
2024年9月29日 06:10,Alan Huang <mmpgouride@gmail.com> wrote: > > 2024年9月28日 06:18,Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: >> >> >> >> Am 9/27/2024 um 10:10 PM schrieb Mathieu Desnoyers: >>> On 2024-09-27 21:23, Jonas Oberhauser wrote: >>> [...] >>>> That idea seems to be confirmed by this (atrocious, not to be copied!) example: >>>> >>>> int fct_escape_address_of_b(void) >>>> { >>>> int *a, *b; >>>> >>>> do { >>>> a = READ_ONCE(p); >>>> asm volatile ("" : : : "memory"); >>>> b = READ_ONCE(p); >>>> } while (a != b); >>>> >>>> // really really hide b >>>> int **p = &b; >>>> OPTIMIZER_HIDE_VAR(p); >>>> >>>> asm volatile ("" : : : "memory"); >>>> return *b; >>>> } >>>> >>>> This also does not generate any additional instructions, unlike just using OPTIMIZER_HIDE_VAR(b). >>>> >>>> What is the advantage of defining OPTIMIZE_HIDE_VAR the way it currently works instead of like above? >>> Did you try it on godbolt.org ? Does it have the intended effect ? >> >> I certainly did try and certainly read it as having the intended effect, otherwise I wouldn't have written that it seems confirmed. >> >> However, just because my eyes read it doesn't mean that's what happened, and even if it happened doesn't mean that it is guaranteed to happen. >> >>> By the looks of it, you're just creating another version of @b called >>> "p", which is then never used and would be discarded by further >>> optimization. > >>> I'm unsure what you are trying to achieve here. >> >> Simply put I'm trying to let the compiler think that I leaked the address of b. After that, the memory barrier should let it think that the b after the memory barrier might not be the same as the one before it (which was equal to a), forcing it to read from b. >> >> However, I suppose on second thought that that might not be enough, because the compiler could still simply do b = a right after exiting the while loop. >> >> And that is true no matter what we put behind the while loop or before the condition, as long as the condition compares a and b, right after it the compiler can do b = a. Just took me a while to see :)) >> >> I'm not sure why gcc does the b=a with the normal OPTIMIZER_HIDE_VAR but (as far as I read the code) doesn't do it with the above. Maybe just a weird corner case... > > Let the p to be a static variable out of the function will make a difference. > > Or the following: > > int **p = &b; > barrier_data(p); Or the following: int **t = &b; WRITE_ONCE(t, &b); barrier(); return *b; also works. > > also works. > > BTW, barrier_data(&b) generates more instructions than godbolt when build the kernel. > >> >> Have fun, >> jonas
Am 9/26/2024 um 6:12 PM schrieb Linus Torvalds: > On Thu, 26 Sept 2024 at 08:54, Jonas Oberhauser > <jonas.oberhauser@huaweicloud.com> wrote: >> >> No, the issue introduced by the compiler optimization (or by your >> original patch) is that the CPU can speculatively load from the first >> pointer as soon as it has completed the load of that pointer: > > You mean the compiler can do it. What I mean is that if we only use rcu_dereference for the second load (and not either some form of compiler barrier or an acquire load), then the compiler can transform the second program from my previous e-mail (which if mapped 1:1 to hardware would be correct because hardware ensures the ordering based on the address dependency) into the first one (which is incorrect). In particular, the compiler can change if (node == node2) t = *node2; into if (node == node2) t = *node; and then the CPU can speculatively read *node before knowing the value of node2. The compiler can also speculatively read *node in this case, but that is not what I meant. The code in Mathieu's original patch is already like the latter one and is broken even if the compiler does not do any optimizations. > The inline asm has no impact on what > the CPU does. The conditional isn't a barrier for the actual hardware. > But once the compiler doesn't try to do it, the data dependency on the > address does end up being an ordering constraint on the hardware too Exactly. The inline asm would prevent the compiler from doing the transformation though, which would mean that the address dependency appears in the final compiler output. > Just use a barrier. Or make sure to use the proper ordered memory > accesses when possible. > > Don't use an inline asm for the compare - we > don't even have anything insane like that as a portable helper, and we > shouldn't have it. I'm glad you say that :)) I would also just use a barrier before returing the pointer. Boqun seems to be unhappy with a barrier though, because it would theoretically also forbid unrelated optimizations. But I have not seen any evidence that there are any unrelated optimizations going on in the first place that would be forbidden by this. Have fun, jonas
On Thu, 26 Sept 2024 at 09:40, Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: > > Boqun seems to be unhappy with a barrier though, because it would > theoretically also forbid unrelated optimizations. Well, doing a "barrier()" is kind of a big hammer thing, but honestly, I don't think we've ever seen any real situation where it makes a noticeable difference. Yes, it can pessimize compiler output more than strictly necessary, but the kind of code generation issues it causes tends to be the non-problematic kind (and particularly the kind that even a trivial OoO core will deal with well). We do have some more directed compiler barriers available, and this code might be able to use OPTIMIZER_HIDE_VAR() for example. It's kind of a "single variable value barrier". Honestly, we don't use it much. It just tends to be _too_specific. But it is there if somebody wants to use it. > But I have not seen any evidence that there are any unrelated > optimizations going on in the first place that would be forbidden by this. Compared to something like "smp_mb()", which is not just a compiler barrier but also generates typically very expensive instructions that completely mess with an OoO core, a regular compiler barrier is a complete non-issue. When you have those two close to each other, you'd have to make up some very odd situation where the plain "barrier()" is even noticeable. Linus
On Thu, Sep 26, 2024 at 09:54:33AM -0700, Linus Torvalds wrote: > On Thu, 26 Sept 2024 at 09:40, Jonas Oberhauser > <jonas.oberhauser@huaweicloud.com> wrote: > > > > Boqun seems to be unhappy with a barrier though, because it would > > theoretically also forbid unrelated optimizations. > > Well, doing a "barrier()" is kind of a big hammer thing, but honestly, > I don't think we've ever seen any real situation where it makes a > noticeable difference. Yes, it can pessimize compiler output more than > strictly necessary, but the kind of code generation issues it causes > tends to be the non-problematic kind (and particularly the kind that > even a trivial OoO core will deal with well). > > We do have some more directed compiler barriers available, and this > code might be able to use OPTIMIZER_HIDE_VAR() for example. It's kind > of a "single variable value barrier". > Hmm.. this seems can do the trick? #define ADDRESS_EQ(var, expr) \ ({ \ bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ \ OPTIMIZER_HIDE_VAR(var); \ _____cmp_res; \ }) i.e. compare the address and hide the equality information immediately, so in hazptr code: ptr = READ_ONCE(*p); // first read if (ptr == NULL) return NULL; head = (struct callback_head *)(ptr + head_offset); WRITE_ONCE(*hzp, head); smp_mb(); ptr = READ_ONCE(*p); // read again if (!ADDRESS_EQ(ptr, (void *)head - head_offset)) { // pointer changed WRITE_ONCE(*hzp, NULL); // reset hazard pointer return NULL; } else { // Optimizer lost the information on the value of 'ptr', // so it cannot replace it with head - head_offset. return ptr; } Regards, Boqun > Honestly, we don't use it much. It just tends to be _too_specific. But > it is there if somebody wants to use it. > > > But I have not seen any evidence that there are any unrelated > > optimizations going on in the first place that would be forbidden by this. > > Compared to something like "smp_mb()", which is not just a compiler > barrier but also generates typically very expensive instructions that > completely mess with an OoO core, a regular compiler barrier is a > complete non-issue. When you have those two close to each other, you'd > have to make up some very odd situation where the plain "barrier()" is > even noticeable. > > Linus >
On 2024-09-27 02:01, Boqun Feng wrote: > #define ADDRESS_EQ(var, expr) \ > ({ \ > bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ > \ > OPTIMIZER_HIDE_VAR(var); \ > _____cmp_res; \ > }) If the goal is to ensure gcc uses the register populated by the second, I'm afraid it does not work. AFAIU, "hiding" the dependency chain does not prevent the SSA GVN optimization from combining the registers as being one and choosing one arbitrary source. "hiding" the dependency chain before or after the comparison won't help here. int fct_hide_var_compare(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = READ_ONCE(p); } while (!ADDRESS_EQ(a, b)); return *b; } gcc 14.2 x86-64: fct_hide_var_compare: mov rax,QWORD PTR [rip+0x0] # 67 <fct_hide_var_compare+0x7> mov rdx,QWORD PTR [rip+0x0] # 6e <fct_hide_var_compare+0xe> cmp rax,rdx jne 60 <fct_hide_var_compare> mov eax,DWORD PTR [rax] ret main: xor eax,eax ret gcc 14.2.0 ARM64: fct_hide_var_compare: adrp x0, .LANCHOR0 add x0, x0, :lo12:.LANCHOR0 .L12: ldr x1, [x0] ldr x2, [x0] cmp x1, x2 bne .L12 ldr w0, [x1] ret p: .zero 8 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote: > On 2024-09-27 02:01, Boqun Feng wrote: >> #define ADDRESS_EQ(var, expr) \ >> ({ \ >> bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ >> \ >> OPTIMIZER_HIDE_VAR(var); \ >> _____cmp_res; \ >> }) > > If the goal is to ensure gcc uses the register populated by the > second, I'm afraid it does not work. AFAIU, "hiding" the dependency > chain does not prevent the SSA GVN optimization from combining the > registers as being one and choosing one arbitrary source. "hiding" > the dependency chain before or after the comparison won't help here. > > int fct_hide_var_compare(void) > { > int *a, *b; > > do { > a = READ_ONCE(p); > asm volatile ("" : : : "memory"); > b = READ_ONCE(p); > } while (!ADDRESS_EQ(a, b)); Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). Regards, Boqun > return *b; > } > > gcc 14.2 x86-64: > > fct_hide_var_compare: > mov rax,QWORD PTR [rip+0x0] # 67 <fct_hide_var_compare+0x7> > mov rdx,QWORD PTR [rip+0x0] # 6e <fct_hide_var_compare+0xe> > cmp rax,rdx > jne 60 <fct_hide_var_compare> > mov eax,DWORD PTR [rax] > ret > main: > xor eax,eax > ret > > gcc 14.2.0 ARM64: > > fct_hide_var_compare: > adrp x0, .LANCHOR0 > add x0, x0, :lo12:.LANCHOR0 > .L12: > ldr x1, [x0] > ldr x2, [x0] > cmp x1, x2 > bne .L12 > ldr w0, [x1] > ret > p: > .zero 8 > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com
On Thu, 26 Sept 2024 at 18:38, Boqun Feng <boqun.feng@gmail.com> wrote: > > Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). Yeah, please stop making things unnecessarily complicated. Just use a barrier(). Please. Stop these stupid games until you can show why it matters. And by "why it matters" I mean "major difference in code generation", not some "it uses one more register and has to spill" kind of small detail. At this point, I'm not even convinced the whole hazard pointer approach makes sense. And you're not helping by making it more complicated than it needs to be. Linus
On 2024-09-27 18:44, Linus Torvalds wrote: > On Thu, 26 Sept 2024 at 18:38, Boqun Feng <boqun.feng@gmail.com> wrote: >> >> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). > > Yeah, please stop making things unnecessarily complicated. > > Just use a barrier(). Please. Stop these stupid games until you can > show why it matters. The barrier() is ineffective at fixing the issue. It does not prevent the compiler CSE from losing the address dependency: int fct_2_volatile_barriers(void) { int *a, *b; do { a = READ_ONCE(p); asm volatile ("" : : : "memory"); b = READ_ONCE(p); } while (a != b); asm volatile ("" : : : "memory"); <----- where you would like your barrier return *b; } With gcc 14.2 (arm64): fct_2_volatile_barriers: adrp x0, .LANCHOR0 add x0, x0, :lo12:.LANCHOR0 .L2: ldr x1, [x0] <------ x1 populated by first load. ldr x2, [x0] cmp x1, x2 bne .L2 ldr w0, [x1] <------ x1 is used for access which should depend on b. ret On weakly-ordered architectures, this lets CPU speculation use the result from the first load to speculate "ldr w0, [x1]" before "ldr x2, [x0]". Based on the RCU documentation, the control dependency does not prevent the CPU from speculating loads. There are a few ways to fix this: - Compile everything with -fno-cse-follow-jumps. - Make the compiler unaware of the relationship between the address equality and address-dependent use of b. This can be done either using ADDRESS_EQ() or asm goto. I prefer ADDRESS_EQ() because it is arch-agnostic. I don't care that it adds one more register movement instruction. > And by "why it matters" I mean "major difference in code generation", > not some "it uses one more register and has to spill" kind of small > detail. Why it matters is because gcc generates code that does not preserve address dependency of the second READ_ONCE(). > At this point, I'm not even convinced the whole hazard pointer > approach makes sense. And you're not helping by making it more > complicated than it needs to be. I'm preparing a small series that aims to show how a minimal hazard pointer implementation can help improve common scenarios: - Guarantee object existence on pointer dereference to increment a reference count: - replace locking used for that purpose in drivers (e.g. usb), - replace RCU + inc_not_zero pattern, - rtmutex: I suspect we can improve situations where locks need to be taken in reverse dependency chain order by guaranteeing existence of first and second locks in traversal order, allowing them to be locked in the correct order (which is reverse from traversal order) rather than try-lock+retry on nested lock. This can be done with hazard pointers without requiring object reclaim to be delayed by an RCU grace period. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Fri, 27 Sept 2024 at 10:17, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > The barrier() is ineffective at fixing the issue. > It does not prevent the compiler CSE from losing the > address dependency: Ok. Thanks for actually specifying code. That needs to be (a) in a comment (b) the value barrier needs to be on *both* values so that the order of the equality testing doesn't matter. > I'm preparing a small series that aims to show how a minimal > hazard pointer implementation can help improve common scenarios: I want actual numbers on real loads. Just so you know. Not "this can help". But "this actually really _does_ help". Linus
On 2024-09-27 19:23, Linus Torvalds wrote: > On Fri, 27 Sept 2024 at 10:17, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> The barrier() is ineffective at fixing the issue. >> It does not prevent the compiler CSE from losing the >> address dependency: > > Ok. Thanks for actually specifying code. > > That needs to be > > (a) in a comment OK. I'll add the code/asm examples to the comment above ADDRESS_EQ(). > > (b) the value barrier needs to be on *both* values so that the order > of the equality testing doesn't matter. If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes the odds that someone get the order wrong, but it disallows using ADDRESS_EQ() with a constant parameter (e.g. ADDRESS_EQ(ptr, &mystruct)), which would be nice to cover. It works fine with using OPTIMIZER_HIDE_VAR() on the first argument, but it opens the door to misuses. Perhaps there is a trick with compiler builtins we could do to only use OPTIMIZER_HIDE_VAR() on non-constant arguments, but I can't get it to work so far. > >> I'm preparing a small series that aims to show how a minimal >> hazard pointer implementation can help improve common scenarios: > > I want actual numbers on real loads. Just so you know. Not "this can > help". But "this actually really _does_ help". Noted, thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Fri, 27 Sept 2024 at 10:53, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > (b) the value barrier needs to be on *both* values so that the order > > of the equality testing doesn't matter. > > If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes > the odds that someone get the order wrong, but it disallows using > ADDRESS_EQ() with a constant parameter No it doesn't. This is trivial - just hide the source of the *comparison*, so that the compiler doesn't know what you are comparing, and can't use it to then replace one with the other: static __always_inline bool compare_ptr(const volatile void *a, const volatile void *b) { OPTIMIZER_HIDE_VAR(a); OPTIMIZER_HIDE_VAR(b); return a == b; } compares two arbitrary pointer values without allowing the compiler to then use the comparison result to use either of the original values as a replacement for the other. And yes, that "hide both" model will cause a bit more register pressure, because the compiler will now compare two values that it really thinks are potentially different from the originals. So you'll have two "useless" temporaries that contain the same values as the source pointers, but if that's the cost of having a comparison that the compiler can't see, that's fine. Making it a bit less obvious, you can hide just one of the variables - you don't actually need to hide both m(but for clarity, maybe you want to). Because even hiding the value one from the compiler will mean that it can't use the comparison to decide that the originals are equal, even if one of them is unhidden. No? Linus
On 2024-09-27 20:13, Linus Torvalds wrote: > On Fri, 27 Sept 2024 at 10:53, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >>> (b) the value barrier needs to be on *both* values so that the order >>> of the equality testing doesn't matter. >> >> If we use OPTIMIZER_HIDE_VAR() on both parameters, it indeed minimizes >> the odds that someone get the order wrong, but it disallows using >> ADDRESS_EQ() with a constant parameter > > No it doesn't. > > This is trivial - just hide the source of the *comparison*, so that > the compiler doesn't know what you are comparing, and can't use it to > then replace one with the other: > > static __always_inline bool compare_ptr(const volatile void *a, > const volatile void *b) > { > OPTIMIZER_HIDE_VAR(a); > OPTIMIZER_HIDE_VAR(b); > return a == b; > } Cool! It works! Thanks! > > compares two arbitrary pointer values without allowing the compiler to > then use the comparison result to use either of the original values as > a replacement for the other. Yep. And the static inline is much cleaner as it allows users to pass constants as well. > > And yes, that "hide both" model will cause a bit more register > pressure, because the compiler will now compare two values that it > really thinks are potentially different from the originals. So you'll > have two "useless" temporaries that contain the same values as the > source pointers, but if that's the cost of having a comparison that > the compiler can't see, that's fine. I've tried it and it seems that the compiler only leaves one "mov" extra there, since the extra register movement on the input that is not used afterwards can then be optimized away. > > Making it a bit less obvious, you can hide just one of the variables - > you don't actually need to hide both m(but for clarity, maybe you want > to). > > Because even hiding the value one from the compiler will mean that it > can't use the comparison to decide that the originals are equal, even > if one of them is unhidden. > > No? I would prefer hiding the two input variables. Hust hiding one variable might work for CSE (and light godbolt attempts seem to confirm this), but I'm worried that it eventually breaks when compilers start making SSA GVN optimizations smarter. AFAIU, in the SSA GVN model, if we just hide @a before the comparison and don't hide @b, we'd be in a situation where the compiler could know that the version of the variable generated by hiding @a (call it a') is equal to @b, and therefore when using @b afterward could instead use a', which is derived from @a rather than @b. It may not happen in practice just because why would a sane optimization would prefer using a version that is deeper in the dependency chain (a') rather than @b, but that would be based on assumptions on how specific heuristics work, and would therefore be fragile. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Am 9/27/2024 um 8:13 PM schrieb Linus Torvalds: > Because even hiding the value one from the compiler will mean that it > can't use the comparison to decide that the originals are equal, even > if one of them is unhidden. > > No? > > Linus I think it depends on which one you hide. If you do z = b; hide(z); if (a==z) { *b; } then it will be fine, because it knows a==z but nothing about the relation of b with a or z. But for z = a; hide(z); if (z==b) { *b; } then it would still know that b == z, and could replace *b with *z (which really is *a). Best wishes, jonas
On Fri, 27 Sept 2024 at 12:12, Jonas Oberhauser <jonas.oberhauser@huaweicloud.com> wrote: > > I think it depends on which one you hide. No. Dammit, people, read the code I posted. > But for > > z = a; > hide(z); > if (z==b) { *b; } No. I *intentionally* made it an inline function, and only hid the arguments to the equality comparison. So the "hide(z)" hides the argument to the inline function - NOT THE ORIGINAL. > then it would still know that b == z, and could replace *b with *z > (which really is *a). No. The hiding is literally *ONLY* for the comparison. It's inside the helper function. It doesn't affect the originals at all. Which means that the compiler CANNOT KNOW anything about the original pointers when it compares for equality (or inequality). Basically, the comparison is now a black box to the compiler, and the compiler cannot use the result of the comparison to make ANY judgment on whether the two original pointers were related or not. Linus
On Fri, 27 Sept 2024 at 12:28, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Dammit, people, read the code I posted. Actually, no, apologies. You were right, and I was wrong. It does need both of the sources of the comparison to be hidden, because while even hiding just one side makes the comparison result "meaningless" as far as the compiler is concerned, the compiler will still have generated a pseudo for the hidden value, and might decide that it can re-use that pseudo for the non-hidden pointer if the two then match. So yeah, the example function I posted should be safe, but my "you can probably make do with hiding just one side" was actually a bogus and invalid optimization. You do need to hide both sides. Not just for clarity, but for correctness. Linus
On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote: > > > On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote: > > On 2024-09-27 02:01, Boqun Feng wrote: > >> #define ADDRESS_EQ(var, expr) \ > >> ({ \ > >> bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ > >> \ > >> OPTIMIZER_HIDE_VAR(var); \ > >> _____cmp_res; \ > >> }) > > > > If the goal is to ensure gcc uses the register populated by the > > second, I'm afraid it does not work. AFAIU, "hiding" the dependency > > chain does not prevent the SSA GVN optimization from combining the Note it's not hiding the dependency, rather the equality, > > registers as being one and choosing one arbitrary source. "hiding" after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var) to indicate the output is overwritten. So when 'var' is referred later, compiler cannot use the register for a 'expr' value or any other register that has the same value, because 'var' may have a different value from the compiler's POV. > > the dependency chain before or after the comparison won't help here. > > > > int fct_hide_var_compare(void) > > { > > int *a, *b; > > > > do { > > a = READ_ONCE(p); > > asm volatile ("" : : : "memory"); > > b = READ_ONCE(p); > > } while (!ADDRESS_EQ(a, b)); > > Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). > I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile result shows it can prevent the issue: gcc 14.2 x86-64: fct_hide_var_compare: .L2: mov rcx, QWORD PTR p[rip] mov rdx, QWORD PTR p[rip] mov rax, rdx cmp rcx, rdx jne .L2 mov eax, DWORD PTR [rax] ret gcc 14.2.0 ARM64: fct_hide_var_compare: adrp x2, p add x2, x2, :lo12:p .L2: ldr x3, [x2] ldr x1, [x2] mov x0, x1 cmp x3, x1 bne .L2 ldr w0, [x0] ret Link to godbolt: https://godbolt.org/z/a7jsfzjxY Regards, Boqun > Regards, > Boqun > > > return *b; > > } > > > > gcc 14.2 x86-64: > > > > fct_hide_var_compare: > > mov rax,QWORD PTR [rip+0x0] # 67 <fct_hide_var_compare+0x7> > > mov rdx,QWORD PTR [rip+0x0] # 6e <fct_hide_var_compare+0xe> > > cmp rax,rdx > > jne 60 <fct_hide_var_compare> > > mov eax,DWORD PTR [rax] > > ret > > main: > > xor eax,eax > > ret > > > > gcc 14.2.0 ARM64: > > > > fct_hide_var_compare: > > adrp x0, .LANCHOR0 > > add x0, x0, :lo12:.LANCHOR0 > > .L12: > > ldr x1, [x0] > > ldr x2, [x0] > > cmp x1, x2 > > bne .L12 > > ldr w0, [x1] > > ret > > p: > > .zero 8 > > > > > > -- > > Mathieu Desnoyers > > EfficiOS Inc. > > https://www.efficios.com >
2024年9月27日 12:28,Boqun Feng <boqun.feng@gmail.com> wrote: > > On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote: >> >> >> On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote: >>> On 2024-09-27 02:01, Boqun Feng wrote: >>>> #define ADDRESS_EQ(var, expr) \ >>>> ({ \ >>>> bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ >>>> \ >>>> OPTIMIZER_HIDE_VAR(var); \ >>>> _____cmp_res; \ >>>> }) >>> >>> If the goal is to ensure gcc uses the register populated by the >>> second, I'm afraid it does not work. AFAIU, "hiding" the dependency >>> chain does not prevent the SSA GVN optimization from combining the > > Note it's not hiding the dependency, rather the equality, > >>> registers as being one and choosing one arbitrary source. "hiding" > > after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is > equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var) > to indicate the output is overwritten. So when 'var' is referred later, > compiler cannot use the register for a 'expr' value or any other > register that has the same value, because 'var' may have a different > value from the compiler's POV. > >>> the dependency chain before or after the comparison won't help here. >>> >>> int fct_hide_var_compare(void) >>> { >>> int *a, *b; >>> >>> do { >>> a = READ_ONCE(p); >>> asm volatile ("" : : : "memory"); >>> b = READ_ONCE(p); >>> } while (!ADDRESS_EQ(a, b)); >> >> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). >> > > I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile > result shows it can prevent the issue: > > gcc 14.2 x86-64: > > fct_hide_var_compare: > .L2: > mov rcx, QWORD PTR p[rip] > mov rdx, QWORD PTR p[rip] > mov rax, rdx > cmp rcx, rdx > jne .L2 > mov eax, DWORD PTR [rax] > ret > > gcc 14.2.0 ARM64: > > fct_hide_var_compare: > adrp x2, p > add x2, x2, :lo12:p > .L2: > ldr x3, [x2] > ldr x1, [x2] > mov x0, x1 > cmp x3, x1 > bne .L2 > ldr w0, [x0] > ret > > Link to godbolt: > > https://godbolt.org/z/a7jsfzjxY Checking the assembly generated by different compilers for the kernel on the local machine will yield more accurate results. Some optimizations are restricted by the kernel. Therefore, if you use Godbolt, ensure that the compiler arguments match those used for the kernel. > > Regards, > Boqun > >> Regards, >> Boqun >> >>> return *b; >>> } >>> >>> gcc 14.2 x86-64: >>> >>> fct_hide_var_compare: >>> mov rax,QWORD PTR [rip+0x0] # 67 <fct_hide_var_compare+0x7> >>> mov rdx,QWORD PTR [rip+0x0] # 6e <fct_hide_var_compare+0xe> >>> cmp rax,rdx >>> jne 60 <fct_hide_var_compare> >>> mov eax,DWORD PTR [rax] >>> ret >>> main: >>> xor eax,eax >>> ret >>> >>> gcc 14.2.0 ARM64: >>> >>> fct_hide_var_compare: >>> adrp x0, .LANCHOR0 >>> add x0, x0, :lo12:.LANCHOR0 >>> .L12: >>> ldr x1, [x0] >>> ldr x2, [x0] >>> cmp x1, x2 >>> bne .L12 >>> ldr w0, [x1] >>> ret >>> p: >>> .zero 8 >>> >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com
On 2024-09-27 06:28, Boqun Feng wrote: > On Fri, Sep 27, 2024 at 09:37:50AM +0800, Boqun Feng wrote: >> >> >> On Fri, Sep 27, 2024, at 9:30 AM, Mathieu Desnoyers wrote: >>> On 2024-09-27 02:01, Boqun Feng wrote: >>>> #define ADDRESS_EQ(var, expr) \ >>>> ({ \ >>>> bool _____cmp_res = (unsigned long)(var) == (unsigned long)(expr); \ >>>> \ >>>> OPTIMIZER_HIDE_VAR(var); \ >>>> _____cmp_res; \ >>>> }) >>> >>> If the goal is to ensure gcc uses the register populated by the >>> second, I'm afraid it does not work. AFAIU, "hiding" the dependency >>> chain does not prevent the SSA GVN optimization from combining the > > Note it's not hiding the dependency, rather the equality, > >>> registers as being one and choosing one arbitrary source. "hiding" > > after OPTIMIZER_HIDE_VAR(var), compiler doesn't know whether 'var' is > equal to 'expr' anymore, because OPTIMIZER_HIDE_VAR(var) uses "=r"(var) > to indicate the output is overwritten. So when 'var' is referred later, > compiler cannot use the register for a 'expr' value or any other > register that has the same value, because 'var' may have a different > value from the compiler's POV. > >>> the dependency chain before or after the comparison won't help here. >>> >>> int fct_hide_var_compare(void) >>> { >>> int *a, *b; >>> >>> do { >>> a = READ_ONCE(p); >>> asm volatile ("" : : : "memory"); >>> b = READ_ONCE(p); >>> } while (!ADDRESS_EQ(a, b)); >> >> Note that ADDRESS_EQ() only hide first parameter, so this should be ADDRESS_EQ(b, a). >> > > I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile > result shows it can prevent the issue: I see, yes. It prevents the issue by making the compiler create a copy of the value "modified" by the asm before doing the equality comparison. This means the compiler cannot derive the value for b from the first load when b is used after after the equality comparison. The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra "mov" instruction to move the content across registers. I don't think it matters performance wise though, so that solution is appealing because it is arch-agnostic. One small improvement over your proposed solution would be to apply OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile asm, it is simply optimized away if var1 or var2 is unused following the equality comparison. It is more convenient to prevent replacement of both addresses being compared by the other rather than providing the guarantee only on a single parameter: #define OPTIMIZER_HIDE_VAR(var) \ __asm__ ("" : "+r" (var)) #define ADDRESS_EQ(var1, var2) \ ({ \ bool _____cmp_res = (var1) == (var2); \ \ OPTIMIZER_HIDE_VAR(var1); \ OPTIMIZER_HIDE_VAR(var2); \ _____cmp_res; \ }) Thanks, Mathieu > > gcc 14.2 x86-64: > > fct_hide_var_compare: > .L2: > mov rcx, QWORD PTR p[rip] > mov rdx, QWORD PTR p[rip] > mov rax, rdx > cmp rcx, rdx > jne .L2 > mov eax, DWORD PTR [rax] > ret > > gcc 14.2.0 ARM64: > > fct_hide_var_compare: > adrp x2, p > add x2, x2, :lo12:p > .L2: > ldr x3, [x2] > ldr x1, [x2] > mov x0, x1 > cmp x3, x1 > bne .L2 > ldr w0, [x0] > ret > > Link to godbolt: > > https://godbolt.org/z/a7jsfzjxY-- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-27 12:59, Mathieu Desnoyers wrote: > On 2024-09-27 06:28, Boqun Feng wrote: [...] >> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile >> result shows it can prevent the issue: > > I see, yes. It prevents the issue by making the compiler create > a copy of the value "modified" by the asm before doing the equality > comparison. > > This means the compiler cannot derive the value for b from the first > load when b is used after after the equality comparison. > > The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra > "mov" instruction to move the content across registers. I don't think > it matters performance wise though, so that solution is appealing > because it is arch-agnostic. > > One small improvement over your proposed solution would be to apply > OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile > asm, it is simply optimized away if var1 or var2 is unused following > the equality comparison. It is more convenient to prevent replacement > of both addresses being compared by the other rather than providing > the guarantee only on a single parameter: Actually, your approach is better (only preserving the address dependency on the first parameter), because it allows the second parameter to be a constant. Here is a diff. Please let me know if I need to improve anything wrt comments or implementation: diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 2df665fa2964..52434eccd715 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -186,6 +186,32 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, __asm__ ("" : "=r" (var) : "0" (var)) #endif +/* + * Compare an address with an expression while preserving the address + * dependencies for later use of the address. It should be used when + * comparing an address returned by rcu_dereference() with another + * address (either constant or in registers). + * + * This is needed to prevent the compiler SSA GVN optimization pass from + * replacing the register holding @addr by @expr (either a constant or a + * register) based on their equality, which does not preserve address + * dependencies and allows the following misordering speculations: + * + * - If @expr is a constant, the compiler can issue the loads which depend + * on @addr before the load of @addr. + * - If @expr is a register populated by a prior load, weakly-ordered + * CPUs can speculate loads which depend on @addr before the load of the + * address they depend on. + */ +#ifndef ADDRESS_EQ +#define ADDRESS_EQ(addr, expr) \ + ({ \ + bool __res = (addr) == (expr); \ + OPTIMIZER_HIDE_VAR(addr); \ + __res; \ + }) +#endif + #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) /** -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-27 16:43, Mathieu Desnoyers wrote: > On 2024-09-27 12:59, Mathieu Desnoyers wrote: >> On 2024-09-27 06:28, Boqun Feng wrote: > [...] >>> I replaced ADDRESS_EQ(a, b) with ADDRESS_EQ(b, a), and the compile >>> result shows it can prevent the issue: >> >> I see, yes. It prevents the issue by making the compiler create >> a copy of the value "modified" by the asm before doing the equality >> comparison. >> >> This means the compiler cannot derive the value for b from the first >> load when b is used after after the equality comparison. >> >> The only downside of OPTIMIZER_HIDE_VAR() is that it adds an extra >> "mov" instruction to move the content across registers. I don't think >> it matters performance wise though, so that solution is appealing >> because it is arch-agnostic. >> >> One small improvement over your proposed solution would be to apply >> OPTIMIZER_HIDE_VAR() on both inputs. Because this is not a volatile >> asm, it is simply optimized away if var1 or var2 is unused following >> the equality comparison. It is more convenient to prevent replacement >> of both addresses being compared by the other rather than providing >> the guarantee only on a single parameter: > > Actually, your approach is better (only preserving the address > dependency on the first parameter), because it allows the second > parameter to be a constant. > > Here is a diff. Please let me know if I need to improve anything wrt > comments or implementation: > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > index 2df665fa2964..52434eccd715 100644 > --- a/include/linux/compiler.h > +++ b/include/linux/compiler.h > @@ -186,6 +186,32 @@ void ftrace_likely_update(struct ftrace_likely_data > *f, int val, > __asm__ ("" : "=r" (var) : "0" (var)) > #endif > > +/* > + * Compare an address with an expression while preserving the address > + * dependencies for later use of the address. It should be used when > + * comparing an address returned by rcu_dereference() with another > + * address (either constant or in registers). > + * > + * This is needed to prevent the compiler SSA GVN optimization pass from > + * replacing the register holding @addr by @expr (either a constant or a Actually, I've done a bit on testing on godbolt, and it appears that disabling this specific optimization makes the problem disappear: -fcse-follow-jumps (by use of -fno-cse-follow-jumps). I will update this comment to state that both CSE and SSA GVN optimizations can cause issues there. Thanks, Mathieu > + * register) based on their equality, which does not preserve address > + * dependencies and allows the following misordering speculations: > + * > + * - If @expr is a constant, the compiler can issue the loads which depend > + * on @addr before the load of @addr. > + * - If @expr is a register populated by a prior load, weakly-ordered > + * CPUs can speculate loads which depend on @addr before the load of the > + * address they depend on. > + */ > +#ifndef ADDRESS_EQ > +#define ADDRESS_EQ(addr, expr) \ > + ({ \ > + bool __res = (addr) == (expr); \ > + OPTIMIZER_HIDE_VAR(addr); \ > + __res; \ > + }) > +#endif > + > #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), > __COUNTER__) > > /** > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Am 9/17/2024 um 4:33 PM schrieb Boqun Feng: > +#define hazptr_protect(hzp, gp, field) ({ \ > + typeof(gp) ___p; \ > + \ > + ___p = hazptr_tryprotect(hzp, gp, field); \ > + BUG_ON(!___p); \ > + ___p; \ > +}) Hi Boqun, why crash instead of retry? jonas
On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote: > +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, > + struct hazptr_context *hzcp) > +{ > + lockdep_assert_held(hzcp->lock); > + > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > + /* > + * Pairs with smp_store_release() in hazptr_{clear,free}(). > + * > + * Ensure > + * > + * <reader> <updater> > + * > + * [access protected pointers] > + * hazptr_clear(); > + * smp_store_release() > + * // in reader scan. > + * smp_load_acquire(); // is null or unused. > + * [run callbacks] // all accesses from > + * // reader must be > + * // observed. > + */ > + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > + > + if (!is_null_or_unused(val)) { > + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > + > + // Already in the tree, need to remove first. > + if (!is_null_or_unused(snap->slot)) { > + reader_del(tree, snap); > + } > + snap->slot = val; > + reader_add(tree, snap); > + } > + } > +} Hello I'm curious about whether there are any possible memory leaks here. It seems that call_hazptr() never frees the memory until the slot is set to another valid value. In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused and snap->slot is not which I think it should be. And it can cause unneeded deletion and addition of the snap if the slot value is unchanged. I'm not so sure... Thanks Lai
On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote: > On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote: > > > +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, > > + struct hazptr_context *hzcp) > > +{ > > + lockdep_assert_held(hzcp->lock); > > + > > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > > + /* > > + * Pairs with smp_store_release() in hazptr_{clear,free}(). > > + * > > + * Ensure > > + * > > + * <reader> <updater> > > + * > > + * [access protected pointers] > > + * hazptr_clear(); > > + * smp_store_release() > > + * // in reader scan. > > + * smp_load_acquire(); // is null or unused. > > + * [run callbacks] // all accesses from > > + * // reader must be > > + * // observed. > > + */ > > + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > > + > > + if (!is_null_or_unused(val)) { > > + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > > + > > + // Already in the tree, need to remove first. > > + if (!is_null_or_unused(snap->slot)) { > > + reader_del(tree, snap); > > + } > > + snap->slot = val; > > + reader_add(tree, snap); > > + } > > + } > > +} > > Hello > > I'm curious about whether there are any possible memory leaks here. > > It seems that call_hazptr() never frees the memory until the slot is > set to another valid value. > > In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused > and snap->slot is not which I think it should be. > > And it can cause unneeded deletion and addition of the snap if the slot > value is unchanged. > I think you're right. (Although the node will be eventually deleted at cleanup_hazptr_context(), however there could be a long-live hazptr_context). It should be: hazptr_t val = smp_load_acquire(&hzcp->slots[i]); struct hazptr_slot_snap *snap = &hzcp->snaps[i]; if (val != snap->slot) { // val changed, need to update the tree node. // Already in the tree, need to remove first. if (!is_null_or_unused(snap->slot)) { reader_del(tree, snap); } // use the latest snapshot. snap->slot = val; // Add it into tree if there is a reader if (!is_null_or_unused(val)) reader_add(tree, snap); } Regards, Boqun > I'm not so sure... > > Thanks > Lai
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote: >> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote: >> >>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, >>> + struct hazptr_context *hzcp) >>> +{ >>> + lockdep_assert_held(hzcp->lock); >>> + >>> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { >>> + /* >>> + * Pairs with smp_store_release() in hazptr_{clear,free}(). >>> + * >>> + * Ensure >>> + * >>> + * <reader> <updater> >>> + * >>> + * [access protected pointers] >>> + * hazptr_clear(); >>> + * smp_store_release() >>> + * // in reader scan. >>> + * smp_load_acquire(); // is null or unused. >>> + * [run callbacks] // all accesses from >>> + * // reader must be >>> + * // observed. >>> + */ >>> + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); >>> + >>> + if (!is_null_or_unused(val)) { >>> + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; >>> + >>> + // Already in the tree, need to remove first. >>> + if (!is_null_or_unused(snap->slot)) { >>> + reader_del(tree, snap); >>> + } >>> + snap->slot = val; >>> + reader_add(tree, snap); >>> + } >>> + } >>> +} >> >> Hello >> >> I'm curious about whether there are any possible memory leaks here. >> >> It seems that call_hazptr() never frees the memory until the slot is >> set to another valid value. >> >> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused >> and snap->slot is not which I think it should be. >> >> And it can cause unneeded deletion and addition of the snap if the slot >> value is unchanged. >> > > I think you're right. (Although the node will be eventually deleted at > cleanup_hazptr_context(), however there could be a long-live > hazptr_context). It should be: > > hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > > if (val != snap->slot) { // val changed, need to update the tree node. > // Already in the tree, need to remove first. > if (!is_null_or_unused(snap->slot)) { > reader_del(tree, snap); > } > > // use the latest snapshot. > snap->slot = val; > > // Add it into tree if there is a reader > if (!is_null_or_unused(val)) > reader_add(tree, snap); > } Even using the same context, if two slots are used to protect the same pointer, let it be ptr1, then if the second slot is reused for ptr2, ptr1’s callback will be invoked even the first slot still has the ptr1. The original patch also has this problem. > > Regards, > Boqun > >> I'm not so sure... >> >> Thanks >> Lai
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote: >> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote: >> >>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, >>> + struct hazptr_context *hzcp) >>> +{ >>> + lockdep_assert_held(hzcp->lock); >>> + >>> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { >>> + /* >>> + * Pairs with smp_store_release() in hazptr_{clear,free}(). >>> + * >>> + * Ensure >>> + * >>> + * <reader> <updater> >>> + * >>> + * [access protected pointers] >>> + * hazptr_clear(); >>> + * smp_store_release() >>> + * // in reader scan. >>> + * smp_load_acquire(); // is null or unused. >>> + * [run callbacks] // all accesses from >>> + * // reader must be >>> + * // observed. >>> + */ >>> + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); >>> + >>> + if (!is_null_or_unused(val)) { >>> + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; >>> + >>> + // Already in the tree, need to remove first. >>> + if (!is_null_or_unused(snap->slot)) { >>> + reader_del(tree, snap); >>> + } >>> + snap->slot = val; >>> + reader_add(tree, snap); >>> + } >>> + } >>> +} >> >> Hello >> >> I'm curious about whether there are any possible memory leaks here. >> >> It seems that call_hazptr() never frees the memory until the slot is >> set to another valid value. >> >> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused >> and snap->slot is not which I think it should be. >> >> And it can cause unneeded deletion and addition of the snap if the slot >> value is unchanged. >> > > I think you're right. (Although the node will be eventually deleted at > cleanup_hazptr_context(), however there could be a long-live > hazptr_context). It should be: > > hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > > if (val != snap->slot) { // val changed, need to update the tree node. > // Already in the tree, need to remove first. > if (!is_null_or_unused(snap->slot)) { > reader_del(tree, snap); > } > > // use the latest snapshot. > snap->slot = val; > > // Add it into tree if there is a reader > if (!is_null_or_unused(val)) > reader_add(tree, snap); > } It seems like that two different hazptr_context can’t be used to protect the same pointer? Otherwise the following can happen? thread1 thread2 thread3(worker) thread4 hazptr_tryprotect(hzp1, ptr1) hazptr_tryprotect(hzp2, ptr1) add ptr1 to tree hazptr_clear(hzp1) hazptr_tryprotect(hzp1, ptr2) delete ptr1 from tree unpub ptr1 call_hazptr(ptr1) oops: invoke ptr1's callback Or am I missing something? > > Regards, > Boqun > >> I'm not so sure... >> >> Thanks >> Lai
On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote: [...] > > > > I think you're right. (Although the node will be eventually deleted at > > cleanup_hazptr_context(), however there could be a long-live > > hazptr_context). It should be: > > > > hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > > struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > > > > if (val != snap->slot) { // val changed, need to update the tree node. > > // Already in the tree, need to remove first. > > if (!is_null_or_unused(snap->slot)) { > > reader_del(tree, snap); > > } > > > > // use the latest snapshot. > > snap->slot = val; > > > > // Add it into tree if there is a reader > > if (!is_null_or_unused(val)) > > reader_add(tree, snap); > > } > > It seems like that two different hazptr_context can’t be used to protect the same pointer? > > Otherwise the following can happen? > > thread1 thread2 thread3(worker) thread4 > hazptr_tryprotect(hzp1, ptr1) hazptr_tryprotect(hzp2, ptr1) > add ptr1 to tree Note that we have snapshot rb_node for each hazard pointer slot, so here thread3 actually would add two rb_nodes with ->slot == ptr1 here. > hazptr_clear(hzp1) > hazptr_tryprotect(hzp1, ptr2) > delete ptr1 from tree unpub ptr1 Therefore, there is still one rb_node with ->slot == ptr1 in the tree after the deletion, so updaters won't invoke ptr1's callback. Regards, Boqun > call_hazptr(ptr1) > oops: invoke ptr1's callback > Or am I missing something? > > > > > Regards, > > Boqun > > > >> I'm not so sure... > >> > >> Thanks > >> Lai > >
2024年9月20日 02:58,Boqun Feng <boqun.feng@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 09:57:12PM +0800, Alan Huang wrote: > [...] >>> >>> I think you're right. (Although the node will be eventually deleted at >>> cleanup_hazptr_context(), however there could be a long-live >>> hazptr_context). It should be: >>> >>> hazptr_t val = smp_load_acquire(&hzcp->slots[i]); >>> struct hazptr_slot_snap *snap = &hzcp->snaps[i]; >>> >>> if (val != snap->slot) { // val changed, need to update the tree node. >>> // Already in the tree, need to remove first. >>> if (!is_null_or_unused(snap->slot)) { >>> reader_del(tree, snap); >>> } >>> >>> // use the latest snapshot. >>> snap->slot = val; >>> >>> // Add it into tree if there is a reader >>> if (!is_null_or_unused(val)) >>> reader_add(tree, snap); >>> } >> >> It seems like that two different hazptr_context can’t be used to protect the same pointer? >> >> Otherwise the following can happen? >> >> thread1 thread2 thread3(worker) thread4 >> hazptr_tryprotect(hzp1, ptr1) hazptr_tryprotect(hzp2, ptr1) >> add ptr1 to tree > > Note that we have snapshot rb_node for each hazard pointer slot, so here > thread3 actually would add two rb_nodes with ->slot == ptr1 here. Ok, good to know the rbtree can have multiple nodes with the same key. Thanks for the explanation! > >> hazptr_clear(hzp1) >> hazptr_tryprotect(hzp1, ptr2) >> delete ptr1 from tree unpub ptr1 > > Therefore, there is still one rb_node with ->slot == ptr1 in the tree > after the deletion, so updaters won't invoke ptr1's callback. > > Regards, > Boqun > >> call_hazptr(ptr1) >> oops: invoke ptr1's callback >> Or am I missing something? >> >>> >>> Regards, >>> Boqun >>> >>>> I'm not so sure... >>>> >>>> Thanks >>>> Lai
2024年9月19日 15:10,Boqun Feng <boqun.feng@gmail.com> wrote: > > On Thu, Sep 19, 2024 at 02:39:13PM +0800, Lai Jiangshan wrote: >> On Tue, Sep 17, 2024 at 10:34 PM Boqun Feng <boqun.feng@gmail.com> wrote: >> >>> +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, >>> + struct hazptr_context *hzcp) >>> +{ >>> + lockdep_assert_held(hzcp->lock); >>> + >>> + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { >>> + /* >>> + * Pairs with smp_store_release() in hazptr_{clear,free}(). >>> + * >>> + * Ensure >>> + * >>> + * <reader> <updater> >>> + * >>> + * [access protected pointers] >>> + * hazptr_clear(); >>> + * smp_store_release() >>> + * // in reader scan. >>> + * smp_load_acquire(); // is null or unused. >>> + * [run callbacks] // all accesses from >>> + * // reader must be >>> + * // observed. >>> + */ >>> + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); >>> + >>> + if (!is_null_or_unused(val)) { >>> + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; >>> + >>> + // Already in the tree, need to remove first. >>> + if (!is_null_or_unused(snap->slot)) { >>> + reader_del(tree, snap); >>> + } >>> + snap->slot = val; >>> + reader_add(tree, snap); >>> + } >>> + } >>> +} >> >> Hello >> >> I'm curious about whether there are any possible memory leaks here. >> >> It seems that call_hazptr() never frees the memory until the slot is >> set to another valid value. >> >> In the code here, the snap is not deleted when hzcp->snaps[i] is null/unused >> and snap->slot is not which I think it should be. >> >> And it can cause unneeded deletion and addition of the snap if the slot >> value is unchanged. >> > > I think you're right. (Although the node will be eventually deleted at > cleanup_hazptr_context(), however there could be a long-live > hazptr_context). It should be: > > hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > > if (val != snap->slot) { // val changed, need to update the tree node. > // Already in the tree, need to remove first. > if (!is_null_or_unused(snap->slot)) { > reader_del(tree, snap); > } > > // use the latest snapshot. > snap->slot = val; > > // Add it into tree if there is a reader > if (!is_null_or_unused(val)) > reader_add(tree, snap); > } With this changed, and force users call hazptr_clear() like rcu_read_unlock(), we could remove the reader_del() in cleanup_hazptr_context(), then remove the tree->lock? > > Regards, > Boqun > >> I'm not so sure... >> >> Thanks >> Lai
On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote: > Hazard pointers [1] provide a way to dynamically distribute refcounting > and can be used to improve the scalability of refcounting without > significant space cost. > +static inline void *__hazptr_tryprotect(hazptr_t *hzp, > + void *const *p, > + unsigned long head_offset) > +{ > + void *ptr; > + struct callback_head *head; > + > + ptr = READ_ONCE(*p); > + > + if (ptr == NULL) > + return NULL; > + > + head = (struct callback_head *)(ptr + head_offset); > + > + WRITE_ONCE(*hzp, head); > + smp_mb(); > + > + ptr = READ_ONCE(*p); // read again > + > + if (ptr + head_offset != head) { // pointer changed > + WRITE_ONCE(*hzp, NULL); // reset hazard pointer > + return NULL; > + } else > + return ptr; > +} I got nerdsniped by the Plumbers talk. So, about that smp_mb()... I think you should be able to avoid the smp_mb() using relaxed atomics (on architectures that have those), at the cost of something like a cmpxchg-acquire sandwiched between a load-acquire and a relaxed load? I'm not sure how their cost compares to an smp_mb() though. Something like this, *assuming there can only be one context at a time waiting for a given hazptr_t*: typedef struct { /* consists of: marker bit in least significant bit, rest is normal pointer */ atomic_long_t value; } hazptr_t; /* assumes that hzp is currently set to NULL (but it may contain a marker bit) */ static inline void *__hazptr_tryprotect(hazptr_t *hzp, void *const *p) { /* note that the loads of these three operations are ordered using acquire semantics */ void *ptr = smp_load_acquire(p); /* set pointer while leaving marker bit intact */ unsigned long hazard_scanning = atomic_long_fetch_or_acquire((unsigned long)ptr, &hzp->value); if (unlikely(hazard_scanning)) { BUG_ON(hazard_scanning != 1); /* slowpath, concurrent hazard pointer waiter */ smp_mb(); } if (READ_ONCE(*p) == ptr) { /* recheck */ atomic_long_and(~1UL, &hzp->value); return NULL; } return ptr; } /* simplified for illustration, assumes there's only a single hazard pointer @hzp that could point to @ptr */ static void remove_pointer_and_wait_for_hazard(hazptr_t *hzp, void *ptr, void *const *p) { WRITE_ONCE(*p, NULL); smb_mb(); /* set marker bit */ atomic_long_or(1UL, &hzp->value); while ((void*)(atomic_long_read(&hzp->value) & ~1UL) == ptr)) wait(); /* clear marker bit */ atomic_long_and(~1UL, &hzp->value); } The idea would be that the possible orderings when these two functions race against each other are: Ordering A: The atomic_long_fetch_or_acquire() in __hazptr_tryprotect() happens after the atomic_long_or(), two subcases: Ordering A1 (slowpath): atomic_long_fetch_or_acquire() is ordered before the atomic_long_and() in remove_pointer_and_wait_for_hazard(), so the marker bit is observed set, "hazard_scanning" is true. We go on the safe slowpath which is like the original patch, so it's safe. Ordering A2 (recheck fails): atomic_long_fetch_or_acquire() is ordered after the atomic_long_and() in remove_pointer_and_wait_for_hazard(), so the subsequent READ_ONCE(*p) is also ordered after the atomic_long_and(), which is ordered after the WRITE_ONCE(*p, NULL), so the READ_ONCE(*p) recheck must see a NULL pointer and fail. Ordering B (hazard pointer visible): The atomic_long_fetch_or_acquire() in __hazptr_tryprotect() happens before the atomic_long_or(). In that case, it also happens before the atomic_long_read() in remove_pointer_and_wait_for_hazard(), so the hazard pointer will be visible to remove_pointer_and_wait_for_hazard(). But this seems pretty gnarly/complicated, so even if my 2AM reasoning ability is correct, actually implementing this might not be a good idea... and it definitely wouldn't help on X86 at all, since X86 doesn't have such nice relaxed RMW ops.
Am 9/19/2024 um 2:12 AM schrieb Jann Horn: > On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote: >> Hazard pointers [1] provide a way to dynamically distribute refcounting >> and can be used to improve the scalability of refcounting without >> significant space cost. > >> +static inline void *__hazptr_tryprotect(hazptr_t *hzp, >> + void *const *p, >> + unsigned long head_offset) >> +{ >> + void *ptr; >> + struct callback_head *head; >> + >> + ptr = READ_ONCE(*p); >> + >> + if (ptr == NULL) >> + return NULL; >> + >> + head = (struct callback_head *)(ptr + head_offset); >> + >> + WRITE_ONCE(*hzp, head); >> + smp_mb(); >> + >> + ptr = READ_ONCE(*p); // read again >> + >> + if (ptr + head_offset != head) { // pointer changed >> + WRITE_ONCE(*hzp, NULL); // reset hazard pointer >> + return NULL; >> + } else >> + return ptr; >> +} > > I got nerdsniped by the Plumbers talk. So, about that smp_mb()... > > I think you should be able to avoid the smp_mb() using relaxed atomics > (on architectures that have those), at the cost of something like a > cmpxchg-acquire sandwiched between a load-acquire and a relaxed load? > I'm not sure how their cost compares to an smp_mb() though. We have done a similar scheme before, and on some architectures (not x86) the RMW is slightly cheaper than the mb. Your reasoning is a bit simplified because it seems to assume a stronger concept of ordering than LKMM has, but even with LKMM's ordering your code seems fine. I feel it can even be simplified a little, the hazard bit does not seem necessary. Assuming atomic operations for everything racy, relaxed unless stated otherwise: (R)eader: old = read p // I don't think this needs acq, because of address dependencies (*) haz ||=_acq old if (read p != old) retry; *old (W)riter: p =_??? ... // assuming we don't set it to null this needs a rel --- mb --- haz ||= 0 while (read_acq haz == old) retry; delete old In order to get a use-after-free, both of the R:read p need to read before W:p = ... , so because of the W:mb, they execute before W:haz||=0 . Also, for the use-after-free, W:read_acq haz needs to read before (the write part of) R:haz||=_acq old . Then the W:haz ||= 0 also needs to read before (the write part of) R:haz||=_acq old because of coherence on the same location. Since both of them are atomic RMW, the W:haz||= 0 also needs to write before (the write part of) R:haz||=_acq old , and in the absence of non-RMW writes (so assuming no thread will just store to the hazard pointer), this implies that the latter reads-from an rmw-sequence-later store than W:haz||=0 , which therefore executes before R:haz||=_acq old . But because of the acquire barrier, R:haz||=_acq old executes before the second R:read p . This gives a cycle (2nd)R:read p ->xb W:haz||=0 ->xb R:haz||=_acq ->xb (2nd)R:read p and therefore is forbidden. Note that in this argument, the two R:read p are not necessarily reading from the same store. Because of ABA, it can happen that they read from distinct stores and see the same value. It does require clearing hazard pointers with an RMW like atomic_and(0) (**). The combination of the two RMW (for setting & clearing the hazard pointer) might in total be slower again than an mb. (I took the liberty to add an acquire barrier in the writer's while loop for the case where we read from the (not shown) release store of the reader that clears the hazard pointer. It's arguable whether that acq is needed since one could argue that a delete is a kind of store, in which case control dependency would handle it.) have fun, jonas (* you talk about sandwiching, and the data dependency does guarantee some weaker form of sandwiching than the acq, but I don't think sandwiching is required at all. If you happened to be able to set the hazard pointer before reading the pointer, that would just extend the protected area, wouldn't it? ) (** I think if you clear the pointer with a store, the hazard bit does not help. You could be overwriting the hazard bit, and the RMWs that set the hazard bit might never propagate to your CPU. Also in your code you probably meant to clear the whole hazard pointer in the retry code of the reader, not just the hazard bit.)
Am 9/19/2024 um 10:30 PM schrieb Jonas Oberhauser: > > > > Am 9/19/2024 um 2:12 AM schrieb Jann Horn: >> On Tue, Sep 17, 2024 at 4:33 PM Boqun Feng <boqun.feng@gmail.com> wrote: >>> Hazard pointers [1] provide a way to dynamically distribute refcounting >>> and can be used to improve the scalability of refcounting without >>> significant space cost. >> >>> +static inline void *__hazptr_tryprotect(hazptr_t *hzp, >>> + void *const *p, >>> + unsigned long head_offset) >>> +{ >>> + void *ptr; >>> + struct callback_head *head; >>> + >>> + ptr = READ_ONCE(*p); >>> + >>> + if (ptr == NULL) >>> + return NULL; >>> + >>> + head = (struct callback_head *)(ptr + head_offset); >>> + >>> + WRITE_ONCE(*hzp, head); >>> + smp_mb(); >>> + >>> + ptr = READ_ONCE(*p); // read again >>> + >>> + if (ptr + head_offset != head) { // pointer changed >>> + WRITE_ONCE(*hzp, NULL); // reset hazard pointer >>> + return NULL; >>> + } else >>> + return ptr; >>> +} >> >> I got nerdsniped by the Plumbers talk. So, about that smp_mb()... >> >> I think you should be able to avoid the smp_mb() using relaxed atomics >> (on architectures that have those), at the cost of something like a >> cmpxchg-acquire sandwiched between a load-acquire and a relaxed load? >> I'm not sure how their cost compares to an smp_mb() though. > > > > We have done a similar scheme before, and on some architectures (not > x86) the RMW is slightly cheaper than the mb. > > Your reasoning is a bit simplified because it seems to assume a stronger > concept of ordering than LKMM has, but even with LKMM's ordering your > code seems fine. > > I feel it can even be simplified a little, the hazard bit does not seem > necessary. > > Assuming atomic operations for everything racy, relaxed unless stated > otherwise: > > (R)eader: > > old = read p // I don't think this needs acq, because of address > dependencies (*) > haz ||=_acq old > if (read p != old) retry; I realized before going to bed that I copied a subtle mistake here from Jann's code, we need an address dependency from this read, or it is not ABA safe. This can be done with the small detour that Boqun used: head = read p // I don't think this needs acq, because of address dependencies (*) haz ||=_acq head old = read p // again if (head != old) retry; barrier(); // ensure compiler does not use its knowledge that head == old to do *head instead! *old // definitely use the second read pointer so hardware can't speculatively dereference this before the second read! Have a good time, jonas
2024年9月17日 22:33,Boqun Feng <boqun.feng@gmail.com> wrote: > > Hazard pointers [1] provide a way to dynamically distribute refcounting > and can be used to improve the scalability of refcounting without > significant space cost. > > Hazard pointers are similar to RCU: they build the synchronization > between two part, readers and updaters. Readers are refcount users, they > acquire and release refcount. Updaters cleanup objects when there are no > reader referencing them (via call_hazptr()). The difference is instead > of waiting for a grace period, hazard pointers can free an object as > long as there is no one referencing the object. This means for a > particular workload, hazard pointers may have smaller memory footprint > due to fewer pending callbacks. > > The synchronization between readers and updaters is built around "hazard > pointer slots": a slot readers can use to store a pointer value. > > Reader side protection: > > 1. Read the value of a pointer to the target data element. > 2. Store it to a hazard pointer slot. > 3. Enforce full ordering (e.g. smp_mb()). > 4. Re-read the original pointer, reset the slot and retry if the > value changed. > 5. Otherwise, the continued existence of the target data element > is guaranteed. > > Updater side check: > > 1. Unpublish the target data element (e.g. setting the pointer > value to NULL). > 2. Enforce full ordering. > 3. Read the value from a hazard pointer slot. > 4. If the value doesn't match the target data element, then this > slot doesn't represent a reference to it. > 5. Otherwise, updater needs to re-check (step 3). > > To distribute the accesses of hazptr slots from different contexts, > hazptr_context is introduced. Users need to define/allocate their own > hazptr_context to allocate hazard pointer slots. > > For the updater side to confirm no existing reference, it needs to scan > all the possible slots, and to speed up this process, hazptr_context > also contains an rbtree node for each slot so that updater can cache the > reader-scan result in an rbtree. The rbtree nodes are pre-allocated in > this way to prevent "allocate memory to free memory" in extreme cases. > > [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for > lock-free objects," in IEEE Transactions on Parallel and > Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 > > Co-developed-by: Paul E. McKenney <paulmck@kernel.org> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > Co-developed-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com> > Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@amd.com> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com> > --- > include/linux/hazptr.h | 83 ++++++++ > kernel/Makefile | 1 + > kernel/hazptr.c | 463 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 547 insertions(+) > create mode 100644 include/linux/hazptr.h > create mode 100644 kernel/hazptr.c > > diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h > new file mode 100644 > index 000000000000..4548ca8c75eb > --- /dev/null > +++ b/include/linux/hazptr.h > @@ -0,0 +1,83 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_HAZPTR_H > +#define _LINUX_HAZPTR_H > + > +#include <linux/list.h> > +#include <linux/spinlock.h> > +#include <linux/rbtree.h> > + > +/* Hazard pointer slot. */ > +typedef void* hazptr_t; > + > +#define HAZPTR_SLOT_PER_CTX 8 > + > +struct hazptr_slot_snap { > + struct rb_node node; > + hazptr_t slot; > +}; > + > +/* > + * A set of hazard pointer slots for a context. > + * > + * The context can be a task, CPU or reader (depends on the use case). It may > + * be allocated statically or dynamically. It can only be used after calling > + * init_hazptr_context(), and users are also responsible to call > + * cleaup_hazptr_context() when it's not used any more. > + */ > +struct hazptr_context { > + // The lock of the percpu context lists. > + spinlock_t *lock; > + > + struct list_head list; > + struct hazptr_slot_snap snaps[HAZPTR_SLOT_PER_CTX]; > + ____cacheline_aligned hazptr_t slots[HAZPTR_SLOT_PER_CTX]; > +}; > + > +void init_hazptr_context(struct hazptr_context *hzcp); > +void cleanup_hazptr_context(struct hazptr_context *hzcp); > +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp); > +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzp); > + > +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field)) > +#define hazptr_protect(hzp, gp, field) ({ \ > + typeof(gp) ___p; \ > + \ > + ___p = hazptr_tryprotect(hzp, gp, field); \ > + BUG_ON(!___p); \ hazptr_tryprotect might return NULL, do you need a loop here? > + ___p; \ > +}) > + > +static inline void *__hazptr_tryprotect(hazptr_t *hzp, > + void *const *p, > + unsigned long head_offset) > +{ > + void *ptr; > + struct callback_head *head; > + > + ptr = READ_ONCE(*p); > + > + if (ptr == NULL) > + return NULL; > + > + head = (struct callback_head *)(ptr + head_offset); > + > + WRITE_ONCE(*hzp, head); > + smp_mb(); > + > + ptr = READ_ONCE(*p); // read again > + > + if (ptr + head_offset != head) { // pointer changed > + WRITE_ONCE(*hzp, NULL); // reset hazard pointer > + return NULL; > + } else > + return ptr; > +} > + > +static inline void hazptr_clear(hazptr_t *hzp) > +{ > + /* Pairs with smp_load_acquire() in reader scan. */ > + smp_store_release(hzp, NULL); > +} > + > +void call_hazptr(struct callback_head *head, rcu_callback_t func); > +#endif > diff --git a/kernel/Makefile b/kernel/Makefile > index 3c13240dfc9f..7927264b9870 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -50,6 +50,7 @@ obj-y += rcu/ > obj-y += livepatch/ > obj-y += dma/ > obj-y += entry/ > +obj-y += hazptr.o > obj-$(CONFIG_MODULES) += module/ > > obj-$(CONFIG_KCMP) += kcmp.o > diff --git a/kernel/hazptr.c b/kernel/hazptr.c > new file mode 100644 > index 000000000000..f22ccc2a4a62 > --- /dev/null > +++ b/kernel/hazptr.c > @@ -0,0 +1,463 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#include <linux/spinlock.h> > +#include <linux/cleanup.h> > +#include <linux/hazptr.h> > +#include <linux/percpu.h> > +#include <linux/workqueue.h> > + > +#define HAZPTR_UNUSED (1ul) > + > +/* Per-CPU data for hazard pointer management. */ > +struct hazptr_percpu { > + spinlock_t ctx_lock; /* hazptr context list lock */ > + struct list_head ctx_list; /* hazptr context list */ > + spinlock_t callback_lock; /* Per-CPU callback queue lock */ > + struct callback_head *queued; /* Per-CPU callback queue */ > + struct callback_head **tail; > +}; > + > +/* > + * Per-CPU data contains context lists and callbacks, which are maintained in a > + * per-CPU maner to reduce potential contention. This means a global scan (for > + * readers or callbacks) has to take each CPU's lock, but it's less problematic > + * because that's a slowpath. > + */ > +DEFINE_PER_CPU(struct hazptr_percpu, hzpcpu); > + > +/* A RBTree that stores the reader scan result of all hazptr slot */ > +struct hazptr_reader_tree { > + spinlock_t lock; > + struct rb_root root; > +}; > + > +static void init_hazptr_reader_tree(struct hazptr_reader_tree *tree) > +{ > + spin_lock_init(&tree->lock); > + tree->root = RB_ROOT; > +} > + > +static bool is_null_or_unused(hazptr_t slot) > +{ > + return slot == NULL || ((unsigned long)slot) == HAZPTR_UNUSED; > +} > + > +static int hazptr_node_cmp(const void *key, const struct rb_node *curr) > +{ > + unsigned long slot = (unsigned long)key; > + struct hazptr_slot_snap *curr_snap = container_of(curr, struct hazptr_slot_snap, node); > + unsigned long curr_slot = (unsigned long)(curr_snap->slot); > + > + if (slot < curr_slot) > + return -1; > + else if (slot > curr_slot) > + return 1; > + else > + return 0; > +} > + > +static bool hazptr_node_less(struct rb_node *new, const struct rb_node *curr) > +{ > + struct hazptr_slot_snap *new_snap = container_of(new, struct hazptr_slot_snap, node); > + > + return hazptr_node_cmp((void *)new_snap->slot, curr) < 0; > +} > + > +/* Add the snapshot into a search tree. tree->lock must be held. */ > +static inline void reader_add_locked(struct hazptr_reader_tree *tree, > + struct hazptr_slot_snap *snap) > +{ > + lockdep_assert_held(&tree->lock); > + BUG_ON(is_null_or_unused(snap->slot)); > + > + rb_add(&snap->node, &tree->root, hazptr_node_less); > +} > + > +static inline void reader_add(struct hazptr_reader_tree *tree, > + struct hazptr_slot_snap *snap) > +{ > + guard(spinlock_irqsave)(&tree->lock); > + > + reader_add_locked(tree, snap); > +} > + > +/* Delete the snapshot from a search tree. tree->lock must be held. */ > +static inline void reader_del_locked(struct hazptr_reader_tree *tree, > + struct hazptr_slot_snap *snap) > +{ > + lockdep_assert_held(&tree->lock); > + > + rb_erase(&snap->node, &tree->root); > +} > + > +static inline void reader_del(struct hazptr_reader_tree *tree, > + struct hazptr_slot_snap *snap) > +{ > + guard(spinlock_irqsave)(&tree->lock); > + > + reader_del_locked(tree, snap); > +} > + > +/* Find whether a read exists. tree->lock must be held. */ > +static inline bool reader_exist_locked(struct hazptr_reader_tree *tree, > + unsigned long slot) > +{ > + lockdep_assert_held(&tree->lock); > + > + return !!rb_find((void *)slot, &tree->root, hazptr_node_cmp); > +} > + > +static inline bool reader_exist(struct hazptr_reader_tree *tree, > + unsigned long slot) > +{ > + guard(spinlock_irqsave)(&tree->lock); > + > + return reader_exist_locked(tree, slot); > +} > + > +/* > + * Scan the readers from one hazptr context and update the global readers tree. > + * > + * Must be called with hzcp->lock held. > + */ > +static void hazptr_context_snap_readers_locked(struct hazptr_reader_tree *tree, > + struct hazptr_context *hzcp) > +{ > + lockdep_assert_held(hzcp->lock); > + > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > + /* > + * Pairs with smp_store_release() in hazptr_{clear,free}(). > + * > + * Ensure > + * > + * <reader> <updater> > + * > + * [access protected pointers] > + * hazptr_clear(); > + * smp_store_release() > + * // in reader scan. > + * smp_load_acquire(); // is null or unused. > + * [run callbacks] // all accesses from > + * // reader must be > + * // observed. > + */ > + hazptr_t val = smp_load_acquire(&hzcp->slots[i]); > + > + if (!is_null_or_unused(val)) { > + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > + > + // Already in the tree, need to remove first. > + if (!is_null_or_unused(snap->slot)) { > + reader_del(tree, snap); > + } > + snap->slot = val; > + reader_add(tree, snap); > + } > + } > +} > + > +/* > + * Initialize a hazptr context. > + * > + * Must be called before using the context for hazptr allocation. > + */ > +void init_hazptr_context(struct hazptr_context *hzcp) > +{ > + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu); > + > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > + hzcp->slots[i] = (hazptr_t)HAZPTR_UNUSED; > + hzcp->snaps[i].slot = (hazptr_t)HAZPTR_UNUSED; > + } > + > + guard(spinlock_irqsave)(&pcpu->ctx_lock); > + list_add(&hzcp->list, &pcpu->ctx_list); > + hzcp->lock = &pcpu->ctx_lock; > +} > + > +struct hazptr_struct { > + struct work_struct work; > + bool scheduled; > + > + // list of callbacks, we move percpu queued callbacks into the global > + // queued list in workqueue function. > + spinlock_t callback_lock; > + struct callback_head *queued; > + struct callback_head **tail; > + > + struct hazptr_reader_tree readers; > +}; > + > +static struct hazptr_struct hazptr_struct; > + > +/* > + * Clean up hazptr context. > + * > + * Must call before freeing the context. This function also removes any > + * reference held by the hazard pointer slots in the context, even > + * hazptr_clear() or hazptr_free() is not called previously. > + */ > +void cleanup_hazptr_context(struct hazptr_context *hzcp) > +{ > + if (hzcp->lock) { > + scoped_guard(spinlock_irqsave, hzcp->lock) { > + list_del(&hzcp->list); > + hzcp->lock = NULL; > + } > + > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > + struct hazptr_slot_snap *snap = &hzcp->snaps[i]; > + > + if (!is_null_or_unused(snap->slot)) > + reader_del(&hazptr_struct.readers, snap); > + } > + } > +} > + > +/* > + * Allocate a hazptr slot from a hazptr_context. > + * > + * Return: NULL means fail to allocate, otherwise the address of the allocated > + * slot. > + */ > +hazptr_t *hazptr_alloc(struct hazptr_context *hzcp) > +{ > + unsigned long unused; > + > + for (int i = 0; i < HAZPTR_SLOT_PER_CTX; i++) { > + if (((unsigned long)READ_ONCE(hzcp->slots[i])) == HAZPTR_UNUSED) { > + unused = HAZPTR_UNUSED; > + > + /* > + * rwm-sequence is relied on here. > + * > + * This is needed since in case of a previous reader: > + * > + * <reader 1> <reader 2> <updater> > + * [access protected pointers] > + * hazptr_free(): > + * smp_store_release(); // hzptr == UNUSED > + * hazptr_alloc(): > + * try_cmpxchg_relaxed(); // hzptr == NULL > + * > + * // in reader scan > + * smp_load_acquire(); // is null > + * [run callbacks] > + * > + * Because of the rwm-sequence of release operations, > + * when callbacks are run, accesses from reader 1 must > + * be already observed by the updater. > + */ > + if (try_cmpxchg_relaxed(&hzcp->slots[i], &unused, NULL)) { > + return (hazptr_t *)&hzcp->slots[i]; > + } > + } > + } > + > + return NULL; > +} > + > +/* Free a hazptr slot. */ > +void hazptr_free(struct hazptr_context *hzcp, hazptr_t *hzptr) > +{ > + WARN_ON(((unsigned long)*hzptr) == HAZPTR_UNUSED); > + > + /* Pairs with smp_load_acquire() in reader scan */ > + smp_store_release(hzptr, (void *)HAZPTR_UNUSED); > +} > + > +/* Scan all possible readers, and update the global reader tree. */ > +static void check_readers(struct hazptr_struct *hzst) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) { > + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu); > + struct hazptr_context *ctx; > + > + guard(spinlock_irqsave)(&pcpu->ctx_lock); > + list_for_each_entry(ctx, &pcpu->ctx_list, list) > + hazptr_context_snap_readers_locked(&hzst->readers, ctx); > + } > +} > + > +/* > + * Start the background work for hazptr callback handling if not started. > + * > + * Must be called with hazptr_struct lock held. > + */ > +static void kick_hazptr_work(void) > +{ > + if (hazptr_struct.scheduled) > + return; > + > + queue_work(system_wq, &hazptr_struct.work); > + hazptr_struct.scheduled = true; > +} > + > +/* > + * Check which callbacks are ready to be called. > + * > + * Return: a callback list that no reader is referencing the corresponding > + * objects. > + */ > +static struct callback_head *do_hazptr(struct hazptr_struct *hzst) > +{ > + struct callback_head *tmp, **curr; > + struct callback_head *todo = NULL, **todo_tail = &todo; > + > + // Currently, the lock is unnecessary, but maybe we will have multiple > + // work_structs sharing the same callback list in the future; > + guard(spinlock_irqsave)(&hzst->callback_lock); > + > + curr = &hzst->queued; > + > + while ((tmp = *curr)) { > + // No reader, we can free the object. > + if (!reader_exist(&hzst->readers, (unsigned long)tmp)) { > + // Add tmp into todo. > + *todo_tail = tmp; > + todo_tail = &tmp->next; > + > + // Delete tmp from ->queued and move to the next entry. > + *curr = tmp->next; > + } else > + curr = &tmp->next; > + } > + > + hzst->tail = curr; > + > + // Keep checking if callback list is not empty. > + if (hzst->queued) > + kick_hazptr_work(); > + > + *todo_tail = NULL; > + > + return todo; > +} > + > +static void hazptr_work_func(struct work_struct *work) > +{ > + struct hazptr_struct *hzst = container_of(work, struct hazptr_struct, work); > + struct callback_head *todo; > + > + // Advance callbacks from hzpcpu to hzst > + scoped_guard(spinlock_irqsave, &hzst->callback_lock) { > + int cpu; > + > + hzst->scheduled = false; > + for_each_possible_cpu(cpu) { > + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu); > + > + guard(spinlock)(&pcpu->callback_lock); > + > + if (pcpu->queued) { > + *(hzst->tail) = pcpu->queued; > + hzst->tail = pcpu->tail; > + pcpu->queued = NULL; > + pcpu->tail = &pcpu->queued; > + } > + } > + } > + > + // Pairs with the smp_mb() on the reader side. This guarantees that if > + // the hazptr work picked up the callback from an updater and the > + // updater set the global pointer to NULL before enqueue the callback, > + // the hazptr work must observe a reader that protects the hazptr before > + // the updater. > + // > + // <reader> <updater> <hazptr work> > + // ptr = READ_ONCE(gp); > + // WRITE_ONCE(*hazptr, ptr); > + // smp_mb(); // => ->strong-fence > + // tofree = gp; > + // ptr = READ_ONCE(gp); // re-read, gp is not NULL > + // // => ->fre > + // WRITE_ONCE(gp, NULL); > + // call_hazptr(gp, ...): > + // lock(->callback_lock); > + // [queued the callback] > + // unlock(->callback_lock);// => ->po-unlock-lock-po > + // lock(->callback_lock); > + // [move from hzpcpu to hzst] > + // > + // smp_mb(); => ->strong-fence > + // ptr = READ_ONCE(*hazptr); > + // // ptr == gp, otherwise => ->fre > + // > + // If ptr != gp, it means there exists a circle with the following > + // memory ordering relationships: > + // ->strong-fence ->fre ->po-unlock-lock-po ->strong-fence ->fre > + // => (due to the definition of prop) > + // ->strong-fence ->prop ->strong-fence ->hb ->prop > + // => (rotate the circle) > + // ->prop ->strong-fence ->prop ->strong-fence ->hb > + // => (due to the definition of pb) > + // ->pb ->pb > + // but pb is acyclic according to LKMM, so this cannot happen. > + smp_mb(); > + check_readers(hzst); > + > + todo = do_hazptr(hzst); > + > + while (todo) { > + struct callback_head *next = todo->next; > + void (*func)(struct callback_head *) = todo->func; > + > + func(todo); > + > + todo = next; > + } > +} > + > +/* > + * Put @head into the cleanup callback queue. > + * > + * @func(@head) will be called after no one is referencing the object. Caller > + * must ensure the object has been unpublished before calling this. > + */ > +void call_hazptr(struct callback_head *head, rcu_callback_t func) > +{ > + struct hazptr_percpu *pcpu = this_cpu_ptr(&hzpcpu); > + > + head->func = func; > + head->next = NULL; > + > + scoped_guard(spinlock_irqsave, &pcpu->callback_lock) { > + *(pcpu->tail) = head; > + pcpu->tail = &head->next; > + } > + > + guard(spinlock_irqsave)(&hazptr_struct.callback_lock); > + kick_hazptr_work(); > +} > + > +static int init_hazptr_struct(void) > +{ > + int cpu; > + > + INIT_WORK(&hazptr_struct.work, hazptr_work_func); > + > + spin_lock_init(&hazptr_struct.callback_lock); > + hazptr_struct.queued = NULL; > + hazptr_struct.tail = &hazptr_struct.queued; > + > + for_each_possible_cpu(cpu) { > + struct hazptr_percpu *pcpu = per_cpu_ptr(&hzpcpu, cpu); > + > + spin_lock_init(&pcpu->ctx_lock); > + INIT_LIST_HEAD(&pcpu->ctx_list); > + > + spin_lock_init(&pcpu->callback_lock); > + pcpu->queued = NULL; > + pcpu->tail = &pcpu->queued; > + > + } > + > + init_hazptr_reader_tree(&hazptr_struct.readers); > + > + return 0; > +} > + > +early_initcall(init_hazptr_struct); > -- > 2.45.2 > >
On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote: [...] > > +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field)) > > +#define hazptr_protect(hzp, gp, field) ({ \ > > + typeof(gp) ___p; \ > > + \ > > + ___p = hazptr_tryprotect(hzp, gp, field); \ > > + BUG_ON(!___p); \ > > hazptr_tryprotect might return NULL, do you need a loop here? > Thanks for the review. It's me who didn't do a good job here on the documentation. hazptr_protect() is supposed to use for the case where readers know the gp won't change. Regards, Boqun > > + ___p; \ > > +}) > > +
Am 9/19/2024 um 8:56 AM schrieb Boqun Feng: > On Wed, Sep 18, 2024 at 11:17:37PM +0800, Alan Huang wrote: > [...] >>> +#define hazptr_tryprotect(hzp, gp, field) (typeof(gp))__hazptr_tryprotect(hzp, (void **)&(gp), offsetof(typeof(*gp), field)) >>> +#define hazptr_protect(hzp, gp, field) ({ \ >>> + typeof(gp) ___p; \ >>> + \ >>> + ___p = hazptr_tryprotect(hzp, gp, field); \ >>> + BUG_ON(!___p); \ >> >> hazptr_tryprotect might return NULL, do you need a loop here? >> > > Thanks for the review. It's me who didn't do a good job here on the > documentation. hazptr_protect() is supposed to use for the case where > readers know the gp won't change. > > Regards, > Boqun > >>> + ___p; \ >>> +}) >>> + Oh, disregard my other e-mail, I hadn't seen this discussion. Do you have any specific use case of this in mind? If you know that the pointer can't change, I would assume you can also just read the pointer normally and assign to the hazard pointer without a fence, no? have fun, jonas
On 2024-09-17 16:33, Boqun Feng wrote: [...] > > The synchronization between readers and updaters is built around "hazard > pointer slots": a slot readers can use to store a pointer value. > > Reader side protection: > > 1. Read the value of a pointer to the target data element. > 2. Store it to a hazard pointer slot. > 3. Enforce full ordering (e.g. smp_mb()). > 4. Re-read the original pointer, reset the slot and retry if the > value changed. > 5. Otherwise, the continued existence of the target data element > is guaranteed. > > Updater side check: > > 1. Unpublish the target data element (e.g. setting the pointer > value to NULL). > 2. Enforce full ordering. > 3. Read the value from a hazard pointer slot. > 4. If the value doesn't match the target data element, then this > slot doesn't represent a reference to it. > 5. Otherwise, updater needs to re-check (step 3). Cool! I look forward to see where this is meant to be used. I would expect it to be a useful tool to speed up reference counting of things like the mm_struct and for TLB flush IPIs. On a related note, with a userspace port in mind, the membarrier(2) syscall can be useful to turn the smp_mb() in (3) from the reader side into a simple compiler barrier, assuming (2) from the updater is using membarrier. If IPIs are acceptable (or already required) for some kernel use-cases, this means a similar asymmetric fence scheme could be used to speed up readers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
© 2016 - 2024 Red Hat, Inc.