From nobody Fri Dec 19 15:00:29 2025 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 487142D3EF2; Fri, 5 Dec 2025 22:58:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764975504; cv=none; b=fRhbhh4ZOxGD4lXKHq9xvVB1UV07g8eVkQ6Dywv8oyAFVZycKyjrypZWuGwMlf+oNbKh4V4ef8FCN7PyOApo79XCJx71U41uvK2IE3OfDu/4uS9pd89eD2e2x7G5MY9P/FJfniYhb/iwTfi/XzAqpaFXzeYEOh3ZnHrBPjpJxHM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764975504; c=relaxed/simple; bh=inld7ikSOOWiAqvELWhNnLoB3o0CGbwM0nmm0OTUsMU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=eaIUFS4gRo3nk8RLftnTCSD0W98yaLc2c5COM5UYWLQlL7P+wp4XPa5k2LHNB+hAYfAe9huG8IbbOzt6oBO9JipWpXVChtFJrJmDm9wq/H/sc0sIuVgzWw6jaoTc6aLaExM0GlHKZUJFBEBPTwML78Zflj9Pro6IophOun4kjAk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.224.107]) by frasgout.his.huawei.com (SkyGuard) with ESMTPS id 4dNRBd4FRYzHnGdM; Sat, 6 Dec 2025 06:39:21 +0800 (CST) Received: from mscpeml500003.china.huawei.com (unknown [7.188.49.51]) by mail.maildlp.com (Postfix) with ESMTPS id EAD514056B; Sat, 6 Dec 2025 06:39:23 +0800 (CST) Received: from localhost.localdomain (10.123.70.40) by mscpeml500003.china.huawei.com (7.188.49.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Sat, 6 Dec 2025 01:39:23 +0300 From: Anatoly Stepanov To: , , , , , , , , , , , , , CC: , , , , , , , , , , Anatoly Stepanov Subject: [RFC PATCH v2 1/5] kernel: introduce Hierarchical Queued spinlock Date: Sat, 6 Dec 2025 14:21:02 +0800 Message-ID: <20251206062106.2109014-2-stepanov.anatoly@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20251206062106.2109014-1-stepanov.anatoly@huawei.com> References: <20251206062106.2109014-1-stepanov.anatoly@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: mscpeml500004.china.huawei.com (7.188.26.250) To mscpeml500003.china.huawei.com (7.188.49.51) Content-Type: text/plain; charset="utf-8" We introduce Hierarchical Queued Spinlock (HQ spinlock), aiming to reduce cross-NUMA cache line traffic and thus improving lock/unlock through= put for high-contention cases. This idea might be considered as a combination of cohort-locking by Dave Dice and Linux kernel queued spinlock. [Design and workflow] The contenders are organized into 2-level scheme where each NUMA-node has it's own FIFO contender queue. NUMA-queues are linked into single-linked list alike structure, while maintaining FIFO order between them. When there're no contenders left in a NUMA-queue, the queue is removed from the list. Contenders try to enqueue to local NUMA-queue first and if there's no such queue - link it to the list. As for "qspinlock" only the first contender is spinning globally, all others - MCS-spinning. "Handoff" operation becomes two-staged: - local handoff: between contenders in a single NUMA-queue - remote handoff: between different NUMA-queues. If "remote handoff" reaches the end of the NUMA-queue linked list it goes to the list head. To avoid potential starvation issues, we use handoff thresholds. Key challenge here was keeping the same "qspinlock" structure size and "bind" a given lock to related NUMA-queues. We came up with dynamic lock metadata concept, where we can dynamically "bind" a given lock to it's NUMA-related metadata, and then "unbind" when the lock is released. This approach allows to avoid metadata reservation for every lock in advanc= e, thus giving the upperbound of metadata instance number to ~ (NR_CPUS x nr_c= ontexts / 2). Which corresponds to maximum amount of different locks falling into the slo= wpath. [Dynamic mode switching] HQ-lock supports switching between "default qspinlock" mode to "NUMA-aware = lock" mode and backwards. Here we introduce simple scheme: when the contention is high enough a lock = switches to NUMA-aware mode until it is completely released. If for some reason "NUMA-aware" mode cannot be enabled we fallback to defau= lt qspinlock mode. [Usage] If someone wants to try the HQ-lock in some subsystem, just change lock initialization code from "spin_lock_init()" to "spin_lock_init_= hq()". The dedicated bit in the lock structure is used to distiguish between the t= wo lock types. [Performance measurements] Performance measurements were done on x86 (AMD EPYC) and arm64 (Kunpeng 920) platforms with the following scenarious: - Locktorture benchmark - Memcached + memtier benchmark - Ngnix + Wrk benchmark [Locktorture] NPS stands for "Nodes per socket" +------------------------------+-----------------------+-------+-------+---= -----+ | AMD EPYC 9654 | +------------------------------+-----------------------+-------+-------+---= -----+ | 192 cores (x2 hyper-threads) | | | | = | | 2 sockets | | | | = | | Locktorture 60 sec. | NUMA nodes per-socket | | | = | | Average gain (single lock) | 1 NPS | 2 NPS | 4 NPS | 12= NPS | | Total contender threads | | | | = | | 8 | 19% | 21% | 12% | 12= % | | 16 | 13% | 18% | 34% | 75= % | | 32 | 8% | 14% | 25% | 11= 2% | | 64 | 11% | 12% | 30% | 15= 2% | | 128 | 9% | 17% | 37% | 16= 3% | | 256 | 2% | 16% | 40% | 16= 8% | | 384 | -1% | 14% | 44% | 18= 6% | +------------------------------+-----------------------+-------+-------+---= -----+ +-----------------+-------+-------+-------+--------+ | Fairness factor | 1 NPS | 2 NPS | 4 NPS | 12 NPS | +-----------------+-------+-------+-------+--------+ | 8 | 0.54 | 0.57 | 0.57 | 0.55 | | 16 | 0.52 | 0.53 | 0.60 | 0.58 | | 32 | 0.53 | 0.53 | 0.53 | 0.61 | | 64 | 0.52 | 0.56 | 0.54 | 0.56 | | 128 | 0.51 | 0.54 | 0.54 | 0.53 | | 256 | 0.52 | 0.52 | 0.52 | 0.52 | | 384 | 0.51 | 0.51 | 0.51 | 0.51 | +-----------------+-------+-------+-------+--------+ +-------------------------+--------------+ | Kunpeng 920 (arm64) | | +-------------------------+--------------+ | 96 cores (no MT) | | | 2 sockets, 4 NUMA nodes | | | Locktorture 60 sec. | | | | | | Total contender threads | Average gain | | 8 | 93% | | 16 | 142% | | 32 | 129% | | 64 | 152% | | 96 | 158% | +-------------------------+--------------+ [Memcached] +---------------------------------+-----------------+-------------------+ | AMD EPYC 9654 | | | +---------------------------------+-----------------+-------------------+ | 192 cores (x2 hyper-threads) | | | | 2 sockets, NPS=3D4 | | | | | | | | Memtier+memcached 1:1 R/W ratio | | | | Workers | Throughput gain | Latency reduction | | 32 | 1% | -1% | | 64 | 1% | -1% | | 128 | 3% | -4% | | 256 | 7% | -6% | | 384 | 10% | -8% | +---------------------------------+-----------------+-------------------+ +---------------------------------+-----------------+-------------------+ | Kunpeng 920 (arm64) | | | +---------------------------------+-----------------+-------------------+ | 96 cores (no MT) | | | | 2 sockets, 4 NUMA nodes | | | | | | | | Memtier+memcached 1:1 R/W ratio | | | | Workers | Throughput gain | Latency reduction | | 32 | 4% | -3% | | 64 | 6% | -6% | | 80 | 8% | -7% | | 96 | 8% | -8% | +---------------------------------+-----------------+-------------------+ [Nginx] +-----------------------------------------------------------------------+--= ---------------+ | Kunpeng 920 (arm64) | = | +-----------------------------------------------------------------------+--= ---------------+ | 96 cores (no MT) | = | | 2 sockets, 4 NUMA nodes | = | | | = | | Nginx + WRK benchmark, single file (lockref spinlock contention case) | = | | Workers | T= hroughput gain | | 32 | 1= % | | 64 | 6= 8% | | 80 | 7= 2% | | 96 | 7= 8% | +-----------------------------------------------------------------------+--= ---------------+ Despite, the test is a single-file test, it can be related to real-life cas= es, when some html-pages are accessed much more frequently than others (index.html, etc.) [Low contention remarks] With the simple contention detection scheme presented in the patch we obser= ve some degradation, compared to "qspinlock" in low-contention scenraios (< 8 = contenders). We're working on more sophisticated and effective contention-detection meth= od, that addresses the issue. Signed-off-by: Anatoly Stepanov Co-authored-by: Stepanov Anatoly Co-authored-by: Fedorov Nikita --- kernel/locking/hqlock_core.h | 812 ++++++++++++++++++++++++++++++++++ kernel/locking/hqlock_meta.h | 477 ++++++++++++++++++++ kernel/locking/hqlock_types.h | 118 +++++ 3 files changed, 1407 insertions(+) create mode 100644 kernel/locking/hqlock_core.h create mode 100644 kernel/locking/hqlock_meta.h create mode 100644 kernel/locking/hqlock_types.h diff --git a/kernel/locking/hqlock_core.h b/kernel/locking/hqlock_core.h new file mode 100644 index 000000000..c69d6c4f4 --- /dev/null +++ b/kernel/locking/hqlock_core.h @@ -0,0 +1,812 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _GEN_HQ_SPINLOCK_SLOWPATH +#error "Do not include this file!" +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* Contains queues for all possible lock id */ +static struct numa_queue *queue_table[MAX_NUMNODES]; + +#include "hqlock_types.h" +#include "hqlock_proc.h" + +/* Gets node_id (1..N) */ +static inline struct numa_queue * +get_queue(u16 lock_id, u16 node_id) +{ + return &queue_table[node_id - 1][lock_id]; +} + +static inline struct numa_queue * +get_local_queue(struct numa_qnode *qnode) +{ + return get_queue(qnode->lock_id, qnode->numa_node); +} + +static inline void init_queue_link(struct numa_queue *queue) +{ + queue->prev_node =3D 0; + queue->next_node =3D 0; +} + +static inline void init_queue(struct numa_qnode *qnode) +{ + struct numa_queue *queue =3D get_local_queue(qnode); + + queue->head =3D qnode; + queue->handoffs_not_head =3D 0; + init_queue_link(queue); +} + +static void set_next_queue(u16 lock_id, u16 prev_node_id, u16 node_id) +{ + struct numa_queue *local_queue =3D get_queue(lock_id, node_id); + struct numa_queue *prev_queue =3D + get_queue(lock_id, prev_node_id); + + WRITE_ONCE(local_queue->prev_node, prev_node_id); + /* + * Needs to be guaranteed the following: + * when appending "local_queue", if "prev_queue->next_node" link + * is observed then "local_queue->prev_node" is also observed. + * + * We need this to guarantee correctness of concurrent + * "unlink_node_queue" for the "prev_queue", if "prev_queue" is the first= in the list. + * [prev_queue] <-> [local_queue] + * + * In this case "unlink_node_queue" would be setting "local_queue->prev_n= ode =3D 0", thus + * w/o the smp-barrier, it might race with "set_next_queue", if + * "local_queue->prev_node =3D prev_node_id" happens afterwards, leading = to corrupted list. + */ + smp_wmb(); + WRITE_ONCE(prev_queue->next_node, node_id); +} + +static inline struct lock_metadata *get_meta(u16 lock_id); + +/** + * Put new node's queue into global NUMA-level queue + */ +static inline u16 append_node_queue(u16 lock_id, u16 node_id) +{ + struct lock_metadata *lock_meta =3D get_meta(lock_id); + u16 prev_node_id =3D xchg(&lock_meta->tail_node, node_id); + + if (prev_node_id) + set_next_queue(lock_id, prev_node_id, node_id); + else + WRITE_ONCE(lock_meta->head_node, node_id); + return prev_node_id; +} + +#include "hqlock_meta.h" + +/** + * Update tail + * + * Call proper function depending on lock's mode + * until successful queuing + */ +static inline u32 hqlock_xchg_tail(struct qspinlock *lock, u32 tail, + struct mcs_spinlock *node, bool *numa_awareness_on) +{ + struct numa_qnode *qnode =3D (struct numa_qnode *)node; + + u16 lock_id; + u32 old_tail; + u32 next_tail =3D tail; + + /* + * Key lock's mode switches questions: + * - After init lock is in LOCK_MODE_QSPINLOCK + * - If many contenders have come while lock was in LOCK_MODE_QSPINLOCK, + * we want this lock to use NUMA awareness next time, + * so we clean LOCK_MODE_QSPINLOCK, see 'low_contention_try_clear_tail' + * - During next lock's usages we try to go through NUMA-aware path. + * We can fail here, because we use shared metadata + * and can have a conflict with another lock, see 'hqlock_meta.h' for d= etails. + * In this case we fallback to generic qspinlock approach. + * + * In other words, lock can be in 3 mode states: + * + * 1. LOCK_MODE_QSPINLOCK - there was low contention or not at all earlie= r, + * or (unlikely) a conflict in metadata + * 2. LOCK_NO_MODE - there was a contention on a lock earlier, + * now there are no contenders in the queue (we are likely the first) + * and we need to try using NUMA awareness + * 3. LOCK_MODE_HQLOCK - lock is currently under contention + * and using NUMA awareness. + */ + + /* + * numa_awareness_on =3D=3D false means we saw LOCK_MODE_QSPINLOCK (1st s= tate) + * before starting slowpath, see 'queued_spin_lock_slowpath' + */ + if (*numa_awareness_on =3D=3D false && + try_update_tail_qspinlock_mode(lock, tail, &old_tail, &next_tail)) + return old_tail; + + /* Calculate the lock_id hash here once */ + qnode->lock_id =3D lock_id =3D hash_ptr(lock, LOCK_ID_BITS); + +try_again: + /* + * Lock is in state 2 or 3 - go through NUMA-aware path + */ + if (try_update_tail_hqlock_mode(lock, lock_id, qnode, tail, &next_tail, &= old_tail)) { + *numa_awareness_on =3D true; + return old_tail; + } + + /* + * We have failed (conflict in metadata), now lock is in LOCK_MODE_QSPINL= OCK again + */ + if (try_update_tail_qspinlock_mode(lock, tail, &old_tail, &next_tail)) { + *numa_awareness_on =3D false; + return old_tail; + } + + /* + * We were slow and clear_tail after high contention has already happened + * (very unlikely situation) + */ + goto try_again; +} + +static inline void hqlock_clear_pending(struct qspinlock *lock, u32 old_va= l) +{ + WRITE_ONCE(lock->pending, (old_val & _Q_LOCK_TYPE_MODE_MASK) >> _Q_PENDIN= G_OFFSET); +} + +static inline void hqlock_clear_pending_set_locked(struct qspinlock *lock,= u32 old_val) +{ + WRITE_ONCE(lock->locked_pending, + _Q_LOCKED_VAL | (old_val & _Q_LOCK_TYPE_MODE_MASK)); +} + +static inline void unlink_node_queue(u16 lock_id, + u16 prev_node_id, + u16 next_node_id) +{ + struct numa_queue *prev_queue =3D + prev_node_id ? get_queue(lock_id, prev_node_id) : NULL; + struct numa_queue *next_queue =3D get_queue(lock_id, next_node_id); + + if (prev_queue) + WRITE_ONCE(prev_queue->next_node, next_node_id); + /* + * This is guaranteed to be ordered "after" next_node_id observation + * by implicit full-barrier in the caller-code. + */ + WRITE_ONCE(next_queue->prev_node, prev_node_id); +} + +static inline bool try_clear_queue_tail(struct numa_queue *queue, u32 tail) +{ + /* + * We need full ordering here to: + * - ensure all prior operations with global tail and prev_queue + * are observed before clearing local tail + * - guarantee all subsequent operations + * with metadata release, unlink etc will be observed after clearing lo= cal tail + */ + return cmpxchg(&queue->tail, tail, 0) =3D=3D tail; +} + +/* + * Determine if we have another local and global contenders. + * Try clear local and global tail, understand handoff type we need to per= form. + * In case we are the last, free lock's metadata + */ +static inline bool hqlock_try_clear_tail(struct qspinlock *lock, u32 val, + u32 tail, struct mcs_spinlock *node, + int *p_next_node) +{ + bool ret =3D false; + struct numa_qnode *qnode =3D (void *)node; + + u16 lock_id =3D qnode->lock_id; + u16 local_node =3D qnode->numa_node; + struct numa_queue *queue =3D get_queue(lock_id, qnode->numa_node); + + struct lock_metadata *lock_meta =3D get_meta(lock_id); + + u16 prev_node =3D 0, next_node =3D 0; + u16 node_tail; + + u32 old_val; + + bool lock_tail_updated =3D false; + bool lock_tail_cleared =3D false; + + /* Do we have *next node* arrived */ + bool pending_next_node =3D false; + + tail >>=3D _Q_TAIL_OFFSET; + + /* Do we have other CPUs in the node queue ? */ + if (READ_ONCE(queue->tail) !=3D tail) { + *p_next_node =3D HQLOCK_HANDOFF_LOCAL; + goto out; + } + + /* + * Key observations and actions: + * 1) next queue isn't observed: + * a) if prev queue is observed, try to unpublish local queue + * b) if prev queue is not observed, try to clean global tail + * Anyway, perform these operations before clearing local tail. + * + * Such trick is essential to safely unlink the local queue, + * otherwise we could race with upcomming local contenders, + * which will perform 'append_node_queue' while our unlink is not prop= erly done. + * + * 2) next queue is observed: + * safely perform 'try_clear_queue_tail' and unlink local node if succ= eeded. + */ + + prev_node =3D READ_ONCE(queue->prev_node); + pending_next_node =3D READ_ONCE(lock_meta->tail_node) !=3D local_node; + + /* + * Tail case: + * [prev_node] -> [local_node], lock->tail_node =3D=3D local_node + * + * There're no nodes after us at the moment, try updating the "lock->tail= _node" + */ + if (!pending_next_node && prev_node) { + struct numa_queue *prev_queue =3D + get_queue(lock_id, prev_node); + + /* Reset next_node, in case no one will come after */ + WRITE_ONCE(prev_queue->next_node, 0); + + /* + * release to publish prev_queue->next_node =3D 0 + * and to ensure ordering with 'READ_ONCE(queue->tail) !=3D tail' + */ + if (cmpxchg_release(&lock_meta->tail_node, local_node, prev_node) =3D=3D= local_node) { + lock_tail_updated =3D true; + + queue->next_node =3D 0; + queue->prev_node =3D 0; + next_node =3D 0; + } else { + /* If some node came after the local meanwhile, reset next_node back */ + WRITE_ONCE(prev_queue->next_node, local_node); + + /* We either observing updated "queue->next" or it equals zero */ + next_node =3D READ_ONCE(queue->next_node); + } + } + + node_tail =3D READ_ONCE(lock_meta->tail_node); + + /* If nobody else is waiting, try clean global tail */ + if (node_tail =3D=3D local_node && !prev_node) { + old_val =3D (((u32)local_node) | (((u32)local_node) << 16)); + /* release to ensure ordering with 'READ_ONCE(queue->tail) !=3D tail' */ + lock_tail_cleared =3D try_cmpxchg_release(&lock_meta->nodes_tail, &old_v= al, 0); + } + + /* + * lock->tail_node was not updated and cleared, + * so we have at least single non-empty node after us + */ + if (!lock_tail_updated && !lock_tail_cleared) { + /* + * If there's a node came before clearing node queue - wait for it to li= nk properly. + * We need this for correct upcoming *unlink*, otherwise the *unlink* mi= ght race with parallel set_next_node() + */ + if (!next_node) { + next_node =3D + smp_cond_load_relaxed(&queue->next_node, (VAL)); + } + } + + /* if we're the last one in the queue - clear the queue tail */ + if (try_clear_queue_tail(queue, tail)) { + /* + * "lock_tail_cleared =3D=3D true" + * It means: we cleared "lock->tail_node" and "lock->head_node". + * + * First new contender will do "global spin" anyway, so no handoff needed + * "ret =3D=3D true" + */ + if (lock_tail_cleared) { + ret =3D true; + + /* + * If someone has arrived in the meanwhile, + * don't try to free the metadata. + */ + old_val =3D READ_ONCE(lock_meta->nodes_tail); + if (!old_val) { + /* + * We are probably the last contender, + * so, need to free lock's metadata. + */ + release_lock_meta(lock, lock_meta, qnode); + } + goto out; + } + + /* + * "lock_tail_updated =3D=3D true" (implies "lock_tail_cleared =3D=3D fa= lse") + * It means we have at least "prev_node" and unlinked "local node" + * + * As we unlinked "local node", we only need to guarantee correct + * remote handoff, thus we have: + * "ret =3D=3D false" + * "next_node =3D=3D HQLOCK_HANDOFF_REMOTE_HEAD" + */ + if (lock_tail_updated) { + *p_next_node =3D HQLOCK_HANDOFF_REMOTE_HEAD; + goto out; + } + + /* + * "!lock_tail_cleared && !lock_tail_updated" + * It means we have at least single node after us. + * + * remote handoff and corect "local node" unlink are needed. + * + * "next_node" visibility guarantees that we observe + * correctly additon of "next_node", so the following unlink + * is safe and correct. + * + * "next_node > 0" + * "ret =3D=3D false" + */ + unlink_node_queue(lock_id, prev_node, next_node); + + /* + * If at the head - update one. + * + * Another place, where "lock->head_node" is updated is "append_node_que= ue" + * But we're safe, as that happens only with the first node on empty "no= de list". + */ + if (!prev_node) + WRITE_ONCE(lock_meta->head_node, next_node); + + *p_next_node =3D next_node; + } else { + /* + * local queue has other contenders. + * + * 1) "lock_tail_updated =3D=3D true": + * It means we have at least "prev_node" and unlinked "local node" + * Also, some new nodes can arrive and link after "prev_node". + * We should just re-add "local node": (prev_node) =3D> ... =3D> (local_= node) + * and perform local handoff, as other CPUs from the local node do "mcs = spin" + * + * 2) "lock_tail_cleared =3D=3D true" + * It means we cleared "lock->tail_node" and "lock->head_node". + * We need to re-add "local node" and move "local_queue->head" to the ne= xt "mcs-node", + * which is in the progress of linking after the current "mcs-node" + * (that's why we couldn't clear the "local_queue->tail"). + * + * Meanwhile other nodes can arrive: (new_node) =3D> (...) + * That "new_node" will spin in "global spin" mode. + * In this case no handoff needed. + * + * 3) "!lock_tail_cleared && !lock_tail_updated" + * It means we had at least one node after us before 'try_clear_queue_ta= il' + * and only need to perform local handoff + */ + + /* Cases 1) and 2) */ + if (lock_tail_updated || lock_tail_cleared) { + u16 prev_node_id; + + init_queue_link(queue); + prev_node_id =3D + append_node_queue(lock_id, local_node); + + if (prev_node_id && lock_tail_cleared) { + /* Case 2) */ + ret =3D true; + WRITE_ONCE(queue->head, + (void *) smp_cond_load_relaxed(&node->next, (VAL))); + goto out; + } + } + + /* Cases 1) and 3) */ + *p_next_node =3D HQLOCK_HANDOFF_LOCAL; + ret =3D false; + } +out: + /* + * Either handoff for current node, + * or remote handoff if the quota is expired + */ + return ret; +} + +static inline void hqlock_handoff(struct qspinlock *lock, + struct mcs_spinlock *node, + struct mcs_spinlock *next, u32 tail, + int handoff_info); + + +/* + * Chech if contention has risen and if we need to set NUMA-aware mode + */ +static __always_inline bool determine_contention_qspinlock_mode(struct mcs= _spinlock *node) +{ + struct numa_qnode *qnode =3D (void *)node; + + if (qnode->general_handoffs > READ_ONCE(hqlock_general_handoffs_turn_numa= )) + return true; + return false; +} + +static __always_inline bool low_contention_try_clear_tail(struct qspinlock= *lock, + u32 val, + struct mcs_spinlock *node) +{ + u32 update_val =3D _Q_LOCKED_VAL | _Q_LOCKTYPE_HQ; + + bool high_contention =3D determine_contention_qspinlock_mode(node); + + /* + * If we have high contention, we set _Q_LOCK_INVALID_TAIL + * to notify upcomming contenders, which have seen QSPINLOCK mode, + * that performing generic 'xchg_tail' is wrong. + * + * We cannot also set HQLOCK mode here, + * because first contender in updated mode + * should check if lock's metadata is free + */ + if (!high_contention) + update_val |=3D _Q_LOCK_MODE_QSPINLOCK_VAL; + else + update_val |=3D _Q_LOCK_INVALID_TAIL; + + return atomic_try_cmpxchg_relaxed(&lock->val, &val, update_val); +} + +static __always_inline void low_contention_mcs_lock_handoff(struct mcs_spi= nlock *node, + struct mcs_spinlock *next, struct mcs_spinlock *prev) +{ + struct numa_qnode *qnode =3D (void *)node; + struct numa_qnode *qnext =3D (void *)next; + + static u16 max_u16 =3D (u16)(-1); + + u16 general_handoffs =3D qnode->general_handoffs; + + if (next !=3D prev && likely(general_handoffs + 1 !=3D max_u16)) + general_handoffs++; + +#ifdef CONFIG_HQSPINLOCKS_DEBUG + if (READ_ONCE(max_general_handoffs) < general_handoffs) + WRITE_ONCE(max_general_handoffs, general_handoffs); +#endif + qnext->general_handoffs =3D general_handoffs; + arch_mcs_spin_unlock_contended(&next->locked); +} + +static inline void hqlock_clear_tail_handoff(struct qspinlock *lock, u32 v= al, + u32 tail, + struct mcs_spinlock *node, + struct mcs_spinlock *next, + struct mcs_spinlock *prev, + bool is_numa_lock) +{ + int handoff_info; + struct numa_qnode *qnode =3D (void *)node; + + /* + * qnode->wrong_fallback_tail means we have queued globally + * in 'try_update_tail_qspinlock_mode' after another contender, + * but lock's mode was not QSPINLOCK in that moment. + * + * First confused contender has restored _Q_LOCK_INVALID_TAIL in global t= ail + * and set us in his local queue. + */ + if (is_numa_lock || qnode->wrong_fallback_tail) { + /* + * Because of splitting generic tail and NUMA tail we must set locked be= fore clearing tail, + * otherwise double lock is possible + */ + set_locked(lock); + + if (hqlock_try_clear_tail(lock, val, tail, node, &handoff_info)) + return; + + hqlock_handoff(lock, node, next, tail, handoff_info); + } else { + if ((val & _Q_TAIL_MASK) =3D=3D tail) { + if (low_contention_try_clear_tail(lock, val, node)) + return; + } + + set_locked(lock); + + if (!next) + next =3D smp_cond_load_relaxed(&node->next, (VAL)); + + low_contention_mcs_lock_handoff(node, next, prev); + } +} + +static inline void hqlock_init_node(struct mcs_spinlock *node) +{ + struct numa_qnode *qnode =3D (void *)node; + + qnode->general_handoffs =3D 0; + qnode->numa_node =3D numa_node_id() + 1; + qnode->lock_id =3D 0; + qnode->wrong_fallback_tail =3D 0; +} + +static inline void reset_handoff_counter(struct numa_qnode *qnode) +{ + qnode->general_handoffs =3D 0; +} + +static inline void handoff_local(struct mcs_spinlock *node, + struct mcs_spinlock *next, + u32 tail) +{ + static u16 max_u16 =3D (u16)(-1); + + struct numa_qnode *qnode =3D (struct numa_qnode *)node; + struct numa_qnode *qnext =3D (struct numa_qnode *)next; + + u16 general_handoffs =3D qnode->general_handoffs; + + if (likely(general_handoffs + 1 !=3D max_u16)) + general_handoffs++; + + qnext->general_handoffs =3D general_handoffs; + + u16 wrong_fallback_tail =3D qnode->wrong_fallback_tail; + + if (wrong_fallback_tail !=3D 0 && wrong_fallback_tail !=3D (tail >> _Q_TA= IL_OFFSET)) { + qnext->numa_node =3D qnode->numa_node; + qnext->wrong_fallback_tail =3D wrong_fallback_tail; + qnext->lock_id =3D qnode->lock_id; + } + + arch_mcs_spin_unlock_contended(&next->locked); +} + +static inline void handoff_remote(struct qspinlock *lock, + struct numa_qnode *qnode, + u32 tail, int handoff_info) +{ + struct numa_queue *next_queue =3D NULL; + struct mcs_spinlock *mcs_head =3D NULL; + struct numa_qnode *qhead =3D NULL; + u16 lock_id =3D qnode->lock_id; + + struct lock_metadata *lock_meta =3D get_meta(lock_id); + struct numa_queue *queue =3D get_local_queue(qnode); + + u16 next_node_id; + u16 node_head, node_tail; + + node_tail =3D READ_ONCE(lock_meta->tail_node); + node_head =3D READ_ONCE(lock_meta->head_node); + + /* + * 'handoffs_not_head > 0' means at the head of NUMA-level queue we have = a node + * which is heavily loaded and has performed a remote handoff upon reachi= ng the threshold. + * + * Perform handoff to the head instead of next node in the NUMA-level que= ue, + * if handoffs_not_head >=3D nr_online_nodes + * (It means other contended nodes have been taking the lock at least onc= e after the head one) + */ + u16 handoffs_not_head =3D READ_ONCE(queue->handoffs_not_head); + + if (handoff_info > 0 && (handoffs_not_head < nr_online_nodes)) { + next_node_id =3D handoff_info; + if (node_head !=3D qnode->numa_node) + handoffs_not_head++; + } else { + if (!node_head) { + /* If we're here - we have defintely other node-contenders, let's wait = */ + next_node_id =3D smp_cond_load_relaxed(&lock_meta->head_node, (VAL)); + } else { + next_node_id =3D node_head; + } + + handoffs_not_head =3D 0; + } + + next_queue =3D get_queue(lock_id, next_node_id); + WRITE_ONCE(next_queue->handoffs_not_head, handoffs_not_head); + + qhead =3D READ_ONCE(next_queue->head); + + mcs_head =3D (void *) qhead; + + /* arch_mcs_spin_unlock_contended implies smp-barrier */ + arch_mcs_spin_unlock_contended(&mcs_head->locked); +} + +static inline bool has_other_nodes(struct qspinlock *lock, + struct numa_qnode *qnode) +{ + struct lock_metadata *lock_meta =3D get_meta(qnode->lock_id); + + return lock_meta->tail_node !=3D qnode->numa_node; +} + +static inline bool is_node_threshold_reached(struct numa_qnode *qnode) +{ + return qnode->general_handoffs > hqlock_fairness_threshold; +} + +static inline void hqlock_handoff(struct qspinlock *lock, + struct mcs_spinlock *node, + struct mcs_spinlock *next, u32 tail, + int handoff_info) +{ + struct numa_qnode *qnode =3D (void *)node; + u16 lock_id =3D qnode->lock_id; + struct lock_metadata *lock_meta =3D get_meta(lock_id); + struct numa_queue *queue =3D get_local_queue(qnode); + + if (handoff_info =3D=3D HQLOCK_HANDOFF_LOCAL) { + if (!next) + next =3D smp_cond_load_relaxed(&node->next, (VAL)); + WRITE_ONCE(queue->head, (void *) next); + + bool threshold_expired =3D is_node_threshold_reached(qnode); + + if (!threshold_expired || qnode->wrong_fallback_tail) { + handoff_local(node, next, tail); + return; + } + + u16 queue_next =3D READ_ONCE(queue->next_node); + bool has_others =3D has_other_nodes(lock, qnode); + + /* + * This check is racy, but it's ok, + * because we fallback to local node in the worst case + * and do not call reset_handoff_counter. + * Next local contender will perform remote handoff + * after next queue is properly linked + */ + if (has_others) { + handoff_info =3D + queue_next > 0 ? queue_next : HQLOCK_HANDOFF_LOCAL; + } else { + handoff_info =3D HQLOCK_HANDOFF_REMOTE_HEAD; + } + + if (handoff_info =3D=3D HQLOCK_HANDOFF_LOCAL || + (handoff_info =3D=3D HQLOCK_HANDOFF_REMOTE_HEAD && + READ_ONCE(lock_meta->head_node) =3D=3D qnode->numa_node)) { + /* + * No other nodes have come yet, so we can clean fairness counter + */ + if (handoff_info =3D=3D HQLOCK_HANDOFF_REMOTE_HEAD) + reset_handoff_counter(qnode); + handoff_local(node, next, tail); + return; + } + } + + handoff_remote(lock, qnode, tail, handoff_info); + reset_handoff_counter(qnode); +} + +static void __init hqlock_alloc_global_queues(void) +{ + int nid; + phys_addr_t phys_addr; + + // meta_pool + unsigned long meta_pool_size =3D + sizeof(struct lock_metadata) * LOCK_ID_MAX; + + pr_info("Init HQspinlock lock_metadata info: size =3D %lu B\n", + meta_pool_size); + + phys_addr =3D memblock_alloc_range_nid( + meta_pool_size, L1_CACHE_BYTES, 0, + MEMBLOCK_ALLOC_ACCESSIBLE, NUMA_NO_NODE, false); + + if (!phys_addr) + panic("HQspinlock lock_metadata metadata info: allocation failure.\n"); + + meta_pool =3D phys_to_virt(phys_addr); + memset(meta_pool, 0, meta_pool_size); + + for (int i =3D 0; i < LOCK_ID_MAX; i++) + atomic_set(&meta_pool[i].seq_counter, 0); + + /* Total queues size for all buckets */ + unsigned long queues_size =3D + LOCK_ID_MAX * + ALIGN(sizeof(struct numa_queue), L1_CACHE_BYTES); + + pr_info("Init HQspinlock per-NUMA metadata (per-node size =3D %lu B)\n", + queues_size); + + + for_each_node(nid) { + phys_addr =3D memblock_alloc_range_nid( + queues_size, L1_CACHE_BYTES, 0, MEMBLOCK_ALLOC_ACCESSIBLE, + nid, false); + + if (!phys_addr) + panic("HQspinlock per-NUMA metadata: allocation failure for node %d.\n"= , nid); + + queue_table[nid] =3D phys_to_virt(phys_addr); + memset(queue_table[nid], 0, queues_size); + } +} + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define hq_queued_spin_lock_slowpath pv_ops.lock.queued_spin_lock_slowpath +#else +void (*hq_queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val) =3D + native_queued_spin_lock_slowpath; +EXPORT_SYMBOL(hq_queued_spin_lock_slowpath); +#endif + +static int numa_spinlock_flag; + +static int __init numa_spinlock_setup(char *str) +{ + if (!strcmp(str, "auto")) { + numa_spinlock_flag =3D 0; + return 1; + } else if (!strcmp(str, "on")) { + numa_spinlock_flag =3D 1; + return 1; + } else if (!strcmp(str, "off")) { + numa_spinlock_flag =3D -1; + return 1; + } + + return 0; +} +__setup("numa_spinlock=3D", numa_spinlock_setup); + +void __hq_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +void __init hq_configure_spin_lock_slowpath(void) +{ + if (numa_spinlock_flag < 0) + return; + + if (numa_spinlock_flag =3D=3D 0 && (nr_node_ids < 2 || + hq_queued_spin_lock_slowpath !=3D + native_queued_spin_lock_slowpath)) { + return; + } + + numa_spinlock_flag =3D 1; + hq_queued_spin_lock_slowpath =3D __hq_queued_spin_lock_slowpath; + pr_info("Enabling HQspinlock\n"); + hqlock_alloc_global_queues(); +} diff --git a/kernel/locking/hqlock_meta.h b/kernel/locking/hqlock_meta.h new file mode 100644 index 000000000..d9fb70af3 --- /dev/null +++ b/kernel/locking/hqlock_meta.h @@ -0,0 +1,477 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _GEN_HQ_SPINLOCK_SLOWPATH +#error "Do not include this file!" +#endif + +/* Lock metadata pool */ +static struct lock_metadata *meta_pool; + +static inline struct lock_metadata *get_meta(u16 lock_id) +{ + return &meta_pool[lock_id]; +} + +static inline hqlock_mode_t set_lock_mode(struct qspinlock *lock, int __va= l, u16 lock_id) +{ + u32 val =3D (u32)__val; + u32 new_val =3D 0; + u32 lock_mode =3D encode_lock_mode(lock_id); + + while (!(val & _Q_LOCK_MODE_MASK)) { + /* + * We need wait until pending is gone. + * Otherwise, clearing pending can erase a NUMA mode we will set here + */ + if (val & _Q_PENDING_VAL) { + val =3D atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_PENDING_VAL)); + + if (val & _Q_LOCK_MODE_MASK) + return LOCK_NO_MODE; + } + + /* + * If we are enabling NUMA-awareness, we should keep previous value in l= ock->tail + * in case of having contenders seen LOCK_MODE_QSPINLOCK and set their t= ails via xchg_tail + * (They will restore it to _Q_LOCK_INVALID_TAIL later). + * If we are setting LOCK_MODE_QSPINLOCK, remove _Q_LOCK_INVALID_TAIL + */ + if (lock_id !=3D LOCK_ID_NONE) + new_val =3D val | lock_mode; + else + new_val =3D (val & ~_Q_LOCK_INVALID_TAIL) | lock_mode; + + /* + * If we're setting LOCK_MODE_HQLOCK, make sure all "seq_counter" + * updates (per-queue, lock_meta) are observed before lock mode update. + * Paired with smp_rmb() in setup_lock_mode(). + */ + if (lock_id !=3D LOCK_ID_NONE) + smp_wmb(); + + bool updated =3D atomic_try_cmpxchg_relaxed(&lock->val, &val, new_val); + + if (updated) { + return (lock_id =3D=3D LOCK_ID_NONE) ? + LOCK_MODE_QSPINLOCK : LOCK_MODE_HQLOCK; + } + } + + return LOCK_NO_MODE; +} + +static inline hqlock_mode_t set_mode_hqlock(struct qspinlock *lock, int va= l, u16 lock_id) +{ + return set_lock_mode(lock, val, lock_id); +} + +static inline hqlock_mode_t set_mode_qspinlock(struct qspinlock *lock, int= val) +{ + return set_lock_mode(lock, val, LOCK_ID_NONE); +} + +/* Dynamic lock-mode conditions */ +static inline bool is_mode_hqlock(int val) +{ + return decode_lock_mode(val) =3D=3D LOCK_MODE_HQLOCK; +} + +static inline bool is_mode_qspinlock(int val) +{ + return decode_lock_mode(val) =3D=3D LOCK_MODE_QSPINLOCK; +} + +enum meta_status { + META_CONFLICT =3D 0, + META_GRABBED, + META_SHARED, +}; + +static inline enum meta_status grab_lock_meta(struct qspinlock *lock, u32 = lock_id, u32 *seq) +{ + int nid, seq_counter; + struct qspinlock *old =3D READ_ONCE(meta_pool[lock_id].lock_ptr); + + if (old && old !=3D lock) + return META_CONFLICT; + + if (old && old =3D=3D lock) + return META_SHARED; + + old =3D cmpxchg_acquire(&meta_pool[lock_id].lock_ptr, NULL, lock); + if (!old) + goto init_meta; + + /* Hash-conflict */ + if (old !=3D lock) + return META_CONFLICT; + + return META_SHARED; +init_meta: + /* + * Update allocations counter and set it to per-NUMA queues + * to prevent upcomming contenders from parking on deallocated queues + */ + seq_counter =3D atomic_inc_return_relaxed(&meta_pool[lock_id].seq_counter= ); + + /* Very unlikely we can overflow */ + if (unlikely(seq_counter =3D=3D 0)) + seq_counter =3D atomic_inc_return_relaxed(&meta_pool[lock_id].seq_counte= r); + + for_each_online_node(nid) { + struct numa_queue *queue =3D &queue_table[nid][lock_id]; + WRITE_ONCE(queue->seq_counter, (u32)seq_counter); + } + + *seq =3D seq_counter; +#ifdef CONFIG_HQSPINLOCKS_DEBUG + int current_used =3D atomic_inc_return_relaxed(&cur_buckets_in_use); + + if (READ_ONCE(max_buckets_in_use) < current_used) + WRITE_ONCE(max_buckets_in_use, current_used); +#endif + return META_GRABBED; +} + +/* + * Try to setup current lock mode: + * + * LOCK_MODE_HQLOCK or fallback to default LOCK_MODE_QSPINLOCK + * if there's hash conflict with another lock in the system. + * + * In general the setup consists of grabbing lock-related metadata and + * publishing the mode in the global lock variable. + * + * For quick meta-lookup the pointer hashing is used. + * + * To identify "occupied/free" metadata record, we use "meta->lock_ptr" + * which is set to corresponding spinlock lock pointer or "NULL". + * + * The action sequence from initial state is the following: + * + * "Find lock-meta by hash" =3D> "Occupy lock-meta" =3D> publish "LOCK_MOD= E_HQLOCK" in + * global lock variable. + * + */ +static inline +hqlock_mode_t setup_lock_mode(struct qspinlock *lock, u16 lock_id, u32 *me= ta_seq_counter) +{ + hqlock_mode_t mode; + + do { + enum meta_status status; + int val =3D atomic_read(&lock->val); + + if (is_mode_hqlock(val)) { + struct lock_metadata *lock_meta =3D get_meta(lock_id); + /* + * The lock is currently in LOCK_MODE_HQLOCK, we need to make sure the + * associated metadata isn't used by another lock. + * + * In the meanwhile several situations can occur: + * + * [Case 1] Another lock using the meta (hash-conflict) + * + * If "release + reallocate" of the meta happenned in the meanwhile, + * we're guaranteed to observe lock-mode change in the "lock->val", + * due to the following event ordering: + * + * [release_lock_meta] + * Clear lock mode in "lock->val", so we wouldn't + * observe LOCK_MODE_HQLOCK mode. + * =3D> + * [setup_lock_mode] + * Update lock->seq_counter + * + * [Case 2] For exact same lock, some contender did "release + realloca= te" of the meta + * + * Either We'll get newly set "seq_counter", or in the worst case, we'l= l get + * outdated "seq_counter" fail in the CAS(queue) in the caller function. + * + * [Case 3] Meta is free, nobody using it + * [Case 4] The lock mode is changed to LOCK_MODE_QSPINLOCK. + */ + int seq_counter =3D atomic_read(&lock_meta->seq_counter); + + /* + * "seq_counter" and "lock->val" should be read in program order. + * Otherwise we might observe "seq_counter" updated on-behalf another l= ock. + * Paired with smp_wmb() in set_lock_mode(). + */ + smp_rmb(); + val =3D atomic_read(&lock->val); + + if (is_mode_hqlock(val)) { + *meta_seq_counter =3D (u32)seq_counter; + return LOCK_MODE_HQLOCK; + } + /* + * [else] Here it can be 2 options: + * + * 1. Lock-meta is free, and nobody using it. + * In this case, we need to try occupying the meta and + * publish lock-mode LOCK_MODE_HQLOCK again. + * + * 2. Lock mode transitioned to LOCK_MODE_QSPINLOCK mode. + */ + continue; + } else if (is_mode_qspinlock(val)) { + return LOCK_MODE_QSPINLOCK; + } + + /* + * Trying to get temporary metadata "weak" ownership, + * Three situations might happen: + * + * 1. Metadata isn't used by anyone + * Just take the ownership. + * + * 2. Metadata is already grabbed by one of the lock contenders. + * + * 3. Hash conflict: metadata is owned by another lock + * Give up, fallback to LOCK_MODE_QSPINLOCK. + */ + status =3D grab_lock_meta(lock, lock_id, meta_seq_counter); + if (status =3D=3D META_SHARED) { + /* + * Someone started publishing lock_id for us: + * 1. We can catch the "LOCK_MODE_HQLOCK" mode quickly + * 2. We can loop several times before we'll see "LOCK_MODE_HQLOCK" mod= e set. + * (lightweight check) + * 3. Another contender might be able to relase lock meta meanwhile. + * Either we catch it in above "seq_counter" check, or we'll grab + * lock meta first and try publishing lock_id. + */ + continue; + } + + /* Setup the lock-mode */ + if (status =3D=3D META_GRABBED) + mode =3D set_mode_hqlock(lock, val, lock_id); + else if (status =3D=3D META_CONFLICT) + mode =3D set_mode_qspinlock(lock, val); + else + BUG_ON(1); + /* + * If we grabbed the meta but were unable to publish LOCK_MODE_HQLOCK + * release it, just by resetting the pointer. + */ + if (status =3D=3D META_GRABBED && mode !=3D LOCK_MODE_HQLOCK) { + smp_store_release(&meta_pool[lock_id].lock_ptr, NULL); +#ifdef CONFIG_HQSPINLOCKS_DEBUG + atomic_dec(&cur_buckets_in_use); +#endif + } + } while (mode =3D=3D LOCK_NO_MODE); + + return mode; +} + +static inline void release_lock_meta(struct qspinlock *lock, + struct lock_metadata *meta, + struct numa_qnode *qnode) +{ + int nid; + bool cleared =3D false; + u32 upd_val =3D _Q_LOCKTYPE_HQ | _Q_LOCKED_VAL; + u16 lock_id =3D qnode->lock_id; + int seq_counter =3D atomic_read(&meta->seq_counter); + + /* + * Firstly, go across per-NUMA queues and set seq counter to 0, + * it will prevent possible contenders, which haven't even queued locally, + * from using already deoccupied metadata. + * + * We need to perform counter reset with CAS, + * because local contenders (we didn't see them while try_clear_lock_tail= and try_clear_queue_tail) + * may have appeared while we were coming that point. + * + * If any CAS is not successful, it means someone has already queued loca= lly, + * in that case we should restore usability of all local queues + * and return seq counter to every per-NUMA queue. + * + * If all CASes are successful, nobody will queue on this metadata's queu= es, + * and we can free it and allow other locks to use it. + */ + + /* + * Before metadata release read every queue tail, + * if we have at least one contender, don't do CASes and leave + * (Reads are much faster and also prefetch local queue's cachelines) + */ + for_each_online_node(nid) { + struct numa_queue *queue =3D get_queue(lock_id, nid + 1); + + if (READ_ONCE(queue->tail) !=3D 0) + return; + } + + for_each_online_node(nid) { + struct numa_queue *queue =3D get_queue(lock_id, nid + 1); + + if (cmpxchg_relaxed(&queue->seq_counter_tail, encode_tc(0, seq_counter),= 0) + !=3D encode_tc(0, seq_counter)) + /* Some contender arrived - rollback */ + goto do_rollback; + } + +#ifdef CONFIG_HQSPINLOCKS_DEBUG + atomic_dec(&cur_buckets_in_use); +#endif + /* + * We need wait until pending is gone. + * Otherwise, clearing pending can erase a mode we will set here + */ + while (!cleared) { + u32 old_lock_val =3D atomic_cond_read_relaxed(&lock->val, !(VAL & _Q_PEN= DING_VAL)); + + cleared =3D atomic_try_cmpxchg_relaxed(&lock->val, + &old_lock_val, upd_val | (old_lock_val & _Q_TAIL_MASK)); + } + + /* + * guarantee current seq counter is erased from every local queue + * and lock mode has been updated before another lock can use metadata + */ + smp_store_release(&meta_pool[qnode->lock_id].lock_ptr, NULL); + return; + +do_rollback: + for_each_online_node(nid) { + struct numa_queue *queue =3D get_queue(lock_id, nid + 1); + WRITE_ONCE(queue->seq_counter, seq_counter); + } +} + +/* + * Call it if we observe LOCK_MODE_QSPINLOCK. + * + * We can do generic xchg_tail in this case, + * if lock's mode has already been changed, we will get _Q_LOCK_INVALID_TA= IL. + * + * If we have such a situation, we perform CAS cycle + * to restore _Q_LOCK_INVALID_TAIL or wait until lock's mode is LOCK_MODE_= QSPINLOCK. + * + * All upcomming confused contenders will see valid tail. + * We will remember the last one before successful CAS and put its tail in= local queue. + * During handoff we will notify them about mode change via qnext->wrong_f= allback_tail + */ +static inline bool try_update_tail_qspinlock_mode(struct qspinlock *lock, = u32 tail, u32 *old_tail, u32 *next_tail) +{ + /* + * next_tail may be tail or last cpu from previous unsuccessful call + * (highly unlikely, but still) + */ + u32 xchged_tail =3D xchg_tail(lock, *next_tail); + + if (likely(xchged_tail !=3D _Q_LOCK_INVALID_TAIL)) { + *old_tail =3D xchged_tail; + return true; + } + + /* + * If we got _Q_LOCK_INVALID_TAIL, it means lock was not in LOCK_MODE_QSP= INLOCK. + * In this case we should restore _Q_LOCK_INVALID_TAIL + * and remember next contenders that got confused. + * Later we will update lock's or local queue's tail to the last contende= r seen here. + */ + u32 val =3D atomic_read(&lock->val); + + bool fixed =3D false; + + while (!fixed) { + if (decode_lock_mode(val) =3D=3D LOCK_MODE_QSPINLOCK) { + *old_tail =3D 0; + return true; + } + + /* + * CAS is needed here to catch possible lock mode change + * from LOCK_MODE_HQLOCK to LOCK_MODE_QSPINLOCK in the meanwhile. + * Thus preventing from publishing _Q_LOCK_INVALID_TAIL + * when LOCK_MODE_QSPINLOCK is enabled. + */ + fixed =3D atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCK_INVALID_T= AIL | + (val & (_Q_LOCKED_PENDING_MASK | _Q_LOCK_TYPE_MODE_MASK))); + } + + if ((val & _Q_TAIL_MASK) !=3D tail) + *next_tail =3D val & _Q_TAIL_MASK; + + return false; +} + +/* + * Call it if we observe LOCK_MODE_HQLOCK or LOCK_NO_MODE in the lock. + * + * Actions performed: + * - Call setup_lock_mode to set or read lock's mode, + * read metadata's sequential counter for valid local queueing + * - CAS on union of local tail and meta_seq_counter + * to guarantee metadata usage correctness. + * Repeat from the beginning if fail. + * - If we are the first local contender, + * update global tail with our NUMA node + */ +static inline bool try_update_tail_hqlock_mode(struct qspinlock *lock, u16= lock_id, + struct numa_qnode *qnode, u32 tail, u32 *next_tail, u32 *old_tail) +{ + u32 meta_seq_counter; + hqlock_mode_t mode; + + struct numa_queue *queue; + u64 old_counter_tail; + bool updated_queue_tail =3D false; + +re_setup: + mode =3D setup_lock_mode(lock, lock_id, &meta_seq_counter); + + if (mode =3D=3D LOCK_MODE_QSPINLOCK) + return false; + + queue =3D get_local_queue(qnode); + + /* + * While queueing locally, perform CAS cycle + * on union of tail and meta_seq_counter. + * + * meta_seq_counter is taken from the lock metadata while allocation, + * it's updated every time it's used by a next lock. + * It shows that queue is used correctly + * and metadata hasn't been deoccupied before we queued locally. + */ + old_counter_tail =3D READ_ONCE(queue->seq_counter_tail); + + while (!updated_queue_tail && + decode_tc_counter(old_counter_tail) =3D=3D meta_seq_counter) { + updated_queue_tail =3D + try_cmpxchg_relaxed(&queue->seq_counter_tail, &old_counter_tail, + encode_tc((*next_tail) >> _Q_TAIL_OFFSET, meta_seq_counter)); + } + + /* queue->seq_counter changed */ + if (!updated_queue_tail) + goto re_setup; + + /* + * The condition means we tried to perform generic tail update in try_upd= ate_tail_qspinlock_mode, + * but before we did it, lock type was changed. + * Moreover, some contenders have come after us in LOCK_MODE_QSPINLOCK, + * during handoff we must notify them that they are set in LOCK_MODE_HQLO= CK in our node's local queue + */ + if (unlikely(*next_tail !=3D tail)) + qnode->wrong_fallback_tail =3D *next_tail >> _Q_TAIL_OFFSET; + + *old_tail =3D decode_tc_tail(old_counter_tail); + + if (!(*old_tail)) { + u16 prev_node_id; + + init_queue(qnode); + prev_node_id =3D append_node_queue(lock_id, qnode->numa_node); + *old_tail =3D prev_node_id ? Q_NEW_NODE_QUEUE : 0; + } else { + *old_tail <<=3D _Q_TAIL_OFFSET; + } + + return true; +} diff --git a/kernel/locking/hqlock_types.h b/kernel/locking/hqlock_types.h new file mode 100644 index 000000000..2a30c87ee --- /dev/null +++ b/kernel/locking/hqlock_types.h @@ -0,0 +1,118 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _GEN_HQ_SPINLOCK_SLOWPATH +#error "Do not include this file!" +#endif + +#define IRQ_NODE (MAX_NUMNODES + 1) +#define Q_NEW_NODE_QUEUE 1 +#define LOCK_ID_BITS (12) + +#define LOCK_ID_MAX (1 << LOCK_ID_BITS) +#define LOCK_ID_NONE (LOCK_ID_MAX + 1) + +#define _QUEUE_TAIL_MASK (((1ULL << 32) - 1) << 0) +#define _QUEUE_SEQ_COUNTER_MASK (((1ULL << 32) - 1) << 32) + +/* + * Output code for handoff-logic: + * + * =3D=3D 0 (HQLOCK_HANDOFF_LOCAL) - has local nodes to handoff + * > 0 - has remote node to handoff, id is visible + * =3D=3D -1 (HQLOCK_HANDOFF_REMOTE_HEAD) - has remote node to handoff, + * id isn't visible yet, will be in the *lock->head_node* + */ +enum { + HQLOCK_HANDOFF_LOCAL =3D 0, + HQLOCK_HANDOFF_REMOTE_HEAD =3D -1 +}; + +typedef enum { + LOCK_NO_MODE =3D 0, + LOCK_MODE_QSPINLOCK =3D 1, + LOCK_MODE_HQLOCK =3D 2, +} hqlock_mode_t; + +struct numa_qnode { + struct mcs_spinlock mcs; + + u16 lock_id; + u16 wrong_fallback_tail; + u16 general_handoffs; + + u16 numa_node; +}; + +struct numa_queue { + struct numa_qnode *head; + union { + u64 seq_counter_tail; + struct { + u32 tail; + u32 seq_counter; + }; + }; + + u16 next_node; + u16 prev_node; + + u16 handoffs_not_head; +} ____cacheline_aligned; + +/** + * Lock metadata + * "allocated"/"freed" on demand. + * + * Used to dynamically bind numa_queue to a lock, + * maintain FIFO-order for the NUMA-queues. + * + * seq_counter is needed to distinguish metadata usage + * by different locks, preventing local contenders + * from queueing in the wrong per-NUMA queue + * + * @see set_bucket + * @see numa_xchg_tail + */ +struct lock_metadata { + atomic_t seq_counter; + struct qspinlock *lock_ptr; + + /* NUMA-queues of contenders ae kept in FIFO order */ + union { + u32 nodes_tail; + struct { + u16 tail_node; + u16 head_node; + }; + }; +}; + +static inline int decode_lock_mode(u32 lock_val) +{ + return (lock_val & _Q_LOCK_MODE_MASK) >> _Q_LOCK_MODE_OFFSET; +} + +static inline u32 encode_lock_mode(u16 lock_id) +{ + if (lock_id =3D=3D LOCK_ID_NONE) + return LOCK_MODE_QSPINLOCK << _Q_LOCK_MODE_OFFSET; + + return LOCK_MODE_HQLOCK << _Q_LOCK_MODE_OFFSET; +} + +static inline u64 encode_tc(u32 tail, u32 counter) +{ + u64 __tail =3D (u64)tail; + u64 __counter =3D (u64)counter; + + return __tail | (__counter << 32); +} + +static inline u32 decode_tc_tail(u64 tail_counter) +{ + return (u32)(tail_counter & _QUEUE_TAIL_MASK); +} + +static inline u32 decode_tc_counter(u64 tail_counter) +{ + return (u32)((tail_counter & _QUEUE_SEQ_COUNTER_MASK) >> 32); +} --=20 2.34.1