From nobody Thu Nov 28 20:03:17 2024 Received: from mx2.zhaoxin.com (mx2.zhaoxin.com [203.110.167.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F0B21547F0 for ; Mon, 30 Sep 2024 07:24:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=203.110.167.99 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681093; cv=none; b=E6mMiXCxFMeN6+4klX/JJdkXhga5sMDKIuGDbXbLeLmRMxBRb3Wcoq9ff+NsMiRMhpVAUziez3FoDInlXMnV5YmlO/OB+KGW4/PD4xsdsBgsciZU9aVCvqLdb/mqC5qlskctNzvWHIUqMcyDT2ErXKVDBfQ4G78Aviu6X0Ow7SU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681093; c=relaxed/simple; bh=oQI1Wz8AgGMdI1DVBdYLWQf2tPZrSIXZveWfGFzELmo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=hHmJ2yCBw1gtOfO00+R7WGLOJuodir5rtJRlnSd3vDql0RGTtCGgph4/hvv1exVNcRModvExdDGkebDPmJ/QB0XxS+D0O2I7P2i7RTyrH0DcY28aIdKKYzmhRz2b7JHzmozWyYz1HXhn2GyYx9VFnMctD+dHeAAy5ACb+c2sBro= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com; spf=pass smtp.mailfrom=zhaoxin.com; arc=none smtp.client-ip=203.110.167.99 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zhaoxin.com X-ASG-Debug-ID: 1727681080-1eb14e31a6126b40001-xx1T2L Received: from ZXSHMBX1.zhaoxin.com (ZXSHMBX1.zhaoxin.com [10.28.252.163]) by mx2.zhaoxin.com with ESMTP id z54oDQQBXNGMb2oT (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Mon, 30 Sep 2024 15:24:40 +0800 (CST) X-Barracuda-Envelope-From: YongLi-oc@zhaoxin.com X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Received: from ZXBJMBX02.zhaoxin.com (10.29.252.6) by ZXSHMBX1.zhaoxin.com (10.28.252.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:40 +0800 Received: from ZXSHMBX2.zhaoxin.com (10.28.252.164) by ZXBJMBX02.zhaoxin.com (10.29.252.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:40 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Received: from ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d]) by ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d%4]) with mapi id 15.01.2507.039; Mon, 30 Sep 2024 15:24:39 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.29.252.6 Received: from zx4.zhaoxin.com (10.29.8.43) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:23:55 +0800 From: yongli-oc To: , , , , CC: , , , , Subject: [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock. Date: Mon, 30 Sep 2024 15:23:53 +0800 X-ASG-Orig-Subj: [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock. Message-ID: <20240930072355.2892-2-yongli-oc@zhaoxin.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930072355.2892-1-yongli-oc@zhaoxin.com> References: <20240914085327.32912-1-yongli-oc@zhaoxin.com> <20240930072355.2892-1-yongli-oc@zhaoxin.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: zxbjmbx1.zhaoxin.com (10.29.252.163) To ZXBJMBX03.zhaoxin.com (10.29.252.7) X-Moderation-Data: 9/30/2024 3:24:38 PM X-Barracuda-Connect: ZXSHMBX1.zhaoxin.com[10.28.252.163] X-Barracuda-Start-Time: 1727681080 X-Barracuda-Encrypted: ECDHE-RSA-AES128-GCM-SHA256 X-Barracuda-URL: https://10.28.252.36:4443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at zhaoxin.com X-Barracuda-Scan-Msg-Size: 1610 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.131164 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Content-Type: text/plain; charset="utf-8" The make menu to choose if compile the dynamic numa-aware osq lock or not, default is N. Signed-off-by: yongli-oc --- kernel/Kconfig.numalocks | 17 +++++++++++++++++ lib/Kconfig.debug | 1 + 2 files changed, 18 insertions(+) create mode 100644 kernel/Kconfig.numalocks diff --git a/kernel/Kconfig.numalocks b/kernel/Kconfig.numalocks new file mode 100644 index 000000000000..feb1751b637c --- /dev/null +++ b/kernel/Kconfig.numalocks @@ -0,0 +1,17 @@ +# SPDX-License-Identifier: GPL-2.0-only + +menu "NUMA Lock Supporting (OSQ)" + +config LOCK_SPIN_ON_OWNER_NUMA + bool "Enable dynamic numa-aware osq lock" + depends on LOCK_SPIN_ON_OWNER && X86_64 + default n + help + According the cpu numa architecture, the numa-aware lock + always unlocks the process waiting on the same numa node + first. It is different from the kernel inbox osq_lock. + The dynamic numa-aware osq lock switchies between osq_lock and + numa-aware lock automatically, according contention level. + Enable: echo 1 > /proc/zx_numa_lock/dynamic_enable. + +endmenu # NUMA Lock Supporting diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 27539c2626bf..cf9344eb61a4 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1337,6 +1337,7 @@ config DEBUG_PREEMPT depending on workload as it triggers debugging routines for each this_cpu operation. It should only be used for debugging purposes. =20 +source "kernel/Kconfig.numalocks" menu "Lock Debugging (spinlocks, mutexes, etc...)" =20 config LOCK_DEBUGGING_SUPPORT --=20 2.34.1 From nobody Thu Nov 28 20:03:17 2024 Received: from mx2.zhaoxin.com (mx2.zhaoxin.com [203.110.167.99]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5129215C144 for ; Mon, 30 Sep 2024 07:24:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=203.110.167.99 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681097; cv=none; b=mxov5HWoxURFGAPumR5fe/0j1uE9xK8539AmI8wOHCgf/lN7xR2ufsTfczJACX4CqNggIGUUcK0ZT8n0tod+u96KoUZ6qHx9L3iO4FiEn9r0moES9XiejpHaZV9bE5lVuDzvkOXC4g7dwdxt354eNsdE+uOVPAozwfV+GRs6zEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681097; c=relaxed/simple; bh=+uvJ6Z1mH3Vm6gf+/ZfohVkJQqXKRQLlbW2a+frGpO8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=gee1yDKi1umU96fq/0noMMO5O/0bgOPzmkPwnWQJ6OjmXfXJarmET6epwVeflM2QpJI6JBcNyHc1wP86n+xw59GcncpQsQXMyYlW1cizN/XKv8KcBHx27IVdoT1Pxx2l8UemJvXNABwZbnDT8BilpW1QsYBXOrlVJnWsgwb9Xi8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com; spf=pass smtp.mailfrom=zhaoxin.com; arc=none smtp.client-ip=203.110.167.99 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zhaoxin.com X-ASG-Debug-ID: 1727681082-1eb14e31a91269e0001-xx1T2L Received: from ZXSHMBX2.zhaoxin.com (ZXSHMBX2.zhaoxin.com [10.28.252.164]) by mx2.zhaoxin.com with ESMTP id OxDCUWwJ2I2oeZ7u (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Mon, 30 Sep 2024 15:24:42 +0800 (CST) X-Barracuda-Envelope-From: YongLi-oc@zhaoxin.com X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.164 Received: from zxbjmbx1.zhaoxin.com (10.29.252.163) by ZXSHMBX2.zhaoxin.com (10.28.252.164) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:42 +0800 Received: from ZXSHMBX2.zhaoxin.com (10.28.252.164) by zxbjmbx1.zhaoxin.com (10.29.252.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:42 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.164 Received: from ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d]) by ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d%4]) with mapi id 15.01.2507.039; Mon, 30 Sep 2024 15:24:41 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.29.252.163 Received: from zx4.zhaoxin.com (10.29.8.43) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:23:55 +0800 From: yongli-oc To: , , , , CC: , , , , Subject: [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock. Date: Mon, 30 Sep 2024 15:23:54 +0800 X-ASG-Orig-Subj: [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock. Message-ID: <20240930072355.2892-3-yongli-oc@zhaoxin.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930072355.2892-1-yongli-oc@zhaoxin.com> References: <20240914085327.32912-1-yongli-oc@zhaoxin.com> <20240930072355.2892-1-yongli-oc@zhaoxin.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: zxbjmbx1.zhaoxin.com (10.29.252.163) To ZXBJMBX03.zhaoxin.com (10.29.252.7) X-Moderation-Data: 9/30/2024 3:24:40 PM X-Barracuda-Connect: ZXSHMBX2.zhaoxin.com[10.28.252.164] X-Barracuda-Start-Time: 1727681082 X-Barracuda-Encrypted: ECDHE-RSA-AES128-GCM-SHA256 X-Barracuda-URL: https://10.28.252.36:4443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at zhaoxin.com X-Barracuda-Scan-Msg-Size: 1947 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.131164 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Content-Type: text/plain; charset="utf-8" To support numa-aware osq lock, the struct optimistic_spin_queue is accessed as three members, numa_enable, index, tail16, by union. The size of the struct is the same as before. Signed-off-by: yongli-oc --- include/linux/osq_lock.h | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index ea8fb31379e3..37a7bc4ab530 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -12,14 +12,42 @@ struct optimistic_spin_queue { * Stores an encoded value of the CPU # of the tail node in the queue. * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL. */ +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA + union { + atomic_t tail; + u32 val; +#ifdef __LITTLE_ENDIAN + struct { + u16 tail16; + u8 index; + u8 numa_enable; + }; +#else + struct { + u8 numa_enable; + u8 index; + u16 tail16; + }; +#endif + }; +#else atomic_t tail; +#endif }; =20 #define OSQ_UNLOCKED_VAL (0) =20 /* Init macro and function. */ +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA + +#define OSQ_LOCK_UNLOCKED { .tail =3D ATOMIC_INIT(OSQ_UNLOCKED_VAL) } + +#else + #define OSQ_LOCK_UNLOCKED { ATOMIC_INIT(OSQ_UNLOCKED_VAL) } =20 +#endif + static inline void osq_lock_init(struct optimistic_spin_queue *lock) { atomic_set(&lock->tail, OSQ_UNLOCKED_VAL); @@ -28,9 +56,12 @@ static inline void osq_lock_init(struct optimistic_spin_= queue *lock) extern bool osq_lock(struct optimistic_spin_queue *lock); extern void osq_unlock(struct optimistic_spin_queue *lock); =20 +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA +extern bool osq_is_locked(struct optimistic_spin_queue *lock); +#else static inline bool osq_is_locked(struct optimistic_spin_queue *lock) { return atomic_read(&lock->tail) !=3D OSQ_UNLOCKED_VAL; } - +#endif #endif --=20 2.34.1 From nobody Thu Nov 28 20:03:17 2024 Received: from mx1.zhaoxin.com (MX1.ZHAOXIN.COM [210.0.225.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 774111547DE for ; Mon, 30 Sep 2024 07:24:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=210.0.225.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681100; cv=none; b=cmDhoddAlqArY25XuunCLQMXQ3ivhcefltuyqe6rYsfmCCM8u6rU8Qaks+vxUiq9meXdA/Zm5H3/l0IQi2oz03UI5bRKGOwVhivmyTXBBJeXYlBSDvTh8QhByXcJOcTW75STj/r8W+pzvJuP0wR7r5jAzghpsrJx3y9RrpW7fWI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727681100; c=relaxed/simple; bh=Kvlndkv6kk1spb79//clXXqherHHgt0ZsgOV1iZv2Xg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=aSoR40YHXTkq79IHX9JxTxi1PrgKo1cU6kE0GL4Ar7XPSpDAORSJayiEHdRhMdmlLCtVAQH19K2w3jR6bmVNCObzDSMfVWFmWX275qxjxLx++lwKuweV3E1eWL3EW/0C94aKVRKW1JvmniG7NqrccjiJoDpwS4RTPhrD93+MxBE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com; spf=pass smtp.mailfrom=zhaoxin.com; arc=none smtp.client-ip=210.0.225.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=zhaoxin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zhaoxin.com X-ASG-Debug-ID: 1727681087-086e236b08040d0001-xx1T2L Received: from ZXSHMBX1.zhaoxin.com (ZXSHMBX1.zhaoxin.com [10.28.252.163]) by mx1.zhaoxin.com with ESMTP id icbYHQtRPVxRk3OS (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Mon, 30 Sep 2024 15:24:47 +0800 (CST) X-Barracuda-Envelope-From: YongLi-oc@zhaoxin.com X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Received: from ZXBJMBX03.zhaoxin.com (10.29.252.7) by ZXSHMBX1.zhaoxin.com (10.28.252.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:46 +0800 Received: from ZXSHMBX2.zhaoxin.com (10.28.252.164) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Mon, 30 Sep 2024 15:24:46 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.28.252.163 Received: from ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d]) by ZXSHMBX2.zhaoxin.com ([fe80::d4e0:880a:d21:684d%4]) with mapi id 15.01.2507.039; Mon, 30 Sep 2024 15:24:45 +0800 X-Barracuda-RBL-Trusted-Forwarder: 10.29.252.7 Received: from zx4.zhaoxin.com (10.29.8.43) by ZXBJMBX03.zhaoxin.com (10.29.252.7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 30 Sep 2024 15:23:55 +0800 From: yongli-oc To: , , , , CC: , , , , Subject: [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock. Date: Mon, 30 Sep 2024 15:23:55 +0800 X-ASG-Orig-Subj: [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock. Message-ID: <20240930072355.2892-4-yongli-oc@zhaoxin.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240930072355.2892-1-yongli-oc@zhaoxin.com> References: <20240914085327.32912-1-yongli-oc@zhaoxin.com> <20240930072355.2892-1-yongli-oc@zhaoxin.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: zxbjmbx1.zhaoxin.com (10.29.252.163) To ZXBJMBX03.zhaoxin.com (10.29.252.7) X-Moderation-Data: 9/30/2024 3:24:44 PM X-Barracuda-Connect: ZXSHMBX1.zhaoxin.com[10.28.252.163] X-Barracuda-Start-Time: 1727681087 X-Barracuda-Encrypted: ECDHE-RSA-AES128-GCM-SHA256 X-Barracuda-URL: https://10.28.252.35:4443/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at zhaoxin.com X-Barracuda-Scan-Msg-Size: 44984 X-Barracuda-BRTS-Status: 1 X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.131164 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Content-Type: text/plain; charset="utf-8" According to the contention level, switches from osq_lock to numa-aware osq_lock. The numa.h is definition for numa-aware lock. The numa_osq.h is definition for osq to numa switching. The x_osq_lock.c is the crossing from two bytes osq lock to numa-aware lock. The zx_numa_osq.c lock is a two level osq_lock. zx_numa.c: The numa-aware lock kernel memory cache preparation, and a workqueue to turn numa-aware lock back to osq lock. The /proc interface. Enable dynamic switch by echo 1 > /proc/zx_numa_lock/dynamic_enable The new Makefile for dynamic numa-aware osq lock. Signed-off-by: yongli-oc --- kernel/locking/Makefile | 3 + kernel/locking/numa.h | 90 ++++++ kernel/locking/numa_osq.h | 29 ++ kernel/locking/x_osq_lock.c | 371 ++++++++++++++++++++++++ kernel/locking/zx_numa.c | 540 +++++++++++++++++++++++++++++++++++ kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++ 6 files changed, 1530 insertions(+) create mode 100644 kernel/locking/numa.h create mode 100644 kernel/locking/numa_osq.h create mode 100644 kernel/locking/x_osq_lock.c create mode 100644 kernel/locking/zx_numa.c create mode 100644 kernel/locking/zx_numa_osq.c diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile index 0db4093d17b8..297bfad88fda 100644 --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -21,7 +21,10 @@ ifeq ($(CONFIG_PROC_FS),y) obj-$(CONFIG_LOCKDEP) +=3D lockdep_proc.o endif obj-$(CONFIG_SMP) +=3D spinlock.o +obj-$(CONFIG_LOCK_SPIN_ON_OWNER_NUMA) +=3D x_osq_lock.o zx_numa_osq.o zx_n= uma.o +else obj-$(CONFIG_LOCK_SPIN_ON_OWNER) +=3D osq_lock.o +endif obj-$(CONFIG_PROVE_LOCKING) +=3D spinlock.o obj-$(CONFIG_QUEUED_SPINLOCKS) +=3D qspinlock.o obj-$(CONFIG_RT_MUTEXES) +=3D rtmutex_api.o diff --git a/kernel/locking/numa.h b/kernel/locking/numa.h new file mode 100644 index 000000000000..790c27ed18e5 --- /dev/null +++ b/kernel/locking/numa.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_NUMA_LOCK_H +#define __LINUX_NUMA_LOCK_H +#include +#include "mcs_spinlock.h" + +struct optimistic_spin_node { + struct optimistic_spin_node *next, *prev; + int locked; /* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + u32 serial; +}; + + +struct _numa_buf { + void *numa_ptr; + struct list_head list; + u32 lockaddr; + u32 highaddr; + u8 idle; + u8 type; + u16 index; +}; + +struct _numa_lock { + atomic_t tail ____cacheline_aligned_in_smp; + atomic_t addr; + u8 shift; + u8 stopping; + u16 numa_nodes; + u32 accessed; + uint64_t totalaccessed; + u32 nodeswitched; + atomic_t initlock; + atomic_t pending; + union { + struct mcs_spinlock mcs_node; + struct optimistic_spin_node osq_node; + }; + CACHELINE_PADDING(pad); +}; + +struct numa_cpu_info { + __u8 x86_model; + /* CPU family */ + __u8 x86; + /* CPU vendor */ + __u8 x86_vendor; + __u8 x86_reserved; + u32 feature1; +}; + +#define NUMAEXPAND 1 + +#define COHORT_START 1 +#define ACQUIRE_NUMALOCK (UINT_MAX-1) +#define NODE_WAIT UINT_MAX +#define LOCK_NUMALOCK 1 +#define UNLOCK_NUMALOCK 0 + +#define NUMALOCKDYNAMIC 0xff +#define TURNTONUMAREADY 0xa + +#define NUMA_LOCKED_VAL 0xffffff +#define NUMA_UNLOCKED_VAL 0 + +#define HIGH32BITMASK 0xffffffff00000000 +#define LOW32MASK 0xffffffff + +extern int numa_shift; +extern int numa_clusters; +extern int zx_numa_lock_total; +extern struct _numa_buf *zx_numa_entry; +extern atomic_t numa_count; +extern int enable_zx_numa_osq_lock; +extern u32 zx_numa_lock; +extern int dynamic_enable; +extern struct kmem_cache *zx_numa_lock_cachep; + +static inline u32 ptrmask(void *s) +{ + return (uint64_t)s & LOW32MASK; +} +inline void *get_numa_lock(int index); + +int zx_check_numa_dynamic_locked(u32 lockaddr, struct _numa_lock *_numa_lo= ck); +int zx_numa_lock_ptr_get(void *p); +void numa_lock_init_data(struct _numa_lock *s, int clusters, u32 lockval, + u32 lockaddr); +#endif diff --git a/kernel/locking/numa_osq.h b/kernel/locking/numa_osq.h new file mode 100644 index 000000000000..5c4675abc4fc --- /dev/null +++ b/kernel/locking/numa_osq.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_NUMA_OSQ_H +#define __LINUX_NUMA_OSQ_H + +#include +#include "mcs_spinlock.h" + +#define OSQLOCKINITED 0 +#define OSQTONUMADETECT 0x10 +#define OSQLOCKSTOPPING 0xfc +#define OSQ_LOCKED_VAL 0xffff + +extern u16 osq_keep_times; +extern u16 osq_lock_depth; +extern int osq_node_max; + +inline int encode_cpu(int cpu_nr); +inline int node_cpu(struct optimistic_spin_node *node); +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val); + +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock); +void zx_osq_numa_start(struct optimistic_spin_queue *lock); +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock); + +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock, + struct _numa_lock *n); +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock, + struct _numa_lock *n); +#endif diff --git a/kernel/locking/x_osq_lock.c b/kernel/locking/x_osq_lock.c new file mode 100644 index 000000000000..82cde1f6355b --- /dev/null +++ b/kernel/locking/x_osq_lock.c @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * crossing from osq_lock to numa-aware lock + */ +#include +#include +#include +#include "numa.h" +#include "numa_osq.h" + +u16 osq_lock_depth =3D 8; +u16 osq_keep_times =3D 32; + +/* + * An MCS like lock especially tailored for optimistic spinning for sleepi= ng + * lock implementations (mutex, rwsem, etc). + * + * Using a single mcs node per CPU is safe because sleeping locks should n= ot be + * called from interrupt context and we have preemption disabled while + * spinning. + */ +static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node= ); +/* + * We use the value 0 to represent "no CPU", thus the encoded value + * will be the CPU number incremented by 1. + */ +inline int encode_cpu(int cpu_nr) +{ + return cpu_nr + 1; +} + +inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) +{ + int cpu_nr =3D encoded_cpu_val - 1; + + return per_cpu_ptr(&osq_node, cpu_nr); +} + +/* + * Get a stable @node->next pointer, either for unlock() or unqueue() purp= oses. + * Can return NULL in case we were the last queued and we updated @lock in= stead. + * + * If osq_lock() is being cancelled there must be a previous node + * and 'old_cpu' is its CPU #. + * For osq_unlock() there is never a previous node and old_cpu is + * set to OSQ_UNLOCKED_VAL. + * If the osq is in OSQLOCKSTOPPING state, + * the tail will be set to OSQ_LOCKED_VAL to stop the queue. + */ +static inline struct optimistic_spin_node * +osq_wait_next_stop(struct optimistic_spin_queue *lock, + struct optimistic_spin_node *node, + int old_cpu) +{ + u16 curr =3D encode_cpu(smp_processor_id()); + u16 old =3D old_cpu; + + if (lock->numa_enable =3D=3D OSQLOCKSTOPPING && old =3D=3D OSQ_UNLOCKED_V= AL) + old =3D OSQ_LOCKED_VAL; + + for (;;) { + if (READ_ONCE(lock->tail16) =3D=3D curr && + cmpxchg(&lock->tail16, curr, old) =3D=3D curr) { + + /* + * We were the last queued, we moved @lock back. @prev + * will now observe @lock and will complete its + * unlock()/unqueue(). + */ + return NULL; + } + + /* + * We must xchg() the @node->next value, because if we were to + * leave it in, a concurrent unlock()/unqueue() from + * @node->next might complete Step-A and think its @prev is + * still valid. + * + * If the concurrent unlock()/unqueue() wins the race, we'll + * wait for either @lock to point to us, through its Step-B, or + * wait for a new @node->next from its Step-C. + */ + if (node->next) { + struct optimistic_spin_node *next; + + next =3D xchg(&node->next, NULL); + if (next) + return next; + } + + cpu_relax(); + } +} + +bool osq_lock(struct optimistic_spin_queue *lock) +{ + struct optimistic_spin_node *node =3D this_cpu_ptr(&osq_node); + struct optimistic_spin_node *prev, *next; + int cpu =3D smp_processor_id(); + u16 curr =3D encode_cpu(cpu); + struct optimistic_spin_queue tail; + u16 old; + + tail.val =3D READ_ONCE(lock->val); + if (unlikely(tail.numa_enable =3D=3D OSQLOCKSTOPPING)) { + zx_osq_turn_numa_waiting(lock); + return osq_lock(lock); + } + + if (unlikely(tail.numa_enable =3D=3D NUMALOCKDYNAMIC)) { + struct _numa_lock *_numa_lock =3D NULL; + struct _numa_lock *node_lock =3D NULL; + + _numa_lock =3D get_numa_lock(tail.index); + node_lock =3D (struct _numa_lock *) _numa_lock + + (cpu >> numa_shift); + + prefetch(node_lock); + return zx_numa_osq_lock(lock, _numa_lock); + } + + node->locked =3D 0; + node->next =3D NULL; + node->cpu =3D curr; + node->serial =3D 0; + + /* + * if ss.tail16 is OSQ_LOCKED_VAL, keeps the LOCKED state + * and waiting numa-aware lock ready. + */ + + if (likely(tail.numa_enable >=3D OSQTONUMADETECT)) { + struct optimistic_spin_queue ss; + + while (1) { + ss.val =3D atomic_read(&lock->tail); + if (ss.tail16 =3D=3D OSQ_LOCKED_VAL) { + zx_osq_turn_numa_waiting(lock); + return osq_lock(lock); + } + if (cmpxchg(&lock->tail16, ss.tail16, curr) + =3D=3D ss.tail16) { + old =3D ss.tail16; + break; + } + cpu_relax(); + } + } else + old =3D xchg(&lock->tail16, curr); + + if (old =3D=3D OSQ_UNLOCKED_VAL) { + node->serial =3D 1; + return true; + } + + prev =3D decode_cpu(old); + node->prev =3D prev; + + // Record osq serial number for the lock. + node->serial =3D prev->serial + 1; + /* + * osq_lock() unqueue + * + * node->prev =3D prev osq_wait_next() + * WMB MB + * prev->next =3D node next->prev =3D prev // unqueue-C + * + * Here 'node->prev' and 'next->prev' are the same variable and we need + * to ensure these stores happen in-order to avoid corrupting the list. + */ + smp_wmb(); + + WRITE_ONCE(prev->next, node); + + /* + * Normally @prev is untouchable after the above store; because at that + * moment unlock can proceed and wipe the node element from stack. + * + * However, since our nodes are static per-cpu storage, we're + * guaranteed their existence -- this allows us to apply + * cmpxchg in an attempt to undo our queueing. + */ + + /* + * Wait to acquire the lock or cancellation. Note that need_resched() + * will come with an IPI, which will wake smp_cond_load_relaxed() if it + * is implemented with a monitor-wait. vcpu_is_preempted() relies on + * polling, be careful. + */ + if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() || + vcpu_is_preempted(node_cpu(node->prev)))) + return true; + + /* unqueue */ + /* + * Step - A -- stabilize @prev + * + * Undo our @prev->next assignment; this will make @prev's + * unlock()/unqueue() wait for a next pointer since @lock points to us + * (or later). + */ + + for (;;) { + /* + * cpu_relax() below implies a compiler barrier which would + * prevent this comparison being optimized away. + */ + if (data_race(prev->next) =3D=3D node && + cmpxchg(&prev->next, node, NULL) =3D=3D node) + break; + + /* + * We can only fail the cmpxchg() racing against an unlock(), + * in which case we should observe @node->locked becoming + * true. + */ + if (smp_load_acquire(&node->locked)) + return true; + + cpu_relax(); + + /* + * Or we race against a concurrent unqueue()'s step-B, in which + * case its step-C will write us a new @node->prev pointer. + */ + prev =3D READ_ONCE(node->prev); + } + + /* + * Step - B -- stabilize @next + * + * Similar to unlock(), wait for @node->next or move @lock from @node + * back to @prev. + */ + + next =3D osq_wait_next_stop(lock, node, prev->cpu); + if (!next) + return false; + + /* + * Step - C -- unlink + * + * @prev is stable because its still waiting for a new @prev->next + * pointer, @next is stable because our @node->next pointer is NULL and + * it will wait in Step-A. + */ + + WRITE_ONCE(next->prev, prev); + WRITE_ONCE(prev->next, next); + + return false; +} + +/* + * In osq_unlock(), changes the osq state for switching, then unlock next. + * OSQTONUMADETECT: Starts to detect lock contention when first osq tail r= eached + * after dynamic enable. + * OSQLOCKSTOPPING: If lock contention keeps more than osq_lock_depth + * osq_keep_times, set the osq to STOPPING state and get + * numa-aware lock memory entry. + * NUMALOCKDYNAMIC: numa-aware lock initialized and ready. + */ + +void osq_unlock(struct optimistic_spin_queue *lock) +{ + struct optimistic_spin_node *node, *next; + int threadshold =3D osq_lock_depth; + int cpu =3D smp_processor_id(); + u16 curr =3D encode_cpu(cpu); + int depth =3D 0; + u32 count =3D 0; + + if (unlikely(lock->numa_enable =3D=3D NUMALOCKDYNAMIC)) { + struct _numa_lock *_numa_lock =3D get_numa_lock(lock->index); + + prefetch((struct _numa_lock *) _numa_lock + (cpu >> numa_shift)); + return zx_numa_osq_unlock(lock, _numa_lock); + } + /* + * Fast path for the uncontended case. + */ + if (unlikely(lock->numa_enable =3D=3D OSQTONUMADETECT)) { + struct optimistic_spin_node *node_last =3D NULL; + u16 tail =3D 0; + + tail =3D cmpxchg(&lock->tail16, curr, OSQ_UNLOCKED_VAL); + if (tail =3D=3D curr) + return; + + node =3D this_cpu_ptr(&osq_node); + node_last =3D decode_cpu(tail); + //Get the contention level + depth =3D node_last->serial - node->serial; + count =3D READ_ONCE(node->locked); + if (count > osq_keep_times && (dynamic_enable & 0x1)) + zx_osq_lock_stopping(lock); + } else if (unlikely(lock->numa_enable =3D=3D OSQLOCKSTOPPING)) { + if (cmpxchg(&lock->tail16, curr, OSQ_LOCKED_VAL) + =3D=3D curr) { + //All osq stopped, start to run as numa-aware lock. + zx_osq_numa_start(lock); + return; + } + } else { + struct optimistic_spin_queue t; + + /* + * After dynamic enable, when osq reaches tail, set the osq + * to DETECT mode to detect the contention. + */ + t.val =3D 0; + if (dynamic_enable & 0x1) { + if (atomic_read(&numa_count) < zx_numa_lock_total) + t.numa_enable =3D OSQTONUMADETECT; + } + if (t.numa_enable =3D=3D OSQTONUMADETECT) { + if (atomic_cmpxchg_release(&lock->tail, curr, + (t.val | OSQ_UNLOCKED_VAL)) =3D=3D curr) + return; + } else if (cmpxchg(&lock->tail16, curr, + OSQ_UNLOCKED_VAL) =3D=3D curr) + return; + } + + /* + * Second most likely case. + */ + node =3D this_cpu_ptr(&osq_node); + next =3D xchg(&node->next, NULL); + if (next) { + if (depth > threadshold) + WRITE_ONCE(next->locked, count + 1); + else + WRITE_ONCE(next->locked, 1); + return; + } + + next =3D osq_wait_next_stop(lock, node, OSQ_UNLOCKED_VAL); + if (next) { + if (depth > threadshold) + WRITE_ONCE(next->locked, count + 1); + else + WRITE_ONCE(next->locked, 1); + } +} + +bool osq_is_locked(struct optimistic_spin_queue *lock) +{ + struct optimistic_spin_queue val; + + val.val =3D atomic_read(&lock->tail); + if (val.tail16 =3D=3D OSQ_UNLOCKED_VAL) + return false; + + if (val.tail16 =3D=3D OSQ_LOCKED_VAL) { + + //state changing + if (val.numa_enable !=3D NUMALOCKDYNAMIC) + return true; + + return zx_check_numa_dynamic_locked(ptrmask(lock), + get_numa_lock(val.index)); + } + + return true; +} diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c new file mode 100644 index 000000000000..31d247261474 --- /dev/null +++ b/kernel/locking/zx_numa.c @@ -0,0 +1,540 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Dynamic numa-aware osq lock + * Crossing from numa-aware lock to osq_lock + * Numa lock memory initialize and /proc interface + * Author: LiYong + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "numa.h" +#include "numa_osq.h" + +int enable_zx_numa_osq_lock; +struct delayed_work zx_numa_start_work; +struct delayed_work zx_numa_cleanup_work; + +atomic_t numa_count; +struct _numa_buf *zx_numa_entry; +int zx_numa_lock_total =3D 256; +LIST_HEAD(_zx_numa_head); +LIST_HEAD(_zx_numa_lock_head); + +struct kmem_cache *zx_numa_entry_cachep; +struct kmem_cache *zx_numa_lock_cachep; +int numa_shift; +int numa_clusters; +static atomic_t lockindex; +int dynamic_enable; + +static const struct numa_cpu_info numa_cpu_list[] =3D { + /*feature1=3D1, a numa node includes two clusters*/ + //{1, 23, X86_VENDOR_AMD, 0, 1}, + {0x5b, 7, X86_VENDOR_CENTAUR, 0, 1}, + {0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1} +}; + +inline void *get_numa_lock(int index) +{ + if (index >=3D 0 && index < zx_numa_lock_total) + return zx_numa_entry[index].numa_ptr; + else + return NULL; +} + +static int zx_get_numa_shift(int all_cpus, int clusters) +{ + int cpus =3D (int) all_cpus/clusters; + int count =3D 0; + + while (cpus) { + cpus >>=3D 1; + count++; + } + return count-1; +} + +void numa_lock_init_data(struct _numa_lock *s, int clusters, + u32 lockval, u32 lockaddr) +{ + int j =3D 0; + + for (j =3D 0; j < clusters + NUMAEXPAND; j++) { + atomic_set(&(s + j)->tail, lockval); + atomic_set(&(s + j)->addr, lockaddr); + (s + j)->shift =3D numa_shift; + (s + j)->stopping =3D 0; + (s + j)->numa_nodes =3D clusters; + (s + j)->accessed =3D 0; + (s + j)->totalaccessed =3D 0; + (s + j)->nodeswitched =3D 0; + atomic_set(&(s + j)->initlock, 0); + atomic_set(&(s + j)->pending, 0); + } +} +/* + * The lockaddr of zx_numa_enry is key value to know which index is occupi= ed. + */ +int zx_numa_lock_ptr_get(void *p) +{ + int i =3D 0; + int index =3D 0; + + if (atomic_read(&numa_count) >=3D zx_numa_lock_total) + return zx_numa_lock_total; + + index =3D atomic_inc_return(&lockindex); + + for (i =3D 0; i < zx_numa_lock_total; i++) { + if (index >=3D zx_numa_lock_total) + index =3D 0; + if (cmpxchg(&zx_numa_entry[index].lockaddr, + 0, ptrmask(p)) =3D=3D 0) { + while (1) { + struct _numa_lock *node_lock =3D + zx_numa_entry[index].numa_ptr; + struct _numa_lock *numa_lock =3D node_lock + + node_lock->numa_nodes; + + if (atomic_read(&numa_lock->tail) =3D=3D + NUMA_LOCKED_VAL) + break; + cpu_relax(); + + } + atomic_inc(&numa_count); + zx_numa_entry[index].highaddr =3D ((u64)p) >> 32; + atomic_set(&lockindex, index); + return index; + } + index++; + if (atomic_read(&numa_count) >=3D zx_numa_lock_total) + break; + } + return zx_numa_lock_total; +} + +int zx_check_numa_dynamic_locked(u32 lockaddr, + struct _numa_lock *_numa_lock) +{ + struct _numa_lock *node_lock =3D NULL; + u64 s =3D -1; + int i =3D 0; + + //in switching if the pending is not 0 + if (atomic_read(&_numa_lock->pending) !=3D 0) + return 1; + + for (i =3D 0; i < _numa_lock->numa_nodes + 1; i++) { + node_lock =3D _numa_lock + i; + cpu_relax(); + s =3D atomic64_read((atomic64_t *) &node_lock->tail); + if ((s >> 32) !=3D lockaddr) + continue; + if ((s & LOW32MASK) =3D=3D NUMA_LOCKED_VAL + || (s & LOW32MASK) =3D=3D NUMA_UNLOCKED_VAL) + continue; + break; + } + + if (i =3D=3D _numa_lock->numa_nodes + 1) + return 0; + return i+1; +} + +static int zx_numa_lock64_try_to_freeze(u32 lockaddr, struct _numa_lock *_= numa_lock, + int index) +{ + struct _numa_lock *node_lock =3D NULL; + u64 addr =3D ((u64)lockaddr) << 32; + u64 s =3D 0; + u64 ff =3D 0; + int i =3D 0; + + //check and set the tail to LOCKED from first node to last node, + //if, all node tail are LOCKED, try to set the NUMA node to LOCKED + for (i =3D 0; i < _numa_lock->numa_nodes+1; i++) { + node_lock =3D _numa_lock + i; + cpu_relax(); + + s =3D atomic64_read((atomic64_t *)&node_lock->tail); + if ((s & HIGH32BITMASK) !=3D addr) + continue; + + if ((s & LOW32MASK) =3D=3D NUMA_LOCKED_VAL) + continue; + + if ((s & LOW32MASK) =3D=3D NUMA_UNLOCKED_VAL) { + ff =3D atomic64_cmpxchg((atomic64_t *)&node_lock->tail, + (addr|NUMA_UNLOCKED_VAL), NUMA_LOCKED_VAL); + if (ff =3D=3D (addr|NUMA_UNLOCKED_VAL)) + continue; + } + break; + } + //All node's tail and numa node's tail are LOCKED, set numa lock memory + //referenced by index to un-occupied. + if (i =3D=3D _numa_lock->numa_nodes + 1) { + zx_numa_entry[index].idle =3D 0; + zx_numa_entry[index].type =3D 0; + zx_numa_entry[index].highaddr =3D 0; + xchg(&zx_numa_entry[index].lockaddr, 0); + } + + return i; +} + +static void zx_numa_lock_stopping(struct _numa_lock *_numa_lock) +{ + struct _numa_lock *node_lock =3D NULL; + int i =3D 0; + + for (i =3D 0; i < _numa_lock->numa_nodes+1; i++) { + node_lock =3D _numa_lock + i; + WRITE_ONCE(node_lock->stopping, 1); + } +} + +static void zx_numa_cleanup(struct work_struct *work) +{ + int i =3D 0; + int checktimes =3D 2; + + //reboot or power off state + if (READ_ONCE(enable_zx_numa_osq_lock) =3D=3D 0xf) + return; + + //If dynamic enable and numa-aware lock count is 0, reschedule the + //workqueue. + if (atomic_read(&numa_count) =3D=3D 0) { + if (READ_ONCE(dynamic_enable) !=3D 0) + schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ); + return; + } + + for (i =3D 0; i < zx_numa_lock_total; i++) { + int s =3D 0; + u32 lockaddr =3D READ_ONCE(zx_numa_entry[i].lockaddr); + u32 type =3D zx_numa_entry[i].type; + struct _numa_lock *buf =3D zx_numa_entry[i].numa_ptr; + int nodes =3D 0; + + if (lockaddr =3D=3D 0 || type =3D=3D 3 || zx_numa_entry[i].idle =3D=3D 0) + continue; + nodes =3D buf->numa_nodes; + if (zx_numa_entry[i].idle < checktimes) { + //check if all node is idle + s =3D zx_check_numa_dynamic_locked(lockaddr, buf); + if (s !=3D 0) { + zx_numa_entry[i].idle =3D 1; + continue; + } + zx_numa_entry[i].idle++; + } + + if (zx_numa_entry[i].idle =3D=3D checktimes) { + //set each node to stopping mode + zx_numa_lock_stopping(buf); + zx_numa_entry[i].idle++; + + } + + if (zx_numa_entry[i].idle =3D=3D checktimes+1) { + while (1) { + //try to freezed all nodes + if (zx_numa_lock64_try_to_freeze(lockaddr, buf, + i) =3D=3D nodes + 1) { + //all node has been locked + atomic_dec(&numa_count); + break; + } + cpu_relax(); + } + } + } + schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ); +} + +static int create_numa_buffer_list(int clusters, int len) +{ + int i =3D 0; + + for (i =3D 0; i < zx_numa_lock_total; i++) { + struct _numa_lock *s =3D (struct _numa_lock *)kmem_cache_alloc( + zx_numa_lock_cachep, GFP_KERNEL); + if (!s) { + while (i > 0) { + kmem_cache_free(zx_numa_lock_cachep, + zx_numa_entry[i-1].numa_ptr); + i--; + } + return 0; + } + memset((char *)s, 0, + len * L1_CACHE_BYTES * (clusters + NUMAEXPAND)); + numa_lock_init_data(s, clusters, NUMA_LOCKED_VAL, 0); + zx_numa_entry[i].numa_ptr =3D s; + zx_numa_entry[i].lockaddr =3D 0; + zx_numa_entry[i].highaddr =3D 0; + zx_numa_entry[i].idle =3D 0; + zx_numa_entry[i].type =3D 0; + } + + for (i =3D 0; i < zx_numa_lock_total; i++) { + zx_numa_entry[i].index =3D i; + list_add_tail(&(zx_numa_entry[i].list), &_zx_numa_lock_head); + } + return 1; +} + +static int zx_numa_lock_init(int numa) +{ + int align =3D max_t(int, L1_CACHE_BYTES, ARCH_MIN_TASKALIGN); + int d =3D 0; + int status =3D 0; + + atomic_set(&lockindex, 0); + atomic_set(&numa_count, 0); + + if (sizeof(struct _numa_lock) & 0x3f) + d =3D (int)((sizeof(struct _numa_lock) + L1_CACHE_BYTES) / + L1_CACHE_BYTES); + else + d =3D (int)(sizeof(struct _numa_lock) / L1_CACHE_BYTES); + + zx_numa_entry_cachep =3D kmem_cache_create( + "zx_numa_entry", + sizeof(struct _numa_buf) * zx_numa_lock_total, align, + SLAB_PANIC | SLAB_ACCOUNT, NULL); + + zx_numa_lock_cachep =3D kmem_cache_create( + "zx_numa_lock", + d * L1_CACHE_BYTES * (numa + NUMAEXPAND), align, + SLAB_PANIC | SLAB_ACCOUNT, NULL); + + + if (zx_numa_entry_cachep && zx_numa_lock_cachep) { + zx_numa_entry =3D (struct _numa_buf *)kmem_cache_alloc( + zx_numa_entry_cachep, GFP_KERNEL); + if (zx_numa_entry) { + memset((char *)zx_numa_entry, 0, + sizeof(struct _numa_buf) * zx_numa_lock_total); + create_numa_buffer_list(numa, d); + status =3D 1; + } + } + + pr_info("enable dynamic numa-aware osq_lock, clusters %d\n", + numa); + return status; +} + + +#define numa_lock_proc_dir "zx_numa_lock" +#define numa_entry_total 8 +struct proc_dir_entry *numa_lock_proc; +struct proc_dir_entry *numa_lock_enable; +struct proc_dir_entry *numa_proc_entry[numa_entry_total]; + +static ssize_t numa_lock_proc_read(struct file *file, + char __user *usrbuf, size_t len, loff_t *off) +{ + int id =3D (long) pde_data(file_inode(file)); + char kbuffer[128]; + ssize_t retval =3D 0; + size_t n =3D 0; + + memset(kbuffer, 0, sizeof(kbuffer)); + if (id =3D=3D 0) + n =3D sprintf(kbuffer, "%d\n", READ_ONCE(dynamic_enable)); + else if (id =3D=3D 1) + n =3D sprintf(kbuffer, "%d\n", READ_ONCE(osq_lock_depth)); + else if (id =3D=3D 2) + n =3D sprintf(kbuffer, "%d\n", READ_ONCE(osq_keep_times)); + else if (id =3D=3D 3) + n =3D sprintf(kbuffer, "%d\n", READ_ONCE(osq_node_max)); + else if (id =3D=3D 4) + n =3D sprintf(kbuffer, "%d\n", atomic_read(&numa_count)); + retval =3D simple_read_from_buffer(usrbuf, len, off, kbuffer, n); + + return retval; +} + +static ssize_t numa_lock_proc_write(struct file *file, + const char __user *buffer, size_t count, loff_t *f_pos) +{ + int id =3D (long) pde_data(file_inode(file)); + char kbuffer[128]; + unsigned long new =3D 0; + int err =3D 0; + + memset(kbuffer, 0, sizeof(kbuffer)); + if (copy_from_user(kbuffer, buffer, count)) + return count; + kbuffer[count] =3D '\0'; + err =3D kstrtoul(kbuffer, 10, &new); + if (err !=3D 0) + return count; + + if (id =3D=3D 0) { + int last =3D READ_ONCE(dynamic_enable); + + if (new < 0 || new >=3D 2 || last =3D=3D new) + return count; + + if (last =3D=3D 0) + schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ); + prefetchw(&dynamic_enable); + WRITE_ONCE(dynamic_enable, new); + return count; + } + + if (READ_ONCE(dynamic_enable) !=3D 0) { + pr_info("dynamic %d: change setting should disable dynamic\n", + dynamic_enable); + return count; + } + if (id =3D=3D 1 && new > 4 && new <=3D 32) + WRITE_ONCE(osq_lock_depth, new); + else if (id =3D=3D 2 && new >=3D 16 && new <=3D 2048) + WRITE_ONCE(osq_keep_times, new); + else if (id =3D=3D 3 && new > 4 && new <=3D 2048) + WRITE_ONCE(osq_node_max, new); + return count; +} +static int numa_lock_proc_show(struct seq_file *m, void *v) +{ + return 0; +} + +static int numa_lock_proc_open(struct inode *inode, struct file *file) +{ + return single_open(file, numa_lock_proc_show, NULL); +} +static const struct proc_ops numa_lock_proc_fops =3D { + .proc_open =3D numa_lock_proc_open, + .proc_read =3D numa_lock_proc_read, + .proc_write =3D numa_lock_proc_write +}; + +static int numalock_proc_init(void) +{ + int index =3D 0; + int i =3D 0; + + numa_lock_proc =3D proc_mkdir(numa_lock_proc_dir, NULL); + if (numa_lock_proc =3D=3D NULL) { + pr_info("%s proc create %s failed\n", __func__, + numa_lock_proc_dir); + return -EINVAL; + } + + numa_lock_enable =3D proc_create_data("dynamic_enable", 0666, + numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++); + if (!numa_lock_enable) { + pr_info("%s proc_create_data %s failed!\n", __func__, + "dynamic_enable"); + return -ENOMEM; + } + + for (i =3D 0; i < numa_entry_total; i++) + numa_proc_entry[i] =3D NULL; + + numa_proc_entry[0] =3D proc_create_data("osq_lock_depth", 0664, + numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++); + numa_proc_entry[1] =3D proc_create_data("osq_keep_times", 0664, + numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++); + numa_proc_entry[2] =3D proc_create_data("osq_node_max", 0664, + numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++); + numa_proc_entry[3] =3D proc_create_data("numa_osq_lock", 0444, + numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++); + return 0; +} + +static void numalock_proc_exit(void) +{ + int i =3D 0; + + for (i =3D 0; i < numa_entry_total; i++) { + if (numa_proc_entry[i]) + proc_remove(numa_proc_entry[i]); + } + if (numa_lock_enable) + proc_remove(numa_lock_enable); + if (numa_lock_proc) + remove_proc_entry(numa_lock_proc_dir, NULL); + +} + +static int numalock_shutdown_notify(struct notifier_block *unused1, + unsigned long unused2, void *unused3) +{ + if (READ_ONCE(enable_zx_numa_osq_lock) =3D=3D 1) { + WRITE_ONCE(dynamic_enable, 0); + WRITE_ONCE(enable_zx_numa_osq_lock, 0xf); + } + return NOTIFY_DONE; +} +static struct notifier_block numalock_shutdown_nb =3D { + .notifier_call =3D numalock_shutdown_notify, +}; +static int __init zx_numa_base_init(void) +{ + int cpu =3D num_possible_cpus(); + int i =3D 0; + + WRITE_ONCE(enable_zx_numa_osq_lock, 0); + if (kvm_para_available()) + return 0; + if (cpu >=3D 65534 || cpu < 16 || (cpu & 0x7) !=3D 0) + return 0; + + for (i =3D 0; i < ARRAY_SIZE(numa_cpu_list); i++) { + if (boot_cpu_data.x86_vendor =3D=3D numa_cpu_list[i].x86_vendor && + boot_cpu_data.x86 =3D=3D numa_cpu_list[i].x86 && + boot_cpu_data.x86_model =3D=3D numa_cpu_list[i].x86_model) { + + if (numa_cpu_list[i].feature1 =3D=3D 1) + numa_clusters =3D nr_node_ids + nr_node_ids; + numa_shift =3D zx_get_numa_shift(num_possible_cpus(), + numa_clusters); + + if (zx_numa_lock_init(numa_clusters) =3D=3D 0) + return -ENOMEM; + register_reboot_notifier(&numalock_shutdown_nb); + numalock_proc_init(); + INIT_DELAYED_WORK(&zx_numa_cleanup_work, + zx_numa_cleanup); + prefetchw(&enable_zx_numa_osq_lock); + WRITE_ONCE(enable_zx_numa_osq_lock, 1); + return 0; + } + } + return 0; +} + +static void __exit zx_numa_lock_exit(void) +{ + numalock_proc_exit(); + prefetchw(&dynamic_enable); + WRITE_ONCE(dynamic_enable, 0); +} + +late_initcall(zx_numa_base_init); +module_exit(zx_numa_lock_exit); +MODULE_AUTHOR("LiYong "); +MODULE_DESCRIPTION("zx dynamic numa-aware osq lock"); +MODULE_LICENSE("GPL"); + diff --git a/kernel/locking/zx_numa_osq.c b/kernel/locking/zx_numa_osq.c new file mode 100644 index 000000000000..9fc329f33c36 --- /dev/null +++ b/kernel/locking/zx_numa_osq.c @@ -0,0 +1,497 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Dynamic numa-aware osq lock + * Author: LiYong + * + */ +#include +#include +#include +#include +#include +#include +#include "numa.h" +#include "numa_osq.h" + +int osq_node_max =3D 256; + + +static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_cpu_= node); + +/* + * We use the value 0 to represent "no CPU", thus the encoded value + * will be the CPU number incremented by 1. + */ +static inline int decode(int cpu_nr) +{ + return cpu_nr - 1; +} + +static inline struct optimistic_spin_node *decode_curr(int encoded_cpu_val) +{ + int cpu_nr =3D decode(encoded_cpu_val); + + return per_cpu_ptr(&osq_cpu_node, cpu_nr); +} +/* + * Exchange the tail to get the last one in the queue. + * If the high 32 bits of tail is not equal to the osq lock addr, or + * the low 32 bits is NUMA_LOCKED_VAL, the queue is in LOCKED state. + */ +static int atomic64_cmpxchg_notequal(void *qslock, atomic_t *tail, int cur= r) +{ + u64 ss =3D 0; + u32 addr =3D ptrmask(qslock); + u64 addrcurr =3D (((u64)addr) << 32) | curr; + + while (1) { + ss =3D atomic64_read((atomic64_t *) tail); + if ((ss >> 32) !=3D addr) + return NUMA_LOCKED_VAL; + if ((ss & LOW32MASK) =3D=3D NUMA_LOCKED_VAL) + return NUMA_LOCKED_VAL; + if (atomic64_cmpxchg((atomic64_t *) tail, ss, addrcurr) =3D=3D ss) + return ss & LOW32MASK; + cpu_relax(); + } +} +/* + * Get numa-aware lock memory and set the osq to STOPPING state. + * Look for the zx_numa_entry array, get a free one and set each node's ad= dr + * to the address of the osq lock, then set the struct optimistic_spin_que= ue's + * index to s. + * + * From the index, lock()/unlock() gets the position of struct _numa_lock, + * atomic check the struct _numa_lock's addr with the struct optimistic_s= pin_queue + * lock's address, equal means the link is valid. if the addr is not equal, + * the link is broken by the numa_cleanup workqueue since idle. + */ +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock) +{ + int s =3D 0; + + s =3D zx_numa_lock_ptr_get(lock); + if (s < zx_numa_lock_total) { + numa_lock_init_data(zx_numa_entry[s].numa_ptr, + numa_clusters, NUMA_UNLOCKED_VAL, + ptrmask(lock)); + + WRITE_ONCE(lock->index, s); + zx_numa_entry[s].type =3D 1; + smp_mb();/*should set these before enable*/ + prefetchw(&lock->numa_enable); + WRITE_ONCE(lock->numa_enable, OSQLOCKSTOPPING); + } else { + prefetchw(&lock->numa_enable); + WRITE_ONCE(lock->numa_enable, OSQLOCKINITED); + } +} +/* + * Set numa_enable to NUMALOCKDYNAMIC to start by numa-aware lock, + * then set _numa_lock->initlock to TURNTONUMAREADY, breaks the + * loop of zx_osq_turn_numa_waiting. + */ +void zx_osq_numa_start(struct optimistic_spin_queue *lock) +{ + struct _numa_lock *_numa_lock =3D get_numa_lock(lock->index); + + prefetchw(&lock->numa_enable); + WRITE_ONCE(lock->numa_enable, NUMALOCKDYNAMIC); + smp_mb(); /*should keep lock->numa_enable modified first*/ + atomic_set(&_numa_lock->initlock, TURNTONUMAREADY); +} + +/* + * Waiting numa-aware lock ready + */ +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock) +{ + struct _numa_lock *_numa_lock =3D get_numa_lock(lock->index); + + atomic_inc(&_numa_lock->pending); + while (1) { + int s =3D atomic_read(&_numa_lock->initlock); + + if (s =3D=3D TURNTONUMAREADY) + break; + cpu_relax(); + + } + atomic_dec(&_numa_lock->pending); +} + +static struct optimistic_spin_node * +zx_numa_osq_wait_next(struct _numa_lock *lock, + struct optimistic_spin_node *node, + struct optimistic_spin_node *prev, int cpu) +{ + struct optimistic_spin_node *next =3D NULL; + int curr =3D encode_cpu(cpu); + int old; + + old =3D prev ? prev->cpu : OSQ_UNLOCKED_VAL; + for (;;) { + if (atomic_read(&lock->tail) =3D=3D curr && + atomic_cmpxchg_acquire(&lock->tail, curr, old) =3D=3D curr) { + + break; + } + if (node->next) { + next =3D xchg(&node->next, NULL); + if (next) + break; + } + cpu_relax(); + } + return next; +} +/* + * If the numa_lock tail is in LOCKED state or the high 32 bits addr + * is not equal to the osq lock, the numa-aware lock is stopped, and + * break the loop to osq_lock() + * if not, exchange the tail to enqueue. + */ +static void zx_numa_turn_osq_waiting(struct optimistic_spin_queue *lock, + struct _numa_lock *_numa_lock) +{ + struct _numa_lock *numa_lock =3D _numa_lock + _numa_lock->numa_nodes; + int lockaddr =3D ptrmask(lock); + u64 s =3D 0; + struct optimistic_spin_queue tail; + + tail.numa_enable =3D NUMALOCKDYNAMIC; + tail.index =3D lock->index; + tail.tail16 =3D OSQ_LOCKED_VAL; + while (1) { + cpu_relax(); + s =3D atomic64_read((atomic64_t *) &numa_lock->tail); + if ((s >> 32) !=3D lockaddr) + break; + if ((s & LOW32MASK) =3D=3D NUMA_LOCKED_VAL) + break; + } + prefetchw(&lock->tail); + if (atomic_cmpxchg(&lock->tail, tail.val, OSQ_UNLOCKED_VAL) + =3D=3D tail.val) { + ; + } + +} + +static int _zx_node_osq_lock_internal(struct optimistic_spin_queue *qslock, + struct optimistic_spin_node *node, struct optimistic_spin_node *prev, + struct _numa_lock *node_lock, int cpu, int *cur_status) +{ + struct optimistic_spin_node *next =3D NULL; + + for (;;) { + struct optimistic_spin_node *node_prev =3D NULL; + + /* + * cpu_relax() below implies a compiler barrier which would + * prevent this comparison being optimized away. + */ + if (data_race(prev->next) =3D=3D node && + cmpxchg(&prev->next, node, NULL) =3D=3D node) { + break; + } + /*load locked first each time*/ + *cur_status =3D smp_load_acquire(&node->locked); + + if (*cur_status !=3D NODE_WAIT) + return 0; //goto NODE_UNLOCK; + + cpu_relax(); + /* + * Or we race against a concurrent unqueue()'s step-B, in which + * case its step-C will write us a new @node->prev pointer. + */ + node_prev =3D READ_ONCE(node->prev); + if (node_prev !=3D prev) + prev =3D node_prev; + } + + /* + * Step - B -- stabilize @next + * + * Similar to unlock(), wait for @node->next or move @lock from @node + * back to @prev. + */ + next =3D zx_numa_osq_wait_next(node_lock, node, prev, cpu); + if (!next) + return -1; + + WRITE_ONCE(next->prev, prev); + WRITE_ONCE(prev->next, next); + + return -1; +} + +static int _zx_node_osq_lock(struct optimistic_spin_queue *qslock, + struct _numa_lock *_numa_lock) +{ + struct optimistic_spin_node *node =3D this_cpu_ptr(&osq_cpu_node); + struct optimistic_spin_node *prev =3D NULL; + int cpu =3D smp_processor_id(); + int curr =3D encode_cpu(cpu); + int numa =3D cpu >> _numa_lock->shift; + struct _numa_lock *node_lock =3D _numa_lock + numa; + int cur_status =3D 0; + int old =3D 0; + + node->locked =3D NODE_WAIT; + node->next =3D NULL; + node->cpu =3D curr; + + old =3D atomic64_cmpxchg_notequal(qslock, &node_lock->tail, curr); + + if (old =3D=3D NUMA_LOCKED_VAL) { + bool s =3D true; + + zx_numa_turn_osq_waiting(qslock, _numa_lock); + s =3D osq_lock(qslock); + if (s =3D=3D true) + return 1; + else + return -1; + } + + if (old =3D=3D 0) { + node->locked =3D COHORT_START; + return ACQUIRE_NUMALOCK; + } + + prev =3D decode_curr(old); + node->prev =3D prev; + + smp_mb(); /* make sure node set before set pre->next */ + + WRITE_ONCE(prev->next, node); + /* + * Normally @prev is untouchable after the above store; because at that + * moment unlock can proceed and wipe the node element from stack. + * + * However, since our nodes are static per-cpu storage, we're + * guaranteed their existence -- this allows us to apply + * cmpxchg in an attempt to undo our queueing. + */ + + /* + * Wait to acquire the lock or cancellation. Note that need_resched() + * will come with an IPI, which will wake smp_cond_load_relaxed() if it + * is implemented with a monitor-wait. vcpu_is_preempted() relies on + * polling, be careful. + */ + while ((cur_status =3D READ_ONCE(node->locked)) =3D=3D NODE_WAIT) { + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) { + int ddd =3D _zx_node_osq_lock_internal(qslock, node, prev, + node_lock, cpu, &cur_status); + + if (cur_status !=3D NODE_WAIT) + goto NODE_UNLOCK; + if (ddd =3D=3D -1) + return -1; + } + cpu_relax(); + } +NODE_UNLOCK: + if (cur_status =3D=3D ACQUIRE_NUMALOCK) + node->locked =3D COHORT_START; + return cur_status; +} +static int _zx_numa_osq_lock(struct optimistic_spin_queue *qslock, int cpu, + struct _numa_lock *_numa_lock) +{ + int numacpu =3D cpu >> _numa_lock->shift; + int numacurr =3D encode_cpu(numacpu); + + struct optimistic_spin_node *node =3D &(_numa_lock + numacpu)->osq_node; + struct _numa_lock *numa_lock =3D _numa_lock + _numa_lock->numa_nodes; + struct optimistic_spin_node *prevnode =3D NULL; + int prev =3D 0; + + node->next =3D NULL; + node->locked =3D LOCK_NUMALOCK; + node->cpu =3D numacurr; + + prev =3D atomic_xchg(&numa_lock->tail, numacurr); + if (prev =3D=3D 0) { + node->locked =3D UNLOCK_NUMALOCK; + return 0; + } + + prevnode =3D &(_numa_lock + prev - 1)->osq_node; + node->prev =3D prevnode; + smp_mb(); /*node->prev should be set before next*/ + WRITE_ONCE(prevnode->next, node); + + while (READ_ONCE(node->locked) =3D=3D LOCK_NUMALOCK) + cpu_relax(); + + return 0; +} +/* + * Two level osq_lock + * Check current node's tail then check the numa tail, the queue goes to + * run or wait according the two tail's state. + */ + +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock, + struct _numa_lock *_numa_lock) +{ + struct _numa_lock *node_lock =3D NULL; + int cpu =3D smp_processor_id(); + int numa =3D cpu >> _numa_lock->shift; + int status =3D 0; + + node_lock =3D _numa_lock + numa; + + if (node_lock->stopping) { + zx_numa_turn_osq_waiting(qslock, _numa_lock); + return osq_lock(qslock); + } + + status =3D _zx_node_osq_lock(qslock, _numa_lock); + if (status =3D=3D ACQUIRE_NUMALOCK) + status =3D _zx_numa_osq_lock(qslock, smp_processor_id(), + _numa_lock); + + if (status =3D=3D -1) + return false; + return true; +} +/* + * Check the end of the queue on current node. + * Keep the addr of the node_lock when set the queue to UNLOCKED. + * If the node is stopping, the node_lock will be set LOCKED. + */ +static int atomic64_checktail_osq(struct optimistic_spin_queue *qslock, + struct _numa_lock *node_lock, int ctail) +{ + u64 addr =3D ((u64)ptrmask(qslock)) << 32; + u64 addrtail =3D addr | ctail; + u64 ss =3D 0; + bool mark; + + ss =3D atomic64_read((atomic64_t *) &node_lock->tail); + if (node_lock->stopping =3D=3D 0) + mark =3D (ss =3D=3D addrtail && + atomic64_cmpxchg_acquire( + (atomic64_t *) &node_lock->tail, + addrtail, addr|NUMA_UNLOCKED_VAL) =3D=3D addrtail); + else + mark =3D (ss =3D=3D addrtail && + atomic64_cmpxchg_acquire( + (atomic64_t *) &node_lock->tail, + addrtail, NUMA_LOCKED_VAL) =3D=3D addrtail); + return mark; +} + +/* + * Set current node's tail to ACQUIRE_NUMALOCK to check the numa tail busy= or + * UNLOCKED + */ +static void node_lock_release(struct optimistic_spin_queue *qslock, + struct _numa_lock *node_lock, struct optimistic_spin_node *node, + int val, int cpu, int numa_end) +{ + struct optimistic_spin_node *next =3D NULL; + int curr =3D encode_cpu(cpu); + + while (1) { + if (atomic64_checktail_osq(qslock, node_lock, curr)) { + if (qslock->numa_enable =3D=3D NUMALOCKDYNAMIC) { + int index =3D qslock->index; + //starts numa_lock idle checking in cleanup workqueue. + if (numa_end =3D=3D OSQ_UNLOCKED_VAL && + zx_numa_entry[index].idle =3D=3D 0) { + cmpxchg(&zx_numa_entry[index].idle, + 0, 1); + } + } + return; + } + if (node->next) { + next =3D xchg(&node->next, NULL); + if (next) { + WRITE_ONCE(next->locked, val); + return; + } + } + cpu_relax(); + } +} +/* + * Unlocks the queue waiting on the next NUMA node + */ +static int numa_lock_release(struct optimistic_spin_queue *qslock, + struct _numa_lock *numa_lock, + struct optimistic_spin_node *node, int cpu) +{ + struct optimistic_spin_node *next =3D NULL; + int curr =3D cpu >> numa_lock->shift; + int numacurr =3D encode_cpu(curr); + + while (1) { + if (atomic_read(&numa_lock->tail) =3D=3D numacurr && + atomic_cmpxchg_acquire(&numa_lock->tail, numacurr, + OSQ_UNLOCKED_VAL) =3D=3D numacurr) { + return OSQ_UNLOCKED_VAL; + } + + if (node->next) { + next =3D xchg(&node->next, NULL); + if (next) { + WRITE_ONCE(next->locked, UNLOCK_NUMALOCK); + return 1; + } + } + cpu_relax(); + } +} +/* + * Two level osq_unlock. + */ +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock, + struct _numa_lock *_numa_lock) +{ + u32 cpu =3D smp_processor_id(); + struct optimistic_spin_node *node =3D this_cpu_ptr(&osq_cpu_node); + int numa =3D cpu >> _numa_lock->shift; + struct _numa_lock *numa_lock =3D _numa_lock + _numa_lock->numa_nodes; + struct _numa_lock *node_lock =3D _numa_lock + numa; + struct optimistic_spin_node *numa_node =3D + &(_numa_lock + numa)->osq_node; + struct optimistic_spin_node *next =3D NULL; + int cur_count =3D 0; + int numa_end =3D 0; + + cur_count =3D READ_ONCE(node->locked); + + /* + * Turns to the queue waiting on the next node if unlocked times more + * than osq_node_max on current node. + */ + if (cur_count >=3D osq_node_max - 1) { + //Unlocks the queue on next node. + numa_end =3D numa_lock_release(qslock, + numa_lock, numa_node, cpu); + node_lock_release(qslock, node_lock, node, + ACQUIRE_NUMALOCK, cpu, numa_end); + return; + } + /* + * Unlocks the next on current node. + */ + next =3D xchg(&node->next, NULL); + if (next) { + WRITE_ONCE(next->locked, cur_count + 1); + return; + } + /* + * The queue on current node reaches end. + */ + numa_end =3D numa_lock_release(qslock, numa_lock, numa_node, cpu); + node_lock_release(qslock, node_lock, node, ACQUIRE_NUMALOCK, + cpu, numa_end); +} --=20 2.34.1