From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DAB02116ED for ; Wed, 16 Apr 2025 16:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820970; cv=none; b=c0sk7rABWRDuLo15Kp4OfyD5b5H5KT0R7XKcJq6Qmo/zRs81YsD5D261W7441/Gn8sGt5aSrsDAAFZt/mtHz80X13JA2D+pxMJQREBsyA4hlSJbnheGdOAuA9lel0gsrhdzAjj+15cnb81maiCez4+4UjfkfYF/40Yo+23pgW/U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820970; c=relaxed/simple; bh=gZDykHfMqby7gxGce8x+ayT1Qdrm+fVyGpKvFiZf75A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dgiKug3Uj+MSfPgzrKLX7WuLQrJyt8yBKAuCRd+6Wse1rfDA42DCDJU0ncwkokK3HsO/lr/tSrRYGcVLq0DOpGsWuQFhRWMsgWXLOoHHnAbF0mTVNSAVttzxMB4j4cfGmvG/4fTwzpOajKkyNCSbRlXiHRVadifFsnyl1Z1RkbQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ITjw4ExO; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Ak1Llw4o; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ITjw4ExO"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Ak1Llw4o" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3jgr1wk+mD/xSNbvenCkg/mlo+/bFFKzD/3SVlMGOxc=; b=ITjw4ExOlIk6Ey+wzASnp9LHCpswtwfHHAsetd2V94ffYkZFmPCd1B7NoXYqy9QykiJUV5 O0/gq3YJoYJbZ65Q+qumMJflTwNdK+CQjN4zJ7hfq3Ixq1MfDRgH8L/NTzMFFqt/bB0eHF 5tTVCPbW1XPPN808Qul/qaDDozGBQwcgrC0MajqCePmI8j7yYK4MmbZ8K3FCAy0yHuPH/L R+0zrOE9CwN2OK92EjpDXMCmqtR429WPrRTZBnnTdZI/hzVFJB3aEMiH0HA3H5j8aavF4q d0COd/DdtrUlVBYGRFQ7jOWYSJ91QTB6GPuJbu30rXOFXSCnkumxBg/hbS0dQQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3jgr1wk+mD/xSNbvenCkg/mlo+/bFFKzD/3SVlMGOxc=; b=Ak1Llw4om2VFqR376+gNgYWtb5v9jjBBRYopdMs36usbcP2D/56NbdK5QHjgY8KuFwnJvz 2qM6XKZOwd1Pb5AA== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 01/21] rcuref: Provide rcuref_is_dead() Date: Wed, 16 Apr 2025 18:29:01 +0200 Message-ID: <20250416162921.513656-2-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" rcuref_read() returns the number of references that are currently held. If 0 is returned then it is not safe to assume that the object ca be scheduled for deconstruction because it is marked DEAD. This happens if the return value of rcuref_put() is ignored and assumptions are made. If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF. If rcuref_put() did not return to the caller then the counter did not yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there is still a chance that the counter will transition from RCUREF_NOREF to 0 meaning it is still valid and must not be deconstructed. In this brief window rcuref_read() will return 0. Provide rcuref_is_dead() to determine if the counter is marked as RCUREF_DEAD. Signed-off-by: Sebastian Andrzej Siewior --- include/linux/rcuref.h | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/include/linux/rcuref.h b/include/linux/rcuref.h index 6322d8c1c6b42..2fb2af6d98249 100644 --- a/include/linux/rcuref.h +++ b/include/linux/rcuref.h @@ -30,7 +30,11 @@ static inline void rcuref_init(rcuref_t *ref, unsigned i= nt cnt) * rcuref_read - Read the number of held reference counts of a rcuref * @ref: Pointer to the reference count * - * Return: The number of held references (0 ... N) + * Return: The number of held references (0 ... N). The value 0 does not + * indicate that it is safe to schedule the object, protected by this refe= rence + * counter, for deconstruction. + * If you want to know if the reference counter has been marked DEAD (as + * signaled by rcuref_put()) please use rcuread_is_dead(). */ static inline unsigned int rcuref_read(rcuref_t *ref) { @@ -40,6 +44,22 @@ static inline unsigned int rcuref_read(rcuref_t *ref) return c >=3D RCUREF_RELEASED ? 0 : c + 1; } =20 +/** + * rcuref_is_dead - Check if the rcuref has been already marked dead + * @ref: Pointer to the reference count + * + * Return: True if the object has been marked DEAD. This signals that a pr= evious + * invocation of rcuref_put() returned true on this reference counter mean= ing + * the protected object can safely be scheduled for deconstruction. + * Otherwise, returns false. + */ +static inline bool rcuref_is_dead(rcuref_t *ref) +{ + unsigned int c =3D atomic_read(&ref->refcnt); + + return (c >=3D RCUREF_RELEASED) && (c < RCUREF_NOREF); +} + extern __must_check bool rcuref_get_slowpath(rcuref_t *ref); =20 /** --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04BAF21170B for ; Wed, 16 Apr 2025 16:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820970; cv=none; b=a17uw9BjW8Tx+7IRqUpNz84Q8iqR/V5YOvG52O8ZBr+1PP3IDWq7/Xa/Y2WMNo5t+eyyWdJKDMohn6HRq2R9Lw+1oPyMrAKkkTIQpm0cqBmwy5Ncx3mdn/Qp6QlyqnHnDKhLDQlW5pFgRVKgtTPOq7pAIWgPioQbVQo5lS+8jKQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820970; c=relaxed/simple; bh=LqNzPbHVr/YdzrkJlwIphlHRZxx/nS6ZHMaT8AIKgss=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CJT1e1sl3WGvBnzX5EuQ61CuC1/f2EUrljDeC+76Euit6bMGEJZIOjo9nM18YirOee7jf7r+OcWReRG0FLzTkfhiqUsj5CRWT6b1mXaBUDL8uYu+gBnsVIPxE8Rj26PL3xwt4xYYKUlj7+syH3wjiK8TxdqxPQ0ChZCwkn5KUF4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=FYiL1+TY; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=4YQN6KEf; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="FYiL1+TY"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="4YQN6KEf" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ixf69cvPKcltCaDhuFinxelm76iWpbIm6Qn87X50Bzo=; b=FYiL1+TYQ8tEQe9ZtUrNyXWmN/WiT0weoz/rxxGcm7E+XGSTvSyoKSsnmA1Atb1Y0w4dPA jqBwqXbAzseEwenTHPpODXE/vtTgaEPBHGhp8GiurcDIbM6oTPjl33V0H6/Q398W7ZvfIf uG4QsTZJju2ZSpsFTzSt83wC9idmjZBZ2nRijvyQJWeSPf7xjTRLgB280+kHNpGn2Hb56Y 3jrGHJa+vCZdLhdl07YZoKs/kSsgCBS2laBBJ4SmP7jm5gORLsVNP3dHmTMQGkQrpjJKP1 0uu9HSQ5F/JHPgRbL6tsVs8HGOoNAVgq95qlSmiSIIOc7C1eDI0yE0w4KIO4EA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ixf69cvPKcltCaDhuFinxelm76iWpbIm6Qn87X50Bzo=; b=4YQN6KEfhrxlBEtSSaBsINXO0S5yFl8y0Vkf2hGlHqZ/K/ENUN5VYzAZ34OHTMY113CWuF VkWkt7hr42Yi/PCg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , linux-mm@kvack.org, Christoph Hellwig , Sebastian Andrzej Siewior Subject: [PATCH v12 02/21] mm: Add vmalloc_huge_node() Date: Wed, 16 Apr 2025 18:29:02 +0200 Message-ID: <20250416162921.513656-3-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra To enable node specific hash-tables using huge pages if possible. [bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline vmalloc_huge] Cc: Andrew Morton Cc: Uladzislau Rezki Cc: Christoph Hellwig Cc: linux-mm@kvack.org Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Christoph Hellwig Signed-off-by: Sebastian Andrzej Siewior --- include/linux/vmalloc.h | 9 +++++++-- mm/nommu.c | 18 +++++++++++++++++- mm/vmalloc.c | 11 ++++++----- 3 files changed, 30 insertions(+), 8 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 31e9ffd936e39..de95794777ad6 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -168,8 +168,13 @@ void *__vmalloc_node_noprof(unsigned long size, unsign= ed long align, gfp_t gfp_m int node, const void *caller) __alloc_size(1); #define __vmalloc_node(...) alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__)) =20 -void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size= (1); -#define vmalloc_huge(...) alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__)) +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int nod= e) __alloc_size(1); +#define vmalloc_huge_node(...) alloc_hooks(vmalloc_huge_node_noprof(__VA_A= RGS__)) + +static inline void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) +{ + return vmalloc_huge_node(size, gfp_mask, NUMA_NO_NODE); +} =20 extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) __= alloc_size(1, 2); #define __vmalloc_array(...) alloc_hooks(__vmalloc_array_noprof(__VA_ARGS_= _)) diff --git a/mm/nommu.c b/mm/nommu.c index 617e7ba8022f5..70f92f9a7fab3 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -200,7 +200,23 @@ void *vmalloc_noprof(unsigned long size) } EXPORT_SYMBOL(vmalloc_noprof); =20 -void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __weak __ali= as(__vmalloc_noprof); +/* + * vmalloc_huge_node - allocate virtually contiguous memory, on a node + * + * @size: allocation size + * @gfp_mask: flags for the page level allocator + * @node: node to use for allocation or NUMA_NO_NODE + * + * Allocate enough pages to cover @size from the page level + * allocator and map them into contiguous kernel virtual space. + * + * Due to NOMMU implications the node argument and HUGE page attribute is + * ignored. + */ +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int nod= e) +{ + return __vmalloc_noprof(size, gfp_mask); +} =20 /* * vzalloc - allocate virtually contiguous memory with zero fill diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3ed720a787ecd..8b9f6d3c099dd 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3943,9 +3943,10 @@ void *vmalloc_noprof(unsigned long size) EXPORT_SYMBOL(vmalloc_noprof); =20 /** - * vmalloc_huge - allocate virtually contiguous memory, allow huge pages + * vmalloc_huge_node - allocate virtually contiguous memory, allow huge pa= ges * @size: allocation size * @gfp_mask: flags for the page level allocator + * @node: node to use for allocation or NUMA_NO_NODE * * Allocate enough pages to cover @size from the page level * allocator and map them into contiguous kernel virtual space. @@ -3954,13 +3955,13 @@ EXPORT_SYMBOL(vmalloc_noprof); * * Return: pointer to the allocated memory or %NULL on error */ -void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) +void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int nod= e) { return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END, - gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP, - NUMA_NO_NODE, __builtin_return_address(0)); + gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP, + node, __builtin_return_address(0)); } -EXPORT_SYMBOL_GPL(vmalloc_huge_noprof); +EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof); =20 /** * vzalloc - allocate virtually contiguous memory with zero fill --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69AA5211A20 for ; Wed, 16 Apr 2025 16:29:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820971; cv=none; b=RSLFUfbWRASNcSBvAHzqHBYCwG5Yqfu8phm7INSOhShduFA4Twa0vUPt32M8m2BeLdIF/NVtYuMmUlzuuy5aV2DfvBg2khCVhAeLGUleN2/1WF2piORvrJTXtMv5/uH3zgOR5nhWM21HpcL6UhxvGlghMVSH0TcmIp6u2mVMegI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820971; c=relaxed/simple; bh=J7Rw6oa+pfRjHdnXIoQpoGhEQa4N4WAwyscpVCMVhDQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pMPDUGRR8e7u6HHcKDBbjMtT7lSwba1mydb4XaHr5Kjd0g4cG1S/i2nmxiVAtpBKLyf/NbDjGSQayT2PaJV2cCgTXuv+mcU5Wl/Seo/Rww0y0uo3N3jxU1Q7BImTTNYzNlzirvcFBiKNwOGAG2QU8vMwtDKsOZseLTP80YV2Ex4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=vhkZpaVo; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=r1gVmyB/; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="vhkZpaVo"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="r1gVmyB/" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=25PGFNpJOMtXBIfN34NjCp5fH0Pnte3nKjsGM5dkrpk=; b=vhkZpaVo/9BvkkN9+0o6qqZDahj45QXJu2BqMguSRKUda7TixbDP46bzbS+hQAAnfZFsdA 9xPICuIDfpqvKE0fNAaQFzFKQNoqVqlB6wY6CoKdb6XHtoh8f/NOZQb1I6LG4FiHCvj9E1 VOjMuomqDAylOGkMfIb7Pxa+xpNfwmUNINjsT9EBUpM7KstvIO0ydZdMnlN2VRuvyCQ26f 5pENJW2cFYWPsauVWL5PFiK6cmGOF+0AfrSX4zCTEQLfXy2q12EizhVwgle/UbpAWtJroe uKhtI6NRNmevjF5P3GPuSJTLPVwQ+SsuR3o6KPIbKr8LoqiCCOrIiwBUMYNnCg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=25PGFNpJOMtXBIfN34NjCp5fH0Pnte3nKjsGM5dkrpk=; b=r1gVmyB/WZ0T7wTHy4AJIhbfQEJ55t2Vqlapqyw7gzg3vSKcB36GvCMbsE9TqZRj2w7kMg XdJP/qq/Sy+9LiDg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 03/21] futex: Move futex_queue() into futex_wait_setup() Date: Wed, 16 Apr 2025 18:29:03 +0200 Message-ID: <20250416162921.513656-4-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra futex_wait_setup() has a weird calling convention in order to return hb to use as an argument to futex_queue(). Mostly such that requeue can have an extra test in between. Reorder code a little to get rid of this and keep the hb usage inside futex_wait_setup(). [bigeasy: fixes] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- io_uring/futex.c | 4 +--- kernel/futex/futex.h | 6 +++--- kernel/futex/requeue.c | 28 ++++++++++-------------- kernel/futex/waitwake.c | 47 +++++++++++++++++++++++------------------ 4 files changed, 42 insertions(+), 43 deletions(-) diff --git a/io_uring/futex.c b/io_uring/futex.c index 0ea4820cd8ff8..e89c0897117ae 100644 --- a/io_uring/futex.c +++ b/io_uring/futex.c @@ -273,7 +273,6 @@ int io_futex_wait(struct io_kiocb *req, unsigned int is= sue_flags) struct io_futex *iof =3D io_kiocb_to_cmd(req, struct io_futex); struct io_ring_ctx *ctx =3D req->ctx; struct io_futex_data *ifd =3D NULL; - struct futex_hash_bucket *hb; int ret; =20 if (!iof->futex_mask) { @@ -295,12 +294,11 @@ int io_futex_wait(struct io_kiocb *req, unsigned int = issue_flags) ifd->req =3D req; =20 ret =3D futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags, - &ifd->q, &hb); + &ifd->q, NULL, NULL); if (!ret) { hlist_add_head(&req->hash_node, &ctx->futex_list); io_ring_submit_unlock(ctx, issue_flags); =20 - futex_queue(&ifd->q, hb, NULL); return IOU_ISSUE_SKIP_COMPLETE; } =20 diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 6b2f4c7eb720f..16aafd0113442 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -219,9 +219,9 @@ static inline int futex_match(union futex_key *key1, un= ion futex_key *key2) } =20 extern int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags, - struct futex_q *q, struct futex_hash_bucket **hb); -extern void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q = *q, - struct hrtimer_sleeper *timeout); + struct futex_q *q, union futex_key *key2, + struct task_struct *task); +extern void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeo= ut); extern bool __futex_wake_mark(struct futex_q *q); extern void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q); =20 diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index b47bb764b3520..0e55975af515c 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -769,7 +769,6 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned i= nt flags, { struct hrtimer_sleeper timeout, *to; struct rt_mutex_waiter rt_waiter; - struct futex_hash_bucket *hb; union futex_key key2 =3D FUTEX_KEY_INIT; struct futex_q q =3D futex_q_init; struct rt_mutex_base *pi_mutex; @@ -805,29 +804,24 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned= int flags, * Prepare to wait on uaddr. On success, it holds hb->lock and q * is initialized. */ - ret =3D futex_wait_setup(uaddr, val, flags, &q, &hb); + ret =3D futex_wait_setup(uaddr, val, flags, &q, &key2, current); if (ret) goto out; =20 - /* - * The check above which compares uaddrs is not sufficient for - * shared futexes. We need to compare the keys: - */ - if (futex_match(&q.key, &key2)) { - futex_q_unlock(hb); - ret =3D -EINVAL; - goto out; - } - /* Queue the futex_q, drop the hb lock, wait for wakeup. */ - futex_wait_queue(hb, &q, to); + futex_do_wait(&q, to); =20 switch (futex_requeue_pi_wakeup_sync(&q)) { case Q_REQUEUE_PI_IGNORE: - /* The waiter is still on uaddr1 */ - spin_lock(&hb->lock); - ret =3D handle_early_requeue_pi_wakeup(hb, &q, to); - spin_unlock(&hb->lock); + { + struct futex_hash_bucket *hb; + + hb =3D futex_hash(&q.key); + /* The waiter is still on uaddr1 */ + spin_lock(&hb->lock); + ret =3D handle_early_requeue_pi_wakeup(hb, &q, to); + spin_unlock(&hb->lock); + } break; =20 case Q_REQUEUE_PI_LOCKED: diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 25877d4f2f8f3..6cf10701294b4 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -339,18 +339,8 @@ static long futex_wait_restart(struct restart_block *r= estart); * @q: the futex_q to queue up on * @timeout: the prepared hrtimer_sleeper, or null for no timeout */ -void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q, - struct hrtimer_sleeper *timeout) +void futex_do_wait(struct futex_q *q, struct hrtimer_sleeper *timeout) { - /* - * The task state is guaranteed to be set before another task can - * wake it. set_current_state() is implemented using smp_store_mb() and - * futex_queue() calls spin_unlock() upon completion, both serializing - * access to the hash list and forcing another memory barrier. - */ - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); - futex_queue(q, hb, current); - /* Arm the timer */ if (timeout) hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS); @@ -578,7 +568,8 @@ int futex_wait_multiple(struct futex_vector *vs, unsign= ed int count, * @val: the expected value * @flags: futex flags (FLAGS_SHARED, etc.) * @q: the associated futex_q - * @hb: storage for hash_bucket pointer to be returned to caller + * @key2: the second futex_key if used for requeue PI + * task: Task queueing this futex * * Setup the futex_q and locate the hash_bucket. Get the futex value and * compare it with the expected value. Handle atomic faults internally. @@ -589,8 +580,10 @@ int futex_wait_multiple(struct futex_vector *vs, unsig= ned int count, * - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is = unlocked */ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags, - struct futex_q *q, struct futex_hash_bucket **hb) + struct futex_q *q, union futex_key *key2, + struct task_struct *task) { + struct futex_hash_bucket *hb; u32 uval; int ret; =20 @@ -618,12 +611,12 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsi= gned int flags, return ret; =20 retry_private: - *hb =3D futex_q_lock(q); + hb =3D futex_q_lock(q); =20 ret =3D futex_get_value_locked(&uval, uaddr); =20 if (ret) { - futex_q_unlock(*hb); + futex_q_unlock(hb); =20 ret =3D get_user(uval, uaddr); if (ret) @@ -636,10 +629,25 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsi= gned int flags, } =20 if (uval !=3D val) { - futex_q_unlock(*hb); - ret =3D -EWOULDBLOCK; + futex_q_unlock(hb); + return -EWOULDBLOCK; } =20 + if (key2 && futex_match(&q->key, key2)) { + futex_q_unlock(hb); + return -EINVAL; + } + + /* + * The task state is guaranteed to be set before another task can + * wake it. set_current_state() is implemented using smp_store_mb() and + * futex_queue() calls spin_unlock() upon completion, both serializing + * access to the hash list and forcing another memory barrier. + */ + if (task =3D=3D current) + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + futex_queue(q, hb, task); + return ret; } =20 @@ -647,7 +655,6 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags,= u32 val, struct hrtimer_sleeper *to, u32 bitset) { struct futex_q q =3D futex_q_init; - struct futex_hash_bucket *hb; int ret; =20 if (!bitset) @@ -660,12 +667,12 @@ int __futex_wait(u32 __user *uaddr, unsigned int flag= s, u32 val, * Prepare to wait on uaddr. On success, it holds hb->lock and q * is initialized. */ - ret =3D futex_wait_setup(uaddr, val, flags, &q, &hb); + ret =3D futex_wait_setup(uaddr, val, flags, &q, NULL, current); if (ret) return ret; =20 /* futex_queue and wait for wakeup, timeout, or a signal. */ - futex_wait_queue(hb, &q, to); + futex_do_wait(&q, to); =20 /* If we were woken (and unqueued), we succeeded, whatever. */ if (!futex_unqueue(&q)) --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 978D5211A1D for ; Wed, 16 Apr 2025 16:29:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820971; cv=none; b=tVGYwjBGE0/XDICvLTaw8/gmuqSbtg/J+/p7QGMltV+RlBAM2OZ8+2Ia/rbIquzQXO/Q2aBA5ywtQn1+rxe6lnzW2KRBtIarLIPElBfV9dr97mN+e5ItDX/CCn3lUWOJwE5/se0fjSHxl8lWU24lQJ2/+9FuOLWHBLMPC/zy554= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820971; c=relaxed/simple; bh=Qn/57B9VzWB0JKYD2iE64ugEwzZtE06e200axAfuk/Q=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=heqRTy7dFw5CQ+TjDm2ZxhVRF1M4wMjsNEgdWV48F8lIK9xKsTwZXv5tHKw9nX/4SfLG3qS3/+HmjIbqdStZf/J4fiMScEd35eVWAVEq2cBWz56ayA99pXeJRvIAiZfByzKGSTV8/K6GJzkEuEeWhyBnA2Ddys9dxrgbSDJSEkw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=U9712h/I; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=GZ1cIBCE; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="U9712h/I"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="GZ1cIBCE" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+7I78ss84vjJrdkUeLd869WN5gJwqNqGsAX7SSgTHw8=; b=U9712h/IdrrqPwQBFDjDVZBK2RJooIUpw0Sfc61EoSzA03M5yxEAeLXywaqjRNx9mtldGt JAcNslLKZ+t8yBu1OFy7zOXv+UX2Q5303m2HdQ4YHr9hUhBk8h6DzURqF9G38NVrY4KZH9 973S0zqZKRvO3TBbD5+ONZfpq5mGJYl71hT3dKHRSMaQamU1u73uyU0RnUMAu/K55eMB1f y1eWZzBgkVKUIJ/xh+3iwAsu86mUy64z1NYR5J2DZYsqvYJhi+9GEeuSKjlzXTx2r8cWQ5 WJe+FXzQ5W6w8mtbJXfHNA5j4474T0jAkrgrOLjwx4kBINpmjBL9b7zb8qjOWQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+7I78ss84vjJrdkUeLd869WN5gJwqNqGsAX7SSgTHw8=; b=GZ1cIBCElpr0EQjLhcan7BCrlPHYHWN/xBBlHonMGKlc5mbFDpBofsxG/Z8DrD2mMM0iLW B40tdNioRgx8MDDg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 04/21] futex: Pull futex_hash() out of futex_q_lock() Date: Wed, 16 Apr 2025 18:29:04 +0200 Message-ID: <20250416162921.513656-5-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 7 +------ kernel/futex/futex.h | 2 +- kernel/futex/pi.c | 3 ++- kernel/futex/waitwake.c | 6 ++++-- 4 files changed, 8 insertions(+), 10 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index cca15859a50be..7adc914878933 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -502,13 +502,9 @@ void __futex_unqueue(struct futex_q *q) } =20 /* The key must be already stored in q->key. */ -struct futex_hash_bucket *futex_q_lock(struct futex_q *q) +void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb) __acquires(&hb->lock) { - struct futex_hash_bucket *hb; - - hb =3D futex_hash(&q->key); - /* * Increment the counter before taking the lock so that * a potential waker won't miss a to-be-slept task that is @@ -522,7 +518,6 @@ struct futex_hash_bucket *futex_q_lock(struct futex_q *= q) q->lock_ptr =3D &hb->lock; =20 spin_lock(&hb->lock); - return hb; } =20 void futex_q_unlock(struct futex_hash_bucket *hb) diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 16aafd0113442..a219903e52084 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -354,7 +354,7 @@ static inline int futex_hb_waiters_pending(struct futex= _hash_bucket *hb) #endif } =20 -extern struct futex_hash_bucket *futex_q_lock(struct futex_q *q); +extern void futex_q_lock(struct futex_q *q, struct futex_hash_bucket *hb); extern void futex_q_unlock(struct futex_hash_bucket *hb); =20 =20 diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c index 7a941845f7eee..3bf942e9400ac 100644 --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -939,7 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags= , ktime_t *time, int tryl goto out; =20 retry_private: - hb =3D futex_q_lock(&q); + hb =3D futex_hash(&q.key); + futex_q_lock(&q, hb); =20 ret =3D futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, &exiting, 0); diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 6cf10701294b4..1108f373fd315 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -441,7 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, = int count, int *woken) struct futex_q *q =3D &vs[i].q; u32 val =3D vs[i].w.val; =20 - hb =3D futex_q_lock(q); + hb =3D futex_hash(&q->key); + futex_q_lock(q, hb); ret =3D futex_get_value_locked(&uval, uaddr); =20 if (!ret && uval =3D=3D val) { @@ -611,7 +612,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsign= ed int flags, return ret; =20 retry_private: - hb =3D futex_q_lock(q); + hb =3D futex_hash(&q->key); + futex_q_lock(q, hb); =20 ret =3D futex_get_value_locked(&uval, uaddr); =20 --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00D4421767C for ; Wed, 16 Apr 2025 16:29:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; cv=none; b=nWNRZwJTxMh0d7al2zeoeleMatFZh1sxBtkafkiXQ3ucW2hHupD3SGcH0aGNkdhMMK/QNFtzcLTUrX4nxUKkLCcH8qO1KF2PiWf/VBDLltSCgiNVN791ZvXkH498gojISu3zWsPr41noqUecH5y4TGpTUEePz1gqKX4owi1qVbA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; c=relaxed/simple; bh=x4durh67+vSenBzU7DwWrusanQRF+Znl9woNbBHdddM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Mws/Jq9GryNOun/Diic76hEkOk6oItlks5DHteuUTgSwb6WYKZijkMIy5JTmj621I69u3fUkjETt+Ll/UPXAOVq02GOxwzTDrsDIOhfEljIkFjy9k03a8UEhSXF/AxEdDy9zg3v2W+QDJMr/PeP7Zs8uzEocKbFbsu0Lj0VFYW8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=0Pa6/aTH; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ATTIcCoA; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="0Pa6/aTH"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ATTIcCoA" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Cen5VSo1OQjjp3ZWjthUTsaTM3ccNr9b55GURHvoC48=; b=0Pa6/aTHJna60mMgFMK8u9RBQf8k1+PP+15AyX/FoGwtiRHpYX4AhctJDEKz8TBW/mnFDO swcEChYYuDCmDFm3i0ZMdUUmwENjjuYAUKA2+ywV1SNkbV0G7RYycvn4062Meooe8odOwr 7cUmQEgp3cLahO2VfAJlAXc5UVYYB+EkpAkQbZTsaD9OU/9wnjwVjbekmowMY6EylszV+I P1/Ykt/0LWHMIS4EjnDsihzV0LFwLTIJd5rJcQ4ah+spN4VM0P07p2S4yIEHUtqADigdWo gBqKiYGTlDiYbr1eCMbBJtiEPD5p8czxz6txld8k82Pexr1JmDiXszNeLEzJWA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Cen5VSo1OQjjp3ZWjthUTsaTM3ccNr9b55GURHvoC48=; b=ATTIcCoA8zqSpvhJiHCz4jkaT989HQuW497E1Q/Oa5XdU94V6uElGnw0moktdDt0VbSNdV Eo/xOcIT8V9gFzDw== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 05/21] futex: Create hb scopes Date: Wed, 16 Apr 2025 18:29:05 +0200 Message-ID: <20250416162921.513656-6-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Create explicit scopes for hb variables; almost pure re-indent. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 81 ++++---- kernel/futex/pi.c | 282 +++++++++++++------------- kernel/futex/requeue.c | 433 ++++++++++++++++++++-------------------- kernel/futex/waitwake.c | 193 +++++++++--------- 4 files changed, 504 insertions(+), 485 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 7adc914878933..e4cb5ce9785b1 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -944,7 +944,6 @@ static void exit_pi_state_list(struct task_struct *curr) { struct list_head *next, *head =3D &curr->pi_state_list; struct futex_pi_state *pi_state; - struct futex_hash_bucket *hb; union futex_key key =3D FUTEX_KEY_INIT; =20 /* @@ -957,50 +956,54 @@ static void exit_pi_state_list(struct task_struct *cu= rr) next =3D head->next; pi_state =3D list_entry(next, struct futex_pi_state, list); key =3D pi_state->key; - hb =3D futex_hash(&key); + if (1) { + struct futex_hash_bucket *hb; =20 - /* - * We can race against put_pi_state() removing itself from the - * list (a waiter going away). put_pi_state() will first - * decrement the reference count and then modify the list, so - * its possible to see the list entry but fail this reference - * acquire. - * - * In that case; drop the locks to let put_pi_state() make - * progress and retry the loop. - */ - if (!refcount_inc_not_zero(&pi_state->refcount)) { + hb =3D futex_hash(&key); + + /* + * We can race against put_pi_state() removing itself from the + * list (a waiter going away). put_pi_state() will first + * decrement the reference count and then modify the list, so + * its possible to see the list entry but fail this reference + * acquire. + * + * In that case; drop the locks to let put_pi_state() make + * progress and retry the loop. + */ + if (!refcount_inc_not_zero(&pi_state->refcount)) { + raw_spin_unlock_irq(&curr->pi_lock); + cpu_relax(); + raw_spin_lock_irq(&curr->pi_lock); + continue; + } raw_spin_unlock_irq(&curr->pi_lock); - cpu_relax(); - raw_spin_lock_irq(&curr->pi_lock); - continue; - } - raw_spin_unlock_irq(&curr->pi_lock); =20 - spin_lock(&hb->lock); - raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); - raw_spin_lock(&curr->pi_lock); - /* - * We dropped the pi-lock, so re-check whether this - * task still owns the PI-state: - */ - if (head->next !=3D next) { - /* retain curr->pi_lock for the loop invariant */ - raw_spin_unlock(&pi_state->pi_mutex.wait_lock); + spin_lock(&hb->lock); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + raw_spin_lock(&curr->pi_lock); + /* + * We dropped the pi-lock, so re-check whether this + * task still owns the PI-state: + */ + if (head->next !=3D next) { + /* retain curr->pi_lock for the loop invariant */ + raw_spin_unlock(&pi_state->pi_mutex.wait_lock); + spin_unlock(&hb->lock); + put_pi_state(pi_state); + continue; + } + + WARN_ON(pi_state->owner !=3D curr); + WARN_ON(list_empty(&pi_state->list)); + list_del_init(&pi_state->list); + pi_state->owner =3D NULL; + + raw_spin_unlock(&curr->pi_lock); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(&hb->lock); - put_pi_state(pi_state); - continue; } =20 - WARN_ON(pi_state->owner !=3D curr); - WARN_ON(list_empty(&pi_state->list)); - list_del_init(&pi_state->list); - pi_state->owner =3D NULL; - - raw_spin_unlock(&curr->pi_lock); - raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); - spin_unlock(&hb->lock); - rt_mutex_futex_unlock(&pi_state->pi_mutex); put_pi_state(pi_state); =20 diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c index 3bf942e9400ac..a56f28fda58dd 100644 --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -920,7 +920,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags= , ktime_t *time, int tryl struct hrtimer_sleeper timeout, *to; struct task_struct *exiting =3D NULL; struct rt_mutex_waiter rt_waiter; - struct futex_hash_bucket *hb; struct futex_q q =3D futex_q_init; DEFINE_WAKE_Q(wake_q); int res, ret; @@ -939,152 +938,169 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int f= lags, ktime_t *time, int tryl goto out; =20 retry_private: - hb =3D futex_hash(&q.key); - futex_q_lock(&q, hb); + if (1) { + struct futex_hash_bucket *hb; =20 - ret =3D futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, - &exiting, 0); - if (unlikely(ret)) { - /* - * Atomic work succeeded and we got the lock, - * or failed. Either way, we do _not_ block. - */ - switch (ret) { - case 1: - /* We got the lock. */ - ret =3D 0; - goto out_unlock_put_key; - case -EFAULT: - goto uaddr_faulted; - case -EBUSY: - case -EAGAIN: + hb =3D futex_hash(&q.key); + futex_q_lock(&q, hb); + + ret =3D futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, + &exiting, 0); + if (unlikely(ret)) { /* - * Two reasons for this: - * - EBUSY: Task is exiting and we just wait for the - * exit to complete. - * - EAGAIN: The user space value changed. + * Atomic work succeeded and we got the lock, + * or failed. Either way, we do _not_ block. */ - futex_q_unlock(hb); - /* - * Handle the case where the owner is in the middle of - * exiting. Wait for the exit to complete otherwise - * this task might loop forever, aka. live lock. - */ - wait_for_owner_exiting(ret, exiting); - cond_resched(); - goto retry; - default: - goto out_unlock_put_key; + switch (ret) { + case 1: + /* We got the lock. */ + ret =3D 0; + goto out_unlock_put_key; + case -EFAULT: + goto uaddr_faulted; + case -EBUSY: + case -EAGAIN: + /* + * Two reasons for this: + * - EBUSY: Task is exiting and we just wait for the + * exit to complete. + * - EAGAIN: The user space value changed. + */ + futex_q_unlock(hb); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); + cond_resched(); + goto retry; + default: + goto out_unlock_put_key; + } } - } =20 - WARN_ON(!q.pi_state); + WARN_ON(!q.pi_state); =20 - /* - * Only actually queue now that the atomic ops are done: - */ - __futex_queue(&q, hb, current); + /* + * Only actually queue now that the atomic ops are done: + */ + __futex_queue(&q, hb, current); =20 - if (trylock) { - ret =3D rt_mutex_futex_trylock(&q.pi_state->pi_mutex); - /* Fixup the trylock return value: */ - ret =3D ret ? 0 : -EWOULDBLOCK; - goto no_block; - } + if (trylock) { + ret =3D rt_mutex_futex_trylock(&q.pi_state->pi_mutex); + /* Fixup the trylock return value: */ + ret =3D ret ? 0 : -EWOULDBLOCK; + goto no_block; + } =20 - /* - * Must be done before we enqueue the waiter, here is unfortunately - * under the hb lock, but that *should* work because it does nothing. - */ - rt_mutex_pre_schedule(); + /* + * Must be done before we enqueue the waiter, here is unfortunately + * under the hb lock, but that *should* work because it does nothing. + */ + rt_mutex_pre_schedule(); =20 - rt_mutex_init_waiter(&rt_waiter); + rt_mutex_init_waiter(&rt_waiter); =20 - /* - * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not - * hold it while doing rt_mutex_start_proxy(), because then it will - * include hb->lock in the blocking chain, even through we'll not in - * fact hold it while blocking. This will lead it to report -EDEADLK - * and BUG when futex_unlock_pi() interleaves with this. - * - * Therefore acquire wait_lock while holding hb->lock, but drop the - * latter before calling __rt_mutex_start_proxy_lock(). This - * interleaves with futex_unlock_pi() -- which does a similar lock - * handoff -- such that the latter can observe the futex_q::pi_state - * before __rt_mutex_start_proxy_lock() is done. - */ - raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); - spin_unlock(q.lock_ptr); - /* - * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter - * such that futex_unlock_pi() is guaranteed to observe the waiter when - * it sees the futex_q::pi_state. - */ - ret =3D __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, cu= rrent, &wake_q); - raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q); + /* + * On PREEMPT_RT, when hb->lock becomes an rt_mutex, we must not + * hold it while doing rt_mutex_start_proxy(), because then it will + * include hb->lock in the blocking chain, even through we'll not in + * fact hold it while blocking. This will lead it to report -EDEADLK + * and BUG when futex_unlock_pi() interleaves with this. + * + * Therefore acquire wait_lock while holding hb->lock, but drop the + * latter before calling __rt_mutex_start_proxy_lock(). This + * interleaves with futex_unlock_pi() -- which does a similar lock + * handoff -- such that the latter can observe the futex_q::pi_state + * before __rt_mutex_start_proxy_lock() is done. + */ + raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); + spin_unlock(q.lock_ptr); + /* + * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter + * such that futex_unlock_pi() is guaranteed to observe the waiter when + * it sees the futex_q::pi_state. + */ + ret =3D __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, c= urrent, &wake_q); + raw_spin_unlock_irq_wake(&q.pi_state->pi_mutex.wait_lock, &wake_q); =20 - if (ret) { - if (ret =3D=3D 1) - ret =3D 0; - goto cleanup; - } + if (ret) { + if (ret =3D=3D 1) + ret =3D 0; + goto cleanup; + } =20 - if (unlikely(to)) - hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS); + if (unlikely(to)) + hrtimer_sleeper_start_expires(to, HRTIMER_MODE_ABS); =20 - ret =3D rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); + ret =3D rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); =20 cleanup: - /* - * If we failed to acquire the lock (deadlock/signal/timeout), we must - * must unwind the above, however we canont lock hb->lock because - * rt_mutex already has a waiter enqueued and hb->lock can itself try - * and enqueue an rt_waiter through rtlock. - * - * Doing the cleanup without holding hb->lock can cause inconsistent - * state between hb and pi_state, but only in the direction of not - * seeing a waiter that is leaving. - * - * See futex_unlock_pi(), it deals with this inconsistency. - * - * There be dragons here, since we must deal with the inconsistency on - * the way out (here), it is impossible to detect/warn about the race - * the other way around (missing an incoming waiter). - * - * What could possibly go wrong... - */ - if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter= )) - ret =3D 0; + /* + * If we failed to acquire the lock (deadlock/signal/timeout), we must + * unwind the above, however we canont lock hb->lock because + * rt_mutex already has a waiter enqueued and hb->lock can itself try + * and enqueue an rt_waiter through rtlock. + * + * Doing the cleanup without holding hb->lock can cause inconsistent + * state between hb and pi_state, but only in the direction of not + * seeing a waiter that is leaving. + * + * See futex_unlock_pi(), it deals with this inconsistency. + * + * There be dragons here, since we must deal with the inconsistency on + * the way out (here), it is impossible to detect/warn about the race + * the other way around (missing an incoming waiter). + * + * What could possibly go wrong... + */ + if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waite= r)) + ret =3D 0; =20 - /* - * Now that the rt_waiter has been dequeued, it is safe to use - * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up - * the - */ - spin_lock(q.lock_ptr); - /* - * Waiter is unqueued. - */ - rt_mutex_post_schedule(); + /* + * Now that the rt_waiter has been dequeued, it is safe to use + * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up + * the + */ + spin_lock(q.lock_ptr); + /* + * Waiter is unqueued. + */ + rt_mutex_post_schedule(); no_block: - /* - * Fixup the pi_state owner and possibly acquire the lock if we - * haven't already. - */ - res =3D fixup_pi_owner(uaddr, &q, !ret); - /* - * If fixup_pi_owner() returned an error, propagate that. If it acquired - * the lock, clear our -ETIMEDOUT or -EINTR. - */ - if (res) - ret =3D (res < 0) ? res : 0; + /* + * Fixup the pi_state owner and possibly acquire the lock if we + * haven't already. + */ + res =3D fixup_pi_owner(uaddr, &q, !ret); + /* + * If fixup_pi_owner() returned an error, propagate that. If it acquired + * the lock, clear our -ETIMEDOUT or -EINTR. + */ + if (res) + ret =3D (res < 0) ? res : 0; =20 - futex_unqueue_pi(&q); - spin_unlock(q.lock_ptr); - goto out; + futex_unqueue_pi(&q); + spin_unlock(q.lock_ptr); + goto out; =20 out_unlock_put_key: - futex_q_unlock(hb); + futex_q_unlock(hb); + goto out; + +uaddr_faulted: + futex_q_unlock(hb); + + ret =3D fault_in_user_writeable(uaddr); + if (ret) + goto out; + + if (!(flags & FLAGS_SHARED)) + goto retry_private; + + goto retry; + } =20 out: if (to) { @@ -1092,18 +1108,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int fl= ags, ktime_t *time, int tryl destroy_hrtimer_on_stack(&to->timer); } return ret !=3D -EINTR ? ret : -ERESTARTNOINTR; - -uaddr_faulted: - futex_q_unlock(hb); - - ret =3D fault_in_user_writeable(uaddr); - if (ret) - goto out; - - if (!(flags & FLAGS_SHARED)) - goto retry_private; - - goto retry; } =20 /* diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index 0e55975af515c..209794cad6f2f 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -371,7 +371,6 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flag= s1, union futex_key key1 =3D FUTEX_KEY_INIT, key2 =3D FUTEX_KEY_INIT; int task_count =3D 0, ret; struct futex_pi_state *pi_state =3D NULL; - struct futex_hash_bucket *hb1, *hb2; struct futex_q *this, *next; DEFINE_WAKE_Q(wake_q); =20 @@ -443,240 +442,244 @@ int futex_requeue(u32 __user *uaddr1, unsigned int = flags1, if (requeue_pi && futex_match(&key1, &key2)) return -EINVAL; =20 - hb1 =3D futex_hash(&key1); - hb2 =3D futex_hash(&key2); - retry_private: - futex_hb_waiters_inc(hb2); - double_lock_hb(hb1, hb2); + if (1) { + struct futex_hash_bucket *hb1, *hb2; =20 - if (likely(cmpval !=3D NULL)) { - u32 curval; + hb1 =3D futex_hash(&key1); + hb2 =3D futex_hash(&key2); =20 - ret =3D futex_get_value_locked(&curval, uaddr1); + futex_hb_waiters_inc(hb2); + double_lock_hb(hb1, hb2); =20 - if (unlikely(ret)) { - double_unlock_hb(hb1, hb2); - futex_hb_waiters_dec(hb2); + if (likely(cmpval !=3D NULL)) { + u32 curval; =20 - ret =3D get_user(curval, uaddr1); - if (ret) - return ret; + ret =3D futex_get_value_locked(&curval, uaddr1); =20 - if (!(flags1 & FLAGS_SHARED)) - goto retry_private; + if (unlikely(ret)) { + double_unlock_hb(hb1, hb2); + futex_hb_waiters_dec(hb2); =20 - goto retry; - } - if (curval !=3D *cmpval) { - ret =3D -EAGAIN; - goto out_unlock; - } - } + ret =3D get_user(curval, uaddr1); + if (ret) + return ret; =20 - if (requeue_pi) { - struct task_struct *exiting =3D NULL; + if (!(flags1 & FLAGS_SHARED)) + goto retry_private; =20 - /* - * Attempt to acquire uaddr2 and wake the top waiter. If we - * intend to requeue waiters, force setting the FUTEX_WAITERS - * bit. We force this here where we are able to easily handle - * faults rather in the requeue loop below. - * - * Updates topwaiter::requeue_state if a top waiter exists. - */ - ret =3D futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, - &key2, &pi_state, - &exiting, nr_requeue); - - /* - * At this point the top_waiter has either taken uaddr2 or - * is waiting on it. In both cases pi_state has been - * established and an initial refcount on it. In case of an - * error there's nothing. - * - * The top waiter's requeue_state is up to date: - * - * - If the lock was acquired atomically (ret =3D=3D 1), then - * the state is Q_REQUEUE_PI_LOCKED. - * - * The top waiter has been dequeued and woken up and can - * return to user space immediately. The kernel/user - * space state is consistent. In case that there must be - * more waiters requeued the WAITERS bit in the user - * space futex is set so the top waiter task has to go - * into the syscall slowpath to unlock the futex. This - * will block until this requeue operation has been - * completed and the hash bucket locks have been - * dropped. - * - * - If the trylock failed with an error (ret < 0) then - * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing - * happened", or Q_REQUEUE_PI_IGNORE when there was an - * interleaved early wakeup. - * - * - If the trylock did not succeed (ret =3D=3D 0) then the - * state is either Q_REQUEUE_PI_IN_PROGRESS or - * Q_REQUEUE_PI_WAIT if an early wakeup interleaved. - * This will be cleaned up in the loop below, which - * cannot fail because futex_proxy_trylock_atomic() did - * the same sanity checks for requeue_pi as the loop - * below does. - */ - switch (ret) { - case 0: - /* We hold a reference on the pi state. */ - break; - - case 1: - /* - * futex_proxy_trylock_atomic() acquired the user space - * futex. Adjust task_count. - */ - task_count++; - ret =3D 0; - break; - - /* - * If the above failed, then pi_state is NULL and - * waiter::requeue_state is correct. - */ - case -EFAULT: - double_unlock_hb(hb1, hb2); - futex_hb_waiters_dec(hb2); - ret =3D fault_in_user_writeable(uaddr2); - if (!ret) goto retry; - return ret; - case -EBUSY: - case -EAGAIN: - /* - * Two reasons for this: - * - EBUSY: Owner is exiting and we just wait for the - * exit to complete. - * - EAGAIN: The user space value changed. - */ - double_unlock_hb(hb1, hb2); - futex_hb_waiters_dec(hb2); - /* - * Handle the case where the owner is in the middle of - * exiting. Wait for the exit to complete otherwise - * this task might loop forever, aka. live lock. - */ - wait_for_owner_exiting(ret, exiting); - cond_resched(); - goto retry; - default: - goto out_unlock; - } - } - - plist_for_each_entry_safe(this, next, &hb1->chain, list) { - if (task_count - nr_wake >=3D nr_requeue) - break; - - if (!futex_match(&this->key, &key1)) - continue; - - /* - * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always - * be paired with each other and no other futex ops. - * - * We should never be requeueing a futex_q with a pi_state, - * which is awaiting a futex_unlock_pi(). - */ - if ((requeue_pi && !this->rt_waiter) || - (!requeue_pi && this->rt_waiter) || - this->pi_state) { - ret =3D -EINVAL; - break; + } + if (curval !=3D *cmpval) { + ret =3D -EAGAIN; + goto out_unlock; + } } =20 - /* Plain futexes just wake or requeue and are done */ - if (!requeue_pi) { - if (++task_count <=3D nr_wake) - this->wake(&wake_q, this); - else + if (requeue_pi) { + struct task_struct *exiting =3D NULL; + + /* + * Attempt to acquire uaddr2 and wake the top waiter. If we + * intend to requeue waiters, force setting the FUTEX_WAITERS + * bit. We force this here where we are able to easily handle + * faults rather in the requeue loop below. + * + * Updates topwaiter::requeue_state if a top waiter exists. + */ + ret =3D futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1, + &key2, &pi_state, + &exiting, nr_requeue); + + /* + * At this point the top_waiter has either taken uaddr2 or + * is waiting on it. In both cases pi_state has been + * established and an initial refcount on it. In case of an + * error there's nothing. + * + * The top waiter's requeue_state is up to date: + * + * - If the lock was acquired atomically (ret =3D=3D 1), then + * the state is Q_REQUEUE_PI_LOCKED. + * + * The top waiter has been dequeued and woken up and can + * return to user space immediately. The kernel/user + * space state is consistent. In case that there must be + * more waiters requeued the WAITERS bit in the user + * space futex is set so the top waiter task has to go + * into the syscall slowpath to unlock the futex. This + * will block until this requeue operation has been + * completed and the hash bucket locks have been + * dropped. + * + * - If the trylock failed with an error (ret < 0) then + * the state is either Q_REQUEUE_PI_NONE, i.e. "nothing + * happened", or Q_REQUEUE_PI_IGNORE when there was an + * interleaved early wakeup. + * + * - If the trylock did not succeed (ret =3D=3D 0) then the + * state is either Q_REQUEUE_PI_IN_PROGRESS or + * Q_REQUEUE_PI_WAIT if an early wakeup interleaved. + * This will be cleaned up in the loop below, which + * cannot fail because futex_proxy_trylock_atomic() did + * the same sanity checks for requeue_pi as the loop + * below does. + */ + switch (ret) { + case 0: + /* We hold a reference on the pi state. */ + break; + + case 1: + /* + * futex_proxy_trylock_atomic() acquired the user space + * futex. Adjust task_count. + */ + task_count++; + ret =3D 0; + break; + + /* + * If the above failed, then pi_state is NULL and + * waiter::requeue_state is correct. + */ + case -EFAULT: + double_unlock_hb(hb1, hb2); + futex_hb_waiters_dec(hb2); + ret =3D fault_in_user_writeable(uaddr2); + if (!ret) + goto retry; + return ret; + case -EBUSY: + case -EAGAIN: + /* + * Two reasons for this: + * - EBUSY: Owner is exiting and we just wait for the + * exit to complete. + * - EAGAIN: The user space value changed. + */ + double_unlock_hb(hb1, hb2); + futex_hb_waiters_dec(hb2); + /* + * Handle the case where the owner is in the middle of + * exiting. Wait for the exit to complete otherwise + * this task might loop forever, aka. live lock. + */ + wait_for_owner_exiting(ret, exiting); + cond_resched(); + goto retry; + default: + goto out_unlock; + } + } + + plist_for_each_entry_safe(this, next, &hb1->chain, list) { + if (task_count - nr_wake >=3D nr_requeue) + break; + + if (!futex_match(&this->key, &key1)) + continue; + + /* + * FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI should always + * be paired with each other and no other futex ops. + * + * We should never be requeueing a futex_q with a pi_state, + * which is awaiting a futex_unlock_pi(). + */ + if ((requeue_pi && !this->rt_waiter) || + (!requeue_pi && this->rt_waiter) || + this->pi_state) { + ret =3D -EINVAL; + break; + } + + /* Plain futexes just wake or requeue and are done */ + if (!requeue_pi) { + if (++task_count <=3D nr_wake) + this->wake(&wake_q, this); + else + requeue_futex(this, hb1, hb2, &key2); + continue; + } + + /* Ensure we requeue to the expected futex for requeue_pi. */ + if (!futex_match(this->requeue_pi_key, &key2)) { + ret =3D -EINVAL; + break; + } + + /* + * Requeue nr_requeue waiters and possibly one more in the case + * of requeue_pi if we couldn't acquire the lock atomically. + * + * Prepare the waiter to take the rt_mutex. Take a refcount + * on the pi_state and store the pointer in the futex_q + * object of the waiter. + */ + get_pi_state(pi_state); + + /* Don't requeue when the waiter is already on the way out. */ + if (!futex_requeue_pi_prepare(this, pi_state)) { + /* + * Early woken waiter signaled that it is on the + * way out. Drop the pi_state reference and try the + * next waiter. @this->pi_state is still NULL. + */ + put_pi_state(pi_state); + continue; + } + + ret =3D rt_mutex_start_proxy_lock(&pi_state->pi_mutex, + this->rt_waiter, + this->task); + + if (ret =3D=3D 1) { + /* + * We got the lock. We do neither drop the refcount + * on pi_state nor clear this->pi_state because the + * waiter needs the pi_state for cleaning up the + * user space value. It will drop the refcount + * after doing so. this::requeue_state is updated + * in the wakeup as well. + */ + requeue_pi_wake_futex(this, &key2, hb2); + task_count++; + } else if (!ret) { + /* Waiter is queued, move it to hb2 */ requeue_futex(this, hb1, hb2, &key2); - continue; - } - - /* Ensure we requeue to the expected futex for requeue_pi. */ - if (!futex_match(this->requeue_pi_key, &key2)) { - ret =3D -EINVAL; - break; + futex_requeue_pi_complete(this, 0); + task_count++; + } else { + /* + * rt_mutex_start_proxy_lock() detected a potential + * deadlock when we tried to queue that waiter. + * Drop the pi_state reference which we took above + * and remove the pointer to the state from the + * waiters futex_q object. + */ + this->pi_state =3D NULL; + put_pi_state(pi_state); + futex_requeue_pi_complete(this, ret); + /* + * We stop queueing more waiters and let user space + * deal with the mess. + */ + break; + } } =20 /* - * Requeue nr_requeue waiters and possibly one more in the case - * of requeue_pi if we couldn't acquire the lock atomically. - * - * Prepare the waiter to take the rt_mutex. Take a refcount - * on the pi_state and store the pointer in the futex_q - * object of the waiter. + * We took an extra initial reference to the pi_state in + * futex_proxy_trylock_atomic(). We need to drop it here again. */ - get_pi_state(pi_state); - - /* Don't requeue when the waiter is already on the way out. */ - if (!futex_requeue_pi_prepare(this, pi_state)) { - /* - * Early woken waiter signaled that it is on the - * way out. Drop the pi_state reference and try the - * next waiter. @this->pi_state is still NULL. - */ - put_pi_state(pi_state); - continue; - } - - ret =3D rt_mutex_start_proxy_lock(&pi_state->pi_mutex, - this->rt_waiter, - this->task); - - if (ret =3D=3D 1) { - /* - * We got the lock. We do neither drop the refcount - * on pi_state nor clear this->pi_state because the - * waiter needs the pi_state for cleaning up the - * user space value. It will drop the refcount - * after doing so. this::requeue_state is updated - * in the wakeup as well. - */ - requeue_pi_wake_futex(this, &key2, hb2); - task_count++; - } else if (!ret) { - /* Waiter is queued, move it to hb2 */ - requeue_futex(this, hb1, hb2, &key2); - futex_requeue_pi_complete(this, 0); - task_count++; - } else { - /* - * rt_mutex_start_proxy_lock() detected a potential - * deadlock when we tried to queue that waiter. - * Drop the pi_state reference which we took above - * and remove the pointer to the state from the - * waiters futex_q object. - */ - this->pi_state =3D NULL; - put_pi_state(pi_state); - futex_requeue_pi_complete(this, ret); - /* - * We stop queueing more waiters and let user space - * deal with the mess. - */ - break; - } - } - - /* - * We took an extra initial reference to the pi_state in - * futex_proxy_trylock_atomic(). We need to drop it here again. - */ - put_pi_state(pi_state); + put_pi_state(pi_state); =20 out_unlock: - double_unlock_hb(hb1, hb2); + double_unlock_hb(hb1, hb2); + futex_hb_waiters_dec(hb2); + } wake_up_q(&wake_q); - futex_hb_waiters_dec(hb2); return ret ? ret : task_count; } =20 diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 1108f373fd315..7dc35be09e436 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -253,7 +253,6 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flag= s, u32 __user *uaddr2, int nr_wake, int nr_wake2, int op) { union futex_key key1 =3D FUTEX_KEY_INIT, key2 =3D FUTEX_KEY_INIT; - struct futex_hash_bucket *hb1, *hb2; struct futex_q *this, *next; int ret, op_ret; DEFINE_WAKE_Q(wake_q); @@ -266,67 +265,71 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int fl= ags, u32 __user *uaddr2, if (unlikely(ret !=3D 0)) return ret; =20 - hb1 =3D futex_hash(&key1); - hb2 =3D futex_hash(&key2); - retry_private: - double_lock_hb(hb1, hb2); - op_ret =3D futex_atomic_op_inuser(op, uaddr2); - if (unlikely(op_ret < 0)) { - double_unlock_hb(hb1, hb2); + if (1) { + struct futex_hash_bucket *hb1, *hb2; =20 - if (!IS_ENABLED(CONFIG_MMU) || - unlikely(op_ret !=3D -EFAULT && op_ret !=3D -EAGAIN)) { - /* - * we don't get EFAULT from MMU faults if we don't have - * an MMU, but we might get them from range checking - */ - ret =3D op_ret; - return ret; - } + hb1 =3D futex_hash(&key1); + hb2 =3D futex_hash(&key2); =20 - if (op_ret =3D=3D -EFAULT) { - ret =3D fault_in_user_writeable(uaddr2); - if (ret) + double_lock_hb(hb1, hb2); + op_ret =3D futex_atomic_op_inuser(op, uaddr2); + if (unlikely(op_ret < 0)) { + double_unlock_hb(hb1, hb2); + + if (!IS_ENABLED(CONFIG_MMU) || + unlikely(op_ret !=3D -EFAULT && op_ret !=3D -EAGAIN)) { + /* + * we don't get EFAULT from MMU faults if we don't have + * an MMU, but we might get them from range checking + */ + ret =3D op_ret; return ret; - } - - cond_resched(); - if (!(flags & FLAGS_SHARED)) - goto retry_private; - goto retry; - } - - plist_for_each_entry_safe(this, next, &hb1->chain, list) { - if (futex_match (&this->key, &key1)) { - if (this->pi_state || this->rt_waiter) { - ret =3D -EINVAL; - goto out_unlock; } - this->wake(&wake_q, this); - if (++ret >=3D nr_wake) - break; - } - } =20 - if (op_ret > 0) { - op_ret =3D 0; - plist_for_each_entry_safe(this, next, &hb2->chain, list) { - if (futex_match (&this->key, &key2)) { + if (op_ret =3D=3D -EFAULT) { + ret =3D fault_in_user_writeable(uaddr2); + if (ret) + return ret; + } + + cond_resched(); + if (!(flags & FLAGS_SHARED)) + goto retry_private; + goto retry; + } + + plist_for_each_entry_safe(this, next, &hb1->chain, list) { + if (futex_match(&this->key, &key1)) { if (this->pi_state || this->rt_waiter) { ret =3D -EINVAL; goto out_unlock; } this->wake(&wake_q, this); - if (++op_ret >=3D nr_wake2) + if (++ret >=3D nr_wake) break; } } - ret +=3D op_ret; - } + + if (op_ret > 0) { + op_ret =3D 0; + plist_for_each_entry_safe(this, next, &hb2->chain, list) { + if (futex_match(&this->key, &key2)) { + if (this->pi_state || this->rt_waiter) { + ret =3D -EINVAL; + goto out_unlock; + } + this->wake(&wake_q, this); + if (++op_ret >=3D nr_wake2) + break; + } + } + ret +=3D op_ret; + } =20 out_unlock: - double_unlock_hb(hb1, hb2); + double_unlock_hb(hb1, hb2); + } wake_up_q(&wake_q); return ret; } @@ -402,7 +405,6 @@ int futex_unqueue_multiple(struct futex_vector *v, int = count) */ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *wok= en) { - struct futex_hash_bucket *hb; bool retry =3D false; int ret, i; u32 uval; @@ -441,21 +443,25 @@ int futex_wait_multiple_setup(struct futex_vector *vs= , int count, int *woken) struct futex_q *q =3D &vs[i].q; u32 val =3D vs[i].w.val; =20 - hb =3D futex_hash(&q->key); - futex_q_lock(q, hb); - ret =3D futex_get_value_locked(&uval, uaddr); + if (1) { + struct futex_hash_bucket *hb; =20 - if (!ret && uval =3D=3D val) { - /* - * The bucket lock can't be held while dealing with the - * next futex. Queue each futex at this moment so hb can - * be unlocked. - */ - futex_queue(q, hb, current); - continue; + hb =3D futex_hash(&q->key); + futex_q_lock(q, hb); + ret =3D futex_get_value_locked(&uval, uaddr); + + if (!ret && uval =3D=3D val) { + /* + * The bucket lock can't be held while dealing with the + * next futex. Queue each futex at this moment so hb can + * be unlocked. + */ + futex_queue(q, hb, current); + continue; + } + + futex_q_unlock(hb); } - - futex_q_unlock(hb); __set_current_state(TASK_RUNNING); =20 /* @@ -584,7 +590,6 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsign= ed int flags, struct futex_q *q, union futex_key *key2, struct task_struct *task) { - struct futex_hash_bucket *hb; u32 uval; int ret; =20 @@ -612,44 +617,48 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsi= gned int flags, return ret; =20 retry_private: - hb =3D futex_hash(&q->key); - futex_q_lock(q, hb); + if (1) { + struct futex_hash_bucket *hb; =20 - ret =3D futex_get_value_locked(&uval, uaddr); + hb =3D futex_hash(&q->key); + futex_q_lock(q, hb); =20 - if (ret) { - futex_q_unlock(hb); + ret =3D futex_get_value_locked(&uval, uaddr); =20 - ret =3D get_user(uval, uaddr); - if (ret) - return ret; + if (ret) { + futex_q_unlock(hb); =20 - if (!(flags & FLAGS_SHARED)) - goto retry_private; + ret =3D get_user(uval, uaddr); + if (ret) + return ret; =20 - goto retry; + if (!(flags & FLAGS_SHARED)) + goto retry_private; + + goto retry; + } + + if (uval !=3D val) { + futex_q_unlock(hb); + return -EWOULDBLOCK; + } + + if (key2 && futex_match(&q->key, key2)) { + futex_q_unlock(hb); + return -EINVAL; + } + + /* + * The task state is guaranteed to be set before another task can + * wake it. set_current_state() is implemented using smp_store_mb() and + * futex_queue() calls spin_unlock() upon completion, both serializing + * access to the hash list and forcing another memory barrier. + */ + if (task =3D=3D current) + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + futex_queue(q, hb, task); } =20 - if (uval !=3D val) { - futex_q_unlock(hb); - return -EWOULDBLOCK; - } - - if (key2 && futex_match(&q->key, key2)) { - futex_q_unlock(hb); - return -EINVAL; - } - - /* - * The task state is guaranteed to be set before another task can - * wake it. set_current_state() is implemented using smp_store_mb() and - * futex_queue() calls spin_unlock() upon completion, both serializing - * access to the hash list and forcing another memory barrier. - */ - if (task =3D=3D current) - set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); - futex_queue(q, hb, task); - return ret; } =20 --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00B7F217679 for ; Wed, 16 Apr 2025 16:29:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; cv=none; b=CMVfO9GfIxwE2dhNgRtmhwBI6u59n8LVoIFP+hYWA0kM+JhNZ67NlpbCZYxg2XY5VmOPate9/DP07bU7RPV9F20xpyEND9ahTlGfEdwbJY9Zo6omDQphB+YGwES29ZAV1lSB1l+gSjnSdAvIgWQFEYNc/uTZZVrzJA3Mohmy2DM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; c=relaxed/simple; bh=4RVs5yFFOy+skWJbJ0/tGqbCUNWFm8fmZA1dFb0K0h0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pMvSqYLxHHPRGSnCxPAc3SRcTVjwgWaIZmaWW/ArVA3Leoabjox/Z/I9OdREXQFCqpm0CI5GJCu8RKjgKgjeONRojWrfzO6OrVYgPS4hBLLb2gO1hnRfDxoX1UFgehhfBt3mn+BrO4i5mg7m0xxA3AzqQ9DiaaBM03SNB7SUhhc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=bs+/vFS9; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=X1FU8SgH; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="bs+/vFS9"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="X1FU8SgH" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FZv5mCM9VZtVeecvyTzEV6NiEWP8t+URKBFUyOq84PE=; b=bs+/vFS9WTwXQO/ULpH2/0vAgTqygfJRMTST3dyCKbH8nSGdKe9U51TZUiZXLa7QnXYEz8 463EJsWre3yJIYgBOjc2momDvLzs8K0NMhCTysOVpeo6/fGjSkWUiDEzAG3bFEdxv96UvA re392JFyzlgze3r86WDOuqmzx8Id+NBl9y3hiuXhF2yME4fRX5RerMbsgeZ1vDsbiukDrM a9hfjnXc3btBPuTWh7Aos1gCQ3IrQq03TQprm65wCiEEyRx0iMPq2i8F5GQdhykzxNK/hO IYfPddceWXjSio42WNvOdpNFKZhY8iCmWQQcYFKuAkXWqNXM5pYhOovZfjJusQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FZv5mCM9VZtVeecvyTzEV6NiEWP8t+URKBFUyOq84PE=; b=X1FU8SgHYV9tJmvuUF9LaFnIbjMUSgni+Be+pd00We03+HwDPidtxcOr9QscacWWN1gzla 0xx1zenkGyq6ybBw== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 06/21] futex: Create futex_hash() get/put class Date: Wed, 16 Apr 2025 18:29:06 +0200 Message-ID: <20250416162921.513656-7-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra This gets us: hb =3D futex_hash(key) /* gets hb and inc users */ futex_hash_get(hb) /* inc users */ futex_hash_put(hb) /* dec users */ Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 6 +++--- kernel/futex/futex.h | 7 +++++++ kernel/futex/pi.c | 16 ++++++++++++---- kernel/futex/requeue.c | 10 +++------- kernel/futex/waitwake.c | 15 +++++---------- 5 files changed, 30 insertions(+), 24 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index e4cb5ce9785b1..56a5653e450cb 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -122,6 +122,8 @@ struct futex_hash_bucket *futex_hash(union futex_key *k= ey) return &futex_queues[hash & futex_hashmask]; } =20 +void futex_hash_get(struct futex_hash_bucket *hb) { } +void futex_hash_put(struct futex_hash_bucket *hb) { } =20 /** * futex_setup_timer - set up the sleeping hrtimer. @@ -957,9 +959,7 @@ static void exit_pi_state_list(struct task_struct *curr) pi_state =3D list_entry(next, struct futex_pi_state, list); key =3D pi_state->key; if (1) { - struct futex_hash_bucket *hb; - - hb =3D futex_hash(&key); + CLASS(hb, hb)(&key); =20 /* * We can race against put_pi_state() removing itself from the diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index a219903e52084..77d9b3509f75c 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -7,6 +7,7 @@ #include #include #include +#include =20 #ifdef CONFIG_PREEMPT_RT #include @@ -202,6 +203,12 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleepe= r *timeout, int flags, u64 range_ns); =20 extern struct futex_hash_bucket *futex_hash(union futex_key *key); +extern void futex_hash_get(struct futex_hash_bucket *hb); +extern void futex_hash_put(struct futex_hash_bucket *hb); + +DEFINE_CLASS(hb, struct futex_hash_bucket *, + if (_T) futex_hash_put(_T), + futex_hash(key), union futex_key *key); =20 /** * futex_match - Check whether two futex keys are equal diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c index a56f28fda58dd..e52f540e81b6a 100644 --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -939,9 +939,8 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags= , ktime_t *time, int tryl =20 retry_private: if (1) { - struct futex_hash_bucket *hb; + CLASS(hb, hb)(&q.key); =20 - hb =3D futex_hash(&q.key); futex_q_lock(&q, hb); =20 ret =3D futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, @@ -994,6 +993,16 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flag= s, ktime_t *time, int tryl goto no_block; } =20 + /* + * Caution; releasing @hb in-scope. The hb->lock is still locked + * while the reference is dropped. The reference can not be dropped + * after the unlock because if a user initiated resize is in progress + * then we might need to wake him. This can not be done after the + * rt_mutex_pre_schedule() invocation. The hb will remain valid because + * the thread, performing resize, will block on hb->lock during + * the requeue. + */ + futex_hash_put(no_free_ptr(hb)); /* * Must be done before we enqueue the waiter, here is unfortunately * under the hb lock, but that *should* work because it does nothing. @@ -1119,7 +1128,6 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int f= lags) { u32 curval, uval, vpid =3D task_pid_vnr(current); union futex_key key =3D FUTEX_KEY_INIT; - struct futex_hash_bucket *hb; struct futex_q *top_waiter; int ret; =20 @@ -1139,7 +1147,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int f= lags) if (ret) return ret; =20 - hb =3D futex_hash(&key); + CLASS(hb, hb)(&key); spin_lock(&hb->lock); retry_hb: =20 diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index 209794cad6f2f..992e3ce005c6f 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -444,10 +444,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int fla= gs1, =20 retry_private: if (1) { - struct futex_hash_bucket *hb1, *hb2; - - hb1 =3D futex_hash(&key1); - hb2 =3D futex_hash(&key2); + CLASS(hb, hb1)(&key1); + CLASS(hb, hb2)(&key2); =20 futex_hb_waiters_inc(hb2); double_lock_hb(hb1, hb2); @@ -817,9 +815,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned i= nt flags, switch (futex_requeue_pi_wakeup_sync(&q)) { case Q_REQUEUE_PI_IGNORE: { - struct futex_hash_bucket *hb; - - hb =3D futex_hash(&q.key); + CLASS(hb, hb)(&q.key); /* The waiter is still on uaddr1 */ spin_lock(&hb->lock); ret =3D handle_early_requeue_pi_wakeup(hb, &q, to); diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index 7dc35be09e436..d52541bcc07e9 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -154,7 +154,6 @@ void futex_wake_mark(struct wake_q_head *wake_q, struct= futex_q *q) */ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bit= set) { - struct futex_hash_bucket *hb; struct futex_q *this, *next; union futex_key key =3D FUTEX_KEY_INIT; DEFINE_WAKE_Q(wake_q); @@ -170,7 +169,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, i= nt nr_wake, u32 bitset) if ((flags & FLAGS_STRICT) && !nr_wake) return 0; =20 - hb =3D futex_hash(&key); + CLASS(hb, hb)(&key); =20 /* Make sure we really have tasks to wakeup */ if (!futex_hb_waiters_pending(hb)) @@ -267,10 +266,8 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int fla= gs, u32 __user *uaddr2, =20 retry_private: if (1) { - struct futex_hash_bucket *hb1, *hb2; - - hb1 =3D futex_hash(&key1); - hb2 =3D futex_hash(&key2); + CLASS(hb, hb1)(&key1); + CLASS(hb, hb2)(&key2); =20 double_lock_hb(hb1, hb2); op_ret =3D futex_atomic_op_inuser(op, uaddr2); @@ -444,9 +441,8 @@ int futex_wait_multiple_setup(struct futex_vector *vs, = int count, int *woken) u32 val =3D vs[i].w.val; =20 if (1) { - struct futex_hash_bucket *hb; + CLASS(hb, hb)(&q->key); =20 - hb =3D futex_hash(&q->key); futex_q_lock(q, hb); ret =3D futex_get_value_locked(&uval, uaddr); =20 @@ -618,9 +614,8 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsign= ed int flags, =20 retry_private: if (1) { - struct futex_hash_bucket *hb; + CLASS(hb, hb)(&q->key); =20 - hb =3D futex_hash(&q->key); futex_q_lock(q, hb); =20 ret =3D futex_get_value_locked(&uval, uaddr); --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00C8921767B for ; Wed, 16 Apr 2025 16:29:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820973; cv=none; b=BRhEz4wzzJbd3vX1/5UjBO0pOzMg6qUXR1woHkPewes6oYTSd0yRf56mAigZ2zx8zgNMj0lJaMT51Ph59nV8jhY6LKr1H/BvUZtQXQ8mgIRH30xxmH+DNS0PW7r9Lz+U34Tq2zuljI4Ap0SYrlA8vwkRKxAuVQs6Vf3sSa1+B18= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820973; c=relaxed/simple; bh=LaZS6UPiImzSyxqxJCcwrXkWJyg0hTcn7pnVQilLcwk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mZYfwEq6SNV1BifVeWKrty77mUpIkRUKfADQdyFG/DGT72TS9QJ7LfhgQ6I6Kpqu/ohN4ANrXCpIZ73ZSa8xLHW/GN7c7rYz98sQ46TkR/hxQO5GszhlCl8IuZ/sLgPS/cIMycLIZJJ1clC/csgGqkCVgtxPjhkrZu+9DdsejyE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=1PSDD/U8; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=fjdaCeM0; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="1PSDD/U8"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="fjdaCeM0" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8ADpzXL+em4lIAaBMKlTSOE1lA/L53fXWSsZH7Iu6lc=; b=1PSDD/U8BBFkEdLcLEG516WDWYKDp3x9nLx/cmlw7fl+AqSkldQ6/Go7cdMRkYvDWsyhaQ iwRErPoPx6uPPLROZmXXQV3SHKQ9alzI5UaMt80HF2LMey8RlrzoJhVaOMXWxAClNVb1l8 9Zts+hPC/lg2Zzp5EOTjYmsLUWj0Q7UVuE9HMGJRLHILhwiM1Efu7LukSyLZ3XEvE17OkX pWD/SGILcnuApyyaR9sg5WIIoZEfExdN5q6uPnHqy7qFLrgTJx77Q8S+NN8JIazY0wjjwm POasnCjd7X5rk1CcHn7D58U3HpHdHiQyzzCedY7B/A/NQIGamfA5sNf7XoQh5A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820968; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8ADpzXL+em4lIAaBMKlTSOE1lA/L53fXWSsZH7Iu6lc=; b=fjdaCeM0yYeFzKy9r/kCrpHbaVWDmOVbG5S+jB0TVipIj3lMSOAlCVtzmJ9s3V20pTQggu 3r4VNnBFJEAxMRAg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 07/21] futex: Create private_hash() get/put class Date: Wed, 16 Apr 2025 18:29:07 +0200 Message-ID: <20250416162921.513656-8-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra This gets us: fph =3D futex_private_hash(key) /* gets fph and inc users */ futex_private_hash_get(fph) /* inc users */ futex_private_hash_put(fph) /* dec users */ Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 12 ++++++++++++ kernel/futex/futex.h | 8 ++++++++ 2 files changed, 20 insertions(+) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 56a5653e450cb..6a1d6b14277f4 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -107,6 +107,18 @@ late_initcall(fail_futex_debugfs); =20 #endif /* CONFIG_FAIL_FUTEX */ =20 +struct futex_private_hash *futex_private_hash(void) +{ + return NULL; +} + +bool futex_private_hash_get(struct futex_private_hash *fph) +{ + return false; +} + +void futex_private_hash_put(struct futex_private_hash *fph) { } + /** * futex_hash - Return the hash bucket in the global hash * @key: Pointer to the futex key for which the hash is calculated diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 77d9b3509f75c..bc76e366f9a77 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -206,10 +206,18 @@ extern struct futex_hash_bucket *futex_hash(union fut= ex_key *key); extern void futex_hash_get(struct futex_hash_bucket *hb); extern void futex_hash_put(struct futex_hash_bucket *hb); =20 +extern struct futex_private_hash *futex_private_hash(void); +extern bool futex_private_hash_get(struct futex_private_hash *fph); +extern void futex_private_hash_put(struct futex_private_hash *fph); + DEFINE_CLASS(hb, struct futex_hash_bucket *, if (_T) futex_hash_put(_T), futex_hash(key), union futex_key *key); =20 +DEFINE_CLASS(private_hash, struct futex_private_hash *, + if (_T) futex_private_hash_put(_T), + futex_private_hash(), void); + /** * futex_match - Check whether two futex keys are equal * @key1: Pointer to key1 --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00C2C21767A for ; Wed, 16 Apr 2025 16:29:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; cv=none; b=mDZACD3KKUVUrtaphFn4HJVblReH0vbPPn3orb88Ur/1gY6muM9kK2VizfI3/RbRfJ0JFAs+JwYWa8tU8Bn19zVC7m8P8W37Q6GF8zJqM5l3zRloDp/UGojfsf+C4ZpRCOAcDyNiC+geOELjFh4OZEPjFRgU7aJu0wDMinWsdZ0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; c=relaxed/simple; bh=NMtemmKnPmyQcs6axqNmBbUMi56qsYrZm/MNY3rHVaw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Aacd8Ozf1DoUn+Vez5IqWMzydpEuEVLY0FbmOubmFY0TcCITuzW6P1cFPTzkJGJK5aMCpRc3BD3kvL6X+pr3fZTct31ekLsDNFrU1iGXkYn3vTVBnBTJINQJjB4AsgsGYv+0nL9pHB01fu0mF9GYsQ8ctcP+D7AdCXSD5KqMZOg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=YoETatFi; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=IMRNbKaq; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="YoETatFi"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="IMRNbKaq" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820969; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0GEZ2YVGNjbG0flB6FM5Jwo67WRq8Fq4s1znrOXO320=; b=YoETatFiGDluO6Ep8ZEOYVmHp0mqSfCmlsnzDw6Y/QtqrcmSXGWD6uzDKU6Arrsi0Je3Mi qLAG7Hq71GwFveQFdV6oA2vQXVJxOKJLWg+MlFCVtPv5aOA8WRT5uT5u3Fxk+V22TvnXqF WuVjli/PFT+n/N4ws2ZYDV3pMAO4GUzuaC8SgG+QnnF23nIbw73bPQy3Mzs9Nonq4JEYbg kaZZurGf8Uvdr/izALeKbxN+kfNmIpjtg3BQ2nY/s3wYEkBcd9kU/p8wSuj8N9/TNhg4sM 7K6QDI0psKg0QaO81hBC5Q4xYMkQ702fox5tlfK82Q5VZTMPgpSG60qzS4rmDg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820969; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0GEZ2YVGNjbG0flB6FM5Jwo67WRq8Fq4s1znrOXO320=; b=IMRNbKaq4fnAhJboRshzX7YG9P3vtJXCjc2Gz1DrcFCOh2H/VAHqNPSHW1DN7DZeh2x1XS LyUnvZTwAWmrHFDQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 08/21] futex: Acquire a hash reference in futex_wait_multiple_setup() Date: Wed, 16 Apr 2025 18:29:08 +0200 Message-ID: <20250416162921.513656-9-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" futex_wait_multiple_setup() changes task_struct::__state to !TASK_RUNNING and then enqueues on multiple futexes. Every futex_q_lock() acquires a reference on the global hash which is dropped later. If a rehash is in progress then the loop will block on mm_struct::futex_hash_bucket for the rehash to complete and this will lose the previously set task_struct::__state. Acquire a reference on the local hash to avoiding blocking on mm_struct::futex_hash_bucket. Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/waitwake.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c index d52541bcc07e9..bd8fef0f8d180 100644 --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -406,6 +406,12 @@ int futex_wait_multiple_setup(struct futex_vector *vs,= int count, int *woken) int ret, i; u32 uval; =20 + /* + * Make sure to have a reference on the private_hash such that we + * don't block on rehash after changing the task state below. + */ + guard(private_hash)(); + /* * Enqueuing multiple futexes is tricky, because we need to enqueue * each futex on the list before dealing with the next one to avoid --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 495B3212D8D for ; Wed, 16 Apr 2025 16:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; cv=none; b=qlE1MTMDSyrM3N2vRO4HfssPKNS75GfEnvajjE2cUKNNbmt5rP5yYOL4+bBGQ9wXWPbAHq5wuaKFlvZkYdlSvOr1NNixv9heiwKcOyp9obfVwvyqJZ5GBgXF1OwT0/BQ25jj1FfTPy9ro0IaxGj/8xnWBwoOPsTK0NvwddaatM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; c=relaxed/simple; bh=HFlqT1mLvmo+UShZEWo/EejFFqRQ7TcsnK9RJNrJVVA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=B8gb1dr+gjI1tNI0JLObpmgypIjaJbukNymyS1z2uP9gW9Q6/12zzasTScK8JNadKTZkrh4Z28b5bMDMH9EEZ5nmrdpiZeDiKuxZxrEFr99ljM/bpGUDnwT95NsreG/QQAisA3N6DDVJPeFAoliOGKb0RcNLYGxWWDg3zrEeB+k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=sYz9K38R; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=c8Pzc1fi; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="sYz9K38R"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="c8Pzc1fi" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820969; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qsSIweOJOeVPMO/KHuLxj3Vruac4eGemiYIOVUirD2Y=; b=sYz9K38RoqrysK0wQDqTfE/18Fm6QdHNgiocw5d+SodH8XXkMZRmb4TAVy+p1PfJ5cCC/m XpoAsx+uT9cInA0OzYmI2HRO+A6tivvQwvzAdMYCUiWqu93P1RqE7C2waeZWGNVgntjBf+ KsYQSQXOb2h56Fa/ClwYTC/oEo/Og62RjuzcbO3yvxtLZFBtNSo88siYBI2WfTWOAIGg5Z 2N1oOks7YZYs8ZXmx1tBWAQqP3YQK4/bnnm/p/jGwvWzecVPetHO48jpY9nMN6vE5xfrff IC2ny+RCuIKDSOTJOQiGYZCkZbe3dC+HJ6G59R5r7XFnP0POIhTwB5yes3wGww== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820969; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qsSIweOJOeVPMO/KHuLxj3Vruac4eGemiYIOVUirD2Y=; b=c8Pzc1fixL07Kd7xRRTKl90ULCuHTf6J6crB0b4tPqeN3gYNaBi3N1I96nNtpzpdZ9Yzei fPLg6//djUiRSiBQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 09/21] futex: Decrease the waiter count before the unlock operation Date: Wed, 16 Apr 2025 18:29:09 +0200 Message-ID: <20250416162921.513656-10-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" To support runtime resizing of the process private hash, it's required to not use the obtained hash bucket once the reference count has been dropped. The reference will be dropped after the unlock of the hash bucket. The amount of waiters is decremented after the unlock operation. There is no requirement that this needs to happen after the unlock. The increment happens before acquiring the lock to signal early that there will be a waiter. The waiter can avoid blocking on the lock if it is known that there will be no waiter. There is no difference in terms of ordering if the decrement happens before or after the unlock. Decrease the waiter count before the unlock operation. Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 2 +- kernel/futex/requeue.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 6a1d6b14277f4..5e70cb8eb2507 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -537,8 +537,8 @@ void futex_q_lock(struct futex_q *q, struct futex_hash_= bucket *hb) void futex_q_unlock(struct futex_hash_bucket *hb) __releases(&hb->lock) { - spin_unlock(&hb->lock); futex_hb_waiters_dec(hb); + spin_unlock(&hb->lock); } =20 void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb, diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index 992e3ce005c6f..023c028d2fce3 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -456,8 +456,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flag= s1, ret =3D futex_get_value_locked(&curval, uaddr1); =20 if (unlikely(ret)) { - double_unlock_hb(hb1, hb2); futex_hb_waiters_dec(hb2); + double_unlock_hb(hb1, hb2); =20 ret =3D get_user(curval, uaddr1); if (ret) @@ -542,8 +542,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flag= s1, * waiter::requeue_state is correct. */ case -EFAULT: - double_unlock_hb(hb1, hb2); futex_hb_waiters_dec(hb2); + double_unlock_hb(hb1, hb2); ret =3D fault_in_user_writeable(uaddr2); if (!ret) goto retry; @@ -556,8 +556,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flag= s1, * exit to complete. * - EAGAIN: The user space value changed. */ - double_unlock_hb(hb1, hb2); futex_hb_waiters_dec(hb2); + double_unlock_hb(hb1, hb2); /* * Handle the case where the owner is in the middle of * exiting. Wait for the exit to complete otherwise @@ -674,8 +674,8 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flag= s1, put_pi_state(pi_state); =20 out_unlock: - double_unlock_hb(hb1, hb2); futex_hb_waiters_dec(hb2); + double_unlock_hb(hb1, hb2); } wake_up_q(&wake_q); return ret ? ret : task_count; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 458C1217657 for ; Wed, 16 Apr 2025 16:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; cv=none; b=S+WzAbnla0mJzPsy6s7ZSuu67J97i3t/C6NvT5KS42KomuC/0xU/yNl4kCOaietgfHWocWTO8Usg+GfDNiVO8b+He28pRPx8XdOgzG9ooODlZhCERD716Y3gnVMyLvqq8bBbFjphMFT0u5GmvRKOMwi5Uy+S9oC9HKMxlMFyBAA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; c=relaxed/simple; bh=1wEU3kqWniX4oeLgQyszFiXMvT6lV0eU4OnN68ez7Qg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=q+AxsXK58liiqoC0WJ88TR5s0uWlFiLgUEkuxcKPEfWd9Vr+19L25KUQIy/QtFSs//hzq8QI4yBxkA7bM01kwdVtHWvXHL6qVhcTn1a0fcQP9CWOhhcgk1biGYZDgk3hGcOdBFzbXxgXfR3vssQD0bkq4fX6xC4oJV8pQMLW9mQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=rpq7mQy+; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=EQFskdHA; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="rpq7mQy+"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="EQFskdHA" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820970; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MHrIMWrBknhDvpUrvZqL2/sWbWTxrgQvEOQEnBSnNUI=; b=rpq7mQy+FNjD250HeJIsiGVEwU6Ppgm6SaiD5cssohBe22fkDVoQaN7Ctf6kQ3hpZxeaVp P0bZZOTbmn1yzB9fESL6iSgCBFjycjWC2Xd/XYCQXVbD9GsKiAy5fD+4WQEQP9C2IYvTVE yNxkEXVAt2a3zT6UweLywPwNQvCWiaHbGihRLX5huSFyDkmF5jIxmqMIhHMgdnbGYKooBp 6uC5I5exjgxaUI9+mDIFD6HpVAeePX+d9I6ivNVde/LWlsV5pGU7D8sTj6A1xoHrf4baSr tv7QAI8akRCufofyNzbrolceQnXKnjunOcZBw7l3U00/TW5t4FB5zr8wY5JfKg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820970; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MHrIMWrBknhDvpUrvZqL2/sWbWTxrgQvEOQEnBSnNUI=; b=EQFskdHAnM/Mepfx+GilASw8dX/aCyIdUdA7ytOii8lBcXJfZnqO8xtBcRknUssEZiFs3C 3tR9x7DL7/k+8GBQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 10/21] futex: Introduce futex_q_lockptr_lock() Date: Wed, 16 Apr 2025 18:29:10 +0200 Message-ID: <20250416162921.513656-11-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" futex_lock_pi() and __fixup_pi_state_owner() acquire the futex_q::lock_ptr without holding a reference assuming the previously obtained hash bucket and the assigned lock_ptr are still valid. This isn't the case once the private hash can be resized and becomes invalid after the reference drop. Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure that it does not go away if the hash bucket has been replaced and the old pointer has been observed. After locking the pointer needs to be compared to check if it changed. If so then the hash bucket has been replaced and the user has been moved to the new one and lock_ptr has been updated. The lock operation needs to be redone in this case. The locked hash bucket is not returned. A special case is an early return in futex_lock_pi() (due to signal or timeout) and a successful futex_wait_requeue_pi(). In both cases a valid futex_q::lock_ptr is expected (and its matching hash bucket) but since the waiter has been removed from the hash this can no longer be guaranteed. Therefore before the waiter is removed and a reference is acquired which is later dropped by the waiter to avoid a resize. Add futex_q_lockptr_lock() and use it. Acquire an additional reference in requeue_pi_wake_futex() and futex_unlock_pi() while the futex_q is removed, denote this extra reference in futex_q::drop_hb_ref and let the waiter drop the reference in this case. Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 25 +++++++++++++++++++++++++ kernel/futex/futex.h | 3 ++- kernel/futex/pi.c | 15 +++++++++++++-- kernel/futex/requeue.c | 16 +++++++++++++--- 4 files changed, 53 insertions(+), 6 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 5e70cb8eb2507..1443a98dfa7fa 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -134,6 +134,13 @@ struct futex_hash_bucket *futex_hash(union futex_key *= key) return &futex_queues[hash & futex_hashmask]; } =20 +/** + * futex_hash_get - Get an additional reference for the local hash. + * @hb: ptr to the private local hash. + * + * Obtain an additional reference for the already obtained hash bucket. The + * caller must already own an reference. + */ void futex_hash_get(struct futex_hash_bucket *hb) { } void futex_hash_put(struct futex_hash_bucket *hb) { } =20 @@ -615,6 +622,24 @@ int futex_unqueue(struct futex_q *q) return ret; } =20 +void futex_q_lockptr_lock(struct futex_q *q) +{ + spinlock_t *lock_ptr; + + /* + * See futex_unqueue() why lock_ptr can change. + */ + guard(rcu)(); +retry: + lock_ptr =3D READ_ONCE(q->lock_ptr); + spin_lock(lock_ptr); + + if (unlikely(lock_ptr !=3D q->lock_ptr)) { + spin_unlock(lock_ptr); + goto retry; + } +} + /* * PI futexes can not be requeued and must remove themselves from the hash * bucket. The hash bucket lock (i.e. lock_ptr) is held. diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index bc76e366f9a77..26e69333cb745 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -183,6 +183,7 @@ struct futex_q { union futex_key *requeue_pi_key; u32 bitset; atomic_t requeue_state; + bool drop_hb_ref; #ifdef CONFIG_PREEMPT_RT struct rcuwait requeue_wait; #endif @@ -197,7 +198,7 @@ enum futex_access { =20 extern int get_futex_key(u32 __user *uaddr, unsigned int flags, union fute= x_key *key, enum futex_access rw); - +extern void futex_q_lockptr_lock(struct futex_q *q); extern struct hrtimer_sleeper * futex_setup_timer(ktime_t *time, struct hrtimer_sleeper *timeout, int flags, u64 range_ns); diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c index e52f540e81b6a..dacb2330f1fbc 100644 --- a/kernel/futex/pi.c +++ b/kernel/futex/pi.c @@ -806,7 +806,7 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, st= ruct futex_q *q, break; } =20 - spin_lock(q->lock_ptr); + futex_q_lockptr_lock(q); raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); =20 /* @@ -1072,7 +1072,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int fla= gs, ktime_t *time, int tryl * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up * the */ - spin_lock(q.lock_ptr); + futex_q_lockptr_lock(&q); /* * Waiter is unqueued. */ @@ -1092,6 +1092,11 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int fl= ags, ktime_t *time, int tryl =20 futex_unqueue_pi(&q); spin_unlock(q.lock_ptr); + if (q.drop_hb_ref) { + CLASS(hb, hb)(&q.key); + /* Additional reference from futex_unlock_pi() */ + futex_hash_put(hb); + } goto out; =20 out_unlock_put_key: @@ -1200,6 +1205,12 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int = flags) */ rt_waiter =3D rt_mutex_top_waiter(&pi_state->pi_mutex); if (!rt_waiter) { + /* + * Acquire a reference for the leaving waiter to ensure + * valid futex_q::lock_ptr. + */ + futex_hash_get(hb); + top_waiter->drop_hb_ref =3D true; __futex_unqueue(top_waiter); raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); goto retry_hb; diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index 023c028d2fce3..b0e64fd454d96 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -231,7 +231,12 @@ void requeue_pi_wake_futex(struct futex_q *q, union fu= tex_key *key, =20 WARN_ON(!q->rt_waiter); q->rt_waiter =3D NULL; - + /* + * Acquire a reference for the waiter to ensure valid + * futex_q::lock_ptr. + */ + futex_hash_get(hb); + q->drop_hb_ref =3D true; q->lock_ptr =3D &hb->lock; =20 /* Signal locked state to the waiter */ @@ -826,7 +831,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned i= nt flags, case Q_REQUEUE_PI_LOCKED: /* The requeue acquired the lock */ if (q.pi_state && (q.pi_state->owner !=3D current)) { - spin_lock(q.lock_ptr); + futex_q_lockptr_lock(&q); ret =3D fixup_pi_owner(uaddr2, &q, true); /* * Drop the reference to the pi state which the @@ -853,7 +858,7 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned i= nt flags, if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter)) ret =3D 0; =20 - spin_lock(q.lock_ptr); + futex_q_lockptr_lock(&q); debug_rt_mutex_free_waiter(&rt_waiter); /* * Fixup the pi_state owner and possibly acquire the lock if we @@ -885,6 +890,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned = int flags, default: BUG(); } + if (q.drop_hb_ref) { + CLASS(hb, hb)(&q.key); + /* Additional reference from requeue_pi_wake_futex() */ + futex_hash_put(hb); + } =20 out: if (to) { --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBF7C218593 for ; Wed, 16 Apr 2025 16:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; cv=none; b=ROdCKcFJeEPH/L/Dxiq6vr7mm9iil650SPXhAJdCJhdBf2GUPFiPKg1rcZDosA0xyllwROj5sY1lIfE0ogU103+b1onYhWynNQy10fPcP5fIhJjrDKBqMlMW4QdIYu2gq+R8m2AV8HkdKjee3OUs+p3N8saY5+f3Ve8N+m7HwOo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820974; c=relaxed/simple; bh=/HU00ieQBIUIR0aAgwtAN1WOYXF/cezNNGTBq+PfTj8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nnDLGjFDDB0Cd2ct+IunblxU7YGPV341qO5c8EksXis3ZURATLrpHUEGD3/VcR0zNisjPQd8tCLP18ILb5T6C1cLS97r176gQyJNUtO7rlPwFNBFDE/UcsPeyMzaNZzL/xGqYH7VVSSaKMEwMDgQwXyQ5ezxrok01LLGjWG5diE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=NlaJcZrQ; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=DQNvwQ7m; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="NlaJcZrQ"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="DQNvwQ7m" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820970; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j6vUDPLxCGc9wD33fZGUNE2aAZ3Nc4AuJ6xxO4D9DEk=; b=NlaJcZrQScjEmlBuKYmTydsoEN07cDRv/EdvY9ds5/iFLZ97q+n4FOpRs8lWJ9UHyjnDeQ M3pD12pLgD/+WHPbFcctColDQb77Qcrwoi/GTPDZ7+FL8BdfokniLbyQKY91EGMQTO/uky 85xKW4oEyPhaxjgu7FVIk68NjLmEaqS5r+i+xKVa5PscdNghsYUwv0OJjecH9gVoXEL3+1 ziz8FpADwSx9blGP6homxKK+9LMSrW8NV0ZeUJ/zti/CDJ+Jx/9EJYK8C0Sa3tZPvC62f/ ubaIx2yH8kCu2742mQ/PHSGq3Q64cCYlDPeTTwAOoBFmh8mLKE4aBqTouv0BoQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820970; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j6vUDPLxCGc9wD33fZGUNE2aAZ3Nc4AuJ6xxO4D9DEk=; b=DQNvwQ7mnpj5ycM87W5QO/RBkaCCuc+mNAE3LbXn0FTx9R2AkhRdeKVMJaeGIVEtAB1VAr rcj3ds4eQytpp7Dg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 11/21] futex: Create helper function to initialize a hash slot Date: Wed, 16 Apr 2025 18:29:11 +0200 Message-ID: <20250416162921.513656-12-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Factor out the futex_hash_bucket initialisation into a helpr function. The helper function will be used in a follow up patch implementing process private hash buckets. Signed-off-by: Sebastian Andrzej Siewior --- kernel/futex/core.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 1443a98dfa7fa..afc66780f84fc 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1160,6 +1160,13 @@ void futex_exit_release(struct task_struct *tsk) futex_cleanup_end(tsk, FUTEX_STATE_DEAD); } =20 +static void futex_hash_bucket_init(struct futex_hash_bucket *fhb) +{ + atomic_set(&fhb->waiters, 0); + plist_head_init(&fhb->chain); + spin_lock_init(&fhb->lock); +} + static int __init futex_init(void) { unsigned long hashsize, i; @@ -1177,11 +1184,8 @@ static int __init futex_init(void) hashsize, hashsize); hashsize =3D 1UL << futex_shift; =20 - for (i =3D 0; i < hashsize; i++) { - atomic_set(&futex_queues[i].waiters, 0); - plist_head_init(&futex_queues[i].chain); - spin_lock_init(&futex_queues[i].lock); - } + for (i =3D 0; i < hashsize; i++) + futex_hash_bucket_init(&futex_queues[i]); =20 futex_hashmask =3D hashsize - 1; return 0; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBF1B21771F for ; Wed, 16 Apr 2025 16:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; cv=none; b=NEVdZp23giED3TwWshZQ0VN/42S3YDxxpswHYXcQencqmC0gDfgQm3Z6TYcjO2zKzt/6MATQUjcDZj53U1wBBFeLkbxmef1Ivz4qNAp0DMfxWFymfclVrhuk4dFTi3NMkd5Iz8z91xvvsmF+YXEVYwdPV2Vvu2/JEkU+GSR/Hjc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; c=relaxed/simple; bh=N6yrzpAAQFKwXl4BfBp8AhSXc89KA6CVmJh+mBfxcsk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NEWdKrWuQFwS0O2Ogj+IRn/618q9/rrSzNKezcgQSCFRpjNLB0klhk7KYkRd1JZYUiw/8n4qeeOV0SwbFp3OUEdWavL5Q68Dnvxedi9y3xiQ3ZDeST1Ul+pjLFQgjMIvo3mlaDtSWg0X8eDCEa3vNfparL+Mcl8cRh9+Bn2807E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=z/0EFdua; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=hmCqIHJJ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="z/0EFdua"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="hmCqIHJJ" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ECLPi4hPWp9ONs+iACcBH/xjjhCYhJom7ym/nPuUq8o=; b=z/0EFdua8GhQANWVHDIGXCM6+OzYbZCJDt81KvEGw89s48GcK7ylNN3N7KbDVvtXR0x6om J8SHw1M0L9QuYl4AdRy0zOASZFeiDcLXv64vBz2jyAkybsQGqMON67Ze9o4lgAciSllms2 A3O2LrU+CA4AjX8WO+MvVf98YdnV2+/o7+BsdgKDDgal8djVFW9ReGaHMy9WbppCTtUmjN EDZdJeBoqqa6a7mrXt5XHq3dZvIWdIJJg4ehFUY1ISaWANBL9HAs5gcJs3IPCA/NmEMDN5 18d/0VF7hLGOsO366xxeiRpC/EVA3mr2hHTyjQrJgPI91AHiSbMiIebiaJWSUw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ECLPi4hPWp9ONs+iACcBH/xjjhCYhJom7ym/nPuUq8o=; b=hmCqIHJJnfM9Rhi5rMSLKXDF1+8UY7goK1Y8mwOk+kphIFE6xs4b6HrzNYAosfZn9icrfR ecB8IJ2Jmio5sJCQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 12/21] futex: Add basic infrastructure for local task local hash Date: Wed, 16 Apr 2025 18:29:12 +0200 Message-ID: <20250416162921.513656-13-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The futex hash is system wide and shared by all tasks. Each slot is hashed based on futex address and the VMA of the thread. Due to randomized VMAs (and memory allocations) the same logical lock (pointer) can end up in a different hash bucket on each invocation of the application. This in turn means that different applications may share a hash bucket on the first invocation but not on the second and it is not always clear which applications will be involved. This can result in high latency's to acquire the futex_hash_bucket::lock especially if the lock owner is limited to a CPU and can not be effectively PI boosted. Introduce basic infrastructure for process local hash which is shared by all threads of process. This hash will only be used for a PROCESS_PRIVATE FUTEX operation. The hashmap can be allocated via prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num) A `num' of 0 means that the global hash is used instead of a private hash. Other values for `num' specify the number of slots for the hash and the number must be power of two, starting with two. The prctl() returns zero on success. This function can only be used before a thread is created. The current status for the private hash can be queried via num =3D prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); which return the current number of slots. The value 0 means that the global hash is used. Values greater than 0 indicate the number of slots that are used. A negative number indicates an error. For optimisation, for the private hash jhash2() uses only two arguments the address and the offset. This omits the VMA which is always the same. [peterz: Use 0 for global hash. A bit shuffling and renaming. ] Signed-off-by: Sebastian Andrzej Siewior --- include/linux/futex.h | 26 ++++- include/linux/mm_types.h | 5 +- include/uapi/linux/prctl.h | 5 + init/Kconfig | 5 + kernel/fork.c | 2 + kernel/futex/core.c | 208 +++++++++++++++++++++++++++++++++---- kernel/futex/futex.h | 10 ++ kernel/sys.c | 4 + 8 files changed, 244 insertions(+), 21 deletions(-) diff --git a/include/linux/futex.h b/include/linux/futex.h index b70df27d7e85c..8f1be08bef18d 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -4,11 +4,11 @@ =20 #include #include +#include =20 #include =20 struct inode; -struct mm_struct; struct task_struct; =20 /* @@ -77,7 +77,22 @@ void futex_exec_release(struct task_struct *tsk); =20 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, u32 __user *uaddr2, u32 val2, u32 val3); -#else +int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4); + +#ifdef CONFIG_FUTEX_PRIVATE_HASH +void futex_hash_free(struct mm_struct *mm); + +static inline void futex_mm_init(struct mm_struct *mm) +{ + mm->futex_phash =3D NULL; +} + +#else /* !CONFIG_FUTEX_PRIVATE_HASH */ +static inline void futex_hash_free(struct mm_struct *mm) { } +static inline void futex_mm_init(struct mm_struct *mm) { } +#endif /* CONFIG_FUTEX_PRIVATE_HASH */ + +#else /* !CONFIG_FUTEX */ static inline void futex_init_task(struct task_struct *tsk) { } static inline void futex_exit_recursive(struct task_struct *tsk) { } static inline void futex_exit_release(struct task_struct *tsk) { } @@ -88,6 +103,13 @@ static inline long do_futex(u32 __user *uaddr, int op, = u32 val, { return -EINVAL; } +static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3,= unsigned long arg4) +{ + return -EINVAL; +} +static inline void futex_hash_free(struct mm_struct *mm) { } +static inline void futex_mm_init(struct mm_struct *mm) { } + #endif =20 #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f91..a4b5661e41770 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -31,6 +31,7 @@ #define INIT_PASID 0 =20 struct address_space; +struct futex_private_hash; struct mem_cgroup; =20 /* @@ -1031,7 +1032,9 @@ struct mm_struct { */ seqcount_t mm_lock_seq; #endif - +#ifdef CONFIG_FUTEX_PRIVATE_HASH + struct futex_private_hash *futex_phash; +#endif =20 unsigned long hiwater_rss; /* High-watermark of RSS usage */ unsigned long hiwater_vm; /* High-water virtual memory usage */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 15c18ef4eb11a..3b93fb906e3c5 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -364,4 +364,9 @@ struct prctl_mm_map { # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 =20 +/* FUTEX hash management */ +#define PR_FUTEX_HASH 78 +# define PR_FUTEX_HASH_SET_SLOTS 1 +# define PR_FUTEX_HASH_GET_SLOTS 2 + #endif /* _LINUX_PRCTL_H */ diff --git a/init/Kconfig b/init/Kconfig index dd2ea3b9a7992..b308b98d79347 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1699,6 +1699,11 @@ config FUTEX_PI depends on FUTEX && RT_MUTEXES default y =20 +config FUTEX_PRIVATE_HASH + bool + depends on FUTEX && !BASE_SMALL && MMU + default y + config EPOLL bool "Enable eventpoll support" if EXPERT default y diff --git a/kernel/fork.c b/kernel/fork.c index c4b26cd8998b8..831dfec450544 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1305,6 +1305,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, RCU_INIT_POINTER(mm->exe_file, NULL); mmu_notifier_subscriptions_init(mm); init_tlb_flush_pending(mm); + futex_mm_init(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLO= CKS) mm->pmd_huge_pte =3D NULL; #endif @@ -1387,6 +1388,7 @@ static inline void __mmput(struct mm_struct *mm) if (mm->binfmt) module_put(mm->binfmt->module); lru_gen_del_mm(mm); + futex_hash_free(mm); mmdrop(mm); } =20 diff --git a/kernel/futex/core.c b/kernel/futex/core.c index afc66780f84fc..818df7420a1a9 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -39,6 +39,7 @@ #include #include #include +#include =20 #include "futex.h" #include "../locking/rtmutex_common.h" @@ -55,6 +56,12 @@ static struct { #define futex_queues (__futex_data.queues) #define futex_hashmask (__futex_data.hashmask) =20 +struct futex_private_hash { + unsigned int hash_mask; + void *mm; + bool custom; + struct futex_hash_bucket queues[]; +}; =20 /* * Fault injections for futexes. @@ -107,9 +114,17 @@ late_initcall(fail_futex_debugfs); =20 #endif /* CONFIG_FAIL_FUTEX */ =20 -struct futex_private_hash *futex_private_hash(void) +static struct futex_hash_bucket * +__futex_hash(union futex_key *key, struct futex_private_hash *fph); + +#ifdef CONFIG_FUTEX_PRIVATE_HASH +static inline bool futex_key_is_private(union futex_key *key) { - return NULL; + /* + * Relies on get_futex_key() to set either bit for shared + * futexes -- see comment with union futex_key. + */ + return !(key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED)); } =20 bool futex_private_hash_get(struct futex_private_hash *fph) @@ -117,21 +132,8 @@ bool futex_private_hash_get(struct futex_private_hash = *fph) return false; } =20 -void futex_private_hash_put(struct futex_private_hash *fph) { } - -/** - * futex_hash - Return the hash bucket in the global hash - * @key: Pointer to the futex key for which the hash is calculated - * - * We hash on the keys returned from get_futex_key (see below) and return = the - * corresponding hash bucket in the global hash. - */ -struct futex_hash_bucket *futex_hash(union futex_key *key) +void futex_private_hash_put(struct futex_private_hash *fph) { - u32 hash =3D jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4, - key->both.offset); - - return &futex_queues[hash & futex_hashmask]; } =20 /** @@ -144,6 +146,84 @@ struct futex_hash_bucket *futex_hash(union futex_key *= key) void futex_hash_get(struct futex_hash_bucket *hb) { } void futex_hash_put(struct futex_hash_bucket *hb) { } =20 +static struct futex_hash_bucket * +__futex_hash_private(union futex_key *key, struct futex_private_hash *fph) +{ + u32 hash; + + if (!futex_key_is_private(key)) + return NULL; + + if (!fph) + fph =3D key->private.mm->futex_phash; + if (!fph || !fph->hash_mask) + return NULL; + + hash =3D jhash2((void *)&key->private.address, + sizeof(key->private.address) / 4, + key->both.offset); + return &fph->queues[hash & fph->hash_mask]; +} + +struct futex_private_hash *futex_private_hash(void) +{ + struct mm_struct *mm =3D current->mm; + struct futex_private_hash *fph; + + fph =3D mm->futex_phash; + return fph; +} + +struct futex_hash_bucket *futex_hash(union futex_key *key) +{ + struct futex_hash_bucket *hb; + + hb =3D __futex_hash(key, NULL); + return hb; +} + +#else /* !CONFIG_FUTEX_PRIVATE_HASH */ + +static struct futex_hash_bucket * +__futex_hash_private(union futex_key *key, struct futex_private_hash *fph) +{ + return NULL; +} + +struct futex_hash_bucket *futex_hash(union futex_key *key) +{ + return __futex_hash(key, NULL); +} + +#endif /* CONFIG_FUTEX_PRIVATE_HASH */ + +/** + * __futex_hash - Return the hash bucket + * @key: Pointer to the futex key for which the hash is calculated + * @fph: Pointer to private hash if known + * + * We hash on the keys returned from get_futex_key (see below) and return = the + * corresponding hash bucket. + * If the FUTEX is PROCESS_PRIVATE then a per-process hash bucket (from the + * private hash) is returned if existing. Otherwise a hash bucket from the + * global hash is returned. + */ +static struct futex_hash_bucket * +__futex_hash(union futex_key *key, struct futex_private_hash *fph) +{ + struct futex_hash_bucket *hb; + u32 hash; + + hb =3D __futex_hash_private(key, fph); + if (hb) + return hb; + + hash =3D jhash2((u32 *)key, + offsetof(typeof(*key), both.offset) / 4, + key->both.offset); + return &futex_queues[hash & futex_hashmask]; +} + /** * futex_setup_timer - set up the sleeping hrtimer. * @time: ptr to the given timeout value @@ -985,6 +1065,13 @@ static void exit_pi_state_list(struct task_struct *cu= rr) struct futex_pi_state *pi_state; union futex_key key =3D FUTEX_KEY_INIT; =20 + /* + * Ensure the hash remains stable (no resize) during the while loop + * below. The hb pointer is acquired under the pi_lock so we can't block + * on the mutex. + */ + WARN_ON(curr !=3D current); + guard(private_hash)(); /* * We are a ZOMBIE and nobody can enqueue itself on * pi_state_list anymore, but we have to be careful @@ -1160,13 +1247,98 @@ void futex_exit_release(struct task_struct *tsk) futex_cleanup_end(tsk, FUTEX_STATE_DEAD); } =20 -static void futex_hash_bucket_init(struct futex_hash_bucket *fhb) +static void futex_hash_bucket_init(struct futex_hash_bucket *fhb, + struct futex_private_hash *fph) { +#ifdef CONFIG_FUTEX_PRIVATE_HASH + fhb->priv =3D fph; +#endif atomic_set(&fhb->waiters, 0); plist_head_init(&fhb->chain); spin_lock_init(&fhb->lock); } =20 +#ifdef CONFIG_FUTEX_PRIVATE_HASH +void futex_hash_free(struct mm_struct *mm) +{ + kvfree(mm->futex_phash); +} + +static int futex_hash_allocate(unsigned int hash_slots, bool custom) +{ + struct mm_struct *mm =3D current->mm; + struct futex_private_hash *fph; + int i; + + if (hash_slots && (hash_slots =3D=3D 1 || !is_power_of_2(hash_slots))) + return -EINVAL; + + if (mm->futex_phash) + return -EALREADY; + + if (!thread_group_empty(current)) + return -EINVAL; + + fph =3D kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT= | __GFP_NOWARN); + if (!fph) + return -ENOMEM; + + fph->hash_mask =3D hash_slots ? hash_slots - 1 : 0; + fph->custom =3D custom; + fph->mm =3D mm; + + for (i =3D 0; i < hash_slots; i++) + futex_hash_bucket_init(&fph->queues[i], fph); + + mm->futex_phash =3D fph; + return 0; +} + +static int futex_hash_get_slots(void) +{ + struct futex_private_hash *fph; + + fph =3D current->mm->futex_phash; + if (fph && fph->hash_mask) + return fph->hash_mask + 1; + return 0; +} + +#else + +static int futex_hash_allocate(unsigned int hash_slots, bool custom) +{ + return -EINVAL; +} + +static int futex_hash_get_slots(void) +{ + return 0; +} +#endif + +int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4) +{ + int ret; + + switch (arg2) { + case PR_FUTEX_HASH_SET_SLOTS: + if (arg4 !=3D 0) + return -EINVAL; + ret =3D futex_hash_allocate(arg3, true); + break; + + case PR_FUTEX_HASH_GET_SLOTS: + ret =3D futex_hash_get_slots(); + break; + + default: + ret =3D -EINVAL; + break; + } + return ret; +} + static int __init futex_init(void) { unsigned long hashsize, i; @@ -1185,7 +1357,7 @@ static int __init futex_init(void) hashsize =3D 1UL << futex_shift; =20 for (i =3D 0; i < hashsize; i++) - futex_hash_bucket_init(&futex_queues[i]); + futex_hash_bucket_init(&futex_queues[i], NULL); =20 futex_hashmask =3D hashsize - 1; return 0; diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 26e69333cb745..899aed5acde12 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -118,6 +118,7 @@ struct futex_hash_bucket { atomic_t waiters; spinlock_t lock; struct plist_head chain; + struct futex_private_hash *priv; } ____cacheline_aligned_in_smp; =20 /* @@ -204,6 +205,7 @@ futex_setup_timer(ktime_t *time, struct hrtimer_sleeper= *timeout, int flags, u64 range_ns); =20 extern struct futex_hash_bucket *futex_hash(union futex_key *key); +#ifdef CONFIG_FUTEX_PRIVATE_HASH extern void futex_hash_get(struct futex_hash_bucket *hb); extern void futex_hash_put(struct futex_hash_bucket *hb); =20 @@ -211,6 +213,14 @@ extern struct futex_private_hash *futex_private_hash(v= oid); extern bool futex_private_hash_get(struct futex_private_hash *fph); extern void futex_private_hash_put(struct futex_private_hash *fph); =20 +#else /* !CONFIG_FUTEX_PRIVATE_HASH */ +static inline void futex_hash_get(struct futex_hash_bucket *hb) { } +static inline void futex_hash_put(struct futex_hash_bucket *hb) { } +static inline struct futex_private_hash *futex_private_hash(void) { return= NULL; } +static inline bool futex_private_hash_get(void) { return false; } +static inline void futex_private_hash_put(struct futex_private_hash *fph) = { } +#endif + DEFINE_CLASS(hb, struct futex_hash_bucket *, if (_T) futex_hash_put(_T), futex_hash(key), union futex_key *key); diff --git a/kernel/sys.c b/kernel/sys.c index c434968e9f5dd..adc0de0aa364a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -52,6 +52,7 @@ #include #include #include +#include =20 #include #include @@ -2820,6 +2821,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, ar= g2, unsigned long, arg3, return -EINVAL; error =3D posixtimer_create_prctl(arg2); break; + case PR_FUTEX_HASH: + error =3D futex_hash_prctl(arg2, arg3, arg4); + break; default: trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5); error =3D -EINVAL; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 38F10212D95 for ; Wed, 16 Apr 2025 16:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; cv=none; b=RrkK+tBfRnecmZcPNoRHcfQ0xWePbDDCut0BpOwji9aefD5bFv5BkgqOWRJb+N+BtDhTTXqh9csYMbGKQRBHScaoXnkA9IYQCuVOmdO5CWWA87HRdTk5UfAnTZXjH8jadp0rlWVzexlLqgjpLGvnPxscW9iIvE3j9mOdMENXLY8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820975; c=relaxed/simple; bh=z53cVIovTo2iotibDtas25ckcHnEJfC5ae+Mxvn/d5U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=o5PABo82a3LIHGanheR27HOelwS48kYbDeqfWP3F+V+HU+uWsBRDbOGp8eUexCWXakQXhLZ2egA8/2qpZRlE6fug79d91RACKub716jk/NI5GJFAKBsbK9TmA54PECGnDBAn41cd3wtcCrDitJG/Rc/5OHzpMErvUJ6zka4C1Ew= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=kzdj7Uxb; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=zTOLJPCg; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="kzdj7Uxb"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="zTOLJPCg" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YhMDpxMm1myz48tpAMxuEHmG2PkPh5xCkOkdXX54tKE=; b=kzdj7UxbjAxQeAE2EHB78oBDdZhMc0ze44j30ed8RKqPhRtPUmZJ5vBHimBb4CwCBHHegP 4Gm143ZkE3iGnoV57sicnb9A00LQW/u+F9CBczj9FlMmj9ndwmYWspYpzHJS12/aMzhScT KBXWbcfu25YGtMr1lUFIEqKwyKaO4wYRSg+YtPFHVCi7s9fQ3k+pFbO/2Z0UmquXiQaN3F nZY9VE3JIPVjnFeA+97gkoNWFRe1qgGir/9B7zQlB6uAp5fSnmcE2VhZgJJPt08XHSZcUp PMwO5wCHESMkqPvj3ufe/tMSZuyfwQpFWkgx1I0J5mZco0LgeJnKNw+JrPpX9Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YhMDpxMm1myz48tpAMxuEHmG2PkPh5xCkOkdXX54tKE=; b=zTOLJPCgVeVryfD2gpkikmWF+hrd//SRixZ3FEFVb5WriSf7Up4XNQHHDE+mv9bbLLHRIs ngzhinK3a1rA/SDA== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 13/21] futex: Allow automatic allocation of process wide futex hash Date: Wed, 16 Apr 2025 18:29:13 +0200 Message-ID: <20250416162921.513656-14-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Allocate a private futex hash with 16 slots if a task forks its first thread. Signed-off-by: Sebastian Andrzej Siewior --- include/linux/futex.h | 6 ++++++ kernel/fork.c | 22 ++++++++++++++++++++++ kernel/futex/core.c | 11 +++++++++++ 3 files changed, 39 insertions(+) diff --git a/include/linux/futex.h b/include/linux/futex.h index 8f1be08bef18d..1d3f7555825ec 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -80,6 +80,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t= *timeout, int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4); =20 #ifdef CONFIG_FUTEX_PRIVATE_HASH +int futex_hash_allocate_default(void); void futex_hash_free(struct mm_struct *mm); =20 static inline void futex_mm_init(struct mm_struct *mm) @@ -88,6 +89,7 @@ static inline void futex_mm_init(struct mm_struct *mm) } =20 #else /* !CONFIG_FUTEX_PRIVATE_HASH */ +static inline int futex_hash_allocate_default(void) { return 0; } static inline void futex_hash_free(struct mm_struct *mm) { } static inline void futex_mm_init(struct mm_struct *mm) { } #endif /* CONFIG_FUTEX_PRIVATE_HASH */ @@ -107,6 +109,10 @@ static inline int futex_hash_prctl(unsigned long arg2,= unsigned long arg3, unsig { return -EINVAL; } +static inline int futex_hash_allocate_default(void) +{ + return 0; +} static inline void futex_hash_free(struct mm_struct *mm) { } static inline void futex_mm_init(struct mm_struct *mm) { } =20 diff --git a/kernel/fork.c b/kernel/fork.c index 831dfec450544..1f5d8083eeb25 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2164,6 +2164,13 @@ static void rv_task_fork(struct task_struct *p) #define rv_task_fork(p) do {} while (0) #endif =20 +static bool need_futex_hash_allocate_default(u64 clone_flags) +{ + if ((clone_flags & (CLONE_THREAD | CLONE_VM)) !=3D (CLONE_THREAD | CLONE_= VM)) + return false; + return true; +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -2544,6 +2551,21 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cancel_cgroup; =20 + /* + * Allocate a default futex hash for the user process once the first + * thread spawns. + */ + if (need_futex_hash_allocate_default(clone_flags)) { + retval =3D futex_hash_allocate_default(); + if (retval) + goto bad_fork_core_free; + /* + * If we fail beyond this point we don't free the allocated + * futex hash map. We assume that another thread will be created + * and makes use of it. The hash map will be freed once the main + * thread terminates. + */ + } /* * From this point on we must avoid any synchronous user-space * communication until we take the tasklist-lock. In particular, we do diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 818df7420a1a9..53b3a00a92539 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -1294,6 +1294,17 @@ static int futex_hash_allocate(unsigned int hash_slo= ts, bool custom) return 0; } =20 +int futex_hash_allocate_default(void) +{ + if (!current->mm) + return 0; + + if (current->mm->futex_phash) + return 0; + + return futex_hash_allocate(16, false); +} + static int futex_hash_get_slots(void) { struct futex_private_hash *fph; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB8C92116ED for ; Wed, 16 Apr 2025 16:29:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820976; cv=none; b=sIEbwXGXUMLv1KnUIRfepuCmlUeAOqERMgg0+ByO65p66O9lh9htva6YttuAayCUGz/4TI1lCcsoqCg779T4O4a2XSNZ6K2G/80OWgX9rtvkNBOTmPOriNm+zaeRbPTwEoSg5Ls+tGoETP1KgzCZopOwPOonVNcCfLcBmnF3gUE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820976; c=relaxed/simple; bh=mdgNovvqXLQWKzjb8vGlsr3FeCF5rE2zPCZonPbJt8g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BMcOkwi1q8Wbk7/EB6gOpqDeCOMvFlFTihkFCasQ2bBc0Owvw+wNpgkQBLe0hMRtHLfbxShxPnF/eP5DJSP9A5NfvG3Qp0+ns4sjAW0AZKs3Agc7FT8PuXnGJKY30Bi6eOzlykQKun2r7W4qAt1seKqG6GdTimm1liVSOUUBUqo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=QYxHwM3T; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=pzxMv/qr; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="QYxHwM3T"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="pzxMv/qr" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sMxbnI1h2T/mOb473WXaKL3DFrHVZkp2SbsI3YPXz0Q=; b=QYxHwM3ToQVELBF0AlSJiFetodHveMbbFIfwf6Z6WhwyoTw5UezxjMk+kZ3jIkoO9JATvu TK3Fz3bl4qL6kjqz5L4egR1pJBNv8CYlsmo914dD8RBVb57JRKPCedmeUPbqINFM1mLWxX bG1YYyxVbOK/w5ZBPPWL7OYR7ipihwJv53CgFfLBH+g10epnLqzVEPg1RJT6+jC+heaDAb 34Glc6V61dXVAEV1v2z2vScDwS+8w4T20GpJzPes7WhYo+FUOEzmyDBf4ZzVpmdh8y60Wp lxG/s2Umo1nl2uhvm5CzFcbgdo/R6c3rAwu30k2kEh5j2hMdv1wqQ875ecf5CQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sMxbnI1h2T/mOb473WXaKL3DFrHVZkp2SbsI3YPXz0Q=; b=pzxMv/qr/SL5NipI7HL7PG7lseoc3Ics71xJ0Zsv5IsvBYZTbo+wHBg0jl6Q+HazkbDyNt yqSBJmpBC8ic/MBw== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 14/21] futex: Allow to resize the private local hash Date: Wed, 16 Apr 2025 18:29:14 +0200 Message-ID: <20250416162921.513656-15-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/ replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS operation can now be invoked at runtime and resize an already existing internal private futex_hash_bucket to another size. The reallocation is based on an idea by Thomas Gleixner: The initial allocation of struct futex_private_hash sets the reference count to one. Every user acquires a reference on the local hash before using it and drops it after it enqueued itself on the hash bucket. There is no reference held while the task is scheduled out while waiting for the wake up. The resize process allocates a new struct futex_private_hash and drops the initial reference. Synchronized with mm_struct::futex_hash_lock it is checked if the reference counter for the currently used mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued on the current private hash are requeued on the new private hash and the new private hash is set to mm_struct::futex_phash. Otherwise the newly allocated private hash is saved as mm_struct::futex_phash_new and the rehashing and reassigning is delayed to the futex_hash() caller once the reference counter is marked DEAD. The replacement is not performed at rcuref_put() time because certain callers, such as futex_wait_queue(), drop their reference after changing the task state. This change will be destroyed once the futex_hash_lock is acquired. The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS multiple times. An increase and decrease is allowed and request blocks until the assignment is done. The private hash allocated at thread creation is changed from 16 to 16 <=3D 4 * number_of_threads <=3D global_hash_size where number_of_threads can not exceed the number of online CPUs. Should the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled. [peterz: reorganize the code to avoid state tracking and simplify new object handling, block the user until changes are in effect, allow increase and decrease of the hash]. Signed-off-by: Sebastian Andrzej Siewior Reported-by: "Lai, Yi" Tested-by: "Lai, Yi" --- include/linux/futex.h | 3 +- include/linux/mm_types.h | 4 +- kernel/futex/core.c | 290 ++++++++++++++++++++++++++++++++++++--- kernel/futex/requeue.c | 5 + 4 files changed, 281 insertions(+), 21 deletions(-) diff --git a/include/linux/futex.h b/include/linux/futex.h index 1d3f7555825ec..40bc778b2bb45 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -85,7 +85,8 @@ void futex_hash_free(struct mm_struct *mm); =20 static inline void futex_mm_init(struct mm_struct *mm) { - mm->futex_phash =3D NULL; + rcu_assign_pointer(mm->futex_phash, NULL); + mutex_init(&mm->futex_hash_lock); } =20 #else /* !CONFIG_FUTEX_PRIVATE_HASH */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a4b5661e41770..32ba5126e2214 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1033,7 +1033,9 @@ struct mm_struct { seqcount_t mm_lock_seq; #endif #ifdef CONFIG_FUTEX_PRIVATE_HASH - struct futex_private_hash *futex_phash; + struct mutex futex_hash_lock; + struct futex_private_hash __rcu *futex_phash; + struct futex_private_hash *futex_phash_new; #endif =20 unsigned long hiwater_rss; /* High-watermark of RSS usage */ diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 53b3a00a92539..9e7dad52abea8 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include "futex.h" #include "../locking/rtmutex_common.h" @@ -57,7 +58,9 @@ static struct { #define futex_hashmask (__futex_data.hashmask) =20 struct futex_private_hash { + rcuref_t users; unsigned int hash_mask; + struct rcu_head rcu; void *mm; bool custom; struct futex_hash_bucket queues[]; @@ -129,11 +132,14 @@ static inline bool futex_key_is_private(union futex_k= ey *key) =20 bool futex_private_hash_get(struct futex_private_hash *fph) { - return false; + return rcuref_get(&fph->users); } =20 void futex_private_hash_put(struct futex_private_hash *fph) { + /* Ignore return value, last put is verified via rcuref_is_dead() */ + if (rcuref_put(&fph->users)) + wake_up_var(fph->mm); } =20 /** @@ -143,8 +149,23 @@ void futex_private_hash_put(struct futex_private_hash = *fph) * Obtain an additional reference for the already obtained hash bucket. The * caller must already own an reference. */ -void futex_hash_get(struct futex_hash_bucket *hb) { } -void futex_hash_put(struct futex_hash_bucket *hb) { } +void futex_hash_get(struct futex_hash_bucket *hb) +{ + struct futex_private_hash *fph =3D hb->priv; + + if (!fph) + return; + WARN_ON_ONCE(!futex_private_hash_get(fph)); +} + +void futex_hash_put(struct futex_hash_bucket *hb) +{ + struct futex_private_hash *fph =3D hb->priv; + + if (!fph) + return; + futex_private_hash_put(fph); +} =20 static struct futex_hash_bucket * __futex_hash_private(union futex_key *key, struct futex_private_hash *fph) @@ -155,7 +176,7 @@ __futex_hash_private(union futex_key *key, struct futex= _private_hash *fph) return NULL; =20 if (!fph) - fph =3D key->private.mm->futex_phash; + fph =3D rcu_dereference(key->private.mm->futex_phash); if (!fph || !fph->hash_mask) return NULL; =20 @@ -165,21 +186,119 @@ __futex_hash_private(union futex_key *key, struct fu= tex_private_hash *fph) return &fph->queues[hash & fph->hash_mask]; } =20 +static void futex_rehash_private(struct futex_private_hash *old, + struct futex_private_hash *new) +{ + struct futex_hash_bucket *hb_old, *hb_new; + unsigned int slots =3D old->hash_mask + 1; + unsigned int i; + + for (i =3D 0; i < slots; i++) { + struct futex_q *this, *tmp; + + hb_old =3D &old->queues[i]; + + spin_lock(&hb_old->lock); + plist_for_each_entry_safe(this, tmp, &hb_old->chain, list) { + + plist_del(&this->list, &hb_old->chain); + futex_hb_waiters_dec(hb_old); + + WARN_ON_ONCE(this->lock_ptr !=3D &hb_old->lock); + + hb_new =3D __futex_hash(&this->key, new); + futex_hb_waiters_inc(hb_new); + /* + * The new pointer isn't published yet but an already + * moved user can be unqueued due to timeout or signal. + */ + spin_lock_nested(&hb_new->lock, SINGLE_DEPTH_NESTING); + plist_add(&this->list, &hb_new->chain); + this->lock_ptr =3D &hb_new->lock; + spin_unlock(&hb_new->lock); + } + spin_unlock(&hb_old->lock); + } +} + +static bool __futex_pivot_hash(struct mm_struct *mm, + struct futex_private_hash *new) +{ + struct futex_private_hash *fph; + + WARN_ON_ONCE(mm->futex_phash_new); + + fph =3D rcu_dereference_protected(mm->futex_phash, + lockdep_is_held(&mm->futex_hash_lock)); + if (fph) { + if (!rcuref_is_dead(&fph->users)) { + mm->futex_phash_new =3D new; + return false; + } + + futex_rehash_private(fph, new); + } + rcu_assign_pointer(mm->futex_phash, new); + kvfree_rcu(fph, rcu); + return true; +} + +static void futex_pivot_hash(struct mm_struct *mm) +{ + scoped_guard(mutex, &mm->futex_hash_lock) { + struct futex_private_hash *fph; + + fph =3D mm->futex_phash_new; + if (fph) { + mm->futex_phash_new =3D NULL; + __futex_pivot_hash(mm, fph); + } + } +} + struct futex_private_hash *futex_private_hash(void) { struct mm_struct *mm =3D current->mm; - struct futex_private_hash *fph; + /* + * Ideally we don't loop. If there is a replacement in progress + * then a new private hash is already prepared and a reference can't be + * obtained once the last user dropped it's. + * In that case we block on mm_struct::futex_hash_lock and either have + * to perform the replacement or wait while someone else is doing the + * job. Eitherway, on the second iteration we acquire a reference on the + * new private hash or loop again because a new replacement has been + * requested. + */ +again: + scoped_guard(rcu) { + struct futex_private_hash *fph; =20 - fph =3D mm->futex_phash; - return fph; + fph =3D rcu_dereference(mm->futex_phash); + if (!fph) + return NULL; + + if (rcuref_get(&fph->users)) + return fph; + } + futex_pivot_hash(mm); + goto again; } =20 struct futex_hash_bucket *futex_hash(union futex_key *key) { + struct futex_private_hash *fph; struct futex_hash_bucket *hb; =20 - hb =3D __futex_hash(key, NULL); - return hb; +again: + scoped_guard(rcu) { + hb =3D __futex_hash(key, NULL); + fph =3D hb->priv; + + if (!fph || futex_private_hash_get(fph)) + return hb; + } + futex_pivot_hash(key->private.mm); + goto again; } =20 #else /* !CONFIG_FUTEX_PRIVATE_HASH */ @@ -664,6 +783,8 @@ int futex_unqueue(struct futex_q *q) spinlock_t *lock_ptr; int ret =3D 0; =20 + /* RCU so lock_ptr is not going away during locking. */ + guard(rcu)(); /* In the common case we don't take the spinlock, which is nice. */ retry: /* @@ -1065,6 +1186,10 @@ static void exit_pi_state_list(struct task_struct *c= urr) struct futex_pi_state *pi_state; union futex_key key =3D FUTEX_KEY_INIT; =20 + /* + * The mutex mm_struct::futex_hash_lock might be acquired. + */ + might_sleep(); /* * Ensure the hash remains stable (no resize) during the while loop * below. The hb pointer is acquired under the pi_lock so we can't block @@ -1261,7 +1386,51 @@ static void futex_hash_bucket_init(struct futex_hash= _bucket *fhb, #ifdef CONFIG_FUTEX_PRIVATE_HASH void futex_hash_free(struct mm_struct *mm) { - kvfree(mm->futex_phash); + struct futex_private_hash *fph; + + kvfree(mm->futex_phash_new); + fph =3D rcu_dereference_raw(mm->futex_phash); + if (fph) { + WARN_ON_ONCE(rcuref_read(&fph->users) > 1); + kvfree(fph); + } +} + +static bool futex_pivot_pending(struct mm_struct *mm) +{ + struct futex_private_hash *fph; + + guard(rcu)(); + + if (!mm->futex_phash_new) + return true; + + fph =3D rcu_dereference(mm->futex_phash); + return rcuref_is_dead(&fph->users); +} + +static bool futex_hash_less(struct futex_private_hash *a, + struct futex_private_hash *b) +{ + /* user provided always wins */ + if (!a->custom && b->custom) + return true; + if (a->custom && !b->custom) + return false; + + /* zero-sized hash wins */ + if (!b->hash_mask) + return true; + if (!a->hash_mask) + return false; + + /* keep the biggest */ + if (a->hash_mask < b->hash_mask) + return true; + if (a->hash_mask > b->hash_mask) + return false; + + return false; /* equal */ } =20 static int futex_hash_allocate(unsigned int hash_slots, bool custom) @@ -1273,16 +1442,23 @@ static int futex_hash_allocate(unsigned int hash_sl= ots, bool custom) if (hash_slots && (hash_slots =3D=3D 1 || !is_power_of_2(hash_slots))) return -EINVAL; =20 - if (mm->futex_phash) - return -EALREADY; - - if (!thread_group_empty(current)) - return -EINVAL; + /* + * Once we've disabled the global hash there is no way back. + */ + scoped_guard(rcu) { + fph =3D rcu_dereference(mm->futex_phash); + if (fph && !fph->hash_mask) { + if (custom) + return -EBUSY; + return 0; + } + } =20 fph =3D kvzalloc(struct_size(fph, queues, hash_slots), GFP_KERNEL_ACCOUNT= | __GFP_NOWARN); if (!fph) return -ENOMEM; =20 + rcuref_init(&fph->users, 1); fph->hash_mask =3D hash_slots ? hash_slots - 1 : 0; fph->custom =3D custom; fph->mm =3D mm; @@ -1290,26 +1466,102 @@ static int futex_hash_allocate(unsigned int hash_s= lots, bool custom) for (i =3D 0; i < hash_slots; i++) futex_hash_bucket_init(&fph->queues[i], fph); =20 - mm->futex_phash =3D fph; + if (custom) { + /* + * Only let prctl() wait / retry; don't unduly delay clone(). + */ +again: + wait_var_event(mm, futex_pivot_pending(mm)); + } + + scoped_guard(mutex, &mm->futex_hash_lock) { + struct futex_private_hash *free __free(kvfree) =3D NULL; + struct futex_private_hash *cur, *new; + + cur =3D rcu_dereference_protected(mm->futex_phash, + lockdep_is_held(&mm->futex_hash_lock)); + new =3D mm->futex_phash_new; + mm->futex_phash_new =3D NULL; + + if (fph) { + if (cur && !new) { + /* + * If we have an existing hash, but do not yet have + * allocated a replacement hash, drop the initial + * reference on the existing hash. + */ + futex_private_hash_put(cur); + } + + if (new) { + /* + * Two updates raced; throw out the lesser one. + */ + if (futex_hash_less(new, fph)) { + free =3D new; + new =3D fph; + } else { + free =3D fph; + } + } else { + new =3D fph; + } + fph =3D NULL; + } + + if (new) { + /* + * Will set mm->futex_phash_new on failure; + * futex_private_hash_get() will try again. + */ + if (!__futex_pivot_hash(mm, new) && custom) + goto again; + } + } return 0; } =20 int futex_hash_allocate_default(void) { + unsigned int threads, buckets, current_buckets =3D 0; + struct futex_private_hash *fph; + if (!current->mm) return 0; =20 - if (current->mm->futex_phash) + scoped_guard(rcu) { + threads =3D min_t(unsigned int, + get_nr_threads(current), + num_online_cpus()); + + fph =3D rcu_dereference(current->mm->futex_phash); + if (fph) { + if (fph->custom) + return 0; + + current_buckets =3D fph->hash_mask + 1; + } + } + + /* + * The default allocation will remain within + * 16 <=3D threads * 4 <=3D global hash size + */ + buckets =3D roundup_pow_of_two(4 * threads); + buckets =3D clamp(buckets, 16, futex_hashmask + 1); + + if (current_buckets >=3D buckets) return 0; =20 - return futex_hash_allocate(16, false); + return futex_hash_allocate(buckets, false); } =20 static int futex_hash_get_slots(void) { struct futex_private_hash *fph; =20 - fph =3D current->mm->futex_phash; + guard(rcu)(); + fph =3D rcu_dereference(current->mm->futex_phash); if (fph && fph->hash_mask) return fph->hash_mask + 1; return 0; diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c index b0e64fd454d96..c716a66f86929 100644 --- a/kernel/futex/requeue.c +++ b/kernel/futex/requeue.c @@ -87,6 +87,11 @@ void requeue_futex(struct futex_q *q, struct futex_hash_= bucket *hb1, futex_hb_waiters_inc(hb2); plist_add(&q->list, &hb2->chain); q->lock_ptr =3D &hb2->lock; + /* + * hb1 and hb2 belong to the same futex_hash_bucket_private + * because if we managed get a reference on hb1 then it can't be + * replaced. Therefore we avoid put(hb1)+get(hb2) here. + */ } q->key =3D *key2; } --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 696C021ADB2 for ; Wed, 16 Apr 2025 16:29:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820978; cv=none; b=Is+6TOIeCdUZg3VXATWzfWSGsFNJxGOgh5OhuAqtSO4EqWreWJhXHiEcwmS/igj5OcQDQTsubG1KkfRDx6Gmmn9P/auo/ekanl/ixfEK2oMPgjjxS4sIftg/kTXj+cQmbuuSNVMnpb69IzS3R36V/T/ofmHDT0tJcxX94eCwJ8k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820978; c=relaxed/simple; bh=uSjj2prwgpxPK1WfhS5g0UP1RCfBJaD0kQ9KvFiBfBI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iLEd8UUVG/MdbkD0KHLA35s/k/kcBetbYdwv1PfuRiMfxEW/oGypgCvVV9x79edmpN7/E2vKLiuzm866TqxTpW602OfE4YOl026JooMxTplG+cu8XCkxclU1OxIMfLYsCjLLSry7uP50/oEcleaWLCVZJX49tYwJXT0qotjN7I0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ObsJYWKq; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Th8pQuwP; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ObsJYWKq"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Th8pQuwP" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yyaf1r00Tl/hAYybKNhDkdWLTny5vavj746pDaUGf7M=; b=ObsJYWKqE8VKH7icrwGBSTgPqw4Va2e6HP5CmdN/UKaGvEl3BEZ5gBCdGs79b8AGK4ESYb PfDMv/7ebOrLLmf7QMR0jlVuA+hrvJw9s1JrpArw6t9PtuPc06hoigtnPmsfzrM5gsrgsv fWD4K+1XFCHTDf+mIootvbdYVShSf3vrtE+20PCyESyOufL3NozeHvBLZf0KhNV3a7YigM A79xpLQkqhd1C6BD1LB2J8+/jjcKSn1qEhvRFowrlvCV1DsVtM9cxV5jpzuUhJOEpxqs2T NdZdSeEmZ9AmNO90G8f5pg4mSET3Y1GBTURgdsFx/UozCJbquUCSB54Ij53g+w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yyaf1r00Tl/hAYybKNhDkdWLTny5vavj746pDaUGf7M=; b=Th8pQuwPHbBzB9CRNsoFRxyZqgTSo/xK1yazt/5P6ZEHUV+zzbEtYfjMTrfA62WO2sWEPg o44Ih8VaNdN3GZDg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior , Shrikanth Hegde Subject: [PATCH v12 15/21] futex: Allow to make the private hash immutable Date: Wed, 16 Apr 2025 18:29:15 +0200 Message-ID: <20250416162921.513656-16-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" My initial testing showed that perf bench futex hash reported less operations/sec with private hash. After using the same amount of buckets in the private hash as used by the global hash then the operations/sec were about the same. This changed once the private hash became resizable. This feature added a RCU section and reference counting via atomic inc+dec operation into the hot path. The reference counting can be avoided if the private hash is made immutable. Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the private should be made immutable. Once set (to true) the a further resize is not allowed (same if set to global hash). Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not be changed. Update "perf bench" suite. For comparison, results of "perf bench futex hash -s": - Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs: - Before the introducing task local hash shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs =3D 10 private Averaged 2.192.405 operations/sec (+- 0,07%), total secs =3D 10 - With the series shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs =3D 10 -b128 Averaged 141.394 operations/sec (+- 1,15%), total secs =3D 10 -Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs =3D 10 -b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs =3D 10 -Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs =3D 10 128 is the default allocation of hash buckets. 8192 was the previous amount of allocated hash buckets. - Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs: - Before the introducing task local hash shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs =3D 20 private Averaged 2.505.801 operations/sec (+- 0,05%), total secs =3D 20 - With the series shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs =3D 20 -b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs =3D 20 -Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs =3D 20 -b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs =3D 20 -Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs =3D 20 1024 is the default allocation of hash buckets. 65536 was the previous amount of allocated hash buckets. Acked-by: Shrikanth Hegde Signed-off-by: Sebastian Andrzej Siewior --- include/uapi/linux/prctl.h | 1 + kernel/futex/core.c | 44 ++++++++++++++++++++++++++++++++------ 2 files changed, 38 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 3b93fb906e3c5..21f30b3ded74b 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -368,5 +368,6 @@ struct prctl_mm_map { #define PR_FUTEX_HASH 78 # define PR_FUTEX_HASH_SET_SLOTS 1 # define PR_FUTEX_HASH_GET_SLOTS 2 +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 =20 #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 9e7dad52abea8..81c5705b6af5e 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -63,6 +63,7 @@ struct futex_private_hash { struct rcu_head rcu; void *mm; bool custom; + bool immutable; struct futex_hash_bucket queues[]; }; =20 @@ -132,12 +133,16 @@ static inline bool futex_key_is_private(union futex_k= ey *key) =20 bool futex_private_hash_get(struct futex_private_hash *fph) { + if (fph->immutable) + return true; return rcuref_get(&fph->users); } =20 void futex_private_hash_put(struct futex_private_hash *fph) { /* Ignore return value, last put is verified via rcuref_is_dead() */ + if (fph->immutable) + return; if (rcuref_put(&fph->users)) wake_up_var(fph->mm); } @@ -277,6 +282,8 @@ struct futex_private_hash *futex_private_hash(void) if (!fph) return NULL; =20 + if (fph->immutable) + return fph; if (rcuref_get(&fph->users)) return fph; } @@ -1433,7 +1440,7 @@ static bool futex_hash_less(struct futex_private_hash= *a, return false; /* equal */ } =20 -static int futex_hash_allocate(unsigned int hash_slots, bool custom) +static int futex_hash_allocate(unsigned int hash_slots, unsigned int immut= able, bool custom) { struct mm_struct *mm =3D current->mm; struct futex_private_hash *fph; @@ -1441,13 +1448,15 @@ static int futex_hash_allocate(unsigned int hash_sl= ots, bool custom) =20 if (hash_slots && (hash_slots =3D=3D 1 || !is_power_of_2(hash_slots))) return -EINVAL; + if (immutable > 2) + return -EINVAL; =20 /* * Once we've disabled the global hash there is no way back. */ scoped_guard(rcu) { fph =3D rcu_dereference(mm->futex_phash); - if (fph && !fph->hash_mask) { + if (fph && (!fph->hash_mask || fph->immutable)) { if (custom) return -EBUSY; return 0; @@ -1461,6 +1470,7 @@ static int futex_hash_allocate(unsigned int hash_slot= s, bool custom) rcuref_init(&fph->users, 1); fph->hash_mask =3D hash_slots ? hash_slots - 1 : 0; fph->custom =3D custom; + fph->immutable =3D !!immutable; fph->mm =3D mm; =20 for (i =3D 0; i < hash_slots; i++) @@ -1553,7 +1563,7 @@ int futex_hash_allocate_default(void) if (current_buckets >=3D buckets) return 0; =20 - return futex_hash_allocate(buckets, false); + return futex_hash_allocate(buckets, 0, false); } =20 static int futex_hash_get_slots(void) @@ -1567,9 +1577,22 @@ static int futex_hash_get_slots(void) return 0; } =20 +static int futex_hash_get_immutable(void) +{ + struct futex_private_hash *fph; + + guard(rcu)(); + fph =3D rcu_dereference(current->mm->futex_phash); + if (fph && fph->immutable) + return 1; + if (fph && !fph->hash_mask) + return 1; + return 0; +} + #else =20 -static int futex_hash_allocate(unsigned int hash_slots, bool custom) +static int futex_hash_allocate(unsigned int hash_slots, unsigned int immut= able, bool custom) { return -EINVAL; } @@ -1578,6 +1601,11 @@ static int futex_hash_get_slots(void) { return 0; } + +static int futex_hash_get_immutable(void) +{ + return 0; +} #endif =20 int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long= arg4) @@ -1586,15 +1614,17 @@ int futex_hash_prctl(unsigned long arg2, unsigned l= ong arg3, unsigned long arg4) =20 switch (arg2) { case PR_FUTEX_HASH_SET_SLOTS: - if (arg4 !=3D 0) - return -EINVAL; - ret =3D futex_hash_allocate(arg3, true); + ret =3D futex_hash_allocate(arg3, arg4, true); break; =20 case PR_FUTEX_HASH_GET_SLOTS: ret =3D futex_hash_get_slots(); break; =20 + case PR_FUTEX_HASH_GET_IMMUTABLE: + ret =3D futex_hash_get_immutable(); + break; + default: ret =3D -EINVAL; break; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C46A21A45E for ; Wed, 16 Apr 2025 16:29:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820980; cv=none; b=OnToDEQfcw4iposMhzcnjkMo08P0l60lO3TGX9Lt8HRmt7shxj7J4TH8rd7kfVUKmGXt6qjFfLZA9DDUa5gTCRCBxdY9hjQ/8p3LL6u63pwUVHtIjvaiyIvIasj+hhd8i6JivBDTvsZeIodMQkXyq5gN36ZvemspQTgk98S0WRs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820980; c=relaxed/simple; bh=7J0RzhUpf7dV8BUJD0PuDv6DZpLLc6u0FSxeeX2ZA5I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RbqmXFdY29wNiS4qGmK1kB5IEbHmej5oRysZQ09JPvxgYxLJqlXiXgs8OHjy+6pEb6lCXYayVWhnVLHOTvN1Xr9N/6Xz3xCnXmByXuWGLMbQSgi4Q8MnQWeWUGCYITdq/oi1b9LQZ64Oyfwv83kSY9HLI43t857KNG7/hStNT8A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=UdoW2kjA; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=faVP8Bbn; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="UdoW2kjA"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="faVP8Bbn" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K7u6CZonrzzvRz/i9kOfa0iBQn4JLrzLuiVIvnvYeJ4=; b=UdoW2kjABMT+n4N2bgGPtc2xQEbl1EQgDWwdwmmdfoOagAEgfH2nEqw8CZ16AWZ7tR6tys LqUvWN67UGffQkQiM7wamKD8wvTbnJhSpGczJe7pcqpanSBrbu1Z/PFL0I0VlY0Glyhnp/ 4qaDnPk+g3l/r2cYuxYwN+vSDMAyDis7XZHgRvJmom7Qsjz2kkevM25j9Bpj5nCIC/QmCJ NT0EOsAlwnpVVyZx9qm4gOihD/4HzMSH00Lu7M531+OKtnvAGFoapQLkbHTIAnXsiHLE7K YueN22uNaHSmfIJcZdB5s+WEV8Egh6exjSkWxmcLYAOm3G4h19fuOrbGnVc80A== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K7u6CZonrzzvRz/i9kOfa0iBQn4JLrzLuiVIvnvYeJ4=; b=faVP8BbngzV01cTNC7EbebgEqsZ1xXPanvlI4vXE+mZLYJxCYF4xfsxmxAg3g04KzeHdB0 rKnir0BWVu6mvcBg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 16/21] futex: Implement FUTEX2_NUMA Date: Wed, 16 Apr 2025 18:29:16 +0200 Message-ID: <20250416162921.513656-17-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Extend the futex2 interface to be numa aware. When FUTEX2_NUMA is specified for a futex, the user value is extended to two words (of the same size). The first is the user value we all know, the second one will be the node to place this futex on. struct futex_numa_32 { u32 val; u32 node; }; When node is set to ~0, WAIT will set it to the current node_id such that WAKE knows where to find it. If userspace corrupts the node value between WAIT and WAKE, the futex will not be found and no wakeup will happen. When FUTEX2_NUMA is not set, the node is simply an extension of the hash, such that traditional futexes are still interleaved over the nodes. This is done to avoid having to have a separate !numa hash-table. [bigeasy: ensure to have at least hashsize of 4 in futex_init(), add pr_info() for size and allocation information. Cast the naddr math to void*] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- include/linux/futex.h | 3 ++ include/uapi/linux/futex.h | 8 +++ kernel/futex/core.c | 100 ++++++++++++++++++++++++++++++------- kernel/futex/futex.h | 33 ++++++++++-- 4 files changed, 124 insertions(+), 20 deletions(-) diff --git a/include/linux/futex.h b/include/linux/futex.h index 40bc778b2bb45..eccc99751bd94 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -34,6 +34,7 @@ union futex_key { u64 i_seq; unsigned long pgoff; unsigned int offset; + /* unsigned int node; */ } shared; struct { union { @@ -42,11 +43,13 @@ union futex_key { }; unsigned long address; unsigned int offset; + /* unsigned int node; */ } private; struct { u64 ptr; unsigned long word; unsigned int offset; + unsigned int node; /* NOT hashed! */ } both; }; =20 diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index d2ee625ea1890..0435025beaae8 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -74,6 +74,14 @@ /* do not use */ #define FUTEX_32 FUTEX2_SIZE_U32 /* historical accident :-( */ =20 + +/* + * When FUTEX2_NUMA doubles the futex word, the second word is a node valu= e. + * The special value -1 indicates no-node. This is the same value as + * NUMA_NO_NODE, except that value is not ABI, this is. + */ +#define FUTEX_NO_NODE (-1) + /* * Max numbers of elements in a futex_waitv array */ diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 81c5705b6af5e..b5be2d4a34a53 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -36,6 +36,8 @@ #include #include #include +#include +#include #include #include #include @@ -51,11 +53,14 @@ * reside in the same cacheline. */ static struct { - struct futex_hash_bucket *queues; unsigned long hashmask; + unsigned int hashshift; + struct futex_hash_bucket *queues[MAX_NUMNODES]; } __futex_data __read_mostly __aligned(2*sizeof(long)); -#define futex_queues (__futex_data.queues) -#define futex_hashmask (__futex_data.hashmask) + +#define futex_hashmask (__futex_data.hashmask) +#define futex_hashshift (__futex_data.hashshift) +#define futex_queues (__futex_data.queues) =20 struct futex_private_hash { rcuref_t users; @@ -339,15 +344,35 @@ __futex_hash(union futex_key *key, struct futex_priva= te_hash *fph) { struct futex_hash_bucket *hb; u32 hash; + int node; =20 hb =3D __futex_hash_private(key, fph); if (hb) return hb; =20 hash =3D jhash2((u32 *)key, - offsetof(typeof(*key), both.offset) / 4, + offsetof(typeof(*key), both.offset) / sizeof(u32), key->both.offset); - return &futex_queues[hash & futex_hashmask]; + node =3D key->both.node; + + if (node =3D=3D FUTEX_NO_NODE) { + /* + * In case of !FLAGS_NUMA, use some unused hash bits to pick a + * node -- this ensures regular futexes are interleaved across + * the nodes and avoids having to allocate multiple + * hash-tables. + * + * NOTE: this isn't perfectly uniform, but it is fast and + * handles sparse node masks. + */ + node =3D (hash >> futex_hashshift) % nr_node_ids; + if (!node_possible(node)) { + node =3D find_next_bit_wrap(node_possible_map.bits, + nr_node_ids, node); + } + } + + return &futex_queues[node][hash & futex_hashmask]; } =20 /** @@ -454,25 +479,49 @@ int get_futex_key(u32 __user *uaddr, unsigned int fla= gs, union futex_key *key, struct page *page; struct folio *folio; struct address_space *mapping; - int err, ro =3D 0; + int node, err, size, ro =3D 0; bool fshared; =20 fshared =3D flags & FLAGS_SHARED; + size =3D futex_size(flags); + if (flags & FLAGS_NUMA) + size *=3D 2; =20 /* * The futex address must be "naturally" aligned. */ key->both.offset =3D address % PAGE_SIZE; - if (unlikely((address % sizeof(u32)) !=3D 0)) + if (unlikely((address % size) !=3D 0)) return -EINVAL; address -=3D key->both.offset; =20 - if (unlikely(!access_ok(uaddr, sizeof(u32)))) + if (unlikely(!access_ok(uaddr, size))) return -EFAULT; =20 if (unlikely(should_fail_futex(fshared))) return -EFAULT; =20 + if (flags & FLAGS_NUMA) { + u32 __user *naddr =3D (void *)uaddr + size / 2; + + if (futex_get_value(&node, naddr)) + return -EFAULT; + + if (node =3D=3D FUTEX_NO_NODE) { + node =3D numa_node_id(); + if (futex_put_value(node, naddr)) + return -EFAULT; + + } else if (node >=3D MAX_NUMNODES || !node_possible(node)) { + return -EINVAL; + } + + key->both.node =3D node; + + } else { + key->both.node =3D FUTEX_NO_NODE; + } + /* * PROCESS_PRIVATE futexes are fast. * As the mm cannot disappear under us and the 'key' only needs @@ -1635,24 +1684,41 @@ int futex_hash_prctl(unsigned long arg2, unsigned l= ong arg3, unsigned long arg4) static int __init futex_init(void) { unsigned long hashsize, i; - unsigned int futex_shift; + unsigned int order, n; + unsigned long size; =20 #ifdef CONFIG_BASE_SMALL hashsize =3D 16; #else - hashsize =3D roundup_pow_of_two(256 * num_possible_cpus()); + hashsize =3D 256 * num_possible_cpus(); + hashsize /=3D num_possible_nodes(); + hashsize =3D max(4, hashsize); + hashsize =3D roundup_pow_of_two(hashsize); #endif + futex_hashshift =3D ilog2(hashsize); + size =3D sizeof(struct futex_hash_bucket) * hashsize; + order =3D get_order(size); =20 - futex_queues =3D alloc_large_system_hash("futex", sizeof(*futex_queues), - hashsize, 0, 0, - &futex_shift, NULL, - hashsize, hashsize); - hashsize =3D 1UL << futex_shift; + for_each_node(n) { + struct futex_hash_bucket *table; =20 - for (i =3D 0; i < hashsize; i++) - futex_hash_bucket_init(&futex_queues[i], NULL); + if (order > MAX_PAGE_ORDER) + table =3D vmalloc_huge_node(size, GFP_KERNEL, n); + else + table =3D alloc_pages_exact_nid(n, size, GFP_KERNEL); + + BUG_ON(!table); + + for (i =3D 0; i < hashsize; i++) + futex_hash_bucket_init(&table[i], NULL); + + futex_queues[n] =3D table; + } =20 futex_hashmask =3D hashsize - 1; + pr_info("futex hash table entries: %lu (%lu bytes on %d NUMA nodes, total= %lu KiB, %s).\n", + hashsize, size, num_possible_nodes(), size * num_possible_nodes() / 1024, + order > MAX_PAGE_ORDER ? "vmalloc" : "linear"); return 0; } core_initcall(futex_init); diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index 899aed5acde12..acc7953678898 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -54,7 +54,7 @@ static inline unsigned int futex_to_flags(unsigned int op) return flags; } =20 -#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE) +#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE) =20 /* FUTEX2_ to FLAGS_ */ static inline unsigned int futex2_to_flags(unsigned int flags2) @@ -87,6 +87,19 @@ static inline bool futex_flags_valid(unsigned int flags) if ((flags & FLAGS_SIZE_MASK) !=3D FLAGS_SIZE_32) return false; =20 + /* + * Must be able to represent both FUTEX_NO_NODE and every valid nodeid + * in a futex word. + */ + if (flags & FLAGS_NUMA) { + int bits =3D 8 * futex_size(flags); + u64 max =3D ~0ULL; + + max >>=3D 64 - bits; + if (nr_node_ids >=3D max) + return false; + } + return true; } =20 @@ -282,7 +295,7 @@ static inline int futex_cmpxchg_value_locked(u32 *curva= l, u32 __user *uaddr, u32 * This looks a bit overkill, but generally just results in a couple * of instructions. */ -static __always_inline int futex_read_inatomic(u32 *dest, u32 __user *from) +static __always_inline int futex_get_value(u32 *dest, u32 __user *from) { u32 val; =20 @@ -299,12 +312,26 @@ static __always_inline int futex_read_inatomic(u32 *d= est, u32 __user *from) return -EFAULT; } =20 +static __always_inline int futex_put_value(u32 val, u32 __user *to) +{ + if (can_do_masked_user_access()) + to =3D masked_user_access_begin(to); + else if (!user_read_access_begin(to, sizeof(*to))) + return -EFAULT; + unsafe_put_user(val, to, Efault); + user_read_access_end(); + return 0; +Efault: + user_read_access_end(); + return -EFAULT; +} + static inline int futex_get_value_locked(u32 *dest, u32 __user *from) { int ret; =20 pagefault_disable(); - ret =3D futex_read_inatomic(dest, from); + ret =3D futex_get_value(dest, from); pagefault_enable(); =20 return ret; --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0ED2621B9E5 for ; Wed, 16 Apr 2025 16:29:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820978; cv=none; b=WRUzxj3D5qmibRi7z7xwGsogcotXEbx68zYqpEb37OqC1p7hXp067Pv0+uxtIOQo8r2Tkn3n9w6+53KDkl5VR2ywQI+YzWzW/CYXJAuhTEwY8/0m5eKd2DnhkUO0WeWwrpi+HvLA3rcYEPDh9hBAgqJLXH7IIAxvRC0q2ncIits= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820978; c=relaxed/simple; bh=d9QnKurb5SH5mTyAlKLm+x2veeu8q55k+fNPL7B78yI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tuTExe8Q/bswauQZyYaKbNleDpU9GBCRGnvGDkSaVWqndGdxzyXK/tZ/8xt6K/2K2/NA/kZOH7HdXLuKF/lhUnwLmwd4rIrxUdMY9VCiTjDzSb+wUAfbQsK2F6Y9Xt8ZbEYF3WNPmf8VA0GoEo+PNbUZFOxIc6jD9vuSKdPTacA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=BZWh+ysD; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=3tqKL4sw; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="BZWh+ysD"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="3tqKL4sw" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=w4uAkwnBy7IBc0gSxMrL7lL3oQSyk01HKxQFhkjakhg=; b=BZWh+ysDeeGKokqy2feS1iDf3PGFnpYyYQqBmO99kibJhu42CEsvTv2uXCZIY9cEE0AGqK 1JRn7fMkpNUH7WmgpbxJJTTOWcOcGoQwCG1R7ZFhzdB41wIWPNpJ8uqFXyVAab8EtRBaIb Q/7nr7Wt7rlE3p0PTVyHRnYoIzKcUx1LYayZqyZyiDgChSBkC5voUZCnslFuMx3hYQWph8 5UfTQeEDSATtSBsDVDzmaWJ9A1h4vvEo2pkQ7cfITcG61WJ5Wo33PBvSK76vX4juszZ+MD tGkWjeWMxJHCHsAv+0bZFPEK0WQ/N6xTMXCJGOmlzFLABtADTeB9Ms5BXQpQFA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820972; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=w4uAkwnBy7IBc0gSxMrL7lL3oQSyk01HKxQFhkjakhg=; b=3tqKL4swFjwYoALRjxBvalSVtcIF5aLDjr7tiIEXX/4mI/npLo5P++wA3dMyVi7jUFXXQw Ahn1aK4n634LlACw== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 17/21] futex: Implement FUTEX2_MPOL Date: Wed, 16 Apr 2025 18:29:17 +0200 Message-ID: <20250416162921.513656-18-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Zijlstra Extend the futex2 interface to be aware of mempolicy. When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or home_node specified covering the futex address, use that hash-map. Notably, in this case the futex will go to the global node hashtable, even if it is a PRIVATE futex. When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be tried first before reverting to setting node to the local node. [bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write the node only to user if FUTEX_NO_NODE was supplied] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior --- include/linux/mmap_lock.h | 4 ++ include/uapi/linux/futex.h | 2 +- init/Kconfig | 5 ++ kernel/futex/core.c | 114 +++++++++++++++++++++++++++++++------ kernel/futex/futex.h | 6 +- 5 files changed, 113 insertions(+), 18 deletions(-) diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 4706c67699027..e0eddfd306ef3 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -7,6 +7,7 @@ #include #include #include +#include =20 #define MMAP_LOCK_INITIALIZER(name) \ .mmap_lock =3D __RWSEM_INITIALIZER((name).mmap_lock), @@ -211,6 +212,9 @@ static inline void mmap_read_unlock(struct mm_struct *m= m) up_read(&mm->mmap_lock); } =20 +DEFINE_GUARD(mmap_read_lock, struct mm_struct *, + mmap_read_lock(_T), mmap_read_unlock(_T)) + static inline void mmap_read_unlock_non_owner(struct mm_struct *mm) { __mmap_lock_trace_released(mm, false); diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index 0435025beaae8..247c425e175ef 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -63,7 +63,7 @@ #define FUTEX2_SIZE_U32 0x02 #define FUTEX2_SIZE_U64 0x03 #define FUTEX2_NUMA 0x04 - /* 0x08 */ +#define FUTEX2_MPOL 0x08 /* 0x10 */ /* 0x20 */ /* 0x40 */ diff --git a/init/Kconfig b/init/Kconfig index b308b98d79347..174633bc9810b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1704,6 +1704,11 @@ config FUTEX_PRIVATE_HASH depends on FUTEX && !BASE_SMALL && MMU default y =20 +config FUTEX_MPOL + bool + depends on FUTEX && NUMA + default y + config EPOLL bool "Enable eventpoll support" if EXPERT default y diff --git a/kernel/futex/core.c b/kernel/futex/core.c index b5be2d4a34a53..ee1d7182ce0c0 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include =20 #include "futex.h" #include "../locking/rtmutex_common.h" @@ -328,6 +330,73 @@ struct futex_hash_bucket *futex_hash(union futex_key *= key) =20 #endif /* CONFIG_FUTEX_PRIVATE_HASH */ =20 +#ifdef CONFIG_FUTEX_MPOL +static int __futex_key_to_node(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma =3D vma_lookup(mm, addr); + struct mempolicy *mpol; + int node =3D FUTEX_NO_NODE; + + if (!vma) + return FUTEX_NO_NODE; + + mpol =3D vma_policy(vma); + if (!mpol) + return FUTEX_NO_NODE; + + switch (mpol->mode) { + case MPOL_PREFERRED: + node =3D first_node(mpol->nodes); + break; + case MPOL_PREFERRED_MANY: + case MPOL_BIND: + if (mpol->home_node !=3D NUMA_NO_NODE) + node =3D mpol->home_node; + break; + default: + break; + } + + return node; +} + +static int futex_key_to_node_opt(struct mm_struct *mm, unsigned long addr) +{ + int seq, node; + + guard(rcu)(); + + if (!mmap_lock_speculate_try_begin(mm, &seq)) + return -EBUSY; + + node =3D __futex_key_to_node(mm, addr); + + if (mmap_lock_speculate_retry(mm, seq)) + return -EAGAIN; + + return node; +} + +static int futex_mpol(struct mm_struct *mm, unsigned long addr) +{ + int node; + + node =3D futex_key_to_node_opt(mm, addr); + if (node >=3D FUTEX_NO_NODE) + return node; + + guard(mmap_read_lock)(mm); + return __futex_key_to_node(mm, addr); +} +#else /* !CONFIG_FUTEX_MPOL */ + +static int futex_mpol(struct mm_struct *mm, unsigned long addr) +{ + return FUTEX_NO_NODE; +} + +#endif /* CONFIG_FUTEX_MPOL */ + /** * __futex_hash - Return the hash bucket * @key: Pointer to the futex key for which the hash is calculated @@ -342,18 +411,20 @@ struct futex_hash_bucket *futex_hash(union futex_key = *key) static struct futex_hash_bucket * __futex_hash(union futex_key *key, struct futex_private_hash *fph) { - struct futex_hash_bucket *hb; + int node =3D key->both.node; u32 hash; - int node; =20 - hb =3D __futex_hash_private(key, fph); - if (hb) - return hb; + if (node =3D=3D FUTEX_NO_NODE) { + struct futex_hash_bucket *hb; + + hb =3D __futex_hash_private(key, fph); + if (hb) + return hb; + } =20 hash =3D jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / sizeof(u32), key->both.offset); - node =3D key->both.node; =20 if (node =3D=3D FUTEX_NO_NODE) { /* @@ -480,6 +551,7 @@ int get_futex_key(u32 __user *uaddr, unsigned int flags= , union futex_key *key, struct folio *folio; struct address_space *mapping; int node, err, size, ro =3D 0; + bool node_updated =3D false; bool fshared; =20 fshared =3D flags & FLAGS_SHARED; @@ -501,27 +573,37 @@ int get_futex_key(u32 __user *uaddr, unsigned int fla= gs, union futex_key *key, if (unlikely(should_fail_futex(fshared))) return -EFAULT; =20 + node =3D FUTEX_NO_NODE; + if (flags & FLAGS_NUMA) { u32 __user *naddr =3D (void *)uaddr + size / 2; =20 if (futex_get_value(&node, naddr)) return -EFAULT; =20 + if (node !=3D FUTEX_NO_NODE && + (node >=3D MAX_NUMNODES || !node_possible(node))) + return -EINVAL; + } + + if (node =3D=3D FUTEX_NO_NODE && (flags & FLAGS_MPOL)) { + node =3D futex_mpol(mm, address); + node_updated =3D true; + } + + if (flags & FLAGS_NUMA) { + u32 __user *naddr =3D (void *)uaddr + size / 2; + if (node =3D=3D FUTEX_NO_NODE) { node =3D numa_node_id(); - if (futex_put_value(node, naddr)) - return -EFAULT; - - } else if (node >=3D MAX_NUMNODES || !node_possible(node)) { - return -EINVAL; + node_updated =3D true; } - - key->both.node =3D node; - - } else { - key->both.node =3D FUTEX_NO_NODE; + if (node_updated && futex_put_value(node, naddr)) + return -EFAULT; } =20 + key->both.node =3D node; + /* * PROCESS_PRIVATE futexes are fast. * As the mm cannot disappear under us and the 'key' only needs diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h index acc7953678898..069fc2a83080d 100644 --- a/kernel/futex/futex.h +++ b/kernel/futex/futex.h @@ -39,6 +39,7 @@ #define FLAGS_HAS_TIMEOUT 0x0040 #define FLAGS_NUMA 0x0080 #define FLAGS_STRICT 0x0100 +#define FLAGS_MPOL 0x0200 =20 /* FUTEX_ to FLAGS_ */ static inline unsigned int futex_to_flags(unsigned int op) @@ -54,7 +55,7 @@ static inline unsigned int futex_to_flags(unsigned int op) return flags; } =20 -#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_PRIVATE) +#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_NUMA | FUTEX2_MPOL | = FUTEX2_PRIVATE) =20 /* FUTEX2_ to FLAGS_ */ static inline unsigned int futex2_to_flags(unsigned int flags2) @@ -67,6 +68,9 @@ static inline unsigned int futex2_to_flags(unsigned int f= lags2) if (flags2 & FUTEX2_NUMA) flags |=3D FLAGS_NUMA; =20 + if (flags2 & FUTEX2_MPOL) + flags |=3D FLAGS_MPOL; + return flags; } =20 --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9DCF21B1BC; Wed, 16 Apr 2025 16:29:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820977; cv=none; b=Fs9jjlytjFeEqbXIasD4Qfa1zuk9MIs/0ICQMKGmOuP92YoEJTLEKe24DahvzHStGMlfwnJzSyH0KC88G71d3497MvpYSSqa8b2vK67KjCF9wc5c4sdd9iDjjOlNoj96zuX+hRvHjxrguKQ/NDSGJp/XjXKVmiOZHLu49e5+45k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820977; c=relaxed/simple; bh=Uf6MF+4cqr2cCfzLE6uv2eShicTCfEarLEKPuph+oKI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Az9J9uHIkM4+Ahng62vQ124RUF8w+PMaKHUwYwD5K1vverPK1c/iggsxX9xL/Fpn5ZOo1vy4Yqm8B7uJpjwySdCfz+6RE8N9MQ7qciS3K0UPv7sx5O6dXLBPPd9qjXxrLZShvGEp9hfNQCVr96qlQp8uKLJ5vIOcUo7r/mjmIdY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=NrV3UKlC; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=gQ2S8Vq1; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="NrV3UKlC"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="gQ2S8Vq1" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820973; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uhYrjCjllIT09FWGjatSJnsamFQi/vOB94QRhXHdb6E=; b=NrV3UKlCwR0oT60ieABqQ7IVT8zgqla0ZxAx6XsEEJMr1iZl/QS/dO5eEamVy+ynMe8855 Tq9gcsL7eZp1qvtKbUeKNFSve4DW8E9cFqHDIuj+4GL+iwSFv2HR4gqmGYsMEuTtxgOvwa dKdvsbKbX7gMa+Mlb/g1kynmiZwq17/jzlC1XKRxPKUwdLlP+lS3JwTBKwrcVAqTqAiuSU SfsK7ZQH+ESVi70cZJlDP+dYYf8kyrNCSzdv2DSJMYZehJV5CyBqGmjts/VCH859nwTqig 8/DgE4JDE2ivJmqKaBpT8Z2lNvoCR43HFnXG93lfqU9l+pKLqN15vphHpaOqHQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820973; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uhYrjCjllIT09FWGjatSJnsamFQi/vOB94QRhXHdb6E=; b=gQ2S8Vq1DN/XS+/lEnfAk5vIDZuNGVlxxD6SKij+o4NlEWLE92lv9C0UQ1LQm+xkkRBAot aXBf8Y+42wCV/pBQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior , "Liang, Kan" , Adrian Hunter , Alexander Shishkin , Arnaldo Carvalho de Melo , Ian Rogers , Jiri Olsa , Mark Rutland , Namhyung Kim , linux-perf-users@vger.kernel.org Subject: [PATCH v12 18/21] tools headers: Synchronize prctl.h ABI header Date: Wed, 16 Apr 2025 18:29:18 +0200 Message-ID: <20250416162921.513656-19-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Synchronize prctl.h with current uapi version after adding PR_FUTEX_HASH. Cc: "Liang, Kan" Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Mark Rutland Cc: Namhyung Kim Cc: linux-perf-users@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior --- tools/include/uapi/linux/prctl.h | 44 +++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/pr= ctl.h index 35791791a879b..21f30b3ded74b 100644 --- a/tools/include/uapi/linux/prctl.h +++ b/tools/include/uapi/linux/prctl.h @@ -230,7 +230,7 @@ struct prctl_mm_map { # define PR_PAC_APDBKEY (1UL << 3) # define PR_PAC_APGAKEY (1UL << 4) =20 -/* Tagged user address controls for arm64 */ +/* Tagged user address controls for arm64 and RISC-V */ #define PR_SET_TAGGED_ADDR_CTRL 55 #define PR_GET_TAGGED_ADDR_CTRL 56 # define PR_TAGGED_ADDR_ENABLE (1UL << 0) @@ -244,6 +244,9 @@ struct prctl_mm_map { # define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT) /* Unused; kept only for source compatibility */ # define PR_MTE_TCF_SHIFT 1 +/* RISC-V pointer masking tag length */ +# define PR_PMLEN_SHIFT 24 +# define PR_PMLEN_MASK (0x7fUL << PR_PMLEN_SHIFT) =20 /* Control reclaim behavior when allocating memory */ #define PR_SET_IO_FLUSHER 57 @@ -328,4 +331,43 @@ struct prctl_mm_map { # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ # define PR_PPC_DEXCR_CTRL_MASK 0x1f =20 +/* + * Get the current shadow stack configuration for the current thread, + * this will be the value configured via PR_SET_SHADOW_STACK_STATUS. + */ +#define PR_GET_SHADOW_STACK_STATUS 74 + +/* + * Set the current shadow stack configuration. Enabling the shadow + * stack will cause a shadow stack to be allocated for the thread. + */ +#define PR_SET_SHADOW_STACK_STATUS 75 +# define PR_SHADOW_STACK_ENABLE (1UL << 0) +# define PR_SHADOW_STACK_WRITE (1UL << 1) +# define PR_SHADOW_STACK_PUSH (1UL << 2) + +/* + * Prevent further changes to the specified shadow stack + * configuration. All bits may be locked via this call, including + * undefined bits. + */ +#define PR_LOCK_SHADOW_STACK_STATUS 76 + +/* + * Controls the mode of timer_create() for CRIU restore operations. + * Enabling this allows CRIU to restore timers with explicit IDs. + * + * Don't use for normal operations as the result might be undefined. + */ +#define PR_TIMER_CREATE_RESTORE_IDS 77 +# define PR_TIMER_CREATE_RESTORE_IDS_OFF 0 +# define PR_TIMER_CREATE_RESTORE_IDS_ON 1 +# define PR_TIMER_CREATE_RESTORE_IDS_GET 2 + +/* FUTEX hash management */ +#define PR_FUTEX_HASH 78 +# define PR_FUTEX_HASH_SET_SLOTS 1 +# define PR_FUTEX_HASH_GET_SLOTS 2 +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 + #endif /* _LINUX_PRCTL_H */ --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50F5D21C9FD; Wed, 16 Apr 2025 16:29:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820981; cv=none; b=iVfJYUcx6i5c0cFzBcWyI452jZ+NE76UyGJrskfXwit8DJ2pBmFQmhsxCfnKmhmphWhzV9tBTrYHtq7a+nTghppqeKwxAk4OxrkvCsDMYsr8ee7wPqlxq5wSOfb4cpUxyhQc5/HOrcSvjbJh5WXvX629MpDAZows7LGU1mK3eMA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820981; c=relaxed/simple; bh=4MvYlUcGhYQjwlZwLAwo1RILr6ykcjeL8wbAjqFO940=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QmlArANM2y8ROYCtb6iPdGMSBKPBcQEIe81ljcJNYRDYPcbyY42+Bj7BD+In/CB6m/ZRxfcl/orYKNv7Z7vqaf7KqszNuqic2HYmYOV1KlgvuYzk+0aHQtM9amxiS4Cr6gb6/b9MRe64HBIhxcI9GHH//Fxs30dldbRy7T5TSPk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=n6yJ2xo4; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=LArePtHJ; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="n6yJ2xo4"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="LArePtHJ" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=k2Lf8SyuDvhuwsrlXnosYopSjakalKZtdBvos5UVJiQ=; b=n6yJ2xo4SvVASGbwe7E6Fk3gGwT4ufh2vJ3mZj2qnEhAHcgq1wEY6XZFSSNlv/rDY96ook /K5IKca5u8zJOV55ybsBZlnuppAE3qU8mL76q3LrzJtPXTsBPmhsWO3H3HRa+YleHfD9US lrbmLspsZddEELwpDonNU884VYMSuNLdkPN0SH8f5hckmXwMv1FxWuLbdv35zsCf2QCcbj DKIQ6OhOWfi6QfiEQ2XMOkxs3Zhllr6VGKyYZofRHA3DyD+O9/S6dLcZ/kbfKj4JtAuBMl YZ9Xdg6941yKDXGHw7HxI7dJjTRzhI1q8+wGJj9H2vCjbNjBl5R7aP/C9OLbHw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=k2Lf8SyuDvhuwsrlXnosYopSjakalKZtdBvos5UVJiQ=; b=LArePtHJlZ5a3Xnl+k7MfSpBDnDVj4q5ep8/WeT15s5Lbf+gM1pl8CcqftKnjq1VOAnVv9 kifftXPhvcvzmmDQ== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior , "Liang, Kan" , Adrian Hunter , Alexander Shishkin , Arnaldo Carvalho de Melo , Ian Rogers , Jiri Olsa , Mark Rutland , Namhyung Kim , linux-perf-users@vger.kernel.org Subject: [PATCH v12 19/21] tools/perf: Allow to select the number of hash buckets Date: Wed, 16 Apr 2025 18:29:19 +0200 Message-ID: <20250416162921.513656-20-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the -b/ --buckets argument to specify the number of hash buckets for the private futex hash. This is directly passed to prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, buckets, immutable) and must return without an error if specified. The `immutable' is 0 by default and can be set to 1 via the -I/ --immutable argument. The size of the private hash is verified with PR_FUTEX_HASH_GET_SLOTS. If PR_FUTEX_HASH_GET_SLOTS failed then it is assumed that an older kernel was used without the support and that the global hash is used. Cc: "Liang, Kan" Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Mark Rutland Cc: Namhyung Kim Cc: linux-perf-users@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior --- tools/perf/bench/Build | 1 + tools/perf/bench/futex-hash.c | 7 +++ tools/perf/bench/futex-lock-pi.c | 5 ++ tools/perf/bench/futex-requeue.c | 6 +++ tools/perf/bench/futex-wake-parallel.c | 9 +++- tools/perf/bench/futex-wake.c | 4 ++ tools/perf/bench/futex.c | 65 ++++++++++++++++++++++++++ tools/perf/bench/futex.h | 5 ++ 8 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 tools/perf/bench/futex.c diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build index 279ab2ab4abe4..b558ab98719f9 100644 --- a/tools/perf/bench/Build +++ b/tools/perf/bench/Build @@ -3,6 +3,7 @@ perf-bench-y +=3D sched-pipe.o perf-bench-y +=3D sched-seccomp-notify.o perf-bench-y +=3D syscall.o perf-bench-y +=3D mem-functions.o +perf-bench-y +=3D futex.o perf-bench-y +=3D futex-hash.o perf-bench-y +=3D futex-wake.o perf-bench-y +=3D futex-wake-parallel.o diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c index b472eded521b1..fdf133c9520f7 100644 --- a/tools/perf/bench/futex-hash.c +++ b/tools/perf/bench/futex-hash.c @@ -18,9 +18,11 @@ #include #include #include +#include #include #include #include +#include #include =20 #include "../util/mutex.h" @@ -50,9 +52,12 @@ struct worker { static struct bench_futex_parameters params =3D { .nfutexes =3D 1024, .runtime =3D 10, + .nbuckets =3D -1, }; =20 static const struct option options[] =3D { + OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash bu= ckets"), + OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash = buckets immutable"), OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads= "), OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in second= s)"), OPT_UINTEGER('f', "futexes", ¶ms.nfutexes, "Specify amount of futexes= per threads"), @@ -118,6 +123,7 @@ static void print_summary(void) printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs =3D %d\n", !params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg), (int)bench__runtime.tv_sec); + futex_print_nbuckets(¶ms); } =20 int bench_futex_hash(int argc, const char **argv) @@ -161,6 +167,7 @@ int bench_futex_hash(int argc, const char **argv) =20 if (!params.fshared) futex_flag =3D FUTEX_PRIVATE_FLAG; + futex_set_nbuckets_param(¶ms); =20 printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futex= es for %d secs.\n\n", getpid(), params.nthreads, params.nfutexes, params.fshared ? "shar= ed":"private", params.runtime); diff --git a/tools/perf/bench/futex-lock-pi.c b/tools/perf/bench/futex-lock= -pi.c index 0416120c091b2..5144a158512cc 100644 --- a/tools/perf/bench/futex-lock-pi.c +++ b/tools/perf/bench/futex-lock-pi.c @@ -41,10 +41,13 @@ static struct stats throughput_stats; static struct cond thread_parent, thread_worker; =20 static struct bench_futex_parameters params =3D { + .nbuckets =3D -1, .runtime =3D 10, }; =20 static const struct option options[] =3D { + OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash bu= ckets"), + OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash = buckets immutable"), OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads= "), OPT_UINTEGER('r', "runtime", ¶ms.runtime, "Specify runtime (in second= s)"), OPT_BOOLEAN( 'M', "multi", ¶ms.multi, "Use multiple futexes"), @@ -67,6 +70,7 @@ static void print_summary(void) printf("%sAveraged %ld operations/sec (+- %.2f%%), total secs =3D %d\n", !params.silent ? "\n" : "", avg, rel_stddev_stats(stddev, avg), (int)bench__runtime.tv_sec); + futex_print_nbuckets(¶ms); } =20 static void toggle_done(int sig __maybe_unused, @@ -203,6 +207,7 @@ int bench_futex_lock_pi(int argc, const char **argv) mutex_init(&thread_lock); cond_init(&thread_parent); cond_init(&thread_worker); + futex_set_nbuckets_param(¶ms); =20 threads_starting =3D params.nthreads; gettimeofday(&bench__start, NULL); diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requ= eue.c index aad5bfc4fe188..a2f91ee1950b3 100644 --- a/tools/perf/bench/futex-requeue.c +++ b/tools/perf/bench/futex-requeue.c @@ -42,6 +42,7 @@ static unsigned int threads_starting; static int futex_flag =3D 0; =20 static struct bench_futex_parameters params =3D { + .nbuckets =3D -1, /* * How many tasks to requeue at a time. * Default to 1 in order to make the kernel work more. @@ -50,6 +51,8 @@ static struct bench_futex_parameters params =3D { }; =20 static const struct option options[] =3D { + OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash bu= ckets"), + OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash = buckets immutable"), OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of thread= s"), OPT_UINTEGER('q', "nrequeue", ¶ms.nrequeue, "Specify amount of thread= s to requeue at once"), OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not displa= y data/details"), @@ -77,6 +80,7 @@ static void print_summary(void) params.nthreads, requeuetime_avg / USEC_PER_MSEC, rel_stddev_stats(requeuetime_stddev, requeuetime_avg)); + futex_print_nbuckets(¶ms); } =20 static void *workerfn(void *arg __maybe_unused) @@ -204,6 +208,8 @@ int bench_futex_requeue(int argc, const char **argv) if (params.broadcast) params.nrequeue =3D params.nthreads; =20 + futex_set_nbuckets_param(¶ms); + printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %s%p)= , " "%d at a time.\n\n", getpid(), params.nthreads, params.fshared ? "shared":"private", &futex1, diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/fute= x-wake-parallel.c index 4352e318631e9..ee66482c29fd1 100644 --- a/tools/perf/bench/futex-wake-parallel.c +++ b/tools/perf/bench/futex-wake-parallel.c @@ -57,9 +57,13 @@ static struct stats waketime_stats, wakeup_stats; static unsigned int threads_starting; static int futex_flag =3D 0; =20 -static struct bench_futex_parameters params; +static struct bench_futex_parameters params =3D { + .nbuckets =3D -1, +}; =20 static const struct option options[] =3D { + OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash bu= ckets"), + OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash = buckets immutable"), OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads= "), OPT_UINTEGER('w', "nwakers", ¶ms.nwakes, "Specify amount of waking th= reads"), OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display= data/details"), @@ -218,6 +222,7 @@ static void print_summary(void) params.nthreads, waketime_avg / USEC_PER_MSEC, rel_stddev_stats(waketime_stddev, waketime_avg)); + futex_print_nbuckets(¶ms); } =20 =20 @@ -291,6 +296,8 @@ int bench_futex_wake_parallel(int argc, const char **ar= gv) if (!params.fshared) futex_flag =3D FUTEX_PRIVATE_FLAG; =20 + futex_set_nbuckets_param(¶ms); + printf("Run summary [PID %d]: blocking on %d threads (at [%s] " "futex %p), %d threads waking up %d at a time.\n\n", getpid(), params.nthreads, params.fshared ? "shared":"private", diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c index 49b3c89b0b35d..8d6107f7cd941 100644 --- a/tools/perf/bench/futex-wake.c +++ b/tools/perf/bench/futex-wake.c @@ -42,6 +42,7 @@ static unsigned int threads_starting; static int futex_flag =3D 0; =20 static struct bench_futex_parameters params =3D { + .nbuckets =3D -1, /* * How many wakeups to do at a time. * Default to 1 in order to make the kernel work more. @@ -50,6 +51,8 @@ static struct bench_futex_parameters params =3D { }; =20 static const struct option options[] =3D { + OPT_INTEGER( 'b', "buckets", ¶ms.nbuckets, "Specify amount of hash bu= ckets"), + OPT_BOOLEAN( 'I', "immutable", ¶ms.buckets_immutable, "Make the hash = buckets immutable"), OPT_UINTEGER('t', "threads", ¶ms.nthreads, "Specify amount of threads= "), OPT_UINTEGER('w', "nwakes", ¶ms.nwakes, "Specify amount of threads t= o wake at once"), OPT_BOOLEAN( 's', "silent", ¶ms.silent, "Silent mode: do not display= data/details"), @@ -93,6 +96,7 @@ static void print_summary(void) params.nthreads, waketime_avg / USEC_PER_MSEC, rel_stddev_stats(waketime_stddev, waketime_avg)); + futex_print_nbuckets(¶ms); } =20 static void block_threads(pthread_t *w, struct perf_cpu_map *cpu) diff --git a/tools/perf/bench/futex.c b/tools/perf/bench/futex.c new file mode 100644 index 0000000000000..02ae6c52ba881 --- /dev/null +++ b/tools/perf/bench/futex.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +#include "futex.h" + +void futex_set_nbuckets_param(struct bench_futex_parameters *params) +{ + int ret; + + if (params->nbuckets < 0) + return; + + ret =3D prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, params->nbuckets, p= arams->buckets_immutable); + if (ret) { + printf("Requesting %d hash buckets failed: %d/%m\n", + params->nbuckets, ret); + err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)"); + } +} + +void futex_print_nbuckets(struct bench_futex_parameters *params) +{ + char *futex_hash_mode; + int ret; + + ret =3D prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); + if (params->nbuckets >=3D 0) { + if (ret !=3D params->nbuckets) { + if (ret < 0) { + printf("Can't query number of buckets: %m\n"); + err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)"); + } + printf("Requested number of hash buckets does not currently used.\n"); + printf("Requested: %d in usage: %d\n", params->nbuckets, ret); + err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)"); + } + if (params->nbuckets =3D=3D 0) { + ret =3D asprintf(&futex_hash_mode, "Futex hashing: global hash"); + } else { + ret =3D prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE); + if (ret < 0) { + printf("Can't check if the hash is immutable: %m\n"); + err(EXIT_FAILURE, "prctl(PR_FUTEX_HASH)"); + } + ret =3D asprintf(&futex_hash_mode, "Futex hashing: %d hash buckets %s", + params->nbuckets, + ret =3D=3D 1 ? "(immutable)" : ""); + } + } else { + if (ret <=3D 0) { + ret =3D asprintf(&futex_hash_mode, "Futex hashing: global hash"); + } else { + ret =3D asprintf(&futex_hash_mode, "Futex hashing: auto resized to %d b= uckets", + ret); + } + } + if (ret < 0) + err(EXIT_FAILURE, "ENOMEM, futex_hash_mode"); + printf("%s\n", futex_hash_mode); + free(futex_hash_mode); +} diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h index ebdc2b032afc1..9c9a73f9d865e 100644 --- a/tools/perf/bench/futex.h +++ b/tools/perf/bench/futex.h @@ -25,6 +25,8 @@ struct bench_futex_parameters { unsigned int nfutexes; unsigned int nwakes; unsigned int nrequeue; + int nbuckets; + bool buckets_immutable; }; =20 /** @@ -143,4 +145,7 @@ futex_cmp_requeue_pi(u_int32_t *uaddr, u_int32_t val, u= _int32_t *uaddr2, val, opflags); } =20 +void futex_set_nbuckets_param(struct bench_futex_parameters *params); +void futex_print_nbuckets(struct bench_futex_parameters *params); + #endif /* _FUTEX_H */ --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82B1E212F9A for ; Wed, 16 Apr 2025 16:29:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820979; cv=none; b=Y0oHa0Osc1EJiXD/ACaFt10IWD5qKOjo3yqfWCHPTIt9Nva/nSgnXLpdHnE83otA1Mlysr3ye2iFlDg0QKOxU7ZWkZ+Ds7m68ohzdabg2rcD1y9ZZt09oOwE/kvK0QnZmQ/VY3VodFDWaCOOZcyObSU3BxTVdJCfdBqw6L9Cmu4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820979; c=relaxed/simple; bh=LejZZ2spwtKIitLCsmp0xqCJ9M7wTNnxW8YjPik6nE8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bOCwR/t7Sr6iVaOCTj2BdEoAe/8qhkndbrA5SMS1pOQmKKIvxId6E+vny91pTOpnujGvTRSm3EEgLfh5hPY77gpZHL3emhW1y6f+aihvkK5LuYBsT6gX7ABumq28lARE43WK2xCEf+yYTTF/UX0U+At7VIHDh99OAG+R3KtCRbs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=qQe0eMdf; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=/Mt8oejq; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="qQe0eMdf"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="/Mt8oejq" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sTntC0IAr7OMmtpTw+HekjeIoPDIhXwXUNcMKwKnyz8=; b=qQe0eMdfCF80WPHou54JiBKl1bMnw9vye1okjYvxnr4YqU8uL51/Hu9ShXBRlFMx2yJtg5 tr+tGX9y3NvBnOh41u0w5r/F6zGRUFS6xsKlMbwZ2oYS3znnNrmmMxy2dIKvI4q9WbA+B8 GBfWmw9CqIbrigjz6CLwDQt6vIFRcg+2R2HLhfX5A2WkBt8jo9Hz6JixNCmpSJ7hvApYGa MdlK7bsNPCIbx9UzWEmlCDsv6Zz/4yDkpBacwh14lkAxtnUqiF2zh17+4sdDxLyiuJf8+A r9eB36nExxq/Q7Ef97J6KXz1kWpfJipgTfi+aNNM25XRY4DI0EONQtWQTmx4DQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sTntC0IAr7OMmtpTw+HekjeIoPDIhXwXUNcMKwKnyz8=; b=/Mt8oejqchbpsPGnQvl1erzIzcsIjQX1QO72Eg0fkyf/j4v0EAJMJdKvqxFClj5csLzcR1 DYFH0mTtZ2bcsDDg== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 20/21] selftests/futex: Add futex_priv_hash Date: Wed, 16 Apr 2025 18:29:20 +0200 Message-ID: <20250416162921.513656-21-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Test the basic functionality of the private hash: - Upon start, with no threads there is no private hash. - The first thread initializes the private hash. - More than four threads will increase the size of the private hash if the system has more than 16 CPUs online. - Once the user sets the size of private hash, auto scaling is disabled. - The user is only allowed to use numbers to the power of two. - The user may request the global or make the hash immutable. - Once the global hash has been set or the hash has been made immutable, further changes are not allowed. - Futex operations should work the whole time. It must be possible to hold a lock, such a PI initialised mutex, during the resize operation. Signed-off-by: Sebastian Andrzej Siewior --- .../selftests/futex/functional/.gitignore | 5 +- .../selftests/futex/functional/Makefile | 1 + .../futex/functional/futex_priv_hash.c | 315 ++++++++++++++++++ .../testing/selftests/futex/functional/run.sh | 4 + 4 files changed, 323 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/futex/functional/futex_priv_has= h.c diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/te= sting/selftests/futex/functional/.gitignore index fbcbdb6963b3a..d37ae7c6e879e 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -1,11 +1,12 @@ # SPDX-License-Identifier: GPL-2.0-only +futex_priv_hash +futex_requeue futex_requeue_pi futex_requeue_pi_mismatched_ops futex_requeue_pi_signal_restart +futex_wait futex_wait_private_mapped_file futex_wait_timeout futex_wait_uninitialized_heap futex_wait_wouldblock -futex_wait -futex_requeue futex_waitv diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/test= ing/selftests/futex/functional/Makefile index f79f9bac7918b..67d9e16d8a1f8 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -17,6 +17,7 @@ TEST_GEN_PROGS :=3D \ futex_wait_private_mapped_file \ futex_wait \ futex_requeue \ + futex_priv_hash \ futex_waitv =20 TEST_PROGS :=3D run.sh diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/t= ools/testing/selftests/futex/functional/futex_priv_hash.c new file mode 100644 index 0000000000000..4d37650baa192 --- /dev/null +++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2025 Sebastian Andrzej Siewior + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include + +#include +#include + +#include "logging.h" + +#define MAX_THREADS 64 + +static pthread_barrier_t barrier_main; +static pthread_mutex_t global_lock; +static pthread_t threads[MAX_THREADS]; +static int counter; + +#ifndef PR_FUTEX_HASH +#define PR_FUTEX_HASH 78 +# define PR_FUTEX_HASH_SET_SLOTS 1 +# define PR_FUTEX_HASH_GET_SLOTS 2 +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 +#endif + +static int futex_hash_slots_set(unsigned int slots, int immutable) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, slots, immutable); +} + +static int futex_hash_slots_get(void) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); +} + +static int futex_hash_immutable_get(void) +{ + return prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_IMMUTABLE); +} + +static void futex_hash_slots_set_verify(int slots) +{ + int ret; + + ret =3D futex_hash_slots_set(slots, 0); + if (ret !=3D 0) { + error("Failed to set slots to %d\n", errno, slots); + exit(1); + } + ret =3D futex_hash_slots_get(); + if (ret !=3D slots) { + error("Set %d slots but PR_FUTEX_HASH_GET_SLOTS returns: %d\n", + errno, slots, ret); + exit(1); + } +} + +static void futex_hash_slots_set_must_fail(int slots, int immutable) +{ + int ret; + + ret =3D futex_hash_slots_set(slots, immutable); + if (ret < 0) + return; + + fail("futex_hash_slots_set(%d, %d) expected to fail but succeeded.\n", + slots, immutable); + exit(1); +} + +static void *thread_return_fn(void *arg) +{ + return NULL; +} + +static void *thread_lock_fn(void *arg) +{ + pthread_barrier_wait(&barrier_main); + + pthread_mutex_lock(&global_lock); + counter++; + usleep(20); + pthread_mutex_unlock(&global_lock); + return NULL; +} + +static void create_max_threads(void *(*thread_fn)(void *)) +{ + int i, ret; + + for (i =3D 0; i < MAX_THREADS; i++) { + ret =3D pthread_create(&threads[i], NULL, thread_fn, NULL); + if (ret) { + error("pthread_create failed\n", errno); + exit(1); + } + } +} + +static void join_max_threads(void) +{ + int i, ret; + + for (i =3D 0; i < MAX_THREADS; i++) { + ret =3D pthread_join(threads[i], NULL); + if (ret) { + error("pthread_join failed for thread %d\n", errno, i); + exit(1); + } + } +} + +static void usage(char *prog) +{ + printf("Usage: %s\n", prog); + printf(" -c Use color\n"); + printf(" -g Test global hash instead intead local immutable \n"); + printf(" -h Display this help message\n"); + printf(" -v L Verbosity level: %d=3DQUIET %d=3DCRITICAL %d=3DINFO\n", + VQUIET, VCRITICAL, VINFO); +} + +int main(int argc, char *argv[]) +{ + int futex_slots1, futex_slotsn, online_cpus; + pthread_mutexattr_t mutex_attr_pi; + int use_global_hash =3D 0; + int ret; + char c; + + while ((c =3D getopt(argc, argv, "cghv:")) !=3D -1) { + switch (c) { + case 'c': + log_color(1); + break; + case 'g': + use_global_hash =3D 1; + break; + case 'h': + usage(basename(argv[0])); + exit(0); + break; + case 'v': + log_verbosity(atoi(optarg)); + break; + default: + usage(basename(argv[0])); + exit(1); + } + } + + + ret =3D pthread_mutexattr_init(&mutex_attr_pi); + ret |=3D pthread_mutexattr_setprotocol(&mutex_attr_pi, PTHREAD_PRIO_INHER= IT); + ret |=3D pthread_mutex_init(&global_lock, &mutex_attr_pi); + if (ret !=3D 0) { + fail("Failed to initialize pthread mutex.\n"); + return 1; + } + + /* First thread, expect to be 0, not yet initialized */ + ret =3D futex_hash_slots_get(); + if (ret !=3D 0) { + error("futex_hash_slots_get() failed: %d\n", errno, ret); + return 1; + } + ret =3D futex_hash_immutable_get(); + if (ret !=3D 0) { + error("futex_hash_immutable_get() failed: %d\n", errno, ret); + return 1; + } + + ret =3D pthread_create(&threads[0], NULL, thread_return_fn, NULL); + if (ret !=3D 0) { + error("pthread_create() failed: %d\n", errno, ret); + return 1; + } + ret =3D pthread_join(threads[0], NULL); + if (ret !=3D 0) { + error("pthread_join() failed: %d\n", errno, ret); + return 1; + } + /* First thread, has to initialiaze private hash */ + futex_slots1 =3D futex_hash_slots_get(); + if (futex_slots1 <=3D 0) { + fail("Expected > 0 hash buckets, got: %d\n", futex_slots1); + return 1; + } + + online_cpus =3D sysconf(_SC_NPROCESSORS_ONLN); + ret =3D pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1); + if (ret !=3D 0) { + error("pthread_barrier_init failed.\n", errno); + return 1; + } + + ret =3D pthread_mutex_lock(&global_lock); + if (ret !=3D 0) { + error("pthread_mutex_lock failed.\n", errno); + return 1; + } + + counter =3D 0; + create_max_threads(thread_lock_fn); + pthread_barrier_wait(&barrier_main); + + /* + * The current default size of hash buckets is 16. The auto increase + * works only if more than 16 CPUs are available. + */ + if (online_cpus > 16) { + futex_slotsn =3D futex_hash_slots_get(); + if (futex_slotsn < 0 || futex_slots1 =3D=3D futex_slotsn) { + fail("Expected increase of hash buckets but got: %d -> %d\n", + futex_slots1, futex_slotsn); + info("Online CPUs: %d\n", online_cpus); + return 1; + } + } + ret =3D pthread_mutex_unlock(&global_lock); + + /* Once the user changes it, it has to be what is set */ + futex_hash_slots_set_verify(2); + futex_hash_slots_set_verify(4); + futex_hash_slots_set_verify(8); + futex_hash_slots_set_verify(32); + futex_hash_slots_set_verify(16); + + ret =3D futex_hash_slots_set(15, 0); + if (ret >=3D 0) { + fail("Expected to fail with 15 slots but succeeded: %d.\n", ret); + return 1; + } + futex_hash_slots_set_verify(2); + join_max_threads(); + if (counter !=3D MAX_THREADS) { + fail("Expected thread counter at %d but is %d\n", + MAX_THREADS, counter); + return 1; + } + counter =3D 0; + /* Once the user set something, auto reisze must be disabled */ + ret =3D pthread_barrier_init(&barrier_main, NULL, MAX_THREADS); + + create_max_threads(thread_lock_fn); + join_max_threads(); + + ret =3D futex_hash_slots_get(); + if (ret !=3D 2) { + printf("Expected 2 slots, no auto-resize, got %d\n", ret); + return 1; + } + + futex_hash_slots_set_must_fail(1 << 29, 0); + + /* + * Once the private hash has been made immutable or global hash has been = requested, + * then this requested can not be undone. + */ + if (use_global_hash) { + ret =3D futex_hash_slots_set(0, 0); + if (ret !=3D 0) { + printf("Can't request global hash: %m\n"); + return 1; + } + } else { + ret =3D futex_hash_slots_set(4, 1); + if (ret !=3D 0) { + printf("Immutable resize to 4 failed: %m\n"); + return 1; + } + } + + futex_hash_slots_set_must_fail(4, 0); + futex_hash_slots_set_must_fail(4, 1); + futex_hash_slots_set_must_fail(8, 0); + futex_hash_slots_set_must_fail(8, 1); + futex_hash_slots_set_must_fail(0, 1); + futex_hash_slots_set_must_fail(6, 1); + + ret =3D pthread_barrier_init(&barrier_main, NULL, MAX_THREADS); + if (ret !=3D 0) { + error("pthread_barrier_init failed.\n", errno); + return 1; + } + create_max_threads(thread_lock_fn); + join_max_threads(); + + ret =3D futex_hash_slots_get(); + if (use_global_hash) { + if (ret !=3D 0) { + error("Expected global hash, got %d\n", errno, ret); + return 1; + } + } else { + if (ret !=3D 4) { + error("Expected 4 slots, no auto-resize, got %d\n", errno, ret); + return 1; + } + } + + ret =3D futex_hash_immutable_get(); + if (ret !=3D 1) { + fail("Expected immutable private hash, got %d\n", ret); + return 1; + } + return 0; +} diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testin= g/selftests/futex/functional/run.sh index 5ccd599da6c30..f0f0d2b683d7e 100755 --- a/tools/testing/selftests/futex/functional/run.sh +++ b/tools/testing/selftests/futex/functional/run.sh @@ -82,3 +82,7 @@ echo =20 echo ./futex_waitv $COLOR + +echo +./futex_priv_hash $COLOR +./futex_priv_hash -g $COLOR --=20 2.49.0 From nobody Fri Dec 19 02:59:33 2025 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4893C21C193 for ; Wed, 16 Apr 2025 16:29:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820979; cv=none; b=UpPuPgckD7tfndi6Er9V89d7z88WsDKWuEGHHBVIwh27zWsH+E4G14culAJtF6R6zeLoS4hfPbA3HUUxQCO1114YpGa9f/Fct8rpzzUj7SoO6B6MnZncqS4XwPvUVhmLQi5/mFwWYpfOcotjPhGwh2sI9vx5iVH7eBGJYw33NLk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744820979; c=relaxed/simple; bh=lt2/Wz0SjlzjDIBrfeAz7OUWJL3HzSiG8nn58ykLWHc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IxKFiuo6PAUkHW7X9ugUmUrJAAz/qoCJ/JdlvSJG3EEnlWo6sXa823LClEsi1nxp8SH9PBxhOAMJ56zxJS3kEwaxDvXqIH4V2A9+IhFyFt0IYdF0lsgC+5RFPtiaeAfpIitaMWUOzYwsyKeLsB+3ZSQfsoN5NM3MrVjDeI+5YqQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=ihgWRmF1; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=HMHHdQQz; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="ihgWRmF1"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="HMHHdQQz" From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lpOZgZc3132Vq4fULyCZrzofdl0twCFBC2bsdRhjlfw=; b=ihgWRmF1+gQpnKUExmKZPIueCgibq0kOzoLv51TnQnW2xlR1yt6T4EBqmggblVU+NWroBQ hUG58d3Ptx8mx2XmSgljG2/7yz25KVZGo0MQzURarq2xpebJ9NYPh+iPdawGP3VjQnS9fz pcXENi2Hvpwx3yctj7ZLaBtrJjzgRe3bQQ8Vca+30C3Li07ocZ1mO1G9ztA3s9uws42ZDd 5Vv8yJjxSArYCRavjsIn5XluETuz9g7Z1dGDZ2qUt3g6+4+1jk4AxUetYenqeQRK/FmpIC d6lv01wD9+Gg0rCidueFFDICtdJ4xFfFtv4Z6Lf2mGt53M3gUUMQ6J2FdE13Ww== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1744820974; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lpOZgZc3132Vq4fULyCZrzofdl0twCFBC2bsdRhjlfw=; b=HMHHdQQz4N38yU5YQDpJnCujHuh1RA1tnBn02XZ202b3CTdXPAp8codo4TEOYIIPe8SaDa /vrhQfJXBCiuUFAw== To: linux-kernel@vger.kernel.org Cc: =?UTF-8?q?Andr=C3=A9=20Almeida?= , Darren Hart , Davidlohr Bueso , Ingo Molnar , Juri Lelli , Peter Zijlstra , Thomas Gleixner , Valentin Schneider , Waiman Long , Sebastian Andrzej Siewior Subject: [PATCH v12 21/21] selftests/futex: Add futex_numa_mpol Date: Wed, 16 Apr 2025 18:29:21 +0200 Message-ID: <20250416162921.513656-22-bigeasy@linutronix.de> In-Reply-To: <20250416162921.513656-1-bigeasy@linutronix.de> References: <20250416162921.513656-1-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Test the basic functionality for the NUMA and MPOL flags: - FUTEX2_NUMA should take the NUMA node which is after the uaddr and use it. - Only update the node if FUTEX_NO_NODE was set by the user - FUTEX2_MPOL should use the memory based on the policy. I attempted to set the node with mbind() and then use this with MPOL but this fails and futex falls back to the default node for the current CPU. Signed-off-by: Sebastian Andrzej Siewior --- .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 3 +- .../futex/functional/futex_numa_mpol.c | 232 ++++++++++++++++++ .../testing/selftests/futex/functional/run.sh | 3 + .../selftests/futex/include/futex2test.h | 34 +++ 5 files changed, 272 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/futex/functional/futex_numa_mpo= l.c diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/te= sting/selftests/futex/functional/.gitignore index d37ae7c6e879e..7b24ae89594a9 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -1,4 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only +futex_numa_mpol futex_priv_hash futex_requeue futex_requeue_pi diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/test= ing/selftests/futex/functional/Makefile index 67d9e16d8a1f8..a4881fd2cd540 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 INCLUDES :=3D -I../include -I../../ $(KHDR_INCLUDES) CFLAGS :=3D $(CFLAGS) -g -O2 -Wall -pthread $(INCLUDES) $(KHDR_INCLUDES) -LDLIBS :=3D -lpthread -lrt +LDLIBS :=3D -lpthread -lrt -lnuma =20 LOCAL_HDRS :=3D \ ../include/futextest.h \ @@ -18,6 +18,7 @@ TEST_GEN_PROGS :=3D \ futex_wait \ futex_requeue \ futex_priv_hash \ + futex_numa_mpol \ futex_waitv =20 TEST_PROGS :=3D run.sh diff --git a/tools/testing/selftests/futex/functional/futex_numa_mpol.c b/t= ools/testing/selftests/futex/functional/futex_numa_mpol.c new file mode 100644 index 0000000000000..30302691303f0 --- /dev/null +++ b/tools/testing/selftests/futex/functional/futex_numa_mpol.c @@ -0,0 +1,232 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2025 Sebastian Andrzej Siewior + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "logging.h" +#include "futextest.h" +#include "futex2test.h" + +#define MAX_THREADS 64 + +static pthread_barrier_t barrier_main; +static pthread_t threads[MAX_THREADS]; + +struct thread_args { + void *futex_ptr; + unsigned int flags; + int result; +}; + +static struct thread_args thread_args[MAX_THREADS]; + +#ifndef FUTEX_NO_NODE +#define FUTEX_NO_NODE (-1) +#endif + +#ifndef FUTEX2_MPOL +#define FUTEX2_MPOL 0x08 +#endif + +static void *thread_lock_fn(void *arg) +{ + struct thread_args *args =3D arg; + int ret; + + pthread_barrier_wait(&barrier_main); + ret =3D futex2_wait2(args->futex_ptr, 0, args->flags, NULL, 0); + args->result =3D ret; + return NULL; +} + +static void create_max_threads(void *futex_ptr) +{ + int i, ret; + + for (i =3D 0; i < MAX_THREADS; i++) { + thread_args[i].futex_ptr =3D futex_ptr; + thread_args[i].flags =3D FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG | FUTEX2_N= UMA; + thread_args[i].result =3D 0; + ret =3D pthread_create(&threads[i], NULL, thread_lock_fn, &thread_args[i= ]); + if (ret) { + error("pthread_create failed\n", errno); + exit(1); + } + } +} + +static void join_max_threads(void) +{ + int i, ret; + + for (i =3D 0; i < MAX_THREADS; i++) { + ret =3D pthread_join(threads[i], NULL); + if (ret) { + error("pthread_join failed for thread %d\n", errno, i); + exit(1); + } + } +} + +static void __test_futex(void *futex_ptr, int must_fail, unsigned int fute= x_flags) +{ + int to_wake, ret, i, need_exit =3D 0; + + pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1); + create_max_threads(futex_ptr); + pthread_barrier_wait(&barrier_main); + to_wake =3D MAX_THREADS; + + do { + ret =3D futex2_wake(futex_ptr, to_wake, futex_flags); + if (must_fail) { + if (ret < 0) + break; + fail("Should fail, but didn't\n"); + exit(1); + } + if (ret < 0) { + error("Failed futex2_wake(%d)\n", errno, to_wake); + exit(1); + } + if (!ret) + usleep(50); + to_wake -=3D ret; + + } while (to_wake); + join_max_threads(); + + for (i =3D 0; i < MAX_THREADS; i++) { + if (must_fail && thread_args[i].result !=3D -1) { + fail("Thread %d should fail but succeeded (%d)\n", i, thread_args[i].re= sult); + need_exit =3D 1; + } + if (!must_fail && thread_args[i].result !=3D 0) { + fail("Thread %d failed (%d)\n", i, thread_args[i].result); + need_exit =3D 1; + } + } + if (need_exit) + exit(1); +} + +static void test_futex(void *futex_ptr, int must_fail) +{ + __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG |= FUTEX2_NUMA); +} + +static void test_futex_mpol(void *futex_ptr, int must_fail) +{ + __test_futex(futex_ptr, must_fail, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG |= FUTEX2_NUMA | FUTEX2_MPOL); +} + +static void usage(char *prog) +{ + printf("Usage: %s\n", prog); + printf(" -c Use color\n"); + printf(" -h Display this help message\n"); + printf(" -v L Verbosity level: %d=3DQUIET %d=3DCRITICAL %d=3DINFO\n", + VQUIET, VCRITICAL, VINFO); +} + +int main(int argc, char *argv[]) +{ + struct futex32_numa *futex_numa; + int mem_size, i; + void *futex_ptr; + char c; + + while ((c =3D getopt(argc, argv, "chv:")) !=3D -1) { + switch (c) { + case 'c': + log_color(1); + break; + case 'h': + usage(basename(argv[0])); + exit(0); + break; + case 'v': + log_verbosity(atoi(optarg)); + break; + default: + usage(basename(argv[0])); + exit(1); + } + } + + mem_size =3D sysconf(_SC_PAGE_SIZE); + futex_ptr =3D mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | = MAP_ANONYMOUS, 0, 0); + if (futex_ptr =3D=3D MAP_FAILED) { + error("mmap() for %d bytes failed\n", errno, mem_size); + return 1; + } + futex_numa =3D futex_ptr; + + info("Regular test\n"); + futex_numa->futex =3D 0; + futex_numa->numa =3D FUTEX_NO_NODE; + test_futex(futex_ptr, 0); + + if (futex_numa->numa =3D=3D FUTEX_NO_NODE) { + fail("NUMA node is left unitiliazed\n"); + return 1; + } + + info("Memory too small\n"); + test_futex(futex_ptr + mem_size - 4, 1); + + info("Memory out of range\n"); + test_futex(futex_ptr + mem_size, 1); + + futex_numa->numa =3D FUTEX_NO_NODE; + mprotect(futex_ptr, mem_size, PROT_READ); + info("Memory, RO\n"); + test_futex(futex_ptr, 1); + + mprotect(futex_ptr, mem_size, PROT_NONE); + info("Memory, no access\n"); + test_futex(futex_ptr, 1); + + mprotect(futex_ptr, mem_size, PROT_READ | PROT_WRITE); + info("Memory back to RW\n"); + test_futex(futex_ptr, 0); + + /* MPOL test. Does not work as expected */ + for (i =3D 0; i < 4; i++) { + unsigned long nodemask; + int ret; + + nodemask =3D 1 << i; + ret =3D mbind(futex_ptr, mem_size, MPOL_BIND, &nodemask, + sizeof(nodemask) * 8, 0); + if (ret =3D=3D 0) { + info("Node %d test\n", i); + futex_numa->futex =3D 0; + futex_numa->numa =3D FUTEX_NO_NODE; + + ret =3D futex2_wake(futex_ptr, 0, FUTEX2_SIZE_U32 | FUTEX_PRIVATE_FLAG = | FUTEX2_NUMA | FUTEX2_MPOL); + if (ret < 0) + error("Failed to wake 0 with MPOL.\n", errno); + if (0) + test_futex_mpol(futex_numa, 0); + if (futex_numa->numa !=3D i) { + fail("Returned NUMA node is %d expected %d\n", + futex_numa->numa, i); + } + } + } + return 0; +} diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testin= g/selftests/futex/functional/run.sh index f0f0d2b683d7e..81739849f2994 100755 --- a/tools/testing/selftests/futex/functional/run.sh +++ b/tools/testing/selftests/futex/functional/run.sh @@ -86,3 +86,6 @@ echo echo ./futex_priv_hash $COLOR ./futex_priv_hash -g $COLOR + +echo +./futex_numa_mpol $COLOR diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/tes= ting/selftests/futex/include/futex2test.h index 9d305520e849b..b664e8f92bfd7 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -8,6 +8,11 @@ =20 #define u64_to_ptr(x) ((void *)(uintptr_t)(x)) =20 +struct futex32_numa { + futex_t futex; + futex_t numa; +}; + /** * futex_waitv - Wait at multiple futexes, wake on any * @waiters: Array of waiters @@ -20,3 +25,32 @@ static inline int futex_waitv(volatile struct futex_wait= v *waiters, unsigned lon { return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clocki= d); } + +static inline int futex2_wait(volatile struct futex_waitv *waiters, unsign= ed long nr_waiters, + unsigned long flags, struct timespec *timo, clockid_t clockid) +{ + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clocki= d); +} + +/* + * futex_wait2() - block on uaddr with optional timeout + * @val: Expected value + * @flags: FUTEX2 flags + * @timeout: Relative timeout + * @clockid: Clock id for the timeout + */ +static inline int futex2_wait2(void *uaddr, long val, unsigned int flags, + struct timespec *timeout, clockid_t clockid) +{ + return syscall(__NR_futex_wait, uaddr, val, 1, flags, timeout, clockid); +} + +/* + * futex2_wake() - Wake a number of futexes + * @nr: Number of threads to wake at most + * @flags: FUTEX2 flags + */ +static inline int futex2_wake(void *uaddr, int nr, unsigned int flags) +{ + return syscall(__NR_futex_wake, uaddr, 1, nr, flags); +} --=20 2.49.0