From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C3CAC433EF for ; Fri, 18 Feb 2022 21:15:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239682AbiBRVQO (ORCPT ); Fri, 18 Feb 2022 16:16:14 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236379AbiBRVQJ (ORCPT ); Fri, 18 Feb 2022 16:16:09 -0500 X-Greylist: delayed 543 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Fri, 18 Feb 2022 13:15:49 PST Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D660B23D5F9; Fri, 18 Feb 2022 13:15:49 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 2A75D3BA57A; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ybaUpoQkLKpV; Fri, 18 Feb 2022 16:06:42 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id AB9663BAB0D; Fri, 18 Feb 2022 16:06:42 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com AB9663BAB0D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218402; bh=BioWaSozmqqbtmvJnfrlpWvhCFvnAXGhaRyjzgR7bfg=; h=From:To:Date:Message-Id; b=qJ1Rej0T6Ag4BSkPDBefCUB3AueJVIOFrWgZZ/cH/NKQqteqma3aje6GTj4ZIA+Gs lfXtKHgbCUgEtXcnhgGSPijvKHd664txolFGAS6giC9RqM+szFYugvBTy0DUk2r5oP xzAe4OnxSwLcfD9/9zLEoAI9xqXO/OR2AbsmcKS7gLSWklT/eXa2rGg9d2jbbmCbm7 EmhSotusL5fWkF+f8kECCi8WbwqD/zi4vJn7Vav3mvvrwZTQ0/ngh7ABveNaTswCd2 Gbc5xPnDpfio9hLAzrpSFqnBm+pCpaobAQZ7GPpeg+1BxLz6YZhcYa7xLTljAe3VTk tu7SKiiSVws5A== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id n_LFVrDxj7ru; Fri, 18 Feb 2022 16:06:42 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 679133BA94B; Fri, 18 Feb 2022 16:06:42 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries Date: Fri, 18 Feb 2022 16:06:23 -0500 Message-Id: <20220218210633.23345-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Export the rseq feature size supported by the kernel as well as the required allocation alignment for the rseq per-thread area to user-space through ELF auxiliary vector entries. This is part of the extensible rseq ABI. Signed-off-by: Mathieu Desnoyers --- fs/binfmt_elf.c | 5 +++++ include/uapi/linux/auxvec.h | 2 ++ include/uapi/linux/rseq.h | 5 +++++ 3 files changed, 12 insertions(+) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 605017eb9349..77776582e76d 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -46,6 +46,7 @@ #include #include #include +#include #include #include =20 @@ -286,6 +287,10 @@ create_elf_tables(struct linux_binprm *bprm, const str= uct elfhdr *exec, if (bprm->have_execfd) { NEW_AUX_ENT(AT_EXECFD, bprm->execfd); } +#ifdef CONFIG_RSEQ + NEW_AUX_ENT(AT_RSEQ_FEATURE_SIZE, offsetof(struct rseq, end)); + NEW_AUX_ENT(AT_RSEQ_ALIGN, __alignof__(struct rseq)); +#endif #undef NEW_AUX_ENT /* AT_NULL is zero; clear the rest too */ memset(elf_info, 0, (char *)mm->saved_auxv + diff --git a/include/uapi/linux/auxvec.h b/include/uapi/linux/auxvec.h index c7e502bf5a6f..6991c4b8ab18 100644 --- a/include/uapi/linux/auxvec.h +++ b/include/uapi/linux/auxvec.h @@ -30,6 +30,8 @@ * differ from AT_PLATFORM. */ #define AT_RANDOM 25 /* address of 16 random bytes */ #define AT_HWCAP2 26 /* extension of AT_HWCAP */ +#define AT_RSEQ_FEATURE_SIZE 27 /* rseq supported feature size */ +#define AT_RSEQ_ALIGN 28 /* rseq allocation alignment */ =20 #define AT_EXECFN 31 /* filename of program */ =20 diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 77ee207623a9..05d3c4cdeb40 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -130,6 +130,11 @@ struct rseq { * this thread. */ __u32 flags; + + /* + * Flexible array member at end of structure, after last feature field. + */ + char end[]; } __attribute__((aligned(4 * sizeof(__u64)))); =20 #endif /* _UAPI_LINUX_RSEQ_H */ --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5097CC433EF for ; Fri, 18 Feb 2022 21:16:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236456AbiBRVQv (ORCPT ); Fri, 18 Feb 2022 16:16:51 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42522 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239668AbiBRVQM (ORCPT ); Fri, 18 Feb 2022 16:16:12 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D49023D5F9; Fri, 18 Feb 2022 13:15:53 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id B85C53BA9C5; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id chI_aB2d2ptU; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 3C14C3BAA92; Fri, 18 Feb 2022 16:06:43 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 3C14C3BAA92 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218403; bh=ehEiD5eN73JoYPbHr0aoW8W4nHv8359VE81+kSIwDoo=; h=From:To:Date:Message-Id; b=VDWkvVXwlpPMH+iGMjtjSvZddXZUvUm/PMOjulnVzpHEp+BpBlC3Vp8lSTcAgl4fy 8rL1n7NItMQGP1lto1VpC+kuET9xKmWDCKpqEXRTvUooTL35+tB2/Opf0wtqYYviuH /YM5LYg98ZHcN33eR5SZxKJWQbCCUdEeYg7FJZ13FeuPoDZUMd8hs6WNwCHpdxz2Sw P6wRvJVkEcDsGoQR1kQx1W1YsSSom4Bg4nnbcrBiSHRpxaTWOyGoGhknY5BzJHc657 i1C7wpzLTrkFh8Wy3D1tJJ8TGWT6pca/X16u5NDyalZMS6fJ3LvZHBjaErTPDQchMM OKPfIvD4L2JBQ== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id LTHpwKlt6g2Y; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id AAB3A3BA778; Fri, 18 Feb 2022 16:06:42 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 02/11] rseq: Introduce extensible rseq ABI Date: Fri, 18 Feb 2022 16:06:24 -0500 Message-Id: <20220218210633.23345-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Introduce the extensible rseq ABI, where the feature size supported by the kernel and the required alignment are communicated to user-space through ELF auxiliary vectors. This allows user-space to call rseq registration with a rseq_len of either 32 bytes for the original struct rseq size (which includes padding), or larger. If rseq_len is larger than 32 bytes, then it must be large enough to contain the feature size communicated to user-space through ELF auxiliary vectors. Signed-off-by: Mathieu Desnoyers --- include/linux/sched.h | 4 ++++ kernel/ptrace.c | 2 +- kernel/rseq.c | 33 +++++++++++++++++++++++++++------ 3 files changed, 32 insertions(+), 7 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 508b91d57470..838c9e0b4cae 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1291,6 +1291,7 @@ struct task_struct { =20 #ifdef CONFIG_RSEQ struct rseq __user *rseq; + u32 rseq_len; u32 rseq_sig; /* * RmW on rseq_event_mask must be performed atomically @@ -2260,10 +2261,12 @@ static inline void rseq_fork(struct task_struct *t,= unsigned long clone_flags) { if (clone_flags & CLONE_VM) { t->rseq =3D NULL; + t->rseq_len =3D 0; t->rseq_sig =3D 0; t->rseq_event_mask =3D 0; } else { t->rseq =3D current->rseq; + t->rseq_len =3D current->rseq_len; t->rseq_sig =3D current->rseq_sig; t->rseq_event_mask =3D current->rseq_event_mask; } @@ -2272,6 +2275,7 @@ static inline void rseq_fork(struct task_struct *t, u= nsigned long clone_flags) static inline void rseq_execve(struct task_struct *t) { t->rseq =3D NULL; + t->rseq_len =3D 0; t->rseq_sig =3D 0; t->rseq_event_mask =3D 0; } diff --git a/kernel/ptrace.c b/kernel/ptrace.c index eea265082e97..f5edde5b7805 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -800,7 +800,7 @@ static long ptrace_get_rseq_configuration(struct task_s= truct *task, { struct ptrace_rseq_configuration conf =3D { .rseq_abi_pointer =3D (u64)(uintptr_t)task->rseq, - .rseq_abi_size =3D sizeof(*task->rseq), + .rseq_abi_size =3D task->rseq_len, .signature =3D task->rseq_sig, .flags =3D 0, }; diff --git a/kernel/rseq.c b/kernel/rseq.c index 97ac20b4f738..46dc5c2ce2b7 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -18,6 +18,9 @@ #define CREATE_TRACE_POINTS #include =20 +/* The original rseq structure size (including padding) is 32 bytes. */ +#define ORIG_RSEQ_SIZE 32 + #define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = | \ RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT) =20 @@ -86,10 +89,15 @@ static int rseq_update_cpu_id(struct task_struct *t) u32 cpu_id =3D raw_smp_processor_id(); struct rseq __user *rseq =3D t->rseq; =20 - if (!user_write_access_begin(rseq, sizeof(*rseq))) + if (!user_write_access_begin(rseq, t->rseq_len)) goto efault; unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end); unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); + /* + * Additional feature fields added after ORIG_RSEQ_SIZE + * need to be conditionally updated only if + * t->rseq_len !=3D ORIG_RSEQ_SIZE. + */ user_write_access_end(); trace_rseq_update(t); return 0; @@ -116,6 +124,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *= t) */ if (put_user(cpu_id, &t->rseq->cpu_id)) return -EFAULT; + /* + * Additional feature fields added after ORIG_RSEQ_SIZE + * need to be conditionally reset only if + * t->rseq_len !=3D ORIG_RSEQ_SIZE. + */ return 0; } =20 @@ -336,7 +349,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, = rseq_len, /* Unregister rseq for current thread. */ if (current->rseq !=3D rseq || !current->rseq) return -EINVAL; - if (rseq_len !=3D sizeof(*rseq)) + if (rseq_len !=3D current->rseq_len) return -EINVAL; if (current->rseq_sig !=3D sig) return -EPERM; @@ -345,6 +358,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, = rseq_len, return ret; current->rseq =3D NULL; current->rseq_sig =3D 0; + current->rseq_len =3D 0; return 0; } =20 @@ -357,7 +371,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, = rseq_len, * the provided address differs from the prior * one. */ - if (current->rseq !=3D rseq || rseq_len !=3D sizeof(*rseq)) + if (current->rseq !=3D rseq || rseq_len !=3D current->rseq_len) return -EINVAL; if (current->rseq_sig !=3D sig) return -EPERM; @@ -366,15 +380,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32= , rseq_len, } =20 /* - * If there was no rseq previously registered, - * ensure the provided rseq is properly aligned and valid. + * If there was no rseq previously registered, ensure the provided rseq + * is properly aligned, as communcated to user-space through the ELF + * auxiliary vector AT_RSEQ_ALIGN. + * + * In order to be valid, rseq_len is either the original rseq size, or + * large enough to contain all supported fields, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. */ if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || - rseq_len !=3D sizeof(*rseq)) + rseq_len < ORIG_RSEQ_SIZE || + (rseq_len !=3D ORIG_RSEQ_SIZE && rseq_len < offsetof(struct rseq, end= ))) return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; current->rseq =3D rseq; + current->rseq_len =3D rseq_len; current->rseq_sig =3D sig; /* * If rseq was previously inactive, and has just been --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE5D7C433F5 for ; Fri, 18 Feb 2022 21:16:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239825AbiBRVRC (ORCPT ); Fri, 18 Feb 2022 16:17:02 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42550 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239672AbiBRVQM (ORCPT ); Fri, 18 Feb 2022 16:16:12 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 714AE23F0B8; Fri, 18 Feb 2022 13:15:53 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 91E723BA6F5; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 8-2BVncmHCEG; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 38D583BAB12; Fri, 18 Feb 2022 16:06:43 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 38D583BAB12 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218403; bh=NkCVjM+S9iEg/gRamTSXlL2ZBMGnYHhG/cpmn7pzdsM=; h=From:To:Date:Message-Id; b=VF0Rg7xRSvX75hYr+WmoJKbpwuPC14+fUA4axb9Au9E73ri5yS4WF2oKzWMDBuH0r XSzug4rvf+FQGSb05mWYWzSq44iQJQfun3omucXVqh+OgUNqLSniPMXYW0l1p9EfaP fgi23lW56Xm3+d+CP+dYv7n7LZc7RP2wogWOGdDFO7EojGfvu8fLHRpqJ6HY3swIVO qTutdSgyOx5pMvX3TeIC0752TUMTHz4r/XDxC803InClq/nWBWtioMarWuLtHj/+DF pQxuL+KdHtZFLLf0ve9QSFWK+BH45epCj5UjA13msd5yKtDdexA2/lx6NO8AQVS2Ft 7PXCifAEBgrsg== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id y7UJfNUnBGFr; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id E81343BAB10; Fri, 18 Feb 2022 16:06:42 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 03/11] rseq: extend struct rseq with numa node id Date: Fri, 18 Feb 2022 16:06:25 -0500 Message-Id: <20220218210633.23345-4-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Adding the NUMA node id to struct rseq is a straightforward thing to do, and a good way to figure out if anything in the user-space ecosystem prevents extending struct rseq. This NUMA node id field allows memory allocators such as tcmalloc to take advantage of fast access to the current NUMA node id to perform NUMA-aware memory allocation. It can also be useful for implementing fast-paths for NUMA-aware user-space mutexes. It also allows implementing getcpu(2) purely in user-space. Signed-off-by: Mathieu Desnoyers --- include/trace/events/rseq.h | 4 +++- include/uapi/linux/rseq.h | 8 ++++++++ kernel/rseq.c | 19 +++++++++++++------ 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h index a04a64bc1a00..6bd442697354 100644 --- a/include/trace/events/rseq.h +++ b/include/trace/events/rseq.h @@ -16,13 +16,15 @@ TRACE_EVENT(rseq_update, =20 TP_STRUCT__entry( __field(s32, cpu_id) + __field(s32, node_id) ), =20 TP_fast_assign( __entry->cpu_id =3D raw_smp_processor_id(); + __entry->node_id =3D cpu_to_node(raw_smp_processor_id()); ), =20 - TP_printk("cpu_id=3D%d", __entry->cpu_id) + TP_printk("cpu_id=3D%d node_id=3D%d", __entry->cpu_id, __entry->node_id) ); =20 TRACE_EVENT(rseq_ip_fixup, diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 05d3c4cdeb40..1cb90a435c5c 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -131,6 +131,14 @@ struct rseq { */ __u32 flags; =20 + /* + * Restartable sequences node_id field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current NUMA node ID. + */ + __u32 node_id; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 46dc5c2ce2b7..cb7d8a5afc82 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -84,15 +84,17 @@ * F1. */ =20 -static int rseq_update_cpu_id(struct task_struct *t) +static int rseq_update_cpu_node_id(struct task_struct *t) { - u32 cpu_id =3D raw_smp_processor_id(); struct rseq __user *rseq =3D t->rseq; + u32 cpu_id =3D raw_smp_processor_id(); + u32 node_id =3D cpu_to_node(cpu_id); =20 if (!user_write_access_begin(rseq, t->rseq_len)) goto efault; unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end); unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); + unsafe_put_user(node_id, &rseq->node_id, efault_end); /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if @@ -108,9 +110,9 @@ static int rseq_update_cpu_id(struct task_struct *t) return -EFAULT; } =20 -static int rseq_reset_rseq_cpu_id(struct task_struct *t) +static int rseq_reset_rseq_cpu_node_id(struct task_struct *t) { - u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED; + u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0; =20 /* * Reset cpu_id_start to its initial state (0). @@ -124,6 +126,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *= t) */ if (put_user(cpu_id, &t->rseq->cpu_id)) return -EFAULT; + /* + * Reset node_id to its initial state (0). + */ + if (put_user(node_id, &t->rseq->node_id)) + return -EFAULT; /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally reset only if @@ -306,7 +313,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, = struct pt_regs *regs) if (unlikely(ret < 0)) goto error; } - if (unlikely(rseq_update_cpu_id(t))) + if (unlikely(rseq_update_cpu_node_id(t))) goto error; return; =20 @@ -353,7 +360,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, = rseq_len, return -EINVAL; if (current->rseq_sig !=3D sig) return -EPERM; - ret =3D rseq_reset_rseq_cpu_id(current); + ret =3D rseq_reset_rseq_cpu_node_id(current); if (ret) return ret; current->rseq =3D NULL; --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89C45C433EF for ; Fri, 18 Feb 2022 21:16:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239759AbiBRVQz (ORCPT ); Fri, 18 Feb 2022 16:16:55 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42430 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239663AbiBRVQN (ORCPT ); Fri, 18 Feb 2022 16:16:13 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBDAB2569FA; Fri, 18 Feb 2022 13:15:53 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 300323BA92B; Fri, 18 Feb 2022 16:06:44 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id gxHMX7oOs1gJ; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 9FB843BA956; Fri, 18 Feb 2022 16:06:43 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 9FB843BA956 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218403; bh=jCxgnH0UNMK3Owi2ll3NEMWJYgSQdasrufNioWW7/gE=; h=From:To:Date:Message-Id; b=N6yUry2cuukjTzNG3GP6YHRVGCkZTfHMwCN/eeYamC9oOLxJEiV08dhNUnKWRvNvX 1PbqbkgoLwjqcWSPfzU1ihyOK80IFattyAaXDNc4fenUcoxGhpU52jUpB0yBBD/MCb /szseLusBFAaAKWGBIrRkhysTe6vy1/9Nts9Y7Tl0tCod0f3B4vvM4f4Zts42VSxXT kW2hbJXaO9rjSSBNOBlQq8Wf6vAs/VuwkyLz0KcU5UnsOtSB3AHUteharNicDgkoB1 0p5YCaaDJYdnFE9axd8TYZDTDUa4/2v3pW3EPcWml9/Z6qqbCUeYlzAIdNCct1SILd +MJyXlqmw+F4g== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id hLVxdEubDYpf; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 388763BAB11; Fri, 18 Feb 2022 16:06:43 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 04/11] selftests/rseq: Use ELF auxiliary vector for extensible rseq Date: Fri, 18 Feb 2022 16:06:26 -0500 Message-Id: <20220218210633.23345-5-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Mathieu Desnoyers --- tools/testing/selftests/rseq/rseq-abi.h | 5 ++ tools/testing/selftests/rseq/rseq.c | 68 ++++++++++++++++++++++--- tools/testing/selftests/rseq/rseq.h | 15 ++++-- 3 files changed, 76 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selfte= sts/rseq/rseq-abi.h index a8c44d9af71f..00ac846d85b0 100644 --- a/tools/testing/selftests/rseq/rseq-abi.h +++ b/tools/testing/selftests/rseq/rseq-abi.h @@ -146,6 +146,11 @@ struct rseq_abi { * this thread. */ __u32 flags; + + /* + * Flexible array member at end of structure, after last feature field. + */ + char end[]; } __attribute__((aligned(4 * sizeof(__u64)))); =20 #endif /* _RSEQ_ABI_H */ diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/= rseq/rseq.c index 986b9458efb2..506f2b17aea6 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -28,6 +28,8 @@ #include #include #include +#include +#include =20 #include "../kselftest.h" #include "rseq.h" @@ -39,17 +41,35 @@ static const unsigned int *libc_rseq_flags_p; /* Offset from the thread pointer to the rseq area. */ ptrdiff_t rseq_offset; =20 -/* Size of the registered rseq area. 0 if the registration was - unsuccessful. */ +/* + * Size of the registered rseq area. 0 if the registration was + * unsuccessful. + */ unsigned int rseq_size =3D -1U; =20 -/* Flags used during rseq registration. */ +/* Flags used during rseq registration. */ unsigned int rseq_flags; =20 +/* + * rseq feature size supported by the kernel. 0 if the registration was + * unsuccessful. + */ +unsigned int rseq_feature_size =3D -1U; + static int rseq_ownership; +static int rseq_reg_success; /* At least one rseq registration has succeed= ed. */ + +/* Allocate a large area for the TLS. */ +#define RSEQ_THREAD_AREA_ALLOC_SIZE 1024 + +/* Original struct rseq feature size is 20 bytes. */ +#define ORIG_RSEQ_FEATURE_SIZE 20 + +/* Original struct rseq allocation size is 32 bytes. */ +#define ORIG_RSEQ_ALLOC_SIZE 32 =20 static -__thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec= "))) =3D { +__thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec= "), aligned(RSEQ_THREAD_AREA_ALLOC_SIZE))) =3D { .cpu_id =3D RSEQ_ABI_CPU_ID_UNINITIALIZED, }; =20 @@ -84,10 +104,18 @@ int rseq_register_current_thread(void) /* Treat libc's ownership as a successful registration. */ return 0; } - rc =3D sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), 0, RSEQ_SIG); - if (rc) + rc =3D sys_rseq(&__rseq_abi, rseq_size, 0, RSEQ_SIG); + if (rc) { + if (RSEQ_READ_ONCE(rseq_reg_success)) { + /* Incoherent success/failure within process. */ + abort(); + } + rseq_size =3D 0; + rseq_feature_size =3D 0; return -1; + } assert(rseq_current_cpu_raw() >=3D 0); + RSEQ_WRITE_ONCE(rseq_reg_success, 1); return 0; } =20 @@ -99,12 +127,28 @@ int rseq_unregister_current_thread(void) /* Treat libc's ownership as a successful unregistration. */ return 0; } - rc =3D sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), RSEQ_ABI_FLAG_UNREG= ISTER, RSEQ_SIG); + rc =3D sys_rseq(&__rseq_abi, rseq_size, RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SI= G); if (rc) return -1; return 0; } =20 +static +unsigned int get_rseq_feature_size(void) +{ + unsigned long auxv_rseq_feature_size, auxv_rseq_align; + + auxv_rseq_align =3D getauxval(AT_RSEQ_ALIGN); + assert(!auxv_rseq_align || auxv_rseq_align <=3D RSEQ_THREAD_AREA_ALLOC_SI= ZE); + + auxv_rseq_feature_size =3D getauxval(AT_RSEQ_FEATURE_SIZE); + assert(!auxv_rseq_feature_size || auxv_rseq_feature_size <=3D RSEQ_THREAD= _AREA_ALLOC_SIZE); + if (auxv_rseq_feature_size) + return auxv_rseq_feature_size; + else + return ORIG_RSEQ_FEATURE_SIZE; +} + static __attribute__((constructor)) void rseq_init(void) { @@ -116,14 +160,21 @@ void rseq_init(void) rseq_offset =3D *libc_rseq_offset_p; rseq_size =3D *libc_rseq_size_p; rseq_flags =3D *libc_rseq_flags_p; + rseq_feature_size =3D get_rseq_feature_size(); + if (rseq_feature_size > rseq_size) + rseq_feature_size =3D rseq_size; return; } if (!rseq_available()) return; rseq_ownership =3D 1; rseq_offset =3D (void *)&__rseq_abi - rseq_thread_pointer(); - rseq_size =3D sizeof(struct rseq_abi); rseq_flags =3D 0; + rseq_feature_size =3D get_rseq_feature_size(); + if (rseq_feature_size =3D=3D ORIG_RSEQ_FEATURE_SIZE) + rseq_size =3D ORIG_RSEQ_ALLOC_SIZE; + else + rseq_size =3D RSEQ_THREAD_AREA_ALLOC_SIZE; } =20 static __attribute__((destructor)) @@ -133,6 +184,7 @@ void rseq_exit(void) return; rseq_offset =3D 0; rseq_size =3D -1U; + rseq_feature_size =3D -1U; rseq_ownership =3D 0; } =20 diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/= rseq/rseq.h index 9d850b290c2e..e73db2e82a11 100644 --- a/tools/testing/selftests/rseq/rseq.h +++ b/tools/testing/selftests/rseq/rseq.h @@ -47,13 +47,20 @@ =20 #include "rseq-thread-pointer.h" =20 -/* Offset from the thread pointer to the rseq area. */ +/* Offset from the thread pointer to the rseq area. */ extern ptrdiff_t rseq_offset; -/* Size of the registered rseq area. 0 if the registration was - unsuccessful. */ +/* + * Size of the registered rseq area. 0 if the registration was + * unsuccessful. + */ extern unsigned int rseq_size; -/* Flags used during rseq registration. */ +/* Flags used during rseq registration. */ extern unsigned int rseq_flags; +/* + * rseq feature size supported by the kernel. 0 if the registration was + * unsuccessful. + */ +extern unsigned int rseq_feature_size; =20 static inline struct rseq_abi *rseq_get_abi(void) { --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98819C433F5 for ; Fri, 18 Feb 2022 21:16:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239777AbiBRVQr (ORCPT ); Fri, 18 Feb 2022 16:16:47 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239665AbiBRVQM (ORCPT ); Fri, 18 Feb 2022 16:16:12 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46BDC28BF43; Fri, 18 Feb 2022 13:15:53 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id BD7C33BAC85; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id rsW_KsIUnzex; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id C56A53BAC06; Fri, 18 Feb 2022 16:06:43 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com C56A53BAC06 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218403; bh=fWg+52u1rxXnBxOGY+84ZiMv6iY1DuUNHrDg8mMnLCk=; h=From:To:Date:Message-Id; b=AOiIxCOk5/up1YA/uuOCETuTL8NQcMyqhxNwzF98b7JhF3W8hhJn9uH6PdGpBJboM ZckZF5xqTdVurLv4lpM//b85OiIrhWVF1DtFXDD3VdrMaWBt/Dh+Ua/HZJBjktK/1M NHK0tGyYFTHKmpOTXZT9O4QahEE47Uv08ouivd6+L7EFYF/qXtldLCeXKKZEN/4MdO GpQZyuh0NGCsf1Svvym9C3HWbopmgMUimfkMLoFTXPKVBpmQ3invaqnKEFU/BRXXOe TLPyjcjphXjegB6logH8UAUQf4Kbl226Xr0IgmQmasMgfXcOt/xnv23DCiMTL6Oi0j noQJtSyKhTc4A== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id MXcb7A5S4Uze; Fri, 18 Feb 2022 16:06:43 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 77C893BAB13; Fri, 18 Feb 2022 16:06:43 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 05/11] selftests/rseq: Implement rseq numa node id field selftest Date: Fri, 18 Feb 2022 16:06:27 -0500 Message-Id: <20220218210633.23345-6-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Test the NUMA node id extension rseq field. Compare it against the value returned by the getcpu(2) system call while pinned on a specific core. Signed-off-by: Mathieu Desnoyers --- tools/testing/selftests/rseq/basic_test.c | 5 ++++ tools/testing/selftests/rseq/rseq-abi.h | 8 ++++++ tools/testing/selftests/rseq/rseq.c | 18 +++++++++++++ tools/testing/selftests/rseq/rseq.h | 31 +++++++++++++++++++++++ 4 files changed, 62 insertions(+) diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/self= tests/rseq/basic_test.c index d8efbfb89193..a49b88cb20a3 100644 --- a/tools/testing/selftests/rseq/basic_test.c +++ b/tools/testing/selftests/rseq/basic_test.c @@ -22,6 +22,8 @@ void test_cpu_pointer(void) CPU_ZERO(&test_affinity); for (i =3D 0; i < CPU_SETSIZE; i++) { if (CPU_ISSET(i, &affinity)) { + int node; + CPU_SET(i, &test_affinity); sched_setaffinity(0, sizeof(test_affinity), &test_affinity); @@ -29,6 +31,9 @@ void test_cpu_pointer(void) assert(rseq_current_cpu() =3D=3D i); assert(rseq_current_cpu_raw() =3D=3D i); assert(rseq_cpu_start() =3D=3D i); + node =3D rseq_fallback_current_node(); + assert(rseq_current_node() =3D=3D node); + assert(rseq_current_node_raw() =3D=3D node); CPU_CLR(i, &test_affinity); } } diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selfte= sts/rseq/rseq-abi.h index 00ac846d85b0..a1faa9162d52 100644 --- a/tools/testing/selftests/rseq/rseq-abi.h +++ b/tools/testing/selftests/rseq/rseq-abi.h @@ -147,6 +147,14 @@ struct rseq_abi { */ __u32 flags; =20 + /* + * Restartable sequences node_id field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current NUMA node ID. + */ + __u32 node_id; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/= rseq/rseq.c index 506f2b17aea6..470fc0f73e22 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -79,6 +79,11 @@ static int sys_rseq(struct rseq_abi *rseq_abi, uint32_t = rseq_len, return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig); } =20 +static int sys_getcpu(unsigned int *cpu, unsigned int *node) +{ + return syscall(__NR_getcpu, cpu, node, NULL); +} + int rseq_available(void) { int rc; @@ -199,3 +204,16 @@ int32_t rseq_fallback_current_cpu(void) } return cpu; } + +int32_t rseq_fallback_current_node(void) +{ + uint32_t cpu_id, node_id; + int ret; + + ret =3D sys_getcpu(&cpu_id, &node_id); + if (ret) { + perror("sys_getcpu()"); + return ret; + } + return (int32_t) node_id; +} diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/= rseq/rseq.h index e73db2e82a11..4f1954cd12ff 100644 --- a/tools/testing/selftests/rseq/rseq.h +++ b/tools/testing/selftests/rseq/rseq.h @@ -20,6 +20,15 @@ #include "rseq-abi.h" #include "compiler.h" =20 +#ifndef rseq_sizeof_field +#define rseq_sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) +#endif + +#ifndef rseq_offsetofend +#define rseq_offsetofend(TYPE, MEMBER) \ + (offsetof(TYPE, MEMBER) + rseq_sizeof_field(TYPE, MEMBER)) +#endif + /* * Empty code injection macros, override when testing. * It is important to consider that the ASM injection macros need to be @@ -123,6 +132,11 @@ int rseq_unregister_current_thread(void); */ int32_t rseq_fallback_current_cpu(void); =20 +/* + * Restartable sequence fallback for reading the current node number. + */ +int32_t rseq_fallback_current_node(void); + /* * Values returned can be either the current CPU number, -1 (rseq is * uninitialized), or -2 (rseq initialization has failed). @@ -132,6 +146,15 @@ static inline int32_t rseq_current_cpu_raw(void) return RSEQ_ACCESS_ONCE(rseq_get_abi()->cpu_id); } =20 +/* + * Current NUMA node number. + */ +static inline uint32_t rseq_current_node_raw(void) +{ + assert((int) rseq_feature_size >=3D rseq_offsetofend(struct rseq_abi, nod= e_id)); + return RSEQ_ACCESS_ONCE(rseq_get_abi()->node_id); +} + /* * Returns a possible CPU number, which is typically the current CPU. * The returned CPU number can be used to prepare for an rseq critical @@ -158,6 +181,14 @@ static inline uint32_t rseq_current_cpu(void) return cpu; } =20 +static inline uint32_t rseq_current_node(void) +{ + if (rseq_likely((int) rseq_feature_size >=3D rseq_offsetofend(struct rseq= _abi, node_id))) + return rseq_current_node_raw(); + else + return rseq_fallback_current_node(); +} + static inline void rseq_clear_rseq_cs(void) { RSEQ_WRITE_ONCE(rseq_get_abi()->rseq_cs.arch.ptr, 0); --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67A9BC433F5 for ; Fri, 18 Feb 2022 21:16:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239711AbiBRVQf (ORCPT ); Fri, 18 Feb 2022 16:16:35 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239658AbiBRVQL (ORCPT ); Fri, 18 Feb 2022 16:16:11 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE1C22402C1; Fri, 18 Feb 2022 13:15:52 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 5D0923BAC81; Fri, 18 Feb 2022 16:06:46 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id C4s-otjKdnzx; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 3C0BC3BAC08; Fri, 18 Feb 2022 16:06:45 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 3C0BC3BAC08 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218405; bh=H1BI137TQVD0IUIV2Akwp3DX/CDXZ2SZLX9a456H6tM=; h=From:To:Date:Message-Id; b=tKEzkfG7PGZjAzJkK4utsGhL+ZAcBe2p49eg1uSb6X63SQBFwUQtZQuJQJIfByPxQ 9qgyMzi8Ep0a/6jaiA+PeL7Ke/lbFl37ERUnjRzVF+Uu7KNKQ4q302ijvg0GQ3DzSK msmreRhyfKA9eoyW6tjqm8/re8pSF8vYRWUo/EW32xe3T9jaiECg+KevZHfDIxMPun rsXzuXGF+7rJqvVodNlhGoacGC03UxnH2CA4kGxn6fjo8hS66rC1PAqliTjPdTphw9 uJIgVn9X7Vs3ZLy2/6/bXFDafE1I5fT2nM84/7HPhsO32E4gz3wgi4+TjX3gXvudXA nyglD/BLCXGsw== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id aVdYg6aUs7zb; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id B61673BAC00; Fri, 18 Feb 2022 16:06:43 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 06/11] lib: invert _find_next_bit source arguments Date: Fri, 18 Feb 2022 16:06:28 -0500 Message-Id: <20220218210633.23345-7-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Apply bit-invert operations before the AND operation in _find_next_bit. Allows AND operations on combined bitmasks in which we search either for one or zero, e.g.: find first bit which is both zero in one bitmask AND one in the second bitmask. The existing use for find first zero bit does not use the second argument, so whether the inversion is performed before or after the AND operator does not matter. Signed-off-by: Mathieu Desnoyers --- include/linux/find.h | 13 +++++++------ lib/find_bit.c | 17 ++++++++--------- tools/include/linux/find.h | 9 +++++---- tools/lib/find_bit.c | 17 ++++++++--------- 4 files changed, 28 insertions(+), 28 deletions(-) diff --git a/include/linux/find.h b/include/linux/find.h index 5bb6db213bcb..41941cb9cad7 100644 --- a/include/linux/find.h +++ b/include/linux/find.h @@ -10,7 +10,8 @@ =20 extern unsigned long _find_next_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert, unsigned long le); + unsigned long start, unsigned long invert_src1, + unsigned long src2, unsigned long le); extern unsigned long _find_first_bit(const unsigned long *addr, unsigned l= ong size); extern unsigned long _find_first_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size); @@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, un= signed long size, return val ? __ffs(val) : size; } =20 - return _find_next_bit(addr, NULL, size, offset, 0UL, 0); + return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0); } #endif =20 @@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr= 1, return val ? __ffs(val) : size; } =20 - return _find_next_bit(addr1, addr2, size, offset, 0UL, 0); + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0); } #endif =20 @@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *ad= dr, unsigned long size, return val =3D=3D ~0UL ? size : ffz(val); } =20 - return _find_next_bit(addr, NULL, size, offset, ~0UL, 0); + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0); } #endif =20 @@ -247,7 +248,7 @@ unsigned long find_next_zero_bit_le(const void *addr, u= nsigned return val =3D=3D ~0UL ? size : ffz(val); } =20 - return _find_next_bit(addr, NULL, size, offset, ~0UL, 1); + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 1); } #endif =20 @@ -266,7 +267,7 @@ unsigned long find_next_bit_le(const void *addr, unsign= ed return val ? __ffs(val) : size; } =20 - return _find_next_bit(addr, NULL, size, offset, 0UL, 1); + return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 1); } #endif =20 diff --git a/lib/find_bit.c b/lib/find_bit.c index 1b8e4b2a9cba..73e78565e691 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -25,23 +25,23 @@ /* * This is a common helper function for find_next_bit, find_next_zero_bit,= and * find_next_and_bit. The differences are: - * - The "invert" argument, which is XORed with each fetched word before - * searching it for one bits. + * - The "invert_src1" and "invert_src2" arguments, which are XORed to + * each source word before applying the 'and' operator. * - The optional "addr2", which is anded with "addr1" if present. */ unsigned long _find_next_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert, unsigned long le) + unsigned long start, unsigned long invert_src1, + unsigned long invert_src2, unsigned long le) { unsigned long tmp, mask; =20 if (unlikely(start >=3D nbits)) return nbits; =20 - tmp =3D addr1[start / BITS_PER_LONG]; + tmp =3D addr1[start / BITS_PER_LONG] ^ invert_src1; if (addr2) - tmp &=3D addr2[start / BITS_PER_LONG]; - tmp ^=3D invert; + tmp &=3D addr2[start / BITS_PER_LONG] ^ invert_src2; =20 /* Handle 1st word. */ mask =3D BITMAP_FIRST_WORD_MASK(start); @@ -57,10 +57,9 @@ unsigned long _find_next_bit(const unsigned long *addr1, if (start >=3D nbits) return nbits; =20 - tmp =3D addr1[start / BITS_PER_LONG]; + tmp =3D addr1[start / BITS_PER_LONG] ^ invert_src1; if (addr2) - tmp &=3D addr2[start / BITS_PER_LONG]; - tmp ^=3D invert; + tmp &=3D addr2[start / BITS_PER_LONG] ^ invert_src2; } =20 if (le) diff --git a/tools/include/linux/find.h b/tools/include/linux/find.h index 47e2bd6c5174..5ab0c95086ad 100644 --- a/tools/include/linux/find.h +++ b/tools/include/linux/find.h @@ -10,7 +10,8 @@ =20 extern unsigned long _find_next_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert, unsigned long le); + unsigned long start, unsigned long invert_src1, + unsigned long src2, unsigned long le); extern unsigned long _find_first_bit(const unsigned long *addr, unsigned l= ong size); extern unsigned long _find_first_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long size); @@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, un= signed long size, return val ? __ffs(val) : size; } =20 - return _find_next_bit(addr, NULL, size, offset, 0UL, 0); + return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0); } #endif =20 @@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr= 1, return val ? __ffs(val) : size; } =20 - return _find_next_bit(addr1, addr2, size, offset, 0UL, 0); + return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0); } #endif =20 @@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *ad= dr, unsigned long size, return val =3D=3D ~0UL ? size : ffz(val); } =20 - return _find_next_bit(addr, NULL, size, offset, ~0UL, 0); + return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0); } #endif =20 diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c index ba4b8d94e004..4176232de7f9 100644 --- a/tools/lib/find_bit.c +++ b/tools/lib/find_bit.c @@ -24,13 +24,14 @@ /* * This is a common helper function for find_next_bit, find_next_zero_bit,= and * find_next_and_bit. The differences are: - * - The "invert" argument, which is XORed with each fetched word before - * searching it for one bits. + * - The "invert_src1" and "invert_src2" arguments, which are XORed to + * each source word before applying the 'and' operator. * - The optional "addr2", which is anded with "addr1" if present. */ unsigned long _find_next_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, - unsigned long start, unsigned long invert, unsigned long le) + unsigned long start, unsigned long invert_src1, + unsigned long invert_src2, unsigned long le) { unsigned long tmp, mask; (void) le; @@ -38,10 +39,9 @@ unsigned long _find_next_bit(const unsigned long *addr1, if (unlikely(start >=3D nbits)) return nbits; =20 - tmp =3D addr1[start / BITS_PER_LONG]; + tmp =3D addr1[start / BITS_PER_LONG] ^ invert_src1; if (addr2) - tmp &=3D addr2[start / BITS_PER_LONG]; - tmp ^=3D invert; + tmp &=3D addr2[start / BITS_PER_LONG] ^ invert_src2; =20 /* Handle 1st word. */ mask =3D BITMAP_FIRST_WORD_MASK(start); @@ -64,10 +64,9 @@ unsigned long _find_next_bit(const unsigned long *addr1, if (start >=3D nbits) return nbits; =20 - tmp =3D addr1[start / BITS_PER_LONG]; + tmp =3D addr1[start / BITS_PER_LONG] ^ invert_src1; if (addr2) - tmp &=3D addr2[start / BITS_PER_LONG]; - tmp ^=3D invert; + tmp &=3D addr2[start / BITS_PER_LONG] ^ invert_src2; } =20 #if (0) --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07D1CC433EF for ; Fri, 18 Feb 2022 21:16:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239756AbiBRVQi (ORCPT ); Fri, 18 Feb 2022 16:16:38 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239659AbiBRVQL (ORCPT ); Fri, 18 Feb 2022 16:16:11 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1BFAC28B639; Fri, 18 Feb 2022 13:15:52 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id EC6423BAB17; Fri, 18 Feb 2022 16:06:46 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id fW0y1t-33qnL; Fri, 18 Feb 2022 16:06:46 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 5BD6B3BA92E; Fri, 18 Feb 2022 16:06:45 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 5BD6B3BA92E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218405; bh=zb3SusDsVJNg+pjuGfT7yEf39MwlGGyPhgXVO4KRkhM=; h=From:To:Date:Message-Id; b=epdNARr8QSHDvXrxYdsz2N7Wm7ivkowqacmFw9oaQsv4DAroRJ2J06SI3qEGdNeMa 6IpjKAqGp+s8QeAJW8ZZuuWILwln0ZGJRy0A+4IuA90ZTUt9RvPE07l7PAVFmtrQJy xLsoeuYru1E6a8R34RrY7GX0uBDBRvO92tnKgVgDO9imWIXfo5BPsaTSkTLlgMoDB3 3Oe5RQqdTz8MchQwZQCTbv5tKd+hjeq7aG8EGEijUvQtqSe5pmbV6Xpv1NnWy5T4iV TatBDih+5ZwnycNubLVrABhJezl1g4yCNgVwkpXt0eQ2WHgTWGNIoMBX9FkK1ZHTpe fx7vd551JhY8Q== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id E1jGpqx5UH-Q; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 0526A3BAB14; Fri, 18 Feb 2022 16:06:44 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 07/11] lib: implement find_{first,next}_{zero,one}_and_zero_bit Date: Fri, 18 Feb 2022 16:06:29 -0500 Message-Id: <20220218210633.23345-8-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Allow finding the first or next bit within two input bitmasks which is either: - both zero and zero, - respectively one and zero. Signed-off-by: Mathieu Desnoyers --- include/linux/find.h | 110 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) diff --git a/include/linux/find.h b/include/linux/find.h index 41941cb9cad7..9b6edd00b45e 100644 --- a/include/linux/find.h +++ b/include/linux/find.h @@ -76,6 +76,66 @@ unsigned long find_next_and_bit(const unsigned long *add= r1, } #endif =20 +#ifndef find_next_one_and_zero_bit +/** + * find_next_one_and_zero_bit - find the next bit which is one in addr1 an= d zero in addr2 + * @addr1: The first address to base the search on + * @addr2: The second address to base the search on + * @offset: The bitnumber to start searching at + * @size: The bitmap size in bits + * + * Returns the bit number for the next bit set in addr1 and cleared in add= r2 + * If no corresponding bits meet this criterion, returns @size. + */ +static inline +unsigned long find_next_one_and_zero_bit(const unsigned long *addr1, + const unsigned long *addr2, unsigned long size, + unsigned long offset) +{ + if (small_const_nbits(size)) { + unsigned long val; + + if (unlikely(offset >=3D size)) + return size; + + val =3D *addr1 & ~*addr2 & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + + return _find_next_bit(addr1, addr2, size, offset, 0UL, ~0UL, 0); +} +#endif + +#ifndef find_next_zero_and_zero_bit +/** + * find_next_zero_and_zero_bit - find the next bit which is zero in addr1 = and addr2 + * @addr1: The first address to base the search on + * @addr2: The second address to base the search on + * @offset: The bitnumber to start searching at + * @size: The bitmap size in bits + * + * Returns the bit number for the next bit cleared in addr1 and addr2 + * If no corresponding bits meet this criterion, returns @size. + */ +static inline +unsigned long find_next_zero_and_zero_bit(const unsigned long *addr1, + const unsigned long *addr2, unsigned long size, + unsigned long offset) +{ + if (small_const_nbits(size)) { + unsigned long val; + + if (unlikely(offset >=3D size)) + return size; + + val =3D ~*addr1 & ~*addr2 & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + + return _find_next_bit(addr1, addr2, size, offset, ~0UL, ~0UL, 0); +} +#endif + #ifndef find_next_zero_bit /** * find_next_zero_bit - find the next cleared bit in a memory region @@ -173,6 +233,56 @@ unsigned long find_first_zero_bit(const unsigned long = *addr, unsigned long size) } #endif =20 +#ifndef find_first_one_and_zero_bit +/** + * find_first_one_and_zero_bit - find the first bit which is one in addr1 = and zero in addr2 + * @addr1: The first address to base the search on + * @addr2: The second address to base the search on + * @size: The bitmap size in bits + * + * Returns the bit number for the first bit set in addr1 and cleared in ad= dr2 + * If no corresponding bits meet this criterion, returns @size. + */ +static inline +unsigned long find_first_one_and_zero_bit(const unsigned long *addr1, + const unsigned long *addr2, + unsigned long size) +{ + if (small_const_nbits(size)) { + unsigned long val =3D *addr1 & ~*addr2 & GENMASK(size - 1, 0); + + return val ? __ffs(val) : size; + } + + return _find_next_bit(addr1, addr2, size, 0, 0UL, ~0UL, 0); +} +#endif + +#ifndef find_first_zero_and_zero_bit +/** + * find_first_zero_and_zero_bit - find the first bit which is zero in addr= 1 and addr2 + * @addr1: The first address to base the search on + * @addr2: The second address to base the search on + * @size: The bitmap size in bits + * + * Returns the bit number for the first bit cleared in addr1 and addr2 + * If no corresponding bits meet this criterion, returns @size. + */ +static inline +unsigned long find_first_zero_and_zero_bit(const unsigned long *addr1, + const unsigned long *addr2, + unsigned long size) +{ + if (small_const_nbits(size)) { + unsigned long val =3D ~*addr1 & ~*addr2 & GENMASK(size - 1, 0); + + return val ? __ffs(val) : size; + } + + return _find_next_bit(addr1, addr2, size, 0, ~0UL, ~0UL, 0); +} +#endif + #ifndef find_last_bit /** * find_last_bit - find the last set bit in a memory region --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61662C433FE for ; Fri, 18 Feb 2022 21:16:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239717AbiBRVQV (ORCPT ); Fri, 18 Feb 2022 16:16:21 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236503AbiBRVQJ (ORCPT ); Fri, 18 Feb 2022 16:16:09 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D75923F0B8; Fri, 18 Feb 2022 13:15:49 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 540153BAC13; Fri, 18 Feb 2022 16:06:47 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 16lt4ka0EUcY; Fri, 18 Feb 2022 16:06:46 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 712053BAB16; Fri, 18 Feb 2022 16:06:45 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 712053BAB16 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218405; bh=ONVniBRsB7JS6D+0uBffC8B0zXbx6+sORFJs6B80S7c=; h=From:To:Date:Message-Id; b=aQPfd7sWnh9LtxZOD2RPxPRmfb1YbWQurkDTElRbuFYspiJ0q8FkcB01cR9hCF8Kj i9PEDX9MCqrteNePWIJjuao335e3VUc7ApuYnxg33RcT1iqfydEtjaLUwh5oomWD3s ETjEgL/U2EKWEJWe66aYQIN0wtZ8RcXRqgobj8Kc5RsjErEGN4NWLcIz4FfwbY/shA MMC1PArRY5geDNJIncv3ncRn8JZw12cRUGNe9KGXZCvuGgv5apD5rqtgTR6BintHXX DX5o3xPLppbQxxUBgSg1z+sSQbt8jfgnngfuIqSBNrdVULa77P8FavuAVKSF5Y2fAd 2uV4q6Vxa98tQ== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id C7c68gXqvGRd; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 4AE423BA5F1; Fri, 18 Feb 2022 16:06:44 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 08/11] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero Date: Fri, 18 Feb 2022 16:06:30 -0500 Message-Id: <20220218210633.23345-9-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Allow finding the first or next bit within two input cpumasks which is either: - both zero and zero, - respectively one and zero. Signed-off-by: Mathieu Desnoyers --- include/linux/cpumask.h | 94 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 64dae70d31f5..040476134557 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -134,6 +134,18 @@ static inline unsigned int cpumask_first_and(const str= uct cpumask *srcp1, return 0; } =20 +static inline unsigned int cpumask_first_one_and_zero(const struct cpumask= *srcp1, + const struct cpumask *srcp2) +{ + return 0; +} + +static inline unsigned int cpumask_first_zero_and_zero(const struct cpumas= k *srcp1, + const struct cpumask *srcp2) +{ + return 0; +} + static inline unsigned int cpumask_last(const struct cpumask *srcp) { return 0; @@ -157,6 +169,20 @@ static inline unsigned int cpumask_next_and(int n, return n+1; } =20 +static inline unsigned int cpumask_next_one_and_zero(int n, + const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + return n+1; +} + +static inline unsigned int cpumask_next_zero_and_zero(int n, + const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + return n+1; +} + static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *= mask, int start, bool wrap) { @@ -230,6 +256,36 @@ unsigned int cpumask_first_and(const struct cpumask *s= rcp1, const struct cpumask return find_first_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cp= umask_bits); } =20 +/** + * cpumask_first_one_and_zero - return the first cpu from *srcp1 & ~*srcp2 + * @src1p: the first input + * @src2p: the second input + * + * Returns >=3D nr_cpu_ids if no cpus match in both. + */ +static inline +unsigned int cpumask_first_one_and_zero(const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + return find_first_one_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp= 2), + nr_cpumask_bits); +} + +/** + * cpumask_first_zero_and_zero - return the first cpu from ~*srcp1 & ~*src= p2 + * @src1p: the first input + * @src2p: the second input + * + * Returns >=3D nr_cpu_ids if no cpus match in both. + */ +static inline +unsigned int cpumask_first_zero_and_zero(const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + return find_first_zero_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(src= p2), + nr_cpumask_bits); +} + /** * cpumask_last - get the last CPU in a cpumask * @srcp: - the cpumask pointer @@ -258,6 +314,44 @@ static inline unsigned int cpumask_next_zero(int n, co= nst struct cpumask *srcp) return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1); } =20 +/** + * cpumask_next_one_and_zero - return the next cpu from *srcp1 & ~*srcp2 + * @n: the cpu prior to the place to search (ie. return will be > @n) + * @src1p: the first input + * @src2p: the second input + * + * Returns >=3D nr_cpu_ids if no cpus match in both. + */ +static inline +unsigned int cpumask_next_one_and_zero(int n, const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + /* -1 is a legal arg here. */ + if (n !=3D -1) + cpumask_check(n); + return find_next_one_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp2= ), + nr_cpumask_bits, n+1); +} + +/** + * cpumask_next_zero_and_zero - return the next cpu from ~*srcp1 & ~*srcp2 + * @n: the cpu prior to the place to search (ie. return will be > @n) + * @src1p: the first input + * @src2p: the second input + * + * Returns >=3D nr_cpu_ids if no cpus match in both. + */ +static inline +unsigned int cpumask_next_zero_and_zero(int n, const struct cpumask *srcp1, + const struct cpumask *srcp2) +{ + /* -1 is a legal arg here. */ + if (n !=3D -1) + cpumask_check(n); + return find_next_zero_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp= 2), + nr_cpumask_bits, n+1); +} + int __pure cpumask_next_and(int n, const struct cpumask *, const struct cp= umask *); int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int cpumask_local_spread(unsigned int i, int node); --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69CAFC433EF for ; Fri, 18 Feb 2022 21:16:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239749AbiBRVQ0 (ORCPT ); Fri, 18 Feb 2022 16:16:26 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42368 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239647AbiBRVQK (ORCPT ); Fri, 18 Feb 2022 16:16:10 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1760923F0A6; Fri, 18 Feb 2022 13:15:49 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 85CEF3BAC84; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id OyUQxXtczig6; Fri, 18 Feb 2022 16:06:47 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 9BF403BAC80; Fri, 18 Feb 2022 16:06:45 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 9BF403BAC80 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218405; bh=uMuAXUKA2soJTkFiGICIDNL1xEJYdKm49C06IYaue1s=; h=From:To:Date:Message-Id; b=YEqyAunoUkZ6zUAueSa1Klmm7gUCSBlZWt3N4778T+RtwBc/6eMB/pBgw326FIs4q BN+WDxbznHA30aRsDlH/l6St/0Ag30Gv84Hoy1D5YXMo/2ZMWT1ziLsv0l0kBW2krJ TWBxocB5dpUrY2Xmjy16aNH8/FqEdnArkDrNnpwAotsWmXfqIEVo6ZB3JFeN82VHvN sw4mG8MqGnSxxV7kZEgPi4uoeHNNv+s4IdmJYp1PYLjsNUlHhjjPgN0acfr8OPffKW nAIYs+YaNih4yQJDNLMVH4GltOoJUPhtrcEGfXQtISoGBu8C89/ZQDAHDqZrclbY4K M6Zh7lq95RHzA== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id jE3P2wCmM7o9; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 86A143BAC07; Fri, 18 Feb 2022 16:06:44 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Date: Fri, 18 Feb 2022 16:06:31 -0500 Message-Id: <20220218210633.23345-10-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" This feature allows the scheduler to expose a current virtual cpu id to user-space. This virtual cpu id is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory space. If a memory space has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the virtual cpu ids will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed by user-space to be associated with a NUMA node, it is guaranteed to never change NUMA node unless a kernel-level NUMA configuration change happens. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container, - Processes using memory from many NUMA nodes. One of the key concerns from scheduler maintainers is the overhead associated with additional atomic operations in the scheduler fast-path. In order to save one atomic set bit and one atomic clear bit on the scheduler context switch fast path, the following optimizations are implemented: 1) On context switch between threads belonging to the same memory space, transfer the mm_vcpu_id from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory space. 2) Threads belonging to a memory space with single user (mm_users=3D=3D1) can be assigned mm_vcpu_id=3D0 without any atomic operation on the scheduler fast-path. In non-NUMA, when a memory space goes from single to multi-threaded, lazily allocate the vcpu_id 0 in the mm vcpu mask. This takes care of all single-threaded use-cases involving context switching between threads belonging to different memory spaces. With NUMA, the single-threaded memory space scenario is still special-cased to eliminate all atomic operations on the fast path, but rather than returning vcpu_id=3D0, return the current numa_node_id to allow single-threaded memory spaces to keep good numa locality. On systems where the number of cpus ids is lower than the number of numa node ids, pick the first cpu in the node cpumask rather than the node ID. 3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries. Keep track of the recently allocated vcpu_id for each mm rather than freeing them immediately. This eliminates most atomic ops when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The credit goes to Paul Turner (Google) for the vcpu_id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally at Google in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. schedstats: * perf bench sched messaging (single instance, multi-process): On sched-switch: single-threaded vcpu-id: 99.985 % transfer between threads: 0 % runqueue cache hit: 0.015 % runqueue cache eviction (bit-clear): 0 % runqueue cache discard (bit-clear): 0 % vcpu-id allocation (bit-set): 0 % On release mm: vcpu-id remove (bit-clear): 0 % On migration: vcpu-id remove (bit-clear): 0 % * perf bench sched messaging -t (single instance, multi-thread): On sched-switch: single-threaded vcpu-id: 0.128 % transfer between threads: 98.260 % runqueue cache hit: 1.075 % runqueue cache eviction (bit-clear): 0.001 % runqueue cache discard (bit-clear): 0 % vcpu-id allocation (bit-set): 0.269 % On release mm: vcpu-id remove (bit-clear): 0.161 % On migration: vcpu-id remove (bit-clear): 0.107 % * perf bench sched messaging -t (two instances, multi-thread): On sched-switch: single-threaded vcpu-id: 0.081 % transfer between threads: 89.512 % runqueue cache hit: 9.659 % runqueue cache eviction (bit-clear): 0.003 % runqueue cache discard (bit-clear): 0 % vcpu-id allocation (bit-set): 0.374 % On release mm: vcpu-id remove (bit-clear): 0.243 % On migration: vcpu-id remove (bit-clear): 0.129 % * perf bench sched pipe (one instance, multi-process): On sched-switch: single-threaded vcpu-id: 99.993 % transfer between threads: 0.001 % runqueue cache hit: 0.002 % runqueue cache eviction (bit-clear): 0 % runqueue cache discard (bit-clear): 0 % vcpu-id allocation (bit-set): 0.002 % On release mm: vcpu-id remove (bit-clear): 0 % On migration: vcpu-id remove (bit-clear): 0.002 % [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_= syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers --- fs/exec.c | 4 + include/linux/mm.h | 25 +++ include/linux/mm_types.h | 111 ++++++++++++ include/linux/sched.h | 5 + init/Kconfig | 4 + kernel/fork.c | 15 +- kernel/sched/core.c | 82 +++++++++ kernel/sched/deadline.c | 3 + kernel/sched/debug.c | 13 ++ kernel/sched/fair.c | 1 + kernel/sched/rt.c | 2 + kernel/sched/sched.h | 364 +++++++++++++++++++++++++++++++++++++++ kernel/sched/stats.c | 16 +- 13 files changed, 642 insertions(+), 3 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 79f2c9483302..7b7520b63e95 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1006,6 +1006,10 @@ static int exec_mmap(struct mm_struct *mm) active_mm =3D tsk->active_mm; tsk->active_mm =3D mm; tsk->mm =3D mm; + mm_init_vcpu_users(mm); + mm_init_vcpumask(mm); + mm_init_node_vcpumask(mm); + sched_vcpu_activate_mm(tsk, mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for diff --git a/include/linux/mm.h b/include/linux/mm.h index e1a84b1e6787..6ca8a4a85fcd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3374,5 +3374,30 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned= long start, } #endif =20 +#ifdef CONFIG_SCHED_MM_VCPU +void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm); +void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm); +void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm); +void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm); +static inline int task_mm_vcpu_id(struct task_struct *t) +{ + return t->mm_vcpu; +} +#else +static inline void sched_vcpu_release_mm(struct task_struct *t, struct mm_= struct *mm) { } +static inline void sched_vcpu_activate_mm(struct task_struct *t, struct mm= _struct *mm) { } +static inline void sched_vcpu_get_mm(struct task_struct *t, struct mm_stru= ct *mm) { } +static inline void sched_vcpu_dup_mm(struct task_struct *t, struct mm_stru= ct *mm) { } +static inline int task_mm_vcpu_id(struct task_struct *t) +{ + /* + * Use the processor id as a fall-back when the mm vcpu feature is + * disabled. This provides functional per-cpu data structure accesses + * in user-space, althrough it won't provide the memory usage benefits. + */ + return raw_smp_processor_id(); +} +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 9db36dc5d4cf..40fcc526396f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -17,6 +17,7 @@ #include #include #include +#include =20 #include =20 @@ -502,6 +503,20 @@ struct mm_struct { */ atomic_t mm_count; =20 +#ifdef CONFIG_SCHED_MM_VCPU + /** + * @mm_vcpu_users: The number of references to &struct mm_struct + * from user-space threads. + * + * Initialized to 1 for the first thread with a reference with + * the mm. Incremented for each thread getting a reference to the + * mm, and decremented on mm release from user-space threads. + * Used to enable single-threaded mm_vcpu accounting (when =3D=3D 1). + */ + + atomic_t mm_vcpu_users; +#endif + #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif @@ -659,6 +674,102 @@ static inline cpumask_t *mm_cpumask(struct mm_struct = *mm) return (struct cpumask *)&mm->cpu_bitmap; } =20 +#ifdef CONFIG_SCHED_MM_VCPU +/* Future-safe accessor for struct mm_struct's vcpu_mask. */ +static inline cpumask_t *mm_vcpumask(struct mm_struct *mm) +{ + unsigned long vcpu_bitmap =3D (unsigned long)mm; + + vcpu_bitmap +=3D offsetof(struct mm_struct, cpu_bitmap); + /* Skip cpu_bitmap */ + vcpu_bitmap +=3D cpumask_size(); + return (struct cpumask *)vcpu_bitmap; +} + +static inline void mm_init_vcpu_users(struct mm_struct *mm) +{ + atomic_set(&mm->mm_vcpu_users, 1); +} + +static inline void mm_init_vcpumask(struct mm_struct *mm) +{ + cpumask_clear(mm_vcpumask(mm)); +} + +static inline unsigned int mm_vcpumask_size(void) +{ + return cpumask_size(); +} + +#else +static inline cpumask_t *mm_vcpumask(struct mm_struct *mm) +{ + return NULL; +} + +static inline void mm_init_vcpu_users(struct mm_struct *mm) { } +static inline void mm_init_vcpumask(struct mm_struct *mm) { } + +static inline unsigned int mm_vcpumask_size(void) +{ + return 0; +} +#endif + +#if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA) +/* + * Layout of node vcpumasks: + * - node_alloc vcpumask: cpumask tracking which vcpu_id were + * allocated (across nodes) in this + * memory space. + * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id + * were allocated in this memory space. + */ +static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm) +{ + unsigned long vcpu_bitmap =3D (unsigned long)mm_vcpumask(mm); + + /* Skip mm_vcpumask */ + vcpu_bitmap +=3D cpumask_size(); + return (struct cpumask *)vcpu_bitmap; +} + +static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned i= nt node) +{ + unsigned long vcpu_bitmap =3D (unsigned long)mm_node_alloc_vcpumask(mm); + + /* Skip node alloc vcpumask */ + vcpu_bitmap +=3D cpumask_size(); + vcpu_bitmap +=3D node * cpumask_size(); + return (struct cpumask *)vcpu_bitmap; +} + +static inline void mm_init_node_vcpumask(struct mm_struct *mm) +{ + unsigned int node; + + if (num_possible_nodes() =3D=3D 1) + return; + cpumask_clear(mm_node_alloc_vcpumask(mm)); + for (node =3D 0; node < nr_node_ids; node++) + cpumask_clear(mm_node_vcpumask(mm, node)); +} + +static inline unsigned int mm_node_vcpumask_size(void) +{ + if (num_possible_nodes() =3D=3D 1) + return 0; + return (nr_node_ids + 1) * cpumask_size(); +} +#else +static inline void mm_init_node_vcpumask(struct mm_struct *mm) { } + +static inline unsigned int mm_node_vcpumask_size(void) +{ + return 0; +} +#endif + struct mmu_gather; extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct= *mm); diff --git a/include/linux/sched.h b/include/linux/sched.h index 838c9e0b4cae..c400d44f8716 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1300,6 +1300,11 @@ struct task_struct { unsigned long rseq_event_mask; #endif =20 +#ifdef CONFIG_SCHED_MM_VCPU + int mm_vcpu; /* Current vcpu in mm */ + int vcpu_mm_active; +#endif + struct tlbflush_unmap_batch tlb_ubc; =20 union { diff --git a/init/Kconfig b/init/Kconfig index e9119bf54b1f..6bd40f303a0d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1023,6 +1023,10 @@ config RT_GROUP_SCHED =20 endif #CGROUP_SCHED =20 +config SCHED_MM_VCPU + def_bool y + depends on SMP && RSEQ + config UCLAMP_TASK_GROUP bool "Utilization clamping per group of tasks" depends on CGROUP_SCHED diff --git a/kernel/fork.c b/kernel/fork.c index d75a528f7b21..78fcf3277540 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -970,6 +970,11 @@ static struct task_struct *dup_task_struct(struct task= _struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg =3D NULL; #endif + +#ifdef CONFIG_SCHED_MM_VCPU + tsk->mm_vcpu =3D 0; + tsk->vcpu_mm_active =3D 0; +#endif return tsk; =20 free_stack: @@ -1079,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, goto fail_nocontext; =20 mm->user_ns =3D get_user_ns(user_ns); + mm_init_vcpu_users(mm); + mm_init_vcpumask(mm); + mm_init_node_vcpumask(mm); return mm; =20 fail_nocontext: @@ -1380,6 +1388,8 @@ static int wait_for_vfork_done(struct task_struct *ch= ild, */ static void mm_release(struct task_struct *tsk, struct mm_struct *mm) { + sched_vcpu_release_mm(tsk, mm); + uprobe_free_utask(tsk); =20 /* Get rid of any cached register state */ @@ -1499,10 +1509,12 @@ static int copy_mm(unsigned long clone_flags, struc= t task_struct *tsk) if (clone_flags & CLONE_VM) { mmget(oldmm); mm =3D oldmm; + sched_vcpu_get_mm(tsk, mm); } else { mm =3D dup_mm(tsk, current->mm); if (!mm) return -ENOMEM; + sched_vcpu_dup_mm(tsk, mm); } =20 tsk->mm =3D mm; @@ -2901,7 +2913,8 @@ void __init proc_caches_init(void) * dynamically sized based on the maximum CPU number this system * can have, taking hotplug into account (nr_cpu_ids). */ - mm_size =3D sizeof(struct mm_struct) + cpumask_size(); + mm_size =3D sizeof(struct mm_struct) + cpumask_size() + mm_vcpumask_size(= ) + + mm_node_vcpumask_size(); =20 mm_cachep =3D kmem_cache_create_usercopy("mm_struct", mm_size, ARCH_MIN_MMSTRUCT_ALIGN, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1e08b02e0cd5..70bf2899c9b3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2267,6 +2267,7 @@ static struct rq *move_queued_task(struct rq *rq, str= uct rq_flags *rf, lockdep_assert_rq_held(rq); =20 deactivate_task(rq, p, DEQUEUE_NOCLOCK); + rq_vcpu_cache_remove_mm_locked(rq, p->mm, false); set_task_cpu(p, new_cpu); rq_unlock(rq, rf); =20 @@ -2454,6 +2455,7 @@ int push_cpu_stop(void *arg) // XXX validate p is still the highest prio task if (task_rq(p) =3D=3D rq) { deactivate_task(rq, p, 0); + rq_vcpu_cache_remove_mm_locked(rq, p->mm, false); set_task_cpu(p, lowest_rq->cpu); activate_task(lowest_rq, p, 0); resched_curr(lowest_rq); @@ -3093,6 +3095,7 @@ static void __migrate_swap_task(struct task_struct *p= , int cpu) rq_pin_lock(dst_rq, &drf); =20 deactivate_task(src_rq, p, 0); + rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false); set_task_cpu(p, cpu); activate_task(dst_rq, p, 0); check_preempt_curr(dst_rq, p, 0); @@ -3716,6 +3719,8 @@ static void __ttwu_queue_wakelist(struct task_struct = *p, int cpu, int wake_flags p->sched_remote_wakeup =3D !!(wake_flags & WF_MIGRATED); =20 WRITE_ONCE(rq->ttwu_pending, 1); + if (WARN_ON_ONCE(task_cpu(p) !=3D cpu_of(rq))) + rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false); __smp_call_single_queue(cpu, &p->wake_entry.llist); } =20 @@ -4125,6 +4130,7 @@ try_to_wake_up(struct task_struct *p, unsigned int st= ate, int wake_flags) =20 wake_flags |=3D WF_MIGRATED; psi_ttwu_dequeue(p); + rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false); set_task_cpu(p, cpu); } #else @@ -4796,6 +4802,7 @@ prepare_task_switch(struct rq *rq, struct task_struct= *prev, sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); rseq_preempt(prev); + switch_mm_vcpu(rq, prev, next); fire_sched_out_preempt_notifiers(prev, next); kmap_local_sched_out(); prepare_task(next); @@ -5922,6 +5929,7 @@ static bool try_steal_cookie(int this, int that) goto next; =20 deactivate_task(src, p, 0); + rq_vcpu_cache_remove_mm_locked(src, p->mm, false); set_task_cpu(p, this); activate_task(dst, p, 0); =20 @@ -10927,3 +10935,77 @@ void call_trace_sched_update_nr_running(struct rq = *rq, int count) { trace_sched_update_nr_running_tp(rq, count); } + +#ifdef CONFIG_SCHED_MM_VCPU +void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm) +{ + struct rq_flags rf; + struct rq *rq; + + if (!mm) + return; + WARN_ON_ONCE(t !=3D current); + preempt_disable(); + rq =3D this_rq(); + rq_lock_irqsave(rq, &rf); + t->vcpu_mm_active =3D 0; + atomic_dec(&mm->mm_vcpu_users); + rq_vcpu_cache_remove_mm_locked(rq, mm, true); + rq_unlock_irqrestore(rq, &rf); + t->mm_vcpu =3D -1; + preempt_enable(); +} + +void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm) +{ + WARN_ON_ONCE(t !=3D current); + preempt_disable(); + t->vcpu_mm_active =3D 1; + /* No need to reserve in cpumask because single-threaded. */ + t->mm_vcpu =3D mm_vcpu_first_node_vcpu(numa_node_id()); + preempt_enable(); +} + +void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm) +{ + int vcpu, mm_vcpu_users; + struct rq_flags rf; + struct rq *rq; + + preempt_disable(); + rq =3D this_rq(); + t->vcpu_mm_active =3D 1; + mm_vcpu_users =3D atomic_read(&mm->mm_vcpu_users); + atomic_inc(&mm->mm_vcpu_users); + t->mm_vcpu =3D -1; + vcpu =3D current->mm_vcpu; + rq_lock_irqsave(rq, &rf); + /* On transition from 1 to 2 mm users, reserve vcpu ids. */ + if (mm_vcpu_users =3D=3D 1) { + mm_vcpu_reserve_nodes(mm); + rq_vcpu_cache_remove_mm_locked(rq, mm, true); + current->mm_vcpu =3D __mm_vcpu_get(rq, mm); + rq_vcpu_cache_add(rq, mm, current->mm_vcpu); + /* + * __mm_vcpu_get could get a different vcpu after going + * multi-threaded, then back single-threaded, then + * multi-threaded on a NUMA configuration using the first CPU + * matching the NUMA node as single-threaded vcpu, with + * leftover vcpu_id matching the NUMA node set from when this + * task was multithreaded. + */ + if (current->mm_vcpu !=3D vcpu) + rseq_set_notify_resume(current); + } + rq_unlock_irqrestore(rq, &rf); + preempt_enable(); +} + +void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm) +{ + preempt_disable(); + t->vcpu_mm_active =3D 1; + t->mm_vcpu =3D -1; + preempt_enable(); +} +#endif diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index d2c072b0ef01..f4f394db2db8 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -655,6 +655,7 @@ static struct rq *dl_task_offline_migration(struct rq *= rq, struct task_struct *p __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span)); raw_spin_unlock(&dl_b->lock); =20 + rq_vcpu_cache_remove_mm_locked(rq, p->mm, false); set_task_cpu(p, later_rq->cpu); double_unlock_balance(later_rq, rq); =20 @@ -2290,6 +2291,7 @@ static int push_dl_task(struct rq *rq) } =20 deactivate_task(rq, next_task, 0); + rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false); set_task_cpu(next_task, later_rq->cpu); =20 /* @@ -2386,6 +2388,7 @@ static void pull_dl_task(struct rq *this_rq) push_task =3D get_push_task(src_rq); } else { deactivate_task(src_rq, p, 0); + rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); dmin =3D p->dl.deadline; diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 102d6f70e84d..3b44f8dd064d 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -763,6 +763,19 @@ do { \ P(sched_goidle); P(ttwu_count); P(ttwu_local); +#undef P +#define P(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, schedstat_val(rq->n)); + P(nr_vcpu_thread_transfer); + P(nr_vcpu_cache_hit); + P(nr_vcpu_cache_evict); + P(nr_vcpu_cache_discard_wrong_node); + P(nr_vcpu_allocate); + P(nr_vcpu_allocate_node_reuse); + P(nr_vcpu_allocate_node_new); + P(nr_vcpu_allocate_node_rebalance); + P(nr_vcpu_allocate_node_steal); + P(nr_vcpu_remove_release_mm); + P(nr_vcpu_remove_migrate); } #undef P =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dcbd3110c687..9c8b88e57315 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7817,6 +7817,7 @@ static void detach_task(struct task_struct *p, struct= lb_env *env) lockdep_assert_rq_held(env->src_rq); =20 deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK); + rq_vcpu_cache_remove_mm_locked(env->src_rq, p->mm, false); set_task_cpu(p, env->dst_cpu); } =20 diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 7b4f4fbbb404..fd37e23612f9 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2106,6 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull) } =20 deactivate_task(rq, next_task, 0); + rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); resched_curr(lowest_rq); @@ -2379,6 +2380,7 @@ static void pull_rt_task(struct rq *this_rq) push_task =3D get_push_task(src_rq); } else { deactivate_task(src_rq, p, 0); + rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false); set_task_cpu(p, this_cpu); activate_task(this_rq, p, 0); resched =3D true; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3da5718cd641..5034a7372452 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -916,6 +916,19 @@ struct uclamp_rq { DECLARE_STATIC_KEY_FALSE(sched_uclamp_used); #endif /* CONFIG_UCLAMP_TASK */ =20 +#ifdef CONFIG_SCHED_MM_VCPU +# define RQ_VCPU_CACHE_SIZE 8 +struct rq_vcpu_entry { + struct mm_struct *mm; /* NULL if unset */ + int vcpu_id; +}; + +struct rq_vcpu_cache { + struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE]; + unsigned int head; +}; +#endif + /* * This is the main, per-CPU runqueue data structure. * @@ -1086,6 +1099,19 @@ struct rq { /* try_to_wake_up() stats */ unsigned int ttwu_count; unsigned int ttwu_local; + + unsigned long long nr_vcpu_single_thread; + unsigned long long nr_vcpu_thread_transfer; + unsigned long long nr_vcpu_cache_hit; + unsigned long long nr_vcpu_cache_evict; + unsigned long long nr_vcpu_cache_discard_wrong_node; + unsigned long long nr_vcpu_allocate; + unsigned long long nr_vcpu_allocate_node_reuse; + unsigned long long nr_vcpu_allocate_node_new; + unsigned long long nr_vcpu_allocate_node_rebalance; + unsigned long long nr_vcpu_allocate_node_steal; + unsigned long long nr_vcpu_remove_release_mm; + unsigned long long nr_vcpu_remove_migrate; #endif =20 #ifdef CONFIG_CPU_IDLE @@ -1116,6 +1142,10 @@ struct rq { unsigned int core_forceidle_occupation; u64 core_forceidle_start; #endif + +#ifdef CONFIG_SCHED_MM_VCPU + struct rq_vcpu_cache vcpu_cache; +#endif }; =20 #ifdef CONFIG_FAIR_GROUP_SCHED @@ -3137,3 +3167,337 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif =20 +#ifdef CONFIG_SCHED_MM_VCPU +static inline int __mm_vcpu_get_single_node(struct mm_struct *mm) +{ + struct cpumask *cpumask; + int vcpu; + + cpumask =3D mm_vcpumask(mm); + /* Atomically reserve lowest available vcpu number. */ + do { + vcpu =3D cpumask_first_zero(cpumask); + if (vcpu >=3D nr_cpu_ids) + return -1; + } while (cpumask_test_and_set_cpu(vcpu, cpumask)); + return vcpu; +} + +#ifdef CONFIG_NUMA +static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcp= u_id) +{ + if (num_possible_nodes() =3D=3D 1) + return true; + return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id())); +} + +static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm) +{ + struct cpumask *cpumask =3D mm_vcpumask(mm), + *node_cpumask =3D mm_node_vcpumask(mm, numa_node_id()), + *node_alloc_cpumask =3D mm_node_alloc_vcpumask(mm); + unsigned int node; + int vcpu; + + if (num_possible_nodes() =3D=3D 1) + return __mm_vcpu_get_single_node(mm); + + /* + * Try to atomically reserve lowest available vcpu number within those + * already reserved for this NUMA node. + */ + do { + vcpu =3D cpumask_first_one_and_zero(node_cpumask, cpumask); + if (vcpu >=3D nr_cpu_ids) + goto alloc_numa; + } while (cpumask_test_and_set_cpu(vcpu, cpumask)); + schedstat_inc(rq->nr_vcpu_allocate_node_reuse); + goto end; + +alloc_numa: + /* + * Try to atomically reserve lowest available vcpu number within those + * not already allocated for numa nodes. + */ + do { + vcpu =3D cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask); + if (vcpu >=3D nr_cpu_ids) + goto numa_update; + } while (cpumask_test_and_set_cpu(vcpu, cpumask)); + cpumask_set_cpu(vcpu, node_cpumask); + cpumask_set_cpu(vcpu, node_alloc_cpumask); + schedstat_inc(rq->nr_vcpu_allocate_node_new); + goto end; + +numa_update: + /* + * NUMA node id configuration changed for at least one CPU in the system. + * We need to steal a currently unused vcpu_id from an overprovisioned + * node for our current node. Userspace must handle the fact that the + * node id associated with this vcpu_id may change due to node ID + * reconfiguration. + * + * Count how many possible cpus are attached to each (other) node id, + * and compare this with the per-mm node vcpumask cpu count. Find one + * which has too many cpus in its mask to steal from. + */ + for (node =3D 0; node < nr_node_ids; node++) { + struct cpumask *iter_cpumask; + + if (node =3D=3D numa_node_id()) + continue; + iter_cpumask =3D mm_node_vcpumask(mm, node); + if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) { + /* Try to steal from this node. */ + do { + vcpu =3D cpumask_first_one_and_zero(iter_cpumask, cpumask); + if (vcpu >=3D nr_cpu_ids) + goto steal_fail; + } while (cpumask_test_and_set_cpu(vcpu, cpumask)); + cpumask_clear_cpu(vcpu, iter_cpumask); + cpumask_set_cpu(vcpu, node_cpumask); + schedstat_inc(rq->nr_vcpu_allocate_node_rebalance); + goto end; + } + } + +steal_fail: + /* + * Our attempt at gracefully stealing a vcpu_id from another + * overprovisioned NUMA node failed. Fallback to grabbing the first + * available vcpu_id. + */ + do { + vcpu =3D cpumask_first_zero(cpumask); + if (vcpu >=3D nr_cpu_ids) + return -1; + } while (cpumask_test_and_set_cpu(vcpu, cpumask)); + /* Steal vcpu from its numa node mask. */ + for (node =3D 0; node < nr_node_ids; node++) { + struct cpumask *iter_cpumask; + + if (node =3D=3D numa_node_id()) + continue; + iter_cpumask =3D mm_node_vcpumask(mm, node); + if (cpumask_test_cpu(vcpu, iter_cpumask)) { + cpumask_clear_cpu(vcpu, iter_cpumask); + break; + } + } + cpumask_set_cpu(vcpu, node_cpumask); + schedstat_inc(rq->nr_vcpu_allocate_node_steal); +end: + return vcpu; +} + +static inline int mm_vcpu_first_node_vcpu(int node) +{ + int vcpu; + + if (likely(nr_cpu_ids >=3D nr_node_ids)) + return node; + vcpu =3D cpumask_first(cpumask_of_node(node)); + if (vcpu >=3D nr_cpu_ids) + return -1; + return vcpu; +} + +/* + * Single-threaded processes observe a mapping of vcpu_id->node_id where + * the vcpu_id returned corresponds to mm_vcpu_first_node_vcpu(). When goi= ng + * from single to multi-threaded, reserve this same mapping so it stays + * invariant. + */ +static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm) +{ + struct cpumask *node_alloc_cpumask =3D mm_node_alloc_vcpumask(mm); + int node, other_node; + + for (node =3D 0; node < nr_node_ids; node++) { + struct cpumask *iter_cpumask =3D mm_node_vcpumask(mm, node); + int vcpu =3D mm_vcpu_first_node_vcpu(node); + + /* Skip nodes that have no CPU associated with them. */ + if (vcpu < 0) + continue; + cpumask_set_cpu(vcpu, iter_cpumask); + cpumask_set_cpu(vcpu, node_alloc_cpumask); + for (other_node =3D 0; other_node < nr_node_ids; other_node++) { + if (other_node =3D=3D node) + continue; + cpumask_clear_cpu(vcpu, mm_node_vcpumask(mm, other_node)); + } + } +} +#else +static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcp= u_id) +{ + return true; +} +static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm) +{ + return __mm_vcpu_get_single_node(mm); +} +static inline int mm_vcpu_first_node_vcpu(int node) +{ + return 0; +} +static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm) { } +#endif + +static inline void __mm_vcpu_put(struct mm_struct *mm, int vcpu) +{ + if (vcpu < 0) + return; + cpumask_clear_cpu(vcpu, mm_vcpumask(mm)); +} + +static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, st= ruct mm_struct *mm) +{ + struct rq_vcpu_cache *vcpu_cache =3D &rq->vcpu_cache; + int i; + + for (i =3D 0; i < RQ_VCPU_CACHE_SIZE; i++) { + struct rq_vcpu_entry *entry =3D &vcpu_cache->entry[i]; + + if (entry->mm =3D=3D mm) + return entry; + } + return NULL; +} + +/* Removal from cache simply leaves an unused hole. */ +static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct mm_str= uct *mm) +{ + struct rq_vcpu_entry *entry =3D rq_vcpu_cache_lookup(rq, mm); + + if (!entry) + return -1; + entry->mm =3D NULL; /* Remove from cache */ + return entry->vcpu_id; +} + +static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, struct mm= _struct *mm, + bool release_mm) +{ + int vcpu; + + if (!mm) + return; + /* + * Do not remove the cache entry for a runqueue that runs a task which + * currently uses the target mm. + */ + if (!release_mm && rq->curr->mm =3D=3D mm) + return; + vcpu =3D rq_vcpu_cache_lookup_remove(rq, mm); + if (vcpu < 0) + return; + if (release_mm) + schedstat_inc(rq->nr_vcpu_remove_release_mm); + else + schedstat_inc(rq->nr_vcpu_remove_migrate); + __mm_vcpu_put(mm, vcpu); +} + +static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct= *mm, + bool release_mm) +{ + struct rq_flags rf; + + rq_lock_irqsave(rq, &rf); + rq_vcpu_cache_remove_mm_locked(rq, mm, release_mm); + rq_unlock_irqrestore(rq, &rf); +} + +/* + * Add at head, move head forward. Cheap LRU cache. + * Only need to clear the vcpu mask bit from its own mm_vcpumask(mm) when = we + * overwrite an old entry from the cache. Note that this is not needed if = the + * overwritten entry is an unused hole. This access to the old_mm from an + * unrelated thread requires that cache entry for a given mm gets pruned f= rom + * the cache when a task is dequeued from the runqueue. + */ +static inline void rq_vcpu_cache_add(struct rq *rq, struct mm_struct *mm, + int vcpu_id) +{ + struct rq_vcpu_cache *vcpu_cache =3D &rq->vcpu_cache; + struct mm_struct *old_mm; + struct rq_vcpu_entry *entry; + unsigned int pos; + + pos =3D vcpu_cache->head; + entry =3D &vcpu_cache->entry[pos]; + old_mm =3D entry->mm; + if (old_mm) { + schedstat_inc(rq->nr_vcpu_cache_evict); + __mm_vcpu_put(old_mm, entry->vcpu_id); + } + entry->mm =3D mm; + entry->vcpu_id =3D vcpu_id; + vcpu_cache->head =3D (pos + 1) % RQ_VCPU_CACHE_SIZE; +} + +static inline int mm_vcpu_get(struct rq *rq, struct task_struct *t) +{ + struct rq_vcpu_entry *entry; + struct mm_struct *mm =3D t->mm; + int vcpu; + + /* Skip allocation if mm is single-threaded. */ + if (atomic_read(&mm->mm_vcpu_users) =3D=3D 1) { + schedstat_inc(rq->nr_vcpu_single_thread); + vcpu =3D mm_vcpu_first_node_vcpu(numa_node_id()); + goto end; + } + entry =3D rq_vcpu_cache_lookup(rq, mm); + if (likely(entry)) { + vcpu =3D entry->vcpu_id; + if (likely(mm_node_vcpumask_test_cpu(mm, vcpu))) { + schedstat_inc(rq->nr_vcpu_cache_hit); + goto end; + } else { + schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node); + entry->mm =3D NULL; /* Remove from cache */ + __mm_vcpu_put(mm, vcpu); + } + } + schedstat_inc(rq->nr_vcpu_allocate); + vcpu =3D __mm_vcpu_get(rq, mm); + rq_vcpu_cache_add(rq, mm, vcpu); +end: + return vcpu; +} + +static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev, + struct task_struct *next) +{ + if (!(next->flags & PF_EXITING) && !(next->flags & PF_KTHREAD) && + next->mm && next->vcpu_mm_active) { + if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) && + prev->mm =3D=3D next->mm && prev->vcpu_mm_active && + mm_node_vcpumask_test_cpu(next->mm, prev->mm_vcpu)) { + /* + * Switching between threads with the same mm. Simply pass the + * vcpu token along to the next thread. + */ + schedstat_inc(rq->nr_vcpu_thread_transfer); + next->mm_vcpu =3D prev->mm_vcpu; + } else { + next->mm_vcpu =3D mm_vcpu_get(rq, next); + } + } + if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) && + prev->mm && prev->vcpu_mm_active) + prev->mm_vcpu =3D -1; +} + +#else +static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev, + struct task_struct *next) { } +static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, + struct mm_struct *mm, + bool release_mm) { } +static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct= *mm, + bool release_mm) { } +#endif diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 07dde2928c79..027d0caf2d14 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -134,12 +134,24 @@ static int show_schedstat(struct seq_file *seq, void = *v) =20 /* runqueue-specific stats */ seq_printf(seq, - "cpu%d %u 0 %u %u %u %u %llu %llu %lu", + "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu = %llu %llu %llu %llu %llu %llu", cpu, rq->yld_count, rq->sched_count, rq->sched_goidle, rq->ttwu_count, rq->ttwu_local, rq->rq_cpu_time, - rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount); + rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount, + rq->nr_vcpu_single_thread, + rq->nr_vcpu_thread_transfer, + rq->nr_vcpu_cache_hit, + rq->nr_vcpu_cache_evict, + rq->nr_vcpu_cache_discard_wrong_node, + rq->nr_vcpu_allocate, + rq->nr_vcpu_allocate_node_reuse, + rq->nr_vcpu_allocate_node_new, + rq->nr_vcpu_allocate_node_rebalance, + rq->nr_vcpu_allocate_node_steal, + rq->nr_vcpu_remove_release_mm, + rq->nr_vcpu_remove_migrate); =20 seq_printf(seq, "\n"); =20 --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46FECC433FE for ; Fri, 18 Feb 2022 21:16:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239692AbiBRVQS (ORCPT ); Fri, 18 Feb 2022 16:16:18 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239638AbiBRVQJ (ORCPT ); Fri, 18 Feb 2022 16:16:09 -0500 X-Greylist: delayed 545 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Fri, 18 Feb 2022 13:15:49 PST Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D665D23F0A3; Fri, 18 Feb 2022 13:15:49 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id C3F1B3BAA1E; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id FADiSyD0JeD3; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id BB4D23BAB86; Fri, 18 Feb 2022 16:06:45 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com BB4D23BAB86 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218405; bh=SlksVcAEQwHXv/2n1z59vggbF/JR7rR40IRrV+e+NmA=; h=From:To:Date:Message-Id; b=CV9nJHqNa4w3xjK87VcyHWPugPwE8EUcnU5cVmaXWok7hzcYR3j2OScg1NoKoa5Sb 7yM8Bq3Sbesl+NutA8NbEMD3u5+N/2LI8FhintzlwjyGfEBGFqaZ5R6DqJjDSYWWjX IsUwsRn8fm4e6hnDoePjGu8g8urlCiNP3No7WuPg2RQzQ29pci0IDbAozaA5AaUSMA 27Nay6ZGPEtG5EqyP7hrw+3xvfgRfjHkXEHEiWiqJBj9p0fe5J7op/rRltFarC00yU bsJkzjsRz1vT2XxGL7A6s7Df91gOSfZ1N1VZ+oQvlFFwsDitEVqSdCb9cRGzFKRomR vPvuM24SiGfYA== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Qd3Nh264iJsy; Fri, 18 Feb 2022 16:06:45 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id D256B3BAB15; Fri, 18 Feb 2022 16:06:44 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 10/11] rseq: extend struct rseq with per memory space vcpu id Date: Fri, 18 Feb 2022 16:06:32 -0500 Message-Id: <20220218210633.23345-11-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" If a memory space has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the virtual cpu ids will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. Signed-off-by: Mathieu Desnoyers --- include/uapi/linux/rseq.h | 9 +++++++++ kernel/rseq.c | 10 +++++++++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 1cb90a435c5c..77a136586ac6 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -139,6 +139,15 @@ struct rseq { */ __u32 node_id; =20 + /* + * Restartable sequences vm_vcpu_id field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current thread's virtual CPU ID + * (allocated uniquely within a memory space). + */ + __u32 vm_vcpu_id; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/rseq.c b/kernel/rseq.c index cb7d8a5afc82..1b00339c341b 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -89,12 +89,14 @@ static int rseq_update_cpu_node_id(struct task_struct *= t) struct rseq __user *rseq =3D t->rseq; u32 cpu_id =3D raw_smp_processor_id(); u32 node_id =3D cpu_to_node(cpu_id); + u32 vm_vcpu_id =3D task_mm_vcpu_id(t); =20 if (!user_write_access_begin(rseq, t->rseq_len)) goto efault; unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end); unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); unsafe_put_user(node_id, &rseq->node_id, efault_end); + unsafe_put_user(vm_vcpu_id, &rseq->vm_vcpu_id, efault_end); /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if @@ -112,7 +114,8 @@ static int rseq_update_cpu_node_id(struct task_struct *= t) =20 static int rseq_reset_rseq_cpu_node_id(struct task_struct *t) { - u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0; + u32 cpu_id_start =3D 0, cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, node_id =3D= 0, + vm_vcpu_id =3D 0; =20 /* * Reset cpu_id_start to its initial state (0). @@ -131,6 +134,11 @@ static int rseq_reset_rseq_cpu_node_id(struct task_str= uct *t) */ if (put_user(node_id, &t->rseq->node_id)) return -EFAULT; + /* + * Reset vm_vcpu_id to its initial state (0). + */ + if (put_user(vm_vcpu_id, &t->rseq->vm_vcpu_id)) + return -EFAULT; /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally reset only if --=20 2.17.1 From nobody Sat Jun 27 21:21:51 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4A47C433F5 for ; Fri, 18 Feb 2022 21:16:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239683AbiBRVQa (ORCPT ); Fri, 18 Feb 2022 16:16:30 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:42436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239657AbiBRVQL (ORCPT ); Fri, 18 Feb 2022 16:16:11 -0500 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3083328BF41; Fri, 18 Feb 2022 13:15:53 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id EBB513BA9DC; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id QrN-FJ-75sez; Fri, 18 Feb 2022 16:06:49 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 6E7F73BA9D3; Fri, 18 Feb 2022 16:06:46 -0500 (EST) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com 6E7F73BA9D3 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1645218406; bh=jU3p85idUR7noONNjLpaS2BHDqtaaxwv6Q/ch9Ss5iw=; h=From:To:Date:Message-Id; b=LsRo9UIlwMUGVY1clr3Qnuzy3HVHBMRT8X7jx/1O46uXqF/nAxJrBBxSZhHILoDyR lZLxFDOOTdxuSp7/PiNn6jhNX0jd1PBvm5VK7TaO1oIgQtcQS2B3ggCJSBLslPdP1c Pt+5SsUHvZNeO1l5TwJ6ID0PL2Pcy3Yj+g958XcV7avusJb7ul87aytm/6kQSfdfsk 53LdmfMk6ZB2EQxEjPqVWI9J4+W5AdvKddHRFCPsj46TMMFePPY9red9bNIMwVlzZA p/yDIG+hnt1r8DF8D4reCyZcjmC2hj+6pS7yUSjwWAlKXTeaGmF12hssnHm5TrsDwd rKXZzl7cAEYng== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id X7ITggOoy7jL; Fri, 18 Feb 2022 16:06:46 -0500 (EST) Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by mail.efficios.com (Postfix) with ESMTPSA id 1EF273BA5F2; Fri, 18 Feb 2022 16:06:45 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Mathieu Desnoyers Subject: [RFC PATCH v2 11/11] selftests/rseq: Implement rseq vm_vcpu_id field support Date: Fri, 18 Feb 2022 16:06:33 -0500 Message-Id: <20220218210633.23345-12-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> References: <20220218210633.23345-1-mathieu.desnoyers@efficios.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Signed-off-by: Mathieu Desnoyers --- tools/testing/selftests/rseq/rseq-abi.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selfte= sts/rseq/rseq-abi.h index a1faa9162d52..1ee4740ebe94 100644 --- a/tools/testing/selftests/rseq/rseq-abi.h +++ b/tools/testing/selftests/rseq/rseq-abi.h @@ -155,6 +155,15 @@ struct rseq_abi { */ __u32 node_id; =20 + /* + * Restartable sequences vm_vcpu_id field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current thread's virtual CPU ID + * (allocated uniquely within a memory space). + */ + __u32 vm_vcpu_id; + /* * Flexible array member at end of structure, after last feature field. */ --=20 2.17.1