From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30000219A97 for ; Fri, 30 May 2025 09:28:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597323; cv=none; b=jdLr3JrTLZ13m8YK/7ZUpxNCKWzYVBLGyahjNZj9iMCqm0LYB0HyR0t/f/qEYZMEtuWed3JPq3KpNnm+GlR5Z7LMg8QazvyGVp8cFaTWQsxvJrawDVxbZViSARCPVed6lJA/pnCQknusEpNeNKANFlUySyXmMP1xOjPdwaM8CKs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597323; c=relaxed/simple; bh=ppCxvzGPzwxMK0c+R/BjMfinsaxs9oKUnADG8qsL8g4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=jwj9EcbDlRPi37P76sV1Nx4m6keD/yucBX1rrhLoeYD8sDW/ZKcWwfFI67V+7fEZ/U6fFcDRgdQUUgwo1IUeLJom2YVhy9oZBYlyXhVrwySj21riZjbP6Htqy1TCSOCewrxxTvS2GVqiyRETR/E5wFTblY3vqTnjqNigFc6u11E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=RNJbGHOV; arc=none smtp.client-ip=209.85.216.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="RNJbGHOV" Received: by mail-pj1-f45.google.com with SMTP id 98e67ed59e1d1-310cf8f7301so1486040a91.1 for ; Fri, 30 May 2025 02:28:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597321; x=1749202121; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=H/A+gAHA3ZAGJfkMi3IvOeV45q7o44ZhkocltOJrpQQ=; b=RNJbGHOVy+7gA5xoBsRaT1rKXF040JsAQapN8McX3jn1Wyk4I0ZZsaZbS8pB8DpatB 3g4kRPZGZxPujBKYFUZMfrHtujffJ6DNwYHUekHRdsao5cXm3myz9VwKiW0EuyowgMpp PIq4oAY9ym5CCsuoHm1VxXGovdSKOHG+FuKhYGkKeL9YBIHHrbtc9YanYWLHXJH0UO0a LESGRa9jM7y2+XjtsoLNrWJIkVkQFrL6526NQ1Dw8BichmkPc+kjMxzJ27kZyTD+tHiw Zj/SCHFZIYYbr2XsoSF3U/5XhRQ5EWBkm35poP/UZR0nPRMb18IBTzExgW4+GBKCcGFy SKVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597321; x=1749202121; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H/A+gAHA3ZAGJfkMi3IvOeV45q7o44ZhkocltOJrpQQ=; b=dIC1YCD0QdLb+XcYner9Idy6gylVb2rtgU4Is59+ThfNdu6dCW+m2zhA32YcAPT9BC tj2oZimLtwiJfbfujyFIIxA1JUamFX9ec0qy3yS4tuqemmREz/+Xsb0jGI1sDeevSzft QK0m75ryJA1cquIWGCMghaZFkGU8PZ9Y+QMopgqli9t6tCICg6U/f68G7Pq9RQ6h/AbD 9hEMONn3pjohXVRgY1gmEuhL54XayewWZ2DlcAiMs+kq0LyxAFziqpzbJBebdzAVKZLe uKvKMtNpkEvv8sWiYoPhqOUEr2cSz/T5OLTmNpJZH9jQ2NFFb5A/qEYMl0elmYUr1QNm cFZg== X-Forwarded-Encrypted: i=1; AJvYcCU74ckBAVmEXdpyUztXzucvgP2WswHfKtDLvOrRUEiVyFYbyQu9J79+0eI9GkUo/5PdHHxwekpstmfhbPQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwvMOVRxwU4ffm38LrJrenaGosoZGQeKhGp4uzmMzOgcsQ+xVGH 0vyr9chOWocX1T1D+U7joGLJLaRp9YkGjENMunsLRZOsnz3ZK50qZt2JqCZexIk8lf8= X-Gm-Gg: ASbGncssKRIT3xbW5PUkX9MQu/P1+d15MEXWjB4kOnu5ZlYAKtpBmk85wXupQaknFMo 7uJ1PolOj2dqeD7Qt0k44HlAMZlluDMh+wUOni+55HBBCLWCF8CtbboALVpK6+mmHKpk7cEIV30 5moJhot2Fti6JjJ4DX56n6oWhNIOYZ8Mb9PgflzVBXMVEa6AY0z7bv/DuzwO9SZf+ABoXJXjtyP QPNONJoBmh99g9ENDjQBkgWfQKOJfsYKWc4TcPNC9pBI5nz5Bpe0mVMpFuKL5afYJils3r4pep/ TuPaQrDiDOT7rhFEaNCalioMIUC+AOEo2EkJEA66p7XjHDdf91S2r/lzPhgzREjZZ2/bbHIiK3r CjdP8plXEf7qtC0z7eSgU X-Google-Smtp-Source: AGHT+IGtvKSy5BRqT77HroAME3q1wPsiRK2qsvrkfhGABhYMviWMHBoI0lrSImjloPHkxtdJIsjzUQ== X-Received: by 2002:a17:90b:4a0d:b0:310:b602:bc52 with SMTP id 98e67ed59e1d1-31214e11d96mr9868483a91.2.1748597321318; Fri, 30 May 2025 02:28:41 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.28.26 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:28:41 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 01/35] Kbuild: rpal support Date: Fri, 30 May 2025 17:27:29 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add kbuild support for RPAL, including new folder arch/x86/kernel/rpal and new config CONFIG_RPAL. Signed-off-by: Bo Li --- arch/x86/Kbuild | 2 ++ arch/x86/Kconfig | 2 ++ arch/x86/rpal/Kconfig | 11 +++++++++++ arch/x86/rpal/Makefile | 0 4 files changed, 15 insertions(+) create mode 100644 arch/x86/rpal/Kconfig create mode 100644 arch/x86/rpal/Makefile diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild index f7fb3d88c57b..26c406442d79 100644 --- a/arch/x86/Kbuild +++ b/arch/x86/Kbuild @@ -34,5 +34,7 @@ obj-$(CONFIG_KEXEC_FILE) +=3D purgatory/ =20 obj-y +=3D virt/ =20 +obj-y +=3D rpal/ + # for cleaning subdir- +=3D boot tools diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 121f9f03bd5c..3f53b6fc943f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2359,6 +2359,8 @@ config X86_BUS_LOCK_DETECT Enable Split Lock Detect and Bus Lock Detect functionalities. See for more information. =20 +source "arch/x86/rpal/Kconfig" + endmenu =20 config CC_HAS_NAMED_AS diff --git a/arch/x86/rpal/Kconfig b/arch/x86/rpal/Kconfig new file mode 100644 index 000000000000..e5e6996553ea --- /dev/null +++ b/arch/x86/rpal/Kconfig @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# This Kconfig describes RPAL options +# + +config RPAL + def_bool y + depends on X86_64 + help + This option enables system support for Run Process As + library (RPAL). \ No newline at end of file diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile new file mode 100644 index 000000000000..e69de29bb2d1 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF7AA220F4E for ; Fri, 30 May 2025 09:28:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597339; cv=none; b=Pthsy8hsuuuVxx6duykfYVzkdIkgGmmko7K+qu12N/tLgcRtMWl6Dk9ozKJ0leY8lePuvEjZpebRCjfHUUoPBVNu715Sh3Z4YA4wwOYkfQw+l/LSCpye1Xcqt1OHh4O3ovGwcdr1waHFExfLOAC7BJ9vYqNZVG/lSIjg01An+wY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597339; c=relaxed/simple; bh=Vc7NbMjHAO2dZd+RVkA7mhRl8eTmZhbc/UEd+aji6oE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=taRaqeYszlQhGZE3TLY0u4MzWsezlGOPSQfiammPK6QtIyXvolAc4jLjlvCtYUX9SRtdaHOOuX9bnJFUUSClA0W0G+eFSP5hTzMbRGACuJWJucFR2ViHaShdRVunE8WMCfK36eswiWmnPD3esqkVJqOC8qOvhx8dAtVGNKooFqA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=HZy7M2u+; arc=none smtp.client-ip=209.85.216.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="HZy7M2u+" Received: by mail-pj1-f43.google.com with SMTP id 98e67ed59e1d1-3122368d7c4so1174147a91.1 for ; Fri, 30 May 2025 02:28:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597337; x=1749202137; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tHtdmIKid/axHIhLP0cCYOHEJ6hr+IyJWWADeBRNQ/E=; b=HZy7M2u+7GpR0jjgD3VvbuWXLAZzxDhZ8MFb1aLnTzO/OdeLksVR4Sp4RcuDwi1+uV J8ZKCyEPW6zPmzQtYc0CBUyZG41mWqwWNMPcjBinHleStwkLACO2cXrXusHc5q4kwBQD wciRQn7IEeNK+1xJkLX/PTIq3NVK4yOddbKkR/kw+Hx7gbzEo0rEe6+tNOidiU1DmfkK xAw2isnYIzbo4J1ZCCK2TBSLIdiwDXTDjTE+ZN7fGPKU6yqNek+wg51KzWEdfn6xY9hh gNqWdIXaNdt5XXNDbtOGrJJe8yWtI/eqFBNTr3/MKjcJdUrn8ErZbH9hCYgQm8wQCh6t W3Nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597337; x=1749202137; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tHtdmIKid/axHIhLP0cCYOHEJ6hr+IyJWWADeBRNQ/E=; b=BfkObf+EoTohFqbAEmo85lZGo2x7J2VkLjP7tCOdFA4EyJs9JKqODHyqSPC7OODVq+ S2vwoJZUT0BqckxjgKZM6MmjHHHg2hA+5oAi+OJK2/yjUuMsYA5NBGf4kBrMqCldwBj7 rp3Gb8Xg6kYkqhprKFLBl2eXhYE6okM+Prv/TsYUCCivmBVPVKBAsnfpNMKd4WLM7YBI f6B8hbzadC3HJ7NeVospo3o0M4FdEzMdVakbL5SH8sSW34eVnAnT7caCaZy/8wzZBNZT ySgpla6Rilp4cgBXLV8v0hZ/kwLYeMM7doOTqr1olv42LZ03VMcaHy9sGsMw2uhv51oO pzHQ== X-Forwarded-Encrypted: i=1; AJvYcCWXk/3jTOPe9wip6XLZJWrFj8ypPJwD4o4gMs2Klhynrw+5nODaybEmkUKPpVfGu2gRegL5duRJSOhb7xM=@vger.kernel.org X-Gm-Message-State: AOJu0YyjuPK3MqKq8wu/ifcI7vzSq6MR9cnvKarP3jC4S9QGOqzQtDc9 y9WGblwDzPUVg/AK6OQxzU25mSHcyMODnTGsSFsKAs0MWm1h+kztWInB4acqTyLB72g= X-Gm-Gg: ASbGnctrMiOnjf+9WIf30FlInN/E9EPsPcEyhCu+Yv/FS/rHLKMFGe0jOOms0Krj70F 7v3RCftzJKcHjA20rUVjdFdgRJGyjGmKmqqCIrvCt9KVzqDy8nopMFJzu1bYGfKLJ6WaTn6/yA7 aIoADXEJguddxS26IKbFFL8j3tmhxTEMrsuxzAiZfIZ65XSnH0uTaJ9ynAeOs+pKvaFBThGtyVu l+PbHeRDyhoNzXIdlEsck119rXt7e3HDW5TZgdLJTDEN0o/ABRRmu5dJNj9uwEcMephidu0Fx0s XBSLRP2Mu82Z/UT9iNwn6vB/j0+1zAhwRYoD8MeMKb+xNNFkTAma+CrA1DvwEJudfwSCP6+o9/b ERigjD9ErRzMnWWzLATUN X-Google-Smtp-Source: AGHT+IF0jYnDB+2g7ZGuSYnYueYT47NoQiuIJK3WP4+gDWcAJJRqzT7RDm6Iq/1/p9n6sv5YTlCMUg== X-Received: by 2002:a17:90b:2f8c:b0:2ee:d371:3227 with SMTP id 98e67ed59e1d1-31241735b7fmr4743727a91.17.1748597336731; Fri, 30 May 2025 02:28:56 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.28.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:28:56 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 02/35] RPAL: add struct rpal_service Date: Fri, 30 May 2025 17:27:30 +0800 Message-Id: <58ca31eb0711ad773c19a167e6888173a64ff890.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Each process that uses RPAL features is called an RPAL service. This patch adds the RPAL header file rpal.h and defines the rpal_service structure. The struct rpal_service uses a dedicated kmem_cache for allocation and deallocation, and atomic variables to maintain references to the struct rpal_service. Additionally, the patch introduces the rpal_get_service() and rpal_put_service() interfaces to manage reference counts. Signed-off-by: Bo Li --- arch/x86/rpal/Makefile | 5 ++++ arch/x86/rpal/core.c | 32 +++++++++++++++++++++++ arch/x86/rpal/internal.h | 13 ++++++++++ arch/x86/rpal/service.c | 56 ++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 43 ++++++++++++++++++++++++++++++ 5 files changed, 149 insertions(+) create mode 100644 arch/x86/rpal/core.c create mode 100644 arch/x86/rpal/internal.h create mode 100644 arch/x86/rpal/service.c create mode 100644 include/linux/rpal.h diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile index e69de29bb2d1..ee3698b5a9b3 100644 --- a/arch/x86/rpal/Makefile +++ b/arch/x86/rpal/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_RPAL) +=3D rpal.o + +rpal-y :=3D service.o core.o diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c new file mode 100644 index 000000000000..495dbc1b1536 --- /dev/null +++ b/arch/x86/rpal/core.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include + +#include "internal.h" + +int __init rpal_init(void); + +bool rpal_inited; + +int __init rpal_init(void) +{ + int ret =3D 0; + + ret =3D rpal_service_init(); + if (ret) + goto fail; + + rpal_inited =3D true; + return 0; + +fail: + rpal_err("rpal init fail\n"); + return -1; +} +subsys_initcall(rpal_init); diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h new file mode 100644 index 000000000000..e44e6fc79677 --- /dev/null +++ b/arch/x86/rpal/internal.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +extern bool rpal_inited; + +/* service.c */ +int __init rpal_service_init(void); +void __init rpal_service_exit(void); diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c new file mode 100644 index 000000000000..c8e609798d4f --- /dev/null +++ b/arch/x86/rpal/service.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include +#include +#include +#include + +#include "internal.h" + +static struct kmem_cache *service_cache; + +static void __rpal_put_service(struct rpal_service *rs) +{ + kmem_cache_free(service_cache, rs); +} + +struct rpal_service *rpal_get_service(struct rpal_service *rs) +{ + if (!rs) + return NULL; + atomic_inc(&rs->refcnt); + return rs; +} + +void rpal_put_service(struct rpal_service *rs) +{ + if (!rs) + return; + + if (atomic_dec_and_test(&rs->refcnt)) + __rpal_put_service(rs); +} + +int __init rpal_service_init(void) +{ + service_cache =3D kmem_cache_create("rpal_service_cache", + sizeof(struct rpal_service), 0, + SLAB_PANIC, NULL); + if (!service_cache) { + rpal_err("service init fail\n"); + return -1; + } + + return 0; +} + +void __init rpal_service_exit(void) +{ + kmem_cache_destroy(service_cache); +} diff --git a/include/linux/rpal.h b/include/linux/rpal.h new file mode 100644 index 000000000000..73468884cc5d --- /dev/null +++ b/include/linux/rpal.h @@ -0,0 +1,43 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#ifndef _LINUX_RPAL_H +#define _LINUX_RPAL_H + +#include +#include +#include + +#define RPAL_ERROR_MSG "rpal error: " +#define rpal_err(x...) pr_err(RPAL_ERROR_MSG x) +#define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x) + +struct rpal_service { + /* reference count of this struct */ + atomic_t refcnt; +}; + +/** + * @brief get new reference to a rpal service, a corresponding + * rpal_put_service() should be called later by the caller. + * + * @param rs The struct rpal_service to get. + * + * @return new reference of struct rpal_service. + */ +struct rpal_service *rpal_get_service(struct rpal_service *rs); + +/** + * @brief put a reference to a rpal service. If the reference count of + * the service turns to be 0, then release its struct rpal_service. + * rpal_put_service() may be used in an atomic context. + * + * @param rs The struct rpal_service to put. + */ +void rpal_put_service(struct rpal_service *rs); +#endif --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA17C2253A0 for ; Fri, 30 May 2025 09:29:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597354; cv=none; b=jsRxn8ZZcIVgT29UGn2+A86aRY6HpD3f96GhatWN4vXwoKzqNWGXdmCb/U6eOVojWKeescczasPjxjaotOeZackouOwJKrBWKvDZ5b+eD1eQPLxnElfdB7eOq1tVmcv6Y2Rqvb5VCa3LAhReORD8ejyP+qf2jfk6yTNpJIODZ50= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597354; c=relaxed/simple; bh=VCOCu6O/NADHfF14Am0W07s7tA5GTZgRlngfMicR/Ns=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Q7s+ySDq3H3kTkNTftmsCdKfTEGXzVGAFowU6r+K1RXj6yNJcwV6qvwvh45js6vt6Kqx9A8ebhwrlkIyPqXoisHlbfkcJSXQpfOjJPm90zGBqgUK8Q+PTiZ4NznXwzGgzF5qshAT/LR6RIps1W91QmRqDPYBNhCeBd4J3tW9t+c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=JF2KgJ8c; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="JF2KgJ8c" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-23526264386so8607445ad.2 for ; Fri, 30 May 2025 02:29:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597352; x=1749202152; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=qGQ/MSD2aq8Zt0uX+CveEn5QJtfmVtywDoJlykkJ748=; b=JF2KgJ8cx2NSohpk333GW3x0my23qoG68NSIxWf5sYIpcvmj7bxkarfiGI+IITH+wK aTcwd7A9kmW/XD7Nqc9N8Cp0KUzCcOoCrWY9y8mRvBC8RtHBmWtQtCgKUwu143NGBm2o /L4ATUoviWRFPR6BmI6F/OeXAPiKS/5DIk6/4VIxNsCSFCOLtMLv257BUSqfhvQAnYM8 7OhsnPp33+CBv9gcOf6K9XyCogI5HvwzR36gX5UNssqDkEnYLBsshsJPvwi5Z+pNJY+6 tUzmm3lBJiRrwHSHt6NOPup1OvuIhn+pW7nz6kM09D44SIZ21LGlph6fmlCi80Hy891o Tu7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597352; x=1749202152; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qGQ/MSD2aq8Zt0uX+CveEn5QJtfmVtywDoJlykkJ748=; b=cUEgOjKkwtzuf9T6iYgvMGXoB/PTTZzwrRrzmfxEiLF9Zr224s5NHmWJ9LMKGjHWy+ vR5sDUSpyzGjauZfHIOcKkUzezyg6ofdO1KN8Hq3tThl9vB7IIRSbmoEnbgsJWaVu/NB et2KhsLYFDvvIp3O1zvrAo3+GVmgCdw6MaatkKMwoc3UTrDac92HRt70J2iJupk2IdRf vn0dynnQZWOa7AC8j4v2k8eXxX53+HnQa4w1fPzlOHjD+N4LTqgxLH1+QK1yta6nGowo LPrOlcDX3UonahplVFgr1LcVAdNo/Ye+AwFDz70VbamCDSvvnOPsRZprTWhjTjp9L1+j /O3A== X-Forwarded-Encrypted: i=1; AJvYcCW8eEwOiMtV3XidR9dz723atBiYGyodYwWyiWRWDb7ytuM7IbQPGJYF5g0pO/cQ8lYnxrndSeZUX4KQLgo=@vger.kernel.org X-Gm-Message-State: AOJu0YxxwFPMt+KgqiAJXS2Ng6mjld726JMa+sjszcr2Vx0a4EHVQsIW PibKww0yIybbMkAteShfm/drAhqAs7mCHTJD1mRkmAeetJ6P+R5i6zKZoogg7IsQowg= X-Gm-Gg: ASbGncuf+HlU7mMSXlCV3VMS79IhnD6Ef/XRfHsJ9eacF/5pK0l1I4DyGdGk0eZQq9h N0KhueAifWGzH9/UTFFAKcc8ViR8zJzUnskBx3ypA3EGW5Db0IHVRIetyW6F+ZV/jCmYP7fIQCs o03nTlTYhxHuXdb1XkBSAjxyITblHO7xWT8GCT2K3rayXiVqr6gw/ex/JzEndr2m1I3jQkmKNOw NTOTAUkyug3zi2CoiLMnKz+qzb4/U8iMc6gpgkmZuaXkIt4ziZsrQ+fzwtZ45dQFeeGemfkEdcf Os5rhP581hBCPxB4JxYj15FGnlWOtVAVP3d1VfR8k1oERxziecvqHGsKCGSJLJQPmjgyWAHNkfk Wxt3jytLAEQ== X-Google-Smtp-Source: AGHT+IHXNv6SBDXy0RRoZRael/cpWOT8NPAPr0x/A4zO15FUEF6dVXWmKj+6rOHNmQu7F4oEsyx/Vw== X-Received: by 2002:a17:90b:3a82:b0:311:ab20:159d with SMTP id 98e67ed59e1d1-3124163a55amr3859759a91.19.1748597351985; Fri, 30 May 2025 02:29:11 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.28.57 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:29:11 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 03/35] RPAL: add service registration interface Date: Fri, 30 May 2025 17:27:31 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Every rpal service should be registered and managed. Each RPAL service has a 64-bit key as its unique identifier, the key should never repeat before kernel reboot. Each RPAL service has an ID to indicate which 512GB virtual address space it can use. Any alive RPAL service has its unique ID, which will never be reused until the service dead. This patch adds a registration interface for RPAL services. Newly registered rpal_service instances are assigned a key that increments starting from 1. The 64-bit length of the key ensures that keys are almost impossible to exhaust before system reboot. Meanwhile, a bitmap is used to allocate IDs, ensuring no duplicate IDs are assigned. RPAL services are managed via a hash list, which facilitates quick lookup of the corresponding rpal_service by key. Signed-off-by: Bo Li --- arch/x86/rpal/service.c | 130 ++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 31 ++++++++++ 2 files changed, 161 insertions(+) diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index c8e609798d4f..609c9550540d 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -13,13 +13,56 @@ =20 #include "internal.h" =20 +static DECLARE_BITMAP(rpal_id_bitmap, RPAL_NR_ID); +static atomic64_t service_key_counter; +static DEFINE_HASHTABLE(service_hash_table, ilog2(RPAL_NR_ID)); +DEFINE_SPINLOCK(hash_table_lock); static struct kmem_cache *service_cache; =20 +static inline void rpal_free_service_id(int id) +{ + clear_bit(id, rpal_id_bitmap); +} + static void __rpal_put_service(struct rpal_service *rs) { kmem_cache_free(service_cache, rs); } =20 +static int rpal_alloc_service_id(void) +{ + int id; + + do { + id =3D find_first_zero_bit(rpal_id_bitmap, RPAL_NR_ID); + if (id =3D=3D RPAL_NR_ID) { + id =3D RPAL_INVALID_ID; + break; + } + } while (test_and_set_bit(id, rpal_id_bitmap)); + + return id; +} + +static bool is_valid_id(int id) +{ + return id >=3D 0 && id < RPAL_NR_ID; +} + +static u64 rpal_alloc_service_key(void) +{ + u64 key; + + /* confirm we do not run out keys */ + if (unlikely(atomic64_read(&service_key_counter) =3D=3D _AC(-1, UL))) { + rpal_err("key is exhausted\n"); + return RPAL_INVALID_KEY; + } + + key =3D atomic64_fetch_inc(&service_key_counter); + return key; +} + struct rpal_service *rpal_get_service(struct rpal_service *rs) { if (!rs) @@ -37,6 +80,90 @@ void rpal_put_service(struct rpal_service *rs) __rpal_put_service(rs); } =20 +static u32 get_hash_key(u64 key) +{ + return key % RPAL_NR_ID; +} + +struct rpal_service *rpal_get_service_by_key(u64 key) +{ + struct rpal_service *rs, *rsp; + u32 hash_key =3D get_hash_key(key); + + rs =3D NULL; + hash_for_each_possible(service_hash_table, rsp, hlist, hash_key) { + if (rsp->key =3D=3D key) { + rs =3D rsp; + break; + } + } + return rpal_get_service(rs); +} + +static void insert_service(struct rpal_service *rs) +{ + unsigned long flags; + int hash_key; + + hash_key =3D get_hash_key(rs->key); + + spin_lock_irqsave(&hash_table_lock, flags); + hash_add(service_hash_table, &rs->hlist, hash_key); + spin_unlock_irqrestore(&hash_table_lock, flags); +} + +static void delete_service(struct rpal_service *rs) +{ + unsigned long flags; + + spin_lock_irqsave(&hash_table_lock, flags); + hash_del(&rs->hlist); + spin_unlock_irqrestore(&hash_table_lock, flags); +} + +struct rpal_service *rpal_register_service(void) +{ + struct rpal_service *rs; + + if (!rpal_inited) + return NULL; + + rs =3D kmem_cache_zalloc(service_cache, GFP_KERNEL); + if (!rs) + goto alloc_fail; + + rs->id =3D rpal_alloc_service_id(); + if (!is_valid_id(rs->id)) + goto id_fail; + + rs->key =3D rpal_alloc_service_key(); + if (unlikely(rs->key =3D=3D RPAL_INVALID_KEY)) + goto key_fail; + + atomic_set(&rs->refcnt, 1); + + insert_service(rs); + + return rs; + +key_fail: + kmem_cache_free(service_cache, rs); +id_fail: + rpal_free_service_id(rs->id); +alloc_fail: + return NULL; +} + +void rpal_unregister_service(struct rpal_service *rs) +{ + if (!rs) + return; + + delete_service(rs); + + rpal_put_service(rs); +} + int __init rpal_service_init(void) { service_cache =3D kmem_cache_create("rpal_service_cache", @@ -47,6 +174,9 @@ int __init rpal_service_init(void) return -1; } =20 + bitmap_zero(rpal_id_bitmap, RPAL_NR_ID); + atomic64_set(&service_key_counter, RPAL_FIRST_KEY); + return 0; } =20 diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 73468884cc5d..75c5acf33844 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -11,13 +11,40 @@ =20 #include #include +#include #include =20 #define RPAL_ERROR_MSG "rpal error: " #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x) #define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x) =20 +/* + * The first 512GB is reserved due to mmap_min_addr. + * The last 512GB is dropped since stack will be initially + * allocated at TASK_SIZE_MAX. + */ +#define RPAL_NR_ID 254 +#define RPAL_INVALID_ID -1 +#define RPAL_FIRST_KEY _AC(1, UL) +#define RPAL_INVALID_KEY _AC(0, UL) + +/* + * Each RPAL service has a 64-bit key as its unique identifier, and + * the 64-bit length ensures that the key will never repeat before + * the kernel reboot. + * + * Each RPAL service has an ID to indicate which 512GB virtual address + * space it can use. All alive RPAL processes have unique IDs, ensuring + * their address spaces do not overlap. When a process exits, its ID + * is released, allowing newly started RPAL services to reuse the ID. + */ struct rpal_service { + /* Unique identifier for RPAL service */ + u64 key; + /* virtual address space id */ + int id; + /* Hashtable list for this struct */ + struct hlist_node hlist; /* reference count of this struct */ atomic_t refcnt; }; @@ -40,4 +67,8 @@ struct rpal_service *rpal_get_service(struct rpal_service= *rs); * @param rs The struct rpal_service to put. */ void rpal_put_service(struct rpal_service *rs); + +void rpal_unregister_service(struct rpal_service *rs); +struct rpal_service *rpal_register_service(void); +struct rpal_service *rpal_get_service_by_key(u64 key); #endif --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1629D21E082 for ; Fri, 30 May 2025 09:29:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597369; cv=none; b=R9q104r3mAr7KNlO72M1CIY9TJuFJ21x2Sngccu7WCimPACesWNrpO5E2gYwdCgiPtf4L44zOIgmbiFh6bC+UJvDeti43dpy/6nMnhSctMIVZsUza6XJF/iMOXiHay3ZLmDbfBUnpZhR+0blr4j24I2CIUyL0PUIvSoQ4brVddk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597369; c=relaxed/simple; bh=HWUbzvJ0czQHj89YGQM3u0dk16ZJgWmPR8JTmwOLaAI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=A1xp0eq8ChVBECTruxn57Jc8XEOjI/V7urJbkt1h3pBdEYn+Ybdun5yJWhzuRCcwoeZjHKXpqeHlhucpEVz5vTjf2SOJHP+FEL9X5/gOFsJrT6WwcnagKErOVatc5Yys7OSZx+sjpeMZEoYXy1kVNraUEcQY3cvlzz6gDAwKsjo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=i+8bKHHV; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="i+8bKHHV" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-311e993f49aso1475171a91.0 for ; Fri, 30 May 2025 02:29:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597367; x=1749202167; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=qlcCT7HpVRiKFCiWQ8GjTfMWomzbPQPS5hEo4l8oJpI=; b=i+8bKHHVu0Iwh+zALug//jSVJrcArOw6+1Sb2X6JYatwx0UBHy5421qx7NXWu/ZxbF ubtHM2AsBEMIoFzuk9lY5UfMjZU6imL1Z/wNa+88TU2vHdC/0qZw8IKsCrJ/z5l4UUIP 4tdt8H2iPqD+a4ll95djQDmPnJ8PO4U6359RE0QkHs/sbfIQAl4yzBhi/lVzBB4X2e/c +dzZy19UlwSn6pBOKLY5LDSsfZbjblPBmDYvQh5MgywpsVLy5zFaCljULT4cwzRUWIUh QUip7XlnryUZDvUSIQt8XnwlZW0/5WOOILHikCQUy1UNbvWQb08L7fnPDpJOiVNB8IXD mN9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597367; x=1749202167; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qlcCT7HpVRiKFCiWQ8GjTfMWomzbPQPS5hEo4l8oJpI=; b=jfWFl2GOrSRgZ1yiXD5II8+EM44vSEW1HE0nSvoB66qupttY9/5TDdPC06VRZuoJwO X+5d7ui6JAz4IuhV/BFauD8fa4nd2ZtRJ8KU6gFgRbwXfIyQwmjsvoA5zaQLh42ztlZs 8VEf9VEvIfc3ZGtEfyk7pt7E4oXqqEkrhGhxnZunQ7fFkds9iitchfMVKo8iNLOde2Jt a8j+WqN0wVGLjS6qzOaO7AHOwSpjiby0+ZU/QEjZ3yyLrlAZnLfeoinSGklaguAeO1AS Qs4ugMd+VNSmLdyUK1GJmxKu5RfxvrIU2un26fwtvIkL4YJUFby5pCbyGLAdzSJsxpkx 7qbg== X-Forwarded-Encrypted: i=1; AJvYcCWi0c+xARzcUIZpQexvoX9kRXLVyvjX3Bkalr0v1cYAstM9tC6vbDrcCEQCzsd4Cm94VSzGkHo9JOcy01w=@vger.kernel.org X-Gm-Message-State: AOJu0YxhlzZbVn8CT+QmtUBdiUlKwO/uA7eMAYSueeaNtLalvV+of1cj XBjIt2BuUJmP8dN+GiYVLMZLXGweLqN7LoD4YWQ6k3jLsm/UpjD2D0js30lflurz8hk= X-Gm-Gg: ASbGncvclYnmjFu8wo0uCeUAYXWXa++EJjxqhT1P7iu6zWQfFcaXUhYV1P1+/3RN1SC T5mzHckNbpZRDefOiNMn9OVmlT2AN8blihCgjgxGmX8j9WEgWnw3gnKLwySC8xPRQKOIyO/HqpV /jBk5uuRCg5vcUORqYFBeMd+zMyCW+/H/YlK8JB+sfQneH/JpQYHIAdN//ThpBR6km0qAi9JVOp DJ+bVNI5hdzqFuw91kQ62rnlRA4DT//w3YYC8HryxebqwTxaP4eRarWKuCbwkQTBSCm0HfcgOG1 3CmqkItwz/cMcbBA7g2GJdwUpMXQA0GOO4SK66NCilims47puhg+Q2AA1h8/c3loKNZbzJsvtm6 eRSAxikCB3w== X-Google-Smtp-Source: AGHT+IHhT+/kCKbFwXp40FdH2tTAUVIIkiKVWdZUIpWhrB1L0an0nzdMBrvxVW8z9SoZzZut7sVHuQ== X-Received: by 2002:a17:90b:5288:b0:312:ec:4128 with SMTP id 98e67ed59e1d1-31250476af1mr2024876a91.34.1748597367284; Fri, 30 May 2025 02:29:27 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.29.12 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:29:26 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 04/35] RPAL: add member to task_struct and mm_struct Date: Fri, 30 May 2025 17:27:32 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In lazy switch and memory-related operations, there is a need to quickly locate the corresponding rpal_service structure. Therefore, rpal_service members are added to these two data structures. This patch adds an rpal_service member to both task_struct and mm_struct, and introduces initialization operations. Meanwhile, rpal_service is also augmented with references to the task_struct and mm_struct of the group_leader. For threads created via fork, the kernel acquires a reference to rpal_service and assigns it to the new task_struct. References to rpal_service are released when threads exit. Regarding the deallocation of rpal_struct, since rpal_put_service may be called in an atomic context (where mmdrop() cannot be invoked), this patch uses delayed work for deallocation. The work delay is set to 30 seconds, which ensures that IDs are not recycled and reused in the short term, preventing other processes from confusing the reallocated ID with the previous one due to race conditions. Signed-off-by: Bo Li --- arch/x86/rpal/service.c | 77 +++++++++++++++++++++++++++++++++++++--- fs/exec.c | 11 ++++++ include/linux/mm_types.h | 3 ++ include/linux/rpal.h | 29 +++++++++++++++ include/linux/sched.h | 5 +++ init/init_task.c | 3 ++ kernel/exit.c | 5 +++ kernel/fork.c | 16 +++++++++ 8 files changed, 145 insertions(+), 4 deletions(-) diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 609c9550540d..55ecb7e0ef8c 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -26,9 +26,24 @@ static inline void rpal_free_service_id(int id) =20 static void __rpal_put_service(struct rpal_service *rs) { + pr_debug("rpal: free service %d, tgid: %d\n", rs->id, + rs->group_leader->pid); + + rs->mm->rpal_rs =3D NULL; + mmdrop(rs->mm); + put_task_struct(rs->group_leader); + rpal_free_service_id(rs->id); kmem_cache_free(service_cache, rs); } =20 +static void rpal_put_service_async_fn(struct work_struct *work) +{ + struct rpal_service *rs =3D + container_of(work, struct rpal_service, delayed_put_work.work); + + __rpal_put_service(rs); +} + static int rpal_alloc_service_id(void) { int id; @@ -75,9 +90,16 @@ void rpal_put_service(struct rpal_service *rs) { if (!rs) return; - - if (atomic_dec_and_test(&rs->refcnt)) - __rpal_put_service(rs); + /* + * Since __rpal_put_service() calls mmdrop() (which + * cannot be invoked in atomic context), we use + * delayed work to release rpal_service. + */ + if (atomic_dec_and_test(&rs->refcnt)) { + INIT_DELAYED_WORK(&rs->delayed_put_work, + rpal_put_service_async_fn); + schedule_delayed_work(&rs->delayed_put_work, HZ * 30); + } } =20 static u32 get_hash_key(u64 key) @@ -128,6 +150,12 @@ struct rpal_service *rpal_register_service(void) if (!rpal_inited) return NULL; =20 + if (!thread_group_leader(current)) { + rpal_err("task %d is not group leader %d\n", current->pid, + current->tgid); + goto alloc_fail; + } + rs =3D kmem_cache_zalloc(service_cache, GFP_KERNEL); if (!rs) goto alloc_fail; @@ -140,10 +168,27 @@ struct rpal_service *rpal_register_service(void) if (unlikely(rs->key =3D=3D RPAL_INVALID_KEY)) goto key_fail; =20 - atomic_set(&rs->refcnt, 1); + current->rpal_rs =3D rs; + + rs->group_leader =3D get_task_struct(current); + mmgrab(current->mm); + current->mm->rpal_rs =3D rs; + rs->mm =3D current->mm; + + /* + * The reference comes from: + * 1. registered service always has one reference + * 2. leader_thread also has one reference + * 3. mm also hold one reference + */ + atomic_set(&rs->refcnt, 3); =20 insert_service(rs); =20 + pr_debug( + "rpal: register service, key: %llx, id: %d, command: %s, tgid: %d\n", + rs->key, rs->id, current->comm, current->tgid); + return rs; =20 key_fail: @@ -161,7 +206,31 @@ void rpal_unregister_service(struct rpal_service *rs) =20 delete_service(rs); =20 + pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id, + rs->group_leader->tgid); + + rpal_put_service(rs); +} + +void copy_rpal(struct task_struct *p) +{ + struct rpal_service *cur =3D rpal_current_service(); + + p->rpal_rs =3D rpal_get_service(cur); +} + +void exit_rpal(bool group_dead) +{ + struct rpal_service *rs =3D rpal_current_service(); + + if (!rs) + return; + + current->rpal_rs =3D NULL; rpal_put_service(rs); + + if (group_dead) + rpal_unregister_service(rs); } =20 int __init rpal_service_init(void) diff --git a/fs/exec.c b/fs/exec.c index cfbb2b9ee3c9..922728aebebe 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -68,6 +68,7 @@ #include #include #include +#include =20 #include #include @@ -1076,6 +1077,16 @@ static int de_thread(struct task_struct *tsk) /* we have changed execution domain */ tsk->exit_signal =3D SIGCHLD; =20 +#if IS_ENABLED(CONFIG_RPAL) + /* + * The rpal process is going to load another binary, we + * need to unregister rpal since it is going to be another + * process. Other threads have already exited by the time + * we come here, we need to set group_dead as true. + */ + exit_rpal(true); +#endif + BUG_ON(!thread_group_leader(tsk)); return 0; =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 32ba5126e221..b29adef082c6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1172,6 +1172,9 @@ struct mm_struct { #ifdef CONFIG_MM_ID mm_id_t mm_id; #endif /* CONFIG_MM_ID */ +#ifdef CONFIG_RPAL + struct rpal_service *rpal_rs; +#endif } __randomize_layout; =20 /* diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 75c5acf33844..7b9d90b62b3f 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -11,6 +11,8 @@ =20 #include #include +#include +#include #include #include =20 @@ -29,6 +31,9 @@ #define RPAL_INVALID_KEY _AC(0, UL) =20 /* + * Each RPAL process (a.k.a RPAL service) should have a pointer to + * struct rpal_service in all its tasks' task_struct. + * * Each RPAL service has a 64-bit key as its unique identifier, and * the 64-bit length ensures that the key will never repeat before * the kernel reboot. @@ -39,10 +44,23 @@ * is released, allowing newly started RPAL services to reuse the ID. */ struct rpal_service { + /* The task_struct of thread group leader. */ + struct task_struct *group_leader; + /* mm_struct of thread group */ + struct mm_struct *mm; /* Unique identifier for RPAL service */ u64 key; /* virtual address space id */ int id; + + /* + * Fields above should never change after initialization. + * Fields below may change after initialization. + */ + + /* delayed service put work */ + struct delayed_work delayed_put_work; + /* Hashtable list for this struct */ struct hlist_node hlist; /* reference count of this struct */ @@ -68,7 +86,18 @@ struct rpal_service *rpal_get_service(struct rpal_servic= e *rs); */ void rpal_put_service(struct rpal_service *rs); =20 +#ifdef CONFIG_RPAL +static inline struct rpal_service *rpal_current_service(void) +{ + return current->rpal_rs; +} +#else +static inline struct rpal_service *rpal_current_service(void) { return NUL= L; } +#endif + void rpal_unregister_service(struct rpal_service *rs); struct rpal_service *rpal_register_service(void); struct rpal_service *rpal_get_service_by_key(u64 key); +void copy_rpal(struct task_struct *p); +void exit_rpal(bool group_dead); #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 45e5953b8f32..ad35b197543c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -72,6 +72,7 @@ struct rcu_node; struct reclaim_state; struct robust_list_head; struct root_domain; +struct rpal_service; struct rq; struct sched_attr; struct sched_dl_entity; @@ -1645,6 +1646,10 @@ struct task_struct { struct user_event_mm *user_event_mm; #endif =20 +#ifdef CONFIG_RPAL + struct rpal_service *rpal_rs; +#endif + /* CPU-specific state of this task: */ struct thread_struct thread; =20 diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..0c5b1927da41 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -220,6 +220,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { #ifdef CONFIG_SECCOMP_FILTER .seccomp =3D { .filter_count =3D ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_RPAL + .rpal_rs =3D NULL, +#endif }; EXPORT_SYMBOL(init_task); =20 diff --git a/kernel/exit.c b/kernel/exit.c index 38645039dd8f..0c8387da59da 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -70,6 +70,7 @@ #include #include #include +#include =20 #include =20 @@ -944,6 +945,10 @@ void __noreturn do_exit(long code) taskstats_exit(tsk, group_dead); trace_sched_process_exit(tsk, group_dead); =20 +#if IS_ENABLED(CONFIG_RPAL) + exit_rpal(group_dead); +#endif + exit_mm(); =20 if (group_dead) diff --git a/kernel/fork.c b/kernel/fork.c index 85afccfdf3b1..1d1c8484a8f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -105,6 +105,7 @@ #include #include #include +#include =20 #include #include @@ -1216,6 +1217,10 @@ static struct task_struct *dup_task_struct(struct ta= sk_struct *orig, int node) tsk->mm_cid_active =3D 0; tsk->migrate_from_cpu =3D -1; #endif + +#ifdef CONFIG_RPAL + tsk->rpal_rs =3D NULL; +#endif return tsk; =20 free_stack: @@ -1312,6 +1317,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, #endif mm_init_uprobes_state(mm); hugetlb_count_init(mm); +#ifdef CONFIG_RPAL + mm->rpal_rs =3D NULL; +#endif =20 if (current->mm) { mm->flags =3D mmf_init_flags(current->mm->flags); @@ -2651,6 +2659,14 @@ __latent_entropy struct task_struct *copy_process( current->signal->nr_threads++; current->signal->quick_threads++; atomic_inc(¤t->signal->live); +#if IS_ENABLED(CONFIG_RPAL) + /* + * For rpal process, the child thread needs to + * inherit p->rpal_rs. Therefore, we can get the + * struct rpal_service for any thread of rpal process. + */ + copy_rpal(p); +#endif refcount_inc(¤t->signal->sigcnt); task_join_group_stop(p); list_add_tail_rcu(&p->thread_node, --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9ADD521E0AA for ; Fri, 30 May 2025 09:29:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597385; cv=none; b=EVjAXWGLY0aVEaf6bM2T4RhU6MyMaJlXuAuuZcTM9KsV+J90b/Bx+wMNFkSFYhWB1+Ukc+f0OfrOiCagJ52HiAtdTcCx5tL7lfMYr1kIwm/NQJNJNwNiE4NCYibVKMXhjd/zPaMjWY2OXCqrfgGRozwWW1oQjmogBE8sCFK3wi4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597385; c=relaxed/simple; bh=S20otgpxbfyI8rSmt5eNc2YoPuXrdf/0GSL9ciOvIuU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=klk/gWt1gRKFLSPwNeckFhfkKREcOSBlIGKMv+bHez0vlo9VZrfoOwWSxNKaFXt7Iqpw/wM7be2dt0uMXT90dpgpwT139Uiuaw0FODO4YsMouDtbTsTYnk98HvhsZzrShQQSfgnz/ssvacpwbvOOzM4O93hIbejRAsZ/Ha+Rhng= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=iLObVoMY; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="iLObVoMY" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-b2c3c689d20so1402443a12.3 for ; Fri, 30 May 2025 02:29:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597383; x=1749202183; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fsVmARra3PYHgocOota+jP5MYor7RDP1xSKc/otYVfE=; b=iLObVoMYwVTx9dhtWeCngYlgKCMrTA89+js5cJ0RTSqzOZ4UgePxNvcf8jYO6Y5tUF yT6Cf/7ITvg0YCbz+hQnvej2YkXbHWclcB9TkFDxHJw3LvvBigI/TeocZ8YFUoiwAVD4 fOjac9/1X7ptj3WKAnchTLK2e7NoPagBOZwkI7Go4RbmNvkaynfvxYRWDqrl6gpkHwoU T20DkUTDq2gcdh98euhWD4xZCLZjGfH3Se/xiCQxFmNzDLSq1NhA4/DflY0Dg5z84yJz R12oC38E1BmVKU+fjwnStUPCB6XMzTlFwEQqFqV7NuTLfPTtNxGZvAcOxFW7YPGU5OQ0 zISA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597383; x=1749202183; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fsVmARra3PYHgocOota+jP5MYor7RDP1xSKc/otYVfE=; b=c2QAfpLq7vI4JJbKBXzsV9d89WAGrTQDehiVToxkaisXP9SVy2uFU/4qAqVG9qdN7L NLxAgnD1K4vt+xmgIN9P0OtepjQ7nCnmsnyIL5F3qDsm/77Br4Fq4lUn0GSkUOrDBJS7 uRntQlCtLSgYJsnX52+UHP/7mPB2wsW702wv/d7ZgO8+OomUbjwNTeSqmkszjxAmw2V+ KB/F9BBqJR7XEO6+ExXugid0vbLJvCN57Mnl4cyuWnCyRALK0hVlXPbtRRnp0VKXwjjY ursVqC4F4gFZK7rkdtPVJuqQcsYy6Ln/rIfE0IkP5cgyPM6DxFB/FLIpO2mkLfRIEeaO NdJg== X-Forwarded-Encrypted: i=1; AJvYcCVI3wiydOaJVvz1u9dsFTbwfeSXpQP5ZrKz1/x3J+Yp14f2kZhvxdiN28yF+unys9U/1WzW1fP+LFz2H38=@vger.kernel.org X-Gm-Message-State: AOJu0YwUSSRKB2E3VuLh4XsSykDwcCD6DClb29XgVLBsy5kXQv7jwB0j ELzpHcQaJ929qbIrRf5lhJk7A9+n5FV6KtIzhsV+GN1ERtBY+I4APn8hrsnGmJAW0mk= X-Gm-Gg: ASbGncuqwI1Xs357HAlty9U1AbH5N8FNkEC9tExt8ezmMGlXFB1qmOtWPSrdPaVXlmI 26wwwgFvuN7rZiQG8vngPk2zwNoUB9KZXoCHleTSOYzWmlx7jWvnUjDnhHA48WGvWPcPOFD7s7z XLPZ7olBsKNCly1VDP1WXeXP5Bd4rALFb/CxScgoGaTHXBCYp5iLWSuf8at80pDb4vshh8q6rwy gZjI/nf2P3r8BXNpcfiaiUskFjJyvz0sk03XsMhJbSHX+lAXhocMX5mvtc4SUmMTMoX3WRcp6Ke SzWtzA7akm5VgMjKV0btl6bd+4msrc0zQ2xdr9gINGNkO6NvOQ4W7HCfV8b/6kHYwUumDhDiai4 oLniIxCRWSJBGxZScn/gWe/ySwO8igDw= X-Google-Smtp-Source: AGHT+IGDxz/qX0egNQr6P0EaiRBDIQcZ9ust5aoVSte6/Q83iWKLonGQ/1cLLq8vU8IelvR0XV1gQw== X-Received: by 2002:a17:90b:5344:b0:312:b4a:6342 with SMTP id 98e67ed59e1d1-31241e9c28fmr4334365a91.33.1748597382681; Fri, 30 May 2025 02:29:42 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.29.27 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:29:42 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 05/35] RPAL: enable virtual address space partitions Date: Fri, 30 May 2025 17:27:33 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Each RPAL service occupies a contiguous 512GB virtual address space, with its base address determined by the id assigned during initialization. For userspace virtual address space beyond this 512GB range, we employ memory ballooning to occupy these regions, ensuring that processes do not utilize these virtual addresses. Since the address space layout is determined when the process is loaded, RPAL sets the unused fields in the header of the ELF binary to the "RPAL" characters to alter the loading method of RPAL processes, enabling the process to be located within the correct 512GB address space upon loading. Signed-off-by: Bo Li --- arch/x86/mm/mmap.c | 10 +++++ arch/x86/rpal/Makefile | 2 +- arch/x86/rpal/mm.c | 70 +++++++++++++++++++++++++++++ arch/x86/rpal/service.c | 8 ++++ fs/binfmt_elf.c | 98 ++++++++++++++++++++++++++++++++++++++++- include/linux/rpal.h | 65 +++++++++++++++++++++++++++ 6 files changed, 251 insertions(+), 2 deletions(-) create mode 100644 arch/x86/rpal/mm.c diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c index 5ed2109211da..504f2b9a0e8e 100644 --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include =20 @@ -119,6 +120,15 @@ static void arch_pick_mmap_base(unsigned long *base, u= nsigned long *legacy_base, *base =3D mmap_base(random_factor, task_size, rlim_stack); } =20 +#ifdef CONFIG_RPAL +void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack) +{ + arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base, + arch_rnd(RPAL_MAX_RAND_BITS), rpal_get_top(mm->rpal_rs), + rlim_stack); +} +#endif + void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack) { if (mmap_is_legacy()) diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile index ee3698b5a9b3..2c858a8d7b9e 100644 --- a/arch/x86/rpal/Makefile +++ b/arch/x86/rpal/Makefile @@ -2,4 +2,4 @@ =20 obj-$(CONFIG_RPAL) +=3D rpal.o =20 -rpal-y :=3D service.o core.o +rpal-y :=3D service.o core.o mm.o diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c new file mode 100644 index 000000000000..f469bcf57b66 --- /dev/null +++ b/arch/x86/rpal/mm.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include +#include +#include +#include + +static inline int rpal_balloon_mapping(unsigned long base, unsigned long s= ize) +{ + struct vm_area_struct *vma; + unsigned long addr, populate; + int is_fail =3D 0; + + if (size =3D=3D 0) + return 0; + + addr =3D do_mmap(NULL, base, size, PROT_NONE, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, + VM_DONTEXPAND | VM_PFNMAP | VM_DONTDUMP, 0, &populate, + NULL); + + is_fail =3D base !=3D addr; + + if (is_fail) { + pr_info("rpal: Balloon mapping 0x%016lx - 0x%016lx, %s, addr: 0x%016lx\n= ", + base, base + size, is_fail ? "Fail" : "Success", addr); + } + vma =3D find_vma(current->mm, addr); + if (vma->vm_start !=3D addr || vma->vm_end !=3D addr + size) { + is_fail =3D 1; + rpal_err("rpal: find vma 0x%016lx - 0x%016lx fail\n", addr, + addr + size); + } + + return is_fail; +} + +#define RPAL_USER_TOP TASK_SIZE + +int rpal_balloon_init(unsigned long base) +{ + unsigned long top; + struct mm_struct *mm =3D current->mm; + int ret; + + top =3D base + RPAL_ADDR_SPACE_SIZE; + + mmap_write_lock(mm); + + if (base > mmap_min_addr) { + ret =3D rpal_balloon_mapping(mmap_min_addr, base - mmap_min_addr); + if (ret) + goto out; + } + + ret =3D rpal_balloon_mapping(top, RPAL_USER_TOP - top); + if (ret && base > mmap_min_addr) + do_munmap(mm, mmap_min_addr, base - mmap_min_addr, NULL); + +out: + mmap_write_unlock(mm); + + return ret; +} diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 55ecb7e0ef8c..caa4afa5a2c6 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -143,6 +143,11 @@ static void delete_service(struct rpal_service *rs) spin_unlock_irqrestore(&hash_table_lock, flags); } =20 +static inline unsigned long calculate_base_address(int id) +{ + return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id; +} + struct rpal_service *rpal_register_service(void) { struct rpal_service *rs; @@ -168,6 +173,9 @@ struct rpal_service *rpal_register_service(void) if (unlikely(rs->key =3D=3D RPAL_INVALID_KEY)) goto key_fail; =20 + rs->bad_service =3D false; + rs->base =3D calculate_base_address(rs->id); + current->rpal_rs =3D rs; =20 rs->group_leader =3D get_task_struct(current); diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index a43363d593e5..9d27d9922de4 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include =20 @@ -814,6 +815,61 @@ static int parse_elf_properties(struct file *f, const = struct elf_phdr *phdr, return ret =3D=3D -ENOENT ? 0 : ret; } =20 +#if IS_ENABLED(CONFIG_RPAL) +static int rpal_create_service(char *e_ident, struct rpal_service **rs, + unsigned long *rpal_base, int *retval, + struct linux_binprm *bprm, int executable_stack) +{ + /* + * The first 16 bytes of the elf binary is magic number, and the last + * 7 bytes of that is reserved and ignored. We use the last 4 bytes + * to indicate a rpal binary. If the last 4 bytes is "RPAL", then this + * is a rpal binary and we need to do register routinue. + */ + if (memcmp(e_ident + RPAL_MAGIC_OFFSET, RPAL_MAGIC, RPAL_MAGIC_LEN) =3D= =3D + 0) { + unsigned long rpal_stack_top =3D STACK_TOP; + + *rs =3D rpal_register_service(); + if (*rs !=3D NULL) { + *rpal_base =3D rpal_get_base(*rs); + rpal_stack_top =3D *rpal_base + RPAL_ADDR_SPACE_SIZE; + /* + * We need to recalculate the mmap_base, otherwise the address space + * layout randomization will not make any difference. + */ + rpal_pick_mmap_base(current->mm, &bprm->rlim_stack); + } + /* + * RPAL process only has a contiguous 512GB address space, Whose base + * address is given by its struct rpal_service. We need to rearrange + * the user stack in this 512GB address space. + */ + *retval =3D setup_arg_pages(bprm, + randomize_stack_top(rpal_stack_top), + executable_stack); + /* + * We use memory ballon to avoid kernel allocating vma other than + * the process's 512GB memory. + */ + if (unlikely(*rs !=3D NULL && rpal_balloon_init(*rpal_base))) { + rpal_err("pid: %d, comm: %s: rpal balloon init fail\n", + current->pid, current->comm); + rpal_unregister_service(*rs); + *rs =3D NULL; + *retval =3D -EINVAL; + goto out; + } + } else { + *retval =3D setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), + executable_stack); + } + +out: + return 0; +} +#endif + static int load_elf_binary(struct linux_binprm *bprm) { struct file *interpreter =3D NULL; /* to shut gcc up */ @@ -836,6 +892,10 @@ static int load_elf_binary(struct linux_binprm *bprm) struct arch_elf_state arch_state =3D INIT_ARCH_ELF_STATE; struct mm_struct *mm; struct pt_regs *regs; +#ifdef CONFIG_RPAL + struct rpal_service *rs =3D NULL; + unsigned long rpal_base; +#endif =20 retval =3D -ENOEXEC; /* First of all, some simple consistency checks */ @@ -1008,10 +1068,19 @@ static int load_elf_binary(struct linux_binprm *bpr= m) =20 setup_new_exec(bprm); =20 +#ifdef CONFIG_RPAL + /* call original function if fails */ + if (rpal_create_service((char *)&elf_ex->e_ident, &rs, &rpal_base, + &retval, bprm, executable_stack)) + retval =3D setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), + executable_stack); +#else /* Do this so that we can load the interpreter, if need be. We will change some of these later */ retval =3D setup_arg_pages(bprm, randomize_stack_top(STACK_TOP), executable_stack); +#endif + if (retval < 0) goto out_free_dentry; =20 @@ -1055,6 +1124,22 @@ static int load_elf_binary(struct linux_binprm *bprm) * is needed. */ elf_flags |=3D MAP_FIXED_NOREPLACE; +#ifdef CONFIG_RPAL + /* + * If We load MAP_FIXED binary, it will either fail when + * doing mmap, as we have done the memory balloon before, + * or work well, where we are so lucky to have fixed address + * in it's RPAL address space. A MAP_FIXED binary should + * by no means be a RPAL service. Here we only print + * an error. Maybe we will handle it in the future. + */ + if (unlikely(rs !=3D NULL)) { + rpal_err( + "pid: %d, common: %s, load a binary with MAP_FIXED segment\n", + current->pid, current->comm); + rs->bad_service =3D true; + } +#endif } else if (elf_ex->e_type =3D=3D ET_DYN) { /* * This logic is run once for the first LOAD Program @@ -1128,6 +1213,12 @@ static int load_elf_binary(struct linux_binprm *bprm) /* Adjust alignment as requested. */ if (alignment) load_bias &=3D ~(alignment - 1); +#ifdef CONFIG_RPAL + if (rs !=3D NULL) { + load_bias &=3D RPAL_RAND_ADDR_SPACE_MASK; + load_bias +=3D rpal_base; + } +#endif elf_flags |=3D MAP_FIXED_NOREPLACE; } else { /* @@ -1306,7 +1397,12 @@ static int load_elf_binary(struct linux_binprm *bprm) if (!IS_ENABLED(CONFIG_COMPAT_BRK) && IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) && elf_ex->e_type =3D=3D ET_DYN && !interpreter) { - elf_brk =3D ELF_ET_DYN_BASE; +#ifdef CONFIG_RPAL + if (rs && !rs->bad_service) + elf_brk =3D rpal_base; + else +#endif + elf_brk =3D ELF_ET_DYN_BASE; /* This counts as moving the brk, so let brk(2) know. */ brk_moved =3D true; } diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 7b9d90b62b3f..f7c0de747f55 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -15,11 +15,17 @@ #include #include #include +#include =20 #define RPAL_ERROR_MSG "rpal error: " #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x) #define rpal_err_ratelimited(x...) pr_err_ratelimited(RPAL_ERROR_MSG x) =20 +/* RPAL magic macros in binary elf header */ +#define RPAL_MAGIC "RPAL" +#define RPAL_MAGIC_OFFSET 12 +#define RPAL_MAGIC_LEN 4 + /* * The first 512GB is reserved due to mmap_min_addr. * The last 512GB is dropped since stack will be initially @@ -30,6 +36,47 @@ #define RPAL_FIRST_KEY _AC(1, UL) #define RPAL_INVALID_KEY _AC(0, UL) =20 +/* + * Process Virtual Address Space Layout (For 4-level Paging) + * |-------------| + * | No Mapping | + * |-------------| <-- 64 KB (mmap_min_addr) + * | ... | + * |-------------| <-- 1 * 512GB + * | service 0 | + * |-------------| <-- 2 * 512 GB + * | Service 1 | + * |-------------| <-- 3 * 512 GB + * | Service 2 | + * |-------------| <-- 4 * 512 GB + * | ... | + * |-------------| <-- 255 * 512 GB + * | Service 254 | + * |-------------| <-- 128 TB + * | | + * | ... | + * |-------------| <-- PAGE_OFFSET + * | | + * | Kernel | + * |_____________| + * + */ +#define RPAL_ADDR_SPACE_SIZE (_AC(512, UL) * SZ_1G) +/* + * Since RPAL restricts the virtual address space used by a single + * process to 512GB, the number of bits for address randomization + * must be correspondingly reduced; otherwise, issues such as overlaps + * in randomized addresses could occur. RPAL employs 20-bit (page number) + * address randomization to balance security and usability. + */ +#define RPAL_RAND_ADDR_SPACE_MASK _AC(0xfffffff0, UL) +#define RPAL_MAX_RAND_BITS 20 + +#define RPAL_NR_ADDR_SPACE 256 + +#define RPAL_ADDRESS_SPACE_LOW ((0UL) + RPAL_ADDR_SPACE_SIZE) +#define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SP= ACE_SIZE) + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -52,6 +99,10 @@ struct rpal_service { u64 key; /* virtual address space id */ int id; + /* virtual address space base address of this service */ + unsigned long base; + /* bad rpal binary */ + bool bad_service; =20 /* * Fields above should never change after initialization. @@ -86,6 +137,16 @@ struct rpal_service *rpal_get_service(struct rpal_servi= ce *rs); */ void rpal_put_service(struct rpal_service *rs); =20 +static inline unsigned long rpal_get_base(struct rpal_service *rs) +{ + return rs->base; +} + +static inline unsigned long rpal_get_top(struct rpal_service *rs) +{ + return rs->base + RPAL_ADDR_SPACE_SIZE; +} + #ifdef CONFIG_RPAL static inline struct rpal_service *rpal_current_service(void) { @@ -100,4 +161,8 @@ struct rpal_service *rpal_register_service(void); struct rpal_service *rpal_get_service_by_key(u64 key); void copy_rpal(struct task_struct *p); void exit_rpal(bool group_dead); +int rpal_balloon_init(unsigned long base); + +extern void rpal_pick_mmap_base(struct mm_struct *mm, + struct rlimit *rlim_stack); #endif --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5B96222562 for ; Fri, 30 May 2025 09:29:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597400; cv=none; b=cSDtoDO1RgGBhmB1lKshupq1FuNLViznCtclqM0+R+ZmYfcTnlCEw475PB7gCkjCRju4UayXz3QKaXDUDzsYazT7AYWDmSptRFueU/XSJobi9TxffU4DfhSdqd8vWDN/h/sL1exidYJT+blFTP0qtuZM5m/gAML1dLTlPFUnRyA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597400; c=relaxed/simple; bh=VIZG5AiWC/s7En/XmKPIO2DjZgsCQ6G5W9Jx/My9WHo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=fvpkuSnPGtfbg/BAhDejjUOM2QLUZMMUAWAwlgDG9HvjY3wtshiUP1dlk0gyF+r0X3FaZrONnIMmTp9sU1LDkj57SQQuUfhH691/kWXU+YqgBMz4J9ni1Q1uRDOj2WWf27v0ru2VgMCIDluEbQ26/M5FmZm21VNPOD+JXwW65cA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=WB6z6OIw; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="WB6z6OIw" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-312116d75a6so1581828a91.3 for ; Fri, 30 May 2025 02:29:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597398; x=1749202198; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rxNHqqJshuS1h1cgjJgJ0ZrC1Xqsxa5tk/BB5ji3hvs=; b=WB6z6OIw5FMSobnxLBt+fPmPsX/NsOiTiisAYiUGKxuhRaLdy/ylTQVQVyf4Dg6kI/ Jx/wkwLFji+l2wZ7nS6xPiWQDqHH5Bv23q2M8bNC3MBIMq22NYwZcwzGKug2VkHWcFW5 Yhlmw5c+XWobbD+qMo7S/YQs4FZrJy3B/FQACXZ8HebyhjqZ8bH3u4Hlqs7/jglQejc4 rZiaNVK4g5bik6IOs5wiJQ13/WYUNFY3iiMIF1wvviEcKkIjVt8CnlhFQGEy0Vx3Bc6L yTpm9aJ+BemoU+Pg6tleE3oHLwOY6LW4/wjv7hEosq+BxU+lCwMCi+6B0l+CcjM6f3Yh GJgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597398; x=1749202198; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rxNHqqJshuS1h1cgjJgJ0ZrC1Xqsxa5tk/BB5ji3hvs=; b=odk1AsfM+CdUV8Q/gaRkfOt4zCVOTLMOYxxcAbwjAX7VhLNZEnJHhPsBzOTjCnlWRG 6ROU8oVugf556Chi/4BW2W5O0MWGMESMVHy/lAl7+pL+uA86rEqZUsBzVFnntxm6qKM1 CI8nwcIo0r9S9B0uY2OJN0r0MlIG+zgkiCMMXWfh11GonrdBh/ppHDOMkeU7b2ibmKhd MTs2D6b0iiUYFRz5vCEHGj/Z5WFgfak3KATv8JozYXpwVMhLKSgETCY0t6qTsV+FIDAS IceQQnopLoodIWUrMdIUc6jOUhCCujSQYn6uFUzWzMJt5nhNi7HvPEeOzBrtCwq6rQeD F4aA== X-Forwarded-Encrypted: i=1; AJvYcCW44j5GxawEuxTUQEOHAsPhBE7Fpfek1ZDonAVQ6XMp7XtacNv4jNgP1IkwK42ZUsrwJhhBCUQ9DHe6NPE=@vger.kernel.org X-Gm-Message-State: AOJu0YyG2+za+waH3z2qZCRBLNOzTs2t5cR+ap7czTadw4k7iQ8eYx3W a8etKc6riXAfvxEDeqEgwDpQc0dO0jGxUU4uMSZx/CdkMhqlW13/Pinne+inGr8jRGI= X-Gm-Gg: ASbGncs0IPDWF8MNewPKfnBua2jsQPW9lTEeX0LVFopSj51OFjaf5/sVlDuVqcMMwse NvFOukxdCJuY3+g2EvB1/sPtyH6bM5/V315F6ZR6Sr1qDH5zLy49pVNTbUhWRa73t4a/GK7co/Z uGlIaLpqey5uRjddl5p2c++xl7kgCAVrYNckjza+NkXk631bmwBSwOIXCMpewUpgP9YuG/ddqWu zkjDKnx6i/v49krkbdsW9k7cSnhf+a/mVh4zKZQ738psoILKQcqBsPjRBGnoiPVWP7nSDwmAH0V i8pmle4nDBfvfE33nUTJMskZjpnaEZ3ytu2lwlrWIRYGYJdondbHeAIgsci6Ce1i+eGd6qQBwfj iKy6kiFhCww== X-Google-Smtp-Source: AGHT+IGBCG8Y5BoAzUhA//fCiLwTsxIymub8I3vmhNOr1M4fJfMNuuh617dJ5r3jIttsydinbIXPuQ== X-Received: by 2002:a17:90b:39c5:b0:311:b5ac:6f5d with SMTP id 98e67ed59e1d1-31241e975d6mr4419282a91.29.1748597398101; Fri, 30 May 2025 02:29:58 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.29.43 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:29:57 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 06/35] RPAL: add user interface Date: Fri, 30 May 2025 17:27:34 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the userspace interface of RPAL. The interface makes use of /proc files. Compared with adding syscalls, /proc files provide more interfaces, such as mmap, poll, etc. These interfaces can facilitate RPAL to implement more complex kernel-space/user-space interaction functions in the future. This patch implements the ioctl interface, The interfaces initially implemented include obtaining the RPAL version, and retrieving the key and ID of the RPAL service. Signed-off-by: Bo Li --- arch/x86/rpal/Makefile | 2 +- arch/x86/rpal/core.c | 3 ++ arch/x86/rpal/internal.h | 3 ++ arch/x86/rpal/proc.c | 71 ++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 34 +++++++++++++++++++ 5 files changed, 112 insertions(+), 1 deletion(-) create mode 100644 arch/x86/rpal/proc.c diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile index 2c858a8d7b9e..a5926fc19334 100644 --- a/arch/x86/rpal/Makefile +++ b/arch/x86/rpal/Makefile @@ -2,4 +2,4 @@ =20 obj-$(CONFIG_RPAL) +=3D rpal.o =20 -rpal-y :=3D service.o core.o mm.o +rpal-y :=3D service.o core.o mm.o proc.o diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 495dbc1b1536..61f5d40b0157 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -13,11 +13,14 @@ int __init rpal_init(void); =20 bool rpal_inited; +unsigned long rpal_cap; =20 int __init rpal_init(void) { int ret =3D 0; =20 + rpal_cap =3D 0; + ret =3D rpal_service_init(); if (ret) goto fail; diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index e44e6fc79677..c102a4c50515 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -6,6 +6,9 @@ * Author: Jiadong Sun */ =20 +#define RPAL_COMPAT_VERSION 1 +#define RPAL_API_VERSION 1 + extern bool rpal_inited; =20 /* service.c */ diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c new file mode 100644 index 000000000000..1ced30e25c15 --- /dev/null +++ b/arch/x86/rpal/proc.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include +#include + +#include "internal.h" + +static int rpal_open(struct inode *inode, + struct file *file) +{ + return 0; +} + +static int rpal_get_api_version_and_cap(void __user *p) +{ + struct rpal_version_info rvi; + int ret; + + rvi.compat_version =3D RPAL_COMPAT_VERSION; + rvi.api_version =3D RPAL_API_VERSION; + rvi.cap =3D rpal_cap; + + ret =3D copy_to_user(p, &rvi, sizeof(rvi)); + if (ret) + return -EFAULT; + + return 0; +} + +static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long = arg) +{ + struct rpal_service *cur =3D rpal_current_service(); + int ret =3D 0; + + if (!cur) + return -EINVAL; + + switch (cmd) { + case RPAL_IOCTL_GET_API_VERSION_AND_CAP: + ret =3D rpal_get_api_version_and_cap((void __user *)arg); + break; + case RPAL_IOCTL_GET_SERVICE_KEY: + ret =3D put_user(cur->key, (u64 __user *)arg); + break; + case RPAL_IOCTL_GET_SERVICE_ID: + ret =3D put_user(cur->id, (int __user *)arg); + break; + default: + return -EINVAL; + } + + return ret; +} + +const struct proc_ops proc_rpal_operations =3D { + .proc_open =3D rpal_open, + .proc_ioctl =3D rpal_ioctl, +}; + +static int __init proc_rpal_init(void) +{ + proc_create("rpal", 0644, NULL, &proc_rpal_operations); + return 0; +} +fs_initcall(proc_rpal_init); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index f7c0de747f55..3bc2a2a44265 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -77,6 +77,8 @@ #define RPAL_ADDRESS_SPACE_LOW ((0UL) + RPAL_ADDR_SPACE_SIZE) #define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SP= ACE_SIZE) =20 +extern unsigned long rpal_cap; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -118,6 +120,38 @@ struct rpal_service { atomic_t refcnt; }; =20 +/* + * Following structures should have the same memory layout with user. + * It seems nothing being different between kernel and user structure + * padding by different C compilers on x86_64, so we need to do nothing + * special here. + */ +/* Begin */ +struct rpal_version_info { + int compat_version; + int api_version; + unsigned long cap; +}; + +/* End */ + +enum rpal_command_type { + RPAL_CMD_GET_API_VERSION_AND_CAP, + RPAL_CMD_GET_SERVICE_KEY, + RPAL_CMD_GET_SERVICE_ID, + RPAL_NR_CMD, +}; + +/* RPAL ioctl macro */ +#define RPAL_IOCTL_MAGIC 0x33 +#define RPAL_IOCTL_GET_API_VERSION_AND_CAP \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_API_VERSION_AND_CAP, \ + struct rpal_version_info *) +#define RPAL_IOCTL_GET_SERVICE_KEY \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, u64 *) +#define RPAL_IOCTL_GET_SERVICE_ID \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *) + /** * @brief get new reference to a rpal service, a corresponding * rpal_put_service() should be called later by the caller. --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6ABB9224AFE for ; Fri, 30 May 2025 09:30:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597416; cv=none; b=HuyhCgYmYBqTIw8MfeuLJJTYLJiYgxomcal5lQckgywWHjeKMUGrPvUd6iWJ5vfQki217kZ5r8RNIu/IQjgXVg7FycBjJ0O3vWhP+mxb6CwnnaGVkdmj+5O9SwI/0GSgW0pkPXr0jIcDvIEnI0dhCA/4JbABoewovSLR1RIYlDs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597416; c=relaxed/simple; bh=422OX9nkZoEDUi615uXpiGE4CKGrzo9OxUzw3vP+lYs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Ltk6sz9rq45cTXaWbk8xoB0+gc/+J0pdc2533eDIe9RbTqEnPqU4onp87RlB+mzP7TMD0WpQYh2v29bqprCCU1xYTLNnLNkw1tsZJol+A+TZgaJMHySxekalLcMqY/chnLb7XOEP+VOVK2yk2N0Tvk+nWxRDdQKiuOaONsfdfiI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=YmPlmkBx; arc=none smtp.client-ip=209.85.210.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="YmPlmkBx" Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-742c3d06de3so2055688b3a.0 for ; Fri, 30 May 2025 02:30:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597413; x=1749202213; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=flixYT7qZa2BQQysi97LfHe75SLaTmZqxePIcANrNj8=; b=YmPlmkBx4f/H8kQ0ZSI35VDZkWg/orrROUWKIlf7Aycmim9cqmVwVuIduIQTOwcP4f TCSZbKbCGTlElnr3Q/Dm+6xt9JaMCPiYZA6C21WTQl2Dy2CKCzE/YbSd86UgSmHjeQal 1hmkOFVD0plnR4eqIjjgLoB66bknFnLXnLS6hnVrDb+sJH56gW7c8rBEwT9QPMkbfS2t NPyolzsP/U82hAIXrgEwlYno8BQ4P0VrUpWjMoB9Ea2csF+sI6je3SSkxBtf/5UCTJcB olpZZNTnwtpXc+laNjtJTjf3bh1vLLT8VzBzf6miP71a8RqHxVokecWnL4etjfNR1y0S yb4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597413; x=1749202213; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=flixYT7qZa2BQQysi97LfHe75SLaTmZqxePIcANrNj8=; b=lA0ZOwSX4NPXMC/jhbrvt7RrMBlCsMP0sHh9WFduGcTx8CSLnKZliVl+y+7deh7Lxl a8+1F8AlV4be1fRC7LUzBsHtSaGaXlo4XjmVAr2MADoYsblvoKTOTHI7dWBTG67oHalg NnyxI/19s7dNwWNquFU6IuEDDe3CmNMuhTit68Q/K4Kxm+XtyL9KUWxuIzgUIXoO/8dm Wr3veeUvXhB7yyF8SMO0aebkRhO2ml07JMiwx3e1V8rpFGubwpQ6yU0VaERtIOKwI9o8 9MJXPSVK2+yLzhgdmOivi2uLtuMKWmUpRJ9FUbKk+BTnn/45EcyG1pQC+PjvEanqryJJ tQaA== X-Forwarded-Encrypted: i=1; AJvYcCWzkRFAyZVGpEivzXhoR8eqRS6QMCTIi6tw+jtvyPgJC7p2a/G27b5PF5rTJ4z3gfK9c8yye1rRI1J35MI=@vger.kernel.org X-Gm-Message-State: AOJu0Yz3C3HCbyI4K81zBpqkGJOURWgn9kzZMxsKoUOt6SQg1DBsPOme X2kEacSGp/5C0Zqn0Uz6aAQAmPtOylf0EPz+FieUrEFjIPtQxsoQJL9npy3PmA4jP/w= X-Gm-Gg: ASbGncu58AAPlcRd2u829YMLianqjUzxsAmDrFaO29N1F3ybi5RfsB87IOcwWRFYorE Vq+/9LzzN+BjRndlAKaCX1A+p528Z45pAs+dqj2yI4CemIuPIv5MfYOe27839u6RQSIX2iLK/Gj RoI/kFbIwMQ4EwWWV79/9RgLdi2DS4no8/YR4WQBFkz9lpDYhPCkeuJ9zVW22TQCKOLNiPq8SKi BX4apHQAhjhcvZrrYSRJ0RxiVxLo85Xl7lvO33ontcnmVsbhvIJoNeS6AUnJNOa4Fbamzmoz1Vz Iku06x2Ycn72DQnD8YHPaNQP/MWwSGf8L1V7oqV+Toew8LgPSRwL9NKlSjYM7cetzAXyYY3/GHY Fi+x65FqoVFNFp0ZNHvP0 X-Google-Smtp-Source: AGHT+IHA2yGD8wMUj4lSxB67R095/h8/6O6yJ7zAmlSwm5/Pic/viWYeW3kwOjE2By4p9s+RPQhz5w== X-Received: by 2002:a17:90b:1d50:b0:311:fde5:c4b6 with SMTP id 98e67ed59e1d1-31250344995mr2204788a91.6.1748597413513; Fri, 30 May 2025 02:30:13 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.29.58 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:30:13 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 07/35] RPAL: enable shared page mmap Date: Fri, 30 May 2025 17:27:35 +0800 Message-Id: <11d4a94318efc8af41f77235f5117aabb8795afe.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RPAL needs to create shared memory between the kernel and user space for the transfer of states and data. This patch implements the rpal_mmap() interface. User processes can create shared memory by calling mmap() on /proc/rpal. To prevent users from creating excessive memory, rpal_mmap() limits the total size of the shared memory that can be created. The shared memory is maintained through reference counting, and rpal_munmap() is implemented for the release of the shared memory. Signed-off-by: Bo Li --- arch/x86/rpal/internal.h | 20 ++++++ arch/x86/rpal/mm.c | 147 +++++++++++++++++++++++++++++++++++++++ arch/x86/rpal/proc.c | 1 + arch/x86/rpal/service.c | 4 ++ include/linux/rpal.h | 15 ++++ mm/mmap.c | 4 ++ 6 files changed, 191 insertions(+) diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index c102a4c50515..65fd14a26f0e 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -9,8 +9,28 @@ #define RPAL_COMPAT_VERSION 1 #define RPAL_API_VERSION 1 =20 +#include +#include + extern bool rpal_inited; =20 /* service.c */ int __init rpal_service_init(void); void __init rpal_service_exit(void); + +/* mm.c */ +static inline struct rpal_shared_page * +rpal_get_shared_page(struct rpal_shared_page *rsp) +{ + atomic_inc(&rsp->refcnt); + return rsp; +} + +static inline void rpal_put_shared_page(struct rpal_shared_page *rsp) +{ + atomic_dec(&rsp->refcnt); +} + +int rpal_mmap(struct file *filp, struct vm_area_struct *vma); +struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs, + unsigned long addr); diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c index f469bcf57b66..8a738c502d1d 100644 --- a/arch/x86/rpal/mm.c +++ b/arch/x86/rpal/mm.c @@ -11,6 +11,8 @@ #include #include =20 +#include "internal.h" + static inline int rpal_balloon_mapping(unsigned long base, unsigned long s= ize) { struct vm_area_struct *vma; @@ -68,3 +70,148 @@ int rpal_balloon_init(unsigned long base) =20 return ret; } + +static void rpal_munmap(struct vm_area_struct *area) +{ + struct mm_struct *mm =3D area->vm_mm; + struct rpal_service *rs =3D mm->rpal_rs; + struct rpal_shared_page *rsp =3D area->vm_private_data; + + if (!rs) { + rpal_err( + "free shared page after exit_mmap or fork a child process\n"); + return; + } + + mutex_lock(&rs->mutex); + if (unlikely(!atomic_dec_and_test(&rsp->refcnt))) { + rpal_err("refcnt(%d) of shared page is not 0\n", atomic_read(&rsp->refcn= t)); + send_sig_info(SIGKILL, SEND_SIG_PRIV, rs->group_leader); + } + + list_del(&rsp->list); + rs->nr_shared_pages -=3D rsp->npage; + __free_pages(virt_to_page(rsp->kernel_start), get_order(rsp->npage)); + kfree(rsp); + mutex_unlock(&rs->mutex); +} + +const struct vm_operations_struct rpal_vm_ops =3D { .close =3D rpal_munmap= }; + +#define RPAL_MAX_SHARED_PAGES 8192 + +int rpal_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_shared_page *rsp; + struct page *page =3D NULL; + unsigned long size =3D (unsigned long)(vma->vm_end - vma->vm_start); + int npage; + int order =3D -1; + int ret =3D 0; + + if (!cur) { + ret =3D -EINVAL; + goto out; + } + + /* + * Check whether the vma is aligned and whether the page number + * is power of 2. This makes shared pages easy to manage. + */ + if (!IS_ALIGNED(size, PAGE_SIZE) || + !IS_ALIGNED(vma->vm_start, PAGE_SIZE)) { + ret =3D -EINVAL; + goto out; + } + + npage =3D size >> PAGE_SHIFT; + if (!is_power_of_2(npage)) { + ret =3D -EINVAL; + goto out; + } + + order =3D get_order(size); + + mutex_lock(&cur->mutex); + + /* make sure user does not alloc too much pages */ + if (cur->nr_shared_pages + npage > RPAL_MAX_SHARED_PAGES) { + ret =3D -ENOMEM; + goto unlock; + } + + rsp =3D kmalloc(sizeof(*rsp), GFP_KERNEL); + if (!rsp) { + ret =3D -EAGAIN; + goto unlock; + } + + page =3D alloc_pages(GFP_KERNEL | __GFP_ZERO, order); + if (!page) { + ret =3D -ENOMEM; + goto free_rsp; + } + + rsp->user_start =3D vma->vm_start; + rsp->kernel_start =3D (unsigned long)page_address(page); + rsp->npage =3D npage; + atomic_set(&rsp->refcnt, 1); + INIT_LIST_HEAD(&rsp->list); + list_add(&rsp->list, &cur->shared_pages); + + vma->vm_ops =3D &rpal_vm_ops; + vma->vm_private_data =3D rsp; + + /* map to shared pages userspace */ + ret =3D remap_pfn_range(vma, vma->vm_start, page_to_pfn(page), size, + vma->vm_page_prot); + if (ret) + goto free_page; + + cur->nr_shared_pages +=3D npage; + mutex_unlock(&cur->mutex); + + return 0; + +free_page: + __free_pages(page, order); + list_del(&rsp->list); +free_rsp: + kfree(rsp); +unlock: + mutex_unlock(&cur->mutex); +out: + return ret; +} + +struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs, + unsigned long addr) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_shared_page *rsp, *ret =3D NULL; + + mutex_lock(&cur->mutex); + list_for_each_entry(rsp, &rs->shared_pages, list) { + if (rsp->user_start <=3D addr && + addr < rsp->user_start + rsp->npage * PAGE_SIZE) { + ret =3D rpal_get_shared_page(rsp); + break; + } + } + mutex_unlock(&cur->mutex); + + return ret; +} + +void rpal_exit_mmap(struct mm_struct *mm) +{ + struct rpal_service *rs =3D mm->rpal_rs; + + if (rs) { + mm->rpal_rs =3D NULL; + /* all shared pages should be freed at this time */ + WARN_ON_ONCE(rs->nr_shared_pages !=3D 0); + rpal_put_service(rs); + } +} diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index 1ced30e25c15..86947dc233d0 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -61,6 +61,7 @@ static long rpal_ioctl(struct file *file, unsigned int cm= d, unsigned long arg) const struct proc_ops proc_rpal_operations =3D { .proc_open =3D rpal_open, .proc_ioctl =3D rpal_ioctl, + .proc_mmap =3D rpal_mmap, }; =20 static int __init proc_rpal_init(void) diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index caa4afa5a2c6..f29a046fc22f 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -173,6 +173,10 @@ struct rpal_service *rpal_register_service(void) if (unlikely(rs->key =3D=3D RPAL_INVALID_KEY)) goto key_fail; =20 + mutex_init(&rs->mutex); + rs->nr_shared_pages =3D 0; + INIT_LIST_HEAD(&rs->shared_pages); + rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); =20 diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 3bc2a2a44265..986dfbd16fc9 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -110,6 +110,12 @@ struct rpal_service { * Fields above should never change after initialization. * Fields below may change after initialization. */ + /* Mutex for time consuming operations */ + struct mutex mutex; + + /* pinned pages */ + int nr_shared_pages; + struct list_head shared_pages; =20 /* delayed service put work */ struct delayed_work delayed_put_work; @@ -135,6 +141,14 @@ struct rpal_version_info { =20 /* End */ =20 +struct rpal_shared_page { + unsigned long user_start; + unsigned long kernel_start; + int npage; + atomic_t refcnt; + struct list_head list; +}; + enum rpal_command_type { RPAL_CMD_GET_API_VERSION_AND_CAP, RPAL_CMD_GET_SERVICE_KEY, @@ -196,6 +210,7 @@ struct rpal_service *rpal_get_service_by_key(u64 key); void copy_rpal(struct task_struct *p); void exit_rpal(bool group_dead); int rpal_balloon_init(unsigned long base); +void rpal_exit_mmap(struct mm_struct *mm); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); diff --git a/mm/mmap.c b/mm/mmap.c index bd210aaf7ebd..98bb33d2091e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -48,6 +48,7 @@ #include #include #include +#include =20 #include #include @@ -1319,6 +1320,9 @@ void exit_mmap(struct mm_struct *mm) __mt_destroy(&mm->mm_mt); mmap_write_unlock(mm); vm_unacct_memory(nr_accounted); +#if IS_ENABLED(CONFIG_RPAL) + rpal_exit_mmap(mm); +#endif } =20 /* Insert vm structure into process list sorted by address --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 964921E8323 for ; Fri, 30 May 2025 09:30:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597432; cv=none; b=Rpm4LmdSmwg8YfYcFS4InE2mr3pCMatHgwX+HY1cqgLs4yLcglmhRqRMx2nJlg06F69UpZXG/nikvBVsA3oHbXEiFwhb+arcxThyLLTEuWiXrulivIa+tGD5kb2uoUGr8nq6XcR/uDDBJeCG9gECgIQpj3sDbxZaj5Vl8VM0+/c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597432; c=relaxed/simple; bh=DFUGROHVCTniD49Ozv49pSZ2hICpYueXEA09N4l6CVc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Bv94HmpRQ4PW4Zacn5VNzLXGX7tz56Ek0CqKjvX4BOaxlFoNadD+gbWofi56iExrMoqEWp3xJiL+sStjrS6dGNJG+7JRMRmvO82jJn2vZNdpLsJT4SgHGKLtnCa2ZR9Z19YesaCa29vHeuCjH4oKn4Ep8Dw1T031XW1z6HrnCzs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=ihzlDemt; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="ihzlDemt" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-30e5430ed0bso1620769a91.3 for ; Fri, 30 May 2025 02:30:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597429; x=1749202229; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=It0nagCB2dMHz+A5VnrcLdhvjSzG9DfFTU//7hpwVEI=; b=ihzlDemthbTov5PXhPgH3PlMpPQsHOe4zKAOJCRsWUntI5rjURuIxsST5IrMwsmZXr cMDEeCqiJd11fkseRgWWyviZqSJkK90CiEOQCi+63DZPZGAwHHbVuNqC/asOl1MrlB+U B9XEHrLdr001jKPVuSAEHq2+C6XfUM/p8jWCMM4aE2JCGwafNiP5n7tD2YKVlCgT72A2 +YQ9G7z8HD8tX/VLki2z6cgf2nB9c3s2CKVZG0bq3TQqvrfq6HxAW6Vi6S1ieVoG/m0h jJyF82BmP7J2fnlAb4gZu5bAYyrwIZ9M2HKGl+A/8fnY7L3tu3ITXC0hQjFrT7os80aw jvlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597429; x=1749202229; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=It0nagCB2dMHz+A5VnrcLdhvjSzG9DfFTU//7hpwVEI=; b=MkweBph4W5yhelBknbj1Bs6S/htHdx8UvwCYPVz0u4X5k0OR6fG4P0EJmVsrIvnXFX +FDQD8LOvm4BcGD6G57XKxMAKvXUn7XOxoC/nb7wOf6hu0V7RFo7eUEncPv+ihHMBIUR FTkXmR5URGVAMnFI7Mi5Acp6ZPcSW2u9SYn6r6yec5KKFqDc6XsYTP8NklH3pNJiMSir 8kVRenPmjPB1vMvFl4rZheAe8kwPCKzxQcIWb/vUe25H6lkUIECZTkiwO2cUVQJgBi7Q rSxG8X9TB4Q2ApmwJcJdBwtNEsHQ0v/cvI0dJi1eBW4IIZuAJyFEZlY4Gbxj6XePrFzM 7reQ== X-Forwarded-Encrypted: i=1; AJvYcCWllE6eT2TZJ/OUoFSHDV6DcgBKK7PV4msB7UTMUSV9zxlMckSDNpuOFof4KW4o0oQb9CQzUO68pKmkPqY=@vger.kernel.org X-Gm-Message-State: AOJu0YyQMEW3RBe5J/Zc/zJRKf431KZX+wkjLzaqx0kRQlL+pj2TrQa5 K+7ccDK75LoOXaBDBzAGP6kaZIAELIOMnpfXcGA58XLG+UZZ5JygX73tPvTZ3RIRxdM= X-Gm-Gg: ASbGnct2sLmLeto6yy1A9xyGr+Icv51i3GSfFbYCAcPpEXSiE52eSs7p/qxgurULjdA bPz82f7c31excF3+3XCTMBV3WpxQMnOYNxeY/shViRxbHtSajaZBnmIWMXo6lrRQ2sD0XtM6am4 sen8Lz1GkW6WHSwpPNULTibv0ogYV2LxKH/1rgqoEmoJOATtsebcKjK3GA1J5XyKqRVJuoJbAWP 8tTL0kXjRy1ADnnjO8iwjsctHzYFGpdEqDtNfGzmLsbFXXdhpYIvO3mQjc8v3tsTA6z4sIzOa4o oHT54uY6SQx2mRiu6raQaLtB2MTtWJPB2voFYY6hPxaAhhi1mAU+TdrWKbem6Xb/XEjkvTm4jWk 0VioC/6hawOeVZUH7f4rG X-Google-Smtp-Source: AGHT+IHhFP+YSRyY0vGcXgga6NhFTtrcJFFIO4OfIe90gNYiOLxgktzPA73bwtmVpC1QBkXNF+uurw== X-Received: by 2002:a17:90b:4fd2:b0:311:f99e:7f4a with SMTP id 98e67ed59e1d1-31250427d15mr1963416a91.26.1748597428884; Fri, 30 May 2025 02:30:28 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.30.13 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:30:28 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 08/35] RPAL: enable sender/receiver registration Date: Fri, 30 May 2025 17:27:36 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In RPAL, there are two roles: the sender (caller) and the receiver ( callee). This patch provides an interface for threads to register as a sender or a receiver with the kernel. Each sender and receiver has its own data structure, along with a block of memory shared between the user space and the kernel space, which is allocated through rpal_mmap(). Signed-off-by: Bo Li --- arch/x86/rpal/Makefile | 2 +- arch/x86/rpal/internal.h | 7 ++ arch/x86/rpal/proc.c | 12 +++ arch/x86/rpal/service.c | 6 ++ arch/x86/rpal/thread.c | 165 +++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 79 +++++++++++++++++++ include/linux/sched.h | 15 ++++ init/init_task.c | 2 + kernel/fork.c | 2 + 9 files changed, 289 insertions(+), 1 deletion(-) create mode 100644 arch/x86/rpal/thread.c diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile index a5926fc19334..89f745382c51 100644 --- a/arch/x86/rpal/Makefile +++ b/arch/x86/rpal/Makefile @@ -2,4 +2,4 @@ =20 obj-$(CONFIG_RPAL) +=3D rpal.o =20 -rpal-y :=3D service.o core.o mm.o proc.o +rpal-y :=3D service.o core.o mm.o proc.o thread.o diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 65fd14a26f0e..3559c9c6e868 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -34,3 +34,10 @@ static inline void rpal_put_shared_page(struct rpal_shar= ed_page *rsp) int rpal_mmap(struct file *filp, struct vm_area_struct *vma); struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs, unsigned long addr); + +/* thread.c */ +int rpal_register_sender(unsigned long addr); +int rpal_unregister_sender(void); +int rpal_register_receiver(unsigned long addr); +int rpal_unregister_receiver(void); +void exit_rpal_thread(void); diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index 86947dc233d0..8a1e4a8a2271 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -51,6 +51,18 @@ static long rpal_ioctl(struct file *file, unsigned int c= md, unsigned long arg) case RPAL_IOCTL_GET_SERVICE_ID: ret =3D put_user(cur->id, (int __user *)arg); break; + case RPAL_IOCTL_REGISTER_SENDER: + ret =3D rpal_register_sender(arg); + break; + case RPAL_IOCTL_UNREGISTER_SENDER: + ret =3D rpal_unregister_sender(); + break; + case RPAL_IOCTL_REGISTER_RECEIVER: + ret =3D rpal_register_receiver(arg); + break; + case RPAL_IOCTL_UNREGISTER_RECEIVER: + ret =3D rpal_unregister_receiver(); + break; default: return -EINVAL; } diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index f29a046fc22f..42fb719dbb2a 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -176,6 +176,7 @@ struct rpal_service *rpal_register_service(void) mutex_init(&rs->mutex); rs->nr_shared_pages =3D 0; INIT_LIST_HEAD(&rs->shared_pages); + atomic_set(&rs->thread_cnt, 0); =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -216,6 +217,9 @@ void rpal_unregister_service(struct rpal_service *rs) if (!rs) return; =20 + while (atomic_read(&rs->thread_cnt) !=3D 0) + schedule(); + delete_service(rs); =20 pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id, @@ -238,6 +242,8 @@ void exit_rpal(bool group_dead) if (!rs) return; =20 + exit_rpal_thread(); + current->rpal_rs =3D NULL; rpal_put_service(rs); =20 diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c new file mode 100644 index 000000000000..7550ad94b63f --- /dev/null +++ b/arch/x86/rpal/thread.c @@ -0,0 +1,165 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include + +#include "internal.h" + +static void rpal_common_data_init(struct rpal_common_data *rcd) +{ + rcd->bp_task =3D current; + rcd->service_id =3D rpal_current_service()->id; +} + +int rpal_register_sender(unsigned long addr) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_shared_page *rsp; + struct rpal_sender_data *rsd; + long ret =3D 0; + + if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) { + ret =3D -EINVAL; + goto out; + } + + rsp =3D rpal_find_shared_page(cur, addr); + if (!rsp) { + ret =3D -EINVAL; + goto out; + } + + if (addr + sizeof(struct rpal_sender_call_context) > + rsp->user_start + rsp->npage * PAGE_SIZE) { + ret =3D -EINVAL; + goto put_shared_page; + } + + rsd =3D kzalloc(sizeof(*rsd), GFP_KERNEL); + if (rsd =3D=3D NULL) { + ret =3D -ENOMEM; + goto put_shared_page; + } + + rpal_common_data_init(&rsd->rcd); + rsd->rsp =3D rsp; + rsd->scc =3D (struct rpal_sender_call_context *)(addr - rsp->user_start + + rsp->kernel_start); + + current->rpal_sd =3D rsd; + rpal_set_current_thread_flag(RPAL_SENDER_BIT); + + atomic_inc(&cur->thread_cnt); + + return 0; + +put_shared_page: + rpal_put_shared_page(rsp); +out: + return ret; +} + +int rpal_unregister_sender(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_sender_data *rsd =3D current->rpal_sd; + long ret =3D 0; + + if (!rpal_test_current_thread_flag(RPAL_SENDER_BIT)) { + ret =3D -EINVAL; + goto out; + } + + rpal_put_shared_page(rsd->rsp); + rpal_clear_current_thread_flag(RPAL_SENDER_BIT); + kfree(rsd); + + atomic_dec(&cur->thread_cnt); + +out: + return ret; +} + +int rpal_register_receiver(unsigned long addr) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_receiver_data *rrd; + struct rpal_shared_page *rsp; + long ret =3D 0; + + if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { + ret =3D -EINVAL; + goto out; + } + + rsp =3D rpal_find_shared_page(cur, addr); + if (!rsp) { + ret =3D -EINVAL; + goto out; + } + + if (addr + sizeof(struct rpal_receiver_call_context) > + rsp->user_start + rsp->npage * PAGE_SIZE) { + ret =3D -EINVAL; + goto put_shared_page; + } + + rrd =3D kzalloc(sizeof(*rrd), GFP_KERNEL); + if (rrd =3D=3D NULL) { + ret =3D -ENOMEM; + goto put_shared_page; + } + + rpal_common_data_init(&rrd->rcd); + rrd->rsp =3D rsp; + rrd->rcc =3D + (struct rpal_receiver_call_context *)(addr - rsp->user_start + + rsp->kernel_start); + + current->rpal_rd =3D rrd; + rpal_set_current_thread_flag(RPAL_RECEIVER_BIT); + + atomic_inc(&cur->thread_cnt); + + return 0; + +put_shared_page: + rpal_put_shared_page(rsp); +out: + return ret; +} + +int rpal_unregister_receiver(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_receiver_data *rrd =3D current->rpal_rd; + long ret =3D 0; + + if (!rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { + ret =3D -EINVAL; + goto out; + } + + rpal_put_shared_page(rrd->rsp); + rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT); + kfree(rrd); + + atomic_dec(&cur->thread_cnt); + +out: + return ret; +} + +void exit_rpal_thread(void) +{ + if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) + rpal_unregister_sender(); + + if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) + rpal_unregister_receiver(); +} diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 986dfbd16fc9..c33425e896af 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -79,6 +79,11 @@ =20 extern unsigned long rpal_cap; =20 +enum rpal_task_flag_bits { + RPAL_SENDER_BIT, + RPAL_RECEIVER_BIT, +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -117,6 +122,9 @@ struct rpal_service { int nr_shared_pages; struct list_head shared_pages; =20 + /* sender/receiver thread count */ + atomic_t thread_cnt; + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -149,10 +157,55 @@ struct rpal_shared_page { struct list_head list; }; =20 +struct rpal_common_data { + /* back pointer to task_struct */ + struct task_struct *bp_task; + /* service id of rpal_service */ + int service_id; +}; + +/* User registers state */ +struct rpal_task_context { + u64 r15; + u64 r14; + u64 r13; + u64 r12; + u64 rbx; + u64 rbp; + u64 rip; + u64 rsp; +}; + +struct rpal_receiver_call_context { + struct rpal_task_context rtc; + int receiver_id; +}; + +struct rpal_receiver_data { + struct rpal_common_data rcd; + struct rpal_shared_page *rsp; + struct rpal_receiver_call_context *rcc; +}; + +struct rpal_sender_call_context { + struct rpal_task_context rtc; + int sender_id; +}; + +struct rpal_sender_data { + struct rpal_common_data rcd; + struct rpal_shared_page *rsp; + struct rpal_sender_call_context *scc; +}; + enum rpal_command_type { RPAL_CMD_GET_API_VERSION_AND_CAP, RPAL_CMD_GET_SERVICE_KEY, RPAL_CMD_GET_SERVICE_ID, + RPAL_CMD_REGISTER_SENDER, + RPAL_CMD_UNREGISTER_SENDER, + RPAL_CMD_REGISTER_RECEIVER, + RPAL_CMD_UNREGISTER_RECEIVER, RPAL_NR_CMD, }; =20 @@ -165,6 +218,14 @@ enum rpal_command_type { _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, u64 *) #define RPAL_IOCTL_GET_SERVICE_ID \ _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *) +#define RPAL_IOCTL_REGISTER_SENDER \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_SENDER, unsigned long) +#define RPAL_IOCTL_UNREGISTER_SENDER \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_SENDER) +#define RPAL_IOCTL_REGISTER_RECEIVER \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long) +#define RPAL_IOCTL_UNREGISTER_RECEIVER \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER) =20 /** * @brief get new reference to a rpal service, a corresponding @@ -200,8 +261,26 @@ static inline struct rpal_service *rpal_current_servic= e(void) { return current->rpal_rs; } + +static inline void rpal_set_current_thread_flag(unsigned long bit) +{ + set_bit(bit, ¤t->rpal_flag); +} + +static inline void rpal_clear_current_thread_flag(unsigned long bit) +{ + clear_bit(bit, ¤t->rpal_flag); +} + +static inline bool rpal_test_current_thread_flag(unsigned long bit) +{ + return test_bit(bit, ¤t->rpal_flag); +} #else static inline struct rpal_service *rpal_current_service(void) { return NUL= L; } +static inline void rpal_set_current_thread_flag(unsigned long bit) { } +static inline void rpal_clear_current_thread_flag(unsigned long bit) { } +static inline bool rpal_test_current_thread_flag(unsigned long bit) { retu= rn false; } #endif =20 void rpal_unregister_service(struct rpal_service *rs); diff --git a/include/linux/sched.h b/include/linux/sched.h index ad35b197543c..5f25cc09fb71 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -72,6 +72,9 @@ struct rcu_node; struct reclaim_state; struct robust_list_head; struct root_domain; +struct rpal_common_data; +struct rpal_receiver_data; +struct rpal_sender_data; struct rpal_service; struct rq; struct sched_attr; @@ -1648,6 +1651,18 @@ struct task_struct { =20 #ifdef CONFIG_RPAL struct rpal_service *rpal_rs; + unsigned long rpal_flag; + /* + * The first member of both rpal_sd and rpal_rd has a type + * of struct rpal_common_data. So if we do not care whether + * it is a struct rpal_sender_data or a struct rpal_receiver_data, + * use rpal_cd instead of rpal_sd or rpal_rd. + */ + union { + struct rpal_common_data *rpal_cd; + struct rpal_sender_data *rpal_sd; + struct rpal_receiver_data *rpal_rd; + }; #endif =20 /* CPU-specific state of this task: */ diff --git a/init/init_task.c b/init/init_task.c index 0c5b1927da41..2eb08b96e66b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -222,6 +222,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { #endif #ifdef CONFIG_RPAL .rpal_rs =3D NULL, + .rpal_flag =3D 0, + .rpal_cd =3D NULL, #endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index 1d1c8484a8f2..01cd48eadf68 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1220,6 +1220,8 @@ static struct task_struct *dup_task_struct(struct tas= k_struct *orig, int node) =20 #ifdef CONFIG_RPAL tsk->rpal_rs =3D NULL; + tsk->rpal_flag =3D 0; + tsk->rpal_cd =3D NULL; #endif return tsk; =20 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F2A11F4E3B for ; Fri, 30 May 2025 09:30:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597447; cv=none; b=LCkgVKcbbLhx71/bWEWngmCBFkPxnT9eo5VMZfORRoJOnjeuJH/1Zmq/gk5DlaeSJ8EUeqzH2N1n3l2xI/ND9qTupjQxmhf80Yt3YwFC2msYCD13XbiPd1guDqA5gKAIS+rlkyOxa/itrqQe3mU43+elGiNfYC+zKpgJu7zs46c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597447; c=relaxed/simple; bh=cZEQ0MqKbOvC/as7YQeo0L+z+hUgvHRxc+aL7T32ukk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JkdSYPOnpb7myy5fd1uhk10dVQ0FKg7AfGtDm20sDN5+kgmOVKSA6m/UyoUMuO5JyS9T5ROKRlt7LVDUL5QZ62uRtdFwVrwDw3Ts+N4v7yx4sL5AZ2GNN9YqmKphlRwnKMwzAZgNT1wQh62V8d9o+lq/C0VuatpuMNWGvY2GcZs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=ZfFuStJw; arc=none smtp.client-ip=209.85.215.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="ZfFuStJw" Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-b26f7d2c1f1so1765290a12.0 for ; Fri, 30 May 2025 02:30:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597444; x=1749202244; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=MCkN3xSvk7RNOqtTsnso8pu1jgTl1XSXrm3OEL9CcBU=; b=ZfFuStJwx/BN5fNeH96ss/AySEuyezo8B0bfDe8tMKyDwgL098ciD8jt8L7GdT+AlT ghBvvwMCIn8m3u2zk+iDtRMlyZ4g0Zcp9gInoNmbl6OGZ4vXHi4HMNUYrwiqlHbvhgXR rOFyRJhGmYpfIytDCZKohBYk3yZnd3yBQ9/4MUiB6ISpvVQSxazUuD+bvWfR3ni8HGYJ 2WTZMlh4DZ8KVZ1OBH3QcQq7I7T58qKyfJaDApCuMtPfvBwndFZarGroYdhHLjBTioAi ksl/sMIeWv6qiH/hLRFjub5+vtaw/gEWfNwOEBIDS+PBybSHccw4P2ANkboLgheKQw8R nL+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597444; x=1749202244; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MCkN3xSvk7RNOqtTsnso8pu1jgTl1XSXrm3OEL9CcBU=; b=EAdTqpeYN/r8Z9Pk/s7dkUBZrGEdO8f29QTVPgXIHwGWR1T5cyzkrt2OBrR0LpCSf+ KcbjzoMCJk0C+rdp8ksBCFM646HW33IgRAggnzPSIIAU3fhpayX1sMMCnH3pfbUd8IGW YkrKeZVOwcFbM9xKLnlE51mZDRvDr+jRr96+UFPVyOB3sFoaiuwogu1VWW66a4SVcBzX 1GGLRgVo1OotwZ6jYcWwcsTaHPIhdMNGPP5ql8F3IZPnyEcISkeltzOL8WVSgBKx+Lf5 bY1VUF56MaGmPgDq4fKqYOYG67ve+3FXT0E7cG/j1awepIu0bVxHoqDpdpb1PaI5FSbO bH7w== X-Forwarded-Encrypted: i=1; AJvYcCUHoxRPDRtc/JvvYVlzu5oT2udsNPmAlbEK9BG4KAd8huexcrKmgBQFYP4XjPweKPub7ZF9b8Mla3QOo4w=@vger.kernel.org X-Gm-Message-State: AOJu0YxQbSHv61Ya/A/TYU4adPN+HEjd1k8h9Wp7CLStu3xRPbXfZ9B6 1RyDq2vyM1a6l52M3yhSd2qG/TsrOIEfQst0SHqmr09kCKPbEX21Kg4wS1yn9Q9KVLw= X-Gm-Gg: ASbGnctPN0XtOS48t8yKBJmPUw4rgpKxStmsS4a70uC0GBt+8FU8L+yWhhe8jRkvs31 RyLFVYnovSTzWHLW1vkPwI08+Ifo+GK20MENTg+h/gsniDpEafBQBQWi/Uy6J9QNeXBAOLWi02a GDQFmF0D5O2Lt8j3LGXq9CkyogE9uW6eO5BtoU8PBXUXtqYkmnQLJzAjdBZtvwFqog4S3XvqhXu hzkR/3DmESVSO1ZIxbNVf0kBAC65fr/1InLvqQtPWocN2cOL9XPb+X76ixJDdfHFGIzhgiPDvib C/pBdK7FtiPKI8sIXwmCraBePR9e/ngX4kWCMvTCT4o3Rf8i3WRQS4zZMyx691gFWtXd7ffbwAy 0H4LhQ3NerQ== X-Google-Smtp-Source: AGHT+IGxnH4FN8v0lfXoJ+KLdg1DAlTGeu78x7GO8KMsAYvLDaxWhgMloZ/K96UmZpYHSFOIcwKPxQ== X-Received: by 2002:a17:90b:2248:b0:311:ff18:b84b with SMTP id 98e67ed59e1d1-31250427cdamr2068655a91.25.1748597444187; Fri, 30 May 2025 02:30:44 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.30.29 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:30:43 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 09/35] RPAL: enable address space sharing Date: Fri, 30 May 2025 17:27:37 +0800 Message-Id: <2b5378f3686fd2831468e65c49609fbb19072b43.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RPAL's memory sharing is implemented by copying p4d entries, which requires implementing corresponding interfaces. Meanwhile, copying p4d entries can cause the process's page table to contain p4d entries that do not belong to it, and RPAL needs to resolve compatibility issues with other subsystems caused by this. This patch implements the rpal_map_service() interface to complete the mutual copying of p4d entries between two RPAL services. For the copied p4d entries, RPAL adds a _PAGE_RPAL_IGN flag to them. This flag makes p4d_none() return true and p4d_present() return false, ensuring that these p4d entries are invisible to other kernel subsystems. The protection of p4d entries is guaranteed by the memory balloon, which ensures that the address space corresponding to the p4d entries is not used by the current service. Signed-off-by: Bo Li --- arch/x86/include/asm/pgtable.h | 25 ++++ arch/x86/include/asm/pgtable_types.h | 11 ++ arch/x86/rpal/internal.h | 2 + arch/x86/rpal/mm.c | 175 +++++++++++++++++++++++++++ 4 files changed, 213 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5ddba366d3b4..54351bfe4e47 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1137,12 +1137,37 @@ static inline int pud_bad(pud_t pud) #if CONFIG_PGTABLE_LEVELS > 3 static inline int p4d_none(p4d_t p4d) { +#if IS_ENABLED(CONFIG_RPAL) + p4dval_t p4dv =3D native_p4d_val(p4d); + + /* + * Since RPAL copy p4d entry to share address space, + * it is important that other process will not manipulate + * this copied p4d. Thus, make p4d_none() always return + * 0 to bypass kernel page table logic on copied p4d. + */ + return (p4dv & _PAGE_RPAL_IGN) || + ((p4dv & ~(_PAGE_KNL_ERRATUM_MASK)) =3D=3D 0); +#else return (native_p4d_val(p4d) & ~(_PAGE_KNL_ERRATUM_MASK)) =3D=3D 0; +#endif } =20 static inline int p4d_present(p4d_t p4d) { +#if IS_ENABLED(CONFIG_RPAL) + p4dval_t p4df =3D p4d_flags(p4d); + + /* + * Since RPAL copy p4d entry to share address space, + * it is important that other process will not manipulate + * this copied p4d. Thus, make p4d_present() always return + * 0 to bypass kernel page table logic on copied p4d. + */ + return ((p4df & (_PAGE_PRESENT | _PAGE_RPAL_IGN)) =3D=3D _PAGE_PRESENT); +#else return p4d_flags(p4d) & _PAGE_PRESENT; +#endif } =20 static inline pud_t *p4d_pgtable(p4d_t p4d) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pg= table_types.h index b74ec5c3643b..781b0f5bc359 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -35,6 +35,13 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_KERNEL_4K _PAGE_BIT_SOFTW3 /* page must not be converted= to large */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 +/* + * _PAGE_BIT_SOFTW1 is used by _PAGE_BIT_SPECIAL. + * but we are not conflicted with _PAGE_BIT_SPECIAL + * as we use it only on p4d/pud level and _PAGE_BIT_SPECIAL + * is only used on pte level. + */ +#define _PAGE_BIT_RPAL_IGN _PAGE_BIT_SOFTW1 =20 #ifdef CONFIG_X86_64 #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */ @@ -95,6 +102,10 @@ #define _PAGE_SOFT_DIRTY (_AT(pteval_t, 0)) #endif =20 +#if IS_ENABLED(CONFIG_RPAL) +#define _PAGE_RPAL_IGN (_AT(pteval_t, 1) << _PAGE_BIT_RPAL_IGN) +#endif + /* * Tracking soft dirty bit when a page goes to a swap is tricky. * We need a bit which can be stored in pte _and_ not conflict diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 3559c9c6e868..65f2cf4baf8f 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -34,6 +34,8 @@ static inline void rpal_put_shared_page(struct rpal_share= d_page *rsp) int rpal_mmap(struct file *filp, struct vm_area_struct *vma); struct rpal_shared_page *rpal_find_shared_page(struct rpal_service *rs, unsigned long addr); +int rpal_map_service(struct rpal_service *tgt); +void rpal_unmap_service(struct rpal_service *tgt); =20 /* thread.c */ int rpal_register_sender(unsigned long addr); diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c index 8a738c502d1d..f1003baae001 100644 --- a/arch/x86/rpal/mm.c +++ b/arch/x86/rpal/mm.c @@ -215,3 +215,178 @@ void rpal_exit_mmap(struct mm_struct *mm) rpal_put_service(rs); } } + +/* + * Since the user address space size of rpal process is 512G, which + * is the size of one p4d, we assume p4d entry will never change after + * rpal process is created. + */ +static int mm_link_p4d(struct mm_struct *dst_mm, p4d_t src_p4d, + unsigned long addr) +{ + spinlock_t *dst_ptl =3D &dst_mm->page_table_lock; + unsigned long flags; + pgd_t *dst_pgdp; + p4d_t p4d, *dst_p4dp; + p4dval_t p4dv; + int ret =3D 0; + + BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4); + + mmap_write_lock(dst_mm); + spin_lock_irqsave(dst_ptl, flags); + dst_pgdp =3D pgd_offset(dst_mm, addr); + /* + * dst_pgd must exists, otherwise we need to alloc pgd entry. When + * src_p4d is freed, we also need to free the pgd entry. This should + * be supported in the future. + */ + if (unlikely(pgd_none_or_clear_bad(dst_pgdp))) { + rpal_err("cannot find pgd entry for addr 0x%016lx\n", addr); + ret =3D -EINVAL; + goto unlock; + } + + dst_p4dp =3D p4d_offset(dst_pgdp, addr); + if (unlikely(!p4d_none_or_clear_bad(dst_p4dp))) { + rpal_err("p4d is previously mapped\n"); + ret =3D -EINVAL; + goto unlock; + } + + p4dv =3D p4d_val(src_p4d); + + /* + * Since RPAL copy p4d entry to share address space, + * it is important that other process will not manipulate + * this copied p4d. We need mark the copied p4d and make + * p4d_present() and p4d_none() ignore such p4d. + */ + p4dv |=3D _PAGE_RPAL_IGN; + + if (boot_cpu_has(X86_FEATURE_PTI)) + p4d =3D native_make_p4d((~_PAGE_NX) & p4dv); + else + p4d =3D native_make_p4d(p4dv); + + set_p4d(dst_p4dp, p4d); + spin_unlock_irqrestore(dst_ptl, flags); + mmap_write_unlock(dst_mm); + + return 0; +unlock: + spin_unlock_irqrestore(dst_ptl, flags); + mmap_write_unlock(dst_mm); + return ret; +} + +static void mm_unlink_p4d(struct mm_struct *mm, unsigned long addr) +{ + spinlock_t *ptl =3D &mm->page_table_lock; + unsigned long flags; + pgd_t *pgdp; + p4d_t *p4dp; + + mmap_write_lock(mm); + spin_lock_irqsave(ptl, flags); + pgdp =3D pgd_offset(mm, addr); + p4dp =3D p4d_offset(pgdp, addr); + p4d_clear(p4dp); + spin_unlock_irqrestore(ptl, flags); + mmap_write_unlock(mm); + + flush_tlb_mm(mm); +} + +static int get_mm_p4d(struct mm_struct *mm, unsigned long addr, p4d_t *src= p) +{ + spinlock_t *ptl; + unsigned long flags; + pgd_t *pgdp; + p4d_t *p4dp; + int ret =3D 0; + + ptl =3D &mm->page_table_lock; + spin_lock_irqsave(ptl, flags); + pgdp =3D pgd_offset(mm, addr); + if (pgd_none(*pgdp)) { + ret =3D -EINVAL; + goto out; + } + + p4dp =3D p4d_offset(pgdp, addr); + if (p4d_none(*p4dp) || p4d_bad(*p4dp)) { + ret =3D -EINVAL; + goto out; + } + *srcp =3D *p4dp; + +out: + spin_unlock_irqrestore(ptl, flags); + + return ret; +} + +int rpal_map_service(struct rpal_service *tgt) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct mm_struct *cur_mm, *tgt_mm; + unsigned long cur_addr, tgt_addr; + p4d_t cur_p4d, tgt_p4d; + int ret =3D 0; + + cur_mm =3D current->mm; + tgt_mm =3D tgt->mm; + if (!mmget_not_zero(tgt_mm)) { + ret =3D -EINVAL; + goto out; + } + + cur_addr =3D rpal_get_base(cur); + tgt_addr =3D rpal_get_base(tgt); + + ret =3D get_mm_p4d(tgt_mm, tgt_addr, &tgt_p4d); + if (ret) + goto put_tgt; + + ret =3D get_mm_p4d(cur_mm, cur_addr, &cur_p4d); + if (ret) + goto put_tgt; + + ret =3D mm_link_p4d(cur_mm, tgt_p4d, tgt_addr); + if (ret) + goto put_tgt; + + ret =3D mm_link_p4d(tgt_mm, cur_p4d, cur_addr); + if (ret) { + mm_unlink_p4d(cur_mm, tgt_addr); + goto put_tgt; + } + +put_tgt: + mmput(tgt_mm); +out: + return ret; +} + +void rpal_unmap_service(struct rpal_service *tgt) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct mm_struct *cur_mm, *tgt_mm; + unsigned long cur_addr, tgt_addr; + + cur_mm =3D current->mm; + tgt_mm =3D tgt->mm; + + cur_addr =3D rpal_get_base(cur); + tgt_addr =3D rpal_get_base(tgt); + + if (mmget_not_zero(tgt_mm)) { + mm_unlink_p4d(tgt_mm, cur_addr); + mmput(tgt_mm); + } else { + /* If tgt has exited, then we get a NULL tgt_mm */ + pr_debug("rpal: [%d] cannot find target mm\n", current->pid); + } + mm_unlink_p4d(cur_mm, tgt->base); +} --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 749D3221560 for ; Fri, 30 May 2025 09:31:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597461; cv=none; b=FwFyAOiFT+Rf5sCCWEuo1O31BCkDUott2dnsDvilrz/P9ozCBfckF3U+jYPV/fn7IpNHZZINQZYTucXguPXKJaK+W65PjTs2HqrfF3isOXtWrpZy74IwTYE7+27UEdafTuKVoEIsVdIuWFVZwkx1VG8hiX3Rcywyp76zbZ0cj+E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597461; c=relaxed/simple; bh=zWuwJn1AfK9f6Ju/rr0kFMON4yyaiLCS77HNg2YNluA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=geD4hR6aGupV4u6KNuwtjZ1PIQn8uWtnltniK98M00LfhdCoWZBVxNJA/OU2KjYw21/CbdoJpbrNraVJxzbINZchThQQtgSWn27u+zlEUR82jySCBrpf4IklFtSkEX21LQKtH7iiojjMK1YxM6VWukMyEdEQkCaUicJIr7bmjcg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=UTab4Njo; arc=none smtp.client-ip=209.85.216.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="UTab4Njo" Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-30f0d8628c8so2061994a91.0 for ; Fri, 30 May 2025 02:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597460; x=1749202260; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kZkk3bjH1sfOYp+VjBWd01Gh2LTXey//pWYFXMUcbnQ=; b=UTab4NjoLFxlKeL8jmboXD5UpJMWaUHx3DVCAYwZLB0nlwEIlfiWuXkN22bpxm4Vi9 DTYrVN9yOy9VnBFeRgd2x8I5ZuuGH0WWwucVoUAv1AgR2lBacf6sTSZhDN9NNTQJoIq+ n3v8UiML8CowCq3JUHmtMvuVFYD22l5wG/lZ7TwlFcSvwCvCiD1TKBltNVnDZClzfipx WdFgJsC2Qf54Z2bSF6uFlzzY/0CeWh0mm8ttBgqOtvLWFUdrFaRaSVGW7yjBXCRIWoiK UvmKF9rH2tUV1J9ndAicxUy3UJGVcmSqBxY7jkpzEAWmcB2KeTTE3GLIcbp0y9hjo+17 XyZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597460; x=1749202260; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kZkk3bjH1sfOYp+VjBWd01Gh2LTXey//pWYFXMUcbnQ=; b=PDFbFXrXSBZGTWLToirV9oEvVhp8TPLXfi48rWjgYN6MqsqgAjE+5pMV5TPI4LOURH ZNTC8lMTxVxn5fnfF7+OZ+bYpIxL6PxdHBKWKgUZ5aO2TJG9pI3DeQ6r/TGB1pHpy3Zr zvOXLP3YOVrxiD0pfimx8pJtp9EemV3+1j9OeYSQ3slya11bBWL1RIGyb49UCBafDU2y W4f4WeCNnm0S0tP28PiXpV5CgbyI2S/b+YipaWaRuk1xoBZuXmT9P/VkGJNfwK3JMCH2 QBJl/z/znTdl8ZP8g5LQ96DlQuUZ4fUWhR5V85YaBuqhyxudt4xe1mI4LJYxoaTEDKLT HsEA== X-Forwarded-Encrypted: i=1; AJvYcCV8N5r8E7TQrqfFLEV/HS79KlUVpi4ibC8McExAeRtLSOtuvCc+bIq+90v2ePGuSa+v0spfyd+T5iEVzxY=@vger.kernel.org X-Gm-Message-State: AOJu0YzvOPFzRiyK+aRiEn0W+X3xAGWphTbnV72qpknsVnxZ8yFJoWbb gHTXixQEg2fwq2mk4cD3Av6UTuHTGCW/teeXHam/gD/qvLKqP+/dk8XBpMD8DtGmvEs= X-Gm-Gg: ASbGncsvB6SsP72NBwGziQspZKEgVxqnz+yZdm/Qr2udQW7QUCDascPQspFFbtdTA9o M3styXsOz73UDITZeay+BtnvErNXyZhzakNm4byKhr8DDKs+3HP5nzpMPL2uZWWi4wJMefJX/ja YxoIAKaoOWMNt9Xw46HAP0+yliGxm9YIyTijMYoyFwuFF0kAktVaJS1s2yiAiHlS5dMNA27O/tM Q2FG4S85CUNxX2nBJlCeOULu8QBqRWZ6XWK+t1hp/3Bkce5NXY8GwFQ4knjrz7z1mmJjQZ6z6Up 8Ewm/vx1mt25l0ycZu6dCHuJM9ps7pSYI2PnyL5HSFgTor9xI18qzoCGRj5QdPJEf0mW0sXxxEk K6WXUcg+PvnZdJNimnmwGmawf6x8p0gA= X-Google-Smtp-Source: AGHT+IEDp+vpzbXXEsQMocz8uu8WDoupHL7dKOSkVefmCskmswDnpl1KI6oBSxmQepUxjo/3kPsGsg== X-Received: by 2002:a17:90b:28c8:b0:30e:e9f1:8447 with SMTP id 98e67ed59e1d1-31214e11c6cmr9475340a91.4.1748597459572; Fri, 30 May 2025 02:30:59 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.30.44 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:30:59 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 10/35] RPAL: allow service enable/disable Date: Fri, 30 May 2025 17:27:38 +0800 Message-Id: <34c12765bcf534c5afedd10ee3e763695c6a045d.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since RPAL involves communication between services, and services require certain preparations (e.g., registering senders/receivers) before communication, the kernel needs to sense whether a service is ready to perform RPAL call-related operations. This patch adds two interfaces: rpal_enable_service() and rpal_disable_service(). rpal_enable_service() passes necessary information to the kernel and marks the service as enabled. RPAL only permits communication between services in the enabled state. rpal_disable_service() clears the service's enabled state, thereby prohibiting communication between the service and others via RPAL. Signed-off-by: Bo Li --- arch/x86/rpal/internal.h | 2 ++ arch/x86/rpal/proc.c | 6 +++++ arch/x86/rpal/service.c | 50 ++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 18 +++++++++++++++ 4 files changed, 76 insertions(+) diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 65f2cf4baf8f..769d3bbe5a6b 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -17,6 +17,8 @@ extern bool rpal_inited; /* service.c */ int __init rpal_service_init(void); void __init rpal_service_exit(void); +int rpal_enable_service(unsigned long arg); +int rpal_disable_service(void); =20 /* mm.c */ static inline struct rpal_shared_page * diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index 8a1e4a8a2271..acd814f31649 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -63,6 +63,12 @@ static long rpal_ioctl(struct file *file, unsigned int c= md, unsigned long arg) case RPAL_IOCTL_UNREGISTER_RECEIVER: ret =3D rpal_unregister_receiver(); break; + case RPAL_IOCTL_ENABLE_SERVICE: + ret =3D rpal_enable_service(arg); + break; + case RPAL_IOCTL_DISABLE_SERVICE: + ret =3D rpal_disable_service(); + break; default: return -EINVAL; } diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 42fb719dbb2a..8a7b679bc28b 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -177,6 +177,7 @@ struct rpal_service *rpal_register_service(void) rs->nr_shared_pages =3D 0; INIT_LIST_HEAD(&rs->shared_pages); atomic_set(&rs->thread_cnt, 0); + rs->enabled =3D false; =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -228,6 +229,52 @@ void rpal_unregister_service(struct rpal_service *rs) rpal_put_service(rs); } =20 +int rpal_enable_service(unsigned long arg) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_service_metadata rsm; + int ret =3D 0; + + if (cur->bad_service) { + ret =3D -EINVAL; + goto out; + } + + ret =3D copy_from_user(&rsm, (void __user *)arg, sizeof(rsm)); + if (ret) { + ret =3D -EFAULT; + goto out; + } + + mutex_lock(&cur->mutex); + if (!cur->enabled) { + cur->rsm =3D rsm; + cur->enabled =3D true; + } + mutex_unlock(&cur->mutex); + +out: + return ret; +} + +int rpal_disable_service(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + int ret =3D 0; + + mutex_lock(&cur->mutex); + if (cur->enabled) { + cur->enabled =3D false; + } else { + ret =3D -EINVAL; + goto unlock_mutex; + } + +unlock_mutex: + mutex_unlock(&cur->mutex); + return ret; +} + void copy_rpal(struct task_struct *p) { struct rpal_service *cur =3D rpal_current_service(); @@ -244,6 +291,9 @@ void exit_rpal(bool group_dead) =20 exit_rpal_thread(); =20 + if (group_dead) + rpal_disable_service(); + current->rpal_rs =3D NULL; rpal_put_service(rs); =20 diff --git a/include/linux/rpal.h b/include/linux/rpal.h index c33425e896af..2e5010602177 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -84,6 +84,14 @@ enum rpal_task_flag_bits { RPAL_RECEIVER_BIT, }; =20 +/* + * user_meta will be sent to other service when requested. + */ +struct rpal_service_metadata { + unsigned long version; + void __user *user_meta; +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -125,6 +133,10 @@ struct rpal_service { /* sender/receiver thread count */ atomic_t thread_cnt; =20 + /* service metadata */ + bool enabled; + struct rpal_service_metadata rsm; + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -206,6 +218,8 @@ enum rpal_command_type { RPAL_CMD_UNREGISTER_SENDER, RPAL_CMD_REGISTER_RECEIVER, RPAL_CMD_UNREGISTER_RECEIVER, + RPAL_CMD_ENABLE_SERVICE, + RPAL_CMD_DISABLE_SERVICE, RPAL_NR_CMD, }; =20 @@ -226,6 +240,10 @@ enum rpal_command_type { _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long) #define RPAL_IOCTL_UNREGISTER_RECEIVER \ _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER) +#define RPAL_IOCTL_ENABLE_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long) +#define RPAL_IOCTL_DISABLE_SERVICE \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE) =20 /** * @brief get new reference to a rpal service, a corresponding --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5E9A522B5AA for ; Fri, 30 May 2025 09:31:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597478; cv=none; b=tSMjodNPwG4U0g9Rd+9L6HqrM4oBIVn02KShTJtB+ry4sEsb2AReuJgN71dDotTzN0VKpyVfkX4iNlgq9QZDlcsRaVW1sKtP5zY91EOkSECslR7/AECA4FnN6hDVcYZb+aLHfSyUyCyVvP2BZ+LmMsT7LHcHiJXZJ5B/hQYZ29U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597478; c=relaxed/simple; bh=Tz+y1/2E7hEgjo92We1jW9pwmbJC4Y8VL7rWkTlATqs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=shgINVY3iMedLC74iitX6mDPuWVZjgII6eIaVdt+Y2P5Kz34e2wvg46qd4gJs7Ba5QXnvhYFccXoQ1qhP2Z4fq3fXNTCm3FEJl/uuT71BhsNF7VfbHo7bOr/S9f8jmxqZkbAtr9zooH52YgjAsd689T6m5zRB+ypW8hgK8tRdEI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Y/Wa05cw; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Y/Wa05cw" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-3081fe5987eso1482476a91.3 for ; Fri, 30 May 2025 02:31:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597475; x=1749202275; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FwsjAroG47yqr2xPwlaPw9IAGrmPcW3kvk56UsePHN4=; b=Y/Wa05cw69+qKOFH87dOlfhVUtl62vYn8/tYhSU5dPdLb2jAyiYYipf7pUeoh72E9R hOjVYM6Ie5QxLl1tI8mDVg7WdN7Bm2P3p95hJZY5pFaWc4PYF4CuWwrpqYIvugN52alF 1DlurKWNoIsr2aOV9KB5opggiRtyFbnUK6Y4XPaJ4bXM7664cf3fQ1ss7473NqnM8oZV li7rFoRIUyHitkUD1AlHf2yN91eISE+ay4aaOGv6YBVYS87IADfLjGuYvYX5uEbWzEbU 3jKTvGou6BBXutBH2P/mXnVK5fD5qElszqHRBBBxrzgmtbSX85YIzUgI+ckYXgu3o8Ic dsLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597475; x=1749202275; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FwsjAroG47yqr2xPwlaPw9IAGrmPcW3kvk56UsePHN4=; b=SztOFuS2NlYXq8gvxQkxfP0NJ0tHIZiP9KmmKenI1HArCjWpOERzDpbE6s5dj4XIoV v+HVZt6/XPomMGMOFsOwy7p1l3YvvmPJOGK5P7E/NQW87a8koGGfhwA3DvZKJZwAfwJ3 UWLJcc5wT0toijwPqHh1mlEB5kDIUAg8E9z0linZJKYNO1qM2YPaqvw9XfLK68wJaQ8E dzRw2OCH6xsXue7yO37CwShFcqnQXE3n/oEUDZNE4uvv5BMsys3Y5JZtpUTHYGkvT9Ar tK4o57hmaIqNjUHjeDM4d3ArmrJ2o1uhmUldq3+DmvW4YHTIYhal5dk6oqCjfDhjsdFK QGmA== X-Forwarded-Encrypted: i=1; AJvYcCXzwScqmplshHB5lmS/KG1JanJTTZrw4x0uvGQsv802ehwIQQ0DPVsIvN3FdVoWQTb3GBq2LGQM7kr6XVU=@vger.kernel.org X-Gm-Message-State: AOJu0YyQOtm2db3SjR4njEj72QmNazXpNLNVAmbkITazkTZAuntKyUKx 8VTffD3PiI154Cytdb0hVCVOXawQBa6KIMnBPJ9SgAxJMKTind7w/jTXOcaNHyHf6OA= X-Gm-Gg: ASbGncs/MA+LvH4BY4hevxYRRQK15oLv+qSHFLVcnvHcj36dYSV/sIgr+6Vumdod5ES K2sHTOFnyD53Ava236qrH798siK6vYLaCwH01jCno9Z0SwBk9h7XIGRdxIHN1idZY+VVt7CcxDt 7fBS232MM6E/QXwYOrb4g/2th2MEYY8QgfA98R/DWD72jJS8MKFTte/7jaqvtvvtnFHW/xACFq8 K+nAjeX5hOrLEgIuUSxvRPNq6mt4VTo/hiQEi6MVoUOu0AESu7pUrU2FRx4fzJMhdbEbzC/eiDO yPpex26motLfXTpIFJZ2ZnOVJU7p7cQVSfMCArD1lZdsn5RzWHvX0d+0op8EmdcJ4/YhDg6K3qu 1GIzFQAxc4nfwHg6w4dy5 X-Google-Smtp-Source: AGHT+IHRc0pSTHqnWWemzLU4cZ03FJzQxrkM5nsu9yl2b6TGMlL/6vitJMQ7NNciKXpMzxjmPBfgdQ== X-Received: by 2002:a17:90a:d448:b0:312:29e:9ed8 with SMTP id 98e67ed59e1d1-31250413834mr2149974a91.20.1748597475019; Fri, 30 May 2025 02:31:15 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.31.00 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:31:14 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 11/35] RPAL: add service request/release Date: Fri, 30 May 2025 17:27:39 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Services communicating via RPAL require a series of operations to perform RPAL calls, such as mapping each other's memory and obtaining each other's metadata. This patch adds the rpal_request_service() and rpal_release_service() interfaces. Before communication, services must first complete a handshake process by mutually requesting each other. Only after both parties have completed their requests will RPAL copy each other's p4d entries into the other party's page tables, thereby achieving address space sharing. The patch defines RPAL_REQUEST_MAP and RPAL_REVERSE_MAP to indicate whether a service has requested another service or has been requested by another service. rpal_release_service() can release previously requested services, which triggers the removal of mutual p4d entries and terminates address space sharing. When a service exits the enabled state, the kernel will release all services it has ever requested, thereby terminating all address space sharing involving this service. Signed-off-by: Bo Li --- arch/x86/rpal/internal.h | 5 + arch/x86/rpal/proc.c | 6 + arch/x86/rpal/service.c | 265 ++++++++++++++++++++++++++++++++++++++- include/linux/rpal.h | 42 +++++++ 4 files changed, 316 insertions(+), 2 deletions(-) diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 769d3bbe5a6b..c504b6efff64 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -12,6 +12,9 @@ #include #include =20 +#define RPAL_REQUEST_MAP 0x1 +#define RPAL_REVERSE_MAP 0x2 + extern bool rpal_inited; =20 /* service.c */ @@ -19,6 +22,8 @@ int __init rpal_service_init(void); void __init rpal_service_exit(void); int rpal_enable_service(unsigned long arg); int rpal_disable_service(void); +int rpal_request_service(unsigned long arg); +int rpal_release_service(u64 key); =20 /* mm.c */ static inline struct rpal_shared_page * diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index acd814f31649..f001afd40562 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -69,6 +69,12 @@ static long rpal_ioctl(struct file *file, unsigned int c= md, unsigned long arg) case RPAL_IOCTL_DISABLE_SERVICE: ret =3D rpal_disable_service(); break; + case RPAL_IOCTL_REQUEST_SERVICE: + ret =3D rpal_request_service(arg); + break; + case RPAL_IOCTL_RELEASE_SERVICE: + ret =3D rpal_release_service(arg); + break; default: return -EINVAL; } diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 8a7b679bc28b..16a2155873a1 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -178,6 +178,9 @@ struct rpal_service *rpal_register_service(void) INIT_LIST_HEAD(&rs->shared_pages); atomic_set(&rs->thread_cnt, 0); rs->enabled =3D false; + atomic_set(&rs->req_avail_cnt, MAX_REQUEST_SERVICE); + bitmap_zero(rs->requested_service_bitmap, RPAL_NR_ID); + spin_lock_init(&rs->lock); =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -229,6 +232,262 @@ void rpal_unregister_service(struct rpal_service *rs) rpal_put_service(rs); } =20 +static inline void set_requested_service_bitmap(struct rpal_service *rs, i= nt id) +{ + set_bit(id, rs->requested_service_bitmap); +} + +static inline void clear_requested_service_bitmap(struct rpal_service *rs,= int id) +{ + clear_bit(id, rs->requested_service_bitmap); +} + +static int add_mapped_service(struct rpal_service *rs, struct rpal_service= *tgt, + int type_bit) +{ + struct rpal_mapped_service *node; + unsigned long flags; + int ret =3D 0; + + spin_lock_irqsave(&rs->lock, flags); + node =3D rpal_get_mapped_node(rs, tgt->id); + if (type_bit =3D=3D RPAL_REQUEST_MAP) { + if (atomic_read(&rs->req_avail_cnt) =3D=3D 0) { + ret =3D -EINVAL; + goto unlock; + } + } + + if (node->rs =3D=3D NULL) { + node->rs =3D rpal_get_service(tgt); + set_bit(type_bit, &node->type); + } else { + if (node->rs !=3D tgt) { + ret =3D -EINVAL; + goto unlock; + } else { + if (test_and_set_bit(type_bit, &node->type)) { + ret =3D -EINVAL; + goto unlock; + } + } + } + + if (type_bit =3D=3D RPAL_REQUEST_MAP) { + set_requested_service_bitmap(rs, tgt->id); + atomic_dec(&rs->req_avail_cnt); + } + +unlock: + spin_unlock_irqrestore(&rs->lock, flags); + return ret; +} + +static void remove_mapped_service(struct rpal_service *rs, int id, int typ= e_bit) +{ + struct rpal_mapped_service *node; + struct rpal_service *t; + unsigned long flags; + + spin_lock_irqsave(&rs->lock, flags); + node =3D rpal_get_mapped_node(rs, id); + if (node->rs =3D=3D NULL) + goto unlock; + + clear_bit(type_bit, &node->type); + if (type_bit =3D=3D RPAL_REQUEST_MAP) { + clear_requested_service_bitmap(rs, id); + atomic_inc(&rs->req_avail_cnt); + } + + if (node->type =3D=3D 0) { + t =3D node->rs; + node->rs =3D NULL; + rpal_put_service(t); + } + +unlock: + spin_unlock_irqrestore(&rs->lock, flags); +} + +static bool ready_to_map(struct rpal_service *cur, int tgt_id) +{ + struct rpal_mapped_service *node; + unsigned long flags; + bool need_map =3D false; + + spin_lock_irqsave(&cur->lock, flags); + node =3D rpal_get_mapped_node(cur, tgt_id); + if (test_bit(RPAL_REQUEST_MAP, &node->type) && + test_bit(RPAL_REVERSE_MAP, &node->type)) { + need_map =3D true; + } + spin_unlock_irqrestore(&cur->lock, flags); + + return need_map; +} + +int rpal_request_service(unsigned long arg) +{ + struct rpal_service *cur, *tgt; + struct rpal_request_arg rra; + long ret =3D 0; + int id; + + cur =3D rpal_current_service(); + + if (copy_from_user(&rra, (void __user *)arg, sizeof(rra))) { + ret =3D -EFAULT; + goto out; + } + + if (cur->key =3D=3D rra.key) { + ret =3D -EINVAL; + goto out; + } + + if (atomic_read(&cur->req_avail_cnt) =3D=3D 0) { + ret =3D -EINVAL; + goto out; + } + + mutex_lock(&cur->mutex); + + if (!cur->enabled) { + ret =3D -EINVAL; + goto unlock_mutex; + } + + tgt =3D rpal_get_service_by_key(rra.key); + if (tgt =3D=3D NULL) { + ret =3D -EINVAL; + goto unlock_mutex; + } + + if (!tgt->enabled) { + ret =3D -EPERM; + goto put_service; + } + + ret =3D put_user((unsigned long)(tgt->rsm.user_meta), rra.user_metap); + if (ret) { + ret =3D -EFAULT; + goto put_service; + } + + ret =3D put_user(tgt->id, rra.id); + if (ret) { + ret =3D -EFAULT; + goto put_service; + } + + id =3D tgt->id; + ret =3D add_mapped_service(cur, tgt, RPAL_REQUEST_MAP); + if (ret < 0) + goto put_service; + + ret =3D add_mapped_service(tgt, cur, RPAL_REVERSE_MAP); + if (ret < 0) + goto remove_request; + + /* only map shared address space when both process request each other */ + if (ready_to_map(cur, id)) { + ret =3D rpal_map_service(tgt); + if (ret < 0) + goto remove_reverse; + } + + mutex_unlock(&cur->mutex); + + rpal_put_service(tgt); + + return 0; + +remove_reverse: + remove_mapped_service(tgt, cur->id, RPAL_REVERSE_MAP); +remove_request: + remove_mapped_service(cur, tgt->id, RPAL_REQUEST_MAP); +put_service: + rpal_put_service(tgt); +unlock_mutex: + mutex_unlock(&cur->mutex); +out: + return ret; +} + +static int release_service(struct rpal_service *cur, struct rpal_service *= tgt) +{ + remove_mapped_service(tgt, cur->id, RPAL_REVERSE_MAP); + remove_mapped_service(cur, tgt->id, RPAL_REQUEST_MAP); + rpal_unmap_service(tgt); + + return 0; +} + +static void rpal_release_service_all(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_service *tgt; + int ret, i; + + rpal_for_each_requested_service(cur, i) { + struct rpal_mapped_service *node; + + if (i =3D=3D cur->id) + continue; + node =3D rpal_get_mapped_node(cur, i); + tgt =3D rpal_get_service(node->rs); + if (!tgt) + continue; + + if (test_bit(RPAL_REQUEST_MAP, &node->type)) { + ret =3D release_service(cur, tgt); + if (unlikely(ret)) { + rpal_err("service %d release service %d fail\n", + cur->id, tgt->id); + } + } + rpal_put_service(tgt); + } +} + +int rpal_release_service(u64 key) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_service *tgt =3D NULL; + struct rpal_mapped_service *node; + int ret =3D 0; + int i; + + mutex_lock(&cur->mutex); + + if (cur->key =3D=3D key) { + ret =3D -EINVAL; + goto unlock_mutex; + } + + rpal_for_each_requested_service(cur, i) { + node =3D rpal_get_mapped_node(cur, i); + if (node->rs->key =3D=3D key) { + tgt =3D rpal_get_service(node->rs); + break; + } + } + + if (!tgt) { + ret =3D -EINVAL; + goto unlock_mutex; + } + + ret =3D release_service(cur, tgt); + + rpal_put_service(tgt); + +unlock_mutex: + mutex_unlock(&cur->mutex); + return ret; +} + int rpal_enable_service(unsigned long arg) { struct rpal_service *cur =3D rpal_current_service(); @@ -270,6 +529,8 @@ int rpal_disable_service(void) goto unlock_mutex; } =20 + rpal_release_service_all(); + unlock_mutex: mutex_unlock(&cur->mutex); return ret; @@ -289,11 +550,11 @@ void exit_rpal(bool group_dead) if (!rs) return; =20 - exit_rpal_thread(); - if (group_dead) rpal_disable_service(); =20 + exit_rpal_thread(); + current->rpal_rs =3D NULL; rpal_put_service(rs); =20 diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 2e5010602177..1fe177523a36 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -77,6 +77,9 @@ #define RPAL_ADDRESS_SPACE_LOW ((0UL) + RPAL_ADDR_SPACE_SIZE) #define RPAL_ADDRESS_SPACE_HIGH ((0UL) + RPAL_NR_ADDR_SPACE * RPAL_ADDR_SP= ACE_SIZE) =20 +/* No more than 15 services can be requested due to limitation of MPK. */ +#define MAX_REQUEST_SERVICE 15 + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -92,6 +95,18 @@ struct rpal_service_metadata { void __user *user_meta; }; =20 +struct rpal_request_arg { + unsigned long version; + u64 key; + unsigned long __user *user_metap; + int __user *id; +}; + +struct rpal_mapped_service { + unsigned long type; + struct rpal_service *rs; +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -125,6 +140,8 @@ struct rpal_service { */ /* Mutex for time consuming operations */ struct mutex mutex; + /* spinlock for short operations */ + spinlock_t lock; =20 /* pinned pages */ int nr_shared_pages; @@ -137,6 +154,13 @@ struct rpal_service { bool enabled; struct rpal_service_metadata rsm; =20 + /* the number of services allow to be requested */ + atomic_t req_avail_cnt; + + /* map for services required, being required and mapped */ + struct rpal_mapped_service service_map[RPAL_NR_ID]; + DECLARE_BITMAP(requested_service_bitmap, RPAL_NR_ID); + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -220,6 +244,8 @@ enum rpal_command_type { RPAL_CMD_UNREGISTER_RECEIVER, RPAL_CMD_ENABLE_SERVICE, RPAL_CMD_DISABLE_SERVICE, + RPAL_CMD_REQUEST_SERVICE, + RPAL_CMD_RELEASE_SERVICE, RPAL_NR_CMD, }; =20 @@ -244,6 +270,16 @@ enum rpal_command_type { _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long) #define RPAL_IOCTL_DISABLE_SERVICE \ _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE) +#define RPAL_IOCTL_REQUEST_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long) +#define RPAL_IOCTL_RELEASE_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long) + +#define rpal_for_each_requested_service(rs, idx) = \ + for (idx =3D find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \ + idx < RPAL_NR_ID; \ + idx =3D find_next_bit(rs->requested_service_bitmap, RPAL_NR_ID, \ + idx + 1)) =20 /** * @brief get new reference to a rpal service, a corresponding @@ -274,6 +310,12 @@ static inline unsigned long rpal_get_top(struct rpal_s= ervice *rs) return rs->base + RPAL_ADDR_SPACE_SIZE; } =20 +static inline struct rpal_mapped_service * +rpal_get_mapped_node(struct rpal_service *rs, int id) +{ + return &rs->service_map[id]; +} + #ifdef CONFIG_RPAL static inline struct rpal_service *rpal_current_service(void) { --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 902DF212B0A for ; Fri, 30 May 2025 09:31:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597493; cv=none; b=ay+Bf4Qx6lpPWhtaYDdRKgmX1Rj6WuesZoxHgifrNVaduqBp1ScQEdf6dEC7tKw3D6yZ+W/1syIZU6drbz9tc3zE+F0XQ4GvjbLa3JaO+mhXiDe+ll4m4fjHAP+Gm4GS+BG31EN/gEmODTfJQxIhpjTfQ9Kb892CDpr8+z32eoA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597493; c=relaxed/simple; bh=CSDCXobBWZhtRTJYQJOr5+LQO5QZF5XuFGMveHLRtkY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BqOeal2+/EsmNGtwnAB+vOAzGg290CE2+dF+qmoO8buSBg/HiaJkQsIlrKVqXXVq3Lm+m3DPBh4A0PjHnKVnC/3leyV1a/NVJEW1fTNOTaAuwNUKVur9rN0xWt0AjQdIQfUC3Dzxvw1YBqIC4i4xPQIKVVqsJmOPQK7xym6pdiE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=QMN/uMDJ; arc=none smtp.client-ip=209.85.215.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="QMN/uMDJ" Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-b13e0471a2dso1244359a12.2 for ; Fri, 30 May 2025 02:31:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597491; x=1749202291; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rWr/E1U9MoZB6J4RM+OE8fwerNiGURyoNSQ2TTxCRRo=; b=QMN/uMDJAC+cia0D5Uie7ylDSB8V955tA2GjAZEnP/hpfefNtYZ2IY9p7iwvn3gjI/ y5nHUjhDjU7dtFQyJkLpP7cilJVsUFSds/l+kvPKRTW/6Ovy1bYNpnvZdgEJ9lQ7Wmb6 DQipz4gUfcFAB15jSSj6KTRpqTq5auBHaRUOTfIzpQUcMVWimE9wD9DmOijdlRvEg04P SvIxTF1nIqA9QIu8kxTJwkl/ANtRZsdPSI3GZfdSyL4r96RjnGHu4Vxni0w7VpDUpldH Szljcpzm05jHfapokWUMaZ8sdnRyXtpaIy0yx4Bd051fqrT6zKzazXQeUtiFLx1YM+C7 SiTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597491; x=1749202291; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rWr/E1U9MoZB6J4RM+OE8fwerNiGURyoNSQ2TTxCRRo=; b=PV+cM8sF/1du80yVNnKjQv1VfGeuIRyTam8bNVIHgrXL58RUSXX19Ao77OILHzGHNd YD5xJd4HPGFNYSr9N9Il/nrz8/dDnBsw48yzV+r9BTM4BrdfCAiuqkiZAPpI280b8otV 7BwsHWYGkd0kq9kHTg3fh9Xmu4qgj7Gdwu0c365rudaj0OsQ3u3XxT+Iiz5scsqE5b7M u5zR/e1ikY+2LjNw8QLFfU38XlYmYsVK97hd2Pfv17e4X04Bf2drFWnrRGSPyxOg1a+4 +VZtRyrWoMXph/RmTYhK6ovk2veAMD5Mt1ilt5nQziZD7Ywl8XzSC6360Rs1GblddQ4P FLIQ== X-Forwarded-Encrypted: i=1; AJvYcCWCS3/uHgA/BvpUoOZmnK3oGAgn+Lm0KamJNksiHnpun0x6MiYKUChJsLLeqHjpDQRTz7UEOD9htFRPzSc=@vger.kernel.org X-Gm-Message-State: AOJu0YzU/kz0wj23Hxv3TjslWztHx5y/BrBJ1l+0rmUkjfVhSvTrE/Iv us54uChL6Br9+HcskZBao0TlGbCIG74F+Ai7euNNgo9puo5V/SWYhRys46x5g1pZoSY= X-Gm-Gg: ASbGnctLj7MUg3fqK0gRNoAGPCs5v1w3+18GZ8WSGXm6I74jzudKpm762wGI/7vZgm1 OFy9lLhPVgJpD7BvuQE9afZyyEWdtmnElgKYdSjhf68+AdtEq8oLo/V/UzRvEE6DKLni17OR6zr AwwU12slqaEH+eDA4oi8OKxeT+lZo6TcrQHCeC3azxX2vboCiVdvWsk+sUSzA3abHtFDSOEqEI8 /GqX1c11Wq6vM0zjSXY0C+rXIAuK6x/CU7gbEImgCDqHqRvUUBTxZqSTcPSQEHLt6VsvpeDltlR V5o6YF1ktnxkBX0ZlQ+8S/ZktMSJu/WlqJRdj45oSEd55pRwyTs29im5bDskdUPKx4Y/8JCmlOY PPuxxk9pCtfmRZ8xPMiUMCHeDlzvfo6k= X-Google-Smtp-Source: AGHT+IHWxhBqxJC7pLUfy1wWPlfI3PTo1BKYuowDxnWEkooeR6TUqW9HSM93HT5AUTV5EBB5Y+mNrQ== X-Received: by 2002:a17:90b:48d2:b0:311:ad7f:3281 with SMTP id 98e67ed59e1d1-3125036ba85mr2606005a91.12.1748597490626; Fri, 30 May 2025 02:31:30 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.31.15 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:31:30 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 12/35] RPAL: enable service disable notification Date: Fri, 30 May 2025 17:27:40 +0800 Message-Id: <20a49e36e1efc99b1489d81eb7b5fe8787fcee4c.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a service is disabled, all services that request this service need to be notified. This patch use poll functions of file to implement such notification. When a service is disabled, it will notify all services that request it by set bit in others services' dead_key_bitmap. And the poll function will then issue a poll epoll event, other services can aware the service has been disabled. The key of disabled service can be read from the proc file. Signed-off-by: Bo Li --- arch/x86/rpal/proc.c | 61 +++++++++++++++++++++++++++++++++++++++++ arch/x86/rpal/service.c | 37 +++++++++++++++++++++++-- include/linux/rpal.h | 10 +++++++ 3 files changed, 106 insertions(+), 2 deletions(-) diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index f001afd40562..16ac9612bfc5 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -8,6 +8,7 @@ =20 #include #include +#include =20 #include "internal.h" =20 @@ -82,10 +83,70 @@ static long rpal_ioctl(struct file *file, unsigned int = cmd, unsigned long arg) return ret; } =20 +static ssize_t rpal_read(struct file *file, char __user *buf, size_t count, + loff_t *pos) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_poll_data *rpd; + u64 released_keys[MAX_REQUEST_SERVICE]; + unsigned long flags; + int nr_key =3D 0; + int nr_byte =3D 0; + int idx; + + if (!cur) + return -EINVAL; + + rpd =3D &cur->rpd; + + spin_lock_irqsave(&rpd->poll_lock, flags); + idx =3D find_first_bit(rpd->dead_key_bitmap, RPAL_NR_ID); + while (idx < RPAL_NR_ID) { + released_keys[nr_key++] =3D rpd->dead_keys[idx]; + idx =3D find_next_bit(rpd->dead_key_bitmap, RPAL_NR_ID, idx + 1); + } + spin_unlock_irqrestore(&rpd->poll_lock, flags); + nr_byte =3D nr_key * sizeof(u64); + + if (copy_to_user(buf, released_keys, nr_byte)) { + nr_byte =3D -EAGAIN; + goto out; + } +out: + return nr_byte; +} + +static __poll_t rpal_poll(struct file *filep, struct poll_table_struct *wa= it) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_poll_data *rpd; + unsigned long flags; + __poll_t mask =3D 0; + + if (unlikely(!cur)) { + rpal_err("Not a rpal service\n"); + goto out; + } + + rpd =3D &cur->rpd; + + poll_wait(filep, &rpd->rpal_waitqueue, wait); + + spin_lock_irqsave(&rpd->poll_lock, flags); + if (find_first_bit(rpd->dead_key_bitmap, RPAL_NR_ID) < RPAL_NR_ID) + mask |=3D EPOLLIN | EPOLLRDNORM; + spin_unlock_irqrestore(&rpd->poll_lock, flags); + +out: + return mask; +} + const struct proc_ops proc_rpal_operations =3D { .proc_open =3D rpal_open, + .proc_read =3D rpal_read, .proc_ioctl =3D rpal_ioctl, .proc_mmap =3D rpal_mmap, + .proc_poll =3D rpal_poll, }; =20 static int __init proc_rpal_init(void) diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 16a2155873a1..f490ab07301d 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -181,6 +181,9 @@ struct rpal_service *rpal_register_service(void) atomic_set(&rs->req_avail_cnt, MAX_REQUEST_SERVICE); bitmap_zero(rs->requested_service_bitmap, RPAL_NR_ID); spin_lock_init(&rs->lock); + spin_lock_init(&rs->rpd.poll_lock); + bitmap_zero(rs->rpd.dead_key_bitmap, RPAL_NR_ID); + init_waitqueue_head(&rs->rpd.rpal_waitqueue); =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -296,6 +299,7 @@ static void remove_mapped_service(struct rpal_service *= rs, int id, int type_bit) =20 clear_bit(type_bit, &node->type); if (type_bit =3D=3D RPAL_REQUEST_MAP) { + clear_bit(id, rs->rpd.dead_key_bitmap); clear_requested_service_bitmap(rs, id); atomic_inc(&rs->req_avail_cnt); } @@ -424,15 +428,30 @@ static int release_service(struct rpal_service *cur, = struct rpal_service *tgt) return 0; } =20 +static void rpal_notify_disable(struct rpal_poll_data *rpd, u64 key, int i= d) +{ + unsigned long flags; + bool need_wake =3D false; + + spin_lock_irqsave(&rpd->poll_lock, flags); + if (!test_bit(id, rpd->dead_key_bitmap)) { + need_wake =3D true; + rpd->dead_keys[id] =3D key; + set_bit(id, rpd->dead_key_bitmap); + } + spin_unlock_irqrestore(&rpd->poll_lock, flags); + if (need_wake) + wake_up_interruptible(&rpd->rpal_waitqueue); +} + static void rpal_release_service_all(void) { struct rpal_service *cur =3D rpal_current_service(); struct rpal_service *tgt; + struct rpal_mapped_service *node; int ret, i; =20 rpal_for_each_requested_service(cur, i) { - struct rpal_mapped_service *node; - if (i =3D=3D cur->id) continue; node =3D rpal_get_mapped_node(cur, i); @@ -449,6 +468,20 @@ static void rpal_release_service_all(void) } rpal_put_service(tgt); } + + for (i =3D 0; i < RPAL_NR_ID; i++) { + if (i =3D=3D cur->id) + continue; + + node =3D rpal_get_mapped_node(cur, i); + tgt =3D rpal_get_service(node->rs); + if (!tgt) + continue; + + if (test_bit(RPAL_REVERSE_MAP, &node->type)) + rpal_notify_disable(&tgt->rpd, cur->key, cur->id); + rpal_put_service(tgt); + } } =20 int rpal_release_service(u64 key) diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 1fe177523a36..b9622f0235bf 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -107,6 +107,13 @@ struct rpal_mapped_service { struct rpal_service *rs; }; =20 +struct rpal_poll_data { + spinlock_t poll_lock; + u64 dead_keys[RPAL_NR_ID]; + DECLARE_BITMAP(dead_key_bitmap, RPAL_NR_ID); + wait_queue_head_t rpal_waitqueue; +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -161,6 +168,9 @@ struct rpal_service { struct rpal_mapped_service service_map[RPAL_NR_ID]; DECLARE_BITMAP(requested_service_bitmap, RPAL_NR_ID); =20 + /* Notify service is released by others */ + struct rpal_poll_data rpd; + /* delayed service put work */ struct delayed_work delayed_put_work; =20 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A30C2222C1 for ; Fri, 30 May 2025 09:31:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597508; cv=none; b=YLsTIkZyr/Lxj3xCg3gqtxROwEMQ4aCTWGsGdOt2faJrA08J6RLeoajEZvQAE2SeIO26phg5/XB5Sa5mH99x/IDf2lfi7jCybijaUTjRHPojwGwCI4+MtmsUraqnn/FndaRAJ5BATfnLSmWVwFlhPPBExt9u+mKay59n3q3IKpE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597508; c=relaxed/simple; bh=x42OvKF8cu3SGLWZonaQOFG1mLKzqSw7oCqoTulLIhU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=RnyYX6BhRrqJjsUPgCdp3Cp7NgmeRhZe+UZXmUctQyN0EW1lZGQjc+KxhON7z2iA1fMCpOwMCrZ0dyzIuUSWyWll7oiOLylCNOd7rIqjPHsBVYUtnislMmL1m2LJEEQ9e5aD7KwahPXkIUrFIW50dTd4yaI6s6nY6McS6Lcv+Xk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=UIhFXuhK; arc=none smtp.client-ip=209.85.216.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="UIhFXuhK" Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-3124f18c214so295304a91.2 for ; Fri, 30 May 2025 02:31:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597506; x=1749202306; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lqEqFVUYhIQ6uCyFMSKyc3XjCtHWAqaCcpg8jFQsxWM=; b=UIhFXuhKkT4Y96MhzklPzS+CDHR0Hw8VJ3D0QgE5uV/f0ZucyRKOfLHwT9vxTiLErV j/R1Ct0aZGJOi9mfTNDGtDO/U7JHgHzd0D/7f3Z02wZCMehFDya7ysKmcdrLPqfa+Kgq jdeAef2VgY6bSAmZeYcaxI75LLX8O8+ImWRe5C6OQmjNP64UgseKPtNB8/sBDld/OO7E gBRLKl/EAGe2+e3mvcMrEiNKYFfPhqBslo/LfYZt0in/HfDVApFPGY4Fsbwe7TO0Ojl4 7tswhefuv5qkD9ggG7FBUkyGDBpgw0LY3WDOgW40Ng7qocSI1ce1YHVfSPN9ASiwORY7 nRAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597506; x=1749202306; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lqEqFVUYhIQ6uCyFMSKyc3XjCtHWAqaCcpg8jFQsxWM=; b=UKpUOSuoLfv1uW43q6yaptbWcjp7xh465EXaXSJSastvM+9VuydNfi88Pd116aYjFi 5ZKtyDEjYWAGs0jSS/hnWbE4EMgONA4/xTuKVcfd3YBdXQhVMkNd0SipBWcuWSPeSApx OS4vgH1QUstpX8waKEKDqHvEzAREto1+iXZ1FSQ/cem8y3L5ZMMGod8hl6nRFSz1Rjzt FeF/XRRmLPdsHHLG1x05cbpgWxMvoGe6dEcFOSCpB8/UKk25Seah/WLinIsTBRk/fPzn /4KqPIPUl4BLF35InyViOSMtJWjaEPsSWAByucUGQFBwaCtdL84XUGA9NQ1usJ+Amufe TC+Q== X-Forwarded-Encrypted: i=1; AJvYcCUnNdgfWmbu1POJvhmU7UWHONXjlki01oJytfxO/k1uJbLmKgZ2bjYe7gBW5MecgmEurRquGGu4HuhCXbw=@vger.kernel.org X-Gm-Message-State: AOJu0YxGiRPUhIhEalpoe/onAzJyH7pwyPpUDEMViu1/DWFqbZvLTEE0 srERtIVcCEwJfr8V9fRjyfeNqdBbjC6vd086022ICpmYDR0yyv97RftSt6cAteOjVkc= X-Gm-Gg: ASbGnct+WzVXOsNaxsH7JZme53xj0AdZv3DsA0bdi96bI83UZAmOYCsBBTYmC89E20Z Q/tvl77ELaEAmZVpc1Zv0ZHJA70kukZ06y92SP//a6PWECoWmxpNppbQjyYHZgwt+zR2Zx5RKJg RWp10FImbin9oYGforzrEDdnc34TmRxWQN8gSAaC5vyy8lJk+O4BNaYVanBp20869LaV/mvSgqY UvY95oyAc6qISlm0izmedP6lHhSpDJ1PhTg24AghYuf+xeoQgiFcYnYkjP/w44VOYaKk8yZWNil u6kAMH5nEbG1ML2tPSdbHgQGOei/iWf+axZ5z9QdSKpnbLJUGPQAT8/oCUz1ayy6u8S8DFamrUX CFJj8Bv40UQ== X-Google-Smtp-Source: AGHT+IGF62FX0HQNv9LilxlCiViF0h8d1bY/G1W0X5Q0qQ2cdYg03wna6Pf30NCPziT41+dazZ7wew== X-Received: by 2002:a17:90b:3b50:b0:312:1c83:58e9 with SMTP id 98e67ed59e1d1-3124150e464mr3530070a91.5.1748597505955; Fri, 30 May 2025 02:31:45 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.31.31 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:31:45 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 13/35] RPAL: add tlb flushing support Date: Fri, 30 May 2025 17:27:41 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a thread flushes the TLB, since the address space is shared, not only other threads in the current process but also other processes that share the address space may access the corresponding memory (related to the TLB flush). Therefore, the cpuset used for TLB flushing should be the union of the mm_cpumasks of all processes that share the address space. This patch extend flush_tlb_info to store other process's mm_struct, and when a CPU in the union of the mm_cpumasks if invoked to handle tlb flushing, it will check whether cpu_tlbstate.loaded_mm matches any of mm_structs stored in flush_tlb_info. If match, the CPU will do local tlb flushing for that mm_struct. Signed-off-by: Bo Li --- arch/x86/include/asm/tlbflush.h | 10 ++ arch/x86/mm/tlb.c | 172 ++++++++++++++++++++++++++++++++ arch/x86/rpal/internal.h | 3 - include/linux/rpal.h | 12 +++ mm/rmap.c | 4 + 5 files changed, 198 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflus= h.h index e9b81876ebe4..f57b745af75c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -227,6 +227,11 @@ struct flush_tlb_info { u8 stride_shift; u8 freed_tables; u8 trim_cpumask; +#ifdef CONFIG_RPAL + struct mm_struct **mm_list; + u64 *tlb_gen_list; + int nr_mm; +#endif }; =20 void flush_tlb_local(void); @@ -356,6 +361,11 @@ static inline void arch_tlbbatch_add_pending(struct ar= ch_tlbflush_unmap_batch *b mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } =20 +#ifdef CONFIG_RPAL +void rpal_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm); +#endif + static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) { flush_tlb_mm(mm); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..a0fe17b13887 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include #include #include +#include =20 #include #include @@ -1361,6 +1362,169 @@ void flush_tlb_multi(const struct cpumask *cpumask, __flush_tlb_multi(cpumask, info); } =20 +#ifdef CONFIG_RPAL +static void rpal_flush_tlb_func_remote(void *info) +{ + struct mm_struct *loaded_mm =3D this_cpu_read(cpu_tlbstate.loaded_mm); + struct flush_tlb_info *f =3D info; + struct flush_tlb_info tf =3D *f; + int i; + + /* As it comes from RPAL path, f->mm cannot be NULL */ + if (f->mm =3D=3D loaded_mm) { + flush_tlb_func(f); + return; + } + + for (i =3D 0; i < f->nr_mm; i++) { + /* We always have f->mm_list[i] !=3D NULL */ + if (f->mm_list[i] =3D=3D loaded_mm) { + tf.mm =3D f->mm_list[i]; + tf.new_tlb_gen =3D f->tlb_gen_list[i]; + flush_tlb_func(&tf); + return; + } + } +} + +static void rpal_flush_tlb_func_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) +{ + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + if (info->end =3D=3D TLB_FLUSH_ALL) + trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); + else + trace_tlb_flush(TLB_REMOTE_SEND_IPI, + (info->end - info->start) >> PAGE_SHIFT); + + if (info->freed_tables || mm_in_asid_transition(info->mm)) + on_each_cpu_mask(cpumask, rpal_flush_tlb_func_remote, + (void *)info, true); + else + on_each_cpu_cond_mask(should_flush_tlb, + rpal_flush_tlb_func_remote, (void *)info, + 1, cpumask); +} + +static void rpal_flush_tlb_func_local(struct mm_struct *mm, int cpu, + struct flush_tlb_info *info, + u64 new_tlb_gen) +{ + struct mm_struct *loaded_mm =3D this_cpu_read(cpu_tlbstate.loaded_mm); + + if (loaded_mm =3D=3D info->mm) { + lockdep_assert_irqs_enabled(); + local_irq_disable(); + flush_tlb_func(info); + local_irq_enable(); + } else { + int i; + + for (i =3D 0; i < info->nr_mm; i++) { + if (info->mm_list[i] =3D=3D loaded_mm) { + lockdep_assert_irqs_enabled(); + local_irq_disable(); + info->mm =3D info->mm_list[i]; + info->new_tlb_gen =3D info->tlb_gen_list[i]; + flush_tlb_func(info); + info->mm =3D mm; + info->new_tlb_gen =3D new_tlb_gen; + local_irq_enable(); + } + } + } +} + +static void rpal_flush_tlb_mm_range(struct mm_struct *mm, int cpu, + struct flush_tlb_info *info, u64 new_tlb_gen) +{ + struct rpal_service *cur =3D mm->rpal_rs; + cpumask_t merged_mask; + struct mm_struct *mm_list[MAX_REQUEST_SERVICE]; + u64 tlb_gen_list[MAX_REQUEST_SERVICE]; + int nr_mm =3D 0; + int i; + + cpumask_copy(&merged_mask, mm_cpumask(mm)); + if (cur) { + struct rpal_service *tgt; + struct mm_struct *tgt_mm; + + rpal_for_each_requested_service(cur, i) { + struct rpal_mapped_service *node; + + if (i =3D=3D cur->id) + continue; + node =3D rpal_get_mapped_node(cur, i); + if (!rpal_service_mapped(node)) + continue; + + tgt =3D rpal_get_service(node->rs); + if (!tgt) + continue; + tgt_mm =3D tgt->mm; + if (!mmget_not_zero(tgt_mm)) { + rpal_put_service(tgt); + continue; + } + mm_list[nr_mm] =3D tgt_mm; + tlb_gen_list[nr_mm] =3D inc_mm_tlb_gen(tgt_mm); + + nr_mm++; + cpumask_or(&merged_mask, &merged_mask, + mm_cpumask(tgt_mm)); + rpal_put_service(tgt); + } + info->mm_list =3D mm_list; + info->tlb_gen_list =3D tlb_gen_list; + info->nr_mm =3D nr_mm; + } + + if (cpumask_any_but(&merged_mask, cpu) < nr_cpu_ids) + rpal_flush_tlb_func_multi(&merged_mask, info); + else + rpal_flush_tlb_func_local(mm, cpu, info, new_tlb_gen); + + for (i =3D 0; i < nr_mm; i++) + mmput_async(mm_list[i]); +} + +void rpal_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm) +{ + struct rpal_service *cur =3D mm->rpal_rs; + struct rpal_service *tgt; + struct mm_struct *tgt_mm; + int i; + + rpal_for_each_requested_service(cur, i) { + struct rpal_mapped_service *node; + + if (i =3D=3D cur->id) + continue; + + node =3D rpal_get_mapped_node(cur, i); + if (!rpal_service_mapped(node)) + continue; + + tgt =3D rpal_get_service(node->rs); + if (!tgt) + continue; + tgt_mm =3D tgt->mm; + if (!mmget_not_zero(tgt_mm)) { + rpal_put_service(tgt); + continue; + } + inc_mm_tlb_gen(tgt_mm); + cpumask_or(&batch->cpumask, &batch->cpumask, + mm_cpumask(tgt_mm)); + mmu_notifier_arch_invalidate_secondary_tlbs(tgt_mm, 0, -1UL); + rpal_put_service(tgt); + mmput_async(tgt_mm); + } +} +#endif + /* * See Documentation/arch/x86/tlb.rst for details. We choose 33 * because it is large enough to cover the vast majority (at @@ -1439,6 +1603,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsign= ed long start, info =3D get_flush_tlb_info(mm, start, end, stride_shift, freed_tables, new_tlb_gen); =20 +#if IS_ENABLED(CONFIG_RPAL) + if (mm->rpal_rs) + rpal_flush_tlb_mm_range(mm, cpu, info, new_tlb_gen); + else { +#endif /* * flush_tlb_multi() is not optimized for the common case in which only * a local TLB flush is needed. Optimize this use-case by calling @@ -1456,6 +1625,9 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigne= d long start, flush_tlb_func(info); local_irq_enable(); } +#if IS_ENABLED(CONFIG_RPAL) + } +#endif =20 put_flush_tlb_info(); put_cpu(); diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index c504b6efff64..cf6d608a994a 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -12,9 +12,6 @@ #include #include =20 -#define RPAL_REQUEST_MAP 0x1 -#define RPAL_REVERSE_MAP 0x2 - extern bool rpal_inited; =20 /* service.c */ diff --git a/include/linux/rpal.h b/include/linux/rpal.h index b9622f0235bf..36be1ab6a9f3 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -80,6 +80,11 @@ /* No more than 15 services can be requested due to limitation of MPK. */ #define MAX_REQUEST_SERVICE 15 =20 +enum { + RPAL_REQUEST_MAP, + RPAL_REVERSE_MAP, +}; + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -326,6 +331,13 @@ rpal_get_mapped_node(struct rpal_service *rs, int id) return &rs->service_map[id]; } =20 +static inline bool rpal_service_mapped(struct rpal_mapped_service *node) +{ + unsigned long type =3D (1 << RPAL_REQUEST_MAP) | (1 << RPAL_REVERSE_MAP); + + return (node->type & type) =3D=3D type; +} + #ifdef CONFIG_RPAL static inline struct rpal_service *rpal_current_service(void) { diff --git a/mm/rmap.c b/mm/rmap.c index 67bb273dfb80..e68384f97ab9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -682,6 +682,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct= *mm, pte_t pteval, return; =20 arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, start, end); +#ifdef CONFIG_RPAL + if (mm->rpal_rs) + rpal_tlbbatch_add_pending(&tlb_ubc->arch, mm); +#endif tlb_ubc->flush_required =3D true; =20 /* --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f176.google.com (mail-pg1-f176.google.com [209.85.215.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 707EB222565 for ; Fri, 30 May 2025 09:32:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597524; cv=none; b=P+UJAdSUpORB0g0aXt9jN4dTiONVKWhOC9vnRW2J1YgGaOrSoj48IGCHTgKBNs2elqmTA7DmNfSzbtjqop3oqp7f/JgzCyFUlNkSb86teUK6UFGyJeTVkKqTMAJGijgur8f7D+ES2vP5qwvdOFxGcJhOePhA7H64a66wnfT21J4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597524; c=relaxed/simple; bh=v0y8BsMdpB12xLvFn5vAwwkHzR1Wq6pNBECHF1ePs8o=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=du2HrmQiXo78CciFPwMZ8IGpbhGcvdtMpMfYkLjIxq0sM1Ctdw8m1jcAVjpOqqYlPGTBeyQD/b/dexYq1TSzWIBxymyEW7T07MZ3wVXZQOr2kVcQY9xKPueFlQ/I7I212BEocFHa9QBDkGgHkTl/K8jPAZ5JKM4/iq+fEoXvtyI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=HYsOtTwq; arc=none smtp.client-ip=209.85.215.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="HYsOtTwq" Received: by mail-pg1-f176.google.com with SMTP id 41be03b00d2f7-af6a315b491so1533617a12.1 for ; Fri, 30 May 2025 02:32:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597522; x=1749202322; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dO6otkog24Nl4MqGG9SKHBZpR3sFE7cRfJye6EEtKwU=; b=HYsOtTwqn7S/y04wP3C85vE2nFJV+K9wtaqiRJohu+f+XYSjGnOuUrbY8pmTlg/+p3 yBxR1fgOc7FXWQpwH+g+Kozu3uCpxusa5vgbfp6fpGeQR0J6TmAaAgkRdmIkSVr0sRyQ p1fJ5rGcazqjI/Bz2d6Mu20bsCyE810r9VouqmECRrFC4z+MqeuO5SYb3IVFtVVhZHfb Dib8PeqnNuJ5IY5b0MRXNz7iIGUxeRBO5vIetq1Ka7SpM905TsScuVZaOUWHOPmiarfj JgO/C7mkZIaj1Tdo5Q+2vmca/JQYdh+6Gd+PLWIlIrFCRE3Ro2pBWSMM0E2B+O/PHF0w zWKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597522; x=1749202322; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dO6otkog24Nl4MqGG9SKHBZpR3sFE7cRfJye6EEtKwU=; b=ro7BvmoWAnV1di6bz7ogFcEtGSYrjj+yMLUiwdrtYQXlyJEOARt4bn9aU6LFCPm6Gi +SN7GvbRBhSG+UmroElLxudP3rxFvh4KvI1RcA3YbeK18Ah2LTLCb73LJLjXA42v1oLf hMRSAIqImnApdJXiLdgl0ruAP3XsIDCvB8/7kcxmQeIrfZavKe0136158+jaDscLuI+4 E/eRLSPi2bYYu4hRLba3gda+fkM4xF6tsCwsClYqpFLLVrQJ1fQJDrHWE6HfeuL4OHE6 ExPNu0A4DNYvQbkQ/hS1ynPryJ49ZQfYlQ7yc5eTsFNTXgq7IE6FLkU2JnHjTBllHgAQ 2eTg== X-Forwarded-Encrypted: i=1; AJvYcCXlYBzYzWKXVLbSDyvSlhqUM8j/dyE4738EeJpW0oGYTCwnRDFZZU5U0Sk9QM4jGnFgjH+zADXnx+gakzg=@vger.kernel.org X-Gm-Message-State: AOJu0YwHUIRJn4AT4RsfZZChTKGLxCPTFEuI42PkRCMmV3Li/CGgm9Zw ddMpQD+zbX+Rm7mvhXjDa2YTEmjoW5CeUHs5H03I45Wjlbin6Fa36jZEGeXmvW0noxs= X-Gm-Gg: ASbGncvwVDYg88zWmGVXE2CNPR4z/kU0UkxHYATg4/CcuklRGwr//Bhuwvd+UxanZcZ eInqoGB8mo73jruR0uQa0zAhRlkuV/UQblyMgqNdILYd6QceIIQjTtwOq4tMtipXen+ICxTIMSP hkl+f6wP0zoPkaZT6bzWjwkqcr1jrSBFGFYCS8os0cH6XKkvV7Z5O6s4Asm2EdCOgQmTsV8hiQT wUtYppUR275sFfZIeBnOay/76RHxq5Maj+ltQIXgGYfQRE+89WSiK0nKWOHaCksSW8deqfuKmVV cLXgMy/4ROP5Kb+Z1p0ZKWSpHUo1p4oeRxjfRsmaUaJy/qORMPUO41plavZRY3rimOW1crTU350 iR4Bb/vwICgTVx/pG/q1m X-Google-Smtp-Source: AGHT+IHUkj4Y6AZ8cB5OAPZaSl8VUuiU504h/MLG14sBKE++U+CpU83URutH7UTuLZZMbY0g/ZO+FA== X-Received: by 2002:a17:90b:1d50:b0:311:fde5:c4b6 with SMTP id 98e67ed59e1d1-31250344995mr2213817a91.6.1748597521434; Fri, 30 May 2025 02:32:01 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.31.46 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:32:01 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 14/35] RPAL: enable page fault handling Date: Fri, 30 May 2025 17:27:42 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RPAL's address space sharing allows one process to access the memory of another process, which may trigger page faults. To ensure programs can run normally, RPAL needs to handle page faults occurring in the address space of other processes. Additionally, to prevent processes from generating coredumps due to invalid memory in other processes, RPAL must also restore the current thread state to a pre-saved state under specific circumstances. For handling page faults, by passing the correct vm_area_struct to handle_page_fault(), RPAL locates the process corresponding to the address where the page fault occurred and uses its mm_struct to handle the page fault. Regarding thread state restoration, RPAL restores the thread's state to a predefined state in userspace when it cannot locate the mm_struct of the corresponding process (i.e., when the process has already exited). Signed-off-by: Bo Li --- arch/x86/mm/fault.c | 271 ++++++++++++++++++++++++++++++++++++++++ arch/x86/rpal/mm.c | 34 +++++ arch/x86/rpal/service.c | 24 ++++ arch/x86/rpal/thread.c | 23 ++++ include/linux/rpal.h | 81 ++++++++---- 5 files changed, 412 insertions(+), 21 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 998bd807fc7b..35f7c60a5e4f 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -19,6 +19,7 @@ #include #include /* find_and_lock_vma() */ #include +#include =20 #include /* boot_cpu_has, ... */ #include /* dotraplinkage, ... */ @@ -1460,6 +1461,268 @@ trace_page_fault_entries(struct pt_regs *regs, unsi= gned long error_code, trace_page_fault_kernel(address, regs, error_code); } =20 +#if IS_ENABLED(CONFIG_RPAL) +static void rpal_do_user_addr_fault(struct pt_regs *regs, unsigned long er= ror_code, + unsigned long address, struct mm_struct *real_mm) +{ + struct vm_area_struct *vma; + vm_fault_t fault; + unsigned int flags =3D FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + + if (unlikely(error_code & X86_PF_RSVD)) + pgtable_bad(regs, error_code, address); + + if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) && + !(error_code & X86_PF_USER) && + !(regs->flags & X86_EFLAGS_AC))) { + page_fault_oops(regs, error_code, address); + return; + } + + if (unlikely(faulthandler_disabled())) { + bad_area_nosemaphore(regs, error_code, address); + return; + } + + if (WARN_ON_ONCE(!(regs->flags & X86_EFLAGS_IF))) { + bad_area_nosemaphore(regs, error_code, address); + return; + } + + local_irq_enable(); + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + + if (error_code & X86_PF_SHSTK) + flags |=3D FAULT_FLAG_WRITE; + if (error_code & X86_PF_WRITE) + flags |=3D FAULT_FLAG_WRITE; + if (error_code & X86_PF_INSTR) + flags |=3D FAULT_FLAG_INSTRUCTION; + + if (user_mode(regs)) + flags |=3D FAULT_FLAG_USER; + +#ifdef CONFIG_X86_64 + if (is_vsyscall_vaddr(address)) { + if (emulate_vsyscall(error_code, regs, address)) + return; + } +#endif + + if (!(flags & FAULT_FLAG_USER)) + goto lock_mmap; + + vma =3D lock_vma_under_rcu(real_mm, address); + if (!vma) + goto lock_mmap; + + if (unlikely(access_error(error_code, vma))) { + bad_area_access_error(regs, error_code, address, NULL, vma); + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return; + } + + fault =3D handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs= ); + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) + vma_end_read(vma); + + if (!(fault & VM_FAULT_RETRY)) { + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; + } + count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |=3D FAULT_FLAG_TRIED; + + /* Quick path to respond to signals */ + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + kernelmode_fixup_or_oops(regs, error_code, address, + SIGBUS, BUS_ADRERR, + ARCH_DEFAULT_PKEY); + return; + } +lock_mmap: + +retry: + /* + * Here we don't need to lock current->mm since no vma in + * current->mm is used to handle this page fault. However, + * we do need to lock real_mm, as the address belongs to + * real_mm's vma. + */ + vma =3D lock_mm_and_find_vma(real_mm, address, regs); + if (unlikely(!vma)) { + bad_area_nosemaphore(regs, error_code, address); + return; + } + + if (unlikely(access_error(error_code, vma))) { + bad_area_access_error(regs, error_code, address, real_mm, vma); + return; + } + + fault =3D handle_mm_fault(vma, address, flags, regs); + + if (fault_signal_pending(fault, regs)) { + /* + * Quick path to respond to signals. The core mm code + * has unlocked the mm for us if we get here. + */ + if (!user_mode(regs)) + kernelmode_fixup_or_oops(regs, error_code, address, + SIGBUS, BUS_ADRERR, + ARCH_DEFAULT_PKEY); + return; + } + + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + + if (unlikely(fault & VM_FAULT_RETRY)) { + flags |=3D FAULT_FLAG_TRIED; + goto retry; + } + + mmap_read_unlock(real_mm); +done: + if (likely(!(fault & VM_FAULT_ERROR))) + return; + + if (fatal_signal_pending(current) && !user_mode(regs)) { + kernelmode_fixup_or_oops(regs, error_code, address, 0, 0, + ARCH_DEFAULT_PKEY); + return; + } + + if (fault & VM_FAULT_OOM) { + /* Kernel mode? Handle exceptions or die: */ + if (!user_mode(regs)) { + kernelmode_fixup_or_oops(regs, error_code, address, + SIGSEGV, SEGV_MAPERR, + ARCH_DEFAULT_PKEY); + return; + } + + pagefault_out_of_memory(); + } else { + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON| + VM_FAULT_HWPOISON_LARGE)) + do_sigbus(regs, error_code, address, fault); + else if (fault & VM_FAULT_SIGSEGV) + bad_area_nosemaphore(regs, error_code, address); + else + BUG(); + } +} +NOKPROBE_SYMBOL(rpal_do_user_addr_fault); + +static inline void rpal_try_to_rebuild_context(struct pt_regs *regs, + unsigned long address, + int error_code) +{ + int handle_more =3D 0; + + /* + * We only rebuild sender's context, as other threads are not supposed + * to access other process's memory, thus they will not trigger a page + * fault. + */ + handle_more =3D rpal_rebuild_sender_context_on_fault(regs, address, -1); + /* + * If we are not able to rebuild sender's context, just + * send a signal to let it coredump. + */ + if (handle_more) + force_sig_fault(SIGSEGV, SEGV_MAPERR, (void __user *)address); +} + +/* + * Most logic of this function is copied from do_user_addr_fault(). + * RPAL logic is added to handle special cases, such as find another + * process's mm and rebuild sender's context if such page table is + * not able to be handled. + */ +static bool rpal_try_user_addr_fault(struct pt_regs *regs, unsigned long e= rror_code, + unsigned long address) +{ + struct mm_struct *real_mm; + int rebuild =3D 0; + + /* fast path: avoid mmget and mmput */ + if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) =3D=3D + X86_PF_INSTR)) { + /* + * Whoops, this is kernel mode code trying to execute from + * user memory. Unless this is AMD erratum #93, which + * corrupts RIP such that it looks like a user address, + * this is unrecoverable. Don't even try to look up the + * VMA or look for extable entries. + */ + if (is_errata93(regs, address)) + return true; + + page_fault_oops(regs, error_code, address); + return true; + } + + /* kprobes don't want to hook the spurious faults: */ + if (WARN_ON_ONCE(kprobe_page_fault(regs, X86_TRAP_PF))) + return true; + + real_mm =3D rpal_pf_get_real_mm(address, &rebuild); + + if (real_mm) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *memcg =3D NULL; + + prefetchw(&real_mm->mmap_lock); + /* try to charge page alloc to real_mm's memcg */ + if (!current->active_memcg) { + memcg =3D get_mem_cgroup_from_mm(real_mm); + if (memcg) + set_active_memcg(memcg); + } + rpal_do_user_addr_fault(regs, error_code, address, real_mm); + if (memcg) { + set_active_memcg(NULL); + mem_cgroup_put(memcg); + } +#else + prefetchw(&real_mm->mmap_lock); + rpal_do_user_addr_fault(regs, error_code, address, real_mm); +#endif + mmput_async(real_mm); + return true; + } else if (user_mode(regs) && rebuild) { + rpal_try_to_rebuild_context(regs, address, -1); + return true; + } + + return false; +} + +static bool rpal_handle_page_fault(struct pt_regs *regs, unsigned long err= or_code, + unsigned long address) +{ + struct rpal_service *cur =3D rpal_current_service(); + + /* + * For RPAL process, it may access another process's memory and + * there may be page fault. We handle this case with our own routine. + * If we cannot handle this page fault, just let it go and handle + * it as a normal page fault. + */ + if (cur && !rpal_is_correct_address(cur, address)) { + if (rpal_try_user_addr_fault(regs, error_code, address)) + return true; + } + return false; +} +#endif + static __always_inline void handle_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) @@ -1473,7 +1736,15 @@ handle_page_fault(struct pt_regs *regs, unsigned lon= g error_code, if (unlikely(fault_in_kernel_space(address))) { do_kern_addr_fault(regs, error_code, address); } else { +#ifdef CONFIG_RPAL + if (rpal_handle_page_fault(regs, error_code, address)) { + local_irq_disable(); + return; + } + do_user_addr_fault(regs, error_code, address); +#else /* !CONFIG_RPAL */ do_user_addr_fault(regs, error_code, address); +#endif /* * User address page fault handling might have reenabled * interrupts. Fixing up all potential exit points of diff --git a/arch/x86/rpal/mm.c b/arch/x86/rpal/mm.c index f1003baae001..be7714ede2bf 100644 --- a/arch/x86/rpal/mm.c +++ b/arch/x86/rpal/mm.c @@ -390,3 +390,37 @@ void rpal_unmap_service(struct rpal_service *tgt) } mm_unlink_p4d(cur_mm, tgt->base); } + +static inline bool check_service_mapped(struct rpal_service *cur, int tgt_= id) +{ + struct rpal_mapped_service *node; + bool is_mapped =3D true; + unsigned long type =3D (1 << RPAL_REVERSE_MAP) | (1 << RPAL_REQUEST_MAP); + + node =3D rpal_get_mapped_node(cur, tgt_id); + if (unlikely((node->type & type) !=3D type)) + is_mapped =3D false; + + return is_mapped; +} + +struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild) +{ + struct rpal_service *cur, *tgt; + struct mm_struct *mm =3D NULL; + + cur =3D rpal_current_service(); + + tgt =3D rpal_get_mapped_service_by_addr(cur, address); + if (tgt =3D=3D NULL) + goto out; + + mm =3D tgt->mm; + if (unlikely(!check_service_mapped(cur, tgt->id) || + !mmget_not_zero(mm))) + mm =3D NULL; + *rebuild =3D 1; + rpal_put_service(tgt); +out: + return mm; +} diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index f490ab07301d..49458321e7dc 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -148,6 +148,30 @@ static inline unsigned long calculate_base_address(int= id) return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id; } =20 +struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs, + int id) +{ + struct rpal_service *ret; + + if (!is_valid_id(id)) + return NULL; + + ret =3D rpal_get_service(rs->service_map[id].rs); + + return ret; +} + +/* This function must be called after rpal_is_correct_address () */ +struct rpal_service *rpal_get_mapped_service_by_addr(struct rpal_service *= rs, + unsigned long addr) +{ + int id; + + id =3D (addr - RPAL_ADDRESS_SPACE_LOW) / RPAL_ADDR_SPACE_SIZE; + + return rpal_get_mapped_service_by_id(rs, id); +} + struct rpal_service *rpal_register_service(void) { struct rpal_service *rs; diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index 7550ad94b63f..e50a4c865ff8 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -155,6 +155,29 @@ int rpal_unregister_receiver(void) return ret; } =20 +int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs, + unsigned long addr, int error_code) +{ + if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) { + struct rpal_sender_call_context *scc =3D current->rpal_sd->scc; + unsigned long erip, ersp; + int magic; + + erip =3D scc->ec.erip; + ersp =3D scc->ec.ersp; + magic =3D scc->ec.magic; + if (magic =3D=3D RPAL_ERROR_MAGIC) { + regs->ax =3D error_code; + regs->ip =3D erip; + regs->sp =3D ersp; + /* avoid rebuild again */ + scc->ec.magic =3D 0; + return 0; + } + } + return -EINVAL; +} + void exit_rpal_thread(void) { if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 36be1ab6a9f3..3310d222739e 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -85,6 +85,8 @@ enum { RPAL_REVERSE_MAP, }; =20 +#define RPAL_ERROR_MAGIC 0x98CC98CC + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -198,23 +200,6 @@ struct rpal_version_info { unsigned long cap; }; =20 -/* End */ - -struct rpal_shared_page { - unsigned long user_start; - unsigned long kernel_start; - int npage; - atomic_t refcnt; - struct list_head list; -}; - -struct rpal_common_data { - /* back pointer to task_struct */ - struct task_struct *bp_task; - /* service id of rpal_service */ - int service_id; -}; - /* User registers state */ struct rpal_task_context { u64 r15; @@ -232,17 +217,44 @@ struct rpal_receiver_call_context { int receiver_id; }; =20 -struct rpal_receiver_data { - struct rpal_common_data rcd; - struct rpal_shared_page *rsp; - struct rpal_receiver_call_context *rcc; +/* recovery point for sender */ +struct rpal_error_context { + unsigned long fsbase; + u64 erip; + u64 ersp; + int state; + int magic; }; =20 struct rpal_sender_call_context { struct rpal_task_context rtc; + struct rpal_error_context ec; int sender_id; }; =20 +/* End */ + +struct rpal_shared_page { + unsigned long user_start; + unsigned long kernel_start; + int npage; + atomic_t refcnt; + struct list_head list; +}; + +struct rpal_common_data { + /* back pointer to task_struct */ + struct task_struct *bp_task; + /* service id of rpal_service */ + int service_id; +}; + +struct rpal_receiver_data { + struct rpal_common_data rcd; + struct rpal_shared_page *rsp; + struct rpal_receiver_call_context *rcc; +}; + struct rpal_sender_data { struct rpal_common_data rcd; struct rpal_shared_page *rsp; @@ -338,6 +350,26 @@ static inline bool rpal_service_mapped(struct rpal_map= ped_service *node) return (node->type & type) =3D=3D type; } =20 +static inline bool rpal_is_correct_address(struct rpal_service *rs, unsign= ed long address) +{ + if (likely(rs->base <=3D address && + address < rs->base + RPAL_ADDR_SPACE_SIZE)) + return true; + + /* + * [rs->base, rs->base + RPAL_ADDR_SPACE_SIZE) is always a + * sub range of [RPAL_ADDRESS_SPACE_LOW, RPAL_ADDRESS_SPACE_HIGH). + * Therefore, we can only check whether the address is in + * [RPAL_ADDRESS_SPACE_LOW, RPAL_ADDRESS_SPACE_HIGH) to determine + * whether the address may belong to another RPAL service. + */ + if (address >=3D RPAL_ADDRESS_SPACE_LOW && + address < RPAL_ADDRESS_SPACE_HIGH) + return false; + + return true; +} + #ifdef CONFIG_RPAL static inline struct rpal_service *rpal_current_service(void) { @@ -372,6 +404,13 @@ void copy_rpal(struct task_struct *p); void exit_rpal(bool group_dead); int rpal_balloon_init(unsigned long base); void rpal_exit_mmap(struct mm_struct *mm); +struct rpal_service *rpal_get_mapped_service_by_addr(struct rpal_service *= rs, + unsigned long addr); +struct rpal_service *rpal_get_mapped_service_by_id(struct rpal_service *rs, + int id); +int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs, + unsigned long addr, int error_code); +struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1094224B1C for ; Fri, 30 May 2025 09:32:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597541; cv=none; b=ZTQlDlu4DzxOGzpQoTZh3kYbRjTI7uMnWQw3rRwIYo3xMTITulRIvM6158JNxSYeIEdLLSJ49QuPkrjHJ4LJcFNXSA8HNCM4lKByAwSpTGi6jl6RKEMxEkTnZi0swjI8fdEmp/Hf17xn5U6BM5bwmK536cEohNLtNDig6epyBYk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597541; c=relaxed/simple; bh=Vo5W68+B44KDyLABkBRZQtt4MP6EbodSR8b0dyMCe9I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=pIzvE0SS9VaQbKpvr3Aegn5oPePLJbixR7Ofr3wBiivBOKs7jZ+4pzrx6Deu6R+pSKqu8S1Gf7Mwq9SplhTNXKEnxG69OLQ1B+f8unywtTkaMCW/+2J/WapMRTuGmHlivHnH9XlC48z9BHGste2YERI5YYixvgoPP56leRB7e1E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=TkX8S0ml; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="TkX8S0ml" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-b26f5f47ba1so1205977a12.1 for ; Fri, 30 May 2025 02:32:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597537; x=1749202337; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=WgdxFQILtv1ZFBwGDlokBNLY9DZwyLg+1V8wfJcAxqs=; b=TkX8S0mlzQALs9Xd5nJpstems/lEQiViliadNUqZ4Vy8VoDCAE5kciPed4kmjZLKad n8hB45Jt3bYVFXCyFT5oyZz4lK1WLpSj5Fc4MxtiO2L4EGIFb8jmcYRiccXd2tLquT7B BRHHWEKNw6hlh2CQoq2nyx+qZm+F00Y9/x74H99SsxNixZDyvF13Fhp+iLENW8YOIwKO Ki8SkEP9Qe9OTkoCcdoZKTBKAAzCdYnz06HFG8XOP7N+0ymzkVd9OGk3pnex18I7Vq9Y TyxCukQA/nk4bq6n4UCJ9LJ8de/svHiuUk4FzIYNo4r0Aj0tiwYsGugLD0GS90z1O7Kb GdBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597537; x=1749202337; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WgdxFQILtv1ZFBwGDlokBNLY9DZwyLg+1V8wfJcAxqs=; b=Yd5wejABTjHSe3GHpFh1aEomkcyBe9fkuBF7nHSSgV9USIvSXucmNZaHyFn5YM97Y7 aDRW1naLaVm3S146r4XUtPB2N1y5e3nyiCOLI+cy1ec1MYVrxUgCXPSr+xyZ6omshOUq 63wox9RsXQ67IqB3eIEOFRCXtrv5RrrfWK8YCvk8c1X/qbLL7A1ut0Z4PTcmeIRJ0iqk eGLDu2YpYI+jhP4+6WxYYbDLqWOXRiD9NUkOETGdN4g0AyqjTlM73gDz/cXfpwoi3rHh 9bMPM8k2SuNiHb23u5lo02ImOatV/vItwH2cXyFYmjogVgmkJcSuPfgIGSj3Rv7QTJHT zj0A== X-Forwarded-Encrypted: i=1; AJvYcCUMV1bA3TEWC+dU78v7+5nNMjxrfkR0VXkqxp1cyHaNc12s2fMnVhk7wbuA0ymfPpk+ahsMRhEKGm8rgZI=@vger.kernel.org X-Gm-Message-State: AOJu0YxLfgnDdzVp71Iw7DhMy4qKP+NAePwtk++0BTqbo9uWAaQR5Wnu AgHfLBxSiL+jEfMmvQCRLxRXmfJh2N5JLDIjhHUBn83j1IzdnYE0lm/uP0OT9RkdBSg= X-Gm-Gg: ASbGncuuvwxKxuOAkqt+uuGv9F58Nr5ylwFQOAc9I6gXRv7LOtV3qOQDSTMwZ9GZ8vi /IcIsNvwgrSJTdmwuP5eebN3zF5Hnl6ninaoRkmD0d2bHytyF4yq2Ux2wqEhqVZU3DYhE8ZFjgA B+1IBsieNj70Lzsox9JkGgoyJM5MzeQnQJzscrEYZuV4kuH7/eEqmjmUu6rFFgGUMk3q4YTv+4/ yVP+z3z2o7YUaQ1CQW/cHaOeJ5p40PPlDN5CHYZN9p/rj/9a7gcfwj1nx/u7pkZo5mJjUVGIksz JdmLDLBFWAu0T/LrJqZ0rm2eyGKHwBg7Xp8dbkRAIDQ+gz5Pd9NNnV2HgjEGR7sbxwL1ISwUsh1 Bxa8KTVip6g== X-Google-Smtp-Source: AGHT+IEJreThnxFmdrlfjlpwKZv59IFtRtCaH2wQICekYkicVpLMTveAapFUeJrWIU6RsHlorrZhig== X-Received: by 2002:a17:90b:314d:b0:311:c1ec:7d0c with SMTP id 98e67ed59e1d1-312504437b2mr2132223a91.27.1748597536883; Fri, 30 May 2025 02:32:16 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.32.02 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:32:16 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 15/35] RPAL: add sender/receiver state Date: Fri, 30 May 2025 17:27:43 +0800 Message-Id: <6582d600063dd2176558bdf2b62a6a143bd594e2.1748594840.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The lazy switch defines six receiver states, and their state transitions are as follows: |<--->READY<----> WAIT <----> CALL ----> LAZY_SWITCH ---> KERNEL_RET | | | RUNNING <----------------------------------------|---------------| The receiver thread initially starts in the RUNNING state and can transition to the WAIT state voluntarily. The READY state is a temporary state before entering WAIT state. For a receiver in the WAIT state, it must be in the TASK_INTERRUPTIBLE state. If the receiver thread is woken up, the WAIT state can transition to the RUNNING state. Once the receiver is in the WAIT state, the sender thread can initiate an RPAL call, causing the receiver to enter the CALL state. A receiver thread in the CALL state cannot be awakened until a lazy switch occurs or its state changes. The call state carries additional service_id and sender_id information. If the sender completes executing the receiver's code without entering the kernel after issuing the RPAL call, the receiver transitions back from the CALL state to the WAIT state. Conversely, if the sender enters the kernel during the RPAL call, the receiver's state changes to LAZY_SWITCH. From the LAZY_SWITCH state, the receiver thread has two possible state transitions: When the receiver thread finishes execution and switches back to the sender via a lazy switch, it first enters the KERNEL_RET state and then transitions to the RUNNING state. If the receiver thread runs for too long and the scheduler resumes the sender, the receiver directly transitions to the RUNNING state. Transitions to the RUNNING state can be done in userspace. The lazy switch mechanism defines three states for the sender thread: - RUNNING: The sender starts in this state. When the sender initiates an RPAL call to switch from user mode to the receiver, it transitions to the CALL state. - CALL: The sender remains in this state while the receiver is executing the code triggered by the RPAL call. When the receiver switches back to the sender from user mode, the sender returns to the RUNNING state. - KERNEL_RET: If the receiver takes an extended period to switch back to the sender after a lazy switch, the scheduler may preempt the sender to run other tasks. In this case, the sender enters the KERNEL_RET state while in the kernel. Once the sender resumes execution in user mode, it transitions back to the RUNNING state. This patch implements the handling and transition of the receiver's state. When a receiver leaves the run queue in the READY state, its state transitions to the WAIT state; otherwise, it transitions to the RUNNING state. The patch also modifies try_to_wake_up() to handling different states: for the READY and WAIT states, try_to_wake_up() causes the state to change to the RUNNING state. For the CALL state, try_to_wake_up() cannot wake up the task. The patch provides a special interface, rpal_try_to_wake_up(), to wake up tasks in the CALL state, which can be used for lazy switches. Signed-off-by: Bo Li --- arch/x86/kernel/process_64.c | 43 ++++++++++++ arch/x86/rpal/internal.h | 7 ++ include/linux/rpal.h | 50 ++++++++++++++ kernel/sched/core.c | 130 +++++++++++++++++++++++++++++++++++ 4 files changed, 230 insertions(+) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index f39ff02e498d..4830e9215de7 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include @@ -596,6 +597,36 @@ void compat_start_thread(struct pt_regs *regs, u32 new= _ip, u32 new_sp, bool x32) } #endif =20 +#ifdef CONFIG_RPAL +static void rpal_receiver_enter_wait(struct task_struct *prev_p) +{ + if (READ_ONCE(prev_p->__state) =3D=3D TASK_INTERRUPTIBLE) { + atomic_cmpxchg(&prev_p->rpal_rd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_WAIT); + } else { + /* + * Simply check RPAL_RECEIVER_STATE_READY is not enough. It is + * possible task's state is TASK_RUNNING. Consider following case: + * + * CPU 0(prev_p) CPU 1(waker) + * set TASK_INTERRUPTIBLE + * set RPAL_RECEIVER_STATE_READY + * check TASK_INTERRUPTIBLE + * clear RPAL_RECEIVER_STATE_READY + * clear TASK_INTERRUPTIBLE + * set TASK_INTERRUPTIBLE + * set RPAL_RECEIVER_STATE_READY + * ttwu_runnable() + * schedule() + */ + atomic_cmpxchg(&prev_p->rpal_rd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_RUNNING); + } +} +#endif + /* * switch_to(x,y) should switch tasks from x to y. * @@ -704,6 +735,18 @@ __switch_to(struct task_struct *prev_p, struct task_st= ruct *next_p) loadsegment(ss, __KERNEL_DS); } =20 +#ifdef CONFIG_RPAL + /* + * When we come to here, the stack switching is finished. Therefore, + * the receiver thread is prepared for a lazy switch. We then change + * the receiver_state from RPAL_RECEIVER_STATE_REAY to + * RPAL_RECEIVER_STATE_WAIT and other thread is able to call it with + * RPAL call. + */ + if (rpal_test_task_thread_flag(prev_p, RPAL_RECEIVER_BIT)) + rpal_receiver_enter_wait(prev_p); +#endif + /* Load the Intel cache allocation PQR MSR. */ resctrl_sched_in(next_p); =20 diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index cf6d608a994a..6256172bb79e 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -47,3 +47,10 @@ int rpal_unregister_sender(void); int rpal_register_receiver(unsigned long addr); int rpal_unregister_receiver(void); void exit_rpal_thread(void); + +static inline unsigned long +rpal_build_call_state(const struct rpal_sender_data *rsd) +{ + return ((rsd->rcd.service_id << RPAL_SID_SHIFT) | + (rsd->scc->sender_id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CALL); +} diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 3310d222739e..4f4719bb7eae 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -87,6 +87,13 @@ enum { =20 #define RPAL_ERROR_MAGIC 0x98CC98CC =20 +#define RPAL_SID_SHIFT 24 +#define RPAL_ID_SHIFT 8 +#define RPAL_RECEIVER_STATE_MASK ((1 << RPAL_ID_SHIFT) - 1) +#define RPAL_SID_MASK (~((1 << RPAL_SID_SHIFT) - 1)) +#define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK)) +#define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1) + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -94,6 +101,22 @@ enum rpal_task_flag_bits { RPAL_RECEIVER_BIT, }; =20 +enum rpal_receiver_state { + RPAL_RECEIVER_STATE_RUNNING, + RPAL_RECEIVER_STATE_KERNEL_RET, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_WAIT, + RPAL_RECEIVER_STATE_CALL, + RPAL_RECEIVER_STATE_LAZY_SWITCH, + RPAL_RECEIVER_STATE_MAX, +}; + +enum rpal_sender_state { + RPAL_SENDER_STATE_RUNNING, + RPAL_SENDER_STATE_CALL, + RPAL_SENDER_STATE_KERNEL_RET, +}; + /* * user_meta will be sent to other service when requested. */ @@ -215,6 +238,8 @@ struct rpal_task_context { struct rpal_receiver_call_context { struct rpal_task_context rtc; int receiver_id; + atomic_t receiver_state; + atomic_t sender_state; }; =20 /* recovery point for sender */ @@ -390,11 +415,35 @@ static inline bool rpal_test_current_thread_flag(unsi= gned long bit) { return test_bit(bit, ¤t->rpal_flag); } + +static inline bool rpal_test_task_thread_flag(struct task_struct *tsk, + unsigned long bit) +{ + return test_bit(bit, &tsk->rpal_flag); +} + +static inline void rpal_set_task_thread_flag(struct task_struct *tsk, + unsigned long bit) +{ + set_bit(bit, &tsk->rpal_flag); +} + +static inline void rpal_clear_task_thread_flag(struct task_struct *tsk, + unsigned long bit) +{ + clear_bit(bit, &tsk->rpal_flag); +} #else static inline struct rpal_service *rpal_current_service(void) { return NUL= L; } static inline void rpal_set_current_thread_flag(unsigned long bit) { } static inline void rpal_clear_current_thread_flag(unsigned long bit) { } static inline bool rpal_test_current_thread_flag(unsigned long bit) { retu= rn false; } +static inline bool rpal_test_task_thread_flag(struct task_struct *tsk, + unsigned long bit) { return false; } +static inline void rpal_set_task_thread_flag(struct task_struct *tsk, + unsigned long bit) { } +static inline void rpal_clear_task_thread_flag(struct task_struct *tsk, + unsigned long bit) { } #endif =20 void rpal_unregister_service(struct rpal_service *rs); @@ -414,4 +463,5 @@ struct mm_struct *rpal_pf_get_real_mm(unsigned long add= ress, int *rebuild); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); +int rpal_try_to_wake_up(struct task_struct *p); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 62b3416f5e43..045e92ee2e3b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -67,6 +67,7 @@ #include #include #include +#include =20 #ifdef CONFIG_PREEMPT_DYNAMIC # ifdef CONFIG_GENERIC_ENTRY @@ -3820,6 +3821,40 @@ static int ttwu_runnable(struct task_struct *p, int = wake_flags) return ret; } =20 +#ifdef CONFIG_RPAL +static bool rpal_check_state(struct task_struct *p) +{ + bool ret =3D true; + + if (rpal_test_task_thread_flag(p, RPAL_RECEIVER_BIT)) { + struct rpal_receiver_call_context *rcc =3D p->rpal_rd->rcc; + int state; + +retry: + state =3D atomic_read(&rcc->receiver_state) & RPAL_RECEIVER_STATE_MASK; + switch (state) { + case RPAL_RECEIVER_STATE_READY: + case RPAL_RECEIVER_STATE_WAIT: + if (state !=3D atomic_cmpxchg(&rcc->receiver_state, state, + RPAL_RECEIVER_STATE_RUNNING)) + goto retry; + break; + case RPAL_RECEIVER_STATE_KERNEL_RET: + case RPAL_RECEIVER_STATE_LAZY_SWITCH: + case RPAL_RECEIVER_STATE_RUNNING: + break; + case RPAL_RECEIVER_STATE_CALL: + ret =3D false; + break; + default: + rpal_err("%s: invalid state: %d\n", __func__, state); + break; + } + } + return ret; +} +#endif + #ifdef CONFIG_SMP void sched_ttwu_pending(void *arg) { @@ -3841,6 +3876,11 @@ void sched_ttwu_pending(void *arg) if (WARN_ON_ONCE(task_cpu(p) !=3D cpu_of(rq))) set_task_cpu(p, cpu_of(rq)); =20 +#ifdef CONFIG_RPAL + if (!rpal_check_state(p)) + continue; +#endif + ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf); } =20 @@ -4208,6 +4248,17 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) if (!ttwu_state_match(p, state, &success)) goto out; =20 +#ifdef CONFIG_RPAL + /* + * For rpal thread, we need to check if it can be woken up. If not, + * we do not wake it up here but wake it up later by kernel worker. + * + * For normal thread, nothing happens. + */ + if (!rpal_check_state(p)) + goto out; +#endif + trace_sched_waking(p); ttwu_do_wakeup(p); goto out; @@ -4224,6 +4275,11 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) if (!ttwu_state_match(p, state, &success)) break; =20 +#ifdef CONFIG_RPAL + if (!rpal_check_state(p)) + break; +#endif + trace_sched_waking(p); =20 /* @@ -4344,6 +4400,56 @@ int try_to_wake_up(struct task_struct *p, unsigned i= nt state, int wake_flags) return success; } =20 +#ifdef CONFIG_RPAL +int rpal_try_to_wake_up(struct task_struct *p) +{ + guard(preempt)(); + int cpu, success =3D 0; + int wake_flags =3D WF_TTWU; + + BUG_ON(READ_ONCE(p->__state) =3D=3D TASK_RUNNING); + + scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { + smp_mb__after_spinlock(); + if (!ttwu_state_match(p, TASK_NORMAL, &success)) + break; + + trace_sched_waking(p); + /* see try_to_wake_up() */ + smp_rmb(); + +#ifdef CONFIG_SMP + smp_acquire__after_ctrl_dep(); + WRITE_ONCE(p->__state, TASK_WAKING); + /* see try_to_wake_up() */ + if (smp_load_acquire(&p->on_cpu) && + ttwu_queue_wakelist(p, task_cpu(p), wake_flags)) + break; + smp_cond_load_acquire(&p->on_cpu, !VAL); + + cpu =3D select_task_rq(p, p->wake_cpu, &wake_flags); + if (task_cpu(p) !=3D cpu) { + if (p->in_iowait) { + delayacct_blkio_end(p); + atomic_dec(&task_rq(p)->nr_iowait); + } + + wake_flags |=3D WF_MIGRATED; + psi_ttwu_dequeue(p); + set_task_cpu(p, cpu); + } +#else + cpu =3D task_cpu(p); +#endif + } + ttwu_queue(p, cpu, wake_flags); + if (success) + ttwu_stat(p, task_cpu(p), wake_flags); + + return success; +} +#endif + static bool __task_needs_rq_lock(struct task_struct *p) { unsigned int state =3D READ_ONCE(p->__state); @@ -6574,6 +6680,18 @@ pick_next_task(struct rq *rq, struct task_struct *pr= ev, struct rq_flags *rf) #define SM_PREEMPT 1 #define SM_RTLOCK_WAIT 2 =20 +#ifdef CONFIG_RPAL +static inline void rpal_check_ready_state(struct task_struct *tsk, int sta= te) +{ + if (rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) { + struct rpal_receiver_call_context *rcc =3D tsk->rpal_rd->rcc; + + atomic_cmpxchg(&rcc->receiver_state, state, + RPAL_RECEIVER_STATE_RUNNING); + } +} +#endif + /* * Helper function for __schedule() * @@ -6727,7 +6845,19 @@ static void __sched notrace __schedule(int sched_mod= e) goto picked; } } else if (!preempt && prev_state) { +#ifdef CONFIG_RPAL + if (!try_to_block_task(rq, prev, &prev_state)) { + /* + * As the task enter TASK_RUNNING state, we should clean up + * RPAL_RECEIVER_STATE_READY status. Therefore, the receiver's + * state will not be change to RPAL_RECEIVER_STATE_WAIT. Thus, + * there is no RPAL call when a receiver is at TASK_RUNNING state. + */ + rpal_check_ready_state(prev, RPAL_RECEIVER_STATE_READY); + } +#else try_to_block_task(rq, prev, &prev_state); +#endif switch_count =3D &prev->nvcsw; } =20 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A24E223DC4 for ; Fri, 30 May 2025 09:32:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597554; cv=none; b=kOs4jWXjzjd+V5V9iizfAPjCPWFPNid8wjaPgJ/mw28mDZUTeHMxyEy7WU1XRLDrXFhXqi9C3iH1YH6H+4gmMNTrUF8SCJFzuer+JtIhvdJrAAu54QO1C8iP75RgtdKT7HIB8CZOEp86oXoFEl2QoEqA3fMRTVSo2/nf/aa50lg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597554; c=relaxed/simple; bh=p7fOy/mpX7DgxiI+SapqF82ZoB9lHQGYA8nT1M0yWa8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dPtVY/uXf/0CVV8BqSGWFYsodoUDib2KurkA6Qkdwzw2Vggau7bQu1oBhmSs1/dxejmYTkOWc6oklgPz9vLqLvCsqHTjbBJ5Kpxioy1krqADr1KfNuTEnO78dG/YSi8Tar5+act4abdZVpIjQxrJ4QtNFYnGENdgOhTrPFGsknQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=A3BLCd+H; arc=none smtp.client-ip=209.85.215.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="A3BLCd+H" Received: by mail-pg1-f170.google.com with SMTP id 41be03b00d2f7-b2c2c762a89so1337908a12.0 for ; Fri, 30 May 2025 02:32:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597552; x=1749202352; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PX/EGT89O+ZEiBiVi4yPzaBsD+Uha5lcnizaJutApCU=; b=A3BLCd+H0XngyTK/aF3Qe3w7oFLL/TjuuoQLSpAqpevzMPXgXES+cXNeABDMSAxj54 jnJJob7xXOtFqVanAP1tnKnxxegezSXETfo+yFjgYlpoEhoWrQ70MtCjgii2VPGOkACD +qWuJ+yVZmVykAIebhMKB4OkY4V8mNEWJWaeKE6B8MoRkKXXCWoMt2yzGTuIId3eiWNG BBku4qA7Sd31bsvfmRnY7hfj7RCLU1m4vbzep0cDBhxBX8nbfipk058dW/RL2cSaSw7Q 1/ugwCYRTPAO9I+Yg1nstYKXH1V7v5zsR3dThG7ULbqGrk1KlHjAlkabkC6r/6dm/kIe fQEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597552; x=1749202352; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PX/EGT89O+ZEiBiVi4yPzaBsD+Uha5lcnizaJutApCU=; b=jnPANCFP6xZY10q0QKxGJfQRoFfKBmn+p8DS+a29aNfnsn6BwVh6F4ieQni14b/oDx BUS1nzXbAS/lK4oR153uTK9/FaKEHrIg43lISZ889V5T8m3i/rybdMJinrYp4e1E1LZi BDN/BeWtM4Zv+0ScjDFXqsSgSsikD9UnjrwV9cYDihiyS7XDk2ldIUDmApZqeBWRFkzg 7ko52plYFcEFVYimlA1h8QRTP1azHd/GTFey4fqvS7vHDcHEpOAhf2QJyHScjuwwswHi aloHUEsxodmSS00R+tXfJ0jkpA85ajKGuYG4KzzvFRF2bxrUr8dbwTFly9jWnIsPD53j U1Hg== X-Forwarded-Encrypted: i=1; AJvYcCV9pVa5DwQVr6N4oLMp3iIOovKt3N5MPgCE5OKQ/dhLyhH4GcjHM7zSqO472jD4ts/XqGx9f0RPOOHNWX0=@vger.kernel.org X-Gm-Message-State: AOJu0Yzwr+aSYiQ8WQOUeAtfs7qi8WXSjScKW93L0JEy/fcuWLtdDPLI jdsOkKpxYNHTCoNJJHojJHi+POe0gYXLIvprwiKNhrM3LsCW2WLE31E2smHm+dqY/pw= X-Gm-Gg: ASbGncsOukfMB5zxIQv2oKv9fmzHy0gBILoMY4mHo8IzwGAIxsND47btTt8gQrUqU78 VQKi0EbIu1KI39XeuTT+AOajB36otqRxGtuE2QhhlAJ0s+n838lQ1cYkt+9pDTuy/TdVkwimKzI JOdqyXYlWwOIysh5nTFOccrLTuXreKXYVEUMQM7JylKVyrEn7XCEDQjJxYf+lLa1+9U2RfxjZLl BkiJIpPfXYq6qL+N52GlxERKp96egU9FsWitskS4JbulX9PEQpiDiRumrEB108zAtS0OK+CBM7/ j9o5IqPhD/nGp/hk5Z6dsJLCbh8wCNHjuP7TtNhkPyKjG2UI8boMu7+fwarGWy7cTc3GGB1zdfm vXz1TzmD1lA== X-Google-Smtp-Source: AGHT+IHBBO2+1F1DkG3gsf9hrt/klQKja6FHxL7s2ecvsv49mNDcLA/dM81wOfpp/PHgMJRmMz0RTg== X-Received: by 2002:a17:90b:3dce:b0:311:baa0:89ca with SMTP id 98e67ed59e1d1-31241e98d1bmr3478119a91.34.1748597552297; Fri, 30 May 2025 02:32:32 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.32.17 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:32:32 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 16/35] RPAL: add cpu lock interface Date: Fri, 30 May 2025 17:27:44 +0800 Message-Id: <8ff6cea94a6438a0856c86a11d56be462314b1f8.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Lazy switch enables the kernel to switch from one task to another to keep the kernel context and user context matched. For the scheduler, both tasks involved in the context switch must reside in the same run queue (rq). Therefore, before a lazy switch occurs, the kernel must first bind both tasks to the same CPU to facilitate the subsequent context switch. This patch introduces the rpal_lock_cpu() interface, which binds two tasks to the same CPU while bypassing cpumask restrictions. The rpal_unlock_cpu() function serves as the inverse operation to release this binding. To ensure consistency, the kernel must prevent other threads from modifying the CPU affinity of tasks locked by rpal_lock_cpu(). Therefore, when using set_cpus_allowed_ptr() to change a task's CPU affinity, other threads must wait until the binding established by rpal_lock_cpu() is released before proceeding with modifications. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 18 +++++++ arch/x86/rpal/thread.c | 14 ++++++ include/linux/rpal.h | 8 +++ kernel/sched/core.c | 109 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 149 insertions(+) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 61f5d40b0157..c185a453c1b2 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -15,6 +15,24 @@ int __init rpal_init(void); bool rpal_inited; unsigned long rpal_cap; =20 +static inline void rpal_lock_cpu(struct task_struct *tsk) +{ + rpal_set_cpus_allowed_ptr(tsk, true); + if (unlikely(!irqs_disabled())) { + local_irq_disable(); + rpal_err("%s: irq is enabled\n", __func__); + } +} + +static inline void rpal_unlock_cpu(struct task_struct *tsk) +{ + rpal_set_cpus_allowed_ptr(tsk, false); + if (unlikely(!irqs_disabled())) { + local_irq_disable(); + rpal_err("%s: irq is enabled\n", __func__); + } +} + int __init rpal_init(void) { int ret =3D 0; diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index e50a4c865ff8..bc203e9c6e5e 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -47,6 +47,10 @@ int rpal_register_sender(unsigned long addr) } =20 rpal_common_data_init(&rsd->rcd); + if (rpal_init_thread_pending(&rsd->rcd)) { + ret =3D -ENOMEM; + goto free_rsd; + } rsd->rsp =3D rsp; rsd->scc =3D (struct rpal_sender_call_context *)(addr - rsp->user_start + rsp->kernel_start); @@ -58,6 +62,8 @@ int rpal_register_sender(unsigned long addr) =20 return 0; =20 +free_rsd: + kfree(rsd); put_shared_page: rpal_put_shared_page(rsp); out: @@ -77,6 +83,7 @@ int rpal_unregister_sender(void) =20 rpal_put_shared_page(rsd->rsp); rpal_clear_current_thread_flag(RPAL_SENDER_BIT); + rpal_free_thread_pending(&rsd->rcd); kfree(rsd); =20 atomic_dec(&cur->thread_cnt); @@ -116,6 +123,10 @@ int rpal_register_receiver(unsigned long addr) } =20 rpal_common_data_init(&rrd->rcd); + if (rpal_init_thread_pending(&rrd->rcd)) { + ret =3D -ENOMEM; + goto free_rrd; + } rrd->rsp =3D rsp; rrd->rcc =3D (struct rpal_receiver_call_context *)(addr - rsp->user_start + @@ -128,6 +139,8 @@ int rpal_register_receiver(unsigned long addr) =20 return 0; =20 +free_rrd: + kfree(rrd); put_shared_page: rpal_put_shared_page(rsp); out: @@ -147,6 +160,7 @@ int rpal_unregister_receiver(void) =20 rpal_put_shared_page(rrd->rsp); rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT); + rpal_free_thread_pending(&rrd->rcd); kfree(rrd); =20 atomic_dec(&cur->thread_cnt); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 4f4719bb7eae..5b115be14a55 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -99,6 +99,7 @@ extern unsigned long rpal_cap; enum rpal_task_flag_bits { RPAL_SENDER_BIT, RPAL_RECEIVER_BIT, + RPAL_CPU_LOCKED_BIT, }; =20 enum rpal_receiver_state { @@ -270,8 +271,12 @@ struct rpal_shared_page { struct rpal_common_data { /* back pointer to task_struct */ struct task_struct *bp_task; + /* pending struct for cpu locking */ + void *pending; /* service id of rpal_service */ int service_id; + /* cpumask before locked */ + cpumask_t old_mask; }; =20 struct rpal_receiver_data { @@ -464,4 +469,7 @@ struct mm_struct *rpal_pf_get_real_mm(unsigned long add= ress, int *rebuild); extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); int rpal_try_to_wake_up(struct task_struct *p); +int rpal_init_thread_pending(struct rpal_common_data *rcd); +void rpal_free_thread_pending(struct rpal_common_data *rcd); +int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 045e92ee2e3b..a862bf4a0161 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3155,6 +3155,104 @@ static int __set_cpus_allowed_ptr_locked(struct tas= k_struct *p, return ret; } =20 +#ifdef CONFIG_RPAL +int rpal_init_thread_pending(struct rpal_common_data *rcd) +{ + struct set_affinity_pending *pending; + + pending =3D kzalloc(sizeof(*pending), GFP_KERNEL); + if (!pending) + return -ENOMEM; + pending->stop_pending =3D 0; + pending->arg =3D (struct migration_arg){ + .task =3D current, + .pending =3D NULL, + }; + rcd->pending =3D pending; + return 0; +} + +void rpal_free_thread_pending(struct rpal_common_data *rcd) +{ + if (rcd->pending !=3D NULL) + kfree(rcd->pending); +} + +/* + * CPU lock is forced and all cpumask will be ignored by RPAL temporary. + */ +int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock) +{ + const struct cpumask *cpu_valid_mask =3D cpu_active_mask; + struct set_affinity_pending *pending =3D p->rpal_cd->pending; + struct cpumask mask; + unsigned int dest_cpu; + struct rq_flags rf; + struct rq *rq; + int ret =3D 0; + struct affinity_context ac =3D { + .new_mask =3D &mask, + .flags =3D 0, + }; + + if (unlikely(p->flags & PF_KTHREAD)) + rpal_err("p: %d, p->flags & PF_KTHREAD\n", p->pid); + + rq =3D task_rq_lock(p, &rf); + + if (is_lock) { + cpumask_copy(&p->rpal_cd->old_mask, &p->cpus_mask); + cpumask_clear(&mask); + cpumask_set_cpu(smp_processor_id(), &mask); + rpal_set_task_thread_flag(p, RPAL_CPU_LOCKED_BIT); + } else { + cpumask_copy(&mask, &p->rpal_cd->old_mask); + rpal_clear_task_thread_flag(p, RPAL_CPU_LOCKED_BIT); + } + + update_rq_clock(rq); + + if (cpumask_equal(&p->cpus_mask, ac.new_mask)) + goto out; + /* + * Picking a ~random cpu helps in cases where we are changing affinity + * for groups of tasks (ie. cpuset), so that load balancing is not + * immediately required to distribute the tasks within their new mask. + */ + dest_cpu =3D cpumask_any_and_distribute(cpu_valid_mask, ac.new_mask); + if (dest_cpu >=3D nr_cpu_ids) { + ret =3D -EINVAL; + goto out; + } + __do_set_cpus_allowed(p, &ac); + if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) { + preempt_disable(); + task_rq_unlock(rq, p, &rf); + preempt_enable(); + } else { + pending->arg.dest_cpu =3D dest_cpu; + + if (task_on_cpu(rq, p) || + READ_ONCE(p->__state) =3D=3D TASK_WAKING) { + preempt_disable(); + task_rq_unlock(rq, p, &rf); + stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, + &pending->arg, &pending->stop_work); + } else { + if (task_on_rq_queued(p)) + rq =3D move_queued_task(rq, &rf, p, dest_cpu); + task_rq_unlock(rq, p, &rf); + } + } + + return 0; + +out: + task_rq_unlock(rq, p, &rf); + return ret; +} +#endif + /* * Change a given task's CPU affinity. Migrate the thread to a * proper CPU and schedule it away if the CPU it's executing on @@ -3169,7 +3267,18 @@ int __set_cpus_allowed_ptr(struct task_struct *p, st= ruct affinity_context *ctx) struct rq_flags rf; struct rq *rq; =20 +#ifdef CONFIG_RPAL +retry: + rq =3D task_rq_lock(p, &rf); + if (rpal_test_task_thread_flag(p, RPAL_CPU_LOCKED_BIT)) { + update_rq_clock(rq); + task_rq_unlock(rq, p, &rf); + schedule(); + goto retry; + } +#else rq =3D task_rq_lock(p, &rf); +#endif /* * Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_* * flags are set. --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CEF3191499 for ; Fri, 30 May 2025 09:32:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597570; cv=none; b=YcHWlywGf/LG/SmA0VnbzB1yiWwQGCLzl693T+t8enpcN4JGNrV4MAOv8lSqHc1E0KZYgQxLa3p/oCLKdOp4EB2FMXZBkOkeWocYgxFMz9FbKnBBomRSBOxvvSBaAXu6TiiCtzXREESMk6OBuLdZa5ozYnm5kvAv5p7tgtKa8fk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597570; c=relaxed/simple; bh=gli6f5Pht7GF/KlCotN9/63ZsUUJh74HTo8OtVCQXPg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YWJwjktg2KDtYXoMklmtp3isLSvcM0wXnJHGpv7iZrzHhXy1iBFgnU407Kpg+flij+KC7ml6DLdTJQoN661mFMNA3D8LkDSDJac1KayLvX5TQqx9IanraTLsbK6HHZZSQ3eechPHNQyL+FlkFKlqozeYNAh+25FYB0PeaBxXLTM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=TGRyou1+; arc=none smtp.client-ip=209.85.215.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="TGRyou1+" Received: by mail-pg1-f172.google.com with SMTP id 41be03b00d2f7-b26c5fd40a9so2319221a12.1 for ; Fri, 30 May 2025 02:32:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597568; x=1749202368; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=7odL+uROHJjULwRoyw6w9QE9qQsYAMCGXEhdmggZkMI=; b=TGRyou1+eoQBbWF9THJjh1oSMbLF2olFxpCnXsDswPGjPu0YnZ2bJxH94qIXOBGdZ9 3nf74OUenUuM3UitMYVqcsSUR8fPNwxslOznu1WUFo0TiGV7ZVFWe6Co94WUjwHpxPcL SV9wdUwyXHcHfyxsArc5HSRPIaaYHekpxs4V1pEl2/E0gp9DTlLS3HY57ZP/FE/kxHOl TYUqMcWq4KtwsurVUNXgxm+hPyraAPBpSkWB1vNiTtzIwtH63i+AQhWjNCGI7KeoBi8v yLKfp3nWPL0O9eOrUpaY5oG0r2wBdZ1hQlz4QdT0FXwhgRwB+HbcI/BVHIz0b9vLrF7k QrSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597568; x=1749202368; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7odL+uROHJjULwRoyw6w9QE9qQsYAMCGXEhdmggZkMI=; b=vF/WQrzTXWmCR6PsM3rmOc8AZ2Fy2n/Cfh5/tzSGAlznfFDcLfpsTx2VqWKrt8OP2Q G3DSSmy7C+ui2VroQEZIZuTuRXjmakqUiHH3b8ffZ3eJz8enxfMeXslGxwx4YT16Gn02 hYPl0l1rLU1OlCY4yRVnpzGBrZnwi3ShCwusyTQNMDm90z5CyLRbFzYDhNINs1p5fDwI RjEHN6BjrAFgc6BYLbjUeYROi/wAsEhvtnVsmaQ7b6hIAbi2HGzyQTq+g+1pYKNFiQ0M FiRY0saa8svsZ1aoP2zVgWOuKhriqjMxlyDciEPQFOZXdOlA/R50gGEoOdpX9lyypJR6 0FYQ== X-Forwarded-Encrypted: i=1; AJvYcCXdmKzUaak1kNhrksN2UL3Bo/pb2S35tj8b0mwOzJ1Tqo42A7lDGEWRoJmNFI+UPdWCdXF7JhbvpeY0ASc=@vger.kernel.org X-Gm-Message-State: AOJu0YypeiCf8pSI9LUma11pT+kqRakI63QBfGiSwV3DTb/sb+Hk22ec dldiC7dYxM2vWdon7YH64Jos5AwoTZjhULyhTrXqg7r1WypbTkThUrcTpB5NO7zbBf0= X-Gm-Gg: ASbGncuOe7IrcPNto6rGnZQufPx6w/FNqvsH5jfosHAoAZeDpGF9mi9ZomJN1TLJdqj lo4pkt21nBYvUBrHIAG6smbRwKY4akbEVtUeLRdUKWnP20BfAbfdgb5J31tU2kKjPUz04srWgpE g24JJh3A6R1OA2+FtbkyWlYuNICWDX70+uj11GARPuGgrFSRDWTKkLskwtQlwMmW/TqPjZgGJ28 ng4gvtxKaasfB3yHZ0m5eaXiCRE9KWV7loKl3+TtJsPUa7xNt0rgzoljrzbLaRV9UqgkaDipiNk 6YAX/Ns3wiV1LhW3p6Xbv8L8K/4xRTZ1gNRVrnoemkaY1S1lHhHdsI0/yMWe2Yv1pr0eYpx8lVh 5TArE6tEzQw== X-Google-Smtp-Source: AGHT+IE+I3OxPDYMOIkiL8a1fAG5KorJApLQ8jYCZgv/H0F4D9V+Weoec5Tdc5pVnNzU/vxuroLuaA== X-Received: by 2002:a17:90b:3eca:b0:30e:5c74:53c9 with SMTP id 98e67ed59e1d1-31246453833mr3702580a91.11.1748597567790; Fri, 30 May 2025 02:32:47 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.32.32 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:32:47 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 17/35] RPAL: add a mapping between fsbase and tasks Date: Fri, 30 May 2025 17:27:45 +0800 Message-Id: <964eab3190221c0c880ee9a52957865512c8571c.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RPAL relies on the value of the fsbase register to determine whether a lazy switch is necessary. Therefore, a mapping between fsbase and tasks must be established. This patch allows a thread to register its fsbase value when it is registered as a receiver. The rpal_find_next_task() interface is used to locate the receiver corresponding to a given fsbase value. Additionally, a new rpal_misidentify() interface has been added to check if the current fsbase value matches the current task. If they do not match, the task corresponding to the fsbase is identified, the RPAL_LAZY_SWITCHED_BIT flag is set, and the current task is recorded. The kernel can later use this flag and the recorded task to backtrack to the task before the lazy switch. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 85 ++++++++++++++++++++++++++++++++++++++++++ arch/x86/rpal/thread.c | 57 +++++++++++++++++++++++++++- include/linux/rpal.h | 15 ++++++++ 3 files changed, 156 insertions(+), 1 deletion(-) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index c185a453c1b2..19c4ef38bca3 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -7,6 +7,7 @@ */ =20 #include +#include =20 #include "internal.h" =20 @@ -33,12 +34,96 @@ static inline void rpal_unlock_cpu(struct task_struct *= tsk) } } =20 + +static inline struct task_struct *rpal_get_sender_task(void) +{ + struct task_struct *next; + + next =3D current->rpal_rd->sender; + current->rpal_rd->sender =3D NULL; + + return next; +} + +/* + * RPAL uses the value of fsbase (which libc uses as the base + * address for thread-local storage) to determine whether a + * lazy switch should be performed. + */ +static inline struct task_struct *rpal_misidentify(void) +{ + struct task_struct *next =3D NULL; + struct rpal_service *cur =3D rpal_current_service(); + unsigned long fsbase; + + fsbase =3D rdfsbase(); + if (unlikely(!rpal_is_correct_address(cur, fsbase))) { + if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT)) { + /* current is receiver, next is sender */ + next =3D rpal_get_sender_task(); + if (unlikely(next =3D=3D NULL)) { + rpal_err("cannot find sender task\n"); + goto out; + } + } else { + /* current is sender, next is receiver */ + next =3D rpal_find_next_task(fsbase); + if (unlikely(next =3D=3D NULL)) { + rpal_err( + "cannot find receiver task, fsbase: 0x%016lx\n", + fsbase); + goto out; + } + rpal_set_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT); + next->rpal_rd->sender =3D current; + } + } +out: + return next; +} + +struct task_struct *rpal_find_next_task(unsigned long fsbase) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_service *tgt; + struct task_struct *tsk =3D NULL; + int i; + + tgt =3D rpal_get_mapped_service_by_addr(cur, fsbase); + if (unlikely(!tgt)) { + pr_debug("rpal debug: cannot find legal rs, fsbase: 0x%016lx\n", + fsbase); + return NULL; + } + for (i =3D 0; i < RPAL_MAX_RECEIVER_NUM; ++i) { + if (tgt->fs_tsk_map[i].fsbase =3D=3D fsbase) { + tsk =3D tgt->fs_tsk_map[i].tsk; + break; + } + } + rpal_put_service(tgt); + + return tsk; +} + +static bool check_hardware_features(void) +{ + if (!boot_cpu_has(X86_FEATURE_FSGSBASE)) { + rpal_err("no fsgsbase feature\n"); + return false; + } + return true; +} + int __init rpal_init(void) { int ret =3D 0; =20 rpal_cap =3D 0; =20 + if (!check_hardware_features()) + goto fail; + ret =3D rpal_service_init(); if (ret) goto fail; diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index bc203e9c6e5e..db3b13ff82be 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -7,9 +7,53 @@ */ =20 #include +#include =20 #include "internal.h" =20 +static bool set_fs_tsk_map(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_fsbase_tsk_map *ftm; + unsigned long fsbase =3D rdfsbase(); + bool success =3D false; + int i =3D 0; + + for (i =3D 0; i < RPAL_MAX_RECEIVER_NUM; ++i) { + ftm =3D &cur->fs_tsk_map[i]; + if (ftm->fsbase =3D=3D 0 && + cmpxchg64(&ftm->fsbase, 0, fsbase) =3D=3D 0) { + ftm->tsk =3D current; + success =3D true; + break; + } + } + + return success; +} + +static bool clear_fs_tsk_map(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct rpal_fsbase_tsk_map *ftm; + unsigned long fsbase =3D rdfsbase(); + bool success =3D false; + int i =3D 0; + + for (i =3D 0; i < RPAL_MAX_RECEIVER_NUM; ++i) { + ftm =3D &cur->fs_tsk_map[i]; + if (ftm->fsbase =3D=3D fsbase) { + ftm->tsk =3D NULL; + barrier(); + ftm->fsbase =3D 0; + success =3D true; + break; + } + } + + return success; +} + static void rpal_common_data_init(struct rpal_common_data *rcd) { rcd->bp_task =3D current; @@ -54,6 +98,7 @@ int rpal_register_sender(unsigned long addr) rsd->rsp =3D rsp; rsd->scc =3D (struct rpal_sender_call_context *)(addr - rsp->user_start + rsp->kernel_start); + rsd->receiver =3D NULL; =20 current->rpal_sd =3D rsd; rpal_set_current_thread_flag(RPAL_SENDER_BIT); @@ -122,15 +167,21 @@ int rpal_register_receiver(unsigned long addr) goto put_shared_page; } =20 + if (!set_fs_tsk_map()) { + ret =3D -EAGAIN; + goto free_rrd; + } + rpal_common_data_init(&rrd->rcd); if (rpal_init_thread_pending(&rrd->rcd)) { ret =3D -ENOMEM; - goto free_rrd; + goto clear_fs; } rrd->rsp =3D rsp; rrd->rcc =3D (struct rpal_receiver_call_context *)(addr - rsp->user_start + rsp->kernel_start); + rrd->sender =3D NULL; =20 current->rpal_rd =3D rrd; rpal_set_current_thread_flag(RPAL_RECEIVER_BIT); @@ -139,6 +190,8 @@ int rpal_register_receiver(unsigned long addr) =20 return 0; =20 +clear_fs: + clear_fs_tsk_map(); free_rrd: kfree(rrd); put_shared_page: @@ -158,6 +211,8 @@ int rpal_unregister_receiver(void) goto out; } =20 + clear_fs_tsk_map(); + rpal_put_shared_page(rrd->rsp); rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT); rpal_free_thread_pending(&rrd->rcd); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 5b115be14a55..45137770fac6 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -80,6 +80,9 @@ /* No more than 15 services can be requested due to limitation of MPK. */ #define MAX_REQUEST_SERVICE 15 =20 +/* We allow at most 16 receiver thread in one process */ +#define RPAL_MAX_RECEIVER_NUM 16 + enum { RPAL_REQUEST_MAP, RPAL_REVERSE_MAP, @@ -100,6 +103,7 @@ enum rpal_task_flag_bits { RPAL_SENDER_BIT, RPAL_RECEIVER_BIT, RPAL_CPU_LOCKED_BIT, + RPAL_LAZY_SWITCHED_BIT, }; =20 enum rpal_receiver_state { @@ -145,6 +149,11 @@ struct rpal_poll_data { wait_queue_head_t rpal_waitqueue; }; =20 +struct rpal_fsbase_tsk_map { + unsigned long fsbase; + struct task_struct *tsk; +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -202,6 +211,9 @@ struct rpal_service { /* Notify service is released by others */ struct rpal_poll_data rpd; =20 + /* fsbase / pid map */ + struct rpal_fsbase_tsk_map fs_tsk_map[RPAL_MAX_RECEIVER_NUM]; + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -283,12 +295,14 @@ struct rpal_receiver_data { struct rpal_common_data rcd; struct rpal_shared_page *rsp; struct rpal_receiver_call_context *rcc; + struct task_struct *sender; }; =20 struct rpal_sender_data { struct rpal_common_data rcd; struct rpal_shared_page *rsp; struct rpal_sender_call_context *scc; + struct task_struct *receiver; }; =20 enum rpal_command_type { @@ -465,6 +479,7 @@ struct rpal_service *rpal_get_mapped_service_by_id(stru= ct rpal_service *rs, int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs, unsigned long addr, int error_code); struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild); +struct task_struct *rpal_find_next_task(unsigned long fsbase); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2757522170B for ; Fri, 30 May 2025 09:33:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597586; cv=none; b=TKOZOJn26rLUJ1Xyt9nhdi8SzGvoG4C3WRUlP6B3gKNdyWCKGv5SOkaaQGkjf4RZv78/msIxV4j+WGLmZGOgB9YauK1fBcCWQsdhgO4/O/hPci9uNVqLb4iXnBPbVeeIYJlTOgj4VQbGArF/XaOSqYDJTmzpXXajD3UwW+GJ4pg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597586; c=relaxed/simple; bh=eF1gU64raWj5Emit2P3crNbm/xRze6gEOQkFioffxcs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=FnJdftun2OGJQ7BddFXxFkimTgoV6RIUt7xmgaej5KtCwPwkP9vuwJi4KUppmt1OhxVk8CbWb76uZapYdZlpIynd0XFf1/xUVHZNDaFcypLLCpIxPQUfJPPmHsQSvGnwtNr5Fyd+gBcLX94lnlJH0qxQJ56m+svdToyhYkWY/oM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=R9FVY4LF; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="R9FVY4LF" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-3081f72c271so1408383a91.0 for ; Fri, 30 May 2025 02:33:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597583; x=1749202383; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dlxrRjV8tJmbvpYFl1bSrUikb+IFpVRg2V1s2EQ5voc=; b=R9FVY4LF1Zrf/Gp3vENFInTrH6itORQS+OA1/+RoJ+ucGi05UAckNOWotomOVQctzl zG7AWofEo4WqwFTlXq50CKPfhYItqxmTicTdRO5SWw7ohYdp02PlQddlnla1xotDbo+W CeGfj86h1fo4ZKax11DGGNo6+4KLAOifBpSNtrGAhViBoHwkYhwKN+P6DijMC57AyoWK GghgOqvu8A54AzXTqf1gxGTET8iBpxpIArrUZVZSqr51J1pjrGa950Ba6pib26pYu7LB KugC80k8QvQRYrihDIhMSHDZFF4VnXJUR+siVu2gcstIfhNNC6smd4Ra1/VHIMj1J1r+ HGag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597583; x=1749202383; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dlxrRjV8tJmbvpYFl1bSrUikb+IFpVRg2V1s2EQ5voc=; b=ZL/V7nulTRP1vti1ugRPkqwT8dod9EpA+XP6JxF5X7G9t0YYHDYBUwoK2qClpNeSLw 5kw8QN9ASWOOWh8uQpz12+ZH1H+QJ1rScCeinI5w+0zzsxAeqGYoCEZdk7wU+VgaeBmC a9nmj4UbssZPhbHXbKgKdPP0k8213Fg23VH1mZRnN2a+t8LJ7CkPc9F46ijiRkaK6qGs 9DUe3IKiJ6fh++XE6+UP119xK24F+P7TYW4eUOtaOmTKApwtYVOI+yAf29IH5esenKZu vZVci5RUjFKDWbIpLsCjPKx/DjFORi/Gv4Nouw5a+whC3KwESPyRK/lR2EYrQNHFFtiZ n3Pw== X-Forwarded-Encrypted: i=1; AJvYcCURPVZTBUH2Z5UhKFHSmmYFzve3C9K1shtJstbs4qkP6oyXCfYmLIzAFM+kGegtG1mAjwJEtIwA/cYbEjc=@vger.kernel.org X-Gm-Message-State: AOJu0YxxOyL6wyY/U0ARwDwsOrzc9JlVeH/1ILMcm7ce5sDaEZcGBYAE zEPr7HrH0nab4glT1S1bFNxVAK4qFRFh9mcYROOB1Rtz6StxVVSzvqG8I6Ne0DeLZyQ= X-Gm-Gg: ASbGnctD1qWR3D5K0zp1WUcJ3stw1rAS+EFHdRedEdgbTeE9GMIO9pSdNNXtASvyG2q mgPIggxNHATwJujHTCOdb0Yz3YU0WthjdYV0AQvOTR4iJe+P5+obcpyTkmqJY4iGoDeHi4egaMc UCpqAaI7vMyP6ygmk5MEfOjDx9/4S/cgOkLL2V/TronuZ/x2xYJ36NinxI2cpiM9DThrKNZBagn 6L5gSSNtHA5iAnf70hFCwgvNEiSw6sKDIlX4oPYmcy6IZOU8xZN5pdpYd/yfR+3Xqpegei0LGLl ac4PybmXSilUaYosDxwjIT4ESbYydBXuTq+itlMhfIlEPsDYVctkFliEW8sge9jVqKbdllzN2iF yaI0+UDjJQeQEmDQYU/G7 X-Google-Smtp-Source: AGHT+IFRxV/uO1ynrbPQpKYbWo1BdkPpqPJfwKaGM+M1EblRzTIKuZgxM4JDshmIh4OUS5cYyqK/IQ== X-Received: by 2002:a17:90b:380a:b0:311:c596:5c6f with SMTP id 98e67ed59e1d1-31250422c83mr2343246a91.17.1748597583233; Fri, 30 May 2025 02:33:03 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.32.48 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:33:02 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 18/35] sched: pick a specified task Date: Fri, 30 May 2025 17:27:46 +0800 Message-Id: <6e785c48ed266694748e0e71e264b94b27d9fa7b.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a lazy switch occurs, the kernel already gets the task_struct of the next task to switch to. However, the CFS does not provide an interface to explicitly specify the next task. Therefore, RPAL must implement its own mechanism to pick a specified task. This patch introduces two interfaces, rpal_pick_next_task_fair() and rpal_pick_task_fair(), to achieve this functionality. These interfaces leverage the sched_entity of the target task to modify the CFS data structures directly. Additionally, the patch adapts to the SCHED_CORE feature by temporarily setting the highest weight for the specified task, ensuring that the core will select this task preferentially during scheduling decisions. Signed-off-by: Bo Li --- kernel/sched/core.c | 212 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 109 ++++++++++++++++++++++ kernel/sched/sched.h | 8 ++ 3 files changed, 329 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a862bf4a0161..2e76376c5172 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11003,3 +11003,215 @@ void sched_enq_and_set_task(struct sched_enq_and_= set_ctx *ctx) set_next_task(rq, ctx->p); } #endif /* CONFIG_SCHED_CLASS_EXT */ + +#ifdef CONFIG_RPAL +#ifdef CONFIG_SCHED_CORE +static inline struct task_struct * +__rpal_pick_next_task(struct rq *rq, struct task_struct *prev, + struct task_struct *next, struct rq_flags *rf) +{ + struct task_struct *p; + + if (likely(prev->sched_class =3D=3D &fair_sched_class && + next->sched_class =3D=3D &fair_sched_class)) { + p =3D rpal_pick_next_task_fair(prev, next, rq, rf); + return p; + } + + BUG(); +} + +static struct task_struct *rpal_pick_next_task(struct rq *rq, + struct task_struct *prev, + struct task_struct *next, + struct rq_flags *rf) +{ + struct task_struct *p; + const struct cpumask *smt_mask; + bool fi_before =3D false; + bool core_clock_updated =3D (rq =3D=3D rq->core); + unsigned long cookie; + int i, cpu, occ =3D 0; + struct rq *rq_i; + bool need_sync; + + if (!sched_core_enabled(rq)) + return __rpal_pick_next_task(rq, prev, next, rf); + + cpu =3D cpu_of(rq); + + /* Stopper task is switching into idle, no need core-wide selection. */ + if (cpu_is_offline(cpu)) { + rq->core_pick =3D NULL; + return __rpal_pick_next_task(rq, prev, next, rf); + } + + if (rq->core->core_pick_seq =3D=3D rq->core->core_task_seq && + rq->core->core_pick_seq !=3D rq->core_sched_seq && + rq->core_pick) { + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); + + /* ignore rq->core_pick, always pick next */ + if (rq->core_pick =3D=3D next) { + put_prev_task(rq, prev); + set_next_task(rq, next); + + rq->core_pick =3D NULL; + goto out; + } + } + + put_prev_task_balance(rq, prev, rf); + + smt_mask =3D cpu_smt_mask(cpu); + need_sync =3D !!rq->core->core_cookie; + + /* reset state */ + rq->core->core_cookie =3D 0UL; + if (rq->core->core_forceidle_count) { + if (!core_clock_updated) { + update_rq_clock(rq->core); + core_clock_updated =3D true; + } + sched_core_account_forceidle(rq); + /* reset after accounting force idle */ + rq->core->core_forceidle_start =3D 0; + rq->core->core_forceidle_count =3D 0; + rq->core->core_forceidle_occupation =3D 0; + need_sync =3D true; + fi_before =3D true; + } + + rq->core->core_task_seq++; + + if (!need_sync) { + next =3D rpal_pick_task_fair(rq, next); + if (!next->core_cookie) { + rq->core_pick =3D NULL; + /* + * For robustness, update the min_vruntime_fi for + * unconstrained picks as well. + */ + WARN_ON_ONCE(fi_before); + task_vruntime_update(rq, next, false); + goto out_set_next; + } + } + + for_each_cpu_wrap(i, smt_mask, cpu) { + rq_i =3D cpu_rq(i); + + if (i !=3D cpu && (rq_i !=3D rq->core || !core_clock_updated)) + update_rq_clock(rq_i); + + /* ignore prio, always pick next */ + if (i =3D=3D cpu) + rq_i->core_pick =3D rpal_pick_task_fair(rq, next); + else + rq_i->core_pick =3D pick_task(rq_i); + } + + cookie =3D rq->core->core_cookie =3D next->core_cookie; + + for_each_cpu(i, smt_mask) { + rq_i =3D cpu_rq(i); + p =3D rq_i->core_pick; + + if (!cookie_equals(p, cookie)) { + p =3D NULL; + if (cookie) + p =3D sched_core_find(rq_i, cookie); + if (!p) + p =3D idle_sched_class.pick_task(rq_i); + } + + rq_i->core_pick =3D p; + + if (p =3D=3D rq_i->idle) { + if (rq_i->nr_running) { + rq->core->core_forceidle_count++; + if (!fi_before) + rq->core->core_forceidle_seq++; + } + } else { + occ++; + } + } + + if (schedstat_enabled() && rq->core->core_forceidle_count) { + rq->core->core_forceidle_start =3D rq_clock(rq->core); + rq->core->core_forceidle_occupation =3D occ; + } + + rq->core->core_pick_seq =3D rq->core->core_task_seq; + WARN_ON_ONCE(next !=3D rq->core_pick); + rq->core_sched_seq =3D rq->core->core_pick_seq; + + for_each_cpu(i, smt_mask) { + rq_i =3D cpu_rq(i); + + /* + * An online sibling might have gone offline before a task + * could be picked for it, or it might be offline but later + * happen to come online, but its too late and nothing was + * picked for it. That's Ok - it will pick tasks for itself, + * so ignore it. + */ + if (!rq_i->core_pick) + continue; + + /* + * Update for new !FI->FI transitions, or if continuing to be in !FI: + * fi_before fi update? + * 0 0 1 + * 0 1 1 + * 1 0 1 + * 1 1 0 + */ + if (!(fi_before && rq->core->core_forceidle_count)) + task_vruntime_update(rq_i, rq_i->core_pick, + !!rq->core->core_forceidle_count); + + rq_i->core_pick->core_occupation =3D occ; + + if (i =3D=3D cpu) { + rq_i->core_pick =3D NULL; + continue; + } + + /* Did we break L1TF mitigation requirements? */ + WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); + + if (rq_i->curr =3D=3D rq_i->core_pick) { + rq_i->core_pick =3D NULL; + continue; + } + + resched_curr(rq_i); + } + +out_set_next: + set_next_task(rq, next); +out: + if (rq->core->core_forceidle_count && next =3D=3D rq->idle) + queue_core_balance(rq); + + return next; +} +#else +static inline struct task_struct * +rpal_pick_next_task(struct rq *rq, struct task_struct *prev, + struct task_struct *next, struct rq_flags *rf) +{ + struct task_struct *p; + + if (likely(prev->sched_class =3D=3D &fair_sched_class && + next->sched_class =3D=3D &fair_sched_class)) { + p =3D rpal_pick_next_task_fair(prev, next, rq, rf); + return p; + } + + BUG(); +} +#endif +#endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 125912c0e9dd..d9c16d974a47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8983,6 +8983,115 @@ void fair_server_init(struct rq *rq) dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task); } =20 +#ifdef CONFIG_RPAL +/* if the next is throttled, unthrottle it */ +static void rpal_unthrottle(struct rq *rq, struct task_struct *next) +{ + struct sched_entity *se; + struct cfs_rq *cfs_rq; + + se =3D &next->se; + for_each_sched_entity(se) { + cfs_rq =3D cfs_rq_of(se); + if (cfs_rq_throttled(cfs_rq)) + unthrottle_cfs_rq(cfs_rq); + + if (cfs_rq =3D=3D &rq->cfs) + break; + } +} + +struct task_struct *rpal_pick_task_fair(struct rq *rq, struct task_struct = *next) +{ + struct sched_entity *se; + struct cfs_rq *cfs_rq; + + rpal_unthrottle(rq, next); + + se =3D &next->se; + for_each_sched_entity(se) { + cfs_rq =3D cfs_rq_of(se); + + if (cfs_rq->curr && cfs_rq->curr->on_rq) + update_curr(cfs_rq); + + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + continue; + + clear_buddies(cfs_rq, se); + } + + return next; +} + +struct task_struct *rpal_pick_next_task_fair(struct task_struct *prev, + struct task_struct *next, + struct rq *rq, struct rq_flags *rf) +{ + struct cfs_rq *cfs_rq; + struct sched_entity *se; + struct task_struct *p; + + rpal_unthrottle(rq, next); + + p =3D rpal_pick_task_fair(rq, next); + + if (!sched_fair_runnable(rq)) + panic("rpal error: !sched_fair_runnable\n"); + +#ifdef CONFIG_FAIR_GROUP_SCHED + __put_prev_set_next_dl_server(rq, prev, next); + + se =3D &next->se; + p =3D task_of(se); + + /* + * Since we haven't yet done put_prev_entity and if the selected task + * is a different task than we started out with, try and touch the + * least amount of cfs_rqs. + */ + if (prev !=3D p) { + struct sched_entity *pse =3D &prev->se; + + while (!(cfs_rq =3D is_same_group(se, pse))) { + int se_depth =3D se->depth; + int pse_depth =3D pse->depth; + + if (se_depth <=3D pse_depth) { + put_prev_entity(cfs_rq_of(pse), pse); + pse =3D parent_entity(pse); + } + if (se_depth >=3D pse_depth) { + set_next_entity(cfs_rq_of(se), se); + se =3D parent_entity(se); + } + } + + put_prev_entity(cfs_rq, pse); + set_next_entity(cfs_rq, se); + } +#endif +#ifdef CONFIG_SMP + /* + * Move the next running task to the front of + * the list, so our cfs_tasks list becomes MRU + * one. + */ + list_move(&p->se.group_node, &rq->cfs_tasks); +#endif + + WARN_ON_ONCE(se->sched_delayed); + + if (hrtick_enabled_fair(rq)) + hrtick_start_fair(rq, p); + + update_misfit_status(p, rq); + sched_fair_update_stop_tick(rq, p); + + return p; +} +#endif + /* * Account for a descheduled task: */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c5a6a503eb6d..f8fd26b584c9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2575,6 +2575,14 @@ static inline bool sched_fair_runnable(struct rq *rq) =20 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_= struct *prev, struct rq_flags *rf); extern struct task_struct *pick_task_idle(struct rq *rq); +#ifdef CONFIG_RPAL +extern struct task_struct *rpal_pick_task_fair(struct rq *rq, + struct task_struct *next); +extern struct task_struct *rpal_pick_next_task_fair(struct task_struct *pr= ev, + struct task_struct *next, + struct rq *rq, + struct rq_flags *rf); +#endif =20 #define SCA_CHECK 0x01 #define SCA_MIGRATE_DISABLE 0x02 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E10D228CBC for ; Fri, 30 May 2025 09:33:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597601; cv=none; b=S+WiR7oVLH6WwVAaT5fHc+1Ggxd9QlIyynGGk6RwdTbwT6E88R9t3uH1IIsMqgeRL+IOkhzGrfc165SEkG4Y9kBZLzvwFovp0Te8RJNvdGCnmtlmzsqNIftDy+dDiE+Qusfz3FQtJstrhVgqNfVRifPIO54fDZYUq4i3PA+ul9Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597601; c=relaxed/simple; bh=/coDkbmv4cf7oezrwY1QvpvI3EAobs0jwEpk59sloz8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=sZtDHVlE9kTPy82BkNnK47NbWQ6mnNaoAjT62fN/Tb1obWgmA5H4cWxsv27xw6BtEqRAzwt6kzTzXKWbFIVmx5nF/VLOnfxKPOpM9dopaB3nVdmh+qnVfv80pTCuiNaPB8TmhXmMUuVS5g3lbtNyg9TUlEs8QAtS5h4agtUL1J0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=bTSGMaBj; arc=none smtp.client-ip=209.85.214.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="bTSGMaBj" Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-234ade5a819so17420825ad.1 for ; Fri, 30 May 2025 02:33:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597598; x=1749202398; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=nfNjslNRihNfiwS5akxUBJSCwiXFfcNf3iOCsj/nqzk=; b=bTSGMaBjRBh1IgEiJ3JOdq5VzJVp+wZ5mTu5WKcDcBiSWf2sLqZDc14pzQ/2sC1ct8 v1HhpaVodkOP6ZirSUp9P3NxEbVso014oEzzpZWt1ejAeICD/0dugdhnDf/rpEN4LxoC SoZjHkqnf0EwMv8Axpx4GbhrsS4+uUgFHbaVHYFSzzQZn8AYL2CaKVi7V0DTY3TUvcpK OJeRvORwn/JeJPBpIN3ma+pgg58xWTnKDEMh27BumuRl546SqXf1HFG1p8n8Rk+qItRM rhWY+Fnbd/hHJu+nPNDQOA8rP0yOUHWwOpiNVPpgdor09puCOzQxiDCNa+QSVMEJmmrt AqFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597598; x=1749202398; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nfNjslNRihNfiwS5akxUBJSCwiXFfcNf3iOCsj/nqzk=; b=a2ENBtz42Xat64pj4jib3Xvflh93IUbiz42QSuyx7aZ2zFNywTI3XUaHyECj00k2mx QC6y5QcEfVpS5GgSWVlTRcQm3BnSv6nXCOel0zM6z8CKBr0HW+aMtMtAM6S4AX7E5gZv +6NMwMa8xtNhHG+MLkM5a0tNI9RPc5JEqbi03gj7I5Q5jYunL/Ual/pmqaUpcsFyyMOk 3IItAUDtCA7ldLf7bmPotoP7p5s8TvmWjQLItYAi8JvM49j/tJM07npAg7jiDzpSRqBe ceA4HvrEIgy4egfTaCIljYNMQgSjrBHm2lAfAVaBpgg0NUsRyghPqU3eW752s1g/8jbO uyvA== X-Forwarded-Encrypted: i=1; AJvYcCVJ6mgHWdMmUWBHveOcRP19AQRcMsAQOBvLcZIQcbksjhWF61qNb5ArmElRwhZytfpEWm3v+uof2kACXuM=@vger.kernel.org X-Gm-Message-State: AOJu0YxJ9NUtckakNB3sOusjkUgJQ1qBuMBCgAf2oh+SSl7cPVo4lAAD Tw8XDbJw09qtDtVnyT5aQgBH2ZQBs+Nd1dPtsgAFy/GjJwI+Oc2f2XHWKL2mlhPTZxo= X-Gm-Gg: ASbGncuVgGWwycup5/McUloq/a5qXpt/AtzSoJFtBfowjJgfMKPpdR7UnhHBKE9aYpe Wlq3TPbn3yqOe/93oBNYLWXaOs0XLiHz6m4xQwyzuvD4wtACBVxqZsTaITfyx95l0R+gPT4vYnq N8rf+0LmgY1s0LYVOlEFRC5IIIcZleJ695487L6WKeUvA5dkHN+BswyCqOAus9fE/YCMJSblaMy 5IurHqHrQkDqjst2NQZT7m5FYcON8R1R/ZoJ1OL+pE5FNupmyEtZ2JzuZ3MZBLahQ8QYN1N23e3 ZZ+9X2xQLitKVzoHOndSlG0Dj8hK07zn1PrGWqEQ5EkiCBU6gCW7vhaj1YrONCGXgfIceOzV7MN PiVB0GUve+g== X-Google-Smtp-Source: AGHT+IEyXGYQ2GCz5bdSvzR+T/Z8enoWNYF8Thjcb9xcpCHkk9mGcXueyM6Xj7aL+U99FPOWwALWDw== X-Received: by 2002:a17:90b:3e45:b0:311:a623:676c with SMTP id 98e67ed59e1d1-31241e8d325mr4349242a91.27.1748597598509; Fri, 30 May 2025 02:33:18 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.33.03 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:33:18 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 19/35] RPAL: add lazy switch main logic Date: Fri, 30 May 2025 17:27:47 +0800 Message-Id: <91e9db5ad4a3e1e58a666bd496e55d8f8db2c63c.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The implementation of lazy switch differs from a regular schedule() in three key aspects: 1. It occurs at the kernel entry with irq disabled. 2. The next task is explicitly pre-determined rather than selected by the scheduler. 3. User-space context (excluding general-purpose registers) remains unchanged across the switch. This patch introduces the rpal_schedule() interface to address these requirements. Firstly, the rpal_schedule() skips irq enabling in finish_lock_switch(), preserving the irq-disabled state required during kernel entry. Secondly, the rpal_pick_next_task() interface is used to explicitly specify the target task, bypassing the default scheduler's decision-making process. Thirdly, non-general-purpose registers (e.g., FPU, vector units) are not restored during the switch, ensuring user space context remains intact. Handling of general-purpose registers will be addressed in a subsequent patch by RPAL before invoking rpal_schedule(). Signed-off-by: Bo Li --- arch/x86/kernel/process_64.c | 75 +++++++++++++++++++++ include/linux/rpal.h | 3 + kernel/sched/core.c | 126 +++++++++++++++++++++++++++++++++++ 3 files changed, 204 insertions(+) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 4830e9215de7..efc3f238c486 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -753,6 +753,81 @@ __switch_to(struct task_struct *prev_p, struct task_st= ruct *next_p) return prev_p; } =20 +#ifdef CONFIG_RPAL +__no_kmsan_checks +__visible __notrace_funcgraph struct task_struct * +__rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p) +{ + struct thread_struct *prev =3D &prev_p->thread; + struct thread_struct *next =3D &next_p->thread; + int cpu =3D smp_processor_id(); + + WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) && + this_cpu_read(hardirq_stack_inuse)); + + /* no need to switch fpu */ + /* __fpu_invalidate_fpregs_state() */ + x86_task_fpu(prev_p)->last_cpu =3D -1; + /* fpregs_activate() */ + __this_cpu_write(fpu_fpregs_owner_ctx, x86_task_fpu(next_p)); + trace_x86_fpu_regs_activated(x86_task_fpu(next_p)); + x86_task_fpu(next_p)->last_cpu =3D cpu; + set_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD); + clear_tsk_thread_flag(next_p, TIF_NEED_FPU_LOAD); + + /* no need to save fs */ + savesegment(gs, prev_p->thread.gsindex); + if (static_cpu_has(X86_FEATURE_FSGSBASE)) + prev_p->thread.gsbase =3D __rdgsbase_inactive(); + else + save_base_legacy(prev_p, prev_p->thread.gsindex, GS); + + load_TLS(next, cpu); + + arch_end_context_switch(next_p); + + savesegment(es, prev->es); + if (unlikely(next->es | prev->es)) + loadsegment(es, next->es); + + savesegment(ds, prev->ds); + if (unlikely(next->ds | prev->ds)) + loadsegment(ds, next->ds); + + /* no need to load fs */ + if (static_cpu_has(X86_FEATURE_FSGSBASE)) { + if (unlikely(prev->gsindex || next->gsindex)) + loadseg(GS, next->gsindex); + + __wrgsbase_inactive(next->gsbase); + } else { + load_seg_legacy(prev->gsindex, prev->gsbase, next->gsindex, + next->gsbase, GS); + } + + /* skip pkru load as we will use pkru in RPAL */ + + this_cpu_write(current_task, next_p); + this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p)); + + /* no need to load fpu */ + + update_task_stack(next_p); + switch_to_extra(prev_p, next_p); + + if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) { + unsigned short ss_sel; + + savesegment(ss, ss_sel); + if (ss_sel !=3D __KERNEL_DS) + loadsegment(ss, __KERNEL_DS); + } + resctrl_sched_in(next_p); + + return prev_p; +} +#endif + void set_personality_64bit(void) { /* inherit personality from parent */ diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 45137770fac6..0813db4552c0 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -487,4 +487,7 @@ int rpal_try_to_wake_up(struct task_struct *p); int rpal_init_thread_pending(struct rpal_common_data *rcd); void rpal_free_thread_pending(struct rpal_common_data *rcd); int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock); +void rpal_schedule(struct task_struct *next); +asmlinkage struct task_struct * +__rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2e76376c5172..760d88458b39 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6827,6 +6827,12 @@ static bool try_to_block_task(struct rq *rq, struct = task_struct *p, if (unlikely(is_special_task_state(task_state))) flags |=3D DEQUEUE_SPECIAL; =20 +#ifdef CONFIG_RPAL + /* DELAY_DEQUEUE will cause CPU stalls after lazy switch, skip it */ + if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) + flags |=3D DEQUEUE_SPECIAL; +#endif + /* * __schedule() ttwu() * prev_state =3D prev->state; if (p->on_rq && ...) @@ -11005,6 +11011,62 @@ void sched_enq_and_set_task(struct sched_enq_and_s= et_ctx *ctx) #endif /* CONFIG_SCHED_CLASS_EXT */ =20 #ifdef CONFIG_RPAL +static struct rq *rpal_finish_task_switch(struct task_struct *prev) + __releases(rq->lock) +{ + struct rq *rq =3D this_rq(); + struct mm_struct *mm =3D rq->prev_mm; + + if (WARN_ONCE(preempt_count() !=3D 2*PREEMPT_DISABLE_OFFSET, + "corrupted preempt_count: %s/%d/0x%x\n", + current->comm, current->pid, preempt_count())) + preempt_count_set(FORK_PREEMPT_COUNT); + + rq->prev_mm =3D NULL; + vtime_task_switch(prev); + perf_event_task_sched_in(prev, current); + finish_task(prev); + tick_nohz_task_switch(); + + /* finish_lock_switch, not enable irq */ + spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); + __balance_callbacks(rq); + raw_spin_rq_unlock(rq); + + finish_arch_post_lock_switch(); + kcov_finish_switch(current); + kmap_local_sched_in(); + + fire_sched_in_preempt_notifiers(current); + if (mm) { + membarrier_mm_sync_core_before_usermode(mm); + mmdrop(mm); + } + + return rq; +} + +static __always_inline struct rq *rpal_context_switch(struct rq *rq, + struct task_struct *prev, + struct task_struct *next, + struct rq_flags *rf) +{ + /* irq is off */ + prepare_task_switch(rq, prev, next); + arch_start_context_switch(prev); + + membarrier_switch_mm(rq, prev->active_mm, next->mm); + switch_mm_irqs_off(prev->active_mm, next->mm, next); + lru_gen_use_mm(next->mm); + + switch_mm_cid(rq, prev, next); + + prepare_lock_switch(rq, next, rf); + __rpal_switch_to(prev, next); + barrier(); + return rpal_finish_task_switch(prev); +} + #ifdef CONFIG_SCHED_CORE static inline struct task_struct * __rpal_pick_next_task(struct rq *rq, struct task_struct *prev, @@ -11214,4 +11276,68 @@ rpal_pick_next_task(struct rq *rq, struct task_str= uct *prev, BUG(); } #endif + +/* enter and exit with irqs disabled() */ +void __sched notrace rpal_schedule(struct task_struct *next) +{ + struct task_struct *prev, *picked; + bool preempt =3D false; + unsigned long *switch_count; + unsigned long prev_state; + struct rq_flags rf; + struct rq *rq; + int cpu; + + /* sched_mode =3D SM_NONE */ + + preempt_disable(); + + trace_sched_entry_tp(preempt, CALLER_ADDR0); + + cpu =3D smp_processor_id(); + rq =3D cpu_rq(cpu); + prev =3D rq->curr; + + schedule_debug(prev, preempt); + + if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) + hrtick_clear(rq); + + rcu_note_context_switch(preempt); + rq_lock(rq, &rf); + smp_mb__after_spinlock(); + + rq->clock_update_flags <<=3D 1; + update_rq_clock(rq); + rq->clock_update_flags =3D RQCF_UPDATED; + + switch_count =3D &prev->nivcsw; + + prev_state =3D READ_ONCE(prev->__state); + if (prev_state) { + try_to_block_task(rq, prev, &prev_state); + switch_count =3D &prev->nvcsw; + } + + picked =3D rpal_pick_next_task(rq, prev, next, &rf); + rq_set_donor(rq, next); + if (unlikely(next !=3D picked)) + panic("rpal error: next !=3D picked\n"); + + clear_tsk_need_resched(prev); + clear_preempt_need_resched(); + rq->last_seen_need_resched_ns =3D 0; + + rq->nr_switches++; + RCU_INIT_POINTER(rq->curr, next); + ++*switch_count; + migrate_disable_switch(rq, prev); + psi_account_irqtime(rq, prev, next); + psi_sched_switch(prev, next, !task_on_rq_queued(prev) || + prev->se.sched_delayed); + trace_sched_switch(preempt, prev, next, prev_state); + rq =3D rpal_context_switch(rq, prev, next, &rf); + trace_sched_exit_tp(true, CALLER_ADDR0); + preempt_enable_no_resched(); +} #endif --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D09F422A81F for ; Fri, 30 May 2025 09:33:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597616; cv=none; b=sdB+QOuMnsH7TSqymGs3I6E7A9s9JATgtQj5MPDMUbgG6nJ/meIl7dp6agGxcFrf/aDD0WFlf3zgORdPYIk0ebSjZ3n0DHvvOSpei8BgQXQ3Bhd9cr130DHY5Nc1T8f+v/rDz75i+Z5jNbbbkhuqt0IPnlwr/r2XKNFo/HbKfFE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597616; c=relaxed/simple; bh=46IfJD6O+6SqRHValc80DBFn7y6T8YogODhoWzSz5NE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=DhzL+8mKKCmKN5XtflpdhCiDoI6fyTGxnS2r1cznn0VupA+C772gpKoohLQm6YBAsydkLd3aklwhkLviNuTHg1qRqg6GMNTpfbeffIc02VgFYLVPaCbdw8OnWkM74gc19qOHF+//9Yw8mWDtGy9Sbhf0II8ynsHR8lwgOtbnyM8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=hvzumzsQ; arc=none smtp.client-ip=209.85.215.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="hvzumzsQ" Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-b2c3c689d20so1404760a12.3 for ; Fri, 30 May 2025 02:33:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597614; x=1749202414; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=IlVLKhSr7F4Ter0cssioRNY+mLV+eE5Kw7Y7qzhdKJU=; b=hvzumzsQd/1QOcNR9/uhM+UmOTvrOHgj5CQds8LFcKMO4yHbvfVDmRmkOK6fP31723 a2Dw4rKLbTbq4oEyVstg0aHaELNgWLSLwt/7MEYhDfeB0QEX+BXX1stuyUxLyqQGyo11 9aQUDCFWq+umcopDz9cMCi2RvOn24YCWUXNoMNZjBGJkZrXZPL5K+Dj9JRvaI1x+t6Vs xDRt/xrw6eyF8CKrIHUAIZibeDJ1LTOWDp+mO2BGmW1JbiFtF9QYH/CW15U1wVOmukrA UGCsxW1tIxdx89/BGZ2+wPJom/pKhj/+A792P8wSh0ggoSYui8F4AfSIBX2lgiNEx966 iTYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597614; x=1749202414; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IlVLKhSr7F4Ter0cssioRNY+mLV+eE5Kw7Y7qzhdKJU=; b=JAoCmkPast5VfRjLjDvReJneWxxxbvRT25y7EXOUagg0VJJDW3iL8+Q8e3I6VwCxvi jq1NdzEiu0Q5f5MNs4xywDNmxeJhXMZGlaZO79rRBbnl8IUAIo17GZiDC6NmjwRdU32N lxWvTWlNu3ZjfflB2AhfB0jbYUoF9rZOkWOBNVRoyghaUKR8JYn9I7EcbLHlyQsvgsr2 KmXUj9yeE8sOoFh25wDdAUAnGZnVK5bBQK55McAOGhhzudhCuLSYviGtfl4zOrMwpy5j cPe3piA8RuixE6P9chj+SLctYJfYAD/2aYjJc1FNMNVFBYSQy862mLg0/fBKf+TvGoxH DxMA== X-Forwarded-Encrypted: i=1; AJvYcCWm422wvLekWVyVgmz4aUsh5DenvR0hCJj4jyxRM/szbI7HBBLzNTfkLzNSoyeQiZxSjModzpDRXkwd7qY=@vger.kernel.org X-Gm-Message-State: AOJu0YxO8RZ2MU3og25dI+xp96xsl4J414pWI5peiC4/K03QD3lca9Ha eVymSoAmVN5ad8Ack5sva301+cLIS4jWq5814a/wGG8DcZBTlljj0elWaoiNhrQzfMg= X-Gm-Gg: ASbGncv+ypHjFCB/NCOW8gY22LMa1v0F9rW6ATrER9MxsoxUbSzeFXet1PK2bmApbQG PI9B96oOUcdfwNO7I+R5VQuUAwjYLC9DRc5jDfO0abe1SdsOQG9AHzzaXQKiktcyjPuVn58Fk18 O4T3NtuD7ovRLppZOgNnxFDGeoozbSM+H8WCiI4Op1U7rBrHExxfYWAi11YGzZ2zIv8EtMnVNQY LlNAFZjhJ3X2ZCS7OSmE5ki0rb+i/K/Rmp72YB+9WWtZB9fSlKEGN1uSvf+sEZPs/qR+wuVB1OA HTCjtZO7pNE/dv6o4KrCscdDniAx+1fleX9Kk13foCfqunYtW0EZuMZfbGJ07G4eEGRUTRij0xD ZwaxZ06qK+Q== X-Google-Smtp-Source: AGHT+IF9UpDBed3ioNwTFOPq4QXc23on+j+5SNtglVD6Wkl9cnonPhoeqxqRBdAvjyhyCYWucdvIFA== X-Received: by 2002:a17:90a:e708:b0:311:a314:c2d1 with SMTP id 98e67ed59e1d1-3124150e443mr4004525a91.6.1748597613983; Fri, 30 May 2025 02:33:33 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.33.19 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:33:33 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 20/35] RPAL: add rpal_ret_from_lazy_switch Date: Fri, 30 May 2025 17:27:48 +0800 Message-Id: <4cd58d0e989640f0c230196e81cec5cee0ceb476.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After lazy switch the task before the lazy switch will lose its user mode context (which is passed to the task after the lazy switch). Therefore, RPAL needs to handle the issue of the previous task losing its user mode context. After the lazy switch occurs, the sender can resume execution in two ways. One way is to be scheduled by the scheduler. In this case, RPAL handles this issue in a manner similar to ret_from_fork. the sender will enter rpal_ret_from_lazy_switch through the constructed stack frame by lazy switchto execute the return logic and finally return to the pre-defined user mode (referred to as "kernel return"). The other way is to be switched back to by the receiver through another lazy switch. In this case, the receiver will pass the user mode context to the sender, so there is no need to construct a user mode context for the sender. And the receiver can return to the user mode through the kernel return method. rpal_ret_from_lazy_switch primarily handles scheduler cleanup work, similar to schedule_tail(), but does not perform set_child_tid-otherwise, it might cause set_child_tid to be executed repeatedly. It then calls rpal_kernel_ret(), which is primarily used to set the states of the sender and receiver and attempt to unlock the CPU. Finally, it performs syscall cleanup work and returns to user mode. Signed-off-by: Bo Li --- arch/x86/entry/entry_64.S | 23 ++++++++++++++++++++ arch/x86/rpal/core.c | 45 +++++++++++++++++++++++++++++++++++++-- include/linux/rpal.h | 5 ++++- kernel/sched/core.c | 25 +++++++++++++++++++++- 4 files changed, 94 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index ed04a968cc7d..13b4d0684575 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -169,6 +169,29 @@ SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL) int3 SYM_CODE_END(entry_SYSCALL_64) =20 +#ifdef CONFIG_RPAL +SYM_CODE_START(rpal_ret_from_lazy_switch) + UNWIND_HINT_END_OF_STACK + ANNOTATE_NOENDBR + movq %rax, %rdi + call rpal_schedule_tail + + movq %rsp, %rdi + call rpal_kernel_ret + + movq %rsp, %rdi + call syscall_exit_to_user_mode /* returns with IRQs disabled */ + + UNWIND_HINT_REGS +#ifdef CONFIG_X86_FRED + ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \ + "jmp asm_fred_exit_user", X86_FEATURE_FRED +#else + jmp swapgs_restore_regs_and_return_to_usermode +#endif +SYM_CODE_END(rpal_ret_from_lazy_switch) +#endif + /* * %rdi: prev task * %rsi: next task diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 19c4ef38bca3..ed4c11e6838c 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -18,7 +18,7 @@ unsigned long rpal_cap; =20 static inline void rpal_lock_cpu(struct task_struct *tsk) { - rpal_set_cpus_allowed_ptr(tsk, true); + rpal_set_cpus_allowed_ptr(tsk, true, false); if (unlikely(!irqs_disabled())) { local_irq_disable(); rpal_err("%s: irq is enabled\n", __func__); @@ -27,13 +27,54 @@ static inline void rpal_lock_cpu(struct task_struct *ts= k) =20 static inline void rpal_unlock_cpu(struct task_struct *tsk) { - rpal_set_cpus_allowed_ptr(tsk, false); + rpal_set_cpus_allowed_ptr(tsk, false, false); if (unlikely(!irqs_disabled())) { local_irq_disable(); rpal_err("%s: irq is enabled\n", __func__); } } =20 +static inline void rpal_unlock_cpu_kernel_ret(struct task_struct *tsk) +{ + rpal_set_cpus_allowed_ptr(tsk, false, true); +} + +void rpal_kernel_ret(struct pt_regs *regs) +{ + struct task_struct *tsk; + struct rpal_receiver_call_context *rcc; + int state; + + if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { + rcc =3D current->rpal_rd->rcc; + atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET); + } else { + tsk =3D current->rpal_sd->receiver; + rcc =3D tsk->rpal_rd->rcc; + rpal_clear_task_thread_flag(tsk, RPAL_LAZY_SWITCHED_BIT); + state =3D atomic_xchg(&rcc->sender_state, RPAL_SENDER_STATE_KERNEL_RET); + WARN_ON_ONCE(state !=3D RPAL_SENDER_STATE_CALL); + /* make sure kernel return is finished */ + smp_mb(); + WRITE_ONCE(tsk->rpal_rd->sender, NULL); + /* + * We must unlock receiver first, otherwise we may unlock + * receiver which is already locked by another sender. + * + * Sender A Receiver B Sender C + * lazy switch (A->B) + * kernel return + * unlock cpu A + * epoll_wait + * lazy switch(C->B) + * lock cpu B + * unlock cpu B + * BUG() BUG() + */ + rpal_unlock_cpu_kernel_ret(tsk); + rpal_unlock_cpu_kernel_ret(current); + } +} =20 static inline struct task_struct *rpal_get_sender_task(void) { diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 0813db4552c0..01b582fa821e 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -480,14 +480,17 @@ int rpal_rebuild_sender_context_on_fault(struct pt_re= gs *regs, unsigned long addr, int error_code); struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild); struct task_struct *rpal_find_next_task(unsigned long fsbase); +void rpal_kernel_ret(struct pt_regs *regs); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, struct rlimit *rlim_stack); int rpal_try_to_wake_up(struct task_struct *p); int rpal_init_thread_pending(struct rpal_common_data *rcd); void rpal_free_thread_pending(struct rpal_common_data *rcd); -int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock); +int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock, + bool is_kernel_ret); void rpal_schedule(struct task_struct *next); asmlinkage struct task_struct * __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p); +asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 760d88458b39..0f9343698198 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3181,7 +3181,8 @@ void rpal_free_thread_pending(struct rpal_common_data= *rcd) /* * CPU lock is forced and all cpumask will be ignored by RPAL temporary. */ -int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock) +int rpal_set_cpus_allowed_ptr(struct task_struct *p, bool is_lock, + bool is_kernel_ret) { const struct cpumask *cpu_valid_mask =3D cpu_active_mask; struct set_affinity_pending *pending =3D p->rpal_cd->pending; @@ -3210,6 +3211,9 @@ int rpal_set_cpus_allowed_ptr(struct task_struct *p, = bool is_lock) rpal_clear_task_thread_flag(p, RPAL_CPU_LOCKED_BIT); } =20 + if (is_kernel_ret) + return __set_cpus_allowed_ptr_locked(p, &ac, rq, &rf); + update_rq_clock(rq); =20 if (cpumask_equal(&p->cpus_mask, ac.new_mask)) @@ -11011,6 +11015,25 @@ void sched_enq_and_set_task(struct sched_enq_and_s= et_ctx *ctx) #endif /* CONFIG_SCHED_CLASS_EXT */ =20 #ifdef CONFIG_RPAL +asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev) + __releases(rq->lock) +{ + /* + * New tasks start with FORK_PREEMPT_COUNT, see there and + * finish_task_switch() for details. + * + * finish_task_switch() will drop rq->lock() and lower preempt_count + * and the preempt_enable() will end up enabling preemption (on + * PREEMPT_COUNT kernels). + */ + + finish_task_switch(prev); + trace_sched_exit_tp(true, CALLER_ADDR0); + preempt_enable(); + + calculate_sigpending(); +} + static struct rq *rpal_finish_task_switch(struct task_struct *prev) __releases(rq->lock) { --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f53.google.com (mail-pj1-f53.google.com [209.85.216.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 498A222129F for ; Fri, 30 May 2025 09:33:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.53 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597632; cv=none; b=P7Unlc3WJ/9hzb8yxNyFahP4iCDhOt649Knhu240jcfoFdY+dJM312Fm9wjmh/suZOvQf25faRIKd2sy8HwxXlSApIfBPWQ/LMwRNFP4DKsXzSkUdTfNEp3TgLL2ewJq4GL9CHzYWMO2ExHj5pXYJzvwL+Hq/vk5qFktSPWH9IA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597632; c=relaxed/simple; bh=s8bvP/o1nUrwW7KF0DJ/pHvc3b43enUVUs73szJcvsA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=GiVr5QKQDimoOPKqjeKaO1sRME6qUTVVgl5MhGrHYGtspNu53feOQqNtcpcZO5W6PnJsXJjQeVKW/jYF+LSKVIGH2PCKYqLKNv6UE/fmLXxf1ldygwu2VpF7SuAI5TtFeYgTwS9KunUyMxlyjEA6Eu1J7zi4b2QpxQ/gIkT49GI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=U7jWFD62; arc=none smtp.client-ip=209.85.216.53 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="U7jWFD62" Received: by mail-pj1-f53.google.com with SMTP id 98e67ed59e1d1-3119822df05so1886086a91.0 for ; Fri, 30 May 2025 02:33:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597629; x=1749202429; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+2NFrx3BFqUZITaGnYPssFKiBOpe/dz6IJRWesRURms=; b=U7jWFD62nxS8TfhhmrTlMIbWjZGLXbkA3Q1GVDdMyhiibiNlV0l5YWrgI+JavnBPpb Gxq6GuT3tq2RN7mnaF9YyNih9RiSM57U7umDgRMf9gzCA7z4SZVNodtbn4WIDsRDWvcb BDTiIIyLlwNpb72o/C/BqOfI6mMv++FKB0YtlMAqTK7QUP77Ct2MzDCe56gVPo2RnuIa ybkovV0jXidA3NGQGQtlHGZDDaLJjNDCp8qJlsHjNJN6dALtCaW3w29v1/hm4x3Bs6B1 T6boHJn4kXyByPOr22bDDLKGgZ4JH8Z/S43M/bOkjz2DaPNk11bLJojgqn7XecTnHmd1 803g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597629; x=1749202429; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+2NFrx3BFqUZITaGnYPssFKiBOpe/dz6IJRWesRURms=; b=kFvZ04zKeA6TfGUfZvNnzOqeTTtFG58fKASX1c4FDIgiohu/V8qzh1jFcpR+dulmsK M0s80AF6jVsszaOc+ab6pJgbBiYjhZMvGTpW+P8aA5yqmwRJgX66r4UU9JAkxsVnF1D7 EMOuty2QIbOl3WHck0xZX5aSuv+x5Hwm/U3S5lrZXAtsAdQhz4vgZTKLyl5Y5lNDCI7Z 4FuG/yxXfif03JZZDiKHGCcnPW94yPIiJNx9FQ/UfkianssjATHIdo7r/QUsjOLitkPW XsO4ijqf3ICXOTMnx7+A63b5hjuMIuVYb6CmaIa+bafFW43Un972vKT2vtiwKlwZLkGC 3IpQ== X-Forwarded-Encrypted: i=1; AJvYcCXZ+u7P1kTRTdr9nXeGiOn8ofieA3iMthsObYWXbUu9ptemyjHwryfFCx3Di1jZlCfPww+Qz3iqabNpo24=@vger.kernel.org X-Gm-Message-State: AOJu0YzteLB5PRcrk5b5FA0fLfW7sC9HQzrEtq5gzIum5NchWBROc6p5 l5FFHqvpYPjPmZR+VyxIOV0LjckXVOj6ZCoIk5G9NW8FrOrghgPHCLROUVgK8j525S8= X-Gm-Gg: ASbGncu4mgnR2USzKoaHIKVjXbeN57lc0w6RYjdyAKToC2QsUJlk8c4jepaXY/tLjTS tATK0wpTgNhTdq/YlRmPVPBpm07mMxpraeIax/ycimY/8kjJCEK8s6GE4kmxydUgBdNr6w6hVWV 7qg4Av17dJeKb0hj6DpvrrwZ5YgHlaqI3mIrQtWLVmVApMkowwlUa/VHsx7PXzsxJfpWA9B/q4l d0r9BT7D5mJezq6+QcGKGxLyEli0Lkk/FT0CT6nzA/6gR5wvdiaQvjj1ODmoE7r0iN/puw9/cc7 cvE1iNLsWMv3ZUYni7zdkIuXzseYBO9ZbntFx+++WKlkcFXBiFUjgAgqNyQxCldJ4xolf3PCkk+ n1aYz2LgYNQ== X-Google-Smtp-Source: AGHT+IHkejFlF7JacpAFc31d67Aa0abDbDTBPS+AvGZCJTX8xpAHAqZtwkMExSrIMlmiA7Dd0bS1gw== X-Received: by 2002:a17:90b:3ec3:b0:310:cea4:e3b9 with SMTP id 98e67ed59e1d1-31250452c5fmr1772636a91.34.1748597629413; Fri, 30 May 2025 02:33:49 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.33.34 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:33:49 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 21/35] RPAL: add kernel entry handling for lazy switch Date: Fri, 30 May 2025 17:27:49 +0800 Message-Id: <924aa7959502c4c3271cb311632eb505e894e26e.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" At the kernel entry point, RPAL performs a lazy switch. Therefore, it is necessary to hook all kernel entry points to execute the logic related to the lazy switch. At the kernel entry, apart from some necessary operations related to the lazy switch (such as ensuring that the general-purpose registers remain unchanged before and after the lazy switch), the task before the lazy switch will lose its user mode context (which is passed to the task after the lazy switch). Therefore, the kernel entry also needs to handle the issue of the previous task losing its user mode context. This patch hooks all locations where the transition from user mode to kernel mode occurs, including entry_SYSCALL_64, error_entry, and asm_exc_nmi. When the kernel detects a mismatch between the kernel-mode and user mode contexts, it executes the logic related to the lazy switch. Taking the switch from the sender to the receiver as an example, the receiver thread is first locked to the CPU where the sender is located. Then, the receiver thread in the CALL state is woken up through rpal_try_to_wake_up(). The general purpose register state (pt_regs) of the sender is copied to the receiver, and rpal_schedule() is executed to complete the lazy switch. Regarding the issue of the sender losing its context, the kernel loads the pre-saved user mode context of the sender into the sender's pt_regs and constructs the kernel stack frame of the sender in a manner similar to the fork operation. The handling of the switch from the receiver to the sender is similar, except that the receiver will be unlocked from the current CPU, and the receiver can only return to the user mode through the kernel return method. Signed-off-by: Bo Li --- arch/x86/entry/entry_64.S | 137 ++++++++++++++++++++++++++++++++++ arch/x86/kernel/asm-offsets.c | 3 + arch/x86/rpal/core.c | 137 ++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 6 ++ 4 files changed, 283 insertions(+) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 13b4d0684575..59c38627510d 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -118,6 +118,20 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_= GLOBAL) UNTRAIN_RET CLEAR_BRANCH_HISTORY =20 +#ifdef CONFIG_RPAL + /* + * We first check if it is a RPAL sender/receiver with + * current->rpal_cd. For non-RPAL task, we just skip it. + * For rpal task, We may need to check if it needs to do + * lazy switch. + */ + movq PER_CPU_VAR(current_task), %r13 + movq TASK_rpal_cd(%r13), %rax + testq %rax, %rax + jz _do_syscall + jmp do_rpal_syscall +_do_syscall: +#endif call do_syscall_64 /* returns with IRQs disabled */ =20 /* @@ -190,6 +204,101 @@ SYM_CODE_START(rpal_ret_from_lazy_switch) jmp swapgs_restore_regs_and_return_to_usermode #endif SYM_CODE_END(rpal_ret_from_lazy_switch) + +/* return address offset of stack frame */ +#define RPAL_FRAME_RET_ADDR_OFFSET -56 + +SYM_CODE_START(do_rpal_syscall) + movq %rsp, %r14 + call rpal_syscall_64_context_switch + testq %rax, %rax + jz 1f + + /* + * When we come here, everything but stack switching is finished. + * This makes current task use another task's kernel stack. Thus, + * we need to do stack switching here. + * + * At the meanwhile, the previous task's stack content is corrupted, + * we also need to rebuild its stack frames, so that it will jump to + * rpal_ret_from_lazy_switch when it is scheduled in. This is inspired + * by ret_from_fork. + */ + movq TASK_threadsp(%rax), %rsp +#ifdef CONFIG_STACKPROTECTOR + movq TASK_stack_canary(%rax), %rbx + movq %rbx, PER_CPU_VAR(__stack_chk_guard) +#endif + /* rebuild src's frame */ + movq $rpal_ret_from_lazy_switch, -8(%r14) + leaq RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx + movq %rbx, TASK_threadsp(%r13) + + movq %r13, %rdi + /* + * Everything of task switch is done, but we still need to do + * a little extra things for lazy switch. + */ + call rpal_lazy_switch_tail + +1: + movq ORIG_RAX(%rsp), %rsi + movq %rsp, %rdi + jmp _do_syscall +SYM_CODE_END(do_rpal_syscall) + +SYM_CODE_START(do_rpal_error) + popq %r12 + movq %rax, %rsp + movq %rax, %r14 + movq %rax, %rdi + call rpal_exception_context_switch + testq %rax, %rax + jz 1f + + movq TASK_threadsp(%rax), %rsp + ENCODE_FRAME_POINTER +#ifdef CONFIG_STACKPROTECTOR + movq TASK_stack_canary(%rax), %rbx + movq %rbx, PER_CPU_VAR(__stack_chk_guard) +#endif + /* rebuild src's frame */ + movq $rpal_ret_from_lazy_switch, -8(%r14) + leaq RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx + movq %rbx, TASK_threadsp(%r13) + + movq %r13, %rdi + call rpal_lazy_switch_tail +1: + movq %rsp, %rax + pushq %r12 + jmp _do_error +SYM_CODE_END(do_rpal_error) + +SYM_CODE_START(do_rpal_nmi) + movq %rsp, %r14 + movq %rsp, %rdi + call rpal_nmi_context_switch + testq %rax, %rax + jz 1f + + movq TASK_threadsp(%rax), %rsp + ENCODE_FRAME_POINTER +#ifdef CONFIG_STACKPROTECTOR + movq TASK_stack_canary(%rax), %rbx + movq %rbx, PER_CPU_VAR(__stack_chk_guard) +#endif + /* rebuild src's frame */ + movq $rpal_ret_from_lazy_switch, -8(%r14) + leaq RPAL_FRAME_RET_ADDR_OFFSET(%r14), %rbx + movq %rbx, TASK_threadsp(%r13) + + movq %r13, %rdi + call rpal_lazy_switch_tail + +1: + jmp _do_nmi +SYM_CODE_END(do_rpal_nmi) #endif =20 /* @@ -1047,7 +1156,22 @@ SYM_CODE_START(error_entry) =20 leaq 8(%rsp), %rdi /* arg0 =3D pt_regs pointer */ /* Put us onto the real thread stack. */ +#ifdef CONFIG_RPAL + call sync_regs + /* + * Check whether we need to perform lazy switch after we + * switch to the real thread stack. + */ + movq PER_CPU_VAR(current_task), %r13 + movq TASK_rpal_cd(%r13), %rdi + testq %rdi, %rdi + jz _do_error + jmp do_rpal_error +_do_error: + RET +#else jmp sync_regs +#endif =20 /* * There are two places in the kernel that can potentially fault with @@ -1206,6 +1330,19 @@ SYM_CODE_START(asm_exc_nmi) IBRS_ENTER UNTRAIN_RET =20 +#ifdef CONFIG_RPAL + /* + * Check whether we need to perform lazy switch only when + * we come from userspace. + */ + movq PER_CPU_VAR(current_task), %r13 + movq TASK_rpal_cd(%r13), %rax + testq %rax, %rax + jz _do_nmi + jmp do_rpal_nmi +_do_nmi: +#endif + /* * At this point we no longer need to worry about stack damage * due to nesting -- we're on the normal thread stack and we're diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 6259b474073b..010202c31b37 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -46,6 +46,9 @@ static void __used common(void) #ifdef CONFIG_STACKPROTECTOR OFFSET(TASK_stack_canary, task_struct, stack_canary); #endif +#ifdef CONFIG_RPAL + OFFSET(TASK_rpal_cd, task_struct, rpal_cd); +#endif =20 BLANK(); OFFSET(pbe_address, pbe, address); diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index ed4c11e6838c..c48df1ce4324 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -7,6 +7,7 @@ */ =20 #include +#include #include =20 #include "internal.h" @@ -39,6 +40,20 @@ static inline void rpal_unlock_cpu_kernel_ret(struct tas= k_struct *tsk) rpal_set_cpus_allowed_ptr(tsk, false, true); } =20 +void rpal_lazy_switch_tail(struct task_struct *tsk) +{ + struct rpal_receiver_call_context *rcc; + + if (rpal_test_task_thread_flag(current, RPAL_LAZY_SWITCHED_BIT)) { + rcc =3D current->rpal_rd->rcc; + atomic_cmpxchg(&rcc->receiver_state, rpal_build_call_state(tsk->rpal_sd), + RPAL_RECEIVER_STATE_LAZY_SWITCH); + } else { + rpal_unlock_cpu(tsk); + rpal_unlock_cpu(current); + } +} + void rpal_kernel_ret(struct pt_regs *regs) { struct task_struct *tsk; @@ -76,6 +91,87 @@ void rpal_kernel_ret(struct pt_regs *regs) } } =20 +static inline void rebuild_stack(struct rpal_task_context *ctx, + struct pt_regs *regs) +{ + regs->r12 =3D ctx->r12; + regs->r13 =3D ctx->r13; + regs->r14 =3D ctx->r14; + regs->r15 =3D ctx->r15; + regs->bx =3D ctx->rbx; + regs->bp =3D ctx->rbp; + regs->ip =3D ctx->rip; + regs->sp =3D ctx->rsp; +} + +static inline void rebuild_sender_stack(struct rpal_sender_data *rsd, + struct pt_regs *regs) +{ + rebuild_stack(&rsd->scc->rtc, regs); +} + +static inline void rebuild_receiver_stack(struct rpal_receiver_data *rrd, + struct pt_regs *regs) +{ + rebuild_stack(&rrd->rcc->rtc, regs); +} + +static inline void update_dst_stack(struct task_struct *next, + struct pt_regs *src) +{ + struct pt_regs *dst; + + dst =3D task_pt_regs(next); + *dst =3D *src; + next->thread.sp =3D (unsigned long)dst; +} + +/* + * rpal_do_kernel_context_switch - the main routine of RPAL lazy switch + * @next: task to switch to + * @regs: the user pt_regs saved in kernel entry + * + * This function performs the lazy switch. When switch from sender to + * receiver, we need to lock both task to current CPU to avoid double + * control flow when we perform lazy switch and after then. + */ +static struct task_struct * +rpal_do_kernel_context_switch(struct task_struct *next, struct pt_regs *re= gs) +{ + struct task_struct *prev =3D current; + + if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) { + current->rpal_sd->receiver =3D next; + rpal_lock_cpu(current); + rpal_lock_cpu(next); + rpal_try_to_wake_up(next); + update_dst_stack(next, regs); + /* + * When a lazy switch occurs, we need to set the sender's + * user-mode context to a predefined state by the sender. + * Otherwise, sender's user context will be corrupted. + */ + rebuild_sender_stack(current->rpal_sd, regs); + rpal_schedule(next); + } else { + update_dst_stack(next, regs); + /* + * When a lazy switch occurs, we need to set the receiver's + * user-mode context to a predefined state by the receiver. + * Otherwise, sender's user context will be corrupted. + */ + rebuild_receiver_stack(current->rpal_rd, regs); + rpal_schedule(next); + rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT); + prev->rpal_rd->sender =3D NULL; + } + if (unlikely(!irqs_disabled())) { + local_irq_disable(); + rpal_err("%s: irq is enabled\n", __func__); + } + return next; +} + static inline struct task_struct *rpal_get_sender_task(void) { struct task_struct *next; @@ -123,6 +219,18 @@ static inline struct task_struct *rpal_misidentify(voi= d) return next; } =20 +static inline struct task_struct * +rpal_kernel_context_switch(struct pt_regs *regs) +{ + struct task_struct *next =3D NULL; + + next =3D rpal_misidentify(); + if (unlikely(next !=3D NULL)) + next =3D rpal_do_kernel_context_switch(next, regs); + + return next; +} + struct task_struct *rpal_find_next_task(unsigned long fsbase) { struct rpal_service *cur =3D rpal_current_service(); @@ -147,6 +255,35 @@ struct task_struct *rpal_find_next_task(unsigned long = fsbase) return tsk; } =20 +__visible struct task_struct * +rpal_syscall_64_context_switch(struct pt_regs *regs, unsigned long nr) +{ + struct task_struct *next; + + next =3D rpal_kernel_context_switch(regs); + + return next; +} + +__visible struct task_struct * +rpal_exception_context_switch(struct pt_regs *regs) +{ + struct task_struct *next; + + next =3D rpal_kernel_context_switch(regs); + + return next; +} + +__visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs) +{ + struct task_struct *next; + + next =3D rpal_kernel_context_switch(regs); + + return next; +} + static bool check_hardware_features(void) { if (!boot_cpu_has(X86_FEATURE_FSGSBASE)) { diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 01b582fa821e..b24176f3f245 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -479,7 +479,13 @@ struct rpal_service *rpal_get_mapped_service_by_id(str= uct rpal_service *rs, int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs, unsigned long addr, int error_code); struct mm_struct *rpal_pf_get_real_mm(unsigned long address, int *rebuild); +__visible struct task_struct * +rpal_syscall_64_context_switch(struct pt_regs *regs, unsigned long nr); +__visible struct task_struct * +rpal_exception_context_switch(struct pt_regs *regs); +__visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs= ); struct task_struct *rpal_find_next_task(unsigned long fsbase); +void rpal_lazy_switch_tail(struct task_struct *tsk); void rpal_kernel_ret(struct pt_regs *regs); =20 extern void rpal_pick_mmap_base(struct mm_struct *mm, --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C65A220F2A for ; Fri, 30 May 2025 09:34:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597647; cv=none; b=RZka/JeQIh3HIC9Y4r74+qEI9vkiOo/lIku5YNZbgD9V9uDpwzPNHMez4lg3RKQ0ZPVkHuBwbYo0OBZHbBrjd8cFxCX0Wu1IhM3DaWDXGPzwYwqjrBD255rEFJ9TznyPS+UZZXFjmid0pkOOirrGcXwwfig89SAO6hZcuh1cWGs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597647; c=relaxed/simple; bh=UOL+ld6kJQ8MITB46M/S3E58D1n58tehOr68tDVTVCQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=D7IoJKCC8x27n/0dKFKyJBobTxPrKI9/elM5pVskzIt/9Mzvw8T6sW215gT7YqF0uM5vVs5HUM0m5CV60BlsWaiXFpj4co7dxudYWgoauRN7WwGlrlHDkefKIoAobfWDo0mIdvu2lUElVFXUEbRbVPfne3AMrpiJKm6MHDCyxT4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Tl4Miuqq; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Tl4Miuqq" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-311d5fdf1f0so1703356a91.1 for ; Fri, 30 May 2025 02:34:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597645; x=1749202445; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JJCF21JaAHh50QQNMzAFoSnSXgccx/2injFNIwjbQFo=; b=Tl4MiuqqrXmkBATq/jgK2TPawTNlYG/h9+Mlq3CjwrjDxBrPcQxTtygGQhLnjVkR9o FJ5CFUvXkks9Y0N6XRX65o8H2hJj5Lnm9VpK6W3Qs3daNtEEl99YbLh++fGyyNRHfb2r FVIhtgvR/hjYrMqYwujJd+Fj/71j7QLBylWBnJcX2AUOGNWNpdoDJiZ4vIjRJTy8VLRg VIAa5VVs3rGdL6m8hzuKCzcs7w95d+QUje+kWxGzQ8fFBSLLUF2GCAXmmGngQlP+S2iK gbYIKMWKgctZryn5cAeNWLNnx2WfhUDPYT/q0wC4VklpjUb8b0CCV8pmRXUcz7Nnp16O C8Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597645; x=1749202445; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JJCF21JaAHh50QQNMzAFoSnSXgccx/2injFNIwjbQFo=; b=k8mXd9a6Q+N77/gAOaXYC0Vln/jtsvoRVxPQrJ7Hx1fbEZFqfdxCaW1Kggeq+jd2nV Fno9xnKDgR9bAklhJYzbrfmOIUvxxIqYkPNhT+6/yuE+gPlWYZpQcToSQB/VhPhREcuc JaEhC4V8vaGyO1QwS6wkglahhEe6/8/HukmzkhvFPFIjtwxPBcd5o77Wu1PMBVjxHSbG 9NeojQ9yjWVPzGAw6K8C3HkbEyjNFXnMpdlt8IOgYzzJpigEl6ymDbWlhpWAQWp2cWFV lOm/YZvTaRGySkaflKVuyEjVaUOVaJ6xp87xD2B59dI2/YiFsQxa0zDZ4Y8sK/93CwZC PEcA== X-Forwarded-Encrypted: i=1; AJvYcCVj5ZE+UEOM0hYApLbSvJ4xYuw2xTiZGhfQhC3BTKMu9IJX9KUwkMAxRw8yph87jQiTgaiYNviyn/bG4nM=@vger.kernel.org X-Gm-Message-State: AOJu0YyS7hVk1oUjykZFE6gtou7tqVy92BRVYst9yKDXIcRuy6Pnk/5m lUbl8OWfuBYw8IjOPQgR8bJKR7qQ+aAo0yTH7nTL76la7YQvFo0CMMdHDzgRKpiPMwKeBhSqvaM 2jpiv X-Gm-Gg: ASbGncsRhKMJwgNhdDG1XTQs6m5x+XVAkbS6r1klEb7WfwhQNgUz+bu3eU+tvJxR/YM 094u28coLD8PdyexW6qvpbhpIv6f50VFC53HbTkOdpQSt0vVZjjQu853v6WsskaeFbjsPOitDeD 7MQaPy9r3c4kmFiyHR+JMb1m+rgSr1WC1V0Z99xRSK4Fr4VRgDnt2uTGgBbYDVEtFPKiFlOfCB2 n0cCXQ6tmZf0NCNy55JvwbNz9owpPP4OxVJk8xwNFasQfK0xwwDAYYVv3lEG6cEFlba48QZIpPP faCLPeIzbw2FfCvR3dB0Wg9/MoF17QeRPNEpF+6MQCfQ+HfOPHAO5oj5nfk+pTQLY1V4bK/cu/R wlibCI1+E0Q== X-Google-Smtp-Source: AGHT+IHRhTQBcA7s5JBEKqc7+0a9k8cenFZsoQhX8jbpv2vo1Ygv4WOD6zUURQ2dI64sy1qmxMzdzQ== X-Received: by 2002:a17:90a:e7cb:b0:311:b3e7:fb2c with SMTP id 98e67ed59e1d1-312503643c0mr2125489a91.13.1748597644660; Fri, 30 May 2025 02:34:04 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.33.49 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:34:04 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 22/35] RPAL: rebuild receiver state Date: Fri, 30 May 2025 17:27:50 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When an RPAL call occurs, the sender modifies the receiver's state. If the sender exits abnormally after modifying the state or encounters an unhandled page fault and returns to a recovery point, the receiver's state will remain as modified by the sender (e.g., in the CALL state). Since the sender may have exited, the lazy switch will not occur, leaving the receiver unrecoverable (unable to be woken up via try_to_wake_up()). Therefore, the kernel must ensure the receiver's state remains valid in these cases. This patch addresses this by rebuild receiver's state during unhandled page faults or sender exits. The kernel detect the fsbase value recorded by the sender and use the fsbase value to locate the corresponding receiver. Then kernel checking if the receiver is in the CALL state set by the sender (using sender_id and service_id carried in the CALL state). If true, transitioning the receiver from CALL to WAIT state and notifying the receiver via sender_state that the RPAL call has completed. This ensures that even if the sender fails, the receiver can recover and resume normal operation by resetting its state and avoiding permanent blocking. Signed-off-by: Bo Li --- arch/x86/rpal/thread.c | 44 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index db3b13ff82be..02c1a9c22dd7 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -224,6 +224,45 @@ int rpal_unregister_receiver(void) return ret; } =20 +/* sender may corrupt receiver's state if unexpectedly exited, rebuild it = */ +static void rpal_rebuild_receiver_context_on_exit(void) +{ + struct task_struct *receiver =3D NULL; + struct rpal_sender_data *rsd =3D current->rpal_sd; + struct rpal_sender_call_context *scc =3D rsd->scc; + struct rpal_receiver_data *rrd; + struct rpal_receiver_call_context *rcc; + unsigned long fsbase; + int state =3D rpal_build_call_state(rsd); + + if (scc->ec.magic !=3D RPAL_ERROR_MAGIC) + goto out; + + fsbase =3D scc->ec.fsbase; + if (rpal_is_correct_address(rpal_current_service(), fsbase)) + goto out; + + receiver =3D rpal_find_next_task(fsbase); + if (!receiver) + goto out; + + rrd =3D receiver->rpal_rd; + if (!rrd) + goto out; + + rcc =3D rrd->rcc; + + if (atomic_read(&rcc->receiver_state) =3D=3D state) { + atomic_cmpxchg(&rcc->sender_state, RPAL_SENDER_STATE_CALL, + RPAL_SENDER_STATE_KERNEL_RET); + atomic_cmpxchg(&rcc->receiver_state, state, + RPAL_RECEIVER_STATE_WAIT); + } + +out: + return; +} + int rpal_rebuild_sender_context_on_fault(struct pt_regs *regs, unsigned long addr, int error_code) { @@ -232,6 +271,7 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs= *regs, unsigned long erip, ersp; int magic; =20 + rpal_rebuild_receiver_context_on_exit(); erip =3D scc->ec.erip; ersp =3D scc->ec.ersp; magic =3D scc->ec.magic; @@ -249,8 +289,10 @@ int rpal_rebuild_sender_context_on_fault(struct pt_reg= s *regs, =20 void exit_rpal_thread(void) { - if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) + if (rpal_test_current_thread_flag(RPAL_SENDER_BIT)) { + rpal_rebuild_receiver_context_on_exit(); rpal_unregister_sender(); + } =20 if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) rpal_unregister_receiver(); --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DEDF228E7 for ; Fri, 30 May 2025 09:34:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597662; cv=none; b=ApUuD1snKakTA9Gc28Aps5tQFi8v0OB9uYzkRFTIWs2TCwLPHGBy5pte3VaTa91G3oI2HrLHgYU/YyxY9MenaJGrsBh5TTPdvp8TMivk5vzHZk7bVxuAox3Zui7NVF8Bf7PEPB6AGNWbGe+GaHKTAEb+1wbRh+zVZSxQ6yMqaQs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597662; c=relaxed/simple; bh=OoRqkHVJJA6Fnn08xoi8Z+2+DiJAYYu6EEBieJEgtFs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AmUejweQ2caiIH/YRfFlo6d44BHAK80+eJSnBqLYfDP9+sOU8R2jVcUcFCP8NEVHvVxsPGgZASQ9Z/DaA8ju0DDTosjcqiFp3rs/+AT/Fgai00KXbpCSS5JOuVW901UtLg4FV54ZB/blu2AtPWGwws4XKCRSo77h3HPZYfhEyWo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=MzMp1xwL; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="MzMp1xwL" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-3122a63201bso1073210a91.0 for ; Fri, 30 May 2025 02:34:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597660; x=1749202460; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=J8WmlWSYokiEZq8ImcTa9QlYOraa+eQKx+5TbFt2hGQ=; b=MzMp1xwLobGJ+PxhL9U31NFiYdW2e6Es6/fKGhgS9fdx/3RJIOx6Z7xgtAlekeM4OF w2FHcZiGe/oaeAj4agy4q2T6ro7WHZAS1OUnyEjYeoN/KU3T2pJlXdCXQth147r09gg1 upJOCi/JaVg7pKSz9GEHtge688GgY5m+rOiNcbvLzR+b48C5+zHCu0ZzGAMHIY0AOGyC ne70J2lkOTTmg/LXAciIlOeaD09k81C+JOCe78G+aFeCLKaqvXAoR5kvpqJHHEp/Nwhn iauXi4dGeW6dZOMk+imu/OBgrpZdQKC1ZZcmC0ljhNIEhpfxQnUphQIRRg6qOjrfu/Kt Dsvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597660; x=1749202460; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=J8WmlWSYokiEZq8ImcTa9QlYOraa+eQKx+5TbFt2hGQ=; b=kg6fN7xbJvcjqYeYIE/bEdnSq3RiaOq9qoAxSsh1FIH7Hc8RmpHAanYGkdIbkbfxj2 jhwbAa0csEupHgOzI2nrAZQSHVQlI8JbNu6dMjLi20YksIVT6WsUQXtWQWAMSBLiGyDS NfCaLk6e2e7as69l3JNhTns0FpsHIEnliAghffE0IGc0uWFxB5JdJS1tOPqZcc8os3rw WBZIontnF/I7auV7Bh+Rb/IaALL0fhXAHigXDoDaiMfFTvgp3igpnoPL9oe+K1AlUT8+ CNX7x9ummq9c5h4sIrHaFLdBz6TpHjo46CR01oVD+ZeWkLJR3aL8PPLSeGXOCs3cFxtt 874Q== X-Forwarded-Encrypted: i=1; AJvYcCWyX7/B2sbrbrYlE9Xp8B7+1w5cO8Sts7SYrfMw9H66oknqbl6EGitmNKSmoBSe+tnd47dTbi6aDog2Uwo=@vger.kernel.org X-Gm-Message-State: AOJu0YxVPdSW9ck+kXQYYRVjlpmEj3RVF/5NGG3fPb9Y+i7OiDWQwzzW gVY2V7p8id4tjlzG+wK49x+y2VlI2sVVL2f1LJNJltkTXq2JQNHuZI20GlmR0tgukwY= X-Gm-Gg: ASbGncsF7pAnCn8fyGHKpaAY0DffTcb8B3V51u6qG4lulnXVck28izUMcpUZeL1PHBd ZnSNA2c7uVlCn+QsUcw8ro+6GG1tCauBO3KSTS72ci6oX4kUQG8M70qoR0uscrapgzNWfjYnm76 GaqqUmuTN1LecLCWNhixczl/nuqD+E1ITPfmEb55erzL9HDsw7Wo9mzhdTtlhBoXUzGNXhbXjKz Sw48wO1PceiK4QLQJshRS14GRloZgwhhxpZJSPiBZhidEN1g7bssLzD+3h+/EAqob7v9LSGBzDK zNGsoeXZMtBqleiAsgas7bsJTmV1dRmHuUgHvdq2Qo860wsWvGr8b4lK56iAMFS71WJnSqLM8+r Cz21Zm2pN+w== X-Google-Smtp-Source: AGHT+IF4zFLLKAk8uepZIGBZM+3fXRB/hltSj95jGD1vqbYt83ikw4yeQ1JJHBQBbftK0SQTg3CQlA== X-Received: by 2002:a17:90a:d2cf:b0:311:a314:c2c7 with SMTP id 98e67ed59e1d1-3124150e346mr4368890a91.2.1748597660011; Fri, 30 May 2025 02:34:20 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.34.05 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:34:19 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 23/35] RPAL: resume cpumask when fork Date: Fri, 30 May 2025 17:27:51 +0800 Message-Id: <45c1884aaf21256ed6fc66b4a4a716bffebb54e1.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After a lazy switch occurs, RPAL locks the receiver to the current CPU by modifying its cpumask. If the receiver performs a fork operation at this point, the kernel will copy the modified cpumask to the new task, causing the new task to be permanently locked on the current CPU. This patch addresses this issue by detecting whether the original task is locked to the current CPU by RPAL during fork. If locked, assigning the cpumask that existed before the lazy switch to the new task. This ensures the new task will not be locked to the current CPU. Signed-off-by: Bo Li --- arch/x86/kernel/process.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c1d2dac72b9c..be8845e2ca4d 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -88,6 +89,19 @@ EXPORT_PER_CPU_SYMBOL(cpu_tss_rw); DEFINE_PER_CPU(bool, __tss_limit_invalid); EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid); =20 +#ifdef CONFIG_RPAL +static void rpal_fix_task_dump(struct task_struct *dst, + struct task_struct *src) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&src->pi_lock, flags); + if (rpal_test_task_thread_flag(src, RPAL_CPU_LOCKED_BIT)) + cpumask_copy(&dst->cpus_mask, &src->rpal_cd->old_mask); + raw_spin_unlock_irqrestore(&src->pi_lock, flags); +} +#endif + /* * this gets called so that we can store lazy state into memory and copy t= he * current task into the new thread. @@ -100,6 +114,10 @@ int arch_dup_task_struct(struct task_struct *dst, stru= ct task_struct *src) #ifdef CONFIG_VM86 dst->thread.vm86 =3D NULL; #endif +#ifdef CONFIG_RPAL + if (src->rpal_rs) + rpal_fix_task_dump(dst, src); +#endif =20 return 0; } --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82A9A22D9E1 for ; Fri, 30 May 2025 09:34:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597678; cv=none; b=n86BDCxj1CGSV7YCd2T3PSJZjoWVdhVgmA3ieBvYv4yNtSLjHdU16yeMkBYxCniKmOasn9X8Hq156sA5SvZaXVUUiIddmOE7rSo90CyQiiOLZZLQnTfuyHEUva1798WDYF+LmAsMPjLpnz5+lEGPR7AD8NohaAIY8GHb5EKXsc4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597678; c=relaxed/simple; bh=AaX4tZdXtUo1twDok859Zwut8WsNeeyIc7hLqSJN5fQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=M9DCZY9384K0Rv4t8xooStYlINT9HyZ49KSS1MHkWinfxpO9OgYnI4SDisowBbZFMhnm0f8UevzCVgIliJVgMUmjgxKEe8hNudPXD2KFA6KCaiYTx9i89s1adn19Sxs4J9clhlW0h4gKsybFudXU9QWOvUIi7Zz1+1xzrdBsSSI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=HZMhmXP+; arc=none smtp.client-ip=209.85.215.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="HZMhmXP+" Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-b271f3ae786so1307585a12.3 for ; Fri, 30 May 2025 02:34:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597676; x=1749202476; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Gd4M6twRFQoCr4fYLEaBSZS3XlcwhQLeBelobi/ni6A=; b=HZMhmXP+JCNWRJqSd3RPI0w5J/yKoXpICiXXM63dslgXSQCYFZQlNgUy1aj/GGEUZz 2sJr/bbPYkYApZ2CTjK/3k9WQkwH+Oa9CTbVopfxMhWD0W/NnMWUP3CKq2ges1++vfvB bZvHRKYKEp6YotqTTeazXCc1+LrGsG6UpY1theDhxSTnhJCJGiJI93TXuG04TAoJp3Hc 4dwVBNESqs1+xqcNMoLih6S0pS99gk1XkcFXDniAYBXP4zEcMSFHBlFad/4tI2yEO3QA J9cW+JVrLP3HAhCapw1rC/dtLab9bMTkGghearsaGmLnXYtzTxoubOlqkgiSm2OXkZZ2 +DWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597676; x=1749202476; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Gd4M6twRFQoCr4fYLEaBSZS3XlcwhQLeBelobi/ni6A=; b=hohCW3QuQ4Dncryqme5rVy+hQT412zReJO1s7+YUWArPIQPOzj2QC1ZPic001hE81n 9pHL/Jj4L2/lktw03TcxHbK+/OOzisikEKuRoZ2UHDe/JlgGrMLiREON/DL8cfO+KkMn BtwOaJzDyd8ako7k5pZf7wkyAWPnlXs0Oo0Y8WzAMgfI0QSJ0kUzee0YOwXzi3Ww75Kj iIWr4R0vL7tzej433sWNy3T9Hv39zm9ToEXg6oK6PT/slES+wDgnzgCMzKIf9i/ILTy4 vPt2js8Ntnsw+JU/sKz4Q/Po2by+doWUHyhFn1RCpR2hKB4juyskZErrkKYuE+DgPvtr QDLg== X-Forwarded-Encrypted: i=1; AJvYcCVq5EzyD5eVrX8Ye8iqS1aovSBvy3Nya8ziUdna6aS5FD1ik+SldBJRDmM/VUghqZmer/n80A9Lxtluc6c=@vger.kernel.org X-Gm-Message-State: AOJu0YzejVol+y8caldLj/WTzov6ggx7D48jtC07wgGQpT4et8CaTzK2 zMab805m+hzwGfFeBEK78229RQpHS4yX0IjMXLAGM0iPLyc+rmAgqN/ordayBHE5giM= X-Gm-Gg: ASbGncsU7QWIP5Vj/7nr2Shbapl7y1M1BKVpSf/9dFnKnTEsghfvY50HuZmOWQtANUJ 60oH+tYDKaPOImI2SK8IxDvNPDMeZx4ThL1kjwmYsyJTOzR/YDHj1paXbQqawKNtry+N7VxeDhV JPy0VJrpauk6U+O5XmepkDNpEU0OdLipakhDgx1wB9Ds5xcImdY50FNGgCPZFWfXdQPJ+EeV5qj z3S0VxzyB3VxdWeZ+nUsq8ViJg6Y/TDMqmYe/+X1peFIjPCezpjB/BbDtN15GxWHsYfgbDJ97X0 IFbGpXqiQ6cLLIoeRTOAxDUx+e3VaotjnKukQqMcYRikbVoROaSRzSntWC4rox3z2o81Xfe9jzN QAQwfK9I7mg== X-Google-Smtp-Source: AGHT+IFObkOA1vqgzfl2J7tHJ7MiXmVGrPtynvFE/NiK8PUmWtoYGs6g/wJAuSU3bu11FutgVffcuA== X-Received: by 2002:a17:90a:d2ce:b0:311:b3e7:fb3c with SMTP id 98e67ed59e1d1-31241e97f30mr3743435a91.31.1748597675528; Fri, 30 May 2025 02:34:35 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.34.20 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:34:35 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 24/35] RPAL: critical section optimization Date: Fri, 30 May 2025 17:27:52 +0800 Message-Id: <47c919a7d65cb5def07c561e29305d39d9df925f.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The critical section is defined as the user mode code segment within the receiver that executes when control returns from the receiver to the sender. This code segment, located in the receiver, involves operations such as switching the fsbase register and changing the stack pointer. Handling the critical section can be categorized into two scenarios: - First Scenario: If no lazy switch has occurred prior to the return and the fsbase switch is incomplete, a lazy switch is triggered to transition the kernel context from the sender to the receiver. After the fsbase is updated in user mode, another lazy switch occurs to revert the kernel context from the receiver back to the sender. This results in two unnecessary lazy switches. - Second Scenario: If a lazy switch has already occurred during execution of the critical section, the lazy switch can be preemptively triggered. This avoids re-entering the kernel solely to initiate another lazy switch. The implementation of the critical section involves modifying the fsbase register in kernel mode and setting the sender's user mode context to a predefined state. These steps minimize redundant user/kernel transitions and lazy switches. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 88 ++++++++++++++++++++++++++++++++++++++++- arch/x86/rpal/service.c | 12 ++++++ include/linux/rpal.h | 6 +++ 3 files changed, 104 insertions(+), 2 deletions(-) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index c48df1ce4324..406d54788bac 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -219,14 +219,98 @@ static inline struct task_struct *rpal_misidentify(vo= id) return next; } =20 +static bool in_ret_section(struct rpal_service *rs, unsigned long ip) +{ + return ip >=3D rs->rsm.rcs.ret_begin && ip < rs->rsm.rcs.ret_end; +} + +/* + * rpal_update_fsbase - fastpath when RPAL call returns + * @regs: pt_regs saved in kernel entry + * + * If the user is executing rpal call return code and it does + * not update fsbase yet, force fsbase update to perform a + * lazy switch immediately. + */ +static inline void rpal_update_fsbase(struct pt_regs *regs) +{ + struct rpal_service *cur =3D rpal_current_service(); + struct task_struct *sender =3D current->rpal_rd->sender; + + if (in_ret_section(cur, regs->ip)) + wrfsbase(sender->thread.fsbase); +} + +/* + * rpal_skip_receiver_code - skip rpal call return code + * @next: the next task to be lazy switched to. + * @regs: pt_regs saved in kernel entry + * + * If the user is executing rpal call return code and we are about + * to perform a lazy switch, skip the remaining return code to + * release receiver's stack. This avoids stack conflict when there + * are more than one senders calls the receiver. + */ +static inline void rpal_skip_receiver_code(struct task_struct *next, + struct pt_regs *regs) +{ + rebuild_sender_stack(next->rpal_sd, regs); +} + +/* + * rpal_skip_receiver_code - skip lazy switch when rpal call return + * @next: the next task to be lazy switched to. + * @regs: pt_regs saved in kernel entry + * + * If the user is executing rpal call return code and we have not + * performed a lazy switch, there is no need to perform lazy switch + * now. Update fsbase and other states to avoid lazy switch. + */ +static inline struct task_struct * +rpal_skip_lazy_switch(struct task_struct *next, struct pt_regs *regs) +{ + struct rpal_service *tgt; + + tgt =3D next->rpal_rs; + if (in_ret_section(tgt, regs->ip)) { + wrfsbase(current->thread.fsbase); + rebuild_sender_stack(current->rpal_sd, regs); + rpal_clear_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT); + next->rpal_rd->sender =3D NULL; + next =3D NULL; + } + return next; +} + +static struct task_struct *rpal_fix_critical_section(struct task_struct *n= ext, + struct pt_regs *regs) +{ + struct rpal_service *cur =3D rpal_current_service(); + + /* sender->receiver */ + if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) + next =3D rpal_skip_lazy_switch(next, regs); + /* receiver->sender */ + else if (rpal_is_correct_address(cur, regs->ip)) + rpal_skip_receiver_code(next, regs); + + return next; +} + static inline struct task_struct * rpal_kernel_context_switch(struct pt_regs *regs) { struct task_struct *next =3D NULL; =20 + if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT)) + rpal_update_fsbase(regs); + next =3D rpal_misidentify(); - if (unlikely(next !=3D NULL)) - next =3D rpal_do_kernel_context_switch(next, regs); + if (unlikely(next !=3D NULL)) { + next =3D rpal_fix_critical_section(next, regs); + if (next) + next =3D rpal_do_kernel_context_switch(next, regs); + } =20 return next; } diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 49458321e7dc..16e94d710445 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -545,6 +545,13 @@ int rpal_release_service(u64 key) return ret; } =20 +static bool rpal_check_critical_section(struct rpal_service *rs, + struct rpal_critical_section *rcs) +{ + return rpal_is_correct_address(rs, rcs->ret_begin) && + rpal_is_correct_address(rs, rcs->ret_end); +} + int rpal_enable_service(unsigned long arg) { struct rpal_service *cur =3D rpal_current_service(); @@ -562,6 +569,11 @@ int rpal_enable_service(unsigned long arg) goto out; } =20 + if (!rpal_check_critical_section(cur, &rsm.rcs)) { + ret =3D -EINVAL; + goto out; + } + mutex_lock(&cur->mutex); if (!cur->enabled) { cur->rsm =3D rsm; diff --git a/include/linux/rpal.h b/include/linux/rpal.h index b24176f3f245..4f1d92053818 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -122,12 +122,18 @@ enum rpal_sender_state { RPAL_SENDER_STATE_KERNEL_RET, }; =20 +struct rpal_critical_section { + unsigned long ret_begin; + unsigned long ret_end; +}; + /* * user_meta will be sent to other service when requested. */ struct rpal_service_metadata { unsigned long version; void __user *user_meta; + struct rpal_critical_section rcs; }; =20 struct rpal_request_arg { --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F34E52222DF for ; Fri, 30 May 2025 09:34:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597694; cv=none; b=JICzyLglG0nwnLhQmUgC3wmA143VYZucXMbMuTKTup6F6GuGt6X7kKylf4lt5i8hB446GPc0r0QIlzlRweD3kI47He9YnxhnzQdAYRct98YRAEN6xwohQ+WRBqf+Zx1QTJlSGQs5zp50f1E3p5nGHJ7qqq29j21YdP8ZCF8FSxs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597694; c=relaxed/simple; bh=huOWwdeBYR2ktVOxaLw5MQ08zC/qZOEF1LgzmHEU8B4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BGN94+HwkEpGGZ8E8nz7TghW3QLx2ewE0Yy9Htyzqp2Fr3qZtCfFqjIsWAM9C7K5Q+W1R8QsSGqgPbc9WDm1rXiDUF8geKgakV15MkWc+IfQQDkFgjyXKr9M1lmhPbYopZi2zNxOXA7tvzolNr2hFqzn9ZItMtV0lFduQFW95/o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=lIRmN2ee; arc=none smtp.client-ip=209.85.216.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="lIRmN2ee" Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-3109fb9f941so2028108a91.3 for ; Fri, 30 May 2025 02:34:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597691; x=1749202491; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=4r9koD1NPFSh7WkJOunYhnXV7czMJGlSW2YfbvvIqZo=; b=lIRmN2eeZ8hSHOyD39mDvrocaoB8bYLso4269wCfjXQI0Z8JKAB6gX6E1JAqesqZ/l zaiBhXOow9Usxm0YRWoacFzS88MYj527OdXbqSVqJ1/1BO+xLI32fpqghiDeoKeOCtCU u8iM/Bfx+qO4chcVapQLofBcqDwsjrgzxfXdYb676HxXu+4OoSAEtNhOwekhobMWxe5b VrmEQwZP8W8pjANYe8ZNtzAHPMNLXeC1Sv3tW4hu7PDctYmDkiIKsRP0vycUImdyIOns w/0C7zy0BB/CD1W6apOtKJLlYzXpQ1a6enbXVSMYtpU7Um7GyF2uvmS9ZZMjLd3ah+Uw zd0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597691; x=1749202491; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4r9koD1NPFSh7WkJOunYhnXV7czMJGlSW2YfbvvIqZo=; b=fThKqwJbXgYmaFEmcIHQhoe4CYEvhqFvNk/SYYyOcX2nC96ORyOixTqMu33VKmwNxj UUjZe3emCSDL9/CL5mttwy0dE31VcLGzAzzcTWxCOJV+a3VA8ExoMYvfhVmBH++t/OIv fuXYIXAPDHCmNZR+YZmSy+nvZDka7Z8gAWz3c0V8d9HC4K12MOzqAQqzRd6K6fPs2a03 V3uFdXy2aWMORTP3HmBfHxMaZDE1bb2m1VTj54SDlvhef9YHfV9LrIT5eW2PDn1OYcmI udmKfsFCJtAe56uC+YMGY/PjC/9+NI3Cyc+EId7yBKLgwv3kPOcgeGolE2y0aANbXz+p xWKA== X-Forwarded-Encrypted: i=1; AJvYcCVTJlAM56kypMsUY5Tl56dIodGa76esH4he2D3YNPOJFyIFmg8qTf1WFiETh1ugUxDRc3Gwr4EG3pVoHeM=@vger.kernel.org X-Gm-Message-State: AOJu0YxQ9hRgevlnk1xMcUv4a5qTNhSfEmBjGs4fzzOEZT+wGJ2erwbY eNfn6fwythp4wwu9AWLnE1fA1LA1gvT2fziqVc87efUHw8J+j6LxYzy46I2DITPYCEo= X-Gm-Gg: ASbGncvwtYxPvjoU1BnBFS1u/6hb1u4ab8d0AQGHwXPLPJRaFbhMwngiYyZy8clQB11 AQR2ByFsUCjXSRoic/9T4dIkopEI+dHgZh3/jxGX3XGXAqaVob50SAESj+SaIdtMIlDZ7jF6U4j 5SsNu9yhI6tCXAtbjMDlawzIlrsOGs1lrHaVcilpS6sb0krgd+eo96MX3VKO0I0xSHZRnHrIvGQ +x57QgePNFneBEAHu+dGQ3JxsT9WqdP23+5E7VZDygtIbCDnyT25DFUwUkVWjcw+dypIhzuF354 sSGTQp43tLSCRyNVxCQ8On/KXxTustV6YhTivO0Hm56Bj133ErHizTJuvCn1bPi6MoKcMOeqzPy bL1TuyvWMWA== X-Google-Smtp-Source: AGHT+IEYFSuwVcZ5gMKEd63K0LSZFm1at7KJSWQADYXZzWbymIXeS5JhogJZ9S4360x8945TmkY35w== X-Received: by 2002:a17:90b:2e8b:b0:311:e637:287b with SMTP id 98e67ed59e1d1-3124198a8e2mr3896259a91.29.1748597690878; Fri, 30 May 2025 02:34:50 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.34.35 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:34:50 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 25/35] RPAL: add MPK initialization and interface Date: Fri, 30 May 2025 17:27:53 +0800 Message-Id: <569387db40571a03a71506cbec12813c1e5dde62.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" RPAL uses MPK (Memory Protection Keys) to protect memory. Therefore, RPAL needs to perform MPK initialization, allocation, and other related tasks, while providing corresponding user-mode interfaces. This patch executes MPK initialization operations, including feature detection, implementation of user mode interfaces for setting and retrieving pkeys, and development of utility functions. For pkey allocation, RPAL prioritizes using pkeys provided by user mode, with user mode responsible for preventing pkey collisions between different services. If user mode does not provide a valid pkey, RPAL generates a pkey via id % arch_max_pkey() to maximize the avoidance of pkey collisions. Additionally, RPAL does not permit services to manipulate pkeys independently; thus, all pkeys are marked as allocated, and services are prohibited from releasing pkeys. Signed-off-by: Bo Li --- arch/x86/rpal/Kconfig | 12 +++++++- arch/x86/rpal/Makefile | 1 + arch/x86/rpal/core.c | 13 ++++++++ arch/x86/rpal/internal.h | 5 +++ arch/x86/rpal/pku.c | 47 ++++++++++++++++++++++++++++ arch/x86/rpal/proc.c | 5 +++ arch/x86/rpal/service.c | 24 +++++++++++++++ include/linux/rpal.h | 66 ++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 9 ++++++ 9 files changed, 181 insertions(+), 1 deletion(-) create mode 100644 arch/x86/rpal/pku.c diff --git a/arch/x86/rpal/Kconfig b/arch/x86/rpal/Kconfig index e5e6996553ea..5434fdb2940d 100644 --- a/arch/x86/rpal/Kconfig +++ b/arch/x86/rpal/Kconfig @@ -8,4 +8,14 @@ config RPAL depends on X86_64 help This option enables system support for Run Process As - library (RPAL). \ No newline at end of file + library (RPAL). + +config RPAL_PKU + bool "mpk protection for RPAL" + default y + depends on RPAL + help + Memory protection key (MPK) can achieve intra-process + memory separation which is broken by RPAL, Always keep + it on when use RPAL. CPU feature will be detected at + boot time as some CPUs do not support it. \ No newline at end of file diff --git a/arch/x86/rpal/Makefile b/arch/x86/rpal/Makefile index 89f745382c51..42a42b0393be 100644 --- a/arch/x86/rpal/Makefile +++ b/arch/x86/rpal/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_RPAL) +=3D rpal.o =20 rpal-y :=3D service.o core.o mm.o proc.o thread.o +rpal-$(CONFIG_RPAL_PKU) +=3D pku.o \ No newline at end of file diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 406d54788bac..41111d693994 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -8,6 +8,7 @@ =20 #include #include +#include #include =20 #include "internal.h" @@ -374,6 +375,14 @@ static bool check_hardware_features(void) rpal_err("no fsgsbase feature\n"); return false; } + +#ifdef CONFIG_RPAL_PKU + if (!arch_pkeys_enabled()) { + rpal_err("MPK is not enabled\n"); + return false; + } +#endif + return true; } =20 @@ -390,6 +399,10 @@ int __init rpal_init(void) if (ret) goto fail; =20 +#ifdef CONFIG_RPAL_PKU + rpal_set_cap(RPAL_CAP_PKU); +#endif + rpal_inited =3D true; return 0; =20 diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 6256172bb79e..71afa8225450 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -54,3 +54,8 @@ rpal_build_call_state(const struct rpal_sender_data *rsd) return ((rsd->rcd.service_id << RPAL_SID_SHIFT) | (rsd->scc->sender_id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CALL); } + +/* pkey.c */ +int rpal_alloc_pkey(struct rpal_service *rs, int pkey); +int rpal_pkey_setup(struct rpal_service *rs, int pkey); +void rpal_service_pku_init(void); diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c new file mode 100644 index 000000000000..4c5151ca5b8b --- /dev/null +++ b/arch/x86/rpal/pku.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * RPAL service level operations + * Copyright (c) 2025, ByteDance. All rights reserved. + * + * Author: Jiadong Sun + */ + +#include +#include + +#include "internal.h" + +void rpal_service_pku_init(void) +{ + u16 all_pkeys_mask =3D ((1U << arch_max_pkey()) - 1); + struct mm_struct *mm =3D current->mm; + + /* We consume all pkeys so that no pkeys will be allocated by others */ + mmap_write_lock(mm); + if (mm->context.pkey_allocation_map !=3D 0x1) + rpal_err("pkey has been allocated: %u\n", + mm->context.pkey_allocation_map); + mm->context.pkey_allocation_map =3D all_pkeys_mask; + mmap_write_unlock(mm); +} + +int rpal_pkey_setup(struct rpal_service *rs, int pkey) +{ + int val; + + val =3D rpal_pkey_to_pkru(pkey); + rs->pkey =3D pkey; + return 0; +} + +int rpal_alloc_pkey(struct rpal_service *rs, int pkey) +{ + int ret; + + if (pkey >=3D 0 && pkey < arch_max_pkey()) + return pkey; + + ret =3D rs->id % arch_max_pkey(); + + return ret; +} diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index 16ac9612bfc5..2f9cceec4992 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -76,6 +76,11 @@ static long rpal_ioctl(struct file *file, unsigned int c= md, unsigned long arg) case RPAL_IOCTL_RELEASE_SERVICE: ret =3D rpal_release_service(arg); break; +#ifdef CONFIG_RPAL_PKU + case RPAL_IOCTL_GET_SERVICE_PKEY: + ret =3D put_user(cur->pkey, (int __user *)arg); + break; +#endif default: return -EINVAL; } diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 16e94d710445..ca795dacc90d 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -208,6 +208,10 @@ struct rpal_service *rpal_register_service(void) spin_lock_init(&rs->rpd.poll_lock); bitmap_zero(rs->rpd.dead_key_bitmap, RPAL_NR_ID); init_waitqueue_head(&rs->rpd.rpal_waitqueue); +#ifdef CONFIG_RPAL_PKU + rs->pkey =3D -1; + rpal_service_pku_init(); +#endif =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -288,6 +292,9 @@ static int add_mapped_service(struct rpal_service *rs, = struct rpal_service *tgt, if (node->rs =3D=3D NULL) { node->rs =3D rpal_get_service(tgt); set_bit(type_bit, &node->type); +#ifdef CONFIG_RPAL_PKU + node->pkey =3D tgt->pkey; +#endif } else { if (node->rs !=3D tgt) { ret =3D -EINVAL; @@ -397,6 +404,19 @@ int rpal_request_service(unsigned long arg) goto put_service; } =20 +#ifdef CONFIG_RPAL_PKU + if (cur->pkey =3D=3D tgt->pkey) { + ret =3D -EINVAL; + goto put_service; + } + + ret =3D put_user(tgt->pkey, rra.pkey); + if (ret) { + ret =3D -EFAULT; + goto put_service; + } +#endif + ret =3D put_user((unsigned long)(tgt->rsm.user_meta), rra.user_metap); if (ret) { ret =3D -EFAULT; @@ -577,6 +597,10 @@ int rpal_enable_service(unsigned long arg) mutex_lock(&cur->mutex); if (!cur->enabled) { cur->rsm =3D rsm; +#ifdef CONFIG_RPAL_PKU + rsm.pkey =3D rpal_alloc_pkey(cur, rsm.pkey); + rpal_pkey_setup(cur, rsm.pkey); +#endif cur->enabled =3D true; } mutex_unlock(&cur->mutex); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 4f1d92053818..2f2982d281cc 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -97,6 +97,12 @@ enum { #define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK)) #define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1) =20 +#define RPAL_PKRU_BASE_CODE_READ 0xAAAAAAAA +#define RPAL_PKRU_BASE_CODE 0xFFFFFFFF +#define RPAL_PKRU_SET 0 +#define RPAL_PKRU_UNION 1 +#define RPAL_PKRU_INTERSECT 2 + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -122,6 +128,10 @@ enum rpal_sender_state { RPAL_SENDER_STATE_KERNEL_RET, }; =20 +enum rpal_capability { + RPAL_CAP_PKU +}; + struct rpal_critical_section { unsigned long ret_begin; unsigned long ret_end; @@ -134,6 +144,7 @@ struct rpal_service_metadata { unsigned long version; void __user *user_meta; struct rpal_critical_section rcs; + int pkey; }; =20 struct rpal_request_arg { @@ -141,11 +152,17 @@ struct rpal_request_arg { u64 key; unsigned long __user *user_metap; int __user *id; +#ifdef CONFIG_RPAL_PKU + int __user *pkey; +#endif }; =20 struct rpal_mapped_service { unsigned long type; struct rpal_service *rs; +#ifdef CONFIG_RPAL_PKU + int pkey; +#endif }; =20 struct rpal_poll_data { @@ -220,6 +237,11 @@ struct rpal_service { /* fsbase / pid map */ struct rpal_fsbase_tsk_map fs_tsk_map[RPAL_MAX_RECEIVER_NUM]; =20 +#ifdef CONFIG_RPAL_PKU + /* pkey */ + int pkey; +#endif + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -323,6 +345,7 @@ enum rpal_command_type { RPAL_CMD_DISABLE_SERVICE, RPAL_CMD_REQUEST_SERVICE, RPAL_CMD_RELEASE_SERVICE, + RPAL_CMD_GET_SERVICE_PKEY, RPAL_NR_CMD, }; =20 @@ -351,6 +374,8 @@ enum rpal_command_type { _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long) #define RPAL_IOCTL_RELEASE_SERVICE \ _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long) +#define RPAL_IOCTL_GET_SERVICE_PKEY \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *) =20 #define rpal_for_each_requested_service(rs, idx) = \ for (idx =3D find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \ @@ -420,6 +445,47 @@ static inline bool rpal_is_correct_address(struct rpal= _service *rs, unsigned lon return true; } =20 +static inline void rpal_set_cap(unsigned long cap) +{ + set_bit(cap, &rpal_cap); +} + +static inline void rpal_clear_cap(unsigned long cap) +{ + clear_bit(cap, &rpal_cap); +} + +static inline bool rpal_has_cap(unsigned long cap) +{ + return test_bit(cap, &rpal_cap); +} + +static inline u32 rpal_pkey_to_pkru(int pkey) +{ + int offset =3D pkey * 2; + u32 mask =3D 0x3 << offset; + + return RPAL_PKRU_BASE_CODE & ~mask; +} + +static inline u32 rpal_pkey_to_pkru_read(int pkey) +{ + int offset =3D pkey * 2; + u32 mask =3D 0x3 << offset; + + return RPAL_PKRU_BASE_CODE_READ & ~mask; +} + +static inline u32 rpal_pkru_union(u32 pkru0, u32 pkru1) +{ + return pkru0 & pkru1; +} + +static inline u32 rpal_pkru_intersect(u32 pkru0, u32 pkru1) +{ + return pkru0 | pkru1; +} + #ifdef CONFIG_RPAL static inline struct rpal_service *rpal_current_service(void) { diff --git a/mm/mprotect.c b/mm/mprotect.c index 62c1f7945741..982f911ffaba 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -895,6 +896,14 @@ SYSCALL_DEFINE1(pkey_free, int, pkey) { int ret; =20 +#ifdef CONFIG_RPAL_PKU + if (rpal_current_service()) { + rpal_err("try_to_free pkey: %d %s\n", current->pid, + current->comm); + return -EINVAL; + } +#endif + mmap_write_lock(current->mm); ret =3D mm_pkey_free(current->mm, pkey); mmap_write_unlock(current->mm); --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25BEA232395 for ; Fri, 30 May 2025 09:35:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597709; cv=none; b=Y0+mONyOM3I5PrcyhqCGNOfLYN11BUJhc1lZooFfTWZ0qTdL5N2bD9Aa5zl5xb9SOijg4SRG1MAdIlh+i+ih5iG7ms+kWZ6Ug19Km62yC3wYlVhf9DS67lQ+AX8+yw1i2t1JdkSKaA4ckObQyhfvl+T3gxO0TIf1stI2W72VMro= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597709; c=relaxed/simple; bh=CSBcJAAaPu1EwMn2t/TMYKgMYiCUawOQ8pC1PZaoOVQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=GIEysaoR+BYfjezTy58Bg4Y043RKdt22FHBpBhAH/eBp8E9XPOMawhz/LyRdwg2IJbUXUiajP8WY1ORwNerq/jxqW2UJFIIJNSGUbBAXBZL7jwF2pSFmeqty58xKWB4JNAMc/dtiAWDCbuWbsJDuaR23o3/bCM4I5pUK7NS9jHw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=d4wJuq1F; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="d4wJuq1F" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-234d366e5f2so23780915ad.1 for ; Fri, 30 May 2025 02:35:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597706; x=1749202506; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=L5lpZBJN8KO9Wi2JkiZbNrbX1Mk+tAQNdIUoCRiUIVY=; b=d4wJuq1FXHu4nTQhd567rVBNLwpQu/vPTHTSMzJELXe8r8QMRyWPKI6UHMdfwJXOvr C1GoXVRbTVaRSYD0nmNXphDcvSH//5te5HB1lEE0EvlBtG1sssy6fSGhqZUTMLkzrpK+ LSGOGzc5RSFiXfbt4+oXm4yIvJYYWyoItXd/4IPDDMlPFoABm0JQji3CRu0pNwGLiMdg G2+GO3t1l7vwd3WtPSIONakx6NnCy6LaQYQbPc1urPCtwWSXG1dSCAvdYQIXjoffdKoe 8eAsBL9+hETeapwiFw5ycojUfAKNjsZiLmf0P6KQfEKFnzs4Wp2+rE9ANwE6A9Gl0gLg UGEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597706; x=1749202506; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=L5lpZBJN8KO9Wi2JkiZbNrbX1Mk+tAQNdIUoCRiUIVY=; b=tRszNgIKYfcjApQ7bilUf2NbrXeaWqh6jKnnLO8tQhBqJDTIPmFuiKGQbIPH9cEHGo WpfNtwBPZT9lAcyBWub/ayZo9uxYcw22sDfZLmseuYl8oSRZYFBJYg+f/lcHs8AhR/Mx MCQo4ImcbnsZyJ1Cs6reivhI4qdoeZh30yuIsG+21nlHIKV8bUt2q7PVRyh15ZIQiJro MM6vnf7ybf72Icz4njP0zXXe80ScrfDQZJUTgVgQ3QizCRI+lxicC+xaV5CUh3fABjiA WysVaeLknkP4Ox7vdddPq+m/CDi/pYpHj7iRz8cOiGqdIPUS57+03H6MJ9xnhUX9ymIt Fl9w== X-Forwarded-Encrypted: i=1; AJvYcCUFgGr0ieRwyLKh7f3e+pbKKb+MGELVnSYrZOqEi4wNx159bxZqcw2P7YkMv5OpVMw71VWj/c7zbasWUcY=@vger.kernel.org X-Gm-Message-State: AOJu0YyTtSYJN9vn34Te73YSTrl+duRNgEwFrXOLjZjG+djSUvtieM4y x4RoyjtxV1O/p2D5QSNDQ5Z2enAS+y4b68adIWBNwVElU8n8JG46pTjlKchUUzL6WoY= X-Gm-Gg: ASbGncuFhjWOLjFkv30SENVubZu30G3DWdbeB8xW+lk25VwBSjO7y/ZG/tyl+nlB3xz i1JrjQ/wXkmT9zJi5MfNxA/7jJbDvavPob8ZJH1AdVTwLOUYrLi8Q0eRpm/O/Bn5BriPk0WOVgQ r3wV0Gxq+fjIBwhvO/wEuGnJO6sGWw3IstIcN56Hc+OgS70Qwxr0kUOPXLrdqWEU2y8rNUHw/Zp DzOju324ZoUt0unLBsxWb50IzlpTaC18Ma2K25PIgDZEdn57Im3gbkwe6muN9daJMWJQ/S61X9O BcqT16UtYCVOT7nCx1uWitQV+K6VOLrUuiUdoI+GOFRm9Vte3dTe9OJ0dmaqda+rmOkmzWFPz/W UxDcOiecr7z1R8ZrOgSLC X-Google-Smtp-Source: AGHT+IHAptJ8gzy+riOF1QxfTIK2HHrgqPaaAtuaRZ0rpmj0urvpGAIWcEQ578w6XYLBnp/Jp17aLw== X-Received: by 2002:a17:902:ccce:b0:234:bca7:2940 with SMTP id d9443c01a7336-23529a2c094mr43718025ad.38.1748597706235; Fri, 30 May 2025 02:35:06 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.34.51 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:35:05 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 26/35] RPAL: enable MPK support Date: Fri, 30 May 2025 17:27:54 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable RPAL leverages Memory Protection Keys (MPK) to safeguard shared memory from illegal access and corruption by other processes. MPK-based memory protection involves two key mechanisms: First, for already allocated memory, when RPAL is enabled, the protection key fields in all page tables must be set to the process=E2=80=99s corresponding pkey value. Second, for = newly allocated memory, when the kernel detects that the process is an RPAL service, it sets the corresponding pkey flag in the relevant memory data structures. Together, these measures ensure that all memory belonging to the current process is protected by its own pkey. For MPK initialization, RPAL needs to set the pkeys of all allocated page table pages to the pkeys assigned by RPAL to the service. This is completed in three steps: First, enable permissions for all pkeys of the service, allowing it to access memory protected by any pkey. Then, update the pkeys in the page tables. Since permissions for all pkeys are already enabled at this stage, even if old and new pkeys coexist during the page table update, the service's memory access remains unaffected. Finally, after the page table update is complete, set the service's pkey permissions to the corresponding values, thereby achieving memory protection. Additionally, RPAL must manage the values of the PKRU register during lazy switch operations and signal handling. This ensures the process avoids coredumps causing by MPK. Signed-off-by: Bo Li --- arch/x86/kernel/cpu/common.c | 8 +- arch/x86/kernel/fpu/core.c | 8 +- arch/x86/kernel/process.c | 7 +- arch/x86/rpal/core.c | 14 +++- arch/x86/rpal/internal.h | 1 + arch/x86/rpal/pku.c | 139 ++++++++++++++++++++++++++++++++++- arch/x86/rpal/service.c | 1 + arch/x86/rpal/thread.c | 5 ++ include/linux/rpal.h | 3 + kernel/sched/core.c | 3 + mm/mmap.c | 12 +++ mm/mprotect.c | 96 ++++++++++++++++++++++++ mm/vma.c | 18 +++++ 13 files changed, 310 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 8feb8fd2957a..2678453cdf76 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -26,6 +26,7 @@ #include #include #include +#include =20 #include #include @@ -532,7 +533,12 @@ static __always_inline void setup_pku(struct cpuinfo_x= 86 *c) =20 cr4_set_bits(X86_CR4_PKE); /* Load the default PKRU value */ - pkru_write_default(); +#ifdef CONFIG_RPAL_PKU + if (rpal_current_service() && rpal_current_service()->pku_on) + write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + else +#endif + pkru_write_default(); } =20 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index ea138583dd92..251b1ddee726 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -20,6 +20,7 @@ #include #include #include +#include =20 #include "context.h" #include "internal.h" @@ -746,7 +747,12 @@ static inline void restore_fpregs_from_init_fpstate(u6= 4 features_mask) else frstor(&init_fpstate.regs.fsave); =20 - pkru_write_default(); +#ifdef CONFIG_RPAL_PKU + if (rpal_current_service() && rpal_current_service()->pku_on) + write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + else +#endif + pkru_write_default(); } =20 /* diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index be8845e2ca4d..b74de35218f9 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -285,7 +285,12 @@ static void pkru_flush_thread(void) * If PKRU is enabled the default PKRU value has to be loaded into * the hardware right here (similar to context switch). */ - pkru_write_default(); +#ifdef CONFIG_RPAL_PKU + if (rpal_current_service() && rpal_current_service()->pku_on) + write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + else +#endif + pkru_write_default(); } =20 void flush_thread(void) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 41111d693994..47c9e551344e 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -275,6 +275,13 @@ rpal_skip_lazy_switch(struct task_struct *next, struct= pt_regs *regs) tgt =3D next->rpal_rs; if (in_ret_section(tgt, regs->ip)) { wrfsbase(current->thread.fsbase); +#ifdef CONFIG_RPAL_PKU + rpal_set_current_pkru( + rpal_pkru_union( + rpal_pkey_to_pkru(rpal_current_service()->pkey), + rpal_pkey_to_pkru(next->rpal_rs->pkey)), + RPAL_PKRU_SET); +#endif rebuild_sender_stack(current->rpal_sd, regs); rpal_clear_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT); next->rpal_rd->sender =3D NULL; @@ -292,8 +299,13 @@ static struct task_struct *rpal_fix_critical_section(s= truct task_struct *next, if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) next =3D rpal_skip_lazy_switch(next, regs); /* receiver->sender */ - else if (rpal_is_correct_address(cur, regs->ip)) + else if (rpal_is_correct_address(cur, regs->ip)) { rpal_skip_receiver_code(next, regs); +#ifdef CONFIG_RPAL_PKU + write_pkru(rpal_pkru_union( + rpal_pkey_to_pkru(next->rpal_rs->pkey), rdpkru())); +#endif + } =20 return next; } diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index 71afa8225450..e49febce8645 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -58,4 +58,5 @@ rpal_build_call_state(const struct rpal_sender_data *rsd) /* pkey.c */ int rpal_alloc_pkey(struct rpal_service *rs, int pkey); int rpal_pkey_setup(struct rpal_service *rs, int pkey); +void rpal_set_current_pkru(u32 val, int mode); void rpal_service_pku_init(void); diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c index 4c5151ca5b8b..26cef324f41f 100644 --- a/arch/x86/rpal/pku.c +++ b/arch/x86/rpal/pku.c @@ -25,12 +25,149 @@ void rpal_service_pku_init(void) mmap_write_unlock(mm); } =20 +void rpal_set_pku_schedule_tail(struct task_struct *prev) +{ + if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { + struct rpal_service *cur =3D rpal_current_service(); + u32 val =3D rpal_pkey_to_pkru(cur->pkey); + + rpal_set_current_pkru(val, RPAL_PKRU_SET); + } else { + struct rpal_service *cur =3D rpal_current_service(); + u32 val =3D rpal_pkey_to_pkru(cur->pkey); + + val =3D rpal_pkru_union( + val, + rpal_pkey_to_pkru( + current->rpal_sd->receiver->rpal_rs->pkey)); + rpal_set_current_pkru(val, RPAL_PKRU_SET); + } +} + +static inline u32 rpal_get_new_val(u32 old_val, u32 new_val, int mode) +{ + switch (mode) { + case RPAL_PKRU_SET: + return new_val; + case RPAL_PKRU_UNION: + return rpal_pkru_union(old_val, new_val); + case RPAL_PKRU_INTERSECT: + return rpal_pkru_intersect(old_val, new_val); + default: + rpal_err("%s: invalid mode: %d\n", __func__, mode); + return old_val; + } +} + +static int rpal_set_task_fpu_pkru(struct task_struct *task, u32 val, int m= ode) +{ + struct thread_struct *t =3D &task->thread; + + val =3D rpal_get_new_val(t->pkru, val, mode); + t->pkru =3D val; + + return 0; +} + +void rpal_set_current_pkru(u32 val, int mode) +{ + u32 new_val; + + new_val =3D rpal_get_new_val(rdpkru(), val, mode); + write_pkru(new_val); +} + +struct task_function_data { + struct task_struct *task; + u32 val; + int mode; + int ret; +}; + +static void rpal_set_remote_pkru(void *data) +{ + struct task_function_data *tfd =3D data; + struct task_struct *task =3D tfd->task; + + if (task) { + /* -EAGAIN */ + if (task_cpu(task) !=3D smp_processor_id()) + return; + + tfd->ret =3D -ESRCH; + if (task =3D=3D current) { + rpal_set_current_pkru(tfd->val, tfd->mode); + tfd->ret =3D 0; + } else { + tfd->ret =3D rpal_set_task_fpu_pkru(task, tfd->val, + tfd->mode); + } + return; + } +} + +static int rpal_task_function_call(struct task_struct *task, u32 val, int = mode) +{ + struct task_function_data data =3D { + .task =3D task, + .val =3D val, + .mode =3D mode, + .ret =3D -EAGAIN, + }; + int ret; + + for (;;) { + smp_call_function_single(task_cpu(task), rpal_set_remote_pkru, + &data, 1); + ret =3D data.ret; + + if (ret !=3D -EAGAIN) + break; + + cond_resched(); + } + + return ret; +} + +static void rpal_set_task_pkru(struct task_struct *task, u32 val, int mode) +{ + if (task =3D=3D current) + rpal_set_current_pkru(val, mode); + else + rpal_task_function_call(task, val, mode); +} + +static void rpal_set_group_pkru(u32 val, int mode) +{ + struct task_struct *p; + + for_each_thread(current, p) { + rpal_set_task_pkru(p, val, mode); + } +} + int rpal_pkey_setup(struct rpal_service *rs, int pkey) { - int val; + int err, val; =20 val =3D rpal_pkey_to_pkru(pkey); + + mmap_write_lock(current->mm); + if (rs->pku_on) { + mmap_write_unlock(current->mm); + return 0; + } rs->pkey =3D pkey; + /* others must see rs->pkey before rs->pku_on */ + barrier(); + rs->pku_on =3D true; + mmap_write_unlock(current->mm); + rpal_set_group_pkru(val, RPAL_PKRU_UNION); + err =3D do_rpal_mprotect_pkey(rs->base, RPAL_ADDR_SPACE_SIZE, pkey); + if (unlikely(err)) + rpal_err("do_rpal_mprotect_key error: %d\n", err); + rpal_set_group_pkru(val, RPAL_PKRU_SET); return 0; } =20 diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index ca795dacc90d..7a83e85cf096 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -210,6 +210,7 @@ struct rpal_service *rpal_register_service(void) init_waitqueue_head(&rs->rpd.rpal_waitqueue); #ifdef CONFIG_RPAL_PKU rs->pkey =3D -1; + rs->pku_on =3D false; rpal_service_pku_init(); #endif =20 diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index 02c1a9c22dd7..fcc592baaac0 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -281,6 +281,11 @@ int rpal_rebuild_sender_context_on_fault(struct pt_reg= s *regs, regs->sp =3D ersp; /* avoid rebuild again */ scc->ec.magic =3D 0; +#ifdef CONFIG_RPAL_PKU + rpal_set_current_pkru( + rpal_pkey_to_pkru(rpal_current_service()->pkey), + RPAL_PKRU_SET); +#endif return 0; } } diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 2f2982d281cc..f2474cb53abe 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -239,6 +239,7 @@ struct rpal_service { =20 #ifdef CONFIG_RPAL_PKU /* pkey */ + bool pku_on; int pkey; #endif =20 @@ -571,4 +572,6 @@ void rpal_schedule(struct task_struct *next); asmlinkage struct task_struct * __rpal_switch_to(struct task_struct *prev_p, struct task_struct *next_p); asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev); +int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey); +void rpal_set_pku_schedule_tail(struct task_struct *prev); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0f9343698198..eb5d5bd51597 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11029,6 +11029,9 @@ asmlinkage __visible void rpal_schedule_tail(struct= task_struct *prev) =20 finish_task_switch(prev); trace_sched_exit_tp(true, CALLER_ADDR0); +#ifdef CONFIG_RPAL_PKU + rpal_set_pku_schedule_tail(prev); +#endif preempt_enable(); =20 calculate_sigpending(); diff --git a/mm/mmap.c b/mm/mmap.c index 98bb33d2091e..d36ea4ea2bd0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -396,6 +396,18 @@ unsigned long do_mmap(struct file *file, unsigned long= addr, if (pkey < 0) pkey =3D 0; } +#ifdef CONFIG_RPAL_PKU + /* + * For RPAL process, if pku is enabled, we always use + * its service pkey for new vma. + */ + do { + struct rpal_service *cur =3D rpal_current_service(); + + if (cur && cur->pku_on) + pkey =3D cur->pkey; + } while (0); +#endif =20 /* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open diff --git a/mm/mprotect.c b/mm/mprotect.c index 982f911ffaba..e9ae828e377d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -713,6 +713,18 @@ static int do_mprotect_pkey(unsigned long start, size_= t len, struct mmu_gather tlb; struct vma_iterator vmi; =20 +#ifdef CONFIG_RPAL_PKU + if (pkey !=3D -1) { + struct rpal_service *cur =3D rpal_current_service(); + + if (unlikely(cur) && cur->pku_on) { + rpal_err("%s, pid: %d, try to change pkey\n", + current->comm, current->pid); + return -EINVAL; + } + } +#endif + start =3D untagged_addr(start); =20 prot &=3D ~(PROT_GROWSDOWN|PROT_GROWSUP); @@ -848,6 +860,90 @@ static int do_mprotect_pkey(unsigned long start, size_= t len, return error; } =20 +#ifdef CONFIG_RPAL_PKU +int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey) +{ + unsigned long nstart, end, tmp; + struct vm_area_struct *vma, *prev; + struct rpal_service *cur =3D rpal_current_service(); + int error =3D -EINVAL; + struct mmu_gather tlb; + struct vma_iterator vmi; + + start =3D untagged_addr(start); + + if (start & ~PAGE_MASK) + return -EINVAL; + if (!len) + return 0; + len =3D PAGE_ALIGN(len); + end =3D start + len; + if (end <=3D start) + return -ENOMEM; + + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + + /* + * If userspace did not allocate the pkey, do not let + * them use it here. + */ + error =3D -EINVAL; + if ((pkey !=3D -1) && !mm_pkey_is_allocated(current->mm, pkey)) + goto out; + + vma_iter_init(&vmi, current->mm, start); + vma =3D vma_find(&vmi, end); + error =3D -ENOMEM; + if (!vma) + goto out; + + prev =3D vma_prev(&vmi); + if (vma->vm_start > start) + start =3D vma->vm_start; + + if (start > vma->vm_start) + prev =3D vma; + + tlb_gather_mmu(&tlb, current->mm); + nstart =3D start; + tmp =3D vma->vm_start; + for_each_vma_range(vmi, vma, end) { + unsigned long vma_pkey_mask; + unsigned long newflags; + + tmp =3D vma->vm_start; + nstart =3D tmp; + + /* Here we know that vma->vm_start <=3D nstart < vma->vm_end. */ + vma_pkey_mask =3D VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | + VM_PKEY_BIT3; + newflags =3D vma->vm_flags; + newflags &=3D ~vma_pkey_mask; + newflags |=3D ((unsigned long)cur->pkey) << VM_PKEY_SHIFT; + + tmp =3D vma->vm_end; + if (tmp > end) + tmp =3D end; + + if (vma->vm_ops && vma->vm_ops->mprotect) { + error =3D vma->vm_ops->mprotect(vma, nstart, tmp, newflags); + if (error) + break; + } + + error =3D mprotect_fixup(&vmi, &tlb, vma, &prev, nstart, tmp, newflags); + if (error) + break; + } + tlb_finish_mmu(&tlb); + +out: + mmap_write_unlock(current->mm); + return error; +} +#endif + SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len, unsigned long, prot) { diff --git a/mm/vma.c b/mm/vma.c index a468d4c29c0c..fa9d8f694e6e 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -4,6 +4,8 @@ * VMA-specific functions. */ =20 +#include + #include "vma_internal.h" #include "vma.h" =20 @@ -2622,6 +2624,22 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm= _area_struct *vma, { struct mm_struct *mm =3D current->mm; =20 +#ifdef CONFIG_RPAL_PKU + /* + * Any memory need to use RPAL service pkey + * once service is enabled. + */ + struct rpal_service *cur =3D rpal_current_service(); + unsigned long vma_pkey_mask; + + if (cur && cur->pku_on) { + vma_pkey_mask =3D VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | + VM_PKEY_BIT3; + flags &=3D ~vma_pkey_mask; + flags |=3D ((unsigned long)cur->pkey) << VM_PKEY_SHIFT; + } +#endif + /* * Check against address space limits by the changed size * Note: This happens *after* clearing old mappings in some code paths. --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 505D022B8B3 for ; Fri, 30 May 2025 09:35:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597724; cv=none; b=g2wx+7NJWmXj8c+hum1IfHHekJsPDjG6uYzbg1ygp/cqEzw2PWNqjSqF39/3lUi+LIlq2ey326z1Y/KOlB3aVqBdHdjeC5XADo47VvxUTmhgCLj84p8MlaFnx7RLWimwWHngkecC3bLUJvbFre6PQVlINOKZnG+qI6+WfF5Eb+g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597724; c=relaxed/simple; bh=9CHW/F/J9wY0zWhmMIE46/gcJf++4JvdpcdoqukKOXQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=qfsCpBvWhh6/DfbeAGCvu29U4M2+DU13I4dLl2+zWEYvFGFUbs2Cy41tWhJiqp8OVS9vYqDTvcobr6ko4faaoXJl3SQXU8alqmxRHBsVTqjp+S9JB8sOUYIRpubFzemq28cL5Bq9oSuHoGXmAfLZFmbETQ0k0uMIyLns7QwXIsg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=f0RARZhZ; arc=none smtp.client-ip=209.85.215.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="f0RARZhZ" Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-b200047a6a5so2491168a12.0 for ; Fri, 30 May 2025 02:35:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597722; x=1749202522; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VFn4X8pNflxKPFbQMCWiED+xW2QzJgp0gVRveKZQMVQ=; b=f0RARZhZLjfW/aWmcoFtrJUt9YE2tMtU6hW+dQbnlucaiQQxvRKFiRnO3h/iMy/s2v omUDtrkqOEt/ktuvHS7n8o/6kqJxThZBTjaanLdHcfE1r68tQrAeOHxsfRfLW3NJ1Hyd nd/Aive6GBoywupUoDPQdMrRdo+l8n/7/P60qKSKROcXtd+U80e0W3nNU68j1qoyhmnr mn1MiVkLYfoxSFIz6NDcTU0XKh6O44BoWmqcKTXh53e35VdFfXUyYkby9sBRdZExDvD8 3ou16CG9Vt4ycR+xYcIbyYYpe5bEbsuJZozsHxU4FYXZ5gsNlVcyG2FwKzZAZMjxVrhE iPLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597722; x=1749202522; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VFn4X8pNflxKPFbQMCWiED+xW2QzJgp0gVRveKZQMVQ=; b=cO7B5OrVRnkB+KFchpungkFJEglXAMpGrp7TP0FeoyU2vJmRzsoY8btgZ2cF/Skw4K gepzHQ75c0HresKK/CbQDTpZKOUuM8x8c8L/bT6TLcTK1BP4Pm0e21lg6a/xWUTt8bEy r6b55Rx3G0pBZqB9lP8cwT4odUCzbloslt5uuSwrImMS4lk8tl9FQv5rctvjIa6/nAEK dB1RFbUsMcuvai3fBZPlTnihNXQF0Eea4gCtfRCeUrcrnNzOAbERSgwmOMXPMG2MK6O9 Cw6TqoW83DPCpjbIj8JcJ+7yr3xtdYDVSTFJr0iZaNBm7/aRmoBqL85TSjCeml1Qokk/ MgQA== X-Forwarded-Encrypted: i=1; AJvYcCX4Y0vjP5TvUhZrCpgCA3TpvsuVUivLw7180O2nesaZjhbhiJjkSGp3do46DrxNH3KSLowvPA8sYm5iHQc=@vger.kernel.org X-Gm-Message-State: AOJu0YwB2TpaTfjPweQXQxiPny7J5WIpAbRu5Ndq06J4peolK3pcZk7u fjdf2hfRqu5kXD5uEqYVZ9xoNLchdny1oCHfC2vVIsd1MN7zvkRmIN1dG5AjvUnKlfg= X-Gm-Gg: ASbGncuZStyn61YIL6ziFiimPGAr4dQp4FlbSmtx/yrP8NZQ4TntrCsAWgAspMRzQLm cTIDACCjcE5FgsfyGJSRUIjTHU8LVvrWy1J6rpuXlE0QRtlHjPRJluYEfJ+jc3AT9oAmDwjycgV TkDyhfnwxtjuhg3+EC8p+k04/vzkLram927tQyj8MdkcR31NdHWsk+/vLQGjjmVL/gqLhFGWdL5 ZH+hvqJ53tNjK5+rRB27n+NBAwFIdhufg+VQFPnk0AmHXuAfu/JdK43NMbvnExYXHWNaVpUoRQQ q+oNckFyY9hQo/oKJp2U/PK9SY4cTO5WTYNOEQdMGLw6AKDBymdCR9gMJoyqLHkdQx0PO4+rNDk Fs7WC4BrKdy8IjnXPSUoc X-Google-Smtp-Source: AGHT+IH46+bZCrX7K5xEEg6cSKu1TR1I7SkSliOWsHgcBqoA6CG8wyIvf12Zvpihk7VZmfu1IuwYXQ== X-Received: by 2002:a17:90b:164b:b0:310:8d4a:4a97 with SMTP id 98e67ed59e1d1-31214ee68f2mr10223135a91.15.1748597721380; Fri, 30 May 2025 02:35:21 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.35.06 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:35:21 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 27/35] RPAL: add epoll support Date: Fri, 30 May 2025 17:27:55 +0800 Message-Id: <7eb30a577e2c6a4f582515357aea25260105eb18.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" To support the epoll family, RPAL needs to add new logic for RPAL services to the existing epoll logic, ensuring that user mode can execute RPAL service-related logic through identical interfaces. When the receiver thread calls epoll_wait(), it can set RPAL_EP_POLL_MAGIC to notify the kernel to invoke RPAL-related logic. The kernel then sets the receiver's state to RPAL_RECEIVER_STATE_READY and transitions it to RPAL_RECEIVER_STATE_WAIT when the receiver is actually removed from the runqueue, allowing the sender to perform RPAL calls on the receiver thread. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 4 + fs/eventpoll.c | 200 +++++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 21 +++++ kernel/sched/core.c | 17 ++++ 4 files changed, 242 insertions(+) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 47c9e551344e..6a22b9faa100 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -9,6 +9,7 @@ #include #include #include +#include #include =20 #include "internal.h" @@ -63,6 +64,7 @@ void rpal_kernel_ret(struct pt_regs *regs) =20 if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { rcc =3D current->rpal_rd->rcc; + regs->ax =3D rpal_try_send_events(current->rpal_rd->ep, rcc); atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET); } else { tsk =3D current->rpal_sd->receiver; @@ -142,6 +144,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) struct task_struct *prev =3D current; =20 if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) { + rpal_resume_ep(next); current->rpal_sd->receiver =3D next; rpal_lock_cpu(current); rpal_lock_cpu(next); @@ -154,6 +157,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) */ rebuild_sender_stack(current->rpal_sd, regs); rpal_schedule(next); + fdput(next->rpal_rd->f); } else { update_dst_stack(next, regs); /* diff --git a/fs/eventpoll.c b/fs/eventpoll.c index d4dbffdedd08..437cd5764c03 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -38,6 +38,7 @@ #include #include #include +#include #include =20 /* @@ -2141,6 +2142,187 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, } } =20 +#ifdef CONFIG_RPAL + +void rpal_resume_ep(struct task_struct *tsk) +{ + struct rpal_receiver_data *rrd =3D tsk->rpal_rd; + struct eventpoll *ep =3D (struct eventpoll *)rrd->ep; + struct rpal_receiver_call_context *rcc =3D rrd->rcc; + + if (rcc->timeout > 0) { + hrtimer_cancel(&rrd->ep_sleeper.timer); + destroy_hrtimer_on_stack(&rrd->ep_sleeper.timer); + } + if (!list_empty_careful(&rrd->ep_wait.entry)) { + write_lock(&ep->lock); + __remove_wait_queue(&ep->wq, &rrd->ep_wait); + write_unlock(&ep->lock); + } +} + +int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc) +{ + int eavail; + int res =3D 0; + + res =3D ep_send_events(ep, rcc->events, rcc->maxevents); + if (res > 0) + ep_suspend_napi_irqs(ep); + + eavail =3D ep_events_available(ep); + if (!eavail) { + atomic_and(~RPAL_KERNEL_PENDING, &rcc->ep_pending); + /* check again to avoid data race on RPAL_KERNEL_PENDING */ + eavail =3D ep_events_available(ep); + if (eavail) + atomic_or(RPAL_KERNEL_PENDING, &rcc->ep_pending); + } + return res; +} + +static int rpal_schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta, + const enum hrtimer_mode mode, + clockid_t clock_id) +{ + struct hrtimer_sleeper *t =3D ¤t->rpal_rd->ep_sleeper; + + /* + * Optimize when a zero timeout value is given. It does not + * matter whether this is an absolute or a relative time. + */ + if (expires && *expires =3D=3D 0) { + __set_current_state(TASK_RUNNING); + return 0; + } + + /* + * A NULL parameter means "infinite" + */ + if (!expires) { + schedule(); + return -EINTR; + } + + hrtimer_setup_sleeper_on_stack(t, clock_id, mode); + hrtimer_set_expires_range_ns(&t->timer, *expires, delta); + hrtimer_sleeper_start_expires(t, mode); + + if (likely(t->task)) + schedule(); + + hrtimer_cancel(&t->timer); + destroy_hrtimer_on_stack(&t->timer); + + __set_current_state(TASK_RUNNING); + + return !t->task ? 0 : -EINTR; +} + +static int rpal_ep_poll(struct eventpoll *ep, struct epoll_event __user *e= vents, + int maxevents, struct timespec64 *timeout) +{ + int res =3D 0, eavail, timed_out =3D 0; + u64 slack =3D 0; + struct rpal_receiver_data *rrd =3D current->rpal_rd; + wait_queue_entry_t *wait =3D &rrd->ep_wait; + ktime_t expires, *to =3D NULL; + + rrd->ep =3D ep; + + lockdep_assert_irqs_enabled(); + + if (timeout && (timeout->tv_sec | timeout->tv_nsec)) { + slack =3D select_estimate_accuracy(timeout); + to =3D &expires; + *to =3D timespec64_to_ktime(*timeout); + } else if (timeout) { + timed_out =3D 1; + } + + eavail =3D ep_events_available(ep); + + while (1) { + if (eavail) { + res =3D rpal_try_send_events(ep, rrd->rcc); + if (res) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return res; + } + } + + if (timed_out) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return 0; + } + + eavail =3D ep_busy_loop(ep); + if (eavail) + continue; + + if (signal_pending(current)) { + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_RUNNING); + return -EINTR; + } + + init_wait(wait); + wait->func =3D rpal_ep_autoremove_wake_function; + wait->private =3D rrd; + write_lock_irq(&ep->lock); + + atomic_xchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY); + __set_current_state(TASK_INTERRUPTIBLE); + + eavail =3D ep_events_available(ep); + if (!eavail) + __add_wait_queue_exclusive(&ep->wq, wait); + + write_unlock_irq(&ep->lock); + + if (!eavail && ep_schedule_timeout(to)) { + if (RPAL_USER_PENDING & atomic_read(&rrd->rcc->ep_pending)) { + timed_out =3D 1; + } else { + timed_out =3D + !rpal_schedule_hrtimeout_range_clock( + to, slack, HRTIMER_MODE_ABS, + CLOCK_MONOTONIC); + } + } + atomic_cmpxchg(&rrd->rcc->receiver_state, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_RUNNING); + __set_current_state(TASK_RUNNING); + + /* + * We were woken up, thus go and try to harvest some events. + * If timed out and still on the wait queue, recheck eavail + * carefully under lock, below. + */ + eavail =3D 1; + + if (!list_empty_careful(&wait->entry)) { + write_lock_irq(&ep->lock); + /* + * If the thread timed out and is not on the wait queue, + * it means that the thread was woken up after its + * timeout expired before it could reacquire the lock. + * Thus, when wait.entry is empty, it needs to harvest + * events. + */ + if (timed_out) + eavail =3D list_empty(&wait->entry); + __remove_wait_queue(&ep->wq, wait); + write_unlock_irq(&ep->lock); + } + } +} +#endif + /** * ep_loop_check_proc - verify that adding an epoll file inside another * epoll structure does not violate the constraints, = in @@ -2529,7 +2711,25 @@ static int do_epoll_wait(int epfd, struct epoll_even= t __user *events, ep =3D fd_file(f)->private_data; =20 /* Time to fish for events ... */ +#ifdef CONFIG_RPAL + /* + * For RPAL task, if it is a receiver and it set MAGIC in shared memory, + * We think it is prepared for rpal calls. Therefore, we need to handle + * it differently. + * + * In other cases, RPAL task always plays like a normal task. + */ + if (rpal_current_service() && + rpal_test_current_thread_flag(RPAL_RECEIVER_BIT) && + current->rpal_rd->rcc->rpal_ep_poll_magic =3D=3D RPAL_EP_POLL_MAGIC) { + current->rpal_rd->f =3D f; + return rpal_ep_poll(ep, events, maxevents, to); + } else { + return ep_poll(ep, events, maxevents, to); + } +#else return ep_poll(ep, events, maxevents, to); +#endif } =20 SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, diff --git a/include/linux/rpal.h b/include/linux/rpal.h index f2474cb53abe..5912ffec6e28 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -16,6 +16,8 @@ #include #include #include +#include +#include =20 #define RPAL_ERROR_MSG "rpal error: " #define rpal_err(x...) pr_err(RPAL_ERROR_MSG x) @@ -89,6 +91,7 @@ enum { }; =20 #define RPAL_ERROR_MAGIC 0x98CC98CC +#define RPAL_EP_POLL_MAGIC 0xCC98CC98 =20 #define RPAL_SID_SHIFT 24 #define RPAL_ID_SHIFT 8 @@ -103,6 +106,9 @@ enum { #define RPAL_PKRU_UNION 1 #define RPAL_PKRU_INTERSECT 2 =20 +#define RPAL_KERNEL_PENDING 0x1 +#define RPAL_USER_PENDING 0x2 + extern unsigned long rpal_cap; =20 enum rpal_task_flag_bits { @@ -282,6 +288,12 @@ struct rpal_receiver_call_context { int receiver_id; atomic_t receiver_state; atomic_t sender_state; + atomic_t ep_pending; + int rpal_ep_poll_magic; + int epfd; + void __user *events; + int maxevents; + int timeout; }; =20 /* recovery point for sender */ @@ -325,6 +337,10 @@ struct rpal_receiver_data { struct rpal_shared_page *rsp; struct rpal_receiver_call_context *rcc; struct task_struct *sender; + void *ep; + struct fd f; + struct hrtimer_sleeper ep_sleeper; + wait_queue_entry_t ep_wait; }; =20 struct rpal_sender_data { @@ -574,4 +590,9 @@ __rpal_switch_to(struct task_struct *prev_p, struct tas= k_struct *next_p); asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev); int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey); void rpal_set_pku_schedule_tail(struct task_struct *prev); +int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr, + unsigned int mode, int wake_flags, + void *key); +void rpal_resume_ep(struct task_struct *tsk); +int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index eb5d5bd51597..486d59bdd3fc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6794,6 +6794,23 @@ pick_next_task(struct rq *rq, struct task_struct *pr= ev, struct rq_flags *rf) #define SM_RTLOCK_WAIT 2 =20 #ifdef CONFIG_RPAL +int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr, + unsigned int mode, int wake_flags, + void *key) +{ + struct rpal_receiver_data *rrd =3D curr->private; + struct task_struct *tsk =3D rrd->rcd.bp_task; + int ret; + + ret =3D try_to_wake_up(tsk, mode, wake_flags); + + list_del_init_careful(&curr->entry); + if (!ret) + atomic_or(RPAL_KERNEL_PENDING, &rrd->rcc->ep_pending); + + return 1; +} + static inline void rpal_check_ready_state(struct task_struct *tsk, int sta= te) { if (rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) { --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BE08224225 for ; Fri, 30 May 2025 09:35:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597738; cv=none; b=mRrFxvfjCNaSchnkVkWxr3k7Z+fDKDSTu4kmSBd8HWz+O1qUvKd/v7LRRwX5BLoIZzzlzxjmGrRMvZyj/d75q3MiBNcUQ4m216jvZEqLuU1YOdUp1jhnfPU1tCyB1tDDzGDSG/ohp21FHQkw1p/ljPkHYsh9F5HBX8QM4j29qhk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597738; c=relaxed/simple; bh=Rt/SGj48eEmheWhl/pKyHYlndzSJCIR513AnD89nuEY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=d2uFUAUI7TqbFHh6WPj/MoW5hQAzx1xwxJGTxuy9Yc5E0b7R/4aFoHHh8+W2YvSdSjepAJ1ztCbnajqCZlA1zEV6kbFkcGtPY8vjUS3EJYDblKAnPwNXMwH+eaNTmO80uskJnqBeSXR5wWwihWpO4/0XIk9uKgn0IobhfJlA3w4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=KC7PSjPx; arc=none smtp.client-ip=209.85.215.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="KC7PSjPx" Received: by mail-pg1-f174.google.com with SMTP id 41be03b00d2f7-b26ee6be1ecso1251860a12.0 for ; Fri, 30 May 2025 02:35:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597736; x=1749202536; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=NQYeXWECBZU61mIlblKiCexij9wbxTcb3SJTjEcJrDE=; b=KC7PSjPx3y+5TbRZPN2v6pQ3ditZFyB/WcVnzDp3Ee3O22P/pUykeel1KcUWkUJiKT nhTuK9kvMmGUQDggJbaQe6VKWFwuqrN3UEsKUZxUueNbSXse/TdWkeX3Z0Vt2Nxqzhe0 HqtvGc1AMocFIRLSXajgQ1k8PdrpFRhHZkI4cm/LWKhNr2oI2rLtAKjVQe89L756Kmmg cncPxK4OWdHhqGNLXqNzRIaaT8BEdoAPSpp8PnVplsrGlsehAxG7kiGY5jb0O2MULmG9 RJoQS2VAgukjfRnHr82khH8CYVmj8yAHvb7AlO4XeARN+iAzEoZr+tcNN9pbJ1WuWv2S 4x2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597736; x=1749202536; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NQYeXWECBZU61mIlblKiCexij9wbxTcb3SJTjEcJrDE=; b=uf6s0YsNWTyrWFxKLN2FvzefccLA3K5l7UFVIEblG/H3OiuPGTlf4L2SysSxRmwPKC Z+JHeMVfSqEJg9Y+tVdx7YZCWDdOsoe405JG9ZZdvEIU30vmavCxMkTfXwjfaQO4eJJi k1RPaXkx0+9jQ1ehm+KFYCi/hUoFAR84MIO0Tw4nnKzdzfc+xcmVqWfjKNeSUdaednYD Rp6Fi53QqUq5ViAe0HVcqyU3RrndcjoX1cANauwsLd3IGqnVelO1AXIpSg3ABmpGmN6v PHfFyTsiJ2F+8Cmb7Gu/kfDzSe3EKvjEAsS2RyJZ0tJJ5WTsSfMcT+P3xMcFGRMVi8A7 p8Pg== X-Forwarded-Encrypted: i=1; AJvYcCXUMC6C14LipX5ap6sFpTmeeNW86UUE0ia0tnUf59w3EELZL1fRXItydDuWDUGIGSgLHTatyfrg0r5sP3c=@vger.kernel.org X-Gm-Message-State: AOJu0YymoL8ieGkjrPfVTNBh9ZFQ0ASwpIOUxhw+GvmeMkeG4ydtIald JtW4t71I5buzDm8rMUnzGXkrvltEOAFg3JpMjIWNWhus7R2aSHmV/E+8fJViVZvhs/Q= X-Gm-Gg: ASbGncsnrMKeB5TljicoRioFu0shRZrf9S+2FgPEzXsYPz4vVtFtOWesKH83RMS8mXs 5z4tJyWbAKFMIJsC+PD3NVxXc3kwpePh28XUq22VMXUcQ02vu1dlp9HmRh6rXvzkW3v93h9U+uD xBCfJ01hx6oVl+erxfBMvuHQFr3mqLo0bdCYVmc0GtthPTReD+mF7pPaHk94AbowiTynLJzkOUK PgJbhd/3NJtRgaBoQuDxOlUXIebJxE60tGaOD/4aOQdupk8LwcB0CQVa8iG1usCYzB1zme/d4FH 75ctTRzqMoA1lbYUf9gTGkp5saizwt5SPku6Wc9g8GG3qLwakz/1f5BV14Tpi8CbsAkkdmPjoYF LD+tJdaBARQ== X-Google-Smtp-Source: AGHT+IG6nnxH2FVgtgvPODfS+5ObcESlWEhm+JqnaRiZqSXrOdJI4HZuM5tUxbidkIm+8B4kCWHa5w== X-Received: by 2002:a17:90b:3c49:b0:312:ec:4123 with SMTP id 98e67ed59e1d1-3125036bb61mr2199095a91.13.1748597736482; Fri, 30 May 2025 02:35:36 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.35.21 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:35:36 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 28/35] RPAL: add rpal_uds_fdmap() support Date: Fri, 30 May 2025 17:27:56 +0800 Message-Id: <7d9d805dcfe80358c06f0a02fadd31a7288500b4.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" For a UDS connection between a sender and a receiver, neither side knows which file descriptor (fd) the other uses to manage the connection. The sender cannot determine which user space fd's buffer in the receiver to write data to, necessitating a complex process for both sides to inform each other of fd mappings. This process incurs significant overhead when managing a large number of connections, which requires optimization. This patch introduces the RPAL_IOCTL_UDS_FDMAP interface, which simplifies the establishment of fd mappings between sender and receiver processes for files monitored by epoll. This avoids the need for a complex setup process each time a new connection is created. Signed-off-by: Bo Li --- arch/x86/rpal/internal.h | 3 + arch/x86/rpal/proc.c | 117 +++++++++++++++++++++++++++++++++++++++ fs/eventpoll.c | 19 +++++++ include/linux/rpal.h | 11 ++++ 4 files changed, 150 insertions(+) diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index e49febce8645..e03f8a90619d 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -11,6 +11,7 @@ =20 #include #include +#include =20 extern bool rpal_inited; =20 @@ -60,3 +61,5 @@ int rpal_alloc_pkey(struct rpal_service *rs, int pkey); int rpal_pkey_setup(struct rpal_service *rs, int pkey); void rpal_set_current_pkru(u32 val, int mode); void rpal_service_pku_init(void); + +extern struct sock *unix_peer_get(struct sock *sk); diff --git a/arch/x86/rpal/proc.c b/arch/x86/rpal/proc.c index 2f9cceec4992..b60c099c4a92 100644 --- a/arch/x86/rpal/proc.c +++ b/arch/x86/rpal/proc.c @@ -9,6 +9,8 @@ #include #include #include +#include +#include =20 #include "internal.h" =20 @@ -34,6 +36,118 @@ static int rpal_get_api_version_and_cap(void __user *p) return 0; } =20 +static void *rpal_uds_peer_data(struct sock *psk, int *pfd) +{ + void *ep =3D NULL; + unsigned long flags; + struct socket_wq *wq; + wait_queue_entry_t *entry; + wait_queue_head_t *whead; + + rcu_read_lock(); + wq =3D rcu_dereference(psk->sk_wq); + if (!skwq_has_sleeper(wq)) + goto unlock_rcu; + + whead =3D &wq->wait; + + spin_lock_irqsave(&whead->lock, flags); + if (list_empty(&whead->head)) { + pr_debug("rpal debug: [%d] cannot find epitem entry\n", + current->pid); + goto unlock_spin; + } + entry =3D list_first_entry(&whead->head, wait_queue_entry_t, entry); + *pfd =3D rpal_get_epitemfd(entry); + if (*pfd < 0) { + pr_debug("rpal debug: [%d] cannot find epitem fd\n", + current->pid); + goto unlock_spin; + } + ep =3D rpal_get_epitemep(entry); + +unlock_spin: + spin_unlock_irqrestore(&whead->lock, flags); +unlock_rcu: + rcu_read_unlock(); + return ep; +} + +static int rpal_find_receiver_rid(int id, void *ep) +{ + struct task_struct *tsk; + struct rpal_service *cur, *tgt; + int rid =3D -1; + + cur =3D rpal_current_service(); + + tgt =3D rpal_get_mapped_service_by_id(cur, id); + if (tgt =3D=3D NULL) + goto out; + + for_each_thread(tgt->group_leader, tsk) { + if (!rpal_test_task_thread_flag(tsk, RPAL_RECEIVER_BIT)) + continue; + if (tsk->rpal_rd->ep =3D=3D ep) { + rid =3D tsk->rpal_rd->rcc->receiver_id; + break; + } + } + + rpal_put_service(tgt); +out: + return rid; +} + +static long rpal_uds_fdmap(unsigned long uarg) +{ + struct rpal_uds_fdmap_arg arg; + struct socket *sock; + struct sock *peer_sk; + void *ep; + int sfd, rid; + struct fd f; + long res; + int ret; + + ret =3D copy_from_user(&arg, (void __user *)uarg, sizeof(arg)); + if (ret) + return ret; + + f =3D fdget(arg.cfd); + if (!fd_file(f)) + goto fd_put; + + sock =3D sock_from_file(fd_file(f)); + if (!sock) + goto fd_put; + + peer_sk =3D unix_peer_get(sock->sk); + if (peer_sk =3D=3D NULL) + goto fd_put; + ep =3D rpal_uds_peer_data(peer_sk, &sfd); + if (ep =3D=3D NULL) { + pr_debug("rpal debug: [%d] cannot find epitem ep\n", + current->pid); + goto peer_sock_put; + } + rid =3D rpal_find_receiver_rid(arg.service_id, ep); + if (rid < 0) { + pr_debug("rpal debug: [%d] rpal: cannot find epitem rid\n", + current->pid); + goto peer_sock_put; + } + res =3D (long)rid << 32 | (long)sfd; + ret =3D put_user(res, arg.res); + +peer_sock_put: + sock_put(peer_sk); +fd_put: + if (fd_file(f)) + fdput(f); + return ret; +} + static long rpal_ioctl(struct file *file, unsigned int cmd, unsigned long = arg) { struct rpal_service *cur =3D rpal_current_service(); @@ -81,6 +195,9 @@ static long rpal_ioctl(struct file *file, unsigned int c= md, unsigned long arg) ret =3D put_user(cur->pkey, (int __user *)arg); break; #endif + case RPAL_IOCTL_UDS_FDMAP: + ret =3D rpal_uds_fdmap(arg); + break; default: return -EINVAL; } diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 437cd5764c03..791321639561 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2143,6 +2143,25 @@ static int ep_poll(struct eventpoll *ep, struct epol= l_event __user *events, } =20 #ifdef CONFIG_RPAL +void *rpal_get_epitemep(wait_queue_entry_t *wait) +{ + struct epitem *epi =3D ep_item_from_wait(wait); + + if (!epi) + return NULL; + + return epi->ep; +} + +int rpal_get_epitemfd(wait_queue_entry_t *wait) +{ + struct epitem *epi =3D ep_item_from_wait(wait); + + if (!epi) + return -1; + + return epi->ffd.fd; +} =20 void rpal_resume_ep(struct task_struct *tsk) { diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 5912ffec6e28..7657e6c6393b 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -350,6 +350,12 @@ struct rpal_sender_data { struct task_struct *receiver; }; =20 +struct rpal_uds_fdmap_arg { + int service_id; + int cfd; + unsigned long *res; +}; + enum rpal_command_type { RPAL_CMD_GET_API_VERSION_AND_CAP, RPAL_CMD_GET_SERVICE_KEY, @@ -363,6 +369,7 @@ enum rpal_command_type { RPAL_CMD_REQUEST_SERVICE, RPAL_CMD_RELEASE_SERVICE, RPAL_CMD_GET_SERVICE_PKEY, + RPAL_CMD_UDS_FDMAP, RPAL_NR_CMD, }; =20 @@ -393,6 +400,8 @@ enum rpal_command_type { _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long) #define RPAL_IOCTL_GET_SERVICE_PKEY \ _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *) +#define RPAL_IOCTL_UDS_FDMAP \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_UDS_FDMAP, unsigned long) =20 #define rpal_for_each_requested_service(rs, idx) = \ for (idx =3D find_first_bit(rs->requested_service_bitmap, RPAL_NR_ID); \ @@ -594,5 +603,7 @@ int rpal_ep_autoremove_wake_function(wait_queue_entry_t= *curr, unsigned int mode, int wake_flags, void *key); void rpal_resume_ep(struct task_struct *tsk); +void *rpal_get_epitemep(wait_queue_entry_t *wait); +int rpal_get_epitemfd(wait_queue_entry_t *wait); int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc); #endif --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6D07B224B1C for ; Fri, 30 May 2025 09:35:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597754; cv=none; b=oiyp6GgSo5d77p7SmnhMv+++TaG09nxTWI+6LQ972WnonfV66L7eQZAVH2HeTB3Xrlh+QWUItnm063KVz63yGveozOSa6Tl0jyDpECbbm+wlZDc8JQXFnlPvwfiuPwDxeTq4aVCrbe4AjEoxtA8q3nqnK1zDkmoz/xyoT3GTr3Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597754; c=relaxed/simple; bh=f9zzXMNbxFfZbGJyBMa0POHUK33Y39jv+mYtKpjEoMk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Bv/BmIzhN12arfeTvrI6wMpIq0oblK5O9cTEYjXElbVGnifBPn4qRpAFvDsaxfvugBPqRWvGwRWuh1+FPmdxr7AVkVbu8CYhRS3IeXWFzp4OBPuaG0Iwye+qIcYZG2y+mZ+JUuwkYlHCEHxCUVGGac9Pg6FlZ9TRZFulM7VLi0U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=NYaqh2v8; arc=none smtp.client-ip=209.85.215.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="NYaqh2v8" Received: by mail-pg1-f169.google.com with SMTP id 41be03b00d2f7-afc857702d1so1469093a12.3 for ; Fri, 30 May 2025 02:35:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597751; x=1749202551; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ruJXCVL33HlYZ/8pvuQkFgYyPjMywOaCJ6OilWvGVoA=; b=NYaqh2v8WiwtvUtfIP8l6dheuVdr6/jF6arATD9keqVx+WpYUV6fkYXqodO9Y3hnNt rp7zHfBulLlNbMg5TFKvqILTzBrh8S7gSK355JgXCx7jRGp8bVvbWVSBDLdDz2Xu6PBM qoFsq2qQUrsZ1rxPQNKydFbrEtgoRQATlkiKjVrENN2/PbqNyD3poYKgHtajBX1ODIRg 5NdnvZqH0mhneUPWc25UHe+YHcC1UOw+SsV7Yl0dEyR0TgwurP7VFTDhJ+zsRWpo5O6T 4bX1uRVDdGXzxk+WixQYEpaeoXe+/QWQ9OzwB+NrpunIFQLx1IDnHcBWAheFU0OcGmSP 26bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597751; x=1749202551; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ruJXCVL33HlYZ/8pvuQkFgYyPjMywOaCJ6OilWvGVoA=; b=BvTaJzwHL+ix1L7fgcuJVAlAiOr14reKF+KeQz6UBVLOLaj6DbySE2Z811y/v7xue5 k6YshdfO34B7+5UN888z5gs6BQtcUA9eVn6CM/3Jh06X05R3Jx9igwRKg3unh+lk7DQf 3mmb0RuoxCdk50kp2e21p56CadQzmRzH3YXyuV5xkTuvypQiz0chag5XHRy02QbeFgp0 E401g3afwq78Ji/IVZXGiEyK4ODFp4F5TIumDPwbZVDxK9b1Sg5BnSe/fOXlGCEixAlD zO4lXJiM2SLKSsAfo45LquaY6G9R8FCCOqix1SR5GRdTo784UdsESoEB9VSDy67lqz/9 ODEw== X-Forwarded-Encrypted: i=1; AJvYcCXm866PIJFmJsd1b0qG8P9u6veOgd1t5AWKIjhx5m74pkTVqz9GwD+fhfGwurk9v/KZuf3mEbWvJlJOGbc=@vger.kernel.org X-Gm-Message-State: AOJu0YwOLMokQHHayPdcG6gZ4LokaqGo1NWEpxCaiS11OHa/mhgwG8ju N87xIK1PD4vvrLQSGM/gjzIVTndGHXekbdEuffJ/zNVUDZ8iQeJW6E/wdawsUuxVdM8= X-Gm-Gg: ASbGncuciDlOjCusoGj7IJeCGp3gAE67PB0CxSllWoln2ObYEh2Jq+mfNwCtY7Qok6t MYOeKKUAoCBZU2ORl1IwSXFlhG4qc4JJiRVM2g0mxApW9nzVJhwHSu9b18OFjGN1ge2UIJL/SGl 8re1IJ3GMDNtCTxIXcV0gZA36IytXcxA6E9w+LA+cEFjf09A0JVazIpWvFGpDAg7bAYMcHA14sy n00DxPKv/YR1EKxp8P3gP8PefSuMv3B+mKKRpDMahZ8S4B8cA++DRSp9YYnUdi2V2dQL8aKef2a wNg//V+hLJss2SnwTgnXeK/R30EWZOJFDYkeZMn2OM+NEQFplGIIeFqzXbToztEM27IDOuYVb/S fu0daGcx0VA== X-Google-Smtp-Source: AGHT+IEvXrgn/ooMRW35qDhy0ngiIGcMy7za8V80iv/SemfT2y7R5n/KmWGhxdI6JxqwxTCjuIZGRA== X-Received: by 2002:a17:90b:1dc4:b0:311:afaa:5e25 with SMTP id 98e67ed59e1d1-31241865ecdmr4397369a91.24.1748597751453; Fri, 30 May 2025 02:35:51 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.35.36 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:35:51 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 29/35] RPAL: fix race condition in pkru update Date: Fri, 30 May 2025 17:27:57 +0800 Message-Id: <7fbb84a57fc8046738c7196031a3fd97ea8334e2.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When setting up MPK, RPAL uses IPIs to notify tasks running on each core in the thread group to modify their PKRU values and update the PKEY fields in all VMA page tables. A race condition exists here: when updating PKRU, the page table updates may not yet be complete. In such cases, writing PKRU permissions at locations that require calling pkru_write_default() (e.g., during signal handling) must not be restricted to a single PKEY, as this would cause PKRU permissions to fail to accommodate both old and new page table PKEY settings. This patch introduces a pku_on state with values PKU_ON_FALSE, PKU_ON_INIT, and PKU_ON_FINISH, representing the states before, during, and after page table PKEY updates, respectively. For RPAL services, all calls to pkru_write_default() are replaced with rpal_pkru_write_default(). - Before page table setup (PKU_ON_FALSE), rpal_pkru_write_default() directly calls pkru_write_default(). - During page table setup (PKU_ON_INIT), rpal_pkru_write_default() enables permissions for all PKEYs, ensuring the task can access both old and new page tables simultaneously. - After page table setup completes (PKU_ON_FINISH), rpal_pkru_write_default() tightens permissions to match the updated page tables. For newly allocated page tables, the new PKEY is only used when pku_on is PKU_ON_FINISH. The mmap lock is used to ensure no race conditions occur during this process. Signed-off-by: Bo Li --- arch/x86/kernel/cpu/common.c | 4 ++-- arch/x86/kernel/fpu/core.c | 4 ++-- arch/x86/kernel/process.c | 4 ++-- arch/x86/rpal/pku.c | 14 +++++++++++++- arch/x86/rpal/service.c | 2 +- include/linux/rpal.h | 9 ++++++++- mm/mmap.c | 2 +- mm/mprotect.c | 1 + mm/vma.c | 2 +- 9 files changed, 31 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 2678453cdf76..d21f44873b86 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -534,8 +534,8 @@ static __always_inline void setup_pku(struct cpuinfo_x8= 6 *c) cr4_set_bits(X86_CR4_PKE); /* Load the default PKRU value */ #ifdef CONFIG_RPAL_PKU - if (rpal_current_service() && rpal_current_service()->pku_on) - write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + if (rpal_current_service()) + rpal_pkru_write_default(); else #endif pkru_write_default(); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 251b1ddee726..4b413af0b179 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -748,8 +748,8 @@ static inline void restore_fpregs_from_init_fpstate(u64= features_mask) frstor(&init_fpstate.regs.fsave); =20 #ifdef CONFIG_RPAL_PKU - if (rpal_current_service() && rpal_current_service()->pku_on) - write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + if (rpal_current_service()) + rpal_pkru_write_default(); else #endif pkru_write_default(); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index b74de35218f9..898a9e0b23e7 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -286,8 +286,8 @@ static void pkru_flush_thread(void) * the hardware right here (similar to context switch). */ #ifdef CONFIG_RPAL_PKU - if (rpal_current_service() && rpal_current_service()->pku_on) - write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + if (rpal_current_service()) + rpal_pkru_write_default(); else #endif pkru_write_default(); diff --git a/arch/x86/rpal/pku.c b/arch/x86/rpal/pku.c index 26cef324f41f..8e530931fb23 100644 --- a/arch/x86/rpal/pku.c +++ b/arch/x86/rpal/pku.c @@ -161,7 +161,7 @@ int rpal_pkey_setup(struct rpal_service *rs, int pkey) rs->pkey =3D pkey; /* others must see rs->pkey before rs->pku_on */ barrier(); - rs->pku_on =3D true; + rs->pku_on =3D PKU_ON_INIT; mmap_write_unlock(current->mm); rpal_set_group_pkru(val, RPAL_PKRU_UNION); err =3D do_rpal_mprotect_pkey(rs->base, RPAL_ADDR_SPACE_SIZE, pkey); @@ -182,3 +182,15 @@ int rpal_alloc_pkey(struct rpal_service *rs, int pkey) =20 return ret; } + +void rpal_pkru_write_default(void) +{ + struct rpal_service *cur =3D rpal_current_service(); + + if (cur->pku_on =3D=3D PKU_ON_INIT) + write_pkru(0); + else if (cur->pku_on =3D=3D PKU_ON_FINISH) + write_pkru(rpal_pkey_to_pkru(rpal_current_service()->pkey)); + else + pkru_write_default(); +} diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 7a83e85cf096..9fd568fa9a29 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -210,7 +210,7 @@ struct rpal_service *rpal_register_service(void) init_waitqueue_head(&rs->rpd.rpal_waitqueue); #ifdef CONFIG_RPAL_PKU rs->pkey =3D -1; - rs->pku_on =3D false; + rs->pku_on =3D PKU_ON_FALSE; rpal_service_pku_init(); #endif =20 diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 7657e6c6393b..16a3c80383f7 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -138,6 +138,12 @@ enum rpal_capability { RPAL_CAP_PKU }; =20 +enum { + PKU_ON_FALSE, + PKU_ON_INIT, + PKU_ON_FINISH, +}; + struct rpal_critical_section { unsigned long ret_begin; unsigned long ret_end; @@ -245,7 +251,7 @@ struct rpal_service { =20 #ifdef CONFIG_RPAL_PKU /* pkey */ - bool pku_on; + int pku_on; int pkey; #endif =20 @@ -599,6 +605,7 @@ __rpal_switch_to(struct task_struct *prev_p, struct tas= k_struct *next_p); asmlinkage __visible void rpal_schedule_tail(struct task_struct *prev); int do_rpal_mprotect_pkey(unsigned long start, size_t len, int pkey); void rpal_set_pku_schedule_tail(struct task_struct *prev); +void rpal_pkru_write_default(void); int rpal_ep_autoremove_wake_function(wait_queue_entry_t *curr, unsigned int mode, int wake_flags, void *key); diff --git a/mm/mmap.c b/mm/mmap.c index d36ea4ea2bd0..85a4a33491ab 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -404,7 +404,7 @@ unsigned long do_mmap(struct file *file, unsigned long = addr, do { struct rpal_service *cur =3D rpal_current_service(); =20 - if (cur && cur->pku_on) + if (cur && cur->pku_on =3D=3D PKU_ON_FINISH) pkey =3D cur->pkey; } while (0); #endif diff --git a/mm/mprotect.c b/mm/mprotect.c index e9ae828e377d..ac162180553e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -938,6 +938,7 @@ int do_rpal_mprotect_pkey(unsigned long start, size_t l= en, int pkey) } tlb_finish_mmu(&tlb); =20 + rpal_current_service()->pku_on =3D PKU_ON_FINISH; out: mmap_write_unlock(current->mm); return error; diff --git a/mm/vma.c b/mm/vma.c index fa9d8f694e6e..57ec99a5969d 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2632,7 +2632,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_= area_struct *vma, struct rpal_service *cur =3D rpal_current_service(); unsigned long vma_pkey_mask; =20 - if (cur && cur->pku_on) { + if (cur && cur->pku_on =3D=3D PKU_ON_FINISH) { vma_pkey_mask =3D VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3; flags &=3D ~vma_pkey_mask; --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69CE021E082 for ; Fri, 30 May 2025 09:36:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597768; cv=none; b=uc6oECp6WOHwwj7CA5ewTGgZ43y+EINAo5Wd+Bx8fAlS6C0ekC01CIfFlAY37yl/dw3pyhvlPkyiKXux6gnDyU6BLSshVYhJkVFseHtIkfUczcJLMhpK8/l5y6Y9zO3/o9zly5fSGdeDjGR7oVZtu4eRTqCuew5jpbszEnxur5Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597768; c=relaxed/simple; bh=M3ZKzt1i9qkVfxIhwI+P7qiPBgdAVQp1iyJ0A0C1oKs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=aEfE2jU18D7LTlbzt9A49xBcwfB3N29xVDczvK6c16LHxdogkJBr8/X/SjSPFNbjcE9WUADM6yES1oCUgiPmZ0g1bwCNickbqxl4vPKp1SWQIiFm8/nN4269qVmRfYcb8r8EaftfqhLqRRzIuxSVuQ7ULARHgWh5HRAwTjsdV5c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=P5v3RIM5; arc=none smtp.client-ip=209.85.215.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="P5v3RIM5" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-b26df8f44e6so1846791a12.2 for ; Fri, 30 May 2025 02:36:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597766; x=1749202566; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JvVWemooqfdC/iIY99ZRbIRy8Qdtq9pGwXsG45AWib0=; b=P5v3RIM52BM+aAtQi9Pl9OO1IkeENABVGz92pzmQn1FFiG8PwrF+QXRI+FsIIMdqjg NM4Z4iKAQ/ldpEJyDjLft52PTL2oc7Ulf9GhtnXTqzORYRRPRUM3cH2GqBbaCY4sAsG9 7Pdz47ziN8ZPYO7H9pbab0qor34/Ezih4X90itfhU5OPGYefwDsAl/Kb6tArsMO8aSee aWoHdVuZS7HiYAhPoZ1C/03n4OGIMKLPj/uwG+KQO7WL5CIUl2T+HIclb6/Gv7q8Sdgh PD6dN7tLf82F2MEC3RVQeO6sWDT7qK5eokyZ7EQnsPqp2y7WWZDlck/HaerjotN8pXXH 5E9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597766; x=1749202566; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JvVWemooqfdC/iIY99ZRbIRy8Qdtq9pGwXsG45AWib0=; b=IUDLpjgpM0UeoJ4lameWzQPSyhfB3tHz0FcDMjAjKgZy+1r4I9prTFXGhMLxXSl/o/ mW96nH3krdH/XpIlzA+TX1Xh/QJXx23P/4DuWbzohke2D7xvw4Dg4kSkvv5HH9qljQIe 3uxBSZLOx8RX5Y1M5EB0OBwK8cZIa5e9ux7vPVgnLRZtXp6Ki1JZ1mzOID6xu3qdi25R aHemc8hyI4aYjY9V2XQ8cbBNtpveFFcs1AIBL+QQTuKhVEeXBmbYF2X7VmxblwUdFJ9i nFVMtXPhWIwOGX8hj0FQLUJPFOYsfiM6fBM2t1YKDusPLJbB4zcZsyWE4tJ7CXrGu/ro +H3w== X-Forwarded-Encrypted: i=1; AJvYcCVzIRB62E4iucC2yu1vNMIx4VxQNI95oUJtS4einimzXDUz9juMAnQTszu29zZg6Prw8hxkGYoseAlK+P8=@vger.kernel.org X-Gm-Message-State: AOJu0YzbD3K19pJb+eK8GOmNDqhkGxOGVAvzHNLlffxAL1zHbr3MKHTP NJoPDY95ARWmf32zI4mkTu52WciF39wNUkTBChdUFO+DLQ8wBqqn8rgOtMQWIVWlTNY= X-Gm-Gg: ASbGncsTqCGPdT9/oW1mDnCSt3SCYkVkGb1AUbDNudAGaz/Yl4q3PXH4QG0xxSX2x/A A/GLfvUSPTnoTXfQOeaCWLfCL/hwY/3Qbf7ob42zl4KY/wklWC0fo8ptdrNDKV6P0xvmhkkEgkc JeVW5MxnDmka7SpCaeABTFhXdd/TqaIXME0bOHc2Tcw5P4n4xSbd5WI1TeIx2RhoHM3Oh3VQRMv XkyP/+/F44tWiukP0D/rBEBqy95thpjQ2LBclP9KH5pIhZEGyv17w8z5JEcHTZqAUohY8RZGnst U+AKrS4hH0ZZ2RhArzgdjDkbtoRbrzdFm/IHpq9fmE6SD+5E2iDYY/B5rwUUO20UcMILhrwnN6I wk/cDaUOKFDkrc0BcJahV X-Google-Smtp-Source: AGHT+IFWmfD7z+IdT9GMiEVeQTlh62r7ag47hPnuBnpxyTEiSzaUyDKA69VQl5NNKF6zBMdalqyfiA== X-Received: by 2002:a17:90b:5387:b0:311:e605:f60e with SMTP id 98e67ed59e1d1-31241637ee5mr4240780a91.20.1748597766419; Fri, 30 May 2025 02:36:06 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.35.51 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:36:06 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 30/35] RPAL: fix pkru setup when fork Date: Fri, 30 May 2025 17:27:58 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a task performs a fork operation, the PKRU value of the newly forked task is set to the value read from hardware. At this point, if the service is executing rpal_pkey_setup(), the newly forked task has not yet been added to the task list, so PKRU settings cannot be synchronized to the new task. This results in the new task's PKRU not being set to the correct value when it is woken up. This patch addresses this issue by: - After the newly forked task is added to the task list, further updating its PKRU value. - Acquiring a mutex lock to ensure that the PKRU update occurs either before or after the invocation of rpal_pkey_setup(). This avoids race conditions with rpal_pkey_setup() and guarantees that the re-updated PKRU value is always correct. Signed-off-by: Bo Li --- kernel/fork.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index 01cd48eadf68..11cba74d07c8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2683,6 +2683,19 @@ __latent_entropy struct task_struct *copy_process( syscall_tracepoint_update(p); write_unlock_irq(&tasklist_lock); =20 +#ifdef CONFIG_RPAL_PKU + do { + struct rpal_service *cur =3D rpal_current_service(); + + if (cur) { + /* ensure we are not in rpal_enable_service() */ + mutex_lock(&cur->mutex); + p->thread.pkru =3D rdpkru(); + mutex_unlock(&cur->mutex); + } + } while (0); +#endif + if (pidfile) fd_install(pidfd, pidfile); =20 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31CD21E8323 for ; Fri, 30 May 2025 09:36:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597783; cv=none; b=EfWZZ7MKKJeqMvtTS4YUrdIjkNR//uZ53sYjxBarcbDK0qIWt7Ivd2Ej8tBxasi5Y8aX7xIoIN/23qHbh+NNcBf68h1ihsfyHY2k8xJTdqjv4Hf9gerNe9eZ1AQTL2us8LT8/EkGVwtQ3QAp5e4sxP86SDtOFwjxcbR2SKy3l8M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597783; c=relaxed/simple; bh=ALYzXLguj2KAo7eUQpHRTBkt+ax6Jk2hscWpGu5UIaI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=eR1DquaDnRe6ENmcbYXIInZTthekhk5cezpVY1xFh3rZ2OczhkWIoSEzYrm3FeR86rWQ7tUePvEeXXaY0Tu3gNIZyQACALh9PboY6SF3+KOntK0XQgewxyfV2nLscmls2fPX31jxuBrqaT1BGPOXGfAjWjfX9HmjZriamIXLq+g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=YUmSbtk+; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="YUmSbtk+" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-3124f18c214so298158a91.2 for ; Fri, 30 May 2025 02:36:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597781; x=1749202581; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=TcEpV4r+zMsMwJgdzkjWqZN5teRn9qhnk6f0JAPvEA0=; b=YUmSbtk+S1jrOLRy9s9/X4wpHMB+n75vTpe5rrwsUM9VdT2Q9FmLk3Ww7r6U6iE+ZZ m6NNM+8p4Zb88ynkQTfns8oWZ9g6Nh2VbA6poFf0frPIumbN5ZnjTAMln5Q5730uDC5n s/gPN2nkG4m5mmijf7oslKxgnz+aF93fN8MHtlqq2pG+LA4ycmC2rKybxXj7YM73zo6x FlevlE2ChxhvQyq0ze6ce1YwOK7rZHevQ8AoE4D8YmS2PWzOcmpKVDMh5yZCJD7I8lDQ evgcwWlw8aKXWinQAsjEUsXeRvvhSuq1uZr5PWRhRFDZd4NGIIdcfbJBbAfNJWB0e1Kk 2n8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597781; x=1749202581; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TcEpV4r+zMsMwJgdzkjWqZN5teRn9qhnk6f0JAPvEA0=; b=KlrH71e1IM7zWcwDaJ7LrWcGNj6bsK03WpIJ6dKv6UFR4ps5GI/0uncIHmLHPsRGra wpPjde+WiEpP90C/CFNMa0GKYiGNMd5o8AYbTICNqg5RJnLvDnKdEfdaigx0hvoKplG+ z8B+vUFQe7r1qBjwu7GpwvM0UloUDFYNZiCvgKRvzfpATdGR8exdUVXvdlvdoaum3+rA ebU6M3HVya5dXGpwd1ybMG6lCSjC/AV6x8saxuLcuCDIGt2mTqoamnQmby6XqjjlGOW+ ac4h+3toQeBsrslYbWbVsXlFXrwWhjWy9lP352AHo7WeeLC9mxURj6McDjK12S9wltXh RC5Q== X-Forwarded-Encrypted: i=1; AJvYcCVy/BGsKC1tP0ITA/q3scpG78Pd3fJQjIw5k/CkRdg71ZjmKDPmsjKEZByUPDnTYaBm1EndSZw3UbFyStc=@vger.kernel.org X-Gm-Message-State: AOJu0YxLa09VTEXi90bHcKRkrdmVaOZvvEeFcmAFiqBnxz2dz40yPqkO 989taXkGR1++5lRHLg3cDDJinHbSIXJAKnIjc/GarL1MK9izENMnMSNRXnybL+1/Ixc= X-Gm-Gg: ASbGncuMZWCSEnqNLeuGG51tlySrvXLSUf34lSpqStrP+RvzOeU5VXLd19NoHTzH8OI RyjcFxK2S4rCGAaJaAT/RL1B5KT9s2Nf0GApa8RQCGH7BEUpPMVUZwAsnDsKVDSuawUYLU2Rcqx mnZS0HuvIsADHMrK3NeOWQqfUew381vZF4nxZTDp1GobXJ4+X930bWbJcgsyTvHA9gJCngM76Ri ElnjsNO68bfJMneY9eiqbpjIuh/CvBzTFBIHveZ82FWLPCwzHKjTexItI+/FYFyHCXvEUJFMJGu GFXvUkJH/tee6wIvy2lssAvDF3R1sapeEReszcqwj02qvsApy6+DK2JqAU5crLsyjtPpWy7ILSC 5T9RdLgXK3I60EnBmWL2f X-Google-Smtp-Source: AGHT+IG0dsH4ruc0ZVKtNevc7/fyUjoj6Kve3iAa9r38GSH3v9kXrPKHP05YVx3+W/n5pyeS7YOYMA== X-Received: by 2002:a17:90b:55c6:b0:311:df4b:4b82 with SMTP id 98e67ed59e1d1-3124150e360mr4147264a91.4.1748597781411; Fri, 30 May 2025 02:36:21 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.36.06 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:36:21 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 31/35] RPAL: add receiver waker Date: Fri, 30 May 2025 17:27:59 +0800 Message-Id: <198278a03d91ab7e0e17d782c657da85cff741bb.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In an RPAL call, the receiver thread is in the TASK_INTERRUPTIBLE state and cannot be awakened, which may lead to missed wakeups. For example, if no kernel event occurs during the entire RPAL call, the receiver thread will remain in the TASK_INTERRUPTIBLE state after the RPAL call completes. To address this issue, RPAL adds a flag to the receiver whenever it encounters an unawakened state and introduces a "waker" work. The waker work runs automatically on every tick to check for receiver threads that have missed wakeups. If any are found, it wakes them up. For epoll, the waker also checks for pending user mode events and wakes the receiver thread if such events exist. Signed-off-by: Bo Li --- arch/x86/rpal/internal.h | 4 ++ arch/x86/rpal/service.c | 98 ++++++++++++++++++++++++++++++++++++++++ arch/x86/rpal/thread.c | 3 ++ include/linux/rpal.h | 11 +++++ kernel/sched/core.c | 3 ++ 5 files changed, 119 insertions(+) diff --git a/arch/x86/rpal/internal.h b/arch/x86/rpal/internal.h index e03f8a90619d..117357dabdec 100644 --- a/arch/x86/rpal/internal.h +++ b/arch/x86/rpal/internal.h @@ -22,6 +22,10 @@ int rpal_enable_service(unsigned long arg); int rpal_disable_service(void); int rpal_request_service(unsigned long arg); int rpal_release_service(u64 key); +void rpal_insert_wake_list(struct rpal_service *rs, + struct rpal_receiver_data *rrd); +void rpal_remove_wake_list(struct rpal_service *rs, + struct rpal_receiver_data *rrd); =20 /* mm.c */ static inline struct rpal_shared_page * diff --git a/arch/x86/rpal/service.c b/arch/x86/rpal/service.c index 9fd568fa9a29..6fefb7a7729c 100644 --- a/arch/x86/rpal/service.c +++ b/arch/x86/rpal/service.c @@ -143,6 +143,99 @@ static void delete_service(struct rpal_service *rs) spin_unlock_irqrestore(&hash_table_lock, flags); } =20 +void rpal_insert_wake_list(struct rpal_service *rs, + struct rpal_receiver_data *rrd) +{ + unsigned long flags; + struct rpal_waker_struct *waker =3D &rs->waker; + + spin_lock_irqsave(&waker->lock, flags); + list_add_tail(&rrd->wake_list, &waker->wake_head); + spin_unlock_irqrestore(&waker->lock, flags); + pr_debug("rpal debug: [%d] insert wake list\n", current->pid); +} + +void rpal_remove_wake_list(struct rpal_service *rs, + struct rpal_receiver_data *rrd) +{ + unsigned long flags; + struct rpal_waker_struct *waker =3D &rs->waker; + + spin_lock_irqsave(&waker->lock, flags); + list_del(&rrd->wake_list); + spin_unlock_irqrestore(&waker->lock, flags); + pr_debug("rpal debug: [%d] remove wake list\n", current->pid); +} + +/* waker->lock must be hold */ +static inline void rpal_wake_all(struct rpal_waker_struct *waker) +{ + struct task_struct *wake_list[RPAL_MAX_RECEIVER_NUM]; + struct list_head *list; + unsigned long flags; + int i, cnt =3D 0; + + spin_lock_irqsave(&waker->lock, flags); + list_for_each(list, &waker->wake_head) { + struct task_struct *task; + struct rpal_receiver_call_context *rcc; + struct rpal_receiver_data *rrd; + int pending; + + rrd =3D list_entry(list, struct rpal_receiver_data, wake_list); + task =3D rrd->rcd.bp_task; + rcc =3D rrd->rcc; + + pending =3D atomic_read(&rcc->ep_pending) & RPAL_USER_PENDING; + + if (rpal_test_task_thread_flag(task, RPAL_WAKE_BIT) || + (pending && atomic_cmpxchg(&rcc->receiver_state, + RPAL_RECEIVER_STATE_WAIT, + RPAL_RECEIVER_STATE_RUNNING) =3D=3D + RPAL_RECEIVER_STATE_WAIT)) { + wake_list[cnt] =3D task; + cnt++; + } + } + spin_unlock_irqrestore(&waker->lock, flags); + + for (i =3D 0; i < cnt; i++) + wake_up_process(wake_list[i]); +} + +static void rpal_wake_callback(struct work_struct *work) +{ + struct rpal_waker_struct *waker =3D + container_of(work, struct rpal_waker_struct, waker_work.work); + + rpal_wake_all(waker); + /* We check it every ticks */ + schedule_delayed_work(&waker->waker_work, 1); +} + +static void rpal_enable_waker(struct rpal_waker_struct *waker) +{ + INIT_DELAYED_WORK(&waker->waker_work, rpal_wake_callback); + schedule_delayed_work(&waker->waker_work, 1); + pr_debug("rpal debug: [%d] enable waker\n", current->pid); +} + +static void rpal_disable_waker(struct rpal_waker_struct *waker) +{ + unsigned long flags; + struct list_head *p, *n; + + cancel_delayed_work_sync(&waker->waker_work); + rpal_wake_all(waker); + spin_lock_irqsave(&waker->lock, flags); + list_for_each_safe(p, n, &waker->wake_head) { + list_del_init(p); + } + INIT_LIST_HEAD(&waker->wake_head); + spin_unlock_irqrestore(&waker->lock, flags); + pr_debug("rpal debug: [%d] disable waker\n", current->pid); +} + static inline unsigned long calculate_base_address(int id) { return RPAL_ADDRESS_SPACE_LOW + RPAL_ADDR_SPACE_SIZE * id; @@ -213,6 +306,10 @@ struct rpal_service *rpal_register_service(void) rs->pku_on =3D PKU_ON_FALSE; rpal_service_pku_init(); #endif + spin_lock_init(&rs->waker.lock); + INIT_LIST_HEAD(&rs->waker.wake_head); + /* receiver may miss wake up if in lazy switch, try to wake it later */ + rpal_enable_waker(&rs->waker); =20 rs->bad_service =3D false; rs->base =3D calculate_base_address(rs->id); @@ -257,6 +354,7 @@ void rpal_unregister_service(struct rpal_service *rs) schedule(); =20 delete_service(rs); + rpal_disable_waker(&rs->waker); =20 pr_debug("rpal: unregister service, id: %d, tgid: %d\n", rs->id, rs->group_leader->tgid); diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index fcc592baaac0..51c9eec639cb 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -186,6 +186,8 @@ int rpal_register_receiver(unsigned long addr) current->rpal_rd =3D rrd; rpal_set_current_thread_flag(RPAL_RECEIVER_BIT); =20 + rpal_insert_wake_list(cur, rrd); + atomic_inc(&cur->thread_cnt); =20 return 0; @@ -214,6 +216,7 @@ int rpal_unregister_receiver(void) clear_fs_tsk_map(); =20 rpal_put_shared_page(rrd->rsp); + rpal_remove_wake_list(cur, rrd); rpal_clear_current_thread_flag(RPAL_RECEIVER_BIT); rpal_free_thread_pending(&rrd->rcd); kfree(rrd); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 16a3c80383f7..1d8c1bdc90f2 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -116,6 +116,7 @@ enum rpal_task_flag_bits { RPAL_RECEIVER_BIT, RPAL_CPU_LOCKED_BIT, RPAL_LAZY_SWITCHED_BIT, + RPAL_WAKE_BIT, }; =20 enum rpal_receiver_state { @@ -189,6 +190,12 @@ struct rpal_fsbase_tsk_map { struct task_struct *tsk; }; =20 +struct rpal_waker_struct { + spinlock_t lock; + struct list_head wake_head; + struct delayed_work waker_work; +}; + /* * Each RPAL process (a.k.a RPAL service) should have a pointer to * struct rpal_service in all its tasks' task_struct. @@ -255,6 +262,9 @@ struct rpal_service { int pkey; #endif =20 + /* receiver thread waker */ + struct rpal_waker_struct waker; + /* delayed service put work */ struct delayed_work delayed_put_work; =20 @@ -347,6 +357,7 @@ struct rpal_receiver_data { struct fd f; struct hrtimer_sleeper ep_sleeper; wait_queue_entry_t ep_wait; + struct list_head wake_list; }; =20 struct rpal_sender_data { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 486d59bdd3fc..c219ada29d34 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3943,6 +3943,7 @@ static bool rpal_check_state(struct task_struct *p) struct rpal_receiver_call_context *rcc =3D p->rpal_rd->rcc; int state; =20 + rpal_clear_task_thread_flag(p, RPAL_WAKE_BIT); retry: state =3D atomic_read(&rcc->receiver_state) & RPAL_RECEIVER_STATE_MASK; switch (state) { @@ -3957,6 +3958,7 @@ static bool rpal_check_state(struct task_struct *p) case RPAL_RECEIVER_STATE_RUNNING: break; case RPAL_RECEIVER_STATE_CALL: + rpal_set_task_thread_flag(p, RPAL_WAKE_BIT); ret =3D false; break; default: @@ -4522,6 +4524,7 @@ int rpal_try_to_wake_up(struct task_struct *p) =20 BUG_ON(READ_ONCE(p->__state) =3D=3D TASK_RUNNING); =20 + rpal_clear_task_thread_flag(p, RPAL_WAKE_BIT); scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { smp_mb__after_spinlock(); if (!ttwu_state_match(p, TASK_NORMAL, &success)) --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31DA6220F30 for ; Fri, 30 May 2025 09:36:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597799; cv=none; b=DXbQVX78YVyek/s3tqBPq+HgPJKO3l4lJd06hOLEFAyR3uOa1t3ocomBTQIR4KwCf9ugyVAyoCQbBsCtLDFoWFcQmEu8i3F9l08Am9xiML5/NNWj9LQc6vthDco3gNw2NK9c2t8Jasow8N1AR9pI+Q/Zf3BkPsWWWH7FBWrC3sE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597799; c=relaxed/simple; bh=Dtz1Aet7apWTXhLd7VJWKFnNYwGISZ8LO5wYY32y+Hg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=PNiiZ49Z5SYThRjWy0JqIcYKnVGPru+aJwySUZKq/E7/KGNfmuCrO1Si0xctyX2zLfoLQ0v7JKioPJ/GLGFBeZXQXMLMfqgF9rYRyeTWmVAX6+Bfbf+7k4Vp/FMrEs+lb5S12XpONL2uQijrGSJzopHgdw7889bTCgryXzTuEfw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Wwt4uAd8; arc=none smtp.client-ip=209.85.216.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Wwt4uAd8" Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-311d5fdf1f0so1705323a91.1 for ; Fri, 30 May 2025 02:36:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597796; x=1749202596; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PQ/D1sjhfYHEwDTh2b+Qyte458RPTLSFlphr3fZumJQ=; b=Wwt4uAd8KbjRRSsRvich6T35PFHgOUS+Fouykni2GzUkaA8t/k3UTaCOLNa7A6hdWF x1RPblBdLzzWC1AE4v87JHrM61KF9DVGbUK8DEpp1cUP/DUuDeLidudvWGuW07fFBQvo XsLAUvimfdP6y7CO6hHio+NKe/+P42z+xBNiOHIBXkeee2PwtArHHOplLnjsJ9wNosUg E01vyFZb748GdrTlxKUnbJTyKS8Zn6JhcvT8DxNNlYF0w5Z0r+UmrF7gj63i7iDyBOhz BIAKWOhEjVr7wQQv+6F9Rr1vj4Y9ZuEuEEYS6k3Lpgm8CNT6pPZaykrXHWBUNyJ1Bb5e 7sXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597796; x=1749202596; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PQ/D1sjhfYHEwDTh2b+Qyte458RPTLSFlphr3fZumJQ=; b=aAlzonhj2hRf+6SXryDi8t4NsLqj6evbTjGyY4/5UP7sXgxTI9Qu9rxhHCJrGIdrj+ SKNcq5qiSt2+H7iSZ2pbcAKsm3vTHmkTYlzdkipzMGFg8GnAlywDCRBHJddKp3Rz/GtL fc/YO0vVdf8MBesW3a85SlRW4DG0nXywhjQ7u4+pW9Sffk7jP/cQ7p/qfLkUqgnltCZy PysNX8k/7mDDmBgQ7RO7iGHKSNCnHdKwEzbvCuTKG5H0fkk0HqmmlIrbrWBukLT//GWv etYNPwJ4LZFbu3UQIVpvoh7NNNR364fDHpE72RiTsAaQkw0CK2aN6vRnFEcRXxe1JLED PJ2Q== X-Forwarded-Encrypted: i=1; AJvYcCXOY210lrmGfWj5ysAvoaDOZ7MjC6xHx2aEwcMZi9uIKDNp4dSXqeWptqQnBVgbfh1YjL7+b5rf7uK/s6A=@vger.kernel.org X-Gm-Message-State: AOJu0Yz9iiGXS+SnmCVxdpMoQ7oQSr+WBJq02oJwkRBX6W9IiZdyRdCQ xrtrL2su8jH1KG4twNHKYmsYchGTGksF4eII0wwaf73IiWcS21UKqV7PyVqCU9TSOVc= X-Gm-Gg: ASbGnctg+lhn2ZODT0SZvEu4gv+8uBV8nCVlmTK6k472thUyHN3aduFsYTFBt06f4dC RWkGtXrC1O8kZRsUMbzgUIQHWhmS6naLLuRp9TtzwolOSikIzwP1rYPcXk08bY3gb7Tx+2a3ZCx /gpdpm7ymKLvLfZpOJQvz247Ljf5uWwZpfmigyQeU8cMnabzGeQgy/Tkh6FAV2D44psXg/mNp7v GfF1dkf2BjfwTbYEHj8XYh0vXjsZoNkS+Wa4/wXlrMG6nASeBevUKMTH3rZkBm23giVKa7ExYyE 2YiPB9jHDqEJlkfwnRxlZVyzMcRmUuBxgXtSYu9hSBGhHEYBKYe8eewkamaV1Rk6TUniHvsRxWg AQrFDxj8vD1doQZNPsDML X-Google-Smtp-Source: AGHT+IE1J4YJ6wHA6b9CUB/dnKICEd+OlMDbXuOs/bJjQ/NWJ3rv4Cxh7kWsPUyPIoUfIb7mzNdLsw== X-Received: by 2002:a17:90b:4f4d:b0:312:1cd7:b337 with SMTP id 98e67ed59e1d1-3125034a47amr1876977a91.5.1748597796457; Fri, 30 May 2025 02:36:36 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.36.21 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:36:36 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 32/35] RPAL: fix unknown nmi on AMD CPU Date: Fri, 30 May 2025 17:28:00 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In Lazy switch, the function event_sched_out() will be called. This function deletes the perf event of the task being scheduled out, causing the active_mask in cpu_hw_events to be cleared. In AMD's NMI handler, if the bit corresponding to active_mask is not set, the CPU will not handle the NMI event, ultimately triggering an unknown NMI error. Additionally, event_sched_out() may call amd_pmu_wait_on_overflow(), leading to a busy wait of up to 50us during lazy switch. This patch adds two per_cpu variables. rpal_nmi_handle is set when an NMI occurs. When encountering an unknown NMI, this NMI is skipped. rpal_nmi is set before lazy switch and cleared after lazy switch, preventing the busy wait caused by amd_pmu_wait_on_overflow(). Signed-off-by: Bo Li --- arch/x86/events/amd/core.c | 14 ++++++++++++++ arch/x86/kernel/nmi.c | 20 ++++++++++++++++++++ arch/x86/rpal/core.c | 17 ++++++++++++++++- 3 files changed, 50 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index b20661b8621d..633a9ac4e77c 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -719,6 +719,10 @@ static void amd_pmu_wait_on_overflow(int idx) } } =20 +#ifdef CONFIG_RPAL +DEFINE_PER_CPU(bool, rpal_nmi); +#endif + static void amd_pmu_check_overflow(void) { struct cpu_hw_events *cpuc =3D this_cpu_ptr(&cpu_hw_events); @@ -732,6 +736,11 @@ static void amd_pmu_check_overflow(void) if (in_nmi()) return; =20 +#ifdef CONFIG_RPAL + if (this_cpu_read(rpal_nmi)) + return; +#endif + /* * Check each counter for overflow and wait for it to be reset by the * NMI if it has overflowed. This relies on the fact that all active @@ -807,6 +816,11 @@ static void amd_pmu_disable_event(struct perf_event *e= vent) if (in_nmi()) return; =20 +#ifdef CONFIG_RPAL + if (this_cpu_read(rpal_nmi)) + return; +#endif + amd_pmu_wait_on_overflow(event->hw.idx); } =20 diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index be93ec7255bf..dd72b6d1c7f9 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -351,12 +351,23 @@ NOKPROBE_SYMBOL(unknown_nmi_error); =20 static DEFINE_PER_CPU(bool, swallow_nmi); static DEFINE_PER_CPU(unsigned long, last_nmi_rip); +#ifdef CONFIG_RPAL +DEFINE_PER_CPU(bool, rpal_nmi_handle); +#endif =20 static noinstr void default_do_nmi(struct pt_regs *regs) { unsigned char reason =3D 0; int handled; bool b2b =3D false; +#ifdef CONFIG_RPAL + bool rpal_handle =3D false; + + if (__this_cpu_read(rpal_nmi_handle)) { + __this_cpu_write(rpal_nmi_handle, false); + rpal_handle =3D true; + } +#endif =20 /* * Back-to-back NMIs are detected by comparing the RIP of the @@ -471,6 +482,15 @@ static noinstr void default_do_nmi(struct pt_regs *reg= s) */ if (b2b && __this_cpu_read(swallow_nmi)) __this_cpu_add(nmi_stats.swallow, 1); +#ifdef CONFIG_RPAL + /* + * Lazy switch may clear the bit in active_mask, causing + * nmi event not handled. This will lead to unknown nmi, + * try to avoid this. + */ + else if (rpal_handle) + goto out; +#endif else unknown_nmi_error(reason, regs); =20 diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 6a22b9faa100..92281b557a6c 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -376,11 +376,26 @@ rpal_exception_context_switch(struct pt_regs *regs) return next; } =20 +DECLARE_PER_CPU(bool, rpal_nmi_handle); +DECLARE_PER_CPU(bool, rpal_nmi); __visible struct task_struct *rpal_nmi_context_switch(struct pt_regs *regs) { struct task_struct *next; =20 - next =3D rpal_kernel_context_switch(regs); + if (rpal_test_current_thread_flag(RPAL_LAZY_SWITCHED_BIT)) + rpal_update_fsbase(regs); + + next =3D rpal_misidentify(); + if (unlikely(next !=3D NULL)) { + next =3D rpal_fix_critical_section(next, regs); + if (next) { + __this_cpu_write(rpal_nmi_handle, true); + /* avoid wait in amd_pmu_check_overflow */ + __this_cpu_write(rpal_nmi, true); + next =3D rpal_do_kernel_context_switch(next, regs); + __this_cpu_write(rpal_nmi, false); + } + } =20 return next; } --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BC7922A1EF for ; Fri, 30 May 2025 09:36:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597814; cv=none; b=Fv/hMe8PKkMhNyTI5O00KTZ3zvXyRhR7jLBjpvT5CcyZPmwbZ/H5O6SGKdM0dPOYQsG0TCNQ4ohUlf5G1L6mFbONTUIfRdjoVsuk/XCIkgHztbP7xBMgyeM1PfAKeNdYwkVza8TTA+sxgT6SwTAbPkEOwXZLLEubc5j+2cLJgjg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597814; c=relaxed/simple; bh=vd2C1pogQV46JqBd4R9eC8WzEgBwL6DtuQBJbThvcY0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Eqhh6pGm2FuiK8BDS+ZNGS8WnbibuWQWMzQrENS458oiLS1kkbqqP6qDGCZjl+PqjIeNqtIE30po1Jitgq8Q7k9i1koVRcr5q3lGeGTrC8HqmcMDxudPGzo1vIcLyh3SizmqltGvdh1Py+HFFupSG0skC/GE/MjCi+6REwvV0zY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=FU4Xbg5J; arc=none smtp.client-ip=209.85.215.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="FU4Xbg5J" Received: by mail-pg1-f170.google.com with SMTP id 41be03b00d2f7-b2c4476d381so1642540a12.0 for ; Fri, 30 May 2025 02:36:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597811; x=1749202611; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9GWxjqZwkmZx5+2pv+QLtRl8XC71PyQccoyY/dIwUoU=; b=FU4Xbg5JkXTsWoP2NVMq4T/VCUnkpyQejsiTEHcl2yure2y1mHWmP11exSYEnJdlhr /FdE7deI2fUaaELz2MqqcsRNahTIaO9I5G1zsgunOCTERvqf58se7Ho9IIgJDjnYAxi2 T0TsibJyUU0LBKhdDjzJXKG7kkB+JQFLDHuReD4ILaEK0GDJSFiA+fa2tB3r8vTZUR/o AGu+W8mrucLYHMnK6b1vbLdNWCa7l3DxnM50Tpq1jlH5jog99k7Vxn3tu36uVfDCA+sa HnQw/Oz4L9eF55h0FXhU1zh/9gZRxIYBxVCFyNGqJZmBCq4jRmW/K+4vgzNbw8kuEo5X Jm5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597811; x=1749202611; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9GWxjqZwkmZx5+2pv+QLtRl8XC71PyQccoyY/dIwUoU=; b=hNxHvdUNAbghhdVrTAy21aOXHUeQ1xOwluJ92P43wB3Q6BpZc0qmUJ7E4dXdpHPfvy ObguWz9jv0z9rZbEAtYHRPhsfTHNkbY47ViYJTIhxsIw989o2nAXkNMv7/69tilQs478 mB6NUqY8xSF5mJ8NfoGdBNPTyeefWutFc/GMca0iLMqZdyOPXyCllEedfroi6VoVFiek lzSj8hcLOEGIDcgAnxlYMMmkwCcfftuXmRD/YT4wW/KatJsAIKNPI4He0ZwCjtJb1e3U ReXmxSeuaQxast8SQZTJGDPFqS1M4s0zx8O532UUArgRYZvbLhzJwpd0PfQ2SzgJdaqZ 6vfg== X-Forwarded-Encrypted: i=1; AJvYcCUPCK7HgxMc1HXLcUPE0UGAgtJhz7wz05KLB4CcbwJcgA5Me59Xn+k5F/eKcYHAYk7mMUPPG0imn7OhhTQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yw6yn7WZJWf4XNtGpsfVs0zEk+bCdyFVY0Mdh5hvI2erzr1EFIT FC7kwdzZaJx5zydAHnP+O6bBtppTNPAZUQ246lDr8hZBnxtDpslRAT5Q6W6SuyyVtJ0= X-Gm-Gg: ASbGncukCnINC8LyxP+MZ8LnwfwyIO+5s7olYaKLEN4QyPdtowSnfQNlOQrB/Zgr3DN 8GU02p7mVo6RP2d4Qb+SIRWopH+VsMlBE4a4HF4zEfcsPgtf6mN7J5nVbNC4widDBIINBc1ldnf JpR4WjUlMTtCqCXaZXgwxC+orsMCqJadIngsrcqAgbWLg/9HY1wKM1Z1tZxT8aySMSj+cEjkmYe ppnYTYuIqekMJ5Xkh/iTAW/Uv4k+lo/2bx2u3fQ9Tl91HGjv/abBvJdaW0RXSmQSYcBppJ8+MKE NpYcZEr2wx8mEauvOQljWGH/MyzyCjQ4UEwk3hn1GRGdtdAWx9/xm+DhQIJp9D20Vx2WIoZs0tu EOcgTqZSVJA== X-Google-Smtp-Source: AGHT+IEd6eqw3bYsn2JQr9oo7VNa3bHRe5ESaa02VpYg0eRHeS5gGe9cNS3vRZsm3BzNT8I7hOudjA== X-Received: by 2002:a17:90b:5104:b0:302:fc48:4f0a with SMTP id 98e67ed59e1d1-3124446ce79mr4391987a91.0.1748597811484; Fri, 30 May 2025 02:36:51 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.36.36 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:36:51 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 33/35] RPAL: enable time slice correction Date: Fri, 30 May 2025 17:28:01 +0800 Message-Id: <8941a17e12edce00c1cc1c78f4dd3e1bf28e47c0.1748594841.git.libo.gcs85@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After an RPAL call, the receiver's user mode code executes. However, the kernel incorrectly attributes this CPU time to the sender due to the unchanged kernel context. This results in incorrect runtime statistics. This patch adds a new member total_time to both rpal_sender_call_context and rpal_receiver_call_context. This member tracks how much runtime ( measured in CPU cycles via rdtsc()) has been incorrectly accounted for. The kernel measures total_time at the entry of __schedule() and corrects the delta in the update_rq_clock_task() function. Additionally, since RPAL calls occur in user space, runtime statistics are typically calculated by user space. However, when a lazy switch happens, the kernel takes over. To address this, the patch introduces a start_time member to record when an RPAL call is initiated, enabling the kernel to accurately calculate the runtime that needs correction. Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 8 ++++++++ arch/x86/rpal/thread.c | 6 ++++++ include/linux/rpal.h | 3 +++ include/linux/sched.h | 1 + init/init_task.c | 1 + kernel/fork.c | 1 + kernel/sched/core.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 62 insertions(+) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 92281b557a6c..2ac5d932f69c 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -144,6 +144,13 @@ rpal_do_kernel_context_switch(struct task_struct *next= , struct pt_regs *regs) struct task_struct *prev =3D current; =20 if (rpal_test_task_thread_flag(next, RPAL_LAZY_SWITCHED_BIT)) { + struct rpal_receiver_call_context *rcc =3D next->rpal_rd->rcc; + struct rpal_sender_call_context *scc =3D current->rpal_sd->scc; + u64 slice =3D rdtsc_ordered() - scc->start_time; + + rcc->total_time +=3D slice; + scc->total_time +=3D slice; + rpal_resume_ep(next); current->rpal_sd->receiver =3D next; rpal_lock_cpu(current); @@ -169,6 +176,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) rpal_schedule(next); rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT); prev->rpal_rd->sender =3D NULL; + next->rpal_sd->scc->start_time =3D rdtsc_ordered(); } if (unlikely(!irqs_disabled())) { local_irq_disable(); diff --git a/arch/x86/rpal/thread.c b/arch/x86/rpal/thread.c index 51c9eec639cb..5cd0be631521 100644 --- a/arch/x86/rpal/thread.c +++ b/arch/x86/rpal/thread.c @@ -99,6 +99,8 @@ int rpal_register_sender(unsigned long addr) rsd->scc =3D (struct rpal_sender_call_context *)(addr - rsp->user_start + rsp->kernel_start); rsd->receiver =3D NULL; + rsd->scc->start_time =3D 0; + rsd->scc->total_time =3D 0; =20 current->rpal_sd =3D rsd; rpal_set_current_thread_flag(RPAL_SENDER_BIT); @@ -182,6 +184,7 @@ int rpal_register_receiver(unsigned long addr) (struct rpal_receiver_call_context *)(addr - rsp->user_start + rsp->kernel_start); rrd->sender =3D NULL; + rrd->rcc->total_time =3D 0; =20 current->rpal_rd =3D rrd; rpal_set_current_thread_flag(RPAL_RECEIVER_BIT); @@ -289,6 +292,9 @@ int rpal_rebuild_sender_context_on_fault(struct pt_regs= *regs, rpal_pkey_to_pkru(rpal_current_service()->pkey), RPAL_PKRU_SET); #endif + if (!rpal_is_correct_address(rpal_current_service(), regs->ip)) + /* receiver has crashed */ + scc->total_time +=3D rdtsc_ordered() - scc->start_time; return 0; } } diff --git a/include/linux/rpal.h b/include/linux/rpal.h index 1d8c1bdc90f2..f5f4da63f28c 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -310,6 +310,7 @@ struct rpal_receiver_call_context { void __user *events; int maxevents; int timeout; + int64_t total_time; }; =20 /* recovery point for sender */ @@ -325,6 +326,8 @@ struct rpal_sender_call_context { struct rpal_task_context rtc; struct rpal_error_context ec; int sender_id; + s64 start_time; + s64 total_time; }; =20 /* End */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5f25cc09fb71..a03113fecdc5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1663,6 +1663,7 @@ struct task_struct { struct rpal_sender_data *rpal_sd; struct rpal_receiver_data *rpal_rd; }; + s64 rpal_steal_time; #endif =20 /* CPU-specific state of this task: */ diff --git a/init/init_task.c b/init/init_task.c index 2eb08b96e66b..3606cf701dfe 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -224,6 +224,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { .rpal_rs =3D NULL, .rpal_flag =3D 0, .rpal_cd =3D NULL, + .rpal_steal_time =3D 0, #endif }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index 11cba74d07c8..ff6331a28987 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1222,6 +1222,7 @@ static struct task_struct *dup_task_struct(struct tas= k_struct *orig, int node) tsk->rpal_rs =3D NULL; tsk->rpal_flag =3D 0; tsk->rpal_cd =3D NULL; + tsk->rpal_steal_time =3D 0; #endif return tsk; =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c219ada29d34..d6f8e0d76fc0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -789,6 +789,14 @@ static void update_rq_clock_task(struct rq *rq, s64 de= lta) delta -=3D steal; } #endif +#ifdef CONFIG_RPAL + if (unlikely(current->rpal_steal_time !=3D 0)) { + delta +=3D current->rpal_steal_time; + if (unlikely(delta < 0)) + delta =3D 0; + current->rpal_steal_time =3D 0; + } +#endif =20 rq->clock_task +=3D delta; =20 @@ -6872,6 +6880,36 @@ static bool try_to_block_task(struct rq *rq, struct = task_struct *p, return true; } =20 +#ifdef CONFIG_RPAL +static void rpal_acct_runtime(void) +{ + if (rpal_current_service()) { + if (rpal_test_task_thread_flag(current, RPAL_SENDER_BIT) && + current->rpal_sd->scc->total_time !=3D 0) { + struct rpal_sender_call_context *scc =3D + current->rpal_sd->scc; + + u64 slice =3D + native_sched_clock_from_tsc(scc->total_time) - + native_sched_clock_from_tsc(0); + current->rpal_steal_time -=3D slice; + scc->total_time =3D 0; + } else if (rpal_test_task_thread_flag(current, + RPAL_RECEIVER_BIT) && + current->rpal_rd->rcc->total_time !=3D 0) { + struct rpal_receiver_call_context *rcc =3D + current->rpal_rd->rcc; + + u64 slice =3D + native_sched_clock_from_tsc(rcc->total_time) - + native_sched_clock_from_tsc(0); + current->rpal_steal_time +=3D slice; + rcc->total_time =3D 0; + } + } +} +#endif + /* * __schedule() is the main scheduler function. * @@ -6926,6 +6964,10 @@ static void __sched notrace __schedule(int sched_mod= e) struct rq *rq; int cpu; =20 +#ifdef CONFIG_RPAL + rpal_acct_runtime(); +#endif + trace_sched_entry_tp(preempt, CALLER_ADDR0); =20 cpu =3D smp_processor_id(); --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 89982220F5F for ; Fri, 30 May 2025 09:37:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597829; cv=none; b=hB67eKZgTkweJjyjtonfxgytjRD7s9oVPGr1Ffz2vEpjZgsM3c7crCl0aFwDbYX14PzN/xde4VOXOTT7ScuGJuILAPVr/Z8xdsFegVXw1pSTbLjGGfNK4CBJH+lUqdCiWN6p3etY1eVkKJxphNs1VGUlvRPqKvV1GuA2cVPJY/E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597829; c=relaxed/simple; bh=OO8xU47yO/bdoL2deRtc+JSUSQ1CGyj22+/NieuKsMM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=bVPZQ/nQpif/uKs1qu8pexlQ8+1hEIX8bz8pBqWROsoq291OWPc80giT6NrENawpsBTdfrN8Pxcdy0bOVOhT6nNkynp5h6CaS+HBmROL18uuOsc3tcAIQ0ny98B946MMpawIAFlJBzcaVU5eUqyRm9uJn0b6cq7n2ivPYvaoA2Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=bQw0UddR; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="bQw0UddR" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-311d5fdf1f0so1705684a91.1 for ; Fri, 30 May 2025 02:37:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597827; x=1749202627; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Ff5bIYNqY4yUnollIgoWVZucxS5phnK+GSArqqp8njA=; b=bQw0UddRSd0zodJFoDpQPUYtbH1/vf+TrFM/oYUAFXfqB280B0OrDjvwoGgX/58GYZ j4hvdO+XitJo8fkmEr2ugoXnozUApYaq5qz1miA5ra+AYe749tpVZer6npWBXRkke9rl 4J9pA24qUUSFt3yjLUYeKiqaXphcZbGmGr4E2mI/FIppeSvb1doRHrPpaGvzlfwxVuko CH3ZTO94pN8EXwoy2vai6ATQnh+0Ijh8Vsj3fLcmjpHubwMPj+kp8l1Xp8olIg2p8bxo JmRg7UYZYAOyDt7wqRwOKkr9jUx6FV0s/w2/PQAYB/BKmaJbzssmmI2i9aOBaStyjc9s tz5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597827; x=1749202627; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ff5bIYNqY4yUnollIgoWVZucxS5phnK+GSArqqp8njA=; b=h4jikp4J+Mij4bzN0xuFGnQBI1U3pLg1U8Sfjpmxd4qBF2wcnqdEyKEUX3ZqttCJ+r qJqWoHe/fPo/bhfDnjVVw093ZJ78Y3hePh7y3BRQGRWcT6+fJfjd4WHH6cAieDOY2qhs kI+M1BFudiEyH9GAzFAuX0AOfLZbt4GmSjvZxyCRd/nfqb73mNd9hsFnfMC2sfDd+11u WP7yz/aLPkKv1/6cWkFGAryNLeMzpy43mxk1WULr0SVGYhKr94TwmNHo838jCsmF8/Lv 3Ih6DCkp285VeMIm5QxDNiNZsWdVJK5vc0oC2/rbcDB2EFLwpGEi0kmSmH7ywE88kWz9 UKiQ== X-Forwarded-Encrypted: i=1; AJvYcCWoB6SejjtbtoPXZCSHBlQ537lofXUeg6w82B+kKjpSlp1eechuyNXrt/5yJhzf7JqFxbNNZo+khKL1eSI=@vger.kernel.org X-Gm-Message-State: AOJu0Yx5nSTY+jFtSKtJRkix7lkU31BgdlTaSDk8WtyZW4rDhXOQsMho Y1iwL/qN3U2p+oc/FUUFOsh5pkHT7+QjOF1UwkjJITwcWKkmF/MzdANyu+w7V6qGLOA= X-Gm-Gg: ASbGnctaQ16oO33ikc/wT2yAzcCs7OvKNbKZlWba0kxsa5BqK3QGvLQo3v69DjKWUfI Nlw0P406OY2RFqzQhOrxA1ZocYrvqCQaJzY7Jilw0dMit9Gv/HpDyFT184wNxIzQbqBNmhW2nOw HRm5cOSkopIkKR27iIWBgAOZ4FHDi72n9ojxu4WPIKXQ+R6G0U81R3kh/z60ZmUb11BnJVIF6u1 rP02ddueL/HhJ69V4jb+Nj8bXlOa/b+Szf6H//BSiOd6gw54jT3bahFo1/k0Bh2A/72VkHGFCUp vV9ezFB2vQ6Jsqq2zrYQeuYksmitvnpeHip1fpEwug6oOKoBYWs9kE3dczlu0tcKUS1wNeRJsGy hItd2K9VuSQ== X-Google-Smtp-Source: AGHT+IGPAReH+gj/mAHjw8Fc+v5rLzKHU0DKOTk3FhYthS79DtKAel685Vp30oKSxbQ9599VbyBn4A== X-Received: by 2002:a17:90b:4d:b0:311:be51:bdec with SMTP id 98e67ed59e1d1-3125036326fmr2501710a91.11.1748597826688; Fri, 30 May 2025 02:37:06 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.36.51 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:37:06 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 34/35] RPAL: enable fast epoll wait Date: Fri, 30 May 2025 17:28:02 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a kernel event occurs during an RPAL call and triggers a lazy switch, the kernel context switches from the sender to the receiver. When the receiver later returns from user space to the sender, a second lazy switch is required to switch the kernel context back to the sender. In the current implementation, after the second lazy switch, the receiver returns to user space via rpal_kernel_ret() and then calls epoll_wait() from user space to re-enter the kernel. This causes the receiver to be unable to process epoll events for a long period, degrading performance. This patch introduces a fast epoll wait feature. During the second lazy switch, the kernel configures epoll-related data structures so that the receiver can directly enter the epoll wait state without first returning to user space and then calling epoll_wait(). The patch adds a new state RPAL_RECEIVER_STATE_READY_LS, which is used to mark that the receiver can transition to RPAL_RECEIVER_STATE_WAIT during the second lazy switch. The kernel then performs this state transition in rpal_lazy_switch_tail(). Signed-off-by: Bo Li --- arch/x86/rpal/core.c | 29 ++++++++++++- fs/eventpoll.c | 101 +++++++++++++++++++++++++++++++++++++++++++ include/linux/rpal.h | 3 ++ kernel/sched/core.c | 13 +++++- 4 files changed, 143 insertions(+), 3 deletions(-) diff --git a/arch/x86/rpal/core.c b/arch/x86/rpal/core.c index 2ac5d932f69c..7b6efde23e48 100644 --- a/arch/x86/rpal/core.c +++ b/arch/x86/rpal/core.c @@ -51,7 +51,25 @@ void rpal_lazy_switch_tail(struct task_struct *tsk) atomic_cmpxchg(&rcc->receiver_state, rpal_build_call_state(tsk->rpal_sd), RPAL_RECEIVER_STATE_LAZY_SWITCH); } else { + /* tsk is receiver */ + int state; + + rcc =3D tsk->rpal_rd->rcc; + state =3D atomic_read(&rcc->receiver_state); + /* receiver may be scheduled on another cpu after unlock. */ rpal_unlock_cpu(tsk); + /* + * We must not use RPAL_RECEIVER_STATE_READY instead of + * RPAL_RECEIVER_STATE_READY_LS. As receiver may at + * TASK_RUNNING state and then call epoll_wait() again, + * the state may become RPAL_RECEIVER_STATE_READY, we should + * not changed its state to RPAL_RECEIVER_STATE_WAIT since + * the state is set by another RPAL call. + */ + if (state =3D=3D RPAL_RECEIVER_STATE_READY_LS) + atomic_cmpxchg(&rcc->receiver_state, + RPAL_RECEIVER_STATE_READY_LS, + RPAL_RECEIVER_STATE_WAIT); rpal_unlock_cpu(current); } } @@ -63,8 +81,14 @@ void rpal_kernel_ret(struct pt_regs *regs) int state; =20 if (rpal_test_current_thread_flag(RPAL_RECEIVER_BIT)) { - rcc =3D current->rpal_rd->rcc; - regs->ax =3D rpal_try_send_events(current->rpal_rd->ep, rcc); + struct rpal_receiver_data *rrd =3D current->rpal_rd; + + rcc =3D rrd->rcc; + if (rcc->timeout > 0) + hrtimer_cancel(&rrd->ep_sleeper.timer); + rpal_remove_ep_wait_list(rrd); + regs->ax =3D rpal_try_send_events(rrd->ep, rcc); + fdput(rrd->f); atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_KERNEL_RET); } else { tsk =3D current->rpal_sd->receiver; @@ -173,6 +197,7 @@ rpal_do_kernel_context_switch(struct task_struct *next,= struct pt_regs *regs) * Otherwise, sender's user context will be corrupted. */ rebuild_receiver_stack(current->rpal_rd, regs); + rpal_fast_ep_poll(current->rpal_rd, regs); rpal_schedule(next); rpal_clear_task_thread_flag(prev, RPAL_LAZY_SWITCHED_BIT); prev->rpal_rd->sender =3D NULL; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 791321639561..b70c1cd82335 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2143,6 +2143,107 @@ static int ep_poll(struct eventpoll *ep, struct epo= ll_event __user *events, } =20 #ifdef CONFIG_RPAL +static void *rpal_get_eventpoll(struct rpal_receiver_data *rrd, struct pt_= regs *regs) +{ + struct rpal_receiver_call_context *rcc =3D rrd->rcc; + int epfd =3D rcc->epfd; + struct epoll_event __user *events =3D rcc->events; + int maxevents =3D rcc->maxevents; + struct file *file; + + if (maxevents <=3D 0 || maxevents > EP_MAX_EVENTS) { + regs->ax =3D -EINVAL; + return NULL; + } + + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) { + regs->ax =3D -EFAULT; + return NULL; + } + + rrd->f =3D fdget(epfd); + file =3D fd_file(rrd->f); + if (!file) { + regs->ax =3D -EBADF; + return NULL; + } + + if (!is_file_epoll(file)) { + regs->ax =3D -EINVAL; + fdput(rrd->f); + return NULL; + } + + rrd->ep =3D file->private_data; + return rrd->ep; +} + +void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *reg= s) +{ + struct eventpoll *ep; + struct rpal_receiver_call_context *rcc =3D rrd->rcc; + ktime_t ts =3D 0; + struct hrtimer *ht =3D &rrd->ep_sleeper.timer; + int state; + int avail; + + regs->orig_ax =3D __NR_epoll_wait; + ep =3D rpal_get_eventpoll(rrd, regs); + + if (!ep || signal_pending(current) || + unlikely(ep_events_available(ep)) || + atomic_read(&rcc->ep_pending) || unlikely(rcc->timeout =3D=3D 0)) { + INIT_LIST_HEAD(&rrd->ep_wait.entry); + } else { + /* + * Here we use RPAL_RECEIVER_STATE_READY_LS to avoid conflict with + * RPAL_RECEIVER_STATE_READY. As the RPAL_RECEIVER_STATE_READY_LS + * is convert to RPAL_RECEIVER_STATE_WAIT in rpal_lazy_switch_tail(), + * it is possible the receiver is woken at that time. Thus, + * rpal_lazy_switch_tail() should figure out whether the receiver + * state is set by lazy switch or not. See rpal_lazy_switch_tail() + * for details. + */ + state =3D atomic_xchg(&rcc->receiver_state, RPAL_RECEIVER_STATE_READY_LS= ); + if (unlikely(state !=3D RPAL_RECEIVER_STATE_LAZY_SWITCH)) + rpal_err("%s: unexpected state: %d\n", __func__, state); + init_waitqueue_func_entry(&rrd->ep_wait, rpal_ep_autoremove_wake_functio= n); + rrd->ep_wait.private =3D rrd; + INIT_LIST_HEAD(&rrd->ep_wait.entry); + write_lock(&ep->lock); + set_current_state(TASK_INTERRUPTIBLE); + avail =3D ep_events_available(ep); + if (!avail) + __add_wait_queue_exclusive(&ep->wq, &rrd->ep_wait); + write_unlock(&ep->lock); + if (avail) { + /* keep state consistent when we enter rpal_kernel_ret() */ + atomic_set(&rcc->receiver_state, + RPAL_RECEIVER_STATE_LAZY_SWITCH); + set_current_state(TASK_RUNNING); + return; + } + + if (rcc->timeout > 0) { + rrd->ep_sleeper.task =3D rrd->rcd.bp_task; + ts =3D ms_to_ktime(rcc->timeout); + hrtimer_start(ht, ts, HRTIMER_MODE_REL); + } + } +} + +void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd) +{ + struct eventpoll *ep =3D (struct eventpoll *)rrd->ep; + wait_queue_entry_t *wait =3D &rrd->ep_wait; + + if (!list_empty_careful(&wait->entry)) { + write_lock_irq(&ep->lock); + __remove_wait_queue(&ep->wq, wait); + write_unlock_irq(&ep->lock); + } +} + void *rpal_get_epitemep(wait_queue_entry_t *wait) { struct epitem *epi =3D ep_item_from_wait(wait); diff --git a/include/linux/rpal.h b/include/linux/rpal.h index f5f4da63f28c..676113f0ba1f 100644 --- a/include/linux/rpal.h +++ b/include/linux/rpal.h @@ -126,6 +126,7 @@ enum rpal_receiver_state { RPAL_RECEIVER_STATE_WAIT, RPAL_RECEIVER_STATE_CALL, RPAL_RECEIVER_STATE_LAZY_SWITCH, + RPAL_RECEIVER_STATE_READY_LS, RPAL_RECEIVER_STATE_MAX, }; =20 @@ -627,4 +628,6 @@ void rpal_resume_ep(struct task_struct *tsk); void *rpal_get_epitemep(wait_queue_entry_t *wait); int rpal_get_epitemfd(wait_queue_entry_t *wait); int rpal_try_send_events(void *ep, struct rpal_receiver_call_context *rcc); +void rpal_remove_ep_wait_list(struct rpal_receiver_data *rrd); +void rpal_fast_ep_poll(struct rpal_receiver_data *rrd, struct pt_regs *reg= s); #endif diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d6f8e0d76fc0..1728b04d1387 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3965,6 +3965,11 @@ static bool rpal_check_state(struct task_struct *p) case RPAL_RECEIVER_STATE_LAZY_SWITCH: case RPAL_RECEIVER_STATE_RUNNING: break; + /* + * Allow RPAL_RECEIVER_STATE_READY_LS to be woken will cause irq + * being enabled in rpal_unlock_cpu. + */ + case RPAL_RECEIVER_STATE_READY_LS: case RPAL_RECEIVER_STATE_CALL: rpal_set_task_thread_flag(p, RPAL_WAKE_BIT); ret =3D false; @@ -11403,7 +11408,13 @@ void __sched notrace rpal_schedule(struct task_str= uct *next) =20 prev_state =3D READ_ONCE(prev->__state); if (prev_state) { - try_to_block_task(rq, prev, &prev_state); + if (!try_to_block_task(rq, prev, &prev_state)) { + /* + * As the task enter TASK_RUNNING state, we should clean up + * RPAL_RECEIVER_STATE_READY_LS status. + */ + rpal_check_ready_state(prev, RPAL_RECEIVER_STATE_READY_LS); + } switch_count =3D &prev->nvcsw; } =20 --=20 2.20.1 From nobody Wed Feb 11 03:41:56 2026 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5365220F4B for ; Fri, 30 May 2025 09:37:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597849; cv=none; b=fNXJ7QspwZY66GmbIzHMbB88k7Ynfx+SNO15iMSp5kmQfJS0rEVf7F8AB9lymossyuRAYosj8DsMAa3wKHIY3bbj1wVmII/qpPMcp0otw/fK4/f4LlJOew0Ex02EGWcz7gXHri7Po2Yv7vebbqMSr0cEEJ4Sq3uTJhVwhP1Cvnw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748597849; c=relaxed/simple; bh=bCYuosSzNO3MtUqiZdCPFyuVcpDUwXDrs2jBQRh3ImE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Jcl/VSeKFD7aq52v8zBg+iUCOIMCRFA7iLW8tQ2mSywrzykPP1nu0wBo4BQU60pULPq/Y0z0M9VQvtU3kMxgqvGaRKl2dWTjvqodQ3ZSB6+JvL/bv/6gI/zqBJsDTHFqQXKO5u+O41xx3X6xrxAzOe0Ye6pddE3lt1ZL2w/K+ew= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=QfBx34tM; arc=none smtp.client-ip=209.85.216.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="QfBx34tM" Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-311c95ddfb5so1365990a91.2 for ; Fri, 30 May 2025 02:37:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1748597843; x=1749202643; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tjVGQKkOlZOy+4fU/xUd4g+vTcZV8exC595abljBAKM=; b=QfBx34tMTEyJ7KNaVGJ43dQOCPJp6eA0NTc0pbqRF3s/mKeIcZaN/apy1L30o2Y7zM CUgILPf/uH0YNSQ6B3gGb1zCVH/zMU8kS1gNPoGq79ZZGrAzCjWFfvm10HpLjbTipwqo Vlo4LwkCu6urf1i8rmcjtHO+Cf51zFcw7OlU349SqrcN9ZqEpEYdSGPNgFSK5R58MtND qsTKgmkqC4VEuJlsuRSMS65ot5LsjEdOkBZ8j4pmjn1nk/+4+QRB4hrMPa79h2d4iYNO 0/vK5UAzXvjOuzOWlwNAB0mx6uRfy+lyyMX16LmL/jbloTW33YsTKfHhSNAMURO+BLsw sWFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748597843; x=1749202643; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tjVGQKkOlZOy+4fU/xUd4g+vTcZV8exC595abljBAKM=; b=Wr5DSTTX/Md5zSDSUbxYYplGk9XGh0VXe0w6p/IesP+ymuSiqMAptbtgVFhaoJTS99 YzNKPAIpRCVEBeC/oADmXLyfsIvK38S6Cgcs/5D5C5vuJf58Whe0EBJoOSR6qCoJY/mO zhbuobQHNGj56vLqOf9aA/Svv8oilolFRg+9ZFgFsEMW2VZiQ8QjIjhkJzGuTvuj6KoN 6XxTwTo5Ntng/e0tq8m4v+CnAjWfZRNgCofavbkNcoV/6jtTm9o3h1aHQgZyYLLRxG1F C1S5ZX4FHN/bvuIF2T6yQCTGd/SBiOseLgxz8Wps5eo+/SVUSoLidUIr/vClW8XYXFvX dQlA== X-Forwarded-Encrypted: i=1; AJvYcCXs5dYRD/iKXL0sf3cNyUmBVTjyw8QglbINHBenWFHcEy7ePlbS6THYlCNSISWX755YQpbjEpc7tPCUF60=@vger.kernel.org X-Gm-Message-State: AOJu0Yx+02aaLhajr3cr0vlOOwyLOd7G9YrXz92V/id+7wIN+yh3VRyl O6eagwLzHBZRkjbgSp0mOi0keSaWe7uPkNugw8HolxwiwDNnJI+m+5NeD8gDo7wiT+E= X-Gm-Gg: ASbGncvsr1D0cfz22lYy5bSR+rkipNQZn1hIpB9Hw5PxUHByOIzY/qruLjO9+nuY5V/ E0wdZI9axAXWMQCJIroRcqKmPUyx5FXjtrA7ey1HZMfPAn/lmVeRSVb+TRIP9Y+gxp/iMgM8ENC M10asUb/pamkv3m4T68vdLeANPwZX5NFQOKfdgOx6kgLY0NREb2GqeK7PYJdkyZ8UzHuY319GiX w0Hz7NHv2XBmVfzj33T9IOY0yhKC5Mv5S3kiDq4vYRLa7SyWqw/I7cJmLyE50H0lN1ir7RMfQW2 7eyG7kvapbMqext1xVkGMEr3e7mK6DUmqoa5bbRIWjSnboPw1uZJ8gw74ULKJp/p2STQLzvksc7 tzcntkRIREnT+jrmhiViG X-Google-Smtp-Source: AGHT+IFSHaG5y+mkmzSneS5RbvI+Y1/bhXtbvD9WhGteUuouTIh+462k1YHpGCKGocTvjur7aEf1Ag== X-Received: by 2002:a17:90b:1d51:b0:311:a314:c2dc with SMTP id 98e67ed59e1d1-3125036bafdmr2653995a91.14.1748597842300; Fri, 30 May 2025 02:37:22 -0700 (PDT) Received: from FQ627FTG20.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e29f7b8sm838724a91.2.2025.05.30.02.37.07 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 30 May 2025 02:37:21 -0700 (PDT) From: Bo Li To: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, luto@kernel.org, kees@kernel.org, akpm@linux-foundation.org, david@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, peterz@infradead.org Cc: dietmar.eggemann@arm.com, hpa@zytor.com, acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, kan.liang@linux.intel.com, viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, jannh@google.com, pfalcato@suse.de, riel@surriel.com, harry.yoo@oracle.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, yinhongbo@bytedance.com, dengliang.1214@bytedance.com, xieyongji@bytedance.com, chaiwen.cc@bytedance.com, songmuchun@bytedance.com, yuanzhu@bytedance.com, chengguozhu@bytedance.com, sunjiadong.lff@bytedance.com, Bo Li Subject: [RFC v2 35/35] samples/rpal: add RPAL samples Date: Fri, 30 May 2025 17:28:03 +0800 Message-Id: X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Added test samples for RPAL (with librpal included). Compile via: cd samples/rpal && make And run it using the following command: ./server & ./client Example output: EPOLL: Message length: 32 bytes, Total TSC cycles: 16439927066, Message count: 1000000, Average latency: 16439 cycles RPAL: Message length: 32 bytes, Total TSC cycles: 2197479484, Message count: 1000000, Average latency: 2197 cycles Signed-off-by: Bo Li --- samples/rpal/Makefile | 17 + samples/rpal/asm_define.c | 14 + samples/rpal/client.c | 178 ++ samples/rpal/librpal/asm_define.h | 6 + samples/rpal/librpal/asm_x86_64_rpal_call.S | 57 + samples/rpal/librpal/debug.h | 12 + samples/rpal/librpal/fiber.c | 119 + samples/rpal/librpal/fiber.h | 64 + .../rpal/librpal/jump_x86_64_sysv_elf_gas.S | 81 + .../rpal/librpal/make_x86_64_sysv_elf_gas.S | 82 + .../rpal/librpal/ontop_x86_64_sysv_elf_gas.S | 84 + samples/rpal/librpal/private.h | 341 +++ samples/rpal/librpal/rpal.c | 2351 +++++++++++++++++ samples/rpal/librpal/rpal.h | 149 ++ samples/rpal/librpal/rpal_pkru.h | 78 + samples/rpal/librpal/rpal_queue.c | 239 ++ samples/rpal/librpal/rpal_queue.h | 55 + samples/rpal/librpal/rpal_x86_64_call_ret.S | 45 + samples/rpal/offset.sh | 5 + samples/rpal/server.c | 249 ++ 20 files changed, 4226 insertions(+) create mode 100644 samples/rpal/Makefile create mode 100644 samples/rpal/asm_define.c create mode 100644 samples/rpal/client.c create mode 100644 samples/rpal/librpal/asm_define.h create mode 100644 samples/rpal/librpal/asm_x86_64_rpal_call.S create mode 100644 samples/rpal/librpal/debug.h create mode 100644 samples/rpal/librpal/fiber.c create mode 100644 samples/rpal/librpal/fiber.h create mode 100644 samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S create mode 100644 samples/rpal/librpal/make_x86_64_sysv_elf_gas.S create mode 100644 samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S create mode 100644 samples/rpal/librpal/private.h create mode 100644 samples/rpal/librpal/rpal.c create mode 100644 samples/rpal/librpal/rpal.h create mode 100644 samples/rpal/librpal/rpal_pkru.h create mode 100644 samples/rpal/librpal/rpal_queue.c create mode 100644 samples/rpal/librpal/rpal_queue.h create mode 100644 samples/rpal/librpal/rpal_x86_64_call_ret.S create mode 100755 samples/rpal/offset.sh create mode 100644 samples/rpal/server.c diff --git a/samples/rpal/Makefile b/samples/rpal/Makefile new file mode 100644 index 000000000000..25627a970028 --- /dev/null +++ b/samples/rpal/Makefile @@ -0,0 +1,17 @@ +.PHONY: rpal + +all: server client offset + +offset: asm_define.c + $(shell ./offset.sh) + +server: server.c librpal/*.c librpal/*.S + $(CC) $^ -lpthread -g -o $@ + @printf "RPAL" | dd of=3D./server bs=3D1 count=3D4 conv=3Dnotrunc seek=3D= 12 + +client: client.c librpal/*.c librpal/*.S + $(CC) $^ -lpthread -g -o $@ + @printf "RPAL" | dd of=3D./client bs=3D1 count=3D4 conv=3Dnotrunc seek=3D= 12 + +clean: + rm server client diff --git a/samples/rpal/asm_define.c b/samples/rpal/asm_define.c new file mode 100644 index 000000000000..6f7731ebc870 --- /dev/null +++ b/samples/rpal/asm_define.c @@ -0,0 +1,14 @@ +#include +#include "librpal/private.h" + +#define DEFINE(sym, val) asm volatile("\n-> " #sym " %0 " #val "\n" : : "i= " (val)) + +static void common(void) +{ + DEFINE(RCI_SENDER_TLS_BASE, offsetof(rpal_call_info_t, sender_tls_base= )); + DEFINE(RCI_SENDER_FCTX, offsetof(rpal_call_info_t, sender_fctx)); + DEFINE(RCI_PKRU, offsetof(rpal_call_info_t, pkru)); + DEFINE(RC_SENDER_STATE, offsetof(receiver_context_t, sender_state)); + DEFINE(RET_BEGIN, offsetof(critical_section_t, ret_begin)); + DEFINE(RET_END, offsetof(critical_section_t, ret_end)); +} diff --git a/samples/rpal/client.c b/samples/rpal/client.c new file mode 100644 index 000000000000..2c4a9eb6115e --- /dev/null +++ b/samples/rpal/client.c @@ -0,0 +1,178 @@ +#include +#include +#include +#include +#include +#include +#include +#include "librpal/rpal.h" + +#define SOCKET_PATH "/tmp/rpal_socket" +#define BUFFER_SIZE 1025 +#define MSG_NUM 1000000 +#define MSG_LEN 32 + +char hello[BUFFER_SIZE]; +char buffer[BUFFER_SIZE] =3D { 0 }; + +int remote_id; +uint64_t remote_sidfd; + +#define INIT_MSG "INIT" +#define SUCC_MSG "SUCC" +#define FAIL_MSG "FAIL" + +#define handle_error(s) = \ + do { \ + perror(s); \ + exit(EXIT_FAILURE); \ + } while (0) + +int rpal_epoll_add(int epfd, int fd) +{ + struct epoll_event ev; + + ev.events =3D EPOLLRPALIN | EPOLLIN | EPOLLRDHUP | EPOLLET; + ev.data.fd =3D fd; + + return rpal_epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev); +} + +void rpal_client_init(int fd) +{ + struct epoll_event ev; + char buffer[BUFFER_SIZE]; + rpal_error_code_t err; + uint64_t remote_key, service_key; + int epoll_fd; + int proc_fd; + int ret; + + proc_fd =3D rpal_init(1, 0, &err); + if (proc_fd < 0) + handle_error("rpal init fail"); + rpal_get_service_key(&service_key); + + strcpy(buffer, INIT_MSG); + *(uint64_t *)(buffer + strlen(INIT_MSG)) =3D service_key; + ret =3D write(fd, buffer, strlen(INIT_MSG) + sizeof(uint64_t)); + if (ret < 0) + handle_error("write key"); + + ret =3D read(fd, buffer, BUFFER_SIZE); + if (ret < 0) + handle_error("read key"); + + memcpy(&remote_key, buffer, sizeof(remote_key)); + if (remote_key =3D=3D 0) + handle_error("remote down"); + + ret =3D rpal_request_service(remote_key); + if (ret) { + write(fd, FAIL_MSG, strlen(FAIL_MSG)); + handle_error("request"); + } + + ret =3D write(fd, SUCC_MSG, strlen(SUCC_MSG)); + if (ret < 0) + handle_error("handshake"); + + remote_id =3D rpal_get_request_service_id(remote_key); + rpal_sender_init(&err); + + epoll_fd =3D epoll_create(1024); + if (epoll_fd =3D=3D -1) { + perror("epoll_create"); + exit(EXIT_FAILURE); + } + rpal_epoll_add(epoll_fd, fd); + + sleep(3); //wait for epoll wait + ret =3D rpal_uds_fdmap(((unsigned long)remote_id << 32) | fd, + &remote_sidfd); + if (ret < 0) + handle_error("uds fdmap fail"); +} + +int run_rpal_client(int msg_len) +{ + ssize_t valread; + int sock =3D 0; + struct sockaddr_un serv_addr; + int count =3D MSG_NUM; + int ret; + + if ((sock =3D socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { + perror("socket creation error"); + return -1; + } + + memset(&serv_addr, 0, sizeof(serv_addr)); + serv_addr.sun_family =3D AF_UNIX; + strncpy(serv_addr.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH)); + + if (connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < + 0) { + perror("Connection Failed"); + return -1; + } + rpal_client_init(sock);=09 + + while (count) { + for (int i =3D 18; i < msg_len; i++) + hello[i] =3D 'a' + i % 26; + sprintf(hello, "0x%016lx", __rdtsc()); + ret =3D rpal_write_ptrs(remote_id, remote_sidfd, (int64_t *)hello, + msg_len / sizeof(int64_t *)); + valread =3D read(sock, buffer, BUFFER_SIZE); + if (memcmp(hello, buffer, msg_len) !=3D 0) + perror("data error"); + count--; + } + + close(sock); +} + +int run_client(int msg_len) +{ + ssize_t valread; + int sock =3D 0; + struct sockaddr_un serv_addr; + int count =3D MSG_NUM; + + if ((sock =3D socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { + perror("socket creation error"); + return -1; + } + + memset(&serv_addr, 0, sizeof(serv_addr)); + serv_addr.sun_family =3D AF_UNIX; + strncpy(serv_addr.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH)); + + if (connect(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr)) < + 0) { + perror("Connection Failed"); + return -1; + } + + while (count) { + for (int i =3D 18; i < msg_len; i++) + hello[i] =3D 'a' + i % 26; + sprintf(hello, "0x%016lx", __rdtsc()); + send(sock, hello, msg_len, 0); + valread =3D read(sock, buffer, BUFFER_SIZE); + if (memcmp(hello, buffer, msg_len) !=3D 0) + perror("data error"); + count--; + } + + close(sock); +} + +int main() +{ + run_client(MSG_LEN); + run_rpal_client(MSG_LEN); + + return 0; +} diff --git a/samples/rpal/librpal/asm_define.h b/samples/rpal/librpal/asm_d= efine.h new file mode 100644 index 000000000000..bc57586cda58 --- /dev/null +++ b/samples/rpal/librpal/asm_define.h @@ -0,0 +1,6 @@ +#define RCI_SENDER_TLS_BASE 0 +#define RCI_SENDER_FCTX 16 +#define RCI_PKRU 8 +#define RC_SENDER_STATE 72 +#define RET_BEGIN 0 +#define RET_END 8 diff --git a/samples/rpal/librpal/asm_x86_64_rpal_call.S b/samples/rpal/lib= rpal/asm_x86_64_rpal_call.S new file mode 100644 index 000000000000..538e8ac5f09b --- /dev/null +++ b/samples/rpal/librpal/asm_x86_64_rpal_call.S @@ -0,0 +1,57 @@ +#ifdef __x86_64__ +#define __ASSEMBLY__ +#include "asm_define.h" + +.text +.globl rpal_access_warpper +.type rpal_access_warpper,@function +.align 16 + +rpal_access_warpper: + pushq %r12 + pushq %r13 + pushq %r14 + pushq %r15 + pushq %rbx + pushq %rbp + + leaq -0x8(%rsp), %rsp + stmxcsr (%rsp) + fnstcw 0x4(%rsp) + + pushq %rsp // Save rsp which may be unaligned. + pushq (%rsp) // Save the original value again + andq $-16, %rsp // Align stack to 16bytes - SysV AMD64 ABI. + + movq %rsp, (%rdi) + call rpal_access@plt +retip: + movq 8(%rsp), %rsp // Restore the potentially unaligned stack + ldmxcsr (%rsp) + fldcw 0x4(%rsp) + leaq 0x8(%rsp), %rsp + + popq %rbp + popq %rbx + popq %r15 + popq %r14 + popq %r13 + popq %r12 + ret + +.size rpal_access_warpper,.-rpal_access_warpper + + + +.globl rpal_get_ret_rip +.type rpal_get_ret_rip, @function +.align 16 +rpal_get_ret_rip: + leaq retip(%rip), %rax + ret + +.size rpal_get_ret_rip,.-rpal_get_ret_rip + +/* Mark that we don't need executable stack. */ +.section .note.GNU-stack,"",%progbits +#endif diff --git a/samples/rpal/librpal/debug.h b/samples/rpal/librpal/debug.h new file mode 100644 index 000000000000..10d2fef8d69a --- /dev/null +++ b/samples/rpal/librpal/debug.h @@ -0,0 +1,12 @@ +#ifndef RPAL_DEBUG_H +#define RPAL_DEBUG_H + +typedef enum { + RPAL_DEBUG_MANAGEMENT =3D (1 << 0), + RPAL_DEBUG_SENDER =3D (1 << 1), + RPAL_DEBUG_RECVER =3D (1 << 2), + RPAL_DEBUG_FIBER =3D (1 << 3), + + __RPAL_DEBUG_ALL =3D ~(0ULL), +} rpal_debug_flag_t; +#endif diff --git a/samples/rpal/librpal/fiber.c b/samples/rpal/librpal/fiber.c new file mode 100644 index 000000000000..2141ad9ab770 --- /dev/null +++ b/samples/rpal/librpal/fiber.c @@ -0,0 +1,119 @@ +#ifdef __x86_64__ +#include "debug.h" +#include "fiber.h" +#include "private.h" +#include +#include +#include +#include + +#define RPAL_CHECK_FAIL -1 +#define STACK_DEBUG 1 + +static task_t *make_fiber_ctx(task_t *fc) +{ + fc->fctx =3D make_fcontext(fc->sp, 0, NULL); + return fc; +} + +static task_t *fiber_ctx_create(void (*fn)(void *ud), void *ud, void *stac= k, + size_t size) +{ + task_t *fc; + int i; + + if (stack =3D=3D NULL) + return NULL; + + fc =3D (task_t *)stack; + fc->fn =3D fn; + fc->ud =3D ud; + fc->size =3D size; + fc->sp =3D stack + size; + for (i =3D 0; i < NR_PADDING; ++i) { + fc->padding[i] =3D 0xdeadbeef; + } + + return make_fiber_ctx(fc); +} + +task_t *fiber_ctx_alloc(void (*fn)(void *ud), void *ud, size_t size) +{ + void *stack; + size_t stack_size; + size_t total_size; + void *lower_guard; + void *upper_guard; + + if (PAGE_SIZE =3D=3D 4096 || STACK_DEBUG) { + stack_size =3D (size + PAGE_SIZE - 1) & ~(PAGE_SIZE - 1); + + dbprint(RPAL_DEBUG_FIBER, + "fiber_ctx_alloc: stack size adjusted from %lu to %lu\n", + size, stack_size); + + // Allocate a stack using mmap with 2 extra pages, 1 at each end + // which will be PROT_NONE to act as guard pages to catch overflow + // and underflow. This will result in a SIGSEGV but should make it + // easier to catch a stack that is too small (or underflows). + // + // Notes: + // + // 1. On ARM64 with 64K pages this would be quite wasteful of memory + // so it is behind a DEBUG flag to enable/disable on that platform. + // + // 2. If the requested stack size is not a multiple of a page size + // then stack underflow wont always be caught as there is some + // extra space up until the next page boundary with the guard page. + // + // 3. The task_t is placed at the top of the stack so can be overwritten + // just before the stack overflows and hits the guard page. + // + + total_size =3D stack_size + (PAGE_SIZE * 2); + lower_guard =3D mmap(NULL, total_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANON, -1, 0); + if (lower_guard =3D=3D MAP_FAILED) { + errprint("mmap of %lu bytes failed: %s\n", total_size, + strerror(errno)); + return NULL; + } + + stack =3D lower_guard + PAGE_SIZE; + upper_guard =3D stack + stack_size; + mprotect(lower_guard, PAGE_SIZE, PROT_NONE); + mprotect(upper_guard, PAGE_SIZE, PROT_NONE); + + dbprint(RPAL_DEBUG_FIBER, + "Total stack of size %lu bytes allocated @ %p\n", + total_size, stack); + dbprint(RPAL_DEBUG_FIBER, + "Underflow guard page %p - %p overflow guard page %p - %p\n", + lower_guard, lower_guard + PAGE_SIZE - 1, upper_guard, + upper_guard + PAGE_SIZE - 1); + } else { + stack =3D malloc(size); + } + return fiber_ctx_create(fn, ud, stack, size); +} + +void fiber_ctx_free(task_t *fc) +{ + size_t stack_size; + size_t total_size; + void *addr; + + if (STACK_DEBUG) { + stack_size =3D (fc->size + PAGE_SIZE - 1) & ~(PAGE_SIZE - 1); + total_size =3D stack_size + (PAGE_SIZE * 2); + addr =3D fc; + addr -=3D PAGE_SIZE; + if (munmap(addr, total_size) !=3D 0) { + errprint("munmap of %lu bytes @ %p failed: %s\n", + total_size, addr, strerror(errno)); + } + } else { + free(fc); + } +} +#endif diff --git a/samples/rpal/librpal/fiber.h b/samples/rpal/librpal/fiber.h new file mode 100644 index 000000000000..b46485ba740f --- /dev/null +++ b/samples/rpal/librpal/fiber.h @@ -0,0 +1,64 @@ +#ifndef FIBER_H +#define FIBER_H + +#include + +typedef void *fcontext_t; +typedef struct { + fcontext_t fctx; + void *ud; +} transfer_t; + +typedef struct fiber_stack { + unsigned long padding; + unsigned long r12; + unsigned long r13; + unsigned long r14; + unsigned long r15; + unsigned long rbx; + unsigned long rbp; + unsigned long rip; +} fiber_stack_t; + +#define NR_PADDING 8 +typedef struct fiber_ctx { + void *sp; + size_t size; + void (*fn)(void *fc); + void *ud; + fcontext_t fctx; + int padding[NR_PADDING]; +} task_t; + +task_t *fiber_ctx_alloc(void (*fn)(void *ud), void *ud, size_t size); +void fiber_ctx_free(task_t *fc); + +/** + * @brief Make a context for jump_fcontext. + * + * @param sp The stack top pointer of context. + * @param size The size of stack, this argument is useless. But a second a= rgument is neccessary. + * @param fn The function pointer of the context function. + * + * @return The pointer of the newly made context. + */ +extern fcontext_t make_fcontext(void *sp, size_t size, void (*fn)(transfer= _t)); + +/** + * @brief jump to target context and execute fn with argument ud + * + * @param to The pointer of target context. + * @param ud The data part of the argument of fn. + * + * @return the pointer of the prev transfer_t struct, where RAX store + * previous context, RDX store ud passed by previous caller. + */ +extern transfer_t jump_fcontext(fcontext_t const to, void *ud); + +/** + * @brief To be written. + */ +extern transfer_t ontop_fcontext(fcontext_t const to, void *ud, + transfer_t (*fn)(transfer_t)); + +#endif diff --git a/samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S b/samples/rpal= /librpal/jump_x86_64_sysv_elf_gas.S new file mode 100644 index 000000000000..43d3a8149c58 --- /dev/null +++ b/samples/rpal/librpal/jump_x86_64_sysv_elf_gas.S @@ -0,0 +1,81 @@ +/* + Copyright Oliver Kowalke 2009. + Distributed under the Boost Software License, Version 1.0. + (See accompanying file LICENSE_1_0.txt or copy at + http://www.boost.org/LICENSE_1_0.txt) +*/ + +/*************************************************************************= *************** + * = * + * ----------------------------------------------------------------------= ------------ * + * | 0 | 1 | 2 | 3 | 4 | 5 | 6 = | 7 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x0 | 0x4 | 0x8 | 0xc | 0x10 | 0x14 | 0x18 = | 0x1c | * + * ----------------------------------------------------------------------= ------------ * + * | fc_mxcsr|fc_x87_cw| R12 | R13 | = R14 | * + * ----------------------------------------------------------------------= ------------ * + * ----------------------------------------------------------------------= ------------ * + * | 8 | 9 | 10 | 11 | 12 | 13 | 14 = | 15 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x20 | 0x24 | 0x28 | 0x2c | 0x30 | 0x34 | 0x38 = | 0x3c | * + * ----------------------------------------------------------------------= ------------ * + * | R15 | RBX | RBP | = RIP | * + * ----------------------------------------------------------------------= ------------ * + * = * + *************************************************************************= ***************/ +#ifdef __x86_64__ +.text +.globl jump_fcontext +.type jump_fcontext,@function +.align 16 +jump_fcontext: + leaq -0x38(%rsp), %rsp /* prepare stack */ + +#if !defined(BOOST_USE_TSX) + stmxcsr (%rsp) /* save MMX control- and status-word */ + fnstcw 0x4(%rsp) /* save x87 control-word */ +#endif + + movq %r12, 0x8(%rsp) /* save R12 */ + movq %r13, 0x10(%rsp) /* save R13 */ + movq %r14, 0x18(%rsp) /* save R14 */ + movq %r15, 0x20(%rsp) /* save R15 */ + movq %rbx, 0x28(%rsp) /* save RBX */ + movq %rbp, 0x30(%rsp) /* save RBP */ + + /* store RSP (pointing to context-data) in RAX */ + movq %rsp, %rax + + /* restore RSP (pointing to context-data) from RDI */ + movq %rdi, %rsp + + movq 0x38(%rsp), %r8 /* restore return-address */ + +#if !defined(BOOST_USE_TSX) + ldmxcsr (%rsp) /* restore MMX control- and status-word */ + fldcw 0x4(%rsp) /* restore x87 control-word */ +#endif + + movq 0x8(%rsp), %r12 /* restore R12 */ + movq 0x10(%rsp), %r13 /* restore R13 */ + movq 0x18(%rsp), %r14 /* restore R14 */ + movq 0x20(%rsp), %r15 /* restore R15 */ + movq 0x28(%rsp), %rbx /* restore RBX */ + movq 0x30(%rsp), %rbp /* restore RBP */ + + leaq 0x40(%rsp), %rsp /* prepare stack */ + + /* return transfer_t from jump */ + /* RAX =3D=3D fctx, RDX =3D=3D data */ + movq %rsi, %rdx + /* pass transfer_t as first arg in context function */ + /* RDI =3D=3D fctx, RSI =3D=3D data */ + movq %rax, %rdi + + /* indirect jump to context */ + jmp *%r8 +.size jump_fcontext,.-jump_fcontext + +/* Mark that we don't need executable stack. */ +.section .note.GNU-stack,"",%progbits +#endif diff --git a/samples/rpal/librpal/make_x86_64_sysv_elf_gas.S b/samples/rpal= /librpal/make_x86_64_sysv_elf_gas.S new file mode 100644 index 000000000000..4f3af9247110 --- /dev/null +++ b/samples/rpal/librpal/make_x86_64_sysv_elf_gas.S @@ -0,0 +1,82 @@ +/* + Copyright Oliver Kowalke 2009. + Distributed under the Boost Software License, Version 1.0. + (See accompanying file LICENSE_1_0.txt or copy at + http://www.boost.org/LICENSE_1_0.txt) +*/ + +/*************************************************************************= *************** + * = * + * ----------------------------------------------------------------------= ------------ * + * | 0 | 1 | 2 | 3 | 4 | 5 | 6 = | 7 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x0 | 0x4 | 0x8 | 0xc | 0x10 | 0x14 | 0x18 = | 0x1c | * + * ----------------------------------------------------------------------= ------------ * + * | fc_mxcsr|fc_x87_cw| R12 | R13 | = R14 | * + * ----------------------------------------------------------------------= ------------ * + * ----------------------------------------------------------------------= ------------ * + * | 8 | 9 | 10 | 11 | 12 | 13 | 14 = | 15 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x20 | 0x24 | 0x28 | 0x2c | 0x30 | 0x34 | 0x38 = | 0x3c | * + * ----------------------------------------------------------------------= ------------ * + * | R15 | RBX | RBP | = RIP | * + * ----------------------------------------------------------------------= ------------ * + * = * + *************************************************************************= ***************/ +#ifdef __x86_64__ +.text +.globl make_fcontext +.type make_fcontext,@function +.align 16 +make_fcontext: + /* first arg of make_fcontext() =3D=3D top of context-stack */ + movq %rdi, %rax + + /* shift address in RAX to lower 16 byte boundary */ + andq $-16, %rax + + /* reserve space for context-data on context-stack */ + /* on context-function entry: (RSP -0x8) % 16 =3D=3D 0 */ + leaq -0x40(%rax), %rax + + /* third arg of make_fcontext() =3D=3D address of context-function */ + /* stored in RBX */ + movq %rdx, 0x28(%rax) + + /* save MMX control- and status-word */ + stmxcsr (%rax) + /* save x87 control-word */ + fnstcw 0x4(%rax) + + /* compute abs address of label trampoline */ + leaq trampoline(%rip), %rcx + /* save address of trampoline as return-address for context-function */ + /* will be entered after calling jump_fcontext() first time */ + movq %rcx, 0x38(%rax) + + /* compute abs address of label finish */ + leaq finish(%rip), %rcx + /* save address of finish as return-address for context-function */ + /* will be entered after context-function returns */ + movq %rcx, 0x30(%rax) + + ret /* return pointer to context-data */ + +trampoline: + /* store return address on stack */ + /* fix stack alignment */ + push %rbp + /* jump to context-function */ + jmp *%rbx + +finish: + /* exit code is zero */ + xorq %rdi, %rdi + /* exit application */ + call _exit@PLT + hlt +.size make_fcontext,.-make_fcontext + +/* Mark that we don't need executable stack. */ +.section .note.GNU-stack,"",%progbits +#endif \ No newline at end of file diff --git a/samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S b/samples/rpa= l/librpal/ontop_x86_64_sysv_elf_gas.S new file mode 100644 index 000000000000..9dce797c2541 --- /dev/null +++ b/samples/rpal/librpal/ontop_x86_64_sysv_elf_gas.S @@ -0,0 +1,84 @@ +/* + Copyright Oliver Kowalke 2009. + Distributed under the Boost Software License, Version 1.0. + (See accompanying file LICENSE_1_0.txt or copy at + http://www.boost.org/LICENSE_1_0.txt) +*/ + +/*************************************************************************= *************** + * = * + * ----------------------------------------------------------------------= ------------ * + * | 0 | 1 | 2 | 3 | 4 | 5 | 6 = | 7 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x0 | 0x4 | 0x8 | 0xc | 0x10 | 0x14 | 0x18 = | 0x1c | * + * ----------------------------------------------------------------------= ------------ * + * | fc_mxcsr|fc_x87_cw| R12 | R13 | = R14 | * + * ----------------------------------------------------------------------= ------------ * + * ----------------------------------------------------------------------= ------------ * + * | 8 | 9 | 10 | 11 | 12 | 13 | 14 = | 15 | * + * ----------------------------------------------------------------------= ------------ * + * | 0x20 | 0x24 | 0x28 | 0x2c | 0x30 | 0x34 | 0x38 = | 0x3c | * + * ----------------------------------------------------------------------= ------------ * + * | R15 | RBX | RBP | = RIP | * + * ----------------------------------------------------------------------= ------------ * + * = * + *************************************************************************= ***************/ +#ifdef __x86_64__ +.text +.globl ontop_fcontext +.type ontop_fcontext,@function +.align 16 +ontop_fcontext: + /* preserve ontop-function in R8 */ + movq %rdx, %r8 + + leaq -0x38(%rsp), %rsp /* prepare stack */ + +#if !defined(BOOST_USE_TSX) + stmxcsr (%rsp) /* save MMX control- and status-word */ + fnstcw 0x4(%rsp) /* save x87 control-word */ +#endif + + movq %r12, 0x8(%rsp) /* save R12 */ + movq %r13, 0x10(%rsp) /* save R13 */ + movq %r14, 0x18(%rsp) /* save R14 */ + movq %r15, 0x20(%rsp) /* save R15 */ + movq %rbx, 0x28(%rsp) /* save RBX */ + movq %rbp, 0x30(%rsp) /* save RBP */ + + /* store RSP (pointing to context-data) in RAX */ + movq %rsp, %rax + + /* restore RSP (pointing to context-data) from RDI */ + movq %rdi, %rsp + +#if !defined(BOOST_USE_TSX) + ldmxcsr (%rsp) /* restore MMX control- and status-word */ + fldcw 0x4(%rsp) /* restore x87 control-word */ +#endif + + movq 0x8(%rsp), %r12 /* restore R12 */ + movq 0x10(%rsp), %r13 /* restore R13 */ + movq 0x18(%rsp), %r14 /* restore R14 */ + movq 0x20(%rsp), %r15 /* restore R15 */ + movq 0x28(%rsp), %rbx /* restore RBX */ + movq 0x30(%rsp), %rbp /* restore RBP */ + + leaq 0x38(%rsp), %rsp /* prepare stack */ + + /* return transfer_t from jump */ + /* RAX =3D=3D fctx, RDX =3D=3D data */ + movq %rsi, %rdx + /* pass transfer_t as first arg in context function */ + /* RDI =3D=3D fctx, RSI =3D=3D data */ + movq %rax, %rdi + + /* keep return-address on stack */ + + /* indirect jump to context */ + jmp *%r8 +.size ontop_fcontext,.-ontop_fcontext + +/* Mark that we don't need executable stack. */ +.section .note.GNU-stack,"",%progbits +#endif diff --git a/samples/rpal/librpal/private.h b/samples/rpal/librpal/private.h new file mode 100644 index 000000000000..9dc78f449f0f --- /dev/null +++ b/samples/rpal/librpal/private.h @@ -0,0 +1,341 @@ +#ifndef PRIVATE_H +#define PRIVATE_H + +#include +#include +#include +#include +#ifdef __x86_64__ +#include +#endif +#include +#include +#include +#include +#include + +#include "debug.h" +#include "rpal_queue.h" +#include "fiber.h" +#include "rpal.h" + +#ifdef __x86_64__ +static inline void write_tls_base(unsigned long tls_base) +{ + asm volatile("wrfsbase %0" ::"r"(tls_base) : "memory"); +} + +static inline unsigned long read_tls_base(void) +{ + unsigned long fsbase; + asm volatile("rdfsbase %0" : "=3Dr"(fsbase)::"memory"); + return fsbase; +} +#endif + +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) + +// | fd_timestamp | pad | rthread_id | server_fd | +// | 16 | 8 | 8 | 32 | +#define LOW32_MASK ((1UL << 32) - 1) +#define MIDL8_MASK ((unsigned long)(((1UL << 8) - 1)) << 32) + +#define HIGH16_OFFSET 48 +#define HIGH32_OFFSET 32 + +#define get_high16(val) ({ (val) >> HIGH16_OFFSET; }) + +#define get_high32(val) ({ (val) >> HIGH32_OFFSET; }) + +#define get_midl8(val) ({ ((val) & MIDL8_MASK) >> HIGH32_OFFSET; }) +#define get_low32(val) ({ (val) & LOW32_MASK; }) + +#define get_fdtimestamp(rpalfd) get_high16(rpalfd) +#define get_rid(rpalfd) get_midl8(rpalfd) +#define get_sfd(rpalfd) get_low32(rpalfd) + +#define PAGE_SIZE 4096 +#define DEFUALT_STACK_SIZE (PAGE_SIZE * 4) +#define TRAMPOLINE_SIZE (PAGE_SIZE * 1) + +#define BITS_PER_LONG 64 +#define BITS_TO_LONGS(x) = \ + (((x) + 8 * sizeof(unsigned long) - 1) / (8 * sizeof(unsigned long))) + +#define KEY_SIZE 16 + +enum rpal_sender_state { + RPAL_SENDER_STATE_RUNNING, + RPAL_SENDER_STATE_CALL, + RPAL_SENDER_STATE_KERNEL_RET, +}; + +enum rpal_epoll_event { + RPAL_KERNEL_PENDING =3D 0x1, + RPAL_USER_PENDING =3D 0x2, +}; + +enum rpal_receiver_state { + RPAL_RECEIVER_STATE_RUNNING, + RPAL_RECEIVER_STATE_KERNEL_RET, + RPAL_RECEIVER_STATE_READY, + RPAL_RECEIVER_STATE_WAIT, + RPAL_RECEIVER_STATE_CALL, + RPAL_RECEIVER_STATE_LAZY_SWITCH, + RPAL_RECEIVER_STATE_MAX, +}; + +enum rpal_command_type { + RPAL_CMD_GET_API_VERSION_AND_CAP, + RPAL_CMD_GET_SERVICE_KEY, + RPAL_CMD_GET_SERVICE_ID, + RPAL_CMD_REGISTER_SENDER, + RPAL_CMD_UNREGISTER_SENDER, + RPAL_CMD_REGISTER_RECEIVER, + RPAL_CMD_UNREGISTER_RECEIVER, + RPAL_CMD_ENABLE_SERVICE, + RPAL_CMD_DISABLE_SERVICE, + RPAL_CMD_REQUEST_SERVICE, + RPAL_CMD_RELEASE_SERVICE, + RPAL_CMD_GET_SERVICE_PKEY, + RPAL_CMD_UDS_FDMAP, + RPAL_NR_CMD, +}; + +/* RPAL ioctl macro */ +#define RPAL_IOCTL_MAGIC 0x33 +#define RPAL_IOCTL_GET_API_VERSION_AND_CAP \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_API_VERSION_AND_CAP, \ + struct rpal_version_info *) +#define RPAL_IOCTL_GET_SERVICE_KEY \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_KEY, unsigned long) +#define RPAL_IOCTL_GET_SERVICE_ID \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_ID, int *) +#define RPAL_IOCTL_REGISTER_SENDER \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_SENDER, unsigned long) +#define RPAL_IOCTL_UNREGISTER_SENDER \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_SENDER) +#define RPAL_IOCTL_REGISTER_RECEIVER \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REGISTER_RECEIVER, unsigned long) +#define RPAL_IOCTL_UNREGISTER_RECEIVER \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_UNREGISTER_RECEIVER) +#define RPAL_IOCTL_ENABLE_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_ENABLE_SERVICE, unsigned long) +#define RPAL_IOCTL_DISABLE_SERVICE \ + _IO(RPAL_IOCTL_MAGIC, RPAL_CMD_DISABLE_SERVICE) +#define RPAL_IOCTL_REQUEST_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_REQUEST_SERVICE, unsigned long) +#define RPAL_IOCTL_RELEASE_SERVICE \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_RELEASE_SERVICE, unsigned long) +#define RPAL_IOCTL_GET_SERVICE_PKEY \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_GET_SERVICE_PKEY, int *) +#define RPAL_IOCTL_UDS_FDMAP \ + _IOWR(RPAL_IOCTL_MAGIC, RPAL_CMD_UDS_FDMAP, void *) + +typedef enum rpal_receiver_status { + RPAL_RECEIVER_UNINITIALIZED, + RPAL_RECEIVER_INITIALIZED, + RPAL_RECEIVER_AVAILABLE, +} rpal_receiver_status_t; + +enum RPAL_CAPABILITIES { + RPAL_CAP_PKU, +}; + +#define RPAL_SID_SHIFT 24 +#define RPAL_ID_SHIFT 8 +#define RPAL_RECEIVER_STATE_MASK ((1 << RPAL_ID_SHIFT) - 1) +#define RPAL_SID_MASK (~((1 << RPAL_SID_SHIFT) - 1)) +#define RPAL_ID_MASK (~(0 | RPAL_RECEIVER_STATE_MASK | RPAL_SID_MASK)) +#define RPAL_MAX_ID ((1 << (RPAL_SID_SHIFT - RPAL_ID_SHIFT)) - 1) +#define RPAL_BUILD_CALL_STATE(id, sid) = \ + ((sid << RPAL_SID_SHIFT) | (id << RPAL_ID_SHIFT) | RPAL_RECEIVER_STATE_CA= LL) + +typedef struct rpal_capability { + int compat_version; + int api_version; + unsigned long cap; +} rpal_capability_t; + +typedef struct task_context { + unsigned long r15; + unsigned long r14; + unsigned long r13; + unsigned long r12; + unsigned long rbx; + unsigned long rbp; + unsigned long rip; + unsigned long rsp; +} task_context_t; + +typedef struct receiver_context { + task_context_t task_context; + int receiver_id; + int receiver_state; + int sender_state; + int ep_pending; + int rpal_ep_poll_magic; + int epfd; + void *ep_events; + int maxevents; + int timeout; + int64_t total_time; +} receiver_context_t; + +typedef struct rpal_call_info { + unsigned long sender_tls_base; + uint32_t pkru; + fcontext_t sender_fctx; +} rpal_call_info_t; + +enum thread_type { + RPAL_RECEIVER =3D 0x1, + RPAL_SENDER =3D 0x2, +}; +typedef struct rpal_receiver_info { + long tid; + unsigned long tls_base; + + int epfd; + rpal_receiver_status_t status; + epoll_uevent_queue_t ueventq; + volatile uint64_t uqlock; + + fcontext_t main_ctx; + task_t *ep_stack; + task_t *trampoline; + + rpal_call_info_t rci; + + volatile receiver_context_t *rc; + struct rpal_thread_pool *rtp; +} rpal_receiver_info_t; + +typedef struct fd_table fd_table_t; +/* Keep it the same as kernel */ +struct rpal_thread_pool { + rpal_receiver_info_t *rris; + fd_table_t *fdt; + uint64_t service_key; + int nr_threads; + int service_id; + int pkey; +}; + +struct rpal_request_arg { + unsigned long version; + uint64_t key; + struct rpal_thread_pool **rtp; + int *id; + int *pkey; +}; + +struct rpal_uds_fdmap_arg { + int service_id; + int cfd; + unsigned long *res; +}; + +#define RPAL_ERROR_MAGIC 0x98CC98CC + +typedef struct rpal_error_context { + unsigned long tls_base; + uint64_t erip; + uint64_t ersp; + int state; + int magic; +} rpal_error_context_t; + +typedef struct sender_context { + task_context_t task_context; + rpal_error_context_t ec; + int sender_id; + int64_t start_time; + int64_t total_time; +} sender_context_t; + +#define RPAL_EP_POLL_MAGIC 0xCC98CC98 + +typedef struct rpal_sender_info { + int idx; + int tid; + int pkey; + int inited; + sender_context_t sc; +} rpal_sender_info_t; + +typedef struct fdt_node fdt_node_t; + +typedef struct fd_event { + int epfd; + int fd; + struct epoll_event epev; + uint32_t events; + int wait; + + rpal_queue_t q; + int pkey; // unused + fdt_node_t *node; + struct fd_event *next; + uint16_t timestamp; + uint16_t outdated; + uint64_t service_key; +} fd_event_t; + +struct fdt_node { + fd_event_t **events; + fdt_node_t *next; + int *ref_count; + uint16_t *timestamps; +}; + +// when sender calls fd_event_get, we must check this number to avoid +// accessing outdated fdt_node definitions + +#define FDTAB_MAG1 0x4D414731UL // add fde lazyswitch +#define FDTAB_MAG2 0x14D414731UL // add fde timestamp +#define FDTAB_MAG3 0x34D414731UL // add fde outdated +#define FDTAB_MAG4 0x74D414731UL // add automatic identification rpal mode + +enum fde_ref_status { + FDE_FREEING =3D -100, + FDE_FREED =3D -1, + FDE_AVAILABLE =3D 0, +}; + +#define DEFAULT_NODE_SHIFT 14 // 2^14 elements per node +typedef struct fd_table { + fdt_node_t *head; + fdt_node_t *tail; + int max_fd; + unsigned int node_shift; + unsigned int node_mask; + pthread_mutex_t lock; + unsigned long magic; + fd_event_t *freelist; + pthread_mutex_t list_lock; +} fd_table_t; + +typedef struct critical_section { + unsigned long ret_begin; + unsigned long ret_end; +} critical_section_t; + +struct rpal_service_metadata { + unsigned long version; + struct rpal_thread_pool *rtp; + critical_section_t rcs; + int pkey; +}; + +#ifndef RPAL_DEBUG +#define dbprint(category, format, args...) ((void)0) +#else +void dbprint(rpal_debug_flag_t category, char *format, ...) + __attribute__((format(printf, 2, 3))); +#endif +void errprint(const char *format, ...) __attribute__((format(printf, 1, 2)= )); +void warnprint(const char *format, ...) __attribute__((format(printf, 1, 2= ))); + +#endif diff --git a/samples/rpal/librpal/rpal.c b/samples/rpal/librpal/rpal.c new file mode 100644 index 000000000000..64bd2b93bd67 --- /dev/null +++ b/samples/rpal/librpal/rpal.c @@ -0,0 +1,2351 @@ +#include "private.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "rpal_pkru.h" + +/* prints an error message to stderr */ +void errprint(const char *format, ...) +{ + va_list args; + + fprintf(stderr, "[RPAL_ERROR] "); + va_start(args, format); + vfprintf(stderr, format, args); + va_end(args); +} + +/* prints a warning message to stderr */ +void warnprint(const char *format, ...) +{ + va_list args; + + fprintf(stderr, "[RPAL_WARNING] "); + va_start(args, format); + vfprintf(stderr, format, args); + va_end(args); +} + +#ifdef RPAL_DEBUG +void dbprint(rpal_debug_flag_t category, char *format, ...) +{ + if (category & RPAL_DEBUG) { + va_list args; + fprintf(stderr, "[RPAL_DEBUG] "); + va_start(args, format); + vfprintf(stderr, format, args); + va_end(args); + } +} +#endif + +#define SAVE_FPU(mxcsr, fpucw) = \ + __asm__ __volatile__("stmxcsr %0;" \ + "fnstcw %1;" \ + : "=3Dm"(mxcsr), "=3Dm"(fpucw) \ + :) +#define RESTORE_FPU(mxcsr, fpucw) = \ + __asm__ __volatile__("ldmxcsr %0;" \ + "fldcw %1;" \ + : \ + : "m"(mxcsr), "m"(fpucw)) + +#define ERRREPORT(EPTR, ECODE, ...) = \ + if (EPTR) { \ + *EPTR =3D ECODE; \ + } \ + errprint(__VA_ARGS__); + +#define RPAL_MGT_FILE "/proc/rpal" +#define MAX_SUPPROTED_CPUS 192 + +static __always_inline unsigned long __ffs(unsigned long word) +{ + asm("rep; bsf %1,%0" : "=3Dr"(word) : "rm"(word)); + + return word; +} + +static void __set_bit(uint64_t *bitmap, int idx) +{ + int bit, i; + i =3D idx / 8; + bit =3D idx % 8; + bitmap[i] |=3D (1UL << bit); +} + +static int clear_first_set_bit(uint64_t *bitmap, int size) +{ + int idx; + int bit, i; + + for (i =3D 0; i * BITS_PER_LONG < size; i++) { + if (bitmap[i]) { + bit =3D __ffs(bitmap[i]); + idx =3D i * BITS_PER_LONG + bit; + if (idx >=3D size) { + return -1; + } + bitmap[i] &=3D ~(1UL << bit); + return idx; + } + } + return -1; +} + +extern void rpal_get_critical_addr(critical_section_t *rcs); +static critical_section_t rcs =3D { 0 }; + +#define MAX_SERVICEID 254 // Intel MPK Limit +#define MIN_RPAL_KERNEL_API_VERSION 1 +#define TARGET_RPAL_KERNEL_API_VERSION = \ + 1 // RPAL will disable when KERNEL_API < TARGET_RPAL_KERNEL_API_VERSION + +enum { + RCALL_IN =3D 0x1 << 0, + RCALL_OUT =3D 0x1 << 1, +}; + +enum { + FDE_NO_TRIGGER, + FDE_TRIGGER_OUT, +}; + +#define EPOLLRPALINOUT_BITS (EPOLLRPALIN | EPOLLRPALOUT) + +#define DEFAULT_QUEUE_SIZE 32U + +typedef struct rpal_requested_service { + struct rpal_thread_pool *service; + int pkey; + uint64_t key; +} rpal_requeseted_service_t; + +static int rpal_mgtfd =3D -1; +static int inited; +int pkru_enabled =3D 0; + +static rpal_capability_t version; +static pthread_key_t rpal_key; +static rpal_requeseted_service_t requested_services[MAX_SERVICEID]; +static pthread_mutex_t release_lock; + +typedef struct rpal_local { + unsigned int tflag; + rpal_receiver_info_t *rri; + rpal_sender_info_t *rsi; +} rpal_local_t; + +#define SENDERS_PAGE_ORDER 3 +#define RPALTHREAD_PAGE_ORDER 0 + +typedef struct rpal_thread_metadata { + int rpal_receiver_idx; + int service_id; + const int epcpage_order; + uint64_t service_key; + struct rpal_thread_pool *rtp; + receiver_context_t *rc; + pid_t pid; + int *eventfds; +} rpal_thread_metadata_t; + +static rpal_thread_metadata_t threads_md =3D { + .service_id =3D -1, + .epcpage_order =3D RPALTHREAD_PAGE_ORDER, +}; + +static inline rpal_sender_info_t *current_rpal_sender(void) +{ + rpal_local_t *local; + + local =3D pthread_getspecific(rpal_key); + if (local && (local->tflag & RPAL_SENDER)) { + return local->rsi; + } else { + return NULL; + } +} + +static inline rpal_receiver_info_t *current_rpal_thread(void) +{ + rpal_local_t *local; + + local =3D pthread_getspecific(rpal_key); + if (local && (local->tflag & RPAL_RECEIVER)) { + return local->rri; + } else { + return NULL; + } +} + +static status_t rpal_register_sender_local(rpal_sender_info_t *sender) +{ + rpal_local_t *local; + local =3D pthread_getspecific(rpal_key); + if (!local) { + local =3D malloc(sizeof(rpal_local_t)); + if (!local) + return RPAL_FAILURE; + memset(local, 0, sizeof(rpal_local_t)); + pthread_setspecific(rpal_key, local); + } + if (local->tflag & RPAL_SENDER) { + return RPAL_FAILURE; + } + local->rsi =3D sender; + local->tflag |=3D RPAL_SENDER; + return RPAL_SUCCESS; +} + +static status_t rpal_unregister_sender_local(void) +{ + rpal_local_t *local; + local =3D pthread_getspecific(rpal_key); + if (!local || !(local->tflag & RPAL_SENDER)) + return RPAL_FAILURE; + + local->rsi =3D NULL; + local->tflag &=3D ~RPAL_SENDER; + if (!local->tflag) { + pthread_setspecific(rpal_key, NULL); + free(local); + } + return RPAL_SUCCESS; +} + +static status_t rpal_register_receiver_local(rpal_receiver_info_t *thread) +{ + rpal_local_t *local; + local =3D pthread_getspecific(rpal_key); + if (!local) { + local =3D malloc(sizeof(rpal_local_t)); + if (!local) + return RPAL_FAILURE; + memset(local, 0, sizeof(rpal_local_t)); + pthread_setspecific(rpal_key, local); + } + if (local->tflag & RPAL_RECEIVER) { + return RPAL_FAILURE; + } + local->rri =3D thread; + local->tflag |=3D RPAL_RECEIVER; + return RPAL_SUCCESS; +} + +static status_t rpal_unregister_receiver_local(void) +{ + rpal_local_t *local; + local =3D pthread_getspecific(rpal_key); + if (!local || !(local->tflag & RPAL_RECEIVER)) + return RPAL_FAILURE; + + local->rri =3D NULL; + local->tflag &=3D ~RPAL_RECEIVER; + if (!local->tflag) { + pthread_setspecific(rpal_key, NULL); + free(local); + } + return RPAL_SUCCESS; +} + +#define MAX_SENDERS 256 +typedef struct rpal_senders_metadata { + uint64_t bitmap[BITS_TO_LONGS(MAX_SENDERS)]; + pthread_mutex_t lock; + int sdpage_order; + rpal_sender_info_t *senders; +} rpal_senders_metadata_t; + +static rpal_senders_metadata_t *senders_md; + +static long rpal_ioctl(unsigned long cmd, unsigned long arg) +{ + struct { + unsigned long *ret; + unsigned long cmd; + unsigned long arg0; + unsigned long arg1; + } args; + const int args_size =3D sizeof(args); + int ret; + + if (rpal_mgtfd =3D=3D -1) { + errprint("rpal_mgtfd is not opened\n"); + return -1; + } + + ret =3D ioctl(rpal_mgtfd, cmd, arg); + + return ret; +} + +static inline long rpal_register_sender(rpal_sender_info_t *sender) +{ + long ret; + + if (rpal_register_sender_local(sender) =3D=3D RPAL_FAILURE) + return RPAL_FAILURE; + + ret =3D rpal_ioctl(RPAL_IOCTL_REGISTER_SENDER, + (unsigned long)&sender->sc); + if (ret < 0) { + rpal_unregister_sender_local(); + } + return ret; +} + +static inline long rpal_register_receiver(rpal_receiver_info_t *rri) +{ + long ret; + + if (rpal_register_receiver_local(rri) =3D=3D RPAL_FAILURE) + return RPAL_FAILURE; + ret =3D rpal_ioctl(RPAL_IOCTL_REGISTER_RECEIVER, + (unsigned long)rri->rc); + if (ret < 0) { + rpal_unregister_receiver_local(); + } + return ret; +} + +static inline long rpal_unregister_sender(void) +{ + if (rpal_unregister_sender_local() =3D=3D RPAL_FAILURE) + return RPAL_FAILURE; + return rpal_ioctl(RPAL_IOCTL_UNREGISTER_SENDER, 0); +} + +static inline long rpal_unregister_receiver(void) +{ + if (rpal_unregister_receiver_local() =3D=3D RPAL_FAILURE) + return RPAL_FAILURE; + return rpal_ioctl(RPAL_IOCTL_UNREGISTER_RECEIVER, 0); +} + +static int rpal_get_service_pkey(void) +{ + int pkey, ret; + + ret =3D rpal_ioctl(RPAL_IOCTL_GET_SERVICE_PKEY, (unsigned long)&pkey); + if (ret < 0 || pkey =3D=3D -1) { + warnprint("MPK not supported on this host, disabling PKRU\n"); + return -1; + } + return pkey; +} + +static int __rpal_get_service_id(void) +{ + int id, ret; + + ret =3D rpal_ioctl(RPAL_IOCTL_GET_SERVICE_ID, (unsigned long)&id); + + if (ret < 0) + return ret; + else + return id; +} + +static uint64_t __rpal_get_service_key(void) +{ + int ret; + uint64_t key; + + ret =3D rpal_ioctl(RPAL_IOCTL_GET_SERVICE_KEY, (unsigned long)&key); + if (ret < 0) + return 0; + else + return key; +} + +static void *rpal_get_shared_page(int order) +{ + void *p; + int size; + int flags =3D MAP_SHARED; + + if (rpal_mgtfd =3D=3D -1) { + return NULL; + } + size =3D PAGE_SIZE * (1 << order); + + p =3D mmap(NULL, size, PROT_READ | PROT_WRITE, flags, rpal_mgtfd, 0); + + return p; +} + +static int rpal_free_shared_page(void *page, int order) +{ + int ret =3D 0; + int size; + + size =3D PAGE_SIZE * (1 << order); + ret =3D munmap(page, size); + if (ret) { + errprint("munmap fail: %d\n", ret); + } + return ret; +} + +static inline int rpal_inited(void) +{ + return (inited =3D=3D 1); +} + +static inline int sender_idx_is_invalid(int idx) +{ + if (idx < 0 || idx >=3D MAX_SENDERS) + return 1; + return 0; +} + +static int rpal_sender_info_alloc(rpal_sender_info_t **sender) +{ + int idx; + + if (!senders_md) + return RPAL_FAILURE; + pthread_mutex_lock(&senders_md->lock); + idx =3D clear_first_set_bit(senders_md->bitmap, MAX_SENDERS); + if (idx < 0) { + errprint("sender data alloc failed: %d, bitmap: %lx\n", idx, + senders_md->bitmap[0]); + goto unlock; + } + *sender =3D senders_md->senders + idx; + +unlock: + pthread_mutex_unlock(&senders_md->lock); + return idx; +} + +static void rpal_sender_info_free(int idx) +{ + if (sender_idx_is_invalid(idx)) { + return; + } + pthread_mutex_lock(&senders_md->lock); + __set_bit(senders_md->bitmap, idx); + pthread_mutex_unlock(&senders_md->lock); +} + +extern unsigned long rpal_get_ret_rip(void); + +static int rpal_sender_inited(rpal_sender_info_t *sender) +{ + return (sender->inited =3D=3D 1); +} + +status_t rpal_sender_init(rpal_error_code_t *error) +{ + int idx; + int ret =3D RPAL_FAILURE; + rpal_sender_info_t *sender; + + if (!rpal_inited()) { + ERRREPORT(error, RPAL_DONT_INITED, "%s: rpal do not init\n", + __FUNCTION__); + goto error_out; + } + sender =3D current_rpal_sender(); + if (sender) { + goto error_out; + } + idx =3D rpal_sender_info_alloc(&sender); + if (idx < 0) { + if (error) { + *error =3D RPAL_ERR_SENDER_INIT; + } + goto error_out; + } + sender->idx =3D idx; + sender->sc.sender_id =3D idx; + sender->tid =3D syscall(SYS_gettid); + sender->pkey =3D rpal_get_service_pkey(); + sender->sc.ec.erip =3D rpal_get_ret_rip(); + ret =3D rpal_register_sender(sender); + if (ret) { + ERRREPORT(error, RPAL_ERR_SENDER_REG, + "rpal_register_sender error: %d\n", ret); + goto sender_register_failed; + } + sender->inited =3D 1; + return RPAL_SUCCESS; + +sender_register_failed: + rpal_sender_info_free(idx); +error_out: + return RPAL_FAILURE; +} + +status_t rpal_sender_exit(void) +{ + int idx; + rpal_sender_info_t *sender; + + sender =3D current_rpal_sender(); + + if (sender) { + idx =3D sender->idx; + sender->idx =3D 0; + sender->tid =3D 0; + rpal_unregister_sender(); + rpal_sender_info_free(idx); + sender->pkey =3D 0; + } + return RPAL_SUCCESS; +} + +static status_t rpal_enable_service(rpal_error_code_t *error) +{ + struct rpal_service_metadata rsm; + long ret =3D 0; + + rsm.version =3D 0; + rsm.rtp =3D threads_md.rtp; + rsm.rcs =3D rcs; + rsm.pkey =3D -1; + ret =3D rpal_ioctl(RPAL_IOCTL_ENABLE_SERVICE, (unsigned long)&rsm); + if (ret) { + ERRREPORT(error, RPAL_ERR_ENABLE_SERVICE, + "rpal enable service failed: %ld\n", ret) + return RPAL_FAILURE; + } + threads_md.rtp->pkey =3D rpal_get_service_pkey(); + return RPAL_SUCCESS; +} + +static status_t rpal_disable_service(void) +{ + long ret =3D 0; + ret =3D rpal_ioctl(RPAL_IOCTL_DISABLE_SERVICE, 0); + if (ret) { + errprint("rpal disable service failed: %ld\n", ret); + return RPAL_FAILURE; + } + return RPAL_SUCCESS; +} + +static status_t add_requested_service(struct rpal_thread_pool *rtp, uint64= _t key, int id, int pkey) +{ + struct rpal_thread_pool *expected =3D NULL; + + if (!rtp) { + errprint("add requested service null\n"); + return RPAL_FAILURE; + } + + if (!__atomic_compare_exchange_n(&requested_services[id].service, + &expected, rtp, 1, __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST)) { + errprint("rpal service %d already add, expected: %ld\n", id, + expected->service_key); + return RPAL_FAILURE; + } + requested_services[id].key =3D key; + requested_services[id].pkey =3D pkey; + return RPAL_SUCCESS; +} + +int rpal_get_request_service_id(uint64_t key) +{ + int i; + + for (i =3D 0; i < MAX_SERVICEID; i++) { + if (requested_services[i].key =3D=3D key) + return i; + } + return -1; +} + +static struct rpal_thread_pool *get_service_from_key(uint64_t key) +{ + int i; + struct rpal_thread_pool *rtp; + + for (i =3D 0; i < MAX_SERVICEID; i++) { + if (requested_services[i].key =3D=3D key) + return requested_services[i].service; + } + return NULL; +} + +static inline struct rpal_thread_pool *get_service_from_id(int id) +{ + return requested_services[id].service; +} + +static inline int get_service_pkey_from_id(int id) +{ + return requested_services[id].pkey; +} + +static struct rpal_thread_pool *del_requested_service(uint64_t key) +{ + int id; + struct rpal_thread_pool *rtp; + + id =3D rpal_get_request_service_id(key); + if (id =3D=3D -1) + return NULL; + rtp =3D __atomic_exchange_n(&requested_services[id].service, NULL, + __ATOMIC_RELAXED); + return rtp; +} + +int rpal_request_service(uint64_t key) +{ + struct rpal_request_arg rra; + long ret =3D RPAL_FAILURE; + struct rpal_thread_pool *rtp; + int id, pkey; + + if (!rpal_inited()) { + errprint("%s: rpal do not init\n", __FUNCTION__); + goto error_out; + } + + rra.version =3D 0; + rra.key =3D key; + rra.rtp =3D &rtp; + rra.id =3D &id; + rra.pkey =3D &pkey; + ret =3D rpal_ioctl(RPAL_IOCTL_REQUEST_SERVICE, (unsigned long)&rra); + if (ret) { + goto error_out; + } + + ret =3D add_requested_service(rtp, key, id, pkey); + if (ret =3D=3D RPAL_FAILURE) { + goto add_requested_failed; + } + + return RPAL_SUCCESS; + +add_requested_failed: + rpal_ioctl(RPAL_IOCTL_RELEASE_SERVICE, key); +error_out: + return (int)ret; +} + +static void fdt_freelist_forcefree(fd_table_t *fdt, uint64_t service_key); + +status_t rpal_release_service(uint64_t key) +{ + long ret; + struct rpal_thread_pool *rtp; + + if (!rpal_inited()) { + errprint("%s: rpal do not init\n", __FUNCTION__); + return RPAL_FAILURE; + } + + rtp =3D del_requested_service(key); + ret =3D rpal_ioctl(RPAL_IOCTL_RELEASE_SERVICE, key); + if (ret) { + errprint("rpal release service failed: %ld\n", ret); + return RPAL_FAILURE; + } + fdt_freelist_forcefree(threads_md.rtp->fdt, key); + return RPAL_SUCCESS; +} + +static void try_clean_lock(rpal_receiver_info_t *rri, uint64_t key) +{ + uint64_t lock_state =3D key | 1UL << 63; + + if (__atomic_load_n(&rri->uqlock, __ATOMIC_RELAXED) =3D=3D lock_state) + uevent_queue_fix(&rri->ueventq); + + if (__atomic_compare_exchange_n(&rri->uqlock, &lock_state, (uint64_t)0, + 1, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) + dbprint(RPAL_DEBUG_MANAGEMENT, + "Serivce (key: %lu) does exit with holding lock\n", + key); +} + +struct release_info { + uint64_t keys[KEY_SIZE]; + int size; +}; + +status_t rpal_clean_service_start(int64_t *ptr) +{ + rpal_receiver_info_t *rri; + struct release_info *info; + int i, j; + int size; + + if (!ptr) { + goto error_out; + } + + info =3D malloc(sizeof(struct release_info)); + if (info =3D=3D NULL) { + errprint("alloc release_info fail\n"); + goto error_out; + } + + pthread_mutex_lock(&release_lock); + size =3D read(rpal_mgtfd, info->keys, KEY_SIZE * sizeof(uint64_t)); + if (size <=3D 0) { + errprint("Read keys on rpal_mgtfd failed\n"); + goto error_unlock; + } + + size /=3D sizeof(uint64_t); + info->size =3D size; + + for (i =3D 0; i < size; i++) { + for (j =3D 0; j < threads_md.rtp->nr_threads; j++) { + rri =3D threads_md.rtp->rris + j; + try_clean_lock(rri, info->keys[i]); + } + } + pthread_mutex_unlock(&release_lock); + *ptr =3D (int64_t)info; + return RPAL_SUCCESS; + +error_unlock: + pthread_mutex_unlock(&release_lock); + free(info); +error_out: + return RPAL_FAILURE; +} + +void rpal_clean_service_end(int64_t *ptr) +{ + int i; + struct release_info *info; + + if (ptr =3D=3D NULL) + return; + info =3D (struct release_info *)(*ptr); + if (info =3D=3D NULL) + return; + for (i =3D 0; i < info->size; i++) { + dbprint(RPAL_DEBUG_MANAGEMENT, "release service: 0x%lx\n", + info->keys[i]); + rpal_release_service(info->keys[i]); + } + free(info); +} +int rpal_get_service_id(void) +{ + if (!rpal_inited()) { + return RPAL_FAILURE; + } + return threads_md.service_id; +} + +status_t rpal_get_service_key(uint64_t *service_key) +{ + if (!rpal_inited() || !service_key) { + return RPAL_FAILURE; + } + *service_key =3D threads_md.service_key; + return RPAL_SUCCESS; +} + +static fdt_node_t *fdt_node_alloc(fd_table_t *fdt) +{ + fdt_node_t *node; + fd_event_t **ev; + int *ref_count; + uint16_t *timestamps; + int size =3D 0; + + node =3D malloc(sizeof(fdt_node_t)); + if (!node) + goto node_alloc_failed; + + size =3D sizeof(fd_event_t **) * (1 << fdt->node_shift); + ev =3D malloc(size); + if (!ev) + goto events_alloc_failed; + memset(ev, 0, size); + + size =3D sizeof(int) * (1 << fdt->node_shift); + ref_count =3D malloc(size); + if (!ref_count) + goto used_alloc_failed; + memset(ref_count, 0xff, size); + + size =3D sizeof(uint16_t) * (1 << fdt->node_shift); + timestamps =3D malloc(size); + if (!timestamps) + goto ts_alloc_failed; + memset(timestamps, 0, size); + + node->events =3D ev; + node->ref_count =3D ref_count; + node->next =3D NULL; + node->timestamps =3D timestamps; + if (!fdt->head) { + fdt->head =3D node; + fdt->tail =3D node; + } else { + fdt->tail->next =3D node; + fdt->tail =3D node; + } + fdt->max_fd +=3D (1 << fdt->node_shift); + return node; + +ts_alloc_failed: + free(ref_count); +used_alloc_failed: + free(ev); +events_alloc_failed: + free(node); +node_alloc_failed: + errprint("%s Error!!! max_fd: %d\n", __FUNCTION__, fdt->max_fd); + return NULL; +} + +static void fdt_node_free_all(fd_table_t *fdt) +{ + fdt_node_t *node, *ptr; + + node =3D fdt->head; + while (node) { + free(node->timestamps); + free(node->ref_count); + free(node->events); + ptr =3D node; + node =3D node->next; + free(ptr); + } +} + +static fdt_node_t *fdt_node_expand(fd_table_t *fdt, int fd) +{ + fdt_node_t *node =3D NULL; + while (fd >=3D fdt->max_fd) { + node =3D fdt_node_alloc(fdt); + if (!node) + break; + } + return node; +} + +static fdt_node_t *fdt_node_search(fd_table_t *fdt, int fd) +{ + fdt_node_t *node =3D NULL; + int pos =3D 0; + if (fd >=3D fdt->max_fd) + return NULL; + pos =3D fd >> fdt->node_shift; + node =3D fdt->head; + while (pos) { + if (!node) { + errprint( + "fdt node search ERROR! fd: %d, pos: %d, fdt->max_fd: %d\n", + fd, pos, fdt->max_fd); + return NULL; + } + node =3D node->next; + pos--; + } + return node; +} + +static fd_table_t *fd_table_alloc(unsigned int node_shift) +{ + fd_table_t *fdt; + pthread_mutexattr_t mattr; + + fdt =3D malloc(sizeof(fd_table_t)); + if (!fdt) + return NULL; + fdt->head =3D NULL; + fdt->tail =3D NULL; + fdt->max_fd =3D 0; + fdt->node_shift =3D node_shift; + fdt->node_mask =3D (1 << node_shift) - 1; + fdt->freelist =3D NULL; + pthread_mutex_init(&fdt->list_lock, NULL); + + pthread_mutexattr_init(&mattr); + pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED); + pthread_mutex_init(&fdt->lock, &mattr); + return fdt; +} + +static void fd_table_free(fd_table_t *fdt) +{ + if (!fdt) + return; + fdt_node_free_all(fdt); + free(fdt); + return; +} + +static inline fd_event_t *fd_event_alloc(int fd, int epfd, + struct epoll_event *event) +{ + fd_event_t *fde; + uint64_t *qdata; + + fde =3D (fd_event_t *)malloc(sizeof(fd_event_t)); + if (!fde) + return NULL; + + fde->fd =3D fd; + fde->epfd =3D epfd; + fde->epev =3D *event; + fde->events =3D 0; + fde->node =3D NULL; + fde->next =3D NULL; + fde->timestamp =3D 0; + fde->service_key =3D 0; + __atomic_store_n(&fde->outdated, (uint16_t)0, __ATOMIC_RELEASE); + + qdata =3D malloc(DEFAULT_QUEUE_SIZE * sizeof(uint64_t)); + if (!qdata) { + errprint("malloc queue data failed\n"); + goto malloc_error; + } + if (rpal_queue_init(&fde->q, qdata, DEFAULT_QUEUE_SIZE)) { + errprint("fde queue alloc failed, fd: %d\n", fd); + goto init_error; + } + return fde; + +init_error: + free(qdata); +malloc_error: + free(fde); + return NULL; +} + +static inline void fd_event_free(fd_event_t *fde) +{ + uint64_t *qdata; + + if (!fde) + return; + qdata =3D rpal_queue_destroy(&fde->q); + free(qdata); + free(fde); + return; +} + +static void fdt_freelist_insert(fd_table_t *fdt, fd_event_t *fde) +{ + if (!fde) + return; + + pthread_mutex_lock(&fdt->list_lock); + if (fdt->freelist =3D=3D NULL) { + fdt->freelist =3D fde; + } else { + fde->next =3D fdt->freelist; + fdt->freelist =3D fde; + } + pthread_mutex_unlock(&fdt->list_lock); +} + +static void fdt_freelist_forcefree(fd_table_t *fdt, uint64_t service_key) +{ + fd_event_t *prev, *pos, *f_fde; + fdt_node_t *node; + int idx; + + pthread_mutex_lock(&fdt->list_lock); + prev =3D NULL; + pos =3D fdt->freelist; + while (pos) { + idx =3D pos->fd & fdt->node_mask; + node =3D pos->node; + if (pos->service_key =3D=3D service_key) { + __atomic_exchange_n(&node->ref_count[idx], FDE_FREEING, + __ATOMIC_RELAXED); + if (!prev) { + fdt->freelist =3D pos->next; + } else { + prev->next =3D pos->next; + } + f_fde =3D pos; + pos =3D pos->next; + node->events[idx] =3D NULL; + __atomic_store_n(&node->ref_count[idx], -1, + __ATOMIC_RELEASE); + fd_event_free(f_fde); + } else { + prev =3D pos; + pos =3D pos->next; + } + } + pthread_mutex_unlock(&fdt->list_lock); + return; +} + +static void fdt_freelist_lazyfree(fd_table_t *fdt) +{ + fd_event_t *prev, *pos, *f_fde; + fdt_node_t *node; + int idx; + int expected; + + pthread_mutex_lock(&fdt->list_lock); + prev =3D NULL; + pos =3D fdt->freelist; + + while (pos) { + idx =3D pos->fd & fdt->node_mask; + // do lazyfree when ref_count less than 0 + expected =3D FDE_AVAILABLE; + node =3D pos->node; + if (__atomic_compare_exchange_n( + &node->ref_count[idx], &expected, FDE_FREEING, 1, + __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) { + if (!prev) { + fdt->freelist =3D pos->next; + } else { + prev->next =3D pos->next; + } + f_fde =3D pos; + pos =3D pos->next; + node->events[idx] =3D NULL; + __atomic_store_n(&node->ref_count[idx], -1, + __ATOMIC_RELEASE); + fd_event_free(f_fde); + } else { + if (expected < 0) { + errprint("error ref: %d, fd: %d\n", expected, + pos->fd); + } + prev =3D pos; + pos =3D pos->next; + } + } + pthread_mutex_unlock(&fdt->list_lock); + return; +} + +static uint16_t fde_timestamp_get(fd_table_t *fdt, int fd) +{ + fdt_node_t *node; + int idx; + + node =3D fdt_node_search(fdt, fd); + if (!node) { + return 0; + } + idx =3D fd & fdt->node_mask; + return node->timestamps[idx]; +} + +static void fd_event_put(fd_table_t *fdt, fd_event_t *fde); + +static fd_event_t *fd_event_get(fd_table_t *fdt, int fd) +{ + fd_event_t *fde =3D NULL; + fdt_node_t *node; + int idx; + int val =3D -1; + int expected; + + node =3D fdt_node_search(fdt, fd); + if (!node) { + return NULL; + } + idx =3D fd & fdt->node_mask; + +retry: + val =3D __atomic_load_n(&node->ref_count[idx], __ATOMIC_ACQUIRE); + if (val < 0) + return NULL; + expected =3D val; + val++; + if (!__atomic_compare_exchange_n(&node->ref_count[idx], &expected, val, + 1, __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST)) { + if (expected >=3D 0) { + goto retry; + } else { + return NULL; + } + } + fde =3D node->events[idx]; + if (!fde) { + errprint("error get: %d, fd: %d\n", val, fd); + } else { + if (__atomic_load_n(&fde->outdated, __ATOMIC_ACQUIRE)) { + fd_event_put(fdt, fde); + fde =3D NULL; + } + } + return fde; +} + +static void fd_event_put(fd_table_t *fdt, fd_event_t *fde) +{ + int idx; + int val; + + if (!fde) + return; + + idx =3D fde->fd & fdt->node_mask; + val =3D __atomic_sub_fetch(&fde->node->ref_count[idx], 1, + __ATOMIC_RELEASE); + if (val < 0) { + errprint("error put: %d, fd: %d\n", val, fde->fd); + } + return; +} + +int rpal_access(void *addr, access_fn do_access, int *ret, va_list va); + +int rpal_access(void *addr, access_fn do_access, int *ret, va_list va) +{ + int func_ret; + + func_ret =3D do_access(va); + if (ret) { + *ret =3D func_ret; + } + return RPAL_SUCCESS; +} + +extern status_t rpal_access_warpper(void *addr, access_fn do_access, int *= ret, + va_list va); + +#define rpal_write_access_safety(ACCESS_FUNC, FUNC_RET, ...) = \ + ({ \ + status_t __access =3D RPAL_FAILURE; \ + uint32_t old_pkru =3D 0; \ + old_pkru =3D rdpkru(); \ + __access =3D rpal_read_access_safety(ACCESS_FUNC, FUNC_RET, \ + ##__VA_ARGS__); \ + wrpkru(old_pkru); \ + __access; \ + }) + +status_t rpal_read_access_safety(access_fn do_access, int *ret, ...) +{ + rpal_sender_info_t *sender; + sender_context_t *sc; + rpal_error_code_t error; + status_t access =3D RPAL_FAILURE; + va_list args; + + sender =3D current_rpal_sender(); + if (!sender || !rpal_sender_inited(sender)) { + dbprint(RPAL_DEBUG_SENDER, "%s: sender(%d) do not init\n", + __FUNCTION__, getpid()); + if (RPAL_FAILURE =3D=3D rpal_sender_init(&error)) { + return RPAL_FAILURE; + } + sender =3D current_rpal_sender(); + } + sc =3D &sender->sc; + sc->ec.magic =3D RPAL_ERROR_MAGIC; + va_start(args, ret); + access =3D rpal_access_warpper(&(sc->ec.ersp), do_access, ret, args); + va_end(args); + sc->ec.magic =3D 0; + + return access; +} + +static int64_t __do_rpal_uds_fdmap(int service_id, int connfd) +{ + struct rpal_uds_fdmap_arg arg; + int64_t res; + int ret; + + arg.cfd =3D connfd; + arg.service_id =3D service_id; + arg.res =3D &res; + ret =3D rpal_ioctl(RPAL_IOCTL_UDS_FDMAP, (unsigned long)&arg); + if (ret < 0) + return RPAL_FAILURE; + + return res; +} + +static status_t do_rpal_uds_fdmap(va_list va) +{ + int64_t ret; + int sfd, cfd, sid; + struct rpal_thread_pool *srtp; + uint64_t stamp =3D 0; + uint64_t sid_fd; + uint64_t *rpalfd; + fd_event_t *fde; + + sid_fd =3D va_arg(va, uint64_t); + rpalfd =3D va_arg(va, uint64_t *); + + if (!rpalfd) { + return RPAL_FAILURE; + } + sid =3D get_high32(sid_fd); + cfd =3D get_low32(sid_fd); + + ret =3D __do_rpal_uds_fdmap(sid, cfd); + if (ret < 0) { + errprint("%s failed %ld, cfd: %d\n", __FUNCTION__, ret, cfd); + return RPAL_FAILURE; + } + + srtp =3D get_service_from_id(sid); + if (!srtp) { + errprint("%s INVALID service_id: %d\n", __FUNCTION__, sid); + return RPAL_FAILURE; + } + sfd =3D get_sfd(ret); + stamp =3D fde_timestamp_get(srtp->fdt, sfd); + ret |=3D (stamp << HIGH16_OFFSET); + + fde =3D fd_event_get(threads_md.rtp->fdt, cfd); + if (!fde) { + errprint("%s get self fde error, fd: %d\n", __FUNCTION__, cfd); + goto out; + } + fde->service_key =3D srtp->service_key; + fd_event_put(threads_md.rtp->fdt, fde); +out: + *rpalfd =3D ret; + return RPAL_SUCCESS; +} + +int rpal_get_peer_rid(uint64_t sid_fd) +{ + int64_t ret; + int sid, cfd; + int rid; + + sid =3D get_high32(sid_fd); + cfd =3D get_low32(sid_fd); + + ret =3D __do_rpal_uds_fdmap(sid, cfd); + if (ret < 0) { + errprint("%s failed %ld, cfd: %d\n", __FUNCTION__, ret, cfd); + return RPAL_FAILURE; + } + rid =3D get_rid(ret); + return rid; +} + +status_t rpal_uds_fdmap(uint64_t sid_fd, uint64_t *rpalfd) +{ + status_t ret =3D RPAL_FAILURE; + status_t access; + uint32_t old_pkru; + + old_pkru =3D rdpkru(); + wrpkru(old_pkru & RPAL_PKRU_BASE_CODE_READ); + access =3D rpal_read_access_safety(do_rpal_uds_fdmap, &ret, sid_fd, + rpalfd); + wrpkru(old_pkru); + if (access =3D=3D RPAL_FAILURE) { + return RPAL_FAILURE; + } + return ret; +} + +static status_t fd_event_install(fd_table_t *fdt, int fd, int epfd, + struct epoll_event *event) +{ + fdt_node_t *node; + fd_event_t *fde; + int idx; + int expected; + + fde =3D fd_event_alloc(fd, epfd, event); + if (!fde) { + goto fde_error; + } + pthread_mutex_lock(&fdt->lock); + if (fd >=3D fdt->max_fd) { + node =3D fdt_node_expand(fdt, fd); + } else { + node =3D fdt_node_search(fdt, fd); + } + pthread_mutex_unlock(&fdt->lock); + + if (!node) { + errprint("fd node search failed, fd: %d\n", fd); + goto node_error; + } + idx =3D fd & fdt->node_mask; + fdt_freelist_lazyfree(fdt); + expected =3D __atomic_load_n(&node->ref_count[idx], __ATOMIC_ACQUIRE); + if (expected !=3D FDE_FREED) { + goto node_error; + } + fde->timestamp =3D + __atomic_add_fetch(&node->timestamps[idx], 1, __ATOMIC_RELEASE); + fde->node =3D node; + node->events[idx] =3D fde; + if (!__atomic_compare_exchange_n(&node->ref_count[idx], &expected, + FDE_AVAILABLE, 1, __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST)) { + errprint("may override fd: %d, val: %d\n", fd, expected); + node->events[idx] =3D NULL; + goto node_error; + } + return RPAL_SUCCESS; + +node_error: + fd_event_free(fde); +fde_error: + return RPAL_FAILURE; +} + +static status_t fd_event_uninstall(fd_table_t *fdt, int fd) +{ + fd_event_t *fde; + fdt_node_t *node; + int idx; + int ret =3D RPAL_SUCCESS; + int expected; + + node =3D fdt_node_search(fdt, fd); + if (!node) { + ret =3D RPAL_FAILURE; + goto out; + } + idx =3D fd & fdt->node_mask; + fde =3D node->events[idx]; + if (!fde) { + ret =3D RPAL_FAILURE; + goto out; + } + expected =3D FDE_AVAILABLE; + __atomic_store_n(&fde->outdated, (uint16_t)1, __ATOMIC_RELEASE); + if (__atomic_compare_exchange_n(&node->ref_count[idx], &expected, + FDE_FREEING, 1, __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST)) { + node->events[idx] =3D NULL; + __atomic_store_n(&node->ref_count[idx], -1, __ATOMIC_RELEASE); + fd_event_free(fde); + } else { + if (expected < FDE_AVAILABLE) { + errprint("error cnt: %d, fd: %d\n", expected, fde->fd); + } + // link this fde for free_head + fdt_freelist_insert(fdt, fde); + } + +out: + fdt_freelist_lazyfree(fdt); + return ret; +} + +static status_t fd_event_modify(fd_table_t *fdt, int fd, + struct epoll_event *event) +{ + fd_event_t *fde; + + fde =3D fd_event_get(fdt, fd); + if (!fde) { + errprint("fde MOD fd(%d) ERROR!\n", fd); + return RPAL_FAILURE; + } + fde->fd =3D fd; + fde->epev =3D *event; + fde->events =3D 0; + fd_event_put(fdt, fde); + return RPAL_SUCCESS; +} + +static int rpal_receiver_info_create(struct rpal_thread_pool *rtp, int id) +{ + rpal_receiver_info_t *rri =3D &rtp->rris[id]; + + rri->ep_stack =3D fiber_ctx_alloc(NULL, NULL, DEFUALT_STACK_SIZE); + if (!rri->ep_stack) + return -1; + + rri->trampoline =3D fiber_ctx_alloc(NULL, NULL, TRAMPOLINE_SIZE); + if (!rri->trampoline) { + fiber_ctx_free(rri->ep_stack); + return -1; + } + + rri->rc =3D threads_md.rc + id; + rri->rc->receiver_id =3D id; + rri->rtp =3D rtp; + + return 0; +} + +static void rpal_receiver_info_destroy(rpal_receiver_info_t *rri) +{ + fiber_ctx_free(rri->ep_stack); + fiber_ctx_free(rri->trampoline); + return; +} + +static struct rpal_thread_pool *rpal_thread_pool_create(int nr_threads, + rpal_thread_metadata_t *rtm) +{ + void *p; + int i, j; + struct rpal_thread_pool *rtp; + + if (rpal_inited()) + goto out; + rtp =3D malloc(sizeof(struct rpal_thread_pool)); + if (rtp =3D=3D NULL) { + goto out; + } + threads_md.eventfds =3D malloc(nr_threads * sizeof(int)); + if (threads_md.eventfds =3D=3D NULL) { + goto eventfds_alloc_fail; + } + rtp->nr_threads =3D nr_threads; + rtp->pkey =3D -1; + p =3D malloc(nr_threads * sizeof(rpal_receiver_info_t)); + if (p =3D=3D NULL) { + goto rri_alloc_fail; + } + rtp->rris =3D p; + memset(p, 0, nr_threads * sizeof(rpal_receiver_info_t)); + + rtp->fdt =3D fd_table_alloc(DEFAULT_NODE_SHIFT); + if (!rtp->fdt) { + goto fdt_alloc_fail; + } + + p =3D rpal_get_shared_page(rtm->epcpage_order); + + if (!p) + goto page_alloc_fail; + rtm->rc =3D p; + + for (i =3D 0; i < nr_threads; i++) { + if (rpal_receiver_info_create(rtp, i)) { + for (j =3D 0; j < i; j++) { + rpal_receiver_info_destroy(&rtp->rris[j]); + } + goto rri_create_fail; + } + } + return rtp; + +rri_create_fail: + rpal_free_shared_page(rtm->rc, rtm->epcpage_order); +page_alloc_fail: + fd_table_free(rtp->fdt); +fdt_alloc_fail: + free(rtp->rris); +rri_alloc_fail: + free(threads_md.eventfds); +eventfds_alloc_fail: + free(rtp); +out: + return NULL; +} + +static void rpal_thread_pool_destory(rpal_thread_metadata_t *rtm) +{ + int i; + struct rpal_thread_pool *rtp; + + if (!rpal_inited()) { + errprint("thread pool is not created.\n"); + return; + } + pthread_mutex_destroy(&release_lock); + rtp =3D threads_md.rtp; + fd_table_free(rtp->fdt); + for (i =3D 0; i < rtp->nr_threads; ++i) { + rpal_receiver_info_destroy(&rtp->rris[i]); + } + rpal_free_shared_page(threads_md.rc, threads_md.epcpage_order); + free(rtp->rris); + free(threads_md.eventfds); + free(rtp); +} + +static inline int rpal_receiver_inited(rpal_receiver_info_t *rri) +{ + if (!rri) + return 0; + return (rri->status !=3D RPAL_RECEIVER_UNINITIALIZED); +} + +static inline int rpal_receiver_available(rpal_receiver_info_t *rri) +{ + return (rri->status =3D=3D RPAL_RECEIVER_AVAILABLE); +} + +static int rpal_receiver_idx_get(void) +{ + return __atomic_fetch_add(&threads_md.rpal_receiver_idx, 1, + __ATOMIC_RELAXED); +} + +int rpal_receiver_init(void) +{ + int ret =3D 0; + int receiver_idx; + rpal_receiver_info_t *rri; + + if (!rpal_inited()) { + errprint("thread pool is not created.\n"); + goto error_out; + } + + receiver_idx =3D rpal_receiver_idx_get(); + if (receiver_idx >=3D threads_md.rtp->nr_threads) { + errprint( + "rpal thread pool size exceeded. thread_idx: %d, thread pool capacity: = %d\n", + receiver_idx, threads_md.rtp->nr_threads); + goto error_out; + } + + rri =3D threads_md.rtp->rris + receiver_idx; + rri->status =3D RPAL_RECEIVER_UNINITIALIZED; + rri->tid =3D syscall(SYS_gettid); + rri->tls_base =3D read_tls_base(); + + rpal_uevent_queue_init(&rri->ueventq, &rri->uqlock); + + rri->rc->rpal_ep_poll_magic =3D 0; + rri->rc->receiver_state =3D RPAL_RECEIVER_STATE_RUNNING; + rri->rc->ep_pending =3D 0; + __atomic_store_n(&rri->rc->sender_state, RPAL_SENDER_STATE_RUNNING, + __ATOMIC_RELAXED); + ret =3D rpal_register_receiver(rri); + if (ret < 0) { + errprint("rpal thread %ld register failed %d\n", rri->tid, ret); + goto error_out; + } + ret =3D eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK); + if (ret < 0) { + errprint("rpal thread %ld eventfd failed %d\n", rri->tid, + errno); + goto eventfd_failed; + } + threads_md.eventfds[receiver_idx] =3D ret; + rri->status =3D RPAL_RECEIVER_INITIALIZED; + return ret; + +eventfd_failed: + rpal_unregister_receiver(); +error_out: + return RPAL_FAILURE; +} + +void rpal_receiver_exit(void) +{ + rpal_receiver_info_t *rri =3D current_rpal_thread(); + int id, fd; + + if (!rpal_receiver_inited(rri)) + return; + rri->status =3D RPAL_RECEIVER_UNINITIALIZED; + id =3D rri->rc->receiver_id; + fd =3D threads_md.eventfds[id]; + close(fd); + threads_md.eventfds[id] =3D 0; + rpal_unregister_receiver(); + return; +} + +static inline void set_task_context(volatile task_context_t *tc, void *src) +{ + fiber_stack_t *fstack =3D src; + tc->r15 =3D fstack->r15; + tc->r14 =3D fstack->r14; + tc->r13 =3D fstack->r13; + tc->r12 =3D fstack->r12; + tc->rbx =3D fstack->rbx; + tc->rbp =3D fstack->rbp; + tc->rip =3D fstack->rip; + tc->rsp =3D (unsigned long)(src + 0x40); +} + +static transfer_t _syscall_epoll_wait(transfer_t t) +{ + rpal_receiver_info_t *rri =3D t.ud; + volatile receiver_context_t *rc =3D rri->rc; + long ret; + + rc->rpal_ep_poll_magic =3D RPAL_EP_POLL_MAGIC; + ret =3D epoll_wait(rc->epfd, rc->ep_events, rc->maxevents, + rc->timeout); + t =3D jump_fcontext(rri->main_ctx, (void *)ret); + return t; +} + +extern void rpal_ret_critical(volatile receiver_context_t *rc, + rpal_call_info_t *rci); + +static transfer_t syscall_epoll_wait(transfer_t t) +{ + rpal_receiver_info_t *rri =3D t.ud; + volatile receiver_context_t *rc =3D rri->rc; + rpal_call_info_t *rci =3D &rri->rci; + task_t *estk =3D rri->ep_stack; + + set_task_context(&rri->rc->task_context, t.fctx); + rri->main_ctx =3D t.fctx; + + rpal_ret_critical(rc, rci); + + estk->fctx =3D make_fcontext(estk->sp, 0, NULL); + t =3D ontop_fcontext(rri->ep_stack->fctx, rri, _syscall_epoll_wait); + return t; +} + +static inline int ep_kernel_events_available(volatile int *ep_pending) +{ + return (RPAL_KERNEL_PENDING & + __atomic_load_n(ep_pending, __ATOMIC_ACQUIRE)); +} + +static inline int ep_user_events_available(volatile int *ep_pending) +{ + return (RPAL_USER_PENDING & + __atomic_load_n(ep_pending, __ATOMIC_ACQUIRE)); +} + +static inline int rpal_ep_send_events(epoll_uevent_queue_t *uq, fd_table_t= *fdt, + volatile receiver_context_t *rc, + struct epoll_event *events, int maxevents) +{ + int fd =3D -1; + int ret =3D 0; + int res =3D 0; + fd_event_t *fde =3D NULL; + + __atomic_and_fetch(&rc->ep_pending, ~RPAL_USER_PENDING, + __ATOMIC_ACQUIRE); + while (uevent_queue_len(uq) && ret < maxevents) { + fd =3D uevent_queue_del(uq); + if (fd =3D=3D -1) { + errprint("uevent get failed\n"); + continue; + } + fde =3D fd_event_get(fdt, fd); + if (!fde) + continue; + res =3D __atomic_exchange_n(&fde->events, 0, __ATOMIC_RELAXED); + res &=3D fde->epev.events; + if (res) { + events[ret].data =3D fde->epev.data; + events[ret].events =3D res; + ret++; + } + fd_event_put(fdt, fde); + } + if (uevent_queue_len(uq) || ret =3D=3D maxevents) { + dbprint(RPAL_DEBUG_RECVER, + "uevent queue still have events, len: %d, ret: %d, maxevents: %d\n", + uevent_queue_len(uq), ret, maxevents); + __atomic_fetch_or(&rc->ep_pending, RPAL_USER_PENDING, + __ATOMIC_RELAXED); + } + return ret; +} + +extern void rpal_call_critical(volatile receiver_context_t *rc, + rpal_receiver_info_t *rri); + +int rpal_epoll_wait(int epfd, struct epoll_event *events, int maxevents, + int timeout) +{ + transfer_t t; + rpal_call_info_t *rci; + task_t *estk, *trampoline; + volatile receiver_context_t *rc; + epoll_uevent_queue_t *ueventq; + rpal_receiver_info_t *rri =3D current_rpal_thread(); + long ret =3D 0; + unsigned int mxcsr =3D 0, fpucw =3D 0; + + if (!rpal_receiver_inited(rri)) + return epoll_wait(epfd, events, maxevents, timeout); + + rc =3D rri->rc; + estk =3D rri->ep_stack; + trampoline =3D rri->trampoline; + rci =3D &rri->rci; + ueventq =3D &rri->ueventq; + + rc->epfd =3D epfd; + rc->ep_events =3D events; + rc->maxevents =3D maxevents; + rc->timeout =3D timeout; + + if (!rpal_receiver_available(rri)) { + rri->status =3D RPAL_RECEIVER_AVAILABLE; + estk->fctx =3D make_fcontext(estk->sp, 0, NULL); + SAVE_FPU(mxcsr, fpucw); + trampoline->fctx =3D make_fcontext(trampoline->sp, 0, NULL); + t =3D ontop_fcontext(trampoline->fctx, rri, syscall_epoll_wait); + } else { + // kernel pending events + if (ep_kernel_events_available(&rc->ep_pending)) { + rc->rpal_ep_poll_magic =3D + RPAL_EP_POLL_MAGIC; // clear KERNEL_PENDING + ret =3D epoll_wait(epfd, events, maxevents, 0); + rc->rpal_ep_poll_magic =3D 0; + goto send_user_events; + } + // user pending events + if (ep_user_events_available(&rc->ep_pending)) { + goto send_user_events; + } + SAVE_FPU(mxcsr, fpucw); + trampoline->fctx =3D make_fcontext(trampoline->sp, 0, NULL); + t =3D ontop_fcontext(trampoline->fctx, rri, syscall_epoll_wait); + } + rc->rpal_ep_poll_magic =3D 0; + + /* + * Here is where sender starts after user context switch. + * The TLS may still be sender's. We should not do anything + * that may use TLS, otherwise the result cannot be controlled. + */ + + switch (rc->receiver_state & RPAL_RECEIVER_STATE_MASK) { + case RPAL_RECEIVER_STATE_RUNNING: // syscall kernel ret + ret =3D (long)t.ud; + break; + case RPAL_RECEIVER_STATE_KERNEL_RET: // receiver kernel ret + RESTORE_FPU(mxcsr, fpucw); + ret =3D (long)t.fctx; + break; + case RPAL_RECEIVER_STATE_CALL: // rpalcall user jmp + rci->sender_tls_base =3D read_tls_base(); + rci->pkru =3D rdpkru(); + write_tls_base(rri->tls_base); + wrpkru(rpal_pkey_to_pkru(rri->rtp->pkey)); + rci->sender_fctx =3D t.fctx; + break; + default: + errprint("Error ep_status: %ld\n", + rc->receiver_state & RPAL_RECEIVER_STATE_MASK); + return -1; + } + +send_user_events: + if (ret < maxevents && ret >=3D 0) + ret +=3D rpal_ep_send_events(ueventq, rri->rtp->fdt, rc, + events + ret, maxevents - ret); + return ret; +} + +int rpal_epoll_wait_user(int epfd, struct epoll_event *events, int maxeven= ts, + int timeout) +{ + volatile receiver_context_t *rc; + epoll_uevent_queue_t *ueventq; + rpal_receiver_info_t *rri =3D current_rpal_thread(); + + if (!rpal_receiver_inited(rri)) + return 0; + + if (!rpal_receiver_available(rri)) + return 0; + + rc =3D rri->rc; + ueventq =3D &rri->ueventq; + if (ep_user_events_available(&rc->ep_pending)) { + return rpal_ep_send_events(ueventq, rri->rtp->fdt, rc, events, + maxevents); + } + return 0; +} + +int rpal_epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) +{ + fd_table_t *fdt; + int ret; + + ret =3D epoll_ctl(epfd, op, fd, event); + if (ret || !rpal_inited()) { + return ret; + } + fdt =3D threads_md.rtp->fdt; + switch (op) { + case EPOLL_CTL_ADD: + if (event->events & EPOLLRPALINOUT_BITS) { + ret =3D fd_event_install(fdt, fd, epfd, event); + if (ret =3D=3D RPAL_FAILURE) + goto install_error; + } + break; + case EPOLL_CTL_MOD: + fd_event_modify(fdt, fd, event); + break; + case EPOLL_CTL_DEL: + fd_event_uninstall(fdt, fd); + break; + } + return ret; +install_error: + epoll_ctl(epfd, EPOLL_CTL_DEL, fd, event); + return RPAL_FAILURE; +} + +static transfer_t set_fcontext(transfer_t t) +{ + sender_context_t *sc =3D t.ud; + + set_task_context(&sc->task_context, t.fctx); + return t; +} + +static void uq_lock(volatile uint64_t *uqlock, uint64_t key) +{ + uint64_t init =3D 0; + + while (1) { + if (__atomic_compare_exchange_n( + uqlock, &init, (1UL << 63 | key), 1, + __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) + return; + asm volatile("rep; nop"); + init =3D 0; + } +} + +static void uq_unlock(volatile uint64_t *uqlock) +{ + __atomic_store_n(uqlock, (uint64_t)0, __ATOMIC_RELAXED); +} + +static status_t do_rpal_call_jump(rpal_sender_info_t *rsi, + rpal_receiver_info_t *rri, + volatile receiver_context_t *rc) +{ + int desired, expected; + int64_t diff; + +WAKE_AGAIN: + desired =3D RPAL_BUILD_CALL_STATE(rsi->sc.sender_id, + threads_md.service_id); + expected =3D RPAL_RECEIVER_STATE_WAIT; + if (__atomic_compare_exchange_n(&rc->receiver_state, &expected, desired, = 1, + __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) { + __atomic_store_n(&rc->sender_state, RPAL_SENDER_STATE_CALL, + __ATOMIC_RELAXED); + rsi->sc.start_time =3D _rdtsc(); + ontop_fcontext(rri->main_ctx, &rsi->sc, set_fcontext); + + if (__atomic_load_n(&rc->sender_state, __ATOMIC_RELAXED) =3D=3D + RPAL_SENDER_STATE_RUNNING) { + if (rc->receiver_state =3D=3D RPAL_RECEIVER_STATE_LAZY_SWITCH) + read(-1, NULL, 0); + diff =3D _rdtsc() - rsi->sc.start_time; + rsi->sc.total_time +=3D diff; + rri->rc->total_time +=3D diff; + expected =3D desired; + desired =3D RPAL_RECEIVER_STATE_WAIT; + __atomic_compare_exchange_n(&rc->receiver_state, &expected, + desired, 1, + __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST); + + if (ep_user_events_available(&rc->ep_pending)) { + goto WAKE_AGAIN; + } + } + dbprint(RPAL_DEBUG_SENDER, "app return: 0x%x, %d, %d\n", + rc->receiver_state, rc->sender_state, sfd); + } + return RPAL_SUCCESS; +} + +static inline void set_fde_trigger(fd_event_t *fde) +{ + __atomic_store_n(&fde->wait, FDE_TRIGGER_OUT, __ATOMIC_RELEASE); + return; +} + +static inline int clear_fde_trigger(fd_event_t *fde) +{ + int expected =3D FDE_TRIGGER_OUT; + + return __atomic_compare_exchange_n(&fde->wait, &expected, + FDE_NO_TRIGGER, 1, __ATOMIC_SEQ_CST, + __ATOMIC_SEQ_CST); +} + +static int do_rpal_call(va_list va) +{ + rpal_sender_info_t *rsi; + rpal_receiver_info_t *rri; + fd_event_t *fde; + volatile receiver_context_t *rc; + struct rpal_thread_pool *srtp; + uint16_t stamp; + uint8_t rid; + int sfd; + int ret =3D 0; + int fall =3D 0; + int pkey; + + int service_id =3D va_arg(va, int); + uint64_t rpalfd =3D va_arg(va, uint64_t); + int64_t *ptrs =3D va_arg(va, int64_t *); + int len =3D va_arg(va, int); + int flags =3D va_arg(va, int); + + rsi =3D current_rpal_sender(); + if (!rsi) { + ret =3D RPAL_INVAL_THREAD; + goto ERROR; + } + srtp =3D get_service_from_id(service_id); + if (!srtp) { + ret =3D RPAL_INVAL_SERVICE; + goto ERROR; + } + pkey =3D get_service_pkey_from_id(service_id); + + rid =3D get_rid(rpalfd); + sfd =3D get_sfd(rpalfd); + wrpkru(rpal_pkru_union(rdpkru(), rpal_pkey_to_pkru(pkey))); + rri =3D srtp->rris + rid; + if (!rri) { + errprint("INVALID rid: %u, rri is NULL\n", rid); + ret =3D RPAL_INVALID_ARG; + goto ERROR; + } + rc =3D rri->rc; + rsi->sc.ec.tls_base =3D rri->tls_base; + + fde =3D fd_event_get(srtp->fdt, sfd); + if (!fde) { + ret =3D RPAL_INVALID_ARG; + goto ERROR; + } + stamp =3D get_fdtimestamp(rpalfd); + if (fde->timestamp !=3D stamp) { + ret =3D RPAL_FDE_OUTDATED; + goto FDE_PUT; + } + + uq_lock(&rri->uqlock, threads_md.service_key); + if (uevent_queue_len(&rri->ueventq) =3D=3D MAX_RDY) { + errprint("rdylist is full: [%u, %u]\n", rri->ueventq.l_beg, + rri->ueventq.l_end); + ret =3D RPAL_CACHE_FULL; + goto UNLOCK; + } + if (likely(flags & RCALL_IN)) { + if (unlikely(rpal_queue_unused(&fde->q) < (uint32_t)len)) { + set_fde_trigger(fde); + fall =3D 1; + /* fall through: try to put data to queue */ + } + ret =3D rpal_queue_put(&fde->q, ptrs, len); + if (ret !=3D len) { + errprint("fde queue put error: %d, data: %lx\n", ret, + (unsigned long)fde->q.data); + ret =3D RPAL_QUEUE_PUT_FAILED; + goto UNLOCK; + } + if (unlikely(fall)) { + clear_fde_trigger(fde); + } + fde->events |=3D EPOLLRPALIN; + } else if (unlikely(flags & RCALL_OUT)) { + ret =3D 0; + fde->events |=3D EPOLLRPALOUT; + } else { + errprint("rpal call failed, ptrs: %lx, len: %d", + (unsigned long)ptrs, len); + ret =3D RPAL_INVALID_ARG; + goto UNLOCK; + } + + uevent_queue_add(&rri->ueventq, sfd); + uq_unlock(&rri->uqlock); + fd_event_put(srtp->fdt, fde); + + __atomic_fetch_or(&rc->ep_pending, RPAL_USER_PENDING, + __ATOMIC_RELEASE); + do_rpal_call_jump(rsi, rri, rc); + return ret; + +UNLOCK: + uq_unlock(&rri->uqlock); +FDE_PUT: + fd_event_put(srtp->fdt, fde); +ERROR: + return -ret; +} + +static int __rpal_write_ptrs_common(int service_id, uint64_t rpalfd, + int64_t *ptrs, int len, int flags) +{ + int ret =3D RPAL_FAILURE; + status_t access =3D RPAL_FAILURE; + + if (unlikely(NULL =3D=3D ptrs)) { + dbprint(RPAL_DEBUG_SENDER, "%s: ptrs is NULL\n", __FUNCTION__); + return -RPAL_INVALID_ARG; + } + if (unlikely(len <=3D 0 || ((uint32_t)len) > DEFAULT_QUEUE_SIZE)) { + dbprint(RPAL_DEBUG_SENDER, + "%s: data len less than or equal to zero\n", + __FUNCTION__); + return -RPAL_INVALID_ARG; + } + + access =3D rpal_write_access_safety(do_rpal_call, &ret, service_id, + rpalfd, ptrs, len, flags); + if (access =3D=3D RPAL_FAILURE) { + return -RPAL_ERR_PEER_MEM; + } + return ret; +} + +int rpal_write_ptrs(int service_id, uint64_t rpalfd, int64_t *ptrs, int le= n) +{ + return __rpal_write_ptrs_common(service_id, rpalfd, ptrs, len, + RCALL_IN); +} + +int rpal_read_ptrs(int fd, int64_t *dptrs, int len) +{ + fd_event_t *fde; + fd_table_t *fdt =3D threads_md.rtp->fdt; + int ret; + + if (!rpal_inited()) + return -1; + + fde =3D fd_event_get(fdt, fd); + if (!fde) + return -1; + + ret =3D rpal_queue_get(&fde->q, dptrs, len); + fd_event_put(fdt, fde); + return ret; +} + +int rpal_read_ptrs_trigger_out(int fd, int64_t *dptrs, int len, int servic= e_id, + uint64_t rpalfd) +{ + fd_event_t *fde; + fd_table_t *fdt =3D threads_md.rtp->fdt; + int access, ret =3D -1; + int nread; + + if (!rpal_inited()) + return -1; + + fde =3D fd_event_get(fdt, fd); + if (!fde) + return -1; + + nread =3D rpal_queue_get(&fde->q, dptrs, len); + if (nread > 0 && clear_fde_trigger(fde)) { + access =3D + rpal_write_access_safety(do_rpal_call, &ret, service_id, + rpalfd, NULL, 0, RCALL_OUT); + if (access =3D=3D RPAL_FAILURE || ret < 0) { + set_fde_trigger(fde); + errprint( + "trigger out failed! access: %d, ret: %d, id: %d, rpalfd: %lx\n", + access, ret, service_id, rpalfd); + } + } + fd_event_put(fdt, fde); + + return nread; +} + +static inline int pkey_is_invalid(const int pkey) +{ + return (pkey < 0 || pkey > 15); +} + +static status_t rpal_thread_metadata_init(int nr_rpalthread, + rpal_error_code_t *error) +{ + uint64_t key; + struct rpal_thread_pool *rtp; + key =3D __rpal_get_service_key(); + if (key >=3D 1UL << 63) { + ERRREPORT( + error, RPAL_ERR_SERVICE_KEY, + "rpal service key error. Service key: 0x%lx, oeverflow, should less tha= n 2^63\n", + key); + goto error_out; + } + threads_md.service_key =3D key; + threads_md.service_id =3D __rpal_get_service_id(); + pthread_mutex_init(&release_lock, NULL); + rpal_get_critical_addr(&rcs); + rtp =3D rpal_thread_pool_create(nr_rpalthread, &threads_md); + if (rtp =3D=3D NULL) { + goto error_out; + } + rtp->service_key =3D threads_md.service_key; + rtp->service_id =3D threads_md.service_id; + threads_md.rtp =3D rtp; + if (rpal_enable_service(error) =3D=3D RPAL_FAILURE) + goto destroy_thread_pool; + threads_md.pid =3D getpid(); + return RPAL_SUCCESS; + +destroy_thread_pool: + rpal_thread_pool_destory(&threads_md); +error_out: + return RPAL_FAILURE; +} + +static void rpal_thread_metadata_exit(void) +{ + rpal_disable_service(); + rpal_thread_pool_destory(&threads_md); +} + +static status_t rpal_senders_metadata_init(rpal_error_code_t *error) +{ + if (senders_md) { + ERRREPORT(error, RPAL_ERR_SENDERS_METADATA, + "senders metadata is already initialized.\n"); + return RPAL_FAILURE; + } + + senders_md =3D malloc(sizeof(struct rpal_senders_metadata)); + if (!senders_md) { + ERRREPORT(error, RPAL_ERR_NOMEM, + "senders metadata alloc failed.\n"); + goto sendes_alloc_failed; + } + senders_md->sdpage_order =3D SENDERS_PAGE_ORDER; + memset(senders_md->bitmap, 0xFF, + sizeof(unsigned long) * BITS_TO_LONGS(MAX_SENDERS)); + pthread_mutex_init(&senders_md->lock, NULL); + senders_md->senders =3D rpal_get_shared_page(senders_md->sdpage_order); + if (!senders_md->senders) { + ERRREPORT(error, RPAL_ERR_SENDER_PAGES, + "get senders share page error.\n"); + goto pages_alloc_failed; + } + dbprint(RPAL_DEBUG_MANAGEMENT, "senders pages addr: 0x%016lx\n", + (unsigned long)senders_md->senders); + return RPAL_SUCCESS; + +pages_alloc_failed: + free(senders_md); +sendes_alloc_failed: + return RPAL_FAILURE; +} + +static void rpal_senders_metadata_exit(void) +{ + if (!senders_md) + return; + + rpal_free_shared_page((void *)senders_md->senders, + senders_md->sdpage_order); + pthread_mutex_destroy(&senders_md->lock); + free(senders_md); +} + +static int rpal_get_version_cap(rpal_capability_t *version) +{ + return rpal_ioctl(RPAL_IOCTL_GET_API_VERSION_AND_CAP, + (unsigned long)version); +} + +static status_t rpal_version_check(rpal_capability_t *ver) +{ + if (ver->compat_version !=3D MIN_RPAL_KERNEL_API_VERSION) + return RPAL_FAILURE; + if (ver->api_version < TARGET_RPAL_KERNEL_API_VERSION) + return RPAL_FAILURE; + return RPAL_SUCCESS; +} + +static status_t rpal_capability_check(rpal_capability_t *ver) +{ + unsigned long cap =3D ver->cap; + + if (!(cap & (1 << RPAL_CAP_PKU))) { + return RPAL_FAILURE; + } + return RPAL_SUCCESS; +} + +static status_t rpal_check_version_cap(rpal_error_code_t *error) +{ + int ret; + + ret =3D rpal_get_version_cap(&version); + if (ret < 0) { + ERRREPORT(error, RPAL_ERR_GET_CAP_VERSION, + "rpal get version failed: %d\n", ret); + ret =3D RPAL_FAILURE; + goto out; + } + ret =3D rpal_version_check(&version); + if (ret =3D=3D RPAL_FAILURE) { + ERRREPORT( + error, RPAL_KERNEL_API_NOTSUPPORT, + "kernel rpal(version: %d-%d) API is not compatible with librpal(version= : %d-%d)\n", + version.compat_version, version.api_version, + MIN_RPAL_KERNEL_API_VERSION, + TARGET_RPAL_KERNEL_API_VERSION); + goto out; + } + ret =3D rpal_capability_check(&version); + if (ret =3D=3D RPAL_FAILURE) { + ERRREPORT(error, RPAL_HARDWARE_NOTSUPPORT, + "hardware do not support RPAL\n"); + goto out; + } +out: + return ret; +} + +static status_t rpal_mgtfd_init(rpal_error_code_t *error) +{ + int err, n; + int mgtfd; + char name[1024]; + + mgtfd =3D open(RPAL_MGT_FILE, O_RDWR); + if (mgtfd =3D=3D -1) { + err =3D errno; + switch (err) { + case EPERM: + n =3D readlink("/proc/self/exe", name, sizeof(name) - 1); + if (n < 0) { + n =3D 0; + } + name[n] =3D 0; + errprint("%s is not a RPAL binary\n", name); + break; + case ENOENT: + errprint("Not in RPAL Environment\n"); + break; + default: + errprint("open %s fail, %d, %s\n", RPAL_MGT_FILE, err, + strerror(err)); + } + if (error) { + *error =3D RPAL_ERR_RPALFILE_OPS; + } + return RPAL_FAILURE; + } + rpal_mgtfd =3D mgtfd; + return RPAL_SUCCESS; +} + +static void rpal_mgtfd_destroy(void) +{ + if (rpal_mgtfd !=3D -1) { + close(rpal_mgtfd); + } + return; +} + +#define RPAL_SECTION_SIZE (512 * 1024 * 1024 * 1024UL) + +static inline status_t rpal_check_address(uint64_t start, uint64_t end, + uint64_t check) +{ + if (check >=3D start && check < end) { + return RPAL_SUCCESS; + } + return RPAL_FAILURE; +} + +static status_t rpal_managment_init(rpal_error_code_t *error) +{ + int i =3D 0; + + if (rpal_mgtfd_init(error) =3D=3D RPAL_FAILURE) { + goto mgtfd_init_failed; + } + if (pthread_key_create(&rpal_key, NULL)) + goto rpal_key_failed; + + for (i =3D 0; i < MAX_SERVICEID; i++) { + requested_services[i].key =3D 0; + requested_services[i].service =3D NULL; + requested_services[i].pkey =3D -1; + } + if (rpal_check_version_cap(error) =3D=3D RPAL_FAILURE) { + goto rpal_check_failed; + } + return RPAL_SUCCESS; + +rpal_check_failed: + pthread_key_delete(rpal_key); +rpal_key_failed: + rpal_mgtfd_destroy(); +mgtfd_init_failed: + return RPAL_FAILURE; +} + +static void rpal_managment_exit(void) +{ + pthread_key_delete(rpal_key); + rpal_mgtfd_destroy(); + return; +} + +int rpal_init(int nr_rpalthread, int flags, rpal_error_code_t *error) +{ + if (nr_rpalthread <=3D 0) { + dbprint(RPAL_DEBUG_MANAGEMENT, + "%s: nr_rpalthread(%d) less than or equal to 0\n", + __FUNCTION__, nr_rpalthread); + return RPAL_FAILURE; + } + if (rpal_managment_init(error) =3D=3D RPAL_FAILURE) { + goto error_out; + } + if (rpal_thread_metadata_init(nr_rpalthread, error) =3D=3D RPAL_FAILURE) + goto managment_exit; + + if (rpal_senders_metadata_init(error) =3D=3D RPAL_FAILURE) + goto thread_md_exit; + + inited =3D 1; + dbprint(RPAL_DEBUG_MANAGEMENT, + "rpal init success, service key: 0x%lx, service id: %d, " + "critical_start: 0x%016lx, critical_end: 0x%016lx\n", + threads_md.service_key, threads_md.service_id, rcs.ret_begin, + rcs.ret_end); + return rpal_mgtfd; + +thread_md_exit: + rpal_thread_metadata_exit(); +managment_exit: + rpal_managment_exit(); +error_out: + return RPAL_FAILURE; +} + +void rpal_exit(void) +{ + if (rpal_inited()) { + dbprint(RPAL_DEBUG_MANAGEMENT, + "rpal exit, service key: 0x%lx, service id: %d\n", + threads_md.service_key, threads_md.service_id); + rpal_senders_metadata_exit(); + rpal_thread_metadata_exit(); + rpal_managment_exit(); + } +} diff --git a/samples/rpal/librpal/rpal.h b/samples/rpal/librpal/rpal.h new file mode 100644 index 000000000000..e91a206b8370 --- /dev/null +++ b/samples/rpal/librpal/rpal.h @@ -0,0 +1,149 @@ +#ifndef RPAL_H_INCLUDED +#define RPAL_H_INCLUDED + +#ifdef __cplusplus +#if __cplusplus +extern "C" { +#endif +#endif /* __cplusplus */ + +#include +#include +#include + +typedef enum rpal_error_code { + RPAL_ERR_NONE =3D 0, + RPAL_ERR_BAD_ARG =3D 1, + RPAL_ERR_NO_SERVICE =3D 2, + RPAL_ERR_MAPPED =3D 3, + RPAL_ERR_RETRY =3D 4, + RPAL_ERR_BAD_SERVICE_STATUS =3D 5, + RPAL_ERR_BAD_THREAD_STATUS =3D 6, + RPAL_ERR_REACH_LIMIT =3D 7, + RPAL_ERR_NOMEM =3D 8, + RPAL_ERR_NOMAPPING =3D 9, + RPAL_ERR_INVAL =3D 10, + + RPAL_ERR_KERNEL_MAX_CODE =3D 100, + + RPAL_ERR_RPALFILE_OPS, /**< Failed to open /proc/self/rpal */ + RPAL_ERR_RPAL_DISABLED, + RPAL_ERR_GET_CAP_VERSION, + RPAL_KERNEL_API_NOTSUPPORT, + RPAL_HARDWARE_NOTSUPPORT, + RPAL_ERR_SERVICE_KEY, /**< Failed to get service key */ + RPAL_ERR_SENDERS_METADATA, + RPAL_ERR_ENABLE_SERVICE, + RPAL_ERR_SENDER_PAGES, + RPAL_DONT_INITED, + RPAL_ERR_SENDER_INIT, + RPAL_ERR_SENDER_REG, + RPAL_INVALID_ARG, + RPAL_CACHE_FULL, + RPAL_FDE_OUTDATED, + RPAL_QUEUE_PUT_FAILED, + RPAL_ERR_PEER_MEM, + RPAL_ERR_NOTIFY_RECVER, + RPAL_INVAL_THREAD, + RPAL_INVAL_SERVICE, +} rpal_error_code_t; + +#define EPOLLRPALIN 0x00020000 +#define EPOLLRPALOUT 0x00040000 + +typedef enum rpal_features { + RPAL_SENDER_RECEIVER =3D 0x1 << 0, +} rpal_features_t; + +typedef enum status { + RPAL_FAILURE =3D -1, /**< return value indicating failure */ + RPAL_SUCCESS /**< return value indicating success */ +} status_t; + +#define RPAL_PUBLIC __attribute__((visibility("default"))) + +RPAL_PUBLIC +int rpal_init(int nr_rpalthread, int flags, rpal_error_code_t *error); + +RPAL_PUBLIC +void rpal_exit(void); + +RPAL_PUBLIC +int rpal_receiver_init(void); + +RPAL_PUBLIC +void rpal_receiver_exit(void); + +RPAL_PUBLIC +int rpal_request_service(uint64_t key); + +RPAL_PUBLIC +status_t rpal_release_service(uint64_t key); + +RPAL_PUBLIC +status_t rpal_clean_service_start(int64_t *ptr); + +RPAL_PUBLIC +void rpal_clean_service_end(int64_t *ptr); + +RPAL_PUBLIC +int rpal_get_service_id(void); + +RPAL_PUBLIC +status_t rpal_get_service_key(uint64_t *service_key); + +RPAL_PUBLIC +int rpal_get_request_service_id(uint64_t key); + +RPAL_PUBLIC +status_t rpal_uds_fdmap(uint64_t sid_fd, uint64_t *rpalfd); + +RPAL_PUBLIC +int rpal_get_peer_rid(uint64_t sid_fd); + +RPAL_PUBLIC +status_t rpal_sender_init(rpal_error_code_t *error); + +RPAL_PUBLIC +status_t rpal_sender_exit(void); + +/* Hook epoll syscall */ +RPAL_PUBLIC +int rpal_epoll_wait(int epfd, struct epoll_event *events, int maxevents, + int timeout); + +RPAL_PUBLIC +int rpal_epoll_wait_user(int epfd, struct epoll_event *events, int maxeven= ts, + int timeout); + +RPAL_PUBLIC +int rpal_epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); + +RPAL_PUBLIC +status_t rpal_copy_prepare(int service_id); + +RPAL_PUBLIC +status_t rpal_copy_finish(void); + +RPAL_PUBLIC +int rpal_write_ptrs(int service_id, uint64_t rpalfd, int64_t *ptrs, int le= n); + +RPAL_PUBLIC +int rpal_read_ptrs(int fd, int64_t *ptrs, int len); + +typedef int (*access_fn)(va_list args); +RPAL_PUBLIC +status_t rpal_read_access_safety(access_fn do_access, int *do_access_ret, = ...); + +RPAL_PUBLIC +void rpal_recver_count_print(void); + +RPAL_PUBLIC +void rpal_sender_count_print(void); + +#ifdef __cplusplus +#if __cplusplus +} +#endif +#endif +#endif //!_RPAL_H_INCLUDED diff --git a/samples/rpal/librpal/rpal_pkru.h b/samples/rpal/librpal/rpal_p= kru.h new file mode 100644 index 000000000000..9590aa7203bb --- /dev/null +++ b/samples/rpal/librpal/rpal_pkru.h @@ -0,0 +1,78 @@ +#include +#include "private.h" + +#define RPAL_PKRU_BASE_CODE_READ 0xAAAAAAAA +#define RPAL_PKRU_BASE_CODE 0xFFFFFFFF +#define RPAL_NO_PKEY -1 + +typedef uint32_t u32; +/* + * extern __inline unsigned int + * __attribute__((__gnu_inline__, __always_inline__, __artificial__)) + * _rdpkru_u32 (void) + * { + * return __builtin_ia32_rdpkru (); + * } + * + * extern __inline void + * __attribute__((__gnu_inline__, __always_inline__, __artificial__)) + * _wrpkru (unsigned int __key) + * { + * __builtin_ia32_wrpkru (__key); + * } + */ +// #define rdpkru _rdpkru_u32 +// #define wrpkru _wrpkru +static inline uint32_t rdpkru(void) +{ + uint32_t ecx =3D 0; + uint32_t edx, pkru; + + /* + * "rdpkru" instruction. Places PKRU contents in to EAX, + * clears EDX and requires that ecx=3D0. + */ + asm volatile(".byte 0x0f,0x01,0xee\n\t" + : "=3Da"(pkru), "=3Dd"(edx) + : "c"(ecx)); + return pkru; +} + +static inline void wrpkru(uint32_t pkru) +{ + uint32_t ecx =3D 0, edx =3D 0; + + /* + * "wrpkru" instruction. Loads contents in EAX to PKRU, + * requires that ecx =3D edx =3D 0. + */ + asm volatile(".byte 0x0f,0x01,0xef\n\t" + : + : "a"(pkru), "c"(ecx), "d"(edx)); +} + +static inline u32 rpal_pkey_to_pkru(int pkey) +{ + int offset =3D pkey * 2; + u32 mask =3D 0x3 << offset; + + return RPAL_PKRU_BASE_CODE & ~mask; +} + +static inline u32 rpal_pkey_to_pkru_read(int pkey) +{ + int offset =3D pkey * 2; + u32 mask =3D 0x3 << offset; + + return RPAL_PKRU_BASE_CODE_READ & ~mask; +} + +static inline u32 rpal_pkru_union(u32 pkru0, u32 pkru1) +{ + return pkru0 & pkru1; +} + +static inline u32 rpal_pkru_intersect(u32 pkru0, u32 pkru1) +{ + return pkru0 | pkru1; +} diff --git a/samples/rpal/librpal/rpal_queue.c b/samples/rpal/librpal/rpal_= queue.c new file mode 100644 index 000000000000..07a90122aa16 --- /dev/null +++ b/samples/rpal/librpal/rpal_queue.c @@ -0,0 +1,239 @@ +#include "rpal_queue.h" + +#include +#include +#include +#include + +#define min(X, Y) ({ ((X) > (Y)) ? (Y) : (X); }) + +static unsigned int roundup_pow_of_two(unsigned int data) +{ + unsigned int msb_position; + + if (data <=3D 1) + return 1; + if (!(data & (data - 1))) + return data; + + msb_position =3D 31 - __builtin_clz(data); + assert(msb_position < 31); + return 1 << (msb_position + 1); +} + +QUEUE_UINT rpal_queue_unused(rpal_queue_t *q) +{ + return (q->mask + 1) - (q->tail - q->head); +} + +QUEUE_UINT rpal_queue_len(rpal_queue_t *q) +{ + return (q->tail - q->head); +} + +int rpal_queue_init(rpal_queue_t *q, void *data, QUEUE_UINT_INC usize) +{ + QUEUE_UINT_INC size; + if (usize > QUEUE_UINT_MAX || !data) { + return -1; + } + size =3D roundup_pow_of_two(usize); + if (usize !=3D size) { + return -1; + } + q->data =3D data; + memset(q->data, 0, size * sizeof(int64_t)); + q->head =3D 0; + q->tail =3D 0; + q->mask =3D size - 1; + return 0; +} + +void *rpal_queue_destroy(rpal_queue_t *q) +{ + void *data =3D q->data; + if (q->data) { + q->data =3D NULL; + } + q->mask =3D 0; + q->head =3D 0; + q->tail =3D 0; + return data; +} + +int rpal_queue_alloc(rpal_queue_t *q, QUEUE_UINT_INC size) +{ + assert(q && size); + if (size > QUEUE_UINT_MAX) { + return -1; + } + size =3D roundup_pow_of_two(size); + q->data =3D malloc(size * sizeof(int64_t)); + if (!q->data) + return -1; + memset(q->data, 0, size * sizeof(int64_t)); + q->head =3D 0; + q->tail =3D 0; + q->mask =3D size - 1; + return 0; +} + +void rpal_queue_free(rpal_queue_t *q) +{ + if (q->data) { + free(q->data); + q->data =3D NULL; + } + q->mask =3D 0; + q->head =3D 0; + q->tail =3D 0; +} + +static void rpal_queue_copy_in(rpal_queue_t *q, const int64_t *buf, + QUEUE_UINT_INC len, QUEUE_UINT off) +{ + QUEUE_UINT_INC l; + QUEUE_UINT_INC size =3D q->mask + 1; + + off &=3D q->mask; + l =3D min(len, size - off); + + memcpy(q->data + off, buf, l << 3); + memcpy(q->data, buf + l, (len - l) << 3); + asm volatile("" : : : "memory"); +} + +QUEUE_UINT_INC rpal_queue_put(rpal_queue_t *q, const int64_t *buf, + QUEUE_UINT_INC len) +{ + QUEUE_UINT_INC l; + + if (!q->data) { + return 0; + } + l =3D rpal_queue_unused(q); + if (len > l) { + return 0; + } + l =3D len; + rpal_queue_copy_in(q, buf, l, q->tail); + q->tail +=3D l; + return l; +} + +static QUEUE_UINT_INC rpal_queue_copy_out(rpal_queue_t *q, int64_t *buf, + QUEUE_UINT_INC len, QUEUE_UINT head) +{ + unsigned int l; + QUEUE_UINT tail; + QUEUE_UINT off; + QUEUE_UINT_INC size =3D q->mask + 1; + + tail =3D __atomic_load_n(&q->tail, __ATOMIC_RELAXED); + len =3D min((QUEUE_UINT)(tail - head), len); + if (head =3D=3D tail) + return 0; + off =3D head & q->mask; + l =3D min(len, size - off); + + memcpy(buf, q->data + off, l << 3); + memcpy(buf + l, q->data, (len - l) << 3); + + return len; +} + +QUEUE_UINT_INC rpal_queue_peek(rpal_queue_t *q, int64_t *buf, + QUEUE_UINT_INC len, QUEUE_UINT *phead) +{ + QUEUE_UINT_INC copied; + QUEUE_UINT head; + + head =3D __atomic_load_n(&q->head, __ATOMIC_RELAXED); + copied =3D rpal_queue_copy_out(q, buf, len, head); + if (phead) { + *phead =3D head; + } + return copied; +} + +QUEUE_UINT_INC rpal_queue_skip(rpal_queue_t *q, QUEUE_UINT head, + QUEUE_UINT_INC skip) +{ + if (skip > rpal_queue_len(q)) { + return 0; + } + if (__atomic_compare_exchange_n(&q->head, &head, head + skip, 1, + __ATOMIC_RELAXED, __ATOMIC_RELAXED)) { + return skip; + } + return 0; +} + +QUEUE_UINT_INC rpal_queue_get(rpal_queue_t *q, int64_t *buf, QUEUE_UINT_IN= C len) +{ + QUEUE_UINT_INC copied; + QUEUE_UINT head; + + while (1) { + head =3D __atomic_load_n(&q->head, __ATOMIC_RELAXED); + copied =3D rpal_queue_copy_out(q, buf, len, head); + if (__atomic_compare_exchange_n(&q->head, &head, head + copied, + 1, __ATOMIC_RELAXED, + __ATOMIC_RELAXED)) { + return copied; + } + } +} + +void rpal_uevent_queue_init(epoll_uevent_queue_t *ueventq, + volatile uint64_t *uqlock) +{ + int i; + __atomic_store_n(uqlock, (uint64_t)0, __ATOMIC_RELAXED); + ueventq->l_beg =3D 0; + ueventq->l_end =3D 0; + ueventq->l_end_cache =3D 0; + for (i =3D 0; i < MAX_RDY; ++i) { + ueventq->fds[i] =3D -1; + } + return; +} + +QUEUE_UINT uevent_queue_len(epoll_uevent_queue_t *ueventq) +{ + return (ueventq->l_end - ueventq->l_beg); +} + +QUEUE_UINT uevent_queue_add(epoll_uevent_queue_t *ueventq, int fd) +{ + unsigned int pos; + if (uevent_queue_len(ueventq) =3D=3D MAX_RDY) + return MAX_RDY; + pos =3D __sync_fetch_and_add(&ueventq->l_end_cache, 1); + pos %=3D MAX_RDY; + ueventq->fds[pos] =3D fd; + asm volatile("" : : : "memory"); + __sync_fetch_and_add(&ueventq->l_end, 1); + return (pos); +} + +int uevent_queue_del(epoll_uevent_queue_t *ueventq) +{ + int fd =3D -1; + int pos; + if (uevent_queue_len(ueventq) =3D=3D 0) { + return -1; + } + pos =3D ueventq->l_beg % MAX_RDY; + fd =3D ueventq->fds[pos]; + asm volatile("" : : : "memory"); + __sync_fetch_and_add(&ueventq->l_beg, 1); + return fd; +} + +int uevent_queue_fix(epoll_uevent_queue_t *ueventq) +{ + __atomic_store_n(&ueventq->l_end_cache, ueventq->l_end, + __ATOMIC_SEQ_CST); + return 0; +} diff --git a/samples/rpal/librpal/rpal_queue.h b/samples/rpal/librpal/rpal_= queue.h new file mode 100644 index 000000000000..224e7b449d50 --- /dev/null +++ b/samples/rpal/librpal/rpal_queue.h @@ -0,0 +1,55 @@ +#ifndef RPAL_QUEUE_H +#define RPAL_QUEUE_H + +#include + +// typedef uint8_t QUEUE_UINT; +// typedef uint16_t QUEUE_UINT_INC; +// #define QUEUE_UINT_MAX UINT8_MAX + +// typedef uint16_t QUEUE_UINT; +// typedef uint32_t QUEUE_UINT_INC; +// #define QUEUE_UINT_MAX UINT16_MAX + +typedef uint32_t QUEUE_UINT; +typedef uint64_t QUEUE_UINT_INC; +#define QUEUE_UINT_MAX UINT32_MAX + +typedef struct rpal_queue { + QUEUE_UINT head; + QUEUE_UINT tail; + QUEUE_UINT mask; + uint64_t *data; +} rpal_queue_t; + +QUEUE_UINT rpal_queue_len(rpal_queue_t *q); +QUEUE_UINT rpal_queue_unused(rpal_queue_t *q); +int rpal_queue_init(rpal_queue_t *q, void *data, QUEUE_UINT_INC usize); +void *rpal_queue_destroy(rpal_queue_t *q); +int rpal_queue_alloc(rpal_queue_t *q, QUEUE_UINT_INC size); +void rpal_queue_free(rpal_queue_t *q); +QUEUE_UINT_INC rpal_queue_put(rpal_queue_t *q, const int64_t *buf, + QUEUE_UINT_INC len); +QUEUE_UINT_INC rpal_queue_get(rpal_queue_t *q, int64_t *buf, + QUEUE_UINT_INC len); +QUEUE_UINT_INC rpal_queue_peek(rpal_queue_t *q, int64_t *buf, + QUEUE_UINT_INC len, QUEUE_UINT *phead); +QUEUE_UINT_INC rpal_queue_skip(rpal_queue_t *q, QUEUE_UINT head, + QUEUE_UINT_INC skip); + +#define MAX_RDY 4096 +typedef struct epoll_uevent_queue { + int fds[MAX_RDY]; + volatile QUEUE_UINT l_beg; + volatile QUEUE_UINT l_end; + volatile QUEUE_UINT l_end_cache; +} epoll_uevent_queue_t; + +void rpal_uevent_queue_init(epoll_uevent_queue_t *ueventq, + volatile uint64_t *uqlock); +QUEUE_UINT uevent_queue_len(epoll_uevent_queue_t *ueventq); +QUEUE_UINT uevent_queue_add(epoll_uevent_queue_t *ueventq, int fd); +int uevent_queue_del(epoll_uevent_queue_t *ueventq); +int uevent_queue_fix(epoll_uevent_queue_t *ueventq); + +#endif diff --git a/samples/rpal/librpal/rpal_x86_64_call_ret.S b/samples/rpal/lib= rpal/rpal_x86_64_call_ret.S new file mode 100644 index 000000000000..a7c09a1b033d --- /dev/null +++ b/samples/rpal/librpal/rpal_x86_64_call_ret.S @@ -0,0 +1,45 @@ +#ifdef __x86_64__ +#define __ASSEMBLY__ +#include "asm_define.h" +#define RPAL_SENDER_STATE_RUNNING $0x0 +#define RPAL_SENDER_STATE_CALL $0x1 + +.text +.globl rpal_ret_critical +.type rpal_ret_critical,@function +.align 16 + +//void rpal_ret_criticalreceiver_context_t *rc, rpal_call_info_t *rci + +rpal_ret_critical: + mov RPAL_SENDER_STATE_CALL, %eax + mov RPAL_SENDER_STATE_RUNNING, %ecx + lock cmpxchg %ecx, RC_SENDER_STATE(%rdi) +ret_begin: + jne 2f + movq RCI_PKRU(%rsi), %rax + xor %edx, %edx + .byte 0x0f,0x01,0xef + movq RCI_SENDER_TLS_BASE(%rsi), %rax + wrfsbase %rax +ret_end: + movq RCI_SENDER_FCTX(%rsi), %rdi + call jump_fcontext@plt +2: + ret + +.globl rpal_get_critical_addr +.type rpal_get_critical_addr,@function +.align 16 +rpal_get_critical_addr: + leaq ret_begin(%rip), %rax + movq %rax, RET_BEGIN(%rdi) + leaq ret_end(%rip), %rax + movq %rax, RET_END(%rdi) + ret + +.size rpal_ret_critical,.-rpal_ret_critical + +/* Mark that we don't need executable stack. */ +.section .note.GNU-stack,"",%progbits +#endif diff --git a/samples/rpal/offset.sh b/samples/rpal/offset.sh new file mode 100755 index 000000000000..f5ae77b893e8 --- /dev/null +++ b/samples/rpal/offset.sh @@ -0,0 +1,5 @@ +#!/bin/bash + +set -e +CUR_DIR=3D$(dirname $(realpath -s "$0")) +gcc -masm=3Dintel -S $CUR_DIR/asm_define.c -o - | awk '($1 =3D=3D "->") { = print "#define " $2 " " $3 }' > $CUR_DIR/librpal/asm_define.h \ No newline at end of file diff --git a/samples/rpal/server.c b/samples/rpal/server.c new file mode 100644 index 000000000000..82c5c9dec922 --- /dev/null +++ b/samples/rpal/server.c @@ -0,0 +1,249 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include "librpal/rpal.h" + +#define SOCKET_PATH "/tmp/rpal_socket" +#define MAX_EVENTS 10 +#define BUFFER_SIZE 1025 +#define MSG_LEN 32 + +#define INIT_MSG "INIT" +#define SUCC_MSG "SUCC" +#define FAIL_MSG "FAIL" + +#define handle_error(s) = \ + do { \ + perror(s); \ + exit(EXIT_FAILURE); \ + } while (0) + +uint64_t service_key; +int server_fd; +int epoll_fd; + +int rpal_epoll_add(int epfd, int fd) +{ + struct epoll_event ev; + + ev.events =3D EPOLLRPALIN | EPOLLIN | EPOLLRDHUP | EPOLLET; + ev.data.fd =3D fd; + + return rpal_epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev); +} + +void rpal_server_init(int fd, int epoll_fd) +{ + char buffer[BUFFER_SIZE]; + rpal_error_code_t err; + uint64_t remote_key, service_key; + int remote_id; + int proc_fd; + int ret; + + proc_fd =3D rpal_init(1, 0, &err); + if (proc_fd < 0) + handle_error("rpal init fail"); + rpal_get_service_key(&service_key); + + rpal_epoll_add(epoll_fd, fd); + + ret =3D read(fd, buffer, BUFFER_SIZE); + if (ret < 0) + handle_error("rpal init: read"); + + if (strncmp(buffer, INIT_MSG, strlen(INIT_MSG)) !=3D 0) { + buffer[BUFFER_SIZE - 1] =3D 0; + handle_error("Invalid msg\n"); + return; + } + + remote_key =3D *(uint64_t *)(buffer + strlen(INIT_MSG)); + ret =3D rpal_request_service(remote_key); + if (ret) { + uint64_t service_key =3D 0; + ret =3D write(fd, (char *)&service_key, sizeof(uint64_t)); + handle_error("request service fail"); + return; + } + ret =3D write(fd, (char *)&service_key, sizeof(uint64_t)); + if (ret < 0) + handle_error("write error"); + + ret =3D read(fd, buffer, BUFFER_SIZE); + if (ret < 0) + handle_error("handshake read"); + + if (strncmp(SUCC_MSG, buffer, strlen(SUCC_MSG)) !=3D 0) + handle_error("handshake"); + + remote_id =3D rpal_get_request_service_id(remote_key); + if (remote_id < 0) + handle_error("remote id get fail"); + rpal_receiver_init(); +} + +void run_rpal_server(int msg_len) +{ + struct epoll_event ev, events[MAX_EVENTS]; + int new_socket; + int nfds; + uint64_t tsc, total_tsc =3D 0; + int count =3D 0; + + while (1) { + nfds =3D rpal_epoll_wait(epoll_fd, events, MAX_EVENTS, -1); + if (nfds =3D=3D -1) { + perror("epoll_wait"); + exit(EXIT_FAILURE); + } + + for (int n =3D 0; n < nfds; ++n) { + if (events[n].data.fd =3D=3D server_fd) { + new_socket =3D accept(server_fd, NULL, NULL); + if (new_socket =3D=3D -1) { + perror("accept"); + continue; + } + + rpal_server_init(new_socket, epoll_fd); + } else if (events[n].events & EPOLLRDHUP) { + close(events[n].data.fd); + goto finish; + } else if (events[n].events & EPOLLRPALIN) { + char buffer[BUFFER_SIZE] =3D { 0 }; + + ssize_t valread =3D rpal_read_ptrs( + events[n].data.fd, (int64_t *)buffer, + MSG_LEN / sizeof(int64_t *)); + if (valread <=3D 0) { + close(events[n].data.fd); + epoll_ctl(epoll_fd, EPOLL_CTL_DEL, + events[n].data.fd, NULL); + goto finish; + } else { + count++; + sscanf(buffer, "0x%016lx", &tsc); + total_tsc +=3D __rdtsc() - tsc; + send(events[n].data.fd, buffer, msg_len, + 0); + } + } else { + perror("bad request\n"); + } + } + } +finish: + printf("RPAL: Message length: %d bytes, Total TSC cycles: %lu, " + "Message count: %d, Average latency: %lu cycles\n", + MSG_LEN, total_tsc, count, total_tsc / count); +} + +void run_server(int msg_len) +{ + struct epoll_event ev, events[MAX_EVENTS]; + int new_socket; + int nfds; + uint64_t tsc, total_tsc =3D 0; + int count =3D 0; + + while (1) { + nfds =3D epoll_wait(epoll_fd, events, MAX_EVENTS, -1); + if (nfds =3D=3D -1) { + perror("epoll_wait"); + exit(EXIT_FAILURE); + } + + for (int n =3D 0; n < nfds; ++n) { + if (events[n].data.fd =3D=3D server_fd) { + new_socket =3D accept(server_fd, NULL, NULL); + if (new_socket =3D=3D -1) { + perror("accept"); + continue; + } + + ev.events =3D EPOLLIN; + ev.data.fd =3D new_socket; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, + new_socket, &ev) =3D=3D -1) { + close(new_socket); + perror("epoll_ctl: add new socket"); + } + } else if (events[n].events & EPOLLRDHUP) { + close(events[n].data.fd); + goto finish; + } else { + char buffer[BUFFER_SIZE] =3D { 0 }; + + ssize_t valread =3D read(events[n].data.fd, + buffer, BUFFER_SIZE); + if (valread <=3D 0) { + close(events[n].data.fd); + epoll_ctl(epoll_fd, EPOLL_CTL_DEL, + events[n].data.fd, NULL); + goto finish; + } else { + count++; + sscanf(buffer, "0x%016lx", &tsc); + total_tsc +=3D __rdtsc() - tsc; + send(events[n].data.fd, buffer, msg_len, + 0); + } + } + } + } +finish: + printf("EPOLL: Message length: %d bytes, Total TSC cycles: %lu, " + "Message count: %d, Average latency: %lu cycles\n", + MSG_LEN, total_tsc, count, total_tsc / count); +} + +int main() +{ + struct sockaddr_un address; + struct epoll_event ev; + + if ((server_fd =3D socket(AF_UNIX, SOCK_STREAM, 0)) =3D=3D 0) { + perror("socket failed"); + exit(EXIT_FAILURE); + } + + memset(&address, 0, sizeof(address)); + address.sun_family =3D AF_UNIX; + strncpy(address.sun_path, SOCKET_PATH, sizeof(SOCKET_PATH)); + + if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) < 0) { + perror("bind failed"); + exit(EXIT_FAILURE); + } + + if (listen(server_fd, 3) < 0) { + perror("listen"); + exit(EXIT_FAILURE); + } + + epoll_fd =3D epoll_create(1024); + if (epoll_fd =3D=3D -1) { + perror("epoll_create"); + exit(EXIT_FAILURE); + } + + ev.events =3D EPOLLIN; + ev.data.fd =3D server_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &ev) =3D=3D -1) { + perror("epoll_ctl: listen_sock"); + exit(EXIT_FAILURE); + } + + run_server(MSG_LEN); + run_rpal_server(MSG_LEN); + + close(server_fd); + unlink(SOCKET_PATH); + return 0; +} --=20 2.20.1