From nobody Fri Nov 29 21:39:06 2024 Received: from NAM04-MW2-obe.outbound.protection.outlook.com (mail-mw2nam04on2050.outbound.protection.outlook.com [40.107.101.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8199B224D6; Mon, 16 Sep 2024 05:08:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.101.50 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726463338; cv=fail; b=DrCrp3A5BqX6hN4yv7QCGXv5L4ZDfs3ZJrf5XQywyb7j4qtl4H8RS6ztG3uRZOuxTVjB2Q2W9C/iFH1nf61a7E8/FwVKE6fqEIAHACSAYR3rjD0WlNjvWXLdBjEln/2hMGzUumjpOy8+JNBAZu5/7V0+tRDcDHGMuG1vKMjj0QU= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726463338; c=relaxed/simple; bh=/OMPpbyioaekCErvkWbeqRAhGK/B9QKC7+zS3er2en8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pMDokd865skha0prWtJpby72gnCPHe/z/seAsp3jdHprE4d3iQUoiFgMLzJJGeRuwBjmuF8zASSpgzQ6P7bbwGcqWlev10E8n/RDLRgn3/0MIfT6aa0LZvnP4VARg8l6cqf/xpoDsS0DXWTIvVqrGr2MVe5c/wUwqD+Q8ANs03E= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=EhiQEAT0; arc=fail smtp.client-ip=40.107.101.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="EhiQEAT0" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Y8whDK5gJFXerI/MTyfY0FBne/1JRw+lSXXunm7oyn1LPYQyOe/n3IDnWB8ZsulRyz44YH69xw1fU4dFMvLL7alcc6BwmAd7A9ka4erIbIvFTECJS5meOMB9GGwxeMbABj9/z4WtJ/76ciVDCg9WsoqA5bPbeAvtiDoBbMIopx0NGRfB6pigIjCXp4qDs8JBSNtckvgsRdGnbIQGWcHYxb8PwYAG3pQln47zwh7+23H8K2DH1LceDPZ5Xceen4eg/ugvHRAAjj0CwHmu5QN3xtDwOtw9oiR9pXzc9EK+pRZgrWhXcBVD3nSogavqb1kSkuYd4AKAC2H6+OopETz+lQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=2FBVw+pyjVJ0fOe69NB1TgCfKY+dzuWS4nyeSzMm6vs=; b=F9iKr9ys0TBpBlSR6TtIOykKbwn57qG+5bq4sb2iPURn97ysbHcFalDIXXxEJo66+M9c9XpseoRhbz5LcAS+dreTBy9aAqzkYS/w/xaUKWPfAL7sU5mlTTUDBlcXQENAh74Qr8GGGHduXt798vRzM4k9iaqJw7+qsRUJA4OdVx6+SvkZR0CM8OWCYh51aviZ2uuGf1SFRKci7Aj5Fs9xpIJRK6o+EjGGG3kz+H7j7BLKCUrgCb2yP5Uqy88xPoNh6wMgW71GS/h5TZIXkcjgBliyuVaXek9tw44CjCq2Ut08e4awQZIy/yQzqPQrewnuAygdXbxaGuPjMniTjpHLBg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2FBVw+pyjVJ0fOe69NB1TgCfKY+dzuWS4nyeSzMm6vs=; b=EhiQEAT0/J575zNjcrq9eW8Dx2YftJpvizrBx63itmdBMixYf56PcbMUSOcgf2q7VToFU3I2+CiP7tLL0nOqYuT81e+Biut4+Xowp72q+t9jAtKh56+j9GHBnNTjjr/pgshfpmSlcyIyK63uDF8BI/EOY3FCCEdlnzLJ57HC+QQ= Received: from SJ0PR03CA0281.namprd03.prod.outlook.com (2603:10b6:a03:39e::16) by CY5PR12MB6347.namprd12.prod.outlook.com (2603:10b6:930:20::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7962.24; Mon, 16 Sep 2024 05:08:50 +0000 Received: from SJ5PEPF000001CE.namprd05.prod.outlook.com (2603:10b6:a03:39e:cafe::b2) by SJ0PR03CA0281.outlook.office365.com (2603:10b6:a03:39e::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.29 via Frontend Transport; Mon, 16 Sep 2024 05:08:50 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by SJ5PEPF000001CE.mail.protection.outlook.com (10.167.242.38) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7918.13 via Frontend Transport; Mon, 16 Sep 2024 05:08:49 +0000 Received: from BLR-L-NUPADHYA.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Mon, 16 Sep 2024 00:08:43 -0500 From: Neeraj Upadhyay To: CC: , , , , , , , , , , , , , , , , , Subject: [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects Date: Mon, 16 Sep 2024 10:38:06 +0530 Message-ID: <20240916050811.473556-2-Neeraj.Upadhyay@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240916050811.473556-1-Neeraj.Upadhyay@amd.com> References: <20240916050811.473556-1-Neeraj.Upadhyay@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ5PEPF000001CE:EE_|CY5PR12MB6347:EE_ X-MS-Office365-Filtering-Correlation-Id: 78289c7b-5265-47a9-fb05-08dcd60da6c5 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|7416014|36860700013|82310400026|1800799024; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?jfpODw4Pkp7XUxB9GwMvh6Gd9CxIE3uU3HEs3OwN+9y6ZwUaQb5W2Y2qPWhI?= =?us-ascii?Q?c1RwPwT4yNwREf7YAQ1v3dHlC+Fv+RqNtcuj5aR2Kaa2jlXMqtUz//frYrAM?= =?us-ascii?Q?moHejRGmrC5DIi3fqVJLANnPxp+wlNlqmbmq34OSIcRJLD676vfmJrgnoFDV?= =?us-ascii?Q?G6hu9OzHZO+7CCnCRuqw+PHnKWM02zUf8HbUHD+tmdGBEMOSLTAbJGiYzZCT?= =?us-ascii?Q?VZmmNdHWNUGP29Bojepo1YcessldMi98WcNji3/rCKIUHMBwIL9xVcj6yw/I?= =?us-ascii?Q?88ousgZFW6huLIHWbOlKZIQAQFRQRHn+r5kEtFTaGoPUAgxuMWzizvEvpoVB?= =?us-ascii?Q?3NJvftudpBApofX7tg1roGhBV15WjtUDJfhPWrsbw8ODhQyRbn6nm2iEXL4D?= =?us-ascii?Q?VGyUOdclRLYYQBTQbGfytvICcYfp87D8ghmhSxIXaUg/MddrTgvOymo3YpJi?= =?us-ascii?Q?biIpF6FT4nr87IGPckH3xynhH6efHN6WyaZbkBNDr4ssH+BcFLGe3L0nkNM+?= =?us-ascii?Q?wvU6RngpzwT+RT9xqJpJPNhSqG9ZymgX/512tIS+YV60XeJc4xou+aymFjYp?= =?us-ascii?Q?08oZo/FhEqUx+WGyJWqdjI3NXvB4sKXuuA156mRxC+lt+IH5LdzDVvle58qj?= =?us-ascii?Q?nvTT1E7fVvaphbZbJpLuLJ7j81r2cQDq720iXZBH0kD4yrpsXFAOmvSk1mqR?= =?us-ascii?Q?Kh/zU2gb7yf0jxpeU4BFjn7jfcVnlC0rUHBXkFLbyf+24tpXCEmx9j35cbFh?= =?us-ascii?Q?pPeUUrOISqBq1lGOzj0AcmnJvH+uESOcJgODgISpsJJx5iBxbdSSxWO16njG?= =?us-ascii?Q?hudM1k3d/sDNnsZQKt9oVpC7WHkqL2B9PncMgt0dqcvpzJeaIShbKHs7NJAh?= =?us-ascii?Q?mysZj83oISqyxQGFPBNZGV1VoC1D5FjWulX79u4Nsp6ytYSqlrnLjTadhKBf?= =?us-ascii?Q?mqqRmp5244hXXf4Dce4Zr9New2k7FZRrCD8cf6lrE5if5rZcDCBoWxlVix5p?= =?us-ascii?Q?0ATz7PguZj2rN0tNPgHdCu5cVNxeRK/4lWL7IzWYSL3AxEV6fpb7rsOywf9K?= =?us-ascii?Q?YVqHLjsBMxy+kY2w7+odYySURvE88K4iopwmppaLPiDlxs/Pscww2/vtmgtX?= =?us-ascii?Q?vpDnhWB7veZqw5OBIdnPxAZaqbvVFEUUXUEEYNruntJ7nsC3cf4k2Zm1gPzZ?= =?us-ascii?Q?D6teJktY8gFWCDBTpueUoNGeFHEPHuMAgpXOxfCiHHqyoYDEIC7dpgOr0Cn1?= =?us-ascii?Q?tbP8BZFQ9Q0dBtxtA4fIfO4aUbSsKFSqUZRTvPRD8i4loZ/q7Z7jMh5nYLj/?= =?us-ascii?Q?sIcJzYg9TE09rsVtN3PgEQTRGlHiKnyQ+CctkIfT148T8HzlYUIPFQZXNxb2?= =?us-ascii?Q?++X2mzM5//3Mjv9IXMwbwUtvG14ea4j+anov6LVZVIY9lTh4IF21V0irQIMc?= =?us-ascii?Q?Jzytqu762bAlFD45MPHhJWrKo2vHmzfA?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(376014)(7416014)(36860700013)(82310400026)(1800799024);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 16 Sep 2024 05:08:49.9273 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 78289c7b-5265-47a9-fb05-08dcd60da6c5 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ5PEPF000001CE.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY5PR12MB6347 Content-Type: text/plain; charset="utf-8" Add a new "managed mode" to percpu refcounts, to track initial reference drop for refs which use RCU grace period for their object reclaims. Typical usage pattern for such refs is: // Called with elevated refcount get() p =3D get_ptr(); kref_get(&p->count); return p; get() rcu_read_lock(); p =3D get_ptr(); if (p && !kref_get_unless_zero(&p->count)) p =3D NULL; rcu_read_unlock(); return p; release() remove_ptr(p); call_rcu(&p->rcu, freep); release() remove_ptr(p); kfree_rcu((p, rcu); Currently, percpu ref requires users to call percpu_ref_kill() when object usage enters a shutdown phase. Post killi operation, ref increment/ decrement are performed on a atomic counter. For cases where ref is actively acquired and released after percpu_ref_kill(), percpu ref does not provide any performance benefits over using an atomic reference counter. Managed mode offloads tracking of ref kill to a manager thread, thereby not requiring users to explicitly call percpu_ref_kill(). This helps avoid the problem of suboptimal performance if a percpu ref is actively acquired and released after percpu_ref_kill() operation. A percpu ref can be initialized as managed either during percpu_ref_init() by passing PERCPU_REF_REL_MANAGED flag or a reinitable ref can be switched to managed mode using percpu_ref_switch_to_managed() post its initialization. Deferred switch to managed mode can be used for cases like module initialization errors, where a inited percpu ref's initial reference is dropped before the object becomes active and is referenced by other contexts. One such case is Apparmor labels which are not associated yet with a namespace. These labels are freed without waiting for a RCU grace period. So, managed mode cannot be used for these labels until their initialization has completed. Following are the allowed initialization modes for managed ref: Atomic Percpu Dead Reinit Managed Managed-ref Y N Y Y Y Following are the allowed transitions for managed ref: To --> A P P(RI) M D D(RI) D(RI/M) KLL REI RES A y n y y n y y y y y P n n n n y n n y n n M n n n y n n y n y y P(RI) y n y y n y y y y y D(RI) y n y y n y y - y y D(RI/M) n n n y n n y - y y Modes: A - Atomic P - PerCPU M - Managed P(RI) - PerCPU with ReInit D(RI) - Dead with ReInit D(RI/M) - Dead with ReInit and Managed PerCPU Ref Ops: KLL - Kill REI - Reinit RES - Resurrect Once a percpu ref is switched to managed mode, it cannot be switched to any other active mode. On reinit/resurrect, managed ref is reinitialized in managed mode. Signed-off-by: Neeraj Upadhyay --- .../admin-guide/kernel-parameters.txt | 12 + include/linux/percpu-refcount.h | 13 + lib/percpu-refcount.c | 358 +++++++++++++++++- 3 files changed, 364 insertions(+), 19 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 09126bb8cc9f..0f02a1b04fe9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4665,6 +4665,18 @@ allocator. This parameter is primarily for debugging and performance comparison. =20 + percpu_refcount.max_scan_count=3D [KNL] + Specifies the maximum number of percpu ref nodes which + are processed in one run of percpu ref manager thread. + + Default: 100 + + percpu_refcount.scan_interval=3D [KNL] + Specifies the duration (ms) between two runs of manager + thread. + + Default: 5000 ms + pirq=3D [SMP,APIC] Manual mp-table setup See Documentation/arch/x86/i386/IO-APIC.rst. =20 diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcoun= t.h index d73a1c08c3e3..e6aea81b3d01 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -68,6 +68,11 @@ enum { __PERCPU_REF_FLAG_BITS =3D 2, }; =20 +/* Auxiliary flags */ +enum { + __PERCPU_REL_MANAGED =3D 1LU << 0, /* operating in managed mode */ +}; + /* @flags for percpu_ref_init() */ enum { /* @@ -90,6 +95,10 @@ enum { * Allow switching from atomic mode to percpu mode. */ PERCPU_REF_ALLOW_REINIT =3D 1 << 2, + /* + * Manage release of the percpu ref. + */ + PERCPU_REF_REL_MANAGED =3D 1 << 3, }; =20 struct percpu_ref_data { @@ -100,6 +109,9 @@ struct percpu_ref_data { bool allow_reinit:1; struct rcu_head rcu; struct percpu_ref *ref; + unsigned int aux_flags; + struct llist_node node; + }; =20 struct percpu_ref { @@ -126,6 +138,7 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch); void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref); void percpu_ref_switch_to_percpu(struct percpu_ref *ref); +int percpu_ref_switch_to_managed(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); void percpu_ref_resurrect(struct percpu_ref *ref); diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 668f6aa6a75d..7b97f9728c5b 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -5,6 +5,9 @@ #include #include #include +#include +#include +#include #include #include =20 @@ -38,6 +41,7 @@ =20 static DEFINE_SPINLOCK(percpu_ref_switch_lock); static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq); +static LLIST_HEAD(percpu_ref_manage_head); =20 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) { @@ -45,6 +49,8 @@ static unsigned long __percpu *percpu_count_ptr(struct pe= rcpu_ref *ref) (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); } =20 +int percpu_ref_switch_to_managed(struct percpu_ref *ref); + /** * percpu_ref_init - initialize a percpu refcount * @ref: percpu_ref to initialize @@ -80,6 +86,9 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_fu= nc_t *release, return -ENOMEM; } =20 + if (flags & PERCPU_REF_REL_MANAGED) + flags |=3D PERCPU_REF_ALLOW_REINIT; + data->force_atomic =3D flags & PERCPU_REF_INIT_ATOMIC; data->allow_reinit =3D flags & PERCPU_REF_ALLOW_REINIT; =20 @@ -101,10 +110,73 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_re= f_func_t *release, data->confirm_switch =3D NULL; data->ref =3D ref; ref->data =3D data; + init_llist_node(&data->node); + + if (flags & PERCPU_REF_REL_MANAGED) + percpu_ref_switch_to_managed(ref); + return 0; } EXPORT_SYMBOL_GPL(percpu_ref_init); =20 +static bool percpu_ref_is_managed(struct percpu_ref *ref) +{ + return (ref->data->aux_flags & __PERCPU_REL_MANAGED) !=3D 0; +} + +static void __percpu_ref_switch_mode(struct percpu_ref *ref, + percpu_ref_func_t *confirm_switch); + +static int __percpu_ref_switch_to_managed(struct percpu_ref *ref) +{ + unsigned long __percpu *percpu_count; + struct percpu_ref_data *data; + int ret =3D -1; + + data =3D ref->data; + + if (WARN_ONCE(!percpu_ref_tryget(ref), "Percpu ref is not active")) + return ret; + + if (WARN_ONCE(!data->allow_reinit, "Percpu ref does not allow switch")) + goto err_switch_managed; + + if (WARN_ONCE(percpu_ref_is_managed(ref), "Percpu ref is already managed"= )) + goto err_switch_managed; + + data->aux_flags |=3D __PERCPU_REL_MANAGED; + data->force_atomic =3D false; + if (!__ref_is_percpu(ref, &percpu_count)) + __percpu_ref_switch_mode(ref, NULL); + /* Ensure ordering of percpu mode switch and node scan */ + smp_mb(); + llist_add(&data->node, &percpu_ref_manage_head); + + return 0; + +err_switch_managed: + percpu_ref_put(ref); + return ret; +} + +/** + * percpu_ref_switch_to_managed - Switch an unmanaged ref to percpu mode. + * + * @ref: percpu_ref to switch to managed mode + * + */ +int percpu_ref_switch_to_managed(struct percpu_ref *ref) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&percpu_ref_switch_lock, flags); + ret =3D __percpu_ref_switch_to_managed(ref); + spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + return ret; +} +EXPORT_SYMBOL_GPL(percpu_ref_switch_to_managed); + static void __percpu_ref_exit(struct percpu_ref *ref) { unsigned long __percpu *percpu_count =3D percpu_count_ptr(ref); @@ -283,6 +355,27 @@ static void __percpu_ref_switch_mode(struct percpu_ref= *ref, __percpu_ref_switch_to_percpu(ref); } =20 +static bool __percpu_ref_switch_to_atomic_checked(struct percpu_ref *ref, + percpu_ref_func_t *confirm_switch, + bool check_managed) +{ + unsigned long flags; + + spin_lock_irqsave(&percpu_ref_switch_lock, flags); + if (check_managed && WARN_ONCE(percpu_ref_is_managed(ref), + "Percpu ref is managed, cannot switch to atomic mode")) { + spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + return false; + } + + ref->data->force_atomic =3D true; + __percpu_ref_switch_mode(ref, confirm_switch); + + spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + + return true; +} + /** * percpu_ref_switch_to_atomic - switch a percpu_ref to atomic mode * @ref: percpu_ref to switch to atomic mode @@ -306,17 +399,16 @@ static void __percpu_ref_switch_mode(struct percpu_re= f *ref, void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch) { - unsigned long flags; - - spin_lock_irqsave(&percpu_ref_switch_lock, flags); - - ref->data->force_atomic =3D true; - __percpu_ref_switch_mode(ref, confirm_switch); - - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + (void)__percpu_ref_switch_to_atomic_checked(ref, confirm_switch, true); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); =20 +static void __percpu_ref_switch_to_atomic_sync_checked(struct percpu_ref *= ref, bool check_managed) +{ + if (!__percpu_ref_switch_to_atomic_checked(ref, NULL, check_managed)) + return; + wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); +} /** * percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode * @ref: percpu_ref to switch to atomic mode @@ -327,11 +419,28 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic); */ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref) { - percpu_ref_switch_to_atomic(ref, NULL); - wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch); + __percpu_ref_switch_to_atomic_sync_checked(ref, true); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync); =20 +static void __percpu_ref_switch_to_percpu_checked(struct percpu_ref *ref, = bool check_managed) +{ + unsigned long flags; + + spin_lock_irqsave(&percpu_ref_switch_lock, flags); + + if (check_managed && WARN_ONCE(percpu_ref_is_managed(ref), + "Percpu ref is managed, cannot switch to percpu mode")) { + spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + return; + } + + ref->data->force_atomic =3D false; + __percpu_ref_switch_mode(ref, NULL); + + spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); +} + /** * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode * @ref: percpu_ref to switch to percpu mode @@ -352,14 +461,7 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync); */ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) { - unsigned long flags; - - spin_lock_irqsave(&percpu_ref_switch_lock, flags); - - ref->data->force_atomic =3D false; - __percpu_ref_switch_mode(ref, NULL); - - spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); + __percpu_ref_switch_to_percpu_checked(ref, true); } EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu); =20 @@ -472,8 +574,226 @@ void percpu_ref_resurrect(struct percpu_ref *ref) =20 ref->percpu_count_ptr &=3D ~__PERCPU_REF_DEAD; percpu_ref_get(ref); - __percpu_ref_switch_mode(ref, NULL); + if (percpu_ref_is_managed(ref)) { + ref->data->aux_flags &=3D ~__PERCPU_REL_MANAGED; + __percpu_ref_switch_to_managed(ref); + } else { + __percpu_ref_switch_mode(ref, NULL); + } =20 spin_unlock_irqrestore(&percpu_ref_switch_lock, flags); } EXPORT_SYMBOL_GPL(percpu_ref_resurrect); + +#define DEFAULT_SCAN_INTERVAL_MS 5000 +/* Interval duration between two ref scans. */ +static ulong scan_interval =3D DEFAULT_SCAN_INTERVAL_MS; +module_param(scan_interval, ulong, 0444); + +#define DEFAULT_MAX_SCAN_COUNT 100 +/* Number of percpu refs scanned in one iteration of worker execution. */ +static int max_scan_count =3D DEFAULT_MAX_SCAN_COUNT; +module_param(max_scan_count, int, 0444); + +static void percpu_ref_release_work_fn(struct work_struct *work); + +/* + * Sentinel llist nodes for lockless list traveral and deletions by + * the pcpu ref release worker, while nodes are added from + * percpu_ref_init() and percpu_ref_switch_to_managed(). + * + * Sentinel node marks the head of list traversal for the current + * iteration of kworker execution. + */ +struct percpu_ref_sen_node { + bool inuse; + struct llist_node node; +}; + +/* + * We need two sentinel nodes for lockless list manipulations from release + * worker - first node will be used in current reclaim iteration. The seco= nd + * node will be used in next iteration. Next iteration marks the first node + * as free, for use in subsequent iteration. + */ +#define PERCPU_REF_SEN_NODES_COUNT 2 + +/* Track last processed percpu ref node */ +static struct llist_node *last_percpu_ref_node; + +static struct percpu_ref_sen_node + percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT]; + +static DECLARE_DELAYED_WORK(percpu_ref_release_work, percpu_ref_release_wo= rk_fn); + +static bool percpu_ref_is_sen_node(struct llist_node *node) +{ + return &percpu_ref_sen_nodes[0].node <=3D node && + node <=3D &percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT - 1].node; +} + +static struct llist_node *percpu_ref_get_sen_node(void) +{ + int i; + struct percpu_ref_sen_node *sn; + + for (i =3D 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) { + sn =3D &percpu_ref_sen_nodes[i]; + if (!sn->inuse) { + sn->inuse =3D true; + return &sn->node; + } + } + + return NULL; +} + +static void percpu_ref_put_sen_node(struct llist_node *node) +{ + struct percpu_ref_sen_node *sn =3D container_of(node, struct percpu_ref_s= en_node, node); + + sn->inuse =3D false; + init_llist_node(node); +} + +static void percpu_ref_put_all_sen_nodes_except(struct llist_node *node) +{ + int i; + + for (i =3D 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) { + if (&percpu_ref_sen_nodes[i].node =3D=3D node) + continue; + percpu_ref_sen_nodes[i].inuse =3D false; + init_llist_node(&percpu_ref_sen_nodes[i].node); + } +} + +static struct workqueue_struct *percpu_ref_release_wq; + +static void percpu_ref_release_work_fn(struct work_struct *work) +{ + struct llist_node *pos, *first, *head, *prev, *next; + struct llist_node *sen_node; + struct percpu_ref *ref; + int count =3D 0; + bool held; + + first =3D READ_ONCE(percpu_ref_manage_head.first); + if (!first) + goto queue_release_work; + + /* + * Enqueue a dummy node to mark the start of scan. This dummy + * node is used as start point of scan and ensures that + * there is no additional synchronization required with new + * label node additions to the llist. Any new labels will + * be processed in next run of the kworker. + * + * SCAN START PTR + * | + * v + * +----------+ +------+ +------+ +------+ + * | | | | | | | | + * | head ------> dummy|--->|label |--->| label|--->NULL + * | | | node | | | | | + * +----------+ +------+ +------+ +------+ + * + * + * New label addition: + * + * SCAN START PTR + * | + * v + * +----------+ +------+ +------+ +------+ +------+ + * | | | | | | | | | | + * | head |--> label|--> dummy|--->|label |--->| label|--->NULL + * | | | | | node | | | | | + * +----------+ +------+ +------+ +------+ +------+ + * + */ + if (last_percpu_ref_node =3D=3D NULL || last_percpu_ref_node->next =3D=3D= NULL) { +retry_sentinel_get: + sen_node =3D percpu_ref_get_sen_node(); + /* + * All sentinel nodes are in use? This should not happen, as we + * require only one sentinel for the start of list traversal and + * other sentinel node is freed during the traversal. + */ + if (WARN_ONCE(!sen_node, "All sentinel nodes are in use")) { + /* Use first node as the sentinel node */ + head =3D first->next; + if (!head) { + struct llist_node *ign_node =3D NULL; + /* + * We exhausted sentinel nodes. However, there aren't + * enough nodes in the llist. So, we have leaked + * sentinel nodes. Reclaim sentinels and retry. + */ + if (percpu_ref_is_sen_node(first)) + ign_node =3D first; + percpu_ref_put_all_sen_nodes_except(ign_node); + goto retry_sentinel_get; + } + prev =3D first; + } else { + llist_add(sen_node, &percpu_ref_manage_head); + prev =3D sen_node; + head =3D prev->next; + } + } else { + prev =3D last_percpu_ref_node; + head =3D prev->next; + } + + last_percpu_ref_node =3D NULL; + llist_for_each_safe(pos, next, head) { + /* Free sentinel node which is present in the list */ + if (percpu_ref_is_sen_node(pos)) { + prev->next =3D pos->next; + percpu_ref_put_sen_node(pos); + continue; + } + + ref =3D container_of(pos, struct percpu_ref_data, node)->ref; + __percpu_ref_switch_to_atomic_sync_checked(ref, false); + /* + * Drop the ref while in RCU read critical section to + * prevent obj free while we manipulating node. + */ + rcu_read_lock(); + percpu_ref_put(ref); + held =3D percpu_ref_tryget(ref); + if (!held) { + prev->next =3D pos->next; + init_llist_node(pos); + ref->percpu_count_ptr |=3D __PERCPU_REF_DEAD; + } + rcu_read_unlock(); + if (!held) + continue; + __percpu_ref_switch_to_percpu_checked(ref, false); + count++; + if (count =3D=3D max_scan_count) { + last_percpu_ref_node =3D pos; + break; + } + prev =3D pos; + } + +queue_release_work: + queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work, + scan_interval); +} + +static __init int percpu_ref_setup(void) +{ + percpu_ref_release_wq =3D alloc_workqueue("percpu_ref_release_wq", + WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_FREEZABLE, 0); + if (!percpu_ref_release_wq) + return -ENOMEM; + + queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work, + scan_interval); + return 0; +} +early_initcall(percpu_ref_setup); --=20 2.34.1