From nobody Fri Dec 19 15:50:14 2025 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A56DA2798E1 for ; Mon, 19 May 2025 13:24:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747661097; cv=none; b=W3062PxX2PKSIZK3OC0+6fiTPpAmAM1P0CABtSdXAre19Wghws2wMd2vT7HevyqstftYM8ZGEopd0aY5dqrbeM2lqP5Wx0Atcx4N81SmqJfnDaW4BfOk+h+kp9PIEUUBM3JeqwxzPZdEdczyDeVExvizaBGqarJWgFlIy4b1MJY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747661097; c=relaxed/simple; bh=9JEH3mVtNdJI82VDThuXzNHcSzoc/ID3W43FJCyW+QU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ji08EfZ/0cYvQPS6q8BKYveTSShcxrqpoJXeEFiQTw9q3ORdC3VfOtJhFjEACmwRDjLLGIPVnrGT9veXD3IuwXvl/CrMdCpLCkDl+Iv+RhPUIxYcnQCgqDxLH4off8PD2KWw5IUaeN/aLI24qYjP2E4feKbLGj3AFoZNrqw+Hds= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.216]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4b1JLf25gTz4f3jdB for ; Mon, 19 May 2025 21:24:26 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.75]) by mail.maildlp.com (Postfix) with ESMTP id 438A01A1A84 for ; Mon, 19 May 2025 21:24:51 +0800 (CST) Received: from hulk-vt.huawei.com (unknown [10.67.174.121]) by APP2 (Coremail) with SMTP id Syh0CgDXk2YDMSto5JogMw--.10967S3; Mon, 19 May 2025 21:24:50 +0800 (CST) From: Chen Ridong To: akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, vbabka@suse.cz, jannh@google.com, pfalcato@suse.de, bigeasy@linutronix.de, paulmck@kernel.org, chenridong@huawei.com, roman.gushchin@linux.dev, brauner@kernel.org, pmladek@suse.com, geert@linux-m68k.org, mingo@kernel.org, rrangel@chromium.org, francesco@valla.it, kpsingh@kernel.org, guoweikang.kernel@gmail.com, link@vivo.com, viro@zeniv.linux.org.uk, neil@brown.name, nichen@iscas.ac.cn, tglx@linutronix.de, frederic@kernel.org, peterz@infradead.org, oleg@redhat.com, joel.granados@kernel.org, linux@weissschuh.net, avagin@google.com, legion@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, lujialin4@huawei.com Subject: [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero Date: Mon, 19 May 2025 13:11:50 +0000 Message-Id: <20250519131151.988900-2-chenridong@huaweicloud.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250519131151.988900-1-chenridong@huaweicloud.com> References: <20250519131151.988900-1-chenridong@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: Syh0CgDXk2YDMSto5JogMw--.10967S3 X-Coremail-Antispam: 1UD129KBjvJXoWxGr17uw18Gr1furWUtF18Zrb_yoWrtF1fpr 4xG345Aa1kJr43JwsxJw48Ary5tr1S9r15GFy7Gwn3Jr13Xr1Fgw1xAr1YgFnxXrn7Jrya qFnrWFyDCF4UXa7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUPIb4IE77IF4wAFF20E14v26rWj6s0DM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28IrcIa0xkI8VA2jI8067AKxVWUGw A2048vs2IY020Ec7CjxVAFwI0_Xr0E3s1l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0rcxS w2x7M28EF7xvwVC0I7IYx2IY67AKxVW7JVWDJwA2z4x0Y4vE2Ix0cI8IcVCY1x0267AKxV W8Jr0_Cr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I8E87Iv6xkF7I0E14v2 6rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI64kE6c02F40Ex7xfMc Ij6xIIjxv20xvE14v26r1Y6r17McIj6I8E87Iv67AKxVWUJVW8JwAm72CE4IkC6x0Yz7v_ Jr0_Gr1lF7xvr2IYc2Ij64vIr41lFIxGxcIEc7CjxVA2Y2ka0xkIwI1lc7CjxVAaw2AFwI 0_Wrv_ZF1l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG 67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26rWY6r4UJw CIc40Y0x0EwIxGrwCI42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x02 67AKxVWxJVW8Jr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r 1j6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x07jh b18UUUUU= X-CM-SenderInfo: hfkh02xlgr0w46kxt4xhlfz01xgou0bp/ Content-Type: text/plain; charset="utf-8" From: Chen Ridong After the commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") and the commit 15bc01effefe ("ucounts: Fix signal ucount refcounting"), the reference counting mechanism for ucounts has the following behavior. The reference count is incremented when the first pending signal pins to the ucounts, and it is decremented when the last pending signal is dequeued. This implies that as long as there are any pending signals pinned to the ucounts, the ucounts cannot be freed. To address the scalability issue, the next patch will mention, the ucounts.rlimits will be converted to percpu_counter. However, summing up the percpu counters is expensive. To overcome this, this patch modifies the conditions for freeing ucounts. Instead of complex checks regarding whether a pending signal is the first or the last one, the ucounts can now be freed only when both the refcount and the rlimits are zero. This change not only simplifies the logic but also reduces the number of atomic operations. Signed-off-by: Chen Ridong --- include/linux/user_namespace.h | 1 + kernel/ucount.c | 75 ++++++++++++++++++++++++++-------- 2 files changed, 59 insertions(+), 17 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index a0bb6d012137..6e2229ea4673 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -122,6 +122,7 @@ struct ucounts { kuid_t uid; struct rcu_head rcu; rcuref_t count; + atomic_long_t freed; atomic_long_t ucount[UCOUNT_COUNTS]; atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; }; diff --git a/kernel/ucount.c b/kernel/ucount.c index 8686e329b8f2..125471af7d59 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -185,18 +185,61 @@ struct ucounts *alloc_ucounts(struct user_namespace *= ns, kuid_t uid) return new; } =20 -void put_ucounts(struct ucounts *ucounts) +/* + * Whether all the rlimits are zero. + * For now, only UCOUNT_RLIMIT_SIGPENDING is considered. + * Other rlimit can be added. + */ +static bool rlimits_are_zero(struct ucounts *ucounts) +{ + int rtypes[] =3D { UCOUNT_RLIMIT_SIGPENDING }; + int rtype; + + for (int i =3D 0; i < sizeof(rtypes)/sizeof(int); ++i) { + rtype =3D rtypes[i]; + if (atomic_long_read(&ucounts->rlimit[rtype]) > 0) + return false; + } + return true; +} + +/* + * Ucounts can be freed only when the ucount->count is released + * and the rlimits are zero. + * The caller should hold rcu_read_lock(); + */ +static bool ucounts_can_be_freed(struct ucounts *ucounts) +{ + if (rcuref_read(&ucounts->count) > 0) + return false; + if (!rlimits_are_zero(ucounts)) + return false; + /* Prevent double free */ + return atomic_long_cmpxchg(&ucounts->freed, 0, 1) =3D=3D 0; +} + +static void free_ucounts(struct ucounts *ucounts) { unsigned long flags; =20 - if (rcuref_put(&ucounts->count)) { - spin_lock_irqsave(&ucounts_lock, flags); - hlist_nulls_del_rcu(&ucounts->node); - spin_unlock_irqrestore(&ucounts_lock, flags); + spin_lock_irqsave(&ucounts_lock, flags); + hlist_nulls_del_rcu(&ucounts->node); + spin_unlock_irqrestore(&ucounts_lock, flags); + + put_user_ns(ucounts->ns); + kfree_rcu(ucounts, rcu); +} =20 - put_user_ns(ucounts->ns); - kfree_rcu(ucounts, rcu); +void put_ucounts(struct ucounts *ucounts) +{ + rcu_read_lock(); + if (rcuref_put(&ucounts->count) && + ucounts_can_be_freed(ucounts)) { + rcu_read_unlock(); + free_ucounts(ucounts); + return; } + rcu_read_unlock(); } =20 static inline bool atomic_long_inc_below(atomic_long_t *v, int u) @@ -281,11 +324,17 @@ static void do_dec_rlimit_put_ucounts(struct ucounts = *ucounts, { struct ucounts *iter, *next; for (iter =3D ucounts; iter !=3D last; iter =3D next) { + bool to_free; + + rcu_read_lock(); long dec =3D atomic_long_sub_return(1, &iter->rlimit[type]); WARN_ON_ONCE(dec < 0); next =3D iter->ns->ucounts; - if (dec =3D=3D 0) - put_ucounts(iter); + to_free =3D ucounts_can_be_freed(iter); + rcu_read_unlock(); + /* If ucounts->count is zero and the rlimits are zero, free ucounts */ + if (to_free) + free_ucounts(iter); } } =20 @@ -310,14 +359,6 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, e= num rlimit_type type, ret =3D new; if (!override_rlimit) max =3D get_userns_rlimit_max(iter->ns, type); - /* - * Grab an extra ucount reference for the caller when - * the rlimit count was previously 0. - */ - if (new !=3D 1) - continue; - if (!get_ucounts(iter)) - goto dec_unwind; } return ret; dec_unwind: --=20 2.34.1 From nobody Fri Dec 19 15:50:14 2025 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9653E1FF5E3 for ; Mon, 19 May 2025 13:24:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.51 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747661098; cv=none; b=CzGrWuUSw1u+4SvViVhf7LdRdCz0ZREZz9IT27Zo61QTk+MlM+9ELgYSqEVSju3qP4lRlnjr4Vh48AfbnpBsU/Jayi3vUnZXSTLzGOFCOgx1IHK7LZ2nB1ccmswL9rZC7PVW7+J9q31iPSO7KZNNfJgLXfPkqTpAofr9QR+WGnw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747661098; c=relaxed/simple; bh=24MdMqHnsa1q2E8nZ1//rOnV2wF+GFe3tThZ75IavTs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Nn5yRiOP+0/YsFjVm8zp4miv5Q7OlBGck2Eqzs5b7YdMPSwXlGGyZdmUh2Rd5CPIqod5wX0AiArv+AaVvBX2jgn4NIeXh3l36f760hnDpg6rEDgrhMzbDglnbcCh5idUGqwzLfIN3yYpHivAwD2LZd/6jatDgaafIrWwbtpoMg0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=none smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4b1JLl4d57z4f3k6d for ; Mon, 19 May 2025 21:24:31 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.75]) by mail.maildlp.com (Postfix) with ESMTP id 6A1461A1096 for ; Mon, 19 May 2025 21:24:51 +0800 (CST) Received: from hulk-vt.huawei.com (unknown [10.67.174.121]) by APP2 (Coremail) with SMTP id Syh0CgDXk2YDMSto5JogMw--.10967S4; Mon, 19 May 2025 21:24:50 +0800 (CST) From: Chen Ridong To: akpm@linux-foundation.org, Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, vbabka@suse.cz, jannh@google.com, pfalcato@suse.de, bigeasy@linutronix.de, paulmck@kernel.org, chenridong@huawei.com, roman.gushchin@linux.dev, brauner@kernel.org, pmladek@suse.com, geert@linux-m68k.org, mingo@kernel.org, rrangel@chromium.org, francesco@valla.it, kpsingh@kernel.org, guoweikang.kernel@gmail.com, link@vivo.com, viro@zeniv.linux.org.uk, neil@brown.name, nichen@iscas.ac.cn, tglx@linutronix.de, frederic@kernel.org, peterz@infradead.org, oleg@redhat.com, joel.granados@kernel.org, linux@weissschuh.net, avagin@google.com, legion@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, lujialin4@huawei.com Subject: [RFC next v2 2/2] ucounts: turn the atomic rlimit to percpu_counter Date: Mon, 19 May 2025 13:11:51 +0000 Message-Id: <20250519131151.988900-3-chenridong@huaweicloud.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250519131151.988900-1-chenridong@huaweicloud.com> References: <20250519131151.988900-1-chenridong@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: Syh0CgDXk2YDMSto5JogMw--.10967S4 X-Coremail-Antispam: 1UD129KBjvAXoWfGrWfZry8Jw4UZr1xuFy5CFg_yoW8Jw43Co WxCws8JF18Gr1Ivr1DKrn5CF15Xry5Aay5Xr43Jr4q93W7Zw4UWrWDAF15Xa9xuw1v9r1D Z343t3yftFsrJa4Dn29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUOn7kC6x804xWl14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK 8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_Jr yl82xGYIkIc2x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48v e4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI 0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AK xVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ew Av7VC0I7IYx2IY67AKxVWUXVWUAwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY 6r1j6r4UM4x0Y48IcxkI7VAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7MxkF7I0En4kS14 v26rWY6Fy7MxAIw28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8C rVAFwI0_Jr0_Jr4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWrXVW8Jr 1lIxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7Cj xVAFwI0_Cr0_Gr1UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxV WUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4UJVWxJrUvcSsGvfC2KfnxnUUI43ZEXa7I U0KiiDUUUUU== X-CM-SenderInfo: hfkh02xlgr0w46kxt4xhlfz01xgou0bp/ Content-Type: text/plain; charset="utf-8" From: Chen Ridong The will-it-scale test case signal1 [1] has been observed. and the test results reveal that the signal sending system call lacks linearity. To further investigate this issue, we initiated a series of tests by launching varying numbers of dockers and closely monitored the throughput of each individual docker. The detailed test outcomes are presented as follows: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | The data clearly demonstrates a discernible trend: as the quantity of dockers increases, the throughput per container progressively declines. In-depth analysis has identified the root cause of this performance degradation. The ucounts module conducts statistics on rlimit, which involves a significant number of atomic operations. These atomic operations, when acting on the same variable, trigger a substantial number of cache misses or remote accesses, ultimately resulting in a drop in performance. To address the above issues, this patch converts the atomic rlimit to a percpu_counter. After the optimization, the performance data is shown below, demonstrating that the throughput no longer declines as the number of Docker containers increases: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |374737 |376377 |374814 |379284 |374950 |377509 | [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ Signed-off-by: Chen Ridong --- include/linux/user_namespace.h | 16 ++++-- init/main.c | 1 + ipc/mqueue.c | 6 +-- kernel/signal.c | 8 +-- kernel/ucount.c | 98 ++++++++++++++++++++++------------ mm/mlock.c | 5 +- 6 files changed, 81 insertions(+), 53 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 6e2229ea4673..0d1251e1f9ea 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -12,6 +12,7 @@ #include #include #include +#include =20 #define UID_GID_MAP_MAX_BASE_EXTENTS 5 #define UID_GID_MAP_MAX_EXTENTS 340 @@ -124,7 +125,7 @@ struct ucounts { rcuref_t count; atomic_long_t freed; atomic_long_t ucount[UCOUNT_COUNTS]; - atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; + struct percpu_counter rlimit[UCOUNT_RLIMIT_COUNTS]; }; =20 extern struct user_namespace init_user_ns; @@ -136,6 +137,7 @@ struct ucounts *inc_ucount(struct user_namespace *ns, k= uid_t uid, enum ucount_ty void dec_ucount(struct ucounts *ucounts, enum ucount_type type); struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid); void put_ucounts(struct ucounts *ucounts); +void __init ucounts_init(void); =20 static inline struct ucounts * __must_check get_ucounts(struct ucounts *uc= ounts) { @@ -146,13 +148,17 @@ static inline struct ucounts * __must_check get_ucoun= ts(struct ucounts *ucounts) =20 static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_t= ype type) { - return atomic_long_read(&ucounts->rlimit[type]); + return percpu_counter_sum(&ucounts->rlimit[type]); } =20 -long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v); -bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v); +bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type ty= pe, long v, long limit); +static inline bool inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit= _type type, long v) +{ + return inc_rlimit_ucounts_limit(ucounts, type, v, LONG_MAX); +} +void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v); long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, - bool override_rlimit); + bool override_rlimit, long limit); void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type= ); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, u= nsigned long max); =20 diff --git a/init/main.c b/init/main.c index 7f0a2a3dbd29..1168c0c453ff 100644 --- a/init/main.c +++ b/init/main.c @@ -1071,6 +1071,7 @@ void start_kernel(void) efi_enter_virtual_mode(); #endif thread_stack_cache_init(); + ucounts_init(); cred_init(); fork_init(); proc_caches_init(); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 35b4f8659904..e4bd211900ab 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -371,11 +371,9 @@ static struct inode *mqueue_get_inode(struct super_blo= ck *sb, mq_bytes +=3D mq_treesize; info->ucounts =3D get_ucounts(current_ucounts()); if (info->ucounts) { - long msgqueue; - spin_lock(&mq_lock); - msgqueue =3D inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, = mq_bytes); - if (msgqueue =3D=3D LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) { + if (!inc_rlimit_ucounts_limit(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, + mq_bytes, rlimit(RLIMIT_MSGQUEUE))) { dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes); spin_unlock(&mq_lock); put_ucounts(info->ucounts); diff --git a/kernel/signal.c b/kernel/signal.c index f8859faa26c5..2b6ed2168db6 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -416,13 +416,9 @@ static struct ucounts *sig_get_ucounts(struct task_str= uct *t, int sig, rcu_read_lock(); ucounts =3D task_ucounts(t); sigpending =3D inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, - override_rlimit); + override_rlimit, task_rlimit(t, RLIMIT_SIGPENDING)); rcu_read_unlock(); - if (!sigpending) - return NULL; - - if (unlikely(!override_rlimit && sigpending > task_rlimit(t, RLIMIT_SIGPE= NDING))) { - dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); + if (!sigpending) { print_dropped_signal(sig); return NULL; } diff --git a/kernel/ucount.c b/kernel/ucount.c index 125471af7d59..a856f3d4a9a1 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -158,6 +158,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns= , kuid_t uid) { struct hlist_nulls_head *hashent =3D ucounts_hashentry(ns, uid); struct ucounts *ucounts, *new; + int i =3D 0, j =3D 0; =20 ucounts =3D find_ucounts(ns, uid, hashent); if (ucounts) @@ -170,11 +171,16 @@ struct ucounts *alloc_ucounts(struct user_namespace *= ns, kuid_t uid) new->ns =3D ns; new->uid =3D uid; rcuref_init(&new->count, 1); - + for (i =3D 0; i < UCOUNT_RLIMIT_COUNTS; ++i) { + if (percpu_counter_init(&new->rlimit[i], 0, GFP_KERNEL)) + goto failed; + } spin_lock_irq(&ucounts_lock); ucounts =3D find_ucounts(ns, uid, hashent); if (ucounts) { spin_unlock_irq(&ucounts_lock); + for (j =3D 0; j < UCOUNT_RLIMIT_COUNTS; ++j) + percpu_counter_destroy(&new->rlimit[j]); kfree(new); return ucounts; } @@ -183,6 +189,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *n= s, kuid_t uid) get_user_ns(new->ns); spin_unlock_irq(&ucounts_lock); return new; + +failed: + for (j =3D 0; i > 0 && j < i - 1; ++j) + percpu_counter_destroy(&new->rlimit[j]); + kfree(new); + return NULL; } =20 /* @@ -197,7 +209,7 @@ static bool rlimits_are_zero(struct ucounts *ucounts) =20 for (int i =3D 0; i < sizeof(rtypes)/sizeof(int); ++i) { rtype =3D rtypes[i]; - if (atomic_long_read(&ucounts->rlimit[rtype]) > 0) + if (get_rlimit_value(ucounts, rtype) > 0) return false; } return true; @@ -225,7 +237,8 @@ static void free_ucounts(struct ucounts *ucounts) spin_lock_irqsave(&ucounts_lock, flags); hlist_nulls_del_rcu(&ucounts->node); spin_unlock_irqrestore(&ucounts_lock, flags); - + for (int i =3D 0; i < UCOUNT_RLIMIT_COUNTS; ++i) + percpu_counter_destroy(&ucounts->rlimit[i]); put_user_ns(ucounts->ns); kfree_rcu(ucounts, rcu); } @@ -289,36 +302,35 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_= type type) put_ucounts(ucounts); } =20 -long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v) +bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type ty= pe, + long v, long limit) { struct ucounts *iter; long max =3D LONG_MAX; - long ret =3D 0; + bool good =3D true; =20 for (iter =3D ucounts; iter; iter =3D iter->ns->ucounts) { - long new =3D atomic_long_add_return(v, &iter->rlimit[type]); - if (new < 0 || new > max) - ret =3D LONG_MAX; - else if (iter =3D=3D ucounts) - ret =3D new; + max =3D min(limit, max); + if (!percpu_counter_limited_add(&iter->rlimit[type], max, v)) + good =3D false; + max =3D get_userns_rlimit_max(iter->ns, type); } - return ret; + return good; } =20 -bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v) +void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, lo= ng v) { struct ucounts *iter; - long new =3D -1; /* Silence compiler warning */ - for (iter =3D ucounts; iter; iter =3D iter->ns->ucounts) { - long dec =3D atomic_long_sub_return(v, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); - if (iter =3D=3D ucounts) - new =3D dec; - } - return (new =3D=3D 0); + + for (iter =3D ucounts; iter; iter =3D iter->ns->ucounts) + percpu_counter_sub(&iter->rlimit[type], v); } =20 +/* + * The inc_rlimit_get_ucounts does not grab the refcount. + * The rlimit_release should be called very time the rlimit is decremented. + */ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, struct ucounts *last, enum rlimit_type type) { @@ -327,8 +339,7 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *u= counts, bool to_free; =20 rcu_read_lock(); - long dec =3D atomic_long_sub_return(1, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); + percpu_counter_sub(&iter->rlimit[type], 1); next =3D iter->ns->ucounts; to_free =3D ucounts_can_be_freed(iter); rcu_read_unlock(); @@ -343,29 +354,37 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, = enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } =20 +/* + * Though this function does not grab the refcount, it is promised that the + * ucounts will not be freed as long as there have any rlimit pins to it. + * Caller must hold a reference to ucounts or under rcu_read_lock(). + * + * Return 1 if increments successful, otherwise return 0. + */ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, - bool override_rlimit) + bool override_rlimit, long limit) { - /* Caller must hold a reference to ucounts */ struct ucounts *iter; long max =3D LONG_MAX; - long dec, ret =3D 0; + long ret =3D 0; + + if (override_rlimit) + limit =3D LONG_MAX; =20 for (iter =3D ucounts; iter; iter =3D iter->ns->ucounts) { - long new =3D atomic_long_add_return(1, &iter->rlimit[type]); - if (new < 0 || new > max) + /* Can not exceed the limit(inputed) or the ns->rlimit_max */ + max =3D min(limit, max); + + if (!percpu_counter_limited_add(&iter->rlimit[type], max, 1)) goto dec_unwind; - if (iter =3D=3D ucounts) - ret =3D new; + if (!override_rlimit) max =3D get_userns_rlimit_max(iter->ns, type); } - return ret; + return 1; dec_unwind: - dec =3D atomic_long_sub_return(1, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); do_dec_rlimit_put_ucounts(ucounts, iter, type); - return 0; + return ret; } =20 bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, u= nsigned long rlimit) @@ -374,15 +393,23 @@ bool is_rlimit_overlimit(struct ucounts *ucounts, enu= m rlimit_type type, unsigne long max =3D rlimit; if (rlimit > LONG_MAX) max =3D LONG_MAX; + for (iter =3D ucounts; iter; iter =3D iter->ns->ucounts) { - long val =3D get_rlimit_value(iter, type); - if (val < 0 || val > max) + /* iter->rlimit[type] > max return 1 */ + if (percpu_counter_compare(&iter->rlimit[type], max) > 0) return true; + max =3D get_userns_rlimit_max(iter->ns, type); } return false; } =20 +void __init ucounts_init(void) +{ + for (int i =3D 0; i < UCOUNT_RLIMIT_COUNTS; ++i) + percpu_counter_init(&init_ucounts.rlimit[i], 0, GFP_KERNEL); +} + static __init int user_namespace_sysctl_init(void) { #ifdef CONFIG_SYSCTL @@ -398,6 +425,7 @@ static __init int user_namespace_sysctl_init(void) BUG_ON(!user_header); BUG_ON(!setup_userns_sysctls(&init_user_ns)); #endif + hlist_add_ucounts(&init_ucounts); inc_rlimit_ucounts(&init_ucounts, UCOUNT_RLIMIT_NPROC, 1); return 0; diff --git a/mm/mlock.c b/mm/mlock.c index 3cb72b579ffd..20f3b62b3ec0 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -793,7 +793,6 @@ static DEFINE_SPINLOCK(shmlock_user_lock); int user_shm_lock(size_t size, struct ucounts *ucounts) { unsigned long lock_limit, locked; - long memlock; int allowed =3D 0; =20 locked =3D (size + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -801,9 +800,9 @@ int user_shm_lock(size_t size, struct ucounts *ucounts) if (lock_limit !=3D RLIM_INFINITY) lock_limit >>=3D PAGE_SHIFT; spin_lock(&shmlock_user_lock); - memlock =3D inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); =20 - if ((memlock =3D=3D LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC= _LOCK)) { + if (!inc_rlimit_ucounts_limit(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked, loc= k_limit) + && !capable(CAP_IPC_LOCK)) { dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); goto out; } --=20 2.34.1