From nobody Wed Feb 11 06:54:09 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45D79CA0FE2 for ; Tue, 5 Sep 2023 16:27:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243877AbjIEQ1E (ORCPT ); Tue, 5 Sep 2023 12:27:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354755AbjIEOHh (ORCPT ); Tue, 5 Sep 2023 10:07:37 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A4E01A7 for ; Tue, 5 Sep 2023 07:07:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922854; x=1725458854; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+Yeg+pnZH94x+zS9eiLxzs8MVPwnnUwq2oxNPIF7mVU=; b=YqWWmjAWtOZiACePQy4nK4IHYZRlpbzKKKVy81UIkFd05GA+QCZmWYbW xLvyHAIYPj3hgaUbukh5In1LkXZn19zcn6JWBtQ1utTIHzFEd8uCqh4Yw Me/2StB4gfCzpTJVMUPflRUQQ2Zj2vWiuBs3S1bJS7qlh6mHkyPQBGs0M gGIW8Ymu3pOiZpJlsE3Ibto0LqqVdAw6LRh+I5dX/VyCG6om1mLGZnzvk jUOwmG3k4WyBkcpBiy1CTzOf0icWv3jfn9rPN54ut86POkdJVM7yYCbji N2P4Nnr7QRUzEQsvXP9164JBTDa5Q7hoRt1AjVHoXnsnFDcwRGVw4UZKh w==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609586" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609586" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242127" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242127" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:20 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Date: Tue, 5 Sep 2023 22:13:46 +0800 Message-Id: <20230905141348.32946-2-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20230905141348.32946-1-feng.tang@intel.com> References: <20230905141348.32946-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There are reports about severe lock contention for slub's per-node 'list_lock' in 'hackbench' test, [1][2], on server systems. And similar contention is also seen when running 'mmap1' case of will-it-scale on big systems. As the trend is one processor (socket) will have more and more CPUs (100+, 200+), the contention could be much more severe and becomes a scalability issue. One way to help reducing the contention is to increase the maximum slab order from 3 to 4, for big systems. Unconditionally increasing the order could bring trouble to client devices with very limited size of memory, which may care more about memory footprint, also allocating order 4 page could be harder under memory pressure. So the increase will only be done for big systems like servers, which usually are equipped with plenty of memory and easier to hit lock contention issues. Following is some performance data: will-it-scale/mmap1 ------------------- Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base+patch wis-mmap1-25% 223670 +33.3% 298205 per_process_ops wis-mmap1-50% 186020 +51.8% 282383 per_process_ops wis-mmap1-100% 89200 +65.0% 147139 per_process_o= ps Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -30.8 13.04 pp.self.native_queued_spin_lo= ck_slowpath 0.85 -0.2 0.65 pp.self.___slab_alloc 0.41 -0.1 0.27 pp.self.__unfreeze_partials 0.20 =C2=B1 2% -0.1 0.12 =C2=B1 4% pp.self.get_any_par= tial hackbench --------- Run same hackbench testcase mentioned in [1], use same HW/SW as will-it-sc= ale: base base+patch hackbench 759951 +10.5% 839601 hackbench.throughput perf-profile diff: 22.20 =C2=B1 3% -15.2 7.05 pp.self.native_queued_sp= in_lock_slowpath 0.82 -0.2 0.59 pp.self.___slab_alloc 0.33 -0.2 0.13 pp.self.__unfreeze_partials [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.co= m/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Signed-off-by: Feng Tang --- mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index f7940048138c..09ae1ed642b7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk); */ static unsigned int slub_min_order; static unsigned int slub_max_order =3D - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER; + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4; static unsigned int slub_min_objects; =20 /* @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned = int size, return order; } =20 +static inline int num_cpus(void) +{ + int nr_cpus; + + /* + * Some architectures will only update present cpus when + * onlining them, so don't trust the number if it's just 1. But + * we also don't want to use nr_cpu_ids always, as on some other + * architectures, there can be many possible cpus, but never + * onlined. Here we compromise between trying to avoid too high + * order on systems that appear larger than they are, and too + * low order on systems that appear smaller than they are. + */ + nr_cpus =3D num_present_cpus(); + if (nr_cpus <=3D 1) + nr_cpus =3D nr_cpu_ids; + + return nr_cpus; +} + static inline int calculate_order(unsigned int size) { unsigned int order; @@ -4151,19 +4171,17 @@ static inline int calculate_order(unsigned int size) */ min_objects =3D slub_min_objects; if (!min_objects) { - /* - * Some architectures will only update present cpus when - * onlining them, so don't trust the number if it's just 1. But - * we also don't want to use nr_cpu_ids always, as on some other - * architectures, there can be many possible cpus, but never - * onlined. Here we compromise between trying to avoid too high - * order on systems that appear larger than they are, and too - * low order on systems that appear smaller than they are. - */ - nr_cpus =3D num_present_cpus(); - if (nr_cpus <=3D 1) - nr_cpus =3D nr_cpu_ids; + nr_cpus =3D num_cpus(); min_objects =3D 4 * (fls(nr_cpus) + 1); + + /* + * If nr_cpus >=3D 32, the platform is likely to be a server + * which usually has much more memory, and is easier to be + * hurt by scalability issue, so enlarge it to reduce the + * possible contention of the per-node 'list_lock'. + */ + if (nr_cpus >=3D 32) + min_objects *=3D 2; } max_objects =3D order_objects(slub_max_order, size); min_objects =3D min(min_objects, max_objects); @@ -4361,6 +4379,13 @@ static void set_cpu_partial(struct kmem_cache *s) else nr_objects =3D 120; =20 + /* + * Give larger system more buffer to reduce scalability issue, like + * the handling in calculate_order(). + */ + if (num_cpus() >=3D 32) + nr_objects *=3D 2; + slub_set_cpu_partial(s, nr_objects); #endif } --=20 2.27.0 From nobody Wed Feb 11 06:54:09 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D581CA0FFC for ; Tue, 5 Sep 2023 16:24:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350374AbjIEQYm (ORCPT ); Tue, 5 Sep 2023 12:24:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354756AbjIEOHi (ORCPT ); Tue, 5 Sep 2023 10:07:38 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84A291AB for ; Tue, 5 Sep 2023 07:07:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922855; x=1725458855; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Q95MzduZqjx1H3qSsIIWM/4xzW0Di0Tyl1YsUi4/EEk=; b=b9O00clt7a9N6CWPs2n7+O9xNN6xeXOqYRrK1WQPdlGf59URCG9hJ3tT IyK4Fu7ZQtmEYMci7EWK2h1MLU7gwUF5LvIQ1FlaYRzFJn95KNZw7VBAd RpJXBkLQKSwjlmCkUbfrvg2adCDPOaW8sP/lOl4wu62f1sj6NSXvV/jhc zLs4hYmHz3rpHvCiY/o5PXEyV51pOOXM84/W32+tXXQGqQpa7XwdimfyD CbyMLJJLeNxiQNDRswdpoFYU7PJueMPTcZIdXgrAdUxHaRPqOLwC2p8rl HKRIT0aItSb8Wh3RGf8uc0G5TSwOQ834IyjnmCjd5p8jmTA1gzeOT90Bl w==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609601" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609601" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242153" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242153" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:23 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems Date: Tue, 5 Sep 2023 22:13:47 +0800 Message-Id: <20230905141348.32946-3-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20230905141348.32946-1-feng.tang@intel.com> References: <20230905141348.32946-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" There are reports about severe lock contention for slub's per-node 'list_lock' in 'hackbench' test, [1][2], on server systems. And similar contention is also seen when running 'mmap1' case of will-it-scale on big systems. As the trend is one processor (socket) will have more and more CPUs (100+, 200+), the contention could be much more severe and becomes a scalability issue. One way to help reducing the contention is to double the per-cpu partial number for large systems. Following is some performance data, where it shows big improvment in will-it-scale/mmap1 case, but no ovbious change for the 'hackbench' test. The patch itself only makes the per-cpu partial number 2X, and for better analysis, the 4X case is also profiled will-it-scale/mmap1 ------------------- Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base + 2X patch base + 4X patch wis-mmap1-25 223670 +12.7% 251999 +34.9% 301749 per_proc= ess_ops wis-mmap1-50 186020 +28.0% 238067 +55.6% 289521 per_proc= ess_ops wis-mmap1-100 89200 +40.7% 125478 +62.4% 144858 per_pro= cess_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -11.5 32.27 -27.9 15.91 pp.sel= f.native_queued_spin_lock_slowpath hackbench --------- Run same hackbench testcase mentioned in [1], use same HW/SW as will-it-sc= ale: base base + 2X patch base + 4X patch hackbench 759951 +0.2% 761506 +0.5% 763972 hackbench.t= hroughput [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.co= m/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Signed-off-by: Feng Tang --- mm/slub.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index f7940048138c..51ca6dbaad09 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4361,6 +4361,13 @@ static void set_cpu_partial(struct kmem_cache *s) else nr_objects =3D 120; =20 + /* + * Give larger system more per-cpu partial slabs to reduce/postpone + * contending per-node partial list. + */ + if (num_cpus() >=3D 32) + nr_objects *=3D 2; + slub_set_cpu_partial(s, nr_objects); #endif } --=20 2.27.0 From nobody Wed Feb 11 06:54:09 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83078CA1005 for ; Tue, 5 Sep 2023 16:57:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343883AbjIEQz2 (ORCPT ); Tue, 5 Sep 2023 12:55:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354758AbjIEOHj (ORCPT ); Tue, 5 Sep 2023 10:07:39 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A6DF1A7 for ; Tue, 5 Sep 2023 07:07:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922856; x=1725458856; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dvT4bjQOLiJP78MPuTog+CQfsmhI/8vfzdHD6sqXSeU=; b=n8vuWiACSzfFi0hQCYZnDr/R3KjuzXRMbEFn+f3hyZMu9ttaXnhlVEpo d/1MuZJHTZMu3RCAPA09YpwISUmtYdEStx60akGgUz0QWhHLh426szOWh y+QxiJg6nNy3oRgQrh4L9yH1+1WlqtgyEv+W5SyDT680I26lBCrbGnHzO IqH4m9tsKFvz0Ci/r+deXv3EGHmLE2303ZbLI384UrEB3mu6WZBbbl4Te 2LVn94LEd7uLzd7MEbe4y95VloGiOXmxvAtwLZ6L4AYAxx9BtR2pLCkoF oiUt5HBN6FJiwzdzYXqk32r1TxMjNwijw02n/b2ijwaTu2pOuqL2nVRf6 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609616" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609616" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242163" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242163" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:25 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers Date: Tue, 5 Sep 2023 22:13:48 +0800 Message-Id: <20230905141348.32946-4-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20230905141348.32946-1-feng.tang@intel.com> References: <20230905141348.32946-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL is 5). This is fine for older or small systesms, and could be too small for a large system with hundreds of CPUs, when per-node 'list_lock' is contended for allocating from and freeing to per-node partial list. So enlarge it based on the CPU numbers per node. Signed-off-by: Feng Tang --- include/linux/nodemask.h | 1 + mm/slub.c | 9 +++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 8d07116caaf1..6e22caab186d 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp) =20 #define num_online_nodes() num_node_state(N_ONLINE) #define num_possible_nodes() num_node_state(N_POSSIBLE) +#define num_cpu_nodes() num_node_state(N_CPU) #define node_online(node) node_state((node), N_ONLINE) #define node_possible(node) node_state((node), N_POSSIBLE) =20 diff --git a/mm/slub.c b/mm/slub.c index 09ae1ed642b7..984e012d7bbc 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s) =20 static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) { + unsigned long min_partial; s->flags =3D kmem_cache_flags(s->size, flags, s->name); #ifdef CONFIG_SLAB_FREELIST_HARDENED s->random =3D get_random_long(); @@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, sla= b_flags_t flags) * The larger the object size is, the more slabs we want on the partial * list to avoid pounding the page allocator excessively. */ - s->min_partial =3D min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2); - s->min_partial =3D max_t(unsigned long, MIN_PARTIAL, s->min_partial); + + min_partial =3D rounddown_pow_of_two(num_cpus() / num_cpu_nodes()); + min_partial =3D max_t(unsigned long, MIN_PARTIAL, min_partial); + + s->min_partial =3D min_t(unsigned long, min_partial * 2, ilog2(s->size) /= 2); + s->min_partial =3D max_t(unsigned long, min_partial, s->min_partial); =20 set_cpu_partial(s); =20 --=20 2.27.0