From nobody Wed Feb 11 08:12:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D581CA0FFC for ; Tue, 5 Sep 2023 16:24:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350374AbjIEQYm (ORCPT ); Tue, 5 Sep 2023 12:24:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354756AbjIEOHi (ORCPT ); Tue, 5 Sep 2023 10:07:38 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84A291AB for ; Tue, 5 Sep 2023 07:07:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922855; x=1725458855; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Q95MzduZqjx1H3qSsIIWM/4xzW0Di0Tyl1YsUi4/EEk=; b=b9O00clt7a9N6CWPs2n7+O9xNN6xeXOqYRrK1WQPdlGf59URCG9hJ3tT IyK4Fu7ZQtmEYMci7EWK2h1MLU7gwUF5LvIQ1FlaYRzFJn95KNZw7VBAd RpJXBkLQKSwjlmCkUbfrvg2adCDPOaW8sP/lOl4wu62f1sj6NSXvV/jhc zLs4hYmHz3rpHvCiY/o5PXEyV51pOOXM84/W32+tXXQGqQpa7XwdimfyD CbyMLJJLeNxiQNDRswdpoFYU7PJueMPTcZIdXgrAdUxHaRPqOLwC2p8rl HKRIT0aItSb8Wh3RGf8uc0G5TSwOQ834IyjnmCjd5p8jmTA1gzeOT90Bl w==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609601" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609601" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242153" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242153" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:23 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems Date: Tue, 5 Sep 2023 22:13:47 +0800 Message-Id: <20230905141348.32946-3-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20230905141348.32946-1-feng.tang@intel.com> References: <20230905141348.32946-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" There are reports about severe lock contention for slub's per-node 'list_lock' in 'hackbench' test, [1][2], on server systems. And similar contention is also seen when running 'mmap1' case of will-it-scale on big systems. As the trend is one processor (socket) will have more and more CPUs (100+, 200+), the contention could be much more severe and becomes a scalability issue. One way to help reducing the contention is to double the per-cpu partial number for large systems. Following is some performance data, where it shows big improvment in will-it-scale/mmap1 case, but no ovbious change for the 'hackbench' test. The patch itself only makes the per-cpu partial number 2X, and for better analysis, the 4X case is also profiled will-it-scale/mmap1 ------------------- Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base + 2X patch base + 4X patch wis-mmap1-25 223670 +12.7% 251999 +34.9% 301749 per_proc= ess_ops wis-mmap1-50 186020 +28.0% 238067 +55.6% 289521 per_proc= ess_ops wis-mmap1-100 89200 +40.7% 125478 +62.4% 144858 per_pro= cess_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -11.5 32.27 -27.9 15.91 pp.sel= f.native_queued_spin_lock_slowpath hackbench --------- Run same hackbench testcase mentioned in [1], use same HW/SW as will-it-sc= ale: base base + 2X patch base + 4X patch hackbench 759951 +0.2% 761506 +0.5% 763972 hackbench.t= hroughput [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.co= m/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Signed-off-by: Feng Tang --- mm/slub.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index f7940048138c..51ca6dbaad09 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4361,6 +4361,13 @@ static void set_cpu_partial(struct kmem_cache *s) else nr_objects =3D 120; =20 + /* + * Give larger system more per-cpu partial slabs to reduce/postpone + * contending per-node partial list. + */ + if (num_cpus() >=3D 32) + nr_objects *=3D 2; + slub_set_cpu_partial(s, nr_objects); #endif } --=20 2.27.0