From nobody Mon Apr 6 08:10:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69B3DECAAD3 for ; Fri, 9 Sep 2022 05:53:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230160AbiIIFxb (ORCPT ); Fri, 9 Sep 2022 01:53:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230072AbiIIFxY (ORCPT ); Fri, 9 Sep 2022 01:53:24 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 323B6B729A for ; Thu, 8 Sep 2022 22:53:23 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id u22so797663plq.12 for ; Thu, 08 Sep 2022 22:53:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=+r5nNsG65J6aS0cU2hfLJE0PMfgzERfyIajReZZQRVw=; b=FsyqRXp0eSc6UzCw20YxafbiS/uTNwEkSQQbJc01xpwNcDCPvumYUJxxWx0nO4m35y Yiafeq83k+1xsnmLctIzBTQIiERDACOADiX9i/dI7vprFfwSCGOacEWmmX74zqK7E/MZ uJyaOA5P79A+CPaKOuCYypUcT8qcHmiCXg8wr5R3lD90fQ27Mk4gWAuQmGNVAeujYJ39 v0GHmMWoeq6IhTc3bI6ioTvi9emhwavi/tjWndwFMG9c6lPWLqlZVq/be4duUgdUdFjM SrG6c6UV8XQrn376waJHYaeO/04OJRm4ltxfraZDMf4CarduT0cC7vRXcgG6w/IhOrpt vc2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=+r5nNsG65J6aS0cU2hfLJE0PMfgzERfyIajReZZQRVw=; b=RFlUaCPJsE3gF9pTQ4vdpkuQwoqoxSRr1UrzL2APIpras+8rLDvHHXrgfanc1IeZsI ipdYD6+J5nmUum7yNulDbEb7xu5m4DHL2nFoorhXC/sew2aKf7gcM/xPejvUTB1tWp2s YJKT1JSFeCGyJp4oygcBTeDud/7meRddhYkrz6A8m+FkySpeAQpWb96XMxlyxdbIxUcs C/V5K8ovtjT+cYPmaoMQRGwNX8Is7893yH3o+PYJFux9rYAZdo9/ezR0DE+R6o8gtSMd ZxIaGb/RsTUSlq4FMIDDLlVWgcwhfMw85Nwtn5JNZQ6YU5nNuOwi/vHznjlIq9M4aBn6 3WcQ== X-Gm-Message-State: ACgBeo0zu4xA4f0WMeR+uQ/G3nfdmfvh9haSbU8T0D70zrUnkmFYW8rn 5dnFwdqCmS/pMto+NkilplB4RA== X-Google-Smtp-Source: AA6agR4TRtvG9G8Gs+R7dq088SKcm9kI30k6MTgh4v2X0rW8HiPbI9kQ32uQe/Sg5TmFF4QJFd54aQ== X-Received: by 2002:a17:903:1ce:b0:16f:1c1f:50e5 with SMTP id e14-20020a17090301ce00b0016f1c1f50e5mr12930647plh.132.1662702802633; Thu, 08 Sep 2022 22:53:22 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id y66-20020a636445000000b00421841943dfsm464380pgb.12.2022.09.08.22.53.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Sep 2022 22:53:22 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: Josh Don , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org, Abel Wu Subject: [PATCH v5 1/5] sched/fair: Ignore SIS_UTIL when has idle core Date: Fri, 9 Sep 2022 13:53:00 +0800 Message-Id: <20220909055304.25171-2-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220909055304.25171-1-wuyun.abel@bytedance.com> References: <20220909055304.25171-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When SIS_UTIL is enabled, SIS domain scan will be skipped if the LLC is overloaded even the has_idle_core hint is true. Since idle load balancing is triggered at tick boundary, the idle cores can stay cold for the whole tick period wasting time meanwhile some of other cpus might be overloaded. Give it a chance to scan for idle cores if the hint implies a worthy effort. Benchmark =3D=3D=3D=3D=3D=3D=3D=3D=3D Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel Xeon(R) Platinum 8260, with SNC configuration: SNC on: 4 NUMA nodes each of which has 12C/24T SNC off: 2 NUMA nodes each of which has 24C/48T All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled. Based on tip sched/core 0fba527e959d (v5.19.0). Results =3D=3D=3D=3D=3D=3D=3D hackbench-process-pipes vanilla patched (SNC on) Amean 1 0.4480 ( 0.00%) 0.4470 ( 0.22%) Amean 4 0.6137 ( 0.00%) 0.5947 ( 3.10%) Amean 7 0.7530 ( 0.00%) 0.7450 ( 1.06%) Amean 12 1.1230 ( 0.00%) 1.1053 ( 1.57%) Amean 21 2.0567 ( 0.00%) 1.9420 ( 5.58%) Amean 30 3.0847 ( 0.00%) 2.9267 * 5.12%* Amean 48 5.9043 ( 0.00%) 4.7027 * 20.35%* Amean 79 9.3477 ( 0.00%) 7.7097 * 17.52%* Amean 110 11.0647 ( 0.00%) 10.0680 * 9.01%* Amean 141 13.3297 ( 0.00%) 12.5450 * 5.89%* Amean 172 15.2210 ( 0.00%) 15.0297 ( 1.26%) Amean 203 17.8510 ( 0.00%) 16.8827 * 5.42%* Amean 234 19.9263 ( 0.00%) 19.1183 ( 4.05%) Amean 265 21.9117 ( 0.00%) 20.9893 * 4.21%* Amean 296 23.7683 ( 0.00%) 23.3920 ( 1.58%) (SNC off) Amean 1 0.2963 ( 0.00%) 0.2717 ( 8.32%) Amean 4 0.6093 ( 0.00%) 0.6257 ( -2.68%) Amean 7 0.7837 ( 0.00%) 0.7740 ( 1.23%) Amean 12 1.2703 ( 0.00%) 1.2410 ( 2.31%) Amean 21 2.6260 ( 0.00%) 2.6410 ( -0.57%) Amean 30 4.3483 ( 0.00%) 3.7620 ( 13.48%) Amean 48 7.9753 ( 0.00%) 6.7757 ( 15.04%) Amean 79 9.6540 ( 0.00%) 8.8827 * 7.99%* Amean 110 11.2597 ( 0.00%) 11.0583 ( 1.79%) Amean 141 13.8077 ( 0.00%) 13.3387 ( 3.40%) Amean 172 16.3513 ( 0.00%) 15.9583 * 2.40%* Amean 203 19.0880 ( 0.00%) 17.8757 * 6.35%* Amean 234 21.7660 ( 0.00%) 20.0543 * 7.86%* Amean 265 23.0447 ( 0.00%) 22.6643 * 1.65%* Amean 296 25.4660 ( 0.00%) 25.6677 ( -0.79%) The more overloaded the system is, the more benefit can be seen due to exploiting the cpu resources by more actively kicking idle cores working, e.g. 21~48 groups. But once more workload are applied (79+ groups), the free cpu capacity that can be exploited becoming less, thus improvement comes down to ~5%. On the other hand when the load is relatively low (<12 groups), not much benefit can be seen because in such case it's not hard to find an idle cpu, so the benefit is picking up an idle core rather than an idle cpu, but the cost of full scans will indeed negate lots of benefit it brings. The downside of full scan is that the cost gets bigger in larger LLCs, but the test result seems still positive. One possible reason is due to the low SIS success rate (~3.5%), so the cost doesn't negate the benefit. tbench4 Throughput vanilla patched (SNC on) Hmean 1 284.44 ( 0.00%) 287.90 * 1.22%* Hmean 2 564.10 ( 0.00%) 575.52 * 2.02%* Hmean 4 1120.93 ( 0.00%) 1137.94 * 1.52%* Hmean 8 2248.94 ( 0.00%) 2250.42 * 0.07%* Hmean 16 4360.10 ( 0.00%) 4363.41 ( 0.08%) Hmean 32 7300.52 ( 0.00%) 7338.06 * 0.51%* Hmean 64 8912.37 ( 0.00%) 8914.66 ( 0.03%) Hmean 128 19874.16 ( 0.00%) 19978.59 * 0.53%* Hmean 256 19759.42 ( 0.00%) 20057.49 * 1.51%* Hmean 384 19502.40 ( 0.00%) 19846.74 * 1.77%* (SNC off) Hmean 1 300.70 ( 0.00%) 309.43 * 2.90%* Hmean 2 597.53 ( 0.00%) 613.92 * 2.74%* Hmean 4 1188.34 ( 0.00%) 1227.84 * 3.32%* Hmean 8 2336.22 ( 0.00%) 2379.04 * 1.83%* Hmean 16 4459.17 ( 0.00%) 4634.66 * 3.94%* Hmean 32 7606.69 ( 0.00%) 7592.12 * -0.19%* Hmean 64 9009.48 ( 0.00%) 9241.11 * 2.57%* Hmean 128 19456.88 ( 0.00%) 17870.37 * -8.15%* Hmean 256 19771.10 ( 0.00%) 19370.92 * -2.02%* Hmean 384 20118.74 ( 0.00%) 19413.92 * -3.50%* netperf-udp vanilla patched (SNC on) Hmean send-64 209.06 ( 0.00%) 211.69 * 1.26%* Hmean send-128 416.70 ( 0.00%) 417.00 ( 0.07%) Hmean send-256 819.65 ( 0.00%) 827.61 * 0.97%* Hmean send-1024 3163.12 ( 0.00%) 3191.16 * 0.89%* Hmean send-2048 5958.21 ( 0.00%) 6045.20 * 1.46%* Hmean send-3312 9168.81 ( 0.00%) 9282.21 * 1.24%* Hmean send-4096 11039.27 ( 0.00%) 11130.55 ( 0.83%) Hmean send-8192 17804.42 ( 0.00%) 17816.25 ( 0.07%) Hmean send-16384 28529.57 ( 0.00%) 28812.09 ( 0.99%) Hmean recv-64 209.06 ( 0.00%) 211.69 * 1.26%* Hmean recv-128 416.70 ( 0.00%) 417.00 ( 0.07%) Hmean recv-256 819.65 ( 0.00%) 827.61 * 0.97%* Hmean recv-1024 3163.12 ( 0.00%) 3191.16 * 0.89%* Hmean recv-2048 5958.21 ( 0.00%) 6045.18 * 1.46%* Hmean recv-3312 9168.81 ( 0.00%) 9282.21 * 1.24%* Hmean recv-4096 11039.27 ( 0.00%) 11130.55 ( 0.83%) Hmean recv-8192 17804.32 ( 0.00%) 17816.23 ( 0.07%) Hmean recv-16384 28529.38 ( 0.00%) 28812.04 ( 0.99%) (SNC off) Hmean send-64 211.39 ( 0.00%) 213.24 ( 0.87%) Hmean send-128 415.25 ( 0.00%) 426.45 * 2.70%* Hmean send-256 814.75 ( 0.00%) 835.33 * 2.53%* Hmean send-1024 3171.61 ( 0.00%) 3173.84 ( 0.07%) Hmean send-2048 6015.92 ( 0.00%) 6046.41 ( 0.51%) Hmean send-3312 9210.17 ( 0.00%) 9309.65 ( 1.08%) Hmean send-4096 11084.55 ( 0.00%) 11250.86 * 1.50%* Hmean send-8192 17769.83 ( 0.00%) 18101.50 * 1.87%* Hmean send-16384 27718.62 ( 0.00%) 28152.58 * 1.57%* Hmean recv-64 211.39 ( 0.00%) 213.24 ( 0.87%) Hmean recv-128 415.25 ( 0.00%) 426.45 * 2.70%* Hmean recv-256 814.75 ( 0.00%) 835.32 * 2.53%* Hmean recv-1024 3171.61 ( 0.00%) 3173.84 ( 0.07%) Hmean recv-2048 6015.92 ( 0.00%) 6046.41 ( 0.51%) Hmean recv-3312 9210.17 ( 0.00%) 9309.65 ( 1.08%) Hmean recv-4096 11084.55 ( 0.00%) 11250.86 * 1.50%* Hmean recv-8192 17769.76 ( 0.00%) 18101.32 * 1.87%* Hmean recv-16384 27718.62 ( 0.00%) 28152.46 * 1.57%* netperf-tcp vanilla patched (SNC on) Hmean 64 1192.41 ( 0.00%) 1253.72 * 5.14%* Hmean 128 2354.50 ( 0.00%) 2375.97 ( 0.91%) Hmean 256 4371.10 ( 0.00%) 4412.90 ( 0.96%) Hmean 1024 13813.84 ( 0.00%) 13987.31 ( 1.26%) Hmean 2048 21518.91 ( 0.00%) 21677.74 ( 0.74%) Hmean 3312 25585.77 ( 0.00%) 25943.95 * 1.40%* Hmean 4096 27402.77 ( 0.00%) 27700.88 * 1.09%* Hmean 8192 31766.67 ( 0.00%) 32187.68 * 1.33%* Hmean 16384 36227.30 ( 0.00%) 36542.97 ( 0.87%) (SNC off) Hmean 64 1182.09 ( 0.00%) 1219.15 * 3.14%* Hmean 128 2316.35 ( 0.00%) 2361.89 * 1.97%* Hmean 256 4231.05 ( 0.00%) 4314.53 * 1.97%* Hmean 1024 13461.44 ( 0.00%) 13543.85 ( 0.61%) Hmean 2048 21016.51 ( 0.00%) 21270.62 * 1.21%* Hmean 3312 24834.03 ( 0.00%) 24960.05 ( 0.51%) Hmean 4096 26700.53 ( 0.00%) 26959.99 ( 0.97%) Hmean 8192 31094.10 ( 0.00%) 30989.89 ( -0.34%) Hmean 16384 34953.23 ( 0.00%) 35310.35 ( 1.02%) The netperf and tbench4 both have high SIS success rate, that is ~100% and ~50% respectively. So the effort paid for full scan for idle cores is not very beneficial compared to its cost. This is actually the case similar to the aforementioned <12 groups case in hackbench. Conclusion =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Taking a full scan for idle cores is generally good for making better use of the cpu resources, yet there is still room for improvement under certain circumstances. Signed-off-by: Abel Wu Tested-by: Chen Yu Reviewed-by: Tim Chen --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index efceb670e755..5af9bf246274 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6437,7 +6437,7 @@ static int select_idle_cpu(struct task_struct *p, str= uct sched_domain *sd, bool time =3D cpu_clock(this); } =20 - if (sched_feat(SIS_UTIL)) { + if (sched_feat(SIS_UTIL) && !has_idle_core) { sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, target)); if (sd_share) { /* because !--nr is the condition to stop scan */ --=20 2.37.3 From nobody Mon Apr 6 08:10:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47FF3ECAAA1 for ; Fri, 9 Sep 2022 05:53:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230218AbiIIFxj (ORCPT ); Fri, 9 Sep 2022 01:53:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35848 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230054AbiIIFxa (ORCPT ); Fri, 9 Sep 2022 01:53:30 -0400 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 363BCB7761 for ; Thu, 8 Sep 2022 22:53:27 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id p18so816042plr.8 for ; Thu, 08 Sep 2022 22:53:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=X2G3bBiuotGk1LyN1WvcUhVX6iwwDcHgvCQOf1/CeC8=; b=fgJWQ4pyh3nAipeR7TYkE2GeaNvGW5sYchO7kG5lPtEooUDK0ut8Xr37c7D5f0mZjc Zk7zhUtNkU5JBDpfZjs8NbkLXqr0WU85fRagvSUFZwBciO18qCKMCVgHKNuB7aMoWPBC KyTpaRSSV+POLLlbF/IP0H5Qqm9Pgt39trnpT88IXtgTwsSl5sJYWjJMHYbRq+34CdH5 C92v0R8yHzRHrHOZhKiwZQO5ftluv7R6vrPOgcu74qDWJ/Hu58gngd1vGWrsH4PfL3TB gwDzOAo4eRkZGmOw9wd2Q6HuuuqNNcSLmDi/N7t8VCsaJtTDDGLt0Est3yBb5AsoP63k 9kAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=X2G3bBiuotGk1LyN1WvcUhVX6iwwDcHgvCQOf1/CeC8=; b=lPxJ1jirEgizA3N7Ok3rwE6Smr0Tnn3pVz8DC4eHdgRwP3DAHZhoscvU12bNnF+M+E 8vXYsaylhIpRj69ITASTfjWi231CsUkJBJfe0Jy20SmFqCA2cVo2m41MnC60l79hElIa l08S/kXe5WVoYD24UZWAxMhPXSCTl7ZWLJ4CUZp384w2YVqbIhZJqvzEWwgmaT8oNtbU t7/FeDmHVImV6m9X3UMvxoA8gTTZL54IKunAEcLgSY2yZgLMLYK1oKAe+1eKBhcAOx2Q 4JyhKhNiaeN1kUgYFFWfu0ExkuzPV4JMQ28suXizlrXI5mDPL/iam+yCa1LDGAakpFv7 vJFQ== X-Gm-Message-State: ACgBeo1b1QqLrs9raj89gIRxThTd1MtybGw8cd6ZzmOKGeD8LcUBk8wj Dzk1tpjMYDNB3XByUu2zI91Wsw== X-Google-Smtp-Source: AA6agR7wJhLbzxk6NhbWDIt1z/xErAkfcxSd1KIR2r9mTtSprlCYtPcv+Kc4scuwrmX0cW6dz16XBw== X-Received: by 2002:a17:90b:1c8e:b0:1f7:5250:7b44 with SMTP id oo14-20020a17090b1c8e00b001f752507b44mr8038068pjb.212.1662702806974; Thu, 08 Sep 2022 22:53:26 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id y66-20020a636445000000b00421841943dfsm464380pgb.12.2022.09.08.22.53.23 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Sep 2022 22:53:26 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: Josh Don , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org, Abel Wu , Mel Gorman Subject: [PATCH v5 2/5] sched/fair: Limited scan for idle cores when overloaded Date: Fri, 9 Sep 2022 13:53:01 +0800 Message-Id: <20220909055304.25171-3-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220909055304.25171-1-wuyun.abel@bytedance.com> References: <20220909055304.25171-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The has_idle_cores hint could be misleading due to some kind of rapid idling workloads, especially when LLC is overloaded. If this is the case, then there will be some full scan cost incurred that often fails to find a core. So limit the scan depth for idle cores in such case to make a speculative inspection at a reasonable cost. Benchmark =3D=3D=3D=3D=3D=3D=3D=3D=3D Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel Xeon(R) Platinum 8260, with SNC configuration: SNC on: 4 NUMA nodes each of which has 12C/24T SNC off: 2 NUMA nodes each of which has 24C/48T All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled. Based on tip sched/core 0fba527e959d (v5.19.0) plus previous patches of this series. Results =3D=3D=3D=3D=3D=3D=3D hackbench-process-pipes unpatched patched (SNC on) Amean 1 0.4470 ( 0.00%) 0.4557 ( -1.94%) Amean 4 0.5947 ( 0.00%) 0.6033 ( -1.46%) Amean 7 0.7450 ( 0.00%) 0.7627 ( -2.37%) Amean 12 1.1053 ( 0.00%) 1.0653 ( 3.62%) Amean 21 1.9420 ( 0.00%) 2.0283 * -4.45%* Amean 30 2.9267 ( 0.00%) 2.9670 ( -1.38%) Amean 48 4.7027 ( 0.00%) 4.6863 ( 0.35%) Amean 79 7.7097 ( 0.00%) 7.9443 * -3.04%* Amean 110 10.0680 ( 0.00%) 10.2393 ( -1.70%) Amean 141 12.5450 ( 0.00%) 12.6343 ( -0.71%) Amean 172 15.0297 ( 0.00%) 14.9957 ( 0.23%) Amean 203 16.8827 ( 0.00%) 16.9133 ( -0.18%) Amean 234 19.1183 ( 0.00%) 19.2433 ( -0.65%) Amean 265 20.9893 ( 0.00%) 21.6917 ( -3.35%) Amean 296 23.3920 ( 0.00%) 23.8743 ( -2.06%) (SNC off) Amean 1 0.2717 ( 0.00%) 0.3143 ( -15.71%) Amean 4 0.6257 ( 0.00%) 0.6070 ( 2.98%) Amean 7 0.7740 ( 0.00%) 0.7960 ( -2.84%) Amean 12 1.2410 ( 0.00%) 1.1947 ( 3.73%) Amean 21 2.6410 ( 0.00%) 2.4837 ( 5.96%) Amean 30 3.7620 ( 0.00%) 3.4577 ( 8.09%) Amean 48 6.7757 ( 0.00%) 5.5227 * 18.49%* Amean 79 8.8827 ( 0.00%) 9.2933 ( -4.62%) Amean 110 11.0583 ( 0.00%) 11.0443 ( 0.13%) Amean 141 13.3387 ( 0.00%) 13.1360 ( 1.52%) Amean 172 15.9583 ( 0.00%) 15.7770 ( 1.14%) Amean 203 17.8757 ( 0.00%) 17.9557 ( -0.45%) Amean 234 20.0543 ( 0.00%) 20.4373 * -1.91%* Amean 265 22.6643 ( 0.00%) 23.6053 * -4.15%* Amean 296 25.6677 ( 0.00%) 25.6803 ( -0.05%) Run to run variations is large in the 1 group test, so can be safely ignored. With limited scan for idle cores when the LLC is overloaded, a slight regression can be seen on the smaller LLC machine. It is because the cost of full scan on these LLCs is much smaller than the machines with bigger LLCs. So when comes to the SNC off case, the limited scan can provide obvious benefit especially when the frequency of such scan is relatively high, e.g. <48 groups. It's not a universal win, but considering the LLCs are getting bigger nowadays, we should be careful on the scan depth and limited scan on certain circumstance is indeed necessary. tbench4 Throughput unpatched patched (SNC on) Hmean 1 309.43 ( 0.00%) 301.54 * -2.55%* Hmean 2 613.92 ( 0.00%) 607.81 * -0.99%* Hmean 4 1227.84 ( 0.00%) 1210.64 * -1.40%* Hmean 8 2379.04 ( 0.00%) 2381.73 * 0.11%* Hmean 16 4634.66 ( 0.00%) 4601.21 * -0.72%* Hmean 32 7592.12 ( 0.00%) 7626.84 * 0.46%* Hmean 64 9241.11 ( 0.00%) 9251.51 * 0.11%* Hmean 128 17870.37 ( 0.00%) 20620.98 * 15.39%* Hmean 256 19370.92 ( 0.00%) 20406.51 * 5.35%* Hmean 384 19413.92 ( 0.00%) 20312.97 * 4.63%* (SNC off) Hmean 1 287.90 ( 0.00%) 292.37 * 1.55%* Hmean 2 575.52 ( 0.00%) 583.29 * 1.35%* Hmean 4 1137.94 ( 0.00%) 1155.83 * 1.57%* Hmean 8 2250.42 ( 0.00%) 2297.63 * 2.10%* Hmean 16 4363.41 ( 0.00%) 4562.44 * 4.56%* Hmean 32 7338.06 ( 0.00%) 7425.69 * 1.19%* Hmean 64 8914.66 ( 0.00%) 9021.77 * 1.20%* Hmean 128 19978.59 ( 0.00%) 20257.76 * 1.40%* Hmean 256 20057.49 ( 0.00%) 20043.54 * -0.07%* Hmean 384 19846.74 ( 0.00%) 19528.03 * -1.61%* Conclusion =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Limited scan for idle cores when LLC is overloaded is almost neutral compared to full scan given smaller LLCs, but is an obvious win at the bigger ones which are future-oriented. Suggested-by: Mel Gorman Signed-off-by: Abel Wu --- kernel/sched/fair.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5af9bf246274..7abe188a1533 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6437,26 +6437,42 @@ static int select_idle_cpu(struct task_struct *p, s= truct sched_domain *sd, bool time =3D cpu_clock(this); } =20 - if (sched_feat(SIS_UTIL) && !has_idle_core) { + if (sched_feat(SIS_UTIL)) { sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, target)); if (sd_share) { /* because !--nr is the condition to stop scan */ nr =3D READ_ONCE(sd_share->nr_idle_scan) + 1; - /* overloaded LLC is unlikely to have idle cpu/core */ - if (nr =3D=3D 1) + + /* + * Overloaded LLC is unlikely to have idle cpus. + * But if has_idle_core hint is true, a limited + * speculative scan might help without incurring + * much overhead. + */ + if (has_idle_core) + nr =3D nr > 1 ? INT_MAX : 3; + else if (nr =3D=3D 1) return -1; } } =20 for_each_cpu_wrap(cpu, cpus, target + 1) { + /* + * This might get the has_idle_cores hint cleared for a + * partial scan for idle cores but the hint is probably + * wrong anyway. What more important is that not clearing + * the hint may result in excessive partial scan for idle + * cores introducing innegligible overhead. + */ + if (!--nr) + break; + if (has_idle_core) { i =3D select_idle_core(p, cpu, cpus, &idle_cpu); if ((unsigned int)i < nr_cpumask_bits) return i; =20 } else { - if (!--nr) - return -1; idle_cpu =3D __select_idle_cpu(cpu, p); if ((unsigned int)idle_cpu < nr_cpumask_bits) break; --=20 2.37.3 From nobody Mon Apr 6 08:10:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B714ECAAA1 for ; Fri, 9 Sep 2022 05:53:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230287AbiIIFxt (ORCPT ); Fri, 9 Sep 2022 01:53:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36000 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229720AbiIIFxe (ORCPT ); Fri, 9 Sep 2022 01:53:34 -0400 Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CAF3DC00E3 for ; Thu, 8 Sep 2022 22:53:31 -0700 (PDT) Received: by mail-pj1-x1036.google.com with SMTP id pj10so570918pjb.2 for ; Thu, 08 Sep 2022 22:53:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=fNnn9wLK8Hw5NIwUy4nGEsew1EpO470MKUS9ViwkYPA=; b=N0iuRf8Tjp2U1euObd/YNC0SPejhZ0Idhojtv157DwpOMO7zpKPXUeBwrzlh1YYcVU KOzGhtqbNMiAYfNjOADqG5MvRpyxpgydXN3dFerF3Mzf/PCUFNFM3cf0S/7nprAEOEgB 8Xewe9KluYQFn6HWd2kKf2g6AdaXag+myWQwTDSaaWNK1+XAl6rpGD1avmUFGtgBTztX bnFXCG5jKQck+bOkSAJesNmgrmY03Jlds4qI/R0gDgiGDYMeQZryiAIvsBn8RtN1wrzs 7kF/AQ8dk6DD601+Dib9J1QSWiRT/C1Zd/egX2bUohIJWYAQmm7xAcR8wUUxLsYYPHNt qC4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=fNnn9wLK8Hw5NIwUy4nGEsew1EpO470MKUS9ViwkYPA=; b=PSYv+Iokz05dY6GHES3GFvj2NI3urSHUStayUvfC48Po1R7LvIP6MdDB0UdcvMhim4 fG1mBStmVaeu3nWys4zOKhRG7g9/9wiSvQ2YO2ki5epFYRa5/oIGNa9pJ7+0hEWY/6aF kIuNcsLD7QVcQ7QDFNDdUMG083+9WR4Ff8AYqtWpnD4J9BnpLb3Jp/y9EnnB2A4job2a 3gIvaHDvktGZgfePfbrv/fnNXQlC8zdmrdHkMzvREhJ3k0IXyOJW7yZNumTbTjfea7R1 FGxB3M7eVULcuTd7Atrk+3IwepUcddJmX8Y8GGSLOpFxs5rO6ER23RxiLJtPFtuA9xlK o0VA== X-Gm-Message-State: ACgBeo0YGMI1mz673vEIOTIipjmb5MSC6YlaH6ObzH3K84HgfbHtab1Q 10rh1ac5OHk7MDLSKTaAqICqyQ== X-Google-Smtp-Source: AA6agR7o05QuyuiUecgvMWqqSXv+iDtcuBax2JBvTYVEVJeZomRC9ggsMz4Xw9Ao7hWisu/Abd1nOQ== X-Received: by 2002:a17:902:ea0c:b0:176:75a2:625d with SMTP id s12-20020a170902ea0c00b0017675a2625dmr12288617plg.21.1662702810956; Thu, 08 Sep 2022 22:53:30 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id y66-20020a636445000000b00421841943dfsm464380pgb.12.2022.09.08.22.53.27 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Sep 2022 22:53:30 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: Josh Don , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org, Abel Wu Subject: [PATCH v5 3/5] sched/fair: Skip core update if task pending Date: Fri, 9 Sep 2022 13:53:02 +0800 Message-Id: <20220909055304.25171-4-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220909055304.25171-1-wuyun.abel@bytedance.com> References: <20220909055304.25171-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The function __update_idle_core() considers this cpu is idle so only checks its siblings to decide whether the resident core is idle or not and update has_idle_cores hint if necessary. But the problem is that this cpu might not be idle at that moment any more, resulting in the hint being misleading. It's not proper to make this check everywhere in the idle path, but checking just before core updating can make the has_idle_core hint more reliable with negligible cost. Signed-off-by: Abel Wu --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7abe188a1533..fad289530e07 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6294,6 +6294,9 @@ void __update_idle_core(struct rq *rq) int core =3D cpu_of(rq); int cpu; =20 + if (rq->ttwu_pending) + return; + rcu_read_lock(); if (test_idle_cores(core, true)) goto unlock; --=20 2.37.3 From nobody Mon Apr 6 08:10:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED27CECAAA1 for ; Fri, 9 Sep 2022 05:53:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230307AbiIIFxw (ORCPT ); Fri, 9 Sep 2022 01:53:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230226AbiIIFxk (ORCPT ); Fri, 9 Sep 2022 01:53:40 -0400 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 25B6CB6D1A for ; Thu, 8 Sep 2022 22:53:35 -0700 (PDT) Received: by mail-pj1-x102b.google.com with SMTP id s14-20020a17090a6e4e00b0020057c70943so4394662pjm.1 for ; Thu, 08 Sep 2022 22:53:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=0uEhMXPDUDFRk9g65SyN66WJYbSzpbA8SwnRw2EhGFU=; b=7B5ASd0p5RbvenZXMcOQY9nlGWeFOctkqQw3j7ozSxFDgjAXLf5JDHPCdIwavHE9DO LKIrLjzQt/XZPFgGVHsJw9zmiWb8t0O5sWCp64VV+t6tUmMGJE9YaUq4Mdc7T75KRNx9 DooLFOXx0H5+rka8y1exh7cf/9iXfTZs9GUblLRQ1upkzI329offz7hnklOXaa6fKjV2 TXVjCA9DLntKFF0FTyrx0YELiOtZglvoJ8Sj756TsLsBuPU4K9vkVJYC4sALcpx001Gv USBEyHaj4Ff/YJtF/YT4Ddz83Gy/BzM+Jy5+jrfoM0kpUGQDGjJpp5t9dj9nDjAx7VeP jP2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=0uEhMXPDUDFRk9g65SyN66WJYbSzpbA8SwnRw2EhGFU=; b=gISl5I5tiNzOsuG1zvVIorye8TAZRcxA1kqI2kO3uwOOwh7HjmEhrMZA203XkXx0G0 hMdlBByksX3vs4r+A/xi1tTktzFntGYoeag/+PXuF+MZJv+H19Ol2N3vfx8ROXmZEWwF jD4XP86e95ZGnwSxheQVf+7BPUPEySblmqwX7sK8Tag2wAmo0Fy28/2jFveLc7RkGuQT 47ixDTC/eqSyQNuRFBkg4prwVgFtyKGuPUd91t6ixLKg+2DltI3iUFLOfW7vaf2CvJ2L Y/iyBctwGPYPUXkAKABtUhqDaSdREexuVuoIfSAqZM0Jk7+diS6wwsFH6X8SusJUjOQe FWCA== X-Gm-Message-State: ACgBeo20pXt1BTGR1YMFeP79hIy4Cm2d1uDbHRnT5bZqIBJzuiJSIqZI FpxSxT24NEto0UGVSkjIC2XWCw== X-Google-Smtp-Source: AA6agR4/NzRF6WmODs+HNEfDatv0qjiM2g7hhwml4B0ymJVkqHpDIQZqQI6ZVy3ZdzuZXCxOYnUfKw== X-Received: by 2002:a17:903:247:b0:16c:5017:9ad4 with SMTP id j7-20020a170903024700b0016c50179ad4mr12809435plh.115.1662702814994; Thu, 08 Sep 2022 22:53:34 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id y66-20020a636445000000b00421841943dfsm464380pgb.12.2022.09.08.22.53.31 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Sep 2022 22:53:34 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: Josh Don , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org, Abel Wu Subject: [PATCH v5 4/5] sched/fair: Skip SIS domain scan if fully busy Date: Fri, 9 Sep 2022 13:53:03 +0800 Message-Id: <20220909055304.25171-5-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220909055304.25171-1-wuyun.abel@bytedance.com> References: <20220909055304.25171-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" If a full domain scan failed, then no unoccupied cpus available and the LLC is fully busy. In this case we'd better use cpus more wisely, rather than wasting it trying to find an idle cpu that probably not exist. The fully busy status will be cleared when any cpu of that LLC goes idle and everything goes back to normal again. Make the has_idle_cores boolean hint more rich by turning it into a state machine. Signed-off-by: Abel Wu --- include/linux/sched/topology.h | 35 +++++++++++++++++- kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++------ 2 files changed, 89 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 816df6cc444e..cc6089765b64 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -77,10 +77,43 @@ extern int sched_domain_level_max; =20 struct sched_group; =20 +/* + * States of the sched-domain + * + * - sd_has_icores + * This state is only used in LLC domains to indicate worthy + * of a full scan in SIS due to idle cores available. + * + * - sd_has_icpus + * This state indicates that unoccupied (sched-idle/idle) cpus + * might exist in this domain. For the LLC domains it is the + * default state since these cpus are the main targets of SIS + * search, and is also used as a fallback state of the other + * states. + * + * - sd_is_busy + * This state indicates there are no unoccupied cpus in this + * domain. So for LLC domains, it gives the hint on whether + * we should put efforts on the SIS search or not. + * + * For LLC domains, sd_has_icores is set when the last non-idle cpu of + * a core becomes idle. After a full SIS scan and if no idle cores found, + * sd_has_icores must be cleared and the state will be set to sd_has_icpus + * or sd_is_busy depending on whether there is any idle cpu. And during + * load balancing on each SMT domain inside the LLC, the state will be + * re-evaluated and switch from sd_is_busy to sd_has_icpus if idle cpus + * exist. + */ +enum sd_state { + sd_has_icores, + sd_has_icpus, + sd_is_busy +}; + struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; - int has_idle_cores; + enum sd_state state; int nr_idle_scan; }; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fad289530e07..25df73c7e73c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6262,26 +6262,47 @@ static inline int __select_idle_cpu(int cpu, struct= task_struct *p) DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); =20 -static inline void set_idle_cores(int cpu, int val) +static inline void set_llc_state(int cpu, enum sd_state state) { struct sched_domain_shared *sds; =20 sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); if (sds) - WRITE_ONCE(sds->has_idle_cores, val); + WRITE_ONCE(sds->state, state); } =20 -static inline bool test_idle_cores(int cpu, bool def) +static inline enum sd_state get_llc_state(int cpu, enum sd_state def) { struct sched_domain_shared *sds; =20 sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); if (sds) - return READ_ONCE(sds->has_idle_cores); + return READ_ONCE(sds->state); =20 return def; } =20 +static inline void clear_idle_cpus(int cpu) +{ + set_llc_state(cpu, sd_is_busy); +} + +static inline bool test_idle_cpus(int cpu) +{ + return get_llc_state(cpu, sd_has_icpus) !=3D sd_is_busy; +} + +static inline void set_idle_cores(int cpu, int core_idle) +{ + set_llc_state(cpu, core_idle ? sd_has_icores : sd_has_icpus); +} + +static inline bool test_idle_cores(int cpu, bool def) +{ + return sd_has_icores =3D=3D + get_llc_state(cpu, def ? sd_has_icores : sd_has_icpus); +} + /* * Scans the local SMT mask to see if the entire core is idle, and records= this * information in sd_llc_shared->has_idle_cores. @@ -6291,25 +6312,29 @@ static inline bool test_idle_cores(int cpu, bool de= f) */ void __update_idle_core(struct rq *rq) { - int core =3D cpu_of(rq); - int cpu; + enum sd_state old, new =3D sd_has_icores; + int core =3D cpu_of(rq), cpu; =20 if (rq->ttwu_pending) return; =20 rcu_read_lock(); - if (test_idle_cores(core, true)) + old =3D get_llc_state(core, sd_has_icores); + if (old =3D=3D sd_has_icores) goto unlock; =20 for_each_cpu(cpu, cpu_smt_mask(core)) { if (cpu =3D=3D core) continue; =20 - if (!available_idle_cpu(cpu)) - goto unlock; + if (!available_idle_cpu(cpu)) { + new =3D sd_has_icpus; + break; + } } =20 - set_idle_cores(core, 1); + if (old !=3D new) + set_llc_state(core, new); unlock: rcu_read_unlock(); } @@ -6370,6 +6395,15 @@ static int select_idle_smt(struct task_struct *p, st= ruct sched_domain *sd, int t =20 #else /* CONFIG_SCHED_SMT */ =20 +static inline void clear_idle_cpus(int cpu) +{ +} + +static inline bool test_idle_cpus(int cpu) +{ + return true; +} + static inline void set_idle_cores(int cpu, int val) { } @@ -6406,6 +6440,9 @@ static int select_idle_cpu(struct task_struct *p, str= uct sched_domain *sd, bool struct sched_domain *this_sd; u64 time =3D 0; =20 + if (!test_idle_cpus(target)) + return -1; + this_sd =3D rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) return -1; @@ -6482,8 +6519,14 @@ static int select_idle_cpu(struct task_struct *p, st= ruct sched_domain *sd, bool } } =20 - if (has_idle_core) - set_idle_cores(target, false); + /* + * If no idle cpu can be found, set LLC state to busy to prevent + * us from SIS domain scan to save a few cycles. + */ + if (idle_cpu =3D=3D -1) + clear_idle_cpus(target); + else if (has_idle_core) + set_idle_cores(target, 0); =20 if (sched_feat(SIS_PROP) && !has_idle_core) { time =3D cpu_clock(this) - time; --=20 2.37.3 From nobody Mon Apr 6 08:10:32 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE3ABECAAD3 for ; Fri, 9 Sep 2022 05:54:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230267AbiIIFx6 (ORCPT ); Fri, 9 Sep 2022 01:53:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36000 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230171AbiIIFxp (ORCPT ); Fri, 9 Sep 2022 01:53:45 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE3BCB7EC6 for ; Thu, 8 Sep 2022 22:53:39 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id t3so838486ply.2 for ; Thu, 08 Sep 2022 22:53:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=vPktivqJGC+qoU4od3K/hUKJYk2Dc2La6EnCvynGgPk=; b=WQBswRlsj681cRVT/6sASrWRpNtu5QVfQGWYBNj5m/FL1ywVmLZJcNvSvBN0+5mQKV a4uvc1qsh5tu22C/AtCpQFgc3Yl6XEHg0JXhqTrRVcNyvmWHtBHngf0FH30p4TIJEEzR vR+WuevzHulCa7dZ19HQ8sM/F4qjfWIJGKdhnhLefuc0sALLGxPCyVUMQAlzYV7EJr3g 2jGI6HEOS6+Mimzbr31rDqY2diQuV2fOqX9OExX3S2hKG+dzCS6OWpYXLJz79bEG5TWV wDILxrSlYJw92VawhsV1qgabLmCP7F+2YA08JNCMqD6phlols+rldnED/TdwaU4bwb/B 0pGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=vPktivqJGC+qoU4od3K/hUKJYk2Dc2La6EnCvynGgPk=; b=k3NiG8GSbjQwXnCCv/NdBRCs9eBFnprztlueEg3vckLs943KqZVs+fMyA4Z1zLmm9c IdZuAG46clmsSWyOv1qyH+7cVAnno7TGxpMUL/U3JszaHYebXPm3XupeEnRwDLCI5rSz 6NWm3hegQE/jN1NoXzy4KEiWrQHT75A6m9bCwX6EEtXXUDVEJaqgcjKEtj6CWkQMYhgh O797vpg509ovWlLiKBXFVFFlydMz3BFEgx/hWeUeXLzTxC6ZUhuRz7KsKEdubZaaJ8jq j15CNvWE6TEWXCUjSVw4qemmF9x8BcNDZJ+BV1MQHni3l85V/+KNvIYETLyIrnU2YCp+ W5JQ== X-Gm-Message-State: ACgBeo1Y3ex0l0wyRYa05LXmynJnI5MX7PFuJENp0xyrtJQZQkOTODeK pdEpOiDkhJljzAb8YmzK9qjI5fJWfnNLOA== X-Google-Smtp-Source: AA6agR5V5Yga9TnX2fbknEMZyNYLh3rYFTUWeHSxrBQ0bvVuaP60D1uamuywTZMXSG5iffu4HieNug== X-Received: by 2002:a17:903:1c1:b0:175:4cf0:31e4 with SMTP id e1-20020a17090301c100b001754cf031e4mr12519109plh.95.1662702819063; Thu, 08 Sep 2022 22:53:39 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id y66-20020a636445000000b00421841943dfsm464380pgb.12.2022.09.08.22.53.35 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Sep 2022 22:53:38 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: Josh Don , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , linux-kernel@vger.kernel.org, Abel Wu Subject: [PATCH v5 5/5] sched/fair: Introduce SIS_FILTER Date: Fri, 9 Sep 2022 13:53:04 +0800 Message-Id: <20220909055304.25171-6-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20220909055304.25171-1-wuyun.abel@bytedance.com> References: <20220909055304.25171-1-wuyun.abel@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The wakeup fastpath for fair tasks, select_idle_sibling() aka SIS, plays an important role in maximizing the usage of cpu resources and can greatly affect overall performance of the system. The SIS tries to find an idle cpu inside the targeting LLC to place the woken-up task. The cache hot cpus are preferred, but if none of them is idle, just fallback to the other cpus of that LLC by firing a domain scan. The domain scan works well under light pressure by simply traversing the cpus of the LLC due to lots of idle cpus can be available. But things change. The LLCs are getting bigger in modern and future machines, and the cloud service providers are consistently trying to reduce TCO by pushing more load squeezing all kinds of resources. So the fashion of simply traversing doesn't fit well to the future requirement. There are already features like SIS_{UTIL,PROP} exist to cope with the scalability issue by limiting the scan depth, and it would be better if we can improve the way of how it scans as well. And this is exactly what the SIS_FILTER is born for. The SIS filter is supposed to contain the idle and sched-idle cpus of the targeting LLC and try to improve efficiency of SIS domain scan by filtering out busy cpus, making the limited scan depth being used more wisely. The idle cpus are propagated to the filter lazily if the resident cores are already present in the filter. This will ease the pain of LLC cache traffic, and can also bring benefit by spreading out load to different cores. There is also a sequence number to indicate the generation of the filter. The generation iterates every time when the filter resets. Once a cpu is propagated/set to the filter, the filter generation is cached locally to help identify whether it is present or not in the filter, rather than peeking at the LLC shared filter. Benchmark =3D=3D=3D=3D=3D=3D=3D=3D=3D Tests are done in a dual socket (2 x 24C/48T) machine modeled Intel Xeon(R) Platinum 8260, with SNC configuration: SNC on: 4 NUMA nodes each of which has 12C/24T SNC off: 2 NUMA nodes each of which has 24C/48T All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled. Baseline is tip sched/core 0fba527e959d (v5.19.0) with the first two patches of this series, and the "patched" contains the whole series. Results =3D=3D=3D=3D=3D=3D=3D hackbench-process-pipes baseline patched (SNC on) Amean 1 0.4557 ( 0.00%) 0.4500 ( 1.24%) Amean 4 0.6033 ( 0.00%) 0.5990 ( 0.72%) Amean 7 0.7627 ( 0.00%) 0.7377 ( 3.28%) Amean 12 1.0653 ( 0.00%) 1.0490 ( 1.53%) Amean 21 2.0283 ( 0.00%) 1.8680 * 7.90%* Amean 30 2.9670 ( 0.00%) 2.7710 * 6.61%* Amean 48 4.6863 ( 0.00%) 4.5393 ( 3.14%) Amean 79 7.9443 ( 0.00%) 7.6610 * 3.57%* Amean 110 10.2393 ( 0.00%) 10.5560 ( -3.09%) Amean 141 12.6343 ( 0.00%) 12.6137 ( 0.16%) Amean 172 14.9957 ( 0.00%) 14.6547 * 2.27%* Amean 203 16.9133 ( 0.00%) 16.9000 ( 0.08%) Amean 234 19.2433 ( 0.00%) 18.7687 ( 2.47%) Amean 265 21.6917 ( 0.00%) 21.4060 ( 1.32%) Amean 296 23.8743 ( 0.00%) 23.0990 * 3.25%* (SNC off) Amean 1 0.3143 ( 0.00%) 0.2933 ( 6.68%) Amean 4 0.6070 ( 0.00%) 0.5883 ( 3.08%) Amean 7 0.7960 ( 0.00%) 0.7570 ( 4.90%) Amean 12 1.1947 ( 0.00%) 1.0780 * 9.77%* Amean 21 2.4837 ( 0.00%) 1.8903 * 23.89%* Amean 30 3.4577 ( 0.00%) 2.7903 * 19.30%* Amean 48 5.5227 ( 0.00%) 4.8920 ( 11.42%) Amean 79 9.2933 ( 0.00%) 8.0127 * 13.78%* Amean 110 11.0443 ( 0.00%) 10.1557 * 8.05%* Amean 141 13.1360 ( 0.00%) 12.7387 ( 3.02%) Amean 172 15.7770 ( 0.00%) 14.5860 * 7.55%* Amean 203 17.9557 ( 0.00%) 17.1950 * 4.24%* Amean 234 20.4373 ( 0.00%) 19.6763 * 3.72%* Amean 265 23.6053 ( 0.00%) 22.5557 ( 4.45%) Amean 296 25.6803 ( 0.00%) 24.4273 * 4.88%* Generally the result showed better improvement on larger LLCs. And with load increases but not saturate the cpus (<30 groups), the benefit is obvious, and even under extreme pressure the filter also helps squeeze out some power (remember that the baseline has already included first two patches). tbench4 Throughput baseline patched (SNC on) Hmean 1 301.54 ( 0.00%) 302.52 * 0.32%* Hmean 2 607.81 ( 0.00%) 604.76 * -0.50%* Hmean 4 1210.64 ( 0.00%) 1204.79 * -0.48%* Hmean 8 2381.73 ( 0.00%) 2375.87 * -0.25%* Hmean 16 4601.21 ( 0.00%) 4681.25 * 1.74%* Hmean 32 7626.84 ( 0.00%) 7607.93 * -0.25%* Hmean 64 9251.51 ( 0.00%) 8956.00 * -3.19%* Hmean 128 20620.98 ( 0.00%) 19258.30 * -6.61%* Hmean 256 20406.51 ( 0.00%) 20783.82 * 1.85%* Hmean 384 20312.97 ( 0.00%) 20407.40 * 0.46%* (SNC off) Hmean 1 292.37 ( 0.00%) 286.27 * -2.09%* Hmean 2 583.29 ( 0.00%) 574.82 * -1.45%* Hmean 4 1155.83 ( 0.00%) 1137.27 * -1.61%* Hmean 8 2297.63 ( 0.00%) 2261.98 * -1.55%* Hmean 16 4562.44 ( 0.00%) 4430.95 * -2.88%* Hmean 32 7425.69 ( 0.00%) 7341.70 * -1.13%* Hmean 64 9021.77 ( 0.00%) 8954.61 * -0.74%* Hmean 128 20257.76 ( 0.00%) 20198.82 * -0.29%* Hmean 256 20043.54 ( 0.00%) 19785.57 * -1.29%* Hmean 384 19528.03 ( 0.00%) 19956.96 * 2.20%* The slight regression indicates that if the workload has already had a relatively high SIS success rate, e.g. tbench4 ~=3D 50%, the benefit the filter brought will be reduced while its cost is still there. And the benefit might not balance the cost if SIS success rate goes high enough. But the real world is more complicated. Different workloads can be located on the same machine to share resources, and their profiles can vary quite much, so the SIS success rate is not predictable. Conclusion =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The SIS filter is much more efficient than linear scan under certain circumstances, and even if it goes unlucky, the filter can be disabled at any time. Signed-off-by: Abel Wu --- include/linux/sched/topology.h | 18 ++++++++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 83 ++++++++++++++++++++++++++++++++-- kernel/sched/features.h | 1 + kernel/sched/sched.h | 3 ++ kernel/sched/topology.c | 9 +++- 6 files changed, 109 insertions(+), 6 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index cc6089765b64..f8e6154b5c37 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -114,7 +114,20 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; enum sd_state state; + u64 seq; int nr_idle_scan; + + /* + * The SIS filter + * + * Record idle and sched-idle cpus to improve efficiency of + * the SIS domain scan. + * + * NOTE: this field is variable length. (Allocated dynamically + * by attaching extra space to the end of the structure, + * depending on how many CPUs the kernel has booted up with) + */ + unsigned long icpus[]; }; =20 struct sched_domain { @@ -200,6 +213,11 @@ static inline struct cpumask *sched_domain_span(struct= sched_domain *sd) return to_cpumask(sd->span); } =20 +static inline struct cpumask *sched_domain_icpus(struct sched_domain *sd) +{ + return to_cpumask(sd->shared->icpus); +} + extern void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[], struct sched_domain_attr *dattr_new); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7d289d87acf7..a0cbf6c0d540 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9719,6 +9719,7 @@ void __init sched_init(void) rq->wake_stamp =3D jiffies; rq->wake_avg_idle =3D rq->avg_idle; rq->max_idle_balance_cost =3D sysctl_sched_migration_cost; + rq->last_idle_seq =3D 0; =20 INIT_LIST_HEAD(&rq->cfs_tasks); =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 25df73c7e73c..354e6e646a7b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6262,6 +6262,57 @@ static inline int __select_idle_cpu(int cpu, struct = task_struct *p) DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); =20 +static inline u64 filter_seq(int cpu) +{ + struct sched_domain_shared *sds; + + sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) + return READ_ONCE(sds->seq); + + return 0; +} + +static inline void filter_set_cpu(int cpu, int nr) +{ + struct sched_domain_shared *sds; + + if (!sched_feat(SIS_FILTER)) + return; + + sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) { + cpu_rq(nr)->last_idle_seq =3D filter_seq(cpu); + set_bit(nr, sds->icpus); + } +} + +static inline bool filter_test_cpu(int cpu, int nr) +{ + if (!sched_feat(SIS_FILTER)) + return true; + + return cpu_rq(nr)->last_idle_seq >=3D filter_seq(cpu); +} + +static inline void filter_reset(int cpu) +{ + struct sched_domain_shared *sds; + + if (!sched_feat(SIS_FILTER)) + return; + + sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) { + bitmap_zero(sds->icpus, nr_cpumask_bits); + /* + * The seq field is racy but at least we can + * use WRITE_ONCE() to prevent store tearing. + */ + WRITE_ONCE(sds->seq, filter_seq(cpu) + 1); + } +} + static inline void set_llc_state(int cpu, enum sd_state state) { struct sched_domain_shared *sds; @@ -6285,6 +6336,8 @@ static inline enum sd_state get_llc_state(int cpu, en= um sd_state def) static inline void clear_idle_cpus(int cpu) { set_llc_state(cpu, sd_is_busy); + if (sched_smt_active()) + filter_reset(cpu); } =20 static inline bool test_idle_cpus(int cpu) @@ -6314,13 +6367,15 @@ void __update_idle_core(struct rq *rq) { enum sd_state old, new =3D sd_has_icores; int core =3D cpu_of(rq), cpu; + int exist; =20 if (rq->ttwu_pending) return; =20 rcu_read_lock(); old =3D get_llc_state(core, sd_has_icores); - if (old =3D=3D sd_has_icores) + exist =3D filter_test_cpu(core, core); + if (old =3D=3D sd_has_icores && exist) goto unlock; =20 for_each_cpu(cpu, cpu_smt_mask(core)) { @@ -6329,11 +6384,26 @@ void __update_idle_core(struct rq *rq) =20 if (!available_idle_cpu(cpu)) { new =3D sd_has_icpus; - break; + + /* + * If any cpu of this core has already + * been set in the filter, then this + * core is present and won't be missed + * during SIS domain scan. + */ + if (exist) + break; + if (!sched_idle_cpu(cpu)) + continue; } + + if (!exist) + exist =3D filter_test_cpu(core, cpu); } =20 - if (old !=3D new) + if (!exist) + filter_set_cpu(core, core); + if (old !=3D sd_has_icores && old !=3D new) set_llc_state(core, new); unlock: rcu_read_unlock(); @@ -6447,8 +6517,6 @@ static int select_idle_cpu(struct task_struct *p, str= uct sched_domain *sd, bool if (!this_sd) return -1; =20 - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg; unsigned long now =3D jiffies; @@ -6496,6 +6564,11 @@ static int select_idle_cpu(struct task_struct *p, st= ruct sched_domain *sd, bool } } =20 + if (sched_smt_active() && sched_feat(SIS_FILTER)) + cpumask_and(cpus, sched_domain_icpus(sd), p->cpus_ptr); + else + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + for_each_cpu_wrap(cpu, cpus, target + 1) { /* * This might get the has_idle_cores hint cleared for a diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..1bebdb87c2f4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_PROP, false) SCHED_FEAT(SIS_UTIL, true) +SCHED_FEAT(SIS_FILTER, true) =20 /* * Issue a WARN when we do multiple update_rq_clock() calls diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b75ac74986fb..1fe1b152bc20 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1071,6 +1071,9 @@ struct rq { /* This is used to determine avg_idle's max value */ u64 max_idle_balance_cost; =20 + /* Cached filter generation when setting this cpu */ + u64 last_idle_seq; + #ifdef CONFIG_HOTPLUG_CPU struct rcuwait hotplug_wait; #endif diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 8739c2a5a54e..01dccaca0fa8 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1641,6 +1641,13 @@ sd_init(struct sched_domain_topology_level *tl, sd->shared =3D *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref); atomic_set(&sd->shared->nr_busy_cpus, sd_weight); + cpumask_copy(sched_domain_icpus(sd), sd_span); + + /* + * The cached per-cpu seq starts from 0, so initialize + * filter seq to 1 to discard all cached cpu state. + */ + WRITE_ONCE(sd->shared->seq, 1); } =20 sd->private =3D sdd; @@ -2106,7 +2113,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map) =20 *per_cpu_ptr(sdd->sd, j) =3D sd; =20 - sds =3D kzalloc_node(sizeof(struct sched_domain_shared), + sds =3D kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(= ), GFP_KERNEL, cpu_to_node(j)); if (!sds) return -ENOMEM; --=20 2.37.3