From nobody Fri Apr 17 22:36:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9949FC43334 for ; Thu, 21 Jul 2022 08:44:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232239AbiGUIoo (ORCPT ); Thu, 21 Jul 2022 04:44:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46012 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230016AbiGUIom (ORCPT ); Thu, 21 Jul 2022 04:44:42 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91BFC7E83E; Thu, 21 Jul 2022 01:44:41 -0700 (PDT) Date: Thu, 21 Jul 2022 08:44:38 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1658393080; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6t9SHuMpqSRx23aqqJvzAgAl2wSJRu2uE4nVE1EuwaE=; b=HOpNuJdZWQdsxlM3ZAL5s3u0OHNXcV9M2pXKAEscti+nSDJ8CUGxpWyZrqHdM9K5jwMyza OPHPG9xS/9kIGwuTtxLkYPNSBeXMYi8lguCO/exyGY9phnRhlgOAv3VYP0E1x0cK2bpn1u uUw5PaXsJVLKH2Ka09ZQ+HxxIk+auW2sDuxGHOrvYs/2W6Ba2zOxbyWV4aTDbZbX4pj/Du 06NMlta4SPR1Eb8FAXLpjz1EdI0bOXTzrHEokzE1rptd2iAK0DBIEEnT+6vP7/rpV8Mig2 QwnzctD9xS1aEHDx/m8+UCLijZQcqMOlrUIZgxiflCGqUoMAepHpQvEER8b0aQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1658393080; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6t9SHuMpqSRx23aqqJvzAgAl2wSJRu2uE4nVE1EuwaE=; b=3Yws62MJ/gSpiqluCct2wa8nWfZZzQEFtq0mKiq+RlrNv0jZ1Ezui6A6CaWLJzZ1AKceHA tZ/PstudlX1fCUCg== From: "tip-bot2 for Libo Chen" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] sched/fair: Disallow sync wakeup from interrupt context Cc: Libo Chen , "Peter Zijlstra (Intel)" , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20220711224704.1672831-1-libo.chen@oracle.com> References: <20220711224704.1672831-1-libo.chen@oracle.com> MIME-Version: 1.0 Message-ID: <165839307880.15455.1642344321269380664.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the sched/core branch of tip: Commit-ID: 14b3f2d9ee8df3b6040f7e21f9fcd1d848938fd9 Gitweb: https://git.kernel.org/tip/14b3f2d9ee8df3b6040f7e21f9fcd1d84= 8938fd9 Author: Libo Chen AuthorDate: Mon, 11 Jul 2022 15:47:04 -07:00 Committer: Peter Zijlstra CommitterDate: Thu, 21 Jul 2022 10:39:39 +02:00 sched/fair: Disallow sync wakeup from interrupt context Barry Song first pointed out that replacing sync wakeup with regular wakeup seems to reduce overeager wakeup pulling and shows noticeable performance improvement.[1] This patch argues that allowing sync for wakeups from interrupt context is a bug and fixing it can improve performance even when irq/softirq is evenly spread out. For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-call= ed waker can be any random task that happens to be running on that CPU when the interrupt comes in. This is completely different from other wakups where the task running on the waking CPU is the actual waker. For example, two tasks communicate through a pipe or mutiple tasks access the same critical sectio= n, etc. This difference is important because with sync we assume the waker will get off the runqueue and go to sleep immedately after the wakeup. The assumption is built into wake_affine() where it discounts the waker's prese= nce from the runqueue when sync is true. The random waker from interrupts bears= no relation to the wakee and don't usually go to sleep immediately afterwards unless wakeup granularity is reached. Plus the scheduler no longer enforces= the preepmtion of waker for sync wakeup as it used to before patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync wakeup preemption for wakeups from interrupt contexts doesn't seem to be appropriate too but at least sync wakeup will do what it's supposed to do. Add a check to make sure that sync can only be set when in_task() is true. = If a wakeup is from interrupt context, sync flag will be 0 because in_task() returns 0. Tested in two scenarios: wakeups from 1) task contexts, expected to see no performance changes; 2) interrupt contexts, expected to see better performa= nce under low/medium load and no regression under heavy load. Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests = on a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance daemon is active. Hackbench (config-scheduler-unbound) =3D=3D=3D=3D=3D=3D=3D=3D=3D process-pipes: Baseline Patched Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%) Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%) Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%) Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%) Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%) Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%) Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%) Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%) Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%) Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%) Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%) Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%) Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%) Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%) Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%) process-sockets: Baseline Patched Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%) Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%) Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%) Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%) Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%) Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%) Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%) Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%) Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%) Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%) Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%) Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%) Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%) Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%) Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%) thread-pipes: Baseline Patched Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%) Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%) Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%) Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%) Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%) Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%) Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%) Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%) Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%) Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%) Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%) Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%) Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%) Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%) Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%) thread-sockets: Baseline Patched Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%) Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%) Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%) Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%) Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%) Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%) Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%) Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%) Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%) Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%) Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%) Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%) Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%) Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%) Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%) Pgbench (config-db-pgbench-timed-ro-small) =3D=3D=3D=3D=3D=3D=3D Baseline Patched Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%) Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%) Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%) Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%) Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%) Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%) Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%) Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%) Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%) Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%) For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe= to conclude that there are no performance differences for wakeups from task co= ntext. For pgbench, after many runs, 10~30% gains are very consistent at lower client counts (< #cores per socket). For higher client counts, both kernels= are very close, +-5% swings are quite common. Also NET_TX|RX softirq load does spread out across both NUMA nodes in this test so there is no need to = do any explicit RPS/RFS. [1]: https://lkml.kernel.org/r/20211105105136.12137-1-21cnbao@gmail.com Signed-off-by: Libo Chen Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20220711224704.1672831-1-libo.chen@oracle.c= om --- kernel/sched/fair.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 914096c..2fc4725 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7013,7 +7013,10 @@ unlock: static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) { - int sync =3D (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); + /* + * Only consider WF_SYNC from task context; since only a task can schedul= e out. + */ + int sync =3D in_task() && (wake_flags & WF_SYNC) && !(current->flags & PF= _EXITING); struct sched_domain *tmp, *sd =3D NULL; int cpu =3D smp_processor_id(); int new_cpu =3D prev_cpu;