From nobody Wed Dec 17 17:39:09 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB0A3146A7B for ; Tue, 25 Jun 2024 07:25:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300314; cv=none; b=l8Fm2bUrm0BVw1pZpYS3og81Zu9X5lBTuy8jX8SgYqLeuYBrx28KqlgxsXoEfY5r/5avcdKeIpI2fTvnyiCwnkrkkUIla+VZTwaBSw54CYsRpPTSGEB0BVzQULaSLdNUc7Nk13sp6K2a7M24vM6DgPD6aSoR+g7MoNK6QadM9Q8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300314; c=relaxed/simple; bh=7ZjSY3rh8J+ct5ERYhltVYubIGtiCkBKkCZbQVawuPc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=FVBGD6Y4Qcx/QDN2P9IcBeAHlFwWv5Fo6uEWTd755K4lTyeEZCMr5erzhudBniXrAWcosx1DcSMT4AXX2F+lOJZ5vNZVen2RYKlocGOlq2JrHmnAkhFWwdyRyVgi9hvi40y+bD5KNHvMmekWUf4G4sulPmnpGwJpInkO2H0knFA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=FPbNok2e; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="FPbNok2e" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719300313; x=1750836313; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7ZjSY3rh8J+ct5ERYhltVYubIGtiCkBKkCZbQVawuPc=; b=FPbNok2eAZukPJJMXSJdWCbKNDtB+W5I1jzyq6IXgEpev551D7hYNy2a /0SLqc3N+E8XpzXUHArumzlnZ2GQ4ltiI/Pvo/IcE9sZ9NOcCDna+muBE O3ZLoXKlNPd9Dy07yWguN2DqKSKUkNofbMbIJRHqGm2GBkhNyeDz/c5fF 4NiAyvGz71Eo1P8MX/7HKfMnlR38l2OWfNXI3ZtdpaFyajVZFsFu20+8Q 2i/+pMY3SNkRCAyKpRo7kav6GP0dT/zKVdvLATqhw+2y9B55vO9DKtE2a IxWwXIuXaBjqYYARTb/BDrwoUK4dwnTbobi0kM6BnRRu/VKvu4tI9u88+ A==; X-CSE-ConnectionGUID: QjkxS0WmRYWciSA5GD0ZZA== X-CSE-MsgGUID: ARPoO8IATYeLtTUfX5MBHw== X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="20075568" X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="20075568" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2024 00:25:13 -0700 X-CSE-ConnectionGUID: kEDRshYiT5eXMNMDOITXlg== X-CSE-MsgGUID: lZe8AqqOToiiqwAubTGSAw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="48523302" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orviesa005.jf.intel.com with ESMTP; 25 Jun 2024 00:25:10 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot Cc: Mike Galbraith , Tim Chen , Yujie Liu , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , linux-kernel@vger.kernel.org, Chen Yu Subject: [PATCH 1/2] sched/fair: Record the average duration of a task Date: Tue, 25 Jun 2024 15:22:09 +0800 Message-Id: <338ec61022d4b5242e4af6d156beac53f20eacf2.1719295669.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Record the average duration of a task, as there is a requirement to leverage this information for better task placement. At first thought the (p->se.sum_exec_runtime / p->nvcsw) can be used to measure the task duration. However, the history long past was factored too heavily in such a formula. Ideally, the old activity should decay and not affect the current status too much. Although something based on PELT can be used, se.util_avg might not be appropriate to describe the task duration: Task p1 and task p2 are doing frequent ping-pong scheduling on one CPU, both p1 and p2 have a short duration, but the util_avg of each task can be up to 50%, which is inconsistent with the short task duration. Here's an example to show what the average duration is. Suppose on CPUx, task p1 and p2 run alternatively: --------------------> time | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | ^ ^ ^ |_____________| |_____________________________________| ^ | p1 dequeu= ed p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run= 1.5ms. This reflects the nature of a task: how long it wishes to run at most. Suggested-by: Tim Chen Signed-off-by: Chen Yu --- include/linux/sched.h | 3 +++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 12 ++++++++++++ 3 files changed, 17 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 90691d99027e..78747d3954fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1339,6 +1339,9 @@ struct task_struct { struct callback_head cid_work; #endif =20 + u64 prev_sleep_sum_runtime; + u64 duration_avg; + struct tlbflush_unmap_batch tlb_ubc; =20 /* Cache last used pipe for splice(): */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0935f9d4bb7b..7399c4143528 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) p->migration_pending =3D NULL; #endif init_sched_mm_cid(p); + p->prev_sleep_sum_runtime =3D 0; + p->duration_avg =3D 0; } =20 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41b58387023d..445877069fbf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) =20 static void set_next_buddy(struct sched_entity *se); =20 +static inline void dur_avg_update(struct task_struct *p) +{ + u64 dur; + + dur =3D p->se.sum_exec_runtime - p->prev_sleep_sum_runtime; + p->prev_sleep_sum_runtime =3D p->se.sum_exec_runtime; + update_avg(&p->duration_avg, dur); +} + /* * The dequeue_task method is called before nr_running is * decreased. We remove the task from the rbtree and @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) =20 dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); + if (task_sleep) + dur_avg_update(p); + hrtick_update(rq); } =20 --=20 2.25.1 From nobody Wed Dec 17 17:39:09 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 11B21146A93 for ; Tue, 25 Jun 2024 07:25:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.14 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300326; cv=none; b=oJLsWx2R0Y0J1mN2iYtrp6D3gVpg0HBu8+htkKC5ztInojLU7d/euFnM84G2Q9Ia1D7ZCzPIYjR/tmnyWiSdUDDhACqfDwtejcVBr9fcN5tXMqtRMAmXamUTBHoV5etZ1qd3RJQQTfYU+0qC0yGjwBOwkYDBm9Uo0lJWO/RkRhA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300326; c=relaxed/simple; bh=iJxaoY6Y6nhbCUJeRWFve7w+uu5jiiKdXQI/5whnEEg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uIDIbXvnerkaFbtEqSX2jbaHoT0jChocL7o6VDFJQSpKHeFTy+Uve5zO/dAV3apiL/1o3ftee4/N/Z//SmFzB0sRpETZo1FaNYkg0Z8ztzbwbikb/I/WbSKfOftqXScrN/gYx8n/wKl5g3Gqn+aLVryvecViZRMM3+ZYu7oJeSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=AIlxpA4J; arc=none smtp.client-ip=198.175.65.14 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="AIlxpA4J" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719300324; x=1750836324; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iJxaoY6Y6nhbCUJeRWFve7w+uu5jiiKdXQI/5whnEEg=; b=AIlxpA4JXJSB1srD7rw5/JSHJ0ekkJlHHcDqBdY8JH3b4844oitkWPCc vYnXs1o0otMBmpBAjFRhc/e4OcyIsftFX6I00J4dQ5p+nM8M/isDDUK/r Fj5PV9kD8Nour90FgeTQ0DCUZpIhN+wxLSD9MY6FEiInFMt8i0oY1EDfi LhQKM6GUByjYwhbBin+Jx6N4DUvWgkFK2xRlA0UVRo0uIYQtM1fo9nv08 PfFLmRDCipLLgAPLr8uksHtI7Z/zAVkrTItq7Kz0O98xLDJl0E+tvvH+e 21TRXfuVx0RJPnoKbTcf7J19pCwUV7OUTZ55UYqNzd7kTxfmuaDfRb+4Y A==; X-CSE-ConnectionGUID: 45eIGy6RR2KIuxdK4Brf6Q== X-CSE-MsgGUID: stnr6bu6TRu8fQUtNQa2Yw== X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="20111595" X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="20111595" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2024 00:25:23 -0700 X-CSE-ConnectionGUID: WvC7ntS/R+2LG+lBZVjurQ== X-CSE-MsgGUID: sguCs3kERiOkfmSXfAEzYQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="43651833" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orviesa009.jf.intel.com with ESMTP; 25 Jun 2024 00:25:21 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot Cc: Mike Galbraith , Tim Chen , Yujie Liu , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , linux-kernel@vger.kernel.org, Chen Yu Subject: [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Date: Tue, 25 Jun 2024 15:22:22 +0800 Message-Id: <383932811ae0cb4df0bb131fa968e746de979417.1719295669.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" [Problem Statement] On platforms where there are many CPUs, one bottleneck is the high Cache-to-Cache latency. This issue is exacerbated when the tasks sharing data are running on different CPUs: When the tasks access different part of the same cache, false sharing happens. One example is the network client/server workload with small packages. A simple example: On a system with 240 CPUs, 2 sockets, taskset -c 2 netserver taskset -c 1 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 83528.11 taskset -c 2 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 134504.35 [Problem Analysis] TL;DR When netperf and nerserver are running on difference cores, the cache false sharing on the TCP/IP stack hurts the performance. As long as the netperf and netserver are on the same system, and within the same network namespace, this issue exists. Detail With the help of perf topdown, when netperf and netserver are both on CPU2: 28.1 % tma_backend_bound 13.7 % tma_memory_bound 3.3 % tma_l2_bound 9.3 % tma_l1_bound When netperf is on CPU1, netserver is on CPU2: 30.5 % tma_backend_bound 16.8 % tma_memory_bound 11.0 % tma_l1_bound 32.4 % tma_l3_bound 59.5 % tma_contested_accesses <---- 11.1 % tma_data_sharing The contested_accesses has increased a lot when netperf and netserver are on different CPUs. Contested accesses occur when data written by one thread is read by another thread on a different core. This indicates the cache false sharing. Use perf c2c to figure out the place where cache false sharing happens. top 2 offenders: Suggested-by: Tim Chen ----- HITM ----- ------- Store Refs ------ --------- Data address ------- RmtHitm LclHitm L1 Hit L1 Miss Offset Node 0.00% 55.17% 0.00% 0.00% 0x1c <---- r= ead 0.00% 0.00% 20.00% 0.00% 0x1f <---- w= rite To be more specific, there are frequent read/write on the same cache line in the: struct tcp_sock { new cache line ... u16 tcp_header_len; <----- read u8 scaling_ratio; u8 chrono_type : 2, <---- write repair : 1, tcp_usec_ts : 1, is_sack_reneg:1, is_cwnd_limited:1; < ---- write new cache line u32 copied_seq; <----- write u32 rcv_tstamp; <---- write u32 snd_wl1; <---- write ... u32 urg_seq; <--- read Re-arranging the layout of struct tcp_sock could become a seesaw. As the va= riables mentioned above are frequently accessed by different path of TCP/IP stack. Propose a more generic solution: 1. if the waker and the wakee are both short duration tasks, 2. if the wakeup is WF_SYNC, 3. if there is no idle Core in the system, 4. if the waker and the wakee wake up each other, Wake up the wakee on the same CPU as waker. N.B. The bar to regard the task as a short duration one depends on the numb= er of CPUs. Normally we don't want to enable this wakeup feature on desktop or= mobile. Because the overhead of Cache-to-Cache latency is negligible on small syste= ms. [Benchmark results] Tested on 4 platforms, significant throughput improvement on tbench, netper= f, stress-ng, will-it-scale, and latency reduced of lmbench. Platform1, 240 CPUs, 2 sockets Intel(R) Xeon(R) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf =3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) TCP_RR 60-threads 1.00 ( 1.04) -0.03 ( 1.27) TCP_RR 120-threads 1.00 ( 2.31) -0.09 ( 2.46) TCP_RR 180-threads 1.00 ( 1.77) +0.93 ( 1.16) TCP_RR 240-threads 1.00 ( 9.39) +190.13 ( 3.66) TCP_RR 300-threads 1.00 ( 45.28) +120.07 ( 19.41) TCP_RR 360-threads 1.00 ( 20.13) +0.27 ( 30.57) TCP_RR 420-threads 1.00 ( 30.85) +13.39 ( 46.38) UDP_RR 60-threads 1.00 ( 11.86) -0.29 ( 2.66) UDP_RR 120-threads 1.00 ( 16.28) +0.42 ( 13.41) UDP_RR 180-threads 1.00 ( 15.34) +0.31 ( 17.45) UDP_RR 240-threads 1.00 ( 16.27) -0.36 ( 18.78) UDP_RR 300-threads 1.00 ( 20.42) -2.54 ( 32.42) UDP_RR 360-threads 1.00 ( 31.59) +0.28 ( 35.66) UDP_RR 420-threads 1.00 ( 30.44) -0.27 ( 37.12) tbench =3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) loopback 60-threads 1.00 ( 0.27) +0.04 ( 0.11) loopback 120-threads 1.00 ( 0.65) -1.01 ( 0.41) loopback 180-threads 1.00 ( 0.42) +62.05 ( 26.22) loopback 240-threads 1.00 ( 30.43) +77.61 ( 15.27) hackbench =3D=3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 6.92) +4.70 ( 5.85) process-pipe 2-groups 1.00 ( 6.45) +7.66 ( 2.39) process-pipe 4-groups 1.00 ( 2.82) -1.82 ( 1.47) schbench =3D=3D=3D=3D=3D=3D=3D=3D No noticeable difference of 99.0th wakeup/request latency, 50.0th RPS perce= ntiles. schbench -m 2 -r 100 baseline sis_sync Wakeup Latencies 99.0th usec 27 25 Request Latencies 99.0th usec 15376 15376 RPS percentiles 50.0th 16608 16608 Platform2, 48 CPUs 2 sockets Intel(R) Xeon(R) CPU E5-2697 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D lmbench3: lmbench3.PIPE.latency.us 33.8% improvement lmbench3: lmbench3.AF_UNIX.sock.stream.latency.us 30.6% improvement Platform3: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D stress-ng: stress-ng.vm-rw.ops_per_sec 250.8% improvement will-it-scale: will-it-scale.per_process_ops 42.1% improvement Suggested-by: Tim Chen Signed-off-by: Chen Yu --- kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++---- kernel/sched/features.h | 1 + 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 445877069fbf..d749397249ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1003,7 +1003,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, st= ruct sched_entity *se) #include "pelt.h" #ifdef CONFIG_SMP =20 -static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cp= u); +static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cp= u, int sync); static unsigned long task_h_load(struct task_struct *p); static unsigned long capacity_of(int cpu); =20 @@ -7410,12 +7410,55 @@ static inline int select_idle_smt(struct task_struc= t *p, struct sched_domain *sd =20 #endif /* CONFIG_SCHED_SMT */ =20 +/* + * threshold of the short duration task: + * sysctl_sched_migration_cost * llc_weight^2 / 256^2 + * + * threshold + * LLC_WEIGHT=3D8 0.5 usec + * LLC_WEIGHT=3D16 2 usec + * LLC_WEIGHT=3D32 8 usec + * LLC_WEIGHT=3D64 31 usec + * LLC_WEIGHT=3D128 125 usec + * LLC_WEIGHT=3D256 500 usec + */ +static int short_task(struct task_struct *p, int llc) +{ + return ((p->duration_avg << 16) < + (sysctl_sched_migration_cost * llc * llc)); +} + +static int mutual_wakeup(struct task_struct *p, int target) +{ + int llc_weight; + + if (!sched_feat(SIS_SYNC)) + return 0; + + if (target !=3D smp_processor_id()) + return 0; + + if (this_rq()->nr_running > 1) + return 0; + + llc_weight =3D per_cpu(sd_llc_size, target); + + if (!short_task(p, llc_weight) || + !short_task(current, llc_weight)) + return 0; + + if (current->last_wakee !=3D p || p->last_wakee !=3D current) + return 0; + + return 1; +} /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against = the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,= bool has_idle_core, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,= bool has_idle_core, int target, + int sync) { struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(select_rq_mask); int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX; @@ -7458,6 +7501,15 @@ static int select_idle_cpu(struct task_struct *p, st= ruct sched_domain *sd, bool } } =20 + /* + * The Cache-to-Cache latency could be large on big system. + * Before trying to find a compelete idle CPU than the current one, + * give the current CPU another chance if waker and wakee are mutually + * waking up each other. + */ + if (!has_idle_core && sync && mutual_wakeup(p, target)) + return target; + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i =3D select_idle_core(p, cpu, cpus, &idle_cpu); @@ -7550,7 +7602,7 @@ static inline bool asym_fits_cpu(unsigned long util, /* * Try and locate an idle core/thread in the LLC cache domain. */ -static int select_idle_sibling(struct task_struct *p, int prev, int target) +static int select_idle_sibling(struct task_struct *p, int prev, int target= , int sync) { bool has_idle_core =3D false; struct sched_domain *sd; @@ -7659,7 +7711,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) } } =20 - i =3D select_idle_cpu(p, sd, has_idle_core, target); + i =3D select_idle_cpu(p, sd, has_idle_core, target, sync); if ((unsigned)i < nr_cpumask_bits) return i; =20 @@ -8259,7 +8311,7 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) new_cpu =3D sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ - new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu); + new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu, sync); } rcu_read_unlock(); =20 diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 143f55df890b..7e5968d01dcb 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domai= n. */ SCHED_FEAT(SIS_UTIL, true) +SCHED_FEAT(SIS_SYNC, true) =20 /* * Issue a WARN when we do multiple update_rq_clock() calls --=20 2.25.1