From nobody Wed Dec 17 19:25:35 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 11B21146A93 for ; Tue, 25 Jun 2024 07:25:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.14 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300326; cv=none; b=oJLsWx2R0Y0J1mN2iYtrp6D3gVpg0HBu8+htkKC5ztInojLU7d/euFnM84G2Q9Ia1D7ZCzPIYjR/tmnyWiSdUDDhACqfDwtejcVBr9fcN5tXMqtRMAmXamUTBHoV5etZ1qd3RJQQTfYU+0qC0yGjwBOwkYDBm9Uo0lJWO/RkRhA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719300326; c=relaxed/simple; bh=iJxaoY6Y6nhbCUJeRWFve7w+uu5jiiKdXQI/5whnEEg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uIDIbXvnerkaFbtEqSX2jbaHoT0jChocL7o6VDFJQSpKHeFTy+Uve5zO/dAV3apiL/1o3ftee4/N/Z//SmFzB0sRpETZo1FaNYkg0Z8ztzbwbikb/I/WbSKfOftqXScrN/gYx8n/wKl5g3Gqn+aLVryvecViZRMM3+ZYu7oJeSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=AIlxpA4J; arc=none smtp.client-ip=198.175.65.14 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="AIlxpA4J" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719300324; x=1750836324; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iJxaoY6Y6nhbCUJeRWFve7w+uu5jiiKdXQI/5whnEEg=; b=AIlxpA4JXJSB1srD7rw5/JSHJ0ekkJlHHcDqBdY8JH3b4844oitkWPCc vYnXs1o0otMBmpBAjFRhc/e4OcyIsftFX6I00J4dQ5p+nM8M/isDDUK/r Fj5PV9kD8Nour90FgeTQ0DCUZpIhN+wxLSD9MY6FEiInFMt8i0oY1EDfi LhQKM6GUByjYwhbBin+Jx6N4DUvWgkFK2xRlA0UVRo0uIYQtM1fo9nv08 PfFLmRDCipLLgAPLr8uksHtI7Z/zAVkrTItq7Kz0O98xLDJl0E+tvvH+e 21TRXfuVx0RJPnoKbTcf7J19pCwUV7OUTZ55UYqNzd7kTxfmuaDfRb+4Y A==; X-CSE-ConnectionGUID: 45eIGy6RR2KIuxdK4Brf6Q== X-CSE-MsgGUID: stnr6bu6TRu8fQUtNQa2Yw== X-IronPort-AV: E=McAfee;i="6700,10204,11113"; a="20111595" X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="20111595" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2024 00:25:23 -0700 X-CSE-ConnectionGUID: WvC7ntS/R+2LG+lBZVjurQ== X-CSE-MsgGUID: sguCs3kERiOkfmSXfAEzYQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,263,1712646000"; d="scan'208";a="43651833" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orviesa009.jf.intel.com with ESMTP; 25 Jun 2024 00:25:21 -0700 From: Chen Yu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot Cc: Mike Galbraith , Tim Chen , Yujie Liu , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , linux-kernel@vger.kernel.org, Chen Yu Subject: [PATCH 2/2] sched/fair: Enhance sync wakeup for short duration tasks Date: Tue, 25 Jun 2024 15:22:22 +0800 Message-Id: <383932811ae0cb4df0bb131fa968e746de979417.1719295669.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" [Problem Statement] On platforms where there are many CPUs, one bottleneck is the high Cache-to-Cache latency. This issue is exacerbated when the tasks sharing data are running on different CPUs: When the tasks access different part of the same cache, false sharing happens. One example is the network client/server workload with small packages. A simple example: On a system with 240 CPUs, 2 sockets, taskset -c 2 netserver taskset -c 1 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 83528.11 taskset -c 2 netperf -4 -H 127.0.0.1 -t TCP_RR -c -C -l 100 Trans Rate per sec: 134504.35 [Problem Analysis] TL;DR When netperf and nerserver are running on difference cores, the cache false sharing on the TCP/IP stack hurts the performance. As long as the netperf and netserver are on the same system, and within the same network namespace, this issue exists. Detail With the help of perf topdown, when netperf and netserver are both on CPU2: 28.1 % tma_backend_bound 13.7 % tma_memory_bound 3.3 % tma_l2_bound 9.3 % tma_l1_bound When netperf is on CPU1, netserver is on CPU2: 30.5 % tma_backend_bound 16.8 % tma_memory_bound 11.0 % tma_l1_bound 32.4 % tma_l3_bound 59.5 % tma_contested_accesses <---- 11.1 % tma_data_sharing The contested_accesses has increased a lot when netperf and netserver are on different CPUs. Contested accesses occur when data written by one thread is read by another thread on a different core. This indicates the cache false sharing. Use perf c2c to figure out the place where cache false sharing happens. top 2 offenders: Suggested-by: Tim Chen ----- HITM ----- ------- Store Refs ------ --------- Data address ------- RmtHitm LclHitm L1 Hit L1 Miss Offset Node 0.00% 55.17% 0.00% 0.00% 0x1c <---- r= ead 0.00% 0.00% 20.00% 0.00% 0x1f <---- w= rite To be more specific, there are frequent read/write on the same cache line in the: struct tcp_sock { new cache line ... u16 tcp_header_len; <----- read u8 scaling_ratio; u8 chrono_type : 2, <---- write repair : 1, tcp_usec_ts : 1, is_sack_reneg:1, is_cwnd_limited:1; < ---- write new cache line u32 copied_seq; <----- write u32 rcv_tstamp; <---- write u32 snd_wl1; <---- write ... u32 urg_seq; <--- read Re-arranging the layout of struct tcp_sock could become a seesaw. As the va= riables mentioned above are frequently accessed by different path of TCP/IP stack. Propose a more generic solution: 1. if the waker and the wakee are both short duration tasks, 2. if the wakeup is WF_SYNC, 3. if there is no idle Core in the system, 4. if the waker and the wakee wake up each other, Wake up the wakee on the same CPU as waker. N.B. The bar to regard the task as a short duration one depends on the numb= er of CPUs. Normally we don't want to enable this wakeup feature on desktop or= mobile. Because the overhead of Cache-to-Cache latency is negligible on small syste= ms. [Benchmark results] Tested on 4 platforms, significant throughput improvement on tbench, netper= f, stress-ng, will-it-scale, and latency reduced of lmbench. Platform1, 240 CPUs, 2 sockets Intel(R) Xeon(R) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf =3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) TCP_RR 60-threads 1.00 ( 1.04) -0.03 ( 1.27) TCP_RR 120-threads 1.00 ( 2.31) -0.09 ( 2.46) TCP_RR 180-threads 1.00 ( 1.77) +0.93 ( 1.16) TCP_RR 240-threads 1.00 ( 9.39) +190.13 ( 3.66) TCP_RR 300-threads 1.00 ( 45.28) +120.07 ( 19.41) TCP_RR 360-threads 1.00 ( 20.13) +0.27 ( 30.57) TCP_RR 420-threads 1.00 ( 30.85) +13.39 ( 46.38) UDP_RR 60-threads 1.00 ( 11.86) -0.29 ( 2.66) UDP_RR 120-threads 1.00 ( 16.28) +0.42 ( 13.41) UDP_RR 180-threads 1.00 ( 15.34) +0.31 ( 17.45) UDP_RR 240-threads 1.00 ( 16.27) -0.36 ( 18.78) UDP_RR 300-threads 1.00 ( 20.42) -2.54 ( 32.42) UDP_RR 360-threads 1.00 ( 31.59) +0.28 ( 35.66) UDP_RR 420-threads 1.00 ( 30.44) -0.27 ( 37.12) tbench =3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) loopback 60-threads 1.00 ( 0.27) +0.04 ( 0.11) loopback 120-threads 1.00 ( 0.65) -1.01 ( 0.41) loopback 180-threads 1.00 ( 0.42) +62.05 ( 26.22) loopback 240-threads 1.00 ( 30.43) +77.61 ( 15.27) hackbench =3D=3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 6.92) +4.70 ( 5.85) process-pipe 2-groups 1.00 ( 6.45) +7.66 ( 2.39) process-pipe 4-groups 1.00 ( 2.82) -1.82 ( 1.47) schbench =3D=3D=3D=3D=3D=3D=3D=3D No noticeable difference of 99.0th wakeup/request latency, 50.0th RPS perce= ntiles. schbench -m 2 -r 100 baseline sis_sync Wakeup Latencies 99.0th usec 27 25 Request Latencies 99.0th usec 15376 15376 RPS percentiles 50.0th 16608 16608 Platform2, 48 CPUs 2 sockets Intel(R) Xeon(R) CPU E5-2697 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D lmbench3: lmbench3.PIPE.latency.us 33.8% improvement lmbench3: lmbench3.AF_UNIX.sock.stream.latency.us 30.6% improvement Platform3: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D stress-ng: stress-ng.vm-rw.ops_per_sec 250.8% improvement will-it-scale: will-it-scale.per_process_ops 42.1% improvement Suggested-by: Tim Chen Signed-off-by: Chen Yu --- kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++---- kernel/sched/features.h | 1 + 2 files changed, 58 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 445877069fbf..d749397249ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1003,7 +1003,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, st= ruct sched_entity *se) #include "pelt.h" #ifdef CONFIG_SMP =20 -static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cp= u); +static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cp= u, int sync); static unsigned long task_h_load(struct task_struct *p); static unsigned long capacity_of(int cpu); =20 @@ -7410,12 +7410,55 @@ static inline int select_idle_smt(struct task_struc= t *p, struct sched_domain *sd =20 #endif /* CONFIG_SCHED_SMT */ =20 +/* + * threshold of the short duration task: + * sysctl_sched_migration_cost * llc_weight^2 / 256^2 + * + * threshold + * LLC_WEIGHT=3D8 0.5 usec + * LLC_WEIGHT=3D16 2 usec + * LLC_WEIGHT=3D32 8 usec + * LLC_WEIGHT=3D64 31 usec + * LLC_WEIGHT=3D128 125 usec + * LLC_WEIGHT=3D256 500 usec + */ +static int short_task(struct task_struct *p, int llc) +{ + return ((p->duration_avg << 16) < + (sysctl_sched_migration_cost * llc * llc)); +} + +static int mutual_wakeup(struct task_struct *p, int target) +{ + int llc_weight; + + if (!sched_feat(SIS_SYNC)) + return 0; + + if (target !=3D smp_processor_id()) + return 0; + + if (this_rq()->nr_running > 1) + return 0; + + llc_weight =3D per_cpu(sd_llc_size, target); + + if (!short_task(p, llc_weight) || + !short_task(current, llc_weight)) + return 0; + + if (current->last_wakee !=3D p || p->last_wakee !=3D current) + return 0; + + return 1; +} /* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against = the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,= bool has_idle_core, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,= bool has_idle_core, int target, + int sync) { struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(select_rq_mask); int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX; @@ -7458,6 +7501,15 @@ static int select_idle_cpu(struct task_struct *p, st= ruct sched_domain *sd, bool } } =20 + /* + * The Cache-to-Cache latency could be large on big system. + * Before trying to find a compelete idle CPU than the current one, + * give the current CPU another chance if waker and wakee are mutually + * waking up each other. + */ + if (!has_idle_core && sync && mutual_wakeup(p, target)) + return target; + for_each_cpu_wrap(cpu, cpus, target + 1) { if (has_idle_core) { i =3D select_idle_core(p, cpu, cpus, &idle_cpu); @@ -7550,7 +7602,7 @@ static inline bool asym_fits_cpu(unsigned long util, /* * Try and locate an idle core/thread in the LLC cache domain. */ -static int select_idle_sibling(struct task_struct *p, int prev, int target) +static int select_idle_sibling(struct task_struct *p, int prev, int target= , int sync) { bool has_idle_core =3D false; struct sched_domain *sd; @@ -7659,7 +7711,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) } } =20 - i =3D select_idle_cpu(p, sd, has_idle_core, target); + i =3D select_idle_cpu(p, sd, has_idle_core, target, sync); if ((unsigned)i < nr_cpumask_bits) return i; =20 @@ -8259,7 +8311,7 @@ select_task_rq_fair(struct task_struct *p, int prev_c= pu, int wake_flags) new_cpu =3D sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); } else if (wake_flags & WF_TTWU) { /* XXX always ? */ /* Fast path */ - new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu); + new_cpu =3D select_idle_sibling(p, prev_cpu, new_cpu, sync); } rcu_read_unlock(); =20 diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 143f55df890b..7e5968d01dcb 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true) * When doing wakeups, attempt to limit superfluous scans of the LLC domai= n. */ SCHED_FEAT(SIS_UTIL, true) +SCHED_FEAT(SIS_SYNC, true) =20 /* * Issue a WARN when we do multiple update_rq_clock() calls --=20 2.25.1