From nobody Thu Mar 5 08:13:06 2026 Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60395A932 for ; Sun, 1 Mar 2026 08:35:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772354159; cv=none; b=nhg7YE5vUlK/rilSoVCHpp0CU+Ii34OjfPVN5UGq9gSDcMzlzomKVFym821Jmyt0cQkXyGel90GZ8R6MEwRbEna4CcRnbiXQd8oXHB2kUAlW29omNeRX0VLjPRikeIlsqlfdxbJKM/5s1MWBSNjTWST2gDQlKZYY/NB1LBnxb4c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772354159; c=relaxed/simple; bh=+EY8p/MfmS4qwF4ywtxxjlfcG2cj4ljJ3b3tRUjYbXI=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=bjxIS++dnskwqNJzTOrm2ZMLuvjnDUyBuHSYYjIy+xRlNtzfLt//3Rcyz+DmCSX9Zg0bDO6TY1yVH2GLTTaQEDoAS5DpsFe55tH9Dxs/p6wJ+dyED3mB8qh6D4GFPrBKu7UB3BP/rDj5Ms7Anecr0gu4h7aYgma4va+N3ff8BAc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=QegBRtoy; arc=none smtp.client-ip=209.85.210.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QegBRtoy" Received: by mail-pf1-f182.google.com with SMTP id d2e1a72fcca58-82748095892so1438844b3a.0 for ; Sun, 01 Mar 2026 00:35:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772354157; x=1772958957; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=NbkxsaOpz/IYfa9sjRSCejJ96LUrMZrDPywFqKFOPHI=; b=QegBRtoyr7NtEEJYMcMKytu6tGUui0bj3XRzZQzzIb4IhEhEd+emfa1ZHEkohkYdhf nRataDAsPo0CRtgRx+l1L+S3iKKNnfw5VuN/dnpfzS445fxiXoEO0ZHUOIs/71Quv7BD HBxL3w9CV9UXsB0J3bb4sLz3f97RgwZxDhe/4v8HirKyt4FwtJviHAGtvPoF+ywRjHBl z36qU0/ZcOoSbMTosLxlGvJ6AyKEhKtKg06KA7OTAYflbp/yCTktqBSq38nR25XX5Tvm aA3LJAwxjJQH1ASl8MWxIwHz6OoITx/m0PFES31ZM5mDs3DCXVelHLF0IYV5ZFa6ZVUI zUQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772354157; x=1772958957; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NbkxsaOpz/IYfa9sjRSCejJ96LUrMZrDPywFqKFOPHI=; b=Yf/mqRYUm8FKfCTUAHedxPJMSRGJhm5SItlfdGU4DkKQWnK0PX/tz1/RG9HTM9jJtL Te6RWuKyx6/SDCxQF5JPbtuGLSu7LYHOhYgfWRNDn9Q9q0XvIfJiUKgF6aikkFl7wBZr w1lQM5lbdycdiKeIdpTRDl4Nwy9fTq1y/ZGoEbpDhCgLEAIkdg8FgudkZWUzod0sfV4k //OGRIZl8c8uuM9SSEhWbte2FXZYCsgpAzpnHVM0uPUa9QYlXSYiSYGnE92a457jrSXh DCBgQXsRo8dAPY5YSp3ehW+os5xZLKHN/yUE3iMG8RwJ77/JBDONn8IEsoMzDGysNw1e jH4w== X-Forwarded-Encrypted: i=1; AJvYcCUPWBvIhQuzGZwHttcucv8IjOQ3qNiJkGZ5dMs05xlXgdZdbWlFKOtUU1/MAc0BGfqkrJ95BY8fShbXZP4=@vger.kernel.org X-Gm-Message-State: AOJu0YyQb88mPNebh0hEKnatpoK9KqSKqV08Pfvci6k3uKvPRJGnuQkO 948te74LGQa6i93X5oq21Ob0G+FfDm1Y/3wJKNDvyIp68drRgbuLhI7k X-Gm-Gg: ATEYQzwX71cUtUEORxTYRNfRt998TVUAIkU0l9R2HfMiw7WWMG1A7OqFtecKbJmdcMN Cm/6e2RY1LTenWX5ZOWfcxhuzNu0OzMRTLLot6T/1sOkDK//KzkK5a6/9l3ynbPP3ECAyjAzZyJ rPgcTARjNoA9ml3FEi/5H0IHwNFfv6ltDF9sNtRcD3QV8dN5LxchbAB+XHTjpvr0X1p+BxTuhNX yQpCKS/kp157OCimk6BxLtr6IOPw0asVUklCLIhbQd2aytOkXPYdpKMmqLALBZ4dFw2kcy+7VfE hSD9HsWGo0y//Q5XU+32E9mf2aIkCUG565L0WNOLg1InpAKIcHUFSE0kUY+xH/FIinlFm4m0Iqr WVJhTpl04YKYkAjEmI3/kjPII+AwPJuVXyyU2jsTnwswYkhjjFwdI8KPG4usetrh6CHioNGV2tT MpG2U6gPkprHbBrsLFNVlPfMN0yawHKwuRtIdzf1XNNvk= X-Received: by 2002:a05:6a21:3405:b0:35e:8b76:c960 with SMTP id adf61e73a8af0-395c3b16cf8mr7659040637.48.1772354156423; Sun, 01 Mar 2026 00:35:56 -0800 (PST) Received: from DESKTOP-3LEPQG8.localdomain ([119.28.20.50]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2adfb6b416bsm107799835ad.61.2026.03.01.00.35.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 01 Mar 2026 00:35:55 -0800 (PST) From: Xie Yuanbin To: peterz@infradead.org, tglx@kernel.org, dave.hansen@linux.intel.com, hpa@zytor.com, riel@surriel.com, david@kernel.org, segher@kernel.crashing.org, arnd@arndb.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, bp@alien8.de, luto@kernel.org, frederic@kernel.org, mingo@kernel.org, houwenlong.hwl@antgroup.com, will@kernel.org, akpm@linux-foundation.org, jgross@suse.com, baohua@kernel.org, ryan.roberts@arm.com, lorenzo.stoakes@oracle.com, nysal@linux.ibm.com, urezki@gmail.com, max.kellermann@ionos.com Cc: x86@kernel.org, linux-kernel@vger.kernel.org, Xie Yuanbin Subject: [PATCH v8 0/3] Optimize code generation during context switching Date: Sun, 1 Mar 2026 16:35:17 +0800 Message-ID: <20260301083520.110969-1-qq570070308@gmail.com> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This series optimizes the performance of context switching, it does not modify the code logic, but only changes the inline attributes of some functions. It is found that finish_task_switch() is not inlined even in the O2 level optimization. Performance testing indicated that this could lead to a significant performance degradation when certain spectre vulnerability mitigations are enabled. This may be due to the following reasons: - In switch_mm_irq_off(), some mitigations may clear branch prediction history, or instruction cache, such as, arm64_apply_bp_hardening() on arm64, BPIALL/ICIALLU on arm, and indirect_branch_prediction_barrier() on x86. finish_task_switch() is right after switch_mm_irqs_off(), so the performance here is greatly affected by function calls and branch jumps. - __schedule() has a __sched attribute, which makes it be placed in '.sched.text' section, while finish_task_switch() does not. This makes they "far away from each other" in vmlinux, which aggravating the performance degradation. This series of patches primarily make some functions called in context switching as always inline to optimize performance. Here is the test data: 1. Performance test - time spent on calling finish_task_switch() a) x86-64: Intel i5-8300h@4Ghz, DDR4@2666mhz; unit: x86's tsc | test scenario | old | new | delta | | gcc 15.2 | 27.50 | 25.45 | -2.05 ( -7.5%) | | gcc 15.2 + spectre_v2_user=3Don | 46.75 | 25.96 | -20.79 (-44.5%) | | clang 21.1.7 | 27.25 | 25.45 | -1.80 ( -6.6%) | | clang 21.1.7 + spectre_v2_user=3Don | 39.50 | 26.00 | -13.50 (-34.2%) | b) x86-64: AMD 9600x@5.45Ghz, DDR5@4800mhz; unit: x86's tsc | test scenario | old | new | delta | | gcc 15.2 | 27.51 | 27.51 | 0 ( 0%) | | gcc 15.2 + spectre_v2_user=3Don | 105.21 | 67.89 | -37.32 (-35.5%) | | clang 21.1.7 | 27.51 | 27.51 | 0 ( 0%) | | clang 21.1.7 + spectre_v2_user=3Don | 104.15 | 67.52 | -36.63 (-35.2%) | c) arm64: Raspberry Pi 3b Rev 1.2, Cortex-A53@1.2Ghz, unaffected by Spectre v2 vulnerability; unit: cntvct_el0 | test scenario | old | new | delta | | gcc 15.2 | 1.453 | 1.115 | -0.338 (-23.3%) | | clang 21.1.7 | 1.532 | 1.123 | -0.409 (-26.7%) | d) arm32: Raspberry Pi 3b Rev 1.2, Cortex-A53@1.2Ghz, unaffected by Spectre v2 vulnerability; unit: cntvct_el0 | test scenario | old | new | delta | | gcc 15.2 | 1.421 | 1.187 | -0.234 (-16.5%) | | clang 21.1.7 | 1.437 | 1.200 | -0.237 (-16.5%) | 2. Size test a) bzImage size: | test scenario | old | new | delta | | gcc 15.2 + -Os | 12604416 | 12604416 | 0 | | gcc 15.2 + -O2 | 14500864 | 14500864 | 0 | | clang 21.1.7 + -Os | 13718528 | 13718528 | 0 | | clang 21.1.7 + -O2 | 14558208 | 14566400 | 8192 | b) sizeof .text section from vmlinx: | test scenario | old | new | delta | | gcc 15.2 + -Os | 16180040 | 16180616 | 576 | | gcc 15.2 + -O2 | 19556424 | 19561352 | 4928 | | clang 21.1.7 + -Os | 17917832 | 17918664 | 832 | | clang 21.1.7 + -O2 | 20030856 | 20035784 | 4928 | 3. Test information a) Linux kernel source: commit d9771d0dbe18dd643760 ("Add linux-next specific files for 20251212") from linux-next branch. b) kernel config for performance test: x86-64: `make x86_64_defconfig` first, then menuconfig setting: CONFIG_HZ=3D100 CONFIG_DEBUG_ENTRY=3Dn CONFIG_X86_DEBUG_FPU=3Dn CONFIG_EXPERT=3Dy CONFIG_MODIFY_LDT_SYSCALL=3Dn CONFIG_STACKPROTECTOR=3Dn CONFIG_BLK_DEV_NVME=3Dy (just for boot) arm64: `make defconfig` first, then menuconfig setting: CONFIG_KVM=3Dn CONFIG_HZ=3D100 CONFIG_SHADOW_CALL_STACK=3Dy arm32: `make multi_v7_defconfig` first, then menuconfig setting: CONFIG_ARCH_OMAP2PLUS_TYPICAL=3Dn CONFIG_HIGHMEM=3Dn c) kernel config for size test: `make x86_64_defconfig` first, then menuconfig setting: CONFIG_SCHED_CORE=3Dy CONFIG_NO_HZ_FULL=3Dy CONFIG_CC_OPTIMIZE_FOR_SIZE=3Dy (optional) d) Compiler: llvm: Debian clang version 21.1.7 (1) + Debian LLD 21.1.7 gcc: x86-64: gcc version 15.2.0 (Debian 15.2.0-11) arm64/arm32: gcc version 15.2.0 (Debian 15.2.0-7) + GNU ld (GNU Binutils for Debian) 2.45.50.20251209 e) When testing on Raspberry Pi 3b, in order to make the test result stable, the CPU frequency should be fixed. The following content was added to config.txt: arm_boost=3D0 core_freq_fixed=3D1 arm_freq=3D1200 gpu_freq=3D250 sdram_freq=3D400 arm_freq_min=3D1200 gpu_freq_min=3D250 sdram_freq_min=3D400 f) cmdline configuration: - "isolcpus=3D3": to obtain more stable test results (assuming the test runs on cpu3). - "spectre_v2_user=3Don": optional on x86-64 to enable mitigations. g) Performance testing code: - kernel patch: ```patch diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index fd09afae72a2..40ce1b28cb27 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -485,3 +485,4 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 8a4ac4841be6..5a42ec008620 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -395,6 +395,7 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index cf84d98964b2..53f0d2e745bd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -441,6 +441,7 @@ asmlinkage long sys_listmount(const struct mnt_id_req _= _user *req, asmlinkage long sys_listns(const struct ns_id_req __user *req, u64 __user *ns_ids, size_t nr_ns_ids, unsigned int flags); +asmlinkage long sys_sched_test(void); asmlinkage long sys_truncate(const char __user *path, long length); asmlinkage long sys_ftruncate(unsigned int fd, off_t length); #if BITS_PER_LONG =3D=3D 32 diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 942370b3f5d2..65023afc291b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr) #define __NR_listns 470 __SYSCALL(__NR_listns, sys_listns) =20 +#define __NR_listns 471 +__SYSCALL(__NR_sched_test, sys_sched_test) + #undef __NR_syscalls -#define __NR_syscalls 471 +#define __NR_syscalls 472 =20 /* * 32 bit systems traditionally used different diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41ba0be16911..f53a423c8600 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5191,6 +5191,31 @@ asmlinkage __visible void schedule_tail(struct task_= struct *prev) calculate_sigpending(); } =20 +static DEFINE_PER_CPU(uint64_t, total_time); + +static __always_inline uint64_t test_gettime(void) +{ +#ifdef CONFIG_X86_64 + register uint64_t rax __asm__("rax"); + register uint64_t rdx __asm__("rdx"); + + __asm__ __volatile__ ("rdtsc" : "=3Da"(rax), "=3Dd"(rdx)); + return rax | (rdx << 32); +#elif defined(CONFIG_ARM64) + uint64_t ret; + + __asm__ __volatile__ ("mrs %0, cntvct_el0" : "=3Dr"(ret)); + return ret; +#elif defined(CONFIG_ARM) + uint64_t ret; + + __asm__ __volatile__ ("mrrc p15, 1, %Q0, %R0, c14" : "=3Dr" (ret)); + return ret; +#else +#error "Not support" +#endif +} + /* * context_switch - switch to the new MM and the new thread's register sta= te. */ @@ -5256,7 +5281,15 @@ context_switch(struct rq *rq, struct task_struct *pr= ev, switch_to(prev, next, prev); barrier(); =20 - return finish_task_switch(prev); + { + uint64_t end_time; + // add volatile to let it alloc on stack + __volatile__ uint64_t start_time =3D test_gettime(); + rq =3D finish_task_switch(prev); + end_time =3D test_gettime(); + raw_cpu_add(total_time, end_time - start_time); + } + return rq; } =20 /* @@ -10827,3 +10860,32 @@ void sched_change_end(struct sched_change_ctx *ctx) p->sched_class->prio_changed(rq, p, ctx->prio); } } + +static struct task_struct *wait_task; +#define PRINT_PERIOD (1U << 20) +static DEFINE_PER_CPU(uint32_t, total_count); + +SYSCALL_DEFINE0(sched_test) +{ + preempt_disable(); + while (1) { + if (likely(wait_task)) + wake_up_process(wait_task); + wait_task =3D current; + set_current_state(TASK_UNINTERRUPTIBLE); + __schedule(SM_NONE); + if (unlikely(raw_cpu_inc_return(total_count) =3D=3D PRINT_PERIOD)) { + const uint64_t total =3D raw_cpu_read(total_time); + uint64_t tmp_h, tmp_l; + + tmp_h =3D total * 100000; + do_div(tmp_h, (uint32_t)PRINT_PERIOD); + tmp_l =3D do_div(tmp_h, (uint32_t)100000); + + pr_emerg("cpu[%d]: total cost time %llu in %u tests, %llu.%05llu per te= st\n", raw_smp_processor_id(), total, PRINT_PERIOD, tmp_h, tmp_l); + raw_cpu_write(total_time, 0); + raw_cpu_write(total_count, 0); + } + } + return 0; +} diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl index e74868be513c..2a2d8d44cb3f 100644 --- a/scripts/syscall.tbl +++ b/scripts/syscall.tbl @@ -411,3 +411,4 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test ``` - User-mode testing program code: ```c int main() { cpu_set_t mask; if (fork()) sleep(1); CPU_ZERO(&mask); CPU_SET(3, &mask); // Assume that cpu3 exists assert(sched_setaffinity(0, sizeof(mask), &mask) =3D=3D 0); syscall(471); // unreachable return 0; } ``` h) Test steps: 1. Apply the above kernel patch and build the kernel. 2. Run the above user program. 3. Wait for kernel to print. --- v7->v8: https://lore.kernel.org/20260216164950.147617-1-qq570070308@gmail.c= om - Improve the cover letter description. - Also always inline mmgrab()/mmgrab_lazy_tlb()/mmdrop_lazy_tlb()/ mmget()/mmget_not_zero(), because it is found that they may not be inlined under gcc -Os compilation. v6->v7: https://lore.kernel.org/20260209162341.2922-1-qq570070308@gmail.com - Move enter_lazy_tlb() to asm/tlbflush.h, suggested by Dave Hansen. v5->v6: https://lore.kernel.org/20251214190907.184793-1-qq570070308@gmail.c= om - Based on tglx's suggestion, move '#define enter_....' under the inline function in patch [1/3]. - Based on tglx's suggestion, correct the description error in patch [1/3]. - Rebase to the latest linux-next source. v4->v5: https://lore.kernel.org/20251123121827.1304-1-qq570070308@gmail.com - Rebase to the latest linux-next source. - Improve the test code and retest. - Add the test of AMD 9600x and Raspberry Pi 3b. v3->v4: https://lore.kernel.org/20251113105227.57650-1-qq570070308@gmail.com - Improve the commit message v2->v3: https://lore.kernel.org/20251108172346.263590-1-qq570070308@gmail.c= om - Fix building error in patch 1 - Simply add the __always_inline attribute to the existing function, Instead of adding the always inline version functions v1->v2: https://lore.kernel.org/20251024182628.68921-1-qq570070308@gmail.com - Make raw_spin_rq_unlock() inline - Make __balance_callbacks() inline - Add comments for always inline functions - Add Performance Test Data Xie Yuanbin (3): x86/mm/tlb: Make enter_lazy_tlb() always inline on x86 sched: Make raw_spin_rq_unlock() inline sched/core: Make finish_task_switch() and its subfunctions always inline arch/arm/include/asm/mmu_context.h | 2 +- arch/riscv/include/asm/sync_core.h | 2 +- arch/s390/include/asm/mmu_context.h | 2 +- arch/sparc/include/asm/mmu_context_64.h | 2 +- arch/x86/include/asm/mmu_context.h | 3 --- arch/x86/include/asm/sync_core.h | 2 +- arch/x86/include/asm/tlbflush.h | 26 +++++++++++++++++++++ arch/x86/mm/tlb.c | 21 ----------------- include/linux/perf_event.h | 2 +- include/linux/sched/mm.h | 20 ++++++++-------- include/linux/tick.h | 4 ++-- include/linux/vtime.h | 8 +++---- kernel/sched/core.c | 17 +++++--------- kernel/sched/sched.h | 31 ++++++++++++++----------- 14 files changed, 71 insertions(+), 71 deletions(-) --=20 2.51.0