From nobody Sat Feb 7 11:38:16 2026 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C28AD238C0F for ; Sat, 24 Jan 2026 17:16:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769274982; cv=none; b=N9bz2PJ1bNCr3qIu26qbCSKhp4BEcW3E558HG1U4zsk8Z8PjLqJ8EDNYNQkt9FMwDAIdhMFn9NHNny9nlyVVsZLr4/grf51NYU90wrMStYm8Iw3PFGpp/hRYWBGYuIl42/Qa2a7lUHiNMsMrvZxFptJqYIis5AMj+G2EEUo1zMM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769274982; c=relaxed/simple; bh=alWALgFGcGoZW3M7jQEdqEYgTbqHy03roDpMH8B+lp0=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=CzqswMcN2huoy3hWU/rwH/UKsiDrlqn+kDx4evpxnJ9QDjU4yf721x/ltsaco0OHTGqLd2nqiyOkRFHMHG8eog065CpugFZvTNG7BvMufB1FnyNNg28/0e0cne83gEggiK3vHUvHj9/JyjajwID8mOQLgcR4+VJGj0OimCmZbdQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OZhgKtXx; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OZhgKtXx" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-82310b74496so1268017b3a.3 for ; Sat, 24 Jan 2026 09:16:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769274979; x=1769879779; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=8CiaFCUZ29n+NttizuXXov5nypYPDUGMcUMuvzUFR9Q=; b=OZhgKtXx1d/dDHRII8+QV8CElZj1bqRTdf1WatCLdrCUfqBJgSU9phRwu/gPpEfKh7 47q68LS5Bs07CZB7cPpMC3pkBdJGf6CyjbadeeUzXXLQrex/9LIrfOozgSn96jHiC42o cDqdSmhlHbg36Hl7kM1yu1UMbfZIkVMXZ3m6BtPD7B/Ah8C1T5BP7UxkcMa6ZNHMm3VS 6wtNM6Hq1k9GPozlXtpCodXCIUlRvQ+CSsnaRNTYZMX/b+ngTmVosEJwimi5kdC3vvjB RDVB3U/bVQ3m4m1s14rZNykAttt3RezntKBhsEkqoGiNv1wd3AfP25wi18cTM5uqQNzy UuSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769274979; x=1769879779; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8CiaFCUZ29n+NttizuXXov5nypYPDUGMcUMuvzUFR9Q=; b=LXtNFXSZ/dAhBpJ+qmWschEDstcn1Y+MRto9sbUEwqWSxeULZbROHPjNi7CuKpwETA Kbdgf/W5aIGM8r8oEpKPS099gQGy6KpUxLB4Rs+Rvp6X07MYTCW9OpJDa533TXZUD2v5 fIQLW5FIEcCUs3CWPfM94NpwsloLcIuYYhMhSpJcWiyfybVcRwqFdEkv67L40JIP7TJf lSO1JlD7AdYWsP80F3Mx4W7xIFW92Fx48g7zYvdhc67zptXduv+Y17pqOM8OcsC4WWeu 0+hmYkiAOKfmmS5xzlytBCQ/5XyrjJuVZh4Ysg0pSVsfaPUsXlBmpsH+6+VJqx05jXeQ Epww== X-Gm-Message-State: AOJu0Yz6NrBSB6is1sA2H1EswXH22xlFnkFC1BwAa6F2NmKogGuj1oc8 hYhMCm92OTeKlZf4G/CpTsyvD+GnpA3lPRK13s1oWoleEywFGagM/eRB X-Gm-Gg: AZuq6aLruxq30fkG5LRP2RqG8WEsoT+coOsdA2BeqkXxsVCZQwF3Op32c8PFI6RYLpk O+vl7jm89W/bv5xHKvg05u10iS8COdQpBWEU2f7U2NIn3b6wjZ+os5IUNM4xaPYV4sRqiD9gHZI Y9ffO7MvPxpLngLNetxDFutEdXgE6AIE9jmJf5ieWKGH2zLbxc8r6bzCmTFmvZtDWrIUtWMeFaT yelUDCURkZUEkFDwQouehv/hSgbshBLexOaug7yDu4A2K+79BgKWptZMetI79uSysy75ySwFeXP SUqUdtsoiDWi8BSCKJR19CDbQ6yNT/r9x6WYMCSN5xEdxaHcLiv1L0rAu7cI03s5Qcycwn/3RV5 C6T5cwkb66KX0rc4Ii6uJnMt2+HkW0m8J6slS/7qAJOxffuQXUuuIPkq7rUXSV6n11aeeYdgzXK /TZ27wIhv1VnvpOGEd8ljHzvxQBPXY2IA= X-Received: by 2002:a05:6a00:2193:b0:81f:50b1:51ec with SMTP id d2e1a72fcca58-82317c0f8cemr6149751b3a.4.1769274978272; Sat, 24 Jan 2026 09:16:18 -0800 (PST) Received: from DESKTOP-3LEPQG8.localdomain ([119.28.20.50]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-8231871f97asm5208091b3a.40.2026.01.24.09.16.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 24 Jan 2026 09:16:17 -0800 (PST) From: Xie Yuanbin To: peterz@infradead.org, tglx@kernel.org, riel@surriel.com, segher@kernel.crashing.org, david@kernel.org, hpa@zytor.com, arnd@arndb.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, luto@kernel.org, houwenlong.hwl@antgroup.com Cc: linux-kernel@vger.kernel.org, x86@kernel.org, Xie Yuanbin Subject: [PATCH v6 0/3] Optimize code generation during context switching Date: Sun, 25 Jan 2026 01:15:43 +0800 Message-ID: <20260124171546.43398-1-qq570070308@gmail.com> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This series optimize the performance of context switching. They do not modify the code logic, but only change the inline attributes of some functions. It is found that finish_task_switch() is not inlined even in the O2 level optimization. Performance testing indicated that this could lead to a significant performance degradation when certain Spectre vulnerability mitigations are enabled. This may be due to the following reasons: 1. In switch_mm_irq_off(), some mitigations may clear branch prediction history, or even clear the instruction cache. For example arm64_apply_bp_hardening() on arm64, BPIALL/ICIALLU on arm, and indirect_branch_prediction_barrier() on x86. finish_task_switch() is right after switch_mm_irqs_off(), so the performance here is greatly affected by function calls and branch jumps. 2. __schedule() has a __sched attribute, which makes it be placed in '.sched.text' section, while finish_task_switch() does not. This makes they "far away from each other" in vmlinux, which aggravating the performance degradation. This series of patches primarily make some functions called in context switching as always inline to optimize performance. Here is the test data: Performance test data - time spent on calling finish_task_switch(): 1. x86-64: Intel i5-8300h@4Ghz, DDR4@2666mhz; unit: x86's tsc | test scenario | old | new | delta | | gcc 15.2 | 27.50 | 25.45 | -2.05 ( -7.5%) | | gcc 15.2 + spectre_v2_user=3Don | 46.75 | 25.96 | -20.79 (-44.5%) | | clang 21.1.7 | 27.25 | 25.45 | -1.80 ( -6.6%) | | clang 21.1.7 + spectre_v2_user=3Don | 39.50 | 26.00 | -13.50 (-34.2%) | 2. x86-64: AMD 9600x@5.45Ghz, DDR5@4800mhz; unit: x86's tsc | test scenario | old | new | delta | | gcc 15.2 | 27.51 | 27.51 | 0 ( 0%) | | gcc 15.2 + spectre_v2_user=3Don | 105.21 | 67.89 | -37.32 (-35.5%) | | clang 21.1.7 | 27.51 | 27.51 | 0 ( 0%) | | clang 21.1.7 + spectre_v2_user=3Don | 104.15 | 67.52 | -36.63 (-35.2%) | 3. arm64: Raspberry Pi 3b Rev 1.2, Cortex-A53@1.2Ghz, unaffected by Spectre v2 vulnerability; unit: cntvct_el0 | test scenario | old | new | delta | | gcc 15.2 | 1.453 | 1.115 | -0.338 (-23.3%) | | clang 21.1.7 | 1.532 | 1.123 | -0.409 (-26.7%) | 4. arm32: Raspberry Pi 3b Rev 1.2, Cortex-A53@1.2Ghz, unaffected by Spectre v2 vulnerability; unit: cntvct_el0 | test scenario | old | new | delta | | gcc 15.2 | 1.421 | 1.187 | -0.234 (-16.5%) | | clang 21.1.7 | 1.437 | 1.200 | -0.237 (-16.5%) | Size test data: 1. bzImage size: | test scenario | old | new | delta | | gcc 15.2 + -Os | 12604416 | 12604416 | 0 | | gcc 15.2 + -O2 | 14500864 | 14500864 | 0 | | clang 21.1.7 + -Os | 13718528 | 13718528 | 0 | | clang 21.1.7 + -O2 | 14558208 | 14566400 | 8192 | 2. sizeof .text section from vmlinx: | test scenario | old | new | delta | | gcc 15.2 + -Os | 16180040 | 16180616 | 576 | | gcc 15.2 + -O2 | 19556424 | 19561352 | 4928 | | clang 21.1.7 + -Os | 17917832 | 17918664 | 832 | | clang 21.1.7 + -O2 | 20030856 | 20035784 | 4928 | Test information: 1. Linux kernel source: commit d9771d0dbe18dd643760 ("Add linux-next specific files for 20251212") from linux-next branch. 2. kernel config for performance test: x86-64: `make x86_64_defconfig` first, then menuconfig setting: CONFIG_HZ=3D100 CONFIG_DEBUG_ENTRY=3Dn CONFIG_X86_DEBUG_FPU=3Dn CONFIG_EXPERT=3Dy CONFIG_MODIFY_LDT_SYSCALL=3Dn CONFIG_STACKPROTECTOR=3Dn CONFIG_BLK_DEV_NVME=3Dy (just for boot) arm64: `make defconfig` first, then menuconfig setting: CONFIG_KVM=3Dn CONFIG_HZ=3D100 CONFIG_SHADOW_CALL_STACK=3Dy arm32: `make multi_v7_defconfig` first, then menuconfig setting: CONFIG_ARCH_OMAP2PLUS_TYPICAL=3Dn CONFIG_HIGHMEM=3Dn 3. kernel config for size test: `make x86_64_defconfig` first, then menuconfig setting: CONFIG_SCHED_CORE=3Dy CONFIG_NO_HZ_FULL=3Dy CONFIG_CC_OPTIMIZE_FOR_SIZE=3Dy (optional) 4. Compiler: llvm: Debian clang version 21.1.7 (1) + Debian LLD 21.1.7 gcc: x86-64: gcc version 15.2.0 (Debian 15.2.0-11) arm64/arm32: gcc version 15.2.0 (Debian 15.2.0-7) + GNU ld (GNU Binutils for Debian) 2.45.50.20251209 5. When testing on Raspberry Pi 3b, in order to make the test result stable, the CPU frequency should be fixed. The following content was added to config.txt: ```config.txt arm_boost=3D0 core_freq_fixed=3D1 arm_freq=3D1200 gpu_freq=3D250 sdram_freq=3D400 arm_freq_min=3D1200 gpu_freq_min=3D250 sdram_freq_min=3D400 ``` 6. cmdline configuration: 6.1 add `isolcpus=3D3` to obtain more stable test results (assuming the test is run on cpu3). 6.2 optional: add `spectre_v2_user=3Don` on x86-64 to enable mitigations. 7. Performance testing code and operations: kernel code: ```patch diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index fd09afae72a2..40ce1b28cb27 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -485,3 +485,4 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscal= ls/syscall_64.tbl index 8a4ac4841be6..5a42ec008620 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -395,6 +395,7 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test =20 # # Due to a historical design error, certain syscalls are numbered differen= tly diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index cf84d98964b2..53f0d2e745bd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -441,6 +441,7 @@ asmlinkage long sys_listmount(const struct mnt_id_req _= _user *req, asmlinkage long sys_listns(const struct ns_id_req __user *req, u64 __user *ns_ids, size_t nr_ns_ids, unsigned int flags); +asmlinkage long sys_sched_test(void); asmlinkage long sys_truncate(const char __user *path, long length); asmlinkage long sys_ftruncate(unsigned int fd, off_t length); #if BITS_PER_LONG =3D=3D 32 diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/u= nistd.h index 942370b3f5d2..65023afc291b 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr) #define __NR_listns 470 __SYSCALL(__NR_listns, sys_listns) =20 +#define __NR_listns 471 +__SYSCALL(__NR_sched_test, sys_sched_test) + #undef __NR_syscalls -#define __NR_syscalls 471 +#define __NR_syscalls 472 =20 /* * 32 bit systems traditionally used different diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 41ba0be16911..f53a423c8600 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5191,6 +5191,31 @@ asmlinkage __visible void schedule_tail(struct task_= struct *prev) calculate_sigpending(); } =20 +static DEFINE_PER_CPU(uint64_t, total_time); + +static __always_inline uint64_t test_gettime(void) +{ +#ifdef CONFIG_X86_64 + register uint64_t rax __asm__("rax"); + register uint64_t rdx __asm__("rdx"); + + __asm__ __volatile__ ("rdtsc" : "=3Da"(rax), "=3Dd"(rdx)); + return rax | (rdx << 32); +#elif defined(CONFIG_ARM64) + uint64_t ret; + + __asm__ __volatile__ ("mrs %0, cntvct_el0" : "=3Dr"(ret)); + return ret; +#elif defined(CONFIG_ARM) + uint64_t ret; + + __asm__ __volatile__ ("mrrc p15, 1, %Q0, %R0, c14" : "=3Dr" (ret)); + return ret; +#else +#error "Not support" +#endif +} + /* * context_switch - switch to the new MM and the new thread's register sta= te. */ @@ -5256,7 +5281,15 @@ context_switch(struct rq *rq, struct task_struct *pr= ev, switch_to(prev, next, prev); barrier(); =20 - return finish_task_switch(prev); + { + uint64_t end_time; + // add volatile to let it alloc on stack + __volatile__ uint64_t start_time =3D test_gettime(); + rq =3D finish_task_switch(prev); + end_time =3D test_gettime(); + raw_cpu_add(total_time, end_time - start_time); + } + return rq; } =20 /* @@ -10827,3 +10860,32 @@ void sched_change_end(struct sched_change_ctx *ctx) p->sched_class->prio_changed(rq, p, ctx->prio); } } + +static struct task_struct *wait_task; +#define PRINT_PERIOD (1U << 20) +static DEFINE_PER_CPU(uint32_t, total_count); + +SYSCALL_DEFINE0(sched_test) +{ + preempt_disable(); + while (1) { + if (likely(wait_task)) + wake_up_process(wait_task); + wait_task =3D current; + set_current_state(TASK_UNINTERRUPTIBLE); + __schedule(SM_NONE); + if (unlikely(raw_cpu_inc_return(total_count) =3D=3D PRINT_PERIOD)) { + const uint64_t total =3D raw_cpu_read(total_time); + uint64_t tmp_h, tmp_l; + + tmp_h =3D total * 100000; + do_div(tmp_h, (uint32_t)PRINT_PERIOD); + tmp_l =3D do_div(tmp_h, (uint32_t)100000); + + pr_emerg("cpu[%d]: total cost time %llu in %u tests, %llu.%05llu per te= st\n", raw_smp_processor_id(), total, PRINT_PERIOD, tmp_h, tmp_l); + raw_cpu_write(total_time, 0); + raw_cpu_write(total_count, 0); + } + } + return 0; +} diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl index e74868be513c..2a2d8d44cb3f 100644 --- a/scripts/syscall.tbl +++ b/scripts/syscall.tbl @@ -411,3 +411,4 @@ 468 common file_getattr sys_file_getattr 469 common file_setattr sys_file_setattr 470 common listns sys_listns +471 common sched_test sys_sched_test ``` User-mode test program code: ```c int main() { cpu_set_t mask; if (fork()) sleep(1); CPU_ZERO(&mask); CPU_SET(3, &mask); // Assume that cpu3 exists assert(sched_setaffinity(0, sizeof(mask), &mask) =3D=3D 0); syscall(471); // unreachable return 0; } ``` Test operation: 1. Apply the above kernel patch and build the kernel. 2. Add `isolcpus=3D3` to kernel cmdline and boot. 3. Run the above user program. 4. Wait for kernel print. v5->v6: https://lore.kernel.org/20251214190907.184793-1-qq570070308@gmail.c= om - Based on tglx's suggestion, move '#define enter_....' under the inline function in patch [1/3]. - Based on tglx's suggestion, correct the description error in patch [1/3]. - Rebase to the latest linux-next source. v4->v5: https://lore.kernel.org/20251123121827.1304-1-qq570070308@gmail.com - Rebase to the latest linux-next source. - Improve the test code and retest. - Add the test of AMD 9600x and Raspberry Pi 3b. v3->v4: https://lore.kernel.org/20251113105227.57650-1-qq570070308@gmail.com - Improve the commit message v2->v3: https://lore.kernel.org/20251108172346.263590-1-qq570070308@gmail.c= om - Fix building error in patch 1 - Simply add the __always_inline attribute to the existing function, Instead of adding the always inline version functions v1->v2: https://lore.kernel.org/20251024182628.68921-1-qq570070308@gmail.com - Make raw_spin_rq_unlock() inline - Make __balance_callbacks() inline - Add comments for always inline functions - Add Performance Test Data Xie Yuanbin (3): x86/mm/tlb: Make enter_lazy_tlb() always inline on x86 sched: Make raw_spin_rq_unlock() inline sched/core: Make finish_task_switch() and its subfunctions always inline arch/arm/include/asm/mmu_context.h | 2 +- arch/riscv/include/asm/sync_core.h | 2 +- arch/s390/include/asm/mmu_context.h | 2 +- arch/sparc/include/asm/mmu_context_64.h | 2 +- arch/x86/include/asm/mmu_context.h | 23 +++++++++++++++++- arch/x86/include/asm/sync_core.h | 2 +- arch/x86/mm/tlb.c | 21 ----------------- include/linux/perf_event.h | 2 +- include/linux/sched/mm.h | 10 ++++---- include/linux/tick.h | 4 ++-- include/linux/vtime.h | 8 +++---- kernel/sched/core.c | 17 +++++--------- kernel/sched/sched.h | 31 ++++++++++++++----------- 13 files changed, 62 insertions(+), 64 deletions(-) --=20 2.51.0