From nobody Mon Dec 1 22:05:37 2025 Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 522CF30DED3 for ; Thu, 27 Nov 2025 06:55:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764226543; cv=none; b=qqLzRjhcLLNEvjTdtfc5ptrEbmSWZxnA2xhT6mNjegaN0Qab5zJgLWq0j7uEW4HymRVH2GbY58uRzKdYVlOo7SYo5WS65iwqF3icH9cjbip4EQVLT5cQvgxB4xxe3N5V5rrfW2v2NeBxu4/apzK68uWIQFYN3I1zB9mXTVEVnwQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764226543; c=relaxed/simple; bh=E5+TRUnh1JRKcykfMPdMBPmEMFYSC9m3lQedVp0VVeg=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition; b=BPzAoK7ld+IOBWjnCCOnk7zXeSynuBCfZdvLcJIKECYSFtOjX6F0lksUupghwOhzfOL5hSeQa7IcYgysxfhtnYjOmv/0EK1MFtbrgOE/5Q19VvRlKnkgQEEN5U8HHzRQWywYoE2eJytPOeChkBie0hH8gaYvCmKDZvEdA+qfYSY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mu3Ej1Pf; arc=none smtp.client-ip=209.85.218.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mu3Ej1Pf" Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-b73545723ebso110339166b.1 for ; Wed, 26 Nov 2025 22:55:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764226540; x=1764831340; darn=vger.kernel.org; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=jsw990ktyaSNXscwoD9FSGL3HFzk0boCMFqhO5LsUgM=; b=mu3Ej1Pfv+zb5EsCEr/tpBUU/KdYN46js3u79yYsVu748V2pQOF1DJymxnRxXM4qkB LI1b1bH4WNFrycuxI5PWn6Tog1TIm2i5EbQN8Wz3gf7D1+7iNB17ZlhzYF5BlZ1LMYC3 ji6d3NiNxP9cQ5IK/lPecsNiJ3b5WCQTOJHqDwbE1kNYpvEXGCMq/q7o5VhAix/Phuet B0iZxbSM/45ehvaNTlmMp+GZrFe3eG4qL3wxMOU3+bmJd39xg5p84DiLdAzaoKA9StpZ SD+9sBvaLjz3El+rREtf9/UV65I57SM+dzknt+fq4WD1tlu39K+iPKsokRQ0gCFaO+2W cjBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764226540; x=1764831340; h=content-disposition:mime-version:message-id:subject:cc:to:from:date :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jsw990ktyaSNXscwoD9FSGL3HFzk0boCMFqhO5LsUgM=; b=tEVj+NqSmzEga0H5yV2qh1zPHLfhWY0EYk5srUlmXmKRBpXEoTU1ij+iJ4r68OLPA0 S3cen8lpU+NP3a/Kp/oE4YC9bid/GH91Q0PMF5V+rakNRb+BYO62LaN/5I60+uO/9xNw tdNUUAWqhFx1/3M/jqEgW1qhAp2zu1uq33vxYnDPtIpv4Tn1TLkwi9Eqe7CljxO7j7ox a2BVXCbOzjaB4PT7MuFj/t7w39fDjrxqEkLumY/cqJ/9EWJWTK482+8PaJ2kxQoN1A4w 1osKplxG2diRbs5SZ5+njK6iOBcLOrfaWf5XLLDzR9TwxFKdqTU2R1tLJkNynQDzp0Nn 5z3A== X-Forwarded-Encrypted: i=1; AJvYcCWYuZ/kfgwI6a5CPe0aP9iDtJjvFOdBS8rKe1mIpsJJv4O5nIcDGu/pPFKzIFIxtrwsWalTIV+Tj4f0v2Q=@vger.kernel.org X-Gm-Message-State: AOJu0Yz66AN0AijIvpre3qDbkhEwcBTSbtYXgGXA+Z0KFINEg1cI5voa 4wNPe2yyABVLjrgwPUOC5UPBaKrzAHQnh9KhIGVbnbfNl+iUKbQkoLEnCfgrKluf X-Gm-Gg: ASbGncuez1Lh2bDMCwfbkyQ+tMtY4zAWFWHHs13zxdEOfGP1Iy6KKAWG6v/bCfKZGMO RC5taqnBwUnoTCUIEhEPd70ZK6ZKHJpjH2HKtjAZ7RTMpiBs9+ZdMWHnCSFm3Ycxyw/RQEjbS1F ltNo5uc9mK+K5rvMdULOVE4kHF5yudLFlIm0483j6UN1L2P2kOIKDNNTOH4PjQ4bm/rU4qRpyXu w58SUbnVo7RjsGWOhB2hE0KTb01E/hUlaWgBOk+uBkFrAQjM0TyhqlJmLGCEXHqQByPvWLCvio8 fFxzFA7PPQTPiIQXX5dzQjsGfg77LjnqNWRCmqw31LRh0uyAYHgsgaqnG3ZVsjMgX6z4uZXs6IM FIRGkmSvFvzjYUl7vabX77lVG6IVF0Mcdl75X5GIKxD1bkdJKc2ldT7xMWLn3c/x877I5Ynu0wy 5+l8fFUDtrK2NU+GhYYFuin6h67crMmbWyf1gOqC9rjxnxV7Rwm7o0cXHX X-Google-Smtp-Source: AGHT+IE8rFz7Z937kKpldudPO/fn7mQzKitpCap+PAPlRhwKL6ewQyFsp5h60lMi8tKGlXAGsj0eDQ== X-Received: by 2002:a17:907:d94:b0:b6d:9bab:a7ba with SMTP id a640c23a62f3a-b7671a7ad6emr2436788866b.42.1764226539280; Wed, 26 Nov 2025 22:55:39 -0800 (PST) Received: from f (cst-prg-14-82.cust.vodafone.cz. [46.135.14.82]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-647510615c0sm713008a12.30.2025.11.26.22.55.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Nov 2025 22:55:38 -0800 (PST) Date: Thu, 27 Nov 2025 07:55:27 +0100 From: Mateusz Guzik To: x86@kernel.org Cc: glx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com Subject: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs() Message-ID: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Sapphire Rapids has both ERMS (of course) and FSRM. sync_regs() runs into a corner case where both rep movsq and rep movsb suffer massive penalty for being used to copy 168 bytes, which clear itself when data is copied by a bunch of movq instead. I verified the issue is not present on AMD EPYC 9454, I don't know about other Intel CPUs. Details: When benchmarking page faults (page_fault1 from will-it-scale), sync_regs() is very high on the profile performing a 168 byte copy with rep movsq. I figured movsq still sucks on the uarch, so I patched the kernel to use movsb instead, but performance barely budged. However, forcing the thing to do the copy with regular stores in memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it. Check this out (ops/s): rep movsb in ___pi_memcpy: min:1293689 max:1293689 total:1293689 min:1293969 max:1293969 total:1293969 min:1293845 max:1293845 total:1293845 min:1293436 max:1293436 total:1293436 hand-rolled mov loop in memcpy_orig: min:1498050 max:1498050 total:1498050 min:1499041 max:1499041 total:1499041 min:1498283 max:1498283 total:1498283 min:1499701 max:1499701 total:1499701 ... or just shy of 16% faster. I patched the kernel with a tunable to select memcpy version for sync_regs() to use, togglable at runtime. Results reliably flip around as I change it at runtime. perf top says: rep movsb in ___pi_memcpy: 25.20% [kernel] [k] asm_exc_page_fault 14.60% [kernel] [k] __pi_memcpy 11.78% page_fault1_processes [.] testcase 4.71% [kernel] [k] _raw_spin_lock 2.36% [kernel] [k] __handle_mm_fault 2.00% [kernel] [k] clear_page_erms =20 hand-rolled mov loop in memcpy_orig: 27.99% [kernel] [k] asm_exc_page_fault 13.42% page_fault1_processes [.] testcase 5.46% [kernel] [k] _raw_spin_lock 2.72% [kernel] [k] __handle_mm_fault 2.48% [kernel] [k] clear_page_erms [..] 0.59% [kernel] [k] memcpy_orig 0.04% [kernel] [k] __pi_memcpy As you can see the difference is staggering and this has to be a deficiency at least in this uarch. When it comes to sync_regs() specifically, I think it makes some sense to instead recode in asm and perhaps issue the movs "by hand", which would work around the immediate problem and shave off a function call per page fault. However, per the profile results above, there is at least one case where rep movsb-based memcpy can grossly underperform and someone(tm) should investigate what's going on there. Also note the kernel inlines plain rep movsb usage for copy to/from user if FSRM is present, again possibly being susceptible to whatever the problem is. Maybe this is a matter of misalignment of the target or some other bullshit, I have not tested and I don't have the time to dig into it. I would expect someone better clued in the area will figure it out in less time than I would need, hence I'm throwing this out there. tunable usable as follows: sysctl fs.magic_tunable=3D0 # rep movsb sysctl fs.magic_tunable=3D1 # regular movs hack: diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 6b22611e69cc..f5fd69b2dc5b 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -915,6 +915,9 @@ DEFINE_IDTENTRY_RAW(exc_int3) } =20 #ifdef CONFIG_X86_64 +extern unsigned long magic_tunable; +void *memcpy_orig(void *dest, const void *src, size_t n); + /* * Help handler running on a per-cpu (IST or entry trampoline) stack * to switch to the normal thread stack if the interrupted code was in @@ -923,8 +926,10 @@ DEFINE_IDTENTRY_RAW(exc_int3) asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *ere= gs) { struct pt_regs *regs =3D (struct pt_regs *)current_top_of_stack() - 1; - if (regs !=3D eregs) - *regs =3D *eregs; + if (!magic_tunable) + __memcpy(regs, eregs, sizeof(struct pt_regs)); + else + memcpy_orig(regs, eregs, sizeof(struct pt_regs)); return regs; } =20 diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index 12a23fa7c44c..0f67378625b4 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -31,8 +31,6 @@ * which the compiler could/should do much better anyway. */ SYM_TYPED_FUNC_START(__memcpy) - ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM - movq %rdi, %rax movq %rdx, %rcx rep movsb @@ -44,7 +42,7 @@ SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy) SYM_PIC_ALIAS(memcpy) EXPORT_SYMBOL(memcpy) =20 -SYM_FUNC_START_LOCAL(memcpy_orig) +SYM_TYPED_FUNC_START(memcpy_orig) movq %rdi, %rax =20 cmpq $0x20, %rdx diff --git a/fs/file_table.c b/fs/file_table.c index cd4a3db4659a..de1ef700d144 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table,= int write, void *buffer, return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } =20 +unsigned long magic_tunable; + static const struct ctl_table fs_stat_sysctls[] =3D { { .procname =3D "file-nr", @@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] =3D { .extra1 =3D SYSCTL_LONG_ZERO, .extra2 =3D SYSCTL_LONG_MAX, }, + { + .procname =3D "magic_tunable", + .data =3D &magic_tunable, + .maxlen =3D sizeof(magic_tunable), + .mode =3D 0644, + .proc_handler =3D proc_doulongvec_minmax, + .extra1 =3D SYSCTL_LONG_ZERO, + .extra2 =3D SYSCTL_LONG_MAX, + }, + { .procname =3D "nr_open", .data =3D &sysctl_nr_open,