From nobody Fri Dec 19 17:36:21 2025 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81056275844 for ; Thu, 5 Jun 2025 16:47:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749142061; cv=none; b=jUnW5GFZb6kRu/3/rLV8PEy/Xm6qHG9u67bmgZFAtsJO1YQhgt/fjPAlK90YISaRoR1qBL9fCBZigno5sKdRvJiExSnCkKuLTt5PCOLpdaAHJ535DDZzreiZb5SrtRohbwL5QUdGrkFYskocVDvVZB3+ivd8PQpNWdUdUx4pxdY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749142061; c=relaxed/simple; bh=NFnFBvpiJ3gORrm9GFBTxS5eaPH+tAYkMwH1uVNzF/8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=shiFpmstyCVJcs5bwaKZM+Ta4ul/OXxKNMOTr2VgnrXnQ8SJyCFuFtPLEwZ5Kqoys4HFk5BFT+8aewI3zT2NxZb+l5OwPbDpkND9ww2aaYkRHw5ygRIGfoeqLfkGfD7PFAFXGfW0cE7j4EY7Gc2B4WyiBIZe3bKQZhj3tVFN61w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ilIh7FCJ; arc=none smtp.client-ip=209.85.208.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ilIh7FCJ" Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-6069d764980so4497015a12.0 for ; Thu, 05 Jun 2025 09:47:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749142058; x=1749746858; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=BB+6bwu2SnyCNKu7AKO4xQjxYg/leDIZeTz7AqghLTs=; b=ilIh7FCJ15xScpHMH7Znkz/q79R9ErcQfmJm1qq2XSGCr+p/2cdfOouAL9X8kYOs6Z cBa5e3V49Zl6WctNeZ+Y+0eb0IEOR/o+2JUWeXgTCaKu3O7sbemDfjdlULhB9TFYyHWx b1QyfO8Ht+OLHXexCZvyxTMByu4cfcJycA/m5v1M4u8l4f6CVrpeyGFLEMj5my2ehbX8 RQvq5N05uG4Ibfd92Ku8tm45pce9dqo4wx5ogWmPnAQr03LE7ULM9TJ7b+xcsyei/kXQ r7slBK0kFKBq0hSgJDFZLKZLTd1NXRAJj3d+x1i+TsUvubpTXWiZ7QHxkn+c2NhosYdl PTSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749142058; x=1749746858; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BB+6bwu2SnyCNKu7AKO4xQjxYg/leDIZeTz7AqghLTs=; b=BFHfBY9BDyI96qdCcMSYtyVqK4akSowklxdU2vZufzCgKUPsRtJOhlD1jXTRsC22iu Fmt5BkCg8vXdbLi2DxI8J+S+UrPX7YrccDxtIqW33pgmYS9yhXV9PJjHxkODbUElda0L Y1Iq+KfzAIJkmM5SYImbFoXVe/AVob1hRZKWE2B0JIXfHkeFb5SslPCmvPO1te89o8Pq W4VO8npphaNaQuUoRgVYvzAI/Zeg8rWyvtb1zLhmphDaUSBZHZ05zwNBe7g4DDhv6EOl L6m881qtCS6CSdPMRix6WcRpexnxuV0HPiFdSJ7guXMsVb1cf+Mvw31A8GZo/WmSNLJ/ QyKQ== X-Forwarded-Encrypted: i=1; AJvYcCUZcWD6KHd1CSBU6KiCH4Vo6kR5Rid6Sz4mz9J+tziFrVct2Xdn17FgpCNagw9pzVu2eC30XClSoaiGLTU=@vger.kernel.org X-Gm-Message-State: AOJu0YzUMDw3IxS5CEjShxkXEB6OyGOROBqUT+pCsHeI2iEfJTg87zZl JBPDq+T0WFMEvnpaCd/l6tavgmLMmVq04AL6s7v9HlsAZRAow5dZ2raA X-Gm-Gg: ASbGncswCuyjn8q/U+sEUNmBhWLi9jS0axN4Tqx2cYbnyGLHxoWFchVcbyTEmxUj464 hO65Zq8i6MAM+7Cy6MkblaTcr2XJTQSUTc7xsJ1h07RVBnZibc+oo5YYDP3AEtBXl3ixK60G2JZ xQZt3lZj5IFDp+RdPcjLXgGllk6oNuQ7hFFv7c7eChaHQdQ5D44D2mNJ4QHSQr6cj41YLlyj5fJ bDWvGFjUxIac19UtcRhEs3JBJ9L/kOJO1Dh1Rzv16goREVMUpzVnkWv+ug8JuX3k3MZDrXd1Zo6 rP54Ebplf9gIuvY+ueg/BcqIY0AeTrExhUD6C9n6RLHt6xMFRqbaMLbNpfTJxOX+hIF8fA== X-Google-Smtp-Source: AGHT+IGg3BPO/7TOzNiwM6QXe7Anl/Xf7CTdvE8ARkLfMuhmyZ5xjijMrnSgKI2ogiMHvmy8U70x/Q== X-Received: by 2002:a17:907:96a7:b0:ad5:2e15:2a7b with SMTP id a640c23a62f3a-ade075bc0f5mr434653466b.2.1749142057515; Thu, 05 Jun 2025 09:47:37 -0700 (PDT) Received: from f.. (cst-prg-88-200.cust.vodafone.cz. [46.135.88.200]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ada5d82e760sm1293971366b.52.2025.06.05.09.47.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Jun 2025 09:47:36 -0700 (PDT) From: Mateusz Guzik To: torvalds@linux-foundation.org Cc: mingo@redhat.com, x86@kernel.org, linux-kernel@vger.kernel.org, Mateusz Guzik Subject: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops Date: Thu, 5 Jun 2025 18:47:33 +0200 Message-ID: <20250605164733.737543-1-mjguzik@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" gcc is over eager to use rep movsq/stosq (starts above 40 bytes), which comes with a significant penalty on CPUs without the respective fast short ops bits (FSRM/FSRS). Another point is that even uarchs with FSRM don't necessarily have FSRS (Ice Lake and Sapphire Rapids don't). More importantly, rep movsq is not fast even if FSRM is present. The issue got reported to upstream gcc, but no progress was made and it looks like nothing will happen for the foreseeable future (see links 1-3). In the meantime perf is left on the table, here is a sample result from compilation of a hello world program in a loop (in compilations / s): Sapphire Rapids: before: 979 after: 997 (+1.8%) AMD EPYC 9R14: before: 808 after: 815 (+0.8%) So this is very much visible outside of a microbenchmark setting. This is very page fault heavy, which lands in sync_regs(): <+0>: endbr64 <+4>: mov %gs:0x22ca5d4(%rip),%rax # 0xffffffff8450f010 <+12>: mov %rdi,%rsi <+15>: sub $0xa8,%rax <+21>: cmp %rdi,%rax <+24>: je 0xffffffff82244a55 <+26>: mov $0x15,%ecx <+31>: mov %rax,%rdi <+34>: rep movsq %ds:(%rsi),%es:(%rdi) <+37>: jmp 0xffffffff82256ba0 <__x86_return_thunk> When microbenchmarking page faults, perf top shows: before: 22.07% [kernel] [k] asm_exc_page_fault 12.83% pf_processes [.] testcase 11.81% [kernel] [k] sync_regs after: 26.06% [kernel] [k] asm_exc_page_fault 13.18% pf_processes [.] testcase [..] 0.91% [kernel] [k] sync_regs A massive reduction in execution time of the routine. Link 1: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D119596 Link 2: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D119703 Link 3: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D119704 Link 4: https://lore.kernel.org/oe-lkp/202504181042.54ea2b8a-lkp@intel.com/ Signed-off-by: Mateusz Guzik --- v2: - only do it if not building with CONFIG_X86_NATIVE_CPU Hi Linus, RFC for the patch was posted here: https://lore.kernel.org/all/xmzxiwno5q3ordgia55wyqtjqbefxpami5wevwltcto52fe= hbv@ul44rsesp4kw/ You rejected it on 2 grounds: - this should be handled by gcc itself -- agreed, but per the interaction in the bzs I created for them I don't believe this will happen any time soon (if ever to be frank) - messing with local optimization flags -- perhaps ifdefing on CONFIG_X86_NATIVE_CPU would be good enough? if not, the thing can be hidden behind an option (default Y) so interested parties can whack it See the commit message for perf numbers. It would be a shame to not get these wins only because gcc is too stubborn. While I completely understand not liking compiler-specific hacks, I believe I made a good enough case for rolling with them here. That said, if you don't see any justification to get something of this sort in, I'm dropping the matter. cheers arch/x86/Makefile | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 1913d342969b..9eb75bd7c81d 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -198,6 +198,31 @@ ifeq ($(CONFIG_STACKPROTECTOR),y) endif endif =20 +ifdef CONFIG_CC_IS_GCC +ifndef CONFIG_X86_NATIVE_CPU +# +# Inline memcpy and memset handling policy for gcc. +# +# For ops of sizes known at compilation time it quickly resorts to issuing= rep +# movsq and stosq. On most uarchs rep-prefixed ops have a significant star= tup +# latency and it is faster to issue regular stores (even if in loops) to h= andle +# small buffers. +# +# This of course comes at an expense in terms of i-cache footprint. bloat-= o-meter +# reported 0.23% increase for enabling these. +# +# We inline up to 256 bytes, which in the best case issues few movs, in the +# worst case creates a 4 * 8 store loop. +# +# The upper limit was chosen semi-arbitrarily as uarchs wildly differ betw= een a +# threshold past which rep-prefixed ops become faster. 256 being the lowest +# common denominator. This should be fixed in the compiler. +# +KBUILD_CFLAGS +=3D -mmemcpy-strategy=3Dunrolled_loop:256:noalign,libcall:-= 1:noalign +KBUILD_CFLAGS +=3D -mmemset-strategy=3Dunrolled_loop:256:noalign,libcall:-= 1:noalign +endif +endif + # # If the function graph tracer is used with mcount instead of fentry, # '-maccumulate-outgoing-args' is needed to prevent a GCC bug --=20 2.48.1