From nobody Fri Jun 19 17:12:31 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABE60C433EF for ; Thu, 31 Mar 2022 21:39:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241900AbiCaVks (ORCPT ); Thu, 31 Mar 2022 17:40:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230027AbiCaVkq (ORCPT ); Thu, 31 Mar 2022 17:40:46 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9ACD539BAF; Thu, 31 Mar 2022 14:38:58 -0700 (PDT) Date: Thu, 31 Mar 2022 21:38:54 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1648762736; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xc1uXLGItTqU4LQHNlRfzQPvnTHvOKHdejsArw8o2qk=; b=sb0+mIgRSokQOWwf1RlXhzfNGPPRs4lo5EC/jJsvFwy9Q0MP+k2t5yDGIDndL6UbfkryEt tts8mgUplTMgsxtE32moDBMTfISPgIlnqD5bazMAoD35hY3ECo7h6aSMEzIodsLWmKNrxi Fnpeku6C5ocALS75uhHd2tiMX7fQPJbw/zkLxIsIkZpQt9XakuUvwvcssWWWLdylqvsYCo 8JwjyoNV9Cc72UoRqEaYu6Su26Th+lF4BDcPCbiwjXlvBxMlhKxdDQhDNEnee5I+I1bB1p NBimD5rWGcs31G36asbyeAMoKdXOFDkt0bVakIcHoFiwe9TKrZ8dfuaXVs8QHA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1648762736; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xc1uXLGItTqU4LQHNlRfzQPvnTHvOKHdejsArw8o2qk=; b=XsZC+V4+g/EWjrKYKMlx4ji+QE/lnJ0gSqZCM7ZMpsBGlTG5Tu86TtLvNWCBsdYNjZvd+W rq8jtumD6U35QOCg== From: "tip-bot2 for Dave Hansen" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: x86/mm] x86/mm/tlb: Revert retpoline avoidance approach Cc: kernel test robot , Dave Hansen , Nadav Amit , Ingo Molnar , Andy Lutomirski , Peter Zijlstra , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <164874672286.389.7021457716635788197.tip-bot2@tip-bot2> References: <164874672286.389.7021457716635788197.tip-bot2@tip-bot2> MIME-Version: 1.0 Message-ID: <164876273469.389.4814502480228230952.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the x86/mm branch of tip: Commit-ID: e1300d97cbc347d319adfa0976be723ada4b582c Gitweb: https://git.kernel.org/tip/e1300d97cbc347d319adfa0976be723ad= a4b582c Author: Dave Hansen AuthorDate: Fri, 18 Mar 2022 06:52:59 -07:00 Committer: Dave Hansen CommitterDate: Thu, 31 Mar 2022 14:31:43 -07:00 x86/mm/tlb: Revert retpoline avoidance approach 0day reported a regression on a microbenchmark which is intended to stress the TLB flushing path: https://lore.kernel.org/all/20220317090415.GE735@xsang-OptiPlex-9020/ It pointed at a commit from Nadav which intended to remove retpoline overhead in the TLB flushing path by taking the 'cond'-ition in on_each_cpu_cond_mask(), pre-calculating it, and incorporating it into 'cpumask'. That allowed the code to use a bunch of earlier direct calls instead of later indirect calls that need a retpoline. But, in practice, threads can go idle (and into lazy TLB mode where they don't need to flush their TLB) between the early and late calls. It works in this direction and not in the other because TLB-flushing threads tend to hold mmap_lock for write. Contention on that lock causes threads to _go_ idle right in this early/late window. There was not any performance data in the original commit specific to the retpoline overhead. I did a few tests on a system with retpolines: https://lore.kernel.org/all/dd8be93c-ded6-b962-50d4-96b1c3afb2b7@intel.com/ which showed a possible small win. But, that small win pales in comparison with the bigger loss induced on non-retpoline systems. Revert the patch that removed the retpolines. This was not a clean revert, but it was self-contained enough not to be too painful. Fixes: 6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb= _is_not_lazy()") Reported-by: kernel test robot Signed-off-by: Dave Hansen Acked-by: Nadav Amit Cc: Ingo Molnar Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: x86@kernel.org Link: https://lkml.kernel.org/r/164874672286.389.7021457716635788197.tip-bo= t2@tip-bot2 --- arch/x86/mm/tlb.c | 37 +++++-------------------------------- 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 1e6513f..161984b 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -854,13 +854,11 @@ done: nr_invalidate); } =20 -static bool tlb_is_not_lazy(int cpu) +static bool tlb_is_not_lazy(int cpu, void *data) { return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu); } =20 -static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); - DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared= ); EXPORT_PER_CPU_SYMBOL(cpu_tlbstate_shared); =20 @@ -889,36 +887,11 @@ STATIC_NOPV void native_flush_tlb_multi(const struct = cpumask *cpumask, * up on the new contents of what used to be page tables, while * doing a speculative memory access. */ - if (info->freed_tables) { + if (info->freed_tables) on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true); - } else { - /* - * Although we could have used on_each_cpu_cond_mask(), - * open-coding it has performance advantages, as it eliminates - * the need for indirect calls or retpolines. In addition, it - * allows to use a designated cpumask for evaluating the - * condition, instead of allocating one. - * - * This code works under the assumption that there are no nested - * TLB flushes, an assumption that is already made in - * flush_tlb_mm_range(). - * - * cond_cpumask is logically a stack-local variable, but it is - * more efficient to have it off the stack and not to allocate - * it on demand. Preemption is disabled and this code is - * non-reentrant. - */ - struct cpumask *cond_cpumask =3D this_cpu_ptr(&flush_tlb_mask); - int cpu; - - cpumask_clear(cond_cpumask); - - for_each_cpu(cpu, cpumask) { - if (tlb_is_not_lazy(cpu)) - __cpumask_set_cpu(cpu, cond_cpumask); - } - on_each_cpu_mask(cond_cpumask, flush_tlb_func, (void *)info, true); - } + else + on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func, + (void *)info, 1, cpumask); } =20 void flush_tlb_multi(const struct cpumask *cpumask,