From nobody Wed Dec 31 10:33:41 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE4D4C4332F for ; Sat, 4 Nov 2023 03:14:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235097AbjKDDOP (ORCPT ); Fri, 3 Nov 2023 23:14:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235011AbjKDDOM (ORCPT ); Fri, 3 Nov 2023 23:14:12 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3128D54 for ; Fri, 3 Nov 2023 20:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699067599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kfEa/PGrfaQZvR+CyJp9bolRL2l7BniJt6LJ8tbhPMY=; b=Fa4sZGs7/ljldrk+yitemgluf6Zpo78KXFQIsL2roBJIDny9TrxYK3FcQErPSoveNdOKp/ B05ZHmPdmhpqeSrWsRC7oLgu1wpTdvAt3bY2vqbmRSnxei+y+DcfON0Y1o598ZCiWkg9LS 4B9wgFGrkaPwD9nAqvkzFoE66LPu5A8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-MErBzw98P1i88MoGN1QToA-1; Fri, 03 Nov 2023 23:13:18 -0400 X-MC-Unique: MErBzw98P1i88MoGN1QToA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EFE16185A781; Sat, 4 Nov 2023 03:13:17 +0000 (UTC) Received: from llong.com (unknown [10.22.33.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 80E4FC1290F; Sat, 4 Nov 2023 03:13:17 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH v3 1/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Date: Fri, 3 Nov 2023 23:13:01 -0400 Message-Id: <20231104031303.592879-2-longman@redhat.com> In-Reply-To: <20231104031303.592879-1-longman@redhat.com> References: <20231104031303.592879-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time. The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock. To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed. Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result. OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch: Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3 Signed-off-by: Waiman Long Reviewed-by: Yosry Ahmed --- include/linux/cgroup-defs.h | 7 ++++++ kernel/cgroup/rstat.c | 43 ++++++++++++++++++++++++------------- 2 files changed, 35 insertions(+), 15 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 265da00a1a8b..ff4b4c590f32 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -491,6 +491,13 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; =20 + /* + * A singly-linked list of cgroup structures to be rstat flushed. + * This is a scratch field to be used exclusively by + * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock. + */ + struct cgroup *rstat_flush_next; + /* cgroup basic resource statistics */ struct cgroup_base_stat last_bstat; struct cgroup_base_stat bstat; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d80d7a608141..1f300bf4dc40 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -145,6 +145,32 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(str= uct cgroup *pos, return pos; } =20 +/* Return a list of updated cgroups to be flushed */ +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) +{ + raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); + struct cgroup *head, *tail, *next; + unsigned long flags; + + /* + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. + */ + raw_spin_lock_irqsave(cpu_lock, flags); + head =3D tail =3D cgroup_rstat_cpu_pop_updated(NULL, root, cpu); + while (tail) { + next =3D cgroup_rstat_cpu_pop_updated(tail, root, cpu); + tail->rstat_flush_next =3D next; + tail =3D next; + } + raw_spin_unlock_irqrestore(cpu_lock, flags); + return head; +} + /* * A hook for bpf stat collectors to attach to and flush their stats. * Together with providing bpf kfuncs for cgroup_rstat_updated() and @@ -179,21 +205,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *c= grp) lockdep_assert_held(&cgroup_rstat_lock); =20 for_each_possible_cpu(cpu) { - raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, - cpu); - struct cgroup *pos =3D NULL; - unsigned long flags; + struct cgroup *pos =3D cgroup_rstat_updated_list(cgrp, cpu); =20 - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); - while ((pos =3D cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) { + for (; pos; pos =3D pos->rstat_flush_next) { struct cgroup_subsys_state *css; =20 cgroup_base_stat_flush(pos, cpu); @@ -205,7 +219,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cg= rp) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - raw_spin_unlock_irqrestore(cpu_lock, flags); =20 /* play nice and yield if necessary */ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { --=20 2.39.3 From nobody Wed Dec 31 10:33:41 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5169EC4332F for ; Sat, 4 Nov 2023 03:14:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346075AbjKDDOd (ORCPT ); Fri, 3 Nov 2023 23:14:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343614AbjKDDOT (ORCPT ); Fri, 3 Nov 2023 23:14:19 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CCDBD72 for ; Fri, 3 Nov 2023 20:13:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699067603; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ag+X6OdrPM4qcS+lve0uYD4xt7fsfaJmYoOAVQ+n5M0=; b=RXrB86OhMLBfvJceLO/3EeZx4uy1H+Zm9ZKrhhP1EBPprzM0GM2x1slZzrIim193JTYboz qGWTqrEO57ho3PKQws4lWViJEz9t5tee+2NZW/F/JxyIQ0oyKmYLsXIe3XPnmEyIuND/1/ Nx8PyoDX9fkJUpdC6b/Ww9ny+Jv2Vco= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-185-spme2rO_PhOq1tF1su41mw-1; Fri, 03 Nov 2023 23:13:19 -0400 X-MC-Unique: spme2rO_PhOq1tF1su41mw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8A1FE29A9A10; Sat, 4 Nov 2023 03:13:18 +0000 (UTC) Received: from llong.com (unknown [10.22.33.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0B72AC1290F; Sat, 4 Nov 2023 03:13:18 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list() Date: Fri, 3 Nov 2023 23:13:02 -0400 Message-Id: <20231104031303.592879-3-longman@redhat.com> In-Reply-To: <20231104031303.592879-1-longman@redhat.com> References: <20231104031303.592879-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents. This can cause traversal of multiple nodes before a leaf node can be found and popped out. IOW, a given node in the tree can be visited multiple times before the whole operation is done. So it is not very efficient and the code can be hard to read. With the introduction of cgroup_rstat_updated_list() to build a list of cgroups to be flushed first before any flushing operation is being done, we can optimize the way the updated tree nodes are being popped by pushing the parents first to the tail end of the list before their children. In this way, most updated tree nodes will be visited only once with the exception of the subtree root as we still need to go back to its parent and popped it out of its updated_children list. This also makes the code easier to read. A parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. Below were the lock hold time frequency distribution before and after the patch: Hold time Before patch After patch --------- ------------ ----------- 0-01 us 13,738,708 14,594,545 01-05 us 1,177,194 439,926 05-10 us 4,984 5,960 10-15 us 3,562 3,543 15-20 us 1,314 1,397 20-25 us 18 25 25-30 us 12 12 It can be seen that the patch pushes the lock hold time towards the lower end. Signed-off-by: Waiman Long --- kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++-------------------- 1 file changed, 70 insertions(+), 62 deletions(-) diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 1f300bf4dc40..d2b709cfeb2a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cg= rp, int cpu) } =20 /** - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated = tree - * @pos: current position - * @root: root of the tree to traversal + * cgroup_rstat_push_children - push children cgroups into the given list + * @head: current head of the list (=3D parent cgroup) + * @prstatc: cgroup_rstat_cpu of the parent cgroup * @cpu: target cpu + * Return: A new singly linked list of cgroups to be flush * - * Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts - * the traversal and %NULL return indicates the end. During traversal, - * each returned cgroup is unlinked from the tree. Must be called with the - * matching cgroup_rstat_cpu_lock held. + * Recursively traverse down the cgroup_rstat_cpu updated tree and push + * parent first before its children. The parent is pushed by the caller. + * The recursion depth is the depth of the current updated tree. + */ +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head, + struct cgroup_rstat_cpu *prstatc, int cpu) +{ + struct cgroup *child, *parent; + struct cgroup_rstat_cpu *crstatc; + + parent =3D head; + child =3D prstatc->updated_children; + prstatc->updated_children =3D parent; + + /* updated_next is parent cgroup terminated */ + while (child !=3D parent) { + child->rstat_flush_next =3D head; + head =3D child; + crstatc =3D cgroup_rstat_cpu(child, cpu); + if (crstatc->updated_children !=3D parent) + head =3D cgroup_rstat_push_children(head, crstatc, cpu); + child =3D crstatc->updated_next; + crstatc->updated_next =3D NULL; + } + return head; +} + +/** + * cgroup_rstat_updated_list - return a list of updated cgroups to be flus= hed + * @root: root of the cgroup subtree to traverse + * @cpu: target cpu + * Return: A singly linked list of cgroups to be flushed + * + * Walks the updated rstat_cpu tree on @cpu from @root. During traversal, + * each returned cgroup is unlinked from the updated tree. Must be called + * with the matching cgroup_rstat_cpu_lock held. * * The only ordering guarantee is that, for a parent and a child pair - * covered by a given traversal, if a child is visited, its parent is - * guaranteed to be visited afterwards. + * covered by a given traversal, the child is before its parent in + * the list. + * + * Note that updated_children is self terminated while updated_next is + * parent cgroup terminated except the cgroup root which can be self + * terminated. */ -static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, - struct cgroup *root, int cpu) +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) { - struct cgroup_rstat_cpu *rstatc; - struct cgroup *parent; - - if (pos =3D=3D root) - return NULL; + raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); + struct cgroup_rstat_cpu *rstatc =3D cgroup_rstat_cpu(root, cpu); + struct cgroup *head =3D NULL, *parent; + unsigned long flags; =20 /* - * We're gonna walk down to the first leaf and visit/remove it. We - * can pick whatever unvisited node as the starting point. + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. */ - if (!pos) { - pos =3D root; - /* return NULL if this subtree is not on-list */ - if (!cgroup_rstat_cpu(pos, cpu)->updated_next) - return NULL; - } else { - pos =3D cgroup_parent(pos); - } + raw_spin_lock_irqsave(cpu_lock, flags); =20 - /* walk down to the first leaf */ - while (true) { - rstatc =3D cgroup_rstat_cpu(pos, cpu); - if (rstatc->updated_children =3D=3D pos) - break; - pos =3D rstatc->updated_children; - } + /* Return NULL if this subtree is not on-list */ + if (!rstatc->updated_next) + goto unlock_ret; =20 /* - * Unlink @pos from the tree. As the updated_children list is + * Unlink @root from its parent. As the updated_children list is * singly linked, we have to walk it to find the removal point. - * However, due to the way we traverse, @pos will be the first - * child in most cases. The only exception is @root. */ - parent =3D cgroup_parent(pos); + parent =3D cgroup_parent(root); if (parent) { struct cgroup_rstat_cpu *prstatc; struct cgroup **nextp; =20 prstatc =3D cgroup_rstat_cpu(parent, cpu); nextp =3D &prstatc->updated_children; - while (*nextp !=3D pos) { + while (*nextp !=3D root) { struct cgroup_rstat_cpu *nrstatc; =20 nrstatc =3D cgroup_rstat_cpu(*nextp, cpu); @@ -142,31 +168,13 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(st= ruct cgroup *pos, } =20 rstatc->updated_next =3D NULL; - return pos; -} - -/* Return a list of updated cgroups to be flushed */ -static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) -{ - raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *head, *tail, *next; - unsigned long flags; =20 - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); - head =3D tail =3D cgroup_rstat_cpu_pop_updated(NULL, root, cpu); - while (tail) { - next =3D cgroup_rstat_cpu_pop_updated(tail, root, cpu); - tail->rstat_flush_next =3D next; - tail =3D next; - } + /* Push @root to the list first before pushing the children */ + head =3D root; + root->rstat_flush_next =3D NULL; + if (rstatc->updated_children !=3D root) + head =3D cgroup_rstat_push_children(head, rstatc, cpu); +unlock_ret: raw_spin_unlock_irqrestore(cpu_lock, flags); return head; } --=20 2.39.3 From nobody Wed Dec 31 10:33:41 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6B40C4332F for ; Sat, 4 Nov 2023 03:14:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345883AbjKDDO1 (ORCPT ); Fri, 3 Nov 2023 23:14:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235095AbjKDDOP (ORCPT ); Fri, 3 Nov 2023 23:14:15 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 00A73D6C for ; Fri, 3 Nov 2023 20:13:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699067603; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XmQVU6t74ex4eso68OgbidZVE99nykXUpQNQesplaTI=; b=MfhWmoQEMiiewXFHr0+pWOsUiZ0JJq+sLGdg8DwztwkARIWC+gBwun3UiCf7X2moDRLkn7 fE1uRFrG9XqEq4mUevGEZFmfWEF1wTsq6yKW17OpbSNQuT1FCdNfKBDzRFfese57cUQ63o HPBiI17VrYke9LyAwwLslHtuXD5gEAA= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-682-OUuOja0NPyqBrgMW7m7vhg-1; Fri, 03 Nov 2023 23:13:19 -0400 X-MC-Unique: OUuOja0NPyqBrgMW7m7vhg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0D11B1C05148; Sat, 4 Nov 2023 03:13:19 +0000 (UTC) Received: from llong.com (unknown [10.22.33.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 987BAC1290F; Sat, 4 Nov 2023 03:13:18 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH v3 3/3] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu Date: Fri, 3 Nov 2023 23:13:03 -0400 Message-Id: <20231104031303.592879-4-longman@redhat.com> In-Reply-To: <20231104031303.592879-1-longman@redhat.com> References: <20231104031303.592879-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The rstat_cpu and also rstat_css_list of the cgroup structure are read mostly variables. However, they may share the same cacheline as the subsequent rstat_flush_next and *bstat variables which can be updated frequently. That will slow down the cgroup_rstat_cpu() call which is called pretty frequently in the rstat code. Add a CACHELINE_PADDING() line in between them to avoid false cacheline sharing. A parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. Below were the lock hold time frequency distribution before and after the patch: Run time Before patch After patch -------- ------------ ----------- 0-01 us 14,594,545 15,484,707 01-05 us 439,926 207,382 05-10 us 5,960 3,174 10-15 us 3,543 3,006 15-20 us 1,397 1,066 20-25 us 25 15 25-30 us 12 10 It can be seen that the patch further pushes the lock hold time towards the lower end. Signed-off-by: Waiman Long --- include/linux/cgroup-defs.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ff4b4c590f32..a4adc0580135 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -491,6 +491,13 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; =20 + /* + * Add padding to separate the read mostly rstat_cpu and + * rstat_css_list into a different cacheline from the following + * rstat_flush_next and *bstat fields which can have frequent updates. + */ + CACHELINE_PADDING(_pad_); + /* * A singly-linked list of cgroup structures to be rstat flushed. * This is a scratch field to be used exclusively by --=20 2.39.3