From nobody Tue Dec 16 16:38:00 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4508AC4167B for ; Thu, 30 Nov 2023 20:43:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376383AbjK3Unr (ORCPT ); Thu, 30 Nov 2023 15:43:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46490 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376330AbjK3Unn (ORCPT ); Thu, 30 Nov 2023 15:43:43 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE2641710 for ; Thu, 30 Nov 2023 12:43:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701377028; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uvh/PoAVDFNBvxQYr5eIckx+Y2FvTFRvRfoME4darjI=; b=ZiFesjK8pHl9CuJBLZW9uoKeAKLHPTljg1zd6qszd1skBKP18jSy0QxDC75Zm+pdqmmrQX HZzkdQ6ABqqZC92BSlWkAO9e7EXE6G+Ih7GcBzdJkXsn5aS8LwpoFU9HQotKqd8QSvlwHh sV7Skg3f/KicB/we2shpAoVp3keXeas= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-673-kJ8CWpKQM2egS-VgsYXc2g-1; Thu, 30 Nov 2023 15:43:44 -0500 X-MC-Unique: kJ8CWpKQM2egS-VgsYXc2g-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 210EB85A59D; Thu, 30 Nov 2023 20:43:44 +0000 (UTC) Received: from llong.com (unknown [10.22.9.192]) by smtp.corp.redhat.com (Postfix) with ESMTP id 95E7440C6EB9; Thu, 30 Nov 2023 20:43:43 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH-cgroup v5 1/2] cgroup/rstat: Optimize cgroup_rstat_updated_list() Date: Thu, 30 Nov 2023 15:43:26 -0500 Message-Id: <20231130204327.494249-2-longman@redhat.com> In-Reply-To: <20231130204327.494249-1-longman@redhat.com> References: <20231130204327.494249-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents. This can cause traversal of multiple nodes before a leaf node can be found and popped out. IOW, a given node in the tree can be visited multiple times before the whole operation is done. So it is not very efficient and the code can be hard to read. With the introduction of cgroup_rstat_updated_list() to build a list of cgroups to be flushed first before any flushing operation is being done, we can optimize the way the updated tree nodes are being popped by pushing the parents first to the tail end of the list before their children. In this way, most updated tree nodes will be visited only once with the exception of the subtree root as we still need to go back to its parent and popped it out of its updated_children list. This also makes the code easier to read. Signed-off-by: Waiman Long --- kernel/cgroup/rstat.c | 153 +++++++++++++++++++++++++----------------- 1 file changed, 91 insertions(+), 62 deletions(-) diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 1f300bf4dc40..e77f134c6f03 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -74,64 +74,109 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *c= grp, int cpu) } =20 /** - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated = tree - * @pos: current position - * @root: root of the tree to traversal + * cgroup_rstat_push_children - push children cgroups into the given list + * @head: current head of the list (=3D subtree root) + * @child: first child of the root * @cpu: target cpu + * Return: A new singly linked list of cgroups to be flush * - * Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts - * the traversal and %NULL return indicates the end. During traversal, - * each returned cgroup is unlinked from the tree. Must be called with the - * matching cgroup_rstat_cpu_lock held. + * Iteratively traverse down the cgroup_rstat_cpu updated tree level by + * level and push all the parents first before their next level children + * into a singly linked list built from the tail backward like "pushing" + * cgroups into a stack. The parent is by the caller. + */ +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head, + struct cgroup *child, int cpu) +{ + struct cgroup *chead =3D child; /* Head of child cgroup level */ + struct cgroup *ghead =3D NULL; /* Head of grandchild cgroup level */ + struct cgroup *parent, *grandchild; + struct cgroup_rstat_cpu *crstatc; + + child->rstat_flush_next =3D NULL; + +next_level: + while (chead) { + child =3D chead; + chead =3D child->rstat_flush_next; + parent =3D cgroup_parent(child); + + /* updated_next is parent cgroup terminated */ + while (child !=3D parent) { + child->rstat_flush_next =3D head; + head =3D child; + crstatc =3D cgroup_rstat_cpu(child, cpu); + grandchild =3D crstatc->updated_children; + if (grandchild !=3D child) { + /* Push the grand child to the next level */ + crstatc->updated_children =3D child; + grandchild->rstat_flush_next =3D ghead; + ghead =3D grandchild; + } + child =3D crstatc->updated_next; + crstatc->updated_next =3D NULL; + } + } + + if (ghead) { + chead =3D ghead; + ghead =3D NULL; + goto next_level; + } + return head; +} + +/** + * cgroup_rstat_updated_list - return a list of updated cgroups to be flus= hed + * @root: root of the cgroup subtree to traverse + * @cpu: target cpu + * Return: A singly linked list of cgroups to be flushed + * + * Walks the updated rstat_cpu tree on @cpu from @root. During traversal, + * each returned cgroup is unlinked from the updated tree. * * The only ordering guarantee is that, for a parent and a child pair - * covered by a given traversal, if a child is visited, its parent is - * guaranteed to be visited afterwards. + * covered by a given traversal, the child is before its parent in + * the list. + * + * Note that updated_children is self terminated and points to a list of + * child cgroups if not empty. Whereas updated_next is like a sibling link + * within the children list and terminated by the parent cgroup. An except= ion + * here is the cgroup root whose updated_next can be self terminated. */ -static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, - struct cgroup *root, int cpu) +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) { - struct cgroup_rstat_cpu *rstatc; - struct cgroup *parent; - - if (pos =3D=3D root) - return NULL; + raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); + struct cgroup_rstat_cpu *rstatc =3D cgroup_rstat_cpu(root, cpu); + struct cgroup *head =3D NULL, *parent, *child; + unsigned long flags; =20 /* - * We're gonna walk down to the first leaf and visit/remove it. We - * can pick whatever unvisited node as the starting point. + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. */ - if (!pos) { - pos =3D root; - /* return NULL if this subtree is not on-list */ - if (!cgroup_rstat_cpu(pos, cpu)->updated_next) - return NULL; - } else { - pos =3D cgroup_parent(pos); - } + raw_spin_lock_irqsave(cpu_lock, flags); =20 - /* walk down to the first leaf */ - while (true) { - rstatc =3D cgroup_rstat_cpu(pos, cpu); - if (rstatc->updated_children =3D=3D pos) - break; - pos =3D rstatc->updated_children; - } + /* Return NULL if this subtree is not on-list */ + if (!rstatc->updated_next) + goto unlock_ret; =20 /* - * Unlink @pos from the tree. As the updated_children list is + * Unlink @root from its parent. As the updated_children list is * singly linked, we have to walk it to find the removal point. - * However, due to the way we traverse, @pos will be the first - * child in most cases. The only exception is @root. */ - parent =3D cgroup_parent(pos); + parent =3D cgroup_parent(root); if (parent) { struct cgroup_rstat_cpu *prstatc; struct cgroup **nextp; =20 prstatc =3D cgroup_rstat_cpu(parent, cpu); nextp =3D &prstatc->updated_children; - while (*nextp !=3D pos) { + while (*nextp !=3D root) { struct cgroup_rstat_cpu *nrstatc; =20 nrstatc =3D cgroup_rstat_cpu(*nextp, cpu); @@ -142,31 +187,15 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(st= ruct cgroup *pos, } =20 rstatc->updated_next =3D NULL; - return pos; -} =20 -/* Return a list of updated cgroups to be flushed */ -static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) -{ - raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); - struct cgroup *head, *tail, *next; - unsigned long flags; - - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); - head =3D tail =3D cgroup_rstat_cpu_pop_updated(NULL, root, cpu); - while (tail) { - next =3D cgroup_rstat_cpu_pop_updated(tail, root, cpu); - tail->rstat_flush_next =3D next; - tail =3D next; - } + /* Push @root to the list first before pushing the children */ + head =3D root; + root->rstat_flush_next =3D NULL; + child =3D rstatc->updated_children; + rstatc->updated_children =3D root; + if (child !=3D root) + head =3D cgroup_rstat_push_children(head, child, cpu); +unlock_ret: raw_spin_unlock_irqrestore(cpu_lock, flags); return head; } --=20 2.39.3 From nobody Tue Dec 16 16:38:00 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6877BC46CA0 for ; Thu, 30 Nov 2023 20:44:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229591AbjK3Unw (ORCPT ); Thu, 30 Nov 2023 15:43:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376408AbjK3Uno (ORCPT ); Thu, 30 Nov 2023 15:43:44 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 71C43171D for ; Thu, 30 Nov 2023 12:43:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701377029; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DlJxse7VR/q1n51PSliVgyIuHZ/VI5Kr3VkztHHMLrk=; b=PlUujg4HuExQyzI+8Qk9BynFsuH4LQQBw19QfnMjqhN2cT5xMQk/Ha3OKicIdBJb7tocW0 HVgsl8DbKLHqbblq3d7TU7OhffiDHVyII9l+a1NGuFF8gznf4MbD5tKp2WQ0cCPt4l6fob jzrpejI34OItPiIBxct34DI7PbtDhQQ= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-516-iFhoK-N_OQq3aTIK0XPlnA-1; Thu, 30 Nov 2023 15:43:45 -0500 X-MC-Unique: iFhoK-N_OQq3aTIK0XPlnA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A8020185A785; Thu, 30 Nov 2023 20:43:44 +0000 (UTC) Received: from llong.com (unknown [10.22.9.192]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3269E40C6EB9; Thu, 30 Nov 2023 20:43:44 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH-cgroup v5 2/2] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu Date: Thu, 30 Nov 2023 15:43:27 -0500 Message-Id: <20231130204327.494249-3-longman@redhat.com> In-Reply-To: <20231130204327.494249-1-longman@redhat.com> References: <20231130204327.494249-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The rstat_cpu and also rstat_css_list of the cgroup structure are read mostly variables. However, they may share the same cacheline as the subsequent rstat_flush_next and *bstat variables which can be updated frequently. That will slow down the cgroup_rstat_cpu() call which is called pretty frequently in the rstat code. Add a CACHELINE_PADDING() line in between them to avoid false cacheline sharing. A parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. Below were the lock hold time frequency distribution before and after the patch: Run time Before patch After patch -------- ------------ ----------- 0-01 us 9,928,562 9,820,428 01-05 us 110,151 50,935 05-10 us 270 93 10-15 us 273 146 15-20 us 135 76 20-25 us 0 2 25-30 us 1 0 It can be seen that the patch further pushes the lock hold time towards the lower end. Signed-off-by: Waiman Long --- include/linux/cgroup-defs.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 37518436cfe7..5a97ea95b564 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -496,6 +496,13 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; =20 + /* + * Add padding to separate the read mostly rstat_cpu and + * rstat_css_list into a different cacheline from the following + * rstat_flush_next and *bstat fields which can have frequent updates. + */ + CACHELINE_PADDING(_pad_); + /* * A singly-linked list of cgroup structures to be rstat flushed. * This is a scratch field to be used exclusively by --=20 2.39.3