From nobody Wed Dec 31 13:51:45 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 054C3C4332F for ; Thu, 2 Nov 2023 00:54:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234879AbjKBAyZ (ORCPT ); Wed, 1 Nov 2023 20:54:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234487AbjKBAyX (ORCPT ); Wed, 1 Nov 2023 20:54:23 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB51A130 for ; Wed, 1 Nov 2023 17:53:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1698886417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=KrR34QK/1EgU2vcvxk/QZ8sFIYtb+uVipnwj+yKp+tQ=; b=YKHAraS0NZcl78dl8N+Mbwuu4TaTg6YXHSahZwUPJ9QcpwNwr6V/eb1UDx+KpWOU6gW1EW LSJaMHWl5vOuzgG/HhPy786FISxEJrm1dsb9K7fv6lsqgaJIoazn3mLfPxRthGujNVc/mS ik1TmsKQponFBKOfRqqq8RS19FEQYAM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-524-Njlu7c5GMtKKFVtf1b8I9g-1; Wed, 01 Nov 2023 20:53:30 -0400 X-MC-Unique: Njlu7c5GMtKKFVtf1b8I9g-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EF16081B563; Thu, 2 Nov 2023 00:53:29 +0000 (UTC) Received: from llong.com (unknown [10.22.33.245]) by smtp.corp.redhat.com (Postfix) with ESMTP id 304C510F52; Thu, 2 Nov 2023 00:53:29 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH v2] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Date: Wed, 1 Nov 2023 20:53:10 -0400 Message-Id: <20231102005310.439588-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happens when there is a page fault even though the task is running in user space most of the time. The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection. To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed. By adding some instrumentation code to measure the maximum elapsed times of the new cgroup_rstat_updated_list() function and each cpu iteration of cgroup_rstat_updated_locked() around the old cpu_lock lock/unlock pair on a 2-socket x86-64 server running parallel kernel build, the maximum elapsed times are 27us and 88us respectively. The maximum cpu_lock hold time is now reduced to about 30% of the original. Below were the run time distribution of cgroup_rstat_updated_list() during the same period: Run time Count -------- ----- t <=3D 1us 12,574,302 1us < t <=3D 5us 2,127,482 5us < t <=3D 10us 8,445 10us < t <=3D 20us 6,425 20us < t <=3D 30us 50 Signed-off-by: Waiman Long --- include/linux/cgroup-defs.h | 6 +++++ kernel/cgroup/rstat.c | 45 ++++++++++++++++++++++++------------- 2 files changed, 36 insertions(+), 15 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 265da00a1a8b..daaf6d4eb8b6 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -491,6 +491,12 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; =20 + /* + * A singly-linked list of cgroup structures to be rstat flushed. + * Protected by cgroup_rstat_lock. + */ + struct cgroup *rstat_flush_next; + /* cgroup basic resource statistics */ struct cgroup_base_stat last_bstat; struct cgroup_base_stat bstat; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d80d7a608141..a86d40ed8bda 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -145,6 +145,34 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(str= uct cgroup *pos, return pos; } =20 +/* + * Return a list of updated cgroups to be flushed + */ +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int c= pu) +{ + raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); + struct cgroup *head, *tail, *next; + unsigned long flags; + + /* + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. + */ + raw_spin_lock_irqsave(cpu_lock, flags); + head =3D tail =3D cgroup_rstat_cpu_pop_updated(NULL, root, cpu); + while (tail) { + next =3D cgroup_rstat_cpu_pop_updated(tail, root, cpu); + tail->rstat_flush_next =3D next; + tail =3D next; + } + raw_spin_unlock_irqrestore(cpu_lock, flags); + return head; +} + /* * A hook for bpf stat collectors to attach to and flush their stats. * Together with providing bpf kfuncs for cgroup_rstat_updated() and @@ -179,21 +207,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *c= grp) lockdep_assert_held(&cgroup_rstat_lock); =20 for_each_possible_cpu(cpu) { - raw_spinlock_t *cpu_lock =3D per_cpu_ptr(&cgroup_rstat_cpu_lock, - cpu); - struct cgroup *pos =3D NULL; - unsigned long flags; + struct cgroup *pos =3D cgroup_rstat_updated_list(cgrp, cpu); =20 - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); - while ((pos =3D cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) { + for (; pos; pos =3D pos->rstat_flush_next) { struct cgroup_subsys_state *css; =20 cgroup_base_stat_flush(pos, cpu); @@ -205,7 +221,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cg= rp) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - raw_spin_unlock_irqrestore(cpu_lock, flags); =20 /* play nice and yield if necessary */ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { --=20 2.39.3