From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75A83C43334 for ; Thu, 2 Jun 2022 13:36:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235487AbiFBNgN (ORCPT ); Thu, 2 Jun 2022 09:36:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51812 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235415AbiFBNf6 (ORCPT ); Thu, 2 Jun 2022 09:35:58 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 726197037D for ; Thu, 2 Jun 2022 06:35:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654176956; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YXLuT1Yk9DAXTYl5NoX/oMeS35ik4rQo8uu3jun7bK8=; b=KQoJ57fswk6Z7qLtFXg3TN+sWYK9m4wWNF1SROhkMRJ/sS+GXS5Do86/84MjKdNx1h2Xa5 12xyt9oCzLD2MUuWKmT44nZqZ+NnbTZwjov5HOd1PRTN6X2S4fpSiKCaGtzI16KSjJs7SJ zWE0BbQhCI4QL0sgWW4ghhZZCYd1aBg= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-610-kEgn3EQiObSe0z98QPm7CQ-1; Thu, 02 Jun 2022 09:35:52 -0400 X-MC-Unique: kEgn3EQiObSe0z98QPm7CQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2F00B3C7B139; Thu, 2 Jun 2022 13:35:52 +0000 (UTC) Received: from llong.com (unknown [10.22.32.147]) by smtp.corp.redhat.com (Postfix) with ESMTP id DD7D02166B26; Thu, 2 Jun 2022 13:35:51 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Date: Thu, 2 Jun 2022 09:35:41 -0400 Message-Id: <20220602133543.128088-2-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") changes block cgroup IO stats to use the rstat APIs. It added a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified to allocate the new percpu iostat_cpu but didn't free it when an error happened. Fix this by freeing the percpu iostat_cpu on error exit. Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup r= stat") Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 40161a3f68d0..acd9b0aa8dc8 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkc= g, struct request_queue *q, return NULL; =20 if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) - goto err_free; + goto err_free_blkg; =20 blkg->iostat_cpu =3D alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask); if (!blkg->iostat_cpu) - goto err_free; + goto err_free_blkg; =20 if (!blk_get_queue(q)) goto err_free; @@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg,= struct request_queue *q, return blkg; =20 err_free: + free_percpu(blkg->iostat_cpu); + +err_free_blkg: blkg_free(blkg); return NULL; } --=20 2.31.1 From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 191AEC43334 for ; Wed, 1 Jun 2022 21:18:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231406AbiFAVSy (ORCPT ); Wed, 1 Jun 2022 17:18:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231340AbiFAVSp (ORCPT ); Wed, 1 Jun 2022 17:18:45 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E40E56F487 for ; Wed, 1 Jun 2022 14:18:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654118320; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YXLuT1Yk9DAXTYl5NoX/oMeS35ik4rQo8uu3jun7bK8=; b=JT48wFxUPXkD5x7djpwYJD1CRaM2l1T9AYYUb8DFIaY3sDiQpCsp3woYDE8/QGB+k8EP2D KddWToJrbb4DnrR+uQ/X9YEEvDTBk/h7NrERAlwtT2Cu1e7zwAsh756yK7CwK6C0C0gq0K BE9mjwQXgfatiXqHBtYqt/SMwW9ngfw= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-612-zPI8cbarPQGIcchXo8gl-A-1; Wed, 01 Jun 2022 17:18:37 -0400 X-MC-Unique: zPI8cbarPQGIcchXo8gl-A-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3EE8880159B; Wed, 1 Jun 2022 21:18:37 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id C8C91406AD49; Wed, 1 Jun 2022 21:18:36 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Date: Wed, 1 Jun 2022 17:18:23 -0400 Message-Id: <20220601211824.89626-2-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.11.54.1 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") changes block cgroup IO stats to use the rstat APIs. It added a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified to allocate the new percpu iostat_cpu but didn't free it when an error happened. Fix this by freeing the percpu iostat_cpu on error exit. Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup r= stat") Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 40161a3f68d0..acd9b0aa8dc8 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkc= g, struct request_queue *q, return NULL; =20 if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) - goto err_free; + goto err_free_blkg; =20 blkg->iostat_cpu =3D alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask); if (!blkg->iostat_cpu) - goto err_free; + goto err_free_blkg; =20 if (!blk_get_queue(q)) goto err_free; @@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg,= struct request_queue *q, return blkg; =20 err_free: + free_percpu(blkg->iostat_cpu); + +err_free_blkg: blkg_free(blkg); return NULL; } --=20 2.31.1 From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BAF52CCA478 for ; Thu, 2 Jun 2022 13:36:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235454AbiFBNgI (ORCPT ); Thu, 2 Jun 2022 09:36:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235407AbiFBNf5 (ORCPT ); Thu, 2 Jun 2022 09:35:57 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D63C96FA3B for ; Thu, 2 Jun 2022 06:35:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654176956; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nIFIquhWxRW8zi5j5Mk86JmeILvcLaTEEhtXEaVuk6Y=; b=NRlfiKmLt790STaLwm29z5ceYJieJ2R+lvwu1polF39QbGM/yIntolBkr31d5ncOcFU4FM qqkAnLStA+qi03bAKOyORE7idI75NE0eKbSc+GOUv5FczYNCcYkMgUibiuPcsMwYUTrd0i k2N5aB4fX+dKHjVCUQh9BCCEjY8hf+I= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-264-wyNoR1d5PEaXsdclR8ldVg-1; Thu, 02 Jun 2022 09:35:53 -0400 X-MC-Unique: wyNoR1d5PEaXsdclR8ldVg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 81A21811E75; Thu, 2 Jun 2022 13:35:52 +0000 (UTC) Received: from llong.com (unknown [10.22.32.147]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3BC8E2166B26; Thu, 2 Jun 2022 13:35:52 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Date: Thu, 2 Jun 2022 09:35:42 -0400 Message-Id: <20220602133543.128088-3-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For blkcg_css_alloc(), the only error that will be returned is -ENOMEM. Simplify error handling code by returning this error directly instead of setting an intermediate "ret" variable. Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index acd9b0aa8dc8..9021f75fc752 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state * blkcg_css_alloc(struct cgroup_subsys_state *parent_css) { struct blkcg *blkcg; - struct cgroup_subsys_state *ret; int i; =20 mutex_lock(&blkcg_pol_mutex); @@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_c= ss) blkcg =3D &blkcg_root; } else { blkcg =3D kzalloc(sizeof(*blkcg), GFP_KERNEL); - if (!blkcg) { - ret =3D ERR_PTR(-ENOMEM); + if (!blkcg) goto unlock; - } } =20 for (i =3D 0; i < BLKCG_MAX_POLS ; i++) { @@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_c= ss) continue; =20 cpd =3D pol->cpd_alloc_fn(GFP_KERNEL); - if (!cpd) { - ret =3D ERR_PTR(-ENOMEM); + if (!cpd) goto free_pd_blkcg; - } + blkcg->cpd[i] =3D cpd; cpd->blkcg =3D blkcg; cpd->plid =3D i; @@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_cs= s) kfree(blkcg); unlock: mutex_unlock(&blkcg_pol_mutex); - return ret; + return ERR_PTR(-ENOMEM); } =20 static int blkcg_css_online(struct cgroup_subsys_state *css) --=20 2.31.1 From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76D48CCA473 for ; Thu, 2 Jun 2022 01:54:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233117AbiFBBym (ORCPT ); Wed, 1 Jun 2022 21:54:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47848 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232944AbiFBByj (ORCPT ); Wed, 1 Jun 2022 21:54:39 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A7DAF29B2EC for ; Wed, 1 Jun 2022 18:54:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654134876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=krg4xzAgfamwYF1F/rhaXbuzS4T74775dwOrPwxuSlc=; b=J8JHd7f7IOKFLw2Tc4ecJSYi43kvubo0nEaRTLoZZtSlMkljqWi9fyI/j+l6Bxgg+CmbgG //CQgPTxC2eR7AaQqbo01j3Iv22fyjmNxJEvoq8qIC9mw5qKMSVDZG1MWEANTaHJ+aeIj1 xhrvYR9kXQiM4efcLM4GZ7pK8IeLKWE= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-251-H95KER2SMi-vmLV86JRF8w-1; Wed, 01 Jun 2022 21:54:35 -0400 X-MC-Unique: H95KER2SMi-vmLV86JRF8w-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E82E83C0D19D; Thu, 2 Jun 2022 01:54:34 +0000 (UTC) Received: from llong.com (unknown [10.22.32.82]) by smtp.corp.redhat.com (Postfix) with ESMTP id 83C2D40E80E0; Thu, 2 Jun 2022 01:54:34 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v4 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Date: Wed, 1 Jun 2022 21:54:19 -0400 Message-Id: <20220602015419.99168-1-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to the lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. A blkg_iostat_set can determine if it is in a lockless list by checking the content of its lnode.next pointer which will be non-NULL when in a lockless list. This requires the presence of a special llist_last sentinel node to be put at the end of the lockless list. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 84 ++++++++++++++++++++++++++++++++++++++++++---- block/blk-cgroup.h | 9 +++++ 2 files changed, 87 insertions(+), 6 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index acd9b0aa8dc8..1b74df5f2710 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -59,6 +59,56 @@ static struct workqueue_struct *blkcg_punt_bio_wq; =20 #define BLKG_DESTROY_BATCH_SIZE 64 =20 +/* + * lnode.next of the last entry in a lockless list is NULL. To enable us to + * use lnode.next as a boolean flag to indicate its presence in a lockless + * list, we have to make it non-NULL for all. This is done by using a + * sentinel node at the end of the lockless list. All the percpu lhead's + * are initialized to point to that sentinel node as being empty. + */ +static struct llist_node llist_last; + +static bool blkcg_llist_empty(struct llist_head *lhead) +{ + return lhead->first =3D=3D &llist_last; +} + +static void init_blkcg_llists(struct blkcg *blkcg) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu_ptr(blkcg->lhead, cpu)->first =3D &llist_last; +} + +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhea= d) +{ + return xchg(&lhead->first, &llist_last); +} + +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode) +{ + struct llist_node *next =3D READ_ONCE(lnode->next); + struct blkcg_gq *blkg =3D llist_entry(lnode, struct blkg_iostat_set, + lnode)->blkg; + + WRITE_ONCE(lnode->next, NULL); + percpu_ref_put(&blkg->refcnt); + return next; +} + +/* + * The retrieved blkg_iostat_set is immediately marked as not in the + * lockless list by clearing its node->next pointer. It could be put + * back into the list by a parallel update before the iostat's are + * finally flushed including probably the new update. + */ +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ + for (; (node !=3D &llist_last) && \ + (pos =3D llist_entry(node, struct blkg_iostat_set, lnode), \ + nxt =3D fetch_delete_lnode_next(node), true); \ + node =3D nxt) + /** * blkcg_css - find the current css * @@ -236,8 +286,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg= , struct request_queue *q, blkg->blkcg =3D blkcg; =20 u64_stats_init(&blkg->iostat.sync); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg =3D blkg; + } =20 for (i =3D 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; @@ -852,17 +904,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, = struct blkg_iostat *src) static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct blkcg *blkcg =3D css_to_blkcg(css); - struct blkcg_gq *blkg; + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + struct llist_node *lnode, *lnext; + struct blkg_iostat_set *bisc; =20 /* Root-level stats are sourced from system-wide IO stats */ if (!cgroup_parent(css->cgroup)) return; =20 + if (blkcg_llist_empty(lhead)) + return; + rcu_read_lock(); =20 - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + lnode =3D fetch_delete_blkcg_llist(lhead); + blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) { + struct blkcg_gq *blkg =3D bisc->blkg; struct blkcg_gq *parent =3D blkg->parent; - struct blkg_iostat_set *bisc =3D per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur, delta; unsigned long flags; unsigned int seq; @@ -1192,6 +1250,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_c= ss) } } =20 + blkcg->lhead =3D alloc_percpu_gfp(struct llist_head, GFP_KERNEL); + if (!blkcg->lhead) + goto free_blkcg; + init_blkcg_llists(blkcg); + for (i =3D 0; i < BLKCG_MAX_POLS ; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; struct blkcg_policy_data *cpd; @@ -1233,7 +1296,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_cs= s) for (i--; i >=3D 0; i--) if (blkcg->cpd[i]) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); - + free_percpu(blkcg->lhead); +free_blkcg: if (blkcg !=3D &blkcg_root) kfree(blkcg); unlock: @@ -1997,6 +2061,7 @@ static int blk_cgroup_io_type(struct bio *bio) =20 void blk_cgroup_bio_start(struct bio *bio) { + struct blkcg *blkcg =3D bio->bi_blkg->blkcg; int rwd =3D blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; unsigned long flags; @@ -2015,9 +2080,16 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++; =20 + if (!READ_ONCE(bis->lnode.next)) { + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + + llist_add(&bis->lnode, lhead); + percpu_ref_get(&bis->blkg->refcnt); + } + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(blkcg->css.cgroup, cpu); put_cpu(); } =20 diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d4de0a35e066..2c36362a332e 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -18,6 +18,7 @@ #include #include #include +#include =20 struct blkcg_gq; struct blkg_policy_data; @@ -43,6 +44,8 @@ struct blkg_iostat { =20 struct blkg_iostat_set { struct u64_stats_sync sync; + struct llist_node lnode; + struct blkcg_gq *blkg; struct blkg_iostat cur; struct blkg_iostat last; }; @@ -97,6 +100,12 @@ struct blkcg { struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; =20 struct list_head all_blkcgs_node; + + /* + * List of updated percpu blkg_iostat_set's since the last flush. + */ + struct llist_head __percpu *lhead; + #ifdef CONFIG_BLK_CGROUP_FC_APPID char fc_app_id[FC_APPID_LEN]; #endif --=20 2.31.1 From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E427BC433EF for ; Wed, 1 Jun 2022 21:18:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231361AbiFAVSt (ORCPT ); Wed, 1 Jun 2022 17:18:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51714 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231339AbiFAVSp (ORCPT ); Wed, 1 Jun 2022 17:18:45 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D376A6EC7C for ; Wed, 1 Jun 2022 14:18:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654118321; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9crq163q6EkEQqxfbwAZtz3VF3ZbUXdixKOJuV9CM4M=; b=Gk90nfHlhKreX9oXggr1HK0wW+O9UH3GfK7ROkwTQ/MqCavOuEruHstQ6YX2wPrrJgld/N nFG9YvRN3zSWsFkrRcHV7z9GE3VN4hqzXy0c5rl+dGO5L5eTKu4r7Jq/MqphpSQ8charHN 14Wa7lkqQb2sgKmUSacjnFQZd4rq9No= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-487-AeUu4TIwMG6v4JXg_p93HA-1; Wed, 01 Jun 2022 17:18:37 -0400 X-MC-Unique: AeUu4TIwMG6v4JXg_p93HA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7DFDD811E80; Wed, 1 Jun 2022 21:18:37 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id 479F440CF8EB; Wed, 1 Jun 2022 21:18:37 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Date: Wed, 1 Jun 2022 17:18:24 -0400 Message-Id: <20220601211824.89626-3-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.84 on 10.11.54.1 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to the lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed. To protect against destruction of blkg, a percpu reference is taken when putting into the lockless list and released when removed. A blkg_iostat_set can determine if it is in a lockless list by checking the content of its lnode.next pointer which will be non-NULL when in a lockless list. This requires the presence of a special llist_last sentinel node to be put at the end of the lockless list. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long Reported-by: kernel test robot --- block/blk-cgroup.c | 85 ++++++++++++++++++++++++++++++++++++++++++---- block/blk-cgroup.h | 9 +++++ 2 files changed, 88 insertions(+), 6 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index acd9b0aa8dc8..0143dda589bd 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq; =20 #define BLKG_DESTROY_BATCH_SIZE 64 =20 +/* + * lnode.next of the last entry in a lockless list is NULL. To make it + * always non-NULL for lnode's, a sentinel node has to be put at the + * end of the lockless list. So all the percpu lhead's are initialized + * to point to that sentinel node. + */ +static struct llist_node llist_last; + +static inline bool blkcg_llist_empty(struct llist_head *lhead) +{ + return lhead->first =3D=3D &llist_last; +} + +static inline void init_blkcg_llists(struct blkcg *blkcg) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu_ptr(blkcg->lhead, cpu)->first =3D &llist_last; +} + +static inline struct llist_node * +fetch_delete_blkcg_llist(struct llist_head *lhead) +{ + return xchg(&lhead->first, &llist_last); +} + +static inline struct llist_node * +fetch_delete_lnode_next(struct llist_node *lnode) +{ + struct llist_node *next =3D READ_ONCE(lnode->next); + struct blkcg_gq *blkg =3D llist_entry(lnode, struct blkg_iostat_set, + lnode)->blkg; + + WRITE_ONCE(lnode->next, NULL); + percpu_ref_put(&blkg->refcnt); + return next; +} + +/* + * The retrieved blkg_iostat_set is immediately marked as not in the + * lockless list by clearing its node->next pointer. It could be put + * back into the list by a parallel update before the iostat's are + * finally flushed including probably the new update. + */ +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ + for (; (node !=3D &llist_last) && \ + (pos =3D llist_entry(node, struct blkg_iostat_set, lnode), \ + nxt =3D fetch_delete_lnode_next(node), true); \ + node =3D nxt) + /** * blkcg_css - find the current css * @@ -236,8 +287,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg= , struct request_queue *q, blkg->blkcg =3D blkcg; =20 u64_stats_init(&blkg->iostat.sync); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg =3D blkg; + } =20 for (i =3D 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; @@ -852,17 +905,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, = struct blkg_iostat *src) static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct blkcg *blkcg =3D css_to_blkcg(css); - struct blkcg_gq *blkg; + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + struct llist_node *lnode, *lnext; + struct blkg_iostat_set *bisc; =20 /* Root-level stats are sourced from system-wide IO stats */ if (!cgroup_parent(css->cgroup)) return; =20 + if (blkcg_llist_empty(lhead)) + return; + rcu_read_lock(); =20 - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + lnode =3D fetch_delete_blkcg_llist(lhead); + blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) { + struct blkcg_gq *blkg =3D bisc->blkg; struct blkcg_gq *parent =3D blkg->parent; - struct blkg_iostat_set *bisc =3D per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur, delta; unsigned long flags; unsigned int seq; @@ -1192,6 +1251,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_c= ss) } } =20 + blkcg->lhead =3D alloc_percpu_gfp(struct llist_head, GFP_KERNEL); + if (!blkcg->lhead) + goto free_blkcg; + init_blkcg_llists(blkcg); + for (i =3D 0; i < BLKCG_MAX_POLS ; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; struct blkcg_policy_data *cpd; @@ -1233,7 +1297,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_cs= s) for (i--; i >=3D 0; i--) if (blkcg->cpd[i]) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); - + free_percpu(blkcg->lhead); +free_blkcg: if (blkcg !=3D &blkcg_root) kfree(blkcg); unlock: @@ -1997,6 +2062,7 @@ static int blk_cgroup_io_type(struct bio *bio) =20 void blk_cgroup_bio_start(struct bio *bio) { + struct blkcg *blkcg =3D bio->bi_blkg->blkcg; int rwd =3D blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; unsigned long flags; @@ -2015,9 +2081,16 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++; =20 + if (!READ_ONCE(bis->lnode.next)) { + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + + llist_add(&bis->lnode, lhead); + percpu_ref_get(&bis->blkg->refcnt); + } + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(blkcg->css.cgroup, cpu); put_cpu(); } =20 diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d4de0a35e066..2c36362a332e 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -18,6 +18,7 @@ #include #include #include +#include =20 struct blkcg_gq; struct blkg_policy_data; @@ -43,6 +44,8 @@ struct blkg_iostat { =20 struct blkg_iostat_set { struct u64_stats_sync sync; + struct llist_node lnode; + struct blkcg_gq *blkg; struct blkg_iostat cur; struct blkg_iostat last; }; @@ -97,6 +100,12 @@ struct blkcg { struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; =20 struct list_head all_blkcgs_node; + + /* + * List of updated percpu blkg_iostat_set's since the last flush. + */ + struct llist_head __percpu *lhead; + #ifdef CONFIG_BLK_CGROUP_FC_APPID char fc_app_id[FC_APPID_LEN]; #endif --=20 2.31.1 From nobody Tue Apr 28 09:06:12 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86F97C43334 for ; Thu, 2 Jun 2022 13:36:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235478AbiFBNgU (ORCPT ); Thu, 2 Jun 2022 09:36:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235416AbiFBNf6 (ORCPT ); Thu, 2 Jun 2022 09:35:58 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7234B70379 for ; Thu, 2 Jun 2022 06:35:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654176956; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CUEV8v/lejdqWktm+UQv6qh9A0AqQe8YDhCKJcagALE=; b=JRb9jL/XltZ1smgTWllvVaw6nut+7fV2YPS2fMZJYyV96DPg/3vDghd22PQqOqjzTjYc1G YQYzGDf2LJDzv72/+UVQpvj3ltRt3WUF51WqBc5+bDMpMHXSHg7mjDsd29QarvT7bLPooo r8uPpMfwsoFpzb4A9+57C9NsvgnxXwk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-561-mKbl9aULOgKwR7tUtkE-kw-1; Thu, 02 Jun 2022 09:35:53 -0400 X-MC-Unique: mKbl9aULOgKwR7tUtkE-kw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D456C101A54E; Thu, 2 Jun 2022 13:35:52 +0000 (UTC) Received: from llong.com (unknown [10.22.32.147]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8E1362166B26; Thu, 2 Jun 2022 13:35:52 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Waiman Long Subject: [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Date: Thu, 2 Jun 2022 09:35:43 -0400 Message-Id: <20220602133543.128088-4-longman@redhat.com> In-Reply-To: <20220601211824.89626-1-longman@redhat.com> References: <20220601211824.89626-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to the lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. A blkg_iostat_set can determine if it is in a lockless list by checking the content of its lnode.next pointer which will be non-NULL when in a lockless list. This requires the presence of a special llist_last sentinel node to be put at the end of the lockless list. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 84 ++++++++++++++++++++++++++++++++++++++++++---- block/blk-cgroup.h | 9 +++++ 2 files changed, 87 insertions(+), 6 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 9021f75fc752..8af97f3b2fc9 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -59,6 +59,56 @@ static struct workqueue_struct *blkcg_punt_bio_wq; =20 #define BLKG_DESTROY_BATCH_SIZE 64 =20 +/* + * lnode.next of the last entry in a lockless list is NULL. To enable us to + * use lnode.next as a boolean flag to indicate its presence in a lockless + * list, we have to make it non-NULL for all. This is done by using a + * sentinel node at the end of the lockless list. All the percpu lhead's + * are initialized to point to that sentinel node as being empty. + */ +static struct llist_node llist_last; + +static bool blkcg_llist_empty(struct llist_head *lhead) +{ + return lhead->first =3D=3D &llist_last; +} + +static void init_blkcg_llists(struct blkcg *blkcg) +{ + int cpu; + + for_each_possible_cpu(cpu) + per_cpu_ptr(blkcg->lhead, cpu)->first =3D &llist_last; +} + +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhea= d) +{ + return xchg(&lhead->first, &llist_last); +} + +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode) +{ + struct llist_node *next =3D READ_ONCE(lnode->next); + struct blkcg_gq *blkg =3D llist_entry(lnode, struct blkg_iostat_set, + lnode)->blkg; + + WRITE_ONCE(lnode->next, NULL); + percpu_ref_put(&blkg->refcnt); + return next; +} + +/* + * The retrieved blkg_iostat_set is immediately marked as not in the + * lockless list by clearing its node->next pointer. It could be put + * back into the list by a parallel update before the iostat's are + * finally flushed including probably the new update. + */ +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ + for (; (node !=3D &llist_last) && \ + (pos =3D llist_entry(node, struct blkg_iostat_set, lnode), \ + nxt =3D fetch_delete_lnode_next(node), true); \ + node =3D nxt) + /** * blkcg_css - find the current css * @@ -236,8 +286,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg= , struct request_queue *q, blkg->blkcg =3D blkcg; =20 u64_stats_init(&blkg->iostat.sync); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg =3D blkg; + } =20 for (i =3D 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; @@ -852,17 +904,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, = struct blkg_iostat *src) static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct blkcg *blkcg =3D css_to_blkcg(css); - struct blkcg_gq *blkg; + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + struct llist_node *lnode, *lnext; + struct blkg_iostat_set *bisc; =20 /* Root-level stats are sourced from system-wide IO stats */ if (!cgroup_parent(css->cgroup)) return; =20 + if (blkcg_llist_empty(lhead)) + return; + rcu_read_lock(); =20 - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + lnode =3D fetch_delete_blkcg_llist(lhead); + blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) { + struct blkcg_gq *blkg =3D bisc->blkg; struct blkcg_gq *parent =3D blkg->parent; - struct blkg_iostat_set *bisc =3D per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur, delta; unsigned long flags; unsigned int seq; @@ -1189,6 +1247,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_c= ss) goto unlock; } =20 + blkcg->lhead =3D alloc_percpu_gfp(struct llist_head, GFP_KERNEL); + if (!blkcg->lhead) + goto free_blkcg; + init_blkcg_llists(blkcg); + for (i =3D 0; i < BLKCG_MAX_POLS ; i++) { struct blkcg_policy *pol =3D blkcg_policy[i]; struct blkcg_policy_data *cpd; @@ -1229,7 +1292,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_cs= s) for (i--; i >=3D 0; i--) if (blkcg->cpd[i]) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); - + free_percpu(blkcg->lhead); +free_blkcg: if (blkcg !=3D &blkcg_root) kfree(blkcg); unlock: @@ -1993,6 +2057,7 @@ static int blk_cgroup_io_type(struct bio *bio) =20 void blk_cgroup_bio_start(struct bio *bio) { + struct blkcg *blkcg =3D bio->bi_blkg->blkcg; int rwd =3D blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; unsigned long flags; @@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++; =20 + if (!READ_ONCE(bis->lnode.next)) { + struct llist_head *lhead =3D per_cpu_ptr(blkcg->lhead, cpu); + + llist_add(&bis->lnode, lhead); + percpu_ref_get(&bis->blkg->refcnt); + } + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(blkcg->css.cgroup, cpu); put_cpu(); } =20 diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d4de0a35e066..2c36362a332e 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -18,6 +18,7 @@ #include #include #include +#include =20 struct blkcg_gq; struct blkg_policy_data; @@ -43,6 +44,8 @@ struct blkg_iostat { =20 struct blkg_iostat_set { struct u64_stats_sync sync; + struct llist_node lnode; + struct blkcg_gq *blkg; struct blkg_iostat cur; struct blkg_iostat last; }; @@ -97,6 +100,12 @@ struct blkcg { struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; =20 struct list_head all_blkcgs_node; + + /* + * List of updated percpu blkg_iostat_set's since the last flush. + */ + struct llist_head __percpu *lhead; + #ifdef CONFIG_BLK_CGROUP_FC_APPID char fc_app_id[FC_APPID_LEN]; #endif --=20 2.31.1