From nobody Thu Sep 11 22:48:23 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37B85CA0EC1 for ; Mon, 11 Sep 2023 21:40:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350282AbjIKVgU (ORCPT ); Mon, 11 Sep 2023 17:36:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46426 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236161AbjIKJwD (ORCPT ); Mon, 11 Sep 2023 05:52:03 -0400 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 763ECE40 for ; Mon, 11 Sep 2023 02:51:38 -0700 (PDT) Received: by mail-pj1-x102e.google.com with SMTP id 98e67ed59e1d1-273e3d8b57aso548858a91.0 for ; Mon, 11 Sep 2023 02:51:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1694425898; x=1695030698; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=02oN8W08zjTxUZl21U6+DV55wPZPOOOcomDUkgbjeeM=; b=eb2NCWfziWPjr3d/T+ODGwOVe4Nh+Tj0dozANXldfp23HfwIbQ6T3iNc7oTNwUWztZ M+RLbFRPJSBvdaZ2B4+Oo2IacG7D2kokrlxdd9ygstNr0+jIQOLN9/yZ7yj3n7xp9yJ6 JjAO1V++EQNwUjZjqpfqn9eSH9gYguwrhYlkCyO1KLzt6M15thI6sov60Iqow5pJI0Zl 3Oahb4E3IqtGvGYXLpmRllLyF6SCXcNUx9aaa71EGqa6MfEkXkehq5PDlQliCvvy2LVq edP/UJ8VZaHbluN1YeXZWv0AEtg011meG8MwdZBekch9w73yM/cdOEE3Bwx7t5c4rgZD 8PlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694425898; x=1695030698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=02oN8W08zjTxUZl21U6+DV55wPZPOOOcomDUkgbjeeM=; b=PzOyViQIMFhQ5azbLGkfnYzKmP1jL5wRAmsRxdV9BpX6GFltih2z4IN/pVwZaQAPqb 07Z1ModvV3EziTL8gn4abE+oO0FFrxjg9jZ4OxvmOw8Mp2mCxhvdk4x+tzgDcdtTwVaU Wh05oCX4siiwZGlD1FqS0/E9rnuWk2Wj0HFxQnvVysaSot52NHqlNPRcyQIb84PIQJj5 DY9xlO0kMZqC/E1Hhqbd4kvLsDwrIlyJX+tMtZ/+rGObvVSflHr1AeP/GmN6Nutof22t OXvEkcJWFGBOuM3eoga43cbFz1Y+xFAe0Ti3eNuycWMNf+Tw8GAp6T70zyJiH+TngtKO TXyA== X-Gm-Message-State: AOJu0Yx4ulRdlPqcb+HZBrw/ZJ6G2iqC6ARg17oRxDmvNdq8vyVxmlW+ 8FUGI7VkG+Z1xT1nAmXJxkT98A== X-Google-Smtp-Source: AGHT+IH97cG96KOmpoktz2XA7SklxuW0ceeXqXRZ1DysovZgTlaqQRndvB65TuUAaXsE5amMaJ1n9w== X-Received: by 2002:a05:6a21:6da4:b0:137:3941:17b3 with SMTP id wl36-20020a056a216da400b00137394117b3mr12294593pzb.6.1694425897928; Mon, 11 Sep 2023 02:51:37 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id az7-20020a170902a58700b001bdc2fdcf7esm5988188plb.129.2023.09.11.02.51.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Sep 2023 02:51:37 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, david@fromorbit.com, tkhai@ya.ru, vbabka@suse.cz, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com, cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com, gregkh@linuxfoundation.org, muchun.song@linux.dev Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Qi Zheng Subject: [PATCH v6 43/45] mm: shrinker: make memcg slab shrink lockless Date: Mon, 11 Sep 2023 17:44:42 +0800 Message-Id: <20230911094444.68966-44-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230911094444.68966-1-zhengqi.arch@bytedance.com> References: <20230911094444.68966-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Like global slab shrink, this commit also uses refcount+RCU method to make memcg slab shrink lockless. Use the following script to do slab shrink stress test: ``` DIR=3D"/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=3D/dev/zero of=3D$DIR/$i/file$i bs=3D1M count=3D1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 33.15% [kernel] [k] down_read_trylock 25.38% [kernel] [k] shrink_slab 21.75% [kernel] [k] up_read 4.45% [kernel] [k] _find_next_bit 2.27% [kernel] [k] do_shrink_slab 1.80% [kernel] [k] intel_idle_irq 1.79% [kernel] [k] shrink_lruvec 0.67% [kernel] [k] xas_descend 0.41% [kernel] [k] mem_cgroup_iter 0.40% [kernel] [k] shrink_node 0.38% [kernel] [k] list_lru_count_one 2) After applying this patchset: 64.56% [kernel] [k] shrink_slab 12.18% [kernel] [k] do_shrink_slab 3.30% [kernel] [k] __rcu_read_unlock 2.61% [kernel] [k] shrink_lruvec 2.49% [kernel] [k] __rcu_read_lock 1.93% [kernel] [k] intel_idle_irq 0.89% [kernel] [k] shrink_node 0.81% [kernel] [k] mem_cgroup_iter 0.77% [kernel] [k] mem_cgroup_calculate_protection 0.66% [kernel] [k] list_lru_count_one We can see that the first perf hotspot becomes shrink_slab, which is what we expect. Signed-off-by: Qi Zheng --- mm/shrinker.c | 85 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 66 insertions(+), 19 deletions(-) diff --git a/mm/shrinker.c b/mm/shrinker.c index 82dc61133c5b..ad64cac5248c 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -218,7 +218,6 @@ static int shrinker_memcg_alloc(struct shrinker *shrink= er) return -ENOSYS; =20 down_write(&shrinker_rwsem); - /* This may call shrinker, so it must use down_read_trylock() */ id =3D idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; @@ -252,10 +251,15 @@ static long xchg_nr_deferred_memcg(int nid, struct sh= rinker *shrinker, { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; =20 - info =3D shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info =3D rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit =3D info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker= ->id)], 0); + nr_deferred =3D atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset= (shrinker->id)], 0); + rcu_read_unlock(); + + return nr_deferred; } =20 static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrin= ker, @@ -263,10 +267,16 @@ static long add_nr_deferred_memcg(long nr, int nid, s= truct shrinker *shrinker, { struct shrinker_info *info; struct shrinker_info_unit *unit; + long nr_deferred; =20 - info =3D shrinker_info_protected(memcg, nid); + rcu_read_lock(); + info =3D rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit =3D info->unit[shrinker_id_to_index(shrinker->id)]; - return atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offse= t(shrinker->id)]); + nr_deferred =3D + atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shri= nker->id)]); + rcu_read_unlock(); + + return nr_deferred; } =20 void reparent_shrinker_deferred(struct mem_cgroup *memcg) @@ -463,18 +473,54 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask= , int nid, if (!mem_cgroup_online(memcg)) return 0; =20 - if (!down_read_trylock(&shrinker_rwsem)) - return 0; - - info =3D shrinker_info_protected(memcg, nid); + /* + * lockless algorithm of memcg shrink. + * + * The shrinker_info may be freed asynchronously via RCU in the + * expand_one_shrinker_info(), so the rcu_read_lock() needs to be used + * to ensure the existence of the shrinker_info. + * + * The shrinker_info_unit is never freed unless its corresponding memcg + * is destroyed. Here we already hold the refcount of memcg, so the + * memcg will not be destroyed, and of course shrinker_info_unit will + * not be freed. + * + * So in the memcg shrink: + * step 1: use rcu_read_lock() to guarantee existence of the + * shrinker_info. + * step 2: after getting shrinker_info_unit we can safely release the + * RCU lock. + * step 3: traverse the bitmap and calculate shrinker_id + * step 4: use rcu_read_lock() to guarantee existence of the shrinker. + * step 5: use shrinker_id to find the shrinker, then use + * shrinker_try_get() to guarantee existence of the shrinker, + * then we can release the RCU lock to do do_shrink_slab() that + * may sleep. + * step 6: do shrinker_put() paired with step 5 to put the refcount, + * if the refcount reaches 0, then wake up the waiter in + * shrinker_free() by calling complete(). + * Note: here is different from the global shrink, we don't + * need to acquire the RCU lock to guarantee existence of + * the shrinker, because we don't need to use this + * shrinker to traverse the next shrinker in the bitmap. + * step 7: we have already exited the read-side of rcu critical section + * before calling do_shrink_slab(), the shrinker_info may be + * released in expand_one_shrinker_info(), so go back to step 1 + * to reacquire the shrinker_info. + */ +again: + rcu_read_lock(); + info =3D rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); if (unlikely(!info)) goto unlock; =20 - for (; index < shrinker_id_to_index(info->map_nr_max); index++) { + if (index < shrinker_id_to_index(info->map_nr_max)) { struct shrinker_info_unit *unit; =20 unit =3D info->unit[index]; =20 + rcu_read_unlock(); + for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) { struct shrink_control sc =3D { .gfp_mask =3D gfp_mask, @@ -484,12 +530,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask= , int nid, struct shrinker *shrinker; int shrinker_id =3D calc_shrinker_id(index, offset); =20 + rcu_read_lock(); shrinker =3D idr_find(&shrinker_idr, shrinker_id); - if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) { - if (!shrinker) - clear_bit(offset, unit->map); + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) { + clear_bit(offset, unit->map); + rcu_read_unlock(); continue; } + rcu_read_unlock(); =20 /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && @@ -522,15 +570,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask= , int nid, set_shrinker_bit(memcg, nid, shrinker_id); } freed +=3D ret; - - if (rwsem_is_contended(&shrinker_rwsem)) { - freed =3D freed ? : 1; - goto unlock; - } + shrinker_put(shrinker); } + + index++; + goto again; } unlock: - up_read(&shrinker_rwsem); + rcu_read_unlock(); return freed; } #else /* !CONFIG_MEMCG */ --=20 2.30.2