From nobody Fri Oct 18 08:36:25 2024 Received: from esa5.hc1455-7.c3s2.iphmx.com (esa5.hc1455-7.c3s2.iphmx.com [68.232.139.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D61214C585 for ; Tue, 23 Jul 2024 06:44:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=68.232.139.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721717098; cv=none; b=DYlpvVP1Vwfp35Ormo07M49l33xqW62eUiMOoz+Hq5DPE6HfUgKXpMvJxACmsFAnWycNyqBUM7MibpfyTslZa7m6hIa+aD+NzSoD/SHQZOco5PS+GZfdWQH/2Cb4ev5nKfk9Uyv2kiA9qsHl7ilm89rsqbkpALhtA9oceHawZaI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721717098; c=relaxed/simple; bh=Wj9f5tB2mEdbYVYaWt84WmCS/rsQK08PH2985ou0wZc=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=jOtsylNz51OLydKrdGNJg76rO0we7NhujWR3snPxUheRrwLX7qVdxxxv8VVALusRG0gX8MFSklY6YeK/FU7o6U8Of2OGjJgPVWp6WZpAsJcSpNn4PkZdX8PsiiUHh+c9U8yx/Ocv25iE8Lk4xhb0Z2kyDRnqWy2SfhpPxDDdyeQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fujitsu.com; spf=pass smtp.mailfrom=fujitsu.com; dkim=pass (2048-bit key) header.d=fujitsu.com header.i=@fujitsu.com header.b=oQxElb5c; arc=none smtp.client-ip=68.232.139.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fujitsu.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fujitsu.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fujitsu.com header.i=@fujitsu.com header.b="oQxElb5c" DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=fujitsu.com; i=@fujitsu.com; q=dns/txt; s=fj2; t=1721717095; x=1753253095; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=Wj9f5tB2mEdbYVYaWt84WmCS/rsQK08PH2985ou0wZc=; b=oQxElb5cb8qqvQsCFFQL+lAkBnBxNF7spaWCMbEAJHfczxk3F+rAIb61 QhTsA6QPK+6XAEhxc6FpZgYUIaps0XdYEIqr2VfjkTTsw2+TPcABcUK2Y 2ZeGQ9+Su7XCNr/Y0kjP4J+lcEsh3Okb7sOYBE2L+LVGkAAK8aQfkaw1u frmwGU65f9u2lb6zdoR6IBo02nDchqhn2ZXb1yMRsyPiVL0y8/LZCosGV 6bfUjUqM/SKJA+QfXUKwpSo8xURAbYJGYlq7CY02lSzdCIqCi9yPDTY74 bQJBmXCxqzzH14QkWsHgMfJr53yaK0Q2B5mtBVl/u8abpJ7gCHyuxaOt7 Q==; X-IronPort-AV: E=McAfee;i="6700,10204,11141"; a="167438707" X-IronPort-AV: E=Sophos;i="6.09,230,1716217200"; d="scan'208";a="167438707" Received: from unknown (HELO yto-r2.gw.nic.fujitsu.com) ([218.44.52.218]) by esa5.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jul 2024 15:44:46 +0900 Received: from yto-m3.gw.nic.fujitsu.com (yto-nat-yto-m3.gw.nic.fujitsu.com [192.168.83.66]) by yto-r2.gw.nic.fujitsu.com (Postfix) with ESMTP id A0A59C68E9 for ; Tue, 23 Jul 2024 15:44:43 +0900 (JST) Received: from kws-ab3.gw.nic.fujitsu.com (kws-ab3.gw.nic.fujitsu.com [192.51.206.21]) by yto-m3.gw.nic.fujitsu.com (Postfix) with ESMTP id EC5EAD9730 for ; Tue, 23 Jul 2024 15:44:42 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab3.gw.nic.fujitsu.com (Postfix) with ESMTP id 7A3542007CDDA for ; Tue, 23 Jul 2024 15:44:42 +0900 (JST) Received: from localhost.localdomain (unknown [10.167.226.45]) by edo.cn.fujitsu.com (Postfix) with ESMTP id 856DF1A000A; Tue, 23 Jul 2024 14:44:40 +0800 (CST) From: Li Zhijian To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Yasunori Gotou , Li Zhijian , David Hildenbrand , Vlastimil Babka , Yao Xingtao Subject: [PATCH v3] mm/page_alloc: Fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() Date: Tue, 23 Jul 2024 14:44:28 +0800 Message-Id: <20240723064428.1179519-1-lizhijian@fujitsu.com> X-Mailer: git-send-email 2.31.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-28546.005 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-28546.005 X-TMASE-Result: 10--5.765100-10.000000 X-TMASE-MatchedRID: WyUGENwRfafoSitJVour/QuB7zdAMUjAl9q75JzWJRMarX6LchMkVuBw lzWEEXt2fatVQLA534yNMkq6FfSn6lnGEjlsas2yFDuTLTe6zcNMkOX0UoduuV7V7de6UnlgmKb hu5KaCkf9F5gpB/8TUo2MogdbmQhJWSEm/dnndoSdVNZaI2n6//SzAdIVxUno2vch1fMqmI8mIm l+ywrqvklEFjVj/aAsbncztPPsTqsv+0FNnM7lDRVqL8+WwS7muhv94WF6cmlQWIO4YO+fPSsoE Tq1uNbxoa8IWqgpLBOAMuqetGVetnyef22ep6XYOwBXM346/+yq6ihSMQjFT/u99Mh5IiWaer/4 kvMjwQ8/NgRatoZopQ8W4Xlk8M63 X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 Content-Type: text/plain; charset="utf-8" It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced =3D rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count =3D READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count +=3D alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@f= ujitsu.com/ Cc: David Hildenbrand Cc: Vlastimil Babka (SUSE) Reported-by: Yao Xingtao Signed-off-by: Li Zhijian Reviewed-by: Vlastimil Babka --- V3: Read pcp->count in the loop with lock held to prevent unnecessary spin_= lock() wasteful V2: - Narrow down the scope of the spin_lock() to limit the draining latenc= y. # Vlastimil and David - In above scenario, it's sufficient to read pcp->count once with lock = held, and it fully fixed my issue[1] in thounds runs(It happened in more than 5% before). RFC: https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@fuji= tsu.com/ Signed-off-by: Li Zhijian --- mm/page_alloc.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..a32289ec4768 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2323,16 +2323,20 @@ void drain_zone_pages(struct zone *zone, struct per= _cpu_pages *pcp) static void drain_pages_zone(unsigned int cpu, struct zone *zone) { struct per_cpu_pages *pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); - int count =3D READ_ONCE(pcp->count); - - while (count) { - int to_drain =3D min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); - count -=3D to_drain; + int count; =20 + do { spin_lock(&pcp->lock); - free_pcppages_bulk(zone, to_drain, pcp, 0); + count =3D pcp->count; + if (count) { + int to_drain =3D min(count, + pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); + + free_pcppages_bulk(zone, to_drain, pcp, 0); + count -=3D to_drain; + } spin_unlock(&pcp->lock); - } + } while (count); } =20 /* --=20 2.29.2