From nobody Thu Apr 2 12:00:53 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C55DFC04A95 for ; Sun, 25 Sep 2022 10:36:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231945AbiIYKgb (ORCPT ); Sun, 25 Sep 2022 06:36:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33518 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231981AbiIYKgW (ORCPT ); Sun, 25 Sep 2022 06:36:22 -0400 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD6362657A; Sun, 25 Sep 2022 03:36:17 -0700 (PDT) Received: by mail-pf1-x42a.google.com with SMTP id e68so4131857pfe.1; Sun, 25 Sep 2022 03:36:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from:from:to :cc:subject:date; bh=YAG2GxsE4t0bLfgPr5VW0shEES+/po3vvusQv2HA2Js=; b=ZKBYsn8b4JyPdzlQgKO6AtuR+1MIC5Q9s3QYeABYc1itK6dyDCShYPHgdEH/bnVbDe b0GGxMHYoOUyV7KlganFwPHdQVCfpwGrw9tnONyBmsgKR9TXsxPp873WWx74joDwO5sJ zFYeKahACv5zcwob7+WC782IngIosOd2rzUwFkSvoaAxw8AdnU7KeNuOASk8ZeBeOgHc ZGflSLoAxg9DJkXp/kGXHaxbi5Bks9DdGPgHjWsWRJgYh3C0UeaBve1l3OEajwP93TKj QzQa9T4KtPrVinRSUzaJJA047qANd4/tB0i5O+HMB3tljHEfWQFu/g0A0yoQbDxH0py9 ThEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date; bh=YAG2GxsE4t0bLfgPr5VW0shEES+/po3vvusQv2HA2Js=; b=yMD2+guxD0DjTqrgiS6e+EYSXwtS+06OEhr2ybVj+Z4WRszW0a8v1Ss/HvusD6XnH6 LdYUdppeEMsim8xkdBU+dJBcdfV1dGkPCgYvizTn/ZDi/YvE1HZmAtIc+KXI8LGbkaPu N6pTlhvKD7MxbORtlfputDsG3qSTjEApsFELKaLXNQQE55mNFzh5gMb6BzKBELscKkH3 8KU4P93jXAHxnQ6Ifc4qhqznjF4G5RPNGhUXLXvfA3vfzJoeK5gA7Mw9XbafzQoinEJZ 7oX+yw3rIoScC1/verE4EVQHXhdmmCvXZWPJlr4vvOSnSun/wIZOli0g6k3cXb89hbni V6UA== X-Gm-Message-State: ACrzQf0Ap3RpfUjLwzSQg9uxd9EmZ1SCXFBkcCUfxKF0sQiOMe/e4t3M Oh6H02V6qztHs8h0mqfknug= X-Google-Smtp-Source: AMsMyM5egCXT7bxRhZDz5usCGA9JNKcOH2iWdJFtii5xiC47CJtoBOEaMCrNiB9EMrc+qEHD/9FMdQ== X-Received: by 2002:a63:525a:0:b0:42b:28a9:8a34 with SMTP id s26-20020a63525a000000b0042b28a98a34mr14733816pgl.269.1664102177039; Sun, 25 Sep 2022 03:36:17 -0700 (PDT) Received: from ubuntu.localdomain ([117.176.186.252]) by smtp.gmail.com with ESMTPSA id u4-20020a170902e80400b00179c81f6693sm3722183plg.264.2022.09.25.03.36.09 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Sep 2022 03:36:16 -0700 (PDT) From: wangyong To: gregkh@linuxfoundation.org Cc: jaewon31.kim@samsung.com, linux-kernel@vger.kernel.org, mhocko@kernel.org, stable@vger.kernel.org, wang.yong12@zte.com.cn, yongw.pur@gmail.com, Joonsoo Kim , Andrew Morton , Johannes Weiner , Minchan Kim , Mel Gorman , Linus Torvalds Subject: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx Date: Sun, 25 Sep 2022 03:35:27 -0700 Message-Id: <20220925103529.13716-2-yongw.pur@gmail.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220925103529.13716-1-yongw.pur@gmail.com> References: <20220925103529.13716-1-yongw.pur@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Joonsoo Kim [ backport of commit 3334a45eb9e2bb040c880ef65e1d72357a0a008b ] Patch series "integrate classzone_idx and high_zoneidx", v5. This patchset is followup of the problem reported and discussed two years ago [1, 2]. The problem this patchset solves is related to the classzone_idx on the NUMA system. It causes a problem when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. This problem was reported two years ago, and, at that time, the solution got general agreements [2]. But it was not upstreamed. [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.= kim@lge.com This patch (of 2): Currently, we use classzone_idx to calculate lowmem reserve proetection for an allocation request. This classzone_idx causes a problem on NUMA systems when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. Before further explanation, I should first clarify how to compute the classzone_idx and the high_zoneidx. - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and represents the index of the highest zone the allocation can use - classzone_idx was supposed to be the index of the highest zone on the local node that the allocation can use, that is actually available in the system Think about following example. Node 0 has 4 populated zone, DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones, such as MOVABLE, doesn't exist on node 1 and this makes following difference. Assume that there is an allocation request whose gfp_zone(gfp_mask) is the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is initiated on node 0, it's classzone_idx is 3 since actually available/usable zone on local (node 0) is MOVABLE. If this allocation is initiated on node 1, it's classzone_idx is 2 since actually available/usable zone on local (node 1) is NORMAL. You can see that classzone_idx of the allocation request are different according to their starting node, even if their high_zoneidx is the same. Think more about these two allocation requests. If they are processed on local, there is no problem. However, if allocation is initiated on node 1 are processed on remote, in this example, at the NORMAL zone on node 0, due to memory shortage, problem occurs. Their different classzone_idx leads to different lowmem reserve and then different min watermark. See the following example. root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo Node 0, zone DMA per-node stats ... pages free 3965 min 5 low 8 high 11 spanned 4095 present 3998 managed 3977 protection: (0, 2961, 4928, 5440) ... Node 0, zone DMA32 pages free 757955 min 1129 low 1887 high 2645 spanned 1044480 present 782303 managed 758116 protection: (0, 0, 1967, 2479) ... Node 0, zone Normal pages free 459806 min 750 low 1253 high 1756 spanned 524288 present 524288 managed 503620 protection: (0, 0, 0, 4096) ... Node 0, zone Movable pages free 130759 min 195 low 326 high 457 spanned 1966079 present 131072 managed 131072 protection: (0, 0, 0, 0) ... Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone Normal per-node stats ... pages free 233277 min 383 low 640 high 897 spanned 262144 present 262144 managed 257744 protection: (0, 0, 0, 0) ... Node 1, zone Movable pages free 0 min 0 low 0 high 0 spanned 262144 present 0 managed 0 protection: (0, 0, 0, 0) - static min watermark for the NORMAL zone on node 0 is 750. - lowmem reserve for the request with classzone idx 3 at the NORMAL on node 0 is 4096. - lowmem reserve for the request with classzone idx 2 at the NORMAL on node 0 is 0. So, overall min watermark is: allocation initiated on node 0 (classzone_idx 3): 750 + 4096 =3D 4846 allocation initiated on node 1 (classzone_idx 2): 750 + 0 =3D 750 Allocation initiated on node 1 will have some precedence than allocation initiated on node 0 because min watermark of the former allocation is lower than the other. So, allocation initiated on node 1 could succeed on node 0 when allocation initiated on node 0 could not, and, this could cause too many numa_miss allocation. Then, performance could be downgraded. Recently, there was a regression report about this problem on CMA patches since CMA memory are placed in ZONE_MOVABLE by those patches. I checked that problem is disappeared with this fix that uses high_zoneidx for classzone_idx. http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop Using high_zoneidx for classzone_idx is more consistent way than previous approach because system's memory layout doesn't affect anything to it. With this patch, both classzone_idx on above example will be 3 so will have the same min watermark. allocation initiated on node 0: 750 + 4096 =3D 4846 allocation initiated on node 1: 750 + 4096 =3D 4846 One could wonder if there is a side effect that allocation initiated on node 1 will use higher bar when allocation is handled on local since classzone_idx could be higher than before. It will not happen because the zone without managed page doesn't contributes lowmem_reserve at all. Reported-by: Ye Xiaolong Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Tested-by: Ye Xiaolong Reviewed-by: Baoquan He Acked-by: Vlastimil Babka Acked-by: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: Minchan Kim Cc: Mel Gorman Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.= kim@lge.com Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.= kim@lge.com Signed-off-by: Linus Torvalds --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 3a2e973..922a173 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -123,7 +123,7 @@ struct alloc_context { bool spread_dirty_pages; }; =20 -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref) +#define ac_classzone_idx(ac) (ac->high_zoneidx) =20 /* * Locate the struct page for both the matching buddy in our --=20 2.7.4 From nobody Thu Apr 2 12:00:53 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42D60C6FA82 for ; Sun, 25 Sep 2022 10:36:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231970AbiIYKg4 (ORCPT ); Sun, 25 Sep 2022 06:36:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34110 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232142AbiIYKgi (ORCPT ); Sun, 25 Sep 2022 06:36:38 -0400 Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9BA8B27DDB; Sun, 25 Sep 2022 03:36:35 -0700 (PDT) Received: by mail-pj1-x1035.google.com with SMTP id s90-20020a17090a2f6300b00203a685a1aaso4276683pjd.1; Sun, 25 Sep 2022 03:36:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from:from:to :cc:subject:date; bh=RZtMfCL18jVgA+OMghrJIY73iIpYPy2aVXIXTc6CMgQ=; b=bfoPzKUKw0d/bzXz02EKTkdAQqLrQH17NXZ3mA35OM8aGI2BQULf2KJokKfv3/nyHb yZtdg4J0Egte9MhoPC3Inu9dnje6CHDeBc1HEfI1fSI6NEOt+Irmf3iD8x6hfiNY96uI elZBe5ibGl+xWnjoEFCvF0XPYibL21ZQku1L6VSa8Fp14+2Gksismz8nFCk2huDdPVAP UxIky/NJM0Y3lBGq4hVxEgaspLP79wPt8kc41CnvsqNu+sqP/bBScnXTrHx9FxttpeVk LeuSQTfb2AJ/8sAkSPkPyWbskGyjAu34+6HdJn0Qn0FGsiRvMbKr/+OVOHtFsA8zCErm fSnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date; bh=RZtMfCL18jVgA+OMghrJIY73iIpYPy2aVXIXTc6CMgQ=; b=CpljJVJ4vRC/BYfDRFpY5P1HQ6Qup9pXInndwB4e56LXXPLrcsBg8anh0/14v8jJwt n/VlKR0z9ar66ON1GkJvOzvt84+BXvNFCBQaSbKmTbzQtWv+lapZDCAgdpqTtyi4Opib /1ZEJdoaZts/paAcwgVG66f0tKoupiqLKooXyhx6fdHyH3Zq/NWwsy5Dlzk8FrwoFkTf EPErn321cJMRu/+ktGg7dVlKS4YEBFdEGpWONzqoE2pmLEhEiycDmAjmebzwdEcKYrUS 25UMzjlbt+FBDGaWKCgxcmu8I+8XdFd8b2jm1KvfG1DoWt6DK/6E15i0U6CPqgyRZxaa S5rg== X-Gm-Message-State: ACrzQf1RA5OiBGL/iy23nVl90B2CcT8JsfkMkS/SAkInp3y5DJp3xBU2 MIQa7/XC9MhS23UFrE6zw7E= X-Google-Smtp-Source: AMsMyM7d2wrgcxzl1urnwcEHrt37m37XH0POx1EVUV9K7wNZy+i3l/FaCXtzapTZZ9UOW76FOhshYQ== X-Received: by 2002:a17:90b:35ca:b0:202:61d9:d2d5 with SMTP id nb10-20020a17090b35ca00b0020261d9d2d5mr31148728pjb.50.1664102195044; Sun, 25 Sep 2022 03:36:35 -0700 (PDT) Received: from ubuntu.localdomain ([117.176.186.252]) by smtp.gmail.com with ESMTPSA id u4-20020a170902e80400b00179c81f6693sm3722183plg.264.2022.09.25.03.36.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Sep 2022 03:36:34 -0700 (PDT) From: wangyong To: gregkh@linuxfoundation.org Cc: jaewon31.kim@samsung.com, linux-kernel@vger.kernel.org, mhocko@kernel.org, stable@vger.kernel.org, wang.yong12@zte.com.cn, yongw.pur@gmail.com, Andrew Morton , Johannes Weiner , Yong-Taek Lee , Linus Torvalds Subject: [PATCH v2 stable-4.19 2/3] page_alloc: consider highatomic reserve in watermark fast Date: Sun, 25 Sep 2022 03:35:28 -0700 Message-Id: <20220925103529.13716-3-yongw.pur@gmail.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220925103529.13716-1-yongw.pur@gmail.com> References: <20220925103529.13716-1-yongw.pur@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Jaewon Kim [ backport of commit f27ce0e14088b23f8d54ae4a44f70307ec420e64 ] zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC)= , nodemask=3D(null) Call trace: [] dump_stack+0xb8/0xf0 [] warn_alloc+0xd8/0x12c [] __alloc_pages_nodemask+0x120c/0x1250 [] new_slab+0x128/0x604 [] ___slab_alloc+0x508/0x670 [] __kmalloc+0x2f8/0x310 [] context_struct_to_string+0x104/0x1cc [] security_sid_to_context_core+0x74/0x144 [] security_sid_to_context+0x10/0x18 [] selinux_secid_to_secctx+0x20/0x28 [] security_secid_to_secctx+0x3c/0x70 [] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB in= active_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):2= 56kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:= 0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160= kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevi= ctable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocke= d:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140k= B local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128k= B 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =3D 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_H= IGHMEM|__GFP_MOVABLE), nodemask=3D(null) kswapd0 cpuset=3D/ mems_allowed=3D0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:442= 60 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 write= back:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:= 43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 = free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB in= active_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):= 0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:= 0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomi= c:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB = inactive_file:333768k B unevictable:16632kB writepending:480kB pr= esent:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB page= tables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB = [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128= kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB =3D 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=3D00000000d2be56= 65 pfn=3D970744 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=3D000000004b2335= c2 pfn=3D970745 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=3D00000000017272= e1 pfn=3D970278 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=3D00000000c4be79= fb pfn=3D970279 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=3D00000000f8a51d= 4f pfn=3D970260 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=3D000000006ba8f5= ac pfn=3D970261 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=3D00000000819f1c= d3 pfn=3D970196 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=3D00000000f6b72a= 64 pfn=3D970197 order=3D0 migratetype=3D0 nr_free=3D3650 gfp_flags=3DGFP_HI= GHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page=3D (null) pfn=3D= 0 order=3D0 migratetype=3D1 nr_free=3D3650 gfp_flags=3DGFP_NOWAIT|__GFP_HIG= HMEM|__GFP_NOWARN|__GFP_MOVABLE [jaewon31.kim@samsung.com: remove redundant code for high-order] Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsun= g.com Reported-by: Yong-Taek Lee Suggested-by: Minchan Kim Signed-off-by: Jaewon Kim Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Vlastimil Babka Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Yong-Taek Lee Cc: Michal Hocko Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.= com Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++---------------------= ---- 1 file changed, 36 insertions(+), 29 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9c35403..237463d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3130,6 +3130,29 @@ static inline bool should_fail_alloc_page(gfp_t gfp_= mask, unsigned int order) =20 #endif /* CONFIG_FAIL_PAGE_ALLOC */ =20 +static inline long __zone_watermark_unusable_free(struct zone *z, + unsigned int order, unsigned int alloc_flags) +{ + const bool alloc_harder =3D (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); + long unusable_free =3D (1 << order) - 1; + + /* + * If the caller does not have rights to ALLOC_HARDER then subtract + * the high-atomic reserves. This will over-estimate the size of the + * atomic reserve but it avoids a search. + */ + if (likely(!alloc_harder)) + unusable_free +=3D z->nr_reserved_highatomic; + +#ifdef CONFIG_CMA + /* If allocation can't use CMA areas don't use free CMA pages */ + if (!(alloc_flags & ALLOC_CMA)) + unusable_free +=3D zone_page_state(z, NR_FREE_CMA_PAGES); +#endif + + return unusable_free; +} + /* * Return true if free base pages are above 'mark'. For high-order checks = it * will return true of the order-0 watermark is reached and there is at le= ast @@ -3145,19 +3168,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned i= nt order, unsigned long mark, const bool alloc_harder =3D (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); =20 /* free_pages may go negative - that's OK */ - free_pages -=3D (1 << order) - 1; + free_pages -=3D __zone_watermark_unusable_free(z, order, alloc_flags); =20 if (alloc_flags & ALLOC_HIGH) min -=3D min / 2; =20 - /* - * If the caller does not have rights to ALLOC_HARDER then subtract - * the high-atomic reserves. This will over-estimate the size of the - * atomic reserve but it avoids a search. - */ - if (likely(!alloc_harder)) { - free_pages -=3D z->nr_reserved_highatomic; - } else { + if (unlikely(alloc_harder)) { /* * OOM victims can try even harder than normal ALLOC_HARDER * users on the grounds that it's definitely going to be in @@ -3170,13 +3186,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned in= t order, unsigned long mark, min -=3D min / 4; } =20 - -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - free_pages -=3D zone_page_state(z, NR_FREE_CMA_PAGES); -#endif - /* * Check watermarks for an order-0 allocation request. If these * are not met, then a high-order request also cannot go ahead @@ -3225,24 +3234,22 @@ bool zone_watermark_ok(struct zone *z, unsigned int= order, unsigned long mark, static inline bool zone_watermark_fast(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx, unsigned int alloc_flags) { - long free_pages =3D zone_page_state(z, NR_FREE_PAGES); - long cma_pages =3D 0; + long free_pages; =20 -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - cma_pages =3D zone_page_state(z, NR_FREE_CMA_PAGES); -#endif + free_pages =3D zone_page_state(z, NR_FREE_PAGES); =20 /* * Fast check for order-0 only. If this fails then the reserves - * need to be calculated. There is a corner case where the check - * passes but only the high-order atomic reserve are free. If - * the caller is !atomic then it'll uselessly search the free - * list. That corner case is then slower but it is harmless. + * need to be calculated. */ - if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzo= ne_idx]) - return true; + if (!order) { + long fast_free; + + fast_free =3D free_pages; + fast_free -=3D __zone_watermark_unusable_free(z, 0, alloc_flags); + if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + return true; + } =20 return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); --=20 2.7.4 From nobody Thu Apr 2 12:00:53 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1B8CC04A95 for ; Sun, 25 Sep 2022 10:37:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232548AbiIYKhR (ORCPT ); Sun, 25 Sep 2022 06:37:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231995AbiIYKgr (ORCPT ); Sun, 25 Sep 2022 06:36:47 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AA4D2CDFA; Sun, 25 Sep 2022 03:36:45 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id bu5-20020a17090aee4500b00202e9ca2182so10888292pjb.0; Sun, 25 Sep 2022 03:36:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from:from:to :cc:subject:date; bh=4Ot2lP39QN3yfVpgj7SLSWZBZZZnT/3K2vptG1qJdkE=; b=iSSeFgmmk/7jZG6g/b/9ADqG+LZ5HMjvms+2LNiqELFfkQ4VwSY/REwyxrBMHhVnDC 6IZ9R8AUb9DlglQhlc9CwKEKCQO1UPW5q+NNnbH0YlbB+QAcb4tdcQlkIUjpJkRSwr3a f4EXiGSPRRvVU4T1PdsgyIUByuEL5ybCgX5rGtTxhpUrN1OFq5Gm6R97Qs/RGACao/h9 iolaIXtceGriJm3JPOHaUUOb3h7l3z/SUbcIEUJLQy7cugQS8ffzDWe0O9sU1HM46Bbm vgimZqpaqT/DSUpV/8WTqC5EMyDC5TUKWJBGDqgDcYN+M7vI24wSYo8Rs0Jgkzl2bxy2 G5JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=references:in-reply-to:message-id:date:subject:cc:to:from :x-gm-message-state:from:to:cc:subject:date; bh=4Ot2lP39QN3yfVpgj7SLSWZBZZZnT/3K2vptG1qJdkE=; b=buBra9Bv32uXbM5I6dMjlRqDylrvqVS2t1WfNa2PSW7+nWtTlT+5r4endeLYKEXvzK cQXwdojr1DtljPjn2hHTaAktzaPdIElumaTkB3ukO1CHOQun9pdUi3D+kkE4ru/d9LSB gkVgbTZAk6OoJuTFDjPzLFtVl/6t9UfVYjkBaKptokYaGGnFA/dvjkub1zPKyp2iHyMQ Gahsh1OjQJWM/kj4kEE827++em94bVlWXxV89t7tovuqfn65pKGb9/piRxJ4FVgesXvH /3UKkL+Fjt1qJ5/x5oM6w385yDWP4446FCQZ4YxBrv+pSrvzcYP5euIbmmnOLZC4q9qE Uf8A== X-Gm-Message-State: ACrzQf2zwur45OpszfrgQB8WP4QqRNQrLx4rs7d5blkMIW0oaZs0WenZ IbHNTm/cSihBwKAl4M0aZG8= X-Google-Smtp-Source: AMsMyM4GXRh9T+Op3qFcLR9q/PCBQ7Qfq4wcJf1TBPzLMUTg+J6KwwvWxx1jadiiESOWNdG/ToWi6A== X-Received: by 2002:a17:90b:33d1:b0:203:7b4b:ba1e with SMTP id lk17-20020a17090b33d100b002037b4bba1emr19178483pjb.128.1664102204547; Sun, 25 Sep 2022 03:36:44 -0700 (PDT) Received: from ubuntu.localdomain ([117.176.186.252]) by smtp.gmail.com with ESMTPSA id u4-20020a170902e80400b00179c81f6693sm3722183plg.264.2022.09.25.03.36.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Sep 2022 03:36:44 -0700 (PDT) From: wangyong To: gregkh@linuxfoundation.org Cc: jaewon31.kim@samsung.com, linux-kernel@vger.kernel.org, mhocko@kernel.org, stable@vger.kernel.org, wang.yong12@zte.com.cn, yongw.pur@gmail.com, Minchan Kim , Baoquan He , Vlastimil Babka , Johannes Weiner , Yong-Taek Lee , stable@vger.kerenl.org, Andrew Morton Subject: [PATCH v2 stable-4.19 3/3] page_alloc: fix invalid watermark check on a negative value Date: Sun, 25 Sep 2022 03:35:29 -0700 Message-Id: <20220925103529.13716-4-yongw.pur@gmail.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220925103529.13716-1-yongw.pur@gmail.com> References: <20220925103529.13716-1-yongw.pur@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: Jaewon Kim [ backport of commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 ] There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") changed the watermark fast to consider highatomic reserve. But it did not handle a negative value case which can be happened when reserved_highatomic pageblock is bigger than the actual free. If watermark is considered as ok for the negative value, allocating contexts for order-0 will consume all free pages without direct reclaim, and finally free page may become depleted except highatomic free. Then allocating contexts may fall into throttle_direct_reclaim. This symptom may easily happen in a system where wmark min is low and other reclaimers like kswapd does not make free pages quickly. Handle the negative case by using MIN. Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung= .com Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark = fast") Signed-off-by: Jaewon Kim Reported-by: GyeongHwan Hong Acked-by: Mel Gorman Cc: Minchan Kim Cc: Baoquan He Cc: Vlastimil Babka Cc: Johannes Weiner Cc: Michal Hocko Cc: Yong-Taek Lee Cc: Signed-off-by: Andrew Morton --- mm/page_alloc.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 237463d..d6d8a37 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3243,11 +3243,15 @@ static inline bool zone_watermark_fast(struct zone = *z, unsigned int order, * need to be calculated. */ if (!order) { - long fast_free; + long usable_free; + long reserved; =20 - fast_free =3D free_pages; - fast_free -=3D __zone_watermark_unusable_free(z, 0, alloc_flags); - if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + usable_free =3D free_pages; + reserved =3D __zone_watermark_unusable_free(z, 0, alloc_flags); + + /* reserved may over estimate high-atomic reserves. */ + usable_free -=3D min(usable_free, reserved); + if (usable_free > mark + z->lowmem_reserve[classzone_idx]) return true; } =20 --=20 2.7.4