From nobody Mon Jun 29 16:00:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3B4AC433FE for ; Tue, 8 Feb 2022 11:25:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348841AbiBHLZl (ORCPT ); Tue, 8 Feb 2022 06:25:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356078AbiBHKIQ (ORCPT ); Tue, 8 Feb 2022 05:08:16 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A2BF8C03FEC5 for ; Tue, 8 Feb 2022 02:08:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1644314892; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=06BnTZzuO/korL6R8pKHINxYBCZYhmQu6E/SNPlqJdA=; b=iAX/EShPndDeYp5EHNyKy4xWd21m4kp8Yv/XqvhEEssuLgQ+cgaiMzdzlWYwghfMEiKx6s Jze8krcyqO1q5scDvMG37pLcEe1Bkj1iczhVD25+iY5cgVxO+XRcD2nC0A9L/teF02n7p3 +HzygEfx0t1ZBHTh+CdNeXWWFib50lY= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-571-PAPW0Rs0MPuzLPpuHBXqxw-1; Tue, 08 Feb 2022 05:08:09 -0500 X-MC-Unique: PAPW0Rs0MPuzLPpuHBXqxw-1 Received: by mail-wm1-f72.google.com with SMTP id h82-20020a1c2155000000b003552c13626cso844631wmh.3 for ; Tue, 08 Feb 2022 02:08:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=06BnTZzuO/korL6R8pKHINxYBCZYhmQu6E/SNPlqJdA=; b=d9hUUz+9qUs16U+P7kPgGWG/tkUbrkrjyq47s3JfbQLaqItA8vcKpHo/0FvVvvSFdr /89TbA1Ni+i9T+DCAj1t+UI2ps3yT4klOPOdtX2G/g+1MBF+fHj3KTkQBus5SFwS27PL XvhSNOTpwGw3SQai91NPxRsVUWXCiiHYRllF2zeT0L6DInwVDxdtVDFnfvkW446wRceq /2xHFDaXb9c/GollauNxEl9BDKw3Kp7oy4emNu9+Lx+sw1yLs4Oi/RDCLzqIqFhSNsxB FSQsWU+GqsdQ3Hb3BvvzvHKrvxqKDbXGx6HwojQWZGg9ruPUrUUuAfkCWW1NDnjEiOzC Kbdg== X-Gm-Message-State: AOAM532RAUk2HPgS8zXsdYHg83I3tRigPBCBnvX4oHmwIOEQOjiplcEv cuA6YjMXsh0ryVduj6r68cpKquunBNlESJXn2gmbl124ZGm1uuNIFntn7Lmnmd0O/7Z9WrF6js8 Bb73nDDtBPkqewMCKbHMzbYEt X-Received: by 2002:a5d:598a:: with SMTP id n10mr2855194wri.136.1644314888328; Tue, 08 Feb 2022 02:08:08 -0800 (PST) X-Google-Smtp-Source: ABdhPJwuwfP5ZaDGYodWgAe1/7/aVA7WVQ/n8GvAre37Zc/wm6jhpzE0k3j7exY7ATMbFHhoeCIUVw== X-Received: by 2002:a5d:598a:: with SMTP id n10mr2855168wri.136.1644314888026; Tue, 08 Feb 2022 02:08:08 -0800 (PST) Received: from vian.redhat.com ([2a0c:5a80:1204:1500:37e7:8150:d9df:36f]) by smtp.gmail.com with ESMTPSA id z5sm1911027wmp.10.2022.02.08.02.08.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Feb 2022 02:08:07 -0800 (PST) From: Nicolas Saenz Julienne To: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, frederic@kernel.org, tglx@linutronix.de, mtosatti@redhat.com, mgorman@suse.de, linux-rt-users@vger.kernel.org, vbabka@suse.cz, cl@linux.com, paulmck@kernel.org, willy@infradead.org, Nicolas Saenz Julienne Subject: [PATCH 1/2] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly Date: Tue, 8 Feb 2022 11:07:49 +0100 Message-Id: <20220208100750.1189808-2-nsaenzju@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220208100750.1189808-1-nsaenzju@redhat.com> References: <20220208100750.1189808-1-nsaenzju@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In preparation to adding remote per-cpu page list drain support, let's bundle 'struct per_cpu_pages's' page lists and page count into a new structure: 'struct pcplists', and have all code access it indirectly through a pointer. It'll be used by upcoming patches in order to maintain multiple instances of 'pcplists' and switch the pointer atomically. The 'struct pcplists' instance lives inside 'struct per_cpu_pages', and shouldn't be accessed directly. It is setup as such since these structures are used during early boot when no memory allocation is possible and to simplify memory hotplug code paths. free_pcppages_bulk() and __rmqueue_pcplist()'s function signatures change a bit so as to accommodate these changes without affecting performance. No functional change intended. Signed-off-by: Nicolas Saenz Julienne --- Changes since RFC: - Add more info in commit message. - Removed __private attribute, in hindsight doesn't really fit what we're doing here. - Use raw_cpu_ptr() where relevant to avoid warnings. include/linux/mmzone.h | 10 +++-- mm/page_alloc.c | 87 +++++++++++++++++++++++++----------------- mm/vmstat.c | 6 +-- 3 files changed, 62 insertions(+), 41 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3fff6deca2c0..b4cb85d9c6e8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -381,7 +381,6 @@ enum zone_watermarks { =20 /* Fields and list protected by pagesets local_lock in page_alloc.c */ struct per_cpu_pages { - int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ short free_factor; /* batch scaling factor during free */ @@ -389,8 +388,13 @@ struct per_cpu_pages { short expire; /* When 0, remote pagesets are drained */ #endif =20 - /* Lists of pages, one per migrate type stored on the pcp-lists */ - struct list_head lists[NR_PCP_LISTS]; + struct pcplists *lp; + struct pcplists { + /* Number of pages in the pcplists */ + int count; + /* Lists of pages, one per migrate type stored on the pcp-lists */ + struct list_head lists[NR_PCP_LISTS]; + } __pcplists; /* Do not access directly */ }; =20 struct per_cpu_zonestat { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4f549123150c..4f37815b0e4c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1449,13 +1449,12 @@ static inline void prefetch_buddy(struct page *page) * count is the number of pages to free. */ static void free_pcppages_bulk(struct zone *zone, int count, - struct per_cpu_pages *pcp) + int batch, struct pcplists *lp) { int pindex =3D 0; int batch_free =3D 0; int nr_freed =3D 0; unsigned int order; - int prefetch_nr =3D READ_ONCE(pcp->batch); bool isolated_pageblocks; struct page *page, *tmp; LIST_HEAD(head); @@ -1464,7 +1463,7 @@ static void free_pcppages_bulk(struct zone *zone, int= count, * Ensure proper count is passed which otherwise would stuck in the * below while (list_empty(list)) loop. */ - count =3D min(pcp->count, count); + count =3D min(lp->count, count); while (count > 0) { struct list_head *list; =20 @@ -1479,7 +1478,7 @@ static void free_pcppages_bulk(struct zone *zone, int= count, batch_free++; if (++pindex =3D=3D NR_PCP_LISTS) pindex =3D 0; - list =3D &pcp->lists[pindex]; + list =3D &lp->lists[pindex]; } while (list_empty(list)); =20 /* This is the only non-empty list. Free them all. */ @@ -1513,13 +1512,13 @@ static void free_pcppages_bulk(struct zone *zone, i= nt count, * avoid excessive prefetching due to large count, only * prefetch buddy for the first pcp->batch nr of pages. */ - if (prefetch_nr) { + if (batch) { prefetch_buddy(page); - prefetch_nr--; + batch--; } } while (count > 0 && --batch_free && !list_empty(list)); } - pcp->count -=3D nr_freed; + lp->count -=3D nr_freed; =20 /* * local_lock_irq held so equivalent to spin_lock_irqsave for @@ -3130,14 +3129,16 @@ static int rmqueue_bulk(struct zone *zone, unsigned= int order, */ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { + struct pcplists *lp; unsigned long flags; int to_drain, batch; =20 local_lock_irqsave(&pagesets.lock, flags); batch =3D READ_ONCE(pcp->batch); - to_drain =3D min(pcp->count, batch); + lp =3D pcp->lp; + to_drain =3D min(lp->count, batch); if (to_drain > 0) - free_pcppages_bulk(zone, to_drain, pcp); + free_pcppages_bulk(zone, to_drain, batch, lp); local_unlock_irqrestore(&pagesets.lock, flags); } #endif @@ -3153,12 +3154,14 @@ static void drain_pages_zone(unsigned int cpu, stru= ct zone *zone) { unsigned long flags; struct per_cpu_pages *pcp; + struct pcplists *lp; =20 local_lock_irqsave(&pagesets.lock, flags); =20 pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); - if (pcp->count) - free_pcppages_bulk(zone, pcp->count, pcp); + lp =3D pcp->lp; + if (lp->count) + free_pcppages_bulk(zone, lp->count, READ_ONCE(pcp->batch), lp); =20 local_unlock_irqrestore(&pagesets.lock, flags); } @@ -3219,7 +3222,7 @@ static void drain_local_pages_wq(struct work_struct *= work) * * drain_all_pages() is optimized to only execute on cpus where pcplists a= re * not empty. The check for non-emptiness can however race with a free to - * pcplist that has not yet increased the pcp->count from 0 to 1. Callers + * pcplist that has not yet increased the lp->count from 0 to 1. Callers * that need the guarantee that every CPU has drained can disable the * optimizing racy check. */ @@ -3258,24 +3261,24 @@ static void __drain_all_pages(struct zone *zone, bo= ol force_all_cpus) * disables preemption as part of its processing */ for_each_online_cpu(cpu) { - struct per_cpu_pages *pcp; struct zone *z; bool has_pcps =3D false; + struct pcplists *lp; =20 if (force_all_cpus) { /* - * The pcp.count check is racy, some callers need a + * The lp->count check is racy, some callers need a * guarantee that no cpu is missed. */ has_pcps =3D true; } else if (zone) { - pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); - if (pcp->count) + lp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp; + if (lp->count) has_pcps =3D true; } else { for_each_populated_zone(z) { - pcp =3D per_cpu_ptr(z->per_cpu_pageset, cpu); - if (pcp->count) { + lp =3D per_cpu_ptr(z->per_cpu_pageset, cpu)->lp; + if (lp->count) { has_pcps =3D true; break; } @@ -3427,19 +3430,21 @@ static void free_unref_page_commit(struct page *pag= e, int migratetype, { struct zone *zone =3D page_zone(page); struct per_cpu_pages *pcp; + struct pcplists *lp; int high; int pindex; =20 __count_vm_event(PGFREE); pcp =3D this_cpu_ptr(zone->per_cpu_pageset); + lp =3D pcp->lp; pindex =3D order_to_pindex(migratetype, order); - list_add(&page->lru, &pcp->lists[pindex]); - pcp->count +=3D 1 << order; + list_add(&page->lru, &lp->lists[pindex]); + lp->count +=3D 1 << order; high =3D nr_pcp_high(pcp, zone); - if (pcp->count >=3D high) { + if (lp->count >=3D high) { int batch =3D READ_ONCE(pcp->batch); =20 - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp); + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), batch, lp); } } =20 @@ -3660,7 +3665,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, uns= igned int order, int migratetype, unsigned int alloc_flags, struct per_cpu_pages *pcp, - struct list_head *list) + struct list_head *list, + int *count) { struct page *page; =20 @@ -3682,14 +3688,14 @@ struct page *__rmqueue_pcplist(struct zone *zone, u= nsigned int order, batch, list, migratetype, alloc_flags); =20 - pcp->count +=3D alloced << order; + *count +=3D alloced << order; if (unlikely(list_empty(list))) return NULL; } =20 page =3D list_first_entry(list, struct page, lru); list_del(&page->lru); - pcp->count -=3D 1 << order; + *count -=3D 1 << order; } while (check_new_pcp(page)); =20 return page; @@ -3703,8 +3709,10 @@ static struct page *rmqueue_pcplist(struct zone *pre= ferred_zone, { struct per_cpu_pages *pcp; struct list_head *list; + struct pcplists *lp; struct page *page; unsigned long flags; + int *count; =20 local_lock_irqsave(&pagesets.lock, flags); =20 @@ -3715,8 +3723,11 @@ static struct page *rmqueue_pcplist(struct zone *pre= ferred_zone, */ pcp =3D this_cpu_ptr(zone->per_cpu_pageset); pcp->free_factor >>=3D 1; - list =3D &pcp->lists[order_to_pindex(migratetype, order)]; - page =3D __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, li= st); + lp =3D pcp->lp; + list =3D &lp->lists[order_to_pindex(migratetype, order)]; + count =3D &lp->count; + page =3D __rmqueue_pcplist(zone, order, migratetype, alloc_flags, + pcp, list, count); local_unlock_irqrestore(&pagesets.lock, flags); if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); @@ -5255,9 +5266,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int pref= erred_nid, struct per_cpu_pages *pcp; struct list_head *pcp_list; struct alloc_context ac; + struct pcplists *lp; gfp_t alloc_gfp; unsigned int alloc_flags =3D ALLOC_WMARK_LOW; int nr_populated =3D 0, nr_account =3D 0; + int *count; =20 /* * Skip populated array elements to determine if any pages need @@ -5333,7 +5346,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int prefe= rred_nid, /* Attempt the batch allocation */ local_lock_irqsave(&pagesets.lock, flags); pcp =3D this_cpu_ptr(zone->per_cpu_pageset); - pcp_list =3D &pcp->lists[order_to_pindex(ac.migratetype, 0)]; + lp =3D pcp->lp; + pcp_list =3D &lp->lists[order_to_pindex(ac.migratetype, 0)]; + count =3D &lp->count; =20 while (nr_populated < nr_pages) { =20 @@ -5344,7 +5359,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int prefe= rred_nid, } =20 page =3D __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, - pcp, pcp_list); + pcp, pcp_list, count); if (unlikely(!page)) { /* Try and get at least one page */ if (!nr_populated) @@ -5947,7 +5962,7 @@ void show_free_areas(unsigned int filter, nodemask_t = *nodemask) continue; =20 for_each_online_cpu(cpu) - free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->count; + free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count; } =20 printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" @@ -6041,7 +6056,7 @@ void show_free_areas(unsigned int filter, nodemask_t = *nodemask) =20 free_pcp =3D 0; for_each_online_cpu(cpu) - free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->count; + free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count; =20 show_node(zone); printk(KERN_CONT @@ -6084,7 +6099,7 @@ void show_free_areas(unsigned int filter, nodemask_t = *nodemask) K(zone_page_state(zone, NR_MLOCK)), K(zone_page_state(zone, NR_BOUNCE)), K(free_pcp), - K(this_cpu_read(zone->per_cpu_pageset->count)), + K(raw_cpu_ptr(zone->per_cpu_pageset)->lp->count), K(zone_page_state(zone, NR_FREE_CMA_PAGES))); printk("lowmem_reserve[]:"); for (i =3D 0; i < MAX_NR_ZONES; i++) @@ -6971,7 +6986,7 @@ static int zone_highsize(struct zone *zone, int batch= , int cpu_online) =20 /* * pcp->high and pcp->batch values are related and generally batch is lower - * than high. They are also related to pcp->count such that count is lower + * than high. They are also related to pcp->lp->count such that count is l= ower * than high, and as soon as it reaches high, the pcplist is flushed. * * However, guaranteeing these relations at all times would require e.g. w= rite @@ -6979,7 +6994,7 @@ static int zone_highsize(struct zone *zone, int batch= , int cpu_online) * thus be prone to error and bad for performance. Thus the update only pr= events * store tearing. Any new users of pcp->batch and pcp->high should ensure = they * can cope with those fields changing asynchronously, and fully trust onl= y the - * pcp->count field on the local CPU with interrupts disabled. + * pcp->lp->count field on the local CPU with interrupts disabled. * * mutex_is_locked(&pcp_batch_high_lock) required when calling this functi= on * outside of boot time (or some other assurance that no concurrent update= rs @@ -6999,8 +7014,10 @@ static void per_cpu_pages_init(struct per_cpu_pages = *pcp, struct per_cpu_zonesta memset(pcp, 0, sizeof(*pcp)); memset(pzstats, 0, sizeof(*pzstats)); =20 + pcp->lp =3D &pcp->__pcplists; + for (pindex =3D 0; pindex < NR_PCP_LISTS; pindex++) - INIT_LIST_HEAD(&pcp->lists[pindex]); + INIT_LIST_HEAD(&pcp->lp->lists[pindex]); =20 /* * Set batch and high values safe for a boot pageset. A true percpu diff --git a/mm/vmstat.c b/mm/vmstat.c index d5cc8d739fac..576b2b932ccd 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) * if not then there is nothing to expire. */ if (!__this_cpu_read(pcp->expire) || - !__this_cpu_read(pcp->count)) + !this_cpu_ptr(pcp)->lp->count) continue; =20 /* @@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) if (__this_cpu_dec_return(pcp->expire)) continue; =20 - if (__this_cpu_read(pcp->count)) { + if (this_cpu_ptr(pcp)->lp->count) { drain_zone_pages(zone, this_cpu_ptr(pcp)); changes++; } @@ -1728,7 +1728,7 @@ static void zoneinfo_show_print(struct seq_file *m, p= g_data_t *pgdat, "\n high: %i" "\n batch: %i", i, - pcp->count, + pcp->lp->count, pcp->high, pcp->batch); #ifdef CONFIG_SMP --=20 2.34.1 From nobody Mon Jun 29 16:00:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4FE6C433FE for ; Tue, 8 Feb 2022 11:25:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244567AbiBHLZb (ORCPT ); Tue, 8 Feb 2022 06:25:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356077AbiBHKIO (ORCPT ); Tue, 8 Feb 2022 05:08:14 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2C624C03FEC0 for ; Tue, 8 Feb 2022 02:08:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1644314892; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Yy9BVLqK7ivyPWdwgLpNg6ELIjXs+mnOumXxnxk8VEI=; b=SxhhZuMxQWef77KhR1ExXTWEO+PNGl2weiYoTurEtUKsXGusd+ce07q3OLbB0C4F0nW6zB ss15CziOcRBg4s27gl2cZQHnMyNdbhw58CWqWe7TkT4lH8h/k+pz1+ey3Q3b4ggDYajzDy bWEwusyliEq/DpBZDledGt5vgYejZgE= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-594-PHTdY_gdOySt8x9X26b-pg-1; Tue, 08 Feb 2022 05:08:10 -0500 X-MC-Unique: PHTdY_gdOySt8x9X26b-pg-1 Received: by mail-wm1-f72.google.com with SMTP id r205-20020a1c44d6000000b0037bb51b549aso840442wma.4 for ; Tue, 08 Feb 2022 02:08:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Yy9BVLqK7ivyPWdwgLpNg6ELIjXs+mnOumXxnxk8VEI=; b=MNRYMZlZzWTGnanEfunsLefRI6KY6ahBlsXnjYS3DC5HRyW5h1VlXV9iK7TeejB3b5 dMJbPdWivuxIVylipEgG6xnOzz5gcK4VhMHrqYcFyRYFh9zx59GU2QTJtlUfKXzESL9X P91ZguDdJLTEesAaHxs2wP/bRLA8rWp1ucBynwlDGVBOS19fZjDggHetv3l16ioWA/3K XvqJClT2JDrSnDOtrZPCSBb0XxaFtGkf0Movq75vhc+Z7eG1fjFLG4ZtO0ZxsBn0tvI4 9nBsaowwLFRBGZUeRQrm8AQXlFgJuM/WL9P4Kw498HrO/FAAY3IOcQWtiTij+l3VOiO8 K1dA== X-Gm-Message-State: AOAM530CZ2sbjOYDeGReNFDSHNObD3wyDsfzOptHFuVxkFd3v3ZvW3wU lVmDyPK0iWhDStNbaAx1ekqrelCobGAzSvudvm7ieE4AekrV2/RoJzFjr/7D4lRpq1mb0ZYxn4/ 7DcNr6p4jDhS6CU1d1ro4vlGu X-Received: by 2002:a5d:50cb:: with SMTP id f11mr2867091wrt.178.1644314889325; Tue, 08 Feb 2022 02:08:09 -0800 (PST) X-Google-Smtp-Source: ABdhPJzDswpqM/YrQiS77tGfqqoRE4Qx19RaBdmm/4g/wUbnCL/hNCRaQOiEKgy4ZIGjd7pixgwdZA== X-Received: by 2002:a5d:50cb:: with SMTP id f11mr2867044wrt.178.1644314888801; Tue, 08 Feb 2022 02:08:08 -0800 (PST) Received: from vian.redhat.com ([2a0c:5a80:1204:1500:37e7:8150:d9df:36f]) by smtp.gmail.com with ESMTPSA id z5sm1911027wmp.10.2022.02.08.02.08.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Feb 2022 02:08:08 -0800 (PST) From: Nicolas Saenz Julienne To: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, frederic@kernel.org, tglx@linutronix.de, mtosatti@redhat.com, mgorman@suse.de, linux-rt-users@vger.kernel.org, vbabka@suse.cz, cl@linux.com, paulmck@kernel.org, willy@infradead.org, Nicolas Saenz Julienne Subject: [PATCH 2/2] mm/page_alloc: Add remote draining support to per-cpu lists Date: Tue, 8 Feb 2022 11:07:50 +0100 Message-Id: <20220208100750.1189808-3-nsaenzju@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220208100750.1189808-1-nsaenzju@redhat.com> References: <20220208100750.1189808-1-nsaenzju@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The page allocator's per-cpu page lists (pcplists) are currently protected using local_locks. While performance savvy, this doesn't allow for remote access to these structures. CPUs requiring system-wide changes to the per-cpu lists get around this by scheduling workers on each CPU. That said, some setups like NOHZ_FULL CPUs, aren't well suited to this since they can't handle interruptions of any sort. To mitigate this, replace the current draining mechanism with one that allows remotely draining the lists: - Each CPU now has two pcplists pointers: one that points to a pcplists instance that is in-use, 'pcp->lp', another that points to an idle and empty instance, 'pcp->drain'. CPUs access their local pcplists through 'pcp->lp' and the pointer is dereferenced atomically. - When a CPU decides it needs to empty some remote pcplists, it'll atomically exchange the remote CPU's 'pcp->lp' and 'pcp->drain' pointers. A remote CPU racing with this will either have: - An old 'pcp->lp' reference, it'll soon be emptied by the drain process, we just have to wait for it to finish using it. - The new 'pcp->lp' reference, that is, an empty pcplists instance. rcu_replace_pointer()'s release semantics ensures any prior changes will be visible by the remote CPU, for example: changes to 'pcp->high' and 'pcp->batch' when disabling the pcplists. - The CPU that started the drain can now wait for an RCU grace period to make sure the remote CPU is done using the old pcplists. synchronize_rcu() counts as a full memory barrier, so any changes the local CPU makes to the soon to be drained pcplists will be visible to the draining CPU once it returns. - Then the CPU can safely free the old pcplists. Nobody else holds a reference to it. Note that concurrent access to the remote pcplists drain is protected by the 'pcpu_drain_mutex'. From an RCU perspective, we're only protecting access to the pcplists pointer, the drain operation is the writer and the local_lock critical sections are the readers. RCU guarantees atomicity both while dereferencing the pcplists pointer and replacing it. It also checks for RCU critical section/locking correctness, as all readers have to hold their per-cpu pagesets local_lock, which also counts as a critical section from RCU's perspective. From a performance perspective, on production configurations, the patch adds an extra dereference to all hot paths (under such circumstances rcu_dereference() will simplify to READ_ONCE()). Extensive measurements have been performed on different architectures to ascertain the performance impact is minimal. Most platforms don't see any difference and the worst-case scenario shows a 1-3% degradation on a page allocation micro-benchmark. See cover letter for in-depth results. Accesses to the pcplists like the ones in mm/vmstat.c don't require RCU supervision since they can handle outdated data, but they do use rcu_access_pointer() to avoid compiler weirdness make sparse happy. Note that special care has been taken to verify there are no races with the memory hotplug code paths. Notably with calls to zone_pcp_reset(). As Mel Gorman explains in a previous patch[1]: "The existing hotplug paths guarantees the pcplists cannot be used after zone_pcp_enable() [the one in offline_pages()]. That should be the case already because all the pages have been freed and there is no page to put on the PCP lists." All in all, this technique allows for remote draining on all setups with an acceptable performance impact. It benefits all sorts of use cases: low-latency, real-time, HPC, idle systems, KVM guests. [1] 8ca559132a2d ("mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove") Signed-off-by: Nicolas Saenz Julienne --- Changes since RFC: - Avoid unnecessary spin_lock_irqsave/restore() in free_pcppages_bulk() - Add more detail to commit and code comments. - Use synchronize_rcu() instead of synchronize_rcu_expedited(), the RCU documentation says to avoid it unless really justified. I don't think it's justified here, if we can schedule and join works, waiting for an RCU grace period is OK. - Avoid sparse warnings by using rcu_access_pointer() and rcu_dereference_protected(). include/linux/mmzone.h | 22 +++++- mm/page_alloc.c | 155 ++++++++++++++++++++++++++--------------- mm/vmstat.c | 6 +- 3 files changed, 120 insertions(+), 63 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4cb85d9c6e8..b0b593fd8e48 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -388,13 +388,31 @@ struct per_cpu_pages { short expire; /* When 0, remote pagesets are drained */ #endif =20 - struct pcplists *lp; + /* + * As a rule of thumb, any access to struct per_cpu_pages's 'lp' has + * happen with the pagesets local_lock held and using + * rcu_dereference_check(). If there is a need to modify both + * 'lp->count' and 'lp->lists' in the same critical section 'pcp->lp' + * can only be derefrenced once. See for example: + * + * local_lock_irqsave(&pagesets.lock, flags); + * lp =3D rcu_dereference_check(pcp->lp, ...); + * list_add(&page->lru, &lp->lists[pindex]); + * lp->count +=3D 1 << order; + * local_unlock_irqrestore(&pagesets.lock, flags); + * + * vmstat code only needs to check the page count and can deal with + * outdated data. In that case rcu_access_pointer() is good enough and + * the locking is not needed. + */ + struct pcplists __rcu *lp; + struct pcplists *drain; struct pcplists { /* Number of pages in the pcplists */ int count; /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; - } __pcplists; /* Do not access directly */ + } __pcplists[2]; /* Do not access directly */ }; =20 struct per_cpu_zonestat { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4f37815b0e4c..4680dd458184 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -150,13 +150,7 @@ DEFINE_PER_CPU(int, _numa_mem_); /* Kernel "local mem= ory" node */ EXPORT_PER_CPU_SYMBOL(_numa_mem_); #endif =20 -/* work_structs for global per-cpu drains */ -struct pcpu_drain { - struct zone *zone; - struct work_struct work; -}; static DEFINE_MUTEX(pcpu_drain_mutex); -static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain); =20 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY volatile unsigned long latent_entropy __latent_entropy; @@ -3135,7 +3129,7 @@ void drain_zone_pages(struct zone *zone, struct per_c= pu_pages *pcp) =20 local_lock_irqsave(&pagesets.lock, flags); batch =3D READ_ONCE(pcp->batch); - lp =3D pcp->lp; + lp =3D rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pages= ets.lock))); to_drain =3D min(lp->count, batch); if (to_drain > 0) free_pcppages_bulk(zone, to_drain, batch, lp); @@ -3159,7 +3153,7 @@ static void drain_pages_zone(unsigned int cpu, struct= zone *zone) local_lock_irqsave(&pagesets.lock, flags); =20 pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); - lp =3D pcp->lp; + lp =3D rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pages= ets.lock))); if (lp->count) free_pcppages_bulk(zone, lp->count, READ_ONCE(pcp->batch), lp); =20 @@ -3198,36 +3192,48 @@ void drain_local_pages(struct zone *zone) drain_pages(cpu); } =20 -static void drain_local_pages_wq(struct work_struct *work) -{ - struct pcpu_drain *drain; - - drain =3D container_of(work, struct pcpu_drain, work); - - /* - * drain_all_pages doesn't use proper cpu hotplug protection so - * we can race with cpu offline when the WQ can move this from - * a cpu pinned worker to an unbound one. We can operate on a different - * cpu which is alright but we also have to make sure to not move to - * a different one. - */ - migrate_disable(); - drain_local_pages(drain->zone); - migrate_enable(); -} - /* * The implementation of drain_all_pages(), exposing an extra parameter to * drain on all cpus. * - * drain_all_pages() is optimized to only execute on cpus where pcplists a= re - * not empty. The check for non-emptiness can however race with a free to - * pcplist that has not yet increased the lp->count from 0 to 1. Callers - * that need the guarantee that every CPU has drained can disable the - * optimizing racy check. + * drain_all_pages() is optimized to only affect cpus where pcplists are n= ot + * empty. The check for non-emptiness can however race with a free to pcpl= ist + * that has not yet increased the lp->count from 0 to 1. Callers that need= the + * guarantee that every CPU has drained can disable the optimizing racy ch= eck. + * + * The drain mechanism does the following: + * + * - Each CPU has two pcplists pointers: one that points to a pcplists + * instance that is in-use, 'pcp->lp', another that points to an idle + * and empty instance, 'pcp->drain'. CPUs atomically dereference their = local + * pcplists through 'pcp->lp' while holding the pagesets local_lock. + * + * - When a CPU decides it needs to empty some remote pcplists, it'll + * atomically exchange the remote CPU's 'pcp->lp' and 'pcp->drain' poin= ters. + * A remote CPU racing with this will either have: + * + * - An old 'pcp->lp' reference, it'll soon be emptied by the drain + * process, we just have to wait for it to finish using it. + * + * - The new 'pcp->lp' reference, that is, an empty pcplists instance. + * rcu_replace_pointer()'s release semantics ensures any prior chan= ges + * will be visible by the remote CPU, for example changes to 'pcp->= high' + * and 'pcp->batch' when disabling the pcplists. + * + * - The CPU that started the drain can now wait for an RCU grace period = to + * make sure the remote CPU is done using the old pcplists. + * synchronize_rcu() counts as a full memory barrier, so any changes the + * local CPU makes to the soon to be drained pcplists will be visible t= o the + * draining CPU once it returns. + * + * - Then the CPU can safely free the old pcplists. Nobody else holds a + * reference to it. Note that concurrent write access to remote pcplists + * pointers is protected by the 'pcpu_drain_mutex'. */ static void __drain_all_pages(struct zone *zone, bool force_all_cpus) { + struct per_cpu_pages *pcp; + struct zone *z; int cpu; =20 /* @@ -3236,13 +3242,6 @@ static void __drain_all_pages(struct zone *zone, boo= l force_all_cpus) */ static cpumask_t cpus_with_pcps; =20 - /* - * Make sure nobody triggers this path before mm_percpu_wq is fully - * initialized. - */ - if (WARN_ON_ONCE(!mm_percpu_wq)) - return; - /* * Do not drain if one is already in progress unless it's specific to * a zone. Such callers are primarily CMA and memory hotplug and need @@ -3261,6 +3260,7 @@ static void __drain_all_pages(struct zone *zone, bool= force_all_cpus) * disables preemption as part of its processing */ for_each_online_cpu(cpu) { + struct per_cpu_pages *pcp; struct zone *z; bool has_pcps =3D false; struct pcplists *lp; @@ -3272,12 +3272,16 @@ static void __drain_all_pages(struct zone *zone, bo= ol force_all_cpus) */ has_pcps =3D true; } else if (zone) { - lp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp; + pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); + lp =3D rcu_dereference_protected(pcp->lp, + mutex_is_locked(&pcpu_drain_mutex)); if (lp->count) has_pcps =3D true; } else { for_each_populated_zone(z) { - lp =3D per_cpu_ptr(z->per_cpu_pageset, cpu)->lp; + pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); + lp =3D rcu_dereference_protected(pcp->lp, + mutex_is_locked(&pcpu_drain_mutex)); if (lp->count) { has_pcps =3D true; break; @@ -3291,16 +3295,40 @@ static void __drain_all_pages(struct zone *zone, bo= ol force_all_cpus) cpumask_clear_cpu(cpu, &cpus_with_pcps); } =20 + if (cpumask_empty(&cpus_with_pcps)) + goto exit; + for_each_cpu(cpu, &cpus_with_pcps) { - struct pcpu_drain *drain =3D per_cpu_ptr(&pcpu_drain, cpu); + for_each_populated_zone(z) { + if (zone && zone !=3D z) + continue; + + pcp =3D per_cpu_ptr(z->per_cpu_pageset, cpu); + WARN_ON(pcp->drain->count); + pcp->drain =3D rcu_replace_pointer(pcp->lp, pcp->drain, + mutex_is_locked(&pcpu_drain_mutex)); + } + } + + synchronize_rcu(); =20 - drain->zone =3D zone; - INIT_WORK(&drain->work, drain_local_pages_wq); - queue_work_on(cpu, mm_percpu_wq, &drain->work); + for_each_cpu(cpu, &cpus_with_pcps) { + for_each_populated_zone(z) { + unsigned long flags; + int count; + + pcp =3D per_cpu_ptr(z->per_cpu_pageset, cpu); + count =3D pcp->drain->count; + if (!count) + continue; + + local_irq_save(flags); + free_pcppages_bulk(z, count, READ_ONCE(pcp->batch), pcp->drain); + local_irq_restore(flags); + } } - for_each_cpu(cpu, &cpus_with_pcps) - flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work); =20 +exit: mutex_unlock(&pcpu_drain_mutex); } =20 @@ -3309,7 +3337,7 @@ static void __drain_all_pages(struct zone *zone, bool= force_all_cpus) * * When zone parameter is non-NULL, spill just the single zone's pages. * - * Note that this can be extremely slow as the draining happens in a workq= ueue. + * Note that this can be extremely slow. */ void drain_all_pages(struct zone *zone) { @@ -3436,7 +3464,7 @@ static void free_unref_page_commit(struct page *page,= int migratetype, =20 __count_vm_event(PGFREE); pcp =3D this_cpu_ptr(zone->per_cpu_pageset); - lp =3D pcp->lp; + lp =3D rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pages= ets.lock))); pindex =3D order_to_pindex(migratetype, order); list_add(&page->lru, &lp->lists[pindex]); lp->count +=3D 1 << order; @@ -3723,7 +3751,7 @@ static struct page *rmqueue_pcplist(struct zone *pref= erred_zone, */ pcp =3D this_cpu_ptr(zone->per_cpu_pageset); pcp->free_factor >>=3D 1; - lp =3D pcp->lp; + lp =3D rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pages= ets.lock))); list =3D &lp->lists[order_to_pindex(migratetype, order)]; count =3D &lp->count; page =3D __rmqueue_pcplist(zone, order, migratetype, alloc_flags, @@ -5346,7 +5374,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int prefe= rred_nid, /* Attempt the batch allocation */ local_lock_irqsave(&pagesets.lock, flags); pcp =3D this_cpu_ptr(zone->per_cpu_pageset); - lp =3D pcp->lp; + lp =3D rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pages= ets.lock))); pcp_list =3D &lp->lists[order_to_pindex(ac.migratetype, 0)]; count =3D &lp->count; =20 @@ -5961,8 +5989,12 @@ void show_free_areas(unsigned int filter, nodemask_t= *nodemask) if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask)) continue; =20 - for_each_online_cpu(cpu) - free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count; + for_each_online_cpu(cpu) { + struct per_cpu_pages *pcp; + + pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); + free_pcp +=3D rcu_access_pointer(pcp->lp)->count; + } } =20 printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" @@ -6055,8 +6087,12 @@ void show_free_areas(unsigned int filter, nodemask_t= *nodemask) continue; =20 free_pcp =3D 0; - for_each_online_cpu(cpu) - free_pcp +=3D per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count; + for_each_online_cpu(cpu) { + struct per_cpu_pages *pcp; + + pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu); + free_pcp +=3D rcu_access_pointer(pcp->lp)->count; + } =20 show_node(zone); printk(KERN_CONT @@ -6099,7 +6135,7 @@ void show_free_areas(unsigned int filter, nodemask_t = *nodemask) K(zone_page_state(zone, NR_MLOCK)), K(zone_page_state(zone, NR_BOUNCE)), K(free_pcp), - K(raw_cpu_ptr(zone->per_cpu_pageset)->lp->count), + K(rcu_access_pointer(raw_cpu_ptr(zone->per_cpu_pageset)->lp)->count), K(zone_page_state(zone, NR_FREE_CMA_PAGES))); printk("lowmem_reserve[]:"); for (i =3D 0; i < MAX_NR_ZONES; i++) @@ -7014,10 +7050,13 @@ static void per_cpu_pages_init(struct per_cpu_pages= *pcp, struct per_cpu_zonesta memset(pcp, 0, sizeof(*pcp)); memset(pzstats, 0, sizeof(*pzstats)); =20 - pcp->lp =3D &pcp->__pcplists; + pcp->lp =3D RCU_INITIALIZER(&pcp->__pcplists[0]); + pcp->drain =3D &pcp->__pcplists[1]; =20 - for (pindex =3D 0; pindex < NR_PCP_LISTS; pindex++) - INIT_LIST_HEAD(&pcp->lp->lists[pindex]); + for (pindex =3D 0; pindex < NR_PCP_LISTS; pindex++) { + INIT_LIST_HEAD(&rcu_access_pointer(pcp->lp)->lists[pindex]); + INIT_LIST_HEAD(&pcp->drain->lists[pindex]); + } =20 /* * Set batch and high values safe for a boot pageset. A true percpu diff --git a/mm/vmstat.c b/mm/vmstat.c index 576b2b932ccd..9c33ff4a580a 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) * if not then there is nothing to expire. */ if (!__this_cpu_read(pcp->expire) || - !this_cpu_ptr(pcp)->lp->count) + !rcu_access_pointer(this_cpu_ptr(pcp)->lp)->count) continue; =20 /* @@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) if (__this_cpu_dec_return(pcp->expire)) continue; =20 - if (this_cpu_ptr(pcp)->lp->count) { + if (rcu_access_pointer(this_cpu_ptr(pcp)->lp)->count) { drain_zone_pages(zone, this_cpu_ptr(pcp)); changes++; } @@ -1728,7 +1728,7 @@ static void zoneinfo_show_print(struct seq_file *m, p= g_data_t *pgdat, "\n high: %i" "\n batch: %i", i, - pcp->lp->count, + rcu_access_pointer(pcp->lp)->count, pcp->high, pcp->batch); #ifdef CONFIG_SMP --=20 2.34.1