From nobody Mon Jun 8 09:48:08 2026 Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C742F41C2FB for ; Thu, 4 Jun 2026 11:31:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572719; cv=none; b=DnkY23KhY/MPvYO0qhHSBPR69p24nkdGIGrLVht1UkZd3mSr1omIz1v1iAJmqT+0WwwUr9WmfDi2O4XuXy7VwoA/Tln7vJdI8+kMcPakEPYDcorwEUrCw9Z3yBd/RrOkEBTUQUvdmkVEc2UCe0DnXsc4O966jEpm3dFgNR2UJ4Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572719; c=relaxed/simple; bh=b8et0zEOhw8e5r44bCHLDA86kipBKfYxskQSjYzcGec=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XiVRNe0fa+Dh3lQ1FELUE8E2Z3QuqNbL3wplMmd7gnt46JZFV97hbGGipDNjiljBHrQbuJq70sIY/ck7c80zRCaCPTxBr438ZX7q9ZA+88wPw7MCV4FB7rXuWR7zp5sEl7Y4m2vghgPkXGR+HaBVwKeZBpJAPQL5oiU8mfiWDYA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=CYZ9U8Kh; arc=none smtp.client-ip=91.218.175.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="CYZ9U8Kh" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780572716; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7UgAkBaY/d/7ezqgbIeADZSgUpEbmYU0w3siNBffvvc=; b=CYZ9U8Khray64pOp693cftUUp4vB3ZgNwkX1LXln+EDnb+eFbn6FJAM0Lljh/cImZvHzri wz1iNunqsUXPuoRf7OpPGiYDvWQTtObS+MYyknGp74jBvX8CokpefVlwqpQHm34CweoTWx 5mcDoV3VGQVc26v0kCEl5Ejq+l1Ta78= From: Kaitao Cheng To: Andrew Morton , Dennis Zhou , Tejun Heo , Christoph Lameter , Uladzislau Rezki , Pedro Falcato , Vlastimil Babka , Michal Hocko Cc: muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kaitao Cheng Subject: [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Date: Thu, 4 Jun 2026 19:30:59 +0800 Message-ID: <20260604113101.89510-2-kaitao.cheng@linux.dev> In-Reply-To: <20260604113101.89510-1-kaitao.cheng@linux.dev> References: <20260604113101.89510-1-kaitao.cheng@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Kaitao Cheng pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and passes it down to the backing percpu allocator. However, when the percpu vmalloc allocator has to create a new chunk, pcpu_create_chunk() calls pcpu_get_vm_areas() to allocate the corresponding vmalloc areas. pcpu_get_vm_areas() currently performs its internal allocations with GFP_KERNEL, including vmap area metadata, vm_struct metadata and KASAN vmalloc shadow population. This means that a caller which deliberately uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating the vmalloc areas for a new percpu chunk. One possible case is blk-cgroup after commit 5d726c4dbeed ("blk-cgroup: fix possible deadlock while configuring policy"). blkg_conf_prep() now serializes against blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason: CPU0: blkg_conf_prep() mutex_lock(q->blkcg_mutex) blkg_alloc(..., GFP_NOIO) alloc_percpu_gfp(..., GFP_NOIO) pcpu_alloc_noprof(..., GFP_NOIO) pcpu_create_chunk(GFP_NOIO) pcpu_get_vm_areas() -> if percpu chunks are exhausted, chunk create may do internal GFP_KERNEL allocations -> direct reclaim / writeback can issue IO to this queue -> IO waits because the queue is frozen CPU1: blkcg_deactivate_policy() blk_mq_freeze_queue(q) mutex_lock(q->blkcg_mutex) -> waits for CPU0 ... unfreeze only happens after q->blkcg_mutex is acquired/released So the concern is that the caller deliberately uses GFP_NOIO because it may hold a lock which can be acquired after queue freeze, but the percpu slow path can temporarily lose that allocation context. Pass the caller supplied GFP mask from pcpu_create_chunk() to pcpu_get_vm_areas(), and use it for the internal vmalloc metadata and KASAN shadow allocations. Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations ato= mic") Signed-off-by: Kaitao Cheng Reviewed-by: Uladzislau Rezki (Sony) --- include/linux/vmalloc.h | 4 ++-- mm/percpu-vm.c | 2 +- mm/vmalloc.c | 23 ++++++++++++----------- 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 3b02c0c6b371..9601e06624c8 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -308,14 +308,14 @@ static inline void set_vm_flush_reset_perms(void *add= r) {} #if defined(CONFIG_MMU) && defined(CONFIG_SMP) struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align); + size_t align, gfp_t gfp); =20 void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms); # else static inline struct vm_struct ** pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align) + size_t align, gfp_t gfp) { return NULL; } diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 4f5937090590..69b00741dc68 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -340,7 +340,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) return NULL; =20 vms =3D pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes, - pcpu_nr_groups, pcpu_atom_size); + pcpu_nr_groups, pcpu_atom_size, gfp); if (!vms) { pcpu_free_chunk(chunk); return NULL; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 1afca3568b9b..08f468135e4d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -4946,16 +4946,17 @@ pvm_determine_end_from_reverse(struct vmap_area **v= a, unsigned long align) * @sizes: array containing size of each area * @nr_vms: the number of areas to allocate * @align: alignment, all entries in @offsets and @sizes must be aligned t= o this + * @gfp: allocation flags passed to the underlying memory allocator * * Returns: kmalloc'd vm_struct pointer array pointing to allocated * vm_structs on success, %NULL on failure * * Percpu allocator wants to use congruent vm areas so that it can * maintain the offsets among percpu areas. This function allocates - * congruent vmalloc areas for it with GFP_KERNEL. These areas tend to - * be scattered pretty far, distance between two areas easily going up - * to gigabytes. To avoid interacting with regular vmallocs, these - * areas are allocated from top. + * congruent vmalloc areas for it. These areas tend to be scattered + * pretty far, distance between two areas easily going up to gigabytes. + * To avoid interacting with regular vmallocs, these areas are allocated + * from top. * * Despite its complicated look, this allocator is rather simple. It * does everything top-down and scans free blocks from the end looking @@ -4966,7 +4967,7 @@ pvm_determine_end_from_reverse(struct vmap_area **va,= unsigned long align) */ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, const size_t *sizes, int nr_vms, - size_t align) + size_t align, gfp_t gfp) { const unsigned long vmalloc_start =3D ALIGN(VMALLOC_START, align); const unsigned long vmalloc_end =3D VMALLOC_END & ~(align - 1); @@ -5004,14 +5005,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned= long *offsets, return NULL; } =20 - vms =3D kzalloc_objs(vms[0], nr_vms); - vas =3D kzalloc_objs(vas[0], nr_vms); + vms =3D kzalloc_objs(vms[0], nr_vms, gfp); + vas =3D kzalloc_objs(vas[0], nr_vms, gfp); if (!vas || !vms) goto err_free2; =20 for (area =3D 0; area < nr_vms; area++) { - vas[area] =3D kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL); - vms[area] =3D kzalloc_obj(struct vm_struct); + vas[area] =3D kmem_cache_zalloc(vmap_area_cachep, gfp); + vms[area] =3D kzalloc_obj(struct vm_struct, gfp); if (!vas[area] || !vms[area]) goto err_free; } @@ -5101,7 +5102,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned l= ong *offsets, =20 /* populate the kasan shadow space */ for (area =3D 0; area < nr_vms; area++) { - if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL)) + if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], gfp)) goto err_free_shadow; } =20 @@ -5158,7 +5159,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned l= ong *offsets, continue; =20 vas[area] =3D kmem_cache_zalloc( - vmap_area_cachep, GFP_KERNEL); + vmap_area_cachep, gfp); if (!vas[area]) goto err_free; } --=20 2.43.0 From nobody Mon Jun 8 09:48:08 2026 Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF0003B19B4 for ; Thu, 4 Jun 2026 11:32:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.186 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572730; cv=none; b=I5Rwzdu6sDTtHa5eGXUoMYeGdHp4LvjCvZZp/XEAQw3ewmG1SXFYMm/89PO/pS5KncJm7W2JMhDY27dPOkkJ1FDXF14rWhrmkyx1fnqOvjfwPPT+YjL72ck2pGWw0kZJ9fPt61OXtwv7k6JfWQ2jGmaRi4In8pqzn7O15G4kXgs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572730; c=relaxed/simple; bh=iYIzbkZGREpgz/KzlhnP9scddqRylrcHQtgx4N/yUCA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IKPMfR8MyAdZour39nSXDViIWDz1oDiRB4gP/3zhVqvf68s+C1pTD8/ZYUkJ7wfYkD1/WRbBWOTuPygzOgX9I3POZAU4hEczBW9FeTnhTOEblBOw1ZUS6tUmAIW5Hql2Fn2dLHYtAS9wGV5c0HouzIDUqfMxzrxtIWFt0sFRnt4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Fzoi0xU9; arc=none smtp.client-ip=91.218.175.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Fzoi0xU9" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780572726; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XZtxzLZEWBXulcLtCK2gibovFAvz1jsUS0zrTG+fguM=; b=Fzoi0xU9vYYAnSMnlu5zthRCNd7hOwLDueLDz3bA0h4OWI34nOgXo+dPLbhSiRHWhZwgRE fiH8wvoW2EzCvOKSAmTXpJ9bM2pOw4nPXA6ltxabIKctDCQl3rXu2yT8rGDqoxQIF5iy3w 8bT5etaJHPp5KUoif0WFTKF0aJKOTuU= From: Kaitao Cheng To: Andrew Morton , Dennis Zhou , Tejun Heo , Christoph Lameter , Uladzislau Rezki , Pedro Falcato , Vlastimil Babka , Michal Hocko Cc: muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kaitao Cheng Subject: [PATCH v2 2/3] mm/percpu: honor GFP constraints when populating chunks Date: Thu, 4 Jun 2026 19:31:00 +0800 Message-ID: <20260604113101.89510-3-kaitao.cheng@linux.dev> In-Reply-To: <20260604113101.89510-1-kaitao.cheng@linux.dev> References: <20260604113101.89510-1-kaitao.cheng@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Kaitao Cheng pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and passes it down to pcpu_populate_chunk(). pcpu_alloc_pages() already uses that mask for backing page allocation. However, the populate slow path still has internal allocations and page table allocations which can lose the caller's allocation context. The temporary pages array is allocated by pcpu_get_pages() with GFP_KERNEL, and pcpu_map_pages() maps the backing pages through vmap_pages_range_noflush() using GFP_KERNEL. The latter can allocate vmalloc page tables implicitly, so a caller which deliberately uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while populating a percpu chunk. This has the same concern as chunk creation: callers such as blk-cgroup may use GFP_NOIO because they hold locks which can be involved in queue freeze or IO reclaim dependencies. If an allocation reaches the percpu slow path and needs to populate previously unbacked pages, the internal GFP_KERNEL allocations can defeat that context. One possible case is blk-cgroup after commit 5d726c4dbeed ("blk-cgroup: fix possible deadlock while configuring policy"). blkg_conf_prep() now serializes against blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason: CPU0: blkg_conf_prep() mutex_lock(q->blkcg_mutex) blkg_alloc(..., GFP_NOIO) alloc_percpu_gfp(..., GFP_NOIO) pcpu_alloc_noprof(..., GFP_NOIO) pcpu_populate_chunk(GFP_NOIO) pcpu_get_pages() pcpu_map_pages() -> if the selected percpu chunk has unpopulated pages, chunk population may do internal GFP_KERNEL allocations -> direct reclaim / writeback can issue IO to this queue -> IO waits because the queue is frozen CPU1: blkcg_deactivate_policy() blk_mq_freeze_queue(q) mutex_lock(q->blkcg_mutex) -> waits for CPU0 ... unfreeze only happens after q->blkcg_mutex is acquired/released So the concern is that the caller deliberately uses GFP_NOIO because it may hold a lock which can be acquired after queue freeze, but the percpu slow path can temporarily lose that allocation context. Pass pcpu_gfp through pcpu_get_pages(), pcpu_map_pages() and __pcpu_map_pages(). Apply the corresponding memalloc scope around vmap_pages_range_noflush(), because vmalloc page table allocation does not pass the GFP mask down explicitly. Keep the first chunk setup path using GFP_KERNEL, matching the previous early-init behavior. Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations ato= mic") Signed-off-by: Kaitao Cheng --- mm/percpu-vm.c | 38 ++++++++++++++++++++++++++------------ mm/percpu.c | 2 +- 2 files changed, 27 insertions(+), 13 deletions(-) diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 69b00741dc68..ccd03cc152d4 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *ch= unk, =20 /** * pcpu_get_pages - get temp pages array + * @gfp: allocation flags passed to the underlying allocator * * Returns pointer to array of pointers to struct page which can be indexed * with pcpu_page_idx(). Note that there is only one array and accesses @@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *ch= unk, * RETURNS: * Pointer to temp pages array on success. */ -static struct page **pcpu_get_pages(void) +static struct page **pcpu_get_pages(gfp_t gfp) { static struct page **pages; size_t pages_size =3D pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]); @@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void) lockdep_assert_held(&pcpu_alloc_mutex); =20 if (!pages) - pages =3D pcpu_mem_zalloc(pages_size, GFP_KERNEL); + pages =3D pcpu_mem_zalloc(pages_size, gfp); return pages; } =20 @@ -191,10 +192,22 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chu= nk *chunk, } =20 static int __pcpu_map_pages(unsigned long addr, struct page **pages, - int nr_pages) + int nr_pages, gfp_t gfp) { - return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT), - PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL); + unsigned int flags; + int ret; + + /* + * The vmalloc page table allocation path does not pass @gfp down + * explicitly. Apply the corresponding memalloc scope so implicit + * page table allocations preserve NOFS/NOIO constraints. + */ + flags =3D memalloc_apply_gfp_scope(gfp); + ret =3D vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT), + PAGE_KERNEL, pages, PAGE_SHIFT, gfp); + memalloc_restore_scope(flags); + + return ret; } =20 /** @@ -203,6 +216,7 @@ static int __pcpu_map_pages(unsigned long addr, struct = page **pages, * @pages: pages array containing pages to be mapped * @page_start: page index of the first page to map * @page_end: page index of the last page to map + 1 + * @gfp: allocation flags passed to the underlying allocator * * For each cpu, map pages [@page_start,@page_end) into @chunk. The * caller is responsible for calling pcpu_post_map_flush() after all @@ -211,8 +225,8 @@ static int __pcpu_map_pages(unsigned long addr, struct = page **pages, * This function is responsible for setting up whatever is necessary for * reverse lookup (addr -> chunk). */ -static int pcpu_map_pages(struct pcpu_chunk *chunk, - struct page **pages, int page_start, int page_end) +static int pcpu_map_pages(struct pcpu_chunk *chunk, struct page **pages, + int page_start, int page_end, gfp_t gfp) { unsigned int cpu, tcpu; int i, err; @@ -220,7 +234,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk, for_each_possible_cpu(cpu) { err =3D __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start), &pages[pcpu_page_idx(cpu, page_start)], - page_end - page_start); + page_end - page_start, gfp); if (err < 0) goto err; =20 @@ -271,21 +285,21 @@ static void pcpu_post_map_flush(struct pcpu_chunk *ch= unk, * @chunk. * * CONTEXT: - * pcpu_alloc_mutex, does GFP_KERNEL allocation. + * pcpu_alloc_mutex, does @gfp allocation. */ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end, gfp_t gfp) { struct page **pages; =20 - pages =3D pcpu_get_pages(); + pages =3D pcpu_get_pages(gfp); if (!pages) return -ENOMEM; =20 if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp)) return -ENOMEM; =20 - if (pcpu_map_pages(chunk, pages, page_start, page_end)) { + if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) { pcpu_free_pages(chunk, pages, page_start, page_end); return -ENOMEM; } @@ -319,7 +333,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *ch= unk, * successful population attempt so the temp pages array must * be available now. */ - pages =3D pcpu_get_pages(); + pages =3D pcpu_get_pages(GFP_KERNEL); BUG_ON(!pages); =20 /* unmap and free */ diff --git a/mm/percpu.c b/mm/percpu.c index b0676b8054ed..4d89965cba16 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -3256,7 +3256,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size= , pcpu_fc_cpu_to_node_fn_t =20 /* pte already populated, the following shouldn't fail */ rc =3D __pcpu_map_pages(unit_addr, &pages[unit * unit_pages], - unit_pages); + unit_pages, GFP_KERNEL); if (rc < 0) panic("failed to map percpu area, err=3D%d\n", rc); =20 --=20 2.43.0 From nobody Mon Jun 8 09:48:08 2026 Received: from out-173.mta0.migadu.com (out-173.mta0.migadu.com [91.218.175.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B8713413D96 for ; Thu, 4 Jun 2026 11:32:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572737; cv=none; b=dloeybzhUEeF92xAseXC1giU1GZ1OpThag/doUsSyAhPsX+aDGEB9xk6/N/zZpReHGKSw6gI2U8pzYUttw1u5rO4dSCzY2QFhmdju/vIPIijQpnPpEDodi2ZdxB4fFMJ3ZZ5YI0zaZg+MwkE+aaC4VU0w4RvCm7yHjFkZuANsRQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780572737; c=relaxed/simple; bh=CsFiYWxHXTn8/MSfUvt2g3xd5p/PPI6E7cyCeNILHKE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SbA/CopdyhIptfvXcVq1YZ9aT/phZuS0deDXTWzPmU7Pry3EZUnJ+57ppXVSLV6Bnaf7T/jKi0RuGSBgZhFk4m6FORPkwT9DgZnPwigVxGNmtDq5uwONt7u8XYxDFv2ca5rtNBajoehSobmAOpfqPqvtlh4vyGnWx+9MQW5VsdQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Aq7frTa3; arc=none smtp.client-ip=91.218.175.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Aq7frTa3" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780572733; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a1GuPMUk5HjK1efq543yYrmfWbZP5fHMEwE2PeqlWRE=; b=Aq7frTa3xKlY0G3u8Xf7kVfMUeSdceuNDU0VEW/fswn5QlJmNaSTLJBnkwTnHpmxArAYI3 SA4/98OFIeqzbOAy06Y9+hdk74dYQEaaSFPzpB8z+rhC9aAw/W8Sii2vTLSUPJ/1IoI49j LqHYoi0w42eJu1uDpEbDe/X0bT1QumA= From: Kaitao Cheng To: Andrew Morton , Dennis Zhou , Tejun Heo , Christoph Lameter , Uladzislau Rezki , Pedro Falcato , Vlastimil Babka , Michal Hocko Cc: muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kaitao Cheng Subject: [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Date: Thu, 4 Jun 2026 19:31:01 +0800 Message-ID: <20260604113101.89510-4-kaitao.cheng@linux.dev> In-Reply-To: <20260604113101.89510-1-kaitao.cheng@linux.dev> References: <20260604113101.89510-1-kaitao.cheng@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" From: Kaitao Cheng Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu allocations to take pcpu_alloc_mutex. This avoids premature allocation failures, but it also makes the mutex visible to callers from constrained IO/FS contexts. Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes pcpu_alloc_mutex. Since the internal allocation is not constrained by NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex, creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock At the same time, Thread B may already hold an FS lock and then call pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire pcpu_alloc_mutex and block, creating the reverse dependency: FS lock -> pcpu_alloc_mutex This can still form a potential deadlock cycle. Avoid the dependency by restricting percpu backing allocations to GFP_NOIO. The public allocation still uses the caller's GFP context to decide whether it may block, but the internal memory allocations performed while pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim. Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations ato= mic") Signed-off-by: Kaitao Cheng --- mm/percpu.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 4d89965cba16..e6f449323064 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1726,9 +1726,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chun= k *chunk, int off, size_t s * @gfp: allocation flags * * Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN - * then no warning will be triggered on invalid or failed allocation - * requests. + * allow blocking, the allocation is atomic. If @gfp has __GFP_NOWARN then= no + * warning will be triggered on invalid or failed allocation requests. * * RETURNS: * Percpu pointer to the allocated area on success, NULL on failure. @@ -1749,8 +1748,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t= align, bool reserved, size_t bits, bit_align; =20 gfp =3D current_gfp_context(gfp); - /* whitelisted flags that can be passed to the backing allocators */ - pcpu_gfp =3D gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); + /* + * Whitelisted flags that can be passed to the backing allocators. + * Backing allocations under pcpu_alloc_mutex must not recurse into + * IO/FS reclaim. Otherwise a GFP_KERNEL caller holding the mutex can + * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock + * waits for the same mutex. + */ + pcpu_gfp =3D gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN); is_atomic =3D !gfpflags_allow_blocking(gfp); do_warn =3D !(gfp & __GFP_NOWARN); =20 --=20 2.43.0