From nobody Thu Dec 18 14:17:34 2025 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F6AF22A7EB for ; Mon, 2 Jun 2025 14:16:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873787; cv=none; b=JkuvbKKarfrvd555Hexam6WsEUCb2kBX8kGXmT//kRSIFp64Un6pEvU42CMyLevmnKg/Ny8PabwcwwTof53Fvk57v7/ukLb50UPtADm7zYa49RGiRV47EwIshAKWcqDFjc5ilUvsbfydHq03lY+FPFvKRZ9pknkte90+yv2F7OM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873787; c=relaxed/simple; bh=YXTKmNuaYYh0okZdS0pTryRK1QT43No+ThGZKxBFkN4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UOxsQbsaKZ7LISo1m+IVrAHh1nT8pAob9IWK07FO25MOY7zMkXjYR1kcRdrVvOWlDzNBKB/8laGSulDxfZHoOYyaHmv6ubaBWUUxwrMyZBLzLdxoVWOyYy5okDjZeoMhliA5RNSwCCPjUmLYzCj6Gwk9wapsMer5NOxMffJPk5w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de; spf=pass smtp.mailfrom=suse.de; arc=none smtp.client-ip=195.135.223.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 66DAD21906; Mon, 2 Jun 2025 14:16:23 +0000 (UTC) Authentication-Results: smtp-out1.suse.de; none Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id C19DB13AE7; Mon, 2 Jun 2025 14:16:22 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id qDhhLDayPWhqVAAAD6G6ig (envelope-from ); Mon, 02 Jun 2025 14:16:22 +0000 From: Oscar Salvador To: Andrew Morton Cc: Muchun Song , David Hildenbrand , James Houghton , Peter Xu , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador Subject: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Date: Mon, 2 Jun 2025 16:16:08 +0200 Message-ID: <20250602141610.173698-2-osalvador@suse.de> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250602141610.173698-1-osalvador@suse.de> References: <20250602141610.173698-1-osalvador@suse.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 50.00]; REPLY(-4.00)[]; ASN(0.00)[asn:9498, ipnet:::/1, country:IN] X-Rspamd-Queue-Id: 66DAD21906 X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Spam-Level: X-Spam-Flag: NO X-Rspamd-Server: rspamd1.dmz-prg2.suse.org X-Spam-Score: -4.00 Content-Type: text/plain; charset="utf-8" Recent events surfaced the fact that it is not clear why are we taking certain locks in the hugetlb faulting path. More specifically pagecache_folio's lock and folio lock in hugetlb_fault, and folio lock in hugetlb_no_page. When digging in the history, I saw that those locks were taken to protect us against truncation, which back then was not serialized with the hugetlb_fault_mutex as it is today. For example, the lock in hugetlb_no_page, looking at the comment above: /* * Use page lock to guard against racing truncation * before we get page_table_lock. */ new_folio =3D false; folio =3D filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff); which was added by 'commit 4c8872659772 ("[PATCH] hugetlb: demand fault han= dler")'. Back when that commit was added (2025), truncate_hugepages looked something= like: truncate_hugepages lock_page(page) truncate_huge_page(page) <- it also removed it from the pagecache unlock_page(page) While today we have remove_inode_hugepages mutex_lock(&hugetlb_fault_mutex_table[hash]) remove_inode_single_folio folio_lock(folio) hugetlb_delete_from_page_cache folio_unlock mutex_unlock(&hugetlb_fault_mutex_table[hash]) And the same happened with the lock for pagecache_folio in hugetlb_fault(), which was introduced in 'commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succee= d")', when we did not have serialization against truncation yet. The only worrisome part of dropping the locks that I considered is when cow_from_owner is true and we cannot allocate another hugetlb page, because= then we drop all the locks, try to unmap the page from the other processes, and = then we re-take the locks again. hash =3D hugetlb_fault_mutex_hash(mapping, idx); hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); unmap_ref_private(mm, vma, &old_folio->page, vmf->address); mutex_lock(&hugetlb_fault_mutex_table[hash]); hugetlb_vma_lock_read(vma); spin_lock(vmf->ptl); So, there is a small window where we are not holding the lock. In this window, someone might have truncated the file (aka pagecache_folio), and call hugetlb_unreserve_pages(). But I do not think it matters for the following reasons 1) we only retry in case the pte has not changed, which means that old_folio still is old_folio. 2) And if the original file got truncated in that window and reserves were adjusted, alloc_hugetlb_folio() will catch this under the lock when we retry again. Another concern was brought up by James Houghton, about UFFDIO_CONTINUE case, and whether we could end up mapping a hugetlb page which has not been zeroed yet. But mfill_atomic_hugetlb(), which calls hugetlb_mfill_atomic_pte(), holds t= he mutex throughout the operation, so we cannot race with truncation/instantia= tion either. E.g: hugetlbfs_fallocate() will allocate the new hugetlb folio and zero it = under the mutex. So, there should be no races. Signed-off-by: Oscar Salvador --- include/linux/hugetlb.h | 12 +++++ mm/hugetlb.c | 98 ++++++++++++++++++----------------------- 2 files changed, 55 insertions(+), 55 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 42f374e828a2..a176724e1204 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -811,6 +811,12 @@ static inline unsigned int blocks_per_huge_page(struct= hstate *h) return huge_page_size(h) / 512; } =20 +static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h, + struct address_space *mapping, pgoff_t idx) +{ + return filemap_get_folio(mapping, idx << huge_page_order(h)); +} + static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h, struct address_space *mapping, pgoff_t idx) { @@ -1088,6 +1094,12 @@ static inline struct hugepage_subpool *hugetlb_folio= _subpool(struct folio *folio return NULL; } =20 +static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h, + struct address_space *mapping, pgoff_t idx) +{ + return NULL; +} + static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h, struct address_space *mapping, pgoff_t idx) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8746ed2fec13..f7bef660ef94 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6146,25 +6146,28 @@ static void unmap_ref_private(struct mm_struct *mm,= struct vm_area_struct *vma, i_mmap_unlock_write(mapping); } =20 +enum cow_context { + HUGETLB_FAULT_CONTEXT, + HUGETLB_NO_PAGE_CONTEXT, +}; + /* - * hugetlb_wp() should be called with page lock of the original hugepage h= eld. * Called with hugetlb_fault_mutex_table held and pte_page locked so we * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easie= r. */ -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, - struct vm_fault *vmf) +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context contex= t) { struct vm_area_struct *vma =3D vmf->vma; struct mm_struct *mm =3D vma->vm_mm; const bool unshare =3D vmf->flags & FAULT_FLAG_UNSHARE; pte_t pte =3D huge_ptep_get(mm, vmf->address, vmf->pte); struct hstate *h =3D hstate_vma(vma); - struct folio *old_folio; - struct folio *new_folio; bool cow_from_owner =3D 0; vm_fault_t ret =3D 0; struct mmu_notifier_range range; + struct folio *old_folio, *new_folio, *pagecache_folio; + struct address_space *mapping =3D vma->vm_file->f_mapping; =20 /* * Never handle CoW for uffd-wp protected pages. It should be only @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache= _folio, * we run out of free hugetlb folios: we would have to kill processes * in scenarios that used to work. As a side effect, there can still * be leaks between processes, for example, with FOLL_GET users. + * + * We need to take the lock here because the folio might be undergoing a + * migration. e.g: see folio_try_share_anon_rmap_pte. */ + folio_lock(old_folio); if (folio_mapcount(old_folio) =3D=3D 1 && folio_test_anon(old_folio)) { if (!PageAnonExclusive(&old_folio->page)) { folio_move_anon_rmap(old_folio, vma); @@ -6209,22 +6216,44 @@ static vm_fault_t hugetlb_wp(struct folio *pagecach= e_folio, vmf->pte); =20 delayacct_wpcopy_end(); + folio_unlock(old_folio); return 0; } VM_BUG_ON_PAGE(folio_test_anon(old_folio) && PageAnonExclusive(&old_folio->page), &old_folio->page); + folio_unlock(old_folio); + =20 + /* + * We can be called from two different contexts: hugetlb_no_page or + * hugetlb_fault. + * When called from the latter, we try to find the original folio in + * the pagecache and compare it against the current folio we have mapped + * in the pagetables. If it differs, it means that this process already + * CoWed and mapped the folio privately, so we know that a reservation + * was already consumed. + * This will be latter used to determine whether we need to unmap the fol= io + * from other processes in case we fail to allocate a new folio. + */ + if (context =3D=3D HUGETLB_FAULT_CONTEXT) { + pagecache_folio =3D filemap_get_hugetlb_folio(h, mapping, vmf->pgoff); + if (IS_ERR(pagecache_folio)) + pagecache_folio =3D NULL; + else + folio_put(pagecache_folio); + } /* * If the process that created a MAP_PRIVATE mapping is about to * perform a COW due to a shared page count, attempt to satisfy * the allocation without using the existing reserves. The pagecache - * page is used to determine if the reserve at this address was + * folio is used to determine if the reserve at this address was already * consumed or not. If reserves were used, a partial faulted mapping * at the time of fork() could consume its reserves on COW instead * of the full address range. */ - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && - old_folio !=3D pagecache_folio) + if (context =3D=3D HUGETLB_FAULT_CONTEXT && + is_vma_resv_set(vma, HPAGE_RESV_OWNER) && + old_folio !=3D pagecache_folio) cow_from_owner =3D true; =20 folio_get(old_folio); @@ -6245,7 +6274,6 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_= folio, * may get SIGKILLed if it later faults. */ if (cow_from_owner) { - struct address_space *mapping =3D vma->vm_file->f_mapping; pgoff_t idx; u32 hash; =20 @@ -6451,11 +6479,11 @@ static vm_fault_t hugetlb_no_page(struct address_sp= ace *mapping, } =20 /* - * Use page lock to guard against racing truncation - * before we get page_table_lock. + * hugetlb_fault_mutex_table guards us against truncation, + * so we do not need to take locks on the folio. */ new_folio =3D false; - folio =3D filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff); + folio =3D filemap_get_hugetlb_folio(h, mapping, vmf->pgoff); if (IS_ERR(folio)) { size =3D i_size_read(mapping->host) >> huge_page_shift(h); if (vmf->pgoff >=3D size) @@ -6521,6 +6549,7 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, if (vma->vm_flags & VM_MAYSHARE) { int err =3D hugetlb_add_to_page_cache(folio, mapping, vmf->pgoff); + folio_unlock(folio); if (err) { /* * err can't be -EEXIST which implies someone @@ -6537,7 +6566,6 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, } new_pagecache_folio =3D true; } else { - folio_lock(folio); anon_rmap =3D 1; } } else { @@ -6603,7 +6631,7 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret =3D hugetlb_wp(folio, vmf); + ret =3D hugetlb_wp(vmf, HUGETLB_NO_PAGE_CONTEXT); } =20 spin_unlock(vmf->ptl); @@ -6615,8 +6643,6 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, */ if (new_folio) folio_set_hugetlb_migratable(folio); - - folio_unlock(folio); out: hugetlb_vma_unlock_read(vma); =20 @@ -6636,7 +6662,6 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, if (new_folio && !new_pagecache_folio) restore_reserve_on_error(h, vma, vmf->address, folio); =20 - folio_unlock(folio); folio_put(folio); goto out; } @@ -6671,7 +6696,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, vm_fault_t ret; u32 hash; struct folio *folio =3D NULL; - struct folio *pagecache_folio =3D NULL; struct hstate *h =3D hstate_vma(vma); struct address_space *mapping; int need_wait_lock =3D 0; @@ -6780,11 +6804,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, } /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, vmf.address); - - pagecache_folio =3D filemap_lock_hugetlb_folio(h, mapping, - vmf.pgoff); - if (IS_ERR(pagecache_folio)) - pagecache_folio =3D NULL; } =20 vmf.ptl =3D huge_pte_lock(h, mm, vmf.pte); @@ -6798,10 +6817,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); return handle_userfault(&vmf, VM_UFFD_WP); @@ -6813,23 +6828,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, stru= ct vm_area_struct *vma, /* Fallthrough to CoW */ } =20 - /* - * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and - * pagecache_folio, so here we need take the former one - * when folio !=3D pagecache_folio or !pagecache_folio. - */ folio =3D page_folio(pte_page(vmf.orig_pte)); - if (folio !=3D pagecache_folio) - if (!folio_trylock(folio)) { - need_wait_lock =3D 1; - goto out_ptl; - } - folio_get(folio); =20 if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - ret =3D hugetlb_wp(pagecache_folio, &vmf); + ret =3D hugetlb_wp(&vmf, HUGETLB_FAULT_CONTEXT); goto out_put_page; } else if (likely(flags & FAULT_FLAG_WRITE)) { vmf.orig_pte =3D huge_pte_mkdirty(vmf.orig_pte); @@ -6840,16 +6844,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, vmf.address, vmf.pte); out_put_page: - if (folio !=3D pagecache_folio) - folio_unlock(folio); folio_put(folio); out_ptl: spin_unlock(vmf.ptl); - - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } out_mutex: hugetlb_vma_unlock_read(vma); =20 @@ -6861,15 +6858,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, vma_end_read(vma); =20 mutex_unlock(&hugetlb_fault_mutex_table[hash]); - /* - * Generally it's safe to hold refcount during waiting page lock. But - * here we just wait to defer the next page fault to avoid busy loop and - * the page is not used after unlocked before returning from the current - * page fault. So we are safe from accessing freed page, even if we wait - * here without taking refcount. - */ - if (need_wait_lock) - folio_wait_locked(folio); return ret; } =20 --=20 2.49.0 From nobody Thu Dec 18 14:17:34 2025 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D213322B8CF for ; Mon, 2 Jun 2025 14:16:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873792; cv=none; b=KHBsI2JfuSPUPwSmv5wOuLz3QqYvj6DpBdMTFOHIJ9zbZpbMwJy1E2uDYoTNXMz64n8GBRm4vHrsm4cnTCVI4RP6GT0GSaNRbM/HOp6cK7PHhwxSh6n3a/x2kSAKDIx/NrEirFwo7O12GBvxOQwSoeS7eGZwkXze7POxcPY3m6I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873792; c=relaxed/simple; bh=u/3v63NaQGUFwOO4PFoAomZMFaW3eVi9NMDzljrXJrs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Dk3T2r2wfB39uxG+PCZGnFoC0qF7BxP7s74dTYrNpPrsaO49AS8o0AP4AeHI5JJZCm5Jz6EyuCAUl2dsWFwUFQcJtohHjOrDEkrz/CkuZh2mdEmVyom1sz/lytKs3lqB7qU63brLV2qJcfxJQ+NWfvICi/FZbCuU7tSTDmEQnc4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de; spf=pass smtp.mailfrom=suse.de; arc=none smtp.client-ip=195.135.223.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.de Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 071AA1F799; Mon, 2 Jun 2025 14:16:24 +0000 (UTC) Authentication-Results: smtp-out2.suse.de; none Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 7856613AE1; Mon, 2 Jun 2025 14:16:23 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id sHqmGjeyPWhqVAAAD6G6ig (envelope-from ); Mon, 02 Jun 2025 14:16:23 +0000 From: Oscar Salvador To: Andrew Morton Cc: Muchun Song , David Hildenbrand , James Houghton , Peter Xu , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador Subject: [RFC PATCH 2/3] mm, hugetlb: Update comments in hugetlb_fault Date: Mon, 2 Jun 2025 16:16:09 +0200 Message-ID: <20250602141610.173698-3-osalvador@suse.de> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250602141610.173698-1-osalvador@suse.de> References: <20250602141610.173698-1-osalvador@suse.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Server: rspamd2.dmz-prg2.suse.org X-Spam-Level: X-Spamd-Result: default: False [-4.00 / 50.00]; REPLY(-4.00)[]; ASN(0.00)[asn:9498, ipnet:::/1, country:IN] X-Spam-Flag: NO X-Rspamd-Queue-Id: 071AA1F799 X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Action: no action X-Spam-Score: -4.00 Content-Type: text/plain; charset="utf-8" There are some comments in the hugetlb faulting path that are not holding anymore because the code has changed. Most promiment is this one: /* * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this * point, so this check prevents the kernel from going below assuming * that we have an active hugepage in pagecache. This goto expects * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned) * check will properly handle it. */ This was written because back in the day we used to do: hugetlb_fault ptep =3D huge_pte_offset(...) if (ptep) { entry =3D huge_ptep_get(ptep) if (unlikely(is_hugetlb_entry_migration(entry)) ... else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) ... ... ... /* * entry could be a migration/hwpoison entry at this point, so this * check prevents the kernel from going below assuming that we have * a active hugepage in pagecache. This goto expects the 2nd page fault, * and is_hugetlb_entry_(migration|hwpoisoned) check will properly * handle it. */ if (!pte_present(entry)) goto out_mutex; ... The code was designed to check for hwpoisoned/migration entries upfront, and then bail out if further down the pte was not present anymore, relying on taking the second fault so the code from the beginning would handle it this time for real. This has changed, so this comment does not hold. Also, mention in hugetlb_fault() that besides allocation and instantiation, the mutex also serializes against truncation. Signed-off-by: Oscar Salvador --- mm/hugetlb.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f7bef660ef94..6ef90958839f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6715,9 +6715,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, }; =20 /* - * Serialize hugepage allocation and instantiation, so that we don't - * get spurious allocation failures if two CPUs race to instantiate - * the same page in the page cache. + * hugetlb_fault_mutex_hash serializes allocation, instantiation and + * truncation. */ mapping =3D vma->vm_file->f_mapping; hash =3D hugetlb_fault_mutex_hash(mapping, vmf.pgoff); @@ -6765,11 +6764,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, ret =3D 0; =20 /* - * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this - * point, so this check prevents the kernel from going below assuming - * that we have an active hugepage in pagecache. This goto expects - * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned) - * check will properly handle it. + * If it is not present, it means we are dealing either with a migration + * or hwpoisoned entry. */ if (!pte_present(vmf.orig_pte)) { if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) { @@ -6793,8 +6789,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the - * spinlock. Also lookup the pagecache page now as it is used to - * determine if a reservation has been consumed. + * spinlock. */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) { --=20 2.49.0 From nobody Thu Dec 18 14:17:34 2025 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D5C3F22CBEC for ; Mon, 2 Jun 2025 14:16:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873793; cv=none; b=o37tYh9uUAs+W10XKkwp/gGglRgKlbPLqECRvvBONIH6V41sAcDtqI4MulMd+f49ilObSgOkGhwlsAvWv2z8GIWBqozOhB13lZ+0uXWZzv4lGizxBoRs/NdFy19DHUz6vOK3xHNv+GIIvDsXNsCRuE7WUW5YfF+/pR80OkTL2N0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748873793; c=relaxed/simple; bh=OagGubIDdITTlh5M+MMlyhgI2JNyyx1zwsAZCR34iHc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=VFe+nUTJBJfNzm1eCgcmJ5kYfk/gJMb/5cASuDjDbLMZ4SHjoRrRjsE8foQ32oqdXo+l8rnf7s2VYl8K6A7209pretPBm6Y5QDidbHHxMB6jiISYLCInqpf/hIcpmB9T3Gzc2RI+3ZOdp32+OUCyqJFGSOlIqXMVDCEeMa69d/M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de; spf=pass smtp.mailfrom=suse.de; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b=190ZZex1; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b=k7CGi+j4; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b=190ZZex1; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b=k7CGi+j4; arc=none smtp.client-ip=195.135.223.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="190ZZex1"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="k7CGi+j4"; dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de header.b="190ZZex1"; dkim=permerror (0-bit key) header.d=suse.de header.i=@suse.de header.b="k7CGi+j4" Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 9BD1F21994; Mon, 2 Jun 2025 14:16:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1748873784; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=26hIJaIlsEh8asQLVwhKQAe/u/26xTMmcWw8TsHGbT4=; b=190ZZex1ahKmfXCarDoa3eHYQvAz6PPVj5936zI85GRurs0Q9B+d3kir3z3XAgpH8LxApV NvLFU344kb2e+vVDWwc65UN3LWw35B9r0MjYxp35tUNcV+9yGK7bG0iOrxnmhvG/4ULJL9 TmXuPt0L3rQQp6YwAerSX22itg6E4DQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1748873784; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=26hIJaIlsEh8asQLVwhKQAe/u/26xTMmcWw8TsHGbT4=; b=k7CGi+j4q4iY/gg0l89lsFFwp/h5io53f8CRbA2g0X6z+mdJyIFWZlsIwY8k0G7WBrca8h emhnV8sxTSCtHRCA== Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1748873784; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=26hIJaIlsEh8asQLVwhKQAe/u/26xTMmcWw8TsHGbT4=; b=190ZZex1ahKmfXCarDoa3eHYQvAz6PPVj5936zI85GRurs0Q9B+d3kir3z3XAgpH8LxApV NvLFU344kb2e+vVDWwc65UN3LWw35B9r0MjYxp35tUNcV+9yGK7bG0iOrxnmhvG/4ULJL9 TmXuPt0L3rQQp6YwAerSX22itg6E4DQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1748873784; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=26hIJaIlsEh8asQLVwhKQAe/u/26xTMmcWw8TsHGbT4=; b=k7CGi+j4q4iY/gg0l89lsFFwp/h5io53f8CRbA2g0X6z+mdJyIFWZlsIwY8k0G7WBrca8h emhnV8sxTSCtHRCA== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 19E4913AE1; Mon, 2 Jun 2025 14:16:24 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id yN54AziyPWhqVAAAD6G6ig (envelope-from ); Mon, 02 Jun 2025 14:16:24 +0000 From: Oscar Salvador To: Andrew Morton Cc: Muchun Song , David Hildenbrand , James Houghton , Peter Xu , Gavin Guo , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Oscar Salvador Subject: [RFC PATCH 3/3] mm, hugetlb: Drop unlikelys from hugetlb_fault Date: Mon, 2 Jun 2025 16:16:10 +0200 Message-ID: <20250602141610.173698-4-osalvador@suse.de> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250602141610.173698-1-osalvador@suse.de> References: <20250602141610.173698-1-osalvador@suse.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Spam-Flag: NO X-Spam-Score: -6.80 X-Spamd-Result: default: False [-6.80 / 50.00]; REPLY(-4.00)[]; BAYES_HAM(-3.00)[100.00%]; MID_CONTAINS_FROM(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_SEVEN(0.00)[9]; RCVD_VIA_SMTP_AUTH(0.00)[]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.de:mid,suse.de:email,imap1.dmz-prg2.suse.org:helo]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_ALL(0.00)[] X-Spam-Level: Content-Type: text/plain; charset="utf-8" The unlikely predates an era where we were checking for hwpoisoned/migration entries without knowing whether the pte was present. The code-flow has changed, and we do not do that anymore, plus they do not bring any advantatge at all. Signed-off-by: Oscar Salvador --- mm/hugetlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6ef90958839f..02ede63e7e5e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6768,7 +6768,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, * or hwpoisoned entry. */ if (!pte_present(vmf.orig_pte)) { - if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) { + if (is_hugetlb_entry_migration(vmf.orig_pte)) { /* * Release the hugetlb fault lock now, but retain * the vma lock, because it is needed to guard the @@ -6779,7 +6779,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, mutex_unlock(&hugetlb_fault_mutex_table[hash]); migration_entry_wait_huge(vma, vmf.address, vmf.pte); return 0; - } else if (unlikely(is_hugetlb_entry_hwpoisoned(vmf.orig_pte))) + } else if (is_hugetlb_entry_hwpoisoned(vmf.orig_pte)) ret =3D VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; --=20 2.49.0