From nobody Sun May 24 23:32:08 2026 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A6E93E7155 for ; Wed, 20 May 2026 14:09:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286169; cv=none; b=EOUFi2JNvowZkr8ZKBF/KSr0HcoqSPOYxrglTnx/QbR5Z38QivzWcm8pnP0utakDcj7sWpNU9lGOYlcdT988ay+xeUcK8QIeUcpuk9sy9odjDbeZTxux2/ei4zfcX0fASNlK3fUMwslKpF8Ma5Ebo6pLfCu700iLXFYnd5dMQ14= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286169; c=relaxed/simple; bh=wL5a+Ge88RNxsFGqTI7NKycn4+Z+qtGoKpTEzNr6Lx4=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=SC+TlgkOzwK+vd3O6ixYJGbJ2XjIZ+xezQIOl+j9D/8CO+Fz5LZPnzRpCB2FBwEK/Ono/XE0Y/qhk9zRnWNYHIHAo44uFasnPXeosobjOesstZGnVRUwj9qfdf7A2UUIG2LETsGCAChPRcARaGu/eNMR6nLC+b+45JjGSb8bJ1E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=poqqU4n+; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="poqqU4n+" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-83537a80ab6so3283135b3a.1 for ; Wed, 20 May 2026 07:09:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779286168; x=1779890968; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:from:to:cc:subject :date:message-id:reply-to; bh=N2P1kNg4b1RcRkebULzAbgfr7kwFAXtEaP6bPGhO1Ac=; b=poqqU4n+ZGnQLoU61gIjfm53NnembDYrJ1DDfNIjnXwssRDmM01/M7eVoproXyQCRQ UkQn26dYDBXM01EjGmN66pkmDsmpKKJXaB6Pf4lDNMg9lj64CECaKD2YbmcF0aBhYvxU 1HBpoaE8HqBFI01gNxPOIg/LJ+/amj5UtZoDyntL8UHMg8p2n4cx8CnLdZvB/UQqp7SK 8mHr0UIXNJ6DGFzWk4rEH5URKTeRF+a2ftHWlo+OTQYZ4IoOa2t+qnAvIBx7LIhN/rns U7HoKxo3Kj9XbLH0WLPjXrntlqxcLfTNDxbuAOgprJya3SkFQ119pttYlDbMXf8zyzVe bMIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779286168; x=1779890968; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=N2P1kNg4b1RcRkebULzAbgfr7kwFAXtEaP6bPGhO1Ac=; b=Ofw4/WPhJpWa+vCjzqE/aoW5GC+8b3SjJdm9cMDKTGC2kIvR5zdAskOKzjOiz6jYLF srFkY4+huPqtG47gpW4qW/YtR7ulBaeC5abIlMx+olYGQGVg4xU1vmQpKOpcdyoXPWX7 I/Z60R5EEaa/fQxWESjPs+WTK0V5NgZOQIKEkz2ZqgreI39tKrD0E0uP2zfl3N9SBuUp txfPA5CiNsJ0DzqHurRd/YSpUDnxDVOFCT9BlhpQEKQ4jP1khUJ1+vq4EOsuoYkaNjam WYXns43hwSHe25gtMd1om2k6n1rn0CYJlAQMom4On3Hde0S60pIFsJefwna5l5SYvn5F EnwQ== X-Forwarded-Encrypted: i=1; AFNElJ8lrwAGa8ABn0E7hqYjYFkxDDWWQQkmePGTN92FY3dA7ZyqKydhCbz71QG114rVM/kCDphgTxxN5XZw/No=@vger.kernel.org X-Gm-Message-State: AOJu0Yy0mmbQ2w7WZZnWQDmeJB1rNMPFtUdHEOt36XEACfD+ODRNlGUE GUf3VNNQMuC1wb+5YJdHKsObq+eV1EBXtgSmpU4kG9eASs6YZT3CjG+I X-Gm-Gg: Acq92OH7JlWhuPPka0Ril4t3J+txuZXO1BzpCcAIM9pzf4vxJUAnodXUaxyRAKIqFS/ h6BKgsVjggnR21hjIAYI7GxgrlcAZFgdnpBzebIFQYAG+7VFB1CNcPmVa8xixkz5wKrO/qtFZhF r2PWm3+Hm095vPDxANDYPEkpGR68p358RkXffpA229w/WOJa1nVfCdiVo/9wgyLh4R04gvxBrET VTI97zH8XV1FNpNthgpf5cm+Pbls81lH2DL6KrAwtfeVJ1fFLq2FT4Z7WgzJWImMGluNCwJFvzr hscXtlTsBG+ljXnLIPxf4e3WiYYb6WI4DHbUDSILBwoM3+6lp7AlIrAbUerKg+folNjkgXxgwEW X8Yajif/e3tT95pp58b3ukh8xk4h2KyaMYA7iG8QD6uXOrnrLXYRm0a/rRDzMlet8fHeQ5UpVAK a2y9jXv/+qfuv8+1GEtKCuKD5iLKcn8LVMoKsgeMv/zY7dxjklw3Q4K5gc7KZWABxXN9ZdoQ== X-Received: by 2002:a05:6a00:4b95:b0:83f:a040:a3d3 with SMTP id d2e1a72fcca58-83fa040a6d7mr4960299b3a.43.1779286167683; Wed, 20 May 2026 07:09:27 -0700 (PDT) Received: from [192.168.0.159] (c-98-225-44-182.hsd1.wa.comcast.net. [98.225.44.182]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83f19c5b296sm19756231b3a.32.2026.05.20.07.09.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 07:09:27 -0700 (PDT) Subject: [PATCH v3 1/3] mm/hmm: move page fault handling out of walk callbacks From: Stanislav Kinsburskii To: Liam.Howlett@oracle.com, akpm@linux-foundation.org, akpm@linux-foundation.org, david@kernel.org, jgg@ziepe.ca, corbet@lwn.net, leon@kernel.org, ljs@kernel.org, mhocko@suse.com, rppt@kernel.org, shuah@kernel.org, skhan@linuxfoundation.org, surenb@google.com, vbabka@kernel.org, skinsburskii@gmail.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Date: Wed, 20 May 2026 07:09:26 -0700 Message-ID: <177928616608.589431.16936852852611203381.stgit@skinsburskii> In-Reply-To: <177928604779.589431.14703161356676674288.stgit@skinsburskii> References: <177928604779.589431.14703161356676674288.stgit@skinsburskii> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Stanislav Kinsburskii hmm_range_fault() currently triggers page faults from inside the page-table walk callbacks: hmm_vma_walk_pmd(), hmm_vma_walk_pud(), hmm_vma_walk_hugetlb_entry() and the pte-level helper all call hmm_vma_fault(), which in turn calls handle_mm_fault() while the walker still holds nested locks. The pte spinlock is dropped explicitly by each caller, and the hugetlb path manually drops and retakes hugetlb_vma_lock_read around the fault to dodge a deadlock against the walk framework's unconditional unlock. This layering does not extend cleanly to fault handlers that may release mmap_lock (VM_FAULT_RETRY, VM_FAULT_COMPLETED). If the lock is dropped while walk_page_range() is mid-traversal, the VMA can be freed before the walk framework's matching hugetlb_vma_unlock_read(), turning that unlock into a use-after-free. Split the responsibilities the way get_user_pages() does. Walk callbacks become inspect-only: when they detect a range that needs to be faulted in, they record it in struct hmm_vma_walk and return a private sentinel (HMM_FAULT_PENDING). The outer loop in hmm_range_fault() then drops out of walk_page_range(), invokes a new helper hmm_do_fault() that calls handle_mm_fault() with only mmap_lock held, and restarts the walk so the now-present entries are collected into hmm_pfns. No functional change for existing callers. As a side effect the hugetlb callback no longer needs the hugetlb_vma_{un}lock_read dance, and every fault-path exit from the callbacks now releases the pte spinlock on a single, common path. This refactor is also a precursor for adding an unlockable variant of hmm_range_fault() in a follow-up patch. Signed-off-by: Stanislav Kinsburskii --- mm/hmm.c | 118 +++++++++++++++++++++++++++++++++++++++-------------------= ---- 1 file changed, 75 insertions(+), 43 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index c72c9ddfdb2f..446dd0c39b3a 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -33,8 +33,17 @@ struct hmm_vma_walk { struct hmm_range *range; unsigned long last; + unsigned long end; + unsigned int required_fault; }; =20 +/* + * Internal sentinel returned by walk callbacks when they need a page faul= t. + * The callback stores end/required_fault in hmm_vma_walk; the outer loop + * consumes the sentinel and never propagates it to the caller. + */ +#define HMM_FAULT_PENDING -EAGAIN + enum { HMM_NEED_FAULT =3D 1 << 0, HMM_NEED_WRITE_FAULT =3D 1 << 1, @@ -60,37 +69,25 @@ static int hmm_pfns_fill(unsigned long addr, unsigned l= ong end, } =20 /* - * hmm_vma_fault() - fault in a range lacking valid pmd or pte(s) - * @addr: range virtual start address (inclusive) - * @end: range virtual end address (exclusive) - * @required_fault: HMM_NEED_* flags - * @walk: mm_walk structure - * Return: -EBUSY after page fault, or page fault error + * hmm_record_fault() - record a range that needs to be faulted in * - * This function will be called whenever pmd_none() or pte_none() returns = true, - * or whenever there is no page directory covering the virtual address ran= ge. + * Called by the walk callbacks when they discover that part of the range + * needs a page fault. The callback records what to fault and returns + * HMM_FAULT_PENDING; the outer loop in hmm_range_fault() drops back out of + * walk_page_range() and invokes handle_mm_fault() from a context where no + * page-table or hugetlb_vma_lock is held. */ -static int hmm_vma_fault(unsigned long addr, unsigned long end, - unsigned int required_fault, struct mm_walk *walk) +static int hmm_record_fault(unsigned long addr, unsigned long end, + unsigned int required_fault, + struct mm_walk *walk) { struct hmm_vma_walk *hmm_vma_walk =3D walk->private; - struct vm_area_struct *vma =3D walk->vma; - unsigned int fault_flags =3D FAULT_FLAG_REMOTE; =20 WARN_ON_ONCE(!required_fault); hmm_vma_walk->last =3D addr; - - if (required_fault & HMM_NEED_WRITE_FAULT) { - if (!(vma->vm_flags & VM_WRITE)) - return -EPERM; - fault_flags |=3D FAULT_FLAG_WRITE; - } - - for (; addr < end; addr +=3D PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags, NULL) & - VM_FAULT_ERROR) - return -EFAULT; - return -EBUSY; + hmm_vma_walk->end =3D end; + hmm_vma_walk->required_fault =3D required_fault; + return HMM_FAULT_PENDING; } =20 static unsigned int hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_= walk, @@ -174,7 +171,7 @@ static int hmm_vma_walk_hole(unsigned long addr, unsign= ed long end, return hmm_pfns_fill(addr, end, range, HMM_PFN_ERROR); } if (required_fault) - return hmm_vma_fault(addr, end, required_fault, walk); + return hmm_record_fault(addr, end, required_fault, walk); return hmm_pfns_fill(addr, end, range, 0); } =20 @@ -209,7 +206,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, uns= igned long addr, required_fault =3D hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, cpu_flags); if (required_fault) - return hmm_vma_fault(addr, end, required_fault, walk); + return hmm_record_fault(addr, end, required_fault, walk); =20 pfn =3D pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); for (i =3D 0; addr < end; addr +=3D PAGE_SIZE, i++, pfn++) { @@ -328,7 +325,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, uns= igned long addr, fault: pte_unmap(ptep); /* Fault any virtual address we were asked to fault */ - return hmm_vma_fault(addr, end, required_fault, walk); + return hmm_record_fault(addr, end, required_fault, walk); } =20 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION @@ -371,7 +368,7 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *wa= lk, unsigned long start, npages, 0); if (required_fault) { if (softleaf_is_device_private(entry)) - return hmm_vma_fault(addr, end, required_fault, walk); + return hmm_record_fault(addr, end, required_fault, walk); else return -EFAULT; } @@ -517,7 +514,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long = start, unsigned long end, npages, cpu_flags); if (required_fault) { spin_unlock(ptl); - return hmm_vma_fault(addr, end, required_fault, walk); + return hmm_record_fault(addr, end, required_fault, walk); } =20 pfn =3D pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); @@ -564,21 +561,8 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsi= gned long hmask, required_fault =3D hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags); if (required_fault) { - int ret; - spin_unlock(ptl); - hugetlb_vma_unlock_read(vma); - /* - * Avoid deadlock: drop the vma lock before calling - * hmm_vma_fault(), which will itself potentially take and - * drop the vma lock. This is also correct from a - * protection point of view, because there is no further - * use here of either pte or ptl after dropping the vma - * lock. - */ - ret =3D hmm_vma_fault(addr, end, required_fault, walk); - hugetlb_vma_lock_read(vma); - return ret; + return hmm_record_fault(addr, end, required_fault, walk); } =20 pfn =3D pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT); @@ -637,6 +621,44 @@ static const struct mm_walk_ops hmm_walk_ops =3D { .walk_lock =3D PGWALK_RDLOCK, }; =20 +/* + * hmm_do_fault - fault in a range recorded by a walk callback + * + * Called from the outer loop in hmm_range_fault() after a callback + * returned HMM_FAULT_PENDING. At this point we hold only mmap_lock; + * the page-table spinlock and any hugetlb_vma_lock acquired by the walk + * framework have already been released by the unwind. + * + * Returns -EBUSY on success (all pages faulted, caller should re-walk). + * Returns a negative errno on failure. + */ +static int hmm_do_fault(struct mm_struct *mm, + struct hmm_vma_walk *hmm_vma_walk) +{ + unsigned long addr =3D hmm_vma_walk->last; + unsigned long end =3D hmm_vma_walk->end; + unsigned int required_fault =3D hmm_vma_walk->required_fault; + unsigned int fault_flags =3D FAULT_FLAG_REMOTE; + struct vm_area_struct *vma; + + vma =3D vma_lookup(mm, addr); + if (!vma) + return -EFAULT; + + if (required_fault & HMM_NEED_WRITE_FAULT) { + if (!(vma->vm_flags & VM_WRITE)) + return -EPERM; + fault_flags |=3D FAULT_FLAG_WRITE; + } + + for (; addr < end; addr +=3D PAGE_SIZE) + if (handle_mm_fault(vma, addr, fault_flags, NULL) & + VM_FAULT_ERROR) + return -EFAULT; + + return -EBUSY; +} + /** * hmm_range_fault - try to fault some address in a virtual address range * @range: argument structure @@ -674,6 +696,16 @@ int hmm_range_fault(struct hmm_range *range) return -EBUSY; ret =3D walk_page_range(mm, hmm_vma_walk.last, range->end, &hmm_walk_ops, &hmm_vma_walk); + /* + * When HMM_FAULT_PENDING is returned a walk callback + * recorded a range that needs handle_mm_fault(); + * hmm_do_fault() runs the fault outside walk_page_range() + * (so no page-table or hugetlb_vma_lock is held) and + * returns -EBUSY so the loop re-walks and picks up the + * now-present entries. + */ + if (ret =3D=3D HMM_FAULT_PENDING) + ret =3D hmm_do_fault(mm, &hmm_vma_walk); /* * When -EBUSY is returned the loop restarts with * hmm_vma_walk.last set to an address that has not been stored From nobody Sun May 24 23:32:08 2026 Received: from mail-pj1-f46.google.com (mail-pj1-f46.google.com [209.85.216.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CE643E7BDE for ; Wed, 20 May 2026 14:09:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.46 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286177; cv=none; b=etWJ1Ix3tcKofJLmWTM2Mcu+Sd/Gssu5WYIzqoWN2oX2e2sqx2uQ7t/WfXY4og0UzS44njExM7DA21myhRDd5O5oahE6yvGBzYGoN8+YtlK0dSOGcGlQNaShbmb0nmvz0mbBo9jSwab2H7DCplKrmdySYxtPEqaFHDxuHkjkKDI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286177; c=relaxed/simple; bh=61apRPqn1rv5BYz47gpqkPMB6IBZPKR798yxdL/4E30=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=qO9ma/xtEeKQ103d29Ke/es9K0aM5jqZTHPEGKU/wW0qOHQSAjgeXEQy0VKgPeT1Th9OjHvMurMgwqaUJ966tSQJwgs9bt8lNGSJu1G2O9/fuLJvp1hBlcOzH3/bdkamNejAlHv5SYLWfDLTOLG+ILQnMR+2fMaL6y6sdoX1yng= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=aDi1bpTo; arc=none smtp.client-ip=209.85.216.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="aDi1bpTo" Received: by mail-pj1-f46.google.com with SMTP id 98e67ed59e1d1-367d88b9940so3127042a91.1 for ; Wed, 20 May 2026 07:09:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779286174; x=1779890974; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:from:to:cc:subject :date:message-id:reply-to; bh=urGGI+UABys1blw899nvFkhAmcNFFQESIMIpWL4Nq+s=; b=aDi1bpTorSlVnof+3z9nkKsyKkad4idTeQudwuTJxDWgW5pAyTpT7UT1mWjkguL0yS qEozkZkpxp2tO48mnvOFBGRrrXhuQ3eLkGzUrwP7x0y9HEtvfz5NLrqSU2qj68D5cC2b ymQSEsvCwgdLLm9NRlfpMEbZ7b4sLSvKCbxRf2uCUbReRTQKcQLwOVPCSQU96hZ9NhHv qXYvKDad4OqICcu6sLWTybhrt8I/boS4Uwj9vfALhu+zOpg5vnUTHFN54e3tav8Egqof N2O7k6QWRrwLIoF6Faq5PHFsbDDhaSkpkafAVcb52tQS1jBkH0/ptz6k5TKeWsNEP+T2 NayA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779286174; x=1779890974; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=urGGI+UABys1blw899nvFkhAmcNFFQESIMIpWL4Nq+s=; b=erPD/ISgrTsqNPPTBtnP1JM3QbEYZJXoSAl2ycLq3dDV7/iFlim1rq7K3k1vmn01cP omClxqxCXVbzAxh9r2jsll2cTFPb523/tRhRYsLxdeHaqm9TtxPzW23vZGXw20e4IiQo 0+wLRmS8yBa4GheB7fzjCWHtBch7yoV4JAnwSPolW8isK7g8jvKtFqy5XDLq/OzhtmTX y+ekLQlYcrSp4SUjbJXuzr4icpkUNpwayQRmWryDj8Zo7gxwH1RLRPh+vbMVmQngsb5T L64q+j9+UYdJfLHf6Oi19oK6BXugE/L5NuLF2EEzxRJeqSAM3XSjsR65LF9Mx563Y9PI fZ+A== X-Forwarded-Encrypted: i=1; AFNElJ8zXgUQt9or5+ROrJu9xYV8nogDwI26TUq6h+KoeqV4NEmUBMSsXDqUQ5i/TKJ44MotF8RQZXE4FzU1kkc=@vger.kernel.org X-Gm-Message-State: AOJu0YwGFJ3dg41DlEUbssMjw93ErAxlQB57mZ5UDKcicmNHwLZW/SCp hPX1Ls41BVhA8ZXlV8oPanvxRUY2JWFEJtydg8Vle3uRmq/GPUODKtJq X-Gm-Gg: Acq92OFWXwGJJafbT7M7XEFrQyRHy7nmHBoIqVUwk+w3XIgHNIMfYyUtTIkUJXCLRys 3gXeKOgS2ovLrPhwil0s6yv4HRdRnjg+gw3jBSnCjV1hEjl6wJ0FMM+2qi0ucS+GbDU6b0sLdRZ MFv7ogEILn08xDHRD6iRIv9x2jTjhGsrINxz/bmNUX5R2EiBldeuH5OdH2TweKzKPpZdggapYkn clZpn5aoqFyh5mMZnpo7IzwyI9bntN5SE8fnmxRChpmGti5I063KiYWesNw7nPPDqGmiLUNyX/f h6Jo6E39dXO9tYXjwcyoZe1Rf+kPCQvLj/LNx7GmUIR7TXDVrrnG0Jtc7nkd8QbqWslYhg4uTNF om0jfOhfeFlSNgW5pwXO6y+5vQPlQV2dU1YVedTg0zE+662LnIeP2qvra4mTBwqhW6yY2Ann7aE mPPhyHHbJE7wdsrArwcRGeNhaX3BrBbo0Wyws+qwNHyzcSyAAYNwt+AunS3smZJRMLnw8Tsw== X-Received: by 2002:a17:90b:2d0c:b0:369:7433:2fe with SMTP id 98e67ed59e1d1-369743303a4mr19190776a91.6.1779286174420; Wed, 20 May 2026 07:09:34 -0700 (PDT) Received: from [192.168.0.159] (c-98-225-44-182.hsd1.wa.comcast.net. [98.225.44.182]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3695a0efa7esm6546608a91.11.2026.05.20.07.09.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 07:09:34 -0700 (PDT) Subject: [PATCH v3 2/3] mm/hmm: add hmm_range_fault_unlockable() for mmap lock-drop support From: Stanislav Kinsburskii To: Liam.Howlett@oracle.com, akpm@linux-foundation.org, akpm@linux-foundation.org, david@kernel.org, jgg@ziepe.ca, corbet@lwn.net, leon@kernel.org, ljs@kernel.org, mhocko@suse.com, rppt@kernel.org, shuah@kernel.org, skhan@linuxfoundation.org, surenb@google.com, vbabka@kernel.org, skinsburskii@gmail.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Date: Wed, 20 May 2026 07:09:32 -0700 Message-ID: <177928617293.589431.17688544423323771407.stgit@skinsburskii> In-Reply-To: <177928604779.589431.14703161356676674288.stgit@skinsburskii> References: <177928604779.589431.14703161356676674288.stgit@skinsburskii> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Stanislav Kinsburskii hmm_range_fault() holds the mmap read lock for the duration of the call. This is incompatible with mappings whose fault handler may release the mmap lock - notably userfaultfd-managed regions, where handle_mm_fault() returns VM_FAULT_RETRY or VM_FAULT_COMPLETED after dropping the lock. Drivers that need to populate device page tables for such mappings have no way to do so today. Add hmm_range_fault_unlockable(), modelled on the int *locked pattern from get_user_pages_remote() in mm/gup.c. Callers set *locked =3D 1 and pass &locked; the function may set *locked =3D 0 to report that handle_mm_fault() dropped the mmap lock during a page fault, in which case the caller must reacquire it and restart the walk with a fresh mmu_interval_read_begin() sequence. The implementation is local to hmm_do_fault() and the outer loop in hmm_range_fault_unlockable(). hmm_do_fault() conditionally sets FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE when locked is non-NULL and translates VM_FAULT_RETRY / VM_FAULT_COMPLETED into *locked =3D 0 plus a private return code consumed by the outer loop, which in turn returns 0 (or -EINTR on fatal signal) to the caller. The previous refactor that moved page fault handling out of the page-table walk callbacks is what makes this change small. Faults now run after walk_page_range() has unwound, with only the mmap lock held, so dropping it does not interact with the walker's pte spinlock or hugetlb_vma_lock. Hugetlb regions therefore participate in the unlockable path uniformly with PTE- and PMD-level mappings; no special case is required. hmm_range_fault() becomes a thin wrapper, preserving exact behaviour for all existing callers. No EXPORT_SYMBOL behaviour change for hmm_range_fault. Documentation/mm/hmm.rst is updated with a description of the new API and the recommended caller pattern. Signed-off-by: Stanislav Kinsburskii --- Documentation/mm/hmm.rst | 62 +++++++++++++++++++++++++++++++++++++ include/linux/hmm.h | 1 + mm/hmm.c | 77 ++++++++++++++++++++++++++++++++++++++++++= +--- 3 files changed, 135 insertions(+), 5 deletions(-) diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst index 7d61b7a8b65b..a9309023ec23 100644 --- a/Documentation/mm/hmm.rst +++ b/Documentation/mm/hmm.rst @@ -208,6 +208,68 @@ invalidate() callback. That lock must be held before c= alling mmu_interval_read_retry() to avoid any race with a concurrent CPU page tab= le update. =20 +Dropping the mmap lock during page faults +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Some VMAs have fault handlers that need to release the mmap lock while +servicing a fault (for example, regions managed by ``userfaultfd``). +``hmm_range_fault()`` cannot be used on such mappings because it must hold= the +mmap lock for the duration of the call. Drivers that need to support them +should call:: + + int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); + +The caller sets ``*locked =3D 1`` and holds ``mmap_read_lock`` before the = call. +If the mmap lock is dropped inside ``handle_mm_fault()``, the function sets +``*locked =3D 0`` and returns ``0``; the caller is responsible for reacqui= ring +the lock and restarting the walk from ``range->start`` with a fresh notifi= er +sequence. When ``locked`` is ``NULL`` the function keeps the lock held for= the +duration of the call, identical to ``hmm_range_fault()``. + +A typical caller looks like this:: + + int driver_populate_range_unlockable(...) + { + struct hmm_range range; + int locked; + ... + + range.notifier =3D &interval_sub; + range.start =3D ...; + range.end =3D ...; + range.hmm_pfns =3D ...; + + if (!mmget_not_zero(interval_sub.mm)) + return -EFAULT; + + again: + range.notifier_seq =3D mmu_interval_read_begin(&interval_sub); + locked =3D 1; + mmap_read_lock(mm); + ret =3D hmm_range_fault_unlockable(&range, &locked); + if (locked) + mmap_read_unlock(mm); + if (ret) { + if (ret =3D=3D -EBUSY) + goto again; + return ret; + } + if (!locked) + goto again; + + take_lock(driver->update); + if (mmu_interval_read_retry(&interval_sub, range.notifier_seq)) { + release_lock(driver->update); + goto again; + } + + /* Use pfns array content to update device page table, + * under the update lock */ + + release_lock(driver->update); + return 0; + } + Leverage default_flags and pfn_flags_mask =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 diff --git a/include/linux/hmm.h b/include/linux/hmm.h index db75ffc949a7..46e581865c48 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -123,6 +123,7 @@ struct hmm_range { * Please see Documentation/mm/hmm.rst for how to use the range API. */ int hmm_range_fault(struct hmm_range *range); +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); =20 /* * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a ran= ge diff --git a/mm/hmm.c b/mm/hmm.c index 446dd0c39b3a..b9ced5003e16 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -32,6 +32,7 @@ =20 struct hmm_vma_walk { struct hmm_range *range; + int *locked; unsigned long last; unsigned long end; unsigned int required_fault; @@ -44,6 +45,13 @@ struct hmm_vma_walk { */ #define HMM_FAULT_PENDING -EAGAIN =20 +/* + * Internal sentinel returned by hmm_do_fault() when handle_mm_fault() dro= ps + * the mmap lock during a page fault. hmm_do_fault() sets *locked =3D 0; t= he + * outer loop consumes the sentinel and never propagates it to the caller. + */ +#define HMM_FAULT_UNLOCKED -ENOLCK + enum { HMM_NEED_FAULT =3D 1 << 0, HMM_NEED_WRITE_FAULT =3D 1 << 1, @@ -639,6 +647,7 @@ static int hmm_do_fault(struct mm_struct *mm, unsigned long end =3D hmm_vma_walk->end; unsigned int required_fault =3D hmm_vma_walk->required_fault; unsigned int fault_flags =3D FAULT_FLAG_REMOTE; + int *locked =3D hmm_vma_walk->locked; struct vm_area_struct *vma; =20 vma =3D vma_lookup(mm, addr); @@ -651,10 +660,20 @@ static int hmm_do_fault(struct mm_struct *mm, fault_flags |=3D FAULT_FLAG_WRITE; } =20 - for (; addr < end; addr +=3D PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags, NULL) & - VM_FAULT_ERROR) + if (locked) + fault_flags |=3D FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + + for (; addr < end; addr +=3D PAGE_SIZE) { + vm_fault_t ret; + + ret =3D handle_mm_fault(vma, addr, fault_flags, NULL); + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) { + *locked =3D 0; + return HMM_FAULT_UNLOCKED; + } + if (ret & VM_FAULT_ERROR) return -EFAULT; + } =20 return -EBUSY; } @@ -677,11 +696,53 @@ static int hmm_do_fault(struct mm_struct *mm, * * This is similar to get_user_pages(), except that it can read the page t= ables * without mutating them (ie causing faults). + * + * The mmap lock must be held by the caller and will remain held on return. + * For a variant that allows the mmap lock to be dropped during faults (e.= g., + * for userfaultfd support), see hmm_range_fault_unlockable(). */ int hmm_range_fault(struct hmm_range *range) +{ + return hmm_range_fault_unlockable(range, NULL); +} +EXPORT_SYMBOL(hmm_range_fault); + +/** + * hmm_range_fault_unlockable - fault in a range, possibly dropping the mm= ap lock + * @range: argument structure + * @locked: pointer to caller's lock state, or %NULL + * + * Behaves like hmm_range_fault(), but allows handle_mm_fault() to drop the + * mmap read lock during a fault. This makes the function usable on mappi= ngs + * whose fault path may release the lock (for example, userfaultfd-managed + * regions). + * + * If @locked is %NULL the mmap lock is never released and the function + * behaves exactly like hmm_range_fault(). + * + * If @locked is non-%NULL the caller must hold mmap_read_lock and set + * *@locked =3D 1 before the call. On return: + * + * *@locked =3D=3D 1: the mmap lock is still held. The return value has= the + * same meaning as hmm_range_fault() (0 on success, or one + * of the error codes documented there). + * + * *@locked =3D=3D 0: the mmap lock was dropped during a page fault. No= PFNs + * collected so far are guaranteed to be valid because the + * address space may have changed under us. The return + * value is either 0 (caller must reacquire the lock and + * restart with a fresh mmu_interval_read_begin()) or + * -EINTR (a fatal signal is pending; abort). + * + * The caller is responsible for reacquiring mmap_read_lock and restarting + * the operation from range->start. See Documentation/mm/hmm.rst for the + * full usage pattern. + */ +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked) { struct hmm_vma_walk hmm_vma_walk =3D { .range =3D range, + .locked =3D locked, .last =3D range->start, }; struct mm_struct *mm =3D range->notifier->mm; @@ -704,8 +765,14 @@ int hmm_range_fault(struct hmm_range *range) * returns -EBUSY so the loop re-walks and picks up the * now-present entries. */ - if (ret =3D=3D HMM_FAULT_PENDING) + if (ret =3D=3D HMM_FAULT_PENDING) { ret =3D hmm_do_fault(mm, &hmm_vma_walk); + if (ret =3D=3D HMM_FAULT_UNLOCKED) { + if (fatal_signal_pending(current)) + return -EINTR; + return 0; /* caller must restart */ + } + } /* * When -EBUSY is returned the loop restarts with * hmm_vma_walk.last set to an address that has not been stored @@ -715,7 +782,7 @@ int hmm_range_fault(struct hmm_range *range) } while (ret =3D=3D -EBUSY); return ret; } -EXPORT_SYMBOL(hmm_range_fault); +EXPORT_SYMBOL(hmm_range_fault_unlockable); =20 /** * hmm_dma_map_alloc - Allocate HMM map structure From nobody Sun May 24 23:32:08 2026 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 043B93E8358 for ; Wed, 20 May 2026 14:09:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286183; cv=none; b=VjIz7E6LGSbnHqTGxe+amT5nxYp2LBYk2QqrL+MgbeQu+kZMf4TVYfcTq5n+D27LCzyCLwkKT8eGNHjPloHXIE93Y3LklrrvztEr0Oe/twJw0G2zoGyKgX1f05mJOuP9D9puEjcB+gRImamRNt0xvRIfIp9ehzRnjztuAvuv+cY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779286183; c=relaxed/simple; bh=F763zJ4OjSKiDVQ6c/m/Qp2IxXmgW/N6jnghpSR7+5Y=; h=Subject:From:To:Cc:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=j7xD748RZzHUYR3aHQl9AHAOuY+j3TiQrI3IuSB/59gZexPi9kcu2ZhHATToX9A4uT8EB0gNlkUezHGhB7VzB4Z7zrhgFj8Xrp0Mm95o5NVwrTaG9sSHcpxhcn9zWSc8RnCQjgugkU8v400L8V7/j3N0KKm2PZv1G/mec7Z7Bxs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=DdRAkORP; arc=none smtp.client-ip=209.85.210.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DdRAkORP" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-834f1075805so3821934b3a.2 for ; Wed, 20 May 2026 07:09:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779286181; x=1779890981; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:from:to:cc:subject :date:message-id:reply-to; bh=HonzK+xjy2hB8uLYFwTsLPswGX7Gaa21jEP54HMpyCw=; b=DdRAkORPiM2V1zuPZObiEz1dBmr6mlOZuDed1xjmamYlqHSwgw8YDmElyVAe8903qY k6gFfXlcLuasS5F9eLgaqDVgU65/xJ7qa3K7MZTFI2VAGWBgtzWDkhLH1jpFHvhEHRYB uGSzESnEgAJW7fiAKpPs0HHoOnAkqy4fGHfNwzDSUc96qTK4kmCoyjzhHFrKh39yWvlU t8H8q+iPDoGZi5wbc5DJN1GzmqmzUEdkCqw87o6/oeE477hsyX802y1Qz/53cqPJb2hC lYdVQmgfvbMjSBiu+gkCnMQpz5bH2JNMVr2hYgE9NXuwLOuYhDIyz6LdUmj7crfkdxa4 lc6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779286181; x=1779890981; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HonzK+xjy2hB8uLYFwTsLPswGX7Gaa21jEP54HMpyCw=; b=Ydq6SMENbF/G87TUEjejjjdGVATZmx2wHnRTbrIvyzgubWt39YICLl09npPMv3Dw41 Aqnyn1H7Nrar5XmCOp2oSdDQwJEUHMqsIGhx5IlbkfFSUXatVe4AbBJZWvo6P3zNkyNb dBDnT6T2+Bow/Fk7XFb/MH4gvksLtlX0Sb3tNOxm0hhgGPRjf9miniynoZqUOxpjQvOL bpoEQylSvfogA3LyozeSVutJ7sfZa4lWnayYTs2eVqAcmRkhOyT81z3GxvrLVZCr1lHS QmkhQQNcBqWYS3lzCT8I6vStqX6mfnx+TxyzDkidN8tKExdhWUcFpfmg7FIFbXhTysqq 4bVQ== X-Forwarded-Encrypted: i=1; AFNElJ/h3YrQxJc2j/tF1Unl5rkKuRvBuX4rJV6LqLNHNph/X2dlh0AbjbbFYPDzPZlK65oj0RzzBqANxCzG+3w=@vger.kernel.org X-Gm-Message-State: AOJu0YyxJN4duueg0DplyHEh86zjPiHH6KeIcmJE0OgE47rYAJFHnkmT owq+fl6UFogrI3CK1SsVvBdeyM9NmJ1yasyuZd+uAViyrkNj2A4HQN+G X-Gm-Gg: Acq92OEawCJ99YAMqoVK4mfQChl/szLlF0TDj8wapooDa5SDN2Rabk6+62CBI0ptMMM 9spHpR1JSu6lyP1ex6UF8NVpvXflWfjPkEyC5SrsmpxbAPQM8tAL+CpUyNAJFyP4Z4NPirsI0aw QCmZUypD55qEobU6AqEuJr5i40ev6uBzQcwIR0uVgkBArV2IgQ7JPuOgugTh3U3i4+o7gUzi3dZ Egzo4nfPJzluG+yxv1LOG1Ex5Fnl6g0+EufqPGGXqLDc7IEwZLislCRz1KsPW9WIpoT7ZDoQCjk VFXrRUZ2100277ItQHDdE9v9vIpTpNEz6/1BcWm+2RXgErv1NIWQqDHDCXzJmlqUpESMgu9l/4H 70WX7dtzJVIMSKipj9zQKebdn+EPr5XJOAu3Tas5ZN304gxbsIUgfWIMiTla6wUQDhS0QdNI0IS NYCTPL2T8A66PgLoNcjpzdltzTMk/6v4LSEkj/SI4/C8RBuxHngU7LfU69HkxHaAq17vRCXw7hj 8mMGVDumT+sjjdYn4Y= X-Received: by 2002:a05:6a00:124c:b0:835:6d99:3f88 with SMTP id d2e1a72fcca58-83f33d9d701mr26743378b3a.23.1779286181199; Wed, 20 May 2026 07:09:41 -0700 (PDT) Received: from [192.168.0.159] (c-98-225-44-182.hsd1.wa.comcast.net. [98.225.44.182]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83f19c5b71fsm20108260b3a.29.2026.05.20.07.09.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 07:09:40 -0700 (PDT) Subject: [PATCH v3 3/3] selftests/mm: add userfaultfd test for HMM unlockable path From: Stanislav Kinsburskii To: Liam.Howlett@oracle.com, akpm@linux-foundation.org, akpm@linux-foundation.org, david@kernel.org, jgg@ziepe.ca, corbet@lwn.net, leon@kernel.org, ljs@kernel.org, mhocko@suse.com, rppt@kernel.org, shuah@kernel.org, skhan@linuxfoundation.org, surenb@google.com, vbabka@kernel.org, skinsburskii@gmail.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Date: Wed, 20 May 2026 07:09:39 -0700 Message-ID: <177928617968.589431.774568933682869957.stgit@skinsburskii> In-Reply-To: <177928604779.589431.14703161356676674288.stgit@skinsburskii> References: <177928604779.589431.14703161356676674288.stgit@skinsburskii> User-Agent: StGit/0.19 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Stanislav Kinsburskii Add a selftest that exercises hmm_range_fault_unlockable() with a userfaultfd-backed mapping. The test: 1. Creates an anonymous mmap region 2. Registers it with userfaultfd (UFFDIO_REGISTER_MODE_MISSING) 3. Spawns a handler thread that responds to page faults by filling pages with a known pattern (0xAB) via UFFDIO_COPY 4. Issues HMM_DMIRROR_READ_UNLOCKABLE to the test_hmm driver, which calls hmm_range_fault_unlockable() internally 5. Verifies the device read back the data provided by the userfaultfd handler This requires changes to the test_hmm kernel module: - New dmirror_range_fault_unlockable() that uses the new HMM API - New dmirror_fault_unlockable() and dmirror_read_unlockable() wrappers - New HMM_DMIRROR_READ_UNLOCKABLE ioctl (0x09) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- lib/test_hmm.c | 122 ++++++++++++++++++++++++++ lib/test_hmm_uapi.h | 1=20 tools/testing/selftests/mm/hmm-tests.c | 149 ++++++++++++++++++++++++++++= ++++ 3 files changed, 272 insertions(+) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 38996c4baa40..7cc7320c9494 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -389,6 +389,84 @@ static int dmirror_range_fault(struct dmirror *dmirror, return ret; } =20 +static int dmirror_range_fault_unlockable(struct dmirror *dmirror, + struct hmm_range *range) +{ + struct mm_struct *mm =3D dmirror->notifier.mm; + unsigned long timeout =3D + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + int locked; + int ret; + + while (true) { + if (time_after(jiffies, timeout)) { + ret =3D -EBUSY; + goto out; + } + + range->notifier_seq =3D mmu_interval_read_begin(range->notifier); + locked =3D 1; + mmap_read_lock(mm); + ret =3D hmm_range_fault_unlockable(range, &locked); + if (locked) + mmap_read_unlock(mm); + if (ret) { + if (ret =3D=3D -EBUSY) + continue; + goto out; + } + if (!locked) + continue; + + mutex_lock(&dmirror->mutex); + if (mmu_interval_read_retry(range->notifier, + range->notifier_seq)) { + mutex_unlock(&dmirror->mutex); + continue; + } + break; + } + + ret =3D dmirror_do_fault(dmirror, range); + + mutex_unlock(&dmirror->mutex); +out: + return ret; +} + +static int dmirror_fault_unlockable(struct dmirror *dmirror, + unsigned long start, + unsigned long end, bool write) +{ + struct mm_struct *mm =3D dmirror->notifier.mm; + unsigned long addr; + unsigned long pfns[32]; + struct hmm_range range =3D { + .notifier =3D &dmirror->notifier, + .hmm_pfns =3D pfns, + .pfn_flags_mask =3D 0, + .default_flags =3D + HMM_PFN_REQ_FAULT | (write ? HMM_PFN_REQ_WRITE : 0), + .dev_private_owner =3D dmirror->mdevice, + }; + int ret =3D 0; + + if (!mmget_not_zero(mm)) + return -EFAULT; + + for (addr =3D start; addr < end; addr =3D range.end) { + range.start =3D addr; + range.end =3D min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end); + + ret =3D dmirror_range_fault_unlockable(dmirror, &range); + if (ret) + break; + } + + mmput(mm); + return ret; +} + static int dmirror_fault(struct dmirror *dmirror, unsigned long start, unsigned long end, bool write) { @@ -488,6 +566,47 @@ static int dmirror_read(struct dmirror *dmirror, struc= t hmm_dmirror_cmd *cmd) return ret; } =20 +static int dmirror_read_unlockable(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + struct dmirror_bounce bounce; + unsigned long start, end; + unsigned long size =3D cmd->npages << PAGE_SHIFT; + int ret; + + start =3D cmd->addr; + end =3D start + size; + if (end < start) + return -EINVAL; + + ret =3D dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + + while (1) { + mutex_lock(&dmirror->mutex); + ret =3D dmirror_do_read(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret !=3D -ENOENT) + break; + + start =3D cmd->addr + (bounce.cpages << PAGE_SHIFT); + ret =3D dmirror_fault_unlockable(dmirror, start, end, false); + if (ret) + break; + cmd->faults++; + } + + if (ret =3D=3D 0) { + if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr, + bounce.size)) + ret =3D -EFAULT; + } + cmd->cpages =3D bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; +} + static int dmirror_do_write(struct dmirror *dmirror, unsigned long start, unsigned long end, struct dmirror_bounce *bounce) { @@ -1549,6 +1668,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *= filp, dmirror->flags =3D cmd.npages; ret =3D 0; break; + case HMM_DMIRROR_READ_UNLOCKABLE: + ret =3D dmirror_read_unlockable(dmirror, &cmd); + break; =20 default: return -EINVAL; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index f94c6d457338..076df6df9227 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -38,6 +38,7 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cm= d) #define HMM_DMIRROR_RELEASE _IOWR('H', 0x07, struct hmm_dmirror_cmd) #define HMM_DMIRROR_FLAGS _IOWR('H', 0x08, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_READ_UNLOCKABLE _IOWR('H', 0x09, struct hmm_dmirror_cm= d) =20 #define HMM_DMIRROR_FLAG_FAIL_ALLOC (1ULL << 0) =20 diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftes= ts/mm/hmm-tests.c index e1c8a679a4cf..cc5525be74b0 100644 --- a/tools/testing/selftests/mm/hmm-tests.c +++ b/tools/testing/selftests/mm/hmm-tests.c @@ -27,6 +27,10 @@ #include #include #include +#include +#include +#include +#include =20 /* * This is a private UAPI to the kernel test module so it isn't exported @@ -2853,4 +2857,149 @@ TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120) &thp_results, ®ular_results); } } +/* + * Test that HMM can fault in pages backed by userfaultfd using the + * hmm_range_fault_unlockable() path. This exercises the lock-drop retry + * logic in the HMM framework. + */ +struct uffd_thread_args { + int uffd; + int stop_fd; + void *page_buffer; + unsigned long page_size; +}; + +static void *uffd_handler_thread(void *arg) +{ + struct uffd_thread_args *args =3D arg; + struct uffd_msg msg; + struct uffdio_copy copy; + struct pollfd pollfd[2]; + int ret; + + pollfd[0].fd =3D args->uffd; + pollfd[0].events =3D POLLIN; + pollfd[1].fd =3D args->stop_fd; + pollfd[1].events =3D POLLIN; + + while (1) { + ret =3D poll(pollfd, 2, -1); + if (ret <=3D 0) + break; + if (pollfd[1].revents) + break; + if (!(pollfd[0].revents & POLLIN)) + break; + + ret =3D read(args->uffd, &msg, sizeof(msg)); + if (ret !=3D sizeof(msg)) + break; + + if (msg.event !=3D UFFD_EVENT_PAGEFAULT) + break; + + /* Fill the page with a known pattern */ + memset(args->page_buffer, 0xAB, args->page_size); + + copy.dst =3D msg.arg.pagefault.address & ~(args->page_size - 1); + copy.src =3D (unsigned long)args->page_buffer; + copy.len =3D args->page_size; + copy.mode =3D 0; + copy.copy =3D 0; + + ret =3D ioctl(args->uffd, UFFDIO_COPY, ©); + if (ret < 0) + break; + } + + return NULL; +} + +TEST_F(hmm, userfaultfd_read) +{ + struct hmm_buffer *buffer; + struct uffd_thread_args uffd_args; + unsigned long npages; + unsigned long size; + unsigned long i; + unsigned char *ptr; + pthread_t thread; + int uffd; + int stop_fd; + int ret; + struct uffdio_api api; + struct uffdio_register reg; + uint64_t stop =3D 1; + ssize_t nwrite; + + npages =3D 4; + size =3D npages << self->page_shift; + + /* Create userfaultfd */ + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) + SKIP(return, "userfaultfd not available"); + + api.api =3D UFFD_API; + api.features =3D 0; + ret =3D ioctl(uffd, UFFDIO_API, &api); + ASSERT_EQ(ret, 0); + + buffer =3D malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd =3D -1; + buffer->size =3D size; + buffer->mirror =3D malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + /* Create anonymous mapping */ + buffer->ptr =3D mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + -1, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Register the region with userfaultfd */ + reg.range.start =3D (unsigned long)buffer->ptr; + reg.range.len =3D size; + reg.mode =3D UFFDIO_REGISTER_MODE_MISSING; + ret =3D ioctl(uffd, UFFDIO_REGISTER, ®); + ASSERT_EQ(ret, 0); + + /* Set up the handler thread */ + uffd_args.uffd =3D uffd; + stop_fd =3D eventfd(0, EFD_CLOEXEC); + ASSERT_GE(stop_fd, 0); + uffd_args.stop_fd =3D stop_fd; + uffd_args.page_buffer =3D malloc(self->page_size); + ASSERT_NE(uffd_args.page_buffer, NULL); + uffd_args.page_size =3D self->page_size; + + ret =3D pthread_create(&thread, NULL, uffd_handler_thread, &uffd_args); + ASSERT_EQ(ret, 0); + + /* + * Use the unlockable read path which allows the mmap lock to be + * dropped during the fault, enabling userfaultfd resolution. + */ + ret =3D hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ_UNLOCKABLE, + buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Verify the device read the data filled by the uffd handler */ + ptr =3D buffer->mirror; + for (i =3D 0; i < size; ++i) + ASSERT_EQ(ptr[i], (unsigned char)0xAB); + + nwrite =3D write(stop_fd, &stop, sizeof(stop)); + ASSERT_EQ(nwrite, sizeof(stop)); + pthread_join(thread, NULL); + close(stop_fd); + free(uffd_args.page_buffer); + close(uffd); + hmm_buffer_free(buffer); +} + TEST_HARNESS_MAIN