From nobody Sat Feb 7 17:49:01 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 462B773535 for ; Thu, 14 Mar 2024 16:13:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710432794; cv=none; b=PYa8tO8e7pmEqEagoFgvovTBZlFJK+/J7t+3H2NETOzKARDhWqdVMgb7D7aSpSEMonRZTV5PRrINWlGRZycLm3yeG0vwu2XP/9kyxJvGKNzSvqBEFI687Eg9QBtW1Rw34EGm1EG4aPsEU7dvOwWHr+kVP37cqAseCLph5SPErBk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710432794; c=relaxed/simple; bh=VomcpaOlPzMQxs2yu0M7gH6x1DZOfskay+Zu1+shBO0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rIcxm8eoD0MdXCzQXhm4MR4xTQizoPT3vVtfiHY70HNbo1HnFhtv/Y6Tlzs9UvtbhwEvV21yztSE417EoUFDFA/Of6XOW1AokDIfTHDemWYe2MrPl3K4plilj1WKuBbk+/IZqzuRtES6rvYSPe5cLu7JWinCuMBIaNTv2LT1a+k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WSNE6heu; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WSNE6heu" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1710432790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3u70TmCODVTszy6/DbBhWFHEWUxfQhlZeNtqCG8er74=; b=WSNE6heujXzuT311SytVgUGll6SR+yBHd6rXiU/oo/yJEApxyZGRaX71oWkj4Ln4MkDRbI v+aIXfwsTzksK2lb5y2jteOcnbM6RnxK1/7p2FMfjx96TuJ+Mu6UhySJpZWBxPzQcvSGV/ WmVksMRtVu65M4tg6XFaA6kXNmhcp5Y= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-550-uEIAQqoHMdOxBr0EYk-BMQ-1; Thu, 14 Mar 2024 12:13:06 -0400 X-MC-Unique: uEIAQqoHMdOxBr0EYk-BMQ-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 07C84803F6F; Thu, 14 Mar 2024 16:13:06 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.193.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7E64240C6CB7; Thu, 14 Mar 2024 16:13:03 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , "Darrick J . Wong" , John Hubbard , Jason Gunthorpe , Hugh Dickins Subject: [PATCH v1 1/2] mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly Date: Thu, 14 Mar 2024 17:12:59 +0100 Message-ID: <20240314161300.382526-2-david@redhat.com> In-Reply-To: <20240314161300.382526-1-david@redhat.com> References: <20240314161300.382526-1-david@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 Content-Type: text/plain; charset="utf-8" Darrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. While the madvise() call can be interrupted by a signal, this is not the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave like page faults in that case: fail and not retry forever. A reproducer can be found at [1]. The reason is that __get_user_pages(), as called by faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way: it will simply return 0 when VM_FAULT_RETRY happened, making madvise_populate()->faultin_vma_page_range() retry again and again, never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages(). __get_user_pages_locked() does what we want, but duplicating that logic in faultin_vma_page_range() feels wrong. So let's use __get_user_pages_locked() instead, that will detect VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT from faultin_page() to __get_user_pages(), all the way to madvise_populate(). But, there is an issue: __get_user_pages_locked() will end up re-taking the MM lock and then __get_user_pages() will do another VMA lookup. In the meantime, the VMA layout could have changed and we'd fail with different error codes than we'd want to. As __get_user_pages() will currently do a new VMA lookup either way, let it do the VMA handling in a different way, controlled by a new FOLL_MADV_POPULATE flag, effectively moving these checks from madvise_populate() + faultin_page_range() in there. With this change, Darricks reproducer properly fails with -EFAULT, as documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE. [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/ Reported-by: Darrick J. Wong Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrog= s/ Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to p= refault page tables") Signed-off-by: David Hildenbrand Tested-by: Darrick J. Wong --- mm/gup.c | 54 ++++++++++++++++++++++++++++++--------------------- mm/internal.h | 10 ++++++---- mm/madvise.c | 17 ++-------------- 3 files changed, 40 insertions(+), 41 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index df83182ec72d5..f6d55635742f5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1206,6 +1206,22 @@ static long __get_user_pages(struct mm_struct *mm, =20 /* first iteration or cross vma bound */ if (!vma || start >=3D vma->vm_end) { + /* + * MADV_POPULATE_(READ|WRITE) wants to handle VMA + * lookups+error reporting differently. + */ + if (gup_flags & FOLL_MADV_POPULATE) { + vma =3D vma_lookup(mm, start); + if (!vma) { + ret =3D -ENOMEM; + goto out; + } + if (check_vma_flags(vma, gup_flags)) { + ret =3D -EINVAL; + goto out; + } + goto retry; + } vma =3D gup_vma_lookup(mm, start); if (!vma && in_gate_area(mm, start)) { ret =3D get_gate_page(mm, start & PAGE_MASK, @@ -1683,35 +1699,35 @@ long populate_vma_page_range(struct vm_area_struct = *vma, } =20 /* - * faultin_vma_page_range() - populate (prefault) page tables inside the - * given VMA range readable/writable + * faultin_page_range() - populate (prefault) page tables inside the + * given range readable/writable * * This takes care of mlocking the pages, too, if VM_LOCKED is set. * - * @vma: target vma + * @mm: the mm to populate page tables in * @start: start address * @end: end address * @write: whether to prefault readable or writable * @locked: whether the mmap_lock is still held * - * Returns either number of processed pages in the vma, or a negative error - * code on error (see __get_user_pages()). + * Returns either number of processed pages in the MM, or a negative error + * code on error (see __get_user_pages()). Note that this function reports + * errors related to VMAs, such as incompatible mappings, as expected by + * MADV_POPULATE_(READ|WRITE). * - * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and - * covered by the VMA. If it's released, *@locked will be set to 0. + * The range must be page-aligned. + * + * mm->mmap_lock must be held. If it's released, *@locked will be set to 0. */ -long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long star= t, - unsigned long end, bool write, int *locked) +long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked) { - struct mm_struct *mm =3D vma->vm_mm; unsigned long nr_pages =3D (end - start) / PAGE_SIZE; int gup_flags; long ret; =20 VM_BUG_ON(!PAGE_ALIGNED(start)); VM_BUG_ON(!PAGE_ALIGNED(end)); - VM_BUG_ON_VMA(start < vma->vm_start, vma); - VM_BUG_ON_VMA(end > vma->vm_end, vma); mmap_assert_locked(mm); =20 /* @@ -1723,19 +1739,13 @@ long faultin_vma_page_range(struct vm_area_struct *= vma, unsigned long start, * a poisoned page. * !FOLL_FORCE: Require proper access permissions. */ - gup_flags =3D FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE; + gup_flags =3D FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE | + FOLL_MADV_POPULATE; if (write) gup_flags |=3D FOLL_WRITE; =20 - /* - * We want to report -EINVAL instead of -EFAULT for any permission - * problems or incompatible mappings. - */ - if (check_vma_flags(vma, gup_flags)) - return -EINVAL; - - ret =3D __get_user_pages(mm, start, nr_pages, gup_flags, - NULL, locked); + ret =3D __get_user_pages_locked(mm, start, nr_pages, NULL, locked, + gup_flags); lru_add_drain(); return ret; } diff --git a/mm/internal.h b/mm/internal.h index d1c69119b24fb..a57dd5156cf84 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -686,9 +686,8 @@ struct anon_vma *folio_anon_vma(struct folio *folio); void unmap_mapping_folio(struct folio *folio); extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); -extern long faultin_vma_page_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - bool write, int *locked); +extern long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked); extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, unsigned long bytes); =20 @@ -1127,10 +1126,13 @@ enum { FOLL_FAST_ONLY =3D 1 << 20, /* allow unlocking the mmap lock */ FOLL_UNLOCKABLE =3D 1 << 21, + /* VMA lookup+checks compatible with MADV_POPULATE_(READ|WRITE) */ + FOLL_MADV_POPULATE =3D 1 << 22, }; =20 #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_P= IN | \ - FOLL_FAST_ONLY | FOLL_UNLOCKABLE) + FOLL_FAST_ONLY | FOLL_UNLOCKABLE | \ + FOLL_MADV_POPULATE) =20 /* * Indicates for which pages that are write-protected in the page table, diff --git a/mm/madvise.c b/mm/madvise.c index 44a498c94158c..1a073fcc4c0c0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -908,27 +908,14 @@ static long madvise_populate(struct vm_area_struct *v= ma, { const bool write =3D behavior =3D=3D MADV_POPULATE_WRITE; struct mm_struct *mm =3D vma->vm_mm; - unsigned long tmp_end; int locked =3D 1; long pages; =20 *prev =3D vma; =20 while (start < end) { - /* - * We might have temporarily dropped the lock. For example, - * our VMA might have been split. - */ - if (!vma || start >=3D vma->vm_end) { - vma =3D vma_lookup(mm, start); - if (!vma) - return -ENOMEM; - } - - tmp_end =3D min_t(unsigned long, end, vma->vm_end); /* Populate (prefault) page tables readable/writable. */ - pages =3D faultin_vma_page_range(vma, start, tmp_end, write, - &locked); + pages =3D faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked =3D 1; @@ -949,7 +936,7 @@ static long madvise_populate(struct vm_area_struct *vma, pr_warn_once("%s: unhandled return value: %ld\n", __func__, pages); fallthrough; - case -ENOMEM: + case -ENOMEM: /* No VMA or out of memory. */ return -ENOMEM; } } --=20 2.43.2 From nobody Sat Feb 7 17:49:01 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88A707353B for ; Thu, 14 Mar 2024 16:13:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710432793; cv=none; b=uhptHeYUpdksjtTVgVl9EkAX9g9khVVVyE5kIs1T8DKYNBMUWoRcv5TemtgAVjNltyZbWKpR8S8LzdGnqNyjI4P9GacTIwhVq5XsDXvqydB1t/p5EeaJRKG0SFhqU9jklx91pOTRYoGiSSWpvaChQxy25gPXK35mA6SaHzLoYFs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710432793; c=relaxed/simple; bh=ES7Q8MZl4cOkK1KxgAtSOL1DyKymQMPLDHN9/tpG3/U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HgQckePK0SOihC7qM3WfzEM3MEKp2wa5A1KwOCvdzLqQHH4TT20cjdmG01OglbS2weNkrIlSePTirTJbAjR/vg1heDw6cdKuepUGmFQznCx/Id7YEJs2KIjEt7j4aXjIbeS+dwHLWScanrr1bUCDIxjvMOmF/N/lV07ZuwgvEKM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=IQ/NyanS; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IQ/NyanS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1710432790; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tioFn9Iuis64TbgOqSI9aPe4MpctazXkMOa/EAVTGPk=; b=IQ/NyanSDhxvb2Z3kN8w9ToT9wz5H01oRzsLJrPk5wylWfQxcSBYw9q3g/dCeuXERi1O8N XOqnipZr891w0o+MGxAOmWoXvaoJ651rlq0nL+1tIReK6QNdd20ehGoaL4wVgtGA/NSiG6 YmKpWakfO/DkTj5lKWNnPs/QJQz6eL8= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-591-0LIaEzhXNZS88K9aMt_lnA-1; Thu, 14 Mar 2024 12:13:08 -0400 X-MC-Unique: 0LIaEzhXNZS88K9aMt_lnA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 187293C0CF06; Thu, 14 Mar 2024 16:13:08 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.193.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 406F540C6CBB; Thu, 14 Mar 2024 16:13:06 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , "Darrick J . Wong" , John Hubbard , Jason Gunthorpe , Hugh Dickins Subject: [PATCH v1 2/2] mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) Date: Thu, 14 Mar 2024 17:13:00 +0100 Message-ID: <20240314161300.382526-3-david@redhat.com> In-Reply-To: <20240314161300.382526-1-david@redhat.com> References: <20240314161300.382526-1-david@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 Content-Type: text/plain; charset="utf-8" We changed faultin_page_range() to no longer consume a VMA, because faultin_page_range() might internally release the mm lock to lookup the VMA again -- required to cleanly handle VM_FAULT_RETRY. But independent of that, __get_user_pages() will always lookup the VMA itself. Now that we let __get_user_pages() just handle VMA checks in a way that is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise() is just overhead. So let's just call madvise_populate() on the full range instead. There is one change in behavior: madvise_walk_vmas() would skip any VMA holes, and if everything succeeded, it would return -ENOMEM after processing all VMAs. However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to notice any difference: -ENOMEM might either indicate that there were VMA holes or that populating page tables failed because there was not enough memory. So it's unlikely that user space will notice the difference, and that special handling likely only makes sense for some other madvise() actions. Further, we'd already fail with -ENOMEM early in the past if looking up the VMA after dropping the MM lock failed because of concurrent VMA modifications. So let's just keep it simple and avoid the madvise VMA walk, and consistently fail early if we find a VMA hole. Signed-off-by: David Hildenbrand Tested-by: Darrick J. Wong --- mm/madvise.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 1a073fcc4c0c0..a2dd70c4a2e6b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -901,26 +901,19 @@ static long madvise_dontneed_free(struct vm_area_stru= ct *vma, return -EINVAL; } =20 -static long madvise_populate(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end, - int behavior) +static long madvise_populate(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior) { const bool write =3D behavior =3D=3D MADV_POPULATE_WRITE; - struct mm_struct *mm =3D vma->vm_mm; int locked =3D 1; long pages; =20 - *prev =3D vma; - while (start < end) { /* Populate (prefault) page tables readable/writable. */ pages =3D faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked =3D 1; - *prev =3D NULL; - vma =3D NULL; } if (pages < 0) { switch (pages) { @@ -1021,9 +1014,6 @@ static int madvise_vma_behavior(struct vm_area_struct= *vma, case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: return madvise_dontneed_free(vma, prev, start, end, behavior); - case MADV_POPULATE_READ: - case MADV_POPULATE_WRITE: - return madvise_populate(vma, prev, start, end, behavior); case MADV_NORMAL: new_flags =3D new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; @@ -1425,8 +1415,16 @@ int do_madvise(struct mm_struct *mm, unsigned long s= tart, size_t len_in, int beh end =3D start + len; =20 blk_start_plug(&plug); - error =3D madvise_walk_vmas(mm, start, end, behavior, - madvise_vma_behavior); + switch (behavior) { + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: + error =3D madvise_populate(mm, start, end, behavior); + break; + default: + error =3D madvise_walk_vmas(mm, start, end, behavior, + madvise_vma_behavior); + break; + } blk_finish_plug(&plug); if (write) mmap_write_unlock(mm); --=20 2.43.2 From nobody Sat Feb 7 17:49:01 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1450A1CD26; Sun, 17 Mar 2024 16:51:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710694318; cv=none; b=HMTu1sjvMRLDeq5nE+GPIZebzVj0xDmbWRWwn2v11KJL9Ie4BklysDwgMDqYdFvykyB5knu3+QR6yuDNad3BVust6C/M0lEtWnJqycZ79VvAi8Hm21Gc2mPey+X4ZwWxPDGvZ4nQVVBuL2e9zPbaaqRNjm007Jr3DyuzBvAPyB4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710694318; c=relaxed/simple; bh=HxWfd1Znm4cUZamAM6TQvmubUOO81Xy3fg/jRpSm4BI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=cd0sJBzft20zdOT3KYoJ0iCrLlJn8+yM5A/4kJIs6B/vOGQN3URRXfiXux2uNznFYWl3buWGvH4zUdXfz+NzCyUZV4PFiBYFauZWHzUUpxhDbUSfGsIDrEbzE/kzkXIWPfhG2qFyDyp7IiHKdnlG674UaVGpqSLmlsgLUyxyke4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=naAqZ7eV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="naAqZ7eV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9858EC433C7; Sun, 17 Mar 2024 16:51:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1710694317; bh=HxWfd1Znm4cUZamAM6TQvmubUOO81Xy3fg/jRpSm4BI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=naAqZ7eVFLhdBU/5dsNz+CCBzniVgBCPwpb/vhB4+sKkEMaIHcuYIIouHufDX7wLl qovkE+LrJgC/RmilV5mhExNDlSqFAHNEBC81/Y7zZ6SPOjYHU9uPTyN4qj0eWLsNru YVO7K7RHFgu3WF4w4Jl2nrcsRnyPYAvj7vf/EosSsTRArVe+IoKo0O4NHFpFjs2lJW OUhnOvvKy4rQd9op8CWgC8yyQeYmgIqRysmIMcV2wyVevAwEzUgEEjMrQ+zdsb/FQP G3pvf2VlekezVJlwgKWTAFHz8gGvHzfQW1x7x5JTSHy1iOfNPK2Kg+d6K3B0DpQAtT yQJaQhcY3KtOw== Date: Sun, 17 Mar 2024 09:51:57 -0700 From: "Darrick J. Wong" To: David Hildenbrand , djwong@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, fstests , xfs Subject: [RFC PATCH] xfs_io: add linux madvise advice codes Message-ID: <20240317165157.GE1927156@frogsfrogsfrogs> References: <20240314161300.382526-1-david@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20240314161300.382526-1-david@redhat.com> Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Darrick J. Wong Add all the Linux-specific madvise codes. We're going to need MADV_POPULATE_READ for a regression test. Signed-off-by: Darrick J. Wong Tested-by: Darrick J. Wong --- configure.ac | 1=20 include/builddefs.in | 1=20 io/Makefile | 4 ++ io/madvise.c | 111 +++++++++++++++++++++++++++++++++++++++++++++= ++++ m4/package_libcdev.m4 | 17 ++++++++ 5 files changed, 133 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index 3786e44db6fd..723bdca506d1 100644 --- a/configure.ac +++ b/configure.ac @@ -187,6 +187,7 @@ AC_CONFIG_SYSTEMD_SYSTEM_UNIT_DIR AC_CONFIG_CROND_DIR AC_CONFIG_UDEV_DIR AC_HAVE_BLKID_TOPO +AC_HAVE_KERNEL_MADVISE_FLAGS =20 if test "$enable_ubsan" =3D "yes" || test "$enable_ubsan" =3D "probe"; then AC_PACKAGE_CHECK_UBSAN diff --git a/include/builddefs.in b/include/builddefs.in index 07428206da45..a04f3e70f19d 100644 --- a/include/builddefs.in +++ b/include/builddefs.in @@ -193,6 +193,7 @@ HAVE_O_TMPFILE =3D @have_o_tmpfile@ HAVE_MKOSTEMP_CLOEXEC =3D @have_mkostemp_cloexec@ USE_RADIX_TREE_FOR_INUMS =3D @use_radix_tree_for_inums@ HAVE_FSVERITY_DESCR =3D @have_fsverity_descr@ +HAVE_KERNEL_MADVISE =3D @have_kernel_madvise@ =20 GCCFLAGS =3D -funsigned-char -fno-strict-aliasing -Wall -Werror -Wextra -W= no-unused-parameter # -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl diff --git a/io/Makefile b/io/Makefile index 6f903e3df9a7..ce39fda0e82a 100644 --- a/io/Makefile +++ b/io/Makefile @@ -84,6 +84,10 @@ ifeq ($(HAVE_GETFSMAP),yes) CFILES +=3D fsmap.c endif =20 +ifeq ($(HAVE_KERNEL_MADVISE),yes) +LCFLAGS +=3D -DHAVE_KERNEL_MADVISE +endif + default: depend $(LTCOMMAND) =20 include $(BUILDRULES) diff --git a/io/madvise.c b/io/madvise.c index 6e9c5b121d72..081666f403bb 100644 --- a/io/madvise.c +++ b/io/madvise.c @@ -9,6 +9,9 @@ #include #include "init.h" #include "io.h" +#ifdef HAVE_KERNEL_MADVISE +# include +#endif =20 static cmdinfo_t madvise_cmd; =20 @@ -26,6 +29,47 @@ madvise_help(void) " -r -- expect random page references (POSIX_MADV_RANDOM)\n" " -s -- expect sequential page references (POSIX_MADV_SEQUENTIAL)\n" " -w -- will need these pages (POSIX_MADV_WILLNEED) [*]\n" +"\n" +"The following Linux-specific advise values are available:\n" +#ifdef MADV_COLLAPSE +" -c -- try to collapse range into transparent hugepages (MADV_COLLAPSE)\n" +#endif +#ifdef MADV_COLD +" -D -- deactivate the range (MADV_COLD)\n" +#endif +#ifdef MADV_FREE +" -f -- free the range (MADV_FREE)\n" +#endif +#ifdef MADV_NOHUGEPAGE +" -h -- disable transparent hugepages (MADV_NOHUGEPAGE)\n" +#endif +#ifdef MADV_HUGEPAGE +" -H -- enable transparent hugepages (MADV_HUGEPAGE)\n" +#endif +#ifdef MADV_MERGEABLE +" -m -- mark the range mergeable (MADV_MERGEABLE)\n" +#endif +#ifdef MADV_UNMERGEABLE +" -M -- mark the range unmergeable (MADV_UNMERGEABLE)\n" +#endif +#ifdef MADV_SOFT_OFFLINE +" -o -- mark the range offline (MADV_SOFT_OFFLINE)\n" +#endif +#ifdef MADV_REMOVE +" -p -- punch a hole in the file (MADV_REMOVE)\n" +#endif +#ifdef MADV_HWPOISON +" -P -- poison the page cache (MADV_HWPOISON)\n" +#endif +#ifdef MADV_POPULATE_READ +" -R -- prefault in the range for read (MADV_POPULATE_READ)\n" +#endif +#ifdef MADV_POPULATE_WRITE +" -W -- prefault in the range for write (MADV_POPULATE_WRITE)\n" +#endif +#ifdef MADV_PAGEOUT +" -X -- reclaim the range (MADV_PAGEOUT)\n" +#endif " Notes:\n" " NORMAL sets the default readahead setting on the file.\n" " RANDOM sets the readahead setting on the file to zero.\n" @@ -45,20 +89,85 @@ madvise_f( int advise =3D MADV_NORMAL, c; size_t blocksize, sectsize; =20 - while ((c =3D getopt(argc, argv, "drsw")) !=3D EOF) { + while ((c =3D getopt(argc, argv, "cdDfhHmMopPrRswWX")) !=3D EOF) { switch (c) { +#ifdef MADV_COLLAPSE + case 'c': /* collapse to thp */ + advise =3D MADV_COLLAPSE; + break; +#endif case 'd': /* Don't need these pages */ advise =3D MADV_DONTNEED; break; +#ifdef MADV_COLD + case 'D': /* make more likely to be reclaimed */ + advise =3D MADV_COLD; + break; +#endif +#ifdef MADV_FREE + case 'f': /* page range out of memory */ + advise =3D MADV_FREE; + break; +#endif +#ifdef MADV_HUGEPAGE + case 'h': /* enable thp memory */ + advise =3D MADV_HUGEPAGE; + break; +#endif +#ifdef MADV_NOHUGEPAGE + case 'H': /* disable thp memory */ + advise =3D MADV_NOHUGEPAGE; + break; +#endif +#ifdef MADV_MERGEABLE + case 'm': /* enable merging */ + advise =3D MADV_MERGEABLE; + break; +#endif +#ifdef MADV_UNMERGEABLE + case 'M': /* disable merging */ + advise =3D MADV_UNMERGEABLE; + break; +#endif +#ifdef MADV_SOFT_OFFLINE + case 'o': /* offline */ + advise =3D MADV_SOFT_OFFLINE; + break; +#endif +#ifdef MADV_REMOVE + case 'p': /* punch hole */ + advise =3D MADV_REMOVE; + break; +#endif +#ifdef MADV_HWPOISON + case 'P': /* poison */ + advise =3D MADV_HWPOISON; + break; +#endif case 'r': /* Expect random page references */ advise =3D MADV_RANDOM; break; +#ifdef MADV_POPULATE_READ + case 'R': /* fault in pages for read */ + advise =3D MADV_POPULATE_READ; + break; +#endif case 's': /* Expect sequential page references */ advise =3D MADV_SEQUENTIAL; break; case 'w': /* Will need these pages */ advise =3D MADV_WILLNEED; break; +#ifdef MADV_POPULATE_WRITE + case 'W': /* fault in pages for write */ + advise =3D MADV_POPULATE_WRITE; + break; +#endif +#ifdef MADV_PAGEOUT + case 'X': /* reclaim memory */ + advise =3D MADV_PAGEOUT; + break; +#endif default: exitcode =3D 1; return command_usage(&madvise_cmd); diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4 index 84f288dfcfdb..064d050b2b55 100644 --- a/m4/package_libcdev.m4 +++ b/m4/package_libcdev.m4 @@ -322,3 +322,20 @@ struct fsverity_descriptor m =3D { }; AC_SUBST(have_fsverity_descr) ]) =20 +# +# Check if asm/mman.h can be included +# +AC_DEFUN([AC_HAVE_KERNEL_MADVISE_FLAGS], + [ AC_MSG_CHECKING([for kernel madvise flags in asm/mman.h ]) + AC_COMPILE_IFELSE( + [ AC_LANG_PROGRAM([[ +#include + ]], [[ +int moo =3D MADV_COLLAPSE; + ]]) + ], have_kernel_madvise=3Dyes + AC_MSG_RESULT(yes), + AC_MSG_RESULT(no)) + AC_SUBST(have_kernel_madvise) + ]) +