From nobody Thu Sep 19 16:41:33 2024 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10F8EC05027 for ; Thu, 9 Feb 2023 10:32:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229839AbjBIKcB (ORCPT ); Thu, 9 Feb 2023 05:32:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229914AbjBIKbg (ORCPT ); Thu, 9 Feb 2023 05:31:36 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4AEE21BAF0 for ; Thu, 9 Feb 2023 02:30:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675938625; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=G1Az3EBV9SOKUryI5J/Szq/Ryazp7FGIfAGTSAZ/Hdw=; b=jHN+1Fdves7JZjHRhGI0ozMMamwPgyIZGeRYwLmFtFycT4bvujmjaSboiaXIjeu4MTT1bp kzYkOD9LMdgwFbduadjIFo0ZS+keWBCenUkR2FJcTdCjnxrsc/SSBmeNWHZOj0AGTrRCPq cjbGsfGSN0k2vVP047C07vG2WxpLuYs= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-321-jaEj5sufMhODZlB0qhi8BQ-1; Thu, 09 Feb 2023 05:30:23 -0500 X-MC-Unique: jaEj5sufMhODZlB0qhi8BQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 02181884DC9; Thu, 9 Feb 2023 10:30:15 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.33.36.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id EE831492B00; Thu, 9 Feb 2023 10:30:12 +0000 (UTC) From: David Howells To: Jens Axboe , Al Viro , Christoph Hellwig Cc: David Howells , Matthew Wilcox , Jan Kara , Jeff Layton , David Hildenbrand , Jason Gunthorpe , Logan Gunthorpe , Hillf Danton , linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Christoph Hellwig , John Hubbard Subject: [PATCH v13 06/12] iov_iter: Add a function to extract a page list from an iterator Date: Thu, 9 Feb 2023 10:29:48 +0000 Message-Id: <20230209102954.528942-7-dhowells@redhat.com> In-Reply-To: <20230209102954.528942-1-dhowells@redhat.com> References: <20230209102954.528942-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Add a function, iov_iter_extract_pages(), to extract a list of pages from an iterator. The pages may be returned with a pin added or nothing, depending on the type of iterator. Add a second function, iov_iter_extract_will_pin(), to determine how the cleanup should be done. There are two cases: (1) ITER_IOVEC or ITER_UBUF iterator. Extracted pages will have pins (FOLL_PIN) obtained on them so that a concurrent fork() will forcibly copy the page so that DMA is done to/from the parent's buffer and is unavailable to/unaffected by the child process. iov_iter_extract_will_pin() will return true for this case. The caller should use something like unpin_user_page() to dispose of the page. (2) Any other sort of iterator. No refs or pins are obtained on the page, the assumption is made that the caller will manage page retention. iov_iter_extract_will_pin() will return false. The pages don't need additional disposal. Signed-off-by: David Howells Reviewed-by: Christoph Hellwig cc: Al Viro cc: John Hubbard cc: David Hildenbrand cc: Matthew Wilcox cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org --- Notes: ver #12) - ITER_PIPE is gone, so drop related bits. - Don't specify FOLL_PIN as that's implied by pin_user_pages_fast(). =20 ver #11) - Fix iov_iter_extract_kvec_pages() to include the offset into the pag= e in the returned starting offset. - Use __bitwise for the extraction flags =20 ver #10) - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec. =20 ver #9) - Rename iov_iter_extract_mode() to iov_iter_extract_will_pin() and ma= ke it return true/false not FOLL_PIN/0 as FOLL_PIN is going to be made private to mm/. - Change extract_flags to extraction_flags. =20 ver #8) - It seems that all DIO is supposed to be done under FOLL_PIN now, and= not FOLL_GET, so switch to only using pin_user_pages() for user-backed iters. - Wrap an argument in brackets in the iov_iter_extract_mode() macro. - Drop the extract_flags argument to iov_iter_extract_mode() for now [hch]. =20 ver #7) - Switch to passing in iter-specific flags rather than FOLL_* flags. - Drop the direction flags for now. - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA. - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter. - Add support for extraction from KVEC-type iters. - Use iov_iter_advance() rather than open-coding it. - Make BVEC- and KVEC-type skip over initial empty vectors. =20 ver #6) - Add back the function to indicate the cleanup mode. - Drop the cleanup_mode return arg to iov_iter_extract_pages(). - Pass FOLL_SOURCE/DEST_BUF in gup_flags. Check this against the iter data_source. =20 ver #4) - Use ITER_SOURCE/DEST instead of WRITE/READ. - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed = in. =20 ver #3) - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party acce= ss to get/pin_user_pages_fast()[1]. include/linux/uio.h | 27 ++++- lib/iov_iter.c | 264 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 290 insertions(+), 1 deletion(-) diff --git a/include/linux/uio.h b/include/linux/uio.h index af70e4c9ea27..cf6658066736 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -347,9 +347,34 @@ static inline void iov_iter_ubuf(struct iov_iter *i, u= nsigned int direction, .count =3D count }; } - /* Flags for iov_iter_get/extract_pages*() */ /* Allow P2PDMA on the extracted pages */ #define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01) =20 +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages, + size_t maxsize, unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0); + +/** + * iov_iter_extract_will_pin - Indicate how pages from the iterator will b= e retained + * @iter: The iterator + * + * Examine the iterator and indicate by returning true or false as to how,= if + * at all, pages extracted from the iterator will be retained by the extra= ction + * function. + * + * %true indicates that the pages will have a pin placed in them that the + * caller must unpin. This is must be done for DMA/async DIO to force for= k() + * to forcibly copy a page for the child (the parent must retain the origi= nal + * page). + * + * %false indicates that no measures are taken and that it's up to the cal= ler + * to retain the pages. + */ +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) +{ + return user_backed_iter(iter); +} + #endif diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 34ee3764d0fa..8d34b6552179 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1487,3 +1487,267 @@ void iov_iter_restore(struct iov_iter *i, struct io= v_iter_state *state) i->iov -=3D state->nr_segs - i->nr_segs; i->nr_segs =3D state->nr_segs; } + +/* + * Extract a list of contiguous pages from an ITER_XARRAY iterator. This = does not + * get references on the pages, nor does it get a pin on them. + */ +static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page *page, **p; + unsigned int nr =3D 0, offset; + loff_t pos =3D i->xarray_start + i->iov_offset; + pgoff_t index =3D pos >> PAGE_SHIFT; + XA_STATE(xas, i->xarray, index); + + offset =3D pos & ~PAGE_MASK; + *offset0 =3D offset; + + maxpages =3D want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p =3D *pages; + + rcu_read_lock(); + for (page =3D xas_load(&xas); page; page =3D xas_next(&xas)) { + if (xas_retry(&xas, page)) + continue; + + /* Has the page moved or been split? */ + if (unlikely(page !=3D xas_reload(&xas))) { + xas_reset(&xas); + continue; + } + + p[nr++] =3D find_subpage(page, xas.xa_index); + if (nr =3D=3D maxpages) + break; + } + rcu_read_unlock(); + + maxsize =3D min_t(size_t, nr * PAGE_SIZE - offset, maxsize); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of contiguous pages from an ITER_BVEC iterator. This do= es + * not get references on the pages, nor does it get a pin on them. + */ +static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page **p, *page; + size_t skip =3D i->iov_offset, offset; + int k; + + for (;;) { + if (i->nr_segs =3D=3D 0) + return 0; + maxsize =3D min(maxsize, i->bvec->bv_len - skip); + if (maxsize) + break; + i->iov_offset =3D 0; + i->nr_segs--; + i->bvec++; + skip =3D 0; + } + + skip +=3D i->bvec->bv_offset; + page =3D i->bvec->bv_page + skip / PAGE_SIZE; + offset =3D skip % PAGE_SIZE; + *offset0 =3D offset; + + maxpages =3D want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p =3D *pages; + for (k =3D 0; k < maxpages; k++) + p[k] =3D page + k; + + maxsize =3D min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of virtually contiguous pages from an ITER_KVEC iterator. + * This does not get references on the pages, nor does it get a pin on the= m. + */ +static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page **p, *page; + const void *kaddr; + size_t skip =3D i->iov_offset, offset, len; + int k; + + for (;;) { + if (i->nr_segs =3D=3D 0) + return 0; + maxsize =3D min(maxsize, i->kvec->iov_len - skip); + if (maxsize) + break; + i->iov_offset =3D 0; + i->nr_segs--; + i->kvec++; + skip =3D 0; + } + + kaddr =3D i->kvec->iov_base + skip; + offset =3D (unsigned long)kaddr & ~PAGE_MASK; + *offset0 =3D offset; + + maxpages =3D want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p =3D *pages; + + kaddr -=3D offset; + len =3D offset + maxsize; + for (k =3D 0; k < maxpages; k++) { + size_t seg =3D min_t(size_t, len, PAGE_SIZE); + + if (is_vmalloc_or_module_addr(kaddr)) + page =3D vmalloc_to_page(kaddr); + else + page =3D virt_to_page(kaddr); + + p[k] =3D page; + len -=3D seg; + kaddr +=3D PAGE_SIZE; + } + + maxsize =3D min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of contiguous pages from a user iterator and get a pin on + * each of them. This should only be used if the iterator is user-backed + * (IOBUF/UBUF). + * + * It does not get refs on the pages, but the pages must be unpinned by the + * caller once the transfer is complete. + * + * This is safe to be used where background IO/DMA *is* going to be modify= ing + * the buffer; using a pin rather than a ref makes forces fork() to give t= he + * child a copy of the page. + */ +static ssize_t iov_iter_extract_user_pages(struct iov_iter *i, + struct page ***pages, + size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + unsigned long addr; + unsigned int gup_flags =3D 0; + size_t offset; + int res; + + if (i->data_source =3D=3D ITER_DEST) + gup_flags |=3D FOLL_WRITE; + if (extraction_flags & ITER_ALLOW_P2PDMA) + gup_flags |=3D FOLL_PCI_P2PDMA; + if (i->nofault) + gup_flags |=3D FOLL_NOFAULT; + + addr =3D first_iovec_segment(i, &maxsize); + *offset0 =3D offset =3D addr % PAGE_SIZE; + addr &=3D PAGE_MASK; + maxpages =3D want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + res =3D pin_user_pages_fast(addr, maxpages, gup_flags, *pages); + if (unlikely(res <=3D 0)) + return res; + maxsize =3D min_t(size_t, maxsize, res * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/** + * iov_iter_extract_pages - Extract a list of contiguous pages from an ite= rator + * @i: The iterator to extract from + * @pages: Where to return the list of pages + * @maxsize: The maximum amount of iterator to extract + * @maxpages: The maximum size of the list of pages + * @extraction_flags: Flags to qualify request + * @offset0: Where to return the starting offset into (*@pages)[0] + * + * Extract a list of contiguous pages from the current point of the iterat= or, + * advancing the iterator. The maximum number of pages and the maximum am= ount + * of page contents can be set. + * + * If *@pages is NULL, a page list will be allocated to the required size = and + * *@pages will be set to its base. If *@pages is not NULL, it will be as= sumed + * that the caller allocated a page list at least @maxpages in size and th= is + * will be filled in. + * + * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee= r DMA + * be allowed on the pages extracted. + * + * The iov_iter_extract_will_pin() function can be used to query how clean= up + * should be performed. + * + * Extra refs or pins on the pages may be obtained as follows: + * + * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be + * added to the pages, but refs will not be taken. + * iov_iter_extract_will_pin() will return true. + * + * (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages = are + * merely listed; no extra refs or pins are obtained. + * iov_iter_extract_will_pin() will return 0. + * + * Note also: + * + * (*) Use with ITER_DISCARD is not supported as that has no content. + * + * On success, the function sets *@pages to the new pagelist, if allocated= , and + * sets *offset0 to the offset into the first page. + * + * It may also return -ENOMEM and -EFAULT. + */ +ssize_t iov_iter_extract_pages(struct iov_iter *i, + struct page ***pages, + size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + maxsize =3D min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT); + if (!maxsize) + return 0; + + if (likely(user_backed_iter(i))) + return iov_iter_extract_user_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_kvec(i)) + return iov_iter_extract_kvec_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_bvec(i)) + return iov_iter_extract_bvec_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_xarray(i)) + return iov_iter_extract_xarray_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + return -EFAULT; +} +EXPORT_SYMBOL_GPL(iov_iter_extract_pages);