RE: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO

Sooyong Suk posted 1 patch 11 months, 1 week ago
block/bio.c | 3 +++
1 file changed, 3 insertions(+)
RE: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Sooyong Suk 11 months, 1 week ago
> On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org>
> wrote:
> >
> > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > There are GUP references to pages that are serving as direct IO
> buffers.
> > > Those pages can be allocated from CMA pageblocks despite they can be
> > > pinned until the DIO is completed.
> >
> > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the
> > reasons to even have the flag.  So big fat no to this.
> >
> 

Understood.

> Hello, thank you for your comment.
> We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM
> for direct I/O as CMA memory got pinned pages which had been pinned from
> direct io.
> 
> > You also completely failed to address the relevant mailinglist and
> > maintainers.
> 
> I added block maintainer Jens Axboe and the block layer maillinst here,
> and added Suren and Sandeep, too.

Then, what do you think of using PF_MEMALLOC_PIN for this context as below?
This will only remove __GFP_MOVABLE from its allocation flag.
Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages,
there seems to be no reason not to use this process flag.

block/bio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 65c796ecb..671e28966 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned len, i = 0;
 	size_t offset;
 	int ret = 0;
+	unsigned int flags;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
+	flags = memalloc_pin_save();
 	size = iov_iter_extract_pages(iter, &pages,
 				      UINT_MAX - bio->bi_iter.bi_size,
 				      nr_pages, extraction_flags, &offset);
+	memalloc_pin_restore(flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;

Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Suren Baghdasaryan 11 months, 1 week ago
On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote:
>
> > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org>
> > wrote:
> > >
> > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > > There are GUP references to pages that are serving as direct IO
> > buffers.
> > > > Those pages can be allocated from CMA pageblocks despite they can be
> > > > pinned until the DIO is completed.
> > >
> > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the
> > > reasons to even have the flag.  So big fat no to this.
> > >
> >
>
> Understood.
>
> > Hello, thank you for your comment.
> > We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM
> > for direct I/O as CMA memory got pinned pages which had been pinned from
> > direct io.
> >
> > > You also completely failed to address the relevant mailinglist and
> > > maintainers.
> >
> > I added block maintainer Jens Axboe and the block layer maillinst here,
> > and added Suren and Sandeep, too.

I'm very far from being a block layer expert :)

>
> Then, what do you think of using PF_MEMALLOC_PIN for this context as below?
> This will only remove __GFP_MOVABLE from its allocation flag.
> Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages,
> there seems to be no reason not to use this process flag.

I think this will help you only when the pages are faulted in but if
__get_user_pages() finds an already mapped page which happens to be
allocated from CMA, it will not migrate it. So, you might still end up
with unmovable pages inside CMA.

>
> block/bio.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index 65c796ecb..671e28966 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>         unsigned len, i = 0;
>         size_t offset;
>         int ret = 0;
> +       unsigned int flags;
>
>         /*
>          * Move page array up in the allocated memory for the bio vecs as far as
> @@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>          * result to ensure the bio's total size is correct. The remainder of
>          * the iov data will be picked up in the next bio iteration.
>          */
> +       flags = memalloc_pin_save();
>         size = iov_iter_extract_pages(iter, &pages,
>                                       UINT_MAX - bio->bi_iter.bi_size,
>                                       nr_pages, extraction_flags, &offset);
> +       memalloc_pin_restore(flags);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
>
>
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Christoph Hellwig 11 months ago
On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote:
> I think this will help you only when the pages are faulted in but if
> __get_user_pages() finds an already mapped page which happens to be
> allocated from CMA, it will not migrate it. So, you might still end up
> with unmovable pages inside CMA.

Direct I/O pages are not unmovable.  They are temporarily pinned for
the duration of the direct I/O.

I really don't understand what problem you're trying to fix here.
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Suren Baghdasaryan 11 months ago
On Wed, Mar 12, 2025 at 8:17 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote:
> > I think this will help you only when the pages are faulted in but if
> > __get_user_pages() finds an already mapped page which happens to be
> > allocated from CMA, it will not migrate it. So, you might still end up
> > with unmovable pages inside CMA.
>
> Direct I/O pages are not unmovable.  They are temporarily pinned for
> the duration of the direct I/O.

Yes but even temporarily pinned pages can cause CMA allocation
failure. My point is that if we know beforehand that the pages will be
pinned we could avoid using CMA and these failures would go away.

>
> I really don't understand what problem you're trying to fix here.
>
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Christoph Hellwig 11 months ago
On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote:
> > Direct I/O pages are not unmovable.  They are temporarily pinned for
> > the duration of the direct I/O.
> 
> Yes but even temporarily pinned pages can cause CMA allocation
> failure. My point is that if we know beforehand that the pages will be
> pinned we could avoid using CMA and these failures would go away.

Direct I/O (and other users of pin_user_pages) are designed to work
on all anonymous and file backed pages, which is kinda the point.
If you CMA user can't wait for the time of an I/O something is wrong
with that caller and it really should not use CMA.
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Suren Baghdasaryan 11 months ago
On Wed, Mar 12, 2025 at 8:25 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote:
> > > Direct I/O pages are not unmovable.  They are temporarily pinned for
> > > the duration of the direct I/O.
> >
> > Yes but even temporarily pinned pages can cause CMA allocation
> > failure. My point is that if we know beforehand that the pages will be
> > pinned we could avoid using CMA and these failures would go away.
>
> Direct I/O (and other users of pin_user_pages) are designed to work
> on all anonymous and file backed pages, which is kinda the point.
> If you CMA user can't wait for the time of an I/O something is wrong
> with that caller and it really should not use CMA.

I might be wrong but my understanding is that we should try to
allocate from CMA when the allocation is movable (not pinned), so that
CMA can move those pages if necessary. I understand that in some cases
a movable allocation can be pinned and we don't know beforehand
whether it will be pinned or not. But in this case we know it will
happen and could avoid this situation.

Yeah, low latency usecases for CMA are problematic and I think the
only current alternative (apart from solutions involving HW change) is
to use a memory carveouts. Device vendors hate that since carved-out
memory ends up poorly utilized. I'm working on a GCMA proposal which
hopefully can address that.

>
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Christoph Hellwig 11 months ago
On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote:
> I might be wrong but my understanding is that we should try to
> allocate from CMA when the allocation is movable (not pinned), so that
> CMA can move those pages if necessary. I understand that in some cases
> a movable allocation can be pinned and we don't know beforehand
> whether it will be pinned or not. But in this case we know it will
> happen and could avoid this situation.

Any file or anonymous folio can be temporarily pinned for I/O and only
moved once that completes.  Direct I/O is one use case for that but there
are plenty others.  I'm not sure how you define "beforehand", but the
pinning is visible in the _pincount field.

> Yeah, low latency usecases for CMA are problematic and I think the
> only current alternative (apart from solutions involving HW change) is
> to use a memory carveouts. Device vendors hate that since carved-out
> memory ends up poorly utilized. I'm working on a GCMA proposal which
> hopefully can address that.

I'd still like to understand what the use case is.  Who does CMA
allocation at a time where heavy direct I/O is in progress?
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Suren Baghdasaryan 11 months ago
On Wed, Mar 12, 2025 at 8:52 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote:
> > I might be wrong but my understanding is that we should try to
> > allocate from CMA when the allocation is movable (not pinned), so that
> > CMA can move those pages if necessary. I understand that in some cases
> > a movable allocation can be pinned and we don't know beforehand
> > whether it will be pinned or not. But in this case we know it will
> > happen and could avoid this situation.
>
> Any file or anonymous folio can be temporarily pinned for I/O and only
> moved once that completes.  Direct I/O is one use case for that but there
> are plenty others.  I'm not sure how you define "beforehand", but the
> pinning is visible in the _pincount field.

Well, by "beforehand" I mean that when allocating for Direct I/O
operation we know this memory will be pinned, so we could tell the
allocator to avoid CMA. However I agree that FOLL_LONGTERM is a wrong
way to accomplish that.

>
> > Yeah, low latency usecases for CMA are problematic and I think the
> > only current alternative (apart from solutions involving HW change) is
> > to use a memory carveouts. Device vendors hate that since carved-out
> > memory ends up poorly utilized. I'm working on a GCMA proposal which
> > hopefully can address that.
>
> I'd still like to understand what the use case is.  Who does CMA
> allocation at a time where heavy direct I/O is in progress?

I'll let Samsung folks clarify their usecase.

>
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Christoph Hellwig 11 months ago
On Wed, Mar 12, 2025 at 09:06:02AM -0700, Suren Baghdasaryan wrote:
> > Any file or anonymous folio can be temporarily pinned for I/O and only
> > moved once that completes.  Direct I/O is one use case for that but there
> > are plenty others.  I'm not sure how you define "beforehand", but the
> > pinning is visible in the _pincount field.
> 
> Well, by "beforehand" I mean that when allocating for Direct I/O
> operation we know this memory will be pinned,

Direct I/O is performed on anonymous (or more rarely) file backed pages
that are allocated from the normal allocators.  Some callers might know
that they are eventually going to perform direct I/O on them, but most
won't as that information is a few layers removed from them or totally
hidden in libraries.

The same is true for other pin_user_pages operations.  If you want memory
that is easily available for CMA allocations it better not be given out
as anonymous memory, and probably also not as file backed memory.  Which
just leaves you with easily migratable kernel allocations, i.e. not much.
Re: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Bart Van Assche 11 months ago
On 3/12/25 8:52 AM, Christoph Hellwig wrote:
> I'd still like to understand what the use case is.  Who does CMA
> allocation at a time where heavy direct I/O is in progress?

An additional question: why is contiguous memory allocated? Is this
perhaps because the allocated memory will be used for DMA? If so,
can the SMMU be used to make it appear contiguous to DMA clients?

Thanks,

Bart.
RE: [RFC PATCH] block, fs: use FOLL_LONGTERM as gup_flags for direct IO
Posted by Sooyong Suk 11 months, 1 week ago
> On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote:
> >
> > > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig
> > > <hch@infradead.org>
> > > wrote:
> > > >
> > > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > > > There are GUP references to pages that are serving as direct IO
> > > buffers.
> > > > > Those pages can be allocated from CMA pageblocks despite they
> > > > > can be pinned until the DIO is completed.
> > > >
> > > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> > > > the reasons to even have the flag.  So big fat no to this.
> > > >
> > >
> >
> > Understood.
> >
> > > Hello, thank you for your comment.
> > > We, Sooyong and I, wanted to get some opinions about this
> > > FOLL_LONGTERM for direct I/O as CMA memory got pinned pages which
> > > had been pinned from direct io.
> > >
> > > > You also completely failed to address the relevant mailinglist and
> > > > maintainers.
> > >
> > > I added block maintainer Jens Axboe and the block layer maillinst
> > > here, and added Suren and Sandeep, too.
> 
> I'm very far from being a block layer expert :)
> 
> >
> > Then, what do you think of using PF_MEMALLOC_PIN for this context as
> below?
> > This will only remove __GFP_MOVABLE from its allocation flag.
> > Since __bio_iov_iter_get_pages() indicates that it will pin user or
> > kernel pages, there seems to be no reason not to use this process flag.
> 
> I think this will help you only when the pages are faulted in but if
> __get_user_pages() finds an already mapped page which happens to be
> allocated from CMA, it will not migrate it. So, you might still end up
> with unmovable pages inside CMA.
> 

Yes, you're right.
However, we can at least prevent issues from fault-in cases and mitigate
the overall probability of CMA allocation failure. And the pinned pages that
we observed from snapuserd was also allocated by fault-in.

> >
> > block/bio.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/block/bio.c b/block/bio.c index 65c796ecb..671e28966
> > 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
> >         unsigned len, i = 0;
> >         size_t offset;
> >         int ret = 0;
> > +       unsigned int flags;
> >
> >         /*
> >          * Move page array up in the allocated memory for the bio vecs
> > as far as @@ -1267,9 +1268,11 @@ static int
> __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >          * result to ensure the bio's total size is correct. The remainder
> of
> >          * the iov data will be picked up in the next bio iteration.
> >          */
> > +       flags = memalloc_pin_save();
> >         size = iov_iter_extract_pages(iter, &pages,
> >                                       UINT_MAX - bio->bi_iter.bi_size,
> >                                       nr_pages, extraction_flags,
> > &offset);
> > +       memalloc_pin_restore(flags);
> >         if (unlikely(size <= 0))
> >                 return size ? size : -EFAULT;
> >
> >