block/bio.c | 2 +- include/linux/uio.h | 2 ++ lib/iov_iter.c | 2 ++ 3 files changed, 5 insertions(+), 1 deletion(-)
There are GUP references to pages that are serving as direct IO buffers.
Those pages can be allocated from CMA pageblocks despite they can be
pinned until the DIO is completed.
Generally, pinning for each DIO might be considered as a transient
operation as described at the documentation. But if a large amount of
direct IO is requested constantly, this can make pages in CMA pageblocks
pinned and unable to migrate outside of the pageblock, which can result
in CMA allocation failure.
In Android devices, on first boot after OTA, snapuserd requests a huge
amount of direct IO reads which might occasionally disturb CMA
allocations.
To prevent this, use FOLL_LONGTERM as gup_flags for direct IO requests
via blkdev_direct_IO or __iomap_dio_rw by default not to allocate buffer
pages from CMA pageblocks.
Signed-off-by: Sooyong Suk <s.suk@samsung.com>
---
block/bio.c | 2 +-
include/linux/uio.h | 2 ++
lib/iov_iter.c | 2 ++
3 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index d5bdc31d88d3..683113b3e35a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1247,7 +1247,7 @@ static unsigned int get_contig_folio_len(unsigned int *num_pages,
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
- iov_iter_extraction_t extraction_flags = 0;
+ iov_iter_extraction_t extraction_flags = ITER_ALLOW_LONGTERM;
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 853f9de5aa05..d1e9174ee29a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -377,6 +377,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
/* Flags for iov_iter_get/extract_pages*() */
/* Allow P2PDMA on the extracted pages */
#define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01)
+/* Allow LONGTERM on the extracted pages */
+#define ITER_ALLOW_LONGTERM ((__force iov_iter_extraction_t)0x02)
ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
size_t maxsize, unsigned int maxpages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9ec806f989f2..4b5c7c30cd4d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1832,6 +1832,8 @@ static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
gup_flags |= FOLL_WRITE;
if (extraction_flags & ITER_ALLOW_P2PDMA)
gup_flags |= FOLL_PCI_P2PDMA;
+ if (extraction_flags & ITER_ALLOW_LONGTERM)
+ gup_flags |= FOLL_LONGTERM;
if (i->nofault)
gup_flags |= FOLL_NOFAULT;
--
2.25.1
On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote: > There are GUP references to pages that are serving as direct IO buffers. > Those pages can be allocated from CMA pageblocks despite they can be > pinned until the DIO is completed. direct I/O is eactly the case that is not FOLL_LONGTERM and one of the reasons to even have the flag. So big fat no to this. You also completely failed to address the relevant mailinglist and maintainers.
On Thu, Mar 06, 2025 at 07:26:52AM -0800, Christoph Hellwig wrote:
> On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > There are GUP references to pages that are serving as direct IO buffers.
> > Those pages can be allocated from CMA pageblocks despite they can be
> > pinned until the DIO is completed.
>
> direct I/O is eactly the case that is not FOLL_LONGTERM and one of
> the reasons to even have the flag. So big fat no to this.
>
> You also completely failed to address the relevant mailinglist and
> maintainers.
You're right; this patch is so bad that it's insulting.
Howver, the problem is real. And the alternative "solution" being
proposed is worse -- reintroducing cleancache and frontswap.
What I've been asking for and don't have the answer to yet is:
- What latency is acceptable to reclaim the pages allocated from CMA
pageblocks?
- Can we afford a TLB shootdown? An rmap walk?
- Is the problem with anonymous or pagecache memory?
I have vaguely been wondering about creating a separate (fake) NUMA node
for the CMA memory so that userspace can control "none of this memory is
in the CMA blocks". But that's not a great solution either.
On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote: > Howver, the problem is real. What is the problem? > What I've been asking for and don't have the answer to yet is: > > - What latency is acceptable to reclaim the pages allocated from CMA > pageblocks? > - Can we afford a TLB shootdown? An rmap walk? > - Is the problem with anonymous or pagecache memory? > > I have vaguely been wondering about creating a separate (fake) NUMA node > for the CMA memory so that userspace can control "none of this memory is > in the CMA blocks". But that's not a great solution either. Maybe I'm misunderstanding things, but CMA basically provides a region that allows for large contiguous allocations from it, but otherwise is used as bog normal kernel memory. But anyone who wants to allocate from it needs to move all that memory. Which to me implies that: - latency can be expected to be horrible because a lot of individual allocations need to possibly be moved, and all of them could be temporarily pinned for I/O - any driver using CMA better do this during early boot time, or at least under the expectation that doing a CMA allocation temporarily causes a huge performance degradation. If a caller can't cope with that it better don't use CMA.
On 12.03.25 16:21, Christoph Hellwig wrote: > On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote: >> Howver, the problem is real. > > What is the problem? I think the problem is the CMA allocation failure, not the latency. "if a large amount of direct IO is requested constantly, this can make pages in CMA pageblocks pinned and unable to migrate outside of the pageblock" We'd need a more reliable way to make CMA allocation -> page migration make progress. For example, after we isolated the pageblocks and migration starts doing its thing, we could disallow any further GUP pins. (e.g., make GUP spin or wait for migration to end) We could detect in GUP code that a folio is soon expected to be migrated by checking the pageblock (isolated) and/or whether the folio is locked. -- Cheers, David / dhildenb
On 3/13/25 3:49 PM, David Hildenbrand wrote: > On 12.03.25 16:21, Christoph Hellwig wrote: >> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote: >>> Howver, the problem is real. >> >> What is the problem? > > I think the problem is the CMA allocation failure, not the latency. > > "if a large amount of direct IO is requested constantly, this can make > pages in CMA pageblocks pinned and unable to migrate outside of the > pageblock" > > We'd need a more reliable way to make CMA allocation -> page migration > make progress. For example, after we isolated the pageblocks and > migration starts doing its thing, we could disallow any further GUP > pins. (e.g., make GUP spin or wait for migration to end) > > We could detect in GUP code that a folio is soon expected to be migrated > by checking the pageblock (isolated) and/or whether the folio is locked. > Jason Gunthorpe and Matthew both had some ideas about how to fix this [1], which were very close (maybe the same) to what you're saying here: sleep and spin in an killable loop. It turns out to be a little difficult to do this--I had trouble making the folio's "has waiters" bit work for this, for example. And then...squirrel! However, I still believe, so far, this is the right approach. I'm just not sure which thing to wait on, exactly. [1] https://lore.kernel.org/20240502183408.GC3341011@nvidia.com thanks, -- John Hubbard
On 15.03.25 02:04, John Hubbard wrote: > On 3/13/25 3:49 PM, David Hildenbrand wrote: >> On 12.03.25 16:21, Christoph Hellwig wrote: >>> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote: >>>> Howver, the problem is real. >>> >>> What is the problem? >> >> I think the problem is the CMA allocation failure, not the latency. >> >> "if a large amount of direct IO is requested constantly, this can make >> pages in CMA pageblocks pinned and unable to migrate outside of the >> pageblock" >> >> We'd need a more reliable way to make CMA allocation -> page migration >> make progress. For example, after we isolated the pageblocks and >> migration starts doing its thing, we could disallow any further GUP >> pins. (e.g., make GUP spin or wait for migration to end) >> >> We could detect in GUP code that a folio is soon expected to be migrated >> by checking the pageblock (isolated) and/or whether the folio is locked. >> > > Jason Gunthorpe and Matthew both had some ideas about how to fix this [1], > which were very close (maybe the same) to what you're saying here: sleep > and spin in an killable loop. > > It turns out to be a little difficult to do this--I had trouble making > the folio's "has waiters" bit work for this, for example. And then...squirrel! > > However, I still believe, so far, this is the right approach. I'm just not > sure which thing to wait on, exactly. Zi Yan has a series to convert the "isolate" state of pageblocks to a separate pageblock bit; it could be considered a lock-bit. Currently, it's essentially the migratetype being MIGRATE_ISOLATE. As soon as a pageblock is isolated, one must be prepared for contained pages/folios to get migrated. The folio lock will only be grabbed once actually trying to migrate a folio IIRC, so it might not be the best choice: especially considering allocations that span many pageblocks. So maybe one would need a "has waiters" bit per pageblock, so relevant users (e.g., GUP) could wait on the isolate bit getting cleared. -- Cheers, David / dhildenb
On 15 Mar 2025, at 19:00, David Hildenbrand wrote: > On 15.03.25 02:04, John Hubbard wrote: >> On 3/13/25 3:49 PM, David Hildenbrand wrote: >>> On 12.03.25 16:21, Christoph Hellwig wrote: >>>> On Fri, Mar 07, 2025 at 08:23:08PM +0000, Matthew Wilcox wrote: >>>>> Howver, the problem is real. >>>> >>>> What is the problem? >>> >>> I think the problem is the CMA allocation failure, not the latency. >>> >>> "if a large amount of direct IO is requested constantly, this can make >>> pages in CMA pageblocks pinned and unable to migrate outside of the >>> pageblock" >>> >>> We'd need a more reliable way to make CMA allocation -> page migration >>> make progress. For example, after we isolated the pageblocks and >>> migration starts doing its thing, we could disallow any further GUP >>> pins. (e.g., make GUP spin or wait for migration to end) >>> >>> We could detect in GUP code that a folio is soon expected to be migrated >>> by checking the pageblock (isolated) and/or whether the folio is locked. >>> >> >> Jason Gunthorpe and Matthew both had some ideas about how to fix this [1], >> which were very close (maybe the same) to what you're saying here: sleep >> and spin in an killable loop. >> >> It turns out to be a little difficult to do this--I had trouble making >> the folio's "has waiters" bit work for this, for example. And then...squirrel! >> >> However, I still believe, so far, this is the right approach. I'm just not >> sure which thing to wait on, exactly. > > Zi Yan has a series to convert the "isolate" state of pageblocks to a separate pageblock bit; it could be considered a lock-bit. Currently, it's essentially the migratetype being MIGRATE_ISOLATE. > > As soon as a pageblock is isolated, one must be prepared for contained pages/folios to get migrated. The folio lock will only be grabbed once actually trying to migrate a folio IIRC, so it might not be the best choice: especially considering allocations that span many pageblocks. > > So maybe one would need a "has waiters" bit per pageblock, so relevant users (e.g., GUP) could wait on the isolate bit getting cleared. The patchset is at: https://lore.kernel.org/linux-mm/20250214154215.717537-1-ziy@nvidia.com/. I should be able to work on it soon, as I have been busy with folio_split() patchset recently. My patchset extends migratetype bits from 4 to 8 and use bit 7 for MIGRATE_ISOLATE. -- Best Regards, Yan, Zi
On Fri, Mar 7, 2025 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Mar 06, 2025 at 07:26:52AM -0800, Christoph Hellwig wrote: > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote: > > > There are GUP references to pages that are serving as direct IO buffers. > > > Those pages can be allocated from CMA pageblocks despite they can be > > > pinned until the DIO is completed. > > > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of > > the reasons to even have the flag. So big fat no to this. > > > > You also completely failed to address the relevant mailinglist and > > maintainers. > > You're right; this patch is so bad that it's insulting. > > Howver, the problem is real. And the alternative "solution" being > proposed is worse -- reintroducing cleancache and frontswap. Matthew, if you are referring to the GCMA proposal I'm working on, that's not to address this problem. My goal with GCMA is to reuse memory carveouts (when they are not used) for extending pagecache. The way I understand this particular problem is that we know direct I/O will allocate pages and make them unmovable and we do nothing to prevent these allocations from using CMA. > > What I've been asking for and don't have the answer to yet is: I'll send my findings related to GCMA usecases separately since I don't want to mix that with the problem discussed here. > > - What latency is acceptable to reclaim the pages allocated from CMA > pageblocks? > - Can we afford a TLB shootdown? An rmap walk? > - Is the problem with anonymous or pagecache memory? > > I have vaguely been wondering about creating a separate (fake) NUMA node > for the CMA memory so that userspace can control "none of this memory is > in the CMA blocks". But that's not a great solution either. >
On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote: > > There are GUP references to pages that are serving as direct IO buffers. > > Those pages can be allocated from CMA pageblocks despite they can be > > pinned until the DIO is completed. > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of > the reasons to even have the flag. So big fat no to this. > Hello, thank you for your comment. We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM for direct I/O as CMA memory got pinned pages which had been pinned from direct io. > You also completely failed to address the relevant mailinglist and > maintainers. I added block maintainer Jens Axboe and the block layer maillinst here, and added Suren and Sandeep, too.
> On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org>
> wrote:
> >
> > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote:
> > > There are GUP references to pages that are serving as direct IO
> buffers.
> > > Those pages can be allocated from CMA pageblocks despite they can be
> > > pinned until the DIO is completed.
> >
> > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the
> > reasons to even have the flag. So big fat no to this.
> >
>
Understood.
> Hello, thank you for your comment.
> We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM
> for direct I/O as CMA memory got pinned pages which had been pinned from
> direct io.
>
> > You also completely failed to address the relevant mailinglist and
> > maintainers.
>
> I added block maintainer Jens Axboe and the block layer maillinst here,
> and added Suren and Sandeep, too.
Then, what do you think of using PF_MEMALLOC_PIN for this context as below?
This will only remove __GFP_MOVABLE from its allocation flag.
Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages,
there seems to be no reason not to use this process flag.
block/bio.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index 65c796ecb..671e28966 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
unsigned len, i = 0;
size_t offset;
int ret = 0;
+ unsigned int flags;
/*
* Move page array up in the allocated memory for the bio vecs as far as
@@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
+ flags = memalloc_pin_save();
size = iov_iter_extract_pages(iter, &pages,
UINT_MAX - bio->bi_iter.bi_size,
nr_pages, extraction_flags, &offset);
+ memalloc_pin_restore(flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote: > > > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig <hch@infradead.org> > > wrote: > > > > > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote: > > > > There are GUP references to pages that are serving as direct IO > > buffers. > > > > Those pages can be allocated from CMA pageblocks despite they can be > > > > pinned until the DIO is completed. > > > > > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of the > > > reasons to even have the flag. So big fat no to this. > > > > > > > Understood. > > > Hello, thank you for your comment. > > We, Sooyong and I, wanted to get some opinions about this FOLL_LONGTERM > > for direct I/O as CMA memory got pinned pages which had been pinned from > > direct io. > > > > > You also completely failed to address the relevant mailinglist and > > > maintainers. > > > > I added block maintainer Jens Axboe and the block layer maillinst here, > > and added Suren and Sandeep, too. I'm very far from being a block layer expert :) > > Then, what do you think of using PF_MEMALLOC_PIN for this context as below? > This will only remove __GFP_MOVABLE from its allocation flag. > Since __bio_iov_iter_get_pages() indicates that it will pin user or kernel pages, > there seems to be no reason not to use this process flag. I think this will help you only when the pages are faulted in but if __get_user_pages() finds an already mapped page which happens to be allocated from CMA, it will not migrate it. So, you might still end up with unmovable pages inside CMA. > > block/bio.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/block/bio.c b/block/bio.c > index 65c796ecb..671e28966 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > unsigned len, i = 0; > size_t offset; > int ret = 0; > + unsigned int flags; > > /* > * Move page array up in the allocated memory for the bio vecs as far as > @@ -1267,9 +1268,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > * result to ensure the bio's total size is correct. The remainder of > * the iov data will be picked up in the next bio iteration. > */ > + flags = memalloc_pin_save(); > size = iov_iter_extract_pages(iter, &pages, > UINT_MAX - bio->bi_iter.bi_size, > nr_pages, extraction_flags, &offset); > + memalloc_pin_restore(flags); > if (unlikely(size <= 0)) > return size ? size : -EFAULT; > >
On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote: > I think this will help you only when the pages are faulted in but if > __get_user_pages() finds an already mapped page which happens to be > allocated from CMA, it will not migrate it. So, you might still end up > with unmovable pages inside CMA. Direct I/O pages are not unmovable. They are temporarily pinned for the duration of the direct I/O. I really don't understand what problem you're trying to fix here.
On Wed, Mar 12, 2025 at 8:17 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Mar 06, 2025 at 06:28:40PM -0800, Suren Baghdasaryan wrote: > > I think this will help you only when the pages are faulted in but if > > __get_user_pages() finds an already mapped page which happens to be > > allocated from CMA, it will not migrate it. So, you might still end up > > with unmovable pages inside CMA. > > Direct I/O pages are not unmovable. They are temporarily pinned for > the duration of the direct I/O. Yes but even temporarily pinned pages can cause CMA allocation failure. My point is that if we know beforehand that the pages will be pinned we could avoid using CMA and these failures would go away. > > I really don't understand what problem you're trying to fix here. >
On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote: > > Direct I/O pages are not unmovable. They are temporarily pinned for > > the duration of the direct I/O. > > Yes but even temporarily pinned pages can cause CMA allocation > failure. My point is that if we know beforehand that the pages will be > pinned we could avoid using CMA and these failures would go away. Direct I/O (and other users of pin_user_pages) are designed to work on all anonymous and file backed pages, which is kinda the point. If you CMA user can't wait for the time of an I/O something is wrong with that caller and it really should not use CMA.
On Wed, Mar 12, 2025 at 8:25 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Mar 12, 2025 at 08:20:36AM -0700, Suren Baghdasaryan wrote: > > > Direct I/O pages are not unmovable. They are temporarily pinned for > > > the duration of the direct I/O. > > > > Yes but even temporarily pinned pages can cause CMA allocation > > failure. My point is that if we know beforehand that the pages will be > > pinned we could avoid using CMA and these failures would go away. > > Direct I/O (and other users of pin_user_pages) are designed to work > on all anonymous and file backed pages, which is kinda the point. > If you CMA user can't wait for the time of an I/O something is wrong > with that caller and it really should not use CMA. I might be wrong but my understanding is that we should try to allocate from CMA when the allocation is movable (not pinned), so that CMA can move those pages if necessary. I understand that in some cases a movable allocation can be pinned and we don't know beforehand whether it will be pinned or not. But in this case we know it will happen and could avoid this situation. Yeah, low latency usecases for CMA are problematic and I think the only current alternative (apart from solutions involving HW change) is to use a memory carveouts. Device vendors hate that since carved-out memory ends up poorly utilized. I'm working on a GCMA proposal which hopefully can address that. >
On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote: > I might be wrong but my understanding is that we should try to > allocate from CMA when the allocation is movable (not pinned), so that > CMA can move those pages if necessary. I understand that in some cases > a movable allocation can be pinned and we don't know beforehand > whether it will be pinned or not. But in this case we know it will > happen and could avoid this situation. Any file or anonymous folio can be temporarily pinned for I/O and only moved once that completes. Direct I/O is one use case for that but there are plenty others. I'm not sure how you define "beforehand", but the pinning is visible in the _pincount field. > Yeah, low latency usecases for CMA are problematic and I think the > only current alternative (apart from solutions involving HW change) is > to use a memory carveouts. Device vendors hate that since carved-out > memory ends up poorly utilized. I'm working on a GCMA proposal which > hopefully can address that. I'd still like to understand what the use case is. Who does CMA allocation at a time where heavy direct I/O is in progress?
On Wed, Mar 12, 2025 at 8:52 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Mar 12, 2025 at 08:38:07AM -0700, Suren Baghdasaryan wrote: > > I might be wrong but my understanding is that we should try to > > allocate from CMA when the allocation is movable (not pinned), so that > > CMA can move those pages if necessary. I understand that in some cases > > a movable allocation can be pinned and we don't know beforehand > > whether it will be pinned or not. But in this case we know it will > > happen and could avoid this situation. > > Any file or anonymous folio can be temporarily pinned for I/O and only > moved once that completes. Direct I/O is one use case for that but there > are plenty others. I'm not sure how you define "beforehand", but the > pinning is visible in the _pincount field. Well, by "beforehand" I mean that when allocating for Direct I/O operation we know this memory will be pinned, so we could tell the allocator to avoid CMA. However I agree that FOLL_LONGTERM is a wrong way to accomplish that. > > > Yeah, low latency usecases for CMA are problematic and I think the > > only current alternative (apart from solutions involving HW change) is > > to use a memory carveouts. Device vendors hate that since carved-out > > memory ends up poorly utilized. I'm working on a GCMA proposal which > > hopefully can address that. > > I'd still like to understand what the use case is. Who does CMA > allocation at a time where heavy direct I/O is in progress? I'll let Samsung folks clarify their usecase. >
On Wed, Mar 12, 2025 at 09:06:02AM -0700, Suren Baghdasaryan wrote: > > Any file or anonymous folio can be temporarily pinned for I/O and only > > moved once that completes. Direct I/O is one use case for that but there > > are plenty others. I'm not sure how you define "beforehand", but the > > pinning is visible in the _pincount field. > > Well, by "beforehand" I mean that when allocating for Direct I/O > operation we know this memory will be pinned, Direct I/O is performed on anonymous (or more rarely) file backed pages that are allocated from the normal allocators. Some callers might know that they are eventually going to perform direct I/O on them, but most won't as that information is a few layers removed from them or totally hidden in libraries. The same is true for other pin_user_pages operations. If you want memory that is easily available for CMA allocations it better not be given out as anonymous memory, and probably also not as file backed memory. Which just leaves you with easily migratable kernel allocations, i.e. not much.
On 3/12/25 8:52 AM, Christoph Hellwig wrote: > I'd still like to understand what the use case is. Who does CMA > allocation at a time where heavy direct I/O is in progress? An additional question: why is contiguous memory allocated? Is this perhaps because the allocated memory will be used for DMA? If so, can the SMMU be used to make it appear contiguous to DMA clients? Thanks, Bart.
> On Thu, Mar 6, 2025 at 6:07 PM Sooyong Suk <s.suk@samsung.com> wrote: > > > > > On Fri, Mar 7, 2025 at 12:26 AM Christoph Hellwig > > > <hch@infradead.org> > > > wrote: > > > > > > > > On Thu, Mar 06, 2025 at 04:40:56PM +0900, Sooyong Suk wrote: > > > > > There are GUP references to pages that are serving as direct IO > > > buffers. > > > > > Those pages can be allocated from CMA pageblocks despite they > > > > > can be pinned until the DIO is completed. > > > > > > > > direct I/O is eactly the case that is not FOLL_LONGTERM and one of > > > > the reasons to even have the flag. So big fat no to this. > > > > > > > > > > > Understood. > > > > > Hello, thank you for your comment. > > > We, Sooyong and I, wanted to get some opinions about this > > > FOLL_LONGTERM for direct I/O as CMA memory got pinned pages which > > > had been pinned from direct io. > > > > > > > You also completely failed to address the relevant mailinglist and > > > > maintainers. > > > > > > I added block maintainer Jens Axboe and the block layer maillinst > > > here, and added Suren and Sandeep, too. > > I'm very far from being a block layer expert :) > > > > > Then, what do you think of using PF_MEMALLOC_PIN for this context as > below? > > This will only remove __GFP_MOVABLE from its allocation flag. > > Since __bio_iov_iter_get_pages() indicates that it will pin user or > > kernel pages, there seems to be no reason not to use this process flag. > > I think this will help you only when the pages are faulted in but if > __get_user_pages() finds an already mapped page which happens to be > allocated from CMA, it will not migrate it. So, you might still end up > with unmovable pages inside CMA. > Yes, you're right. However, we can at least prevent issues from fault-in cases and mitigate the overall probability of CMA allocation failure. And the pinned pages that we observed from snapuserd was also allocated by fault-in. > > > > block/bio.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/block/bio.c b/block/bio.c index 65c796ecb..671e28966 > > 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -1248,6 +1248,7 @@ static int __bio_iov_iter_get_pages(struct bio > *bio, struct iov_iter *iter) > > unsigned len, i = 0; > > size_t offset; > > int ret = 0; > > + unsigned int flags; > > > > /* > > * Move page array up in the allocated memory for the bio vecs > > as far as @@ -1267,9 +1268,11 @@ static int > __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > * result to ensure the bio's total size is correct. The remainder > of > > * the iov data will be picked up in the next bio iteration. > > */ > > + flags = memalloc_pin_save(); > > size = iov_iter_extract_pages(iter, &pages, > > UINT_MAX - bio->bi_iter.bi_size, > > nr_pages, extraction_flags, > > &offset); > > + memalloc_pin_restore(flags); > > if (unlikely(size <= 0)) > > return size ? size : -EFAULT; > > > >
© 2016 - 2026 Red Hat, Inc.