proc: Avoid costly high-order page allocations when reading proc files

[PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Yafang Shao 1 week, 2 days ago

While investigating a kcompactd 100% CPU utilization issue in production, I
observed frequent costly high-order (order-6) page allocations triggered by
proc file reads from monitoring tools. This can be reproduced with a simple
test case:

  fd = open(PROC_FILE, O_RDONLY);
  size = read(fd, buff, 256KB);
  close(fd);

Although we should modify the monitoring tools to use smaller buffer sizes,
we should also enhance the kernel to prevent these expensive high-order
allocations.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
---
 fs/proc/proc_sysctl.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index cc9d74a06ff0..c53ba733bda5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 	error = -ENOMEM;
 	if (count >= KMALLOC_MAX_SIZE)
 		goto out;
-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
+
+	/*
+	 * Use vmalloc if the count is too large to avoid costly high-order page
+	 * allocations.
+	 */
+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+		kbuf = kvzalloc(count + 1, GFP_KERNEL);
+	else
+		kbuf = vmalloc(count + 1);
 	if (!kbuf)
 		goto out;
 
-- 
2.43.5

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Kees Cook 1 week, 2 days ago


On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
>While investigating a kcompactd 100% CPU utilization issue in production, I
>observed frequent costly high-order (order-6) page allocations triggered by
>proc file reads from monitoring tools. This can be reproduced with a simple
>test case:
>
>  fd = open(PROC_FILE, O_RDONLY);
>  size = read(fd, buff, 256KB);
>  close(fd);
>
>Although we should modify the monitoring tools to use smaller buffer sizes,
>we should also enhance the kernel to prevent these expensive high-order
>allocations.
>
>Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>Cc: Josef Bacik <josef@toxicpanda.com>
>---
> fs/proc/proc_sysctl.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>index cc9d74a06ff0..c53ba733bda5 100644
>--- a/fs/proc/proc_sysctl.c
>+++ b/fs/proc/proc_sysctl.c
>@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> 	error = -ENOMEM;
> 	if (count >= KMALLOC_MAX_SIZE)
> 		goto out;
>-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
>+
>+	/*
>+	 * Use vmalloc if the count is too large to avoid costly high-order page
>+	 * allocations.
>+	 */
>+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>+		kbuf = kvzalloc(count + 1, GFP_KERNEL);

Why not move this check into kvmalloc family?

>+	else
>+		kbuf = vmalloc(count + 1);

You dropped the zeroing. This must be vzalloc.

> 	if (!kbuf)
> 		goto out;
> 

Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?

-Kees

-- 
Kees Cook

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Harry Yoo 1 week, 1 day ago

On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> 
> 
> On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >While investigating a kcompactd 100% CPU utilization issue in production, I
> >observed frequent costly high-order (order-6) page allocations triggered by
> >proc file reads from monitoring tools. This can be reproduced with a simple
> >test case:
> >
> >  fd = open(PROC_FILE, O_RDONLY);
> >  size = read(fd, buff, 256KB);
> >  close(fd);
> >
> >Although we should modify the monitoring tools to use smaller buffer sizes,
> >we should also enhance the kernel to prevent these expensive high-order
> >allocations.
> >
> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >Cc: Josef Bacik <josef@toxicpanda.com>
> >---
> > fs/proc/proc_sysctl.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >index cc9d74a06ff0..c53ba733bda5 100644
> >--- a/fs/proc/proc_sysctl.c
> >+++ b/fs/proc/proc_sysctl.c
> >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > 	error = -ENOMEM;
> > 	if (count >= KMALLOC_MAX_SIZE)
> > 		goto out;
> >-	kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >+
> >+	/*
> >+	 * Use vmalloc if the count is too large to avoid costly high-order page
> >+	 * allocations.
> >+	 */
> >+	if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >+		kbuf = kvzalloc(count + 1, GFP_KERNEL);
> 
> Why not move this check into kvmalloc family?

Hmm should this check really be in kvmalloc family?

I don't think users would expect kvmalloc() to implictly decide on using
vmalloc() without trying kmalloc() first, just because it's a high-order
allocation.

> >+	else
> >+		kbuf = vmalloc(count + 1);
> 
> You dropped the zeroing. This must be vzalloc.
> 
> > 	if (!kbuf)
> > 		goto out;
> > 
> 
> Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?
> 
> -Kees
> 
> -- 
> Kees Cook

-- 
Cheers,
Harry (formerly known as Hyeonggon)

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Yafang Shao 1 week, 1 day ago

On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >
> >
> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > >observed frequent costly high-order (order-6) page allocations triggered by
> > >proc file reads from monitoring tools. This can be reproduced with a simple
> > >test case:
> > >
> > >  fd = open(PROC_FILE, O_RDONLY);
> > >  size = read(fd, buff, 256KB);
> > >  close(fd);
> > >
> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > >we should also enhance the kernel to prevent these expensive high-order
> > >allocations.
> > >
> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > >Cc: Josef Bacik <josef@toxicpanda.com>
> > >---
> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > >index cc9d74a06ff0..c53ba733bda5 100644
> > >--- a/fs/proc/proc_sysctl.c
> > >+++ b/fs/proc/proc_sysctl.c
> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > >     error = -ENOMEM;
> > >     if (count >= KMALLOC_MAX_SIZE)
> > >             goto out;
> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > >+
> > >+    /*
> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > >+     * allocations.
> > >+     */
> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >
> > Why not move this check into kvmalloc family?
>
> Hmm should this check really be in kvmalloc family?

Modifying the existing kvmalloc functions risks performance regressions.
Could we instead introduce a new variant like vkmalloc() (favoring
vmalloc over kmalloc) or kvmalloc_costless()?

>
> I don't think users would expect kvmalloc() to implictly decide on using
> vmalloc() without trying kmalloc() first, just because it's a high-order
> allocation.
>

-- 
Regards
Yafang

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Dave Chinner 1 week, 1 day ago

On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote:
> On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> > >
> > >
> > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > > >observed frequent costly high-order (order-6) page allocations triggered by
> > > >proc file reads from monitoring tools. This can be reproduced with a simple
> > > >test case:
> > > >
> > > >  fd = open(PROC_FILE, O_RDONLY);
> > > >  size = read(fd, buff, 256KB);
> > > >  close(fd);
> > > >
> > > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > > >we should also enhance the kernel to prevent these expensive high-order
> > > >allocations.
> > > >
> > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > >Cc: Josef Bacik <josef@toxicpanda.com>
> > > >---
> > > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > > >index cc9d74a06ff0..c53ba733bda5 100644
> > > >--- a/fs/proc/proc_sysctl.c
> > > >+++ b/fs/proc/proc_sysctl.c
> > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > > >     error = -ENOMEM;
> > > >     if (count >= KMALLOC_MAX_SIZE)
> > > >             goto out;
> > > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > >+
> > > >+    /*
> > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > >+     * allocations.
> > > >+     */
> > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > >
> > > Why not move this check into kvmalloc family?
> >
> > Hmm should this check really be in kvmalloc family?
> 
> Modifying the existing kvmalloc functions risks performance regressions.
> Could we instead introduce a new variant like vkmalloc() (favoring
> vmalloc over kmalloc) or kvmalloc_costless()?

We should fix kvmalloc() instead of continuing to force
subsystems to work around the limitations of kvmalloc().

Have a look at xlog_kvmalloc() in XFS. It implements a basic
fast-fail, no retry high order kmalloc before it falls back to
vmalloc by turning off direct reclaim for the kmalloc() call.
Hence if the there isn't a high-order page on the free lists ready
to allocate, it falls back to vmalloc() immediately.

For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
overhead by around 80% when compared to a standard kvmalloc()
call. Numbers and profiles were documented in the commit message
(reproduced in whole below)...

> > I don't think users would expect kvmalloc() to implictly decide on using
> > vmalloc() without trying kmalloc() first, just because it's a high-order
> > allocation.

Right, but users expect kvmalloc() to use the most efficient
allocation paths available to it.

In this case, vmalloc() is faster and more reliable than
direct reclaim w/ compaction. Hence vmalloc() should really be the
primary fallback path when high-order pages are not immediately
available to kmalloc() when called from kvmalloc()...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers

    Oh, let me count the ways that the kvmalloc API sucks dog eggs.

    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:

         - 49.73% xlog_cil_commit
             - 31.62% kvmalloc_node
                - 29.96% __kmalloc_node
                   - 29.38% kmalloc_large_node
                      - 29.33% __alloc_pages
                         - 24.33% __alloc_pages_slowpath.constprop.0
                            - 18.35% __alloc_pages_direct_compact
                               - 17.39% try_to_compact_pages
                                  - compact_zone_order
                                     - 15.26% compact_zone
                                          5.29% __pageblock_pfn_to_page
                                          3.71% PageHuge
                                        - 1.44% isolate_migratepages_block
                                             0.71% set_pfnblock_flags_mask
                                       1.11% get_pfnblock_flags_mask
                               - 0.81% get_page_from_freelist
                                  - 0.59% _raw_spin_lock_irqsave
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
                            - 3.24% try_to_free_pages
                               - 3.14% shrink_node
                                  - 2.94% shrink_slab.constprop.0
                                     - 0.89% super_cache_count
                                        - 0.66% xfs_fs_nr_cached_objects
                                           - 0.65% xfs_reclaim_inodes_count
                                                0.55% xfs_perag_get_tag
                                       0.58% kfree_rcu_shrink_count
                            - 2.09% get_page_from_freelist
                               - 1.03% _raw_spin_lock_irqsave
                                  - do_raw_spin_lock
                                       __pv_queued_spin_lock_slowpath
                         - 4.88% get_page_from_freelist
                            - 3.66% _raw_spin_lock_irqsave
                               - do_raw_spin_lock
                                    __pv_queued_spin_lock_slowpath
                - 1.63% __vmalloc_node
                   - __vmalloc_node_range
                      - 1.10% __alloc_pages_bulk
                         - 0.93% __alloc_pages
                            - 0.92% get_page_from_freelist
                               - 0.89% rmqueue_bulk
                                  - 0.69% _raw_spin_lock
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
               13.73% memcpy_erms
             - 2.22% kvfree

    On this workload, that's almost a dozen CPUs all trying to compact
    and reclaim memory inside kvmalloc_node at the same time. Yet it is
    regularly falling back to vmalloc despite all that compaction, page
    and shrinker reclaim that direct reclaim is doing. Copying all the
    metadata is taking far less CPU time than allocating the storage!

    Direct reclaim should be considered extremely harmful.

    This is a high frequency, high throughput, CPU usage and latency
    sensitive allocation. We've got memory there, and we're using
    kvmalloc to allow memory allocation to avoid doing lots of work to
    try to do contiguous allocations.

    Except it still does *lots of costly work* that is unnecessary.

    Worse: the only way to avoid the slowpath page allocation trying to
    do compaction on costly allocations is to turn off direct reclaim
    (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).

    Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
    GFP_KERNEL allocation context, so you only get kmalloc!". This
    cuts off the vmalloc fallback, and this leads to almost instant OOM
    problems which ends up in filesystems deadlocks, shutdowns and/or
    kernel crashes.

    I want some basic kvmalloc behaviour:

    - kmalloc for a contiguous range with fail fast semantics - no
      compaction direct reclaim if the allocation enters the slow path.
    - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails

    The really, really stupid part about this is these kvmalloc() calls
    are run under memalloc_nofs task context, so all the allocations are
    always reduced to GFP_NOFS regardless of the fact that kvmalloc
    requires GFP_KERNEL to be passed in. IOWs, we're already telling
    kvmalloc to behave differently to the gfp flags we pass in, but it
    still won't allow vmalloc to be run with anything other than
    GFP_KERNEL.

    So, this patch open codes the kvmalloc() in the commit path to have
    the above described behaviour. The result is we more than halve the
    CPU time spend doing kvmalloc() in this path and transaction commits
    with 64kB objects in them more than doubles. i.e. we get ~5x
    reduction in CPU usage per costly-sized kvmalloc() invocation and
    the profile looks like this:

          - 37.60% xlog_cil_commit
            16.01% memcpy_erms
          - 8.45% __kmalloc
             - 8.04% kmalloc_order_trace
                - 8.03% kmalloc_order
                   - 7.93% alloc_pages
                      - 7.90% __alloc_pages
                         - 4.05% __alloc_pages_slowpath.constprop.0
                            - 2.18% get_page_from_freelist
                            - 1.77% wake_all_kswapds
    ....
                                        - __wake_up_common_lock
                                           - 0.94% _raw_spin_lock_irqsave
                         - 3.72% get_page_from_freelist
                            - 2.43% _raw_spin_lock_irqsave
          - 5.72% vmalloc
             - 5.72% __vmalloc_node_range
                - 4.81% __get_vm_area_node.constprop.0
                   - 3.26% alloc_vmap_area
                      - 2.52% _raw_spin_lock
                   - 1.46% _raw_spin_lock
                  0.56% __alloc_pages_bulk
          - 4.66% kvfree
             - 3.25% vfree
                - __vfree
                   - 3.23% __vunmap
                      - 1.95% remove_vm_area
                         - 1.06% free_vmap_area_noflush
                            - 0.82% _raw_spin_lock
                         - 0.68% _raw_spin_lock
                      - 0.92% _raw_spin_lock
             - 1.40% kfree
                - 1.36% __free_pages
                   - 1.35% __free_pages_ok
                      - 1.02% _raw_spin_lock_irqsave

    It's worth noting that over 50% of the CPU time spent allocating
    these shadow buffers is now spent on spinlocks. So the shadow buffer
    allocation overhead is greatly reduced by getting rid of direct
    reclaim from kmalloc, and could probably be made even less costly if
    vmalloc() didn't use global spinlocks to protect it's structures.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Michal Hocko 1 week, 1 day ago

On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 04:42:06PM +0800, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> > > >
> > > >
> > > > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >While investigating a kcompactd 100% CPU utilization issue in production, I
> > > > >observed frequent costly high-order (order-6) page allocations triggered by
> > > > >proc file reads from monitoring tools. This can be reproduced with a simple
> > > > >test case:
> > > > >
> > > > >  fd = open(PROC_FILE, O_RDONLY);
> > > > >  size = read(fd, buff, 256KB);
> > > > >  close(fd);
> > > > >
> > > > >Although we should modify the monitoring tools to use smaller buffer sizes,
> > > > >we should also enhance the kernel to prevent these expensive high-order
> > > > >allocations.
> > > > >
> > > > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > >Cc: Josef Bacik <josef@toxicpanda.com>
> > > > >---
> > > > > fs/proc/proc_sysctl.c | 10 +++++++++-
> > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > > > >index cc9d74a06ff0..c53ba733bda5 100644
> > > > >--- a/fs/proc/proc_sysctl.c
> > > > >+++ b/fs/proc/proc_sysctl.c
> > > > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> > > > >     error = -ENOMEM;
> > > > >     if (count >= KMALLOC_MAX_SIZE)
> > > > >             goto out;
> > > > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > >+
> > > > >+    /*
> > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > >+     * allocations.
> > > > >+     */
> > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > >
> > > > Why not move this check into kvmalloc family?
> > >
> > > Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We should fix kvmalloc() instead of continuing to force
> subsystems to work around the limitations of kvmalloc().

Agreed!

> Have a look at xlog_kvmalloc() in XFS. It implements a basic
> fast-fail, no retry high order kmalloc before it falls back to
> vmalloc by turning off direct reclaim for the kmalloc() call.
> Hence if the there isn't a high-order page on the free lists ready
> to allocate, it falls back to vmalloc() immediately.
> 
> For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> overhead by around 80% when compared to a standard kvmalloc()
> call. Numbers and profiles were documented in the commit message
> (reproduced in whole below)...

Btw. it would be really great to have such concerns to be posted to the
linux-mm ML so that we are aware of that.

kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
to express - I prefer SLAB allocator over vmalloc. I think we could make
the default kvmalloc slab path weaker by default as those who really
want slab already have means to achieve that. There is a risk of long
term fragmentation but I think this is worth trying
diff --git a/mm/util.c b/mm/util.c
index 60aa40f612b8..8386f6976d7d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
 	 * contribute to a long term fragmentation less than vmalloc fallback.
-	 * However make sure that larger requests are not too disruptive - no
-	 * OOM killer and no allocation failure warnings as we have a fallback.
+	 * However make sure that larger requests are not too disruptive - i.e.
+	 * do not direct reclaim unless physically continuous memory is preferred
+	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
+	 * working in the background but the allocation itself.
 	 */
 	if (size > PAGE_SIZE) {
 		flags |= __GFP_NOWARN;
 
 		if (!(flags & __GFP_RETRY_MAYFAIL))
 			flags |= __GFP_NORETRY;
+		else
+			flags &= ~__GFP_DIRECT_RECLAIM;
 
 		/* nofail semantic is implemented by the vmalloc fallback */
 		flags &= ~__GFP_NOFAIL;
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Shakeel Butt 1 week ago

On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> diff --git a/mm/util.c b/mm/util.c
> index 60aa40f612b8..8386f6976d7d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.
>  	 */
>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
>  			flags |= __GFP_NORETRY;
> +		else
> +			flags &= ~__GFP_DIRECT_RECLAIM;

I think you wanted the following instead:

		if (!(flags & __GFP_RETRY_MAYFAIL))
			flags &= ~__GFP_DIRECT_RECLAIM;

This is what Dave is asking as well for kmalloc() case of kvmalloc().

>  
>  		/* nofail semantic is implemented by the vmalloc fallback */
>  		flags &= ~__GFP_NOFAIL;
> -- 
> Michal Hocko
> SUSE Labs

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Michal Hocko 1 week ago

On Wed 02-04-25 21:37:40, Shakeel Butt wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > diff --git a/mm/util.c b/mm/util.c
> > index 60aa40f612b8..8386f6976d7d 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -601,14 +601,18 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> >  	 * contribute to a long term fragmentation less than vmalloc fallback.
> > -	 * However make sure that larger requests are not too disruptive - no
> > -	 * OOM killer and no allocation failure warnings as we have a fallback.
> > +	 * However make sure that larger requests are not too disruptive - i.e.
> > +	 * do not direct reclaim unless physically continuous memory is preferred
> > +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> > +	 * working in the background but the allocation itself.
> >  	 */
> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;
> > +		else
> > +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think you wanted the following instead:
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags &= ~__GFP_DIRECT_RECLAIM;

You are absolutely right. Not sure what I was thinking... I will send a
full patch with a changelog to wrap the situation up.

-- 
Michal Hocko
SUSE Labs

[PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 1 week ago

There are users like xfs which need larger allocations with NOFAIL
sementic. They are not using kvmalloc currently because the current
implementation tries too hard to allocate through the kmalloc path
which causes a lot of direct reclaim and compaction and that hurts
performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
CIL shadow buffers") for more details).

kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
kmalloc (physically contiguous) allocation is preferred and we should go
more aggressive to make it happen. There is currently no way to express
that kmalloc should be very lightweight and as it has been argued [1]
this mode should be default to support kvmalloc(NOFAIL) with a
lightweight kmalloc path which is currently impossible to express as
__GFP_NOFAIL cannot be combined by any other reclaim modifiers.

This patch makes all kmalloc allocations GFP_NOWAIT unless
__GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
fail fast and retry hard on physically contiguous memory with vmalloc
fallback.

There is a potential downside that relatively small allocations (smaller
than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
cause page block fragmentation. We cannot really rule that out but it
seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.

[1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/slub.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b46f87662e71..2da40c2f6478 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore
 	 * contribute to a long term fragmentation less than vmalloc fallback.
-	 * However make sure that larger requests are not too disruptive - no
-	 * OOM killer and no allocation failure warnings as we have a fallback.
+	 * However make sure that larger requests are not too disruptive - i.e.
+	 * do not direct reclaim unless physically continuous memory is preferred
+	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
+	 * working in the background but the allocation itself.
 	 */
 	if (size > PAGE_SIZE) {
 		flags |= __GFP_NOWARN;
 
 		if (!(flags & __GFP_RETRY_MAYFAIL))
-			flags |= __GFP_NORETRY;
+			flags &= ~__GFP_DIRECT_RECLAIM;
 
 		/* nofail semantic is implemented by the vmalloc fallback */
 		flags &= ~__GFP_NOFAIL;
-- 
2.49.0

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 6 days, 23 hours ago

Add Andrew

Also, Dave do you want me to redirect xlog_cil_kvmalloc to kvmalloc or
do you preffer to do that yourself?

On Thu 03-04-25 09:43:41, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.
>  	 */
>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
> -			flags |= __GFP_NORETRY;
> +			flags &= ~__GFP_DIRECT_RECLAIM;
>  
>  		/* nofail semantic is implemented by the vmalloc fallback */
>  		flags &= ~__GFP_NOFAIL;
> -- 
> 2.49.0
> 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 1 day, 11 hours ago

On Thu 03-04-25 21:51:46, Michal Hocko wrote:
> Add Andrew

Andrew, do you want me to repost the patch or can you take it from this
email thread?
 
> Also, Dave do you want me to redirect xlog_cil_kvmalloc to kvmalloc or
> do you preffer to do that yourself?
> 
> On Thu 03-04-25 09:43:41, Michal Hocko wrote:
> > There are users like xfs which need larger allocations with NOFAIL
> > sementic. They are not using kvmalloc currently because the current
> > implementation tries too hard to allocate through the kmalloc path
> > which causes a lot of direct reclaim and compaction and that hurts
> > performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> > CIL shadow buffers") for more details).
> > 
> > kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> > kmalloc (physically contiguous) allocation is preferred and we should go
> > more aggressive to make it happen. There is currently no way to express
> > that kmalloc should be very lightweight and as it has been argued [1]
> > this mode should be default to support kvmalloc(NOFAIL) with a
> > lightweight kmalloc path which is currently impossible to express as
> > __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> > 
> > This patch makes all kmalloc allocations GFP_NOWAIT unless
> > __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> > fail fast and retry hard on physically contiguous memory with vmalloc
> > fallback.
> > 
> > There is a potential downside that relatively small allocations (smaller
> > than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> > cause page block fragmentation. We cannot really rule that out but it
> > seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> > 
> > [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/slub.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b46f87662e71..2da40c2f6478 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> >  	 * contribute to a long term fragmentation less than vmalloc fallback.
> > -	 * However make sure that larger requests are not too disruptive - no
> > -	 * OOM killer and no allocation failure warnings as we have a fallback.
> > +	 * However make sure that larger requests are not too disruptive - i.e.
> > +	 * do not direct reclaim unless physically continuous memory is preferred
> > +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> > +	 * working in the background but the allocation itself.
> >  	 */
> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> > -			flags |= __GFP_NORETRY;
> > +			flags &= ~__GFP_DIRECT_RECLAIM;
> >  
> >  		/* nofail semantic is implemented by the vmalloc fallback */
> >  		flags &= ~__GFP_NOFAIL;
> > -- 
> > 2.49.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Vlastimil Babka 1 day, 9 hours ago

On 4/9/25 9:35 AM, Michal Hocko wrote:
> On Thu 03-04-25 21:51:46, Michal Hocko wrote:
>> Add Andrew
> 
> Andrew, do you want me to repost the patch or can you take it from this
> email thread?

I'll take it as it's now all in mm/slub.c

>> Also, Dave do you want me to redirect xlog_cil_kvmalloc to kvmalloc or
>> do you preffer to do that yourself?
>>
>> On Thu 03-04-25 09:43:41, Michal Hocko wrote:
>>> There are users like xfs which need larger allocations with NOFAIL
>>> sementic. They are not using kvmalloc currently because the current
>>> implementation tries too hard to allocate through the kmalloc path
>>> which causes a lot of direct reclaim and compaction and that hurts
>>> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
>>> CIL shadow buffers") for more details).
>>>
>>> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
>>> kmalloc (physically contiguous) allocation is preferred and we should go
>>> more aggressive to make it happen. There is currently no way to express
>>> that kmalloc should be very lightweight and as it has been argued [1]
>>> this mode should be default to support kvmalloc(NOFAIL) with a
>>> lightweight kmalloc path which is currently impossible to express as
>>> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
>>>
>>> This patch makes all kmalloc allocations GFP_NOWAIT unless
>>> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
>>> fail fast and retry hard on physically contiguous memory with vmalloc
>>> fallback.
>>>
>>> There is a potential downside that relatively small allocations (smaller
>>> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
>>> cause page block fragmentation. We cannot really rule that out but it
>>> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
>>>
>>> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>> ---
>>>  mm/slub.c | 8 +++++---
>>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index b46f87662e71..2da40c2f6478 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>>>  	 * We want to attempt a large physically contiguous block first because
>>>  	 * it is less likely to fragment multiple larger blocks and therefore
>>>  	 * contribute to a long term fragmentation less than vmalloc fallback.
>>> -	 * However make sure that larger requests are not too disruptive - no
>>> -	 * OOM killer and no allocation failure warnings as we have a fallback.
>>> +	 * However make sure that larger requests are not too disruptive - i.e.
>>> +	 * do not direct reclaim unless physically continuous memory is preferred
>>> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
>>> +	 * working in the background but the allocation itself.
>>>  	 */
>>>  	if (size > PAGE_SIZE) {
>>>  		flags |= __GFP_NOWARN;
>>>  
>>>  		if (!(flags & __GFP_RETRY_MAYFAIL))
>>> -			flags |= __GFP_NORETRY;
>>> +			flags &= ~__GFP_DIRECT_RECLAIM;
>>>  
>>>  		/* nofail semantic is implemented by the vmalloc fallback */
>>>  		flags &= ~__GFP_NOFAIL;
>>> -- 
>>> 2.49.0
>>>
>>
>> -- 
>> Michal Hocko
>> SUSE Labs
>

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 1 day, 6 hours ago

On Wed 09-04-25 11:11:37, Vlastimil Babka wrote:
> On 4/9/25 9:35 AM, Michal Hocko wrote:
> > On Thu 03-04-25 21:51:46, Michal Hocko wrote:
> >> Add Andrew
> > 
> > Andrew, do you want me to repost the patch or can you take it from this
> > email thread?
> 
> I'll take it as it's now all in mm/slub.c

Thanks that will work as well.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Vlastimil Babka 1 day, 6 hours ago

On 4/9/25 2:20 PM, Michal Hocko wrote:
> On Wed 09-04-25 11:11:37, Vlastimil Babka wrote:
>> On 4/9/25 9:35 AM, Michal Hocko wrote:
>>> On Thu 03-04-25 21:51:46, Michal Hocko wrote:
>>>> Add Andrew
>>>
>>> Andrew, do you want me to repost the patch or can you take it from this
>>> email thread?
>>
>> I'll take it as it's now all in mm/slub.c
> 
> Thanks that will work as well.

It's now in slab/for-next with Shakeel's ack and the updated comment you
discussed:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/commit/?h=slab/for-6.16/fixes&id=bfedb6b93bc8d1dc02627beb43ceb466f42a4ed9

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Dave Chinner 1 day, 17 hours ago

On Thu, Apr 03, 2025 at 09:51:44PM +0200, Michal Hocko wrote:
> Add Andrew
> 
> Also, Dave do you want me to redirect xlog_cil_kvmalloc to kvmalloc or
> do you preffer to do that yourself?

I'll do it when the kvmalloc patches evntually land and I can do
back to back testing to determine if the new kvmalloc code behaves
as expected...

Please cc me on the new patches you send that modify the kvmalloc
behaviour.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Shakeel Butt 1 week ago

On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Kees Cook 1 week ago

On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Thanks for finding a solution for this! It makes way more sense to me to
kick over to vmap by default for kvmalloc users.

> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.

I think a word is missing here? "...but do the allocation..." or
"...allocation itself happens" ?

-- 
Kees Cook

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Darrick J. Wong 6 days, 3 hours ago

On Thu, Apr 03, 2025 at 09:21:50AM -0700, Kees Cook wrote:
> On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
> > There are users like xfs which need larger allocations with NOFAIL
> > sementic. They are not using kvmalloc currently because the current
> > implementation tries too hard to allocate through the kmalloc path
> > which causes a lot of direct reclaim and compaction and that hurts
> > performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> > CIL shadow buffers") for more details).
> > 
> > kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> > kmalloc (physically contiguous) allocation is preferred and we should go
> > more aggressive to make it happen. There is currently no way to express
> > that kmalloc should be very lightweight and as it has been argued [1]
> > this mode should be default to support kvmalloc(NOFAIL) with a
> > lightweight kmalloc path which is currently impossible to express as
> > __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> > 
> > This patch makes all kmalloc allocations GFP_NOWAIT unless
> > __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> > fail fast and retry hard on physically contiguous memory with vmalloc
> > fallback.
> > 
> > There is a potential downside that relatively small allocations (smaller
> > than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> > cause page block fragmentation. We cannot really rule that out but it
> > seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> > 
> > [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks for finding a solution for this! It makes way more sense to me to
> kick over to vmap by default for kvmalloc users.

Are 32-bit kernels still constrained by a small(ish) vmalloc space?
It's all fine for xlog_kvmalloc which will continue looping until
something makes progress, but tuning for those platforms aren't a
priority for most xfs developers AFAIK.

--D

> > ---
> >  mm/slub.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b46f87662e71..2da40c2f6478 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> >  	 * contribute to a long term fragmentation less than vmalloc fallback.
> > -	 * However make sure that larger requests are not too disruptive - no
> > -	 * OOM killer and no allocation failure warnings as we have a fallback.
> > +	 * However make sure that larger requests are not too disruptive - i.e.
> > +	 * do not direct reclaim unless physically continuous memory is preferred
> > +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> > +	 * working in the background but the allocation itself.
> 
> I think a word is missing here? "...but do the allocation..." or
> "...allocation itself happens" ?
> 
> -- 
> Kees Cook
>

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 6 days, 23 hours ago

On Thu 03-04-25 09:21:50, Kees Cook wrote:
> On Thu, Apr 03, 2025 at 09:43:39AM +0200, Michal Hocko wrote:
[...]
> >  mm/slub.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index b46f87662e71..2da40c2f6478 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
> >  	 * We want to attempt a large physically contiguous block first because
> >  	 * it is less likely to fragment multiple larger blocks and therefore
> >  	 * contribute to a long term fragmentation less than vmalloc fallback.
> > -	 * However make sure that larger requests are not too disruptive - no
> > -	 * OOM killer and no allocation failure warnings as we have a fallback.
> > +	 * However make sure that larger requests are not too disruptive - i.e.
> > +	 * do not direct reclaim unless physically continuous memory is preferred
> > +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> > +	 * working in the background but the allocation itself.
> 
> I think a word is missing here? "...but do the allocation..." or
> "...allocation itself happens" ?

Thinking about this some more I would just cut this short and go with
"We still kick in kswapd/kcompactd to start working in the background"

Does that sound better?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Vlastimil Babka 1 week ago

On 4/3/25 09:43, Michal Hocko wrote:
> There are users like xfs which need larger allocations with NOFAIL
> sementic. They are not using kvmalloc currently because the current
> implementation tries too hard to allocate through the kmalloc path
> which causes a lot of direct reclaim and compaction and that hurts
> performance a lot (see 8dc9384b7d75 ("xfs: reduce kvmalloc overhead for
> CIL shadow buffers") for more details).
> 
> kvmalloc does support __GFP_RETRY_MAYFAIL semantic to express that
> kmalloc (physically contiguous) allocation is preferred and we should go
> more aggressive to make it happen. There is currently no way to express
> that kmalloc should be very lightweight and as it has been argued [1]
> this mode should be default to support kvmalloc(NOFAIL) with a
> lightweight kmalloc path which is currently impossible to express as
> __GFP_NOFAIL cannot be combined by any other reclaim modifiers.
> 
> This patch makes all kmalloc allocations GFP_NOWAIT unless
> __GFP_RETRY_MAYFAIL is provided to kvmalloc. This allows to support both
> fail fast and retry hard on physically contiguous memory with vmalloc
> fallback.
> 
> There is a potential downside that relatively small allocations (smaller
> than PAGE_ALLOC_COSTLY_ORDER) could fallback to vmalloc too easily and
> cause page block fragmentation. We cannot really rule that out but it
> seems that xlog_cil_kvmalloc use doesn't indicate this to be happening.
> 
> [1] https://lore.kernel.org/all/Z-3i1wATGh6vI8x8@dread.disaster.area/T/#u
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Looks like a step in the right direction, but is that enough?

- to replace xlog_kvmalloc(), we need to deal with kvmalloc() passing
VM_ALLOW_HUGE_VMAP, so we don't end up with GFP_KERNEL huge allocation
anyway (in practice maybe it wouldn't happen because "size >= PMD_SIZE"
required for the huge vmalloc is never true for current xlog_kvmalloc()
users but dunno if we can rely on that).

Maybe it's a bad idea to use VM_ALLOW_HUGE_VMAP in kvmalloc() anyway? Since
we're in a vmalloc fallback which means the huge allocations failed anyway
for the kmalloc() part. Maybe there's some grey area where it makes sense,
with size much larger than PMD_SIZE, e.g. exceeding MAX_PAGE_ORDER where we
can't kmalloc() anyway so at least try to assemble the allocation from huge
vmalloc. Maybe tie it to such a size check, or require __GFP_RETRY_MAYFAIL
to activate VM_ALLOW_HUGE_VMAP?

- we're still not addressing the original issue of high kcompactd activity,
but maybe the answer is that it needs to be investigated more (why deferred
compaction doesn't limit it) instead of trying to suppress it from kvmalloc()

> ---
>  mm/slub.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index b46f87662e71..2da40c2f6478 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4972,14 +4972,16 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
>  	 * contribute to a long term fragmentation less than vmalloc fallback.
> -	 * However make sure that larger requests are not too disruptive - no
> -	 * OOM killer and no allocation failure warnings as we have a fallback.
> +	 * However make sure that larger requests are not too disruptive - i.e.
> +	 * do not direct reclaim unless physically continuous memory is preferred
> +	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to start
> +	 * working in the background but the allocation itself.
>  	 */
>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
> -			flags |= __GFP_NORETRY;
> +			flags &= ~__GFP_DIRECT_RECLAIM;
>  
>  		/* nofail semantic is implemented by the vmalloc fallback */
>  		flags &= ~__GFP_NOFAIL;

Re: [PATCH] mm: kvmalloc: make kmalloc fast path real fast path

Posted by Michal Hocko 1 week ago

On Thu 03-04-25 10:24:56, Vlastimil Babka wrote:
[...]
> - to replace xlog_kvmalloc(), we need to deal with kvmalloc() passing
> VM_ALLOW_HUGE_VMAP, so we don't end up with GFP_KERNEL huge allocation
> anyway (in practice maybe it wouldn't happen because "size >= PMD_SIZE"
> required for the huge vmalloc is never true for current xlog_kvmalloc()
> users but dunno if we can rely on that).

I would just make that its own patch. Ideally with some numbers showing
there are code paths benefiting from the change.

> Maybe it's a bad idea to use VM_ALLOW_HUGE_VMAP in kvmalloc() anyway? Since
> we're in a vmalloc fallback which means the huge allocations failed anyway
> for the kmalloc() part. Maybe there's some grey area where it makes sense,
> with size much larger than PMD_SIZE, e.g. exceeding MAX_PAGE_ORDER where we
> can't kmalloc() anyway so at least try to assemble the allocation from huge
> vmalloc. Maybe tie it to such a size check, or require __GFP_RETRY_MAYFAIL
> to activate VM_ALLOW_HUGE_VMAP?

We didn't have that initially. 9becb6889130 ("kvmalloc: use vmalloc_huge
for vmalloc allocations") has added it. I thought large allocations are
very optimistic (ie. NOWAIT like) but that doesn't seem to be the case.

As said above, I would just change that after we have any numbers to
support the removal.

> - we're still not addressing the original issue of high kcompactd activity,
> but maybe the answer is that it needs to be investigated more (why deferred
> compaction doesn't limit it) instead of trying to suppress it from kvmalloc()

yes this seems like something that should be investigated on the
compaction side.

Thanks!

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Dave Chinner 1 week ago

On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > fast-fail, no retry high order kmalloc before it falls back to
> > vmalloc by turning off direct reclaim for the kmalloc() call.
> > Hence if the there isn't a high-order page on the free lists ready
> > to allocate, it falls back to vmalloc() immediately.
> > 
> > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > overhead by around 80% when compared to a standard kvmalloc()
> > call. Numbers and profiles were documented in the commit message
> > (reproduced in whole below)...
> 
> Btw. it would be really great to have such concerns to be posted to the
> linux-mm ML so that we are aware of that.

I have brought it up in the past, along with all the other kvmalloc
API problems that are mentioned in that commit message.
Unfortunately, discussion focus always ended up on calling context
and API flags (e.g. whether stuff like GFP_NOFS should be supported
or not) no the fast-fail-then-no-fail behaviour we need.

Yes, these discussions have resulted in API changes that support
some new subset of gfp flags, but the performance issues have never
been addressed...

> kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> to express - I prefer SLAB allocator over vmalloc.

The conditional use of __GFP_NORETRY for the kmalloc call is broken
if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
mask to hold __GFP_NOFAIL | __GFP_NORETRY....

We have a hard requirement for xlog_kvmalloc() to provide
__GFP_NOFAIL semantics.

IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
performance with fallback to vmalloc(__GFP_NOFAIL) for
correctness...

> I think we could make
> the default kvmalloc slab path weaker by default as those who really
> want slab already have means to achieve that. There is a risk of long
> term fragmentation but I think this is worth trying

We've been doing this for a few years now in XFS in a hot path that
can make in the order of a million xlog_kvmalloc() calls a second.
We've not seen any evidence that this causes or exacerbates memory
fragmentation....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Shakeel Butt 1 week ago

On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> > > 
> > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > overhead by around 80% when compared to a standard kvmalloc()
> > > call. Numbers and profiles were documented in the commit message
> > > (reproduced in whole below)...
> > 
> > Btw. it would be really great to have such concerns to be posted to the
> > linux-mm ML so that we are aware of that.
> 
> I have brought it up in the past, along with all the other kvmalloc
> API problems that are mentioned in that commit message.
> Unfortunately, discussion focus always ended up on calling context
> and API flags (e.g. whether stuff like GFP_NOFS should be supported
> or not) no the fast-fail-then-no-fail behaviour we need.
> 
> Yes, these discussions have resulted in API changes that support
> some new subset of gfp flags, but the performance issues have never
> been addressed...
> 
> > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > to express - I prefer SLAB allocator over vmalloc.
> 
> The conditional use of __GFP_NORETRY for the kmalloc call is broken
> if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> mask to hold __GFP_NOFAIL | __GFP_NORETRY....
> 
> We have a hard requirement for xlog_kvmalloc() to provide
> __GFP_NOFAIL semantics.
> 
> IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> performance with fallback to vmalloc(__GFP_NOFAIL) for
> correctness...
> 

Are you asking the above kvmalloc() semantics just for xfs or for all
the users of kvmalloc() api? 

> > I think we could make
> > the default kvmalloc slab path weaker by default as those who really
> > want slab already have means to achieve that. There is a risk of long
> > term fragmentation but I think this is worth trying
> 
> We've been doing this for a few years now in XFS in a hot path that
> can make in the order of a million xlog_kvmalloc() calls a second.
> We've not seen any evidence that this causes or exacerbates memory
> fragmentation....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Dave Chinner 1 week ago

On Wed, Apr 02, 2025 at 04:10:06PM -0700, Shakeel Butt wrote:
> On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> > On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > > fast-fail, no retry high order kmalloc before it falls back to
> > > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > > Hence if the there isn't a high-order page on the free lists ready
> > > > to allocate, it falls back to vmalloc() immediately.
> > > > 
> > > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > > overhead by around 80% when compared to a standard kvmalloc()
> > > > call. Numbers and profiles were documented in the commit message
> > > > (reproduced in whole below)...
> > > 
> > > Btw. it would be really great to have such concerns to be posted to the
> > > linux-mm ML so that we are aware of that.
> > 
> > I have brought it up in the past, along with all the other kvmalloc
> > API problems that are mentioned in that commit message.
> > Unfortunately, discussion focus always ended up on calling context
> > and API flags (e.g. whether stuff like GFP_NOFS should be supported
> > or not) no the fast-fail-then-no-fail behaviour we need.
> > 
> > Yes, these discussions have resulted in API changes that support
> > some new subset of gfp flags, but the performance issues have never
> > been addressed...
> > 
> > > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > > to express - I prefer SLAB allocator over vmalloc.
> > 
> > The conditional use of __GFP_NORETRY for the kmalloc call is broken
> > if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> > mask to hold __GFP_NOFAIL | __GFP_NORETRY....
> > 
> > We have a hard requirement for xlog_kvmalloc() to provide
> > __GFP_NOFAIL semantics.
> > 
> > IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> > performance with fallback to vmalloc(__GFP_NOFAIL) for
> > correctness...
> 
> Are you asking the above kvmalloc() semantics just for xfs or for all
> the users of kvmalloc() api? 

I'm suggesting that fast-fail should be the default behaviour for
everyone.

If you look at __vmalloc() internals, you'll see that it turns off
__GFP_NOFAIL for high order allocations because "reclaim is too
costly and it's far cheaper to fall back to order-0 pages".

That's pretty much exactly what we are doing with xlog_kvmalloc(),
and what I'm suggesting that kvmalloc should be doing by default.

i.e. If it's necessary for mm internal implementations to avoid
high-order reclaim when there is a faster order-0 allocation
fallback path available for performance reasons, then we should be
using that same behaviour anywhere optimisitic high-order allocation
is used as an optimisation for those same performance reasons.

The overall __GFP_NOFAIL requirement is something XFS needs, but it
is most definitely not something that should be enabled by default.
However, it needs to work with kvmalloc(), and it is not possible to
do so right now.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Shakeel Butt 1 week ago

On Thu, Apr 03, 2025 at 12:22:31PM +1100, Dave Chinner wrote:
> On Wed, Apr 02, 2025 at 04:10:06PM -0700, Shakeel Butt wrote:
> > On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> > > On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > > > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > > > fast-fail, no retry high order kmalloc before it falls back to
> > > > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > > > Hence if the there isn't a high-order page on the free lists ready
> > > > > to allocate, it falls back to vmalloc() immediately.
> > > > > 
> > > > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > > > overhead by around 80% when compared to a standard kvmalloc()
> > > > > call. Numbers and profiles were documented in the commit message
> > > > > (reproduced in whole below)...
> > > > 
> > > > Btw. it would be really great to have such concerns to be posted to the
> > > > linux-mm ML so that we are aware of that.
> > > 
> > > I have brought it up in the past, along with all the other kvmalloc
> > > API problems that are mentioned in that commit message.
> > > Unfortunately, discussion focus always ended up on calling context
> > > and API flags (e.g. whether stuff like GFP_NOFS should be supported
> > > or not) no the fast-fail-then-no-fail behaviour we need.
> > > 
> > > Yes, these discussions have resulted in API changes that support
> > > some new subset of gfp flags, but the performance issues have never
> > > been addressed...
> > > 
> > > > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > > > to express - I prefer SLAB allocator over vmalloc.
> > > 
> > > The conditional use of __GFP_NORETRY for the kmalloc call is broken
> > > if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> > > mask to hold __GFP_NOFAIL | __GFP_NORETRY....
> > > 
> > > We have a hard requirement for xlog_kvmalloc() to provide
> > > __GFP_NOFAIL semantics.
> > > 
> > > IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> > > performance with fallback to vmalloc(__GFP_NOFAIL) for
> > > correctness...
> > 
> > Are you asking the above kvmalloc() semantics just for xfs or for all
> > the users of kvmalloc() api? 
> 
> I'm suggesting that fast-fail should be the default behaviour for
> everyone.
> 
> If you look at __vmalloc() internals, you'll see that it turns off
> __GFP_NOFAIL for high order allocations because "reclaim is too
> costly and it's far cheaper to fall back to order-0 pages".
> 
> That's pretty much exactly what we are doing with xlog_kvmalloc(),
> and what I'm suggesting that kvmalloc should be doing by default.
> 
> i.e. If it's necessary for mm internal implementations to avoid
> high-order reclaim when there is a faster order-0 allocation
> fallback path available for performance reasons, then we should be
> using that same behaviour anywhere optimisitic high-order allocation
> is used as an optimisation for those same performance reasons.
> 

I am convinced and I think Michal is onboard as well for the above. At
least we should try and see how it goes.

> The overall __GFP_NOFAIL requirement is something XFS needs, but it
> is most definitely not something that should be enabled by default.
> However, it needs to work with kvmalloc(), and it is not possible to
> do so right now.

After the kmalloc(GFP_NOWAIT) being default in kvmalloc(), what remains
to support kvmalloc(__GFP_NOFAIL)? (Yafang mentioned vmap_huge)

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Michal Hocko 1 week ago

On Wed 02-04-25 22:05:57, Shakeel Butt wrote:
> On Thu, Apr 03, 2025 at 12:22:31PM +1100, Dave Chinner wrote:
> > On Wed, Apr 02, 2025 at 04:10:06PM -0700, Shakeel Butt wrote:
> > > On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> > > > On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > > > > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > > > > fast-fail, no retry high order kmalloc before it falls back to
> > > > > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > > > > Hence if the there isn't a high-order page on the free lists ready
> > > > > > to allocate, it falls back to vmalloc() immediately.
> > > > > > 
> > > > > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > > > > overhead by around 80% when compared to a standard kvmalloc()
> > > > > > call. Numbers and profiles were documented in the commit message
> > > > > > (reproduced in whole below)...
> > > > > 
> > > > > Btw. it would be really great to have such concerns to be posted to the
> > > > > linux-mm ML so that we are aware of that.
> > > > 
> > > > I have brought it up in the past, along with all the other kvmalloc
> > > > API problems that are mentioned in that commit message.
> > > > Unfortunately, discussion focus always ended up on calling context
> > > > and API flags (e.g. whether stuff like GFP_NOFS should be supported
> > > > or not) no the fast-fail-then-no-fail behaviour we need.
> > > > 
> > > > Yes, these discussions have resulted in API changes that support
> > > > some new subset of gfp flags, but the performance issues have never
> > > > been addressed...

I, at least, was not aware of the performance aspect. We are trying to
make kvmalloc as usable as possible to prevent its open coded variants
to grow in subystems.

> > > > > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > > > > to express - I prefer SLAB allocator over vmalloc.
> > > > 
> > > > The conditional use of __GFP_NORETRY for the kmalloc call is broken
> > > > if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> > > > mask to hold __GFP_NOFAIL | __GFP_NORETRY....

Correct.

> > > > We have a hard requirement for xlog_kvmalloc() to provide
> > > > __GFP_NOFAIL semantics.
> > > > 
> > > > IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> > > > performance with fallback to vmalloc(__GFP_NOFAIL) for
> > > > correctness...

Understood.

> > > Are you asking the above kvmalloc() semantics just for xfs or for all
> > > the users of kvmalloc() api? 
> > 
> > I'm suggesting that fast-fail should be the default behaviour for
> > everyone.
> > 
> > If you look at __vmalloc() internals, you'll see that it turns off
> > __GFP_NOFAIL for high order allocations because "reclaim is too
> > costly and it's far cheaper to fall back to order-0 pages".
> > 
> > That's pretty much exactly what we are doing with xlog_kvmalloc(),
> > and what I'm suggesting that kvmalloc should be doing by default.
> > 
> > i.e. If it's necessary for mm internal implementations to avoid
> > high-order reclaim when there is a faster order-0 allocation
> > fallback path available for performance reasons, then we should be
> > using that same behaviour anywhere optimisitic high-order allocation
> > is used as an optimisation for those same performance reasons.
> > 
> 
> I am convinced and I think Michal is onboard as well for the above. At
> least we should try and see how it goes.

If we find out that this doesn't really work because a fragmentation
of page blocks is a real problem then we might need to reconsider this.

> > The overall __GFP_NOFAIL requirement is something XFS needs, but it
> > is most definitely not something that should be enabled by default.
> > However, it needs to work with kvmalloc(), and it is not possible to
> > do so right now.
> 
> After the kmalloc(GFP_NOWAIT) being default in kvmalloc(), what remains
> to support kvmalloc(__GFP_NOFAIL)? (Yafang mentioned vmap_huge)

We already do support kvmalloc(__GFP_NOFAIL) since 9376130c390a7 IIRC.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Yafang Shao 1 week ago

On Thu, Apr 3, 2025 at 9:22 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Apr 02, 2025 at 04:10:06PM -0700, Shakeel Butt wrote:
> > On Thu, Apr 03, 2025 at 08:16:56AM +1100, Dave Chinner wrote:
> > > On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > > > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > > > fast-fail, no retry high order kmalloc before it falls back to
> > > > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > > > Hence if the there isn't a high-order page on the free lists ready
> > > > > to allocate, it falls back to vmalloc() immediately.
> > > > >
> > > > > For XFS, using xlog_kvmalloc() reduced the high-order per-allocation
> > > > > overhead by around 80% when compared to a standard kvmalloc()
> > > > > call. Numbers and profiles were documented in the commit message
> > > > > (reproduced in whole below)...
> > > >
> > > > Btw. it would be really great to have such concerns to be posted to the
> > > > linux-mm ML so that we are aware of that.
> > >
> > > I have brought it up in the past, along with all the other kvmalloc
> > > API problems that are mentioned in that commit message.
> > > Unfortunately, discussion focus always ended up on calling context
> > > and API flags (e.g. whether stuff like GFP_NOFS should be supported
> > > or not) no the fast-fail-then-no-fail behaviour we need.
> > >
> > > Yes, these discussions have resulted in API changes that support
> > > some new subset of gfp flags, but the performance issues have never
> > > been addressed...
> > >
> > > > kvmalloc currently doesn't support GFP_NOWAIT semantic but it does allow
> > > > to express - I prefer SLAB allocator over vmalloc.
> > >
> > > The conditional use of __GFP_NORETRY for the kmalloc call is broken
> > > if we try to use __GFP_NOFAIL with kvmalloc() - this causes the gfp
> > > mask to hold __GFP_NOFAIL | __GFP_NORETRY....
> > >
> > > We have a hard requirement for xlog_kvmalloc() to provide
> > > __GFP_NOFAIL semantics.
> > >
> > > IOWs, we need kvmalloc() to support kmalloc(GFP_NOWAIT) for
> > > performance with fallback to vmalloc(__GFP_NOFAIL) for
> > > correctness...
> >
> > Are you asking the above kvmalloc() semantics just for xfs or for all
> > the users of kvmalloc() api?
>
> I'm suggesting that fast-fail should be the default behaviour for
> everyone.
>
> If you look at __vmalloc() internals, you'll see that it turns off
> __GFP_NOFAIL for high order allocations because "reclaim is too
> costly and it's far cheaper to fall back to order-0 pages".

This behavior was introduced in commit 7de8728f55ff ("mm: vmalloc:
refactor vm_area_alloc_pages()") and only applies when
HAVE_ARCH_HUGE_VMALLOC is enabled (added in commit 121e6f3258fe,
"mm/vmalloc: hugepage vmalloc mappings").

Instead of disabling __GFP_NOFAIL for hugevmalloc allocations, perhaps
we could simply enforce "vmap_allow_huge= false" when __GFP_NOFAIL is
specified. Or we could ...

>
> That's pretty much exactly what we are doing with xlog_kvmalloc(),
> and what I'm suggesting that kvmalloc should be doing by default.
>
> i.e. If it's necessary for mm internal implementations to avoid
> high-order reclaim when there is a faster order-0 allocation
> fallback path available for performance reasons, then we should be
> using that same behaviour anywhere optimisitic high-order allocation
> is used as an optimisation for those same performance reasons.
>
> The overall __GFP_NOFAIL requirement is something XFS needs, but it
> is most definitely not something that should be enabled by default.
> However, it needs to work with kvmalloc(), and it is not possible to
> do so right now.

1. Introduce a new vmalloc() flag to explicitly disable hugepage
mappings when needed (e.g., for __GFP_NOFAIL cases).
2. Extend kvmalloc() with finer control by allowing separate GFP flags
for kmalloc and vmalloc, plus an option to disable hugevmalloc:

  kvmalloc(size_t size, gfp_t kmalloc_flags, gfp_t vmalloc_flags, bool
allow_hugevmalloc);

Then we can replace the xlog_cil_kvmalloc() with:

  kvmalloc(size, GFP_NOWAIT, __GFP_NOFAIL, false);

This is just a preliminary idea...

-- 
Regards
Yafang

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Matthew Wilcox 1 week, 1 day ago

On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > >+    /*
> > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > >+     * allocations.
> > > > > >+     */
> > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > >
> > > > > Why not move this check into kvmalloc family?
> > > >
> > > > Hmm should this check really be in kvmalloc family?
> > > 
> > > Modifying the existing kvmalloc functions risks performance regressions.
> > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > vmalloc over kmalloc) or kvmalloc_costless()?
> > 
> > We should fix kvmalloc() instead of continuing to force
> > subsystems to work around the limitations of kvmalloc().
> 
> Agreed!
> 
> > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > fast-fail, no retry high order kmalloc before it falls back to
> > vmalloc by turning off direct reclaim for the kmalloc() call.
> > Hence if the there isn't a high-order page on the free lists ready
> > to allocate, it falls back to vmalloc() immediately.

... but if vmalloc fails, it goes around again!  This is exactly why
we don't want filesystems implementing workarounds for MM problems.
What a mess.

>  	if (size > PAGE_SIZE) {
>  		flags |= __GFP_NOWARN;
>  
>  		if (!(flags & __GFP_RETRY_MAYFAIL))
>  			flags |= __GFP_NORETRY;
> +		else
> +			flags &= ~__GFP_DIRECT_RECLAIM;

I think it might be better to do this:

		flags |= __GFP_NOWARN;

		if (!(flags & __GFP_RETRY_MAYFAIL))
			flags |= __GFP_NORETRY;
+		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+			flags &= ~__GFP_DIRECT_RECLAIM;

I think it's entirely appropriate for a call to kvmalloc() to do
direct reclaim if it's asking for, say, 16KiB and we don't have any of
those available.  Better than exacerbating the fragmentation problem by
allocating 4x4KiB pages, each from different groupings.

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Dave Chinner 1 week ago

On Wed, Apr 02, 2025 at 06:24:10PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > >+    /*
> > > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > > >+     * allocations.
> > > > > > >+     */
> > > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > > >
> > > > > > Why not move this check into kvmalloc family?
> > > > >
> > > > > Hmm should this check really be in kvmalloc family?
> > > > 
> > > > Modifying the existing kvmalloc functions risks performance regressions.
> > > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > > vmalloc over kmalloc) or kvmalloc_costless()?
> > > 
> > > We should fix kvmalloc() instead of continuing to force
> > > subsystems to work around the limitations of kvmalloc().
> > 
> > Agreed!
> > 
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> 
> ... but if vmalloc fails, it goes around again!  This is exactly why
> we don't want filesystems implementing workarounds for MM problems.
> What a mess.

That's because we need __GFP_NOFAIL semantics for the overall
operation, and we can't pass that to kvmalloc() because it doesn't
support __GFP_NOFAIL. And when this code was written, vmalloc didn't
support __GFP_NOFAIL, either. We *had* to open code nofail
semantics, because the mm infrastructure did not provide it.

Yes, we can fix this now that __vmalloc(__GFP_NOFAIL) is a thing.
We still need to open code the kmalloc() side of the operation right
now because....

> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;

.... this is a built-in catch-22.

If we use kvmalloc(__GFP_NOFAIL), this code results in kmalloc
with __GFP_NORETRY | __GFP_NOFAIL flags set. i.e. we are telling
the allocation that it must not retry but it also must retry until
it succeeds.

To work around this, the caller then has to use __GFP_RETRY_MAYFAIL
| __GFP_NOFAIL, which is telling the allocation that it is allowed
to fail but it also must not fail. Again, this makes no sense at
all, and on top of that it doesn't give us fast-fail semantics
we want from the kmalloc side of kvmalloc.

i.e. high order page allocation from kmalloc() is an optimisation,
not a requirement for kvmalloc(). If high order page allocation is
frequently more expensive than simply falling back to vmalloc(),
then we've made the wrong optimisation choices for the kvmalloc()
implementation...

> I think it might be better to do this:
> 
> 		flags |= __GFP_NOWARN;
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags |= __GFP_NORETRY;
> +		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think it's entirely appropriate for a call to kvmalloc() to do
> direct reclaim if it's asking for, say, 16KiB and we don't have any of
> those available.

I disagree - we have background compaction to address the lack of
high order folios in the allocator reserves. Let that do the work of
resolving the internal resource shortage instead of slowing down
allocations that *do not require high order pages to be allocated*.

> Better than exacerbating the fragmentation problem by
> allocating 4x4KiB pages, each from different groupings.

We have no evidence that this allocation behaviour in XFS causes or
exacerbates memory fragmentation. We have been running it in
production systems for a few years now....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Shakeel Butt 1 week, 1 day ago

On Wed, Apr 02, 2025 at 06:24:10PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 02, 2025 at 02:24:45PM +0200, Michal Hocko wrote:
> > On Wed 02-04-25 22:32:14, Dave Chinner wrote:
> > > > > > >+    /*
> > > > > > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> > > > > > >+     * allocations.
> > > > > > >+     */
> > > > > > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > > > > > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> > > > > >
> > > > > > Why not move this check into kvmalloc family?
> > > > >
> > > > > Hmm should this check really be in kvmalloc family?
> > > > 
> > > > Modifying the existing kvmalloc functions risks performance regressions.
> > > > Could we instead introduce a new variant like vkmalloc() (favoring
> > > > vmalloc over kmalloc) or kvmalloc_costless()?
> > > 
> > > We should fix kvmalloc() instead of continuing to force
> > > subsystems to work around the limitations of kvmalloc().
> > 
> > Agreed!
> > 
> > > Have a look at xlog_kvmalloc() in XFS. It implements a basic
> > > fast-fail, no retry high order kmalloc before it falls back to
> > > vmalloc by turning off direct reclaim for the kmalloc() call.
> > > Hence if the there isn't a high-order page on the free lists ready
> > > to allocate, it falls back to vmalloc() immediately.
> 
> ... but if vmalloc fails, it goes around again!  This is exactly why
> we don't want filesystems implementing workarounds for MM problems.
> What a mess.
> 
> >  	if (size > PAGE_SIZE) {
> >  		flags |= __GFP_NOWARN;
> >  
> >  		if (!(flags & __GFP_RETRY_MAYFAIL))
> >  			flags |= __GFP_NORETRY;
> > +		else
> > +			flags &= ~__GFP_DIRECT_RECLAIM;
> 
> I think it might be better to do this:
> 
> 		flags |= __GFP_NOWARN;
> 
> 		if (!(flags & __GFP_RETRY_MAYFAIL))
> 			flags |= __GFP_NORETRY;
> +		else if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +			flags &= ~__GFP_DIRECT_RECLAIM;

The above seems more appropriate then the Michal's bigger hammer.
In addition I think Vlastimil has a very good point about the kswapd
reclaim for such cases (the patch explicitly complains about kcompactd
cpu usage).

> 
> I think it's entirely appropriate for a call to kvmalloc() to do
> direct reclaim if it's asking for, say, 16KiB and we don't have any of
> those available.  Better than exacerbating the fragmentation problem by
> allocating 4x4KiB pages, each from different groupings.

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Vlastimil Babka 1 week, 1 day ago

On 4/2/25 10:42, Yafang Shao wrote:
> On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
>> >
>> >
>> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
>> > >While investigating a kcompactd 100% CPU utilization issue in production, I
>> > >observed frequent costly high-order (order-6) page allocations triggered by
>> > >proc file reads from monitoring tools. This can be reproduced with a simple
>> > >test case:
>> > >
>> > >  fd = open(PROC_FILE, O_RDONLY);
>> > >  size = read(fd, buff, 256KB);
>> > >  close(fd);
>> > >
>> > >Although we should modify the monitoring tools to use smaller buffer sizes,
>> > >we should also enhance the kernel to prevent these expensive high-order
>> > >allocations.
>> > >
>> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>> > >Cc: Josef Bacik <josef@toxicpanda.com>
>> > >---
>> > > fs/proc/proc_sysctl.c | 10 +++++++++-
>> > > 1 file changed, 9 insertions(+), 1 deletion(-)
>> > >
>> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>> > >index cc9d74a06ff0..c53ba733bda5 100644
>> > >--- a/fs/proc/proc_sysctl.c
>> > >+++ b/fs/proc/proc_sysctl.c
>> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>> > >     error = -ENOMEM;
>> > >     if (count >= KMALLOC_MAX_SIZE)
>> > >             goto out;
>> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
>> > >+
>> > >+    /*
>> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
>> > >+     * allocations.
>> > >+     */
>> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
>> >
>> > Why not move this check into kvmalloc family?
>>
>> Hmm should this check really be in kvmalloc family?
> 
> Modifying the existing kvmalloc functions risks performance regressions.
> Could we instead introduce a new variant like vkmalloc() (favoring
> vmalloc over kmalloc) or kvmalloc_costless()?

We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
kmalloc() is before the vmalloc() fallback. It does e.g.:

                if (!(flags & __GFP_RETRY_MAYFAIL))
                        flags |= __GFP_NORETRY;

However if your problem is kcompactd utilization then the kmalloc() attempt
would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the
deferred compaction mechanism would limit the issue in the first place.

The ad-hoc fixing up of a particular place (/proc files reading) or creating
a new vkmalloc() and then spreading its use as you see other places
triggering the issue seems quite suboptimal to me.

>>
>> I don't think users would expect kvmalloc() to implictly decide on using
>> vmalloc() without trying kmalloc() first, just because it's a high-order
>> allocation.
>>
>

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Shakeel Butt 1 week, 1 day ago

On Wed, Apr 02, 2025 at 11:25:12AM +0200, Vlastimil Babka wrote:
> On 4/2/25 10:42, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >> >
> >> >
> >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> >> > >observed frequent costly high-order (order-6) page allocations triggered by
> >> > >proc file reads from monitoring tools. This can be reproduced with a simple
> >> > >test case:
> >> > >
> >> > >  fd = open(PROC_FILE, O_RDONLY);
> >> > >  size = read(fd, buff, 256KB);
> >> > >  close(fd);
> >> > >
> >> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> >> > >we should also enhance the kernel to prevent these expensive high-order
> >> > >allocations.
> >> > >
> >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >> > >Cc: Josef Bacik <josef@toxicpanda.com>
> >> > >---
> >> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> >> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> >> > >
> >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >> > >index cc9d74a06ff0..c53ba733bda5 100644
> >> > >--- a/fs/proc/proc_sysctl.c
> >> > >+++ b/fs/proc/proc_sysctl.c
> >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >> > >     error = -ENOMEM;
> >> > >     if (count >= KMALLOC_MAX_SIZE)
> >> > >             goto out;
> >> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> > >+
> >> > >+    /*
> >> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> >> > >+     * allocations.
> >> > >+     */
> >> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> >
> >> > Why not move this check into kvmalloc family?
> >>
> >> Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
> kmalloc() is before the vmalloc() fallback. It does e.g.:
> 
>                 if (!(flags & __GFP_RETRY_MAYFAIL))
>                         flags |= __GFP_NORETRY;
> 
> However if your problem is kcompactd utilization then the kmalloc() attempt
> would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
> kcompactd. Should we remove the flag for costly orders? Dunno.

Agree with the following points (i.e. ad-hoc fixing etc). The above
point of removing kswapd reclaim for costly orders need more thought.
Will we be hiding some compaction issues by doing so (i.e. no kswapd
reclaim for costly orders)?

> Ideally the
> deferred compaction mechanism would limit the issue in the first place.
> 
> The ad-hoc fixing up of a particular place (/proc files reading) or creating
> a new vkmalloc() and then spreading its use as you see other places
> triggering the issue seems quite suboptimal to me.
> 
> >>
> >> I don't think users would expect kvmalloc() to implictly decide on using
> >> vmalloc() without trying kmalloc() first, just because it's a high-order
> >> allocation.
> >>
> > 
>

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Michal Hocko 1 week, 1 day ago

On Wed 02-04-25 11:25:12, Vlastimil Babka wrote:
> On 4/2/25 10:42, Yafang Shao wrote:
> > On Wed, Apr 2, 2025 at 12:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> On Tue, Apr 01, 2025 at 07:01:04AM -0700, Kees Cook wrote:
> >> >
> >> >
> >> > On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >> > >While investigating a kcompactd 100% CPU utilization issue in production, I
> >> > >observed frequent costly high-order (order-6) page allocations triggered by
> >> > >proc file reads from monitoring tools. This can be reproduced with a simple
> >> > >test case:
> >> > >
> >> > >  fd = open(PROC_FILE, O_RDONLY);
> >> > >  size = read(fd, buff, 256KB);
> >> > >  close(fd);
> >> > >
> >> > >Although we should modify the monitoring tools to use smaller buffer sizes,
> >> > >we should also enhance the kernel to prevent these expensive high-order
> >> > >allocations.
> >> > >
> >> > >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >> > >Cc: Josef Bacik <josef@toxicpanda.com>
> >> > >---
> >> > > fs/proc/proc_sysctl.c | 10 +++++++++-
> >> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> >> > >
> >> > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >> > >index cc9d74a06ff0..c53ba733bda5 100644
> >> > >--- a/fs/proc/proc_sysctl.c
> >> > >+++ b/fs/proc/proc_sysctl.c
> >> > >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >> > >     error = -ENOMEM;
> >> > >     if (count >= KMALLOC_MAX_SIZE)
> >> > >             goto out;
> >> > >-    kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> > >+
> >> > >+    /*
> >> > >+     * Use vmalloc if the count is too large to avoid costly high-order page
> >> > >+     * allocations.
> >> > >+     */
> >> > >+    if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >> > >+            kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >> >
> >> > Why not move this check into kvmalloc family?
> >>
> >> Hmm should this check really be in kvmalloc family?
> > 
> > Modifying the existing kvmalloc functions risks performance regressions.
> > Could we instead introduce a new variant like vkmalloc() (favoring
> > vmalloc over kmalloc) or kvmalloc_costless()?
> 
> We have gfp flags and kmalloc_gfp_adjust() to moderate how aggressive
> kmalloc() is before the vmalloc() fallback. It does e.g.:
> 
>                 if (!(flags & __GFP_RETRY_MAYFAIL))
>                         flags |= __GFP_NORETRY;
> 
> However if your problem is kcompactd utilization then the kmalloc() attempt
> would have to avoid ___GFP_KSWAPD_RECLAIM to avoid waking up kswapd and then
> kcompactd. Should we remove the flag for costly orders? Dunno. Ideally the
> deferred compaction mechanism would limit the issue in the first place.

Yes, triggering heavy compation for costly allocations seems to be quite
bad. We have GFP_RETRY_MAYFAIL for that purpose if the caller really
needs the allocation to try really hard.

> The ad-hoc fixing up of a particular place (/proc files reading) or creating
> a new vkmalloc() and then spreading its use as you see other places
> triggering the issue seems quite suboptimal to me.

Yes I absolutely agree.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] proc: Avoid costly high-order page allocations when reading proc files

Posted by Yafang Shao 1 week, 2 days ago

On Tue, Apr 1, 2025 at 10:01 PM Kees Cook <kees@kernel.org> wrote:
>
>
>
> On April 1, 2025 12:30:46 AM PDT, Yafang Shao <laoar.shao@gmail.com> wrote:
> >While investigating a kcompactd 100% CPU utilization issue in production, I
> >observed frequent costly high-order (order-6) page allocations triggered by
> >proc file reads from monitoring tools. This can be reproduced with a simple
> >test case:
> >
> >  fd = open(PROC_FILE, O_RDONLY);
> >  size = read(fd, buff, 256KB);
> >  close(fd);
> >
> >Although we should modify the monitoring tools to use smaller buffer sizes,
> >we should also enhance the kernel to prevent these expensive high-order
> >allocations.
> >
> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >Cc: Josef Bacik <josef@toxicpanda.com>
> >---
> > fs/proc/proc_sysctl.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >index cc9d74a06ff0..c53ba733bda5 100644
> >--- a/fs/proc/proc_sysctl.c
> >+++ b/fs/proc/proc_sysctl.c
> >@@ -581,7 +581,15 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
> >       error = -ENOMEM;
> >       if (count >= KMALLOC_MAX_SIZE)
> >               goto out;
> >-      kbuf = kvzalloc(count + 1, GFP_KERNEL);
> >+
> >+      /*
> >+       * Use vmalloc if the count is too large to avoid costly high-order page
> >+       * allocations.
> >+       */
> >+      if (count < (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> >+              kbuf = kvzalloc(count + 1, GFP_KERNEL);
>
> Why not move this check into kvmalloc family?

good suggestion.

>
> >+      else
> >+              kbuf = vmalloc(count + 1);
>
> You dropped the zeroing. This must be vzalloc.

Nice catch.

>
> >       if (!kbuf)
> >               goto out;
> >
>
> Alternatively, why not force count to be <PAGE_SIZE? What uses >PAGE_SIZE writes in proc/sys?

This would break backward compatibility with existing tools, so we
cannot enforce this restriction.

-- 
Regards
Yafang