mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 448 insertions(+), 193 deletions(-)
Hello, folk.
1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
conference. This small serial attempts to mitigate the contention across the
vmap/vmalloc code. The problem is described here:
wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
The material is tagged as a v2 version. It contains extra slides about testing
the throughput, steps and comparison with a current approach.
2. Motivation.
- The vmap code is not scalled to number of CPUs and this should be fixed;
- XFS folk has complained several times that vmalloc might be contented on
their workloads:
<snip>
commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date: Tue Jan 4 17:22:18 2022 -0800
xfs: reduce kvmalloc overhead for CIL shadow buffers
Oh, let me count the ways that the kvmalloc API sucks dog eggs.
The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
...
<snip>
- If we align with per-cpu allocator in terms of performance we can
remove it to simplify the vmap code. Currently we have 3 allocators
See:
<snip>
/*** Per cpu kva allocator ***/
/*
* vmap space is limited especially on 32 bit architectures. Ensure there is
* room for at least 16 percpu vmap blocks per CPU.
*/
/*
* If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
* to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess
* instead (we just need a rough idea)
*/
<snip>
3. Test
On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
1-page 1-page-this-patch
1 0.576131 vs 0.555889
2 2.68376 vs 1.07895
3 4.26502 vs 1.01739
4 6.04306 vs 1.28924
5 8.04786 vs 1.57616
6 9.38844 vs 1.78142
7 9.53481 vs 2.00172
8 10.4609 vs 2.15964
9 10.6089 vs 2.484
10 11.7372 vs 2.40443
11 11.5969 vs 2.71635
12 13.053 vs 2.6162
13 12.2973 vs 2.843
14 13.1429 vs 2.85714
15 13.7348 vs 2.90691
16 14.3687 vs 3.0285
17 14.8823 vs 3.05717
18 14.9571 vs 2.98018
19 14.9127 vs 3.0951
20 16.0786 vs 3.19521
21 15.8055 vs 3.24915
22 16.8087 vs 3.2521
23 16.7298 vs 3.25698
24 17.244 vs 3.36586
25 17.8416 vs 3.39384
26 18.8008 vs 3.40907
27 18.5591 vs 3.5254
28 19.761 vs 3.55468
29 20.06 vs 3.59869
30 20.4353 vs 3.6991
31 20.9082 vs 3.73028
32 21.0865 vs 3.82904
1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
pair throughput.
The series is based on the v6.3 tag and considered as beta version. Please note
it does not support vread() functionality yet. So it means that it is not fully
complete.
Any input/thoughts are welcome.
Uladzislau Rezki (Sony) (9):
mm: vmalloc: Add va_alloc() helper
mm: vmalloc: Rename adjust_va_to_fit_type() function
mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
mm: vmalloc: Add a per-CPU-zone infrastructure
mm: vmalloc: Insert busy-VA per-cpu zone
mm: vmalloc: Support multiple zones in vmallocinfo
mm: vmalloc: Insert lazy-VA per-cpu zone
mm: vmalloc: Offload free_vmap_area_lock global lock
mm: vmalloc: Scale and activate cvz_size
mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 448 insertions(+), 193 deletions(-)
--
2.30.2
On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote: > Hello, folk. > > 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023 > conference. This small serial attempts to mitigate the contention across the > vmap/vmalloc code. The problem is described here: > Hello Uladzislau, thank you for the work! > wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf I ran the exactly same command but couldn't download the file, did I miss something? $ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf [...] ==> PASV ... done. ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'. > The material is tagged as a v2 version. It contains extra slides about testing > the throughput, steps and comparison with a current approach. > > 2. Motivation. > > - The vmap code is not scalled to number of CPUs and this should be fixed; > - XFS folk has complained several times that vmalloc might be contented on > their workloads: > > <snip> > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf > Author: Dave Chinner <dchinner@redhat.com> > Date: Tue Jan 4 17:22:18 2022 -0800 > > xfs: reduce kvmalloc overhead for CIL shadow buffers > > Oh, let me count the ways that the kvmalloc API sucks dog eggs. > > The problem is when we are logging lots of large objects, we hit > kvmalloc really damn hard with costly order allocations, and > behaviour utterly sucks: based on the commit I guess xfs should use vmalloc/kvmalloc is because it allocates large buffers, how large could it be? > 3. Test > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: > > 1-page 1-page-this-patch > 1 0.576131 vs 0.555889 > 2 2.68376 vs 1.07895 > 3 4.26502 vs 1.01739 > 4 6.04306 vs 1.28924 > 5 8.04786 vs 1.57616 > 6 9.38844 vs 1.78142 <snip> > 29 20.06 vs 3.59869 > 30 20.4353 vs 3.6991 > 31 20.9082 vs 3.73028 > 32 21.0865 vs 3.82904 > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() > pair throughput. I would be more interested in real numbers than synthetic benchmarks, Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750 with and without this patchset? By the way looking at the commit, teaching __p?d_alloc() about gfp context (that I'm _slowly_ working on...) could be nice for allowing non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1] Thanks! [1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org -- Hyeonggon Yoo Doing kernel stuff as a hobby Undergraduate | Chungnam National University Dept. Computer Science & Engineering
On Tue, May 23, 2023 at 08:59:05PM +0900, Hyeonggon Yoo wrote: > On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote: > > Hello, folk. > > > > 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023 > > conference. This small serial attempts to mitigate the contention across the > > vmap/vmalloc code. The problem is described here: > > > > Hello Uladzislau, thank you for the work! > > > wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf > > I ran the exactly same command but couldn't download the file, did I > miss something? > wget ftp://vps418301.ovh.net/incoming/Mitigate_a_vmalloc_lock_contention_in_SMP_env_v2.pdf Sorry, i renamed the file name :) > $ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf > [...] > ==> PASV ... done. ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... > No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'. > > > The material is tagged as a v2 version. It contains extra slides about testing > > the throughput, steps and comparison with a current approach. > > > > 2. Motivation. > > > > - The vmap code is not scalled to number of CPUs and this should be fixed; > > - XFS folk has complained several times that vmalloc might be contented on > > their workloads: > > > > <snip> > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf > > Author: Dave Chinner <dchinner@redhat.com> > > Date: Tue Jan 4 17:22:18 2022 -0800 > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs. > > > > The problem is when we are logging lots of large objects, we hit > > kvmalloc really damn hard with costly order allocations, and > > behaviour utterly sucks: > > based on the commit I guess xfs should use vmalloc/kvmalloc is because > it allocates large buffers, how large could it be? > They use kvmalloc(). When the page allocator is not able to serve a request they fallback to vmalloc. At least what i see, the sizes are: from 73728 up to 1048576, i.e. 18 pages up to 256 pages. > > 3. Test > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: > > > > 1-page 1-page-this-patch > > 1 0.576131 vs 0.555889 > > 2 2.68376 vs 1.07895 > > 3 4.26502 vs 1.01739 > > 4 6.04306 vs 1.28924 > > 5 8.04786 vs 1.57616 > > 6 9.38844 vs 1.78142 > > <snip> > > > 29 20.06 vs 3.59869 > > 30 20.4353 vs 3.6991 > > 31 20.9082 vs 3.73028 > > 32 21.0865 vs 3.82904 > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() > > pair throughput. > > I would be more interested in real numbers than synthetic benchmarks, > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750 > with and without this patchset? > I added Dave Chinner <david@fromorbit.com> to this thread. But. The contention exists. Apart of that per-cpu-KVA allocator can go away if we make it generic instead. > By the way looking at the commit, teaching __p?d_alloc() about gfp > context (that I'm _slowly_ working on...) could be nice for allowing > non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1] > > Thanks! > > [1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org > Thanks! -- Uladzisdlau Rezki
On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote: > > > 2. Motivation. > > > > > > - The vmap code is not scalled to number of CPUs and this should be fixed; > > > - XFS folk has complained several times that vmalloc might be contented on > > > their workloads: > > > > > > <snip> > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf > > > Author: Dave Chinner <dchinner@redhat.com> > > > Date: Tue Jan 4 17:22:18 2022 -0800 > > > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers > > > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs. > > > > > > The problem is when we are logging lots of large objects, we hit > > > kvmalloc really damn hard with costly order allocations, and > > > behaviour utterly sucks: > > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because > > it allocates large buffers, how large could it be? > > > They use kvmalloc(). When the page allocator is not able to serve a > request they fallback to vmalloc. At least what i see, the sizes are: > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages. > > > > 3. Test > > > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: > > > > > > 1-page 1-page-this-patch > > > 1 0.576131 vs 0.555889 > > > 2 2.68376 vs 1.07895 > > > 3 4.26502 vs 1.01739 > > > 4 6.04306 vs 1.28924 > > > 5 8.04786 vs 1.57616 > > > 6 9.38844 vs 1.78142 > > > > <snip> > > > > > 29 20.06 vs 3.59869 > > > 30 20.4353 vs 3.6991 > > > 31 20.9082 vs 3.73028 > > > 32 21.0865 vs 3.82904 > > > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() > > > pair throughput. > > > > I would be more interested in real numbers than synthetic benchmarks, > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750 > > with and without this patchset? > > > I added Dave Chinner <david@fromorbit.com> to this thread. Oh, I missed that, and it would be better to [+Cc linux-xfs] > But. The contention exists. I think "theoretically can be contended" doesn't necessarily mean it's actually contended in the real world. Also I find it difficult to imagine vmalloc being highly contended because it was historically considered slow and thus discouraged when performance is important. IOW vmalloc would not be contended when allocation size is small because we have kmalloc/buddy API, and therefore I wonder which workloads are allocating very large buffers and at the same time allocating very frequently, thus performance-sensitive. I am not against this series, but wondering which workloads would benefit ;) > Apart of that per-cpu-KVA allocator can go away if we make it generic instead. Not sure I understand your point, can you elaborate please? And I would like to ask some side questions: 1. Is vm_[un]map_ram() API still worth with this patchset? 2. How does this patchset deals with 32-bit machines where vmalloc address space is limited? Thanks! > > By the way looking at the commit, teaching __p?d_alloc() about gfp > > context (that I'm _slowly_ working on...) could be nice for allowing > > non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1] > > > > Thanks! > > > > [1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org > > -- Hyeonggon Yoo Doing kernel stuff as a hobby Undergraduate | Chungnam National University Dept. Computer Science & Engineering
On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote: > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote: > > > > 2. Motivation. > > > > > > > > - The vmap code is not scalled to number of CPUs and this should be fixed; > > > > - XFS folk has complained several times that vmalloc might be contented on > > > > their workloads: > > > > > > > > <snip> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf > > > > Author: Dave Chinner <dchinner@redhat.com> > > > > Date: Tue Jan 4 17:22:18 2022 -0800 > > > > > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers > > > > > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs. > > > > > > > > The problem is when we are logging lots of large objects, we hit > > > > kvmalloc really damn hard with costly order allocations, and > > > > behaviour utterly sucks: > > > > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because > > > it allocates large buffers, how large could it be? > > > > > They use kvmalloc(). When the page allocator is not able to serve a > > request they fallback to vmalloc. At least what i see, the sizes are: > > > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages. > > > > > > 3. Test > > > > > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures: > > > > > > > > 1-page 1-page-this-patch > > > > 1 0.576131 vs 0.555889 > > > > 2 2.68376 vs 1.07895 > > > > 3 4.26502 vs 1.01739 > > > > 4 6.04306 vs 1.28924 > > > > 5 8.04786 vs 1.57616 > > > > 6 9.38844 vs 1.78142 > > > > > > <snip> > > > > > > > 29 20.06 vs 3.59869 > > > > 30 20.4353 vs 3.6991 > > > > 31 20.9082 vs 3.73028 > > > > 32 21.0865 vs 3.82904 > > > > > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree() > > > > pair throughput. > > > > > > I would be more interested in real numbers than synthetic benchmarks, > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750 > > > with and without this patchset? > > > > > I added Dave Chinner <david@fromorbit.com> to this thread. > > Oh, I missed that, and it would be better to [+Cc linux-xfs] > > > But. The contention exists. > > I think "theoretically can be contended" doesn't necessarily mean it's actually > contended in the real world. > > Also I find it difficult to imagine vmalloc being highly contended because it was > historically considered slow and thus discouraged when performance is important. > > IOW vmalloc would not be contended when allocation size is small because we have > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large > buffers and at the same time allocating very frequently, thus performance-sensitive. > > I am not against this series, but wondering which workloads would benefit ;) > > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead. > > Not sure I understand your point, can you elaborate please? > > And I would like to ask some side questions: > > 1. Is vm_[un]map_ram() API still worth with this patchset? > It is up to community to decide. As i see XFS needs it also. Maybe in the future it can be removed(who knows). If the vmalloc code itself can deliver such performance as vm_map* APIs. > > 2. How does this patchset deals with 32-bit machines where > vmalloc address space is limited? > It can deal without any problems. Though i am not sure it is needed for 32-bit systems. The reason is, the vmalloc code was a bit slow when it comes to lookup time, it used to be O(n). After that it was improved to O(logn). vm_map_ram() and friends interface was added because of vmalloc drawbacks. I am not sure that there are 32-bit systems with 10/20/30... CPUs on board. In that case it is worth care about contention. -- Uladzislau Rezki
On Wed, May 24, 2023 at 11:50:12AM +0200, Uladzislau Rezki wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > And I would like to ask some side questions:
> >
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
> >
> It is up to community to decide. As i see XFS needs it also. Maybe in
> the future it can be removed(who knows). If the vmalloc code itself can
> deliver such performance as vm_map* APIs.
vm_map* APIs cannot be replaced with vmalloc, they cover a very
different use case. i.e. vmalloc allocates mapped memory,
vm_map_ram() maps allocated memory....
> vm_map_ram() and friends interface was added because of vmalloc drawbacks.
No. vm_map*() were scalability improvements added in 2009 to replace
on vmap() and vunmap() to avoid global lock contention in the vmap
allocator that XFS had been working around for years with it's own
internal vmap cache....
commit 95f8e302c04c0b0c6de35ab399a5551605eeb006
Author: Nicholas Piggin <npiggin@gmail.com>
Date: Tue Jan 6 14:43:09 2009 +1100
[XFS] use scalable vmap API
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
vmap/vunmap() themselves were introduce in 2.5.32 (2002) and before
that XFS was using remap_page_array() and vfree() in exactly the
same way it uses vm_map_ram() and vm_unmap_ram() today....
XFS has a long, long history of causing virtual memory allocator
scalability and contention problems. As you can see, this isn't our
first rodeo...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 11:50:12AM +0200, Uladzislau Rezki wrote:
> > On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > And I would like to ask some side questions:
> > >
> > > 1. Is vm_[un]map_ram() API still worth with this patchset?
> > >
> > It is up to community to decide. As i see XFS needs it also. Maybe in
> > the future it can be removed(who knows). If the vmalloc code itself can
> > deliver such performance as vm_map* APIs.
>
> vm_map* APIs cannot be replaced with vmalloc, they cover a very
> different use case. i.e. vmalloc allocates mapped memory,
> vm_map_ram() maps allocated memory....
>
> > vm_map_ram() and friends interface was added because of vmalloc drawbacks.
>
> No. vm_map*() were scalability improvements added in 2009 to replace
> on vmap() and vunmap() to avoid global lock contention in the vmap
> allocator that XFS had been working around for years with it's own
> internal vmap cache....
>
> commit 95f8e302c04c0b0c6de35ab399a5551605eeb006
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date: Tue Jan 6 14:43:09 2009 +1100
>
> [XFS] use scalable vmap API
>
> Implement XFS's large buffer support with the new vmap APIs. See the vmap
> rewrite (db64fe02) for some numbers. The biggest improvement that comes from
> using the new APIs is avoiding the global KVA allocation lock on every call.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Reviewed-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
>
> vmap/vunmap() themselves were introduce in 2.5.32 (2002) and before
> that XFS was using remap_page_array() and vfree() in exactly the
> same way it uses vm_map_ram() and vm_unmap_ram() today....
>
> XFS has a long, long history of causing virtual memory allocator
> scalability and contention problems. As you can see, this isn't our
> first rodeo...
>
Let me be more specific, sorry it looks like there is misunderstanding.
I am talking about removing of vb_alloc()/vb_free() per-cpu stuff. If
alloc_vmap_area() gives same performance:
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d50c551592fc..a1687bbdad30 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2503,12 +2503,6 @@ void vm_unmap_ram(const void *mem, unsigned int count)
kasan_poison_vmalloc(mem, size);
- if (likely(count <= VMAP_MAX_ALLOC)) {
- debug_check_no_locks_freed(mem, size);
- vb_free(addr, size);
- return;
- }
-
va = find_unlink_vmap_area(addr);
if (WARN_ON_ONCE(!va))
return;
@@ -2539,12 +2533,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
unsigned long addr;
void *mem;
- if (likely(count <= VMAP_MAX_ALLOC)) {
- mem = vb_alloc(size, GFP_KERNEL);
- if (IS_ERR(mem))
- return NULL;
- addr = (unsigned long)mem;
- } else {
struct vmap_area *va;
va = alloc_vmap_area(size, PAGE_SIZE,
VMALLOC_START, VMALLOC_END,
@@ -2554,7 +2542,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
addr = va->va_start;
mem = (void *)addr;
- }
if (vmap_pages_range(addr, addr + size, PAGE_KERNEL,
pages, PAGE_SHIFT) < 0) {
+ other related parts.
--
Uladzislau Rezki
On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote: > > It is up to community to decide. As i see XFS needs it also. Maybe in > > the future it can be removed(who knows). If the vmalloc code itself can > > deliver such performance as vm_map* APIs. > > vm_map* APIs cannot be replaced with vmalloc, they cover a very > different use case. i.e. vmalloc allocates mapped memory, > vm_map_ram() maps allocated memory.... > > > vm_map_ram() and friends interface was added because of vmalloc drawbacks. > > No. vm_map*() were scalability improvements added in 2009 to replace > on vmap() and vunmap() to avoid global lock contention in the vmap > allocator that XFS had been working around for years with it's own > internal vmap cache.... All of that is true. At the same time XFS could very much switch to vmalloc for !XBF_UNMAPPED && size > PAGE_SIZE buffers IFF that provided an advantage. The need for vmap and then vm_map_* initially came from the fact that we were using the page cache to back the xfs_buf (or page_buf back then). With your work on getting rid of the pagecache usage we could just use vmalloc now if we wanted to and it improves performance. Or at some point we could even look into just using large folios for that with the infrastructure for that improving a lot (but I suspect we're not quite there yet). But ther are other uses of vm_map_* that can't do that, and they will benefit from the additional scalability as well even IFF just using vmalloc was fine for XFS now (I don't know, I haven't actually looked into it).
On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > 2. Motivation.
> > > >
> > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > their workloads:
> > > >
> > > > <snip>
> > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date: Tue Jan 4 17:22:18 2022 -0800
> > > >
> > > > xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > >
> > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > >
> > > > The problem is when we are logging lots of large objects, we hit
> > > > kvmalloc really damn hard with costly order allocations, and
> > > > behaviour utterly sucks:
> > >
> > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > it allocates large buffers, how large could it be?
> > >
> > They use kvmalloc(). When the page allocator is not able to serve a
> > request they fallback to vmalloc. At least what i see, the sizes are:
> >
> > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> >
> > > > 3. Test
> > > >
> > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > >
> > > > 1-page 1-page-this-patch
> > > > 1 0.576131 vs 0.555889
> > > > 2 2.68376 vs 1.07895
> > > > 3 4.26502 vs 1.01739
> > > > 4 6.04306 vs 1.28924
> > > > 5 8.04786 vs 1.57616
> > > > 6 9.38844 vs 1.78142
> > >
> > > <snip>
> > >
> > > > 29 20.06 vs 3.59869
> > > > 30 20.4353 vs 3.6991
> > > > 31 20.9082 vs 3.73028
> > > > 32 21.0865 vs 3.82904
> > > >
> > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > pair throughput.
> > >
> > > I would be more interested in real numbers than synthetic benchmarks,
> > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > with and without this patchset?
> > >
> > I added Dave Chinner <david@fromorbit.com> to this thread.
>
> Oh, I missed that, and it would be better to [+Cc linux-xfs]
>
> > But. The contention exists.
>
> I think "theoretically can be contended" doesn't necessarily mean it's actually
> contended in the real world.
Did you not read the commit message for the XFS commit documented
above? vmalloc lock contention most c0ertainly does exist in the
real world and the profiles in commit 8dc9384b7d75 ("xfs: reduce
kvmalloc overhead for CIL shadow buffers") document it clearly.
> Also I find it difficult to imagine vmalloc being highly contended because it was
> historically considered slow and thus discouraged when performance is important.
Read the above XFS commit.
We use vmalloc in critical high performance fast paths that cannot
tolerate high order memory allocation failure. XFS runs this
fast path millions of times a second, and will call into
vmalloc() several hundred thousands times a second with machine wide
concurrency under certain types of workloads.
> IOW vmalloc would not be contended when allocation size is small because we have
> kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> buffers and at the same time allocating very frequently, thus performance-sensitive.
>
> I am not against this series, but wondering which workloads would benefit ;)
Yup, you need to read the XFS commit message. If you understand what
is in that commit message, then you wouldn't be doubting that
vmalloc contention is real and that it is used in high performance
fast paths that are traversed millions of times a second....
> > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
>
> Not sure I understand your point, can you elaborate please?
>
> And I would like to ask some side questions:
>
> 1. Is vm_[un]map_ram() API still worth with this patchset?
XFS also uses this interface for mapping multi-page buffers in the
XFS buffer cache. These are the items that also require the high
order costly kvmalloc allocations in the transaction commit path
when they are modified.
So, yes, we need these mapping interfaces to scale just as well as
vmalloc itself....
> 2. How does this patchset deals with 32-bit machines where
> vmalloc address space is limited?
From the XFS side, we just don't care about 32 bit machines at all.
XFS is aimed at server and HPC environments which have been entirely
64 bit for a long, long time now...
-Dave.
--
Dave Chinner
david@fromorbit.com
On Wed, May 24, 2023 at 07:43:38AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > > 2. Motivation.
> > > > >
> > > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > > their workloads:
> > > > >
> > > > > <snip>
> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > > Date: Tue Jan 4 17:22:18 2022 -0800
> > > > >
> > > > > xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > > >
> > > > > Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > > >
> > > > > The problem is when we are logging lots of large objects, we hit
> > > > > kvmalloc really damn hard with costly order allocations, and
> > > > > behaviour utterly sucks:
> > > >
> > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > > it allocates large buffers, how large could it be?
> > > >
> > > They use kvmalloc(). When the page allocator is not able to serve a
> > > request they fallback to vmalloc. At least what i see, the sizes are:
> > >
> > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > >
> > > > > 3. Test
> > > > >
> > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > >
> > > > > 1-page 1-page-this-patch
> > > > > 1 0.576131 vs 0.555889
> > > > > 2 2.68376 vs 1.07895
> > > > > 3 4.26502 vs 1.01739
> > > > > 4 6.04306 vs 1.28924
> > > > > 5 8.04786 vs 1.57616
> > > > > 6 9.38844 vs 1.78142
> > > >
> > > > <snip>
> > > >
> > > > > 29 20.06 vs 3.59869
> > > > > 30 20.4353 vs 3.6991
> > > > > 31 20.9082 vs 3.73028
> > > > > 32 21.0865 vs 3.82904
> > > > >
> > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > > pair throughput.
> > > >
> > > > I would be more interested in real numbers than synthetic benchmarks,
> > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > > with and without this patchset?
> > > >
> > > I added Dave Chinner <david@fromorbit.com> to this thread.
> >
> > Oh, I missed that, and it would be better to [+Cc linux-xfs]
> >
> > > But. The contention exists.
> >
> > I think "theoretically can be contended" doesn't necessarily mean it's actually
> > contended in the real world.
>
> Did you not read the commit message for the XFS commit documented
> above? vmalloc lock contention most c0ertainly does exist in the
> real world and the profiles in commit 8dc9384b7d75 ("xfs: reduce
> kvmalloc overhead for CIL shadow buffers") document it clearly.
>
> > Also I find it difficult to imagine vmalloc being highly contended because it was
> > historically considered slow and thus discouraged when performance is important.
>
> Read the above XFS commit.
>
> We use vmalloc in critical high performance fast paths that cannot
> tolerate high order memory allocation failure. XFS runs this
> fast path millions of times a second, and will call into
> vmalloc() several hundred thousands times a second with machine wide
> concurrency under certain types of workloads.
>
> > IOW vmalloc would not be contended when allocation size is small because we have
> > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> > buffers and at the same time allocating very frequently, thus performance-sensitive.
> >
> > I am not against this series, but wondering which workloads would benefit ;)
>
> Yup, you need to read the XFS commit message. If you understand what
> is in that commit message, then you wouldn't be doubting that
> vmalloc contention is real and that it is used in high performance
> fast paths that are traversed millions of times a second....
Oh, I read the commit but seems slipped my mind while reading it - sorry for such a dumb
question, now I get it, and thank you so much. In any case didn't mean to offend,
I should've read and thought more before asking.
>
> > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> >
> > Not sure I understand your point, can you elaborate please?
> >
> > And I would like to ask some side questions:
> >
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
>
> XFS also uses this interface for mapping multi-page buffers in the
> XFS buffer cache. These are the items that also require the high
> order costly kvmalloc allocations in the transaction commit path
> when they are modified.
>
> So, yes, we need these mapping interfaces to scale just as well as
> vmalloc itself....
I mean, even before this series, vm_[un]map_ram() caches vmap_blocks
per CPU but it has limitation on size that can be cached per cpu.
But now that vmap() itself becomes scalable after this series, I wonder
they are still worth, why not replace it with v[un]map()?
>
> > 2. How does this patchset deals with 32-bit machines where
> > vmalloc address space is limited?
>
> From the XFS side, we just don't care about 32 bit machines at all.
> XFS is aimed at server and HPC environments which have been entirely
> 64 bit for a long, long time now...
But Linux still supports 32 bit machines and is not going to drop
support for them anytime soon so I think there should be at least a way to
disable this feature.
Thanks!
--
Hyeonggon Yoo
Doing kernel stuff as a hobby
Undergraduate | Chungnam National University
Dept. Computer Science & Engineering
© 2016 - 2026 Red Hat, Inc.