fs/exec.c | 1 + include/linux/maple_tree.h | 23 ++++- include/linux/mm.h | 4 - lib/maple_tree.c | 78 ++++++++++---- lib/test_maple_tree.c | 74 +++++++++++++ mm/internal.h | 40 ++++++-- mm/memory.c | 16 ++- mm/mmap.c | 171 ++++++++++++++++--------------- mm/nommu.c | 45 ++++---- tools/testing/radix-tree/maple.c | 59 ++++++----- 10 files changed, 331 insertions(+), 180 deletions(-)
Initial work on preallocations showed no regression in performance during testing, but recently some users (both on [1] and off [android] list) have reported that preallocating the worst-case number of nodes has caused some slow down. This patch set addresses the number of allocations in a few ways. During munmap() most munmap() operations will remove a single VMA, so leverage the fact that the maple tree can place a single pointer at range 0 - 0 without allocating. This is done by changing the index in the 'sidetree'. Re-introduce the entry argument to mas_preallocate() so that a more intelligent guess of the node count can be made. Patches are in the following order: 0001-0002: Testing framework for benchmarking some operations 0003-0004: Reduction of maple node allocation in sidetree 0005: Small cleanup of do_vmi_align_munmap() 0006-0013: mas_preallocate() calculation change 0014: Change the vma iterator order [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ Liam R. Howlett (14): maple_tree: Add benchmarking for mas_for_each maple_tree: Add benchmarking for mas_prev() mm: Move unmap_vmas() declaration to internal header mm: Change do_vmi_align_munmap() side tree index mm: Remove prev check from do_vmi_align_munmap() maple_tree: Introduce __mas_set_range() mm: Remove re-walk from mmap_region() maple_tree: Re-introduce entry to mas_preallocate() arguments mm: Use vma_iter_clear_gfp() in nommu mm: Set up vma iterator for vma_iter_prealloc() calls maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() maple_tree: Update mas_preallocate() testing maple_tree: Refine mas_preallocate() node calculations mm/mmap: Change vma iteration order in do_vmi_align_munmap() fs/exec.c | 1 + include/linux/maple_tree.h | 23 ++++- include/linux/mm.h | 4 - lib/maple_tree.c | 78 ++++++++++---- lib/test_maple_tree.c | 74 +++++++++++++ mm/internal.h | 40 ++++++-- mm/memory.c | 16 ++- mm/mmap.c | 171 ++++++++++++++++--------------- mm/nommu.c | 45 ++++---- tools/testing/radix-tree/maple.c | 59 ++++++----- 10 files changed, 331 insertions(+), 180 deletions(-) -- 2.39.2
Hi Liam, On 6/1/2023 10:15 AM, Liam R. Howlett wrote: > Initial work on preallocations showed no regression in performance > during testing, but recently some users (both on [1] and off [android] > list) have reported that preallocating the worst-case number of nodes > has caused some slow down. This patch set addresses the number of > allocations in a few ways. > > During munmap() most munmap() operations will remove a single VMA, so > leverage the fact that the maple tree can place a single pointer at > range 0 - 0 without allocating. This is done by changing the index in > the 'sidetree'. > > Re-introduce the entry argument to mas_preallocate() so that a more > intelligent guess of the node count can be made. > > Patches are in the following order: > 0001-0002: Testing framework for benchmarking some operations > 0003-0004: Reduction of maple node allocation in sidetree > 0005: Small cleanup of do_vmi_align_munmap() > 0006-0013: mas_preallocate() calculation change > 0014: Change the vma iterator order I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with this patchset. The result has a little bit improvement: Base (next-20230602): 503880 Base with this patchset: 519501 But they are far from the none-regression result (commit 7be1c1a3c7b1): 718080 Some other information I collected: With Base, the mas_alloc_nodes are always hit with request: 7. With this patchset, the request are 1 or 5. I suppose this is the reason for improvement from 503880 to 519501. With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered mas_alloc_nodes() call. Thanks. Regards Yin, Fengwei > > [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ > > Liam R. Howlett (14): > maple_tree: Add benchmarking for mas_for_each > maple_tree: Add benchmarking for mas_prev() > mm: Move unmap_vmas() declaration to internal header > mm: Change do_vmi_align_munmap() side tree index > mm: Remove prev check from do_vmi_align_munmap() > maple_tree: Introduce __mas_set_range() > mm: Remove re-walk from mmap_region() > maple_tree: Re-introduce entry to mas_preallocate() arguments > mm: Use vma_iter_clear_gfp() in nommu > mm: Set up vma iterator for vma_iter_prealloc() calls > maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() > maple_tree: Update mas_preallocate() testing > maple_tree: Refine mas_preallocate() node calculations > mm/mmap: Change vma iteration order in do_vmi_align_munmap() > > fs/exec.c | 1 + > include/linux/maple_tree.h | 23 ++++- > include/linux/mm.h | 4 - > lib/maple_tree.c | 78 ++++++++++---- > lib/test_maple_tree.c | 74 +++++++++++++ > mm/internal.h | 40 ++++++-- > mm/memory.c | 16 ++- > mm/mmap.c | 171 ++++++++++++++++--------------- > mm/nommu.c | 45 ++++---- > tools/testing/radix-tree/maple.c | 59 ++++++----- > 10 files changed, 331 insertions(+), 180 deletions(-) >
在 2023/6/2 16:10, Yin, Fengwei 写道: > Hi Liam, > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >> Initial work on preallocations showed no regression in performance >> during testing, but recently some users (both on [1] and off [android] >> list) have reported that preallocating the worst-case number of nodes >> has caused some slow down. This patch set addresses the number of >> allocations in a few ways. >> >> During munmap() most munmap() operations will remove a single VMA, so >> leverage the fact that the maple tree can place a single pointer at >> range 0 - 0 without allocating. This is done by changing the index in >> the 'sidetree'. >> >> Re-introduce the entry argument to mas_preallocate() so that a more >> intelligent guess of the node count can be made. >> >> Patches are in the following order: >> 0001-0002: Testing framework for benchmarking some operations >> 0003-0004: Reduction of maple node allocation in sidetree >> 0005: Small cleanup of do_vmi_align_munmap() >> 0006-0013: mas_preallocate() calculation change >> 0014: Change the vma iterator order > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with > this patchset. > > The result has a little bit improvement: > Base (next-20230602): > 503880 > Base with this patchset: > 519501 > > But they are far from the none-regression result (commit 7be1c1a3c7b1): > 718080 > > > Some other information I collected: > With Base, the mas_alloc_nodes are always hit with request: 7. > With this patchset, the request are 1 or 5. > > I suppose this is the reason for improvement from 503880 to 519501. > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered > mas_alloc_nodes() call. Thanks. Hi Fengwei, I think it may be related to the inaccurate number of nodes allocated in the pre-allocation. I slightly modified the pre-allocation in this patchset, but I don't know if it works. It would be great if you could help test it, and help pinpoint the cause. Below is the diff, which can be applied based on this pachset. Thanks, Peng diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 5ea211c3f186..e67bf2744384 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) goto ask_now; } - /* New root needs a singe node */ - if (unlikely(mte_is_root(mas->node))) - goto ask_now; + if ((node_size == wr_mas.node_end + 1 && + mas->offset == wr_mas.node_end) || + (node_size == wr_mas.node_end && + wr_mas.offset_end - mas->offset == 1)) + return 0; /* Potential spanning rebalance collapsing a node, use worst-case */ if (node_size - 1 <= mt_min_slots[wr_mas.type]) @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) if (likely(!mas_is_err(mas))) return 0; - mas_set_alloc_req(mas, 0); ret = xa_err(mas->node); mas_reset(mas); mas_destroy(mas); > > > Regards > Yin, Fengwei > >> >> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >> >> Liam R. Howlett (14): >> maple_tree: Add benchmarking for mas_for_each >> maple_tree: Add benchmarking for mas_prev() >> mm: Move unmap_vmas() declaration to internal header >> mm: Change do_vmi_align_munmap() side tree index >> mm: Remove prev check from do_vmi_align_munmap() >> maple_tree: Introduce __mas_set_range() >> mm: Remove re-walk from mmap_region() >> maple_tree: Re-introduce entry to mas_preallocate() arguments >> mm: Use vma_iter_clear_gfp() in nommu >> mm: Set up vma iterator for vma_iter_prealloc() calls >> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >> maple_tree: Update mas_preallocate() testing >> maple_tree: Refine mas_preallocate() node calculations >> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >> >> fs/exec.c | 1 + >> include/linux/maple_tree.h | 23 ++++- >> include/linux/mm.h | 4 - >> lib/maple_tree.c | 78 ++++++++++---- >> lib/test_maple_tree.c | 74 +++++++++++++ >> mm/internal.h | 40 ++++++-- >> mm/memory.c | 16 ++- >> mm/mmap.c | 171 ++++++++++++++++--------------- >> mm/nommu.c | 45 ++++---- >> tools/testing/radix-tree/maple.c | 59 ++++++----- >> 10 files changed, 331 insertions(+), 180 deletions(-) >>
Hi Peng, On 6/5/23 11:28, Peng Zhang wrote: > > > 在 2023/6/2 16:10, Yin, Fengwei 写道: >> Hi Liam, >> >> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>> Initial work on preallocations showed no regression in performance >>> during testing, but recently some users (both on [1] and off [android] >>> list) have reported that preallocating the worst-case number of nodes >>> has caused some slow down. This patch set addresses the number of >>> allocations in a few ways. >>> >>> During munmap() most munmap() operations will remove a single VMA, so >>> leverage the fact that the maple tree can place a single pointer at >>> range 0 - 0 without allocating. This is done by changing the index in >>> the 'sidetree'. >>> >>> Re-introduce the entry argument to mas_preallocate() so that a more >>> intelligent guess of the node count can be made. >>> >>> Patches are in the following order: >>> 0001-0002: Testing framework for benchmarking some operations >>> 0003-0004: Reduction of maple node allocation in sidetree >>> 0005: Small cleanup of do_vmi_align_munmap() >>> 0006-0013: mas_preallocate() calculation change >>> 0014: Change the vma iterator order >> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >> this patchset. >> >> The result has a little bit improvement: >> Base (next-20230602): >> 503880 >> Base with this patchset: >> 519501 >> >> But they are far from the none-regression result (commit 7be1c1a3c7b1): >> 718080 >> >> >> Some other information I collected: >> With Base, the mas_alloc_nodes are always hit with request: 7. >> With this patchset, the request are 1 or 5. >> >> I suppose this is the reason for improvement from 503880 to 519501. >> >> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >> mas_alloc_nodes() call. Thanks. > Hi Fengwei, > > I think it may be related to the inaccurate number of nodes allocated > in the pre-allocation. I slightly modified the pre-allocation in this > patchset, but I don't know if it works. It would be great if you could > help test it, and help pinpoint the cause. Below is the diff, which can > be applied based on this pachset. I tried the patch, it could eliminate the call of mas_alloc_nodes() during the test. But the result of benchmark got a little bit improvement: 529040 But it's still much less than none-regression result. I will also double confirm the none-regression result. Regards Yin, Fengwei > > Thanks, > Peng > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 5ea211c3f186..e67bf2744384 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > goto ask_now; > } > > - /* New root needs a singe node */ > - if (unlikely(mte_is_root(mas->node))) > - goto ask_now; > + if ((node_size == wr_mas.node_end + 1 && > + mas->offset == wr_mas.node_end) || > + (node_size == wr_mas.node_end && > + wr_mas.offset_end - mas->offset == 1)) > + return 0; > > /* Potential spanning rebalance collapsing a node, use worst-case */ > if (node_size - 1 <= mt_min_slots[wr_mas.type]) > @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > if (likely(!mas_is_err(mas))) > return 0; > > - mas_set_alloc_req(mas, 0); > ret = xa_err(mas->node); > mas_reset(mas); > mas_destroy(mas); > > >> >> >> Regards >> Yin, Fengwei >> >>> >>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>> >>> Liam R. Howlett (14): >>> maple_tree: Add benchmarking for mas_for_each >>> maple_tree: Add benchmarking for mas_prev() >>> mm: Move unmap_vmas() declaration to internal header >>> mm: Change do_vmi_align_munmap() side tree index >>> mm: Remove prev check from do_vmi_align_munmap() >>> maple_tree: Introduce __mas_set_range() >>> mm: Remove re-walk from mmap_region() >>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>> mm: Use vma_iter_clear_gfp() in nommu >>> mm: Set up vma iterator for vma_iter_prealloc() calls >>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>> maple_tree: Update mas_preallocate() testing >>> maple_tree: Refine mas_preallocate() node calculations >>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>> >>> fs/exec.c | 1 + >>> include/linux/maple_tree.h | 23 ++++- >>> include/linux/mm.h | 4 - >>> lib/maple_tree.c | 78 ++++++++++---- >>> lib/test_maple_tree.c | 74 +++++++++++++ >>> mm/internal.h | 40 ++++++-- >>> mm/memory.c | 16 ++- >>> mm/mmap.c | 171 ++++++++++++++++--------------- >>> mm/nommu.c | 45 ++++---- >>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>
On 6/5/2023 12:41 PM, Yin Fengwei wrote:
> Hi Peng,
>
> On 6/5/23 11:28, Peng Zhang wrote:
>>
>>
>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>> Hi Liam,
>>>
>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>> Initial work on preallocations showed no regression in performance
>>>> during testing, but recently some users (both on [1] and off [android]
>>>> list) have reported that preallocating the worst-case number of nodes
>>>> has caused some slow down. This patch set addresses the number of
>>>> allocations in a few ways.
>>>>
>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>> leverage the fact that the maple tree can place a single pointer at
>>>> range 0 - 0 without allocating. This is done by changing the index in
>>>> the 'sidetree'.
>>>>
>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>> intelligent guess of the node count can be made.
>>>>
>>>> Patches are in the following order:
>>>> 0001-0002: Testing framework for benchmarking some operations
>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>> 0005: Small cleanup of do_vmi_align_munmap()
>>>> 0006-0013: mas_preallocate() calculation change
>>>> 0014: Change the vma iterator order
>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>> this patchset.
>>>
>>> The result has a little bit improvement:
>>> Base (next-20230602):
>>> 503880
>>> Base with this patchset:
>>> 519501
>>>
>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>> 718080
>>>
>>>
>>> Some other information I collected:
>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>> With this patchset, the request are 1 or 5.
>>>
>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>
>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>> mas_alloc_nodes() call. Thanks.
>> Hi Fengwei,
>>
>> I think it may be related to the inaccurate number of nodes allocated
>> in the pre-allocation. I slightly modified the pre-allocation in this
>> patchset, but I don't know if it works. It would be great if you could
>> help test it, and help pinpoint the cause. Below is the diff, which can
>> be applied based on this pachset.
> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
> the test. But the result of benchmark got a little bit improvement:
> 529040
>
> But it's still much less than none-regression result. I will also double
> confirm the none-regression result.
Just noticed that the commit f5715584af95 make validate_mm() two implementation
based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
apple to apple.
I disable CONFIG_DEBUG_VM and re-run the test and got:
Before preallocation change (7be1c1a3c7b1):
770100
After preallocation change (28c5609fb236):
680000
With liam's fix:
702100
plus Peng's fix:
725900
Regards
Yin, Fengwei
>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Peng
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 5ea211c3f186..e67bf2744384 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>> goto ask_now;
>> }
>>
>> - /* New root needs a singe node */
>> - if (unlikely(mte_is_root(mas->node)))
>> - goto ask_now;
>> + if ((node_size == wr_mas.node_end + 1 &&
>> + mas->offset == wr_mas.node_end) ||
>> + (node_size == wr_mas.node_end &&
>> + wr_mas.offset_end - mas->offset == 1))
>> + return 0;
>>
>> /* Potential spanning rebalance collapsing a node, use worst-case */
>> if (node_size - 1 <= mt_min_slots[wr_mas.type])
>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>> if (likely(!mas_is_err(mas)))
>> return 0;
>>
>> - mas_set_alloc_req(mas, 0);
>> ret = xa_err(mas->node);
>> mas_reset(mas);
>> mas_destroy(mas);
>>
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/
>>>>
>>>> Liam R. Howlett (14):
>>>> maple_tree: Add benchmarking for mas_for_each
>>>> maple_tree: Add benchmarking for mas_prev()
>>>> mm: Move unmap_vmas() declaration to internal header
>>>> mm: Change do_vmi_align_munmap() side tree index
>>>> mm: Remove prev check from do_vmi_align_munmap()
>>>> maple_tree: Introduce __mas_set_range()
>>>> mm: Remove re-walk from mmap_region()
>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>> mm: Use vma_iter_clear_gfp() in nommu
>>>> mm: Set up vma iterator for vma_iter_prealloc() calls
>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>> maple_tree: Update mas_preallocate() testing
>>>> maple_tree: Refine mas_preallocate() node calculations
>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>
>>>> fs/exec.c | 1 +
>>>> include/linux/maple_tree.h | 23 ++++-
>>>> include/linux/mm.h | 4 -
>>>> lib/maple_tree.c | 78 ++++++++++----
>>>> lib/test_maple_tree.c | 74 +++++++++++++
>>>> mm/internal.h | 40 ++++++--
>>>> mm/memory.c | 16 ++-
>>>> mm/mmap.c | 171 ++++++++++++++++---------------
>>>> mm/nommu.c | 45 ++++----
>>>> tools/testing/radix-tree/maple.c | 59 ++++++-----
>>>> 10 files changed, 331 insertions(+), 180 deletions(-)
>>>>
在 2023/6/5 14:18, Yin, Fengwei 写道: > > > On 6/5/2023 12:41 PM, Yin Fengwei wrote: >> Hi Peng, >> >> On 6/5/23 11:28, Peng Zhang wrote: >>> >>> >>> 在 2023/6/2 16:10, Yin, Fengwei 写道: >>>> Hi Liam, >>>> >>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>>>> Initial work on preallocations showed no regression in performance >>>>> during testing, but recently some users (both on [1] and off [android] >>>>> list) have reported that preallocating the worst-case number of nodes >>>>> has caused some slow down. This patch set addresses the number of >>>>> allocations in a few ways. >>>>> >>>>> During munmap() most munmap() operations will remove a single VMA, so >>>>> leverage the fact that the maple tree can place a single pointer at >>>>> range 0 - 0 without allocating. This is done by changing the index in >>>>> the 'sidetree'. >>>>> >>>>> Re-introduce the entry argument to mas_preallocate() so that a more >>>>> intelligent guess of the node count can be made. >>>>> >>>>> Patches are in the following order: >>>>> 0001-0002: Testing framework for benchmarking some operations >>>>> 0003-0004: Reduction of maple node allocation in sidetree >>>>> 0005: Small cleanup of do_vmi_align_munmap() >>>>> 0006-0013: mas_preallocate() calculation change >>>>> 0014: Change the vma iterator order >>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >>>> this patchset. >>>> >>>> The result has a little bit improvement: >>>> Base (next-20230602): >>>> 503880 >>>> Base with this patchset: >>>> 519501 >>>> >>>> But they are far from the none-regression result (commit 7be1c1a3c7b1): >>>> 718080 >>>> >>>> >>>> Some other information I collected: >>>> With Base, the mas_alloc_nodes are always hit with request: 7. >>>> With this patchset, the request are 1 or 5. >>>> >>>> I suppose this is the reason for improvement from 503880 to 519501. >>>> >>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >>>> mas_alloc_nodes() call. Thanks. >>> Hi Fengwei, >>> >>> I think it may be related to the inaccurate number of nodes allocated >>> in the pre-allocation. I slightly modified the pre-allocation in this >>> patchset, but I don't know if it works. It would be great if you could >>> help test it, and help pinpoint the cause. Below is the diff, which can >>> be applied based on this pachset. >> I tried the patch, it could eliminate the call of mas_alloc_nodes() during >> the test. But the result of benchmark got a little bit improvement: >> 529040 >> >> But it's still much less than none-regression result. I will also double >> confirm the none-regression result. > Just noticed that the commit f5715584af95 make validate_mm() two implementation > based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have > CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an > apple to apple. > > > I disable CONFIG_DEBUG_VM and re-run the test and got: > Before preallocation change (7be1c1a3c7b1): > 770100 > After preallocation change (28c5609fb236): > 680000 > With liam's fix: > 702100 > plus Peng's fix: > 725900 Thank you for your test, now it seems that the performance regression is not so much. > > > Regards > Yin, Fengwei > >> >> >> Regards >> Yin, Fengwei >> >>> >>> Thanks, >>> Peng >>> >>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>> index 5ea211c3f186..e67bf2744384 100644 >>> --- a/lib/maple_tree.c >>> +++ b/lib/maple_tree.c >>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>> goto ask_now; >>> } >>> >>> - /* New root needs a singe node */ >>> - if (unlikely(mte_is_root(mas->node))) >>> - goto ask_now; >>> + if ((node_size == wr_mas.node_end + 1 && >>> + mas->offset == wr_mas.node_end) || >>> + (node_size == wr_mas.node_end && >>> + wr_mas.offset_end - mas->offset == 1)) >>> + return 0; >>> >>> /* Potential spanning rebalance collapsing a node, use worst-case */ >>> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>> if (likely(!mas_is_err(mas))) >>> return 0; >>> >>> - mas_set_alloc_req(mas, 0); >>> ret = xa_err(mas->node); >>> mas_reset(mas); >>> mas_destroy(mas); >>> >>> >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>>>> >>>>> Liam R. Howlett (14): >>>>> maple_tree: Add benchmarking for mas_for_each >>>>> maple_tree: Add benchmarking for mas_prev() >>>>> mm: Move unmap_vmas() declaration to internal header >>>>> mm: Change do_vmi_align_munmap() side tree index >>>>> mm: Remove prev check from do_vmi_align_munmap() >>>>> maple_tree: Introduce __mas_set_range() >>>>> mm: Remove re-walk from mmap_region() >>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>>>> mm: Use vma_iter_clear_gfp() in nommu >>>>> mm: Set up vma iterator for vma_iter_prealloc() calls >>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>>>> maple_tree: Update mas_preallocate() testing >>>>> maple_tree: Refine mas_preallocate() node calculations >>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>>>> >>>>> fs/exec.c | 1 + >>>>> include/linux/maple_tree.h | 23 ++++- >>>>> include/linux/mm.h | 4 - >>>>> lib/maple_tree.c | 78 ++++++++++---- >>>>> lib/test_maple_tree.c | 74 +++++++++++++ >>>>> mm/internal.h | 40 ++++++-- >>>>> mm/memory.c | 16 ++- >>>>> mm/mmap.c | 171 ++++++++++++++++--------------- >>>>> mm/nommu.c | 45 ++++---- >>>>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>>>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>>>
* Peng Zhang <zhangpeng.00@bytedance.com> [230605 03:59]: > > > 在 2023/6/5 14:18, Yin, Fengwei 写道: > > > > > > On 6/5/2023 12:41 PM, Yin Fengwei wrote: > > > Hi Peng, > > > > > > On 6/5/23 11:28, Peng Zhang wrote: > > > > > > > > > > > > 在 2023/6/2 16:10, Yin, Fengwei 写道: > > > > > Hi Liam, > > > > > > > > > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote: > > > > > > Initial work on preallocations showed no regression in performance > > > > > > during testing, but recently some users (both on [1] and off [android] > > > > > > list) have reported that preallocating the worst-case number of nodes > > > > > > has caused some slow down. This patch set addresses the number of > > > > > > allocations in a few ways. > > > > > > > > > > > > During munmap() most munmap() operations will remove a single VMA, so > > > > > > leverage the fact that the maple tree can place a single pointer at > > > > > > range 0 - 0 without allocating. This is done by changing the index in > > > > > > the 'sidetree'. > > > > > > > > > > > > Re-introduce the entry argument to mas_preallocate() so that a more > > > > > > intelligent guess of the node count can be made. > > > > > > > > > > > > Patches are in the following order: > > > > > > 0001-0002: Testing framework for benchmarking some operations > > > > > > 0003-0004: Reduction of maple node allocation in sidetree > > > > > > 0005: Small cleanup of do_vmi_align_munmap() > > > > > > 0006-0013: mas_preallocate() calculation change > > > > > > 0014: Change the vma iterator order > > > > > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with > > > > > this patchset. > > > > > > > > > > The result has a little bit improvement: > > > > > Base (next-20230602): > > > > > 503880 > > > > > Base with this patchset: > > > > > 519501 > > > > > > > > > > But they are far from the none-regression result (commit 7be1c1a3c7b1): > > > > > 718080 > > > > > > > > > > > > > > > Some other information I collected: > > > > > With Base, the mas_alloc_nodes are always hit with request: 7. > > > > > With this patchset, the request are 1 or 5. > > > > > > > > > > I suppose this is the reason for improvement from 503880 to 519501. > > > > > > > > > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered > > > > > mas_alloc_nodes() call. Thanks. > > > > Hi Fengwei, > > > > > > > > I think it may be related to the inaccurate number of nodes allocated > > > > in the pre-allocation. I slightly modified the pre-allocation in this > > > > patchset, but I don't know if it works. It would be great if you could > > > > help test it, and help pinpoint the cause. Below is the diff, which can > > > > be applied based on this pachset. > > > I tried the patch, it could eliminate the call of mas_alloc_nodes() during > > > the test. But the result of benchmark got a little bit improvement: > > > 529040 > > > > > > But it's still much less than none-regression result. I will also double > > > confirm the none-regression result. > > Just noticed that the commit f5715584af95 make validate_mm() two implementation > > based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have > > CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an > > apple to apple. You mean "mm: update validate_mm() to use vma iterator" here I guess. I have it as a different commit id in my branch. I 'restored' some of the checking because I was able to work around not having the mt_dump() definition with the vma iterator. I'm now wondering how wide spread CONFIG_DEBUG_VM is used and if I should not have added these extra checks. > > > > > > I disable CONFIG_DEBUG_VM and re-run the test and got: > > Before preallocation change (7be1c1a3c7b1): > > 770100 > > After preallocation change (28c5609fb236): > > 680000 > > With liam's fix: > > 702100 > > plus Peng's fix: > > 725900 > Thank you for your test, now it seems that the performance > regression is not so much. We are also too strict on the reset during mas_store_prealloc() checking for a spanning write. I have a fix for this for v2 of the patch set, although I suspect it will not make a huge difference. > > > > > > Regards > > Yin, Fengwei > > > > > > > > > > > Regards > > > Yin, Fengwei > > > > > > > > > > > Thanks, > > > > Peng > > > > > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > > > > index 5ea211c3f186..e67bf2744384 100644 > > > > --- a/lib/maple_tree.c > > > > +++ b/lib/maple_tree.c > > > > @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > > > > goto ask_now; > > > > } > > > > > > > > - /* New root needs a singe node */ > > > > - if (unlikely(mte_is_root(mas->node))) > > > > - goto ask_now; Why did you drop this? If we are creating a new root we will only need one node. > > > > + if ((node_size == wr_mas.node_end + 1 && > > > > + mas->offset == wr_mas.node_end) || > > > > + (node_size == wr_mas.node_end && > > > > + wr_mas.offset_end - mas->offset == 1)) > > > > + return 0; I will add this to v2 as well, or something similar. > > > > > > > > /* Potential spanning rebalance collapsing a node, use worst-case */ > > > > if (node_size - 1 <= mt_min_slots[wr_mas.type]) > > > > @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > > > > if (likely(!mas_is_err(mas))) > > > > return 0; > > > > > > > > - mas_set_alloc_req(mas, 0); Why did you drop this? It seems like a worth while cleanup on failure. > > > > ret = xa_err(mas->node); > > > > mas_reset(mas); > > > > mas_destroy(mas); > > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > Yin, Fengwei > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ > > > > > > > > > > > > Liam R. Howlett (14): > > > > > > maple_tree: Add benchmarking for mas_for_each > > > > > > maple_tree: Add benchmarking for mas_prev() > > > > > > mm: Move unmap_vmas() declaration to internal header > > > > > > mm: Change do_vmi_align_munmap() side tree index > > > > > > mm: Remove prev check from do_vmi_align_munmap() > > > > > > maple_tree: Introduce __mas_set_range() > > > > > > mm: Remove re-walk from mmap_region() > > > > > > maple_tree: Re-introduce entry to mas_preallocate() arguments > > > > > > mm: Use vma_iter_clear_gfp() in nommu > > > > > > mm: Set up vma iterator for vma_iter_prealloc() calls > > > > > > maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() > > > > > > maple_tree: Update mas_preallocate() testing > > > > > > maple_tree: Refine mas_preallocate() node calculations > > > > > > mm/mmap: Change vma iteration order in do_vmi_align_munmap() > > > > > > > > > > > > fs/exec.c | 1 + > > > > > > include/linux/maple_tree.h | 23 ++++- > > > > > > include/linux/mm.h | 4 - > > > > > > lib/maple_tree.c | 78 ++++++++++---- > > > > > > lib/test_maple_tree.c | 74 +++++++++++++ > > > > > > mm/internal.h | 40 ++++++-- > > > > > > mm/memory.c | 16 ++- > > > > > > mm/mmap.c | 171 ++++++++++++++++--------------- > > > > > > mm/nommu.c | 45 ++++---- > > > > > > tools/testing/radix-tree/maple.c | 59 ++++++----- > > > > > > 10 files changed, 331 insertions(+), 180 deletions(-) > > > > > >
On Mon, 5 Jun 2023, Liam R. Howlett wrote: > > You mean "mm: update validate_mm() to use vma iterator" here I guess. I > have it as a different commit id in my branch. > > I 'restored' some of the checking because I was able to work around not > having the mt_dump() definition with the vma iterator. I'm now > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > have added these extra checks. Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for easily checked conditions. If validate_mm() is still the kind of thing it used to be, checking through every vma on every mmap operation, please don't bring that into CONFIG_DEBUG_VM - it distorts performance too much, so always used to be under a separate CONFIG_DEBUG_VM_RB instead. Hugh
Hi Huge, On 6/6/2023 10:41 AM, Hugh Dickins wrote: > On Mon, 5 Jun 2023, Liam R. Howlett wrote: >> >> You mean "mm: update validate_mm() to use vma iterator" here I guess. I >> have it as a different commit id in my branch. >> >> I 'restored' some of the checking because I was able to work around not >> having the mt_dump() definition with the vma iterator. I'm now >> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not >> have added these extra checks. > > Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report. > easily checked conditions. If validate_mm() is still the kind of thing > it used to be, checking through every vma on every mmap operation, please > don't bring that into CONFIG_DEBUG_VM - it distorts performance too much, > so always used to be under a separate CONFIG_DEBUG_VM_RB instead. So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance testing? Thanks. Regards Yin, Fengwei > > Hugh
On Tue, 6 Jun 2023, Yin, Fengwei wrote: > On 6/6/2023 10:41 AM, Hugh Dickins wrote: > > On Mon, 5 Jun 2023, Liam R. Howlett wrote: > >> > >> You mean "mm: update validate_mm() to use vma iterator" here I guess. I > >> have it as a different commit id in my branch. > >> > >> I 'restored' some of the checking because I was able to work around not > >> having the mt_dump() definition with the vma iterator. I'm now > >> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > >> have added these extra checks. > > > > Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for > Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report. > > > > easily checked conditions. If validate_mm() is still the kind of thing > > it used to be, checking through every vma on every mmap operation, please > > don't bring that into CONFIG_DEBUG_VM - it distorts performance too much, > > so always used to be under a separate CONFIG_DEBUG_VM_RB instead. > So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance > testing? Thanks. I was going to say: No, I did not mean that: I just meant that even developers not doing strict performance testing still like to keep a rough eye on performance changes; and historically CONFIG_DEBUG_VM has not distorted very much. But then I wonder about certain distros which (wrongly or rightly) turn CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels. Hugh
On 6/6/2023 11:08 AM, Hugh Dickins wrote: > On Tue, 6 Jun 2023, Yin, Fengwei wrote: >> On 6/6/2023 10:41 AM, Hugh Dickins wrote: >>> On Mon, 5 Jun 2023, Liam R. Howlett wrote: >>>> >>>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I >>>> have it as a different commit id in my branch. >>>> >>>> I 'restored' some of the checking because I was able to work around not >>>> having the mt_dump() definition with the vma iterator. I'm now >>>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not >>>> have added these extra checks. >>> >>> Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for >> Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report. >> >> >>> easily checked conditions. If validate_mm() is still the kind of thing >>> it used to be, checking through every vma on every mmap operation, please >>> don't bring that into CONFIG_DEBUG_VM - it distorts performance too much, >>> so always used to be under a separate CONFIG_DEBUG_VM_RB instead. >> So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance >> testing? Thanks. > > I was going to say: > No, I did not mean that: I just meant that even developers not doing > strict performance testing still like to keep a rough eye on performance > changes; and historically CONFIG_DEBUG_VM has not distorted very much. > > But then I wonder about certain distros which (wrongly or rightly) turn > CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels. Fair enough. Thanks for explanation. Regards Yin, Fengwei > > Hugh
* Yin, Fengwei <fengwei.yin@intel.com> [230605 23:11]: > > > On 6/6/2023 11:08 AM, Hugh Dickins wrote: > > On Tue, 6 Jun 2023, Yin, Fengwei wrote: > >> On 6/6/2023 10:41 AM, Hugh Dickins wrote: > >>> On Mon, 5 Jun 2023, Liam R. Howlett wrote: > >>>> > >>>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I > >>>> have it as a different commit id in my branch. > >>>> > >>>> I 'restored' some of the checking because I was able to work around not > >>>> having the mt_dump() definition with the vma iterator. I'm now > >>>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > >>>> have added these extra checks. > >>> > >>> Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for > >> Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report. > >> > >> > >>> easily checked conditions. If validate_mm() is still the kind of thing > >>> it used to be, checking through every vma on every mmap operation, please > >>> don't bring that into CONFIG_DEBUG_VM - it distorts performance too much, > >>> so always used to be under a separate CONFIG_DEBUG_VM_RB instead. Okay, I will update my patch to use CONFIG_DEBUG_VM_MAPLE_TREE for validate_mm(). > >> So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance > >> testing? Thanks. > > > > I was going to say: > > No, I did not mean that: I just meant that even developers not doing > > strict performance testing still like to keep a rough eye on performance > > changes; and historically CONFIG_DEBUG_VM has not distorted very much. > > > > But then I wonder about certain distros which (wrongly or rightly) turn > > CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels. > Fair enough. Thanks for explanation. > Thanks for looking at this everyone. Regards, Liam
On 6/5/23 22:03, Liam R. Howlett wrote: > * Peng Zhang <zhangpeng.00@bytedance.com> [230605 03:59]: >> >> >> 在 2023/6/5 14:18, Yin, Fengwei 写道: >>> >>> >>> On 6/5/2023 12:41 PM, Yin Fengwei wrote: >>>> Hi Peng, >>>> >>>> On 6/5/23 11:28, Peng Zhang wrote: >>>>> >>>>> >>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道: >>>>>> Hi Liam, >>>>>> >>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>>>>>> Initial work on preallocations showed no regression in performance >>>>>>> during testing, but recently some users (both on [1] and off [android] >>>>>>> list) have reported that preallocating the worst-case number of nodes >>>>>>> has caused some slow down. This patch set addresses the number of >>>>>>> allocations in a few ways. >>>>>>> >>>>>>> During munmap() most munmap() operations will remove a single VMA, so >>>>>>> leverage the fact that the maple tree can place a single pointer at >>>>>>> range 0 - 0 without allocating. This is done by changing the index in >>>>>>> the 'sidetree'. >>>>>>> >>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more >>>>>>> intelligent guess of the node count can be made. >>>>>>> >>>>>>> Patches are in the following order: >>>>>>> 0001-0002: Testing framework for benchmarking some operations >>>>>>> 0003-0004: Reduction of maple node allocation in sidetree >>>>>>> 0005: Small cleanup of do_vmi_align_munmap() >>>>>>> 0006-0013: mas_preallocate() calculation change >>>>>>> 0014: Change the vma iterator order >>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >>>>>> this patchset. >>>>>> >>>>>> The result has a little bit improvement: >>>>>> Base (next-20230602): >>>>>> 503880 >>>>>> Base with this patchset: >>>>>> 519501 >>>>>> >>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1): >>>>>> 718080 >>>>>> >>>>>> >>>>>> Some other information I collected: >>>>>> With Base, the mas_alloc_nodes are always hit with request: 7. >>>>>> With this patchset, the request are 1 or 5. >>>>>> >>>>>> I suppose this is the reason for improvement from 503880 to 519501. >>>>>> >>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >>>>>> mas_alloc_nodes() call. Thanks. >>>>> Hi Fengwei, >>>>> >>>>> I think it may be related to the inaccurate number of nodes allocated >>>>> in the pre-allocation. I slightly modified the pre-allocation in this >>>>> patchset, but I don't know if it works. It would be great if you could >>>>> help test it, and help pinpoint the cause. Below is the diff, which can >>>>> be applied based on this pachset. >>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during >>>> the test. But the result of benchmark got a little bit improvement: >>>> 529040 >>>> >>>> But it's still much less than none-regression result. I will also double >>>> confirm the none-regression result. >>> Just noticed that the commit f5715584af95 make validate_mm() two implementation >>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have >>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an >>> apple to apple. > > You mean "mm: update validate_mm() to use vma iterator" here I guess. I > have it as a different commit id in my branch. Yes. The f5715584af95 was from next-20230602. I will include the subject of the patch in the future to avoid such confusion. > > I 'restored' some of the checking because I was able to work around not > having the mt_dump() definition with the vma iterator. I'm now > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > have added these extra checks. No worries here. The problem is in my local config. I enabled the CONFIG_DEBUG_VM to capture VM asserts/warnings. It should be disabled for performance testing. LKP does disable it for performance check. Regards Yin, Fengwei > >>> >>> >>> I disable CONFIG_DEBUG_VM and re-run the test and got: >>> Before preallocation change (7be1c1a3c7b1): >>> 770100 >>> After preallocation change (28c5609fb236): >>> 680000 >>> With liam's fix: >>> 702100 >>> plus Peng's fix: >>> 725900 >> Thank you for your test, now it seems that the performance >> regression is not so much. > > We are also too strict on the reset during mas_store_prealloc() checking > for a spanning write. I have a fix for this for v2 of the patch set, > although I suspect it will not make a huge difference. > >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> >>>>> Thanks, >>>>> Peng >>>>> >>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>>> index 5ea211c3f186..e67bf2744384 100644 >>>>> --- a/lib/maple_tree.c >>>>> +++ b/lib/maple_tree.c >>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> goto ask_now; >>>>> } >>>>> >>>>> - /* New root needs a singe node */ >>>>> - if (unlikely(mte_is_root(mas->node))) >>>>> - goto ask_now; > > Why did you drop this? If we are creating a new root we will only need > one node. > >>>>> + if ((node_size == wr_mas.node_end + 1 && >>>>> + mas->offset == wr_mas.node_end) || >>>>> + (node_size == wr_mas.node_end && >>>>> + wr_mas.offset_end - mas->offset == 1)) >>>>> + return 0; > > I will add this to v2 as well, or something similar. > >>>>> >>>>> /* Potential spanning rebalance collapsing a node, use worst-case */ >>>>> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> if (likely(!mas_is_err(mas))) >>>>> return 0; >>>>> >>>>> - mas_set_alloc_req(mas, 0); > > Why did you drop this? It seems like a worth while cleanup on failure. > >>>>> ret = xa_err(mas->node); >>>>> mas_reset(mas); >>>>> mas_destroy(mas); >>>>> >>>>> >>>>>> >>>>>> >>>>>> Regards >>>>>> Yin, Fengwei >>>>>> >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>>>>>> >>>>>>> Liam R. Howlett (14): >>>>>>> maple_tree: Add benchmarking for mas_for_each >>>>>>> maple_tree: Add benchmarking for mas_prev() >>>>>>> mm: Move unmap_vmas() declaration to internal header >>>>>>> mm: Change do_vmi_align_munmap() side tree index >>>>>>> mm: Remove prev check from do_vmi_align_munmap() >>>>>>> maple_tree: Introduce __mas_set_range() >>>>>>> mm: Remove re-walk from mmap_region() >>>>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>>>>>> mm: Use vma_iter_clear_gfp() in nommu >>>>>>> mm: Set up vma iterator for vma_iter_prealloc() calls >>>>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>>>>>> maple_tree: Update mas_preallocate() testing >>>>>>> maple_tree: Refine mas_preallocate() node calculations >>>>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>>>>>> >>>>>>> fs/exec.c | 1 + >>>>>>> include/linux/maple_tree.h | 23 ++++- >>>>>>> include/linux/mm.h | 4 - >>>>>>> lib/maple_tree.c | 78 ++++++++++---- >>>>>>> lib/test_maple_tree.c | 74 +++++++++++++ >>>>>>> mm/internal.h | 40 ++++++-- >>>>>>> mm/memory.c | 16 ++- >>>>>>> mm/mmap.c | 171 ++++++++++++++++--------------- >>>>>>> mm/nommu.c | 45 ++++---- >>>>>>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>>>>>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>>>>>
在 2023/6/5 22:03, Liam R. Howlett 写道: > * Peng Zhang <zhangpeng.00@bytedance.com> [230605 03:59]: >> >> >> 在 2023/6/5 14:18, Yin, Fengwei 写道: >>> >>> >>> On 6/5/2023 12:41 PM, Yin Fengwei wrote: >>>> Hi Peng, >>>> >>>> On 6/5/23 11:28, Peng Zhang wrote: >>>>> >>>>> >>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道: >>>>>> Hi Liam, >>>>>> >>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>>>>>> Initial work on preallocations showed no regression in performance >>>>>>> during testing, but recently some users (both on [1] and off [android] >>>>>>> list) have reported that preallocating the worst-case number of nodes >>>>>>> has caused some slow down. This patch set addresses the number of >>>>>>> allocations in a few ways. >>>>>>> >>>>>>> During munmap() most munmap() operations will remove a single VMA, so >>>>>>> leverage the fact that the maple tree can place a single pointer at >>>>>>> range 0 - 0 without allocating. This is done by changing the index in >>>>>>> the 'sidetree'. >>>>>>> >>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more >>>>>>> intelligent guess of the node count can be made. >>>>>>> >>>>>>> Patches are in the following order: >>>>>>> 0001-0002: Testing framework for benchmarking some operations >>>>>>> 0003-0004: Reduction of maple node allocation in sidetree >>>>>>> 0005: Small cleanup of do_vmi_align_munmap() >>>>>>> 0006-0013: mas_preallocate() calculation change >>>>>>> 0014: Change the vma iterator order >>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >>>>>> this patchset. >>>>>> >>>>>> The result has a little bit improvement: >>>>>> Base (next-20230602): >>>>>> 503880 >>>>>> Base with this patchset: >>>>>> 519501 >>>>>> >>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1): >>>>>> 718080 >>>>>> >>>>>> >>>>>> Some other information I collected: >>>>>> With Base, the mas_alloc_nodes are always hit with request: 7. >>>>>> With this patchset, the request are 1 or 5. >>>>>> >>>>>> I suppose this is the reason for improvement from 503880 to 519501. >>>>>> >>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >>>>>> mas_alloc_nodes() call. Thanks. >>>>> Hi Fengwei, >>>>> >>>>> I think it may be related to the inaccurate number of nodes allocated >>>>> in the pre-allocation. I slightly modified the pre-allocation in this >>>>> patchset, but I don't know if it works. It would be great if you could >>>>> help test it, and help pinpoint the cause. Below is the diff, which can >>>>> be applied based on this pachset. >>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during >>>> the test. But the result of benchmark got a little bit improvement: >>>> 529040 >>>> >>>> But it's still much less than none-regression result. I will also double >>>> confirm the none-regression result. >>> Just noticed that the commit f5715584af95 make validate_mm() two implementation >>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have >>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an >>> apple to apple. > > You mean "mm: update validate_mm() to use vma iterator" here I guess. I > have it as a different commit id in my branch. > > I 'restored' some of the checking because I was able to work around not > having the mt_dump() definition with the vma iterator. I'm now > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > have added these extra checks. > >>> >>> >>> I disable CONFIG_DEBUG_VM and re-run the test and got: >>> Before preallocation change (7be1c1a3c7b1): >>> 770100 >>> After preallocation change (28c5609fb236): >>> 680000 >>> With liam's fix: >>> 702100 >>> plus Peng's fix: >>> 725900 >> Thank you for your test, now it seems that the performance >> regression is not so much. > > We are also too strict on the reset during mas_store_prealloc() checking > for a spanning write. I have a fix for this for v2 of the patch set, > although I suspect it will not make a huge difference. > >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> >>>>> Thanks, >>>>> Peng >>>>> >>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>>> index 5ea211c3f186..e67bf2744384 100644 >>>>> --- a/lib/maple_tree.c >>>>> +++ b/lib/maple_tree.c >>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> goto ask_now; >>>>> } >>>>> >>>>> - /* New root needs a singe node */ >>>>> - if (unlikely(mte_is_root(mas->node))) >>>>> - goto ask_now; > > Why did you drop this? If we are creating a new root we will only need > one node. The code below handles the root case perfectly, we don't need additional checks. if (node_size - 1 <= mt_min_slots[wr_mas.type]) request = mas_mt_height(mas) * 2 - 1; > >>>>> + if ((node_size == wr_mas.node_end + 1 && >>>>> + mas->offset == wr_mas.node_end) || >>>>> + (node_size == wr_mas.node_end && >>>>> + wr_mas.offset_end - mas->offset == 1)) >>>>> + return 0; > > I will add this to v2 as well, or something similar. > >>>>> >>>>> /* Potential spanning rebalance collapsing a node, use worst-case */ >>>>> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>> if (likely(!mas_is_err(mas))) >>>>> return 0; >>>>> >>>>> - mas_set_alloc_req(mas, 0); > > Why did you drop this? It seems like a worth while cleanup on failure. Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes(). > >>>>> ret = xa_err(mas->node); >>>>> mas_reset(mas); >>>>> mas_destroy(mas); >>>>> >>>>> >>>>>> >>>>>> >>>>>> Regards >>>>>> Yin, Fengwei >>>>>> >>>>>>> >>>>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>>>>>> >>>>>>> Liam R. Howlett (14): >>>>>>> maple_tree: Add benchmarking for mas_for_each >>>>>>> maple_tree: Add benchmarking for mas_prev() >>>>>>> mm: Move unmap_vmas() declaration to internal header >>>>>>> mm: Change do_vmi_align_munmap() side tree index >>>>>>> mm: Remove prev check from do_vmi_align_munmap() >>>>>>> maple_tree: Introduce __mas_set_range() >>>>>>> mm: Remove re-walk from mmap_region() >>>>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>>>>>> mm: Use vma_iter_clear_gfp() in nommu >>>>>>> mm: Set up vma iterator for vma_iter_prealloc() calls >>>>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>>>>>> maple_tree: Update mas_preallocate() testing >>>>>>> maple_tree: Refine mas_preallocate() node calculations >>>>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>>>>>> >>>>>>> fs/exec.c | 1 + >>>>>>> include/linux/maple_tree.h | 23 ++++- >>>>>>> include/linux/mm.h | 4 - >>>>>>> lib/maple_tree.c | 78 ++++++++++---- >>>>>>> lib/test_maple_tree.c | 74 +++++++++++++ >>>>>>> mm/internal.h | 40 ++++++-- >>>>>>> mm/memory.c | 16 ++- >>>>>>> mm/mmap.c | 171 ++++++++++++++++--------------- >>>>>>> mm/nommu.c | 45 ++++---- >>>>>>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>>>>>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>>>>>
* Peng Zhang <zhangpeng.00@bytedance.com> [230605 10:27]: > > > 在 2023/6/5 22:03, Liam R. Howlett 写道: > > * Peng Zhang <zhangpeng.00@bytedance.com> [230605 03:59]: > > > > > > > > > 在 2023/6/5 14:18, Yin, Fengwei 写道: > > > > > > > > > > > > On 6/5/2023 12:41 PM, Yin Fengwei wrote: > > > > > Hi Peng, > > > > > > > > > > On 6/5/23 11:28, Peng Zhang wrote: > > > > > > > > > > > > > > > > > > 在 2023/6/2 16:10, Yin, Fengwei 写道: > > > > > > > Hi Liam, > > > > > > > > > > > > > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote: > > > > > > > > Initial work on preallocations showed no regression in performance > > > > > > > > during testing, but recently some users (both on [1] and off [android] > > > > > > > > list) have reported that preallocating the worst-case number of nodes > > > > > > > > has caused some slow down. This patch set addresses the number of > > > > > > > > allocations in a few ways. > > > > > > > > > > > > > > > > During munmap() most munmap() operations will remove a single VMA, so > > > > > > > > leverage the fact that the maple tree can place a single pointer at > > > > > > > > range 0 - 0 without allocating. This is done by changing the index in > > > > > > > > the 'sidetree'. > > > > > > > > > > > > > > > > Re-introduce the entry argument to mas_preallocate() so that a more > > > > > > > > intelligent guess of the node count can be made. > > > > > > > > > > > > > > > > Patches are in the following order: > > > > > > > > 0001-0002: Testing framework for benchmarking some operations > > > > > > > > 0003-0004: Reduction of maple node allocation in sidetree > > > > > > > > 0005: Small cleanup of do_vmi_align_munmap() > > > > > > > > 0006-0013: mas_preallocate() calculation change > > > > > > > > 0014: Change the vma iterator order > > > > > > > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with > > > > > > > this patchset. > > > > > > > > > > > > > > The result has a little bit improvement: > > > > > > > Base (next-20230602): > > > > > > > 503880 > > > > > > > Base with this patchset: > > > > > > > 519501 > > > > > > > > > > > > > > But they are far from the none-regression result (commit 7be1c1a3c7b1): > > > > > > > 718080 > > > > > > > > > > > > > > > > > > > > > Some other information I collected: > > > > > > > With Base, the mas_alloc_nodes are always hit with request: 7. > > > > > > > With this patchset, the request are 1 or 5. > > > > > > > > > > > > > > I suppose this is the reason for improvement from 503880 to 519501. > > > > > > > > > > > > > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered > > > > > > > mas_alloc_nodes() call. Thanks. > > > > > > Hi Fengwei, > > > > > > > > > > > > I think it may be related to the inaccurate number of nodes allocated > > > > > > in the pre-allocation. I slightly modified the pre-allocation in this > > > > > > patchset, but I don't know if it works. It would be great if you could > > > > > > help test it, and help pinpoint the cause. Below is the diff, which can > > > > > > be applied based on this pachset. > > > > > I tried the patch, it could eliminate the call of mas_alloc_nodes() during > > > > > the test. But the result of benchmark got a little bit improvement: > > > > > 529040 > > > > > > > > > > But it's still much less than none-regression result. I will also double > > > > > confirm the none-regression result. > > > > Just noticed that the commit f5715584af95 make validate_mm() two implementation > > > > based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have > > > > CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an > > > > apple to apple. > > > > You mean "mm: update validate_mm() to use vma iterator" here I guess. I > > have it as a different commit id in my branch. > > > > I 'restored' some of the checking because I was able to work around not > > having the mt_dump() definition with the vma iterator. I'm now > > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not > > have added these extra checks. > > > > > > > > > > > > > > I disable CONFIG_DEBUG_VM and re-run the test and got: > > > > Before preallocation change (7be1c1a3c7b1): > > > > 770100 > > > > After preallocation change (28c5609fb236): > > > > 680000 > > > > With liam's fix: > > > > 702100 > > > > plus Peng's fix: > > > > 725900 > > > Thank you for your test, now it seems that the performance > > > regression is not so much. > > > > We are also too strict on the reset during mas_store_prealloc() checking > > for a spanning write. I have a fix for this for v2 of the patch set, > > although I suspect it will not make a huge difference. > > > > > > > > > > > > > > Regards > > > > Yin, Fengwei > > > > > > > > > > > > > > > > > > > Regards > > > > > Yin, Fengwei > > > > > > > > > > > > > > > > > Thanks, > > > > > > Peng > > > > > > > > > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > > > > > > index 5ea211c3f186..e67bf2744384 100644 > > > > > > --- a/lib/maple_tree.c > > > > > > +++ b/lib/maple_tree.c > > > > > > @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > > > > > > goto ask_now; > > > > > > } > > > > > > > > > > > > - /* New root needs a singe node */ > > > > > > - if (unlikely(mte_is_root(mas->node))) > > > > > > - goto ask_now; > > > > Why did you drop this? If we are creating a new root we will only need > > one node. > The code below handles the root case perfectly, > we don't need additional checks. > if (node_size - 1 <= mt_min_slots[wr_mas.type]) > request = mas_mt_height(mas) * 2 - 1; Unless we have the minimum for a node, in which case we will fall through and ask for one node. This works and is rare enough so I'll drop it. Thanks. > > > > > > > > + if ((node_size == wr_mas.node_end + 1 && > > > > > > + mas->offset == wr_mas.node_end) || > > > > > > + (node_size == wr_mas.node_end && > > > > > > + wr_mas.offset_end - mas->offset == 1)) > > > > > > + return 0; > > > > I will add this to v2 as well, or something similar. > > > > > > > > > > > > > > /* Potential spanning rebalance collapsing a node, use worst-case */ > > > > > > if (node_size - 1 <= mt_min_slots[wr_mas.type]) > > > > > > @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) > > > > > > if (likely(!mas_is_err(mas))) > > > > > > return 0; > > > > > > > > > > > > - mas_set_alloc_req(mas, 0); > > > > Why did you drop this? It seems like a worth while cleanup on failure. > Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes(). On failure we set the alloc request to the remainder of what was not allocated. > > > > > > > > ret = xa_err(mas->node); > > > > > > mas_reset(mas); > > > > > > mas_destroy(mas); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > Yin, Fengwei > > > > > > > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ > > > > > > > > > > > > > > > > Liam R. Howlett (14): > > > > > > > > maple_tree: Add benchmarking for mas_for_each > > > > > > > > maple_tree: Add benchmarking for mas_prev() > > > > > > > > mm: Move unmap_vmas() declaration to internal header > > > > > > > > mm: Change do_vmi_align_munmap() side tree index > > > > > > > > mm: Remove prev check from do_vmi_align_munmap() > > > > > > > > maple_tree: Introduce __mas_set_range() > > > > > > > > mm: Remove re-walk from mmap_region() > > > > > > > > maple_tree: Re-introduce entry to mas_preallocate() arguments > > > > > > > > mm: Use vma_iter_clear_gfp() in nommu > > > > > > > > mm: Set up vma iterator for vma_iter_prealloc() calls > > > > > > > > maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() > > > > > > > > maple_tree: Update mas_preallocate() testing > > > > > > > > maple_tree: Refine mas_preallocate() node calculations > > > > > > > > mm/mmap: Change vma iteration order in do_vmi_align_munmap() > > > > > > > > > > > > > > > > fs/exec.c | 1 + > > > > > > > > include/linux/maple_tree.h | 23 ++++- > > > > > > > > include/linux/mm.h | 4 - > > > > > > > > lib/maple_tree.c | 78 ++++++++++---- > > > > > > > > lib/test_maple_tree.c | 74 +++++++++++++ > > > > > > > > mm/internal.h | 40 ++++++-- > > > > > > > > mm/memory.c | 16 ++- > > > > > > > > mm/mmap.c | 171 ++++++++++++++++--------------- > > > > > > > > mm/nommu.c | 45 ++++---- > > > > > > > > tools/testing/radix-tree/maple.c | 59 ++++++----- > > > > > > > > 10 files changed, 331 insertions(+), 180 deletions(-) > > > > > > > >
在 2023/6/5 22:44, Liam R. Howlett 写道: > * Peng Zhang <zhangpeng.00@bytedance.com> [230605 10:27]: >> >> >> 在 2023/6/5 22:03, Liam R. Howlett 写道: >>> * Peng Zhang <zhangpeng.00@bytedance.com> [230605 03:59]: >>>> >>>> >>>> 在 2023/6/5 14:18, Yin, Fengwei 写道: >>>>> >>>>> >>>>> On 6/5/2023 12:41 PM, Yin Fengwei wrote: >>>>>> Hi Peng, >>>>>> >>>>>> On 6/5/23 11:28, Peng Zhang wrote: >>>>>>> >>>>>>> >>>>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道: >>>>>>>> Hi Liam, >>>>>>>> >>>>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>>>>>>>> Initial work on preallocations showed no regression in performance >>>>>>>>> during testing, but recently some users (both on [1] and off [android] >>>>>>>>> list) have reported that preallocating the worst-case number of nodes >>>>>>>>> has caused some slow down. This patch set addresses the number of >>>>>>>>> allocations in a few ways. >>>>>>>>> >>>>>>>>> During munmap() most munmap() operations will remove a single VMA, so >>>>>>>>> leverage the fact that the maple tree can place a single pointer at >>>>>>>>> range 0 - 0 without allocating. This is done by changing the index in >>>>>>>>> the 'sidetree'. >>>>>>>>> >>>>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more >>>>>>>>> intelligent guess of the node count can be made. >>>>>>>>> >>>>>>>>> Patches are in the following order: >>>>>>>>> 0001-0002: Testing framework for benchmarking some operations >>>>>>>>> 0003-0004: Reduction of maple node allocation in sidetree >>>>>>>>> 0005: Small cleanup of do_vmi_align_munmap() >>>>>>>>> 0006-0013: mas_preallocate() calculation change >>>>>>>>> 0014: Change the vma iterator order >>>>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >>>>>>>> this patchset. >>>>>>>> >>>>>>>> The result has a little bit improvement: >>>>>>>> Base (next-20230602): >>>>>>>> 503880 >>>>>>>> Base with this patchset: >>>>>>>> 519501 >>>>>>>> >>>>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1): >>>>>>>> 718080 >>>>>>>> >>>>>>>> >>>>>>>> Some other information I collected: >>>>>>>> With Base, the mas_alloc_nodes are always hit with request: 7. >>>>>>>> With this patchset, the request are 1 or 5. >>>>>>>> >>>>>>>> I suppose this is the reason for improvement from 503880 to 519501. >>>>>>>> >>>>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >>>>>>>> mas_alloc_nodes() call. Thanks. >>>>>>> Hi Fengwei, >>>>>>> >>>>>>> I think it may be related to the inaccurate number of nodes allocated >>>>>>> in the pre-allocation. I slightly modified the pre-allocation in this >>>>>>> patchset, but I don't know if it works. It would be great if you could >>>>>>> help test it, and help pinpoint the cause. Below is the diff, which can >>>>>>> be applied based on this pachset. >>>>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during >>>>>> the test. But the result of benchmark got a little bit improvement: >>>>>> 529040 >>>>>> >>>>>> But it's still much less than none-regression result. I will also double >>>>>> confirm the none-regression result. >>>>> Just noticed that the commit f5715584af95 make validate_mm() two implementation >>>>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have >>>>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an >>>>> apple to apple. >>> >>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I >>> have it as a different commit id in my branch. >>> >>> I 'restored' some of the checking because I was able to work around not >>> having the mt_dump() definition with the vma iterator. I'm now >>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not >>> have added these extra checks. >>> >>>>> >>>>> >>>>> I disable CONFIG_DEBUG_VM and re-run the test and got: >>>>> Before preallocation change (7be1c1a3c7b1): >>>>> 770100 >>>>> After preallocation change (28c5609fb236): >>>>> 680000 >>>>> With liam's fix: >>>>> 702100 >>>>> plus Peng's fix: >>>>> 725900 >>>> Thank you for your test, now it seems that the performance >>>> regression is not so much. >>> >>> We are also too strict on the reset during mas_store_prealloc() checking >>> for a spanning write. I have a fix for this for v2 of the patch set, >>> although I suspect it will not make a huge difference. >>> >>>>> >>>>> >>>>> Regards >>>>> Yin, Fengwei >>>>> >>>>>> >>>>>> >>>>>> Regards >>>>>> Yin, Fengwei >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Peng >>>>>>> >>>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c >>>>>>> index 5ea211c3f186..e67bf2744384 100644 >>>>>>> --- a/lib/maple_tree.c >>>>>>> +++ b/lib/maple_tree.c >>>>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>>>> goto ask_now; >>>>>>> } >>>>>>> >>>>>>> - /* New root needs a singe node */ >>>>>>> - if (unlikely(mte_is_root(mas->node))) >>>>>>> - goto ask_now; >>> >>> Why did you drop this? If we are creating a new root we will only need >>> one node. >> The code below handles the root case perfectly, >> we don't need additional checks. >> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >> request = mas_mt_height(mas) * 2 - 1; > > Unless we have the minimum for a node, in which case we will fall > through and ask for one node. This works and is rare enough so I'll > drop it. Thanks. > >>> >>>>>>> + if ((node_size == wr_mas.node_end + 1 && >>>>>>> + mas->offset == wr_mas.node_end) || >>>>>>> + (node_size == wr_mas.node_end && >>>>>>> + wr_mas.offset_end - mas->offset == 1)) >>>>>>> + return 0; >>> >>> I will add this to v2 as well, or something similar. >>> >>>>>>> >>>>>>> /* Potential spanning rebalance collapsing a node, use worst-case */ >>>>>>> if (node_size - 1 <= mt_min_slots[wr_mas.type]) >>>>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp) >>>>>>> if (likely(!mas_is_err(mas))) >>>>>>> return 0; >>>>>>> >>>>>>> - mas_set_alloc_req(mas, 0); >>> >>> Why did you drop this? It seems like a worth while cleanup on failure. >> Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes(). > > On failure we set the alloc request to the remainder of what was not > allocated. Yes, you are right. > >>> >>>>>>> ret = xa_err(mas->node); >>>>>>> mas_reset(mas); >>>>>>> mas_destroy(mas); >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Regards >>>>>>>> Yin, Fengwei >>>>>>>> >>>>>>>>> >>>>>>>>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>>>>>>>> >>>>>>>>> Liam R. Howlett (14): >>>>>>>>> maple_tree: Add benchmarking for mas_for_each >>>>>>>>> maple_tree: Add benchmarking for mas_prev() >>>>>>>>> mm: Move unmap_vmas() declaration to internal header >>>>>>>>> mm: Change do_vmi_align_munmap() side tree index >>>>>>>>> mm: Remove prev check from do_vmi_align_munmap() >>>>>>>>> maple_tree: Introduce __mas_set_range() >>>>>>>>> mm: Remove re-walk from mmap_region() >>>>>>>>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>>>>>>>> mm: Use vma_iter_clear_gfp() in nommu >>>>>>>>> mm: Set up vma iterator for vma_iter_prealloc() calls >>>>>>>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>>>>>>>> maple_tree: Update mas_preallocate() testing >>>>>>>>> maple_tree: Refine mas_preallocate() node calculations >>>>>>>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>>>>>>>> >>>>>>>>> fs/exec.c | 1 + >>>>>>>>> include/linux/maple_tree.h | 23 ++++- >>>>>>>>> include/linux/mm.h | 4 - >>>>>>>>> lib/maple_tree.c | 78 ++++++++++---- >>>>>>>>> lib/test_maple_tree.c | 74 +++++++++++++ >>>>>>>>> mm/internal.h | 40 ++++++-- >>>>>>>>> mm/memory.c | 16 ++- >>>>>>>>> mm/mmap.c | 171 ++++++++++++++++--------------- >>>>>>>>> mm/nommu.c | 45 ++++---- >>>>>>>>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>>>>>>>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>>>>>>>
* Yin, Fengwei <fengwei.yin@intel.com> [230602 04:11]: > Hi Liam, > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote: > > Initial work on preallocations showed no regression in performance > > during testing, but recently some users (both on [1] and off [android] > > list) have reported that preallocating the worst-case number of nodes > > has caused some slow down. This patch set addresses the number of > > allocations in a few ways. > > > > During munmap() most munmap() operations will remove a single VMA, so > > leverage the fact that the maple tree can place a single pointer at > > range 0 - 0 without allocating. This is done by changing the index in > > the 'sidetree'. > > > > Re-introduce the entry argument to mas_preallocate() so that a more > > intelligent guess of the node count can be made. > > > > Patches are in the following order: > > 0001-0002: Testing framework for benchmarking some operations > > 0003-0004: Reduction of maple node allocation in sidetree > > 0005: Small cleanup of do_vmi_align_munmap() > > 0006-0013: mas_preallocate() calculation change > > 0014: Change the vma iterator order > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with > this patchset. > > The result has a little bit improvement: > Base (next-20230602): > 503880 > Base with this patchset: > 519501 > > But they are far from the none-regression result (commit 7be1c1a3c7b1): > 718080 > > > Some other information I collected: > With Base, the mas_alloc_nodes are always hit with request: 7. > With this patchset, the request are 1 or 5. > > I suppose this is the reason for improvement from 503880 to 519501. > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered > mas_alloc_nodes() call. Thanks. Thanks for retesting. I've not been able to see the regression myself. Are you running in a VM of sorts? Android and some cloud VMs seem to see this, but I do not in kvm or the server I test on. I am still looking to reduce/reverse the regression and a reproducer on my end would help. > > > Regards > Yin, Fengwei > > > > > [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ > > > > Liam R. Howlett (14): > > maple_tree: Add benchmarking for mas_for_each > > maple_tree: Add benchmarking for mas_prev() > > mm: Move unmap_vmas() declaration to internal header > > mm: Change do_vmi_align_munmap() side tree index > > mm: Remove prev check from do_vmi_align_munmap() > > maple_tree: Introduce __mas_set_range() > > mm: Remove re-walk from mmap_region() > > maple_tree: Re-introduce entry to mas_preallocate() arguments > > mm: Use vma_iter_clear_gfp() in nommu > > mm: Set up vma iterator for vma_iter_prealloc() calls > > maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() > > maple_tree: Update mas_preallocate() testing > > maple_tree: Refine mas_preallocate() node calculations > > mm/mmap: Change vma iteration order in do_vmi_align_munmap() > > > > fs/exec.c | 1 + > > include/linux/maple_tree.h | 23 ++++- > > include/linux/mm.h | 4 - > > lib/maple_tree.c | 78 ++++++++++---- > > lib/test_maple_tree.c | 74 +++++++++++++ > > mm/internal.h | 40 ++++++-- > > mm/memory.c | 16 ++- > > mm/mmap.c | 171 ++++++++++++++++--------------- > > mm/nommu.c | 45 ++++---- > > tools/testing/radix-tree/maple.c | 59 ++++++----- > > 10 files changed, 331 insertions(+), 180 deletions(-) > >
Hi Liam, On 6/3/2023 2:55 AM, Liam R. Howlett wrote: > * Yin, Fengwei <fengwei.yin@intel.com> [230602 04:11]: >> Hi Liam, >> >> On 6/1/2023 10:15 AM, Liam R. Howlett wrote: >>> Initial work on preallocations showed no regression in performance >>> during testing, but recently some users (both on [1] and off [android] >>> list) have reported that preallocating the worst-case number of nodes >>> has caused some slow down. This patch set addresses the number of >>> allocations in a few ways. >>> >>> During munmap() most munmap() operations will remove a single VMA, so >>> leverage the fact that the maple tree can place a single pointer at >>> range 0 - 0 without allocating. This is done by changing the index in >>> the 'sidetree'. >>> >>> Re-introduce the entry argument to mas_preallocate() so that a more >>> intelligent guess of the node count can be made. >>> >>> Patches are in the following order: >>> 0001-0002: Testing framework for benchmarking some operations >>> 0003-0004: Reduction of maple node allocation in sidetree >>> 0005: Small cleanup of do_vmi_align_munmap() >>> 0006-0013: mas_preallocate() calculation change >>> 0014: Change the vma iterator order >> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with >> this patchset. >> >> The result has a little bit improvement: >> Base (next-20230602): >> 503880 >> Base with this patchset: >> 519501 >> >> But they are far from the none-regression result (commit 7be1c1a3c7b1): >> 718080 >> >> >> Some other information I collected: >> With Base, the mas_alloc_nodes are always hit with request: 7. >> With this patchset, the request are 1 or 5. >> >> I suppose this is the reason for improvement from 503880 to 519501. >> >> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered >> mas_alloc_nodes() call. Thanks. > > Thanks for retesting. I've not been able to see the regression myself. > Are you running in a VM of sorts? Android and some cloud VMs seem to I didn't run it in VM. I run it on a native env. > see this, but I do not in kvm or the server I test on. > > I am still looking to reduce/reverse the regression and a reproducer on > my end would help. The test is page_test of AIM9. You could get AIM9 test suite from: http://nchc.dl.sourceforge.net/project/aimbench/aim-suite9 After build it, we could see app singleuser. It needs a txt file named s9workfile to define the test case. The s9workfile I am using has following content: # @(#) s9workfile:1.2 1/22/96 00:00:00 # AIM Independent Resource Benchmark - Suite IX Workfile FILESIZE: 5M page_test Then you can run the testing by command: ./singleuser -nl It will ask some configuration questions and then run the real test. One thing need be taken care is that the create-clo.c has one line: newbrk = sbrk(-4096 * 16); It should be updated as: intptr_t inc = -4096 * 16; newbrk = sbrk(inc); Otherwise, the -4096 * 16 will be treated as 32 bit and the line is changed to extend brk to around 4G. If we don't have enough RAM, the set_brk syscall will fail. If you met any issue to run the test, just ping me. Thanks. Regards Yin, Fengwei > >> >> >> Regards >> Yin, Fengwei >> >>> >>> [1] https://lore.kernel.org/linux-mm/202305061457.ac15990c-yujie.liu@intel.com/ >>> >>> Liam R. Howlett (14): >>> maple_tree: Add benchmarking for mas_for_each >>> maple_tree: Add benchmarking for mas_prev() >>> mm: Move unmap_vmas() declaration to internal header >>> mm: Change do_vmi_align_munmap() side tree index >>> mm: Remove prev check from do_vmi_align_munmap() >>> maple_tree: Introduce __mas_set_range() >>> mm: Remove re-walk from mmap_region() >>> maple_tree: Re-introduce entry to mas_preallocate() arguments >>> mm: Use vma_iter_clear_gfp() in nommu >>> mm: Set up vma iterator for vma_iter_prealloc() calls >>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null() >>> maple_tree: Update mas_preallocate() testing >>> maple_tree: Refine mas_preallocate() node calculations >>> mm/mmap: Change vma iteration order in do_vmi_align_munmap() >>> >>> fs/exec.c | 1 + >>> include/linux/maple_tree.h | 23 ++++- >>> include/linux/mm.h | 4 - >>> lib/maple_tree.c | 78 ++++++++++---- >>> lib/test_maple_tree.c | 74 +++++++++++++ >>> mm/internal.h | 40 ++++++-- >>> mm/memory.c | 16 ++- >>> mm/mmap.c | 171 ++++++++++++++++--------------- >>> mm/nommu.c | 45 ++++---- >>> tools/testing/radix-tree/maple.c | 59 ++++++----- >>> 10 files changed, 331 insertions(+), 180 deletions(-) >>>
© 2016 - 2026 Red Hat, Inc.