[PATCH v9 00/17] reimplement per-vma lock as a refcount

Suren Baghdasaryan posted 17 patches 1 year ago
There is a newer version of this series
Documentation/mm/process_addrs.rst |  44 ++++----
include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
include/linux/mm_types.h           |  75 +++++++-------
include/linux/mmap_lock.h          |   6 --
include/linux/rcuwait.h            |  13 +--
include/linux/refcount.h           |  24 ++++-
include/linux/slab.h               |   6 --
include/linux/types.h              |  12 +++
kernel/fork.c                      | 129 +++++++++++-------------
mm/debug.c                         |  12 +++
mm/init-mm.c                       |   1 +
mm/memory.c                        |  97 ++++++++++++++++--
mm/mmap.c                          |   3 +-
mm/userfaultfd.c                   |  32 +++---
mm/vma.c                           |  23 ++---
mm/vma.h                           |  15 ++-
tools/testing/vma/linux/atomic.h   |   5 +
tools/testing/vma/vma_internal.h   |  93 ++++++++---------
18 files changed, 465 insertions(+), 281 deletions(-)
[PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. regroups vm_area_struct members to fit them into 3 cachelines;
5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v8 [4]:
- Change subject for the cover letter, per Vlastimil Babka
- Added Reviewed-by and Acked-by, per Vlastimil Babka
- Added static check for no-limit case in __refcount_add_not_zero_limited,
per David Laight
- Fixed vma_refcount_put() to call rwsem_release() unconditionally,
per Hillf Danton and Vlastimil Babka
- Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
under us, per Vlastimil Babka
- Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
per Vlastimil Babka
- Changed __vma_enter_locked() parameter to centralize refcount logic,
per Vlastimil Babka
- Amended description in vm_lock replacement patch explaining the effects
of the patch on vm_area_struct size, per Vlastimil Babka
- Added vm_area_struct member regrouping patch [5] into the series
- Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
- Added a comment for vm_area_struct to update vm_area_init_from() when
adding new members, per Vlastimil Babka
- Updated a comment about unstable src->shared.rb when copying a vma in
vm_area_init_from(), per Vlastimil Babka

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
[5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

Patchset applies over mm-unstable after reverting v8
(current SHA range: 235b5129cb7b - 9e6b24c58985)

Suren Baghdasaryan (17):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: introduce vma_iter_store_attached() to use with attached vmas
  mm: mark vmas detached upon exit
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm: move lesser used vma_area_struct members into the last cacheline
  mm/debug: print vm_refcnt state when dumping the vma
  mm: remove extra vma_numab_state_init() call
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 ++++----
 include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
 include/linux/mm_types.h           |  75 +++++++-------
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  24 ++++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      | 129 +++++++++++-------------
 mm/debug.c                         |  12 +++
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  97 ++++++++++++++++--
 mm/mmap.c                          |   3 +-
 mm/userfaultfd.c                   |  32 +++---
 mm/vma.c                           |  23 ++---
 mm/vma.h                           |  15 ++-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  93 ++++++++---------
 18 files changed, 465 insertions(+), 281 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Shivank Garg 1 year ago
Hi Suren,

I tested the patch-series on AMD EPYC Zen 5 system
(2-socket, 64-core per socket with SMT Enabled, 4 NUMA nodes)
using mmtests's PFT (config-workload-pft-threads) benchmark on
mm-unstable.

On 1/11/2025 9:55 AM, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. regroups vm_area_struct members to fit them into 3 cachelines;
> 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
> 
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Results:
                             mm-unstable-vanilla   mm-unstable-v9-pervma-lock
Hmean     faults/cpu-1    2018530.3023 (   0.00%)  2051456.8039 (   1.63%)
Hmean     faults/cpu-4     718869.0234 (   0.00%)   720985.6986 (   0.29%)
Hmean     faults/cpu-7     377965.1187 (   0.00%)   368305.2802 (  -2.56%)
Hmean     faults/cpu-12    215502.5334 (   0.00%)   212764.0061 (  -1.27%)
Hmean     faults/cpu-21    149946.1873 (   0.00%)   150688.2173 (   0.49%)
Hmean     faults/cpu-30    142379.7863 (   0.00%)   143677.5012 (   0.91%)
Hmean     faults/cpu-48    139625.5003 (   0.00%)   156217.1383 *  11.88%*
Hmean     faults/cpu-79    119093.6132 (   0.00%)   140501.1787 *  17.98%*
Hmean     faults/cpu-110   102446.6879 (   0.00%)   114128.7155 (  11.40%)
Hmean     faults/cpu-128    96640.4022 (   0.00%)   109474.8875 (  13.28%)
Hmean     faults/sec-1    2018197.4666 (   0.00%)  2051119.1375 (   1.63%)
Hmean     faults/sec-4    2853494.9619 (   0.00%)  2865639.8350 (   0.43%)
Hmean     faults/sec-7    2631049.4283 (   0.00%)  2564037.1696 (  -2.55%)
Hmean     faults/sec-12   2570378.4204 (   0.00%)  2540353.2525 (  -1.17%)
Hmean     faults/sec-21   3018640.3933 (   0.00%)  3014396.9773 (  -0.14%)
Hmean     faults/sec-30   4150723.9209 (   0.00%)  4167550.4070 (   0.41%)
Hmean     faults/sec-48   6459327.6559 (   0.00%)  7217660.4385 *  11.74%*
Hmean     faults/sec-79   8977397.1421 (   0.00%) 10695351.6214 *  19.14%*
Hmean     faults/sec-110 10590055.2262 (   0.00%) 12309035.9250 (  16.23%)
Hmean     faults/sec-128 11448246.6485 (   0.00%) 13554648.3823 (  18.40%)

Please add my:
Tested-by: Shivank Garg <shivankg@amd.com>

I would be happy to test future versions if needed.

Thanks,
Shivank



> 
> Changes since v8 [4]:
> - Change subject for the cover letter, per Vlastimil Babka
> - Added Reviewed-by and Acked-by, per Vlastimil Babka
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight
> - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> per Hillf Danton and Vlastimil Babka
> - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> under us, per Vlastimil Babka
> - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> per Vlastimil Babka
> - Changed __vma_enter_locked() parameter to centralize refcount logic,
> per Vlastimil Babka
> - Amended description in vm_lock replacement patch explaining the effects
> of the patch on vm_area_struct size, per Vlastimil Babka
> - Added vm_area_struct member regrouping patch [5] into the series
> - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> - Added a comment for vm_area_struct to update vm_area_init_from() when
> adding new members, per Vlastimil Babka
> - Updated a comment about unstable src->shared.rb when copying a vma in
> vm_area_init_from(), per Vlastimil Babka
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> 
> Patchset applies over mm-unstable after reverting v8
> (current SHA range: 235b5129cb7b - 9e6b24c58985)
> 
> Suren Baghdasaryan (17):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm: move lesser used vma_area_struct members into the last cacheline
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
> 
>  Documentation/mm/process_addrs.rst |  44 ++++----
>  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  75 +++++++-------
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  24 ++++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 129 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  97 ++++++++++++++++--
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
>  18 files changed, 465 insertions(+), 281 deletions(-)
>
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Mon, Jan 27, 2025 at 9:27 PM Shivank Garg <shivankg@amd.com> wrote:
>
> Hi Suren,
>
> I tested the patch-series on AMD EPYC Zen 5 system
> (2-socket, 64-core per socket with SMT Enabled, 4 NUMA nodes)
> using mmtests's PFT (config-workload-pft-threads) benchmark on
> mm-unstable.
>
> On 1/11/2025 9:55 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Results:
>                              mm-unstable-vanilla   mm-unstable-v9-pervma-lock
> Hmean     faults/cpu-1    2018530.3023 (   0.00%)  2051456.8039 (   1.63%)
> Hmean     faults/cpu-4     718869.0234 (   0.00%)   720985.6986 (   0.29%)
> Hmean     faults/cpu-7     377965.1187 (   0.00%)   368305.2802 (  -2.56%)
> Hmean     faults/cpu-12    215502.5334 (   0.00%)   212764.0061 (  -1.27%)
> Hmean     faults/cpu-21    149946.1873 (   0.00%)   150688.2173 (   0.49%)
> Hmean     faults/cpu-30    142379.7863 (   0.00%)   143677.5012 (   0.91%)
> Hmean     faults/cpu-48    139625.5003 (   0.00%)   156217.1383 *  11.88%*
> Hmean     faults/cpu-79    119093.6132 (   0.00%)   140501.1787 *  17.98%*
> Hmean     faults/cpu-110   102446.6879 (   0.00%)   114128.7155 (  11.40%)
> Hmean     faults/cpu-128    96640.4022 (   0.00%)   109474.8875 (  13.28%)
> Hmean     faults/sec-1    2018197.4666 (   0.00%)  2051119.1375 (   1.63%)
> Hmean     faults/sec-4    2853494.9619 (   0.00%)  2865639.8350 (   0.43%)
> Hmean     faults/sec-7    2631049.4283 (   0.00%)  2564037.1696 (  -2.55%)
> Hmean     faults/sec-12   2570378.4204 (   0.00%)  2540353.2525 (  -1.17%)
> Hmean     faults/sec-21   3018640.3933 (   0.00%)  3014396.9773 (  -0.14%)
> Hmean     faults/sec-30   4150723.9209 (   0.00%)  4167550.4070 (   0.41%)
> Hmean     faults/sec-48   6459327.6559 (   0.00%)  7217660.4385 *  11.74%*
> Hmean     faults/sec-79   8977397.1421 (   0.00%) 10695351.6214 *  19.14%*
> Hmean     faults/sec-110 10590055.2262 (   0.00%) 12309035.9250 (  16.23%)
> Hmean     faults/sec-128 11448246.6485 (   0.00%) 13554648.3823 (  18.40%)

Hi Shivank,
Thank you for providing more test results! This looks quite good!

>
> Please add my:
> Tested-by: Shivank Garg <shivankg@amd.com>
>
> I would be happy to test future versions if needed.

That would be very much appreciated! I'll CC you on the next version.
Thanks,
Suren.

>
> Thanks,
> Shivank
>
>
>
> >
> > Changes since v8 [4]:
> > - Change subject for the cover letter, per Vlastimil Babka
> > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
> > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > per Hillf Danton and Vlastimil Babka
> > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > under us, per Vlastimil Babka
> > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > per Vlastimil Babka
> > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > per Vlastimil Babka
> > - Amended description in vm_lock replacement patch explaining the effects
> > of the patch on vm_area_struct size, per Vlastimil Babka
> > - Added vm_area_struct member regrouping patch [5] into the series
> > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > adding new members, per Vlastimil Babka
> > - Updated a comment about unstable src->shared.rb when copying a vma in
> > vm_area_init_from(), per Vlastimil Babka
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v8
> > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> >
> > Suren Baghdasaryan (17):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 ++++----
> >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  75 +++++++-------
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  24 ++++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 129 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  97 ++++++++++++++++--
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> >  18 files changed, 465 insertions(+), 281 deletions(-)
> >
>
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Matthew Wilcox 1 year ago
On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight

Ugh, no, don't listen to David.
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Fri, Jan 10, 2025 at 8:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
>
> Ugh, no, don't listen to David.

I thought his suggestion to add a check which can be verified at
compile time made sense. Could you please explain why that's a bad
idea? I'm really curious.
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Lorenzo Stoakes 1 year ago
A nit on subject, I mean this is part of what this series does, and hey -
we have only so much text to put in here - but isn't this both
reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
using the RCU typesafe mechanism?

Do we have to do both in one series? Can we split this out? I mean maybe
that's just churny and unnecessary, but to me this series is 'allocate VMAs
RCU safe and refcount VMA lock' or something like this. Maybe this is
nitty... but still :)

One general comment here - this is a really major change in how this stuff
works, and yet I don't see any tests anywhere in the series.

I know it's tricky to write tests for this, but the new VMA testing
environment should make it possible to test a _lot_ more than we previously
could.

However due to some (*ahem*) interesting distribution of where functions
are, most notably stuff in kernel/fork.c, I guess we can't test
_everything_ there effectively.

But I do feel like we should be able to do better than having absolutely no
testing added for this?

I think there's definitely quite a bit you could test now, at least in
asserting fundamentals in tools/testing/vma/vma.c.

This can cover at least detached state asserts in various scenarios.

But that won't cover off the really gnarly stuff here around RCU slab
allocation, and determining precisely how to test that in a sensible way is
maybe less clear.

But I'd like to see _something_ here please, this is more or less
fundamentally changing how all VMAs are allocated and to just have nothing
feels unfortunate.

I'm already nervous because we've hit issues coming up to v9 and we're not
100% sure if a recent syzkaller is related to these changes or not, I'm not
sure how much we can get assurances with tests but I'd like something.

Thanks!

On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. regroups vm_area_struct members to fit them into 3 cachelines;
> 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Changes since v8 [4]:
> - Change subject for the cover letter, per Vlastimil Babka
> - Added Reviewed-by and Acked-by, per Vlastimil Babka
> - Added static check for no-limit case in __refcount_add_not_zero_limited,
> per David Laight
> - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> per Hillf Danton and Vlastimil Babka
> - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> under us, per Vlastimil Babka
> - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> per Vlastimil Babka
> - Changed __vma_enter_locked() parameter to centralize refcount logic,
> per Vlastimil Babka
> - Amended description in vm_lock replacement patch explaining the effects
> of the patch on vm_area_struct size, per Vlastimil Babka
> - Added vm_area_struct member regrouping patch [5] into the series
> - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> - Added a comment for vm_area_struct to update vm_area_init_from() when
> adding new members, per Vlastimil Babka
> - Updated a comment about unstable src->shared.rb when copying a vma in
> vm_area_init_from(), per Vlastimil Babka
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
>
> Patchset applies over mm-unstable after reverting v8
> (current SHA range: 235b5129cb7b - 9e6b24c58985)
>
> Suren Baghdasaryan (17):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm: move lesser used vma_area_struct members into the last cacheline
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
>
>  Documentation/mm/process_addrs.rst |  44 ++++----
>  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  75 +++++++-------
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  24 ++++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 129 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  97 ++++++++++++++++--
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
>  18 files changed, 465 insertions(+), 281 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Andrew Morton 1 year ago
On Mon, 13 Jan 2025 12:14:19 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> A nit on subject, I mean this is part of what this series does, and hey -
> we have only so much text to put in here - but isn't this both
> reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> using the RCU typesafe mechanism?
> 
> Do we have to do both in one series? Can we split this out? I mean maybe
> that's just churny and unnecessary, but to me this series is 'allocate VMAs
> RCU safe and refcount VMA lock' or something like this. Maybe this is
> nitty... but still :)
> 
> One general comment here - this is a really major change in how this stuff
> works, and yet I don't see any tests anywhere in the series.
> 
> I know it's tricky to write tests for this, but the new VMA testing
> environment should make it possible to test a _lot_ more than we previously
> could.
> 
> However due to some (*ahem*) interesting distribution of where functions
> are, most notably stuff in kernel/fork.c, I guess we can't test
> _everything_ there effectively.
> 
> But I do feel like we should be able to do better than having absolutely no
> testing added for this?
> 
> I think there's definitely quite a bit you could test now, at least in
> asserting fundamentals in tools/testing/vma/vma.c.
> 
> This can cover at least detached state asserts in various scenarios.
> 
> But that won't cover off the really gnarly stuff here around RCU slab
> allocation, and determining precisely how to test that in a sensible way is
> maybe less clear.
> 
> But I'd like to see _something_ here please, this is more or less
> fundamentally changing how all VMAs are allocated and to just have nothing
> feels unfortunate.
> 
> I'm already nervous because we've hit issues coming up to v9 and we're not
> 100% sure if a recent syzkaller is related to these changes or not, I'm not
> sure how much we can get assurances with tests but I'd like something.

Thanks.

Yes, we're at -rc7 and this series is rather in panic mode and it seems
unnecessarily risky so I'm inclined to set it aside for this cycle.

If the series is considered super desirable and if people are confident
that we can address any remaining glitches during two months of -rc
then sure, we could push the envelope a bit.  But I don't believe this
is the case so I'm thinking let's give ourselves another cycle to get
this all sorted out?
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 13 Jan 2025 12:14:19 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > A nit on subject, I mean this is part of what this series does, and hey -
> > we have only so much text to put in here - but isn't this both
> > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > using the RCU typesafe mechanism?
> >
> > Do we have to do both in one series? Can we split this out? I mean maybe
> > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > nitty... but still :)
> >
> > One general comment here - this is a really major change in how this stuff
> > works, and yet I don't see any tests anywhere in the series.
> >
> > I know it's tricky to write tests for this, but the new VMA testing
> > environment should make it possible to test a _lot_ more than we previously
> > could.
> >
> > However due to some (*ahem*) interesting distribution of where functions
> > are, most notably stuff in kernel/fork.c, I guess we can't test
> > _everything_ there effectively.
> >
> > But I do feel like we should be able to do better than having absolutely no
> > testing added for this?
> >
> > I think there's definitely quite a bit you could test now, at least in
> > asserting fundamentals in tools/testing/vma/vma.c.
> >
> > This can cover at least detached state asserts in various scenarios.
> >
> > But that won't cover off the really gnarly stuff here around RCU slab
> > allocation, and determining precisely how to test that in a sensible way is
> > maybe less clear.
> >
> > But I'd like to see _something_ here please, this is more or less
> > fundamentally changing how all VMAs are allocated and to just have nothing
> > feels unfortunate.
> >
> > I'm already nervous because we've hit issues coming up to v9 and we're not
> > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > sure how much we can get assurances with tests but I'd like something.
>
> Thanks.
>
> Yes, we're at -rc7 and this series is rather in panic mode and it seems
> unnecessarily risky so I'm inclined to set it aside for this cycle.
>
> If the series is considered super desirable and if people are confident
> that we can address any remaining glitches during two months of -rc
> then sure, we could push the envelope a bit.  But I don't believe this
> is the case so I'm thinking let's give ourselves another cycle to get
> this all sorted out?

I didn't think this series was in panic mode with one real issue that
is not hard to address (memory ordering in
__refcount_inc_not_zero_limited()) but I'm obviously biased and might
be missing the big picture. In any case, if it makes people nervous I
have no objections to your plan.
Thanks,
Suren.
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Andrew Morton 1 year ago
On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:

> On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> >
> > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > unnecessarily risky so I'm inclined to set it aside for this cycle.
> >
> > If the series is considered super desirable and if people are confident
> > that we can address any remaining glitches during two months of -rc
> > then sure, we could push the envelope a bit.  But I don't believe this
> > is the case so I'm thinking let's give ourselves another cycle to get
> > this all sorted out?
> 
> I didn't think this series was in panic mode with one real issue that
> is not hard to address (memory ordering in
> __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> be missing the big picture. In any case, if it makes people nervous I
> have no objections to your plan.

Well, I'm soliciting opinions here.  What do others think?

And do you see much urgency with these changes?
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Liam R. Howlett 1 year ago
* Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > >
> > > If the series is considered super desirable and if people are confident
> > > that we can address any remaining glitches during two months of -rc
> > > then sure, we could push the envelope a bit.  But I don't believe this
> > > is the case so I'm thinking let's give ourselves another cycle to get
> > > this all sorted out?
> > 
> > I didn't think this series was in panic mode with one real issue that
> > is not hard to address (memory ordering in
> > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > be missing the big picture. In any case, if it makes people nervous I
> > have no objections to your plan.
> 
> Well, I'm soliciting opinions here.  What do others think?
> 
> And do you see much urgency with these changes?
> 

I think it's in good shape, but more time for this change is probably
the right approach.

I don't think it's had enough testing time with the changes since v7.
The series has had significant changes, with the side effect of
invalidating some of the test time.

I really like what it does, but if Suren doesn't need it upstream for
some reason then I'd say we leave it to soak longer.

If he does need it upstream, we can deal with any fallout and fixes - it
will have minimum long term effects as it's not an LTS.

Thanks,
Liam
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
<kernel-team@android.com> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > >
> > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > >
> > > > If the series is considered super desirable and if people are confident
> > > > that we can address any remaining glitches during two months of -rc
> > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > this all sorted out?
> > >
> > > I didn't think this series was in panic mode with one real issue that
> > > is not hard to address (memory ordering in
> > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > be missing the big picture. In any case, if it makes people nervous I
> > > have no objections to your plan.
> >
> > Well, I'm soliciting opinions here.  What do others think?
> >
> > And do you see much urgency with these changes?
> >
>
> I think it's in good shape, but more time for this change is probably
> the right approach.
>
> I don't think it's had enough testing time with the changes since v7.
> The series has had significant changes, with the side effect of
> invalidating some of the test time.
>
> I really like what it does, but if Suren doesn't need it upstream for
> some reason then I'd say we leave it to soak longer.
>
> If he does need it upstream, we can deal with any fallout and fixes - it
> will have minimum long term effects as it's not an LTS.

Thanks for voicing your opinion, folks! There is no real urgency and
no objections from me to wait until the next cycle.
I'll be posting v10 shortly purely for reviews while this is fresh on
people's mind, and with the understanding that it won't be picked up
by Andrew.
Thanks,
Suren.

>
> Thanks,
> Liam
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Lorenzo Stoakes 1 year ago
On Tue, Jan 14, 2025 at 07:54:48AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > >
> > > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > > >
> > > > > If the series is considered super desirable and if people are confident
> > > > > that we can address any remaining glitches during two months of -rc
> > > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > > this all sorted out?
> > > >
> > > > I didn't think this series was in panic mode with one real issue that
> > > > is not hard to address (memory ordering in
> > > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > > be missing the big picture. In any case, if it makes people nervous I
> > > > have no objections to your plan.
> > >
> > > Well, I'm soliciting opinions here.  What do others think?
> > >
> > > And do you see much urgency with these changes?
> > >
> >
> > I think it's in good shape, but more time for this change is probably
> > the right approach.
> >
> > I don't think it's had enough testing time with the changes since v7.
> > The series has had significant changes, with the side effect of
> > invalidating some of the test time.
> >
> > I really like what it does, but if Suren doesn't need it upstream for
> > some reason then I'd say we leave it to soak longer.
> >
> > If he does need it upstream, we can deal with any fallout and fixes - it
> > will have minimum long term effects as it's not an LTS.
>
> Thanks for voicing your opinion, folks! There is no real urgency and
> no objections from me to wait until the next cycle.
> I'll be posting v10 shortly purely for reviews while this is fresh on
> people's mind, and with the understanding that it won't be picked up
> by Andrew.
> Thanks,
> Suren.

(From my side :) Thanks, and definitely no reflection on quality and your
responsiveness has been amazing, just a reflection of the complexity of
this change.

>
> >
> > Thanks,
> > Liam
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Wed, Jan 15, 2025 at 3:34 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jan 14, 2025 at 07:54:48AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 14, 2025 at 6:59 AM 'Liam R. Howlett' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > * Andrew Morton <akpm@linux-foundation.org> [250113 23:09]:
> > > > On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > > >
> > > > > >
> > > > > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > > > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > > > > >
> > > > > > If the series is considered super desirable and if people are confident
> > > > > > that we can address any remaining glitches during two months of -rc
> > > > > > then sure, we could push the envelope a bit.  But I don't believe this
> > > > > > is the case so I'm thinking let's give ourselves another cycle to get
> > > > > > this all sorted out?
> > > > >
> > > > > I didn't think this series was in panic mode with one real issue that
> > > > > is not hard to address (memory ordering in
> > > > > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > > > > be missing the big picture. In any case, if it makes people nervous I
> > > > > have no objections to your plan.
> > > >
> > > > Well, I'm soliciting opinions here.  What do others think?
> > > >
> > > > And do you see much urgency with these changes?
> > > >
> > >
> > > I think it's in good shape, but more time for this change is probably
> > > the right approach.
> > >
> > > I don't think it's had enough testing time with the changes since v7.
> > > The series has had significant changes, with the side effect of
> > > invalidating some of the test time.
> > >
> > > I really like what it does, but if Suren doesn't need it upstream for
> > > some reason then I'd say we leave it to soak longer.
> > >
> > > If he does need it upstream, we can deal with any fallout and fixes - it
> > > will have minimum long term effects as it's not an LTS.
> >
> > Thanks for voicing your opinion, folks! There is no real urgency and
> > no objections from me to wait until the next cycle.
> > I'll be posting v10 shortly purely for reviews while this is fresh on
> > people's mind, and with the understanding that it won't be picked up
> > by Andrew.
> > Thanks,
> > Suren.
>
> (From my side :) Thanks, and definitely no reflection on quality and your
> responsiveness has been amazing, just a reflection of the complexity of
> this change.

No worries, I understand and accept the reasoning.
And thanks for sugar coating the pill, it made it easier to swallow :)

>
> >
> > >
> > > Thanks,
> > > Liam
> > >
> > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > >
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Lorenzo Stoakes 1 year ago
On Mon, Jan 13, 2025 at 08:09:08PM -0800, Andrew Morton wrote:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > Yes, we're at -rc7 and this series is rather in panic mode and it seems
> > > unnecessarily risky so I'm inclined to set it aside for this cycle.
> > >
> > > If the series is considered super desirable and if people are confident
> > > that we can address any remaining glitches during two months of -rc
> > > then sure, we could push the envelope a bit.  But I don't believe this
> > > is the case so I'm thinking let's give ourselves another cycle to get
> > > this all sorted out?
> >
> > I didn't think this series was in panic mode with one real issue that
> > is not hard to address (memory ordering in
> > __refcount_inc_not_zero_limited()) but I'm obviously biased and might
> > be missing the big picture. In any case, if it makes people nervous I
> > have no objections to your plan.
>
> Well, I'm soliciting opinions here.  What do others think?
>
> And do you see much urgency with these changes?

With apologies to Suren (genuinely!) who is doing great work and is
super-responsive here, this really needs another cycle in my opinion.

As Vlastimil points out there's some non-trivial bits to go, but I am also
firmly of the opinion we need to have as much testing as is practical here.

I don't think this is urgent on any timeline so I'd like to join Vlastimil
to firmly but politely push for this to land in 6.15 rather than 6.14.

Just to reiterate - this is absolutely no reflection on Suren who has been
really great here - it is purely a product of the complexity and scope of
this change.
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Vlastimil Babka 1 year ago
On 1/14/25 05:09, Andrew Morton wrote:
> On Mon, 13 Jan 2025 18:53:11 -0800 Suren Baghdasaryan <surenb@google.com> wrote:
> 
>> On Mon, Jan 13, 2025 at 5:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> >
>> >
>> > Yes, we're at -rc7 and this series is rather in panic mode and it seems
>> > unnecessarily risky so I'm inclined to set it aside for this cycle.
>> >
>> > If the series is considered super desirable and if people are confident
>> > that we can address any remaining glitches during two months of -rc
>> > then sure, we could push the envelope a bit.  But I don't believe this
>> > is the case so I'm thinking let's give ourselves another cycle to get
>> > this all sorted out?
>> 
>> I didn't think this series was in panic mode with one real issue that
>> is not hard to address (memory ordering in
>> __refcount_inc_not_zero_limited()) but I'm obviously biased and might
>> be missing the big picture. In any case, if it makes people nervous I
>> have no objections to your plan.
> 
> Well, I'm soliciting opinions here.  What do others think?
> 
> And do you see much urgency with these changes?

I don't see the urgency and at this point giving it more time seems wise.
Seems like v10 won't be exactly trivial as we'll change from refcount_t to
atomic_t? And I'd like to see PeterZ review the lockdep parts too.
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Hillf Danton 1 year ago
On Tue, 14 Jan 2025 10:09:42 +0100 Vlastimil Babka <vbabka@suse.cz>
> On 1/14/25 05:09, Andrew Morton wrote:
> > 
> > Well, I'm soliciting opinions here.  What do others think?
> > 
> > And do you see much urgency with these changes?
> 
> I don't see the urgency and at this point giving it more time seems wise.
> Seems like v10 won't be exactly trivial as we'll change from refcount_t to
> atomic_t? And I'd like to see PeterZ review the lockdep parts too.

+1
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> A nit on subject, I mean this is part of what this series does, and hey -
> we have only so much text to put in here - but isn't this both
> reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> using the RCU typesafe mechanism?
>
> Do we have to do both in one series? Can we split this out? I mean maybe
> that's just churny and unnecessary, but to me this series is 'allocate VMAs
> RCU safe and refcount VMA lock' or something like this. Maybe this is
> nitty... but still :)

There is "motivational dependency" because one of the main reasons I'm
converting the vm_lock into vm_refcnt is to make it easier to add
SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
thighs easier. That would be the 2 patches at the end:

mm: prepare lock_vma_under_rcu() for vma reuse possibility
mm: make vma cache SLAB_TYPESAFE_BY_RCU

I made sure that each patch is bisectable, so there should not be a
problem with tracking issues.

>
> One general comment here - this is a really major change in how this stuff
> works, and yet I don't see any tests anywhere in the series.

Hmm. I was diligently updating the tests to reflect the replacement of
vm_lock with vm_refcnt and adding assertions for detach/attach cases.
This actually reminds me that I missed updading vm_area_struct in
vma_internal.h for the member regrouping patch; will add that. I think
the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
was not sure what kind of testing I can add for that. Any suggestions
would be welcomed.

>
> I know it's tricky to write tests for this, but the new VMA testing
> environment should make it possible to test a _lot_ more than we previously
> could.
>
> However due to some (*ahem*) interesting distribution of where functions
> are, most notably stuff in kernel/fork.c, I guess we can't test
> _everything_ there effectively.
>
> But I do feel like we should be able to do better than having absolutely no
> testing added for this?

Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
for the rest I thought the tests were modified accordingly.

>
> I think there's definitely quite a bit you could test now, at least in
> asserting fundamentals in tools/testing/vma/vma.c.
>
> This can cover at least detached state asserts in various scenarios.

Ok, you mean to check that VMA re-attachment/re-detachment would
trigger assertions? I'll look into adding tests for that.

>
> But that won't cover off the really gnarly stuff here around RCU slab
> allocation, and determining precisely how to test that in a sensible way is
> maybe less clear.
>
> But I'd like to see _something_ here please, this is more or less
> fundamentally changing how all VMAs are allocated and to just have nothing
> feels unfortunate.

Again, I'm open to suggestions on what kind of testing I can add for
SLAB_TYPESAFE_BY_RCU change.

>
> I'm already nervous because we've hit issues coming up to v9 and we're not
> 100% sure if a recent syzkaller is related to these changes or not, I'm not
> sure how much we can get assurances with tests but I'd like something.

If you are referring to the issue at [1], I think David ran the
syzcaller against mm-stable that does not contain this patchset and
the issue still triggered (see [2]). This of course does not guarantee
that this patchset has no other issues :) I'll try adding tests for
re-attaching, re-detaching and welcome ideas on how to test
SLAB_TYPESAFE_BY_RCU transition.
Thanks,
Suren.

[1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
[2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/

>
> Thanks!
>
> On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Changes since v8 [4]:
> > - Change subject for the cover letter, per Vlastimil Babka
> > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > per David Laight
> > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > per Hillf Danton and Vlastimil Babka
> > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > under us, per Vlastimil Babka
> > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > per Vlastimil Babka
> > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > per Vlastimil Babka
> > - Amended description in vm_lock replacement patch explaining the effects
> > of the patch on vm_area_struct size, per Vlastimil Babka
> > - Added vm_area_struct member regrouping patch [5] into the series
> > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > adding new members, per Vlastimil Babka
> > - Updated a comment about unstable src->shared.rb when copying a vma in
> > vm_area_init_from(), per Vlastimil Babka
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v8
> > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> >
> > Suren Baghdasaryan (17):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 ++++----
> >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  75 +++++++-------
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  24 ++++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 129 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  97 ++++++++++++++++--
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> >  18 files changed, 465 insertions(+), 281 deletions(-)
> >
> > --
> > 2.47.1.613.gc27f4b7a9f-goog
> >
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Lorenzo Stoakes 1 year ago
On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > A nit on subject, I mean this is part of what this series does, and hey -
> > we have only so much text to put in here - but isn't this both
> > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > using the RCU typesafe mechanism?
> >
> > Do we have to do both in one series? Can we split this out? I mean maybe
> > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > nitty... but still :)
>
> There is "motivational dependency" because one of the main reasons I'm
> converting the vm_lock into vm_refcnt is to make it easier to add
> SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> thighs easier. That would be the 2 patches at the end:

Right yeah... maybe it's better to do it in one hit.

>
> mm: prepare lock_vma_under_rcu() for vma reuse possibility
> mm: make vma cache SLAB_TYPESAFE_BY_RCU
>
> I made sure that each patch is bisectable, so there should not be a
> problem with tracking issues.
>
> >
> > One general comment here - this is a really major change in how this stuff
> > works, and yet I don't see any tests anywhere in the series.
>
> Hmm. I was diligently updating the tests to reflect the replacement of
> vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> This actually reminds me that I missed updading vm_area_struct in
> vma_internal.h for the member regrouping patch; will add that. I think
> the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> was not sure what kind of testing I can add for that. Any suggestions
> would be welcomed.

And to be clear I'm super grateful you did that :) thanks, be good to
change the member regrouping thing also.

But that doesn't change the fact that this series has exactly zero tests
for it. And for something so broad, it feels like a big issue, we really
want to be careful with something so big here.

You've also noticed that I've cleverly failed to _actually_ suggest
SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
to test that.

Liam has experience doing RCU testing this for the maple tree stuff, but it
wasn't pretty and wasn't easy and would probably require massive rework to
expose this stuff to some viable testing environment, or in other words -
is unworkable.

HOWEVER, I feel like maybe we could try to create scenarios where we might
trigger reuse bugs?

Perhaps some userland code, perhaps even constrained by cgroup, that maps a
ton of stuff and unmaps in a loop in parallel?

Perhaps create scenarios with shared memory where we up refcounts a lot too?

Anyway, this is necessarily nebulous without further investigation, what I
was thinking more concretely is:

Using the VMA userland testing:

1. Assert reference count correctness across locking scenarios and various
   VMA operations.
2. Assert correct detached/not detached state across different scenarios.

This won't quite be complete as not everything is separated out quite
enough to allow things like process tear down/forking etc. to be explicitly
tested but you can unit tests the VMA bits at least.

One note on this, I intend to split the vma.c file into multiple files in
tools/testing/vma/ so if you add tests here it'd be worth probably putting
them into a new file.

I'm happy to help with this if you need any assistance, feel free to ping!

Sorry to put this on you so late in the series, I realise it's annoying,
but I feel like things have changed a lot and obviously aggregated with two
series in one in effect and these are genuine concerns that at this stage I
feel like we need to try to at least make some headway on.

>
> >
> > I know it's tricky to write tests for this, but the new VMA testing
> > environment should make it possible to test a _lot_ more than we previously
> > could.
> >
> > However due to some (*ahem*) interesting distribution of where functions
> > are, most notably stuff in kernel/fork.c, I guess we can't test
> > _everything_ there effectively.
> >
> > But I do feel like we should be able to do better than having absolutely no
> > testing added for this?
>
> Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> for the rest I thought the tests were modified accordingly.

See above ^

>
> >
> > I think there's definitely quite a bit you could test now, at least in
> > asserting fundamentals in tools/testing/vma/vma.c.
> >
> > This can cover at least detached state asserts in various scenarios.
>
> Ok, you mean to check that VMA re-attachment/re-detachment would
> trigger assertions? I'll look into adding tests for that.

Yeah this is one, see above :)

>
> >
> > But that won't cover off the really gnarly stuff here around RCU slab
> > allocation, and determining precisely how to test that in a sensible way is
> > maybe less clear.
> >
> > But I'd like to see _something_ here please, this is more or less
> > fundamentally changing how all VMAs are allocated and to just have nothing
> > feels unfortunate.
>
> Again, I'm open to suggestions on what kind of testing I can add for
> SLAB_TYPESAFE_BY_RCU change.

See above

>
> >
> > I'm already nervous because we've hit issues coming up to v9 and we're not
> > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > sure how much we can get assurances with tests but I'd like something.
>
> If you are referring to the issue at [1], I think David ran the
> syzcaller against mm-stable that does not contain this patchset and
> the issue still triggered (see [2]). This of course does not guarantee
> that this patchset has no other issues :) I'll try adding tests for
> re-attaching, re-detaching and welcome ideas on how to test
> SLAB_TYPESAFE_BY_RCU transition.
> Thanks,
> Suren.

OK that's reassuring!

>
> [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
>
> >
> > Thanks!
> >
> > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > >
> > > Changes since v8 [4]:
> > > - Change subject for the cover letter, per Vlastimil Babka
> > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > per David Laight
> > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > per Hillf Danton and Vlastimil Babka
> > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > under us, per Vlastimil Babka
> > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > per Vlastimil Babka
> > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > per Vlastimil Babka
> > > - Amended description in vm_lock replacement patch explaining the effects
> > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > - Added vm_area_struct member regrouping patch [5] into the series
> > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > adding new members, per Vlastimil Babka
> > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > vm_area_init_from(), per Vlastimil Babka
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v8
> > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > >
> > > Suren Baghdasaryan (17):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm: move lesser used vma_area_struct members into the last cacheline
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  75 +++++++-------
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  24 ++++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 129 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  97 ++++++++++++++++--
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > >
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Suren Baghdasaryan 1 year ago
On Mon, Jan 13, 2025 at 9:11 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > A nit on subject, I mean this is part of what this series does, and hey -
> > > we have only so much text to put in here - but isn't this both
> > > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > > using the RCU typesafe mechanism?
> > >
> > > Do we have to do both in one series? Can we split this out? I mean maybe
> > > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > > nitty... but still :)
> >
> > There is "motivational dependency" because one of the main reasons I'm
> > converting the vm_lock into vm_refcnt is to make it easier to add
> > SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> > can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> > thighs easier. That would be the 2 patches at the end:
>
> Right yeah... maybe it's better to do it in one hit.
>
> >
> > mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >
> > I made sure that each patch is bisectable, so there should not be a
> > problem with tracking issues.
> >
> > >
> > > One general comment here - this is a really major change in how this stuff
> > > works, and yet I don't see any tests anywhere in the series.
> >
> > Hmm. I was diligently updating the tests to reflect the replacement of
> > vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> > This actually reminds me that I missed updading vm_area_struct in
> > vma_internal.h for the member regrouping patch; will add that. I think
> > the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> > was not sure what kind of testing I can add for that. Any suggestions
> > would be welcomed.
>
> And to be clear I'm super grateful you did that :) thanks, be good to
> change the member regrouping thing also.
>
> But that doesn't change the fact that this series has exactly zero tests
> for it. And for something so broad, it feels like a big issue, we really
> want to be careful with something so big here.
>
> You've also noticed that I've cleverly failed to _actually_ suggest
> SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
> to test that.
>
> Liam has experience doing RCU testing this for the maple tree stuff, but it
> wasn't pretty and wasn't easy and would probably require massive rework to
> expose this stuff to some viable testing environment, or in other words -
> is unworkable.
>
> HOWEVER, I feel like maybe we could try to create scenarios where we might
> trigger reuse bugs?
>
> Perhaps some userland code, perhaps even constrained by cgroup, that maps a
> ton of stuff and unmaps in a loop in parallel?
>
> Perhaps create scenarios with shared memory where we up refcounts a lot too?

I have this old spf_test
(https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c)
which I often use to weed out vma locking issues because it starts
multiple threads doing mmap + page faults. Perhaps we can repackage it
into a test/benchmark for testing contention on mmap/vma locks?

>
> Anyway, this is necessarily nebulous without further investigation, what I
> was thinking more concretely is:
>
> Using the VMA userland testing:
>
> 1. Assert reference count correctness across locking scenarios and various
>    VMA operations.
> 2. Assert correct detached/not detached state across different scenarios.
>
> This won't quite be complete as not everything is separated out quite
> enough to allow things like process tear down/forking etc. to be explicitly
> tested but you can unit tests the VMA bits at least.
>
> One note on this, I intend to split the vma.c file into multiple files in
> tools/testing/vma/ so if you add tests here it'd be worth probably putting
> them into a new file.
>
> I'm happy to help with this if you need any assistance, feel free to ping!

As a starting point I was thinking of changing
vma_assert_attached()/vma_assert_detached() and
vma_mark_attached()/vma_mark_detached() to return a bool and use
WARN_ON_ONCE() (to address your concern about asserts being dependent
on CONFIG_DEBUG_VM) like this:

static inline bool vma_assert_detached()
{
    return !WARN_ON_ONCE(atomic_read(&vma->vm_refcnt));
}

static inline bool vma_mark_attached(struct vm_area_struct *vma)
{
    vma_assert_write_locked(vma);
    if (!vma_assert_detached(vma))
        return false;

    atomic_set(&vma->vm_refcnt, 1);
    return true;
}

With that we can add correctness checks in the tools/testing/vma/vma.c
for different states, for example in the alloc_and_link_vma() we can
check that after vma_link() the vma is indeed attached:

ASSERT_TRUE(vma_assert_attached(vma));

This might not cover all states but is probably a good starting point. WDYT?

>
> Sorry to put this on you so late in the series, I realise it's annoying,
> but I feel like things have changed a lot and obviously aggregated with two
> series in one in effect and these are genuine concerns that at this stage I
> feel like we need to try to at least make some headway on.
>
> >
> > >
> > > I know it's tricky to write tests for this, but the new VMA testing
> > > environment should make it possible to test a _lot_ more than we previously
> > > could.
> > >
> > > However due to some (*ahem*) interesting distribution of where functions
> > > are, most notably stuff in kernel/fork.c, I guess we can't test
> > > _everything_ there effectively.
> > >
> > > But I do feel like we should be able to do better than having absolutely no
> > > testing added for this?
> >
> > Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> > for the rest I thought the tests were modified accordingly.
>
> See above ^
>
> >
> > >
> > > I think there's definitely quite a bit you could test now, at least in
> > > asserting fundamentals in tools/testing/vma/vma.c.
> > >
> > > This can cover at least detached state asserts in various scenarios.
> >
> > Ok, you mean to check that VMA re-attachment/re-detachment would
> > trigger assertions? I'll look into adding tests for that.
>
> Yeah this is one, see above :)
>
> >
> > >
> > > But that won't cover off the really gnarly stuff here around RCU slab
> > > allocation, and determining precisely how to test that in a sensible way is
> > > maybe less clear.
> > >
> > > But I'd like to see _something_ here please, this is more or less
> > > fundamentally changing how all VMAs are allocated and to just have nothing
> > > feels unfortunate.
> >
> > Again, I'm open to suggestions on what kind of testing I can add for
> > SLAB_TYPESAFE_BY_RCU change.
>
> See above
>
> >
> > >
> > > I'm already nervous because we've hit issues coming up to v9 and we're not
> > > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > > sure how much we can get assurances with tests but I'd like something.
> >
> > If you are referring to the issue at [1], I think David ran the
> > syzcaller against mm-stable that does not contain this patchset and
> > the issue still triggered (see [2]). This of course does not guarantee
> > that this patchset has no other issues :) I'll try adding tests for
> > re-attaching, re-detaching and welcome ideas on how to test
> > SLAB_TYPESAFE_BY_RCU transition.
> > Thanks,
> > Suren.
>
> OK that's reassuring!
>
> >
> > [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> > [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
> >
> > >
> > > Thanks!
> > >
> > > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > vm_area_struct in [1] because of the performance regression caused by
> > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > even there it can be mitigated by disabling adjacent cacheline
> > > > prefetching, see [3].
> > > > Splitting single logical structure into multiple ones leads to more
> > > > complicated management, extra pointer dereferences and overall less
> > > > maintainable code. When that split-away part is a lock, it complicates
> > > > things even further. With no performance benefits, there are no reasons
> > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > This patchset:
> > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > cacheline sharing;
> > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > it is inserted into vma tree;
> > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > reuse and to minimize call_rcu() calls.
> > > >
> > > > Pagefault microbenchmarks show performance improvement:
> > > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > > >
> > > > Changes since v8 [4]:
> > > > - Change subject for the cover letter, per Vlastimil Babka
> > > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > > per David Laight
> > > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > > per Hillf Danton and Vlastimil Babka
> > > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > > under us, per Vlastimil Babka
> > > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > > per Vlastimil Babka
> > > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > > per Vlastimil Babka
> > > > - Amended description in vm_lock replacement patch explaining the effects
> > > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > > - Added vm_area_struct member regrouping patch [5] into the series
> > > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > > adding new members, per Vlastimil Babka
> > > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > > vm_area_init_from(), per Vlastimil Babka
> > > >
> > > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > > >
> > > > Patchset applies over mm-unstable after reverting v8
> > > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > > >
> > > > Suren Baghdasaryan (17):
> > > >   mm: introduce vma_start_read_locked{_nested} helpers
> > > >   mm: move per-vma lock into vm_area_struct
> > > >   mm: mark vma as detached until it's added into vma tree
> > > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > > >   mm: mark vmas detached upon exit
> > > >   types: move struct rcuwait into types.h
> > > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > > >   mm: move mmap_init_lock() out of the header file
> > > >   mm: uninline the main body of vma_start_write()
> > > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > > >   mm: replace vm_lock and detached flag with a reference count
> > > >   mm: move lesser used vma_area_struct members into the last cacheline
> > > >   mm/debug: print vm_refcnt state when dumping the vma
> > > >   mm: remove extra vma_numab_state_init() call
> > > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > > >   docs/mm: document latest changes to vm_lock
> > > >
> > > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > > >  include/linux/mm_types.h           |  75 +++++++-------
> > > >  include/linux/mmap_lock.h          |   6 --
> > > >  include/linux/rcuwait.h            |  13 +--
> > > >  include/linux/refcount.h           |  24 ++++-
> > > >  include/linux/slab.h               |   6 --
> > > >  include/linux/types.h              |  12 +++
> > > >  kernel/fork.c                      | 129 +++++++++++-------------
> > > >  mm/debug.c                         |  12 +++
> > > >  mm/init-mm.c                       |   1 +
> > > >  mm/memory.c                        |  97 ++++++++++++++++--
> > > >  mm/mmap.c                          |   3 +-
> > > >  mm/userfaultfd.c                   |  32 +++---
> > > >  mm/vma.c                           |  23 ++---
> > > >  mm/vma.h                           |  15 ++-
> > > >  tools/testing/vma/linux/atomic.h   |   5 +
> > > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > > >
> > > > --
> > > > 2.47.1.613.gc27f4b7a9f-goog
> > > >
Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
Posted by Lorenzo Stoakes 1 year ago
On Mon, Jan 13, 2025 at 11:00:07AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 9:11 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > A nit on subject, I mean this is part of what this series does, and hey -
> > > > we have only so much text to put in here - but isn't this both
> > > > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > > > using the RCU typesafe mechanism?
> > > >
> > > > Do we have to do both in one series? Can we split this out? I mean maybe
> > > > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > > > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > > > nitty... but still :)
> > >
> > > There is "motivational dependency" because one of the main reasons I'm
> > > converting the vm_lock into vm_refcnt is to make it easier to add
> > > SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> > > can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> > > thighs easier. That would be the 2 patches at the end:
> >
> > Right yeah... maybe it's better to do it in one hit.
> >
> > >
> > > mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >
> > > I made sure that each patch is bisectable, so there should not be a
> > > problem with tracking issues.
> > >
> > > >
> > > > One general comment here - this is a really major change in how this stuff
> > > > works, and yet I don't see any tests anywhere in the series.
> > >
> > > Hmm. I was diligently updating the tests to reflect the replacement of
> > > vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> > > This actually reminds me that I missed updading vm_area_struct in
> > > vma_internal.h for the member regrouping patch; will add that. I think
> > > the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> > > was not sure what kind of testing I can add for that. Any suggestions
> > > would be welcomed.
> >
> > And to be clear I'm super grateful you did that :) thanks, be good to
> > change the member regrouping thing also.
> >
> > But that doesn't change the fact that this series has exactly zero tests
> > for it. And for something so broad, it feels like a big issue, we really
> > want to be careful with something so big here.
> >
> > You've also noticed that I've cleverly failed to _actually_ suggest
> > SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
> > to test that.
> >
> > Liam has experience doing RCU testing this for the maple tree stuff, but it
> > wasn't pretty and wasn't easy and would probably require massive rework to
> > expose this stuff to some viable testing environment, or in other words -
> > is unworkable.
> >
> > HOWEVER, I feel like maybe we could try to create scenarios where we might
> > trigger reuse bugs?
> >
> > Perhaps some userland code, perhaps even constrained by cgroup, that maps a
> > ton of stuff and unmaps in a loop in parallel?
> >
> > Perhaps create scenarios with shared memory where we up refcounts a lot too?
>
> I have this old spf_test
> (https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c)
> which I often use to weed out vma locking issues because it starts
> multiple threads doing mmap + page faults. Perhaps we can repackage it
> into a test/benchmark for testing contention on mmap/vma locks?

Ah nice yeah that sounds good!

>
> >
> > Anyway, this is necessarily nebulous without further investigation, what I
> > was thinking more concretely is:
> >
> > Using the VMA userland testing:
> >
> > 1. Assert reference count correctness across locking scenarios and various
> >    VMA operations.
> > 2. Assert correct detached/not detached state across different scenarios.
> >
> > This won't quite be complete as not everything is separated out quite
> > enough to allow things like process tear down/forking etc. to be explicitly
> > tested but you can unit tests the VMA bits at least.
> >
> > One note on this, I intend to split the vma.c file into multiple files in
> > tools/testing/vma/ so if you add tests here it'd be worth probably putting
> > them into a new file.
> >
> > I'm happy to help with this if you need any assistance, feel free to ping!
>
> As a starting point I was thinking of changing
> vma_assert_attached()/vma_assert_detached() and
> vma_mark_attached()/vma_mark_detached() to return a bool and use
> WARN_ON_ONCE() (to address your concern about asserts being dependent
> on CONFIG_DEBUG_VM) like this:
>
> static inline bool vma_assert_detached()
> {
>     return !WARN_ON_ONCE(atomic_read(&vma->vm_refcnt));
> }
>
> static inline bool vma_mark_attached(struct vm_area_struct *vma)
> {
>     vma_assert_write_locked(vma);
>     if (!vma_assert_detached(vma))
>         return false;
>
>     atomic_set(&vma->vm_refcnt, 1);
>     return true;
> }
>

Sounds good!

> With that we can add correctness checks in the tools/testing/vma/vma.c
> for different states, for example in the alloc_and_link_vma() we can
> check that after vma_link() the vma is indeed attached:
>
> ASSERT_TRUE(vma_assert_attached(vma));
>
> This might not cover all states but is probably a good starting point. WDYT?

Yeah, this is a good starting point.

I think also we should add explicit asserts in the merge tests to ensure
attachment.

I mean part of this is adding more tests in general for standard
operations, but I don't want to be silly and suggest you need to do that.

I think this forms a decent basis.

>
> >
> > Sorry to put this on you so late in the series, I realise it's annoying,
> > but I feel like things have changed a lot and obviously aggregated with two
> > series in one in effect and these are genuine concerns that at this stage I
> > feel like we need to try to at least make some headway on.
> >
> > >
> > > >
> > > > I know it's tricky to write tests for this, but the new VMA testing
> > > > environment should make it possible to test a _lot_ more than we previously
> > > > could.
> > > >
> > > > However due to some (*ahem*) interesting distribution of where functions
> > > > are, most notably stuff in kernel/fork.c, I guess we can't test
> > > > _everything_ there effectively.
> > > >
> > > > But I do feel like we should be able to do better than having absolutely no
> > > > testing added for this?
> > >
> > > Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> > > for the rest I thought the tests were modified accordingly.
> >
> > See above ^
> >
> > >
> > > >
> > > > I think there's definitely quite a bit you could test now, at least in
> > > > asserting fundamentals in tools/testing/vma/vma.c.
> > > >
> > > > This can cover at least detached state asserts in various scenarios.
> > >
> > > Ok, you mean to check that VMA re-attachment/re-detachment would
> > > trigger assertions? I'll look into adding tests for that.
> >
> > Yeah this is one, see above :)
> >
> > >
> > > >
> > > > But that won't cover off the really gnarly stuff here around RCU slab
> > > > allocation, and determining precisely how to test that in a sensible way is
> > > > maybe less clear.
> > > >
> > > > But I'd like to see _something_ here please, this is more or less
> > > > fundamentally changing how all VMAs are allocated and to just have nothing
> > > > feels unfortunate.
> > >
> > > Again, I'm open to suggestions on what kind of testing I can add for
> > > SLAB_TYPESAFE_BY_RCU change.
> >
> > See above
> >
> > >
> > > >
> > > > I'm already nervous because we've hit issues coming up to v9 and we're not
> > > > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > > > sure how much we can get assurances with tests but I'd like something.
> > >
> > > If you are referring to the issue at [1], I think David ran the
> > > syzcaller against mm-stable that does not contain this patchset and
> > > the issue still triggered (see [2]). This of course does not guarantee
> > > that this patchset has no other issues :) I'll try adding tests for
> > > re-attaching, re-detaching and welcome ideas on how to test
> > > SLAB_TYPESAFE_BY_RCU transition.
> > > Thanks,
> > > Suren.
> >
> > OK that's reassuring!
> >
> > >
> > > [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> > > [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
> > >
> > > >
> > > > Thanks!
> > > >
> > > > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > > vm_area_struct in [1] because of the performance regression caused by
> > > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > > even there it can be mitigated by disabling adjacent cacheline
> > > > > prefetching, see [3].
> > > > > Splitting single logical structure into multiple ones leads to more
> > > > > complicated management, extra pointer dereferences and overall less
> > > > > maintainable code. When that split-away part is a lock, it complicates
> > > > > things even further. With no performance benefits, there are no reasons
> > > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > > This patchset:
> > > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > > cacheline sharing;
> > > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > > it is inserted into vma tree;
> > > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > > reuse and to minimize call_rcu() calls.
> > > > >
> > > > > Pagefault microbenchmarks show performance improvement:
> > > > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> > > > >
> > > > > Changes since v8 [4]:
> > > > > - Change subject for the cover letter, per Vlastimil Babka
> > > > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > > > per David Laight
> > > > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > > > per Hillf Danton and Vlastimil Babka
> > > > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > > > under us, per Vlastimil Babka
> > > > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > > > per Vlastimil Babka
> > > > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > > > per Vlastimil Babka
> > > > > - Amended description in vm_lock replacement patch explaining the effects
> > > > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > > > - Added vm_area_struct member regrouping patch [5] into the series
> > > > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > > > adding new members, per Vlastimil Babka
> > > > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > > > vm_area_init_from(), per Vlastimil Babka
> > > > >
> > > > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > > > >
> > > > > Patchset applies over mm-unstable after reverting v8
> > > > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > > > >
> > > > > Suren Baghdasaryan (17):
> > > > >   mm: introduce vma_start_read_locked{_nested} helpers
> > > > >   mm: move per-vma lock into vm_area_struct
> > > > >   mm: mark vma as detached until it's added into vma tree
> > > > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > > > >   mm: mark vmas detached upon exit
> > > > >   types: move struct rcuwait into types.h
> > > > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > > > >   mm: move mmap_init_lock() out of the header file
> > > > >   mm: uninline the main body of vma_start_write()
> > > > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > > > >   mm: replace vm_lock and detached flag with a reference count
> > > > >   mm: move lesser used vma_area_struct members into the last cacheline
> > > > >   mm/debug: print vm_refcnt state when dumping the vma
> > > > >   mm: remove extra vma_numab_state_init() call
> > > > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > > > >   docs/mm: document latest changes to vm_lock
> > > > >
> > > > >  Documentation/mm/process_addrs.rst |  44 ++++----
> > > > >  include/linux/mm.h                 | 156 ++++++++++++++++++++++-------
> > > > >  include/linux/mm_types.h           |  75 +++++++-------
> > > > >  include/linux/mmap_lock.h          |   6 --
> > > > >  include/linux/rcuwait.h            |  13 +--
> > > > >  include/linux/refcount.h           |  24 ++++-
> > > > >  include/linux/slab.h               |   6 --
> > > > >  include/linux/types.h              |  12 +++
> > > > >  kernel/fork.c                      | 129 +++++++++++-------------
> > > > >  mm/debug.c                         |  12 +++
> > > > >  mm/init-mm.c                       |   1 +
> > > > >  mm/memory.c                        |  97 ++++++++++++++++--
> > > > >  mm/mmap.c                          |   3 +-
> > > > >  mm/userfaultfd.c                   |  32 +++---
> > > > >  mm/vma.c                           |  23 ++---
> > > > >  mm/vma.h                           |  15 ++-
> > > > >  tools/testing/vma/linux/atomic.h   |   5 +
> > > > >  tools/testing/vma/vma_internal.h   |  93 ++++++++---------
> > > > >  18 files changed, 465 insertions(+), 281 deletions(-)
> > > > >
> > > > > --
> > > > > 2.47.1.613.gc27f4b7a9f-goog
> > > > >