[v8] move per-vma lock into vm_area_struct

[PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. replaces vm_lock and vma->detached flag with a reference counter;
4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.

Pagefault microbenchmarks show performance improvement:
Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Changes since v7 [4]:
- Removed additional parameter for vma_iter_store() and introduced
vma_iter_store_attached() instead, per Vlastimil Babka and
Liam R. Howlett
- Fixed coding style nits, per Vlastimil Babka
- Added Reviewed-bys and Acked-bys, per Vlastimil Babka
- Added Reviewed-bys and Acked-bys, per Liam R. Howlett
- Added Acked-by, per Davidlohr Bueso
- Removed unnecessary patch changeing nommu.c
- Folded a fixup patch [5] into the patch it was fixing
- Changed calculation in __refcount_add_not_zero_limited() to avoid
overflow, to change the limit to be inclusive and to use INT_MAX to
indicate no limits, per Vlastimil Babka and Matthew Wilcox
- Folded a fixup patch [6] into the patch it was fixing
- Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
- Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
to not reserve one count for a writer, per Liam R. Howlett
- Changed vma_refcount_put() to wake up writers only when the last reader
is leaving, per Liam R. Howlett
- Fixed rwsem_acquire_read() parameters when read-locking a vma to match
the way down_read_trylock() does lockdep, per Vlastimil Babka
- Folded vma_lockdep_init() into vma_lock_init() for simplicity
- Brought back vma_copy() to keep vm_refcount at 0 during reuse,
per Vlastimil Babka

What I did not include in this patchset:
- Liam's suggestion to change dump_vma() output since it's unclear to me
how it should look like. The patch is for debug only and not critical for
the rest of the series, we can change the output later or even drop it if
necessary.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
[5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
[6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/

Patchset applies over mm-unstable after reverting v7
(current SHA range: 588f0086398e - fb2270654630)

Suren Baghdasaryan (16):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: introduce vma_iter_store_attached() to use with attached vmas
  mm: mark vmas detached upon exit
  types: move struct rcuwait into types.h
  mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
  mm: move mmap_init_lock() out of the header file
  mm: uninline the main body of vma_start_write()
  refcount: introduce __refcount_{add|inc}_not_zero_limited
  mm: replace vm_lock and detached flag with a reference count
  mm/debug: print vm_refcnt state when dumping the vma
  mm: remove extra vma_numab_state_init() call
  mm: prepare lock_vma_under_rcu() for vma reuse possibility
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  44 +++++----
 include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
 include/linux/mm_types.h           |  36 ++++---
 include/linux/mmap_lock.h          |   6 --
 include/linux/rcuwait.h            |  13 +--
 include/linux/refcount.h           |  20 +++-
 include/linux/slab.h               |   6 --
 include/linux/types.h              |  12 +++
 kernel/fork.c                      | 128 +++++++++++-------------
 mm/debug.c                         |  12 +++
 mm/init-mm.c                       |   1 +
 mm/memory.c                        |  94 +++++++++++++++---
 mm/mmap.c                          |   3 +-
 mm/userfaultfd.c                   |  32 +++---
 mm/vma.c                           |  23 ++---
 mm/vma.h                           |  15 ++-
 tools/testing/vma/linux/atomic.h   |   5 +
 tools/testing/vma/vma_internal.h   |  93 ++++++++----------
 18 files changed, 435 insertions(+), 260 deletions(-)

-- 
2.47.1.613.gc27f4b7a9f-goog

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Wed, Jan 8, 2025 at 6:30 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
>
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Changes since v7 [4]:
> - Removed additional parameter for vma_iter_store() and introduced
> vma_iter_store_attached() instead, per Vlastimil Babka and
> Liam R. Howlett
> - Fixed coding style nits, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> - Added Acked-by, per Davidlohr Bueso
> - Removed unnecessary patch changeing nommu.c
> - Folded a fixup patch [5] into the patch it was fixing
> - Changed calculation in __refcount_add_not_zero_limited() to avoid
> overflow, to change the limit to be inclusive and to use INT_MAX to
> indicate no limits, per Vlastimil Babka and Matthew Wilcox
> - Folded a fixup patch [6] into the patch it was fixing
> - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> to not reserve one count for a writer, per Liam R. Howlett
> - Changed vma_refcount_put() to wake up writers only when the last reader
> is leaving, per Liam R. Howlett
> - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> the way down_read_trylock() does lockdep, per Vlastimil Babka
> - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> per Vlastimil Babka
>
> What I did not include in this patchset:
> - Liam's suggestion to change dump_vma() output since it's unclear to me
> how it should look like. The patch is for debug only and not critical for
> the rest of the series, we can change the output later or even drop it if
> necessary.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
>
> Patchset applies over mm-unstable after reverting v7
> (current SHA range: 588f0086398e - fb2270654630)

^^^ Please note that to apply this patchset over mm-unstable you
should revert the previous version. Thanks!

>
> Suren Baghdasaryan (16):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
>
>  Documentation/mm/process_addrs.rst |  44 +++++----
>  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  36 ++++---
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  20 +++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 128 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  94 +++++++++++++++---
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
>  18 files changed, 435 insertions(+), 260 deletions(-)
>
> --
> 2.47.1.613.gc27f4b7a9f-goog
>

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Vlastimil Babka 1 year ago

Btw the subject became rather incomplete given all the series does :)

On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.
> 
> Pagefault microbenchmarks show performance improvement:
> Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*

Given how patch 2 discusses memory growth due to moving the lock, should
also patch 11 discuss how the replacement with refcount reduces the
memory footprint? And/or the cover letter could summarize the impact of
the whole series in that aspect? Perhaps the refcount doesn't reduce
anything as it's smaller but sits alone in the cacheline? Could it be
grouped with some non-hot fields instead as a followup, so could we get
to <=192 (non-debug) size without impacting performance?

> Changes since v7 [4]:
> - Removed additional parameter for vma_iter_store() and introduced
> vma_iter_store_attached() instead, per Vlastimil Babka and
> Liam R. Howlett
> - Fixed coding style nits, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> - Added Acked-by, per Davidlohr Bueso
> - Removed unnecessary patch changeing nommu.c
> - Folded a fixup patch [5] into the patch it was fixing
> - Changed calculation in __refcount_add_not_zero_limited() to avoid
> overflow, to change the limit to be inclusive and to use INT_MAX to
> indicate no limits, per Vlastimil Babka and Matthew Wilcox
> - Folded a fixup patch [6] into the patch it was fixing
> - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> to not reserve one count for a writer, per Liam R. Howlett
> - Changed vma_refcount_put() to wake up writers only when the last reader
> is leaving, per Liam R. Howlett
> - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> the way down_read_trylock() does lockdep, per Vlastimil Babka
> - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> per Vlastimil Babka
> 
> What I did not include in this patchset:
> - Liam's suggestion to change dump_vma() output since it's unclear to me
> how it should look like. The patch is for debug only and not critical for
> the rest of the series, we can change the output later or even drop it if
> necessary.
> 
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> 
> Patchset applies over mm-unstable after reverting v7
> (current SHA range: 588f0086398e - fb2270654630)
> 
> Suren Baghdasaryan (16):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: mark vma as detached until it's added into vma tree
>   mm: introduce vma_iter_store_attached() to use with attached vmas
>   mm: mark vmas detached upon exit
>   types: move struct rcuwait into types.h
>   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
>   mm: move mmap_init_lock() out of the header file
>   mm: uninline the main body of vma_start_write()
>   refcount: introduce __refcount_{add|inc}_not_zero_limited
>   mm: replace vm_lock and detached flag with a reference count
>   mm/debug: print vm_refcnt state when dumping the vma
>   mm: remove extra vma_numab_state_init() call
>   mm: prepare lock_vma_under_rcu() for vma reuse possibility
>   mm: make vma cache SLAB_TYPESAFE_BY_RCU
>   docs/mm: document latest changes to vm_lock
> 
>  Documentation/mm/process_addrs.rst |  44 +++++----
>  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
>  include/linux/mm_types.h           |  36 ++++---
>  include/linux/mmap_lock.h          |   6 --
>  include/linux/rcuwait.h            |  13 +--
>  include/linux/refcount.h           |  20 +++-
>  include/linux/slab.h               |   6 --
>  include/linux/types.h              |  12 +++
>  kernel/fork.c                      | 128 +++++++++++-------------
>  mm/debug.c                         |  12 +++
>  mm/init-mm.c                       |   1 +
>  mm/memory.c                        |  94 +++++++++++++++---
>  mm/mmap.c                          |   3 +-
>  mm/userfaultfd.c                   |  32 +++---
>  mm/vma.c                           |  23 ++---
>  mm/vma.h                           |  15 ++-
>  tools/testing/vma/linux/atomic.h   |   5 +
>  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
>  18 files changed, 435 insertions(+), 260 deletions(-)
>

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Btw the subject became rather incomplete given all the series does :)

Missed this one. What do you think is worth mentioning here? It's
true, the patchset does many small things but I wanted to outline the
main conceptual changes. Please LMK if you think there are more
changes big enough to be mentioned here.

>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Given how patch 2 discusses memory growth due to moving the lock, should
> also patch 11 discuss how the replacement with refcount reduces the
> memory footprint? And/or the cover letter could summarize the impact of
> the whole series in that aspect? Perhaps the refcount doesn't reduce
> anything as it's smaller but sits alone in the cacheline? Could it be
> grouped with some non-hot fields instead as a followup, so could we get
> to <=192 (non-debug) size without impacting performance?
>
> > Changes since v7 [4]:
> > - Removed additional parameter for vma_iter_store() and introduced
> > vma_iter_store_attached() instead, per Vlastimil Babka and
> > Liam R. Howlett
> > - Fixed coding style nits, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > - Added Acked-by, per Davidlohr Bueso
> > - Removed unnecessary patch changeing nommu.c
> > - Folded a fixup patch [5] into the patch it was fixing
> > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > overflow, to change the limit to be inclusive and to use INT_MAX to
> > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > - Folded a fixup patch [6] into the patch it was fixing
> > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > to not reserve one count for a writer, per Liam R. Howlett
> > - Changed vma_refcount_put() to wake up writers only when the last reader
> > is leaving, per Liam R. Howlett
> > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > per Vlastimil Babka
> >
> > What I did not include in this patchset:
> > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > how it should look like. The patch is for debug only and not critical for
> > the rest of the series, we can change the output later or even drop it if
> > necessary.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v7
> > (current SHA range: 588f0086398e - fb2270654630)
> >
> > Suren Baghdasaryan (16):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 +++++----
> >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  36 ++++---
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  20 +++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 128 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  94 +++++++++++++++---
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> >  18 files changed, 435 insertions(+), 260 deletions(-)
> >
>

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Btw the subject became rather incomplete given all the series does :)
>
> Missed this one. What do you think is worth mentioning here? It's
> true, the patchset does many small things but I wanted to outline the
> main conceptual changes. Please LMK if you think there are more
> changes big enough to be mentioned here.

I just realized that your comment was only about the subject of this
cover letter. Maybe something like this:

per-vma lock and vm_area_struct cache optimizations

Would that be better?

>
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Given how patch 2 discusses memory growth due to moving the lock, should
> > also patch 11 discuss how the replacement with refcount reduces the
> > memory footprint? And/or the cover letter could summarize the impact of
> > the whole series in that aspect? Perhaps the refcount doesn't reduce
> > anything as it's smaller but sits alone in the cacheline? Could it be
> > grouped with some non-hot fields instead as a followup, so could we get
> > to <=192 (non-debug) size without impacting performance?
> >
> > > Changes since v7 [4]:
> > > - Removed additional parameter for vma_iter_store() and introduced
> > > vma_iter_store_attached() instead, per Vlastimil Babka and
> > > Liam R. Howlett
> > > - Fixed coding style nits, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > > - Added Acked-by, per Davidlohr Bueso
> > > - Removed unnecessary patch changeing nommu.c
> > > - Folded a fixup patch [5] into the patch it was fixing
> > > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > > overflow, to change the limit to be inclusive and to use INT_MAX to
> > > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > > - Folded a fixup patch [6] into the patch it was fixing
> > > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > > to not reserve one count for a writer, per Liam R. Howlett
> > > - Changed vma_refcount_put() to wake up writers only when the last reader
> > > is leaving, per Liam R. Howlett
> > > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > > per Vlastimil Babka
> > >
> > > What I did not include in this patchset:
> > > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > > how it should look like. The patch is for debug only and not critical for
> > > the rest of the series, we can change the output later or even drop it if
> > > necessary.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v7
> > > (current SHA range: 588f0086398e - fb2270654630)
> > >
> > > Suren Baghdasaryan (16):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 +++++----
> > >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  36 ++++---
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  20 +++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 128 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  94 +++++++++++++++---
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> > >  18 files changed, 435 insertions(+), 260 deletions(-)
> > >
> >

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Vlastimil Babka 1 year ago


On 1/10/25 1:16 AM, Suren Baghdasaryan wrote:
> On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> Btw the subject became rather incomplete given all the series does :)
>>
>> Missed this one. What do you think is worth mentioning here? It's
>> true, the patchset does many small things but I wanted to outline the
>> main conceptual changes. Please LMK if you think there are more
>> changes big enough to be mentioned here.
> 
> I just realized that your comment was only about the subject of this
> cover letter. Maybe something like this:
> 
> per-vma lock and vm_area_struct cache optimizations

arguably the biggest change here is:

reimplement per-vma lock as a refcount

but yours is ok to, don't want to bikeshed

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Fri, Jan 10, 2025 at 7:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
>
>
> On 1/10/25 1:16 AM, Suren Baghdasaryan wrote:
> > On Thu, Jan 9, 2025 at 7:59 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>>
> >>> Btw the subject became rather incomplete given all the series does :)
> >>
> >> Missed this one. What do you think is worth mentioning here? It's
> >> true, the patchset does many small things but I wanted to outline the
> >> main conceptual changes. Please LMK if you think there are more
> >> changes big enough to be mentioned here.
> >
> > I just realized that your comment was only about the subject of this
> > cover letter. Maybe something like this:
> >
> > per-vma lock and vm_area_struct cache optimizations
>
> arguably the biggest change here is:
>
> reimplement per-vma lock as a refcount

Ok, I'll use that. Thanks!

>
> but yours is ok to, don't want to bikeshed
>

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Btw the subject became rather incomplete given all the series does :)
>
> On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
> >
> > Pagefault microbenchmarks show performance improvement:
> > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
>
> Given how patch 2 discusses memory growth due to moving the lock, should
> also patch 11 discuss how the replacement with refcount reduces the
> memory footprint? And/or the cover letter could summarize the impact of
> the whole series in that aspect?

That's a good idea. I can amend the cover letter and the description
of patch 11 to include size information.

> Perhaps the refcount doesn't reduce
> anything as it's smaller but sits alone in the cacheline? Could it be
> grouped with some non-hot fields instead as a followup, so could we get
> to <=192 (non-debug) size without impacting performance?

Yes, absolutely. Before this series, vm_area_struct was roughly 168
bytes and vm_lock was 40 bytes. After the changes vm_area_struct
becomes 256 bytes. I was planning to pack the fields as a follow-up
patch similar to an earlier one [1] and bring the size of
vm_area_struct to < 192. I felt this patchset already does many things
and did not include it here but I can add it at the end of this
patchset if you think it's essential.

[1] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

>
> > Changes since v7 [4]:
> > - Removed additional parameter for vma_iter_store() and introduced
> > vma_iter_store_attached() instead, per Vlastimil Babka and
> > Liam R. Howlett
> > - Fixed coding style nits, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > - Added Acked-by, per Davidlohr Bueso
> > - Removed unnecessary patch changeing nommu.c
> > - Folded a fixup patch [5] into the patch it was fixing
> > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > overflow, to change the limit to be inclusive and to use INT_MAX to
> > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > - Folded a fixup patch [6] into the patch it was fixing
> > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > to not reserve one count for a writer, per Liam R. Howlett
> > - Changed vma_refcount_put() to wake up writers only when the last reader
> > is leaving, per Liam R. Howlett
> > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > per Vlastimil Babka
> >
> > What I did not include in this patchset:
> > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > how it should look like. The patch is for debug only and not critical for
> > the rest of the series, we can change the output later or even drop it if
> > necessary.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> >
> > Patchset applies over mm-unstable after reverting v7
> > (current SHA range: 588f0086398e - fb2270654630)
> >
> > Suren Baghdasaryan (16):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: mark vma as detached until it's added into vma tree
> >   mm: introduce vma_iter_store_attached() to use with attached vmas
> >   mm: mark vmas detached upon exit
> >   types: move struct rcuwait into types.h
> >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> >   mm: move mmap_init_lock() out of the header file
> >   mm: uninline the main body of vma_start_write()
> >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> >   mm: replace vm_lock and detached flag with a reference count
> >   mm/debug: print vm_refcnt state when dumping the vma
> >   mm: remove extra vma_numab_state_init() call
> >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> >   docs/mm: document latest changes to vm_lock
> >
> >  Documentation/mm/process_addrs.rst |  44 +++++----
> >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> >  include/linux/mm_types.h           |  36 ++++---
> >  include/linux/mmap_lock.h          |   6 --
> >  include/linux/rcuwait.h            |  13 +--
> >  include/linux/refcount.h           |  20 +++-
> >  include/linux/slab.h               |   6 --
> >  include/linux/types.h              |  12 +++
> >  kernel/fork.c                      | 128 +++++++++++-------------
> >  mm/debug.c                         |  12 +++
> >  mm/init-mm.c                       |   1 +
> >  mm/memory.c                        |  94 +++++++++++++++---
> >  mm/mmap.c                          |   3 +-
> >  mm/userfaultfd.c                   |  32 +++---
> >  mm/vma.c                           |  23 ++---
> >  mm/vma.h                           |  15 ++-
> >  tools/testing/vma/linux/atomic.h   |   5 +
> >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> >  18 files changed, 435 insertions(+), 260 deletions(-)
> >
>

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Thu, Jan 9, 2025 at 7:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jan 9, 2025 at 5:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Btw the subject became rather incomplete given all the series does :)
> >
> > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean     faults/cpu-1    507926.5547 (   0.00%)   506519.3692 *  -0.28%*
> > > Hmean     faults/cpu-4    479119.7051 (   0.00%)   481333.6802 *   0.46%*
> > > Hmean     faults/cpu-7    452880.2961 (   0.00%)   455845.6211 *   0.65%*
> > > Hmean     faults/cpu-12   347639.1021 (   0.00%)   352004.2254 *   1.26%*
> > > Hmean     faults/cpu-21   200061.2238 (   0.00%)   229597.0317 *  14.76%*
> > > Hmean     faults/cpu-30   145251.2001 (   0.00%)   164202.5067 *  13.05%*
> > > Hmean     faults/cpu-48   106848.4434 (   0.00%)   120641.5504 *  12.91%*
> > > Hmean     faults/cpu-56    92472.3835 (   0.00%)   103464.7916 *  11.89%*
> > > Hmean     faults/sec-1    507566.1468 (   0.00%)   506139.0811 *  -0.28%*
> > > Hmean     faults/sec-4   1880478.2402 (   0.00%)  1886795.6329 *   0.34%*
> > > Hmean     faults/sec-7   3106394.3438 (   0.00%)  3140550.7485 *   1.10%*
> > > Hmean     faults/sec-12  4061358.4795 (   0.00%)  4112477.0206 *   1.26%*
> > > Hmean     faults/sec-21  3988619.1169 (   0.00%)  4577747.1436 *  14.77%*
> > > Hmean     faults/sec-30  3909839.5449 (   0.00%)  4311052.2787 *  10.26%*
> > > Hmean     faults/sec-48  4761108.4691 (   0.00%)  5283790.5026 *  10.98%*
> > > Hmean     faults/sec-56  4885561.4590 (   0.00%)  5415839.4045 *  10.85%*
> >
> > Given how patch 2 discusses memory growth due to moving the lock, should
> > also patch 11 discuss how the replacement with refcount reduces the
> > memory footprint? And/or the cover letter could summarize the impact of
> > the whole series in that aspect?
>
> That's a good idea. I can amend the cover letter and the description
> of patch 11 to include size information.
>
> > Perhaps the refcount doesn't reduce
> > anything as it's smaller but sits alone in the cacheline? Could it be
> > grouped with some non-hot fields instead as a followup, so could we get
> > to <=192 (non-debug) size without impacting performance?
>
> Yes, absolutely. Before this series, vm_area_struct was roughly 168
> bytes and vm_lock was 40 bytes. After the changes vm_area_struct
> becomes 256 bytes. I was planning to pack the fields as a follow-up
> patch similar to an earlier one [1] and bring the size of
> vm_area_struct to < 192. I felt this patchset already does many things
> and did not include it here but I can add it at the end of this
> patchset if you think it's essential.
>
> [1] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/

Actually I tried to rewrite the above patch [1] over the latest
patchset and it's pretty much the same. I think I should include it at
the end of this patchset as it's pretty simple. Planning to post v9
tomorrow morning, so if you don't want it in this patchset, please let
me know.
Thanks!

>
> >
> > > Changes since v7 [4]:
> > > - Removed additional parameter for vma_iter_store() and introduced
> > > vma_iter_store_attached() instead, per Vlastimil Babka and
> > > Liam R. Howlett
> > > - Fixed coding style nits, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Vlastimil Babka
> > > - Added Reviewed-bys and Acked-bys, per Liam R. Howlett
> > > - Added Acked-by, per Davidlohr Bueso
> > > - Removed unnecessary patch changeing nommu.c
> > > - Folded a fixup patch [5] into the patch it was fixing
> > > - Changed calculation in __refcount_add_not_zero_limited() to avoid
> > > overflow, to change the limit to be inclusive and to use INT_MAX to
> > > indicate no limits, per Vlastimil Babka and Matthew Wilcox
> > > - Folded a fixup patch [6] into the patch it was fixing
> > > - Added vm_refcnt rules summary in the changelog, per Liam R. Howlett
> > > - Changed writers to not increment vm_refcnt and adjusted VMA_REF_LIMIT
> > > to not reserve one count for a writer, per Liam R. Howlett
> > > - Changed vma_refcount_put() to wake up writers only when the last reader
> > > is leaving, per Liam R. Howlett
> > > - Fixed rwsem_acquire_read() parameters when read-locking a vma to match
> > > the way down_read_trylock() does lockdep, per Vlastimil Babka
> > > - Folded vma_lockdep_init() into vma_lock_init() for simplicity
> > > - Brought back vma_copy() to keep vm_refcount at 0 during reuse,
> > > per Vlastimil Babka
> > >
> > > What I did not include in this patchset:
> > > - Liam's suggestion to change dump_vma() output since it's unclear to me
> > > how it should look like. The patch is for debug only and not critical for
> > > the rest of the series, we can change the output later or even drop it if
> > > necessary.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20241226170710.1159679-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20250107030415.721474-1-surenb@google.com/
> > > [6] https://lore.kernel.org/all/20241226200335.1250078-1-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v7
> > > (current SHA range: 588f0086398e - fb2270654630)
> > >
> > > Suren Baghdasaryan (16):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: mark vma as detached until it's added into vma tree
> > >   mm: introduce vma_iter_store_attached() to use with attached vmas
> > >   mm: mark vmas detached upon exit
> > >   types: move struct rcuwait into types.h
> > >   mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > >   mm: move mmap_init_lock() out of the header file
> > >   mm: uninline the main body of vma_start_write()
> > >   refcount: introduce __refcount_{add|inc}_not_zero_limited
> > >   mm: replace vm_lock and detached flag with a reference count
> > >   mm/debug: print vm_refcnt state when dumping the vma
> > >   mm: remove extra vma_numab_state_init() call
> > >   mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > >   mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > >   docs/mm: document latest changes to vm_lock
> > >
> > >  Documentation/mm/process_addrs.rst |  44 +++++----
> > >  include/linux/mm.h                 | 152 ++++++++++++++++++++++-------
> > >  include/linux/mm_types.h           |  36 ++++---
> > >  include/linux/mmap_lock.h          |   6 --
> > >  include/linux/rcuwait.h            |  13 +--
> > >  include/linux/refcount.h           |  20 +++-
> > >  include/linux/slab.h               |   6 --
> > >  include/linux/types.h              |  12 +++
> > >  kernel/fork.c                      | 128 +++++++++++-------------
> > >  mm/debug.c                         |  12 +++
> > >  mm/init-mm.c                       |   1 +
> > >  mm/memory.c                        |  94 +++++++++++++++---
> > >  mm/mmap.c                          |   3 +-
> > >  mm/userfaultfd.c                   |  32 +++---
> > >  mm/vma.c                           |  23 ++---
> > >  mm/vma.h                           |  15 ++-
> > >  tools/testing/vma/linux/atomic.h   |   5 +
> > >  tools/testing/vma/vma_internal.h   |  93 ++++++++----------
> > >  18 files changed, 435 insertions(+), 260 deletions(-)
> > >
> >

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Peter Zijlstra 1 year ago

On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> Splitting single logical structure into multiple ones leads to more
> complicated management, extra pointer dereferences and overall less
> maintainable code. When that split-away part is a lock, it complicates
> things even further. With no performance benefits, there are no reasons
> for this split. Merging the vm_lock back into vm_area_struct also allows
> vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> This patchset:
> 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> boundary and changing the cache to be cacheline-aligned to minimize
> cacheline sharing;
> 2. changes vm_area_struct initialization to mark new vma as detached until
> it is inserted into vma tree;
> 3. replaces vm_lock and vma->detached flag with a reference counter;
> 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> reuse and to minimize call_rcu() calls.

Does not clean up that reattach nonsense :-(

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Suren Baghdasaryan 1 year ago

On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > Splitting single logical structure into multiple ones leads to more
> > complicated management, extra pointer dereferences and overall less
> > maintainable code. When that split-away part is a lock, it complicates
> > things even further. With no performance benefits, there are no reasons
> > for this split. Merging the vm_lock back into vm_area_struct also allows
> > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > This patchset:
> > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > boundary and changing the cache to be cacheline-aligned to minimize
> > cacheline sharing;
> > 2. changes vm_area_struct initialization to mark new vma as detached until
> > it is inserted into vma tree;
> > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > reuse and to minimize call_rcu() calls.
>
> Does not clean up that reattach nonsense :-(

Oh, no. I think it does. That's why in [1] I introduce
vma_iter_store_attached() to be used on already attached vmas and to
avoid marking them attached again. Also I added assertions in
vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
re-detaching. Unless I misunderstood your comment?

[1] https://lore.kernel.org/all/20250109023025.2242447-5-surenb@google.com/

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Peter Zijlstra 1 year ago

On Thu, Jan 09, 2025 at 07:48:32AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> >
> > Does not clean up that reattach nonsense :-(
> 
> Oh, no. I think it does. That's why in [1] I introduce
> vma_iter_store_attached() to be used on already attached vmas and to
> avoid marking them attached again. Also I added assertions in
> vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
> re-detaching. Unless I misunderstood your comment?

Hmm, I'll go read the thing again, maybe I missed it.

Re: [PATCH v8 00/16] move per-vma lock into vm_area_struct

Posted by Peter Zijlstra 1 year ago

On Fri, Jan 10, 2025 at 06:01:05PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 07:48:32AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 9, 2025 at 3:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Jan 08, 2025 at 06:30:09PM -0800, Suren Baghdasaryan wrote:
> > > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > > vm_area_struct in [1] because of the performance regression caused by
> > > > false cacheline sharing. Recent investigation [2] revealed that the
> > > > regressions is limited to a rather old Broadwell microarchitecture and
> > > > even there it can be mitigated by disabling adjacent cacheline
> > > > prefetching, see [3].
> > > > Splitting single logical structure into multiple ones leads to more
> > > > complicated management, extra pointer dereferences and overall less
> > > > maintainable code. When that split-away part is a lock, it complicates
> > > > things even further. With no performance benefits, there are no reasons
> > > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > > This patchset:
> > > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > > boundary and changing the cache to be cacheline-aligned to minimize
> > > > cacheline sharing;
> > > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > > it is inserted into vma tree;
> > > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > > 4. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > > reuse and to minimize call_rcu() calls.
> > >
> > > Does not clean up that reattach nonsense :-(
> > 
> > Oh, no. I think it does. That's why in [1] I introduce
> > vma_iter_store_attached() to be used on already attached vmas and to
> > avoid marking them attached again. Also I added assertions in
> > vma_mark_attached()/vma_mark_detached() to avoid re-attaching or
> > re-detaching. Unless I misunderstood your comment?
> 
> Hmm, I'll go read the thing again, maybe I missed it.

You're right. I was looking for the approach that changed the need to
reattach, by moving the point of no return.

This should do for now.

Let me see if I can find time today to finally do a proper reading.

Thanks!