[v3] RSEQ node id and virtual cpu id extensions

[PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Mathieu Desnoyers posted 23 patches 3 years, 8 months ago

Diff against v2 v4 v4 v5
Download series mbox

There is a newer version of this series

fs/binfmt_elf.c                               |    5 +
fs/exec.c                                     |    4 +
include/linux/cpumask.h                       |   86 ++
include/linux/find.h                          |  123 +-
include/linux/mm.h                            |   25 +
include/linux/mm_types.h                      |  111 ++
include/linux/sched.h                         |    9 +
include/trace/events/rseq.h                   |    4 +-
include/uapi/linux/auxvec.h                   |    2 +
include/uapi/linux/rseq.h                     |   22 +
init/Kconfig                                  |    4 +
kernel/fork.c                                 |   15 +-
kernel/ptrace.c                               |    2 +-
kernel/rseq.c                                 |   60 +-
kernel/sched/core.c                           |   82 ++
kernel/sched/deadline.c                       |    3 +
kernel/sched/debug.c                          |   13 +
kernel/sched/fair.c                           |    1 +
kernel/sched/rt.c                             |    2 +
kernel/sched/sched.h                          |  357 +++++
kernel/sched/stats.c                          |   16 +-
lib/find_bit.c                                |   17 +-
tools/include/linux/find.h                    |    9 +-
tools/lib/find_bit.c                          |   17 +-
tools/testing/selftests/rseq/.gitignore       |    5 +
tools/testing/selftests/rseq/Makefile         |   20 +-
.../testing/selftests/rseq/basic_numa_test.c  |  117 ++
.../selftests/rseq/basic_percpu_ops_test.c    |   46 +-
tools/testing/selftests/rseq/basic_test.c     |    4 +
tools/testing/selftests/rseq/compiler.h       |    6 +
tools/testing/selftests/rseq/param_test.c     |  152 ++-
tools/testing/selftests/rseq/rseq-abi.h       |   22 +
tools/testing/selftests/rseq/rseq-arm-bits.h  |  505 +++++++
tools/testing/selftests/rseq/rseq-arm.h       |  701 +---------
.../testing/selftests/rseq/rseq-arm64-bits.h  |  392 ++++++
tools/testing/selftests/rseq/rseq-arm64.h     |  520 +------
.../testing/selftests/rseq/rseq-bits-reset.h  |   10 +
.../selftests/rseq/rseq-bits-template.h       |   39 +
tools/testing/selftests/rseq/rseq-mips-bits.h |  462 +++++++
tools/testing/selftests/rseq/rseq-mips.h      |  646 +--------
tools/testing/selftests/rseq/rseq-ppc-bits.h  |  454 +++++++
tools/testing/selftests/rseq/rseq-ppc.h       |  617 +--------
.../testing/selftests/rseq/rseq-riscv-bits.h  |  410 ++++++
tools/testing/selftests/rseq/rseq-riscv.h     |  529 +-------
tools/testing/selftests/rseq/rseq-s390-bits.h |  474 +++++++
tools/testing/selftests/rseq/rseq-s390.h      |  495 +------
tools/testing/selftests/rseq/rseq-skip.h      |   65 -
tools/testing/selftests/rseq/rseq-x86-bits.h  | 1036 ++++++++++++++
tools/testing/selftests/rseq/rseq-x86.h       | 1193 +----------------
tools/testing/selftests/rseq/rseq.c           |   86 +-
tools/testing/selftests/rseq/rseq.h           |  229 +++-
.../testing/selftests/rseq/run_param_test.sh  |    5 +
52 files changed, 5536 insertions(+), 4693 deletions(-)
create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c
create mode 100644 tools/testing/selftests/rseq/rseq-arm-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-arm64-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-bits-reset.h
create mode 100644 tools/testing/selftests/rseq/rseq-bits-template.h
create mode 100644 tools/testing/selftests/rseq/rseq-mips-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-ppc-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-riscv-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-s390-bits.h
delete mode 100644 tools/testing/selftests/rseq/rseq-skip.h
create mode 100644 tools/testing/selftests/rseq/rseq-x86-bits.h

Expand all Fold all

[PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Mathieu Desnoyers 3 years, 8 months ago

Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.

The NUMA node ID field allows implementing a faster getcpu(2) in libc.

The virtual cpu id allows ideal scaling (down or up) of user-space
per-cpu data structures. The virtual cpu ids allocated within a memory
space are tracked by the scheduler, which takes into account the number
of concurrently running threads, thus implicitly considering the number
of threads, the cpu affinity, the cpusets applying to those threads, and
the number of logical cores on the system.

This series is based on the v5.18.13 tag.

Thanks,

Mathieu

Mathieu Desnoyers (23):
  rseq: Introduce feature size and alignment ELF auxiliary vector
    entries
  rseq: Introduce extensible rseq ABI
  rseq: extend struct rseq with numa node id
  selftests/rseq: Use ELF auxiliary vector for extensible rseq
  selftests/rseq: Implement rseq numa node id field selftest
  lib: invert _find_next_bit source arguments
  lib: implement find_{first,next}_{zero,one}_and_zero_bit
  cpumask: implement cpumask_{first,next}_{zero,one}_and_zero
  sched: Introduce per memory space current virtual cpu id
  rseq: extend struct rseq with per memory space vcpu id
  selftests/rseq: Remove RSEQ_SKIP_FASTPATH code
  selftests/rseq: Implement rseq vm_vcpu_id field support
  selftests/rseq: x86: Template memory ordering and percpu access mode
  selftests/rseq: arm: Template memory ordering and percpu access mode
  selftests/rseq: arm64: Template memory ordering and percpu access mode
  selftests/rseq: mips: Template memory ordering and percpu access mode
  selftests/rseq: ppc: Template memory ordering and percpu access mode
  selftests/rseq: s390: Template memory ordering and percpu access mode
  selftests/rseq: riscv: Template memory ordering and percpu access mode
  selftests/rseq: basic percpu ops vm_vcpu_id test
  selftests/rseq: parametrized vm_vcpu_id test
  selftests/rseq: x86: Implement rseq_load_u32_u32
  selftests/rseq: Implement numa node id vs vm_vcpu_id invariant test

 fs/binfmt_elf.c                               |    5 +
 fs/exec.c                                     |    4 +
 include/linux/cpumask.h                       |   86 ++
 include/linux/find.h                          |  123 +-
 include/linux/mm.h                            |   25 +
 include/linux/mm_types.h                      |  111 ++
 include/linux/sched.h                         |    9 +
 include/trace/events/rseq.h                   |    4 +-
 include/uapi/linux/auxvec.h                   |    2 +
 include/uapi/linux/rseq.h                     |   22 +
 init/Kconfig                                  |    4 +
 kernel/fork.c                                 |   15 +-
 kernel/ptrace.c                               |    2 +-
 kernel/rseq.c                                 |   60 +-
 kernel/sched/core.c                           |   82 ++
 kernel/sched/deadline.c                       |    3 +
 kernel/sched/debug.c                          |   13 +
 kernel/sched/fair.c                           |    1 +
 kernel/sched/rt.c                             |    2 +
 kernel/sched/sched.h                          |  357 +++++
 kernel/sched/stats.c                          |   16 +-
 lib/find_bit.c                                |   17 +-
 tools/include/linux/find.h                    |    9 +-
 tools/lib/find_bit.c                          |   17 +-
 tools/testing/selftests/rseq/.gitignore       |    5 +
 tools/testing/selftests/rseq/Makefile         |   20 +-
 .../testing/selftests/rseq/basic_numa_test.c  |  117 ++
 .../selftests/rseq/basic_percpu_ops_test.c    |   46 +-
 tools/testing/selftests/rseq/basic_test.c     |    4 +
 tools/testing/selftests/rseq/compiler.h       |    6 +
 tools/testing/selftests/rseq/param_test.c     |  152 ++-
 tools/testing/selftests/rseq/rseq-abi.h       |   22 +
 tools/testing/selftests/rseq/rseq-arm-bits.h  |  505 +++++++
 tools/testing/selftests/rseq/rseq-arm.h       |  701 +---------
 .../testing/selftests/rseq/rseq-arm64-bits.h  |  392 ++++++
 tools/testing/selftests/rseq/rseq-arm64.h     |  520 +------
 .../testing/selftests/rseq/rseq-bits-reset.h  |   10 +
 .../selftests/rseq/rseq-bits-template.h       |   39 +
 tools/testing/selftests/rseq/rseq-mips-bits.h |  462 +++++++
 tools/testing/selftests/rseq/rseq-mips.h      |  646 +--------
 tools/testing/selftests/rseq/rseq-ppc-bits.h  |  454 +++++++
 tools/testing/selftests/rseq/rseq-ppc.h       |  617 +--------
 .../testing/selftests/rseq/rseq-riscv-bits.h  |  410 ++++++
 tools/testing/selftests/rseq/rseq-riscv.h     |  529 +-------
 tools/testing/selftests/rseq/rseq-s390-bits.h |  474 +++++++
 tools/testing/selftests/rseq/rseq-s390.h      |  495 +------
 tools/testing/selftests/rseq/rseq-skip.h      |   65 -
 tools/testing/selftests/rseq/rseq-x86-bits.h  | 1036 ++++++++++++++
 tools/testing/selftests/rseq/rseq-x86.h       | 1193 +----------------
 tools/testing/selftests/rseq/rseq.c           |   86 +-
 tools/testing/selftests/rseq/rseq.h           |  229 +++-
 .../testing/selftests/rseq/run_param_test.sh  |    5 +
 52 files changed, 5536 insertions(+), 4693 deletions(-)
 create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c
 create mode 100644 tools/testing/selftests/rseq/rseq-arm-bits.h
 create mode 100644 tools/testing/selftests/rseq/rseq-arm64-bits.h
 create mode 100644 tools/testing/selftests/rseq/rseq-bits-reset.h
 create mode 100644 tools/testing/selftests/rseq/rseq-bits-template.h
 create mode 100644 tools/testing/selftests/rseq/rseq-mips-bits.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc-bits.h
 create mode 100644 tools/testing/selftests/rseq/rseq-riscv-bits.h
 create mode 100644 tools/testing/selftests/rseq/rseq-s390-bits.h
 delete mode 100644 tools/testing/selftests/rseq/rseq-skip.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86-bits.h

-- 
2.17.1

Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Peter Oskolkov 3 years, 8 months ago

On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.

Thanks a lot, Mathieu - it is really exciting to see this happening!

I'll share our experiences here, with the hope that it may be useful.
I've also cc-ed
Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
more context/details if I miss or misrepresent something.

The problem:

tcmalloc maintains per-cpu freelists in the userspace to make userspace
memory allocations fast and efficient; it relies on rseq to do so, as
any manipulation
of the freelists has to be protected vs thread migrations.

However, as a typical userspace process at a Google datacenter is confined to
a relatively small number of CPUs (8-16) via cgroups, while the
servers typically
have a much larger number of physical CPUs, the per-cpu freelist model
is somewhat
wasteful: if a process has only at most 10 threads running, for
example, but these threads
can "wander" across 100 CPUs over the lifetime of the process, keeping 100
freelists instead of 10 noticeably wastes memory.

Note that although a typical process at Google has a limited CPU
quota, thus using
only a small number of CPUs at any given time, the process may often have many
hundreds or thousands of threads, so per-thread freelists are not a viable
solution to the problem just described.

Our current solution:

As you outlined in patch 9, tracking the number of currently running threads per
address space and exposing this information via a vcpu_id abstraction helps
tcmalloc to noticeably reduce its freelist overhead in the "narrow
process running
on a wide server" situation, which is typical at Google.

We have experimented with several approaches here. The one that we are
currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.

We did try per-numa-node vcpus, but it did not show any material improvement
over the "flat" model, perhaps because on our most "wide" servers the CPU
topology is multi-level. Chris Kennelly may provide more details here.

On a more technical note, we do use atomic operations extensively in
the kernel to make sure
vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
no going to N and above; this does consume some extra CPU cycles, but the
RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
see what you can do with the optimizations you propose in this patchset.

Again, thanks a lot for this effort!

Peter

[...]

Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Mathieu Desnoyers 3 years, 8 months ago

----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov posk@posk.io wrote:

> On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.
> 
> Thanks a lot, Mathieu - it is really exciting to see this happening!
> 
> I'll share our experiences here, with the hope that it may be useful.
> I've also cc-ed
> Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
> more context/details if I miss or misrepresent something.

Thanks for sharing your experiences at Google. This helps put things in
perspective.

> 
> The problem:
> 
> tcmalloc maintains per-cpu freelists in the userspace to make userspace
> memory allocations fast and efficient; it relies on rseq to do so, as
> any manipulation
> of the freelists has to be protected vs thread migrations.
> 
> However, as a typical userspace process at a Google datacenter is confined to
> a relatively small number of CPUs (8-16) via cgroups, while the
> servers typically
> have a much larger number of physical CPUs, the per-cpu freelist model
> is somewhat
> wasteful: if a process has only at most 10 threads running, for
> example, but these threads
> can "wander" across 100 CPUs over the lifetime of the process, keeping 100
> freelists instead of 10 noticeably wastes memory.
> 
> Note that although a typical process at Google has a limited CPU
> quota, thus using
> only a small number of CPUs at any given time, the process may often have many
> hundreds or thousands of threads, so per-thread freelists are not a viable
> solution to the problem just described.
> 
> Our current solution:
> 
> As you outlined in patch 9, tracking the number of currently running threads per
> address space and exposing this information via a vcpu_id abstraction helps
> tcmalloc to noticeably reduce its freelist overhead in the "narrow
> process running
> on a wide server" situation, which is typical at Google.
> 
> We have experimented with several approaches here. The one that we are
> currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> 
> We did try per-numa-node vcpus, but it did not show any material improvement
> over the "flat" model, perhaps because on our most "wide" servers the CPU
> topology is multi-level. Chris Kennelly may provide more details here.

I would really like to know more about Google's per-numa-node vcpus implementation.
I suspect you guys may have taken a different turn somewhere in the design which
led to these results. But having not seen that implementation, I can only guess.

I notice the following Google-specific prototype extension in tcmalloc:

  // This is a prototype extension to the rseq() syscall.  Since a process may
  // run on only a few cores at a time, we can use a dense set of "v(irtual)
  // cpus."  This can reduce cache requirements, as we only need N caches for
  // the cores we actually run on simultaneously, rather than a cache for every
  // physical core.
  union {
    struct {
      short numa_node_id;
      short vcpu_id;
    };
    int vcpu_flat;
  };

Can you tell me more about the way the numa_node_id and vcpu_id are allocated
internally, and how they are expected to be used by userspace ?

> 
> On a more technical note, we do use atomic operations extensively in
> the kernel to make sure
> vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
> running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
> no going to N and above; this does consume some extra CPU cycles, but the
> RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
> see what you can do with the optimizations you propose in this patchset.

The optimizations I propose keep those "tightly packed" characteristics, but skip
the atomic operations in common scenarios. I'll welcome benchmarks of the added
overhead in representative workloads.

> Again, thanks a lot for this effort!

Thanks for your input. It really helps steering the effort in the right direction.

Mathieu

> 
> Peter
> 
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Peter Oskolkov 3 years, 8 months ago

On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>

[...]

> >
> > We have experimented with several approaches here. The one that we are
> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> >
> > We did try per-numa-node vcpus, but it did not show any material improvement
> > over the "flat" model, perhaps because on our most "wide" servers the CPU
> > topology is multi-level. Chris Kennelly may provide more details here.
>
> I would really like to know more about Google's per-numa-node vcpus implementation.
> I suspect you guys may have taken a different turn somewhere in the design which
> led to these results. But having not seen that implementation, I can only guess.
>
> I notice the following Google-specific prototype extension in tcmalloc:
>
>   // This is a prototype extension to the rseq() syscall.  Since a process may
>   // run on only a few cores at a time, we can use a dense set of "v(irtual)
>   // cpus."  This can reduce cache requirements, as we only need N caches for
>   // the cores we actually run on simultaneously, rather than a cache for every
>   // physical core.
>   union {
>     struct {
>       short numa_node_id;
>       short vcpu_id;
>     };
>     int vcpu_flat;
>   };
>
> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
> internally, and how they are expected to be used by userspace ?

Based on a "VCPU policy" flag passed by the userspace during rseq registration
request, our kernel would:
- do nothing re: vcpus, i.e. behave like it currently does upstream;
- allocate VCPUs in a "flat" manner, ignoring NUMA;
- populate numa_node_id with the value from the function with the same name in
  https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
  and allocate vcpu_id within the numa node in a tight manner.

Basically, if there are M threads running on node 0 and N threads
running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
in time. If a thread migrates across nodes, the balance would change
accordingly.

I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
what were the benefits expected. The simplest way would be to keep
a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
but this would tend to increase the number of freelists due to thread
migrations,
so benefits should be related to memory locality, and so somewhat difficult to
measure precisely.

Chris Kennelly may offer more details here.

Thanks,
Peter

Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Mathieu Desnoyers 3 years, 8 months ago

----- On Aug 2, 2022, at 1:06 PM, Peter Oskolkov posk@google.com wrote:

> On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
> 
> [...]
> 
>> >
>> > We have experimented with several approaches here. The one that we are
>> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
>> >
>> > We did try per-numa-node vcpus, but it did not show any material improvement
>> > over the "flat" model, perhaps because on our most "wide" servers the CPU
>> > topology is multi-level. Chris Kennelly may provide more details here.
>>
>> I would really like to know more about Google's per-numa-node vcpus
>> implementation.
>> I suspect you guys may have taken a different turn somewhere in the design which
>> led to these results. But having not seen that implementation, I can only guess.
>>
>> I notice the following Google-specific prototype extension in tcmalloc:
>>
>>   // This is a prototype extension to the rseq() syscall.  Since a process may
>>   // run on only a few cores at a time, we can use a dense set of "v(irtual)
>>   // cpus."  This can reduce cache requirements, as we only need N caches for
>>   // the cores we actually run on simultaneously, rather than a cache for every
>>   // physical core.
>>   union {
>>     struct {
>>       short numa_node_id;
>>       short vcpu_id;
>>     };
>>     int vcpu_flat;
>>   };
>>
>> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
>> internally, and how they are expected to be used by userspace ?
> 
> Based on a "VCPU policy" flag passed by the userspace during rseq registration
> request, our kernel would:
> - do nothing re: vcpus, i.e. behave like it currently does upstream;
> - allocate VCPUs in a "flat" manner, ignoring NUMA;
> - populate numa_node_id with the value from the function with the same name in
>  https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
>  and allocate vcpu_id within the numa node in a tight manner.
> 
> Basically, if there are M threads running on node 0 and N threads
> running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
> node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
> in time. If a thread migrates across nodes, the balance would change
> accordingly.
> 
> I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
> what were the benefits expected. The simplest way would be to keep
> a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
> but this would tend to increase the number of freelists due to thread
> migrations,
> so benefits should be related to memory locality, and so somewhat difficult to
> measure precisely.

So, based on your explanation of the Google implementation, for each memory space,
the kernel keeps per-numa-node vcpu-id allocation domains.

This leaves two choices to userspace AFAIU:

1) Userspace takes the vcpu_flat (int, combining the short node_id with the short vcpu_id)
   as index in a sparse array. The sparseness of the array may be unwelcome in terms of
   memory and cache footprint.

2) Userspace could take a 2-level approach: using the short node_id to index an array of
   "numa node" objects, which would then point to a 2nd level indexed by short vcpu_id.
   This adds an extra pointer dereference on the fast-path, and touches additional cache
   lines on the fast path as well, which is probably unwelcome. In addition, keeping track
   of this 2-level table adds extra complexity in userspace, and requires that user-space
   designs its data structure specifically for NUMA, which is unusual considering that NUMA
   is typically just an optimization hint to improve locality of memory accesses.

Hopefully I did not miss anything there.

So here is how I did things differently.

I realized that when userspace uses a rseq_abi()->cpu_id as index into per-cpu data structures,
it expects that when any of the process' threads observe a numa_node_id when running on behalf
of a given cpu_id once in the process lifetime, this topology is invariant [1]. IOW, the same
cpu_id should always run on the same NUMA node in the future.

This characteristic is what allows indexing by cpu_id to index data structures that have a
good NUMA locality: on first use of a given cpu_id, memory allocation can be done on behalf of
the right NUMA node, and then all per-cpu accesses are guaranteed to be local.

So I applied this concept to vcpu_ids.

The trick here is mostly to add a per-numa-node bitmap to the mm_struct in addition to the bitmap
tracking the current vcpu_id allocation. The per-numa-node bitmap keeps track of which vcpu_ids
have been allocated on behalf of each numa node in the past.

So when a thread running on a given NUMA node needs to find the lowest vcpu_id which is available,
it uses cpumask_first_one_and_zero(node_cpumask, cpumask) to find the first bit which has both
been allocated already for this NUMA node, and is currently not in use by another thread.

There is also a node_alloc_vcpumask bitmap which keeps track of which vcpu have already been
allocated in the past, across all NUMA nodes. This allows scanning efficiently for the first
vcpu which was _not_ yet allocated, and is currently unused with
cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask).

With this, user-space can simply use the vm_vcpu_id field as index into the per-vcpu array,
and the NUMA locality is implicit. Upon initial allocation of the numa-local memory, it just
needs to read both vm_vcpu_id and numa_node_id fields within a rseq critical section to
ensure there is no migration between the two field loads.

So as long as the scheduler does a good job at keeping the number of threads per NUMA node
relatively constant by pushing back against thread migration across NUMA nodes over the process
lifetime, there should not be too many extra vcpu_ids needed. In the worse-case scenario, the
number of vcpu_ids needed is equal to the number of online cpus in the system.

The performance overhead of keeping track of those additional bitmaps for NUMA locality should
be minimal considering that the only update needed in the per-numa-node bitmap and the
node_alloc_vcpumask bitmap is the first time a vcpu_id is assigned within a process. The rest
is lookup only. And even there, the optimizations I have put in place skip those lookups in
the common scenarios entirely.

Thanks,

Mathieu

[1] There is the exception of Power CPU hotplug which can reconfigure the NUMA topology,
    but this seems like a rather odd and rare corner-case. It is supported by my implementation,
    but userspace would have to deal with this kind of reconfiguration on its own to
    preserve NUMA locality.

> 
> Chris Kennelly may offer more details here.
> 
> Thanks,
> Peter

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Posted by Chris Kennelly 3 years, 8 months ago

On Tue, Aug 2, 2022 at 4:53 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Aug 2, 2022, at 1:06 PM, Peter Oskolkov posk@google.com wrote:
>
> > On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >
> > [...]
> >
> >> >
> >> > We have experimented with several approaches here. The one that we are
> >> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> >> >
> >> > We did try per-numa-node vcpus, but it did not show any material improvement
> >> > over the "flat" model, perhaps because on our most "wide" servers the CPU
> >> > topology is multi-level. Chris Kennelly may provide more details here.

As Peter mentioned, though, we solely use the flat vcpu ID implementation.

> >>
> >> I would really like to know more about Google's per-numa-node vcpus
> >> implementation.
> >> I suspect you guys may have taken a different turn somewhere in the design which
> >> led to these results. But having not seen that implementation, I can only guess.
> >>
> >> I notice the following Google-specific prototype extension in tcmalloc:
> >>
> >>   // This is a prototype extension to the rseq() syscall.  Since a process may
> >>   // run on only a few cores at a time, we can use a dense set of "v(irtual)
> >>   // cpus."  This can reduce cache requirements, as we only need N caches for
> >>   // the cores we actually run on simultaneously, rather than a cache for every
> >>   // physical core.
> >>   union {
> >>     struct {
> >>       short numa_node_id;
> >>       short vcpu_id;
> >>     };
> >>     int vcpu_flat;
> >>   };
> >>
> >> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
> >> internally, and how they are expected to be used by userspace ?
> >
> > Based on a "VCPU policy" flag passed by the userspace during rseq registration
> > request, our kernel would:
> > - do nothing re: vcpus, i.e. behave like it currently does upstream;
> > - allocate VCPUs in a "flat" manner, ignoring NUMA;
> > - populate numa_node_id with the value from the function with the same name in
> >  https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
> >  and allocate vcpu_id within the numa node in a tight manner.
> >
> > Basically, if there are M threads running on node 0 and N threads
> > running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
> > node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
> > in time. If a thread migrates across nodes, the balance would change
> > accordingly.
> >
> > I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
> > what were the benefits expected. The simplest way would be to keep
> > a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
> > but this would tend to increase the number of freelists due to thread
> > migrations,
> > so benefits should be related to memory locality, and so somewhat difficult to
> > measure precisely.
>
> So, based on your explanation of the Google implementation, for each memory space,
> the kernel keeps per-numa-node vcpu-id allocation domains.
>
> This leaves two choices to userspace AFAIU:
>
> 1) Userspace takes the vcpu_flat (int, combining the short node_id with the short vcpu_id)
>    as index in a sparse array. The sparseness of the array may be unwelcome in terms of
>    memory and cache footprint.

In general, TCMalloc addresses sparseness by lazily allocating caches.
Virtual address space is relatively inexpensive, so we can allocate
enough space to have cache space for every feasible CPU ID and
structure things so on fault (of a zero page), we trigger the slow
path and handle initialization
(https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.h#L824).

Even without virtual CPU IDs, TCMalloc still preferred to allocate a
cache per possible core, so no additional checks would be needed on
the fast path.

>
>
> 2) Userspace could take a 2-level approach: using the short node_id to index an array of
>    "numa node" objects, which would then point to a 2nd level indexed by short vcpu_id.
>    This adds an extra pointer dereference on the fast-path, and touches additional cache
>    lines on the fast path as well, which is probably unwelcome. In addition, keeping track
>    of this 2-level table adds extra complexity in userspace, and requires that user-space
>    designs its data structure specifically for NUMA, which is unusual considering that NUMA
>    is typically just an optimization hint to improve locality of memory accesses.

This is the strategy I looked at using previously.

> Hopefully I did not miss anything there.
>
> So here is how I did things differently.
>
> I realized that when userspace uses a rseq_abi()->cpu_id as index into per-cpu data structures,
> it expects that when any of the process' threads observe a numa_node_id when running on behalf
> of a given cpu_id once in the process lifetime, this topology is invariant [1]. IOW, the same
> cpu_id should always run on the same NUMA node in the future.
>
> This characteristic is what allows indexing by cpu_id to index data structures that have a
> good NUMA locality: on first use of a given cpu_id, memory allocation can be done on behalf of
> the right NUMA node, and then all per-cpu accesses are guaranteed to be local.
>
> So I applied this concept to vcpu_ids.
>
> The trick here is mostly to add a per-numa-node bitmap to the mm_struct in addition to the bitmap
> tracking the current vcpu_id allocation. The per-numa-node bitmap keeps track of which vcpu_ids
> have been allocated on behalf of each numa node in the past.
>
> So when a thread running on a given NUMA node needs to find the lowest vcpu_id which is available,
> it uses cpumask_first_one_and_zero(node_cpumask, cpumask) to find the first bit which has both
> been allocated already for this NUMA node, and is currently not in use by another thread.
>
> There is also a node_alloc_vcpumask bitmap which keeps track of which vcpu have already been
> allocated in the past, across all NUMA nodes. This allows scanning efficiently for the first
> vcpu which was _not_ yet allocated, and is currently unused with
> cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask).
>
> With this, user-space can simply use the vm_vcpu_id field as index into the per-vcpu array,
> and the NUMA locality is implicit. Upon initial allocation of the numa-local memory, it just
> needs to read both vm_vcpu_id and numa_node_id fields within a rseq critical section to
> ensure there is no migration between the two field loads.
>
>
> So as long as the scheduler does a good job at keeping the number of threads per NUMA node
> relatively constant by pushing back against thread migration across NUMA nodes over the process
> lifetime, there should not be too many extra vcpu_ids needed. In the worse-case scenario, the
> number of vcpu_ids needed is equal to the number of online cpus in the system.

Interesting.  It seems like this simplifies the fast path for using
the NUMA ID quite a bit.

As Peter mentions, our implementation prefers lower indexed vCPU IDs
over higher ones, but TCMalloc is agnostic to the precise IDs in use
(it looks at individual cache stats, not IDs, to make decisions).
Definitively skipping a vCPU ID on one node because it is used on
another seems entirely workable.

Chris



>
> The performance overhead of keeping track of those additional bitmaps for NUMA locality should
> be minimal considering that the only update needed in the per-numa-node bitmap and the
> node_alloc_vcpumask bitmap is the first time a vcpu_id is assigned within a process. The rest
> is lookup only. And even there, the optimizations I have put in place skip those lookups in
> the common scenarios entirely.
>
>
> Thanks,
>
> Mathieu
>
> [1] There is the exception of Power CPU hotplug which can reconfigure the NUMA topology,
>     but this seems like a rather odd and rare corner-case. It is supported by my implementation,
>     but userspace would have to deal with this kind of reconfiguration on its own to
>     preserve NUMA locality.
>
> >
> > Chris Kennelly may offer more details here.
> >
> > Thanks,
> > Peter
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com