[PATCH v2 00/16] Address some perf memory/data size issues

Ian Rogers posted 16 patches 2 years, 8 months ago
tools/lib/api/fs/cgroup.c                     |  17 ++-
tools/lib/api/fs/fs.c                         |  25 +++-
tools/lib/api/fs/tracing_path.c               |  17 +--
tools/lib/subcmd/exec-cmd.c                   |  35 +++--
tools/perf/arch/x86/tests/insn-x86.c          |  10 +-
tools/perf/arch/x86/tests/intel-pt-test.c     |  14 +-
tools/perf/builtin-config.c                   |   4 +-
tools/perf/builtin-daemon.c                   |  44 +++---
tools/perf/builtin-help.c                     |   4 +-
tools/perf/builtin-lock.c                     |  20 ++-
tools/perf/builtin-probe.c                    | 133 ++++++++++--------
tools/perf/builtin-timechart.c                |  48 +++++--
tools/perf/builtin-trace.c                    |  33 +++--
tools/perf/tests/pmu.c                        |  17 +--
tools/perf/trace/beauty/beauty.h              |   2 +-
.../perf/trace/beauty/tracepoints/x86_msr.sh  |   6 +-
tools/perf/util/cache.h                       |   2 +-
tools/perf/util/config.c                      |   3 +-
tools/perf/util/header.c                      |  41 +++---
tools/perf/util/path.c                        |  35 +----
.../util/scripting-engines/trace-event-perl.c |   4 +-
.../scripting-engines/trace-event-python.c    |   5 +-
22 files changed, 297 insertions(+), 222 deletions(-)
[PATCH v2 00/16] Address some perf memory/data size issues
Posted by Ian Rogers 2 years, 8 months ago
Try to reduce the data size of the perf command. Before these patches
a stripped non-debug binary was:

$ size -A perf
perf  :
section                  size       addr
.interp                    28        848
.note.gnu.property         32        880
.note.gnu.build-id         36        912
.note.ABI-tag              32        948
.gnu.hash               24628        984
.dynsym                 88920      25616
.dynstr                 70193     114536
.gnu.version             7410     184730
.gnu.version_r            800     192144
.rela.dyn              460824     192944
.rela.plt               14784     653768
.init                      23     671744
.plt                     9872     671776
.plt.got                   24     681648
.text                 2279182     681680
.noinstr.text             476    2960864
.fini                       9    2961340
.rodata               7042922    2961408
.eh_frame_hdr           42844   10004332
.eh_frame              226496   10047176
.tbss                      48   10279720
.init_array                16   10279720
.fini_array                 8   10279736
.data.rel.ro            53376   10279744
.dynamic                  736   10333120
.got                      328   10333856
.got.plt                 4952   10334184
.data                  391088   10339136
.bss                   285776   10730240
.comment                   31          0
Total                11005894

And after:
perf  :
section                  size       addr
.interp                    28        848
.note.gnu.property         32        880
.note.gnu.build-id         36        912
.note.ABI-tag              32        948
.gnu.hash               24628        984
.dynsym                 88944      25616
.dynstr                 70217     114560
.gnu.version             7412     184778
.gnu.version_r            816     192192
.rela.dyn              460824     193008
.rela.plt               14808     653832
.init                      23     671744
.plt                     9888     671776
.plt.got                   24     681664
.text                 2280446     681696
.noinstr.text             476    2962144
.fini                       9    2962620
.rodata               7048746    2965504
.eh_frame_hdr           42852   10014252
.eh_frame              226568   10057104
.tbss                      48   10285640
.init_array                16   10285640
.fini_array                 8   10285656
.data.rel.ro           301408   10285664
.dynamic                  736   10587072
.got                      328   10587808
.got.plt                 4960   10588136
.data                  100464   10593152
.bss                    22512   10693632
.comment                   31          0
Total                10707320

The binary has reduced in size by 298,574 bytes. The .bss, that
doesn't count toward file size, is reduced by 263,254 bytes. At
runtime this could reduce the footprint up to 561,828 bytes. This is
still just a fraction of the .rodata section's size of 7,048,746
bytes, that mainly contains the converted json events. The .rodata
section needn't all be mapped at the same time.

The changes are largely removing static variables and replacing them
with local or dynamically allocated memory. A common issue was having
paths in statics for the sake of returning a non-stack pointer to a
buffer, so the APIs were changed to pass buffers in.

v2. Address review comments from Namhyung, thanks!

Ian Rogers (16):
  perf header: Make nodes dynamic in write_mem_topology
  perf test x86: insn-x86 test data is immutable so mark it const
  perf test x86: intel-pt-test data is immutable so mark it const
  perf trace: Make some large static arrays const
  perf trace beauty: Make MSR arrays const
  tools api fs: Avoid large static PATH_MAX arrays
  tools lib api fs tracing_path: Remove two unused MAX_PATH paths
  perf daemon: Dynamically allocate path to perf
  perf lock: Dynamically allocate lockhash_table
  perf timechart: Make large arrays dynamic
  perf probe: Dynamically allocate params memory
  perf path: Make mkpath thread safe
  perf scripting-engines: Move static to local variable
  tools api fs: Dynamically allocate cgroupfs mount point cache
  perf test pmu: Avoid 2 static path arrays
  libsubcmd: Avoid two path statics

 tools/lib/api/fs/cgroup.c                     |  17 ++-
 tools/lib/api/fs/fs.c                         |  25 +++-
 tools/lib/api/fs/tracing_path.c               |  17 +--
 tools/lib/subcmd/exec-cmd.c                   |  35 +++--
 tools/perf/arch/x86/tests/insn-x86.c          |  10 +-
 tools/perf/arch/x86/tests/intel-pt-test.c     |  14 +-
 tools/perf/builtin-config.c                   |   4 +-
 tools/perf/builtin-daemon.c                   |  44 +++---
 tools/perf/builtin-help.c                     |   4 +-
 tools/perf/builtin-lock.c                     |  20 ++-
 tools/perf/builtin-probe.c                    | 133 ++++++++++--------
 tools/perf/builtin-timechart.c                |  48 +++++--
 tools/perf/builtin-trace.c                    |  33 +++--
 tools/perf/tests/pmu.c                        |  17 +--
 tools/perf/trace/beauty/beauty.h              |   2 +-
 .../perf/trace/beauty/tracepoints/x86_msr.sh  |   6 +-
 tools/perf/util/cache.h                       |   2 +-
 tools/perf/util/config.c                      |   3 +-
 tools/perf/util/header.c                      |  41 +++---
 tools/perf/util/path.c                        |  35 +----
 .../util/scripting-engines/trace-event-perl.c |   4 +-
 .../scripting-engines/trace-event-python.c    |   5 +-
 22 files changed, 297 insertions(+), 222 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog
Re: [PATCH v2 00/16] Address some perf memory/data size issues
Posted by Andi Kleen 2 years, 8 months ago
FWIW I think the whole patchkit could be replaced with a one liner
that disables THP for the BSS segment. I suspect that would be roughly
equivalent for memory consumption because 4K pages that are never
touched would never be allocated.

-Andi
Re: [PATCH v2 00/16] Address some perf memory/data size issues
Posted by Ian Rogers 2 years, 8 months ago
On Mon, May 29, 2023 at 9:57 PM Andi Kleen <ak@linux.intel.com> wrote:
>
>
> FWIW I think the whole patchkit could be replaced with a one liner
> that disables THP for the BSS segment. I suspect that would be roughly
> equivalent for memory consumption because 4K pages that are never
> touched would never be allocated.
>
> -Andi

So, it is worth reading some of the comments on the code to see why a
wider clean up is worth it:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/path.c?h=perf-tools#n7
"It's obviously not thread-safe. Sue me."
(a comment stemming back to the git origins of the code base)

BSS won't count toward file size, which the patches were primarily
going after - but checking the size numbers I have miscalculated from
reading size's output that I'm not familiar with. The numbers are
still improved, but I just see a 37kb saving, with 5kb more in
.rodata. Something but not much. .data.rel.ro is larger, which imo is
good, but those pages will still be dirtied so a mute point wrt file
size and memory overhead.

For huge pages I thought it was correct that things are aligned by max
page size which I thought on x86-64 was 2MB, so I tried:
EXTRA_LDFLAGS="-z max-page-size=4096"
but it made no difference to anything, and with:
EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
file size just got worse.

Thanks,
Ian
Re: [PATCH v2 00/16] Address some perf memory/data size issues
Posted by Andi Kleen 2 years, 8 months ago
> BSS won't count toward file size, which the patches were primarily
> going after - but checking the size numbers I have miscalculated from
> reading size's output that I'm not familiar with. The numbers are
> still improved, but I just see a 37kb saving, with 5kb more in
> .rodata. Something but not much. .data.rel.ro is larger, which imo is
> good, but those pages will still be dirtied so a mute point wrt file
> size and memory overhead.

The way perf is written (lots of separate code depending on a single high level
switch) most pages probably won't be dirtied.

> 
> For huge pages I thought it was correct that things are aligned by max
> page size which I thought on x86-64 was 2MB, so I tried:
> EXTRA_LDFLAGS="-z max-page-size=4096"
> but it made no difference to anything, and with:
> EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
> EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
> file size just got worse.

The default alignment to 2MB was dropped in the GNU toolchain in 2018 or
so.

-Andi
Re: [PATCH v2 00/16] Address some perf memory/data size issues
Posted by Ian Rogers 2 years, 8 months ago
On Tue, May 30, 2023 at 12:59 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> > BSS won't count toward file size, which the patches were primarily
> > going after - but checking the size numbers I have miscalculated from
> > reading size's output that I'm not familiar with. The numbers are
> > still improved, but I just see a 37kb saving, with 5kb more in
> > .rodata. Something but not much. .data.rel.ro is larger, which imo is
> > good, but those pages will still be dirtied so a mute point wrt file
> > size and memory overhead.
>
> The way perf is written (lots of separate code depending on a single high level
> switch) most pages probably won't be dirtied.

For data everything is relocated when perf is loaded. Setting a
breakpoint on main and then dumping smaps (edited for brevity) I see:
```
555555554000-5555555f8000 r--p 00000000 fe:01 32936368
  /tmp/perf/perf
Size:                656 kB
Pss:                 656 kB
Pss_Dirty:             0 kB
5555555f8000-555555828000 r-xp 000a4000 fe:01 32936368
  /tmp/perf/perf
Size:               2240 kB
Pss:                  32 kB
Pss_Dirty:             8 kB
555555828000-555555f23000 r--p 002d4000 fe:01 32936368
  /tmp/perf/perf
Size:               7148 kB
Pss:                  64 kB
Pss_Dirty:             0 kB
555555f23000-555555f6d000 r--p 009cf000 fe:01 32936368
  /tmp/perf/perf
Size:                296 kB
Pss:                 288 kB
Pss_Dirty:           288 kB
555555f6d000-555555f87000 rw-p 00a19000 fe:01 32936368
  /tmp/perf/perf
Size:                104 kB
Pss:                 104 kB
Pss_Dirty:           104 kB
```
These are roughly header, text, .rodata, .data.rel.ro, .data. So at
the point we enter main we have 392kB of dirty pages in .data.rel.ro
and .data.

For x86 a large contributor to the relocations comes from the insn-x86.c test:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/arch/x86/tests/insn-x86.c?h=perf-tools-next#n21
The test_data_32 and test_data_64 arrays are 75,024 bytes and 93,600
bytes respectively and are in .data.rel.ro, they account for nearly
40% of it.

In gdb at main entry:
```
(gdb) p test_data_32[0]
$1 = {data = "\017\061", '\000' <repeats 12 times>, expected_length =
2, expected_rel = 0,
 expected_op_str = 0x555555866adc "", expected_branch_str = 0x555555866adc "",
 asm_rep = 0x55555586fa2a "0f 31", ' ' <repeats 16 times>, "\trdtsc  "}
```
you can see that all the strings in test_data_32 have been relocated
(even though we haven't run any part of perf yet) and are pointing to
data in .rodata. To avoid these relocations for the output of
jevents.py (pmu-events.c) all the strings are merged into a big string
and then the offsets within the string are stored - no relocations
means everything goes in the nice non-dirty .rodata. As the data in
the insn-x86.c test is also generated then a similar trick could be
performed. There is also the possibility to separate all the perf
builtins into libraries...

Thanks,
Ian

> >
> > For huge pages I thought it was correct that things are aligned by max
> > page size which I thought on x86-64 was 2MB, so I tried:
> > EXTRA_LDFLAGS="-z max-page-size=4096"
> > but it made no difference to anything, and with:
> > EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
> > EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
> > file size just got worse.
>
> The default alignment to 2MB was dropped in the GNU toolchain in 2018 or
> so.
>
> -Andi