[v1] perf annotate: Improve memory usage for symbol histogram

[PATCH 0/4] perf annotate: Improve memory usage for symbol histogram

Posted by Namhyung Kim 1 year, 11 months ago

Hello,

This is another series of memory optimization in perf annotate.

When perf annotate (or perf report/top with TUI) processes samples, it
needs to save the sample period (overhead) at instruction level.  For
now, it allocates an array to do that for the whole symbol when it
hits any new symbol.  This comes with a lot of waste since samples can
be very few and instructions span to multiple bytes.

For example, when a sample hits symbol 'foo' that has size of 100 and
that's the only sample falls into the symbol.  Then it needs to
allocate a symbol histogram (sym_hist) and the its size would be

  16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616

But actually it just needs 32 (header + sym_hist_entry) bytes.  Things
get worse if the symbol size is bigger (and it doesn't have many
samples in different places).  Also note that it needs separate
histogram for each event.

Let's split the sym_hist_entry and have it in a hash table so that it
can allocate only necessary entries.

No functional change intended.

Thanks,
Namhyung


Namhyung Kim (4):
  perf annotate: Add a hashmap for symbol histogram
  perf annotate: Calculate instruction overhead using hashmap
  perf annotate: Remove sym_hist.addr[] array
  perf annotate: Add comments in the data structures

 tools/perf/ui/gtk/annotate.c |  14 ++++-
 tools/perf/util/annotate.c   | 114 ++++++++++++++++++++++-------------
 tools/perf/util/annotate.h   |  86 +++++++++++++++++++++++---
 3 files changed, 158 insertions(+), 56 deletions(-)

-- 
2.44.0.rc1.240.g4c46232300-goog

Re: [PATCH 0/4] perf annotate: Improve memory usage for symbol histogram

Posted by Arnaldo Carvalho de Melo 1 year, 11 months ago

On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> This is another series of memory optimization in perf annotate.
 
> When perf annotate (or perf report/top with TUI) processes samples, it
> needs to save the sample period (overhead) at instruction level.  For
> now, it allocates an array to do that for the whole symbol when it
> hits any new symbol.  This comes with a lot of waste since samples can
> be very few and instructions span to multiple bytes.
 
> For example, when a sample hits symbol 'foo' that has size of 100 and
> that's the only sample falls into the symbol.  Then it needs to
> allocate a symbol histogram (sym_hist) and the its size would be
 
>   16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
 
> But actually it just needs 32 (header + sym_hist_entry) bytes.  Things
> get worse if the symbol size is bigger (and it doesn't have many
> samples in different places).  Also note that it needs separate
> histogram for each event.
 
> Let's split the sym_hist_entry and have it in a hash table so that it
> can allocate only necessary entries.
 
> No functional change intended.

I tried this before/after this series:

  $ time perf annotate --stdio2 -i perf.data.annotate

For:

  perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin

And found these odd cases:

  $ diff -u before after
  --- before	2024-02-28 15:38:25.086062812 -0300
  +++ after	2024-02-29 14:12:05.606652725 -0300
  @@ -2450826,7 +2450826,7 @@
                                ↓ je       1c62                  
                                → call     operator delete(void*)@plt
                                { return _M_dataplus._M_p; }
  -                       1c62:   mov      0x13c0(%rsp),%rdi     
  +  0.00   0.00 100.00   1c62:   mov      0x13c0(%rsp),%rdi     
                                if (_M_data() == _M_local_data())
                                  lea      0x13d0(%rsp),%rax     
                                  cmp      %rax,%rdi             
  @@ -2470648,7 +2470648,7 @@
                                  mov      %rbx,%rdi             
                                → call     operator delete(void*)@plt
                                using reference = T &;     
  -  0.00   0.00 100.00  11c65:   mov      0x8(%r12),%rax        
  +                      11c65:   mov      0x8(%r12),%rax        
                                size_t size() const { return Size; }
                                  mov      0x10(%r12),%ecx       
                                  mov      %rax,%rbp             
  $


This is a large function:

Samples: 574K of events 'anon group { cpu_core/cycles/u, cpu_core/instructions/u, cpu_core/cache-misses/u }', 4000 Hz, Event count (approx.): 614695751751, [percent: local period]$
clang::CompilerInvocation::ParseCodeGenArgs(clang::CodeGenOptions&, llvm::opt::ArgList&, clang::InputKind, clang::DiagnosticsEngine&, llvm::Triple const&, std::__cxx11::basic_string<char, std
Percent 

Probably when building the BPF skels in tools/perf/

- Arnaldo

Re: [PATCH 0/4] perf annotate: Improve memory usage for symbol histogram

Posted by Namhyung Kim 1 year, 11 months ago

Hi Arnaldo,

On Mon, Mar 4, 2024 at 5:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> > This is another series of memory optimization in perf annotate.
>
> > When perf annotate (or perf report/top with TUI) processes samples, it
> > needs to save the sample period (overhead) at instruction level.  For
> > now, it allocates an array to do that for the whole symbol when it
> > hits any new symbol.  This comes with a lot of waste since samples can
> > be very few and instructions span to multiple bytes.
>
> > For example, when a sample hits symbol 'foo' that has size of 100 and
> > that's the only sample falls into the symbol.  Then it needs to
> > allocate a symbol histogram (sym_hist) and the its size would be
>
> >   16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
>
> > But actually it just needs 32 (header + sym_hist_entry) bytes.  Things
> > get worse if the symbol size is bigger (and it doesn't have many
> > samples in different places).  Also note that it needs separate
> > histogram for each event.
>
> > Let's split the sym_hist_entry and have it in a hash table so that it
> > can allocate only necessary entries.
>
> > No functional change intended.
>
> I tried this before/after this series:
>
>   $ time perf annotate --stdio2 -i perf.data.annotate
>
> For:
>
>   perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin
>
> And found these odd cases:
>
>   $ diff -u before after
>   --- before    2024-02-28 15:38:25.086062812 -0300
>   +++ after     2024-02-29 14:12:05.606652725 -0300
>   @@ -2450826,7 +2450826,7 @@
>                                 ↓ je       1c62
>                                 → call     operator delete(void*)@plt
>                                 { return _M_dataplus._M_p; }
>   -                       1c62:   mov      0x13c0(%rsp),%rdi
>   +  0.00   0.00 100.00   1c62:   mov      0x13c0(%rsp),%rdi
>                                 if (_M_data() == _M_local_data())
>                                   lea      0x13d0(%rsp),%rax
>                                   cmp      %rax,%rdi
>   @@ -2470648,7 +2470648,7 @@
>                                   mov      %rbx,%rdi
>                                 → call     operator delete(void*)@plt
>                                 using reference = T &;
>   -  0.00   0.00 100.00  11c65:   mov      0x8(%r12),%rax
>   +                      11c65:   mov      0x8(%r12),%rax
>                                 size_t size() const { return Size; }
>                                   mov      0x10(%r12),%ecx
>                                   mov      %rax,%rbp
>   $
>
>
> This is a large function:

Thanks for the test!  I think it missed the cast to 64-bit somewhere.
I'll check and send v2 soon.

Thanks,
Namhyung


>
> Samples: 574K of events 'anon group { cpu_core/cycles/u, cpu_core/instructions/u, cpu_core/cache-misses/u }', 4000 Hz, Event count (approx.): 614695751751, [percent: local period]$
> clang::CompilerInvocation::ParseCodeGenArgs(clang::CodeGenOptions&, llvm::opt::ArgList&, clang::InputKind, clang::DiagnosticsEngine&, llvm::Triple const&, std::__cxx11::basic_string<char, std
> Percent
>
> Probably when building the BPF skels in tools/perf/
>
> - Arnaldo

Re: [PATCH 0/4] perf annotate: Improve memory usage for symbol histogram

Posted by Namhyung Kim 1 year, 11 months ago

On Mon, Mar 4, 2024 at 8:36 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hi Arnaldo,
>
> On Mon, Mar 4, 2024 at 5:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> > > This is another series of memory optimization in perf annotate.
> >
> > > When perf annotate (or perf report/top with TUI) processes samples, it
> > > needs to save the sample period (overhead) at instruction level.  For
> > > now, it allocates an array to do that for the whole symbol when it
> > > hits any new symbol.  This comes with a lot of waste since samples can
> > > be very few and instructions span to multiple bytes.
> >
> > > For example, when a sample hits symbol 'foo' that has size of 100 and
> > > that's the only sample falls into the symbol.  Then it needs to
> > > allocate a symbol histogram (sym_hist) and the its size would be
> >
> > >   16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
> >
> > > But actually it just needs 32 (header + sym_hist_entry) bytes.  Things
> > > get worse if the symbol size is bigger (and it doesn't have many
> > > samples in different places).  Also note that it needs separate
> > > histogram for each event.
> >
> > > Let's split the sym_hist_entry and have it in a hash table so that it
> > > can allocate only necessary entries.
> >
> > > No functional change intended.
> >
> > I tried this before/after this series:
> >
> >   $ time perf annotate --stdio2 -i perf.data.annotate
> >
> > For:
> >
> >   perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin
> >
> > And found these odd cases:
> >
> >   $ diff -u before after
> >   --- before    2024-02-28 15:38:25.086062812 -0300
> >   +++ after     2024-02-29 14:12:05.606652725 -0300
> >   @@ -2450826,7 +2450826,7 @@
> >                                 ↓ je       1c62
> >                                 → call     operator delete(void*)@plt
> >                                 { return _M_dataplus._M_p; }
> >   -                       1c62:   mov      0x13c0(%rsp),%rdi
> >   +  0.00   0.00 100.00   1c62:   mov      0x13c0(%rsp),%rdi
> >                                 if (_M_data() == _M_local_data())
> >                                   lea      0x13d0(%rsp),%rax
> >                                   cmp      %rax,%rdi
> >   @@ -2470648,7 +2470648,7 @@
> >                                   mov      %rbx,%rdi
> >                                 → call     operator delete(void*)@plt
> >                                 using reference = T &;
> >   -  0.00   0.00 100.00  11c65:   mov      0x8(%r12),%rax
> >   +                      11c65:   mov      0x8(%r12),%rax
> >                                 size_t size() const { return Size; }
> >                                   mov      0x10(%r12),%ecx
> >                                   mov      %rax,%rbp
> >   $
> >
> >
> > This is a large function:
>
> Thanks for the test!  I think it missed the cast to 64-bit somewhere.
> I'll check and send v2 soon.

Yep, the offset variable in __symbol__inc_addr_samples()
should be u64.

Thanks,
Namhyung

Re: [PATCH 0/4] perf annotate: Improve memory usage for symbol histogram

Posted by Ian Rogers 1 year, 11 months ago

On Tue, Feb 27, 2024 at 4:52 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> This is another series of memory optimization in perf annotate.
>
> When perf annotate (or perf report/top with TUI) processes samples, it
> needs to save the sample period (overhead) at instruction level.  For
> now, it allocates an array to do that for the whole symbol when it
> hits any new symbol.  This comes with a lot of waste since samples can
> be very few and instructions span to multiple bytes.
>
> For example, when a sample hits symbol 'foo' that has size of 100 and
> that's the only sample falls into the symbol.  Then it needs to
> allocate a symbol histogram (sym_hist) and the its size would be
>
>   16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
>
> But actually it just needs 32 (header + sym_hist_entry) bytes.  Things
> get worse if the symbol size is bigger (and it doesn't have many
> samples in different places).  Also note that it needs separate
> histogram for each event.
>
> Let's split the sym_hist_entry and have it in a hash table so that it
> can allocate only necessary entries.
>
> No functional change intended.
>
> Thanks,
> Namhyung
>
>
> Namhyung Kim (4):
>   perf annotate: Add a hashmap for symbol histogram
>   perf annotate: Calculate instruction overhead using hashmap
>   perf annotate: Remove sym_hist.addr[] array
>   perf annotate: Add comments in the data structures

Reviewed-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

>  tools/perf/ui/gtk/annotate.c |  14 ++++-
>  tools/perf/util/annotate.c   | 114 ++++++++++++++++++++++-------------
>  tools/perf/util/annotate.h   |  86 +++++++++++++++++++++++---
>  3 files changed, 158 insertions(+), 56 deletions(-)
>
> --
> 2.44.0.rc1.240.g4c46232300-goog
>