tools/perf/ui/gtk/annotate.c | 14 ++++- tools/perf/util/annotate.c | 114 ++++++++++++++++++++++------------- tools/perf/util/annotate.h | 86 +++++++++++++++++++++++--- 3 files changed, 158 insertions(+), 56 deletions(-)
Hello, This is another series of memory optimization in perf annotate. When perf annotate (or perf report/top with TUI) processes samples, it needs to save the sample period (overhead) at instruction level. For now, it allocates an array to do that for the whole symbol when it hits any new symbol. This comes with a lot of waste since samples can be very few and instructions span to multiple bytes. For example, when a sample hits symbol 'foo' that has size of 100 and that's the only sample falls into the symbol. Then it needs to allocate a symbol histogram (sym_hist) and the its size would be 16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616 But actually it just needs 32 (header + sym_hist_entry) bytes. Things get worse if the symbol size is bigger (and it doesn't have many samples in different places). Also note that it needs separate histogram for each event. Let's split the sym_hist_entry and have it in a hash table so that it can allocate only necessary entries. No functional change intended. Thanks, Namhyung Namhyung Kim (4): perf annotate: Add a hashmap for symbol histogram perf annotate: Calculate instruction overhead using hashmap perf annotate: Remove sym_hist.addr[] array perf annotate: Add comments in the data structures tools/perf/ui/gtk/annotate.c | 14 ++++- tools/perf/util/annotate.c | 114 ++++++++++++++++++++++------------- tools/perf/util/annotate.h | 86 +++++++++++++++++++++++--- 3 files changed, 158 insertions(+), 56 deletions(-) -- 2.44.0.rc1.240.g4c46232300-goog
On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> This is another series of memory optimization in perf annotate.
> When perf annotate (or perf report/top with TUI) processes samples, it
> needs to save the sample period (overhead) at instruction level. For
> now, it allocates an array to do that for the whole symbol when it
> hits any new symbol. This comes with a lot of waste since samples can
> be very few and instructions span to multiple bytes.
> For example, when a sample hits symbol 'foo' that has size of 100 and
> that's the only sample falls into the symbol. Then it needs to
> allocate a symbol histogram (sym_hist) and the its size would be
> 16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
> But actually it just needs 32 (header + sym_hist_entry) bytes. Things
> get worse if the symbol size is bigger (and it doesn't have many
> samples in different places). Also note that it needs separate
> histogram for each event.
> Let's split the sym_hist_entry and have it in a hash table so that it
> can allocate only necessary entries.
> No functional change intended.
I tried this before/after this series:
$ time perf annotate --stdio2 -i perf.data.annotate
For:
perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin
And found these odd cases:
$ diff -u before after
--- before 2024-02-28 15:38:25.086062812 -0300
+++ after 2024-02-29 14:12:05.606652725 -0300
@@ -2450826,7 +2450826,7 @@
↓ je 1c62
→ call operator delete(void*)@plt
{ return _M_dataplus._M_p; }
- 1c62: mov 0x13c0(%rsp),%rdi
+ 0.00 0.00 100.00 1c62: mov 0x13c0(%rsp),%rdi
if (_M_data() == _M_local_data())
lea 0x13d0(%rsp),%rax
cmp %rax,%rdi
@@ -2470648,7 +2470648,7 @@
mov %rbx,%rdi
→ call operator delete(void*)@plt
using reference = T &;
- 0.00 0.00 100.00 11c65: mov 0x8(%r12),%rax
+ 11c65: mov 0x8(%r12),%rax
size_t size() const { return Size; }
mov 0x10(%r12),%ecx
mov %rax,%rbp
$
This is a large function:
Samples: 574K of events 'anon group { cpu_core/cycles/u, cpu_core/instructions/u, cpu_core/cache-misses/u }', 4000 Hz, Event count (approx.): 614695751751, [percent: local period]$
clang::CompilerInvocation::ParseCodeGenArgs(clang::CodeGenOptions&, llvm::opt::ArgList&, clang::InputKind, clang::DiagnosticsEngine&, llvm::Triple const&, std::__cxx11::basic_string<char, std
Percent
Probably when building the BPF skels in tools/perf/
- Arnaldo
Hi Arnaldo,
On Mon, Mar 4, 2024 at 5:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> > This is another series of memory optimization in perf annotate.
>
> > When perf annotate (or perf report/top with TUI) processes samples, it
> > needs to save the sample period (overhead) at instruction level. For
> > now, it allocates an array to do that for the whole symbol when it
> > hits any new symbol. This comes with a lot of waste since samples can
> > be very few and instructions span to multiple bytes.
>
> > For example, when a sample hits symbol 'foo' that has size of 100 and
> > that's the only sample falls into the symbol. Then it needs to
> > allocate a symbol histogram (sym_hist) and the its size would be
>
> > 16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
>
> > But actually it just needs 32 (header + sym_hist_entry) bytes. Things
> > get worse if the symbol size is bigger (and it doesn't have many
> > samples in different places). Also note that it needs separate
> > histogram for each event.
>
> > Let's split the sym_hist_entry and have it in a hash table so that it
> > can allocate only necessary entries.
>
> > No functional change intended.
>
> I tried this before/after this series:
>
> $ time perf annotate --stdio2 -i perf.data.annotate
>
> For:
>
> perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin
>
> And found these odd cases:
>
> $ diff -u before after
> --- before 2024-02-28 15:38:25.086062812 -0300
> +++ after 2024-02-29 14:12:05.606652725 -0300
> @@ -2450826,7 +2450826,7 @@
> ↓ je 1c62
> → call operator delete(void*)@plt
> { return _M_dataplus._M_p; }
> - 1c62: mov 0x13c0(%rsp),%rdi
> + 0.00 0.00 100.00 1c62: mov 0x13c0(%rsp),%rdi
> if (_M_data() == _M_local_data())
> lea 0x13d0(%rsp),%rax
> cmp %rax,%rdi
> @@ -2470648,7 +2470648,7 @@
> mov %rbx,%rdi
> → call operator delete(void*)@plt
> using reference = T &;
> - 0.00 0.00 100.00 11c65: mov 0x8(%r12),%rax
> + 11c65: mov 0x8(%r12),%rax
> size_t size() const { return Size; }
> mov 0x10(%r12),%ecx
> mov %rax,%rbp
> $
>
>
> This is a large function:
Thanks for the test! I think it missed the cast to 64-bit somewhere.
I'll check and send v2 soon.
Thanks,
Namhyung
>
> Samples: 574K of events 'anon group { cpu_core/cycles/u, cpu_core/instructions/u, cpu_core/cache-misses/u }', 4000 Hz, Event count (approx.): 614695751751, [percent: local period]$
> clang::CompilerInvocation::ParseCodeGenArgs(clang::CodeGenOptions&, llvm::opt::ArgList&, clang::InputKind, clang::DiagnosticsEngine&, llvm::Triple const&, std::__cxx11::basic_string<char, std
> Percent
>
> Probably when building the BPF skels in tools/perf/
>
> - Arnaldo
On Mon, Mar 4, 2024 at 8:36 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hi Arnaldo,
>
> On Mon, Mar 4, 2024 at 5:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 04:52:26PM -0800, Namhyung Kim wrote:
> > > This is another series of memory optimization in perf annotate.
> >
> > > When perf annotate (or perf report/top with TUI) processes samples, it
> > > needs to save the sample period (overhead) at instruction level. For
> > > now, it allocates an array to do that for the whole symbol when it
> > > hits any new symbol. This comes with a lot of waste since samples can
> > > be very few and instructions span to multiple bytes.
> >
> > > For example, when a sample hits symbol 'foo' that has size of 100 and
> > > that's the only sample falls into the symbol. Then it needs to
> > > allocate a symbol histogram (sym_hist) and the its size would be
> >
> > > 16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616
> >
> > > But actually it just needs 32 (header + sym_hist_entry) bytes. Things
> > > get worse if the symbol size is bigger (and it doesn't have many
> > > samples in different places). Also note that it needs separate
> > > histogram for each event.
> >
> > > Let's split the sym_hist_entry and have it in a hash table so that it
> > > can allocate only necessary entries.
> >
> > > No functional change intended.
> >
> > I tried this before/after this series:
> >
> > $ time perf annotate --stdio2 -i perf.data.annotate
> >
> > For:
> >
> > perf record -e '{cycles,instructions,cache-misses}' make -k CORESIGHT=1 O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin
> >
> > And found these odd cases:
> >
> > $ diff -u before after
> > --- before 2024-02-28 15:38:25.086062812 -0300
> > +++ after 2024-02-29 14:12:05.606652725 -0300
> > @@ -2450826,7 +2450826,7 @@
> > ↓ je 1c62
> > → call operator delete(void*)@plt
> > { return _M_dataplus._M_p; }
> > - 1c62: mov 0x13c0(%rsp),%rdi
> > + 0.00 0.00 100.00 1c62: mov 0x13c0(%rsp),%rdi
> > if (_M_data() == _M_local_data())
> > lea 0x13d0(%rsp),%rax
> > cmp %rax,%rdi
> > @@ -2470648,7 +2470648,7 @@
> > mov %rbx,%rdi
> > → call operator delete(void*)@plt
> > using reference = T &;
> > - 0.00 0.00 100.00 11c65: mov 0x8(%r12),%rax
> > + 11c65: mov 0x8(%r12),%rax
> > size_t size() const { return Size; }
> > mov 0x10(%r12),%ecx
> > mov %rax,%rbp
> > $
> >
> >
> > This is a large function:
>
> Thanks for the test! I think it missed the cast to 64-bit somewhere.
> I'll check and send v2 soon.
Yep, the offset variable in __symbol__inc_addr_samples()
should be u64.
Thanks,
Namhyung
On Tue, Feb 27, 2024 at 4:52 PM Namhyung Kim <namhyung@kernel.org> wrote: > > Hello, > > This is another series of memory optimization in perf annotate. > > When perf annotate (or perf report/top with TUI) processes samples, it > needs to save the sample period (overhead) at instruction level. For > now, it allocates an array to do that for the whole symbol when it > hits any new symbol. This comes with a lot of waste since samples can > be very few and instructions span to multiple bytes. > > For example, when a sample hits symbol 'foo' that has size of 100 and > that's the only sample falls into the symbol. Then it needs to > allocate a symbol histogram (sym_hist) and the its size would be > > 16 (header) + 16 (sym_hist_entry) * 100 (symbol_size) = 1616 > > But actually it just needs 32 (header + sym_hist_entry) bytes. Things > get worse if the symbol size is bigger (and it doesn't have many > samples in different places). Also note that it needs separate > histogram for each event. > > Let's split the sym_hist_entry and have it in a hash table so that it > can allocate only necessary entries. > > No functional change intended. > > Thanks, > Namhyung > > > Namhyung Kim (4): > perf annotate: Add a hashmap for symbol histogram > perf annotate: Calculate instruction overhead using hashmap > perf annotate: Remove sym_hist.addr[] array > perf annotate: Add comments in the data structures Reviewed-by: Ian Rogers <irogers@google.com> Thanks, Ian > tools/perf/ui/gtk/annotate.c | 14 ++++- > tools/perf/util/annotate.c | 114 ++++++++++++++++++++++------------- > tools/perf/util/annotate.h | 86 +++++++++++++++++++++++--- > 3 files changed, 158 insertions(+), 56 deletions(-) > > -- > 2.44.0.rc1.240.g4c46232300-goog >
© 2016 - 2026 Red Hat, Inc.