[v4] perf c2c: Support display for Arm64

[PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Leo Yan 3 years, 11 months ago

Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
us to detect cache line contention and transfers.

This patch set is based on Ali Said's patch set v9 "perf: arm-spe: Decode SPE
source and use for perf c2c" [1] and Ali's patch set doesn't need any
change in this new round.

To clearly show peer loads and express the local peer loads and remote
peer lodes, this patch introduces three new metrics 'lcl_peer',
'rmt_peer' and 'tot_peer'.  The display 'peer' mode uses metric
'tot_peer' for sorting cache lines.

Patches 01-05 adds statistics for memory samples, and add dimensions for
peer metrics.

Patches 06-09 are for refactoring, it refines the code with more general
naming so this can allow us to easier to extend display modes but not
strictly bound to HITM tags.

Patches 10-11 are to extend display 'peer' mode, and also changes to use
'peer' mode as default mode for Arm64 arches.

Patch 12 updates document to describe the new dimensions for peer
metrics.

This patch set has been verified for both x86 and Arm64 memory samples.

Known issues:  Joe reminded there have an issue in patch set v3 that the
cache line metric shows 'N/A' for node, this is because Arm SPE trace
data doesn't contain physical address and leads to perf c2c tool fails
to find matched node range if physical address is zero.  This issue is
addressed in a separte patch [2].  Since I am still using the old
perf data file (I have no Neoverse platforms), the output result still
shows the Node field is 'N/A'.

Another thing is we need to enhance data source setting for old Arm
platforms.  As discussed, German would follow up this task later.

The latest patch set has been uploaded on the git server [3].

The display result with x86 memory samples:

  =================================================
             Shared Data Cache Line Table          
  =================================================
  #
  #        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
  # Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
  # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
  #
        0      0x55c8971f0080     0    1967   66.14%      252      252        0     6044     3550     2494     2024      470        0      528     2672       78        20      252         0        0         0         0
        1      0x55c8971f00c0     0       1   33.86%      129      129        0      914      914        0        0        0        0      272      374       52        87      129         0        0         0         0

  =================================================
        Shared Cache Line Distribution Pareto      
  =================================================
  #
  #        ----- HITM -----  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                     Shared                               
  #   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                  Symbol             Object              Source:Line  Node
  # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  .................  .......................  ....
  #
    ----------------------------------------------------------------------
        0        0      252     2024      470        0      0x55c8971f0080
    ----------------------------------------------------------------------
             0.00%   12.30%    0.00%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e9         0      1313       863     1222         3  [.] 0x00000000000013e9  false_sharing.exe  false_sharing.exe[13e9]   0
             0.00%    0.79%   90.51%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e2         0      1800       878     3029         3  [.] 0x00000000000013e2  false_sharing.exe  false_sharing.exe[13e2]   0
             0.00%    0.00%    9.49%  100.00%    0.00%                 0x0     0       1      0x55c8971ed3f4         0         0         0      662         3  [.] 0x00000000000013f4  false_sharing.exe  false_sharing.exe[13f4]   0
             0.00%   86.90%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed447         0       141       103     1131         2  [.] 0x0000000000001447  false_sharing.exe  false_sharing.exe[1447]   0

    ----------------------------------------------------------------------
        1        0      129        0        0        0      0x55c8971f00c0
    ----------------------------------------------------------------------
             0.00%  100.00%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed455         0        88        94      914         2  [.] 0x0000000000001455  false_sharing.exe  false_sharing.exe[1455]   0


The display result with Arm SPE:

  =================================================
             Shared Data Cache Line Table          
  =================================================
  #
  #        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
  # Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
  # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
  #
        0      0xaaaac17d6000   N/A       0  100.00%       99       99        0    18851    18851        0        0        0        0        0    18752        0        99        0         0        0         0         0

  =================================================
        Shared Cache Line Distribution Pareto      
  =================================================
  #
  #        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                    Shared                       
  #   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                  Symbol            Object      Source:Line  Node{cpus %peers %stores}
  # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  ................  ...............  ....
  #
    ----------------------------------------------------------------------
        0        0       99        0        0        0      0xaaaac17d6000
    ----------------------------------------------------------------------
             0.00%    6.06%    0.00%    0.00%    0.00%                0x20   N/A       0      0xaaaac17c25ac         0       375        43    18469         2  [.] 0x00000000000025ac  memstress         memstress[25ac]   0{ 2 100.0%    n/a}
             0.00%   93.94%    0.00%    0.00%    0.00%                0x29   N/A       0      0xaaaac17c3e88         0       180       173      135         2  [.] 0x0000000000003e88  memstress         memstress[3e88]   0{ 2 100.0%    n/a}


Changes from v3:
* Changed to display remote and local peer accesses (Joe);
* Fixed the usage info for display types (Joe);
* Do not display HITM dimensions when use 'peer' display, and HITM
  display doesn't show any 'peer' dimensions (James);
* Split to smaller patches for adding dimensions of peer operations;
* Updated documentation to reflect the latest GUI and stdio.

Changes from v2:
* Updated patch 04 to account metrics for both cache level and ld_peer
  for PEER flag;
* Updated document for metric 'rmt_hit' which is accounted for all
  remote accesses (include remote DRAM and any upward caches).

Changes from v1:
* Updated patches 01, 02 and 03 to support 'N/A' metrics for store
  operations, so can align with the patch set [1] for store samples.


[1] https://lore.kernel.org/lkml/20220517020326.18580-1-alisaidi@amazon.com/
[2] https://lore.kernel.org/lkml/20220530083645.253432-1-leo.yan@linaro.org/
[3] https://git.linaro.org/people/leo.yan/linux-spe.git/ branch: perf_c2c_arm_spe_peer_v4


Leo Yan (12):
  perf mem: Add statistics for peer snooping
  perf c2c: Output statistics for peer snooping
  perf c2c: Add dimensions for peer load operations
  perf c2c: Add dimensions of peer metrics for cache line view
  perf c2c: Add mean dimensions for peer operations
  perf c2c: Use explicit names for display macros
  perf c2c: Rename dimension from 'percent_hitm' to
    'percent_costly_snoop'
  perf c2c: Refactor node header
  perf c2c: Refactor display string
  perf c2c: Sort on peer snooping for load operations
  perf c2c: Use 'peer' as default display for Arm64
  perf c2c: Update documentation for new display option 'peer'

 tools/perf/Documentation/perf-c2c.txt |  30 +-
 tools/perf/builtin-c2c.c              | 454 ++++++++++++++++++++------
 tools/perf/util/mem-events.c          |  28 +-
 tools/perf/util/mem-events.h          |   3 +
 4 files changed, 403 insertions(+), 112 deletions(-)

-- 
2.25.1

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Joe Mario 3 years, 11 months ago


On 5/30/22 7:40 AM, Leo Yan wrote:
> Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
> us to detect cache line contention and transfers.
> 
> This patch set is based on Ali Said's patch set v9 "perf: arm-spe: Decode SPE
> source and use for perf c2c" [1] and Ali's patch set doesn't need any
> change in this new round.
> 
> To clearly show peer loads and express the local peer loads and remote
> peer lodes, this patch introduces three new metrics 'lcl_peer',
> 'rmt_peer' and 'tot_peer'.  The display 'peer' mode uses metric
> 'tot_peer' for sorting cache lines.
> 
> Patches 01-05 adds statistics for memory samples, and add dimensions for
> peer metrics.
> 
> Patches 06-09 are for refactoring, it refines the code with more general
> naming so this can allow us to easier to extend display modes but not
> strictly bound to HITM tags.
> 
> Patches 10-11 are to extend display 'peer' mode, and also changes to use
> 'peer' mode as default mode for Arm64 arches.
> 
> Patch 12 updates document to describe the new dimensions for peer
> metrics.
> 
> This patch set has been verified for both x86 and Arm64 memory samples.
> 
> Known issues:  Joe reminded there have an issue in patch set v3 that the
> cache line metric shows 'N/A' for node, this is because Arm SPE trace
> data doesn't contain physical address and leads to perf c2c tool fails
> to find matched node range if physical address is zero.  This issue is
> addressed in a separte patch [2].  Since I am still using the old
> perf data file (I have no Neoverse platforms), the output result still
> shows the Node field is 'N/A'.
> 

Hi Leo:
I built a new perf with your patches and ran it on a 2-numa node Neoverse platform.
I then ran my simple test that creates reader and writer threads to tug on the same cacheline.
The c2c output is appended below.

The output looks good, especially where you've broken out the (average) cycles for local and remote peer loads.  
And I'm glad to see you fixed the "Node" column.  I use that a lot to help detect remote node accesses.  
And the "PA cnt" field is working as well,  which is important to see if numa_balance is moving the data around.

=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0            0x422140     0    6904   74.86%      137      131        6   148008   144970     3038        0        0     3038        0   144833      120        11        0         6        0         0         0
      1  0xffffd976e63ae5c0     1       6    3.83%        7        7        0       15       15        0        0        0        0        0        8        4         3        0         0        0         0         0
      2  0xffff07ffbf290980     0       5    2.19%        4        2        2       14       14        0        0        0        0        0       10        1         1        0         2        0         0         0
      3  0xffffd976e57275c0     1       1    0.55%        1        1        0        1        1        0        0        0        0        0        0        1         0        0         0        0         0         0
      4  0xffffd976e6071c00     1       3    0.55%        1        0        1        4        4        0        0        0        0        0        3        0         0        0         1        0         0         0
     [snip]
=================================================
      Shared Cache Line Distribution Pareto
=================================================
#
#        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                               Shared
#   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                      Symbol   Object                Source:Line  Node
# .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..........................  .......  .........................  ....
#
  ----------------------------------------------------------------------
      0        6      131        0        0     3038            0x422140
  ----------------------------------------------------------------------
           0.00%    0.00%    0.00%    0.00%   52.60%                 0x8     0       1            0x400e6c         0         0         0     1598         4  [.] writer                  tugtest  tugtest.c:152               0 1
           0.00%    0.00%    0.00%    0.00%   47.40%                0x10     0       1            0x400e7c         0         0         0     1440         4  [.] writer                  tugtest  tugtest.c:153               0 1
          33.33%   75.57%    0.00%    0.00%    0.00%                0x20     0       1            0x401018      4095      3803      3419      409         4  [.] reader                  tugtest  tugtest.c:187               0 1
          66.67%   24.43%    0.00%    0.00%    0.00%                0x28     0       1            0x401034      4095      3470      3643      413         4  [.] reader                  tugtest  tugtest.c:187               0 1

  ----------------------------------------------------------------------
      1        0        7        0        0        0  0xffffd976e63ae5c0
  ----------------------------------------------------------------------
           0.00%   57.14%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4815fbc         0      1333         0        4         2  [k] ktime_get                   [kernel.kallsyms]  seqlock.h:276          1                   
           0.00%   14.29%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4816d10         0       266       794        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  seqlock.h:276        0 1
           0.00%   28.57%    0.00%    0.00%    0.00%                0x30     1       1  0xffffd976e4816d20         0        87       150        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  timekeeping.c:2298   0 1
  
  ----------------------------------------------------------------------     
      2        2        2        0        0        0  0xffff07ffbf290980
  ----------------------------------------------------------------------
          50.00%  100.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2bdc      1217      1600      1147        4         3  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:511    0 1
          50.00%    0.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2a2c      4033         0         0        1         1  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:382    0 1
  
  ----------------------------------------------------------------------     

Thanks for doing this.  It looks good.
I'll assume someone else is reviewing your code changes.

Joe

> Another thing is we need to enhance data source setting for old Arm
> platforms.  As discussed, German would follow up this task later.
> 
> The latest patch set has been uploaded on the git server [3].
> 
> The display result with x86 memory samples:
> 
>   =================================================
>              Shared Data Cache Line Table          
>   =================================================
>   #
>   #        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
>   # Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
>   # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
>   #
>         0      0x55c8971f0080     0    1967   66.14%      252      252        0     6044     3550     2494     2024      470        0      528     2672       78        20      252         0        0         0         0
>         1      0x55c8971f00c0     0       1   33.86%      129      129        0      914      914        0        0        0        0      272      374       52        87      129         0        0         0         0
> 
>   =================================================
>         Shared Cache Line Distribution Pareto      
>   =================================================
>   #
>   #        ----- HITM -----  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                     Shared                               
>   #   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                  Symbol             Object              Source:Line  Node
>   # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  .................  .......................  ....
>   #
>     ----------------------------------------------------------------------
>         0        0      252     2024      470        0      0x55c8971f0080
>     ----------------------------------------------------------------------
>              0.00%   12.30%    0.00%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e9         0      1313       863     1222         3  [.] 0x00000000000013e9  false_sharing.exe  false_sharing.exe[13e9]   0
>              0.00%    0.79%   90.51%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e2         0      1800       878     3029         3  [.] 0x00000000000013e2  false_sharing.exe  false_sharing.exe[13e2]   0
>              0.00%    0.00%    9.49%  100.00%    0.00%                 0x0     0       1      0x55c8971ed3f4         0         0         0      662         3  [.] 0x00000000000013f4  false_sharing.exe  false_sharing.exe[13f4]   0
>              0.00%   86.90%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed447         0       141       103     1131         2  [.] 0x0000000000001447  false_sharing.exe  false_sharing.exe[1447]   0
> 
>     ----------------------------------------------------------------------
>         1        0      129        0        0        0      0x55c8971f00c0
>     ----------------------------------------------------------------------
>              0.00%  100.00%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed455         0        88        94      914         2  [.] 0x0000000000001455  false_sharing.exe  false_sharing.exe[1455]   0
> 
> 
> The display result with Arm SPE:
> 
>   =================================================
>              Shared Data Cache Line Table          
>   =================================================
>   #
>   #        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
>   # Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
>   # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
>   #
>         0      0xaaaac17d6000   N/A       0  100.00%       99       99        0    18851    18851        0        0        0        0        0    18752        0        99        0         0        0         0         0
> 
>   =================================================
>         Shared Cache Line Distribution Pareto      
>   =================================================
>   #
>   #        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                    Shared                       
>   #   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                  Symbol            Object      Source:Line  Node{cpus %peers %stores}
>   # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  ................  ...............  ....
>   #
>     ----------------------------------------------------------------------
>         0        0       99        0        0        0      0xaaaac17d6000
>     ----------------------------------------------------------------------
>              0.00%    6.06%    0.00%    0.00%    0.00%                0x20   N/A       0      0xaaaac17c25ac         0       375        43    18469         2  [.] 0x00000000000025ac  memstress         memstress[25ac]   0{ 2 100.0%    n/a}
>              0.00%   93.94%    0.00%    0.00%    0.00%                0x29   N/A       0      0xaaaac17c3e88         0       180       173      135         2  [.] 0x0000000000003e88  memstress         memstress[3e88]   0{ 2 100.0%    n/a}
> 
> 
> Changes from v3:
> * Changed to display remote and local peer accesses (Joe);
> * Fixed the usage info for display types (Joe);
> * Do not display HITM dimensions when use 'peer' display, and HITM
>   display doesn't show any 'peer' dimensions (James);
> * Split to smaller patches for adding dimensions of peer operations;
> * Updated documentation to reflect the latest GUI and stdio.
> 
> Changes from v2:
> * Updated patch 04 to account metrics for both cache level and ld_peer
>   for PEER flag;
> * Updated document for metric 'rmt_hit' which is accounted for all
>   remote accesses (include remote DRAM and any upward caches).
> 
> Changes from v1:
> * Updated patches 01, 02 and 03 to support 'N/A' metrics for store
>   operations, so can align with the patch set [1] for store samples.
> 
> 
> [1] https://lore.kernel.org/lkml/20220517020326.18580-1-alisaidi@amazon.com/
> [2] https://lore.kernel.org/lkml/20220530083645.253432-1-leo.yan@linaro.org/
> [3] https://git.linaro.org/people/leo.yan/linux-spe.git/ branch: perf_c2c_arm_spe_peer_v4
> 
> 
> Leo Yan (12):
>   perf mem: Add statistics for peer snooping
>   perf c2c: Output statistics for peer snooping
>   perf c2c: Add dimensions for peer load operations
>   perf c2c: Add dimensions of peer metrics for cache line view
>   perf c2c: Add mean dimensions for peer operations
>   perf c2c: Use explicit names for display macros
>   perf c2c: Rename dimension from 'percent_hitm' to
>     'percent_costly_snoop'
>   perf c2c: Refactor node header
>   perf c2c: Refactor display string
>   perf c2c: Sort on peer snooping for load operations
>   perf c2c: Use 'peer' as default display for Arm64
>   perf c2c: Update documentation for new display option 'peer'
> 
>  tools/perf/Documentation/perf-c2c.txt |  30 +-
>  tools/perf/builtin-c2c.c              | 454 ++++++++++++++++++++------
>  tools/perf/util/mem-events.c          |  28 +-
>  tools/perf/util/mem-events.h          |   3 +
>  4 files changed, 403 insertions(+), 112 deletions(-)
>

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Leo Yan 3 years, 11 months ago

Hi Joe,

On Tue, May 31, 2022 at 02:44:07PM -0400, Joe Mario wrote:

[...]

> Hi Leo:
> I built a new perf with your patches and ran it on a 2-numa node Neoverse platform.
> I then ran my simple test that creates reader and writer threads to tug on the same cacheline.
> The c2c output is appended below.
>
> The output looks good, especially where you've broken out the (average) cycles for local and remote peer loads.  
> And I'm glad to see you fixed the "Node" column.  I use that a lot to help detect remote node accesses.  

Thanks a lot for your testing and suggestions, which are really helpful!

> And the "PA cnt" field is working as well,  which is important to see if numa_balance is moving the data around.

Good to know this.  To be honest, before I didn't note for "PA cnt"
metric, I checked a bit for the code, this metrics is very useful to
understand how it's severe that a cache line is accessed from different
addresses, so we can get sense how a cache line is hammered.

> =================================================
>            Shared Data Cache Line Table
> =================================================
> #
> #        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
> # Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
> # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
> #
>       0            0x422140     0    6904   74.86%      137      131        6   148008   144970     3038        0        0     3038        0   144833      120        11        0         6        0         0         0
>       1  0xffffd976e63ae5c0     1       6    3.83%        7        7        0       15       15        0        0        0        0        0        8        4         3        0         0        0         0         0
>       2  0xffff07ffbf290980     0       5    2.19%        4        2        2       14       14        0        0        0        0        0       10        1         1        0         2        0         0         0
>       3  0xffffd976e57275c0     1       1    0.55%        1        1        0        1        1        0        0        0        0        0        0        1         0        0         0        0         0         0
>       4  0xffffd976e6071c00     1       3    0.55%        1        0        1        4        4        0        0        0        0        0        3        0         0        0         1        0         0         0
>      [snip]
> =================================================
>       Shared Cache Line Distribution Pareto
> =================================================
> #
> #        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                               Shared
> #   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                      Symbol   Object                Source:Line  Node
> # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..........................  .......  .........................  ....
> #
>   ----------------------------------------------------------------------
>       0        6      131        0        0     3038            0x422140
>   ----------------------------------------------------------------------
>            0.00%    0.00%    0.00%    0.00%   52.60%                 0x8     0       1            0x400e6c         0         0         0     1598         4  [.] writer                  tugtest  tugtest.c:152               0 1
>            0.00%    0.00%    0.00%    0.00%   47.40%                0x10     0       1            0x400e7c         0         0         0     1440         4  [.] writer                  tugtest  tugtest.c:153               0 1
>           33.33%   75.57%    0.00%    0.00%    0.00%                0x20     0       1            0x401018      4095      3803      3419      409         4  [.] reader                  tugtest  tugtest.c:187               0 1
>           66.67%   24.43%    0.00%    0.00%    0.00%                0x28     0       1            0x401034      4095      3470      3643      413         4  [.] reader                  tugtest  tugtest.c:187               0 1
> 
>   ----------------------------------------------------------------------
>       1        0        7        0        0        0  0xffffd976e63ae5c0
>   ----------------------------------------------------------------------
>            0.00%   57.14%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4815fbc         0      1333         0        4         2  [k] ktime_get                   [kernel.kallsyms]  seqlock.h:276          1                   
>            0.00%   14.29%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4816d10         0       266       794        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  seqlock.h:276        0 1
>            0.00%   28.57%    0.00%    0.00%    0.00%                0x30     1       1  0xffffd976e4816d20         0        87       150        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  timekeeping.c:2298   0 1
>   
>   ----------------------------------------------------------------------     
>       2        2        2        0        0        0  0xffff07ffbf290980
>   ----------------------------------------------------------------------
>           50.00%  100.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2bdc      1217      1600      1147        4         3  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:511    0 1
>           50.00%    0.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2a2c      4033         0         0        1         1  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:382    0 1
>   
>   ----------------------------------------------------------------------     
> 
> Thanks for doing this.  It looks good.

You are welcome!  And very appreicate your helping to mature the code.

> I'll assume someone else is reviewing your code changes.

Yeah, let's give a bit more time for reviewing.

Thanks,
Leo

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Ali Saidi 3 years, 11 months ago

Hi Leo,

On Wed, 1 Jun 2022 18:25:05 +0800 Leo Yan wrote:
> Hi Joe,
> 
> On Tue, May 31, 2022 at 02:44:07PM -0400, Joe Mario wrote:
> 
> [...]
> 
> > Hi Leo:
> > I built a new perf with your patches and ran it on a 2-numa node Neoverse platform.
> > I then ran my simple test that creates reader and writer threads to tug on the same cacheline.
> > The c2c output is appended below.
> >
> > The output looks good, especially where you've broken out the (average) cycles for local and remote peer loads.  
> > And I'm glad to see you fixed the "Node" column.  I use that a lot to help detect remote node accesses.  
> 
> Thanks a lot for your testing and suggestions, which are really helpful!
> 
> > And the "PA cnt" field is working as well,  which is important to see if numa_balance is moving the data around.
> 
> Good to know this.  To be honest, before I didn't note for "PA cnt"
> metric, I checked a bit for the code, this metrics is very useful to
> understand how it's severe that a cache line is accessed from different
> addresses, so we can get sense how a cache line is hammered.
> 
> [...]
>
> > Thanks for doing this.  It looks good.
> 
> You are welcome!  And very appreicate your helping to mature the code.

Seconding that, thanks for progressing this so much Leo. 

> 
> > I'll assume someone else is reviewing your code changes.
> 
> Yeah, let's give a bit more time for reviewing.

I've tested and given each patch a close look. I haven't found anything that
looks to change other architectures and the output on my Graviton systems looks
great. I pulled in your patch to add physical addresses to the spe records and
as expected I saw the Node properly populated and PA cnt is no longer zero.  One
nit is the documentation still says that "Total HITMs (tot) as default," while
the code now defaults to "peer" on arm64.  Other than that:

Tested-by: Ali Saidi <alisaidi@amazon.com>
Reviewed-by: Ali Saidi <alisaidi@amazon.com>

Thanks,
Ali

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Leo Yan 3 years, 11 months ago

On Thu, Jun 02, 2022 at 05:11:20PM +0000, Ali Saidi wrote:

[...]

> > You are welcome!  And very appreicate your helping to mature the code.
> 
> Seconding that, thanks for progressing this so much Leo. 

You are very welcome, Ali!

> > > I'll assume someone else is reviewing your code changes.
> > 
> > Yeah, let's give a bit more time for reviewing.
> 
> I've tested and given each patch a close look. I haven't found anything that
> looks to change other architectures and the output on my Graviton systems looks
> great. I pulled in your patch to add physical addresses to the spe records and
> as expected I saw the Node properly populated and PA cnt is no longer zero.  One
> nit is the documentation still says that "Total HITMs (tot) as default," while
> the code now defaults to "peer" on arm64.  Other than that:
> 
> Tested-by: Ali Saidi <alisaidi@amazon.com>
> Reviewed-by: Ali Saidi <alisaidi@amazon.com>

Thanks a lot for the testing.  Will respin a new patch set for
correcting the documentation:

  "Total HITMs (tot) as default, except Arm64 uses peer mode as default".

And will add your test and review tags.

Thanks,
Leo

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Ian Rogers 3 years, 11 months ago

On Wed, Jun 1, 2022 at 3:25 AM Leo Yan <leo.yan@linaro.org> wrote:
>
> Hi Joe,
>
> On Tue, May 31, 2022 at 02:44:07PM -0400, Joe Mario wrote:
>
> [...]
>
> > Hi Leo:
> > I built a new perf with your patches and ran it on a 2-numa node Neoverse platform.
> > I then ran my simple test that creates reader and writer threads to tug on the same cacheline.
> > The c2c output is appended below.
> >
> > The output looks good, especially where you've broken out the (average) cycles for local and remote peer loads.
> > And I'm glad to see you fixed the "Node" column.  I use that a lot to help detect remote node accesses.
>
> Thanks a lot for your testing and suggestions, which are really helpful!
>
> > And the "PA cnt" field is working as well,  which is important to see if numa_balance is moving the data around.
>
> Good to know this.  To be honest, before I didn't note for "PA cnt"
> metric, I checked a bit for the code, this metrics is very useful to
> understand how it's severe that a cache line is accessed from different
> addresses, so we can get sense how a cache line is hammered.
>
> > =================================================
> >            Shared Data Cache Line Table
> > =================================================
> > #
> > #        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
> > # Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
> > # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
> > #
> >       0            0x422140     0    6904   74.86%      137      131        6   148008   144970     3038        0        0     3038        0   144833      120        11        0         6        0         0         0
> >       1  0xffffd976e63ae5c0     1       6    3.83%        7        7        0       15       15        0        0        0        0        0        8        4         3        0         0        0         0         0
> >       2  0xffff07ffbf290980     0       5    2.19%        4        2        2       14       14        0        0        0        0        0       10        1         1        0         2        0         0         0
> >       3  0xffffd976e57275c0     1       1    0.55%        1        1        0        1        1        0        0        0        0        0        0        1         0        0         0        0         0         0
> >       4  0xffffd976e6071c00     1       3    0.55%        1        0        1        4        4        0        0        0        0        0        3        0         0        0         1        0         0         0
> >      [snip]
> > =================================================
> >       Shared Cache Line Distribution Pareto
> > =================================================
> > #
> > #        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                               Shared
> > #   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                      Symbol   Object                Source:Line  Node
> > # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..........................  .......  .........................  ....
> > #
> >   ----------------------------------------------------------------------
> >       0        6      131        0        0     3038            0x422140
> >   ----------------------------------------------------------------------
> >            0.00%    0.00%    0.00%    0.00%   52.60%                 0x8     0       1            0x400e6c         0         0         0     1598         4  [.] writer                  tugtest  tugtest.c:152               0 1
> >            0.00%    0.00%    0.00%    0.00%   47.40%                0x10     0       1            0x400e7c         0         0         0     1440         4  [.] writer                  tugtest  tugtest.c:153               0 1
> >           33.33%   75.57%    0.00%    0.00%    0.00%                0x20     0       1            0x401018      4095      3803      3419      409         4  [.] reader                  tugtest  tugtest.c:187               0 1
> >           66.67%   24.43%    0.00%    0.00%    0.00%                0x28     0       1            0x401034      4095      3470      3643      413         4  [.] reader                  tugtest  tugtest.c:187               0 1
> >
> >   ----------------------------------------------------------------------
> >       1        0        7        0        0        0  0xffffd976e63ae5c0
> >   ----------------------------------------------------------------------
> >            0.00%   57.14%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4815fbc         0      1333         0        4         2  [k] ktime_get                   [kernel.kallsyms]  seqlock.h:276          1
> >            0.00%   14.29%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffd976e4816d10         0       266       794        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  seqlock.h:276        0 1
> >            0.00%   28.57%    0.00%    0.00%    0.00%                0x30     1       1  0xffffd976e4816d20         0        87       150        4         3  [k] ktime_get_update_offsets_n  [kernel.kallsyms]  timekeeping.c:2298   0 1
> >
> >   ----------------------------------------------------------------------
> >       2        2        2        0        0        0  0xffff07ffbf290980
> >   ----------------------------------------------------------------------
> >           50.00%  100.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2bdc      1217      1600      1147        4         3  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:511    0 1
> >           50.00%    0.00%    0.00%    0.00%    0.00%                 0x4     0       1  0xffffd976e47d2a2c      4033         0         0        1         1  [k] queued_spin_lock_slowpath  [kernel.kallsyms]  qspinlock.c:382    0 1
> >
> >   ----------------------------------------------------------------------
> >
> > Thanks for doing this.  It looks good.
>
> You are welcome!  And very appreicate your helping to mature the code.
>
> > I'll assume someone else is reviewing your code changes.
>
> Yeah, let's give a bit more time for reviewing.
>
> Thanks,
> Leo

This is great Leo! I've not been able to test the changes but I didn't
have any coding comments (happy to give an Acked-by). Do you think we
can add a test for this? The test can skip when c2c isn't supported.

Thanks,
Ian

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Leo Yan 3 years, 11 months ago

Hi Ian,

On Thu, Jun 02, 2022 at 09:59:29AM -0700, Ian Rogers wrote:

[...]

> This is great Leo! I've not been able to test the changes but I didn't
> have any coding comments (happy to give an Acked-by). Do you think we
> can add a test for this? The test can skip when c2c isn't supported.

It would be great if recieve your ACK tags :)

Yeah, agree that it's good thing to add test for perf c2c.  Just note,
after rough thinkіng, seems to me the right thing to do is to add a
testing shell script in the folder tests/shell and use a tiny
multi-threading program for accessing the share cache line from
different nodes (or rollback to different CPUs).

I would like to use a separate patch for adding the test case, this
would be easier for upstreaming current patch set.

Please let me know if this is fine for you or not.  Thanks for reviewing
and suggestions!

Leo

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Arnaldo Carvalho de Melo 3 years, 11 months ago

Em Mon, May 30, 2022 at 07:40:24PM +0800, Leo Yan escreveu:
> Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
> us to detect cache line contention and transfers.
> 
> This patch set is based on Ali Said's patch set v9 "perf: arm-spe: Decode SPE
> source and use for perf c2c" [1] and Ali's patch set doesn't need any
> change in this new round.

IIRC there is a kernel part there, please let me know when that part
gets merged so that I can process this 12 patches long set.

- Arnaldo
 
> To clearly show peer loads and express the local peer loads and remote
> peer lodes, this patch introduces three new metrics 'lcl_peer',
> 'rmt_peer' and 'tot_peer'.  The display 'peer' mode uses metric
> 'tot_peer' for sorting cache lines.
> 
> Patches 01-05 adds statistics for memory samples, and add dimensions for
> peer metrics.
> 
> Patches 06-09 are for refactoring, it refines the code with more general
> naming so this can allow us to easier to extend display modes but not
> strictly bound to HITM tags.
> 
> Patches 10-11 are to extend display 'peer' mode, and also changes to use
> 'peer' mode as default mode for Arm64 arches.
> 
> Patch 12 updates document to describe the new dimensions for peer
> metrics.
> 
> This patch set has been verified for both x86 and Arm64 memory samples.
> 
> Known issues:  Joe reminded there have an issue in patch set v3 that the
> cache line metric shows 'N/A' for node, this is because Arm SPE trace
> data doesn't contain physical address and leads to perf c2c tool fails
> to find matched node range if physical address is zero.  This issue is
> addressed in a separte patch [2].  Since I am still using the old
> perf data file (I have no Neoverse platforms), the output result still
> shows the Node field is 'N/A'.
> 
> Another thing is we need to enhance data source setting for old Arm
> platforms.  As discussed, German would follow up this task later.
> 
> The latest patch set has been uploaded on the git server [3].
> 
> The display result with x86 memory samples:
> 
>   =================================================
>              Shared Data Cache Line Table          
>   =================================================
>   #
>   #        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
>   # Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
>   # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
>   #
>         0      0x55c8971f0080     0    1967   66.14%      252      252        0     6044     3550     2494     2024      470        0      528     2672       78        20      252         0        0         0         0
>         1      0x55c8971f00c0     0       1   33.86%      129      129        0      914      914        0        0        0        0      272      374       52        87      129         0        0         0         0
> 
>   =================================================
>         Shared Cache Line Distribution Pareto      
>   =================================================
>   #
>   #        ----- HITM -----  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                     Shared                               
>   #   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                  Symbol             Object              Source:Line  Node
>   # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  .................  .......................  ....
>   #
>     ----------------------------------------------------------------------
>         0        0      252     2024      470        0      0x55c8971f0080
>     ----------------------------------------------------------------------
>              0.00%   12.30%    0.00%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e9         0      1313       863     1222         3  [.] 0x00000000000013e9  false_sharing.exe  false_sharing.exe[13e9]   0
>              0.00%    0.79%   90.51%    0.00%    0.00%                 0x0     0       1      0x55c8971ed3e2         0      1800       878     3029         3  [.] 0x00000000000013e2  false_sharing.exe  false_sharing.exe[13e2]   0
>              0.00%    0.00%    9.49%  100.00%    0.00%                 0x0     0       1      0x55c8971ed3f4         0         0         0      662         3  [.] 0x00000000000013f4  false_sharing.exe  false_sharing.exe[13f4]   0
>              0.00%   86.90%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed447         0       141       103     1131         2  [.] 0x0000000000001447  false_sharing.exe  false_sharing.exe[1447]   0
> 
>     ----------------------------------------------------------------------
>         1        0      129        0        0        0      0x55c8971f00c0
>     ----------------------------------------------------------------------
>              0.00%  100.00%    0.00%    0.00%    0.00%                0x20     0       1      0x55c8971ed455         0        88        94      914         2  [.] 0x0000000000001455  false_sharing.exe  false_sharing.exe[1455]   0
> 
> 
> The display result with Arm SPE:
> 
>   =================================================
>              Shared Data Cache Line Table          
>   =================================================
>   #
>   #        ----------- Cacheline ----------     Peer  ------- Load Peer -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
>   # Index             Address  Node  PA cnt    Snoop    Total    Local   Remote  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
>   # .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
>   #
>         0      0xaaaac17d6000   N/A       0  100.00%       99       99        0    18851    18851        0        0        0        0        0    18752        0        99        0         0        0         0         0
> 
>   =================================================
>         Shared Cache Line Distribution Pareto      
>   =================================================
>   #
>   #        -- Peer Snoop --  ------- Store Refs ------  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                    Shared                       
>   #   Num      Rmt      Lcl   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt        Code address  rmt peer  lcl peer      load  records       cnt                  Symbol            Object      Source:Line  Node{cpus %peers %stores}
>   # .....  .......  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ......................  ................  ...............  ....
>   #
>     ----------------------------------------------------------------------
>         0        0       99        0        0        0      0xaaaac17d6000
>     ----------------------------------------------------------------------
>              0.00%    6.06%    0.00%    0.00%    0.00%                0x20   N/A       0      0xaaaac17c25ac         0       375        43    18469         2  [.] 0x00000000000025ac  memstress         memstress[25ac]   0{ 2 100.0%    n/a}
>              0.00%   93.94%    0.00%    0.00%    0.00%                0x29   N/A       0      0xaaaac17c3e88         0       180       173      135         2  [.] 0x0000000000003e88  memstress         memstress[3e88]   0{ 2 100.0%    n/a}
> 
> 
> Changes from v3:
> * Changed to display remote and local peer accesses (Joe);
> * Fixed the usage info for display types (Joe);
> * Do not display HITM dimensions when use 'peer' display, and HITM
>   display doesn't show any 'peer' dimensions (James);
> * Split to smaller patches for adding dimensions of peer operations;
> * Updated documentation to reflect the latest GUI and stdio.
> 
> Changes from v2:
> * Updated patch 04 to account metrics for both cache level and ld_peer
>   for PEER flag;
> * Updated document for metric 'rmt_hit' which is accounted for all
>   remote accesses (include remote DRAM and any upward caches).
> 
> Changes from v1:
> * Updated patches 01, 02 and 03 to support 'N/A' metrics for store
>   operations, so can align with the patch set [1] for store samples.
> 
> 
> [1] https://lore.kernel.org/lkml/20220517020326.18580-1-alisaidi@amazon.com/
> [2] https://lore.kernel.org/lkml/20220530083645.253432-1-leo.yan@linaro.org/
> [3] https://git.linaro.org/people/leo.yan/linux-spe.git/ branch: perf_c2c_arm_spe_peer_v4
> 
> 
> Leo Yan (12):
>   perf mem: Add statistics for peer snooping
>   perf c2c: Output statistics for peer snooping
>   perf c2c: Add dimensions for peer load operations
>   perf c2c: Add dimensions of peer metrics for cache line view
>   perf c2c: Add mean dimensions for peer operations
>   perf c2c: Use explicit names for display macros
>   perf c2c: Rename dimension from 'percent_hitm' to
>     'percent_costly_snoop'
>   perf c2c: Refactor node header
>   perf c2c: Refactor display string
>   perf c2c: Sort on peer snooping for load operations
>   perf c2c: Use 'peer' as default display for Arm64
>   perf c2c: Update documentation for new display option 'peer'
> 
>  tools/perf/Documentation/perf-c2c.txt |  30 +-
>  tools/perf/builtin-c2c.c              | 454 ++++++++++++++++++++------
>  tools/perf/util/mem-events.c          |  28 +-
>  tools/perf/util/mem-events.h          |   3 +
>  4 files changed, 403 insertions(+), 112 deletions(-)
> 
> -- 
> 2.25.1

-- 

- Arnaldo

Re: [PATCH v4 00/12] perf c2c: Support display for Arm64

Posted by Leo Yan 3 years, 11 months ago

Hi Arnaldo,

On Fri, Jun 03, 2022 at 09:33:10PM +0200, Arnaldo Carvalho de Melo wrote:
> Em Mon, May 30, 2022 at 07:40:24PM +0800, Leo Yan escreveu:
> > Arm64 Neoverse CPUs supports data source in Arm SPE trace, this allows
> > us to detect cache line contention and transfers.
> > 
> > This patch set is based on Ali Said's patch set v9 "perf: arm-spe: Decode SPE
> > source and use for perf c2c" [1] and Ali's patch set doesn't need any
> > change in this new round.
> 
> IIRC there is a kernel part there, please let me know when that part
> gets merged so that I can process this 12 patches long set.

This patch set is not dependent on any kernel patch, it bases on Ali's
patch set "perf: arm-spe: Decode SPE source and use for perf c2c".

Let me use a new patch set to include all relevant patches (include
Ali's patches for setting data source and this patch set), will send
out soon.  Hope this would be easier for picking up.

Thanks,
Leo

P.s. James' patch set for enabling SVE register is dependent on kernel
changes, James is on vacation, I will monitor the latest status and
update in the corresponding patch set thread.