[PATCH v2 0/7] Implement inline static calls on PPC32 - v2

Christophe Leroy posted 7 patches 3 years, 9 months ago
arch/powerpc/Kconfig                          |  3 +-
arch/powerpc/include/asm/asm.h                |  7 +++
arch/powerpc/include/asm/static_call.h        |  2 +
arch/powerpc/kernel/cpu_setup_6xx.S           | 26 ++++++---
arch/powerpc/kernel/cpu_setup_fsl_booke.S     |  8 ++-
arch/powerpc/kernel/entry_32.S                |  8 ++-
arch/powerpc/kernel/head_40x.S                |  5 +-
arch/powerpc/kernel/head_8xx.S                |  5 +-
arch/powerpc/kernel/head_book3s_32.S          | 29 +++++++---
arch/powerpc/kernel/head_fsl_booke.S          |  5 +-
arch/powerpc/kernel/static_call.c             | 56 ++++++++++++++-----
arch/powerpc/kernel/swsusp_32.S               |  5 +-
arch/powerpc/kvm/fpu.S                        | 17 ++++--
arch/powerpc/platforms/52xx/lite5200_sleep.S  | 15 +++--
arch/x86/kernel/static_call.c                 |  2 +-
init/main.c                                   |  1 +
kernel/static_call_inline.c                   |  2 +-
tools/objtool/arch/powerpc/decode.c           | 16 ++++--
tools/objtool/arch/powerpc/include/arch/elf.h |  1 +
tools/objtool/arch/x86/include/arch/elf.h     |  1 +
tools/objtool/check.c                         | 10 ++--
tools/objtool/orc_gen.c                       |  2 +-
22 files changed, 162 insertions(+), 64 deletions(-)
create mode 100644 arch/powerpc/include/asm/asm.h
[PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Posted by Christophe Leroy 3 years, 9 months ago
This series applies on top of the series v3 "objtool: Enable and
implement --mcount option on powerpc" [1] rebased on powerpc-next branch

A few modifications are done to core parts to enable powerpc
implementation:
- R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
redefined as R_PPC_REL32.
- A call to static_call_init() is added to start_kernel() to avoid
every architecture to have to call it
- Trampoline address is provided to arch_static_call_transform() even
when setting a site to fallback on a call to the trampoline when the
target is too far.

[1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef

Christophe Leroy (7):
  powerpc: Add missing asm/asm.h for objtool
  objtool/powerpc: Activate objtool on PPC32
  objtool: Add architecture specific R_REL32 macro
  objtool/powerpc: Add necessary support for inline static calls
  init: Call static_call_init() from start_kernel()
  static_call_inline: Provide trampoline address when updating sites
  powerpc/static_call: Implement inline static calls

 arch/powerpc/Kconfig                          |  3 +-
 arch/powerpc/include/asm/asm.h                |  7 +++
 arch/powerpc/include/asm/static_call.h        |  2 +
 arch/powerpc/kernel/cpu_setup_6xx.S           | 26 ++++++---
 arch/powerpc/kernel/cpu_setup_fsl_booke.S     |  8 ++-
 arch/powerpc/kernel/entry_32.S                |  8 ++-
 arch/powerpc/kernel/head_40x.S                |  5 +-
 arch/powerpc/kernel/head_8xx.S                |  5 +-
 arch/powerpc/kernel/head_book3s_32.S          | 29 +++++++---
 arch/powerpc/kernel/head_fsl_booke.S          |  5 +-
 arch/powerpc/kernel/static_call.c             | 56 ++++++++++++++-----
 arch/powerpc/kernel/swsusp_32.S               |  5 +-
 arch/powerpc/kvm/fpu.S                        | 17 ++++--
 arch/powerpc/platforms/52xx/lite5200_sleep.S  | 15 +++--
 arch/x86/kernel/static_call.c                 |  2 +-
 init/main.c                                   |  1 +
 kernel/static_call_inline.c                   |  2 +-
 tools/objtool/arch/powerpc/decode.c           | 16 ++++--
 tools/objtool/arch/powerpc/include/arch/elf.h |  1 +
 tools/objtool/arch/x86/include/arch/elf.h     |  1 +
 tools/objtool/check.c                         | 10 ++--
 tools/objtool/orc_gen.c                       |  2 +-
 22 files changed, 162 insertions(+), 64 deletions(-)
 create mode 100644 arch/powerpc/include/asm/asm.h

-- 
2.36.1
Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Posted by Ard Biesheuvel 3 years, 9 months ago
Hello Christophe,

On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> This series applies on top of the series v3 "objtool: Enable and
> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>
> A few modifications are done to core parts to enable powerpc
> implementation:
> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
> redefined as R_PPC_REL32.
> - A call to static_call_init() is added to start_kernel() to avoid
> every architecture to have to call it
> - Trampoline address is provided to arch_static_call_transform() even
> when setting a site to fallback on a call to the trampoline when the
> target is too far.
>
> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>
> Christophe Leroy (7):
>   powerpc: Add missing asm/asm.h for objtool
>   objtool/powerpc: Activate objtool on PPC32
>   objtool: Add architecture specific R_REL32 macro
>   objtool/powerpc: Add necessary support for inline static calls
>   init: Call static_call_init() from start_kernel()
>   static_call_inline: Provide trampoline address when updating sites
>   powerpc/static_call: Implement inline static calls
>

Could you quantify the performance gains of moving from out-of-line,
patched tail-call branch instructions to full-fledged inline static
calls? On x86, the retpoline problem makes this glaringly obvious, but
on other architectures, the complexity of supporting this model may
outweigh the performance advantages.
Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Posted by Christophe Leroy 3 years, 7 months ago

Le 09/07/2022 à 08:52, Ard Biesheuvel a écrit :
> Hello Christophe,
> 
> On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>> This series applies on top of the series v3 "objtool: Enable and
>> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>>
>> A few modifications are done to core parts to enable powerpc
>> implementation:
>> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
>> redefined as R_PPC_REL32.
>> - A call to static_call_init() is added to start_kernel() to avoid
>> every architecture to have to call it
>> - Trampoline address is provided to arch_static_call_transform() even
>> when setting a site to fallback on a call to the trampoline when the
>> target is too far.
>>
>> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>>
>> Christophe Leroy (7):
>>    powerpc: Add missing asm/asm.h for objtool
>>    objtool/powerpc: Activate objtool on PPC32
>>    objtool: Add architecture specific R_REL32 macro
>>    objtool/powerpc: Add necessary support for inline static calls
>>    init: Call static_call_init() from start_kernel()
>>    static_call_inline: Provide trampoline address when updating sites
>>    powerpc/static_call: Implement inline static calls
>>
> 
> Could you quantify the performance gains of moving from out-of-line,
> patched tail-call branch instructions to full-fledged inline static
> calls? On x86, the retpoline problem makes this glaringly obvious, but
> on other architectures, the complexity of supporting this model may
> outweigh the performance advantages.

Surprisingly, I get worst performance with inline static call than with 
out of line static call:

No static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.186
Running with 1*40 (== 40) tasks.
Time: 16.738
Running with 1*40 (== 40) tasks.
Time: 16.579
Running with 1*40 (== 40) tasks.
Time: 16.838
Running with 1*40 (== 40) tasks.
Time: 16.652
Running with 1*40 (== 40) tasks.
Time: 17.380
Running with 1*40 (== 40) tasks.
Time: 16.630
Running with 1*40 (== 40) tasks.
Time: 16.850
Running with 1*40 (== 40) tasks.
Time: 17.161
Running with 1*40 (== 40) tasks.
Time: 16.722

  Performance counter stats for './hackbench 1' (10 runs):

           17019.55 msec task-clock                #    0.980 CPUs 
utilized            ( +-  0.51% )
               4847      context-switches          #  282.280 /sec 
               ( +-  6.32% )
                  0      cpu-migrations            #    0.000 /sec
               1249      page-faults               #   72.739 /sec 
               ( +-  0.49% )
         2245344976      cycles                    #    0.131 GHz 
               ( +-  0.51% )
          727437072      instructions              #    0.32  insn per 
cycle           ( +-  0.40% )
    <not supported>      branches
    <not supported>      branch-misses

            17.3585 +- 0.0909 seconds time elapsed  ( +-  0.52% )


Outline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 15.892
Running with 1*40 (== 40) tasks.
Time: 15.731
Running with 1*40 (== 40) tasks.
Time: 15.507
Running with 1*40 (== 40) tasks.
Time: 16.269
Running with 1*40 (== 40) tasks.
Time: 15.934
Running with 1*40 (== 40) tasks.
Time: 16.048
Running with 1*40 (== 40) tasks.
Time: 15.700
Running with 1*40 (== 40) tasks.
Time: 16.063
Running with 1*40 (== 40) tasks.
Time: 15.852
Running with 1*40 (== 40) tasks.
Time: 15.941

  Performance counter stats for './hackbench 1' (10 runs):

           16227.32 msec task-clock                #    0.992 CPUs 
utilized            ( +-  0.42% )
               3732      context-switches          #  230.525 /sec 
               ( +-  6.42% )
                  0      cpu-migrations            #    0.000 /sec
               1244      page-faults               #   76.842 /sec 
               ( +-  0.11% )
         2141094288      cycles                    #    0.132 GHz 
               ( +-  0.42% )
          712598441      instructions              #    0.33  insn per 
cycle           ( +-  0.29% )
    <not supported>      branches
    <not supported>      branch-misses

            16.3539 +- 0.0675 seconds time elapsed  ( +-  0.41% )


Inline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.512
Running with 1*40 (== 40) tasks.
Time: 17.240
Running with 1*40 (== 40) tasks.
Time: 16.901
Running with 1*40 (== 40) tasks.
Time: 17.125
Running with 1*40 (== 40) tasks.
Time: 17.262
Running with 1*40 (== 40) tasks.
Time: 17.298
Running with 1*40 (== 40) tasks.
Time: 17.182
Running with 1*40 (== 40) tasks.
Time: 16.988
Running with 1*40 (== 40) tasks.
Time: 17.102
Running with 1*40 (== 40) tasks.
Time: 16.669

  Performance counter stats for './hackbench 1' (10 runs):

           16976.76 msec task-clock                #    0.964 CPUs 
utilized            ( +-  0.44% )
               4760      context-switches          #  273.007 /sec 
               ( +-  4.93% )
                  0      cpu-migrations            #    0.000 /sec
               1252      page-faults               #   71.808 /sec 
               ( +-  0.35% )
         2239986112      cycles                    #    0.128 GHz 
               ( +-  0.44% )
          721540184      instructions              #    0.31  insn per 
cycle           ( +-  0.31% )
    <not supported>      branches
    <not supported>      branch-misses

            17.6126 +- 0.0762 seconds time elapsed  ( +-  0.43% )


Summary:

No static calls:
            17.3585 +- 0.0909 seconds time elapsed  ( +-  0.52% )
Out-of-line static calls:
            16.3539 +- 0.0675 seconds time elapsed  ( +-  0.41% )
Inline static calls:
            17.6126 +- 0.0762 seconds time elapsed  ( +-  0.43% )

Is there anything wrong with inline statica calls ?

Christophe
Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Posted by Benjamin Gray 3 years, 7 months ago
On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
> Surprisingly, I get worst performance with inline static call than
> with 
> out of line static call:

I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
out-of-line calls in a loop I saw a similar thing where adding more
indirection improved the performance despite doing more work. The cause
seemed to be a combination of using older hardware and the target being
too short (just an integer increment). Moving to a newer machine and
adding a lot of NOPs to the target made the performance make sense.
Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Posted by Christophe Leroy 3 years, 7 months ago

Le 08/09/2022 à 02:13, Benjamin Gray a écrit :
> On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
>> Surprisingly, I get worst performance with inline static call than
>> with
>> out of line static call:
> 
> I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
> out-of-line calls in a loop I saw a similar thing where adding more
> indirection improved the performance despite doing more work. The cause
> seemed to be a combination of using older hardware and the target being
> too short (just an integer increment). Moving to a newer machine and
> adding a lot of NOPs to the target made the performance make sense.

Yes might be.

I think I'll first do new tests with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Christophe