[PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()

BALATON Zoltan posted 1 patch 5 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240622204833.5F7C74E6000@zero.eik.bme.hu
Maintainers: Nicholas Piggin <npiggin@gmail.com>, Daniel Henrique Barboza <danielhb413@gmail.com>
There is a newer version of this series
target/ppc/mem_helper.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
[PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()
Posted by BALATON Zoltan 5 months, 2 weeks ago
Instead of passing a bool and select a value within dcbz_common() let
the callers pass in the right value to avoid this conditional
statement. On PPC dcbz is often used to zero memory and some code uses
it a lot. This change improves the run time of a test case that copies
memory with a dcbz call in every iteration from 6.23 to 5.83 seconds.

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
This is just a small optimisation removing some of the overhead but
dcbz still seems to be the biggest issue with this test. Removing the
dcbz call it runs in 2 seconds. In a profile I see:
  Children      Self  Command   Shared Object            Symbol
-   55.01%    11.44%  qemu-ppc  qemu-ppc                 [.] dcbz_common.constprop.0
               - 43.57% dcbz_common.constprop.0
                  - probe_access
                     - page_get_flags
                          interval_tree_iter_first
               - 11.44% helper_raise_exception_err
                    cpu_loop_exit_restore
                    cpu_loop
                    cpu_exec
                    cpu_exec_setjmp.isra.0
                    cpu_exec_loop.constprop.0
                    cpu_tb_exec
                    0x7f262403636e
                    helper_raise_exception_err
                    cpu_loop_exit_restore
                    cpu_loop
                    cpu_exec
                    cpu_exec_setjmp.isra.0
                    cpu_exec_loop.constprop.0
                    cpu_tb_exec
                  - 0x7f26240386a4
                       11.20% helper_dcbz
+   43.81%    12.28%  qemu-ppc  qemu-ppc                 [.] probe_access
+   39.31%     0.00%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f2624000000
+   32.45%     4.51%  qemu-ppc  qemu-ppc                 [.] page_get_flags
+   25.50%     2.10%  qemu-ppc  qemu-ppc                 [.] interval_tree_iter_first
+   24.67%    24.67%  qemu-ppc  qemu-ppc                 [.] interval_tree_subtree_search
+   16.75%     1.19%  qemu-ppc  qemu-ppc                 [.] helper_dcbz
+    4.78%     4.78%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f26240386be
+    3.46%     3.46%  qemu-ppc  libc-2.32.so             [.] __memset_avx2_unaligned_erms
Any idea how this could be optimised further? (This is running with
qemu-ppc user mode emulation but I think with system it might be even
worse.) Could an inline implementation with TCG vector ops work to
avoid the helper and let it compile to efficient host code? Even if
that could work I don't know how to do that so I'd need some further
advice on this.

 target/ppc/mem_helper.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
index f88155ad45..361fd72226 100644
--- a/target/ppc/mem_helper.c
+++ b/target/ppc/mem_helper.c
@@ -271,12 +271,11 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
 }
 
 static void dcbz_common(CPUPPCState *env, target_ulong addr,
-                        uint32_t opcode, bool epid, uintptr_t retaddr)
+                        uint32_t opcode, int mmu_idx, uintptr_t retaddr)
 {
     target_ulong mask, dcbz_size = env->dcache_line_size;
     uint32_t i;
     void *haddr;
-    int mmu_idx = epid ? PPC_TLB_EPID_STORE : ppc_env_mmu_index(env, false);
 
 #if defined(TARGET_PPC64)
     /* Check for dcbz vs dcbzl on 970 */
@@ -309,12 +308,12 @@ static void dcbz_common(CPUPPCState *env, target_ulong addr,
 
 void helper_dcbz(CPUPPCState *env, target_ulong addr, uint32_t opcode)
 {
-    dcbz_common(env, addr, opcode, false, GETPC());
+    dcbz_common(env, addr, opcode, ppc_env_mmu_index(env, false), GETPC());
 }
 
 void helper_dcbzep(CPUPPCState *env, target_ulong addr, uint32_t opcode)
 {
-    dcbz_common(env, addr, opcode, true, GETPC());
+    dcbz_common(env, addr, opcode, PPC_TLB_EPID_STORE, GETPC());
 }
 
 void helper_icbi(CPUPPCState *env, target_ulong addr)
-- 
2.30.9
Re: [PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()
Posted by Richard Henderson 5 months, 2 weeks ago
On 6/22/24 13:48, BALATON Zoltan wrote:
> Instead of passing a bool and select a value within dcbz_common() let
> the callers pass in the right value to avoid this conditional
> statement. On PPC dcbz is often used to zero memory and some code uses
> it a lot. This change improves the run time of a test case that copies
> memory with a dcbz call in every iteration from 6.23 to 5.83 seconds.
> 
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> ---
> This is just a small optimisation removing some of the overhead but
> dcbz still seems to be the biggest issue with this test. Removing the
> dcbz call it runs in 2 seconds. In a profile I see:
>    Children      Self  Command   Shared Object            Symbol
> -   55.01%    11.44%  qemu-ppc  qemu-ppc                 [.] dcbz_common.constprop.0
>                 - 43.57% dcbz_common.constprop.0
>                    - probe_access
>                       - page_get_flags
>                            interval_tree_iter_first
>                 - 11.44% helper_raise_exception_err
>                      cpu_loop_exit_restore
>                      cpu_loop
>                      cpu_exec
>                      cpu_exec_setjmp.isra.0
>                      cpu_exec_loop.constprop.0
>                      cpu_tb_exec
>                      0x7f262403636e
>                      helper_raise_exception_err
>                      cpu_loop_exit_restore
>                      cpu_loop
>                      cpu_exec
>                      cpu_exec_setjmp.isra.0
>                      cpu_exec_loop.constprop.0
>                      cpu_tb_exec
>                    - 0x7f26240386a4
>                         11.20% helper_dcbz
> +   43.81%    12.28%  qemu-ppc  qemu-ppc                 [.] probe_access
> +   39.31%     0.00%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f2624000000
> +   32.45%     4.51%  qemu-ppc  qemu-ppc                 [.] page_get_flags
> +   25.50%     2.10%  qemu-ppc  qemu-ppc                 [.] interval_tree_iter_first
> +   24.67%    24.67%  qemu-ppc  qemu-ppc                 [.] interval_tree_subtree_search
> +   16.75%     1.19%  qemu-ppc  qemu-ppc                 [.] helper_dcbz
> +    4.78%     4.78%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f26240386be
> +    3.46%     3.46%  qemu-ppc  libc-2.32.so             [.] __memset_avx2_unaligned_erms
> Any idea how this could be optimised further? (This is running with
> qemu-ppc user mode emulation but I think with system it might be even
> worse.) Could an inline implementation with TCG vector ops work to
> avoid the helper and let it compile to efficient host code? Even if
> that could work I don't know how to do that so I'd need some further
> advice on this.
> 
>   target/ppc/mem_helper.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
> index f88155ad45..361fd72226 100644
> --- a/target/ppc/mem_helper.c
> +++ b/target/ppc/mem_helper.c
> @@ -271,12 +271,11 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
>   }
>   
>   static void dcbz_common(CPUPPCState *env, target_ulong addr,
> -                        uint32_t opcode, bool epid, uintptr_t retaddr)
> +                        uint32_t opcode, int mmu_idx, uintptr_t retaddr)
>   {
>       target_ulong mask, dcbz_size = env->dcache_line_size;
>       uint32_t i;
>       void *haddr;
> -    int mmu_idx = epid ? PPC_TLB_EPID_STORE : ppc_env_mmu_index(env, false);
>   
>   #if defined(TARGET_PPC64)
>       /* Check for dcbz vs dcbzl on 970 */
> @@ -309,12 +308,12 @@ static void dcbz_common(CPUPPCState *env, target_ulong addr,
>   
>   void helper_dcbz(CPUPPCState *env, target_ulong addr, uint32_t opcode)
>   {
> -    dcbz_common(env, addr, opcode, false, GETPC());
> +    dcbz_common(env, addr, opcode, ppc_env_mmu_index(env, false), GETPC());

This is already computed in the translator: DisasContext.mem_idx.
If you pass the mmu_idx as an argument, you can unify these two helpers.


r~

>   }
>   
>   void helper_dcbzep(CPUPPCState *env, target_ulong addr, uint32_t opcode)
>   {
> -    dcbz_common(env, addr, opcode, true, GETPC());
> +    dcbz_common(env, addr, opcode, PPC_TLB_EPID_STORE, GETPC());
>   }
>   
>   void helper_icbi(CPUPPCState *env, target_ulong addr)
Re: [PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()
Posted by BALATON Zoltan 5 months, 2 weeks ago
On Sun, 23 Jun 2024, Richard Henderson wrote:
> On 6/22/24 13:48, BALATON Zoltan wrote:
>> Instead of passing a bool and select a value within dcbz_common() let
>> the callers pass in the right value to avoid this conditional
>> statement. On PPC dcbz is often used to zero memory and some code uses
>> it a lot. This change improves the run time of a test case that copies
>> memory with a dcbz call in every iteration from 6.23 to 5.83 seconds.
>> 
>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>> ---
>> This is just a small optimisation removing some of the overhead but
>> dcbz still seems to be the biggest issue with this test. Removing the
>> dcbz call it runs in 2 seconds. In a profile I see:
>>    Children      Self  Command   Shared Object            Symbol
>> -   55.01%    11.44%  qemu-ppc  qemu-ppc                 [.] 
>> dcbz_common.constprop.0
>>                 - 43.57% dcbz_common.constprop.0
>>                    - probe_access
>>                       - page_get_flags
>>                            interval_tree_iter_first
>>                 - 11.44% helper_raise_exception_err
>>                      cpu_loop_exit_restore
>>                      cpu_loop
>>                      cpu_exec
>>                      cpu_exec_setjmp.isra.0
>>                      cpu_exec_loop.constprop.0
>>                      cpu_tb_exec
>>                      0x7f262403636e
>>                      helper_raise_exception_err
>>                      cpu_loop_exit_restore
>>                      cpu_loop
>>                      cpu_exec
>>                      cpu_exec_setjmp.isra.0
>>                      cpu_exec_loop.constprop.0
>>                      cpu_tb_exec
>>                    - 0x7f26240386a4
>>                         11.20% helper_dcbz
>> +   43.81%    12.28%  qemu-ppc  qemu-ppc                 [.] probe_access
>> +   39.31%     0.00%  qemu-ppc  [JIT] tid 9969           [.] 
>> 0x00007f2624000000
>> +   32.45%     4.51%  qemu-ppc  qemu-ppc                 [.] page_get_flags
>> +   25.50%     2.10%  qemu-ppc  qemu-ppc                 [.] 
>> interval_tree_iter_first
>> +   24.67%    24.67%  qemu-ppc  qemu-ppc                 [.] 
>> interval_tree_subtree_search
>> +   16.75%     1.19%  qemu-ppc  qemu-ppc                 [.] helper_dcbz
>> +    4.78%     4.78%  qemu-ppc  [JIT] tid 9969           [.] 
>> 0x00007f26240386be
>> +    3.46%     3.46%  qemu-ppc  libc-2.32.so             [.] 
>> __memset_avx2_unaligned_erms
>> Any idea how this could be optimised further? (This is running with
>> qemu-ppc user mode emulation but I think with system it might be even
>> worse.) Could an inline implementation with TCG vector ops work to
>> avoid the helper and let it compile to efficient host code? Even if
>> that could work I don't know how to do that so I'd need some further
>> advice on this.
>>
>>   target/ppc/mem_helper.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>> 
>> diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
>> index f88155ad45..361fd72226 100644
>> --- a/target/ppc/mem_helper.c
>> +++ b/target/ppc/mem_helper.c
>> @@ -271,12 +271,11 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, 
>> uint32_t nb,
>>   }
>>     static void dcbz_common(CPUPPCState *env, target_ulong addr,
>> -                        uint32_t opcode, bool epid, uintptr_t retaddr)
>> +                        uint32_t opcode, int mmu_idx, uintptr_t retaddr)
>>   {
>>       target_ulong mask, dcbz_size = env->dcache_line_size;
>>       uint32_t i;
>>       void *haddr;
>> -    int mmu_idx = epid ? PPC_TLB_EPID_STORE : ppc_env_mmu_index(env, 
>> false);
>>     #if defined(TARGET_PPC64)
>>       /* Check for dcbz vs dcbzl on 970 */
>> @@ -309,12 +308,12 @@ static void dcbz_common(CPUPPCState *env, 
>> target_ulong addr,
>>     void helper_dcbz(CPUPPCState *env, target_ulong addr, uint32_t opcode)
>>   {
>> -    dcbz_common(env, addr, opcode, false, GETPC());
>> +    dcbz_common(env, addr, opcode, ppc_env_mmu_index(env, false), 
>> GETPC());
>
> This is already computed in the translator: DisasContext.mem_idx.
> If you pass the mmu_idx as an argument, you can unify these two helpers.

I've tried that. It works but slower: I get 5.9 seconds vs 5.83 with this 
patch so I think I'd stay with this one. Maybe it's because making the 
helper take 4 parameters instead of 3? I can submit the patch for 
reference if it would be useful but I'd keep this one for merging for now.

Regards,
BALATON Zoltan

>
> r~
>
>>   }
>>     void helper_dcbzep(CPUPPCState *env, target_ulong addr, uint32_t 
>> opcode)
>>   {
>> -    dcbz_common(env, addr, opcode, true, GETPC());
>> +    dcbz_common(env, addr, opcode, PPC_TLB_EPID_STORE, GETPC());
>>   }
>>     void helper_icbi(CPUPPCState *env, target_ulong addr)
>
>
Re: [PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()
Posted by Richard Henderson 5 months, 2 weeks ago
On 6/23/24 15:24, BALATON Zoltan wrote:
> On Sun, 23 Jun 2024, Richard Henderson wrote:
>> On 6/22/24 13:48, BALATON Zoltan wrote:
>>> Instead of passing a bool and select a value within dcbz_common() let
>>> the callers pass in the right value to avoid this conditional
>>> statement. On PPC dcbz is often used to zero memory and some code uses
>>> it a lot. This change improves the run time of a test case that copies
>>> memory with a dcbz call in every iteration from 6.23 to 5.83 seconds.
>>>
>>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>>> ---
>>> This is just a small optimisation removing some of the overhead but
>>> dcbz still seems to be the biggest issue with this test. Removing the
>>> dcbz call it runs in 2 seconds. In a profile I see:
>>>    Children      Self  Command   Shared Object            Symbol
>>> -   55.01%    11.44%  qemu-ppc  qemu-ppc                 [.] dcbz_common.constprop.0
>>>                 - 43.57% dcbz_common.constprop.0
>>>                    - probe_access
>>>                       - page_get_flags
>>>                            interval_tree_iter_first
>>>                 - 11.44% helper_raise_exception_err
>>>                      cpu_loop_exit_restore
>>>                      cpu_loop
>>>                      cpu_exec
>>>                      cpu_exec_setjmp.isra.0
>>>                      cpu_exec_loop.constprop.0
>>>                      cpu_tb_exec
>>>                      0x7f262403636e
>>>                      helper_raise_exception_err
>>>                      cpu_loop_exit_restore
>>>                      cpu_loop
>>>                      cpu_exec
>>>                      cpu_exec_setjmp.isra.0
>>>                      cpu_exec_loop.constprop.0
>>>                      cpu_tb_exec
>>>                    - 0x7f26240386a4
>>>                         11.20% helper_dcbz
>>> +   43.81%    12.28%  qemu-ppc  qemu-ppc                 [.] probe_access
>>> +   39.31%     0.00%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f2624000000
>>> +   32.45%     4.51%  qemu-ppc  qemu-ppc                 [.] page_get_flags
>>> +   25.50%     2.10%  qemu-ppc  qemu-ppc                 [.] interval_tree_iter_first
>>> +   24.67%    24.67%  qemu-ppc  qemu-ppc                 [.] interval_tree_subtree_search
>>> +   16.75%     1.19%  qemu-ppc  qemu-ppc                 [.] helper_dcbz
>>> +    4.78%     4.78%  qemu-ppc  [JIT] tid 9969           [.] 0x00007f26240386be
>>> +    3.46%     3.46%  qemu-ppc  libc-2.32.so             [.] __memset_avx2_unaligned_erms
>>> Any idea how this could be optimised further? (This is running with
>>> qemu-ppc user mode emulation but I think with system it might be even
>>> worse.) Could an inline implementation with TCG vector ops work to
>>> avoid the helper and let it compile to efficient host code? Even if
>>> that could work I don't know how to do that so I'd need some further
>>> advice on this.
>>>
>>>   target/ppc/mem_helper.c | 7 +++----
>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
>>> index f88155ad45..361fd72226 100644
>>> --- a/target/ppc/mem_helper.c
>>> +++ b/target/ppc/mem_helper.c
>>> @@ -271,12 +271,11 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
>>>   }
>>>     static void dcbz_common(CPUPPCState *env, target_ulong addr,
>>> -                        uint32_t opcode, bool epid, uintptr_t retaddr)
>>> +                        uint32_t opcode, int mmu_idx, uintptr_t retaddr)
>>>   {
>>>       target_ulong mask, dcbz_size = env->dcache_line_size;
>>>       uint32_t i;
>>>       void *haddr;
>>> -    int mmu_idx = epid ? PPC_TLB_EPID_STORE : ppc_env_mmu_index(env, false);
>>>     #if defined(TARGET_PPC64)
>>>       /* Check for dcbz vs dcbzl on 970 */
>>> @@ -309,12 +308,12 @@ static void dcbz_common(CPUPPCState *env, target_ulong addr,
>>>     void helper_dcbz(CPUPPCState *env, target_ulong addr, uint32_t opcode)
>>>   {
>>> -    dcbz_common(env, addr, opcode, false, GETPC());
>>> +    dcbz_common(env, addr, opcode, ppc_env_mmu_index(env, false), GETPC());
>>
>> This is already computed in the translator: DisasContext.mem_idx.
>> If you pass the mmu_idx as an argument, you can unify these two helpers.
> 
> I've tried that. It works but slower: I get 5.9 seconds vs 5.83 with this patch so I think 
> I'd stay with this one. Maybe it's because making the helper take 4 parameters instead of 
> 3? I can submit the patch for reference if it would be useful but I'd keep this one for 
> merging for now.

Interesting.  Can you share your test case?

The other thing I see is that we can split out the 970 test to a separate helper.  We can 
test for POWERPC_EXCP_970 and !(opcode & 0x00200000) during translation.  I think the 
current placement, where dcbzep tests that as well, is wrong, since dcbze is an E500 insn.
Anyway, that would eliminate opcode as an argument bringing you back to 3.


r~

Re: [PATCH] target/ppc/mem_helper.c: Remove a conditional from dcbz_common()
Posted by BALATON Zoltan 5 months, 2 weeks ago
On Sun, 23 Jun 2024, Richard Henderson wrote:
> On 6/23/24 15:24, BALATON Zoltan wrote:
>> On Sun, 23 Jun 2024, Richard Henderson wrote:
>>> On 6/22/24 13:48, BALATON Zoltan wrote:
>>>> Instead of passing a bool and select a value within dcbz_common() let
>>>> the callers pass in the right value to avoid this conditional
>>>> statement. On PPC dcbz is often used to zero memory and some code uses
>>>> it a lot. This change improves the run time of a test case that copies
>>>> memory with a dcbz call in every iteration from 6.23 to 5.83 seconds.
>>>> 
>>>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>>>> ---
>>>> This is just a small optimisation removing some of the overhead but
>>>> dcbz still seems to be the biggest issue with this test. Removing the
>>>> dcbz call it runs in 2 seconds. In a profile I see:
>>>>    Children      Self  Command   Shared Object            Symbol
>>>> -   55.01%    11.44%  qemu-ppc  qemu-ppc                 [.] 
>>>> dcbz_common.constprop.0
>>>>                 - 43.57% dcbz_common.constprop.0
>>>>                    - probe_access
>>>>                       - page_get_flags
>>>>                            interval_tree_iter_first
>>>>                 - 11.44% helper_raise_exception_err
>>>>                      cpu_loop_exit_restore
>>>>                      cpu_loop
>>>>                      cpu_exec
>>>>                      cpu_exec_setjmp.isra.0
>>>>                      cpu_exec_loop.constprop.0
>>>>                      cpu_tb_exec
>>>>                      0x7f262403636e
>>>>                      helper_raise_exception_err
>>>>                      cpu_loop_exit_restore
>>>>                      cpu_loop
>>>>                      cpu_exec
>>>>                      cpu_exec_setjmp.isra.0
>>>>                      cpu_exec_loop.constprop.0
>>>>                      cpu_tb_exec
>>>>                    - 0x7f26240386a4
>>>>                         11.20% helper_dcbz
>>>> +   43.81%    12.28%  qemu-ppc  qemu-ppc                 [.] probe_access
>>>> +   39.31%     0.00%  qemu-ppc  [JIT] tid 9969           [.] 
>>>> 0x00007f2624000000
>>>> +   32.45%     4.51%  qemu-ppc  qemu-ppc                 [.] 
>>>> page_get_flags
>>>> +   25.50%     2.10%  qemu-ppc  qemu-ppc                 [.] 
>>>> interval_tree_iter_first
>>>> +   24.67%    24.67%  qemu-ppc  qemu-ppc                 [.] 
>>>> interval_tree_subtree_search
>>>> +   16.75%     1.19%  qemu-ppc  qemu-ppc                 [.] helper_dcbz
>>>> +    4.78%     4.78%  qemu-ppc  [JIT] tid 9969           [.] 
>>>> 0x00007f26240386be
>>>> +    3.46%     3.46%  qemu-ppc  libc-2.32.so             [.] 
>>>> __memset_avx2_unaligned_erms
>>>> Any idea how this could be optimised further? (This is running with
>>>> qemu-ppc user mode emulation but I think with system it might be even
>>>> worse.) Could an inline implementation with TCG vector ops work to
>>>> avoid the helper and let it compile to efficient host code? Even if
>>>> that could work I don't know how to do that so I'd need some further
>>>> advice on this.
>>>> 
>>>>   target/ppc/mem_helper.c | 7 +++----
>>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
>>>> index f88155ad45..361fd72226 100644
>>>> --- a/target/ppc/mem_helper.c
>>>> +++ b/target/ppc/mem_helper.c
>>>> @@ -271,12 +271,11 @@ void helper_stsw(CPUPPCState *env, target_ulong 
>>>> addr, uint32_t nb,
>>>>   }
>>>>     static void dcbz_common(CPUPPCState *env, target_ulong addr,
>>>> -                        uint32_t opcode, bool epid, uintptr_t retaddr)
>>>> +                        uint32_t opcode, int mmu_idx, uintptr_t retaddr)
>>>>   {
>>>>       target_ulong mask, dcbz_size = env->dcache_line_size;
>>>>       uint32_t i;
>>>>       void *haddr;
>>>> -    int mmu_idx = epid ? PPC_TLB_EPID_STORE : ppc_env_mmu_index(env, 
>>>> false);
>>>>     #if defined(TARGET_PPC64)
>>>>       /* Check for dcbz vs dcbzl on 970 */
>>>> @@ -309,12 +308,12 @@ static void dcbz_common(CPUPPCState *env, 
>>>> target_ulong addr,
>>>>     void helper_dcbz(CPUPPCState *env, target_ulong addr, uint32_t 
>>>> opcode)
>>>>   {
>>>> -    dcbz_common(env, addr, opcode, false, GETPC());
>>>> +    dcbz_common(env, addr, opcode, ppc_env_mmu_index(env, false), 
>>>> GETPC());
>>> 
>>> This is already computed in the translator: DisasContext.mem_idx.
>>> If you pass the mmu_idx as an argument, you can unify these two helpers.
>> 
>> I've tried that. It works but slower: I get 5.9 seconds vs 5.83 with this 
>> patch so I think I'd stay with this one. Maybe it's because making the 
>> helper take 4 parameters instead of 3? I can submit the patch for reference 
>> if it would be useful but I'd keep this one for merging for now.
>
> Interesting.  Can you share your test case?
>
> The other thing I see is that we can split out the 970 test to a separate 
> helper.  We can test for POWERPC_EXCP_970 and !(opcode & 0x00200000) during 
> translation.  I think the current placement, where dcbzep tests that as well, 
> is wrong, since dcbze is an E500 insn.
> Anyway, that would eliminate opcode as an argument bringing you back to 3.

It'd eliminate opcode but I think I'd need to pass cache line size instead 
somehow so it might still need 4 parameters. I'll give that a try and see 
if it's better or not. But the biggest overhead is still the call to 
probe_write. Is there a way to eliminate that and maybe do the zeroing 
with vector or 128 bir ops in compiled code and not in the helper? (I've 
tried removing the test and memset but the slow path is really slow, takes 
more than 9 seconds so it's still faster to test when memset can be used.)

Regards,
BALATON Zoltan