target/ppc/helper.h | 8 - target/ppc/int_helper.c | 7 - target/ppc/translate.c | 72 ++-- target/ppc/translate/fp-impl.inc.c | 492 ++++++++++++++++++----- target/ppc/translate/vmx-impl.inc.c | 182 ++++++--- target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++---------- 6 files changed, 1110 insertions(+), 433 deletions(-)
This patchset is an attempt at trying to improve the VMX (Altivec) instruction performance by making use of the new TCG vector operations where possible. In order to use TCG vector operations, the registers must be accessible from cpu_env whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR registers using the supplied TCGv_i64 parameter. Once this is done, patch 4 enables us to remove the static TCG global arrays and updates the access helpers to read/write to the relevant fields in cpu_env directly. The final patches 5 and 6 convert the VMX logical instructions and addition/subtraction instructions respectively over to the TCG vector operations. NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector operations, however it struck me that there may be some potential for converting saturating add/sub and cmp instructions if there were a mechanism to return a set of flags indicating the result of the saturation/comparison. Finally thanks to Richard for taking the time to answer some of my (mostly beginner) questions related to TCG. Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> Mark Cave-Ayland (6): target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access target/ppc: introduce get_avr64() and set_avr64() helpers for VMX register access target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() helpers for VSR register access target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env target/ppc: convert VMX logical instructions to use vector operations target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use vector operations target/ppc/helper.h | 8 - target/ppc/int_helper.c | 7 - target/ppc/translate.c | 72 ++-- target/ppc/translate/fp-impl.inc.c | 492 ++++++++++++++++++----- target/ppc/translate/vmx-impl.inc.c | 182 ++++++--- target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++---------- 6 files changed, 1110 insertions(+), 433 deletions(-) -- 2.11.0
On Fri, 7 Dec 2018, Mark Cave-Ayland wrote: > This patchset is an attempt at trying to improve the VMX (Altivec) instruction > performance by making use of the new TCG vector operations where possible. This is very welcome, thanks for doing this. > In order to use TCG vector operations, the registers must be accessible from cpu_env > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR > registers using the supplied TCGv_i64 parameter. Have you tried some benchmarks or tests to measure the impact of these changes? I've tried the (very unscientific) benchmarks I've written about before here: http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html (which seem to use AltiVec/VMX instructions but not sure which) on mac99 with MorphOS and I could not see any performance increase. I haven't run enough tests but results with or without this series on master were mostly the same within a few percents, and sometimes even seen lower performance with these patches than without. I haven't tried to find out why (no time for that now) so can't really draw any conclusions from this. I'm also not sure if I've actually tested what you've changed or these use instructions that your patches don't optimise yet, or the changes I've seen were just normal changes between runs; but I wonder if the increased number of temporaries could result in lower performance in some cases? Regards, BALATON Zoltan
On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote: > On Fri, 7 Dec 2018, Mark Cave-Ayland wrote: > > This patchset is an attempt at trying to improve the VMX (Altivec) instruction > > performance by making use of the new TCG vector operations where possible. > > This is very welcome, thanks for doing this. > > > In order to use TCG vector operations, the registers must be accessible from cpu_env > > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 > > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR > > registers using the supplied TCGv_i64 parameter. > > Have you tried some benchmarks or tests to measure the impact of these > changes? I've tried the (very unscientific) benchmarks I've written about > before here: > > http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html > > (which seem to use AltiVec/VMX instructions but not sure which) on mac99 > with MorphOS and I could not see any performance increase. I haven't run > enough tests but results with or without this series on master were mostly > the same within a few percents, and sometimes even seen lower performance > with these patches than without. I haven't tried to find out why (no time > for that now) so can't really draw any conclusions from this. I'm also not > sure if I've actually tested what you've changed or these use instructions > that your patches don't optimise yet, or the changes I've seen were just > normal changes between runs; but I wonder if the increased number of > temporaries could result in lower performance in some cases? What was your host machine. IIUC this change will only improve performance if the host tcg backend is able to implement TCG vector ops in terms of vector ops on the host. In addition, this series only converts a subset of the integer and logical vector instructions. If your testcase is mostly floating point (vectored or otherwise), it will still be softfloat and so not see any speedup. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Mon, 10 Dec 2018, David Gibson wrote: > On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote: >> On Fri, 7 Dec 2018, Mark Cave-Ayland wrote: >>> This patchset is an attempt at trying to improve the VMX (Altivec) instruction >>> performance by making use of the new TCG vector operations where possible. >> >> This is very welcome, thanks for doing this. >> >>> In order to use TCG vector operations, the registers must be accessible from cpu_env >>> whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 >>> are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR >>> registers using the supplied TCGv_i64 parameter. >> >> Have you tried some benchmarks or tests to measure the impact of these >> changes? I've tried the (very unscientific) benchmarks I've written about >> before here: >> >> http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html >> >> (which seem to use AltiVec/VMX instructions but not sure which) on mac99 >> with MorphOS and I could not see any performance increase. I haven't run >> enough tests but results with or without this series on master were mostly >> the same within a few percents, and sometimes even seen lower performance >> with these patches than without. I haven't tried to find out why (no time >> for that now) so can't really draw any conclusions from this. I'm also not >> sure if I've actually tested what you've changed or these use instructions >> that your patches don't optimise yet, or the changes I've seen were just >> normal changes between runs; but I wonder if the increased number of >> temporaries could result in lower performance in some cases? > > What was your host machine. IIUC this change will only improve > performance if the host tcg backend is able to implement TCG vector > ops in terms of vector ops on the host. Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64 should be supported but not sure what are the CPU requirements. > In addition, this series only converts a subset of the integer and > logical vector instructions. If your testcase is mostly floating > point (vectored or otherwise), it will still be softfloat and so not > see any speedup. Yes, I don't really know what these tests use but I think "lame" test is mostly floating point but tried with "lame_vmx" which should at least use some vector ops and "mplayer -benchmark" test is more vmx dependent based on my previous profiling and testing with hardfloat but I'm not sure. (When testing these with hardfloat I've found that lame was benefiting from hardfloat but mplayer wasn't and more VMX related functions showed up with mplayer so I assumed it's more VMX bound.) I've tried to do some profiling again to find out what's used but I can't get good results with the tools I have (oprofile stopped working since I've updated my machine and Linux perf provides results that are hard to interpret for me, haven't tried if gprof would work now it didn't before) but I've seen some vector related helpers in the profile so at least some vector ops are used. The "helper_vperm" came up top at about 11th (not sure where is it called from), other vector helpers were lower. I don't remember details now but previously when testing hardfloat I've written this: "I've looked at vperm which came out top in one of the profiles I've taken and on little endian hosts it has the loop backwards and also accesses vector elements from end to front which I wonder may be enough for the compiler to not be able to optimise it? But I haven't checked assembly. The altivec dependent mplayer video decoding test did not change much with hardfloat, it took 98% compared to master so likely altivec is dominating here." (Although this was with the PPC specific vector helpers before VMX patch so not sure if this is still relevant.) The top 10 in profile were still related to low level memory access and MMU management stuff as I've found before: http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html I think implementing i2c for mac99 may help this and some other optimisations may also be possible but I don't know enough about these to try that. It also looks like with --enable-debug something is always flusing tlb and blowing away tb caches so these will be top in profile and likely dominate runtime so can't really use profile to measure impact of VMX patch. Without --enable-debug I can't get call graphs so can't get useful profile. I think I've looked at this before as well but can't remember now which check enabled by --enable-debug is responsible for constant tb cache flush and if that could be avoided. I just don't use --enable-debug since unless need to debug somthing. Maybe the PPC softmmu should be reviewed and optimised by someone who knows it... Regards, BALATON Zoltan
On 12/10/18 2:54 PM, BALATON Zoltan wrote: >> What was your host machine. IIUC this change will only improve >> performance if the host tcg backend is able to implement TCG vector >> ops in terms of vector ops on the host. > > Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64 > should be supported but not sure what are the CPU requirements. Not quite. I only support avx1 and later. I thought about supporting sse4 and later (that's the minimum with all of the instructions that do what we need), but there is only one cpu generation with sse4 and without avx1, and avx1 is already 8 years old. r~
On Mon, 10 Dec 2018, Richard Henderson wrote: > On 12/10/18 2:54 PM, BALATON Zoltan wrote: >> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64 >> should be supported but not sure what are the CPU requirements. > > Not quite. I only support avx1 and later. > > I thought about supporting sse4 and later (that's the minimum with all of the > instructions that do what we need), but there is only one cpu generation with > sse4 and without avx1, and avx1 is already 8 years old. OK that explains why I haven't seen any improvements. My CPU just predates AVX and happens to be in the generation you mention. But I agree this probably does not worth the effort. Maybe I should test on something newer instead. Thank you, BALATON Zoltan
On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote: > On Mon, 10 Dec 2018, David Gibson wrote: > > On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote: > > > On Fri, 7 Dec 2018, Mark Cave-Ayland wrote: > > > > This patchset is an attempt at trying to improve the VMX (Altivec) instruction > > > > performance by making use of the new TCG vector operations where possible. > > > > > > This is very welcome, thanks for doing this. > > > > > > > In order to use TCG vector operations, the registers must be accessible from cpu_env > > > > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 > > > > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR > > > > registers using the supplied TCGv_i64 parameter. > > > > > > Have you tried some benchmarks or tests to measure the impact of these > > > changes? I've tried the (very unscientific) benchmarks I've written about > > > before here: > > > > > > http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html > > > > > > (which seem to use AltiVec/VMX instructions but not sure which) on mac99 > > > with MorphOS and I could not see any performance increase. I haven't run > > > enough tests but results with or without this series on master were mostly > > > the same within a few percents, and sometimes even seen lower performance > > > with these patches than without. I haven't tried to find out why (no time > > > for that now) so can't really draw any conclusions from this. I'm also not > > > sure if I've actually tested what you've changed or these use instructions > > > that your patches don't optimise yet, or the changes I've seen were just > > > normal changes between runs; but I wonder if the increased number of > > > temporaries could result in lower performance in some cases? > > > > What was your host machine. IIUC this change will only improve > > performance if the host tcg backend is able to implement TCG vector > > ops in terms of vector ops on the host. > > Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64 > should be supported but not sure what are the CPU requirements. > > > In addition, this series only converts a subset of the integer and > > logical vector instructions. If your testcase is mostly floating > > point (vectored or otherwise), it will still be softfloat and so not > > see any speedup. > > Yes, I don't really know what these tests use but I think "lame" test is > mostly floating point but tried with "lame_vmx" which should at least use > some vector ops and "mplayer -benchmark" test is more vmx dependent based on > my previous profiling and testing with hardfloat but I'm not sure. (When > testing these with hardfloat I've found that lame was benefiting from > hardfloat but mplayer wasn't and more VMX related functions showed up with > mplayer so I assumed it's more VMX bound.) I should clarify here. When I say "floating point" above, I'm not meaning things using the regular FPU instead of the vector unit. I'm saying *anything* involving floating point calculations whether they're done in the FPU or the vector unit. The patches here don't convert all VMX instructions to use vector TCG ops - they only convert a few, and those few are about using the vector unit for integer (and logical) operations. VMX instructions involving floating point calculations are unaffected and will still use soft-float. > I've tried to do some profiling again to find out what's used but I can't > get good results with the tools I have (oprofile stopped working since I've > updated my machine and Linux perf provides results that are hard to > interpret for me, haven't tried if gprof would work now it didn't before) > but I've seen some vector related helpers in the profile so at least some > vector ops are used. The "helper_vperm" came up top at about 11th (not sure > where is it called from), other vector helpers were lower. > > I don't remember details now but previously when testing hardfloat I've > written this: "I've looked at vperm which came out top in one of the > profiles I've taken and on little endian hosts it has the loop backwards and > also accesses vector elements from end to front which I wonder may be enough > for the compiler to not be able to optimise it? But I haven't checked > assembly. The altivec dependent mplayer video decoding test did not change > much with hardfloat, it took 98% compared to master so likely altivec is > dominating here." (Although this was with the PPC specific vector helpers > before VMX patch so not sure if this is still relevant.) > > The top 10 in profile were still related to low level memory access and MMU > management stuff as I've found before: > > http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html > http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html > > I think implementing i2c for mac99 may help this and some other > optimisations may also be possible but I don't know enough about these to > try that. > > It also looks like with --enable-debug something is always flusing tlb and > blowing away tb caches so these will be top in profile and likely dominate > runtime so can't really use profile to measure impact of VMX patch. Without > --enable-debug I can't get call graphs so can't get useful profile. I think > I've looked at this before as well but can't remember now which check > enabled by --enable-debug is responsible for constant tb cache flush and if > that could be avoided. I just don't use --enable-debug since unless need to > debug somthing. > > Maybe the PPC softmmu should be reviewed and optimised by someone who knows > it... I'm not sure there is anyone who knows it at this point. I probably know it as well as anybody, and the ppc32 code scares me. It's a crufty mess and it would be nice to clean up, but that requires someone with enough time and interest. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 11/12/2018 01:20, David Gibson wrote: > On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote: >> On Mon, 10 Dec 2018, David Gibson wrote: >>> On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote: >>>> On Fri, 7 Dec 2018, Mark Cave-Ayland wrote: >>>>> This patchset is an attempt at trying to improve the VMX (Altivec) instruction >>>>> performance by making use of the new TCG vector operations where possible. >>>> >>>> This is very welcome, thanks for doing this. >>>> >>>>> In order to use TCG vector operations, the registers must be accessible from cpu_env >>>>> whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 >>>>> are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR >>>>> registers using the supplied TCGv_i64 parameter. >>>> >>>> Have you tried some benchmarks or tests to measure the impact of these >>>> changes? I've tried the (very unscientific) benchmarks I've written about >>>> before here: >>>> >>>> http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html >>>> >>>> (which seem to use AltiVec/VMX instructions but not sure which) on mac99 >>>> with MorphOS and I could not see any performance increase. I haven't run >>>> enough tests but results with or without this series on master were mostly >>>> the same within a few percents, and sometimes even seen lower performance >>>> with these patches than without. I haven't tried to find out why (no time >>>> for that now) so can't really draw any conclusions from this. I'm also not >>>> sure if I've actually tested what you've changed or these use instructions >>>> that your patches don't optimise yet, or the changes I've seen were just >>>> normal changes between runs; but I wonder if the increased number of >>>> temporaries could result in lower performance in some cases? >>> >>> What was your host machine. IIUC this change will only improve >>> performance if the host tcg backend is able to implement TCG vector >>> ops in terms of vector ops on the host. >> >> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64 >> should be supported but not sure what are the CPU requirements. >> >>> In addition, this series only converts a subset of the integer and >>> logical vector instructions. If your testcase is mostly floating >>> point (vectored or otherwise), it will still be softfloat and so not >>> see any speedup. >> >> Yes, I don't really know what these tests use but I think "lame" test is >> mostly floating point but tried with "lame_vmx" which should at least use >> some vector ops and "mplayer -benchmark" test is more vmx dependent based on >> my previous profiling and testing with hardfloat but I'm not sure. (When >> testing these with hardfloat I've found that lame was benefiting from >> hardfloat but mplayer wasn't and more VMX related functions showed up with >> mplayer so I assumed it's more VMX bound.) > > I should clarify here. When I say "floating point" above, I'm not > meaning things using the regular FPU instead of the vector unit. I'm > saying *anything* involving floating point calculations whether > they're done in the FPU or the vector unit. > > The patches here don't convert all VMX instructions to use vector TCG > ops - they only convert a few, and those few are about using the > vector unit for integer (and logical) operations. VMX instructions > involving floating point calculations are unaffected and will still > use soft-float. Right. As I mentioned in an earlier email, this is hopefully laying the groundwork for future evolution of the TCG vector operations and making use of the existing functions first. Certainly I'd be interested at looking at the hardfloat patches after this, since FP performance is something that can offer enormous benefit to MacOS emulation. >> I've tried to do some profiling again to find out what's used but I can't >> get good results with the tools I have (oprofile stopped working since I've >> updated my machine and Linux perf provides results that are hard to >> interpret for me, haven't tried if gprof would work now it didn't before) >> but I've seen some vector related helpers in the profile so at least some >> vector ops are used. The "helper_vperm" came up top at about 11th (not sure >> where is it called from), other vector helpers were lower. >> >> I don't remember details now but previously when testing hardfloat I've >> written this: "I've looked at vperm which came out top in one of the >> profiles I've taken and on little endian hosts it has the loop backwards and >> also accesses vector elements from end to front which I wonder may be enough >> for the compiler to not be able to optimise it? But I haven't checked >> assembly. The altivec dependent mplayer video decoding test did not change >> much with hardfloat, it took 98% compared to master so likely altivec is >> dominating here." (Although this was with the PPC specific vector helpers >> before VMX patch so not sure if this is still relevant.) >> >> The top 10 in profile were still related to low level memory access and MMU >> management stuff as I've found before: >> >> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html >> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html >> >> I think implementing i2c for mac99 may help this and some other >> optimisations may also be possible but I don't know enough about these to >> try that. >> >> It also looks like with --enable-debug something is always flusing tlb and >> blowing away tb caches so these will be top in profile and likely dominate >> runtime so can't really use profile to measure impact of VMX patch. Without >> --enable-debug I can't get call graphs so can't get useful profile. I think >> I've looked at this before as well but can't remember now which check >> enabled by --enable-debug is responsible for constant tb cache flush and if >> that could be avoided. I just don't use --enable-debug since unless need to >> debug somthing. >> >> Maybe the PPC softmmu should be reviewed and optimised by someone who knows >> it... > > I'm not sure there is anyone who knows it at this point. I probably > know it as well as anybody, and the ppc32 code scares me. It's a > crufty mess and it would be nice to clean up, but that requires > someone with enough time and interest. As a newcomer to this code, it's not particularly easy to read. It strikes me that one of the main improvements would be to switch over to using generated helper templates rather than to make use of nested macros as in use currently. Looking at your profiles above, the primary hotspot appears to be helper_lookup_tb_ptr(). However as someone quite new to the TCG parts of QEMU, I couldn't tell you whether or not this is to be expected. Perhaps a question for Richard: what could we consider to be a "normal" backend when looking at profiles in terms of recommended features to implement, and to get an idea of what a typical profile should look like? It would be interesting to compare with a similar workload profile on e.g. ARM to see if there is anything obvious that stands out for PPC. ATB, Mark.
On 12/11/18 1:35 PM, Mark Cave-Ayland wrote: > Looking at your profiles above, the primary hotspot appears to be > helper_lookup_tb_ptr(). However as someone quite new to the TCG parts of QEMU, I > couldn't tell you whether or not this is to be expected. > > Perhaps a question for Richard: what could we consider to be a "normal" backend when > looking at profiles in terms of recommended features to implement, and to get an idea > of what a typical profile should look like? > > It would be interesting to compare with a similar workload profile on e.g. ARM to see > if there is anything obvious that stands out for PPC. For Alpha, which has relatively sane tlb management, the top entry in the profile is helper_lookup_tb_ptr at about 8%. Otherwise, somewhere in the top 2 or 3 functions will be helper_le_ldq_mmu at about 2%. That's probably a best-case scenario. r~
On Tue, 11 Dec 2018, David Gibson wrote: > On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote: >> Yes, I don't really know what these tests use but I think "lame" test is >> mostly floating point but tried with "lame_vmx" which should at least use >> some vector ops and "mplayer -benchmark" test is more vmx dependent based on >> my previous profiling and testing with hardfloat but I'm not sure. (When >> testing these with hardfloat I've found that lame was benefiting from >> hardfloat but mplayer wasn't and more VMX related functions showed up with >> mplayer so I assumed it's more VMX bound.) > > I should clarify here. When I say "floating point" above, I'm not > meaning things using the regular FPU instead of the vector unit. I'm > saying *anything* involving floating point calculations whether > they're done in the FPU or the vector unit. OK that clarifies it. I admit I was only testing these but didn't have time to look what changed exactly. > The patches here don't convert all VMX instructions to use vector TCG > ops - they only convert a few, and those few are about using the > vector unit for integer (and logical) operations. VMX instructions > involving floating point calculations are unaffected and will still > use soft-float. What I've said above about lame test being more FPU and mplayer more VMX intensive probably still holds as I've retried now on a Haswell i5 and got 1-2% difference with lame_vmx and ~6% with mplayer. That's very little improvement but if only some VMX instructions should be faster then this may make sense. These tests are not the best, maybe there are better ways to measure this but I don't know of any, >> Maybe the PPC softmmu should be reviewed and optimised by someone who knows >> it... > > I'm not sure there is anyone who knows it at this point. I probably > know it as well as anybody, and the ppc32 code scares me. It's a > crufty mess and it would be nice to clean up, but that requires > someone with enough time and interest. At least this seems to be a big bottleneck in PPC emulation and one that's not being worked on (others like hardfloat and VMX while not finished and still lot to do but already there are some results but no one is looking at softmmu). I was just trying to direct some attention to that softmmu may also need some optimisation and hope someone would notice this. I have some interest but not much time these days and if it scares you what should I say. I don't even understand most of it so it would take a lot of time to even get how it works and what would need to be done. So I hope someone with more time or knowledge shows up and maybe at least provides some hints on what may need to be done. Regards, BALATON Zoltan
On Dec 7, 2018 9:59 AM, "Mark Cave-Ayland" <mark.cave-ayland@ilande.co.uk> wrote: > > This patchset is an attempt at trying to improve the VMX (Altivec) instruction > performance by making use of the new TCG vector operations where possible. > Hello, Mark. I just want to say that I support these efforts. Very interesting, it can bring significant improvements for multimedia intensive emulations. But, even more important, we can see new tcg vector interface in action, possibly suggest improvements, extensions. I would just like to ask you to add some performance comparisons, if possible. Very simple tests would be sufficient. Thanks again for this series! Aleksandar > In order to use TCG vector operations, the registers must be accessible from cpu_env > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3 > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR > registers using the supplied TCGv_i64 parameter. > > Once this is done, patch 4 enables us to remove the static TCG global arrays and updates > the access helpers to read/write to the relevant fields in cpu_env directly. > > The final patches 5 and 6 convert the VMX logical instructions and addition/subtraction > instructions respectively over to the TCG vector operations. > > NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector > operations, however it struck me that there may be some potential for converting > saturating add/sub and cmp instructions if there were a mechanism to return a set of > flags indicating the result of the saturation/comparison. > > Finally thanks to Richard for taking the time to answer some of my (mostly beginner) > questions related to TCG. > > Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> > > > Mark Cave-Ayland (6): > target/ppc: introduce get_fpr() and set_fpr() helpers for FP register > access > target/ppc: introduce get_avr64() and set_avr64() helpers for VMX > register access > target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() > helpers for VSR register access > target/ppc: switch FPR, VMX and VSX helpers to access data directly > from cpu_env > target/ppc: convert VMX logical instructions to use vector operations > target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use > vector operations > > target/ppc/helper.h | 8 - > target/ppc/int_helper.c | 7 - > target/ppc/translate.c | 72 ++-- > target/ppc/translate/fp-impl.inc.c | 492 ++++++++++++++++++----- > target/ppc/translate/vmx-impl.inc.c | 182 ++++++--- > target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++---------- > 6 files changed, 1110 insertions(+), 433 deletions(-) > > -- > 2.11.0 > >
On 10/12/2018 13:04, Aleksandar Markovic wrote: > On Dec 7, 2018 9:59 AM, "Mark Cave-Ayland" <mark.cave-ayland@ilande.co.uk> > wrote: >> >> This patchset is an attempt at trying to improve the VMX (Altivec) > instruction >> performance by making use of the new TCG vector operations where possible. >> > > Hello, Mark. > > I just want to say that I support these efforts. Very interesting, it can > bring significant improvements for multimedia intensive emulations. But, > even more important, we can see new tcg vector interface in action, > possibly suggest improvements, extensions. > > I would just like to ask you to add some performance comparisons, if > possible. Very simple tests would be sufficient. > > Thanks again for this series! Thanks Aleksander! In my local tests I haven't done much in the way of benchmarking, other than to check the output of the disassembler to confirm that the x86 vector opcodes are being used. Given that only a small number of TCG vector operations are currently suitable for use with the VMX code, then I would expect certain artificial benchmarks to show improvements e.g. with the logical ops but there was certainly nothing noticeable in normal usage with my OpenBIOS test images. I agree with you that having a second implementation other than ARM will help provide ideas as to how the vector interfaces can be evolved to support more operations in future. ATB, Mark.
© 2016 - 2024 Red Hat, Inc.