fpu/softfloat.c | 14 +++++++------- target/ppc/cpu.h | 2 ++ target/ppc/fpu_helper.c | 7 ++++++- target/ppc/translate_init.inc.c | 2 ++ 4 files changed, 17 insertions(+), 8 deletions(-)
Hello, This is an RFC series to start exploring the possibility of enabling hardfloat for PPC target that haven't progressed in the last two years. Hopefully we can work out something now. Previously I've explored this here: https://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html where some ad-hoc benchmarks using lame mp3 encoder is also explained that has two versions: one using VMX and another only using FP. Both are mostly floating point bounded. I've run this test on mac99 under MorphOS before and after my patches, also verifying that md5sum of resulting mp3 matches (this is no proof for correctness but maybe shows it did not break too much at least those ops used by this program). I've got these measurements on an Intel i7-9700K CPU @ 3.60GHz (did not bother to take multiple samples so these are just approximate): 1) before patch series using softfloat: lame: 4:01, lame_vmx: 3:14 2) only enabling hardfloat in fpu/softfloat.c without other changes: lame: 4:06, lame_vmx: 2:06 (this shows why hardfloat was disabled but VMX can benefit from this) 3) with this series, hardfloat=true: lame: 3:15, lame_vmx: 1:59 (so the patch does something even if there should be more places to preserve inexact flag to fully use hardfloat) 4) with this series but forcing softfloat with hardfloat=false: lame: 4:11, lame_vmx: 2:08 (unfortunately it's slower than before, likely due to adding if () to helper_reset_fpstatus() that should be avoided to at least get back previous hardfloat enabled case that's still slower than softfloat so this series only makes sense if the default can be hardfloat=true at the moment but even that would need more testing) I hope others can contribute to this by doing more testing to find out what else this would break or give some ideas how this could be improved. Regards, BALATON Zoltan BALATON Zoltan (2): target/ppc/cpu: Add hardfloat property target/ppc: Enable hardfloat for PPC fpu/softfloat.c | 14 +++++++------- target/ppc/cpu.h | 2 ++ target/ppc/fpu_helper.c | 7 ++++++- target/ppc/translate_init.inc.c | 2 ++ 4 files changed, 17 insertions(+), 8 deletions(-) -- 2.21.1
On Mon, 17 Feb 2020 at 02:43, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > Hello, > > This is an RFC series to start exploring the possibility of enabling > hardfloat for PPC target that haven't progressed in the last two years. > Hopefully we can work out something now. Previously I've explored this > here: > > https://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html > > where some ad-hoc benchmarks using lame mp3 encoder is also explained > that has two versions: one using VMX and another only using FP. Both > are mostly floating point bounded. I've run this test on mac99 under > MorphOS before and after my patches, also verifying that md5sum of > resulting mp3 matches (this is no proof for correctness but maybe > shows it did not break too much at least those ops used by this > program). > I hope others can contribute to this by doing more testing to find out > what else this would break or give some ideas how this could be > improved. I think the ideal would be to test against a reference using risu to see whether this changes behaviour (FP results should be bit-for-bit identical; usually application level testing is often not sufficient to detect this). You could test either against real hardware or against the non-hardfloat QEMU. I'm not sure how comprehensive the coverage for ppc insns is but there are a fair number of fp insns covered already: https://git.linaro.org/people/peter.maydell/risu.git/tree/ It's also worth testing any alternate/non-standard config modes the FPU might have (eg different default rounding modes, any flush-to-zero or alternate denormal handling, that kind of thing), and not just the default how-the-CPU-boots-up mode. thanks -- PMM
On Mon, 17 Feb 2020, Peter Maydell wrote:
> On Mon, 17 Feb 2020 at 02:43, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> Hello,
>>
>> This is an RFC series to start exploring the possibility of enabling
>> hardfloat for PPC target that haven't progressed in the last two years.
>> Hopefully we can work out something now. Previously I've explored this
>> here:
>>
>> https://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
>>
>> where some ad-hoc benchmarks using lame mp3 encoder is also explained
>> that has two versions: one using VMX and another only using FP. Both
>> are mostly floating point bounded. I've run this test on mac99 under
>> MorphOS before and after my patches, also verifying that md5sum of
>> resulting mp3 matches (this is no proof for correctness but maybe
>> shows it did not break too much at least those ops used by this
>> program).
>
>> I hope others can contribute to this by doing more testing to find out
>> what else this would break or give some ideas how this could be
>> improved.
>
> I think the ideal would be to test against a reference using
> risu to see whether this changes behaviour (FP results should
> be bit-for-bit identical; usually application level testing is
> often not sufficient to detect this). You could test either
Sure, thanks. I did not mean to claim the simple test I've done was
sufficient but I expect others who have interest in this and more
experienced in such testing (or even being payed to work on QEMU which I'm
not) contribute to this so I did not try testing it more throughly than
just showing it could be considerably faster and still work fot at least
some workloads so it's worth working on. I'm surprised that in the two
years since hardfloat was merged nobody even tried this (or those who did
dropped the idea before any results without letting us know). So I tried
to make a start with it to explore what would it take to fix this
eventually but I don't want to do that alone. I hope this inspires others
to help e.g. in thesting and we can reach a solution together.
> against real hardware or against the non-hardfloat QEMU.
> I'm not sure how comprehensive the coverage for ppc insns
> is but there are a fair number of fp insns covered already:
> https://git.linaro.org/people/peter.maydell/risu.git/tree/
I don't have real hardware and testing against QEMU may take longer and
not sure how useful. There could also be preexisting bugs, although some
fixes were made to PPC FP implementation recently. Maybe I'll have a look
if have no better things to do but I have other ongoing QEMU related
projects as well that I might try to make some progress as well.
> It's also worth testing any alternate/non-standard config
> modes the FPU might have (eg different default rounding modes,
> any flush-to-zero or alternate denormal handling, that kind
> of thing), and not just the default how-the-CPU-boots-up mode.
It is expected to break inexact exceptions currently until a better way
can be found to handle those but I think hardfloat is already disabled for
other than default rounding modes or FPU settings so maybe those should
not break. According to:
https://git.qemu.org/?p=qemu.git;a=blob;f=fpu/softfloat.c;h=301ce3b537b6c0eee5dbbc358587b66a3a341d2a;hb=HEAD#l235
235 static inline bool can_use_fpu(const float_status *s)
236 {
237 if (QEMU_NO_HARDFLOAT) {
238 return false;
239 }
240 return likely(s->float_exception_flags & float_flag_inexact &&
241 s->float_rounding_mode == float_round_nearest_even);
242 }
243
and
https://git.qemu.org/?p=qemu.git;a=blob;f=fpu/softfloat.c;h=301ce3b537b6c0eee5dbbc358587b66a3a341d2a;hb=HEAD#l99
99 /*
100 * Hardfloat
101 *
102 * Fast emulation of guest FP instructions is challenging for two reasons.
103 * First, FP instruction semantics are similar but not identical, particularly
104 * when handling NaNs. Second, emulating at reasonable speed the guest FP
105 * exception flags is not trivial: reading the host's flags register with a
106 * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp],
107 * and trapping on every FP exception is not fast nor pleasant to work with.
108 *
109 * We address these challenges by leveraging the host FPU for a subset of the
110 * operations. To do this we expand on the idea presented in this paper:
111 *
112 * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a
113 * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615.
114 *
115 * The idea is thus to leverage the host FPU to (1) compute FP operations
116 * and (2) identify whether FP exceptions occurred while avoiding
117 * expensive exception flag register accesses.
118 *
119 * An important optimization shown in the paper is that given that exception
120 * flags are rarely cleared by the guest, we can avoid recomputing some flags.
121 * This is particularly useful for the inexact flag, which is very frequently
122 * raised in floating-point workloads.
123 *
124 * We optimize the code further by deferring to soft-fp whenever FP exception
125 * detection might get hairy. Two examples: (1) when at least one operand is
126 * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 result
127 * and the result is < the minimum normal.
128 */
129 #define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \
130 static inline void name(soft_t *a, float_status *s) \
131 { \
132 if (unlikely(soft_t ## _is_denormal(*a))) { \
133 *a = soft_t ## _set_sign(soft_t ## _zero, \
134 soft_t ## _is_neg(*a)); \
135 s->float_exception_flags |= float_flag_input_denormal; \
136 } \
137 }
Regards,
BALATON Zoltan
© 2016 - 2025 Red Hat, Inc.