[v2] Idea for using hardfloat in PPC

[RFC PATCH v2 0/5] Idea for using hardfloat in PPC
Posted by Víctor Colombo 3 years, 3 months ago
As can be seem in the mailing thread that added hardfloat support in
QEMU [1], a requirement for it to work is to have float_flag_inexact
set when entering the API in softfloat.c. However, in the same thread,
it was explained that PPC target would not work by default with this
implementation.
The problem is that PPC has a non-sticky inexact bit (there is a
discussion about it in [2]), meaning that we can't just set the flag
and call the API in softfloat.c, as it would return the same flag set
to 1, and we wouldn't know if it is supposed to be updated on FPSCR or
not.
Over the last couple years, there were attempts to enable hardfpu
for Power, like [3]. But nothing got to master.
[5] shows a suggestion by Yonggang Luo and commentaries by Richard and
Zoltan, about caching the last FP instruction and reexecuting it when
necessary.

This patch set is a proposition on the idea to cache the last FP insn,
to be reexecuted later when the value of FPSCR is to be read by a
program. When executed in hardfloat, the instruction "context" is saved
inside `env`, and is expected to be reexecuted later, in softfloat,
to calculate the correct value of the inexact flag in FPSCR.
The instruction to be cached is the last instruction that changes FI.
If the instructions does not change FI, it keeps the cache intact.
If it changes FI, it caches itself and tries to execute in hardfpu.
It might or might not use hardfloat, but as the inexact flag was
artificially set, it will require to be reexecuted later. 'Later'
means when FPSCR is to be read, like during a call to MFFS, or when
a signal occurs. There are probably other places, e.g. other mffs-like
instructions, but this RFC only addresses these two scenarios.
This is supposed to be more efficient because programs very seldomly
read FPSCR, meaning the amount of reexecutions will be low.

For now, this was implemented and tested for linux-user, no softmmu
work or analysis was done.
I implemented the base code to keep all instructions working with
this new behavior (patch 1), and also implemented some instructions
as an example on what it would be necessary to do for every instruction
to use hardfpu (patches 2, 3 and 4).

My tests with risu and other manual tests showed the behavior seems to
be correct. I tested mainly if FPSCR is the same after using softfloat
or hardfloat.

On the v1 of this RFC I reported a performance regression with the
implementation. However, the test I crafted [4] was supposed to be a
mix of many hardfloats and some softfloat fallbacks (instructions
fallback to softfloat in special cases, like e.g. negative argument
for sqrt). What actually was happening was that there was a huge amount
of fallbacks and not many hardfloats actually happening. The expected
'normal scenario' is to have a lot of valid, 'happy path' instructions
that can use hardfloat.
So, what I did for v2 is to create two tests, one that would hit 100%
hardfloat, and one that would fallback 100% to softfloat. I present
the results below. The tests are not comparable, neither the new ones
or the previous one from v1. So they are supposed to be analyzed
uniquely.

100% hardfloat (1:1 mix of fsqrt and fmadd) [6]
|                | min [s] | max [s] | avg [s] |
| before (master)| 30.731 | 31.420   | 31.186  |
| after changes  | 20.860 | 21.100   | 20.989  |
(approx. 1.5x speedup)

100% softfloat (1:1 mix of fsqrt and fmadd) [7]
|                | min [s] | max [s] | avg [s] |
| before (master)| 22.684  | 23.152   | 22.868  |
| after changes  | 25.098  | 25.397   | 25.281  |
(approx 0.9x of old performance)

This is way better than what I previously reported, and is a result
that might justify going forward with this idea. The only problem
is the performance impact when hardfloat cannot be used. I expect
that most real-life use cases will hit hardfloat almost 100% of the
time, so this might not be a big issue. Opinions on this?

You can see that I actually added a new commit to this RFC,
implementing the idea also for add, sub, mul, and div. I tested the old
test with this new commit, and the result was not better. So the new
patch was not responsible for the performance gain, the test itself
was bad.

As I did not test the code in softmmu or bsd-user (does bsd-user work
for PPC?), I added some build time checks to only enable this RFC for
linux-user. I'm pretty confident that making this work for softmmu will
need changes in other places in the code. But I'm focusing on linux-
user for now.

Thank you very much!

[1] https://patchwork.kernel.org/project/qemu-devel/patch/20181124235553.17371-8-cota@braap.org/
[2] https://lists.nongnu.org/archive/html/qemu-ppc/2022-05/msg00246.html
[3] https://patchwork.kernel.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/
[4] https://gist.github.com/vcoracolombo/6ad884a402f1bba531e2e3da7e196656
[5] https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg00064.html
[6] https://gist.github.com/vcoracolombo/f0d8b7c9f1cb63dac6ff0221209ec4ff
[7] https://gist.github.com/vcoracolombo/4b592644517c0efb3854872a4b30f6cc

Víctor Colombo (5):
  target/ppc: prepare instructions to work with caching last FP insn
  target/ppc: Implement instruction caching for fsqrt
  target/ppc: Implement instruction caching for muladd
  target/ppc: Implement instruction caching for add/sub/mul/div
  target/ppc: Enable hardfpu for Power

 fpu/softfloat.c                    |  10 +-
 target/ppc/cpu.h                   |  37 ++++++
 target/ppc/excp_helper.c           |   2 +
 target/ppc/fpu_helper.c            | 186 +++++++++++++++++++++++++++++
 target/ppc/helper.h                |   1 +
 target/ppc/translate/fp-impl.c.inc |   1 +
 6 files changed, 233 insertions(+), 4 deletions(-)

-- 
2.25.1