[v1] Idea for using hardfloat in PPC

[RFC PATCH 0/4] Idea for using hardfloat in PPC

Posted by Víctor Colombo 3 years, 4 months ago

As can be seem in the mailing thread that added hardfloat support in
QEMU [1], a requirement for it to work is to have float_flag_inexact
set when entering the API in softfloat.c. However, in the same thread,
it was explained that PPC target would not work by default with this
implementation.
The problem is that PPC has a non-sticky inexact bit (there is a
discussion about it in [2]), meaning that we can't just set the flag
and call the API in softfloat.c, as it would return the same flag set
to 1, and we wouldn't know if it is supposed to be updated on FPSCR or
not.
Over the last couple years, there were attempts to enable hardfpu
for Power, like [3]. But nothing got to master.
[5] shows a suggestion by Yonggang Luo and commentaries by Richard and
Zoltan, about caching the last FP instruction and reexecuting it when
necessary.

This patch set is a proposition on the idea to cache the last FP insn,
to be reexecuted later when the value of FPSCR is to be read by a
program. When executed in hardfloat, the instruction "context" is saved
inside `env`, and is expected to be reexecuted later, in softfloat,
to calculate the correct value of the inexact flag in FPSCR.
The instruction to be cached is the last instruction that changes FI.
If the instructions does not change FI, it keeps the cache intact.
If it changes FI, it caches itself and tries to execute in hardfpu.
It might or might not use hardfloat, but as the inexact flag was
artificially set, it will require to be reexecuted later. 'Later'
means when FPSCR is to be read, like during a call to MFFS, or when
a signal occurs. There are probably other places, e.g. other mffs-like
instructions, but this RFC only addresses these two scenarios.
This is supposed to be more efficient because programs very seldomly
read FPSCR, meaning the amount of reexecutions will be low.

For now, this was implemented and tested for linux-user, no softmmu
work or analysis was done.
I implemented the base code to keep all instructions working with
this new behavior (patch 1), and also implemented two instructions
as an example on what it would be necessary to do for every instruction
to use hardfpu (patches 1 and 2).

My tests with risu and other manual tests showed the behavior seems to
be correct. I tested mainly if FPSCR is the same after using softfloat
or hardfloat.

However, the impact in performance was not the expected. In x86_64 I
had a small 3% improvement, while in a Power9 machine there was a small
performance loss, as can be seem below (100 executions).

|        | min [s] | max [s] | avg [s] |
| before | 122.309 | 123.459 | 122.747 |
| after  | 123.906 | 125.016 | 124.373 |

The test code can be found in [4].

The issue is most likely all the overhead with the caching, which is
negating the improvement from hardfpu execution.

With all that said, could you kindly take a look at my implementation
and see if it can be improved to result in better performance? Is there
any chance to save this idea?

Thank you very much!

[1] https://patchwork.kernel.org/project/qemu-devel/patch/20181124235553.17371-8-cota@braap.org/
[2] https://lists.nongnu.org/archive/html/qemu-ppc/2022-05/msg00246.html
[3] https://patchwork.kernel.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/
[4] https://gist.github.com/vcoracolombo/6ad884a402f1bba531e2e3da7e196656
[5] https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg00064.html

Víctor Colombo (4):
  target/ppc: prepare instructions to work with caching last FP insn
  target/ppc: Implement instruction caching for fsqrt
  target/ppc: Implement instruction caching for muladd
  fpu/softfloat: Enable hardfpu for ppc target

 fpu/softfloat.c                    |   6 +-
 target/ppc/cpu.h                   |  28 ++++++
 target/ppc/excp_helper.c           |   2 +
 target/ppc/fpu_helper.c            | 132 +++++++++++++++++++++++++++++
 target/ppc/helper.h                |   1 +
 target/ppc/translate/fp-impl.c.inc |   1 +
 6 files changed, 166 insertions(+), 4 deletions(-)

-- 
2.25.1

Re: [RFC PATCH 0/4] Idea for using hardfloat in PPC

Posted by Richard Henderson 3 years, 4 months ago

On 10/5/22 07:37, Víctor Colombo wrote:
> However, the impact in performance was not the expected. In x86_64 I
> had a small 3% improvement, while in a Power9 machine there was a small
> performance loss, as can be seem below (100 executions).
> 
> |        | min [s] | max [s] | avg [s] |
> | before | 122.309 | 123.459 | 122.747 |
> | after  | 123.906 | 125.016 | 124.373 |

I hope this is because you didn't handle the most common cases: add, sub, mul, div.

The logic seems plausible, as far as it goes, and would work for the FR bit as well which 
afair isn't handled at all at the moment.  I'll review properly in a little while.

r~

Re: [RFC PATCH 0/4] Idea for using hardfloat in PPC

Posted by Alex Bennée 3 years, 4 months ago

Richard Henderson <richard.henderson@linaro.org> writes:

> On 10/5/22 07:37, Víctor Colombo wrote:
>> However, the impact in performance was not the expected. In x86_64 I
>> had a small 3% improvement, while in a Power9 machine there was a small
>> performance loss, as can be seem below (100 executions).
>> |        | min [s] | max [s] | avg [s] |
>> | before | 122.309 | 123.459 | 122.747 |
>> | after  | 123.906 | 125.016 | 124.373 |
>
> I hope this is because you didn't handle the most common cases: add, sub, mul, div.
>
> The logic seems plausible, as far as it goes, and would work for the
> FR bit as well which afair isn't handled at all at the moment.  I'll
> review properly in a little while.

I wonder if this is something that could be generalised and pushed up
into the fpu stuff itself. We could after all cache the op and
decomposed parameters here in a generic way. The trick would be working
out how to do that without slowing down the current common case.

Is ppc unique in not persisting the inexact flag from previous
operations?

>
>
> r~

-- 
Alex Bennée

Re: [RFC PATCH 0/4] Idea for using hardfloat in PPC

Posted by Richard Henderson 3 years, 4 months ago

On 10/7/22 06:42, Alex Bennée wrote:
> Is ppc unique in not persisting the inexact flag from previous
> operations?

Better phrased as "having an additional per-operation flags for inexact and 'rounded'", 
because ppc also has the standard ieee sticky inexact flag.  But yes, as far as I know ppc 
is unique with this.

r~