The appended paves the way for leveraging the host FPU for a subset
of guest FP operations. For most guest workloads (e.g. FP flags
aren't ever cleared, inexact occurs often and rounding is set to the
default [to nearest]) this will yield sizable performance speedups.
The approach followed here avoids checking the FP exception flags register.
See the comment at the top of hostfloat.c for details.
This assumes that QEMU is running on an IEEE754-compliant FPU and
that the rounding is set to the default (to nearest). The
implementation-dependent specifics of the FPU should not matter; things
like tininess detection and snan representation are still dealt with in
soft-fp. However, this approach will break on most hosts if we compile
QEMU with flags such as -ffast-math. We control the flags so this should
be easy to enforce though.
The licensing in softfloat.h is complicated at best, so to keep things
simple I'm adding this as a separate, GPL'ed file.
This patch just adds some boilerplate code; subsequent patches add
operations, one per commit to ease bisection.
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
Makefile.target | 2 +-
include/fpu/hostfloat.h | 14 +++++++
include/fpu/softfloat.h | 1 +
fpu/hostfloat.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++
target/m68k/Makefile.objs | 2 +-
tests/fp-test/Makefile | 2 +-
6 files changed, 114 insertions(+), 3 deletions(-)
create mode 100644 include/fpu/hostfloat.h
create mode 100644 fpu/hostfloat.c
diff --git a/Makefile.target b/Makefile.target
index 6549481..efcdfb9 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -97,7 +97,7 @@ obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/tcg-op-vec.o tcg/tcg-op-gvec.o
obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o
obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
-obj-y += fpu/softfloat.o
+obj-y += fpu/softfloat.o fpu/hostfloat.o
obj-y += target/$(TARGET_BASE_ARCH)/
obj-y += disas.o
obj-$(call notempty,$(TARGET_XML_FILES)) += gdbstub-xml.o
diff --git a/include/fpu/hostfloat.h b/include/fpu/hostfloat.h
new file mode 100644
index 0000000..b01291b
--- /dev/null
+++ b/include/fpu/hostfloat.h
@@ -0,0 +1,14 @@
+/*
+ * Copyright (C) 2018, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef HOSTFLOAT_H
+#define HOSTFLOAT_H
+
+#ifndef SOFTFLOAT_H
+#error fpu/hostfloat.h must only be included from softfloat.h
+#endif
+
+#endif /* HOSTFLOAT_H */
diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h
index 8fb44a8..8963b68 100644
--- a/include/fpu/softfloat.h
+++ b/include/fpu/softfloat.h
@@ -95,6 +95,7 @@ enum {
};
#include "fpu/softfloat-types.h"
+#include "fpu/hostfloat.h"
static inline void set_float_detect_tininess(int val, float_status *status)
{
diff --git a/fpu/hostfloat.c b/fpu/hostfloat.c
new file mode 100644
index 0000000..cab0341
--- /dev/null
+++ b/fpu/hostfloat.c
@@ -0,0 +1,96 @@
+/*
+ * hostfloat.c - FP primitives that use the host's FPU whenever possible.
+ *
+ * Copyright (C) 2018, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ * Fast emulation of guest FP instructions is challenging for two reasons.
+ * First, FP instruction semantics are similar but not identical, particularly
+ * when handling NaNs. Second, emulating at reasonable speed the guest FP
+ * exception flags is not trivial: reading the host's flags register with a
+ * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp],
+ * and trapping on every FP exception is not fast nor pleasant to work with.
+ *
+ * This module leverages the host FPU for a subset of the operations. To
+ * do this it follows the main idea presented in this paper:
+ *
+ * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a
+ * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615.
+ *
+ * The idea is thus to leverage the host FPU to (1) compute FP operations
+ * and (2) identify whether FP exceptions occurred while avoiding
+ * expensive exception flag register accesses.
+ *
+ * An important optimization shown in the paper is that given that exception
+ * flags are rarely cleared by the guest, we can avoid recomputing some flags.
+ * This is particularly useful for the inexact flag, which is very frequently
+ * raised in floating-point workloads.
+ *
+ * We optimize the code further by deferring to soft-fp whenever FP
+ * exception detection might get hairy. Fortunately this is not common.
+ */
+#include <math.h>
+
+#include "qemu/osdep.h"
+#include "fpu/softfloat.h"
+
+#define GEN_TYPE_CONV(name, to_t, from_t) \
+ static inline to_t name(from_t a) \
+ { \
+ to_t r = *(to_t *)&a; \
+ return r; \
+ }
+
+GEN_TYPE_CONV(float32_to_float, float, float32)
+GEN_TYPE_CONV(float64_to_double, double, float64)
+GEN_TYPE_CONV(float_to_float32, float32, float)
+GEN_TYPE_CONV(double_to_float64, float64, double)
+#undef GEN_TYPE_CONV
+
+#define GEN_INPUT_FLUSH(soft_t) \
+ static inline __attribute__((always_inline)) void \
+ soft_t ## _input_flush__nocheck(soft_t *a, float_status *s) \
+ { \
+ if (unlikely(soft_t ## _is_denormal(*a))) { \
+ *a = soft_t ## _set_sign(soft_t ## _zero, \
+ soft_t ## _is_neg(*a)); \
+ s->float_exception_flags |= float_flag_input_denormal; \
+ } \
+ } \
+ \
+ static inline __attribute__((always_inline)) void \
+ soft_t ## _input_flush1(soft_t *a, float_status *s) \
+ { \
+ if (likely(!s->flush_inputs_to_zero)) { \
+ return; \
+ } \
+ soft_t ## _input_flush__nocheck(a, s); \
+ } \
+ \
+ static inline __attribute__((always_inline)) void \
+ soft_t ## _input_flush2(soft_t *a, soft_t *b, float_status *s) \
+ { \
+ if (likely(!s->flush_inputs_to_zero)) { \
+ return; \
+ } \
+ soft_t ## _input_flush__nocheck(a, s); \
+ soft_t ## _input_flush__nocheck(b, s); \
+ } \
+ \
+ static inline __attribute__((always_inline)) void \
+ soft_t ## _input_flush3(soft_t *a, soft_t *b, soft_t *c, \
+ float_status *s) \
+ { \
+ if (likely(!s->flush_inputs_to_zero)) { \
+ return; \
+ } \
+ soft_t ## _input_flush__nocheck(a, s); \
+ soft_t ## _input_flush__nocheck(b, s); \
+ soft_t ## _input_flush__nocheck(c, s); \
+ }
+
+GEN_INPUT_FLUSH(float32)
+GEN_INPUT_FLUSH(float64)
+#undef GEN_INPUT_FLUSH
diff --git a/target/m68k/Makefile.objs b/target/m68k/Makefile.objs
index ac61948..2868b11 100644
--- a/target/m68k/Makefile.objs
+++ b/target/m68k/Makefile.objs
@@ -1,5 +1,5 @@
obj-y += m68k-semi.o
obj-y += translate.o op_helper.o helper.o cpu.o
-obj-y += fpu_helper.o softfloat.o
+obj-y += fpu_helper.o softfloat.o hostfloat.o
obj-y += gdbstub.o
obj-$(CONFIG_SOFTMMU) += monitor.o
diff --git a/tests/fp-test/Makefile b/tests/fp-test/Makefile
index 703434f..187cfcc 100644
--- a/tests/fp-test/Makefile
+++ b/tests/fp-test/Makefile
@@ -28,7 +28,7 @@ ibm:
$(WHITELIST_FILES):
wget -nv -O $@ http://www.cs.columbia.edu/~cota/qemu/fpbench-$@
-fp-test$(EXESUF): fp-test.o softfloat.o
+fp-test$(EXESUF): fp-test.o softfloat.o hostfloat.o
clean:
rm -f *.o *.d $(OBJS)
--
2.7.4
Le 21/03/2018 à 21:11, Emilio G. Cota a écrit : > The appended paves the way for leveraging the host FPU for a subset > of guest FP operations. For most guest workloads (e.g. FP flags > aren't ever cleared, inexact occurs often and rounding is set to the > default [to nearest]) this will yield sizable performance speedups. > > The approach followed here avoids checking the FP exception flags register. > See the comment at the top of hostfloat.c for details. > > This assumes that QEMU is running on an IEEE754-compliant FPU and > that the rounding is set to the default (to nearest). The > implementation-dependent specifics of the FPU should not matter; things > like tininess detection and snan representation are still dealt with in > soft-fp. However, this approach will break on most hosts if we compile > QEMU with flags such as -ffast-math. We control the flags so this should > be easy to enforce though. > > The licensing in softfloat.h is complicated at best, so to keep things > simple I'm adding this as a separate, GPL'ed file. > > This patch just adds some boilerplate code; subsequent patches add > operations, one per commit to ease bisection. > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > Makefile.target | 2 +- > include/fpu/hostfloat.h | 14 +++++++ > include/fpu/softfloat.h | 1 + > fpu/hostfloat.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++ > target/m68k/Makefile.objs | 2 +- > tests/fp-test/Makefile | 2 +- > 6 files changed, 114 insertions(+), 3 deletions(-) > create mode 100644 include/fpu/hostfloat.h > create mode 100644 fpu/hostfloat.c > ... > diff --git a/target/m68k/Makefile.objs b/target/m68k/Makefile.objs > index ac61948..2868b11 100644 > --- a/target/m68k/Makefile.objs > +++ b/target/m68k/Makefile.objs > @@ -1,5 +1,5 @@ > obj-y += m68k-semi.o > obj-y += translate.o op_helper.o helper.o cpu.o > -obj-y += fpu_helper.o softfloat.o > +obj-y += fpu_helper.o softfloat.o hostfloat.o I don't think you need to add hostfloat.o here, the softfloat.o in this list contains function specific to m68k emulation, it's not the one from fpu/ Thanks, Laurent
On Wed, Mar 21, 2018 at 21:41:19 +0100, Laurent Vivier wrote: > Le 21/03/2018 à 21:11, Emilio G. Cota a écrit : > > diff --git a/target/m68k/Makefile.objs b/target/m68k/Makefile.objs > > index ac61948..2868b11 100644 > > --- a/target/m68k/Makefile.objs > > +++ b/target/m68k/Makefile.objs > > @@ -1,5 +1,5 @@ > > obj-y += m68k-semi.o > > obj-y += translate.o op_helper.o helper.o cpu.o > > -obj-y += fpu_helper.o softfloat.o > > +obj-y += fpu_helper.o softfloat.o hostfloat.o > > I don't think you need to add hostfloat.o here, > the softfloat.o in this list contains function specific to m68k > emulation, it's not the one from fpu/ Aah yes indeed. Didn't consider there might be another softfloat.c =) Thanks, E.
Emilio G. Cota <cota@braap.org> writes:
> The appended paves the way for leveraging the host FPU for a subset
> of guest FP operations. For most guest workloads (e.g. FP flags
> aren't ever cleared, inexact occurs often and rounding is set to the
> default [to nearest]) this will yield sizable performance speedups.
>
> The approach followed here avoids checking the FP exception flags register.
> See the comment at the top of hostfloat.c for details.
>
> This assumes that QEMU is running on an IEEE754-compliant FPU and
> that the rounding is set to the default (to nearest). The
> implementation-dependent specifics of the FPU should not matter; things
> like tininess detection and snan representation are still dealt with in
> soft-fp. However, this approach will break on most hosts if we compile
> QEMU with flags such as -ffast-math. We control the flags so this should
> be easy to enforce though.
The thing I would avoid is generating is any x87 instructions as we can
get weird effects if the compiler ever decides to stash a signalling NaN
in an x87 register.
Anyway perhaps -fno-fast-math should be explicit when building fpu/* code?
>
> The licensing in softfloat.h is complicated at best, so to keep things
> simple I'm adding this as a separate, GPL'ed file.
I don't think we need to worry about this. It's fine to add GPL only
stuff to softfloat.c and since the re-factoring (or before really) we
"own" this code and are unlikely to upstream anything.
My preference would be to include this all in softfloat.c unless there
is a very good reason not to.
>
> This patch just adds some boilerplate code; subsequent patches add
> operations, one per commit to ease bisection.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> Makefile.target | 2 +-
> include/fpu/hostfloat.h | 14 +++++++
> include/fpu/softfloat.h | 1 +
> fpu/hostfloat.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++
> target/m68k/Makefile.objs | 2 +-
> tests/fp-test/Makefile | 2 +-
> 6 files changed, 114 insertions(+), 3 deletions(-)
> create mode 100644 include/fpu/hostfloat.h
> create mode 100644 fpu/hostfloat.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 6549481..efcdfb9 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -97,7 +97,7 @@ obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/tcg-op-vec.o tcg/tcg-op-gvec.o
> obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o
> obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
> obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
> -obj-y += fpu/softfloat.o
> +obj-y += fpu/softfloat.o fpu/hostfloat.o
> obj-y += target/$(TARGET_BASE_ARCH)/
> obj-y += disas.o
> obj-$(call notempty,$(TARGET_XML_FILES)) += gdbstub-xml.o
> diff --git a/include/fpu/hostfloat.h b/include/fpu/hostfloat.h
> new file mode 100644
> index 0000000..b01291b
> --- /dev/null
> +++ b/include/fpu/hostfloat.h
> @@ -0,0 +1,14 @@
> +/*
> + * Copyright (C) 2018, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#ifndef HOSTFLOAT_H
> +#define HOSTFLOAT_H
> +
> +#ifndef SOFTFLOAT_H
> +#error fpu/hostfloat.h must only be included from softfloat.h
> +#endif
> +
> +#endif /* HOSTFLOAT_H */
> diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h
> index 8fb44a8..8963b68 100644
> --- a/include/fpu/softfloat.h
> +++ b/include/fpu/softfloat.h
> @@ -95,6 +95,7 @@ enum {
> };
>
> #include "fpu/softfloat-types.h"
> +#include "fpu/hostfloat.h"
>
> static inline void set_float_detect_tininess(int val, float_status *status)
> {
> diff --git a/fpu/hostfloat.c b/fpu/hostfloat.c
> new file mode 100644
> index 0000000..cab0341
> --- /dev/null
> +++ b/fpu/hostfloat.c
> @@ -0,0 +1,96 @@
> +/*
> + * hostfloat.c - FP primitives that use the host's FPU whenever possible.
> + *
> + * Copyright (C) 2018, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + * Fast emulation of guest FP instructions is challenging for two reasons.
> + * First, FP instruction semantics are similar but not identical, particularly
> + * when handling NaNs. Second, emulating at reasonable speed the guest FP
> + * exception flags is not trivial: reading the host's flags register with a
> + * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp],
> + * and trapping on every FP exception is not fast nor pleasant to work with.
> + *
> + * This module leverages the host FPU for a subset of the operations. To
> + * do this it follows the main idea presented in this paper:
> + *
> + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a
> + * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615.
> + *
> + * The idea is thus to leverage the host FPU to (1) compute FP operations
> + * and (2) identify whether FP exceptions occurred while avoiding
> + * expensive exception flag register accesses.
> + *
> + * An important optimization shown in the paper is that given that exception
> + * flags are rarely cleared by the guest, we can avoid recomputing some flags.
> + * This is particularly useful for the inexact flag, which is very frequently
> + * raised in floating-point workloads.
> + *
> + * We optimize the code further by deferring to soft-fp whenever FP
> + * exception detection might get hairy. Fortunately this is not common.
> + */
> +#include <math.h>
> +
> +#include "qemu/osdep.h"
> +#include "fpu/softfloat.h"
> +
> +#define GEN_TYPE_CONV(name, to_t, from_t) \
> + static inline to_t name(from_t a) \
> + { \
> + to_t r = *(to_t *)&a; \
> + return r; \
> + }
> +
> +GEN_TYPE_CONV(float32_to_float, float, float32)
> +GEN_TYPE_CONV(float64_to_double, double, float64)
> +GEN_TYPE_CONV(float_to_float32, float32, float)
> +GEN_TYPE_CONV(double_to_float64, float64, double)
> +#undef GEN_TYPE_CONV
> +
> +#define GEN_INPUT_FLUSH(soft_t) \
> + static inline __attribute__((always_inline)) void \
> + soft_t ## _input_flush__nocheck(soft_t *a, float_status *s) \
> + { \
> + if (unlikely(soft_t ## _is_denormal(*a))) { \
> + *a = soft_t ## _set_sign(soft_t ## _zero, \
> + soft_t ## _is_neg(*a)); \
> + s->float_exception_flags |= float_flag_input_denormal; \
> + } \
> + } \
> + \
> + static inline __attribute__((always_inline)) void \
> + soft_t ## _input_flush1(soft_t *a, float_status *s) \
> + { \
> + if (likely(!s->flush_inputs_to_zero)) { \
> + return; \
> + } \
> + soft_t ## _input_flush__nocheck(a, s); \
> + } \
> + \
> + static inline __attribute__((always_inline)) void \
> + soft_t ## _input_flush2(soft_t *a, soft_t *b, float_status *s) \
> + { \
> + if (likely(!s->flush_inputs_to_zero)) { \
> + return; \
> + } \
> + soft_t ## _input_flush__nocheck(a, s); \
> + soft_t ## _input_flush__nocheck(b, s); \
> + } \
> + \
> + static inline __attribute__((always_inline)) void \
> + soft_t ## _input_flush3(soft_t *a, soft_t *b, soft_t *c, \
> + float_status *s) \
> + { \
> + if (likely(!s->flush_inputs_to_zero)) { \
> + return; \
> + } \
> + soft_t ## _input_flush__nocheck(a, s); \
> + soft_t ## _input_flush__nocheck(b, s); \
> + soft_t ## _input_flush__nocheck(c, s); \
> + }
> +
> +GEN_INPUT_FLUSH(float32)
> +GEN_INPUT_FLUSH(float64)
Having spent time getting rid of a bunch of macro expansions I'm wary of
adding more in. However for these I guess it's kind of marginal.
> +#undef GEN_INPUT_FLUSH
> diff --git a/target/m68k/Makefile.objs b/target/m68k/Makefile.objs
> index ac61948..2868b11 100644
> --- a/target/m68k/Makefile.objs
> +++ b/target/m68k/Makefile.objs
> @@ -1,5 +1,5 @@
> obj-y += m68k-semi.o
> obj-y += translate.o op_helper.o helper.o cpu.o
> -obj-y += fpu_helper.o softfloat.o
> +obj-y += fpu_helper.o softfloat.o hostfloat.o
> obj-y += gdbstub.o
> obj-$(CONFIG_SOFTMMU) += monitor.o
> diff --git a/tests/fp-test/Makefile b/tests/fp-test/Makefile
> index 703434f..187cfcc 100644
> --- a/tests/fp-test/Makefile
> +++ b/tests/fp-test/Makefile
> @@ -28,7 +28,7 @@ ibm:
> $(WHITELIST_FILES):
> wget -nv -O $@ http://www.cs.columbia.edu/~cota/qemu/fpbench-$@
>
> -fp-test$(EXESUF): fp-test.o softfloat.o
> +fp-test$(EXESUF): fp-test.o softfloat.o hostfloat.o
>
> clean:
> rm -f *.o *.d $(OBJS)
--
Alex Bennée
On Tue, Mar 27, 2018 at 12:49:48 +0100, Alex Bennée wrote: > Emilio G. Cota <cota@braap.org> writes: > > > The appended paves the way for leveraging the host FPU for a subset > > of guest FP operations. For most guest workloads (e.g. FP flags > > aren't ever cleared, inexact occurs often and rounding is set to the > > default [to nearest]) this will yield sizable performance speedups. > > > > The approach followed here avoids checking the FP exception flags register. > > See the comment at the top of hostfloat.c for details. > > > > This assumes that QEMU is running on an IEEE754-compliant FPU and > > that the rounding is set to the default (to nearest). The > > implementation-dependent specifics of the FPU should not matter; things > > like tininess detection and snan representation are still dealt with in > > soft-fp. However, this approach will break on most hosts if we compile > > QEMU with flags such as -ffast-math. We control the flags so this should > > be easy to enforce though. > > The thing I would avoid is generating is any x87 instructions as we can > get weird effects if the compiler ever decides to stash a signalling NaN > in an x87 register. We take care not to do hardfloat on operands that might result in NaNs. So this should not be a concern. > Anyway perhaps -fno-fast-math should be explicit when building fpu/* code? That's a fair suggestion. There are plenty of other flags though that could ruin this approach, so I'm not sure how effective this would be. Also, we should be careful not to sneak in things like _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON) in the QEMU binary. Not sure we can guarantee this is avoided unless we had a runtime check =) > > The licensing in softfloat.h is complicated at best, so to keep things > > simple I'm adding this as a separate, GPL'ed file. > > I don't think we need to worry about this. It's fine to add GPL only > stuff to softfloat.c and since the re-factoring (or before really) we > "own" this code and are unlikely to upstream anything. > > My preference would be to include this all in softfloat.c unless there > is a very good reason not to. Yes I did this in v2 after reading the license etc. (snip) > > +++ b/fpu/hostfloat.c (snip) > > +#define GEN_INPUT_FLUSH(soft_t) \ > > + static inline __attribute__((always_inline)) void \ > > + soft_t ## _input_flush__nocheck(soft_t *a, float_status *s) \ (snip) > > + soft_t ## _input_flush__nocheck(c, s); \ > > + } > > + > > +GEN_INPUT_FLUSH(float32) > > +GEN_INPUT_FLUSH(float64) > > Having spent time getting rid of a bunch of macro expansions I'm wary of > adding more in. However for these I guess it's kind of marginal. Then you won't like v2 :-( I don't like macros either but in this case they might be a necessary evil. I left a lot of macros in there because it'll let us retain performance and also easily support things like half/quad precision, if we ever want to. Thanks, Emilio
© 2016 - 2025 Red Hat, Inc.