[RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage

Anton Johansson via posted 43 patches 1 year, 2 months ago
[RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage
Posted by Anton Johansson via 1 year, 2 months ago
Temporary vectors in helper-to-tcg generated code are allocated from an
array of bytes in CPUArchState, specified with --temp-vector-block.

This commits adds such a block of memory to CPUArchState, if
CONFIG_HELPER_TO_TCG is set.

Signed-off-by: Anton Johansson <anjo@rev.ng>
---
 target/hexagon/cpu.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
index 7be4b5769e..fa6ac83e01 100644
--- a/target/hexagon/cpu.h
+++ b/target/hexagon/cpu.h
@@ -97,6 +97,10 @@ typedef struct CPUArchState {
     MMVector future_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
     MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
 
+#ifdef CONFIG_HELPER_TO_TCG
+    uint8_t tmp_vmem[4096] QEMU_ALIGNED(16);
+#endif
+
     MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
     MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
 
-- 
2.45.2
Re: [RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage
Posted by Richard Henderson 1 year, 2 months ago
On 11/20/24 19:49, Anton Johansson wrote:
> Temporary vectors in helper-to-tcg generated code are allocated from an
> array of bytes in CPUArchState, specified with --temp-vector-block.
> 
> This commits adds such a block of memory to CPUArchState, if
> CONFIG_HELPER_TO_TCG is set.
> 
> Signed-off-by: Anton Johansson <anjo@rev.ng>
> ---
>   target/hexagon/cpu.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
> index 7be4b5769e..fa6ac83e01 100644
> --- a/target/hexagon/cpu.h
> +++ b/target/hexagon/cpu.h
> @@ -97,6 +97,10 @@ typedef struct CPUArchState {
>       MMVector future_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
>       MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
>   
> +#ifdef CONFIG_HELPER_TO_TCG
> +    uint8_t tmp_vmem[4096] QEMU_ALIGNED(16);
> +#endif
> +
>       MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
>       MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
>   

Wow.  Do you really require 4k in temp storage?


r~
Re: [RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage
Posted by Anton Johansson via 1 year, 2 months ago
On 22/11/24, Richard Henderson wrote:
> On 11/20/24 19:49, Anton Johansson wrote:
> > Temporary vectors in helper-to-tcg generated code are allocated from an
> > array of bytes in CPUArchState, specified with --temp-vector-block.
> > 
> > This commits adds such a block of memory to CPUArchState, if
> > CONFIG_HELPER_TO_TCG is set.
> > 
> > Signed-off-by: Anton Johansson <anjo@rev.ng>
> > ---
> >   target/hexagon/cpu.h | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
> > index 7be4b5769e..fa6ac83e01 100644
> > --- a/target/hexagon/cpu.h
> > +++ b/target/hexagon/cpu.h
> > @@ -97,6 +97,10 @@ typedef struct CPUArchState {
> >       MMVector future_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
> >       MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
> > +#ifdef CONFIG_HELPER_TO_TCG
> > +    uint8_t tmp_vmem[4096] QEMU_ALIGNED(16);
> > +#endif
> > +
> >       MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
> >       MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
> 
> Wow.  Do you really require 4k in temp storage?

No, 4k is overkill used during testing.  But consider that Hexagon uses
128- and 256-byte vectors in some cases so if the emitted code uses say
5 temporaries in its computation we end up at 1280 bytes as an upper
bound.

Two ideas here, we can:

  1. Allow users to specify an upper bound on vector memory, and abort
     translation of helpers that surpass this, and;

  2. Emit maximum number of bytes used for vector temporaries to a
     macro.

//Anton
Re: [RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage
Posted by Brian Cain 1 year, 2 months ago
On 12/3/2024 12:56 PM, Anton Johansson via wrote:
> On 22/11/24, Richard Henderson wrote:
>> On 11/20/24 19:49, Anton Johansson wrote:
>>> Temporary vectors in helper-to-tcg generated code are allocated from an
>>> array of bytes in CPUArchState, specified with --temp-vector-block.
>>>
>>> This commits adds such a block of memory to CPUArchState, if
>>> CONFIG_HELPER_TO_TCG is set.
>>>
>>> Signed-off-by: Anton Johansson <anjo@rev.ng>
>>> ---
>>>    target/hexagon/cpu.h | 4 ++++
>>>    1 file changed, 4 insertions(+)
>>>
>>> diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
>>> index 7be4b5769e..fa6ac83e01 100644
>>> --- a/target/hexagon/cpu.h
>>> +++ b/target/hexagon/cpu.h
>>> @@ -97,6 +97,10 @@ typedef struct CPUArchState {
>>>        MMVector future_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
>>>        MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
>>> +#ifdef CONFIG_HELPER_TO_TCG
>>> +    uint8_t tmp_vmem[4096] QEMU_ALIGNED(16);
>>> +#endif
>>> +
>>>        MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
>>>        MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
>> Wow.  Do you really require 4k in temp storage?
> No, 4k is overkill used during testing.  But consider that Hexagon uses
> 128- and 256-byte vectors in some cases so if the emitted code uses say
> 5 temporaries in its computation we end up at 1280 bytes as an upper
> bound.

Per-packet there should be a maximum of one temporary.  But per-TB it's 
unbound.  Could we/should we have some guidance to put the brakes on 
translation early if we encounter ~N temp references?

But maybe that's not needed since the temp space can be reused within a 
TB among packets.

>
> Two ideas here, we can:
>
>    1. Allow users to specify an upper bound on vector memory, and abort
>       translation of helpers that surpass this, and;
>
>    2. Emit maximum number of bytes used for vector temporaries to a
>       macro.
>
> //Anton
>

RE: [RFC PATCH v1 36/43] target/hexagon: Add temporary vector storage
Posted by ltaylorsimpson@gmail.com 1 year, 2 months ago

> -----Original Message-----
> From: Brian Cain <brian.cain@oss.qualcomm.com>
> Sent: Tuesday, December 3, 2024 1:28 PM
> To: Anton Johansson <anjo@rev.ng>; Richard Henderson
> <richard.henderson@linaro.org>
> Cc: qemu-devel@nongnu.org; ale@rev.ng; ltaylorsimpson@gmail.com;
> bcain@quicinc.com; philmd@linaro.org; alex.bennee@linaro.org
> Subject: Re: [RFC PATCH v1 36/43] target/hexagon: Add temporary vector
> storage
> 
> 
> On 12/3/2024 12:56 PM, Anton Johansson via wrote:
> > On 22/11/24, Richard Henderson wrote:
> >> On 11/20/24 19:49, Anton Johansson wrote:
> >>> Temporary vectors in helper-to-tcg generated code are allocated from
> >>> an array of bytes in CPUArchState, specified with --temp-vector-block.
> >>>
> >>> This commits adds such a block of memory to CPUArchState, if
> >>> CONFIG_HELPER_TO_TCG is set.
> >>>
> >>> Signed-off-by: Anton Johansson <anjo@rev.ng>
> >>> ---
> >>>    target/hexagon/cpu.h | 4 ++++
> >>>    1 file changed, 4 insertions(+)
> >>>
> >>> diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h index
> >>> 7be4b5769e..fa6ac83e01 100644
> >>> --- a/target/hexagon/cpu.h
> >>> +++ b/target/hexagon/cpu.h
> >>> @@ -97,6 +97,10 @@ typedef struct CPUArchState {
> >>>        MMVector future_VRegs[VECTOR_TEMPS_MAX]
> QEMU_ALIGNED(16);
> >>>        MMVector tmp_VRegs[VECTOR_TEMPS_MAX] QEMU_ALIGNED(16);
> >>> +#ifdef CONFIG_HELPER_TO_TCG
> >>> +    uint8_t tmp_vmem[4096] QEMU_ALIGNED(16); #endif
> >>> +
> >>>        MMQReg QRegs[NUM_QREGS] QEMU_ALIGNED(16);
> >>>        MMQReg future_QRegs[NUM_QREGS] QEMU_ALIGNED(16);
> >> Wow.  Do you really require 4k in temp storage?
> > No, 4k is overkill used during testing.  But consider that Hexagon
> > uses
> > 128- and 256-byte vectors in some cases so if the emitted code uses
> > say
> > 5 temporaries in its computation we end up at 1280 bytes as an upper
> > bound.
> 
> Per-packet there should be a maximum of one temporary.  But per-TB it's
> unbound.  Could we/should we have some guidance to put the brakes on
> translation early if we encounter ~N temp references?
> 
> But maybe that's not needed since the temp space can be reused within a TB
> among packets.

You should only need enough temporaries for one instruction.  There are already temporaries (future_VRegs, tmp_VRegs, future_QRegs) in CPUHexagonState to handle the needs within a packet.  There shouldn't be any temps needed between the packets in a TB.

The number of temps needed for a given instruction is determined by the compiler - version, level of optimization.  So, you can determine this by compiling all the instructions (i.e., build qemu).  I'd recommend having a few extra to future proof against changes to LLVM.

Taylor