Remaining MMU clean up patches

[PATCH 00/43] Remaining MMU clean up patches

Posted by BALATON Zoltan 1 year, 8 months ago

This is the rest of the MMU clean up series the first part of which
was merged. Here are the remaining patches rebased and some more added.

Regards,
BALATON Zoltan

BALATON Zoltan (43):
  target/ppc: Reorganise and rename ppc_hash32_pp_prot()
  target/ppc/mmu_common.c: Remove local name for a constant
  target/ppc/mmu_common.c: Remove single use local variable
  target/ppc/mmu_common.c: Remove single use local variable
  target/ppc/mmu_common.c: Remove another single use local variable
  target/ppc/mmu_common.c: Remove yet another single use local variable
  target/ppc/mmu_common.c: Return directly in ppc6xx_tlb_pte_check()
  target/ppc/mmu_common.c: Simplify ppc6xx_tlb_pte_check()
  target/ppc/mmu_common.c: Remove unused field from mmu_ctx_t
  target/ppc/mmu_common.c: Remove hash field from mmu_ctx_t
  target/ppc/mmu_common.c: Remove pte_update_flags()
  target/ppc/mmu_common.c: Remove nx field from mmu_ctx_t
  target/ppc/mmu_common.c: Convert local variable to bool
  target/ppc/mmu_common.c: Remove single use local variable
  target/ppc/mmu_common.c: Simplify a switch statement
  target/ppc/mmu_common.c: Inline and remove ppc6xx_tlb_pte_check()
  target/ppc/mmu_common.c: Remove ptem field from mmu_ctx_t
  target/ppc: Add function to get protection key for hash32 MMU
  target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_prot()
  target/ppc/mmu_common.c: Init variable in function that relies on it
  target/ppc/mmu_common.c: Remove key field from mmu_ctx_t
  target/ppc/mmu_common.c: Stop using ctx in ppc6xx_tlb_check()
  target/ppc/mmu_common.c: Rename function parameter
  target/ppc/mmu_common.c: Use defines instead of numeric constants
  target/ppc: Remove bat_size_prot()
  target/ppc/mmu_common.c: Stop using ctx in get_bat_6xx_tlb()
  target/ppc/mmu_common.c: Remove mmu_ctx_t
  target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_raddr()
  target/ppc/mmu-hash32.c: Move get_pteg_offset32() to the header
  target/ppc: Unexport some functions from mmu-book3s-v3.h
  target/ppc/mmu-radix64: Remove externally unused parts from header
  target/ppc: Remove includes from mmu-book3s-v3.h
  target/ppc: Remove single use static inline function
  target/ppc/internal.h: Consolidate ifndef CONFIG_USER_ONLY blocks
  target/ppc/mmu-hash32.c: Change parameter type of
    ppc_hash32_bat_lookup()
  target/ppc/mmu-hash32: Remove some static inlines from header
  target/ppc/mmu-hash32.c: Return and use pte address instead of base +
    offset
  target/ppc/mmu-hash32.c: Use pte address as parameter instead of
    offset
  target/ppc: Change parameter type of some inline functions
  target/ppc: Change parameter type of ppc64_v3_radix()
  target/ppc: Change MMU xlate functions to take CPUState
  target/ppc/mmu-hash32.c: Change parameter type of ppc_hash32_set_[rc]
  target/ppc/mmu-hash32.c: Change parameter type of
    ppc_hash32_direct_store

 hw/ppc/spapr_rtas.c                 |   2 +-
 hw/ppc/spapr_vhyp_mmu.c             |  21 +-
 target/ppc/internal.h               |  34 +--
 target/ppc/mmu-book3s-v3.c          |   1 -
 target/ppc/mmu-book3s-v3.h          |  47 +---
 target/ppc/mmu-booke.c              |   5 +-
 target/ppc/mmu-booke.h              |   2 +-
 target/ppc/mmu-hash32.c             | 165 ++++--------
 target/ppc/mmu-hash32.h             |  86 +++---
 target/ppc/mmu-hash64.c             |  54 +++-
 target/ppc/mmu-hash64.h             |   3 +-
 target/ppc/mmu-radix64.c            |  57 +++-
 target/ppc/mmu-radix64.h            |  55 +---
 target/ppc/mmu_common.c             | 405 ++++++++++------------------
 target/ppc/mmu_helper.c             |   9 +-
 target/ppc/translate/vsx-impl.c.inc |   6 +-
 16 files changed, 376 insertions(+), 576 deletions(-)

-- 
2.30.9

Re: [PATCH 00/43] Remaining MMU clean up patches

Posted by BALATON Zoltan 1 year, 7 months ago

On Mon, 27 May 2024, BALATON Zoltan wrote:
> This is the rest of the MMU clean up series the first part of which
> was merged. Here are the remaining patches rebased and some more added.

Ping?

> Regards,
> BALATON Zoltan
>
> BALATON Zoltan (43):
>  target/ppc: Reorganise and rename ppc_hash32_pp_prot()
>  target/ppc/mmu_common.c: Remove local name for a constant
>  target/ppc/mmu_common.c: Remove single use local variable
>  target/ppc/mmu_common.c: Remove single use local variable
>  target/ppc/mmu_common.c: Remove another single use local variable
>  target/ppc/mmu_common.c: Remove yet another single use local variable
>  target/ppc/mmu_common.c: Return directly in ppc6xx_tlb_pte_check()
>  target/ppc/mmu_common.c: Simplify ppc6xx_tlb_pte_check()
>  target/ppc/mmu_common.c: Remove unused field from mmu_ctx_t
>  target/ppc/mmu_common.c: Remove hash field from mmu_ctx_t
>  target/ppc/mmu_common.c: Remove pte_update_flags()
>  target/ppc/mmu_common.c: Remove nx field from mmu_ctx_t
>  target/ppc/mmu_common.c: Convert local variable to bool
>  target/ppc/mmu_common.c: Remove single use local variable
>  target/ppc/mmu_common.c: Simplify a switch statement
>  target/ppc/mmu_common.c: Inline and remove ppc6xx_tlb_pte_check()
>  target/ppc/mmu_common.c: Remove ptem field from mmu_ctx_t
>  target/ppc: Add function to get protection key for hash32 MMU
>  target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_prot()
>  target/ppc/mmu_common.c: Init variable in function that relies on it
>  target/ppc/mmu_common.c: Remove key field from mmu_ctx_t
>  target/ppc/mmu_common.c: Stop using ctx in ppc6xx_tlb_check()
>  target/ppc/mmu_common.c: Rename function parameter
>  target/ppc/mmu_common.c: Use defines instead of numeric constants
>  target/ppc: Remove bat_size_prot()
>  target/ppc/mmu_common.c: Stop using ctx in get_bat_6xx_tlb()
>  target/ppc/mmu_common.c: Remove mmu_ctx_t
>  target/ppc/mmu-hash32.c: Inline and remove ppc_hash32_pte_raddr()
>  target/ppc/mmu-hash32.c: Move get_pteg_offset32() to the header
>  target/ppc: Unexport some functions from mmu-book3s-v3.h
>  target/ppc/mmu-radix64: Remove externally unused parts from header
>  target/ppc: Remove includes from mmu-book3s-v3.h
>  target/ppc: Remove single use static inline function
>  target/ppc/internal.h: Consolidate ifndef CONFIG_USER_ONLY blocks
>  target/ppc/mmu-hash32.c: Change parameter type of
>    ppc_hash32_bat_lookup()
>  target/ppc/mmu-hash32: Remove some static inlines from header
>  target/ppc/mmu-hash32.c: Return and use pte address instead of base +
>    offset
>  target/ppc/mmu-hash32.c: Use pte address as parameter instead of
>    offset
>  target/ppc: Change parameter type of some inline functions
>  target/ppc: Change parameter type of ppc64_v3_radix()
>  target/ppc: Change MMU xlate functions to take CPUState
>  target/ppc/mmu-hash32.c: Change parameter type of ppc_hash32_set_[rc]
>  target/ppc/mmu-hash32.c: Change parameter type of
>    ppc_hash32_direct_store
>
> hw/ppc/spapr_rtas.c                 |   2 +-
> hw/ppc/spapr_vhyp_mmu.c             |  21 +-
> target/ppc/internal.h               |  34 +--
> target/ppc/mmu-book3s-v3.c          |   1 -
> target/ppc/mmu-book3s-v3.h          |  47 +---
> target/ppc/mmu-booke.c              |   5 +-
> target/ppc/mmu-booke.h              |   2 +-
> target/ppc/mmu-hash32.c             | 165 ++++--------
> target/ppc/mmu-hash32.h             |  86 +++---
> target/ppc/mmu-hash64.c             |  54 +++-
> target/ppc/mmu-hash64.h             |   3 +-
> target/ppc/mmu-radix64.c            |  57 +++-
> target/ppc/mmu-radix64.h            |  55 +---
> target/ppc/mmu_common.c             | 405 ++++++++++------------------
> target/ppc/mmu_helper.c             |   9 +-
> target/ppc/translate/vsx-impl.c.inc |   6 +-
> 16 files changed, 376 insertions(+), 576 deletions(-)
>
>

Re: [PATCH 00/43] Remaining MMU clean up patches

Posted by BALATON Zoltan 1 year, 8 months ago

On Mon, 27 May 2024, BALATON Zoltan wrote:
> This is the rest of the MMU clean up series the first part of which
> was merged. Here are the remaining patches rebased and some more added.

Besides cleaning up this code my other goal with this and previous already 
merged series was trying to optimise this a bit. Here are some numbers 
I've got with 9.0 compared to after this series. I've got these running 
old benchmarks that do a lot of memory accesses under AmigaOS on sam460ex 
and amigaone machines. The speed up is not much but measurable.

***
sam460ex:
QEMU v9.0.0
===========
    Sieve of Eratosthenes (Scaled to 10 Iterations)
    Version 1.2, 03 April 1992

    Array Size   Number   Last Prime    Linear     RunTime    MIPS
     (Bytes)   of Primes               Time(sec)    (Sec)
        8191       1899        16381   0.001068   0.001068  1552.2
       10000       2261        19997   0.001304   0.001373  1479.5
       20000       4202        39989   0.002608   0.002747  1498.7
       40000       7836        79999   0.005216   0.005493  1517.0
       80000      14683       160001   0.010432   0.010681  1578.4
      160000      27607       319993   0.020864   0.241089   141.4
      320000      52073       639997   0.041728   0.799561    86.2
      640000      98609      1279997   0.083457   2.019043    68.9
     1280000     187133      2559989   0.166913   4.824219    58.2
     2560000     356243      5119997   0.333827  11.142578    50.9
     5120000     679460     10239989   0.667654  25.488281    44.9
    10240000    1299068     20479999   1.335307  55.898438    41.2
    20480000    2488465     40960001   2.670614  125.625000    37.0

    Average RunTime = 0.020496 (sec)
    High  MIPS      =   1578.4
    Low   MIPS      =     37.0

STREAM version $Revision: 5.10 $
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 228749 microseconds.
    (= 76249 clock ticks)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1703.0     0.097710     0.093951     0.107887
Scale:            706.9     0.233270     0.226353     0.244723
Add:              862.8     0.288949     0.278165     0.302902
Triad:            763.4     0.318530     0.314394     0.324167

QEMU master + this series:
==========================
    Sieve of Eratosthenes (Scaled to 10 Iterations)
    Version 1.2, 03 April 1992

    Array Size   Number   Last Prime    Linear     RunTime    MIPS
     (Bytes)   of Primes               Time(sec)    (Sec)
        8191       1899        16381   0.001068   0.001068  1552.2
       10000       2261        19997   0.001304   0.001259  1614.0
       20000       4202        39989   0.002608   0.002518  1635.0
       40000       7836        79999   0.005216   0.005493  1517.0
       80000      14683       160001   0.010432   0.011597  1453.8
      160000      27607       319993   0.020864   0.218506   156.0
      320000      52073       639997   0.041728   0.747070    92.2
      640000      98609      1279997   0.083457   1.887207    73.7
     1280000     187133      2559989   0.166913   4.633789    60.6
     2560000     356243      5119997   0.333827  10.449219    54.2
     5120000     679460     10239989   0.667654  23.750000    48.1
    10240000    1299068     20479999   1.335307  52.109375    44.2
    20480000    2488465     40960001   2.670614  117.031250    39.7

    Relative to 10 Iterations and the 8191 Array Size:
    Average RunTime = 0.019190 (sec)
    High  MIPS      =   1635.0
    Low   MIPS      =     39.7

STREAM version $Revision: 5.10 $
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 189202 microseconds.
    (= 63067 clock ticks)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1730.8     0.096015     0.092444     0.107926
Scale:            784.4     0.208925     0.203969     0.217610
Add:             1002.6     0.243805     0.239389     0.254176
Triad:            768.9     0.322526     0.312154     0.345777

***
amigaone:
QEMU v9.0.0
===========
    Sieve of Eratosthenes (Scaled to 10 Iterations)
    Version 1.2, 03 April 1992

    Array Size   Number   Last Prime    Linear     RunTime    MIPS
     (Bytes)   of Primes               Time(sec)    (Sec)
        8191       1899        16381   0.001068   0.001068  1552.2
       10000       2261        19997   0.001304   0.001259  1614.0
       20000       4202        39989   0.002608   0.002747  1498.7
       40000       7836        79999   0.005216   0.005493  1517.0
       80000      14683       160001   0.010432   0.010986  1534.6
      160000      27607       319993   0.020864   0.023193  1469.7
      320000      52073       639997   0.041728   0.047607  1446.9
      640000      98609      1279997   0.083457   0.100098  1389.9
     1280000     187133      2559989   0.166913   0.200195  1403.0
     2560000     356243      5119997   0.333827   0.400391  1415.6
     5120000     679460     10239989   0.667654   1.484375   770.2
    10240000    1299068     20479999   1.335307   1.679688  1372.5
    20480000    2488465     40960001   2.670614   6.796875   683.7

    Relative to 10 Iterations and the 8191 Array Size:
    Average RunTime = 0.001397 (sec)
    High  MIPS      =   1614.0
    Low   MIPS      =    683.7

STREAM version $Revision: 5.10 $
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 203076 microseconds.
    (= 101538 clock ticks)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            2529.4     0.067538     0.063255     0.079943
Scale:            885.4     0.187032     0.180708     0.194940
Add:             1119.5     0.226545     0.214384     0.246212
Triad:            959.5     0.260417     0.250131     0.281227

QEMU master + this series:
==========================
    Sieve of Eratosthenes (Scaled to 10 Iterations)
    Version 1.2, 03 April 1992

    Array Size   Number   Last Prime    Linear     RunTime    MIPS
     (Bytes)   of Primes               Time(sec)    (Sec)
        8191       1899        16381   0.001068   0.001068  1552.2
       10000       2261        19997   0.001304   0.001373  1479.5
       20000       4202        39989   0.002608   0.002518  1635.0
       40000       7836        79999   0.005216   0.005798  1437.2
       80000      14683       160001   0.010432   0.010986  1534.6
      160000      27607       319993   0.020864   0.021973  1551.3
      320000      52073       639997   0.041728   0.046387  1485.0
      640000      98609      1279997   0.083457   0.100098  1389.9
     1280000     187133      2559989   0.166913   0.200195  1403.0
     2560000     356243      5119997   0.333827   0.400391  1415.6
     5120000     679460     10239989   0.667654   0.859375  1330.4
    10240000    1299068     20479999   1.335307   3.085938   747.1
    20480000    2488465     40960001   2.670614   6.562500   708.1

    Relative to 10 Iterations and the 8191 Array Size:
    Average RunTime = 0.001397 (sec)
    High  MIPS      =   1635.0
    Low   MIPS      =    708.1

STREAM version $Revision: 5.10 $
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 168572 microseconds.
    (= 84286 clock ticks)
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            2410.2     0.076613     0.066384     0.127486
Scale:           1007.6     0.164015     0.158791     0.175446
Add:             1236.3     0.203815     0.194123     0.216319
Triad:            967.6     0.262833     0.248042     0.281844

There is some variation between results between multiple runs but the 
optimised version seems to run a bit faster and it should be more readable 
code than it was before. It could be possible to improve it further but I 
stop here for now.

The sam460ex seems to be slower due to TLB misses generating an exception 
on embedded PPC and exceptions are slow on QEMU (not only because of 
needing to go through guest code but generally, I've also seen this with 
workloads that do a lot of syscalls but I don't have measurements of that 
now). The amigaone with G4 CPU uses hash MMU which can access the needed 
data from guest memory without an exception so it can keep running faster 
with TLB misses.

Regards,
BALATON Zoltan