softmmu demacro

[Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Alex Bennée 6 years, 11 months ago

Hi,

This is hopefully the final version of the softmmu demacro series. I
tracked down the remaining failures to:

  - not always using the correct victim tlb
  - not masking reads when we unrolled the unaligned helpers

Other than that I've rolled in the changes that were made to support
dynamic TLB code and the follow-up fixes. All patches need some review
before we can merge them.

Alex Bennée (3):
  accel/tcg: demacro cputlb
  accel/tcg: remove softmmu_template.h
  accel/tcg: move unaligned helpers out of core helpers

 accel/tcg/cputlb.c           | 501 +++++++++++++++++++++++++++++++++--
 accel/tcg/softmmu_template.h | 454 -------------------------------
 2 files changed, 475 insertions(+), 480 deletions(-)
 delete mode 100644 accel/tcg/softmmu_template.h

-- 
2.20.1

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by no-reply@patchew.org 6 years, 11 months ago

Patchew URL: https://patchew.org/QEMU/20190215143115.28777-1-alex.bennee@linaro.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20190215143115.28777-1-alex.bennee@linaro.org
Subject: [Qemu-devel] [PATCH  v3 0/3] softmmu demacro
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]            patchew/20190215133005.15955-1-david@redhat.com -> patchew/20190215133005.15955-1-david@redhat.com
 * [new tag]               patchew/20190215143115.28777-1-alex.bennee@linaro.org -> patchew/20190215143115.28777-1-alex.bennee@linaro.org
Switched to a new branch 'test'
e26497c4c1 accel/tcg: move unaligned helpers out of core helpers
bb7a57bffb accel/tcg: remove softmmu_template.h
3a3cfc0d92 accel/tcg: demacro cputlb

=== OUTPUT BEGIN ===
1/3 Checking commit 3a3cfc0d9223 (accel/tcg: demacro cputlb)
2/3 Checking commit bb7a57bffbe0 (accel/tcg: remove softmmu_template.h)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#11: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 lines checked

Patch 2/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
3/3 Checking commit e26497c4c1e5 (accel/tcg: move unaligned helpers out of core helpers)
ERROR: externs should be avoided in .c files
#29: FILE: accel/tcg/cputlb.c:1210:
+tcg_target_ulong load_helper(CPUArchState *env, target_ulong addr,

WARNING: line over 80 characters
#34: FILE: accel/tcg/cputlb.c:1215:
+static tcg_target_ulong unaligned_load_helper(CPUArchState *env, target_ulong addr,

ERROR: externs should be avoided in .c files
#199: FILE: accel/tcg/cputlb.c:1443:
+void store_helper(CPUArchState *env, target_ulong addr, uint64_t val,

WARNING: line over 80 characters
#246: FILE: accel/tcg/cputlb.c:1509:
+            unaligned_store_helper(env, addr, val, oi, retaddr, size, big_endian);

total: 2 errors, 2 warnings, 381 lines checked

Patch 3/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20190215143115.28777-1-alex.bennee@linaro.org/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Alex Bennée 6 years, 11 months ago

Alex Bennée <alex.bennee@linaro.org> writes:

> Hi,
>
> This is hopefully the final version of the softmmu demacro series. I
> tracked down the remaining failures to:
>
>   - not always using the correct victim tlb
>   - not masking reads when we unrolled the unaligned helpers
>
> Other than that I've rolled in the changes that were made to support
> dynamic TLB code and the follow-up fixes. All patches need some review
> before we can merge them.

Measurements:

Before

hyperfine -w 2 './aarch64-softmmu/qemu-system-aarch64 -machine virt,graphics=on,gic-version=3,virtualization=on -cpu cortex-a53 -serial mon:stdio -nic user,model=virtio-net-
pci,hostfwd=tcp::2222-:22 -device virtio-scsi-pci -device scsi-hd,drive=hd0 -blockdev driver=raw,node-name=hd0,discard=unmap,file.driver=host_device,file.filename=/dev/zvol/
hackpool-0/debian-buster-arm64 -kernel ../linux.git/builds/arm64.nopreempt/arch/arm64/boot/Image -append "console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark-pigz.service"
 -display none -m 8192 -smp 4 --snapshot'
Benchmark #1: ./aarch64-softmmu/qemu-system-aarch64 -machine virt,graphics=on,gic-version=3,virtualization=on -cpu cortex-a53 -serial mon:stdio -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 -device virtio-scsi-pci -device scsi-hd,drive=hd0 -blockdev driver=raw,node-name=hd0,discard=unmap,file.driver=host_device,file.filename=/dev/zvol/hackpool-0/debian-buster-arm64 -kernel ../linux.git/builds/arm64.nopreempt/arch/arm64/boot/Image -append "console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark-pigz.service" -display none -m 8192 -smp 4 --snapshot
  Time (mean ± σ):     66.035 s ±  1.425 s    [User: 241.495 s, System: 2.327 s]
  Range (min … max):   64.718 s … 68.944 s    10 runs

After

hyperfine -w 2 './aarch64-softmmu/qemu-system-aarch64 -machine virt,graphics=on,gic-version=3,virtualization=on -cpu cortex-a53 -serial mon:stdio -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 -device virtio-scsi-pci -device scsi-hd,drive=hd0 -blockdev driver=raw,node-name=hd0,discard=unmap,file.driver=host_device,file.filename=/dev/zvol/hackpool-0/debian-buster-arm64 -kernel ../linux.git/builds/arm64.nopreempt/arch/arm64/boot/Image -append "console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark-pigz.service" -display none -m 8192 -smp 4 --snapshot'
Benchmark #1: ./aarch64-softmmu/qemu-system-aarch64 -machine virt,graphics=on,gic-version=3,virtualization=on -cpu cortex-a53 -serial mon:stdio -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 -device virtio-scsi-pci -device scsi-hd,drive=hd0 -blockdev driver=raw,node-name=hd0,discard=unmap,file.driver=host_device,file.filename=/dev/zvol/hackpool-0/debian-buster-arm64 -kernel ../linux.git/builds/arm64.nopreempt/arch/arm64/boot/Image -append "console=ttyAMA0 root=/dev/sda2 systemd.unit=benchmark-pigz.service" -display none -m 8192 -smp 4 --snapshot
  Time (mean ± σ):     65.331 s ±  0.684 s    [User: 240.528 s, System: 2.306 s]
  Range (min … max):   64.624 s … 66.914 s    10 runs

Although I'm not going to claim a performance boost, just that it's in
the noise.

Emilio,

Any chance you could run it through your benchmark suite?

>
> Alex Bennée (3):
>   accel/tcg: demacro cputlb
>   accel/tcg: remove softmmu_template.h
>   accel/tcg: move unaligned helpers out of core helpers
>
>  accel/tcg/cputlb.c           | 501 +++++++++++++++++++++++++++++++++--
>  accel/tcg/softmmu_template.h | 454 -------------------------------
>  2 files changed, 475 insertions(+), 480 deletions(-)
>  delete mode 100644 accel/tcg/softmmu_template.h


--
Alex Bennée

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Emilio G. Cota 6 years, 11 months ago

On Tue, Feb 19, 2019 at 14:22:37 +0000, Alex Bennée wrote:
> Emilio,
> 
> Any chance you could run it through your benchmark suite?

Something isn't quite right. For instance, gcc in SPEC doesn't
complete; it fails after 2s with some errors about the syntax of
the source being compiled. Before the patches the error isn't
there (the test completes in ~7s, as usual), so I suspect
some guest memory accesses aren't going to the right place.

I am too busy right now to try to debug this, but if you have
patches to test, I can run them. The patches I tested are in
your v3 branch on github, i.e. with 430f28154b5 at the head.

Thanks,

		Emilio

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Mark Cave-Ayland 6 years, 11 months ago

On 19/02/2019 18:25, Emilio G. Cota wrote:

> On Tue, Feb 19, 2019 at 14:22:37 +0000, Alex Bennée wrote:
>> Emilio,
>>
>> Any chance you could run it through your benchmark suite?
> 
> Something isn't quite right. For instance, gcc in SPEC doesn't
> complete; it fails after 2s with some errors about the syntax of
> the source being compiled. Before the patches the error isn't
> there (the test completes in ~7s, as usual), so I suspect
> some guest memory accesses aren't going to the right place.
> 
> I am too busy right now to try to debug this, but if you have
> patches to test, I can run them. The patches I tested are in
> your v3 branch on github, i.e. with 430f28154b5 at the head.

I spent a few hours yesterday and today testing this patchset against my OpenBIOS
test images and found that all of them were able to boot, except for one MacOS image
which started exhibiting flashing blocks on a progress bar during boot. Tracing this
back, I found the problem was still present when just the first patch "accel/tcg:
demacro cputlb" was applied.

First of all there were a couple of typos I found in load_helper() and store_helper()
which can be fixed up with the following diff:


diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 351f579fed..f3cc2dd0d9 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1251,7 +1251,7 @@ static tcg_target_ulong load_helper(CPUArchState *env,
target_ulong addr,
         }

         tmp = io_readx(env, iotlbentry, mmu_idx, addr, retaddr,
-                       addr & tlb_addr & TLB_RECHECK,
+                       tlb_addr & TLB_RECHECK,
                        code_read ? MMU_INST_FETCH : MMU_DATA_LOAD, size);
         return handle_bswap(tmp, size, big_endian);
     }
@@ -1405,14 +1405,14 @@ tcg_target_ulong
 helper_le_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
                    uintptr_t retaddr)
 {
-    return (int32_t)helper_le_lduw_mmu(env, addr, oi, retaddr);
+    return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr);
 }

 tcg_target_ulong
 helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
                    uintptr_t retaddr)
 {
-    return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr);
+    return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr);
 }
+    return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr);
 }

 tcg_target_ulong
 helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
                    uintptr_t retaddr)
 {
-    return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr);
+    return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr);
 }

 /*
@@ -1427,6 +1427,7 @@ static void store_helper(CPUArchState *env, target_ulong addr,
uint64_t val,
     uintptr_t index = tlb_index(env, mmu_idx, addr);
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
     target_ulong tlb_addr = tlb_addr_write(entry);
+    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
     unsigned a_bits = get_alignment_bits(get_memop(oi));
     uintptr_t haddr;

@@ -1438,7 +1439,8 @@ static void store_helper(CPUArchState *env, target_ulong addr,
uint64_t val,

     /* If the TLB entry is for a different page, reload and try again.  */
     if (!tlb_hit(tlb_addr, addr)) {
-        if (!VICTIM_TLB_HIT(addr_write, addr)) {
+        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
+            addr & TARGET_PAGE_MASK)) {
             tlb_fill(ENV_GET_CPU(env), addr, size, MMU_DATA_STORE,
                      mmu_idx, retaddr);
             index = tlb_index(env, mmu_idx, addr);
@@ -1466,7 +1468,6 @@ static void store_helper(CPUArchState *env, target_ulong addr,
uint64_t val,
         && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
                      >= TARGET_PAGE_SIZE)) {
         int i;
-        uintptr_t index2;
         CPUTLBEntry *entry2;
         target_ulong page2, tlb_addr2;
     do_unaligned_access:
@@ -1476,15 +1477,13 @@ static void store_helper(CPUArchState *env, target_ulong
addr, uint64_t val,
          * cannot evict the first.
          */
         page2 = (addr + size) & TARGET_PAGE_MASK;
-        index2 = tlb_index(env, mmu_idx, page2);
-        entry2 = tlb_entry(env, mmu_idx, index2);
+        entry2 = tlb_entry(env, mmu_idx, page2);
         tlb_addr2 = tlb_addr_write(entry2);
         if (!tlb_hit_page(tlb_addr2, page2)
-            && !VICTIM_TLB_HIT(addr_write, page2)) {
+            && !victim_tlb_hit(env, mmu_idx, index, tlb_off,
+                               page2 & TARGET_PAGE_MASK)) {
             tlb_fill(ENV_GET_CPU(env), page2, size, MMU_DATA_STORE,
                      mmu_idx, retaddr);
-            index2 = tlb_index(env, mmu_idx, page2);
-            entry2 = tlb_entry(env, mmu_idx, index2);
         }

         /*


Even with this patch applied I was still seeing issues with the progress bar, so I
ended up manually unrolling softmmu_template.h myself until I could see at what point
things were breaking.

The culprit turned out to be the change in type of res in load_helper() from the size
-related type of uint8_t/uint16_t/uint32_t/uint64_t to tcg_target_ulong in the slow
path as seen from my locally unrolled version:


WORKING:

tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr)
{
    ....

    /* Handle slow unaligned access (it spans two pages or IO).  */
    if (4 > 1
        && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
                    >= TARGET_PAGE_SIZE)) {
        target_ulong addr1, addr2;
        uint32_t res1, res2;
        ^^^^^^^^
        unsigned shift;
    do_unaligned_access:
        addr1 = addr & ~(4 - 1);
        addr2 = addr1 + 4;
        res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr);
        res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr);
        shift = (addr & (4 - 1)) * 8;

        /* Little-endian combine.  */
        res = (res1 >> shift) | (res2 << ((4 * 8) - shift));
        return res;
    }

}

tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr)
{
    ....

    /* Handle slow unaligned access (it spans two pages or IO).  */
    if (4 > 1
        && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
                    >= TARGET_PAGE_SIZE)) {
        target_ulong addr1, addr2;
        uint32_t res1, res2;
        ^^^^^^^^
        unsigned shift;
    do_unaligned_access:
        addr1 = addr & ~(4 - 1);
        addr2 = addr1 + 4;
        res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr);
        res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr);
        shift = (addr & (4 - 1)) * 8;

        /* Big-endian combine.  */
        res = (res1 << shift) | (res2 >> ((4 * 8) - shift));
        return res;
    }
}


CORRUPTED:

tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr)
{
    ....

    /* Handle slow unaligned access (it spans two pages or IO).  */
    if (4 > 1
        && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
                    >= TARGET_PAGE_SIZE)) {
        target_ulong addr1, addr2;
        tcg_target_ulong res1, res2;
        ^^^^^^^^^^^^^^^^
        unsigned shift;
    do_unaligned_access:
        addr1 = addr & ~(4 - 1);
        addr2 = addr1 + 4;
        res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr);
        res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr);
        shift = (addr & (4 - 1)) * 8;

        /* Little-endian combine.  */
        res = (res1 >> shift) | (res2 << ((4 * 8) - shift));
        return res;
    }

}

tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr)
{
    ....

    /* Handle slow unaligned access (it spans two pages or IO).  */
    if (4 > 1
        && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
                    >= TARGET_PAGE_SIZE)) {
        target_ulong addr1, addr2;
        tcg_target_ulong res1, res2;
        ^^^^^^^^^^^^^^^^
        unsigned shift;
    do_unaligned_access:
        addr1 = addr & ~(4 - 1);
        addr2 = addr1 + 4;
        res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr);
        res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr);
        shift = (addr & (4 - 1)) * 8;

        /* Big-endian combine.  */
        res = (res1 << shift) | (res2 >> ((4 * 8) - shift));
        return res;
    }
}


Presumably the issue here is somehow related to the compiler incorrectly
extending/reducing the shift when the larger type is involved? Also during my tests
the visual corruption was only present for 32-bit accesses, but presumably all the
access sizes will need a similar fix.


ATB,

Mark.

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Alex Bennée 6 years, 11 months ago

Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 19/02/2019 18:25, Emilio G. Cota wrote:
>
>> On Tue, Feb 19, 2019 at 14:22:37 +0000, Alex Bennée wrote:
>>> Emilio,
>>>
>>> Any chance you could run it through your benchmark suite?
>>
>> Something isn't quite right. For instance, gcc in SPEC doesn't
>> complete; it fails after 2s with some errors about the syntax of
>> the source being compiled. Before the patches the error isn't
>> there (the test completes in ~7s, as usual), so I suspect
>> some guest memory accesses aren't going to the right place.
>>
>> I am too busy right now to try to debug this, but if you have
>> patches to test, I can run them. The patches I tested are in
>> your v3 branch on github, i.e. with 430f28154b5 at the head.
>
> I spent a few hours yesterday and today testing this patchset against my OpenBIOS
> test images and found that all of them were able to boot, except for one MacOS image
> which started exhibiting flashing blocks on a progress bar during boot. Tracing this
> back, I found the problem was still present when just the first patch "accel/tcg:
> demacro cputlb" was applied.

Yeah in the original version of the patch I de-macroed each element one
by one which made bisecting errors easier. This is more of a big bang
approach which has made it harder to find which bit of the conversion
broke.

That said I did test it fairly hard on ARM but I guess we see less
unaligned and page-crossing access there.

>
> First of all there were a couple of typos I found in load_helper() and store_helper()
> which can be fixed up with the following diff:
>
>
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 351f579fed..f3cc2dd0d9 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -1251,7 +1251,7 @@ static tcg_target_ulong load_helper(CPUArchState *env,
> target_ulong addr,
>          }
>
>          tmp = io_readx(env, iotlbentry, mmu_idx, addr, retaddr,
> -                       addr & tlb_addr & TLB_RECHECK,
> +                       tlb_addr & TLB_RECHECK,
>                         code_read ? MMU_INST_FETCH : MMU_DATA_LOAD, size);
>          return handle_bswap(tmp, size, big_endian);
>      }
> @@ -1405,14 +1405,14 @@ tcg_target_ulong
>  helper_le_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
>                     uintptr_t retaddr)
>  {
> -    return (int32_t)helper_le_lduw_mmu(env, addr, oi, retaddr);
> +    return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr);
>  }
>
>  tcg_target_ulong
>  helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
>                     uintptr_t retaddr)
>  {
> -    return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr);
> +    return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr);
>  }
> +    return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr);
>  }
>
>  tcg_target_ulong
>  helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
>                     uintptr_t retaddr)
>  {
> -    return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr);
> +    return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr);
>  }

Awesome, I shall apply those.

>
>  /*
> @@ -1427,6 +1427,7 @@ static void store_helper(CPUArchState *env, target_ulong addr,
> uint64_t val,
>      uintptr_t index = tlb_index(env, mmu_idx, addr);
>      CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
>      target_ulong tlb_addr = tlb_addr_write(entry);
> +    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
>      unsigned a_bits = get_alignment_bits(get_memop(oi));
>      uintptr_t haddr;
>
> @@ -1438,7 +1439,8 @@ static void store_helper(CPUArchState *env, target_ulong addr,
> uint64_t val,
>
>      /* If the TLB entry is for a different page, reload and try again.  */
>      if (!tlb_hit(tlb_addr, addr)) {
> -        if (!VICTIM_TLB_HIT(addr_write, addr)) {
> +        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
> +            addr & TARGET_PAGE_MASK)) {
>              tlb_fill(ENV_GET_CPU(env), addr, size, MMU_DATA_STORE,
>                       mmu_idx, retaddr);
>              index = tlb_index(env, mmu_idx, addr);
> @@ -1466,7 +1468,6 @@ static void store_helper(CPUArchState *env, target_ulong addr,
> uint64_t val,
>          && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
>                       >= TARGET_PAGE_SIZE)) {
>          int i;
> -        uintptr_t index2;
>          CPUTLBEntry *entry2;
>          target_ulong page2, tlb_addr2;
>      do_unaligned_access:
> @@ -1476,15 +1477,13 @@ static void store_helper(CPUArchState *env, target_ulong
> addr, uint64_t val,
>           * cannot evict the first.
>           */
>          page2 = (addr + size) & TARGET_PAGE_MASK;
> -        index2 = tlb_index(env, mmu_idx, page2);
> -        entry2 = tlb_entry(env, mmu_idx, index2);
> +        entry2 = tlb_entry(env, mmu_idx, page2);
>          tlb_addr2 = tlb_addr_write(entry2);
>          if (!tlb_hit_page(tlb_addr2, page2)
> -            && !VICTIM_TLB_HIT(addr_write, page2)) {
> +            && !victim_tlb_hit(env, mmu_idx, index, tlb_off,
> +                               page2 & TARGET_PAGE_MASK)) {
>              tlb_fill(ENV_GET_CPU(env), page2, size, MMU_DATA_STORE,
>                       mmu_idx, retaddr);
> -            index2 = tlb_index(env, mmu_idx, page2);
> -            entry2 = tlb_entry(env, mmu_idx, index2);
>          }

Oops, I thought I'd applied the victim fix to both loads and stores.

>
>          /*
>
>
> Even with this patch applied I was still seeing issues with the progress bar, so I
> ended up manually unrolling softmmu_template.h myself until I could see at what point
> things were breaking.
>
> The culprit turned out to be the change in type of res in load_helper() from the size
> -related type of uint8_t/uint16_t/uint32_t/uint64_t to tcg_target_ulong in the slow
> path as seen from my locally unrolled version:
>
>
> WORKING:
>
> tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr)
> {
>     ....
>
>     /* Handle slow unaligned access (it spans two pages or IO).  */
>     if (4 > 1
>         && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
>                     >= TARGET_PAGE_SIZE)) {
>         target_ulong addr1, addr2;
>         uint32_t res1, res2;
>         ^^^^^^^^
>         unsigned shift;
>     do_unaligned_access:
>         addr1 = addr & ~(4 - 1);
>         addr2 = addr1 + 4;
>         res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr);
>         res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr);
>         shift = (addr & (4 - 1)) * 8;
>
>         /* Little-endian combine.  */
>         res = (res1 >> shift) | (res2 << ((4 * 8) - shift));
>         return res;
>     }
>
> }
>
> tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr)
> {
>     ....
>
>     /* Handle slow unaligned access (it spans two pages or IO).  */
>     if (4 > 1
>         && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
>                     >= TARGET_PAGE_SIZE)) {
>         target_ulong addr1, addr2;
>         uint32_t res1, res2;
>         ^^^^^^^^
>         unsigned shift;
>     do_unaligned_access:
>         addr1 = addr & ~(4 - 1);
>         addr2 = addr1 + 4;
>         res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr);
>         res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr);
>         shift = (addr & (4 - 1)) * 8;
>
>         /* Big-endian combine.  */
>         res = (res1 << shift) | (res2 >> ((4 * 8) - shift));
>         return res;
>     }
> }
>
>
> CORRUPTED:
>
> tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr)
> {
>     ....
>
>     /* Handle slow unaligned access (it spans two pages or IO).  */
>     if (4 > 1
>         && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
>                     >= TARGET_PAGE_SIZE)) {
>         target_ulong addr1, addr2;
>         tcg_target_ulong res1, res2;
>         ^^^^^^^^^^^^^^^^
>         unsigned shift;
>     do_unaligned_access:
>         addr1 = addr & ~(4 - 1);
>         addr2 = addr1 + 4;
>         res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr);
>         res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr);
>         shift = (addr & (4 - 1)) * 8;
>
>         /* Little-endian combine.  */
>         res = (res1 >> shift) | (res2 << ((4 * 8) - shift));
>         return res;
>     }
>
> }
>
> tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr)
> {
>     ....
>
>     /* Handle slow unaligned access (it spans two pages or IO).  */
>     if (4 > 1
>         && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1
>                     >= TARGET_PAGE_SIZE)) {
>         target_ulong addr1, addr2;
>         tcg_target_ulong res1, res2;
>         ^^^^^^^^^^^^^^^^
>         unsigned shift;
>     do_unaligned_access:
>         addr1 = addr & ~(4 - 1);
>         addr2 = addr1 + 4;
>         res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr);
>         res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr);
>         shift = (addr & (4 - 1)) * 8;
>
>         /* Big-endian combine.  */
>         res = (res1 << shift) | (res2 >> ((4 * 8) - shift));
>         return res;
>     }
> }
>
>
> Presumably the issue here is somehow related to the compiler incorrectly
> extending/reducing the shift when the larger type is involved? Also during my tests
> the visual corruption was only present for 32-bit accesses, but presumably all the
> access sizes will need a similar fix.

So I did fix this in the third patch when I out of lined the unaligned
helpers so:

     const tcg_target_ulong size_mask = MAKE_64BIT_MASK(0, size * 8);

and

     /* Big-endian combine.  */
     r2 &= size_mask;

or

     /* Little-endian combine.  */
     r1 &= size_mask;

I guess I should also apply that to patch 1 for bisectability.

Thanks for digging into the failures. It does make me think that we
really could do with some system mode tests to properly exercise all
softmmu code paths:

  * each size access, load and store
  * unaligned accesses, load and store (+ potential arch specific handling)
  * page striding accesses faulting and non-faulting

I'm not sure how much work it would be to get a minimal bare metal
kernel that would be able to do these and report success/fail. When I
did the MTTCG work I used kvm-unit-tests as a fairly complete mini-os
but maybe that's overkill for what we need here. After all we already
have migration kernels that just do a simple tick-tock for testing
migration. It would be nice to have the tests in C with maybe a minimal
per-arch assembly so we can share the tests across various platforms.

--
Alex Bennée

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Mark Cave-Ayland 6 years, 11 months ago

On 26/02/2019 09:24, Alex Bennée wrote:

>> Presumably the issue here is somehow related to the compiler incorrectly
>> extending/reducing the shift when the larger type is involved? Also during my tests
>> the visual corruption was only present for 32-bit accesses, but presumably all the
>> access sizes will need a similar fix.
> 
> So I did fix this in the third patch when I out of lined the unaligned
> helpers so:
> 
>      const tcg_target_ulong size_mask = MAKE_64BIT_MASK(0, size * 8);
> 
> and
> 
>      /* Big-endian combine.  */
>      r2 &= size_mask;
> 
> or
> 
>      /* Little-endian combine.  */
>      r1 &= size_mask;
> 
> I guess I should also apply that to patch 1 for bisectability.

I've done that locally, and while things have improved with progress bars I'm still
seeing some strange blank fills appearing on the display in MacOS. I'll have another
dig further and see what's going on...

> Thanks for digging into the failures. It does make me think that we
> really could do with some system mode tests to properly exercise all
> softmmu code paths:
> 
>   * each size access, load and store
>   * unaligned accesses, load and store (+ potential arch specific handling)
>   * page striding accesses faulting and non-faulting
> 
> I'm not sure how much work it would be to get a minimal bare metal
> kernel that would be able to do these and report success/fail. When I
> did the MTTCG work I used kvm-unit-tests as a fairly complete mini-os
> but maybe that's overkill for what we need here. After all we already
> have migration kernels that just do a simple tick-tock for testing
> migration. It would be nice to have the tests in C with maybe a minimal
> per-arch assembly so we can share the tests across various platforms.

Right. Having unrolled the macros myself whilst softmmu_template.h appears on the
surface to be deceptively simple, it hides quite a few subtle implementation details
which is why I think your work is so valuable here.


ATB,

Mark.

Re: [Qemu-devel] [PATCH v3 0/3] softmmu demacro

Posted by Mark Cave-Ayland 6 years, 11 months ago

On 26/02/2019 19:03, Mark Cave-Ayland wrote:

> On 26/02/2019 09:24, Alex Bennée wrote:
> 
>>> Presumably the issue here is somehow related to the compiler incorrectly
>>> extending/reducing the shift when the larger type is involved? Also during my tests
>>> the visual corruption was only present for 32-bit accesses, but presumably all the
>>> access sizes will need a similar fix.
>>
>> So I did fix this in the third patch when I out of lined the unaligned
>> helpers so:
>>
>>      const tcg_target_ulong size_mask = MAKE_64BIT_MASK(0, size * 8);
>>
>> and
>>
>>      /* Big-endian combine.  */
>>      r2 &= size_mask;
>>
>> or
>>
>>      /* Little-endian combine.  */
>>      r1 &= size_mask;
>>
>> I guess I should also apply that to patch 1 for bisectability.
> 
> I've done that locally, and while things have improved with progress bars I'm still
> seeing some strange blank fills appearing on the display in MacOS. I'll have another
> dig further and see what's going on...

Okay I've found it: looks like you also need to apply size_mask to the final result
to keep within the number of bits represented by size:


diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 7729254424..73710f9b9c 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1274,11 +1274,11 @@ static tcg_target_ulong load_helper(CPUArchState *env,
target_ulong addr,
         if (big_endian) {
             /* Big-endian combine.  */
             r2 &= size_mask;
-            res = (r1 << shift) | (r2 >> ((size * 8) - shift));
+            res = ((r1 << shift) | (r2 >> ((size * 8) - shift))) & size_mask;
         } else {
             /* Little-endian combine.  */
             r1 &= size_mask;
-            res = (r1 >> shift) | (r2 << ((size * 8) - shift));
+            res = ((r1 >> shift) | (r2 << ((size * 8) - shift))) & size_mask;
         }
         return res;
     }


I've now incorporated this into your original patchset, rebased and pushed the result
to https://github.com/mcayland/qemu/tree/alex-softmmu for you to grab and test.
Hopefully this version will now pass Emilio's tests too: I don't have a benchmark
setup in place, so it's worth testing to make sure that my fixes haven't introduced
any performance regressions.

Something else I noticed is that patch 3 removes the extra victim TLB fill from the
unaligned access path in store_helper() but doesn't mention it in the commit message
- is this deliberate?


ATB,

Mark.