Documentation/riscv/hwprobe.rst | 8 +-- arch/riscv/errata/thead/errata.c | 8 --- arch/riscv/include/asm/alternative.h | 5 -- arch/riscv/include/asm/cpufeature.h | 2 + arch/riscv/kernel/Makefile | 1 + arch/riscv/kernel/alternative.c | 19 ------- arch/riscv/kernel/copy-noalign.S | 71 +++++++++++++++++++++++++ arch/riscv/kernel/copy-noalign.h | 13 +++++ arch/riscv/kernel/cpufeature.c | 78 ++++++++++++++++++++++++++++ arch/riscv/kernel/smpboot.c | 3 +- 10 files changed, 171 insertions(+), 37 deletions(-) create mode 100644 arch/riscv/kernel/copy-noalign.S create mode 100644 arch/riscv/kernel/copy-noalign.h
The current setting for the hwprobe bit indicating misaligned access speed is controlled by a vendor-specific feature probe function. This is essentially a per-SoC table we have to maintain on behalf of each vendor going forward. Let's convert that instead to something we detect at runtime. We have two assembly routines at the heart of our probe: one that does a bunch of word-sized accesses (without aligning its input buffer), and the other that does byte accesses. If we can move a larger number of bytes using misaligned word accesses than we can with the same amount of time doing byte accesses, then we can declare misaligned accesses as "fast". The tradeoff of reducing this maintenance burden is boot time. We spend 4-6 jiffies per core doing this measurement (0-2 on jiffie edge alignment, and 4 on measurement). The timing loop was based on raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number of algorithms). On my THead C906, I found measurements to be stable across several reboots, and looked like this: [ 0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast I don't have a machine where misaligned accesses are slow, but I'd be interested to see the results of booting this series if someone did. Evan Green (2): RISC-V: Probe for unaligned access speed RISC-V: alternative: Remove feature_probe_func Documentation/riscv/hwprobe.rst | 8 +-- arch/riscv/errata/thead/errata.c | 8 --- arch/riscv/include/asm/alternative.h | 5 -- arch/riscv/include/asm/cpufeature.h | 2 + arch/riscv/kernel/Makefile | 1 + arch/riscv/kernel/alternative.c | 19 ------- arch/riscv/kernel/copy-noalign.S | 71 +++++++++++++++++++++++++ arch/riscv/kernel/copy-noalign.h | 13 +++++ arch/riscv/kernel/cpufeature.c | 78 ++++++++++++++++++++++++++++ arch/riscv/kernel/smpboot.c | 3 +- 10 files changed, 171 insertions(+), 37 deletions(-) create mode 100644 arch/riscv/kernel/copy-noalign.S create mode 100644 arch/riscv/kernel/copy-noalign.h -- 2.34.1
Hi, Thanks for doing this. On 6/24/23 6:20 AM, Evan Green wrote: > I don't have a machine where misaligned accesses are slow, but I'd be > interested to see the results of booting this series if someone did. I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, every misaligned memory access will trap and emulated by SBI. Here is the result: ~ # cat /proc/cpuinfo processor : 0 hart : 0 isa : rv64imafdc mmu : sv39 uarch : sifive,rocket0 mvendorid : 0x0 marchid : 0x1 mimpid : 0x20181004 processor : 1 hart : 1 isa : rv64imafdc mmu : sv39 uarch : sifive,rocket0 mvendorid : 0x0 marchid : 0x1 mimpid : 0x20181004 ~ # dmesg | grep Unaligned [ 0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow [ 0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow Thanks, Yangyu Chen
On Sat, Jun 24, 2023 at 3:22 AM Yangyu Chen <cyy@cyyself.name> wrote: > > Hi, > > Thanks for doing this. > > On 6/24/23 6:20 AM, Evan Green wrote: > > I don't have a machine where misaligned accesses are slow, but I'd be > > interested to see the results of booting this series if someone did. > > I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, every misaligned memory access will trap and emulated by SBI. > > Here is the result: > > ~ # cat /proc/cpuinfo > processor : 0 > hart : 0 > isa : rv64imafdc > mmu : sv39 > uarch : sifive,rocket0 > mvendorid : 0x0 > marchid : 0x1 > mimpid : 0x20181004 > > processor : 1 > hart : 1 > isa : rv64imafdc > mmu : sv39 > uarch : sifive,rocket0 > mvendorid : 0x0 > marchid : 0x1 > mimpid : 0x20181004 > > ~ # dmesg | grep Unaligned > [ 0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow > [ 0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow Thank you, Yangyu! Oof, the firmware traps are quite slow! -Evan
From: Yangyu Chen > Sent: 24 June 2023 11:22 > > Hi, > > Thanks for doing this. > > On 6/24/23 6:20 AM, Evan Green wrote: > > I don't have a machine where misaligned accesses are slow, but I'd be > > interested to see the results of booting this series if someone did. > > I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with > 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, > every misaligned memory access will trap and emulated by SBI. > > Here is the result: ... > ~ # dmesg | grep Unaligned > [ 0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow > [ 0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow How many misaligned cycles are in the test loop? If emulated ones are that slow you pretty much only need to test one. Also it is pretty clear that you really don't want to be emulating them. If the emulation is hidden from the kernel that really doesn't help at all. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Jun 26, 2023 at 2:24 AM David Laight <David.Laight@aculab.com> wrote: > > From: Yangyu Chen > > Sent: 24 June 2023 11:22 > > > > Hi, > > > > Thanks for doing this. > > > > On 6/24/23 6:20 AM, Evan Green wrote: > > > I don't have a machine where misaligned accesses are slow, but I'd be > > > interested to see the results of booting this series if someone did. > > > > I have tested your patches on a 100MHz BigCore rocket-chip with opensbi running on FPGA with > > 72bit(64bit+ECC) DDR3 1600MHz memory. As the rocket-chip did not support misaligned memory access, > > every misaligned memory access will trap and emulated by SBI. > > > > Here is the result: > ... > > ~ # dmesg | grep Unaligned > > [ 0.210140] cpu1: Unaligned word copy 0 MB/s, byte copy 38 MB/s, misaligned accesses are slow > > [ 0.410715] cpu0: Unaligned word copy 0 MB/s, byte copy 35 MB/s, misaligned accesses are slow > > How many misaligned cycles are in the test loop? > If emulated ones are that slow you pretty much only need to test one. The code does as many cycles as it can in a fixed number of jiffies. > > Also it is pretty clear that you really don't want to be emulating them. > If the emulation is hidden from the kernel that really doesn't help at all. From what I understand there's work being done to give the kernel some awareness and even control over the misaligned access trapping/emulation. It won't help today's systems though, and either way you're right emulating is very very slow. -Evan
On Fri, Jun 23, 2023 at 03:20:14PM -0700, Evan Green wrote: > > The current setting for the hwprobe bit indicating misaligned access > speed is controlled by a vendor-specific feature probe function. This is > essentially a per-SoC table we have to maintain on behalf of each vendor > going forward. Let's convert that instead to something we detect at > runtime. > > We have two assembly routines at the heart of our probe: one that > does a bunch of word-sized accesses (without aligning its input buffer), > and the other that does byte accesses. If we can move a larger number of > bytes using misaligned word accesses than we can with the same amount of > time doing byte accesses, then we can declare misaligned accesses as > "fast". > > The tradeoff of reducing this maintenance burden is boot time. We spend > 4-6 jiffies per core doing this measurement (0-2 on jiffie edge > alignment, and 4 on measurement). The timing loop was based on > raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number > of algorithms). On my THead C906, I found measurements to be stable > across several reboots, and looked like this: > > [ 0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast > > I don't have a machine where misaligned accesses are slow, but I'd be > interested to see the results of booting this series if someone did. Can you elaborate on "results" please? Otherwise, [ 0.333110] smp: Bringing up secondary CPUs ... [ 0.370794] cpu1: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow [ 0.411368] cpu2: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow [ 0.451947] cpu3: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow [ 0.462628] smp: Brought up 1 node, 4 CPUs [ 0.631464] cpu0: Unaligned word copy 2 MB/s, byte copy 229 MB/s, misaligned accesses are slow btw, why the mixed usage of "unaligned" and misaligned"? Cheers, Conor.
On Sat, Jun 24, 2023 at 3:08 AM Conor Dooley <conor@kernel.org> wrote: > > On Fri, Jun 23, 2023 at 03:20:14PM -0700, Evan Green wrote: > > > > The current setting for the hwprobe bit indicating misaligned access > > speed is controlled by a vendor-specific feature probe function. This is > > essentially a per-SoC table we have to maintain on behalf of each vendor > > going forward. Let's convert that instead to something we detect at > > runtime. > > > > We have two assembly routines at the heart of our probe: one that > > does a bunch of word-sized accesses (without aligning its input buffer), > > and the other that does byte accesses. If we can move a larger number of > > bytes using misaligned word accesses than we can with the same amount of > > time doing byte accesses, then we can declare misaligned accesses as > > "fast". > > > > The tradeoff of reducing this maintenance burden is boot time. We spend > > 4-6 jiffies per core doing this measurement (0-2 on jiffie edge > > alignment, and 4 on measurement). The timing loop was based on > > raid6_choose_gen(), which uses (16+1)*N jiffies (where N is the number > > of algorithms). On my THead C906, I found measurements to be stable > > across several reboots, and looked like this: > > > > [ 0.047582] cpu0: Unaligned word copy 1728 MB/s, byte copy 402 MB/s, misaligned accesses are fast > > > > I don't have a machine where misaligned accesses are slow, but I'd be > > interested to see the results of booting this series if someone did. > > Can you elaborate on "results" please? Otherwise, > > [ 0.333110] smp: Bringing up secondary CPUs ... > [ 0.370794] cpu1: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow > [ 0.411368] cpu2: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow > [ 0.451947] cpu3: Unaligned word copy 2 MB/s, byte copy 231 MB/s, misaligned accesses are slow > [ 0.462628] smp: Brought up 1 node, 4 CPUs > > [ 0.631464] cpu0: Unaligned word copy 2 MB/s, byte copy 229 MB/s, misaligned accesses are slow > > btw, why the mixed usage of "unaligned" and misaligned"? Yes, this is exactly what I was hoping for in terms of results, thank you. I'll clean up the diction and choose one word. I think my brain attributed subtle differences between unaligned (as in, without regard for alignment, the behavior of the copies) and misaligned (ie. deliberately out of alignment, the type of access we're testing), but I'm not sure I'm even fully consistent to those, so I'll fix it. -Evan > > Cheers, > Conor.
© 2016 - 2026 Red Hat, Inc.