.../admin-guide/kernel-parameters.txt | 1 + arch/x86/Kconfig | 11 ++ arch/x86/include/asm/apic.h | 12 ++ arch/x86/include/asm/hardirq.h | 6 + arch/x86/include/asm/idtentry.h | 3 + arch/x86/include/asm/irq_remapping.h | 11 ++ arch/x86/include/asm/irq_vectors.h | 9 +- arch/x86/include/asm/posted_intr.h | 116 +++++++++++++ arch/x86/kernel/apic/vector.c | 5 +- arch/x86/kernel/cpu/common.c | 3 + arch/x86/kernel/idt.c | 3 + arch/x86/kernel/irq.c | 156 ++++++++++++++++-- arch/x86/kvm/vmx/posted_intr.h | 93 +---------- arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/vmx/vmx.h | 2 +- drivers/iommu/intel/irq_remapping.c | 115 ++++++++++++- drivers/iommu/irq_remapping.c | 13 +- 17 files changed, 446 insertions(+), 114 deletions(-) create mode 100644 arch/x86/include/asm/posted_intr.h
Hi Thomas and all,
This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
posted interrupts.
There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this
topic.
https://lpc.events/event/17/sessions/172/#20231115
Background
==========
On modern x86 server SoCs, interrupt remapping (IR) is required and turned
on by default to support X2APIC. Two interrupt remapping modes can be supported
by IOMMU/VT-d:
- Remappable (host)
- Posted (guest only so far)
With remappable mode, the device MSI to CPU process is a HW flow without system
software touch points, it roughly goes as follows:
1. Devices issue interrupt requests with writes to 0xFEEx_xxxx
2. The system agent accepts and remaps/translates the IRQ
3. Upon receiving the translation response, the system agent notifies the
destination CPU with the translated MSI
4. CPU's local APIC accepts interrupts into its IRR/ISR registers
5. Interrupt delivered through IDT (MSI vector)
The above process can be inefficient under high IRQ rates. The notifications in
step #3 are often unnecessary when the destination CPU is already overwhelmed
with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
is also expensive and requires strong ordering w.r.t DMA. As a result, slower
IRQ rates can become a limiting factor for DMA I/O performance.
For example, on Intel Xeon Sapphire Rapids SoC, as more NVMe disks are attached
to the same socket, FIO (libaio engine) 4K block random read performance
per-disk drops quickly.
# of disks 2 4 8
-------------------------------------
IOPS(million) 1.991 1.136 0.834
(NVMe Gen 5 Samsung PM174x)
With posted mode enabled in interrupt remapping, the interrupt flow is divided
into two parts: posting (storing pending IRQ vector information in memory) and
CPU notification.
The above remappable IRQ flow becomes the following (1 and 2 unchanged):
3. Notifies the destination CPU with a notification vector
- IOMMU suppresses CPU notification
- IOMMU atomic swap/store IRQ status to memory-resident posted interrupt
descriptor (PID)
4. CPU's local APIC accepts the notification interrupt into its IRR/ISR
registers
5. Interrupt delivered through IDT (notification vector handler)
System SW allows new notifications by clearing outstanding notification
(ON) bit in PID.
(The above flow is not in Linux today since we only use posted mode for VM)
Note that the system software can now suppress CPU notifications at runtime as
needed. This allows the system software to coalesce the expensive CPU
notifications and in turn, improve IRQ throughput and DMA performance.
Consider the following scenario when MSIs arrive at a CPU in high-frequency
bursts:
Time ----------------------------------------------------------------------->
^ ^ ^ ^ ^ ^ ^ ^ ^
MSIs A B C D E F G H I
RI N N' N' N N' N' N' N N
PI N N N N
RI: remappable interrupt; PI: posted interrupt;
N: interrupt notification, N': superfluous interrupt notification
With remappable interrupt (row titled RI), every MSI generates a notification
event to the CPU.
With posted interrupts enabled in this patch set (row titled PI), CPU
notifications are coalesced during IRQ bursts. N's are eliminated in the flow
above. We refer to this mechanism Coalesced Interrupt Delivery (CID).
Post interrupts have existed for a long time, they have been used for
virtualization where MSIs from directly assigned devices can be delivered to
the guest kernel without VMM intervention. On x86 Intel platforms, posted
interrupts can be used on the host as well. Only host physical address of
Posted interrupt descriptor (PID) is used.
This patch set enables a new usage of posted interrupts on existing (and
new hardware) for host kernel device MSIs. It is referred to as Posted MSIs
throughout this patch set.
Performance (with this patch set):
==================================
Test #1. NVMe FIO
FIO libaio (million IOPS/sec/disk) Gen 5 NVMe Samsung PM174x disks on a single
socket, Intel Xeon Sapphire Rapids. Random read with 4k block size. NVMe IRQ
affinity is managed by the kernel with one vector per CPU.
#disks Before After %Gain
---------------------------------------------
8 0.834 1.943 132%
4 1.136 2.023 78%
Other observations:
- Increased block sizes shows diminishing benefits, e.g. with 4 NVME disks on
one x16 PCIe slot, the combined IOPS looks like:
Block Size Baseline PostedMSI
-------------------------------------
4K 6475 8778
8K 5727 5896
16k 2864 2900
32k 1546 1520
128k 397 398
- Submission/Completion latency (usec) also improved at 4K block size only
FIO report SLAT
---------------------------------------
Block Size Baseline postedMSI
4k 2177 2282
8k 4416 3967
16k 2950 3053
32k 3453 3505
128k 5911 5801
FIO report CLAT
---------------------------------------
Block Size Baseline postedMSI
4k 313 230
8k 352 343
16k 711 702
32k 1320 1343
128k 5146 5137
Test #2. Intel Data Streaming Accelerator
Two dedicated workqueues from two PCI root complex integrated endpoint
(RCIEP) devices, pin IRQ affinity of the two interrupts to a single CPU.
Before After %Gain
-------------------------------------
DSA memfill (mil IRQs/sec) 5.157 8.987 74%
DMA throughput has similar improvements.
At lower IRQ rate (< 1 million/second), no performance benefits nor regression
observed so far.
No harm tests also performed to ensure no performance regression on workloads
that do not have high interrupt rate. These tests include:
- kernel compile time
- file copy
- FIO NVME random writes
Implementation choices:
======================
- Transparent to the device drivers
- System-wide option instead of per-device or per-IRQ opt-in, i.e. once enabled
all device MSIs are posted. The benefit is that we only need to change IR
irq_chip and domain layer. No change to PCI MSI.
Exceptions are: IOAPIC, HPET, and VT-d's own IRQs
- Limit the number of polling/demuxing loops per CPU notification event
- Only change Intel-IR in IRQ domain hierarchy VECTOR->INTEL-IR->PCI-MSI,
- X86 Intel only so far, can be extended to other architectures with posted
interrupt support (ARM and AMD), RFC.
- Bare metal only, no posted interrupt capable virtual IOMMU.
Changes and implications (moving from remappable to posted mode)
===============================
1. All MSI vectors are multiplexed into a single notification vector for each
CPU MSI vectors are then de-multiplexed by SW, no IDT delivery for MSIs
2. Losing the following features compared to the remappable mode (AFAIK, none of
the below matters for device MSIs)
- Control of delivery mode, e.g. NMI for MSIs
- No logical destinations, posted interrupt destination is x2APIC
physical APIC ID
- No per vector stack, since all MSI vectors are multiplexed into one
Runtime changes
===============
The IRQ runtime behavior has changed with this patch, here is a pseudo trace
comparison for 3 MSIs of different vectors arriving in a burst on the same CPU.
A system vector interrupt (e.g. timer) arrives randomly.
BEFORE:
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
interrupt(timer)
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
AFTER:
interrupt /* Posted MSI notification vector */
irq_enter()
atomic_xchg(PIR)
handler()
handler()
handler()
pi_clear_on()
apic_eoi()
irq_exit()
interrupt(timer)
process_softirq()
With posted MSI (as pointed out by Thomas Gleixner), both high-priority
interrupts (system interrupt vectors) and softIRQs are blocked during MSI vector
demux loop. Some can be timing sensitive.
Here are the options I have attempted or still working on:
1. Use self-IPI to invoke MSI vector handler but that took away the majority of
the performance benefits.
2. Limit the # of demuxing loops, this is implemented in this patch. Note that
today, we already allow one low priority MSI to block system interrupts. System
vector can preempt MSI vectors without waiting for EOI but we have IRQ disabled
in the ISR.
Performance data (on DSA with MEMFILL) also shows that coalescing more than 3
loops yields diminishing benefits. Therefore, the max loops for coalescing is
set to 3 in this patch.
MaxLoop IRQ/sec bandwidth Mbps
-------------------------------------------------------------------------
2 6157107 25219
3 6226611 25504
4 6557081 26857
5 6629683 27155
6 6662425 27289
3. limit the time that system interrupts can be blocked (WIP).
In addition, posted MSI uses atomic xchg from both CPU and IOMMU. Compared to
remappable mode, there may be additional cache line ownership contention over
PID. However, we have not observed performance regression at lower IRQ rates.
At high interrupt rate, posted mode always wins.
Testing:
========
The following tests have been performed and continue to be evaluated.
- IRQ affinity change, migration
- CPU offlining
- Multi vector coalescing
- Low IRQ rate, general no-harm test
- VM device assignment via VFIO
- General no harm test, performance regressions have not been observed for low
IRQ rate workload.
With the patch, a new entry in /proc/interrupts is added.
cat /proc/interrupts | grep PMN
PMN: 13868907 Posted MSI notification event
No change to the device MSI accounting.
A new INTEL-IR-POST irq_chip is visible at IRQ debugfs, e.g.
domain: IR-PCI-MSIX-0000:6f:01.0-12
hwirq: 0x8
chip: IR-PCI-MSIX-0000:6f:01.0
flags: 0x430
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: INTEL-IR-12-13
hwirq: 0x90000
chip: INTEL-IR-POST /* For posted MSIs */
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x65
chip: APIC
Acknowledgment
==============
- Rajesh Sankaran and Ashok Raj for the original idea
- Thomas Gleixner for reviewing and guiding the upstream direction of PoC
patches. Help correct my many misunderstandings of the IRQ subsystem.
- Jie J Yan(Jeff), Sebastien Lemarie, and Dan Liang for performance evaluation
with NVMe and network workload
- Bernice Zhang and Scott Morris for functional validation
- Michael Prinke helped me understand how VT-d HW works
- Sanjay Kumar for providing the DSA IRQ test suite
Thanks,
Jacob
Change log:
V1 (since RFC)
- Removed mentioning of wishful features, IRQ preemption, separate and
full MSI vector space
- Refined MSI handler de-multiplexing loop based on suggestions from
Peter and Thomas. Reduced xchg() usage and code duplication
- Assign the new posted IR irq_chip only to device MSI/x, avoid changing
IO-APIC code
- Extract and use common code for preventing lost interrupt during
affinity change
- Added more test results to the cover letter
Jacob Pan (12):
x86/irq: Move posted interrupt descriptor out of vmx code
x86/irq: Unionize PID.PIR for 64bit access w/o casting
x86/irq: Add a Kconfig option for posted MSI
x86/irq: Reserve a per CPU IDT vector for posted MSIs
x86/irq: Add accessors for posted interrupt descriptors
x86/irq: Factor out calling ISR from common_interrupt
x86/irq: Install posted MSI notification handler
x86/irq: Factor out common code for checking pending interrupts
x86/irq: Extend checks for pending vectors to posted interrupts
iommu/vt-d: Make posted MSI an opt-in cmdline option
iommu/vt-d: Add an irq_chip for posted MSIs
iommu/vt-d: Enable posted mode for device MSIs
Thomas Gleixner (3):
x86/irq: Use bitfields exclusively in posted interrupt descriptor
x86/irq: Set up per host CPU posted interrupt descriptors
iommu/vt-d: Add a helper to retrieve PID address
.../admin-guide/kernel-parameters.txt | 1 +
arch/x86/Kconfig | 11 ++
arch/x86/include/asm/apic.h | 12 ++
arch/x86/include/asm/hardirq.h | 6 +
arch/x86/include/asm/idtentry.h | 3 +
arch/x86/include/asm/irq_remapping.h | 11 ++
arch/x86/include/asm/irq_vectors.h | 9 +-
arch/x86/include/asm/posted_intr.h | 116 +++++++++++++
arch/x86/kernel/apic/vector.c | 5 +-
arch/x86/kernel/cpu/common.c | 3 +
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/irq.c | 156 ++++++++++++++++--
arch/x86/kvm/vmx/posted_intr.h | 93 +----------
arch/x86/kvm/vmx/vmx.c | 1 +
arch/x86/kvm/vmx/vmx.h | 2 +-
drivers/iommu/intel/irq_remapping.c | 115 ++++++++++++-
drivers/iommu/irq_remapping.c | 13 +-
17 files changed, 446 insertions(+), 114 deletions(-)
create mode 100644 arch/x86/include/asm/posted_intr.h
--
2.25.1
On 1/27/2024 7:42 AM, Jacob Pan wrote: > Hi Thomas and all, > > This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of > posted interrupts. > > There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this > topic. > > https://lpc.events/event/17/sessions/172/#20231115 > > Background > ========== > On modern x86 server SoCs, interrupt remapping (IR) is required and turned > on by default to support X2APIC. Two interrupt remapping modes can be supported > by IOMMU/VT-d: > > - Remappable (host) > - Posted (guest only so far) > > With remappable mode, the device MSI to CPU process is a HW flow without system > software touch points, it roughly goes as follows: > > 1. Devices issue interrupt requests with writes to 0xFEEx_xxxx > 2. The system agent accepts and remaps/translates the IRQ > 3. Upon receiving the translation response, the system agent notifies the > destination CPU with the translated MSI > 4. CPU's local APIC accepts interrupts into its IRR/ISR registers > 5. Interrupt delivered through IDT (MSI vector) > > The above process can be inefficient under high IRQ rates. The notifications in > step #3 are often unnecessary when the destination CPU is already overwhelmed > with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3 > is also expensive and requires strong ordering w.r.t DMA. Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"? > As a result, slower > IRQ rates can become a limiting factor for DMA I/O performance. >
Hi Robert,
On Thu, 4 Apr 2024 21:45:05 +0800, Robert Hoo <robert.hoo.linux@gmail.com>
wrote:
> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > Hi Thomas and all,
> >
> > This patch set is aimed to improve IRQ throughput on Intel Xeon by
> > making use of posted interrupts.
> >
> > There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented
> > this topic.
> >
> > https://lpc.events/event/17/sessions/172/#20231115
> >
> > Background
> > ==========
> > On modern x86 server SoCs, interrupt remapping (IR) is required and
> > turned on by default to support X2APIC. Two interrupt remapping modes
> > can be supported by IOMMU/VT-d:
> >
> > - Remappable (host)
> > - Posted (guest only so far)
> >
> > With remappable mode, the device MSI to CPU process is a HW flow
> > without system software touch points, it roughly goes as follows:
> >
> > 1. Devices issue interrupt requests with writes to 0xFEEx_xxxx
> > 2. The system agent accepts and remaps/translates the IRQ
> > 3. Upon receiving the translation response, the system agent
> > notifies the destination CPU with the translated MSI
> > 4. CPU's local APIC accepts interrupts into its IRR/ISR registers
> > 5. Interrupt delivered through IDT (MSI vector)
> >
> > The above process can be inefficient under high IRQ rates. The
> > notifications in step #3 are often unnecessary when the destination CPU
> > is already overwhelmed with handling bursts of IRQs. On some
> > architectures, such as Intel Xeon, step #3 is also expensive and
> > requires strong ordering w.r.t DMA.
>
> Can you tell more on this "step #3 requires strong ordering w.r.t. DMA"?
>
I am not sure how much micro architecture details I can disclose but the
point is that there are ordering rules related to DMA read/writes
and posted MSI writes. I am not a hardware expert.
From PCIe pov, my understanding is that the upstream writes tested here on
NVMe drives as the result of 4K random reads are relaxed ordered. I can see
lspci showing: RlxdOrd+ on my Samsung drives.
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 512 bytes, MaxReadReq 4096 bytes
But MSIs are strictly ordered afaik.
> > As a result, slower
> > IRQ rates can become a limiting factor for DMA I/O performance.
> >
>
>
Thanks,
Jacob
Hi Jacob,
I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
IOPS bound on the drive, and using 1 thread per drive for IO. Random
reads, using io_uring.
For reference, using polled IO:
IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
which is abount 5.1M/drive, which is what they can deliver.
Before your patches, I see:
IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
at 2.82M ints/sec. With the patches, I see:
IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
quite at the extent I expected. Booted with 'posted_msi' and I do see
posted interrupts increasing in the PMN in /proc/interrupts,
Probably want to fold this one in:
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 8e09d40ea928..a289282f1cf9 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -393,7 +393,7 @@ void intel_posted_msi_init(void)
* instead of:
* read, xchg, read, xchg, read, xchg, read, xchg
*/
-static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
+static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
int i, vec = FIRST_EXTERNAL_VECTOR;
unsigned long pir_copy[4];
--
Jens Axboe
Hi Jens,
On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> Hi Jacob,
>
> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> reads, using io_uring.
>
> For reference, using polled IO:
>
> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
>
> which is abount 5.1M/drive, which is what they can deliver.
>
> Before your patches, I see:
>
> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
>
> at 2.82M ints/sec. With the patches, I see:
>
> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
>
> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> quite at the extent I expected. Booted with 'posted_msi' and I do see
> posted interrupts increasing in the PMN in /proc/interrupts,
>
The ints/sec reduction is not as high as I expected either, especially
at this high rate. Which means not enough coalescing going on to get the
performance benefits.
The opportunity of IRQ coalescing is also dependent on how long the
driver's hardirq handler executes. In the posted MSI demux loop, it does
not wait for more MSIs to come before existing the pending IRQ polling
loop. So if the hardirq handler finishes very quickly, it may not coalesce
as much. Perhaps, we need to find more "useful" work to do to maximize the
window for coalescing.
I am not familiar with optane driver, need to look into how its hardirq
handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
30-50% ints/sec reduction at even lower IRQ rate (200k/sec).
> Probably want to fold this one in:
>
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 8e09d40ea928..a289282f1cf9 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -393,7 +393,7 @@ void intel_posted_msi_init(void)
> * instead of:
> * read, xchg, read, xchg, read, xchg, read, xchg
> */
> -static __always_inline inline bool handle_pending_pir(u64 *pir, struct
> pt_regs *regs) +static __always_inline bool handle_pending_pir(u64 *pir,
> struct pt_regs *regs) {
> int i, vec = FIRST_EXTERNAL_VECTOR;
> unsigned long pir_copy[4];
>
Good catch! will do.
Thanks,
Jacob
On 2/9/24 10:43 AM, Jacob Pan wrote: > Hi Jens, > > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: > >> Hi Jacob, >> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just >> IOPS bound on the drive, and using 1 thread per drive for IO. Random >> reads, using io_uring. >> >> For reference, using polled IO: >> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 >> >> which is abount 5.1M/drive, which is what they can deliver. >> >> Before your patches, I see: >> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >> >> at 2.82M ints/sec. With the patches, I see: >> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 >> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not >> quite at the extent I expected. Booted with 'posted_msi' and I do see >> posted interrupts increasing in the PMN in /proc/interrupts, >> > The ints/sec reduction is not as high as I expected either, especially > at this high rate. Which means not enough coalescing going on to get the > performance benefits. Right, it means that we're getting pretty decent commands-per-int coalescing already. I added another drive and repeated, here's that one: IOPS w/polled: 25.7M IOPS Stock kernel: IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 at ~3.7M ints/sec, or about 5.8 IOPS / int on average. Patched kernel: IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 at the same interrupt rate. So not a reduction, but slighter higher perf. Maybe we're reaping more commands on average per interrupt. Anyway, not a lot of interesting data there, just figured I'd re-run it with the added drive. > The opportunity of IRQ coalescing is also dependent on how long the > driver's hardirq handler executes. In the posted MSI demux loop, it does > not wait for more MSIs to come before existing the pending IRQ polling > loop. So if the hardirq handler finishes very quickly, it may not coalesce > as much. Perhaps, we need to find more "useful" work to do to maximize the > window for coalescing. > > I am not familiar with optane driver, need to look into how its hardirq > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). It's just an nvme device, so it's the nvme driver. The IRQ side is very cheap - for as long as there are CQEs in the completion ring, it'll reap them and complete them. That does mean that if we get an IRQ and there's more than one entry to complete, we will do all of them. No IRQ coalescing is configured (nvme kind of sucks for that...), but optane media is much faster than flash, so that may be a difference. -- Jens Axboe
Hi Jens,
On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> >
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts,
> >>
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.
>
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
>
> IOPS w/polled: 25.7M IOPS
>
> Stock kernel:
>
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
>
> Patched kernel:
>
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
>
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
>
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
>
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> >
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).
>
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
>
Yeah, I also check the the driver code it seems just wake up the threaded
handler.
For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector
AFTER: 220128
CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done
PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]
LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+
LnkSta: Speed 32GT/s (ok), Width x4(ok)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis
NVME setup:
nvme5n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme6n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme3n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme4n1 SAMSUNG MZWLO1T9HCJR-00A07
FIO:
[global]
bs=4k
direct=1
norandommap
ioengine=libaio
randrepeat=0
readwrite=randread
group_reporting
time_based
iodepth=64
exitall
random_generator=tausworthe64
runtime=30
ramp_time=3
numjobs=8
group_reporting=1
#cpus_allowed_policy=shared
cpus_allowed_policy=split
[disk_nvme6n1_thread_1]
filename=/dev/nvme6n1
cpus_allowed=0-7
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1
cpus_allowed=8-15
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1
cpus_allowed=16-23
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1
cpus_allowed=24-31
iostat w/o posted MSI patch, v6.8-rc1:
nvme3c3n1 1615525.00 6462100.00 0.00 0.00 6462100
nvme4c4n1 1615471.00 6461884.00 0.00 0.00 6461884
nvme5c5n1 1615602.00 6462408.00 0.00 0.00 6462408
nvme6c6n1 1614637.00 6458544.00 0.00 0.00 6458544
irqtop (delta 1 sec.)
IRQ TOTAL DELTA NAME
800 6290026 185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76
797 6279554 185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73
799 6281627 185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75
802 6285742 185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78
... ... similar irq rate for all 32 vectors
iostat w/ posted MSI patch:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme3c3n1 2184313.00 8737256.00 0.00 0.00 8737256 0 0
nvme4c4n1 2184241.00 8736972.00 0.00 0.00 8736972 0 0
nvme5c5n1 2184269.00 8737080.00 0.00 0.00 8737080 0 0
nvme6c6n1 2184003.00 8736012.00 0.00 0.00 8736012 0 0
irqtop w/ posted MSI patch:
IRQ TOTAL DELTA NAME
PMN 5230078416 5502657 Posted MSI notification event
423 138068935 220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80
425 138057654 219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82
426 138101745 219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83
... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125
Thanks,
Jacob
On 2/12/24 11:27 AM, Jacob Pan wrote: > Hi Jens, > > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote: > >> On 2/9/24 10:43 AM, Jacob Pan wrote: >>> Hi Jens, >>> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote: >>> >>>> Hi Jacob, >>>> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just >>>> IOPS bound on the drive, and using 1 thread per drive for IO. Random >>>> reads, using io_uring. >>>> >>>> For reference, using polled IO: >>>> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31 >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31 >>>> >>>> which is abount 5.1M/drive, which is what they can deliver. >>>> >>>> Before your patches, I see: >>>> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31 >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32 >>>> >>>> at 2.82M ints/sec. With the patches, I see: >>>> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31 >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31 >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32 >>>> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see >>>> posted interrupts increasing in the PMN in /proc/interrupts, >>>> >>> The ints/sec reduction is not as high as I expected either, especially >>> at this high rate. Which means not enough coalescing going on to get the >>> performance benefits. >> >> Right, it means that we're getting pretty decent commands-per-int >> coalescing already. I added another drive and repeated, here's that one: >> >> IOPS w/polled: 25.7M IOPS >> >> Stock kernel: >> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32 >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32 >> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average. >> >> Patched kernel: >> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32 >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31 >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32 >> >> at the same interrupt rate. So not a reduction, but slighter higher >> perf. Maybe we're reaping more commands on average per interrupt. >> >> Anyway, not a lot of interesting data there, just figured I'd re-run it >> with the added drive. >> >>> The opportunity of IRQ coalescing is also dependent on how long the >>> driver's hardirq handler executes. In the posted MSI demux loop, it does >>> not wait for more MSIs to come before existing the pending IRQ polling >>> loop. So if the hardirq handler finishes very quickly, it may not >>> coalesce as much. Perhaps, we need to find more "useful" work to do to >>> maximize the window for coalescing. >>> >>> I am not familiar with optane driver, need to look into how its hardirq >>> handler work. I have only tested NVMe gen5 in terms of storage IO, i saw >>> 30-50% ints/sec reduction at even lower IRQ rate (200k/sec). >> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very >> cheap - for as long as there are CQEs in the completion ring, it'll reap >> them and complete them. That does mean that if we get an IRQ and there's >> more than one entry to complete, we will do all of them. No IRQ >> coalescing is configured (nvme kind of sucks for that...), but optane >> media is much faster than flash, so that may be a difference. >> > Yeah, I also check the the driver code it seems just wake up the threaded > handler. That only happens if you're using threaded interrupts, which is not the default as it's much slower. What happens for the normal case is that we init a batch, and then poll the CQ ring for completions, and then add them to the completion batch. Once no more are found, we complete the batch. You're not using threaded interrupts, are you? > For the record, here is my set up and performance data for 4 Samsung disks. > IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that > IRQ throughput is improved instead of reduction with this patch on my setup. > e.g. BEFORE: 185545/sec/vector > AFTER: 220128 I'm surprised at the rates being that low, and if so, why the posted MSI makes a difference? Usually what I've seen for IRQ being slower than poll is if interrupt delivery is unreasonably slow on that architecture of machine. But ~200k/sec isn't that high at all. > [global] > bs=4k > direct=1 > norandommap > ioengine=libaio > randrepeat=0 > readwrite=randread > group_reporting > time_based > iodepth=64 > exitall > random_generator=tausworthe64 > runtime=30 > ramp_time=3 > numjobs=8 > group_reporting=1 > > #cpus_allowed_policy=shared > cpus_allowed_policy=split > [disk_nvme6n1_thread_1] > filename=/dev/nvme6n1 > cpus_allowed=0-7 > [disk_nvme6n1_thread_1] > filename=/dev/nvme5n1 > cpus_allowed=8-15 > [disk_nvme5n1_thread_2] > filename=/dev/nvme4n1 > cpus_allowed=16-23 > [disk_nvme5n1_thread_3] > filename=/dev/nvme3n1 > cpus_allowed=24-31 For better performance, I'd change that engine=libaio to: ioengine=io_uring fixedbufs=1 registerfiles=1 Particularly fixedbufs makes a big difference, as a big cycle consumer is mapping/unmapping pages from the application space into the kernel for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the buffers. At least for my runs, this is ~15% of the systime for doing IO. It also removes the page referencing, which isn't as big a consumer, but still noticeable. Anyway, side quest, but I think you'll find this considerably reduces overhead / improves performance. Also makes it so that you can compare with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an option for that (with a side note that you need to configure nvme poll queues, see the poll_queues parameter). On my box, all the NVMe devices seem to be on node1, not node0 which looks like it's the CPUs you are using. Might be worth checking and adjusting your CPU domains for each drive? I also tend to get better performance by removing the CPU scheduler, eg just pin each job to a single CPU rather than many. It's just one process/thread anyway, so really no point in giving it options here. It'll help reduce variability too, which can be a pain in the butt to deal with. -- Jens Axboe
Hi Jens, On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote: > > For the record, here is my set up and performance data for 4 Samsung > > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I > > noticed is that IRQ throughput is improved instead of reduction with > > this patch on my setup. e.g. BEFORE: 185545/sec/vector > > AFTER: 220128 > > I'm surprised at the rates being that low, and if so, why the posted MSI > makes a difference? Usually what I've seen for IRQ being slower than > poll is if interrupt delivery is unreasonably slow on that architecture > of machine. But ~200k/sec isn't that high at all. Even at ~200k/sec, I am seeing around 75% ratio between posted interrupt notification and MSIs. i.e. for every 4 MSIs, we save one CPU notification. That might be where the savings come from. I was expecting an even or reduction in CPU notifications but more MSI throughput. Instead, Optane gets less MSIs/sec as your data shows. Is it possible to get the interrupt coalescing ratio on your set up? ie. PMN count in cat /proc/interrupts divided by total NVME MSIs. Here is a summary of my testing on 4 Samsung Gen 5 drives: test cases IOPS*1000 ints/sec(MSI)* ================================================= aio 6348 182218 io_uring 6895 207932 aio w/ posted MSI 8295 185545 io_uring w/ post MSI 8811 220128 io_uring poll_queue 13000 0 ================================================ Thanks, Jacob
Hi Jens,
On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/12/24 11:27 AM, Jacob Pan wrote:
> > Hi Jens,
> >
> > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> On 2/9/24 10:43 AM, Jacob Pan wrote:
> >>> Hi Jens,
> >>>
> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>>> Hi Jacob,
> >>>>
> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test,
> >>>> just IOPS bound on the drive, and using 1 thread per drive for IO.
> >>>> Random reads, using io_uring.
> >>>>
> >>>> For reference, using polled IO:
> >>>>
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>>>
> >>>> which is abount 5.1M/drive, which is what they can deliver.
> >>>>
> >>>> Before your patches, I see:
> >>>>
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>>
> >>>> at 2.82M ints/sec. With the patches, I see:
> >>>>
> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>>>
> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >>>> posted interrupts increasing in the PMN in /proc/interrupts,
> >>>>
> >>> The ints/sec reduction is not as high as I expected either, especially
> >>> at this high rate. Which means not enough coalescing going on to get
> >>> the performance benefits.
> >>
> >> Right, it means that we're getting pretty decent commands-per-int
> >> coalescing already. I added another drive and repeated, here's that
> >> one:
> >>
> >> IOPS w/polled: 25.7M IOPS
> >>
> >> Stock kernel:
> >>
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >>
> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> >>
> >> Patched kernel:
> >>
> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> >>
> >> at the same interrupt rate. So not a reduction, but slighter higher
> >> perf. Maybe we're reaping more commands on average per interrupt.
> >>
> >> Anyway, not a lot of interesting data there, just figured I'd re-run it
> >> with the added drive.
> >>
> >>> The opportunity of IRQ coalescing is also dependent on how long the
> >>> driver's hardirq handler executes. In the posted MSI demux loop, it
> >>> does not wait for more MSIs to come before existing the pending IRQ
> >>> polling loop. So if the hardirq handler finishes very quickly, it may
> >>> not coalesce as much. Perhaps, we need to find more "useful" work to
> >>> do to maximize the window for coalescing.
> >>>
> >>> I am not familiar with optane driver, need to look into how its
> >>> hardirq handler work. I have only tested NVMe gen5 in terms of
> >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate
> >>> (200k/sec).
> >>
> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> >> cheap - for as long as there are CQEs in the completion ring, it'll
> >> reap them and complete them. That does mean that if we get an IRQ and
> >> there's more than one entry to complete, we will do all of them. No IRQ
> >> coalescing is configured (nvme kind of sucks for that...), but optane
> >> media is much faster than flash, so that may be a difference.
> >>
> > Yeah, I also check the the driver code it seems just wake up the
> > threaded handler.
>
> That only happens if you're using threaded interrupts, which is not the
> default as it's much slower. What happens for the normal case is that we
> init a batch, and then poll the CQ ring for completions, and then add
> them to the completion batch. Once no more are found, we complete the
> batch.
>
thanks for the explanation.
> You're not using threaded interrupts, are you?
No. I didn't add module parameter "use_threaded_interrupts"
>
> > For the record, here is my set up and performance data for 4 Samsung
> > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I
> > noticed is that IRQ throughput is improved instead of reduction with
> > this patch on my setup. e.g. BEFORE: 185545/sec/vector
> > AFTER: 220128
>
> I'm surprised at the rates being that low, and if so, why the posted MSI
> makes a difference? Usually what I've seen for IRQ being slower than
> poll is if interrupt delivery is unreasonably slow on that architecture
> of machine. But ~200k/sec isn't that high at all.
>
> > [global]
> > bs=4k
> > direct=1
> > norandommap
> > ioengine=libaio
> > randrepeat=0
> > readwrite=randread
> > group_reporting
> > time_based
> > iodepth=64
> > exitall
> > random_generator=tausworthe64
> > runtime=30
> > ramp_time=3
> > numjobs=8
> > group_reporting=1
> >
> > #cpus_allowed_policy=shared
> > cpus_allowed_policy=split
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme6n1
> > cpus_allowed=0-7
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme5n1
> > cpus_allowed=8-15
> > [disk_nvme5n1_thread_2]
> > filename=/dev/nvme4n1
> > cpus_allowed=16-23
> > [disk_nvme5n1_thread_3]
> > filename=/dev/nvme3n1
> > cpus_allowed=24-31
>
> For better performance, I'd change that engine=libaio to:
>
> ioengine=io_uring
> fixedbufs=1
> registerfiles=1
>
> Particularly fixedbufs makes a big difference, as a big cycle consumer
> is mapping/unmapping pages from the application space into the kernel
> for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
> buffers. At least for my runs, this is ~15% of the systime for doing IO.
> It also removes the page referencing, which isn't as big a consumer, but
> still noticeable.
>
Indeed, the CPU utilization system time goes down significantly. I got the
following with posted MSI patch applied:
Before (aio):
read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec)
user 3m25.156s
sys 11m16.785s
After (fixedbufs, iouring engine):
read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec)
user 2m56.255s
sys 8m56.378s
It seems to have no gain in IOPS, just CPU utilization reduction.
Both have improvement over libaio w/o posted MSI patch.
> Anyway, side quest, but I think you'll find this considerably reduces
> overhead / improves performance. Also makes it so that you can compare
> with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
> option for that (with a side note that you need to configure nvme poll
> queues, see the poll_queues parameter).
>
> On my box, all the NVMe devices seem to be on node1, not node0 which
> looks like it's the CPUs you are using. Might be worth checking and
> adjusting your CPU domains for each drive? I also tend to get better
> performance by removing the CPU scheduler, eg just pin each job to a
> single CPU rather than many. It's just one process/thread anyway, so
> really no point in giving it options here. It'll help reduce variability
> too, which can be a pain in the butt to deal with.
>
Much faster with poll_queues=32 (32jobs)
read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec)
user 2m29.177s
sys 15m7.022s
Observed no IRQ counts from NVME.
Thanks,
Jacob
© 2016 - 2025 Red Hat, Inc.