aio-poll: improve aio-polling efficiency

[PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by Jaehoon Kim 3 weeks, 3 days ago

Dear all,

I'm submitting this patch series for review under the RFC tag.

This patch series refines the aio_poll adaptive polling logic to reduce
unnecessary busy-waiting and improve CPU efficiency.

The first patch prevents redundant polling time calculation when polling
is disabled. The second patch enhances the adaptive polling mechanism by
dynamically adjusting the iothread's polling duration based on event
intervals measured by individual AioHandlers. The third patch introduces
a new 'poll-weight' parameter to adjust how much the current interval
influences the next polling duration.

We evaluated the patches on an s390x host with a single guest using 16
virtio block devices backed by FCP multipath devices in a separate-disk
setup, with the I/O scheduler set to 'none' in both host and guest.

The fio workload included sequential and random read/write with varying
numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
with single and dual iothreads, using the newly introduced poll-weight
parameter to measure their impact on CPU cost and throughput.

Compared to the baseline, across four FIO workload patterns (sequential
R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
for two iothreads), while CPU usage on the s390x host dropped
significantly (-10% to -25% and -7% to -12%, respectively).

Best regards,
Jaehoon Kim

Jaehoon Kim (3):
  aio-poll: avoid unnecessary polling time computation
  aio-poll: refine iothread polling using weighted handler intervals
  qapi/iothread: introduce poll-weight parameter for aio-poll

 include/qemu/aio.h                |   8 +-
 include/system/iothread.h         |   1 +
 iothread.c                        |  10 ++
 monitor/hmp-cmds.c                |   1 +
 qapi/misc.json                    |   6 ++
 qapi/qom.json                     |   8 +-
 qemu-options.hx                   |   7 +-
 tests/unit/test-nested-aio-poll.c |   2 +-
 util/aio-posix.c                  | 151 +++++++++++++++++++++---------
 util/aio-win32.c                  |   3 +-
 util/async.c                      |   2 +
 11 files changed, 147 insertions(+), 52 deletions(-)

-- 
2.50.1

Re: [PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by Stefan Hajnoczi 2 weeks, 4 days ago

On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
> We evaluated the patches on an s390x host with a single guest using 16
> virtio block devices backed by FCP multipath devices in a separate-disk
> setup, with the I/O scheduler set to 'none' in both host and guest.
> 
> The fio workload included sequential and random read/write with varying
> numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
> with single and dual iothreads, using the newly introduced poll-weight
> parameter to measure their impact on CPU cost and throughput.
> 
> Compared to the baseline, across four FIO workload patterns (sequential
> R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
> throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
> for two iothreads), while CPU usage on the s390x host dropped
> significantly (-10% to -25% and -7% to -12%, respectively).

Hi Jaehoon,
I would like to run the same fio benchmarks on a local NVMe drive (<10us
request latency) to see how that type of hardware configuration is
affected. Are the scripts and fio job files available somewhere?

Thanks,
Stefan

Re: [PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by JAEHOON KIM 2 weeks ago

On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
>> We evaluated the patches on an s390x host with a single guest using 16
>> virtio block devices backed by FCP multipath devices in a separate-disk
>> setup, with the I/O scheduler set to 'none' in both host and guest.
>>
>> The fio workload included sequential and random read/write with varying
>> numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
>> with single and dual iothreads, using the newly introduced poll-weight
>> parameter to measure their impact on CPU cost and throughput.
>>
>> Compared to the baseline, across four FIO workload patterns (sequential
>> R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
>> throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
>> for two iothreads), while CPU usage on the s390x host dropped
>> significantly (-10% to -25% and -7% to -12%, respectively).
> Hi Jaehoon,
> I would like to run the same fio benchmarks on a local NVMe drive (<10us
> request latency) to see how that type of hardware configuration is
> affected. Are the scripts and fio job files available somewhere?
>
> Thanks,
> Stefan

Thank you for your reply.
The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings.
I’m sharing below the methodology and test setup used by our performance team.

Guest Setup
----------------------
- 12 vCPUs, 4 GiB memory
- 16 virtio disks based on the FCP multipath devices in the host

FIO test parameters
-----------------------
- FIO Version: fio-3.33
- Filesize: 2G
- Blocksize: 8K / 128K
- Direct I/O: 1
- FIO I/O Engine: libaio
- NUMJOB List: 1, 4, 8, 16
- IODEPTH: 8
- Runtime (s): 150

Two FIO samples for random read
--------------------------------
fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G  --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
fio --direct=1 --name=test --numjobs=4  --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0                                                                        --size=8G   --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k


additional notes
----------------
- Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=....
- We execute one warmup run, then two measurement runs and calculate the average

Please let me know if you need any additional information.

Regards,
Jaehoon Kim

Re: [PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by Stefan Hajnoczi 3 days, 11 hours ago

On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote:
> On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote:
> > On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
> > > We evaluated the patches on an s390x host with a single guest using 16
> > > virtio block devices backed by FCP multipath devices in a separate-disk
> > > setup, with the I/O scheduler set to 'none' in both host and guest.
> > > 
> > > The fio workload included sequential and random read/write with varying
> > > numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
> > > with single and dual iothreads, using the newly introduced poll-weight
> > > parameter to measure their impact on CPU cost and throughput.
> > > 
> > > Compared to the baseline, across four FIO workload patterns (sequential
> > > R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
> > > throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
> > > for two iothreads), while CPU usage on the s390x host dropped
> > > significantly (-10% to -25% and -7% to -12%, respectively).
> > Hi Jaehoon,
> > I would like to run the same fio benchmarks on a local NVMe drive (<10us
> > request latency) to see how that type of hardware configuration is
> > affected. Are the scripts and fio job files available somewhere?
> > 
> > Thanks,
> > Stefan
> 
> Thank you for your reply.
> The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings.
> I’m sharing below the methodology and test setup used by our performance team.
> 
> Guest Setup
> ----------------------
> - 12 vCPUs, 4 GiB memory
> - 16 virtio disks based on the FCP multipath devices in the host
> 
> FIO test parameters
> -----------------------
> - FIO Version: fio-3.33
> - Filesize: 2G
> - Blocksize: 8K / 128K
> - Direct I/O: 1
> - FIO I/O Engine: libaio
> - NUMJOB List: 1, 4, 8, 16
> - IODEPTH: 8
> - Runtime (s): 150
> 
> Two FIO samples for random read
> --------------------------------
> fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G  --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
> fio --direct=1 --name=test --numjobs=4  --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0                                                                        --size=8G   --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
> 
> 
> additional notes
> ----------------
> - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=....
> - We execute one warmup run, then two measurement runs and calculate the average

Hi Jaehoon,
I ran fio benchmarks on an Intel Optane SSD DC P4800X Series drive (<10
microsecond latency). This is with just 1 drive.

The 8 KiB block size results show something similar to what you
reported: there are IOPS (or throughput) regressions and CPU utilization
improvements.

Although the CPU improvements are welcome, I think the default behavior
should only be changed if the IOPS regressions can be brought below 5%.

The regressions seem to happen regardless of whether 1 or 2 IOThreads
are configured. CPU utilization is different (98% vs 78%) depending on
the number of IOThreads, so the regressions happen across a range of CPU
utilizations.

The 128 KiB block size results are not interesting because the drive
already saturates at numjobs=1. This is expected since the drive cannot
go much above ~2 GiB/s throughput.

You can find the Ansible playbook, libvirt domain XML, fio
command-lines, and the fio/sar data here:

https://gitlab.com/stefanha/virt-playbooks/-/tree/aio-polling-efficiency

Please let me know if you'd like me to rerun the benchmark with new
patches or a configuration change.

Do you want to have a video call to discuss your work and how to get the
patches merged?

Host
----
CPU: Intel Xeon Silver 4214 CPU @ 2.20GHz
RAM: 32 GiB

Guest
-----
vCPUs: 8
RAM: 4 GiB
Disk: 1 virtio-blk aio=native cache=none

IOPS
----
rw        bs   numjobs iothreads iops   diff
randread  8k   1       1         163417 -7.8%
randread  8k   1       2         165041 -2.4%
randread  8k   4       1         221508 -0.64%
randread  8k   4       2         251298 0.008%
randread  8k   8       1         222128 -0.51%
randread  8k   8       2         249489 -2.6%
randread  8k   16      1         230535 -0.18%
randread  8k   16      2         246732 -0.22%
randread  128k 1       1          17616 -0.11%
randread  128k 1       2          17678 0.027%
randread  128k 4       1          17536 -0.27%
randread  128k 4       2          17610 -0.031%
randread  128k 8       1          17369 -0.42%
randread  128k 8       2          17433 -0.071%
randread  128k 16      1          17215 -0.61%
randread  128k 16      2          17269 -0.22%
randwrite 8k   1       1         156597 -3.1%
randwrite 8k   1       2         157720 -3.8%
randwrite 8k   4       1         218448 -0.5%
randwrite 8k   4       2         247075 -5.1%
randwrite 8k   8       1         220866 -0.75%
randwrite 8k   8       2         260935 -0.011%
randwrite 8k   16      1         230913 0.23%
randwrite 8k   16      2         261125 -0.01%
randwrite 128k 1       1          16009 0.094%
randwrite 128k 1       2          16070 0.035%
randwrite 128k 4       1          16073 -0.62%
randwrite 128k 4       2          16131 0.059%
randwrite 128k 8       1          16106 0.092%
randwrite 128k 8       2          16153 0.048%
randwrite 128k 16      1          16102 -0.0091%
randwrite 128k 16      2          16160 0.048%

IOThread CPU usage
------------------
iothreads before  after
1         98.7    95.81
2         78.43   66.13

Stefan

Re: [PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by JAEHOON KIM 1 day, 1 hour ago

On 2/3/2026 3:12 PM, Stefan Hajnoczi wrote:
> On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote:
>> On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote:
>>> On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
>>>> We evaluated the patches on an s390x host with a single guest using 16
>>>> virtio block devices backed by FCP multipath devices in a separate-disk
>>>> setup, with the I/O scheduler set to 'none' in both host and guest.
>>>>
>>>> The fio workload included sequential and random read/write with varying
>>>> numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
>>>> with single and dual iothreads, using the newly introduced poll-weight
>>>> parameter to measure their impact on CPU cost and throughput.
>>>>
>>>> Compared to the baseline, across four FIO workload patterns (sequential
>>>> R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
>>>> throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
>>>> for two iothreads), while CPU usage on the s390x host dropped
>>>> significantly (-10% to -25% and -7% to -12%, respectively).
>>> Hi Jaehoon,
>>> I would like to run the same fio benchmarks on a local NVMe drive (<10us
>>> request latency) to see how that type of hardware configuration is
>>> affected. Are the scripts and fio job files available somewhere?
>>>
>>> Thanks,
>>> Stefan
>> Thank you for your reply.
>> The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings.
>> I’m sharing below the methodology and test setup used by our performance team.
>>
>> Guest Setup
>> ----------------------
>> - 12 vCPUs, 4 GiB memory
>> - 16 virtio disks based on the FCP multipath devices in the host
>>
>> FIO test parameters
>> -----------------------
>> - FIO Version: fio-3.33
>> - Filesize: 2G
>> - Blocksize: 8K / 128K
>> - Direct I/O: 1
>> - FIO I/O Engine: libaio
>> - NUMJOB List: 1, 4, 8, 16
>> - IODEPTH: 8
>> - Runtime (s): 150
>>
>> Two FIO samples for random read
>> --------------------------------
>> fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G  --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
>> fio --direct=1 --name=test --numjobs=4  --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0                                                                        --size=8G   --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
>>
>>
>> additional notes
>> ----------------
>> - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=....
>> - We execute one warmup run, then two measurement runs and calculate the average
> Hi Jaehoon,
> I ran fio benchmarks on an Intel Optane SSD DC P4800X Series drive (<10
> microsecond latency). This is with just 1 drive.
>
> The 8 KiB block size results show something similar to what you
> reported: there are IOPS (or throughput) regressions and CPU utilization
> improvements.
>
> Although the CPU improvements are welcome, I think the default behavior
> should only be changed if the IOPS regressions can be brought below 5%.
>
> The regressions seem to happen regardless of whether 1 or 2 IOThreads
> are configured. CPU utilization is different (98% vs 78%) depending on
> the number of IOThreads, so the regressions happen across a range of CPU
> utilizations.
>
> The 128 KiB block size results are not interesting because the drive
> already saturates at numjobs=1. This is expected since the drive cannot
> go much above ~2 GiB/s throughput.
>
> You can find the Ansible playbook, libvirt domain XML, fio
> command-lines, and the fio/sar data here:
>
> https://gitlab.com/stefanha/virt-playbooks/-/tree/aio-polling-efficiency
>
> Please let me know if you'd like me to rerun the benchmark with new
> patches or a configuration change.
>
> Do you want to have a video call to discuss your work and how to get the
> patches merged?
>
> Host
> ----
> CPU: Intel Xeon Silver 4214 CPU @ 2.20GHz
> RAM: 32 GiB
>
> Guest
> -----
> vCPUs: 8
> RAM: 4 GiB
> Disk: 1 virtio-blk aio=native cache=none
>
> IOPS
> ----
> rw        bs   numjobs iothreads iops   diff
> randread  8k   1       1         163417 -7.8%
> randread  8k   1       2         165041 -2.4%
> randread  8k   4       1         221508 -0.64%
> randread  8k   4       2         251298 0.008%
> randread  8k   8       1         222128 -0.51%
> randread  8k   8       2         249489 -2.6%
> randread  8k   16      1         230535 -0.18%
> randread  8k   16      2         246732 -0.22%
> randread  128k 1       1          17616 -0.11%
> randread  128k 1       2          17678 0.027%
> randread  128k 4       1          17536 -0.27%
> randread  128k 4       2          17610 -0.031%
> randread  128k 8       1          17369 -0.42%
> randread  128k 8       2          17433 -0.071%
> randread  128k 16      1          17215 -0.61%
> randread  128k 16      2          17269 -0.22%
> randwrite 8k   1       1         156597 -3.1%
> randwrite 8k   1       2         157720 -3.8%
> randwrite 8k   4       1         218448 -0.5%
> randwrite 8k   4       2         247075 -5.1%
> randwrite 8k   8       1         220866 -0.75%
> randwrite 8k   8       2         260935 -0.011%
> randwrite 8k   16      1         230913 0.23%
> randwrite 8k   16      2         261125 -0.01%
> randwrite 128k 1       1          16009 0.094%
> randwrite 128k 1       2          16070 0.035%
> randwrite 128k 4       1          16073 -0.62%
> randwrite 128k 4       2          16131 0.059%
> randwrite 128k 8       1          16106 0.092%
> randwrite 128k 8       2          16153 0.048%
> randwrite 128k 16      1          16102 -0.0091%
> randwrite 128k 16      2          16160 0.048%
>
> IOThread CPU usage
> ------------------
> iothreads before  after
> 1         98.7    95.81
> 2         78.43   66.13
>
> Stefan

Hello Stefan,

Thank you very much for your effort in running these benchmarks.
The results show a pattern very similar to what our performance team
observed.

I fully agree with the 5% threshold for the default behavior.
However, we need an approach that balances the current performance
oriented polling scheme with CPU efficiency.

I found that relying on grow/shrink parameters was too limited to
achieve these results. This is why I've adjusted the process using a
weight-based grow/shrink approach to ensure the polling window remains
robust against jitter. Specifically, it avoids abrupt resets to zero
by implementing a gradual shrink rather than an immediate reset, even
when device latency exceeds the threshold.

As seen in both your results and our team's measurements, this may lead
to a bit of a performance trade-off, but it provides a reasonable
balance for CPU-sensitive environment.

Thank you for suggesting the video call and I am also looking forward to
hearing your thoughts. I'm on US Central Time. Except for Tuesday, I can
adjust my schedule to a time that works for you.

Please let me know your preferred time.

Regards,
Jaehoon Kim

Re: [PATCH RFC v1 0/3] aio-poll: improve aio-polling efficiency

Posted by Stefan Hajnoczi 1 week, 3 days ago

On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote:
> On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote:
> > On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote:
> > > We evaluated the patches on an s390x host with a single guest using 16
> > > virtio block devices backed by FCP multipath devices in a separate-disk
> > > setup, with the I/O scheduler set to 'none' in both host and guest.
> > > 
> > > The fio workload included sequential and random read/write with varying
> > > numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted
> > > with single and dual iothreads, using the newly introduced poll-weight
> > > parameter to measure their impact on CPU cost and throughput.
> > > 
> > > Compared to the baseline, across four FIO workload patterns (sequential
> > > R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16,
> > > throughput decreased slightly (-3% to -8% for one iothread, -2% to -5%
> > > for two iothreads), while CPU usage on the s390x host dropped
> > > significantly (-10% to -25% and -7% to -12%, respectively).
> > Hi Jaehoon,
> > I would like to run the same fio benchmarks on a local NVMe drive (<10us
> > request latency) to see how that type of hardware configuration is
> > affected. Are the scripts and fio job files available somewhere?
> > 
> > Thanks,
> > Stefan
> 
> Thank you for your reply.
> The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings.
> I’m sharing below the methodology and test setup used by our performance team.
> 
> Guest Setup
> ----------------------
> - 12 vCPUs, 4 GiB memory
> - 16 virtio disks based on the FCP multipath devices in the host
> 
> FIO test parameters
> -----------------------
> - FIO Version: fio-3.33
> - Filesize: 2G
> - Blocksize: 8K / 128K
> - Direct I/O: 1
> - FIO I/O Engine: libaio
> - NUMJOB List: 1, 4, 8, 16
> - IODEPTH: 8
> - Runtime (s): 150
> 
> Two FIO samples for random read
> --------------------------------
> fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G  --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
> fio --direct=1 --name=test --numjobs=4  --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0                                                                        --size=8G   --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k
> 
> 
> additional notes
> ----------------
> - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=....
> - We execute one warmup run, then two measurement runs and calculate the average

Thanks, I will share x86_64 with fast local NVMe results when I have collected them.

Stefan