include/qemu/aio.h | 8 +- include/system/iothread.h | 1 + iothread.c | 10 ++ monitor/hmp-cmds.c | 1 + qapi/misc.json | 6 ++ qapi/qom.json | 8 +- qemu-options.hx | 7 +- tests/unit/test-nested-aio-poll.c | 2 +- util/aio-posix.c | 151 +++++++++++++++++++++--------- util/aio-win32.c | 3 +- util/async.c | 2 + 11 files changed, 147 insertions(+), 52 deletions(-)
Dear all, I'm submitting this patch series for review under the RFC tag. This patch series refines the aio_poll adaptive polling logic to reduce unnecessary busy-waiting and improve CPU efficiency. The first patch prevents redundant polling time calculation when polling is disabled. The second patch enhances the adaptive polling mechanism by dynamically adjusting the iothread's polling duration based on event intervals measured by individual AioHandlers. The third patch introduces a new 'poll-weight' parameter to adjust how much the current interval influences the next polling duration. We evaluated the patches on an s390x host with a single guest using 16 virtio block devices backed by FCP multipath devices in a separate-disk setup, with the I/O scheduler set to 'none' in both host and guest. The fio workload included sequential and random read/write with varying numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted with single and dual iothreads, using the newly introduced poll-weight parameter to measure their impact on CPU cost and throughput. Compared to the baseline, across four FIO workload patterns (sequential R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% for two iothreads), while CPU usage on the s390x host dropped significantly (-10% to -25% and -7% to -12%, respectively). Best regards, Jaehoon Kim Jaehoon Kim (3): aio-poll: avoid unnecessary polling time computation aio-poll: refine iothread polling using weighted handler intervals qapi/iothread: introduce poll-weight parameter for aio-poll include/qemu/aio.h | 8 +- include/system/iothread.h | 1 + iothread.c | 10 ++ monitor/hmp-cmds.c | 1 + qapi/misc.json | 6 ++ qapi/qom.json | 8 +- qemu-options.hx | 7 +- tests/unit/test-nested-aio-poll.c | 2 +- util/aio-posix.c | 151 +++++++++++++++++++++--------- util/aio-win32.c | 3 +- util/async.c | 2 + 11 files changed, 147 insertions(+), 52 deletions(-) -- 2.50.1
On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote: > We evaluated the patches on an s390x host with a single guest using 16 > virtio block devices backed by FCP multipath devices in a separate-disk > setup, with the I/O scheduler set to 'none' in both host and guest. > > The fio workload included sequential and random read/write with varying > numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted > with single and dual iothreads, using the newly introduced poll-weight > parameter to measure their impact on CPU cost and throughput. > > Compared to the baseline, across four FIO workload patterns (sequential > R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, > throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% > for two iothreads), while CPU usage on the s390x host dropped > significantly (-10% to -25% and -7% to -12%, respectively). Hi Jaehoon, I would like to run the same fio benchmarks on a local NVMe drive (<10us request latency) to see how that type of hardware configuration is affected. Are the scripts and fio job files available somewhere? Thanks, Stefan
On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote: > On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote: >> We evaluated the patches on an s390x host with a single guest using 16 >> virtio block devices backed by FCP multipath devices in a separate-disk >> setup, with the I/O scheduler set to 'none' in both host and guest. >> >> The fio workload included sequential and random read/write with varying >> numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted >> with single and dual iothreads, using the newly introduced poll-weight >> parameter to measure their impact on CPU cost and throughput. >> >> Compared to the baseline, across four FIO workload patterns (sequential >> R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, >> throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% >> for two iothreads), while CPU usage on the s390x host dropped >> significantly (-10% to -25% and -7% to -12%, respectively). > Hi Jaehoon, > I would like to run the same fio benchmarks on a local NVMe drive (<10us > request latency) to see how that type of hardware configuration is > affected. Are the scripts and fio job files available somewhere? > > Thanks, > Stefan Thank you for your reply. The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings. I’m sharing below the methodology and test setup used by our performance team. Guest Setup ---------------------- - 12 vCPUs, 4 GiB memory - 16 virtio disks based on the FCP multipath devices in the host FIO test parameters ----------------------- - FIO Version: fio-3.33 - Filesize: 2G - Blocksize: 8K / 128K - Direct I/O: 1 - FIO I/O Engine: libaio - NUMJOB List: 1, 4, 8, 16 - IODEPTH: 8 - Runtime (s): 150 Two FIO samples for random read -------------------------------- fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k fio --direct=1 --name=test --numjobs=4 --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0 --size=8G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k additional notes ---------------- - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=.... - We execute one warmup run, then two measurement runs and calculate the average Please let me know if you need any additional information. Regards, Jaehoon Kim
On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote: > On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote: > > On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote: > > > We evaluated the patches on an s390x host with a single guest using 16 > > > virtio block devices backed by FCP multipath devices in a separate-disk > > > setup, with the I/O scheduler set to 'none' in both host and guest. > > > > > > The fio workload included sequential and random read/write with varying > > > numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted > > > with single and dual iothreads, using the newly introduced poll-weight > > > parameter to measure their impact on CPU cost and throughput. > > > > > > Compared to the baseline, across four FIO workload patterns (sequential > > > R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, > > > throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% > > > for two iothreads), while CPU usage on the s390x host dropped > > > significantly (-10% to -25% and -7% to -12%, respectively). > > Hi Jaehoon, > > I would like to run the same fio benchmarks on a local NVMe drive (<10us > > request latency) to see how that type of hardware configuration is > > affected. Are the scripts and fio job files available somewhere? > > > > Thanks, > > Stefan > > Thank you for your reply. > The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings. > I’m sharing below the methodology and test setup used by our performance team. > > Guest Setup > ---------------------- > - 12 vCPUs, 4 GiB memory > - 16 virtio disks based on the FCP multipath devices in the host > > FIO test parameters > ----------------------- > - FIO Version: fio-3.33 > - Filesize: 2G > - Blocksize: 8K / 128K > - Direct I/O: 1 > - FIO I/O Engine: libaio > - NUMJOB List: 1, 4, 8, 16 > - IODEPTH: 8 > - Runtime (s): 150 > > Two FIO samples for random read > -------------------------------- > fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k > fio --direct=1 --name=test --numjobs=4 --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0 --size=8G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k > > > additional notes > ---------------- > - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=.... > - We execute one warmup run, then two measurement runs and calculate the average Hi Jaehoon, I ran fio benchmarks on an Intel Optane SSD DC P4800X Series drive (<10 microsecond latency). This is with just 1 drive. The 8 KiB block size results show something similar to what you reported: there are IOPS (or throughput) regressions and CPU utilization improvements. Although the CPU improvements are welcome, I think the default behavior should only be changed if the IOPS regressions can be brought below 5%. The regressions seem to happen regardless of whether 1 or 2 IOThreads are configured. CPU utilization is different (98% vs 78%) depending on the number of IOThreads, so the regressions happen across a range of CPU utilizations. The 128 KiB block size results are not interesting because the drive already saturates at numjobs=1. This is expected since the drive cannot go much above ~2 GiB/s throughput. You can find the Ansible playbook, libvirt domain XML, fio command-lines, and the fio/sar data here: https://gitlab.com/stefanha/virt-playbooks/-/tree/aio-polling-efficiency Please let me know if you'd like me to rerun the benchmark with new patches or a configuration change. Do you want to have a video call to discuss your work and how to get the patches merged? Host ---- CPU: Intel Xeon Silver 4214 CPU @ 2.20GHz RAM: 32 GiB Guest ----- vCPUs: 8 RAM: 4 GiB Disk: 1 virtio-blk aio=native cache=none IOPS ---- rw bs numjobs iothreads iops diff randread 8k 1 1 163417 -7.8% randread 8k 1 2 165041 -2.4% randread 8k 4 1 221508 -0.64% randread 8k 4 2 251298 0.008% randread 8k 8 1 222128 -0.51% randread 8k 8 2 249489 -2.6% randread 8k 16 1 230535 -0.18% randread 8k 16 2 246732 -0.22% randread 128k 1 1 17616 -0.11% randread 128k 1 2 17678 0.027% randread 128k 4 1 17536 -0.27% randread 128k 4 2 17610 -0.031% randread 128k 8 1 17369 -0.42% randread 128k 8 2 17433 -0.071% randread 128k 16 1 17215 -0.61% randread 128k 16 2 17269 -0.22% randwrite 8k 1 1 156597 -3.1% randwrite 8k 1 2 157720 -3.8% randwrite 8k 4 1 218448 -0.5% randwrite 8k 4 2 247075 -5.1% randwrite 8k 8 1 220866 -0.75% randwrite 8k 8 2 260935 -0.011% randwrite 8k 16 1 230913 0.23% randwrite 8k 16 2 261125 -0.01% randwrite 128k 1 1 16009 0.094% randwrite 128k 1 2 16070 0.035% randwrite 128k 4 1 16073 -0.62% randwrite 128k 4 2 16131 0.059% randwrite 128k 8 1 16106 0.092% randwrite 128k 8 2 16153 0.048% randwrite 128k 16 1 16102 -0.0091% randwrite 128k 16 2 16160 0.048% IOThread CPU usage ------------------ iothreads before after 1 98.7 95.81 2 78.43 66.13 Stefan
On 2/3/2026 3:12 PM, Stefan Hajnoczi wrote: > On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote: >> On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote: >>> On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote: >>>> We evaluated the patches on an s390x host with a single guest using 16 >>>> virtio block devices backed by FCP multipath devices in a separate-disk >>>> setup, with the I/O scheduler set to 'none' in both host and guest. >>>> >>>> The fio workload included sequential and random read/write with varying >>>> numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted >>>> with single and dual iothreads, using the newly introduced poll-weight >>>> parameter to measure their impact on CPU cost and throughput. >>>> >>>> Compared to the baseline, across four FIO workload patterns (sequential >>>> R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, >>>> throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% >>>> for two iothreads), while CPU usage on the s390x host dropped >>>> significantly (-10% to -25% and -7% to -12%, respectively). >>> Hi Jaehoon, >>> I would like to run the same fio benchmarks on a local NVMe drive (<10us >>> request latency) to see how that type of hardware configuration is >>> affected. Are the scripts and fio job files available somewhere? >>> >>> Thanks, >>> Stefan >> Thank you for your reply. >> The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings. >> I’m sharing below the methodology and test setup used by our performance team. >> >> Guest Setup >> ---------------------- >> - 12 vCPUs, 4 GiB memory >> - 16 virtio disks based on the FCP multipath devices in the host >> >> FIO test parameters >> ----------------------- >> - FIO Version: fio-3.33 >> - Filesize: 2G >> - Blocksize: 8K / 128K >> - Direct I/O: 1 >> - FIO I/O Engine: libaio >> - NUMJOB List: 1, 4, 8, 16 >> - IODEPTH: 8 >> - Runtime (s): 150 >> >> Two FIO samples for random read >> -------------------------------- >> fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k >> fio --direct=1 --name=test --numjobs=4 --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0 --size=8G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k >> >> >> additional notes >> ---------------- >> - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=.... >> - We execute one warmup run, then two measurement runs and calculate the average > Hi Jaehoon, > I ran fio benchmarks on an Intel Optane SSD DC P4800X Series drive (<10 > microsecond latency). This is with just 1 drive. > > The 8 KiB block size results show something similar to what you > reported: there are IOPS (or throughput) regressions and CPU utilization > improvements. > > Although the CPU improvements are welcome, I think the default behavior > should only be changed if the IOPS regressions can be brought below 5%. > > The regressions seem to happen regardless of whether 1 or 2 IOThreads > are configured. CPU utilization is different (98% vs 78%) depending on > the number of IOThreads, so the regressions happen across a range of CPU > utilizations. > > The 128 KiB block size results are not interesting because the drive > already saturates at numjobs=1. This is expected since the drive cannot > go much above ~2 GiB/s throughput. > > You can find the Ansible playbook, libvirt domain XML, fio > command-lines, and the fio/sar data here: > > https://gitlab.com/stefanha/virt-playbooks/-/tree/aio-polling-efficiency > > Please let me know if you'd like me to rerun the benchmark with new > patches or a configuration change. > > Do you want to have a video call to discuss your work and how to get the > patches merged? > > Host > ---- > CPU: Intel Xeon Silver 4214 CPU @ 2.20GHz > RAM: 32 GiB > > Guest > ----- > vCPUs: 8 > RAM: 4 GiB > Disk: 1 virtio-blk aio=native cache=none > > IOPS > ---- > rw bs numjobs iothreads iops diff > randread 8k 1 1 163417 -7.8% > randread 8k 1 2 165041 -2.4% > randread 8k 4 1 221508 -0.64% > randread 8k 4 2 251298 0.008% > randread 8k 8 1 222128 -0.51% > randread 8k 8 2 249489 -2.6% > randread 8k 16 1 230535 -0.18% > randread 8k 16 2 246732 -0.22% > randread 128k 1 1 17616 -0.11% > randread 128k 1 2 17678 0.027% > randread 128k 4 1 17536 -0.27% > randread 128k 4 2 17610 -0.031% > randread 128k 8 1 17369 -0.42% > randread 128k 8 2 17433 -0.071% > randread 128k 16 1 17215 -0.61% > randread 128k 16 2 17269 -0.22% > randwrite 8k 1 1 156597 -3.1% > randwrite 8k 1 2 157720 -3.8% > randwrite 8k 4 1 218448 -0.5% > randwrite 8k 4 2 247075 -5.1% > randwrite 8k 8 1 220866 -0.75% > randwrite 8k 8 2 260935 -0.011% > randwrite 8k 16 1 230913 0.23% > randwrite 8k 16 2 261125 -0.01% > randwrite 128k 1 1 16009 0.094% > randwrite 128k 1 2 16070 0.035% > randwrite 128k 4 1 16073 -0.62% > randwrite 128k 4 2 16131 0.059% > randwrite 128k 8 1 16106 0.092% > randwrite 128k 8 2 16153 0.048% > randwrite 128k 16 1 16102 -0.0091% > randwrite 128k 16 2 16160 0.048% > > IOThread CPU usage > ------------------ > iothreads before after > 1 98.7 95.81 > 2 78.43 66.13 > > Stefan Hello Stefan, Thank you very much for your effort in running these benchmarks. The results show a pattern very similar to what our performance team observed. I fully agree with the 5% threshold for the default behavior. However, we need an approach that balances the current performance oriented polling scheme with CPU efficiency. I found that relying on grow/shrink parameters was too limited to achieve these results. This is why I've adjusted the process using a weight-based grow/shrink approach to ensure the polling window remains robust against jitter. Specifically, it avoids abrupt resets to zero by implementing a gradual shrink rather than an immediate reset, even when device latency exceeds the threshold. As seen in both your results and our team's measurements, this may lead to a bit of a performance trade-off, but it provides a reasonable balance for CPU-sensitive environment. Thank you for suggesting the video call and I am also looking forward to hearing your thoughts. I'm on US Central Time. Except for Tuesday, I can adjust my schedule to a time that works for you. Please let me know your preferred time. Regards, Jaehoon Kim
On Fri, Jan 23, 2026 at 01:15:04PM -0600, JAEHOON KIM wrote: > On 1/19/2026 12:16 PM, Stefan Hajnoczi wrote: > > On Tue, Jan 13, 2026 at 11:48:21AM -0600, Jaehoon Kim wrote: > > > We evaluated the patches on an s390x host with a single guest using 16 > > > virtio block devices backed by FCP multipath devices in a separate-disk > > > setup, with the I/O scheduler set to 'none' in both host and guest. > > > > > > The fio workload included sequential and random read/write with varying > > > numbers of jobs (1,4,8,16) and io_depth of 8. The tests were conducted > > > with single and dual iothreads, using the newly introduced poll-weight > > > parameter to measure their impact on CPU cost and throughput. > > > > > > Compared to the baseline, across four FIO workload patterns (sequential > > > R/W, random R/W), and averaged over FIO job counts of 1, 4, 8, and 16, > > > throughput decreased slightly (-3% to -8% for one iothread, -2% to -5% > > > for two iothreads), while CPU usage on the s390x host dropped > > > significantly (-10% to -25% and -7% to -12%, respectively). > > Hi Jaehoon, > > I would like to run the same fio benchmarks on a local NVMe drive (<10us > > request latency) to see how that type of hardware configuration is > > affected. Are the scripts and fio job files available somewhere? > > > > Thanks, > > Stefan > > Thank you for your reply. > The fio scripts are not available in a location you can access, but there is nothing particularly special in the settings. > I’m sharing below the methodology and test setup used by our performance team. > > Guest Setup > ---------------------- > - 12 vCPUs, 4 GiB memory > - 16 virtio disks based on the FCP multipath devices in the host > > FIO test parameters > ----------------------- > - FIO Version: fio-3.33 > - Filesize: 2G > - Blocksize: 8K / 128K > - Direct I/O: 1 > - FIO I/O Engine: libaio > - NUMJOB List: 1, 4, 8, 16 > - IODEPTH: 8 > - Runtime (s): 150 > > Two FIO samples for random read > -------------------------------- > fio --direct=1 --name=test --numjobs=16 --filename=base.0.0:base.1.0:base.2.0:base.3.0:base.4.0:base.5.0:base.6.0:base.7.0:base.8.0:base.9.0:base.10.0:base.11.0:base.12.0:base.13.0:base.14.0:base.15.0 --size=32G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k > fio --direct=1 --name=test --numjobs=4 --filename=subw1/base.0.0:subw4/base.3.0:subw8/base.7.0:subw12/base.11.0:subw16/base.15.0 --size=8G --time_based --runtime=4m --readwrite=randread --ioengine=libaio --iodepth=8 --bs=8k > > > additional notes > ---------------- > - Each file is placed on a separate disk device mounted under subw<n> as specified in --filename=.... > - We execute one warmup run, then two measurement runs and calculate the average Thanks, I will share x86_64 with fast local NVMe results when I have collected them. Stefan
© 2016 - 2026 Red Hat, Inc.