block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- 1 file changed, 126 insertions(+), 64 deletions(-)
This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations,
addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes
demonstrate measurable performance improvements while simplifying resource management.
1. technology implementation
according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management.
and change the fuse_getattr to call: bdrv_co_get_allocated_file_size().
2. performance summary
For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows:
-------------------------------
Average results for iodepth=1:
Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement
Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement
Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement
Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement
--------------------------------
-------------------------------
Average results for iodepth=64:
Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement
Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement
Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement
Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement
--------------------------------
Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement.
3. Performance Bottlenecks Identified via gprof
After running a fio test with the following command:
fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \
--rw=randrw --bs=4k --time_based=1 --name=job1 \
--filename=/mnt/qemu-fuse --iopath=64
and analyzing the execution profile using gprof, the following issues were identified:
3.1 Increased Overall Execution Time
In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%).
After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%).
This suggests that coroutine overhead is contributing significantly to execution time.
3.2 Increased Read and Write Calls
fuse_write calls increased from 173,400 → 333,232.
fuse_read calls increased from 173,526 → 332,931.
This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches.
3.3 Significant Coroutine Overhead
qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously.
This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements.
saz97 (1):
Integration coroutines into fuse export
block/export/fuse.c | 190 +++++++++++++++++++++++++++++---------------
1 file changed, 126 insertions(+), 64 deletions(-)
--
2.34.1
On Mon, Mar 24, 2025 at 04:05:09PM +0800, saz97 wrote: > This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, > addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes > demonstrate measurable performance improvements while simplifying resource management. > > 1. technology implementation > > according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management. > and change the fuse_getattr to call: bdrv_co_get_allocated_file_size(). > > 2. performance summary > > For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows: > ------------------------------- > Average results for iodepth=1: > Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement > Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement > Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement > Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement > -------------------------------- > ------------------------------- > Average results for iodepth=64: > Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement > Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement > Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement > Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement > -------------------------------- > Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement. > > 3. Performance Bottlenecks Identified via gprof > After running a fio test with the following command: > fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \ > --rw=randrw --bs=4k --time_based=1 --name=job1 \ > --filename=/mnt/qemu-fuse --iopath=64 > and analyzing the execution profile using gprof, the following issues were identified: > > 3.1 Increased Overall Execution Time > In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%). > After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%). > This suggests that coroutine overhead is contributing significantly to execution time. > > 3.2 Increased Read and Write Calls > fuse_write calls increased from 173,400 → 333,232. > fuse_read calls increased from 173,526 → 332,931. > This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches. > > 3.3 Significant Coroutine Overhead > qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously. > This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements. Due to the remaining performance issues, let's leave this contribution task here. Please focus on submitting your Google Summer of Code application at https://summerofcode.withgoogle.com/ by April 8th. Thanks, Stefan > > saz97 (1): > Integration coroutines into fuse export > > block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- > 1 file changed, 126 insertions(+), 64 deletions(-) > > -- > 2.34.1 >
© 2016 - 2026 Red Hat, Inc.