block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- 1 file changed, 126 insertions(+), 64 deletions(-)
This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes demonstrate measurable performance improvements while simplifying resource management. 1. technology implementation according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management. and change the fuse_getattr to call: bdrv_co_get_allocated_file_size(). 2. performance summary For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows: ------------------------------- Average results for iodepth=1: Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement -------------------------------- ------------------------------- Average results for iodepth=64: Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement -------------------------------- Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement. 3. Performance Bottlenecks Identified via gprof After running a fio test with the following command: fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \ --rw=randrw --bs=4k --time_based=1 --name=job1 \ --filename=/mnt/qemu-fuse --iopath=64 and analyzing the execution profile using gprof, the following issues were identified: 3.1 Increased Overall Execution Time In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%). After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%). This suggests that coroutine overhead is contributing significantly to execution time. 3.2 Increased Read and Write Calls fuse_write calls increased from 173,400 → 333,232. fuse_read calls increased from 173,526 → 332,931. This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches. 3.3 Significant Coroutine Overhead qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously. This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements. saz97 (1): Integration coroutines into fuse export block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- 1 file changed, 126 insertions(+), 64 deletions(-) -- 2.34.1
On Mon, Mar 24, 2025 at 04:05:09PM +0800, saz97 wrote: > This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, > addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes > demonstrate measurable performance improvements while simplifying resource management. > > 1. technology implementation > > according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management. > and change the fuse_getattr to call: bdrv_co_get_allocated_file_size(). > > 2. performance summary > > For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows: > ------------------------------- > Average results for iodepth=1: > Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement > Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement > Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement > Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement > -------------------------------- > ------------------------------- > Average results for iodepth=64: > Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement > Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement > Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement > Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement > -------------------------------- > Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement. > > 3. Performance Bottlenecks Identified via gprof > After running a fio test with the following command: > fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \ > --rw=randrw --bs=4k --time_based=1 --name=job1 \ > --filename=/mnt/qemu-fuse --iopath=64 > and analyzing the execution profile using gprof, the following issues were identified: > > 3.1 Increased Overall Execution Time > In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%). > After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%). > This suggests that coroutine overhead is contributing significantly to execution time. > > 3.2 Increased Read and Write Calls > fuse_write calls increased from 173,400 → 333,232. > fuse_read calls increased from 173,526 → 332,931. > This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches. > > 3.3 Significant Coroutine Overhead > qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously. > This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements. Due to the remaining performance issues, let's leave this contribution task here. Please focus on submitting your Google Summer of Code application at https://summerofcode.withgoogle.com/ by April 8th. Thanks, Stefan > > saz97 (1): > Integration coroutines into fuse export > > block/export/fuse.c | 190 +++++++++++++++++++++++++++++--------------- > 1 file changed, 126 insertions(+), 64 deletions(-) > > -- > 2.34.1 >
© 2016 - 2025 Red Hat, Inc.