[v1] FUSE Export Coroutine Integration Cover Letter

[PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter

Posted by saz97 6 days, 8 hours ago

This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations,
addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes
demonstrate measurable performance improvements while simplifying resource management.

1. technology implementation

   according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management.
   and change the fuse_getattr to call: bdrv_co_get_allocated_file_size().    

2. performance summary

   For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows:
    -------------------------------  
    Average results for iodepth=1:
    Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement
    Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement
    Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement
    Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement
    --------------------------------
    -------------------------------
    Average results for iodepth=64:
    Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement
    Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement
    Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement
    Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement
    --------------------------------
   Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement.

3. Performance Bottlenecks Identified via gprof
   After running a fio test with the following command:
   fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \
    --rw=randrw --bs=4k --time_based=1 --name=job1 \
    --filename=/mnt/qemu-fuse --iopath=64
   and analyzing the execution profile using gprof, the following issues were identified:

   3.1 Increased Overall Execution Time
   In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%).
   After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%).
   This suggests that coroutine overhead is contributing significantly to execution time.

   3.2 Increased Read and Write Calls
   fuse_write calls increased from 173,400 → 333,232.
   fuse_read calls increased from 173,526 → 332,931.
   This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches.

   3.3 Significant Coroutine Overhead
   qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously.
   This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements.

saz97 (1):
  Integration coroutines into fuse export

 block/export/fuse.c | 190 +++++++++++++++++++++++++++++---------------
 1 file changed, 126 insertions(+), 64 deletions(-)

-- 
2.34.1

Re: [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter

Posted by Stefan Hajnoczi 6 days, 1 hour ago

On Mon, Mar 24, 2025 at 04:05:09PM +0800, saz97 wrote:
> This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations,
> addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes
> demonstrate measurable performance improvements while simplifying resource management.
> 
> 1. technology implementation
> 
>    according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management.
>    and change the fuse_getattr to call: bdrv_co_get_allocated_file_size().    
> 
> 2. performance summary
> 
>    For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows:
>     -------------------------------  
>     Average results for iodepth=1:
>     Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement
>     Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement
>     Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement
>     Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement
>     --------------------------------
>     -------------------------------
>     Average results for iodepth=64:
>     Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement
>     Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement
>     Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement
>     Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement
>     --------------------------------
>    Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement.
> 
> 3. Performance Bottlenecks Identified via gprof
>    After running a fio test with the following command:
>    fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \
>     --rw=randrw --bs=4k --time_based=1 --name=job1 \
>     --filename=/mnt/qemu-fuse --iopath=64
>    and analyzing the execution profile using gprof, the following issues were identified:
> 
>    3.1 Increased Overall Execution Time
>    In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%).
>    After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%).
>    This suggests that coroutine overhead is contributing significantly to execution time.
> 
>    3.2 Increased Read and Write Calls
>    fuse_write calls increased from 173,400 → 333,232.
>    fuse_read calls increased from 173,526 → 332,931.
>    This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches.
> 
>    3.3 Significant Coroutine Overhead
>    qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously.
>    This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements.

Due to the remaining performance issues, let's leave this contribution
task here.

Please focus on submitting your Google Summer of Code application at
https://summerofcode.withgoogle.com/ by April 8th.

Thanks,
Stefan

> 
> saz97 (1):
>   Integration coroutines into fuse export
> 
>  block/export/fuse.c | 190 +++++++++++++++++++++++++++++---------------
>  1 file changed, 126 insertions(+), 64 deletions(-)
> 
> -- 
> 2.34.1
>