[v1] FUSE Export Coroutine Integration Cover Letter

[PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter

Posted by saz97 2 weeks, 3 days ago

Signed-off-by: Changzhi Xie <sa.z@qq.com>

FUSE Export Coroutine Integration Cover Letter

This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, 
addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes 
demonstrate measurable performance improvements while simplifying resource management.

1. Technical Implementation
Key modifications address prior review feedback (Stefan Hajnoczi) and optimize execution flow:

1.1 Coroutine Integration
Convert fuse_read()/fuse_write() to launch coroutines (fuse_*_coroutine)
Utilize non-blocking blk_co_pread()/blk_co_pwrite() for block layer access
Eliminate main loop blocking during heavy I/O workloads

1.2 Buffer Management
Removed explicit buffer pre-allocation in read_from_fuse_export()
Replaced fuse_buf_free() with g_free() due to libfuse3 API constraints

1.3 Resource Lifecycle
Moved in_flight decrement and blk_exp_unref() into coroutines
Added FUSE opcode checks (FUSE_READ/FUSE_WRITE) to prevent premature cleanup

1.4 Structural Improvements
Simplified FuseIORequest structure:
Removed redundant fuse_ino_t and fuse_file_info fields
Retained minimal parameter passing requirements

2. Performance Validation
Tested using fio with 4K random RW pattern, and the result is the average of 5 runs:
fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 --rw=randrw --bs=4k --time_based=1

Key Results

Metric	       iodepth=1	           iodepth=64
Read Latency	  ▼ 2.7% (3.8k→3kns)	  ▼ 1.3% (4.7M→4.6M ns)
Write Latency	▼ 3.6% (112k→108kns)	▼ 2.8% (5.2M→5.0M ns)
Read IOPS	    4740 → 4729 (±0.2%)	  ▲ 2.1% (6391→6529)
Write IOPS	    4738 → 4727 (±0.2%)	  ▲ 2.2% (6390→6529)
Throughput	    ~18.9 GB/s (stable)	  ▲ 2.1% (25.6→26.1 GB/s)

Analysis

High Concurrency (iodepth=64):
Sustained throughput gains (+2.1-2.2%) demonstrate improved scalability
Latency reductions confirm reduced contention in concurrent operations

saz97 (1):
  Integration coroutines into fuse export

 block/export/fuse.c | 189 +++++++++++++++++++++++++++++++-------------
 1 file changed, 132 insertions(+), 57 deletions(-)

-- 
2.34.1

Re: [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter

Posted by Stefan Hajnoczi 2 weeks, 1 day ago

On Sun, Mar 16, 2025 at 01:30:06AM +0800, saz97 wrote:
> Signed-off-by: Changzhi Xie <sa.z@qq.com>
> 
> FUSE Export Coroutine Integration Cover Letter
> 
> This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, 
> addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes 
> demonstrate measurable performance improvements while simplifying resource management.
> 
> 1. Technical Implementation
> Key modifications address prior review feedback (Stefan Hajnoczi) and optimize execution flow:
> 
> 1.1 Coroutine Integration
> Convert fuse_read()/fuse_write() to launch coroutines (fuse_*_coroutine)
> Utilize non-blocking blk_co_pread()/blk_co_pwrite() for block layer access
> Eliminate main loop blocking during heavy I/O workloads
> 
> 1.2 Buffer Management
> Removed explicit buffer pre-allocation in read_from_fuse_export()
> Replaced fuse_buf_free() with g_free() due to libfuse3 API constraints
> 
> 1.3 Resource Lifecycle
> Moved in_flight decrement and blk_exp_unref() into coroutines
> Added FUSE opcode checks (FUSE_READ/FUSE_WRITE) to prevent premature cleanup
> 
> 1.4 Structural Improvements
> Simplified FuseIORequest structure:
> Removed redundant fuse_ino_t and fuse_file_info fields
> Retained minimal parameter passing requirements
> 
> 2. Performance Validation
> Tested using fio with 4K random RW pattern, and the result is the average of 5 runs:
> fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 --rw=randrw --bs=4k --time_based=1
> 
> Key Results
> 
> Metric	       iodepth=1	           iodepth=64
> Read Latency	  ▼ 2.7% (3.8k→3kns)	  ▼ 1.3% (4.7M→4.6M ns)
> Write Latency	▼ 3.6% (112k→108kns)	▼ 2.8% (5.2M→5.0M ns)
> Read IOPS	    4740 → 4729 (±0.2%)	  ▲ 2.1% (6391→6529)
> Write IOPS	    4738 → 4727 (±0.2%)	  ▲ 2.2% (6390→6529)
> Throughput	    ~18.9 GB/s (stable)	  ▲ 2.1% (25.6→26.1 GB/s)

Are you sure throughput is GB/s instead of MB/s?

iodepth=1 read 4729 IOPS * bs=4k = 18 MB/s

Also, fio was configured with --rw=randrw, so the total throughput
should be read throughput + write throughput. Based on the read and
write IOPS numbers, the total throughput should be ~36 MB/s. Which
throughput number are you showing?

> 
> Analysis
> 
> High Concurrency (iodepth=64):
> Sustained throughput gains (+2.1-2.2%) demonstrate improved scalability
> Latency reductions confirm reduced contention in concurrent operations

This is surprising. Before this patch series the FUSE export code only
submits 1 request at a time, so the iodepth=64 results should be only a
little better than the iodepth=1 results. After this patch series the
FUSE export code should be submitting all 64 requests concurrently and
improving performance by more than 2%.

Why was the improvement only 2%?

> 
> saz97 (1):
>   Integration coroutines into fuse export
> 
>  block/export/fuse.c | 189 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 132 insertions(+), 57 deletions(-)
> 
> -- 
> 2.34.1
>