All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter
@ 2025-03-15 17:30 saz97
  2025-03-17 20:56 ` Stefan Hajnoczi
  0 siblings, 1 reply; 4+ messages in thread
From: saz97 @ 2025-03-15 17:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: hreitz, kwolf, stefanha, qemu-block, saz97

Signed-off-by: Changzhi Xie <sa.z@qq.com>

FUSE Export Coroutine Integration Cover Letter

This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, 
addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes 
demonstrate measurable performance improvements while simplifying resource management.

1. Technical Implementation
Key modifications address prior review feedback (Stefan Hajnoczi) and optimize execution flow:

​1.1 Coroutine Integration
Convert fuse_read()/fuse_write() to launch coroutines (fuse_*_coroutine)
Utilize non-blocking blk_co_pread()/blk_co_pwrite() for block layer access
Eliminate main loop blocking during heavy I/O workloads

1.2 ​Buffer Management
Removed explicit buffer pre-allocation in read_from_fuse_export()
Replaced fuse_buf_free() with g_free() due to libfuse3 API constraints

​1.3 Resource Lifecycle
Moved in_flight decrement and blk_exp_unref() into coroutines
Added FUSE opcode checks (FUSE_READ/FUSE_WRITE) to prevent premature cleanup

1.4 ​Structural Improvements
Simplified FuseIORequest structure:
Removed redundant fuse_ino_t and fuse_file_info fields
Retained minimal parameter passing requirements

2. Performance Validation
Tested using fio with 4K random RW pattern, and the result is the average of 5 runs:
fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 --rw=randrw --bs=4k --time_based=1

Key Results

Metric	       iodepth=1	           iodepth=64
​Read Latency	  ▼ 2.7% (3.8k→3kns)	  ▼ 1.3% (4.7M→4.6M ns)
​Write Latency	▼ 3.6% (112k→108kns)	▼ 2.8% (5.2M→5.0M ns)
​Read IOPS	    4740 → 4729 (±0.2%)	  ▲ 2.1% (6391→6529)
​Write IOPS	    4738 → 4727 (±0.2%)	  ▲ 2.2% (6390→6529)
​Throughput	    ~18.9 GB/s (stable)	  ▲ 2.1% (25.6→26.1 GB/s)

Analysis

​High Concurrency (iodepth=64):
Sustained throughput gains (+2.1-2.2%) demonstrate improved scalability
Latency reductions confirm reduced contention in concurrent operations

saz97 (1):
  Integration coroutines into fuse export

 block/export/fuse.c | 189 +++++++++++++++++++++++++++++++-------------
 1 file changed, 132 insertions(+), 57 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter
  2025-03-15 17:30 saz97
@ 2025-03-17 20:56 ` Stefan Hajnoczi
  0 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-03-17 20:56 UTC (permalink / raw)
  To: saz97; +Cc: qemu-devel, hreitz, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

On Sun, Mar 16, 2025 at 01:30:06AM +0800, saz97 wrote:
> Signed-off-by: Changzhi Xie <sa.z@qq.com>
> 
> FUSE Export Coroutine Integration Cover Letter
> 
> This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations, 
> addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes 
> demonstrate measurable performance improvements while simplifying resource management.
> 
> 1. Technical Implementation
> Key modifications address prior review feedback (Stefan Hajnoczi) and optimize execution flow:
> 
> ​1.1 Coroutine Integration
> Convert fuse_read()/fuse_write() to launch coroutines (fuse_*_coroutine)
> Utilize non-blocking blk_co_pread()/blk_co_pwrite() for block layer access
> Eliminate main loop blocking during heavy I/O workloads
> 
> 1.2 ​Buffer Management
> Removed explicit buffer pre-allocation in read_from_fuse_export()
> Replaced fuse_buf_free() with g_free() due to libfuse3 API constraints
> 
> ​1.3 Resource Lifecycle
> Moved in_flight decrement and blk_exp_unref() into coroutines
> Added FUSE opcode checks (FUSE_READ/FUSE_WRITE) to prevent premature cleanup
> 
> 1.4 ​Structural Improvements
> Simplified FuseIORequest structure:
> Removed redundant fuse_ino_t and fuse_file_info fields
> Retained minimal parameter passing requirements
> 
> 2. Performance Validation
> Tested using fio with 4K random RW pattern, and the result is the average of 5 runs:
> fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 --rw=randrw --bs=4k --time_based=1
> 
> Key Results
> 
> Metric	       iodepth=1	           iodepth=64
> ​Read Latency	  ▼ 2.7% (3.8k→3kns)	  ▼ 1.3% (4.7M→4.6M ns)
> ​Write Latency	▼ 3.6% (112k→108kns)	▼ 2.8% (5.2M→5.0M ns)
> ​Read IOPS	    4740 → 4729 (±0.2%)	  ▲ 2.1% (6391→6529)
> ​Write IOPS	    4738 → 4727 (±0.2%)	  ▲ 2.2% (6390→6529)
> ​Throughput	    ~18.9 GB/s (stable)	  ▲ 2.1% (25.6→26.1 GB/s)

Are you sure throughput is GB/s instead of MB/s?

iodepth=1 read 4729 IOPS * bs=4k = 18 MB/s

Also, fio was configured with --rw=randrw, so the total throughput
should be read throughput + write throughput. Based on the read and
write IOPS numbers, the total throughput should be ~36 MB/s. Which
throughput number are you showing?

> 
> Analysis
> 
> ​High Concurrency (iodepth=64):
> Sustained throughput gains (+2.1-2.2%) demonstrate improved scalability
> Latency reductions confirm reduced contention in concurrent operations

This is surprising. Before this patch series the FUSE export code only
submits 1 request at a time, so the iodepth=64 results should be only a
little better than the iodepth=1 results. After this patch series the
FUSE export code should be submitting all 64 requests concurrently and
improving performance by more than 2%.

Why was the improvement only 2%?

> 
> saz97 (1):
>   Integration coroutines into fuse export
> 
>  block/export/fuse.c | 189 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 132 insertions(+), 57 deletions(-)
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter
@ 2025-03-24  8:05 saz97
  2025-03-24 14:41 ` Stefan Hajnoczi
  0 siblings, 1 reply; 4+ messages in thread
From: saz97 @ 2025-03-24  8:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: hreitz, kwolf, stefanha, qemu-block, saz97

This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations,
addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes
demonstrate measurable performance improvements while simplifying resource management.

1. technology implementation

   according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management.
   and change the fuse_getattr to call: bdrv_co_get_allocated_file_size().    

2. performance summary

   For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows:
    -------------------------------  
    Average results for iodepth=1:
    Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement
    Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement
    Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement
    Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement
    --------------------------------
    -------------------------------
    Average results for iodepth=64:
    Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement
    Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement
    Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement
    Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement
    --------------------------------
   Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement.

3. Performance Bottlenecks Identified via gprof
   After running a fio test with the following command:
   fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \
    --rw=randrw --bs=4k --time_based=1 --name=job1 \
    --filename=/mnt/qemu-fuse --iopath=64
   and analyzing the execution profile using gprof, the following issues were identified:

   3.1 Increased Overall Execution Time
   In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%).
   After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%).
   This suggests that coroutine overhead is contributing significantly to execution time.

   3.2 Increased Read and Write Calls
   fuse_write calls increased from 173,400 → 333,232.
   fuse_read calls increased from 173,526 → 332,931.
   This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches.

   3.3 Significant Coroutine Overhead
   qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously.
   This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements.

saz97 (1):
  Integration coroutines into fuse export

 block/export/fuse.c | 190 +++++++++++++++++++++++++++++---------------
 1 file changed, 126 insertions(+), 64 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter
  2025-03-24  8:05 [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter saz97
@ 2025-03-24 14:41 ` Stefan Hajnoczi
  0 siblings, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2025-03-24 14:41 UTC (permalink / raw)
  To: saz97; +Cc: qemu-devel, hreitz, kwolf, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

On Mon, Mar 24, 2025 at 04:05:09PM +0800, saz97 wrote:
> This patch series refactors QEMU's FUSE export module to leverage coroutines for read/write operations,
> addressing concurrency limitations and aligning with QEMU's asynchronous I/O model. The changes
> demonstrate measurable performance improvements while simplifying resource management.
> 
> 1. technology implementation
> 
>    according to Stefan suggerstion, i move the processing logic of read_from_fuse_export into a coroutine for buffer management.
>    and change the fuse_getattr to call: bdrv_co_get_allocated_file_size().    
> 
> 2. performance summary
> 
>    For the coroutine_integration_fuse test, the average results for iodepth=1 and iodepth=64 are as follows:
>     -------------------------------  
>     Average results for iodepth=1:
>     Read_IOPS: coroutine_integration_fuse: 4492.88 | origin: 4309.39 | 4.25% improvement
>     Write_IOPS: coroutine_integration_fuse: 4500.68 | origin: 4318.68 | 4.21% improvement
>     Read_BW: coroutine_integration_fuse: 17971.00 KB/s | origin: 17237.30 KB/s | 4.26% improvement
>     Write_BW: coroutine_integration_fuse: 18002.50 KB/s | origin: 17274.30 KB/s | 4.23% improvement
>     --------------------------------
>     -------------------------------
>     Average results for iodepth=64:
>     Read_IOPS: coroutine_integration_fuse: 5576.93 | origin: 5347.13 | 4.29% improvement
>     Write_IOPS: coroutine_integration_fuse: 5569.55 | origin: 5337.33 | 4.33% improvement
>     Read_BW: coroutine_integration_fuse: 22311.40 KB/s | origin: 21392.20 KB/s | 4.31% improvement
>     Write_BW: coroutine_integration_fuse: 22282.20 KB/s | origin: 21353.20 KB/s | 4.34% improvement
>     --------------------------------
>    Although all metrics show improvements, the gains are concentrated in the 4.2%–4.3% range, which is lower than expected. Further investigation using gprof reveals the reasons for this limited improvement.
> 
> 3. Performance Bottlenecks Identified via gprof
>    After running a fio test with the following command:
>    fio --ioengine=io_uring --numjobs=1 --runtime=30 --ramp_time=5 \
>     --rw=randrw --bs=4k --time_based=1 --name=job1 \
>     --filename=/mnt/qemu-fuse --iopath=64
>    and analyzing the execution profile using gprof, the following issues were identified:
> 
>    3.1 Increased Overall Execution Time
>    In the original implementation, fuse_write + blk_pwrite accounted for 8.7% of total execution time (6.0% + 2.7%).
>    After refactoring, fuse_write_coroutine + blk_co_pwrite now accounts for 43.1% (22.9% + 20.2%).
>    This suggests that coroutine overhead is contributing significantly to execution time.
> 
>    3.2 Increased Read and Write Calls
>    fuse_write calls increased from 173,400 → 333,232.
>    fuse_read calls increased from 173,526 → 332,931.
>    This indicates that the coroutine-based approach is introducing redundant I/O calls, likely due to unnecessary coroutine switches.
> 
>    3.3 Significant Coroutine Overhead
>    qemu_coroutine_enter is now called 1,572,803 times, compared to ~476,057 previously.
>    This frequent coroutine switching introduces unnecessary overhead, limiting the expected performance improvements.

Due to the remaining performance issues, let's leave this contribution
task here.

Please focus on submitting your Google Summer of Code application at
https://summerofcode.withgoogle.com/ by April 8th.

Thanks,
Stefan

> 
> saz97 (1):
>   Integration coroutines into fuse export
> 
>  block/export/fuse.c | 190 +++++++++++++++++++++++++++++---------------
>  1 file changed, 126 insertions(+), 64 deletions(-)
> 
> -- 
> 2.34.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-03-24 14:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-24  8:05 [PATCH 0/1 RFC] FUSE Export Coroutine Integration Cover Letter saz97
2025-03-24 14:41 ` Stefan Hajnoczi
  -- strict thread matches above, loose matches on Subject: below --
2025-03-15 17:30 saz97
2025-03-17 20:56 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.