From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Maxim V. Patlasov" <mpatlasov@parallels.com>
Subject: Re: [fuse-devel] [PATCH 0/6] fuse: process direct IO asynchronously
Date: Thu, 13 Dec 2012 16:52:37 +0400
Message-ID: <50C9CF95.7090704@parallels.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "miklos@szeredi.hu" <miklos@szeredi.hu>, Alexander Viro
	<viro@zeniv.linux.org.uk>, "fuse-devel@lists.sourceforge.net"
	<fuse-devel@lists.sourceforge.net>, "linux-fsdevel@vger.kernel.org"
	<linux-fsdevel@vger.kernel.org>, <linux-aio@kvack.org>, <bcrl@kvack.org>
To: Brian Foster <bfoster@redhat.com>
Return-path: <owner-linux-aio@kvack.org>
Sender: owner-linux-aio@kvack.org
List-Id: linux-fsdevel.vger.kernel.org

Hi Brian,

>> Existing fuse implementation always processes direct IO 
>> synchronously: it
>> submits next request to userspace fuse only when previous is 
>> completed. This
>> is suboptimal because: 1) libaio DIO works in blocking way; 2) 
>> userspace fuse
>> can't achieve parallelism  processing several requests simultaneously 
>> (e.g.
>> in case of distributed network storage); 3) userspace fuse can't merge
>> requests before passing it to actual storage.
>>
>> The idea of the patch-set is to submit fuse requests in non-blocking way
>> (where it's possible) and either return -EIOCBQUEUED or wait for their
>> completion synchronously. The patch-set to be applied on top of 
>> for-next of
>> Miklos' git repo.
>>
>> To estimate performance improvement I used slightly modified fusexmp 
>> over
>> tmpfs (clearing O_DIRECT bit from fi->flags in xmp_open). For 
>> synchronous
>> operations I used 'dd' like this:
>>
>> dd of=/dev/null if=/fuse/mnt/file bs=2M count=256 iflag=direct
>> dd if=/dev/zero of=/fuse/mnt/file bs=2M count=256 oflag=direct 
>> conv=notrunc
>>
>> For AIO I used 'aio-stress' like this:
>>
>> aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 1 /fuse/mnt/file
>> aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 0 /fuse/mnt/file
>>
>> The throughput on some commodity (rather feeble) server was (in MB/sec):
>>
>>               original / patched
>>
>> dd reads:    ~322     / ~382
>> dd writes:   ~277     / ~288
>>
>> aio reads:   ~380     / ~459
>> aio writes:  ~319     / ~353
>>

> Thanks for posting this, first of all. I was reading through the code
> and found some of the logic a bit hard to follow, particularly due to
> controlling the behavior of underlying read/write functions based on the
> 'async' kiocb. For example: conditionally updating inode size in a
> couple different places; submitting sync requests async and waiting on
> them (causing the need to duplicate the update size call); and the
> existence of a kiocb called 'async' (when a kiocb can be either sync or
> async).
>
> Anyways, I couldn't quite put my finger on how to potentially clean that
> up without messing around with the code, so I've inlined the diff that I
> came up with that (IMO) cleans things up a bit. This should apply on top
> of your set and makes the following tweaks:
>
> - Always pass a kiocb down through __fuse_direct_write()/read() (instead
> of struct file).
> - Trigger the "async" behavior in fuse_direct_io() and
> fuse_send_read/write() based on the sync/async nature of the kiocb. This
> means that sync requests and async requests are sent as such based on
> the nature of the kiocb (as opposed to whether a kiocb exists).
> - Update the various other callers of __fuse_direct_write()/read and
> fuse_direct_io() to create and pass a sync kiocb where necessary.
> - Use the same approach in fuse_direct_IO() to turn an extending, async
> write into a sync write.
>
> The tradeoff with this approach is that we slightly uglify the callers
> that have to now create a sync kiocb. That said, I think some of those
> codepaths (i.e., fuse_file_aio_write()->fuse_perform_write()->...) could
> now just pass the kiocb from the vfs straight down. Thoughts?
>

Thanks a lot for review and further efforts. Your clean-up patch looks 
fine, but it constrains the scope of patch-set too much: ordinary 
synchronous dio-s (generated by read(2) and write(2)) cease to be 
optimized. I think such a loss doesn't justify the clean-up. To be sure 
that we are on the same page, I'll try to explain this optimization in 
more details:

Imagine direct read or write of 1MB size. Having 
FUSE_MAX_PAGES_PER_REQ==32, this can be fulfilled by 8 fuse requests. 
Generic linux-kernel (e.g. generic_file_aio_read) passes synchronous 
iocb to fuse_direct_IO(). From fuse perspective, it only means that the 
caller expects to get control back when that 1MB request is completed. 
It's up to fuse how to process that 1MB request.

Existing fuse implementation (w/o my patch-set) processes it like this:
  - allocate, fill and send 1st fuse request, wait for its completion 
(ACK-ing by userspace);
  - allocate, fill and send 2nd fuse request, wait for its completion 
(ACK-ing by userspace);
  - ...
  - allocate, fill and send 8th fuse request, wait for its completion 
(ACK-ing by userspace);

With my patch-set applied, it becomes like this:
  - allocate, fill and send 1st fuse request;
  - allocate, fill and send 2nd fuse request;
  - ...
  - allocate, fill and send 8th fuse request;
  - wait till all of them are completed (ACK-ed by userspace)

This is significant optimization for non-trivial fuse userspace 
implementations because: 1) fuse requests might be processed 
concurrently; 2) fuse requests might be merged before processing.

Now, let's proceed to technical details... Your idea to use iocb 
everywhere instead of file and rely on its type (sync vs. async) is 
cool. I like it! Unfortunately, to make it usable in both libaio and 
sync dio cases we'll need to put our hands on generic linux-kernel code 
as well. Let me explain why:

Currently, iocb has only one binary attribute - it's either sync or 
async. aio_complete() derives its ancestry (libaio vs. read(2)/write(2)) 
from its sync/async type:

>     if (is_sync_kiocb(iocb)) {
>         BUG_ON(iocb->ki_users != 1);
>         iocb->ki_user_data = res;
>         iocb->ki_users = 0;
>         wake_up_process(iocb->ki_obj.tsk);
>         return 1;
>     }

In the other words, if it's sync, it's enough to wake up someone in 
kernel (who called wait_on_sync_kiocb()) and return. Otherwise, 
aio_complete() performs some steps to wake up user (who called 
io_getevents()). What we need for fuse is another binary attribute: iocb 
was generated by libaio vs. read(2)/write(2). Then we could use 
aio_complete() and wait_on_sync_kiocb() for any iocb which was not 
generated by libaio. E.g. to process sync dio in async way, we'd 
allocate iocb in fuse_direct_IO on stack (as you suggested), but 
initialize it as async iocb and pass it to __fuse_direct_read/write, 
then call wait_on_sync_kiocb(), not to return -EIOCBQUEUED.

The problem is that FUSE tends to be the only user of this 'feature'. 
I'm considering one-line change of aio_complete():

-    if (is_sync_kiocb(iocb)) {
+    if (!iocb->ki_ctx) {

Though, not sure whether Miklos, Al Viro, Benjamin and others will 
accept it... Let's try if you're OK about this approach. At least this 
will cover all fuse dio use-cases (including libaio extending file) and 
makes fuse code significantly cleaner.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>