From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Maxim V. Patlasov" Subject: Re: [fuse-devel] [PATCH 0/6] fuse: process direct IO asynchronously Date: Thu, 13 Dec 2012 16:52:37 +0400 Message-ID: <50C9CF95.7090704@parallels.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Cc: "miklos@szeredi.hu" , Alexander Viro , "fuse-devel@lists.sourceforge.net" , "linux-fsdevel@vger.kernel.org" , , To: Brian Foster Return-path: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Brian, >> Existing fuse implementation always processes direct IO >> synchronously: it >> submits next request to userspace fuse only when previous is >> completed. This >> is suboptimal because: 1) libaio DIO works in blocking way; 2) >> userspace fuse >> can't achieve parallelism processing several requests simultaneously >> (e.g. >> in case of distributed network storage); 3) userspace fuse can't merge >> requests before passing it to actual storage. >> >> The idea of the patch-set is to submit fuse requests in non-blocking way >> (where it's possible) and either return -EIOCBQUEUED or wait for their >> completion synchronously. The patch-set to be applied on top of >> for-next of >> Miklos' git repo. >> >> To estimate performance improvement I used slightly modified fusexmp >> over >> tmpfs (clearing O_DIRECT bit from fi->flags in xmp_open). For >> synchronous >> operations I used 'dd' like this: >> >> dd of=/dev/null if=/fuse/mnt/file bs=2M count=256 iflag=direct >> dd if=/dev/zero of=/fuse/mnt/file bs=2M count=256 oflag=direct >> conv=notrunc >> >> For AIO I used 'aio-stress' like this: >> >> aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 1 /fuse/mnt/file >> aio-stress -s 512 -a 4 -b 1 -c 1 -O -o 0 /fuse/mnt/file >> >> The throughput on some commodity (rather feeble) server was (in MB/sec): >> >> original / patched >> >> dd reads: ~322 / ~382 >> dd writes: ~277 / ~288 >> >> aio reads: ~380 / ~459 >> aio writes: ~319 / ~353 >> > Thanks for posting this, first of all. I was reading through the code > and found some of the logic a bit hard to follow, particularly due to > controlling the behavior of underlying read/write functions based on the > 'async' kiocb. For example: conditionally updating inode size in a > couple different places; submitting sync requests async and waiting on > them (causing the need to duplicate the update size call); and the > existence of a kiocb called 'async' (when a kiocb can be either sync or > async). > > Anyways, I couldn't quite put my finger on how to potentially clean that > up without messing around with the code, so I've inlined the diff that I > came up with that (IMO) cleans things up a bit. This should apply on top > of your set and makes the following tweaks: > > - Always pass a kiocb down through __fuse_direct_write()/read() (instead > of struct file). > - Trigger the "async" behavior in fuse_direct_io() and > fuse_send_read/write() based on the sync/async nature of the kiocb. This > means that sync requests and async requests are sent as such based on > the nature of the kiocb (as opposed to whether a kiocb exists). > - Update the various other callers of __fuse_direct_write()/read and > fuse_direct_io() to create and pass a sync kiocb where necessary. > - Use the same approach in fuse_direct_IO() to turn an extending, async > write into a sync write. > > The tradeoff with this approach is that we slightly uglify the callers > that have to now create a sync kiocb. That said, I think some of those > codepaths (i.e., fuse_file_aio_write()->fuse_perform_write()->...) could > now just pass the kiocb from the vfs straight down. Thoughts? > Thanks a lot for review and further efforts. Your clean-up patch looks fine, but it constrains the scope of patch-set too much: ordinary synchronous dio-s (generated by read(2) and write(2)) cease to be optimized. I think such a loss doesn't justify the clean-up. To be sure that we are on the same page, I'll try to explain this optimization in more details: Imagine direct read or write of 1MB size. Having FUSE_MAX_PAGES_PER_REQ==32, this can be fulfilled by 8 fuse requests. Generic linux-kernel (e.g. generic_file_aio_read) passes synchronous iocb to fuse_direct_IO(). From fuse perspective, it only means that the caller expects to get control back when that 1MB request is completed. It's up to fuse how to process that 1MB request. Existing fuse implementation (w/o my patch-set) processes it like this: - allocate, fill and send 1st fuse request, wait for its completion (ACK-ing by userspace); - allocate, fill and send 2nd fuse request, wait for its completion (ACK-ing by userspace); - ... - allocate, fill and send 8th fuse request, wait for its completion (ACK-ing by userspace); With my patch-set applied, it becomes like this: - allocate, fill and send 1st fuse request; - allocate, fill and send 2nd fuse request; - ... - allocate, fill and send 8th fuse request; - wait till all of them are completed (ACK-ed by userspace) This is significant optimization for non-trivial fuse userspace implementations because: 1) fuse requests might be processed concurrently; 2) fuse requests might be merged before processing. Now, let's proceed to technical details... Your idea to use iocb everywhere instead of file and rely on its type (sync vs. async) is cool. I like it! Unfortunately, to make it usable in both libaio and sync dio cases we'll need to put our hands on generic linux-kernel code as well. Let me explain why: Currently, iocb has only one binary attribute - it's either sync or async. aio_complete() derives its ancestry (libaio vs. read(2)/write(2)) from its sync/async type: > if (is_sync_kiocb(iocb)) { > BUG_ON(iocb->ki_users != 1); > iocb->ki_user_data = res; > iocb->ki_users = 0; > wake_up_process(iocb->ki_obj.tsk); > return 1; > } In the other words, if it's sync, it's enough to wake up someone in kernel (who called wait_on_sync_kiocb()) and return. Otherwise, aio_complete() performs some steps to wake up user (who called io_getevents()). What we need for fuse is another binary attribute: iocb was generated by libaio vs. read(2)/write(2). Then we could use aio_complete() and wait_on_sync_kiocb() for any iocb which was not generated by libaio. E.g. to process sync dio in async way, we'd allocate iocb in fuse_direct_IO on stack (as you suggested), but initialize it as async iocb and pass it to __fuse_direct_read/write, then call wait_on_sync_kiocb(), not to return -EIOCBQUEUED. The problem is that FUSE tends to be the only user of this 'feature'. I'm considering one-line change of aio_complete(): - if (is_sync_kiocb(iocb)) { + if (!iocb->ki_ctx) { Though, not sure whether Miklos, Al Viro, Benjamin and others will accept it... Let's try if you're OK about this approach. At least this will cover all fuse dio use-cases (including libaio extending file) and makes fuse code significantly cleaner. Thanks, Maxim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org