From mboxrd@z Thu Jan 1 00:00:00 1970 From: Milosz Tanski Subject: [PATCH v6 0/7] vfs: Non-blockling buffered fs read (page cache only) Date: Mon, 10 Nov 2014 11:40:23 -0500 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Return-path: Sender: owner-linux-aio@kvack.org To: linux-kernel@vger.kernel.org Cc: Christoph Hellwig , linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, Mel Gorman , Volker Lendecke , Tejun Heo , Jeff Moyer , Theodore Ts'o , Al Viro , linux-api@vger.kernel.org, Michael Kerrisk , linux-arch@vger.kernel.org List-Id: linux-api@vger.kernel.org This patcheset introduces an ability to perform a non-blocking read from regular files in buffered IO mode. This works by only for those filesyste= ms that have data in the page cache. It does this by introducing new syscalls new syscalls preadv2/pwritev2. T= hese new syscalls behave like the network sendmsg, recvmsg syscalls that accep= t an extra flag argument (RWF_NONBLOCK). It's a very common patern today (samba, libuv, etc..) use a large threadp= ool to perform buffered IO operations. They submit the work form another thread that performs network IO and epoll or other threads that perform CPU work= . This leads to increased latency for processing, esp. in the case of data that'= s already cached in the page cache. With the new interface the applications will now be able to fetch the dat= a in their network / cpu bound thread(s) and only defer to a threadpool if it'= s not there. In our own application (VLDB) we've observed a decrease in latency= for "fast" request by avoiding unnecessary queuing and having to swap out cur= rent tasks in IO bound work threads. Version 6 highlight: - Compat syscall flag checks, per. Jeff. - Minor stylistic suggestions. Version 5 highlight: - XFS support for RWF_NONBLOCK. from Christoph. - RWF_DSYNC flag and support for pwritev2, from Christoph. - Implemented compat syscalls, per. Jeff. - Missing nfs, ceph changes from older patchset. Version 4 highlight: - Updated for 3.18-rc1. - Performance data from our application. - First stab at man page with Jeff's help. Patch is in-reply to. RFC Version 3 highlights: - Down to 2 syscalls from 4; can user fp or argument position. - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff. RFC Version 2 highlights: - Put the flags argument into kiocb (less noise), per. Al Viro - O_DIRECT checking early in the process, per. Jeff Moyer - Resolved duplicate (c&p) code in syscall code, per. Jeff - Included perf data in thread cover letter, per. Jeff - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff Some perf data generated using fio comparing the posix aio engine to a ve= rsion of the posix AIO engine that attempts to performs "fast" reads before submitting the operations to the queue. This workflow is on ext4 partitio= n on raid0 (test / build-rig.) Simulating our database access patern workload = using 16kb read accesses. Our database uses a home-spun posix aio like queue (s= amba does the same thing.) f1: ~73% rand read over mostly cached data (zipf med-size dataset) f2: ~18% rand read over mostly un-cached data (uniform large-dataset) f3: ~9% seq-read over large dataset before: f1: bw (KB /s): min=3D 11, max=3D 9088, per=3D0.56%, avg=3D969.54, std= ev=3D827.99 lat (msec) : 50=3D0.01%, 100=3D1.06%, 250=3D5.88%, 500=3D4.08%, 750=3D= 12.48% lat (msec) : 1000=3D17.27%, 2000=3D49.86%, >=3D2000=3D9.42% f2: bw (KB /s): min=3D 2, max=3D 1882, per=3D0.16%, avg=3D273.28, std= ev=3D220.26 lat (msec) : 250=3D5.65%, 500=3D3.31%, 750=3D15.64%, 1000=3D24.59%, 2= 000=3D46.56% lat (msec) : >=3D2000=3D4.33% f3: bw (KB /s): min=3D 0, max=3D265568, per=3D99.95%, avg=3D174575.10= , stdev=3D34526.89 lat (usec) : 2=3D0.01%, 4=3D0.01%, 10=3D0.02%, 20=3D0.27%, 50=3D10.82= % lat (usec) : 100=3D50.34%, 250=3D5.05%, 500=3D7.12%, 750=3D6.60%, 100= 0=3D4.55% lat (msec) : 2=3D8.73%, 4=3D3.49%, 10=3D1.83%, 20=3D0.89%, 50=3D0.22% lat (msec) : 100=3D0.05%, 250=3D0.02%, 500=3D0.01% total: READ: io=3D102365MB, aggrb=3D174669KB/s, minb=3D240KB/s, maxb=3D173599= KB/s, mint=3D600001msec, maxt=3D600113msec after (with fast read using preadv2 before submit): f1: bw (KB /s): min=3D 3, max=3D14897, per=3D1.28%, avg=3D2276.69, st= dev=3D2930.39 lat (usec) : 2=3D70.63%, 4=3D0.01% lat (msec) : 250=3D0.20%, 500=3D2.26%, 750=3D1.18%, 2000=3D0.22%, >=3D= 2000=3D25.53% f2: bw (KB /s): min=3D 2, max=3D 2362, per=3D0.14%, avg=3D249.83, std= ev=3D222.00 lat (msec) : 250=3D6.35%, 500=3D1.78%, 750=3D9.29%, 1000=3D20.49%, 20= 00=3D52.18% lat (msec) : >=3D2000=3D9.99% f3: bw (KB /s): min=3D 1, max=3D245448, per=3D100.00%, avg=3D177366.5= 0, stdev=3D35995.60 lat (usec) : 2=3D64.04%, 4=3D0.01%, 10=3D0.01%, 20=3D0.06%, 50=3D0.43= % lat (usec) : 100=3D0.20%, 250=3D1.27%, 500=3D2.93%, 750=3D3.93%, 1000= =3D7.35% lat (msec) : 2=3D14.27%, 4=3D2.88%, 10=3D1.54%, 20=3D0.81%, 50=3D0.22= % lat (msec) : 100=3D0.05%, 250=3D0.02% total: READ: io=3D103941MB, aggrb=3D177339KB/s, minb=3D213KB/s, maxb=3D176375= KB/s, mint=3D600020msec, maxt=3D600178msec Interpreting the results you can see total bandwidth stays the same but o= verall request latency is decreased in f1 (random, mostly cached) and f3 (sequen= tial) workloads. There is a slight bump in latency for since it's random data t= hat's unlikely to be cached but we're always trying "fast read". In our application we have starting keeping track of "fast read" hits/mis= ses and for files / requests that have a lot hit ratio we don't do "fast read= s" mostly getting rid of extra latency in the uncached cases. In our real wo= rld work load we were able to reduce average response time by 20 to 30% (depe= nds on amount of IO done by request). I've performed other benchmarks and I have no observed any perf regressio= ns in any of the normal (old) code paths. I have co-developed these changes with Christoph Hellwig. Christoph Hellwig (3): xfs: add RWF_NONBLOCK support fs: pass iocb to generic_write_sync fs: add a flag for per-operation O_DSYNC semantics Milosz Tanski (4): vfs: Prepare for adding a new preadv/pwritev with user flags. vfs: Define new syscalls preadv2,pwritev2 x86: wire up preadv2 and pwritev2 vfs: RWF_NONBLOCK flag for preadv2 arch/x86/syscalls/syscall_32.tbl | 2 + arch/x86/syscalls/syscall_64.tbl | 2 + drivers/target/target_core_file.c | 6 +- fs/block_dev.c | 8 +- fs/btrfs/file.c | 7 +- fs/ceph/file.c | 6 +- fs/cifs/file.c | 14 +-- fs/direct-io.c | 8 +- fs/ext4/file.c | 8 +- fs/fuse/file.c | 2 + fs/gfs2/file.c | 9 +- fs/nfs/file.c | 15 ++- fs/nfsd/vfs.c | 4 +- fs/ntfs/file.c | 8 +- fs/ocfs2/file.c | 12 +- fs/pipe.c | 3 +- fs/read_write.c | 233 +++++++++++++++++++++++++++++---= ------ fs/splice.c | 2 +- fs/udf/file.c | 11 +- fs/xfs/xfs_file.c | 36 ++++-- include/linux/aio.h | 2 + include/linux/compat.h | 6 + include/linux/fs.h | 16 ++- include/linux/syscalls.h | 6 + include/uapi/asm-generic/unistd.h | 6 +- mm/filemap.c | 56 +++++++-- mm/shmem.c | 4 + 27 files changed, 343 insertions(+), 149 deletions(-) --=20 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org