* selective block polling and preadv2/pwritev2 revisited @ 2015-12-24 14:14 Christoph Hellwig [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA This series allows to selectively enable/disable polling for completions in the block layer [1] on a per-I/O basis. For this it resurrects the preadv2/pwritev2 syscalls that Milosz prepared a while ago (and which are much simpler now due to VFS changes that happened in the meantime). That approach also had a man page update prepared, which I will resubmit with the current flags once this series makes it in. Polling for block I/O is important to reduce the latency on flash and post-flash storage technologies. On the fastest NVMe controller I have access to it almost halves latencies from over 7 microseconds to about 4 microseonds. But it only is usesful if we actually care for the latency of this particular I/O, and generally is a waste if enabled for all I/O to a given device. This series uses the per-I/O flags in preadv2/pwritev2 to control this behavior. The alternative would be a new O_* flag set at open time or using fcntl, but this is still to corse-grained for some applications and we're starting to run out out of open flags. Note that there are plenty of other use cases for preadv2/pwritev2 as well, but I'd like to concentrate on this one for now. Example are: non-blocking reads (the original purpose), per-I/O O_SYNC, user space support for T10 DIF/DIX applications tags and probably some more. [1] only supported for NVMe at the moment. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>]
* [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> @ 2015-12-24 14:14 ` Christoph Hellwig 2015-12-24 14:14 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA From: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> This way we can set kiocb flags also from the sync read/write path. Signed-off-by: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> [hch: rebased on top of my kiocb changes] Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> --- fs/nfsd/vfs.c | 4 ++-- fs/read_write.c | 44 ++++++++++++++++++++++++++------------------ fs/splice.c | 2 +- include/linux/fs.h | 4 ++-- 4 files changed, 31 insertions(+), 23 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 994d66f..3a9f7bf 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen, oldfs = get_fs(); set_fs(KERNEL_DS); - host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset); + host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0); set_fs(oldfs); return nfsd_finish_read(file, count, host_err); } @@ -942,7 +942,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file, /* Write the data. */ oldfs = get_fs(); set_fs(KERNEL_DS); - host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos); + host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0); set_fs(oldfs); if (host_err < 0) goto out_nfserr; diff --git a/fs/read_write.c b/fs/read_write.c index 819ef3f..34a2920 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -653,11 +653,14 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to) EXPORT_SYMBOL(iov_shorten); static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, - loff_t *ppos, iter_fn_t fn) + loff_t *ppos, iter_fn_t fn, int flags) { struct kiocb kiocb; ssize_t ret; + if (flags) + return -EOPNOTSUPP; + init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; @@ -669,10 +672,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, /* Do it by hand, with file-ops */ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter, - loff_t *ppos, io_fn_t fn) + loff_t *ppos, io_fn_t fn, int flags) { ssize_t ret = 0; + if (flags) + return -EOPNOTSUPP; + while (iov_iter_count(iter)) { struct iovec iovec = iov_iter_iovec(iter); ssize_t nr; @@ -773,7 +779,8 @@ out: static ssize_t do_readv_writev(int type, struct file *file, const struct iovec __user * uvector, - unsigned long nr_segs, loff_t *pos) + unsigned long nr_segs, loff_t *pos, + int flags) { size_t tot_len; struct iovec iovstack[UIO_FASTIOV]; @@ -805,9 +812,9 @@ static ssize_t do_readv_writev(int type, struct file *file, } if (iter_fn) - ret = do_iter_readv_writev(file, &iter, pos, iter_fn); + ret = do_iter_readv_writev(file, &iter, pos, iter_fn, flags); else - ret = do_loop_readv_writev(file, &iter, pos, fn); + ret = do_loop_readv_writev(file, &iter, pos, fn, flags); if (type != READ) file_end_write(file); @@ -824,27 +831,27 @@ out: } ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, - unsigned long vlen, loff_t *pos) + unsigned long vlen, loff_t *pos, int flags) { if (!(file->f_mode & FMODE_READ)) return -EBADF; if (!(file->f_mode & FMODE_CAN_READ)) return -EINVAL; - return do_readv_writev(READ, file, vec, vlen, pos); + return do_readv_writev(READ, file, vec, vlen, pos, flags); } EXPORT_SYMBOL(vfs_readv); ssize_t vfs_writev(struct file *file, const struct iovec __user *vec, - unsigned long vlen, loff_t *pos) + unsigned long vlen, loff_t *pos, int flags) { if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; - return do_readv_writev(WRITE, file, vec, vlen, pos); + return do_readv_writev(WRITE, file, vec, vlen, pos, flags); } EXPORT_SYMBOL(vfs_writev); @@ -857,7 +864,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { loff_t pos = file_pos_read(f.file); - ret = vfs_readv(f.file, vec, vlen, &pos); + ret = vfs_readv(f.file, vec, vlen, &pos, 0); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); @@ -877,7 +884,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { loff_t pos = file_pos_read(f.file); - ret = vfs_writev(f.file, vec, vlen, &pos); + ret = vfs_writev(f.file, vec, vlen, &pos, 0); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); @@ -909,7 +916,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { ret = -ESPIPE; if (f.file->f_mode & FMODE_PREAD) - ret = vfs_readv(f.file, vec, vlen, &pos); + ret = vfs_readv(f.file, vec, vlen, &pos, 0); fdput(f); } @@ -933,7 +940,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { ret = -ESPIPE; if (f.file->f_mode & FMODE_PWRITE) - ret = vfs_writev(f.file, vec, vlen, &pos); + ret = vfs_writev(f.file, vec, vlen, &pos, 0); fdput(f); } @@ -947,7 +954,8 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, static ssize_t compat_do_readv_writev(int type, struct file *file, const struct compat_iovec __user *uvector, - unsigned long nr_segs, loff_t *pos) + unsigned long nr_segs, loff_t *pos, + int flags) { compat_ssize_t tot_len; struct iovec iovstack[UIO_FASTIOV]; @@ -979,9 +987,9 @@ static ssize_t compat_do_readv_writev(int type, struct file *file, } if (iter_fn) - ret = do_iter_readv_writev(file, &iter, pos, iter_fn); + ret = do_iter_readv_writev(file, &iter, pos, iter_fn, flags); else - ret = do_loop_readv_writev(file, &iter, pos, fn); + ret = do_loop_readv_writev(file, &iter, pos, fn, flags); if (type != READ) file_end_write(file); @@ -1010,7 +1018,7 @@ static size_t compat_readv(struct file *file, if (!(file->f_mode & FMODE_CAN_READ)) goto out; - ret = compat_do_readv_writev(READ, file, vec, vlen, pos); + ret = compat_do_readv_writev(READ, file, vec, vlen, pos, 0); out: if (ret > 0) @@ -1087,7 +1095,7 @@ static size_t compat_writev(struct file *file, if (!(file->f_mode & FMODE_CAN_WRITE)) goto out; - ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos); + ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, 0); out: if (ret > 0) diff --git a/fs/splice.c b/fs/splice.c index 801c21c..f357bc0 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -579,7 +579,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec, old_fs = get_fs(); set_fs(get_ds()); /* The cast to a user pointer is valid due to the set_fs() */ - res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos); + res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0); set_fs(old_fs); return res; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa5142..2b0e078 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1677,9 +1677,9 @@ extern ssize_t __vfs_write(struct file *, const char __user *, size_t, loff_t *) extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, - unsigned long, loff_t *); + unsigned long, loff_t *, int); extern ssize_t vfs_writev(struct file *, const struct iovec __user *, - unsigned long, loff_t *); + unsigned long, loff_t *, int); struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2015-12-24 14:14 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig @ 2015-12-24 14:14 ` Christoph Hellwig 2015-12-24 14:14 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig 2016-01-04 14:58 ` selective block polling and preadv2/pwritev2 revisited Sagi Grimberg 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA From: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> New syscalls that take an flag argument. This change does not add any specific flags. Signed-off-by: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> [hch: rebased on top of my kiocb changes] Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> --- fs/read_write.c | 162 +++++++++++++++++++++++++++++++++++++---------- include/linux/compat.h | 6 ++ include/linux/syscalls.h | 6 ++ 3 files changed, 139 insertions(+), 35 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 34a2920..caa30ac 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -856,15 +856,15 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec, EXPORT_SYMBOL(vfs_writev); -SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec, - unsigned long, vlen) +static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, int flags) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); - ret = vfs_readv(f.file, vec, vlen, &pos, 0); + ret = vfs_readv(f.file, vec, vlen, &pos, flags); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); @@ -876,15 +876,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec, return ret; } -SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec, - unsigned long, vlen) +static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, int flags) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); - ret = vfs_writev(f.file, vec, vlen, &pos, 0); + ret = vfs_writev(f.file, vec, vlen, &pos, flags); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); @@ -902,10 +902,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low) return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low; } -SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec, - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h) +static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos, int flags) { - loff_t pos = pos_from_hilo(pos_h, pos_l); struct fd f; ssize_t ret = -EBADF; @@ -916,7 +915,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { ret = -ESPIPE; if (f.file->f_mode & FMODE_PREAD) - ret = vfs_readv(f.file, vec, vlen, &pos, 0); + ret = vfs_readv(f.file, vec, vlen, &pos, flags); fdput(f); } @@ -926,10 +925,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec, return ret; } -SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h) +static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, loff_t pos, int flags) { - loff_t pos = pos_from_hilo(pos_h, pos_l); struct fd f; ssize_t ret = -EBADF; @@ -940,7 +938,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, if (f.file) { ret = -ESPIPE; if (f.file->f_mode & FMODE_PWRITE) - ret = vfs_writev(f.file, vec, vlen, &pos, 0); + ret = vfs_writev(f.file, vec, vlen, &pos, flags); fdput(f); } @@ -950,6 +948,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, return ret; } +SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen) +{ + return do_readv(fd, vec, vlen, 0); +} + +SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen) +{ + return do_writev(fd, vec, vlen, 0); +} + +SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h) +{ + loff_t pos = pos_from_hilo(pos_h, pos_l); + + return do_preadv(fd, vec, vlen, pos, 0); +} + +SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h, + int, flags) +{ + loff_t pos = pos_from_hilo(pos_h, pos_l); + + if (pos == -1) + return do_readv(fd, vec, vlen, flags); + + return do_preadv(fd, vec, vlen, pos, flags); +} + +SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h) +{ + loff_t pos = pos_from_hilo(pos_h, pos_l); + + return do_pwritev(fd, vec, vlen, pos, 0); +} + +SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec, + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h, + int, flags) +{ + loff_t pos = pos_from_hilo(pos_h, pos_l); + + if (pos == -1) + return do_writev(fd, vec, vlen, flags); + + return do_pwritev(fd, vec, vlen, pos, flags); +} + #ifdef CONFIG_COMPAT static ssize_t compat_do_readv_writev(int type, struct file *file, @@ -1007,7 +1057,7 @@ out: static size_t compat_readv(struct file *file, const struct compat_iovec __user *vec, - unsigned long vlen, loff_t *pos) + unsigned long vlen, loff_t *pos, int flags) { ssize_t ret = -EBADF; @@ -1018,7 +1068,7 @@ static size_t compat_readv(struct file *file, if (!(file->f_mode & FMODE_CAN_READ)) goto out; - ret = compat_do_readv_writev(READ, file, vec, vlen, pos, 0); + ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags); out: if (ret > 0) @@ -1027,9 +1077,9 @@ out: return ret; } -COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd, - const struct compat_iovec __user *,vec, - compat_ulong_t, vlen) +static size_t do_compat_readv(compat_ulong_t fd, + const struct compat_iovec __user *vec, + compat_ulong_t vlen, int flags) { struct fd f = fdget_pos(fd); ssize_t ret; @@ -1038,16 +1088,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd, if (!f.file) return -EBADF; pos = f.file->f_pos; - ret = compat_readv(f.file, vec, vlen, &pos); + ret = compat_readv(f.file, vec, vlen, &pos, flags); if (ret >= 0) f.file->f_pos = pos; fdput_pos(f); return ret; + +} + +COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd, + const struct compat_iovec __user *,vec, + compat_ulong_t, vlen) +{ + return do_compat_readv(fd, vec, vlen, 0); } -static long __compat_sys_preadv64(unsigned long fd, +static long do_compat_preadv64(unsigned long fd, const struct compat_iovec __user *vec, - unsigned long vlen, loff_t pos) + unsigned long vlen, loff_t pos, int flags) { struct fd f; ssize_t ret; @@ -1059,7 +1117,7 @@ static long __compat_sys_preadv64(unsigned long fd, return -EBADF; ret = -ESPIPE; if (f.file->f_mode & FMODE_PREAD) - ret = compat_readv(f.file, vec, vlen, &pos); + ret = compat_readv(f.file, vec, vlen, &pos, flags); fdput(f); return ret; } @@ -1069,7 +1127,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd, const struct compat_iovec __user *,vec, unsigned long, vlen, loff_t, pos) { - return __compat_sys_preadv64(fd, vec, vlen, pos); + return do_compat_preadv64(fd, vec, vlen, pos, 0); } #endif @@ -1079,12 +1137,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd, { loff_t pos = ((loff_t)pos_high << 32) | pos_low; - return __compat_sys_preadv64(fd, vec, vlen, pos); + return do_compat_preadv64(fd, vec, vlen, pos, 0); +} + +COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd, + const struct compat_iovec __user *,vec, + compat_ulong_t, vlen, u32, pos_low, u32, pos_high, + int, flags) +{ + loff_t pos = ((loff_t)pos_high << 32) | pos_low; + + if (pos == -1) + return do_compat_readv(fd, vec, vlen, flags); + + return do_compat_preadv64(fd, vec, vlen, pos, flags); } static size_t compat_writev(struct file *file, const struct compat_iovec __user *vec, - unsigned long vlen, loff_t *pos) + unsigned long vlen, loff_t *pos, int flags) { ssize_t ret = -EBADF; @@ -1104,9 +1175,9 @@ out: return ret; } -COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd, - const struct compat_iovec __user *, vec, - compat_ulong_t, vlen) +static size_t do_compat_writev(compat_ulong_t fd, + const struct compat_iovec __user* vec, + compat_ulong_t vlen, int flags) { struct fd f = fdget_pos(fd); ssize_t ret; @@ -1115,28 +1186,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd, if (!f.file) return -EBADF; pos = f.file->f_pos; - ret = compat_writev(f.file, vec, vlen, &pos); + ret = compat_writev(f.file, vec, vlen, &pos, flags); if (ret >= 0) f.file->f_pos = pos; fdput_pos(f); return ret; } -static long __compat_sys_pwritev64(unsigned long fd, +COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd, + const struct compat_iovec __user *, vec, + compat_ulong_t, vlen) +{ + return do_compat_writev(fd, vec, vlen, 0); +} + +static long do_compat_pwritev64(unsigned long fd, const struct compat_iovec __user *vec, - unsigned long vlen, loff_t pos) + unsigned long vlen, loff_t pos, int flags) { struct fd f; ssize_t ret; if (pos < 0) return -EINVAL; + f = fdget(fd); if (!f.file) return -EBADF; ret = -ESPIPE; if (f.file->f_mode & FMODE_PWRITE) - ret = compat_writev(f.file, vec, vlen, &pos); + ret = compat_writev(f.file, vec, vlen, &pos, flags); fdput(f); return ret; } @@ -1146,7 +1225,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd, const struct compat_iovec __user *,vec, unsigned long, vlen, loff_t, pos) { - return __compat_sys_pwritev64(fd, vec, vlen, pos); + return do_compat_pwritev64(fd, vec, vlen, pos, 0); } #endif @@ -1156,8 +1235,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd, { loff_t pos = ((loff_t)pos_high << 32) | pos_low; - return __compat_sys_pwritev64(fd, vec, vlen, pos); + return do_compat_pwritev64(fd, vec, vlen, pos, 0); +} + +COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd, + const struct compat_iovec __user *,vec, + compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags) +{ + loff_t pos = ((loff_t)pos_high << 32) | pos_low; + + if (pos == -1) + return do_compat_writev(fd, vec, vlen, flags); + + return do_compat_pwritev64(fd, vec, vlen, pos, flags); } + #endif static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, diff --git a/include/linux/compat.h b/include/linux/compat.h index a76c917..fe4ccd0 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd, asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd, const struct compat_iovec __user *vec, compat_ulong_t vlen, u32 pos_low, u32 pos_high); +asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd, + const struct compat_iovec __user *vec, + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags); +asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd, + const struct compat_iovec __user *vec, + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags); #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64 asmlinkage long compat_sys_preadv64(unsigned long fd, diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a156b82..c4fac0d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -575,8 +575,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf, size_t count, loff_t pos); asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec, unsigned long vlen, unsigned long pos_l, unsigned long pos_h); +asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, unsigned long pos_l, unsigned long pos_h, + int flags); asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen, unsigned long pos_l, unsigned long pos_h); +asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec, + unsigned long vlen, unsigned long pos_l, unsigned long pos_h, + int flags); asmlinkage long sys_getcwd(char __user *buf, unsigned long size); asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode); asmlinkage long sys_chdir(const char __user *filename); -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/6] x86: wire up preadv2 and pwritev2 [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2015-12-24 14:14 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig 2015-12-24 14:14 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig @ 2015-12-24 14:14 ` Christoph Hellwig 2016-01-04 14:58 ` selective block polling and preadv2/pwritev2 revisited Sagi Grimberg 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA From: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> Signed-off-by: Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> [hch: rebased due to newly added syscalls] Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> --- arch/x86/entry/syscalls/syscall_32.tbl | 2 ++ arch/x86/entry/syscalls/syscall_64.tbl | 2 ++ 2 files changed, 4 insertions(+) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index f17705e..13e33ced 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -383,3 +383,5 @@ 374 i386 userfaultfd sys_userfaultfd 375 i386 membarrier sys_membarrier 376 i386 mlock2 sys_mlock2 +377 i386 preadv2 sys_preadv2 +378 i386 pwritev2 sys_pwritev2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 314a90b..2108dae 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -332,6 +332,8 @@ 323 common userfaultfd sys_userfaultfd 324 common membarrier sys_membarrier 325 common mlock2 sys_mlock2 +326 64 preadv2 sys_preadv2 +327 64 pwritev2 sys_pwritev2 # # x32-specific system call numbers start at 512 to avoid cache impact -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: selective block polling and preadv2/pwritev2 revisited [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> ` (2 preceding siblings ...) 2015-12-24 14:14 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig @ 2016-01-04 14:58 ` Sagi Grimberg [not found] ` <568A889E.4020204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 3 siblings, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2016-01-04 14:58 UTC (permalink / raw) To: Christoph Hellwig Cc: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg, milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Hi Christoph, > Note that there are plenty of other use cases for preadv2/pwritev2 as well, > but I'd like to concentrate on this one for now. Example are: non-blocking > reads (the original purpose), per-I/O O_SYNC, user space support for T10 > DIF/DIX applications tags and probably some more. So I'm trying to understand how can integrity metadata be used here. Will the user-app append the meta-data to the data iovec (given there is no metadata iovec)? If so, how will we separate data from metadata? Cheers, Sagi. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <568A889E.4020204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>]
* Re: selective block polling and preadv2/pwritev2 revisited [not found] ` <568A889E.4020204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> @ 2016-01-04 16:39 ` Christoph Hellwig [not found] ` <20160104163949.GA17409-jcswGhMUV9g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-01-04 16:39 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg, milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On Mon, Jan 04, 2016 at 04:58:38PM +0200, Sagi Grimberg wrote: > Hi Christoph, > >> Note that there are plenty of other use cases for preadv2/pwritev2 as well, >> but I'd like to concentrate on this one for now. Example are: non-blocking >> reads (the original purpose), per-I/O O_SYNC, user space support for T10 >> DIF/DIX applications tags and probably some more. > > So I'm trying to understand how can integrity metadata be used here. > Will the user-app append the meta-data to the data iovec (given there > is no metadata iovec)? If so, how will we separate data from metadata? The idea that was floated aroud a few times was to have a flag where the first half of the vectors would be the data, and the second half the metadata. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20160104163949.GA17409-jcswGhMUV9g@public.gmane.org>]
* Re: selective block polling and preadv2/pwritev2 revisited [not found] ` <20160104163949.GA17409-jcswGhMUV9g@public.gmane.org> @ 2016-01-06 17:01 ` Sagi Grimberg 2016-01-06 22:49 ` Martin K. Petersen 0 siblings, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2016-01-06 17:01 UTC (permalink / raw) To: Christoph Hellwig Cc: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg, milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA >> Hi Christoph, >> >>> Note that there are plenty of other use cases for preadv2/pwritev2 as well, >>> but I'd like to concentrate on this one for now. Example are: non-blocking >>> reads (the original purpose), per-I/O O_SYNC, user space support for T10 >>> DIF/DIX applications tags and probably some more. >> >> So I'm trying to understand how can integrity metadata be used here. >> Will the user-app append the meta-data to the data iovec (given there >> is no metadata iovec)? If so, how will we separate data from metadata? > > The idea that was floated aroud a few times was to have a flag where > the first half of the vectors would be the data, and the second half > the metadata. This means that the user would need to pass iovec entries of 8 bytes correct? Seems like a waste for large IOs (sorry for diverging from the subject) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: selective block polling and preadv2/pwritev2 revisited 2016-01-06 17:01 ` Sagi Grimberg @ 2016-01-06 22:49 ` Martin K. Petersen [not found] ` <yq1h9iqiary.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Martin K. Petersen @ 2016-01-06 22:49 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, viro, axboe, milosz, linux-fsdevel, linux-block, linux-api >>>>> "Sagi" == Sagi Grimberg <sagig@dev.mellanox.co.il> writes: >> The idea that was floated aroud a few times was to have a flag where >> the first half of the vectors would be the data, and the second half >> the metadata. Sagi> This means that the user would need to pass iovec entries of 8 Sagi> bytes correct? Seems like a waste for large IOs (sorry for Sagi> diverging from the subject) The assumption was that there would be a 1:1 mapping between the number of data buffers and the metadata ditto. But nothing says that a data iovec entry is limited in size to a single sector. The other option to have a single iovec at the end representing the metadata for all data buffers. I think there are valid use cases for either approach and we may end up having to support both via a flag. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <yq1h9iqiary.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org>]
* Re: selective block polling and preadv2/pwritev2 revisited [not found] ` <yq1h9iqiary.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> @ 2016-01-07 14:41 ` Sagi Grimberg 0 siblings, 0 replies; 15+ messages in thread From: Sagi Grimberg @ 2016-01-07 14:41 UTC (permalink / raw) To: Martin K. Petersen Cc: Christoph Hellwig, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg, milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA Hi Martin, > Sagi> This means that the user would need to pass iovec entries of 8 > Sagi> bytes correct? Seems like a waste for large IOs (sorry for > Sagi> diverging from the subject) > > The assumption was that there would be a 1:1 mapping between the number > of data buffers and the metadata ditto. But nothing says that a data > iovec entry is limited in size to a single sector. Yea... I meant 1:1, I got confused on the 8 bytes comment... > The other option to have a single iovec at the end representing the > metadata for all data buffers. I think there are valid use cases for > either approach and we may end up having to support both via a flag. Either approach presents limitation, but I guess user-space can deal with it... ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 2015-12-24 14:14 selective block polling and preadv2/pwritev2 revisited Christoph Hellwig [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> @ 2015-12-24 14:14 ` Christoph Hellwig 2015-12-24 14:14 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig 2015-12-24 14:14 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api This adds a flag that tells the file system that this is a high priority request for which it's worth to poll the hardware. The flag is purely advisory and can be ignored if not supported. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/read_write.c | 6 ++++-- include/linux/fs.h | 1 + include/uapi/linux/fs.h | 3 +++ 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index caa30ac..4dc377e 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -658,10 +658,12 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, struct kiocb kiocb; ssize_t ret; - if (flags) + if (flags & ~RWF_HIPRI) return -EOPNOTSUPP; init_sync_kiocb(&kiocb, filp); + if (flags & RWF_HIPRI) + kiocb.ki_flags |= IOCB_HIPRI; kiocb.ki_pos = *ppos; ret = fn(&kiocb, iter); @@ -676,7 +678,7 @@ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter, { ssize_t ret = 0; - if (flags) + if (flags & ~RWF_HIPRI) return -EOPNOTSUPP; while (iov_iter_count(iter)) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 2b0e078..0247620 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -319,6 +319,7 @@ struct writeback_control; #define IOCB_EVENTFD (1 << 0) #define IOCB_APPEND (1 << 1) #define IOCB_DIRECT (1 << 2) +#define IOCB_HIPRI (1 << 3) struct kiocb { struct file *ki_filp; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index f15d980..42f7627 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -208,4 +208,7 @@ struct inodes_stat_t { #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* flags for preadv2/pwritev2: */ +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */ + #endif /* _UAPI_LINUX_FS_H */ -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 5/6] direct-io: only use block polling if explicitly requested 2015-12-24 14:14 selective block polling and preadv2/pwritev2 revisited Christoph Hellwig [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2015-12-24 14:14 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig @ 2015-12-24 14:14 ` Christoph Hellwig 2015-12-24 14:14 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/direct-io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index cb5337d..904ff7f 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -445,7 +445,8 @@ static struct bio *dio_await_one(struct dio *dio) __set_current_state(TASK_UNINTERRUPTIBLE); dio->waiter = current; spin_unlock_irqrestore(&dio->bio_lock, flags); - if (!blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie)) + if (!(dio->iocb->ki_flags & IOCB_HIPRI) || + !blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie)) io_schedule(); /* wake up sets us TASK_RUNNING */ spin_lock_irqsave(&dio->bio_lock, flags); -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 6/6] blk-mq: enable polling support by default 2015-12-24 14:14 selective block polling and preadv2/pwritev2 revisited Christoph Hellwig ` (2 preceding siblings ...) 2015-12-24 14:14 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig @ 2015-12-24 14:14 ` Christoph Hellwig 3 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2015-12-24 14:14 UTC (permalink / raw) To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api Now that applications need to explicitly ask for polling we can enable it by default in blk-mq drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> --- include/linux/blkdev.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e711f29..1b73222 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -496,7 +496,8 @@ struct request_queue { #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ - (1 << QUEUE_FLAG_SAME_COMP)) + (1 << QUEUE_FLAG_SAME_COMP) | \ + (1 << QUEUE_FLAG_POLL)) static inline void queue_lockdep_assert_held(struct request_queue *q) { -- 1.9.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* generic RDMA READ/WRITE API V2 @ 2016-03-03 15:03 Christoph Hellwig [not found] ` <1457017443-17662-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-03-03 15:03 UTC (permalink / raw) To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api This series contains patches that implement a first version of a generic API to handle RDMA READ/WRITE operations as commonly used on the target (or server) side for storage protocols. This has been developed for the upcoming NVMe over Fabrics target, and extensively teѕted as part of that, although this upstream version has additional updates over the one we're currently using. It hides details such as the use of MRs for iWarp devices, and will allow looking at other HCA specifics easily in the future. This series contains a conversion of the SRP target, and the git tree below also has a RFC conversion of the iSER target (a little hacky due to the signature MR support which I can't test) I also have a git tree available at: git://git.infradead.org/users/hch/rdma.git rdma-rw-api Gitweb: http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-rw-api These two also include the RFC iSER target conversion. Chances since V2: - fold the list_del in mr_pool_get into the right patch - clamp the max FR page size length - minor srpt style fix - spelling fixes Changes since V1: - fixed offset handling in ib_sg_to_pages - uses proper SG iterators to handle larger than PAGE_SIZE segments - adjusted parameters for some functions to reduce size of the context - SRP target support -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <1457017443-17662-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org>]
* [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 [not found] ` <1457017443-17662-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> @ 2016-03-03 15:04 ` Christoph Hellwig [not found] ` <1457017443-17662-5-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2016-03-03 15:04 UTC (permalink / raw) To: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA This adds a flag that tells the file system that this is a high priority request for which it's worth to poll the hardware. The flag is purely advisory and can be ignored if not supported. Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Reviewed-by: Stephen Bates <stephen.bates-PwyqCcigF0Q@public.gmane.org> Tested-by: Stephen Bates <stephen.bates-PwyqCcigF0Q@public.gmane.org> Acked-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/read_write.c | 6 ++++-- include/linux/fs.h | 1 + include/uapi/linux/fs.h | 3 +++ 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 799d25f..cf377cf 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -698,10 +698,12 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, struct kiocb kiocb; ssize_t ret; - if (flags) + if (flags & ~RWF_HIPRI) return -EOPNOTSUPP; init_sync_kiocb(&kiocb, filp); + if (flags & RWF_HIPRI) + kiocb.ki_flags |= IOCB_HIPRI; kiocb.ki_pos = *ppos; ret = fn(&kiocb, iter); @@ -716,7 +718,7 @@ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter, { ssize_t ret = 0; - if (flags) + if (flags & ~RWF_HIPRI) return -EOPNOTSUPP; while (iov_iter_count(iter)) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 875277a..a1f731c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -320,6 +320,7 @@ struct writeback_control; #define IOCB_EVENTFD (1 << 0) #define IOCB_APPEND (1 << 1) #define IOCB_DIRECT (1 << 2) +#define IOCB_HIPRI (1 << 3) struct kiocb { struct file *ki_filp; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 149bec8..d246339 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -304,4 +304,7 @@ struct fsxattr { #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* flags for preadv2/pwritev2: */ +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */ + #endif /* _UAPI_LINUX_FS_H */ -- 2.1.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
[parent not found: <1457017443-17662-5-git-send-email-hch-jcswGhMUV9g@public.gmane.org>]
* Re: [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 [not found] ` <1457017443-17662-5-git-send-email-hch-jcswGhMUV9g@public.gmane.org> @ 2016-05-08 21:47 ` NeilBrown [not found] ` <874ma8usrr.fsf-wvvUuzkyo1HefUI2i7LXDhCRmIWqnp/j@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: NeilBrown @ 2016-05-08 21:47 UTC (permalink / raw) To: Christoph Hellwig, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg Cc: milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 1259 bytes --] On Fri, Mar 04 2016, Christoph Hellwig wrote: > This adds a flag that tells the file system that this is a high priority > request for which it's worth to poll the hardware. The flag is purely > advisory and can be ignored if not supported. Here you say the flag is "advice". > > +/* flags for preadv2/pwritev2: */ > +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */ This text makes it sound like a firm "request" ("if possible"). In the man page posted separately it says: +.BR RWF_HIPRI " (since Linux 4.6)" +High priority read/write. Allows block based filesystems to use polling of the +device, which provides lower latency, but may use additional ressources. (Currently +only usable on a file descriptor opened using the +.BR O_DIRECT " flag)." So now it "allows", which is different again. The differences may be subtle, but consistency is nice. Also in that man page fragment: > provides lower latency, but may use additional ressources Is this a "latency vs throughput" trade-off, or something more subtle? It would be nice to make the decision process as obvious as possible for the developer considering the use of this flag. (and s/ressources/resources/) NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <874ma8usrr.fsf-wvvUuzkyo1HefUI2i7LXDhCRmIWqnp/j@public.gmane.org>]
* Re: [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 [not found] ` <874ma8usrr.fsf-wvvUuzkyo1HefUI2i7LXDhCRmIWqnp/j@public.gmane.org> @ 2016-05-11 8:55 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2016-05-11 8:55 UTC (permalink / raw) To: NeilBrown Cc: Christoph Hellwig, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, axboe-b10kYP2dOMg, milosz-B5zB6C1i6pkAvxtiuMwx3w, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-block-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA On Mon, May 09, 2016 at 07:47:04AM +1000, NeilBrown wrote: > On Fri, Mar 04 2016, Christoph Hellwig wrote: > > > This adds a flag that tells the file system that this is a high priority > > request for which it's worth to poll the hardware. The flag is purely > > advisory and can be ignored if not supported. > > Here you say the flag is "advice". > > > > > +/* flags for preadv2/pwritev2: */ > > +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */ > > This text makes it sound like a firm "request" ("if possible"). "request" here is in the sense of an I/O request. Better wording highly welcome. > > > provides lower latency, but may use additional ressources > > Is this a "latency vs throughput" trade-off, or something more subtle? > It would be nice to make the decision process as obvious as possible for > the developer considering the use of this flag. If you poll you can't do anything else, so you end up using CPU cycles to wait which otherwise could do something productive. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-05-11 8:55 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-24 14:14 selective block polling and preadv2/pwritev2 revisited Christoph Hellwig [not found] ` <1450966464-6847-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2015-12-24 14:14 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig 2015-12-24 14:14 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig 2015-12-24 14:14 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig 2016-01-04 14:58 ` selective block polling and preadv2/pwritev2 revisited Sagi Grimberg [not found] ` <568A889E.4020204-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 2016-01-04 16:39 ` Christoph Hellwig [not found] ` <20160104163949.GA17409-jcswGhMUV9g@public.gmane.org> 2016-01-06 17:01 ` Sagi Grimberg 2016-01-06 22:49 ` Martin K. Petersen [not found] ` <yq1h9iqiary.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> 2016-01-07 14:41 ` Sagi Grimberg 2015-12-24 14:14 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig 2015-12-24 14:14 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig 2015-12-24 14:14 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig -- strict thread matches above, loose matches on Subject: below -- 2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig [not found] ` <1457017443-17662-1-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2016-03-03 15:04 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig [not found] ` <1457017443-17662-5-git-send-email-hch-jcswGhMUV9g@public.gmane.org> 2016-05-08 21:47 ` NeilBrown [not found] ` <874ma8usrr.fsf-wvvUuzkyo1HefUI2i7LXDhCRmIWqnp/j@public.gmane.org> 2016-05-11 8:55 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).