* generic RDMA READ/WRITE API V2
@ 2016-03-03 15:03 Christoph Hellwig
2016-03-03 15:03 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig
` (6 more replies)
0 siblings, 7 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:03 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
This series contains patches that implement a first version of a generic
API to handle RDMA READ/WRITE operations as commonly used on the target
(or server) side for storage protocols.
This has been developed for the upcoming NVMe over Fabrics target, and
extensively teѕted as part of that, although this upstream version has
additional updates over the one we're currently using.
It hides details such as the use of MRs for iWarp devices, and will allow
looking at other HCA specifics easily in the future.
This series contains a conversion of the SRP target, and the git tree
below also has a RFC conversion of the iSER target (a little hacky
due to the signature MR support which I can't test)
I also have a git tree available at:
git://git.infradead.org/users/hch/rdma.git rdma-rw-api
Gitweb:
http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-rw-api
These two also include the RFC iSER target conversion.
Chances since V2:
- fold the list_del in mr_pool_get into the right patch
- clamp the max FR page size length
- minor srpt style fix
- spelling fixes
Changes since V1:
- fixed offset handling in ib_sg_to_pages
- uses proper SG iterators to handle larger than PAGE_SIZE segments
- adjusted parameters for some functions to reduce size of the context
- SRP target support
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
@ 2016-03-03 15:03 ` Christoph Hellwig
2016-03-03 15:03 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig
` (5 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:03 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations. For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
---
fs/nfsd/vfs.c | 4 ++--
fs/read_write.c | 44 ++++++++++++++++++++++++++------------------
fs/splice.c | 2 +-
include/linux/fs.h | 4 ++--
4 files changed, 31 insertions(+), 23 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 5d2a57e..d40010e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -870,7 +870,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
oldfs = get_fs();
set_fs(KERNEL_DS);
- host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+ host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
set_fs(oldfs);
return nfsd_finish_read(file, count, host_err);
}
@@ -957,7 +957,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
- host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+ host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
set_fs(oldfs);
if (host_err < 0)
goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index dadf24e..3b7577d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -693,11 +693,14 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
EXPORT_SYMBOL(iov_shorten);
static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
- loff_t *ppos, iter_fn_t fn)
+ loff_t *ppos, iter_fn_t fn, int flags)
{
struct kiocb kiocb;
ssize_t ret;
+ if (flags)
+ return -EOPNOTSUPP;
+
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = *ppos;
@@ -709,10 +712,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
/* Do it by hand, with file-ops */
static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
- loff_t *ppos, io_fn_t fn)
+ loff_t *ppos, io_fn_t fn, int flags)
{
ssize_t ret = 0;
+ if (flags)
+ return -EOPNOTSUPP;
+
while (iov_iter_count(iter)) {
struct iovec iovec = iov_iter_iovec(iter);
ssize_t nr;
@@ -813,7 +819,8 @@ out:
static ssize_t do_readv_writev(int type, struct file *file,
const struct iovec __user * uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos,
+ int flags)
{
size_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -845,9 +852,9 @@ static ssize_t do_readv_writev(int type, struct file *file,
}
if (iter_fn)
- ret = do_iter_readv_writev(file, &iter, pos, iter_fn);
+ ret = do_iter_readv_writev(file, &iter, pos, iter_fn, flags);
else
- ret = do_loop_readv_writev(file, &iter, pos, fn);
+ ret = do_loop_readv_writev(file, &iter, pos, fn, flags);
if (type != READ)
file_end_write(file);
@@ -864,27 +871,27 @@ out:
}
ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_READ))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_READ))
return -EINVAL;
- return do_readv_writev(READ, file, vec, vlen, pos);
+ return do_readv_writev(READ, file, vec, vlen, pos, flags);
}
EXPORT_SYMBOL(vfs_readv);
ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
}
EXPORT_SYMBOL(vfs_writev);
@@ -897,7 +904,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -917,7 +924,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -949,7 +956,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos);
+ ret = vfs_readv(f.file, vec, vlen, &pos, 0);
fdput(f);
}
@@ -973,7 +980,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos);
+ ret = vfs_writev(f.file, vec, vlen, &pos, 0);
fdput(f);
}
@@ -987,7 +994,8 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
static ssize_t compat_do_readv_writev(int type, struct file *file,
const struct compat_iovec __user *uvector,
- unsigned long nr_segs, loff_t *pos)
+ unsigned long nr_segs, loff_t *pos,
+ int flags)
{
compat_ssize_t tot_len;
struct iovec iovstack[UIO_FASTIOV];
@@ -1019,9 +1027,9 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
}
if (iter_fn)
- ret = do_iter_readv_writev(file, &iter, pos, iter_fn);
+ ret = do_iter_readv_writev(file, &iter, pos, iter_fn, flags);
else
- ret = do_loop_readv_writev(file, &iter, pos, fn);
+ ret = do_loop_readv_writev(file, &iter, pos, fn, flags);
if (type != READ)
file_end_write(file);
@@ -1050,7 +1058,7 @@ static size_t compat_readv(struct file *file,
if (!(file->f_mode & FMODE_CAN_READ))
goto out;
- ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(READ, file, vec, vlen, pos, 0);
out:
if (ret > 0)
@@ -1127,7 +1135,7 @@ static size_t compat_writev(struct file *file,
if (!(file->f_mode & FMODE_CAN_WRITE))
goto out;
- ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, 0);
out:
if (ret > 0)
diff --git a/fs/splice.c b/fs/splice.c
index 82bc0d6..3dc1426 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -577,7 +577,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
old_fs = get_fs();
set_fs(get_ds());
/* The cast to a user pointer is valid due to the set_fs() */
- res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+ res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
set_fs(old_fs);
return res;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..875277a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1709,9 +1709,9 @@ extern ssize_t __vfs_write(struct file *, const char __user *, size_t, loff_t *)
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
- unsigned long, loff_t *);
+ unsigned long, loff_t *, int);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
2016-03-03 15:03 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig
@ 2016-03-03 15:03 ` Christoph Hellwig
2016-03-10 18:15 ` Michael Kerrisk (man-pages)
2016-03-03 15:04 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig
` (4 subsequent siblings)
6 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:03 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
From: Milosz Tanski <milosz@adfin.com>
New syscalls that take an flag argument. No flags are added yet in this
patch.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
---
fs/read_write.c | 161 ++++++++++++++++++++++++++++++++++++-----------
include/linux/compat.h | 6 ++
include/linux/syscalls.h | 6 ++
3 files changed, 138 insertions(+), 35 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 3b7577d..799d25f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -896,15 +896,15 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
EXPORT_SYMBOL(vfs_writev);
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -916,15 +916,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
@@ -942,10 +942,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
}
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;
@@ -956,7 +955,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+ ret = vfs_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
}
@@ -966,10 +965,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
- unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, loff_t pos, int flags)
{
- loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
ssize_t ret = -EBADF;
@@ -980,7 +978,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
if (f.file) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+ ret = vfs_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
}
@@ -990,6 +988,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
return ret;
}
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen)
+{
+ return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_readv(fd, vec, vlen, flags);
+
+ return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+ unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+ int, flags)
+{
+ loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+ if (pos == -1)
+ return do_writev(fd, vec, vlen, flags);
+
+ return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
#ifdef CONFIG_COMPAT
static ssize_t compat_do_readv_writev(int type, struct file *file,
@@ -1047,7 +1097,7 @@ out:
static size_t compat_readv(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;
@@ -1058,7 +1108,7 @@ static size_t compat_readv(struct file *file,
if (!(file->f_mode & FMODE_CAN_READ))
goto out;
- ret = compat_do_readv_writev(READ, file, vec, vlen, pos, 0);
+ ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
out:
if (ret > 0)
@@ -1067,9 +1117,9 @@ out:
return ret;
}
-COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
- const struct compat_iovec __user *,vec,
- compat_ulong_t, vlen)
+static size_t do_compat_readv(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1078,16 +1128,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
+
}
-static long __compat_sys_preadv64(unsigned long fd,
+COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen)
+{
+ return do_compat_readv(fd, vec, vlen, 0);
+}
+
+static long do_compat_preadv64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;
@@ -1099,7 +1157,7 @@ static long __compat_sys_preadv64(unsigned long fd,
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
- ret = compat_readv(f.file, vec, vlen, &pos);
+ ret = compat_readv(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1109,7 +1167,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ return do_compat_preadv64(fd, vec, vlen, pos, 0);
}
#endif
@@ -1119,12 +1177,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
- return __compat_sys_preadv64(fd, vec, vlen, pos);
+ return do_compat_preadv64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
+ int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return do_compat_readv(fd, vec, vlen, flags);
+
+ return do_compat_preadv64(fd, vec, vlen, pos, flags);
}
static size_t compat_writev(struct file *file,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t *pos)
+ unsigned long vlen, loff_t *pos, int flags)
{
ssize_t ret = -EBADF;
@@ -1144,9 +1215,9 @@ out:
return ret;
}
-COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
- const struct compat_iovec __user *, vec,
- compat_ulong_t, vlen)
+static size_t do_compat_writev(compat_ulong_t fd,
+ const struct compat_iovec __user* vec,
+ compat_ulong_t vlen, int flags)
{
struct fd f = fdget_pos(fd);
ssize_t ret;
@@ -1155,16 +1226,23 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
if (!f.file)
return -EBADF;
pos = f.file->f_pos;
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
if (ret >= 0)
f.file->f_pos = pos;
fdput_pos(f);
return ret;
}
-static long __compat_sys_pwritev64(unsigned long fd,
+COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
+ const struct compat_iovec __user *, vec,
+ compat_ulong_t, vlen)
+{
+ return do_compat_writev(fd, vec, vlen, 0);
+}
+
+static long do_compat_pwritev64(unsigned long fd,
const struct compat_iovec __user *vec,
- unsigned long vlen, loff_t pos)
+ unsigned long vlen, loff_t pos, int flags)
{
struct fd f;
ssize_t ret;
@@ -1176,7 +1254,7 @@ static long __compat_sys_pwritev64(unsigned long fd,
return -EBADF;
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
- ret = compat_writev(f.file, vec, vlen, &pos);
+ ret = compat_writev(f.file, vec, vlen, &pos, flags);
fdput(f);
return ret;
}
@@ -1186,7 +1264,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
const struct compat_iovec __user *,vec,
unsigned long, vlen, loff_t, pos)
{
- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return do_compat_pwritev64(fd, vec, vlen, pos, 0);
}
#endif
@@ -1196,8 +1274,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
- return __compat_sys_pwritev64(fd, vec, vlen, pos);
+ return do_compat_pwritev64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
+ const struct compat_iovec __user *,vec,
+ compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
+{
+ loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+ if (pos == -1)
+ return do_compat_writev(fd, vec, vlen, flags);
+
+ return do_compat_pwritev64(fd, vec, vlen, pos, flags);
}
+
#endif
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
diff --git a/include/linux/compat.h b/include/linux/compat.h
index a76c917..fe4ccd0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
const struct compat_iovec __user *vec,
compat_ulong_t vlen, u32 pos_low, u32 pos_high);
+asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
+asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
#ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
asmlinkage long compat_sys_preadv64(unsigned long fd,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 185815c..d795472 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -575,8 +575,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
size_t count, loff_t pos);
asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+ unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+ int flags);
asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
asmlinkage long sys_chdir(const char __user *filename);
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/6] x86: wire up preadv2 and pwritev2
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
2016-03-03 15:03 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig
2016-03-03 15:03 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig
@ 2016-03-03 15:04 ` Christoph Hellwig
2016-03-03 15:04 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig
` (3 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:04 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased due to newly added syscalls]
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
arch/x86/entry/syscalls/syscall_32.tbl | 2 ++
arch/x86/entry/syscalls/syscall_64.tbl | 2 ++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df..b30dd81 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,5 @@
375 i386 membarrier sys_membarrier
376 i386 mlock2 sys_mlock2
377 i386 copy_file_range sys_copy_file_range
+378 i386 preadv2 sys_preadv2
+379 i386 pwritev2 sys_pwritev2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a..31cec92 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,8 @@
324 common membarrier sys_membarrier
325 common mlock2 sys_mlock2
326 common copy_file_range sys_copy_file_range
+327 64 preadv2 sys_preadv2
+328 64 pwritev2 sys_pwritev2
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
` (2 preceding siblings ...)
2016-03-03 15:04 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig
@ 2016-03-03 15:04 ` Christoph Hellwig
2016-05-08 21:47 ` NeilBrown
2016-03-03 15:04 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig
` (2 subsequent siblings)
6 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:04 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
This adds a flag that tells the file system that this is a high priority
request for which it's worth to poll the hardware. The flag is purely
advisory and can be ignored if not supported.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
---
fs/read_write.c | 6 ++++--
include/linux/fs.h | 1 +
include/uapi/linux/fs.h | 3 +++
3 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 799d25f..cf377cf 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -698,10 +698,12 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
- if (flags)
+ if (flags & ~RWF_HIPRI)
return -EOPNOTSUPP;
init_sync_kiocb(&kiocb, filp);
+ if (flags & RWF_HIPRI)
+ kiocb.ki_flags |= IOCB_HIPRI;
kiocb.ki_pos = *ppos;
ret = fn(&kiocb, iter);
@@ -716,7 +718,7 @@ static ssize_t do_loop_readv_writev(struct file *filp, struct iov_iter *iter,
{
ssize_t ret = 0;
- if (flags)
+ if (flags & ~RWF_HIPRI)
return -EOPNOTSUPP;
while (iov_iter_count(iter)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 875277a..a1f731c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -320,6 +320,7 @@ struct writeback_control;
#define IOCB_EVENTFD (1 << 0)
#define IOCB_APPEND (1 << 1)
#define IOCB_DIRECT (1 << 2)
+#define IOCB_HIPRI (1 << 3)
struct kiocb {
struct file *ki_filp;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 149bec8..d246339 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -304,4 +304,7 @@ struct fsxattr {
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4
+/* flags for preadv2/pwritev2: */
+#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */
+
#endif /* _UAPI_LINUX_FS_H */
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 5/6] direct-io: only use block polling if explicitly requested
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
` (3 preceding siblings ...)
2016-03-03 15:04 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig
@ 2016-03-03 15:04 ` Christoph Hellwig
2016-03-03 15:04 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig
2016-03-03 15:09 ` generic RDMA READ/WRITE API V2 Sagi Grimberg
6 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:04 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
---
fs/direct-io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index d6a9012..0a8d937 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -445,7 +445,8 @@ static struct bio *dio_await_one(struct dio *dio)
__set_current_state(TASK_UNINTERRUPTIBLE);
dio->waiter = current;
spin_unlock_irqrestore(&dio->bio_lock, flags);
- if (!blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
+ if (!(dio->iocb->ki_flags & IOCB_HIPRI) ||
+ !blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
io_schedule();
/* wake up sets us TASK_RUNNING */
spin_lock_irqsave(&dio->bio_lock, flags);
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 6/6] blk-mq: enable polling support by default
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
` (4 preceding siblings ...)
2016-03-03 15:04 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig
@ 2016-03-03 15:04 ` Christoph Hellwig
2016-03-03 15:09 ` generic RDMA READ/WRITE API V2 Sagi Grimberg
6 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:04 UTC (permalink / raw)
To: viro, axboe; +Cc: milosz, linux-fsdevel, linux-block, linux-api
Now that applications need to explicitly ask for polling we can enable it
by default in blk-mq drivers. Note that this will only have an affect
on driver that supply a poll function, which currently only includes nvme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
---
include/linux/blkdev.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4571ef1..458f6ef 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -499,7 +499,8 @@ struct request_queue {
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
- (1 << QUEUE_FLAG_SAME_COMP))
+ (1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_POLL))
static inline void queue_lockdep_assert_held(struct request_queue *q)
{
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: generic RDMA READ/WRITE API V2
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
` (5 preceding siblings ...)
2016-03-03 15:04 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig
@ 2016-03-03 15:09 ` Sagi Grimberg
2016-03-03 15:11 ` selective block polling and preadv2/pwritev2 revisited V3 Christoph Hellwig
6 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2016-03-03 15:09 UTC (permalink / raw)
To: Christoph Hellwig, viro, axboe
Cc: milosz, linux-fsdevel, linux-block, linux-api
> This series contains patches that implement a first version of a generic
> API to handle RDMA READ/WRITE operations as commonly used on the target
> (or server) side for storage protocols.
>
> This has been developed for the upcoming NVMe over Fabrics target, and
> extensively teѕted as part of that, although this upstream version has
> additional updates over the one we're currently using.
>
> It hides details such as the use of MRs for iWarp devices, and will allow
> looking at other HCA specifics easily in the future.
>
> This series contains a conversion of the SRP target, and the git tree
> below also has a RFC conversion of the iSER target (a little hacky
> due to the signature MR support which I can't test)
>
> I also have a git tree available at:
>
> git://git.infradead.org/users/hch/rdma.git rdma-rw-api
>
> Gitweb:
>
> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-rw-api
>
> These two also include the RFC iSER target conversion.
Heh... Looks like you got your cover-letters mixed up :)
^ permalink raw reply [flat|nested] 20+ messages in thread
* selective block polling and preadv2/pwritev2 revisited V3
2016-03-03 15:09 ` generic RDMA READ/WRITE API V2 Sagi Grimberg
@ 2016-03-03 15:11 ` Christoph Hellwig
2016-03-03 15:16 ` Jens Axboe
2016-03-03 15:52 ` Arnd Bergmann
0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 15:11 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Christoph Hellwig, viro, axboe, milosz, linux-fsdevel,
linux-block, linux-api
On Thu, Mar 03, 2016 at 05:09:41PM +0200, Sagi Grimberg wrote:
> Heh... Looks like you got your cover-letters mixed up :)
Looks like it indeed..
Here is the right one:
This series allows to selectively enable/disable polling for completions
in the block layer on a per-I/O basis. For this it resurrects the
preadv2/pwritev2 syscalls that Milosz prepared a while ago (and which
are much simpler now due to VFS changes that happened in the meantime).
That approach also had a man page update prepared, which I will resubmit
with the current flags once this series makes it in.
Polling for block I/O is important to reduce the latency on flash and
post-flash storage technologies. On the fastest NVMe controller I have
access to it almost halves latencies from over 7 microseconds to about 4
microseonds. But it only is usesful if we actually care for the latency
of this particular I/O, and generally is a waste if enabled for all I/O
to a given device. This series uses the per-I/O flags in preadv2/pwritev2
to control this behavior. The alternative would be a new O_* flag set
at open time or using fcntl, but this is still to corse-grained for some
applications and we're starting to run out out of open flags.
Note that there are plenty of other use cases for preadv2/pwritev2 as well,
but I'd like to concentrate on this one for now. Example are: non-blocking
reads (the original purpose), per-I/O O_SYNC, user space support for T10
DIF/DIX applications tags and probably some more.
Changes since V2:
- minor style fixes
- various changelog updates
- dropped the unused REQ_POLL flag
Changes since V1:
- rebased on top of Linux 4.5-rc5
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: selective block polling and preadv2/pwritev2 revisited V3
2016-03-03 15:11 ` selective block polling and preadv2/pwritev2 revisited V3 Christoph Hellwig
@ 2016-03-03 15:16 ` Jens Axboe
2016-03-03 15:52 ` Arnd Bergmann
1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2016-03-03 15:16 UTC (permalink / raw)
To: Christoph Hellwig, Sagi Grimberg
Cc: viro, milosz, linux-fsdevel, linux-block, linux-api
On 03/03/2016 08:11 AM, Christoph Hellwig wrote:
> On Thu, Mar 03, 2016 at 05:09:41PM +0200, Sagi Grimberg wrote:
>> Heh... Looks like you got your cover-letters mixed up :)
>
> Looks like it indeed..
>
> Here is the right one:
>
>
> This series allows to selectively enable/disable polling for completions
> in the block layer on a per-I/O basis. For this it resurrects the
> preadv2/pwritev2 syscalls that Milosz prepared a while ago (and which
> are much simpler now due to VFS changes that happened in the meantime).
> That approach also had a man page update prepared, which I will resubmit
> with the current flags once this series makes it in.
>
> Polling for block I/O is important to reduce the latency on flash and
> post-flash storage technologies. On the fastest NVMe controller I have
> access to it almost halves latencies from over 7 microseconds to about 4
> microseonds. But it only is usesful if we actually care for the latency
> of this particular I/O, and generally is a waste if enabled for all I/O
> to a given device. This series uses the per-I/O flags in preadv2/pwritev2
> to control this behavior. The alternative would be a new O_* flag set
> at open time or using fcntl, but this is still to corse-grained for some
> applications and we're starting to run out out of open flags.
>
> Note that there are plenty of other use cases for preadv2/pwritev2 as well,
> but I'd like to concentrate on this one for now. Example are: non-blocking
> reads (the original purpose), per-I/O O_SYNC, user space support for T10
> DIF/DIX applications tags and probably some more.
>
> Changes since V2:
> - minor style fixes
> - various changelog updates
> - dropped the unused REQ_POLL flag
>
> Changes since V1:
> - rebased on top of Linux 4.5-rc5
You can add my reviewed-by to the series, assuming that Al pulls it in.
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: selective block polling and preadv2/pwritev2 revisited V3
2016-03-03 15:11 ` selective block polling and preadv2/pwritev2 revisited V3 Christoph Hellwig
2016-03-03 15:16 ` Jens Axboe
@ 2016-03-03 15:52 ` Arnd Bergmann
2016-03-03 16:11 ` Christoph Hellwig
1 sibling, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2016-03-03 15:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, viro, axboe, milosz, linux-fsdevel, linux-block,
linux-api
On Thursday 03 March 2016 16:11:16 Christoph Hellwig wrote:
>
> This series allows to selectively enable/disable polling for completions
> in the block layer on a per-I/O basis. For this it resurrects the
> preadv2/pwritev2 syscalls that Milosz prepared a while ago (and which
> are much simpler now due to VFS changes that happened in the meantime).
> That approach also had a man page update prepared, which I will resubmit
> with the current flags once this series makes it in.
>
> Polling for block I/O is important to reduce the latency on flash and
> post-flash storage technologies. On the fastest NVMe controller I have
> access to it almost halves latencies from over 7 microseconds to about 4
> microseonds. But it only is usesful if we actually care for the latency
> of this particular I/O, and generally is a waste if enabled for all I/O
> to a given device. This series uses the per-I/O flags in preadv2/pwritev2
> to control this behavior. The alternative would be a new O_* flag set
> at open time or using fcntl, but this is still to corse-grained for some
> applications and we're starting to run out out of open flags.
>
> Note that there are plenty of other use cases for preadv2/pwritev2 as well,
> but I'd like to concentrate on this one for now. Example are: non-blocking
> reads (the original purpose), per-I/O O_SYNC, user space support for T10
> DIF/DIX applications tags and probably some more.
If we decide to revise the asm-generic/unistd.h system call list
for future architecture ports, can the syscalls replace all of
read/write/readv/writev/pread64/write64/preadv/pwritev, or would
it be better to keep all of them around indefinitely?
When we introduced the generic syscall table, I tried to limit
it to the syscalls that are actually needed and avoid all duplications,
but since then we have added a couple of calls that can replace old
ones, and we might want to do that when risc-v gets merged.
Arnd
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: selective block polling and preadv2/pwritev2 revisited V3
2016-03-03 15:52 ` Arnd Bergmann
@ 2016-03-03 16:11 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-03 16:11 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Christoph Hellwig, Sagi Grimberg, viro, axboe, milosz,
linux-fsdevel, linux-block, linux-api
On Thu, Mar 03, 2016 at 04:52:55PM +0100, Arnd Bergmann wrote:
> If we decide to revise the asm-generic/unistd.h system call list
> for future architecture ports, can the syscalls replace all of
> read/write/readv/writev/pread64/write64/preadv/pwritev, or would
> it be better to keep all of them around indefinitely?
It does replace all off them fully. I never quite understood
why having the wrappers is better in libc than the kernel, though.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-03-03 15:03 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig
@ 2016-03-10 18:15 ` Michael Kerrisk (man-pages)
2016-03-11 9:53 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-03-10 18:15 UTC (permalink / raw)
To: Christoph Hellwig, viro, axboe
Cc: mtk.manpages, milosz, linux-fsdevel, linux-block, linux-api
Hi Christoph,
On 03/03/2016 04:03 PM, Christoph Hellwig wrote:
> From: Milosz Tanski <milosz@adfin.com>
>
> New syscalls that take an flag argument. No flags are added yet in this
> patch.
Are there some man pages patches for these proposed system calls?
Thanks,
Michael
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> [hch: rebased on top of my kiocb changes]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
> Tested-by: Stephen Bates <stephen.bates@pmcs.com>
> Acked-by: Jeff Moyer <jmoyer@redhat.com>
> ---
> fs/read_write.c | 161 ++++++++++++++++++++++++++++++++++++-----------
> include/linux/compat.h | 6 ++
> include/linux/syscalls.h | 6 ++
> 3 files changed, 138 insertions(+), 35 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3b7577d..799d25f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -896,15 +896,15 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
>
> EXPORT_SYMBOL(vfs_writev);
>
> -SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen)
> +static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret = -EBADF;
>
> if (f.file) {
> loff_t pos = file_pos_read(f.file);
> - ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> + ret = vfs_readv(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> file_pos_write(f.file, pos);
> fdput_pos(f);
> @@ -916,15 +916,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> -SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen)
> +static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret = -EBADF;
>
> if (f.file) {
> loff_t pos = file_pos_read(f.file);
> - ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> + ret = vfs_writev(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> file_pos_write(f.file, pos);
> fdput_pos(f);
> @@ -942,10 +942,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
> return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
> }
>
> -SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, loff_t pos, int flags)
> {
> - loff_t pos = pos_from_hilo(pos_h, pos_l);
> struct fd f;
> ssize_t ret = -EBADF;
>
> @@ -956,7 +955,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> if (f.file) {
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PREAD)
> - ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> + ret = vfs_readv(f.file, vec, vlen, &pos, flags);
> fdput(f);
> }
>
> @@ -966,10 +965,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> -SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> - unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, loff_t pos, int flags)
> {
> - loff_t pos = pos_from_hilo(pos_h, pos_l);
> struct fd f;
> ssize_t ret = -EBADF;
>
> @@ -980,7 +978,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> if (f.file) {
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PWRITE)
> - ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> + ret = vfs_writev(f.file, vec, vlen, &pos, flags);
> fdput(f);
> }
>
> @@ -990,6 +988,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> return ret;
> }
>
> +SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen)
> +{
> + return do_readv(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen)
> +{
> + return do_writev(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + return do_preadv(fd, vec, vlen, pos, 0);
> +}
> +
> +SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> + int, flags)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + if (pos == -1)
> + return do_readv(fd, vec, vlen, flags);
> +
> + return do_preadv(fd, vec, vlen, pos, flags);
> +}
> +
> +SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + return do_pwritev(fd, vec, vlen, pos, 0);
> +}
> +
> +SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
> + unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> + int, flags)
> +{
> + loff_t pos = pos_from_hilo(pos_h, pos_l);
> +
> + if (pos == -1)
> + return do_writev(fd, vec, vlen, flags);
> +
> + return do_pwritev(fd, vec, vlen, pos, flags);
> +}
> +
> #ifdef CONFIG_COMPAT
>
> static ssize_t compat_do_readv_writev(int type, struct file *file,
> @@ -1047,7 +1097,7 @@ out:
>
> static size_t compat_readv(struct file *file,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t *pos)
> + unsigned long vlen, loff_t *pos, int flags)
> {
> ssize_t ret = -EBADF;
>
> @@ -1058,7 +1108,7 @@ static size_t compat_readv(struct file *file,
> if (!(file->f_mode & FMODE_CAN_READ))
> goto out;
>
> - ret = compat_do_readv_writev(READ, file, vec, vlen, pos, 0);
> + ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
>
> out:
> if (ret > 0)
> @@ -1067,9 +1117,9 @@ out:
> return ret;
> }
>
> -COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> - const struct compat_iovec __user *,vec,
> - compat_ulong_t, vlen)
> +static size_t do_compat_readv(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret;
> @@ -1078,16 +1128,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> if (!f.file)
> return -EBADF;
> pos = f.file->f_pos;
> - ret = compat_readv(f.file, vec, vlen, &pos);
> + ret = compat_readv(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> f.file->f_pos = pos;
> fdput_pos(f);
> return ret;
> +
> }
>
> -static long __compat_sys_preadv64(unsigned long fd,
> +COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen)
> +{
> + return do_compat_readv(fd, vec, vlen, 0);
> +}
> +
> +static long do_compat_preadv64(unsigned long fd,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t pos)
> + unsigned long vlen, loff_t pos, int flags)
> {
> struct fd f;
> ssize_t ret;
> @@ -1099,7 +1157,7 @@ static long __compat_sys_preadv64(unsigned long fd,
> return -EBADF;
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PREAD)
> - ret = compat_readv(f.file, vec, vlen, &pos);
> + ret = compat_readv(f.file, vec, vlen, &pos, flags);
> fdput(f);
> return ret;
> }
> @@ -1109,7 +1167,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
> const struct compat_iovec __user *,vec,
> unsigned long, vlen, loff_t, pos)
> {
> - return __compat_sys_preadv64(fd, vec, vlen, pos);
> + return do_compat_preadv64(fd, vec, vlen, pos, 0);
> }
> #endif
>
> @@ -1119,12 +1177,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
> {
> loff_t pos = ((loff_t)pos_high << 32) | pos_low;
>
> - return __compat_sys_preadv64(fd, vec, vlen, pos);
> + return do_compat_preadv64(fd, vec, vlen, pos, 0);
> +}
> +
> +COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
> + int, flags)
> +{
> + loff_t pos = ((loff_t)pos_high << 32) | pos_low;
> +
> + if (pos == -1)
> + return do_compat_readv(fd, vec, vlen, flags);
> +
> + return do_compat_preadv64(fd, vec, vlen, pos, flags);
> }
>
> static size_t compat_writev(struct file *file,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t *pos)
> + unsigned long vlen, loff_t *pos, int flags)
> {
> ssize_t ret = -EBADF;
>
> @@ -1144,9 +1215,9 @@ out:
> return ret;
> }
>
> -COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> - const struct compat_iovec __user *, vec,
> - compat_ulong_t, vlen)
> +static size_t do_compat_writev(compat_ulong_t fd,
> + const struct compat_iovec __user* vec,
> + compat_ulong_t vlen, int flags)
> {
> struct fd f = fdget_pos(fd);
> ssize_t ret;
> @@ -1155,16 +1226,23 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> if (!f.file)
> return -EBADF;
> pos = f.file->f_pos;
> - ret = compat_writev(f.file, vec, vlen, &pos);
> + ret = compat_writev(f.file, vec, vlen, &pos, flags);
> if (ret >= 0)
> f.file->f_pos = pos;
> fdput_pos(f);
> return ret;
> }
>
> -static long __compat_sys_pwritev64(unsigned long fd,
> +COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
> + const struct compat_iovec __user *, vec,
> + compat_ulong_t, vlen)
> +{
> + return do_compat_writev(fd, vec, vlen, 0);
> +}
> +
> +static long do_compat_pwritev64(unsigned long fd,
> const struct compat_iovec __user *vec,
> - unsigned long vlen, loff_t pos)
> + unsigned long vlen, loff_t pos, int flags)
> {
> struct fd f;
> ssize_t ret;
> @@ -1176,7 +1254,7 @@ static long __compat_sys_pwritev64(unsigned long fd,
> return -EBADF;
> ret = -ESPIPE;
> if (f.file->f_mode & FMODE_PWRITE)
> - ret = compat_writev(f.file, vec, vlen, &pos);
> + ret = compat_writev(f.file, vec, vlen, &pos, flags);
> fdput(f);
> return ret;
> }
> @@ -1186,7 +1264,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
> const struct compat_iovec __user *,vec,
> unsigned long, vlen, loff_t, pos)
> {
> - return __compat_sys_pwritev64(fd, vec, vlen, pos);
> + return do_compat_pwritev64(fd, vec, vlen, pos, 0);
> }
> #endif
>
> @@ -1196,8 +1274,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
> {
> loff_t pos = ((loff_t)pos_high << 32) | pos_low;
>
> - return __compat_sys_pwritev64(fd, vec, vlen, pos);
> + return do_compat_pwritev64(fd, vec, vlen, pos, 0);
> +}
> +
> +COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
> + const struct compat_iovec __user *,vec,
> + compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
> +{
> + loff_t pos = ((loff_t)pos_high << 32) | pos_low;
> +
> + if (pos == -1)
> + return do_compat_writev(fd, vec, vlen, flags);
> +
> + return do_compat_pwritev64(fd, vec, vlen, pos, flags);
> }
> +
> #endif
>
> static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index a76c917..fe4ccd0 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
> asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
> const struct compat_iovec __user *vec,
> compat_ulong_t vlen, u32 pos_low, u32 pos_high);
> +asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
> +asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
> + const struct compat_iovec __user *vec,
> + compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
>
> #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
> asmlinkage long compat_sys_preadv64(unsigned long fd,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 185815c..d795472 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -575,8 +575,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
> size_t count, loff_t pos);
> asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
> unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> + int flags);
> asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
> unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
> + unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> + int flags);
> asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
> asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
> asmlinkage long sys_chdir(const char __user *filename);
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-03-10 18:15 ` Michael Kerrisk (man-pages)
@ 2016-03-11 9:53 ` Christoph Hellwig
2016-04-18 13:51 ` Michael Kerrisk (man-pages)
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2016-03-11 9:53 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christoph Hellwig, viro, axboe, milosz, linux-fsdevel,
linux-block, linux-api
On Thu, Mar 10, 2016 at 07:15:04PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Christoph,
>
> On 03/03/2016 04:03 PM, Christoph Hellwig wrote:
> > From: Milosz Tanski <milosz@adfin.com>
> >
> > New syscalls that take an flag argument. No flags are added yet in this
> > patch.
>
> Are there some man pages patches for these proposed system calls?
This is what I have:
---
>From d33a02d56f447a6cb223b3964e1dd894f2921d5c Mon Sep 17 00:00:00 2001
From: Milosz Tanski <milosz@adfin.com>
Date: Fri, 11 Mar 2016 10:52:31 +0100
Subject: add preadv2/pwritev2 documentation
New syscalls that are a variation on the preadv/pwritev but support an extra
flag argument.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: added RWF_HIPRI documentation]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
man2/readv.2 | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 54 insertions(+), 9 deletions(-)
diff --git a/man2/readv.2 b/man2/readv.2
index 93f2b6f..5cba5e2 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -45,6 +45,12 @@ readv, writev, preadv, pwritev \- read or write data into multiple buffers
.sp
.BI "ssize_t pwritev(int " fd ", const struct iovec *" iov ", int " iovcnt ,
.BI " off_t " offset );
+.sp
+.BI "ssize_t preadv2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI " off_t " offset ", int " flags );
+.sp
+.BI "ssize_t pwritev2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI " off_t " offset ", int " flags );
.fi
.sp
.in -4n
@@ -166,9 +172,9 @@ The
system call combines the functionality of
.BR writev ()
and
-.BR pwrite (2).
+.BR pwrite (2) "."
It performs the same task as
-.BR writev (),
+.BR writev () ","
but adds a fourth argument,
.IR offset ,
which specifies the file offset at which the output operation
@@ -178,15 +184,43 @@ The file offset is not changed by these system calls.
The file referred to by
.I fd
must be capable of seeking.
+.SS preadv2() and pwritev2()
+
+This pair of system calls has similar functionality to the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, but adds a fifth argument, \fIflags\fP, which modifies the behavior on a per call basis.
+
+Like the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, they accept an \fIoffset\fP argument. Unlike those calls, if the \fIoffset\fP argument is set to -1 then the current file offset is used and updated.
+
+The \fIflags\fP arguments to
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+contains a bitwise OR of one or more of the following flags:
+.TP
+.BR RWF_HIPRI " (since Linux 4.6)"
+High priority read/write. Allows block based filesystems to use polling of the
+device, which provides lower latency, but may use additional ressources. (Currently
+only usable on a file descriptor opened using the
+.BR O_DIRECT " flag)."
+
.SH RETURN VALUE
On success,
-.BR readv ()
-and
+.BR readv () ","
.BR preadv ()
-return the number of bytes read;
-.BR writev ()
and
+.BR preadv2 ()
+return the number of bytes read;
+.BR writev () ","
.BR pwritev ()
+and
+.BR pwritev2 ()
return the number of bytes written.
Note that is not an error for a successful call to transfer fewer bytes
@@ -202,9 +236,11 @@ The errors are as given for
and
.BR write (2).
Furthermore,
-.BR preadv ()
-and
+.BR preadv () ","
+.BR preadv2 () ","
.BR pwritev ()
+and
+.BR pwritev2 ()
can also fail for the same reasons as
.BR lseek (2).
Additionally, the following error is defined:
@@ -218,12 +254,17 @@ value.
.TP
.B EINVAL
The vector count \fIiovcnt\fP is less than zero or greater than the
-permitted maximum.
+permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
.SH VERSIONS
.BR preadv ()
and
.BR pwritev ()
first appeared in Linux 2.6.30; library support was added in glibc 2.10.
+.sp
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+first appeared in Linux 4.6
.SH CONFORMING TO
.BR readv (),
.BR writev ():
@@ -237,6 +278,10 @@ POSIX.1-2001, POSIX.1-2008,
.BR preadv (),
.BR pwritev ():
nonstandard, but present also on the modern BSDs.
+.sp
+.BR preadv2 (),
+.BR pwritev2 ():
+nonstandard, Linux extension.
.SH NOTES
POSIX.1 allows an implementation to place a limit on
the number of items that can be passed in
--
2.1.4
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-03-11 9:53 ` Christoph Hellwig
@ 2016-04-18 13:51 ` Michael Kerrisk (man-pages)
2016-04-25 8:47 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-04-18 13:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: mtk.manpages, viro, axboe, milosz, linux-fsdevel, linux-block,
linux-api
Hello Christoph,
On 03/11/2016 09:53 AM, Christoph Hellwig wrote:
> On Thu, Mar 10, 2016 at 07:15:04PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi Christoph,
>>
>> On 03/03/2016 04:03 PM, Christoph Hellwig wrote:
>>> From: Milosz Tanski <milosz@adfin.com>
>>>
>>> New syscalls that take an flag argument. No flags are added yet in this
>>> patch.
>>
>> Are there some man pages patches for these proposed system calls?
>
> This is what I have:
Thanks. I applied the patch, but I see one point where the doc
and code differ, and I suspect that the code needs to be fixed.
See below.
> ---
>>>From d33a02d56f447a6cb223b3964e1dd894f2921d5c Mon Sep 17 00:00:00 2001
> From: Milosz Tanski <milosz@adfin.com>
> Date: Fri, 11 Mar 2016 10:52:31 +0100
> Subject: add preadv2/pwritev2 documentation
>
> New syscalls that are a variation on the preadv/pwritev but support an extra
> flag argument.
>
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> [hch: added RWF_HIPRI documentation]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> man2/readv.2 | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 54 insertions(+), 9 deletions(-)
>
> diff --git a/man2/readv.2 b/man2/readv.2
> index 93f2b6f..5cba5e2 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -45,6 +45,12 @@ readv, writev, preadv, pwritev \- read or write data into multiple buffers
> .sp
> .BI "ssize_t pwritev(int " fd ", const struct iovec *" iov ", int " iovcnt ,
> .BI " off_t " offset );
> +.sp
> +.BI "ssize_t preadv2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
> +.BI " off_t " offset ", int " flags );
> +.sp
> +.BI "ssize_t pwritev2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
> +.BI " off_t " offset ", int " flags );
> .fi
> .sp
> .in -4n
> @@ -166,9 +172,9 @@ The
> system call combines the functionality of
> .BR writev ()
> and
> -.BR pwrite (2).
> +.BR pwrite (2) "."
> It performs the same task as
> -.BR writev (),
> +.BR writev () ","
> but adds a fourth argument,
> .IR offset ,
> which specifies the file offset at which the output operation
> @@ -178,15 +184,43 @@ The file offset is not changed by these system calls.
> The file referred to by
> .I fd
> must be capable of seeking.
> +.SS preadv2() and pwritev2()
> +
> +This pair of system calls has similar functionality to the
> +.BR preadv ()
> +and
> +.BR pwritev ()
> +calls, but adds a fifth argument, \fIflags\fP, which modifies the behavior on a per call basis.
> +
> +Like the
> +.BR preadv ()
> +and
> +.BR pwritev ()
> +calls, they accept an \fIoffset\fP argument. Unlike those calls, if the \fIoffset\fP argument is set to -1 then the current file offset is used and updated.
> +
> +The \fIflags\fP arguments to
> +.BR preadv2 ()
> +and
> +.BR pwritev2 ()
> +contains a bitwise OR of one or more of the following flags:
> +.TP
> +.BR RWF_HIPRI " (since Linux 4.6)"
> +High priority read/write. Allows block based filesystems to use polling of the
> +device, which provides lower latency, but may use additional ressources. (Currently
> +only usable on a file descriptor opened using the
> +.BR O_DIRECT " flag)."
> +
> .SH RETURN VALUE
> On success,
> -.BR readv ()
> -and
> +.BR readv () ","
> .BR preadv ()
> -return the number of bytes read;
> -.BR writev ()
> and
> +.BR preadv2 ()
> +return the number of bytes read;
> +.BR writev () ","
> .BR pwritev ()
> +and
> +.BR pwritev2 ()
> return the number of bytes written.
>
> Note that is not an error for a successful call to transfer fewer bytes
> @@ -202,9 +236,11 @@ The errors are as given for
> and
> .BR write (2).
> Furthermore,
> -.BR preadv ()
> -and
> +.BR preadv () ","
> +.BR preadv2 () ","
> .BR pwritev ()
> +and
> +.BR pwritev2 ()
> can also fail for the same reasons as
> .BR lseek (2).
> Additionally, the following error is defined:
> @@ -218,12 +254,17 @@ value.
> .TP
> .B EINVAL
> The vector count \fIiovcnt\fP is less than zero or greater than the
> -permitted maximum.
> +permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
In the case described in the last sentence, the code currently appears
to be returning EOPNOTSUPP. EINVAL is more usual here, so I think the
code needs adjusting. Your thoughts?
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-04-18 13:51 ` Michael Kerrisk (man-pages)
@ 2016-04-25 8:47 ` Christoph Hellwig
2016-04-25 17:35 ` Michael Kerrisk (man-pages)
0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2016-04-25 8:47 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christoph Hellwig, viro, axboe, milosz, linux-fsdevel,
linux-block, linux-api
On Mon, Apr 18, 2016 at 02:51:50PM +0100, Michael Kerrisk (man-pages) wrote:
> Thanks. I applied the patch, but I see one point where the doc
> and code differ, and I suspect that the code needs to be fixed.
> See below.
> > .TP
> > .B EINVAL
> > The vector count \fIiovcnt\fP is less than zero or greater than the
> > -permitted maximum.
> > +permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
>
> In the case described in the last sentence, the code currently appears
> to be returning EOPNOTSUPP. EINVAL is more usual here, so I think the
> code needs adjusting. Your thoughts?
I'd rather update the man page - EOPNOTSUPP is a much more descriptive
error code for this case. I'll send you a patch.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-04-25 8:47 ` Christoph Hellwig
@ 2016-04-25 17:35 ` Michael Kerrisk (man-pages)
2016-05-08 9:29 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-04-25 17:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Christoph Hellwig, Alexander Viro, Jens Axboe, Milosz Tanski,
linux-fsdevel@vger.kernel.org, linux-block, Linux API
Hi Christoph,
On 25 April 2016 at 10:47, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Apr 18, 2016 at 02:51:50PM +0100, Michael Kerrisk (man-pages) wrote:
>> Thanks. I applied the patch, but I see one point where the doc
>> and code differ, and I suspect that the code needs to be fixed.
>> See below.
>
>> > .TP
>> > .B EINVAL
>> > The vector count \fIiovcnt\fP is less than zero or greater than the
>> > -permitted maximum.
>> > +permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
>>
>> In the case described in the last sentence, the code currently appears
>> to be returning EOPNOTSUPP. EINVAL is more usual here, so I think the
>> code needs adjusting. Your thoughts?
>
> I'd rather update the man page - EOPNOTSUPP is a much more descriptive
> error code for this case. I'll send you a patch.
Unless I'm misunderstanding something here, you're proposing something
very inconsistent. The standard error for unknown flag bits is EINVAL.
This is so for dozens of systems calls (check the man pages; you might
find a rare exception, but that's the point, they are exceptions). It
seems to me here that it's really the implementation that needs
fixing, not the man page!
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2
2016-04-25 17:35 ` Michael Kerrisk (man-pages)
@ 2016-05-08 9:29 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-05-08 9:29 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christoph Hellwig, Christoph Hellwig, Alexander Viro, Jens Axboe,
Milosz Tanski, linux-fsdevel@vger.kernel.org, linux-block,
Linux API
On Mon, Apr 25, 2016 at 07:35:36PM +0200, Michael Kerrisk (man-pages) wrote:
> > I'd rather update the man page - EOPNOTSUPP is a much more descriptive
> > error code for this case. I'll send you a patch.
>
> Unless I'm misunderstanding something here, you're proposing something
> very inconsistent. The standard error for unknown flag bits is EINVAL.
> This is so for dozens of systems calls (check the man pages; you might
> find a rare exception, but that's the point, they are exceptions). It
> seems to me here that it's really the implementation that needs
> fixing, not the man page!
For new filesystem calls we try to use EOPNOTSUPP as much as possible,
e.g. fallocate.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2
2016-03-03 15:04 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig
@ 2016-05-08 21:47 ` NeilBrown
2016-05-11 8:55 ` Christoph Hellwig
0 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2016-05-08 21:47 UTC (permalink / raw)
To: Christoph Hellwig, viro, axboe
Cc: milosz, linux-fsdevel, linux-block, linux-api
[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]
On Fri, Mar 04 2016, Christoph Hellwig wrote:
> This adds a flag that tells the file system that this is a high priority
> request for which it's worth to poll the hardware. The flag is purely
> advisory and can be ignored if not supported.
Here you say the flag is "advice".
>
> +/* flags for preadv2/pwritev2: */
> +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */
This text makes it sound like a firm "request" ("if possible").
In the man page posted separately it says:
+.BR RWF_HIPRI " (since Linux 4.6)"
+High priority read/write. Allows block based filesystems to use polling of the
+device, which provides lower latency, but may use additional ressources. (Currently
+only usable on a file descriptor opened using the
+.BR O_DIRECT " flag)."
So now it "allows", which is different again.
The differences may be subtle, but consistency is nice.
Also in that man page fragment:
> provides lower latency, but may use additional ressources
Is this a "latency vs throughput" trade-off, or something more subtle?
It would be nice to make the decision process as obvious as possible for
the developer considering the use of this flag.
(and s/ressources/resources/)
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2
2016-05-08 21:47 ` NeilBrown
@ 2016-05-11 8:55 ` Christoph Hellwig
0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2016-05-11 8:55 UTC (permalink / raw)
To: NeilBrown
Cc: Christoph Hellwig, viro, axboe, milosz, linux-fsdevel,
linux-block, linux-api
On Mon, May 09, 2016 at 07:47:04AM +1000, NeilBrown wrote:
> On Fri, Mar 04 2016, Christoph Hellwig wrote:
>
> > This adds a flag that tells the file system that this is a high priority
> > request for which it's worth to poll the hardware. The flag is purely
> > advisory and can be ignored if not supported.
>
> Here you say the flag is "advice".
>
> >
> > +/* flags for preadv2/pwritev2: */
> > +#define RWF_HIPRI 0x00000001 /* high priority request, poll if possible */
>
> This text makes it sound like a firm "request" ("if possible").
"request" here is in the sense of an I/O request. Better wording
highly welcome.
>
> > provides lower latency, but may use additional ressources
>
> Is this a "latency vs throughput" trade-off, or something more subtle?
> It would be nice to make the decision process as obvious as possible for
> the developer considering the use of this flag.
If you poll you can't do anything else, so you end up using CPU
cycles to wait which otherwise could do something productive.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-05-11 8:55 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-03 15:03 generic RDMA READ/WRITE API V2 Christoph Hellwig
2016-03-03 15:03 ` [PATCH 1/6] vfs: pass a flags argument to vfs_readv/vfs_writev Christoph Hellwig
2016-03-03 15:03 ` [PATCH 2/6] vfs: vfs: Define new syscalls preadv2,pwritev2 Christoph Hellwig
2016-03-10 18:15 ` Michael Kerrisk (man-pages)
2016-03-11 9:53 ` Christoph Hellwig
2016-04-18 13:51 ` Michael Kerrisk (man-pages)
2016-04-25 8:47 ` Christoph Hellwig
2016-04-25 17:35 ` Michael Kerrisk (man-pages)
2016-05-08 9:29 ` Christoph Hellwig
2016-03-03 15:04 ` [PATCH 3/6] x86: wire up preadv2 and pwritev2 Christoph Hellwig
2016-03-03 15:04 ` [PATCH 4/6] vfs: add the RWF_HIPRI flag for preadv2/pwritev2 Christoph Hellwig
2016-05-08 21:47 ` NeilBrown
2016-05-11 8:55 ` Christoph Hellwig
2016-03-03 15:04 ` [PATCH 5/6] direct-io: only use block polling if explicitly requested Christoph Hellwig
2016-03-03 15:04 ` [PATCH 6/6] blk-mq: enable polling support by default Christoph Hellwig
2016-03-03 15:09 ` generic RDMA READ/WRITE API V2 Sagi Grimberg
2016-03-03 15:11 ` selective block polling and preadv2/pwritev2 revisited V3 Christoph Hellwig
2016-03-03 15:16 ` Jens Axboe
2016-03-03 15:52 ` Arnd Bergmann
2016-03-03 16:11 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).