* [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
@ 2013-12-12 18:15 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig
` (4 subsequent siblings)
5 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
[-- Attachment #1: 0001-splice-move-balance_dirty_pages_ratelimited-into-pip.patch --]
[-- Type: text/plain, Size: 1205 bytes --]
Try to balance the dirty pages for every written pages instead of once
per system call, mirroring the regular write path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/ocfs2/file.c | 2 --
fs/splice.c | 5 ++++-
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6fff128..a77ef6e 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2494,8 +2494,6 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
ret = err;
else
*ppos += ret;
-
- balance_dirty_pages_ratelimited(mapping);
}
return ret;
diff --git a/fs/splice.c b/fs/splice.c
index 46a08f7..fcb459d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -759,6 +759,10 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
}
ret = pagecache_write_end(file, mapping, sd->pos, this_len, this_len,
page, fsdata);
+ if (ret)
+ goto out;
+
+ balance_dirty_pages_ratelimited(mapping);
out:
return ret;
}
@@ -1034,7 +1038,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
ret = err;
else
*ppos += ret;
- balance_dirty_pages_ratelimited(mapping);
}
return ret;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH 2/5] splice: nest i_mutex outside pipe_lock
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig
@ 2013-12-12 18:15 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig
` (3 subsequent siblings)
5 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
[-- Attachment #1: 0002-splice-nest-i_mutex-outside-pipe_lock.patch --]
[-- Type: text/plain, Size: 2961 bytes --]
I can't find any obvious reason why we would want to nest i_mutex
inside the pipe lock, but two good reasons speak against it:
- with i_mutex inside the pipe lock we have to unlock it every time we
iterate to the next buffer, which prevents i_mutex from guaranteeing
write atomicy similar to write(2), and also is ineffiecient for
filesystems like ocfs2 that have additional cluster locking along
i_mutex.
- the ordering of pipe_lock outside i_mutex makes it very hard to
share code with filesystems that have additional inode-wide locks
that need to be taken in the right order with i_mutex.
In addition to changing the lock order this patch also removes the
useless I_MUTEX_CHILD annotations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/ocfs2/file.c | 20 +++++++++++---------
fs/splice.c | 5 +++--
2 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index a77ef6e..c68e111 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2461,6 +2461,12 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
out->f_path.dentry->d_name.len,
out->f_path.dentry->d_name.name, len);
+ mutex_lock(&inode->i_mutex);
+ ret = ocfs2_rw_lock(inode, 1);
+ if (ret < 0) {
+ mlog_errno(ret);
+ goto out_unlock_inode;
+ }
pipe_lock(pipe);
splice_from_pipe_begin(&sd);
@@ -2469,20 +2475,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
if (ret <= 0)
break;
- mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
- ret = ocfs2_rw_lock(inode, 1);
- if (ret < 0)
- mlog_errno(ret);
- else {
- ret = ocfs2_splice_to_file(pipe, out, &sd);
- ocfs2_rw_unlock(inode, 1);
- }
- mutex_unlock(&inode->i_mutex);
+ ret = ocfs2_splice_to_file(pipe, out, &sd);
} while (ret > 0);
splice_from_pipe_end(pipe, &sd);
pipe_unlock(pipe);
+ ocfs2_rw_unlock(inode, 1);
+out_unlock_inode:
+ mutex_unlock(&inode->i_mutex);
+
if (sd.num_spliced)
ret = sd.num_spliced;
diff --git a/fs/splice.c b/fs/splice.c
index fcb459d..4fb6c1f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1005,6 +1005,8 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
};
ssize_t ret;
+ mutex_lock(&inode->i_mutex);
+
pipe_lock(pipe);
splice_from_pipe_begin(&sd);
@@ -1013,7 +1015,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
if (ret <= 0)
break;
- mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
ret = file_remove_suid(out);
if (!ret) {
ret = file_update_time(out);
@@ -1021,11 +1022,11 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
ret = splice_from_pipe_feed(pipe, &sd,
pipe_to_file);
}
- mutex_unlock(&inode->i_mutex);
} while (ret > 0);
splice_from_pipe_end(pipe, &sd);
pipe_unlock(pipe);
+ mutex_unlock(&inode->i_mutex);
if (sd.num_spliced)
ret = sd.num_spliced;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig
2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig
@ 2013-12-12 18:15 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig
` (2 subsequent siblings)
5 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
[-- Attachment #1: 0003-splice-use-splice_from_pipe-in-generic_file_splice_w.patch --]
[-- Type: text/plain, Size: 5600 bytes --]
Reuse generic_write_sync instead of reimplementing it in the splice
write path both in the generic code and ocfs2. Only needs a little
bit of refactoring for the actors to provide the desired functionality.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/ocfs2/file.c | 58 ++++++++++-----------------------------------
fs/splice.c | 61 +++++++++++++++++-------------------------------
include/linux/splice.h | 2 ++
3 files changed, 36 insertions(+), 85 deletions(-)
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index c68e111..5b1b5f9 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2423,38 +2423,14 @@ out_sems:
return ret;
}
-static int ocfs2_splice_to_file(struct pipe_inode_info *pipe,
- struct file *out,
- struct splice_desc *sd)
-{
- int ret;
-
- ret = ocfs2_prepare_inode_for_write(out, &sd->pos,
- sd->total_len, 0, NULL, NULL);
- if (ret < 0) {
- mlog_errno(ret);
- return ret;
- }
-
- return splice_from_pipe_feed(pipe, sd, pipe_to_file);
-}
-
static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
struct file *out,
loff_t *ppos,
size_t len,
unsigned int flags)
{
- int ret;
- struct address_space *mapping = out->f_mapping;
- struct inode *inode = mapping->host;
- struct splice_desc sd = {
- .total_len = len,
- .flags = flags,
- .pos = *ppos,
- .u.file = out,
- };
-
+ struct inode *inode = out->f_mapping->host;
+ ssize_t ret;
trace_ocfs2_file_splice_write(inode, out, out->f_path.dentry,
(unsigned long long)OCFS2_I(inode)->ip_blkno,
@@ -2467,35 +2443,25 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
mlog_errno(ret);
goto out_unlock_inode;
}
- pipe_lock(pipe);
- splice_from_pipe_begin(&sd);
- do {
- ret = splice_from_pipe_next(pipe, &sd);
- if (ret <= 0)
- break;
-
- ret = ocfs2_splice_to_file(pipe, out, &sd);
- } while (ret > 0);
- splice_from_pipe_end(pipe, &sd);
+ ret = ocfs2_prepare_inode_for_write(out, ppos, len, 0, NULL, NULL);
+ if (ret < 0) {
+ mlog_errno(ret);
+ goto out_unlock_rw;
+ }
- pipe_unlock(pipe);
+ ret = splice_from_pipe(pipe, out, ppos, len, flags, __pipe_to_file);
+out_unlock_rw:
ocfs2_rw_unlock(inode, 1);
out_unlock_inode:
mutex_unlock(&inode->i_mutex);
- if (sd.num_spliced)
- ret = sd.num_spliced;
-
if (ret > 0) {
- int err;
-
- err = generic_write_sync(out, *ppos, ret);
+ int err = generic_write_sync(out, *ppos, ret);
if (err)
- ret = err;
- else
- *ppos += ret;
+ return ret;
+ *ppos += ret;
}
return ret;
diff --git a/fs/splice.c b/fs/splice.c
index 4fb6c1f..108e527 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -727,7 +727,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
* SPLICE_F_MOVE isn't set, or we cannot move the page, we simply create
* a new page in the output file page cache and fill/dirty that.
*/
-int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+int __pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
struct splice_desc *sd)
{
struct file *file = sd->u.file;
@@ -766,6 +766,22 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
out:
return ret;
}
+EXPORT_SYMBOL(__pipe_to_file);
+
+int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ struct file *file = sd->u.file;
+ int ret;
+
+ ret = file_remove_suid(file);
+ if (ret)
+ return ret;
+ ret = file_update_time(file);
+ if (ret)
+ return ret;
+ return __pipe_to_file(pipe, buf, sd);
+}
EXPORT_SYMBOL(pipe_to_file);
static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
@@ -995,55 +1011,22 @@ ssize_t
generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
- struct address_space *mapping = out->f_mapping;
- struct inode *inode = mapping->host;
- struct splice_desc sd = {
- .total_len = len,
- .flags = flags,
- .pos = *ppos,
- .u.file = out,
- };
+ struct inode *inode = out->f_mapping->host;
ssize_t ret;
mutex_lock(&inode->i_mutex);
-
- pipe_lock(pipe);
-
- splice_from_pipe_begin(&sd);
- do {
- ret = splice_from_pipe_next(pipe, &sd);
- if (ret <= 0)
- break;
-
- ret = file_remove_suid(out);
- if (!ret) {
- ret = file_update_time(out);
- if (!ret)
- ret = splice_from_pipe_feed(pipe, &sd,
- pipe_to_file);
- }
- } while (ret > 0);
- splice_from_pipe_end(pipe, &sd);
-
- pipe_unlock(pipe);
+ ret = splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_file);
mutex_unlock(&inode->i_mutex);
- if (sd.num_spliced)
- ret = sd.num_spliced;
-
if (ret > 0) {
- int err;
-
- err = generic_write_sync(out, *ppos, ret);
+ int err = generic_write_sync(out, *ppos, ret);
if (err)
- ret = err;
- else
- *ppos += ret;
+ return err;
+ *ppos += ret;
}
return ret;
}
-
EXPORT_SYMBOL(generic_file_splice_write);
static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 74575cb..c5aca88 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -78,6 +78,8 @@ extern void splice_from_pipe_end(struct pipe_inode_info *,
struct splice_desc *);
extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *,
struct splice_desc *);
+extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *,
+ struct splice_desc *);
extern ssize_t splice_to_pipe(struct pipe_inode_info *,
struct splice_pipe_desc *);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH 4/5] xfs: fix splice_write locking
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
` (2 preceding siblings ...)
2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig
@ 2013-12-12 18:15 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
5 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
[-- Attachment #1: 0004-xfs-fix-splice_write-locking.patch --]
[-- Type: text/plain, Size: 1905 bytes --]
Call splice_from_pipe directly from XFS so that we can do our normal
I/O path locking. This fixes a rare to hit deadlock vs direct I/O
as reported by Dave Chinner long time ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_file.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52c91e1..9d0da98 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -43,6 +43,7 @@
#include <linux/dcache.h>
#include <linux/falloc.h>
#include <linux/pagevec.h>
+#include <linux/splice.h>
static const struct vm_operations_struct xfs_file_vm_ops;
@@ -348,14 +349,6 @@ xfs_file_splice_read(
return ret;
}
-/*
- * xfs_file_splice_write() does not use xfs_rw_ilock() because
- * generic_file_splice_write() takes the i_mutex itself. This, in theory,
- * couuld cause lock inversions between the aio_write path and the splice path
- * if someone is doing concurrent splice(2) based writes and write(2) based
- * writes to the same inode. The only real way to fix this is to re-implement
- * the generic code here with correct locking orders.
- */
STATIC ssize_t
xfs_file_splice_write(
struct pipe_inode_info *pipe,
@@ -377,15 +370,22 @@ xfs_file_splice_write(
if (XFS_FORCED_SHUTDOWN(ip->i_mount))
return -EIO;
- xfs_ilock(ip, XFS_IOLOCK_EXCL);
-
+ xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
trace_xfs_file_splice_write(ip, count, *ppos, ioflags);
+ ret = splice_from_pipe(pipe, outfilp, ppos, count, flags, pipe_to_file);
+ xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
+
+ if (ret > 0) {
+ int err;
- ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
- if (ret > 0)
XFS_STATS_ADD(xs_write_bytes, ret);
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ err = generic_write_sync(outfilp, *ppos, ret);
+ if (err)
+ return err;
+ *ppos += ret;
+ }
+
return ret;
}
--
1.7.10.4
^ permalink raw reply related [flat|nested] 49+ messages in thread
* [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
` (3 preceding siblings ...)
2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig
@ 2013-12-12 18:15 ` Christoph Hellwig
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
5 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
[-- Attachment #1: 0005-splice-stop-exporting-splice_from_pipe-implementatio.patch --]
[-- Type: text/plain, Size: 4280 bytes --]
The splice_to_file and __splice_to_file helpers with the various
actors are everyting consumers need to implement splice properly,
so stop exporting the low-level helpers that are used to implement
these functions, or in two cases were they were so trivially remove
the helpers entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
fs/splice.c | 49 +++++++++---------------------------------------
include/linux/splice.h | 7 -------
2 files changed, 9 insertions(+), 47 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index 108e527..8627a85 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -812,8 +812,8 @@ static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
* locking is required around copying the pipe buffers to the
* destination.
*/
-int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
- splice_actor *actor)
+static int splice_from_pipe_feed(struct pipe_inode_info *pipe,
+ struct splice_desc *sd, splice_actor *actor)
{
int ret;
@@ -859,7 +859,6 @@ int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
return 1;
}
-EXPORT_SYMBOL(splice_from_pipe_feed);
/**
* splice_from_pipe_next - wait for some data to splice from
@@ -871,7 +870,8 @@ EXPORT_SYMBOL(splice_from_pipe_feed);
* value (one) if pipe buffers are available. It will return zero
* or -errno if no more data needs to be spliced.
*/
-int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd)
+static int splice_from_pipe_next(struct pipe_inode_info *pipe,
+ struct splice_desc *sd)
{
while (!pipe->nrbufs) {
if (!pipe->writers)
@@ -896,40 +896,6 @@ int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd)
return 1;
}
-EXPORT_SYMBOL(splice_from_pipe_next);
-
-/**
- * splice_from_pipe_begin - start splicing from pipe
- * @sd: information about the splice operation
- *
- * Description:
- * This function should be called before a loop containing
- * splice_from_pipe_next() and splice_from_pipe_feed() to
- * initialize the necessary fields of @sd.
- */
-void splice_from_pipe_begin(struct splice_desc *sd)
-{
- sd->num_spliced = 0;
- sd->need_wakeup = false;
-}
-EXPORT_SYMBOL(splice_from_pipe_begin);
-
-/**
- * splice_from_pipe_end - finish splicing from pipe
- * @pipe: pipe to splice from
- * @sd: information about the splice operation
- *
- * Description:
- * This function will wake up pipe writers if necessary. It should
- * be called after a loop containing splice_from_pipe_next() and
- * splice_from_pipe_feed().
- */
-void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_desc *sd)
-{
- if (sd->need_wakeup)
- wakeup_pipe_writers(pipe);
-}
-EXPORT_SYMBOL(splice_from_pipe_end);
/**
* __splice_from_pipe - splice data from a pipe to given actor
@@ -949,14 +915,17 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
{
int ret;
- splice_from_pipe_begin(sd);
+ sd->num_spliced = 0;
+ sd->need_wakeup = false;
+
do {
ret = splice_from_pipe_next(pipe, sd);
if (ret > 0)
ret = splice_from_pipe_feed(pipe, sd, actor);
} while (ret > 0);
- splice_from_pipe_end(pipe, sd);
+ if (sd->need_wakeup)
+ wakeup_pipe_writers(pipe);
return sd->num_spliced ? sd->num_spliced : ret;
}
EXPORT_SYMBOL(__splice_from_pipe);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index c5aca88..0dabcc7 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -69,13 +69,6 @@ extern ssize_t splice_from_pipe(struct pipe_inode_info *, struct file *,
splice_actor *);
extern ssize_t __splice_from_pipe(struct pipe_inode_info *,
struct splice_desc *, splice_actor *);
-extern int splice_from_pipe_feed(struct pipe_inode_info *, struct splice_desc *,
- splice_actor *);
-extern int splice_from_pipe_next(struct pipe_inode_info *,
- struct splice_desc *);
-extern void splice_from_pipe_begin(struct splice_desc *);
-extern void splice_from_pipe_end(struct pipe_inode_info *,
- struct splice_desc *);
extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *,
struct splice_desc *);
extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *,
--
1.7.10.4
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
` (4 preceding siblings ...)
2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig
@ 2014-01-13 14:14 ` Christoph Hellwig
2014-01-13 23:56 ` Al Viro
5 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2014-01-13 14:14 UTC (permalink / raw)
To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs
ping? Would be nice to get this into 3.14
On Thu, Dec 12, 2013 at 10:14:59AM -0800, Christoph Hellwig wrote:
> I've been trying to fix the old splice iolock lock inversion issue in XFS
> and started looking over the splice code a little more for it. It seems
> like the root of all evil is that we try to nest i_mutex inside the
> pipe_lock instead of outside of it, and I can't find any good reason for
> that. Does anyone remember why it went this way initially?
>
> By fixing that and a few minor issues we can not only fix this issue nicely
> in XFS, but also get rid of various bits of code duplication, and poking into
> splice internals by the ocfs2 splice_write path.
>
> Btw, does anyone have a good test suite for splice functionality? xfstests
> coverage exits but is not very extensive.
>
> b/fs/ocfs2/file.c | 2
> b/fs/splice.c | 5 +-
> b/fs/xfs/xfs_file.c | 26 +++++-----
> b/include/linux/splice.h | 2
> fs/ocfs2/file.c | 78 +++++++++----------------------
> fs/splice.c | 115 +++++++++++++----------------------------------
> include/linux/splice.h | 7 --
> 7 files changed, 76 insertions(+), 159 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
@ 2014-01-13 23:56 ` Al Viro
2014-01-14 13:22 ` Christoph Hellwig
0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-01-13 23:56 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs
On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> ping? Would be nice to get this into 3.14
Umm... The reason for pipe_lock outside of ->i_mutex is this:
default_file_splice_write() calls splice_from_pipe() with
write_pipe_buf for callback. splice_from_pipe() calls that
callback under pipe_lock(pipe). And write_pipe_buf() calls
__kernel_write(), which certainly might want to take ->i_mutex.
Now, this codepath isn't taken for files that have non-NULL
->splice_write(), so that's not an issue for XFS and OCFS2,
but having pipe_lock nest between the ->i_mutex for filesystems
that do and do not have ->splice_write()... Ouch...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-13 23:56 ` Al Viro
@ 2014-01-14 13:22 ` Christoph Hellwig
2014-01-14 17:20 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2014-01-14 13:22 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs
On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > ping? Would be nice to get this into 3.14
>
> Umm... The reason for pipe_lock outside of ->i_mutex is this:
> default_file_splice_write() calls splice_from_pipe() with
> write_pipe_buf for callback. splice_from_pipe() calls that
> callback under pipe_lock(pipe). And write_pipe_buf() calls
> __kernel_write(), which certainly might want to take ->i_mutex.
>
> Now, this codepath isn't taken for files that have non-NULL
> ->splice_write(), so that's not an issue for XFS and OCFS2,
> but having pipe_lock nest between the ->i_mutex for filesystems
> that do and do not have ->splice_write()... Ouch...
What would be the alternative? Duplicating the code in even more
filesystems to enforce an non-natural locking order for filesystems
actually implementing splice? There don't actually seem to be a whole
lot of real filesystems not implemting splice_write, the prime use
would be for device drivers or synthetic ones. I'm not even sure
how much that fallback gets used in practice.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-14 13:22 ` Christoph Hellwig
@ 2014-01-14 17:20 ` Al Viro
2014-01-15 18:10 ` Al Viro
2014-01-18 6:40 ` Al Viro
0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2014-01-14 17:20 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs,
Linus Torvalds, Sage Weil, Steve French
On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > > ping? Would be nice to get this into 3.14
> >
> > Umm... The reason for pipe_lock outside of ->i_mutex is this:
> > default_file_splice_write() calls splice_from_pipe() with
> > write_pipe_buf for callback. splice_from_pipe() calls that
> > callback under pipe_lock(pipe). And write_pipe_buf() calls
> > __kernel_write(), which certainly might want to take ->i_mutex.
> >
> > Now, this codepath isn't taken for files that have non-NULL
> > ->splice_write(), so that's not an issue for XFS and OCFS2,
> > but having pipe_lock nest between the ->i_mutex for filesystems
> > that do and do not have ->splice_write()... Ouch...
>
> What would be the alternative? Duplicating the code in even more
> filesystems to enforce an non-natural locking order for filesystems
> actually implementing splice? There don't actually seem to be a whole
> lot of real filesystems not implemting splice_write, the prime use
> would be for device drivers or synthetic ones. I'm not even sure
> how much that fallback gets used in practice.
Actually, I wonder if splice should be allowed just because destination
has ->write() - for a lot of synthetic files the effect of writes really
depends on boundaries in output stream and I'm not sure if splice to
such a beast makes any sense.
Going through fs/*:
* 9p: messy when we have O_DIRECT on target file. Other than that,
probably could use generic_file_splice_write().
* adfs, affs, afs, bfs, fat, hfs, hfs+, minix, sysv: can use
generic_file_splice_write()
* /proc/fs/afs/*: probably shouldn't allow splice at all
* binfmt_misc: shouldn't allow splice
* btrfs: not sure, O_DIRECT complicates the picture again.
* cachefiles_daemon_write(): probably shouldn't allow splice at all
* ceph: uses generic_file_splice_write(), but I'm not sure whether
it is correct - there are interesting things done by ceph_aio_write() that
do not have any counterparts on that path.
* cifs: trouble; might be switchable to generic, but
there's an interesting bit in the end of its ->aio_write() to consider...
In any case, should be doable as generic_file_splice_write() +
filemap_fdatawrite().
* cifs mounted with strict_io - ouch. Would need ->splice_write() of
its own with really interesting locking order.
* cifs with direct_io - about the same story. Er... Just where
is it doing suid removal in that case, BTW? cifs_user_writev() doesn't
seem to?
* /proc/fs/cifs/*: shouldn't allow splice to it
* coda: needs ->splice_write() with that patch, might or might not
be tricky
* coda misc device: shouldn't allow splice to it
* configfs: shouldn't allow splice to it
* debugfs default fops: blackhole ->write(), might as well offer
blackhole ->splice_write(). Or refuse to allow splice to it, since I'd
bet that most of non-default ones are of the "shouldn't allow splice to it"
kind...
* debugfs bool: shouldn't allow splice to it
* dlm misc devices: shouldn't allow splice to it
* ecryptfs: probably should use generic_file_splice_write()
(incidentally, who the hell has thought it would be a good idea to
put ->iterate (aka ->readdir) into file_operations of a non-directory?)
* ecryptfs misc device: probably shouldn't allow splice to it at all
* eventfd: shouldn't allow splice to it at all
* XIP ext2: needs ->splice_write() with that patch
And that's less than half of fs/*... I'm not saying that the current
situation on the write side is good; hell, just the mess with write/aio_write
alone is ugly enough - we have
* a bunch of file_operations without ->aio_write(); simple enough.
* a bunch with ->write == do_sync_write. Also simple.
* several with NULL ->write and non-NULL ->aio_write(); same as
do_sync_write() for ->write (socket, android/logger, kmsg, macvtap)
* several with ->aio_write being an optimized equivalent of
do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode)
* 9p cached with its "oh, but if we have O_DIRECT we want ->write()
to be different" (why not use a separate file_operations, then? It's not
as if ->open() couldn't switch to it if it sees O_DIRECT...)
* two infinibad things (ipath and qib), with completely unrelated
semantics on write(2) and writev(2) (the latter shared with aio). As in
"writev() of a single-element iovec array does things that do not even
resemble what write() of the same data would've done". Yes, really - check
for yourself.
* snd_pcm - hell knows; it might be that it tries to collect the
data from iovec and push it in one go, as if it was a single write, but
then it might be something as bogus as what ipath is doing...
* gadgetfs - hell knows; ep_write() seems to be doing something
beyond what ep_aio_write() does, but I haven't traced them down the call
chain... That one, BTW, won't be fun for splice - looks like it cares
about datagram boundaries a lot, so it's not obvious what the semantics
should be.
* lustre. I _think_ do_sync_write() would work there, but I'm might
be easily missing something in all those layers of obfusca^Wgood software
development practices.
And ->splice_write() thrown into that mess, defaulting to "just do
->write() or ->aio_write(), everything writable should be able to cope
with splice to it" hasn't made it any prettier. Unfortunately, what
you are proposing will make it worse - we'll have to grow a bunch of
->splice_write() instances, with non-trivial correspondence between
them and ->aio_write() ones...
It needs to be cleaned up, but it's nowhere near as simple as "just flip
the order of i_mutex and pipe_lock" ;-/
BTW, speaking of ->aio_write() - why the devil do we pass the pos
argument (long long, at that) separately, when all call sites provably
have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures
it makes the difference between passing arguments in registers and spilling
them on stack; moreover, if we do something and only then call
generic_file_aio_write(), we get to propagate it all way down. And
generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos)
since 2.5.55, for crying out loud...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-14 17:20 ` Al Viro
@ 2014-01-15 18:10 ` Al Viro
2014-01-18 6:40 ` Al Viro
1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-01-15 18:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs,
Linus Torvalds, Sage Weil, Steve French
On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote:
> And that's less than half of fs/*... I'm not saying that the current
> situation on the write side is good; hell, just the mess with write/aio_write
> alone is ugly enough - we have
> * a bunch of file_operations without ->aio_write(); simple enough.
> * a bunch with ->write == do_sync_write. Also simple.
> * several with NULL ->write and non-NULL ->aio_write(); same as
> do_sync_write() for ->write (socket, android/logger, kmsg, macvtap)
> * several with ->aio_write being an optimized equivalent of
> do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode)
> * 9p cached with its "oh, but if we have O_DIRECT we want ->write()
> to be different" (why not use a separate file_operations, then? It's not
> as if ->open() couldn't switch to it if it sees O_DIRECT...)
> * two infinibad things (ipath and qib), with completely unrelated
> semantics on write(2) and writev(2) (the latter shared with aio). As in
> "writev() of a single-element iovec array does things that do not even
> resemble what write() of the same data would've done". Yes, really - check
> for yourself.
> * snd_pcm - hell knows; it might be that it tries to collect the
> data from iovec and push it in one go, as if it was a single write, but
> then it might be something as bogus as what ipath is doing...
> * gadgetfs - hell knows; ep_write() seems to be doing something
> beyond what ep_aio_write() does, but I haven't traced them down the call
> chain... That one, BTW, won't be fun for splice - looks like it cares
> about datagram boundaries a lot, so it's not obvious what the semantics
> should be.
> * lustre. I _think_ do_sync_write() would work there, but I'm might
> be easily missing something in all those layers of obfusca^Wgood software
> development practices.
BTW, ->read/->aio_read situation is only slighlty better - of file_operations
instances that have ->aio_read, most have do_sync_read() for ->read() (as
they ought to). Exceptions:
* 9p O_DIRECT (again)
* NULL ->read where do_sync_read ought to be (socket, macvtap)
* optimized ->read (/dev/zero, /dev/null, bad_inode)
* snd_pcm - magic. It (and its ->aio_write counterpart) wants exactly
one iovec per channel. IOW, it's not a general-purpose ->aio_{read,write}
at all - it's a magic API shoehorned into readv(2)/writev(2) (and aio
IOCB_CMD_P{READ,WRITE}V as well).
* lustre - probably could live with do_sync_read(), but there might
be stack footprint considerations or some really weird magic going on
(the difference is that instead of iocb on stack they appear to be using
per-thread one allocated on heap and hashed by pid, of all things).
It's really weird - they end up doing repeated hash lookups for that
per-thread wastebasket of a structure on different levels of call chain.
Looks like they have swept a lot of local variables of a lot of functions
into that thing; worse, it appears to be one of several dynamically allocated
bits of that thing, hidden behind a bunch of wrappers... Overall feel is
Lovecraftian, complete with lurking horrors of the deep... BTW, its ->aio_read
would better never return -EIOCBQUEUED - its ->read does *not* wait for
completion of iocb it has submitted.
* gadgetfs - it appears to be seriously datagram-oriented; basically,
they want to reduce readv/writev to read/write, not the other way round.
> BTW, speaking of ->aio_write() - why the devil do we pass the pos
> argument (long long, at that) separately, when all call sites provably
> have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures
> it makes the difference between passing arguments in registers and spilling
> them on stack; moreover, if we do something and only then call
> generic_file_aio_write(), we get to propagate it all way down. And
> generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos)
> since 2.5.55, for crying out loud...
The same goes for ->aio_read() (except for s/2.5.55/2.5.39/)...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-14 17:20 ` Al Viro
2014-01-15 18:10 ` Al Viro
@ 2014-01-18 6:40 ` Al Viro
2014-01-18 7:22 ` Linus Torvalds
1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-01-18 6:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French,
Joel Becker, linux-fsdevel, Linus Torvalds
On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote:
> On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > > > ping? Would be nice to get this into 3.14
> > >
> > > Umm... The reason for pipe_lock outside of ->i_mutex is this:
> > > default_file_splice_write() calls splice_from_pipe() with
> > > write_pipe_buf for callback. splice_from_pipe() calls that
> > > callback under pipe_lock(pipe). And write_pipe_buf() calls
> > > __kernel_write(), which certainly might want to take ->i_mutex.
> > >
> > > Now, this codepath isn't taken for files that have non-NULL
> > > ->splice_write(), so that's not an issue for XFS and OCFS2,
> > > but having pipe_lock nest between the ->i_mutex for filesystems
> > > that do and do not have ->splice_write()... Ouch...
> >
> > What would be the alternative? Duplicating the code in even more
> > filesystems to enforce an non-natural locking order for filesystems
> > actually implementing splice? There don't actually seem to be a whole
> > lot of real filesystems not implemting splice_write, the prime use
> > would be for device drivers or synthetic ones. I'm not even sure
> > how much that fallback gets used in practice.
Hmm... In principle, the following would be no worse than what
generic_file_splice_write() is doing: confirm and map the pages, build
an iovec and use ->aio_write() to write it out, then unmap the suckers,
release ones entirely written to file and adjust the partially
written one. All under pipe_lock(). Hell, if we introduce
kernel_writev() (either by calling vfs_writev() or taking do_readv_writev()
sans copying iovec and using that under set_fs()), we could switch
default_file_splice_write() to that and get rid of ->splice_write() for
the majority of filesystems, if not all of them.
Sure, it means copying from pipe buffers to pagecache, but we have
generic_file_splice_write() do that copy anyway - conditional memcpy()
in pipe_to_file() is actually unconditional; that if (page != buf->page) in
there had just been forgotten by Nick back in 2007 ("1/2 splice: dont steal").
Objections, comments?
The problem Christoph was talking about is that generic_file_splice_write()
plays with ->i_mutex and both gets/drops it for each page of IO *and*
causes PITA for any fs that wants some locks of its own taken in addition
to ->i_mutex on the write paths. What ->splice_write() without page
stealing is doing is pretty much a writev() from array of pages in kernel
space; so it looks like we might as well just reuse writev() guts for that...
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 6:40 ` Al Viro
@ 2014-01-18 7:22 ` Linus Torvalds
2014-01-18 7:46 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-01-18 7:22 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French
On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Objections, comments?
I certainly object to the "map, then unmap" approach. No VM games.
But if it can be done more naturally as a writev, then that may well
be ok. As long as we're talking about just the
default_file_splice_write() case, and people who want to do special
things with page movement can continue to do so..
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 7:22 ` Linus Torvalds
@ 2014-01-18 7:46 ` Al Viro
2014-01-18 7:56 ` Al Viro
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Al Viro @ 2014-01-18 7:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Joel Becker, linux-fsdevel
On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote:
> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Objections, comments?
>
> I certainly object to the "map, then unmap" approach. No VM games.
Um...
int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
struct splice_desc *sd)
...
if (buf->page != page) {
char *src = buf->ops->map(pipe, buf, 1);
char *dst = kmap_atomic(page);
memcpy(dst + offset, src + buf->offset, this_len);
flush_dcache_page(page);
kunmap_atomic(dst);
buf->ops->unmap(pipe, buf, src);
}
...
->map() and ->unmap() (BTW, why are those methods, anyway? They are
identical for all instances) are
void *generic_pipe_buf_map(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, int atomic)
{
if (atomic) {
buf->flags |= PIPE_BUF_FLAG_ATOMIC;
return kmap_atomic(buf->page);
}
return kmap(buf->page);
}
and
void generic_pipe_buf_unmap(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, void *map_data)
{
if (buf->flags & PIPE_BUF_FLAG_ATOMIC) {
buf->flags &= ~PIPE_BUF_FLAG_ATOMIC;
kunmap_atomic(map_data);
} else
kunmap(buf->page);
}
resp.
If we are going to copy that data (and all users of generic_file_splice_write()
do that memcpy() to page cache), we have to kmap the source ;-/
> But if it can be done more naturally as a writev, then that may well
> be ok. As long as we're talking about just the
> default_file_splice_write() case, and people who want to do special
> things with page movement can continue to do so..
The thing is, after such change default_file_splice_write() is no worse than
generic_file_splice_write(). The only instances that really want something
else are the ones that try to steal pages (e.g. virtio_console, fuse miscdev)
or sockets, with their "do DMA from the sodding page, don't copy it at
anywhere" ->sendpage() method. IOW, ones those special things you are
talking about. Normal filesystems do not - not on pipe-to-file splice.
file-to-pipe - sure, that one plays with pagecache and tries hard to
do zero-copy, but that's ->splice_read(), not ->splice_write()...
_If_ somebody figures out how to deal with zero-copy on pipe-to-file - fine,
we'll be able to revisit that. But there hadn't been one since 2007 and
there was zero activity in that area, so...
What I'm doing right now is taking do_readv_writev() apart and making the
stuff after rw_copy_check_uvector() non-static (visible in fs/internal.h).
As long as we do not go through rw_copy_check_uvector() (we'd just built
that iovec ourselves and it's already in kernel space), we should be fine -
single copy done straight to pagecache, with whatever locks fs wants to
take, etc.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 7:46 ` Al Viro
@ 2014-01-18 7:56 ` Al Viro
2014-01-18 8:27 ` Al Viro
2014-01-18 19:59 ` Linus Torvalds
2 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-01-18 7:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French
On Sat, Jan 18, 2014 at 07:46:49AM +0000, Al Viro wrote:
> Um...
> int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> struct splice_desc *sd)
> ...
> if (buf->page != page) {
> char *src = buf->ops->map(pipe, buf, 1);
> char *dst = kmap_atomic(page);
>
> memcpy(dst + offset, src + buf->offset, this_len);
> flush_dcache_page(page);
> kunmap_atomic(dst);
> buf->ops->unmap(pipe, buf, src);
> }
BTW, that if (buf->page != page) is always true - it's a leftover from
before Nick's removal of ->steal() uses in pipe_to_file() (as well as
the big fat comment in front of that function that had lost any relation
to what it's doing 7 years ago)...
Is there anybody maintaining fs/splice.c these days? I'd been doing
massive RTFS in that area lately, but it would certainly be nice to
have some braindump on the design and issues in that thing, preferably
still matching the code...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 7:46 ` Al Viro
2014-01-18 7:56 ` Al Viro
@ 2014-01-18 8:27 ` Al Viro
2014-01-18 8:44 ` David Miller
2014-01-18 19:59 ` Linus Torvalds
2 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-01-18 8:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Joel Becker, linux-fsdevel, David Miller
On Sat, Jan 18, 2014 at 07:46:49AM +0000, Al Viro wrote:
> On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > Objections, comments?
> >
> > I certainly object to the "map, then unmap" approach. No VM games.
[snip]
> > But if it can be done more naturally as a writev, then that may well
> > be ok. As long as we're talking about just the
> > default_file_splice_write() case, and people who want to do special
> > things with page movement can continue to do so..
>
> The thing is, after such change default_file_splice_write() is no worse than
> generic_file_splice_write(). The only instances that really want something
> else are the ones that try to steal pages (e.g. virtio_console, fuse miscdev)
> or sockets, with their "do DMA from the sodding page, don't copy it at
> anywhere" ->sendpage() method. IOW, ones those special things you are
> talking about. Normal filesystems do not - not on pipe-to-file splice.
> file-to-pipe - sure, that one plays with pagecache and tries hard to
> do zero-copy, but that's ->splice_read(), not ->splice_write()...
BTW, would sockets benefit from having ->sendpages() that would take an
array of (page, offset, len) triples? It would be trivial to do and
some of the helpers that are falling out of writing that writev-based
default_file_splice_write() look like they could be reused for
calling that one... Dave?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 8:27 ` Al Viro
@ 2014-01-18 8:44 ` David Miller
2014-02-07 17:10 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2014-01-18 8:44 UTC (permalink / raw)
To: viro
Cc: torvalds, hch, axboe, mfasheh, jlbec, linux-fsdevel, xfs, sage,
sfrench
From: Al Viro <viro@ZenIV.linux.org.uk>
Date: Sat, 18 Jan 2014 08:27:30 +0000
> BTW, would sockets benefit from having ->sendpages() that would take an
> array of (page, offset, len) triples? It would be trivial to do and
> some of the helpers that are falling out of writing that writev-based
> default_file_splice_write() look like they could be reused for
> calling that one... Dave?
That's originally how the sendpage method was implemented, but back then
Linus asked us to only pass one page at a time.
I don't remember the details beyond that.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 8:44 ` David Miller
@ 2014-02-07 17:10 ` Al Viro
0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-07 17:10 UTC (permalink / raw)
To: David Miller
Cc: torvalds, hch, axboe, mfasheh, jlbec, linux-fsdevel, xfs, sage,
sfrench
On Sat, Jan 18, 2014 at 12:44:53AM -0800, David Miller wrote:
> From: Al Viro <viro@ZenIV.linux.org.uk>
> Date: Sat, 18 Jan 2014 08:27:30 +0000
>
> > BTW, would sockets benefit from having ->sendpages() that would take an
> > array of (page, offset, len) triples? It would be trivial to do and
> > some of the helpers that are falling out of writing that writev-based
> > default_file_splice_write() look like they could be reused for
> > calling that one... Dave?
>
> That's originally how the sendpage method was implemented, but back then
> Linus asked us to only pass one page at a time.
>
> I don't remember the details beyond that.
FWIW, I wonder if what we are doing with ->msg_iov is the right thing.
We modify the iovecs in array as we drain it. And that's inconvenient
for at least some callers (see e.g. complaints in fs/ncpfs about the
need to copy the array, etc.).
What if we embed iov_iter into the sucker and replace memcpy_{to,from}iovec*
with variants taking iov_iter *? If nothing else, it'll be marginally more
efficient (no more skipping the already-emptied iovecs) and it seems to be
more convenient for callers. If we are lucky, that might even eliminate
the need of ->sendpage() - just set the iov_iter over <page,offset,size>
array instead of iovec one and let ->sendmsg() do the smart thing if it
knows how. I hadn't done comparison of {tcp,udp}_send{page,msg}, though -
there might be dragons... Even if that will turn out to be infeasible,
it will at least drive the kmap/kunmap done by sock_no_sendpage() down
into memcpy_from_iter(), turning them into kmap_atomic/kunmap_atomic.
The obvious price is that kernel-side msghdr diverges from the userland
one, so copy_msghdr_from_user() needs to deal with that, but I really
doubt that you'll find a load where the price of copying it in two
chunks instead of one would be measurable.
What else am I missing?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 7:46 ` Al Viro
2014-01-18 7:56 ` Al Viro
2014-01-18 8:27 ` Al Viro
@ 2014-01-18 19:59 ` Linus Torvalds
2014-01-18 20:10 ` Al Viro
2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-01-18 19:59 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Joel Becker, linux-fsdevel
On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >
>> > Objections, comments?
>>
>> I certainly object to the "map, then unmap" approach. No VM games.
>
> Um...
>
> If we are going to copy that data (and all users of generic_file_splice_write()
> do that memcpy() to page cache), we have to kmap the source ;-/
Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and
has to be done one page at a time (well, I guess you could do a
couple).
But you can't do that *around* the default_file_splice_write(), so I
thought you meant some kind of "map into user space". And I absolutely
*detest* that kind of approach.
Linus
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 19:59 ` Linus Torvalds
@ 2014-01-18 20:10 ` Al Viro
2014-01-18 20:27 ` Al Viro
2014-01-19 5:13 ` [RFC] unifying write variants for filesystems Al Viro
0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2014-01-18 20:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Joel Becker, linux-fsdevel
On Sat, Jan 18, 2014 at 11:59:56AM -0800, Linus Torvalds wrote:
> On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >> >
> >> > Objections, comments?
> >>
> >> I certainly object to the "map, then unmap" approach. No VM games.
> >
> > Um...
> >
> > If we are going to copy that data (and all users of generic_file_splice_write()
> > do that memcpy() to page cache), we have to kmap the source ;-/
>
> Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and
> has to be done one page at a time (well, I guess you could do a
> couple).
>
> But you can't do that *around* the default_file_splice_write(), so I
> thought you meant some kind of "map into user space". And I absolutely
> *detest* that kind of approach.
Ouch... No, I hadn't meant that kind of insanity, but I'd missed the
problem with scarcity of mappings completely...
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 20:10 ` Al Viro
@ 2014-01-18 20:27 ` Al Viro
2014-01-18 20:30 ` Al Viro
2014-01-19 5:13 ` [RFC] unifying write variants for filesystems Al Viro
1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-01-18 20:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French
On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote:
> On Sat, Jan 18, 2014 at 11:59:56AM -0800, Linus Torvalds wrote:
> > On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote:
> > >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >> >
> > >> > Objections, comments?
> > >>
> > >> I certainly object to the "map, then unmap" approach. No VM games.
> > >
> > > Um...
> > >
> > > If we are going to copy that data (and all users of generic_file_splice_write()
> > > do that memcpy() to page cache), we have to kmap the source ;-/
> >
> > Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and
> > has to be done one page at a time (well, I guess you could do a
> > couple).
> >
> > But you can't do that *around* the default_file_splice_write(), so I
> > thought you meant some kind of "map into user space". And I absolutely
> > *detest* that kind of approach.
>
> Ouch... No, I hadn't meant that kind of insanity, but I'd missed the
> problem with scarcity of mappings completely...
Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(),
which does this:
data = buf->ops->map(pipe, buf, 0);
ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp);
buf->ops->unmap(pipe, buf, data);
IOW, ->write() (with whatever locks there might be) wrapped into
kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to
file on procfs will hit that codepath... Or on 9p, for that matter, or
fat, or afs, or cifs, etc.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring
2014-01-18 20:27 ` Al Viro
@ 2014-01-18 20:30 ` Al Viro
0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-01-18 20:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French
On Sat, Jan 18, 2014 at 08:27:17PM +0000, Al Viro wrote:
> Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(),
> which does this:
> data = buf->ops->map(pipe, buf, 0);
> ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp);
> buf->ops->unmap(pipe, buf, data);
> IOW, ->write() (with whatever locks there might be) wrapped into
> kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to
> file on procfs will hit that codepath... Or on 9p, for that matter, or
> fat, or afs, or cifs, etc.
s/kmap_atomic/kmap/, so it's not that disastrously bad (kmap_atomic might
lose the mapping as soon as we block), but scarcity problem remains...
^ permalink raw reply [flat|nested] 49+ messages in thread
* [RFC] unifying write variants for filesystems
2014-01-18 20:10 ` Al Viro
2014-01-18 20:27 ` Al Viro
@ 2014-01-19 5:13 ` Al Viro
2014-01-20 13:55 ` Christoph Hellwig
1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-01-19 5:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French
On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote:
> Ouch... No, I hadn't meant that kind of insanity, but I'd missed the
> problem with scarcity of mappings completely...
OK, that pretty much kills this approach. Pity...
Folks, what do you think about the following:
* a new data structure:
struct io_source {
enum {IO_IOVEC, IO_PVEC} type;
union {
struct iovec *iov;
struct pvec {
struct page *page;
unsigned offset;
unsigned size;
} *pvec;
};
}
* a new method that would look like aio_write, but take
struct io_source instead of iovec.
* store the type in iov_iter (normally - IO_UIOVEC) and teach the
code dealing with it to do the right thing depending on type. I.e. instead
of __copy_from_user_inatomic() do kmap_atomic()/memcpy()/kunmap_atomic() if
it's a IO_PAGEVEC.
* generic_file_aio_write() analog for new method, converging with
generic_file_aio_write() almost immediately (basically, as soon as iov_iter
has been initialized).
* new_aio_write() consisting of
{
struct io_source source = {.type = IO_UIOVEC, .user = iov};
return file->f_op-><new_method>(iocb, &source, nr_segs, pos);
}
* new_sync_write(), doing what do_sync_write() does for files
that have new_aio_write() as ->aio_write().
* new_splice_write() usable for files that provide that method -
it would collect pipe_buffers, put together struct pvec array and pass
it to that method. All mapping the pages would happen one-by-one
and only around actual copying the data. And, of course, the locking
would be identical to what we do for write()/writev()/aio write
Then filesystems can switch to that new method, turning their
flipping their aio_write() instances to new type and replacing ->aio_write
with default_aio_write, ->write with new_write and ->splice_write with
new_splice_write.
Actually, there's a possibility that it would be possible to use
it for *all* instances of ->splice_write() - we'd need to store something
a pointer to "call this to try and steal this page" function in pvec
and allow the method do actual stealing. Note that pipe_buffer ->steal()
only uses the page argument - they all ignore which pipe it's in (and
there's nothing they could usefully do if they knew which pipe had it been
in the first place).
This is very preliminary, of course, and I might easily miss
something - the previous idea was unworkable, after all. Comments
would be very welcome...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-01-19 5:13 ` [RFC] unifying write variants for filesystems Al Viro
@ 2014-01-20 13:55 ` Christoph Hellwig
2014-01-20 20:32 ` Linus Torvalds
0 siblings, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2014-01-20 13:55 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh,
Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French,
Dave Kleikamp
On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
> Folks, what do you think about the following:
That's very much what Shaggy did in the aio-direct tree, except that
it kept using a single set of methods. Linus really didn't like it
unfortunately.
https://github.com/kleikamp/linux-shaggy/commits/aio_loop
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-01-20 13:55 ` Christoph Hellwig
@ 2014-01-20 20:32 ` Linus Torvalds
2014-02-01 22:43 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-01-20 20:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs,
Sage Weil, Steve French, Dave Kleikamp
On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
>> Folks, what do you think about the following:
>
> That's very much what Shaggy did in the aio-direct tree, except that
> it kept using a single set of methods. Linus really didn't like it
> unfortunately.
Umm. That wasn't what I objected to.
I objected to the incredibly ugly implementation, the crazy new flags
arguments for core helper functions, ugly naming etc etc. I even
outlined what might fix it.
In other words, I thought the code was shit and ugly. Not that
iterators per se would be wrong. Just doing them badly is wrong.
So don't go putting words in my mouth. It isn't "unfortunate" at all
that I have higher quality standards.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-01-20 20:32 ` Linus Torvalds
@ 2014-02-01 22:43 ` Al Viro
2014-02-02 0:13 ` Linus Torvalds
` (4 more replies)
0 siblings, 5 replies; 49+ messages in thread
From: Al Viro @ 2014-02-01 22:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp,
Anton Altaparmakov
On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote:
> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
> >> Folks, what do you think about the following:
> >
> > That's very much what Shaggy did in the aio-direct tree, except that
> > it kept using a single set of methods. Linus really didn't like it
> > unfortunately.
>
> Umm. That wasn't what I objected to.
>
> I objected to the incredibly ugly implementation, the crazy new flags
> arguments for core helper functions, ugly naming etc etc. I even
> outlined what might fix it.
>
> In other words, I thought the code was shit and ugly. Not that
> iterators per se would be wrong. Just doing them badly is wrong.
Gyahhh... OK, I should've known better than go looking into that thing
after such warning. Some relatively printable notes (i.e. about 10%
of the comments I really had about that) follow:
* WTF bother passing 'pos' separately? It's the same mistake that was
made with ->aio_read/->aio_write and just as with those, *all* callers
provably have pos == iocb->ki_pos.
* I'm not sure that iov_iter is a good fit here. OTOH, it probably could
be reused (it has damn few users right now and they are on the codepaths
involved into that thing).
* We *definitely* want a variant structure with tag - unsigned long thing
was just plain insane. I see at least two variants - array of iovecs
and array of (at least) triples <page, offset, length>. Quite possibly -
quadruples, with "here's how to try to steal this page" thrown in, if
we want that as replacement for ->splice_write() as well (it looks like
the few instances that do steal on pipe-to-file splices could be dealt
with the same way as the dumb ones, provided that ->write_iter or whatever
we end up calling it is allowed to try and steal pages). Possibly more
variants on the read side of things... FWIW, I'm not sure that bio_vec
makes a lot of sense here.
* direction (i.e. source vs. destination) also should be a part of that
tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int;
the situation with pos is identical to aio_read/aio_write. While we
are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks
very much like a special case of "array of <page,offset,size>" we want
for splice_write...
* having looked through the ->read and ->write instances in drivers,
I'd say that surprisingly few helpers would go a _long_ way towards
converting those guys to the same methods. And we need such helpers
anyway - there's a whole lot of (badly) open-coded "copy the whole
user buffer into kmalloc'ed array and NUL-terminate the sucker" logics
in ->write() instances, for example. Even more "copy up to that much
into given array and NUL-terminate it", with rather amusing bugs in
there - e.g. NUL going into the end of array, regardless of the actual
amount of data to copy; junk is left in the middle, complete with
printk of the entire thing if it doesn't make sense. IOW, spewing
random piece of kernel stack into dmesg. Off-by-ones galore, too...
BTW, speaking of bogosities - grep for generic_write_checks(). Exactly
one caller (in __generic_file_aio_write()) has any business looking
at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write().
All other callers have copied that, even though it makes absolutely
no sense for them...
* file_readable/file_writable looks really wrong; if nothing else, I would
rather check that after open() and set a couple of FMODE bits, then check
those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no
method"...
* pipe_buffer_operations ->map()/->unmap() should die; let the caller do
k{,un}map{,_atomic}(). All instances have the same method there and
there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also
go.
* WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable
and pagefault_enable around it? The sucker starts with kmap_atomic() and
ends with kunmap_atomic(); all instances of those guys have pagefaul
disabling/enabling (and I suspect that it might make sense to lift it
out of arch-specific variants - rename them to e.g. __kmap_atomic(),
rip pagefault_disable() out of those and put kmap_atomic() into highmem.h
outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic
page_address; move pagefault_enable() from __kunmap_atomic() to
kunmap_atomic() while we are at it).
Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we
have __copy_from_user_inatomic() done there under kmap_atomic(), and we
really don't want to block in such conditions.
* pipe_iov_copy_from_user() ought to return how much has it managed to
bring in, not 0 or -EFAULT as it does now. Then it won't need the
non-atomic side, AFAICS. Moreover, we'll just be able to use
iov_iter_copy_from_user_atomic() (which badly needs a more palatable
name, BTW).
* sure, we want to call do_generic_file_read() once, passing the entire
iovec to file_read_actor(). But what the hell does that have to do with
introduction of new methods? It's a change that makes sense on its
own. Moreover, it's a damn good preparation to adding those - we
get generic_file_aio_read() into "set iov_iter up, then do <this>",
with <this> using iov_iter instead of iovec. Once we get to introducing
those methods, it's just a matter of taking the rest of generic_file_aio_read()
into a separate function and making that function an instance of new
method...
* Unrelated to this patchset, but... may I politely inquire about the reasons
why ->is_partially_uptodate() gets read_descriptor_t? The whole point
of read_descriptor_t (if it has any) is that its interpretation is up to
whoever's doing the reading, _not_ what they are reading from. So desc->arg
is off-limits for any instance of ->is_partially_uptodate(). desc->written is
obviously pointless for them; the need (or lack thereof) to do something
to the page doesn't depend on how much have we already read from the
file. Moreover, reporting an error is rather dubious in such method;
if there's something fishy with the page, we'll just return "no" and let
->readpage() complain. Which leaves only desc->count, which, unsurprisingly,
is the only thing looked at by (the only) instance of that method. And
"tell me if the part of this page that long starting from that offset is
up to date" is much more natural that "is what this read_descriptor_t would
have had us read from this offset in this page up to date?"...
* do we really need separate non-atomic variant of iov_iter_copy_from_user()?
We have only two users right now (cifs and ceph) and both can use the
fault-in / atomic copy loop without much pain...
* in addition to having too many methods, I'm not convinced that we want
them to be methods. Let's start with explicit checks in the primitives and
see where it goes from there; if we start to grow too many variants,
we can always introduce some methods, but then we'll be in better position
to decide what is and what is not a good method...
* on the read side, I really don't believe that exposing atomic and
non-atomic variants is a good idea. Look at the existing users of
__copy_to_user_inatomic(); leaving aside the i915_gem weirdness,
all of them are used to implement the exact same thing: given a page,
offset and length, feed its contents to iovec/buffer/whatnot. Unlike
the write side of things, there's nothing between prefaulting pages
and actual attempts to copy. So let's make _that_ an exposed primitive
and let it deal with kmap/kmap_atomic/etc. Variants that don't have
to worry about blocking (vector of <page,offset,length>, etc.) would
simply not bother with non-atomic kmap, etc. Sure, it should take
iov_iter as destination. And deal with mapping the damn thing internally...
* ntfs_file_buffered_write() should switch to iov_iter as well. It's
open-coding a lot of iov_iter stuff. It's not entirely trivial and
I'd really like to hear from ntfs folks on that, though, and the current
code looks deadlock-prone. We prefault everything, then lock the pages
to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
falling back to __copy_from_user() if that fails. Fine, but what happens
if the source of write() is mmaped from the same file and we lose CPU while
prefaulting the last page, have memory pressure evict the first one, then
have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
we'll be copying to and have __copy_from_user() try to copy *from* those
same pages? We are doing it while holding these pages locked, so pagefault
will have really fun time with them... Anton?
BTW, Linus, when did you comment on that patchset? I've found an iteration
of that patchset circa last October (v9, apparently the latest posted),
but it looks like your comments had either got lost or had been on the
earlier iteration of that thing...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-01 22:43 ` Al Viro
@ 2014-02-02 0:13 ` Linus Torvalds
2014-02-02 2:02 ` Al Viro
2014-02-02 19:21 ` Al Viro
` (3 subsequent siblings)
4 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-02-02 0:13 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh,
xfs, Christoph Hellwig, Joel Becker, linux-fsdevel,
Anton Altaparmakov
On Sat, Feb 1, 2014 at 2:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, Linus, when did you comment on that patchset? I've found an iteration
> of that patchset circa last October (v9, apparently the latest posted),
> but it looks like your comments had either got lost or had been on the
> earlier iteration of that thing...
I commented on the pull request in November. Something like this:
http://lkml.indiana.edu/hypermail/linux/kernel/1311.2/02278.html
and then I have a follow-up in a reply to that that at least outlines
a few things that could be done to make it less barfy.
But judging from your email, you actually went through that
patch-series with a finer comb, I don't think there is anything in
that commentary of mine that adds anything to yours, apart on a
comment on the naming that I hated.
Linus
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-02 0:13 ` Linus Torvalds
@ 2014-02-02 2:02 ` Al Viro
0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-02 2:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp,
Anton Altaparmakov
On Sat, Feb 01, 2014 at 04:13:36PM -0800, Linus Torvalds wrote:
> On Sat, Feb 1, 2014 at 2:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > BTW, Linus, when did you comment on that patchset? I've found an iteration
> > of that patchset circa last October (v9, apparently the latest posted),
> > but it looks like your comments had either got lost or had been on the
> > earlier iteration of that thing...
>
> I commented on the pull request in November. Something like this:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1311.2/02278.html
>
> and then I have a follow-up in a reply to that that at least outlines
> a few things that could be done to make it less barfy.
>
> But judging from your email, you actually went through that
> patch-series with a finer comb, I don't think there is anything in
> that commentary of mine that adds anything to yours, apart on a
> comment on the naming that I hated.
Heh... You've actually been a lot more polite than what I started to
write on that particular topic. Mine had started with "folks, identifiers
should be possible to read aloud, without the people around getting nervous.
I mean, ai,ai-ai,owwie-...?" and then got so nasty that I decided to leave
the whole thing alone.
But yes, references to the importance of remembering the safeword aside,
the naming is really atrocious.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-01 22:43 ` Al Viro
2014-02-02 0:13 ` Linus Torvalds
@ 2014-02-02 19:21 ` Al Viro
2014-02-02 19:23 ` Al Viro
2014-02-03 14:41 ` Miklos Szeredi
2014-02-02 23:16 ` Anton Altaparmakov
` (2 subsequent siblings)
4 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2014-02-02 19:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp,
Anton Altaparmakov, Miklos Szeredi
On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote:
> * pipe_buffer_operations ->map()/->unmap() should die; let the caller do
> k{,un}map{,_atomic}(). All instances have the same method there and
> there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also
> go.
BTW, another pile of code interesting in that respect (i.e. getting that
interface right) is fs/fuse/dev.c; I don't like the way it's playing
with get_user_pages_fast() there, and I doubt that sharing the code for
read and write side as it's done there makes much sense, but it's
definitely going to be a test for any API of that kind. It *does*
try to unify write-from-iovec with write-from-array-of-pages and
similar for reads; the interesting issue is that unlike the usual
write-to-pagecache we can have many chunks picked from one page and
we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those.
I suspect that the right answer is, in addition to a primitive that
does copying from iov_iter to have "copy from iov_iter and be ready
to copy more from soon after" + "done copying"; for the "array of
pages" the former would be allowed to leave the current page mapped,
skipping kmap_atomic() on the next call. And the latter would unmap.
of course. The caller is responsible for not blocking or doing
unbalanced map/unmap until it's said "done copying".
BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for
everything? After all, as soon as we'd done kmap() in there, we
grab a spinlock and don't drop it until just before kunmap(). With
nothing by memcpy() done in between... Miklos? AFAICS, we only win
from switching to kmap_atomic there - we can't block anyway, we don't
need it to be visible on other CPUs and nesting isn't a problem.
Looks like it'll be cheaper in highmem cases and do exactly the same
thing as now for non-highmem... Comments?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-02 19:21 ` Al Viro
@ 2014-02-02 19:23 ` Al Viro
2014-02-03 14:41 ` Miklos Szeredi
1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-02 19:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp,
Anton Altaparmakov, Miklos Szeredi
On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote:
> BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for
> everything? After all, as soon as we'd done kmap() in there, we
> grab a spinlock and don't drop it until just before kunmap(). With
> nothing by memcpy() done in between... Miklos? AFAICS, we only win
s/by/but/ - sorry...
> from switching to kmap_atomic there - we can't block anyway, we don't
> need it to be visible on other CPUs and nesting isn't a problem.
> Looks like it'll be cheaper in highmem cases and do exactly the same
> thing as now for non-highmem... Comments?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-02 19:21 ` Al Viro
2014-02-02 19:23 ` Al Viro
@ 2014-02-03 14:41 ` Miklos Szeredi
2014-02-03 15:33 ` Al Viro
1 sibling, 1 reply; 49+ messages in thread
From: Miklos Szeredi @ 2014-02-03 14:41 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh,
Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French,
Dave Kleikamp, Anton Altaparmakov
On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote:
> On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote:
>
> > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do
> > k{,un}map{,_atomic}(). All instances have the same method there and
> > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also
> > go.
>
> BTW, another pile of code interesting in that respect (i.e. getting that
> interface right) is fs/fuse/dev.c; I don't like the way it's playing
> with get_user_pages_fast() there,
It's trying to work around a deadlock by lock_page() recursion. I.e. fuse
daemon is reading a request into a page for which the readpage is being serviced
by said request (no, that's not something that happens accidentally).
Do all archs set FAULT_FLAG_KILLABLE? If so, that flag along with code
depending on the lack of it can go away and we can simply depend on page faults
being interruptible by fatal signals.
And that would simplify the fuse device I/O substantially.
> and I doubt that sharing the code for
> read and write side as it's done there makes much sense, but it's
> definitely going to be a test for any API of that kind. It *does*
> try to unify write-from-iovec with write-from-array-of-pages and
> similar for reads; the interesting issue is that unlike the usual
> write-to-pagecache we can have many chunks picked from one page and
> we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those.
>
> I suspect that the right answer is, in addition to a primitive that
> does copying from iov_iter to have "copy from iov_iter and be ready
> to copy more from soon after" + "done copying"; for the "array of
> pages" the former would be allowed to leave the current page mapped,
> skipping kmap_atomic() on the next call. And the latter would unmap.
> of course. The caller is responsible for not blocking or doing
> unbalanced map/unmap until it's said "done copying".
>
> BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for
> everything? After all, as soon as we'd done kmap() in there, we
> grab a spinlock and don't drop it until just before kunmap(). With
> nothing by memcpy() done in between... Miklos? AFAICS, we only win
> from switching to kmap_atomic there - we can't block anyway, we don't
> need it to be visible on other CPUs and nesting isn't a problem.
> Looks like it'll be cheaper in highmem cases and do exactly the same
> thing as now for non-highmem... Comments?
We don't hold the spinlock. But regardless, I don't see any reason why it
couldn't be atomic kmap.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-03 14:41 ` Miklos Szeredi
@ 2014-02-03 15:33 ` Al Viro
0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-03 15:33 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh,
Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French,
Dave Kleikamp, Anton Altaparmakov
On Mon, Feb 03, 2014 at 03:41:55PM +0100, Miklos Szeredi wrote:
> > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for
> > everything? After all, as soon as we'd done kmap() in there, we
> > grab a spinlock and don't drop it until just before kunmap(). With
> > nothing by memcpy() done in between... Miklos? AFAICS, we only win
> > from switching to kmap_atomic there - we can't block anyway, we don't
> > need it to be visible on other CPUs and nesting isn't a problem.
> > Looks like it'll be cheaper in highmem cases and do exactly the same
> > thing as now for non-highmem... Comments?
>
> We don't hold the spinlock. But regardless, I don't see any reason why it
> couldn't be atomic kmap.
Oh, right - lock_request() drops it. Still, I don't see anything other
than copying between map and unmap, and not a lot of it either...
As for get_user_pages_fast()... Why not do what mm/filemap.c does to
deal with the same issue? Prefault, then lock the destination page,
kmap_atomic() and do __copy_from_user_inatomic(). If that fails (i.e.
if something has raced with us and evicted the source from page table),
shrug, unlock and repeat.
I do realize that you want to share code between the read and write sides
of the whole thing, but I'm not sure it's worth doing. Almost everything
in that pile knows the direction - splitting a few low-level functions
into ..._in() and ..._out() variants (mostly along the checks already
in them) allows to separate these paths completely, at which point it
becomes possible to use copy-page-to-iov_iter, etc. to take care of
mapping, dealing with iovec components, etc.
What I want to do is to get a sane set of iov_iter primitives that could
be used for everything, without their users having to care about the
nature of iov_iter - iovec, array of <page,offset,size,how_to_steal>
quadruples, biovec, etc. The interesting part of it is how to make
that set expressive enough, while keeping it reasonably sane. And
fs/fuse/dev.c is one of the more interesting potential users out there...
I've a growing queue with the beginning of that stuff; so far it's mostly
preparatory bits and pieces. Currently being tested: copy_page_to_iter()
(more or less similar to iov_iter_copy_to_..., but with saner interface
and dealing with the kmap, atomics, etc. without forcing the callers do
do that) with conversion of generic_file_aio_read() and friends to it.
If it survives the local beating, I'll start pushing it out (as
vfs.git#iov_iter); that pile is getting to potentially interesting bits...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-01 22:43 ` Al Viro
2014-02-02 0:13 ` Linus Torvalds
2014-02-02 19:21 ` Al Viro
@ 2014-02-02 23:16 ` Anton Altaparmakov
2014-02-03 15:12 ` Christoph Hellwig
2014-02-03 16:23 ` Dave Kleikamp
4 siblings, 0 replies; 49+ messages in thread
From: Anton Altaparmakov @ 2014-02-02 23:16 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh,
xfs, Christoph Hellwig, Joel Becker, linux-fsdevel,
Linus Torvalds
Hi Al,
On 1 Feb 2014, at 22:43, Al Viro <viro@zeniv.linux.org.uk> wrote:
> * ntfs_file_buffered_write() should switch to iov_iter as well. It's
> open-coding a lot of iov_iter stuff.
The NTFS code predates iov_iter by more than 10 years so it simply wasn't there to use when I wrote NTFS... But yes, NTFS should be updated to use it, I agree completely.
> It's not entirely trivial and
> I'd really like to hear from ntfs folks on that, though, and the current
> code looks deadlock-prone. We prefault everything, then lock the pages
> to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
> falling back to __copy_from_user() if that fails. Fine, but what happens
> if the source of write() is mmaped from the same file and we lose CPU while
> prefaulting the last page, have memory pressure evict the first one, then
> have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
> we'll be copying to and have __copy_from_user() try to copy *from* those
> same pages? We are doing it while holding these pages locked, so pagefault
> will have really fun time with them... Anton?
That would deadlock. You are (of course) quite right. But the generic file write code used to suffer from the same deadlock problem and no-one ever complained much about it IIRC...
In the current kernel the problem does not exist any more (due to the do the atomic copy with pagefaults disabled and then retrying with just the first segment and keeping retrying till it works) so indeed updating NTFS to use iov_iter would be a good opportunity to get that fixed in NTFS as well.
It does not matter if for NTFS it turns out to be quite inefficient (due to locking/unlocking several pages at once) as it is such a crazy corner case that will hardly ever be triggered in real life especially so as the only time NTFS operates on more than one page at once in ntfs_file_buffered_write() is when a write happens into a sparse logical block (NTFS "cluster") and only on volumes with a logical block size > PAGE_CACHE_SIZE which with a minimum PAGE_CACHE_SIZE of 4096 bytes on Linux and the fact that Windows only ever creates volumes with a logical block size of 4096 (unless the user specifically forces a different block size when creating the volume) it means that it basically hardly ever happens. Also, the NTFS kernel driver never creates sparse files, i.e. the sparse file woul
d have had to come from Windows or another NTFS implementation...
I will have a look at moving NTFS to iov_iter and fixing the potential deadlock at the same time.
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge
J.J. Thomson Avenue, Cambridge, CB3 0RB, UK
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-01 22:43 ` Al Viro
` (2 preceding siblings ...)
2014-02-02 23:16 ` Anton Altaparmakov
@ 2014-02-03 15:12 ` Christoph Hellwig
2014-02-03 16:24 ` Al Viro
2014-02-03 16:50 ` Dave Kleikamp
2014-02-03 16:23 ` Dave Kleikamp
4 siblings, 2 replies; 49+ messages in thread
From: Christoph Hellwig @ 2014-02-03 15:12 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Dave Kleikamp, Sage Weil, Christoph Hellwig,
Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel,
Linus Torvalds, Anton Altaparmakov
On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote:
> * WTF bother passing 'pos' separately? It's the same mistake that was
> made with ->aio_read/->aio_write and just as with those, *all* callers
> provably have pos == iocb->ki_pos.
I think this landed with the initial aio support, which planned for
allowing AIO retries for a workqueue with a partially incremented
pos. None of this ever got merged, probably because it was too ugly
to live.
> * We *definitely* want a variant structure with tag - unsigned long thing
> was just plain insane. I see at least two variants - array of iovecs
> and array of (at least) triples <page, offset, length>. Quite possibly -
> quadruples, with "here's how to try to steal this page" thrown in, if
> we want that as replacement for ->splice_write() as well (it looks like
> the few instances that do steal on pipe-to-file splices could be dealt
> with the same way as the dumb ones, provided that ->write_iter or whatever
> we end up calling it is allowed to try and steal pages). Possibly more
> variants on the read side of things... FWIW, I'm not sure that bio_vec
> makes a lot of sense here.
bio_vec just is one of the many page+offset+len containers we have, I
guess Dave took it because loop uses it. We could either invent a new
one here or finally have a common one for the different uses all over
the kernel.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-03 15:12 ` Christoph Hellwig
@ 2014-02-03 16:24 ` Al Viro
2014-02-03 16:50 ` Dave Kleikamp
1 sibling, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-03 16:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Dave Kleikamp, Sage Weil, Mark Fasheh, xfs,
Steve French, Joel Becker, linux-fsdevel, Linus Torvalds,
Anton Altaparmakov
On Mon, Feb 03, 2014 at 07:12:42AM -0800, Christoph Hellwig wrote:
> > we end up calling it is allowed to try and steal pages). Possibly more
> > variants on the read side of things... FWIW, I'm not sure that bio_vec
> > makes a lot of sense here.
>
> bio_vec just is one of the many page+offset+len containers we have, I
> guess Dave took it because loop uses it. We could either invent a new
> one here or finally have a common one for the different uses all over
> the kernel.
*blink*
Good luck unifying all such uses. That would have to include the things
needed by pipe_buffer, DMA-related bits for scatterlist, etc. I don't
believe that it's feasible, or would've been a good idea in the first
place...
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-03 15:12 ` Christoph Hellwig
2014-02-03 16:24 ` Al Viro
@ 2014-02-03 16:50 ` Dave Kleikamp
1 sibling, 0 replies; 49+ messages in thread
From: Dave Kleikamp @ 2014-02-03 16:50 UTC (permalink / raw)
To: Christoph Hellwig, Al Viro
Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French,
Kent Overstreet, Joel Becker, linux-fsdevel, Linus Torvalds,
Anton Altaparmakov
On 02/03/2014 09:12 AM, Christoph Hellwig wrote:
> On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote:
>> * WTF bother passing 'pos' separately? It's the same mistake that was
>> made with ->aio_read/->aio_write and just as with those, *all* callers
>> provably have pos == iocb->ki_pos.
>
> I think this landed with the initial aio support, which planned for
> allowing AIO retries for a workqueue with a partially incremented
> pos. None of this ever got merged, probably because it was too ugly
> to live.
Yeah, when these patches were first written, AIO looked a lot different.
>> * We *definitely* want a variant structure with tag - unsigned long thing
>> was just plain insane. I see at least two variants - array of iovecs
>> and array of (at least) triples <page, offset, length>. Quite possibly -
>> quadruples, with "here's how to try to steal this page" thrown in, if
>> we want that as replacement for ->splice_write() as well (it looks like
>> the few instances that do steal on pipe-to-file splices could be dealt
>> with the same way as the dumb ones, provided that ->write_iter or whatever
>> we end up calling it is allowed to try and steal pages). Possibly more
>> variants on the read side of things... FWIW, I'm not sure that bio_vec
>> makes a lot of sense here.
>
> bio_vec just is one of the many page+offset+len containers we have, I
> guess Dave took it because loop uses it. We could either invent a new
> one here or finally have a common one for the different uses all over
> the kernel.
With Kent's immutable bio_vec changes, peeking inside the bio to get to
the bio_vec is uglier than it was before, so there's no need to stick
with that.
Shaggy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-01 22:43 ` Al Viro
` (3 preceding siblings ...)
2014-02-03 15:12 ` Christoph Hellwig
@ 2014-02-03 16:23 ` Dave Kleikamp
2014-02-04 12:44 ` Al Viro
4 siblings, 1 reply; 49+ messages in thread
From: Dave Kleikamp @ 2014-02-03 16:23 UTC (permalink / raw)
To: Al Viro, Linus Torvalds
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Joel Becker, linux-fsdevel, Zach Brown,
Anton Altaparmakov
On 02/01/2014 04:43 PM, Al Viro wrote:
> On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote:
>> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
>>>> Folks, what do you think about the following:
>>>
>>> That's very much what Shaggy did in the aio-direct tree, except that
>>> it kept using a single set of methods. Linus really didn't like it
>>> unfortunately.
>>
>> Umm. That wasn't what I objected to.
>>
>> I objected to the incredibly ugly implementation, the crazy new flags
>> arguments for core helper functions, ugly naming etc etc. I even
>> outlined what might fix it.
>>
>> In other words, I thought the code was shit and ugly. Not that
>> iterators per se would be wrong. Just doing them badly is wrong.
>
> Gyahhh... OK, I should've known better than go looking into that thing
> after such warning. Some relatively printable notes (i.e. about 10%
> of the comments I really had about that) follow:
Thanks for the feedback. I'd been asking for feedback on this patchset
for some time now, and have not received very much.
This is all based on some years-old work by Zach Brown that he probably
wishes would have disappeared by now. I pretty much left what I could
alone since 1) it was working, and 2) I didn't hear any objections
(until now).
It's clear now that the patchset isn't close to mergable, so treat it
like a proof-of-concept and we can come up with a better container and
read/write interface. I won't respond individually to your comments, but
will take them all into consideration going forward.
Thanks,
Shaggy
> * WTF bother passing 'pos' separately? It's the same mistake that was
> made with ->aio_read/->aio_write and just as with those, *all* callers
> provably have pos == iocb->ki_pos.
>
> * I'm not sure that iov_iter is a good fit here. OTOH, it probably could
> be reused (it has damn few users right now and they are on the codepaths
> involved into that thing).
>
> * We *definitely* want a variant structure with tag - unsigned long thing
> was just plain insane. I see at least two variants - array of iovecs
> and array of (at least) triples <page, offset, length>. Quite possibly -
> quadruples, with "here's how to try to steal this page" thrown in, if
> we want that as replacement for ->splice_write() as well (it looks like
> the few instances that do steal on pipe-to-file splices could be dealt
> with the same way as the dumb ones, provided that ->write_iter or whatever
> we end up calling it is allowed to try and steal pages). Possibly more
> variants on the read side of things... FWIW, I'm not sure that bio_vec
> makes a lot of sense here.
>
> * direction (i.e. source vs. destination) also should be a part of that
> tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int;
> the situation with pos is identical to aio_read/aio_write. While we
> are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks
> very much like a special case of "array of <page,offset,size>" we want
> for splice_write...
>
> * having looked through the ->read and ->write instances in drivers,
> I'd say that surprisingly few helpers would go a _long_ way towards
> converting those guys to the same methods. And we need such helpers
> anyway - there's a whole lot of (badly) open-coded "copy the whole
> user buffer into kmalloc'ed array and NUL-terminate the sucker" logics
> in ->write() instances, for example. Even more "copy up to that much
> into given array and NUL-terminate it", with rather amusing bugs in
> there - e.g. NUL going into the end of array, regardless of the actual
> amount of data to copy; junk is left in the middle, complete with
> printk of the entire thing if it doesn't make sense. IOW, spewing
> random piece of kernel stack into dmesg. Off-by-ones galore, too...
>
> BTW, speaking of bogosities - grep for generic_write_checks(). Exactly
> one caller (in __generic_file_aio_write()) has any business looking
> at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write().
> All other callers have copied that, even though it makes absolutely
> no sense for them...
>
> * file_readable/file_writable looks really wrong; if nothing else, I would
> rather check that after open() and set a couple of FMODE bits, then check
> those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no
> method"...
>
> * pipe_buffer_operations ->map()/->unmap() should die; let the caller do
> k{,un}map{,_atomic}(). All instances have the same method there and
> there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also
> go.
>
> * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable
> and pagefault_enable around it? The sucker starts with kmap_atomic() and
> ends with kunmap_atomic(); all instances of those guys have pagefaul
> disabling/enabling (and I suspect that it might make sense to lift it
> out of arch-specific variants - rename them to e.g. __kmap_atomic(),
> rip pagefault_disable() out of those and put kmap_atomic() into highmem.h
> outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic
> page_address; move pagefault_enable() from __kunmap_atomic() to
> kunmap_atomic() while we are at it).
>
> Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we
> have __copy_from_user_inatomic() done there under kmap_atomic(), and we
> really don't want to block in such conditions.
>
> * pipe_iov_copy_from_user() ought to return how much has it managed to
> bring in, not 0 or -EFAULT as it does now. Then it won't need the
> non-atomic side, AFAICS. Moreover, we'll just be able to use
> iov_iter_copy_from_user_atomic() (which badly needs a more palatable
> name, BTW).
>
> * sure, we want to call do_generic_file_read() once, passing the entire
> iovec to file_read_actor(). But what the hell does that have to do with
> introduction of new methods? It's a change that makes sense on its
> own. Moreover, it's a damn good preparation to adding those - we
> get generic_file_aio_read() into "set iov_iter up, then do <this>",
> with <this> using iov_iter instead of iovec. Once we get to introducing
> those methods, it's just a matter of taking the rest of generic_file_aio_read()
> into a separate function and making that function an instance of new
> method...
>
> * Unrelated to this patchset, but... may I politely inquire about the reasons
> why ->is_partially_uptodate() gets read_descriptor_t? The whole point
> of read_descriptor_t (if it has any) is that its interpretation is up to
> whoever's doing the reading, _not_ what they are reading from. So desc->arg
> is off-limits for any instance of ->is_partially_uptodate(). desc->written is
> obviously pointless for them; the need (or lack thereof) to do something
> to the page doesn't depend on how much have we already read from the
> file. Moreover, reporting an error is rather dubious in such method;
> if there's something fishy with the page, we'll just return "no" and let
> ->readpage() complain. Which leaves only desc->count, which, unsurprisingly,
> is the only thing looked at by (the only) instance of that method. And
> "tell me if the part of this page that long starting from that offset is
> up to date" is much more natural that "is what this read_descriptor_t would
> have had us read from this offset in this page up to date?"...
>
> * do we really need separate non-atomic variant of iov_iter_copy_from_user()?
> We have only two users right now (cifs and ceph) and both can use the
> fault-in / atomic copy loop without much pain...
>
> * in addition to having too many methods, I'm not convinced that we want
> them to be methods. Let's start with explicit checks in the primitives and
> see where it goes from there; if we start to grow too many variants,
> we can always introduce some methods, but then we'll be in better position
> to decide what is and what is not a good method...
>
> * on the read side, I really don't believe that exposing atomic and
> non-atomic variants is a good idea. Look at the existing users of
> __copy_to_user_inatomic(); leaving aside the i915_gem weirdness,
> all of them are used to implement the exact same thing: given a page,
> offset and length, feed its contents to iovec/buffer/whatnot. Unlike
> the write side of things, there's nothing between prefaulting pages
> and actual attempts to copy. So let's make _that_ an exposed primitive
> and let it deal with kmap/kmap_atomic/etc. Variants that don't have
> to worry about blocking (vector of <page,offset,length>, etc.) would
> simply not bother with non-atomic kmap, etc. Sure, it should take
> iov_iter as destination. And deal with mapping the damn thing internally...
>
> * ntfs_file_buffered_write() should switch to iov_iter as well. It's
> open-coding a lot of iov_iter stuff. It's not entirely trivial and
> I'd really like to hear from ntfs folks on that, though, and the current
> code looks deadlock-prone. We prefault everything, then lock the pages
> to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
> falling back to __copy_from_user() if that fails. Fine, but what happens
> if the source of write() is mmaped from the same file and we lose CPU while
> prefaulting the last page, have memory pressure evict the first one, then
> have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
> we'll be copying to and have __copy_from_user() try to copy *from* those
> same pages? We are doing it while holding these pages locked, so pagefault
> will have really fun time with them... Anton?
>
> BTW, Linus, when did you comment on that patchset? I've found an iteration
> of that patchset circa last October (v9, apparently the latest posted),
> but it looks like your comments had either got lost or had been on the
> earlier iteration of that thing...
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-03 16:23 ` Dave Kleikamp
@ 2014-02-04 12:44 ` Al Viro
2014-02-04 12:52 ` Kent Overstreet
0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-02-04 12:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker,
linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov,
Zach Brown, Kent Overstreet, Dave Kleikamp
On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote:
> Thanks for the feedback. I'd been asking for feedback on this patchset
> for some time now, and have not received very much.
>
> This is all based on some years-old work by Zach Brown that he probably
> wishes would have disappeared by now. I pretty much left what I could
> alone since 1) it was working, and 2) I didn't hear any objections
> (until now).
>
> It's clear now that the patchset isn't close to mergable, so treat it
> like a proof-of-concept and we can come up with a better container and
> read/write interface. I won't respond individually to your comments, but
> will take them all into consideration going forward.
FWIW, I suspect that the right way to deal with dio side of things would
be a primitive along the lines of "get first N <page,start,len> for the
iov_iter". With get_user_pages_fast() for iovec-backed ones and "just
grab references" for array-of-page-subranges ones.
_IF_ direct-io.c can be massaged to use that (and it looks like it should
be able to - AFAICS, we don't really care if pages are mapped in userland or
kernel space there), we get something really neat out of that: not only can
we get rid of generic_file_splice_write(), but we get full zero-copy
sendfile() - just have the target opened with O_DIRECT and everything will
work; ->splice_read() will trigger reads to source pagecache and with that
massage done, ->splice_write() will issue writes directly from those
pages, with no memory-to-memory copying in sight... We can also get rid of
that kmap() in __swap_writepage(), while we are at it.
I'm going through direct-io.c guts right now and so far that looks feasible,
but I'd really appreciate comments from the folks more familiar with the
damn thing.
The queue so far is in vfs.git#iov_iter; I've gone after the low-hanging
fruits in the review I've posted upthread and I more or less like the
results so far...
Comments?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 12:44 ` Al Viro
@ 2014-02-04 12:52 ` Kent Overstreet
2014-02-04 15:17 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Kent Overstreet @ 2014-02-04 12:52 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh,
Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French,
Anton Altaparmakov, Zach Brown, Dave Kleikamp
On Tue, Feb 04, 2014 at 12:44:09PM +0000, Al Viro wrote:
> On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote:
>
> > Thanks for the feedback. I'd been asking for feedback on this patchset
> > for some time now, and have not received very much.
> >
> > This is all based on some years-old work by Zach Brown that he probably
> > wishes would have disappeared by now. I pretty much left what I could
> > alone since 1) it was working, and 2) I didn't hear any objections
> > (until now).
> >
> > It's clear now that the patchset isn't close to mergable, so treat it
> > like a proof-of-concept and we can come up with a better container and
> > read/write interface. I won't respond individually to your comments, but
> > will take them all into consideration going forward.
>
> FWIW, I suspect that the right way to deal with dio side of things would
> be a primitive along the lines of "get first N <page,start,len> for the
> iov_iter". With get_user_pages_fast() for iovec-backed ones and "just
> grab references" for array-of-page-subranges ones.
I'm on vacation in Switzerland, didn't bring my adderall, and
direct-io.c makes my head hurt at the best of times, but - have a look
at my in-progress dio rewrite:
http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da
Where I'm headed with things is to just start out by allocating bios and
pinning pages into them, and _then_ doing all the fun "ask the
filesystem where it goes and what to do with it" dance. The goal is to
push the bios as far up the stack as possible.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 12:52 ` Kent Overstreet
@ 2014-02-04 15:17 ` Al Viro
2014-02-04 17:27 ` Zach Brown
0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-02-04 15:17 UTC (permalink / raw)
To: Kent Overstreet
Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh,
Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French,
Anton Altaparmakov, Zach Brown, Dave Kleikamp
On Tue, Feb 04, 2014 at 04:52:20AM -0800, Kent Overstreet wrote:
> I'm on vacation in Switzerland, didn't bring my adderall, and
> direct-io.c makes my head hurt at the best of times, but - have a look
> at my in-progress dio rewrite:
>
> http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da
>
> Where I'm headed with things is to just start out by allocating bios and
> pinning pages into them, and _then_ doing all the fun "ask the
> filesystem where it goes and what to do with it" dance. The goal is to
> push the bios as far up the stack as possible.
How far would that be? E.g. for something like NFS it would be
completely wrong. AFAICS, what you are doing there isn't incompatible
with what I described; bio_get_user_pages() would just use that
primitive (in fact, the loop in it is a damn good starting point for
implementation of that primitive for iovec-based instances of iov_iter).
I'm not too fond of the names, TBH - it might make sense to rename
iov_iter to something like mem_stream; maybe even leave iov_iter as-is
for whatever users might remain, but for now I don't see any that
would fundamentally depend on the thing being iovev-backed...
And bio_vec is a bad misnomer - it's not related to block subsystem at
all. Sure, it had originated there, but... Hell knows; by now it's
probably too much PITA to rename (we have about half a thousand instances
in the tree). Pity, that...
I definitely don't buy "bio is a natural object for carrying an array
of pieces of pages"; not sure if that's what you implied in earlier
thread, but it has too much baggage from block subsystem *and* it lacks
the things we may want to associate with individual elements of such
array (starting with "how can I steal that page?" method).
I'm not sure if you'd been reading that thread back when it started;
my interest in that thing is mostly because I want to get rid of
duplication (and inconsistencies) between ->aio_write() and ->splice_write().
I hadn't been watching the threads around iov_iter last year; hch has
pointed to those when I proposed to use an object that could carry
both iovec and (possibly extended) analog of bio_vec and make
generic_file_aio_write() et.al. agnostic wrt what's behind that object.
Then we could use the same method to implement both ->aio_write() and
->splice_write() in a lot of cases. iov_iter is a good starting point
for such object, and for now I'm mostly doing stuff that encapsulates
the knowledge of its guts (including "there's an iovec behind it").
Those cleanups aside (and they make sense on their own, regardless of
where the rest goes), it might make sense to add a copy of struct iov_iter
that would have a tagged union in it (originally just for iovec, with
IOVEC_READ/IOVEC_WRITE as possible tags) and switch a bunch of places that
do not look into the guts of iov_iter to that thing. I'm not sure if
there will be any other places left (so far it looks like we'll be able
to get away with a reasonable set of primitives), but... we'll see.
For now the whole thing is fairly experimental and it will almost certainly
be reordered, etc. quite a few times. I'm trying to keep the part of
queue in vfs.git#iov_iter more or less stable (with a lot of stuff in
flux sitting in the local one), but it's not at the state where I'd
recommend merges from it; there will be rebases, etc.
BTW, folks, any suggestions about the name of that "memory stream" thing?
struct iov_iter really implies iterator for iovec and more generic name
would probably be better... struct mem_stream would probably do if nobody
comes up with better variant, but it's long and somewhat clumsy...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 15:17 ` Al Viro
@ 2014-02-04 17:27 ` Zach Brown
2014-02-04 17:35 ` Kent Overstreet
2014-02-04 18:00 ` Al Viro
0 siblings, 2 replies; 49+ messages in thread
From: Zach Brown @ 2014-02-04 17:27 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker,
linux-fsdevel, Linus Torvalds, Anton Altaparmakov
> I definitely don't buy "bio is a natural object for carrying an array
> of pieces of pages"; not sure if that's what you implied in earlier
> thread, but it has too much baggage from block subsystem *and* it lacks
> the things we may want to associate with individual elements of such
> array (starting with "how can I steal that page?" method).
I think Kent is talking about what happens after the user addresses are
consumed. Turning dio into more of a bio mapping and redirection engine
would use more of the bio machinery instead of the bits that dio has
implemented itself with state in struct dio that hangs off the bios. I
imagine it'd still make sense to clean up the addresses/pages arguments
that feed that engine. (And give another entry point that already has
bios for callers like loop, etc.)
> BTW, folks, any suggestions about the name of that "memory stream" thing?
> struct iov_iter really implies iterator for iovec and more generic name
> would probably be better... struct mem_stream would probably do if nobody
> comes up with better variant, but it's long and somewhat clumsy...
I don't like 'stream'. To me that sounds more strictly advancing than I
think this'd be capable of. Maybe something dirt simple like 'mem_vec'?
With 'mvec_' call prefixes?
Or kiobuf! *runs*
- z
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 17:27 ` Zach Brown
@ 2014-02-04 17:35 ` Kent Overstreet
2014-02-04 18:08 ` Al Viro
2014-02-04 18:00 ` Al Viro
1 sibling, 1 reply; 49+ messages in thread
From: Kent Overstreet @ 2014-02-04 17:35 UTC (permalink / raw)
To: Zach Brown
Cc: Jens Axboe, Christoph Hellwig, Sage Weil, Mark Fasheh, xfs,
Steve French, Dave Kleikamp, Al Viro, linux-fsdevel,
Linus Torvalds, Anton Altaparmakov, Joel Becker
[-- Attachment #1.1: Type: text/plain, Size: 1819 bytes --]
On Feb 4, 2014 6:27 PM, "Zach Brown" <zab@redhat.com> wrote:
>
> > I definitely don't buy "bio is a natural object for carrying an array
> > of pieces of pages"; not sure if that's what you implied in earlier
> > thread, but it has too much baggage from block subsystem *and* it lacks
> > the things we may want to associate with individual elements of such
> > array (starting with "how can I steal that page?" method).
>
> I think Kent is talking about what happens after the user addresses are
> consumed. Turning dio into more of a bio mapping and redirection engine
> would use more of the bio machinery instead of the bits that dio has
> implemented itself with state in struct dio that hangs off the bios. I
> imagine it'd still make sense to clean up the addresses/pages arguments
> that feed that engine. (And give another entry point that already has
> bios for callers like loop, etc.)
Yeah, precisely. I think we can push the point at which pages are pinned at
least a fair but higher than it is now, and if we can do that we definitely
should be working with a generic "bag of pinned pages" struct - and much
better to use struct bio than add another one. Bios may not be perfect but
at least some of the block layer specific cruft can be gotten rid of (and
is on my todo list)
>
> > BTW, folks, any suggestions about the name of that "memory stream"
thing?
> > struct iov_iter really implies iterator for iovec and more generic name
> > would probably be better... struct mem_stream would probably do if
nobody
> > comes up with better variant, but it's long and somewhat clumsy...
>
> I don't like 'stream'. To me that sounds more strictly advancing than I
> think this'd be capable of. Maybe something dirt simple like 'mem_vec'?
> With 'mvec_' call prefixes?
>
> Or kiobuf! *runs*
*cuts you*
[-- Attachment #1.2: Type: text/html, Size: 2258 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 17:35 ` Kent Overstreet
@ 2014-02-04 18:08 ` Al Viro
0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2014-02-04 18:08 UTC (permalink / raw)
To: Kent Overstreet
Cc: Zach Brown, Dave Kleikamp, Steve French, Sage Weil,
Linus Torvalds, Jens Axboe, Anton Altaparmakov, Mark Fasheh,
linux-fsdevel, xfs, Joel Becker, Christoph Hellwig
On Tue, Feb 04, 2014 at 09:35:06AM -0800, Kent Overstreet wrote:
> > I think Kent is talking about what happens after the user addresses are
> > consumed. Turning dio into more of a bio mapping and redirection engine
> > would use more of the bio machinery instead of the bits that dio has
> > implemented itself with state in struct dio that hangs off the bios. I
> > imagine it'd still make sense to clean up the addresses/pages arguments
> > that feed that engine. (And give another entry point that already has
> > bios for callers like loop, etc.)
>
> Yeah, precisely. I think we can push the point at which pages are pinned at
> least a fair but higher than it is now, and if we can do that we definitely
> should be working with a generic "bag of pinned pages" struct - and much
> better to use struct bio than add another one. Bios may not be perfect but
> at least some of the block layer specific cruft can be gotten rid of (and
> is on my todo list)
How far up? I've no problem with having that done in ->direct_IO()
(especially if it would take this mem_vec/mem_stream/whatever and
keep the code doing actual pinning and building an array of pages
outside of direct-io.c, allowing it to do different things for iovec-backed
and page-array-backed variants), but I really don't like the idea of
lifting that to callers of ->direct_IO(), let alone past them.
If nothing else, we do *not* want to pin the entire write request, so
lifting that to e.g. generic_file_aio_write() (or, worse, its callers)
wouldn't make sense.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 17:27 ` Zach Brown
2014-02-04 17:35 ` Kent Overstreet
@ 2014-02-04 18:00 ` Al Viro
2014-02-04 18:33 ` Zach Brown
1 sibling, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-02-04 18:00 UTC (permalink / raw)
To: Zach Brown
Cc: Kent Overstreet, Linus Torvalds, Christoph Hellwig, Jens Axboe,
Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil,
Steve French, Anton Altaparmakov, Dave Kleikamp
On Tue, Feb 04, 2014 at 09:27:23AM -0800, Zach Brown wrote:
> I think Kent is talking about what happens after the user addresses are
> consumed. Turning dio into more of a bio mapping and redirection engine
> would use more of the bio machinery instead of the bits that dio has
> implemented itself with state in struct dio that hangs off the bios. I
> imagine it'd still make sense to clean up the addresses/pages arguments
> that feed that engine. (And give another entry point that already has
> bios for callers like loop, etc.)
>
> > BTW, folks, any suggestions about the name of that "memory stream" thing?
> > struct iov_iter really implies iterator for iovec and more generic name
> > would probably be better... struct mem_stream would probably do if nobody
> > comes up with better variant, but it's long and somewhat clumsy...
>
> I don't like 'stream'. To me that sounds more strictly advancing than I
> think this'd be capable of. Maybe something dirt simple like 'mem_vec'?
> With 'mvec_' call prefixes?
Umm... Frankly, I would rather discourage attempts to read the same data
twice, if only on the naming level...
Case in point: commit 1c1c87 (btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME).
I really wonder how many places have similar holes. What used to happen
was this: we have a userland structure, with a variable-sized array hanging
off its arse. The size of array is determined by the field in fixed-sized
header. We copy the header in, decide what size the whole thing should have,
and do memdup_user() to bring everything in. Very convenient, since at that
point we have a pointer to that struct-with-array in the kernel space.
Attacker manages to increase the 'desc_count' field between two
copy_from_user()... and the sucker proceeds to loop over the array in
kernel-side copy, using the ->desc_count of that copy as the upper limit
of the loop. Oops - in the best case, that is.
Double reads really ought to raise red flags on review. I'm not saying that
they should be hard to do (after all, the fix in that commit *does* read the
same thing twice), but it's better if they are not used without thinking.
And no, I'm not suggesting to make ioctls use iov_iter/whatnot - it's just
an example of the class of bugs. I wouldn't be surprised to find ->write()
instances in drivers suffering the same problem...
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 18:00 ` Al Viro
@ 2014-02-04 18:33 ` Zach Brown
2014-02-04 18:36 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Zach Brown @ 2014-02-04 18:33 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker,
linux-fsdevel, Linus Torvalds, Anton Altaparmakov
> > > BTW, folks, any suggestions about the name of that "memory stream" thing?
> > > struct iov_iter really implies iterator for iovec and more generic name
> > > would probably be better... struct mem_stream would probably do if nobody
> > > comes up with better variant, but it's long and somewhat clumsy...
> >
> > I don't like 'stream'. To me that sounds more strictly advancing than I
> > think this'd be capable of. Maybe something dirt simple like 'mem_vec'?
> > With 'mvec_' call prefixes?
>
> Umm... Frankly, I would rather discourage attempts to read the same data
> twice, if only on the naming level...
Ahh, OK, sure. mem_iter?
- z
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 18:33 ` Zach Brown
@ 2014-02-04 18:36 ` Al Viro
2014-02-05 19:58 ` Al Viro
0 siblings, 1 reply; 49+ messages in thread
From: Al Viro @ 2014-02-04 18:36 UTC (permalink / raw)
To: Zach Brown
Cc: Kent Overstreet, Linus Torvalds, Christoph Hellwig, Jens Axboe,
Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil,
Steve French, Anton Altaparmakov, Dave Kleikamp
On Tue, Feb 04, 2014 at 10:33:56AM -0800, Zach Brown wrote:
> > > > BTW, folks, any suggestions about the name of that "memory stream" thing?
> > > > struct iov_iter really implies iterator for iovec and more generic name
> > > > would probably be better... struct mem_stream would probably do if nobody
> > > > comes up with better variant, but it's long and somewhat clumsy...
> > >
> > > I don't like 'stream'. To me that sounds more strictly advancing than I
> > > think this'd be capable of. Maybe something dirt simple like 'mem_vec'?
> > > With 'mvec_' call prefixes?
> >
> > Umm... Frankly, I would rather discourage attempts to read the same data
> > twice, if only on the naming level...
>
> Ahh, OK, sure. mem_iter?
Works for me... Any other suggestions/objections/etc.?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-04 18:36 ` Al Viro
@ 2014-02-05 19:58 ` Al Viro
2014-02-05 20:42 ` Zach Brown
2014-02-06 9:08 ` Kent Overstreet
0 siblings, 2 replies; 49+ messages in thread
From: Al Viro @ 2014-02-05 19:58 UTC (permalink / raw)
To: Zach Brown
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker,
linux-fsdevel, Linus Torvalds, Anton Altaparmakov
BTW, why do we still have generic_segment_checks()?
AFAICS, *all* paths leading to any ->aio_read/->aio_write
instances are either
1) with KERNEL_DS (and base/len are verifiably sane in those
cases), or
2) have iovec come from successful {compat,}rw_copy_check_uvector()
and through rw_verify_area(), or
3) have single-element iovec with access_ok()/rw_verify_area()
checked directly, or
4) have single-element iovec with base/len unchanged from
what had been passed to some ->read() or ->write() instance, in which
case the caller of that ->read() or ->write() has done access_ok/rw_verify_area
And yes, I can prove that for the current tree, modulo a couple of dumb
bugs with unchecked values coming via read_code(). Which is called
a couple of times per a.out execve() and should be using vfs_read() instead
of blindly calling ->read() - it's *not* a hot path and never had been one.
With that fixed, we have the following: and call of any instance of
->read()/->write()/->aio_read()/->aio_write() (be it direct or via method)
is guaranteed that
* all segments it's asked to read/write will satisfy access_ok().
* all segments it's asked to read/write will have non-negative
lengths.
* total size of all segments will be at most MAX_RW_COUNT.
* file offset won't go from negative to zero in the combined area;
unless the file has FMODE_UNSIGNED_OFFSET in ->f_mode, it won't go from
positive to negative either.
So what exactly does generic_segments_check() give us? Is it just that
everybody went "well, maybe there's some weird path where we don't do
validation; let's leave it there"? Linus?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-05 19:58 ` Al Viro
@ 2014-02-05 20:42 ` Zach Brown
2014-02-06 9:08 ` Kent Overstreet
1 sibling, 0 replies; 49+ messages in thread
From: Zach Brown @ 2014-02-05 20:42 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs,
Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker,
linux-fsdevel, Linus Torvalds, Anton Altaparmakov
> So what exactly does generic_segments_check() give us? Is it just that
> everybody went "well, maybe there's some weird path where we don't do
> validation; let's leave it there"? Linus?
Sounds likely. I'd bet that the btrfs call was just copied from
__generic_file_aio_write() when btrfs grew its own .aio_write method:
11c65dccf70be9ace5dbd3906778e1a099b1fee1
Author: Josef Bacik <josef@redhat.com>
Date: Sun May 23 11:07:21 2010 -0400
Btrfs: do aio_write instead of write
It probably amounts to no more than ocount = iov_length().
- z
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] unifying write variants for filesystems
2014-02-05 19:58 ` Al Viro
2014-02-05 20:42 ` Zach Brown
@ 2014-02-06 9:08 ` Kent Overstreet
1 sibling, 0 replies; 49+ messages in thread
From: Kent Overstreet @ 2014-02-06 9:08 UTC (permalink / raw)
To: Al Viro
Cc: Jens Axboe, Steve French, Sage Weil, Zach Brown, Mark Fasheh, xfs,
Christoph Hellwig, Dave Kleikamp, Joel Becker, linux-fsdevel,
Linus Torvalds, Anton Altaparmakov
On Wed, Feb 05, 2014 at 07:58:38PM +0000, Al Viro wrote:
> BTW, why do we still have generic_segment_checks()?
> AFAICS, *all* paths leading to any ->aio_read/->aio_write
> instances are either
> 1) with KERNEL_DS (and base/len are verifiably sane in those
> cases), or
> 2) have iovec come from successful {compat,}rw_copy_check_uvector()
> and through rw_verify_area(), or
> 3) have single-element iovec with access_ok()/rw_verify_area()
> checked directly, or
> 4) have single-element iovec with base/len unchanged from
> what had been passed to some ->read() or ->write() instance, in which
> case the caller of that ->read() or ->write() has done access_ok/rw_verify_area
>
> And yes, I can prove that for the current tree, modulo a couple of dumb
> bugs with unchecked values coming via read_code(). Which is called
> a couple of times per a.out execve() and should be using vfs_read() instead
> of blindly calling ->read() - it's *not* a hot path and never had been one.
> With that fixed, we have the following: and call of any instance of
> ->read()/->write()/->aio_read()/->aio_write() (be it direct or via method)
> is guaranteed that
> * all segments it's asked to read/write will satisfy access_ok().
> * all segments it's asked to read/write will have non-negative
> lengths.
> * total size of all segments will be at most MAX_RW_COUNT.
> * file offset won't go from negative to zero in the combined area;
> unless the file has FMODE_UNSIGNED_OFFSET in ->f_mode, it won't go from
> positive to negative either.
>
> So what exactly does generic_segments_check() give us? Is it just that
> everybody went "well, maybe there's some weird path where we don't do
> validation; let's leave it there"? Linus?
I came to the same conclusion awhile ago - I'm pretty sure it can be
safely dropped (I think I even have such a patch in one of my
branches...)
Anyways, copy_check_uvector() is the correct place for all that stuff
anyways - it's taking a __user type and producing a type without the
__user attribute, so if there was any validation missing there that's
where it should go.
I vaguelly recall converting some SCSI related code to use
copy_check_uvector() instead of its own (open coded?) thing, if that
patch made it upstream that could've been a place that at one point in
time did need the generic_segment_checks() call.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 49+ messages in thread