* [PATCH 0/5] splice: locking changes and code refactoring @ 2013-12-12 18:14 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:14 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs I've been trying to fix the old splice iolock lock inversion issue in XFS and started looking over the splice code a little more for it. It seems like the root of all evil is that we try to nest i_mutex inside the pipe_lock instead of outside of it, and I can't find any good reason for that. Does anyone remember why it went this way initially? By fixing that and a few minor issues we can not only fix this issue nicely in XFS, but also get rid of various bits of code duplication, and poking into splice internals by the ocfs2 splice_write path. Btw, does anyone have a good test suite for splice functionality? xfstests coverage exits but is not very extensive. b/fs/ocfs2/file.c | 2 b/fs/splice.c | 5 +- b/fs/xfs/xfs_file.c | 26 +++++----- b/include/linux/splice.h | 2 fs/ocfs2/file.c | 78 +++++++++---------------------- fs/splice.c | 115 +++++++++++++---------------------------------- include/linux/splice.h | 7 -- 7 files changed, 76 insertions(+), 159 deletions(-) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 0/5] splice: locking changes and code refactoring @ 2013-12-12 18:14 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:14 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs I've been trying to fix the old splice iolock lock inversion issue in XFS and started looking over the splice code a little more for it. It seems like the root of all evil is that we try to nest i_mutex inside the pipe_lock instead of outside of it, and I can't find any good reason for that. Does anyone remember why it went this way initially? By fixing that and a few minor issues we can not only fix this issue nicely in XFS, but also get rid of various bits of code duplication, and poking into splice internals by the ocfs2 splice_write path. Btw, does anyone have a good test suite for splice functionality? xfstests coverage exits but is not very extensive. b/fs/ocfs2/file.c | 2 b/fs/splice.c | 5 +- b/fs/xfs/xfs_file.c | 26 +++++----- b/include/linux/splice.h | 2 fs/ocfs2/file.c | 78 +++++++++---------------------- fs/splice.c | 115 +++++++++++++---------------------------------- include/linux/splice.h | 7 -- 7 files changed, 76 insertions(+), 159 deletions(-) ^ permalink raw reply [flat|nested] 82+ messages in thread
* [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file 2013-12-12 18:14 ` Christoph Hellwig @ 2013-12-12 18:15 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0001-splice-move-balance_dirty_pages_ratelimited-into-pip.patch --] [-- Type: text/plain, Size: 1326 bytes --] Try to balance the dirty pages for every written pages instead of once per system call, mirroring the regular write path. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 2 -- fs/splice.c | 5 ++++- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 6fff128..a77ef6e 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2494,8 +2494,6 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, ret = err; else *ppos += ret; - - balance_dirty_pages_ratelimited(mapping); } return ret; diff --git a/fs/splice.c b/fs/splice.c index 46a08f7..fcb459d 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -759,6 +759,10 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, } ret = pagecache_write_end(file, mapping, sd->pos, this_len, this_len, page, fsdata); + if (ret) + goto out; + + balance_dirty_pages_ratelimited(mapping); out: return ret; } @@ -1034,7 +1038,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, ret = err; else *ppos += ret; - balance_dirty_pages_ratelimited(mapping); } return ret; -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file @ 2013-12-12 18:15 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0001-splice-move-balance_dirty_pages_ratelimited-into-pip.patch --] [-- Type: text/plain, Size: 1205 bytes --] Try to balance the dirty pages for every written pages instead of once per system call, mirroring the regular write path. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 2 -- fs/splice.c | 5 ++++- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 6fff128..a77ef6e 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2494,8 +2494,6 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, ret = err; else *ppos += ret; - - balance_dirty_pages_ratelimited(mapping); } return ret; diff --git a/fs/splice.c b/fs/splice.c index 46a08f7..fcb459d 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -759,6 +759,10 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, } ret = pagecache_write_end(file, mapping, sd->pos, this_len, this_len, page, fsdata); + if (ret) + goto out; + + balance_dirty_pages_ratelimited(mapping); out: return ret; } @@ -1034,7 +1038,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, ret = err; else *ppos += ret; - balance_dirty_pages_ratelimited(mapping); } return ret; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 2/5] splice: nest i_mutex outside pipe_lock 2013-12-12 18:14 ` Christoph Hellwig @ 2013-12-12 18:15 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0002-splice-nest-i_mutex-outside-pipe_lock.patch --] [-- Type: text/plain, Size: 3082 bytes --] I can't find any obvious reason why we would want to nest i_mutex inside the pipe lock, but two good reasons speak against it: - with i_mutex inside the pipe lock we have to unlock it every time we iterate to the next buffer, which prevents i_mutex from guaranteeing write atomicy similar to write(2), and also is ineffiecient for filesystems like ocfs2 that have additional cluster locking along i_mutex. - the ordering of pipe_lock outside i_mutex makes it very hard to share code with filesystems that have additional inode-wide locks that need to be taken in the right order with i_mutex. In addition to changing the lock order this patch also removes the useless I_MUTEX_CHILD annotations. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 20 +++++++++++--------- fs/splice.c | 5 +++-- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index a77ef6e..c68e111 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2461,6 +2461,12 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, out->f_path.dentry->d_name.len, out->f_path.dentry->d_name.name, len); + mutex_lock(&inode->i_mutex); + ret = ocfs2_rw_lock(inode, 1); + if (ret < 0) { + mlog_errno(ret); + goto out_unlock_inode; + } pipe_lock(pipe); splice_from_pipe_begin(&sd); @@ -2469,20 +2475,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, if (ret <= 0) break; - mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); - ret = ocfs2_rw_lock(inode, 1); - if (ret < 0) - mlog_errno(ret); - else { - ret = ocfs2_splice_to_file(pipe, out, &sd); - ocfs2_rw_unlock(inode, 1); - } - mutex_unlock(&inode->i_mutex); + ret = ocfs2_splice_to_file(pipe, out, &sd); } while (ret > 0); splice_from_pipe_end(pipe, &sd); pipe_unlock(pipe); + ocfs2_rw_unlock(inode, 1); +out_unlock_inode: + mutex_unlock(&inode->i_mutex); + if (sd.num_spliced) ret = sd.num_spliced; diff --git a/fs/splice.c b/fs/splice.c index fcb459d..4fb6c1f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1005,6 +1005,8 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, }; ssize_t ret; + mutex_lock(&inode->i_mutex); + pipe_lock(pipe); splice_from_pipe_begin(&sd); @@ -1013,7 +1015,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, if (ret <= 0) break; - mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); ret = file_remove_suid(out); if (!ret) { ret = file_update_time(out); @@ -1021,11 +1022,11 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, ret = splice_from_pipe_feed(pipe, &sd, pipe_to_file); } - mutex_unlock(&inode->i_mutex); } while (ret > 0); splice_from_pipe_end(pipe, &sd); pipe_unlock(pipe); + mutex_unlock(&inode->i_mutex); if (sd.num_spliced) ret = sd.num_spliced; -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 2/5] splice: nest i_mutex outside pipe_lock @ 2013-12-12 18:15 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0002-splice-nest-i_mutex-outside-pipe_lock.patch --] [-- Type: text/plain, Size: 2961 bytes --] I can't find any obvious reason why we would want to nest i_mutex inside the pipe lock, but two good reasons speak against it: - with i_mutex inside the pipe lock we have to unlock it every time we iterate to the next buffer, which prevents i_mutex from guaranteeing write atomicy similar to write(2), and also is ineffiecient for filesystems like ocfs2 that have additional cluster locking along i_mutex. - the ordering of pipe_lock outside i_mutex makes it very hard to share code with filesystems that have additional inode-wide locks that need to be taken in the right order with i_mutex. In addition to changing the lock order this patch also removes the useless I_MUTEX_CHILD annotations. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 20 +++++++++++--------- fs/splice.c | 5 +++-- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index a77ef6e..c68e111 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2461,6 +2461,12 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, out->f_path.dentry->d_name.len, out->f_path.dentry->d_name.name, len); + mutex_lock(&inode->i_mutex); + ret = ocfs2_rw_lock(inode, 1); + if (ret < 0) { + mlog_errno(ret); + goto out_unlock_inode; + } pipe_lock(pipe); splice_from_pipe_begin(&sd); @@ -2469,20 +2475,16 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, if (ret <= 0) break; - mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); - ret = ocfs2_rw_lock(inode, 1); - if (ret < 0) - mlog_errno(ret); - else { - ret = ocfs2_splice_to_file(pipe, out, &sd); - ocfs2_rw_unlock(inode, 1); - } - mutex_unlock(&inode->i_mutex); + ret = ocfs2_splice_to_file(pipe, out, &sd); } while (ret > 0); splice_from_pipe_end(pipe, &sd); pipe_unlock(pipe); + ocfs2_rw_unlock(inode, 1); +out_unlock_inode: + mutex_unlock(&inode->i_mutex); + if (sd.num_spliced) ret = sd.num_spliced; diff --git a/fs/splice.c b/fs/splice.c index fcb459d..4fb6c1f 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1005,6 +1005,8 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, }; ssize_t ret; + mutex_lock(&inode->i_mutex); + pipe_lock(pipe); splice_from_pipe_begin(&sd); @@ -1013,7 +1015,6 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, if (ret <= 0) break; - mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); ret = file_remove_suid(out); if (!ret) { ret = file_update_time(out); @@ -1021,11 +1022,11 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, ret = splice_from_pipe_feed(pipe, &sd, pipe_to_file); } - mutex_unlock(&inode->i_mutex); } while (ret > 0); splice_from_pipe_end(pipe, &sd); pipe_unlock(pipe); + mutex_unlock(&inode->i_mutex); if (sd.num_spliced) ret = sd.num_spliced; -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write 2013-12-12 18:14 ` Christoph Hellwig @ 2013-12-12 18:15 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0003-splice-use-splice_from_pipe-in-generic_file_splice_w.patch --] [-- Type: text/plain, Size: 5721 bytes --] Reuse generic_write_sync instead of reimplementing it in the splice write path both in the generic code and ocfs2. Only needs a little bit of refactoring for the actors to provide the desired functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 58 ++++++++++----------------------------------- fs/splice.c | 61 +++++++++++++++++------------------------------- include/linux/splice.h | 2 ++ 3 files changed, 36 insertions(+), 85 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index c68e111..5b1b5f9 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2423,38 +2423,14 @@ out_sems: return ret; } -static int ocfs2_splice_to_file(struct pipe_inode_info *pipe, - struct file *out, - struct splice_desc *sd) -{ - int ret; - - ret = ocfs2_prepare_inode_for_write(out, &sd->pos, - sd->total_len, 0, NULL, NULL); - if (ret < 0) { - mlog_errno(ret); - return ret; - } - - return splice_from_pipe_feed(pipe, sd, pipe_to_file); -} - static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { - int ret; - struct address_space *mapping = out->f_mapping; - struct inode *inode = mapping->host; - struct splice_desc sd = { - .total_len = len, - .flags = flags, - .pos = *ppos, - .u.file = out, - }; - + struct inode *inode = out->f_mapping->host; + ssize_t ret; trace_ocfs2_file_splice_write(inode, out, out->f_path.dentry, (unsigned long long)OCFS2_I(inode)->ip_blkno, @@ -2467,35 +2443,25 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, mlog_errno(ret); goto out_unlock_inode; } - pipe_lock(pipe); - splice_from_pipe_begin(&sd); - do { - ret = splice_from_pipe_next(pipe, &sd); - if (ret <= 0) - break; - - ret = ocfs2_splice_to_file(pipe, out, &sd); - } while (ret > 0); - splice_from_pipe_end(pipe, &sd); + ret = ocfs2_prepare_inode_for_write(out, ppos, len, 0, NULL, NULL); + if (ret < 0) { + mlog_errno(ret); + goto out_unlock_rw; + } - pipe_unlock(pipe); + ret = splice_from_pipe(pipe, out, ppos, len, flags, __pipe_to_file); +out_unlock_rw: ocfs2_rw_unlock(inode, 1); out_unlock_inode: mutex_unlock(&inode->i_mutex); - if (sd.num_spliced) - ret = sd.num_spliced; - if (ret > 0) { - int err; - - err = generic_write_sync(out, *ppos, ret); + int err = generic_write_sync(out, *ppos, ret); if (err) - ret = err; - else - *ppos += ret; + return ret; + *ppos += ret; } return ret; diff --git a/fs/splice.c b/fs/splice.c index 4fb6c1f..108e527 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -727,7 +727,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, * SPLICE_F_MOVE isn't set, or we cannot move the page, we simply create * a new page in the output file page cache and fill/dirty that. */ -int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, +int __pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { struct file *file = sd->u.file; @@ -766,6 +766,22 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, out: return ret; } +EXPORT_SYMBOL(__pipe_to_file); + +int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct file *file = sd->u.file; + int ret; + + ret = file_remove_suid(file); + if (ret) + return ret; + ret = file_update_time(file); + if (ret) + return ret; + return __pipe_to_file(pipe, buf, sd); +} EXPORT_SYMBOL(pipe_to_file); static void wakeup_pipe_writers(struct pipe_inode_info *pipe) @@ -995,55 +1011,22 @@ ssize_t generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { - struct address_space *mapping = out->f_mapping; - struct inode *inode = mapping->host; - struct splice_desc sd = { - .total_len = len, - .flags = flags, - .pos = *ppos, - .u.file = out, - }; + struct inode *inode = out->f_mapping->host; ssize_t ret; mutex_lock(&inode->i_mutex); - - pipe_lock(pipe); - - splice_from_pipe_begin(&sd); - do { - ret = splice_from_pipe_next(pipe, &sd); - if (ret <= 0) - break; - - ret = file_remove_suid(out); - if (!ret) { - ret = file_update_time(out); - if (!ret) - ret = splice_from_pipe_feed(pipe, &sd, - pipe_to_file); - } - } while (ret > 0); - splice_from_pipe_end(pipe, &sd); - - pipe_unlock(pipe); + ret = splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_file); mutex_unlock(&inode->i_mutex); - if (sd.num_spliced) - ret = sd.num_spliced; - if (ret > 0) { - int err; - - err = generic_write_sync(out, *ppos, ret); + int err = generic_write_sync(out, *ppos, ret); if (err) - ret = err; - else - *ppos += ret; + return err; + *ppos += ret; } return ret; } - EXPORT_SYMBOL(generic_file_splice_write); static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf, diff --git a/include/linux/splice.h b/include/linux/splice.h index 74575cb..c5aca88 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -78,6 +78,8 @@ extern void splice_from_pipe_end(struct pipe_inode_info *, struct splice_desc *); extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, struct splice_desc *); +extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, + struct splice_desc *); extern ssize_t splice_to_pipe(struct pipe_inode_info *, struct splice_pipe_desc *); -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write @ 2013-12-12 18:15 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0003-splice-use-splice_from_pipe-in-generic_file_splice_w.patch --] [-- Type: text/plain, Size: 5600 bytes --] Reuse generic_write_sync instead of reimplementing it in the splice write path both in the generic code and ocfs2. Only needs a little bit of refactoring for the actors to provide the desired functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/ocfs2/file.c | 58 ++++++++++----------------------------------- fs/splice.c | 61 +++++++++++++++++------------------------------- include/linux/splice.h | 2 ++ 3 files changed, 36 insertions(+), 85 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index c68e111..5b1b5f9 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2423,38 +2423,14 @@ out_sems: return ret; } -static int ocfs2_splice_to_file(struct pipe_inode_info *pipe, - struct file *out, - struct splice_desc *sd) -{ - int ret; - - ret = ocfs2_prepare_inode_for_write(out, &sd->pos, - sd->total_len, 0, NULL, NULL); - if (ret < 0) { - mlog_errno(ret); - return ret; - } - - return splice_from_pipe_feed(pipe, sd, pipe_to_file); -} - static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { - int ret; - struct address_space *mapping = out->f_mapping; - struct inode *inode = mapping->host; - struct splice_desc sd = { - .total_len = len, - .flags = flags, - .pos = *ppos, - .u.file = out, - }; - + struct inode *inode = out->f_mapping->host; + ssize_t ret; trace_ocfs2_file_splice_write(inode, out, out->f_path.dentry, (unsigned long long)OCFS2_I(inode)->ip_blkno, @@ -2467,35 +2443,25 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe, mlog_errno(ret); goto out_unlock_inode; } - pipe_lock(pipe); - splice_from_pipe_begin(&sd); - do { - ret = splice_from_pipe_next(pipe, &sd); - if (ret <= 0) - break; - - ret = ocfs2_splice_to_file(pipe, out, &sd); - } while (ret > 0); - splice_from_pipe_end(pipe, &sd); + ret = ocfs2_prepare_inode_for_write(out, ppos, len, 0, NULL, NULL); + if (ret < 0) { + mlog_errno(ret); + goto out_unlock_rw; + } - pipe_unlock(pipe); + ret = splice_from_pipe(pipe, out, ppos, len, flags, __pipe_to_file); +out_unlock_rw: ocfs2_rw_unlock(inode, 1); out_unlock_inode: mutex_unlock(&inode->i_mutex); - if (sd.num_spliced) - ret = sd.num_spliced; - if (ret > 0) { - int err; - - err = generic_write_sync(out, *ppos, ret); + int err = generic_write_sync(out, *ppos, ret); if (err) - ret = err; - else - *ppos += ret; + return ret; + *ppos += ret; } return ret; diff --git a/fs/splice.c b/fs/splice.c index 4fb6c1f..108e527 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -727,7 +727,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, * SPLICE_F_MOVE isn't set, or we cannot move the page, we simply create * a new page in the output file page cache and fill/dirty that. */ -int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, +int __pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) { struct file *file = sd->u.file; @@ -766,6 +766,22 @@ int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, out: return ret; } +EXPORT_SYMBOL(__pipe_to_file); + +int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct file *file = sd->u.file; + int ret; + + ret = file_remove_suid(file); + if (ret) + return ret; + ret = file_update_time(file); + if (ret) + return ret; + return __pipe_to_file(pipe, buf, sd); +} EXPORT_SYMBOL(pipe_to_file); static void wakeup_pipe_writers(struct pipe_inode_info *pipe) @@ -995,55 +1011,22 @@ ssize_t generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { - struct address_space *mapping = out->f_mapping; - struct inode *inode = mapping->host; - struct splice_desc sd = { - .total_len = len, - .flags = flags, - .pos = *ppos, - .u.file = out, - }; + struct inode *inode = out->f_mapping->host; ssize_t ret; mutex_lock(&inode->i_mutex); - - pipe_lock(pipe); - - splice_from_pipe_begin(&sd); - do { - ret = splice_from_pipe_next(pipe, &sd); - if (ret <= 0) - break; - - ret = file_remove_suid(out); - if (!ret) { - ret = file_update_time(out); - if (!ret) - ret = splice_from_pipe_feed(pipe, &sd, - pipe_to_file); - } - } while (ret > 0); - splice_from_pipe_end(pipe, &sd); - - pipe_unlock(pipe); + ret = splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_file); mutex_unlock(&inode->i_mutex); - if (sd.num_spliced) - ret = sd.num_spliced; - if (ret > 0) { - int err; - - err = generic_write_sync(out, *ppos, ret); + int err = generic_write_sync(out, *ppos, ret); if (err) - ret = err; - else - *ppos += ret; + return err; + *ppos += ret; } return ret; } - EXPORT_SYMBOL(generic_file_splice_write); static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf, diff --git a/include/linux/splice.h b/include/linux/splice.h index 74575cb..c5aca88 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -78,6 +78,8 @@ extern void splice_from_pipe_end(struct pipe_inode_info *, struct splice_desc *); extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, struct splice_desc *); +extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, + struct splice_desc *); extern ssize_t splice_to_pipe(struct pipe_inode_info *, struct splice_pipe_desc *); -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 4/5] xfs: fix splice_write locking 2013-12-12 18:14 ` Christoph Hellwig @ 2013-12-12 18:15 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0004-xfs-fix-splice_write-locking.patch --] [-- Type: text/plain, Size: 2026 bytes --] Call splice_from_pipe directly from XFS so that we can do our normal I/O path locking. This fixes a rare to hit deadlock vs direct I/O as reported by Dave Chinner long time ago. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52c91e1..9d0da98 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -43,6 +43,7 @@ #include <linux/dcache.h> #include <linux/falloc.h> #include <linux/pagevec.h> +#include <linux/splice.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -348,14 +349,6 @@ xfs_file_splice_read( return ret; } -/* - * xfs_file_splice_write() does not use xfs_rw_ilock() because - * generic_file_splice_write() takes the i_mutex itself. This, in theory, - * couuld cause lock inversions between the aio_write path and the splice path - * if someone is doing concurrent splice(2) based writes and write(2) based - * writes to the same inode. The only real way to fix this is to re-implement - * the generic code here with correct locking orders. - */ STATIC ssize_t xfs_file_splice_write( struct pipe_inode_info *pipe, @@ -377,15 +370,22 @@ xfs_file_splice_write( if (XFS_FORCED_SHUTDOWN(ip->i_mount)) return -EIO; - xfs_ilock(ip, XFS_IOLOCK_EXCL); - + xfs_rw_ilock(ip, XFS_IOLOCK_EXCL); trace_xfs_file_splice_write(ip, count, *ppos, ioflags); + ret = splice_from_pipe(pipe, outfilp, ppos, count, flags, pipe_to_file); + xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL); + + if (ret > 0) { + int err; - ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); - if (ret > 0) XFS_STATS_ADD(xs_write_bytes, ret); - xfs_iunlock(ip, XFS_IOLOCK_EXCL); + err = generic_write_sync(outfilp, *ppos, ret); + if (err) + return err; + *ppos += ret; + } + return ret; } -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 4/5] xfs: fix splice_write locking @ 2013-12-12 18:15 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0004-xfs-fix-splice_write-locking.patch --] [-- Type: text/plain, Size: 1905 bytes --] Call splice_from_pipe directly from XFS so that we can do our normal I/O path locking. This fixes a rare to hit deadlock vs direct I/O as reported by Dave Chinner long time ago. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 52c91e1..9d0da98 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -43,6 +43,7 @@ #include <linux/dcache.h> #include <linux/falloc.h> #include <linux/pagevec.h> +#include <linux/splice.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -348,14 +349,6 @@ xfs_file_splice_read( return ret; } -/* - * xfs_file_splice_write() does not use xfs_rw_ilock() because - * generic_file_splice_write() takes the i_mutex itself. This, in theory, - * couuld cause lock inversions between the aio_write path and the splice path - * if someone is doing concurrent splice(2) based writes and write(2) based - * writes to the same inode. The only real way to fix this is to re-implement - * the generic code here with correct locking orders. - */ STATIC ssize_t xfs_file_splice_write( struct pipe_inode_info *pipe, @@ -377,15 +370,22 @@ xfs_file_splice_write( if (XFS_FORCED_SHUTDOWN(ip->i_mount)) return -EIO; - xfs_ilock(ip, XFS_IOLOCK_EXCL); - + xfs_rw_ilock(ip, XFS_IOLOCK_EXCL); trace_xfs_file_splice_write(ip, count, *ppos, ioflags); + ret = splice_from_pipe(pipe, outfilp, ppos, count, flags, pipe_to_file); + xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL); + + if (ret > 0) { + int err; - ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags); - if (ret > 0) XFS_STATS_ADD(xs_write_bytes, ret); - xfs_iunlock(ip, XFS_IOLOCK_EXCL); + err = generic_write_sync(outfilp, *ppos, ret); + if (err) + return err; + *ppos += ret; + } + return ret; } -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details 2013-12-12 18:14 ` Christoph Hellwig @ 2013-12-12 18:15 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0005-splice-stop-exporting-splice_from_pipe-implementatio.patch --] [-- Type: text/plain, Size: 4401 bytes --] The splice_to_file and __splice_to_file helpers with the various actors are everyting consumers need to implement splice properly, so stop exporting the low-level helpers that are used to implement these functions, or in two cases were they were so trivially remove the helpers entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/splice.c | 49 +++++++++--------------------------------------- include/linux/splice.h | 7 ------- 2 files changed, 9 insertions(+), 47 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 108e527..8627a85 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -812,8 +812,8 @@ static void wakeup_pipe_writers(struct pipe_inode_info *pipe) * locking is required around copying the pipe buffers to the * destination. */ -int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd, - splice_actor *actor) +static int splice_from_pipe_feed(struct pipe_inode_info *pipe, + struct splice_desc *sd, splice_actor *actor) { int ret; @@ -859,7 +859,6 @@ int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd, return 1; } -EXPORT_SYMBOL(splice_from_pipe_feed); /** * splice_from_pipe_next - wait for some data to splice from @@ -871,7 +870,8 @@ EXPORT_SYMBOL(splice_from_pipe_feed); * value (one) if pipe buffers are available. It will return zero * or -errno if no more data needs to be spliced. */ -int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd) +static int splice_from_pipe_next(struct pipe_inode_info *pipe, + struct splice_desc *sd) { while (!pipe->nrbufs) { if (!pipe->writers) @@ -896,40 +896,6 @@ int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd) return 1; } -EXPORT_SYMBOL(splice_from_pipe_next); - -/** - * splice_from_pipe_begin - start splicing from pipe - * @sd: information about the splice operation - * - * Description: - * This function should be called before a loop containing - * splice_from_pipe_next() and splice_from_pipe_feed() to - * initialize the necessary fields of @sd. - */ -void splice_from_pipe_begin(struct splice_desc *sd) -{ - sd->num_spliced = 0; - sd->need_wakeup = false; -} -EXPORT_SYMBOL(splice_from_pipe_begin); - -/** - * splice_from_pipe_end - finish splicing from pipe - * @pipe: pipe to splice from - * @sd: information about the splice operation - * - * Description: - * This function will wake up pipe writers if necessary. It should - * be called after a loop containing splice_from_pipe_next() and - * splice_from_pipe_feed(). - */ -void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_desc *sd) -{ - if (sd->need_wakeup) - wakeup_pipe_writers(pipe); -} -EXPORT_SYMBOL(splice_from_pipe_end); /** * __splice_from_pipe - splice data from a pipe to given actor @@ -949,14 +915,17 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd, { int ret; - splice_from_pipe_begin(sd); + sd->num_spliced = 0; + sd->need_wakeup = false; + do { ret = splice_from_pipe_next(pipe, sd); if (ret > 0) ret = splice_from_pipe_feed(pipe, sd, actor); } while (ret > 0); - splice_from_pipe_end(pipe, sd); + if (sd->need_wakeup) + wakeup_pipe_writers(pipe); return sd->num_spliced ? sd->num_spliced : ret; } EXPORT_SYMBOL(__splice_from_pipe); diff --git a/include/linux/splice.h b/include/linux/splice.h index c5aca88..0dabcc7 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -69,13 +69,6 @@ extern ssize_t splice_from_pipe(struct pipe_inode_info *, struct file *, splice_actor *); extern ssize_t __splice_from_pipe(struct pipe_inode_info *, struct splice_desc *, splice_actor *); -extern int splice_from_pipe_feed(struct pipe_inode_info *, struct splice_desc *, - splice_actor *); -extern int splice_from_pipe_next(struct pipe_inode_info *, - struct splice_desc *); -extern void splice_from_pipe_begin(struct splice_desc *); -extern void splice_from_pipe_end(struct pipe_inode_info *, - struct splice_desc *); extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, struct splice_desc *); extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 82+ messages in thread
* [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details @ 2013-12-12 18:15 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2013-12-12 18:15 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs [-- Attachment #1: 0005-splice-stop-exporting-splice_from_pipe-implementatio.patch --] [-- Type: text/plain, Size: 4280 bytes --] The splice_to_file and __splice_to_file helpers with the various actors are everyting consumers need to implement splice properly, so stop exporting the low-level helpers that are used to implement these functions, or in two cases were they were so trivially remove the helpers entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/splice.c | 49 +++++++++--------------------------------------- include/linux/splice.h | 7 ------- 2 files changed, 9 insertions(+), 47 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 108e527..8627a85 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -812,8 +812,8 @@ static void wakeup_pipe_writers(struct pipe_inode_info *pipe) * locking is required around copying the pipe buffers to the * destination. */ -int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd, - splice_actor *actor) +static int splice_from_pipe_feed(struct pipe_inode_info *pipe, + struct splice_desc *sd, splice_actor *actor) { int ret; @@ -859,7 +859,6 @@ int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd, return 1; } -EXPORT_SYMBOL(splice_from_pipe_feed); /** * splice_from_pipe_next - wait for some data to splice from @@ -871,7 +870,8 @@ EXPORT_SYMBOL(splice_from_pipe_feed); * value (one) if pipe buffers are available. It will return zero * or -errno if no more data needs to be spliced. */ -int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd) +static int splice_from_pipe_next(struct pipe_inode_info *pipe, + struct splice_desc *sd) { while (!pipe->nrbufs) { if (!pipe->writers) @@ -896,40 +896,6 @@ int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_desc *sd) return 1; } -EXPORT_SYMBOL(splice_from_pipe_next); - -/** - * splice_from_pipe_begin - start splicing from pipe - * @sd: information about the splice operation - * - * Description: - * This function should be called before a loop containing - * splice_from_pipe_next() and splice_from_pipe_feed() to - * initialize the necessary fields of @sd. - */ -void splice_from_pipe_begin(struct splice_desc *sd) -{ - sd->num_spliced = 0; - sd->need_wakeup = false; -} -EXPORT_SYMBOL(splice_from_pipe_begin); - -/** - * splice_from_pipe_end - finish splicing from pipe - * @pipe: pipe to splice from - * @sd: information about the splice operation - * - * Description: - * This function will wake up pipe writers if necessary. It should - * be called after a loop containing splice_from_pipe_next() and - * splice_from_pipe_feed(). - */ -void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_desc *sd) -{ - if (sd->need_wakeup) - wakeup_pipe_writers(pipe); -} -EXPORT_SYMBOL(splice_from_pipe_end); /** * __splice_from_pipe - splice data from a pipe to given actor @@ -949,14 +915,17 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd, { int ret; - splice_from_pipe_begin(sd); + sd->num_spliced = 0; + sd->need_wakeup = false; + do { ret = splice_from_pipe_next(pipe, sd); if (ret > 0) ret = splice_from_pipe_feed(pipe, sd, actor); } while (ret > 0); - splice_from_pipe_end(pipe, sd); + if (sd->need_wakeup) + wakeup_pipe_writers(pipe); return sd->num_spliced ? sd->num_spliced : ret; } EXPORT_SYMBOL(__splice_from_pipe); diff --git a/include/linux/splice.h b/include/linux/splice.h index c5aca88..0dabcc7 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -69,13 +69,6 @@ extern ssize_t splice_from_pipe(struct pipe_inode_info *, struct file *, splice_actor *); extern ssize_t __splice_from_pipe(struct pipe_inode_info *, struct splice_desc *, splice_actor *); -extern int splice_from_pipe_feed(struct pipe_inode_info *, struct splice_desc *, - splice_actor *); -extern int splice_from_pipe_next(struct pipe_inode_info *, - struct splice_desc *); -extern void splice_from_pipe_begin(struct splice_desc *); -extern void splice_from_pipe_end(struct pipe_inode_info *, - struct splice_desc *); extern int pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, struct splice_desc *); extern int __pipe_to_file(struct pipe_inode_info *, struct pipe_buffer *, -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2013-12-12 18:14 ` Christoph Hellwig @ 2014-01-13 14:14 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-13 14:14 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs ping? Would be nice to get this into 3.14 On Thu, Dec 12, 2013 at 10:14:59AM -0800, Christoph Hellwig wrote: > I've been trying to fix the old splice iolock lock inversion issue in XFS > and started looking over the splice code a little more for it. It seems > like the root of all evil is that we try to nest i_mutex inside the > pipe_lock instead of outside of it, and I can't find any good reason for > that. Does anyone remember why it went this way initially? > > By fixing that and a few minor issues we can not only fix this issue nicely > in XFS, but also get rid of various bits of code duplication, and poking into > splice internals by the ocfs2 splice_write path. > > Btw, does anyone have a good test suite for splice functionality? xfstests > coverage exits but is not very extensive. > > b/fs/ocfs2/file.c | 2 > b/fs/splice.c | 5 +- > b/fs/xfs/xfs_file.c | 26 +++++----- > b/include/linux/splice.h | 2 > fs/ocfs2/file.c | 78 +++++++++---------------------- > fs/splice.c | 115 +++++++++++++---------------------------------- > include/linux/splice.h | 7 -- > 7 files changed, 76 insertions(+), 159 deletions(-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-13 14:14 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-13 14:14 UTC (permalink / raw) To: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker; +Cc: linux-fsdevel, xfs ping? Would be nice to get this into 3.14 On Thu, Dec 12, 2013 at 10:14:59AM -0800, Christoph Hellwig wrote: > I've been trying to fix the old splice iolock lock inversion issue in XFS > and started looking over the splice code a little more for it. It seems > like the root of all evil is that we try to nest i_mutex inside the > pipe_lock instead of outside of it, and I can't find any good reason for > that. Does anyone remember why it went this way initially? > > By fixing that and a few minor issues we can not only fix this issue nicely > in XFS, but also get rid of various bits of code duplication, and poking into > splice internals by the ocfs2 splice_write path. > > Btw, does anyone have a good test suite for splice functionality? xfstests > coverage exits but is not very extensive. > > b/fs/ocfs2/file.c | 2 > b/fs/splice.c | 5 +- > b/fs/xfs/xfs_file.c | 26 +++++----- > b/include/linux/splice.h | 2 > fs/ocfs2/file.c | 78 +++++++++---------------------- > fs/splice.c | 115 +++++++++++++---------------------------------- > include/linux/splice.h | 7 -- > 7 files changed, 76 insertions(+), 159 deletions(-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-13 14:14 ` Christoph Hellwig @ 2014-01-13 23:56 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-13 23:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Mark Fasheh, linux-fsdevel, Joel Becker, xfs On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > ping? Would be nice to get this into 3.14 Umm... The reason for pipe_lock outside of ->i_mutex is this: default_file_splice_write() calls splice_from_pipe() with write_pipe_buf for callback. splice_from_pipe() calls that callback under pipe_lock(pipe). And write_pipe_buf() calls __kernel_write(), which certainly might want to take ->i_mutex. Now, this codepath isn't taken for files that have non-NULL ->splice_write(), so that's not an issue for XFS and OCFS2, but having pipe_lock nest between the ->i_mutex for filesystems that do and do not have ->splice_write()... Ouch... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-13 23:56 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-13 23:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > ping? Would be nice to get this into 3.14 Umm... The reason for pipe_lock outside of ->i_mutex is this: default_file_splice_write() calls splice_from_pipe() with write_pipe_buf for callback. splice_from_pipe() calls that callback under pipe_lock(pipe). And write_pipe_buf() calls __kernel_write(), which certainly might want to take ->i_mutex. Now, this codepath isn't taken for files that have non-NULL ->splice_write(), so that's not an issue for XFS and OCFS2, but having pipe_lock nest between the ->i_mutex for filesystems that do and do not have ->splice_write()... Ouch... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-13 23:56 ` Al Viro @ 2014-01-14 13:22 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-14 13:22 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > ping? Would be nice to get this into 3.14 > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > default_file_splice_write() calls splice_from_pipe() with > write_pipe_buf for callback. splice_from_pipe() calls that > callback under pipe_lock(pipe). And write_pipe_buf() calls > __kernel_write(), which certainly might want to take ->i_mutex. > > Now, this codepath isn't taken for files that have non-NULL > ->splice_write(), so that's not an issue for XFS and OCFS2, > but having pipe_lock nest between the ->i_mutex for filesystems > that do and do not have ->splice_write()... Ouch... What would be the alternative? Duplicating the code in even more filesystems to enforce an non-natural locking order for filesystems actually implementing splice? There don't actually seem to be a whole lot of real filesystems not implemting splice_write, the prime use would be for device drivers or synthetic ones. I'm not even sure how much that fallback gets used in practice. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-14 13:22 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-14 13:22 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > ping? Would be nice to get this into 3.14 > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > default_file_splice_write() calls splice_from_pipe() with > write_pipe_buf for callback. splice_from_pipe() calls that > callback under pipe_lock(pipe). And write_pipe_buf() calls > __kernel_write(), which certainly might want to take ->i_mutex. > > Now, this codepath isn't taken for files that have non-NULL > ->splice_write(), so that's not an issue for XFS and OCFS2, > but having pipe_lock nest between the ->i_mutex for filesystems > that do and do not have ->splice_write()... Ouch... What would be the alternative? Duplicating the code in even more filesystems to enforce an non-natural locking order for filesystems actually implementing splice? There don't actually seem to be a whole lot of real filesystems not implemting splice_write, the prime use would be for device drivers or synthetic ones. I'm not even sure how much that fallback gets used in practice. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-14 13:22 ` Christoph Hellwig @ 2014-01-14 17:20 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-14 17:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel, Linus Torvalds On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > > ping? Would be nice to get this into 3.14 > > > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > > default_file_splice_write() calls splice_from_pipe() with > > write_pipe_buf for callback. splice_from_pipe() calls that > > callback under pipe_lock(pipe). And write_pipe_buf() calls > > __kernel_write(), which certainly might want to take ->i_mutex. > > > > Now, this codepath isn't taken for files that have non-NULL > > ->splice_write(), so that's not an issue for XFS and OCFS2, > > but having pipe_lock nest between the ->i_mutex for filesystems > > that do and do not have ->splice_write()... Ouch... > > What would be the alternative? Duplicating the code in even more > filesystems to enforce an non-natural locking order for filesystems > actually implementing splice? There don't actually seem to be a whole > lot of real filesystems not implemting splice_write, the prime use > would be for device drivers or synthetic ones. I'm not even sure > how much that fallback gets used in practice. Actually, I wonder if splice should be allowed just because destination has ->write() - for a lot of synthetic files the effect of writes really depends on boundaries in output stream and I'm not sure if splice to such a beast makes any sense. Going through fs/*: * 9p: messy when we have O_DIRECT on target file. Other than that, probably could use generic_file_splice_write(). * adfs, affs, afs, bfs, fat, hfs, hfs+, minix, sysv: can use generic_file_splice_write() * /proc/fs/afs/*: probably shouldn't allow splice at all * binfmt_misc: shouldn't allow splice * btrfs: not sure, O_DIRECT complicates the picture again. * cachefiles_daemon_write(): probably shouldn't allow splice at all * ceph: uses generic_file_splice_write(), but I'm not sure whether it is correct - there are interesting things done by ceph_aio_write() that do not have any counterparts on that path. * cifs: trouble; might be switchable to generic, but there's an interesting bit in the end of its ->aio_write() to consider... In any case, should be doable as generic_file_splice_write() + filemap_fdatawrite(). * cifs mounted with strict_io - ouch. Would need ->splice_write() of its own with really interesting locking order. * cifs with direct_io - about the same story. Er... Just where is it doing suid removal in that case, BTW? cifs_user_writev() doesn't seem to? * /proc/fs/cifs/*: shouldn't allow splice to it * coda: needs ->splice_write() with that patch, might or might not be tricky * coda misc device: shouldn't allow splice to it * configfs: shouldn't allow splice to it * debugfs default fops: blackhole ->write(), might as well offer blackhole ->splice_write(). Or refuse to allow splice to it, since I'd bet that most of non-default ones are of the "shouldn't allow splice to it" kind... * debugfs bool: shouldn't allow splice to it * dlm misc devices: shouldn't allow splice to it * ecryptfs: probably should use generic_file_splice_write() (incidentally, who the hell has thought it would be a good idea to put ->iterate (aka ->readdir) into file_operations of a non-directory?) * ecryptfs misc device: probably shouldn't allow splice to it at all * eventfd: shouldn't allow splice to it at all * XIP ext2: needs ->splice_write() with that patch And that's less than half of fs/*... I'm not saying that the current situation on the write side is good; hell, just the mess with write/aio_write alone is ugly enough - we have * a bunch of file_operations without ->aio_write(); simple enough. * a bunch with ->write == do_sync_write. Also simple. * several with NULL ->write and non-NULL ->aio_write(); same as do_sync_write() for ->write (socket, android/logger, kmsg, macvtap) * several with ->aio_write being an optimized equivalent of do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode) * 9p cached with its "oh, but if we have O_DIRECT we want ->write() to be different" (why not use a separate file_operations, then? It's not as if ->open() couldn't switch to it if it sees O_DIRECT...) * two infinibad things (ipath and qib), with completely unrelated semantics on write(2) and writev(2) (the latter shared with aio). As in "writev() of a single-element iovec array does things that do not even resemble what write() of the same data would've done". Yes, really - check for yourself. * snd_pcm - hell knows; it might be that it tries to collect the data from iovec and push it in one go, as if it was a single write, but then it might be something as bogus as what ipath is doing... * gadgetfs - hell knows; ep_write() seems to be doing something beyond what ep_aio_write() does, but I haven't traced them down the call chain... That one, BTW, won't be fun for splice - looks like it cares about datagram boundaries a lot, so it's not obvious what the semantics should be. * lustre. I _think_ do_sync_write() would work there, but I'm might be easily missing something in all those layers of obfusca^Wgood software development practices. And ->splice_write() thrown into that mess, defaulting to "just do ->write() or ->aio_write(), everything writable should be able to cope with splice to it" hasn't made it any prettier. Unfortunately, what you are proposing will make it worse - we'll have to grow a bunch of ->splice_write() instances, with non-trivial correspondence between them and ->aio_write() ones... It needs to be cleaned up, but it's nowhere near as simple as "just flip the order of i_mutex and pipe_lock" ;-/ BTW, speaking of ->aio_write() - why the devil do we pass the pos argument (long long, at that) separately, when all call sites provably have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures it makes the difference between passing arguments in registers and spilling them on stack; moreover, if we do something and only then call generic_file_aio_write(), we get to propagate it all way down. And generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos) since 2.5.55, for crying out loud... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-14 17:20 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-14 17:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Linus Torvalds, Sage Weil, Steve French On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote: > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > > ping? Would be nice to get this into 3.14 > > > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > > default_file_splice_write() calls splice_from_pipe() with > > write_pipe_buf for callback. splice_from_pipe() calls that > > callback under pipe_lock(pipe). And write_pipe_buf() calls > > __kernel_write(), which certainly might want to take ->i_mutex. > > > > Now, this codepath isn't taken for files that have non-NULL > > ->splice_write(), so that's not an issue for XFS and OCFS2, > > but having pipe_lock nest between the ->i_mutex for filesystems > > that do and do not have ->splice_write()... Ouch... > > What would be the alternative? Duplicating the code in even more > filesystems to enforce an non-natural locking order for filesystems > actually implementing splice? There don't actually seem to be a whole > lot of real filesystems not implemting splice_write, the prime use > would be for device drivers or synthetic ones. I'm not even sure > how much that fallback gets used in practice. Actually, I wonder if splice should be allowed just because destination has ->write() - for a lot of synthetic files the effect of writes really depends on boundaries in output stream and I'm not sure if splice to such a beast makes any sense. Going through fs/*: * 9p: messy when we have O_DIRECT on target file. Other than that, probably could use generic_file_splice_write(). * adfs, affs, afs, bfs, fat, hfs, hfs+, minix, sysv: can use generic_file_splice_write() * /proc/fs/afs/*: probably shouldn't allow splice at all * binfmt_misc: shouldn't allow splice * btrfs: not sure, O_DIRECT complicates the picture again. * cachefiles_daemon_write(): probably shouldn't allow splice at all * ceph: uses generic_file_splice_write(), but I'm not sure whether it is correct - there are interesting things done by ceph_aio_write() that do not have any counterparts on that path. * cifs: trouble; might be switchable to generic, but there's an interesting bit in the end of its ->aio_write() to consider... In any case, should be doable as generic_file_splice_write() + filemap_fdatawrite(). * cifs mounted with strict_io - ouch. Would need ->splice_write() of its own with really interesting locking order. * cifs with direct_io - about the same story. Er... Just where is it doing suid removal in that case, BTW? cifs_user_writev() doesn't seem to? * /proc/fs/cifs/*: shouldn't allow splice to it * coda: needs ->splice_write() with that patch, might or might not be tricky * coda misc device: shouldn't allow splice to it * configfs: shouldn't allow splice to it * debugfs default fops: blackhole ->write(), might as well offer blackhole ->splice_write(). Or refuse to allow splice to it, since I'd bet that most of non-default ones are of the "shouldn't allow splice to it" kind... * debugfs bool: shouldn't allow splice to it * dlm misc devices: shouldn't allow splice to it * ecryptfs: probably should use generic_file_splice_write() (incidentally, who the hell has thought it would be a good idea to put ->iterate (aka ->readdir) into file_operations of a non-directory?) * ecryptfs misc device: probably shouldn't allow splice to it at all * eventfd: shouldn't allow splice to it at all * XIP ext2: needs ->splice_write() with that patch And that's less than half of fs/*... I'm not saying that the current situation on the write side is good; hell, just the mess with write/aio_write alone is ugly enough - we have * a bunch of file_operations without ->aio_write(); simple enough. * a bunch with ->write == do_sync_write. Also simple. * several with NULL ->write and non-NULL ->aio_write(); same as do_sync_write() for ->write (socket, android/logger, kmsg, macvtap) * several with ->aio_write being an optimized equivalent of do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode) * 9p cached with its "oh, but if we have O_DIRECT we want ->write() to be different" (why not use a separate file_operations, then? It's not as if ->open() couldn't switch to it if it sees O_DIRECT...) * two infinibad things (ipath and qib), with completely unrelated semantics on write(2) and writev(2) (the latter shared with aio). As in "writev() of a single-element iovec array does things that do not even resemble what write() of the same data would've done". Yes, really - check for yourself. * snd_pcm - hell knows; it might be that it tries to collect the data from iovec and push it in one go, as if it was a single write, but then it might be something as bogus as what ipath is doing... * gadgetfs - hell knows; ep_write() seems to be doing something beyond what ep_aio_write() does, but I haven't traced them down the call chain... That one, BTW, won't be fun for splice - looks like it cares about datagram boundaries a lot, so it's not obvious what the semantics should be. * lustre. I _think_ do_sync_write() would work there, but I'm might be easily missing something in all those layers of obfusca^Wgood software development practices. And ->splice_write() thrown into that mess, defaulting to "just do ->write() or ->aio_write(), everything writable should be able to cope with splice to it" hasn't made it any prettier. Unfortunately, what you are proposing will make it worse - we'll have to grow a bunch of ->splice_write() instances, with non-trivial correspondence between them and ->aio_write() ones... It needs to be cleaned up, but it's nowhere near as simple as "just flip the order of i_mutex and pipe_lock" ;-/ BTW, speaking of ->aio_write() - why the devil do we pass the pos argument (long long, at that) separately, when all call sites provably have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures it makes the difference between passing arguments in registers and spilling them on stack; moreover, if we do something and only then call generic_file_aio_write(), we get to propagate it all way down. And generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos) since 2.5.55, for crying out loud... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-14 17:20 ` Al Viro @ 2014-01-15 18:10 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-15 18:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel, Linus Torvalds On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote: > And that's less than half of fs/*... I'm not saying that the current > situation on the write side is good; hell, just the mess with write/aio_write > alone is ugly enough - we have > * a bunch of file_operations without ->aio_write(); simple enough. > * a bunch with ->write == do_sync_write. Also simple. > * several with NULL ->write and non-NULL ->aio_write(); same as > do_sync_write() for ->write (socket, android/logger, kmsg, macvtap) > * several with ->aio_write being an optimized equivalent of > do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode) > * 9p cached with its "oh, but if we have O_DIRECT we want ->write() > to be different" (why not use a separate file_operations, then? It's not > as if ->open() couldn't switch to it if it sees O_DIRECT...) > * two infinibad things (ipath and qib), with completely unrelated > semantics on write(2) and writev(2) (the latter shared with aio). As in > "writev() of a single-element iovec array does things that do not even > resemble what write() of the same data would've done". Yes, really - check > for yourself. > * snd_pcm - hell knows; it might be that it tries to collect the > data from iovec and push it in one go, as if it was a single write, but > then it might be something as bogus as what ipath is doing... > * gadgetfs - hell knows; ep_write() seems to be doing something > beyond what ep_aio_write() does, but I haven't traced them down the call > chain... That one, BTW, won't be fun for splice - looks like it cares > about datagram boundaries a lot, so it's not obvious what the semantics > should be. > * lustre. I _think_ do_sync_write() would work there, but I'm might > be easily missing something in all those layers of obfusca^Wgood software > development practices. BTW, ->read/->aio_read situation is only slighlty better - of file_operations instances that have ->aio_read, most have do_sync_read() for ->read() (as they ought to). Exceptions: * 9p O_DIRECT (again) * NULL ->read where do_sync_read ought to be (socket, macvtap) * optimized ->read (/dev/zero, /dev/null, bad_inode) * snd_pcm - magic. It (and its ->aio_write counterpart) wants exactly one iovec per channel. IOW, it's not a general-purpose ->aio_{read,write} at all - it's a magic API shoehorned into readv(2)/writev(2) (and aio IOCB_CMD_P{READ,WRITE}V as well). * lustre - probably could live with do_sync_read(), but there might be stack footprint considerations or some really weird magic going on (the difference is that instead of iocb on stack they appear to be using per-thread one allocated on heap and hashed by pid, of all things). It's really weird - they end up doing repeated hash lookups for that per-thread wastebasket of a structure on different levels of call chain. Looks like they have swept a lot of local variables of a lot of functions into that thing; worse, it appears to be one of several dynamically allocated bits of that thing, hidden behind a bunch of wrappers... Overall feel is Lovecraftian, complete with lurking horrors of the deep... BTW, its ->aio_read would better never return -EIOCBQUEUED - its ->read does *not* wait for completion of iocb it has submitted. * gadgetfs - it appears to be seriously datagram-oriented; basically, they want to reduce readv/writev to read/write, not the other way round. > BTW, speaking of ->aio_write() - why the devil do we pass the pos > argument (long long, at that) separately, when all call sites provably > have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures > it makes the difference between passing arguments in registers and spilling > them on stack; moreover, if we do something and only then call > generic_file_aio_write(), we get to propagate it all way down. And > generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos) > since 2.5.55, for crying out loud... The same goes for ->aio_read() (except for s/2.5.55/2.5.39/)... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-15 18:10 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-15 18:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Linus Torvalds, Sage Weil, Steve French On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote: > And that's less than half of fs/*... I'm not saying that the current > situation on the write side is good; hell, just the mess with write/aio_write > alone is ugly enough - we have > * a bunch of file_operations without ->aio_write(); simple enough. > * a bunch with ->write == do_sync_write. Also simple. > * several with NULL ->write and non-NULL ->aio_write(); same as > do_sync_write() for ->write (socket, android/logger, kmsg, macvtap) > * several with ->aio_write being an optimized equivalent of > do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode) > * 9p cached with its "oh, but if we have O_DIRECT we want ->write() > to be different" (why not use a separate file_operations, then? It's not > as if ->open() couldn't switch to it if it sees O_DIRECT...) > * two infinibad things (ipath and qib), with completely unrelated > semantics on write(2) and writev(2) (the latter shared with aio). As in > "writev() of a single-element iovec array does things that do not even > resemble what write() of the same data would've done". Yes, really - check > for yourself. > * snd_pcm - hell knows; it might be that it tries to collect the > data from iovec and push it in one go, as if it was a single write, but > then it might be something as bogus as what ipath is doing... > * gadgetfs - hell knows; ep_write() seems to be doing something > beyond what ep_aio_write() does, but I haven't traced them down the call > chain... That one, BTW, won't be fun for splice - looks like it cares > about datagram boundaries a lot, so it's not obvious what the semantics > should be. > * lustre. I _think_ do_sync_write() would work there, but I'm might > be easily missing something in all those layers of obfusca^Wgood software > development practices. BTW, ->read/->aio_read situation is only slighlty better - of file_operations instances that have ->aio_read, most have do_sync_read() for ->read() (as they ought to). Exceptions: * 9p O_DIRECT (again) * NULL ->read where do_sync_read ought to be (socket, macvtap) * optimized ->read (/dev/zero, /dev/null, bad_inode) * snd_pcm - magic. It (and its ->aio_write counterpart) wants exactly one iovec per channel. IOW, it's not a general-purpose ->aio_{read,write} at all - it's a magic API shoehorned into readv(2)/writev(2) (and aio IOCB_CMD_P{READ,WRITE}V as well). * lustre - probably could live with do_sync_read(), but there might be stack footprint considerations or some really weird magic going on (the difference is that instead of iocb on stack they appear to be using per-thread one allocated on heap and hashed by pid, of all things). It's really weird - they end up doing repeated hash lookups for that per-thread wastebasket of a structure on different levels of call chain. Looks like they have swept a lot of local variables of a lot of functions into that thing; worse, it appears to be one of several dynamically allocated bits of that thing, hidden behind a bunch of wrappers... Overall feel is Lovecraftian, complete with lurking horrors of the deep... BTW, its ->aio_read would better never return -EIOCBQUEUED - its ->read does *not* wait for completion of iocb it has submitted. * gadgetfs - it appears to be seriously datagram-oriented; basically, they want to reduce readv/writev to read/write, not the other way round. > BTW, speaking of ->aio_write() - why the devil do we pass the pos > argument (long long, at that) separately, when all call sites provably > have it equal to iocb->ki_pos? If nothing else, on a bunch of architectures > it makes the difference between passing arguments in registers and spilling > them on stack; moreover, if we do something and only then call > generic_file_aio_write(), we get to propagate it all way down. And > generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos) > since 2.5.55, for crying out loud... The same goes for ->aio_read() (except for s/2.5.55/2.5.39/)... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-14 17:20 ` Al Viro (?) (?) @ 2014-01-18 6:40 ` Al Viro 2014-01-18 7:22 ` Linus Torvalds -1 siblings, 1 reply; 82+ messages in thread From: Al Viro @ 2014-01-18 6:40 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel, Linus Torvalds On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote: > On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > > > ping? Would be nice to get this into 3.14 > > > > > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > > > default_file_splice_write() calls splice_from_pipe() with > > > write_pipe_buf for callback. splice_from_pipe() calls that > > > callback under pipe_lock(pipe). And write_pipe_buf() calls > > > __kernel_write(), which certainly might want to take ->i_mutex. > > > > > > Now, this codepath isn't taken for files that have non-NULL > > > ->splice_write(), so that's not an issue for XFS and OCFS2, > > > but having pipe_lock nest between the ->i_mutex for filesystems > > > that do and do not have ->splice_write()... Ouch... > > > > What would be the alternative? Duplicating the code in even more > > filesystems to enforce an non-natural locking order for filesystems > > actually implementing splice? There don't actually seem to be a whole > > lot of real filesystems not implemting splice_write, the prime use > > would be for device drivers or synthetic ones. I'm not even sure > > how much that fallback gets used in practice. Hmm... In principle, the following would be no worse than what generic_file_splice_write() is doing: confirm and map the pages, build an iovec and use ->aio_write() to write it out, then unmap the suckers, release ones entirely written to file and adjust the partially written one. All under pipe_lock(). Hell, if we introduce kernel_writev() (either by calling vfs_writev() or taking do_readv_writev() sans copying iovec and using that under set_fs()), we could switch default_file_splice_write() to that and get rid of ->splice_write() for the majority of filesystems, if not all of them. Sure, it means copying from pipe buffers to pagecache, but we have generic_file_splice_write() do that copy anyway - conditional memcpy() in pipe_to_file() is actually unconditional; that if (page != buf->page) in there had just been forgotten by Nick back in 2007 ("1/2 splice: dont steal"). Objections, comments? The problem Christoph was talking about is that generic_file_splice_write() plays with ->i_mutex and both gets/drops it for each page of IO *and* causes PITA for any fs that wants some locks of its own taken in addition to ->i_mutex on the write paths. What ->splice_write() without page stealing is doing is pretty much a writev() from array of pages in kernel space; so it looks like we might as well just reuse writev() guts for that... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 6:40 ` Al Viro @ 2014-01-18 7:22 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2014-01-18 7:22 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Objections, comments? I certainly object to the "map, then unmap" approach. No VM games. But if it can be done more naturally as a writev, then that may well be ok. As long as we're talking about just the default_file_splice_write() case, and people who want to do special things with page movement can continue to do so.. Linus _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-18 7:22 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2014-01-18 7:22 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Objections, comments? I certainly object to the "map, then unmap" approach. No VM games. But if it can be done more naturally as a writev, then that may well be ok. As long as we're talking about just the default_file_splice_write() case, and people who want to do special things with page movement can continue to do so.. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 7:22 ` Linus Torvalds (?) @ 2014-01-18 7:46 ` Al Viro 2014-01-18 7:56 ` Al Viro ` (2 more replies) -1 siblings, 3 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 7:46 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: > On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Objections, comments? > > I certainly object to the "map, then unmap" approach. No VM games. Um... int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, struct splice_desc *sd) ... if (buf->page != page) { char *src = buf->ops->map(pipe, buf, 1); char *dst = kmap_atomic(page); memcpy(dst + offset, src + buf->offset, this_len); flush_dcache_page(page); kunmap_atomic(dst); buf->ops->unmap(pipe, buf, src); } ... ->map() and ->unmap() (BTW, why are those methods, anyway? They are identical for all instances) are void *generic_pipe_buf_map(struct pipe_inode_info *pipe, struct pipe_buffer *buf, int atomic) { if (atomic) { buf->flags |= PIPE_BUF_FLAG_ATOMIC; return kmap_atomic(buf->page); } return kmap(buf->page); } and void generic_pipe_buf_unmap(struct pipe_inode_info *pipe, struct pipe_buffer *buf, void *map_data) { if (buf->flags & PIPE_BUF_FLAG_ATOMIC) { buf->flags &= ~PIPE_BUF_FLAG_ATOMIC; kunmap_atomic(map_data); } else kunmap(buf->page); } resp. If we are going to copy that data (and all users of generic_file_splice_write() do that memcpy() to page cache), we have to kmap the source ;-/ > But if it can be done more naturally as a writev, then that may well > be ok. As long as we're talking about just the > default_file_splice_write() case, and people who want to do special > things with page movement can continue to do so.. The thing is, after such change default_file_splice_write() is no worse than generic_file_splice_write(). The only instances that really want something else are the ones that try to steal pages (e.g. virtio_console, fuse miscdev) or sockets, with their "do DMA from the sodding page, don't copy it at anywhere" ->sendpage() method. IOW, ones those special things you are talking about. Normal filesystems do not - not on pipe-to-file splice. file-to-pipe - sure, that one plays with pagecache and tries hard to do zero-copy, but that's ->splice_read(), not ->splice_write()... _If_ somebody figures out how to deal with zero-copy on pipe-to-file - fine, we'll be able to revisit that. But there hadn't been one since 2007 and there was zero activity in that area, so... What I'm doing right now is taking do_readv_writev() apart and making the stuff after rw_copy_check_uvector() non-static (visible in fs/internal.h). As long as we do not go through rw_copy_check_uvector() (we'd just built that iovec ourselves and it's already in kernel space), we should be fine - single copy done straight to pagecache, with whatever locks fs wants to take, etc. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 7:46 ` Al Viro @ 2014-01-18 7:56 ` Al Viro 2014-01-18 8:27 ` Al Viro 2014-01-18 19:59 ` Linus Torvalds 2 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 7:56 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Sat, Jan 18, 2014 at 07:46:49AM +0000, Al Viro wrote: > Um... > int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > struct splice_desc *sd) > ... > if (buf->page != page) { > char *src = buf->ops->map(pipe, buf, 1); > char *dst = kmap_atomic(page); > > memcpy(dst + offset, src + buf->offset, this_len); > flush_dcache_page(page); > kunmap_atomic(dst); > buf->ops->unmap(pipe, buf, src); > } BTW, that if (buf->page != page) is always true - it's a leftover from before Nick's removal of ->steal() uses in pipe_to_file() (as well as the big fat comment in front of that function that had lost any relation to what it's doing 7 years ago)... Is there anybody maintaining fs/splice.c these days? I'd been doing massive RTFS in that area lately, but it would certainly be nice to have some braindump on the design and issues in that thing, preferably still matching the code... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-18 7:56 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 7:56 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French On Sat, Jan 18, 2014 at 07:46:49AM +0000, Al Viro wrote: > Um... > int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > struct splice_desc *sd) > ... > if (buf->page != page) { > char *src = buf->ops->map(pipe, buf, 1); > char *dst = kmap_atomic(page); > > memcpy(dst + offset, src + buf->offset, this_len); > flush_dcache_page(page); > kunmap_atomic(dst); > buf->ops->unmap(pipe, buf, src); > } BTW, that if (buf->page != page) is always true - it's a leftover from before Nick's removal of ->steal() uses in pipe_to_file() (as well as the big fat comment in front of that function that had lost any relation to what it's doing 7 years ago)... Is there anybody maintaining fs/splice.c these days? I'd been doing massive RTFS in that area lately, but it would certainly be nice to have some braindump on the design and issues in that thing, preferably still matching the code... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 7:46 ` Al Viro 2014-01-18 7:56 ` Al Viro @ 2014-01-18 8:27 ` Al Viro 2014-01-18 8:44 ` David Miller 2014-01-18 19:59 ` Linus Torvalds 2 siblings, 1 reply; 82+ messages in thread From: Al Viro @ 2014-01-18 8:27 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, David Miller On Sat, Jan 18, 2014 at 07:46:49AM +0000, Al Viro wrote: > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: > > On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Objections, comments? > > > > I certainly object to the "map, then unmap" approach. No VM games. [snip] > > But if it can be done more naturally as a writev, then that may well > > be ok. As long as we're talking about just the > > default_file_splice_write() case, and people who want to do special > > things with page movement can continue to do so.. > > The thing is, after such change default_file_splice_write() is no worse than > generic_file_splice_write(). The only instances that really want something > else are the ones that try to steal pages (e.g. virtio_console, fuse miscdev) > or sockets, with their "do DMA from the sodding page, don't copy it at > anywhere" ->sendpage() method. IOW, ones those special things you are > talking about. Normal filesystems do not - not on pipe-to-file splice. > file-to-pipe - sure, that one plays with pagecache and tries hard to > do zero-copy, but that's ->splice_read(), not ->splice_write()... BTW, would sockets benefit from having ->sendpages() that would take an array of (page, offset, len) triples? It would be trivial to do and some of the helpers that are falling out of writing that writev-based default_file_splice_write() look like they could be reused for calling that one... Dave? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 8:27 ` Al Viro @ 2014-01-18 8:44 ` David Miller 0 siblings, 0 replies; 82+ messages in thread From: David Miller @ 2014-01-18 8:44 UTC (permalink / raw) To: viro Cc: axboe, sfrench, sage, mfasheh, xfs, hch, jlbec, linux-fsdevel, torvalds From: Al Viro <viro@ZenIV.linux.org.uk> Date: Sat, 18 Jan 2014 08:27:30 +0000 > BTW, would sockets benefit from having ->sendpages() that would take an > array of (page, offset, len) triples? It would be trivial to do and > some of the helpers that are falling out of writing that writev-based > default_file_splice_write() look like they could be reused for > calling that one... Dave? That's originally how the sendpage method was implemented, but back then Linus asked us to only pass one page at a time. I don't remember the details beyond that. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-18 8:44 ` David Miller 0 siblings, 0 replies; 82+ messages in thread From: David Miller @ 2014-01-18 8:44 UTC (permalink / raw) To: viro Cc: torvalds, hch, axboe, mfasheh, jlbec, linux-fsdevel, xfs, sage, sfrench From: Al Viro <viro@ZenIV.linux.org.uk> Date: Sat, 18 Jan 2014 08:27:30 +0000 > BTW, would sockets benefit from having ->sendpages() that would take an > array of (page, offset, len) triples? It would be trivial to do and > some of the helpers that are falling out of writing that writev-based > default_file_splice_write() look like they could be reused for > calling that one... Dave? That's originally how the sendpage method was implemented, but back then Linus asked us to only pass one page at a time. I don't remember the details beyond that. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 8:44 ` David Miller @ 2014-02-07 17:10 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-07 17:10 UTC (permalink / raw) To: David Miller Cc: axboe, sfrench, sage, mfasheh, xfs, hch, jlbec, linux-fsdevel, torvalds On Sat, Jan 18, 2014 at 12:44:53AM -0800, David Miller wrote: > From: Al Viro <viro@ZenIV.linux.org.uk> > Date: Sat, 18 Jan 2014 08:27:30 +0000 > > > BTW, would sockets benefit from having ->sendpages() that would take an > > array of (page, offset, len) triples? It would be trivial to do and > > some of the helpers that are falling out of writing that writev-based > > default_file_splice_write() look like they could be reused for > > calling that one... Dave? > > That's originally how the sendpage method was implemented, but back then > Linus asked us to only pass one page at a time. > > I don't remember the details beyond that. FWIW, I wonder if what we are doing with ->msg_iov is the right thing. We modify the iovecs in array as we drain it. And that's inconvenient for at least some callers (see e.g. complaints in fs/ncpfs about the need to copy the array, etc.). What if we embed iov_iter into the sucker and replace memcpy_{to,from}iovec* with variants taking iov_iter *? If nothing else, it'll be marginally more efficient (no more skipping the already-emptied iovecs) and it seems to be more convenient for callers. If we are lucky, that might even eliminate the need of ->sendpage() - just set the iov_iter over <page,offset,size> array instead of iovec one and let ->sendmsg() do the smart thing if it knows how. I hadn't done comparison of {tcp,udp}_send{page,msg}, though - there might be dragons... Even if that will turn out to be infeasible, it will at least drive the kmap/kunmap done by sock_no_sendpage() down into memcpy_from_iter(), turning them into kmap_atomic/kunmap_atomic. The obvious price is that kernel-side msghdr diverges from the userland one, so copy_msghdr_from_user() needs to deal with that, but I really doubt that you'll find a load where the price of copying it in two chunks instead of one would be measurable. What else am I missing? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-02-07 17:10 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-07 17:10 UTC (permalink / raw) To: David Miller Cc: torvalds, hch, axboe, mfasheh, jlbec, linux-fsdevel, xfs, sage, sfrench On Sat, Jan 18, 2014 at 12:44:53AM -0800, David Miller wrote: > From: Al Viro <viro@ZenIV.linux.org.uk> > Date: Sat, 18 Jan 2014 08:27:30 +0000 > > > BTW, would sockets benefit from having ->sendpages() that would take an > > array of (page, offset, len) triples? It would be trivial to do and > > some of the helpers that are falling out of writing that writev-based > > default_file_splice_write() look like they could be reused for > > calling that one... Dave? > > That's originally how the sendpage method was implemented, but back then > Linus asked us to only pass one page at a time. > > I don't remember the details beyond that. FWIW, I wonder if what we are doing with ->msg_iov is the right thing. We modify the iovecs in array as we drain it. And that's inconvenient for at least some callers (see e.g. complaints in fs/ncpfs about the need to copy the array, etc.). What if we embed iov_iter into the sucker and replace memcpy_{to,from}iovec* with variants taking iov_iter *? If nothing else, it'll be marginally more efficient (no more skipping the already-emptied iovecs) and it seems to be more convenient for callers. If we are lucky, that might even eliminate the need of ->sendpage() - just set the iov_iter over <page,offset,size> array instead of iovec one and let ->sendmsg() do the smart thing if it knows how. I hadn't done comparison of {tcp,udp}_send{page,msg}, though - there might be dragons... Even if that will turn out to be infeasible, it will at least drive the kmap/kunmap done by sock_no_sendpage() down into memcpy_from_iter(), turning them into kmap_atomic/kunmap_atomic. The obvious price is that kernel-side msghdr diverges from the userland one, so copy_msghdr_from_user() needs to deal with that, but I really doubt that you'll find a load where the price of copying it in two chunks instead of one would be measurable. What else am I missing? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 7:46 ` Al Viro 2014-01-18 7:56 ` Al Viro 2014-01-18 8:27 ` Al Viro @ 2014-01-18 19:59 ` Linus Torvalds 2014-01-18 20:10 ` Al Viro 2 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2014-01-18 19:59 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> > >> > Objections, comments? >> >> I certainly object to the "map, then unmap" approach. No VM games. > > Um... > > If we are going to copy that data (and all users of generic_file_splice_write() > do that memcpy() to page cache), we have to kmap the source ;-/ Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and has to be done one page at a time (well, I guess you could do a couple). But you can't do that *around* the default_file_splice_write(), so I thought you meant some kind of "map into user space". And I absolutely *detest* that kind of approach. Linus _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 19:59 ` Linus Torvalds @ 2014-01-18 20:10 ` Al Viro 2014-01-18 20:27 ` Al Viro 2014-01-19 5:13 ` Al Viro 0 siblings, 2 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 20:10 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Sat, Jan 18, 2014 at 11:59:56AM -0800, Linus Torvalds wrote: > On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: > >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > >> > > >> > Objections, comments? > >> > >> I certainly object to the "map, then unmap" approach. No VM games. > > > > Um... > > > > If we are going to copy that data (and all users of generic_file_splice_write() > > do that memcpy() to page cache), we have to kmap the source ;-/ > > Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and > has to be done one page at a time (well, I guess you could do a > couple). > > But you can't do that *around* the default_file_splice_write(), so I > thought you meant some kind of "map into user space". And I absolutely > *detest* that kind of approach. Ouch... No, I hadn't meant that kind of insanity, but I'd missed the problem with scarcity of mappings completely... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 20:10 ` Al Viro @ 2014-01-18 20:27 ` Al Viro 2014-01-19 5:13 ` Al Viro 1 sibling, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 20:27 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote: > On Sat, Jan 18, 2014 at 11:59:56AM -0800, Linus Torvalds wrote: > > On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: > > >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > >> > > > >> > Objections, comments? > > >> > > >> I certainly object to the "map, then unmap" approach. No VM games. > > > > > > Um... > > > > > > If we are going to copy that data (and all users of generic_file_splice_write() > > > do that memcpy() to page cache), we have to kmap the source ;-/ > > > > Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and > > has to be done one page at a time (well, I guess you could do a > > couple). > > > > But you can't do that *around* the default_file_splice_write(), so I > > thought you meant some kind of "map into user space". And I absolutely > > *detest* that kind of approach. > > Ouch... No, I hadn't meant that kind of insanity, but I'd missed the > problem with scarcity of mappings completely... Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(), which does this: data = buf->ops->map(pipe, buf, 0); ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp); buf->ops->unmap(pipe, buf, data); IOW, ->write() (with whatever locks there might be) wrapped into kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to file on procfs will hit that codepath... Or on 9p, for that matter, or fat, or afs, or cifs, etc. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-18 20:27 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 20:27 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote: > On Sat, Jan 18, 2014 at 11:59:56AM -0800, Linus Torvalds wrote: > > On Fri, Jan 17, 2014 at 11:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > On Fri, Jan 17, 2014 at 11:22:04PM -0800, Linus Torvalds wrote: > > >> On Fri, Jan 17, 2014 at 10:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > >> > > > >> > Objections, comments? > > >> > > >> I certainly object to the "map, then unmap" approach. No VM games. > > > > > > Um... > > > > > > If we are going to copy that data (and all users of generic_file_splice_write() > > > do that memcpy() to page cache), we have to kmap the source ;-/ > > > > Yeah, the kmap/kunmap we have to do. But that's a no-op on 64-bit, and > > has to be done one page at a time (well, I guess you could do a > > couple). > > > > But you can't do that *around* the default_file_splice_write(), so I > > thought you meant some kind of "map into user space". And I absolutely > > *detest* that kind of approach. > > Ouch... No, I hadn't meant that kind of insanity, but I'd missed the > problem with scarcity of mappings completely... Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(), which does this: data = buf->ops->map(pipe, buf, 0); ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp); buf->ops->unmap(pipe, buf, data); IOW, ->write() (with whatever locks there might be) wrapped into kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to file on procfs will hit that codepath... Or on 9p, for that matter, or fat, or afs, or cifs, etc. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring 2014-01-18 20:27 ` Al Viro @ 2014-01-18 20:30 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Sat, Jan 18, 2014 at 08:27:17PM +0000, Al Viro wrote: > Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(), > which does this: > data = buf->ops->map(pipe, buf, 0); > ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp); > buf->ops->unmap(pipe, buf, data); > IOW, ->write() (with whatever locks there might be) wrapped into > kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to > file on procfs will hit that codepath... Or on 9p, for that matter, or > fat, or afs, or cifs, etc. s/kmap_atomic/kmap/, so it's not that disastrously bad (kmap_atomic might lose the mapping as soon as we block), but scarcity problem remains... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [PATCH 0/5] splice: locking changes and code refactoring @ 2014-01-18 20:30 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-18 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French On Sat, Jan 18, 2014 at 08:27:17PM +0000, Al Viro wrote: > Ouch^2: default_file_write_splice_write() keeps calling write_pipe_buf(), > which does this: > data = buf->ops->map(pipe, buf, 0); > ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp); > buf->ops->unmap(pipe, buf, data); > IOW, ->write() (with whatever locks there might be) wrapped into > kmap_atomic()/kunmap_atomic(). And anybody can do that - just a splice to > file on procfs will hit that codepath... Or on 9p, for that matter, or > fat, or afs, or cifs, etc. s/kmap_atomic/kmap/, so it's not that disastrously bad (kmap_atomic might lose the mapping as soon as we block), but scarcity problem remains... ^ permalink raw reply [flat|nested] 82+ messages in thread
* [RFC] unifying write variants for filesystems 2014-01-18 20:10 ` Al Viro @ 2014-01-19 5:13 ` Al Viro 2014-01-19 5:13 ` Al Viro 1 sibling, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-19 5:13 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote: > Ouch... No, I hadn't meant that kind of insanity, but I'd missed the > problem with scarcity of mappings completely... OK, that pretty much kills this approach. Pity... Folks, what do you think about the following: * a new data structure: struct io_source { enum {IO_IOVEC, IO_PVEC} type; union { struct iovec *iov; struct pvec { struct page *page; unsigned offset; unsigned size; } *pvec; }; } * a new method that would look like aio_write, but take struct io_source instead of iovec. * store the type in iov_iter (normally - IO_UIOVEC) and teach the code dealing with it to do the right thing depending on type. I.e. instead of __copy_from_user_inatomic() do kmap_atomic()/memcpy()/kunmap_atomic() if it's a IO_PAGEVEC. * generic_file_aio_write() analog for new method, converging with generic_file_aio_write() almost immediately (basically, as soon as iov_iter has been initialized). * new_aio_write() consisting of { struct io_source source = {.type = IO_UIOVEC, .user = iov}; return file->f_op-><new_method>(iocb, &source, nr_segs, pos); } * new_sync_write(), doing what do_sync_write() does for files that have new_aio_write() as ->aio_write(). * new_splice_write() usable for files that provide that method - it would collect pipe_buffers, put together struct pvec array and pass it to that method. All mapping the pages would happen one-by-one and only around actual copying the data. And, of course, the locking would be identical to what we do for write()/writev()/aio write Then filesystems can switch to that new method, turning their flipping their aio_write() instances to new type and replacing ->aio_write with default_aio_write, ->write with new_write and ->splice_write with new_splice_write. Actually, there's a possibility that it would be possible to use it for *all* instances of ->splice_write() - we'd need to store something a pointer to "call this to try and steal this page" function in pvec and allow the method do actual stealing. Note that pipe_buffer ->steal() only uses the page argument - they all ignore which pipe it's in (and there's nothing they could usefully do if they knew which pipe had it been in the first place). This is very preliminary, of course, and I might easily miss something - the previous idea was unworkable, after all. Comments would be very welcome... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* [RFC] unifying write variants for filesystems @ 2014-01-19 5:13 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-01-19 5:13 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote: > Ouch... No, I hadn't meant that kind of insanity, but I'd missed the > problem with scarcity of mappings completely... OK, that pretty much kills this approach. Pity... Folks, what do you think about the following: * a new data structure: struct io_source { enum {IO_IOVEC, IO_PVEC} type; union { struct iovec *iov; struct pvec { struct page *page; unsigned offset; unsigned size; } *pvec; }; } * a new method that would look like aio_write, but take struct io_source instead of iovec. * store the type in iov_iter (normally - IO_UIOVEC) and teach the code dealing with it to do the right thing depending on type. I.e. instead of __copy_from_user_inatomic() do kmap_atomic()/memcpy()/kunmap_atomic() if it's a IO_PAGEVEC. * generic_file_aio_write() analog for new method, converging with generic_file_aio_write() almost immediately (basically, as soon as iov_iter has been initialized). * new_aio_write() consisting of { struct io_source source = {.type = IO_UIOVEC, .user = iov}; return file->f_op-><new_method>(iocb, &source, nr_segs, pos); } * new_sync_write(), doing what do_sync_write() does for files that have new_aio_write() as ->aio_write(). * new_splice_write() usable for files that provide that method - it would collect pipe_buffers, put together struct pvec array and pass it to that method. All mapping the pages would happen one-by-one and only around actual copying the data. And, of course, the locking would be identical to what we do for write()/writev()/aio write Then filesystems can switch to that new method, turning their flipping their aio_write() instances to new type and replacing ->aio_write with default_aio_write, ->write with new_write and ->splice_write with new_splice_write. Actually, there's a possibility that it would be possible to use it for *all* instances of ->splice_write() - we'd need to store something a pointer to "call this to try and steal this page" function in pvec and allow the method do actual stealing. Note that pipe_buffer ->steal() only uses the page argument - they all ignore which pipe it's in (and there's nothing they could usefully do if they knew which pipe had it been in the first place). This is very preliminary, of course, and I might easily miss something - the previous idea was unworkable, after all. Comments would be very welcome... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-01-19 5:13 ` Al Viro @ 2014-01-20 13:55 ` Christoph Hellwig -1 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-20 13:55 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Linus Torvalds On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: > Folks, what do you think about the following: That's very much what Shaggy did in the aio-direct tree, except that it kept using a single set of methods. Linus really didn't like it unfortunately. https://github.com/kleikamp/linux-shaggy/commits/aio_loop _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-01-20 13:55 ` Christoph Hellwig 0 siblings, 0 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-01-20 13:55 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: > Folks, what do you think about the following: That's very much what Shaggy did in the aio-direct tree, except that it kept using a single set of methods. Linus really didn't like it unfortunately. https://github.com/kleikamp/linux-shaggy/commits/aio_loop ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-01-20 13:55 ` Christoph Hellwig @ 2014-01-20 20:32 ` Linus Torvalds -1 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2014-01-20 20:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Dave Kleikamp, Sage Weil, Mark Fasheh, xfs, Steve French, Al Viro, linux-fsdevel, Joel Becker On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: >> Folks, what do you think about the following: > > That's very much what Shaggy did in the aio-direct tree, except that > it kept using a single set of methods. Linus really didn't like it > unfortunately. Umm. That wasn't what I objected to. I objected to the incredibly ugly implementation, the crazy new flags arguments for core helper functions, ugly naming etc etc. I even outlined what might fix it. In other words, I thought the code was shit and ugly. Not that iterators per se would be wrong. Just doing them badly is wrong. So don't go putting words in my mouth. It isn't "unfortunate" at all that I have higher quality standards. Linus _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-01-20 20:32 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2014-01-20 20:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: >> Folks, what do you think about the following: > > That's very much what Shaggy did in the aio-direct tree, except that > it kept using a single set of methods. Linus really didn't like it > unfortunately. Umm. That wasn't what I objected to. I objected to the incredibly ugly implementation, the crazy new flags arguments for core helper functions, ugly naming etc etc. I even outlined what might fix it. In other words, I thought the code was shit and ugly. Not that iterators per se would be wrong. Just doing them badly is wrong. So don't go putting words in my mouth. It isn't "unfortunate" at all that I have higher quality standards. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-01-20 20:32 ` Linus Torvalds @ 2014-02-01 22:43 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-01 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Anton Altaparmakov On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote: > On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: > >> Folks, what do you think about the following: > > > > That's very much what Shaggy did in the aio-direct tree, except that > > it kept using a single set of methods. Linus really didn't like it > > unfortunately. > > Umm. That wasn't what I objected to. > > I objected to the incredibly ugly implementation, the crazy new flags > arguments for core helper functions, ugly naming etc etc. I even > outlined what might fix it. > > In other words, I thought the code was shit and ugly. Not that > iterators per se would be wrong. Just doing them badly is wrong. Gyahhh... OK, I should've known better than go looking into that thing after such warning. Some relatively printable notes (i.e. about 10% of the comments I really had about that) follow: * WTF bother passing 'pos' separately? It's the same mistake that was made with ->aio_read/->aio_write and just as with those, *all* callers provably have pos == iocb->ki_pos. * I'm not sure that iov_iter is a good fit here. OTOH, it probably could be reused (it has damn few users right now and they are on the codepaths involved into that thing). * We *definitely* want a variant structure with tag - unsigned long thing was just plain insane. I see at least two variants - array of iovecs and array of (at least) triples <page, offset, length>. Quite possibly - quadruples, with "here's how to try to steal this page" thrown in, if we want that as replacement for ->splice_write() as well (it looks like the few instances that do steal on pipe-to-file splices could be dealt with the same way as the dumb ones, provided that ->write_iter or whatever we end up calling it is allowed to try and steal pages). Possibly more variants on the read side of things... FWIW, I'm not sure that bio_vec makes a lot of sense here. * direction (i.e. source vs. destination) also should be a part of that tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int; the situation with pos is identical to aio_read/aio_write. While we are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks very much like a special case of "array of <page,offset,size>" we want for splice_write... * having looked through the ->read and ->write instances in drivers, I'd say that surprisingly few helpers would go a _long_ way towards converting those guys to the same methods. And we need such helpers anyway - there's a whole lot of (badly) open-coded "copy the whole user buffer into kmalloc'ed array and NUL-terminate the sucker" logics in ->write() instances, for example. Even more "copy up to that much into given array and NUL-terminate it", with rather amusing bugs in there - e.g. NUL going into the end of array, regardless of the actual amount of data to copy; junk is left in the middle, complete with printk of the entire thing if it doesn't make sense. IOW, spewing random piece of kernel stack into dmesg. Off-by-ones galore, too... BTW, speaking of bogosities - grep for generic_write_checks(). Exactly one caller (in __generic_file_aio_write()) has any business looking at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write(). All other callers have copied that, even though it makes absolutely no sense for them... * file_readable/file_writable looks really wrong; if nothing else, I would rather check that after open() and set a couple of FMODE bits, then check those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no method"... * pipe_buffer_operations ->map()/->unmap() should die; let the caller do k{,un}map{,_atomic}(). All instances have the same method there and there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also go. * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable and pagefault_enable around it? The sucker starts with kmap_atomic() and ends with kunmap_atomic(); all instances of those guys have pagefaul disabling/enabling (and I suspect that it might make sense to lift it out of arch-specific variants - rename them to e.g. __kmap_atomic(), rip pagefault_disable() out of those and put kmap_atomic() into highmem.h outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic page_address; move pagefault_enable() from __kunmap_atomic() to kunmap_atomic() while we are at it). Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we have __copy_from_user_inatomic() done there under kmap_atomic(), and we really don't want to block in such conditions. * pipe_iov_copy_from_user() ought to return how much has it managed to bring in, not 0 or -EFAULT as it does now. Then it won't need the non-atomic side, AFAICS. Moreover, we'll just be able to use iov_iter_copy_from_user_atomic() (which badly needs a more palatable name, BTW). * sure, we want to call do_generic_file_read() once, passing the entire iovec to file_read_actor(). But what the hell does that have to do with introduction of new methods? It's a change that makes sense on its own. Moreover, it's a damn good preparation to adding those - we get generic_file_aio_read() into "set iov_iter up, then do <this>", with <this> using iov_iter instead of iovec. Once we get to introducing those methods, it's just a matter of taking the rest of generic_file_aio_read() into a separate function and making that function an instance of new method... * Unrelated to this patchset, but... may I politely inquire about the reasons why ->is_partially_uptodate() gets read_descriptor_t? The whole point of read_descriptor_t (if it has any) is that its interpretation is up to whoever's doing the reading, _not_ what they are reading from. So desc->arg is off-limits for any instance of ->is_partially_uptodate(). desc->written is obviously pointless for them; the need (or lack thereof) to do something to the page doesn't depend on how much have we already read from the file. Moreover, reporting an error is rather dubious in such method; if there's something fishy with the page, we'll just return "no" and let ->readpage() complain. Which leaves only desc->count, which, unsurprisingly, is the only thing looked at by (the only) instance of that method. And "tell me if the part of this page that long starting from that offset is up to date" is much more natural that "is what this read_descriptor_t would have had us read from this offset in this page up to date?"... * do we really need separate non-atomic variant of iov_iter_copy_from_user()? We have only two users right now (cifs and ceph) and both can use the fault-in / atomic copy loop without much pain... * in addition to having too many methods, I'm not convinced that we want them to be methods. Let's start with explicit checks in the primitives and see where it goes from there; if we start to grow too many variants, we can always introduce some methods, but then we'll be in better position to decide what is and what is not a good method... * on the read side, I really don't believe that exposing atomic and non-atomic variants is a good idea. Look at the existing users of __copy_to_user_inatomic(); leaving aside the i915_gem weirdness, all of them are used to implement the exact same thing: given a page, offset and length, feed its contents to iovec/buffer/whatnot. Unlike the write side of things, there's nothing between prefaulting pages and actual attempts to copy. So let's make _that_ an exposed primitive and let it deal with kmap/kmap_atomic/etc. Variants that don't have to worry about blocking (vector of <page,offset,length>, etc.) would simply not bother with non-atomic kmap, etc. Sure, it should take iov_iter as destination. And deal with mapping the damn thing internally... * ntfs_file_buffered_write() should switch to iov_iter as well. It's open-coding a lot of iov_iter stuff. It's not entirely trivial and I'd really like to hear from ntfs folks on that, though, and the current code looks deadlock-prone. We prefault everything, then lock the pages to which we'll be copying, then attempt to do __copy_from_user_inatomic(), falling back to __copy_from_user() if that fails. Fine, but what happens if the source of write() is mmaped from the same file and we lose CPU while prefaulting the last page, have memory pressure evict the first one, then have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages we'll be copying to and have __copy_from_user() try to copy *from* those same pages? We are doing it while holding these pages locked, so pagefault will have really fun time with them... Anton? BTW, Linus, when did you comment on that patchset? I've found an iteration of that patchset circa last October (v9, apparently the latest posted), but it looks like your comments had either got lost or had been on the earlier iteration of that thing... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-01 22:43 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-01 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote: > On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: > >> Folks, what do you think about the following: > > > > That's very much what Shaggy did in the aio-direct tree, except that > > it kept using a single set of methods. Linus really didn't like it > > unfortunately. > > Umm. That wasn't what I objected to. > > I objected to the incredibly ugly implementation, the crazy new flags > arguments for core helper functions, ugly naming etc etc. I even > outlined what might fix it. > > In other words, I thought the code was shit and ugly. Not that > iterators per se would be wrong. Just doing them badly is wrong. Gyahhh... OK, I should've known better than go looking into that thing after such warning. Some relatively printable notes (i.e. about 10% of the comments I really had about that) follow: * WTF bother passing 'pos' separately? It's the same mistake that was made with ->aio_read/->aio_write and just as with those, *all* callers provably have pos == iocb->ki_pos. * I'm not sure that iov_iter is a good fit here. OTOH, it probably could be reused (it has damn few users right now and they are on the codepaths involved into that thing). * We *definitely* want a variant structure with tag - unsigned long thing was just plain insane. I see at least two variants - array of iovecs and array of (at least) triples <page, offset, length>. Quite possibly - quadruples, with "here's how to try to steal this page" thrown in, if we want that as replacement for ->splice_write() as well (it looks like the few instances that do steal on pipe-to-file splices could be dealt with the same way as the dumb ones, provided that ->write_iter or whatever we end up calling it is allowed to try and steal pages). Possibly more variants on the read side of things... FWIW, I'm not sure that bio_vec makes a lot of sense here. * direction (i.e. source vs. destination) also should be a part of that tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int; the situation with pos is identical to aio_read/aio_write. While we are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks very much like a special case of "array of <page,offset,size>" we want for splice_write... * having looked through the ->read and ->write instances in drivers, I'd say that surprisingly few helpers would go a _long_ way towards converting those guys to the same methods. And we need such helpers anyway - there's a whole lot of (badly) open-coded "copy the whole user buffer into kmalloc'ed array and NUL-terminate the sucker" logics in ->write() instances, for example. Even more "copy up to that much into given array and NUL-terminate it", with rather amusing bugs in there - e.g. NUL going into the end of array, regardless of the actual amount of data to copy; junk is left in the middle, complete with printk of the entire thing if it doesn't make sense. IOW, spewing random piece of kernel stack into dmesg. Off-by-ones galore, too... BTW, speaking of bogosities - grep for generic_write_checks(). Exactly one caller (in __generic_file_aio_write()) has any business looking at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write(). All other callers have copied that, even though it makes absolutely no sense for them... * file_readable/file_writable looks really wrong; if nothing else, I would rather check that after open() and set a couple of FMODE bits, then check those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no method"... * pipe_buffer_operations ->map()/->unmap() should die; let the caller do k{,un}map{,_atomic}(). All instances have the same method there and there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also go. * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable and pagefault_enable around it? The sucker starts with kmap_atomic() and ends with kunmap_atomic(); all instances of those guys have pagefaul disabling/enabling (and I suspect that it might make sense to lift it out of arch-specific variants - rename them to e.g. __kmap_atomic(), rip pagefault_disable() out of those and put kmap_atomic() into highmem.h outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic page_address; move pagefault_enable() from __kunmap_atomic() to kunmap_atomic() while we are at it). Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we have __copy_from_user_inatomic() done there under kmap_atomic(), and we really don't want to block in such conditions. * pipe_iov_copy_from_user() ought to return how much has it managed to bring in, not 0 or -EFAULT as it does now. Then it won't need the non-atomic side, AFAICS. Moreover, we'll just be able to use iov_iter_copy_from_user_atomic() (which badly needs a more palatable name, BTW). * sure, we want to call do_generic_file_read() once, passing the entire iovec to file_read_actor(). But what the hell does that have to do with introduction of new methods? It's a change that makes sense on its own. Moreover, it's a damn good preparation to adding those - we get generic_file_aio_read() into "set iov_iter up, then do <this>", with <this> using iov_iter instead of iovec. Once we get to introducing those methods, it's just a matter of taking the rest of generic_file_aio_read() into a separate function and making that function an instance of new method... * Unrelated to this patchset, but... may I politely inquire about the reasons why ->is_partially_uptodate() gets read_descriptor_t? The whole point of read_descriptor_t (if it has any) is that its interpretation is up to whoever's doing the reading, _not_ what they are reading from. So desc->arg is off-limits for any instance of ->is_partially_uptodate(). desc->written is obviously pointless for them; the need (or lack thereof) to do something to the page doesn't depend on how much have we already read from the file. Moreover, reporting an error is rather dubious in such method; if there's something fishy with the page, we'll just return "no" and let ->readpage() complain. Which leaves only desc->count, which, unsurprisingly, is the only thing looked at by (the only) instance of that method. And "tell me if the part of this page that long starting from that offset is up to date" is much more natural that "is what this read_descriptor_t would have had us read from this offset in this page up to date?"... * do we really need separate non-atomic variant of iov_iter_copy_from_user()? We have only two users right now (cifs and ceph) and both can use the fault-in / atomic copy loop without much pain... * in addition to having too many methods, I'm not convinced that we want them to be methods. Let's start with explicit checks in the primitives and see where it goes from there; if we start to grow too many variants, we can always introduce some methods, but then we'll be in better position to decide what is and what is not a good method... * on the read side, I really don't believe that exposing atomic and non-atomic variants is a good idea. Look at the existing users of __copy_to_user_inatomic(); leaving aside the i915_gem weirdness, all of them are used to implement the exact same thing: given a page, offset and length, feed its contents to iovec/buffer/whatnot. Unlike the write side of things, there's nothing between prefaulting pages and actual attempts to copy. So let's make _that_ an exposed primitive and let it deal with kmap/kmap_atomic/etc. Variants that don't have to worry about blocking (vector of <page,offset,length>, etc.) would simply not bother with non-atomic kmap, etc. Sure, it should take iov_iter as destination. And deal with mapping the damn thing internally... * ntfs_file_buffered_write() should switch to iov_iter as well. It's open-coding a lot of iov_iter stuff. It's not entirely trivial and I'd really like to hear from ntfs folks on that, though, and the current code looks deadlock-prone. We prefault everything, then lock the pages to which we'll be copying, then attempt to do __copy_from_user_inatomic(), falling back to __copy_from_user() if that fails. Fine, but what happens if the source of write() is mmaped from the same file and we lose CPU while prefaulting the last page, have memory pressure evict the first one, then have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages we'll be copying to and have __copy_from_user() try to copy *from* those same pages? We are doing it while holding these pages locked, so pagefault will have really fun time with them... Anton? BTW, Linus, when did you comment on that patchset? I've found an iteration of that patchset circa last October (v9, apparently the latest posted), but it looks like your comments had either got lost or had been on the earlier iteration of that thing... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-01 22:43 ` Al Viro (?) @ 2014-02-02 0:13 ` Linus Torvalds 2014-02-02 2:02 ` Al Viro -1 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2014-02-02 0:13 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Anton Altaparmakov On Sat, Feb 1, 2014 at 2:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > BTW, Linus, when did you comment on that patchset? I've found an iteration > of that patchset circa last October (v9, apparently the latest posted), > but it looks like your comments had either got lost or had been on the > earlier iteration of that thing... I commented on the pull request in November. Something like this: http://lkml.indiana.edu/hypermail/linux/kernel/1311.2/02278.html and then I have a follow-up in a reply to that that at least outlines a few things that could be done to make it less barfy. But judging from your email, you actually went through that patch-series with a finer comb, I don't think there is anything in that commentary of mine that adds anything to yours, apart on a comment on the naming that I hated. Linus _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-02 0:13 ` Linus Torvalds @ 2014-02-02 2:02 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 2:02 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Anton Altaparmakov On Sat, Feb 01, 2014 at 04:13:36PM -0800, Linus Torvalds wrote: > On Sat, Feb 1, 2014 at 2:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > BTW, Linus, when did you comment on that patchset? I've found an iteration > > of that patchset circa last October (v9, apparently the latest posted), > > but it looks like your comments had either got lost or had been on the > > earlier iteration of that thing... > > I commented on the pull request in November. Something like this: > > http://lkml.indiana.edu/hypermail/linux/kernel/1311.2/02278.html > > and then I have a follow-up in a reply to that that at least outlines > a few things that could be done to make it less barfy. > > But judging from your email, you actually went through that > patch-series with a finer comb, I don't think there is anything in > that commentary of mine that adds anything to yours, apart on a > comment on the naming that I hated. Heh... You've actually been a lot more polite than what I started to write on that particular topic. Mine had started with "folks, identifiers should be possible to read aloud, without the people around getting nervous. I mean, ai,ai-ai,owwie-...?" and then got so nasty that I decided to leave the whole thing alone. But yes, references to the importance of remembering the safeword aside, the naming is really atrocious. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-02 2:02 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 2:02 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov On Sat, Feb 01, 2014 at 04:13:36PM -0800, Linus Torvalds wrote: > On Sat, Feb 1, 2014 at 2:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > BTW, Linus, when did you comment on that patchset? I've found an iteration > > of that patchset circa last October (v9, apparently the latest posted), > > but it looks like your comments had either got lost or had been on the > > earlier iteration of that thing... > > I commented on the pull request in November. Something like this: > > http://lkml.indiana.edu/hypermail/linux/kernel/1311.2/02278.html > > and then I have a follow-up in a reply to that that at least outlines > a few things that could be done to make it less barfy. > > But judging from your email, you actually went through that > patch-series with a finer comb, I don't think there is anything in > that commentary of mine that adds anything to yours, apart on a > comment on the naming that I hated. Heh... You've actually been a lot more polite than what I started to write on that particular topic. Mine had started with "folks, identifiers should be possible to read aloud, without the people around getting nervous. I mean, ai,ai-ai,owwie-...?" and then got so nasty that I decided to leave the whole thing alone. But yes, references to the importance of remembering the safeword aside, the naming is really atrocious. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-01 22:43 ` Al Viro @ 2014-02-02 19:21 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 19:21 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, Miklos Szeredi, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Anton Altaparmakov On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do > k{,un}map{,_atomic}(). All instances have the same method there and > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also > go. BTW, another pile of code interesting in that respect (i.e. getting that interface right) is fs/fuse/dev.c; I don't like the way it's playing with get_user_pages_fast() there, and I doubt that sharing the code for read and write side as it's done there makes much sense, but it's definitely going to be a test for any API of that kind. It *does* try to unify write-from-iovec with write-from-array-of-pages and similar for reads; the interesting issue is that unlike the usual write-to-pagecache we can have many chunks picked from one page and we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those. I suspect that the right answer is, in addition to a primitive that does copying from iov_iter to have "copy from iov_iter and be ready to copy more from soon after" + "done copying"; for the "array of pages" the former would be allowed to leave the current page mapped, skipping kmap_atomic() on the next call. And the latter would unmap. of course. The caller is responsible for not blocking or doing unbalanced map/unmap until it's said "done copying". BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for everything? After all, as soon as we'd done kmap() in there, we grab a spinlock and don't drop it until just before kunmap(). With nothing by memcpy() done in between... Miklos? AFAICS, we only win from switching to kmap_atomic there - we can't block anyway, we don't need it to be visible on other CPUs and nesting isn't a problem. Looks like it'll be cheaper in highmem cases and do exactly the same thing as now for non-highmem... Comments? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-02 19:21 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 19:21 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov, Miklos Szeredi On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do > k{,un}map{,_atomic}(). All instances have the same method there and > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also > go. BTW, another pile of code interesting in that respect (i.e. getting that interface right) is fs/fuse/dev.c; I don't like the way it's playing with get_user_pages_fast() there, and I doubt that sharing the code for read and write side as it's done there makes much sense, but it's definitely going to be a test for any API of that kind. It *does* try to unify write-from-iovec with write-from-array-of-pages and similar for reads; the interesting issue is that unlike the usual write-to-pagecache we can have many chunks picked from one page and we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those. I suspect that the right answer is, in addition to a primitive that does copying from iov_iter to have "copy from iov_iter and be ready to copy more from soon after" + "done copying"; for the "array of pages" the former would be allowed to leave the current page mapped, skipping kmap_atomic() on the next call. And the latter would unmap. of course. The caller is responsible for not blocking or doing unbalanced map/unmap until it's said "done copying". BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for everything? After all, as soon as we'd done kmap() in there, we grab a spinlock and don't drop it until just before kunmap(). With nothing by memcpy() done in between... Miklos? AFAICS, we only win from switching to kmap_atomic there - we can't block anyway, we don't need it to be visible on other CPUs and nesting isn't a problem. Looks like it'll be cheaper in highmem cases and do exactly the same thing as now for non-highmem... Comments? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-02 19:21 ` Al Viro @ 2014-02-02 19:23 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 19:23 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, Miklos Szeredi, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Anton Altaparmakov On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote: > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > everything? After all, as soon as we'd done kmap() in there, we > grab a spinlock and don't drop it until just before kunmap(). With > nothing by memcpy() done in between... Miklos? AFAICS, we only win s/by/but/ - sorry... > from switching to kmap_atomic there - we can't block anyway, we don't > need it to be visible on other CPUs and nesting isn't a problem. > Looks like it'll be cheaper in highmem cases and do exactly the same > thing as now for non-highmem... Comments? > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-02 19:23 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-02 19:23 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov, Miklos Szeredi On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote: > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > everything? After all, as soon as we'd done kmap() in there, we > grab a spinlock and don't drop it until just before kunmap(). With > nothing by memcpy() done in between... Miklos? AFAICS, we only win s/by/but/ - sorry... > from switching to kmap_atomic there - we can't block anyway, we don't > need it to be visible on other CPUs and nesting isn't a problem. > Looks like it'll be cheaper in highmem cases and do exactly the same > thing as now for non-highmem... Comments? > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-02 19:21 ` Al Viro @ 2014-02-03 14:41 ` Miklos Szeredi -1 siblings, 0 replies; 82+ messages in thread From: Miklos Szeredi @ 2014-02-03 14:41 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote: > On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: > > > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do > > k{,un}map{,_atomic}(). All instances have the same method there and > > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also > > go. > > BTW, another pile of code interesting in that respect (i.e. getting that > interface right) is fs/fuse/dev.c; I don't like the way it's playing > with get_user_pages_fast() there, It's trying to work around a deadlock by lock_page() recursion. I.e. fuse daemon is reading a request into a page for which the readpage is being serviced by said request (no, that's not something that happens accidentally). Do all archs set FAULT_FLAG_KILLABLE? If so, that flag along with code depending on the lack of it can go away and we can simply depend on page faults being interruptible by fatal signals. And that would simplify the fuse device I/O substantially. > and I doubt that sharing the code for > read and write side as it's done there makes much sense, but it's > definitely going to be a test for any API of that kind. It *does* > try to unify write-from-iovec with write-from-array-of-pages and > similar for reads; the interesting issue is that unlike the usual > write-to-pagecache we can have many chunks picked from one page and > we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those. > > I suspect that the right answer is, in addition to a primitive that > does copying from iov_iter to have "copy from iov_iter and be ready > to copy more from soon after" + "done copying"; for the "array of > pages" the former would be allowed to leave the current page mapped, > skipping kmap_atomic() on the next call. And the latter would unmap. > of course. The caller is responsible for not blocking or doing > unbalanced map/unmap until it's said "done copying". > > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > everything? After all, as soon as we'd done kmap() in there, we > grab a spinlock and don't drop it until just before kunmap(). With > nothing by memcpy() done in between... Miklos? AFAICS, we only win > from switching to kmap_atomic there - we can't block anyway, we don't > need it to be visible on other CPUs and nesting isn't a problem. > Looks like it'll be cheaper in highmem cases and do exactly the same > thing as now for non-highmem... Comments? We don't hold the spinlock. But regardless, I don't see any reason why it couldn't be atomic kmap. Thanks, Miklos _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-03 14:41 ` Miklos Szeredi 0 siblings, 0 replies; 82+ messages in thread From: Miklos Szeredi @ 2014-02-03 14:41 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov On Sun, Feb 02, 2014 at 07:21:04PM +0000, Al Viro wrote: > On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: > > > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do > > k{,un}map{,_atomic}(). All instances have the same method there and > > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also > > go. > > BTW, another pile of code interesting in that respect (i.e. getting that > interface right) is fs/fuse/dev.c; I don't like the way it's playing > with get_user_pages_fast() there, It's trying to work around a deadlock by lock_page() recursion. I.e. fuse daemon is reading a request into a page for which the readpage is being serviced by said request (no, that's not something that happens accidentally). Do all archs set FAULT_FLAG_KILLABLE? If so, that flag along with code depending on the lack of it can go away and we can simply depend on page faults being interruptible by fatal signals. And that would simplify the fuse device I/O substantially. > and I doubt that sharing the code for > read and write side as it's done there makes much sense, but it's > definitely going to be a test for any API of that kind. It *does* > try to unify write-from-iovec with write-from-array-of-pages and > similar for reads; the interesting issue is that unlike the usual > write-to-pagecache we can have many chunks picked from one page and > we'd rather avoid doing kmap_atomic/kunmap_atomic for each of those. > > I suspect that the right answer is, in addition to a primitive that > does copying from iov_iter to have "copy from iov_iter and be ready > to copy more from soon after" + "done copying"; for the "array of > pages" the former would be allowed to leave the current page mapped, > skipping kmap_atomic() on the next call. And the latter would unmap. > of course. The caller is responsible for not blocking or doing > unbalanced map/unmap until it's said "done copying". > > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > everything? After all, as soon as we'd done kmap() in there, we > grab a spinlock and don't drop it until just before kunmap(). With > nothing by memcpy() done in between... Miklos? AFAICS, we only win > from switching to kmap_atomic there - we can't block anyway, we don't > need it to be visible on other CPUs and nesting isn't a problem. > Looks like it'll be cheaper in highmem cases and do exactly the same > thing as now for non-highmem... Comments? We don't hold the spinlock. But regardless, I don't see any reason why it couldn't be atomic kmap. Thanks, Miklos ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-03 14:41 ` Miklos Szeredi @ 2014-02-03 15:33 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-03 15:33 UTC (permalink / raw) To: Miklos Szeredi Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Mon, Feb 03, 2014 at 03:41:55PM +0100, Miklos Szeredi wrote: > > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > > everything? After all, as soon as we'd done kmap() in there, we > > grab a spinlock and don't drop it until just before kunmap(). With > > nothing by memcpy() done in between... Miklos? AFAICS, we only win > > from switching to kmap_atomic there - we can't block anyway, we don't > > need it to be visible on other CPUs and nesting isn't a problem. > > Looks like it'll be cheaper in highmem cases and do exactly the same > > thing as now for non-highmem... Comments? > > We don't hold the spinlock. But regardless, I don't see any reason why it > couldn't be atomic kmap. Oh, right - lock_request() drops it. Still, I don't see anything other than copying between map and unmap, and not a lot of it either... As for get_user_pages_fast()... Why not do what mm/filemap.c does to deal with the same issue? Prefault, then lock the destination page, kmap_atomic() and do __copy_from_user_inatomic(). If that fails (i.e. if something has raced with us and evicted the source from page table), shrug, unlock and repeat. I do realize that you want to share code between the read and write sides of the whole thing, but I'm not sure it's worth doing. Almost everything in that pile knows the direction - splitting a few low-level functions into ..._in() and ..._out() variants (mostly along the checks already in them) allows to separate these paths completely, at which point it becomes possible to use copy-page-to-iov_iter, etc. to take care of mapping, dealing with iovec components, etc. What I want to do is to get a sane set of iov_iter primitives that could be used for everything, without their users having to care about the nature of iov_iter - iovec, array of <page,offset,size,how_to_steal> quadruples, biovec, etc. The interesting part of it is how to make that set expressive enough, while keeping it reasonably sane. And fs/fuse/dev.c is one of the more interesting potential users out there... I've a growing queue with the beginning of that stuff; so far it's mostly preparatory bits and pieces. Currently being tested: copy_page_to_iter() (more or less similar to iov_iter_copy_to_..., but with saner interface and dealing with the kmap, atomics, etc. without forcing the callers do do that) with conversion of generic_file_aio_read() and friends to it. If it survives the local beating, I'll start pushing it out (as vfs.git#iov_iter); that pile is getting to potentially interesting bits... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-03 15:33 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-03 15:33 UTC (permalink / raw) To: Miklos Szeredi Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Dave Kleikamp, Anton Altaparmakov On Mon, Feb 03, 2014 at 03:41:55PM +0100, Miklos Szeredi wrote: > > BTW, is there any reason why fuse/dev.c doesn't use atomic kmaps for > > everything? After all, as soon as we'd done kmap() in there, we > > grab a spinlock and don't drop it until just before kunmap(). With > > nothing by memcpy() done in between... Miklos? AFAICS, we only win > > from switching to kmap_atomic there - we can't block anyway, we don't > > need it to be visible on other CPUs and nesting isn't a problem. > > Looks like it'll be cheaper in highmem cases and do exactly the same > > thing as now for non-highmem... Comments? > > We don't hold the spinlock. But regardless, I don't see any reason why it > couldn't be atomic kmap. Oh, right - lock_request() drops it. Still, I don't see anything other than copying between map and unmap, and not a lot of it either... As for get_user_pages_fast()... Why not do what mm/filemap.c does to deal with the same issue? Prefault, then lock the destination page, kmap_atomic() and do __copy_from_user_inatomic(). If that fails (i.e. if something has raced with us and evicted the source from page table), shrug, unlock and repeat. I do realize that you want to share code between the read and write sides of the whole thing, but I'm not sure it's worth doing. Almost everything in that pile knows the direction - splitting a few low-level functions into ..._in() and ..._out() variants (mostly along the checks already in them) allows to separate these paths completely, at which point it becomes possible to use copy-page-to-iov_iter, etc. to take care of mapping, dealing with iovec components, etc. What I want to do is to get a sane set of iov_iter primitives that could be used for everything, without their users having to care about the nature of iov_iter - iovec, array of <page,offset,size,how_to_steal> quadruples, biovec, etc. The interesting part of it is how to make that set expressive enough, while keeping it reasonably sane. And fs/fuse/dev.c is one of the more interesting potential users out there... I've a growing queue with the beginning of that stuff; so far it's mostly preparatory bits and pieces. Currently being tested: copy_page_to_iter() (more or less similar to iov_iter_copy_to_..., but with saner interface and dealing with the kmap, atomics, etc. without forcing the callers do do that) with conversion of generic_file_aio_read() and friends to it. If it survives the local beating, I'll start pushing it out (as vfs.git#iov_iter); that pile is getting to potentially interesting bits... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-01 22:43 ` Al Viro @ 2014-02-02 23:16 ` Anton Altaparmakov -1 siblings, 0 replies; 82+ messages in thread From: Anton Altaparmakov @ 2014-02-02 23:16 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Linus Torvalds Hi Al, On 1 Feb 2014, at 22:43, Al Viro <viro@zeniv.linux.org.uk> wrote: > * ntfs_file_buffered_write() should switch to iov_iter as well. It's > open-coding a lot of iov_iter stuff. The NTFS code predates iov_iter by more than 10 years so it simply wasn't there to use when I wrote NTFS... But yes, NTFS should be updated to use it, I agree completely. > It's not entirely trivial and > I'd really like to hear from ntfs folks on that, though, and the current > code looks deadlock-prone. We prefault everything, then lock the pages > to which we'll be copying, then attempt to do __copy_from_user_inatomic(), > falling back to __copy_from_user() if that fails. Fine, but what happens > if the source of write() is mmaped from the same file and we lose CPU while > prefaulting the last page, have memory pressure evict the first one, then > have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages > we'll be copying to and have __copy_from_user() try to copy *from* those > same pages? We are doing it while holding these pages locked, so pagefault > will have really fun time with them... Anton? That would deadlock. You are (of course) quite right. But the generic file write code used to suffer from the same deadlock problem and no-one ever complained much about it IIRC... In the current kernel the problem does not exist any more (due to the do the atomic copy with pagefaults disabled and then retrying with just the first segment and keeping retrying till it works) so indeed updating NTFS to use iov_iter would be a good opportunity to get that fixed in NTFS as well. It does not matter if for NTFS it turns out to be quite inefficient (due to locking/unlocking several pages at once) as it is such a crazy corner case that will hardly ever be triggered in real life especially so as the only time NTFS operates on more than one page at once in ntfs_file_buffered_write() is when a write happens into a sparse logical block (NTFS "cluster") and only on volumes with a logical block size > PAGE_CACHE_SIZE which with a minimum PAGE_CACHE_SIZE of 4096 bytes on Linux and the fact that Windows only ever creates volumes with a logical block size of 4096 (unless the user specifically forces a different block size when creating the volume) it means that it basically hardly ever happens. Also, the NTFS kernel driver never creates sparse files, i.e. the sparse file would have had to come from Windows or another NTFS implementation... I will have a look at moving NTFS to iov_iter and fixing the potential deadlock at the same time. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge J.J. Thomson Avenue, Cambridge, CB3 0RB, UK _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-02 23:16 ` Anton Altaparmakov 0 siblings, 0 replies; 82+ messages in thread From: Anton Altaparmakov @ 2014-02-02 23:16 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Dave Kleikamp, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Linus Torvalds Hi Al, On 1 Feb 2014, at 22:43, Al Viro <viro@zeniv.linux.org.uk> wrote: > * ntfs_file_buffered_write() should switch to iov_iter as well. It's > open-coding a lot of iov_iter stuff. The NTFS code predates iov_iter by more than 10 years so it simply wasn't there to use when I wrote NTFS... But yes, NTFS should be updated to use it, I agree completely. > It's not entirely trivial and > I'd really like to hear from ntfs folks on that, though, and the current > code looks deadlock-prone. We prefault everything, then lock the pages > to which we'll be copying, then attempt to do __copy_from_user_inatomic(), > falling back to __copy_from_user() if that fails. Fine, but what happens > if the source of write() is mmaped from the same file and we lose CPU while > prefaulting the last page, have memory pressure evict the first one, then > have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages > we'll be copying to and have __copy_from_user() try to copy *from* those > same pages? We are doing it while holding these pages locked, so pagefault > will have really fun time with them... Anton? That would deadlock. You are (of course) quite right. But the generic file write code used to suffer from the same deadlock problem and no-one ever complained much about it IIRC... In the current kernel the problem does not exist any more (due to the do the atomic copy with pagefaults disabled and then retrying with just the first segment and keeping retrying till it works) so indeed updating NTFS to use iov_iter would be a good opportunity to get that fixed in NTFS as well. It does not matter if for NTFS it turns out to be quite inefficient (due to locking/unlocking several pages at once) as it is such a crazy corner case that will hardly ever be triggered in real life especially so as the only time NTFS operates on more than one page at once in ntfs_file_buffered_write() is when a write happens into a sparse logical block (NTFS "cluster") and only on volumes with a logical block size > PAGE_CACHE_SIZE which with a minimum PAGE_CACHE_SIZE of 4096 bytes on Linux and the fact that Windows only ever creates volumes with a logical block size of 4096 (unless the user specifically forces a different block size when creating the volume) it means that it basically hardly ever happens. Also, the NTFS kernel driver never creates sparse files, i.e. the sparse file woul d have had to come from Windows or another NTFS implementation... I will have a look at moving NTFS to iov_iter and fixing the potential deadlock at the same time. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge J.J. Thomson Avenue, Cambridge, CB3 0RB, UK _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-01 22:43 ` Al Viro ` (3 preceding siblings ...) (?) @ 2014-02-03 15:12 ` Christoph Hellwig 2014-02-03 16:24 ` Al Viro 2014-02-03 16:50 ` Dave Kleikamp -1 siblings, 2 replies; 82+ messages in thread From: Christoph Hellwig @ 2014-02-03 15:12 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Dave Kleikamp, Sage Weil, Christoph Hellwig, Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: > * WTF bother passing 'pos' separately? It's the same mistake that was > made with ->aio_read/->aio_write and just as with those, *all* callers > provably have pos == iocb->ki_pos. I think this landed with the initial aio support, which planned for allowing AIO retries for a workqueue with a partially incremented pos. None of this ever got merged, probably because it was too ugly to live. > * We *definitely* want a variant structure with tag - unsigned long thing > was just plain insane. I see at least two variants - array of iovecs > and array of (at least) triples <page, offset, length>. Quite possibly - > quadruples, with "here's how to try to steal this page" thrown in, if > we want that as replacement for ->splice_write() as well (it looks like > the few instances that do steal on pipe-to-file splices could be dealt > with the same way as the dumb ones, provided that ->write_iter or whatever > we end up calling it is allowed to try and steal pages). Possibly more > variants on the read side of things... FWIW, I'm not sure that bio_vec > makes a lot of sense here. bio_vec just is one of the many page+offset+len containers we have, I guess Dave took it because loop uses it. We could either invent a new one here or finally have a common one for the different uses all over the kernel. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-03 15:12 ` Christoph Hellwig @ 2014-02-03 16:24 ` Al Viro 2014-02-03 16:50 ` Dave Kleikamp 1 sibling, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-03 16:24 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Dave Kleikamp, Sage Weil, Mark Fasheh, xfs, Steve French, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Mon, Feb 03, 2014 at 07:12:42AM -0800, Christoph Hellwig wrote: > > we end up calling it is allowed to try and steal pages). Possibly more > > variants on the read side of things... FWIW, I'm not sure that bio_vec > > makes a lot of sense here. > > bio_vec just is one of the many page+offset+len containers we have, I > guess Dave took it because loop uses it. We could either invent a new > one here or finally have a common one for the different uses all over > the kernel. *blink* Good luck unifying all such uses. That would have to include the things needed by pipe_buffer, DMA-related bits for scatterlist, etc. I don't believe that it's feasible, or would've been a good idea in the first place... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-03 15:12 ` Christoph Hellwig 2014-02-03 16:24 ` Al Viro @ 2014-02-03 16:50 ` Dave Kleikamp 1 sibling, 0 replies; 82+ messages in thread From: Dave Kleikamp @ 2014-02-03 16:50 UTC (permalink / raw) To: Christoph Hellwig, Al Viro Cc: Jens Axboe, Sage Weil, Mark Fasheh, xfs, Steve French, Kent Overstreet, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On 02/03/2014 09:12 AM, Christoph Hellwig wrote: > On Sat, Feb 01, 2014 at 10:43:01PM +0000, Al Viro wrote: >> * WTF bother passing 'pos' separately? It's the same mistake that was >> made with ->aio_read/->aio_write and just as with those, *all* callers >> provably have pos == iocb->ki_pos. > > I think this landed with the initial aio support, which planned for > allowing AIO retries for a workqueue with a partially incremented > pos. None of this ever got merged, probably because it was too ugly > to live. Yeah, when these patches were first written, AIO looked a lot different. >> * We *definitely* want a variant structure with tag - unsigned long thing >> was just plain insane. I see at least two variants - array of iovecs >> and array of (at least) triples <page, offset, length>. Quite possibly - >> quadruples, with "here's how to try to steal this page" thrown in, if >> we want that as replacement for ->splice_write() as well (it looks like >> the few instances that do steal on pipe-to-file splices could be dealt >> with the same way as the dumb ones, provided that ->write_iter or whatever >> we end up calling it is allowed to try and steal pages). Possibly more >> variants on the read side of things... FWIW, I'm not sure that bio_vec >> makes a lot of sense here. > > bio_vec just is one of the many page+offset+len containers we have, I > guess Dave took it because loop uses it. We could either invent a new > one here or finally have a common one for the different uses all over > the kernel. With Kent's immutable bio_vec changes, peeking inside the bio to get to the bio_vec is uglier than it was before, so there's no need to stick with that. Shaggy _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-01 22:43 ` Al Viro ` (4 preceding siblings ...) (?) @ 2014-02-03 16:23 ` Dave Kleikamp 2014-02-04 12:44 ` Al Viro -1 siblings, 1 reply; 82+ messages in thread From: Dave Kleikamp @ 2014-02-03 16:23 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Joel Becker, linux-fsdevel, Zach Brown, Anton Altaparmakov On 02/01/2014 04:43 PM, Al Viro wrote: > On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote: >> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote: >>> On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: >>>> Folks, what do you think about the following: >>> >>> That's very much what Shaggy did in the aio-direct tree, except that >>> it kept using a single set of methods. Linus really didn't like it >>> unfortunately. >> >> Umm. That wasn't what I objected to. >> >> I objected to the incredibly ugly implementation, the crazy new flags >> arguments for core helper functions, ugly naming etc etc. I even >> outlined what might fix it. >> >> In other words, I thought the code was shit and ugly. Not that >> iterators per se would be wrong. Just doing them badly is wrong. > > Gyahhh... OK, I should've known better than go looking into that thing > after such warning. Some relatively printable notes (i.e. about 10% > of the comments I really had about that) follow: Thanks for the feedback. I'd been asking for feedback on this patchset for some time now, and have not received very much. This is all based on some years-old work by Zach Brown that he probably wishes would have disappeared by now. I pretty much left what I could alone since 1) it was working, and 2) I didn't hear any objections (until now). It's clear now that the patchset isn't close to mergable, so treat it like a proof-of-concept and we can come up with a better container and read/write interface. I won't respond individually to your comments, but will take them all into consideration going forward. Thanks, Shaggy > * WTF bother passing 'pos' separately? It's the same mistake that was > made with ->aio_read/->aio_write and just as with those, *all* callers > provably have pos == iocb->ki_pos. > > * I'm not sure that iov_iter is a good fit here. OTOH, it probably could > be reused (it has damn few users right now and they are on the codepaths > involved into that thing). > > * We *definitely* want a variant structure with tag - unsigned long thing > was just plain insane. I see at least two variants - array of iovecs > and array of (at least) triples <page, offset, length>. Quite possibly - > quadruples, with "here's how to try to steal this page" thrown in, if > we want that as replacement for ->splice_write() as well (it looks like > the few instances that do steal on pipe-to-file splices could be dealt > with the same way as the dumb ones, provided that ->write_iter or whatever > we end up calling it is allowed to try and steal pages). Possibly more > variants on the read side of things... FWIW, I'm not sure that bio_vec > makes a lot of sense here. > > * direction (i.e. source vs. destination) also should be a part of that > tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int; > the situation with pos is identical to aio_read/aio_write. While we > are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks > very much like a special case of "array of <page,offset,size>" we want > for splice_write... > > * having looked through the ->read and ->write instances in drivers, > I'd say that surprisingly few helpers would go a _long_ way towards > converting those guys to the same methods. And we need such helpers > anyway - there's a whole lot of (badly) open-coded "copy the whole > user buffer into kmalloc'ed array and NUL-terminate the sucker" logics > in ->write() instances, for example. Even more "copy up to that much > into given array and NUL-terminate it", with rather amusing bugs in > there - e.g. NUL going into the end of array, regardless of the actual > amount of data to copy; junk is left in the middle, complete with > printk of the entire thing if it doesn't make sense. IOW, spewing > random piece of kernel stack into dmesg. Off-by-ones galore, too... > > BTW, speaking of bogosities - grep for generic_write_checks(). Exactly > one caller (in __generic_file_aio_write()) has any business looking > at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write(). > All other callers have copied that, even though it makes absolutely > no sense for them... > > * file_readable/file_writable looks really wrong; if nothing else, I would > rather check that after open() and set a couple of FMODE bits, then check > those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no > method"... > > * pipe_buffer_operations ->map()/->unmap() should die; let the caller do > k{,un}map{,_atomic}(). All instances have the same method there and > there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also > go. > > * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable > and pagefault_enable around it? The sucker starts with kmap_atomic() and > ends with kunmap_atomic(); all instances of those guys have pagefaul > disabling/enabling (and I suspect that it might make sense to lift it > out of arch-specific variants - rename them to e.g. __kmap_atomic(), > rip pagefault_disable() out of those and put kmap_atomic() into highmem.h > outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic > page_address; move pagefault_enable() from __kunmap_atomic() to > kunmap_atomic() while we are at it). > > Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we > have __copy_from_user_inatomic() done there under kmap_atomic(), and we > really don't want to block in such conditions. > > * pipe_iov_copy_from_user() ought to return how much has it managed to > bring in, not 0 or -EFAULT as it does now. Then it won't need the > non-atomic side, AFAICS. Moreover, we'll just be able to use > iov_iter_copy_from_user_atomic() (which badly needs a more palatable > name, BTW). > > * sure, we want to call do_generic_file_read() once, passing the entire > iovec to file_read_actor(). But what the hell does that have to do with > introduction of new methods? It's a change that makes sense on its > own. Moreover, it's a damn good preparation to adding those - we > get generic_file_aio_read() into "set iov_iter up, then do <this>", > with <this> using iov_iter instead of iovec. Once we get to introducing > those methods, it's just a matter of taking the rest of generic_file_aio_read() > into a separate function and making that function an instance of new > method... > > * Unrelated to this patchset, but... may I politely inquire about the reasons > why ->is_partially_uptodate() gets read_descriptor_t? The whole point > of read_descriptor_t (if it has any) is that its interpretation is up to > whoever's doing the reading, _not_ what they are reading from. So desc->arg > is off-limits for any instance of ->is_partially_uptodate(). desc->written is > obviously pointless for them; the need (or lack thereof) to do something > to the page doesn't depend on how much have we already read from the > file. Moreover, reporting an error is rather dubious in such method; > if there's something fishy with the page, we'll just return "no" and let > ->readpage() complain. Which leaves only desc->count, which, unsurprisingly, > is the only thing looked at by (the only) instance of that method. And > "tell me if the part of this page that long starting from that offset is > up to date" is much more natural that "is what this read_descriptor_t would > have had us read from this offset in this page up to date?"... > > * do we really need separate non-atomic variant of iov_iter_copy_from_user()? > We have only two users right now (cifs and ceph) and both can use the > fault-in / atomic copy loop without much pain... > > * in addition to having too many methods, I'm not convinced that we want > them to be methods. Let's start with explicit checks in the primitives and > see where it goes from there; if we start to grow too many variants, > we can always introduce some methods, but then we'll be in better position > to decide what is and what is not a good method... > > * on the read side, I really don't believe that exposing atomic and > non-atomic variants is a good idea. Look at the existing users of > __copy_to_user_inatomic(); leaving aside the i915_gem weirdness, > all of them are used to implement the exact same thing: given a page, > offset and length, feed its contents to iovec/buffer/whatnot. Unlike > the write side of things, there's nothing between prefaulting pages > and actual attempts to copy. So let's make _that_ an exposed primitive > and let it deal with kmap/kmap_atomic/etc. Variants that don't have > to worry about blocking (vector of <page,offset,length>, etc.) would > simply not bother with non-atomic kmap, etc. Sure, it should take > iov_iter as destination. And deal with mapping the damn thing internally... > > * ntfs_file_buffered_write() should switch to iov_iter as well. It's > open-coding a lot of iov_iter stuff. It's not entirely trivial and > I'd really like to hear from ntfs folks on that, though, and the current > code looks deadlock-prone. We prefault everything, then lock the pages > to which we'll be copying, then attempt to do __copy_from_user_inatomic(), > falling back to __copy_from_user() if that fails. Fine, but what happens > if the source of write() is mmaped from the same file and we lose CPU while > prefaulting the last page, have memory pressure evict the first one, then > have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages > we'll be copying to and have __copy_from_user() try to copy *from* those > same pages? We are doing it while holding these pages locked, so pagefault > will have really fun time with them... Anton? > > BTW, Linus, when did you comment on that patchset? I've found an iteration > of that patchset circa last October (v9, apparently the latest posted), > but it looks like your comments had either got lost or had been on the > earlier iteration of that thing... > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-03 16:23 ` Dave Kleikamp @ 2014-02-04 12:44 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 12:44 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Zach Brown, Anton Altaparmakov On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote: > Thanks for the feedback. I'd been asking for feedback on this patchset > for some time now, and have not received very much. > > This is all based on some years-old work by Zach Brown that he probably > wishes would have disappeared by now. I pretty much left what I could > alone since 1) it was working, and 2) I didn't hear any objections > (until now). > > It's clear now that the patchset isn't close to mergable, so treat it > like a proof-of-concept and we can come up with a better container and > read/write interface. I won't respond individually to your comments, but > will take them all into consideration going forward. FWIW, I suspect that the right way to deal with dio side of things would be a primitive along the lines of "get first N <page,start,len> for the iov_iter". With get_user_pages_fast() for iovec-backed ones and "just grab references" for array-of-page-subranges ones. _IF_ direct-io.c can be massaged to use that (and it looks like it should be able to - AFAICS, we don't really care if pages are mapped in userland or kernel space there), we get something really neat out of that: not only can we get rid of generic_file_splice_write(), but we get full zero-copy sendfile() - just have the target opened with O_DIRECT and everything will work; ->splice_read() will trigger reads to source pagecache and with that massage done, ->splice_write() will issue writes directly from those pages, with no memory-to-memory copying in sight... We can also get rid of that kmap() in __swap_writepage(), while we are at it. I'm going through direct-io.c guts right now and so far that looks feasible, but I'd really appreciate comments from the folks more familiar with the damn thing. The queue so far is in vfs.git#iov_iter; I've gone after the low-hanging fruits in the review I've posted upthread and I more or less like the results so far... Comments? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 12:44 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 12:44 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov, Zach Brown, Kent Overstreet, Dave Kleikamp On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote: > Thanks for the feedback. I'd been asking for feedback on this patchset > for some time now, and have not received very much. > > This is all based on some years-old work by Zach Brown that he probably > wishes would have disappeared by now. I pretty much left what I could > alone since 1) it was working, and 2) I didn't hear any objections > (until now). > > It's clear now that the patchset isn't close to mergable, so treat it > like a proof-of-concept and we can come up with a better container and > read/write interface. I won't respond individually to your comments, but > will take them all into consideration going forward. FWIW, I suspect that the right way to deal with dio side of things would be a primitive along the lines of "get first N <page,start,len> for the iov_iter". With get_user_pages_fast() for iovec-backed ones and "just grab references" for array-of-page-subranges ones. _IF_ direct-io.c can be massaged to use that (and it looks like it should be able to - AFAICS, we don't really care if pages are mapped in userland or kernel space there), we get something really neat out of that: not only can we get rid of generic_file_splice_write(), but we get full zero-copy sendfile() - just have the target opened with O_DIRECT and everything will work; ->splice_read() will trigger reads to source pagecache and with that massage done, ->splice_write() will issue writes directly from those pages, with no memory-to-memory copying in sight... We can also get rid of that kmap() in __swap_writepage(), while we are at it. I'm going through direct-io.c guts right now and so far that looks feasible, but I'd really appreciate comments from the folks more familiar with the damn thing. The queue so far is in vfs.git#iov_iter; I've gone after the low-hanging fruits in the review I've posted upthread and I more or less like the results so far... Comments? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 12:44 ` Al Viro @ 2014-02-04 12:52 ` Kent Overstreet -1 siblings, 0 replies; 82+ messages in thread From: Kent Overstreet @ 2014-02-04 12:52 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Dave Kleikamp, Joel Becker, linux-fsdevel, Zach Brown, Linus Torvalds, Anton Altaparmakov On Tue, Feb 04, 2014 at 12:44:09PM +0000, Al Viro wrote: > On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote: > > > Thanks for the feedback. I'd been asking for feedback on this patchset > > for some time now, and have not received very much. > > > > This is all based on some years-old work by Zach Brown that he probably > > wishes would have disappeared by now. I pretty much left what I could > > alone since 1) it was working, and 2) I didn't hear any objections > > (until now). > > > > It's clear now that the patchset isn't close to mergable, so treat it > > like a proof-of-concept and we can come up with a better container and > > read/write interface. I won't respond individually to your comments, but > > will take them all into consideration going forward. > > FWIW, I suspect that the right way to deal with dio side of things would > be a primitive along the lines of "get first N <page,start,len> for the > iov_iter". With get_user_pages_fast() for iovec-backed ones and "just > grab references" for array-of-page-subranges ones. I'm on vacation in Switzerland, didn't bring my adderall, and direct-io.c makes my head hurt at the best of times, but - have a look at my in-progress dio rewrite: http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da Where I'm headed with things is to just start out by allocating bios and pinning pages into them, and _then_ doing all the fun "ask the filesystem where it goes and what to do with it" dance. The goal is to push the bios as far up the stack as possible. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 12:52 ` Kent Overstreet 0 siblings, 0 replies; 82+ messages in thread From: Kent Overstreet @ 2014-02-04 12:52 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov, Zach Brown, Dave Kleikamp On Tue, Feb 04, 2014 at 12:44:09PM +0000, Al Viro wrote: > On Mon, Feb 03, 2014 at 10:23:13AM -0600, Dave Kleikamp wrote: > > > Thanks for the feedback. I'd been asking for feedback on this patchset > > for some time now, and have not received very much. > > > > This is all based on some years-old work by Zach Brown that he probably > > wishes would have disappeared by now. I pretty much left what I could > > alone since 1) it was working, and 2) I didn't hear any objections > > (until now). > > > > It's clear now that the patchset isn't close to mergable, so treat it > > like a proof-of-concept and we can come up with a better container and > > read/write interface. I won't respond individually to your comments, but > > will take them all into consideration going forward. > > FWIW, I suspect that the right way to deal with dio side of things would > be a primitive along the lines of "get first N <page,start,len> for the > iov_iter". With get_user_pages_fast() for iovec-backed ones and "just > grab references" for array-of-page-subranges ones. I'm on vacation in Switzerland, didn't bring my adderall, and direct-io.c makes my head hurt at the best of times, but - have a look at my in-progress dio rewrite: http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da Where I'm headed with things is to just start out by allocating bios and pinning pages into them, and _then_ doing all the fun "ask the filesystem where it goes and what to do with it" dance. The goal is to push the bios as far up the stack as possible. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 12:52 ` Kent Overstreet @ 2014-02-04 15:17 ` Al Viro -1 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 15:17 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Dave Kleikamp, Joel Becker, linux-fsdevel, Zach Brown, Linus Torvalds, Anton Altaparmakov On Tue, Feb 04, 2014 at 04:52:20AM -0800, Kent Overstreet wrote: > I'm on vacation in Switzerland, didn't bring my adderall, and > direct-io.c makes my head hurt at the best of times, but - have a look > at my in-progress dio rewrite: > > http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da > > Where I'm headed with things is to just start out by allocating bios and > pinning pages into them, and _then_ doing all the fun "ask the > filesystem where it goes and what to do with it" dance. The goal is to > push the bios as far up the stack as possible. How far would that be? E.g. for something like NFS it would be completely wrong. AFAICS, what you are doing there isn't incompatible with what I described; bio_get_user_pages() would just use that primitive (in fact, the loop in it is a damn good starting point for implementation of that primitive for iovec-based instances of iov_iter). I'm not too fond of the names, TBH - it might make sense to rename iov_iter to something like mem_stream; maybe even leave iov_iter as-is for whatever users might remain, but for now I don't see any that would fundamentally depend on the thing being iovev-backed... And bio_vec is a bad misnomer - it's not related to block subsystem at all. Sure, it had originated there, but... Hell knows; by now it's probably too much PITA to rename (we have about half a thousand instances in the tree). Pity, that... I definitely don't buy "bio is a natural object for carrying an array of pieces of pages"; not sure if that's what you implied in earlier thread, but it has too much baggage from block subsystem *and* it lacks the things we may want to associate with individual elements of such array (starting with "how can I steal that page?" method). I'm not sure if you'd been reading that thread back when it started; my interest in that thing is mostly because I want to get rid of duplication (and inconsistencies) between ->aio_write() and ->splice_write(). I hadn't been watching the threads around iov_iter last year; hch has pointed to those when I proposed to use an object that could carry both iovec and (possibly extended) analog of bio_vec and make generic_file_aio_write() et.al. agnostic wrt what's behind that object. Then we could use the same method to implement both ->aio_write() and ->splice_write() in a lot of cases. iov_iter is a good starting point for such object, and for now I'm mostly doing stuff that encapsulates the knowledge of its guts (including "there's an iovec behind it"). Those cleanups aside (and they make sense on their own, regardless of where the rest goes), it might make sense to add a copy of struct iov_iter that would have a tagged union in it (originally just for iovec, with IOVEC_READ/IOVEC_WRITE as possible tags) and switch a bunch of places that do not look into the guts of iov_iter to that thing. I'm not sure if there will be any other places left (so far it looks like we'll be able to get away with a reasonable set of primitives), but... we'll see. For now the whole thing is fairly experimental and it will almost certainly be reordered, etc. quite a few times. I'm trying to keep the part of queue in vfs.git#iov_iter more or less stable (with a lot of stuff in flux sitting in the local one), but it's not at the state where I'd recommend merges from it; there will be rebases, etc. BTW, folks, any suggestions about the name of that "memory stream" thing? struct iov_iter really implies iterator for iovec and more generic name would probably be better... struct mem_stream would probably do if nobody comes up with better variant, but it's long and somewhat clumsy... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 15:17 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 15:17 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov, Zach Brown, Dave Kleikamp On Tue, Feb 04, 2014 at 04:52:20AM -0800, Kent Overstreet wrote: > I'm on vacation in Switzerland, didn't bring my adderall, and > direct-io.c makes my head hurt at the best of times, but - have a look > at my in-progress dio rewrite: > > http://evilpiepirate.org/git/linux-bcache.git/commit/?h=block_stuff&id=ca09c20f08efd640f255fabd778de0dbf43ed1da > > Where I'm headed with things is to just start out by allocating bios and > pinning pages into them, and _then_ doing all the fun "ask the > filesystem where it goes and what to do with it" dance. The goal is to > push the bios as far up the stack as possible. How far would that be? E.g. for something like NFS it would be completely wrong. AFAICS, what you are doing there isn't incompatible with what I described; bio_get_user_pages() would just use that primitive (in fact, the loop in it is a damn good starting point for implementation of that primitive for iovec-based instances of iov_iter). I'm not too fond of the names, TBH - it might make sense to rename iov_iter to something like mem_stream; maybe even leave iov_iter as-is for whatever users might remain, but for now I don't see any that would fundamentally depend on the thing being iovev-backed... And bio_vec is a bad misnomer - it's not related to block subsystem at all. Sure, it had originated there, but... Hell knows; by now it's probably too much PITA to rename (we have about half a thousand instances in the tree). Pity, that... I definitely don't buy "bio is a natural object for carrying an array of pieces of pages"; not sure if that's what you implied in earlier thread, but it has too much baggage from block subsystem *and* it lacks the things we may want to associate with individual elements of such array (starting with "how can I steal that page?" method). I'm not sure if you'd been reading that thread back when it started; my interest in that thing is mostly because I want to get rid of duplication (and inconsistencies) between ->aio_write() and ->splice_write(). I hadn't been watching the threads around iov_iter last year; hch has pointed to those when I proposed to use an object that could carry both iovec and (possibly extended) analog of bio_vec and make generic_file_aio_write() et.al. agnostic wrt what's behind that object. Then we could use the same method to implement both ->aio_write() and ->splice_write() in a lot of cases. iov_iter is a good starting point for such object, and for now I'm mostly doing stuff that encapsulates the knowledge of its guts (including "there's an iovec behind it"). Those cleanups aside (and they make sense on their own, regardless of where the rest goes), it might make sense to add a copy of struct iov_iter that would have a tagged union in it (originally just for iovec, with IOVEC_READ/IOVEC_WRITE as possible tags) and switch a bunch of places that do not look into the guts of iov_iter to that thing. I'm not sure if there will be any other places left (so far it looks like we'll be able to get away with a reasonable set of primitives), but... we'll see. For now the whole thing is fairly experimental and it will almost certainly be reordered, etc. quite a few times. I'm trying to keep the part of queue in vfs.git#iov_iter more or less stable (with a lot of stuff in flux sitting in the local one), but it's not at the state where I'd recommend merges from it; there will be rebases, etc. BTW, folks, any suggestions about the name of that "memory stream" thing? struct iov_iter really implies iterator for iovec and more generic name would probably be better... struct mem_stream would probably do if nobody comes up with better variant, but it's long and somewhat clumsy... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 15:17 ` Al Viro (?) @ 2014-02-04 17:27 ` Zach Brown 2014-02-04 17:35 ` Kent Overstreet 2014-02-04 18:00 ` Al Viro -1 siblings, 2 replies; 82+ messages in thread From: Zach Brown @ 2014-02-04 17:27 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov > I definitely don't buy "bio is a natural object for carrying an array > of pieces of pages"; not sure if that's what you implied in earlier > thread, but it has too much baggage from block subsystem *and* it lacks > the things we may want to associate with individual elements of such > array (starting with "how can I steal that page?" method). I think Kent is talking about what happens after the user addresses are consumed. Turning dio into more of a bio mapping and redirection engine would use more of the bio machinery instead of the bits that dio has implemented itself with state in struct dio that hangs off the bios. I imagine it'd still make sense to clean up the addresses/pages arguments that feed that engine. (And give another entry point that already has bios for callers like loop, etc.) > BTW, folks, any suggestions about the name of that "memory stream" thing? > struct iov_iter really implies iterator for iovec and more generic name > would probably be better... struct mem_stream would probably do if nobody > comes up with better variant, but it's long and somewhat clumsy... I don't like 'stream'. To me that sounds more strictly advancing than I think this'd be capable of. Maybe something dirt simple like 'mem_vec'? With 'mvec_' call prefixes? Or kiobuf! *runs* - z _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 17:27 ` Zach Brown @ 2014-02-04 17:35 ` Kent Overstreet 2014-02-04 18:08 ` Al Viro 2014-02-04 18:00 ` Al Viro 1 sibling, 1 reply; 82+ messages in thread From: Kent Overstreet @ 2014-02-04 17:35 UTC (permalink / raw) To: Zach Brown Cc: Jens Axboe, Christoph Hellwig, Sage Weil, Mark Fasheh, xfs, Steve French, Dave Kleikamp, Al Viro, linux-fsdevel, Linus Torvalds, Anton Altaparmakov, Joel Becker [-- Attachment #1.1: Type: text/plain, Size: 1819 bytes --] On Feb 4, 2014 6:27 PM, "Zach Brown" <zab@redhat.com> wrote: > > > I definitely don't buy "bio is a natural object for carrying an array > > of pieces of pages"; not sure if that's what you implied in earlier > > thread, but it has too much baggage from block subsystem *and* it lacks > > the things we may want to associate with individual elements of such > > array (starting with "how can I steal that page?" method). > > I think Kent is talking about what happens after the user addresses are > consumed. Turning dio into more of a bio mapping and redirection engine > would use more of the bio machinery instead of the bits that dio has > implemented itself with state in struct dio that hangs off the bios. I > imagine it'd still make sense to clean up the addresses/pages arguments > that feed that engine. (And give another entry point that already has > bios for callers like loop, etc.) Yeah, precisely. I think we can push the point at which pages are pinned at least a fair but higher than it is now, and if we can do that we definitely should be working with a generic "bag of pinned pages" struct - and much better to use struct bio than add another one. Bios may not be perfect but at least some of the block layer specific cruft can be gotten rid of (and is on my todo list) > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > struct iov_iter really implies iterator for iovec and more generic name > > would probably be better... struct mem_stream would probably do if nobody > > comes up with better variant, but it's long and somewhat clumsy... > > I don't like 'stream'. To me that sounds more strictly advancing than I > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > With 'mvec_' call prefixes? > > Or kiobuf! *runs* *cuts you* [-- Attachment #1.2: Type: text/html, Size: 2258 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 17:35 ` Kent Overstreet @ 2014-02-04 18:08 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:08 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Christoph Hellwig, Sage Weil, Zach Brown, Mark Fasheh, xfs, Steve French, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Tue, Feb 04, 2014 at 09:35:06AM -0800, Kent Overstreet wrote: > > I think Kent is talking about what happens after the user addresses are > > consumed. Turning dio into more of a bio mapping and redirection engine > > would use more of the bio machinery instead of the bits that dio has > > implemented itself with state in struct dio that hangs off the bios. I > > imagine it'd still make sense to clean up the addresses/pages arguments > > that feed that engine. (And give another entry point that already has > > bios for callers like loop, etc.) > > Yeah, precisely. I think we can push the point at which pages are pinned at > least a fair but higher than it is now, and if we can do that we definitely > should be working with a generic "bag of pinned pages" struct - and much > better to use struct bio than add another one. Bios may not be perfect but > at least some of the block layer specific cruft can be gotten rid of (and > is on my todo list) How far up? I've no problem with having that done in ->direct_IO() (especially if it would take this mem_vec/mem_stream/whatever and keep the code doing actual pinning and building an array of pages outside of direct-io.c, allowing it to do different things for iovec-backed and page-array-backed variants), but I really don't like the idea of lifting that to callers of ->direct_IO(), let alone past them. If nothing else, we do *not* want to pin the entire write request, so lifting that to e.g. generic_file_aio_write() (or, worse, its callers) wouldn't make sense. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 18:08 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:08 UTC (permalink / raw) To: Kent Overstreet Cc: Zach Brown, Dave Kleikamp, Steve French, Sage Weil, Linus Torvalds, Jens Axboe, Anton Altaparmakov, Mark Fasheh, linux-fsdevel, xfs, Joel Becker, Christoph Hellwig On Tue, Feb 04, 2014 at 09:35:06AM -0800, Kent Overstreet wrote: > > I think Kent is talking about what happens after the user addresses are > > consumed. Turning dio into more of a bio mapping and redirection engine > > would use more of the bio machinery instead of the bits that dio has > > implemented itself with state in struct dio that hangs off the bios. I > > imagine it'd still make sense to clean up the addresses/pages arguments > > that feed that engine. (And give another entry point that already has > > bios for callers like loop, etc.) > > Yeah, precisely. I think we can push the point at which pages are pinned at > least a fair but higher than it is now, and if we can do that we definitely > should be working with a generic "bag of pinned pages" struct - and much > better to use struct bio than add another one. Bios may not be perfect but > at least some of the block layer specific cruft can be gotten rid of (and > is on my todo list) How far up? I've no problem with having that done in ->direct_IO() (especially if it would take this mem_vec/mem_stream/whatever and keep the code doing actual pinning and building an array of pages outside of direct-io.c, allowing it to do different things for iovec-backed and page-array-backed variants), but I really don't like the idea of lifting that to callers of ->direct_IO(), let alone past them. If nothing else, we do *not* want to pin the entire write request, so lifting that to e.g. generic_file_aio_write() (or, worse, its callers) wouldn't make sense. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 17:27 ` Zach Brown @ 2014-02-04 18:00 ` Al Viro 2014-02-04 18:00 ` Al Viro 1 sibling, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:00 UTC (permalink / raw) To: Zach Brown Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Tue, Feb 04, 2014 at 09:27:23AM -0800, Zach Brown wrote: > I think Kent is talking about what happens after the user addresses are > consumed. Turning dio into more of a bio mapping and redirection engine > would use more of the bio machinery instead of the bits that dio has > implemented itself with state in struct dio that hangs off the bios. I > imagine it'd still make sense to clean up the addresses/pages arguments > that feed that engine. (And give another entry point that already has > bios for callers like loop, etc.) > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > struct iov_iter really implies iterator for iovec and more generic name > > would probably be better... struct mem_stream would probably do if nobody > > comes up with better variant, but it's long and somewhat clumsy... > > I don't like 'stream'. To me that sounds more strictly advancing than I > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > With 'mvec_' call prefixes? Umm... Frankly, I would rather discourage attempts to read the same data twice, if only on the naming level... Case in point: commit 1c1c87 (btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME). I really wonder how many places have similar holes. What used to happen was this: we have a userland structure, with a variable-sized array hanging off its arse. The size of array is determined by the field in fixed-sized header. We copy the header in, decide what size the whole thing should have, and do memdup_user() to bring everything in. Very convenient, since at that point we have a pointer to that struct-with-array in the kernel space. Attacker manages to increase the 'desc_count' field between two copy_from_user()... and the sucker proceeds to loop over the array in kernel-side copy, using the ->desc_count of that copy as the upper limit of the loop. Oops - in the best case, that is. Double reads really ought to raise red flags on review. I'm not saying that they should be hard to do (after all, the fix in that commit *does* read the same thing twice), but it's better if they are not used without thinking. And no, I'm not suggesting to make ioctls use iov_iter/whatnot - it's just an example of the class of bugs. I wouldn't be surprised to find ->write() instances in drivers suffering the same problem... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 18:00 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:00 UTC (permalink / raw) To: Zach Brown Cc: Kent Overstreet, Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov, Dave Kleikamp On Tue, Feb 04, 2014 at 09:27:23AM -0800, Zach Brown wrote: > I think Kent is talking about what happens after the user addresses are > consumed. Turning dio into more of a bio mapping and redirection engine > would use more of the bio machinery instead of the bits that dio has > implemented itself with state in struct dio that hangs off the bios. I > imagine it'd still make sense to clean up the addresses/pages arguments > that feed that engine. (And give another entry point that already has > bios for callers like loop, etc.) > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > struct iov_iter really implies iterator for iovec and more generic name > > would probably be better... struct mem_stream would probably do if nobody > > comes up with better variant, but it's long and somewhat clumsy... > > I don't like 'stream'. To me that sounds more strictly advancing than I > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > With 'mvec_' call prefixes? Umm... Frankly, I would rather discourage attempts to read the same data twice, if only on the naming level... Case in point: commit 1c1c87 (btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME). I really wonder how many places have similar holes. What used to happen was this: we have a userland structure, with a variable-sized array hanging off its arse. The size of array is determined by the field in fixed-sized header. We copy the header in, decide what size the whole thing should have, and do memdup_user() to bring everything in. Very convenient, since at that point we have a pointer to that struct-with-array in the kernel space. Attacker manages to increase the 'desc_count' field between two copy_from_user()... and the sucker proceeds to loop over the array in kernel-side copy, using the ->desc_count of that copy as the upper limit of the loop. Oops - in the best case, that is. Double reads really ought to raise red flags on review. I'm not saying that they should be hard to do (after all, the fix in that commit *does* read the same thing twice), but it's better if they are not used without thinking. And no, I'm not suggesting to make ioctls use iov_iter/whatnot - it's just an example of the class of bugs. I wouldn't be surprised to find ->write() instances in drivers suffering the same problem... ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 18:00 ` Al Viro (?) @ 2014-02-04 18:33 ` Zach Brown 2014-02-04 18:36 ` Al Viro -1 siblings, 1 reply; 82+ messages in thread From: Zach Brown @ 2014-02-04 18:33 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > > struct iov_iter really implies iterator for iovec and more generic name > > > would probably be better... struct mem_stream would probably do if nobody > > > comes up with better variant, but it's long and somewhat clumsy... > > > > I don't like 'stream'. To me that sounds more strictly advancing than I > > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > > With 'mvec_' call prefixes? > > Umm... Frankly, I would rather discourage attempts to read the same data > twice, if only on the naming level... Ahh, OK, sure. mem_iter? - z _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 18:33 ` Zach Brown @ 2014-02-04 18:36 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:36 UTC (permalink / raw) To: Zach Brown Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Tue, Feb 04, 2014 at 10:33:56AM -0800, Zach Brown wrote: > > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > > > struct iov_iter really implies iterator for iovec and more generic name > > > > would probably be better... struct mem_stream would probably do if nobody > > > > comes up with better variant, but it's long and somewhat clumsy... > > > > > > I don't like 'stream'. To me that sounds more strictly advancing than I > > > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > > > With 'mvec_' call prefixes? > > > > Umm... Frankly, I would rather discourage attempts to read the same data > > twice, if only on the naming level... > > Ahh, OK, sure. mem_iter? Works for me... Any other suggestions/objections/etc.? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems @ 2014-02-04 18:36 ` Al Viro 0 siblings, 0 replies; 82+ messages in thread From: Al Viro @ 2014-02-04 18:36 UTC (permalink / raw) To: Zach Brown Cc: Kent Overstreet, Linus Torvalds, Christoph Hellwig, Jens Axboe, Mark Fasheh, Joel Becker, linux-fsdevel, xfs, Sage Weil, Steve French, Anton Altaparmakov, Dave Kleikamp On Tue, Feb 04, 2014 at 10:33:56AM -0800, Zach Brown wrote: > > > > BTW, folks, any suggestions about the name of that "memory stream" thing? > > > > struct iov_iter really implies iterator for iovec and more generic name > > > > would probably be better... struct mem_stream would probably do if nobody > > > > comes up with better variant, but it's long and somewhat clumsy... > > > > > > I don't like 'stream'. To me that sounds more strictly advancing than I > > > think this'd be capable of. Maybe something dirt simple like 'mem_vec'? > > > With 'mvec_' call prefixes? > > > > Umm... Frankly, I would rather discourage attempts to read the same data > > twice, if only on the naming level... > > Ahh, OK, sure. mem_iter? Works for me... Any other suggestions/objections/etc.? ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-04 18:36 ` Al Viro (?) @ 2014-02-05 19:58 ` Al Viro 2014-02-05 20:42 ` Zach Brown 2014-02-06 9:08 ` Kent Overstreet -1 siblings, 2 replies; 82+ messages in thread From: Al Viro @ 2014-02-05 19:58 UTC (permalink / raw) To: Zach Brown Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov BTW, why do we still have generic_segment_checks()? AFAICS, *all* paths leading to any ->aio_read/->aio_write instances are either 1) with KERNEL_DS (and base/len are verifiably sane in those cases), or 2) have iovec come from successful {compat,}rw_copy_check_uvector() and through rw_verify_area(), or 3) have single-element iovec with access_ok()/rw_verify_area() checked directly, or 4) have single-element iovec with base/len unchanged from what had been passed to some ->read() or ->write() instance, in which case the caller of that ->read() or ->write() has done access_ok/rw_verify_area And yes, I can prove that for the current tree, modulo a couple of dumb bugs with unchecked values coming via read_code(). Which is called a couple of times per a.out execve() and should be using vfs_read() instead of blindly calling ->read() - it's *not* a hot path and never had been one. With that fixed, we have the following: and call of any instance of ->read()/->write()/->aio_read()/->aio_write() (be it direct or via method) is guaranteed that * all segments it's asked to read/write will satisfy access_ok(). * all segments it's asked to read/write will have non-negative lengths. * total size of all segments will be at most MAX_RW_COUNT. * file offset won't go from negative to zero in the combined area; unless the file has FMODE_UNSIGNED_OFFSET in ->f_mode, it won't go from positive to negative either. So what exactly does generic_segments_check() give us? Is it just that everybody went "well, maybe there's some weird path where we don't do validation; let's leave it there"? Linus? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-05 19:58 ` Al Viro @ 2014-02-05 20:42 ` Zach Brown 2014-02-06 9:08 ` Kent Overstreet 1 sibling, 0 replies; 82+ messages in thread From: Zach Brown @ 2014-02-05 20:42 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Mark Fasheh, xfs, Christoph Hellwig, Kent Overstreet, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov > So what exactly does generic_segments_check() give us? Is it just that > everybody went "well, maybe there's some weird path where we don't do > validation; let's leave it there"? Linus? Sounds likely. I'd bet that the btrfs call was just copied from __generic_file_aio_write() when btrfs grew its own .aio_write method: 11c65dccf70be9ace5dbd3906778e1a099b1fee1 Author: Josef Bacik <josef@redhat.com> Date: Sun May 23 11:07:21 2010 -0400 Btrfs: do aio_write instead of write It probably amounts to no more than ocount = iov_length(). - z _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: [RFC] unifying write variants for filesystems 2014-02-05 19:58 ` Al Viro 2014-02-05 20:42 ` Zach Brown @ 2014-02-06 9:08 ` Kent Overstreet 1 sibling, 0 replies; 82+ messages in thread From: Kent Overstreet @ 2014-02-06 9:08 UTC (permalink / raw) To: Al Viro Cc: Jens Axboe, Steve French, Sage Weil, Zach Brown, Mark Fasheh, xfs, Christoph Hellwig, Dave Kleikamp, Joel Becker, linux-fsdevel, Linus Torvalds, Anton Altaparmakov On Wed, Feb 05, 2014 at 07:58:38PM +0000, Al Viro wrote: > BTW, why do we still have generic_segment_checks()? > AFAICS, *all* paths leading to any ->aio_read/->aio_write > instances are either > 1) with KERNEL_DS (and base/len are verifiably sane in those > cases), or > 2) have iovec come from successful {compat,}rw_copy_check_uvector() > and through rw_verify_area(), or > 3) have single-element iovec with access_ok()/rw_verify_area() > checked directly, or > 4) have single-element iovec with base/len unchanged from > what had been passed to some ->read() or ->write() instance, in which > case the caller of that ->read() or ->write() has done access_ok/rw_verify_area > > And yes, I can prove that for the current tree, modulo a couple of dumb > bugs with unchecked values coming via read_code(). Which is called > a couple of times per a.out execve() and should be using vfs_read() instead > of blindly calling ->read() - it's *not* a hot path and never had been one. > With that fixed, we have the following: and call of any instance of > ->read()/->write()/->aio_read()/->aio_write() (be it direct or via method) > is guaranteed that > * all segments it's asked to read/write will satisfy access_ok(). > * all segments it's asked to read/write will have non-negative > lengths. > * total size of all segments will be at most MAX_RW_COUNT. > * file offset won't go from negative to zero in the combined area; > unless the file has FMODE_UNSIGNED_OFFSET in ->f_mode, it won't go from > positive to negative either. > > So what exactly does generic_segments_check() give us? Is it just that > everybody went "well, maybe there's some weird path where we don't do > validation; let's leave it there"? Linus? I came to the same conclusion awhile ago - I'm pretty sure it can be safely dropped (I think I even have such a patch in one of my branches...) Anyways, copy_check_uvector() is the correct place for all that stuff anyways - it's taking a __user type and producing a type without the __user attribute, so if there was any validation missing there that's where it should go. I vaguelly recall converting some SCSI related code to use copy_check_uvector() instead of its own (open coded?) thing, if that patch made it upstream that could've been a place that at one point in time did need the generic_segment_checks() call. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2014-02-07 17:10 UTC | newest] Thread overview: 82+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig 2013-12-12 18:14 ` Christoph Hellwig 2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig 2013-12-12 18:15 ` Christoph Hellwig 2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig 2013-12-12 18:15 ` Christoph Hellwig 2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig 2013-12-12 18:15 ` Christoph Hellwig 2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig 2013-12-12 18:15 ` Christoph Hellwig 2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig 2013-12-12 18:15 ` Christoph Hellwig 2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig 2014-01-13 14:14 ` Christoph Hellwig 2014-01-13 23:56 ` Al Viro 2014-01-13 23:56 ` Al Viro 2014-01-14 13:22 ` Christoph Hellwig 2014-01-14 13:22 ` Christoph Hellwig 2014-01-14 17:20 ` Al Viro 2014-01-14 17:20 ` Al Viro 2014-01-15 18:10 ` Al Viro 2014-01-15 18:10 ` Al Viro 2014-01-18 6:40 ` Al Viro 2014-01-18 7:22 ` Linus Torvalds 2014-01-18 7:22 ` Linus Torvalds 2014-01-18 7:46 ` Al Viro 2014-01-18 7:56 ` Al Viro 2014-01-18 7:56 ` Al Viro 2014-01-18 8:27 ` Al Viro 2014-01-18 8:44 ` David Miller 2014-01-18 8:44 ` David Miller 2014-02-07 17:10 ` Al Viro 2014-02-07 17:10 ` Al Viro 2014-01-18 19:59 ` Linus Torvalds 2014-01-18 20:10 ` Al Viro 2014-01-18 20:27 ` Al Viro 2014-01-18 20:27 ` Al Viro 2014-01-18 20:30 ` Al Viro 2014-01-18 20:30 ` Al Viro 2014-01-19 5:13 ` [RFC] unifying write variants for filesystems Al Viro 2014-01-19 5:13 ` Al Viro 2014-01-20 13:55 ` Christoph Hellwig 2014-01-20 13:55 ` Christoph Hellwig 2014-01-20 20:32 ` Linus Torvalds 2014-01-20 20:32 ` Linus Torvalds 2014-02-01 22:43 ` Al Viro 2014-02-01 22:43 ` Al Viro 2014-02-02 0:13 ` Linus Torvalds 2014-02-02 2:02 ` Al Viro 2014-02-02 2:02 ` Al Viro 2014-02-02 19:21 ` Al Viro 2014-02-02 19:21 ` Al Viro 2014-02-02 19:23 ` Al Viro 2014-02-02 19:23 ` Al Viro 2014-02-03 14:41 ` Miklos Szeredi 2014-02-03 14:41 ` Miklos Szeredi 2014-02-03 15:33 ` Al Viro 2014-02-03 15:33 ` Al Viro 2014-02-02 23:16 ` Anton Altaparmakov 2014-02-02 23:16 ` Anton Altaparmakov 2014-02-03 15:12 ` Christoph Hellwig 2014-02-03 16:24 ` Al Viro 2014-02-03 16:50 ` Dave Kleikamp 2014-02-03 16:23 ` Dave Kleikamp 2014-02-04 12:44 ` Al Viro 2014-02-04 12:44 ` Al Viro 2014-02-04 12:52 ` Kent Overstreet 2014-02-04 12:52 ` Kent Overstreet 2014-02-04 15:17 ` Al Viro 2014-02-04 15:17 ` Al Viro 2014-02-04 17:27 ` Zach Brown 2014-02-04 17:35 ` Kent Overstreet 2014-02-04 18:08 ` Al Viro 2014-02-04 18:08 ` Al Viro 2014-02-04 18:00 ` Al Viro 2014-02-04 18:00 ` Al Viro 2014-02-04 18:33 ` Zach Brown 2014-02-04 18:36 ` Al Viro 2014-02-04 18:36 ` Al Viro 2014-02-05 19:58 ` Al Viro 2014-02-05 20:42 ` Zach Brown 2014-02-06 9:08 ` Kent Overstreet
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.