linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jeremy Allison <jra@samba.org>
Cc: Steve French <smfrench@gmail.com>,
	Jeff Layton <jlayton@samba.org>,
	linux-cifs@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Recvfile patch used for Samba.
Date: Tue, 23 Jul 2013 17:10:27 +1000	[thread overview]
Message-ID: <20130723071027.GJ19986@dastard> (raw)
In-Reply-To: <20130722215738.GB20647@samba2>

On Mon, Jul 22, 2013 at 02:57:38PM -0700, Jeremy Allison wrote:
> Hi Steve and Jeff (and others).
> 
> Here is a patch that Samba vendors have been using
> to implement recvfile (copy directly from socket
> to file). It can improve write performance on boxes
> by a significant amount (10% or more).

Is that all? It's just moving the memcpy() into the kernel?

> I'm not qualified to evaluate this code, can someone
> who is (hi there Steve and Jeff :-) take a look at
> this and see if it's work shepherding into the kernel ?

It's pretty nasty.

> 
> Cheers,
> 
> 	Jeremy.

> diff -urp linux-2.6.37-rc5.orig/fs/splice.c linux-2.6.37-rc5/fs/splice.c
> --- linux-2.6.37-rc5.orig/fs/splice.c	2010-12-06 20:09:04.000000000 -0800
> +++ linux-2.6.37-rc5/fs/splice.c	2010-12-07 16:16:48.000000000 -0800
> @@ -31,6 +31,7 @@
>  #include <linux/uio.h>
>  #include <linux/security.h>
>  #include <linux/gfp.h>
> +#include <net/sock.h>
>  
>  /*
>   * Attempt to steal a page from a pipe buffer. This should perhaps go into
> @@ -1387,6 +1388,141 @@ static long do_splice(struct file *in, l
>  	return -EINVAL;
>  }
>  
> +static ssize_t do_splice_from_socket(struct file *file, struct socket *sock,
> +				     loff_t __user *ppos, size_t count)
> +{		
> +	struct address_space *mapping = file->f_mapping;
> +	struct inode	*inode = mapping->host;
> +	loff_t pos;
> +	int count_tmp;
> +	int err = 0;
> +	int cPagePtr = 0;		
> +	int cPagesAllocated = 0;

camel case.

> +	struct recvfile_ctl_blk rv_cb[MAX_PAGES_PER_RECVFILE];

MAX_PAGES_PER_RECVFILE = 32, and the structure is 32 bytes on 64
bit. So 1k of stack usage right there. Bad.

> +	struct kvec iov[MAX_PAGES_PER_RECVFILE];

And that's another 512 bytes of stack. Way too much.

> +	struct msghdr msg;

Another ~52 bytes of stack.

> +	long rcvtimeo;
> +	int ret;
> +
> +	if(copy_from_user(&pos, ppos, sizeof(loff_t)))
> +		return -EFAULT;
> +
> +	if(count > MAX_PAGES_PER_RECVFILE * PAGE_SIZE) {
> +		printk("%s: count(%u) exceeds maxinum\n", __func__, count);
> +		return -EINVAL;
> +	}    
> +	mutex_lock(&inode->i_mutex);

Locking on write needs to be passed to filesystems, not done by
generic code. i.e. this isn't sufficient for XFS, and may also be
problematic for btrfs.

> +
> +	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

That doesn't exist anymore - it was removed over a year ago....

> +
> +	/* We can write back this queue in page reclaim */
> +	current->backing_dev_info = mapping->backing_dev_info;
> +
> +	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
> +	if (err != 0 || count == 0)
> +		goto done;
> +
> +	file_remove_suid(file);
> +	file_update_time(file);	

error checking?

> +
> +	count_tmp = count;
> +	do {
> +		unsigned long bytes;	/* Bytes to write to page */
> +		unsigned long offset;	/* Offset into pagecache page */
> +		struct page *pageP;

camel case

> +		void *fsdata;
> +
> +		offset = (pos & (PAGE_CACHE_SIZE - 1));
> +		bytes = PAGE_CACHE_SIZE - offset;
> +		if (bytes > count_tmp)
> +			bytes = count_tmp;
> +		ret = mapping->a_ops->write_begin(file, mapping, pos, bytes,
> +						  AOP_FLAG_UNINTERRUPTIBLE,
> +						  &pageP, &fsdata);
> +
> +		if (unlikely(ret)) {
> +			err = ret;
> +			for(cPagePtr = 0; cPagePtr < cPagesAllocated; cPagePtr++) {
> +				kunmap(rv_cb[cPagePtr].rv_page);
> +				ret = mapping->a_ops->write_end(file, mapping,
> +								rv_cb[cPagePtr].rv_pos,
> +								rv_cb[cPagePtr].rv_count,
> +								rv_cb[cPagePtr].rv_count,
> +								rv_cb[cPagePtr].rv_page,
> +								rv_cb[cPagePtr].rv_fsdata);
> +			}
> +			goto done;
> +		}
> +		rv_cb[cPagesAllocated].rv_page = pageP;
> +		rv_cb[cPagesAllocated].rv_pos = pos;
> +		rv_cb[cPagesAllocated].rv_count = bytes;
> +		rv_cb[cPagesAllocated].rv_fsdata = fsdata;
> +		iov[cPagesAllocated].iov_base = kmap(pageP) + offset;
> +		iov[cPagesAllocated].iov_len = bytes;
> +		cPagesAllocated++;
> +		count_tmp -= bytes;
> +		pos += bytes;
> +	} while (count_tmp);

So, we are nesting up to 32 page locks here. That's bad. And we are
nesting kmap() calls for all the pages individually - is that even
safe to do?

So, what happens when we've got 16 pages in, and the filesystem has
allocated space for those 16 blocks, and we get ENOSPC on the 17th?
Sure, you undo the state here, but what about the 16 blocks that the
filesystem has allocated to this file? There's no notification to
the filesystem that they need to be truncated away because the write
failed....

> +
> +	/* IOV is ready, receive the date from socket now */
> +	msg.msg_name = NULL;
> +	msg.msg_namelen = 0;
> +	msg.msg_iov = (struct iovec *)&iov[0];
> +	msg.msg_iovlen = cPagesAllocated ;
> +	msg.msg_control = NULL;
> +	msg.msg_controllen = 0;
> +	msg.msg_flags = MSG_KERNSPACE;
> +	rcvtimeo = sock->sk->sk_rcvtimeo;    
> +	sock->sk->sk_rcvtimeo = 8 * HZ;

We can hold the inode and the pages locked for 8 seconds?

I'll stop there. This is fundamentally broken. It's an attempt to do
a multi-page write operation without any of the supporting
structures needed to handle the failure cases properly.  The nested
page locking has "deadlock" written all over it, and the lack of
partial failure handling shouts "data corruption" and "stale data
exposure" to me. The fact it can block for up to 8 seconds waiting
for network shenanigans to be completed while holding lots of locks
is going to cause all sorts of problems under memory pressure.

Not to mention it means that all memory allocations in the msgrcv
path need to be done with GFP_NOFS, because GFP_KERNEL allocations
are almost guaranteed to deadlock on the locked pages this path
already holds....

Need I say more?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2013-07-23  7:10 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-22 21:57 Recvfile patch used for Samba Jeremy Allison
2013-07-22 23:26 ` Joe Perches
2013-07-23  7:10 ` Dave Chinner [this message]
2013-07-23 13:31   ` Jeff Layton
2013-07-23 21:58   ` Jeremy Allison
2013-07-24  2:47     ` Dave Chinner
2013-07-25  8:17       ` Steven Whitehouse
2013-07-26  4:42         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130723071027.GJ19986@dastard \
    --to=david@fromorbit.com \
    --cc=jlayton@samba.org \
    --cc=jra@samba.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=smfrench@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).