From: Jeremy Allison <jra-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: Jeremy Allison <jra-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>,
Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Jeff Layton <jlayton-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>,
linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
linux-fsdevel
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Recvfile patch used for Samba.
Date: Tue, 23 Jul 2013 14:58:58 -0700 [thread overview]
Message-ID: <20130723215858.GB16356@samba2> (raw)
In-Reply-To: <20130723071027.GJ19986@dastard>
On Tue, Jul 23, 2013 at 05:10:27PM +1000, Dave Chinner wrote:
> So, we are nesting up to 32 page locks here. That's bad. And we are
> nesting kmap() calls for all the pages individually - is that even
> safe to do?
>
> So, what happens when we've got 16 pages in, and the filesystem has
> allocated space for those 16 blocks, and we get ENOSPC on the 17th?
> Sure, you undo the state here, but what about the 16 blocks that the
> filesystem has allocated to this file? There's no notification to
> the filesystem that they need to be truncated away because the write
> failed....
>
> > +
> > + /* IOV is ready, receive the date from socket now */
> > + msg.msg_name = NULL;
> > + msg.msg_namelen = 0;
> > + msg.msg_iov = (struct iovec *)&iov[0];
> > + msg.msg_iovlen = cPagesAllocated ;
> > + msg.msg_control = NULL;
> > + msg.msg_controllen = 0;
> > + msg.msg_flags = MSG_KERNSPACE;
> > + rcvtimeo = sock->sk->sk_rcvtimeo;
> > + sock->sk->sk_rcvtimeo = 8 * HZ;
>
> We can hold the inode and the pages locked for 8 seconds?
>
> I'll stop there. This is fundamentally broken. It's an attempt to do
> a multi-page write operation without any of the supporting
> structures needed to handle the failure cases properly. The nested
> page locking has "deadlock" written all over it, and the lack of
> partial failure handling shouts "data corruption" and "stale data
> exposure" to me. The fact it can block for up to 8 seconds waiting
> for network shenanigans to be completed while holding lots of locks
> is going to cause all sorts of problems under memory pressure.
>
> Not to mention it means that all memory allocations in the msgrcv
> path need to be done with GFP_NOFS, because GFP_KERNEL allocations
> are almost guaranteed to deadlock on the locked pages this path
> already holds....
>
> Need I say more?
No, that's great ! :-).
Thanks for the analysis. I'd heard it wasn't
near production quality, but not being a kernel
engineer myself I wasn't able to make that assessment.
Having said that the OEMs that are using it does
find it improves write speeds by a large amount (10%
or more), so it's showing there is room for improvement
here if the correct code can be created for recvfile.
Cheers,
Jeremy.
WARNING: multiple messages have this Message-ID (diff)
From: Jeremy Allison <jra@samba.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Jeremy Allison <jra@samba.org>, Steve French <smfrench@gmail.com>,
Jeff Layton <jlayton@samba.org>,
linux-cifs@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Recvfile patch used for Samba.
Date: Tue, 23 Jul 2013 14:58:58 -0700 [thread overview]
Message-ID: <20130723215858.GB16356@samba2> (raw)
In-Reply-To: <20130723071027.GJ19986@dastard>
On Tue, Jul 23, 2013 at 05:10:27PM +1000, Dave Chinner wrote:
> So, we are nesting up to 32 page locks here. That's bad. And we are
> nesting kmap() calls for all the pages individually - is that even
> safe to do?
>
> So, what happens when we've got 16 pages in, and the filesystem has
> allocated space for those 16 blocks, and we get ENOSPC on the 17th?
> Sure, you undo the state here, but what about the 16 blocks that the
> filesystem has allocated to this file? There's no notification to
> the filesystem that they need to be truncated away because the write
> failed....
>
> > +
> > + /* IOV is ready, receive the date from socket now */
> > + msg.msg_name = NULL;
> > + msg.msg_namelen = 0;
> > + msg.msg_iov = (struct iovec *)&iov[0];
> > + msg.msg_iovlen = cPagesAllocated ;
> > + msg.msg_control = NULL;
> > + msg.msg_controllen = 0;
> > + msg.msg_flags = MSG_KERNSPACE;
> > + rcvtimeo = sock->sk->sk_rcvtimeo;
> > + sock->sk->sk_rcvtimeo = 8 * HZ;
>
> We can hold the inode and the pages locked for 8 seconds?
>
> I'll stop there. This is fundamentally broken. It's an attempt to do
> a multi-page write operation without any of the supporting
> structures needed to handle the failure cases properly. The nested
> page locking has "deadlock" written all over it, and the lack of
> partial failure handling shouts "data corruption" and "stale data
> exposure" to me. The fact it can block for up to 8 seconds waiting
> for network shenanigans to be completed while holding lots of locks
> is going to cause all sorts of problems under memory pressure.
>
> Not to mention it means that all memory allocations in the msgrcv
> path need to be done with GFP_NOFS, because GFP_KERNEL allocations
> are almost guaranteed to deadlock on the locked pages this path
> already holds....
>
> Need I say more?
No, that's great ! :-).
Thanks for the analysis. I'd heard it wasn't
near production quality, but not being a kernel
engineer myself I wasn't able to make that assessment.
Having said that the OEMs that are using it does
find it improves write speeds by a large amount (10%
or more), so it's showing there is room for improvement
here if the correct code can be created for recvfile.
Cheers,
Jeremy.
next prev parent reply other threads:[~2013-07-23 21:58 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-22 21:57 Recvfile patch used for Samba Jeremy Allison
2013-07-22 21:57 ` Jeremy Allison
2013-07-22 23:26 ` Joe Perches
2013-07-23 7:10 ` Dave Chinner
2013-07-23 13:31 ` Jeff Layton
2013-07-23 13:31 ` Jeff Layton
2013-07-23 21:58 ` Jeremy Allison [this message]
2013-07-23 21:58 ` Jeremy Allison
2013-07-24 2:47 ` Dave Chinner
2013-07-25 8:17 ` Steven Whitehouse
2013-07-25 8:17 ` Steven Whitehouse
2013-07-26 4:42 ` Dave Chinner
2013-07-26 4:42 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130723215858.GB16356@samba2 \
--to=jra-eunubhrolfbytjvyw6ydsg@public.gmane.org \
--cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
--cc=jlayton-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org \
--cc=linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.