From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750954AbdANBYe (ORCPT ); Fri, 13 Jan 2017 20:24:34 -0500 Received: from zeniv.linux.org.uk ([195.92.253.2]:32884 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750894AbdANBYd (ORCPT ); Fri, 13 Jan 2017 20:24:33 -0500 Date: Sat, 14 Jan 2017 01:24:28 +0000 From: Al Viro To: Linus Torvalds Cc: "Alan J. Wylie" , Thorsten Leemhuis , linux-kernel Subject: Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Message-ID: <20170114012428.GX1555@ZenIV.linux.org.uk> References: <20170113200826.GP1555@ZenIV.linux.org.uk> <20170113201121.GQ1555@ZenIV.linux.org.uk> <20170113204731.GS1555@ZenIV.linux.org.uk> <20170113215504.GT1555@ZenIV.linux.org.uk> <20170113215919.GU1555@ZenIV.linux.org.uk> <20170113221308.GV1555@ZenIV.linux.org.uk> <20170113225058.GW1555@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.7.1 (2016-10-04) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote: > EXCEPT. > > I don't think "i->iov_offset" is actually correct. If you truncated > the whole thing, you should have cleared iov_offset too, and that > never happened. So truncating everything will not empty the buffers > for us, we'll still get to that "if (off)" case and have nrbufs=1. > > So I'd expect something like > > if (!size) > i->iov_offset = 0; > > in pipe_advance(), in order to really free all buffers for that case. No? Why would advance by 0 change ->iov_offset here? > Or is there some guarantee that iov_offset was already zero there? I > don't see it, but I'm batting zero on this code... It was zero initially (see iov_iter_pipe()). It was not affected by iov_iter_get_pages() and friends. If there was copy_to_iter/zero_iter/ copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and should bloody well stay such. Note that the normal case is something like O_DIRECT read grabbing pages, doing a bunch of IO into such and, once it's done, iov_iter_advance() for the actual amount read being done by retval = mapping->a_ops->direct_IO(iocb, &data); if (retval >= 0) { iocb->ki_pos += retval; iov_iter_advance(iter, retval); } which advances iter past the actually read data. If we proceed to fall back to the do_generic_file_read(), it will do a bunch of copy_page_to_iter() to the points past that. default_file_splice_read() pretty much imitates O_DIRECT read in that respect, with kernel_readv() being a counterpart of direct_IO. generic_file_splice_read() is another PIPE_ITER user; that one really calls ->read_iter(). On any non-error (including short reads) it will not bother with iov_iter_advance() at all - ->read_iter() should've called that already, if at all needed. On error it does use iov_iter_advance(), pretty much as a way to trigger pipe_truncate(). There we directly reset ->iov_offset to 0 and ->idx to its original value. Strictly speaking, we could live without calling it there at all - normally ->read_iter() instances follow the usual "if you'd managed to read something, report a short read, not an error" approach. None of the exceptions are good candidates for use of generic_file_splice_read() anyway. However, theoretically it is possible that ->read_iter() instance does successful copy_to_iter() and then decides to return an error. This } else if (ret < 0) { to.idx = idx; to.iov_offset = 0; iov_iter_advance(&to, 0); /* to free what was emitted */ in generic_file_splice_read() catches any such cases. Here the call of iov_iter_advance(&to, 0) is guaranteed to be equivalent to pipe_truncate(&to); we might make the latter non-static and call it directly, but I'm not sure it's worth doing. > Can you please explain when you are once more awake..