From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1750954AbdANBYe (ORCPT <rfc822;w@1wt.eu>);
        Fri, 13 Jan 2017 20:24:34 -0500
Received: from zeniv.linux.org.uk ([195.92.253.2]:32884 "EHLO
        ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750894AbdANBYd (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 13 Jan 2017 20:24:33 -0500
Date: Sat, 14 Jan 2017 01:24:28 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Alan J. Wylie" <alan@wylie.me.uk>,
        Thorsten Leemhuis <regressions@leemhuis.info>,
        linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
Message-ID: <20170114012428.GX1555@ZenIV.linux.org.uk>
References: <CA+55aFw0T1wARr8kLvcuHkA4bDE+YJ18wKtPaxc1vFHWPgfQQg@mail.gmail.com>
 <20170113200826.GP1555@ZenIV.linux.org.uk>
 <20170113201121.GQ1555@ZenIV.linux.org.uk>
 <CA+55aFwVgk8QwZKeU2xUGxS0rjRUsSKRYVVA-p+vHnaS5ixoAA@mail.gmail.com>
 <20170113204731.GS1555@ZenIV.linux.org.uk>
 <20170113215504.GT1555@ZenIV.linux.org.uk>
 <20170113215919.GU1555@ZenIV.linux.org.uk>
 <20170113221308.GV1555@ZenIV.linux.org.uk>
 <20170113225058.GW1555@ZenIV.linux.org.uk>
 <CA+55aFx-Lc2KooGGt3ohHZUM6cUpcHWxs2a_YUdL=4bdRyxNfA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFx-Lc2KooGGt3ohHZUM6cUpcHWxs2a_YUdL=4bdRyxNfA@mail.gmail.com>
User-Agent: Mutt/1.7.1 (2016-10-04)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote:

> EXCEPT.
> 
> I don't think "i->iov_offset" is actually correct. If you truncated
> the whole thing, you should have cleared iov_offset too, and that
> never happened. So truncating everything will not empty the buffers
> for us, we'll still get to that "if (off)" case and have nrbufs=1.
> 
> So I'd expect something like
> 
>     if (!size)
>         i->iov_offset = 0;
> 
> in pipe_advance(), in order to really free all buffers for that case. No?

Why would advance by 0 change ->iov_offset here?

> Or is there some guarantee that iov_offset was already zero there? I
> don't see it, but I'm batting zero on this code...

It was zero initially (see iov_iter_pipe()).  It was not affected by
iov_iter_get_pages() and friends.  If there was copy_to_iter/zero_iter/
copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and
should bloody well stay such.

Note that the normal case is something like O_DIRECT read grabbing pages,
doing a bunch of IO into such and, once it's done, iov_iter_advance()
for the actual amount read being done by
                retval = mapping->a_ops->direct_IO(iocb, &data);
                if (retval >= 0) {
                        iocb->ki_pos += retval;
                        iov_iter_advance(iter, retval);
                }
which advances iter past the actually read data.  If we proceed to
fall back to the do_generic_file_read(), it will do a bunch of
copy_page_to_iter() to the points past that.

default_file_splice_read() pretty much imitates O_DIRECT read in that
respect, with kernel_readv() being a counterpart of direct_IO.

generic_file_splice_read() is another PIPE_ITER user; that one really
calls ->read_iter().  On any non-error (including short reads) it will
not bother with iov_iter_advance() at all - ->read_iter() should've
called that already, if at all needed.

On error it does use iov_iter_advance(), pretty much as a way to
trigger pipe_truncate().  There we directly reset ->iov_offset to 0
and ->idx to its original value.  Strictly speaking, we could live
without calling it there at all - normally ->read_iter() instances follow
the usual "if you'd managed to read something, report a short read,
not an error" approach.  None of the exceptions are good candidates for
use of generic_file_splice_read() anyway.

However, theoretically it is possible that ->read_iter() instance does
successful copy_to_iter() and then decides to return an error.  This
        } else if (ret < 0) {
                to.idx = idx;
                to.iov_offset = 0;
                iov_iter_advance(&to, 0); /* to free what was emitted */
in generic_file_splice_read() catches any such cases.  Here the call
of iov_iter_advance(&to, 0) is guaranteed to be equivalent to
pipe_truncate(&to); we might make the latter non-static and call it
directly, but I'm not sure it's worth doing.
 
> Can you please explain when you are once more awake..