Weird writev() behaviour on EFAULT - also successfully modifying the file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Weird writev() behaviour on EFAULT - also successfully modifying the file
@ 2016-08-11 18:46 Tuomas Tynkkynen
  2016-08-11 22:10 ` Al Viro
  0 siblings, 1 reply; 6+ messages in thread
From: Tuomas Tynkkynen @ 2016-08-11 18:46 UTC (permalink / raw)
  To: linux-fsdevel

Greetings,

I've noticed a corner case with writev() both modifying the file and
returning -EFAULT at the same time. This happens on filesystems using
generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below,
down to 3.16. Here's the reproducer:

// 8<---- cut here ------------------------ >8
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/uio.h>

int main(int argc, char** argv) {
    int fd;
    ssize_t ret;
    struct iovec iov[] = {
        { .iov_base = NULL, .iov_len = 0, },
        { .iov_base = NULL, .iov_len = 4096, },
    };

    system("dd if=/dev/zero bs=8k count=1 | tr '\\0' 'A' > foo");

    fd = open("foo", O_RDWR);
    if (fd < 0) {
        perror("open()");
        return 1;
    }

    ret = writev(fd, iov, 2);
    if (ret < 0) {
        perror("writev()");
        return 1;
    }

    return 0;
}
// 8<---- cut here ------------------------ >8

Running that prints "writev(): Bad address" but also some NUL bytes have
appeared at the beginning file, in addition to the 'A's by the dd.

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
*
00002000

Is that intented behaviour?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file
  2016-08-11 18:46 Weird writev() behaviour on EFAULT - also successfully modifying the file Tuomas Tynkkynen
@ 2016-08-11 22:10 ` Al Viro
  2016-08-11 22:39   ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Al Viro @ 2016-08-11 22:10 UTC (permalink / raw)
  To: Tuomas Tynkkynen; +Cc: linux-fsdevel, Linus Torvalds

On Thu, Aug 11, 2016 at 09:46:51PM +0300, Tuomas Tynkkynen wrote:
> Greetings,
> 
> I've noticed a corner case with writev() both modifying the file and
> returning -EFAULT at the same time. This happens on filesystems using
> generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below,
> down to 3.16. Here's the reproducer:

[fill 8Kb with 'A', then reopen the file and do writev() on two-element
iovec array consisting of {NULL,0},{NULL,4096}]

> Running that prints "writev(): Bad address" but also some NUL bytes have
> appeared at the beginning file, in addition to the 'A's by the dd.
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00001000  41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41  |AAAAAAAAAAAAAAAA|
> *
> 00002000
> 
> Is that intented behaviour?

There are two issues mixed here: one is that iovec array (NULL/0, NULL/4K)
we get
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
        if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
                char __user *buf = i->iov->iov_base + i->iov_offset;
                bytes = min(bytes, i->iov->iov_len - i->iov_offset);
                return fault_in_pages_readable(buf, bytes);
        }
        return 0;
}
succeed.  It shouldn't.  However, even if it correctly failed, we still have
another one - __copy_from_user_inatomic() zeroes what it hadn't copied over,
so if we e.g. gave write(2) a buffer with unmapped page in the middle, we
would get a short write *and* a bunch of bytes past the amount reported
written replaced with zeroes.  And _that_ goes very far back.  2.1.late
definitely had already been that way.

Note, BTW, that fixing iov_iter_fault_in_readable() won't exclude a possibility
of write() returning -EFAULT and overwriting some bytes with zeroes; all you
need is munmap() racing with write() and hitting after the successful fault-in.

We could, in principle, provide a non-zeroing variants of copy_from_user and
copy_from_user_inatomic and switch ->write()/->write_iter() instances to
use of those; however, I'm not sure if it's really worth bothering with.

I do see one potential argument in favour of doing that; suppose we have
a buffer filled with 'A'.  It starts one byte before a page boundary and all
its pages are currently in swap.  There's a file also filled with 'A' and
mmapped by another process.  We say write() from our buffer to that file.
Having that other process observe transient zeroes appearing in its mmapped
area is Not Nice(tm); having them stick around is even worse, and that can
happen in case if the first process gets killed while it's faulting the next
page of buffer in.

Linus?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file
  2016-08-11 22:10 ` Al Viro
@ 2016-08-11 22:39   ` Linus Torvalds
  2016-08-11 23:03     ` Al Viro
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2016-08-11 22:39 UTC (permalink / raw)
  To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel

On Thu, Aug 11, 2016 at 3:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Aug 11, 2016 at 09:46:51PM +0300, Tuomas Tynkkynen wrote:
>> Greetings,
>>
>> I've noticed a corner case with writev() both modifying the file and
>> returning -EFAULT at the same time. This happens on filesystems using
>> generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below,
>> down to 3.16. Here's the reproducer:

Yeah, if you have things that cause EFAULT, all bets are off.

Bad things may have happened in the middle, and POSIX allows it.

That said, from a quality-of-implementation standpoint, we've
generally tried to do the robust thing, and return partial write
results etc. So in general, EFAULT does mean "nothing happened", but
there are exceptions:

> We could, in principle, provide a non-zeroing variants of copy_from_user and
> copy_from_user_inatomic and switch ->write()/->write_iter() instances to
> use of those; however, I'm not sure if it's really worth bothering with.

That would be nice, but it doesn't really work.

When we write a whole page, we may have actively thrown the old values
away. In other words, we saw that the user was going to overwrite the
page, we didn't even read it in at all, knowing that it's all going to
be overwritten.

We now get a page fault in the middle, so we've overwritten part of
the page, but the rest of the page is just garbage (and _not_ the old
contents). End result: overwrite the garbage with zeroes, and take
advantage of the fact that POSIX says that EFAULT is "undefined".

We *can* handle the case of immediate page fault slightly better - the
whole prefault thing notices that it gets no data at all, and it could
just say "don't even look up the page in the first place". But it
doesn't really help the more complicated "page fault in the middle"
case, so on the whole I really think that we're better off just saying
that page faults in the middle of a write will leave the file in an
inconsistent state.

If somebody wants to make it behave slightly better, and has a
reasonable patch for it, I certainly won't argue against better
semantics.

But at the same time, the basic rule really is:

 "If you give bad virtual memory regions to system calls, you get to
keep the resulting broken piece and blame yourself".

anything the kernel does better is purely about us being polite, not
about correctness or caring deeply.

                Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file
  2016-08-11 22:39   ` Linus Torvalds
@ 2016-08-11 23:03     ` Al Viro
  2016-08-11 23:10       ` Linus Torvalds
  2016-08-11 23:13       ` Linus Torvalds
  0 siblings, 2 replies; 6+ messages in thread
From: Al Viro @ 2016-08-11 23:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tuomas Tynkkynen, linux-fsdevel

On Thu, Aug 11, 2016 at 03:39:22PM -0700, Linus Torvalds wrote:

> But at the same time, the basic rule really is:
> 
>  "If you give bad virtual memory regions to system calls, you get to
> keep the resulting broken piece and blame yourself".
> 
> anything the kernel does better is purely about us being polite, not
> about correctness or caring deeply.

Yes, but... it doesn't need to be a bad region at all.  Look: we have a 20Kb
array of char starting at 0x....3ff.  We feed it to write().  Everything
is mapped, etc. - no EFAULT in sight.  However, it is all swapped out at
the moment.  And somebody else has that file mmapped; again, no pathological
cases, not even in the same address space as writer, etc.  File contains no
zero bytes.  Neither does the buffer we are writing.  File position is 0 and
file is considerably longer than 20Kb.

We do a fault-in; fine, the first page (with one byte of useful data) is
swapped in.  We call ->write_begin(), then __copy_from_user_inatomic()
(with pagefaults disabled) the first 4Kb into the page with index 0 in
the file's page cache.  Copy fails after 1 byte, since the next page
is currently still swapped out.  We advance by 1 byte and fault that page
in; fine, now we'll copy 4095 bytes successfully, advance by 4095 and
proceed to writing into the page with index 1 in file's page cache, etc.
In the end everything works fine - no short writes, no EFAULT, all the
data copied into file.

However, _during_ the write the other process had seen something very odd -
it had mmapped a zero-free file, it knows that nobody had been writing any
zero-containing data into it, but it had seen zeroes come and go in the
mmapped area.  That "copied 1 byte" is actually "copied 1 byte, zeroed the
next 4095 bytes".  Sure, it's followed by "copy the next 4095 bytes over
those zeroes", but only after the next chunk of buffer got swapped in.  And
if the writer got killed, the things are even nastier - these zeroes are
*not* overwritten by subsequent data.

Again, it's about as tame case as they come - no NULLs, no EFAULT, no
buffers mmapped from the file we are writing to; just a normal write()
replacing the data in the very beginning of file from a buffer that isn't
page-aligned...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file
  2016-08-11 23:03     ` Al Viro
@ 2016-08-11 23:10       ` Linus Torvalds
  2016-08-11 23:13       ` Linus Torvalds
  1 sibling, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2016-08-11 23:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel

On Thu, Aug 11, 2016 at 4:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Yes, but... it doesn't need to be a bad region at all.  Look: we have a 20Kb
> array of char starting at 0x....3ff.  We feed it to write(). [..]
>
> However, _during_ the write the other process had seen something very odd -
> it had mmapped a zero-free file, it knows that nobody had been writing any
> zero-containing data into it, but it had seen zeroes come and go in the
> mmapped area.  That "copied 1 byte" is actually "copied 1 byte, zeroed the
> next 4095 bytes".

Yeah, we should probably be better about that.  That's not the EFAULT
case, and I agree that what we do there is not pretty.

And yes, I'd be ok with the atomic from-user functions not clearing
the end of the buffer, but we'd have to be very very careful about it.
In particular, the "write_end()" functions would need to distinguish
between the case of "we didn't actually write everything, but the page
was up-to-date" and the "we didn't write everything, and the remainder
_wasn't_ uptodate and now we really need to zero it before unlocking
the page".

There may not be all that many write_end() functions out there, but
still.. The zeroing behavior does guarantee that we never have
uninitialized memory from when a user copy fails.

             Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file
  2016-08-11 23:03     ` Al Viro
  2016-08-11 23:10       ` Linus Torvalds
@ 2016-08-11 23:13       ` Linus Torvalds
  1 sibling, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2016-08-11 23:13 UTC (permalink / raw)
  To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel

On Thu, Aug 11, 2016 at 4:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Again, it's about as tame case as they come - no NULLs, no EFAULT, no
> buffers mmapped from the file we are writing to; just a normal write()
> replacing the data in the very beginning of file from a buffer that isn't
> page-aligned...

Side note: this whole "partial copy from user space" is a case that we
have definitely gotten wrong in the past.

That whole iov_iter_fault_in_readable() is an ugly hack, and basically
avoids the whole situation of a partial user copy by just faulting
things in early. Exactly because we have gotten things wrong in the
past.

And yes, it's not a proper fix - it's racy, but it means you'll never
see the partial user access in practice.

                Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-08-11 23:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-11 18:46 Weird writev() behaviour on EFAULT - also successfully modifying the file Tuomas Tynkkynen
2016-08-11 22:10 ` Al Viro
2016-08-11 22:39   ` Linus Torvalds
2016-08-11 23:03     ` Al Viro
2016-08-11 23:10       ` Linus Torvalds
2016-08-11 23:13       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).