* Weird writev() behaviour on EFAULT - also successfully modifying the file
@ 2016-08-11 18:46 Tuomas Tynkkynen
2016-08-11 22:10 ` Al Viro
0 siblings, 1 reply; 6+ messages in thread
From: Tuomas Tynkkynen @ 2016-08-11 18:46 UTC (permalink / raw)
To: linux-fsdevel
Greetings,
I've noticed a corner case with writev() both modifying the file and
returning -EFAULT at the same time. This happens on filesystems using
generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below,
down to 3.16. Here's the reproducer:
// 8<---- cut here ------------------------ >8
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/uio.h>
int main(int argc, char** argv) {
int fd;
ssize_t ret;
struct iovec iov[] = {
{ .iov_base = NULL, .iov_len = 0, },
{ .iov_base = NULL, .iov_len = 4096, },
};
system("dd if=/dev/zero bs=8k count=1 | tr '\\0' 'A' > foo");
fd = open("foo", O_RDWR);
if (fd < 0) {
perror("open()");
return 1;
}
ret = writev(fd, iov, 2);
if (ret < 0) {
perror("writev()");
return 1;
}
return 0;
}
// 8<---- cut here ------------------------ >8
Running that prints "writev(): Bad address" but also some NUL bytes have
appeared at the beginning file, in addition to the 'A's by the dd.
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
*
00002000
Is that intented behaviour?
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file 2016-08-11 18:46 Weird writev() behaviour on EFAULT - also successfully modifying the file Tuomas Tynkkynen @ 2016-08-11 22:10 ` Al Viro 2016-08-11 22:39 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Al Viro @ 2016-08-11 22:10 UTC (permalink / raw) To: Tuomas Tynkkynen; +Cc: linux-fsdevel, Linus Torvalds On Thu, Aug 11, 2016 at 09:46:51PM +0300, Tuomas Tynkkynen wrote: > Greetings, > > I've noticed a corner case with writev() both modifying the file and > returning -EFAULT at the same time. This happens on filesystems using > generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below, > down to 3.16. Here's the reproducer: [fill 8Kb with 'A', then reopen the file and do writev() on two-element iovec array consisting of {NULL,0},{NULL,4096}] > Running that prints "writev(): Bad address" but also some NUL bytes have > appeared at the beginning file, in addition to the 'A's by the dd. > > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 00001000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA| > * > 00002000 > > Is that intented behaviour? There are two issues mixed here: one is that iovec array (NULL/0, NULL/4K) we get int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes) { if (!(i->type & (ITER_BVEC|ITER_KVEC))) { char __user *buf = i->iov->iov_base + i->iov_offset; bytes = min(bytes, i->iov->iov_len - i->iov_offset); return fault_in_pages_readable(buf, bytes); } return 0; } succeed. It shouldn't. However, even if it correctly failed, we still have another one - __copy_from_user_inatomic() zeroes what it hadn't copied over, so if we e.g. gave write(2) a buffer with unmapped page in the middle, we would get a short write *and* a bunch of bytes past the amount reported written replaced with zeroes. And _that_ goes very far back. 2.1.late definitely had already been that way. Note, BTW, that fixing iov_iter_fault_in_readable() won't exclude a possibility of write() returning -EFAULT and overwriting some bytes with zeroes; all you need is munmap() racing with write() and hitting after the successful fault-in. We could, in principle, provide a non-zeroing variants of copy_from_user and copy_from_user_inatomic and switch ->write()/->write_iter() instances to use of those; however, I'm not sure if it's really worth bothering with. I do see one potential argument in favour of doing that; suppose we have a buffer filled with 'A'. It starts one byte before a page boundary and all its pages are currently in swap. There's a file also filled with 'A' and mmapped by another process. We say write() from our buffer to that file. Having that other process observe transient zeroes appearing in its mmapped area is Not Nice(tm); having them stick around is even worse, and that can happen in case if the first process gets killed while it's faulting the next page of buffer in. Linus? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file 2016-08-11 22:10 ` Al Viro @ 2016-08-11 22:39 ` Linus Torvalds 2016-08-11 23:03 ` Al Viro 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2016-08-11 22:39 UTC (permalink / raw) To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel On Thu, Aug 11, 2016 at 3:10 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Aug 11, 2016 at 09:46:51PM +0300, Tuomas Tynkkynen wrote: >> Greetings, >> >> I've noticed a corner case with writev() both modifying the file and >> returning -EFAULT at the same time. This happens on filesystems using >> generic_perform_write() (i.e. ext4, vfat) on 4.6.3 kernel and below, >> down to 3.16. Here's the reproducer: Yeah, if you have things that cause EFAULT, all bets are off. Bad things may have happened in the middle, and POSIX allows it. That said, from a quality-of-implementation standpoint, we've generally tried to do the robust thing, and return partial write results etc. So in general, EFAULT does mean "nothing happened", but there are exceptions: > We could, in principle, provide a non-zeroing variants of copy_from_user and > copy_from_user_inatomic and switch ->write()/->write_iter() instances to > use of those; however, I'm not sure if it's really worth bothering with. That would be nice, but it doesn't really work. When we write a whole page, we may have actively thrown the old values away. In other words, we saw that the user was going to overwrite the page, we didn't even read it in at all, knowing that it's all going to be overwritten. We now get a page fault in the middle, so we've overwritten part of the page, but the rest of the page is just garbage (and _not_ the old contents). End result: overwrite the garbage with zeroes, and take advantage of the fact that POSIX says that EFAULT is "undefined". We *can* handle the case of immediate page fault slightly better - the whole prefault thing notices that it gets no data at all, and it could just say "don't even look up the page in the first place". But it doesn't really help the more complicated "page fault in the middle" case, so on the whole I really think that we're better off just saying that page faults in the middle of a write will leave the file in an inconsistent state. If somebody wants to make it behave slightly better, and has a reasonable patch for it, I certainly won't argue against better semantics. But at the same time, the basic rule really is: "If you give bad virtual memory regions to system calls, you get to keep the resulting broken piece and blame yourself". anything the kernel does better is purely about us being polite, not about correctness or caring deeply. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file 2016-08-11 22:39 ` Linus Torvalds @ 2016-08-11 23:03 ` Al Viro 2016-08-11 23:10 ` Linus Torvalds 2016-08-11 23:13 ` Linus Torvalds 0 siblings, 2 replies; 6+ messages in thread From: Al Viro @ 2016-08-11 23:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: Tuomas Tynkkynen, linux-fsdevel On Thu, Aug 11, 2016 at 03:39:22PM -0700, Linus Torvalds wrote: > But at the same time, the basic rule really is: > > "If you give bad virtual memory regions to system calls, you get to > keep the resulting broken piece and blame yourself". > > anything the kernel does better is purely about us being polite, not > about correctness or caring deeply. Yes, but... it doesn't need to be a bad region at all. Look: we have a 20Kb array of char starting at 0x....3ff. We feed it to write(). Everything is mapped, etc. - no EFAULT in sight. However, it is all swapped out at the moment. And somebody else has that file mmapped; again, no pathological cases, not even in the same address space as writer, etc. File contains no zero bytes. Neither does the buffer we are writing. File position is 0 and file is considerably longer than 20Kb. We do a fault-in; fine, the first page (with one byte of useful data) is swapped in. We call ->write_begin(), then __copy_from_user_inatomic() (with pagefaults disabled) the first 4Kb into the page with index 0 in the file's page cache. Copy fails after 1 byte, since the next page is currently still swapped out. We advance by 1 byte and fault that page in; fine, now we'll copy 4095 bytes successfully, advance by 4095 and proceed to writing into the page with index 1 in file's page cache, etc. In the end everything works fine - no short writes, no EFAULT, all the data copied into file. However, _during_ the write the other process had seen something very odd - it had mmapped a zero-free file, it knows that nobody had been writing any zero-containing data into it, but it had seen zeroes come and go in the mmapped area. That "copied 1 byte" is actually "copied 1 byte, zeroed the next 4095 bytes". Sure, it's followed by "copy the next 4095 bytes over those zeroes", but only after the next chunk of buffer got swapped in. And if the writer got killed, the things are even nastier - these zeroes are *not* overwritten by subsequent data. Again, it's about as tame case as they come - no NULLs, no EFAULT, no buffers mmapped from the file we are writing to; just a normal write() replacing the data in the very beginning of file from a buffer that isn't page-aligned... ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file 2016-08-11 23:03 ` Al Viro @ 2016-08-11 23:10 ` Linus Torvalds 2016-08-11 23:13 ` Linus Torvalds 1 sibling, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2016-08-11 23:10 UTC (permalink / raw) To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel On Thu, Aug 11, 2016 at 4:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Yes, but... it doesn't need to be a bad region at all. Look: we have a 20Kb > array of char starting at 0x....3ff. We feed it to write(). [..] > > However, _during_ the write the other process had seen something very odd - > it had mmapped a zero-free file, it knows that nobody had been writing any > zero-containing data into it, but it had seen zeroes come and go in the > mmapped area. That "copied 1 byte" is actually "copied 1 byte, zeroed the > next 4095 bytes". Yeah, we should probably be better about that. That's not the EFAULT case, and I agree that what we do there is not pretty. And yes, I'd be ok with the atomic from-user functions not clearing the end of the buffer, but we'd have to be very very careful about it. In particular, the "write_end()" functions would need to distinguish between the case of "we didn't actually write everything, but the page was up-to-date" and the "we didn't write everything, and the remainder _wasn't_ uptodate and now we really need to zero it before unlocking the page". There may not be all that many write_end() functions out there, but still.. The zeroing behavior does guarantee that we never have uninitialized memory from when a user copy fails. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Weird writev() behaviour on EFAULT - also successfully modifying the file 2016-08-11 23:03 ` Al Viro 2016-08-11 23:10 ` Linus Torvalds @ 2016-08-11 23:13 ` Linus Torvalds 1 sibling, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2016-08-11 23:13 UTC (permalink / raw) To: Al Viro; +Cc: Tuomas Tynkkynen, linux-fsdevel On Thu, Aug 11, 2016 at 4:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Again, it's about as tame case as they come - no NULLs, no EFAULT, no > buffers mmapped from the file we are writing to; just a normal write() > replacing the data in the very beginning of file from a buffer that isn't > page-aligned... Side note: this whole "partial copy from user space" is a case that we have definitely gotten wrong in the past. That whole iov_iter_fault_in_readable() is an ugly hack, and basically avoids the whole situation of a partial user copy by just faulting things in early. Exactly because we have gotten things wrong in the past. And yes, it's not a proper fix - it's racy, but it means you'll never see the partial user access in practice. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-08-11 23:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-11 18:46 Weird writev() behaviour on EFAULT - also successfully modifying the file Tuomas Tynkkynen 2016-08-11 22:10 ` Al Viro 2016-08-11 22:39 ` Linus Torvalds 2016-08-11 23:03 ` Al Viro 2016-08-11 23:10 ` Linus Torvalds 2016-08-11 23:13 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).