* [RFC] writev() semantics with invalid iovec in the middle
@ 2016-09-14 21:34 Al Viro
2016-09-15 10:23 ` Mike Marshall
0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2016-09-14 21:34 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel
Right now writev() with 3-iovec array that has unmapped address in
the second element and total length less than PAGE_SIZE will write the
first segment and stop at that. Among other things, it guarantees the
short copy, and I would rather have it yeild 0-bytes write (and -EFAULT as
return value).
All POSIX has to say about that is this (in 2.3 Error Numbers):
[EFAULT]
Bad address. The system detected an invalid address in attempting to use
an argument of a call. The reliable detection of this error cannot be
guaranteed, and when not detected may result in the generation of a signal,
indicating an address violation, which is sent to the process.
Note that unmapped page in the middle of a range covered already can lead to
the same kind of short write - i.e. if we have
p = mmap(0, 3*4096, PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
munmap(p + 4096, 4096);
fd = open("/tmp/foo", O_CREAT|O_TRUNC|O_RDWR, 0777);
write(fd, p + 2048, 8192);
write() will yield -EFAULT, not a 2Kb stored. The same will happen with
writev(fd, &(struct iovec){p + 2048, 8192}, 1);
BTW, adding lseek(fd, 2049, SEEK_SET); before that write (or writev) will
result in 2047 bytes being written by the latter.
IOW, we do not try to squeeze every byte that can be squeezed out of the
buffer; generally, an unmapped address anywhere in PAGE_SIZE worth of data
that would go into the same page-aligned chunk of destination can result in
short write cut at the beginning of that chunk. iovec boundaries act
as barriers to short writes, mostly by accident.
Do we need to preserve that special treatment of iovec boundaries? I would
really like to get rid of that - the current behaviour is an easy and reliable
way to trigger a short copy case in ->write_end() and those are fairly
brittle. Sure, we still need to cope with them, and I think I've got all
instances in the current mainline fixed, but they are often suboptimal.
Objections?
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-14 21:34 [RFC] writev() semantics with invalid iovec in the middle Al Viro
@ 2016-09-15 10:23 ` Mike Marshall
2016-09-15 22:29 ` Al Viro
0 siblings, 1 reply; 7+ messages in thread
From: Mike Marshall @ 2016-09-15 10:23 UTC (permalink / raw)
To: Al Viro; +Cc: Linus Torvalds, LKML, linux-fsdevel
If you squeeze out every byte won't you still have a short
write? And the written data wouldn't be cut at the bad
place, but it would have a weird hole or discontinuity there.
-Mike
On Wed, Sep 14, 2016 at 5:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Right now writev() with 3-iovec array that has unmapped address in
> the second element and total length less than PAGE_SIZE will write the
> first segment and stop at that. Among other things, it guarantees the
> short copy, and I would rather have it yeild 0-bytes write (and -EFAULT as
> return value).
>
> All POSIX has to say about that is this (in 2.3 Error Numbers):
>
> [EFAULT]
> Bad address. The system detected an invalid address in attempting to use
> an argument of a call. The reliable detection of this error cannot be
> guaranteed, and when not detected may result in the generation of a signal,
> indicating an address violation, which is sent to the process.
>
> Note that unmapped page in the middle of a range covered already can lead to
> the same kind of short write - i.e. if we have
> p = mmap(0, 3*4096, PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> munmap(p + 4096, 4096);
> fd = open("/tmp/foo", O_CREAT|O_TRUNC|O_RDWR, 0777);
> write(fd, p + 2048, 8192);
>
> write() will yield -EFAULT, not a 2Kb stored. The same will happen with
> writev(fd, &(struct iovec){p + 2048, 8192}, 1);
> BTW, adding lseek(fd, 2049, SEEK_SET); before that write (or writev) will
> result in 2047 bytes being written by the latter.
>
> IOW, we do not try to squeeze every byte that can be squeezed out of the
> buffer; generally, an unmapped address anywhere in PAGE_SIZE worth of data
> that would go into the same page-aligned chunk of destination can result in
> short write cut at the beginning of that chunk. iovec boundaries act
> as barriers to short writes, mostly by accident.
>
> Do we need to preserve that special treatment of iovec boundaries? I would
> really like to get rid of that - the current behaviour is an easy and reliable
> way to trigger a short copy case in ->write_end() and those are fairly
> brittle. Sure, we still need to cope with them, and I think I've got all
> instances in the current mainline fixed, but they are often suboptimal.
>
> Objections?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-15 10:23 ` Mike Marshall
@ 2016-09-15 22:29 ` Al Viro
2016-09-15 22:32 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Al Viro @ 2016-09-15 22:29 UTC (permalink / raw)
To: Mike Marshall; +Cc: Linus Torvalds, LKML, linux-fsdevel
On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
> If you squeeze out every byte won't you still have a short
> write? And the written data wouldn't be cut at the bad
> place, but it would have a weird hole or discontinuity there.
???
What I mean is that if we have an invalid address in the middle of a buffer
(unmapped, for example), we do not attempt to write every byte prior to that
invalid address. Of course what we write is going to be contiguous.
Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
7 valid, 3 invalid:
VVVVIIIVV
and it starts 100 bytes into the first page. And write goes into a regular
file on e.g. tmpfs, starting at offset 31. We _can't_ write more than
4*4096-100 bytes, no matter what. It will be a short write. As the matter
of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
up to the last pagecache boundary we can cover completely. That obviously
depends upon the filesystem - not everything uses pagecache, for starters.
However, the caller is *not* guaranteed that write() with an invalid page
in the middle of a buffer would write everything up to the very beginning
of the invalid page. A short write will happen, but the amount written
might be up to page size less than the actual length of valid part in the
beginning of the buffer.
Now, for writev() we could have invalid pages in any iovec; again, we
obviously can't write anything past the first invalid page - we'll get
either a short write or -EFAULT (if nothing got written). That's fine;
the question is what the caller can count upon wrt shortening.
Again, we are *not* guaranteed writing up to exact boundary. However, the
current implementation will end up shortening no more than to the iovec
boundary. I.e. if the first iovec contains only valid pages and there's
an invalid one in the second iovec, the current implementation will write
at least everything in the first iovec. That's _not_ promised by POSIX
or our manpages; moreover, I'm not sure if it's even true for each filesystem.
And keeping that property is actually inconvenient - if we could discard it,
we could make partial-copy ->write_end() calls a lot more infrequent.
Unfortunately, some of LTP writev tests end up checking that writev() does
behave that way - they feed it a three-element iovec with shorter-than-page
segments, the second of which is all invalid. And they check that the
entire first segment had been written.
I would really like to drop that property, making it "if some addresses
in the buffer(s) we are asked to write are invalid, the write will be
shortened by up to a PAGE_SIZE from the first such invalid address", making
writev() rules exactly the same as write() ones. Does anybody have objections
to it?
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-15 22:29 ` Al Viro
@ 2016-09-15 22:32 ` Linus Torvalds
2016-09-15 22:32 ` Cedric Blancher
2016-09-16 13:25 ` One Thousand Gnomes
2 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2016-09-15 22:32 UTC (permalink / raw)
To: Al Viro; +Cc: Mike Marshall, LKML, linux-fsdevel
On Thu, Sep 15, 2016 at 3:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid. And they check that the
> entire first segment had been written.
>
> I would really like to drop that property,
I'm pretty sure you can and should do that.
The LTP test people have actually been pretty good about just fixing
their tests when they cause problems and there is no reason for the
particular behavior.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-15 22:29 ` Al Viro
2016-09-15 22:32 ` Linus Torvalds
@ 2016-09-15 22:32 ` Cedric Blancher
2016-09-16 13:25 ` One Thousand Gnomes
2 siblings, 0 replies; 7+ messages in thread
From: Cedric Blancher @ 2016-09-15 22:32 UTC (permalink / raw)
To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, LKML, linux-fsdevel
PAGE_SIZE isn't accurate on architectures which do multiple page
sizes, like 8k, 64k, 512k, 4M, 32M, 256M on SPARC64 and same on
PPC64/Power.
Ced
On 16 September 2016 at 00:29, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
>> If you squeeze out every byte won't you still have a short
>> write? And the written data wouldn't be cut at the bad
>> place, but it would have a weird hole or discontinuity there.
>
> ???
>
> What I mean is that if we have an invalid address in the middle of a buffer
> (unmapped, for example), we do not attempt to write every byte prior to that
> invalid address. Of course what we write is going to be contiguous.
>
> Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
> 7 valid, 3 invalid:
> VVVVIIIVV
> and it starts 100 bytes into the first page. And write goes into a regular
> file on e.g. tmpfs, starting at offset 31. We _can't_ write more than
> 4*4096-100 bytes, no matter what. It will be a short write. As the matter
> of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
> up to the last pagecache boundary we can cover completely. That obviously
> depends upon the filesystem - not everything uses pagecache, for starters.
> However, the caller is *not* guaranteed that write() with an invalid page
> in the middle of a buffer would write everything up to the very beginning
> of the invalid page. A short write will happen, but the amount written
> might be up to page size less than the actual length of valid part in the
> beginning of the buffer.
>
> Now, for writev() we could have invalid pages in any iovec; again, we
> obviously can't write anything past the first invalid page - we'll get
> either a short write or -EFAULT (if nothing got written). That's fine;
> the question is what the caller can count upon wrt shortening.
>
> Again, we are *not* guaranteed writing up to exact boundary. However, the
> current implementation will end up shortening no more than to the iovec
> boundary. I.e. if the first iovec contains only valid pages and there's
> an invalid one in the second iovec, the current implementation will write
> at least everything in the first iovec. That's _not_ promised by POSIX
> or our manpages; moreover, I'm not sure if it's even true for each filesystem.
> And keeping that property is actually inconvenient - if we could discard it,
> we could make partial-copy ->write_end() calls a lot more infrequent.
>
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid. And they check that the
> entire first segment had been written.
>
> I would really like to drop that property, making it "if some addresses
> in the buffer(s) we are asked to write are invalid, the write will be
> shortened by up to a PAGE_SIZE from the first such invalid address", making
> writev() rules exactly the same as write() ones. Does anybody have objections
> to it?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-15 22:29 ` Al Viro
2016-09-15 22:32 ` Linus Torvalds
2016-09-15 22:32 ` Cedric Blancher
@ 2016-09-16 13:25 ` One Thousand Gnomes
2016-09-16 18:36 ` Linus Torvalds
2 siblings, 1 reply; 7+ messages in thread
From: One Thousand Gnomes @ 2016-09-16 13:25 UTC (permalink / raw)
To: Al Viro; +Cc: Mike Marshall, Linus Torvalds, LKML, linux-fsdevel
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid. And they check that the
> entire first segment had been written.
1003.1 says
"Each iovec entry specifies the base address and length of an area in
memory from which data should be written. The writev() function shall
always write a complete area before proceeding to the next."
and I imagine that is what LTP is attempting to test.
The moment you pass an invalid address you are in the land of undefined
behaviour, so I would read the standard as actually trying to deal with
the behaviour in defined situations (eg out of disk space mid writev()).
Alan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] writev() semantics with invalid iovec in the middle
2016-09-16 13:25 ` One Thousand Gnomes
@ 2016-09-16 18:36 ` Linus Torvalds
0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2016-09-16 18:36 UTC (permalink / raw)
To: One Thousand Gnomes; +Cc: Al Viro, Mike Marshall, LKML, linux-fsdevel
On Fri, Sep 16, 2016 at 6:25 AM, One Thousand Gnomes
<gnomes@lxorguk.ukuu.org.uk> wrote:
>
> 1003.1 says
>
> "Each iovec entry specifies the base address and length of an area in
> memory from which data should be written. The writev() function shall
> always write a complete area before proceeding to the next."
>
> and I imagine that is what LTP is attempting to test.
Ahh. Yes. But as you note, the EFAULT case is undefined behavior, so
what that POSIX language is *really* about is presumably making sure
that readers of a file cannot see the "later" writes without seeing
the earlier ones.
So you cannot do some fancy threaded thing where you do different
iovec parts concurrently, because that could be seen by a reader (or
more likely mmap) as doing the writes out of order.
Or, as you mention, the disk-full case.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-09-16 18:37 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-14 21:34 [RFC] writev() semantics with invalid iovec in the middle Al Viro
2016-09-15 10:23 ` Mike Marshall
2016-09-15 22:29 ` Al Viro
2016-09-15 22:32 ` Linus Torvalds
2016-09-15 22:32 ` Cedric Blancher
2016-09-16 13:25 ` One Thousand Gnomes
2016-09-16 18:36 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox