[heads-up][RFC] ext4_file_write() breakage

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [heads-up][RFC] ext4_file_write() breakage
@ 2014-04-03 16:37 Al Viro
  2014-04-04  2:55 ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2014-04-03 16:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Eric Sandeen, linux-fsdevel

	There's a bunch of fun problems with ext4_file_write():

1) check for non-extent file trying to grow past the old size limit is
looking at position we say we'll be writing to.  generic_file_aio_write()
might have a different idea, though - just open with O_APPEND and watch
the sucker grow the file across the limit.

2) simply looking at file size in O_APPEND case instead of pos would not
close that one - file size is unstable at that point (we don't have any
locks held here).

3) ext4_unaligned_aio() suffers the same problem, but that's *not* the
only issue with it.  It checks that (O_DIRECT) aio write tries to hit
something aligned only to hw sector and not to block size.  Fine, but...
think what rlimit will do to us.  generic_write_checks() contains this:

	unsigned long limit = rlimit(RLIMIT_FSIZE);
	....
		if (limit != RLIM_INFINITY) {
			if (*pos >= limit) {
				send_sig(SIGXFSZ, current, 0);
				return -EFBIG;
			}
			if (*count > limit - (typeof(limit))*pos) {
				*count = limit - (typeof(limit))*pos;
			}
		}

and it's done only after we'd called ext4_unaligned_aio().  So it doesn't
predict whether the iovec seen by ->direct_IO() will be unaligned - there
are false negatives.  Even worse, consider an iovec that consists of
8 segments, 512 bytes each.  Starting offset in file is a multiple of block
size.  Everything's fine from ext4_unaligned_aio() POV, right?  And from
fs/direct-io.c one it's only sector-aligned sucker.  For a good reason,
since a segment in the middle of that thing might very well point to unmapped
memory, which will mean short write, with all zeroing issues ext4 is trying
to avoid here.

TBH, I'm seriously tempted to have ext4 ->direct_IO() to bail out on any
iovec that isn't block-aligned, and scratch the whole ext4_unaligned_aio()
logics.  It still leaves (1) and (2), but then we'll be able to lambda-expand
the call of generic_file_aio_write() and shift taking i_mutex up to the point
prior to checking for non-extent files trying to grow too much.

There might be kinder ways to deal with that mess, but I don't know the code
in ext4 dealing with unwritten extents anywhere near well enough to tell what's
feasible in that area.

Comments?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-03 16:37 [heads-up][RFC] ext4_file_write() breakage Al Viro
@ 2014-04-04  2:55 ` Theodore Ts'o
  2014-04-04  6:11   ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2014-04-04  2:55 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Thu, Apr 03, 2014 at 05:37:39PM +0100, Al Viro wrote:
> 2) simply looking at file size in O_APPEND case instead of pos would not
> close that one - file size is unstable at that point (we don't have any
> locks held here).
> 
> 3) ext4_unaligned_aio() suffers the same problem, but that's *not* the
> only issue with it.

So basically, we'll have to take i_mutex in order to check the file
size, which means there's no point with the ext4_unaligned_aio()
logics.  We can just take the i_mutex and then do the tests based on
i_size in ext4_file_dio_write()

>  It checks that (O_DIRECT) aio write tries to hit
> something aligned only to hw sector and not to block size.  Fine, but...
> think what rlimit will do to us.  generic_write_checks() contains this:
> 
> 	unsigned long limit = rlimit(RLIMIT_FSIZE);
> 	....
> 		if (limit != RLIM_INFINITY) {
> 			if (*pos >= limit) {
> 				send_sig(SIGXFSZ, current, 0);
> 				return -EFBIG;
> 			}
> 			if (*count > limit - (typeof(limit))*pos) {
> 				*count = limit - (typeof(limit))*pos;
> 			}
> 		}
> 
> and it's done only after we'd called ext4_unaligned_aio().  

Can we solve these problem by simply doing these tests in
ext4_file_dio_write(), so we modify pos/couint before we do the
ext4_unaligned_aio() checks?  We don't need i_mutex to do these
particular tests, right?

> So it doesn't
> predict whether the iovec seen by ->direct_IO() will be unaligned - there
> are false negatives.  Even worse, consider an iovec that consists of
> 8 segments, 512 bytes each.  Starting offset in file is a multiple of block
> size.  Everything's fine from ext4_unaligned_aio() POV, right?  And from
> fs/direct-io.c one it's only sector-aligned sucker.  For a good reason,
> since a segment in the middle of that thing might very well point to unmapped
> memory, which will mean short write, with all zeroing issues ext4 is trying
> to avoid here.

I'm not sure I understand the concern here.  The zeroing issues we're
concerned about is when two threads need to work on the same unwritten
block.  So if the pos and size are block aligned, this can't heppen.
What am I missing?

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-04  2:55 ` Theodore Ts'o
@ 2014-04-04  6:11   ` Al Viro
  2014-04-05  3:15     ` Theodore Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2014-04-04  6:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Thu, Apr 03, 2014 at 10:55:59PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 03, 2014 at 05:37:39PM +0100, Al Viro wrote:
> > 2) simply looking at file size in O_APPEND case instead of pos would not
> > close that one - file size is unstable at that point (we don't have any
> > locks held here).
> > 
> > 3) ext4_unaligned_aio() suffers the same problem, but that's *not* the
> > only issue with it.
> 
> So basically, we'll have to take i_mutex in order to check the file
> size, which means there's no point with the ext4_unaligned_aio()
> logics.  We can just take the i_mutex and then do the tests based on
> i_size in ext4_file_dio_write()

Can you hold it across ext4_unwritten_wait(), though?

> >  It checks that (O_DIRECT) aio write tries to hit
> > something aligned only to hw sector and not to block size.  Fine, but...
> > think what rlimit will do to us.  generic_write_checks() contains this:
> > 
> > 	unsigned long limit = rlimit(RLIMIT_FSIZE);
> > 	....
> > 		if (limit != RLIM_INFINITY) {
> > 			if (*pos >= limit) {
> > 				send_sig(SIGXFSZ, current, 0);
> > 				return -EFBIG;
> > 			}
> > 			if (*count > limit - (typeof(limit))*pos) {
> > 				*count = limit - (typeof(limit))*pos;
> > 			}
> > 		}
> > 
> > and it's done only after we'd called ext4_unaligned_aio().  
> 
> Can we solve these problem by simply doing these tests in
> ext4_file_dio_write(), so we modify pos/couint before we do the
> ext4_unaligned_aio() checks?  We don't need i_mutex to do these
> particular tests, right?

Yes, we do - O_APPEND, again ;-/

> > So it doesn't
> > predict whether the iovec seen by ->direct_IO() will be unaligned - there
> > are false negatives.  Even worse, consider an iovec that consists of
> > 8 segments, 512 bytes each.  Starting offset in file is a multiple of block
> > size.  Everything's fine from ext4_unaligned_aio() POV, right?  And from
> > fs/direct-io.c one it's only sector-aligned sucker.  For a good reason,
> > since a segment in the middle of that thing might very well point to unmapped
> > memory, which will mean short write, with all zeroing issues ext4 is trying
> > to avoid here.
> 
> I'm not sure I understand the concern here.  The zeroing issues we're
> concerned about is when two threads need to work on the same unwritten
> block.  So if the pos and size are block aligned, this can't heppen.
> What am I missing?

Thread A: write at offset 40M+512.  Unaligned as far as ext4_unaligned_aio()
is concerned, so it takes that mutex.

Thread B: write at offset 40M, with 8 512-byte segments in iovec.  The second
segment points to munmapped memory.  Same as 512-byte write at the same offset,
but not from the ext4_unaligned_aio() point of view.  It does *not* wait
for unwritten blocks resulting from A to be dealt with.

Area around 40M is still unwritten.  Apply Eric's scenario from the commit
that has introduced the whole "we need exclusion on unaligned aio" thing...

That, BTW, is one of the areas where we rely on blocks being less than
page-sized.  Aligned iovec will *not* have page boundaries inside the
pieces that will go into one block, so there we are guaranteed that we
won't end up with sub-block writes when we hit a VMA boundary in the
memory area we are trying to write from.

If iovec elements are not block-aligned, we can run into a short write due
to that effect.  And short write ending in the middle of a new block would
bloody better make sure to zero the rest of that block out, for obvious
reasons...

The mess happens if we have zero-the-rest-of-new-block logics trigger when
that block is, in reality, not new anymore.  I.e. when we have an earlier
write that has already returned from ->aio_write(), but still hasn't reached
the IO completion.  That's what this ext4_unwritten_wait(inode) is about,
as far as I understand the whole thing.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-04  6:11   ` Al Viro
@ 2014-04-05  3:15     ` Theodore Ts'o
  2014-04-05  4:32       ` Al Viro
  2014-04-05  6:53       ` Al Viro
  0 siblings, 2 replies; 7+ messages in thread
From: Theodore Ts'o @ 2014-04-05  3:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Fri, Apr 04, 2014 at 07:11:07AM +0100, Al Viro wrote:
> > So basically, we'll have to take i_mutex in order to check the file
> > size, which means there's no point with the ext4_unaligned_aio()
> > logics.  We can just take the i_mutex and then do the tests based on
> > i_size in ext4_file_dio_write()
> 
> Can you hold it across ext4_unwritten_wait(), though?

Gah....  I'm not sure.  But ultimately, all of this is really about
O_APPEND being something horrible, right?

If we have to create a new mutex which is just used to serialized
O_APPEND writes, I'm OK with that.  The main thing that I really want
to preserve is to able to do parallel O_DIRECT writes in the the
non-O_APPEND case.

> Thread A: write at offset 40M+512.  Unaligned as far as ext4_unaligned_aio()
> is concerned, so it takes that mutex.
> 
> Thread B: write at offset 40M, with 8 512-byte segments in iovec.  The second
> segment points to munmapped memory.  Same as 512-byte write at the same offset,
> but not from the ext4_unaligned_aio() point of view.  It does *not* wait
> for unwritten blocks resulting from A to be dealt with.

Hang on a second.  What are you assuming the block size to be in this
example?  If the block size is 4k, then this doesn't make any sense,
because unmapped memory will be in units of the block size, so we
couldn't have the second 512 byte segment be unmapped.  Blocks are
unmaped, not individual 512 byte sectors.

If you are assuming the block size is 512 bytes, then it's not a
problem, since the entire block is unmapped, and dio_zero_block() can
operatate on the whole block.

(actually, ext4 doesn't support 512 byte blocks, but it does support
1024 block sizes.  but the same argument applies)

> Area around 40M is still unwritten.  Apply Eric's scenario from the commit
> that has introduced the whole "we need exclusion on unaligned aio" thing...

Right, but Eric's scneario was talking about unaligned *blocks* not
*pages*.

So his scenario was one where the block size was 4k, and the write was
unaligned with respect to the 4k block size.  For example, if with a
4k block size, we had one write starting at offset 0 with a size 512,
and at the same time another write starting at offset 2048 with a size
1024 bytes.  The problem is that we were doing two writes inside the
same *block*, and so if dio_zero_block() tried to operate on the same
block at the same time, bad things would happen.

Does that make sense?

							- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-05  3:15     ` Theodore Ts'o
@ 2014-04-05  4:32       ` Al Viro
  2014-04-08  2:01         ` Theodore Ts'o
  2014-04-05  6:53       ` Al Viro
  1 sibling, 1 reply; 7+ messages in thread
From: Al Viro @ 2014-04-05  4:32 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Fri, Apr 04, 2014 at 11:15:07PM -0400, Theodore Ts'o wrote:

> Hang on a second.  What are you assuming the block size to be in this
> example?  If the block size is 4k, then this doesn't make any sense,
> because unmapped memory will be in units of the block size, so we
> couldn't have the second 512 byte segment be unmapped.  Blocks are
> unmaped, not individual 512 byte sectors.

char *p = (char *)mmap(NULL, 8192, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
struct iovec v[8];
memset(p, 'a', 4096);
munmap(p + 4096, 4096);
for (int i = 0; i < 8; i++)
	v[i] = (struct iovec){p + i * 512, 512};
v[1].iov_base = p + 4096;	/* unmapped */

The rest of feeding v to aio (with AIO_PWRITEV) is left as an exercise.

v[0] points to 512 bytes of RAM, all present (and filled with 'a').
v[1] points to the memory we'd just munmapped; trying to dereference it
would segfault, passing it to write() would give -EFAULT and passing the
entire array to writev(2) will result in short write - 512 bytes (all 'a')
written to file, return value is 512.

Unmapped memory is in 4K units, all right - and iovec elements are free to
point whereever they bloody please.  Sure, v[0].iov_base is 4K-aligned
and v[0].iov_len is 512, but that doesn't mean that v[1].iov_base can't point
into completely different page.  It does *not* have to be v[0].iov_base + 512.
That's the whole point of iovec, after all - ability to take an arbitrary
bunch of memory objects and write them all in one syscall, without having to
copy them into adjacent addresses...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-05  3:15     ` Theodore Ts'o
  2014-04-05  4:32       ` Al Viro
@ 2014-04-05  6:53       ` Al Viro
  1 sibling, 0 replies; 7+ messages in thread
From: Al Viro @ 2014-04-05  6:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Fri, Apr 04, 2014 at 11:15:07PM -0400, Theodore Ts'o wrote:

> Right, but Eric's scneario was talking about unaligned *blocks* not
> *pages*.
> 
> So his scenario was one where the block size was 4k, and the write was
> unaligned with respect to the 4k block size.  For example, if with a
> 4k block size, we had one write starting at offset 0 with a size 512,
> and at the same time another write starting at offset 2048 with a size
> 1024 bytes.  The problem is that we were doing two writes inside the
> same *block*, and so if dio_zero_block() tried to operate on the same
> block at the same time, bad things would happen.
> 
> Does that make sense?

Make the first write take the following iovec array:
	{{good_pointer, 512}, {unmapped_pointer, 4096 - 512}}
and you'll get exact same scenario.   writev() on that iovec is the
same as write(fd, good_pointer, 512).  It certainly should not overwrite
the data at offsets greater than 512.

That's the whole point - it's possible to sneak an equivalent of what
ext4 considers an unaligned write (unaligned wrt fs blocks) past the
check in ext4_unaligned_aio().  You can pad a 512-byte write with
additional iovec segment that will *not* be written (->iov_base points
to something we'd just munmapped), so that the total iovec length looks
good, but passing that to writev()/pwritev()/AIO_PWRITEV will end up
with a short write.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [heads-up][RFC] ext4_file_write() breakage
  2014-04-05  4:32       ` Al Viro
@ 2014-04-08  2:01         ` Theodore Ts'o
  0 siblings, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2014-04-08  2:01 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-ext4, Eric Sandeen, linux-fsdevel

On Sat, Apr 05, 2014 at 05:32:43AM +0100, Al Viro wrote:
> 
> char *p = (char *)mmap(NULL, 8192, PROT_READ | PROT_WRITE, MAP_ANON, -1, 0);
> struct iovec v[8];
> memset(p, 'a', 4096);
> munmap(p + 4096, 4096);
> for (int i = 0; i < 8; i++)
> 	v[i] = (struct iovec){p + i * 512, 512};
> v[1].iov_base = p + 4096;	/* unmapped */
> 
> The rest of feeding v to aio (with AIO_PWRITEV) is left as an exercise.
> 
> v[0] points to 512 bytes of RAM, all present (and filled with 'a').
> v[1] points to the memory we'd just munmapped; trying to dereference it
> would segfault, passing it to write() would give -EFAULT and passing the
> entire array to writev(2) will result in short write - 512 bytes (all 'a')
> written to file, return value is 512.

It's hard for me to stare the direct I/O code without my eyes
bleeding, but I'm not sure that we're not causing corruption for
another reason, without even needing to race.

If we have an error leading to a short write, after we bail out, we
don't zero the the rest of the block, but it looks like we also won't
clear the uninitialized bit, which means even though we would have
written out the short write, an attempt to read back of the file
should lead to all zeros.  I'll need to write some test cases to make
sure whether we're doing the right thing or not, but I'm worried we're
screwing up there.

Fortunately most DIO users tend to page align their buffers out of
paranoia, and they're not deliberately trying to induce failures and
check to make sure the failures are correct.

I do think we should force the use of the ext4_aio_mutex() if the file
is opened O_APPEND.  That way we force serialization if the DIO is
unaligned or in the O_APPEND case, which should fix a number of
problems.

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-04-08  2:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-03 16:37 [heads-up][RFC] ext4_file_write() breakage Al Viro
2014-04-04  2:55 ` Theodore Ts'o
2014-04-04  6:11   ` Al Viro
2014-04-05  3:15     ` Theodore Ts'o
2014-04-05  4:32       ` Al Viro
2014-04-08  2:01         ` Theodore Ts'o
2014-04-05  6:53       ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).