fallocate mode flag for "unshare blocks"?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fallocate mode flag for "unshare blocks"?
       [not found] <20160302155007.GB7125@infradead.org>
@ 2016-03-30 18:27 ` Darrick J. Wong
  2016-03-30 18:58   ` Austin S. Hemmelgarn
                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Darrick J. Wong @ 2016-03-30 18:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-fsdevel, linux-btrfs, linux-api

Hi all,

Christoph and I have been working on adding reflink and CoW support to
XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
that future file writes cannot ENOSPC, I extended the XFS fallocate
handler to unshare any shared blocks via the copy on write mechanism I
built for it.  However, Christoph shared the following concerns with
me about that interpretation:

> I know that I suggested unsharing blocks on fallocate, but it turns out
> this is causing problems.  Applications expect falloc to be a fast
> metadata operation, and copying a potentially large number of blocks
> is against that expextation.  This is especially bad for the NFS
> server, which should not be blocked for a long time in a synchronous
> operation.
> 
> I think we'll have to remove the unshare and just fail the fallocate
> for a reflinked region for now.  I still think it makes sense to expose
> an unshare operation, and we probably should make that another
> fallocate mode.

With that in mind, how do you all think we ought to resolve this?
Should we add a new fallocate mode flag that means "unshare the shared
blocks"?  Obviously, this unshare flag cannot be used in conjunction
with hole punching, zero range, insert range, or collapse range.  This
breaks the expectation that writing to a file after fallocate won't
ENOSPC.

Or is it ok that fallocate could block, potentially for a long time as
we stream cows through the page cache (or however unshare works
internally)?  Those same programs might not be expecting fallocate to
take a long time.

Can we do better than either solution?  It occurs to me that XFS does
unshare by reading the file data into the pagecache, marking the pages
dirty, and flushing the dirty pages; performance could be improved by
skipping the flush at the end.  We won't ENOSPC, because the XFS
delalloc system is careful enough to check that there are enough free
blocks to handle both the allocation and the metadata updates.  The
only gap in this scheme that I can see is if we fallocate, crash, and
upon restart the program then tries to write without retrying the
fallocate.  Can we trade some performance for the added requirement
that we must fallocate -> write -> fsync, and retry the trio if we
crash before the fsync returns?  I think that's already an implicit
requirement, so we might be ok here.

Opinions?  I rather like the last option, though I've only just
thought of it and have not had time to examine it thoroughly, and it's
specific to XFS. :)

--D

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-30 18:27 ` fallocate mode flag for "unshare blocks"? Darrick J. Wong
@ 2016-03-30 18:58   ` Austin S. Hemmelgarn
  2016-03-31  7:58     ` Christoph Hellwig
  2016-03-31  0:32   ` Liu Bo
  2016-03-31  1:18   ` Dave Chinner
  2 siblings, 1 reply; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-30 18:58 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: xfs, linux-fsdevel, linux-btrfs, linux-api

On 2016-03-30 14:27, Darrick J. Wong wrote:
> Hi all,
>
> Christoph and I have been working on adding reflink and CoW support to
> XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
> that future file writes cannot ENOSPC, I extended the XFS fallocate
> handler to unshare any shared blocks via the copy on write mechanism I
> built for it.  However, Christoph shared the following concerns with
> me about that interpretation:
>
>> I know that I suggested unsharing blocks on fallocate, but it turns out
>> this is causing problems.  Applications expect falloc to be a fast
>> metadata operation, and copying a potentially large number of blocks
>> is against that expextation.  This is especially bad for the NFS
>> server, which should not be blocked for a long time in a synchronous
>> operation.
>>
>> I think we'll have to remove the unshare and just fail the fallocate
>> for a reflinked region for now.  I still think it makes sense to expose
>> an unshare operation, and we probably should make that another
>> fallocate mode.
>
> With that in mind, how do you all think we ought to resolve this?
> Should we add a new fallocate mode flag that means "unshare the shared
> blocks"?  Obviously, this unshare flag cannot be used in conjunction
> with hole punching, zero range, insert range, or collapse range.  This
> breaks the expectation that writing to a file after fallocate won't
> ENOSPC.
>
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.
Nothing that I can find in the man-pages or API documentation for 
Linux's fallocate explicitly says that it will be fast.  There are bits 
that say it should be efficient, but that is not itself well defined 
(given context, I would assume it to mean that it doesn't use as much 
I/O as writing out that many bytes of zero data, not necessarily that it 
will return quickly).  We may have done a lot to make it fast, but that 
doesn't mean by any measure that we guarantee it anywhere (at least, we 
don't guarantee it anywhere I can find).
>
> Can we do better than either solution?  It occurs to me that XFS does
> unshare by reading the file data into the pagecache, marking the pages
> dirty, and flushing the dirty pages; performance could be improved by
> skipping the flush at the end.  We won't ENOSPC, because the XFS
> delalloc system is careful enough to check that there are enough free
> blocks to handle both the allocation and the metadata updates.  The
> only gap in this scheme that I can see is if we fallocate, crash, and
> upon restart the program then tries to write without retrying the
> fallocate.  Can we trade some performance for the added requirement
> that we must fallocate -> write -> fsync, and retry the trio if we
> crash before the fsync returns?  I think that's already an implicit
> requirement, so we might be ok here.
Most of the software I've seen that doesn't use fallocate like this is 
either doing odd things otherwise, or is just making sure it has space 
for temporary files, so I think it is probably safe to require this.
>
> Opinions?  I rather like the last option, though I've only just
> thought of it and have not had time to examine it thoroughly, and it's
> specific to XFS. :)
Personally I'm indifferent about how we handle it, as long as it still 
maintains the normal semantics, and it works for reflinked ranges 
(seemingly arbitrary failures for a range in a file should be handled 
properly by an application, but that doesn't mean we shouldn't try to 
reduce their occurrence).

I would like to comment that it would be nice to have an fallocate 
option to force a range to become unshared, but I personally feel we 
should have that alongside the regular functionality, not in-place of it.

It's probably also worth noting that reflinks technically break 
expectations WRT FALLOC_FL_PUNCH_HOLE already.  Most apps I see that use 
PUNCH_HOLE seem to expect it to free space, which won't happen if the 
range is reflinked elsewhere.  There is of course nothing that says that 
it will free space, but that doesn't change user expectations.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-30 18:58   ` Austin S. Hemmelgarn
@ 2016-03-31  7:58     ` Christoph Hellwig
  2016-03-31 11:13       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-31  7:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Darrick J. Wong, Christoph Hellwig, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote:
> Nothing that I can find in the man-pages or API documentation for Linux's
> fallocate explicitly says that it will be fast.  There are bits that say it
> should be efficient, but that is not itself well defined (given context, I
> would assume it to mean that it doesn't use as much I/O as writing out that
> many bytes of zero data, not necessarily that it will return quickly).

And that's pretty much as narrow as an defintion we get.  But apparently
gfs2 already breaks that expectation :(

> >delalloc system is careful enough to check that there are enough free
> >blocks to handle both the allocation and the metadata updates.  The
> >only gap in this scheme that I can see is if we fallocate, crash, and
> >upon restart the program then tries to write without retrying the
> >fallocate.  Can we trade some performance for the added requirement
> >that we must fallocate -> write -> fsync, and retry the trio if we
> >crash before the fsync returns?  I think that's already an implicit
> >requirement, so we might be ok here.
> Most of the software I've seen that doesn't use fallocate like this is
> either doing odd things otherwise, or is just making sure it has space for
> temporary files, so I think it is probably safe to require this.

posix_fallocate gurantees you that you don't get ENOSPC from the write,
and there is plenty of software relying on that or crashing / cause data
integrity problems that way.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  7:58     ` Christoph Hellwig
@ 2016-03-31 11:13       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-31 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, xfs, linux-fsdevel, linux-btrfs, linux-api

On 2016-03-31 03:58, Christoph Hellwig wrote:
> On Wed, Mar 30, 2016 at 02:58:38PM -0400, Austin S. Hemmelgarn wrote:
>> Nothing that I can find in the man-pages or API documentation for Linux's
>> fallocate explicitly says that it will be fast.  There are bits that say it
>> should be efficient, but that is not itself well defined (given context, I
>> would assume it to mean that it doesn't use as much I/O as writing out that
>> many bytes of zero data, not necessarily that it will return quickly).
>
> And that's pretty much as narrow as an defintion we get.  But apparently
> gfs2 already breaks that expectation :(
GFS2 breaks other expectations as well (mostly stuff with locking) in 
arguably more significant ways, so I would not personally consider it to 
be precedent for breaking this on other filesystems.
>
>>> delalloc system is careful enough to check that there are enough free
>>> blocks to handle both the allocation and the metadata updates.  The
>>> only gap in this scheme that I can see is if we fallocate, crash, and
>>> upon restart the program then tries to write without retrying the
>>> fallocate.  Can we trade some performance for the added requirement
>>> that we must fallocate -> write -> fsync, and retry the trio if we
>>> crash before the fsync returns?  I think that's already an implicit
>>> requirement, so we might be ok here.
>> Most of the software I've seen that doesn't use fallocate like this is
>> either doing odd things otherwise, or is just making sure it has space for
>> temporary files, so I think it is probably safe to require this.
>
> posix_fallocate gurantees you that you don't get ENOSPC from the write,
> and there is plenty of software relying on that or crashing / cause data
> integrity problems that way.
>
posix_fallocate is not the same thing as the fallocate syscall.  It's 
there for compatibility, it has less functionality, and most 
importantly, it _can_ be slow (because at least glibc will emulate it if 
the underlying FS doesn't support fallocate, which means it's no faster 
than just using dd).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-30 18:27 ` fallocate mode flag for "unshare blocks"? Darrick J. Wong
  2016-03-30 18:58   ` Austin S. Hemmelgarn
@ 2016-03-31  0:32   ` Liu Bo
  2016-03-31  7:55     ` Christoph Hellwig
  2016-03-31 11:18     ` Austin S. Hemmelgarn
  2016-03-31  1:18   ` Dave Chinner
  2 siblings, 2 replies; 22+ messages in thread
From: Liu Bo @ 2016-03-31  0:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-btrfs, linux-api

On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Christoph and I have been working on adding reflink and CoW support to
> XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
> that future file writes cannot ENOSPC, I extended the XFS fallocate
> handler to unshare any shared blocks via the copy on write mechanism I
> built for it.  However, Christoph shared the following concerns with
> me about that interpretation:
> 
> > I know that I suggested unsharing blocks on fallocate, but it turns out
> > this is causing problems.  Applications expect falloc to be a fast
> > metadata operation, and copying a potentially large number of blocks
> > is against that expextation.  This is especially bad for the NFS
> > server, which should not be blocked for a long time in a synchronous
> > operation.
> > 
> > I think we'll have to remove the unshare and just fail the fallocate
> > for a reflinked region for now.  I still think it makes sense to expose
> > an unshare operation, and we probably should make that another
> > fallocate mode.

I'm expecting fallocate to be fast, too.

Well, btrfs fallocate doesn't allocate space if it's a shared one
because it thinks the space is already allocated.  So a later overwrite
over this shared extent may hit enospc errors.

> 
> With that in mind, how do you all think we ought to resolve this?
> Should we add a new fallocate mode flag that means "unshare the shared
> blocks"?  Obviously, this unshare flag cannot be used in conjunction
> with hole punching, zero range, insert range, or collapse range.  This
> breaks the expectation that writing to a file after fallocate won't
> ENOSPC.
> 
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.
> 
> Can we do better than either solution?  It occurs to me that XFS does
> unshare by reading the file data into the pagecache, marking the pages
> dirty, and flushing the dirty pages; performance could be improved by
> skipping the flush at the end.  We won't ENOSPC, because the XFS
> delalloc system is careful enough to check that there are enough free
> blocks to handle both the allocation and the metadata updates.  The
> only gap in this scheme that I can see is if we fallocate, crash, and
> upon restart the program then tries to write without retrying the
> fallocate.  Can we trade some performance for the added requirement
> that we must fallocate -> write -> fsync, and retry the trio if we
> crash before the fsync returns?  I think that's already an implicit
> requirement, so we might be ok here.
> 
> Opinions?  I rather like the last option, though I've only just
> thought of it and have not had time to examine it thoroughly, and it's
> specific to XFS. :)

I'd vote for another mode for 'unshare the shared blocks'.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  0:32   ` Liu Bo
@ 2016-03-31  7:55     ` Christoph Hellwig
  2016-03-31 15:31       ` Andreas Dilger
  2016-03-31 11:18     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-31  7:55 UTC (permalink / raw)
  To: Liu Bo
  Cc: Darrick J. Wong, Christoph Hellwig, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
> Well, btrfs fallocate doesn't allocate space if it's a shared one
> because it thinks the space is already allocated.  So a later overwrite
> over this shared extent may hit enospc errors.

And this makes it an incorrect implementation of posix_fallocate,
which glibcs implements using fallocate if available.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  7:55     ` Christoph Hellwig
@ 2016-03-31 15:31       ` Andreas Dilger
  2016-03-31 15:43         ` Austin S. Hemmelgarn
  2016-03-31 16:47         ` Henk Slager
  0 siblings, 2 replies; 22+ messages in thread
From: Andreas Dilger @ 2016-03-31 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Liu Bo, Darrick J. Wong, xfs, linux-fsdevel, linux-btrfs,
	linux-api

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

On Mar 31, 2016, at 1:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>> because it thinks the space is already allocated.  So a later overwrite
>> over this shared extent may hit enospc errors.
> 
> And this makes it an incorrect implementation of posix_fallocate,
> which glibcs implements using fallocate if available.

It isn't really useful for a COW filesystem to implement fallocate()
to reserve blocks.  Even if it did allocate all of the blocks on the
initial fallocate() call, when it comes time to overwrite these blocks
new blocks need to be allocated as the old ones will not be overwritten.

Because of snapshots that could hold references to the old blocks,
there isn't even the guarantee that the previous fallocated blocks will
be released in a reasonable time to free up an equal amount of space.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 15:31       ` Andreas Dilger
@ 2016-03-31 15:43         ` Austin S. Hemmelgarn
  2016-03-31 16:47         ` Henk Slager
  1 sibling, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-31 15:43 UTC (permalink / raw)
  To: Andreas Dilger, Christoph Hellwig
  Cc: Liu Bo, Darrick J. Wong, xfs, linux-fsdevel, linux-btrfs,
	linux-api

On 2016-03-31 11:31, Andreas Dilger wrote:
> On Mar 31, 2016, at 1:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>>> because it thinks the space is already allocated.  So a later overwrite
>>> over this shared extent may hit enospc errors.
>>
>> And this makes it an incorrect implementation of posix_fallocate,
>> which glibcs implements using fallocate if available.
>
> It isn't really useful for a COW filesystem to implement fallocate()
> to reserve blocks.  Even if it did allocate all of the blocks on the
> initial fallocate() call, when it comes time to overwrite these blocks
> new blocks need to be allocated as the old ones will not be overwritten.
>
> Because of snapshots that could hold references to the old blocks,
> there isn't even the guarantee that the previous fallocated blocks will
> be released in a reasonable time to free up an equal amount of space.

That really depends on how it's done.  AFAIK, unwritten extents on BTRFS 
are block reservations which make sure that you can write there (IOW, 
the unwritten extent gets converted to a regular extent in-place, not 
via COW).  This means that it is possible to guarantee that the first 
write to that area will work, which is technically all that POSIX 
requires.  This in turn means that stuff like SystemD and RDBMS software 
don't exactly see things working as they expect them too, but that's 
because they make assumptions based on existing technology.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 15:31       ` Andreas Dilger
  2016-03-31 15:43         ` Austin S. Hemmelgarn
@ 2016-03-31 16:47         ` Henk Slager
  1 sibling, 0 replies; 22+ messages in thread
From: Henk Slager @ 2016-03-31 16:47 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Liu Bo, Darrick J. Wong, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 5:31 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Mar 31, 2016, at 1:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Wed, Mar 30, 2016 at 05:32:42PM -0700, Liu Bo wrote:
>>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>>> because it thinks the space is already allocated.  So a later overwrite
>>> over this shared extent may hit enospc errors.
>>
>> And this makes it an incorrect implementation of posix_fallocate,
>> which glibcs implements using fallocate if available.
>
> It isn't really useful for a COW filesystem to implement fallocate()
> to reserve blocks.  Even if it did allocate all of the blocks on the
> initial fallocate() call, when it comes time to overwrite these blocks
> new blocks need to be allocated as the old ones will not be overwritten.

There are also use-cases on BTRFS with CoW disabled, like operations
on virtual machine images that aren't snapshotted.
Those files tend to be big and having fallocate() implemented and
working like for e.g. XFS, in order to achieve space and speed
efficiency, makes sense IMHO.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  0:32   ` Liu Bo
  2016-03-31  7:55     ` Christoph Hellwig
@ 2016-03-31 11:18     ` Austin S. Hemmelgarn
  2016-03-31 11:38       ` Austin S. Hemmelgarn
  2016-03-31 19:52       ` Liu Bo
  1 sibling, 2 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-31 11:18 UTC (permalink / raw)
  To: bo.li.liu, Darrick J. Wong
  Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-btrfs, linux-api

On 2016-03-30 20:32, Liu Bo wrote:
> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
>> Hi all,
>>
>> Christoph and I have been working on adding reflink and CoW support to
>> XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
>> that future file writes cannot ENOSPC, I extended the XFS fallocate
>> handler to unshare any shared blocks via the copy on write mechanism I
>> built for it.  However, Christoph shared the following concerns with
>> me about that interpretation:
>>
>>> I know that I suggested unsharing blocks on fallocate, but it turns out
>>> this is causing problems.  Applications expect falloc to be a fast
>>> metadata operation, and copying a potentially large number of blocks
>>> is against that expextation.  This is especially bad for the NFS
>>> server, which should not be blocked for a long time in a synchronous
>>> operation.
>>>
>>> I think we'll have to remove the unshare and just fail the fallocate
>>> for a reflinked region for now.  I still think it makes sense to expose
>>> an unshare operation, and we probably should make that another
>>> fallocate mode.
>
> I'm expecting fallocate to be fast, too.
>
> Well, btrfs fallocate doesn't allocate space if it's a shared one
> because it thinks the space is already allocated.  So a later overwrite
> over this shared extent may hit enospc errors.
And this _really_ should get fixed, otherwise glibc will add a check for 
running posix_fallocate against BTRFS and force emulation, and people 
_will_ complain about performance.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 11:18     ` Austin S. Hemmelgarn
@ 2016-03-31 11:38       ` Austin S. Hemmelgarn
  2016-03-31 19:52       ` Liu Bo
  1 sibling, 0 replies; 22+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-31 11:38 UTC (permalink / raw)
  To: bo.li.liu
  Cc: Darrick J. Wong, Christoph Hellwig, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On 2016-03-31 07:18, Austin S. Hemmelgarn wrote:
> On 2016-03-30 20:32, Liu Bo wrote:
>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
>>> Hi all,
>>>
>>> Christoph and I have been working on adding reflink and CoW support to
>>> XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
>>> that future file writes cannot ENOSPC, I extended the XFS fallocate
>>> handler to unshare any shared blocks via the copy on write mechanism I
>>> built for it.  However, Christoph shared the following concerns with
>>> me about that interpretation:
>>>
>>>> I know that I suggested unsharing blocks on fallocate, but it turns out
>>>> this is causing problems.  Applications expect falloc to be a fast
>>>> metadata operation, and copying a potentially large number of blocks
>>>> is against that expextation.  This is especially bad for the NFS
>>>> server, which should not be blocked for a long time in a synchronous
>>>> operation.
>>>>
>>>> I think we'll have to remove the unshare and just fail the fallocate
>>>> for a reflinked region for now.  I still think it makes sense to expose
>>>> an unshare operation, and we probably should make that another
>>>> fallocate mode.
>>
>> I'm expecting fallocate to be fast, too.
>>
>> Well, btrfs fallocate doesn't allocate space if it's a shared one
>> because it thinks the space is already allocated.  So a later overwrite
>> over this shared extent may hit enospc errors.
> And this _really_ should get fixed, otherwise glibc will add a check for
> running posix_fallocate against BTRFS and force emulation, and people
> _will_ complain about performance.
>
Thinking a bit further about this, how hard would it be to add the 
ability to have unwritten extents point somewhere else for reads?  Then 
when we get an fallocate call, we create the unwritten extents, and add 
the metadata to make them read from the shared region.  Then, when a 
write gets issued to that extent, the parts that aren't being written in 
that block get copied, the write happens, and then the link for that 
block gets removed.  This way, fallocate would still provide the correct 
semantics, it would be relatively fast (still not quite as fast as it is 
now, but it wouldn't be anywhere near as slow as copying the data), and 
the cost of copying gets amortized across writes (we may not need to 
copy everything, but we'll still copy less than we would for just 
un-sharing the extent).  This would of course need to be an incompat 
feature, but I would personally say that's not as much of an issue, as 
things are subtly broken in the common use-case right now (at this point 
I'm just thinking BTRFS, as what Darrick suggested for XFS seems to be a 
better solution there at least short term).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 11:18     ` Austin S. Hemmelgarn
  2016-03-31 11:38       ` Austin S. Hemmelgarn
@ 2016-03-31 19:52       ` Liu Bo
  1 sibling, 0 replies; 22+ messages in thread
From: Liu Bo @ 2016-03-31 19:52 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Darrick J. Wong, Christoph Hellwig, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 07:18:55AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-03-30 20:32, Liu Bo wrote:
> >On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> >>Hi all,
> >>
> >>Christoph and I have been working on adding reflink and CoW support to
> >>XFS recently.  Since the purpose of (mode 0) fallocate is to make sure
> >>that future file writes cannot ENOSPC, I extended the XFS fallocate
> >>handler to unshare any shared blocks via the copy on write mechanism I
> >>built for it.  However, Christoph shared the following concerns with
> >>me about that interpretation:
> >>
> >>>I know that I suggested unsharing blocks on fallocate, but it turns out
> >>>this is causing problems.  Applications expect falloc to be a fast
> >>>metadata operation, and copying a potentially large number of blocks
> >>>is against that expextation.  This is especially bad for the NFS
> >>>server, which should not be blocked for a long time in a synchronous
> >>>operation.
> >>>
> >>>I think we'll have to remove the unshare and just fail the fallocate
> >>>for a reflinked region for now.  I still think it makes sense to expose
> >>>an unshare operation, and we probably should make that another
> >>>fallocate mode.
> >
> >I'm expecting fallocate to be fast, too.
> >
> >Well, btrfs fallocate doesn't allocate space if it's a shared one
> >because it thinks the space is already allocated.  So a later overwrite
> >over this shared extent may hit enospc errors.
> And this _really_ should get fixed, otherwise glibc will add a check for
> running posix_fallocate against BTRFS and force emulation, and people _will_
> complain about performance.

Even if glibc adds a check like that and emulates fallocate by writing
zero to real blocks, btrfs still does cow and requests to allocate space
for new writes, so it's not only performance, but also getting ENOSPC in
extremely case though.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-30 18:27 ` fallocate mode flag for "unshare blocks"? Darrick J. Wong
  2016-03-30 18:58   ` Austin S. Hemmelgarn
  2016-03-31  0:32   ` Liu Bo
@ 2016-03-31  1:18   ` Dave Chinner
  2016-03-31  7:54     ` Christoph Hellwig
  2 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-03-31  1:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-btrfs, linux-api

On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> Or is it ok that fallocate could block, potentially for a long time as
> we stream cows through the page cache (or however unshare works
> internally)?  Those same programs might not be expecting fallocate to
> take a long time.

Yes, it's perfectly fine for fallocate to block for long periods of
time. See what gfs2 does during preallocation of blocks - it ends up
calling sb_issue_zerout() because it doesn't have unwritten
extents, and hence can block for long periods of time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  1:18   ` Dave Chinner
@ 2016-03-31  7:54     ` Christoph Hellwig
  2016-03-31 11:18       ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2016-03-31  7:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Christoph Hellwig, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > Or is it ok that fallocate could block, potentially for a long time as
> > we stream cows through the page cache (or however unshare works
> > internally)?  Those same programs might not be expecting fallocate to
> > take a long time.
> 
> Yes, it's perfectly fine for fallocate to block for long periods of
> time. See what gfs2 does during preallocation of blocks - it ends up
> calling sb_issue_zerout() because it doesn't have unwritten
> extents, and hence can block for long periods of time....

gfs2 fallocate is an implementation that will cause all but the most
trivial users real pain.  Even the initial XFS implementation just
marking the transactions synchronous made it unusable for all kinds
of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
to gfs2 will probab;ly hand your connection for extended periods of
time.

If we need to support something like what gfs2 does we should have a
separate flag for it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31  7:54     ` Christoph Hellwig
@ 2016-03-31 11:18       ` Dave Chinner
  2016-03-31 18:08         ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-03-31 11:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, xfs, linux-fsdevel, linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > Or is it ok that fallocate could block, potentially for a long time as
> > > we stream cows through the page cache (or however unshare works
> > > internally)?  Those same programs might not be expecting fallocate to
> > > take a long time.
> > 
> > Yes, it's perfectly fine for fallocate to block for long periods of
> > time. See what gfs2 does during preallocation of blocks - it ends up
> > calling sb_issue_zerout() because it doesn't have unwritten
> > extents, and hence can block for long periods of time....
> 
> gfs2 fallocate is an implementation that will cause all but the most
> trivial users real pain.  Even the initial XFS implementation just
> marking the transactions synchronous made it unusable for all kinds
> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> to gfs2 will probab;ly hand your connection for extended periods of
> time.
> 
> If we need to support something like what gfs2 does we should have a
> separate flag for it.

Using fallocate() for preallocation was always intended to
be a faster, more efficient method allocating zeroed space
than having userspace write blocks of data. Faster, more efficient
does not mean instantaneous, and gfs2 using sb_issue_zerout() means
that if the hardware has zeroing offloads (deterministic trim, write
same, etc) it will use them, and that will be much faster than
writing zeros from userspace.

IMO, what gfs2 is definitely within the intended usage of
fallocate() for accelerating the preallocation of blocks.

Yes, it may not be optimal for things like NFS servers which haven't
considered that a fallocate based offload operation might take some
time to execute, but that's not a problem with fallocate. i.e.
that's a problem with the nfs server ALLOCATE implementation not
being prepared to return NFSERR_JUKEBOX to prevent client side hangs
and timeouts while the operation is run....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 11:18       ` Dave Chinner
@ 2016-03-31 18:08         ` J. Bruce Fields
  2016-03-31 18:19           ` Darrick J. Wong
  2016-03-31 19:47           ` Andreas Dilger
  0 siblings, 2 replies; 22+ messages in thread
From: J. Bruce Fields @ 2016-03-31 18:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, xfs, linux-fsdevel,
	linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > Or is it ok that fallocate could block, potentially for a long time as
> > > > we stream cows through the page cache (or however unshare works
> > > > internally)?  Those same programs might not be expecting fallocate to
> > > > take a long time.
> > > 
> > > Yes, it's perfectly fine for fallocate to block for long periods of
> > > time. See what gfs2 does during preallocation of blocks - it ends up
> > > calling sb_issue_zerout() because it doesn't have unwritten
> > > extents, and hence can block for long periods of time....
> > 
> > gfs2 fallocate is an implementation that will cause all but the most
> > trivial users real pain.  Even the initial XFS implementation just
> > marking the transactions synchronous made it unusable for all kinds
> > of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > to gfs2 will probab;ly hand your connection for extended periods of
> > time.
> > 
> > If we need to support something like what gfs2 does we should have a
> > separate flag for it.
> 
> Using fallocate() for preallocation was always intended to
> be a faster, more efficient method allocating zeroed space
> than having userspace write blocks of data. Faster, more efficient
> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> that if the hardware has zeroing offloads (deterministic trim, write
> same, etc) it will use them, and that will be much faster than
> writing zeros from userspace.
> 
> IMO, what gfs2 is definitely within the intended usage of
> fallocate() for accelerating the preallocation of blocks.
> 
> Yes, it may not be optimal for things like NFS servers which haven't
> considered that a fallocate based offload operation might take some
> time to execute, but that's not a problem with fallocate. i.e.
> that's a problem with the nfs server ALLOCATE implementation not
> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> and timeouts while the operation is run....

That's an interesting idea, but I don't think it's really legal.  I take
JUKEBOX to mean "sorry, I'm failing this operation for now, try again
later and it might succeed", not "OK, I'm working on it, try again and
you may find out I've done it".

So if the client gets a JUKEBOX error but the server goes ahead and does
the operation anyway, that'd be unexpected.

I suppose it's comparable to the case where a slow fallocate is
interrupted--would it be legal to return EINTR in that case and leave
the application to sort out whether some part of the allocation had
already happened?  Would it be legal to continue the fallocate under the
covers even after returning EINTR?

But anyway my first inclination is to say that the NFS FALLOCATE
protocol just wasn't designed to handle long-running fallocates, and if
we really need that then we need to give it a way to either report
partial results or to report results asynchronously.

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 18:08         ` J. Bruce Fields
@ 2016-03-31 18:19           ` Darrick J. Wong
  2016-03-31 19:47           ` Andreas Dilger
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2016-03-31 18:19 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Chinner, Christoph Hellwig, xfs, linux-fsdevel, linux-btrfs,
	linux-api

On Thu, Mar 31, 2016 at 02:08:21PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > > Or is it ok that fallocate could block, potentially for a long time as
> > > > > we stream cows through the page cache (or however unshare works
> > > > > internally)?  Those same programs might not be expecting fallocate to
> > > > > take a long time.
> > > > 
> > > > Yes, it's perfectly fine for fallocate to block for long periods of
> > > > time. See what gfs2 does during preallocation of blocks - it ends up
> > > > calling sb_issue_zerout() because it doesn't have unwritten
> > > > extents, and hence can block for long periods of time....
> > > 
> > > gfs2 fallocate is an implementation that will cause all but the most
> > > trivial users real pain.  Even the initial XFS implementation just
> > > marking the transactions synchronous made it unusable for all kinds
> > > of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > > to gfs2 will probab;ly hand your connection for extended periods of
> > > time.
> > > 
> > > If we need to support something like what gfs2 does we should have a
> > > separate flag for it.
> > 
> > Using fallocate() for preallocation was always intended to
> > be a faster, more efficient method allocating zeroed space
> > than having userspace write blocks of data. Faster, more efficient
> > does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > that if the hardware has zeroing offloads (deterministic trim, write
> > same, etc) it will use them, and that will be much faster than
> > writing zeros from userspace.
> > 
> > IMO, what gfs2 is definitely within the intended usage of
> > fallocate() for accelerating the preallocation of blocks.
> > 
> > Yes, it may not be optimal for things like NFS servers which haven't
> > considered that a fallocate based offload operation might take some
> > time to execute, but that's not a problem with fallocate. i.e.
> > that's a problem with the nfs server ALLOCATE implementation not
> > being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > and timeouts while the operation is run....
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.
> 
> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

<shrug> The unshare component to XFS fallocate does this if something
sends a fatal signal to the process.  There's a difference between
shooting down a process in the middle of fallocate and fallocate
returning EINTR out of the blue, though...

...the manpage for fallocate says that "EINTR == a signal was caught".

> Would it be legal to continue the fallocate under the covers even
> after returning EINTR?

It doesn't do that, however.

--D

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 18:08         ` J. Bruce Fields
  2016-03-31 18:19           ` Darrick J. Wong
@ 2016-03-31 19:47           ` Andreas Dilger
  2016-03-31 22:20             ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Andreas Dilger @ 2016-03-31 19:47 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Chinner, Christoph Hellwig, Darrick J. Wong, xfs,
	linux-fsdevel, linux-btrfs, linux-api

[-- Attachment #1: Type: text/plain, Size: 4231 bytes --]

On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
>> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
>>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
>>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
>>>>> Or is it ok that fallocate could block, potentially for a long time as
>>>>> we stream cows through the page cache (or however unshare works
>>>>> internally)?  Those same programs might not be expecting fallocate to
>>>>> take a long time.
>>>> 
>>>> Yes, it's perfectly fine for fallocate to block for long periods of
>>>> time. See what gfs2 does during preallocation of blocks - it ends up
>>>> calling sb_issue_zerout() because it doesn't have unwritten
>>>> extents, and hence can block for long periods of time....
>>> 
>>> gfs2 fallocate is an implementation that will cause all but the most
>>> trivial users real pain.  Even the initial XFS implementation just
>>> marking the transactions synchronous made it unusable for all kinds
>>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
>>> to gfs2 will probab;ly hand your connection for extended periods of
>>> time.
>>> 
>>> If we need to support something like what gfs2 does we should have a
>>> separate flag for it.
>> 
>> Using fallocate() for preallocation was always intended to
>> be a faster, more efficient method allocating zeroed space
>> than having userspace write blocks of data. Faster, more efficient
>> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
>> that if the hardware has zeroing offloads (deterministic trim, write
>> same, etc) it will use them, and that will be much faster than
>> writing zeros from userspace.
>> 
>> IMO, what gfs2 is definitely within the intended usage of
>> fallocate() for accelerating the preallocation of blocks.
>> 
>> Yes, it may not be optimal for things like NFS servers which haven't
>> considered that a fallocate based offload operation might take some
>> time to execute, but that's not a problem with fallocate. i.e.
>> that's a problem with the nfs server ALLOCATE implementation not
>> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
>> and timeouts while the operation is run....
> 
> That's an interesting idea, but I don't think it's really legal.  I take
> JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> later and it might succeed", not "OK, I'm working on it, try again and
> you may find out I've done it".
> 
> So if the client gets a JUKEBOX error but the server goes ahead and does
> the operation anyway, that'd be unexpected.

Well, the tape continued to be mounted in the background and/or the file
restored from the tape into the filesystem...

> I suppose it's comparable to the case where a slow fallocate is
> interrupted--would it be legal to return EINTR in that case and leave
> the application to sort out whether some part of the allocation had
> already happened?

If the later fallocate() was not re-doing the same work as the first one,
it should be fine for the client to re-send the fallocate() request.  The
fallocate() to reserve blocks does not touch the blocks that are already
allocated, so this is safe to do even if another process is writing to the
file.  If you have multiple processes writing and calling fallocate() with
PUNCH/ZERO/COLLAPSE/INSERT to overlapping regions at the same time then
the application is in for a world of hurt already.

> Would it be legal to continue the fallocate under the covers even after
> returning EINTR?

That might produce unexpected results in some cases, but it depends on
the options used.  Probably the safest is to not continue, and depend on
userspace to retry the operation on EINTR.  For fallocate() doing prealloc
or punch or zero this should eventually complete even if it is slow.

Cheers, Andreas

> But anyway my first inclination is to say that the NFS FALLOCATE
> protocol just wasn't designed to handle long-running fallocates, and if
> we really need that then we need to give it a way to either report
> partial results or to report results asynchronously.
> 
> --b.





[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 19:47           ` Andreas Dilger
@ 2016-03-31 22:20             ` Dave Chinner
  2016-03-31 22:34               ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-03-31 22:20 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: J. Bruce Fields, Christoph Hellwig, Darrick J. Wong, xfs,
	linux-fsdevel, linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> >>>>> Or is it ok that fallocate could block, potentially for a long time as
> >>>>> we stream cows through the page cache (or however unshare works
> >>>>> internally)?  Those same programs might not be expecting fallocate to
> >>>>> take a long time.
> >>>> 
> >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> >>>> calling sb_issue_zerout() because it doesn't have unwritten
> >>>> extents, and hence can block for long periods of time....
> >>> 
> >>> gfs2 fallocate is an implementation that will cause all but the most
> >>> trivial users real pain.  Even the initial XFS implementation just
> >>> marking the transactions synchronous made it unusable for all kinds
> >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> >>> to gfs2 will probab;ly hand your connection for extended periods of
> >>> time.
> >>> 
> >>> If we need to support something like what gfs2 does we should have a
> >>> separate flag for it.
> >> 
> >> Using fallocate() for preallocation was always intended to
> >> be a faster, more efficient method allocating zeroed space
> >> than having userspace write blocks of data. Faster, more efficient
> >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> >> that if the hardware has zeroing offloads (deterministic trim, write
> >> same, etc) it will use them, and that will be much faster than
> >> writing zeros from userspace.
> >> 
> >> IMO, what gfs2 is definitely within the intended usage of
> >> fallocate() for accelerating the preallocation of blocks.
> >> 
> >> Yes, it may not be optimal for things like NFS servers which haven't
> >> considered that a fallocate based offload operation might take some
> >> time to execute, but that's not a problem with fallocate. i.e.
> >> that's a problem with the nfs server ALLOCATE implementation not
> >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> >> and timeouts while the operation is run....
> > 
> > That's an interesting idea, but I don't think it's really legal.  I take
> > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > later and it might succeed", not "OK, I'm working on it, try again and
> > you may find out I've done it".
> > 
> > So if the client gets a JUKEBOX error but the server goes ahead and does
> > the operation anyway, that'd be unexpected.
> 
> Well, the tape continued to be mounted in the background and/or the file
> restored from the tape into the filesystem...

Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
many years, using the above NFSERR_JUKEBOX behaviour for operations
that may block for a long time due to the need to pull stuff into
the filesytsem from the slow backing store. Best explanation is in
the relevant commit in the last published XFS+DMAPI branch from SGI,
for example:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 22:20             ` Dave Chinner
@ 2016-03-31 22:34               ` J. Bruce Fields
  2016-04-01  0:33                 ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: J. Bruce Fields @ 2016-03-31 22:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Christoph Hellwig, Darrick J. Wong, xfs,
	linux-fsdevel, linux-btrfs, linux-api

On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > 
> > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > >>>>> Or is it ok that fallocate could block, potentially for a long time as
> > >>>>> we stream cows through the page cache (or however unshare works
> > >>>>> internally)?  Those same programs might not be expecting fallocate to
> > >>>>> take a long time.
> > >>>> 
> > >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> > >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> > >>>> calling sb_issue_zerout() because it doesn't have unwritten
> > >>>> extents, and hence can block for long periods of time....
> > >>> 
> > >>> gfs2 fallocate is an implementation that will cause all but the most
> > >>> trivial users real pain.  Even the initial XFS implementation just
> > >>> marking the transactions synchronous made it unusable for all kinds
> > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > >>> time.
> > >>> 
> > >>> If we need to support something like what gfs2 does we should have a
> > >>> separate flag for it.
> > >> 
> > >> Using fallocate() for preallocation was always intended to
> > >> be a faster, more efficient method allocating zeroed space
> > >> than having userspace write blocks of data. Faster, more efficient
> > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > >> that if the hardware has zeroing offloads (deterministic trim, write
> > >> same, etc) it will use them, and that will be much faster than
> > >> writing zeros from userspace.
> > >> 
> > >> IMO, what gfs2 is definitely within the intended usage of
> > >> fallocate() for accelerating the preallocation of blocks.
> > >> 
> > >> Yes, it may not be optimal for things like NFS servers which haven't
> > >> considered that a fallocate based offload operation might take some
> > >> time to execute, but that's not a problem with fallocate. i.e.
> > >> that's a problem with the nfs server ALLOCATE implementation not
> > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > >> and timeouts while the operation is run....
> > > 
> > > That's an interesting idea, but I don't think it's really legal.  I take
> > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > later and it might succeed", not "OK, I'm working on it, try again and
> > > you may find out I've done it".
> > > 
> > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > the operation anyway, that'd be unexpected.
> > 
> > Well, the tape continued to be mounted in the background and/or the file
> > restored from the tape into the filesystem...
> 
> Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> many years, using the above NFSERR_JUKEBOX behaviour for operations
> that may block for a long time due to the need to pull stuff into
> the filesytsem from the slow backing store. Best explanation is in
> the relevant commit in the last published XFS+DMAPI branch from SGI,
> for example:
> 
> http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

I haven't looked at the code, but I assume a JUKEBOX-returning write to
an absent file brings into cache the bits necessary to perform the
write, but stops short of actually doing the write.  That allows
handling the retried write quickly without doing the wrong thing in the
case the retry never comes.

Implementing fallocate by returning JUKEBOX while still continuing the
allocation in the background is a bit different.

I guess it doesn't matter as much in practice, since the only way you're
likely to notice that fallocate unexpectedly succeeded would be if it
caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
little weird.

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-03-31 22:34               ` J. Bruce Fields
@ 2016-04-01  0:33                 ` Dave Chinner
  2016-04-01  2:00                   ` J. Bruce Fields
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-04-01  0:33 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andreas Dilger, Christoph Hellwig, Darrick J. Wong, xfs,
	linux-fsdevel, linux-btrfs, linux-api

On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > 
> > > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > >>>>> Or is it ok that fallocate could block, potentially for a long time as
> > > >>>>> we stream cows through the page cache (or however unshare works
> > > >>>>> internally)?  Those same programs might not be expecting fallocate to
> > > >>>>> take a long time.
> > > >>>> 
> > > >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> > > >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> > > >>>> calling sb_issue_zerout() because it doesn't have unwritten
> > > >>>> extents, and hence can block for long periods of time....
> > > >>> 
> > > >>> gfs2 fallocate is an implementation that will cause all but the most
> > > >>> trivial users real pain.  Even the initial XFS implementation just
> > > >>> marking the transactions synchronous made it unusable for all kinds
> > > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > > >>> time.
> > > >>> 
> > > >>> If we need to support something like what gfs2 does we should have a
> > > >>> separate flag for it.
> > > >> 
> > > >> Using fallocate() for preallocation was always intended to
> > > >> be a faster, more efficient method allocating zeroed space
> > > >> than having userspace write blocks of data. Faster, more efficient
> > > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > > >> that if the hardware has zeroing offloads (deterministic trim, write
> > > >> same, etc) it will use them, and that will be much faster than
> > > >> writing zeros from userspace.
> > > >> 
> > > >> IMO, what gfs2 is definitely within the intended usage of
> > > >> fallocate() for accelerating the preallocation of blocks.
> > > >> 
> > > >> Yes, it may not be optimal for things like NFS servers which haven't
> > > >> considered that a fallocate based offload operation might take some
> > > >> time to execute, but that's not a problem with fallocate. i.e.
> > > >> that's a problem with the nfs server ALLOCATE implementation not
> > > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > > >> and timeouts while the operation is run....
> > > > 
> > > > That's an interesting idea, but I don't think it's really legal.  I take
> > > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > > later and it might succeed", not "OK, I'm working on it, try again and
> > > > you may find out I've done it".
> > > > 
> > > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > > the operation anyway, that'd be unexpected.
> > > 
> > > Well, the tape continued to be mounted in the background and/or the file
> > > restored from the tape into the filesystem...
> > 
> > Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> > many years, using the above NFSERR_JUKEBOX behaviour for operations
> > that may block for a long time due to the need to pull stuff into
> > the filesytsem from the slow backing store. Best explanation is in
> > the relevant commit in the last published XFS+DMAPI branch from SGI,
> > for example:
> > 
> > http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75
> 
> I haven't looked at the code, but I assume a JUKEBOX-returning write to
> an absent file brings into cache the bits necessary to perform the
> write, but stops short of actually doing the write.

Not exactly, as all subsequent read/write/truncate requests will
EJUKEBOX until the absent file has been brought back onto disk. Once
that is done, the next operation attempt will proceed.

> That allows
> handling the retried write quickly without doing the wrong thing in the
> case the retry never comes.

Essentially. But if a retry never comes it means there's either a
bug in the client NFS implementation or the client crashed, in which
case we don't really care.

> Implementing fallocate by returning JUKEBOX while still continuing the
> allocation in the background is a bit different.

Not really. like the HSM case we don't really care if a retry occurs
or not - the server simply needs to reply NFSERR_JUKEBOX for all
subsequent read/write/fallocate/truncate operations on that inode
until the fallocate completes...

i.e. it requires O_NONBLOCK style operation for filesystem IO
operations to really work correctly, and for the above patchset that
is added by the DMAPI layer through the hooks added into the IO
paths here:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commitdiff;h=87e98fb84c235a45fc5dea6fced8c6bd9e534234

i.e. recall status was tracked externally to the filesystem and
obeyed non-blocking flags on the filp. hence when the NFSD called
into the fs with O_NONBLOCK set, the dmapi hook would return EAGAIN
if there was a recall in progress on the range the IO was going to
be issued on...

> I guess it doesn't matter as much in practice, since the only way you're
> likely to notice that fallocate unexpectedly succeeded would be if it
> caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
> little weird.

s/succeeded/failed/ and that statement is right.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: fallocate mode flag for "unshare blocks"?
  2016-04-01  0:33                 ` Dave Chinner
@ 2016-04-01  2:00                   ` J. Bruce Fields
  0 siblings, 0 replies; 22+ messages in thread
From: J. Bruce Fields @ 2016-04-01  2:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Christoph Hellwig, Darrick J. Wong, xfs,
	linux-fsdevel, linux-btrfs, linux-api

On Fri, Apr 01, 2016 at 11:33:00AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> > I haven't looked at the code, but I assume a JUKEBOX-returning write to
> > an absent file brings into cache the bits necessary to perform the
> > write, but stops short of actually doing the write.
> 
> Not exactly, as all subsequent read/write/truncate requests will
> EJUKEBOX until the absent file has been brought back onto disk. Once
> that is done, the next operation attempt will proceed.
> 
> > That allows
> > handling the retried write quickly without doing the wrong thing in the
> > case the retry never comes.
> 
> Essentially. But if a retry never comes it means there's either a
> bug in the client NFS implementation or the client crashed,

NFS clients are under no obligation to retry operations after JUKEBOX.
And I'd expect them not to in the case the calling process was
interrupted, for example.

> > I guess it doesn't matter as much in practice, since the only way you're
> > likely to notice that fallocate unexpectedly succeeded would be if it
> > caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
> > little weird.
> 
> s/succeeded/failed/ and that statement is right.

Sorry, I didn't explain clearly.

The case I was worrying about was the case were the on-the-wire ALLOCATE
call returns JUKEBOX, but the server allocates anyway.

That behavior violates the spec as I understand it.

The client therefore assumes there was no allocation, when in fact there
was.

So, technically a bug, but I wondered if it's likely to bite anyone.
One of the only ways it seems someone would notice would be if it caused
the filesystem to run out of space earlier than I expected.  But perhaps
that's unlikely.

--b.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-04-01  2:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20160302155007.GB7125@infradead.org>
2016-03-30 18:27 ` fallocate mode flag for "unshare blocks"? Darrick J. Wong
2016-03-30 18:58   ` Austin S. Hemmelgarn
2016-03-31  7:58     ` Christoph Hellwig
2016-03-31 11:13       ` Austin S. Hemmelgarn
2016-03-31  0:32   ` Liu Bo
2016-03-31  7:55     ` Christoph Hellwig
2016-03-31 15:31       ` Andreas Dilger
2016-03-31 15:43         ` Austin S. Hemmelgarn
2016-03-31 16:47         ` Henk Slager
2016-03-31 11:18     ` Austin S. Hemmelgarn
2016-03-31 11:38       ` Austin S. Hemmelgarn
2016-03-31 19:52       ` Liu Bo
2016-03-31  1:18   ` Dave Chinner
2016-03-31  7:54     ` Christoph Hellwig
2016-03-31 11:18       ` Dave Chinner
2016-03-31 18:08         ` J. Bruce Fields
2016-03-31 18:19           ` Darrick J. Wong
2016-03-31 19:47           ` Andreas Dilger
2016-03-31 22:20             ` Dave Chinner
2016-03-31 22:34               ` J. Bruce Fields
2016-04-01  0:33                 ` Dave Chinner
2016-04-01  2:00                   ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).