linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Splitting dirty fs folios
@ 2023-05-29 19:59 Matthew Wilcox
  2023-05-29 22:39 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Matthew Wilcox @ 2023-05-29 19:59 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-xfs

At the moment, when we truncate (also holepunch, etc) a file,
the VFS attempts to split any large folios which overlap the newly
created boundary in the file.  See mm/truncate.c for the callers of
->invalidate_folio.

We need the filesystem and the MM to cooperate on splitting a folio
because there's FS metadata attached to folio->private.  We have per-folio
state (uptodate, dirty) which the filesystem keeps per-block state for
and uses the folio state as a summary (if every block is uptodate,
the folio is uptodate.  if any block is dirty, the folio is dirty).
If we just throw away that per-folio extra state we risk not writing
back blocks which are dirty, or losing buffered writes as we re-read
blocks which were more uptodate in memory than on disk.  There's no
safe state to set the folio to.

This is fine if the entire folio is uptodate, and it generally is today
because large folios are only created through readahead, which will
bring the entire folio uptodate unless there is a read error.  But when
creating a large folio in the write path, we can end up with large folios
which are not uptodate under various circumstances.  For example, I've
captured one where we write to pos:0x2a0e5f len:0xf1a1.  Because this is
on a 1kB block size filesystem, we leave the first three blocks in the folio
unread, and the uptodate bits are fffffffffffffff8.  That means that
the folio as a whole is not uptodate.

Option 1: Read the start of the folio so we can set the whole folio
uptodate.  In this case, we're already submitting a read for bytes
0x2a0c00-0x2a0fff (so we can overwrite the end of that block).  We could
expand that to read 0x2a0000-0x2a0fff instead.  This could get tricky;
at the moment we're guaranteed to have the iomap that covers the start
of the block, but we might have to do a lookup to find the iomap(s)
that covers the start of the folio.

Option 2: In the invalidate_folio implementation, writeback the folio
so it is no longer dirty.  I'm not sure we have all the information we
need to start writeback, and it'll annoy the filesystem as it has to
allocate space if it wasn't already allocated.

Option 3: Figure out a more complicated dance between the FS and the MM
that allows the FS to attach state to the newly created folios before
finally freeing the original folio.

Option 4: Stop splitting folios on holepunch / truncate.  Folio splits
can fail, so we all have to cope with folios that substantially overhang
a hole/data/EOF boundary.  We don't attempt to split folios on readahead
when we discover we're trying to read from a hole, we just zero the
appropriate chuks of the folio.  We do attempt to not allocate folios
which extend more than one page past EOF, but that's subject to change
anyway.

Option 5: If the folio is both dirty and !uptodate, just refuse to split
it, like if somebody else had a reference on it.  A less extreme version
of #4.

I may have missed some other option.  Option 5 seems like the least
amount of work.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Splitting dirty fs folios
  2023-05-29 19:59 Splitting dirty fs folios Matthew Wilcox
@ 2023-05-29 22:39 ` Dave Chinner
  2023-05-29 22:44   ` Matthew Wilcox
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2023-05-29 22:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Mon, May 29, 2023 at 08:59:27PM +0100, Matthew Wilcox wrote:
> At the moment, when we truncate (also holepunch, etc) a file,
> the VFS attempts to split any large folios which overlap the newly
> created boundary in the file.  See mm/truncate.c for the callers of
> ->invalidate_folio.
> 
> We need the filesystem and the MM to cooperate on splitting a folio
> because there's FS metadata attached to folio->private.  We have per-folio
> state (uptodate, dirty) which the filesystem keeps per-block state for
> and uses the folio state as a summary (if every block is uptodate,
> the folio is uptodate.  if any block is dirty, the folio is dirty).
> If we just throw away that per-folio extra state we risk not writing
> back blocks which are dirty, or losing buffered writes as we re-read
> blocks which were more uptodate in memory than on disk.  There's no
> safe state to set the folio to.
> 
> This is fine if the entire folio is uptodate, and it generally is today
> because large folios are only created through readahead, which will
> bring the entire folio uptodate unless there is a read error.  But when
> creating a large folio in the write path, we can end up with large folios
> which are not uptodate under various circumstances.  For example, I've
> captured one where we write to pos:0x2a0e5f len:0xf1a1.  Because this is
> on a 1kB block size filesystem, we leave the first three blocks in the folio
> unread, and the uptodate bits are fffffffffffffff8.  That means that
> the folio as a whole is not uptodate.
> 
> Option 1: Read the start of the folio so we can set the whole folio
> uptodate.  In this case, we're already submitting a read for bytes
> 0x2a0c00-0x2a0fff (so we can overwrite the end of that block).  We could
> expand that to read 0x2a0000-0x2a0fff instead.  This could get tricky;
> at the moment we're guaranteed to have the iomap that covers the start
> of the block, but we might have to do a lookup to find the iomap(s)
> that covers the start of the folio.

Yeah, that seems like a problem - having to go back and get a
different iomap for read while we currently hold an iomap for write
is a potential deadlock for some filesystems. I think at this point,
we would be better off backing out of the write, getting a write
mapping for the entire large folio range and running the existing
"sub folio" zero-or-RMW code...

> Option 2: In the invalidate_folio implementation, writeback the folio
> so it is no longer dirty.  I'm not sure we have all the information we
> need to start writeback, and it'll annoy the filesystem as it has to
> allocate space if it wasn't already allocated.

I don't really like this - we want to throw away dirty data in range
being invalidated, not take a latency/performance hit to write it
back first.

FWIW, is the folio is dirty and the filesystem doesn't have
allocated space for it to be written back, then that is a bug that
needs fixing. Nothing should be dirtying a folio without giving the
filesystem the opportunity to allocate backing space for it
first....

> Option 3: Figure out a more complicated dance between the FS and the MM
> that allows the FS to attach state to the newly created folios before
> finally freeing the original folio.

Complex, but seems possible. Also seems tricky with respect to
making the entire swap appear atomic from an outside observer's
perspective. 

> Option 4: Stop splitting folios on holepunch / truncate.  Folio splits
> can fail, so we all have to cope with folios that substantially overhang
> a hole/data/EOF boundary.  We don't attempt to split folios on readahead
> when we discover we're trying to read from a hole, we just zero the
> appropriate chuks of the folio. 

That sounds like the best idea to me - it matches what we already in
terms of sub-page block size behaviour - zero the part of the page
that was invalidated.

> We do attempt to not allocate folios
> which extend more than one page past EOF, but that's subject to change
> anyway.

Yeah, i think we'll want to change that because if extending
sequential writes don't exactly match large folio alignment we'll
still end up with lots of small folios around the edges of the user
writes....

> Option 5: If the folio is both dirty and !uptodate, just refuse to split
> it, like if somebody else had a reference on it.  A less extreme version
> of #4.

Also seems like a reasonable first step.

> I may have missed some other option.  Option 5 seems like the least
> amount of work.

*nod*

Overall, I think the best way to approach it is to do the simplest,
most obviously correct thing first. If/when we observe performance
problems from the simple approach, then we can decide how to solve
that via one of the more complex approaches...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Splitting dirty fs folios
  2023-05-29 22:39 ` Dave Chinner
@ 2023-05-29 22:44   ` Matthew Wilcox
  0 siblings, 0 replies; 3+ messages in thread
From: Matthew Wilcox @ 2023-05-29 22:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-fsdevel, linux-xfs

On Tue, May 30, 2023 at 08:39:32AM +1000, Dave Chinner wrote:
> > Option 5: If the folio is both dirty and !uptodate, just refuse to split
> > it, like if somebody else had a reference on it.  A less extreme version
> > of #4.
> 
> Also seems like a reasonable first step.
> 
> > I may have missed some other option.  Option 5 seems like the least
> > amount of work.
> 
> *nod*
> 
> Overall, I think the best way to approach it is to do the simplest,
> most obviously correct thing first. If/when we observe performance
> problems from the simple approach, then we can decide how to solve
> that via one of the more complex approaches...

So the good news is that Option 5 seems to have no regressions and
the functional part of the patch looks like ...

+++ b/fs/iomap/buffered-io.c
@@ -510,11 +510,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t off
set, size_t len)
                WARN_ON_ONCE(folio_test_writeback(folio));
                folio_cancel_dirty(folio);
                iomap_page_release(folio);
-       } else if (folio_test_large(folio)) {
-               /* Must release the iop so the page can be split */
-               VM_WARN_ON_ONCE_FOLIO(!folio_test_uptodate(folio) &&
-                               folio_test_dirty(folio), folio);
-               iomap_page_release(folio);
        }
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);

I need to modify a few comments and document exactly why this works,
but it seems like a good next step.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-05-29 22:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-29 19:59 Splitting dirty fs folios Matthew Wilcox
2023-05-29 22:39 ` Dave Chinner
2023-05-29 22:44   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).