public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* Why doesn't XFS need ->launder_folio?
@ 2023-09-08 14:48 Matthew Wilcox
  2023-09-11  0:02 ` Dave Chinner
  0 siblings, 1 reply; 2+ messages in thread
From: Matthew Wilcox @ 2023-09-08 14:48 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

I want to remove ->launder_folio.  So I'm looking at commit e3db7691e9f3
which introduced ->launder_page.  The race described there is pretty
clear:

     invalidate_inode_pages2() may find the dirty bit has been set on a page
     owing to the fact that the page may still be mapped after it was locked.
     Only after the call to unmap_mapping_range() are we sure that the page
     can no longer be dirtied.

ie this happens:

Task A				Task B
mmaps a file, writes to page A
				open(O_DIRECT)
				read()
				kiocb_invalidate_pages()
				filemap_write_and_wait_range()
				__filemap_fdatawrite_range()
				filemap_fdatawrite_wbc()
				do_writepages()
				iomap_writepages()
				write_cache_pages()
				page A gets cleaned
writes to page A again
				invalidate_inode_pages2_range()
				folio_mapped() is true, so we unmap it
				folio_launder() returns 0
				invalidate_complete_folio2() returns 0
				ret = -EBUSY
				kiocb_invalidate_pages() returns EBUSY

and the DIO read fails, despite it being totally reasonable to return
the now-stale data on storage.  A DIO write would be a different matter;
we really do need to get page A out of cache.

So would it be reasonable to unmap the pages earlier and rely on
invalidate_lock to prevent page faults making the page writable
between the call to filemap_write_and_wait_range() and the call to
invalidate_complete_folio2() ?  Then we could get rid of ->launder_folio()
as well as making DIO a little more reliable when racing with page faults.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Why doesn't XFS need ->launder_folio?
  2023-09-08 14:48 Why doesn't XFS need ->launder_folio? Matthew Wilcox
@ 2023-09-11  0:02 ` Dave Chinner
  0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2023-09-11  0:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel

On Fri, Sep 08, 2023 at 03:48:03PM +0100, Matthew Wilcox wrote:
> I want to remove ->launder_folio.  So I'm looking at commit e3db7691e9f3
> which introduced ->launder_page.  The race described there is pretty
> clear:
> 
>      invalidate_inode_pages2() may find the dirty bit has been set on a page
>      owing to the fact that the page may still be mapped after it was locked.
>      Only after the call to unmap_mapping_range() are we sure that the page
>      can no longer be dirtied.
> 
> ie this happens:
> 
> Task A				Task B
> mmaps a file, writes to page A
> 				open(O_DIRECT)
> 				read()
> 				kiocb_invalidate_pages()
> 				filemap_write_and_wait_range()
> 				__filemap_fdatawrite_range()
> 				filemap_fdatawrite_wbc()
> 				do_writepages()
> 				iomap_writepages()
> 				write_cache_pages()
> 				page A gets cleaned
> writes to page A again
> 				invalidate_inode_pages2_range()
> 				folio_mapped() is true, so we unmap it
> 				folio_launder() returns 0
> 				invalidate_complete_folio2() returns 0
> 				ret = -EBUSY
> 				kiocb_invalidate_pages() returns EBUSY
> 
> and the DIO read fails, despite it being totally reasonable to return
> the now-stale data on storage.

I think you've read the __iomap_dio_rw() call path incorrectly -
kiocb_invalidate_pages() is only called from the DIO write
submission call path, not the DIO read call path.

For a DIO read, we only call kiocb_write_and_wait() to write back
dirty cached pages and we don't invalidate anything in the page
cache. Hence IO submission never calls
invalidate_inode_pages2_range(), so the data returned by the DIO
read will only contain the first write. On DIO read completion, we
still don't do any page cache invalidation, so AFAICT it returns
stale data that doesn't include the second mmap write task A made to
the file while the DIO read was in progress.

This looks like the iomap DIO read path is working as intended to
me...

> A DIO write would be a different matter;
> we really do need to get page A out of cache.

No, I don't think we need to.

On a dio write, if invalidate_inode_pages2_range() fails with -EBUSY
as per above it will result in iomap_dio_rw() returning -ENOTBLK to
the filesystem. That gets caught at xfs_file_write_iter() and the
DIO write is then executed as buffered IO instead. This means the
write is then directed through the page cache and so second write
that Task A made is captured and does not get lost.

IOWs, this looks like it is all working as it is supposed to,
without the need to implement ->launder_folio()...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-09-11  0:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-08 14:48 Why doesn't XFS need ->launder_folio? Matthew Wilcox
2023-09-11  0:02 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox