Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Matthew Wilcox <willy@infradead.org>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Brian Foster <bfoster@redhat.com>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Dave Chinner <david@fromorbit.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Disha Goel <disgoel@linux.ibm.com>,
	Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
Date: Mon, 22 May 2023 05:48:14 +0100	[thread overview]
Message-ID: <ZGr0DhOFH6AwoBB0@casper.infradead.org> (raw)
In-Reply-To: <87ttw5ugse.fsf@doe.com>

On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> > already have an iop?
> >>
> >> Yes.
> >
> > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > dirty flag on the folio, but depending on the architecture, we may
> > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > we have one dirty bit for the entire folio; similarly if ARM uses the
> > contiguous PTE bit).  If we do have independent dirty bits, we could
> > dirty only the blocks corresponding to a single page at a time.
> >
> > This has potential for causing some nasty bugs, so I'm inclined to
> > rule that if a folio is mmaped, then it's all dirty from any writable
> > page fault.  The fact is that applications generally do not perform
> > writes through mmap because the error handling story is so poor.
> >
> > There may be a different answer for anonymous memory, but that doesn't
> > feel like my problem and shouldn't feel like any FS developer's problem.
> 
> Although I am skeptical too to do the changes which Brian is suggesting
> here. i.e. not making all the blocks of the folio dirty when we are
> going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> 
> However, I am sorry but I coudn't completely follow your reasoning
> above. I think what Brian is suggesting here is that
> filemap_dirty_folio() should be similar to complete buffered overwrite
> case where we do not allocate the iop at the ->write_begin() time.
> Then at the writeback time we allocate an iop and mark all blocks dirty.
> 
> In a way it is also the similar case as for mmapped writes too but my
> only worry is the way mmaped writes work and it makes more
> sense to keep the dirty state of folio and per-block within iop in sync.
> For that matter, we can even just make sure we always allocate an iop in
> the complete overwrites case as well. I didn't change that code because
> it was kept that way for uptodate state as well and based on one of your
> inputs for complete overwrite case.
> 
> Though I agree that we should ideally be allocatting & marking all
> blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> understand your reasoning better.

Think about it at a higher level than the implementation ("do we allocate
an iop or not").  If userspace dirties one page in a folio, should it
dirty all pages in the folio, or just the page that was actually dirtied?
I appreciate you're thinking about this from the point of view of 64kB
pages on PPC and using single-page folios, but pretend we've allocated a
1MB folio, mmaped it (or a part of it?)  and now userspace stores to it.
How much of it do we want to write back?

My argument is that this is RARE.  Userspace generally does not
mmap(MAP_SHARED), store to it and call msync() to do writes.  Writes are
almost always done using the write() syscall.  Userspace gets a lot more
control about when the writeback happens, and they actually get errors
back from the write() syscall.

If we attempt to track which pages have actually been dirtied, I worry
the fs and the mm will lose track of what the other needs to know.
eg the mm will make every page in the folio writable and then not notify
the fs when subsequent pages in the folio are stored to.

next prev parent reply	other threads:[~2023-05-22  4:48 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
2023-05-18  6:13   ` Christoph Hellwig
2023-05-19 15:01     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
2023-05-15 15:09   ` Brian Foster
2023-05-16 10:12     ` Ritesh Harjani
2023-05-18  6:16   ` Christoph Hellwig
2023-05-19 15:03     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
2023-05-15 15:10   ` Brian Foster
2023-05-16 10:14     ` Ritesh Harjani
2023-05-18  6:18   ` Christoph Hellwig
2023-05-19 15:07     ` Ritesh Harjani
2023-05-23  6:00       ` Christoph Hellwig
2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-05-18  6:21   ` Christoph Hellwig
2023-05-19 15:18     ` Ritesh Harjani
2023-05-19 15:53       ` Matthew Wilcox
2023-05-22  4:05         ` Ritesh Harjani
2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
     [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
2023-05-15  8:16     ` Pankaj Raghav
2023-05-15  8:31       ` Ritesh Harjani
2023-05-15 13:23         ` Pankaj Raghav
2023-05-15 15:15   ` Brian Foster
2023-05-16 14:49     ` Ritesh Harjani
2023-05-16 19:29       ` Brian Foster
2023-05-17 15:20         ` Ritesh Harjani
2023-05-17 18:48           ` Brian Foster
2023-05-18 13:23             ` Christoph Hellwig
2023-05-18 16:15               ` Matthew Wilcox
2023-05-22  4:33                 ` Ritesh Harjani
2023-05-22  4:48                   ` Matthew Wilcox [this message]
2023-05-22 11:18                   ` Brian Foster
2023-05-23  0:56                     ` Darrick J. Wong
2023-05-23 12:15                       ` Brian Foster
2023-05-23 13:43                         ` Ritesh Harjani
2023-05-23 14:44                           ` Brian Foster
2023-05-23 15:02                             ` Ritesh Harjani
2023-05-23 15:22                               ` Brian Foster
2023-05-23 15:38                                 ` Ritesh Harjani
2023-05-23 15:59                                 ` Matthew Wilcox
2023-05-18 13:27   ` Christoph Hellwig
2023-05-19 16:08     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGr0DhOFH6AwoBB0@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=araherle@in.ibm.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=disgoel@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).