Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Dave Chinner <david@fromorbit.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Disha Goel <disgoel@linux.ibm.com>,
	Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
Date: Tue, 23 May 2023 11:22:33 -0400	[thread overview]
Message-ID: <ZGzaOSvpCAUsluhr@bfoster> (raw)
In-Reply-To: <87fs7nozu5.fsf@doe.com>

On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote:
> Brian Foster <bfoster@redhat.com> writes:
> 
> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote:
> >> Brian Foster <bfoster@redhat.com> writes:
> >> 
> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> >> >> > > Matthew Wilcox <willy@infradead.org> writes:
> >> >> > >
> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> >> >> > > >> > already have an iop?
> >> >> > > >>
> >> >> > > >> Yes.
> >> >> > > >
> >> >> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> >> >> > > > dirty flag on the folio, but depending on the architecture, we may
> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> >> >> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> >> >> > > > dirty only the blocks corresponding to a single page at a time.
> >> >> > > >
> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to
> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> >> >> > > > page fault.  The fact is that applications generally do not perform
> >> >> > > > writes through mmap because the error handling story is so poor.
> >> >> > > >
> >> >> > > > There may be a different answer for anonymous memory, but that doesn't
> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> >> >> > >
> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting
> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are
> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> >> >> > >
> >> >> > > However, I am sorry but I coudn't completely follow your reasoning
> >> >> > > above. I think what Brian is suggesting here is that
> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> >> >> > > case where we do not allocate the iop at the ->write_begin() time.
> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> >> >> > >
> >> >> >
> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure
> >> >> > what it has to do with the idea of being consistent with how full folio
> >> >> > overwrites are implemented (between buffered or mapped writes). We're
> >> >> > not changing historical dirtying granularity either way. I think this is
> >> >> > just a bigger picture thought for future consideration as opposed to
> >> >> > direct feedback on this patch..
> >> >>
> >> >> <nod>
> >> >>
> >> >> > > In a way it is also the similar case as for mmapped writes too but my
> >> >> > > only worry is the way mmaped writes work and it makes more
> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync.
> >> >> > > For that matter, we can even just make sure we always allocate an iop in
> >> >> > > the complete overwrites case as well. I didn't change that code because
> >> >> > > it was kept that way for uptodate state as well and based on one of your
> >> >> > > inputs for complete overwrite case.
> >> >> > >
> >> >> >
> >> >> > Can you elaborate on your concerns, out of curiosity?
> >> >> >
> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the
> >> >> > basic implementation of dirty tracking (so always allocate the iop for
> >> >> > sub-folio tracking as you suggest above) and then potentially restore it
> >> >> > as a separate optimization patch at the end of the series.
> >> >>
> >> >> Agree.
> >> >>
> >> >> > That said, I'm not totally clear why it exists in the first place, so
> >> >> > that might warrant some investigation. Is it primarily to defer
> >> >> > allocations out of task write/fault contexts?
> >> >>
> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally
> >> >> allocate iops for blocksize < foliosize...)
> >> >>
> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations
> >> >> unless it's absolutely necessary for correctness was /my/ understanding
> >> >> of why we don't always allocate the iop...
> >> >>
> >> >> > To optimize the case where pagecache is dirtied but truncated or
> >> >> > something and thus never written back?
> >> >>
> >> >> ...because this might very well happen.  Write a temporary .o file to
> >> >> the filesystem, then delete the whole thing before writeback ever gets
> >> >> its hands on the file.
> >> >>
> >> >
> >> > I don't think a simple temp write will trigger this scenario currently
> >> > because the folios would have to be uptodate at the time of the write to
> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed
> >> > by holes) first to start seeing the !iop case at writeback time (for bs
> >> > != ps).
> >> >
> >> > That could change with these patches, but I was trying to reason about
> >> > the intent of the existing code and whether there was some known reason
> >> > to continue to try and defer the iop allocation as the need/complexity
> >> > for deferring it grows with the addition of more (i.e. dirty) tracking.
> >> >
> >> 
> >> Here is the 1st discussion/reasoning where the deferred iop allocation
> >> in the readpage path got discussed [1].
> >> And here is the discussion when I first pointed out the deferred
> >> allocation in writepage path. IMO, it got slipped in with the
> >> discussions maybe only on mailing list but nothing in the commit
> >> messages or comments.[2]
> >> 
> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/
> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/
> >> 
> >> >> > Is there any room for further improvement where the alloc could be
> >> >> > avoided completely for folio overwrites instead of just deferred?
> >> >>
> >> >> Once writeback starts, though, we need the iop so that we can know when
> >> >> all the writeback for that folio is actually complete, no matter how
> >> >> many IOs might be in flight for that folio.  I don't know how you'd get
> >> >> around this problem.
> >> >>
> >> >
> >> > Ok. I noticed some kind of counter or something being updated last time
> >> > I looked through that code, so it sounds like that's the reason the iop
> >> > eventually needs to exist. Thanks.
> >> >
> >> >> > Was that actually the case at some point and then something later
> >> >> > decided the iop was needed at writeback time, leading to current
> >> >> > behavior?
> >> >>
> >> >> It's been in iomap since the beginning when we lifted it from xfs.
> >> >>
> >> >
> >> > Not sure exactly what you're referring to here. iomap_writepage_map()
> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't
> >> > see how iop allocs were deferred (other than when bs == ps, obviously)
> >> > prior to that.
> >> >
> >> > Heh, but I'm starting to get my wires crossed just trying to piece
> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
> >> > case is primarily a subtle side effect of the current writeback behavior
> >> > being driven by uptodate status. I think it's probably wise to drop it
> >> > at least initially, always alloc and dirty the appropriate iop ranges
> >> > for sub-folio blocks, and then if you or others think there is value in
> >> > the overwrite optimization to defer iop allocs, tack that on as a
> >> > separate patch and try to be consistent between buffered and mapped
> >> > writes.
> >> 
> >> Based on the discussion so far, I would like to think of this as follow:
> >> We already have some sort of lazy iop allocation in the buffered I/O
> >> path (discussed above). This patch series does not changes that
> >> behavior. For now I would like to keep the page mkwrite page as is
> >> without any lazy iop allocation optimization.
> >> I am ok to pick this optimization work as a seperate series
> >> because, IIUC, Christoph has some ideas on deferring iop allocations
> >> even further [2] (from link shared above).
> >> 
> >> Does that sound good?
> >> 
> >
> > Could you do that in two steps where the buffered I/O path variant is
> > replaced by explicit dirty tracking in the initial patch, and then is
> > restored by a subsequent patch in the same series? That would allow
> 
> Sorry, I couldn't follow it. Can you please elaborate.
> 

Sorry for the confusion...

> So, what I was suggesting was - for buffered I/O path we should continue
> to have the lazy iop allocation scheme whereever possible.
> Rest of the optimizations of further deferring the iop allocation
> including in the ->mkwrite path should be dealt seperately in a later
> patch series.
> 

Yup, agree.

> Also we already have a seperate patch in this series which defers the
> iop allocation if the write completely overwrites the folio [1].
> Earlier the behavior was that it skips it entirely if the folio was
> uptodate, but since we require it for per-block dirty tracking, we
> defer iop allocation only if it was a complete overwrite of the folio. 
> 

That is a prepatory patch before iop dirty tracking is enabled in patch
5, right? I was mainly just suggesting to make the overwrite checking
part of this patch come after dirty tracking is enabled (as a small
optimization patch) rather than before.

I don't want to harp on it if that's difficult or still doesn't make
sense for some reason. I'll take a closer look the next go around when I
have a bit more time and just send a diff if it seems it can be done
cleanly..

Brian

> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee
> 
> 
> -ritesh
> 
> > keeping it around and documenting it explicitly in the commit log for
> > the separate patch, but IMO makes this a bit easier to review (and
> > potentially debug/bisect if needed down the road).
> >
> > But I don't insist if that's too troublesome for some reason...
> >
> > Brian
> >
> >> >
> >> > Darrick noted above that he also agrees with that separate patch
> >> > approach. For me, I think it would also be useful to show that there is
> >> > some measurable performance benefit on at least one reasonable workload
> >> > to help justify it.
> >> 
> >> Agree that when we work on such optimizations as a seperate series, it
> >> will be worth measuring the performance benefits of that.
> >> 
> >> 
> >> -ritesh
> >> 
> >> >
> >> > Brian
> >> >
> >> >> --D (who is now weeks behind on reviewing things and stressed out)
> >> >>
> >> >> > Brian
> >> >> >
> >> >> > > Though I agree that we should ideally be allocatting & marking all
> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> >> >> > > understand your reasoning better.
> >> >> > >
> >> >> > > Thanks!
> >> >> > > -ritesh
> >> >> > >
> >> >> >
> >> >>
> >> 
>

next prev parent reply	other threads:[~2023-05-23 15:21 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
2023-05-18  6:13   ` Christoph Hellwig
2023-05-19 15:01     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
2023-05-07 21:54   ` kernel test robot
2023-05-08  2:42   ` kernel test robot
2023-05-15 15:09   ` Brian Foster
2023-05-16 10:12     ` Ritesh Harjani
2023-05-18  6:16   ` Christoph Hellwig
2023-05-19 15:03     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
2023-05-15 15:10   ` Brian Foster
2023-05-16 10:14     ` Ritesh Harjani
2023-05-18  6:18   ` Christoph Hellwig
2023-05-19 15:07     ` Ritesh Harjani
2023-05-23  6:00       ` Christoph Hellwig
2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-05-18  6:21   ` Christoph Hellwig
2023-05-19 15:18     ` Ritesh Harjani
2023-05-19 15:53       ` Matthew Wilcox
2023-05-22  4:05         ` Ritesh Harjani
2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
2023-05-07 22:14   ` kernel test robot
2023-05-08  5:00   ` kernel test robot
2023-05-15  8:16   ` Pankaj Raghav
2023-05-15  8:31     ` Ritesh Harjani
2023-05-15 13:23       ` Pankaj Raghav
2023-05-15 15:15   ` Brian Foster
2023-05-16 14:49     ` Ritesh Harjani
2023-05-16 19:29       ` Brian Foster
2023-05-17 15:20         ` Ritesh Harjani
2023-05-17 18:48           ` Brian Foster
2023-05-18 13:23             ` Christoph Hellwig
2023-05-18 16:15               ` Matthew Wilcox
2023-05-22  4:33                 ` Ritesh Harjani
2023-05-22  4:48                   ` Matthew Wilcox
2023-05-22 11:18                   ` Brian Foster
2023-05-23  0:56                     ` Darrick J. Wong
2023-05-23 12:15                       ` Brian Foster
2023-05-23 13:43                         ` Ritesh Harjani
2023-05-23 14:44                           ` Brian Foster
2023-05-23 15:02                             ` Ritesh Harjani
2023-05-23 15:22                               ` Brian Foster [this message]
2023-05-23 15:38                                 ` Ritesh Harjani
2023-05-23 15:59                                 ` Matthew Wilcox
2023-05-18 13:27   ` Christoph Hellwig
2023-05-19 16:08     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGzaOSvpCAUsluhr@bfoster \
    --to=bfoster@redhat.com \
    --cc=araherle@in.ibm.com \
    --cc=david@fromorbit.com \
    --cc=disgoel@linux.ibm.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.