linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ritesh Harjani <ritesh.list@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Dave Chinner <david@fromorbit.com>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Disha Goel <disgoel@linux.ibm.com>,
	Aravinda Herle <araherle@in.ibm.com>
Subject: Re: [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance
Date: Tue, 23 May 2023 08:15:26 -0400	[thread overview]
Message-ID: <ZGyuXgGzuFWmHnsd@bfoster> (raw)
In-Reply-To: <20230523005625.GE11620@frogsfrogsfrogs>

On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote:
> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote:
> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote:
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > 
> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote:
> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered
> > > >> > overwrites, isn't that also true of mapped writes to folios that don't
> > > >> > already have an iop?
> > > >>
> > > >> Yes.
> > > >
> > > > Hm, well, maybe?  If somebody stores to a page, we obviously set the
> > > > dirty flag on the folio, but depending on the architecture, we may
> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD,
> > > > we have one dirty bit for the entire folio; similarly if ARM uses the
> > > > contiguous PTE bit).  If we do have independent dirty bits, we could
> > > > dirty only the blocks corresponding to a single page at a time.
> > > >
> > > > This has potential for causing some nasty bugs, so I'm inclined to
> > > > rule that if a folio is mmaped, then it's all dirty from any writable
> > > > page fault.  The fact is that applications generally do not perform
> > > > writes through mmap because the error handling story is so poor.
> > > >
> > > > There may be a different answer for anonymous memory, but that doesn't
> > > > feel like my problem and shouldn't feel like any FS developer's problem.
> > > 
> > > Although I am skeptical too to do the changes which Brian is suggesting
> > > here. i.e. not making all the blocks of the folio dirty when we are
> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes).
> > > 
> > > However, I am sorry but I coudn't completely follow your reasoning
> > > above. I think what Brian is suggesting here is that
> > > filemap_dirty_folio() should be similar to complete buffered overwrite
> > > case where we do not allocate the iop at the ->write_begin() time.
> > > Then at the writeback time we allocate an iop and mark all blocks dirty.
> > > 
> > 
> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty
> > granularity of intra-folio faults) makes sense, but I'm also not sure
> > what it has to do with the idea of being consistent with how full folio
> > overwrites are implemented (between buffered or mapped writes). We're
> > not changing historical dirtying granularity either way. I think this is
> > just a bigger picture thought for future consideration as opposed to
> > direct feedback on this patch..
> 
> <nod>
> 
> > > In a way it is also the similar case as for mmapped writes too but my
> > > only worry is the way mmaped writes work and it makes more
> > > sense to keep the dirty state of folio and per-block within iop in sync.
> > > For that matter, we can even just make sure we always allocate an iop in
> > > the complete overwrites case as well. I didn't change that code because
> > > it was kept that way for uptodate state as well and based on one of your
> > > inputs for complete overwrite case.
> > > 
> > 
> > Can you elaborate on your concerns, out of curiosity?
> > 
> > Either way, IMO it also seems reasonable to drop this behavior for the
> > basic implementation of dirty tracking (so always allocate the iop for
> > sub-folio tracking as you suggest above) and then potentially restore it
> > as a separate optimization patch at the end of the series.
> 
> Agree.
> 
> > That said, I'm not totally clear why it exists in the first place, so
> > that might warrant some investigation. Is it primarily to defer
> > allocations out of task write/fault contexts?
> 
> (Assuming by 'it' you mean the behavior where we don't unconditionally
> allocate iops for blocksize < foliosize...)
> 
> IIRC the reason is to reduce memory usage by eliding iop allocations
> unless it's absolutely necessary for correctness was /my/ understanding
> of why we don't always allocate the iop...
> 
> > To optimize the case where pagecache is dirtied but truncated or
> > something and thus never written back?
> 
> ...because this might very well happen.  Write a temporary .o file to
> the filesystem, then delete the whole thing before writeback ever gets
> its hands on the file.
> 

I don't think a simple temp write will trigger this scenario currently
because the folios would have to be uptodate at the time of the write to
bypass the iop alloc. I guess you'd have to read folios (even if backed
by holes) first to start seeing the !iop case at writeback time (for bs
!= ps).

That could change with these patches, but I was trying to reason about
the intent of the existing code and whether there was some known reason
to continue to try and defer the iop allocation as the need/complexity
for deferring it grows with the addition of more (i.e. dirty) tracking.

> > Is there any room for further improvement where the alloc could be
> > avoided completely for folio overwrites instead of just deferred?
> 
> Once writeback starts, though, we need the iop so that we can know when
> all the writeback for that folio is actually complete, no matter how
> many IOs might be in flight for that folio.  I don't know how you'd get
> around this problem.
> 

Ok. I noticed some kind of counter or something being updated last time
I looked through that code, so it sounds like that's the reason the iop
eventually needs to exist. Thanks.

> > Was that actually the case at some point and then something later
> > decided the iop was needed at writeback time, leading to current
> > behavior?
> 
> It's been in iomap since the beginning when we lifted it from xfs.
> 

Not sure exactly what you're referring to here. iomap_writepage_map()
would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d
("iomap: Permit pages without an iop to enter writeback"), so I don't
see how iop allocs were deferred (other than when bs == ps, obviously)
prior to that.

Heh, but I'm starting to get my wires crossed just trying to piece
things together here. Ritesh, ISTM the (writeback && !iop && bs != ps)
case is primarily a subtle side effect of the current writeback behavior
being driven by uptodate status. I think it's probably wise to drop it
at least initially, always alloc and dirty the appropriate iop ranges
for sub-folio blocks, and then if you or others think there is value in
the overwrite optimization to defer iop allocs, tack that on as a
separate patch and try to be consistent between buffered and mapped
writes.

Darrick noted above that he also agrees with that separate patch
approach. For me, I think it would also be useful to show that there is
some measurable performance benefit on at least one reasonable workload
to help justify it.

Brian

> --D (who is now weeks behind on reviewing things and stressed out)
> 
> > Brian
> > 
> > > Though I agree that we should ideally be allocatting & marking all
> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to
> > > understand your reasoning better.
> > > 
> > > Thanks!
> > > -ritesh
> > > 
> > 
> 


  reply	other threads:[~2023-05-23 12:13 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-07 19:27 [RFCv5 0/5] iomap: Add support for per-block dirty state to improve write performance Ritesh Harjani (IBM)
2023-05-07 19:27 ` [RFCv5 1/5] iomap: Rename iomap_page_create/release() to iop_alloc/free() Ritesh Harjani (IBM)
2023-05-18  6:13   ` Christoph Hellwig
2023-05-19 15:01     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 2/5] iomap: Refactor iop_set_range_uptodate() function Ritesh Harjani (IBM)
2023-05-15 15:09   ` Brian Foster
2023-05-16 10:12     ` Ritesh Harjani
2023-05-18  6:16   ` Christoph Hellwig
2023-05-19 15:03     ` Ritesh Harjani
2023-05-07 19:27 ` [RFCv5 3/5] iomap: Add iop's uptodate state handling functions Ritesh Harjani (IBM)
2023-05-15 15:10   ` Brian Foster
2023-05-16 10:14     ` Ritesh Harjani
2023-05-18  6:18   ` Christoph Hellwig
2023-05-19 15:07     ` Ritesh Harjani
2023-05-23  6:00       ` Christoph Hellwig
2023-05-07 19:27 ` [RFCv5 4/5] iomap: Allocate iop in ->write_begin() early Ritesh Harjani (IBM)
2023-05-18  6:21   ` Christoph Hellwig
2023-05-19 15:18     ` Ritesh Harjani
2023-05-19 15:53       ` Matthew Wilcox
2023-05-22  4:05         ` Ritesh Harjani
2023-05-07 19:28 ` [RFCv5 5/5] iomap: Add per-block dirty state tracking to improve performance Ritesh Harjani (IBM)
     [not found]   ` <CGME20230515081618eucas1p1c852fec3ba7a42ee7094248c30ff5978@eucas1p1.samsung.com>
2023-05-15  8:16     ` Pankaj Raghav
2023-05-15  8:31       ` Ritesh Harjani
2023-05-15 13:23         ` Pankaj Raghav
2023-05-15 15:15   ` Brian Foster
2023-05-16 14:49     ` Ritesh Harjani
2023-05-16 19:29       ` Brian Foster
2023-05-17 15:20         ` Ritesh Harjani
2023-05-17 18:48           ` Brian Foster
2023-05-18 13:23             ` Christoph Hellwig
2023-05-18 16:15               ` Matthew Wilcox
2023-05-22  4:33                 ` Ritesh Harjani
2023-05-22  4:48                   ` Matthew Wilcox
2023-05-22 11:18                   ` Brian Foster
2023-05-23  0:56                     ` Darrick J. Wong
2023-05-23 12:15                       ` Brian Foster [this message]
2023-05-23 13:43                         ` Ritesh Harjani
2023-05-23 14:44                           ` Brian Foster
2023-05-23 15:02                             ` Ritesh Harjani
2023-05-23 15:22                               ` Brian Foster
2023-05-23 15:38                                 ` Ritesh Harjani
2023-05-23 15:59                                 ` Matthew Wilcox
2023-05-18 13:27   ` Christoph Hellwig
2023-05-19 16:08     ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZGyuXgGzuFWmHnsd@bfoster \
    --to=bfoster@redhat.com \
    --cc=araherle@in.ibm.com \
    --cc=david@fromorbit.com \
    --cc=disgoel@linux.ibm.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).