Re: Splitting an iomap_page

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>,
	david@fromorbit.com, yukuai3@huawei.com,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: Splitting an iomap_page
Date: Mon, 24 Aug 2020 09:06:14 -0700	[thread overview]
Message-ID: <20200824160614.GO6107@magnolia> (raw)
In-Reply-To: <20200822060618.GE17129@infradead.org>

On Sat, Aug 22, 2020 at 07:06:18AM +0100, Christoph Hellwig wrote:
> On Fri, Aug 21, 2020 at 03:40:21PM +0100, Matthew Wilcox wrote:
> > I have only bad ideas for solving this problem, so I thought I'd share.
> > 
> > Operations like holepunch or truncate call into
> > truncate_inode_pages_range() which just remove THPs which are
> > entirely within the punched range, but pages which contain one or both
> > ends of the range need to be split.
> > 
> > What I have right now, and works, calls do_invalidatepage() from
> > truncate_inode_pages_range().  The iomap implementation just calls
> > iomap_page_release().  We have the page locked, and we've waited for
> > writeback to finish, so there is no I/O outstanding against the page.
> > We may then get into one of the writepage methods without an iomap being
> > allocated, so we have to allocate it.  Patches here:
> > 
> > http://git.infradead.org/users/willy/pagecache.git/commitdiff/167f81e880ef00d83ab7db50d56ed85bfbae2601
> > http://git.infradead.org/users/willy/pagecache.git/commitdiff/82fe90cde95420c3cf155b54ed66c74d5fb6ffc5
> > 
> > If the page is Uptodate, this works great.  But if it's not, then we're
> > going to unnecessarily re-read data from storage -- indeed, we may as
> > well just dump the entire page if it's !Uptodate.  Of course, right now
> > the only way to get THPs is through readahead and that's going to always
> > read the entire page, so we're never going to see a !Uptodate THP.  But
> > in the future, maybe we shall?  I wouldn't like to rely on that without
> > pasting some big warnings for anyone who tries to change it.
> > 
> > Alternative 1: Allocate a new iop for each base page if needed.  This is
> > the obvious approach.  If the block size is >= PAGE_SIZE, we don't even
> > need to allocate a new iop; we can just mark the page as being Uptodate
> > if that range is.  The problem is that we might need to allocate a lot of
> > them -- 512 if we're splitting a 2MB page into 4kB chunks (which would
> > be 12kB -- plus a 2kB array to hold 512 pointers).  And at the point
> > where we know we need to allocate them, we're under a spin_lock_irq().
> > We could allocate them in advance, but then we might find out we aren't
> > going to split this page after all.
> > 
> > Alternative 2: Always allocate an iop for each base page in a THP.  We pay
> > the allocation price up front for every THP, even though the majority
> > will never be split.  It does save us from allocating any iop at all for
> > block size >= page size, but it's a lot of extra overhead to gather all
> > the Uptodate bits.  As above, it'd be 12kB of iops vs 80 bytes that we
> > currently allocate for a 2MB THP.  144 once we also track dirty bits.
> > 
> > Alternative 3: Allow pages to share an iop.  Do something like add a
> > pgoff_t base and a refcount_t to the iop and have each base page point
> > to the same iop, using the part of the bit array indexed by (page->index
> > - base) * blocks_per_page.  The read/write count are harder to share,
> > and I'm not sure I see a way to do it.
> > 
> > Alternative 4: Don't support partially-uptodate THPs.  We declare (and
> > enforce with strategic assertions) that all THPs must always be Uptodate
> > (or Error) once unlocked.  If your workload benefits from using THPs,
> > you want to do big I/Os anyway, so these "write 512 bytes at a time
> > using O_SYNC" workloads aren't going to use THPs.
> > 
> > Funnily, buffer_heads are easier here.  They just need to be reparented
> > to their new page.  Of course, they're allocated up front, so they're
> > essentially alternative 2.
> 
> At least initially I'd go for 4.  And then see if someone screams loudly
> enough to reconsider.  And if we really have to I wonder if we can do
> a variation of variant 1 where we avoid allocating under the irqs
> disabled spinlock by a clever retry trick.

/me doesn't have any objection to #4, and bets #1 and #3 will probably
lead to weird problems /somewhere/ ;)

--D

next prev parent reply	other threads:[~2020-08-24 16:06 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 14:40 Splitting an iomap_page Matthew Wilcox
2020-08-22  6:06 ` Christoph Hellwig
2020-08-24 16:06   ` Darrick J. Wong [this message]
2020-09-04  3:37 ` Matthew Wilcox
2020-09-07 11:33   ` Kirill A. Shutemov
2020-09-08 11:43     ` Matthew Wilcox
2020-09-09 11:57       ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200824160614.GO6107@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).