From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Amir Goldstein <amir73il@gmail.com>,
Hugh Dickins <hughd@google.com>,
Michael Larabel <Michael@michaellarabel.com>,
Ted Ts'o <tytso@google.com>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Ext4 Developers List <linux-ext4@vger.kernel.org>,
Jan Kara <jack@suse.cz>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Kernel Benchmarking
Date: Sun, 13 Sep 2020 10:40:57 +1000 [thread overview]
Message-ID: <20200913004057.GR12096@dread.disaster.area> (raw)
In-Reply-To: <CAHk-=whjhYa3ig0U_mtpoU5Zok_2Y5zTCw8f-THkf1vHRBDNuA@mail.gmail.com>
On Sat, Sep 12, 2020 at 10:59:40AM -0700, Linus Torvalds wrote:
[...]
> In particular, the page locking is often used for just verifying
> simple things, with the most common example being "lock page, check
> that the mapping is still valid, insert page into page tables, unlock
> page".
>
> The reason the apache benchmark regresses is that it basically does a
> web server test with a single file ("test.html") that gets served by
> just mmap'ing it, and sending it out that way. Using lots of threads,
> and using lots of different mappings. So they *all* fault on the read
> of that page, and they *all* do that "lock page, check that the
> mapping is valid, insert page" dance.
Hmmmm. So this is a typically a truncate race check, but this isn't
sufficient to protect the fault against all page invalidation races
as the page can be re-inserted into the same mapping at a different
page->index now within EOF.
Hence filesystems that support hole punching have to serialise the
->fault path against the page invalidations done in ->fallocate
operations because they otherwise we get data corruption from the
mm/ truncate checks failing to detect invalidated pages within EOF
correctly.
i.e. truncate/hole punch is a multi-object modification operation,
with the typical atomicity boundary of the operation defined by the
inode_lock() and/or the filesystem transaction that makes the
modification. IOWs, page_lock() based truncation/invalidation checks
aren't atomic w.r.t. the other objects being modified in the same
operation. Truncate avoids this by the ordering the file size update
vs the page cache invalidation, but no such ordering protection can
be provided for ->fallocate() operations that directly manipulate
the metadata of user data in the file.
> Anyway, I don't have a great solution. I have a few options (roughly
> ordered by "simplest to most complex"):
>
> (a) just revert
> (b) add some busy-spinning
> (c) reader-writer page lock
> (d) try to de-emphasize the page lock
....
> Option (d) is "we already have a locking in many filesystems that give
> us exclusion between faulting in a page, and the truncate/hole punch,
> so we shouldn't use the page lock at all".
>
> I do think that the locking that filesystems do is in many ways
> inferior - it's done on a per-inode basis rather than on a per-page
> basis. But if the filesystems end up doing that *anyway*, what's the
> advantage of the finer granularity one? And *because* the common case
> is all about the reading case, the bigger granularity tends to work
> very well in practice, and basically never sees contention.
*nod*
Given that:
1) we have been doing (d) for 5 years (see commit 653c60b633a ("xfs:
introduce mmap/truncate lock")),
2) ext4 also has this same functionality,
3) DAX requires the ability for filesystems to exclude page faults
4) it is a widely deployed and tested solution
5) filesystems will still need to be able to exclude page faults
over a file range while they directly manipulate file metadata to
change the user data in the file
> So I think option (c) is potentially technically better because it has
> smaller locking granularity, but in practice (d) might be easier and
> we already effectively do it for several filesystems.
Right. Even if we go for (c), AFAICT we still need (d) because we
still (d) largely because of reason (5) above. There are a whole
class of "page fault vs direct storage manipulation offload"
serialisation issues that filesystems have to consider (especially
if they want to support DAX), so if we can use that same mechanism
to knock a whole bunch of page locking out of the fault paths then
that seems like a win to me....
> Any other suggestions than those (a)-(d) ones above?
Not really - I've been advocating for (d) as the general mechanism
for truncate/holepunch exclusion for quite a few years now because
it largely seems to work with no obvious/apparent issues.
Just as a FWIW: I agree that the per-inode rwsem could be an issue
here, jsut as it is for the IO path. As a side project I'm working
on shared/exclusive range locks for the XFS inode to replace the
rwsems for the XFS_IOLOCK_{SHARED,EXCL} and the
XFS_MMAPLOCK_{SHARED,EXCL}.
That will largely alleviate any problems that "per-inode rwsem"
serialisation migh cause us here - I've got the DIO fastpath down to
2 atomic operations per lock/unlock - it's with 10% of rwsems up to
approx. half a million concurrent DIO read/writes to the same inode.
Concurrent buffered read/write are not far behind direct IO until I
run out of CPU to copy data. None of this requires changes to
anything outside fs/xfs because everythign is already correctly serialised
to "atomic" filesystem operations and range locking preserves the
atomicity of those operations including all the page cache
operations done within them.
Hence I'd much prefer to be moving the page cache in a direction
that results in the page cache not having to care at all about
serialising against racing truncates, hole punches or anythign else
that runs page invalidation. That will make the page cache code
simpler, require less locking, and likely have less invalidation
related issues over time...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2020-09-13 0:41 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAHk-=wiZnE409WkTOG6fbF_eV1LgrHBvMtyKkpTqM9zT5hpf9A@mail.gmail.com>
[not found] ` <4ced9401-de3d-b7c9-9976-2739e837fafc@MichaelLarabel.com>
[not found] ` <CAHk-=wj+Qj=wXByMrAx3T8jmw=soUetioRrbz6dQaECx+zjMtg@mail.gmail.com>
[not found] ` <CAHk-=wgOPjbJsj-LeLc-JMx9Sz9DjGF66Q+jQFJROt9X9utdBg@mail.gmail.com>
[not found] ` <CAHk-=wjjK7PTnDZNi039yBxSHtAqusFoRrZzgMNTiYkJYdNopw@mail.gmail.com>
[not found] ` <aa90f272-1186-f9e1-8fdb-eefd332fdae8@MichaelLarabel.com>
[not found] ` <CAHk-=wh_31_XBNHbdF7EUJceLpEpwRxVF+_1TONzyBUym6Pw4w@mail.gmail.com>
[not found] ` <e24ef34d-7b1d-dd99-082d-28ca285a79ff@MichaelLarabel.com>
[not found] ` <CAHk-=wgEE4GuNjcRaaAvaS97tW+239-+tjcPjTq2FGhEuM8HYg@mail.gmail.com>
[not found] ` <6e1d8740-2594-c58b-ff02-a04df453d53c@MichaelLarabel.com>
[not found] ` <CAHk-=wgJ3-cEkU-5zXFPvRCHKkCCuKxVauYWGphjePEhJJgtgQ@mail.gmail.com>
[not found] ` <d2023f4c-ef14-b877-b5bb-e4f8af332abc@MichaelLarabel.com>
[not found] ` <CAHk-=wiz=J=8mJ=zRG93nuJ9GtQAm5bSRAbWJbWZuN4Br38+EQ@mail.gmail.com>
2020-09-11 0:05 ` Kernel Benchmarking Linus Torvalds
2020-09-11 0:49 ` Michael Larabel
2020-09-11 2:20 ` Linus Torvalds
[not found] ` <0cbc959e-1b8d-8d7e-1dc6-672cf5b3899a@MichaelLarabel.com>
2020-09-11 16:19 ` Linus Torvalds
2020-09-11 22:07 ` Linus Torvalds
2020-09-11 22:37 ` Michael Larabel
2020-09-12 7:28 ` Amir Goldstein
2020-09-12 10:32 ` Michael Larabel
2020-09-12 14:37 ` Matthew Wilcox
2020-09-12 14:44 ` Michael Larabel
2020-09-15 3:32 ` Matthew Wilcox
2020-09-15 10:39 ` Jan Kara
2020-09-15 13:52 ` Matthew Wilcox
[not found] ` <658ae026-32d9-0a25-5a59-9c510d6898d5@MichaelLarabel.com>
2020-09-14 17:47 ` Linus Torvalds
2020-09-14 20:21 ` Matthieu Baerts
2020-09-14 20:53 ` Linus Torvalds
2020-09-15 0:42 ` Linus Torvalds
2020-09-15 15:34 ` Matthieu Baerts
2020-09-15 18:27 ` Linus Torvalds
2020-09-15 18:47 ` Linus Torvalds
2020-09-15 19:26 ` Matthieu Baerts
2020-09-15 19:32 ` Linus Torvalds
2020-09-15 19:56 ` Matthieu Baerts
2020-09-15 23:35 ` Linus Torvalds
2020-09-16 10:34 ` Jan Kara
2020-09-16 18:47 ` Linus Torvalds
[not found] ` <9a92bf16-02c5-ba38-33c7-f350588ac874@tessares.net>
2020-09-15 19:24 ` Linus Torvalds
2020-09-15 19:38 ` Matthieu Baerts
2020-09-15 18:31 ` Linus Torvalds
2020-09-15 14:21 ` Michael Larabel
2020-09-15 17:52 ` Linus Torvalds
2020-09-17 17:51 ` Linus Torvalds
2020-09-17 18:23 ` Matthew Wilcox
2020-09-17 18:30 ` Linus Torvalds
2020-09-17 18:50 ` Matthew Wilcox
2020-09-17 19:00 ` Linus Torvalds
2020-09-17 19:27 ` Matthew Wilcox
2020-09-17 19:47 ` Linus Torvalds
2020-09-18 0:39 ` Sedat Dilek
2020-09-18 0:40 ` Sedat Dilek
2020-09-18 20:25 ` Sedat Dilek
2020-09-20 17:06 ` Linus Torvalds
2020-09-20 17:14 ` Sedat Dilek
2020-09-20 17:40 ` Linus Torvalds
2020-09-20 18:00 ` Sedat Dilek
2020-09-20 23:23 ` Dave Chinner
2020-09-20 23:31 ` Linus Torvalds
2020-09-20 23:40 ` Linus Torvalds
2020-09-21 1:20 ` Dave Chinner
2020-09-12 15:53 ` Matthew Wilcox
2020-09-12 17:59 ` Linus Torvalds
2020-09-12 20:32 ` Rogério Brito
2020-09-14 9:33 ` Jan Kara
2020-09-12 20:58 ` Josh Triplett
2020-09-12 20:59 ` James Bottomley
2020-09-12 21:15 ` Linus Torvalds
2020-09-12 22:32 ` Matthew Wilcox
2020-09-13 0:40 ` Dave Chinner [this message]
2020-09-13 2:39 ` Linus Torvalds
2020-09-13 3:40 ` Matthew Wilcox
2020-09-13 23:45 ` Dave Chinner
2020-09-14 3:31 ` Matthew Wilcox
2020-09-15 14:28 ` Chris Mason
2020-09-15 9:27 ` Jan Kara
2020-09-13 3:18 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200913004057.GR12096@dread.disaster.area \
--to=david@fromorbit.com \
--cc=Michael@michaellarabel.com \
--cc=adilger.kernel@dilger.ca \
--cc=amir73il@gmail.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=tytso@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.