From: Kent Overstreet <kent.overstreet@linux.dev>
To: "Andreas Grünbacher" <andreas.gruenbacher@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>, Jan Kara <jack@suse.cz>,
cluster-devel@redhat.com, "Darrick J . Wong" <djwong@kernel.org>,
linux-kernel@vger.kernel.org, dhowells@redhat.com,
linux-bcachefs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Kent Overstreet <kent.overstreet@gmail.com>
Subject: Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
Date: Thu, 25 May 2023 20:39:32 -0400 [thread overview]
Message-ID: <ZG//xK/QSp1Wm72M@moria.home.lan> (raw)
In-Reply-To: <CAHpGcMKQke0f5-y6fg3O5dBwcTYX69dEbxZgDiFABgOLCc+zGw@mail.gmail.com>
On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote:
> Oh, it's just that gfs2 uses one dlm lock per inode to control access
> to that inode. In the code, this is called the "inode glock" ---
> glocks being an abstraction above dlm locks --- but it boils down to
> dlm locks in the end. An additional layer of locking will only work
> correctly if all cluster nodes use the new locks consistently, so old
> cluster nodes will become incompatible. Those kinds of changes are
> hard.
>
> But the additional lock taking would also hurt performance, forever,
> and I'd really like to avoid taking that hit.
>
> It may not be obvious to everyone, but allowing page faults during
> reads and writes (i.e., while holding dlm locks) can lead to
> distributed deadlocks. We cannot just pretend to be a local
> filesystem.
Ah, gotcha. Same basic issue then, dio -> page fault means you're taking
two DLM locks simulateously, so without lock ordering you get deadlock.
So that means promoting this to the VFS won't be be useful to you
(because the VFS lock will be a local lock, not your DLM lock).
Do you have trylock() and a defined lock ordering you can check for or
glocks (inode number)? Here's the new and expanded commit message, in
case my approach is of interest to you:
commit 32858d0fb90b503c8c39e899ad5155e44639d3b5
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date: Wed Oct 16 15:03:50 2019 -0400
sched: Add task_struct->faults_disabled_mapping
There has been a long standing page cache coherence bug with direct IO.
This provides part of a mechanism to fix it, currently just used by
bcachefs but potentially worth promoting to the VFS.
Direct IO evicts the range of the pagecache being read or written to.
For reads, we need dirty pages to be written to disk, so that the read
doesn't return stale data. For writes, we need to evict that range of
the pagecache so that it's not stale after the write completes.
However, without a locking mechanism to prevent those pages from being
re-added to the pagecache - by a buffered read or page fault - page
cache inconsistency is still possible.
This isn't necessarily just an issue for userspace when they're playing
games; filesystems may hang arbitrary state off the pagecache, and so
page cache inconsistency may cause real filesystem bugs, depending on
the filesystem. This is less of an issue for iomap based filesystems,
but e.g. buffer heads caches disk block mappings (!) and attaches them
to the pagecache, and bcachefs attaches disk reservations to pagecache
pages.
This issue has been hard to fix, because
- we need to add a lock (henceforth calld pagecache_add_lock), which
would be held for the duration of the direct IO
- page faults add pages to the page cache, thus need to take the same
lock
- dio -> gup -> page fault thus can deadlock
And we cannot enforce a lock ordering with this lock, since userspace
will be controlling the lock ordering (via the fd and buffer arguments
to direct IOs), so we need a different method of deadlock avoidance.
We need to tell the page fault handler that we're already holding a
pagecache_add_lock, and since plumbing it through the entire gup() path
would be highly impractical this adds a field to task_struct.
Then the full method is:
- in the dio path, when we take first pagecache_add_lock, note the
mapping in task_struct
- in the page fault handler, if faults_disabled_mapping is set, we
check if it's the same mapping as the one taking a page fault for,
and if so return an error.
Then we check lock ordering: if there's a lock ordering violation and
trylock fails, we'll have to cycle the locks and return an error that
tells the DIO path to retry: faults_disabled_mapping is also used for
signalling "locks were dropped, please retry".
Also relevant to this patch: mapping->invalidate_lock.
mapping->invalidate_lock provides most of the required semantics - it's
used by truncate/fallocate to block pages being added to the pagecache.
However, since it's a rwsem, direct IOs would need to take the write
side in order to block page cache adds, and would then be exclusive with
each other - we'll need a new type of lock to pair with this approach.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jan Kara <jack@suse.cz>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Andreas Grünbacher <andreas.gruenbacher@gmail.com>
next prev parent reply other threads:[~2023-05-26 0:39 UTC|newest]
Thread overview: 186+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
2023-05-09 17:04 ` Miguel Ojeda
2023-05-09 17:24 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
2023-05-09 19:30 ` Peter Zijlstra
2023-05-09 20:11 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
2023-05-09 19:31 ` Peter Zijlstra
2023-05-09 19:57 ` Kent Overstreet
2023-05-09 20:18 ` Kent Overstreet
2023-05-09 20:27 ` Waiman Long
2023-05-09 20:35 ` Kent Overstreet
2023-05-09 21:37 ` Waiman Long
2023-05-10 8:59 ` Peter Zijlstra
2023-05-10 20:38 ` Kent Overstreet
2023-05-11 8:25 ` Peter Zijlstra
2023-05-11 9:32 ` Kent Overstreet
2023-05-12 20:49 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2023-05-11 12:14 ` Jan Engelhardt
2023-05-12 20:58 ` Kent Overstreet
2023-05-12 22:39 ` Jan Engelhardt
2023-05-12 23:26 ` Kent Overstreet
2023-05-12 23:49 ` Randy Dunlap
2023-05-13 0:17 ` Kent Overstreet
2023-05-13 0:45 ` Eric Biggers
2023-05-13 0:51 ` Kent Overstreet
2023-05-14 12:15 ` Jeff Layton
2023-05-15 2:39 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
2023-05-10 1:07 ` Jan Kara
2023-05-10 6:18 ` Kent Overstreet
2023-05-23 13:34 ` Jan Kara
2023-05-23 16:21 ` [Cluster-devel] " Christoph Hellwig
2023-05-23 16:35 ` Kent Overstreet
2023-05-24 6:43 ` Christoph Hellwig
2023-05-24 8:09 ` Kent Overstreet
2023-05-25 8:58 ` Christoph Hellwig
2023-05-25 20:50 ` Kent Overstreet
2023-05-26 8:06 ` Christoph Hellwig
2023-05-26 8:34 ` Kent Overstreet
2023-05-25 21:40 ` Kent Overstreet
2023-05-25 22:25 ` Andreas Grünbacher
2023-05-25 23:20 ` Kent Overstreet
2023-05-26 0:05 ` Andreas Grünbacher
2023-05-26 0:39 ` Kent Overstreet [this message]
2023-05-26 8:10 ` Christoph Hellwig
2023-05-26 8:38 ` Kent Overstreet
2023-05-23 16:49 ` Kent Overstreet
2023-05-25 8:47 ` Jan Kara
2023-05-25 21:36 ` Kent Overstreet
2023-05-25 22:45 ` Andreas Grünbacher
2023-05-25 22:04 ` Andreas Grünbacher
2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
2023-05-09 18:19 ` Lorenzo Stoakes
2023-05-09 20:15 ` Kent Overstreet
2023-05-09 20:46 ` Christoph Hellwig
2023-05-09 21:12 ` Lorenzo Stoakes
2023-05-09 21:29 ` Kent Overstreet
2023-05-10 6:48 ` Eric Biggers
2023-05-12 18:36 ` Kent Overstreet
2023-05-13 1:57 ` Eric Biggers
2023-05-13 19:28 ` Kent Overstreet
2023-05-14 5:45 ` Kent Overstreet
2023-05-14 18:43 ` Eric Biggers
2023-05-15 5:38 ` Kent Overstreet
2023-05-15 6:13 ` Eric Biggers
2023-05-15 6:18 ` Kent Overstreet
2023-05-15 7:13 ` Eric Biggers
2023-05-15 7:26 ` Kent Overstreet
2023-05-21 21:33 ` Eric Biggers
2023-05-21 22:04 ` Kent Overstreet
2023-05-15 10:29 ` David Laight
2023-05-10 11:56 ` David Laight
2023-05-09 21:43 ` Darrick J. Wong
2023-05-09 21:54 ` Kent Overstreet
2023-05-11 5:33 ` Theodore Ts'o
2023-05-11 5:44 ` Kent Overstreet
2023-05-13 13:25 ` Lorenzo Stoakes
2023-05-14 18:39 ` Christophe Leroy
2023-05-14 23:43 ` Kent Overstreet
2023-05-15 4:45 ` Christophe Leroy
2023-05-15 5:02 ` Kent Overstreet
2023-05-10 14:18 ` Christophe Leroy
2023-05-10 15:05 ` Johannes Thumshirn
2023-05-11 22:28 ` Kees Cook
2023-05-12 18:41 ` Kent Overstreet
2023-05-16 21:02 ` Kees Cook
2023-05-16 21:20 ` Kent Overstreet
2023-05-16 21:47 ` Matthew Wilcox
2023-05-16 21:57 ` Kent Overstreet
2023-05-17 5:28 ` Kent Overstreet
2023-05-17 14:04 ` Mike Rapoport
2023-05-17 14:18 ` Kent Overstreet
2023-05-17 15:44 ` Mike Rapoport
2023-05-17 15:59 ` Kent Overstreet
2023-06-17 4:13 ` Andy Lutomirski
2023-06-17 15:34 ` Kent Overstreet
2023-06-17 19:19 ` Andy Lutomirski
2023-06-17 20:08 ` Kent Overstreet
2023-06-17 20:35 ` Andy Lutomirski
2023-06-19 19:45 ` Kees Cook
2023-06-20 0:39 ` Kent Overstreet
2023-06-19 9:19 ` Mark Rutland
2023-06-19 10:47 ` Kent Overstreet
2023-06-19 12:47 ` Mark Rutland
2023-06-19 19:17 ` Kent Overstreet
2023-06-20 17:42 ` Andy Lutomirski
2023-06-20 18:08 ` Kent Overstreet
2023-06-20 18:15 ` Andy Lutomirski
2023-06-20 18:48 ` Dave Hansen
2023-06-20 20:18 ` Kent Overstreet
2023-06-20 20:42 ` Andy Lutomirski
2023-06-20 22:32 ` Andy Lutomirski
2023-06-20 22:43 ` Nadav Amit
2023-06-21 1:27 ` Andy Lutomirski
2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
2023-05-10 1:10 ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
2023-05-09 17:05 ` Coly Li
2023-05-09 21:03 ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
2023-05-23 9:27 ` (subset) " Christian Brauner
2023-05-23 22:53 ` Dave Chinner
2023-05-24 6:44 ` Christoph Hellwig
2023-05-24 7:35 ` Dave Chinner
2023-05-24 8:31 ` Christian Brauner
2023-05-24 8:41 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
2023-05-10 4:48 ` Dave Chinner
2023-05-23 9:27 ` (subset) " Christian Brauner
2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
2023-05-10 4:45 ` Dave Chinner
2023-05-16 15:45 ` Christian Brauner
2023-05-16 16:17 ` Kent Overstreet
2023-05-16 23:15 ` Dave Chinner
2023-05-22 13:04 ` Christian Brauner
2023-05-23 9:28 ` (subset) " Christian Brauner
2023-10-19 15:30 ` Mateusz Guzik
2023-10-19 15:59 ` Mateusz Guzik
2023-10-20 11:38 ` Dave Chinner
2023-10-20 17:49 ` Mateusz Guzik
2023-10-21 12:13 ` Mateusz Guzik
2023-10-23 5:10 ` Dave Chinner
2023-10-27 17:13 ` Mateusz Guzik
2023-10-27 18:36 ` Darrick J. Wong
2023-10-31 11:02 ` Christian Brauner
2023-10-31 11:31 ` Mateusz Guzik
2023-11-02 2:36 ` Kent Overstreet
2023-11-04 20:51 ` Dave Chinner
2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
2023-05-10 2:20 ` kernel test robot
2023-05-11 2:08 ` kernel test robot
2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
2023-05-09 21:03 ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
2023-06-19 9:10 ` Mark Rutland
2023-06-19 11:16 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
2023-07-12 19:58 ` Kees Cook
2023-07-12 20:19 ` Kent Overstreet
2023-07-12 22:38 ` Kees Cook
2023-07-12 23:53 ` Kent Overstreet
2023-07-12 20:23 ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
2023-05-09 21:04 ` Randy Dunlap
2023-05-09 21:07 ` Kent Overstreet
2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
2023-06-15 21:26 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZG//xK/QSp1Wm72M@moria.home.lan \
--to=kent.overstreet@linux.dev \
--cc=andreas.gruenbacher@gmail.com \
--cc=cluster-devel@redhat.com \
--cc=dhowells@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=kent.overstreet@gmail.com \
--cc=linux-bcachefs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).