linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 01/10] mm: pagecache add lock
       [not found]   ` <20180518131305.GA6361@bombadil.infradead.org>
@ 2018-05-18 15:53     ` Christoph Hellwig
  2018-05-20 22:45       ` Kent Overstreet
       [not found]       ` <20180518174549.GD31737@kmo-pixel>
  0 siblings, 2 replies; 6+ messages in thread
From: Christoph Hellwig @ 2018-05-18 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik, viro,
	peterz

On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > Historically, the only problematic case has been direct IO, and people
> > have been willing to say "well, if you mix buffered and direct IO you
> > get what you deserve", and that's probably not unreasonable. But now we
> > have fallocate insert range and collapse range, and those are broken in
> > ways I frankly don't want to think about if they can't ensure consistency
> > with the page cache.
> 
> ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> You may get pushback on the grounds that this ought to be a
> filesystem-specific lock rather than one embedded in the generic inode.

Honestly I think this probably should be in the core.  But IFF we move
it to the core the existing users of per-fs locks need to be moved
over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
that copied the approach, and probably more if you audit deep enough.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 10/10] Dynamic fault injection
       [not found]     ` <20180518191040.GG31737@kmo-pixel>
@ 2018-05-18 20:54       ` Andreas Dilger
  0 siblings, 0 replies; 6+ messages in thread
From: Andreas Dilger @ 2018-05-18 20:54 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: LKML, fsdevel, Andrew Morton, Dave Chinner, darrick.wong, tytso,
	linux-btrfs, clm, jbacik, viro, willy, peterz

[-- Attachment #1: Type: text/plain, Size: 2908 bytes --]

On May 18, 2018, at 1:10 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
>> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
>>> 
>>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
>> 
>> I agree with Christoph that even if there was some explanation in the cover
>> letter, there should be something at least as good in the patch itself.  The
>> cover letter is not saved, but the commit stays around forever, and should
>> explain how this should be added to code, and how to use it from userspace.
>> 
>> 
>> That said, I think this is a useful functionality.  We have something similar
>> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
>> test a distributed filesystem, which is just a CPP macro with an unlikely()
>> branch, while this looks more sophisticated.  This looks like it has some
>> added functionality like having more than one fault enabled at a time.
>> If this lands we could likely switch our code over to using this.
> 
> This is pretty much what I was looking for, I just wanted to know if this
> patch was interesting enough to anyone that I should spend more time on it
> or just drop it :) Agreed on documentation. I think it's also worth
> factoring out the functionality for the elf section trick that dynamic
> debug uses too.
> 
>> Some things that are missing from this patch that is in our code:
>> 
>> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
>>  - timeout: sleep for N msec to simulate network/disk/locking delays
>>  - race: wait with one thread until a second thread hits matching check
>> 
>> We also have a "fail_val" that allows making the check conditional (e.g.
>> only operation on server "N" should fail, only RPC opcode "N", etc).
> 
> Those all sound like good ideas... fail_val especially, I think with that
> we'd have all the functionality the existing fault injection framework has
> (which is way too heavyweight to actually get used, imo)

The other thing that we have that is slightly orthogonal to your modes,
which is possible because we just have a __u32 for the fault location,
is that the "oneshot" mode is just a mask added to the fault location
together with "fail_val" is that we can add other masks "fail N times",
"fail randomly 1/N times", or "pass N times before failure".  The other
mask is set in the kernel when the fault was actually hit, so that test
scripts can poll until that happens, and then continue running.

The "fail randomly 1/N times" was useful for detecting memory allocation
failure handling under load, but that has been superseded by the same
functionality in kmalloc(), and it sounds like your fault injection can
do this deterministically for every allocation?

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-18 15:53     ` [PATCH 01/10] mm: pagecache add lock Christoph Hellwig
@ 2018-05-20 22:45       ` Kent Overstreet
  2018-05-23 15:22         ` Christoph Hellwig
       [not found]       ` <20180518174549.GD31737@kmo-pixel>
  1 sibling, 1 reply; 6+ messages in thread
From: Kent Overstreet @ 2018-05-20 22:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik, viro,
	peterz

On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > Historically, the only problematic case has been direct IO, and people
> > > have been willing to say "well, if you mix buffered and direct IO you
> > > get what you deserve", and that's probably not unreasonable. But now we
> > > have fallocate insert range and collapse range, and those are broken in
> > > ways I frankly don't want to think about if they can't ensure consistency
> > > with the page cache.
> > 
> > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > You may get pushback on the grounds that this ought to be a
> > filesystem-specific lock rather than one embedded in the generic inode.
> 
> Honestly I think this probably should be in the core.  But IFF we move
> it to the core the existing users of per-fs locks need to be moved
> over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> that copied the approach, and probably more if you audit deep enough.

I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
merging bcachefs. Sorry, but that's a bit crazy.

I am more than happy to work on the locking itself if we can agree on what
semantics we want out of it. We have two possible approaches, and we're going to
have to pick one first: the locking can be done at the top of the IO stack (like
ext4 and I'm guessing xfs), but then we're adding locking overhead to buffered
reads and writes that don't need it because they're only touching pages that are
already in cache.

Or we can go with my approach, pushing down the locking to only when we need to
add pages to the page cache. I think if we started out by merging my approach,
it would be pretty easy to have it make use of Mathew's fancy xarray based range
locking when that goes in, the semantics should be similar enough.

If people are ok with and willing to use my approach, I can polish it up - add
lockdep support and whatever else I can think of, and attempt to get rid of the
stupid recursive part.

But that's got to be decided first, where in the call stack the locking should
be done.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-20 22:45       ` Kent Overstreet
@ 2018-05-23 15:22         ` Christoph Hellwig
  2018-05-23 17:12           ` Kent Overstreet
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2018-05-23 15:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Matthew Wilcox, linux-kernel, linux-fsdevel,
	Andrew Morton, Dave Chinner, darrick.wong, tytso, linux-btrfs,
	clm, jbacik, viro, peterz

On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> merging bcachefs. Sorry, but that's a bit crazy.

It isn't crazy at all.  In general we expect people to do their fair
share of core work to get their pet feature in.  How much is acceptable
is a difficult question and not black and white.

But if you want to grow a critical core structure you better take a stab
at converting existing users.  Without that the tradeoff of growing
that core structure is simply not given.

Or to put it in words for this exact feature:  unless your new field
is also used by mainstream file systems it is not going to be added.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 01/10] mm: pagecache add lock
  2018-05-23 15:22         ` Christoph Hellwig
@ 2018-05-23 17:12           ` Kent Overstreet
  0 siblings, 0 replies; 6+ messages in thread
From: Kent Overstreet @ 2018-05-23 17:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik, viro,
	peterz

On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote:
> On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote:
> > > 
> > > Honestly I think this probably should be in the core.  But IFF we move
> > > it to the core the existing users of per-fs locks need to be moved
> > > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > > that copied the approach, and probably more if you audit deep enough.
> > 
> > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to
> > merging bcachefs. Sorry, but that's a bit crazy.
> 
> It isn't crazy at all.  In general we expect people to do their fair
> share of core work to get their pet feature in.  How much is acceptable
> is a difficult question and not black and white.
> 
> But if you want to grow a critical core structure you better take a stab
> at converting existing users.  Without that the tradeoff of growing
> that core structure is simply not given.
> 
> Or to put it in words for this exact feature:  unless your new field
> is also used by mainstream file systems it is not going to be added.

Christoph, I'm really not someone you can accuse of avoiding my share of core
work and refactoring and you know it.

When are you going to get around to converting existing users of fs/direct-io.c
to iomap so it can be deleted? The kernel is carrying around two dio
implementations right now thanks to you. Not a good situation, is it?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock)
       [not found]       ` <20180518174549.GD31737@kmo-pixel>
@ 2018-05-23 23:55         ` Kent Overstreet
  0 siblings, 0 replies; 6+ messages in thread
From: Kent Overstreet @ 2018-05-23 23:55 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-fsdevel, Andrew Morton,
	Dave Chinner, darrick.wong, tytso, linux-btrfs, clm, jbacik, viro,
	peterz, bcrl, Christoph Hellwig, zach.brown

On Fri, May 18, 2018 at 01:45:49PM -0400, Kent Overstreet wrote:
> On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote:
> > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote:
> > > > Historically, the only problematic case has been direct IO, and people
> > > > have been willing to say "well, if you mix buffered and direct IO you
> > > > get what you deserve", and that's probably not unreasonable. But now we
> > > > have fallocate insert range and collapse range, and those are broken in
> > > > ways I frankly don't want to think about if they can't ensure consistency
> > > > with the page cache.
> > > 
> > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem.
> > > You may get pushback on the grounds that this ought to be a
> > > filesystem-specific lock rather than one embedded in the generic inode.
> > 
> > Honestly I think this probably should be in the core.  But IFF we move
> > it to the core the existing users of per-fs locks need to be moved
> > over first.  E.g. XFS as the very first one, and at least ext4 and f2fs
> > that copied the approach, and probably more if you audit deep enough.
> 
> I didn't know about i_mmap_sem, thanks
> 
> But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes
> to block pagecache adds in bcachefs since the dio write could overwrite
> uncompressed data or a reservation with compressed data, which means new writes
> need a disk reservation.
> 
> Also I'm guessing ext4 takes the lock at the top of the read path? That sucks
> for reads that are already cached, the generic buffered read path can do cached
> reads without taking any per inode locks - that's why I pushed the lock down to
> only be taken when we have to add a page to the pagecache.
> 
> Definitely less ugly doing it that way though...

More notes on locking design: Mathew, this is very relevant to any sort of range
locking, too.

There's some really tricky issues related to dio writes + page faults. If your
particular filesystem doesn't care about minor page cache consistency issues
caused by dio writes most of this may not be relevant to you, but I honestly
would find it a little hard to believe this isn't an issue for _any_ other
filesystems.

Current situation today, for most filesystems: top of the dio write path shoots
down the region of the pagecache for the file it's writing to, with
filemap_write_and_wait_range() then invalidate_inode_pages2_range().

This is racy though and does _not_ gurarantee page cache consistency, i.e. we
can end up in a situation where the write completes, but we have stale data -
and worse, potentially stale metadata, in the buffer heads or whatever your
filesystem uses.

Ok, solving that is easy enough, we need a lock dio writes can take that
prevents pages from being added to a mapping for their duration (or alternately,
range locks, but range locks can be viewed as just an optimization).

This explains my locking design - if you have a lock that can be taken for "add"
or "block", where multiple threads can take it for add or block, but it can't be
in both states simultaneously then it does what we want, you can have multiple
outstanding dio writes or multiple buffered IO operations bringing pages in and
it doesn't add much overhead.

This isn't enough, though.

 - calling filemap_write_and_wait_range then invalidate_inode_pages2_range can
   race - the call to invalidate_inode_pages2_range will fail if any of the
   pages have been redirtied, and we'll have stale pages in the page cache.

   The ideal solution for this would be a function to do both, that removes
   pages from the page cache atomically with clearing PageDirty and kicking off
   writeback. Alternately, you can have .page_mkwrite and the buffered write
   path take the pagecache add lock when they have to dirty a page, but that
   kind of sucks.

 - pagefaults, via gup()
   
   this is the really annoying one, if userspace does a dio write where the buf
   they're writing is mmapped and overlaps with the part of the file they're
   writing to, yay fun

   we call gup() after shooting down the part of the pagecache we're writing
   to, so gup() just faults it back in again and we're left with stale data in
   the page cache again.

   the generic dio code tries to deal with this these days by calling
   invalidate_inode_pages2_range() again after the dio write completes. Again
   though, invalidate_inode_pages2_range() will fail if the pages are dirty, and
   we're left with stale data in the page cache.

   I _think_ (haven't tried this yet) it ought to be possible to move the second
   call to invalidate_inode_pages2_range() to immediately after the call to
   gup() - this change wouldn't make sense in the current code without locking,
   but it would make sense if the dio write path is holding a pagecache add lock
   to prevent anything but its own faults via gup() from bringing pages back in.

   Also need to prevent the pages gup() faulted in from being dirtied until they
   can be removed again... 

   Also, one of the things my patch did was change filemap_fault() to not kick
   off readahead and only read in single pages if we're being called with
   pagecache block lock held (i.e. from dio write to the same file). I'm pretty
   sure this is unnecessary though, if I add the second
   invalidate_inode_pages2_range() call to my own dio code after gup() (which I
   believe is the correct solution anyways).

 - lock recursion

   my locking approach pushes down the locking to only when we're adding pages
   to the radix tree, unlike, I belive, the ext4/xfs approach.

   this means that dio write takes page_cache_block lock, and page faults take
   page_cache_add lock, so dio write -> gup -> fault is a recursive locking
   deadlock unless we do something about it.

   my solution was to add a pointer to a pagecache_lock to task_struct, so we
   can detect the recursion when we go to take pagecache_add lock. It's ugly,
   but I haven't thought of a better way to do it.

   I'm pretty sure that the xarray based range locking Matthew wants to do will
   hit the same issue, it's just inherent to pushing the locking down to where
   we manipulate the radix tree - which IMO we want to do, so that buffered
   reads/writes to cached data aren't taking these locks unnecessarily; those
   are the fast paths.

If anyone has any better ideas than what I've come up with, or sees any gaping
holes in my design, please speak up...

Even if you don't care about dio write consistency, having this locking in the
core VFS could mean converting truncate to use it, which I think would be a
major benefit too.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-05-23 23:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180518074918.13816-1-kent.overstreet@gmail.com>
     [not found] ` <20180518074918.13816-3-kent.overstreet@gmail.com>
     [not found]   ` <20180518131305.GA6361@bombadil.infradead.org>
2018-05-18 15:53     ` [PATCH 01/10] mm: pagecache add lock Christoph Hellwig
2018-05-20 22:45       ` Kent Overstreet
2018-05-23 15:22         ` Christoph Hellwig
2018-05-23 17:12           ` Kent Overstreet
     [not found]       ` <20180518174549.GD31737@kmo-pixel>
2018-05-23 23:55         ` Notes on locking for pagacache consistency (was: [PATCH 01/10] mm: pagecache add lock) Kent Overstreet
     [not found] ` <20180518074918.13816-21-kent.overstreet@gmail.com>
     [not found]   ` <905DA1CC-63F8-4020-A1D7-1F59ABDF3448@dilger.ca>
     [not found]     ` <20180518191040.GG31737@kmo-pixel>
2018-05-18 20:54       ` [PATCH 10/10] Dynamic fault injection Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).