pagecache locking (was: bcachefs status update) merged)

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* pagecache locking (was: bcachefs status update) merged)
       [not found]         ` <20190612230224.GJ14308@dread.disaster.area>
@ 2019-06-13 18:36           ` Kent Overstreet
  2019-06-13 21:13             ` Andreas Dilger
  2019-06-13 23:55             ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Kent Overstreet @ 2019-06-13 18:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > Ok, I'm totally on board with returning EDEADLOCK.
> > 
> > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > in the same address space as the file being read/written to, even if the buffer
> > and the IO don't technically overlap?
> 
> I'd say that depends on the lock granularity. For a range lock,
> we'd be able to do the IO for non-overlapping ranges. For a normal
> mutex or rwsem, then we risk deadlock if the page fault triggers on
> the same address space host as we already have locked for IO. That's
> the case we currently handle with the second IO lock in XFS, ext4,
> btrfs, etc (XFS_MMAPLOCK_* in XFS).
> 
> One of the reasons I'm looking at range locks for XFS is to get rid
> of the need for this second mmap lock, as there is no reason for it
> existing if we can lock ranges and EDEADLOCK inside page faults and
> return errors.

My concern is that range locks are going to turn out to be both more complicated
and heavier weight, performance wise, than the approach I've taken of just a
single lock per address space.

Reason being range locks only help when you've got multiple operations going on
simultaneously that don't conflict - i.e. it's really only going to be useful
for applications that are doing buffered IO and direct IO simultaneously to the
same file. Personally, I think that would be a pretty gross thing to do and I'm
not particularly interested in optimizing for that case myself... but, if you
know of applications that do depend on that I might change my opinion. If not, I
want to try and get the simpler, one-lock-per-address space approach to work.

That said though - range locks on the page cache can be viewed as just a
performance optimization over my approach, they've got the same semantics
(locking a subset of the page cache vs. the entire thing). So that's a bit of a
digression.

> > This would simplify things a lot and eliminate a really nasty corner case - page
> > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.

I just puked a little in my mouth.

> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

So I think there's two different approaches we should distinguish between. We
can either add the locking to all the top level IO paths - what you just
described - or, the locking can be pushed down to protect _only_ adding pages to
the page cache, which is the approach I've been taking.

I think both approaches are workable, but I do think that pushing the locking
down to __add_to_page_cache_locked is fundamentally the better, more correct
approach.

 - It better matches the semantics of what we're trying to do. All these
   operations we're trying to protect - dio, fallocate, truncate - they all have
   in common that they just want to shoot down a range of the page cache and
   keep it from being readded. And in general, it's better to have locks that
   protect specific data structures ("adding to this radix tree"), vs. large
   critical sections ("the io path").

   In bcachefs, at least for buffered IO I don't currently need any per-inode IO
   locks, page granularity locks suffice, so I'd like to keep that - under the
   theory that buffered IO to pages already in cache is more of a fast path than
   faulting pages in.

 - If we go with the approach of using the filesystem IO locks, we need to be
   really careful about auditing and adding assertions to make sure we've found
   and fixed all the code paths that can potentially add pages to the page
   cache. I didn't even know about the fadvise case, eesh.

 - We still need to do something about the case where we're recursively faulting
   pages back in, which means we need _something_ in place to even detect that
   that's happening. Just trying to cover everything with the filesystem IO
   locks isn't useful here.

So to summarize - if we have locking specifically for adding pages to the page
cache, we don't need to extend the filesystem IO locks to all these places, and
we need something at that level anyways to handle recursive faults from gup()
anyways.

The tricky part is that there's multiple places that want to call
add_to_page_cache() while holding this pagecache_add_lock.

 - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I
   like way better than my recursive locking approach.

 - the other case is truncate/fpunch/etc - they make use of buffered IO to
   handle operations that aren't page/block aligned. But those look a lot more
   tractable than dio, since they're calling find_get_page()/readpage() directly
   instead of via gup(), and possibly they could be structured to not have to
   truncate/punch the partial page while holding the pagecache_add_lock at all
   (but that's going to be heavily filesystem dependent).

The more I think about it, the more convinced I am that this is fundamentally
the correct approach. So, I'm going to work on an improved version of this
patch.

One other tricky thing we need is a way to write out and then evict a page
without allowing it to be redirtied - i.e. something that combines
filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise,
a process continuously redirtying a page is going to make truncate/dio
operations spin trying to shoot down the page cache - in bcachefs I'm currently
taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I
really want to get rid of that. If we can get that combined
write_and_invalidate() operation, then I think the locking will turn out fairly
clean.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
@ 2019-06-13 21:13             ` Andreas Dilger
  2019-06-13 21:21               ` Kent Overstreet
  2019-06-13 23:55             ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2019-06-13 21:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton


[-- Attachment #1.1: Type: text/plain, Size: 2948 bytes --]

On Jun 13, 2019, at 12:36 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
>> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
>>> Ok, I'm totally on board with returning EDEADLOCK.
>>> 
>>> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
>>> in the same address space as the file being read/written to, even if the buffer
>>> and the IO don't technically overlap?
>> 
>> I'd say that depends on the lock granularity. For a range lock,
>> we'd be able to do the IO for non-overlapping ranges. For a normal
>> mutex or rwsem, then we risk deadlock if the page fault triggers on
>> the same address space host as we already have locked for IO. That's
>> the case we currently handle with the second IO lock in XFS, ext4,
>> btrfs, etc (XFS_MMAPLOCK_* in XFS).
>> 
>> One of the reasons I'm looking at range locks for XFS is to get rid
>> of the need for this second mmap lock, as there is no reason for it
>> existing if we can lock ranges and EDEADLOCK inside page faults and
>> return errors.
> 
> My concern is that range locks are going to turn out to be both more complicated
> and heavier weight, performance wise, than the approach I've taken of just a
> single lock per address space.
> 
> Reason being range locks only help when you've got multiple operations going on
> simultaneously that don't conflict - i.e. it's really only going to be useful
> for applications that are doing buffered IO and direct IO simultaneously to the
> same file. Personally, I think that would be a pretty gross thing to do and I'm
> not particularly interested in optimizing for that case myself... but, if you
> know of applications that do depend on that I might change my opinion. If not, I
> want to try and get the simpler, one-lock-per-address space approach to work.

There are definitely workloads that require multiple threads doing non-overlapping
writes to a single file in HPC.  This is becoming an increasingly common problem
as the number of cores on a single client increase, since there is typically one
thread per core trying to write to a shared file.  Using multiple files (one per
core) is possible, but that has file management issues for users when there are a
million cores running on the same job/file (obviously not on the same client node)
dumping data every hour.

We were just looking at this exact problem last week, and most of the threads are
spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
but high-end systems today easily have 2-4x that many cores.

Any approach for range locks can't be worse than spending 80% of time spinning.

Cheers, Andreas





[-- Attachment #1.2: shared_file_write.svg --]
[-- Type: image/svg+xml, Size: 161674 bytes --]

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 21:13             ` Andreas Dilger
@ 2019-06-13 21:21               ` Kent Overstreet
  2019-06-14  0:35                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2019-06-13 21:21 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote:
> There are definitely workloads that require multiple threads doing non-overlapping
> writes to a single file in HPC.  This is becoming an increasingly common problem
> as the number of cores on a single client increase, since there is typically one
> thread per core trying to write to a shared file.  Using multiple files (one per
> core) is possible, but that has file management issues for users when there are a
> million cores running on the same job/file (obviously not on the same client node)
> dumping data every hour.

Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to
me.

> We were just looking at this exact problem last week, and most of the threads are
> spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
> when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
> writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
> but high-end systems today easily have 2-4x that many cores.

Yeah I've been spending some time on buffered IO performance too - 4k page
overhead is a killer.

bcachefs has a buffered write path that looks up multiple pages at a time and
locks them, and then copies the data to all the pages at once (I stole the idea
from btrfs). It was a very significant performance increase.

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n1498

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 21:21               ` Kent Overstreet
@ 2019-06-14  0:35                 ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2019-06-14  0:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andreas Dilger, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote:
> > There are definitely workloads that require multiple threads doing non-overlapping
> > writes to a single file in HPC.  This is becoming an increasingly common problem
> > as the number of cores on a single client increase, since there is typically one
> > thread per core trying to write to a shared file.  Using multiple files (one per
> > core) is possible, but that has file management issues for users when there are a
> > million cores running on the same job/file (obviously not on the same client node)
> > dumping data every hour.
> 
> Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to
> me.
> 
> > We were just looking at this exact problem last week, and most of the threads are
> > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
> > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
> > writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
> > but high-end systems today easily have 2-4x that many cores.
> 
> Yeah I've been spending some time on buffered IO performance too - 4k page
> overhead is a killer.
> 
> bcachefs has a buffered write path that looks up multiple pages at a time and
> locks them, and then copies the data to all the pages at once (I stole the idea
> from btrfs). It was a very significant performance increase.

Careful with that - locking multiple pages is also a deadlock vector
that triggers unexpectedly when something conspires to lock pages in
non-ascending order. e.g.

64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock

The fs/iomap.c code avoids this problem by mapping the IO first,
then iterating pages one at a time until the mapping is consumed,
then it gets another mapping. It also avoids needing to put a page
array on stack....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2019-06-13 21:13             ` Andreas Dilger
@ 2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
                                 ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Dave Chinner @ 2019-06-13 23:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > > Ok, I'm totally on board with returning EDEADLOCK.
> > > 
> > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > > in the same address space as the file being read/written to, even if the buffer
> > > and the IO don't technically overlap?
> > 
> > I'd say that depends on the lock granularity. For a range lock,
> > we'd be able to do the IO for non-overlapping ranges. For a normal
> > mutex or rwsem, then we risk deadlock if the page fault triggers on
> > the same address space host as we already have locked for IO. That's
> > the case we currently handle with the second IO lock in XFS, ext4,
> > btrfs, etc (XFS_MMAPLOCK_* in XFS).
> > 
> > One of the reasons I'm looking at range locks for XFS is to get rid
> > of the need for this second mmap lock, as there is no reason for it
> > existing if we can lock ranges and EDEADLOCK inside page faults and
> > return errors.
> 
> My concern is that range locks are going to turn out to be both more complicated
> and heavier weight, performance wise, than the approach I've taken of just a
> single lock per address space.

That's the battle I'm fighting at the moment with them for direct
IO(*), but range locks are something I'm doing for XFS and I don't
really care if anyone else wants to use them or not.

(*)Direct IO on XFS is a pure shared lock workload, so the rwsem
scales until single atomic update cache line bouncing limits
throughput. That means I can max out my hardware at 1.6 million
random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file
with a rwsem at 32 AIO+DIO dispatch threads. I've only got range
locks to about 1.1M IOPS on the same workload, though it's within a
couple of percent of a rwsem up to 16 threads...

(**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/
than pmem that is emulated with RAM, let alone real pmem which is
much slower at random writes than RAM.

> Reason being range locks only help when you've got multiple operations going on
> simultaneously that don't conflict - i.e. it's really only going to be useful
> for applications that are doing buffered IO and direct IO simultaneously to the
> same file.

Yes, they do that, but that's not why I'm looking at this.  Range
locks are primarily for applications that mix multiple different
types of operations to the same file concurrently. e.g:

- fallocate and read/write() can be run concurrently if they
don't overlap, but right now we serialise them because we have no
visibility into what other operations require.

- buffered read and buffered write can run concurrently if they
don't overlap, but right now they are serialised because that's the
only way to provide POSIX atomic write vs read semantics (only XFS
provides userspace with that guarantee).

- Sub-block direct IO is serialised against all other direct IO
because we can't tell if it overlaps with some other direct IO and
so we have to take the slow but safe option - range locks solve that
problem, too.

- there's inode_dio_wait() for DIO truncate serialisation
because AIO doesn't hold inode locks across IO - range locks can be
held all the way to AIO completion so we can get rid of
inode_Dio_wait() in XFS and that allows truncate/fallocate to run
concurrently with non-overlapping direct IO.

- holding non-overlapping range locks on either side of page
faults which then gets rid of the need for the special mmap locking
path to serialise it against invalidation operations.

IOWs, range locks for IO solve a bunch of long term problems we have
in XFS and largely simplify the lock algorithms within the
filesystem. And it puts us on the path to introduce range locks for
extent mapping serialisation, allowing concurrent mapping lookups
and allocation within a single file. It also has the potential to
allow us to do concurrent directory modifications....

> Personally, I think that would be a pretty gross thing to do and I'm
> not particularly interested in optimizing for that case myself... but, if you
> know of applications that do depend on that I might change my opinion. If not, I
> want to try and get the simpler, one-lock-per-address space approach to work.
> 
> That said though - range locks on the page cache can be viewed as just a
> performance optimization over my approach, they've got the same semantics
> (locking a subset of the page cache vs. the entire thing). So that's a bit of a
> digression.

IO range locks are not "locking the page cache". IO range locks are
purely for managing concurrent IO state in a fine grained manner.
The page cache already has it's own locking - that just needs to
nest inside IO range locks as teh io locks are what provide the high
level exclusion from overlapping page cache operations...

> > > This would simplify things a lot and eliminate a really nasty corner case - page
> > > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > > readahead can pull in pages that do overlap with the dio.
> > 
> > Page cache readahead needs to be moved under the filesystem IO
> > locks. There was a recent thread about how readahead can race with
> > hole punching and other fallocate() operations because page cache
> > readahead bypasses the filesystem IO locks used to serialise page
> > cache invalidation.
> > 
> > e.g. Readahead can be directed by userspace via fadvise, so we now
> > have file->f_op->fadvise() so that filesystems can lock the inode
> > before calling generic_fadvise() such that page cache instantiation
> > and readahead dispatch can be serialised against page cache
> > invalidation. I have a patch for XFS sitting around somewhere that
> > implements the ->fadvise method.
> 
> I just puked a little in my mouth.

Yeah, it's pretty gross. But the page cache simply isn't designed to
allow atomic range operations to be performed. We ahven't be able to
drag it out of the 1980s - we wrote the fs/iomap.c code so we could
do range based extent mapping for IOs rather than the horrible,
inefficient page-by-page block mapping the generic page cache code
does - that gave us a 30+% increase in buffered IO throughput
because we only do a single mapping lookup per IO rather than one
per page...

That said, the page cache is still far, far slower than direct IO,
and the gap is just getting wider and wider as nvme SSDs get faster
and faster. PCIe 4 SSDs are just going to make this even more
obvious - it's getting to the point where the only reason for having
a page cache is to support mmap() and cheap systems with spinning
rust storage.

> > I think there are some other patches floating around to address the
> > other readahead mechanisms to only be done under filesytem IO locks,
> > but I haven't had time to dig into it any further. Readahead from
> > page faults most definitely needs to be under the MMAPLOCK at
> > least so it serialises against fallocate()...
> 
> So I think there's two different approaches we should distinguish between. We
> can either add the locking to all the top level IO paths - what you just
> described - or, the locking can be pushed down to protect _only_ adding pages to
> the page cache, which is the approach I've been taking.

I'm don't think just serialising adding pages is sufficient for
filesystems like XFS because, e.g., DAX.

> I think both approaches are workable, but I do think that pushing the locking
> down to __add_to_page_cache_locked is fundamentally the better, more correct
> approach.
> 
>  - It better matches the semantics of what we're trying to do. All these
>    operations we're trying to protect - dio, fallocate, truncate - they all have
>    in common that they just want to shoot down a range of the page cache and
>    keep it from being readded. And in general, it's better to have locks that
>    protect specific data structures ("adding to this radix tree"), vs. large
>    critical sections ("the io path").

I disagree :)

The high level IO locks provide the IO concurrency policy for the
filesystem. The page cache is an internal structure for caching
pages - it is not a structure for efficiently and cleanly
implementing IO concurrency policy. That's the mistake the current
page cache architecture makes - it tries to be the central control
for all the filesystem IO (because filesystems are dumb and the page
cache knows best!) but, unfortunately, this does not provide the
semantics or functionality that all filesystems want and/or need.

Just look at the truncate detection mess we have every time we
lookup and lock a page anywhere in the mm/ code - do you see any
code in all that which detects a hole punch race? Nope, you don't
because the filesystems take responsibility for serialising that
functionality. Unfortunately, we have so much legacy filesystem
cruft we'll never get rid of those truncate hacks.

That's my beef with relying on the page cache - the page cache is
rapidly becoming a legacy structure that only serves to slow modern
IO subsystems down. More and more we are going to bypass the page
cache and push/pull data via DMA directly into user buffers because
that's the only way we have enough CPU power in the systems to keep
the storage we have fully utilised.

That means, IMO, making the page cache central to solving an IO
concurrency problem is going the wrong way....

>    In bcachefs, at least for buffered IO I don't currently need any per-inode IO
>    locks, page granularity locks suffice, so I'd like to keep that - under the
>    theory that buffered IO to pages already in cache is more of a fast path than
>    faulting pages in.
> 
>  - If we go with the approach of using the filesystem IO locks, we need to be
>    really careful about auditing and adding assertions to make sure we've found
>    and fixed all the code paths that can potentially add pages to the page
>    cache. I didn't even know about the fadvise case, eesh.

Sure, but we've largely done that already. There aren't a lot of
places that add pages to the page cache.

>  - We still need to do something about the case where we're recursively faulting
>    pages back in, which means we need _something_ in place to even detect that
>    that's happening. Just trying to cover everything with the filesystem IO
>    locks isn't useful here.

Haven't we just addressed that with setting a current task lock
context and returning -EDEADLOCK?

> So to summarize - if we have locking specifically for adding pages to the page
> cache, we don't need to extend the filesystem IO locks to all these places, and
> we need something at that level anyways to handle recursive faults from gup()
> anyways.
> 
> The tricky part is that there's multiple places that want to call
> add_to_page_cache() while holding this pagecache_add_lock.
> 
>  - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I
>    like way better than my recursive locking approach.
> 
>  - the other case is truncate/fpunch/etc - they make use of buffered IO to
>    handle operations that aren't page/block aligned.  But those look a lot more
>    tractable than dio, since they're calling find_get_page()/readpage() directly
>    instead of via gup(), and possibly they could be structured to not have to
>    truncate/punch the partial page while holding the pagecache_add_lock at all
>    (but that's going to be heavily filesystem dependent).

That sounds like it is going to be messy. :(

> One other tricky thing we need is a way to write out and then evict a page
> without allowing it to be redirtied - i.e. something that combines
> filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise,
> a process continuously redirtying a page is going to make truncate/dio
> operations spin trying to shoot down the page cache - in bcachefs I'm currently
> taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I
> really want to get rid of that. If we can get that combined
> write_and_invalidate() operation, then I think the locking will turn out fairly
> clean.

IMO, this isn't a page cache problem - this is a filesystem
operation vs page fault serialisation issue. Why? Because the
problem exists for DAX, and it doesn't add pages to the page cache
for mappings...

i.e.  We've already solved these problems with the same high level
IO locks that solve all the truncate, hole punch, etc issues for
both the page cache and DAX operation. i.e. The MMAPLOCK prevents
the PTE being re-dirtied across the entire filesystem operation, not
just the writeback and invalidation. The XFS flush+inval code
looks like this:

	xfs_ilock(ip, XFS_IOLOCK_EXCL);
	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modifications if necessary>

	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
	xfs_iunlock(ip, XFS_IOLOCK_EXCL);

With range locks it looks like this;

	range_lock_init(&rl, offset, len);
	range_lock_write(&ip->i_iolock, &rl);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modified range if necessary>

	range_unlock_write(&ip->i_iolock, &rl);

---

In summary, I can see why the page cache add lock works well for
bcachefs, but on the other hand I can say that it really doesn't
match well for filesystems like XFS or for filesystems that
implement DAX and so may not be using the page cache at all...

Don't get me wrong - I'm not opposed to incuding page cache add
locking - I'm just saying that the problems it tries to address (and
ones it cannot address) are already largely solved in existing
filesystems. I suspect that if we do merge this code, whatever
locking is added would have to be optional....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
@ 2019-06-14  2:30               ` Linus Torvalds
  2019-06-14  7:30                 ` Dave Chinner
  2019-06-14  3:08               ` Linus Torvalds
  2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2019-06-14  2:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> That said, the page cache is still far, far slower than direct IO,

Bullshit, Dave.

You've made that claim before, and it's been complete bullshit before
too, and I've called you out on it then too.

Why do you continue to make this obviously garbage argument?

The key word in the "page cache" name is "cache".

Caches work, Dave. Anybody who thinks caches don't work is
incompetent. 99% of all filesystem accesses are cached, and they never
do any IO at all, and the page cache handles them beautifully.

When you say that the page cache is slower than direct IO, it's
because you don't even see or care about the *fast* case. You only get
involved once there is actual IO to be done.

So you're making that statement without taking into account all the
cases that you don't see, and that you don't care about, because the
page cache has already handled them for you, and done so much better
than DIO can do or ever _will_ do.

Is direct IO faster when you *know* it's not cached, and shouldn't be
cached? Sure. But that/s actually quite rare.

How often do you use non-temporal stores when you do non-IO
programming? Approximately never, perhaps? Because caches work.

And no, SSD's haven't made caches irrelevant. Not doing IO at all is
still orders of magnitude faster than doing IO. And it's not clear
nvdimms will either.

So stop with the stupid and dishonest argument already, where you
ignore the effects of caching.

                Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  2:30               ` Linus Torvalds
@ 2019-06-14  7:30                 ` Dave Chinner
  2019-06-15  1:15                   ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-06-14  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 04:30:36PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > That said, the page cache is still far, far slower than direct IO,
> 
> Bullshit, Dave.
> 
> You've made that claim before, and it's been complete bullshit before
> too, and I've called you out on it then too.

Yes, your last run of insulting rants on this topic resulted in me
pointing out your CoC violations because you were unable to listen
or discuss the subject matter in a civil manner. And you've started
right where you left off last time....

> Why do you continue to make this obviously garbage argument?
> 
> The key word in the "page cache" name is "cache".
> 
> Caches work, Dave.

Yes, they do, I see plenty of cases where the page cache works just
fine because it is still faster than most storage. But that's _not
what I said_.

Indeed, you haven't even bothered to ask me to clarify what I was
refering to in the statement you quoted. IOWs, you've taken _one
single statement_ I made from a huge email about complexities in
dealing with IO concurency, the page cache and architectural flaws n
the existing code, quoted it out of context, fabricated a completely
new context and started ranting about how I know nothing about how
caches or the page cache work.

Not very professional but, unfortunately, an entirely predictable
and _expected_ response.

Linus, nobody can talk about direct IO without you screaming and
tossing all your toys out of the crib. If you can't be civil or you
find yourself writing a some condescending "caching 101" explanation
to someone who has spent the last 15+ years working with filesystems
and caches, then you're far better off not saying anything.

---

So, in the interests of further _civil_ discussion, let me clarify
my statement for you: for a highly concurrent application that is
crunching through bulk data on large files on high throughput
storage, the page cache is still far, far slower than direct IO.

Which comes back to this statement you made:

> Is direct IO faster when you *know* it's not cached, and shouldn't
> be cached? Sure. But that/s actually quite rare. 

This is where I think you get the wrong end of the stick, Linus.

The world I work in has a significant proportion of applications
where the data set is too large to be cached effectively or is
better cached by the application than the kernel. IOWs, data being
cached efficiently by the page cache is the exception rather than
the rule. Hence, they use direct IO because it is faster than the
page cache. This is common in applications like major enterprise
databases, HPC apps, data mining/analysis applications, etc. and
there's an awful lot of the world that runs on these apps....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  7:30                 ` Dave Chinner
@ 2019-06-15  1:15                   ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2019-06-15  1:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 9:31 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, they do, I see plenty of cases where the page cache works just
> fine because it is still faster than most storage. But that's _not
> what I said_.

I only quoted one small part of your email, because I wanted to point
out how you again dismissed caches.

And yes, that literally _is_ what you said. In other parts of that
same email you said

   "..it's getting to the point where the only reason for having
    a page cache is to support mmap() and cheap systems with spinning
    rust storage"

and

  "That's my beef with relying on the page cache - the page cache is
   rapidly becoming a legacy structure that only serves to slow modern
   IO subsystems down"

and your whole email was basically a rant against the page cache.

So I only quoted the bare minimum, and pointed out that caching is
still damn important.

Because most loads cache well.

How you are back-tracking a bit from your statements, but don't go
saying was misreading you. How else would the above be read? You
really were saying that caching was "legacy". I called you out on it.
Now you're trying to back-track.

Yes, you have loads that don't cache well. But that does not mean that
caching has somehow become irrelevant in the big picture or a "legacy"
thing at all.

The thing is, I don't even hate DIO. But we always end up clashing
because you seem to have this mindset where nothing else matters
(which really came through in that email I replied to).

Do you really wonder why I point out that caching is important?
Because you seem to actively claim caching doesn't matter. Are you
happier now that I quoted more of your emails back to you?

>         IOWs, you've taken _one
> single statement_ I made from a huge email about complexities in
> dealing with IO concurency, the page cache and architectural flaws n
> the existing code, quoted it out of context, fabricated a completely
> new context and started ranting about how I know nothing about how
> caches or the page cache work.

See above. I cut things down a lot, but it wasn't a single statement
at all. I just boiled it down to the basics.

> Linus, nobody can talk about direct IO without you screaming and
> tossing all your toys out of the crib.

Dave, look in the mirror some day. You might be surprised.

> So, in the interests of further _civil_ discussion, let me clarify
> my statement for you: for a highly concurrent application that is
> crunching through bulk data on large files on high throughput
> storage, the page cache is still far, far slower than direct IO.

.. and Christ, Dave, we even _agree_ on this.

But when DIO becomes an issue is when you try to claim it makes the
page cache irrelevant, or a problem.

I also take issue with you then making statements that seem to be
explicitly designed to be misleading. For DIO, you talk about how XFS
has no serialization and gets great performance. Then in the very next
email, you talk about how you think buffered IO has to be excessively
serialized, and how XFS is the only one who does it properly, and how
that is a problem for performance. But as far as I can tell, the
serialization rule you quote is simply not true. But for you it is,
and only for buffered IO.

It's really as if you were actively trying to make the non-DIO case
look bad by picking and choosing your rules.

And the thing is, I suspect that the overlap between DIO and cached IO
shouldn't even need to be there. We've generally tried to just not
have them interact at all, by just having DIO invalidate the caches
(which is really really cheap if they don't exist - which should be
the common case by far!). People almost never mix the two at all, and
we might be better off aiming to separate them out even more than we
do now.

That's actually the part I like best about the page cache add lock - I
may not be a great fan of yet another ad-hoc lock - but I do like how
it adds minimal overhead to the cached case (because by definition,
the good cached case is when you don't need to add new pages), while
hopefully working well together with the whole "invalidate existing
caches" case for DIO.

I know you don't like the cache flush and invalidation stuff for some
reason, but I don't even understand why you care. Again, if you're
actually just doing all DIO, the caches will be empty and not be in
your way. So normally all that should be really really cheap. Flushing
and invalidating caches that don't exists isn't really complicated, is
it?

And if cached state *does* exist, and if it can't be invalidated (for
example, existing busy mmap or whatever), maybe the solution there is
"always fall back to buffered/cached IO".

For the cases you care about, that should never happen, after all.

IOW, if anything, I think we should strive for a situation where the
whole DIO vs cached becomes even _more_ independent. If there are busy
caches, just fall back to cached IO. It will have lower IO throughput,
but that's one of the _points_ of caches - they should decrease the
need for IO, and less IO is what it's all about.

So I don't understand why you hate the page cache so much. For the
cases you care about, the page cache should be a total non-issue. And
if the page cache does exist, then it almost by definition means that
it's not a case you care about.

And yes, yes, maybe some day people won't have SSD's at all, and it's
all nvdimm's and all filesystem data accesses are DAX, and caching is
all done by hardware and the page cache will never exist at all. At
that point a page cache will be legacy.

But honestly, that day is not today. It's decades away, and might
never happen at all.

So in the meantime, don't pooh-pooh the page cache. It works very well
indeed, and I say that as somebody who has refused to touch spinning
media (or indeed bad SSD's) for a decade.

              Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
@ 2019-06-14  3:08               ` Linus Torvalds
  2019-06-15  4:01                 ` Linus Torvalds
  2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2019-06-14  3:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> - buffered read and buffered write can run concurrently if they
> don't overlap, but right now they are serialised because that's the
> only way to provide POSIX atomic write vs read semantics (only XFS
> provides userspace with that guarantee).

I do not believe that posix itself actually requires that at all,
although extended standards may.

That said, from a quality of implementation standpoint, it's obviously
a good thing to do, so it might be worth looking at if something
reasonable can be done. The XFS atomicity guarantees are better than
what other filesystems give, but they might also not be exactly
required.

But POSIX actually ends up being pretty lax, and says

  "Writes can be serialized with respect to other reads and writes. If
a read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. A similar requirement applies to
multiple write operations to the same file position. This is needed to
guarantee the propagation of data from write() calls to subsequent
read() calls. This requirement is particularly significant for
networked file systems, where some caching schemes violate these
semantics."

Note the "can" in "can be serialized", not "must". Also note that
whole language about how the read file data must match the written
data only if the read can be proven to have occurred after a write of
that data.  Concurrency is very much left in the air, only provably
serial operations matter.

(There is also language that talks about "after the write has
successfully returned" etc - again, it's about reads that occur
_after_ the write, not concurrently with the write).

The only atomicity guarantees are about the usual pipe writes and
PIPE_BUF. Those are very explicit.

Of course, there are lots of standards outside of just the POSIX
read/write thing, so you may be thinking of some other stricter
standard. POSIX itself has always been pretty permissive.

And as mentioned, I do agree from a QoI standpoint that atomicity is
nice, and that the XFS behavior is better. However, it does seem that
nobody really cares, because I'm not sure we've ever done it in
general (although we do have that i_rwsem, but I think it's mainly
used to give the proper lseek behavior). And so the XFS behavior may
not necessarily be *worth* it, although I presume you have some test
for this as part of xfstests.

                Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  3:08               ` Linus Torvalds
@ 2019-06-15  4:01                 ` Linus Torvalds
  2019-06-17 22:47                   ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2019-06-15  4:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do not believe that posix itself actually requires that at all,
> although extended standards may.

So I tried to see if I could find what this perhaps alludes to.

And I suspect it's not in the read/write thing, but the pthreads side
talks about atomicity.

Interesting, but I doubt if that's actually really intentional, since
the non-thread read/write behavior specifically seems to avoid the
whole concurrency issue.

The pthreads atomicity thing seems to be about not splitting up IO and
doing it in chunks when you have m:n threading models, but can be
(mis-)construed to have threads given higher atomicity guarantees than
processes.

               Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-15  4:01                 ` Linus Torvalds
@ 2019-06-17 22:47                   ` Dave Chinner
  2019-06-17 23:38                     ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-06-17 22:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Fri, Jun 14, 2019 at 06:01:07PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I do not believe that posix itself actually requires that at all,
> > although extended standards may.
> 
> So I tried to see if I could find what this perhaps alludes to.
> 
> And I suspect it's not in the read/write thing, but the pthreads side
> talks about atomicity.
>
> Interesting, but I doubt if that's actually really intentional, since
> the non-thread read/write behavior specifically seems to avoid the
> whole concurrency issue.

The wording of posix changes every time they release a new version
of the standard, and it's _never_ obvious what behaviour the
standard is actually meant to define. They are always written with
sufficient ambiguity and wiggle room that they could mean
_anything_. The POSIX 2017.1 standard you quoted is quite different
to older versions, but it's no less ambiguous...

> The pthreads atomicity thing seems to be about not splitting up IO and
> doing it in chunks when you have m:n threading models, but can be
> (mis-)construed to have threads given higher atomicity guarantees than
> processes.

Right, but regardless of the spec we have to consider that the
behaviour of XFS comes from it's Irix heritage (actually from EFS,
the predecessor of XFS from the late 1980s). i.e. the IO exclusion
model dates to long before POSIX had anything to say about pthreads,
and it's wording about atomicity could only refer to to
multi-process interactions.

These days, however, is the unfortunate reality of a long tail of
applications developed on other Unix systems under older POSIX
specifications that are still being ported to and deployed on Linux.
Hence the completely ambiguous behaviours defined in the older specs
are still just as important these days as the completely ambiguous
behaviours defined in the new specifications. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-17 22:47                   ` Dave Chinner
@ 2019-06-17 23:38                     ` Linus Torvalds
  2019-06-18  4:21                       ` Amir Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2019-06-17 23:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Mon, Jun 17, 2019 at 3:48 PM Dave Chinner <david@fromorbit.com> wrote:
>
> The wording of posix changes every time they release a new version
> of the standard, and it's _never_ obvious what behaviour the
> standard is actually meant to define. They are always written with
> sufficient ambiguity and wiggle room that they could mean
> _anything_. The POSIX 2017.1 standard you quoted is quite different
> to older versions, but it's no less ambiguous...

POSIX has always been pretty lax, partly because all the Unixes did
things differently, but partly because it then also ended up about
trying to work for the VMS and Windows posix subsystems..

So yes, the language tends to be intentionally not all that strict.

> > The pthreads atomicity thing seems to be about not splitting up IO and
> > doing it in chunks when you have m:n threading models, but can be
> > (mis-)construed to have threads given higher atomicity guarantees than
> > processes.
>
> Right, but regardless of the spec we have to consider that the
> behaviour of XFS comes from it's Irix heritage (actually from EFS,
> the predecessor of XFS from the late 1980s)

Sure. And as I mentioned, I think it's technically the nicer guarantee.

That said, it's a pretty *expensive* guarantee. It's one that you
yourself are not willing to give for O_DIRECT IO.

And it's not a guarantee that Linux has ever had. In fact, it's not
even something I've ever seen anybody ever depend on.

I agree that it's possible that some app out there might depend on
that kind of guarantee, but I also suspect it's much much more likely
that it's the other way around: XFS is being unnecessarily strict,
because everybody is testing against filesystems that don't actually
give the total atomicity guarantees.

Nobody develops for other unixes any more (and nobody really ever did
it by reading standards papers - even if they had been very explicit).

And honestly, the only people who really do threaded accesses to the same file

 (a) don't want that guarantee in the first place

 (b) are likely to use direct-io that apparently doesn't give that
atomicity guarantee even on xfs

so I do think it's moot.

End result: if we had a really cheap range lock, I think it would be a
good idea to use it (for the whole QoI implementation), but for
practical reasons it's likely better to just stick to the current lack
of serialization because it performs better and nobody really seems to
want anything else anyway.

                  Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-17 23:38                     ` Linus Torvalds
@ 2019-06-18  4:21                       ` Amir Goldstein
  2019-06-19 10:38                         ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2019-06-18  4:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

> > Right, but regardless of the spec we have to consider that the
> > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > the predecessor of XFS from the late 1980s)
>
> Sure. And as I mentioned, I think it's technically the nicer guarantee.
>
> That said, it's a pretty *expensive* guarantee. It's one that you
> yourself are not willing to give for O_DIRECT IO.
>
> And it's not a guarantee that Linux has ever had. In fact, it's not
> even something I've ever seen anybody ever depend on.
>
> I agree that it's possible that some app out there might depend on
> that kind of guarantee, but I also suspect it's much much more likely
> that it's the other way around: XFS is being unnecessarily strict,
> because everybody is testing against filesystems that don't actually
> give the total atomicity guarantees.
>
> Nobody develops for other unixes any more (and nobody really ever did
> it by reading standards papers - even if they had been very explicit).
>
> And honestly, the only people who really do threaded accesses to the same file
>
>  (a) don't want that guarantee in the first place
>
>  (b) are likely to use direct-io that apparently doesn't give that
> atomicity guarantee even on xfs
>
> so I do think it's moot.
>
> End result: if we had a really cheap range lock, I think it would be a
> good idea to use it (for the whole QoI implementation), but for
> practical reasons it's likely better to just stick to the current lack
> of serialization because it performs better and nobody really seems to
> want anything else anyway.
>

This is the point in the conversation where somebody usually steps in
and says "let the user/distro decide". Distro maintainers are in a much
better position to take the risk of breaking hypothetical applications.

I should point out that even if "strict atomic rw" behavior is desired, then
page cache warmup [1] significantly improves performance.
Having mentioned that, the discussion can now return to what is the
preferred way to solve the punch hole vs. page cache add race.

XFS may end up with special tailored range locks, which beings some
other benefits to XFS, but all filesystems need the solution for the punch
hole vs. page cache add race.
Jan recently took a stab at it for ext4 [2], but that didn't work out.
So I wonder what everyone thinks about Kent's page add lock as the
solution to the problem.
Allegedly, all filesystems (XFS included) are potentially exposed to
stale data exposure/data corruption.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20190404165737.30889-1-amir73il@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20190603132155.20600-3-jack@suse.cz/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-18  4:21                       ` Amir Goldstein
@ 2019-06-19 10:38                         ` Jan Kara
  2019-06-19 22:37                           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-06-19 10:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, Dave Chinner, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > Right, but regardless of the spec we have to consider that the
> > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > the predecessor of XFS from the late 1980s)
> >
> > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> >
> > That said, it's a pretty *expensive* guarantee. It's one that you
> > yourself are not willing to give for O_DIRECT IO.
> >
> > And it's not a guarantee that Linux has ever had. In fact, it's not
> > even something I've ever seen anybody ever depend on.
> >
> > I agree that it's possible that some app out there might depend on
> > that kind of guarantee, but I also suspect it's much much more likely
> > that it's the other way around: XFS is being unnecessarily strict,
> > because everybody is testing against filesystems that don't actually
> > give the total atomicity guarantees.
> >
> > Nobody develops for other unixes any more (and nobody really ever did
> > it by reading standards papers - even if they had been very explicit).
> >
> > And honestly, the only people who really do threaded accesses to the same file
> >
> >  (a) don't want that guarantee in the first place
> >
> >  (b) are likely to use direct-io that apparently doesn't give that
> > atomicity guarantee even on xfs
> >
> > so I do think it's moot.
> >
> > End result: if we had a really cheap range lock, I think it would be a
> > good idea to use it (for the whole QoI implementation), but for
> > practical reasons it's likely better to just stick to the current lack
> > of serialization because it performs better and nobody really seems to
> > want anything else anyway.
> >
> 
> This is the point in the conversation where somebody usually steps in
> and says "let the user/distro decide". Distro maintainers are in a much
> better position to take the risk of breaking hypothetical applications.
> 
> I should point out that even if "strict atomic rw" behavior is desired, then
> page cache warmup [1] significantly improves performance.
> Having mentioned that, the discussion can now return to what is the
> preferred way to solve the punch hole vs. page cache add race.
> 
> XFS may end up with special tailored range locks, which beings some
> other benefits to XFS, but all filesystems need the solution for the punch
> hole vs. page cache add race.
> Jan recently took a stab at it for ext4 [2], but that didn't work out.

Yes, but I have idea how to fix it. I just need to push acquiring ext4's
i_mmap_sem down a bit further so that only page cache filling is protected
by it but not copying of data out to userspace. But I didn't get to coding
it last week due to other stuff.

> So I wonder what everyone thinks about Kent's page add lock as the
> solution to the problem.
> Allegedly, all filesystems (XFS included) are potentially exposed to
> stale data exposure/data corruption.

When we first realized that hole-punching vs page faults races are an issue I
was looking into a generic solution but at that time there was a sentiment
against adding another rwsem to struct address_space (or inode) so we ended
up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
filesystems these days. If people think that growing struct inode for
everybody is OK, we can think about lifting private filesystem solutions
into a generic one. I'm fine with that.

That being said as Dave said we use those fs-private locks also for
serializing against equivalent issues arising for DAX. So the problem is
not only about page cache but generally about doing IO and caching
block mapping information for a file range. So the solution should not be
too tied to page cache.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-19 10:38                         ` Jan Kara
@ 2019-06-19 22:37                           ` Dave Chinner
  2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-06-19 22:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Wed, Jun 19, 2019 at 12:38:38PM +0200, Jan Kara wrote:
> On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > > Right, but regardless of the spec we have to consider that the
> > > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > > the predecessor of XFS from the late 1980s)
> > >
> > > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> > >
> > > That said, it's a pretty *expensive* guarantee. It's one that you
> > > yourself are not willing to give for O_DIRECT IO.
> > >
> > > And it's not a guarantee that Linux has ever had. In fact, it's not
> > > even something I've ever seen anybody ever depend on.
> > >
> > > I agree that it's possible that some app out there might depend on
> > > that kind of guarantee, but I also suspect it's much much more likely
> > > that it's the other way around: XFS is being unnecessarily strict,
> > > because everybody is testing against filesystems that don't actually
> > > give the total atomicity guarantees.
> > >
> > > Nobody develops for other unixes any more (and nobody really ever did
> > > it by reading standards papers - even if they had been very explicit).
> > >
> > > And honestly, the only people who really do threaded accesses to the same file
> > >
> > >  (a) don't want that guarantee in the first place
> > >
> > >  (b) are likely to use direct-io that apparently doesn't give that
> > > atomicity guarantee even on xfs
> > >
> > > so I do think it's moot.
> > >
> > > End result: if we had a really cheap range lock, I think it would be a
> > > good idea to use it (for the whole QoI implementation), but for
> > > practical reasons it's likely better to just stick to the current lack
> > > of serialization because it performs better and nobody really seems to
> > > want anything else anyway.
> > >
> > 
> > This is the point in the conversation where somebody usually steps in
> > and says "let the user/distro decide". Distro maintainers are in a much
> > better position to take the risk of breaking hypothetical applications.
> > 
> > I should point out that even if "strict atomic rw" behavior is desired, then
> > page cache warmup [1] significantly improves performance.
> > Having mentioned that, the discussion can now return to what is the
> > preferred way to solve the punch hole vs. page cache add race.
> > 
> > XFS may end up with special tailored range locks, which beings some
> > other benefits to XFS, but all filesystems need the solution for the punch
> > hole vs. page cache add race.
> > Jan recently took a stab at it for ext4 [2], but that didn't work out.
> 
> Yes, but I have idea how to fix it. I just need to push acquiring ext4's
> i_mmap_sem down a bit further so that only page cache filling is protected
> by it but not copying of data out to userspace. But I didn't get to coding
> it last week due to other stuff.
> 
> > So I wonder what everyone thinks about Kent's page add lock as the
> > solution to the problem.
> > Allegedly, all filesystems (XFS included) are potentially exposed to
> > stale data exposure/data corruption.
> 
> When we first realized that hole-punching vs page faults races are an issue I
> was looking into a generic solution but at that time there was a sentiment
> against adding another rwsem to struct address_space (or inode) so we ended
> up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
> filesystems these days. If people think that growing struct inode for
> everybody is OK, we can think about lifting private filesystem solutions
> into a generic one. I'm fine with that.

I'd prefer it doesn't get lifted to the VFS because I'm planning on
getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
likely to go away in the near term because a range lock can be
taken on either side of the mmap_sem in the page fault path.

> That being said as Dave said we use those fs-private locks also for
> serializing against equivalent issues arising for DAX. So the problem is
> not only about page cache but generally about doing IO and caching
> block mapping information for a file range. So the solution should not be
> too tied to page cache.

Yup, that was the point I was trying to make when Linus started
shouting at me about how caches work and how essential they are.  I
guess the fact that DAX doesn't use the page cache isn't as widely
known as I assumed it was...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-06-19 22:37                           ` Dave Chinner
@ 2019-07-03  0:04                             ` Boaz Harrosh
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
  2019-07-05 23:31                               ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Boaz Harrosh @ 2019-07-03  0:04 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On 20/06/2019 01:37, Dave Chinner wrote:
<>
> 
> I'd prefer it doesn't get lifted to the VFS because I'm planning on
> getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> likely to go away in the near term because a range lock can be
> taken on either side of the mmap_sem in the page fault path.
> 
<>
Sir Dave

Sorry if this was answered before. I am please very curious. In the zufs
project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
(Read & writes all take read-locks ...)
The only reason I have it is because of lockdep actually.

Specifically for those xfstests that mmap a buffer then direct_IO in/out
of that buffer from/to another file in the same FS or the same file.
(For lockdep its the same case).
I would be perfectly happy to recursively _read_lock both from the top
of the page_fault at the DIO path, and under in the page_fault. I'm
_read_locking after all. But lockdep is hard to convince. So I stole the
xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
truncate/punch/clone time when all mapping traversal needs to stop for
the destructive change to take place. (Allocations are done another way
and are race safe with traversal)

How do you intend to address this problem with range-locks? ie recursively
taking the same "lock"? because if not for the recursive-ity and lockdep I would
not need the extra lock-object per inode.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>]

* Re: pagecache locking
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
@ 2019-07-03  1:25                                 ` Boaz Harrosh
  0 siblings, 0 replies; 24+ messages in thread
From: Boaz Harrosh @ 2019-07-03  1:25 UTC (permalink / raw)
  To: Patrick Farrell, Dave Chinner, Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On 03/07/2019 04:07, Patrick Farrell wrote:
> Recursively read locking is generally unsafe, that’s why lockdep
> complains about it.  The common RW lock primitives are queued in
> their implementation, meaning this recursive read lock sequence:
> P1 - read (gets lock)
> P2 - write
> P1 - read
> 
> Results not in a successful read lock, but P1 blocking behind P2,
> which is blocked behind P1.  

> Readers are not allowed to jump past waiting writers.

OK thanks that makes sense. I did not know about that last part. Its a kind
of a lock fairness I did not know we have.

So I guess I'll keep my two locks than. The write_locker is the SLOW
path for me anyway, right?

[if we are already at the subject, Do mutexes have the same lock fairness as
 above? Do the write_lock side of rw_sem have same fairness? Something I never
 figured out]

Thanks
Boaz

> 
> - Patrick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
@ 2019-07-05 23:31                               ` Dave Chinner
  2019-07-07 15:05                                 ` Boaz Harrosh
  2019-07-08 13:31                                 ` Jan Kara
  1 sibling, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2019-07-05 23:31 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> On 20/06/2019 01:37, Dave Chinner wrote:
> <>
> > 
> > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > likely to go away in the near term because a range lock can be
> > taken on either side of the mmap_sem in the page fault path.
> > 
> <>
> Sir Dave
> 
> Sorry if this was answered before. I am please very curious. In the zufs
> project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> (Read & writes all take read-locks ...)
> The only reason I have it is because of lockdep actually.
> 
> Specifically for those xfstests that mmap a buffer then direct_IO in/out
> of that buffer from/to another file in the same FS or the same file.
> (For lockdep its the same case).

Which can deadlock if the same inode rwsem is taken on both sides of
the mmap_sem, as lockdep tells you...

> I would be perfectly happy to recursively _read_lock both from the top
> of the page_fault at the DIO path, and under in the page_fault. I'm
> _read_locking after all. But lockdep is hard to convince. So I stole the
> xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> truncate/punch/clone time when all mapping traversal needs to stop for
> the destructive change to take place. (Allocations are done another way
> and are race safe with traversal)
> 
> How do you intend to address this problem with range-locks? ie recursively
> taking the same "lock"? because if not for the recursive-ity and lockdep I would
> not need the extra lock-object per inode.

As long as the IO ranges to the same file *don't overlap*, it should
be perfectly safe to take separate range locks (in read or write
mode) on either side of the mmap_sem as non-overlapping range locks
can be nested and will not self-deadlock.

The "recursive lock problem" still arises with DIO and page faults
inside gup, but it only occurs when the user buffer range overlaps
the DIO range to the same file. IOWs, the application is trying to
do something that has an undefined result and is likely to result in
data corruption. So, in that case I plan to have the gup page faults
fail and the DIO return -EDEADLOCK to userspace....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-05 23:31                               ` Dave Chinner
@ 2019-07-07 15:05                                 ` Boaz Harrosh
  2019-07-07 23:55                                   ` Dave Chinner
  2019-07-08 13:31                                 ` Jan Kara
  1 sibling, 1 reply; 24+ messages in thread
From: Boaz Harrosh @ 2019-07-07 15:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On 06/07/2019 02:31, Dave Chinner wrote:

> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.
> 
> The "recursive lock problem" still arises with DIO and page faults
> inside gup, but it only occurs when the user buffer range overlaps
> the DIO range to the same file. IOWs, the application is trying to
> do something that has an undefined result and is likely to result in
> data corruption. So, in that case I plan to have the gup page faults
> fail and the DIO return -EDEADLOCK to userspace....
> 

This sounds very cool. I now understand. I hope you put all the tools
for this in generic places so it will be easier to salvage.

One thing I will be very curious to see is how you teach lockdep
about the "range locks can be nested" thing. I know its possible,
other places do it, but its something I never understood.

> Cheers,
> Dave.

[ Ha one more question if you have time:

  In one of the mails, and you also mentioned it before, you said about
  the rw_read_lock not being able to scale well on mammoth machines
  over 10ns of cores (maybe you said over 20).
  I wonder why that happens. Is it because of the atomic operations,
  or something in the lock algorithm. In my theoretical understanding,
  as long as there are no write-lock-grabbers, why would the readers
  interfere with each other?
]

Thanks
Boaz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-07 15:05                                 ` Boaz Harrosh
@ 2019-07-07 23:55                                   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2019-07-07 23:55 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Sun, Jul 07, 2019 at 06:05:16PM +0300, Boaz Harrosh wrote:
> On 06/07/2019 02:31, Dave Chinner wrote:
> 
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> > 
> > The "recursive lock problem" still arises with DIO and page faults
> > inside gup, but it only occurs when the user buffer range overlaps
> > the DIO range to the same file. IOWs, the application is trying to
> > do something that has an undefined result and is likely to result in
> > data corruption. So, in that case I plan to have the gup page faults
> > fail and the DIO return -EDEADLOCK to userspace....
> > 
> 
> This sounds very cool. I now understand. I hope you put all the tools
> for this in generic places so it will be easier to salvage.

That's the plan, though I'm not really caring about anything outside
XFS for the moment.

> One thing I will be very curious to see is how you teach lockdep
> about the "range locks can be nested" thing. I know its possible,
> other places do it, but its something I never understood.

The issue with lockdep is not nested locks, it's that there is no
concept of ranges. e.g.  This is fine:

P0				P1
read_lock(A, 0, 1000)
				read_lock(B, 0, 1000)
write_lock(B, 1001, 2000)
				write_lock(A, 1001, 2000)

Because the read/write lock ranges on file A don't overlap and so
can be held concurrently, similarly the ranges on file B. i.e. This
lock pattern does not result in deadlock.

However, this very similar lock pattern is not fine:

P0				P1
read_lock(A, 0, 1000)
				read_lock(B, 0, 1000)
write_lock(B, 500, 1500)
				write_lock(A, 900, 1900)

i.e. it's an ABBA deadlock because the lock ranges partially
overlap.

IOWs, the problem with lockdep is not nesting read lock or nesting
write locks (because that's valid, too), the problem is that it
needs to be taught about ranges. Once it knows about ranges, nested
read/write locking contexts don't require any special support...

As it is, tracking overlapping lock ranges in lockdep will be
interesting, given that I've been taking several thousand
non-overlapping range locks concurrently on a single file in my
testing. Tracking this sort of usage without completely killing the
machine looking for conflicts and order violations likely makes
lockdep validation of range locks a non-starter....

> [ Ha one more question if you have time:
> 
>   In one of the mails, and you also mentioned it before, you said about
>   the rw_read_lock not being able to scale well on mammoth machines
>   over 10ns of cores (maybe you said over 20).
>   I wonder why that happens. Is it because of the atomic operations,
>   or something in the lock algorithm. In my theoretical understanding,
>   as long as there are no write-lock-grabbers, why would the readers
>   interfere with each other?

Concurrent shared read lock/unlock are still atomic counting
operations.  Hence they bounce exclusive cachelines from CPU to
CPU...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-05 23:31                               ` Dave Chinner
  2019-07-07 15:05                                 ` Boaz Harrosh
@ 2019-07-08 13:31                                 ` Jan Kara
  2019-07-09 23:47                                   ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2019-07-08 13:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Boaz Harrosh, Jan Kara, Amir Goldstein, Linus Torvalds,
	Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing,
	linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro,
	Andrew Morton

On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > On 20/06/2019 01:37, Dave Chinner wrote:
> > <>
> > > 
> > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > likely to go away in the near term because a range lock can be
> > > taken on either side of the mmap_sem in the page fault path.
> > > 
> > <>
> > Sir Dave
> > 
> > Sorry if this was answered before. I am please very curious. In the zufs
> > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > (Read & writes all take read-locks ...)
> > The only reason I have it is because of lockdep actually.
> > 
> > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > of that buffer from/to another file in the same FS or the same file.
> > (For lockdep its the same case).
> 
> Which can deadlock if the same inode rwsem is taken on both sides of
> the mmap_sem, as lockdep tells you...
> 
> > I would be perfectly happy to recursively _read_lock both from the top
> > of the page_fault at the DIO path, and under in the page_fault. I'm
> > _read_locking after all. But lockdep is hard to convince. So I stole the
> > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > truncate/punch/clone time when all mapping traversal needs to stop for
> > the destructive change to take place. (Allocations are done another way
> > and are race safe with traversal)
> > 
> > How do you intend to address this problem with range-locks? ie recursively
> > taking the same "lock"? because if not for the recursive-ity and lockdep I would
> > not need the extra lock-object per inode.
> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.

I'd be really careful with nesting range locks. You can have nasty
situations like:

Thread 1		Thread 2
read_lock(0,1000)	
			write_lock(500,1500) -> blocks due to read lock
read_lock(1001,1500) -> blocks due to write lock (or you have to break
  fairness and then deal with starvation issues).

So once you allow nesting of range locks, you have to very carefully define
what is and what is not allowed. That's why in my range lock implementation
ages back I've decided to treat range lock as a rwsem for deadlock
verification purposes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-08 13:31                                 ` Jan Kara
@ 2019-07-09 23:47                                   ` Dave Chinner
  2019-07-10  8:41                                     ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2019-07-09 23:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> > On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > > On 20/06/2019 01:37, Dave Chinner wrote:
> > > <>
> > > > 
> > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > > likely to go away in the near term because a range lock can be
> > > > taken on either side of the mmap_sem in the page fault path.
> > > > 
> > > <>
> > > Sir Dave
> > > 
> > > Sorry if this was answered before. I am please very curious. In the zufs
> > > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > > (Read & writes all take read-locks ...)
> > > The only reason I have it is because of lockdep actually.
> > > 
> > > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > > of that buffer from/to another file in the same FS or the same file.
> > > (For lockdep its the same case).
> > 
> > Which can deadlock if the same inode rwsem is taken on both sides of
> > the mmap_sem, as lockdep tells you...
> > 
> > > I would be perfectly happy to recursively _read_lock both from the top
> > > of the page_fault at the DIO path, and under in the page_fault. I'm
> > > _read_locking after all. But lockdep is hard to convince. So I stole the
> > > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > > truncate/punch/clone time when all mapping traversal needs to stop for
> > > the destructive change to take place. (Allocations are done another way
> > > and are race safe with traversal)
> > > 
> > > How do you intend to address this problem with range-locks? ie recursively
> > > taking the same "lock"? because if not for the recursive-ity and lockdep I would
> > > not need the extra lock-object per inode.
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> 
> I'd be really careful with nesting range locks. You can have nasty
> situations like:
> 
> Thread 1		Thread 2
> read_lock(0,1000)	
> 			write_lock(500,1500) -> blocks due to read lock
> read_lock(1001,1500) -> blocks due to write lock (or you have to break
>   fairness and then deal with starvation issues).
>
> So once you allow nesting of range locks, you have to very carefully define
> what is and what is not allowed.

Yes. I do understand the problem with rwsem read nesting and how
that can translate to reange locks.

That's why my range locks don't even try to block on other pending
waiters. The case where read nesting vs write might occur like above
is something like copy_file_range() vs truncate, but otherwise for
IO locks we simply don't have arbitrarily deep nesting of range
locks.

i.e. for your example my range lock would result in:

Thread 1		Thread 2
read_lock(0,1000)	
			write_lock(500,1500)
			<finds conflicting read lock>
			<marks read lock as having a write waiter>
			<parks on range lock wait list>
<...>
read_lock_nested(1001,1500)
<no overlapping range in tree>
<gets nested range lock>

<....>
read_unlock(1001,1500)	<stays blocked because nothing is waiting
		         on (1001,1500) so no wakeup>
<....>
read_unlock(0,1000)
<sees write waiter flag, runs wakeup>
			<gets woken>
			<retries write lock>
			<write lock gained>

IOWs, I'm not trying to complicate the range lock implementation
with complex stuff like waiter fairness or anti-starvation semantics
at this point in time. Waiters simply don't impact whether a new lock
can be gained or not, and hence the limitations of rwsem semantics
don't apply.

If such functionality is necessary (and I suspect it will be to
prevent AIO from delaying truncate and fallocate-like operations
indefinitely) then I'll add a "barrier" lock type (e.g.
range_lock_write_barrier()) that will block new range lock attempts
across it's span.

However, because this can cause deadlocks like the above, a barrier
lock will not block new *_lock_nested() or *_trylock() calls, hence
allowing runs of nested range locking to complete and drain without
deadlocking on a conflicting barrier range. And that still can't be
modelled by existing rwsem semantics....

> That's why in my range lock implementation
> ages back I've decided to treat range lock as a rwsem for deadlock
> verification purposes.

IMO, treating a range lock as a rwsem for deadlock purposes defeats
the purpose of adding range locks in the first place. The
concurrency models are completely different, and some of the
limitations on rwsems are a result of implementation choices rather
than limitations of a rwsem construct.

In reality I couldn't care less about what lockdep can or can't
verify. I've always said lockdep is a crutch for people who don't
understand locks and the concurrency model of the code they
maintain. That obviously extends to the fact that lockdep
verification limitations should not limit what we allow new locking
primitives to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking
  2019-07-09 23:47                                   ` Dave Chinner
@ 2019-07-10  8:41                                     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2019-07-10  8:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Boaz Harrosh, Amir Goldstein, Linus Torvalds,
	Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing,
	linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro,
	Andrew Morton

On Wed 10-07-19 09:47:12, Dave Chinner wrote:
> On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> > I'd be really careful with nesting range locks. You can have nasty
> > situations like:
> > 
> > Thread 1		Thread 2
> > read_lock(0,1000)	
> > 			write_lock(500,1500) -> blocks due to read lock
> > read_lock(1001,1500) -> blocks due to write lock (or you have to break
> >   fairness and then deal with starvation issues).
> >
> > So once you allow nesting of range locks, you have to very carefully define
> > what is and what is not allowed.
> 
> Yes. I do understand the problem with rwsem read nesting and how
> that can translate to reange locks.
> 
> That's why my range locks don't even try to block on other pending
> waiters. The case where read nesting vs write might occur like above
> is something like copy_file_range() vs truncate, but otherwise for
> IO locks we simply don't have arbitrarily deep nesting of range
> locks.
> 
> i.e. for your example my range lock would result in:
> 
> Thread 1		Thread 2
> read_lock(0,1000)	
> 			write_lock(500,1500)
> 			<finds conflicting read lock>
> 			<marks read lock as having a write waiter>
> 			<parks on range lock wait list>
> <...>
> read_lock_nested(1001,1500)
> <no overlapping range in tree>
> <gets nested range lock>
> 
> <....>
> read_unlock(1001,1500)	<stays blocked because nothing is waiting
> 		         on (1001,1500) so no wakeup>
> <....>
> read_unlock(0,1000)
> <sees write waiter flag, runs wakeup>
> 			<gets woken>
> 			<retries write lock>
> 			<write lock gained>
> 
> IOWs, I'm not trying to complicate the range lock implementation
> with complex stuff like waiter fairness or anti-starvation semantics
> at this point in time. Waiters simply don't impact whether a new lock
> can be gained or not, and hence the limitations of rwsem semantics
> don't apply.
> 
> If such functionality is necessary (and I suspect it will be to
> prevent AIO from delaying truncate and fallocate-like operations
> indefinitely) then I'll add a "barrier" lock type (e.g.
> range_lock_write_barrier()) that will block new range lock attempts
> across it's span.
> 
> However, because this can cause deadlocks like the above, a barrier
> lock will not block new *_lock_nested() or *_trylock() calls, hence
> allowing runs of nested range locking to complete and drain without
> deadlocking on a conflicting barrier range. And that still can't be
> modelled by existing rwsem semantics....

Clever :). Thanks for explanation.

> > That's why in my range lock implementation
> > ages back I've decided to treat range lock as a rwsem for deadlock
> > verification purposes.
> 
> IMO, treating a range lock as a rwsem for deadlock purposes defeats
> the purpose of adding range locks in the first place. The
> concurrency models are completely different, and some of the
> limitations on rwsems are a result of implementation choices rather
> than limitations of a rwsem construct.

Well, even a range lock that cannot nest allows concurrent non-overlapping
read and write to the same file which rwsem doesn't allow. But I agree that
your nesting variant offers more (but also at the cost of higher complexity
of the lock semantics).

> In reality I couldn't care less about what lockdep can or can't
> verify. I've always said lockdep is a crutch for people who don't
> understand locks and the concurrency model of the code they
> maintain. That obviously extends to the fact that lockdep
> verification limitations should not limit what we allow new locking
> primitives to do.

I didn't say we have to constrain new locking primitives to what lockdep
can support. It is just a locking correctness verification tool so naturally
lockdep should be taught to what it needs to know about any locking scheme
we come up with. And sometimes it is just too much effort which is why e.g.
page lock still doesn't have lockdep coverage.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
  2019-06-14  3:08               ` Linus Torvalds
@ 2019-06-14 17:08               ` Kent Overstreet
  2 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2019-06-14 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Fri, Jun 14, 2019 at 09:55:24AM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> > > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > > > Ok, I'm totally on board with returning EDEADLOCK.
> > > > 
> > > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > > > in the same address space as the file being read/written to, even if the buffer
> > > > and the IO don't technically overlap?
> > > 
> > > I'd say that depends on the lock granularity. For a range lock,
> > > we'd be able to do the IO for non-overlapping ranges. For a normal
> > > mutex or rwsem, then we risk deadlock if the page fault triggers on
> > > the same address space host as we already have locked for IO. That's
> > > the case we currently handle with the second IO lock in XFS, ext4,
> > > btrfs, etc (XFS_MMAPLOCK_* in XFS).
> > > 
> > > One of the reasons I'm looking at range locks for XFS is to get rid
> > > of the need for this second mmap lock, as there is no reason for it
> > > existing if we can lock ranges and EDEADLOCK inside page faults and
> > > return errors.
> > 
> > My concern is that range locks are going to turn out to be both more complicated
> > and heavier weight, performance wise, than the approach I've taken of just a
> > single lock per address space.
> 
> That's the battle I'm fighting at the moment with them for direct
> IO(*), but range locks are something I'm doing for XFS and I don't
> really care if anyone else wants to use them or not.

I'm not saying I won't use them :)

I just want to do the simple thing first, and then if range locks turn out well
I don't think it'll be hard to switch bcachefs to them. Or if the simple thing
turns out to be good enough, even better :)

> (*)Direct IO on XFS is a pure shared lock workload, so the rwsem
> scales until single atomic update cache line bouncing limits
> throughput. That means I can max out my hardware at 1.6 million
> random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file
> with a rwsem at 32 AIO+DIO dispatch threads. I've only got range
> locks to about 1.1M IOPS on the same workload, though it's within a
> couple of percent of a rwsem up to 16 threads...
> 
> (**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/
> than pmem that is emulated with RAM, let alone real pmem which is
> much slower at random writes than RAM.

So something I should mention is that I've been 

> 
> > Reason being range locks only help when you've got multiple operations going on
> > simultaneously that don't conflict - i.e. it's really only going to be useful
> > for applications that are doing buffered IO and direct IO simultaneously to the
> > same file.
> 
> Yes, they do that, but that's not why I'm looking at this.  Range
> locks are primarily for applications that mix multiple different
> types of operations to the same file concurrently. e.g:
> 
> - fallocate and read/write() can be run concurrently if they
> don't overlap, but right now we serialise them because we have no
> visibility into what other operations require.

True true. Have you ever seen this be an issue for real applications?

> - buffered read and buffered write can run concurrently if they
> don't overlap, but right now they are serialised because that's the
> only way to provide POSIX atomic write vs read semantics (only XFS
> provides userspace with that guarantee).

We already talked about this on IRC, but it's not the _only_ way - page locks
suffice if we lock all the pages for the read/write at once, and that's actually
a really good thing to for performance.

bcachefs doesn't currently provide any guarantees here, but since we already are
batching up the page operations for buffered writes (and I have patches to do so
for the buffered read path in filemap.c) I will tweak that slightly to provide
some sort of guarantee.

This is something I care about in bcachefs because I'm pretty sure I can make
both the buffered read and write paths work without taking any per inode locks -
so unrelated IOs to the same file won't be modifying at shared cachelines at
all. Haven't gotten around to it yet for the buffered write path, but it's on
the todo list.

> - Sub-block direct IO is serialised against all other direct IO
> because we can't tell if it overlaps with some other direct IO and
> so we have to take the slow but safe option - range locks solve that
> problem, too.

This feels like an internal filesystem implementation detail (not objecting,
just doesn't sound like something applicable outside xfs to me).

> - there's inode_dio_wait() for DIO truncate serialisation
> because AIO doesn't hold inode locks across IO - range locks can be
> held all the way to AIO completion so we can get rid of
> inode_Dio_wait() in XFS and that allows truncate/fallocate to run
> concurrently with non-overlapping direct IO.

getting rid of inode_dio_wait() and unifying that with other locking would be
great, for sure.

> - holding non-overlapping range locks on either side of page
> faults which then gets rid of the need for the special mmap locking
> path to serialise it against invalidation operations.
> 
> IOWs, range locks for IO solve a bunch of long term problems we have
> in XFS and largely simplify the lock algorithms within the
> filesystem. And it puts us on the path to introduce range locks for
> extent mapping serialisation, allowing concurrent mapping lookups
> and allocation within a single file. It also has the potential to
> allow us to do concurrent directory modifications....

*nod*

Don't take any of what I'm saying as arguing against range locks. They don't
seem _as_ useful to me for bcachefs because a lot of what you want them for I
can already do concurrently - but if they turn out to work better than my
pagecache add lock I'll happily switch to them later. I'm just reserving
judgement until I see them and get to play with them.

> IO range locks are not "locking the page cache". IO range locks are
> purely for managing concurrent IO state in a fine grained manner.
> The page cache already has it's own locking - that just needs to
> nest inside IO range locks as teh io locks are what provide the high
> level exclusion from overlapping page cache operations...

I suppose I worded it badly, my point was just that range locks can be viewed as
a superset of my pagecache add lock so switching to them would be an easy
conversion.

> > > > This would simplify things a lot and eliminate a really nasty corner case - page
> > > > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > > > readahead can pull in pages that do overlap with the dio.
> > > 
> > > Page cache readahead needs to be moved under the filesystem IO
> > > locks. There was a recent thread about how readahead can race with
> > > hole punching and other fallocate() operations because page cache
> > > readahead bypasses the filesystem IO locks used to serialise page
> > > cache invalidation.
> > > 
> > > e.g. Readahead can be directed by userspace via fadvise, so we now
> > > have file->f_op->fadvise() so that filesystems can lock the inode
> > > before calling generic_fadvise() such that page cache instantiation
> > > and readahead dispatch can be serialised against page cache
> > > invalidation. I have a patch for XFS sitting around somewhere that
> > > implements the ->fadvise method.
> > 
> > I just puked a little in my mouth.
> 
> Yeah, it's pretty gross. But the page cache simply isn't designed to
> allow atomic range operations to be performed. We ahven't be able to
> drag it out of the 1980s - we wrote the fs/iomap.c code so we could
> do range based extent mapping for IOs rather than the horrible,
> inefficient page-by-page block mapping the generic page cache code
> does - that gave us a 30+% increase in buffered IO throughput
> because we only do a single mapping lookup per IO rather than one
> per page...

I don't think there's anything _fundamentally_ wrong with the pagecache design -
i.e. a radix tree of pages.

We do _badly_ need support for arbitrary power of 2 sized pages in the
pagecache (and it _really_ needs to be compound pages, not just hugepages - a
lot of the work that's been done that just adds if (hugepage) to page cache code
is FUCKING RETARDED) - the overhead of working with 4k pages has gotten to be
completely absurd. But AFAIK the radix tree/xarray side of that is done, what
needs to be fixed is all the stuff in e.g. filemap.c that assumes it's iterating
over fixed size pages - that code needs to be reworked to be more like iterating
over extents.

> That said, the page cache is still far, far slower than direct IO,
> and the gap is just getting wider and wider as nvme SSDs get faster
> and faster. PCIe 4 SSDs are just going to make this even more
> obvious - it's getting to the point where the only reason for having
> a page cache is to support mmap() and cheap systems with spinning
> rust storage.

It is true that buffered IO performance is _way_ behind direct IO performance
today in a lot of situations, but this is mostly due to the fact that we've
completely dropped the ball and it's 2019 and we're STILL DOING EVERYTHING 4k AT
A TIME.

> > I think both approaches are workable, but I do think that pushing the locking
> > down to __add_to_page_cache_locked is fundamentally the better, more correct
> > approach.
> > 
> >  - It better matches the semantics of what we're trying to do. All these
> >    operations we're trying to protect - dio, fallocate, truncate - they all have
> >    in common that they just want to shoot down a range of the page cache and
> >    keep it from being readded. And in general, it's better to have locks that
> >    protect specific data structures ("adding to this radix tree"), vs. large
> >    critical sections ("the io path").
> 
> I disagree :)
> 
> The high level IO locks provide the IO concurrency policy for the
> filesystem. The page cache is an internal structure for caching
> pages - it is not a structure for efficiently and cleanly
> implementing IO concurrency policy. That's the mistake the current
> page cache architecture makes - it tries to be the central control
> for all the filesystem IO (because filesystems are dumb and the page
> cache knows best!) but, unfortunately, this does not provide the
> semantics or functionality that all filesystems want and/or need.

We might be talking past each other a bit.

I think we both agree that there should be a separation of concerns w.r.t.
locking, and that hanging too much stuff off the page cache has been a mistake
in the past (e.g. buffer heads).

But that doesn't mean the page cache shouldn't have locking, just that that
locking should only protect the page cache, not other filesystem state.

I am _not_ arguing that this should replace filesystem IO path locks for other,
not page cache purposes.

> Just look at the truncate detection mess we have every time we
> lookup and lock a page anywhere in the mm/ code - do you see any
> code in all that which detects a hole punch race? Nope, you don't
> because the filesystems take responsibility for serialising that
> functionality.

Yeah, I know.

> Unfortunately, we have so much legacy filesystem cruft we'll never get rid of
> those truncate hacks.

Actually - I honestly think that this page cache add lock is simple and
straightforward enough to use that converting legacy filesystems to it wouldn't
be crazypants, and getting rid of all those hacks would be _awesome_. The legacy
filesystems tend to to implement the really crazy fallocate stuff, it's just
truncate that really has to be dealt with.

> Don't get me wrong - I'm not opposed to incuding page cache add
> locking - I'm just saying that the problems it tries to address (and
> ones it cannot address) are already largely solved in existing
> filesystems. I suspect that if we do merge this code, whatever
> locking is added would have to be optional....

That's certainly a fair point. There is more than one way to skin a cat.

Not sure "largely solved" is true though - solved in XFS yes, but these issues
have not been taken seriously in other filesystems up until pretty recently...

That said, I do think this will prove useful for non bcachefs filesystems at
some point. Like you alluded to, truncate synchronization is hot garbage in most
filesystems, and this would give us an easy way of improving that and possibly
getting rid of those truncate hacks in filemap.c.

And because (on the add side at least) this lock only needs to taken when adding
pages to the page cache, not when the pages needed are already present - in
bcachefs at least this is a key enabler for making buffered IO not modify any
per-inode cachelines in the fast path, and I think other filesystems could do
the same thing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-07-10  8:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190610191420.27007-1-kent.overstreet@gmail.com>
     [not found] ` <CAHk-=wi0iMHcO5nsYug06fV3-8s8fz7GDQWCuanefEGq6mHH1Q@mail.gmail.com>
     [not found]   ` <20190611011737.GA28701@kmo-pixel>
     [not found]     ` <20190611043336.GB14363@dread.disaster.area>
     [not found]       ` <20190612162144.GA7619@kmo-pixel>
     [not found]         ` <20190612230224.GJ14308@dread.disaster.area>
2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-13 21:13             ` Andreas Dilger
2019-06-13 21:21               ` Kent Overstreet
2019-06-14  0:35                 ` Dave Chinner
2019-06-13 23:55             ` Dave Chinner
2019-06-14  2:30               ` Linus Torvalds
2019-06-14  7:30                 ` Dave Chinner
2019-06-15  1:15                   ` Linus Torvalds
2019-06-14  3:08               ` Linus Torvalds
2019-06-15  4:01                 ` Linus Torvalds
2019-06-17 22:47                   ` Dave Chinner
2019-06-17 23:38                     ` Linus Torvalds
2019-06-18  4:21                       ` Amir Goldstein
2019-06-19 10:38                         ` Jan Kara
2019-06-19 22:37                           ` Dave Chinner
2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
     [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
2019-07-03  1:25                                 ` Boaz Harrosh
2019-07-05 23:31                               ` Dave Chinner
2019-07-07 15:05                                 ` Boaz Harrosh
2019-07-07 23:55                                   ` Dave Chinner
2019-07-08 13:31                                 ` Jan Kara
2019-07-09 23:47                                   ` Dave Chinner
2019-07-10  8:41                                     ` Jan Kara
2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox