* pagecache locking (was: bcachefs status update) merged) [not found] ` <20190612230224.GJ14308@dread.disaster.area> @ 2019-06-13 18:36 ` Kent Overstreet 2019-06-13 21:13 ` Andreas Dilger 2019-06-13 23:55 ` Dave Chinner 0 siblings, 2 replies; 24+ messages in thread From: Kent Overstreet @ 2019-06-13 18:36 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote: > > Ok, I'm totally on board with returning EDEADLOCK. > > > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is > > in the same address space as the file being read/written to, even if the buffer > > and the IO don't technically overlap? > > I'd say that depends on the lock granularity. For a range lock, > we'd be able to do the IO for non-overlapping ranges. For a normal > mutex or rwsem, then we risk deadlock if the page fault triggers on > the same address space host as we already have locked for IO. That's > the case we currently handle with the second IO lock in XFS, ext4, > btrfs, etc (XFS_MMAPLOCK_* in XFS). > > One of the reasons I'm looking at range locks for XFS is to get rid > of the need for this second mmap lock, as there is no reason for it > existing if we can lock ranges and EDEADLOCK inside page faults and > return errors. My concern is that range locks are going to turn out to be both more complicated and heavier weight, performance wise, than the approach I've taken of just a single lock per address space. Reason being range locks only help when you've got multiple operations going on simultaneously that don't conflict - i.e. it's really only going to be useful for applications that are doing buffered IO and direct IO simultaneously to the same file. Personally, I think that would be a pretty gross thing to do and I'm not particularly interested in optimizing for that case myself... but, if you know of applications that do depend on that I might change my opinion. If not, I want to try and get the simpler, one-lock-per-address space approach to work. That said though - range locks on the page cache can be viewed as just a performance optimization over my approach, they've got the same semantics (locking a subset of the page cache vs. the entire thing). So that's a bit of a digression. > > This would simplify things a lot and eliminate a really nasty corner case - page > > faults trigger readahead. Even if the buffer and the direct IO don't overlap, > > readahead can pull in pages that do overlap with the dio. > > Page cache readahead needs to be moved under the filesystem IO > locks. There was a recent thread about how readahead can race with > hole punching and other fallocate() operations because page cache > readahead bypasses the filesystem IO locks used to serialise page > cache invalidation. > > e.g. Readahead can be directed by userspace via fadvise, so we now > have file->f_op->fadvise() so that filesystems can lock the inode > before calling generic_fadvise() such that page cache instantiation > and readahead dispatch can be serialised against page cache > invalidation. I have a patch for XFS sitting around somewhere that > implements the ->fadvise method. I just puked a little in my mouth. > I think there are some other patches floating around to address the > other readahead mechanisms to only be done under filesytem IO locks, > but I haven't had time to dig into it any further. Readahead from > page faults most definitely needs to be under the MMAPLOCK at > least so it serialises against fallocate()... So I think there's two different approaches we should distinguish between. We can either add the locking to all the top level IO paths - what you just described - or, the locking can be pushed down to protect _only_ adding pages to the page cache, which is the approach I've been taking. I think both approaches are workable, but I do think that pushing the locking down to __add_to_page_cache_locked is fundamentally the better, more correct approach. - It better matches the semantics of what we're trying to do. All these operations we're trying to protect - dio, fallocate, truncate - they all have in common that they just want to shoot down a range of the page cache and keep it from being readded. And in general, it's better to have locks that protect specific data structures ("adding to this radix tree"), vs. large critical sections ("the io path"). In bcachefs, at least for buffered IO I don't currently need any per-inode IO locks, page granularity locks suffice, so I'd like to keep that - under the theory that buffered IO to pages already in cache is more of a fast path than faulting pages in. - If we go with the approach of using the filesystem IO locks, we need to be really careful about auditing and adding assertions to make sure we've found and fixed all the code paths that can potentially add pages to the page cache. I didn't even know about the fadvise case, eesh. - We still need to do something about the case where we're recursively faulting pages back in, which means we need _something_ in place to even detect that that's happening. Just trying to cover everything with the filesystem IO locks isn't useful here. So to summarize - if we have locking specifically for adding pages to the page cache, we don't need to extend the filesystem IO locks to all these places, and we need something at that level anyways to handle recursive faults from gup() anyways. The tricky part is that there's multiple places that want to call add_to_page_cache() while holding this pagecache_add_lock. - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I like way better than my recursive locking approach. - the other case is truncate/fpunch/etc - they make use of buffered IO to handle operations that aren't page/block aligned. But those look a lot more tractable than dio, since they're calling find_get_page()/readpage() directly instead of via gup(), and possibly they could be structured to not have to truncate/punch the partial page while holding the pagecache_add_lock at all (but that's going to be heavily filesystem dependent). The more I think about it, the more convinced I am that this is fundamentally the correct approach. So, I'm going to work on an improved version of this patch. One other tricky thing we need is a way to write out and then evict a page without allowing it to be redirtied - i.e. something that combines filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise, a process continuously redirtying a page is going to make truncate/dio operations spin trying to shoot down the page cache - in bcachefs I'm currently taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I really want to get rid of that. If we can get that combined write_and_invalidate() operation, then I think the locking will turn out fairly clean. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 18:36 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet @ 2019-06-13 21:13 ` Andreas Dilger 2019-06-13 21:21 ` Kent Overstreet 2019-06-13 23:55 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Andreas Dilger @ 2019-06-13 21:13 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton [-- Attachment #1.1: Type: text/plain, Size: 2948 bytes --] On Jun 13, 2019, at 12:36 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote: > > On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote: >> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote: >>> Ok, I'm totally on board with returning EDEADLOCK. >>> >>> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is >>> in the same address space as the file being read/written to, even if the buffer >>> and the IO don't technically overlap? >> >> I'd say that depends on the lock granularity. For a range lock, >> we'd be able to do the IO for non-overlapping ranges. For a normal >> mutex or rwsem, then we risk deadlock if the page fault triggers on >> the same address space host as we already have locked for IO. That's >> the case we currently handle with the second IO lock in XFS, ext4, >> btrfs, etc (XFS_MMAPLOCK_* in XFS). >> >> One of the reasons I'm looking at range locks for XFS is to get rid >> of the need for this second mmap lock, as there is no reason for it >> existing if we can lock ranges and EDEADLOCK inside page faults and >> return errors. > > My concern is that range locks are going to turn out to be both more complicated > and heavier weight, performance wise, than the approach I've taken of just a > single lock per address space. > > Reason being range locks only help when you've got multiple operations going on > simultaneously that don't conflict - i.e. it's really only going to be useful > for applications that are doing buffered IO and direct IO simultaneously to the > same file. Personally, I think that would be a pretty gross thing to do and I'm > not particularly interested in optimizing for that case myself... but, if you > know of applications that do depend on that I might change my opinion. If not, I > want to try and get the simpler, one-lock-per-address space approach to work. There are definitely workloads that require multiple threads doing non-overlapping writes to a single file in HPC. This is becoming an increasingly common problem as the number of cores on a single client increase, since there is typically one thread per core trying to write to a shared file. Using multiple files (one per core) is possible, but that has file management issues for users when there are a million cores running on the same job/file (obviously not on the same client node) dumping data every hour. We were just looking at this exact problem last week, and most of the threads are spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty() when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are writing O_DIRECT instead of buffered). Flame graph is attached for 16-thread case, but high-end systems today easily have 2-4x that many cores. Any approach for range locks can't be worse than spending 80% of time spinning. Cheers, Andreas [-- Attachment #1.2: shared_file_write.svg --] [-- Type: image/svg+xml, Size: 161674 bytes --] [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 21:13 ` Andreas Dilger @ 2019-06-13 21:21 ` Kent Overstreet 2019-06-14 0:35 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2019-06-13 21:21 UTC (permalink / raw) To: Andreas Dilger Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote: > There are definitely workloads that require multiple threads doing non-overlapping > writes to a single file in HPC. This is becoming an increasingly common problem > as the number of cores on a single client increase, since there is typically one > thread per core trying to write to a shared file. Using multiple files (one per > core) is possible, but that has file management issues for users when there are a > million cores running on the same job/file (obviously not on the same client node) > dumping data every hour. Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to me. > We were just looking at this exact problem last week, and most of the threads are > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty() > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are > writing O_DIRECT instead of buffered). Flame graph is attached for 16-thread case, > but high-end systems today easily have 2-4x that many cores. Yeah I've been spending some time on buffered IO performance too - 4k page overhead is a killer. bcachefs has a buffered write path that looks up multiple pages at a time and locks them, and then copies the data to all the pages at once (I stole the idea from btrfs). It was a very significant performance increase. https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n1498 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 21:21 ` Kent Overstreet @ 2019-06-14 0:35 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2019-06-14 0:35 UTC (permalink / raw) To: Kent Overstreet Cc: Andreas Dilger, Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote: > On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote: > > There are definitely workloads that require multiple threads doing non-overlapping > > writes to a single file in HPC. This is becoming an increasingly common problem > > as the number of cores on a single client increase, since there is typically one > > thread per core trying to write to a shared file. Using multiple files (one per > > core) is possible, but that has file management issues for users when there are a > > million cores running on the same job/file (obviously not on the same client node) > > dumping data every hour. > > Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to > me. > > > We were just looking at this exact problem last week, and most of the threads are > > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty() > > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are > > writing O_DIRECT instead of buffered). Flame graph is attached for 16-thread case, > > but high-end systems today easily have 2-4x that many cores. > > Yeah I've been spending some time on buffered IO performance too - 4k page > overhead is a killer. > > bcachefs has a buffered write path that looks up multiple pages at a time and > locks them, and then copies the data to all the pages at once (I stole the idea > from btrfs). It was a very significant performance increase. Careful with that - locking multiple pages is also a deadlock vector that triggers unexpectedly when something conspires to lock pages in non-ascending order. e.g. 64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock The fs/iomap.c code avoids this problem by mapping the IO first, then iterating pages one at a time until the mapping is consumed, then it gets another mapping. It also avoids needing to put a page array on stack.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 18:36 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet 2019-06-13 21:13 ` Andreas Dilger @ 2019-06-13 23:55 ` Dave Chinner 2019-06-14 2:30 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 24+ messages in thread From: Dave Chinner @ 2019-06-13 23:55 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote: > On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote: > > > Ok, I'm totally on board with returning EDEADLOCK. > > > > > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is > > > in the same address space as the file being read/written to, even if the buffer > > > and the IO don't technically overlap? > > > > I'd say that depends on the lock granularity. For a range lock, > > we'd be able to do the IO for non-overlapping ranges. For a normal > > mutex or rwsem, then we risk deadlock if the page fault triggers on > > the same address space host as we already have locked for IO. That's > > the case we currently handle with the second IO lock in XFS, ext4, > > btrfs, etc (XFS_MMAPLOCK_* in XFS). > > > > One of the reasons I'm looking at range locks for XFS is to get rid > > of the need for this second mmap lock, as there is no reason for it > > existing if we can lock ranges and EDEADLOCK inside page faults and > > return errors. > > My concern is that range locks are going to turn out to be both more complicated > and heavier weight, performance wise, than the approach I've taken of just a > single lock per address space. That's the battle I'm fighting at the moment with them for direct IO(*), but range locks are something I'm doing for XFS and I don't really care if anyone else wants to use them or not. (*)Direct IO on XFS is a pure shared lock workload, so the rwsem scales until single atomic update cache line bouncing limits throughput. That means I can max out my hardware at 1.6 million random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file with a rwsem at 32 AIO+DIO dispatch threads. I've only got range locks to about 1.1M IOPS on the same workload, though it's within a couple of percent of a rwsem up to 16 threads... (**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/ than pmem that is emulated with RAM, let alone real pmem which is much slower at random writes than RAM. > Reason being range locks only help when you've got multiple operations going on > simultaneously that don't conflict - i.e. it's really only going to be useful > for applications that are doing buffered IO and direct IO simultaneously to the > same file. Yes, they do that, but that's not why I'm looking at this. Range locks are primarily for applications that mix multiple different types of operations to the same file concurrently. e.g: - fallocate and read/write() can be run concurrently if they don't overlap, but right now we serialise them because we have no visibility into what other operations require. - buffered read and buffered write can run concurrently if they don't overlap, but right now they are serialised because that's the only way to provide POSIX atomic write vs read semantics (only XFS provides userspace with that guarantee). - Sub-block direct IO is serialised against all other direct IO because we can't tell if it overlaps with some other direct IO and so we have to take the slow but safe option - range locks solve that problem, too. - there's inode_dio_wait() for DIO truncate serialisation because AIO doesn't hold inode locks across IO - range locks can be held all the way to AIO completion so we can get rid of inode_Dio_wait() in XFS and that allows truncate/fallocate to run concurrently with non-overlapping direct IO. - holding non-overlapping range locks on either side of page faults which then gets rid of the need for the special mmap locking path to serialise it against invalidation operations. IOWs, range locks for IO solve a bunch of long term problems we have in XFS and largely simplify the lock algorithms within the filesystem. And it puts us on the path to introduce range locks for extent mapping serialisation, allowing concurrent mapping lookups and allocation within a single file. It also has the potential to allow us to do concurrent directory modifications.... > Personally, I think that would be a pretty gross thing to do and I'm > not particularly interested in optimizing for that case myself... but, if you > know of applications that do depend on that I might change my opinion. If not, I > want to try and get the simpler, one-lock-per-address space approach to work. > > That said though - range locks on the page cache can be viewed as just a > performance optimization over my approach, they've got the same semantics > (locking a subset of the page cache vs. the entire thing). So that's a bit of a > digression. IO range locks are not "locking the page cache". IO range locks are purely for managing concurrent IO state in a fine grained manner. The page cache already has it's own locking - that just needs to nest inside IO range locks as teh io locks are what provide the high level exclusion from overlapping page cache operations... > > > This would simplify things a lot and eliminate a really nasty corner case - page > > > faults trigger readahead. Even if the buffer and the direct IO don't overlap, > > > readahead can pull in pages that do overlap with the dio. > > > > Page cache readahead needs to be moved under the filesystem IO > > locks. There was a recent thread about how readahead can race with > > hole punching and other fallocate() operations because page cache > > readahead bypasses the filesystem IO locks used to serialise page > > cache invalidation. > > > > e.g. Readahead can be directed by userspace via fadvise, so we now > > have file->f_op->fadvise() so that filesystems can lock the inode > > before calling generic_fadvise() such that page cache instantiation > > and readahead dispatch can be serialised against page cache > > invalidation. I have a patch for XFS sitting around somewhere that > > implements the ->fadvise method. > > I just puked a little in my mouth. Yeah, it's pretty gross. But the page cache simply isn't designed to allow atomic range operations to be performed. We ahven't be able to drag it out of the 1980s - we wrote the fs/iomap.c code so we could do range based extent mapping for IOs rather than the horrible, inefficient page-by-page block mapping the generic page cache code does - that gave us a 30+% increase in buffered IO throughput because we only do a single mapping lookup per IO rather than one per page... That said, the page cache is still far, far slower than direct IO, and the gap is just getting wider and wider as nvme SSDs get faster and faster. PCIe 4 SSDs are just going to make this even more obvious - it's getting to the point where the only reason for having a page cache is to support mmap() and cheap systems with spinning rust storage. > > I think there are some other patches floating around to address the > > other readahead mechanisms to only be done under filesytem IO locks, > > but I haven't had time to dig into it any further. Readahead from > > page faults most definitely needs to be under the MMAPLOCK at > > least so it serialises against fallocate()... > > So I think there's two different approaches we should distinguish between. We > can either add the locking to all the top level IO paths - what you just > described - or, the locking can be pushed down to protect _only_ adding pages to > the page cache, which is the approach I've been taking. I'm don't think just serialising adding pages is sufficient for filesystems like XFS because, e.g., DAX. > I think both approaches are workable, but I do think that pushing the locking > down to __add_to_page_cache_locked is fundamentally the better, more correct > approach. > > - It better matches the semantics of what we're trying to do. All these > operations we're trying to protect - dio, fallocate, truncate - they all have > in common that they just want to shoot down a range of the page cache and > keep it from being readded. And in general, it's better to have locks that > protect specific data structures ("adding to this radix tree"), vs. large > critical sections ("the io path"). I disagree :) The high level IO locks provide the IO concurrency policy for the filesystem. The page cache is an internal structure for caching pages - it is not a structure for efficiently and cleanly implementing IO concurrency policy. That's the mistake the current page cache architecture makes - it tries to be the central control for all the filesystem IO (because filesystems are dumb and the page cache knows best!) but, unfortunately, this does not provide the semantics or functionality that all filesystems want and/or need. Just look at the truncate detection mess we have every time we lookup and lock a page anywhere in the mm/ code - do you see any code in all that which detects a hole punch race? Nope, you don't because the filesystems take responsibility for serialising that functionality. Unfortunately, we have so much legacy filesystem cruft we'll never get rid of those truncate hacks. That's my beef with relying on the page cache - the page cache is rapidly becoming a legacy structure that only serves to slow modern IO subsystems down. More and more we are going to bypass the page cache and push/pull data via DMA directly into user buffers because that's the only way we have enough CPU power in the systems to keep the storage we have fully utilised. That means, IMO, making the page cache central to solving an IO concurrency problem is going the wrong way.... > In bcachefs, at least for buffered IO I don't currently need any per-inode IO > locks, page granularity locks suffice, so I'd like to keep that - under the > theory that buffered IO to pages already in cache is more of a fast path than > faulting pages in. > > - If we go with the approach of using the filesystem IO locks, we need to be > really careful about auditing and adding assertions to make sure we've found > and fixed all the code paths that can potentially add pages to the page > cache. I didn't even know about the fadvise case, eesh. Sure, but we've largely done that already. There aren't a lot of places that add pages to the page cache. > - We still need to do something about the case where we're recursively faulting > pages back in, which means we need _something_ in place to even detect that > that's happening. Just trying to cover everything with the filesystem IO > locks isn't useful here. Haven't we just addressed that with setting a current task lock context and returning -EDEADLOCK? > So to summarize - if we have locking specifically for adding pages to the page > cache, we don't need to extend the filesystem IO locks to all these places, and > we need something at that level anyways to handle recursive faults from gup() > anyways. > > The tricky part is that there's multiple places that want to call > add_to_page_cache() while holding this pagecache_add_lock. > > - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I > like way better than my recursive locking approach. > > - the other case is truncate/fpunch/etc - they make use of buffered IO to > handle operations that aren't page/block aligned. But those look a lot more > tractable than dio, since they're calling find_get_page()/readpage() directly > instead of via gup(), and possibly they could be structured to not have to > truncate/punch the partial page while holding the pagecache_add_lock at all > (but that's going to be heavily filesystem dependent). That sounds like it is going to be messy. :( > One other tricky thing we need is a way to write out and then evict a page > without allowing it to be redirtied - i.e. something that combines > filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise, > a process continuously redirtying a page is going to make truncate/dio > operations spin trying to shoot down the page cache - in bcachefs I'm currently > taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I > really want to get rid of that. If we can get that combined > write_and_invalidate() operation, then I think the locking will turn out fairly > clean. IMO, this isn't a page cache problem - this is a filesystem operation vs page fault serialisation issue. Why? Because the problem exists for DAX, and it doesn't add pages to the page cache for mappings... i.e. We've already solved these problems with the same high level IO locks that solve all the truncate, hole punch, etc issues for both the page cache and DAX operation. i.e. The MMAPLOCK prevents the PTE being re-dirtied across the entire filesystem operation, not just the writeback and invalidation. The XFS flush+inval code looks like this: xfs_ilock(ip, XFS_IOLOCK_EXCL); xfs_ilock(ip, XFS_MMAPLOCK_EXCL); xfs_flush_unmap_range(ip, offset, len); <do operation that requires invalidated page cache> <flush modifications if necessary> xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); xfs_iunlock(ip, XFS_IOLOCK_EXCL); With range locks it looks like this; range_lock_init(&rl, offset, len); range_lock_write(&ip->i_iolock, &rl); xfs_flush_unmap_range(ip, offset, len); <do operation that requires invalidated page cache> <flush modified range if necessary> range_unlock_write(&ip->i_iolock, &rl); --- In summary, I can see why the page cache add lock works well for bcachefs, but on the other hand I can say that it really doesn't match well for filesystems like XFS or for filesystems that implement DAX and so may not be using the page cache at all... Don't get me wrong - I'm not opposed to incuding page cache add locking - I'm just saying that the problems it tries to address (and ones it cannot address) are already largely solved in existing filesystems. I suspect that if we do merge this code, whatever locking is added would have to be optional.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 23:55 ` Dave Chinner @ 2019-06-14 2:30 ` Linus Torvalds 2019-06-14 7:30 ` Dave Chinner 2019-06-14 3:08 ` Linus Torvalds 2019-06-14 17:08 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet 2 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2019-06-14 2:30 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote: > > That said, the page cache is still far, far slower than direct IO, Bullshit, Dave. You've made that claim before, and it's been complete bullshit before too, and I've called you out on it then too. Why do you continue to make this obviously garbage argument? The key word in the "page cache" name is "cache". Caches work, Dave. Anybody who thinks caches don't work is incompetent. 99% of all filesystem accesses are cached, and they never do any IO at all, and the page cache handles them beautifully. When you say that the page cache is slower than direct IO, it's because you don't even see or care about the *fast* case. You only get involved once there is actual IO to be done. So you're making that statement without taking into account all the cases that you don't see, and that you don't care about, because the page cache has already handled them for you, and done so much better than DIO can do or ever _will_ do. Is direct IO faster when you *know* it's not cached, and shouldn't be cached? Sure. But that/s actually quite rare. How often do you use non-temporal stores when you do non-IO programming? Approximately never, perhaps? Because caches work. And no, SSD's haven't made caches irrelevant. Not doing IO at all is still orders of magnitude faster than doing IO. And it's not clear nvdimms will either. So stop with the stupid and dishonest argument already, where you ignore the effects of caching. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-14 2:30 ` Linus Torvalds @ 2019-06-14 7:30 ` Dave Chinner 2019-06-15 1:15 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2019-06-14 7:30 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 04:30:36PM -1000, Linus Torvalds wrote: > On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote: > > > > That said, the page cache is still far, far slower than direct IO, > > Bullshit, Dave. > > You've made that claim before, and it's been complete bullshit before > too, and I've called you out on it then too. Yes, your last run of insulting rants on this topic resulted in me pointing out your CoC violations because you were unable to listen or discuss the subject matter in a civil manner. And you've started right where you left off last time.... > Why do you continue to make this obviously garbage argument? > > The key word in the "page cache" name is "cache". > > Caches work, Dave. Yes, they do, I see plenty of cases where the page cache works just fine because it is still faster than most storage. But that's _not what I said_. Indeed, you haven't even bothered to ask me to clarify what I was refering to in the statement you quoted. IOWs, you've taken _one single statement_ I made from a huge email about complexities in dealing with IO concurency, the page cache and architectural flaws n the existing code, quoted it out of context, fabricated a completely new context and started ranting about how I know nothing about how caches or the page cache work. Not very professional but, unfortunately, an entirely predictable and _expected_ response. Linus, nobody can talk about direct IO without you screaming and tossing all your toys out of the crib. If you can't be civil or you find yourself writing a some condescending "caching 101" explanation to someone who has spent the last 15+ years working with filesystems and caches, then you're far better off not saying anything. --- So, in the interests of further _civil_ discussion, let me clarify my statement for you: for a highly concurrent application that is crunching through bulk data on large files on high throughput storage, the page cache is still far, far slower than direct IO. Which comes back to this statement you made: > Is direct IO faster when you *know* it's not cached, and shouldn't > be cached? Sure. But that/s actually quite rare. This is where I think you get the wrong end of the stick, Linus. The world I work in has a significant proportion of applications where the data set is too large to be cached effectively or is better cached by the application than the kernel. IOWs, data being cached efficiently by the page cache is the exception rather than the rule. Hence, they use direct IO because it is faster than the page cache. This is common in applications like major enterprise databases, HPC apps, data mining/analysis applications, etc. and there's an awful lot of the world that runs on these apps.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-14 7:30 ` Dave Chinner @ 2019-06-15 1:15 ` Linus Torvalds 0 siblings, 0 replies; 24+ messages in thread From: Linus Torvalds @ 2019-06-15 1:15 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 9:31 PM Dave Chinner <david@fromorbit.com> wrote: > > Yes, they do, I see plenty of cases where the page cache works just > fine because it is still faster than most storage. But that's _not > what I said_. I only quoted one small part of your email, because I wanted to point out how you again dismissed caches. And yes, that literally _is_ what you said. In other parts of that same email you said "..it's getting to the point where the only reason for having a page cache is to support mmap() and cheap systems with spinning rust storage" and "That's my beef with relying on the page cache - the page cache is rapidly becoming a legacy structure that only serves to slow modern IO subsystems down" and your whole email was basically a rant against the page cache. So I only quoted the bare minimum, and pointed out that caching is still damn important. Because most loads cache well. How you are back-tracking a bit from your statements, but don't go saying was misreading you. How else would the above be read? You really were saying that caching was "legacy". I called you out on it. Now you're trying to back-track. Yes, you have loads that don't cache well. But that does not mean that caching has somehow become irrelevant in the big picture or a "legacy" thing at all. The thing is, I don't even hate DIO. But we always end up clashing because you seem to have this mindset where nothing else matters (which really came through in that email I replied to). Do you really wonder why I point out that caching is important? Because you seem to actively claim caching doesn't matter. Are you happier now that I quoted more of your emails back to you? > IOWs, you've taken _one > single statement_ I made from a huge email about complexities in > dealing with IO concurency, the page cache and architectural flaws n > the existing code, quoted it out of context, fabricated a completely > new context and started ranting about how I know nothing about how > caches or the page cache work. See above. I cut things down a lot, but it wasn't a single statement at all. I just boiled it down to the basics. > Linus, nobody can talk about direct IO without you screaming and > tossing all your toys out of the crib. Dave, look in the mirror some day. You might be surprised. > So, in the interests of further _civil_ discussion, let me clarify > my statement for you: for a highly concurrent application that is > crunching through bulk data on large files on high throughput > storage, the page cache is still far, far slower than direct IO. .. and Christ, Dave, we even _agree_ on this. But when DIO becomes an issue is when you try to claim it makes the page cache irrelevant, or a problem. I also take issue with you then making statements that seem to be explicitly designed to be misleading. For DIO, you talk about how XFS has no serialization and gets great performance. Then in the very next email, you talk about how you think buffered IO has to be excessively serialized, and how XFS is the only one who does it properly, and how that is a problem for performance. But as far as I can tell, the serialization rule you quote is simply not true. But for you it is, and only for buffered IO. It's really as if you were actively trying to make the non-DIO case look bad by picking and choosing your rules. And the thing is, I suspect that the overlap between DIO and cached IO shouldn't even need to be there. We've generally tried to just not have them interact at all, by just having DIO invalidate the caches (which is really really cheap if they don't exist - which should be the common case by far!). People almost never mix the two at all, and we might be better off aiming to separate them out even more than we do now. That's actually the part I like best about the page cache add lock - I may not be a great fan of yet another ad-hoc lock - but I do like how it adds minimal overhead to the cached case (because by definition, the good cached case is when you don't need to add new pages), while hopefully working well together with the whole "invalidate existing caches" case for DIO. I know you don't like the cache flush and invalidation stuff for some reason, but I don't even understand why you care. Again, if you're actually just doing all DIO, the caches will be empty and not be in your way. So normally all that should be really really cheap. Flushing and invalidating caches that don't exists isn't really complicated, is it? And if cached state *does* exist, and if it can't be invalidated (for example, existing busy mmap or whatever), maybe the solution there is "always fall back to buffered/cached IO". For the cases you care about, that should never happen, after all. IOW, if anything, I think we should strive for a situation where the whole DIO vs cached becomes even _more_ independent. If there are busy caches, just fall back to cached IO. It will have lower IO throughput, but that's one of the _points_ of caches - they should decrease the need for IO, and less IO is what it's all about. So I don't understand why you hate the page cache so much. For the cases you care about, the page cache should be a total non-issue. And if the page cache does exist, then it almost by definition means that it's not a case you care about. And yes, yes, maybe some day people won't have SSD's at all, and it's all nvdimm's and all filesystem data accesses are DAX, and caching is all done by hardware and the page cache will never exist at all. At that point a page cache will be legacy. But honestly, that day is not today. It's decades away, and might never happen at all. So in the meantime, don't pooh-pooh the page cache. It works very well indeed, and I say that as somebody who has refused to touch spinning media (or indeed bad SSD's) for a decade. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 23:55 ` Dave Chinner 2019-06-14 2:30 ` Linus Torvalds @ 2019-06-14 3:08 ` Linus Torvalds 2019-06-15 4:01 ` Linus Torvalds 2019-06-14 17:08 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet 2 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2019-06-14 3:08 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote: > > - buffered read and buffered write can run concurrently if they > don't overlap, but right now they are serialised because that's the > only way to provide POSIX atomic write vs read semantics (only XFS > provides userspace with that guarantee). I do not believe that posix itself actually requires that at all, although extended standards may. That said, from a quality of implementation standpoint, it's obviously a good thing to do, so it might be worth looking at if something reasonable can be done. The XFS atomicity guarantees are better than what other filesystems give, but they might also not be exactly required. But POSIX actually ends up being pretty lax, and says "Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data from write() calls to subsequent read() calls. This requirement is particularly significant for networked file systems, where some caching schemes violate these semantics." Note the "can" in "can be serialized", not "must". Also note that whole language about how the read file data must match the written data only if the read can be proven to have occurred after a write of that data. Concurrency is very much left in the air, only provably serial operations matter. (There is also language that talks about "after the write has successfully returned" etc - again, it's about reads that occur _after_ the write, not concurrently with the write). The only atomicity guarantees are about the usual pipe writes and PIPE_BUF. Those are very explicit. Of course, there are lots of standards outside of just the POSIX read/write thing, so you may be thinking of some other stricter standard. POSIX itself has always been pretty permissive. And as mentioned, I do agree from a QoI standpoint that atomicity is nice, and that the XFS behavior is better. However, it does seem that nobody really cares, because I'm not sure we've ever done it in general (although we do have that i_rwsem, but I think it's mainly used to give the proper lseek behavior). And so the XFS behavior may not necessarily be *worth* it, although I presume you have some test for this as part of xfstests. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-14 3:08 ` Linus Torvalds @ 2019-06-15 4:01 ` Linus Torvalds 2019-06-17 22:47 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2019-06-15 4:01 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I do not believe that posix itself actually requires that at all, > although extended standards may. So I tried to see if I could find what this perhaps alludes to. And I suspect it's not in the read/write thing, but the pthreads side talks about atomicity. Interesting, but I doubt if that's actually really intentional, since the non-thread read/write behavior specifically seems to avoid the whole concurrency issue. The pthreads atomicity thing seems to be about not splitting up IO and doing it in chunks when you have m:n threading models, but can be (mis-)construed to have threads given higher atomicity guarantees than processes. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-15 4:01 ` Linus Torvalds @ 2019-06-17 22:47 ` Dave Chinner 2019-06-17 23:38 ` Linus Torvalds 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2019-06-17 22:47 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Fri, Jun 14, 2019 at 06:01:07PM -1000, Linus Torvalds wrote: > On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I do not believe that posix itself actually requires that at all, > > although extended standards may. > > So I tried to see if I could find what this perhaps alludes to. > > And I suspect it's not in the read/write thing, but the pthreads side > talks about atomicity. > > Interesting, but I doubt if that's actually really intentional, since > the non-thread read/write behavior specifically seems to avoid the > whole concurrency issue. The wording of posix changes every time they release a new version of the standard, and it's _never_ obvious what behaviour the standard is actually meant to define. They are always written with sufficient ambiguity and wiggle room that they could mean _anything_. The POSIX 2017.1 standard you quoted is quite different to older versions, but it's no less ambiguous... > The pthreads atomicity thing seems to be about not splitting up IO and > doing it in chunks when you have m:n threading models, but can be > (mis-)construed to have threads given higher atomicity guarantees than > processes. Right, but regardless of the spec we have to consider that the behaviour of XFS comes from it's Irix heritage (actually from EFS, the predecessor of XFS from the late 1980s). i.e. the IO exclusion model dates to long before POSIX had anything to say about pthreads, and it's wording about atomicity could only refer to to multi-process interactions. These days, however, is the unfortunate reality of a long tail of applications developed on other Unix systems under older POSIX specifications that are still being ported to and deployed on Linux. Hence the completely ambiguous behaviours defined in the older specs are still just as important these days as the completely ambiguous behaviours defined in the new specifications. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-17 22:47 ` Dave Chinner @ 2019-06-17 23:38 ` Linus Torvalds 2019-06-18 4:21 ` Amir Goldstein 0 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2019-06-17 23:38 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Mon, Jun 17, 2019 at 3:48 PM Dave Chinner <david@fromorbit.com> wrote: > > The wording of posix changes every time they release a new version > of the standard, and it's _never_ obvious what behaviour the > standard is actually meant to define. They are always written with > sufficient ambiguity and wiggle room that they could mean > _anything_. The POSIX 2017.1 standard you quoted is quite different > to older versions, but it's no less ambiguous... POSIX has always been pretty lax, partly because all the Unixes did things differently, but partly because it then also ended up about trying to work for the VMS and Windows posix subsystems.. So yes, the language tends to be intentionally not all that strict. > > The pthreads atomicity thing seems to be about not splitting up IO and > > doing it in chunks when you have m:n threading models, but can be > > (mis-)construed to have threads given higher atomicity guarantees than > > processes. > > Right, but regardless of the spec we have to consider that the > behaviour of XFS comes from it's Irix heritage (actually from EFS, > the predecessor of XFS from the late 1980s) Sure. And as I mentioned, I think it's technically the nicer guarantee. That said, it's a pretty *expensive* guarantee. It's one that you yourself are not willing to give for O_DIRECT IO. And it's not a guarantee that Linux has ever had. In fact, it's not even something I've ever seen anybody ever depend on. I agree that it's possible that some app out there might depend on that kind of guarantee, but I also suspect it's much much more likely that it's the other way around: XFS is being unnecessarily strict, because everybody is testing against filesystems that don't actually give the total atomicity guarantees. Nobody develops for other unixes any more (and nobody really ever did it by reading standards papers - even if they had been very explicit). And honestly, the only people who really do threaded accesses to the same file (a) don't want that guarantee in the first place (b) are likely to use direct-io that apparently doesn't give that atomicity guarantee even on xfs so I do think it's moot. End result: if we had a really cheap range lock, I think it would be a good idea to use it (for the whole QoI implementation), but for practical reasons it's likely better to just stick to the current lack of serialization because it performs better and nobody really seems to want anything else anyway. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-17 23:38 ` Linus Torvalds @ 2019-06-18 4:21 ` Amir Goldstein 2019-06-19 10:38 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Amir Goldstein @ 2019-06-18 4:21 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton > > Right, but regardless of the spec we have to consider that the > > behaviour of XFS comes from it's Irix heritage (actually from EFS, > > the predecessor of XFS from the late 1980s) > > Sure. And as I mentioned, I think it's technically the nicer guarantee. > > That said, it's a pretty *expensive* guarantee. It's one that you > yourself are not willing to give for O_DIRECT IO. > > And it's not a guarantee that Linux has ever had. In fact, it's not > even something I've ever seen anybody ever depend on. > > I agree that it's possible that some app out there might depend on > that kind of guarantee, but I also suspect it's much much more likely > that it's the other way around: XFS is being unnecessarily strict, > because everybody is testing against filesystems that don't actually > give the total atomicity guarantees. > > Nobody develops for other unixes any more (and nobody really ever did > it by reading standards papers - even if they had been very explicit). > > And honestly, the only people who really do threaded accesses to the same file > > (a) don't want that guarantee in the first place > > (b) are likely to use direct-io that apparently doesn't give that > atomicity guarantee even on xfs > > so I do think it's moot. > > End result: if we had a really cheap range lock, I think it would be a > good idea to use it (for the whole QoI implementation), but for > practical reasons it's likely better to just stick to the current lack > of serialization because it performs better and nobody really seems to > want anything else anyway. > This is the point in the conversation where somebody usually steps in and says "let the user/distro decide". Distro maintainers are in a much better position to take the risk of breaking hypothetical applications. I should point out that even if "strict atomic rw" behavior is desired, then page cache warmup [1] significantly improves performance. Having mentioned that, the discussion can now return to what is the preferred way to solve the punch hole vs. page cache add race. XFS may end up with special tailored range locks, which beings some other benefits to XFS, but all filesystems need the solution for the punch hole vs. page cache add race. Jan recently took a stab at it for ext4 [2], but that didn't work out. So I wonder what everyone thinks about Kent's page add lock as the solution to the problem. Allegedly, all filesystems (XFS included) are potentially exposed to stale data exposure/data corruption. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20190404165737.30889-1-amir73il@gmail.com/ [2] https://lore.kernel.org/linux-fsdevel/20190603132155.20600-3-jack@suse.cz/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-18 4:21 ` Amir Goldstein @ 2019-06-19 10:38 ` Jan Kara 2019-06-19 22:37 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Jan Kara @ 2019-06-19 10:38 UTC (permalink / raw) To: Amir Goldstein Cc: Linus Torvalds, Dave Chinner, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Tue 18-06-19 07:21:56, Amir Goldstein wrote: > > > Right, but regardless of the spec we have to consider that the > > > behaviour of XFS comes from it's Irix heritage (actually from EFS, > > > the predecessor of XFS from the late 1980s) > > > > Sure. And as I mentioned, I think it's technically the nicer guarantee. > > > > That said, it's a pretty *expensive* guarantee. It's one that you > > yourself are not willing to give for O_DIRECT IO. > > > > And it's not a guarantee that Linux has ever had. In fact, it's not > > even something I've ever seen anybody ever depend on. > > > > I agree that it's possible that some app out there might depend on > > that kind of guarantee, but I also suspect it's much much more likely > > that it's the other way around: XFS is being unnecessarily strict, > > because everybody is testing against filesystems that don't actually > > give the total atomicity guarantees. > > > > Nobody develops for other unixes any more (and nobody really ever did > > it by reading standards papers - even if they had been very explicit). > > > > And honestly, the only people who really do threaded accesses to the same file > > > > (a) don't want that guarantee in the first place > > > > (b) are likely to use direct-io that apparently doesn't give that > > atomicity guarantee even on xfs > > > > so I do think it's moot. > > > > End result: if we had a really cheap range lock, I think it would be a > > good idea to use it (for the whole QoI implementation), but for > > practical reasons it's likely better to just stick to the current lack > > of serialization because it performs better and nobody really seems to > > want anything else anyway. > > > > This is the point in the conversation where somebody usually steps in > and says "let the user/distro decide". Distro maintainers are in a much > better position to take the risk of breaking hypothetical applications. > > I should point out that even if "strict atomic rw" behavior is desired, then > page cache warmup [1] significantly improves performance. > Having mentioned that, the discussion can now return to what is the > preferred way to solve the punch hole vs. page cache add race. > > XFS may end up with special tailored range locks, which beings some > other benefits to XFS, but all filesystems need the solution for the punch > hole vs. page cache add race. > Jan recently took a stab at it for ext4 [2], but that didn't work out. Yes, but I have idea how to fix it. I just need to push acquiring ext4's i_mmap_sem down a bit further so that only page cache filling is protected by it but not copying of data out to userspace. But I didn't get to coding it last week due to other stuff. > So I wonder what everyone thinks about Kent's page add lock as the > solution to the problem. > Allegedly, all filesystems (XFS included) are potentially exposed to > stale data exposure/data corruption. When we first realized that hole-punching vs page faults races are an issue I was looking into a generic solution but at that time there was a sentiment against adding another rwsem to struct address_space (or inode) so we ended up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other filesystems these days. If people think that growing struct inode for everybody is OK, we can think about lifting private filesystem solutions into a generic one. I'm fine with that. That being said as Dave said we use those fs-private locks also for serializing against equivalent issues arising for DAX. So the problem is not only about page cache but generally about doing IO and caching block mapping information for a file range. So the solution should not be too tied to page cache. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-19 10:38 ` Jan Kara @ 2019-06-19 22:37 ` Dave Chinner 2019-07-03 0:04 ` pagecache locking Boaz Harrosh 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2019-06-19 22:37 UTC (permalink / raw) To: Jan Kara Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Wed, Jun 19, 2019 at 12:38:38PM +0200, Jan Kara wrote: > On Tue 18-06-19 07:21:56, Amir Goldstein wrote: > > > > Right, but regardless of the spec we have to consider that the > > > > behaviour of XFS comes from it's Irix heritage (actually from EFS, > > > > the predecessor of XFS from the late 1980s) > > > > > > Sure. And as I mentioned, I think it's technically the nicer guarantee. > > > > > > That said, it's a pretty *expensive* guarantee. It's one that you > > > yourself are not willing to give for O_DIRECT IO. > > > > > > And it's not a guarantee that Linux has ever had. In fact, it's not > > > even something I've ever seen anybody ever depend on. > > > > > > I agree that it's possible that some app out there might depend on > > > that kind of guarantee, but I also suspect it's much much more likely > > > that it's the other way around: XFS is being unnecessarily strict, > > > because everybody is testing against filesystems that don't actually > > > give the total atomicity guarantees. > > > > > > Nobody develops for other unixes any more (and nobody really ever did > > > it by reading standards papers - even if they had been very explicit). > > > > > > And honestly, the only people who really do threaded accesses to the same file > > > > > > (a) don't want that guarantee in the first place > > > > > > (b) are likely to use direct-io that apparently doesn't give that > > > atomicity guarantee even on xfs > > > > > > so I do think it's moot. > > > > > > End result: if we had a really cheap range lock, I think it would be a > > > good idea to use it (for the whole QoI implementation), but for > > > practical reasons it's likely better to just stick to the current lack > > > of serialization because it performs better and nobody really seems to > > > want anything else anyway. > > > > > > > This is the point in the conversation where somebody usually steps in > > and says "let the user/distro decide". Distro maintainers are in a much > > better position to take the risk of breaking hypothetical applications. > > > > I should point out that even if "strict atomic rw" behavior is desired, then > > page cache warmup [1] significantly improves performance. > > Having mentioned that, the discussion can now return to what is the > > preferred way to solve the punch hole vs. page cache add race. > > > > XFS may end up with special tailored range locks, which beings some > > other benefits to XFS, but all filesystems need the solution for the punch > > hole vs. page cache add race. > > Jan recently took a stab at it for ext4 [2], but that didn't work out. > > Yes, but I have idea how to fix it. I just need to push acquiring ext4's > i_mmap_sem down a bit further so that only page cache filling is protected > by it but not copying of data out to userspace. But I didn't get to coding > it last week due to other stuff. > > > So I wonder what everyone thinks about Kent's page add lock as the > > solution to the problem. > > Allegedly, all filesystems (XFS included) are potentially exposed to > > stale data exposure/data corruption. > > When we first realized that hole-punching vs page faults races are an issue I > was looking into a generic solution but at that time there was a sentiment > against adding another rwsem to struct address_space (or inode) so we ended > up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other > filesystems these days. If people think that growing struct inode for > everybody is OK, we can think about lifting private filesystem solutions > into a generic one. I'm fine with that. I'd prefer it doesn't get lifted to the VFS because I'm planning on getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is likely to go away in the near term because a range lock can be taken on either side of the mmap_sem in the page fault path. > That being said as Dave said we use those fs-private locks also for > serializing against equivalent issues arising for DAX. So the problem is > not only about page cache but generally about doing IO and caching > block mapping information for a file range. So the solution should not be > too tied to page cache. Yup, that was the point I was trying to make when Linus started shouting at me about how caches work and how essential they are. I guess the fact that DAX doesn't use the page cache isn't as widely known as I assumed it was... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-06-19 22:37 ` Dave Chinner @ 2019-07-03 0:04 ` Boaz Harrosh [not found] ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com> 2019-07-05 23:31 ` Dave Chinner 0 siblings, 2 replies; 24+ messages in thread From: Boaz Harrosh @ 2019-07-03 0:04 UTC (permalink / raw) To: Dave Chinner, Jan Kara Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On 20/06/2019 01:37, Dave Chinner wrote: <> > > I'd prefer it doesn't get lifted to the VFS because I'm planning on > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is > likely to go away in the near term because a range lock can be > taken on either side of the mmap_sem in the page fault path. > <> Sir Dave Sorry if this was answered before. I am please very curious. In the zufs project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults. (Read & writes all take read-locks ...) The only reason I have it is because of lockdep actually. Specifically for those xfstests that mmap a buffer then direct_IO in/out of that buffer from/to another file in the same FS or the same file. (For lockdep its the same case). I would be perfectly happy to recursively _read_lock both from the top of the page_fault at the DIO path, and under in the page_fault. I'm _read_locking after all. But lockdep is hard to convince. So I stole the xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at truncate/punch/clone time when all mapping traversal needs to stop for the destructive change to take place. (Allocations are done another way and are race safe with traversal) How do you intend to address this problem with range-locks? ie recursively taking the same "lock"? because if not for the recursive-ity and lockdep I would not need the extra lock-object per inode. Thanks Boaz ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>]
* Re: pagecache locking [not found] ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com> @ 2019-07-03 1:25 ` Boaz Harrosh 0 siblings, 0 replies; 24+ messages in thread From: Boaz Harrosh @ 2019-07-03 1:25 UTC (permalink / raw) To: Patrick Farrell, Dave Chinner, Jan Kara Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On 03/07/2019 04:07, Patrick Farrell wrote: > Recursively read locking is generally unsafe, that’s why lockdep > complains about it. The common RW lock primitives are queued in > their implementation, meaning this recursive read lock sequence: > P1 - read (gets lock) > P2 - write > P1 - read > > Results not in a successful read lock, but P1 blocking behind P2, > which is blocked behind P1. > Readers are not allowed to jump past waiting writers. OK thanks that makes sense. I did not know about that last part. Its a kind of a lock fairness I did not know we have. So I guess I'll keep my two locks than. The write_locker is the SLOW path for me anyway, right? [if we are already at the subject, Do mutexes have the same lock fairness as above? Do the write_lock side of rw_sem have same fairness? Something I never figured out] Thanks Boaz > > - Patrick ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-03 0:04 ` pagecache locking Boaz Harrosh [not found] ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com> @ 2019-07-05 23:31 ` Dave Chinner 2019-07-07 15:05 ` Boaz Harrosh 2019-07-08 13:31 ` Jan Kara 1 sibling, 2 replies; 24+ messages in thread From: Dave Chinner @ 2019-07-05 23:31 UTC (permalink / raw) To: Boaz Harrosh Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote: > On 20/06/2019 01:37, Dave Chinner wrote: > <> > > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is > > likely to go away in the near term because a range lock can be > > taken on either side of the mmap_sem in the page fault path. > > > <> > Sir Dave > > Sorry if this was answered before. I am please very curious. In the zufs > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults. > (Read & writes all take read-locks ...) > The only reason I have it is because of lockdep actually. > > Specifically for those xfstests that mmap a buffer then direct_IO in/out > of that buffer from/to another file in the same FS or the same file. > (For lockdep its the same case). Which can deadlock if the same inode rwsem is taken on both sides of the mmap_sem, as lockdep tells you... > I would be perfectly happy to recursively _read_lock both from the top > of the page_fault at the DIO path, and under in the page_fault. I'm > _read_locking after all. But lockdep is hard to convince. So I stole the > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at > truncate/punch/clone time when all mapping traversal needs to stop for > the destructive change to take place. (Allocations are done another way > and are race safe with traversal) > > How do you intend to address this problem with range-locks? ie recursively > taking the same "lock"? because if not for the recursive-ity and lockdep I would > not need the extra lock-object per inode. As long as the IO ranges to the same file *don't overlap*, it should be perfectly safe to take separate range locks (in read or write mode) on either side of the mmap_sem as non-overlapping range locks can be nested and will not self-deadlock. The "recursive lock problem" still arises with DIO and page faults inside gup, but it only occurs when the user buffer range overlaps the DIO range to the same file. IOWs, the application is trying to do something that has an undefined result and is likely to result in data corruption. So, in that case I plan to have the gup page faults fail and the DIO return -EDEADLOCK to userspace.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-05 23:31 ` Dave Chinner @ 2019-07-07 15:05 ` Boaz Harrosh 2019-07-07 23:55 ` Dave Chinner 2019-07-08 13:31 ` Jan Kara 1 sibling, 1 reply; 24+ messages in thread From: Boaz Harrosh @ 2019-07-07 15:05 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On 06/07/2019 02:31, Dave Chinner wrote: > > As long as the IO ranges to the same file *don't overlap*, it should > be perfectly safe to take separate range locks (in read or write > mode) on either side of the mmap_sem as non-overlapping range locks > can be nested and will not self-deadlock. > > The "recursive lock problem" still arises with DIO and page faults > inside gup, but it only occurs when the user buffer range overlaps > the DIO range to the same file. IOWs, the application is trying to > do something that has an undefined result and is likely to result in > data corruption. So, in that case I plan to have the gup page faults > fail and the DIO return -EDEADLOCK to userspace.... > This sounds very cool. I now understand. I hope you put all the tools for this in generic places so it will be easier to salvage. One thing I will be very curious to see is how you teach lockdep about the "range locks can be nested" thing. I know its possible, other places do it, but its something I never understood. > Cheers, > Dave. [ Ha one more question if you have time: In one of the mails, and you also mentioned it before, you said about the rw_read_lock not being able to scale well on mammoth machines over 10ns of cores (maybe you said over 20). I wonder why that happens. Is it because of the atomic operations, or something in the lock algorithm. In my theoretical understanding, as long as there are no write-lock-grabbers, why would the readers interfere with each other? ] Thanks Boaz ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-07 15:05 ` Boaz Harrosh @ 2019-07-07 23:55 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2019-07-07 23:55 UTC (permalink / raw) To: Boaz Harrosh Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Sun, Jul 07, 2019 at 06:05:16PM +0300, Boaz Harrosh wrote: > On 06/07/2019 02:31, Dave Chinner wrote: > > > > > As long as the IO ranges to the same file *don't overlap*, it should > > be perfectly safe to take separate range locks (in read or write > > mode) on either side of the mmap_sem as non-overlapping range locks > > can be nested and will not self-deadlock. > > > > The "recursive lock problem" still arises with DIO and page faults > > inside gup, but it only occurs when the user buffer range overlaps > > the DIO range to the same file. IOWs, the application is trying to > > do something that has an undefined result and is likely to result in > > data corruption. So, in that case I plan to have the gup page faults > > fail and the DIO return -EDEADLOCK to userspace.... > > > > This sounds very cool. I now understand. I hope you put all the tools > for this in generic places so it will be easier to salvage. That's the plan, though I'm not really caring about anything outside XFS for the moment. > One thing I will be very curious to see is how you teach lockdep > about the "range locks can be nested" thing. I know its possible, > other places do it, but its something I never understood. The issue with lockdep is not nested locks, it's that there is no concept of ranges. e.g. This is fine: P0 P1 read_lock(A, 0, 1000) read_lock(B, 0, 1000) write_lock(B, 1001, 2000) write_lock(A, 1001, 2000) Because the read/write lock ranges on file A don't overlap and so can be held concurrently, similarly the ranges on file B. i.e. This lock pattern does not result in deadlock. However, this very similar lock pattern is not fine: P0 P1 read_lock(A, 0, 1000) read_lock(B, 0, 1000) write_lock(B, 500, 1500) write_lock(A, 900, 1900) i.e. it's an ABBA deadlock because the lock ranges partially overlap. IOWs, the problem with lockdep is not nesting read lock or nesting write locks (because that's valid, too), the problem is that it needs to be taught about ranges. Once it knows about ranges, nested read/write locking contexts don't require any special support... As it is, tracking overlapping lock ranges in lockdep will be interesting, given that I've been taking several thousand non-overlapping range locks concurrently on a single file in my testing. Tracking this sort of usage without completely killing the machine looking for conflicts and order violations likely makes lockdep validation of range locks a non-starter.... > [ Ha one more question if you have time: > > In one of the mails, and you also mentioned it before, you said about > the rw_read_lock not being able to scale well on mammoth machines > over 10ns of cores (maybe you said over 20). > I wonder why that happens. Is it because of the atomic operations, > or something in the lock algorithm. In my theoretical understanding, > as long as there are no write-lock-grabbers, why would the readers > interfere with each other? Concurrent shared read lock/unlock are still atomic counting operations. Hence they bounce exclusive cachelines from CPU to CPU... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-05 23:31 ` Dave Chinner 2019-07-07 15:05 ` Boaz Harrosh @ 2019-07-08 13:31 ` Jan Kara 2019-07-09 23:47 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Jan Kara @ 2019-07-08 13:31 UTC (permalink / raw) To: Dave Chinner Cc: Boaz Harrosh, Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Sat 06-07-19 09:31:57, Dave Chinner wrote: > On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote: > > On 20/06/2019 01:37, Dave Chinner wrote: > > <> > > > > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on > > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is > > > likely to go away in the near term because a range lock can be > > > taken on either side of the mmap_sem in the page fault path. > > > > > <> > > Sir Dave > > > > Sorry if this was answered before. I am please very curious. In the zufs > > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults. > > (Read & writes all take read-locks ...) > > The only reason I have it is because of lockdep actually. > > > > Specifically for those xfstests that mmap a buffer then direct_IO in/out > > of that buffer from/to another file in the same FS or the same file. > > (For lockdep its the same case). > > Which can deadlock if the same inode rwsem is taken on both sides of > the mmap_sem, as lockdep tells you... > > > I would be perfectly happy to recursively _read_lock both from the top > > of the page_fault at the DIO path, and under in the page_fault. I'm > > _read_locking after all. But lockdep is hard to convince. So I stole the > > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at > > truncate/punch/clone time when all mapping traversal needs to stop for > > the destructive change to take place. (Allocations are done another way > > and are race safe with traversal) > > > > How do you intend to address this problem with range-locks? ie recursively > > taking the same "lock"? because if not for the recursive-ity and lockdep I would > > not need the extra lock-object per inode. > > As long as the IO ranges to the same file *don't overlap*, it should > be perfectly safe to take separate range locks (in read or write > mode) on either side of the mmap_sem as non-overlapping range locks > can be nested and will not self-deadlock. I'd be really careful with nesting range locks. You can have nasty situations like: Thread 1 Thread 2 read_lock(0,1000) write_lock(500,1500) -> blocks due to read lock read_lock(1001,1500) -> blocks due to write lock (or you have to break fairness and then deal with starvation issues). So once you allow nesting of range locks, you have to very carefully define what is and what is not allowed. That's why in my range lock implementation ages back I've decided to treat range lock as a rwsem for deadlock verification purposes. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-08 13:31 ` Jan Kara @ 2019-07-09 23:47 ` Dave Chinner 2019-07-10 8:41 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2019-07-09 23:47 UTC (permalink / raw) To: Jan Kara Cc: Boaz Harrosh, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote: > On Sat 06-07-19 09:31:57, Dave Chinner wrote: > > On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote: > > > On 20/06/2019 01:37, Dave Chinner wrote: > > > <> > > > > > > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on > > > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is > > > > likely to go away in the near term because a range lock can be > > > > taken on either side of the mmap_sem in the page fault path. > > > > > > > <> > > > Sir Dave > > > > > > Sorry if this was answered before. I am please very curious. In the zufs > > > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults. > > > (Read & writes all take read-locks ...) > > > The only reason I have it is because of lockdep actually. > > > > > > Specifically for those xfstests that mmap a buffer then direct_IO in/out > > > of that buffer from/to another file in the same FS or the same file. > > > (For lockdep its the same case). > > > > Which can deadlock if the same inode rwsem is taken on both sides of > > the mmap_sem, as lockdep tells you... > > > > > I would be perfectly happy to recursively _read_lock both from the top > > > of the page_fault at the DIO path, and under in the page_fault. I'm > > > _read_locking after all. But lockdep is hard to convince. So I stole the > > > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at > > > truncate/punch/clone time when all mapping traversal needs to stop for > > > the destructive change to take place. (Allocations are done another way > > > and are race safe with traversal) > > > > > > How do you intend to address this problem with range-locks? ie recursively > > > taking the same "lock"? because if not for the recursive-ity and lockdep I would > > > not need the extra lock-object per inode. > > > > As long as the IO ranges to the same file *don't overlap*, it should > > be perfectly safe to take separate range locks (in read or write > > mode) on either side of the mmap_sem as non-overlapping range locks > > can be nested and will not self-deadlock. > > I'd be really careful with nesting range locks. You can have nasty > situations like: > > Thread 1 Thread 2 > read_lock(0,1000) > write_lock(500,1500) -> blocks due to read lock > read_lock(1001,1500) -> blocks due to write lock (or you have to break > fairness and then deal with starvation issues). > > So once you allow nesting of range locks, you have to very carefully define > what is and what is not allowed. Yes. I do understand the problem with rwsem read nesting and how that can translate to reange locks. That's why my range locks don't even try to block on other pending waiters. The case where read nesting vs write might occur like above is something like copy_file_range() vs truncate, but otherwise for IO locks we simply don't have arbitrarily deep nesting of range locks. i.e. for your example my range lock would result in: Thread 1 Thread 2 read_lock(0,1000) write_lock(500,1500) <finds conflicting read lock> <marks read lock as having a write waiter> <parks on range lock wait list> <...> read_lock_nested(1001,1500) <no overlapping range in tree> <gets nested range lock> <....> read_unlock(1001,1500) <stays blocked because nothing is waiting on (1001,1500) so no wakeup> <....> read_unlock(0,1000) <sees write waiter flag, runs wakeup> <gets woken> <retries write lock> <write lock gained> IOWs, I'm not trying to complicate the range lock implementation with complex stuff like waiter fairness or anti-starvation semantics at this point in time. Waiters simply don't impact whether a new lock can be gained or not, and hence the limitations of rwsem semantics don't apply. If such functionality is necessary (and I suspect it will be to prevent AIO from delaying truncate and fallocate-like operations indefinitely) then I'll add a "barrier" lock type (e.g. range_lock_write_barrier()) that will block new range lock attempts across it's span. However, because this can cause deadlocks like the above, a barrier lock will not block new *_lock_nested() or *_trylock() calls, hence allowing runs of nested range locking to complete and drain without deadlocking on a conflicting barrier range. And that still can't be modelled by existing rwsem semantics.... > That's why in my range lock implementation > ages back I've decided to treat range lock as a rwsem for deadlock > verification purposes. IMO, treating a range lock as a rwsem for deadlock purposes defeats the purpose of adding range locks in the first place. The concurrency models are completely different, and some of the limitations on rwsems are a result of implementation choices rather than limitations of a rwsem construct. In reality I couldn't care less about what lockdep can or can't verify. I've always said lockdep is a crutch for people who don't understand locks and the concurrency model of the code they maintain. That obviously extends to the fact that lockdep verification limitations should not limit what we allow new locking primitives to do. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking 2019-07-09 23:47 ` Dave Chinner @ 2019-07-10 8:41 ` Jan Kara 0 siblings, 0 replies; 24+ messages in thread From: Jan Kara @ 2019-07-10 8:41 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Boaz Harrosh, Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Wed 10-07-19 09:47:12, Dave Chinner wrote: > On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote: > > I'd be really careful with nesting range locks. You can have nasty > > situations like: > > > > Thread 1 Thread 2 > > read_lock(0,1000) > > write_lock(500,1500) -> blocks due to read lock > > read_lock(1001,1500) -> blocks due to write lock (or you have to break > > fairness and then deal with starvation issues). > > > > So once you allow nesting of range locks, you have to very carefully define > > what is and what is not allowed. > > Yes. I do understand the problem with rwsem read nesting and how > that can translate to reange locks. > > That's why my range locks don't even try to block on other pending > waiters. The case where read nesting vs write might occur like above > is something like copy_file_range() vs truncate, but otherwise for > IO locks we simply don't have arbitrarily deep nesting of range > locks. > > i.e. for your example my range lock would result in: > > Thread 1 Thread 2 > read_lock(0,1000) > write_lock(500,1500) > <finds conflicting read lock> > <marks read lock as having a write waiter> > <parks on range lock wait list> > <...> > read_lock_nested(1001,1500) > <no overlapping range in tree> > <gets nested range lock> > > <....> > read_unlock(1001,1500) <stays blocked because nothing is waiting > on (1001,1500) so no wakeup> > <....> > read_unlock(0,1000) > <sees write waiter flag, runs wakeup> > <gets woken> > <retries write lock> > <write lock gained> > > IOWs, I'm not trying to complicate the range lock implementation > with complex stuff like waiter fairness or anti-starvation semantics > at this point in time. Waiters simply don't impact whether a new lock > can be gained or not, and hence the limitations of rwsem semantics > don't apply. > > If such functionality is necessary (and I suspect it will be to > prevent AIO from delaying truncate and fallocate-like operations > indefinitely) then I'll add a "barrier" lock type (e.g. > range_lock_write_barrier()) that will block new range lock attempts > across it's span. > > However, because this can cause deadlocks like the above, a barrier > lock will not block new *_lock_nested() or *_trylock() calls, hence > allowing runs of nested range locking to complete and drain without > deadlocking on a conflicting barrier range. And that still can't be > modelled by existing rwsem semantics.... Clever :). Thanks for explanation. > > That's why in my range lock implementation > > ages back I've decided to treat range lock as a rwsem for deadlock > > verification purposes. > > IMO, treating a range lock as a rwsem for deadlock purposes defeats > the purpose of adding range locks in the first place. The > concurrency models are completely different, and some of the > limitations on rwsems are a result of implementation choices rather > than limitations of a rwsem construct. Well, even a range lock that cannot nest allows concurrent non-overlapping read and write to the same file which rwsem doesn't allow. But I agree that your nesting variant offers more (but also at the cost of higher complexity of the lock semantics). > In reality I couldn't care less about what lockdep can or can't > verify. I've always said lockdep is a crutch for people who don't > understand locks and the concurrency model of the code they > maintain. That obviously extends to the fact that lockdep > verification limitations should not limit what we allow new locking > primitives to do. I didn't say we have to constrain new locking primitives to what lockdep can support. It is just a locking correctness verification tool so naturally lockdep should be taught to what it needs to know about any locking scheme we come up with. And sometimes it is just too much effort which is why e.g. page lock still doesn't have lockdep coverage. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: pagecache locking (was: bcachefs status update) merged) 2019-06-13 23:55 ` Dave Chinner 2019-06-14 2:30 ` Linus Torvalds 2019-06-14 3:08 ` Linus Torvalds @ 2019-06-14 17:08 ` Kent Overstreet 2 siblings, 0 replies; 24+ messages in thread From: Kent Overstreet @ 2019-06-14 17:08 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara, Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton On Fri, Jun 14, 2019 at 09:55:24AM +1000, Dave Chinner wrote: > On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote: > > On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote: > > > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote: > > > > Ok, I'm totally on board with returning EDEADLOCK. > > > > > > > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is > > > > in the same address space as the file being read/written to, even if the buffer > > > > and the IO don't technically overlap? > > > > > > I'd say that depends on the lock granularity. For a range lock, > > > we'd be able to do the IO for non-overlapping ranges. For a normal > > > mutex or rwsem, then we risk deadlock if the page fault triggers on > > > the same address space host as we already have locked for IO. That's > > > the case we currently handle with the second IO lock in XFS, ext4, > > > btrfs, etc (XFS_MMAPLOCK_* in XFS). > > > > > > One of the reasons I'm looking at range locks for XFS is to get rid > > > of the need for this second mmap lock, as there is no reason for it > > > existing if we can lock ranges and EDEADLOCK inside page faults and > > > return errors. > > > > My concern is that range locks are going to turn out to be both more complicated > > and heavier weight, performance wise, than the approach I've taken of just a > > single lock per address space. > > That's the battle I'm fighting at the moment with them for direct > IO(*), but range locks are something I'm doing for XFS and I don't > really care if anyone else wants to use them or not. I'm not saying I won't use them :) I just want to do the simple thing first, and then if range locks turn out well I don't think it'll be hard to switch bcachefs to them. Or if the simple thing turns out to be good enough, even better :) > (*)Direct IO on XFS is a pure shared lock workload, so the rwsem > scales until single atomic update cache line bouncing limits > throughput. That means I can max out my hardware at 1.6 million > random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file > with a rwsem at 32 AIO+DIO dispatch threads. I've only got range > locks to about 1.1M IOPS on the same workload, though it's within a > couple of percent of a rwsem up to 16 threads... > > (**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/ > than pmem that is emulated with RAM, let alone real pmem which is > much slower at random writes than RAM. So something I should mention is that I've been > > > Reason being range locks only help when you've got multiple operations going on > > simultaneously that don't conflict - i.e. it's really only going to be useful > > for applications that are doing buffered IO and direct IO simultaneously to the > > same file. > > Yes, they do that, but that's not why I'm looking at this. Range > locks are primarily for applications that mix multiple different > types of operations to the same file concurrently. e.g: > > - fallocate and read/write() can be run concurrently if they > don't overlap, but right now we serialise them because we have no > visibility into what other operations require. True true. Have you ever seen this be an issue for real applications? > - buffered read and buffered write can run concurrently if they > don't overlap, but right now they are serialised because that's the > only way to provide POSIX atomic write vs read semantics (only XFS > provides userspace with that guarantee). We already talked about this on IRC, but it's not the _only_ way - page locks suffice if we lock all the pages for the read/write at once, and that's actually a really good thing to for performance. bcachefs doesn't currently provide any guarantees here, but since we already are batching up the page operations for buffered writes (and I have patches to do so for the buffered read path in filemap.c) I will tweak that slightly to provide some sort of guarantee. This is something I care about in bcachefs because I'm pretty sure I can make both the buffered read and write paths work without taking any per inode locks - so unrelated IOs to the same file won't be modifying at shared cachelines at all. Haven't gotten around to it yet for the buffered write path, but it's on the todo list. > - Sub-block direct IO is serialised against all other direct IO > because we can't tell if it overlaps with some other direct IO and > so we have to take the slow but safe option - range locks solve that > problem, too. This feels like an internal filesystem implementation detail (not objecting, just doesn't sound like something applicable outside xfs to me). > - there's inode_dio_wait() for DIO truncate serialisation > because AIO doesn't hold inode locks across IO - range locks can be > held all the way to AIO completion so we can get rid of > inode_Dio_wait() in XFS and that allows truncate/fallocate to run > concurrently with non-overlapping direct IO. getting rid of inode_dio_wait() and unifying that with other locking would be great, for sure. > - holding non-overlapping range locks on either side of page > faults which then gets rid of the need for the special mmap locking > path to serialise it against invalidation operations. > > IOWs, range locks for IO solve a bunch of long term problems we have > in XFS and largely simplify the lock algorithms within the > filesystem. And it puts us on the path to introduce range locks for > extent mapping serialisation, allowing concurrent mapping lookups > and allocation within a single file. It also has the potential to > allow us to do concurrent directory modifications.... *nod* Don't take any of what I'm saying as arguing against range locks. They don't seem _as_ useful to me for bcachefs because a lot of what you want them for I can already do concurrently - but if they turn out to work better than my pagecache add lock I'll happily switch to them later. I'm just reserving judgement until I see them and get to play with them. > IO range locks are not "locking the page cache". IO range locks are > purely for managing concurrent IO state in a fine grained manner. > The page cache already has it's own locking - that just needs to > nest inside IO range locks as teh io locks are what provide the high > level exclusion from overlapping page cache operations... I suppose I worded it badly, my point was just that range locks can be viewed as a superset of my pagecache add lock so switching to them would be an easy conversion. > > > > This would simplify things a lot and eliminate a really nasty corner case - page > > > > faults trigger readahead. Even if the buffer and the direct IO don't overlap, > > > > readahead can pull in pages that do overlap with the dio. > > > > > > Page cache readahead needs to be moved under the filesystem IO > > > locks. There was a recent thread about how readahead can race with > > > hole punching and other fallocate() operations because page cache > > > readahead bypasses the filesystem IO locks used to serialise page > > > cache invalidation. > > > > > > e.g. Readahead can be directed by userspace via fadvise, so we now > > > have file->f_op->fadvise() so that filesystems can lock the inode > > > before calling generic_fadvise() such that page cache instantiation > > > and readahead dispatch can be serialised against page cache > > > invalidation. I have a patch for XFS sitting around somewhere that > > > implements the ->fadvise method. > > > > I just puked a little in my mouth. > > Yeah, it's pretty gross. But the page cache simply isn't designed to > allow atomic range operations to be performed. We ahven't be able to > drag it out of the 1980s - we wrote the fs/iomap.c code so we could > do range based extent mapping for IOs rather than the horrible, > inefficient page-by-page block mapping the generic page cache code > does - that gave us a 30+% increase in buffered IO throughput > because we only do a single mapping lookup per IO rather than one > per page... I don't think there's anything _fundamentally_ wrong with the pagecache design - i.e. a radix tree of pages. We do _badly_ need support for arbitrary power of 2 sized pages in the pagecache (and it _really_ needs to be compound pages, not just hugepages - a lot of the work that's been done that just adds if (hugepage) to page cache code is FUCKING RETARDED) - the overhead of working with 4k pages has gotten to be completely absurd. But AFAIK the radix tree/xarray side of that is done, what needs to be fixed is all the stuff in e.g. filemap.c that assumes it's iterating over fixed size pages - that code needs to be reworked to be more like iterating over extents. > That said, the page cache is still far, far slower than direct IO, > and the gap is just getting wider and wider as nvme SSDs get faster > and faster. PCIe 4 SSDs are just going to make this even more > obvious - it's getting to the point where the only reason for having > a page cache is to support mmap() and cheap systems with spinning > rust storage. It is true that buffered IO performance is _way_ behind direct IO performance today in a lot of situations, but this is mostly due to the fact that we've completely dropped the ball and it's 2019 and we're STILL DOING EVERYTHING 4k AT A TIME. > > I think both approaches are workable, but I do think that pushing the locking > > down to __add_to_page_cache_locked is fundamentally the better, more correct > > approach. > > > > - It better matches the semantics of what we're trying to do. All these > > operations we're trying to protect - dio, fallocate, truncate - they all have > > in common that they just want to shoot down a range of the page cache and > > keep it from being readded. And in general, it's better to have locks that > > protect specific data structures ("adding to this radix tree"), vs. large > > critical sections ("the io path"). > > I disagree :) > > The high level IO locks provide the IO concurrency policy for the > filesystem. The page cache is an internal structure for caching > pages - it is not a structure for efficiently and cleanly > implementing IO concurrency policy. That's the mistake the current > page cache architecture makes - it tries to be the central control > for all the filesystem IO (because filesystems are dumb and the page > cache knows best!) but, unfortunately, this does not provide the > semantics or functionality that all filesystems want and/or need. We might be talking past each other a bit. I think we both agree that there should be a separation of concerns w.r.t. locking, and that hanging too much stuff off the page cache has been a mistake in the past (e.g. buffer heads). But that doesn't mean the page cache shouldn't have locking, just that that locking should only protect the page cache, not other filesystem state. I am _not_ arguing that this should replace filesystem IO path locks for other, not page cache purposes. > Just look at the truncate detection mess we have every time we > lookup and lock a page anywhere in the mm/ code - do you see any > code in all that which detects a hole punch race? Nope, you don't > because the filesystems take responsibility for serialising that > functionality. Yeah, I know. > Unfortunately, we have so much legacy filesystem cruft we'll never get rid of > those truncate hacks. Actually - I honestly think that this page cache add lock is simple and straightforward enough to use that converting legacy filesystems to it wouldn't be crazypants, and getting rid of all those hacks would be _awesome_. The legacy filesystems tend to to implement the really crazy fallocate stuff, it's just truncate that really has to be dealt with. > Don't get me wrong - I'm not opposed to incuding page cache add > locking - I'm just saying that the problems it tries to address (and > ones it cannot address) are already largely solved in existing > filesystems. I suspect that if we do merge this code, whatever > locking is added would have to be optional.... That's certainly a fair point. There is more than one way to skin a cat. Not sure "largely solved" is true though - solved in XFS yes, but these issues have not been taken seriously in other filesystems up until pretty recently... That said, I do think this will prove useful for non bcachefs filesystems at some point. Like you alluded to, truncate synchronization is hot garbage in most filesystems, and this would give us an easy way of improving that and possibly getting rid of those truncate hacks in filemap.c. And because (on the add side at least) this lock only needs to taken when adding pages to the page cache, not when the pages needed are already present - in bcachefs at least this is a key enabler for making buffered IO not modify any per-inode cachelines in the fast path, and I think other filesystems could do the same thing. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2019-07-10 8:41 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20190610191420.27007-1-kent.overstreet@gmail.com>
[not found] ` <CAHk-=wi0iMHcO5nsYug06fV3-8s8fz7GDQWCuanefEGq6mHH1Q@mail.gmail.com>
[not found] ` <20190611011737.GA28701@kmo-pixel>
[not found] ` <20190611043336.GB14363@dread.disaster.area>
[not found] ` <20190612162144.GA7619@kmo-pixel>
[not found] ` <20190612230224.GJ14308@dread.disaster.area>
2019-06-13 18:36 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-13 21:13 ` Andreas Dilger
2019-06-13 21:21 ` Kent Overstreet
2019-06-14 0:35 ` Dave Chinner
2019-06-13 23:55 ` Dave Chinner
2019-06-14 2:30 ` Linus Torvalds
2019-06-14 7:30 ` Dave Chinner
2019-06-15 1:15 ` Linus Torvalds
2019-06-14 3:08 ` Linus Torvalds
2019-06-15 4:01 ` Linus Torvalds
2019-06-17 22:47 ` Dave Chinner
2019-06-17 23:38 ` Linus Torvalds
2019-06-18 4:21 ` Amir Goldstein
2019-06-19 10:38 ` Jan Kara
2019-06-19 22:37 ` Dave Chinner
2019-07-03 0:04 ` pagecache locking Boaz Harrosh
[not found] ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
2019-07-03 1:25 ` Boaz Harrosh
2019-07-05 23:31 ` Dave Chinner
2019-07-07 15:05 ` Boaz Harrosh
2019-07-07 23:55 ` Dave Chinner
2019-07-08 13:31 ` Jan Kara
2019-07-09 23:47 ` Dave Chinner
2019-07-10 8:41 ` Jan Kara
2019-06-14 17:08 ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox