* Hole punching and mmap races @ 2012-05-15 22:48 Jan Kara 2012-05-16 2:14 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-05-15 22:48 UTC (permalink / raw) To: linux-fsdevel; +Cc: xfs, linux-ext4, Hugh Dickins, linux-mm Hello, Hugh pointed me to ext4 hole punching code which is clearly missing some locking. But looking at the code more deeply I realized I don't see anything preventing the following race in XFS or ext4: TASK1 TASK2 punch_hole(file, 0, 4096) filemap_write_and_wait() truncate_pagecache_range() addr = mmap(file); addr[0] = 1 ^^ writeably fault a page remove file blocks FLUSHER write out file ^^ interesting things can happen because we expect blocks under the first page to be allocated / reserved but they are not... I'm pretty sure ext4 has this problem, I'm not completely sure whether XFS has something to protect against such race but I don't see anything. It's not easy to protect against these races. For truncate, i_size protects us against similar races but for hole punching we don't have any such mechanism. One way to avoid the race would be to hold mmap_sem while we are invalidating the page cache and punching hole but that sounds a bit ugly. Alternatively we could just have some special lock (rwsem?) held during page_mkwrite() (for reading) and during whole hole punching (for writing) to serialize these two operations. Another alternative, which doesn't really look more appealing, is to go page-by-page and always free corresponding blocks under page lock. Any other ideas or thoughts? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-15 22:48 Hole punching and mmap races Jan Kara @ 2012-05-16 2:14 ` Dave Chinner 2012-05-16 13:04 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-05-16 2:14 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote: > Hello, > > Hugh pointed me to ext4 hole punching code which is clearly missing some > locking. But looking at the code more deeply I realized I don't see > anything preventing the following race in XFS or ext4: > > TASK1 TASK2 > punch_hole(file, 0, 4096) > filemap_write_and_wait() > truncate_pagecache_range() > addr = mmap(file); > addr[0] = 1 > ^^ writeably fault a page > remove file blocks > > FLUSHER > write out file > ^^ interesting things can > happen because we expect blocks under the first page to be allocated / > reserved but they are not... > > I'm pretty sure ext4 has this problem, I'm not completely sure whether > XFS has something to protect against such race but I don't see anything. No, it doesn't. It's a known problem due to not being able to take a lock in .page_mkwrite() to serialise mmap() IO against truncation or other IO such as direct IO. This has been known for, well, long before we came up with page_mkwrite(). At the time page_mkwrite() was introduced, locking was discusses to solve this problem but was considered difficult on the VM side so it was ignored. > It's not easy to protect against these races. For truncate, i_size protects > us against similar races but for hole punching we don't have any such > mechanism. One way to avoid the race would be to hold mmap_sem while we are > invalidating the page cache and punching hole but that sounds a bit ugly. > Alternatively we could just have some special lock (rwsem?) held during > page_mkwrite() (for reading) and during whole hole punching (for writing) > to serialize these two operations. What really needs to happen is that .page_mkwrite() can be made to fail with -EAGAIN and retry the entire page fault from the start an arbitrary number of times instead of just once as the current code does with VM_FAULT_RETRY. That would allow us to try to take the filesystem lock that provides IO exclusion for all other types of IO and fail with EAGAIN if we can't get it without blocking. For XFS, that is the i_iolock rwsem, for others it is the i_mutex, and some other filesystems might take other locks. FWIW, I've been running at "use the IO lock in page_mkwrite" patch for XFS for several months now, but I haven't posted it because without the VM side being able to handle such locking failures gracefully there's not much point in making the change. I did this patch to reduce the incidence of mmap vs direct IO races that are essentially identical in nature to rule them out of the cause of stray delalloc blocks in files that fsstress has been producing on XFS. FYI, this race condition hasn't been responsible for any of the problems I've found recently.... > Another alternative, which doesn't really look more appealing, is to go > page-by-page and always free corresponding blocks under page lock. Doesn't work for regions with no pages in memory over them. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-16 2:14 ` Dave Chinner @ 2012-05-16 13:04 ` Jan Kara 2012-05-17 7:43 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-05-16 13:04 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed 16-05-12 12:14:23, Dave Chinner wrote: > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote: > > Hello, > > > > Hugh pointed me to ext4 hole punching code which is clearly missing some > > locking. But looking at the code more deeply I realized I don't see > > anything preventing the following race in XFS or ext4: > > > > TASK1 TASK2 > > punch_hole(file, 0, 4096) > > filemap_write_and_wait() > > truncate_pagecache_range() > > addr = mmap(file); > > addr[0] = 1 > > ^^ writeably fault a page > > remove file blocks > > > > FLUSHER > > write out file > > ^^ interesting things can > > happen because we expect blocks under the first page to be allocated / > > reserved but they are not... > > > > I'm pretty sure ext4 has this problem, I'm not completely sure whether > > XFS has something to protect against such race but I don't see anything. > > No, it doesn't. It's a known problem due to not being able to take a > lock in .page_mkwrite() to serialise mmap() IO against truncation or > other IO such as direct IO. This has been known for, well, long > before we came up with page_mkwrite(). At the time page_mkwrite() > was introduced, locking was discusses to solve this problem but was > considered difficult on the VM side so it was ignored. I thought someone must have noticed before since XFS has hole punching for a long time... > > It's not easy to protect against these races. For truncate, i_size protects > > us against similar races but for hole punching we don't have any such > > mechanism. One way to avoid the race would be to hold mmap_sem while we are > > invalidating the page cache and punching hole but that sounds a bit ugly. > > Alternatively we could just have some special lock (rwsem?) held during > > page_mkwrite() (for reading) and during whole hole punching (for writing) > > to serialize these two operations. > > What really needs to happen is that .page_mkwrite() can be made to > fail with -EAGAIN and retry the entire page fault from the start an > arbitrary number of times instead of just once as the current code > does with VM_FAULT_RETRY. That would allow us to try to take the > filesystem lock that provides IO exclusion for all other types of IO > and fail with EAGAIN if we can't get it without blocking. For XFS, > that is the i_iolock rwsem, for others it is the i_mutex, and some > other filesystems might take other locks. Actually, I've been playing with VM_FAULT_RETRY recently (for freezing patches) and it's completely unhandled for .page_mkwrite() callbacks. Also only x86 really tries to handle it at all. Other architectures just don't allow it at all. Also there's a ton of callers of things like get_user_pages() which would need to handle VM_FAULT_RETRY and for some of them it would be actually non-trivial. But in this particular case, I don't think VM_FAULT_RETRY is strictly necessary. We can have a lock, which ranks below mmap_sem (and thus i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction start, etc. Such lock could be taken in page_mkwrite() before taking page lock, in truncate() and punch_hold() just after i_mutex, and direct IO paths could be tweaked to use it as well I think. > FWIW, I've been running at "use the IO lock in page_mkwrite" patch > for XFS for several months now, but I haven't posted it because > without the VM side being able to handle such locking failures > gracefully there's not much point in making the change. I did this > patch to reduce the incidence of mmap vs direct IO races that are > essentially identical in nature to rule them out of the cause of > stray delalloc blocks in files that fsstress has been producing on > XFS. FYI, this race condition hasn't been responsible for any of the > problems I've found recently.... Yeah, I've been trying to hit the race window for a while and I failed as well... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-16 13:04 ` Jan Kara @ 2012-05-17 7:43 ` Dave Chinner 2012-05-17 23:28 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-05-17 7:43 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote: > > > It's not easy to protect against these races. For truncate, i_size protects > > > us against similar races but for hole punching we don't have any such > > > mechanism. One way to avoid the race would be to hold mmap_sem while we are > > > invalidating the page cache and punching hole but that sounds a bit ugly. > > > Alternatively we could just have some special lock (rwsem?) held during > > > page_mkwrite() (for reading) and during whole hole punching (for writing) > > > to serialize these two operations. > > > > What really needs to happen is that .page_mkwrite() can be made to > > fail with -EAGAIN and retry the entire page fault from the start an > > arbitrary number of times instead of just once as the current code > > does with VM_FAULT_RETRY. That would allow us to try to take the > > filesystem lock that provides IO exclusion for all other types of IO > > and fail with EAGAIN if we can't get it without blocking. For XFS, > > that is the i_iolock rwsem, for others it is the i_mutex, and some > > other filesystems might take other locks. > Actually, I've been playing with VM_FAULT_RETRY recently (for freezing > patches) and it's completely unhandled for .page_mkwrite() callbacks. Yeah, it's a mess. > Also > only x86 really tries to handle it at all. Other architectures just don't > allow it at all. Also there's a ton of callers of things like > get_user_pages() which would need to handle VM_FAULT_RETRY and for some of > them it would be actually non-trivial. Seems kind of silly to me to have a generic retry capability in the page fault handler and then not implement it in a useful manner for *anyone*. > But in this particular case, I don't think VM_FAULT_RETRY is strictly > necessary. We can have a lock, which ranks below mmap_sem (and thus > i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction > start, etc. Such lock could be taken in page_mkwrite() before taking page > lock, in truncate() and punch_hold() just after i_mutex, and direct IO > paths could be tweaked to use it as well I think. Which means we'd be adding another layer of mostly redundant locking just to avoid i_mutex/mmap_sem inversion. But I don't see how it solves the direct IO problem because we still need to grab the mmap_sem inside the IO during get_user_pages_fast() while holding i_mutex/i_iolock.... > > FWIW, I've been running at "use the IO lock in page_mkwrite" patch > > for XFS for several months now, but I haven't posted it because > > without the VM side being able to handle such locking failures > > gracefully there's not much point in making the change. I did this > > patch to reduce the incidence of mmap vs direct IO races that are > > essentially identical in nature to rule them out of the cause of > > stray delalloc blocks in files that fsstress has been producing on > > XFS. FYI, this race condition hasn't been responsible for any of the > > problems I've found recently.... > Yeah, I've been trying to hit the race window for a while and I failed as > well... IIRC, it's a rare case (that I consider insane, BTW): read from a file with into a buffer that is a mmap()d region of the same file that has not been faulted in yet..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-17 7:43 ` Dave Chinner @ 2012-05-17 23:28 ` Jan Kara 2012-05-18 10:12 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-05-17 23:28 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Thu 17-05-12 17:43:08, Dave Chinner wrote: > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote: > > > > It's not easy to protect against these races. For truncate, i_size protects > > > > us against similar races but for hole punching we don't have any such > > > > mechanism. One way to avoid the race would be to hold mmap_sem while we are > > > > invalidating the page cache and punching hole but that sounds a bit ugly. > > > > Alternatively we could just have some special lock (rwsem?) held during > > > > page_mkwrite() (for reading) and during whole hole punching (for writing) > > > > to serialize these two operations. > > > > > > What really needs to happen is that .page_mkwrite() can be made to > > > fail with -EAGAIN and retry the entire page fault from the start an > > > arbitrary number of times instead of just once as the current code > > > does with VM_FAULT_RETRY. That would allow us to try to take the > > > filesystem lock that provides IO exclusion for all other types of IO > > > and fail with EAGAIN if we can't get it without blocking. For XFS, > > > that is the i_iolock rwsem, for others it is the i_mutex, and some > > > other filesystems might take other locks. > > Actually, I've been playing with VM_FAULT_RETRY recently (for freezing > > patches) and it's completely unhandled for .page_mkwrite() callbacks. > > Yeah, it's a mess. > > > Also > > only x86 really tries to handle it at all. Other architectures just don't > > allow it at all. Also there's a ton of callers of things like > > get_user_pages() which would need to handle VM_FAULT_RETRY and for some of > > them it would be actually non-trivial. > > Seems kind of silly to me to have a generic retry capability in the > page fault handler and then not implement it in a useful manner for > *anyone*. Yeah. It's only tailored for one specific use in filemap_fault() on x86... > > But in this particular case, I don't think VM_FAULT_RETRY is strictly > > necessary. We can have a lock, which ranks below mmap_sem (and thus > > i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction > > start, etc. Such lock could be taken in page_mkwrite() before taking page > > lock, in truncate() and punch_hold() just after i_mutex, and direct IO > > paths could be tweaked to use it as well I think. > > Which means we'd be adding another layer of mostly redundant locking > just to avoid i_mutex/mmap_sem inversion. But I don't see how it > solves the direct IO problem because we still need to grab the > mmap_sem inside the IO during get_user_pages_fast() while holding > i_mutex/i_iolock.... Yeah, direct IO case won't be trivial. We would have to take the lock after dio_get_page() and release it before going for next page. While the lock was released someone could fault in pages into the area where direct IO is happening so we would have to invalidate them while holding the lock again. It seems it would work but I agree it's probably too ugly to live given how abstract the problem is... > > > FWIW, I've been running at "use the IO lock in page_mkwrite" patch > > > for XFS for several months now, but I haven't posted it because > > > without the VM side being able to handle such locking failures > > > gracefully there's not much point in making the change. I did this > > > patch to reduce the incidence of mmap vs direct IO races that are > > > essentially identical in nature to rule them out of the cause of > > > stray delalloc blocks in files that fsstress has been producing on > > > XFS. FYI, this race condition hasn't been responsible for any of the > > > problems I've found recently.... > > Yeah, I've been trying to hit the race window for a while and I failed as > > well... > > IIRC, it's a rare case (that I consider insane, BTW): read from a > file with into a buffer that is a mmap()d region of the same file > that has not been faulted in yet..... With punch hole, the race is less insane - just punching hole in the area which is accessed via mmap could race in a bad way AFAICS. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-17 23:28 ` Jan Kara @ 2012-05-18 10:12 ` Dave Chinner 2012-05-18 13:32 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-05-18 10:12 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote: > On Thu 17-05-12 17:43:08, Dave Chinner wrote: > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > IIRC, it's a rare case (that I consider insane, BTW): read from a > > file with into a buffer that is a mmap()d region of the same file > > that has not been faulted in yet..... > With punch hole, the race is less insane - just punching hole in the area > which is accessed via mmap could race in a bad way AFAICS. Seems the simple answer to me is to prevent page faults while hole punching, then.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-18 10:12 ` Dave Chinner @ 2012-05-18 13:32 ` Jan Kara 2012-05-19 1:40 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-05-18 13:32 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Fri 18-05-12 20:12:10, Dave Chinner wrote: > On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote: > > On Thu 17-05-12 17:43:08, Dave Chinner wrote: > > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > > IIRC, it's a rare case (that I consider insane, BTW): read from a > > > file with into a buffer that is a mmap()d region of the same file > > > that has not been faulted in yet..... > > With punch hole, the race is less insane - just punching hole in the area > > which is accessed via mmap could race in a bad way AFAICS. > > Seems the simple answer to me is to prevent page faults while hole > punching, then.... Yes, that's what I was suggesting in the beginning :) And I was asking whether people are OK with another lock in the page fault path (in particular in ->page_mkwrite) or whether someone has a better idea (e.g. taking mmap_sem in the hole punching path seems possible but I'm not sure whether that would be considered acceptable abuse). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-18 13:32 ` Jan Kara @ 2012-05-19 1:40 ` Dave Chinner 2012-05-24 12:35 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-05-19 1:40 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Fri, May 18, 2012 at 03:32:50PM +0200, Jan Kara wrote: > On Fri 18-05-12 20:12:10, Dave Chinner wrote: > > On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote: > > > On Thu 17-05-12 17:43:08, Dave Chinner wrote: > > > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > > > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > > > IIRC, it's a rare case (that I consider insane, BTW): read from a > > > > file with into a buffer that is a mmap()d region of the same file > > > > that has not been faulted in yet..... > > > With punch hole, the race is less insane - just punching hole in the area > > > which is accessed via mmap could race in a bad way AFAICS. > > > > Seems the simple answer to me is to prevent page faults while hole > > punching, then.... > Yes, that's what I was suggesting in the beginning :) And I was asking > whether people are OK with another lock in the page fault path (in > particular in ->page_mkwrite) Right. I probably should have been clearer in what I said. We got back here from considering another IO level lock and all the complexity it adds to just solve the hole punch problem.... > or whether someone has a better idea (e.g. > taking mmap_sem in the hole punching path seems possible but I'm not sure > whether that would be considered acceptable abuse). That's for the VM guys to answer, but it seems wrong to me to have to treat hole punching differently to truncation.... The thing is, mmap IO is completely unlocked from an IO perspective, and that means we cannot guarantee exclusion from IO without using the IO exclusion lock. That's the simplest way we can make mmap serialise sanely against direct IO and hole punching. Hole punching is inherently a filesystem operation (just like truncation), and mmap operations must stall while it is in progress. It's just that we have the problem that we allow the mmap_sem to be taken inside the IO exclusion locks... So let's step back a moment and have a look at how we've got here. The problem is that we've optimised ourselves into a corner with the way we handle page cache truncation - we don't need mmap serialisation because of the combination of i_size and page locks mean we can detect truncated pages safely at page fault time. With hole punching, we don't have that i_size safety blanket, and so we need some other serialisation mechanism to safely detect whether a page is valid or not at any given point in time. Because it needs to serialise against IO operations, we need a sleeping lock of some kind, and it can't be the existing IO lock. And now we are looking at needing a new lock for hole punching, I'm really wondering if the i_size/page lock truncation optimisation should even continue to exist. i.e. replace it with a single mechanism that works for both hole punching, truncation and other functions that require exclusive access or exclusion against modifications to the mapping tree. But this is only one of the problems in this area.The way I see it is that we have many kludges in the area of page invalidation w.r.t. different types of IO, the page cache and mmap, especially when we take into account direct IO. What we are seeing here is we need some level of _mapping tree exclusion_ between: 1. mmap vs hole punch (broken) 2. mmap vs truncate (i_size/page lock) 3. mmap vs direct IO (non-existent) 4. mmap vs buffered IO (page lock) 5. writeback vs truncate (i_size/page lock) 6. writeback vs hole punch (page lock, possibly broken) 7. direct IO vs buffered IO (racy - flush cache before/after DIO) #1, #2, #5 and #6 could be solved by a rw-lock for the operations - read for mmap/writeback, exclusive for hole-punch and truncation. That, however, doesn't work for #3 and #4 as the exclusion is inverted - direct/buffered IO would require a shared mode lock and mmap requires the exclusive lock. Similarly, #7 requires a shared lock for direct IO, and a shared lock for buffered IO, but exclusion between the two for overlapping ranges. But no one locking primitive that currently exists can give us this set of semantics.... Right now we are talking about hacking in some solution to #1, while ignoring the wider range of related but ignored/papered over problems we also have. I don't have a magic bullet that solves all of these problems, but I think it is worth recognising and considering that this problem is much larger than just hole punching and that these problems have been there for a *long time*. To me the issue at hand is that we have no method of serialising multi-page operations on the mapping tree between the filesystem and the VM, and that seems to be the fundamental problem we face in this whole area of mmap/buffered/direct IO/truncate/holepunch coherency. Hence it might be better to try to work out how to fix this entire class of problems rather than just adding a complex kuldge that just papers over the current "hot" symptom.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-19 1:40 ` Dave Chinner @ 2012-05-24 12:35 ` Jan Kara 2012-06-05 5:51 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-05-24 12:35 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Sat 19-05-12 11:40:24, Dave Chinner wrote: > On Fri, May 18, 2012 at 03:32:50PM +0200, Jan Kara wrote: > > On Fri 18-05-12 20:12:10, Dave Chinner wrote: > > > On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote: > > > > On Thu 17-05-12 17:43:08, Dave Chinner wrote: > > > > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote: > > > > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote: > > > > > IIRC, it's a rare case (that I consider insane, BTW): read from a > > > > > file with into a buffer that is a mmap()d region of the same file > > > > > that has not been faulted in yet..... > > > > With punch hole, the race is less insane - just punching hole in the area > > > > which is accessed via mmap could race in a bad way AFAICS. > > > > > > Seems the simple answer to me is to prevent page faults while hole > > > punching, then.... > > Yes, that's what I was suggesting in the beginning :) And I was asking > > whether people are OK with another lock in the page fault path (in > > particular in ->page_mkwrite) > > Right. I probably should have been clearer in what I said. We got > back here from considering another IO level lock and all the > complexity it adds to just solve the hole punch problem.... > > > or whether someone has a better idea (e.g. > > taking mmap_sem in the hole punching path seems possible but I'm not sure > > whether that would be considered acceptable abuse). > > That's for the VM guys to answer, but it seems wrong to me to have > to treat hole punching differently to truncation.... > > The thing is, mmap IO is completely unlocked from an IO perspective, > and that means we cannot guarantee exclusion from IO without using > the IO exclusion lock. That's the simplest way we can make mmap > serialise sanely against direct IO and hole punching. Hole punching > is inherently a filesystem operation (just like truncation), and > mmap operations must stall while it is in progress. It's just that > we have the problem that we allow the mmap_sem to be taken inside > the IO exclusion locks... > > So let's step back a moment and have a look at how we've got here. > The problem is that we've optimised ourselves into a corner with the > way we handle page cache truncation - we don't need mmap > serialisation because of the combination of i_size and page locks > mean we can detect truncated pages safely at page fault time. With > hole punching, we don't have that i_size safety blanket, and so we > need some other serialisation mechanism to safely detect whether a > page is valid or not at any given point in time. > > Because it needs to serialise against IO operations, we need a > sleeping lock of some kind, and it can't be the existing IO lock. > And now we are looking at needing a new lock for hole punching, I'm > really wondering if the i_size/page lock truncation optimisation > should even continue to exist. i.e. replace it with a single > mechanism that works for both hole punching, truncation and other > functions that require exclusive access or exclusion against > modifications to the mapping tree. > > But this is only one of the problems in this area.The way I see it > is that we have many kludges in the area of page invalidation w.r.t. > different types of IO, the page cache and mmap, especially when we > take into account direct IO. What we are seeing here is we need > some level of _mapping tree exclusion_ between: > > 1. mmap vs hole punch (broken) > 2. mmap vs truncate (i_size/page lock) > 3. mmap vs direct IO (non-existent) > 4. mmap vs buffered IO (page lock) > 5. writeback vs truncate (i_size/page lock) > 6. writeback vs hole punch (page lock, possibly broken) > 7. direct IO vs buffered IO (racy - flush cache before/after DIO) Yes, this is a nice summary of the most interesting cases. For completeness, here are the remaining cases: 8. mmap vs writeback (page lock) 9. writeback vs direct IO (as direct IO vs buffered IO) 10. writeback vs buffered IO (page lock) 11. direct IO vs truncate (dio_wait) 12. direct IO vs hole punch (dio_wait) 13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads) 14. buffered IO vs hole punch (fs dependent, broken for ext4) 15. truncate vs hole punch (fs dependent) 16. mmap vs mmap (page lock) 17. writeback vs writeback (page lock) 18. direct IO vs direct IO (i_mutex or fs dependent) 19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads) 20. truncate vs truncate (i_mutex) 21. punch hole vs punch hole (fs dependent) > #1, #2, #5 and #6 could be solved by a rw-lock for the operations - > read for mmap/writeback, exclusive for hole-punch and truncation. > That, however, doesn't work for #3 and #4 as the exclusion is > inverted - direct/buffered IO would require a shared mode lock and > mmap requires the exclusive lock. Similarly, #7 requires a shared > lock for direct IO, and a shared lock for buffered IO, but exclusion > between the two for overlapping ranges. But no one locking primitive > that currently exists can give us this set of semantics.... > > Right now we are talking about hacking in some solution to #1, while > ignoring the wider range of related but ignored/papered over > problems we also have. I don't have a magic bullet that solves all > of these problems, but I think it is worth recognising and > considering that this problem is much larger than just hole punching > and that these problems have been there for a *long time*. > > To me the issue at hand is that we have no method of serialising > multi-page operations on the mapping tree between the filesystem and > the VM, and that seems to be the fundamental problem we face in this > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > Hence it might be better to try to work out how to fix this entire > class of problems rather than just adding a complex kuldge that just > papers over the current "hot" symptom.... Yes, looking at the above table, the amount of different synchronization mechanisms is really striking. So probably we should look at some possibility of unifying at least some cases. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-05-24 12:35 ` Jan Kara @ 2012-06-05 5:51 ` Dave Chinner 2012-06-05 6:22 ` Marco Stornelli 2012-06-05 23:15 ` Jan Kara 0 siblings, 2 replies; 20+ messages in thread From: Dave Chinner @ 2012-06-05 5:51 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > On Sat 19-05-12 11:40:24, Dave Chinner wrote: > > So let's step back a moment and have a look at how we've got here. > > The problem is that we've optimised ourselves into a corner with the > > way we handle page cache truncation - we don't need mmap > > serialisation because of the combination of i_size and page locks > > mean we can detect truncated pages safely at page fault time. With > > hole punching, we don't have that i_size safety blanket, and so we > > need some other serialisation mechanism to safely detect whether a > > page is valid or not at any given point in time. > > > > Because it needs to serialise against IO operations, we need a > > sleeping lock of some kind, and it can't be the existing IO lock. > > And now we are looking at needing a new lock for hole punching, I'm > > really wondering if the i_size/page lock truncation optimisation > > should even continue to exist. i.e. replace it with a single > > mechanism that works for both hole punching, truncation and other > > functions that require exclusive access or exclusion against > > modifications to the mapping tree. > > > > But this is only one of the problems in this area.The way I see it > > is that we have many kludges in the area of page invalidation w.r.t. > > different types of IO, the page cache and mmap, especially when we > > take into account direct IO. What we are seeing here is we need > > some level of _mapping tree exclusion_ between: > > > > 1. mmap vs hole punch (broken) > > 2. mmap vs truncate (i_size/page lock) > > 3. mmap vs direct IO (non-existent) > > 4. mmap vs buffered IO (page lock) > > 5. writeback vs truncate (i_size/page lock) > > 6. writeback vs hole punch (page lock, possibly broken) > > 7. direct IO vs buffered IO (racy - flush cache before/after DIO) > Yes, this is a nice summary of the most interesting cases. For completeness, > here are the remaining cases: > 8. mmap vs writeback (page lock) > 9. writeback vs direct IO (as direct IO vs buffered IO) > 10. writeback vs buffered IO (page lock) > 11. direct IO vs truncate (dio_wait) > 12. direct IO vs hole punch (dio_wait) > 13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads) > 14. buffered IO vs hole punch (fs dependent, broken for ext4) > 15. truncate vs hole punch (fs dependent) > 16. mmap vs mmap (page lock) > 17. writeback vs writeback (page lock) > 18. direct IO vs direct IO (i_mutex or fs dependent) > 19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads) > 20. truncate vs truncate (i_mutex) > 21. punch hole vs punch hole (fs dependent) A lot of them are the IO exclusion side of the problem - I ignored them just to make my discussion short and to the point. So thanks for documenting them for everyone. :) .... > > To me the issue at hand is that we have no method of serialising > > multi-page operations on the mapping tree between the filesystem and > > the VM, and that seems to be the fundamental problem we face in this > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > Hence it might be better to try to work out how to fix this entire > > class of problems rather than just adding a complex kuldge that just > > papers over the current "hot" symptom.... > Yes, looking at the above table, the amount of different synchronization > mechanisms is really striking. So probably we should look at some > possibility of unifying at least some cases. It seems to me that we need some thing in between the fine grained page lock and the entire-file IO exclusion lock. We need to maintain fine grained locking for mmap scalability, but we also need to be able to atomically lock ranges of pages. I guess if we were to nest a fine grained multi-state lock inside both the IO exclusion lock and the mmap_sem, we might be able to kill all problems in one go. Exclusive access on a range needs to be granted to: - direct IO - truncate - hole punch so they can be serialised against mmap based page faults, writeback and concurrent buffered IO. Serialisation against themselves is an IO/fs exclusion problem. Shared access for traversal or modification needs to be granted to: - buffered IO - mmap page faults - writeback Each of these cases can rely on the existing page locks or IO exclusion locks to provide safety for concurrent access to the same ranges. This means that once we have access granted to a range we can check truncate races once and ignore the problem until we drop the access. And the case of taking a page fault within a buffered IO won't deadlock because both take a shared lock.... We'd need some kind of efficient shared/exclusive range lock for this sort of exclusion, and it's entirely possible that it would have too much overhead to be acceptible in the page fault path. It's the best I can think of right now..... As it is, a range lock of this kind would be very handy for other things, too (like the IO exclusion locks so we can do concurrent buffered writes in XFS ;). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-05 5:51 ` Dave Chinner @ 2012-06-05 6:22 ` Marco Stornelli 2012-06-05 23:15 ` Jan Kara 1 sibling, 0 replies; 20+ messages in thread From: Marco Stornelli @ 2012-06-05 6:22 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm 2012/6/5 Dave Chinner <david@fromorbit.com>: > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: >> On Sat 19-05-12 11:40:24, Dave Chinner wrote: >> > So let's step back a moment and have a look at how we've got here. >> > The problem is that we've optimised ourselves into a corner with the >> > way we handle page cache truncation - we don't need mmap >> > serialisation because of the combination of i_size and page locks >> > mean we can detect truncated pages safely at page fault time. With >> > hole punching, we don't have that i_size safety blanket, and so we >> > need some other serialisation mechanism to safely detect whether a >> > page is valid or not at any given point in time. >> > >> > Because it needs to serialise against IO operations, we need a >> > sleeping lock of some kind, and it can't be the existing IO lock. >> > And now we are looking at needing a new lock for hole punching, I'm >> > really wondering if the i_size/page lock truncation optimisation >> > should even continue to exist. i.e. replace it with a single >> > mechanism that works for both hole punching, truncation and other >> > functions that require exclusive access or exclusion against >> > modifications to the mapping tree. >> > >> > But this is only one of the problems in this area.The way I see it >> > is that we have many kludges in the area of page invalidation w.r.t. >> > different types of IO, the page cache and mmap, especially when we >> > take into account direct IO. What we are seeing here is we need >> > some level of _mapping tree exclusion_ between: >> > >> > 1. mmap vs hole punch (broken) >> > 2. mmap vs truncate (i_size/page lock) >> > 3. mmap vs direct IO (non-existent) >> > 4. mmap vs buffered IO (page lock) >> > 5. writeback vs truncate (i_size/page lock) >> > 6. writeback vs hole punch (page lock, possibly broken) >> > 7. direct IO vs buffered IO (racy - flush cache before/after DIO) >> Yes, this is a nice summary of the most interesting cases. For completeness, >> here are the remaining cases: >> 8. mmap vs writeback (page lock) >> 9. writeback vs direct IO (as direct IO vs buffered IO) >> 10. writeback vs buffered IO (page lock) >> 11. direct IO vs truncate (dio_wait) >> 12. direct IO vs hole punch (dio_wait) >> 13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads) >> 14. buffered IO vs hole punch (fs dependent, broken for ext4) >> 15. truncate vs hole punch (fs dependent) >> 16. mmap vs mmap (page lock) >> 17. writeback vs writeback (page lock) >> 18. direct IO vs direct IO (i_mutex or fs dependent) >> 19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads) >> 20. truncate vs truncate (i_mutex) >> 21. punch hole vs punch hole (fs dependent) > I think we have even the xip cases here. Marco -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-05 5:51 ` Dave Chinner 2012-06-05 6:22 ` Marco Stornelli @ 2012-06-05 23:15 ` Jan Kara 2012-06-06 0:06 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-06-05 23:15 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Tue 05-06-12 15:51:50, Dave Chinner wrote: > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > To me the issue at hand is that we have no method of serialising > > > multi-page operations on the mapping tree between the filesystem and > > > the VM, and that seems to be the fundamental problem we face in this > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > Hence it might be better to try to work out how to fix this entire > > > class of problems rather than just adding a complex kuldge that just > > > papers over the current "hot" symptom.... > > Yes, looking at the above table, the amount of different synchronization > > mechanisms is really striking. So probably we should look at some > > possibility of unifying at least some cases. > > It seems to me that we need some thing in between the fine grained > page lock and the entire-file IO exclusion lock. We need to maintain > fine grained locking for mmap scalability, but we also need to be > able to atomically lock ranges of pages. Yes, we also need to keep things fine grained to keep scalability of direct IO and buffered reads... > I guess if we were to nest a fine grained multi-state lock > inside both the IO exclusion lock and the mmap_sem, we might be able > to kill all problems in one go. > > Exclusive access on a range needs to be granted to: > > - direct IO > - truncate > - hole punch > > so they can be serialised against mmap based page faults, writeback > and concurrent buffered IO. Serialisation against themselves is an > IO/fs exclusion problem. > > Shared access for traversal or modification needs to be granted to: > > - buffered IO > - mmap page faults > - writeback > > Each of these cases can rely on the existing page locks or IO > exclusion locks to provide safety for concurrent access to the same > ranges. This means that once we have access granted to a range we > can check truncate races once and ignore the problem until we drop > the access. And the case of taking a page fault within a buffered > IO won't deadlock because both take a shared lock.... You cannot just use a lock (not even a shared one) both above and under mmap_sem. That is deadlockable in presence of other requests for exclusive locking... Luckily, with buffered writes the situation isn't that bad. You need mmap_sem only before each page is processed (in iov_iter_fault_in_readable()). Later on in the loop we use iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can just get our shared lock after iov_iter_fault_in_readable() (or simply leave it for ->write_begin() if we want to give control over the locking to filesystems). > We'd need some kind of efficient shared/exclusive range lock for > this sort of exclusion, and it's entirely possible that it would > have too much overhead to be acceptible in the page fault path. It's > the best I can think of right now..... > > As it is, a range lock of this kind would be very handy for other > things, too (like the IO exclusion locks so we can do concurrent > buffered writes in XFS ;). Yes, that's what I thought as well. In particular it should be pretty efficient in locking a single page range because that's going to be majority of calls. I'll try to write something and see how fast it can be... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-05 23:15 ` Jan Kara @ 2012-06-06 0:06 ` Dave Chinner 2012-06-06 9:58 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-06-06 0:06 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > To me the issue at hand is that we have no method of serialising > > > > multi-page operations on the mapping tree between the filesystem and > > > > the VM, and that seems to be the fundamental problem we face in this > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > Hence it might be better to try to work out how to fix this entire > > > > class of problems rather than just adding a complex kuldge that just > > > > papers over the current "hot" symptom.... > > > Yes, looking at the above table, the amount of different synchronization > > > mechanisms is really striking. So probably we should look at some > > > possibility of unifying at least some cases. > > > > It seems to me that we need some thing in between the fine grained > > page lock and the entire-file IO exclusion lock. We need to maintain > > fine grained locking for mmap scalability, but we also need to be > > able to atomically lock ranges of pages. > Yes, we also need to keep things fine grained to keep scalability of > direct IO and buffered reads... > > > I guess if we were to nest a fine grained multi-state lock > > inside both the IO exclusion lock and the mmap_sem, we might be able > > to kill all problems in one go. > > > > Exclusive access on a range needs to be granted to: > > > > - direct IO > > - truncate > > - hole punch > > > > so they can be serialised against mmap based page faults, writeback > > and concurrent buffered IO. Serialisation against themselves is an > > IO/fs exclusion problem. > > > > Shared access for traversal or modification needs to be granted to: > > > > - buffered IO > > - mmap page faults > > - writeback > > > > Each of these cases can rely on the existing page locks or IO > > exclusion locks to provide safety for concurrent access to the same > > ranges. This means that once we have access granted to a range we > > can check truncate races once and ignore the problem until we drop > > the access. And the case of taking a page fault within a buffered > > IO won't deadlock because both take a shared lock.... > You cannot just use a lock (not even a shared one) both above and under > mmap_sem. That is deadlockable in presence of other requests for exclusive > locking... Well, that's assuming that exclusive lock requests form a barrier to new shared requests. Remember that I'm talking about a range lock here, which we can make up whatever semantics we'd need, including having "shared lock if already locked shared" nested locking semantics which avoids this page-fault-in-buffered-IO-copy-in/out problem.... It also allows writeback to work the same way it does write now when we take a page fult on a page that is under writeback > Luckily, with buffered writes the situation isn't that bad. You > need mmap_sem only before each page is processed (in > iov_iter_fault_in_readable()). Later on in the loop we use > iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can > just get our shared lock after iov_iter_fault_in_readable() (or simply > leave it for ->write_begin() if we want to give control over the locking to > filesystems). That would probably work as well, but it much more likely that people would get it wrong as opposed to special casing the nested lock semantic in the page fault code... > > We'd need some kind of efficient shared/exclusive range lock for > > this sort of exclusion, and it's entirely possible that it would > > have too much overhead to be acceptible in the page fault path. It's > > the best I can think of right now..... > > > > As it is, a range lock of this kind would be very handy for other > > things, too (like the IO exclusion locks so we can do concurrent > > buffered writes in XFS ;). > Yes, that's what I thought as well. In particular it should be pretty > efficient in locking a single page range because that's going to be > majority of calls. I'll try to write something and see how fast it can > be... Cool. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-06 0:06 ` Dave Chinner @ 2012-06-06 9:58 ` Jan Kara 2012-06-06 13:36 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-06-06 9:58 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed 06-06-12 10:06:36, Dave Chinner wrote: > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > > To me the issue at hand is that we have no method of serialising > > > > > multi-page operations on the mapping tree between the filesystem and > > > > > the VM, and that seems to be the fundamental problem we face in this > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > > Hence it might be better to try to work out how to fix this entire > > > > > class of problems rather than just adding a complex kuldge that just > > > > > papers over the current "hot" symptom.... > > > > Yes, looking at the above table, the amount of different synchronization > > > > mechanisms is really striking. So probably we should look at some > > > > possibility of unifying at least some cases. > > > > > > It seems to me that we need some thing in between the fine grained > > > page lock and the entire-file IO exclusion lock. We need to maintain > > > fine grained locking for mmap scalability, but we also need to be > > > able to atomically lock ranges of pages. > > Yes, we also need to keep things fine grained to keep scalability of > > direct IO and buffered reads... > > > > > I guess if we were to nest a fine grained multi-state lock > > > inside both the IO exclusion lock and the mmap_sem, we might be able > > > to kill all problems in one go. > > > > > > Exclusive access on a range needs to be granted to: > > > > > > - direct IO > > > - truncate > > > - hole punch > > > > > > so they can be serialised against mmap based page faults, writeback > > > and concurrent buffered IO. Serialisation against themselves is an > > > IO/fs exclusion problem. > > > > > > Shared access for traversal or modification needs to be granted to: > > > > > > - buffered IO > > > - mmap page faults > > > - writeback > > > > > > Each of these cases can rely on the existing page locks or IO > > > exclusion locks to provide safety for concurrent access to the same > > > ranges. This means that once we have access granted to a range we > > > can check truncate races once and ignore the problem until we drop > > > the access. And the case of taking a page fault within a buffered > > > IO won't deadlock because both take a shared lock.... > > You cannot just use a lock (not even a shared one) both above and under > > mmap_sem. That is deadlockable in presence of other requests for exclusive > > locking... > > Well, that's assuming that exclusive lock requests form a barrier to > new shared requests. Remember that I'm talking about a range lock > here, which we can make up whatever semantics we'd need, including > having "shared lock if already locked shared" nested locking > semantics which avoids this page-fault-in-buffered-IO-copy-in/out > problem.... That's true. But if you have semantics like this, constant writing to or reading from a file could starve e.g. truncate. So I'd prefer not to open this can of worms and keep semantics of rw semaphores if possible. Furthermore, with direct IO you have to set in stone the ordering of mmap_sem and range lock anyway because there we need an exclusive lock. > It also allows writeback to work the same way it does write now when > we take a page fault on a page that is under writeback I'm not sure what would be the difference regardless of which semantics we choose... > > Luckily, with buffered writes the situation isn't that bad. You > > need mmap_sem only before each page is processed (in > > iov_iter_fault_in_readable()). Later on in the loop we use > > iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can > > just get our shared lock after iov_iter_fault_in_readable() (or simply > > leave it for ->write_begin() if we want to give control over the locking to > > filesystems). > > That would probably work as well, but it much more likely that > people would get it wrong as opposed to special casing the nested > lock semantic in the page fault code... I suppose some helper functions could make this easier... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-06 9:58 ` Jan Kara @ 2012-06-06 13:36 ` Dave Chinner 2012-06-07 21:58 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-06-06 13:36 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote: > On Wed 06-06-12 10:06:36, Dave Chinner wrote: > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > > > To me the issue at hand is that we have no method of serialising > > > > > > multi-page operations on the mapping tree between the filesystem and > > > > > > the VM, and that seems to be the fundamental problem we face in this > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > > > Hence it might be better to try to work out how to fix this entire > > > > > > class of problems rather than just adding a complex kuldge that just > > > > > > papers over the current "hot" symptom.... > > > > > Yes, looking at the above table, the amount of different synchronization > > > > > mechanisms is really striking. So probably we should look at some > > > > > possibility of unifying at least some cases. > > > > > > > > It seems to me that we need some thing in between the fine grained > > > > page lock and the entire-file IO exclusion lock. We need to maintain > > > > fine grained locking for mmap scalability, but we also need to be > > > > able to atomically lock ranges of pages. > > > Yes, we also need to keep things fine grained to keep scalability of > > > direct IO and buffered reads... > > > > > > > I guess if we were to nest a fine grained multi-state lock > > > > inside both the IO exclusion lock and the mmap_sem, we might be able > > > > to kill all problems in one go. > > > > > > > > Exclusive access on a range needs to be granted to: > > > > > > > > - direct IO > > > > - truncate > > > > - hole punch > > > > > > > > so they can be serialised against mmap based page faults, writeback > > > > and concurrent buffered IO. Serialisation against themselves is an > > > > IO/fs exclusion problem. > > > > > > > > Shared access for traversal or modification needs to be granted to: > > > > > > > > - buffered IO > > > > - mmap page faults > > > > - writeback > > > > > > > > Each of these cases can rely on the existing page locks or IO > > > > exclusion locks to provide safety for concurrent access to the same > > > > ranges. This means that once we have access granted to a range we > > > > can check truncate races once and ignore the problem until we drop > > > > the access. And the case of taking a page fault within a buffered > > > > IO won't deadlock because both take a shared lock.... > > > You cannot just use a lock (not even a shared one) both above and under > > > mmap_sem. That is deadlockable in presence of other requests for exclusive > > > locking... > > > > Well, that's assuming that exclusive lock requests form a barrier to > > new shared requests. Remember that I'm talking about a range lock > > here, which we can make up whatever semantics we'd need, including > > having "shared lock if already locked shared" nested locking > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out > > problem.... > That's true. But if you have semantics like this, constant writing to > or reading from a file could starve e.g. truncate. So I'd prefer not to > open this can of worms and keep semantics of rw semaphores if possible. Except truncate uses the i_mutex/i_iolock for exclusion, so it would never get held off any more than it already does by buffered IO in this case. i.e. the mapping tree range lock is inside the locks used for truncate serialisation, so we don't have a situation where other operations woul dbe held off by such an IO pattern... > Furthermore, with direct IO you have to set in stone the ordering of > mmap_sem and range lock anyway because there we need an exclusive lock. Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For direct IO, we only need the mmap_sem for the get_user_pages() call IIRC, so that requires exclusive range lock-> shared mmap_sem ordering. Unless we can lift the range lock in the mmap path outside the mmap_sem, we're still in the same boat.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-06 13:36 ` Dave Chinner @ 2012-06-07 21:58 ` Jan Kara 2012-06-08 0:57 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-06-07 21:58 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Wed 06-06-12 23:36:16, Dave Chinner wrote: > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote: > > On Wed 06-06-12 10:06:36, Dave Chinner wrote: > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > > > > To me the issue at hand is that we have no method of serialising > > > > > > > multi-page operations on the mapping tree between the filesystem and > > > > > > > the VM, and that seems to be the fundamental problem we face in this > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > > > > Hence it might be better to try to work out how to fix this entire > > > > > > > class of problems rather than just adding a complex kuldge that just > > > > > > > papers over the current "hot" symptom.... > > > > > > Yes, looking at the above table, the amount of different synchronization > > > > > > mechanisms is really striking. So probably we should look at some > > > > > > possibility of unifying at least some cases. > > > > > > > > > > It seems to me that we need some thing in between the fine grained > > > > > page lock and the entire-file IO exclusion lock. We need to maintain > > > > > fine grained locking for mmap scalability, but we also need to be > > > > > able to atomically lock ranges of pages. > > > > Yes, we also need to keep things fine grained to keep scalability of > > > > direct IO and buffered reads... > > > > > > > > > I guess if we were to nest a fine grained multi-state lock > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able > > > > > to kill all problems in one go. > > > > > > > > > > Exclusive access on a range needs to be granted to: > > > > > > > > > > - direct IO > > > > > - truncate > > > > > - hole punch > > > > > > > > > > so they can be serialised against mmap based page faults, writeback > > > > > and concurrent buffered IO. Serialisation against themselves is an > > > > > IO/fs exclusion problem. > > > > > > > > > > Shared access for traversal or modification needs to be granted to: > > > > > > > > > > - buffered IO > > > > > - mmap page faults > > > > > - writeback > > > > > > > > > > Each of these cases can rely on the existing page locks or IO > > > > > exclusion locks to provide safety for concurrent access to the same > > > > > ranges. This means that once we have access granted to a range we > > > > > can check truncate races once and ignore the problem until we drop > > > > > the access. And the case of taking a page fault within a buffered > > > > > IO won't deadlock because both take a shared lock.... > > > > You cannot just use a lock (not even a shared one) both above and under > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive > > > > locking... > > > > > > Well, that's assuming that exclusive lock requests form a barrier to > > > new shared requests. Remember that I'm talking about a range lock > > > here, which we can make up whatever semantics we'd need, including > > > having "shared lock if already locked shared" nested locking > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out > > > problem.... > > That's true. But if you have semantics like this, constant writing to > > or reading from a file could starve e.g. truncate. So I'd prefer not to > > open this can of worms and keep semantics of rw semaphores if possible. > > Except truncate uses the i_mutex/i_iolock for exclusion, so it would > never get held off any more than it already does by buffered IO in > this case. i.e. the mapping tree range lock is inside the locks used > for truncate serialisation, so we don't have a situation where other > operations woul dbe held off by such an IO pattern... True. I was just hoping that i_mutex won't be needed if we get our new lock right. > > Furthermore, with direct IO you have to set in stone the ordering of > > mmap_sem and range lock anyway because there we need an exclusive lock. > > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For > direct IO, we only need the mmap_sem for the get_user_pages() call > IIRC, so that requires exclusive range lock-> shared mmap_sem > ordering. Unless we can lift the range lock in the mmap path outside > the mmap_sem, we're still in the same boat.... Yes. Just as I said before it is not an unsolvable situation since for direct IO we can grab our lock after get_user_pages() call. I was thinking about it for a while and I realized, that if we have range lock that we take during page fault, buffered IO, direct IO, truncate, punch hole, we actually don't need a shared version of it. Just the fact it would be range lock is enough to avoid serialization. Also I was thinking that since lock ordering forces our new lock to be relatively close to page lock and page lock is serializing quite some of IO operations anyway, it might be workable (and actually reasonably easy to grasp for developers) if the range lock is coupled with page lock - i.e. locking a range will be equivalent to locking each page in a range, except that this way you can "lock" pages which are not present and thus forbid their creation. Also we could implement the common case of locking a range containing single page by just taking page lock so we save modification of interval tree in the common case and generally make the tree smaller (of course, at the cost of somewhat slowing down cases where we want to lock larger ranges). Using something like this in ext4 should be reasonably easy, we'd just need to change some code to move page lock up in locking order to make locking of ranges useful. Would it be also usable for XFS? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-07 21:58 ` Jan Kara @ 2012-06-08 0:57 ` Dave Chinner 2012-06-08 21:36 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-06-08 0:57 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote: > On Wed 06-06-12 23:36:16, Dave Chinner wrote: > > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote: > > > On Wed 06-06-12 10:06:36, Dave Chinner wrote: > > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > > > > > To me the issue at hand is that we have no method of serialising > > > > > > > > multi-page operations on the mapping tree between the filesystem and > > > > > > > > the VM, and that seems to be the fundamental problem we face in this > > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > > > > > Hence it might be better to try to work out how to fix this entire > > > > > > > > class of problems rather than just adding a complex kuldge that just > > > > > > > > papers over the current "hot" symptom.... > > > > > > > Yes, looking at the above table, the amount of different synchronization > > > > > > > mechanisms is really striking. So probably we should look at some > > > > > > > possibility of unifying at least some cases. > > > > > > > > > > > > It seems to me that we need some thing in between the fine grained > > > > > > page lock and the entire-file IO exclusion lock. We need to maintain > > > > > > fine grained locking for mmap scalability, but we also need to be > > > > > > able to atomically lock ranges of pages. > > > > > Yes, we also need to keep things fine grained to keep scalability of > > > > > direct IO and buffered reads... > > > > > > > > > > > I guess if we were to nest a fine grained multi-state lock > > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able > > > > > > to kill all problems in one go. > > > > > > > > > > > > Exclusive access on a range needs to be granted to: > > > > > > > > > > > > - direct IO > > > > > > - truncate > > > > > > - hole punch > > > > > > > > > > > > so they can be serialised against mmap based page faults, writeback > > > > > > and concurrent buffered IO. Serialisation against themselves is an > > > > > > IO/fs exclusion problem. > > > > > > > > > > > > Shared access for traversal or modification needs to be granted to: > > > > > > > > > > > > - buffered IO > > > > > > - mmap page faults > > > > > > - writeback > > > > > > > > > > > > Each of these cases can rely on the existing page locks or IO > > > > > > exclusion locks to provide safety for concurrent access to the same > > > > > > ranges. This means that once we have access granted to a range we > > > > > > can check truncate races once and ignore the problem until we drop > > > > > > the access. And the case of taking a page fault within a buffered > > > > > > IO won't deadlock because both take a shared lock.... > > > > > You cannot just use a lock (not even a shared one) both above and under > > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive > > > > > locking... > > > > > > > > Well, that's assuming that exclusive lock requests form a barrier to > > > > new shared requests. Remember that I'm talking about a range lock > > > > here, which we can make up whatever semantics we'd need, including > > > > having "shared lock if already locked shared" nested locking > > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out > > > > problem.... > > > That's true. But if you have semantics like this, constant writing to > > > or reading from a file could starve e.g. truncate. So I'd prefer not to > > > open this can of worms and keep semantics of rw semaphores if possible. > > > > Except truncate uses the i_mutex/i_iolock for exclusion, so it would > > never get held off any more than it already does by buffered IO in > > this case. i.e. the mapping tree range lock is inside the locks used > > for truncate serialisation, so we don't have a situation where other > > operations woul dbe held off by such an IO pattern... > True. I was just hoping that i_mutex won't be needed if we get our new > lock right. > > > > Furthermore, with direct IO you have to set in stone the ordering of > > > mmap_sem and range lock anyway because there we need an exclusive lock. > > > > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For > > direct IO, we only need the mmap_sem for the get_user_pages() call > > IIRC, so that requires exclusive range lock-> shared mmap_sem > > ordering. Unless we can lift the range lock in the mmap path outside > > the mmap_sem, we're still in the same boat.... > Yes. Just as I said before it is not an unsolvable situation since for > direct IO we can grab our lock after get_user_pages() call. I was thinking > about it for a while and I realized, that if we have range lock that we > take during page fault, buffered IO, direct IO, truncate, punch hole, we > actually don't need a shared version of it. Just the fact it would be range > lock is enough to avoid serialization. Well, we currently allow overlapping direct IO to the same range in XFS, The order of completion is underfined, just like concurrent IOs to the same sector of a disk, but it basically results in direct IO on an XFS file to behave exactly the same way as concurrent IO to a raw device would.... Now, that does cause some problems for naive users of direct IO (just like those same naive users mix mmap, buffered and direct IO to the same file), but if we are going to make everything else coherent then I have no problems with dropping this functionality and ony allowing concurrency for non-overlapping requests. The other thing is that concurrent overlapping buffered reads to the same range will only serialise on page locks if the range can be lock shared, so there would be no change in behaviour. Making the range lock exclusive would serialise the overlapping reads at a much higher level and will result in a change of cached read behaviour and potentially a significant reduction in performance. This may be a corner case no-one cares about, but exclusive range locking will definitely impact such a workload.... > Also I was thinking that since lock ordering forces our new lock to be > relatively close to page lock and page lock is serializing quite some of > IO operations anyway, That's the observation that has led me to call it a "mapping tree" lock. > it might be workable (and actually reasonably easy > to grasp for developers) if the range lock is coupled with page lock - i.e. > locking a range will be equivalent to locking each page in a range, except > that this way you can "lock" pages which are not present and thus forbid > their creation. Nice idea, but I think that is introducing requirements and potential complexity way beyond what we need right now. It's already a complex problem, so lets not make it any harder to validate than it already will be... :/ As it is, we can't "forbid" the creation of pages - we control access to the mapping tree, which allows us to prevent insertion or removal of pages from a given range. That's exactly how it will serialise DIO against buffered/mmap IO, hole punch/truncate against mmap, and so on. It's a mapping tree access control mechanism, not a page serialisation mechanism... > Also we could implement the common case of locking a range > containing single page by just taking page lock so we save modification of > interval tree in the common case and generally make the tree smaller (of > course, at the cost of somewhat slowing down cases where we want to lock > larger ranges). That seems like premature optimistion to me, and all the cases I think we need to care about are locking large ranges of the tree. Let's measure what the overhead of tracking everything in a single tree is first so we can then see what needs optimising... Also, I don't think it can replace the page lock entirely because the page lock as that also serialises against other parts of the VM (e.g. memory reclaim). I'd prefer not to complicate the issue by trying to be fancy - it's going to be hard enough to validate that it works correctly without having to validate that it does not introduce races that were previously prevented by the page lock. > Using something like this in ext4 should be reasonably easy, we'd just need > to change some code to move page lock up in locking order to make locking > of ranges useful. Would it be also usable for XFS? I don't think we need to change the page locking at all - just inserting the range locks in the right place should. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-08 0:57 ` Dave Chinner @ 2012-06-08 21:36 ` Jan Kara 2012-06-08 23:06 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Jan Kara @ 2012-06-08 21:36 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Fri 08-06-12 10:57:00, Dave Chinner wrote: > On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote: > > On Wed 06-06-12 23:36:16, Dave Chinner wrote: > > > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote: > > > > On Wed 06-06-12 10:06:36, Dave Chinner wrote: > > > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote: > > > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote: > > > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > > > > > > > To me the issue at hand is that we have no method of serialising > > > > > > > > > multi-page operations on the mapping tree between the filesystem and > > > > > > > > > the VM, and that seems to be the fundamental problem we face in this > > > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > > > > > > > Hence it might be better to try to work out how to fix this entire > > > > > > > > > class of problems rather than just adding a complex kuldge that just > > > > > > > > > papers over the current "hot" symptom.... > > > > > > > > Yes, looking at the above table, the amount of different synchronization > > > > > > > > mechanisms is really striking. So probably we should look at some > > > > > > > > possibility of unifying at least some cases. > > > > > > > > > > > > > > It seems to me that we need some thing in between the fine grained > > > > > > > page lock and the entire-file IO exclusion lock. We need to maintain > > > > > > > fine grained locking for mmap scalability, but we also need to be > > > > > > > able to atomically lock ranges of pages. > > > > > > Yes, we also need to keep things fine grained to keep scalability of > > > > > > direct IO and buffered reads... > > > > > > > > > > > > > I guess if we were to nest a fine grained multi-state lock > > > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able > > > > > > > to kill all problems in one go. > > > > > > > > > > > > > > Exclusive access on a range needs to be granted to: > > > > > > > > > > > > > > - direct IO > > > > > > > - truncate > > > > > > > - hole punch > > > > > > > > > > > > > > so they can be serialised against mmap based page faults, writeback > > > > > > > and concurrent buffered IO. Serialisation against themselves is an > > > > > > > IO/fs exclusion problem. > > > > > > > > > > > > > > Shared access for traversal or modification needs to be granted to: > > > > > > > > > > > > > > - buffered IO > > > > > > > - mmap page faults > > > > > > > - writeback > > > > > > > > > > > > > > Each of these cases can rely on the existing page locks or IO > > > > > > > exclusion locks to provide safety for concurrent access to the same > > > > > > > ranges. This means that once we have access granted to a range we > > > > > > > can check truncate races once and ignore the problem until we drop > > > > > > > the access. And the case of taking a page fault within a buffered > > > > > > > IO won't deadlock because both take a shared lock.... > > > > > > You cannot just use a lock (not even a shared one) both above and under > > > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive > > > > > > locking... > > > > > > > > > > Well, that's assuming that exclusive lock requests form a barrier to > > > > > new shared requests. Remember that I'm talking about a range lock > > > > > here, which we can make up whatever semantics we'd need, including > > > > > having "shared lock if already locked shared" nested locking > > > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out > > > > > problem.... > > > > That's true. But if you have semantics like this, constant writing to > > > > or reading from a file could starve e.g. truncate. So I'd prefer not to > > > > open this can of worms and keep semantics of rw semaphores if possible. > > > > > > Except truncate uses the i_mutex/i_iolock for exclusion, so it would > > > never get held off any more than it already does by buffered IO in > > > this case. i.e. the mapping tree range lock is inside the locks used > > > for truncate serialisation, so we don't have a situation where other > > > operations woul dbe held off by such an IO pattern... > > True. I was just hoping that i_mutex won't be needed if we get our new > > lock right. > > > > > > Furthermore, with direct IO you have to set in stone the ordering of > > > > mmap_sem and range lock anyway because there we need an exclusive lock. > > > > > > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For > > > direct IO, we only need the mmap_sem for the get_user_pages() call > > > IIRC, so that requires exclusive range lock-> shared mmap_sem > > > ordering. Unless we can lift the range lock in the mmap path outside > > > the mmap_sem, we're still in the same boat.... > > Yes. Just as I said before it is not an unsolvable situation since for > > direct IO we can grab our lock after get_user_pages() call. I was thinking > > about it for a while and I realized, that if we have range lock that we > > take during page fault, buffered IO, direct IO, truncate, punch hole, we > > actually don't need a shared version of it. Just the fact it would be range > > lock is enough to avoid serialization. > > Well, we currently allow overlapping direct IO to the same range in > XFS, The order of completion is underfined, just like concurrent IOs > to the same sector of a disk, but it basically results in direct IO > on an XFS file to behave exactly the same way as concurrent IO to a raw > device would.... > > Now, that does cause some problems for naive users of direct IO > (just like those same naive users mix mmap, buffered and direct IO to > the same file), but if we are going to make everything else coherent > then I have no problems with dropping this functionality and ony > allowing concurrency for non-overlapping requests. Yeah, overlapping direct IO is asking for trouble (except possibly two direct IO reads). > The other thing is that concurrent overlapping buffered reads to the > same range will only serialise on page locks if the range can be > lock shared, so there would be no change in behaviour. Making the > range lock exclusive would serialise the overlapping reads at a much > higher level and will result in a change of cached read behaviour > and potentially a significant reduction in performance. This may be > a corner case no-one cares about, but exclusive range locking will > definitely impact such a workload.... For buffered reads we would lock page-by-page anyway (again due to locking constraints with mmap_sem when copying what we've read) so there shouldn't be any difference to current level of concurrency. > > Also I was thinking that since lock ordering forces our new lock to be > > relatively close to page lock and page lock is serializing quite some of > > IO operations anyway, > > That's the observation that has led me to call it a "mapping > tree" lock. I've internally called the locking function lock_mapping_range() ;). > > it might be workable (and actually reasonably easy > > to grasp for developers) if the range lock is coupled with page lock - i.e. > > locking a range will be equivalent to locking each page in a range, except > > that this way you can "lock" pages which are not present and thus forbid > > their creation. > > Nice idea, but I think that is introducing requirements and > potential complexity way beyond what we need right now. It's already > a complex problem, so lets not make it any harder to validate than > it already will be... :/ > > As it is, we can't "forbid" the creation of pages - we control > access to the mapping tree, which allows us to prevent insertion or > removal of pages from a given range. That's exactly how it will > serialise DIO against buffered/mmap IO, hole punch/truncate against > mmap, and so on. It's a mapping tree access control mechanism, not a > page serialisation mechanism... I agree, I was imprecise here and it's good to realize exactly what we guard. > > Also we could implement the common case of locking a range > > containing single page by just taking page lock so we save modification of > > interval tree in the common case and generally make the tree smaller (of > > course, at the cost of somewhat slowing down cases where we want to lock > > larger ranges). > > That seems like premature optimistion to me, and all the cases I > think we need to care about are locking large ranges of the tree. > Let's measure what the overhead of tracking everything in a single > tree is first so we can then see what needs optimising... Umm, I agree that initially we probably want just to have the mapping range lock ability, stick it somewhere to IO path and make things work. Then we can look into making it faster / merging with page lock. However I disagree we care most about locking large ranges. For all buffered IO and all page faults we need to lock a range containing just a single page. We cannot lock more due to locking constraints with mmap_sem. So the places that will lock larger ranges are: direct IO, truncate, punch hole. Writeback actually doesn't seem to need any additional protection at least as I've sketched out things so far. So single-page ranges matter at least as much as longer ranges. That's why I came up with that page lock optimisation and merging... > Also, I don't think it can replace the page lock entirely because > the page lock as that also serialises against other parts of the VM > (e.g. memory reclaim). I'd prefer not to complicate the issue by > trying to be fancy - it's going to be hard enough to validate that > it works correctly without having to validate that it does not > introduce races that were previously prevented by the page lock. So I never meant to completely replace page lock. I agree that will be almost impossible. I rather meant to add a capability of locking a range of pages. But lets leave that for now. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-08 21:36 ` Jan Kara @ 2012-06-08 23:06 ` Dave Chinner 2012-06-12 8:56 ` Jan Kara 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2012-06-08 23:06 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Fri, Jun 08, 2012 at 11:36:29PM +0200, Jan Kara wrote: > On Fri 08-06-12 10:57:00, Dave Chinner wrote: > > On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote: > > > On Wed 06-06-12 23:36:16, Dave Chinner wrote: > > > Also we could implement the common case of locking a range > > > containing single page by just taking page lock so we save modification of > > > interval tree in the common case and generally make the tree smaller (of > > > course, at the cost of somewhat slowing down cases where we want to lock > > > larger ranges). > > > > That seems like premature optimistion to me, and all the cases I > > think we need to care about are locking large ranges of the tree. > > Let's measure what the overhead of tracking everything in a single > > tree is first so we can then see what needs optimising... > Umm, I agree that initially we probably want just to have the mapping > range lock ability, stick it somewhere to IO path and make things work. > Then we can look into making it faster / merging with page lock. > > However I disagree we care most about locking large ranges. For all > buffered IO and all page faults we need to lock a range containing just a > single page. We cannot lock more due to locking constraints with mmap_sem. Not sure I understand what that constraint is - I hav ebeen thinking that the buffered IO range lok would be across the entire IO, not individual pages. As it is, if we want to do multipage writes (and we do), we have to be able to lock a range of the mapping in the buffered IO path at a time... > So the places that will lock larger ranges are: direct IO, truncate, punch > hole. Writeback actually doesn't seem to need any additional protection at > least as I've sketched out things so far. AFAICT, writeback needs protection against punching holes, just like mmap does, because they use the same "avoid truncated pages" mechanism. > So single-page ranges matter at least as much as longer ranges. That's why > I came up with that page lock optimisation and merging... I agree they are common, but lets measure the overhead first before trying to optimise/special case certain behaviours.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Hole punching and mmap races 2012-06-08 23:06 ` Dave Chinner @ 2012-06-12 8:56 ` Jan Kara 0 siblings, 0 replies; 20+ messages in thread From: Jan Kara @ 2012-06-12 8:56 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-fsdevel, xfs, linux-ext4, Hugh Dickins, linux-mm On Sat 09-06-12 09:06:16, Dave Chinner wrote: > On Fri, Jun 08, 2012 at 11:36:29PM +0200, Jan Kara wrote: > > On Fri 08-06-12 10:57:00, Dave Chinner wrote: > > > On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote: > > > > On Wed 06-06-12 23:36:16, Dave Chinner wrote: > > > > Also we could implement the common case of locking a range > > > > containing single page by just taking page lock so we save modification of > > > > interval tree in the common case and generally make the tree smaller (of > > > > course, at the cost of somewhat slowing down cases where we want to lock > > > > larger ranges). > > > > > > That seems like premature optimistion to me, and all the cases I > > > think we need to care about are locking large ranges of the tree. > > > Let's measure what the overhead of tracking everything in a single > > > tree is first so we can then see what needs optimising... > > Umm, I agree that initially we probably want just to have the mapping > > range lock ability, stick it somewhere to IO path and make things work. > > Then we can look into making it faster / merging with page lock. > > > > However I disagree we care most about locking large ranges. For all > > buffered IO and all page faults we need to lock a range containing just a > > single page. We cannot lock more due to locking constraints with mmap_sem. > > Not sure I understand what that constraint is - I hav ebeen thinking > that the buffered IO range lok would be across the entire IO, not > individual pages. > > As it is, if we want to do multipage writes (and we do), we have to > be able to lock a range of the mapping in the buffered IO path at a > time... The problem is that buffered IO path does (e.g. in generic_perform_write()): iov_iter_fault_in_readable() - that faults in one page worth of buffers, takes mmap_sem ->write_begin() copy data - iov_iter_copy_from_user_atomic() ->write_end() So we take mmap_sem before writing every page. We could fault in more, but that increases risk of iov_iter_copy_from_user_atomic() failing because the page got reclaimed before we got to it. So the amount we fault in would have to adapt to current memory pressure. That's certainly possible but not related to the problem we are trying to solve now so I'd prefer to handle it separately. > > So the places that will lock larger ranges are: direct IO, truncate, punch > > hole. Writeback actually doesn't seem to need any additional protection at > > least as I've sketched out things so far. > > AFAICT, writeback needs protection against punching holes, just like > mmap does, because they use the same "avoid truncated pages" > mechanism. If punching hole does: lock_mapping_range() evict all pages in a range punch blocks unlock_mapping_range() Then we shouldn't race against writeback because there are no pages in the mapping range we punch and they cannot be created there because we hold the lock. I agree this might be unnecessary optimization, but the nice result is that we can clean dirty pages regardless of what others do with the mapping. So in case there would be problems with taking mapping lock from writeback, we could avoid that. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-06-12 8:56 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-15 22:48 Hole punching and mmap races Jan Kara 2012-05-16 2:14 ` Dave Chinner 2012-05-16 13:04 ` Jan Kara 2012-05-17 7:43 ` Dave Chinner 2012-05-17 23:28 ` Jan Kara 2012-05-18 10:12 ` Dave Chinner 2012-05-18 13:32 ` Jan Kara 2012-05-19 1:40 ` Dave Chinner 2012-05-24 12:35 ` Jan Kara 2012-06-05 5:51 ` Dave Chinner 2012-06-05 6:22 ` Marco Stornelli 2012-06-05 23:15 ` Jan Kara 2012-06-06 0:06 ` Dave Chinner 2012-06-06 9:58 ` Jan Kara 2012-06-06 13:36 ` Dave Chinner 2012-06-07 21:58 ` Jan Kara 2012-06-08 0:57 ` Dave Chinner 2012-06-08 21:36 ` Jan Kara 2012-06-08 23:06 ` Dave Chinner 2012-06-12 8:56 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).