* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read [not found] ` <1377163725.2720.18.camel@menhir> @ 2013-08-22 13:05 ` Kirill A. Shutemov 2013-08-22 13:33 ` Steven Whitehouse 0 siblings, 1 reply; 6+ messages in thread From: Kirill A. Shutemov @ 2013-08-22 13:05 UTC (permalink / raw) To: Steven Whitehouse Cc: Andrew Morton, Kirill A. Shutemov, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Steven Whitehouse wrote: > Hi, > > On Wed, 2013-08-21 at 13:58 -0700, Andrew Morton wrote: > > On Wed, 21 Aug 2013 17:42:12 +0100 Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > > > I don't think the change is harmful. The worst case scenario is race with > > > > write or truncate, but it's valid to return EOF in this case. > > > > > > > > What scenario do you have in mind? > > > > > > > > > > 1. File open on node A > > > 2. Someone updates it on node B by extending the file > > > 3. Someone reads the file on node A beyond end of original file size, > > > but within end of new file size as updated by node B. Without the patch > > > this works, with it, it will fail. The reason being the i_size would not > > > be up to date until after readpage(s) has been called. CC: +linux-fsdevel@ So in this case node A will see the file like it was never touched by node B. It's okay, if new i_size will eventually reach node A. Is ->readpage() the only way to get i_size updated on node A or it will be eventually updated without it? If it's the only way, we need add a explicit way to initiate i_size sync between nodes on read. Probably, distributed filesystems should provide own ->aio_read() which deal i_size as the filesystem need. > > > I think this is likely to be an issue for any distributed fs using > > > do_generic_file_read(), although it would certainly affect GFS2, since > > > the locking is done at page cache level, > > > > Boy, that's rather subtle. I'm surprised that the generic filemap.c > > stuff works at all in that sort of scenario. > > > > Can we put the i_size check down in the no_cached_page block? afaict > > that will solve the problem without breaking GFS2 and is more > > efficient? > > > > Well I think is even more subtle, since it relies on ->readpages > updating the file size, even if it has failed to actually read the > required pages :-) Having said that, we do rely on ->readpages updating > the inode size elsewhere in this function, as per the block comment > immediately following the page_ok label. That i_size recheck was invented to cover different use case: read vs. truncate race. Userspace should not see truncate-caused zeros in buffer. It's not to prevent file extending vs. read() race. This usually harmless: data is consistent. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read 2013-08-22 13:05 ` [PATCH] mm, fs: avoid page allocation beyond i_size on read Kirill A. Shutemov @ 2013-08-22 13:33 ` Steven Whitehouse 2013-08-22 14:30 ` Kirill A. Shutemov 0 siblings, 1 reply; 6+ messages in thread From: Steven Whitehouse @ 2013-08-22 13:33 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Morton, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Hi, On Thu, 2013-08-22 at 16:05 +0300, Kirill A. Shutemov wrote: > Steven Whitehouse wrote: > > Hi, > > > > On Wed, 2013-08-21 at 13:58 -0700, Andrew Morton wrote: > > > On Wed, 21 Aug 2013 17:42:12 +0100 Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > > > > > I don't think the change is harmful. The worst case scenario is race with > > > > > write or truncate, but it's valid to return EOF in this case. > > > > > > > > > > What scenario do you have in mind? > > > > > > > > > > > > > 1. File open on node A > > > > 2. Someone updates it on node B by extending the file > > > > 3. Someone reads the file on node A beyond end of original file size, > > > > but within end of new file size as updated by node B. Without the patch > > > > this works, with it, it will fail. The reason being the i_size would not > > > > be up to date until after readpage(s) has been called. > > CC: +linux-fsdevel@ > > So in this case node A will see the file like it was never touched by > node B. It's okay, if new i_size will eventually reach node A. > > Is ->readpage() the only way to get i_size updated on node A or it will be > eventually updated without it? > It will be updated by anything which takes a glock in the gfs2 case. Note that ->readpage() is not a very frequently used aop. For any filesystem with ->readpages() this should cover almost all calls to the fs for reading, with ->readpage() only used for some corner cases. So from a performance point of view, it is ->readpages() which matters most. There is no time based updating of the inode information - it relies entirely upon the locking/cache control provided by the glock layer. > If it's the only way, we need add a explicit way to initiate i_size sync > between nodes on read. Probably, distributed filesystems should provide own > ->aio_read() which deal i_size as the filesystem need. > I'd rather not do that, if we can avoid it. The current system has been carefully designed so that all the cluster fs knowledge can be hidden in the layer below the page cache. That means for a read which can be satisfied from just the page cache, cluster filesystems are the same speed as local file systems, since it is the same code path. Only when a page doesn't exist in the cache do we need to take cluster locks in order to check the file size, etc. Taking cluster locks can be expensive, since in the worst case it can involve both the local glock and dlm state machines, and remote dlm and glock state machines, network communication, and waiting for disk i/o and log flushes on (a) remote node(s). Andrew's proposed solution makes sense to me, and is probably the easiest way to solve this. > > > > I think this is likely to be an issue for any distributed fs using > > > > do_generic_file_read(), although it would certainly affect GFS2, since > > > > the locking is done at page cache level, > > > > > > Boy, that's rather subtle. I'm surprised that the generic filemap.c > > > stuff works at all in that sort of scenario. > > > > > > Can we put the i_size check down in the no_cached_page block? afaict > > > that will solve the problem without breaking GFS2 and is more > > > efficient? > > > > > > > Well I think is even more subtle, since it relies on ->readpages > > updating the file size, even if it has failed to actually read the > > required pages :-) Having said that, we do rely on ->readpages updating > > the inode size elsewhere in this function, as per the block comment > > immediately following the page_ok label. > > That i_size recheck was invented to cover different use case: read vs. > truncate race. Userspace should not see truncate-caused zeros in buffer. > It's not to prevent file extending vs. read() race. This usually harmless: > data is consistent. > Yes, it is a different use case, but the same issue applies that without a prior call to ->readpage() or ->readpages() then i_size may not be correct. The comment mentions that there needs to be an uptodate page in existence in order for i_size to be valid, and that is, at least in the GFS2 case, an equivalent condition since it implies that ->readpage() or ->readpages() must have been called since the address space was last invalidated (or alternatively that the page was populated from a local write, but the effect is the same) This is only true though, since GFS2 does its locking/caching on a per-inode basis - it someone wanted to implement a filesystem which worked on a per-page (for example) basis then the uptodate page criteria would not necessarily mean that the i_size was also uptodate. I hope that helps clarify things a bit, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read 2013-08-22 13:33 ` Steven Whitehouse @ 2013-08-22 14:30 ` Kirill A. Shutemov 2013-08-22 14:59 ` Steven Whitehouse 0 siblings, 1 reply; 6+ messages in thread From: Kirill A. Shutemov @ 2013-08-22 14:30 UTC (permalink / raw) To: Steven Whitehouse Cc: Kirill A. Shutemov, Andrew Morton, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Steven Whitehouse wrote: > Hi, > > On Thu, 2013-08-22 at 16:05 +0300, Kirill A. Shutemov wrote: > > Steven Whitehouse wrote: > > > Hi, > > > > > > On Wed, 2013-08-21 at 13:58 -0700, Andrew Morton wrote: > > > > On Wed, 21 Aug 2013 17:42:12 +0100 Steven Whitehouse <swhiteho@redhat.com> wrote: > > > > > > > > > > I don't think the change is harmful. The worst case scenario is race with > > > > > > write or truncate, but it's valid to return EOF in this case. > > > > > > > > > > > > What scenario do you have in mind? > > > > > > > > > > > > > > > > 1. File open on node A > > > > > 2. Someone updates it on node B by extending the file > > > > > 3. Someone reads the file on node A beyond end of original file size, > > > > > but within end of new file size as updated by node B. Without the patch > > > > > this works, with it, it will fail. The reason being the i_size would not > > > > > be up to date until after readpage(s) has been called. > > > > CC: +linux-fsdevel@ > > > > So in this case node A will see the file like it was never touched by > > node B. It's okay, if new i_size will eventually reach node A. > > > > Is ->readpage() the only way to get i_size updated on node A or it will be > > eventually updated without it? > > > It will be updated by anything which takes a glock in the gfs2 case. > Note that ->readpage() is not a very frequently used aop. For any > filesystem with ->readpages() this should cover almost all calls to the > fs for reading, with ->readpage() only used for some corner cases. So > from a performance point of view, it is ->readpages() which matters > most. > > There is no time based updating of the inode information - it relies > entirely upon the locking/cache control provided by the glock layer. > > > If it's the only way, we need add a explicit way to initiate i_size sync > > between nodes on read. Probably, distributed filesystems should provide own > > ->aio_read() which deal i_size as the filesystem need. > > > I'd rather not do that, if we can avoid it. The current system has been > carefully designed so that all the cluster fs knowledge can be hidden in > the layer below the page cache. That means for a read which can be > satisfied from just the page cache, cluster filesystems are the same > speed as local file systems, since it is the same code path. Only when a > page doesn't exist in the cache do we need to take cluster locks in > order to check the file size, etc. > > Taking cluster locks can be expensive, since in the worst case it can > involve both the local glock and dlm state machines, and remote dlm and > glock state machines, network communication, and waiting for disk i/o > and log flushes on (a) remote node(s). > > Andrew's proposed solution makes sense to me, and is probably the > easiest way to solve this. Move check to no_cached_page? I don't see how it makes any difference for page cache miss case: we anyway exclude ->readpage() if it's beyond local i_size. And for cache hit local i_size will be most likely cover locally cached pages. Should we introduce an aop which can be called before i_size check in no_cached_page path to refresh local i_size? > > > > > I think this is likely to be an issue for any distributed fs using > > > > > do_generic_file_read(), although it would certainly affect GFS2, since > > > > > the locking is done at page cache level, > > > > > > > > Boy, that's rather subtle. I'm surprised that the generic filemap.c > > > > stuff works at all in that sort of scenario. > > > > > > > > Can we put the i_size check down in the no_cached_page block? afaict > > > > that will solve the problem without breaking GFS2 and is more > > > > efficient? > > > > > > > > > > Well I think is even more subtle, since it relies on ->readpages > > > updating the file size, even if it has failed to actually read the > > > required pages :-) Having said that, we do rely on ->readpages updating > > > the inode size elsewhere in this function, as per the block comment > > > immediately following the page_ok label. > > > > That i_size recheck was invented to cover different use case: read vs. > > truncate race. Userspace should not see truncate-caused zeros in buffer. > > It's not to prevent file extending vs. read() race. This usually harmless: > > data is consistent. > > > > Yes, it is a different use case, but the same issue applies that without > a prior call to ->readpage() or ->readpages() then i_size may not be > correct. I believe it's correct, but stale. I think it makes difference in the context. Use stale value for read vs. write race is okay, but not for read vs. truncate. Will i_size be set to correct value on the first file open (struct inode creation)? > The comment mentions that there needs to be an uptodate page in > existence in order for i_size to be valid, and that is, at least in the > GFS2 case, an equivalent condition since it implies that ->readpage() or > ->readpages() must have been called since the address space was last > invalidated (or alternatively that the page was populated from a local > write, but the effect is the same) > > This is only true though, since GFS2 does its locking/caching on a > per-inode basis - it someone wanted to implement a filesystem which > worked on a per-page (for example) basis then the uptodate page criteria > would not necessarily mean that the i_size was also uptodate. > > I hope that helps clarify things a bit, > > Steve. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read 2013-08-22 14:30 ` Kirill A. Shutemov @ 2013-08-22 14:59 ` Steven Whitehouse 2013-08-22 15:16 ` Kirill A. Shutemov 0 siblings, 1 reply; 6+ messages in thread From: Steven Whitehouse @ 2013-08-22 14:59 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Morton, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Hi, On Thu, 2013-08-22 at 17:30 +0300, Kirill A. Shutemov wrote: [snip] > > Andrew's proposed solution makes sense to me, and is probably the > > easiest way to solve this. > > Move check to no_cached_page? Yes > I don't see how it makes any difference for > page cache miss case: we anyway exclude ->readpage() if it's beyond local > i_size. > And for cache hit local i_size will be most likely cover locally cached > pages. The difference is that as the function is currently written, you cannot get to no_cached_page without first calling page_cache_sync_readahead(), i.e. ->readpages() so that i_size will have been updated, even if ->readpages() doesn't return any read-ahead pages. I guess that it is not very obvious that a call to ->readpages is hidden in page_cache_sync_readahead() but that is the path that should in the common case provide the pages from the fs, rather than the ->readpage call thats further down do_generic_file_read() > > Should we introduce an aop which can be called before i_size check in > no_cached_page path to refresh local i_size? > No, due to the call to ->readpages via page_cache_sync_readahead() it should not be necessary. [snip] > > Yes, it is a different use case, but the same issue applies that without > > a prior call to ->readpage() or ->readpages() then i_size may not be > > correct. > > I believe it's correct, but stale. I think it makes difference in the > context. Use stale value for read vs. write race is okay, but not for read > vs. truncate. > Well stale means possibly incorrect. We just don't know until we've got the cluster lock whether the data is correct or not, so it might as well be incorrect for all the difference it makes. As soon as we drop the cluster lock, the i_size may change. We always invalidate the page cache for the inode if we drop the cluster lock though, so that we know that ->readpage(s) will be called again on the next read to refresh the information. Obviously if someone called fstat, for example, on the inode in the mean time, that would also result in a refresh, as would a number of other fs operations. In both the page_ok: and proposed no_cached_page: situations the i_size can change as soon as the ->readpages call has completed, but the time difference is very small, so that returning an EOF indication in that case is not a problem. The issue with not doing a ->readpages() call prior to the i_size check is that the size may have been sitting around stale for hours, days or weeks, having been updated on other nodes, and thus it would give the wrong result. > Will i_size be set to correct value on the first file open (struct inode > creation)? > Yes, open has to read the inode in order to know what the permissions are. So the i_size will be set then, along with all the other inode information. Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read 2013-08-22 14:59 ` Steven Whitehouse @ 2013-08-22 15:16 ` Kirill A. Shutemov 2013-08-22 15:34 ` Steven Whitehouse 0 siblings, 1 reply; 6+ messages in thread From: Kirill A. Shutemov @ 2013-08-22 15:16 UTC (permalink / raw) To: Steven Whitehouse Cc: Kirill A. Shutemov, Andrew Morton, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Steven Whitehouse wrote: > Hi, > > On Thu, 2013-08-22 at 17:30 +0300, Kirill A. Shutemov wrote: > [snip] > > > Andrew's proposed solution makes sense to me, and is probably the > > > easiest way to solve this. > > > > Move check to no_cached_page? > Yes > > > I don't see how it makes any difference for > > page cache miss case: we anyway exclude ->readpage() if it's beyond local > > i_size. > > And for cache hit local i_size will be most likely cover locally cached > > pages. > The difference is that as the function is currently written, you cannot > get to no_cached_page without first calling page_cache_sync_readahead(), > i.e. ->readpages() so that i_size will have been updated, even if > ->readpages() doesn't return any read-ahead pages. > > I guess that it is not very obvious that a call to ->readpages is hidden > in page_cache_sync_readahead() but that is the path that should in the > common case provide the pages from the fs, rather than the ->readpage > call thats further down do_generic_file_read() I've checked the codepath before and to me it looks like ->readpages() will not be called beyond i_size. Codepath is: page_cache_sync_readahead() ondemand_readahead() __do_page_cache_readahead() read_pages() mapping->a_ops->readpages() But if you check __do_page_cache_readahead(): 152 static int 153 __do_page_cache_readahead(struct address_space *mapping, struct file *filp, 154 pgoff_t offset, unsigned long nr_to_read, 155 unsigned long lookahead_size) 156 { ... 163 loff_t isize = i_size_read(inode); 164 165 if (isize == 0) 166 goto out; 167 168 end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); ... 173 for (page_idx = 0; page_idx < nr_to_read; page_idx++) { 174 pgoff_t page_offset = offset + page_idx; 175 176 if (page_offset > end_index) 177 break; ... 193 } ... 200 if (ret) 201 read_pages(mapping, filp, &page_pool, ret); 202 BUG_ON(!list_empty(&page_pool)); 203 out: 204 return ret; 205 } Do I miss something? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm, fs: avoid page allocation beyond i_size on read 2013-08-22 15:16 ` Kirill A. Shutemov @ 2013-08-22 15:34 ` Steven Whitehouse 0 siblings, 0 replies; 6+ messages in thread From: Steven Whitehouse @ 2013-08-22 15:34 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Morton, Dave Hansen, Jan Kara, Al Viro, NeilBrown, linux-mm, linux-kernel, linux-fsdevel Hi, On Thu, 2013-08-22 at 18:16 +0300, Kirill A. Shutemov wrote: > Steven Whitehouse wrote: > > Hi, > > > > On Thu, 2013-08-22 at 17:30 +0300, Kirill A. Shutemov wrote: > > [snip] > > > > Andrew's proposed solution makes sense to me, and is probably the > > > > easiest way to solve this. > > > > > > Move check to no_cached_page? > > Yes > > > > > I don't see how it makes any difference for > > > page cache miss case: we anyway exclude ->readpage() if it's beyond local > > > i_size. > > > And for cache hit local i_size will be most likely cover locally cached > > > pages. > > The difference is that as the function is currently written, you cannot > > get to no_cached_page without first calling page_cache_sync_readahead(), > > i.e. ->readpages() so that i_size will have been updated, even if > > ->readpages() doesn't return any read-ahead pages. > > > > I guess that it is not very obvious that a call to ->readpages is hidden > > in page_cache_sync_readahead() but that is the path that should in the > > common case provide the pages from the fs, rather than the ->readpage > > call thats further down do_generic_file_read() > > I've checked the codepath before and to me it looks like ->readpages() > will not be called beyond i_size. Codepath is: > > page_cache_sync_readahead() > ondemand_readahead() > __do_page_cache_readahead() > read_pages() > mapping->a_ops->readpages() > > But if you check __do_page_cache_readahead(): > > 152 static int > 153 __do_page_cache_readahead(struct address_space *mapping, struct file *filp, > 154 pgoff_t offset, unsigned long nr_to_read, > 155 unsigned long lookahead_size) > 156 { > ... > 163 loff_t isize = i_size_read(inode); > 164 > 165 if (isize == 0) > 166 goto out; > 167 > 168 end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); > ... > 173 for (page_idx = 0; page_idx < nr_to_read; page_idx++) { > 174 pgoff_t page_offset = offset + page_idx; > 175 > 176 if (page_offset > end_index) > 177 break; > ... > 193 } > ... > 200 if (ret) > 201 read_pages(mapping, filp, &page_pool, ret); > 202 BUG_ON(!list_empty(&page_pool)); > 203 out: > 204 return ret; > 205 } > > Do I miss something? > Hrm. It looks like that Andrew's proposal is not going to work then :( The trouble is that readahead has broken silently in this one case, since the net result of the readahead not going ahead at this point, is that ->readpage will be called later on. So in the cases where it matters (i.e. i_size updated remotely), we'd probably not have noticed it before as the read will still return successfully. That is real problem though... so maybe this isn't as easy as I'd first thought, Steve. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-08-22 15:34 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1377099441-2224-1-git-send-email-kirill.shutemov@linux.intel.com> [not found] ` <1377100012.2738.28.camel@menhir> [not found] ` <20130821160817.940D3E0090@blue.fi.intel.com> [not found] ` <1377103332.2738.37.camel@menhir> [not found] ` <20130821135821.fc8f5a2551a28c9ce9c4b049@linux-foundation.org> [not found] ` <1377163725.2720.18.camel@menhir> 2013-08-22 13:05 ` [PATCH] mm, fs: avoid page allocation beyond i_size on read Kirill A. Shutemov 2013-08-22 13:33 ` Steven Whitehouse 2013-08-22 14:30 ` Kirill A. Shutemov 2013-08-22 14:59 ` Steven Whitehouse 2013-08-22 15:16 ` Kirill A. Shutemov 2013-08-22 15:34 ` Steven Whitehouse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).