* [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read @ 2015-03-18 14:09 Michal Hocko 2015-03-18 14:32 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-18 14:09 UTC (permalink / raw) To: Andrew Morton Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is, however, not called from the fs layer so it doesn't need this protection. Even ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. The protection might be even harmful. There is a strong push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. Use GFP_KERNEL mask instead because it is safe from the reclaim recursion POV. We are already doing GFP_KERNEL allocations down add_to_page_cache_lru path. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 968cd8e03d2e..26f62ba79f50 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset) int ret; do { - page = page_cache_alloc_cold(mapping); + page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD); if (!page) return -ENOMEM; -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko @ 2015-03-18 14:32 ` Rik van Riel 2015-03-18 14:37 ` Michal Hocko 2015-03-18 14:38 ` Mel Gorman ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Rik van Riel @ 2015-03-18 14:32 UTC (permalink / raw) To: Michal Hocko, Andrew Morton Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On 03/18/2015 10:09 AM, Michal Hocko wrote: > diff --git a/mm/filemap.c b/mm/filemap.c > index 968cd8e03d2e..26f62ba79f50 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset) > int ret; > > do { > - page = page_cache_alloc_cold(mapping); > + page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD); > if (!page) > return -ENOMEM; Won't this break on highmem systems, by failing to allocate the page cache from highmem, where previously it would? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:32 ` Rik van Riel @ 2015-03-18 14:37 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-18 14:37 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed 18-03-15 10:32:57, Rik van Riel wrote: > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 968cd8e03d2e..26f62ba79f50 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset) > > int ret; > > > > do { > > - page = page_cache_alloc_cold(mapping); > > + page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD); > > if (!page) > > return -ENOMEM; > > Won't this break on highmem systems, by failing to > allocate the page cache from highmem, where previously > it would? It will! This is broken. I can see inode_init_always now. We need to add GFP_HIGHUSER_MOVABLE here. Thanks for pointing this out! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 2015-03-18 14:32 ` Rik van Riel @ 2015-03-18 14:38 ` Mel Gorman 2015-03-18 14:43 ` Michal Hocko 2015-03-18 14:44 ` Rik van Riel 2015-03-18 15:45 ` Michal Hocko 3 siblings, 1 reply; 24+ messages in thread From: Mel Gorman @ 2015-03-18 14:38 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote: > page_cache_read has been historically using page_cache_alloc_cold to > allocate a new page. This means that mapping_gfp_mask is used as the > base for the gfp_mask. Many filesystems are setting this mask to > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > however, not called from the fs layer so it doesn't need this > protection. Even ceph and ocfs2 which call filemap_fault from their > fault handlers seem to be OK because they are not taking any fs lock > before invoking generic implementation. > > The protection might be even harmful. There is a strong push to fail > GFP_NOFS allocations rather than loop within allocator indefinitely with > a very limited reclaim ability. Once we start failing those requests > the OOM killer might be triggered prematurely because the page cache > allocation failure is propagated up the page fault path and end up in > pagefault_out_of_memory. > > Use GFP_KERNEL mask instead because it is safe from the reclaim > recursion POV. We are already doing GFP_KERNEL allocations down > add_to_page_cache_lru path. > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.cz> I'm very far behind after LSF/MM so do not know where this came out of but it loses addressing restriction hints from the driver such as drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32); It also loses mobility hints for fragmentation avoidance. fs/inode.c: mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); If users of mapping_set_gfp_mask are now being ignored then it should at least trigger a once-off warning that the flags are being ignored so it's obvious if a recursion does occur and cause problems. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:38 ` Mel Gorman @ 2015-03-18 14:43 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-18 14:43 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed 18-03-15 14:38:47, Mel Gorman wrote: > On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote: > > page_cache_read has been historically using page_cache_alloc_cold to > > allocate a new page. This means that mapping_gfp_mask is used as the > > base for the gfp_mask. Many filesystems are setting this mask to > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > however, not called from the fs layer so it doesn't need this > > protection. Even ceph and ocfs2 which call filemap_fault from their > > fault handlers seem to be OK because they are not taking any fs lock > > before invoking generic implementation. > > > > The protection might be even harmful. There is a strong push to fail > > GFP_NOFS allocations rather than loop within allocator indefinitely with > > a very limited reclaim ability. Once we start failing those requests > > the OOM killer might be triggered prematurely because the page cache > > allocation failure is propagated up the page fault path and end up in > > pagefault_out_of_memory. > > > > Use GFP_KERNEL mask instead because it is safe from the reclaim > > recursion POV. We are already doing GFP_KERNEL allocations down > > add_to_page_cache_lru path. > > > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > I'm very far behind after LSF/MM so do not know where this came out of > but it loses addressing restriction hints from the driver such as > > drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32); OK, I have missed these drivers. Scratch the patch for now I will think about it more. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 2015-03-18 14:32 ` Rik van Riel 2015-03-18 14:38 ` Mel Gorman @ 2015-03-18 14:44 ` Rik van Riel 2015-03-18 14:55 ` Michal Hocko 2015-03-18 15:45 ` Michal Hocko 3 siblings, 1 reply; 24+ messages in thread From: Rik van Riel @ 2015-03-18 14:44 UTC (permalink / raw) To: Michal Hocko, Andrew Morton Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On 03/18/2015 10:09 AM, Michal Hocko wrote: > page_cache_read has been historically using page_cache_alloc_cold to > allocate a new page. This means that mapping_gfp_mask is used as the > base for the gfp_mask. Many filesystems are setting this mask to > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > however, not called from the fs layer Is that true for filesystems that have directories in the page cache? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:44 ` Rik van Riel @ 2015-03-18 14:55 ` Michal Hocko 2015-03-19 7:14 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-18 14:55 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed 18-03-15 10:44:11, Rik van Riel wrote: > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > page_cache_read has been historically using page_cache_alloc_cold to > > allocate a new page. This means that mapping_gfp_mask is used as the > > base for the gfp_mask. Many filesystems are setting this mask to > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > however, not called from the fs layer > > Is that true for filesystems that have directories in > the page cache? I haven't found any explicit callers of filemap_fault except for ocfs2 and ceph and those seem OK to me. Which filesystems you have in mind? Btw. how would that work as we already have GFP_KERNEL allocation few lines below? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:55 ` Michal Hocko @ 2015-03-19 7:14 ` Dave Chinner 2015-03-19 11:11 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read Tetsuo Handa 2015-03-19 12:44 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 0 siblings, 2 replies; 24+ messages in thread From: Dave Chinner @ 2015-03-19 7:14 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote: > On Wed 18-03-15 10:44:11, Rik van Riel wrote: > > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > page_cache_read has been historically using page_cache_alloc_cold to > > > allocate a new page. This means that mapping_gfp_mask is used as the > > > base for the gfp_mask. Many filesystems are setting this mask to > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > > however, not called from the fs layer > > > > Is that true for filesystems that have directories in > > the page cache? > > I haven't found any explicit callers of filemap_fault except for ocfs2 > and ceph and those seem OK to me. Which filesystems you have in mind? Just about every major filesystem calls filemap_fault through the .fault callout. C symbol: filemap_fault File Function Line 0 9p/vfs_file.c <global> 831 .fault = filemap_fault, 1 9p/vfs_file.c <global> 838 .fault = filemap_fault, 2 btrfs/file.c <global> 2081 .fault = filemap_fault, 3 cifs/file.c <global> 3242 .fault = filemap_fault, 4 ext4/file.c <global> 215 .fault = filemap_fault, 5 f2fs/file.c <global> 93 .fault = filemap_fault, 6 fuse/file.c <global> 2062 .fault = filemap_fault, 7 gfs2/file.c <global> 498 .fault = filemap_fault, 8 nfs/file.c <global> 653 .fault = filemap_fault, 9 nilfs2/file.c <global> 128 .fault = filemap_fault, a ubifs/file.c <global> 1536 .fault = filemap_fault, b xfs/xfs_file.c <global> 1420 .fault = filemap_fault, > Btw. how would that work as we already have GFP_KERNEL allocation few > lines below? GFP_KERNEL allocation for mappings is simply wrong. All mapping allocations where the caller cannot pass a gfp_mask need to obey the mapping_gfp_mask that is set by the mapping owner.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read 2015-03-19 7:14 ` Dave Chinner @ 2015-03-19 11:11 ` Tetsuo Handa 2015-03-19 12:44 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 1 sibling, 0 replies; 24+ messages in thread From: Tetsuo Handa @ 2015-03-19 11:11 UTC (permalink / raw) To: david, mhocko Cc: riel, akpm, viro, hannes, mgorman, neilb, sage, mfasheh, linux-mm, linux-kernel Dave Chinner wrote: > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote: > > On Wed 18-03-15 10:44:11, Rik van Riel wrote: > > > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > > page_cache_read has been historically using page_cache_alloc_cold to > > > > allocate a new page. This means that mapping_gfp_mask is used as the > > > > base for the gfp_mask. Many filesystems are setting this mask to > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > > > however, not called from the fs layer > > > > > > Is that true for filesystems that have directories in > > > the page cache? > > > > I haven't found any explicit callers of filemap_fault except for ocfs2 > > and ceph and those seem OK to me. Which filesystems you have in mind? > > Just about every major filesystem calls filemap_fault through the > .fault callout. > > C symbol: filemap_fault > > File Function Line > 0 9p/vfs_file.c <global> 831 .fault = filemap_fault, > 1 9p/vfs_file.c <global> 838 .fault = filemap_fault, > 2 btrfs/file.c <global> 2081 .fault = filemap_fault, > 3 cifs/file.c <global> 3242 .fault = filemap_fault, > 4 ext4/file.c <global> 215 .fault = filemap_fault, > 5 f2fs/file.c <global> 93 .fault = filemap_fault, > 6 fuse/file.c <global> 2062 .fault = filemap_fault, > 7 gfs2/file.c <global> 498 .fault = filemap_fault, > 8 nfs/file.c <global> 653 .fault = filemap_fault, > 9 nilfs2/file.c <global> 128 .fault = filemap_fault, > a ubifs/file.c <global> 1536 .fault = filemap_fault, > b xfs/xfs_file.c <global> 1420 .fault = filemap_fault, > > > > Btw. how would that work as we already have GFP_KERNEL allocation few > > lines below? > > GFP_KERNEL allocation for mappings is simply wrong. All mapping > allocations where the caller cannot pass a gfp_mask need to obey > the mapping_gfp_mask that is set by the mapping owner.... > Is there any chance to annotate which GFP flag needs to be used like https://lkml.org/lkml/2015/3/17/507 ? > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-19 7:14 ` Dave Chinner 2015-03-19 11:11 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read Tetsuo Handa @ 2015-03-19 12:44 ` Michal Hocko 2015-03-20 3:48 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-19 12:44 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu 19-03-15 18:14:39, Dave Chinner wrote: > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote: > > On Wed 18-03-15 10:44:11, Rik van Riel wrote: > > > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > > page_cache_read has been historically using page_cache_alloc_cold to > > > > allocate a new page. This means that mapping_gfp_mask is used as the > > > > base for the gfp_mask. Many filesystems are setting this mask to > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > > > however, not called from the fs layer > > > > > > Is that true for filesystems that have directories in > > > the page cache? > > > > I haven't found any explicit callers of filemap_fault except for ocfs2 > > and ceph and those seem OK to me. Which filesystems you have in mind? > > Just about every major filesystem calls filemap_fault through the > .fault callout. That is right but the callback is called from the VM layer where we obviously do not take any fs locks (we are holding only mmap_sem for reading). Those who call filemap_fault directly (ocfs2 and ceph) and those who call the callback directly: qxl_ttm_fault, radeon_ttm_fault, kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem to be used from the reclaim context. Or did I miss your point? Are you concerned about some fs overloading filemap_fault and do some locking before delegating to filemap_fault? > C symbol: filemap_fault > > File Function Line > 0 9p/vfs_file.c <global> 831 .fault = filemap_fault, > 1 9p/vfs_file.c <global> 838 .fault = filemap_fault, > 2 btrfs/file.c <global> 2081 .fault = filemap_fault, > 3 cifs/file.c <global> 3242 .fault = filemap_fault, > 4 ext4/file.c <global> 215 .fault = filemap_fault, > 5 f2fs/file.c <global> 93 .fault = filemap_fault, > 6 fuse/file.c <global> 2062 .fault = filemap_fault, > 7 gfs2/file.c <global> 498 .fault = filemap_fault, > 8 nfs/file.c <global> 653 .fault = filemap_fault, > 9 nilfs2/file.c <global> 128 .fault = filemap_fault, > a ubifs/file.c <global> 1536 .fault = filemap_fault, > b xfs/xfs_file.c <global> 1420 .fault = filemap_fault, > > > > Btw. how would that work as we already have GFP_KERNEL allocation few > > lines below? > > GFP_KERNEL allocation for mappings is simply wrong. All mapping > allocations where the caller cannot pass a gfp_mask need to obey > the mapping_gfp_mask that is set by the mapping owner.... Hmm, I thought this is true only when the function might be called from the fs path. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-19 12:44 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko @ 2015-03-20 3:48 ` Dave Chinner 2015-03-20 13:14 ` Michal Hocko 2015-03-26 9:53 ` Michal Hocko 0 siblings, 2 replies; 24+ messages in thread From: Dave Chinner @ 2015-03-20 3:48 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > On Thu 19-03-15 18:14:39, Dave Chinner wrote: > > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote: > > > On Wed 18-03-15 10:44:11, Rik van Riel wrote: > > > > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > > > page_cache_read has been historically using page_cache_alloc_cold to > > > > > allocate a new page. This means that mapping_gfp_mask is used as the > > > > > base for the gfp_mask. Many filesystems are setting this mask to > > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > > > > however, not called from the fs layer > > > > > > > > Is that true for filesystems that have directories in > > > > the page cache? > > > > > > I haven't found any explicit callers of filemap_fault except for ocfs2 > > > and ceph and those seem OK to me. Which filesystems you have in mind? > > > > Just about every major filesystem calls filemap_fault through the > > .fault callout. > > That is right but the callback is called from the VM layer where we > obviously do not take any fs locks (we are holding only mmap_sem > for reading). > Those who call filemap_fault directly (ocfs2 and ceph) and those > who call the callback directly: qxl_ttm_fault, radeon_ttm_fault, > kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion > POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem > to be used from the reclaim context. > > Or did I miss your point? Are you concerned about some fs overloading > filemap_fault and do some locking before delegating to filemap_fault? The latter: https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b > > GFP_KERNEL allocation for mappings is simply wrong. All mapping > > allocations where the caller cannot pass a gfp_mask need to obey > > the mapping_gfp_mask that is set by the mapping owner.... > > Hmm, I thought this is true only when the function might be called from > the fs path. How do you know in, say, mpage_readpages, you aren't being called from a fs path that holds locks? e.g. we can get there from ext4 doing readdir, so it is holding an i_mutex lock at that point. Many other paths into mpages_readpages don't hold locks, but there are some that do, and those that do need functionals like this to obey the mapping_gfp_mask because it is set appropriately for the allocation context of the inode that owns the mapping.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-20 3:48 ` Dave Chinner @ 2015-03-20 13:14 ` Michal Hocko 2015-03-20 22:51 ` Dave Chinner 2015-03-26 9:53 ` Michal Hocko 1 sibling, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-20 13:14 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Fri 20-03-15 14:48:20, Dave Chinner wrote: > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > On Thu 19-03-15 18:14:39, Dave Chinner wrote: > > > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote: > > > > On Wed 18-03-15 10:44:11, Rik van Riel wrote: > > > > > On 03/18/2015 10:09 AM, Michal Hocko wrote: > > > > > > page_cache_read has been historically using page_cache_alloc_cold to > > > > > > allocate a new page. This means that mapping_gfp_mask is used as the > > > > > > base for the gfp_mask. Many filesystems are setting this mask to > > > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > > > > > however, not called from the fs layer > > > > > > > > > > Is that true for filesystems that have directories in > > > > > the page cache? > > > > > > > > I haven't found any explicit callers of filemap_fault except for ocfs2 > > > > and ceph and those seem OK to me. Which filesystems you have in mind? > > > > > > Just about every major filesystem calls filemap_fault through the > > > .fault callout. > > > > That is right but the callback is called from the VM layer where we > > obviously do not take any fs locks (we are holding only mmap_sem > > for reading). > > Those who call filemap_fault directly (ocfs2 and ceph) and those > > who call the callback directly: qxl_ttm_fault, radeon_ttm_fault, > > kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion > > POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem > > to be used from the reclaim context. > > > > Or did I miss your point? Are you concerned about some fs overloading > > filemap_fault and do some locking before delegating to filemap_fault? > > The latter: > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b I will have a look at the code to see what we can do about it. > > > GFP_KERNEL allocation for mappings is simply wrong. All mapping > > > allocations where the caller cannot pass a gfp_mask need to obey > > > the mapping_gfp_mask that is set by the mapping owner.... > > > > Hmm, I thought this is true only when the function might be called from > > the fs path. > > How do you know in, say, mpage_readpages, you aren't being called > from a fs path that holds locks? e.g. we can get there from ext4 > doing readdir, so it is holding an i_mutex lock at that point. > > Many other paths into mpages_readpages don't hold locks, but there > are some that do, and those that do need functionals like this to > obey the mapping_gfp_mask because it is set appropriately for the > allocation context of the inode that owns the mapping.... What about the following? --- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-20 13:14 ` Michal Hocko @ 2015-03-20 22:51 ` Dave Chinner 2015-03-23 13:02 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2015-03-20 22:51 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote: > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > > > allocations where the caller cannot pass a gfp_mask need to obey > > > > the mapping_gfp_mask that is set by the mapping owner.... > > > > > > Hmm, I thought this is true only when the function might be called from > > > the fs path. > > > > How do you know in, say, mpage_readpages, you aren't being called > > from a fs path that holds locks? e.g. we can get there from ext4 > > doing readdir, so it is holding an i_mutex lock at that point. > > > > Many other paths into mpages_readpages don't hold locks, but there > > are some that do, and those that do need functionals like this to > > obey the mapping_gfp_mask because it is set appropriately for the > > allocation context of the inode that owns the mapping.... > > What about the following? > --- > From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Thu, 19 Mar 2015 14:56:56 +0100 > Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache > allocation paths Looks reasonable, though I though there were more places that that in the mapping paths that need to be careful... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-20 22:51 ` Dave Chinner @ 2015-03-23 13:02 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-23 13:02 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Sat 21-03-15 09:51:39, Dave Chinner wrote: > On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote: > > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > > > > allocations where the caller cannot pass a gfp_mask need to obey > > > > > the mapping_gfp_mask that is set by the mapping owner.... > > > > > > > > Hmm, I thought this is true only when the function might be called from > > > > the fs path. > > > > > > How do you know in, say, mpage_readpages, you aren't being called > > > from a fs path that holds locks? e.g. we can get there from ext4 > > > doing readdir, so it is holding an i_mutex lock at that point. > > > > > > Many other paths into mpages_readpages don't hold locks, but there > > > are some that do, and those that do need functionals like this to > > > obey the mapping_gfp_mask because it is set appropriately for the > > > allocation context of the inode that owns the mapping.... > > > > What about the following? > > --- > > From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Thu, 19 Mar 2015 14:56:56 +0100 > > Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache > > allocation paths > > Looks reasonable, though I though there were more places that that > in the mapping paths that need to be careful... I have focused on those which involve page cache allocation because those are obvious. We might need others but I do not see them right now. I will include this patch for the next submit after I manage to wrap my head around up-coming xfs changes and come up with something for page_cache_read vs OOM killer issue. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-20 3:48 ` Dave Chinner 2015-03-20 13:14 ` Michal Hocko @ 2015-03-26 9:53 ` Michal Hocko 2015-03-26 21:43 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-26 9:53 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Fri 20-03-15 14:48:20, Dave Chinner wrote: > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: [...] > > Or did I miss your point? Are you concerned about some fs overloading > > filemap_fault and do some locking before delegating to filemap_fault? > > The latter: > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b Hmm. I am completely unfamiliar with the xfs code but my reading of 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be OK from the reclaim recursion POV. It protects against truncate and punch hole, right? Or are there any internal paths which I am missing and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-26 9:53 ` Michal Hocko @ 2015-03-26 21:43 ` Dave Chinner 2015-03-30 8:22 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2015-03-26 21:43 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote: > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > [...] > > > Or did I miss your point? Are you concerned about some fs overloading > > > filemap_fault and do some locking before delegating to filemap_fault? > > > > The latter: > > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b > > Hmm. I am completely unfamiliar with the xfs code but my reading of > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be > OK from the reclaim recursion POV. It protects against truncate and > punch hole, right? Or are there any internal paths which I am missing > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held? It might be OK, but you're only looking at the example I gave you, not the fundamental issue it demonstrates. That is: filesystems may have *internal dependencies that are unknown to the page cache or mm subsystem*. Hence the page cache or mm allocations cannot arbitrarily ignore allocation constraints the filesystem assigns to mapping operations.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-26 21:43 ` Dave Chinner @ 2015-03-30 8:22 ` Michal Hocko 2015-03-31 21:46 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-30 8:22 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Fri 27-03-15 08:43:54, Dave Chinner wrote: > On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote: > > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > [...] > > > > Or did I miss your point? Are you concerned about some fs overloading > > > > filemap_fault and do some locking before delegating to filemap_fault? > > > > > > The latter: > > > > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b > > > > Hmm. I am completely unfamiliar with the xfs code but my reading of > > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be > > OK from the reclaim recursion POV. It protects against truncate and > > punch hole, right? Or are there any internal paths which I am missing > > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held? > > It might be OK, but you're only looking at the example I gave you, > not the fundamental issue it demonstrates. That is: filesystems may > have *internal dependencies that are unknown to the page cache or mm > subsystem*. Hence the page cache or mm allocations cannot > arbitrarily ignore allocation constraints the filesystem assigns to > mapping operations.... I fully understand that. I am just trying to understand what are the real requirements from filesystems wrt filemap_fault. mapping gfp mask is not usable much for that because e.g. xfs has GFP_NOFS set for the whole inode life AFAICS. And it seems that this context is not really required even after the recent code changes. We can add gfp_mask into struct vm_fault and initialize it to mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This would be cleaner than unconditional gfp manipulation (the patch below). But we are in a really hard position if the GFP_NOFS context is really required here. We shouldn't really trigger OOM killer because that could be premature and way too disruptive. We can retry the page fault or the allocation but that both sound suboptimal to me. Do you have any other suggestions? This hasn't been tested yet it just shows the idea mentioned above. --- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-30 8:22 ` Michal Hocko @ 2015-03-31 21:46 ` Dave Chinner 2015-04-07 12:16 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2015-03-31 21:46 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Mon, Mar 30, 2015 at 10:22:18AM +0200, Michal Hocko wrote: > On Fri 27-03-15 08:43:54, Dave Chinner wrote: > > On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote: > > > On Fri 20-03-15 14:48:20, Dave Chinner wrote: > > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote: > > > [...] > > > > > Or did I miss your point? Are you concerned about some fs overloading > > > > > filemap_fault and do some locking before delegating to filemap_fault? > > > > > > > > The latter: > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b > > > > > > Hmm. I am completely unfamiliar with the xfs code but my reading of > > > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be > > > OK from the reclaim recursion POV. It protects against truncate and > > > punch hole, right? Or are there any internal paths which I am missing > > > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held? > > > > It might be OK, but you're only looking at the example I gave you, > > not the fundamental issue it demonstrates. That is: filesystems may > > have *internal dependencies that are unknown to the page cache or mm > > subsystem*. Hence the page cache or mm allocations cannot > > arbitrarily ignore allocation constraints the filesystem assigns to > > mapping operations.... > > I fully understand that. I am just trying to understand what are the > real requirements from filesystems wrt filemap_fault. mapping gfp mask is > not usable much for that because e.g. xfs has GFP_NOFS set for the whole > inode life AFAICS. And it seems that this context is not really required > even after the recent code changes. > We can add gfp_mask into struct vm_fault and initialize it to > mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This > would be cleaner than unconditional gfp manipulation (the patch below). > > But we are in a really hard position if the GFP_NOFS context is really > required here. We shouldn't really trigger OOM killer because that could > be premature and way too disruptive. We can retry the page fault or the > allocation but that both sound suboptimal to me. GFP_NOFS has also been required in the mapping mask in the past because reclaim from page cache allocation points had a nasty habit of blowing the stack. In XFS, we replaced AOP_FLAG_NOFS calls with the GFP_NOFS mapping mask because the AOP_FLAG_NOFS didn't give us the coverage needed on mapping allocation calls to prevent the problems being reported. Yes, we now have a larger stack, but as I've said in the past, GFP_NOFS is used for several different reasons by subsystems - recursion can be bad in more ways than one, and GFP_NOFS is the only hammer we have... > This hasn't been tested yet it just shows the idea mentioned above. > --- > From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Fri, 27 Mar 2015 13:33:51 +0100 > Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation > > page_cache_read has been historically using page_cache_alloc_cold to > allocate a new page. This means that mapping_gfp_mask is used as the > base for the gfp_mask. Many filesystems are setting this mask to > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > however, not called from the fs layera directly so it doesn't need this > protection normally. It can be called from a page fault while copying into or out of a user buffer from a read()/write() system call. Hence the page fault can be nested inside filesystem locks. Indeed, the canonical reason for why we can't take the i_mutex in the page fault path is exactly this. i.e. the user buffer might be a mmap()d region of the same file and so we have mmap_sem/i_mutex inversion issues. This is the same case - we can be taking page faults with filesystem locks held, and that means we've got problems if the page fault then recurses back into the filesystem and trips over those locks... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-31 21:46 ` Dave Chinner @ 2015-04-07 12:16 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2015-04-07 12:16 UTC (permalink / raw) To: Dave Chinner Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Wed 01-04-15 08:46:51, Dave Chinner wrote: [...] > GFP_NOFS has also been required in the mapping mask in the past > because reclaim from page cache allocation points had a nasty habit > of blowing the stack. Yeah I remember some scary traces but we are talking about the page fault path and we definitely have to handle GFP_IOFS allocations there. We cannot use GFP_NOFS as a workaround e.g. for anonymous pages. [...] > > From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.cz> > > Date: Fri, 27 Mar 2015 13:33:51 +0100 > > Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation > > > > page_cache_read has been historically using page_cache_alloc_cold to > > allocate a new page. This means that mapping_gfp_mask is used as the > > base for the gfp_mask. Many filesystems are setting this mask to > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > > however, not called from the fs layera directly so it doesn't need this > > protection normally. > > It can be called from a page fault while copying into or out of a > user buffer from a read()/write() system call. Hence the page fault > can be nested inside filesystem locks. As pointed above, the user buffer might be an anonymous memory as well and so we have to be able to handle GFP_IOFS allocations from the page fault without recalaim deadlocks. Besides that we are allocating page tables which are GFP_KERNEL and probably some more. So either we are broken by definition or GFP_IOFS is safe from under i_mutex lock. My code inspection suggests the later but the code is really hard to follow and dependencies might be not direct. I remember that nfs_release_page would be prone to i_mutex deadlock when server and client are on the same machine. But this shouldn't be a problem anymore because the amount of time client waits for the server is limited (9590544694bec). I might be missing other places of course but to me it sounds that GFP_IOFS must be safe under _some_ FS locks and i_mutex is one of them. > Indeed, the canonical reason > for why we can't take the i_mutex in the page fault path is exactly > this. i.e. the user buffer might be a mmap()d region of the same > file and so we have mmap_sem/i_mutex inversion issues. > > This is the same case - we can be taking page faults with filesystem > locks held, and that means we've got problems if the page fault then > recurses back into the filesystem and trips over those locks... Yeah, I am familiar with the generic meaning of GFP_NOFS flags. I just think that it is used as a too big of a hammer here (all FS locks is just too broad). The page fault is not GFP_NOFS safe now and it never has been (anonymous pages are not GFP_NOFS, page tables etc...). And I am afraid we cannot simply change it to use GFP_NOFS all over. Are there any other fs locks (except for i_mutex) which might be held while doing {get,put}_user or get_user_pages? I haven't found many instances in the fs/ but there is a lot of done via indirection. That being said I think the patch should be safe and an improvement over the current state. Unless I am missing something obvious or there are other objections I will repost it along with the other clean up patch later this week. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko ` (2 preceding siblings ...) 2015-03-18 14:44 ` Rik van Riel @ 2015-03-18 15:45 ` Michal Hocko 2015-03-18 21:38 ` NeilBrown 3 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2015-03-18 15:45 UTC (permalink / raw) To: Andrew Morton Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML What do you think about this v2? I cannot say I would like it but I really dislike the whole mapping_gfp_mask API to be honest. --- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 15:45 ` Michal Hocko @ 2015-03-18 21:38 ` NeilBrown 2015-03-19 13:55 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: NeilBrown @ 2015-03-18 21:38 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML [-- Attachment #1: Type: text/plain, Size: 2983 bytes --] On Wed, 18 Mar 2015 16:45:40 +0100 Michal Hocko <mhocko@suse.cz> wrote: > What do you think about this v2? I cannot say I would like it but I > really dislike the whole mapping_gfp_mask API to be honest. > --- > >From d88010d6f5f59d7eb87b691e27e201d12cab9141 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Wed, 18 Mar 2015 16:06:40 +0100 > Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation > > page_cache_read has been historically using page_cache_alloc_cold to > allocate a new page. This means that mapping_gfp_mask is used as the > base for the gfp_mask. Many filesystems are setting this mask to > GFP_NOFS to prevent from fs recursion issues. page_cache_read is, > however, not called from the fs layer so it doesn't need this > protection. Even ceph and ocfs2 which call filemap_fault from their > fault handlers seem to be OK because they are not taking any fs lock > before invoking generic implementation. > > The protection might be even harmful. There is a strong push to fail > GFP_NOFS allocations rather than loop within allocator indefinitely with > a very limited reclaim ability. Once we start failing those requests > the OOM killer might be triggered prematurely because the page cache > allocation failure is propagated up the page fault path and end up in > pagefault_out_of_memory. > > Add __GFP_FS and __GFPIO to the gfp mask which is coming from the > mapping to fix this issue. > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.cz> > --- > mm/filemap.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 968cd8e03d2e..8b50d5eb52b2 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1752,7 +1752,15 @@ static int page_cache_read(struct file *file, pgoff_t offset) > int ret; > > do { > - page = page_cache_alloc_cold(mapping); > + gfp_t page_cache_gfp = mapping_gfp_mask(mapping)|__GFP_COLD; > + > + /* > + * This code is not called from the fs layer so we do not need > + * reclaim recursion protection. !GFP_FS might fail too easy > + * and trigger OOM killer prematuraly. > + */ > + page_cache_gfp |= __GFP_FS | __GFP_IO; > + page = __page_cache_alloc(page_cache_gfp); > if (!page) > return -ENOMEM; > Nearly half the places in the kernel which call mapping_gfp_mask() remove the __GFP_FS bit. That suggests to me that it might make sense to have mapping_gfp_mask_fs() and mapping_gfp_mask_nofs() and let the presence of __GFP_FS (and __GFP_IO) be determined by the call-site rather than the filesystem. However I am a bit concerned about drivers/block/loop.c. Might a filesystem read on the loop block device wait for a page_cache_read() on the loop-mounted file? In that case you really don't want __GFP_FS set when allocating that page. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-18 21:38 ` NeilBrown @ 2015-03-19 13:55 ` Michal Hocko 2015-03-19 14:27 ` Michal Hocko 2015-03-20 3:57 ` Dave Chinner 0 siblings, 2 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-19 13:55 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu 19-03-15 08:38:35, Neil Brown wrote: [...] > Nearly half the places in the kernel which call mapping_gfp_mask() remove the > __GFP_FS bit. > > That suggests to me that it might make sense to have > mapping_gfp_mask_fs() > and > mapping_gfp_mask_nofs() > > and let the presence of __GFP_FS (and __GFP_IO) be determined by the > call-site rather than the filesystem. Sounds reasonable to me but filesystems tend to use this in a very different ways. - xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are NOFS. - reiserfs drops GFP_FS only before calling read_mapping_page in reiserfs_get_page and never restores the original mask. - btrfs doesn't seem to rely on mapping_gfp_mask for anything other than btree_inode (unless it gets inherrited in a way I haven't noticed). - ext* doesn't seem to rely on the mapping gfp mask at all. So it is not clear to me how we should change that into callsites. But I guess we can change at least the page fault path like the following. I like it much more than the previous way which is too hackish. --- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-19 13:55 ` Michal Hocko @ 2015-03-19 14:27 ` Michal Hocko 2015-03-20 3:57 ` Dave Chinner 1 sibling, 0 replies; 24+ messages in thread From: Michal Hocko @ 2015-03-19 14:27 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu 19-03-15 14:55:58, Michal Hocko wrote: > On Thu 19-03-15 08:38:35, Neil Brown wrote: > [...] > > Nearly half the places in the kernel which call mapping_gfp_mask() remove the > > __GFP_FS bit. > > > > That suggests to me that it might make sense to have > > mapping_gfp_mask_fs() > > and > > mapping_gfp_mask_nofs() > > > > and let the presence of __GFP_FS (and __GFP_IO) be determined by the > > call-site rather than the filesystem. > > Sounds reasonable to me but filesystems tend to use this in a very > different ways. > - xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are > NOFS. > - reiserfs drops GFP_FS only before calling read_mapping_page in > reiserfs_get_page and never restores the original mask. > - btrfs doesn't seem to rely on mapping_gfp_mask for anything other than > btree_inode (unless it gets inherrited in a way I haven't noticed). > - ext* doesn't seem to rely on the mapping gfp mask at all. > > So it is not clear to me how we should change that into callsites. But I > guess we can change at least the page fault path like the following. I > like it much more than the previous way which is too hackish. But this is racy instead... And I do not think we can make it raceless so scratch this and get back to the original approach. [...] > + /* > + * Some filesystems always drop __GFP_FS to prevent from reclaim > + * recursion back to FS code. This is not the case here because > + * we are at the top of the call chain. Add GFP_FS flags to prevent > + * from premature OOM killer. > + */ > + mapping_gfp = mapping_gfp_mask(mapping); > + mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO); > ret = vma->vm_ops->fault(vma, &vmf); > + mapping_set_gfp_mask(mapping, mapping_gfp); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) > return ret; -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read 2015-03-19 13:55 ` Michal Hocko 2015-03-19 14:27 ` Michal Hocko @ 2015-03-20 3:57 ` Dave Chinner 1 sibling, 0 replies; 24+ messages in thread From: Dave Chinner @ 2015-03-20 3:57 UTC (permalink / raw) To: Michal Hocko Cc: NeilBrown, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML On Thu, Mar 19, 2015 at 02:55:58PM +0100, Michal Hocko wrote: > @@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, > { > struct vm_fault vmf; > int ret; > + struct address_space *mapping = vma->vm_file->f_mapping; > + gfp_t mapping_gfp; > > vmf.virtual_address = (void __user *)(address & PAGE_MASK); > vmf.pgoff = pgoff; > vmf.flags = flags; > vmf.page = NULL; > > + /* > + * Some filesystems always drop __GFP_FS to prevent from reclaim > + * recursion back to FS code. This is not the case here because > + * we are at the top of the call chain. Add GFP_FS flags to prevent > + * from premature OOM killer. > + */ > + mapping_gfp = mapping_gfp_mask(mapping); > + mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO); > ret = vma->vm_ops->fault(vma, &vmf); > + mapping_set_gfp_mask(mapping, mapping_gfp); Urk! The inode owns the mapping and makes these decisions, not the page fault path. These mapping flags may be set for reasons you don't expect or know about (e.g. a subsystem specific shrinker constraint) so paths like this have no business clearing flags they don't own. cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2015-04-07 12:16 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 2015-03-18 14:32 ` Rik van Riel 2015-03-18 14:37 ` Michal Hocko 2015-03-18 14:38 ` Mel Gorman 2015-03-18 14:43 ` Michal Hocko 2015-03-18 14:44 ` Rik van Riel 2015-03-18 14:55 ` Michal Hocko 2015-03-19 7:14 ` Dave Chinner 2015-03-19 11:11 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read Tetsuo Handa 2015-03-19 12:44 ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko 2015-03-20 3:48 ` Dave Chinner 2015-03-20 13:14 ` Michal Hocko 2015-03-20 22:51 ` Dave Chinner 2015-03-23 13:02 ` Michal Hocko 2015-03-26 9:53 ` Michal Hocko 2015-03-26 21:43 ` Dave Chinner 2015-03-30 8:22 ` Michal Hocko 2015-03-31 21:46 ` Dave Chinner 2015-04-07 12:16 ` Michal Hocko 2015-03-18 15:45 ` Michal Hocko 2015-03-18 21:38 ` NeilBrown 2015-03-19 13:55 ` Michal Hocko 2015-03-19 14:27 ` Michal Hocko 2015-03-20 3:57 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).