* Re: linux-next: kernel BUG at mm/slub.c:1447! [not found] <560D59F7.4070002@roeck-us.net> @ 2015-10-01 20:49 ` Andrew Morton 2015-10-02 7:25 ` Michal Hocko 2015-10-02 13:36 ` Guenter Roeck 0 siblings, 2 replies; 10+ messages in thread From: Andrew Morton @ 2015-10-01 20:49 UTC (permalink / raw) To: Guenter Roeck Cc: linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Michal Hocko, Dave Chinner On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote: > Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration. > On a re-run, I have seen it with the same image, but this time when simulating IvyBridge, > so it is not CPU dependent. I did not previously see the problem. > > Log is at > http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio > > I'll try to bisect. The problem is not seen with every boot, so that may take a while. Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()", I expect. > --- > gfp: 2 That's __GFP_HIGHMEM > ------------[ cut here ]------------ > invalid opcode: 0000 [#1] PREEMPT > Modules linked in: > CPU: 0 PID: 121 Comm: udevd Not tainted 4.3.0-rc3-next-20151001-yocto-standard #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 > task: ced90000 ti: ced8c000 task.ti: ced8c000 > EIP: 0060:[<c1128873>] EFLAGS: 00000092 CPU: 0 > EIP is at new_slab+0x353/0x360 > EAX: 00000006 EBX: 00000000 ECX: 00000001 EDX: 80000001 > ESI: cf8019c0 EDI: 00000000 EBP: ced8daa4 ESP: ced8da7c > DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 > CR0: 8005003b CR2: 080791c0 CR3: 0ed6c000 CR4: 000006d0 > DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > DR6: 00000000 DR7: 00000000 > Stack: > c19a42cf 00000002 c137542e ced8da90 c137544c ffffffff c144c8a8 00000000 > cf8019c0 00000000 ced8db28 c1129ca8 0203128a c144f346 00004e20 cec2e740 > c10ee933 0203128a cf8019c0 c181d6c0 c181d460 000001a1 00150015 c0011c00 > Call Trace: > [<c137542e>] ? __delay+0xe/0x10 > [<c137544c>] ? __const_udelay+0x1c/0x20 > [<c144c8a8>] ? ide_execute_command+0x68/0xa0 > [<c1129ca8>] ___slab_alloc.constprop.75+0x248/0x310 > [<c144f346>] ? do_rw_taskfile+0x286/0x320 > [<c10ee933>] ? mempool_alloc_slab+0x13/0x20 > [<c1457d12>] ? ide_do_rw_disk+0x222/0x320 > [<c1136219>] __slab_alloc.isra.72.constprop.74+0x18/0x1f > [<c112a2f2>] kmem_cache_alloc+0x122/0x1c0 > [<c10ee933>] ? mempool_alloc_slab+0x13/0x20 > [<c10ee933>] mempool_alloc_slab+0x13/0x20 > [<c10eebe5>] mempool_alloc+0x45/0x170 > [<c1345202>] bio_alloc_bioset+0xd2/0x1b0 > [<c1172e9f>] mpage_alloc+0x2f/0xa0 > [<c1037979>] ? kmap_atomic_prot+0x59/0xf0 > [<c1173523>] do_mpage_readpage+0x4d3/0x7e0 > [<c10f31b8>] ? __alloc_pages_nodemask+0xf8/0x8c0 > [<c134ed67>] ? blk_queue_bio+0x267/0x2d0 > [<c112a24a>] ? kmem_cache_alloc+0x7a/0x1c0 > [<c138357f>] ? __this_cpu_preempt_check+0xf/0x20 > [<c1173894>] mpage_readpage+0x64/0x80 mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask() and that got passed all the way into kmem_cache_alloc() to allocate a bio. slab goes BUG if asked for highmem. A fix would be to mask off __GFP_HIGHMEM right there in mpage_readpage(). But I think the patch needs a bit of a rethink. mapping_gfp_mask() is the mask for allocating a file's pagecache. It isn't designed for allocation of memory for IO structures, file metadata, etc. Now, we could redefine mapping_gfp_mask()'s purpose (or formalize stuff which has been sneaking in anyway). Treat mapping_gfp_mask() as a constraint mask - instead of it being "use this gfp for this mapping", it becomes "don't use these gfp flags for this mapping". Hence something like: gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in) { return mapping_gfp_mask(mapping) & gfp_in; } So instead of doing this: @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma prefetchw(&page->flags); list_del(&page->lru); if (!add_to_page_cache_lru(page, mapping, - page->index, GFP_KERNEL)) { + page->index, + gfp)) { Michal's patch will do: @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma prefetchw(&page->flags); list_del(&page->lru); if (!add_to_page_cache_lru(page, mapping, - page->index, GFP_KERNEL)) { + page->index, + mapping_gfp_constraint(mapping, GFP_KERNEL))) { ie: use mapping_gfp_mask() to strip out any GFP flags which the filesystem doesn't want used. If the filesystem has *added* flags to mapping_gfp_mask() then obviously this won't work and we'll need two fields in the address_space or something. Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()", thanks for the report. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton @ 2015-10-02 7:25 ` Michal Hocko 2015-10-02 8:53 ` Michal Hocko 2015-10-02 20:49 ` Andrew Morton 2015-10-02 13:36 ` Guenter Roeck 1 sibling, 2 replies; 10+ messages in thread From: Michal Hocko @ 2015-10-02 7:25 UTC (permalink / raw) To: Andrew Morton Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Thu 01-10-15 13:49:04, Andrew Morton wrote: [...] > mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask() > and that got passed all the way into kmem_cache_alloc() to allocate a > bio. slab goes BUG if asked for highmem. > > A fix would be to mask off __GFP_HIGHMEM right there in > mpage_readpage(). Yes, this is an obvious bug in the patch. It should only make the gfp mask more restrictive. > But I think the patch needs a bit of a rethink. mapping_gfp_mask() is > the mask for allocating a file's pagecache. It isn't designed for > allocation of memory for IO structures, file metadata, etc. > > Now, we could redefine mapping_gfp_mask()'s purpose (or formalize > stuff which has been sneaking in anyway). Treat mapping_gfp_mask() as > a constraint mask - instead of it being "use this gfp for this > mapping", it becomes "don't use these gfp flags for this mapping". > > Hence something like: > > gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in) > { > return mapping_gfp_mask(mapping) & gfp_in; > } > > So instead of doing this: > > @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma > prefetchw(&page->flags); > list_del(&page->lru); > if (!add_to_page_cache_lru(page, mapping, > - page->index, GFP_KERNEL)) { > + page->index, > + gfp)) { > > Michal's patch will do: > > @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma > prefetchw(&page->flags); > list_del(&page->lru); > if (!add_to_page_cache_lru(page, mapping, > - page->index, GFP_KERNEL)) { > + page->index, > + mapping_gfp_constraint(mapping, GFP_KERNEL))) { > > ie: use mapping_gfp_mask() to strip out any GFP flags which the > filesystem doesn't want used. If the filesystem has *added* flags to > mapping_gfp_mask() then obviously this won't work and we'll need two > fields in the address_space or something. > > Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()", > thanks for the report. mapping_gfp_mask is used at many places so I think it would be better to fix this particular place (others seem to be correct). It would make the stable backport much easier. We can build a more sane API on top. What do you think? Here is the respin of the original patch. I will post another one which will add mapping_gfp_constraint on top. It will surely be less error prone. --- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-02 7:25 ` Michal Hocko @ 2015-10-02 8:53 ` Michal Hocko 2015-10-02 20:49 ` Andrew Morton 1 sibling, 0 replies; 10+ messages in thread From: Michal Hocko @ 2015-10-02 8:53 UTC (permalink / raw) To: Andrew Morton Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Fri 02-10-15 09:25:22, Michal Hocko wrote: > On Thu 01-10-15 13:49:04, Andrew Morton wrote: [...] > > Now, we could redefine mapping_gfp_mask()'s purpose (or formalize > > stuff which has been sneaking in anyway). Treat mapping_gfp_mask() as > > a constraint mask - instead of it being "use this gfp for this > > mapping", it becomes "don't use these gfp flags for this mapping". > > > > Hence something like: > > > > gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in) > > { > > return mapping_gfp_mask(mapping) & gfp_in; > > } > > > > So instead of doing this: > > > > @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma > > prefetchw(&page->flags); > > list_del(&page->lru); > > if (!add_to_page_cache_lru(page, mapping, > > - page->index, GFP_KERNEL)) { > > + page->index, > > + gfp)) { > > [...] > I will post another one which > will add mapping_gfp_constraint on top. It will surely be less error > prone. OK, so here we go. There are still few direct users of mapping_gfp_mask but most of them use it unrestricted. The only exception seems to be loop driver and luste lloop which save and restrict mapping_gfp which didn't seem worthwhile converting. This is on top of the current linux-next + the updated fix for the loop hang. --- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-02 7:25 ` Michal Hocko 2015-10-02 8:53 ` Michal Hocko @ 2015-10-02 20:49 ` Andrew Morton 2015-10-05 13:47 ` Michal Hocko 1 sibling, 1 reply; 10+ messages in thread From: Andrew Morton @ 2015-10-02 20:49 UTC (permalink / raw) To: Michal Hocko Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Fri, 2 Oct 2015 09:25:23 +0200 Michal Hocko <mhocko@kernel.org> wrote: > 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation > paths") has caught some users of hardcoded GFP_KERNEL used in the page > cache allocation paths. This, however, wasn't complete and there were > others which went unnoticed. > > Dave Chinner has reported the following deadlock for xfs on loop device: > : With the recent merge of the loop device changes, I'm now seeing > : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073. > : > : The deadlocked is as follows: > : > : kloopd1: loop_queue_read_work > : xfs_file_iter_read > : lock XFS inode XFS_IOLOCK_SHARED (on image file) > : page cache read (GFP_KERNEL) > : radix tree alloc > : memory reclaim > : reclaim XFS inodes > : log force to unpin inodes > : <wait for log IO completion> > : > : xfs-cil/loop1: <does log force IO work> > : xlog_cil_push > : xlog_write > : <loop issuing log writes> > : xlog_state_get_iclog_space() > : <blocks due to all log buffers under write io> > : <waits for IO completion> > : > : kloopd1: loop_queue_write_work > : xfs_file_write_iter > : lock XFS inode XFS_IOLOCK_EXCL (on image file) > : <wait for inode to be unlocked> > : > : i.e. the kloopd, with it's split read and write work queues, has > : introduced a dependency through memory reclaim. i.e. that writes > : need to be able to progress for reads make progress. > : > : The problem, fundamentally, is that mpage_readpages() does a > : GFP_KERNEL allocation, rather than paying attention to the inode's > : mapping gfp mask, which is set to GFP_NOFS. > : > : The didn't used to happen, because the loop device used to issue > : reads through the splice path and that does: > : > : error = add_to_page_cache_lru(page, mapping, index, > : GFP_KERNEL & mapping_gfp_mask(mapping)); > > This has changed by aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC"). > > This patch changes mpage_readpage{s} to follow gfp mask set for the > mapping. There are, however, other places which are doing basically the > same. > > lustre:ll_dir_filler is doing GFP_KERNEL from the function which > apparently uses GFP_NOFS for other allocations so let's make this > consistent. > > cifs:readpages_get_pages is called from cifs_readpages and > __cifs_readpages_from_fscache called from the same path obeys mapping > gfp. > > ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well > regardless it uses mapping_gfp_mask for the page allocation. > > ext4_mpage_readpages is the called from the page cache allocation path > same as read_pages and read_cache_pages > > As I've noticed in my previous post I cannot say I would be happy about > sprinkling mapping_gfp_mask all over the place and it sounds like we > should drop gfp_mask argument altogether and use it internally in > __add_to_page_cache_locked that would require all the filesystems to use > mapping gfp consistently which I am not sure is the case here. From a > quick glance it seems that some file system use it all the time while > others are selective. There's a lot of confusion here, so let's try to do some clear thinking. mapping_gfp_mask() describes the mask to be used for allocating this mapping's pagecache pages. Nothing else. I introduced it to provide a clean way to ensure that blockdev inode pagecache is not allocated from highmem. This is totally different from "I'm holding a lock so I can't do __GFP_FS"! __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts - the only commonality is that they happen to be passed to callees in the same memory word. At present the VFS provides no way for the filesystem to specify the allocation mode for pagecache operations. The VFS assumes __GFP_FS|__GFP_IO|__GFP_WAIT|etc. mapping_gfp_mask() may *look* like it provides a way, but it doesn't. The way for a caller to communicate "I can't do __GFP_FS from this particular callsite" is to pass that info in a gfp_t, in a function call argument. It is very wrong to put this info into mapping_gfp_mask(), as XFS has done. For obvious reasons: different callsites may want different values. One can easily envision a filesystem whose read() can allocate pagecache with GFP_HIGHUSER but the write() needs GFP_HIGHUSER&~__GFP_FS. This obviously won't fly if we're (ab)using mapping_gpf_mask() in this fashion. Also callsites who *can* use the stronger __GFP_FS are artificially prevented from doing so. So. By far the best way of addressing this bug is to fix XFS so that it can use __GFP_FS for allocating pagecache. It's really quite lame that the filesystem cannot use the strongest memory allocation mode for the largest volume of allocation. Obviously this fix is too difficult, otherwise it would already have been made. The second best way of solving this bug is to pass a gfp_t into the relevant callees, in the time-honoured manner. That would involve alteration of probably several address_space_operations function pointers and lots of mind-numbing mechanical edits and bloat. The third best way is to pass the gfp_t via the task_struct. See memalloc_noio_save() and memalloc_noio_restore(). This is a pretty grubby way of passing function arguments, but I'm OK with it in this special case, because a) Adding a gfp_t arg to 11 billion functions has costs of several forms b) Adding all these costs just because one filesystem is being weird doesn't make sense c) The mpage functions already have too many arguments. Adding yet another is getting kinda ridiculous, and will cost more stack on some of our deepest paths. The fourth best way of fixing this is a nasty short-term bodge, such a the one you just sent ;) But if we're going to do this, it should be the minimal bodge which fixes this deadlock. Is it possible to come up with a one-liner (plus suitable comment) to get us out of this mess? Longer-term I suggest we look at generalising the memalloc_noio_foo() stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in callees. I have a suspicion we should have done this 15 years ago (which is about when I started wanting to do it). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-02 20:49 ` Andrew Morton @ 2015-10-05 13:47 ` Michal Hocko 2015-10-05 19:29 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Michal Hocko @ 2015-10-05 13:47 UTC (permalink / raw) To: Andrew Morton Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Fri 02-10-15 13:49:53, Andrew Morton wrote: [...] > There's a lot of confusion here, so let's try to do some clear > thinking. > > mapping_gfp_mask() describes the mask to be used for allocating this > mapping's pagecache pages. Nothing else. I introduced it to provide a > clean way to ensure that blockdev inode pagecache is not allocated from > highmem. I am afraid this is not how it ended up being used. > This is totally different from "I'm holding a lock so I can't do > __GFP_FS"! __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts - > the only commonality is that they happen to be passed to callees in the > same memory word. Which is exactly how those filesytems overcome the lack of consistent API to tell the restrictions all the way down to the allocator. AFAIR this is the case beyond xfs. > At present the VFS provides no way for the filesystem to specify the > allocation mode for pagecache operations. The VFS assumes > __GFP_FS|__GFP_IO|__GFP_WAIT|etc. mapping_gfp_mask() may *look* like > it provides a way, but it doesn't. > > The way for a caller to communicate "I can't do __GFP_FS from this > particular callsite" is to pass that info in a gfp_t, in a function > call argument. It is very wrong to put this info into > mapping_gfp_mask(), as XFS has done. For obvious reasons: different > callsites may want different values. > > One can easily envision a filesystem whose read() can allocate > pagecache with GFP_HIGHUSER but the write() needs > GFP_HIGHUSER&~__GFP_FS. This obviously won't fly if we're (ab)using > mapping_gpf_mask() in this fashion. Also callsites who *can* use the > stronger __GFP_FS are artificially prevented from doing so. Yes I am not happy about the state we have grown into as well. > So. > > By far the best way of addressing this bug is to fix XFS so that it can > use __GFP_FS for allocating pagecache. It's really quite lame that the > filesystem cannot use the strongest memory allocation mode for the > largest volume of allocation. Obviously this fix is too difficult, > otherwise it would already have been made. > > > The second best way of solving this bug is to pass a gfp_t into the > relevant callees, in the time-honoured manner. That would involve > alteration of probably several address_space_operations function > pointers and lots of mind-numbing mechanical edits and bloat. > > > The third best way is to pass the gfp_t via the task_struct. See > memalloc_noio_save() and memalloc_noio_restore(). This is a pretty > grubby way of passing function arguments, but I'm OK with it in this > special case, because > > a) Adding a gfp_t arg to 11 billion functions has costs of > several forms > > b) Adding all these costs just because one filesystem is being > weird doesn't make sense > > c) The mpage functions already have too many arguments. Adding > yet another is getting kinda ridiculous, and will cost more stack > on some of our deepest paths. > > > The fourth best way of fixing this is a nasty short-term bodge, such a > the one you just sent ;) But if we're going to do this, it should be > the minimal bodge which fixes this deadlock. Is it possible to come up > with a one-liner (plus suitable comment) to get us out of this mess? Yes I do agree that the fix I am proposing is short-term but this seems like the easiest way to go for stable and older kernels that might be affected. I thought your proposal for mapping_gfp_constraint was exactly to have all such places annotated for an easier future transition to something more reasonable. I can reduce the patch to fs/mpage.c chunk which would be few liners and fix the issue in the same time but what about the other calls which are inconsistent and broken probably? > Longer-term I suggest we look at generalising the memalloc_noio_foo() > stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in > callees. I have a suspicion we should have done this 15 years ago > (which is about when I started wanting to do it). I am not sure memalloc_noio_foo is a huge win. It is an easy hack where the whole allocation transaction is clear - like in the PM code. I am not sure this is true also for the FS. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-05 13:47 ` Michal Hocko @ 2015-10-05 19:29 ` Andrew Morton 2015-10-06 2:12 ` Andrew Morton 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2015-10-05 19:29 UTC (permalink / raw) To: Michal Hocko Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Mon, 5 Oct 2015 15:47:13 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > The fourth best way of fixing this is a nasty short-term bodge, such a > > the one you just sent ;) But if we're going to do this, it should be > > the minimal bodge which fixes this deadlock. Is it possible to come up > > with a one-liner (plus suitable comment) to get us out of this mess? > > Yes I do agree that the fix I am proposing is short-term but this seems > like the easiest way to go for stable and older kernels that might be > affected. I thought your proposal for mapping_gfp_constraint was exactly > to have all such places annotated for an easier future transition to > something more reasonable. hm, OK, let's go that way. But I expect this mess will continue to float around for a long time - fixing it nicely will be somewhat intrusive. > > Longer-term I suggest we look at generalising the memalloc_noio_foo() > > stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in > > callees. I have a suspicion we should have done this 15 years ago > > (which is about when I started wanting to do it). > > I am not sure memalloc_noio_foo is a huge win. It is an easy hack where > the whole allocation transaction is clear - like in the PM code. I am > not sure this is true also for the FS. mm.. I think it'll work out OK - a set/restore around particular callsites. It might get messy in core MM though. Do we apply current->mask at the very low levels of the page allocator? If so, that might muck up intermediate callers who are peeking into specific gfp_t flags. Perhaps it would be better to apply the mask at the highest possible level: wherever a function which was not passed a gfp_t decides to create one. Basically a grep for "GFP_". But then we need to decide *which* gfp_t-creators need the treatment. All of them (yikes) or is this mechanism only for called-via-address_space_operations code? That might work. Maybe it would be better to add the gfp_t argument to the address_space_operations. At a minimum, writepage(), readpage(), writepages(), readpages(). What a pickle. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-05 19:29 ` Andrew Morton @ 2015-10-06 2:12 ` Andrew Morton 2015-10-06 5:12 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2015-10-06 2:12 UTC (permalink / raw) To: Michal Hocko, Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > Maybe it would be better to add the gfp_t argument to the > address_space_operations. At a minimum, writepage(), readpage(), > writepages(), readpages(). What a pickle. I'm being dumb. All we need to do is to add a new address_space_operations.readpage_gfp(..., gfp_t gfp) etc. That should be trivial. Each fs type only has 2-4 instances of address_space_operations so the overhead is miniscule. As a background janitorial thing we can migrate filesystems over to the new interface then remove address_space_operations.readpage() etc. But this *would* add overhead: more stack and more text for no gain. So perhaps we just leave things as they are. That's so simple that I expect we can short-term fix this bug with that approach. umm, something like --- a/fs/mpage.c~a +++ a/fs/mpage.c @@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st static struct bio * do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, sector_t *last_block_in_bio, struct buffer_head *map_bh, - unsigned long *first_logical_block, get_block_t get_block) + unsigned long *first_logical_block, get_block_t get_block, + gfp_t gfp) { struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; @@ -278,7 +279,7 @@ alloc_new: } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), min_t(int, nr_pages, BIO_MAX_PAGES), - GFP_KERNEL); + gfp); if (bio == NULL) goto confused; } @@ -310,7 +311,7 @@ confused: } /** - * mpage_readpages - populate an address space with some pages & start reads against them + * mpage_readpages_gfp - populate an address space with some pages & start reads against them * @mapping: the address_space * @pages: The address of a list_head which contains the target pages. These * pages have their ->index populated and are otherwise uninitialised. @@ -318,6 +319,7 @@ confused: * issued in @pages->prev to @pages->next order. * @nr_pages: The number of pages at *@pages * @get_block: The filesystem's block mapper function. + * @gfp: memory allocation constraints * * This function walks the pages and the blocks within each page, building and * emitting large BIOs. @@ -352,9 +354,8 @@ confused: * * This all causes the disk requests to be issued in the correct order. */ -int -mpage_readpages(struct address_space *mapping, struct list_head *pages, - unsigned nr_pages, get_block_t get_block) +int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages, + unsigned nr_pages, get_block_t get_block, gfp_t gfp) { struct bio *bio = NULL; unsigned page_idx; @@ -370,12 +371,12 @@ mpage_readpages(struct address_space *ma prefetchw(&page->flags); list_del(&page->lru); if (!add_to_page_cache_lru(page, mapping, - page->index, GFP_KERNEL)) { + page->index, gfp)) { bio = do_mpage_readpage(bio, page, nr_pages - page_idx, &last_block_in_bio, &map_bh, &first_logical_block, - get_block); + get_block, gfp); } page_cache_release(page); } @@ -384,6 +385,14 @@ mpage_readpages(struct address_space *ma mpage_bio_submit(READ, bio); return 0; } +EXPORT_SYMBOL(mpage_readpages_gfp); + +int mpage_readpages(struct address_space *mapping, struct list_head *pages, + unsigned nr_pages, get_block_t get_block) +{ + return mpage_readpages_gfp(mapping, pages, nr_pages, get_block, + GFP_KERNEL); +} EXPORT_SYMBOL(mpage_readpages); /* @@ -399,7 +408,7 @@ int mpage_readpage(struct page *page, ge map_bh.b_state = 0; map_bh.b_size = 0; bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio, - &map_bh, &first_logical_block, get_block); + &map_bh, &first_logical_block, get_block, GFP_KERNEL); if (bio) mpage_bio_submit(READ, bio); return 0; diff -puN include/linux/fs.h~a include/linux/fs.h diff -puN include/linux/mpage.h~a include/linux/mpage.h --- a/include/linux/mpage.h~a +++ a/include/linux/mpage.h @@ -15,6 +15,8 @@ struct writeback_control; int mpage_readpages(struct address_space *mapping, struct list_head *pages, unsigned nr_pages, get_block_t get_block); +int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages, + unsigned nr_pages, get_block_t get_block, gfp_t gfp); int mpage_readpage(struct page *page, get_block_t get_block); int mpage_writepages(struct address_space *mapping, struct writeback_control *wbc, get_block_t get_block); diff -puN fs/xfs/xfs_aops.c~a fs/xfs/xfs_aops.c --- a/fs/xfs/xfs_aops.c~a +++ a/fs/xfs/xfs_aops.c @@ -1908,13 +1908,14 @@ xfs_vm_readpage( } STATIC int -xfs_vm_readpages( +xfs_vm_readpages_gfp( struct file *unused, struct address_space *mapping, struct list_head *pages, unsigned nr_pages) { - return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks); + return mpage_readpages_gfp(mapping, pages, nr_pages, xfs_get_blocks, + GFP_NOFS); } /* @@ -1987,7 +1988,7 @@ xfs_vm_set_page_dirty( const struct address_space_operations xfs_address_space_operations = { .readpage = xfs_vm_readpage, - .readpages = xfs_vm_readpages, + .readpages = xfs_vm_readpages_gfp, .writepage = xfs_vm_writepage, .writepages = xfs_vm_writepages, .set_page_dirty = xfs_vm_set_page_dirty, _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-06 2:12 ` Andrew Morton @ 2015-10-06 5:12 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2015-10-06 5:12 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm On Mon, Oct 05, 2015 at 07:12:17PM -0700, Andrew Morton wrote: > On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > Maybe it would be better to add the gfp_t argument to the > > address_space_operations. At a minimum, writepage(), readpage(), > > writepages(), readpages(). What a pickle. > > I'm being dumb. All we need to do is to add a new > > address_space_operations.readpage_gfp(..., gfp_t gfp) > > etc. That should be trivial. Each fs type only has 2-4 instances of > address_space_operations so the overhead is miniscule. > > As a background janitorial thing we can migrate filesystems over to the > new interface then remove address_space_operations.readpage() etc. But > this *would* add overhead: more stack and more text for no gain. So > perhaps we just leave things as they are. > > That's so simple that I expect we can short-term fix this bug with that > approach. umm, something like > > --- a/fs/mpage.c~a > +++ a/fs/mpage.c > @@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st > static struct bio * > do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, > sector_t *last_block_in_bio, struct buffer_head *map_bh, > - unsigned long *first_logical_block, get_block_t get_block) > + unsigned long *first_logical_block, get_block_t get_block, > + gfp_t gfp) Simple enough, but not sufficient to avoid deadlocks because this doesn't address the common vector for deadlock that was reported. i.e. deadlocks due to the layering dependency the loop device creates between two otherwise independent filesystems. If read IO through the loop device does GFP_KERNEL allocations, then it is susceptible to deadlock as that can force writeback and transactions from the filesystem on top of the loop device, which does more IO to the loop device, which then backs up on the backing file inode mutex. Forwards progress cannot be made until the GFP_KERNEL allocation succeeds, but that depends on the loop device making forwards progress... The loop device indicates this is a known problems tries to avoid these deadlocks by doing this on it's backing file: mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)) to try to ensure that mapping related allocations do not cause inappropriate reclaim contexts to be triggered. This is a problem independent of any specific filesystem - let's not hack a workaround into a specific filesystem just because the problem was reported on that filesystem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton 2015-10-02 7:25 ` Michal Hocko @ 2015-10-02 13:36 ` Guenter Roeck 2015-10-02 13:42 ` Michal Hocko 1 sibling, 1 reply; 10+ messages in thread From: Guenter Roeck @ 2015-10-02 13:36 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Michal Hocko, Dave Chinner On 10/01/2015 01:49 PM, Andrew Morton wrote: > On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote: > >> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration. >> On a re-run, I have seen it with the same image, but this time when simulating IvyBridge, >> so it is not CPU dependent. I did not previously see the problem. >> >> Log is at >> http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio >> >> I'll try to bisect. The problem is not seen with every boot, so that may take a while. > > Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()", > I expect. > I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some false negatives. I assume bisect is no longer necessary. If I need to try again, please let me know. Thanks, Guenter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next: kernel BUG at mm/slub.c:1447! 2015-10-02 13:36 ` Guenter Roeck @ 2015-10-02 13:42 ` Michal Hocko 0 siblings, 0 replies; 10+ messages in thread From: Michal Hocko @ 2015-10-02 13:42 UTC (permalink / raw) To: Guenter Roeck Cc: Andrew Morton, linux-kernel@vger.kernel.org, Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner On Fri 02-10-15 06:36:57, Guenter Roeck wrote: > On 10/01/2015 01:49 PM, Andrew Morton wrote: > >On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote: > > > >>Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration. > >>On a re-run, I have seen it with the same image, but this time when simulating IvyBridge, > >>so it is not CPU dependent. I did not previously see the problem. > >> > >>Log is at > >>http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio > >> > >>I'll try to bisect. The problem is not seen with every boot, so that may take a while. > > > >Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()", > >I expect. > > > I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some > false negatives. I assume bisect is no longer necessary. If I need to try again, please > let me know. The updated patch has been posted here: http://lkml.kernel.org/r/20151002085324.GA2927%40dhcp22.suse.cz Andrew's analysis seems decent so I am pretty sure it should fix the issue. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-10-06 5:12 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <560D59F7.4070002@roeck-us.net> 2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton 2015-10-02 7:25 ` Michal Hocko 2015-10-02 8:53 ` Michal Hocko 2015-10-02 20:49 ` Andrew Morton 2015-10-05 13:47 ` Michal Hocko 2015-10-05 19:29 ` Andrew Morton 2015-10-06 2:12 ` Andrew Morton 2015-10-06 5:12 ` Dave Chinner 2015-10-02 13:36 ` Guenter Roeck 2015-10-02 13:42 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).