linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: linux-next: kernel BUG at mm/slub.c:1447!
       [not found] <560D59F7.4070002@roeck-us.net>
@ 2015-10-01 20:49 ` Andrew Morton
  2015-10-02  7:25   ` Michal Hocko
  2015-10-02 13:36   ` Guenter Roeck
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2015-10-01 20:49 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Michal Hocko, Dave Chinner

On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:

> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
> On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
> so it is not CPU dependent. I did not previously see the problem.
> 
> Log is at
> http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
> 
> I'll try to bisect. The problem is not seen with every boot, so that may take a while.

Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
I expect.

> ---
> gfp: 2

That's __GFP_HIGHMEM

> ------------[ cut here ]------------
> invalid opcode: 0000 [#1] PREEMPT
> Modules linked in:
> CPU: 0 PID: 121 Comm: udevd Not tainted 4.3.0-rc3-next-20151001-yocto-standard #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> task: ced90000 ti: ced8c000 task.ti: ced8c000
> EIP: 0060:[<c1128873>] EFLAGS: 00000092 CPU: 0
> EIP is at new_slab+0x353/0x360
> EAX: 00000006 EBX: 00000000 ECX: 00000001 EDX: 80000001
> ESI: cf8019c0 EDI: 00000000 EBP: ced8daa4 ESP: ced8da7c
>   DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> CR0: 8005003b CR2: 080791c0 CR3: 0ed6c000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: 00000000 DR7: 00000000
> Stack:
>   c19a42cf 00000002 c137542e ced8da90 c137544c ffffffff c144c8a8 00000000
>   cf8019c0 00000000 ced8db28 c1129ca8 0203128a c144f346 00004e20 cec2e740
>   c10ee933 0203128a cf8019c0 c181d6c0 c181d460 000001a1 00150015 c0011c00
> Call Trace:
>   [<c137542e>] ? __delay+0xe/0x10
>   [<c137544c>] ? __const_udelay+0x1c/0x20
>   [<c144c8a8>] ? ide_execute_command+0x68/0xa0
>   [<c1129ca8>] ___slab_alloc.constprop.75+0x248/0x310
>   [<c144f346>] ? do_rw_taskfile+0x286/0x320
>   [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
>   [<c1457d12>] ? ide_do_rw_disk+0x222/0x320
>   [<c1136219>] __slab_alloc.isra.72.constprop.74+0x18/0x1f
>   [<c112a2f2>] kmem_cache_alloc+0x122/0x1c0
>   [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
>   [<c10ee933>] mempool_alloc_slab+0x13/0x20
>   [<c10eebe5>] mempool_alloc+0x45/0x170
>   [<c1345202>] bio_alloc_bioset+0xd2/0x1b0
>   [<c1172e9f>] mpage_alloc+0x2f/0xa0
>   [<c1037979>] ? kmap_atomic_prot+0x59/0xf0
>   [<c1173523>] do_mpage_readpage+0x4d3/0x7e0
>   [<c10f31b8>] ? __alloc_pages_nodemask+0xf8/0x8c0
>   [<c134ed67>] ? blk_queue_bio+0x267/0x2d0
>   [<c112a24a>] ? kmem_cache_alloc+0x7a/0x1c0
>   [<c138357f>] ? __this_cpu_preempt_check+0xf/0x20
>   [<c1173894>] mpage_readpage+0x64/0x80

mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask()
and that got passed all the way into kmem_cache_alloc() to allocate a
bio.  slab goes BUG if asked for highmem.

A fix would be to mask off __GFP_HIGHMEM right there in
mpage_readpage().

But I think the patch needs a bit of a rethink.  mapping_gfp_mask() is
the mask for allocating a file's pagecache.  It isn't designed for
allocation of memory for IO structures, file metadata, etc.

Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
a constraint mask - instead of it being "use this gfp for this
mapping", it becomes "don't use these gfp flags for this mapping".

Hence something like:

gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
{
	return mapping_gfp_mask(mapping) & gfp_in;
}

So instead of doing this:

@@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					gfp)) {

Michal's patch will do:

@@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-				page->index, GFP_KERNEL)) {
+				page->index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL))) {

ie: use mapping_gfp_mask() to strip out any GFP flags which the
filesystem doesn't want used.  If the filesystem has *added* flags to
mapping_gfp_mask() then obviously this won't work and we'll need two
fields in the address_space or something.

Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()",
thanks for the report.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton
@ 2015-10-02  7:25   ` Michal Hocko
  2015-10-02  8:53     ` Michal Hocko
  2015-10-02 20:49     ` Andrew Morton
  2015-10-02 13:36   ` Guenter Roeck
  1 sibling, 2 replies; 10+ messages in thread
From: Michal Hocko @ 2015-10-02  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Thu 01-10-15 13:49:04, Andrew Morton wrote:
[...]
> mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask()
> and that got passed all the way into kmem_cache_alloc() to allocate a
> bio.  slab goes BUG if asked for highmem.
> 
> A fix would be to mask off __GFP_HIGHMEM right there in
> mpage_readpage().

Yes, this is an obvious bug in the patch. It should only make the gfp
mask more restrictive.

> But I think the patch needs a bit of a rethink.  mapping_gfp_mask() is
> the mask for allocating a file's pagecache.  It isn't designed for
> allocation of memory for IO structures, file metadata, etc.
>
> Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
> stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
> a constraint mask - instead of it being "use this gfp for this
> mapping", it becomes "don't use these gfp flags for this mapping".
> 
> Hence something like:
> 
> gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
> {
> 	return mapping_gfp_mask(mapping) & gfp_in;
> }
> 
> So instead of doing this:
> 
> @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -					page->index, GFP_KERNEL)) {
> +					page->index,
> +					gfp)) {
> 
> Michal's patch will do:
> 
> @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -				page->index, GFP_KERNEL)) {
> +				page->index,
> +				mapping_gfp_constraint(mapping, GFP_KERNEL))) {
> 
> ie: use mapping_gfp_mask() to strip out any GFP flags which the
> filesystem doesn't want used.  If the filesystem has *added* flags to
> mapping_gfp_mask() then obviously this won't work and we'll need two
> fields in the address_space or something.
> 
> Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()",
> thanks for the report.

mapping_gfp_mask is used at many places so I think it would be better to
fix this particular place (others seem to be correct). It would make the
stable backport much easier. We can build a more sane API on top. What
do you think?

Here is the respin of the original patch. I will post another one which
will add mapping_gfp_constraint on top. It will surely be less error
prone.
---

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02  7:25   ` Michal Hocko
@ 2015-10-02  8:53     ` Michal Hocko
  2015-10-02 20:49     ` Andrew Morton
  1 sibling, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2015-10-02  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 09:25:22, Michal Hocko wrote:
> On Thu 01-10-15 13:49:04, Andrew Morton wrote:
[...]
> > Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
> > stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
> > a constraint mask - instead of it being "use this gfp for this
> > mapping", it becomes "don't use these gfp flags for this mapping".
> > 
> > Hence something like:
> > 
> > gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
> > {
> > 	return mapping_gfp_mask(mapping) & gfp_in;
> > }
> > 
> > So instead of doing this:
> > 
> > @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
> >  		prefetchw(&page->flags);
> >  		list_del(&page->lru);
> >  		if (!add_to_page_cache_lru(page, mapping,
> > -					page->index, GFP_KERNEL)) {
> > +					page->index,
> > +					gfp)) {
> > 
[...]
> I will post another one which
> will add mapping_gfp_constraint on top. It will surely be less error
> prone.

OK, so here we go. There are still few direct users of mapping_gfp_mask
but most of them use it unrestricted. The only exception seems to be
loop driver and luste lloop which save and restrict mapping_gfp which
didn't seem worthwhile converting.

This is on top of the current linux-next + the updated fix for the loop
hang.
---

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton
  2015-10-02  7:25   ` Michal Hocko
@ 2015-10-02 13:36   ` Guenter Roeck
  2015-10-02 13:42     ` Michal Hocko
  1 sibling, 1 reply; 10+ messages in thread
From: Guenter Roeck @ 2015-10-02 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Michal Hocko, Dave Chinner

On 10/01/2015 01:49 PM, Andrew Morton wrote:
> On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
>> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
>> On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
>> so it is not CPU dependent. I did not previously see the problem.
>>
>> Log is at
>> http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
>>
>> I'll try to bisect. The problem is not seen with every boot, so that may take a while.
>
> Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
> I expect.
>
I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some
false negatives. I assume bisect is no longer necessary. If I need to try again, please
let me know.

Thanks,
Guenter

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02 13:36   ` Guenter Roeck
@ 2015-10-02 13:42     ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2015-10-02 13:42 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Andrew Morton, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 06:36:57, Guenter Roeck wrote:
> On 10/01/2015 01:49 PM, Andrew Morton wrote:
> >On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> >
> >>Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
> >>On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
> >>so it is not CPU dependent. I did not previously see the problem.
> >>
> >>Log is at
> >>http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
> >>
> >>I'll try to bisect. The problem is not seen with every boot, so that may take a while.
> >
> >Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
> >I expect.
> >
> I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some
> false negatives. I assume bisect is no longer necessary. If I need to try again, please
> let me know.

The updated patch has been posted here:
http://lkml.kernel.org/r/20151002085324.GA2927%40dhcp22.suse.cz

Andrew's analysis seems decent so I am pretty sure it should fix the
issue.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02  7:25   ` Michal Hocko
  2015-10-02  8:53     ` Michal Hocko
@ 2015-10-02 20:49     ` Andrew Morton
  2015-10-05 13:47       ` Michal Hocko
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-10-02 20:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri, 2 Oct 2015 09:25:23 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation
> paths") has caught some users of hardcoded GFP_KERNEL used in the page
> cache allocation paths.  This, however, wasn't complete and there were
> others which went unnoticed.
> 
> Dave Chinner has reported the following deadlock for xfs on loop device:
> : With the recent merge of the loop device changes, I'm now seeing
> : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
> :
> : The deadlocked is as follows:
> :
> : kloopd1: loop_queue_read_work
> :       xfs_file_iter_read
> :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
> :       page cache read (GFP_KERNEL)
> :       radix tree alloc
> :       memory reclaim
> :       reclaim XFS inodes
> :       log force to unpin inodes
> :       <wait for log IO completion>
> :
> : xfs-cil/loop1: <does log force IO work>
> :       xlog_cil_push
> :       xlog_write
> :       <loop issuing log writes>
> :               xlog_state_get_iclog_space()
> :               <blocks due to all log buffers under write io>
> :               <waits for IO completion>
> :
> : kloopd1: loop_queue_write_work
> :       xfs_file_write_iter
> :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
> :       <wait for inode to be unlocked>
> :
> : i.e. the kloopd, with it's split read and write work queues, has
> : introduced a dependency through memory reclaim. i.e. that writes
> : need to be able to progress for reads make progress.
> :
> : The problem, fundamentally, is that mpage_readpages() does a
> : GFP_KERNEL allocation, rather than paying attention to the inode's
> : mapping gfp mask, which is set to GFP_NOFS.
> :
> : The didn't used to happen, because the loop device used to issue
> : reads through the splice path and that does:
> :
> :       error = add_to_page_cache_lru(page, mapping, index,
> :                       GFP_KERNEL & mapping_gfp_mask(mapping));
> 
> This has changed by aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC").
> 
> This patch changes mpage_readpage{s} to follow gfp mask set for the
> mapping.  There are, however, other places which are doing basically the
> same.
> 
> lustre:ll_dir_filler is doing GFP_KERNEL from the function which
> apparently uses GFP_NOFS for other allocations so let's make this
> consistent.
> 
> cifs:readpages_get_pages is called from cifs_readpages and
> __cifs_readpages_from_fscache called from the same path obeys mapping
> gfp.
> 
> ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
> regardless it uses mapping_gfp_mask for the page allocation.
> 
> ext4_mpage_readpages is the called from the page cache allocation path
> same as read_pages and read_cache_pages
> 
> As I've noticed in my previous post I cannot say I would be happy about
> sprinkling mapping_gfp_mask all over the place and it sounds like we
> should drop gfp_mask argument altogether and use it internally in
> __add_to_page_cache_locked that would require all the filesystems to use
> mapping gfp consistently which I am not sure is the case here.  From a
> quick glance it seems that some file system use it all the time while
> others are selective.

There's a lot of confusion here, so let's try to do some clear
thinking.

mapping_gfp_mask() describes the mask to be used for allocating this
mapping's pagecache pages.  Nothing else.  I introduced it to provide a
clean way to ensure that blockdev inode pagecache is not allocated from
highmem.

This is totally different from "I'm holding a lock so I can't do
__GFP_FS"!  __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts -
the only commonality is that they happen to be passed to callees in the
same memory word.

At present the VFS provides no way for the filesystem to specify the
allocation mode for pagecache operations.  The VFS assumes
__GFP_FS|__GFP_IO|__GFP_WAIT|etc.  mapping_gfp_mask() may *look* like
it provides a way, but it doesn't.

The way for a caller to communicate "I can't do __GFP_FS from this
particular callsite" is to pass that info in a gfp_t, in a function
call argument.  It is very wrong to put this info into
mapping_gfp_mask(), as XFS has done.  For obvious reasons: different
callsites may want different values.

One can easily envision a filesystem whose read() can allocate
pagecache with GFP_HIGHUSER but the write() needs
GFP_HIGHUSER&~__GFP_FS.  This obviously won't fly if we're (ab)using
mapping_gpf_mask() in this fashion.  Also callsites who *can* use the
stronger __GFP_FS are artificially prevented from doing so.


So.

By far the best way of addressing this bug is to fix XFS so that it can
use __GFP_FS for allocating pagecache.  It's really quite lame that the
filesystem cannot use the strongest memory allocation mode for the
largest volume of allocation.  Obviously this fix is too difficult,
otherwise it would already have been made.


The second best way of solving this bug is to pass a gfp_t into the
relevant callees, in the time-honoured manner.  That would involve
alteration of probably several address_space_operations function
pointers and lots of mind-numbing mechanical edits and bloat.


The third best way is to pass the gfp_t via the task_struct.  See
memalloc_noio_save() and memalloc_noio_restore().  This is a pretty
grubby way of passing function arguments, but I'm OK with it in this
special case, because

  a) Adding a gfp_t arg to 11 billion functions has costs of
     several forms

  b) Adding all these costs just because one filesystem is being
     weird doesn't make sense

  c) The mpage functions already have too many arguments.  Adding
     yet another is getting kinda ridiculous, and will cost more stack
     on some of our deepest paths.


The fourth best way of fixing this is a nasty short-term bodge, such a
the one you just sent ;) But if we're going to do this, it should be
the minimal bodge which fixes this deadlock.  Is it possible to come up
with a one-liner (plus suitable comment) to get us out of this mess?


Longer-term I suggest we look at generalising the memalloc_noio_foo()
stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
callees.  I have a suspicion we should have done this 15 years ago
(which is about when I started wanting to do it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02 20:49     ` Andrew Morton
@ 2015-10-05 13:47       ` Michal Hocko
  2015-10-05 19:29         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2015-10-05 13:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 13:49:53, Andrew Morton wrote:
[...]
> There's a lot of confusion here, so let's try to do some clear
> thinking.
> 
> mapping_gfp_mask() describes the mask to be used for allocating this
> mapping's pagecache pages.  Nothing else.  I introduced it to provide a
> clean way to ensure that blockdev inode pagecache is not allocated from
> highmem.

I am afraid this is not how it ended up being used.

> This is totally different from "I'm holding a lock so I can't do
> __GFP_FS"!  __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts -
> the only commonality is that they happen to be passed to callees in the
> same memory word.

Which is exactly how those filesytems overcome the lack of consistent
API to tell the restrictions all the way down to the allocator. AFAIR
this is the case beyond xfs.

> At present the VFS provides no way for the filesystem to specify the
> allocation mode for pagecache operations.  The VFS assumes
> __GFP_FS|__GFP_IO|__GFP_WAIT|etc.  mapping_gfp_mask() may *look* like
> it provides a way, but it doesn't.
> 
> The way for a caller to communicate "I can't do __GFP_FS from this
> particular callsite" is to pass that info in a gfp_t, in a function
> call argument.  It is very wrong to put this info into
> mapping_gfp_mask(), as XFS has done.  For obvious reasons: different
> callsites may want different values.
> 
> One can easily envision a filesystem whose read() can allocate
> pagecache with GFP_HIGHUSER but the write() needs
> GFP_HIGHUSER&~__GFP_FS.  This obviously won't fly if we're (ab)using
> mapping_gpf_mask() in this fashion.  Also callsites who *can* use the
> stronger __GFP_FS are artificially prevented from doing so.

Yes I am not happy about the state we have grown into as well.

> So.
> 
> By far the best way of addressing this bug is to fix XFS so that it can
> use __GFP_FS for allocating pagecache.  It's really quite lame that the
> filesystem cannot use the strongest memory allocation mode for the
> largest volume of allocation.  Obviously this fix is too difficult,
> otherwise it would already have been made.
> 
> 
> The second best way of solving this bug is to pass a gfp_t into the
> relevant callees, in the time-honoured manner.  That would involve
> alteration of probably several address_space_operations function
> pointers and lots of mind-numbing mechanical edits and bloat.
> 
> 
> The third best way is to pass the gfp_t via the task_struct.  See
> memalloc_noio_save() and memalloc_noio_restore().  This is a pretty
> grubby way of passing function arguments, but I'm OK with it in this
> special case, because
> 
>   a) Adding a gfp_t arg to 11 billion functions has costs of
>      several forms
> 
>   b) Adding all these costs just because one filesystem is being
>      weird doesn't make sense
> 
>   c) The mpage functions already have too many arguments.  Adding
>      yet another is getting kinda ridiculous, and will cost more stack
>      on some of our deepest paths.
> 
> 
> The fourth best way of fixing this is a nasty short-term bodge, such a
> the one you just sent ;) But if we're going to do this, it should be
> the minimal bodge which fixes this deadlock.  Is it possible to come up
> with a one-liner (plus suitable comment) to get us out of this mess?

Yes I do agree that the fix I am proposing is short-term but this seems
like the easiest way to go for stable and older kernels that might be
affected. I thought your proposal for mapping_gfp_constraint was exactly
to have all such places annotated for an easier future transition to
something more reasonable.
I can reduce the patch to fs/mpage.c chunk which would be few liners and
fix the issue in the same time but what about the other calls which are
inconsistent and broken probably?
 
> Longer-term I suggest we look at generalising the memalloc_noio_foo()
> stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
> callees.  I have a suspicion we should have done this 15 years ago
> (which is about when I started wanting to do it).

I am not sure memalloc_noio_foo is a huge win. It is an easy hack where
the whole allocation transaction is clear - like in the PM code. I am
not sure this is true also for the FS.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-05 13:47       ` Michal Hocko
@ 2015-10-05 19:29         ` Andrew Morton
  2015-10-06  2:12           ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-10-05 19:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guenter Roeck, linux-kernel@vger.kernel.org, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Mon, 5 Oct 2015 15:47:13 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > The fourth best way of fixing this is a nasty short-term bodge, such a
> > the one you just sent ;) But if we're going to do this, it should be
> > the minimal bodge which fixes this deadlock.  Is it possible to come up
> > with a one-liner (plus suitable comment) to get us out of this mess?
> 
> Yes I do agree that the fix I am proposing is short-term but this seems
> like the easiest way to go for stable and older kernels that might be
> affected. I thought your proposal for mapping_gfp_constraint was exactly
> to have all such places annotated for an easier future transition to
> something more reasonable.

hm, OK, let's go that way.  But I expect this mess will continue to
float around for a long time - fixing it nicely will be somewhat
intrusive.

> > Longer-term I suggest we look at generalising the memalloc_noio_foo()
> > stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
> > callees.  I have a suspicion we should have done this 15 years ago
> > (which is about when I started wanting to do it).
> 
> I am not sure memalloc_noio_foo is a huge win. It is an easy hack where
> the whole allocation transaction is clear - like in the PM code. I am
> not sure this is true also for the FS.

mm..  I think it'll work out OK - a set/restore around particular
callsites.

It might get messy in core MM though.  Do we apply current->mask at the
very low levels of the page allocator?  If so, that might muck up
intermediate callers who are peeking into specific gfp_t flags.

Perhaps it would be better to apply the mask at the highest possible
level: wherever a function which was not passed a gfp_t decides to
create one.  Basically a grep for "GFP_".  But then we need to decide
*which* gfp_t-creators need the treatment.  All of them (yikes) or is
this mechanism only for called-via-address_space_operations code?  That
might work.

Maybe it would be better to add the gfp_t argument to the
address_space_operations.  At a minimum, writepage(), readpage(),
writepages(), readpages().  What a pickle.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-05 19:29         ` Andrew Morton
@ 2015-10-06  2:12           ` Andrew Morton
  2015-10-06  5:12             ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-10-06  2:12 UTC (permalink / raw)
  To: Michal Hocko, Guenter Roeck, linux-kernel@vger.kernel.org,
	Christoph Lameter, Kirill A. Shutemov, linux-mm, Dave Chinner

On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Maybe it would be better to add the gfp_t argument to the
> address_space_operations.  At a minimum, writepage(), readpage(),
> writepages(), readpages().  What a pickle.

I'm being dumb.  All we need to do is to add a new

	address_space_operations.readpage_gfp(..., gfp_t gfp)

etc.  That should be trivial.  Each fs type only has 2-4 instances of
address_space_operations so the overhead is miniscule.

As a background janitorial thing we can migrate filesystems over to the
new interface then remove address_space_operations.readpage() etc.  But
this *would* add overhead: more stack and more text for no gain.  So
perhaps we just leave things as they are.

That's so simple that I expect we can short-term fix this bug with that
approach.  umm, something like

--- a/fs/mpage.c~a
+++ a/fs/mpage.c
@@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
-		unsigned long *first_logical_block, get_block_t get_block)
+		unsigned long *first_logical_block, get_block_t get_block,
+		gfp_t gfp)
 {
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -278,7 +279,7 @@ alloc_new:
 		}
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				min_t(int, nr_pages, BIO_MAX_PAGES),
-				GFP_KERNEL);
+				gfp);
 		if (bio == NULL)
 			goto confused;
 	}
@@ -310,7 +311,7 @@ confused:
 }
 
 /**
- * mpage_readpages - populate an address space with some pages & start reads against them
+ * mpage_readpages_gfp - populate an address space with some pages & start reads against them
  * @mapping: the address_space
  * @pages: The address of a list_head which contains the target pages.  These
  *   pages have their ->index populated and are otherwise uninitialised.
@@ -318,6 +319,7 @@ confused:
  *   issued in @pages->prev to @pages->next order.
  * @nr_pages: The number of pages at *@pages
  * @get_block: The filesystem's block mapper function.
+ * @gfp: memory allocation constraints
  *
  * This function walks the pages and the blocks within each page, building and
  * emitting large BIOs.
@@ -352,9 +354,8 @@ confused:
  *
  * This all causes the disk requests to be issued in the correct order.
  */
-int
-mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block)
+int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block, gfp_t gfp)
 {
 	struct bio *bio = NULL;
 	unsigned page_idx;
@@ -370,12 +371,12 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index, gfp)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
-					get_block);
+					get_block, gfp);
 		}
 		page_cache_release(page);
 	}
@@ -384,6 +385,14 @@ mpage_readpages(struct address_space *ma
 		mpage_bio_submit(READ, bio);
 	return 0;
 }
+EXPORT_SYMBOL(mpage_readpages_gfp);
+
+int mpage_readpages(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block)
+{
+	return mpage_readpages_gfp(mapping, pages, nr_pages, get_block,
+				   GFP_KERNEL);
+}
 EXPORT_SYMBOL(mpage_readpages);
 
 /*
@@ -399,7 +408,7 @@ int mpage_readpage(struct page *page, ge
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
 	bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
-			&map_bh, &first_logical_block, get_block);
+			&map_bh, &first_logical_block, get_block, GFP_KERNEL);
 	if (bio)
 		mpage_bio_submit(READ, bio);
 	return 0;
diff -puN include/linux/fs.h~a include/linux/fs.h
diff -puN include/linux/mpage.h~a include/linux/mpage.h
--- a/include/linux/mpage.h~a
+++ a/include/linux/mpage.h
@@ -15,6 +15,8 @@ struct writeback_control;
 
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
+int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block, gfp_t gfp);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
diff -puN fs/xfs/xfs_aops.c~a fs/xfs/xfs_aops.c
--- a/fs/xfs/xfs_aops.c~a
+++ a/fs/xfs/xfs_aops.c
@@ -1908,13 +1908,14 @@ xfs_vm_readpage(
 }
 
 STATIC int
-xfs_vm_readpages(
+xfs_vm_readpages_gfp(
 	struct file		*unused,
 	struct address_space	*mapping,
 	struct list_head	*pages,
 	unsigned		nr_pages)
 {
-	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
+	return mpage_readpages_gfp(mapping, pages, nr_pages, xfs_get_blocks,
+				   GFP_NOFS);
 }
 
 /*
@@ -1987,7 +1988,7 @@ xfs_vm_set_page_dirty(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readpages		= xfs_vm_readpages_gfp,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= xfs_vm_set_page_dirty,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-06  2:12           ` Andrew Morton
@ 2015-10-06  5:12             ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2015-10-06  5:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Guenter Roeck, linux-kernel@vger.kernel.org,
	Christoph Lameter, Kirill A. Shutemov, linux-mm

On Mon, Oct 05, 2015 at 07:12:17PM -0700, Andrew Morton wrote:
> On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Maybe it would be better to add the gfp_t argument to the
> > address_space_operations.  At a minimum, writepage(), readpage(),
> > writepages(), readpages().  What a pickle.
> 
> I'm being dumb.  All we need to do is to add a new
> 
> 	address_space_operations.readpage_gfp(..., gfp_t gfp)
> 
> etc.  That should be trivial.  Each fs type only has 2-4 instances of
> address_space_operations so the overhead is miniscule.
> 
> As a background janitorial thing we can migrate filesystems over to the
> new interface then remove address_space_operations.readpage() etc.  But
> this *would* add overhead: more stack and more text for no gain.  So
> perhaps we just leave things as they are.
> 
> That's so simple that I expect we can short-term fix this bug with that
> approach.  umm, something like
> 
> --- a/fs/mpage.c~a
> +++ a/fs/mpage.c
> @@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st
>  static struct bio *
>  do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  		sector_t *last_block_in_bio, struct buffer_head *map_bh,
> -		unsigned long *first_logical_block, get_block_t get_block)
> +		unsigned long *first_logical_block, get_block_t get_block,
> +		gfp_t gfp)

Simple enough, but not sufficient to avoid deadlocks because this
doesn't address the common vector for deadlock that was reported.
i.e. deadlocks due to the layering dependency the loop device
creates between two otherwise independent filesystems.

If read IO through the loop device does GFP_KERNEL allocations, then
it is susceptible to deadlock as that can force writeback and
transactions from the filesystem on top of the loop device, which
does more IO to the loop device, which then backs up on the backing
file inode mutex. Forwards progress cannot be made until the
GFP_KERNEL allocation succeeds, but that depends on the loop device
making forwards progress...

The loop device indicates this is a known problems tries to avoid
these deadlocks by doing this on it's backing file:

	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))

to try to ensure that mapping related allocations do not cause
inappropriate reclaim contexts to be triggered.

This is a problem independent of any specific filesystem - let's not
hack a workaround into a specific filesystem just because the
problem was reported on that filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-06  5:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <560D59F7.4070002@roeck-us.net>
2015-10-01 20:49 ` linux-next: kernel BUG at mm/slub.c:1447! Andrew Morton
2015-10-02  7:25   ` Michal Hocko
2015-10-02  8:53     ` Michal Hocko
2015-10-02 20:49     ` Andrew Morton
2015-10-05 13:47       ` Michal Hocko
2015-10-05 19:29         ` Andrew Morton
2015-10-06  2:12           ` Andrew Morton
2015-10-06  5:12             ` Dave Chinner
2015-10-02 13:36   ` Guenter Roeck
2015-10-02 13:42     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).