All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Dave Chinner <david@fromorbit.com>
Cc: akpm@linux-foundation.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, tytso@mit.edu, Jan Kara <jack@suse.cz>,
	hch@lst.de
Subject: Re: [PATCH v4 1/3] dax: masking off __GFP_FS in fs DAX handlers
Date: Tue, 20 Dec 2016 11:13:52 +0100	[thread overview]
Message-ID: <20161220101352.GE3769@dhcp22.suse.cz> (raw)
In-Reply-To: <20161219211711.GD4219@dastard>

On Tue 20-12-16 08:17:11, Dave Chinner wrote:
> On Mon, Dec 19, 2016 at 08:53:02PM +0100, Jan Kara wrote:
> > On Sat 17-12-16 09:04:50, Dave Chinner wrote:
> > > On Fri, Dec 16, 2016 at 09:19:16AM -0700, Ross Zwisler wrote:
> > > > On Fri, Dec 16, 2016 at 12:07:30PM +1100, Dave Chinner wrote:
> > > > > On Thu, Dec 15, 2016 at 04:40:41PM -0700, Dave Jiang wrote:
> > > > > > The caller into dax needs to clear __GFP_FS mask bit since it's
> > > > > > responsible for acquiring locks / transactions that blocks __GFP_FS
> > > > > > allocation.  The caller will restore the original mask when dax function
> > > > > > returns.
> > > > > 
> > > > > What's the allocation problem you're working around here? Can you
> > > > > please describe the call chain that is the problem?
> > > > > 
> > > > > >  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > > > >  
> > > > > >  	if (IS_DAX(inode)) {
> > > > > > +		gfp_t old_gfp = vmf->gfp_mask;
> > > > > > +
> > > > > > +		vmf->gfp_mask &= ~__GFP_FS;
> > > > > >  		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
> > > > > > +		vmf->gfp_mask = old_gfp;
> > > > > 
> > > > > I really have to say that I hate code that clears and restores flags
> > > > > without any explanation of why the code needs to play flag tricks. I
> > > > > take one look at the XFS fault handling code and ask myself now "why
> > > > > the hell do we need to clear those flags?" Especially as the other
> > > > > paths into generic fault handlers /don't/ require us to do this.
> > > > > What does DAX do that require us to treat memory allocation contexts
> > > > > differently to the filemap_fault() path?
> > > > 
> > > > This was done in response to Jan Kara's concern:
> > > > 
> > > >   The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
> > > >   because at that point it is correct. But once we grab filesystem locks which
> > > >   are not reclaim safe, we should update vmf->gfp_mask we pass further down
> > > >   into DAX code to not contain __GFP_FS (that's a bug we apparently have
> > > >   there). And inside DAX code, we definitely are not generally safe to add
> > > >   __GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
> > > >   vm_fault into this function, using passed gfp_mask there and make sure
> > > >   callers update gfp_mask as appropriate.
> > > > 
> > > > https://lkml.org/lkml/2016/10/4/37
> > > > 
> > > > IIUC I think the concern is that, for example, in xfs_filemap_page_mkwrite()
> > > > we take a read lock on the struct inode.i_rwsem before we call
> > > > dax_iomap_fault().
> > > 
> > > That, my friends, is exactly the problem that mapping_gfp_mask() is
> > > meant to solve. This:
> > > 
> > > > > > +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS |  __GFP_IO;
> > > 
> > > Is just so wrong it's not funny.
> > 
> > You mean like in mm/memory.c: __get_fault_gfp_mask()?
> > 
> > Which was introduced by commit c20cd45eb017 "mm: allow GFP_{FS,IO} for
> > page_cache_read page cache allocation" by Michal (added to CC) and you were
> > even on CC ;).
> 
> Sure, I was on the cc list, but that doesn't mean I /liked/ the
> patch. It also doesn't mean I had the time or patience to argue
> whether it was the right way to address whatever whacky OOM/reclaim
> deficiency was being reported....
> 
> Oh, and this is a write fault, not a read fault. There's a big
> difference in filesystem behaviour between those two types of
> faults, so what might be fine for a page cache read (i.e. no
> transactions) isn't necessarily correct for a write operation...
> 
> > The code here was replicating __get_fault_gfp_mask() and in fact the idea
> > of the cleanup is to get rid of this code and take whatever is in
> > vmf.gfp_mask and mask off __GFP_FS in the filesystem if it deems it is
> > needed (e.g. ext4 really needs this as inode reclaim is depending on being
> > able to force a transaction commit).
> 
> And so now we add a flag to the fault that the filesystem says not
> to add to mapping masks, and now the filesystem has to mask off
> thati flag /again/ because it's mapping gfp mask guidelines are
> essentially being ignored.
> 
> Remind me again why we even have the mapping gfp_mask if we just
> ignore it like this?

mapping mask still serves its _main_ purpose - the allocation
placement/movability properties. This is something only the owner of
the mapping knows. The (ab)use of the mapping gfp_mask to drop GFP_FS
was imho a bad decision. As the above mentioned commit has mentioned
we were doing a lot of GFP_NOFS allocations from the paths which are
inherently GFP_KERNEL so they couldn't prevent from recursion problems
while they still affected the direct relaim behavior. On the other hand
I do understand why mapping's mask has been used at the time. We simply
lacked a better api back then. But I believe that with the scope nofs
[1] api we can do much better and get rid of ~__GFP_FS in mapping's mask
finally. c20cd45eb017 was an intermediate step until we get there.

I am not fully familiar with the DAX changes which started this
discussion but if there is a reclaim recursion problem from within the
fault path then the scope api sounds like a good fit here.

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	akpm@linux-foundation.org, linux-nvdimm@lists.01.org, hch@lst.de,
	linux-mm@kvack.org, tytso@mit.edu, dan.j.williams@intel.com
Subject: Re: [PATCH v4 1/3] dax: masking off __GFP_FS in fs DAX handlers
Date: Tue, 20 Dec 2016 11:13:52 +0100	[thread overview]
Message-ID: <20161220101352.GE3769@dhcp22.suse.cz> (raw)
In-Reply-To: <20161219211711.GD4219@dastard>

On Tue 20-12-16 08:17:11, Dave Chinner wrote:
> On Mon, Dec 19, 2016 at 08:53:02PM +0100, Jan Kara wrote:
> > On Sat 17-12-16 09:04:50, Dave Chinner wrote:
> > > On Fri, Dec 16, 2016 at 09:19:16AM -0700, Ross Zwisler wrote:
> > > > On Fri, Dec 16, 2016 at 12:07:30PM +1100, Dave Chinner wrote:
> > > > > On Thu, Dec 15, 2016 at 04:40:41PM -0700, Dave Jiang wrote:
> > > > > > The caller into dax needs to clear __GFP_FS mask bit since it's
> > > > > > responsible for acquiring locks / transactions that blocks __GFP_FS
> > > > > > allocation.  The caller will restore the original mask when dax function
> > > > > > returns.
> > > > > 
> > > > > What's the allocation problem you're working around here? Can you
> > > > > please describe the call chain that is the problem?
> > > > > 
> > > > > >  	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > > > >  
> > > > > >  	if (IS_DAX(inode)) {
> > > > > > +		gfp_t old_gfp = vmf->gfp_mask;
> > > > > > +
> > > > > > +		vmf->gfp_mask &= ~__GFP_FS;
> > > > > >  		ret = dax_iomap_fault(vma, vmf, &xfs_iomap_ops);
> > > > > > +		vmf->gfp_mask = old_gfp;
> > > > > 
> > > > > I really have to say that I hate code that clears and restores flags
> > > > > without any explanation of why the code needs to play flag tricks. I
> > > > > take one look at the XFS fault handling code and ask myself now "why
> > > > > the hell do we need to clear those flags?" Especially as the other
> > > > > paths into generic fault handlers /don't/ require us to do this.
> > > > > What does DAX do that require us to treat memory allocation contexts
> > > > > differently to the filemap_fault() path?
> > > > 
> > > > This was done in response to Jan Kara's concern:
> > > > 
> > > >   The gfp_mask that propagates from __do_fault() or do_page_mkwrite() is fine
> > > >   because at that point it is correct. But once we grab filesystem locks which
> > > >   are not reclaim safe, we should update vmf->gfp_mask we pass further down
> > > >   into DAX code to not contain __GFP_FS (that's a bug we apparently have
> > > >   there). And inside DAX code, we definitely are not generally safe to add
> > > >   __GFP_FS to mapping_gfp_mask(). Maybe we'd be better off propagating struct
> > > >   vm_fault into this function, using passed gfp_mask there and make sure
> > > >   callers update gfp_mask as appropriate.
> > > > 
> > > > https://lkml.org/lkml/2016/10/4/37
> > > > 
> > > > IIUC I think the concern is that, for example, in xfs_filemap_page_mkwrite()
> > > > we take a read lock on the struct inode.i_rwsem before we call
> > > > dax_iomap_fault().
> > > 
> > > That, my friends, is exactly the problem that mapping_gfp_mask() is
> > > meant to solve. This:
> > > 
> > > > > > +	vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS |  __GFP_IO;
> > > 
> > > Is just so wrong it's not funny.
> > 
> > You mean like in mm/memory.c: __get_fault_gfp_mask()?
> > 
> > Which was introduced by commit c20cd45eb017 "mm: allow GFP_{FS,IO} for
> > page_cache_read page cache allocation" by Michal (added to CC) and you were
> > even on CC ;).
> 
> Sure, I was on the cc list, but that doesn't mean I /liked/ the
> patch. It also doesn't mean I had the time or patience to argue
> whether it was the right way to address whatever whacky OOM/reclaim
> deficiency was being reported....
> 
> Oh, and this is a write fault, not a read fault. There's a big
> difference in filesystem behaviour between those two types of
> faults, so what might be fine for a page cache read (i.e. no
> transactions) isn't necessarily correct for a write operation...
> 
> > The code here was replicating __get_fault_gfp_mask() and in fact the idea
> > of the cleanup is to get rid of this code and take whatever is in
> > vmf.gfp_mask and mask off __GFP_FS in the filesystem if it deems it is
> > needed (e.g. ext4 really needs this as inode reclaim is depending on being
> > able to force a transaction commit).
> 
> And so now we add a flag to the fault that the filesystem says not
> to add to mapping masks, and now the filesystem has to mask off
> thati flag /again/ because it's mapping gfp mask guidelines are
> essentially being ignored.
> 
> Remind me again why we even have the mapping gfp_mask if we just
> ignore it like this?

mapping mask still serves its _main_ purpose - the allocation
placement/movability properties. This is something only the owner of
the mapping knows. The (ab)use of the mapping gfp_mask to drop GFP_FS
was imho a bad decision. As the above mentioned commit has mentioned
we were doing a lot of GFP_NOFS allocations from the paths which are
inherently GFP_KERNEL so they couldn't prevent from recursion problems
while they still affected the direct relaim behavior. On the other hand
I do understand why mapping's mask has been used at the time. We simply
lacked a better api back then. But I believe that with the scope nofs
[1] api we can do much better and get rid of ~__GFP_FS in mapping's mask
finally. c20cd45eb017 was an intermediate step until we get there.

I am not fully familiar with the DAX changes which started this
discussion but if there is a reclaim recursion problem from within the
fault path then the scope api sounds like a good fit here.

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-12-20 10:13 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-15 23:40 [PATCH v4 1/3] dax: masking off __GFP_FS in fs DAX handlers Dave Jiang
2016-12-15 23:40 ` Dave Jiang
2016-12-15 23:40 ` [PATCH v4 2/3] mm, dax: make pmd_fault() and friends to be the same as fault() Dave Jiang
2016-12-15 23:40 ` [PATCH v4 3/3] mm, dax: move pmd_fault() to take only vmf parameter Dave Jiang
2016-12-15 23:40   ` Dave Jiang
2016-12-19 17:41   ` Jan Kara
2016-12-19 17:41     ` Jan Kara
2016-12-16  1:07 ` [PATCH v4 1/3] dax: masking off __GFP_FS in fs DAX handlers Dave Chinner
2016-12-16  1:07   ` Dave Chinner
2016-12-16 16:19   ` Ross Zwisler
2016-12-16 16:19     ` Ross Zwisler
2016-12-16 22:04     ` Dave Chinner
2016-12-16 22:04       ` Dave Chinner
2016-12-19 17:56       ` Jiang, Dave
2016-12-19 17:56         ` Jiang, Dave
2016-12-19 19:53       ` Jan Kara
2016-12-19 19:53         ` Jan Kara
2016-12-19 21:17         ` Dave Chinner
2016-12-19 21:17           ` Dave Chinner
2016-12-20 10:13           ` Michal Hocko [this message]
2016-12-20 10:13             ` Michal Hocko
2016-12-21 12:36             ` Jan Kara
2016-12-21 12:36               ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161220101352.GE3769@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.