[LSF/MM TOPIC] few MM topics

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] few MM topics
@ 2018-01-24  9:26 Michal Hocko
  2018-01-24 18:23 ` Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Michal Hocko @ 2018-01-24  9:26 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-nvme, linux-fsdevel; +Cc: Johannes Weiner, Rik van Riel

Hi,
I would like to propose the following few topics for further discussion
at LSF/MM this year. MM track would be the most appropriate one but
there is some overlap with FS and NVDIM
- memcg OOM behavior has changed around 3.12 as a result of OOM
  deadlocks when the memcg OOM killer was triggered from the charge
  path. We simply fail the charge and unroll to a safe place to
  trigger the OOM killer. This is only done from the #PF path and any
  g-u-p or kmem accounted allocation can just fail in that case leading
  to unexpected ENOMEM to userspace. I believe we can return to the
  original OOM handling now that we have the oom reaper and guranteed
  forward progress of the OOM path.
  Discussion http://lkml.kernel.org/r/20171010142434.bpiqmsbb7gttrlcb@dhcp22.suse.cz
- It seems there is some demand for large (> MAX_ORDER) allocations.
  We have that alloc_contig_range which was originally used for CMA and
  later (ab)used for Giga hugetlb pages. The API is less than optimal
  and we should probably think about how to make it more generic.
- we have grown a new get_user_pages_longterm. It is an ugly API and
  I think we really need to have a decent page pinning one with the
  accounting and limiting.
- memory hotplug has seen quite some surgery last year and it seems that
  DAX/nvdim and HMM have some interest in using it as well. I am mostly
  interested in struct page self hosting which is already done for NVDIM
  AFAIU. It would be great if we can unify that for the regular mem
  hotplug as well.
- I would be very interested to talk about memory softofflining
  (HWPoison) with somebody familiar with this area because I find the
  development in that area as more or less random without any design in
  mind. The resulting code is chaotic and stuffed to "random" places.
- I would also love to talk to some FS people and convince them to move
  away from GFP_NOFS in favor of the new scope API. I know this just
  means to send patches but the existing code is quite complex and it
  really requires somebody familiar with the specific FS to do that
  work.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-24  9:26 [LSF/MM TOPIC] few MM topics Michal Hocko
@ 2018-01-24 18:23 ` Mike Kravetz
  2018-01-25 10:02   ` Michal Hocko
  2018-01-25  9:37 ` Jan Kara
  2018-01-31 19:21 ` Darrick J. Wong
  2 siblings, 1 reply; 9+ messages in thread
From: Mike Kravetz @ 2018-01-24 18:23 UTC (permalink / raw)
  To: Michal Hocko, lsf-pc, linux-mm, linux-nvme, linux-fsdevel
  Cc: Johannes Weiner, Rik van Riel

On 01/24/2018 01:26 AM, Michal Hocko wrote:
> Hi,
> I would like to propose the following few topics for further discussion
> at LSF/MM this year. MM track would be the most appropriate one but
> there is some overlap with FS and NVDIM
> - memcg OOM behavior has changed around 3.12 as a result of OOM
>   deadlocks when the memcg OOM killer was triggered from the charge
>   path. We simply fail the charge and unroll to a safe place to
>   trigger the OOM killer. This is only done from the #PF path and any
>   g-u-p or kmem accounted allocation can just fail in that case leading
>   to unexpected ENOMEM to userspace. I believe we can return to the
>   original OOM handling now that we have the oom reaper and guranteed
>   forward progress of the OOM path.
>   Discussion http://lkml.kernel.org/r/20171010142434.bpiqmsbb7gttrlcb@dhcp22.suse.cz
> - It seems there is some demand for large (> MAX_ORDER) allocations.
>   We have that alloc_contig_range which was originally used for CMA and
>   later (ab)used for Giga hugetlb pages. The API is less than optimal
>   and we should probably think about how to make it more generic.

This is also of interest to me.  I actually started some efforts in this
area.  The idea (as you mention above) would be to provide a more usable
API for allocation of contiguous pages/ranges.  And, gigantic huge pages
would be the first consumer.

alloc_contig_range currently has some issues with being used in a 'more
generic' way.  A comment describing the routine says "it's the caller's
responsibility to guarantee that we are the only thread that changes
migrate type of pageblocks the pages fall in.".  This is true, and I think
it also applies to users of the underlying routines such as
start_isolate_page_range.  The CMA code has a mechanism that prevents two
threads from operating on the same range concurrently.  The other users
(gigantic page allocation and memory offline) happen infrequently enough
that we are unlikely to have a conflict.  But, opening this up to more
generic use will require at least a more generic synchronization mechanism.

> - we have grown a new get_user_pages_longterm. It is an ugly API and
>   I think we really need to have a decent page pinning one with the
>   accounting and limiting.
> - memory hotplug has seen quite some surgery last year and it seems that
>   DAX/nvdim and HMM have some interest in using it as well. I am mostly
>   interested in struct page self hosting which is already done for NVDIM
>   AFAIU. It would be great if we can unify that for the regular mem
>   hotplug as well.
> - I would be very interested to talk about memory softofflining
>   (HWPoison) with somebody familiar with this area because I find the
>   development in that area as more or less random without any design in
>   mind. The resulting code is chaotic and stuffed to "random" places.

Me too.  I have looked at some code in this area for huge pages.  At least
for huge pages there is more work to do as indicated by this comment:
/*
 * Huge pages. Needs work.
 * Issues:
 * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
 *   To narrow down kill region to one page, we need to break up pmd.
 */

-- 
Mike Kravetz

> - I would also love to talk to some FS people and convince them to move
>   away from GFP_NOFS in favor of the new scope API. I know this just
>   means to send patches but the existing code is quite complex and it
>   really requires somebody familiar with the specific FS to do that
>   work.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-24 18:23 ` Mike Kravetz
@ 2018-01-25 10:02   ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2018-01-25 10:02 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: lsf-pc, linux-mm, linux-nvme, linux-fsdevel, Johannes Weiner,
	Rik van Riel

On Wed 24-01-18 10:23:20, Mike Kravetz wrote:
> On 01/24/2018 01:26 AM, Michal Hocko wrote:
[...]
> > - It seems there is some demand for large (> MAX_ORDER) allocations.
> >   We have that alloc_contig_range which was originally used for CMA and
> >   later (ab)used for Giga hugetlb pages. The API is less than optimal
> >   and we should probably think about how to make it more generic.
> 
> This is also of interest to me.  I actually started some efforts in this
> area.  The idea (as you mention above) would be to provide a more usable
> API for allocation of contiguous pages/ranges.  And, gigantic huge pages
> would be the first consumer.
> 
> alloc_contig_range currently has some issues with being used in a 'more
> generic' way.  A comment describing the routine says "it's the caller's
> responsibility to guarantee that we are the only thread that changes
> migrate type of pageblocks the pages fall in.".  This is true, and I think
> it also applies to users of the underlying routines such as
> start_isolate_page_range.  The CMA code has a mechanism that prevents two
> threads from operating on the same range concurrently.  The other users
> (gigantic page allocation and memory offline) happen infrequently enough
> that we are unlikely to have a conflict.  But, opening this up to more
> generic use will require at least a more generic synchronization mechanism.

Yes, that is exactly my concern and the current state of art that has to
change. I am not yet sure how. So any discussion seems interesting.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-24  9:26 [LSF/MM TOPIC] few MM topics Michal Hocko
  2018-01-24 18:23 ` Mike Kravetz
@ 2018-01-25  9:37 ` Jan Kara
  2018-01-31 19:21 ` Darrick J. Wong
  2 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2018-01-25  9:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lsf-pc, linux-mm, linux-nvme, linux-fsdevel, Johannes Weiner,
	Rik van Riel

Hi,

On Wed 24-01-18 10:26:49, Michal Hocko wrote:
> - we have grown a new get_user_pages_longterm. It is an ugly API and
>   I think we really need to have a decent page pinning one with the
>   accounting and limiting.

I'm interested in this topic from NVDIMM/DAX POV as well as due to other
issues filesystems currently have with GUP (more on that in a topic
proposal I'll send in a moment).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-24  9:26 [LSF/MM TOPIC] few MM topics Michal Hocko
  2018-01-24 18:23 ` Mike Kravetz
  2018-01-25  9:37 ` Jan Kara
@ 2018-01-31 19:21 ` Darrick J. Wong
  2018-01-31 20:24   ` Michal Hocko
  2 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2018-01-31 19:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lsf-pc, linux-mm, linux-nvme, linux-fsdevel, Johannes Weiner,
	Rik van Riel

On Wed, Jan 24, 2018 at 10:26:49AM +0100, Michal Hocko wrote:
> Hi,
> I would like to propose the following few topics for further discussion
> at LSF/MM this year. MM track would be the most appropriate one but
> there is some overlap with FS and NVDIM
> - memcg OOM behavior has changed around 3.12 as a result of OOM
>   deadlocks when the memcg OOM killer was triggered from the charge
>   path. We simply fail the charge and unroll to a safe place to
>   trigger the OOM killer. This is only done from the #PF path and any
>   g-u-p or kmem accounted allocation can just fail in that case leading
>   to unexpected ENOMEM to userspace. I believe we can return to the
>   original OOM handling now that we have the oom reaper and guranteed
>   forward progress of the OOM path.
>   Discussion http://lkml.kernel.org/r/20171010142434.bpiqmsbb7gttrlcb@dhcp22.suse.cz
> - It seems there is some demand for large (> MAX_ORDER) allocations.
>   We have that alloc_contig_range which was originally used for CMA and
>   later (ab)used for Giga hugetlb pages. The API is less than optimal
>   and we should probably think about how to make it more generic.
> - we have grown a new get_user_pages_longterm. It is an ugly API and
>   I think we really need to have a decent page pinning one with the
>   accounting and limiting.
> - memory hotplug has seen quite some surgery last year and it seems that
>   DAX/nvdim and HMM have some interest in using it as well. I am mostly
>   interested in struct page self hosting which is already done for NVDIM
>   AFAIU. It would be great if we can unify that for the regular mem
>   hotplug as well.
> - I would be very interested to talk about memory softofflining
>   (HWPoison) with somebody familiar with this area because I find the
>   development in that area as more or less random without any design in
>   mind. The resulting code is chaotic and stuffed to "random" places.
> - I would also love to talk to some FS people and convince them to move
>   away from GFP_NOFS in favor of the new scope API. I know this just
>   means to send patches but the existing code is quite complex and it
>   really requires somebody familiar with the specific FS to do that
>   work.

Hm, are you talking about setting PF_MEMALLOC_NOFS instead of passing
*_NOFS to allocation functions and whatnot?  Right now XFS will set it
on any thread which has a transaction open, but that doesn't help for
fs operations that don't have transactions (e.g. reading metadata,
opening files).  I suppose we could just set the flag any time someone
stumbles into the fs code from userspace, though you're right that seems
daunting.

--D

> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-31 19:21 ` Darrick J. Wong
@ 2018-01-31 20:24   ` Michal Hocko
  2018-01-31 23:41     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2018-01-31 20:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: lsf-pc, linux-mm, linux-nvme, linux-fsdevel, Johannes Weiner,
	Rik van Riel

On Wed 31-01-18 11:21:04, Darrick J. Wong wrote:
> On Wed, Jan 24, 2018 at 10:26:49AM +0100, Michal Hocko wrote:
[...]
> > - I would also love to talk to some FS people and convince them to move
> >   away from GFP_NOFS in favor of the new scope API. I know this just
> >   means to send patches but the existing code is quite complex and it
> >   really requires somebody familiar with the specific FS to do that
> >   work.
> 
> Hm, are you talking about setting PF_MEMALLOC_NOFS instead of passing
> *_NOFS to allocation functions and whatnot?

yes memalloc_nofs_{save,restore}

> Right now XFS will set it
> on any thread which has a transaction open, but that doesn't help for
> fs operations that don't have transactions (e.g. reading metadata,
> opening files).  I suppose we could just set the flag any time someone
> stumbles into the fs code from userspace, though you're right that seems
> daunting.

I would really love to see the code to take the nofs scope
(memalloc_nofs_save) at the point where the FS "critical" section starts
(from the reclaim recursion POV). This would both document the context
and also limit NOFS allocations to bare minumum.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] few MM topics
  2018-01-31 20:24   ` Michal Hocko
@ 2018-01-31 23:41     ` Dave Chinner
  2018-02-01 15:46       ` [Lsf-pc] " Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2018-01-31 23:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Darrick J. Wong, lsf-pc, linux-mm, linux-nvme, linux-fsdevel,
	Johannes Weiner, Rik van Riel

On Wed, Jan 31, 2018 at 09:24:38PM +0100, Michal Hocko wrote:
> On Wed 31-01-18 11:21:04, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2018 at 10:26:49AM +0100, Michal Hocko wrote:
> [...]
> > > - I would also love to talk to some FS people and convince them to move
> > >   away from GFP_NOFS in favor of the new scope API. I know this just
> > >   means to send patches but the existing code is quite complex and it
> > >   really requires somebody familiar with the specific FS to do that
> > >   work.
> > 
> > Hm, are you talking about setting PF_MEMALLOC_NOFS instead of passing
> > *_NOFS to allocation functions and whatnot?
> 
> yes memalloc_nofs_{save,restore}
> 
> > Right now XFS will set it
> > on any thread which has a transaction open, but that doesn't help for
> > fs operations that don't have transactions (e.g. reading metadata,
> > opening files).  I suppose we could just set the flag any time someone
> > stumbles into the fs code from userspace, though you're right that seems
> > daunting.
> 
> I would really love to see the code to take the nofs scope
> (memalloc_nofs_save) at the point where the FS "critical" section starts
> (from the reclaim recursion POV).

We already do that - the transaction context in XFS is the critical
context, and we set PF_MEMALLOC_NOFS when we allocate a transaction
handle and remove it when we commit the transaction.

> This would both document the context
> and also limit NOFS allocations to bare minumum.

Yup, most of XFS already uses implicit GFP_NOFS allocation calls via
the transaction context process flag manipulation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] few MM topics
  2018-01-31 23:41     ` Dave Chinner
@ 2018-02-01 15:46       ` Michal Hocko
  2018-02-01 22:47         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2018-02-01 15:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Rik van Riel, linux-nvme, linux-mm,
	Johannes Weiner, linux-fsdevel, lsf-pc

On Thu 01-02-18 10:41:26, Dave Chinner wrote:
> On Wed, Jan 31, 2018 at 09:24:38PM +0100, Michal Hocko wrote:
[...]
> > This would both document the context
> > and also limit NOFS allocations to bare minumum.
> 
> Yup, most of XFS already uses implicit GFP_NOFS allocation calls via
> the transaction context process flag manipulation.

Yeah, xfs is in quite a good shape. There are still around 40+ KM_NOFS
users. Are there any major obstacles to remove those? Or is this just
"send patches" thing.

Compare that to
$ git grep GFP_NOFS -- fs/btrfs/ | wc -l
272
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] few MM topics
  2018-02-01 15:46       ` [Lsf-pc] " Michal Hocko
@ 2018-02-01 22:47         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2018-02-01 22:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Darrick J. Wong, Rik van Riel, linux-nvme, linux-mm,
	Johannes Weiner, linux-fsdevel, lsf-pc

On Thu, Feb 01, 2018 at 04:46:55PM +0100, Michal Hocko wrote:
> On Thu 01-02-18 10:41:26, Dave Chinner wrote:
> > On Wed, Jan 31, 2018 at 09:24:38PM +0100, Michal Hocko wrote:
> [...]
> > > This would both document the context
> > > and also limit NOFS allocations to bare minumum.
> > 
> > Yup, most of XFS already uses implicit GFP_NOFS allocation calls via
> > the transaction context process flag manipulation.
> 
> Yeah, xfs is in quite a good shape. There are still around 40+ KM_NOFS
> users. Are there any major obstacles to remove those? Or is this just
> "send patches" thing.

They need to be looked at on a case by case basis - many of
them are the "shut up lockdep false positives" workarounds because
the code is called from multiple memory reclaim contexts. In other
cases they might actually be needed. If you send patches, it'll
kinda force us to look at them and say yay/nay :P

> Compare that to
> $ git grep GFP_NOFS -- fs/btrfs/ | wc -l
> 272

Fair point. :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-02-01 22:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-24  9:26 [LSF/MM TOPIC] few MM topics Michal Hocko
2018-01-24 18:23 ` Mike Kravetz
2018-01-25 10:02   ` Michal Hocko
2018-01-25  9:37 ` Jan Kara
2018-01-31 19:21 ` Darrick J. Wong
2018-01-31 20:24   ` Michal Hocko
2018-01-31 23:41     ` Dave Chinner
2018-02-01 15:46       ` [Lsf-pc] " Michal Hocko
2018-02-01 22:47         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).