All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dennis Zhou <dennis@kernel.org>
To: Wang Yugui <wangyugui@e16-tech.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-btrfs@vger.kernel.org
Subject: Re: unexpected -ENOMEM from percpu_counter_init()
Date: Mon, 12 Apr 2021 04:03:01 +0000	[thread overview]
Message-ID: <YHPGdVDc6/4jxggg@google.com> (raw)
In-Reply-To: <20210411232000.BF15.409509F4@e16-tech.com>

On Sun, Apr 11, 2021 at 11:20:00PM +0800, Wang Yugui wrote:
> Hi, Dennis Zhou
> 
> > Hi,
> > 
> > > On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> > > > Hi, Dennis Zhou 
> > > > 
> > > > Thanks for your ncie answer.
> > > > but still a few questions.
> > > > 
> > > > > Percpu is not really cheap memory to allocate because it has a
> > > > > amplification factor of NR_CPUS. As a result, percpu on the critical
> > > > > path is really not something that is expected to be high throughput.
> > > > 
> > > > > Ideally things like btrfs snapshots should preallocate a number of these
> > > > > and not try to do atomic allocations because that in theory could fail
> > > > > because even after we go to the page allocator in the future we can't
> > > > > get enough pages due to needing to go into reclaim.
> > > > 
> > > > pre-allocate in module such as mempool_t is just used in a few place in
> > > > linux/fs.  so most people like system wide pre-allocate, because it is
> > > > more easy to use?
> > > > 
> > > > can we add more chance to management the system wide pre-alloc
> > > > just like this?
> > > > 
> > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > > index dc1f4dc..eb3f592 100644
> > > > --- a/include/linux/sched/mm.h
> > > > +++ b/include/linux/sched/mm.h
> > > > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
> > > >  static inline unsigned int memalloc_nofs_save(void)
> > > >  {
> > > >  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > > > +
> > > > +	// just like slab_pre_alloc_hook
> > > > +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> > > > +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> > > > +
> > > >  	current->flags |= PF_MEMALLOC_NOFS;
> > > >  	return flags;
> > > >  }
> > > > 
> > > > 
> > > > > The workqueue approach has been good enough so far. Technically there is
> > > > > a higher priority workqueue that this work could be scheduled on, but
> > > > > save for this miss on my part, the system workqueue has worked out fine.
> > > > 
> > > > > In the future as I mentioned above. It would be good to support actually
> > > > > getting pages, but it's work that needs to be tackled with a bit of
> > > > > care. I might target the work for v5.14.
> > > > > 
> > > > > > this is our application pipeline.
> > > > > > 	file_pre_process |
> > > > > > 	bwa.nipt xx |
> > > > > > 	samtools.nipt sort xx |
> > > > > > 	file_post_process
> > > > > > 
> > > > > > file_pre_process/file_post_process is fast, so often are blocked by
> > > > > > pipe input/output.
> > > > > > 
> > > > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > > > > 
> > > > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > > > > if the memory is not enough, it will save all the buffer to temp file,
> > > > > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > > > > 
> > > > > > 
> > > > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > > > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > > > > 
> > > > > > Although there is yet not a simple reproducer for another problem
> > > > > > happend here, but there is a little high chance that something is wrong
> > > > > > in btrfs/mm/fs-buffer.
> > > > > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > > > > the reason is yet unkown) still happen.
> > > > > 
> > > > > I do not have an answer for this. I would recommend looking into kdump.
> > > > 
> > > > percpu ENOMEM problem blocked many heavy load test a little long time?
> > > > I still guess this problem of system freeze is a mm/btrfs problem.
> > > > OOM not work, OOPS not work too.
> > > > 
> > > 
> > > I don't follow. Is this still a problem after the patch?
> > 
> > 
> > After the patch for percpu ENOMEM,  the problem of system freeze have a high
> > frequecy (>75%) to be triggered by our user-space application.
> > 
> > The problem of system freeze maybe not caused by the percpu ENOMEM patch.
> > 
> > percpu ENOMEM problem maybe more easy to happen than the problem of
> > system freeze.
> 
> After highmem zone +80% / otherzone +40% of WMARK_MIN/ WMARK_LOW/
> WMARK_HIGH, we walked around or reduced the reproduce frequency of the
> problem of system freeze.
> 
> so this is a problem of linux-mm.
> 
> the user case of our user-space application.
> 1)  write the files with the total size > 3 * memory size.
>      the memory size > 128G
> 2)  btrfs with SSD/SAS, SSD/SATA, or btrfs RAID6 hdd
>     SSD/NVMe maybe too fast, so difficult to reproduce.
> 3) some CPU load, and some memory load.
> 

To me it just sounds like writeback is slow. It's hard to debug a system
without actually observing it as well. You might want to limit the
memory allotted to the workload cgroup possibly memory.high. This may
help kick reclaim in earlier.

> btrfs and other fs seem not like mempool_t wiht pre-alloc, so difficult
> job is left to the system-wide reclaim/pre-alloc of linux-mm.
> 
> maye memalloc_nofs_save() or memalloc_nofs_restore() is a good place to
>  add some sync/aysnc memory reclaim/pre-alloc operations for WMARK_MIN/
> WMARK_LOW/WMARK_HIGH and percpu PCPU_EMPTY_POP_PAGES_LOW.
> 

It's not that simple. Memory reclaim is a balancing act and these places
mark where reclaim cannot trigger writeback and thus oom-killer is the
only way out. I'm sorry, but beyond the above, I don't really have any
additional advice besides retuning your workload to use less memory and
give the system more headroom.

I appreciate the bug report though and if its anything percpu related I
will always be available.

> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/11
> 

Thanks,
Dennis

  reply	other threads:[~2021-04-12  4:03 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-01 10:51 unexpected -ENOMEM from percpu_counter_init() Wang Yugui
2021-04-02  1:49 ` Wang Yugui
2021-04-07 12:35 ` Vlastimil Babka
2021-04-07 13:09   ` Wang Yugui
2021-04-07 14:56     ` Dennis Zhou
2021-04-07 23:28       ` Wang Yugui
2021-04-08  2:44         ` Dennis Zhou
2021-04-08  9:20           ` Wang Yugui
2021-04-08 13:48             ` Dennis Zhou
2021-04-08 14:28               ` Filipe Manana
2021-04-08 15:02                 ` Dennis Zhou
2021-04-09 11:39                   ` Filipe Manana
2021-04-09 13:39                     ` Dennis Zhou
2021-04-09 13:42                       ` Filipe Manana
2021-04-09  0:08               ` Wang Yugui
2021-04-09  2:14                 ` Dennis Zhou
2021-04-09  4:02                   ` Wang Yugui
2021-04-09  7:36                     ` Wang Yugui
2021-04-09  7:48                       ` Wang Yugui
2021-04-09 13:56                       ` Dennis Zhou
2021-04-10 15:29                         ` Wang Yugui
2021-04-10 15:52                           ` Dennis Zhou
2021-04-10 16:08                             ` Wang Yugui
2021-04-11 15:20                               ` Wang Yugui
2021-04-12  4:03                                 ` Dennis Zhou [this message]
2021-04-12  5:24                                   ` Wang Yugui
2021-04-09  9:52   ` Wang Yugui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHPGdVDc6/4jxggg@google.com \
    --to=dennis@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=vbabka@suse.cz \
    --cc=wangyugui@e16-tech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.