From: Michal Hocko <mhocko@suse.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Christoph Hellwig <hch@lst.de>,
Yafang Shao <laoar.shao@gmail.com>,
jack@suse.cz, Vlastimil Babka <vbabka@suse.cz>,
Dave Chinner <dchinner@redhat.com>,
Christian Brauner <brauner@kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Paul Moore <paul@paul-moore.com>,
James Morris <jmorris@namei.org>,
"Serge E. Hallyn" <serge@hallyn.com>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-bcachefs@vger.kernel.org,
linux-security-module@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM
Date: Thu, 5 Sep 2024 13:26:50 +0200 [thread overview]
Message-ID: <ZtmVej0fbVxrGPVz@tiehlicka> (raw)
In-Reply-To: <pmvxqqj5e6a2hdlyscmi36rcuf4kn37ry4ofdsp4aahpw223nk@lskmdcwkjeob>
On Wed 04-09-24 14:03:13, Kent Overstreet wrote:
> On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote:
> > On Wed 04-09-24 12:05:56, Kent Overstreet wrote:
> > > On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote:
> > > > On Tue 03-09-24 19:53:41, Kent Overstreet wrote:
> > > > [...]
> > > > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not
> > > > > possible to satisfy this allocation" (and I have been arguing that that
> > > > > is the only sane meaning) - then that could lead to a lot of error paths
> > > > > getting simpler.
> > > > >
> > > > > Because there are a lot of places where there's essentially no good
> > > > > reason to bubble up an -ENOMEM to userspace; if we're actually out of
> > > > > memory the current allocation is just one out of many and not
> > > > > particularly special, better to let the oom killer handle it...
> > > >
> > > > This is exactly GFP_KERNEL semantic for low order allocations or
> > > > kvmalloc for that matter. They simply never fail unless couple of corner
> > > > cases - e.g. the allocating task is an oom victim and all of the oom
> > > > memory reserves have been consumed. This is where we call "not possible
> > > > to allocate".
> > >
> > > *nod*
> > >
> > > Which does beg the question of why GFP_NOFAIL exists.
> >
> > Exactly for the reason that even rare failure is not acceptable and
> > there is no way to handle it other than keep retrying. Typical code was
> > while (!(ptr = kmalloc()))
> > ;
>
> But is it _rare_ failure, or _no_ failure?
>
> You seem to be saying (and I just reviewed the code, it looks like
> you're right) that there is essentially no difference in behaviour
> between GFP_KERNEL and GFP_NOFAIL.
The fundamental difference is that (appart from unsupported allocation
mode/size) the latter never returns NULL and you can rely on that fact.
Our docummentation says:
* %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
* cannot handle allocation failures. The allocation could block
* indefinitely but will never return with failure. Testing for
* failure is pointless.
> So given that - why the wart?
>
> I think we might be able to chalk it up to history; I'd have to go
> spunking through the history (or ask Dave or Ted, maybe they'll chime
> in), but I suspect GFP_KERNEL didn't provide such strong guarantees when
> the allocation loops & GFP_NOFAIL were introduced.
Sure, go ahead. If you manage to remove all existing users of
__GFP_NOFAIL (without replacing them with retry loops in the caller)
then I would be more than happy to remove __GFP_NOFAIL in the allocator.
[...]
> > But the point is there are some which _do_ need this. We have discussed
> > that in other email thread where you have heard why XFS and EXT4 does
> > that and why they are not going to change that model.
>
> No, I agree that they need the strong guarantees.
>
> But if there's an actual bug, returning an error is better than killing
> the task. Killing the task is really bad; these allocations are deep in
> contexts where locks and refcounts are held, and the system will just
> grind to a halt.
Not sure what you mean by these allocations but I am not aware that any
of the existing user would be really buggy. Also as I've said elsewhere,
there is simply no good way to handle a buggy caller. Killing the buggy
context has some downsides, returning NULL has others. I have argued
that the former has better predictable behavior than potentially silent
failure. We can clearly disagree on this but I also do not see why that
is relevant to the original discussion because my argument against
PF_MEMALLOC_NORECLAIM was focused on correct GPF_NOFAIL nested context
that would get an unexpected failure mode. No matter what kind of
failure mode that is it would be unexpected for those users.
> > > But as a matter of policy going forward, yes we should be saying that
> > > even GFP_NOFAIL allocations should be checking for -ENOMEM.
> >
> > I argue that such NOFAIL semantic has no well defined semantic and legit
> > users are forced to do
> > while (!(ptr = kmalloc(GFP_NOFAIL))) ;
> > or
> > BUG_ON(!(ptr = kmalloc(GFP_NOFAIL)));
> >
> > So it has no real reason to exist.
>
> I'm arguing that it does, provided when it returns NULL is defined to
> be:
> - invalid allocation context
> - a size that is so big that it will never be possible to satisfy.
Those are not really important situations because you are arguing about
a buggy code that needs fixing. As said above we can argue how to deal
with those users to get a predictable behavior but as the matter of
fact, correct users can expect never seeing the failure so handling
failure might be a) impossible and b) unfeasible (i.e. you are adding a
dead code that is never tested).
[...]
> For large allocations in bcachefs: in journal replay we read all the
> keys in the journal, and then we create a big flat array with references
> to all of those keys to sort and dedup them.
>
> We haven't hit the INT_MAX size limit there yet, but filesystem sizes
> being what they are, we will soon. I've heard of users with 150 TB
> filesystems, and once the fsck scalability issues are sorted we'll be
> aiming for petabytes. Dirty keys in the journal scales more with system
> memory, but I'm leasing machines right now with a quarter terabyte of
> ram.
I thought you were arguing about bcachefs handling failure mode so
presumably you do not need to use __GFP_NOFAIL for those.
I am sorry but I am getting lost in these arguments.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2024-09-05 11:26 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-02 9:51 [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM Michal Hocko
2024-09-02 9:51 ` [PATCH 1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM Michal Hocko
2024-09-05 9:28 ` kernel test robot
2024-09-02 9:51 ` [PATCH 2/2] Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN" Michal Hocko
2024-09-02 9:53 ` [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM Kent Overstreet
2024-09-02 21:52 ` Andrew Morton
2024-09-02 22:32 ` Kent Overstreet
2024-09-03 7:06 ` Michal Hocko
2024-09-04 16:15 ` Kent Overstreet
2024-09-04 16:50 ` Michal Hocko
2024-09-03 23:53 ` Kent Overstreet
2024-09-04 7:14 ` Michal Hocko
2024-09-04 16:05 ` Kent Overstreet
2024-09-04 16:46 ` Michal Hocko
2024-09-04 18:03 ` Kent Overstreet
2024-09-04 22:34 ` Dave Chinner
2024-09-04 23:05 ` Kent Overstreet
2024-09-05 11:26 ` Michal Hocko [this message]
2024-09-05 13:53 ` Theodore Ts'o
2024-09-05 14:05 ` Kent Overstreet
2024-09-05 15:24 ` Theodore Ts'o
2024-09-05 14:12 ` Michal Hocko
2024-09-03 5:13 ` Christoph Hellwig
2024-09-04 16:27 ` Kent Overstreet
2024-09-04 17:01 ` Michal Hocko
2024-09-10 19:29 ` Andrew Morton
2024-09-10 19:37 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZtmVej0fbVxrGPVz@tiehlicka \
--to=mhocko@suse.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=dchinner@redhat.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jmorris@namei.org \
--cc=kent.overstreet@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=linux-bcachefs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-security-module@vger.kernel.org \
--cc=paul@paul-moore.com \
--cc=serge@hallyn.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.