All of lore.kernel.org
 help / color / mirror / Atom feed
From: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
To: <dsterba@suse.cz>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: Avoid BUG_ON()s because of ENOMEM caused by kmalloc() failure
Date: Thu, 18 Feb 2016 10:59:22 +0900	[thread overview]
Message-ID: <56C5257A.20603@jp.fujitsu.com> (raw)
In-Reply-To: <20160217151113.GT4374@twin.jikos.cz>

On 2016/02/18 0:11, David Sterba wrote:
> On Wed, Feb 17, 2016 at 02:54:23PM +0900, Satoru Takeuchi wrote:
>> On 2016/02/16 2:53, David Sterba wrote:
>>> On Mon, Feb 15, 2016 at 02:38:09PM +0900, Satoru Takeuchi wrote:
>>>> There are some BUG_ON()'s after kmalloc() as follows.
>>>>
>>>> =====
>>>> foo = kmalloc();
>>>> BUG_ON(!foo);	/* -ENOMEM case */
>>>> =====
>>>>
>>>> A Docker + memory cgroup user hit these BUG_ON()s.
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=112101
>>>>
>>>> Since it's very hard to handle these ENOMEMs properly,
>>>> preventing these kmalloc() failures to avoid these
>>>> BUG_ON()s for now, are a bit better than the current
>>>> implementation anyway.
>>>
>>> Beware that the NOFAIL semantics is can cause deadlocks if it's on the
>>> critical writeback path or and can be reentered from itself through the
>>> reclaim. Unless you're sure that this is not the case, please do not add
>>> them just because it would seemingly fix the allocation failures.
>>
>> About the all cases I changed, kmalloc()s can block
>> since gfp_flags_allow_blocking() are true. Then no locks
>> are acquired here and deadlocks don't happen.
>>
>> Am I missing something?
>
> Waiting as in GFP_WAIT is not the same as GFP_NOFAIL that can wait
> indefinetelly. The locking of NOFAIL is implied. The kmalloc callsite
> will block until there's memory available. If another thread of btrfs
> waits for this code to progress, it will block as well.
>
>>> In the docker example, the memory is limited by cgroups so the NOFAIL
>>> mode can exhaust all reserves and just loop endlessly waiting for the
>>> OOM killer to get some memory or just waiting without any chance to
>>> progress.
>>
>> I consider triggering OOM killer and killing processes
>> in a cgroup are better than killing whole system.
>
> The action of OOM killer is not a problem. The cgroup memory limit can
> be low or all the memory is unreclaimable. At this point btrfs code will
> block.
>
>
>> About the possibility of endless loop, there are many
>> such problems in the whole kernel. Of course it can be
>> said to Btrfs.
>
> Many? Examples? In this context we're talking about endless loops caused
> by non-failing allocations.
>
>> ==========================================
>> $ grep -rnH __GFP_NOFAIL fs/btrfs/
>> fs/btrfs/extent-tree.c:5970: GFP_NOFS | __GFP_NOFAIL);
>> fs/btrfs/extent-tree.c:6043: bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
>> fs/btrfs/extent_io.c:4643: eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
>> fs/btrfs/extent_io.c:4909: p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>> ==========================================
>
> I'm aware of the existing NOFAIL allocations. There are two at
> extent_buffer allocation time, added recently and provoked by the
> intended changes to memory allocator that would fail the GFP_NOFS
> allocations.
>
> The other two are from year 2010, set_extent_bit, IMHO added so the
> error handling does not get complicated and expressing "we don't want to
> fail here". But there are many other calls to set_extent_bit that could
> fail. This is inconsistent and should be unified. In a way that we're
> sure that we're not introducing potential hangs.
>
>> I understand fixing these problems cooperate with
>> memory cgroup guys is the best in the long run.
>
> It's not about cgroups, btrfs needs to ideally handle all memory
> allocation failures in a way that uses some sort of fallbacks but still
> can lead to a transaction abort if there's simply no memory left.
>
>> However, I consider bypassing this problem for now
>> is better than the current implementation.
>
> It's more like replacing one problem with another. With every new
> NOFAIL, one has to think about the runtime interactions with the others.
> I'd rather see this fixed with a particular call path in mind or class,
> eg. the extent_map bit settings, than throwing NOFAIL at places that
> somebody accidentally hit.
>
> As a temporary fix we can add __GFP_HIGH to the interesting sites so we
> can get access to the emergency reserves, and this is on my list of
> things to do after the NOFS -> KERNEL updates are done.

OK, got it. I'll reconsider how to fix there problem.

Thanks,
Satoru

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

      reply	other threads:[~2016-02-18  1:59 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-15  5:38 [PATCH] btrfs: Avoid BUG_ON()s because of ENOMEM caused by kmalloc() failure Satoru Takeuchi
2016-02-15 17:53 ` David Sterba
2016-02-17  5:54   ` Satoru Takeuchi
2016-02-17 15:11     ` David Sterba
2016-02-18  1:59       ` Satoru Takeuchi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C5257A.20603@jp.fujitsu.com \
    --to=takeuchi_satoru@jp.fujitsu.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.