Re: [PULL] Btrfs, updates for 4.12

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Mason <clm@fb.com>
To: <fdmanana@gmail.com>, David Sterba <dsterba@suse.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PULL] Btrfs, updates for 4.12
Date: Wed, 26 Apr 2017 13:26:49 -0400	[thread overview]
Message-ID: <90cd8455-c226-799e-bab3-3d55b59d3db3@fb.com> (raw)
In-Reply-To: <CAL3q7H6dNZ1SShz6iaHYeqMQ+PCAKPzetMuRm+8MVA-8h2E0MA@mail.gmail.com>

On 04/26/2017 11:06 AM, Filipe Manana wrote:
 
> Hi,
> 
> Did you actually ran xfstests with those readahead patches to
> preallocate radix tree nodes?
> 
> With those 2 patches applied (Chris' for-linus.4,12 branch) this
> breaks things and many btrfs specific tests (at least, since I can't
> get pass them) result in tons of traces like the following in a debug
> kernel:
> 
> [ 8180.696804] BUG: sleeping function called from invalid context at
> mm/slab.h:432
> [ 8180.703584] in_atomic(): 1, irqs_disabled(): 0, pid: 28583, name: btrfs
> [ 8180.724146] 2 locks held by btrfs/28583:
> [ 8180.726427]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff811c1e33>]
> mnt_want_write_file+0x25/0x4d
> [ 8180.736742]  #1:  (&(&fs_info->reada_lock)->rlock){+.+.+.}, at:
> [<ffffffffa02306eb>] reada_add_block+0x2fe/0x6cd [btrfs]
> [ 8180.766321] Preemption disabled at:
> [ 8180.766326] [<ffffffff8107ac54>] preempt_count_add+0x65/0x68
> [ 8180.794837] CPU: 5 PID: 28583 Comm: btrfs Tainted: G        W
> 4.11.0-rc8-btrfs-next-39+ #1
> [ 8180.798818] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [ 8180.798818] Call Trace:
> [ 8180.798818]  dump_stack+0x68/0x92
> [ 8180.798818]  ? preempt_count_add+0x65/0x68
> [ 8180.798818]  ___might_sleep+0x20f/0x226
> [ 8180.798818]  __might_sleep+0x77/0x7e
> [ 8180.798818]  slab_pre_alloc_hook+0x32/0x4f
> [ 8180.798818]  kmem_cache_alloc+0x39/0x233
> [ 8180.798818]  ? radix_tree_node_alloc.constprop.12+0x9d/0xdf
> [ 8180.798818]  radix_tree_node_alloc.constprop.12+0x9d/0xdf
> [ 8180.798818]  __radix_tree_create+0xc3/0x143
> [ 8180.798818]  __radix_tree_insert+0x32/0xc0
> [ 8180.798818]  reada_add_block+0x318/0x6cd [btrfs]

So radix_tree_preload doesn't work the way I thought it did.  It populates a 
per-cpu pool of radix tree nodes so the allocation is sure not to fail.

But, when we go to actually allocate the node during radix_tree_insert:


static struct radix_tree_node *                                                 
radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,           
                        struct radix_tree_root *root,                           
                        unsigned int shift, unsigned int offset,                
                        unsigned int count, unsigned int exceptional)           
{                                                                               
        struct radix_tree_node *ret = NULL;                                     
                                                                                
        /*                                                                      
         * Preload code isn't irq safe and it doesn't make sense to use         
         * preloading during an interrupt anyway as all the allocations have    
         * to be atomic. So just do normal allocation when in interrupt.        
         */                                                                     
        if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {            
                struct radix_tree_preload *rtp;                                 
                                                                                
                /*                                                              
                 * Even if the caller has preloaded, try to allocate from the   
                 * cache first for the new node to get accounted to the memory  
                 * cgroup.                                                      
                 */                                                             
                ret = kmem_cache_alloc(radix_tree_node_cachep,                  
                                       gfp_mask | __GFP_NOWARN);                
                if (ret)                                                        
                        goto out;                                               
                                                                                
                /*                                                              
                 * Provided the caller has preloaded here, we will always       
                 * succeed in getting a node here (and never reach              
                 * kmem_cache_alloc)                                            
                 */                                                             
                rtp = this_cpu_ptr(&radix_tree_preloads);                       
                if (rtp->nr) {                                                  
                        ret = rtp->nodes;                                       
                        rtp->nodes = ret->parent;                               
                        rtp->nr--;                                              
                }                                                               
                /*                                                              
                 * Update the allocation stack trace as this is more useful     
                 * for debugging.                                               
                 */                                                             
                kmemleak_update_trace(ret);                                     
                goto out;                                                       
        }                                                                       
        ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);

We only jump into the preload pool if our gfp_mask for the root doesn't
allow blocking.  And even if we don't allow blocking we'll still hit the
pool as a last resort.

So I think the right answer is to keep the sleeping flag off the root and
also keep the preload GFP_KERNEL.

-chris

     prev parent reply	other threads:[~2017-04-26 17:28 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-19 11:35 [PULL] Btrfs, updates for 4.12 David Sterba
2017-04-26 15:06 ` Filipe Manana
2017-04-26 15:12   ` Chris Mason
2017-04-26 16:08   ` David Sterba
2017-04-26 17:26   ` Chris Mason [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=90cd8455-c226-799e-bab3-3d55b59d3db3@fb.com \
    --to=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).