Re: [PATCH] of: allocate / free phandle cache outside of the devtree_lock

devicetree.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Frank Rowand <frowand.list@gmail.com>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	devicetree@vger.kernel.org
Cc: Rob Herring <robh+dt@kernel.org>, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH] of: allocate / free phandle cache outside of the devtree_lock
Date: Tue, 12 Nov 2019 16:48:12 -0600	[thread overview]
Message-ID: <e62e6a76-c614-80c4-2c9b-cd4326105fd9@gmail.com> (raw)
In-Reply-To: <20191111172142.ozczh7j2gmzi7o5k@linutronix.de>

Hi Sebastian,

On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
> The phandle cache code allocates memory while holding devtree_lock which
> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
> RT while a raw_spinlock_t is held.
> Invoke the kfree() and kcalloc() while the lock is dropped.

I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
was sufficient.  And I didn't realize (or remember) that kfree was
not allowed while a raw_spinlock_t is held.  Do you have a
pointer to the preempt RT documentation that explains that?
I'd like to add that pointer to my personal notes about locking so
that I won't mis-remember this too often.


> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Cc: devicetree@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> This is a repost of:
> 	https://lore.kernel.org/linux-devicetree/20180910154227.xsbbqvw3cayro4gg@linutronix.de/
> 
> I mentioned this patch (briefly) to Frank, let me summarize:
> 
> of_populate_phandle_cache() triggers a warning during boot on arm64 with
> RT enabled. By moving memory allocation/free outside of the locked
> section (which really disables interrupts on -RT) everything is fine
> again.
> 
> The lock has been made a raw_spinlock_t in RT as part pSeries bring up.
> It then made its way upstream as:
>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> 
> I've been looking into making devtree_lock a spinlock_t which would
> avoid this patch. I haven't seen an issue during boot on arm64 even
> with hotplug.> However Power64/pSeries complained during boot:
> 
> | smp: Bringing up secondary CPUs ...
> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> | 1 lock held by swapper/1/0:
> |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
> | Preemption disabled at:
> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
> |
> | Call Trace:
> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
> 
> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
> up path.

I read too much into that sentence, and ran off on a tangent re-educating
myself on preempt RT lock stuff.

The issue in this path is that start_secondary() disables preemption before
going down the code path that ends up with an attempt by of_find_node_opts_by_path()
to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
disabled, but not ok to acquire a normal spinlock with preemption disabled.

The calling path to cpu_to_chip_id() has an intervening call that does not
show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
before preemption is disabled.  But at the end of add_cpu_to_masks() is:

        for_each_cpu(i, cpu_online_mask)
                if (cpu_to_chip_id(i) == chipid)
                        set_cpus_related(cpu, i, cpu_core_mask);

This use of cpu_to_chip_id() is a little harder to move to before the preemption,
but it is possible.  A table of the chip ids for all possible cpus could be
created before disabling preemption, and the table could be passed into
add_cpu_to_masks().  This would allow devtree_lock to be changed to a
spinlock_t.

I like this approach because it removes the one known place that constrains
what type of lock devtree_lock is.

My second choice (and I am willing to accept this) is:

> 
>  drivers/of/base.c |   19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
>  /*
>   * Caller must hold devtree_lock.
>   */

Add a one line comment to the effect of kfree()
can not occur while raw_spinlock_t held, so caller must
do the kfree().


> -static void __of_free_phandle_cache(void)
> +static struct device_node** __of_free_phandle_cache(void)
>  {
>  	u32 cache_entries = phandle_cache_mask + 1;
>  	u32 k;
> +	struct device_node **shadow;
>  
>  	if (!phandle_cache)
> -		return;
> +		return NULL;
>  
>  	for (k = 0; k < cache_entries; k++)
>  		of_node_put(phandle_cache[k]);
>  
> -	kfree(phandle_cache);
> +	shadow = phandle_cache;
>  	phandle_cache = NULL;
> +	return shadow;
>  }
>  
>  int of_free_phandle_cache(void)
>  {
>  	unsigned long flags;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> -
> +	kfree(shadow);
>  	return 0;
>  }
>  #if !defined(CONFIG_MODULES)
> @@ -197,10 +200,11 @@ void of_populate_phandle_cache(void)
>  	u32 cache_entries;
>  	struct device_node *np;
>  	u32 phandles = 0;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	for_each_of_allnodes(np)
>  		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
>  
>  	if (!phandles)
>  		goto out;

Add a one line comment to the effect of raw_spinlock_t can not be held
when calling kcalloc().


> +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>  
>  	cache_entries = roundup_pow_of_two(phandles);
>  	phandle_cache_mask = cache_entries - 1;
> 

Need to avoid race with of_find_node_by_phandle().  So change the following
to tmp_phandle_cache = kcalloc(...
 
>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>  				GFP_ATOMIC);
> +	raw_spin_lock_irqsave(&devtree_lock, flags);

Then here:

        phandle_cache = tmp_phandle_cache;

>  	if (!phandle_cache)
>  		goto out;
>  
> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
>  
>  out:
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> +	kfree(shadow);
>  }
>  
>  void __init of_core_init(void)
> 

The subtle race with of_find_node_by_phandle() is that if
of_find_node_by_phandle() added an entry to the cache it
also did an of_node_get().  It is ok that of_populate_phandle_cache()
overwrite the cache entry, but it would also do an additional
of_node_get().

-Frank

next prev parent reply	other threads:[~2019-11-12 22:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-11 17:21 [PATCH] of: allocate / free phandle cache outside of the devtree_lock Sebastian Andrzej Siewior
2019-11-12  3:35 ` Rob Herring
2019-11-12  9:10   ` Sebastian Andrzej Siewior
2019-11-12 15:55     ` Rob Herring
2019-11-12 23:46       ` Frank Rowand
2019-11-13  0:48         ` Rob Herring
2019-11-13 16:52           ` Frank Rowand
2019-11-12 22:48 ` Frank Rowand [this message]
2019-11-29 13:57   ` Sebastian Andrzej Siewior
2019-11-30  2:48     ` Frank Rowand
2019-11-29 14:04   ` [PATCH v2] " Sebastian Andrzej Siewior
2019-11-30  2:46     ` Frank Rowand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e62e6a76-c614-80c4-2c9b-cd4326105fd9@gmail.com \
    --to=frowand.list@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=devicetree@vger.kernel.org \
    --cc=robh+dt@kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).