From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50891C432C0 for ; Sat, 30 Nov 2019 02:48:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 15556217BC for ; Sat, 30 Nov 2019 02:48:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cft47JZC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727192AbfK3CsJ (ORCPT ); Fri, 29 Nov 2019 21:48:09 -0500 Received: from mail-yb1-f195.google.com ([209.85.219.195]:32815 "EHLO mail-yb1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727142AbfK3CsJ (ORCPT ); Fri, 29 Nov 2019 21:48:09 -0500 Received: by mail-yb1-f195.google.com with SMTP id o63so3707178ybc.0 for ; Fri, 29 Nov 2019 18:48:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=h5NC5IvVZZiqc0iy++UTanTgRuzq6t8qTsKUPscRzek=; b=cft47JZCnj7wWQn4v7BSX3McOlhxdL67FJ/hoAAt4TFLQAeuw1PHMq0QbduHYRjDv5 OOjKg2BQl+Sz91p3qjuUwgBXZqn1j3/VL/SJXK0r42QD1Wdg+7zCUhquLUeTNvKD3GW6 Zcl474csaVdiNhutPME9cC0Kjf/I9RtSbheTFmbx4dQrz6zsqqxuS9G5a6oGq92md55s 7vpwC1JoAa1QYCLjsZWEp0spmB5yfyKYBlgUYDVQOJjg7uhNZ4MB4t2g6THzRyuW3uGd NB+pFpgvAfJefoCTIQyssAJbSVPO87iFNsGqb+ZzFZz9L1RqcejIANz1uJo8QGx2rg2X k1KQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=h5NC5IvVZZiqc0iy++UTanTgRuzq6t8qTsKUPscRzek=; b=pJc4zlb8/pbmIwFLNCXsnLlSW1udeyv5nzsuCtIDZwfm/SnefSerMAVADxTFo1zWz7 Jq23SXJ8AtX5idZtxKuxZ3IDLgk2GJVayl3exLyYbIZ1N1unCdsDgeBCWH7mjaENJJ/Z YaPkrA5etOo0Rn4OSa0OroLu50Dj0+zXJ+uaYqHMe9Dd6/iPkS3CDGP5MhbVUyINHlYU 95y+zpSgL52CY9YY6BUl2q+Hyt9UVYJp6td3kkwwKACMftEkmOfQ4CbUO06mFvs5ruvT Zd+uCyuje9jaPuIrJSGpDiUfXYdSoPUrjCpaq9NcuTT7ij1J9ZUqywkJAIMt6r71oJqr rEGw== X-Gm-Message-State: APjAAAUeE+b92IUpqzm17UiMqob1QA+Cu8v/kFEa3fP/GXRVEaRKLRml ClF8chcSLnbgySznd7Rme3QXVXBA X-Google-Smtp-Source: APXvYqwQG11XMSHHfIaCckFJu/e1tCwN7Rq/u0c2btHIFjhvoTvR8aoq/Nx/YELu9+93zhdbCv9gSg== X-Received: by 2002:a25:3753:: with SMTP id e80mr11970848yba.123.1575082086796; Fri, 29 Nov 2019 18:48:06 -0800 (PST) Received: from [192.168.1.46] (c-73-88-245-53.hsd1.tn.comcast.net. [73.88.245.53]) by smtp.gmail.com with ESMTPSA id y186sm10705361ywe.97.2019.11.29.18.48.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 29 Nov 2019 18:48:06 -0800 (PST) Subject: Re: [PATCH] of: allocate / free phandle cache outside of the devtree_lock To: Sebastian Andrzej Siewior Cc: devicetree@vger.kernel.org, Rob Herring , Thomas Gleixner References: <20191111172142.ozczh7j2gmzi7o5k@linutronix.de> <20191129135756.tfmuygw4be6yftz5@linutronix.de> From: Frank Rowand Message-ID: <1ce03fc5-e527-8bb7-e1af-4d05a9c000e5@gmail.com> Date: Fri, 29 Nov 2019 20:48:05 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20191129135756.tfmuygw4be6yftz5@linutronix.de> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: devicetree-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: devicetree@vger.kernel.org Hi Sebastian, On 11/29/19 7:57 AM, Sebastian Andrzej Siewior wrote: > On 2019-11-12 16:48:12 [-0600], Frank Rowand wrote: >> Hi Sebastian, > Hi Frank, > >> On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote: >>> The phandle cache code allocates memory while holding devtree_lock which >>> is a raw_spinlock_t. Memory allocation (and free()) is not possible on >>> RT while a raw_spinlock_t is held. >>> Invoke the kfree() and kcalloc() while the lock is dropped. >> >> I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache() >> was sufficient. And I didn't realize (or remember) that kfree was >> not allowed while a raw_spinlock_t is held. Do you have a >> pointer to the preempt RT documentation that explains that? >> I'd like to add that pointer to my personal notes about locking so >> that I won't mis-remember this too often. > > Unfortunately not yet. I have a mini document that needs polishing… > However, since as far as my RT memory goes: > GFP_ATOMIC can be used in non-blocking context. On !RT that includes > disabled preemption and/or interrupts _usually_ part of locking (like > spin_lock_irqsave()) or the inherited context like IRQ-handler or > softirq for instance. > On RT the former example is not blocking anymore (sleeping spinlocks, > threaded IRQs). The locking with raw_spinlock_t still disables > preemption and interrupts as part of the procedure. The GFP_ATOMIC in > this case does not matter on RT here in terms of blocking (the part with > emergency pools and so on remains unchanged). > The kfree() part within a raw_spinlock_t section has the same > limitations on RT as kmalloc() for the same reason. The SLUB part of the > function is fine. The problem is once SLUB has to call to the page > allocated in order to return or ask for memory. We can't do this in > blocking context. Thanks for writing this up. And thanks for digging even deeper into the relevant context than I did below. -Frank > > … >>> | smp: Bringing up secondary CPUs ... >>> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973 >>> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 >>> | 1 lock held by swapper/1/0: >>> | #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230 >>> | Preemption disabled at: >>> | [] start_secondary+0xd0/0x6a0 >>> | >>> | Call Trace: >>> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270 >>> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90 >>> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230 >>> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180 >>> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90 >>> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70 >>> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0 >>> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14 >>> >>> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring >>> up path. >> >> I read too much into that sentence, and ran off on a tangent re-educating >> myself on preempt RT lock stuff. >> >> The issue in this path is that start_secondary() disables preemption before >> going down the code path that ends up with an attempt by of_find_node_opts_by_path() >> to lock devtree_lock. It is ok to acquire a raw spinlock with preemption >> disabled, but not ok to acquire a normal spinlock with preemption disabled. > That is correct. So > spin_lock(); > spin_lock(); > > is okay. However > preempt_disable(); > spin_lock(); > > is not okay. Same is true for s/preempt_disable/local_irq_disable()/. > > Please note the even before preempt_disable() in start_secondary() the > whole code runs with disabled interrupts as part of CPU bring up. > >> The calling path to cpu_to_chip_id() has an intervening call that does not >> show up in the above trace, add_cpu_to_masks(). The first call of cpu_to_chip_id() >> is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(), >> before preemption is disabled. But at the end of add_cpu_to_masks() is: >> >> for_each_cpu(i, cpu_online_mask) >> if (cpu_to_chip_id(i) == chipid) >> set_cpus_related(cpu, i, cpu_core_mask); >> >> This use of cpu_to_chip_id() is a little harder to move to before the preemption, >> but it is possible. A table of the chip ids for all possible cpus could be >> created before disabling preemption, and the table could be passed into >> add_cpu_to_masks(). This would allow devtree_lock to be changed to a >> spinlock_t. >> >> I like this approach because it removes the one known place that constrains >> what type of lock devtree_lock is. > > So even if we move that whole section before preempt_disable(), we still > run with disabled interrupts and can't use spinlock_t. > >> My second choice (and I am willing to accept this) is: >> >>> >>> drivers/of/base.c | 19 +++++++++++++------ >>> 1 file changed, 13 insertions(+), 6 deletions(-) >>> >>> --- a/drivers/of/base.c >>> +++ b/drivers/of/base.c >>> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask; >>> /* >>> * Caller must hold devtree_lock. >>> */ >> >> Add a one line comment to the effect of kfree() >> can not occur while raw_spinlock_t held, so caller must >> do the kfree(). > > okay. > > … >>> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void) >>> >>> if (!phandles) >>> goto out; >> >> Add a one line comment to the effect of raw_spinlock_t can not be held >> when calling kcalloc(). > okay. > >>> + raw_spin_unlock_irqrestore(&devtree_lock, flags); >>> >>> cache_entries = roundup_pow_of_two(phandles); >>> phandle_cache_mask = cache_entries - 1; >>> >> >> Need to avoid race with of_find_node_by_phandle(). So change the following >> to tmp_phandle_cache = kcalloc(... > okay. > >>> phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), >>> GFP_ATOMIC); >>> + raw_spin_lock_irqsave(&devtree_lock, flags); >> >> Then here: >> >> phandle_cache = tmp_phandle_cache; >> >>> if (!phandle_cache) >>> goto out; >>> >>> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void) >>> >>> out: >>> raw_spin_unlock_irqrestore(&devtree_lock, flags); >>> + kfree(shadow); >>> } >>> >>> void __init of_core_init(void) >>> >> >> The subtle race with of_find_node_by_phandle() is that if >> of_find_node_by_phandle() added an entry to the cache it >> also did an of_node_get(). It is ok that of_populate_phandle_cache() >> overwrite the cache entry, but it would also do an additional >> of_node_get(). > okay. > >> -Frank > > Sebastian >