From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754920AbcEYPJi (ORCPT ); Wed, 25 May 2016 11:09:38 -0400 Received: from hqemgate15.nvidia.com ([216.228.121.64]:5782 "EHLO hqemgate15.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751817AbcEYPJh (ORCPT ); Wed, 25 May 2016 11:09:37 -0400 X-PGP-Universal: processed; by hqnvupgp07.nvidia.com on Wed, 25 May 2016 08:07:10 -0700 Subject: Re: [PATCH] soc/tegra: pmc: Fix "scheduling while atomic" To: Dmitry Osipenko , Stephen Warren , Thierry Reding , "Alexandre Courbot" , Peter De Schrijver , Prashant Gaikwad References: <1460900051-3065-1-git-send-email-digetx@gmail.com> <572B47DE.1090804@nvidia.com> CC: , , From: Jon Hunter Message-ID: <5745C02A.20308@nvidia.com> Date: Wed, 25 May 2016 16:09:30 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: X-Originating-IP: [10.21.132.103] X-ClientProxiedBy: UKMAIL102.nvidia.com (10.26.138.15) To UKMAIL102.nvidia.com (10.26.138.15) Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/05/16 15:24, Dmitry Osipenko wrote: > Hello, Jon! > > On 05.05.2016 16:17, Jon Hunter wrote: >> >> Thanks for the report. I have been unable to reproduce this, but then I >> don't see my tegra20 entering LP2 during cpuidle. I did force my tegra20 >> into LP2 during suspend which will exercise the same code but I did not >> trigger this either. However, from looking at the code it does appear >> that we could hit this. >> > > As I wrote before, it's quite difficult to reproduce. So far I have been unable to reproduce this. I did noticed that in the upstream kernel we disable LP2 on Tegra20 if PCIE is enabled (see arch/arm/mach-tegra/cpuidle-tegra20.c) ... /* * Tegra20 HW appears to have a bug such that PCIe device interrupts, whether * they are legacy IRQs or MSI, are lost when LP2 is enabled. To work around * this, simply disable LP2 if the PCI driver and DT node are both enabled. */ void tegra20_cpuidle_pcie_irqs_in_use(void) { pr_info_once( "Disabling cpuidle LP2 state, since PCIe IRQs are in use\n"); tegra_idle_driver.states[1].disabled = true; } Even if I remove this and verify that I can enter LP2, I have been unable to reproduce this. I know that you said that it is difficult to reproduce and there needs to be a specific workload, however, from looking at the code I am trying to understand the situation that would trigger this. Your backtrace shows ... [ 3.430853] [] (dump_stack) from [] (__schedule_bug+0x50/0x64) [ 3.431079] [] (__schedule_bug) from [] (__schedule+0x5c8/0x688) [ 3.431453] [] (__schedule) from [] (schedule_preempt_disabled+0x24/0x34) [ 3.431835] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0xbc/0x170) [ 3.432204] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x4c/0x50) [ 3.432427] [] (mutex_lock) from [] (clk_prepare_lock+0x88/0xfc) [ 3.432800] [] (clk_prepare_lock) from [] (clk_get_rate+0xc/0x60) [ 3.433177] [] (clk_get_rate) from [] (tegra_pmc_enter_suspend_mode+0x188/0x20c) [ 3.433580] [] (tegra_pmc_enter_suspend_mode) from [] (tegra_idle_lp2_last+0xc/0x40) [ 3.433795] [] (tegra_idle_lp2_last) from [] (tegra20_idle_lp2_coupled+0x118/0x1fc) [ 3.434171] [] (tegra20_idle_lp2_coupled) from [] (cpuidle_enter_state+0x3c/0x160) [ 3.434551] [] (cpuidle_enter_state) from [] (cpuidle_enter_state_coupled+0x3dc/0x3f4) [ 3.434959] [] (cpuidle_enter_state_coupled) from [] (cpu_startup_entry+0x240/0x288) [ 3.435340] [] (cpu_startup_entry) from [] (start_kernel+0x3b4/0x3c0) [ 3.435557] [] (start_kernel) from [<00008074>] (0x8074) ... however, when we call tegra_idle_lp2_last(), CPU1 should be down and so I would not expect that the call to mutex_trylock() in clk_get_rate() would fail (ie. return 0 for contention) at this point and cause us to call mutex_lock and sleep. Therefore, I am wondering if there could be another bug in the v3.18 kernel that you are using that could be triggering this. If you are able to reproduce this on v3.18, then it would be good if you could trace the CCF calls around this WARNING to see what is causing the contention. Cheers Jon -- nvpublic