From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752897Ab3KXUyC (ORCPT ); Sun, 24 Nov 2013 15:54:02 -0500 Received: from v094114.home.net.pl ([79.96.170.134]:60176 "HELO v094114.home.net.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1752554Ab3KXUyA (ORCPT ); Sun, 24 Nov 2013 15:54:00 -0500 From: "Rafael J. Wysocki" To: Francis Moreau Cc: Thomas Gleixner , Jingoo Han , "'Borislav Petkov'" , "'Wei WANG'" , "'LKML'" , "'Samuel Ortiz'" , "'Chris Ball'" Subject: Re: 3.12: kernel panic when resuming from suspend to RAM (x86_64) Date: Sun, 24 Nov 2013 22:06:45 +0100 Message-ID: <4523614.I5MBhorHFt@vostro.rjw.lan> User-Agent: KMail/4.10.5 (Linux/3.12.0-rc6+; KDE/4.10.5; x86_64; ; ) In-Reply-To: <5291C948.1080305@gmail.com> References: <20131117195358.GO27323@pd.tnic> <5291C948.1080305@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="utf-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sunday, November 24, 2013 10:39:20 AM Francis Moreau wrote: > Hello Thomas > > On 11/22/2013 11:27 PM, Thomas Gleixner wrote: > > On Fri, 22 Nov 2013, Rafael J. Wysocki wrote: > >> On Friday, November 22, 2013 10:36:23 PM Francis Moreau wrote: > >>> Ok, I've finally managed to find out the bad commit: > >>> ad07277e82dedabacc52c82746633680a3187d25: ACPI / PM: Hold acpi_scan_lock > >>> over system PM transitions > >>> > >>> I verified that the parent commit doesn't have the problem. > >> > >> Interesting. > >> > >>> Rafael, you're the man now ;) > >> > >> I kind of don't see how that commit may result in behavior that you > >> described earlier in the thread. > >> > >> You get a memory corruption that seems to have started to happen because > >> we're holding an additional lock over suspend resume now. Something's fishy > >> on that machine and we need to figure out what it is. > > > > The hickup happens in the timer softirq. > > > > @Francis: Did you try to enable DEBUG_OBJECTS.*. If not please give it > > a try. > > This looks like it was a good idea. > > The kernel now outputs the following traces after resuming. > > [ 26.973928] WARNING: CPU: 0 PID: 4 at lib/debugobjects.c:260 > debug_print_object+0x83/0xa0() > [ 26.973932] ODEBUG: free active (active state 0) object type: > timer_list hint: delayed_work_timer_fn+0x0/0x20 > [ 26.973972] Modules linked in: x86_pkg_temp_thermal intel_powerclamp > rtsx_pci_ms coretemp memstick kvm_intel i2c_i801 iTCO_wdt > iTCO_vendor_support i915 i2c_algo_bit intel_agp intel_gtt drm_kms_helper > r8169 drm kvm mii agpgart i2c_core lpc_ich ac shpchp crc32c_intel > battery thermal wmi evdev mei_me video mei button mperf processor > serio_raw microcode ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod > usb_storage rtsx_pci_sdmmc mmc_core ahci libahci libata ehci_pci > ehci_hcd xhci_hcd scsi_mod rtsx_pci usbcore usb_common > [ 26.974013] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted > 3.11.0-rc2-ARCH #64 > [ 26.974014] Hardware name: CLEVO CO. W55xEU > /W55xEU , BIOS 4.6.5 > 03/05/2013 > [ 26.974019] Workqueue: kacpi_hotplug hotplug_event_work > [ 26.974020] 0000000000000009 ffff880407d0da18 ffffffff81459fe9 > ffff880407d0da60 > [ 26.974023] ffff880407d0da50 ffffffff8104dc7d ffff880407fad488 > ffffffff81836fc0 > [ 26.974025] ffffffff81701358 ffffffff81afef70 0000000000000003 > ffff880407d0dab0 > [ 26.974027] Call Trace: > [ 26.974031] [] dump_stack+0x54/0x8d > [ 26.974043] [] warn_slowpath_common+0x7d/0xa0 > [ 26.974044] [] warn_slowpath_fmt+0x4c/0x50 > [ 26.974047] [] debug_print_object+0x83/0xa0 > [ 26.974050] [] ? queue_work_on+0x50/0x50 > [ 26.974053] [] __debug_check_no_obj_freed+0x1fb/0x240 > [ 26.974059] [] ? rtsx_pci_remove+0x119/0x1d0 > [rtsx_pci] So a device driven by rtsx_pcr.c is removed after resume. Without the commit you've bisected it is removed as well, but that happens during resume, so rtsx_pci_resume() is likely not called in that case. I bet that there's a bug either in rtsx_pci_remove() or in rtsx_pci_resume(). The latter definitely should check if the device is actually still present before scheduling the delayed work, but then the Boris' patch should take care of that anyway. Thanks! -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center.