[2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
@ 2006-06-02 22:51 Jeremy Fitzhardinge
  2006-06-04 11:47 ` Rafael J. Wysocki
  2006-06-05  7:37 ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-02 22:51 UTC (permalink / raw)
  To: Linux Kernel Mailing List, mingo

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

I'm trying to get suspend/resume working properly on my Thinkpad X60.
This is a dual-core machine, so its running in SMP mode.

Now that I have a set of patches to make AHCI resume properly, I'm
getting a crash on the second suspend.  I can't get an actual listing of
the oops, but I have a set of screenshots if anyone needs more details.

The gist is that there's a BUG_ON failing at arch/i386/kernel/nmi.c:174
(BUG_ON(counter > NMI_MAX_COUNTER_BITS)), in release_evntsel_nmi.  The
backtrace is:

     release_evntsel_nmi
     stop_apci_nmi_watchdog
     on_each_cpu
     disable_lapic_nmi_watchdog
     lapic_nmi_suspend
     sysdev_suspend
     device_power_down
     suspend_enter
     enter_state
     state_store
     subsys_attr_store
     sysfs_write_file
     vfs_write
     sys_write
     sysenter_past_esp

This happens after all the devices have suspended themselves; then
there's a longish pause (several seconds), and the oops appears.  The
first suspend is very quick.

Everything works as expected when I disable nmi watchdog with
nmi_watchdog=0 on the kernel command line.

dmesg after a single successful suspend/resume cycle attached.
(resent without .config to get under linux-kernel's size limit; mail if 
you want a copy)

     J



[-- Attachment #2: dmesg.txt --]
[-- Type: text/plain, Size: 48959 bytes --]

Linux version 2.6.17-rc5-mm2 (jeremy@ezr) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #10 SMP Fri Jun 2 01:23:24 PDT 2006
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009f000 end: 000000000009f000 type: 1
copy_e820_map() type is E820_RAM
add_memory_region(0000000000000000, 000000000009f000, 1)
copy_e820_map() start: 000000000009f000 size: 0000000000001000 end: 00000000000a0000 type: 2
add_memory_region(000000000009f000, 0000000000001000, 2)
copy_e820_map() start: 00000000000d2000 size: 0000000000002000 end: 00000000000d4000 type: 2
add_memory_region(00000000000d2000, 0000000000002000, 2)
copy_e820_map() start: 00000000000dc000 size: 0000000000024000 end: 0000000000100000 type: 2
add_memory_region(00000000000dc000, 0000000000024000, 2)
copy_e820_map() start: 0000000000100000 size: 000000007f5d0000 end: 000000007f6d0000 type: 1
copy_e820_map() type is E820_RAM
add_memory_region(0000000000100000, 000000007f5d0000, 1)
copy_e820_map() start: 000000007f6d0000 size: 0000000000013000 end: 000000007f6e3000 type: 3
add_memory_region(000000007f6d0000, 0000000000013000, 3)
copy_e820_map() start: 000000007f6e3000 size: 000000000001d000 end: 000000007f700000 type: 4
add_memory_region(000000007f6e3000, 000000000001d000, 4)
copy_e820_map() start: 000000007f700000 size: 0000000000900000 end: 0000000080000000 type: 2
add_memory_region(000000007f700000, 0000000000900000, 2)
copy_e820_map() start: 00000000f0000000 size: 0000000004000000 end: 00000000f4000000 type: 2
add_memory_region(00000000f0000000, 0000000004000000, 2)
copy_e820_map() start: 00000000fec00000 size: 0000000000010000 end: 00000000fec10000 type: 2
add_memory_region(00000000fec00000, 0000000000010000, 2)
copy_e820_map() start: 00000000fed00000 size: 0000000000000400 end: 00000000fed00400 type: 2
add_memory_region(00000000fed00000, 0000000000000400, 2)
copy_e820_map() start: 00000000fed14000 size: 0000000000006000 end: 00000000fed1a000 type: 2
add_memory_region(00000000fed14000, 0000000000006000, 2)
copy_e820_map() start: 00000000fed1c000 size: 0000000000074000 end: 00000000fed90000 type: 2
add_memory_region(00000000fed1c000, 0000000000074000, 2)
copy_e820_map() start: 00000000fee00000 size: 0000000000001000 end: 00000000fee01000 type: 2
add_memory_region(00000000fee00000, 0000000000001000, 2)
copy_e820_map() start: 00000000ff800000 size: 0000000000800000 end: 0000000100000000 type: 2
add_memory_region(00000000ff800000, 0000000000800000, 2)
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007f6d0000 (usable)
 BIOS-e820: 000000007f6d0000 - 000000007f6e3000 (ACPI data)
 BIOS-e820: 000000007f6e3000 - 000000007f700000 (ACPI NVS)
 BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
1142MB HIGHMEM available.
895MB LOWMEM available.
found SMP MP-table at 000f68c0
NX (Execute Disable) protection: active
On node 0 totalpages: 521936
  DMA zone: 4096 pages, LIFO batch:0
  Normal zone: 225279 pages, LIFO batch:31
node 0 zone HighMem misaligned start pfn, enable UNALIGNED_ZONE_BOUNDARIES
  HighMem zone: 292561 pages, LIFO batch:31
DMI present.
Using APIC driver default
ACPI: RSDP (v002 LENOVO                                ) @ 0x000f6880
ACPI: XSDT (v001 LENOVO TP-7B    0x00001060  LTP 0x00000000) @ 0x7f6d6621
ACPI: FADT (v003 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6d6700
ACPI: SSDT (v001 LENOVO TP-7B    0x00001060 MSFT 0x0100000e) @ 0x7f6d68b4
ACPI: ECDT (v001 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6e2d4a
ACPI: TCPA (v002 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6e2d9c
ACPI: MADT (v001 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6e2dce
ACPI: MCFG (v001 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6e2e36
ACPI: HPET (v001 LENOVO TP-7B    0x00001060 LNVO 0x00000001) @ 0x7f6e2e74
ACPI: BOOT (v001 LENOVO TP-7B    0x00001060  LTP 0x00000001) @ 0x7f6e2fd8
ACPI: SSDT (v001 LENOVO TP-7B    0x00001060 INTL 0x20050513) @ 0x7f6d5bdc
ACPI: SSDT (v001 LENOVO TP-7B    0x00001060 INTL 0x20050513) @ 0x7f6d5a04
ACPI: DSDT (v001 LENOVO TP-7B    0x00001060 MSFT 0x0100000e) @ 0x00000000
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 6:14 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 6:14 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
ACPI: HPET id: 0x8086a201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:70000000)
Detected 1828.944 MHz processor.
Built 1 zonelists
Kernel command line: ro root=LABEL=/ 
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c04ae000 soft=c048e000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour VGA+ 80x25
------------------------
| Locking API testsuite:
----------------------------------------------------------------------------
                                 | spin |wlock |rlock |mutex | wsem | rsem |
  --------------------------------------------------------------------------
                     A-A deadlock:failed|failed|failed|failed|failed|failed|
                 A-B-B-A deadlock:failed|failed|  ok  |failed|failed|failed|
             A-B-B-C-C-A deadlock:failed|failed|  ok  |failed|failed|failed|
             A-B-C-A-B-C deadlock:failed|failed|  ok  |failed|failed|failed|
         A-B-B-C-C-D-D-A deadlock:failed|failed|  ok  |failed|failed|failed|
         A-B-C-D-B-D-D-A deadlock:failed|failed|  ok  |failed|failed|failed|
         A-B-C-D-B-C-D-A deadlock:failed|failed|  ok  |failed|failed|failed|
                    double unlock:  ok  |  ok  |failed|failed|failed|failed|
                 bad unlock order:failed|failed|failed|failed|failed|failed|
  --------------------------------------------------------------------------
              recursive read-lock:             |  ok  |             |failed|
  --------------------------------------------------------------------------
     hard-irqs-on + irq-safe-A/12:failed|failed|  ok  |
     soft-irqs-on + irq-safe-A/12:failed|failed|  ok  |
     hard-irqs-on + irq-safe-A/21:failed|failed|  ok  |
     soft-irqs-on + irq-safe-A/21:failed|failed|  ok  |
       sirq-safe-A => hirqs-on/12:failed|failed|  ok  |
       sirq-safe-A => hirqs-on/21:failed|failed|  ok  |
         hard-safe-A + irqs-on/12:failed|failed|  ok  |
         soft-safe-A + irqs-on/12:failed|failed|  ok  |
         hard-safe-A + irqs-on/21:failed|failed|  ok  |
         soft-safe-A + irqs-on/21:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/123:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/123:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/132:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/132:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/213:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/213:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/231:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/231:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/312:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/312:failed|failed|  ok  |
    hard-safe-A + unsafe-B #1/321:failed|failed|  ok  |
    soft-safe-A + unsafe-B #1/321:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/123:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/123:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/132:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/132:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/213:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/213:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/231:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/231:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/312:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/312:failed|failed|  ok  |
    hard-safe-A + unsafe-B #2/321:failed|failed|  ok  |
    soft-safe-A + unsafe-B #2/321:failed|failed|  ok  |
      hard-irq lock-inversion/123:failed|failed|  ok  |
      soft-irq lock-inversion/123:failed|failed|  ok  |
      hard-irq lock-inversion/132:failed|failed|  ok  |
      soft-irq lock-inversion/132:failed|failed|  ok  |
      hard-irq lock-inversion/213:failed|failed|  ok  |
      soft-irq lock-inversion/213:failed|failed|  ok  |
      hard-irq lock-inversion/231:failed|failed|  ok  |
      soft-irq lock-inversion/231:failed|failed|  ok  |
      hard-irq lock-inversion/312:failed|failed|  ok  |
      soft-irq lock-inversion/312:failed|failed|  ok  |
      hard-irq lock-inversion/321:failed|failed|  ok  |
      soft-irq lock-inversion/321:failed|failed|  ok  |
      hard-irq read-recursion/123:  ok  |
      soft-irq read-recursion/123:  ok  |
      hard-irq read-recursion/132:  ok  |
      soft-irq read-recursion/132:  ok  |
      hard-irq read-recursion/213:  ok  |
      soft-irq read-recursion/213:  ok  |
      hard-irq read-recursion/231:  ok  |
      soft-irq read-recursion/231:  ok  |
      hard-irq read-recursion/312:  ok  |
      soft-irq read-recursion/312:  ok  |
      hard-irq read-recursion/321:  ok  |
      soft-irq read-recursion/321:  ok  |
--------------------------------------------------------
139 out of 206 testcases failed, as expected. |
----------------------------------------------------
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 2059716k/2087744k available (2151k kernel code, 26752k reserved, 865k data, 232k init, 1170244k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
hpet0: at MMIO 0xfed00000 (virtual 0xf8800000), IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz
Using HPET for base-timer
Calibrating delay using timer specific routine.. 3662.28 BogoMIPS (lpj=7324578)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
CPU: After vendor identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
monitor/mwait feature present.
using mwait in idle threads.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU: After all inits, caps: bfe9fbff 00100000 00000000 00000940 0000c1a9 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Compat vDSO mapped to ffffe000.
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
CPU0: Intel Genuine Intel(R) CPU           T2400  @ 1.83GHz stepping 08
SMP alternatives: switching to SMP code
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c04af000 soft=c048f000
Initializing CPU#1
Calibrating delay using timer specific routine.. 3657.66 BogoMIPS (lpj=7315332)
CPU: After generic identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
CPU: After vendor identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
monitor/mwait feature present.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU: After all inits, caps: bfe9fbff 00100000 00000000 00000940 0000c1a9 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel Genuine Intel(R) CPU           T2400  @ 1.83GHz stepping 08
Total of 2 processors activated (7319.95 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 2 CPUs: 
Brought up 2 CPUs
migration_cost=73
checking if image is initramfs... it is
Freeing initrd memory: 1036k freed
PM: Adding info for No Bus:platform
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: BIOS Bug: MCFG area is not E820-reserved
PCI: Not using MMCONFIG.
PCI: PCI BIOS revision 2.10 entry at 0xfd82b, last bus=24
Setting up standard PCI resources
ACPI: Subsystem revision 20060310
ACPI: Found ECDT
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
PM: Adding info for acpi:acpi
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
PM: Adding info for No Bus:pci0000:00
Boot video device is 0000:00:02.0
PCI: Ignoring BAR0-3 of IDE controller 0000:00:1f.1
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
PM: Adding info for pci:0000:00:00.0
PM: Adding info for pci:0000:00:02.0
PM: Adding info for pci:0000:00:02.1
PM: Adding info for pci:0000:00:1b.0
PM: Adding info for pci:0000:00:1c.0
PM: Adding info for pci:0000:00:1c.1
PM: Adding info for pci:0000:00:1c.2
PM: Adding info for pci:0000:00:1c.3
PM: Adding info for pci:0000:00:1d.0
PM: Adding info for pci:0000:00:1d.1
PM: Adding info for pci:0000:00:1d.2
PM: Adding info for pci:0000:00:1d.3
PM: Adding info for pci:0000:00:1d.7
PM: Adding info for pci:0000:00:1e.0
PM: Adding info for pci:0000:00:1f.0
PM: Adding info for pci:0000:00:1f.1
PM: Adding info for pci:0000:00:1f.2
PM: Adding info for pci:0000:00:1f.3
PM: Adding info for pci:0000:02:00.0
PM: Adding info for pci:0000:03:00.0
PM: Adding info for pci:0000:15:00.0
PM: Adding info for pci:0000:15:00.1
PM: Adding info for pci:0000:15:00.2
ACPI: Embedded Controller [EC] (gpe 28) interrupt mode.
ACPI: Power Resource [PUBS] (on)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP3._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
PM: Adding info for No Bus:pnp0
PM: Adding info for pnp:00:00
PM: Adding info for pnp:00:01
PM: Adding info for pnp:00:02
PM: Adding info for pnp:00:03
PM: Adding info for pnp:00:04
PM: Adding info for pnp:00:05
PM: Adding info for pnp:00:06
PM: Adding info for pnp:00:07
PM: Adding info for pnp:00:08
PM: Adding info for pnp:00:09
PM: Adding info for pnp:00:0a
PM: Adding info for pnp:00:0b
pnp: PnP ACPI: found 12 devices
Intel 82802 RNG detected
intel_rng: cannot enable RNG, aborting
intel_rng: RNG registering failed (-5)
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI: Ignore bogus resource 6 [0:0] of 0000:00:02.0
PCI: Bridge: 0000:00:1c.0
  IO window: 2000-2fff
  MEM window: ee000000-ee0fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1c.1
  IO window: 3000-4fff
  MEM window: ec000000-edffffff
  PREFETCH window: e4000000-e40fffff
PCI: Bridge: 0000:00:1c.2
  IO window: 5000-6fff
  MEM window: e8000000-e9ffffff
  PREFETCH window: e4100000-e41fffff
PCI: Bridge: 0000:00:1c.3
  IO window: 7000-8fff
  MEM window: ea000000-ebffffff
  PREFETCH window: e4200000-e42fffff
PCI: Bus 22, cardbus bridge: 0000:15:00.0
  IO window: 00009000-000090ff
  IO window: 00009400-000094ff
  PREFETCH window: e0000000-e1ffffff
  MEM window: e6000000-e7ffffff
PCI: Bridge: 0000:00:1e.0
  IO window: 9000-cfff
  MEM window: e4300000-e7ffffff
  PREFETCH window: e0000000-e3ffffff
ACPI (acpi_bus-0192): Device `EXP0]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1c.0 to 64
ACPI (acpi_bus-0192): Device `EXP1]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 21 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.1 to 64
ACPI (acpi_bus-0192): Device `EXP2]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1c.2 to 64
ACPI (acpi_bus-0192): Device `EXP3]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1c.3 to 64
PCI: Enabling device 0000:00:1e.0 (0005 -> 0007)
PCI: Setting latency timer of device 0000:00:1e.0 to 64
ACPI (acpi_bus-0192): Device `CDBS]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:15:00.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:15:00.0 to 64
NET: Registered protocol family 2
IP route cache hash table entries: 65536 (order: 6, 262144 bytes)
TCP established hash table entries: 131072 (order: 9, 2621440 bytes)
TCP bind hash table entries: 65536 (order: 8, 1310720 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
PM: Adding info for platform:pcspkr
Simple Boot Flag at 0x35 set to 0x1
apm: BIOS not found.
audit: initializing netlink socket (disabled)
audit(1149246804.984:1): initialized
highmem bounce pool size: 64 pages
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux:  Registering netfilter hooks
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1c.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.0:pcie00]
PM: Adding info for pci_express:0000:00:1c.0:pcie00
Allocate Port Service[0000:00:1c.0:pcie02]
PM: Adding info for pci_express:0000:00:1c.0:pcie02
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 21 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.1 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.1:pcie00]
PM: Adding info for pci_express:0000:00:1c.1:pcie00
Allocate Port Service[0000:00:1c.1:pcie02]
PM: Adding info for pci_express:0000:00:1c.1:pcie02
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1c.2 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.2:pcie00]
PM: Adding info for pci_express:0000:00:1c.2:pcie00
Allocate Port Service[0000:00:1c.2:pcie02]
PM: Adding info for pci_express:0000:00:1c.2:pcie02
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1c.3 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.3:pcie00]
PM: Adding info for pci_express:0000:00:1c.3:pcie00
Allocate Port Service[0000:00:1c.3:pcie02]
PM: Adding info for pci_express:0000:00:1c.3:pcie02
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
PM: Adding info for platform:vesafb.0
ACPI: ACPI Dock Station Driver 
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: Thermal Zone [THM0] (49 C)
ACPI: Thermal Zone [THM1] (46 C)
PM: Adding info for No Bus:pnp1
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
Real Time Clock Driver v1.12ac
hpet_resources: 0xfed00000 is busy
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: Detected an Intel 945GM Chipset.
agpgart: Detected 7932K stolen memory.
agpgart: AGP aperture is 256M @ 0xd0000000
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
PM: Adding info for platform:serial8250
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ICH7: IDE controller at PCI slot 0000:00:1f.1
ACPI: PCI Interrupt 0000:00:1f.1[C] -> GSI 16 (level, low) -> IRQ 20
ICH7: chipset revision 2
ICH7: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x1810-0x1817, BIOS settings: hda:pio, hdb:pio
Probing IDE interface ide0...
Probing IDE interface ide0...
Probing IDE interface ide1...
ide-floppy driver 0.99.newide
ACPI: PCI Interrupt 0000:15:00.0[A] -> GSI 16 (level, low) -> IRQ 20
Yenta: CardBus bridge found at 0000:15:00.0 [17aa:201c]
Yenta: ISA IRQ mask 0x0cb8, PCI irq 20
Socket status: 30000006
pcmcia: parent PCI bridge I/O window: 0x9000 - 0xcfff
cs: IO port probe 0x9000-0xcfff: clean.
pcmcia: parent PCI bridge Memory window: 0xe4300000 - 0xe7ffffff
pcmcia: parent PCI bridge Memory window: 0xe0000000 - 0xe3ffffff
usbcore: registered new driver libusual
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
PM: Adding info for platform:i8042
serio: i8042 AUX port at 0x60,0x64 irq 12
PM: Adding info for serio:serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
PM: Adding info for serio:serio1
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI No-Shortcut mode
ACPI: (supports S0 S3 S4 S5)
Time: tsc clocksource has been installed.
Freeing unused kernel memory: 232k freed
Write protecting the kernel read-only data: 387k
Time: hpet clocksource has been installed.
input: AT Translated Set 2 keyboard as /class/input/input0
SCSI subsystem initialized
libata version 1.30 loaded.
ahci 0000:00:1f.2: version 1.3
ACPI (acpi_bus-0192): Device `SATA]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 16 (level, low) -> IRQ 20
IBM TrackPoint firmware: 0x0e, buttons: 3/3
input: TPPS/2 IBM TrackPoint as /class/input/input1
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 1.5 Gbps 0x1 impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part 
ata1: SATA max UDMA/133 cmd 0xF883E500 ctl 0x0 bmdma 0x0 irq 20
ata2: Could not start DMA engineof port (-1)
ata2: SATA max UDMA/133 cmd 0xF883E580 ctl 0x0 bmdma 0x0 irq 20
ata3: Could not start DMA engineof port (-1)
ata3: SATA max UDMA/133 cmd 0xF883E600 ctl 0x0 bmdma 0x0 irq 20
ata4: Could not start DMA engineof port (-1)
ata4: SATA max UDMA/133 cmd 0xF883E680 ctl 0x0 bmdma 0x0 irq 20
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: cfg 49:0f00 82:746b 83:7f69 84:6063 85:f469 86:3c49 87:6063 88:043f
ata1.00: ATA-7, max UDMA/100, 195371568 sectors: LBA48 
ata1.00: configured for UDMA/100
scsi0 : ahci
PM: Adding info for No Bus:host0
ata2: SATA link down (SStatus 0 SControl 0)
scsi1 : ahci
PM: Adding info for No Bus:host1
ata3: SATA link down (SStatus 0 SControl 0)
scsi2 : ahci
PM: Adding info for No Bus:host2
ata4: SATA link down (SStatus 0 SControl 0)
scsi3 : ahci
PM: Adding info for No Bus:host3
PM: Adding info for No Bus:target0:0:0
  Vendor: ATA       Model: HTS541010G9SA00   Rev: MBZI
  Type:   Direct-Access                      ANSI SCSI revision: 05
PM: Adding info for scsi:0:0:0:0
SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 >
sd 0:0:0:0: Attached scsi disk sda
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: sda7: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 13258130
ext3_orphan_cleanup: deleting unreferenced inode 13258129
ext3_orphan_cleanup: deleting unreferenced inode 13258128
ext3_orphan_cleanup: deleting unreferenced inode 13258108
ext3_orphan_cleanup: deleting unreferenced inode 13258107
EXT3-fs: sda7: 5 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
SELinux:  Disabled at runtime.
SELinux:  Unregistering netfilter hooks
audit(1149246816.604:2): selinux=0 auid=4294967295
input: PC Speaker as /class/input/input2
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 20, io base 0x00001820
usb usb1: new device found, idVendor=0000, idProduct=0000
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
PM: Adding info for usb:usb1
usb usb1: configuration #1 chosen from 1 choice
PM: Adding info for usb:1-0:1.0
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 21, io base 0x00001840
usb usb2: new device found, idVendor=0000, idProduct=0000
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
PM: Adding info for usb:usb2
usb usb2: configuration #1 chosen from 1 choice
PM: Adding info for usb:2-0:1.0
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 22, io base 0x00001860
usb usb3: new device found, idVendor=0000, idProduct=0000
usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
usb usb3: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.2
PM: Adding info for usb:usb3
usb usb3: configuration #1 chosen from 1 choice
PM: Adding info for usb:3-0:1.0
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 19 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1d.3 to 64
uhci_hcd 0000:00:1d.3: UHCI Host Controller
uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001880
usb usb4: new device found, idVendor=0000, idProduct=0000
usb usb4: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb4: Product: UHCI Host Controller
usb usb4: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb4: SerialNumber: 0000:00:1d.3
PM: Adding info for usb:usb4
usb usb4: configuration #1 chosen from 1 choice
PM: Adding info for usb:4-0:1.0
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
PM: Adding info for No Bus:i2c-0
ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1d.7 to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 5
ehci_hcd 0000:00:1d.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: irq 23, io mem 0xee444000
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb5: new device found, idVendor=0000, idProduct=0000
usb usb5: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb5: Product: EHCI Host Controller
usb usb5: Manufacturer: Linux 2.6.17-rc5-mm2 ehci_hcd
usb usb5: SerialNumber: 0000:00:1d.7
PM: Adding info for usb:usb5
usb usb5: configuration #1 chosen from 1 choice
PM: Adding info for usb:5-0:1.0
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 8 ports detected
ieee1394: Initialized config rom entry `ip1394'
ath_hal: module license 'Proprietary' taints kernel.
ath_hal: 0.9.17.0 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
sd 0:0:0:0: Attached scsi generic sg0 type 0
wlan: 0.8.4.2 (svn r1615)
sdhci: Secure Digital Host Controller Interface driver, 0.11
sdhci: Copyright(c) Pierre Ossman
ACPI: PCI Interrupt 0000:15:00.2[C] -> GSI 18 (level, low) -> IRQ 22
mmc0: SDHCI at 0xe4301800 irq 22 PIO
ath_rate_sample: 1.2 (svn r1615)
Intel(R) PRO/1000 Network Driver - version 7.0.38-k4-NAPI
Copyright (c) 1999-2006 Intel Corporation.
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:02:00.0 to 64
cs: IO port probe 0x100-0x3af: excluding 0x1f0-0x1f7
cs: IO port probe 0x3e0-0x4ff: excluding 0x3f0-0x3f7 0x4d0-0x4d7
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0xc00-0xcf7: clean.
cs: IO port probe 0xa00-0xaff: clean.
e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:d3:20:d2:0b
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
ACPI: PCI Interrupt 0000:15:00.1[B] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:15:00.1 to 64
PM: Adding info for ieee1394:fw-host0
usb 4-2: new full speed USB device using uhci_hcd and address 2
ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[21]  MMIO=[e4301000-e43017ff]  Max Packet=[2048]  IR/IT contexts=[4/4]
ath_pci: 0.9.4.5 (svn r1615)
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:03:00.0 to 64
wifi0: 11a rates: 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: 11b rates: 1Mbps 2Mbps 5.5Mbps 11Mbps
wifi0: 11g rates: 1Mbps 2Mbps 5.5Mbps 11Mbps 6Mbps 9Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: H/W encryption support: WEP AES AES_CCM TKIP
wifi0: mac 10.3 phy 6.1 radio 10.2
wifi0: Use hw queue 1 for WME_AC_BE traffic
wifi0: Use hw queue 0 for WME_AC_BK traffic
wifi0: Use hw queue 2 for WME_AC_VI traffic
wifi0: Use hw queue 3 for WME_AC_VO traffic
wifi0: Use hw queue 8 for CAB traffic
wifi0: Use hw queue 9 for beacons
usb 4-2: new device found, idVendor=0483, idProduct=2016
usb 4-2: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 4-2: Product: Biometric Coprocessor
usb 4-2: Manufacturer: STMicroelectronics
PM: Adding info for usb:4-2
usb 4-2: configuration #1 chosen from 1 choice
PM: Adding info for usb:4-2:1.0
wifi0: Atheros 5212: mem=0xedf00000, irq=21
PM: Adding info for ieee1394:000ae40600142001
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[000ae40600142001]
PM: Adding info for ieee1394:000ae40600142001-0
ACPI (acpi_bus-0192): Device `HDEF]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1b.0[B] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1b.0 to 64
Non-volatile memory driver v1.2
floppy0: no floppy controllers found
lp: driver loaded but no devices found
ACPI: AC Adapter [AC] (off-line)
ACPI: Battery Slot [BAT0] (battery present)
ACPI: Power Button (FF) [PWRF]
ACPI: Lid Switch [LID]
ACPI: Sleep Button (CM) [SLPB]
ibm_acpi: IBM ThinkPad ACPI Extras v0.12a
ibm_acpi: http://ibm-acpi.sf.net/
ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
device-mapper: 4.6.0-ioctl (2006-02-17) initialised: dm-devel@redhat.com
EXT3 FS on sda7, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 4192924k swap on /dev/sda6.  Priority:-1 extents:1 across:4192924k
ip_tables: (C) 2000-2006 Netfilter Core Team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 224 bytes per conntrack
Bluetooth: Core ver 2.8
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP ver 2.8
Bluetooth: L2CAP socket layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM ver 1.7
Bluetooth: HIDP (Human Interface Emulation) ver 1.1
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ADDRCONF(NETDEV_UP): eth0: link is not ready
[drm] Initialized drm 1.0.1 20051102
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 20
[drm] Initialized i915 1.4.0 20060119 on minor 0
sdhci: Secure Digital Host Controller Interface driver, 0.11
sdhci: Copyright(c) Pierre Ossman
ACPI: PCI Interrupt 0000:15:00.2[C] -> GSI 18 (level, low) -> IRQ 22
mmc0: SDHCI at 0xe4301800 irq 22 PIO
PM: Adding info for mmc:mmc0:a95c
mmcblk0: mmc0:a95c SD512 495488KiB 
 mmcblk0: p1
uhci_hcd 0000:00:1d.3: remove, state 1
usb usb4: USB disconnect, address 1
usb 4-2: USB disconnect, address 2
PM: Removing info for usb:4-2:1.0
PM: Removing info for usb:4-2
PM: Removing info for usb:4-0:1.0
PM: Removing info for usb:usb4
uhci_hcd 0000:00:1d.3: USB bus 4 deregistered
uhci_hcd 0000:00:1d.2: remove, state 1
usb usb3: USB disconnect, address 1
PM: Removing info for usb:3-0:1.0
PM: Removing info for usb:usb3
uhci_hcd 0000:00:1d.2: USB bus 3 deregistered
uhci_hcd 0000:00:1d.1: remove, state 1
usb usb2: USB disconnect, address 1
PM: Removing info for usb:2-0:1.0
PM: Removing info for usb:usb2
uhci_hcd 0000:00:1d.1: USB bus 2 deregistered
uhci_hcd 0000:00:1d.0: remove, state 1
usb usb1: USB disconnect, address 1
PM: Removing info for usb:1-0:1.0
PM: Removing info for usb:usb1
uhci_hcd 0000:00:1d.0: USB bus 1 deregistered
PM: Preparing system for mem sleep
Freezing cpus ...
Breaking affinity for irq 0
Breaking affinity for irq 12
Breaking affinity for irq 20
CPU 1 is now offline
SMP alternatives: switching to UP code
CPU1 is down
Stopping tasks: ==========================================================================================================================================================|
Suspending device mmc0:a95c
mmcblk mmc0:a95c: suspend
Suspending device 000ae40600142001-0
Suspending device 000ae40600142001
Suspending device fw-host0
Suspending device 5-0:1.0
hub 5-0:1.0: suspend
Suspending device usb5
usb usb5: suspend, may wakeup
Suspending device i2c-0
Suspending device 0:0:0:0
sd 0:0:0:0: suspend
Suspending device target0:0:0
Suspending device host3
Suspending device host2
Suspending device host1
Suspending device host0
Suspending device serio1
Suspending device serio0
Suspending device i8042
i8042 i8042: suspend
Suspending device serial8250
serial8250 serial8250: suspend
Suspending device pnp1
Suspending device vesafb.0
 vesafb.0: suspend
Suspending device 0000:00:1c.3:pcie02
 0000:00:1c.3:pcie02: suspend
Suspending device 0000:00:1c.3:pcie00
 0000:00:1c.3:pcie00: suspend
Suspending device 0000:00:1c.2:pcie02
 0000:00:1c.2:pcie02: suspend
Suspending device 0000:00:1c.2:pcie00
 0000:00:1c.2:pcie00: suspend
Suspending device 0000:00:1c.1:pcie02
 0000:00:1c.1:pcie02: suspend
Suspending device 0000:00:1c.1:pcie00
 0000:00:1c.1:pcie00: suspend
Suspending device 0000:00:1c.0:pcie02
 0000:00:1c.0:pcie02: suspend
Suspending device 0000:00:1c.0:pcie00
 0000:00:1c.0:pcie00: suspend
Suspending device pcspkr
pcspkr pcspkr: suspend
Suspending device 00:0b
 00:0b: suspend
Suspending device 00:0a
 00:0a: suspend
Suspending device 00:09
i8042 aux 00:09: suspend
Suspending device 00:08
i8042 kbd 00:08: suspend
Suspending device 00:07
 00:07: suspend
Suspending device 00:06
 00:06: suspend
Suspending device 00:05
 00:05: suspend
Suspending device 00:04
 00:04: suspend
Suspending device 00:03
 00:03: suspend
Suspending device 00:02
system 00:02: suspend
Suspending device 00:01
 00:01: suspend
Suspending device 00:00
system 00:00: suspend
Suspending device pnp0
Suspending device 0000:15:00.2
sdhci 0000:15:00.2: suspend
Suspending device 0000:15:00.1
ohci1394 0000:15:00.1: suspend
Suspending device 0000:15:00.0
yenta_cardbus 0000:15:00.0: suspend
Suspending device 0000:03:00.0
ath_pci 0000:03:00.0: suspend
Suspending device 0000:02:00.0
e1000 0000:02:00.0: suspend
Suspending device 0000:00:1f.3
i801_smbus 0000:00:1f.3: suspend
Suspending device 0000:00:1f.2
ahci 0000:00:1f.2: suspend
ACPI (acpi_bus-0192): Device `SATA]is not power manageable [20060310]
Suspending device 0000:00:1f.1
PIIX_IDE 0000:00:1f.1: suspend
Suspending device 0000:00:1f.0
 0000:00:1f.0: suspend
Suspending device 0000:00:1e.0
 0000:00:1e.0: suspend
Suspending device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: suspend, may wakeup
Suspending device 0000:00:1d.3
 0000:00:1d.3: suspend
Suspending device 0000:00:1d.2
 0000:00:1d.2: suspend
Suspending device 0000:00:1d.1
 0000:00:1d.1: suspend
Suspending device 0000:00:1d.0
 0000:00:1d.0: suspend
Suspending device 0000:00:1c.3
pcieport-driver 0000:00:1c.3: suspend
Suspending device 0000:00:1c.2
pcieport-driver 0000:00:1c.2: suspend
Suspending device 0000:00:1c.1
pcieport-driver 0000:00:1c.1: suspend
Suspending device 0000:00:1c.0
pcieport-driver 0000:00:1c.0: suspend
Suspending device 0000:00:1b.0
HDA Intel 0000:00:1b.0: suspend
Suspending device 0000:00:02.1
 0000:00:02.1: suspend
Suspending device 0000:00:02.0
 0000:00:02.0: suspend
Suspending device 0000:00:00.0
agpgart-intel 0000:00:00.0: suspend
Suspending device pci0000:00
Suspending device acpi
 acpi: suspend
Suspending device platform
PM: Entering mem sleep
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
BUG: sleeping function called from invalid context at include/asm/semaphore.h:99
in_atomic():0, irqs_disabled():1
 <c010493a> show_trace_log_lvl+0x58/0x163  <c0104fd8> show_trace+0xf/0x11
 <c01050d1> dump_stack+0x15/0x17  <c02088a4> acpi_os_wait_semaphore+0x62/0xbd
 <c021f044> acpi_ut_acquire_mutex+0x2a/0x7a  <c021534a> acpi_set_register+0x5f/0x17d
 <c021f678> acpi_pm_enter+0x8f/0xb3  <c0140526> suspend_enter+0x34/0x44
 <c014069e> enter_state+0x168/0x1c6  <c0140781> state_store+0x85/0x99
 <c01a4f7e> subsys_attr_store+0x1e/0x22  <c01a5071> sysfs_write_file+0xa7/0xce
 <c016cf8b> vfs_write+0xa8/0x159  <c016d5bf> sys_write+0x41/0x67
 <c03172ad> sysenter_past_esp+0x56/0x79 
Back to C!
PM: Finishing wakeup.
 acpi: resuming
agpgart-intel 0000:00:00.0: resuming
 0000:00:02.0: resuming
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 20
 0000:00:02.1: resuming
PM: Writing back config space on device 0000:00:02.1 at offset 1 (was 900000, writing 900003)
HDA Intel 0000:00:1b.0: resuming
PM: Writing back config space on device 0000:00:1b.0 at offset 1 (was 100106, writing 100102)
ACPI: PCI Interrupt 0000:00:1b.0[B] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1b.0 to 64
pcieport-driver 0000:00:1c.0: resuming
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1c.0 to 64
pcieport-driver 0000:00:1c.1: resuming
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 21 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.1 to 64
pcieport-driver 0000:00:1c.2: resuming
PM: Writing back config space on device 0000:00:1c.2 at offset 7 (was 20006050, writing 6050)
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1c.2 to 64
pcieport-driver 0000:00:1c.3: resuming
PM: Writing back config space on device 0000:00:1c.3 at offset 1 (was 100000, writing 100107)
PM: Writing back config space on device 0000:00:1c.3 at offset 3 (was 810000, writing 810010)
PM: Writing back config space on device 0000:00:1c.3 at offset 7 (was 20000000, writing 8070)
PM: Writing back config space on device 0000:00:1c.3 at offset 8 (was 0, writing ebf0ea00)
PM: Writing back config space on device 0000:00:1c.3 at offset 9 (was 10001, writing e421e421)
PM: Writing back config space on device 0000:00:1c.3 at offset f (was 40400, writing 4040b)
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1c.3 to 64
 0000:00:1d.0: resuming
PM: Writing back config space on device 0000:00:1d.0 at offset 1 (was 2800005, writing 2800001)
 0000:00:1d.1: resuming
PM: Writing back config space on device 0000:00:1d.1 at offset 1 (was 2800005, writing 2800001)
 0000:00:1d.2: resuming
PM: Writing back config space on device 0000:00:1d.2 at offset 1 (was 2800005, writing 2800001)
 0000:00:1d.3: resuming
PM: Writing back config space on device 0000:00:1d.3 at offset 1 (was 2800005, writing 2800001)
ehci_hcd 0000:00:1d.7: resuming
ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1d.7 to 64
 0000:00:1e.0: resuming
PM: Writing back config space on device 0000:00:1e.0 at offset 1 (was 100005, writing 100007)
PCI: Setting latency timer of device 0000:00:1e.0 to 64
 0000:00:1f.0: resuming
PIIX_IDE 0000:00:1f.1: resuming
ACPI: PCI Interrupt 0000:00:1f.1[C] -> GSI 16 (level, low) -> IRQ 20
ahci 0000:00:1f.2: resuming
ACPI (acpi_bus-0192): Device `SATA]is not power manageable [20060310]
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ata1: Enable interrupts
ata2: Can't start DMA engine (-1)
i801_smbus 0000:00:1f.3: resuming
e1000 0000:02:00.0: resuming
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:02:00.0 to 64
ath_pci 0000:03:00.0: resuming
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:03:00.0 to 64
yenta_cardbus 0000:15:00.0: resuming
PM: Writing back config space on device 0000:15:00.0 at offset 1 (was 2100003, writing 2100007)
ACPI: PCI Interrupt 0000:15:00.0[A] -> GSI 16 (level, low) -> IRQ 20
ohci1394 0000:15:00.1: resuming
ACPI: PCI Interrupt 0000:15:00.1[B] -> GSI 17 (level, low) -> IRQ 21
sdhci 0000:15:00.2: resuming
PM: Writing back config space on device 0000:15:00.2 at offset 1 (was 2100000, writing 2100002)
PM: Writing back config space on device 0000:15:00.2 at offset 4 (was 0, writing e4301800)
ACPI: PCI Interrupt 0000:15:00.2[C] -> GSI 18 (level, low) -> IRQ 22
system 00:00: resuming
 00:01: resuming
system 00:02: resuming
 00:03: resuming
 00:04: resuming
 00:05: resuming
 00:06: resuming
 00:07: resuming
i8042 kbd 00:08: resuming
pnp: Device 00:08 does not support activation.
i8042 aux 00:09: resuming
pnp: Device 00:09 does not support activation.
 00:0a: resuming
 00:0b: resuming
pcspkr pcspkr: resuming
 0000:00:1c.0:pcie00: resuming
 0000:00:1c.0:pcie02: resuming
 0000:00:1c.1:pcie00: resuming
 0000:00:1c.1:pcie02: resuming
 0000:00:1c.2:pcie00: resuming
 0000:00:1c.2:pcie02: resuming
 0000:00:1c.3:pcie00: resuming
 0000:00:1c.3:pcie02: resuming
 vesafb.0: resuming
serial8250 serial8250: resuming
i8042 i8042: resuming
psmouse serio0: resuming
atkbd serio1: resuming
sd 0:0:0:0: resuming
ata1.00: configured for UDMA/100
usb usb5: resuming
hub 5-0:1.0: resuming
mmcblk mmc0:a95c: resuming
Restarting tasks... done
Thawing cpus ...
SMP alternatives: switching to SMP code
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c04af000 soft=c048f000
Initializing CPU#1
Calibrating delay using timer specific routine.. 3657.71 BogoMIPS (lpj=7315435)
CPU: After generic identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
CPU: After vendor identify, caps: bfe9fbff 00100000 00000000 00000000 0000c1a9 00000000 00000000
monitor/mwait feature present.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU: After all inits, caps: bfe9fbff 00100000 00000000 00000940 0000c1a9 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel Genuine Intel(R) CPU           T2400  @ 1.83GHz stepping 08
APIC error on CPU1: 00(40)
CPU1 is up
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 1
uhci_hcd 0000:00:1d.0: irq 20, io base 0x00001820
usb usb1: new device found, idVendor=0000, idProduct=0000
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: UHCI Host Controller
usb usb1: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb1: SerialNumber: 0000:00:1d.0
PM: Adding info for usb:usb1
usb usb1: configuration #1 chosen from 1 choice
PM: Adding info for usb:1-0:1.0
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 17 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.1: irq 21, io base 0x00001840
usb usb2: new device found, idVendor=0000, idProduct=0000
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
usb usb2: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.1
PM: Adding info for usb:usb2
usb usb2: configuration #1 chosen from 1 choice
PM: Adding info for usb:2-0:1.0
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.2: irq 22, io base 0x00001860
usb usb3: new device found, idVendor=0000, idProduct=0000
usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
usb usb3: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.2
PM: Adding info for usb:usb3
usb usb3: configuration #1 chosen from 1 choice
PM: Adding info for usb:3-0:1.0
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 19 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1d.3 to 64
uhci_hcd 0000:00:1d.3: UHCI Host Controller
uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:1d.3: irq 23, io base 0x00001880
usb usb4: new device found, idVendor=0000, idProduct=0000
usb usb4: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb4: Product: UHCI Host Controller
usb usb4: Manufacturer: Linux 2.6.17-rc5-mm2 uhci_hcd
usb usb4: SerialNumber: 0000:00:1d.3
PM: Adding info for usb:usb4
usb usb4: configuration #1 chosen from 1 choice
PM: Adding info for usb:4-0:1.0
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
usb 4-2: new full speed USB device using uhci_hcd and address 2
usb 4-2: new device found, idVendor=0483, idProduct=2016
usb 4-2: new device strings: Mfr=1, Product=2, SerialNumber=0
usb 4-2: Product: Biometric Coprocessor
usb 4-2: Manufacturer: STMicroelectronics
PM: Adding info for usb:4-2
usb 4-2: configuration #1 chosen from 1 choice
PM: Adding info for usb:4-2:1.0

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-02 22:51 [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174 Jeremy Fitzhardinge
@ 2006-06-04 11:47 ` Rafael J. Wysocki
  2006-06-05  7:21   ` Jeremy Fitzhardinge
  2006-06-05  7:37 ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2006-06-04 11:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Linux Kernel Mailing List, mingo

Hi,

On Saturday 03 June 2006 00:51, Jeremy Fitzhardinge wrote:
> I'm trying to get suspend/resume working properly on my Thinkpad X60.
> This is a dual-core machine, so its running in SMP mode.
> 
> Now that I have a set of patches to make AHCI resume properly, I'm
> getting a crash on the second suspend.  I can't get an actual listing of
> the oops, but I have a set of screenshots if anyone needs more details.
> 
> The gist is that there's a BUG_ON failing at arch/i386/kernel/nmi.c:174
> (BUG_ON(counter > NMI_MAX_COUNTER_BITS)), in release_evntsel_nmi.  The
> backtrace is:
> 
>      release_evntsel_nmi
>      stop_apci_nmi_watchdog
>      on_each_cpu
>      disable_lapic_nmi_watchdog
>      lapic_nmi_suspend
>      sysdev_suspend
>      device_power_down
>      suspend_enter
>      enter_state
>      state_store
>      subsys_attr_store
>      sysfs_write_file
>      vfs_write
>      sys_write
>      sysenter_past_esp
> 
> This happens after all the devices have suspended themselves; then
> there's a longish pause (several seconds), and the oops appears.  The
> first suspend is very quick.
> 
> Everything works as expected when I disable nmi watchdog with
> nmi_watchdog=0 on the kernel command line.
> 
> dmesg after a single successful suspend/resume cycle attached.
> (resent without .config to get under linux-kernel's size limit; mail if 
> you want a copy)

Well, this looks like a tough one.  Could you please create a bugzilla entry
with all of the relevant information?

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-04 11:47 ` Rafael J. Wysocki
@ 2006-06-05  7:21   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-05  7:21 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux Kernel Mailing List, mingo

Rafael J. Wysocki wrote:
> Well, this looks like a tough one.  Could you please create a bugzilla entry
> with all of the relevant information?
>   
OK, http://bugzilla.kernel.org/show_bug.cgi?id=6647

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-02 22:51 [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174 Jeremy Fitzhardinge
  2006-06-04 11:47 ` Rafael J. Wysocki
@ 2006-06-05  7:37 ` Jeremy Fitzhardinge
  2006-06-05  7:48   ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-05  7:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linux Kernel Mailing List, Andrew Morton, dzickus, Andi Kleen

Jeremy Fitzhardinge wrote:
> I'm trying to get suspend/resume working properly on my Thinkpad X60.
> This is a dual-core machine, so its running in SMP mode.
>
> Now that I have a set of patches to make AHCI resume properly, I'm
> getting a crash on the second suspend.  I can't get an actual listing of
> the oops, but I have a set of screenshots if anyone needs more details.
>
> The gist is that there's a BUG_ON failing at arch/i386/kernel/nmi.c:174
> (BUG_ON(counter > NMI_MAX_COUNTER_BITS)), in release_evntsel_nmi.  The
> backtrace is:
>
>     release_evntsel_nmi
>     stop_apci_nmi_watchdog
>     on_each_cpu
>     disable_lapic_nmi_watchdog
>     lapic_nmi_suspend
>     sysdev_suspend
>     device_power_down
>     suspend_enter
>     enter_state
>     state_store
>     subsys_attr_store
>     sysfs_write_file
>     vfs_write
>     sys_write
>     sysenter_past_esp

This BUG_ON was introduced by the patch 
x86_64-mm-add-performance-counter-reservation-framework-for-up-kernels.patch.

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-05  7:37 ` Jeremy Fitzhardinge
@ 2006-06-05  7:48   ` Andrew Morton
  2006-06-05  7:59     ` Jeremy Fitzhardinge
  2006-06-05  8:35     ` Miles Lane
  0 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2006-06-05  7:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: jeremy, linux-kernel, dzickus, ak, Miles Lane

On Mon, 05 Jun 2006 00:37:22 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Jeremy Fitzhardinge wrote:
> > I'm trying to get suspend/resume working properly on my Thinkpad X60.
> > This is a dual-core machine, so its running in SMP mode.
> >
> > Now that I have a set of patches to make AHCI resume properly, I'm
> > getting a crash on the second suspend.  I can't get an actual listing of
> > the oops, but I have a set of screenshots if anyone needs more details.
> >
> > The gist is that there's a BUG_ON failing at arch/i386/kernel/nmi.c:174
> > (BUG_ON(counter > NMI_MAX_COUNTER_BITS)), in release_evntsel_nmi.  The
> > backtrace is:
> >
> >     release_evntsel_nmi
> >     stop_apci_nmi_watchdog
> >     on_each_cpu
> >     disable_lapic_nmi_watchdog
> >     lapic_nmi_suspend
> >     sysdev_suspend
> >     device_power_down
> >     suspend_enter
> >     enter_state
> >     state_store
> >     subsys_attr_store
> >     sysfs_write_file
> >     vfs_write
> >     sys_write
> >     sysenter_past_esp
> 
> This BUG_ON was introduced by the patch 
> x86_64-mm-add-performance-counter-reservation-framework-for-up-kernels.patch.
> 

http://bugzilla.kernel.org/show_bug.cgi?id=6647 has details.

Do you think the suspend breakage is related to that patch?

Miles also reports that every second suspend fails for him.  Miles, does
'nmi_watchdog=0' make it better?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-05  7:48   ` Andrew Morton
@ 2006-06-05  7:59     ` Jeremy Fitzhardinge
  2006-06-05  8:35     ` Miles Lane
  1 sibling, 0 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-05  7:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, dzickus, ak, Miles Lane

Andrew Morton wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=6647 has details.
>
> Do you think the suspend breakage is related to that patch?
>   

Yes.  I haven't really worked out what's going on in there, but it looks 
like it's losing track of what it has allocated and running out of timer 
MSRs.  Possibly because the CPUs are reinitialized on resume: it "thaws 
the CPUs", which prints the same CPU information as at boot time - caps, 
bogomips, etc - so I presume it actually redoes those things.  I wonder 
if this makes the performance counter reservation loose track of things?

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-05  7:48   ` Andrew Morton
  2006-06-05  7:59     ` Jeremy Fitzhardinge
@ 2006-06-05  8:35     ` Miles Lane
  2006-06-06  6:44       ` Shaohua Li
  1 sibling, 1 reply; 47+ messages in thread
From: Miles Lane @ 2006-06-05  8:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeremy Fitzhardinge, linux-kernel, dzickus, ak

On 6/5/06, Andrew Morton <akpm@osdl.org> wrote:

> Do you think the suspend breakage is related to that patch?
>
> Miles also reports that every second suspend fails for him.  Miles, does
> 'nmi_watchdog=0' make it better?

I tried using that as an appended boot option, but it didn't change the
behavior.

    Miles

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-05  8:35     ` Miles Lane
@ 2006-06-06  6:44       ` Shaohua Li
  2006-06-06 14:17         ` Don Zickus
  2006-06-06 16:23         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 47+ messages in thread
From: Shaohua Li @ 2006-06-06  6:44 UTC (permalink / raw)
  To: Miles Lane; +Cc: Andrew Morton, Jeremy Fitzhardinge, linux-kernel, dzickus, ak

On Mon, 2006-06-05 at 16:35 +0800, Miles Lane wrote:
> On 6/5/06, Andrew Morton <akpm@osdl.org> wrote:
> 
> > Do you think the suspend breakage is related to that patch? 
> > 
> > Miles also reports that every second suspend fails for him.  Miles,
> does 
> > 'nmi_watchdog=0' make it better?
> 
> I tried using that as an appended boot option, but it didn't change
> the 
> behavior.
Does below patch help? The nmi suspend/resume doesn't look good to me.
Only CPU0 uses the suspend/resume code path. Other CPUs run the CPU
hotplug code path.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---

 linux-2.6.17-rc5-mm3-root/arch/i386/kernel/nmi.c       |   14 +++++++++-----
 linux-2.6.17-rc5-mm3-root/arch/i386/kernel/smpboot.c   |    3 ++-
 linux-2.6.17-rc5-mm3-root/arch/x86_64/kernel/nmi.c     |   14 +++++++++-----
 linux-2.6.17-rc5-mm3-root/arch/x86_64/kernel/smpboot.c |    2 ++
 linux-2.6.17-rc5-mm3-root/include/asm-i386/nmi.h       |    1 +
 linux-2.6.17-rc5-mm3-root/include/asm-x86_64/nmi.h     |    1 +
 6 files changed, 24 insertions(+), 11 deletions(-)

diff -puN arch/i386/kernel/nmi.c~nmi arch/i386/kernel/nmi.c
--- linux-2.6.17-rc5-mm3/arch/i386/kernel/nmi.c~nmi	2006-06-04 15:22:06.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/arch/i386/kernel/nmi.c	2006-06-05 12:28:17.000000000 +0800
@@ -65,7 +65,6 @@ struct nmi_watchdog_ctlblk {
 static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
 
 /* local prototypes */
-static void stop_apic_nmi_watchdog(void *unused);
 static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu);
 
 extern void show_registers(struct pt_regs *regs);
@@ -377,15 +376,20 @@ static int nmi_pm_active; /* nmi_active 
 
 static int lapic_nmi_suspend(struct sys_device *dev, pm_message_t state)
 {
+	/* only CPU0 goes here, other CPUs should be offline */
 	nmi_pm_active = atomic_read(&nmi_active);
-	disable_lapic_nmi_watchdog();
+	stop_apic_nmi_watchdog(NULL);
+	BUG_ON(atomic_read(&nmi_active) != 0);
 	return 0;
 }
 
 static int lapic_nmi_resume(struct sys_device *dev)
 {
-	if (nmi_pm_active > 0)
-		enable_lapic_nmi_watchdog();
+	/* only CPU0 goes here, other CPUs should be offline */
+	if (nmi_pm_active > 0) {
+		setup_apic_nmi_watchdog(NULL);
+		touch_nmi_watchdog();
+	}
 	return 0;
 }
 
@@ -798,7 +802,7 @@ void setup_apic_nmi_watchdog (void *unus
 	atomic_inc(&nmi_active);
 }
 
-static void stop_apic_nmi_watchdog(void *unused)
+void stop_apic_nmi_watchdog(void *unused)
 {
 	/* only support LOCAL and IO APICs for now */
 	if ((nmi_watchdog != NMI_LOCAL_APIC) &&
diff -puN arch/i386/kernel/smpboot.c~nmi arch/i386/kernel/smpboot.c
--- linux-2.6.17-rc5-mm3/arch/i386/kernel/smpboot.c~nmi	2006-06-04 15:26:47.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/arch/i386/kernel/smpboot.c	2006-06-05 08:51:16.000000000 +0800
@@ -1368,7 +1368,8 @@ int __cpu_disable(void)
 	 */
 	if (cpu == 0)
 		return -EBUSY;
-
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		stop_apic_nmi_watchdog(NULL);
 	clear_local_APIC();
 	/* Allow any queued timer interrupts to get serviced */
 	local_irq_enable();
diff -puN include/asm-i386/nmi.h~nmi include/asm-i386/nmi.h
--- linux-2.6.17-rc5-mm3/include/asm-i386/nmi.h~nmi	2006-06-05 08:50:23.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/include/asm-i386/nmi.h	2006-06-05 08:50:33.000000000 +0800
@@ -23,6 +23,7 @@ extern int reserve_evntsel_nmi(unsigned 
 extern void release_evntsel_nmi(unsigned int);
 
 extern void setup_apic_nmi_watchdog (void *);
+extern void stop_apic_nmi_watchdog (void *);
 extern void disable_timer_nmi_watchdog(void);
 extern void enable_timer_nmi_watchdog(void);
 extern int nmi_watchdog_tick (struct pt_regs * regs, unsigned reason);
diff -puN arch/x86_64/kernel/nmi.c~nmi arch/x86_64/kernel/nmi.c
--- linux-2.6.17-rc5-mm3/arch/x86_64/kernel/nmi.c~nmi	2006-06-05 12:24:03.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/arch/x86_64/kernel/nmi.c	2006-06-05 12:26:00.000000000 +0800
@@ -65,7 +65,6 @@ struct nmi_watchdog_ctlblk {
 static DEFINE_PER_CPU(struct nmi_watchdog_ctlblk, nmi_watchdog_ctlblk);
 
 /* local prototypes */
-static void stop_apic_nmi_watchdog(void *unused);
 static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu);
 
 /* converts an msr to an appropriate reservation bit */
@@ -363,15 +362,20 @@ static int nmi_pm_active; /* nmi_active 
 
 static int lapic_nmi_suspend(struct sys_device *dev, pm_message_t state)
 {
+	/* only CPU0 goes here, other CPUs should be offline */
 	nmi_pm_active = atomic_read(&nmi_active);
-	disable_lapic_nmi_watchdog();
+	stop_apic_nmi_watchdog(NULL);
+	BUG_ON(atomic_read(&nmi_active) != 0);
 	return 0;
 }
 
 static int lapic_nmi_resume(struct sys_device *dev)
 {
-	if (nmi_pm_active > 0)
-		enable_lapic_nmi_watchdog();
+	/* only CPU0 goes here, other CPUs should be offline */
+	if (nmi_pm_active > 0) {
+		setup_apic_nmi_watchdog(NULL);
+		touch_nmi_watchdog();
+	}
 	return 0;
 }
 
@@ -709,7 +713,7 @@ void setup_apic_nmi_watchdog(void *unuse
 	atomic_inc(&nmi_active);
 }
 
-static void stop_apic_nmi_watchdog(void *unused)
+void stop_apic_nmi_watchdog(void *unused)
 {
 	/* only support LOCAL and IO APICs for now */
 	if ((nmi_watchdog != NMI_LOCAL_APIC) &&
diff -puN include/asm-x86_64/nmi.h~nmi include/asm-x86_64/nmi.h
--- linux-2.6.17-rc5-mm3/include/asm-x86_64/nmi.h~nmi	2006-06-05 12:34:27.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/include/asm-x86_64/nmi.h	2006-06-05 12:34:41.000000000 +0800
@@ -54,6 +54,7 @@ extern int reserve_evntsel_nmi(unsigned 
 extern void release_evntsel_nmi(unsigned int);
 
 extern void setup_apic_nmi_watchdog (void *);
+extern void stop_apic_nmi_watchdog (void *);
 extern void disable_timer_nmi_watchdog(void);
 extern void enable_timer_nmi_watchdog(void);
 extern int nmi_watchdog_tick (struct pt_regs * regs, unsigned reason);
diff -puN arch/x86_64/kernel/smpboot.c~nmi arch/x86_64/kernel/smpboot.c
--- linux-2.6.17-rc5-mm3/arch/x86_64/kernel/smpboot.c~nmi	2006-06-05 12:34:56.000000000 +0800
+++ linux-2.6.17-rc5-mm3-root/arch/x86_64/kernel/smpboot.c	2006-06-05 12:45:58.000000000 +0800
@@ -1232,6 +1232,8 @@ int __cpu_disable(void)
 	if (cpu == 0)
 		return -EBUSY;
 
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		stop_apic_nmi_watchdog(NULL);
 	clear_local_APIC();
 
 	/*
_

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06  6:44       ` Shaohua Li
@ 2006-06-06 14:17         ` Don Zickus
  2006-06-06 14:18           ` Andi Kleen
  2006-06-06 16:23         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 47+ messages in thread
From: Don Zickus @ 2006-06-06 14:17 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Miles Lane, Andrew Morton, Jeremy Fitzhardinge, linux-kernel, ak

On Tue, Jun 06, 2006 at 02:44:06PM +0800, Shaohua Li wrote:
> On Mon, 2006-06-05 at 16:35 +0800, Miles Lane wrote:
> > On 6/5/06, Andrew Morton <akpm@osdl.org> wrote:
> > 
> > > Do you think the suspend breakage is related to that patch? 
> > > 
> > > Miles also reports that every second suspend fails for him.  Miles,
> > does 
> > > 'nmi_watchdog=0' make it better?
> > 
> > I tried using that as an appended boot option, but it didn't change
> > the 
> > behavior.
> Does below patch help? The nmi suspend/resume doesn't look good to me.
> Only CPU0 uses the suspend/resume code path. Other CPUs run the CPU
> hotplug code path.
>

This patch makes sense.  I was unaware that the suspend/resume case was
only for CPU0.  

However, I am still concerned about one thing though.  After looking
through the bugzilla attachments, the reason why Jeremy's machine is
crashing on the second suspend is because one of the watchdog timers is
turning on after the resume.  

Because he is using a i386 machine, the nmi watchdog is disabled by
default.  This is evident by his attachment ('cat /proc/interrupts | grep
NMI was zero).  The machine suspends then resumes.  Running the same 'cat'
command again, we see NMI output.  My guess is during the second suspend,
the NMI watchdog code sees one watchdog is enabled but tries to disable
all of them (this is a bug and I will provide that patch).  Upon disabling
the 'disabled' watchdog, the BUG_ON is hit and the machine goes down.  

My concern is _why_ one of the watchdog timers was re-enable during
resume.  It makes sense that CPU0 was still disabled, but apparently the
hotplug case for CPU1 didn't realize that CPU1 was previously disabled or
that the default is to remain off.  I need to look at the code for this.
Any inputs from those that know the hotplug code better than I?

Cheers,
Don

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 14:17         ` Don Zickus
@ 2006-06-06 14:18           ` Andi Kleen
  2006-06-06 21:45             ` Don Zickus
  0 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2006-06-06 14:18 UTC (permalink / raw)
  To: Don Zickus
  Cc: Shaohua Li, Miles Lane, Andrew Morton, Jeremy Fitzhardinge,
	linux-kernel


> Because he is using a i386 machine, the nmi watchdog is disabled by
> default. 

I changed that - it's now on by default on i386 too.

-Andi


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 14:18           ` Andi Kleen
@ 2006-06-06 21:45             ` Don Zickus
  2006-06-06 22:15               ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Don Zickus @ 2006-06-06 21:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Shaohua Li, Miles Lane, Andrew Morton, Jeremy Fitzhardinge,
	linux-kernel

On Tue, Jun 06, 2006 at 04:18:15PM +0200, Andi Kleen wrote:
> 
> > Because he is using a i386 machine, the nmi watchdog is disabled by
> > default. 
> 
> I changed that - it's now on by default on i386 too.
> 
> -Andi

I am trying to create a patch for this problem and it just dawned on me,
how does one store the previous state in a suspend/resume path if the code
hotplugs all the cpus first?  CPU0 is easy because an explicit
suspend/resume path is called, but it seems to be called last after all
the other cpus have been removed.  How do I save the state?

Is there a recommened way of doing this?  Or can I assume that
__cpu_disable/enable is only called by the suspend/resume subsystem?

Thanks.

Cheers,
Don

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 21:45             ` Don Zickus
@ 2006-06-06 22:15               ` Andrew Morton
  2006-06-06 23:05                 ` Don Zickus
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2006-06-06 22:15 UTC (permalink / raw)
  To: Don Zickus; +Cc: ak, shaohua.li, miles.lane, jeremy, linux-kernel

On Tue, 6 Jun 2006 17:45:53 -0400
Don Zickus <dzickus@redhat.com> wrote:

> On Tue, Jun 06, 2006 at 04:18:15PM +0200, Andi Kleen wrote:
> > 
> > > Because he is using a i386 machine, the nmi watchdog is disabled by
> > > default. 
> > 
> > I changed that - it's now on by default on i386 too.
> > 
> > -Andi
> 
> I am trying to create a patch for this problem and it just dawned on me,
> how does one store the previous state in a suspend/resume path if the code
> hotplugs all the cpus first?  CPU0 is easy because an explicit
> suspend/resume path is called, but it seems to be called last after all
> the other cpus have been removed.  How do I save the state?

I'm really struggling to understand this question.  If you're referring to
some per-cpu state then a CPU hotplug handler would be appropriate?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 22:15               ` Andrew Morton
@ 2006-06-06 23:05                 ` Don Zickus
  2006-06-06 23:22                   ` Andrew Morton
  2006-06-06 23:34                   ` Andi Kleen
  0 siblings, 2 replies; 47+ messages in thread
From: Don Zickus @ 2006-06-06 23:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, shaohua.li, miles.lane, jeremy, linux-kernel

On Tue, Jun 06, 2006 at 03:15:07PM -0700, Andrew Morton wrote:
> On Tue, 6 Jun 2006 17:45:53 -0400
> Don Zickus <dzickus@redhat.com> wrote:
> 
> > On Tue, Jun 06, 2006 at 04:18:15PM +0200, Andi Kleen wrote:
> > > 
> > > > Because he is using a i386 machine, the nmi watchdog is disabled by
> > > > default. 
> > > 
> > > I changed that - it's now on by default on i386 too.
> > > 
> > > -Andi
> > 
> > I am trying to create a patch for this problem and it just dawned on me,
> > how does one store the previous state in a suspend/resume path if the code
> > hotplugs all the cpus first?  CPU0 is easy because an explicit
> > suspend/resume path is called, but it seems to be called last after all
> > the other cpus have been removed.  How do I save the state?
> 
> I'm really struggling to understand this question.  If you're referring to
> some per-cpu state then a CPU hotplug handler would be appropriate?

Sorry.  I got ahead of myself.  My concern is how the suspend/resume code
works with device drivers on an SMP system.  My initial impression was
that the subsystem registers with the suspend/resume layer and upon such
actions those registered functions are called.  

Inside those functions I saved the previous state of the watchdog timer.
However, I learned today that my understanding was incorrect.  Instead
first the _hotplug_ code is called for every cpu _except_ cpu0.  The
_suspend/resume_ functions are only called in the context of _cpu0_.  

This breaks the design I have because upon resuming the watchdog timers
automatically start on all cpus (except cpu0 because I saved the previous
state through the handlers), regardless of what the previous state was.  

So my question is/was what is the proper way to handle processor level
subsystems during the suspend/resume path on an SMP system.  I really
don't understand the hotplug path nor the suspend/resume path very well.  

I didn't want to register a hotplug handler because a hotplug event is
really different than a suspend event (I want to _save_ info during a
suspend event).  The documentation I was reading seemed to suggest that
hotplug/suspend/smp was a work-in-progress. 

Is the typical approach to just hack in an extra parameter to the
start/stop functions of the nmi_watchdog letting the function know it is
coming through the suspend/resume path? 

Any tips, code, other docs would be helpful.

Cheers,
Don

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:05                 ` Don Zickus
@ 2006-06-06 23:22                   ` Andrew Morton
  2006-06-06 23:27                     ` Jeremy Fitzhardinge
                                       ` (2 more replies)
  2006-06-06 23:34                   ` Andi Kleen
  1 sibling, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2006-06-06 23:22 UTC (permalink / raw)
  To: Don Zickus; +Cc: ak, shaohua.li, miles.lane, jeremy, linux-kernel

On Tue, 6 Jun 2006 19:05:04 -0400
Don Zickus <dzickus@redhat.com> wrote:

> On Tue, Jun 06, 2006 at 03:15:07PM -0700, Andrew Morton wrote:
> > On Tue, 6 Jun 2006 17:45:53 -0400
> > Don Zickus <dzickus@redhat.com> wrote:
> > 
> > > On Tue, Jun 06, 2006 at 04:18:15PM +0200, Andi Kleen wrote:
> > > > 
> > > > > Because he is using a i386 machine, the nmi watchdog is disabled by
> > > > > default. 
> > > > 
> > > > I changed that - it's now on by default on i386 too.
> > > > 
> > > > -Andi
> > > 
> > > I am trying to create a patch for this problem and it just dawned on me,
> > > how does one store the previous state in a suspend/resume path if the code
> > > hotplugs all the cpus first?  CPU0 is easy because an explicit
> > > suspend/resume path is called, but it seems to be called last after all
> > > the other cpus have been removed.  How do I save the state?
> > 
> > I'm really struggling to understand this question.  If you're referring to
> > some per-cpu state then a CPU hotplug handler would be appropriate?
> 
> Sorry.  I got ahead of myself.  My concern is how the suspend/resume code
> works with device drivers on an SMP system.  My initial impression was
> that the subsystem registers with the suspend/resume layer and upon such
> actions those registered functions are called.  
> 
> Inside those functions I saved the previous state of the watchdog timer.
> However, I learned today that my understanding was incorrect.  Instead
> first the _hotplug_ code is called for every cpu _except_ cpu0.  The
> _suspend/resume_ functions are only called in the context of _cpu0_.  
> 
> This breaks the design I have because upon resuming the watchdog timers
> automatically start on all cpus (except cpu0 because I saved the previous
> state through the handlers), regardless of what the previous state was.  
> 
> So my question is/was what is the proper way to handle processor level
> subsystems during the suspend/resume path on an SMP system.  I really
> don't understand the hotplug path nor the suspend/resume path very well.  
> 
> I didn't want to register a hotplug handler because a hotplug event is
> really different than a suspend event (I want to _save_ info during a
> suspend event).  The documentation I was reading seemed to suggest that
> hotplug/suspend/smp was a work-in-progress. 
> 
> Is the typical approach to just hack in an extra parameter to the
> start/stop functions of the nmi_watchdog letting the function know it is
> coming through the suspend/resume path? 
> 
> Any tips, code, other docs would be helpful.
> 

OK...  My understanding of how it works is that the cpu hotplug handlers
are called early in the suspend process to take the CPUs down.  Once all
the APs are shut down, CPU0 will then proceed to handle the devices.

So if you want to save and restore per-cpu NMI state then doing it in the
CPU hot-add and hot-remove handlers is appropriate.  It will affect the
behaviour of _real_ CPU hot-add and hot-remove as well.  But in what
appears to be a correct fashion.

All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
shuts down the APs.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:22                   ` Andrew Morton
@ 2006-06-06 23:27                     ` Jeremy Fitzhardinge
  2006-06-06 23:32                       ` Andi Kleen
  2006-06-06 23:42                       ` Don Zickus
  2006-06-06 23:38                     ` Nigel Cunningham
  2006-06-08 12:45                     ` Pavel Machek
  2 siblings, 2 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-06 23:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Don Zickus, ak, shaohua.li, miles.lane, linux-kernel

Andrew Morton wrote:
> All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
> shuts down the APs.
>   

I'm using suspend-to-mem and it looks like its unplugging/replugging all 
the CPUs.

The part of the question I don't quite understand is why this is 
considered per-CPU state?  Surely NMI-watchdog is a system-wide thing?  
Or does this also tie into other uses of the performance registers which 
may be set per-CPU?

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:27                     ` Jeremy Fitzhardinge
@ 2006-06-06 23:32                       ` Andi Kleen
  2006-06-06 23:42                       ` Don Zickus
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2006-06-06 23:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, Don Zickus, shaohua.li, miles.lane, linux-kernel


> Or does this also tie into other uses of the performance registers which
> may be set per-CPU?

That's it. The registration is to properly share performance registers
between oprofile, nmi watchdog and other users.

-Andi


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:27                     ` Jeremy Fitzhardinge
  2006-06-06 23:32                       ` Andi Kleen
@ 2006-06-06 23:42                       ` Don Zickus
  2006-06-08 20:11                         ` Pavel Machek
  1 sibling, 1 reply; 47+ messages in thread
From: Don Zickus @ 2006-06-06 23:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, ak, shaohua.li, miles.lane, linux-kernel

On Tue, Jun 06, 2006 at 04:27:55PM -0700, Jeremy Fitzhardinge wrote:
> Andrew Morton wrote:
> >All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
> >shuts down the APs.
> >  
> 
> I'm using suspend-to-mem and it looks like its unplugging/replugging all 
> the CPUs.
> 
> The part of the question I don't quite understand is why this is 
> considered per-CPU state?  Surely NMI-watchdog is a system-wide thing?  
> Or does this also tie into other uses of the performance registers which 
> may be set per-CPU?
> 
>    J

The nmi watchdog is enable/disabled on a per-cpu basis.  The fact that a
single switch turns all of them on/off is just convienance.  Adding in
code to turn them on/off on a per-cpu basis just requires a simple user
interface.  It has been talked about before to deal with NUMA systems. 

Cheers,
Don


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:42                       ` Don Zickus
@ 2006-06-08 20:11                         ` Pavel Machek
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2006-06-08 20:11 UTC (permalink / raw)
  To: Don Zickus
  Cc: Jeremy Fitzhardinge, Andrew Morton, ak, shaohua.li, miles.lane,
	linux-kernel

Hi!

> > >All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
> > >shuts down the APs.
> > >  
> > 
> > I'm using suspend-to-mem and it looks like its unplugging/replugging all 
> > the CPUs.
> > 
> > The part of the question I don't quite understand is why this is 
> > considered per-CPU state?  Surely NMI-watchdog is a system-wide thing?  
> > Or does this also tie into other uses of the performance registers which 
> > may be set per-CPU?
> 
> The nmi watchdog is enable/disabled on a per-cpu basis.  The fact that a
> single switch turns all of them on/off is just convienance.  Adding in
> code to turn them on/off on a per-cpu basis just requires a simple user
> interface.  It has been talked about before to deal with NUMA systems. 

Does it make sense to run watchdog on cpu 1 but not on cpu 0? If user
plugs cpu 2, should it get watchdog or not? If I unplug cpu 1 and plug
it back, should it run watchdog or not?

I believe it should be per-system thing.
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:22                   ` Andrew Morton
  2006-06-06 23:27                     ` Jeremy Fitzhardinge
@ 2006-06-06 23:38                     ` Nigel Cunningham
  2006-06-07  0:06                       ` Jeremy Fitzhardinge
  2006-06-08 12:45                     ` Pavel Machek
  2 siblings, 1 reply; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-06 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Don Zickus, ak, shaohua.li, miles.lane, jeremy, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

Hi guys.

Back on board after the big shift :)

On Wednesday 07 June 2006 09:22, Andrew Morton wrote:
> On Tue, 6 Jun 2006 19:05:04 -0400
>
> Don Zickus <dzickus@redhat.com> wrote:
> > On Tue, Jun 06, 2006 at 03:15:07PM -0700, Andrew Morton wrote:
> > > On Tue, 6 Jun 2006 17:45:53 -0400
> > >
> > > Don Zickus <dzickus@redhat.com> wrote:
> > > > On Tue, Jun 06, 2006 at 04:18:15PM +0200, Andi Kleen wrote:
> > > > > > Because he is using a i386 machine, the nmi watchdog is disabled
> > > > > > by default.
> > > > >
> > > > > I changed that - it's now on by default on i386 too.
> > > > >
> > > > > -Andi
> > > >
> > > > I am trying to create a patch for this problem and it just dawned on
> > > > me, how does one store the previous state in a suspend/resume path if
> > > > the code hotplugs all the cpus first?  CPU0 is easy because an
> > > > explicit suspend/resume path is called, but it seems to be called
> > > > last after all the other cpus have been removed.  How do I save the
> > > > state?
> > >
> > > I'm really struggling to understand this question.  If you're referring
> > > to some per-cpu state then a CPU hotplug handler would be appropriate?
> >
> > Sorry.  I got ahead of myself.  My concern is how the suspend/resume code
> > works with device drivers on an SMP system.  My initial impression was
> > that the subsystem registers with the suspend/resume layer and upon such
> > actions those registered functions are called.
> >
> > Inside those functions I saved the previous state of the watchdog timer.
> > However, I learned today that my understanding was incorrect.  Instead
> > first the _hotplug_ code is called for every cpu _except_ cpu0.  The
> > _suspend/resume_ functions are only called in the context of _cpu0_.
> >
> > This breaks the design I have because upon resuming the watchdog timers
> > automatically start on all cpus (except cpu0 because I saved the previous
> > state through the handlers), regardless of what the previous state was.
> >
> > So my question is/was what is the proper way to handle processor level
> > subsystems during the suspend/resume path on an SMP system.  I really
> > don't understand the hotplug path nor the suspend/resume path very well.
> >
> > I didn't want to register a hotplug handler because a hotplug event is
> > really different than a suspend event (I want to _save_ info during a
> > suspend event).  The documentation I was reading seemed to suggest that
> > hotplug/suspend/smp was a work-in-progress.
> >
> > Is the typical approach to just hack in an extra parameter to the
> > start/stop functions of the nmi_watchdog letting the function know it is
> > coming through the suspend/resume path?
> >
> > Any tips, code, other docs would be helpful.
>
> OK...  My understanding of how it works is that the cpu hotplug handlers
> are called early in the suspend process to take the CPUs down.  Once all
> the APs are shut down, CPU0 will then proceed to handle the devices.
>
> So if you want to save and restore per-cpu NMI state then doing it in the
> CPU hot-add and hot-remove handlers is appropriate.  It will affect the
> behaviour of _real_ CPU hot-add and hot-remove as well.  But in what
> appears to be a correct fashion.
>
> All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
> shuts down the APs.

I'm not sure about suspend to ram either, but I can confirm the rest:

* Hotplugging handlers should manage the state of secondary cpus, really 
disabling NMIs on unplug and only enabling NMIs if the boot processor has 
them enabled (assuming consistent behaviour across cpus is desired). At a 
hotplug event, both the hardware state and values of variables may not match 
the state at the end of a previous hotplug event (we may have powered off, or 
may have switched from a boot kernel to a suspended-to-disk one), so even if 
you know the hotplug event is synthesised, you may still need to treat it as 
uninitialised.
* Driver suspend and resume calls should only handle cpu0, and should not 
touch other processors. The same semantics regarding hardware state and 
values of variables apply here.

If you _really_ need to store in a variable what the hardware state was at the 
last suspend call, it is possible to use _nosave variables, but this isn't 
done or recommended at the moment. (I think it's a good idea, but that's just 
my opinion). The value of a _nosave variable will persist across the atomic 
restore of a suspend to disk kernel context, and since a drivers_suspend call 
is made before doing the atomic restore, you'll never get uninitialised 
values. I'm not sure what it would be helpful for, but there may be a case.

Hope this helps.

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:38                     ` Nigel Cunningham
@ 2006-06-07  0:06                       ` Jeremy Fitzhardinge
  2006-06-07  0:13                         ` Nigel Cunningham
  2006-06-08 20:13                         ` Pavel Machek
  0 siblings, 2 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07  0:06 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Andrew Morton, Don Zickus, ak, shaohua.li, miles.lane,
	linux-kernel

Nigel Cunningham wrote:
> * Driver suspend and resume calls should only handle cpu0, and should not 
> touch other processors. The same semantics regarding hardware state and 
> values of variables apply here.
>   
Isn't the trouble that in this case, the devices themselves are the 
CPUs, and so the CPUs themselves need to operate on their own state?

Or perhaps, to look at it another way, suspend/resume is just a special 
case of:

   1. unplug cpus 1-N
   2. [something]
   3. re-plug cpus 1-N

where [something] in this case is "suspend cpu0". 

But the problem is that there's nothing which keeps track of whether the 
re-plugged cpus 1-N are the "same" as the unplugged 1-N, and so nothing 
can apply the same per-cpu settings to them.  In the suspend/resume case 
they clearly are, but in the general remove/add case, do you really want 
the new CPU to get the same state as the old one just because it ends up 
with the same logical CPU number?  Perhaps, but what if it doesn't even 
have the same capabilities?  (Do we support heterogeneous CPUs anyway?)

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:06                       ` Jeremy Fitzhardinge
@ 2006-06-07  0:13                         ` Nigel Cunningham
  2006-06-07  0:24                           ` Andrew Morton
  2006-06-07  0:26                           ` Jeremy Fitzhardinge
  2006-06-08 20:13                         ` Pavel Machek
  1 sibling, 2 replies; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:13 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, Don Zickus, ak, shaohua.li, miles.lane,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1723 bytes --]

Hi.

On Wednesday 07 June 2006 10:06, Jeremy Fitzhardinge wrote:
> Nigel Cunningham wrote:
> > * Driver suspend and resume calls should only handle cpu0, and should not
> > touch other processors. The same semantics regarding hardware state and
> > values of variables apply here.
>
> Isn't the trouble that in this case, the devices themselves are the
> CPUs, and so the CPUs themselves need to operate on their own state?
>
> Or perhaps, to look at it another way, suspend/resume is just a special
> case of:
>
>    1. unplug cpus 1-N
>    2. [something]
>    3. re-plug cpus 1-N
>
> where [something] in this case is "suspend cpu0".
>
> But the problem is that there's nothing which keeps track of whether the
> re-plugged cpus 1-N are the "same" as the unplugged 1-N, and so nothing
> can apply the same per-cpu settings to them.  In the suspend/resume case
> they clearly are, but in the general remove/add case, do you really want

It's probably safter to say "In the suspend/resume case, they may well be." 
It's not inconceivable that a system could be suspended, a faulty cpu 
replaced with another, and the system resumed. Hotplugging ought to handle 
that nicely.

> the new CPU to get the same state as the old one just because it ends up
> with the same logical CPU number?  Perhaps, but what if it doesn't even
> have the same capabilities?  (Do we support heterogeneous CPUs anyway?)

Indeed. I'm also not sure that there's necessarily a guarantee that cpus will 
be hotplugged in the same order. Perhaps those with more knowledge can 
clarify there.

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:13                         ` Nigel Cunningham
@ 2006-06-07  0:24                           ` Andrew Morton
  2006-06-07  0:29                             ` Jeremy Fitzhardinge
                                               ` (2 more replies)
  2006-06-07  0:26                           ` Jeremy Fitzhardinge
  1 sibling, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2006-06-07  0:24 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: jeremy, dzickus, ak, shaohua.li, miles.lane, linux-kernel

On Wed, 7 Jun 2006 10:13:49 +1000
Nigel Cunningham <ncunningham@linuxmail.org> wrote:

> > the new CPU to get the same state as the old one just because it ends up
> > with the same logical CPU number?  Perhaps, but what if it doesn't even
> > have the same capabilities?  (Do we support heterogeneous CPUs anyway?)
> 
> Indeed. I'm also not sure that there's necessarily a guarantee that cpus will 
> be hotplugged in the same order. Perhaps those with more knowledge can 
> clarify there.

It all depends on what we mean by "per-cpu state".  If we were to remember
that "CPU 7 needs 0x1234 in register 44" then that would be wrong.  But
remembering some high-level functional thing like "CPU 7 needs to run the
NMI watchdog" is fine.  The CPU bringup code can work out whether that is
possible, and how to do it.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:24                           ` Andrew Morton
@ 2006-06-07  0:29                             ` Jeremy Fitzhardinge
  2006-06-07  0:31                             ` Nigel Cunningham
  2006-06-07  0:33                             ` Andi Kleen
  2 siblings, 0 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07  0:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nigel Cunningham, dzickus, ak, shaohua.li, miles.lane,
	linux-kernel

Andrew Morton wrote:
> It all depends on what we mean by "per-cpu state".  If we were to remember
> that "CPU 7 needs 0x1234 in register 44" then that would be wrong.  But
> remembering some high-level functional thing like "CPU 7 needs to run the
> NMI watchdog" is fine.  The CPU bringup code can work out whether that is
> possible, and how to do it.
>
>   

But all the performance counter stuff is very model-specific. I don't 
think there's any abstraction which would allow us to say "CPU 7 is 
measuring branch misprediction stalls in pipeline 2" in any way other 
than "needs 0x1234 in register 44".

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:24                           ` Andrew Morton
  2006-06-07  0:29                             ` Jeremy Fitzhardinge
@ 2006-06-07  0:31                             ` Nigel Cunningham
  2006-06-07  0:33                             ` Andi Kleen
  2 siblings, 0 replies; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: jeremy, dzickus, ak, shaohua.li, miles.lane, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

Hi.

On Wednesday 07 June 2006 10:24, you wrote:
> On Wed, 7 Jun 2006 10:13:49 +1000
>
> Nigel Cunningham <ncunningham@linuxmail.org> wrote:
> > > the new CPU to get the same state as the old one just because it ends
> > > up with the same logical CPU number?  Perhaps, but what if it doesn't
> > > even have the same capabilities?  (Do we support heterogeneous CPUs
> > > anyway?)
> >
> > Indeed. I'm also not sure that there's necessarily a guarantee that cpus
> > will be hotplugged in the same order. Perhaps those with more knowledge
> > can clarify there.
>
> It all depends on what we mean by "per-cpu state".  If we were to remember
> that "CPU 7 needs 0x1234 in register 44" then that would be wrong.  But
> remembering some high-level functional thing like "CPU 7 needs to run the
> NMI watchdog" is fine.  The CPU bringup code can work out whether that is
> possible, and how to do it.

Does that imply that there's no danger of cpus being hotplugged in a different 
order (so that cpu7 becomes cpu5, for example)?

I guess I'm missing an understanding of why one cpu would need a different 
configuration to the rest. If it's related to the cpu number, then it 
shouldn't matter if a different cpu gets the number, should it? If it's 
related to the node that the cpu is on, perhaps the hotplugging code for the 
driver should be checking for the reason ("Am I on the node with the... ?") 
rather than the cpu number?

Regards,

Nigel

-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:24                           ` Andrew Morton
  2006-06-07  0:29                             ` Jeremy Fitzhardinge
  2006-06-07  0:31                             ` Nigel Cunningham
@ 2006-06-07  0:33                             ` Andi Kleen
  2006-06-07  0:40                               ` Nigel Cunningham
  2 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2006-06-07  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nigel Cunningham, jeremy, dzickus, shaohua.li, miles.lane,
	linux-kernel

On Wednesday 07 June 2006 02:24, Andrew Morton wrote:
> On Wed, 7 Jun 2006 10:13:49 +1000
>
> Nigel Cunningham <ncunningham@linuxmail.org> wrote:
> > > the new CPU to get the same state as the old one just because it ends
> > > up with the same logical CPU number?  Perhaps, but what if it doesn't
> > > even have the same capabilities?  (Do we support heterogeneous CPUs
> > > anyway?)
> >
> > Indeed. I'm also not sure that there's necessarily a guarantee that cpus
> > will be hotplugged in the same order. Perhaps those with more knowledge
> > can clarify there.
>
> It all depends on what we mean by "per-cpu state".  If we were to remember
> that "CPU 7 needs 0x1234 in register 44" then that would be wrong.  But
> remembering some high-level functional thing like "CPU 7 needs to run the
> NMI watchdog" is fine.  The CPU bringup code can work out whether that is
> possible, and how to do it.

Actually the nmi watchdog state should be global, not per CPU.  We
want it to either work for the whole system or be completely disabled.

What is per CPU are the performance counter allocations, but these
can be forgotten over CPU unplug/replug.

(ok this means oprofile  might need to be restarted after suspend/resume,
but I guess that's reasonable) 

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:33                             ` Andi Kleen
@ 2006-06-07  0:40                               ` Nigel Cunningham
  0 siblings, 0 replies; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, jeremy, dzickus, shaohua.li, miles.lane,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1760 bytes --]

Hi.

On Wednesday 07 June 2006 10:33, Andi Kleen wrote:
> On Wednesday 07 June 2006 02:24, Andrew Morton wrote:
> > On Wed, 7 Jun 2006 10:13:49 +1000
> >
> > Nigel Cunningham <ncunningham@linuxmail.org> wrote:
> > > > the new CPU to get the same state as the old one just because it ends
> > > > up with the same logical CPU number?  Perhaps, but what if it doesn't
> > > > even have the same capabilities?  (Do we support heterogeneous CPUs
> > > > anyway?)
> > >
> > > Indeed. I'm also not sure that there's necessarily a guarantee that
> > > cpus will be hotplugged in the same order. Perhaps those with more
> > > knowledge can clarify there.
> >
> > It all depends on what we mean by "per-cpu state".  If we were to
> > remember that "CPU 7 needs 0x1234 in register 44" then that would be
> > wrong.  But remembering some high-level functional thing like "CPU 7
> > needs to run the NMI watchdog" is fine.  The CPU bringup code can work
> > out whether that is possible, and how to do it.
>
> Actually the nmi watchdog state should be global, not per CPU.  We
> want it to either work for the whole system or be completely disabled.

Ok. Now I get and fully agree with what you said earlier ("Make it work 
properly for CPU hotplug for individual CPU and then in suspend
you take care of "global" state and the last CPU.").

> What is per CPU are the performance counter allocations, but these
> can be forgotten over CPU unplug/replug.
>
> (ok this means oprofile  might need to be restarted after suspend/resume,
> but I guess that's reasonable)

Don't know enough in that area to say anything :>

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:13                         ` Nigel Cunningham
  2006-06-07  0:24                           ` Andrew Morton
@ 2006-06-07  0:26                           ` Jeremy Fitzhardinge
  2006-06-07  0:33                             ` Nigel Cunningham
  1 sibling, 1 reply; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07  0:26 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Andrew Morton, Don Zickus, ak, shaohua.li, miles.lane,
	linux-kernel

Nigel Cunningham wrote:
> It's probably safter to say "In the suspend/resume case, they may well be." 
> It's not inconceivable that a system could be suspended, a faulty cpu 
> replaced with another, and the system resumed. Hotplugging ought to handle 
> that nicely.
>   
I think, in general, changing the hardware configuration of the system 
while its suspend is not supported.  But perhaps someone who actually 
knows about this PM stuff has a more authoritative view...

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:26                           ` Jeremy Fitzhardinge
@ 2006-06-07  0:33                             ` Nigel Cunningham
  2006-06-07  0:56                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:33 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andrew Morton, Don Zickus, ak, shaohua.li, miles.lane,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]

Hi.

On Wednesday 07 June 2006 10:26, Jeremy Fitzhardinge wrote:
> Nigel Cunningham wrote:
> > It's probably safter to say "In the suspend/resume case, they may well
> > be." It's not inconceivable that a system could be suspended, a faulty
> > cpu replaced with another, and the system resumed. Hotplugging ought to
> > handle that nicely.
>
> I think, in general, changing the hardware configuration of the system
> while its suspend is not supported.  But perhaps someone who actually
> knows about this PM stuff has a more authoritative view...

In general, you're right because we don't have perfect hardware hotplugging 
yet. But cpu hotplugging is one area we do have, so it should work. (I ought 
to be one of those people, because I'm the author of the Suspend2 
patches :) ... not that I'm claiming complete knowledge of all things related 
to suspending! )

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:33                             ` Nigel Cunningham
@ 2006-06-07  0:56                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07  0:56 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Andrew Morton, Don Zickus, ak, shaohua.li, miles.lane,
	linux-kernel

Nigel Cunningham wrote:
> In general, you're right because we don't have perfect hardware hotplugging 
> yet. But cpu hotplugging is one area we do have, so it should work.
Well, it seems to me the general problem is generating the proper 
hotplug events.  If you actually pull, say, a USB device, the usb 
subsystem will tell you about it as it happens.  But if you can suspend 
the machine and then arbitrarily rearrange the hardware, then on resume 
you'd have to go over the current hardware state and compare it to the 
pre-suspend state and generate all those events.  Or I guess you could 
just generate unplug events for everything at suspend and re-plug 
anything you find on resume.  Sounds pretty heavyweight though.

    J


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:06                       ` Jeremy Fitzhardinge
  2006-06-07  0:13                         ` Nigel Cunningham
@ 2006-06-08 20:13                         ` Pavel Machek
  1 sibling, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2006-06-08 20:13 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nigel Cunningham, Andrew Morton, Don Zickus, ak, shaohua.li,
	miles.lane, linux-kernel

Hi!

> But the problem is that there's nothing which keeps 
> track of whether the re-plugged cpus 1-N are the "same" 
> as the unplugged 1-N, and so nothing can apply the same 
> per-cpu settings to them.  In the suspend/resume case 
> they clearly are, but in the general remove/add case, do 
> you really want the new CPU to get the same state as the 
> old one just because it ends up with the same logical 
> CPU number?  Perhaps, but what if it doesn't even have 
> the same capabilities? 

> (Do we support heterogeneous 
> CPUs anyway?)

It works for some people, but it certainly falls into unsupported
category.

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:22                   ` Andrew Morton
  2006-06-06 23:27                     ` Jeremy Fitzhardinge
  2006-06-06 23:38                     ` Nigel Cunningham
@ 2006-06-08 12:45                     ` Pavel Machek
  2 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2006-06-08 12:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Don Zickus, ak, shaohua.li, miles.lane, jeremy, linux-kernel

Hi!

> > Is the typical approach to just hack in an extra parameter to the
> > start/stop functions of the nmi_watchdog letting the function know it is
> > coming through the suspend/resume path? 
> > 
> > Any tips, code, other docs would be helpful.
> > 
> 
> OK...  My understanding of how it works is that the cpu hotplug handlers
> are called early in the suspend process to take the CPUs down.  Once all
> the APs are shut down, CPU0 will then proceed to handle the devices.

Yep.

> All the above applies to suspend-to-disk.  I don't know if suspend-to-RAM
> shuts down the APs.

It applies to suspend-to-ram, too.
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:05                 ` Don Zickus
  2006-06-06 23:22                   ` Andrew Morton
@ 2006-06-06 23:34                   ` Andi Kleen
  2006-06-06 23:55                     ` Don Zickus
  1 sibling, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2006-06-06 23:34 UTC (permalink / raw)
  To: Don Zickus; +Cc: Andrew Morton, shaohua.li, miles.lane, jeremy, linux-kernel

On Wednesday 07 June 2006 01:05, Don Zickus wrote:

> Inside those functions I saved the previous state of the watchdog timer.
> However, I learned today that my understanding was incorrect.  Instead
> first the _hotplug_ code is called for every cpu _except_ cpu0.  The
> _suspend/resume_ functions are only called in the context of _cpu0_.
>
> This breaks the design I have because upon resuming the watchdog timers
> automatically start on all cpus (except cpu0 because I saved the previous
> state through the handlers), regardless of what the previous state was.

This means the design was incorrect for CPU hotplug (which needs
to work anyways). suspend is just the most popular user of CPU
hotplug.

> So my question is/was what is the proper way to handle processor level
> subsystems during the suspend/resume path on an SMP system.  I really
> don't understand the hotplug path nor the suspend/resume path very well.

Make it work properly for CPU hotplug for individual CPU and then in suspend
you take care of "global" state and the last CPU.

> I didn't want to register a hotplug handler because a hotplug event is
> really different than a suspend event (I want to _save_ info during a
> suspend event).  The documentation I was reading seemed to suggest that
> hotplug/suspend/smp was a work-in-progress.

You need to disable the nmi watchdog on CPU hotunplug too,
it's no good to keep the NMI running.


-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:34                   ` Andi Kleen
@ 2006-06-06 23:55                     ` Don Zickus
  2006-06-07  0:04                       ` Andi Kleen
  2006-06-07  0:05                       ` Nigel Cunningham
  0 siblings, 2 replies; 47+ messages in thread
From: Don Zickus @ 2006-06-06 23:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, shaohua.li, miles.lane, jeremy, linux-kernel

> > So my question is/was what is the proper way to handle processor level
> > subsystems during the suspend/resume path on an SMP system.  I really
> > don't understand the hotplug path nor the suspend/resume path very well.
> 
> Make it work properly for CPU hotplug for individual CPU and then in suspend
> you take care of "global" state and the last CPU.

So the assumption is treat all the cpus the same either all on or all off,
no mixed mode (some cpus on, some cpus off).  I guess I was trying to hard
to work on the per-cpu level.  
> 
> > I didn't want to register a hotplug handler because a hotplug event is
> > really different than a suspend event (I want to _save_ info during a
> > suspend event).  The documentation I was reading seemed to suggest that
> > hotplug/suspend/smp was a work-in-progress.
> 
> You need to disable the nmi watchdog on CPU hotunplug too,
> it's no good to keep the NMI running.

Don't you want to make sure those CPUs are actually sleeping.  :^D

-Don
> 
> 
> -Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:55                     ` Don Zickus
@ 2006-06-07  0:04                       ` Andi Kleen
  2006-06-07  0:05                       ` Nigel Cunningham
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2006-06-07  0:04 UTC (permalink / raw)
  To: Don Zickus; +Cc: Andrew Morton, shaohua.li, miles.lane, jeremy, linux-kernel

On Wednesday 07 June 2006 01:55, Don Zickus wrote:
> > > So my question is/was what is the proper way to handle processor level
> > > subsystems during the suspend/resume path on an SMP system.  I really
> > > don't understand the hotplug path nor the suspend/resume path very
> > > well.
> >
> > Make it work properly for CPU hotplug for individual CPU and then in
> > suspend you take care of "global" state and the last CPU.
>
> So the assumption is treat all the cpus the same either all on or all off,
> no mixed mode (some cpus on, some cpus off).  I guess I was trying to hard
> to work on the per-cpu level.

No, mixed should work of course.

>
> > > I didn't want to register a hotplug handler because a hotplug event is
> > > really different than a suspend event (I want to _save_ info during a
> > > suspend event).  The documentation I was reading seemed to suggest that
> > > hotplug/suspend/smp was a work-in-progress.
> >
> > You need to disable the nmi watchdog on CPU hotunplug too,
> > it's no good to keep the NMI running.
>
> Don't you want to make sure those CPUs are actually sleeping.  :^D

That is what i meant - they can't sleep if the NMI keeps running.
Ok it would run only for a short time for local APIC, but longer
for io apic

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 23:55                     ` Don Zickus
  2006-06-07  0:04                       ` Andi Kleen
@ 2006-06-07  0:05                       ` Nigel Cunningham
  2006-06-07  0:42                         ` Don Zickus
  1 sibling, 1 reply; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:05 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, Andrew Morton, shaohua.li, miles.lane, jeremy,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

Hi.

On Wednesday 07 June 2006 09:55, Don Zickus wrote:
> > > So my question is/was what is the proper way to handle processor level
> > > subsystems during the suspend/resume path on an SMP system.  I really
> > > don't understand the hotplug path nor the suspend/resume path very
> > > well.
> >
> > Make it work properly for CPU hotplug for individual CPU and then in
> > suspend you take care of "global" state and the last CPU.
>
> So the assumption is treat all the cpus the same either all on or all off,
> no mixed mode (some cpus on, some cpus off).  I guess I was trying to hard
> to work on the per-cpu level.

This sounds wrong to me. Shouldn't the the effect of hotunplugging a cpu be to 
put the driver in a state equivalent to if that cpu simply didn't exist? 
Unplugging shouldn't assume we're going to subsequently have either a driver 
suspend, or a replug.

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:05                       ` Nigel Cunningham
@ 2006-06-07  0:42                         ` Don Zickus
  2006-06-07  0:50                           ` Nigel Cunningham
                                             ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Don Zickus @ 2006-06-07  0:42 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Andi Kleen, Andrew Morton, shaohua.li, miles.lane, jeremy,
	linux-kernel

On Wed, Jun 07, 2006 at 10:05:07AM +1000, Nigel Cunningham wrote:
> Hi.
> 
> On Wednesday 07 June 2006 09:55, Don Zickus wrote:
> > > > So my question is/was what is the proper way to handle processor level
> > > > subsystems during the suspend/resume path on an SMP system.  I really
> > > > don't understand the hotplug path nor the suspend/resume path very
> > > > well.
> > >
> > > Make it work properly for CPU hotplug for individual CPU and then in
> > > suspend you take care of "global" state and the last CPU.
> >
> > So the assumption is treat all the cpus the same either all on or all off,
> > no mixed mode (some cpus on, some cpus off).  I guess I was trying to hard
> > to work on the per-cpu level.
> 
> This sounds wrong to me. Shouldn't the the effect of hotunplugging a cpu be to 
> put the driver in a state equivalent to if that cpu simply didn't exist? 
> Unplugging shouldn't assume we're going to subsequently have either a driver 
> suspend, or a replug.

This is my biggest problem or maybe my complete lack of understanding, is
that I don't know how to determine what state I am in during a hotplug
event, either a cpu removal or a suspend.  Therefore I feel like I have to
store some persistant data around _just_ in case this is a suspend event.
Also at the opposite end, how to separate a cpu insert vs. a cpu resume.
The different being initialize to a global state vs. initialize to a last
known state.  

I thought it would make more sense if a few more states were to the
hotplug event list.  For example, in addition to CPU_ONLINE and CPU_DEAD,
there could also be something like CPU_SUSPEND, CPU_FREEZE, CPU_RESUME,
and CPU_THAW.  

Anyway, I am probably complicating the matter.  I'll whip something up and
post it for review.

Cheers,
Don

> 
> Regards,
> 
> Nigel
> -- 
> Nigel, Michelle and Alisdair Cunningham
> 5 Mitchell Street
> Cobden 3266
> Victoria, Australia

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:42                         ` Don Zickus
@ 2006-06-07  0:50                           ` Nigel Cunningham
  2006-06-07  3:29                             ` [linux-pm] " David Brownell
  2006-06-07  9:55                           ` Rafael J. Wysocki
  2006-06-08 20:27                           ` Pavel Machek
  2 siblings, 1 reply; 47+ messages in thread
From: Nigel Cunningham @ 2006-06-07  0:50 UTC (permalink / raw)
  To: Don Zickus, Linux-pm list
  Cc: Andi Kleen, Andrew Morton, shaohua.li, miles.lane, jeremy,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]

Hi.

On Wednesday 07 June 2006 10:42, Don Zickus wrote:
> On Wed, Jun 07, 2006 at 10:05:07AM +1000, Nigel Cunningham wrote:
> > Hi.
> >
> > On Wednesday 07 June 2006 09:55, Don Zickus wrote:
> > > > > So my question is/was what is the proper way to handle processor
> > > > > level subsystems during the suspend/resume path on an SMP system. 
> > > > > I really don't understand the hotplug path nor the suspend/resume
> > > > > path very well.
> > > >
> > > > Make it work properly for CPU hotplug for individual CPU and then in
> > > > suspend you take care of "global" state and the last CPU.
> > >
> > > So the assumption is treat all the cpus the same either all on or all
> > > off, no mixed mode (some cpus on, some cpus off).  I guess I was trying
> > > to hard to work on the per-cpu level.
> >
> > This sounds wrong to me. Shouldn't the the effect of hotunplugging a cpu
> > be to put the driver in a state equivalent to if that cpu simply didn't
> > exist? Unplugging shouldn't assume we're going to subsequently have
> > either a driver suspend, or a replug.
>
> This is my biggest problem or maybe my complete lack of understanding, is
> that I don't know how to determine what state I am in during a hotplug
> event, either a cpu removal or a suspend.  Therefore I feel like I have to
> store some persistant data around _just_ in case this is a suspend event.
> Also at the opposite end, how to separate a cpu insert vs. a cpu resume.
> The different being initialize to a global state vs. initialize to a last
> known state.
>
> I thought it would make more sense if a few more states were to the
> hotplug event list.  For example, in addition to CPU_ONLINE and CPU_DEAD,
> there could also be something like CPU_SUSPEND, CPU_FREEZE, CPU_RESUME,
> and CPU_THAW.
>
> Anyway, I am probably complicating the matter.  I'll whip something up and
> post it for review.

Act like...

Unplug: It's going away, never to return.
Plug: It's just appeared from nowhere, is completely uninitialised and may be 
a different item to anything that happened to look the same before.
Suspend: It's going to be put into a low (possibly no-) power state. It's 
going to come back, and when it does, you want to be able to put it back in 
the state it's in prior to this call.
Resume: You want to restore the state you saved in memory when given the 
suspend call earlier.

Regarding _FREEZE, there is work in progress to add this. I haven't been 
following the conversation really closely recently, but my understanding is 
that you should expect it to be similar to suspend, except that you can 
guarantee that power will not be lost. All activity should be stopped so that 
you get a consistent state which you can restore in the resume call. Every 
suspend or freeze must be followed by a resume.

I'll add the linux-pm list to the cc, just in case I've gotten something wrong 
or the other guys want to comment and have missed this thread.

Hope this helps.

Regards,

Nigel
-- 
Nigel, Michelle and Alisdair Cunningham
5 Mitchell Street
Cobden 3266
Victoria, Australia

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [linux-pm] [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:50                           ` Nigel Cunningham
@ 2006-06-07  3:29                             ` David Brownell
  0 siblings, 0 replies; 47+ messages in thread
From: David Brownell @ 2006-06-07  3:29 UTC (permalink / raw)
  To: linux-pm
  Cc: Nigel Cunningham, Don Zickus, Andrew Morton, jeremy, miles.lane,
	Andi Kleen, linux-kernel

On Tuesday 06 June 2006 5:50 pm, Nigel Cunningham wrote:

> Suspend: It's going to be put into a low (possibly no-) power state. It's 
> going to come back, and when it does, you want to be able to put it back in 
> the state it's in prior to this call.

Not exactly.  Suspended devices can in general can resume() into a RESET
state in which case software reinit is appropriate ... or they can come back
in the state that the suspend() left them in, modulo changes that may come
from hot-unplugging hardware connected to that device.  (Which may be a
wakeup event, depending on system configuration.)

CPU suspend might have additional rules (just like for any pm-smart class
of drivers), but those are the generic rules.  Not that I think many
platforms treat CPUs quite the same as other hardware!  :)

I don't think the PM events -- suspend()/resume() -- should ever be
entangled with hotplug events.  The former apply to devices which are
known; the latter are how they become known (or get forgotten).

> Every suspend or freeze must be followed by a resume.

Freeze is an optional nuance; it's basically OK to treat every suspend() as
an "enter low power mode" suspend request, regardless of the event signified
by its parameter.  The canonical/main example of when it might _not_ do that
is avoiding disk drive spindown on freeze durin swsusp.

It's a bit problematic just now to handle hot-unplug during suspend(), so
the best advice just now is to make sure that if that's physically possible
(like ejecting a PCMCIA/Cardbus adapter) then driver resume() checks whether
the device is present, just like it checks for power-lost/reset.

- Dave

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:42                         ` Don Zickus
  2006-06-07  0:50                           ` Nigel Cunningham
@ 2006-06-07  9:55                           ` Rafael J. Wysocki
  2006-06-08 20:27                           ` Pavel Machek
  2 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2006-06-07  9:55 UTC (permalink / raw)
  To: Don Zickus
  Cc: Nigel Cunningham, Andi Kleen, Andrew Morton, shaohua.li,
	miles.lane, jeremy, linux-kernel, Pavel Machek

Hi,

On Wednesday 07 June 2006 02:42, Don Zickus wrote:
> On Wed, Jun 07, 2006 at 10:05:07AM +1000, Nigel Cunningham wrote:
> > On Wednesday 07 June 2006 09:55, Don Zickus wrote:
> > > > > So my question is/was what is the proper way to handle processor level
> > > > > subsystems during the suspend/resume path on an SMP system.  I really
> > > > > don't understand the hotplug path nor the suspend/resume path very
> > > > > well.
> > > >
> > > > Make it work properly for CPU hotplug for individual CPU and then in
> > > > suspend you take care of "global" state and the last CPU.
> > >
> > > So the assumption is treat all the cpus the same either all on or all off,
> > > no mixed mode (some cpus on, some cpus off).  I guess I was trying to hard
> > > to work on the per-cpu level.
> > 
> > This sounds wrong to me. Shouldn't the the effect of hotunplugging a cpu be to 
> > put the driver in a state equivalent to if that cpu simply didn't exist? 
> > Unplugging shouldn't assume we're going to subsequently have either a driver 
> > suspend, or a replug.
> 
> This is my biggest problem or maybe my complete lack of understanding, is
> that I don't know how to determine what state I am in during a hotplug
> event, either a cpu removal or a suspend.  Therefore I feel like I have to
> store some persistant data around _just_ in case this is a suspend event.
> Also at the opposite end, how to separate a cpu insert vs. a cpu resume.
> The different being initialize to a global state vs. initialize to a last
> known state.

The original idea was to treat the nonboot CPUs as though they were removed.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  0:42                         ` Don Zickus
  2006-06-07  0:50                           ` Nigel Cunningham
  2006-06-07  9:55                           ` Rafael J. Wysocki
@ 2006-06-08 20:27                           ` Pavel Machek
  2 siblings, 0 replies; 47+ messages in thread
From: Pavel Machek @ 2006-06-08 20:27 UTC (permalink / raw)
  To: Don Zickus
  Cc: Nigel Cunningham, Andi Kleen, Andrew Morton, shaohua.li,
	miles.lane, jeremy, linux-kernel

Hi!

> > This sounds wrong to me. Shouldn't the the effect of hotunplugging a cpu be to 
> > put the driver in a state equivalent to if that cpu simply didn't exist? 
> > Unplugging shouldn't assume we're going to subsequently have either a driver 
> > suspend, or a replug.
> 
> This is my biggest problem or maybe my complete lack of understanding, is
> that I don't know how to determine what state I am in during a hotplug

Basically you can't/shouldn't determine that. 

> I thought it would make more sense if a few more states were to the
> hotplug event list.  For example, in addition to CPU_ONLINE and CPU_DEAD,
> there could also be something like CPU_SUSPEND, CPU_FREEZE, CPU_RESUME,
> and CPU_THAW.  
> 
> Anyway, I am probably complicating the matter.  I'll whip something up and
> post it for review.

I think you are overcomplicating this. Just forget about
suspend/resume, and reinit cpus from the scratch each time. It may
lead into some 'interesting' behaviour if someone tries to suspend
while profiling, but I believe we can live with that.
							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06  6:44       ` Shaohua Li
  2006-06-06 14:17         ` Don Zickus
@ 2006-06-06 16:23         ` Jeremy Fitzhardinge
  2006-06-06 16:51           ` Don Zickus
  2006-06-07  2:49           ` Don Zickus
  1 sibling, 2 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-06 16:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Miles Lane, Andrew Morton, linux-kernel, dzickus, ak

Shaohua Li wrote:
> Does below patch help? The nmi suspend/resume doesn't look good to me.
> Only CPU0 uses the suspend/resume code path. Other CPUs run the CPU
> hotplug code path.
>   
Unfortunately this just oopses immediately on the first suspend.  The 
stack trace is unclear (and I'm just going from memory at the moment), 
but it looked like it got an invalid op.  I'll try to get a clearer idea 
of the crash later today.

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 16:23         ` Jeremy Fitzhardinge
@ 2006-06-06 16:51           ` Don Zickus
  2006-06-07  2:49           ` Don Zickus
  1 sibling, 0 replies; 47+ messages in thread
From: Don Zickus @ 2006-06-06 16:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Shaohua Li, Miles Lane, Andrew Morton, linux-kernel, ak

On Tue, Jun 06, 2006 at 09:23:59AM -0700, Jeremy Fitzhardinge wrote:
> Shaohua Li wrote:
> >Does below patch help? The nmi suspend/resume doesn't look good to me.
> >Only CPU0 uses the suspend/resume code path. Other CPUs run the CPU
> >hotplug code path.
> >  
> Unfortunately this just oopses immediately on the first suspend.  The 
> stack trace is unclear (and I'm just going from memory at the moment), 
> but it looked like it got an invalid op.  I'll try to get a clearer idea 
> of the crash later today.
> 
>    J

No this makes sense.  The code right now just blindly tries to disable the
watchdog without checking to see that it is already disabled.  The oops
you are seeing is a result of that.  I'll have a patch to fix all that a
little later.

Cheers,
Don


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-06 16:23         ` Jeremy Fitzhardinge
  2006-06-06 16:51           ` Don Zickus
@ 2006-06-07  2:49           ` Don Zickus
  2006-06-07 16:33             ` Andi Kleen
  2006-06-07 17:07             ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 47+ messages in thread
From: Don Zickus @ 2006-06-07  2:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Shaohua Li, Miles Lane, Andrew Morton, linux-kernel, ak

Makes the start/stop paths of nmi watchdog more robust to handle the
suspend/resume cases more gracefully.

Signed-off-by:  Don Zickus <dzickus@redhat.com>

---

On Tue, Jun 06, 2006 at 09:23:59AM -0700, Jeremy Fitzhardinge wrote:
> Shaohua Li wrote:
> >Does below patch help? The nmi suspend/resume doesn't look good to me.
> >Only CPU0 uses the suspend/resume code path. Other CPUs run the CPU
> >hotplug code path.
> >  
> Unfortunately this just oopses immediately on the first suspend.  The 
> stack trace is unclear (and I'm just going from memory at the moment), 
> but it looked like it got an invalid op.  I'll try to get a clearer idea 
> of the crash later today.
> 
>    J

Can you apply this patch on top of Shaohua's.  This should fix all your
suspend problems.  

Inside the patch is a little hack to handle the scenario when we come out
of resume we do _not_ want the nmi watchdog enabled (to match the
case entering suspend).  

Compiled but not tested, as I don't have easy access to my test machines
right now.  Mainly posted for Andrew to pick up for rc6-mm1.

Cheers,
Don

Index: linux-don/arch/i386/kernel/nmi.c
===================================================================
--- linux-don.orig/arch/i386/kernel/nmi.c
+++ linux-don/arch/i386/kernel/nmi.c
@@ -745,6 +745,7 @@ static void stop_intel_arch_watchdog(voi
 
 void setup_apic_nmi_watchdog (void *unused)
 {
+	struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * The NMI watchdog uses spinlocks (notifier chains, etc.),
@@ -761,6 +762,14 @@ void setup_apic_nmi_watchdog (void *unus
 	    (nmi_watchdog != NMI_IO_APIC))
 	    	return;
 
+	if (wd->enabled == 1)
+		return;
+
+	/* cheap hack to support suspend/resume */
+	/* if cpu0 is not active neither should the other cpus */
+	if ((smp_processor_id() != 0) && (atomic_read(&nmi_active) <= 0))
+		return;
+
 	if (nmi_watchdog == NMI_LOCAL_APIC) {
 		switch (boot_cpu_data.x86_vendor) {
 		case X86_VENDOR_AMD:
@@ -798,17 +807,22 @@ void setup_apic_nmi_watchdog (void *unus
 			return;
 		}
 	}
-	__get_cpu_var(nmi_watchdog_ctlblk.enabled) = 1;
+	wd->enabled = 1;
 	atomic_inc(&nmi_active);
 }
 
 void stop_apic_nmi_watchdog(void *unused)
 {
+	struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
+
 	/* only support LOCAL and IO APICs for now */
 	if ((nmi_watchdog != NMI_LOCAL_APIC) &&
 	    (nmi_watchdog != NMI_IO_APIC))
 	    	return;
 
+	if (wd->enabled == 0)
+		return;
+
 	if (nmi_watchdog == NMI_LOCAL_APIC) {
 		switch (boot_cpu_data.x86_vendor) {
 		case X86_VENDOR_AMD:
@@ -836,7 +850,7 @@ void stop_apic_nmi_watchdog(void *unused
 			return;
 		}
 	}
-	__get_cpu_var(nmi_watchdog_ctlblk.enabled) = 0;
+	wd->enabled = 0;
 	atomic_dec(&nmi_active);
 }
 
Index: linux-don/arch/x86_64/kernel/nmi.c
===================================================================
--- linux-don.orig/arch/x86_64/kernel/nmi.c
+++ linux-don/arch/x86_64/kernel/nmi.c
@@ -672,6 +672,7 @@ static void stop_intel_arch_watchdog(voi
 
 void setup_apic_nmi_watchdog(void *unused)
 {
+	struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * The NMI watchdog uses spinlocks (notifier chains, etc.),
@@ -688,6 +689,14 @@ void setup_apic_nmi_watchdog(void *unuse
 	    (nmi_watchdog != NMI_IO_APIC))
 	    	return;
 
+	if (wd->enabled == 1)
+		return;
+
+	/* cheap hack to support suspend/resume */
+	/* if cpu0 is not active neither should the other cpus */
+	if ((smp_processor_id() != 0) && (atomic_read(&nmi_active) <= 0))
+		return;
+
 	if (nmi_watchdog == NMI_LOCAL_APIC) {
 		switch (boot_cpu_data.x86_vendor) {
 		case X86_VENDOR_AMD:
@@ -709,17 +718,22 @@ void setup_apic_nmi_watchdog(void *unuse
 			return;
 		}
 	}
-	__get_cpu_var(nmi_watchdog_ctlblk.enabled) = 1;
+	wd->enabled = 1;
 	atomic_inc(&nmi_active);
 }
 
 void stop_apic_nmi_watchdog(void *unused)
 {
+	struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
+
 	/* only support LOCAL and IO APICs for now */
 	if ((nmi_watchdog != NMI_LOCAL_APIC) &&
 	    (nmi_watchdog != NMI_IO_APIC))
 	    	return;
 
+	if (wd->enabled == 0)
+		return;
+
 	if (nmi_watchdog == NMI_LOCAL_APIC) {
 		switch (boot_cpu_data.x86_vendor) {
 		case X86_VENDOR_AMD:
@@ -738,7 +752,7 @@ void stop_apic_nmi_watchdog(void *unused
 			return;
 		}
 	}
-	__get_cpu_var(nmi_watchdog_ctlblk.enabled) = 0;
+	wd->enabled = 0;
 	atomic_dec(&nmi_active);
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  2:49           ` Don Zickus
@ 2006-06-07 16:33             ` Andi Kleen
  2006-06-07 17:07             ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2006-06-07 16:33 UTC (permalink / raw)
  To: Don Zickus
  Cc: Jeremy Fitzhardinge, Shaohua Li, Miles Lane, Andrew Morton,
	linux-kernel

On Wednesday 07 June 2006 04:49, Don Zickus wrote:
> Makes the start/stop paths of nmi watchdog more robust to handle the
> suspend/resume cases more gracefully.

Can someone with the problem please confirm the patch actually helps.

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07  2:49           ` Don Zickus
  2006-06-07 16:33             ` Andi Kleen
@ 2006-06-07 17:07             ` Jeremy Fitzhardinge
  2006-06-07 17:50               ` Don Zickus
  1 sibling, 1 reply; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07 17:07 UTC (permalink / raw)
  To: Don Zickus; +Cc: Shaohua Li, Miles Lane, Andrew Morton, linux-kernel, ak

Don Zickus wrote:
> Makes the start/stop paths of nmi watchdog more robust to handle the
> suspend/resume cases more gracefully.
>   
This solves the original symptom, but I'm seeing something else now.  
After the second resume, there's a noticable pause after it brings cpu 1 
online.  After the third resume it's a longer pause, and after the 4th 
it just hangs there.  The system is up enough to respond to sysreq, but 
nothing in usermode seems to be actually running.  I'll try and get a 
better understanding of what I'm seeing later today.

    J

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07 17:07             ` Jeremy Fitzhardinge
@ 2006-06-07 17:50               ` Don Zickus
  2006-06-07 18:53                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 47+ messages in thread
From: Don Zickus @ 2006-06-07 17:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Shaohua Li, Miles Lane, Andrew Morton, linux-kernel, ak

On Wed, Jun 07, 2006 at 10:07:38AM -0700, Jeremy Fitzhardinge wrote:
> Don Zickus wrote:
> >Makes the start/stop paths of nmi watchdog more robust to handle the
> >suspend/resume cases more gracefully.
> >  
> This solves the original symptom, but I'm seeing something else now.  
> After the second resume, there's a noticable pause after it brings cpu 1 
> online.  After the third resume it's a longer pause, and after the 4th 
> it just hangs there.  The system is up enough to respond to sysreq, but 
> nothing in usermode seems to be actually running.  I'll try and get a 
> better understanding of what I'm seeing later today.
> 
>    J

Can you do me a quick favor and 'cat /proc/interrupts |grep NMI' before
each of your suspends.  I want to double check a piece of code.  Your
bugzilla postings showed your system starting with no nmi watchdog running
but after the resume the watchdog started running on cpu1.  I am hoping I
fixed that issue too.

Thanks.

Cheers,
Don


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174
  2006-06-07 17:50               ` Don Zickus
@ 2006-06-07 18:53                 ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 47+ messages in thread
From: Jeremy Fitzhardinge @ 2006-06-07 18:53 UTC (permalink / raw)
  To: Don Zickus; +Cc: Shaohua Li, Miles Lane, Andrew Morton, linux-kernel, ak

Don Zickus wrote:
> Can you do me a quick favor and 'cat /proc/interrupts |grep NMI' before
> each of your suspends.  I want to double check a piece of code.  Your
> bugzilla postings showed your system starting with no nmi watchdog running
> but after the resume the watchdog started running on cpu1.  I am hoping I
> fixed that issue too.
>   
Yes, there are no NMIs on either CPU after resume.

    J


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2006-06-08 20:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-02 22:51 [2.6.17-rc5-mm2] crash when doing second suspend: BUG in arch/i386/kernel/nmi.c:174 Jeremy Fitzhardinge
2006-06-04 11:47 ` Rafael J. Wysocki
2006-06-05  7:21   ` Jeremy Fitzhardinge
2006-06-05  7:37 ` Jeremy Fitzhardinge
2006-06-05  7:48   ` Andrew Morton
2006-06-05  7:59     ` Jeremy Fitzhardinge
2006-06-05  8:35     ` Miles Lane
2006-06-06  6:44       ` Shaohua Li
2006-06-06 14:17         ` Don Zickus
2006-06-06 14:18           ` Andi Kleen
2006-06-06 21:45             ` Don Zickus
2006-06-06 22:15               ` Andrew Morton
2006-06-06 23:05                 ` Don Zickus
2006-06-06 23:22                   ` Andrew Morton
2006-06-06 23:27                     ` Jeremy Fitzhardinge
2006-06-06 23:32                       ` Andi Kleen
2006-06-06 23:42                       ` Don Zickus
2006-06-08 20:11                         ` Pavel Machek
2006-06-06 23:38                     ` Nigel Cunningham
2006-06-07  0:06                       ` Jeremy Fitzhardinge
2006-06-07  0:13                         ` Nigel Cunningham
2006-06-07  0:24                           ` Andrew Morton
2006-06-07  0:29                             ` Jeremy Fitzhardinge
2006-06-07  0:31                             ` Nigel Cunningham
2006-06-07  0:33                             ` Andi Kleen
2006-06-07  0:40                               ` Nigel Cunningham
2006-06-07  0:26                           ` Jeremy Fitzhardinge
2006-06-07  0:33                             ` Nigel Cunningham
2006-06-07  0:56                               ` Jeremy Fitzhardinge
2006-06-08 20:13                         ` Pavel Machek
2006-06-08 12:45                     ` Pavel Machek
2006-06-06 23:34                   ` Andi Kleen
2006-06-06 23:55                     ` Don Zickus
2006-06-07  0:04                       ` Andi Kleen
2006-06-07  0:05                       ` Nigel Cunningham
2006-06-07  0:42                         ` Don Zickus
2006-06-07  0:50                           ` Nigel Cunningham
2006-06-07  3:29                             ` [linux-pm] " David Brownell
2006-06-07  9:55                           ` Rafael J. Wysocki
2006-06-08 20:27                           ` Pavel Machek
2006-06-06 16:23         ` Jeremy Fitzhardinge
2006-06-06 16:51           ` Don Zickus
2006-06-07  2:49           ` Don Zickus
2006-06-07 16:33             ` Andi Kleen
2006-06-07 17:07             ` Jeremy Fitzhardinge
2006-06-07 17:50               ` Don Zickus
2006-06-07 18:53                 ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).