From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758247AbYGFOut (ORCPT ); Sun, 6 Jul 2008 10:50:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756921AbYGFOul (ORCPT ); Sun, 6 Jul 2008 10:50:41 -0400 Received: from gw.goop.org ([64.81.55.164]:52085 "EHLO mail.goop.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756900AbYGFOuk (ORCPT ); Sun, 6 Jul 2008 10:50:40 -0400 Message-ID: <4870DBB6.30107@goop.org> Date: Sun, 06 Jul 2008 07:50:30 -0700 From: Jeremy Fitzhardinge User-Agent: Thunderbird 2.0.0.14 (X11/20080501) MIME-Version: 1.0 To: Jens Axboe CC: Ingo Molnar , Linux Kernel Mailing List Subject: Generic callfunction IPI problems X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jens, I'm seeing these oopses when running under Xen: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: [] generic_smp_call_function_interrupt+0xfb/0x118 PGD 0 Oops: 0000 [1] SMP CPU 15 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #306 RIP: e030:[] [] generic_smp_call_function_interrupt+0xfb/0x118 RSP: e02b:ffff88007f653e98 EFLAGS: 00010046 RAX: ffffffff815fe6e0 RBX: ffff88007e523cc8 RCX: 0000000000000001 RDX: ffffc10000200200 RSI: 0000000000000001 RDI: ffffffff81693240 RBP: ffff88007f653eb8 R08: ffff88007f653ec8 R09: 0002db11ddd83820 R10: ffff880000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000000000f R15: 0000000000000040 FS: 00007f1dadd907a0(0000) GS:ffff88007ff30080(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000 Process swapper (pid: 0, threadinfo ffff88007ff90000, task ffff88007ff57100) Stack: ffff88007ff2f4c0 0000000000000000 0000000000000000 000000000000004d ffff88007f653ec8 ffffffff8100dea7 ffff88007f653ef8 ffffffff810747c5 ffffffff816959c0 000000000000004d ffff88007ff2f4c0 ffffffff81695a10 Call Trace: [] xen_call_function_interrupt+0xe/0x167 [] handle_IRQ_event+0x2e/0x65 [] handle_level_irq+0xb5/0x116 [] do_IRQ+0xf7/0x177 [] xen_evtchn_do_upcall+0xb3/0x136 [] xen_do_hypervisor_callback+0x1e/0x30 [] ? _stext+0x3aa/0x1000 [] ? _stext+0x3aa/0x1000 [] ? xen_safe_halt+0x10/0x1a [] ? xen_idle+0x46/0x5c [] ? cpu_idle+0xca/0x101 [] ? cpu_bringup_and_idle+0x8a/0x8f Code: e8 fc 96 fc ff 90 41 f6 44 24 20 01 74 08 41 83 64 24 20 fe eb 11 49 8d 7c 24 38 48 c7 c6 90 dd 05 81 e8 6f 92 01 00 4d 8b 24 24 <49> 8b 04 24 49 81 fc e0 e6 5f 81 0f 18 08 0f 85 11 ff ff ff 5b RIP [] generic_smp_call_function_interrupt+0xfb/0x118 RSP CR2: 0000000000000000 Kernel panic - not syncing: Fatal exception in interrupt They're pretty rare - this system did a kernbench run on a 16 vcpu system with no problems, then oopsed this way when I left it idle overnight. One interesting data point is that I've been experimenting with more virtualization-friendly spinlock algorithms. If I replace ticket locks with the old lock-byte algorithm, I see this much more frequently (and a spin-and-block algorithm generally doesn't get through boot). I wonder if there's a race which is masked by ticket locks' strict FIFO algorithm? (But this particular oops was with completely standard ticketlocks in place.) I've been running your old generic IPI patches for a while with no problems; this seems to be specific to the version in tip.git. I haven't looked to see what differences there are yet. I've also only observed problems under Xen, but I haven't done much testing on real hardware. Thanks, J