From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760243Ab2CNUTr (ORCPT ); Wed, 14 Mar 2012 16:19:47 -0400 Received: from rinux.net ([85.214.141.182]:50154 "EHLO rinux.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932403Ab2CNUTc (ORCPT ); Wed, 14 Mar 2012 16:19:32 -0400 Message-ID: <1331756331.5315.19.camel@erde.fritz.box> Subject: RE: Kernel Panic with Rawtherapee (mce related) From: Adalbert Dawid To: "Luck, Tony" Cc: Borislav Petkov , "Srivatsa S. Bhat" , "linux-kernel@vger.kernel.org" , "mingo@elte.hu" , "x86@kernel.org" Date: Wed, 14 Mar 2012 21:18:51 +0100 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F15B64BF0@ORSMSX103.amr.corp.intel.com> References: <1331678930.11814.16.camel@erde.fritz.box> <4F60B0E3.6020503@linux.vnet.ibm.com> <20120314155906.GA25286@aftab> <3908561D78D1C84285E8C5FCA982C28F15B64BF0@ORSMSX103.amr.corp.intel.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.2-1 Content-Transfer-Encoding: 8bit Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thank you for the quick reply. On Wed, 2012-03-14 at 17:51 +0000, Luck, Tony wrote: > > You're getting a bunch of machine checks, the last one of them being > > fatal (Process Context Corrupt bit is set) causing the machine to panic. > > PCC is set in all of them > > > Tony will probably be able to help you further in decoding what exactly > > those MC0_STATUS and MC5_STATUS values mean > > Bank 5 ends in 0400 - which means "Internal timer error". Bank 0 has 0800 > which is a bus/interconnect error where this processor was the source of > a memory transaction. > > That's where the facts end - speculation begins here ... > > Since this is repeatable under load - it's possible that a page table got > corrupted and you are trying to access some non-existent memory location? > Do all traces for this panic involve *_tlb_* functions? Since the screenshot I had posted is the only one I have been able to capture, I don't know. I will try to provoke the crash by setting the machine under load utilizing rawtherapee and will post results in case of success. Cpuburn did not manage to crash the machine in a (shortish) test I did a few days ago. It would be very helpful to disable the "reboot in 30 seconds" timeout. Is that possible somehow? > Or perhaps you have a cooling problem - and when stressed your cpu or > memory is getting too hot? I do not believe this is true as the cpu fan plus two case fans are running fine and the sensors display cpu tempratures <60°C, even under load. Up to now, it has always been rawtherapee that crashed the machine. This is why I thought it might possibly be some special cpu feature (an SSE command or something) that happens to be broken in my cpu and that is triggered only by rawtherapee and not by any other software. What is your opinion on this theory? > -Tony >