From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751524AbdG0H7g (ORCPT ); Thu, 27 Jul 2017 03:59:36 -0400 Received: from mail.skyhub.de ([5.9.137.197]:46686 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750765AbdG0H7e (ORCPT ); Thu, 27 Jul 2017 03:59:34 -0400 Date: Thu, 27 Jul 2017 09:58:45 +0200 From: Borislav Petkov To: Ingo Molnar Cc: linux-edac , Steven Rostedt , Tony Luck , Yazen Ghannam , X86 ML , LKML Subject: Re: [RFC PATCH 0/8] EDAC, mce_amd: Add a tracepoint for the decoded error Message-ID: <20170727075845.GD4690@nazgul.tnic> References: <20170725154601.27427-1-bp@alien8.de> <20170727071034.epbwwmgnbj6dv4sf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20170727071034.epbwwmgnbj6dv4sf@gmail.com> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 27, 2017 at 09:10:34AM +0200, Ingo Molnar wrote: > Looks pretty nice to me conceptually. Do you have a couple of examples of > real-life events that get logged? It's hard to decode it from the new tracepoint > alone. Here's what comes out in dmesg: [ 932.370319] mce: [Hardware Error]: Machine check events logged [ 932.374474] [Hardware Error]: Corrected error, no action required. [ 932.381684] [Hardware Error]: CPU:1 (0:0:0) MC5_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00410000020f0f [ 932.384256] [Hardware Error]: Error Addr: 0x0000000056071033 [Hardware Error]: TSC: 2703436211649 [ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error. [ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out) (whoops, that TSC thing should be on a new line). and the TP dumps only the last two lines: [ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error. [ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out) but come to think of it, it should dump only the MC? Error line because the last line can be easily deduced from the error code. I'll change that. Btw, the reason why I'm dumping only MC? line is to keep the string going into the TP relatively small. It is 128 bytes now. I tried dumping the whole decoded string but that easily overflowed 256 bytes and 256 bytes is already a bit too much to log into the trace buffers. So I'm concentrating only on the not-very-trivial stuff to decode. The rest is being deduced directly from the MCi_STATUS value anyway which we can easily do in userspace and that is straightforward. And that u64 value we already dump with trace_mce_record(). So the idea is, userspace opens trace_mce_record() to get the raw MCE data and then this second TP to get the decoded string of what that error is. Later, we could extend that same behavior to Intel for the common errors, at least, so that we can dump at least *some* string explaining what the error is. Anyway, something like that is swirling in my head right now... Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --