From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: EDAC, ghes: use CPER module handles to locate DIMMs From: wufan Message-Id: <000f01d4406f$6ce8f160$46bad420$@codeaurora.org> Date: Thu, 30 Aug 2018 08:40:40 -0600 To: 'James Morse' Cc: mchehab@kernel.org, bp@alien8.de, baicar.tyler@gmail.com, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org List-ID: SGkgSmFtZXMsCgo+ID4gVGhlIGN1cnJlbnQgZ2hlc19lZGFjIGRyaXZlciBkb2VzIG5vdCB1cGRh dGUgcGVyLWRpbW0gZXJyb3IgY291bnRlcnMKPiA+IHdoZW4gcmVwb3J0aW5nIG1lbW9yeSBlcnJv cnMsIGJlY2F1c2UgdGhlcmUgaXMgbm8gcGxhdGZvcm0taW5kZXBlbmRlbnQKPiA+IHdheSB0byBm aW5kIERJTU1zIGJhc2VkIG9uIHRoZSBlcnJvciBpbmZvcm1hdGlvbiBwcm92aWRlZCBieSBmaXJt d2FyZS4KPiAKPiBJJ2QgYXJndWUgdGhlcmUgaXM6IGl0cyBpbiB0aGUgQ1BFUiByZWNvcmRzLCB3 ZSBqdXN0IGRpZG4ndCBkbyBhbnl0aGluZyB1c2VmdWwKPiB3aXRoIHRoZSBpbmZvcm1hdGlvbiBp biB0aGUgcGFzdCEKCkFncmVlZC4gV2lsbCB1cGRhdGUgdGhlIHdvcmRpbmcuIAogCj4gPiArc3Rh dGljIGludCBnaGVzX2VkYWNfZGltbV9pbmRleCh1MTYgaGFuZGxlKSB7Cj4gPiArCXN0cnVjdCBt ZW1fY3RsX2luZm8gKm1jaTsKPiA+ICsJaW50IGk7Cj4gPiArCj4gPiArCWlmICghZ2hlc19wdnQp Cj4gPiArCQlyZXR1cm4gLTE7Cj4gCj4gZ2hlc19lZGFjX3JlcG9ydF9tZW1fZXJyb3IoKSBhbHJl YWR5IGNoZWNrZWQgdGhpcywgYXMgaXRzIHRoZSBvbmx5IGNhbGxlcgo+IHRoZXJlIGlzIG5vIG5l ZWQgdG8gY2hlY2sgaXQgYWdhaW4uCgpXaWxsIHJlbW92ZS4KIAo+IAo+ID4gKwltY2kgPSBnaGVz X3B2dC0+bWNpOwo+ID4gKwo+ID4gKwlpZiAoIW1jaSkKPiA+ICsJCXJldHVybiAtMTsKPiAKPiBD YW4gdGhpcyBoYXBwZW4/IGdoZXNfZWRhY19yZXBvcnRfbWVtX2Vycm9yKCkgd291bGQgaGF2ZQo+ IGRlcmVmZXJlbmNlZCB0aGlzIGFscmVhZHkhCj4gCj4gSWYgeW91IG5lZWQgdGhlIHN0cnVjdCBt ZW1fY3RsX2luZm8sIHlvdSBtYXkgYXMgd2VsbCBwYXNzIGl0IGluIGFzIHRoZSBvbmx5Cj4gY2Fs bGVyIGhhcyBpdCB0byBoYW5kLgoKV2lsbCByZW1vdmUuCj4gCj4gPiArCj4gPiArCWZvciAoaSA9 IDA7IGkgPCBtY2ktPnRvdF9kaW1tczsgaSsrKSB7Cj4gPiArCQlpZiAobWNpLT5kaW1tc1tpXS0+ c21iaW9zX2hhbmRsZSA9PSBoYW5kbGUpCj4gPiArCQkJcmV0dXJuIGk7Cj4gPiArCX0KPiA+ICsJ cmV0dXJuIC0xOwo+ID4gK30KPiA+ICsKPiA+ICBzdGF0aWMgdm9pZCBnaGVzX2VkYWNfZG1pZGVj b2RlKGNvbnN0IHN0cnVjdCBkbWlfaGVhZGVyICpkaCwgdm9pZAo+ID4gKmFyZykgIHsKPiA+ICAJ c3RydWN0IGdoZXNfZWRhY19kaW1tX2ZpbGwgKmRpbW1fZmlsbCA9IGFyZzsgQEAgLTE3Nyw2ICsx OTcsOCBAQAo+ID4gc3RhdGljIHZvaWQgZ2hlc19lZGFjX2RtaWRlY29kZShjb25zdCBzdHJ1Y3Qg ZG1pX2hlYWRlciAqZGgsIHZvaWQgKmFyZykKPiA+ICAJCQkJZW50cnktPnRvdGFsX3dpZHRoLCBl bnRyeS0+ZGF0YV93aWR0aCk7Cj4gPiAgCQl9Cj4gPgo+ID4gKwkJZGltbS0+c21iaW9zX2hhbmRs ZSA9IGVudHJ5LT5oYW5kbGU7Cj4gCj4gV2UgYXJlbid0IGNoZWNraW5nIGZvciBkdXBsaWNhdGUg aGFuZGxlcywgKGUuZy4gdGhleSdyZSBhbGwgemVybykuIEkgdGhpbmsgdGhpcyBpcwo+IGZpbmUg YXMgY2hhbmNlcyBhcmUgZmlybXdhcmUgb24gdGhvc2Ugc3lzdGVtcyB3b24ndCBzZXQKPiBDUEVS X01FTV9WQUxJRF9NT0RVTEVfSEFORExFLiBJZiBpdCBkb2VzLCB0aGUgaGFuZGxlIGl0IGdpdmVz IHVzIGlzCj4gYW1iaWd1b3VzLCBhbmQgd2UgcGljayBhIGRpbW0sIGluc3RlYWQgb2Ygd2hpbmUt aW5nIGFib3V0IGJyb2tlbgo+IGZpcm13YXJlIHRhYmxlcy4KPiAKPiAoSSdtIGp1c3QgZHJhd2lu ZyBhdHRlbnRpb24gdG8gaXQgaW4gY2FzZSBzb21lb25lIGRpc2FncmVlcykKClNCTUlPUyB0YWJs ZXMgYXJlIHR5cGljYWxseSBhdXRvbWF0aWNhbGx5IGdlbmVyYXRlZCBzbyBjaGFuY2VzIGZvciBk dXBsaWNhdGUKaGFuZGxlcyBhcmUgc21hbGwuIAoKPiAKPiA+ICAJCWRpbW1fZmlsbC0+Y291bnQr KzsKPiA+ICAJfQo+ID4gIH0KPiA+IEBAIC0zMjcsMTIgKzM0OSwyMCBAQCB2b2lkIGdoZXNfZWRh Y19yZXBvcnRfbWVtX2Vycm9yKGludCBzZXYsCj4gc3RydWN0IGNwZXJfc2VjX21lbV9lcnIgKm1l bV9lcnIpCj4gPiAgCQlwICs9IHNwcmludGYocCwgImJpdF9wb3M6JWQgIiwgbWVtX2Vyci0+Yml0 X3Bvcyk7Cj4gPiAgCWlmIChtZW1fZXJyLT52YWxpZGF0aW9uX2JpdHMgJgo+IENQRVJfTUVNX1ZB TElEX01PRFVMRV9IQU5ETEUpIHsKPiA+ICAJCWNvbnN0IGNoYXIgKmJhbmsgPSBOVUxMLCAqZGV2 aWNlID0gTlVMTDsKPiA+ICsJCWludCBpbmRleCA9IC0xOwo+ID4gKwo+ID4gIAkJZG1pX21lbWRl dl9uYW1lKG1lbV9lcnItPm1lbV9kZXZfaGFuZGxlLCAmYmFuaywKPiAmZGV2aWNlKTsKPiAKPiA+ ICsJCXAgKz0gc3ByaW50ZihwLCAiRElNTSBETUkgaGFuZGxlOiAweCUuNHggIiwKPiA+ICsJCQkg ICAgIG1lbV9lcnItPm1lbV9kZXZfaGFuZGxlKTsKPiA+ICAJCWlmIChiYW5rICE9IE5VTEwgJiYg ZGV2aWNlICE9IE5VTEwpCj4gPiAgCQkJcCArPSBzcHJpbnRmKHAsICJESU1NIGxvY2F0aW9uOiVz ICVzICIsIGJhbmssIGRldmljZSk7Cj4gPiAtCQllbHNlCj4gPiAtCQkJcCArPSBzcHJpbnRmKHAs ICJESU1NIERNSSBoYW5kbGU6IDB4JS40eCAiLAo+ID4gLQkJCQkgICAgIG1lbV9lcnItPm1lbV9k ZXZfaGFuZGxlKTsKPiAKPiBXaHkgZG8gd2Ugbm93IHByaW50IHRoZSBoYW5kbGUgZXZlcnkgdGlt ZT8gVGhlIGhhbmRsZSBpcyBwcmV0dHkKPiBtZWFuaW5nbGVzcywgaXQgY2FuIG9ubHkgYmUgdXNl ZCB0byBmaW5kIHRoZSBsb2NhdGlvbi1zdHJpbmdzLCBpZiB3ZSBnZXQgdGhvc2UKPiB3ZSBwcmlu dCB0aGVtIGluc3RlYWQuCgpGb3IgZ2hlc19lZGFjIHRoZSBiYW5rL2RldmljZSBpcyBpbmZvcm1h dGlvbmFsLCBhbmQgbm90aGluZyB3b3VsZCBnbyB3cm9uZwppZiB0aGUgYmFuay9kZXZpY2UgbnVt YmVycyBhcmUgdGhlIHNhbWUgYXMgYW5vdGhlciBlbnRyeS4gQnV0IHRoZSBoYW5kbGUKaXMgbm93 IGNyaXRpY2FsIGZvciBESU1NIGxvb2t1cCwgdGh1cyBwdWxsIGl0IG91dC4KClRoYW5rcyEKRmFu Cg== From mboxrd@z Thu Jan 1 00:00:00 1970 From: wufan@codeaurora.org (wufan) Date: Thu, 30 Aug 2018 08:40:40 -0600 Subject: [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs In-Reply-To: References: <1535567632-18089-1-git-send-email-wufan@codeaurora.org> Message-ID: <000f01d4406f$6ce8f160$46bad420$@codeaurora.org> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi James, > > The current ghes_edac driver does not update per-dimm error counters > > when reporting memory errors, because there is no platform-independent > > way to find DIMMs based on the error information provided by firmware. > > I'd argue there is: its in the CPER records, we just didn't do anything useful > with the information in the past! Agreed. Will update the wording. > > +static int ghes_edac_dimm_index(u16 handle) { > > + struct mem_ctl_info *mci; > > + int i; > > + > > + if (!ghes_pvt) > > + return -1; > > ghes_edac_report_mem_error() already checked this, as its the only caller > there is no need to check it again. Will remove. > > > + mci = ghes_pvt->mci; > > + > > + if (!mci) > > + return -1; > > Can this happen? ghes_edac_report_mem_error() would have > dereferenced this already! > > If you need the struct mem_ctl_info, you may as well pass it in as the only > caller has it to hand. Will remove. > > > + > > + for (i = 0; i < mci->tot_dimms; i++) { > > + if (mci->dimms[i]->smbios_handle == handle) > > + return i; > > + } > > + return -1; > > +} > > + > > static void ghes_edac_dmidecode(const struct dmi_header *dh, void > > *arg) { > > struct ghes_edac_dimm_fill *dimm_fill = arg; @@ -177,6 +197,8 @@ > > static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg) > > entry->total_width, entry->data_width); > > } > > > > + dimm->smbios_handle = entry->handle; > > We aren't checking for duplicate handles, (e.g. they're all zero). I think this is > fine as chances are firmware on those systems won't set > CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is > ambiguous, and we pick a dimm, instead of whine-ing about broken > firmware tables. > > (I'm just drawing attention to it in case someone disagrees) SBMIOS tables are typically automatically generated so chances for duplicate handles are small. > > > dimm_fill->count++; > > } > > } > > @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, > struct cper_sec_mem_err *mem_err) > > p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos); > > if (mem_err->validation_bits & > CPER_MEM_VALID_MODULE_HANDLE) { > > const char *bank = NULL, *device = NULL; > > + int index = -1; > > + > > dmi_memdev_name(mem_err->mem_dev_handle, &bank, > &device); > > > + p += sprintf(p, "DIMM DMI handle: 0x%.4x ", > > + mem_err->mem_dev_handle); > > if (bank != NULL && device != NULL) > > p += sprintf(p, "DIMM location:%s %s ", bank, device); > > - else > > - p += sprintf(p, "DIMM DMI handle: 0x%.4x ", > > - mem_err->mem_dev_handle); > > Why do we now print the handle every time? The handle is pretty > meaningless, it can only be used to find the location-strings, if we get those > we print them instead. For ghes_edac the bank/device is informational, and nothing would go wrong if the bank/device numbers are the same as another entry. But the handle is now critical for DIMM lookup, thus pull it out. Thanks! Fan From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,T_DKIM_INVALID autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5945C433F4 for ; Thu, 30 Aug 2018 14:40:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8C9172082C for ; Thu, 30 Aug 2018 14:40:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="TK9woC2w"; dkim=fail reason="key not found in DNS" (0-bit key) header.d=codeaurora.org header.i=@codeaurora.org header.b="LzM1rES+" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8C9172082C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729408AbeH3SnL (ORCPT ); Thu, 30 Aug 2018 14:43:11 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:51080 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728972AbeH3SnL (ORCPT ); Thu, 30 Aug 2018 14:43:11 -0400 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 6A2D960388; Thu, 30 Aug 2018 14:40:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1535640042; bh=gu5qe3RnPUWCsVYyYOgUi09FFa136wIbvQedHfU1rMY=; h=From:To:Cc:References:In-Reply-To:Subject:Date:From; b=TK9woC2w0gn9BNqua/XaNjzW508FPTOb2nflJDAjIGHTYc6m3iJv1TRLQW7oUukPv 88gSPLQj8KZqNH6wc3htaw8asPo+i5kVz8RvGo97TQ2sf/s15cB/qmAK7JyswdYN2I j3ycWN0/gkeVTKxlwjCgd1EuHcaflcsriXxxM87Y= Received: from WUFANW10 (c-71-205-14-210.hsd1.co.comcast.net [71.205.14.210]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: wufan@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id EF024606DB; Thu, 30 Aug 2018 14:40:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1535640041; bh=gu5qe3RnPUWCsVYyYOgUi09FFa136wIbvQedHfU1rMY=; h=From:To:Cc:References:In-Reply-To:Subject:Date:From; b=LzM1rES+j1qDGCcZHAotIseDRNPZ4jBC6F2Lyvti42xEWvF6E4eKIw1kHMoWcEmIM Go9RsoizKlL7mub9gA/ngn/gVNSyNmW8SiNKIHXKSsEBLg2Cwn5d5V17PpUN7r3Bte huqla3QX8VMGUXa8VelkDYdnTt4VJpHdHtF41qgI= DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org EF024606DB Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=wufan@codeaurora.org From: "wufan" To: "'James Morse'" Cc: , , , , , References: <1535567632-18089-1-git-send-email-wufan@codeaurora.org> In-Reply-To: Subject: RE: [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs Date: Thu, 30 Aug 2018 08:40:40 -0600 Message-ID: <000f01d4406f$6ce8f160$46bad420$@codeaurora.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHcLsFbwflxxBzVeG17JegIRw3gmAF/6qNxpLx8bJA= Content-Language: en-us Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James, > > The current ghes_edac driver does not update per-dimm error counters > > when reporting memory errors, because there is no = platform-independent > > way to find DIMMs based on the error information provided by = firmware. >=20 > I'd argue there is: its in the CPER records, we just didn't do = anything useful > with the information in the past! Agreed. Will update the wording.=20 =20 > > +static int ghes_edac_dimm_index(u16 handle) { > > + struct mem_ctl_info *mci; > > + int i; > > + > > + if (!ghes_pvt) > > + return -1; >=20 > ghes_edac_report_mem_error() already checked this, as its the only = caller > there is no need to check it again. Will remove. =20 >=20 > > + mci =3D ghes_pvt->mci; > > + > > + if (!mci) > > + return -1; >=20 > Can this happen? ghes_edac_report_mem_error() would have > dereferenced this already! >=20 > If you need the struct mem_ctl_info, you may as well pass it in as the = only > caller has it to hand. Will remove. >=20 > > + > > + for (i =3D 0; i < mci->tot_dimms; i++) { > > + if (mci->dimms[i]->smbios_handle =3D=3D handle) > > + return i; > > + } > > + return -1; > > +} > > + > > static void ghes_edac_dmidecode(const struct dmi_header *dh, void > > *arg) { > > struct ghes_edac_dimm_fill *dimm_fill =3D arg; @@ -177,6 +197,8 @@ > > static void ghes_edac_dmidecode(const struct dmi_header *dh, void = *arg) > > entry->total_width, entry->data_width); > > } > > > > + dimm->smbios_handle =3D entry->handle; >=20 > We aren't checking for duplicate handles, (e.g. they're all zero). I = think this is > fine as chances are firmware on those systems won't set > CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is > ambiguous, and we pick a dimm, instead of whine-ing about broken > firmware tables. >=20 > (I'm just drawing attention to it in case someone disagrees) SBMIOS tables are typically automatically generated so chances for = duplicate handles are small.=20 >=20 > > dimm_fill->count++; > > } > > } > > @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, > struct cper_sec_mem_err *mem_err) > > p +=3D sprintf(p, "bit_pos:%d ", mem_err->bit_pos); > > if (mem_err->validation_bits & > CPER_MEM_VALID_MODULE_HANDLE) { > > const char *bank =3D NULL, *device =3D NULL; > > + int index =3D -1; > > + > > dmi_memdev_name(mem_err->mem_dev_handle, &bank, > &device); >=20 > > + p +=3D sprintf(p, "DIMM DMI handle: 0x%.4x ", > > + mem_err->mem_dev_handle); > > if (bank !=3D NULL && device !=3D NULL) > > p +=3D sprintf(p, "DIMM location:%s %s ", bank, device); > > - else > > - p +=3D sprintf(p, "DIMM DMI handle: 0x%.4x ", > > - mem_err->mem_dev_handle); >=20 > Why do we now print the handle every time? The handle is pretty > meaningless, it can only be used to find the location-strings, if we = get those > we print them instead. For ghes_edac the bank/device is informational, and nothing would go = wrong if the bank/device numbers are the same as another entry. But the handle is now critical for DIMM lookup, thus pull it out. Thanks! Fan