All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Qiuxu Zhuo <qiuxu.zhuo@intel.com>,
	Tony Luck <tony.luck@intel.com>,
	linux-edac <linux-edac@vger.kernel.org>,
	Borislav Petkov <bp@suse.de>
Subject: [4.13,43/44] EDAC, sb_edac: Dont create a second memory controller if HA1 is not present
Date: Thu, 16 Nov 2017 18:43:07 +0100	[thread overview]
Message-ID: <20171116172825.483949271@linuxfoundation.org> (raw)

4.13-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

commit 15cc3ae001873845b5d842e212478a6570c7d938 upstream.

Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
server (DELL PowerEdge 730xd):

  EDAC sbridge: Some needed devices are missing
  EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
  EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Failed to register device with error -19.

The refactored sb_edac driver creates the IMC1 (the 2nd memory
controller) if any IMC1 device is present. In this case only
HA1_TA of IMC1 was present, but the driver expected to find
HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.

The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
Zhang inserted one DIMM per channel for each CPU, and did random error
address injection test with this patch:

      4024  addresses fell in TOLM hole area
     12715  addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
     12774  addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
     12798  addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
     12913  addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
     12674  addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
     12686  addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
     12882  addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
     12934  addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
    106400  addresses were injected totally.

The test result shows that all the 4 channels belong to IMC0 per CPU, so
the server really only has one IMC per CPU.

In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
used and should be ignored.

Thus, do not create a second memory controller if the key HA1 is absent.

[1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
[2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

Reported-and-tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/edac/sb_edac.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -455,6 +455,7 @@ static const struct pci_id_table pci_dev
 static const struct pci_id_descr pci_dev_descr_ibridge[] = {
 		/* Processor Home Agent */
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0,        0, IMC0) },
+	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1,        1, IMC1) },
 
 		/* Memory controller */
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0_TA,     0, IMC0) },
@@ -465,7 +466,6 @@ static const struct pci_id_descr pci_dev
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0_TAD3,   0, IMC0) },
 
 		/* Optional, mode 2HA */
-	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1,        1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_TA,     1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_RAS,    1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_TAD0,   1, IMC1) },
@@ -2260,6 +2260,13 @@ static int sbridge_get_onedevice(struct
 next_imc:
 	sbridge_dev = get_sbridge_dev(bus, dev_descr->dom, multi_bus, sbridge_dev);
 	if (!sbridge_dev) {
+		/* If the HA1 wasn't found, don't create EDAC second memory controller */
+		if (dev_descr->dom == IMC1 && devno != 1) {
+			edac_dbg(0, "Skip IMC1: %04x:%04x (since HA1 was absent)\n",
+				 PCI_VENDOR_ID_INTEL, dev_descr->dev_id);
+			pci_dev_put(pdev);
+			return 0;
+		}
 
 		if (dev_descr->dom == SOCK)
 			goto out_imc;

WARNING: multiple messages have this Message-ID (diff)
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Qiuxu Zhuo <qiuxu.zhuo@intel.com>,
	Tony Luck <tony.luck@intel.com>,
	linux-edac <linux-edac@vger.kernel.org>,
	Borislav Petkov <bp@suse.de>
Subject: [PATCH 4.13 43/44] EDAC, sb_edac: Dont create a second memory controller if HA1 is not present
Date: Thu, 16 Nov 2017 18:43:07 +0100	[thread overview]
Message-ID: <20171116172825.483949271@linuxfoundation.org> (raw)
In-Reply-To: <20171116172823.336649076@linuxfoundation.org>

4.13-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

commit 15cc3ae001873845b5d842e212478a6570c7d938 upstream.

Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
server (DELL PowerEdge 730xd):

  EDAC sbridge: Some needed devices are missing
  EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
  EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Failed to register device with error -19.

The refactored sb_edac driver creates the IMC1 (the 2nd memory
controller) if any IMC1 device is present. In this case only
HA1_TA of IMC1 was present, but the driver expected to find
HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.

The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
Zhang inserted one DIMM per channel for each CPU, and did random error
address injection test with this patch:

      4024  addresses fell in TOLM hole area
     12715  addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
     12774  addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
     12798  addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
     12913  addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
     12674  addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
     12686  addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
     12882  addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
     12934  addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
    106400  addresses were injected totally.

The test result shows that all the 4 channels belong to IMC0 per CPU, so
the server really only has one IMC per CPU.

In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
used and should be ignored.

Thus, do not create a second memory controller if the key HA1 is absent.

[1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
[2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

Reported-and-tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/edac/sb_edac.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -455,6 +455,7 @@ static const struct pci_id_table pci_dev
 static const struct pci_id_descr pci_dev_descr_ibridge[] = {
 		/* Processor Home Agent */
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0,        0, IMC0) },
+	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1,        1, IMC1) },
 
 		/* Memory controller */
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0_TA,     0, IMC0) },
@@ -465,7 +466,6 @@ static const struct pci_id_descr pci_dev
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA0_TAD3,   0, IMC0) },
 
 		/* Optional, mode 2HA */
-	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1,        1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_TA,     1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_RAS,    1, IMC1) },
 	{ PCI_DESCR(PCI_DEVICE_ID_INTEL_IBRIDGE_IMC_HA1_TAD0,   1, IMC1) },
@@ -2260,6 +2260,13 @@ static int sbridge_get_onedevice(struct
 next_imc:
 	sbridge_dev = get_sbridge_dev(bus, dev_descr->dom, multi_bus, sbridge_dev);
 	if (!sbridge_dev) {
+		/* If the HA1 wasn't found, don't create EDAC second memory controller */
+		if (dev_descr->dom == IMC1 && devno != 1) {
+			edac_dbg(0, "Skip IMC1: %04x:%04x (since HA1 was absent)\n",
+				 PCI_VENDOR_ID_INTEL, dev_descr->dev_id);
+			pci_dev_put(pdev);
+			return 0;
+		}
 
 		if (dev_descr->dom == SOCK)
 			goto out_imc;

             reply	other threads:[~2017-11-16 17:43 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-16 17:43 Greg Kroah-Hartman [this message]
2017-11-16 17:43 ` [PATCH 4.13 43/44] EDAC, sb_edac: Dont create a second memory controller if HA1 is not present Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2017-11-16 17:42 [PATCH 4.13 00/44] 4.13.14-stable review Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 01/44] ppp: fix race in ppp device destruction Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 02/44] gso: fix payload length when gso_size is zero Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 03/44] ipv4: Fix traffic triggered IPsec connections Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 04/44] ipv6: " Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 05/44] netlink: do not set cb_running if dumps start() errs Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 06/44] net: call cgroup_sk_alloc() earlier in sk_clone_lock() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 07/44] macsec: fix memory leaks when skb_to_sgvec fails Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 08/44] l2tp: check ps->sock before running pppol2tp_session_ioctl() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 09/44] tun: call dev_get_valid_name() before register_netdevice() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 10/44] netlink: fix netlink_ack() extack race Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 11/44] sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 12/44] tcp/dccp: fix ireq->opt races Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 13/44] packet: avoid panic in packet_getsockopt() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 14/44] geneve: Fix function matching VNI and tunnel ID on big-endian Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 15/44] net: bridge: fix returning of vlan range op errors Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 16/44] soreuseport: fix initialization race Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 17/44] ipv6: flowlabel: do not leave opt->tot_len with garbage Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 18/44] sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 19/44] tcp/dccp: fix lockdep splat in inet_csk_route_req() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 21/44] net: dsa: check master device before put Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 22/44] net/unix: dont show information about sockets from other namespaces Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 23/44] tap: double-free in error path in tap_open() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 24/44] net/mlx5: Fix health work queue spin lock to IRQ safe Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 25/44] net/mlx5e: Properly deal with encap flows add/del under neigh update Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 26/44] ipip: only increase err_count for some certain type icmp in ipip_err Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 27/44] ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 28/44] ip6_gre: update dst pmtu if dev mtu has been updated by toobig in __gre6_xmit Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 29/44] tcp: refresh tp timestamp before tcp_mtu_probe() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 30/44] tap: reference to KVA of an unloaded module causes kernel panic Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 31/44] tun: allow positive return values on dev_get_valid_name() call Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 32/44] sctp: reset owner sk for data chunks on out queues when migrating a sock Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 33/44] net_sched: avoid matching qdisc with zero handle Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 34/44] l2tp: hold tunnel in pppol2tp_connect() Greg Kroah-Hartman
2017-11-16 17:42 ` [PATCH 4.13 35/44] tun/tap: sanitize TUNSETSNDBUF input Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 36/44] ipv6: addrconf: increment ifp refcount before ipv6_del_addr() Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 37/44] tcp: fix tcp_mtu_probe() vs highest_sack Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 38/44] mac80211: accept key reinstall without changing anything Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 39/44] mac80211: use constant time comparison with keys Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 40/44] mac80211: dont compare TKIP TX MIC key in reinstall prevention Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 41/44] usb: usbtest: fix NULL pointer dereference Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 42/44] Input: ims-psu - check if CDC union descriptor is sane Greg Kroah-Hartman
2017-11-16 17:43 ` [PATCH 4.13 44/44] dmaengine: dmatest: warn user when dma test times out Greg Kroah-Hartman
2017-11-16 22:45 ` [PATCH 4.13 00/44] 4.13.14-stable review Shuah Khan
2017-11-17  8:10   ` Greg Kroah-Hartman
2017-11-17  2:03 ` Guenter Roeck
2017-11-17  8:11   ` Greg Kroah-Hartman
     [not found] ` <5a0e2714.c5badf0a.3983c.7eaa@mx.google.com>
2017-11-17  8:14   ` Greg Kroah-Hartman
2017-11-18  0:41     ` Kevin Hilman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171116172825.483949271@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=bp@suse.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.