All of lore.kernel.org
 help / color / mirror / Atom feed
From: Naveen Krishna Chatradhi <nchatrad@amd.com>
To: <linux-edac@vger.kernel.org>, <x86@kernel.org>
Cc: <linux-kernel@vger.kernel.org>, <bp@alien8.de>,
	<mingo@redhat.com>, <mchehab@kernel.org>, <yazen.ghannam@amd.com>,
	Muralidhara M K <muralimk@amd.com>,
	Naveen Krishna Chatradhi <nchatrad@amd.com>
Subject: [PATCH v7 01/12] EDAC/amd64: Document heterogeneous enumeration
Date: Thu, 3 Feb 2022 11:49:31 -0600	[thread overview]
Message-ID: <20220203174942.31630-2-nchatrad@amd.com> (raw)
In-Reply-To: <20220203174942.31630-1-nchatrad@amd.com>

From: Muralidhara M K <muralimk@amd.com>

The Documentation notes have been added in amd64_edac.h and will be
referring to driver-api wherever needed.

Explains how the physical topology is enumerated in the software and
edac module populates the sysfs ABIs.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
---
v6->v7:
* New in v7

 Documentation/driver-api/edac.rst |   9 +++
 drivers/edac/amd64_edac.h         | 101 ++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/Documentation/driver-api/edac.rst b/Documentation/driver-api/edac.rst
index b8c742aa0a71..0dd07d0d0e47 100644
--- a/Documentation/driver-api/edac.rst
+++ b/Documentation/driver-api/edac.rst
@@ -106,6 +106,15 @@ will occupy those chip-select rows.
 This term is avoided because it is unclear when needing to distinguish
 between chip-select rows and socket sets.
 
+* High Bandwidth Memory (HBM)
+
+HBM is a new type of memory chip with low power consumption and ultra-wide
+communication lanes. It uses vertically stacked memory chips (DRAM dies)
+interconnected by microscopic wires called "through-silicon vias," or TSVs.
+
+Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
+interconnect called the “interposer". So that HBM’s characteristics are
+nearly indistinguishable from on-chip integrated RAM.
 
 Memory Controllers
 ------------------
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 6f8147abfa71..6a112270a84b 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -559,3 +559,104 @@ static inline u32 dct_sel_baseaddr(struct amd64_pvt *pvt)
 	}
 	return (pvt)->dct_sel_lo & 0xFFFFF800;
 }
+
+/*
+ * AMD Heterogeneous system support on EDAC subsystem
+ * --------------------------------------------------
+ *
+ * An AMD heterogeneous system built by connecting the data fabrics of both CPUs
+ * and GPUs via custom xGMI links. So, the Data Fabric on the GPU nodes can be
+ * accessed the same way as the Data Fabric on CPU nodes.
+ *
+ * An Aldebaran GPUs has 2 Data Fabrics, each GPU DF contains four Unified
+ * Memory Controllers (UMC). Each UMC contains eight Channels. Each UMC Channel
+ * controls one 128-bit HBM2e (2GB) channel (equivalent to 8 X 2GB ranks),
+ * this creates a total of 4096-bits of DRAM data bus.
+ *
+ * While UMC is interfacing a 16GB (8H X 2GB DRAM) HBM stack, each UMC channel is
+ * interfacing 2GB of DRAM (represented as rank).
+ *
+ * Memory controllers on AMD GPU nodes can be represented in EDAC is as below:
+ *       GPU DF / GPU Node -> EDAC MC
+ *       GPU UMC           -> EDAC CSROW
+ *       GPU UMC channel   -> EDAC CHANNEL
+ *
+ * Eg: An heterogeneous system with 1 AMD CPU is connected to 4 Aldebaran GPUs using xGMI.
+ *
+ * AMD GPU Nodes are enumerated in sequential order based on the PCI hierarchy, and the
+ * first GPU node is assumed to have an "Node ID" value after CPU Nodes are fully
+ * populated.
+ *
+ * $ ls /sys/devices/system/edac/mc/
+ *	mc0   - CPU MC node 0
+ *	mc1  |
+ *	mc2  |- GPU card[0] => node 0(mc1), node 1(mc2)
+ *	mc3  |
+ *	mc4  |- GPU card[1] => node 0(mc3), node 1(mc4)
+ *	mc5  |
+ *	mc6  |- GPU card[2] => node 0(mc5), node 1(mc6)
+ *	mc7  |
+ *	mc8  |- GPU card[3] => node 0(mc7), node 1(mc8)
+ *
+ * sysfs entries will be populated as below:
+ *
+ *	CPU			# CPU node
+ *	├── mc 0
+ *
+ *	GPU Nodes are enumerated sequentially after CPU nodes are populated
+ *	GPU card 1		# Each Aldebaran GPU has 2 nodes/mcs
+ *	├── mc 1		# GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
+ *	│   ├── csrow 0		# UMC 0
+ *	│   │   ├── channel 0	# Each UMC has 8 channels
+ *	│   │   ├── channel 1   # size of each channel is 2 GB, so each UMC has 16 GB
+ *	│   │   ├── channel 2
+ *	│   │   ├── channel 3
+ *	│   │   ├── channel 4
+ *	│   │   ├── channel 5
+ *	│   │   ├── channel 6
+ *	│   │   ├── channel 7
+ *	│   ├── csrow 1		# UMC 1
+ *	│   │   ├── channel 0
+ *	│   │   ├── ..
+ *	│   │   ├── channel 7
+ *	│   ├── ..		..
+ *	│   ├── csrow 3		# UMC 3
+ *	│   │   ├── channel 0
+ *	│   │   ├── ..
+ *	│   │   ├── channel 7
+ *	│   ├── rank 0
+ *	│   ├── ..		..
+ *	│   ├── rank 31		# total 32 ranks/dimms from 4 UMCs
+ *	├
+ *	├── mc 2		# GPU node 1 == mc2
+ *	│   ├── ..		# each GPU has total 64 GB
+ *
+ *	GPU card 2
+ *	├── mc 3
+ *	│   ├── ..
+ *	├── mc 4
+ *	│   ├── ..
+ *
+ *	GPU card 3
+ *	├── mc 5
+ *	│   ├── ..
+ *	├── mc 6
+ *	│   ├── ..
+ *
+ *	GPU card 4
+ *	├── mc 7
+ *	│   ├── ..
+ *	├── mc 8
+ *	│   ├── ..
+ *
+ *
+ * Heterogeneous hardware details for above context as below:
+ * - The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
+ *   They have chip selects (csrows) and channels. However, the layouts are different
+ *   for performance, physical layout, or other reasons.
+ * - CPU UMCs use 1 channel. So we say UMC = EDAC Channel. This follows the
+ *   marketing speak, example. CPU has X memory channels, etc.
+ * - CPU UMCs use up to 4 chip selects. So we say UMC chip select = EDAC CSROW.
+ * - GPU UMCs use 1 chip select. So we say UMC = EDAC CSROW.
+ * - GPU UMCs use 8 channels. So we say UMC Channel = EDAC Channel.
+ */
-- 
2.25.1


  reply	other threads:[~2022-02-03 17:50 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-03 17:49 [PATCH v7 00/12] x86/edac/amd64: Add support for GPU nodes Naveen Krishna Chatradhi
2022-02-03 17:49 ` Naveen Krishna Chatradhi [this message]
2022-02-09 22:34   ` [PATCH v7 01/12] EDAC/amd64: Document heterogeneous enumeration Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 02/12] x86/amd_nb: Add support for northbridges on Aldebaran Naveen Krishna Chatradhi
2022-02-09 23:23   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 03/12] EDAC/mce_amd: Extract node id from MCA_IPID Naveen Krishna Chatradhi
2022-02-09 23:31   ` Yazen Ghannam
2022-02-14 17:54     ` Chatradhi, Naveen Krishna
2022-02-03 17:49 ` [PATCH v7 04/12] EDAC/amd64: Move struct fam_type variables into amd64_pvt structure Naveen Krishna Chatradhi
2022-02-15 15:39   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 05/12] EDAC/amd64: Define dynamic family ops routines Naveen Krishna Chatradhi
2022-02-15 15:49   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 06/12] EDAC/amd64: Add AMD heterogeneous family 19h Model 30h-3fh Naveen Krishna Chatradhi
2022-02-15 16:20   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 07/12] EDAC/amd64: Enumerate Aldebaran GPU nodes by adding family ops Naveen Krishna Chatradhi
2022-02-15 16:34   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 08/12] EDAC/amd64: Add Family ops to update GPU csrow and channel info Naveen Krishna Chatradhi
2022-02-15 16:43   ` Yazen Ghannam
2022-02-03 17:49 ` [PATCH v7 09/12] EDAC/amd64: Add check for when to add DRAM base and hole Naveen Krishna Chatradhi
2022-02-03 17:49 ` [PATCH v7 10/12] EDAC/amd64: Save the number of block instances Naveen Krishna Chatradhi
2022-02-03 17:49 ` [PATCH v7 11/12] EDAC/amd64: Add address translation support for DF3.5 Naveen Krishna Chatradhi
2022-02-03 17:49 ` [PATCH v7 12/12] EDAC/amd64: Add fixed UMC to CS mapping Naveen Krishna Chatradhi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220203174942.31630-2-nchatrad@amd.com \
    --to=nchatrad@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=muralimk@amd.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.