Re: [PATCH] irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4

From: Marc Zyngier <maz@kernel.org>
To: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Vikram Sethi <vsethi@nvidia.com>,
	Thierry Reding <treding@nvidia.com>
Subject: Re: [PATCH] irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4
Date: Sun, 12 Mar 2023 13:43:01 +0000	[thread overview]
Message-ID: <871qlu5bmi.wl-maz@kernel.org> (raw)
In-Reply-To: <ff5306c3-c72e-87a3-f21f-13cdf11e1df5@nvidia.com>

On Fri, 10 Mar 2023 14:16:34 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> Hi Marc,
> 
> On 3/7/23 02:32, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, 06 Mar 2023 01:31:48 +0000,
> > Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> >> 
> >> The purpose of this patch is to address the T241 erratum T241-FABRIC-4,
> >> which causes unexpected behavior in the GIC when multiple transactions
> > 
> > nit: "The purpose of this patch" is superfluous. Instead, write
> > something like:
> > 
> > "The T241 platform suffers from the T241-FABRIC-4 erratum which
> > causes..."
> > 
> I'll fix in v2 patch.
> 
> >> are received simultaneously from different sources. This hardware issue
> >> impacts NVIDIA server platforms that use more than two T241 chips
> >> interconnected. Each chip has support for 320 {E}SPIs.
> >> 
> >> This issue occurs when multiple packets from different GICs are
> >> incorrectly interleaved at the target chip. The erratum text below
> >> specifies exactly what can cause multiple transfer packets susceptible
> >> to interleaving and GIC state corruption. GIC state corruption can
> >> lead to a range of problems, including kernel panics, and unexpected
> >> behavior.
> >> 
> >>  From the erratum text:
> >>    "In some cases, inter-socket AXI4 Stream packets with multiple
> >>    transfers, may be interleaved by the fabric when presented to ARM
> >>    Generic Interrupt Controller. GIC expects all transfers of a packet
> >>    to be delivered without any interleaving.
> >> 
> >>    The following GICv3 commands may result in multiple transfer packets
> >>    over inter-socket AXI4 Stream interface:
> >>     - Register reads from GICD_I* and GICD_N*
> >>     - Register writes to 64-bit GICD registers other than GICD_IROUTERn*
> >>     - ITS command MOVALL
> > 
> > Does is also affect cross-chip traffic such as SPI deactivation?
> > 
> No, it is not impacted.
> 
> >> 
> >>    Multiple commands in GICv4+ utilize multiple transfer packets,
> >>    including VMOVP, VMOVI and VMAPP.
> >> 
> >>    This issue impacts system configurations with more than 2 sockets,
> >>    that require multi-transfer packets to be sent over inter-socket
> >>    AXI4 Stream interface between GIC instances on different sockets.
> >>    GICv4 cannot be supported. GICv3 SW model can only be supported
> >>    with the workaround. Single and Dual socket configurations are not
> >>    impacted by this issue and support GICv3 and GICv4."
> > 
> > Do you have a public link to this erratum? This is really something
> > that we should be go back to when changing things in the GIC code
> > (should we ever use MOVALL, for example).
> > 
> https://developer.nvidia.com/docs/t241-fabric-4/nvidia-t241-fabric-4-errata.pdf

Great. Please add this to the commit message and a comment next to the
workaround code.

[...]

> >> +static inline void __iomem *gic_dist_base_read_alias(irq_hw_number_t intid)
> >> +{
> >> +     struct dist_base_alias *base_alias;
> >> +     int i;
> >> +
> >> +     if (static_branch_unlikely(&gic_nvidia_t241_erratum)) {
> >> +             base_alias = gic_data.base_read_aliases;
> >> +             for (i = 0; i < gic_data.nr_dist_base_aliases; i++) {
> >> +                     if (base_alias->base &&
> >> +                        (intid >= base_alias->intid_start) &&
> >> +                        (intid <= base_alias->intid_end)) {
> >> +                             return base_alias->base;
> >> +                     }
> >> +                     base_alias++;
> >> +             }
> >> +     }
> > 
> > Each distributor has the exact same number of SPIs. So why isn't this
> > just a division that gives you a distributor number?
> > 
> 
> I considered creating a generic function that could potentially be
> utilized in the future for other Workarounds (WARs).
> 
> I'll change to this in v2.
> 
> static inline void __iomem *gic_dist_base_alias(irq_hw_number_t intid)
> {
>         u32 chip;
> 
>         if (static_branch_unlikely(&gic_nvidia_t241_erratum)) {
>                 /**
>                  *  {E}SPI mappings for all 4 chips
>                  *    Chip0 = 32-351
>                  *    Chip1 = 52-671

s/52/352/, right?

>                  *    Chip2 = 672-991
>                  *    Chip3 = 4096-4415
>                  */
>                 switch (__get_intid_range(intid)) {
>                 case SPI_RANGE:
>                         chip = (intid - 32) / 320;
>                         break;
>                 case ESPI_RANGE:
>                         chip = 3;
>                         break;
>                 default:
>                         unreachable();
>                 }
>                 BUG_ON(!t241_dist_base_alias[chip]);

You can drop this BUG_ON(), and replace it with on at probe time.

>                 return t241_dist_base_alias[chip];
>         }
> 
>         return gic_data.dist_base;
> }

Yup, that's much better.

> 
> >> +
> >> +     return gic_data.dist_base;
> >> +}
> >> +
> >>   static inline void __iomem *gic_dist_base(struct irq_data *d)
> >>   {
> >>        switch (get_intid_range(d)) {
> >> @@ -346,7 +377,7 @@ static int gic_peek_irq(struct irq_data *d, u32 offset)
> >>        if (gic_irq_in_rdist(d))
> >>                base = gic_data_rdist_sgi_base();
> >>        else
> >> -             base = gic_data.dist_base;
> >> +             base = gic_dist_base_read_alias(irqd_to_hwirq(d));
> >> 
> >>        return !!(readl_relaxed(base + offset + (index / 32) * 4) & mask);
> >>   }
> >> @@ -580,6 +611,7 @@ static int gic_set_type(struct irq_data *d, unsigned int type)
> >>        enum gic_intid_range range;
> >>        unsigned int irq = gic_irq(d);
> >>        void __iomem *base;
> >> +     void __iomem *base_read_alias;
> >>        u32 offset, index;
> >>        int ret;
> >> 
> >> @@ -594,14 +626,17 @@ static int gic_set_type(struct irq_data *d, unsigned int type)
> >>            type != IRQ_TYPE_LEVEL_HIGH && type != IRQ_TYPE_EDGE_RISING)
> >>                return -EINVAL;
> >> 
> >> -     if (gic_irq_in_rdist(d))
> >> +     if (gic_irq_in_rdist(d)) {
> >>                base = gic_data_rdist_sgi_base();
> >> -     else
> >> +             base_read_alias = base;
> >> +     } else {
> >>                base = gic_data.dist_base;
> >> +             base_read_alias = gic_dist_base_read_alias(irqd_to_hwirq(d));
> >> +     }
> >> 
> >>        offset = convert_offset_index(d, GICD_ICFGR, &index);
> >> -
> >> -     ret = gic_configure_irq(index, type, base + offset, NULL);
> >> +     ret = gic_configure_irq(index, type, base + offset, NULL,
> >> +                             base_read_alias + offset);
> >>        if (ret && (range == PPI_RANGE || range == EPPI_RANGE)) {
> >>                /* Misconfigured PPIs are usually not fatal */
> >>                pr_warn("GIC: PPI INTID%d is secure or misconfigured\n", irq);
> >> @@ -1719,6 +1754,70 @@ static bool gic_enable_quirk_hip06_07(void *data)
> >>        return false;
> >>   }
> >> 
> >> +static bool gic_enable_quirk_nvidia_t241(void *data)
> >> +{
> >> +#ifdef CONFIG_ACPI
> >> +     struct dist_base_alias *base_alias;
> >> +     struct acpi_table_header *madt;
> >> +     int i, intid, nchips = 0;
> >> +     acpi_status status;
> >> +     phys_addr_t phys;
> >> +
> >> +     status = acpi_get_table(ACPI_SIG_MADT, 0, &madt);
> >> +     if (ACPI_FAILURE(status))
> >> +             return false;
> >> +
> >> +     /* Check NVIDIA OEM ID */
> >> +     if (memcmp(madt->oem_id, "NVIDIA", 6)) {
> > 
> > What guarantees do we have that this string will always be present?
> > "oem_id" is usually updated to reflect the integrator, not the
> > silicon vendor.
> > 
> 
> Our company provides UEFI firmware porting guidelines to OEMs that
> ensure the MADT table generation, along with the ACPI header, remains
> unaltered. Thanks to your input, we are now looking into alternative
> approaches for identifying platform types and removing our dependence
> on ACPI. Specifically, we are interested in utilizing the SMCCC API
> to detect the CHIP. Determine whether the individual chips are present
> by referring to the GICR regions described in MADT.

Seems like a reasonable alternative.

> 
> 
> >> +             acpi_put_table(madt);
> >> +             return false;
> >> +     }
> >> +
> >> +     /* Find the number of chips based on OEM_TABLE_ID */
> >> +     if ((!memcmp(madt->oem_table_id, "T241x3", 6)) ||
> >> +         (!memcmp(madt->oem_table_id, "T241c3", 6))) {
> >> +             nchips = 3;
> >> +     } else if ((!memcmp(madt->oem_table_id, "T241x4", 6)) ||
> >> +                (!memcmp(madt->oem_table_id, "T241c4", 6))) {
> >> +             nchips = 4;
> >> +     }
> > 
> > Same question for these. This seems pretty fragile.
> > 
> This can be avoid for the SMCCC based platform detection.
> 
> >> +
> >> +     acpi_put_table(madt);
> >> +     if (nchips < 3)
> >> +             return false;
> >> +
> >> +     base_alias = kmalloc_array(nchips, sizeof(*base_alias),
> >> +                                GFP_KERNEL | __GFP_ZERO);
> > 
> > You are fully initialising the structures, right? So why the
> > __GFP_ZERO?
> Yes, not needed. will use the staic array since size is small after
> removing INTID_start/end feilds.
> 
> > 
> >> +     if (!base_alias)
> >> +             return false;
> >> +
> >> +     gic_data.base_read_aliases = base_alias;
> >> +     gic_data.nr_dist_base_aliases = nchips;
> >> +
> >> +     /**
> >> +      * Setup GICD alias and {E}SPIs range for each chip
> >> +      * {E}SPI blocks mappings:
> >> +      *    Chip0 = 00-09
> >> +      *    Chip1 = 10-19
> >> +      *    Chip2 = 20-29
> >> +      *    Chip3 = 30-39
> > 
> > What are these ranges? From the code below, I can (sort of) guess that
> > each chip has 10 registers in the SPI/ESPI range, with chips 0-1
> > dealing with SPIs, and chips 2-3 dealing with ESPIs.
> > 
> > It would be a lot clearer if you indicated the actual INTID ranges.
> Agree.
> 
> > 
> >> +      */
> >> +     for (i = 0; i < nchips; i++, base_alias++) {
> >> +             phys = ((1ULL << 44) * i) | 0x23580000;
> > 
> > Where is this address coming from? Can it be inferred from the MADT?
> > It would also be a lot more readable if written as:
> > 
> > #define CHIP_MASK       GENMASK_ULL(45, 44)
> > #define CHIP_ALIAS_BASE 0x23580000
> > 
> I'll define macros for constants. Use the offset, global GICD-PHYS,
> and CHIP number to get the alias addressses.
> 
> #define T241_CHIPN_MASK                 GENMASK_ULL(45, 44)
> #define T241_CHIP_GICDA_OFFSET          0x1580000
> 
>      phys = gic_data.dist_phys_base + T241_CHIP_GICDA_OFFSET;
>      phys |= FIELD_PREP(T241_CHIPN_MASK, i);
> 
> 
> >                  phys = CHIP_ALIAS_BASE;
> >                  phys |= FIELD_PREP(CHIP_MASK, i);
> > 
> >> +             base_alias->base = ioremap(phys, SZ_64K);
> >> +             WARN_ON(!base_alias->base);
> >> +
> >> +             intid = i < 3 ? 32 + i * 10 * 32 : ESPI_BASE_INTID;
> >> +             base_alias->intid_start = intid;
> >> +             base_alias->intid_end = intid + 10 * 32 - 1;
> > 
> > This really is obfuscated. And it also shows that we really don't need
> > the INTID ranges in the data structure. You can easily get to the chip
> > number with something like:
> ACK
> 
> > 
> >          switch (__get_intid_range(intid)) {
> >          case SPI_RANGE:
> >                  chip = (intid - 32) / 320;
> >                  break;
> >          case ESPI_RANGE:
> >                  chip = (intid - ESPI_BASE_INTID) / 320;
> >                  break;
> >          }
> > 
> >          alias = base_alias[chip];
> > 
> > Bonus point if you add a #define for the magic numbers.
> > 
> ACK
> 
> >> +     }
> >> +     static_branch_enable(&gic_nvidia_t241_erratum);
> >> +     return true;
> >> +#else
> >> +     return false;
> >> +#endif
> >> +}
> > 
> > How about moving the whole function under #ifdef CONFIG_ACPI?
> > 
>  
> If you're not satisfied with SMCCC-based platform detection, I'll
> make the necessary changes. We value your input and would appreciate
> your opinion on whether we should use SMCCC or ACPI-OEM-ID based
> platform detection. Our preference is to go with SMC if that's
> agreeable to you.

If you can guarantee that this FW-based discovery will always be
available, then this is a more robust way of doing it.

> 
> 
> #define SMCCC_JEP106_BANK_ID(v)         FIELD_GET(GENMASK(30, 24), (v))
> #define SMCCC_JEP106_ID_CODE(v)         FIELD_GET(GENMASK(22, 16), (v))
> #define SMCCC_JEP106_SOC_ID(v)          FIELD_GET(GENMASK(15, 0), (v))
> 
> #define JEP106_NVIDIA_BANK_ID           0x3
> #define JEP106_NVIDIA_ID_CODE           0x6b
> #define T241_CHIPN_MASK                 GENMASK_ULL(45, 44)
> #define T241_CHIP_GICDA_OFFSET          0x1580000
> #define T241_CHIP_ID                    0x241
> 
> static bool gic_enable_quirk_nvidia_t241(void *data)
> {
>         unsigned long chip_bmask = 0;
>         struct arm_smccc_res res;
>         phys_addr_t phys;
>         u32 i;
> 
>         if ((arm_smccc_get_version() < ARM_SMCCC_VERSION_1_2) ||
>             (arm_smccc_1_1_get_conduit() == SMCCC_CONDUIT_NONE)) {
>                 return false;
>         }
> 
>         arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_FEATURES_FUNC_ID,
>                              ARM_SMCCC_ARCH_SOC_ID, &res);
>         if ((s32)res.a0 < 0)
>                 return false;
> 
>         arm_smccc_1_1_invoke(ARM_SMCCC_ARCH_SOC_ID, 0, &res);
>         if ((s32)res.a0 < 0)
>                 return false;

Most of this should probably directly come from the soc_id
infrastructure.  It would need to probe early and expose the low-level
data.

> 
>         /* Check JEP106 code for NVIDIA T241 chip (036b:0241) */
>         if ((SMCCC_JEP106_BANK_ID(res.a0) != JEP106_NVIDIA_BANK_ID) ||
>             (SMCCC_JEP106_ID_CODE(res.a0) != JEP106_NVIDIA_ID_CODE) ||
>             (SMCCC_JEP106_SOC_ID(res.a0) != T241_CHIP_ID)) {
>                 return false;
>         }
> 
>         /* Find the chips based on GICR regions PHYS addr */
>         for (i = 0; i < gic_data.nr_redist_regions; i++) {
>                 chip_bmask |= BIT(FIELD_GET(T241_CHIPN_MASK,
>                                   gic_data.redist_regions[i].phys_base));
>         }
> 
>         if (hweight32(chip_bmask) < 3)
>                 return false;
> 
>         /* Setup GICD alias regions */
>         for (i = 0; i < ARRAY_SIZE(t241_dist_base_alias); i++) {
>                 if (chip_bmask & BIT(i)) {
>                         phys = gic_data.dist_phys_base + T241_CHIP_GICDA_OFFSET;
>                         phys |= FIELD_PREP(T241_CHIPN_MASK, i);
>                         t241_dist_base_alias[i] = ioremap(phys, SZ_64K);
>                         WARN_ON(!t241_dist_base_alias[i]);
>                 }
>         }
>         static_branch_enable(&gic_nvidia_t241_erratum);
>         return true;
> }

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.