public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Daniel J Blueman <daniel@1degreenorth.com>
To: Bjorn Helgaas <bhelgaas@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	"x86@kernel.org" <x86@kernel.org>, Borislav Petkov <bp@suse.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Steffen Persvold <sp@numascale.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	kim.naru@amd.com,
	Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>,
	Myron Stowe <myron.stowe@redhat.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>
Subject: Re: [PATCH] Fix northbridge quirk to assign correct NUMA node
Date: Mon, 24 Mar 2014 14:03:07 +0800	[thread overview]
Message-ID: <532FCA9B.7080707@1degreenorth.com> (raw)
In-Reply-To: <CAErSpo4x+=EQ5h8DuzoRznZ_d8Roy4LHer7ELEVeHxDsKwDhzA@mail.gmail.com>

On 03/22/2014 12:11 AM, Bjorn Helgaas wrote:
> [+cc Rafael, linux-acpi for _PXM questions]
>
> On Thu, Mar 20, 2014 at 9:38 PM, Daniel J Blueman <daniel@numascale.com> wrote:
>> On 21/03/2014 06:07, Bjorn Helgaas wrote:
>>> On Thu, Mar 13, 2014 at 5:43 AM, Daniel J Blueman <daniel@numascale.com>
>>> wrote:
>>>>
>>>> For systems with multiple servers and routed fabric, all northbridges get
>>>> assigned to the first server. Fix this by also using the node reported
>>>> from
>>>> the PCI bus. For single-fabric systems, the northbriges are on PCI bus 0
>>>> by definition, which are on NUMA node 0 by definition, so this is
>>>> invarient
>>>> on most systems.
>>>>
>>>> Tested on fam10h and fam15h single and multi-fabric systems and candidate
>>>> for stable.
>
>>> So I suspect the problem is more complicated, and maybe _PXM is
>>> insufficient to describe the topology?  Are there subtrees that should
>>> have nodes different from the host bridge?
>>
>> Yes; see below.
>> ...
>> The _PXM method associates each northbridge with the first NUMA node, 0 in
>> single-fabric systems, and eg 4 for the second server in a multi-fabric
>> system with 2 dual-module Opterons (with 2 NUMA nodes internally) etc, since
>> the northbridges appear in the PCI tree, under the host bridge, not above it
>> [1].
>>
>> With _PXM, the rest of the PCI bus hierarchy has the right NUMA node
>> associated, but the northbridge PCI devices should be associated with their
>> actual NUMA node, 0, 1, 2, 3 for the first server in this example. The quirk
>> fixes this up; irqbalance at least uses this NUMA data exposed in /sys.
>
> I'm confused about which devices we're talking about.  We currently
> look at _PXM for PNP0A08 (and PNP0A03) ACPI devices.  The resulting
> node is associated with every PCI device we enumerate below the
> PNP0A08 bridge.  This association is made in pci_device_add().
>
> When you say "northbridge PCI devices should be associated with their
> actual NUMA node," I assume you mean the 00:18.x and 00:19.x devices
> ("AMD Family 10h Processor ..."), since those seem to be what the
> quirk applies to.  You are *not* talking about 00:00.0 ("ATI RD890
> Northbridge"), right?

Yes, on bus 0, devices 0x18 to 0x20 decode to the (up to) eight 
Hypertransport devices in the processor fabric, normally all processor 
northbridges.

> You mention irqbalance; is the NUMA node information for the 00:18.x
> and 00:19.x devices important because you get a lot of interrupts from
> those devices?  Or is the issue with actual I/O devices (NICs, SCSI
> adapters, etc.)?  If so, I don't see how this quirk would affect
> those, because the node information for them comes from the PNP0A08
> bridge (in pci_device_add()), not from the 00:00.0, 00:18.x, or
> 00:19.x devices.

I need to investigate the lockups irqbalance was causing on a customer 
system, and am not sure what interrupt source that was rewritten which 
causing hangs; disabling the daemon prevented the hangs.

>> The alternative to the quirk may be to explicitly express the northbridge
>> PCI devices in the AML with their own _PXM methods. If it's valid, it may be
>> the honest approach, though the quirk may be needed for most BIOSs; I can
>> check the AML on a few servers to confirm if helpful.
>
> ACPI allows _PXM for any device, so this might be a possible approach.
>   However, it looks like Linux only pays attention to _PXM for
> PNP0A08/03, CPUs, memory and IOAPICs (which seems like a Linux defect
> to me).

> I'm really worried about the approach here:
>
>          pci_read_config_dword(nb_ht, 0x60, &val);
>          node = pcibus_to_node(dev->bus) | (val & 7);
>
> because the pcibus_to_node() information comes indirectly from _PXM,
> and the "val" part comes from the hardware, and I don't think these
> are the same node number space.  If I understand correctly, the BIOS
> can synthesize whatever numbers it wants for _PXM, which returns a
> "proximity domain," and then Linux can make up its own mapping of
> "proximity domain" to "logical Linux node."  So I don't see why we can
> assume that it's valid to OR in the bits from a PCI config register to
> this logical Linux node number.

pcibus_to_node uses the proximity domain values in the ACPI SRAT table, 
which is thus correctly mapped to the linux NUMA node ID, so my oneliner 
is still progress.

Linux allocates NUMA node ids using the ordering of PXM values seen in 
the SRAT table, ie first_unset_node(nodes_found_map). The APIC ids are 
initialised using the HyperTransport NodeId [1, p263 and p465], but the 
NodeId can be reprogrammed after the APIC ids are set (which also 
changes the PCI configuration device id from 0x18 on bus 0 it responds 
to), and the SRAT table needn't be emitted in order, perhaps except for 
the bootstrap core.

I guess fixing the original quirk depends on how important these cases 
really are.

Thanks,
   Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
-- 
Daniel J Blueman
Principal Software Engineer, Numascale

  reply	other threads:[~2014-03-24  6:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-13 11:43 [PATCH] Fix northbridge quirk to assign correct NUMA node Daniel J Blueman
2014-03-14  9:06 ` Borislav Petkov
2014-03-14  9:57   ` Daniel J Blueman
2014-03-14 10:09 ` [tip:x86/urgent] x86/amd/numa: " tip-bot for Daniel J Blueman
2014-03-20 22:07 ` [PATCH] " Bjorn Helgaas
2014-03-21  3:38   ` Daniel J Blueman
2014-03-21 16:11     ` Bjorn Helgaas
2014-03-24  6:03       ` Daniel J Blueman [this message]
2014-03-21 17:16     ` Suravee Suthikulpanit
2014-03-23 14:30       ` Daniel J Blueman
2014-03-21  3:51   ` Suravee Suthikulpanit
2014-03-21  4:14     ` Daniel J Blueman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=532FCA9B.7080707@1degreenorth.com \
    --to=daniel@1degreenorth.com \
    --cc=aravind.gopalakrishnan@amd.com \
    --cc=bhelgaas@google.com \
    --cc=bp@suse.de \
    --cc=hpa@zytor.com \
    --cc=kim.naru@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=myron.stowe@redhat.com \
    --cc=rjw@rjwysocki.net \
    --cc=sp@numascale.com \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox