From: Jeff Moyer <jmoyer@redhat.com>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
Cc: Bart Van Assche <bvanassche@acm.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
Date: Tue, 13 Nov 2012 10:44:50 -0500 [thread overview]
Message-ID: <x497gppecct.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40294CCF7347@G9W0745.americas.hpqcorp.net> (Robert Elliott's message of "Tue, 13 Nov 2012 01:26:04 +0000")
"Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:
> What do these commands report about the NUMA and non-uniform IO topology on the test system?
This is a DELL PowerEdge R715. See chapter 7 of this document for
details on how the I/O bridges are connected:
http://www.dell.com/downloads/global/products/pedge/en/Poweredge-r715-technicalguide.pdf
> numactl --hardware
# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 2 4 6
node 0 size: 8182 MB
node 0 free: 7856 MB
node 1 cpus: 8 10 12 14
node 1 size: 8192 MB
node 1 free: 8008 MB
node 2 cpus: 9 11 13 15
node 2 size: 8192 MB
node 2 free: 7994 MB
node 3 cpus: 1 3 5 7
node 3 size: 8192 MB
node 3 free: 7982 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10
> lspci -t
# lspci -vt
-+-[0000:20]-+-00.0 ATI Technologies Inc RD890 Northbridge only dual slot (2x8) PCI-e GFX Hydra part
| +-02.0-[21]--
| +-03.0-[22]----00.0 LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator]
| \-0b.0-[23]--+-00.0 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
| \-00.1 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection
\-[0000:00]-+-00.0 ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 port A)
+-02.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
| \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
+-03.0-[02]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
| \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
+-04.0-[03-08]----00.0-[04-08]--+-00.0-[05]----00.0 LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
+-09.0-[09]--
+-12.0 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
+-12.1 ATI Technologies Inc SB7x0 USB OHCI1 Controller
+-12.2 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
+-13.0 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
+-13.1 ATI Technologies Inc SB7x0 USB OHCI1 Controller
+-13.2 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller
+-14.0 ATI Technologies Inc SBx00 SMBus Controller
+-14.3 ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
+-14.4-[0a]----03.0 Matrox Graphics, Inc. MGA G200eW WPCM450
+-18.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
+-18.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map
+-18.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
+-18.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
+-18.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control
+-19.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
+-19.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map
+-19.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
+-19.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
+-19.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control
+-1a.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
+-1a.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map
+-1a.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
+-1a.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
+-1a.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control
+-1b.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration
+-1b.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map
+-1b.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller
+-1b.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
\-1b.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control
# cat /sys/bus/pci/devices/0000\:20\:03.0/0000\:22\:00.0/numa_node
3
-Jeff
>
>
>> -----Original Message-----
>> From: Jeff Moyer [mailto:jmoyer@redhat.com]
>> Sent: Monday, 12 November, 2012 3:27 PM
>> To: Bart Van Assche
>> Cc: Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org
>> Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly
>>
>> Bart Van Assche <bvanassche@acm.org> writes:
>>
>> > On 11/09/12 21:46, Jeff Moyer wrote:
>> >>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>> >>>> It's certainly better to tie them all to one node then let them be
>> >>>> randomly scattered across nodes; your 6% observation may simply be
>> >>>> from that.
>> >>>>
>> >>>> How do you think these compare, though (for structures that are per-IO)?
>> >>>> - tying the structures to the node hosting the storage device
>> >>>> - tying the structures to the node running the application
>> >>
>> >> This is a great question, thanks for asking it! I went ahead and
>> >> modified the megaraid_sas driver to take a module parameter that
>> >> specifies on which node to allocate the scsi_host data structure (and
>> >> all other structures on top that are tied to that). I then booted the
>> >> system 4 times, specifying a different node each time. Here are the
>> >> results as compared to a vanilla kernel:
>> >>
>> [snip]
>> > Which NUMA node was processing the megaraid_sas interrupts in these
>> > tests ? Was irqbalance running during these tests or were interrupts
>> > manually pinned to a specific CPU core ?
>>
>> irqbalanced was indeed running, so I can't say for sure what node the
>> irq was pinned to during my tests (I didn't record that information).
>>
>> I re-ran the tests, this time turning off irqbalance (well, I set it to
>> one-shot), and the pinning the irq to the node running the benchmark.
>> In this configuration, I saw no regressions in performance.
>>
>> As a reminder:
>>
>> >> The first number is the percent gain (or loss) w.r.t. the vanilla
>> >> kernel. The second number is the standard deviation as a percent of the
>> >> bandwidth. So, when data structures are tied to node 0, we see an
>> >> increase in performance for nodes 0-3. However, on node 3, which is the
>> >> node the megaraid_sas controller is attached to, we see no gain in
>> >> performance, and we see an increase in the run to run variation. The
>> >> standard deviation for the vanilla kernel was 1% across all nodes.
>>
>> Here are the updated numbers:
>>
>> data structures tied to node 0
>>
>> application tied to:
>> node 0: 0 +/-4%
>> node 1: 9 +/-1%
>> node 2: 10 +/-2%
>> node 3: 0 +/-2%
>>
>> data structures tied to node 1
>>
>> application tied to:
>> node 0: 5 +/-2%
>> node 1: 6 +/-8%
>> node 2: 10 +/-1%
>> node 3: 0 +/-3%
>>
>> data structures tied to node 2
>>
>> application tied to:
>> node 0: 6 +/-2%
>> node 1: 9 +/-2%
>> node 2: 7 +/-6%
>> node 3: 0 +/-3%
>>
>> data structures tied to node 3
>>
>> application tied to:
>> node 0: 0 +/-4%
>> node 1: 10 +/-2%
>> node 2: 11 +/-1%
>> node 3: 0 +/-5%
>>
>> Now, the above is apples to oranges, since the vanilla kernel was run
>> w/o any tuning of irqs. So, I went ahead and booted with
>> numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.
>>
>> When we compare a vanilla kernel with and without irq binding, we get
>> this:
>>
>> node 0: 0 +/-3%
>> node 1: 9 +/-1%
>> node 2: 8 +/-3%
>> node 3: 0 +/-1%
>>
>> As you can see, binding irqs helps nodes 1 and 2 quite substantially.
>> What this boils down to, when you compare a patched kernel with the
>> vanilla kernel, where they are both tying irqs to the node hosting the
>> application, is a net gain of zero, but an increase in standard
>> deviation.
>>
>> Let me try to make that more readable. The patch set does not appear
>> to help at all with my benchmark configuration. ;-) One other
>> conclusion I can draw from this data is that irqbalance could do a
>> better job.
>>
>> An interesting (to me) tidbit about this hardware is that, while it has
>> 4 numa nodes, it only has 2 sockets. Based on the numbers above, I'd
>> guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.
>>
>> Cheers,
>> Jeff
prev parent reply other threads:[~2012-11-13 15:44 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-02 21:45 [patch,v2 00/10] make I/O path allocations more numa-friendly Jeff Moyer
2012-11-02 21:45 ` [patch,v2 01/10] scsi: add scsi_host_alloc_node Jeff Moyer
2012-11-03 16:35 ` Bart Van Assche
2012-11-05 14:06 ` Jeff Moyer
2012-11-02 21:45 ` [patch,v2 02/10] scsi: make __scsi_alloc_queue numa-aware Jeff Moyer
2012-11-02 21:45 ` [patch,v2 03/10] scsi: make scsi_alloc_sdev numa-aware Jeff Moyer
2012-11-02 21:45 ` [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node Jeff Moyer
2012-11-03 16:36 ` Bart Van Assche
2012-11-05 14:09 ` Jeff Moyer
2012-11-02 21:45 ` [patch,v2 05/10] sd: use alloc_disk_node Jeff Moyer
2012-11-03 16:37 ` Bart Van Assche
2012-11-05 14:12 ` Jeff Moyer
2012-11-05 14:57 ` Bart Van Assche
2012-11-05 15:32 ` taco
2012-11-02 21:45 ` [patch,v2 06/10] ata: use scsi_host_alloc_node Jeff Moyer
2012-11-02 21:46 ` [patch,v2 07/10] megaraid_sas: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 08/10] mpt2sas: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 09/10] lpfc: " Jeff Moyer
2012-11-02 21:46 ` [patch,v2 10/10] cciss: use blk_init_queue_node Jeff Moyer
2012-11-06 15:41 ` [patch,v2 00/10] make I/O path allocations more numa-friendly Elliott, Robert (Server Storage)
2012-11-06 19:12 ` Bart Van Assche
2012-11-09 20:46 ` Jeff Moyer
2012-11-10 8:56 ` Bart Van Assche
2012-11-12 21:26 ` Jeff Moyer
2012-11-13 1:26 ` Elliott, Robert (Server Storage)
2012-11-13 15:44 ` Jeff Moyer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=x497gppecct.fsf@segfault.boston.devel.redhat.com \
--to=jmoyer@redhat.com \
--cc=Elliott@hp.com \
--cc=bvanassche@acm.org \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).