Re: [patch 2/2 v3]raid5: create multiple threads to handle stripes

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: Stan Hoeppner <stan@hardwarefreak.com>
Cc: NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org, dan.j.williams@gmail.com,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Joe Landman <joe.landman@gmail.com>
Subject: Re: [patch 2/2 v3]raid5: create multiple threads to handle stripes
Date: Tue, 2 Apr 2013 08:39:37 +0800	[thread overview]
Message-ID: <20130402003937.GA7393@kernel.org> (raw)
In-Reply-To: <5159E08A.2040203@hardwarefreak.com>

On Mon, Apr 01, 2013 at 02:31:22PM -0500, Stan Hoeppner wrote:
> On 3/31/2013 8:57 PM, Shaohua Li wrote:
> > On Fri, Mar 29, 2013 at 04:36:14AM -0500, Stan Hoeppner wrote:
> >> I'm CC'ing Joe Landman as he's already building systems of the caliber
> >> that would benefit from this write threading and may need configurable
> >> CPU scheduling.  Joe I've not seen a post from you on linux-raid in a
> >> while so I don't know if you've been following this topic.  Shaohua has
> >> created patch sets to eliminate, or dramatically mitigate, the horrible
> >> single threaded write performance of md/RAID 1, 10, 5, 6 on SSD.
> >> Throughput no longer hits a wall due to peaking one core, as with the
> >> currently shipping kernel code.  Your thoughts?
> >>
> >> On 3/28/2013 9:34 PM, Shaohua Li wrote:
> >> ...
> >>> Frankly I don't like the cpuset way. It might just work, but it's just another
> >>> API to control process affinity and has no essential difference against my
> >>> approach (which directly sets process affinity). Generally we use cpuset
> >>> instead of process affinity is because of something like inherit affinity.
> >>> While the raid5 process doesn't involve those.
> >>
> >> First I should again state I'm not a developer, but a sysadmin, and this
> >> is the viewpoint from which I speak.
> >>
> >> The essential difference I see is the user interface the sysadmin will
> >> employ to tweak thread placement/behavior.  Hypothetically, say I have a
> >> 64 socket Altix UV machine w/8 core CPUs, 512 cores.  Each node board
> >> has two sockets, two distinct NUMA nodes, 64 total, but these share a
> >> NUMALink hub interface chip connection to the rest of the machine, and
> >> share a PCIe mezzanine interface.
> >>
> >> We obviously want to keep md/RAID housekeeping bandwidth (stripe cache,
> >> RMW reads, etc) isolated to the node where it is attached so it doesn't
> >> needlessly traverse NUMALink eating precious, limited, 'high' latency
> >> NUMAlink system interconnect bandwidth.  We need to keep that free for
> >> our parallel application which is eating 100% of the other 504 cores and
> >> saturating NUMAlink with MPI and file IO traffic.
> >>
> >> So lets say I have one NUMA node out of 64 dedicated to block device IO.
> >>  It has a PCIe x8 v2 IB 4x QDR HBA (4GB/s) connection to a SAN box with
> >> 18 SSDs (and 128 SAS rust).  The SAN RAID ASIC can't keep up with SSD
> >> RAID5 IO rates while also doing RAID for the rust.  So we export the
> >> SSDs individually and we make 2x 9 drive md/RAID5 arrays.  I've already
> >> created a cpuset with this NUMA node for strictly storage related
> >> processes including but not limited to XFS utils, backup processes,
> >> snapshots, etc, so that the only block IO traversing NUMAlink is user
> >> application data.  Now I add another 18 SSDs to the SAN chassis, and
> >> another IB HBA to this node board.
> >>
> >> Ideally, my md/RAID write threads should already be bound to this
> >> cpuset.  So all I should need to do is add this 2nd node to the cpuset
> >> and I'm done.  No need to monkey with additional md/RAID specific
> >> interfaces.
> >>
> >> Now, that's the simple scenario.  On this particular machine's
> >> architecture, you have two NUMA nodes per physical node, so expanding
> >> storage hardware on the same node board should be straightforward above.
> >>  However, most Altix UV machines will have storage HBAs plugged into
> >> many node boards.  If we create one cpuset and put all the md/RAID write
> >> theads in it, then we get housekeeping RAID IO traversing the NUMAlink
> >> interconnect.  So in this case we'd want to pin the threads to the
> >> physical node board where the PCIe cards, and thus disks, are attached.
> >>
> >> The 'easy' way to do this is simply create multiple cpusets, one for
> >> each storage node.  But then you have the downside of administration
> >> headaches, because you may need to pin your FS utils, backup, etc to a
> >> different storage cpuset depending on which HBAs the filesystem resides,
> >> and do this each and every time, which is a nightmare with scheduled
> >> jobs.  Thus in this case its probably best to retain the single storage
> >> cpuset and simply make sure the node boards share the same upstream
> >> switch hop, keeping the traffic as local as possible.  The kernel
> >> scheduler might already have some NUMA scheduling intelligence here that
> >> works automagically even within a cpuset, to minimize this.  I simply
> >> lack knowledge in this area.
> >>
> >>>> I still like the idea of an 'ioctl' which a process can call and will cause
> >>>> it to start handling requests.
> >>>> The process could bind itself to whatever cpu or cpuset it wanted to, then
> >>>> could call the ioctl on the relevant md array, and pass in a bitmap of cpus
> >>>> which indicate which requests it wants to be responsible for.  The current
> >>>> kernel thread will then only handle requests that no-one else has put their
> >>>> hand up for.  This leave all the details of configuration in user-space
> >>>> (where I think it belongs).
> >>>
> >>> The 'ioctl' way is interesting. But there are something we need answer:
> >>>
> >>> 1. How kernel knows if there will be process to handle one cpu's requests
> >>> before an 'ioctl' is called? I suppose you want 2 ioctls. One ioctl telles
> >>> kernel the process handles request from cpus of a cpumask. The other ioctl does
> >>> request handling. The process must sleep in the ioctl to wait requests.
> >>>
> >>> 2. If a process is killed in the middle, how kernel knows? Do you want to hook
> >>> something in task management code? For normal process exit, we need another
> >>> ioctl to tell kernel the process is exiting.
> >>>
> >>> The only difference between this way and my way is if the request handling task
> >>> is userspace or kernel space. In both ways, you need set affinity and uses
> >>> ioctl/sysfs to control requests source for the process.
> >>
> >> Being a non dev I lack requisite knowledge to comment on ioctls.  I'll
> >> simply reiterate that whatever you go with should make use of an
> >> existing familiar user interface where this same scheduling is already
> >> handled, which is cpusets.  The only difference being kernel vs user
> >> space.  Which may turn out to be a problem, I dunno.
> > 
> > Hmm, there might be misunderstanding here. In my way:
> 
> Very likely.
> 
> > #echo 3 > /sys/block/md0/md/auxthread_number. Create several kernel threads to
> > handle requests. You can use any approach to set smp affinity for the threads.
> > You can use cpuset to bind the threads too.
> 
> So you have verified that these kernel threads can be placed by the
> cpuset calls and shell commands?  Cool, then we're over one hurdle, so
> to speak.  So say I create 8 threads with a boot script.  I want to
> place 4 each in 2 different cpusets.  Will this work be left for every
> sysadmin to figure out and create him/herself, or will you include
> scripts/docs/etc to facilitate this integration?

Sure, verified cpuset can apply to kernel threads. No, I don't have scripts.
 
> > #echo 1-3 > /sys/block/md0/md/auxth0/cpulist. This doesn't set above threads'
> > affinity. It sets which CPU's requests the thread should handle. Regardless
> > using my way, cpuset or ioctl, we need the similar way to notify worker thread
> > which CPU's requests it should handle (unless we have a hook in scheduler, when
> > a thread's affinity is changed, we get a notification)
> 
> I don't even know if this is necessary.  From a NUMA perspective, and
> all systems are now NUMA, it's far more critical to make sure a RAID
> thread is executing on a core/socket to which the HBA is attached via
> the PCIe bridge.  You should make it a priority to write code to
> identify this path and automatically set RAID thread affinity to that
> set of cores.  This keeps the extra mirror and parity write data, RMW
> read data, and stripe cache accesses off the NUMA interconnect, as I
> stated in a previous email.  This is critical to system performance, no
> matter how large or small the system.
> 
> Once this is accomplished, I see zero downside, from a NUMA standpoint,
> to having every RAID thread be able to service every core.  Obviously
> this would require some kind of hashing so we don't generate hot spots.
>  Does your code already prevent this?  Anyway, I think you can simply
> eliminate this tunable parm altogether.
> 
> On that note, it would make sense to modify every md/RAID driver to
> participate in this hashing.  Users run multiple RAID levels on a given
> box, and we want the bandwidth and CPU load spread as evenly as possible
> I would think.
> 
> > In the sumary, my approach doesn't prevent you to use CPUSET. Did I miss something?
> 
> IMO, it's not enough to simply make it work with cpusets, but to get
> some seamless integration.  Now that I think more about this, it should
> be possible to get optimal affinity automatically by identifying the
> attachment point of the HBA(s), and sticking all RAID threads to cores
> on that socket.  If the optimal number of threads to create could be
> calculated for any system, you could eliminate all of these tunables,
> and everything be be fully automatic.  No need for user defined parms,
> and no need for cpusets.

I understand. It's always preferred everything is automatic set with best
performance. But last time I checked, different optimal thread number applies
in different setup/workload. After some discussions, we decided to add some
tunables. This isn't convenient from user point of view, but it's hard to
determine the optimal tunable value.

next prev parent reply	other threads:[~2013-04-02  0:39 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-09  8:58 [patch 2/2 v3]raid5: create multiple threads to handle stripes Shaohua Li
2012-08-11  8:45 ` Jianpeng Ma
2012-08-13  0:21   ` Shaohua Li
2012-08-13  1:06     ` Jianpeng Ma
2012-08-13  2:13       ` Shaohua Li
2012-08-13  2:20         ` Shaohua Li
2012-08-13  2:25           ` Jianpeng Ma
2012-08-13  4:21           ` NeilBrown
2012-08-14 10:39           ` Jianpeng Ma
2012-08-15  3:51             ` Shaohua Li
2012-08-15  6:21               ` Jianpeng Ma
2012-08-15  8:04                 ` Shaohua Li
2012-08-15  8:19                   ` Jianpeng Ma
2012-09-24 11:15                   ` Jianpeng Ma
2012-09-26  1:26                     ` NeilBrown
2012-08-13  9:11     ` Jianpeng Ma
2012-08-13  4:29 ` NeilBrown
2012-08-13  6:22   ` Shaohua Li
2013-03-07  7:31 ` Shaohua Li
2013-03-12  1:39   ` NeilBrown
2013-03-13  0:44     ` Stan Hoeppner
2013-03-28  6:47       ` NeilBrown
2013-03-28 16:53         ` Stan Hoeppner
2013-03-29  2:34         ` Shaohua Li
2013-03-29  9:36           ` Stan Hoeppner
2013-04-01  1:57             ` Shaohua Li
2013-04-01 19:31               ` Stan Hoeppner
2013-04-02  0:39                 ` Shaohua Li [this message]
2013-04-02  3:12                   ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130402003937.GA7393@kernel.org \
    --to=shli@kernel.org \
    --cc=dan.j.williams@gmail.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=joe.landman@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).