linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shaohua Li <shli@kernel.org>
To: Stan Hoeppner <stan@hardwarefreak.com>
Cc: NeilBrown <neilb@suse.de>,
	linux-raid@vger.kernel.org, dan.j.williams@gmail.com,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Joe Landman <joe.landman@gmail.com>
Subject: Re: [patch 2/2 v3]raid5: create multiple threads to handle stripes
Date: Tue, 2 Apr 2013 08:39:37 +0800	[thread overview]
Message-ID: <20130402003937.GA7393@kernel.org> (raw)
In-Reply-To: <5159E08A.2040203@hardwarefreak.com>

On Mon, Apr 01, 2013 at 02:31:22PM -0500, Stan Hoeppner wrote:
> On 3/31/2013 8:57 PM, Shaohua Li wrote:
> > On Fri, Mar 29, 2013 at 04:36:14AM -0500, Stan Hoeppner wrote:
> >> I'm CC'ing Joe Landman as he's already building systems of the caliber
> >> that would benefit from this write threading and may need configurable
> >> CPU scheduling.  Joe I've not seen a post from you on linux-raid in a
> >> while so I don't know if you've been following this topic.  Shaohua has
> >> created patch sets to eliminate, or dramatically mitigate, the horrible
> >> single threaded write performance of md/RAID 1, 10, 5, 6 on SSD.
> >> Throughput no longer hits a wall due to peaking one core, as with the
> >> currently shipping kernel code.  Your thoughts?
> >>
> >> On 3/28/2013 9:34 PM, Shaohua Li wrote:
> >> ...
> >>> Frankly I don't like the cpuset way. It might just work, but it's just another
> >>> API to control process affinity and has no essential difference against my
> >>> approach (which directly sets process affinity). Generally we use cpuset
> >>> instead of process affinity is because of something like inherit affinity.
> >>> While the raid5 process doesn't involve those.
> >>
> >> First I should again state I'm not a developer, but a sysadmin, and this
> >> is the viewpoint from which I speak.
> >>
> >> The essential difference I see is the user interface the sysadmin will
> >> employ to tweak thread placement/behavior.  Hypothetically, say I have a
> >> 64 socket Altix UV machine w/8 core CPUs, 512 cores.  Each node board
> >> has two sockets, two distinct NUMA nodes, 64 total, but these share a
> >> NUMALink hub interface chip connection to the rest of the machine, and
> >> share a PCIe mezzanine interface.
> >>
> >> We obviously want to keep md/RAID housekeeping bandwidth (stripe cache,
> >> RMW reads, etc) isolated to the node where it is attached so it doesn't
> >> needlessly traverse NUMALink eating precious, limited, 'high' latency
> >> NUMAlink system interconnect bandwidth.  We need to keep that free for
> >> our parallel application which is eating 100% of the other 504 cores and
> >> saturating NUMAlink with MPI and file IO traffic.
> >>
> >> So lets say I have one NUMA node out of 64 dedicated to block device IO.
> >>  It has a PCIe x8 v2 IB 4x QDR HBA (4GB/s) connection to a SAN box with
> >> 18 SSDs (and 128 SAS rust).  The SAN RAID ASIC can't keep up with SSD
> >> RAID5 IO rates while also doing RAID for the rust.  So we export the
> >> SSDs individually and we make 2x 9 drive md/RAID5 arrays.  I've already
> >> created a cpuset with this NUMA node for strictly storage related
> >> processes including but not limited to XFS utils, backup processes,
> >> snapshots, etc, so that the only block IO traversing NUMAlink is user
> >> application data.  Now I add another 18 SSDs to the SAN chassis, and
> >> another IB HBA to this node board.
> >>
> >> Ideally, my md/RAID write threads should already be bound to this
> >> cpuset.  So all I should need to do is add this 2nd node to the cpuset
> >> and I'm done.  No need to monkey with additional md/RAID specific
> >> interfaces.
> >>
> >> Now, that's the simple scenario.  On this particular machine's
> >> architecture, you have two NUMA nodes per physical node, so expanding
> >> storage hardware on the same node board should be straightforward above.
> >>  However, most Altix UV machines will have storage HBAs plugged into
> >> many node boards.  If we create one cpuset and put all the md/RAID write
> >> theads in it, then we get housekeeping RAID IO traversing the NUMAlink
> >> interconnect.  So in this case we'd want to pin the threads to the
> >> physical node board where the PCIe cards, and thus disks, are attached.
> >>
> >> The 'easy' way to do this is simply create multiple cpusets, one for
> >> each storage node.  But then you have the downside of administration
> >> headaches, because you may need to pin your FS utils, backup, etc to a
> >> different storage cpuset depending on which HBAs the filesystem resides,
> >> and do this each and every time, which is a nightmare with scheduled
> >> jobs.  Thus in this case its probably best to retain the single storage
> >> cpuset and simply make sure the node boards share the same upstream
> >> switch hop, keeping the traffic as local as possible.  The kernel
> >> scheduler might already have some NUMA scheduling intelligence here that
> >> works automagically even within a cpuset, to minimize this.  I simply
> >> lack knowledge in this area.
> >>
> >>>> I still like the idea of an 'ioctl' which a process can call and will cause
> >>>> it to start handling requests.
> >>>> The process could bind itself to whatever cpu or cpuset it wanted to, then
> >>>> could call the ioctl on the relevant md array, and pass in a bitmap of cpus
> >>>> which indicate which requests it wants to be responsible for.  The current
> >>>> kernel thread will then only handle requests that no-one else has put their
> >>>> hand up for.  This leave all the details of configuration in user-space
> >>>> (where I think it belongs).
> >>>
> >>> The 'ioctl' way is interesting. But there are something we need answer:
> >>>
> >>> 1. How kernel knows if there will be process to handle one cpu's requests
> >>> before an 'ioctl' is called? I suppose you want 2 ioctls. One ioctl telles
> >>> kernel the process handles request from cpus of a cpumask. The other ioctl does
> >>> request handling. The process must sleep in the ioctl to wait requests.
> >>>
> >>> 2. If a process is killed in the middle, how kernel knows? Do you want to hook
> >>> something in task management code? For normal process exit, we need another
> >>> ioctl to tell kernel the process is exiting.
> >>>
> >>> The only difference between this way and my way is if the request handling task
> >>> is userspace or kernel space. In both ways, you need set affinity and uses
> >>> ioctl/sysfs to control requests source for the process.
> >>
> >> Being a non dev I lack requisite knowledge to comment on ioctls.  I'll
> >> simply reiterate that whatever you go with should make use of an
> >> existing familiar user interface where this same scheduling is already
> >> handled, which is cpusets.  The only difference being kernel vs user
> >> space.  Which may turn out to be a problem, I dunno.
> > 
> > Hmm, there might be misunderstanding here. In my way:
> 
> Very likely.
> 
> > #echo 3 > /sys/block/md0/md/auxthread_number. Create several kernel threads to
> > handle requests. You can use any approach to set smp affinity for the threads.
> > You can use cpuset to bind the threads too.
> 
> So you have verified that these kernel threads can be placed by the
> cpuset calls and shell commands?  Cool, then we're over one hurdle, so
> to speak.  So say I create 8 threads with a boot script.  I want to
> place 4 each in 2 different cpusets.  Will this work be left for every
> sysadmin to figure out and create him/herself, or will you include
> scripts/docs/etc to facilitate this integration?

Sure, verified cpuset can apply to kernel threads. No, I don't have scripts.
 
> > #echo 1-3 > /sys/block/md0/md/auxth0/cpulist. This doesn't set above threads'
> > affinity. It sets which CPU's requests the thread should handle. Regardless
> > using my way, cpuset or ioctl, we need the similar way to notify worker thread
> > which CPU's requests it should handle (unless we have a hook in scheduler, when
> > a thread's affinity is changed, we get a notification)
> 
> I don't even know if this is necessary.  From a NUMA perspective, and
> all systems are now NUMA, it's far more critical to make sure a RAID
> thread is executing on a core/socket to which the HBA is attached via
> the PCIe bridge.  You should make it a priority to write code to
> identify this path and automatically set RAID thread affinity to that
> set of cores.  This keeps the extra mirror and parity write data, RMW
> read data, and stripe cache accesses off the NUMA interconnect, as I
> stated in a previous email.  This is critical to system performance, no
> matter how large or small the system.
> 
> Once this is accomplished, I see zero downside, from a NUMA standpoint,
> to having every RAID thread be able to service every core.  Obviously
> this would require some kind of hashing so we don't generate hot spots.
>  Does your code already prevent this?  Anyway, I think you can simply
> eliminate this tunable parm altogether.
> 
> On that note, it would make sense to modify every md/RAID driver to
> participate in this hashing.  Users run multiple RAID levels on a given
> box, and we want the bandwidth and CPU load spread as evenly as possible
> I would think.
> 
> > In the sumary, my approach doesn't prevent you to use CPUSET. Did I miss something?
> 
> IMO, it's not enough to simply make it work with cpusets, but to get
> some seamless integration.  Now that I think more about this, it should
> be possible to get optimal affinity automatically by identifying the
> attachment point of the HBA(s), and sticking all RAID threads to cores
> on that socket.  If the optimal number of threads to create could be
> calculated for any system, you could eliminate all of these tunables,
> and everything be be fully automatic.  No need for user defined parms,
> and no need for cpusets.

I understand. It's always preferred everything is automatic set with best
performance. But last time I checked, different optimal thread number applies
in different setup/workload. After some discussions, we decided to add some
tunables. This isn't convenient from user point of view, but it's hard to
determine the optimal tunable value.

  reply	other threads:[~2013-04-02  0:39 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-09  8:58 [patch 2/2 v3]raid5: create multiple threads to handle stripes Shaohua Li
2012-08-11  8:45 ` Jianpeng Ma
2012-08-13  0:21   ` Shaohua Li
2012-08-13  1:06     ` Jianpeng Ma
2012-08-13  2:13       ` Shaohua Li
2012-08-13  2:20         ` Shaohua Li
2012-08-13  2:25           ` Jianpeng Ma
2012-08-13  4:21           ` NeilBrown
2012-08-14 10:39           ` Jianpeng Ma
2012-08-15  3:51             ` Shaohua Li
2012-08-15  6:21               ` Jianpeng Ma
2012-08-15  8:04                 ` Shaohua Li
2012-08-15  8:19                   ` Jianpeng Ma
2012-09-24 11:15                   ` Jianpeng Ma
2012-09-26  1:26                     ` NeilBrown
2012-08-13  9:11     ` Jianpeng Ma
2012-08-13  4:29 ` NeilBrown
2012-08-13  6:22   ` Shaohua Li
2013-03-07  7:31 ` Shaohua Li
2013-03-12  1:39   ` NeilBrown
2013-03-13  0:44     ` Stan Hoeppner
2013-03-28  6:47       ` NeilBrown
2013-03-28 16:53         ` Stan Hoeppner
2013-03-29  2:34         ` Shaohua Li
2013-03-29  9:36           ` Stan Hoeppner
2013-04-01  1:57             ` Shaohua Li
2013-04-01 19:31               ` Stan Hoeppner
2013-04-02  0:39                 ` Shaohua Li [this message]
2013-04-02  3:12                   ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130402003937.GA7393@kernel.org \
    --to=shli@kernel.org \
    --cc=dan.j.williams@gmail.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=joe.landman@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).