* rq_affinity doesn't seem to work? @ 2011-07-12 19:03 Jiang, Dave 2011-07-12 20:30 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Jiang, Dave @ 2011-07-12 19:03 UTC (permalink / raw) To: axboe@kernel.dk Cc: Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D Jens, I'm doing some performance tuning for the Intel isci SAS controller driver, and I noticed some interesting numbers with mpstat. Looking at the numbers it seems that rq_affinity is not moving the request completion to the request submission CPU. Using fio to saturate the system with 512B I/Os, I noticed that all I/Os are bound to the CPUs (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack in the driver so that it records the CPU during request construction and then I try to steer the scsi->done() calls to the request CPUs. With this simple hack, mpstat shows that the soft irq contexts are now distributed. I observed significant performance increase. The iowait% gone from 30s and 40s to low single digit approaching 0. Any ideas what could be happening with the rq_affinity logic? I'm assuming rq_affinity should behave the way my hacked solution is behaving. This is running on an 8 core single CPU SandyBridge based system with hyper-threading turned off. The two MSIX interrupts on the controller are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm running fio with 8 SAS disks and 8 threads. no rq_affinity: 09:23:31 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 09:23:36 AM all 9.65 0.00 41.75 23.60 0.00 24.98 0.00 0.00 0.03 09:23:36 AM 0 13.40 0.00 59.60 27.00 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 1 14.00 0.00 58.80 27.20 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 2 13.20 0.00 57.40 29.40 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 3 12.40 0.00 57.00 30.60 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 4 12.60 0.00 52.80 34.60 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 5 11.62 0.00 48.30 40.08 0.00 0.00 0.00 0.00 0.00 09:23:36 AM 6 0.00 0.00 0.20 0.00 0.00 99.80 0.00 0.00 0.00 09:23:36 AM 7 0.00 0.00 0.00 0.00 0.00 99.80 0.00 0.00 0.20 with rq_affinity: 09:25:04 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 09:25:09 AM all 9.50 0.00 42.32 23.19 0.00 24.99 0.00 0.00 0.00 09:25:09 AM 0 13.80 0.00 61.60 24.60 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 1 13.03 0.00 60.32 26.65 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 2 12.83 0.00 58.52 28.66 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 3 12.20 0.00 56.60 31.20 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 4 12.20 0.00 52.40 35.40 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 5 11.78 0.00 49.30 38.92 0.00 0.00 0.00 0.00 0.00 09:25:09 AM 6 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 09:25:09 AM 7 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 with soft irq steering: 09:31:57 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 09:32:02 AM all 12.73 0.00 46.82 1.63 8.03 28.59 0.00 0.00 2.20 09:32:02 AM 0 16.20 0.00 55.00 3.20 10.20 15.40 0.00 0.00 0.00 09:32:02 AM 1 15.60 0.00 57.60 0.00 10.00 16.80 0.00 0.00 0.00 09:32:02 AM 2 16.03 0.00 56.91 0.20 10.62 16.23 0.00 0.00 0.00 09:32:02 AM 3 15.77 0.00 58.48 0.20 10.18 15.17 0.00 0.00 0.20 09:32:02 AM 4 16.17 0.00 56.09 0.00 10.18 17.56 0.00 0.00 0.00 09:32:02 AM 5 16.00 0.00 56.60 0.20 10.60 16.60 0.00 0.00 0.00 09:32:02 AM 6 3.41 0.00 18.64 3.81 0.80 60.52 0.00 0.00 12.83 09:32:02 AM 7 2.79 0.00 14.97 5.79 1.40 70.26 0.00 0.00 4.79 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-12 19:03 rq_affinity doesn't seem to work? Jiang, Dave @ 2011-07-12 20:30 ` Jens Axboe 2011-07-12 21:17 ` Jiang, Dave 2011-07-13 17:10 ` Matthew Wilcox 0 siblings, 2 replies; 10+ messages in thread From: Jens Axboe @ 2011-07-12 20:30 UTC (permalink / raw) To: Jiang, Dave Cc: Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On 2011-07-12 21:03, Jiang, Dave wrote: > Jens, > I'm doing some performance tuning for the Intel isci SAS controller > driver, and I noticed some interesting numbers with mpstat. Looking at > the numbers it seems that rq_affinity is not moving the request > completion to the request submission CPU. Using fio to saturate the > system with 512B I/Os, I noticed that all I/Os are bound to the CPUs > (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack > in the driver so that it records the CPU during request construction > and then I try to steer the scsi->done() calls to the request CPUs. > With this simple hack, mpstat shows that the soft irq contexts are now > distributed. I observed significant performance increase. The iowait% > gone from 30s and 40s to low single digit approaching 0. Any ideas > what could be happening with the rq_affinity logic? I'm assuming > rq_affinity should behave the way my hacked solution is behaving. This > is running on an 8 core single CPU SandyBridge based system with > hyper-threading turned off. The two MSIX interrupts on the controller > are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm > running fio with 8 SAS disks and 8 threads. It's probably the grouping, we need to do something about that. Does the below patch make it behave as you expect? diff --git a/block/blk.h b/block/blk.h index d658628..17d53d8 100644 --- a/block/blk.h +++ b/block/blk.h @@ -157,6 +157,7 @@ static inline int queue_congestion_off_threshold(struct request_queue *q) static inline int blk_cpu_to_group(int cpu) { +#if 0 int group = NR_CPUS; #ifdef CONFIG_SCHED_MC const struct cpumask *mask = cpu_coregroup_mask(cpu); @@ -168,6 +169,7 @@ static inline int blk_cpu_to_group(int cpu) #endif if (likely(group < NR_CPUS)) return group; +#endif return cpu; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: rq_affinity doesn't seem to work? 2011-07-12 20:30 ` Jens Axboe @ 2011-07-12 21:17 ` Jiang, Dave 2011-07-13 17:10 ` Matthew Wilcox 1 sibling, 0 replies; 10+ messages in thread From: Jiang, Dave @ 2011-07-12 21:17 UTC (permalink / raw) To: Jens Axboe Cc: Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D > -----Original Message----- > From: Jens Axboe [mailto:axboe@kernel.dk] > Sent: Tuesday, July 12, 2011 1:31 PM > To: Jiang, Dave > Cc: Williams, Dan J; Foong, Annie; linux-scsi@vger.kernel.org; linux- > kernel@vger.kernel.org; Nadolski, Edmund; Skirvin, Jeffrey D > Subject: Re: rq_affinity doesn't seem to work? > > On 2011-07-12 21:03, Jiang, Dave wrote: > > Jens, > > I'm doing some performance tuning for the Intel isci SAS controller > > driver, and I noticed some interesting numbers with mpstat. Looking at > > the numbers it seems that rq_affinity is not moving the request > > completion to the request submission CPU. Using fio to saturate the > > system with 512B I/Os, I noticed that all I/Os are bound to the CPUs > > (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack > > in the driver so that it records the CPU during request construction > > and then I try to steer the scsi->done() calls to the request CPUs. > > With this simple hack, mpstat shows that the soft irq contexts are now > > distributed. I observed significant performance increase. The iowait% > > gone from 30s and 40s to low single digit approaching 0. Any ideas > > what could be happening with the rq_affinity logic? I'm assuming > > rq_affinity should behave the way my hacked solution is behaving. This > > is running on an 8 core single CPU SandyBridge based system with > > hyper-threading turned off. The two MSIX interrupts on the controller > > are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm > > running fio with 8 SAS disks and 8 threads. > > It's probably the grouping, we need to do something about that. Does the > below patch make it behave as you expect? Yep that is it. 02:14:12 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 02:14:17 PM all 11.98 0.00 46.62 1.18 0.00 37.79 0.00 0.00 2.43 02:14:17 PM 0 15.43 0.00 55.31 0.00 0.00 29.26 0.00 0.00 0.00 02:14:17 PM 1 14.83 0.00 56.71 0.00 0.00 28.46 0.00 0.00 0.00 02:14:17 PM 2 14.80 0.00 56.00 0.00 0.00 29.20 0.00 0.00 0.00 02:14:17 PM 3 14.63 0.00 57.11 0.00 0.00 28.26 0.00 0.00 0.00 02:14:17 PM 4 14.80 0.00 57.60 0.00 0.00 27.60 0.00 0.00 0.00 02:14:17 PM 5 15.03 0.00 56.11 0.00 0.00 28.86 0.00 0.00 0.00 02:14:17 PM 6 3.79 0.00 20.16 5.99 0.00 59.68 0.00 0.00 10.38 02:14:17 PM 7 2.80 0.00 14.20 3.20 0.00 70.80 0.00 0.00 9.00 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-12 20:30 ` Jens Axboe 2011-07-12 21:17 ` Jiang, Dave @ 2011-07-13 17:10 ` Matthew Wilcox 2011-07-13 18:00 ` Jens Axboe 2011-07-14 17:02 ` Roland Dreier 1 sibling, 2 replies; 10+ messages in thread From: Matthew Wilcox @ 2011-07-13 17:10 UTC (permalink / raw) To: Jens Axboe Cc: Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote: > It's probably the grouping, we need to do something about that. Does the > below patch make it behave as you expect? "something", absolutely. But there is benefit from doing some aggregation (we tried disabling it entirely with the "well-known OLTP benchmark" and performance went down). Ideally we'd do something like "if the softirq is taking up more than 10% of a core, split the grouping". Do we have enough stats to do that kind of monitoring? -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-13 17:10 ` Matthew Wilcox @ 2011-07-13 18:00 ` Jens Axboe 2011-07-14 17:02 ` Roland Dreier 1 sibling, 0 replies; 10+ messages in thread From: Jens Axboe @ 2011-07-13 18:00 UTC (permalink / raw) To: Matthew Wilcox Cc: Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On 2011-07-13 19:10, Matthew Wilcox wrote: > On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote: >> It's probably the grouping, we need to do something about that. Does the >> below patch make it behave as you expect? > > "something", absolutely. But there is benefit from doing some aggregation > (we tried disabling it entirely with the "well-known OLTP benchmark" and > performance went down). Yep, that's why the current solution is somewhat middle of the road... > Ideally we'd do something like "if the softirq is taking up more than 10% > of a core, split the grouping". Do we have enough stats to do that kind > of monitoring? I don't think we have those stats, though it could/should be pulled from the ksoftirqX threads. We could have some metric, ala dest_cpu = get_group_completion_cpu(rq->cpu); if (ksoftirqd_of(dest_cpu) >= 90% busy) dest_cpu = rq->cpu; to send things completely local to the submitter of the IO, IFF the current CPU is close to running at full tilt. -- Jens Axboe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-13 17:10 ` Matthew Wilcox 2011-07-13 18:00 ` Jens Axboe @ 2011-07-14 17:02 ` Roland Dreier 2011-07-15 20:20 ` Dan Williams 2011-07-15 23:43 ` ersatz splatt 1 sibling, 2 replies; 10+ messages in thread From: Roland Dreier @ 2011-07-14 17:02 UTC (permalink / raw) To: Matthew Wilcox Cc: Jens Axboe, Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <matthew@wil.cx> wrote: > On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote: >> It's probably the grouping, we need to do something about that. Does the >> below patch make it behave as you expect? > > "something", absolutely. But there is benefit from doing some aggregation > (we tried disabling it entirely with the "well-known OLTP benchmark" and > performance went down). > > Ideally we'd do something like "if the softirq is taking up more than 10% > of a core, split the grouping". Do we have enough stats to do that kind > of monitoring? What platform was your "OLTP benchmark" on? It seems that as the number of cores per package goes up, this grouping becomes too coarse, since almost everyone will have SCHED_MC set in the code: static inline int blk_cpu_to_group(int cpu) { int group = NR_CPUS; #ifdef CONFIG_SCHED_MC const struct cpumask *mask = cpu_coregroup_mask(cpu); group = cpumask_first(mask); #elif defined(CONFIG_SCHED_SMT) group = cpumask_first(topology_thread_cpumask(cpu)); #else return cpu; #endif if (likely(group < NR_CPUS)) return group; return cpu; } and so we use cpumask_first(cpu_coregroup_mask(cpu)). And from const struct cpumask *cpu_coregroup_mask(int cpu) { struct cpuinfo_x86 *c = &cpu_data(cpu); /* * For perf, we return last level cache shared map. * And for power savings, we return cpu_core_map */ if ((sched_mc_power_savings || sched_smt_power_savings) && !(cpu_has(c, X86_FEATURE_AMD_DCM))) return cpu_core_mask(cpu); else return cpu_llc_shared_mask(cpu); } in the "max performance" case, we use cpu_llc_shared_mask(). The problem as we've seen it is that on a dual-socket Westmere (Xeon 56xx) system, we have two sockets with 6 cores (12 threads) each, all sharing L3 cache, and so we end up with all block softirqs on only 2 out of 24 threads, which is not enough to handle all the IOPS that fast storage can provide. It's not clear to me what the right answer or tradeoffs are here. It might make sense to use only one hyperthread per core for block softirqs. As I understand the Westmere cache topology, there's not really an obvious intermediate step -- all the cores in a package share the L3, and then each core has its own L2. Limiting softirqs to 10% of a core seems a bit low, since we seem to be able to use more than 100% of a core handling block softirqs, and anyway magic numbers like that seem to always be wrong sometimes. Perhaps we could use the queue length on the destination CPU as a proxy for how busy ksoftirq is? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-14 17:02 ` Roland Dreier @ 2011-07-15 20:20 ` Dan Williams 2011-07-15 23:43 ` ersatz splatt 1 sibling, 0 replies; 10+ messages in thread From: Dan Williams @ 2011-07-15 20:20 UTC (permalink / raw) To: Roland Dreier Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On Thu, 2011-07-14 at 10:02 -0700, Roland Dreier wrote: > On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <matthew@wil.cx> wrote: > Limiting softirqs to 10% of a core seems a bit low, since we seem to > be able to use more than 100% of a core handling block softirqs, and > anyway magic numbers like that seem to always be wrong sometimes. > Perhaps we could use the queue length on the destination CPU as a > proxy for how busy ksoftirq is? This is likely too aggressive (untested / need to confirm it resolves the isci issue), but it's at least straightforward to determine, and I wonder if it prevents the regression Matthew is seeing. It assumes that the once we have naturally spilled from the irq return path to ksoftirqd that this cpu is having trouble keeping up with the load. ?? diff --git a/block/blk-core.c b/block/blk-core.c index d2f8f40..9c7ba87 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1279,10 +1279,8 @@ get_rq: init_request_from_bio(req, bio); if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) || - bio_flagged(bio, BIO_CPU_AFFINE)) { - req->cpu = blk_cpu_to_group(get_cpu()); - put_cpu(); - } + bio_flagged(bio, BIO_CPU_AFFINE)) + req->cpu = smp_processor_id(); plug = current->plug; if (plug) { diff --git a/block/blk-softirq.c b/block/blk-softirq.c index ee9c216..720918f 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -101,17 +101,21 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { .notifier_call = blk_cpu_notify, }; +DECLARE_PER_CPU(struct task_struct *, ksoftirqd); + void __blk_complete_request(struct request *req) { + int ccpu, cpu, group_ccpu, group_cpu; struct request_queue *q = req->q; + struct task_struct *tsk; unsigned long flags; - int ccpu, cpu, group_cpu; BUG_ON(!q->softirq_done_fn); local_irq_save(flags); cpu = smp_processor_id(); group_cpu = blk_cpu_to_group(cpu); + tsk = per_cpu(ksoftirqd, cpu); /* * Select completion CPU @@ -120,8 +124,15 @@ void __blk_complete_request(struct request *req) ccpu = req->cpu; else ccpu = cpu; + group_ccpu = blk_cpu_to_group(ccpu); - if (ccpu == cpu || ccpu == group_cpu) { + /* + * try to skip a remote softirq-trigger if the completion is + * within the same group, but not if local softirqs have already + * spilled to ksoftirqd + */ + if (ccpu == cpu || + (group_ccpu == group_cpu && tsk->state != TASK_RUNNING)) { struct list_head *list; do_local: list = &__get_cpu_var(blk_cpu_done); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-14 17:02 ` Roland Dreier 2011-07-15 20:20 ` Dan Williams @ 2011-07-15 23:43 ` ersatz splatt 2011-07-16 2:12 ` ersatz splatt 2011-07-16 2:40 ` Christoph Hellwig 1 sibling, 2 replies; 10+ messages in thread From: ersatz splatt @ 2011-07-15 23:43 UTC (permalink / raw) To: Roland Dreier Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On Thu, Jul 14, 2011 at 10:02 AM, Roland Dreier <roland@purestorage.com> wrote: > The problem as we've seen it is that on a dual-socket Westmere (Xeon > 56xx) system, we have two sockets with 6 cores (12 threads) each, all > sharing L3 cache, and so we end up with all block softirqs on only 2 > out of 24 threads, which is not enough to handle all the IOPS that > fast storage can provide. I have a dual socket system with Tylersburg chipset (approximately Westmere I gather). With two Xeon X5660 packages I get this when running with more iops potential than the system can handle: 02:15:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 02:15:02 PM all 2.76 0.00 30.40 28.28 0.00 13.74 0.00 0.00 24.81 02:15:02 PM 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 02:15:02 PM 1 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 99.50 02:15:02 PM 2 3.02 0.00 36.68 52.26 0.00 8.04 0.00 0.00 0.00 02:15:02 PM 3 2.50 0.00 36.00 54.50 0.00 7.00 0.00 0.00 0.00 02:15:02 PM 4 5.47 0.00 64.18 18.91 0.00 11.44 0.00 0.00 0.00 02:15:02 PM 5 3.02 0.00 37.69 53.27 0.00 6.03 0.00 0.00 0.00 02:15:02 PM 6 0.00 0.00 0.50 0.00 0.00 91.54 0.00 0.00 7.96 02:15:02 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 02:15:02 PM 8 3.00 0.00 35.50 55.00 0.00 6.50 0.00 0.00 0.00 02:15:02 PM 9 3.02 0.00 39.70 50.25 0.00 7.04 0.00 0.00 0.00 02:15:02 PM 10 3.50 0.00 36.50 53.00 0.00 7.00 0.00 0.00 0.00 02:15:02 PM 11 6.53 0.00 70.85 9.05 0.00 13.57 0.00 0.00 0.00 02:15:02 PM 12 0.00 0.00 0.57 0.00 0.00 0.00 0.00 0.00 99.43 02:15:02 PM 13 3.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 97.00 02:15:02 PM 14 2.50 0.00 36.50 54.00 0.00 7.00 0.00 0.00 0.00 02:15:02 PM 15 3.52 0.00 36.18 53.77 0.00 6.53 0.00 0.00 0.00 02:15:02 PM 16 5.00 0.00 64.00 21.00 0.00 10.00 0.00 0.00 0.00 02:15:02 PM 17 3.02 0.00 37.19 52.76 0.00 7.04 0.00 0.00 0.00 02:15:02 PM 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 02:15:02 PM 19 0.00 0.00 1.01 0.00 0.00 0.00 0.00 0.00 98.99 02:15:02 PM 20 3.48 0.00 38.31 52.24 0.00 5.97 0.00 0.00 0.00 02:15:02 PM 21 5.50 0.00 63.00 18.50 0.00 13.00 0.00 0.00 0.00 02:15:02 PM 22 2.50 0.00 35.00 54.50 0.00 8.00 0.00 0.00 0.00 02:15:02 PM 23 5.03 0.00 58.79 23.62 0.00 12.56 0.00 0.00 0.00 By "more IOPS potential than the system can handle", I mean that with about a quarter of the targets I get the same figure. The HBA is known to handle more than twice the IOPS I'm seeing. I'm using 16 targets with fio driving one target with each core you see sys activity on. You can see that two additional cores are getting weighed down -- 0 and 6. Is that indicative of the bottleneck? These results are without using any of the patches suggested in this e-mail thread. I'll have to try and see if they help. What is the top number of IOPS I should hope for with this system and the Linux kernel? Dave Jiang (or anyone else) -- can you share the max IOPS that you are seeing? > It's not clear to me what the right answer or tradeoffs are here. It > might make sense to use only one hyperthread per core for block > softirqs. As I understand the Westmere cache topology, there's not > really an obvious intermediate step -- all the cores in a package > share the L3, and then each core has its own L2. > > Limiting softirqs to 10% of a core seems a bit low, since we seem to > be able to use more than 100% of a core handling block softirqs, and > anyway magic numbers like that seem to always be wrong sometimes. > Perhaps we could use the queue length on the destination CPU as a > proxy for how busy ksoftirq is? > > - R. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-15 23:43 ` ersatz splatt @ 2011-07-16 2:12 ` ersatz splatt 2011-07-16 2:40 ` Christoph Hellwig 1 sibling, 0 replies; 10+ messages in thread From: ersatz splatt @ 2011-07-16 2:12 UTC (permalink / raw) To: Roland Dreier Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D With the quickest and easiest fix (the first suggestion from Jens Axboe), I was able to get another 20%+ in IOPS. Thank you. (Pardon the previous ugly wrap on the data. I'm not sure how to stop that with my e-mail vendor) Driving more IOPS on the same system looks like this for me: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle all 2.85 0.00 31.37 12.05 0.00 14.84 0.00 0.00 38.90 0 2.44 0.00 0.00 0.00 0.00 4.39 0.00 0.00 93.17 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2 1.51 0.00 23.12 70.85 0.00 4.52 0.00 0.00 0.00 3 5.05 0.00 51.01 19.70 0.00 24.24 0.00 0.00 0.00 4 5.47 0.00 62.19 1.00 0.00 31.34 0.00 0.00 0.00 5 4.00 0.00 50.00 22.50 0.00 23.50 0.00 0.00 0.00 6 0.00 0.00 0.00 0.00 0.00 0.47 0.00 0.00 99.53 7 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 99.78 8 4.48 0.00 53.23 16.92 0.00 25.37 0.00 0.00 0.00 9 4.48 0.00 50.25 19.40 0.00 25.87 0.00 0.00 0.00 10 5.53 0.00 63.82 0.50 0.00 30.15 0.00 0.00 0.00 11 3.50 0.00 52.00 20.50 0.00 24.00 0.00 0.00 0.00 12 0.50 0.00 1.00 1.49 0.00 0.00 0.00 0.00 97.01 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 14 3.50 0.00 43.50 35.50 0.00 17.50 0.00 0.00 0.00 15 4.02 0.00 51.26 20.60 0.00 24.12 0.00 0.00 0.00 16 6.03 0.00 57.29 8.54 0.00 28.14 0.00 0.00 0.00 17 4.50 0.00 49.00 25.00 0.00 21.50 0.00 0.00 0.00 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 20 4.98 0.00 57.21 11.44 0.00 26.37 0.00 0.00 0.00 21 4.50 0.00 54.00 16.00 0.00 25.50 0.00 0.00 0.00 22 5.50 0.00 58.00 7.00 0.00 29.50 0.00 0.00 0.00 23 4.00 0.00 49.50 22.50 0.00 24.00 0.00 0.00 0.00 I'm happy to have the performance improvement, but I would like to know how I could do much better. The storage hardware is all capable of about twice the IOPS I'm getting now. I see that "sys" is eating most of the CPU time at this point. What do I need to fix? Is fio too heavy in implementation? ... or is this a scsi midlayer bottleneck? I would be happy to get advice on what I should do to better illuminate the bottleneck. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rq_affinity doesn't seem to work? 2011-07-15 23:43 ` ersatz splatt 2011-07-16 2:12 ` ersatz splatt @ 2011-07-16 2:40 ` Christoph Hellwig 1 sibling, 0 replies; 10+ messages in thread From: Christoph Hellwig @ 2011-07-16 2:40 UTC (permalink / raw) To: ersatz splatt Cc: Roland Dreier, Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Nadolski, Edmund, Skirvin, Jeffrey D On Fri, Jul 15, 2011 at 04:43:44PM -0700, ersatz splatt wrote: > I have a dual socket system with Tylersburg chipset (approximately > Westmere I gather). > With two Xeon X5660 packages I get this when running with more iops > potential than the system can handle: What HBA do you use? Does it already have a lockless ->queuecommand? ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-07-16 2:40 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-12 19:03 rq_affinity doesn't seem to work? Jiang, Dave 2011-07-12 20:30 ` Jens Axboe 2011-07-12 21:17 ` Jiang, Dave 2011-07-13 17:10 ` Matthew Wilcox 2011-07-13 18:00 ` Jens Axboe 2011-07-14 17:02 ` Roland Dreier 2011-07-15 20:20 ` Dan Williams 2011-07-15 23:43 ` ersatz splatt 2011-07-16 2:12 ` ersatz splatt 2011-07-16 2:40 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox