Re: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
	Robert Elliott <relliott@beardog.cce.hp.com>,
	"hch@lst.de" <hch@lst.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq
Date: Wed, 10 Sep 2014 13:51:49 -0600	[thread overview]
Message-ID: <5410ABD5.9060106@kernel.dk> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B402958C71B00@G9W0745.americas.hpqcorp.net>

On 09/10/2014 01:35 PM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Wednesday, 10 September, 2014 1:15 PM
>> To: Robert Elliott; Elliott, Robert (Server Storage); hch@lst.de;
>> linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq
>>
>> On 09/09/2014 06:18 PM, Robert Elliott wrote:
>>> From: Robert Elliott <elliott@hp.com>
>>>
>>> One change introduced by blk-mq is that it does all
>>> the completion work in hard irq context rather than
>>> soft irq context.
>>>
>>> On a 6 core system, if all interrupts are routed to
>>> one CPU, then you can easily run into this:
>>> * 5 CPUs submitting IOs
>>> * 1 CPU spending 100% of its time in hard irq context
>>> processing IO completions, not able to submit anything
>>> itself
>>>
>>> Example with CPU5 receiving all interrupts:
>>>    CPU usage:   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
>>>         %usr:   0.00   3.03   1.01   2.02   2.00   0.00
>>>         %sys:  14.58  75.76  14.14   4.04  78.00   0.00
>>>         %irq:   0.00   0.00   0.00   1.01   0.00 100.00
>>>        %soft:   0.00   0.00   0.00   0.00   0.00   0.00
>>> %iowait idle:  85.42  21.21  84.85  92.93  20.00   0.00
>>>        %idle:   0.00   0.00   0.00   0.00   0.00   0.00
>>>
>>> When the submitting CPUs are forced to process their own
>>> completion interrupts, this steals time from new
>>> submissions and self-throttles them.
>>>
>>> Without that, there is no direct feedback to the
>>> submitters to slow down.  The only feedback is:
>>> * reaching max queue depth
>>> * lots of timeouts, resulting in aborts, resets, soft
>>>   lockups and self-detected stalls on CPU5, bogus
>>>   clocksource tsc unstable reports, network
>>>   drop-offs, etc.
>>>
>>> The SCSI LLD can set affinity_hint for each of its
>>> interrupts to request that a program like irqbalance
>>> route the interrupts back to the submitting CPU.
>>> The latest version of irqbalance ignores those hints,
>>> though, instead offering an option to run a policy
>>> script that could honor them. Otherwise, it balances
>>> them based on its own algorithms. So, we cannot rely
>>> on this.
>>>
>>> Hardware might perform interrupt coalescing to help,
>>> but it cannot help 1 CPU keep up with the work
>>> generated by many other CPUs.
>>>
>>> rq_affinity=2 helps by pushing most of the block layer
>>> and SCSI midlayer completion work back to the submitting
>>> CPU (via an IPI).
>>>
>>> Change the default rq_affinity=2 under blk-mq
>>> so there's at least some feedback to slow down the
>>> submitters.
>>
>> I don't think we should do this generically. For "sane" devices with
>> multiple completion queues, and with proper affinity setting in the
>> driver, this is going to be a loss.
>>
>> So lets not add it to QUEUE_FLAG_MQ_DEFAULT, but we can make it
>> default
>> for nr_hw_queues == 1. I think that would be way saner.
>>
>> --
>> Jens Axboe
> 
> If the interrupt does arrive on the submitting CPU, then it 
> meets the criteria for all the cases:
> * 1: complete on any CPU
> * 2: complete on submitting CPU's node (QUEUE_FLAG_SAME_COMP)
> * 3: complete on submitting CPU (QUEUE_FLAG_SAME_FORCE)
> 
> and _blk_complete_request handles it locally rather
> than sending an IPI.
> 
>         if (req->cpu != -1) {
>                 ccpu = req->cpu;
>                 if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
>                         shared = cpus_share_cache(cpu, ccpu);
>         } else
>                 ccpu = cpu;
>         ...
>         if (ccpu == cpu || shared) {
>                 struct list_head *list;
> do_local:
>         ...
>         } else if (raise_blk_irq(ccpu, req))
>                 goto do_local;

I forgot about the shared case being handled appropriately, so that
should probably be fine to do. My primary concern here is a performance
penalty on sync IO, I'll run a few test on a single IRQ case (like the
mtip32xx) and see how that performs. But you are right, it might not be
a bad thing to do by default.

> Are you saying you want the blk_queue_bio submission to 
> not even set the req->cpu field (which defaulted to -1):
>         if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
>                 req->cpu = raw_smp_processor_id();
> 
> when you expect the interrupt routing is good so that
> _blk_complete_request can avoid the test_bit and 
> cpus_share_cache calls?

No, and since those are non-serializing tests, I suspect if we start
adding a branch to avoid that we will negate any potential win on that.
The flags should basically never get dirtied. Well I guess they could
for heavy uses of start/stop queue, but that might be something that's
worthwhile to tackle separately.

> With irqbalance no longer honoring affinity_hint
> by default, I'm worried that most LLDs will not find
> their interrupts routed that way anymore.  That's
> how we ran into this; scsi-mq + kernel-3.17 on an
> up-to-date RHEL 6.5 distro (which now carries the
> new irqbalance).
> 
> We plan to create a policyscript for the new irqbalance
> for hpsa devices, but other high-IOPS drivers will hit
> the same problem.

irqbalance has _always_ been a pain in the butt... Suboptimal or
changing behaviour from release to release, it's been one of the most
annoying parts of performance tuning.

-- 
Jens Axboe

next prev parent reply	other threads:[~2014-09-10 19:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-10  0:17 [PATCH 0/2] block: rq_affinity default and reserved tag limits Robert Elliott
2014-09-10  0:18 ` [PATCH 1/2] block: default to rq_affinity=2 for blk-mq Robert Elliott
2014-09-10 18:14   ` Jens Axboe
2014-09-10 19:35     ` Elliott, Robert (Server Storage)
2014-09-10 19:51       ` Jens Axboe [this message]
2014-09-10  0:18 ` [PATCH 2/2] block: return error if too many reserved tags are requested Robert Elliott
2014-09-10 18:17   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5410ABD5.9060106@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=Elliott@hp.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=relliott@beardog.cce.hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.