From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752865AbaIJATN (ORCPT ); Tue, 9 Sep 2014 20:19:13 -0400 Received: from g2t1383g.austin.hp.com ([15.217.136.92]:4625 "EHLO g2t1383g.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752208AbaIJATF (ORCPT ); Tue, 9 Sep 2014 20:19:05 -0400 Subject: [PATCH 1/2] block: default to rq_affinity=2 for blk-mq To: axboe@kernel.dk, elliott@hp.com, hch@lst.de, linux-kernel@vger.kernel.org From: Robert Elliott Date: Tue, 09 Sep 2014 19:18:01 -0500 Message-ID: <20140910001801.9294.79720.stgit@beardog.cce.hp.com> In-Reply-To: <20140910001417.9294.40414.stgit@beardog.cce.hp.com> References: <20140910001417.9294.40414.stgit@beardog.cce.hp.com> User-Agent: StGit/0.15 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robert Elliott One change introduced by blk-mq is that it does all the completion work in hard irq context rather than soft irq context. On a 6 core system, if all interrupts are routed to one CPU, then you can easily run into this: * 5 CPUs submitting IOs * 1 CPU spending 100% of its time in hard irq context processing IO completions, not able to submit anything itself Example with CPU5 receiving all interrupts: CPU usage: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 %usr: 0.00 3.03 1.01 2.02 2.00 0.00 %sys: 14.58 75.76 14.14 4.04 78.00 0.00 %irq: 0.00 0.00 0.00 1.01 0.00 100.00 %soft: 0.00 0.00 0.00 0.00 0.00 0.00 %iowait idle: 85.42 21.21 84.85 92.93 20.00 0.00 %idle: 0.00 0.00 0.00 0.00 0.00 0.00 When the submitting CPUs are forced to process their own completion interrupts, this steals time from new submissions and self-throttles them. Without that, there is no direct feedback to the submitters to slow down. The only feedback is: * reaching max queue depth * lots of timeouts, resulting in aborts, resets, soft lockups and self-detected stalls on CPU5, bogus clocksource tsc unstable reports, network drop-offs, etc. The SCSI LLD can set affinity_hint for each of its interrupts to request that a program like irqbalance route the interrupts back to the submitting CPU. The latest version of irqbalance ignores those hints, though, instead offering an option to run a policy script that could honor them. Otherwise, it balances them based on its own algorithms. So, we cannot rely on this. Hardware might perform interrupt coalescing to help, but it cannot help 1 CPU keep up with the work generated by many other CPUs. rq_affinity=2 helps by pushing most of the block layer and SCSI midlayer completion work back to the submitting CPU (via an IPI). Change the default rq_affinity=2 under blk-mq so there's at least some feedback to slow down the submitters. Signed-off-by: Robert Elliott --- include/linux/blkdev.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 518b465..9f41a02 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -522,7 +522,8 @@ struct request_queue { (1 << QUEUE_FLAG_ADD_RANDOM)) #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ - (1 << QUEUE_FLAG_SAME_COMP)) + (1 << QUEUE_FLAG_SAME_COMP) | \ + (1 << QUEUE_FLAG_SAME_FORCE)) static inline void queue_lockdep_assert_held(struct request_queue *q) {