From: Jan Kara <jack@suse.cz>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH] block: strict rq_affinity
Date: Fri, 10 Oct 2014 16:23:10 +0200 [thread overview]
Message-ID: <1412951028-4085-6-git-send-email-jack@suse.cz> (raw)
In-Reply-To: <1412951028-4085-1-git-send-email-jack@suse.cz>
From: Dan Williams <dan.j.williams@intel.com>
Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.
Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.
echo 2 > /sys/block/<bdev>/queue/rq_affinity
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
Documentation/block/queue-sysfs.txt | 10 +++++++---
block/blk-core.c | 6 ++----
block/blk-softirq.c | 11 +++++++----
block/blk-sysfs.c | 13 +++++++++----
include/linux/blkdev.h | 3 ++-
5 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f65274081c8d..d8147b336c35 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
rq_affinity (RW)
----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
scheduler (RW)
--------------
diff --git a/block/blk-core.c b/block/blk-core.c
index a56485292062..b3228255304d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
init_request_from_bio(req, bio);
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
- bio_flagged(bio, BIO_CPU_AFFINE)) {
- req->cpu = blk_cpu_to_group(get_cpu());
- put_cpu();
- }
+ bio_flagged(bio, BIO_CPU_AFFINE))
+ req->cpu = smp_processor_id();
plug = current->plug;
if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c21602228..475fab809a80 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
void __blk_complete_request(struct request *req)
{
+ int ccpu, cpu, group_cpu = NR_CPUS;
struct request_queue *q = req->q;
unsigned long flags;
- int ccpu, cpu, group_cpu;
BUG_ON(!q->softirq_done_fn);
local_irq_save(flags);
cpu = smp_processor_id();
- group_cpu = blk_cpu_to_group(cpu);
/*
* Select completion CPU
*/
- if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+ if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
ccpu = req->cpu;
- else
+ if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+ ccpu = blk_cpu_to_group(ccpu);
+ group_cpu = blk_cpu_to_group(cpu);
+ }
+ } else
ccpu = cpu;
if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd859c87..0ee17b5e7fb6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
{
bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+ bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
- return queue_var_show(set, page);
+ return queue_var_show(set << force, page);
}
static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
ret = queue_var_store(&val, page, count);
spin_lock_irq(q->queue_lock);
- if (val)
+ if (val) {
queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
- else
- queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+ if (val == 2)
+ queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+ } else {
+ queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+ queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+ }
spin_unlock_irq(q->queue_lock);
#endif
return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0cd9a2f22ef..0e67c45b3bc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -392,7 +392,7 @@ struct request_queue {
#define QUEUE_FLAG_ELVSWITCH 6 /* don't use elevator, just do FIFO */
#define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */
#define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP 9 /* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */
#define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */
#define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */
#define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */
@@ -402,6 +402,7 @@ struct request_queue {
#define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */
#define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */
#define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */
#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
--
1.8.1.4
next prev parent reply other threads:[~2014-10-10 14:23 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-10 14:23 [Cluster-devel] [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2 RESEND] bdi: Fix hung task on sync Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths Jan Kara
2014-10-10 15:19 ` Dave Jones
2014-10-10 15:32 ` Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] block: improve rq_affinity placement Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] block: Make rq_affinity = 1 work as expected Jan Kara
2014-10-10 14:23 ` Jan Kara [this message]
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext3: Speedup WB_SYNC_ALL pass Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2] ext4: Don't check quota format when there are no quota files Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Fix buffer double free in ext4_alloc_branch() Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Fix jbd2 warning under heavy xattr load Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Fix zeroing of page during writeback Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Remove orphan list handling Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ext4: Speedup WB_SYNC_ALL pass Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH for 3.14-stable] fanotify: fix double free of pending permission events Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] lockdep: Dump info via tracing Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ncpfs: fix rmdir returns Device or resource busy Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] ocfs2: Fix quota file corruption Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2] printk: Debug patch1 Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] printk: debug: Slow down printing to 9600 bauds Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] printk: enable interrupts before calling console_trylock_for_printk() Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] quota: Fix race between dqput() and dquot_scan_active() Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] scsi: Keep interrupts disabled while submitting requests Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] sync: don't block the flusher thread waiting on IO Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] udf: Avoid infinite loop when processing indirect ICBs Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] udf: Print error when inode is loaded Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init() Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] writeback: plug writeback at a high level Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH] x86: Fixup lockdep complaint caused by io apic code Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2 RESEND] bdi: Avoid oops on device removal Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2] ext3: Don't check quota format when there are no quota files Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list Jan Kara
2014-10-10 14:23 ` [Cluster-devel] [PATCH 2/2] printk: Debug patch 2 Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1412951028-4085-6-git-send-email-jack@suse.cz \
--to=jack@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).