* [PATCHSET RFC 0/2] mq-deadline scalability improvements
@ 2024-01-18 18:04 Jens Axboe
2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw)
To: linux-block; +Cc: bvanassche
Hi,
It's no secret that mq-deadline doesn't scale very well - it was
originally done as a proof-of-concept conversion from deadline, when the
blk-mq multiqueue layer was written. In the single queue world, the
queue lock protected the IO scheduler as well, and mq-deadline simply
adopted an internal dd->lock to fill the place of that.
While mq-deadline works under blk-mq and doesn't suffer any scaling on
that side, as soon as request insertion or dispatch is done, we're
hitting the per-queue dd->lock quite intensely. On a basic test box
with 16 cores / 32 threads, running a number of IO intensive threads
on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
this quite easily:
Device QD Jobs IOPS Lock contention
=======================================================
null_blk 4 32 1090K 92%
nvme0n1 4 32 1070K 94%
which looks pretty miserable, most of the time is spent contending on
the queue lock.
This RFC patchset attempts to address that by:
1) Serializing dispatch of requests. If we fail dispatching, rely on
the next completion to dispatch the next one. This could potentially
reduce the overall depth achieved on the device side, however even
for the heavily contended test I'm running here, no observable
change is seen. This is patch 1.
2) Serialize request insertion, using internal per-cpu lists to
temporarily store requests until insertion can proceed. This is
patch 2.
With that in place, the same test case now does:
Device QD Jobs IOPS Contention Diff
=============================================================
null_blk 4 32 2250K 28% +106%
nvme0n1 4 32 2560K 23% +112%
and while that doesn't completely eliminate the lock contention, it's
oodles better than what it was before. The throughput increase shows
that nicely, with more than 100% improvement for both cases.
block/mq-deadline.c | 146 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 133 insertions(+), 13 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 20+ messages in thread* [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe @ 2024-01-18 18:04 ` Jens Axboe 2024-01-18 18:24 ` Bart Van Assche 2024-01-19 2:40 ` Ming Lei 2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe 2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2 siblings, 2 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw) To: linux-block; +Cc: bvanassche, Jens Axboe If we're entering request dispatch but someone else is already dispatching, then just skip this dispatch. We know IO is inflight and this will trigger another dispatch event for any completion. This will potentially cause slightly lower queue depth for contended cases, but those are slowed down anyway and this should not cause an issue. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/mq-deadline.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index f958e79277b8..9e0ab3ea728a 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -79,10 +79,20 @@ struct dd_per_prio { struct io_stats_per_prio stats; }; +enum { + DD_DISPATCHING = 0, +}; + struct deadline_data { /* * run time data */ + struct { + spinlock_t lock; + spinlock_t zone_lock; + } ____cacheline_aligned_in_smp; + + unsigned long run_state; struct dd_per_prio per_prio[DD_PRIO_COUNT]; @@ -100,9 +110,6 @@ struct deadline_data { int front_merges; u32 async_depth; int prio_aging_expire; - - spinlock_t lock; - spinlock_t zone_lock; }; /* Maps an I/O priority class to a deadline scheduler priority. */ @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) struct request *rq; enum dd_prio prio; + /* + * If someone else is already dispatching, skip this one. This will + * defer the next dispatch event to when something completes, and could + * potentially lower the queue depth for contended cases. + */ + if (test_bit(DD_DISPATCHING, &dd->run_state) || + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) + return NULL; + spin_lock(&dd->lock); rq = dd_dispatch_prio_aged_requests(dd, now); if (rq) @@ -616,6 +632,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) } unlock: + clear_bit(DD_DISPATCHING, &dd->run_state); spin_unlock(&dd->lock); return rq; @@ -706,6 +723,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) eq->elevator_data = dd; + spin_lock_init(&dd->lock); + spin_lock_init(&dd->zone_lock); + for (prio = 0; prio <= DD_PRIO_MAX; prio++) { struct dd_per_prio *per_prio = &dd->per_prio[prio]; @@ -722,8 +742,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) dd->last_dir = DD_WRITE; dd->fifo_batch = fifo_batch; dd->prio_aging_expire = prio_aging_expire; - spin_lock_init(&dd->lock); - spin_lock_init(&dd->zone_lock); /* We dispatch from request queue wide instead of hw queue */ blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); -- 2.43.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe @ 2024-01-18 18:24 ` Bart Van Assche 2024-01-18 18:45 ` Jens Axboe 2024-01-19 2:40 ` Ming Lei 1 sibling, 1 reply; 20+ messages in thread From: Bart Van Assche @ 2024-01-18 18:24 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 10:04, Jens Axboe wrote: > If we're entering request dispatch but someone else is already > dispatching, then just skip this dispatch. We know IO is inflight and > this will trigger another dispatch event for any completion. This will > potentially cause slightly lower queue depth for contended cases, but > those are slowed down anyway and this should not cause an issue. Shouldn't a performance optimization patch include numbers that show by how much that patch improves the performance of different workloads? > struct deadline_data { > /* > * run time data > */ > + struct { > + spinlock_t lock; > + spinlock_t zone_lock; > + } ____cacheline_aligned_in_smp; Is ____cacheline_aligned_in_smp useful here? struct deadline_data is not embedded in any other data structure but is allocated with kzalloc_node(). > + /* > + * If someone else is already dispatching, skip this one. This will > + * defer the next dispatch event to when something completes, and could > + * potentially lower the queue depth for contended cases. > + */ > + if (test_bit(DD_DISPATCHING, &dd->run_state) || > + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) > + return NULL; > + The above code behaves similar to spin_trylock(). Has it been considered to use spin_trylock() instead? Thanks, Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:24 ` Bart Van Assche @ 2024-01-18 18:45 ` Jens Axboe 2024-01-18 18:51 ` Bart Van Assche 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:45 UTC (permalink / raw) To: Bart Van Assche, linux-block On 1/18/24 11:24 AM, Bart Van Assche wrote: > On 1/18/24 10:04, Jens Axboe wrote: >> If we're entering request dispatch but someone else is already >> dispatching, then just skip this dispatch. We know IO is inflight and >> this will trigger another dispatch event for any completion. This will >> potentially cause slightly lower queue depth for contended cases, but >> those are slowed down anyway and this should not cause an issue. > > Shouldn't a performance optimization patch include numbers that show by > how much that patch improves the performance of different workloads? The commit messages may want some editing, but all of that is in the cover letter that I'm sure you saw. This is just a POC/RFC posting. >> struct deadline_data { >> /* >> * run time data >> */ >> + struct { >> + spinlock_t lock; >> + spinlock_t zone_lock; >> + } ____cacheline_aligned_in_smp; > > Is ____cacheline_aligned_in_smp useful here? struct deadline_data > is not embedded in any other data structure but is allocated with kzalloc_node(). It's not for alignment of deadline_data, it's for alignment within deadline_data so that grabbing/releasing these locks don't share a cacheline with other bits. We could ensure that deadline_data itself is aligned as well, that might be a good idea. >> + /* >> + * If someone else is already dispatching, skip this one. This will >> + * defer the next dispatch event to when something completes, and could >> + * potentially lower the queue depth for contended cases. >> + */ >> + if (test_bit(DD_DISPATCHING, &dd->run_state) || >> + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) >> + return NULL; >> + > > The above code behaves similar to spin_trylock(). Has it been > considered to use spin_trylock() instead? Do you read the replies to the emails, from the other thread? First of all, you'd need another lock for this. And secondly, this avoids a RMW if it's already set. So no, I think this is cleaner than a separate lock, and you'd still have cacheline bouncing on that for ALL calls if you did it that way. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:45 ` Jens Axboe @ 2024-01-18 18:51 ` Bart Van Assche 2024-01-18 18:55 ` Jens Axboe 0 siblings, 1 reply; 20+ messages in thread From: Bart Van Assche @ 2024-01-18 18:51 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 10:45, Jens Axboe wrote: > Do you read the replies to the emails, from the other thread? Yes. > And secondly, this avoids a RMW if it's already set. From the spinlock implementation (something I looked up before I wrote my previous reply): static __always_inline int queued_spin_trylock(struct qspinlock *lock) { int val = atomic_read(&lock->val); if (unlikely(val)) return 0; return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)); } I think this implementation does not perform a RMW if the spinlock is already locked. Thanks, Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:51 ` Bart Van Assche @ 2024-01-18 18:55 ` Jens Axboe 0 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:55 UTC (permalink / raw) To: Bart Van Assche, linux-block On 1/18/24 11:51 AM, Bart Van Assche wrote: > On 1/18/24 10:45, Jens Axboe wrote: >> Do you read the replies to the emails, from the other thread? > > Yes. Well it's been a bit frustrating to say the least, as you seem to either not read them or just ignore what's in them. > > And secondly, this avoids a RMW if it's already set. > From the spinlock implementation (something I looked up before I wrote > my previous reply): > > static __always_inline int queued_spin_trylock(struct qspinlock *lock) > { > int val = atomic_read(&lock->val); > > if (unlikely(val)) > return 0; > > return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, > _Q_LOCKED_VAL)); > } > > I think this implementation does not perform a RMW if the spinlock is > already locked. This looks like a generic trylock, is the same true for the arch variants? Because your attempt either failed because of RMW or because you are hammering it repeatedly, or both. In any case, even if any trylock would avoid the initial atomic, it doesn't give you anything that a bitop would not just do, and we use the latter trick elsewhere in blk-mq. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe 2024-01-18 18:24 ` Bart Van Assche @ 2024-01-19 2:40 ` Ming Lei 2024-01-19 15:49 ` Jens Axboe 1 sibling, 1 reply; 20+ messages in thread From: Ming Lei @ 2024-01-19 2:40 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, bvanassche, ming.lei On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote: > If we're entering request dispatch but someone else is already > dispatching, then just skip this dispatch. We know IO is inflight and > this will trigger another dispatch event for any completion. This will > potentially cause slightly lower queue depth for contended cases, but > those are slowed down anyway and this should not cause an issue. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > block/mq-deadline.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) > > diff --git a/block/mq-deadline.c b/block/mq-deadline.c > index f958e79277b8..9e0ab3ea728a 100644 > --- a/block/mq-deadline.c > +++ b/block/mq-deadline.c > @@ -79,10 +79,20 @@ struct dd_per_prio { > struct io_stats_per_prio stats; > }; > > +enum { > + DD_DISPATCHING = 0, > +}; > + > struct deadline_data { > /* > * run time data > */ > + struct { > + spinlock_t lock; > + spinlock_t zone_lock; > + } ____cacheline_aligned_in_smp; > + > + unsigned long run_state; > > struct dd_per_prio per_prio[DD_PRIO_COUNT]; > > @@ -100,9 +110,6 @@ struct deadline_data { > int front_merges; > u32 async_depth; > int prio_aging_expire; > - > - spinlock_t lock; > - spinlock_t zone_lock; > }; > > /* Maps an I/O priority class to a deadline scheduler priority. */ > @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) > struct request *rq; > enum dd_prio prio; > > + /* > + * If someone else is already dispatching, skip this one. This will > + * defer the next dispatch event to when something completes, and could > + * potentially lower the queue depth for contended cases. > + */ > + if (test_bit(DD_DISPATCHING, &dd->run_state) || > + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) > + return NULL; > + This patch looks fine. BTW, the current dispatch is actually piggyback in the in-progress dispatch, see blk_mq_do_dispatch_sched(). And the correctness should depend on the looping dispatch & retry for nothing to dispatch in blk_mq_do_dispatch_sched(), maybe we need to document it here. Thanks, Ming ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] block/mq-deadline: serialize request dispatching 2024-01-19 2:40 ` Ming Lei @ 2024-01-19 15:49 ` Jens Axboe 0 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-19 15:49 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, bvanassche On 1/18/24 7:40 PM, Ming Lei wrote: > On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote: >> If we're entering request dispatch but someone else is already >> dispatching, then just skip this dispatch. We know IO is inflight and >> this will trigger another dispatch event for any completion. This will >> potentially cause slightly lower queue depth for contended cases, but >> those are slowed down anyway and this should not cause an issue. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> block/mq-deadline.c | 28 +++++++++++++++++++++++----- >> 1 file changed, 23 insertions(+), 5 deletions(-) >> >> diff --git a/block/mq-deadline.c b/block/mq-deadline.c >> index f958e79277b8..9e0ab3ea728a 100644 >> --- a/block/mq-deadline.c >> +++ b/block/mq-deadline.c >> @@ -79,10 +79,20 @@ struct dd_per_prio { >> struct io_stats_per_prio stats; >> }; >> >> +enum { >> + DD_DISPATCHING = 0, >> +}; >> + >> struct deadline_data { >> /* >> * run time data >> */ >> + struct { >> + spinlock_t lock; >> + spinlock_t zone_lock; >> + } ____cacheline_aligned_in_smp; >> + >> + unsigned long run_state; >> >> struct dd_per_prio per_prio[DD_PRIO_COUNT]; >> >> @@ -100,9 +110,6 @@ struct deadline_data { >> int front_merges; >> u32 async_depth; >> int prio_aging_expire; >> - >> - spinlock_t lock; >> - spinlock_t zone_lock; >> }; >> >> /* Maps an I/O priority class to a deadline scheduler priority. */ >> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) >> struct request *rq; >> enum dd_prio prio; >> >> + /* >> + * If someone else is already dispatching, skip this one. This will >> + * defer the next dispatch event to when something completes, and could >> + * potentially lower the queue depth for contended cases. >> + */ >> + if (test_bit(DD_DISPATCHING, &dd->run_state) || >> + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) >> + return NULL; >> + > > This patch looks fine. > > BTW, the current dispatch is actually piggyback in the in-progress dispatch, > see blk_mq_do_dispatch_sched(). And the correctness should depend on > the looping dispatch & retry for nothing to dispatch in > blk_mq_do_dispatch_sched(), maybe we need to document it here. Thanks for taking a look, I'll add a comment. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe @ 2024-01-18 18:04 ` Jens Axboe 2024-01-18 18:25 ` Keith Busch 2024-01-18 18:31 ` Bart Van Assche 2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2 siblings, 2 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:04 UTC (permalink / raw) To: linux-block; +Cc: bvanassche, Jens Axboe If we attempt to insert a list of requests but someone else is already running an insertion, then fallback to queueing it internally and let the existing inserter finish the operation. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/mq-deadline.c | 118 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 110 insertions(+), 8 deletions(-) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 9e0ab3ea728a..eeeaaff189e1 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -81,8 +81,17 @@ struct dd_per_prio { enum { DD_DISPATCHING = 0, + DD_INSERTING = 1, }; +#define DD_CPU_BUCKETS 32 +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) + +struct dd_bucket_list { + struct list_head list; + spinlock_t lock; +} ____cacheline_aligned_in_smp; + struct deadline_data { /* * run time data @@ -94,6 +103,9 @@ struct deadline_data { unsigned long run_state; + atomic_t insert_seq; + struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS]; + struct dd_per_prio per_prio[DD_PRIO_COUNT]; /* Data direction of latest dispatched request. */ @@ -711,7 +723,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) struct deadline_data *dd; struct elevator_queue *eq; enum dd_prio prio; - int ret = -ENOMEM; + int i, ret = -ENOMEM; eq = elevator_alloc(q, e); if (!eq) @@ -725,6 +737,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) spin_lock_init(&dd->lock); spin_lock_init(&dd->zone_lock); + atomic_set(&dd->insert_seq, 0); + + for (i = 0; i < DD_CPU_BUCKETS; i++) { + INIT_LIST_HEAD(&dd->bucket_lists[i].list); + spin_lock_init(&dd->bucket_lists[i].lock); + } for (prio = 0; prio <= DD_PRIO_MAX; prio++) { struct dd_per_prio *per_prio = &dd->per_prio[prio]; @@ -876,6 +894,67 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, } } +static void dd_dispatch_from_buckets(struct deadline_data *dd, + struct list_head *list) +{ + int i; + + for (i = 0; i < DD_CPU_BUCKETS; i++) { + struct dd_bucket_list *bucket = &dd->bucket_lists[i]; + + if (list_empty_careful(&bucket->list)) + continue; + spin_lock(&bucket->lock); + list_splice_init(&bucket->list, list); + spin_unlock(&bucket->lock); + } +} + +/* + * If we can grab the dd->lock, then just return and do the insertion as per + * usual. If not, add to one of our internal buckets, and afterwards recheck + * if if we should retry. + */ +static bool dd_insert_to_bucket(struct deadline_data *dd, + struct list_head *list, int *seq) + __acquires(&dd->lock) +{ + struct dd_bucket_list *bucket; + int next_seq; + + *seq = atomic_read(&dd->insert_seq); + + if (spin_trylock(&dd->lock)) + return false; + if (!test_bit(DD_INSERTING, &dd->run_state)) { + spin_lock(&dd->lock); + return false; + } + + *seq = atomic_inc_return(&dd->insert_seq); + + bucket = &dd->bucket_lists[get_cpu() & DD_CPU_BUCKETS_MASK]; + spin_lock(&bucket->lock); + list_splice_init(list, &bucket->list); + spin_unlock(&bucket->lock); + put_cpu(); + + /* + * If seq still matches, we should be safe to just exit with the + * pending requests queued in a bucket. + */ + next_seq = atomic_inc_return(&dd->insert_seq); + if (next_seq == *seq + 1) + return true; + + /* + * Seq changed, be safe and grab the lock and insert. Don't update + * sequence, so that we flusht the buckets too. + */ + spin_lock(&dd->lock); + return false; +} + /* * Called from blk_mq_insert_request() or blk_mq_dispatch_plug_list(). */ @@ -886,16 +965,39 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx, struct request_queue *q = hctx->queue; struct deadline_data *dd = q->elevator->elevator_data; LIST_HEAD(free); + int seq; - spin_lock(&dd->lock); - while (!list_empty(list)) { - struct request *rq; + /* + * If dispatch is busy and we ended up adding to our internal bucket, + * then we're done for now. + */ + if (dd_insert_to_bucket(dd, list, &seq)) + return; + + set_bit(DD_INSERTING, &dd->run_state); + do { + int next_seq; + + while (!list_empty(list)) { + struct request *rq; + + rq = list_first_entry(list, struct request, queuelist); + list_del_init(&rq->queuelist); + dd_insert_request(hctx, rq, flags, &free); + } + + /* + * If sequence changed, flush internal buckets + */ + next_seq = atomic_inc_return(&dd->insert_seq); + if (next_seq == seq + 1) + break; + seq = next_seq; + dd_dispatch_from_buckets(dd, list); + } while (1); - rq = list_first_entry(list, struct request, queuelist); - list_del_init(&rq->queuelist); - dd_insert_request(hctx, rq, flags, &free); - } spin_unlock(&dd->lock); + clear_bit(DD_INSERTING, &dd->run_state); blk_mq_free_requests(&free); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe @ 2024-01-18 18:25 ` Keith Busch 2024-01-18 18:28 ` Jens Axboe 2024-01-18 18:31 ` Bart Van Assche 1 sibling, 1 reply; 20+ messages in thread From: Keith Busch @ 2024-01-18 18:25 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, bvanassche On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote: > +#define DD_CPU_BUCKETS 32 > +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not much and the fixed size makes the implementation simpler, but these add up. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:25 ` Keith Busch @ 2024-01-18 18:28 ` Jens Axboe 0 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:28 UTC (permalink / raw) To: Keith Busch; +Cc: linux-block, bvanassche On 1/18/24 11:25 AM, Keith Busch wrote: > On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote: >> +#define DD_CPU_BUCKETS 32 >> +#define DD_CPU_BUCKETS_MASK (DD_CPU_BUCKETS - 1) > > A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not > much and the fixed size makes the implementation simpler, but these add > up. True, could make it allocated instead, 32 was just pulled out of thin air. Might make sense to make the magic 8 or something instead, that's probably Good Enough in general. Or we can make it dependent on the number of online CPUs when initialized. Honestly just didn't want to overcomplicate this initial RFC. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe 2024-01-18 18:25 ` Keith Busch @ 2024-01-18 18:31 ` Bart Van Assche 2024-01-18 18:33 ` Jens Axboe 1 sibling, 1 reply; 20+ messages in thread From: Bart Van Assche @ 2024-01-18 18:31 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 10:04, Jens Axboe wrote: > If we attempt to insert a list of requests but someone else is already > running an insertion, then fallback to queueing it internally and let > the existing inserter finish the operation. Because this patch adds significant complexity: what are the use cases that benefit from this kind of optimization? Are these perhaps workloads on systems with many CPU cores and fast storage? If the storage is fast, why to use mq-deadline instead of "none" as I/O-scheduler? Thanks, Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:31 ` Bart Van Assche @ 2024-01-18 18:33 ` Jens Axboe 2024-01-18 18:53 ` Bart Van Assche 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:33 UTC (permalink / raw) To: Bart Van Assche, linux-block On 1/18/24 11:31 AM, Bart Van Assche wrote: > On 1/18/24 10:04, Jens Axboe wrote: >> If we attempt to insert a list of requests but someone else is already >> running an insertion, then fallback to queueing it internally and let >> the existing inserter finish the operation. > > Because this patch adds significant complexity: what are the use cases > that benefit from this kind of optimization? Are these perhaps workloads > on systems with many CPU cores and fast storage? If the storage is fast, > why to use mq-deadline instead of "none" as I/O-scheduler? You and others complain that mq-deadline is slow and doesn't scale, these two patches help improve that situation. Not sure why this is even a question? -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:33 ` Jens Axboe @ 2024-01-18 18:53 ` Bart Van Assche 2024-01-18 18:56 ` Jens Axboe 0 siblings, 1 reply; 20+ messages in thread From: Bart Van Assche @ 2024-01-18 18:53 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 10:33, Jens Axboe wrote: > On 1/18/24 11:31 AM, Bart Van Assche wrote: >> On 1/18/24 10:04, Jens Axboe wrote: >>> If we attempt to insert a list of requests but someone else is already >>> running an insertion, then fallback to queueing it internally and let >>> the existing inserter finish the operation. >> >> Because this patch adds significant complexity: what are the use cases >> that benefit from this kind of optimization? Are these perhaps workloads >> on systems with many CPU cores and fast storage? If the storage is fast, >> why to use mq-deadline instead of "none" as I/O-scheduler? > > You and others complain that mq-deadline is slow and doesn't scale, > these two patches help improve that situation. Not sure why this is even > a question? How much does this patch improve performance? Thanks, Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:53 ` Bart Van Assche @ 2024-01-18 18:56 ` Jens Axboe 2024-01-18 20:46 ` Bart Van Assche 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2024-01-18 18:56 UTC (permalink / raw) To: Bart Van Assche, linux-block On 1/18/24 11:53 AM, Bart Van Assche wrote: > On 1/18/24 10:33, Jens Axboe wrote: >> On 1/18/24 11:31 AM, Bart Van Assche wrote: >>> On 1/18/24 10:04, Jens Axboe wrote: >>>> If we attempt to insert a list of requests but someone else is already >>>> running an insertion, then fallback to queueing it internally and let >>>> the existing inserter finish the operation. >>> >>> Because this patch adds significant complexity: what are the use cases >>> that benefit from this kind of optimization? Are these perhaps workloads >>> on systems with many CPU cores and fast storage? If the storage is fast, >>> why to use mq-deadline instead of "none" as I/O-scheduler? >> >> You and others complain that mq-deadline is slow and doesn't scale, >> these two patches help improve that situation. Not sure why this is even >> a question? > > How much does this patch improve performance? Do you need me to link the cover letter that you were CC'ed on? -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 18:56 ` Jens Axboe @ 2024-01-18 20:46 ` Bart Van Assche 2024-01-18 20:52 ` Jens Axboe 0 siblings, 1 reply; 20+ messages in thread From: Bart Van Assche @ 2024-01-18 20:46 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 10:56, Jens Axboe wrote: > Do you need me to link the cover letter that you were CC'ed on? There is no reason to use such an aggressive tone in your emails. In the cover letter I see performance numbers for the patch series in its entirety but not for the individual patches. I'd like to know by how much this patch by itself improves performance because whether or not I will review this patch will depend on that data. Thanks, Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 20:46 ` Bart Van Assche @ 2024-01-18 20:52 ` Jens Axboe 2024-01-19 23:11 ` Bart Van Assche 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2024-01-18 20:52 UTC (permalink / raw) To: Bart Van Assche, linux-block On 1/18/24 1:46 PM, Bart Van Assche wrote: > On 1/18/24 10:56, Jens Axboe wrote: >> Do you need me to link the cover letter that you were CC'ed on? > > There is no reason to use such an aggressive tone in your emails. I'm getting frustrated with you because I need to say the same things multiple times, and it doesn't seem like it's getting received at the other end. And you had clearly seen the results, which means that rather than the passive aggressive question, you could have said "It'd probably be useful to include some performance numbers in the commmit messages themselves as well." which would be received way differently than asking a question that you already have the answer to. > In the cover letter I see performance numbers for the patch series in > its entirety but not for the individual patches. I'd like to know by > how much this patch by itself improves performance because whether or > not I will review this patch will depend on that data. You can't really split them up, as you need both to see the real benefit. The only reason they are split is because it makes sense to do so in terms of the code, as they are two different paths and points of contention. This obviously makes patch 2 look better than it is, and that would be the case even if I swapped the order of them as well. As mentioned in a previous reply, I'll be editing the commit messages and probably just include the various performance results in patch 2 and reference patch 1 from that too. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention 2024-01-18 20:52 ` Jens Axboe @ 2024-01-19 23:11 ` Bart Van Assche 0 siblings, 0 replies; 20+ messages in thread From: Bart Van Assche @ 2024-01-19 23:11 UTC (permalink / raw) To: Jens Axboe, linux-block On 1/18/24 12:52, Jens Axboe wrote: > And you had clearly seen the results, which means that rather > than the passive aggressive question, you could have said [ ... ] I'm never passive aggressive. If that is how my message was received I want to apologize. Bart. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHSET RFC 0/2] mq-deadline scalability improvements 2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe 2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe @ 2024-01-18 19:29 ` Jens Axboe 2024-01-18 20:22 ` Jens Axboe 2 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2024-01-18 19:29 UTC (permalink / raw) To: linux-block; +Cc: bvanassche On 1/18/24 11:04 AM, Jens Axboe wrote: > With that in place, the same test case now does: > > Device QD Jobs IOPS Contention Diff > ============================================================= > null_blk 4 32 2250K 28% +106% > nvme0n1 4 32 2560K 23% +112% nvme0n1 4 32 2560K 23% +139% Apparently I can't math, this is a +139% improvement for the nvme case... Just wanted to make it clear that the IOPS number was correct, it's just the diff math that was wrong. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHSET RFC 0/2] mq-deadline scalability improvements 2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe @ 2024-01-18 20:22 ` Jens Axboe 0 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2024-01-18 20:22 UTC (permalink / raw) To: linux-block; +Cc: bvanassche On 1/18/24 12:29 PM, Jens Axboe wrote: > On 1/18/24 11:04 AM, Jens Axboe wrote: >> With that in place, the same test case now does: >> >> Device QD Jobs IOPS Contention Diff >> ============================================================= >> null_blk 4 32 2250K 28% +106% >> nvme0n1 4 32 2560K 23% +112% > > nvme0n1 4 32 2560K 23% +139% > > Apparently I can't math, this is a +139% improvement for the nvme > case... Just wanted to make it clear that the IOPS number was correct, > it's just the diff math that was wrong. And further followup, since I ran some quick testing on another box that has a raid1 more normal drive (SATA, 32 tags). Both pre and post the patches, the performance is roughly the same. The bigger difference is that the pre result is using 8% systime to do ~73K, and with the patches we're using 1% systime to do the same work. This should help answer the question "does this matter at all?". The answer is definitely yes. It's not just about scalability, as is usually the case with improving things like this, it's about efficiency as well. 8x the sys time is ridiculous. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2024-01-19 23:11 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-18 18:04 [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2024-01-18 18:04 ` [PATCH 1/2] block/mq-deadline: serialize request dispatching Jens Axboe 2024-01-18 18:24 ` Bart Van Assche 2024-01-18 18:45 ` Jens Axboe 2024-01-18 18:51 ` Bart Van Assche 2024-01-18 18:55 ` Jens Axboe 2024-01-19 2:40 ` Ming Lei 2024-01-19 15:49 ` Jens Axboe 2024-01-18 18:04 ` [PATCH 2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention Jens Axboe 2024-01-18 18:25 ` Keith Busch 2024-01-18 18:28 ` Jens Axboe 2024-01-18 18:31 ` Bart Van Assche 2024-01-18 18:33 ` Jens Axboe 2024-01-18 18:53 ` Bart Van Assche 2024-01-18 18:56 ` Jens Axboe 2024-01-18 20:46 ` Bart Van Assche 2024-01-18 20:52 ` Jens Axboe 2024-01-19 23:11 ` Bart Van Assche 2024-01-18 19:29 ` [PATCHSET RFC 0/2] mq-deadline scalability improvements Jens Axboe 2024-01-18 20:22 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox