* [PATCH 0/4] blk-mq: optimize the size of struct request
@ 2023-06-27 12:08 chengming.zhou
2023-06-27 12:08 ` [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: chengming.zhou @ 2023-06-27 12:08 UTC (permalink / raw)
To: axboe, tj, hch, ming.lei; +Cc: linux-block, linux-kernel, zhouchengming
From: Chengming Zhou <zhouchengming@bytedance.com>
Hello,
After the commit be4c427809b0 ("blk-mq: use the I/O scheduler for
writes from the flush state machine"), rq->flush can't reuse rq->elv
anymore, since flush_data requests can go into io scheduler now.
That increased the size of struct request by 24 bytes, but this
patchset can decrease the size by 40 bytes, which is good I think.
patch 1 use percpu csd to do remote complete instead of per-rq csd,
decrease the size by 24 bytes.
patch 2-3 reuse rq->queuelist in flush state machine pending list,
and maintain a u64 counter of inflight flush_data requests, decrease
the size by 16 bytes.
patch 4 is just cleanup by the way.
Thanks for comments!
Chengming Zhou (4):
blk-mq: use percpu csd to remote complete instead of per-rq csd
blk-flush: count inflight flush_data requests
blk-flush: reuse rq queuelist in flush state machine
blk-mq: delete unused completion_data in struct request
block/blk-flush.c | 19 +++++++++----------
block/blk-mq.c | 12 ++++++++----
block/blk.h | 5 ++---
include/linux/blk-mq.h | 10 ++--------
4 files changed, 21 insertions(+), 25 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd
2023-06-27 12:08 [PATCH 0/4] blk-mq: optimize the size of struct request chengming.zhou
@ 2023-06-27 12:08 ` chengming.zhou
2023-06-28 2:20 ` Ming Lei
2023-06-27 12:08 ` [PATCH 2/4] blk-flush: count inflight flush_data requests chengming.zhou
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: chengming.zhou @ 2023-06-27 12:08 UTC (permalink / raw)
To: axboe, tj, hch, ming.lei; +Cc: linux-block, linux-kernel, zhouchengming
From: Chengming Zhou <zhouchengming@bytedance.com>
If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.
We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
block/blk-mq.c | 12 ++++++++----
include/linux/blk-mq.h | 5 +----
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index decb6ab2d508..a36822479b94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,7 @@
#include "blk-ioprio.h"
static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct __call_single_data, blk_cpu_csd);
static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
static void blk_mq_request_bypass_insert(struct request *rq,
@@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
{
struct llist_head *list;
unsigned int cpu;
+ struct __call_single_data *csd;
cpu = rq->mq_ctx->cpu;
list = &per_cpu(blk_cpu_done, cpu);
- if (llist_add(&rq->ipi_list, list)) {
- INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
- smp_call_function_single_async(cpu, &rq->csd);
- }
+ csd = &per_cpu(blk_cpu_csd, cpu);
+ if (llist_add(&rq->ipi_list, list))
+ smp_call_function_single_async(cpu, csd);
}
static void blk_mq_raise_softirq(struct request *rq)
@@ -4796,6 +4797,9 @@ static int __init blk_mq_init(void)
for_each_possible_cpu(i)
init_llist_head(&per_cpu(blk_cpu_done, i));
+ for_each_possible_cpu(i)
+ INIT_CSD(&per_cpu(blk_cpu_csd, i),
+ __blk_mq_complete_request_remote, NULL);
open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f401067ac03a..070551197c0e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -182,10 +182,7 @@ struct request {
rq_end_io_fn *saved_end_io;
} flush;
- union {
- struct __call_single_data csd;
- u64 fifo_time;
- };
+ u64 fifo_time;
/*
* completion callback.
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/4] blk-flush: count inflight flush_data requests
2023-06-27 12:08 [PATCH 0/4] blk-mq: optimize the size of struct request chengming.zhou
2023-06-27 12:08 ` [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
@ 2023-06-27 12:08 ` chengming.zhou
2023-06-28 4:13 ` Ming Lei
2023-06-27 12:08 ` [PATCH 3/4] blk-flush: reuse rq queuelist in flush state machine chengming.zhou
2023-06-27 12:08 ` [PATCH 4/4] blk-mq: delete unused completion_data in struct request chengming.zhou
3 siblings, 1 reply; 13+ messages in thread
From: chengming.zhou @ 2023-06-27 12:08 UTC (permalink / raw)
To: axboe, tj, hch, ming.lei; +Cc: linux-block, linux-kernel, zhouchengming
From: Chengming Zhou <zhouchengming@bytedance.com>
The flush state machine use a double list to link all inflight
flush_data requests, to avoid issuing separate post-flushes for
these flush_data requests which shared PREFLUSH.
So we can't reuse rq->queuelist, this is why we need rq->flush.list
In preparation of the next patch that reuse rq->queuelist for flush
state machine, we change the double linked list to a u64 counter,
which count all inflight flush_data requests.
This is ok since we only need to know if there is any inflight
flush_data request, so a u64 counter is good. The only problem I can
think of is that u64 counter may overflow, which should be unlikely happen.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
block/blk-flush.c | 9 +++++----
block/blk.h | 5 ++---
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dba392cf22be..bb7adfc2a5da 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -187,7 +187,8 @@ static void blk_flush_complete_seq(struct request *rq,
break;
case REQ_FSEQ_DATA:
- list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
+ list_del_init(&rq->flush.list);
+ fq->flush_data_in_flight++;
spin_lock(&q->requeue_lock);
list_add_tail(&rq->queuelist, &q->flush_list);
spin_unlock(&q->requeue_lock);
@@ -299,7 +300,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
return;
/* C2 and C3 */
- if (!list_empty(&fq->flush_data_in_flight) &&
+ if (fq->flush_data_in_flight &&
time_before(jiffies,
fq->flush_pending_since + FLUSH_PENDING_TIMEOUT))
return;
@@ -374,6 +375,7 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
* the comment in flush_end_io().
*/
spin_lock_irqsave(&fq->mq_flush_lock, flags);
+ fq->flush_data_in_flight--;
blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error);
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
@@ -445,7 +447,7 @@ bool blk_insert_flush(struct request *rq)
blk_rq_init_flush(rq);
rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
spin_lock_irq(&fq->mq_flush_lock);
- list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
+ fq->flush_data_in_flight++;
spin_unlock_irq(&fq->mq_flush_lock);
return false;
default:
@@ -496,7 +498,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
INIT_LIST_HEAD(&fq->flush_queue[0]);
INIT_LIST_HEAD(&fq->flush_queue[1]);
- INIT_LIST_HEAD(&fq->flush_data_in_flight);
return fq;
diff --git a/block/blk.h b/block/blk.h
index 608c5dcc516b..686712e13835 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -15,15 +15,14 @@ struct elevator_type;
extern struct dentry *blk_debugfs_root;
struct blk_flush_queue {
+ spinlock_t mq_flush_lock;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
blk_status_t rq_status;
unsigned long flush_pending_since;
struct list_head flush_queue[2];
- struct list_head flush_data_in_flight;
+ unsigned long flush_data_in_flight;
struct request *flush_rq;
-
- spinlock_t mq_flush_lock;
};
bool is_flush_rq(struct request *req);
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/4] blk-flush: reuse rq queuelist in flush state machine
2023-06-27 12:08 [PATCH 0/4] blk-mq: optimize the size of struct request chengming.zhou
2023-06-27 12:08 ` [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
2023-06-27 12:08 ` [PATCH 2/4] blk-flush: count inflight flush_data requests chengming.zhou
@ 2023-06-27 12:08 ` chengming.zhou
2023-06-27 12:08 ` [PATCH 4/4] blk-mq: delete unused completion_data in struct request chengming.zhou
3 siblings, 0 replies; 13+ messages in thread
From: chengming.zhou @ 2023-06-27 12:08 UTC (permalink / raw)
To: axboe, tj, hch, ming.lei; +Cc: linux-block, linux-kernel, zhouchengming
From: Chengming Zhou <zhouchengming@bytedance.com>
Since we don't need to maintain inflight flush_data requests list
anymore, we can reuse rq->queuelist for flush pending list.
This patch decrease the size of struct request by 16 bytes.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
block/blk-flush.c | 12 +++++-------
include/linux/blk-mq.h | 1 -
2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index bb7adfc2a5da..81588edbe8b0 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -183,14 +183,13 @@ static void blk_flush_complete_seq(struct request *rq,
/* queue for flush */
if (list_empty(pending))
fq->flush_pending_since = jiffies;
- list_move_tail(&rq->flush.list, pending);
+ list_move_tail(&rq->queuelist, pending);
break;
case REQ_FSEQ_DATA:
- list_del_init(&rq->flush.list);
fq->flush_data_in_flight++;
spin_lock(&q->requeue_lock);
- list_add_tail(&rq->queuelist, &q->flush_list);
+ list_move_tail(&rq->queuelist, &q->flush_list);
spin_unlock(&q->requeue_lock);
blk_mq_kick_requeue_list(q);
break;
@@ -202,7 +201,7 @@ static void blk_flush_complete_seq(struct request *rq,
* flush data request completion path. Restore @rq for
* normal completion and end it.
*/
- list_del_init(&rq->flush.list);
+ list_del_init(&rq->queuelist);
blk_flush_restore_request(rq);
blk_mq_end_request(rq, error);
break;
@@ -258,7 +257,7 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
fq->flush_running_idx ^= 1;
/* and push the waiting requests to the next stage */
- list_for_each_entry_safe(rq, n, running, flush.list) {
+ list_for_each_entry_safe(rq, n, running, queuelist) {
unsigned int seq = blk_flush_cur_seq(rq);
BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
@@ -292,7 +291,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
{
struct list_head *pending = &fq->flush_queue[fq->flush_pending_idx];
struct request *first_rq =
- list_first_entry(pending, struct request, flush.list);
+ list_first_entry(pending, struct request, queuelist);
struct request *flush_rq = fq->flush_rq;
/* C1 described at the top of this file */
@@ -386,7 +385,6 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
static void blk_rq_init_flush(struct request *rq)
{
rq->flush.seq = 0;
- INIT_LIST_HEAD(&rq->flush.list);
rq->rq_flags |= RQF_FLUSH_SEQ;
rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
rq->end_io = mq_flush_data_end_io;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 070551197c0e..96644d6f8d18 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -178,7 +178,6 @@ struct request {
struct {
unsigned int seq;
- struct list_head list;
rq_end_io_fn *saved_end_io;
} flush;
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/4] blk-mq: delete unused completion_data in struct request
2023-06-27 12:08 [PATCH 0/4] blk-mq: optimize the size of struct request chengming.zhou
` (2 preceding siblings ...)
2023-06-27 12:08 ` [PATCH 3/4] blk-flush: reuse rq queuelist in flush state machine chengming.zhou
@ 2023-06-27 12:08 ` chengming.zhou
3 siblings, 0 replies; 13+ messages in thread
From: chengming.zhou @ 2023-06-27 12:08 UTC (permalink / raw)
To: axboe, tj, hch, ming.lei; +Cc: linux-block, linux-kernel, zhouchengming
From: Chengming Zhou <zhouchengming@bytedance.com>
After global search, I found "completion_data" in struct request
is not used anywhere, so just clean it up by the way.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
include/linux/blk-mq.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 96644d6f8d18..ab790eba5fcf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -158,13 +158,11 @@ struct request {
/*
* The rb_node is only used inside the io scheduler, requests
- * are pruned when moved to the dispatch queue. So let the
- * completion_data share space with the rb_node.
+ * are pruned when moved to the dispatch queue.
*/
union {
struct rb_node rb_node; /* sort/lookup */
struct bio_vec special_vec;
- void *completion_data;
};
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd
2023-06-27 12:08 ` [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
@ 2023-06-28 2:20 ` Ming Lei
2023-06-28 3:28 ` Chengming Zhou
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-06-28 2:20 UTC (permalink / raw)
To: chengming.zhou
Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming,
ming.lei
On Tue, Jun 27, 2023 at 08:08:51PM +0800, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> If request need to be completed remotely, we insert it into percpu llist,
> and smp_call_function_single_async() if llist is empty previously.
>
> We don't need to use per-rq csd, percpu csd is enough. And the size of
> struct request is decreased by 24 bytes.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
> block/blk-mq.c | 12 ++++++++----
> include/linux/blk-mq.h | 5 +----
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index decb6ab2d508..a36822479b94 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -43,6 +43,7 @@
> #include "blk-ioprio.h"
>
> static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
> +static DEFINE_PER_CPU(struct __call_single_data, blk_cpu_csd);
It might be better to use call_single_data, given:
/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
typedef struct __call_single_data call_single_data_t
__aligned(sizeof(struct __call_single_data));
>
> static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
> static void blk_mq_request_bypass_insert(struct request *rq,
> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
> {
> struct llist_head *list;
> unsigned int cpu;
> + struct __call_single_data *csd;
>
> cpu = rq->mq_ctx->cpu;
> list = &per_cpu(blk_cpu_done, cpu);
> - if (llist_add(&rq->ipi_list, list)) {
> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> - smp_call_function_single_async(cpu, &rq->csd);
> - }
> + csd = &per_cpu(blk_cpu_csd, cpu);
> + if (llist_add(&rq->ipi_list, list))
> + smp_call_function_single_async(cpu, csd);
> }
This way is cleaner, and looks correct, given block softirq is guaranteed to be
scheduled to consume the list if one new request is added to this percpu list,
either smp_call_function_single_async() returns -EBUSY or 0.
thanks
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd
2023-06-28 2:20 ` Ming Lei
@ 2023-06-28 3:28 ` Chengming Zhou
2023-06-28 4:50 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Chengming Zhou @ 2023-06-28 3:28 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On 2023/6/28 10:20, Ming Lei wrote:
> On Tue, Jun 27, 2023 at 08:08:51PM +0800, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> If request need to be completed remotely, we insert it into percpu llist,
>> and smp_call_function_single_async() if llist is empty previously.
>>
>> We don't need to use per-rq csd, percpu csd is enough. And the size of
>> struct request is decreased by 24 bytes.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>> block/blk-mq.c | 12 ++++++++----
>> include/linux/blk-mq.h | 5 +----
>> 2 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index decb6ab2d508..a36822479b94 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -43,6 +43,7 @@
>> #include "blk-ioprio.h"
>>
>> static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
>> +static DEFINE_PER_CPU(struct __call_single_data, blk_cpu_csd);
>
> It might be better to use call_single_data, given:
>
> /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
> typedef struct __call_single_data call_single_data_t
> __aligned(sizeof(struct __call_single_data));
>
Good, I will change to use this.
>>
>> static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
>> static void blk_mq_request_bypass_insert(struct request *rq,
>> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
>> {
>> struct llist_head *list;
>> unsigned int cpu;
>> + struct __call_single_data *csd;
>>
>> cpu = rq->mq_ctx->cpu;
>> list = &per_cpu(blk_cpu_done, cpu);
>> - if (llist_add(&rq->ipi_list, list)) {
>> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
>> - smp_call_function_single_async(cpu, &rq->csd);
>> - }
>> + csd = &per_cpu(blk_cpu_csd, cpu);
>> + if (llist_add(&rq->ipi_list, list))
>> + smp_call_function_single_async(cpu, csd);
>> }
>
> This way is cleaner, and looks correct, given block softirq is guaranteed to be
> scheduled to consume the list if one new request is added to this percpu list,
> either smp_call_function_single_async() returns -EBUSY or 0.
>
If this llist_add() see the llist is empty, the consumer function in the softirq
on the remote CPU must have consumed the llist, so smp_call_function_single_async()
won't return -EBUSY ?
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] blk-flush: count inflight flush_data requests
2023-06-27 12:08 ` [PATCH 2/4] blk-flush: count inflight flush_data requests chengming.zhou
@ 2023-06-28 4:13 ` Ming Lei
2023-06-28 4:55 ` Chengming Zhou
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-06-28 4:13 UTC (permalink / raw)
To: chengming.zhou; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On Tue, Jun 27, 2023 at 08:08:52PM +0800, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> The flush state machine use a double list to link all inflight
> flush_data requests, to avoid issuing separate post-flushes for
> these flush_data requests which shared PREFLUSH.
>
> So we can't reuse rq->queuelist, this is why we need rq->flush.list
>
> In preparation of the next patch that reuse rq->queuelist for flush
> state machine, we change the double linked list to a u64 counter,
> which count all inflight flush_data requests.
>
> This is ok since we only need to know if there is any inflight
> flush_data request, so a u64 counter is good. The only problem I can
> think of is that u64 counter may overflow, which should be unlikely happen.
It won't overflow, q->nr_requests is 'unsigned long', which should have
been limited to one more reasonable value, such as 2 * BLK_MQ_MAX_DEPTH, so
u16 should be big enough in theory.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
> block/blk-flush.c | 9 +++++----
> block/blk.h | 5 ++---
> 2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index dba392cf22be..bb7adfc2a5da 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -187,7 +187,8 @@ static void blk_flush_complete_seq(struct request *rq,
> break;
>
> case REQ_FSEQ_DATA:
> - list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
> + list_del_init(&rq->flush.list);
> + fq->flush_data_in_flight++;
> spin_lock(&q->requeue_lock);
> list_add_tail(&rq->queuelist, &q->flush_list);
> spin_unlock(&q->requeue_lock);
> @@ -299,7 +300,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> return;
>
> /* C2 and C3 */
> - if (!list_empty(&fq->flush_data_in_flight) &&
> + if (fq->flush_data_in_flight &&
> time_before(jiffies,
> fq->flush_pending_since + FLUSH_PENDING_TIMEOUT))
> return;
> @@ -374,6 +375,7 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
> * the comment in flush_end_io().
> */
> spin_lock_irqsave(&fq->mq_flush_lock, flags);
> + fq->flush_data_in_flight--;
> blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error);
> spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
>
> @@ -445,7 +447,7 @@ bool blk_insert_flush(struct request *rq)
> blk_rq_init_flush(rq);
> rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
> spin_lock_irq(&fq->mq_flush_lock);
> - list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
> + fq->flush_data_in_flight++;
> spin_unlock_irq(&fq->mq_flush_lock);
> return false;
> default:
> @@ -496,7 +498,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
>
> INIT_LIST_HEAD(&fq->flush_queue[0]);
> INIT_LIST_HEAD(&fq->flush_queue[1]);
> - INIT_LIST_HEAD(&fq->flush_data_in_flight);
>
> return fq;
>
> diff --git a/block/blk.h b/block/blk.h
> index 608c5dcc516b..686712e13835 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -15,15 +15,14 @@ struct elevator_type;
> extern struct dentry *blk_debugfs_root;
>
> struct blk_flush_queue {
> + spinlock_t mq_flush_lock;
> unsigned int flush_pending_idx:1;
> unsigned int flush_running_idx:1;
> blk_status_t rq_status;
> unsigned long flush_pending_since;
> struct list_head flush_queue[2];
> - struct list_head flush_data_in_flight;
> + unsigned long flush_data_in_flight;
> struct request *flush_rq;
> -
> - spinlock_t mq_flush_lock;
> };
The part of replacing inflight data rq list with counter looks fine.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd
2023-06-28 3:28 ` Chengming Zhou
@ 2023-06-28 4:50 ` Ming Lei
2023-06-28 6:43 ` Chengming Zhou
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-06-28 4:50 UTC (permalink / raw)
To: Chengming Zhou; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On Wed, Jun 28, 2023 at 11:28:20AM +0800, Chengming Zhou wrote:
> On 2023/6/28 10:20, Ming Lei wrote:
> > On Tue, Jun 27, 2023 at 08:08:51PM +0800, chengming.zhou@linux.dev wrote:
> >> From: Chengming Zhou <zhouchengming@bytedance.com>
> >>
> >> If request need to be completed remotely, we insert it into percpu llist,
> >> and smp_call_function_single_async() if llist is empty previously.
> >>
> >> We don't need to use per-rq csd, percpu csd is enough. And the size of
> >> struct request is decreased by 24 bytes.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> ---
> >> block/blk-mq.c | 12 ++++++++----
> >> include/linux/blk-mq.h | 5 +----
> >> 2 files changed, 9 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index decb6ab2d508..a36822479b94 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -43,6 +43,7 @@
> >> #include "blk-ioprio.h"
> >>
> >> static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
> >> +static DEFINE_PER_CPU(struct __call_single_data, blk_cpu_csd);
> >
> > It might be better to use call_single_data, given:
> >
> > /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
> > typedef struct __call_single_data call_single_data_t
> > __aligned(sizeof(struct __call_single_data));
> >
>
> Good, I will change to use this.
>
> >>
> >> static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
> >> static void blk_mq_request_bypass_insert(struct request *rq,
> >> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
> >> {
> >> struct llist_head *list;
> >> unsigned int cpu;
> >> + struct __call_single_data *csd;
> >>
> >> cpu = rq->mq_ctx->cpu;
> >> list = &per_cpu(blk_cpu_done, cpu);
> >> - if (llist_add(&rq->ipi_list, list)) {
> >> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> >> - smp_call_function_single_async(cpu, &rq->csd);
> >> - }
> >> + csd = &per_cpu(blk_cpu_csd, cpu);
> >> + if (llist_add(&rq->ipi_list, list))
> >> + smp_call_function_single_async(cpu, csd);
> >> }
> >
> > This way is cleaner, and looks correct, given block softirq is guaranteed to be
> > scheduled to consume the list if one new request is added to this percpu list,
> > either smp_call_function_single_async() returns -EBUSY or 0.
> >
>
> If this llist_add() see the llist is empty, the consumer function in the softirq
> on the remote CPU must have consumed the llist, so smp_call_function_single_async()
> won't return -EBUSY ?
block softirq can be scheduled from other code path, such as blk_mq_raise_softirq()
for single queue's remote completion, where no percpu csd schedule is needed, so
two smp_call_function_single_async() could be called, and the 2nd one
may return -EBUSY.
Not mention csd_unlock() could be called after the callback returns, see
__flush_smp_call_function_queue().
But that is fine, if there is pending block softirq, the llist is
guaranteed to be consumed because the csd callback just raises block
softirq, and request/llist is consumed in softirq handler.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] blk-flush: count inflight flush_data requests
2023-06-28 4:13 ` Ming Lei
@ 2023-06-28 4:55 ` Chengming Zhou
2023-06-28 7:22 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Chengming Zhou @ 2023-06-28 4:55 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On 2023/6/28 12:13, Ming Lei wrote:
> On Tue, Jun 27, 2023 at 08:08:52PM +0800, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> The flush state machine use a double list to link all inflight
>> flush_data requests, to avoid issuing separate post-flushes for
>> these flush_data requests which shared PREFLUSH.
>>
>> So we can't reuse rq->queuelist, this is why we need rq->flush.list
>>
>> In preparation of the next patch that reuse rq->queuelist for flush
>> state machine, we change the double linked list to a u64 counter,
>> which count all inflight flush_data requests.
>>
>> This is ok since we only need to know if there is any inflight
>> flush_data request, so a u64 counter is good. The only problem I can
>> think of is that u64 counter may overflow, which should be unlikely happen.
>
> It won't overflow, q->nr_requests is 'unsigned long', which should have
> been limited to one more reasonable value, such as 2 * BLK_MQ_MAX_DEPTH, so
> u16 should be big enough in theory.
Ah, right. q->nr_requests is 'unsigned long' and q->queue_depth is 'unsigned int',
so 'unsigned long' counter here won't overflow.
Should I change it to smaller 'unsigned short' or just leave it as 'unsigned long' ?
(Now the size of struct blk_flush_queue is exactly 64 bytes)
Thanks.
>
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>> block/blk-flush.c | 9 +++++----
>> block/blk.h | 5 ++---
>> 2 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index dba392cf22be..bb7adfc2a5da 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -187,7 +187,8 @@ static void blk_flush_complete_seq(struct request *rq,
>> break;
>>
>> case REQ_FSEQ_DATA:
>> - list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>> + list_del_init(&rq->flush.list);
>> + fq->flush_data_in_flight++;
>> spin_lock(&q->requeue_lock);
>> list_add_tail(&rq->queuelist, &q->flush_list);
>> spin_unlock(&q->requeue_lock);
>> @@ -299,7 +300,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
>> return;
>>
>> /* C2 and C3 */
>> - if (!list_empty(&fq->flush_data_in_flight) &&
>> + if (fq->flush_data_in_flight &&
>> time_before(jiffies,
>> fq->flush_pending_since + FLUSH_PENDING_TIMEOUT))
>> return;
>> @@ -374,6 +375,7 @@ static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
>> * the comment in flush_end_io().
>> */
>> spin_lock_irqsave(&fq->mq_flush_lock, flags);
>> + fq->flush_data_in_flight--;
>> blk_flush_complete_seq(rq, fq, REQ_FSEQ_DATA, error);
>> spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
>>
>> @@ -445,7 +447,7 @@ bool blk_insert_flush(struct request *rq)
>> blk_rq_init_flush(rq);
>> rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>> spin_lock_irq(&fq->mq_flush_lock);
>> - list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>> + fq->flush_data_in_flight++;
>> spin_unlock_irq(&fq->mq_flush_lock);
>> return false;
>> default:
>> @@ -496,7 +498,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
>>
>> INIT_LIST_HEAD(&fq->flush_queue[0]);
>> INIT_LIST_HEAD(&fq->flush_queue[1]);
>> - INIT_LIST_HEAD(&fq->flush_data_in_flight);
>>
>> return fq;
>>
>> diff --git a/block/blk.h b/block/blk.h
>> index 608c5dcc516b..686712e13835 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -15,15 +15,14 @@ struct elevator_type;
>> extern struct dentry *blk_debugfs_root;
>>
>> struct blk_flush_queue {
>> + spinlock_t mq_flush_lock;
>> unsigned int flush_pending_idx:1;
>> unsigned int flush_running_idx:1;
>> blk_status_t rq_status;
>> unsigned long flush_pending_since;
>> struct list_head flush_queue[2];
>> - struct list_head flush_data_in_flight;
>> + unsigned long flush_data_in_flight;
>> struct request *flush_rq;
>> -
>> - spinlock_t mq_flush_lock;
>> };
>
> The part of replacing inflight data rq list with counter looks fine.
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd
2023-06-28 4:50 ` Ming Lei
@ 2023-06-28 6:43 ` Chengming Zhou
0 siblings, 0 replies; 13+ messages in thread
From: Chengming Zhou @ 2023-06-28 6:43 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On 2023/6/28 12:50, Ming Lei wrote:
> On Wed, Jun 28, 2023 at 11:28:20AM +0800, Chengming Zhou wrote:
>> On 2023/6/28 10:20, Ming Lei wrote:
>>> On Tue, Jun 27, 2023 at 08:08:51PM +0800, chengming.zhou@linux.dev wrote:
>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>
>>>> If request need to be completed remotely, we insert it into percpu llist,
>>>> and smp_call_function_single_async() if llist is empty previously.
>>>>
>>>> We don't need to use per-rq csd, percpu csd is enough. And the size of
>>>> struct request is decreased by 24 bytes.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>> ---
>>>> block/blk-mq.c | 12 ++++++++----
>>>> include/linux/blk-mq.h | 5 +----
>>>> 2 files changed, 9 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index decb6ab2d508..a36822479b94 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -43,6 +43,7 @@
>>>> #include "blk-ioprio.h"
>>>>
>>>> static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
>>>> +static DEFINE_PER_CPU(struct __call_single_data, blk_cpu_csd);
>>>
>>> It might be better to use call_single_data, given:
>>>
>>> /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
>>> typedef struct __call_single_data call_single_data_t
>>> __aligned(sizeof(struct __call_single_data));
>>>
>>
>> Good, I will change to use this.
>>
>>>>
>>>> static void blk_mq_insert_request(struct request *rq, blk_insert_t flags);
>>>> static void blk_mq_request_bypass_insert(struct request *rq,
>>>> @@ -1156,13 +1157,13 @@ static void blk_mq_complete_send_ipi(struct request *rq)
>>>> {
>>>> struct llist_head *list;
>>>> unsigned int cpu;
>>>> + struct __call_single_data *csd;
>>>>
>>>> cpu = rq->mq_ctx->cpu;
>>>> list = &per_cpu(blk_cpu_done, cpu);
>>>> - if (llist_add(&rq->ipi_list, list)) {
>>>> - INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
>>>> - smp_call_function_single_async(cpu, &rq->csd);
>>>> - }
>>>> + csd = &per_cpu(blk_cpu_csd, cpu);
>>>> + if (llist_add(&rq->ipi_list, list))
>>>> + smp_call_function_single_async(cpu, csd);
>>>> }
>>>
>>> This way is cleaner, and looks correct, given block softirq is guaranteed to be
>>> scheduled to consume the list if one new request is added to this percpu list,
>>> either smp_call_function_single_async() returns -EBUSY or 0.
>>>
>>
>> If this llist_add() see the llist is empty, the consumer function in the softirq
>> on the remote CPU must have consumed the llist, so smp_call_function_single_async()
>> won't return -EBUSY ?
>
> block softirq can be scheduled from other code path, such as blk_mq_raise_softirq()
> for single queue's remote completion, where no percpu csd schedule is needed, so
> two smp_call_function_single_async() could be called, and the 2nd one
> may return -EBUSY.
Thanks for your very clear explanation! I understand what you mean.
Yes, the 2nd smp_call_function_single_async() will return -EBUSY, but it's ok since
the 1st will do the right thing.
>
> Not mention csd_unlock() could be called after the callback returns, see
> __flush_smp_call_function_queue().
Ok, CSD_TYPE_SYNC will csd_unlock() after csd_do_func() returns, our CSD_TYPE_ASYNC
will csd_unlock() before csd_do_func().
>
> But that is fine, if there is pending block softirq, the llist is
> guaranteed to be consumed because the csd callback just raises block
> softirq, and request/llist is consumed in softirq handler.
>
Agree, it's fine even the 2nd return -EBUSY when the 1st function is raising block softirq,
our llist will be consumed in softirq handler.
Thanks!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] blk-flush: count inflight flush_data requests
2023-06-28 4:55 ` Chengming Zhou
@ 2023-06-28 7:22 ` Ming Lei
2023-06-28 12:55 ` Chengming Zhou
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-06-28 7:22 UTC (permalink / raw)
To: Chengming Zhou; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On Wed, Jun 28, 2023 at 12:55:49PM +0800, Chengming Zhou wrote:
> On 2023/6/28 12:13, Ming Lei wrote:
> > On Tue, Jun 27, 2023 at 08:08:52PM +0800, chengming.zhou@linux.dev wrote:
> >> From: Chengming Zhou <zhouchengming@bytedance.com>
> >>
> >> The flush state machine use a double list to link all inflight
> >> flush_data requests, to avoid issuing separate post-flushes for
> >> these flush_data requests which shared PREFLUSH.
> >>
> >> So we can't reuse rq->queuelist, this is why we need rq->flush.list
> >>
> >> In preparation of the next patch that reuse rq->queuelist for flush
> >> state machine, we change the double linked list to a u64 counter,
> >> which count all inflight flush_data requests.
> >>
> >> This is ok since we only need to know if there is any inflight
> >> flush_data request, so a u64 counter is good. The only problem I can
> >> think of is that u64 counter may overflow, which should be unlikely happen.
> >
> > It won't overflow, q->nr_requests is 'unsigned long', which should have
> > been limited to one more reasonable value, such as 2 * BLK_MQ_MAX_DEPTH, so
> > u16 should be big enough in theory.
>
> Ah, right. q->nr_requests is 'unsigned long' and q->queue_depth is 'unsigned int',
> so 'unsigned long' counter here won't overflow.
Not like q->nr_requests, q->queue_depth usually means the whole queue's depth,
which may cover all hw queue's depth. And it is only used by scsi, but it
should be held in "unsigned short" too.
>
> Should I change it to smaller 'unsigned short' or just leave it as 'unsigned long' ?
> (Now the size of struct blk_flush_queue is exactly 64 bytes)
You have to limit q->nr_requests first, which may need a bit more work for avoiding
compiling warning or sort of thing. And 64k is big enough for holding per-queue
scheduler request.
Once it is done, it is fine to define this counter as 'unsigned short'.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] blk-flush: count inflight flush_data requests
2023-06-28 7:22 ` Ming Lei
@ 2023-06-28 12:55 ` Chengming Zhou
0 siblings, 0 replies; 13+ messages in thread
From: Chengming Zhou @ 2023-06-28 12:55 UTC (permalink / raw)
To: Ming Lei; +Cc: axboe, tj, hch, linux-block, linux-kernel, zhouchengming
On 2023/6/28 15:22, Ming Lei wrote:
> On Wed, Jun 28, 2023 at 12:55:49PM +0800, Chengming Zhou wrote:
>> On 2023/6/28 12:13, Ming Lei wrote:
>>> On Tue, Jun 27, 2023 at 08:08:52PM +0800, chengming.zhou@linux.dev wrote:
>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>
>>>> The flush state machine use a double list to link all inflight
>>>> flush_data requests, to avoid issuing separate post-flushes for
>>>> these flush_data requests which shared PREFLUSH.
>>>>
>>>> So we can't reuse rq->queuelist, this is why we need rq->flush.list
>>>>
>>>> In preparation of the next patch that reuse rq->queuelist for flush
>>>> state machine, we change the double linked list to a u64 counter,
>>>> which count all inflight flush_data requests.
>>>>
>>>> This is ok since we only need to know if there is any inflight
>>>> flush_data request, so a u64 counter is good. The only problem I can
>>>> think of is that u64 counter may overflow, which should be unlikely happen.
>>>
>>> It won't overflow, q->nr_requests is 'unsigned long', which should have
>>> been limited to one more reasonable value, such as 2 * BLK_MQ_MAX_DEPTH, so
>>> u16 should be big enough in theory.
>>
>> Ah, right. q->nr_requests is 'unsigned long' and q->queue_depth is 'unsigned int',
>> so 'unsigned long' counter here won't overflow.
>
> Not like q->nr_requests, q->queue_depth usually means the whole queue's depth,
> which may cover all hw queue's depth. And it is only used by scsi, but it
> should be held in "unsigned short" too.
>
>>
>> Should I change it to smaller 'unsigned short' or just leave it as 'unsigned long' ?
>> (Now the size of struct blk_flush_queue is exactly 64 bytes)
>
> You have to limit q->nr_requests first, which may need a bit more work for avoiding
> compiling warning or sort of thing. And 64k is big enough for holding per-queue
> scheduler request.
>
> Once it is done, it is fine to define this counter as 'unsigned short'.
>
Ok, I looked around these related code, found it maybe subtle to me for now.
So I'd better just leave it 'unsigned long' here. :)
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-06-28 12:57 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-27 12:08 [PATCH 0/4] blk-mq: optimize the size of struct request chengming.zhou
2023-06-27 12:08 ` [PATCH 1/4] blk-mq: use percpu csd to remote complete instead of per-rq csd chengming.zhou
2023-06-28 2:20 ` Ming Lei
2023-06-28 3:28 ` Chengming Zhou
2023-06-28 4:50 ` Ming Lei
2023-06-28 6:43 ` Chengming Zhou
2023-06-27 12:08 ` [PATCH 2/4] blk-flush: count inflight flush_data requests chengming.zhou
2023-06-28 4:13 ` Ming Lei
2023-06-28 4:55 ` Chengming Zhou
2023-06-28 7:22 ` Ming Lei
2023-06-28 12:55 ` Chengming Zhou
2023-06-27 12:08 ` [PATCH 3/4] blk-flush: reuse rq queuelist in flush state machine chengming.zhou
2023-06-27 12:08 ` [PATCH 4/4] blk-mq: delete unused completion_data in struct request chengming.zhou
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).