* [PATCHSET RFC v2 0/5] Cache issue side time querying
@ 2024-01-16 16:54 Jens Axboe
2024-01-16 16:54 ` [PATCH 1/5] block: add blk_time_get_ns() helper Jens Axboe
` (5 more replies)
0 siblings, 6 replies; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k
Hi,
When I run my peak testing to see if we've regressed, my test script
always does:
echo 0 > /sys/block/$DEV/queue/iostats
echo 2 > /sys/block/$DEV/queue/nomerges
for each device being used. It's unfortunate that we need to disable
iostats, but without doing that, I lose about 12% performance. The main
reason for that is the time querying we need to do, when iostats are
enabled. As it turns out, lots of other block code is quite trigger
happy with querying time as well. We do have some nice batching in place
which helps ammortize that, but it's not perfect.
This trivial patchset simply caches the current time in struct blk_plug,
on the premise that any issue side time querying can get adequate
granularity through that. Nobody really needs nsec granularity on the
timestamp.
Results in patch 2, but tldr is a more than 9% improvement (108M -> 118M
IOPS) for my test case, which doesn't even enable most of the costly
block layer items that you'd typically find in a distro and which would
further increase the number of issue side time calls. This brings iostats
enabled _almost_ to the level of turning it off.
v2:
- Fix typo in cover letter, the prep script obviously turns
_off_ iostats normally
- Cover rest of block/* cases that use ktime_get_ns()
- Fix build error in block/blk-wbt.c
- Don't use the LSB to detect if the timestamp is valid or not,
just accept we'll do double ktime_get_ns() if we happen to
get 0 as a valid time.
- Invalidate timestamp on any schedule out condition
- Add two patches reclaiming the added space in blk_plug
- Update to current perf results
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/5] block: add blk_time_get_ns() helper
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
@ 2024-01-16 16:54 ` Jens Axboe
2024-01-17 8:01 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 2/5] block: cache current nsec time in struct blk_plug Jens Axboe
` (4 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k, Jens Axboe
Convert any user of ktime_get_ns() to use blk_time_get_ns(), so we have
a unified API for querying the current time in nanoseconds.
No functional changes intended, this patch just wraps ktime_get_ns()
with a block helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/bfq-cgroup.c | 14 +++++++-------
block/bfq-iosched.c | 22 +++++++++++-----------
block/blk-cgroup.c | 2 +-
block/blk-flush.c | 2 +-
block/blk-iocost.c | 6 +++---
block/blk-iolatency.c | 6 +++---
block/blk-mq.c | 16 ++++++++--------
block/blk-throttle.c | 6 +++---
block/blk-wbt.c | 5 ++---
include/linux/blkdev.h | 5 +++++
10 files changed, 44 insertions(+), 40 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 2c90e5de0acd..d442ee358fc2 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -127,7 +127,7 @@ static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
if (!bfqg_stats_waiting(stats))
return;
- now = ktime_get_ns();
+ now = blk_time_get_ns();
if (now > stats->start_group_wait_time)
bfq_stat_add(&stats->group_wait_time,
now - stats->start_group_wait_time);
@@ -144,7 +144,7 @@ static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
return;
if (bfqg == curr_bfqg)
return;
- stats->start_group_wait_time = ktime_get_ns();
+ stats->start_group_wait_time = blk_time_get_ns();
bfqg_stats_mark_waiting(stats);
}
@@ -156,7 +156,7 @@ static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
if (!bfqg_stats_empty(stats))
return;
- now = ktime_get_ns();
+ now = blk_time_get_ns();
if (now > stats->start_empty_time)
bfq_stat_add(&stats->empty_time,
now - stats->start_empty_time);
@@ -183,7 +183,7 @@ void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg)
if (bfqg_stats_empty(stats))
return;
- stats->start_empty_time = ktime_get_ns();
+ stats->start_empty_time = blk_time_get_ns();
bfqg_stats_mark_empty(stats);
}
@@ -192,7 +192,7 @@ void bfqg_stats_update_idle_time(struct bfq_group *bfqg)
struct bfqg_stats *stats = &bfqg->stats;
if (bfqg_stats_idling(stats)) {
- u64 now = ktime_get_ns();
+ u64 now = blk_time_get_ns();
if (now > stats->start_idle_time)
bfq_stat_add(&stats->idle_time,
@@ -205,7 +205,7 @@ void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg)
{
struct bfqg_stats *stats = &bfqg->stats;
- stats->start_idle_time = ktime_get_ns();
+ stats->start_idle_time = blk_time_get_ns();
bfqg_stats_mark_idling(stats);
}
@@ -242,7 +242,7 @@ void bfqg_stats_update_completion(struct bfq_group *bfqg, u64 start_time_ns,
u64 io_start_time_ns, blk_opf_t opf)
{
struct bfqg_stats *stats = &bfqg->stats;
- u64 now = ktime_get_ns();
+ u64 now = blk_time_get_ns();
if (now > io_start_time_ns)
blkg_rwstat_add(&stats->service_time, opf,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3cce6de464a7..1922574e1c0d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1005,7 +1005,7 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq,
rq = rq_entry_fifo(bfqq->fifo.next);
- if (rq == last || ktime_get_ns() < rq->fifo_time)
+ if (rq == last || blk_time_get_ns() < rq->fifo_time)
return NULL;
bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);
@@ -1829,7 +1829,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
* bfq_bfqq_update_budg_for_activation for
* details on the usage of the next variable.
*/
- arrived_in_time = ktime_get_ns() <=
+ arrived_in_time = blk_time_get_ns() <=
bfqq->ttime.last_end_request +
bfqd->bfq_slice_idle * 3;
unsigned int act_idx = bfq_actuator_index(bfqd, rq->bio);
@@ -2208,7 +2208,7 @@ static void bfq_add_request(struct request *rq)
struct request *next_rq, *prev;
unsigned int old_wr_coeff = bfqq->wr_coeff;
bool interactive = false;
- u64 now_ns = ktime_get_ns();
+ u64 now_ns = blk_time_get_ns();
bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
bfqq->queued[rq_is_sync(rq)]++;
@@ -2262,7 +2262,7 @@ static void bfq_add_request(struct request *rq)
bfqd->rqs_injected && bfqd->tot_rq_in_driver > 0)) &&
time_is_before_eq_jiffies(bfqq->decrease_time_jif +
msecs_to_jiffies(10))) {
- bfqd->last_empty_occupied_ns = ktime_get_ns();
+ bfqd->last_empty_occupied_ns = blk_time_get_ns();
/*
* Start the state machine for measuring the
* total service time of rq: setting
@@ -3433,7 +3433,7 @@ static void bfq_reset_rate_computation(struct bfq_data *bfqd,
struct request *rq)
{
if (rq != NULL) { /* new rq dispatch now, reset accordingly */
- bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns();
+ bfqd->last_dispatch = bfqd->first_dispatch = blk_time_get_ns();
bfqd->peak_rate_samples = 1;
bfqd->sequential_samples = 0;
bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
@@ -3590,7 +3590,7 @@ static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
*/
static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
{
- u64 now_ns = ktime_get_ns();
+ u64 now_ns = blk_time_get_ns();
if (bfqd->peak_rate_samples == 0) { /* first dispatch */
bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
@@ -5591,7 +5591,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct bfq_io_cq *bic, pid_t pid, int is_sync,
unsigned int act_idx)
{
- u64 now_ns = ktime_get_ns();
+ u64 now_ns = blk_time_get_ns();
bfqq->actuator_idx = act_idx;
RB_CLEAR_NODE(&bfqq->entity.rb_node);
@@ -5903,7 +5903,7 @@ static void bfq_update_io_thinktime(struct bfq_data *bfqd,
*/
if (bfqq->dispatched || bfq_bfqq_busy(bfqq))
return;
- elapsed = ktime_get_ns() - bfqq->ttime.last_end_request;
+ elapsed = blk_time_get_ns() - bfqq->ttime.last_end_request;
elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle);
ttime->ttime_samples = (7*ttime->ttime_samples + 256) / 8;
@@ -6194,7 +6194,7 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
bfq_add_request(rq);
idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
- rq->fifo_time = ktime_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ rq->fifo_time = blk_time_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
list_add_tail(&rq->queuelist, &bfqq->fifo);
bfq_rq_enqueued(bfqd, bfqq, rq);
@@ -6370,7 +6370,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
bfq_weights_tree_remove(bfqq);
}
- now_ns = ktime_get_ns();
+ now_ns = blk_time_get_ns();
bfqq->ttime.last_end_request = now_ns;
@@ -6585,7 +6585,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
static void bfq_update_inject_limit(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
- u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
+ u64 tot_time_ns = blk_time_get_ns() - bfqd->last_empty_occupied_ns;
unsigned int old_limit = bfqq->inject_limit;
if (bfqq->last_serv_time_ns > 0 && bfqd->rqs_injected) {
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ff93c385ba5a..bdbb557feb5a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1846,7 +1846,7 @@ static void blkcg_maybe_throttle_blkg(struct blkcg_gq *blkg, bool use_memdelay)
{
unsigned long pflags;
bool clamp;
- u64 now = ktime_to_ns(ktime_get());
+ u64 now = blk_time_get_ns();
u64 exp;
u64 delay_nsec = 0;
int tok;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 3f4d41952ef2..b0f314f4bc14 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -143,7 +143,7 @@ static void blk_account_io_flush(struct request *rq)
part_stat_lock();
part_stat_inc(part, ios[STAT_FLUSH]);
part_stat_add(part, nsecs[STAT_FLUSH],
- ktime_get_ns() - rq->start_time_ns);
+ blk_time_get_ns() - rq->start_time_ns);
part_stat_unlock();
}
diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index c8beec6d7df0..e54b17261d96 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -829,7 +829,7 @@ static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
/* step up/down based on the vrate */
vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
- now_ns = ktime_get_ns();
+ now_ns = blk_time_get_ns();
if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
if (!ioc->autop_too_fast_at)
@@ -1044,7 +1044,7 @@ static void ioc_now(struct ioc *ioc, struct ioc_now *now)
unsigned seq;
u64 vrate;
- now->now_ns = ktime_get();
+ now->now_ns = blk_time_get_ns();
now->now = ktime_to_us(now->now_ns);
vrate = atomic64_read(&ioc->vtime_rate);
@@ -2810,7 +2810,7 @@ static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
return;
}
- on_q_ns = ktime_get_ns() - rq->alloc_time_ns;
+ on_q_ns = blk_time_get_ns() - rq->alloc_time_ns;
rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index c1a6aba1d59e..ebb522788d97 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -609,7 +609,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio)
if (!iolat->blkiolat->enabled)
return;
- now = ktime_to_ns(ktime_get());
+ now = blk_time_get_ns();
while (blkg && blkg->parent) {
iolat = blkg_to_lat(blkg);
if (!iolat) {
@@ -661,7 +661,7 @@ static void blkiolatency_timer_fn(struct timer_list *t)
struct blk_iolatency *blkiolat = from_timer(blkiolat, t, timer);
struct blkcg_gq *blkg;
struct cgroup_subsys_state *pos_css;
- u64 now = ktime_to_ns(ktime_get());
+ u64 now = blk_time_get_ns();
rcu_read_lock();
blkg_for_each_descendant_pre(blkg, pos_css,
@@ -985,7 +985,7 @@ static void iolatency_pd_init(struct blkg_policy_data *pd)
struct blkcg_gq *blkg = lat_to_blkg(iolat);
struct rq_qos *rqos = iolat_rq_qos(blkg->q);
struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos);
- u64 now = ktime_to_ns(ktime_get());
+ u64 now = blk_time_get_ns();
int cpu;
if (blk_queue_nonrot(blkg->q))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index aa87fcfda1ec..aff9e9492f59 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -323,7 +323,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
RB_CLEAR_NODE(&rq->rb_node);
rq->tag = BLK_MQ_NO_TAG;
rq->internal_tag = BLK_MQ_NO_TAG;
- rq->start_time_ns = ktime_get_ns();
+ rq->start_time_ns = blk_time_get_ns();
rq->part = NULL;
blk_crypto_rq_set_defaults(rq);
}
@@ -333,7 +333,7 @@ EXPORT_SYMBOL(blk_rq_init);
static inline void blk_mq_rq_time_init(struct request *rq, u64 alloc_time_ns)
{
if (blk_mq_need_time_stamp(rq))
- rq->start_time_ns = ktime_get_ns();
+ rq->start_time_ns = blk_time_get_ns();
else
rq->start_time_ns = 0;
@@ -444,7 +444,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
/* alloc_time includes depth and tag waits */
if (blk_queue_rq_alloc_time(q))
- alloc_time_ns = ktime_get_ns();
+ alloc_time_ns = blk_time_get_ns();
if (data->cmd_flags & REQ_NOWAIT)
data->flags |= BLK_MQ_REQ_NOWAIT;
@@ -629,7 +629,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
/* alloc_time includes depth and tag waits */
if (blk_queue_rq_alloc_time(q))
- alloc_time_ns = ktime_get_ns();
+ alloc_time_ns = blk_time_get_ns();
/*
* If the tag allocator sleeps we could get an allocation for a
@@ -1042,7 +1042,7 @@ static inline void __blk_mq_end_request_acct(struct request *rq, u64 now)
inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
{
if (blk_mq_need_time_stamp(rq))
- __blk_mq_end_request_acct(rq, ktime_get_ns());
+ __blk_mq_end_request_acct(rq, blk_time_get_ns());
blk_mq_finish_request(rq);
@@ -1085,7 +1085,7 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
u64 now = 0;
if (iob->need_ts)
- now = ktime_get_ns();
+ now = blk_time_get_ns();
while ((rq = rq_list_pop(&iob->req_list)) != NULL) {
prefetch(rq->bio);
@@ -1255,7 +1255,7 @@ void blk_mq_start_request(struct request *rq)
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags) &&
!blk_rq_is_passthrough(rq)) {
- rq->io_start_time_ns = ktime_get_ns();
+ rq->io_start_time_ns = blk_time_get_ns();
rq->stats_sectors = blk_rq_sectors(rq);
rq->rq_flags |= RQF_STATS;
rq_qos_issue(q, rq);
@@ -3107,7 +3107,7 @@ blk_status_t blk_insert_cloned_request(struct request *rq)
blk_mq_run_dispatch_ops(q,
ret = blk_mq_request_issue_directly(rq, true));
if (ret)
- blk_account_io_done(rq, ktime_get_ns());
+ blk_account_io_done(rq, blk_time_get_ns());
return ret;
}
EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 16f5766620a4..da9dc1f793c3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1815,7 +1815,7 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
time = min_t(unsigned long, MAX_IDLE_TIME, 4 * tg->idletime_threshold);
ret = tg->latency_target == DFL_LATENCY_TARGET ||
tg->idletime_threshold == DFL_IDLE_THRESHOLD ||
- (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
+ (blk_time_get_ns() >> 10) - tg->last_finish_time > time ||
tg->avg_idletime > tg->idletime_threshold ||
(tg->latency_target && tg->bio_cnt &&
tg->bad_bio_cnt * 5 < tg->bio_cnt);
@@ -2060,7 +2060,7 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
if (last_finish_time == 0)
return;
- now = ktime_get_ns() >> 10;
+ now = blk_time_get_ns() >> 10;
if (now <= last_finish_time ||
last_finish_time == tg->checked_last_finish_time)
return;
@@ -2327,7 +2327,7 @@ void blk_throtl_bio_endio(struct bio *bio)
if (!tg->td->limit_valid[LIMIT_LOW])
return;
- finish_time_ns = ktime_get_ns();
+ finish_time_ns = blk_time_get_ns();
tg->last_finish_time = finish_time_ns >> 10;
start_time = bio_issue_time(&bio->bi_issue) >> 10;
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 5ba3cd574eac..4c1c04345040 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -274,13 +274,12 @@ static inline bool stat_sample_valid(struct blk_rq_stat *stat)
static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
{
- u64 now, issue = READ_ONCE(rwb->sync_issue);
+ u64 issue = READ_ONCE(rwb->sync_issue);
if (!issue || !rwb->sync_cookie)
return 0;
- now = ktime_to_ns(ktime_get());
- return now - issue;
+ return blk_time_get_ns() - issue;
}
static inline unsigned int wbt_inflight(struct rq_wb *rwb)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99e4f5e72213..2f9ceea0e23b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -974,6 +974,11 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
int blkdev_issue_flush(struct block_device *bdev);
long nr_blockdev_pages(void);
+
+static inline u64 blk_time_get_ns(void)
+{
+ return ktime_get_ns();
+}
#else /* CONFIG_BLOCK */
struct blk_plug {
};
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/5] block: cache current nsec time in struct blk_plug
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
2024-01-16 16:54 ` [PATCH 1/5] block: add blk_time_get_ns() helper Jens Axboe
@ 2024-01-16 16:54 ` Jens Axboe
2024-01-17 8:02 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 3/5] block: update cached timestamp post schedule/preemption Jens Axboe
` (3 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k, Jens Axboe
Querying the current time is the most costly thing we do in the block
layer per IO, and depending on kernel config settings, we may do it
many times per IO.
None of the callers actually need nsec granularity. Take advantage of
that by caching the current time in the plug, with the assumption here
being that any time checking will be temporally close enough that the
slight loss of precision doesn't matter.
If the block plug gets flushed, eg on preempt or schedule out, then
we invalidate the cached clock.
On a basic peak IOPS test case with iostats enabled, this changes
the performance from:
IOPS=108.41M, BW=52.93GiB/s, IOS/call=31/31
IOPS=108.43M, BW=52.94GiB/s, IOS/call=32/32
IOPS=108.29M, BW=52.88GiB/s, IOS/call=31/32
IOPS=108.35M, BW=52.91GiB/s, IOS/call=32/32
IOPS=108.42M, BW=52.94GiB/s, IOS/call=31/31
IOPS=108.40M, BW=52.93GiB/s, IOS/call=32/32
IOPS=108.31M, BW=52.89GiB/s, IOS/call=32/31
to
IOPS=118.79M, BW=58.00GiB/s, IOS/call=31/32
IOPS=118.62M, BW=57.92GiB/s, IOS/call=31/31
IOPS=118.80M, BW=58.01GiB/s, IOS/call=32/31
IOPS=118.78M, BW=58.00GiB/s, IOS/call=32/32
IOPS=118.69M, BW=57.95GiB/s, IOS/call=32/31
IOPS=118.62M, BW=57.92GiB/s, IOS/call=32/31
IOPS=118.63M, BW=57.92GiB/s, IOS/call=31/32
which is more than a 9% improvement in performance. Looking at perf diff,
we can see a huge reduction in time overhead:
10.55% -9.88% [kernel.vmlinux] [k] read_tsc
1.31% -1.22% [kernel.vmlinux] [k] ktime_get
Note that since this relies on blk_plug for the caching, it's only
applicable to the issue side. But this is where most of the time calls
happen anyway. It's also worth nothing that the above testing doesn't
enable any of the higher cost CPU items on the block layer side, like
wbt, cgroups, iocost, etc, which all would add additional time querying.
IOW, results would likely look even better in comparison with those
enabled, as distros would do.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-core.c | 1 +
include/linux/blkdev.h | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 11342af420d0..cc4db4d92c75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1073,6 +1073,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
if (tsk->plug)
return;
+ plug->cur_ktime = 0;
plug->mq_list = NULL;
plug->cached_rq = NULL;
plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f9ceea0e23b..2d5c94e99792 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -942,6 +942,7 @@ struct blk_plug {
/* if ios_left is > 1, we can batch tag/rq allocations */
struct request *cached_rq;
+ u64 cur_ktime;
unsigned short nr_ios;
unsigned short rq_count;
@@ -977,7 +978,19 @@ long nr_blockdev_pages(void);
static inline u64 blk_time_get_ns(void)
{
- return ktime_get_ns();
+ struct blk_plug *plug = current->plug;
+
+ if (!plug)
+ return ktime_get_ns();
+
+ /*
+ * 0 could very well be a valid time, but rather than flag "this is
+ * a valid timestamp" separately, just accept that we'll do an extra
+ * ktime_get_ns() if we just happen to get 0 as the current time.
+ */
+ if (!plug->cur_ktime)
+ plug->cur_ktime = ktime_get_ns();
+ return plug->cur_ktime;
}
#else /* CONFIG_BLOCK */
struct blk_plug {
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/5] block: update cached timestamp post schedule/preemption
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
2024-01-16 16:54 ` [PATCH 1/5] block: add blk_time_get_ns() helper Jens Axboe
2024-01-16 16:54 ` [PATCH 2/5] block: cache current nsec time in struct blk_plug Jens Axboe
@ 2024-01-16 16:54 ` Jens Axboe
2024-01-17 8:06 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 4/5] block: shrink plug->{nr_ios, rq_count} to unsigned char Jens Axboe
` (2 subsequent siblings)
5 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k, Jens Axboe
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/linux/blkdev.h | 19 ++++++++++++++++++-
include/linux/sched.h | 2 +-
kernel/sched/core.c | 4 +++-
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2d5c94e99792..81a7fca1b4f7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -976,6 +976,17 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
int blkdev_issue_flush(struct block_device *bdev);
long nr_blockdev_pages(void);
+/*
+ * tsk == current here
+ */
+static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+{
+ struct blk_plug *plug = tsk->plug;
+
+ if (plug)
+ plug->cur_ktime = 0;
+}
+
static inline u64 blk_time_get_ns(void)
{
struct blk_plug *plug = current->plug;
@@ -988,8 +999,10 @@ static inline u64 blk_time_get_ns(void)
* a valid timestamp" separately, just accept that we'll do an extra
* ktime_get_ns() if we just happen to get 0 as the current time.
*/
- if (!plug->cur_ktime)
+ if (!plug->cur_ktime) {
plug->cur_ktime = ktime_get_ns();
+ current->flags |= PF_BLOCK_TS;
+ }
return plug->cur_ktime;
}
#else /* CONFIG_BLOCK */
@@ -1013,6 +1026,10 @@ static inline void blk_flush_plug(struct blk_plug *plug, bool async)
{
}
+static inline void blk_plug_invalidate_ts(struct task_struct *tsk)
+{
+}
+
static inline int blkdev_issue_flush(struct block_device *bdev)
{
return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9a66147915b2..d8a073b06495 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1642,7 +1642,7 @@ extern struct pid *cad_pid;
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
-#define PF__HOLE__20000000 0x20000000
+#define PF_BLOCK_TS 0x20000000 /* plug has ts that needs updating */
#define PF__HOLE__40000000 0x40000000
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..4675d59313ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6787,7 +6787,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
static void sched_update_worker(struct task_struct *tsk)
{
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_BLOCK_TS)) {
+ if (tsk->flags & PF_BLOCK_TS)
+ blk_plug_invalidate_ts(tsk);
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
else
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/5] block: shrink plug->{nr_ios, rq_count} to unsigned char
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
` (2 preceding siblings ...)
2024-01-16 16:54 ` [PATCH 3/5] block: update cached timestamp post schedule/preemption Jens Axboe
@ 2024-01-16 16:54 ` Jens Axboe
2024-01-16 16:54 ` [PATCH 5/5] block: convert struct blk_plug callback list to hlists Jens Axboe
2024-01-16 21:08 ` [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
5 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k, Jens Axboe
We never use more than 64 max in here, we can change them from unsigned
short to just a byte. Add a BUILD_BUG_ON() check, in case the max plug
count changes in the future.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-core.c | 4 ++--
block/blk-mq.c | 2 ++
include/linux/blkdev.h | 8 ++++----
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index cc4db4d92c75..902799f71a59 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1063,7 +1063,7 @@ int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
}
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
-void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
+void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned char nr_ios)
{
struct task_struct *tsk = current;
@@ -1076,7 +1076,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
plug->cur_ktime = 0;
plug->mq_list = NULL;
plug->cached_rq = NULL;
- plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
+ plug->nr_ios = min_t(unsigned char, nr_ios, BLK_MAX_REQUEST_COUNT);
plug->rq_count = 0;
plug->multiple_queues = false;
plug->has_elevator = false;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index aff9e9492f59..a9b4a66e1e13 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1283,6 +1283,8 @@ EXPORT_SYMBOL(blk_mq_start_request);
*/
static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
{
+ BUILD_BUG_ON(2 * BLK_MAX_REQUEST_COUNT > U8_MAX);
+
if (plug->multiple_queues)
return BLK_MAX_REQUEST_COUNT * 2;
return BLK_MAX_REQUEST_COUNT;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 81a7fca1b4f7..5b17d0e460e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -943,9 +943,9 @@ struct blk_plug {
/* if ios_left is > 1, we can batch tag/rq allocations */
struct request *cached_rq;
u64 cur_ktime;
- unsigned short nr_ios;
+ unsigned char nr_ios;
- unsigned short rq_count;
+ unsigned char rq_count;
bool multiple_queues;
bool has_elevator;
@@ -963,7 +963,7 @@ struct blk_plug_cb {
extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
void *data, int size);
extern void blk_start_plug(struct blk_plug *);
-extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short);
+extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned char);
extern void blk_finish_plug(struct blk_plug *);
void __blk_flush_plug(struct blk_plug *plug, bool from_schedule);
@@ -1010,7 +1010,7 @@ struct blk_plug {
};
static inline void blk_start_plug_nr_ios(struct blk_plug *plug,
- unsigned short nr_ios)
+ unsigned char nr_ios)
{
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/5] block: convert struct blk_plug callback list to hlists
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
` (3 preceding siblings ...)
2024-01-16 16:54 ` [PATCH 4/5] block: shrink plug->{nr_ios, rq_count} to unsigned char Jens Axboe
@ 2024-01-16 16:54 ` Jens Axboe
2024-01-16 17:03 ` Jens Axboe
2024-01-16 21:08 ` [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
5 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 16:54 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k, Jens Axboe
We currently use a doubly linked list, which means the head takes up
16 bytes. As any iteration goes over the full list by first splicing it
to an on-stack copy, we never need to remove members from the middle of
the list.
Convert it to an hlist instead, saving 8 bytes in the blk_plug structure.
This also helps save 40 bytes of text in the core block code, tested on
arm64.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
block/blk-core.c | 26 ++++++++++++++------------
include/linux/blkdev.h | 4 ++--
2 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 902799f71a59..a487881fe2a6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1080,7 +1080,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned char nr_ios)
plug->rq_count = 0;
plug->multiple_queues = false;
plug->has_elevator = false;
- INIT_LIST_HEAD(&plug->cb_list);
+ INIT_HLIST_HEAD(&plug->cb_list);
/*
* Store ordering should not be needed here, since a potential
@@ -1120,16 +1120,18 @@ EXPORT_SYMBOL(blk_start_plug);
static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
{
- LIST_HEAD(callbacks);
+ HLIST_HEAD(callbacks);
- while (!list_empty(&plug->cb_list)) {
- list_splice_init(&plug->cb_list, &callbacks);
+ while (!hlist_empty(&plug->cb_list)) {
+ struct hlist_node *entry, *tmp;
- while (!list_empty(&callbacks)) {
- struct blk_plug_cb *cb = list_first_entry(&callbacks,
- struct blk_plug_cb,
- list);
- list_del(&cb->list);
+ hlist_move_list(&plug->cb_list, &callbacks);
+
+ hlist_for_each_safe(entry, tmp, &callbacks) {
+ struct blk_plug_cb *cb;
+
+ cb = hlist_entry(entry, struct blk_plug_cb, list);
+ hlist_del(&cb->list);
cb->callback(cb, from_schedule);
}
}
@@ -1144,7 +1146,7 @@ struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
if (!plug)
return NULL;
- list_for_each_entry(cb, &plug->cb_list, list)
+ hlist_for_each_entry(cb, &plug->cb_list, list)
if (cb->callback == unplug && cb->data == data)
return cb;
@@ -1154,7 +1156,7 @@ struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data,
if (cb) {
cb->data = data;
cb->callback = unplug;
- list_add(&cb->list, &plug->cb_list);
+ hlist_add_head(&cb->list, &plug->cb_list);
}
return cb;
}
@@ -1162,7 +1164,7 @@ EXPORT_SYMBOL(blk_check_plugged);
void __blk_flush_plug(struct blk_plug *plug, bool from_schedule)
{
- if (!list_empty(&plug->cb_list))
+ if (!hlist_empty(&plug->cb_list))
flush_plug_callbacks(plug, from_schedule);
blk_mq_flush_plug_list(plug, from_schedule);
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5b17d0e460e4..f339f856e44f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -950,13 +950,13 @@ struct blk_plug {
bool multiple_queues;
bool has_elevator;
- struct list_head cb_list; /* md requires an unplug callback */
+ struct hlist_head cb_list; /* md requires an unplug callback */
};
struct blk_plug_cb;
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
struct blk_plug_cb {
- struct list_head list;
+ struct hlist_node list;
blk_plug_cb_fn callback;
void *data;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 5/5] block: convert struct blk_plug callback list to hlists
2024-01-16 16:54 ` [PATCH 5/5] block: convert struct blk_plug callback list to hlists Jens Axboe
@ 2024-01-16 17:03 ` Jens Axboe
0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 17:03 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k
On 1/16/24 9:54 AM, Jens Axboe wrote:
> We currently use a doubly linked list, which means the head takes up
> 16 bytes. As any iteration goes over the full list by first splicing it
> to an on-stack copy, we never need to remove members from the middle of
> the list.
>
> Convert it to an hlist instead, saving 8 bytes in the blk_plug structure.
> This also helps save 40 bytes of text in the core block code, tested on
> arm64.
Gah, looks like I forgot to refresh before committing this one, it
just needs a one-liner for raid:
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 512746551f36..4a1b6f17067f 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -152,7 +152,7 @@ static inline bool raid1_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
plug = container_of(cb, struct raid1_plug_cb, cb);
bio_list_add(&plug->pending, bio);
if (++plug->count / MAX_PLUG_BIO >= copies) {
- list_del(&cb->list);
+ hlist_del(&cb->list);
cb->callback(cb, false);
}
--
Jens Axboe
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCHSET RFC v2 0/5] Cache issue side time querying
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
` (4 preceding siblings ...)
2024-01-16 16:54 ` [PATCH 5/5] block: convert struct blk_plug callback list to hlists Jens Axboe
@ 2024-01-16 21:08 ` Jens Axboe
5 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-01-16 21:08 UTC (permalink / raw)
To: linux-block; +Cc: kbusch, joshi.k
On 1/16/24 9:54 AM, Jens Axboe wrote:
> Results in patch 2, but tldr is a more than 9% improvement (108M -> 118M
> IOPS) for my test case, which doesn't even enable most of the costly
> block layer items that you'd typically find in a distro and which would
> further increase the number of issue side time calls. This brings iostats
> enabled _almost_ to the level of turning it off.
Enabled the typical distro things (block cgroups, blk-wbt, iocost,
iolatency) which all add considerable cost (and is an optimization
project in itself) and this is the performance of the stock kernel with
iostats enabled:
IOPS=91.01M, BW=44.44GiB/s, IOS/call=32/32
IOPS=91.29M, BW=44.58GiB/s, IOS/call=31/32
IOPS=91.27M, BW=44.57GiB/s, IOS/call=32/31
IOPS=91.26M, BW=44.56GiB/s, IOS/call=32/31
IOPS=91.38M, BW=44.62GiB/s, IOS/call=32/31
IOPS=91.28M, BW=44.57GiB/s, IOS/call=32/32
which is down from 122M for an optimized config and with iostats off.
With this patchset applied (and one extra patch, missed a spot...), we
now get:
IOPS=101.38M, BW=49.50GiB/s, IOS/call=32/32
IOPS=101.31M, BW=49.47GiB/s, IOS/call=32/32
IOPS=101.35M, BW=49.49GiB/s, IOS/call=31/31
IOPS=101.44M, BW=49.53GiB/s, IOS/call=32/31
IOPS=101.32M, BW=49.47GiB/s, IOS/call=32/32
IOPS=101.14M, BW=49.38GiB/s, IOS/call=32/31
which is about a 10% improvement. Mostly ran this because I was curious,
and while the above config changes do add more time stamping, it also
adds additional overhead. In any case, 10% win for the distro config
case is not bad at all.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5] block: add blk_time_get_ns() helper
2024-01-16 16:54 ` [PATCH 1/5] block: add blk_time_get_ns() helper Jens Axboe
@ 2024-01-17 8:01 ` Johannes Thumshirn
0 siblings, 0 replies; 11+ messages in thread
From: Johannes Thumshirn @ 2024-01-17 8:01 UTC (permalink / raw)
To: Jens Axboe, linux-block@vger.kernel.org
Cc: kbusch@kernel.org, joshi.k@samsung.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/5] block: cache current nsec time in struct blk_plug
2024-01-16 16:54 ` [PATCH 2/5] block: cache current nsec time in struct blk_plug Jens Axboe
@ 2024-01-17 8:02 ` Johannes Thumshirn
0 siblings, 0 replies; 11+ messages in thread
From: Johannes Thumshirn @ 2024-01-17 8:02 UTC (permalink / raw)
To: Jens Axboe, linux-block@vger.kernel.org
Cc: kbusch@kernel.org, joshi.k@samsung.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/5] block: update cached timestamp post schedule/preemption
2024-01-16 16:54 ` [PATCH 3/5] block: update cached timestamp post schedule/preemption Jens Axboe
@ 2024-01-17 8:06 ` Johannes Thumshirn
0 siblings, 0 replies; 11+ messages in thread
From: Johannes Thumshirn @ 2024-01-17 8:06 UTC (permalink / raw)
To: Jens Axboe, linux-block@vger.kernel.org
Cc: kbusch@kernel.org, joshi.k@samsung.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-01-17 8:06 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-16 16:54 [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
2024-01-16 16:54 ` [PATCH 1/5] block: add blk_time_get_ns() helper Jens Axboe
2024-01-17 8:01 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 2/5] block: cache current nsec time in struct blk_plug Jens Axboe
2024-01-17 8:02 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 3/5] block: update cached timestamp post schedule/preemption Jens Axboe
2024-01-17 8:06 ` Johannes Thumshirn
2024-01-16 16:54 ` [PATCH 4/5] block: shrink plug->{nr_ios, rq_count} to unsigned char Jens Axboe
2024-01-16 16:54 ` [PATCH 5/5] block: convert struct blk_plug callback list to hlists Jens Axboe
2024-01-16 17:03 ` Jens Axboe
2024-01-16 21:08 ` [PATCHSET RFC v2 0/5] Cache issue side time querying Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox