Linux block layer
 help / color / mirror / Atom feed
* [RFC 0/1] block: export I/O latency histograms
@ 2026-07-02 13:27 Diangang Li
  2026-07-02 13:27 ` [RFC 1/1] " Diangang Li
  0 siblings, 1 reply; 2+ messages in thread
From: Diangang Li @ 2026-07-02 13:27 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, linux-block, Diangang Li

From: Diangang Li <lidiangang@bytedance.com>

Hi,

The existing block I/O statistics count completed I/Os and accumulate the
time spent in each operation group. That works for average latency, but
not for the tail. Once the time is folded into a single total, userspace
cannot tell whether a device saw a steady stream of moderate I/Os or a
small number of very slow ones.

This RFC adds cumulative latency histograms for block devices and
partitions. The new accounting is in the same completion paths as the
existing I/O statistics and uses the same operation groups: read, write,
discard, and flush.

Two proc files are added:

  /proc/disk_lat_buckets
        bucket upper bounds, in microseconds

  /proc/disk_lat_hists
        cumulative histogram counters

/proc/disk_lat_hists follows the shape of /proc/diskstats. Each
reported device or partition has four consecutive lines, in read,
write, discard, flush order. Each line starts with the major number,
minor number, and device name, followed by the bucket counters.
Userspace can sample the file twice and compute interval histograms and
percentiles from the deltas.

eBPF is useful for targeted debugging, but it is not a good match for
this interface. These counters are block accounting data, tied to the
same accounting points as diskstats and readable without a resident
userspace collector.

The histogram storage is per block_device and optional. If allocation
fails, bd_lat_hist remains NULL and regular I/O statistics keep working.
The record side uses per-cpu counters.

The current bucket table has 24 upper bounds, from 10 us to 8 seconds,
which gives 25 counters. That covers both fast NVMe devices and slow
disks without making the per-device state too large.

Fio tests on NVMe and HDD devices did not show a consistent performance
regression, and confirmed that histogram deltas match the corresponding
diskstats completion counters.

Diangang Li (1):
  block: export I/O latency histograms

 Documentation/ABI/testing/procfs-diskstats |  25 ++++
 block/Makefile                             |   2 +-
 block/bdev.c                               |   2 +
 block/blk-core.c                           |   4 +-
 block/blk-flush.c                          |   5 +-
 block/blk-mq.c                             |   4 +-
 block/blk.h                                |   7 +
 block/disk-lat-hist.c                      | 158 +++++++++++++++++++++
 block/genhd.c                              |  10 ++
 include/linux/blk_types.h                  |   1 +
 10 files changed, 213 insertions(+), 5 deletions(-)
 create mode 100644 block/disk-lat-hist.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-07-02 13:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 13:27 [RFC 0/1] block: export I/O latency histograms Diangang Li
2026-07-02 13:27 ` [RFC 1/1] " Diangang Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox