[LSF/MM/BPF TOPIC] bpf iterator for file-system

* [LSF/MM/BPF TOPIC] bpf iterator for file-system
@ 2023-02-28  3:30 Hou Tao
  2023-02-28 19:59 ` Viacheslav Dubeyko
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hou Tao @ 2023-02-28  3:30 UTC (permalink / raw)
  To: lsf-pc
  Cc: bpf, linux-fsdevel, Miklos Szeredi, Nhat Pham, Alexei Starovoitov,
	Yonghong Song

From time to time, new syscalls have been proposed to gain more observability
for file-system:

(1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
multiple values in single syscall.
(2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
of a given file in a scalable way.

All these proposals requires adding a new syscall. Here I would like to propose
another solution for file system observability: bpf iterator for file system
object. The initial idea came when I was trying to implement a filefrag-like
page cache tool with support for multi-order folio, so that we can know the
number of multi-order folios and the orders of those folios in page cache. After
developing a demo for it, I realized that we could use it to provide more
observability for file system objects. e.g., dumping the per-cpu iostat for a
super block [2],  iterating all inodes in a super-block to dump info for
specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
specific mount.

The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
for kernel objects. It works by creating bpf iterator file [4], which is a
seq-like read-only file, and the content of the bpf iterator file is determined
by a previously loaded bpf program, so userspace can read the bpf iterator file
to get the information it needs. However there are some unresolved issues:
(1) The privilege.
Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
observability will be available to the privileged process. Maybe we can load the
bpf program through a privileged process and make the bpf iterator file being
readable for normal users.
(2) Prevent pinning the super-block
In the current naive implementation, the bpf iterator simply pins the
super-block of the passed fd and prevents the super-block from being destroyed.
Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
the filesystem is umounted.

I hope to send out an RFC soon before LSF/MM/BPF for further discussion.

[0]:
https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
[1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
[2]:
https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
[4]: https://docs.kernel.org/bpf/bpf_iterators.html

^ permalink raw reply	[flat|nested] 6+ messages in thread