* [PATCH 0/3] ceph: add subvolume metrics reporting support
@ 2025-11-27 13:46 Alex Markuze
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Alex Markuze @ 2025-11-27 13:46 UTC (permalink / raw)
To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko
This patch series adds support for per-subvolume I/O metrics collection
and reporting to the MDS. This enables administrators to monitor I/O
patterns at the subvolume granularity, which is useful for multi-tenant
CephFS deployments where different subvolumes may be allocated to
different users or applications.
The implementation requires protocol changes to receive the subvolume_id
from the MDS (InodeStat v9), and introduces a new metrics type
(CLIENT_METRIC_TYPE_SUBVOLUME_METRICS) for reporting aggregated I/O
statistics back to the MDS.
Patch 1 adds forward-compatible handling for InodeStat v8. The MDS v8
encoding added a versioned optmetadata field containing optional inode
metadata such as charmap (for case-insensitive/case-preserving file
systems). The kernel client does not currently support case-insensitive
lookups, so this field is skipped rather than parsed. This ensures
forward compatibility with newer MDS servers without requiring the
full case-insensitivity feature implementation.
Patch 2 adds support for parsing the subvolume_id field from InodeStat
v9 and storing it in the inode structure for later use.
Patch 3 adds the complete subvolume metrics infrastructure:
- CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- Red-black tree based metrics tracker for efficient per-subvolume
aggregation
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing CLIENT_METRICS message
- Recording of I/O operations from file read/write and writeback paths
- Debugfs interfaces for monitoring
Metrics tracked per subvolume include:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)
The metrics are periodically sent to the MDS as part of the existing
metrics reporting infrastructure when the MDS advertises support for
the SUBVOLUME_METRICS feature.
Debugfs additions in Patch 3:
- metrics/subvolumes: displays last sent and pending subvolume metrics
- metrics/metric_features: displays MDS session feature negotiation
status, showing which metric-related features are enabled (including
METRIC_COLLECT and SUBVOLUME_METRICS)
Alex Markuze (3):
ceph: handle InodeStat v8 versioned field in reply parsing
ceph: parse subvolume_id from InodeStat v9 and store in inode
ceph: add subvolume metrics collection and reporting
fs/ceph/Makefile | 2 +-
fs/ceph/addr.c | 10 +
fs/ceph/debugfs.c | 153 ++++++++++++++
fs/ceph/file.c | 58 ++++-
fs/ceph/inode.c | 19 ++
fs/ceph/mds_client.c | 89 ++++++--
fs/ceph/mds_client.h | 14 +-
fs/ceph/metric.c | 172 ++++++++++++++-
fs/ceph/metric.h | 27 ++-
fs/ceph/subvolume_metrics.c | 407 ++++++++++++++++++++++++++++++++++++
fs/ceph/subvolume_metrics.h | 68 ++++++
fs/ceph/super.c | 1 +
fs/ceph/super.h | 3 +
13 files changed, 997 insertions(+), 26 deletions(-)
create mode 100644 fs/ceph/subvolume_metrics.c
create mode 100644 fs/ceph/subvolume_metrics.h
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing
2025-11-27 13:46 [PATCH 0/3] ceph: add subvolume metrics reporting support Alex Markuze
@ 2025-11-27 13:46 ` Alex Markuze
2025-11-28 1:48 ` kernel test robot
2025-12-01 20:20 ` Viacheslav Dubeyko
2025-11-27 13:46 ` [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: Alex Markuze @ 2025-11-27 13:46 UTC (permalink / raw)
To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko
Add forward-compatible handling for the new versioned field introduced
in InodeStat v8. This patch only skips the field without using it,
preparing for future protocol extensions.
The v8 encoding adds a versioned sub-structure that needs to be properly
decoded and skipped to maintain compatibility with newer MDS versions.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/mds_client.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 1740047aef0f..32561fc701e5 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -231,6 +231,18 @@ static int parse_reply_info_in(void **p, void *end,
info->fscrypt_file_len, bad);
}
}
+
+ /* struct_v 8 added a versioned field - skip it */
+ if (struct_v >= 8) {
+ u8 v8_struct_v, v8_struct_compat;
+ u32 v8_struct_len;
+
+ ceph_decode_8_safe(p, end, v8_struct_v, bad);
+ ceph_decode_8_safe(p, end, v8_struct_compat, bad);
+ ceph_decode_32_safe(p, end, v8_struct_len, bad);
+ ceph_decode_skip_n(p, end, v8_struct_len, bad);
+ }
+
*p = end;
} else {
/* legacy (unversioned) struct */
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode
2025-11-27 13:46 [PATCH 0/3] ceph: add subvolume metrics reporting support Alex Markuze
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
@ 2025-11-27 13:46 ` Alex Markuze
2025-12-01 23:27 ` Viacheslav Dubeyko
2025-11-27 13:46 ` [PATCH 3/3] ceph: add subvolume metrics collection and reporting Alex Markuze
2025-12-01 20:14 ` [PATCH 0/3] ceph: add subvolume metrics reporting support Viacheslav Dubeyko
3 siblings, 1 reply; 10+ messages in thread
From: Alex Markuze @ 2025-11-27 13:46 UTC (permalink / raw)
To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko
Add support for parsing the subvolume_id field from InodeStat v9 and
storing it in the inode for later use by subvolume metrics tracking.
The subvolume_id identifies which CephFS subvolume an inode belongs to,
enabling per-subvolume I/O metrics collection and reporting.
This patch:
- Adds subvolume_id field to struct ceph_mds_reply_info_in
- Adds i_subvolume_id field to struct ceph_inode_info
- Parses subvolume_id from v9 InodeStat in parse_reply_info_in()
- Adds ceph_inode_set_subvolume() helper to propagate the ID to inodes
- Initializes i_subvolume_id in inode allocation and clears on destroy
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/inode.c | 19 +++++++++++++++++++
fs/ceph/mds_client.c | 7 +++++++
fs/ceph/mds_client.h | 1 +
fs/ceph/super.h | 2 ++
4 files changed, 29 insertions(+)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index a6e260d9e420..c3fb4dac4692 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -638,6 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
ci->i_max_bytes = 0;
ci->i_max_files = 0;
+ ci->i_subvolume_id = 0;
memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
@@ -742,6 +743,8 @@ void ceph_evict_inode(struct inode *inode)
percpu_counter_dec(&mdsc->metric.total_inodes);
+ ci->i_subvolume_id = 0;
+
netfs_wait_for_outstanding_io(inode);
truncate_inode_pages_final(&inode->i_data);
if (inode->i_state & I_PINNING_NETFS_WB)
@@ -873,6 +876,18 @@ int ceph_fill_file_size(struct inode *inode, int issued,
return queue_trunc;
}
+void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
+{
+ struct ceph_inode_info *ci;
+
+ if (!inode || !subvolume_id)
+ return;
+
+ ci = ceph_inode(inode);
+ if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
+ WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
+}
+
void ceph_fill_file_time(struct inode *inode, int issued,
u64 time_warp_seq, struct timespec64 *ctime,
struct timespec64 *mtime, struct timespec64 *atime)
@@ -1087,6 +1102,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
new_issued = ~issued & info_caps;
__ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
+ ceph_inode_set_subvolume(inode, iinfo->subvolume_id);
#ifdef CONFIG_FS_ENCRYPTION
if (iinfo->fscrypt_auth_len &&
@@ -1594,6 +1610,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
goto done;
}
if (parent_dir) {
+ ceph_inode_set_subvolume(parent_dir,
+ rinfo->diri.subvolume_id);
err = ceph_fill_inode(parent_dir, NULL, &rinfo->diri,
rinfo->dirfrag, session, -1,
&req->r_caps_reservation);
@@ -1682,6 +1700,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
BUG_ON(!req->r_target_inode);
in = req->r_target_inode;
+ ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
NULL, session,
(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 32561fc701e5..6f66097f740b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -105,6 +105,8 @@ static int parse_reply_info_in(void **p, void *end,
int err = 0;
u8 struct_v = 0;
+ info->subvolume_id = 0;
+
if (features == (u64)-1) {
u32 struct_len;
u8 struct_compat;
@@ -243,6 +245,10 @@ static int parse_reply_info_in(void **p, void *end,
ceph_decode_skip_n(p, end, v8_struct_len, bad);
}
+ /* struct_v 9 added subvolume_id */
+ if (struct_v >= 9)
+ ceph_decode_64_safe(p, end, info->subvolume_id, bad);
+
*p = end;
} else {
/* legacy (unversioned) struct */
@@ -3962,6 +3968,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
goto out_err;
}
req->r_target_inode = in;
+ ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
}
mutex_lock(&session->s_mutex);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0428a5eaf28c..bd3690baa65c 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -118,6 +118,7 @@ struct ceph_mds_reply_info_in {
u32 fscrypt_file_len;
u64 rsnaps;
u64 change_attr;
+ u64 subvolume_id;
};
struct ceph_mds_reply_dir_entry {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index a1f781c46b41..69069c920683 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -385,6 +385,7 @@ struct ceph_inode_info {
/* quotas */
u64 i_max_bytes, i_max_files;
+ u64 i_subvolume_id;
s32 i_dir_pin;
@@ -1057,6 +1058,7 @@ extern struct inode *ceph_get_inode(struct super_block *sb,
extern struct inode *ceph_get_snapdir(struct inode *parent);
extern int ceph_fill_file_size(struct inode *inode, int issued,
u32 truncate_seq, u64 truncate_size, u64 size);
+extern void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id);
extern void ceph_fill_file_time(struct inode *inode, int issued,
u64 time_warp_seq, struct timespec64 *ctime,
struct timespec64 *mtime,
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/3] ceph: add subvolume metrics collection and reporting
2025-11-27 13:46 [PATCH 0/3] ceph: add subvolume metrics reporting support Alex Markuze
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
2025-11-27 13:46 ` [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
@ 2025-11-27 13:46 ` Alex Markuze
2025-11-29 16:51 ` Dan Carpenter
2025-12-01 20:14 ` [PATCH 0/3] ceph: add subvolume metrics reporting support Viacheslav Dubeyko
3 siblings, 1 reply; 10+ messages in thread
From: Alex Markuze @ 2025-11-27 13:46 UTC (permalink / raw)
To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko
Add complete subvolume metrics infrastructure for tracking and reporting
per-subvolume I/O metrics to the MDS. This enables administrators to
monitor I/O patterns at the subvolume granularity.
The implementation includes:
- New CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- Red-black tree based metrics tracker (subvolume_metrics.c/h)
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing metrics reporting infrastructure
- Recording of I/O operations from file read/write paths
- Debugfs interface for monitoring collected metrics
Metrics tracked per subvolume:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)
The metrics are periodically sent to the MDS as part of the existing
CLIENT_METRICS message when the MDS advertises support for the
SUBVOLUME_METRICS feature.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/Makefile | 2 +-
fs/ceph/addr.c | 10 +
fs/ceph/debugfs.c | 153 ++++++++++++++
fs/ceph/file.c | 58 ++++-
fs/ceph/mds_client.c | 70 +++++--
fs/ceph/mds_client.h | 13 +-
fs/ceph/metric.c | 172 ++++++++++++++-
fs/ceph/metric.h | 27 ++-
fs/ceph/subvolume_metrics.c | 407 ++++++++++++++++++++++++++++++++++++
fs/ceph/subvolume_metrics.h | 68 ++++++
fs/ceph/super.c | 1 +
fs/ceph/super.h | 1 +
12 files changed, 956 insertions(+), 26 deletions(-)
create mode 100644 fs/ceph/subvolume_metrics.c
create mode 100644 fs/ceph/subvolume_metrics.h
diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
index 1f77ca04c426..ebb29d11ac22 100644
--- a/fs/ceph/Makefile
+++ b/fs/ceph/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o
ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
export.o caps.o snap.o xattr.o quota.o io.o \
mds_client.o mdsmap.o strings.o ceph_frag.o \
- debugfs.o util.o metric.o
+ debugfs.o util.o metric.o subvolume_metrics.o
ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
ceph-$(CONFIG_CEPH_FS_POSIX_ACL) += acl.o
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 322ed268f14a..feae80dc2816 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -19,6 +19,7 @@
#include "mds_client.h"
#include "cache.h"
#include "metric.h"
+#include "subvolume_metrics.h"
#include "crypto.h"
#include <linux/ceph/osd_client.h>
#include <linux/ceph/striper.h>
@@ -823,6 +824,10 @@ static int write_folio_nounlock(struct folio *folio,
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
req->r_end_latency, len, err);
+ if (err >= 0 && len > 0)
+ ceph_subvolume_metrics_record_io(fsc->mdsc, ci, true, len,
+ req->r_start_latency,
+ req->r_end_latency);
fscrypt_free_bounce_page(bounce_page);
ceph_osdc_put_request(req);
if (err == 0)
@@ -963,6 +968,11 @@ static void writepages_finish(struct ceph_osd_request *req)
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
req->r_end_latency, len, rc);
+ if (rc >= 0 && len > 0)
+ ceph_subvolume_metrics_record_io(mdsc, ci, true, len,
+ req->r_start_latency,
+ req->r_end_latency);
+
ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc);
osd_data = osd_req_op_extent_osd_data(req, 0);
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index f3fe786b4143..d15af485378d 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,11 +9,13 @@
#include <linux/seq_file.h>
#include <linux/math64.h>
#include <linux/ktime.h>
+#include <linux/atomic.h>
#include <linux/ceph/libceph.h>
#include <linux/ceph/mon_client.h>
#include <linux/ceph/auth.h>
#include <linux/ceph/debugfs.h>
+#include <linux/ceph/decode.h>
#include "super.h"
@@ -21,6 +23,31 @@
#include "mds_client.h"
#include "metric.h"
+#include "subvolume_metrics.h"
+
+extern bool disable_send_metrics;
+
+struct ceph_session_feature_desc {
+ unsigned int bit;
+ const char *name;
+};
+
+static const struct ceph_session_feature_desc ceph_session_feature_table[] = {
+ { CEPHFS_FEATURE_METRIC_COLLECT, "METRIC_COLLECT" },
+ { CEPHFS_FEATURE_REPLY_ENCODING, "REPLY_ENCODING" },
+ { CEPHFS_FEATURE_RECLAIM_CLIENT, "RECLAIM_CLIENT" },
+ { CEPHFS_FEATURE_LAZY_CAP_WANTED, "LAZY_CAP_WANTED" },
+ { CEPHFS_FEATURE_MULTI_RECONNECT, "MULTI_RECONNECT" },
+ { CEPHFS_FEATURE_DELEG_INO, "DELEG_INO" },
+ { CEPHFS_FEATURE_ALTERNATE_NAME, "ALTERNATE_NAME" },
+ { CEPHFS_FEATURE_NOTIFY_SESSION_STATE, "NOTIFY_SESSION_STATE" },
+ { CEPHFS_FEATURE_OP_GETVXATTR, "OP_GETVXATTR" },
+ { CEPHFS_FEATURE_32BITS_RETRY_FWD, "32BITS_RETRY_FWD" },
+ { CEPHFS_FEATURE_NEW_SNAPREALM_INFO, "NEW_SNAPREALM_INFO" },
+ { CEPHFS_FEATURE_HAS_OWNER_UIDGID, "HAS_OWNER_UIDGID" },
+ { CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK, "MDS_AUTH_CAPS_CHECK" },
+ { CEPHFS_FEATURE_SUBVOLUME_METRICS, "SUBVOLUME_METRICS" },
+};
static int mdsmap_show(struct seq_file *s, void *p)
{
@@ -360,6 +387,60 @@ static int status_show(struct seq_file *s, void *p)
return 0;
}
+static int subvolume_metrics_show(struct seq_file *s, void *p)
+{
+ struct ceph_fs_client *fsc = s->private;
+ struct ceph_mds_client *mdsc = fsc->mdsc;
+ struct ceph_subvol_metric_snapshot *snapshot = NULL;
+ u32 nr = 0;
+ u64 total_sent = 0;
+ u32 i;
+
+ if (mdsc) {
+ mutex_lock(&mdsc->subvol_metrics_last_mutex);
+ if (mdsc->subvol_metrics_last &&
+ mdsc->subvol_metrics_last_nr) {
+ nr = mdsc->subvol_metrics_last_nr;
+ snapshot = kmemdup(mdsc->subvol_metrics_last,
+ nr * sizeof(*snapshot),
+ GFP_KERNEL);
+ if (!snapshot)
+ nr = 0;
+ }
+ total_sent = mdsc->subvol_metrics_sent;
+ mutex_unlock(&mdsc->subvol_metrics_last_mutex);
+ }
+
+ seq_puts(s, "Last sent subvolume metrics:\n");
+ if (!nr) {
+ seq_puts(s, " (none)\n");
+ } else {
+ seq_puts(s, " subvol_id rd_ops wr_ops rd_bytes wr_bytes rd_lat_us wr_lat_us\n");
+ for (i = 0; i < nr; i++) {
+ const struct ceph_subvol_metric_snapshot *e = &snapshot[i];
+
+ seq_printf(s, " %-18llu %-9llu %-9llu %-14llu %-14llu %-14llu %-14llu\n",
+ e->subvolume_id,
+ e->read_ops, e->write_ops,
+ e->read_bytes, e->write_bytes,
+ e->read_latency_us, e->write_latency_us);
+ }
+ }
+ kfree(snapshot);
+
+ seq_puts(s, "\nStatistics:\n");
+ seq_printf(s, " entries_sent: %llu\n", total_sent);
+
+ mutex_lock(&mdsc->subvol_metrics_last_mutex);
+ seq_printf(s, " non_zero_sends: %llu\n",
+ mdsc->subvol_metrics_nonzero_sends);
+ mutex_unlock(&mdsc->subvol_metrics_last_mutex);
+
+ seq_puts(s, "\nPending (unsent) subvolume metrics:\n");
+ ceph_subvolume_metrics_dump(&fsc->mdsc->subvol_metrics, s);
+ return 0;
+}
+
DEFINE_SHOW_ATTRIBUTE(mdsmap);
DEFINE_SHOW_ATTRIBUTE(mdsc);
DEFINE_SHOW_ATTRIBUTE(caps);
@@ -369,7 +450,72 @@ DEFINE_SHOW_ATTRIBUTE(metrics_file);
DEFINE_SHOW_ATTRIBUTE(metrics_latency);
DEFINE_SHOW_ATTRIBUTE(metrics_size);
DEFINE_SHOW_ATTRIBUTE(metrics_caps);
+DEFINE_SHOW_ATTRIBUTE(subvolume_metrics);
+
+static int metric_features_show(struct seq_file *s, void *p)
+{
+ struct ceph_fs_client *fsc = s->private;
+ struct ceph_mds_client *mdsc = fsc->mdsc;
+ unsigned long session_features = 0;
+ bool have_session = false;
+ bool metric_collect = false;
+ bool subvol_support = false;
+ bool metrics_enabled = false;
+ bool subvol_enabled = false;
+ int i;
+
+ if (!mdsc) {
+ seq_puts(s, "mds client unavailable\n");
+ return 0;
+ }
+
+ mutex_lock(&mdsc->mutex);
+ if (mdsc->metric.session) {
+ have_session = true;
+ session_features = mdsc->metric.session->s_features;
+ }
+ mutex_unlock(&mdsc->mutex);
+
+ if (have_session) {
+ metric_collect =
+ test_bit(CEPHFS_FEATURE_METRIC_COLLECT,
+ &session_features);
+ subvol_support =
+ test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+ &session_features);
+ }
+
+ metrics_enabled = !disable_send_metrics && have_session && metric_collect;
+ subvol_enabled = metrics_enabled && subvol_support;
+
+ seq_printf(s,
+ "metrics_enabled: %s (disable_send_metrics=%d, session=%s, metric_collect=%s)\n",
+ metrics_enabled ? "yes" : "no",
+ disable_send_metrics ? 1 : 0,
+ have_session ? "yes" : "no",
+ metric_collect ? "yes" : "no");
+ seq_printf(s, "subvolume_metrics_enabled: %s\n",
+ subvol_enabled ? "yes" : "no");
+ seq_printf(s, "session_feature_bits: 0x%lx\n", session_features);
+
+ if (!have_session) {
+ seq_puts(s, "(no active MDS session for metrics)\n");
+ return 0;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(ceph_session_feature_table); i++) {
+ const struct ceph_session_feature_desc *desc =
+ &ceph_session_feature_table[i];
+ bool set = test_bit(desc->bit, &session_features);
+
+ seq_printf(s, " %-24s : %s\n", desc->name,
+ set ? "yes" : "no");
+ }
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(metric_features);
/*
* debugfs
@@ -404,6 +550,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
debugfs_remove(fsc->debugfs_caps);
debugfs_remove(fsc->debugfs_status);
debugfs_remove(fsc->debugfs_mdsc);
+ debugfs_remove(fsc->debugfs_subvolume_metrics);
debugfs_remove_recursive(fsc->debugfs_metrics_dir);
doutc(fsc->client, "done\n");
}
@@ -468,6 +615,12 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
&metrics_size_fops);
debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
&metrics_caps_fops);
+ debugfs_create_file("metric_features", 0400, fsc->debugfs_metrics_dir,
+ fsc, &metric_features_fops);
+ fsc->debugfs_subvolume_metrics =
+ debugfs_create_file("subvolumes", 0400,
+ fsc->debugfs_metrics_dir, fsc,
+ &subvolume_metrics_fops);
doutc(fsc->client, "done\n");
}
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 99b30f784ee2..8c0e29c464b7 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -19,6 +19,19 @@
#include "cache.h"
#include "io.h"
#include "metric.h"
+#include "subvolume_metrics.h"
+
+static inline void ceph_record_subvolume_io(struct inode *inode, bool is_write,
+ ktime_t start, ktime_t end,
+ size_t bytes)
+{
+ if (!bytes)
+ return;
+
+ ceph_subvolume_metrics_record_io(ceph_sb_to_mdsc(inode->i_sb),
+ ceph_inode(inode),
+ is_write, bytes, start, end);
+}
static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
{
@@ -1140,6 +1153,11 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
req->r_start_latency,
req->r_end_latency,
read_len, ret);
+ if (ret > 0)
+ ceph_record_subvolume_io(inode, false,
+ req->r_start_latency,
+ req->r_end_latency,
+ ret);
if (ret > 0)
objver = req->r_version;
@@ -1385,12 +1403,23 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
/* r_start_latency == 0 means the request was not submitted */
if (req->r_start_latency) {
- if (aio_req->write)
+ if (aio_req->write) {
ceph_update_write_metrics(metric, req->r_start_latency,
req->r_end_latency, len, rc);
- else
+ if (rc >= 0 && len)
+ ceph_record_subvolume_io(inode, true,
+ req->r_start_latency,
+ req->r_end_latency,
+ len);
+ } else {
ceph_update_read_metrics(metric, req->r_start_latency,
req->r_end_latency, len, rc);
+ if (rc > 0)
+ ceph_record_subvolume_io(inode, false,
+ req->r_start_latency,
+ req->r_end_latency,
+ rc);
+ }
}
put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs,
@@ -1614,12 +1643,23 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
ceph_osdc_start_request(req->r_osdc, req);
ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
- if (write)
+ if (write) {
ceph_update_write_metrics(metric, req->r_start_latency,
req->r_end_latency, len, ret);
- else
+ if (ret >= 0 && len)
+ ceph_record_subvolume_io(inode, true,
+ req->r_start_latency,
+ req->r_end_latency,
+ len);
+ } else {
ceph_update_read_metrics(metric, req->r_start_latency,
req->r_end_latency, len, ret);
+ if (ret > 0)
+ ceph_record_subvolume_io(inode, false,
+ req->r_start_latency,
+ req->r_end_latency,
+ ret);
+ }
size = i_size_read(inode);
if (!write) {
@@ -1872,6 +1912,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
req->r_start_latency,
req->r_end_latency,
read_len, ret);
+ if (ret > 0)
+ ceph_record_subvolume_io(inode, false,
+ req->r_start_latency,
+ req->r_end_latency,
+ ret);
/* Ok if object is not already present */
if (ret == -ENOENT) {
@@ -2036,6 +2081,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
req->r_end_latency, len, ret);
+ if (ret >= 0 && write_len)
+ ceph_record_subvolume_io(inode, true,
+ req->r_start_latency,
+ req->r_end_latency,
+ write_len);
ceph_osdc_put_request(req);
if (ret != 0) {
doutc(cl, "osd write returned %d\n", ret);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 6f66097f740b..ecfeb40417b2 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -67,6 +67,22 @@ static void ceph_cap_reclaim_work(struct work_struct *work);
static const struct ceph_connection_operations mds_con_ops;
+static void ceph_metric_bind_session(struct ceph_mds_client *mdsc,
+ struct ceph_mds_session *session)
+{
+ struct ceph_mds_session *old;
+
+ if (!mdsc || !session || disable_send_metrics)
+ return;
+
+ old = mdsc->metric.session;
+ mdsc->metric.session = ceph_get_mds_session(session);
+ if (old)
+ ceph_put_mds_session(old);
+
+ metric_schedule_delayed(&mdsc->metric);
+}
+
/*
* mds reply parsing
@@ -95,21 +111,23 @@ static int parse_reply_info_quota(void **p, void *end,
return -EIO;
}
-/*
- * parse individual inode info
- */
static int parse_reply_info_in(void **p, void *end,
struct ceph_mds_reply_info_in *info,
- u64 features)
+ u64 features,
+ struct ceph_mds_client *mdsc)
{
int err = 0;
u8 struct_v = 0;
+ u8 struct_compat = 0;
+ u32 struct_len = 0;
+ struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
+
+ info->subvolume_id = 0;
+ doutc(cl, "subv_metric parse start features=0x%llx\n", features);
info->subvolume_id = 0;
if (features == (u64)-1) {
- u32 struct_len;
- u8 struct_compat;
ceph_decode_8_safe(p, end, struct_v, bad);
ceph_decode_8_safe(p, end, struct_compat, bad);
/* struct_v is expected to be >= 1. we only understand
@@ -381,12 +399,13 @@ static int parse_reply_info_lease(void **p, void *end,
*/
static int parse_reply_info_trace(void **p, void *end,
struct ceph_mds_reply_info_parsed *info,
- u64 features)
+ u64 features,
+ struct ceph_mds_client *mdsc)
{
int err;
if (info->head->is_dentry) {
- err = parse_reply_info_in(p, end, &info->diri, features);
+ err = parse_reply_info_in(p, end, &info->diri, features, mdsc);
if (err < 0)
goto out_bad;
@@ -406,7 +425,8 @@ static int parse_reply_info_trace(void **p, void *end,
}
if (info->head->is_target) {
- err = parse_reply_info_in(p, end, &info->targeti, features);
+ err = parse_reply_info_in(p, end, &info->targeti, features,
+ mdsc);
if (err < 0)
goto out_bad;
}
@@ -427,7 +447,8 @@ static int parse_reply_info_trace(void **p, void *end,
*/
static int parse_reply_info_readdir(void **p, void *end,
struct ceph_mds_request *req,
- u64 features)
+ u64 features,
+ struct ceph_mds_client *mdsc)
{
struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
struct ceph_client *cl = req->r_mdsc->fsc->client;
@@ -542,7 +563,7 @@ static int parse_reply_info_readdir(void **p, void *end,
rde->name_len = oname.len;
/* inode */
- err = parse_reply_info_in(p, end, &rde->inode, features);
+ err = parse_reply_info_in(p, end, &rde->inode, features, mdsc);
if (err < 0)
goto out_bad;
/* ceph_readdir_prepopulate() will update it */
@@ -750,7 +771,8 @@ static int parse_reply_info_extra(void **p, void *end,
if (op == CEPH_MDS_OP_GETFILELOCK)
return parse_reply_info_filelock(p, end, info, features);
else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
- return parse_reply_info_readdir(p, end, req, features);
+ return parse_reply_info_readdir(p, end, req, features,
+ req->r_mdsc);
else if (op == CEPH_MDS_OP_CREATE)
return parse_reply_info_create(p, end, info, features, s);
else if (op == CEPH_MDS_OP_GETVXATTR)
@@ -779,7 +801,8 @@ static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
ceph_decode_32_safe(&p, end, len, bad);
if (len > 0) {
ceph_decode_need(&p, end, len, bad);
- err = parse_reply_info_trace(&p, p+len, info, features);
+ err = parse_reply_info_trace(&p, p + len, info, features,
+ s->s_mdsc);
if (err < 0)
goto out_bad;
}
@@ -788,7 +811,7 @@ static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
ceph_decode_32_safe(&p, end, len, bad);
if (len > 0) {
ceph_decode_need(&p, end, len, bad);
- err = parse_reply_info_extra(&p, p+len, req, features, s);
+ err = parse_reply_info_extra(&p, p + len, req, features, s);
if (err < 0)
goto out_bad;
}
@@ -4318,6 +4341,11 @@ static void handle_session(struct ceph_mds_session *session,
}
mdsc->s_cap_auths_num = cap_auths_num;
mdsc->s_cap_auths = cap_auths;
+
+ session->s_features = features;
+ if (test_bit(CEPHFS_FEATURE_METRIC_COLLECT,
+ &session->s_features))
+ ceph_metric_bind_session(mdsc, session);
}
if (op == CEPH_SESSION_CLOSE) {
ceph_get_mds_session(session);
@@ -4344,7 +4372,11 @@ static void handle_session(struct ceph_mds_session *session,
pr_info_client(cl, "mds%d reconnect success\n",
session->s_mds);
- session->s_features = features;
+ if (test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+ &session->s_features))
+ ceph_subvolume_metrics_enable(&mdsc->subvol_metrics, true);
+ else
+ ceph_subvolume_metrics_enable(&mdsc->subvol_metrics, false);
if (session->s_state == CEPH_MDS_SESSION_OPEN) {
pr_notice_client(cl, "mds%d is already opened\n",
session->s_mds);
@@ -5583,6 +5615,12 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
err = ceph_metric_init(&mdsc->metric);
if (err)
goto err_mdsmap;
+ ceph_subvolume_metrics_init(&mdsc->subvol_metrics);
+ mutex_init(&mdsc->subvol_metrics_last_mutex);
+ mdsc->subvol_metrics_last = NULL;
+ mdsc->subvol_metrics_last_nr = 0;
+ mdsc->subvol_metrics_sent = 0;
+ mdsc->subvol_metrics_nonzero_sends = 0;
spin_lock_init(&mdsc->dentry_list_lock);
INIT_LIST_HEAD(&mdsc->dentry_leases);
@@ -6115,6 +6153,8 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
ceph_mdsc_stop(mdsc);
ceph_metric_destroy(&mdsc->metric);
+ ceph_subvolume_metrics_destroy(&mdsc->subvol_metrics);
+ kfree(mdsc->subvol_metrics_last);
fsc->mdsc = NULL;
kfree(mdsc);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index bd3690baa65c..4e6c87f8414c 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -18,6 +18,7 @@
#include "mdsmap.h"
#include "metric.h"
+#include "subvolume_metrics.h"
#include "super.h"
/* The first 8 bits are reserved for old ceph releases */
@@ -36,8 +37,9 @@ enum ceph_feature_type {
CEPHFS_FEATURE_NEW_SNAPREALM_INFO,
CEPHFS_FEATURE_HAS_OWNER_UIDGID,
CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK,
+ CEPHFS_FEATURE_SUBVOLUME_METRICS,
- CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK,
+ CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_SUBVOLUME_METRICS,
};
#define CEPHFS_FEATURES_CLIENT_SUPPORTED { \
@@ -54,6 +56,7 @@ enum ceph_feature_type {
CEPHFS_FEATURE_32BITS_RETRY_FWD, \
CEPHFS_FEATURE_HAS_OWNER_UIDGID, \
CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK, \
+ CEPHFS_FEATURE_SUBVOLUME_METRICS, \
}
/*
@@ -537,6 +540,14 @@ struct ceph_mds_client {
struct list_head dentry_dir_leases; /* lru list */
struct ceph_client_metric metric;
+ struct ceph_subvolume_metrics_tracker subvol_metrics;
+
+ /* Subvolume metrics send tracking */
+ struct mutex subvol_metrics_last_mutex;
+ struct ceph_subvol_metric_snapshot *subvol_metrics_last;
+ u32 subvol_metrics_last_nr;
+ u64 subvol_metrics_sent;
+ u64 subvol_metrics_nonzero_sends;
spinlock_t snapid_map_lock;
struct rb_root snapid_map_tree;
diff --git a/fs/ceph/metric.c b/fs/ceph/metric.c
index 871c1090e520..8ff6bcb50bc4 100644
--- a/fs/ceph/metric.c
+++ b/fs/ceph/metric.c
@@ -4,10 +4,85 @@
#include <linux/types.h>
#include <linux/percpu_counter.h>
#include <linux/math64.h>
+#include <linux/ratelimit.h>
+
+#include <linux/ceph/decode.h>
#include "metric.h"
#include "mds_client.h"
+static DEFINE_RATELIMIT_STATE(metrics_no_session_rs, HZ, 1);
+static bool metrics_disable_warned;
+
+static inline u32 ceph_subvolume_entry_payload_len(void)
+{
+ return sizeof(struct ceph_subvolume_metric_entry_wire);
+}
+
+static inline u32 ceph_subvolume_entry_encoded_len(void)
+{
+ return CEPH_ENCODING_START_BLK_LEN +
+ ceph_subvolume_entry_payload_len();
+}
+
+static inline u32 ceph_subvolume_outer_payload_len(u32 nr_subvols)
+{
+ /* count is encoded as le64 (size_t on wire) to match FUSE client */
+ return sizeof(__le64) +
+ nr_subvols * ceph_subvolume_entry_encoded_len();
+}
+
+static inline u32 ceph_subvolume_metric_data_len(u32 nr_subvols)
+{
+ return CEPH_ENCODING_START_BLK_LEN +
+ ceph_subvolume_outer_payload_len(nr_subvols);
+}
+
+static inline u32 ceph_subvolume_clamp_u32(u64 val)
+{
+ return val > U32_MAX ? U32_MAX : (u32)val;
+}
+
+static void ceph_init_subvolume_wire_entry(
+ struct ceph_subvolume_metric_entry_wire *dst,
+ const struct ceph_subvol_metric_snapshot *src)
+{
+ dst->subvolume_id = cpu_to_le64(src->subvolume_id);
+ dst->read_ops = cpu_to_le32(ceph_subvolume_clamp_u32(src->read_ops));
+ dst->write_ops = cpu_to_le32(ceph_subvolume_clamp_u32(src->write_ops));
+ dst->read_bytes = cpu_to_le64(src->read_bytes);
+ dst->write_bytes = cpu_to_le64(src->write_bytes);
+ dst->read_latency_us = cpu_to_le64(src->read_latency_us);
+ dst->write_latency_us = cpu_to_le64(src->write_latency_us);
+ dst->time_stamp = 0;
+}
+
+static int ceph_encode_subvolume_metrics(void **p, void *end,
+ struct ceph_subvol_metric_snapshot *subvols,
+ u32 nr_subvols)
+{
+ u32 i;
+
+ ceph_start_encoding(p, 1, 1,
+ ceph_subvolume_outer_payload_len(nr_subvols));
+ /* count is encoded as le64 (size_t on wire) to match FUSE client */
+ ceph_encode_64_safe(p, end, (u64)nr_subvols, enc_err);
+
+ for (i = 0; i < nr_subvols; i++) {
+ struct ceph_subvolume_metric_entry_wire wire_entry;
+
+ ceph_init_subvolume_wire_entry(&wire_entry, &subvols[i]);
+ ceph_start_encoding(p, 1, 1,
+ ceph_subvolume_entry_payload_len());
+ ceph_encode_copy_safe(p, end, &wire_entry,
+ sizeof(wire_entry), enc_err);
+ }
+
+ return 0;
+enc_err:
+ return -ERANGE;
+}
+
static void ktime_to_ceph_timespec(struct ceph_timespec *ts, ktime_t val)
{
struct timespec64 t = ktime_to_timespec64(val);
@@ -29,10 +104,14 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
struct ceph_read_io_size *rsize;
struct ceph_write_io_size *wsize;
struct ceph_client_metric *m = &mdsc->metric;
+ struct ceph_subvol_metric_snapshot *subvols = NULL;
u64 nr_caps = atomic64_read(&m->total_caps);
u32 header_len = sizeof(struct ceph_metric_header);
struct ceph_client *cl = mdsc->fsc->client;
struct ceph_msg *msg;
+ u32 nr_subvols = 0;
+ size_t subvol_len = 0;
+ void *cursor;
s64 sum;
s32 items = 0;
s32 len;
@@ -45,15 +124,37 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
}
mutex_unlock(&mdsc->mutex);
+ if (ceph_subvolume_metrics_enabled(&mdsc->subvol_metrics) &&
+ test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS, &s->s_features)) {
+ int ret;
+
+ ret = ceph_subvolume_metrics_snapshot(&mdsc->subvol_metrics,
+ &subvols, &nr_subvols,
+ true);
+ if (ret) {
+ pr_warn_client(cl, "failed to snapshot subvolume metrics: %d\n",
+ ret);
+ nr_subvols = 0;
+ subvols = NULL;
+ }
+ }
+
+ if (nr_subvols) {
+ /* type (le32) + ENCODE_START payload - no metric header */
+ subvol_len = sizeof(__le32) +
+ ceph_subvolume_metric_data_len(nr_subvols);
+ }
+
len = sizeof(*head) + sizeof(*cap) + sizeof(*read) + sizeof(*write)
+ sizeof(*meta) + sizeof(*dlease) + sizeof(*files)
+ sizeof(*icaps) + sizeof(*inodes) + sizeof(*rsize)
- + sizeof(*wsize);
+ + sizeof(*wsize) + subvol_len;
msg = ceph_msg_new(CEPH_MSG_CLIENT_METRICS, len, GFP_NOFS, true);
if (!msg) {
pr_err_client(cl, "to mds%d, failed to allocate message\n",
s->s_mds);
+ kfree(subvols);
return false;
}
@@ -172,13 +273,56 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
wsize->total_size = cpu_to_le64(m->metric[METRIC_WRITE].size_sum);
items++;
+ cursor = wsize + 1;
+
+ if (nr_subvols) {
+ void *payload;
+ void *payload_end;
+ int ret;
+
+ /* Emit only the type (le32), no ver/compat/data_len */
+ ceph_encode_32(&cursor, CLIENT_METRIC_TYPE_SUBVOLUME_METRICS);
+ items++;
+
+ payload = cursor;
+ payload_end = (char *)payload +
+ ceph_subvolume_metric_data_len(nr_subvols);
+
+ ret = ceph_encode_subvolume_metrics(&payload, payload_end,
+ subvols, nr_subvols);
+ if (ret) {
+ pr_warn_client(cl,
+ "failed to encode subvolume metrics\n");
+ kfree(subvols);
+ ceph_msg_put(msg);
+ return false;
+ }
+
+ WARN_ON(payload != payload_end);
+ cursor = payload;
+ }
+
put_unaligned_le32(items, &head->num);
- msg->front.iov_len = len;
+ msg->front.iov_len = (char *)cursor - (char *)head;
msg->hdr.version = cpu_to_le16(1);
msg->hdr.compat_version = cpu_to_le16(1);
msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
+
ceph_con_send(&s->s_con, msg);
+ if (nr_subvols) {
+ mutex_lock(&mdsc->subvol_metrics_last_mutex);
+ kfree(mdsc->subvol_metrics_last);
+ mdsc->subvol_metrics_last = subvols;
+ mdsc->subvol_metrics_last_nr = nr_subvols;
+ mdsc->subvol_metrics_sent += nr_subvols;
+ mdsc->subvol_metrics_nonzero_sends++;
+ mutex_unlock(&mdsc->subvol_metrics_last_mutex);
+
+ subvols = NULL;
+ }
+ kfree(subvols);
+
return true;
}
@@ -201,6 +345,12 @@ static void metric_get_session(struct ceph_mds_client *mdsc)
*/
if (check_session_state(s) &&
test_bit(CEPHFS_FEATURE_METRIC_COLLECT, &s->s_features)) {
+ if (ceph_subvolume_metrics_enabled(&mdsc->subvol_metrics) &&
+ !test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+ &s->s_features)) {
+ ceph_put_mds_session(s);
+ continue;
+ }
mdsc->metric.session = s;
break;
}
@@ -217,8 +367,17 @@ static void metric_delayed_work(struct work_struct *work)
struct ceph_mds_client *mdsc =
container_of(m, struct ceph_mds_client, metric);
- if (mdsc->stopping || disable_send_metrics)
+ if (mdsc->stopping)
+ return;
+
+ if (disable_send_metrics) {
+ if (!metrics_disable_warned) {
+ pr_err("ceph: metrics worker disabled via module parameter\n");
+ metrics_disable_warned = true;
+ }
return;
+ }
+ metrics_disable_warned = false;
if (!m->session || !check_session_state(m->session)) {
if (m->session) {
@@ -227,10 +386,15 @@ static void metric_delayed_work(struct work_struct *work)
}
metric_get_session(mdsc);
}
+
if (m->session) {
ceph_mdsc_send_metrics(mdsc, m->session);
- metric_schedule_delayed(m);
+ } else {
+ if (__ratelimit(&metrics_no_session_rs))
+ pr_warn("ceph: metrics worker missing MDS session\n");
}
+
+ metric_schedule_delayed(m);
}
int ceph_metric_init(struct ceph_client_metric *m)
diff --git a/fs/ceph/metric.h b/fs/ceph/metric.h
index 0d0c44bd3332..7e4aac63f6a6 100644
--- a/fs/ceph/metric.h
+++ b/fs/ceph/metric.h
@@ -25,8 +25,9 @@ enum ceph_metric_type {
CLIENT_METRIC_TYPE_STDEV_WRITE_LATENCY,
CLIENT_METRIC_TYPE_AVG_METADATA_LATENCY,
CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY,
+ CLIENT_METRIC_TYPE_SUBVOLUME_METRICS,
- CLIENT_METRIC_TYPE_MAX = CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY,
+ CLIENT_METRIC_TYPE_MAX = CLIENT_METRIC_TYPE_SUBVOLUME_METRICS,
};
/*
@@ -50,6 +51,7 @@ enum ceph_metric_type {
CLIENT_METRIC_TYPE_STDEV_WRITE_LATENCY, \
CLIENT_METRIC_TYPE_AVG_METADATA_LATENCY, \
CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY, \
+ CLIENT_METRIC_TYPE_SUBVOLUME_METRICS, \
\
CLIENT_METRIC_TYPE_MAX, \
}
@@ -139,6 +141,29 @@ struct ceph_write_io_size {
__le64 total_size;
} __packed;
+/* Wire format for subvolume metrics - matches C++ AggregatedIOMetrics */
+struct ceph_subvolume_metric_entry_wire {
+ __le64 subvolume_id;
+ __le32 read_ops;
+ __le32 write_ops;
+ __le64 read_bytes;
+ __le64 write_bytes;
+ __le64 read_latency_us;
+ __le64 write_latency_us;
+ __le64 time_stamp;
+} __packed;
+
+/* Old struct kept for internal tracking, not used on wire */
+struct ceph_subvolume_metric_entry {
+ __le64 subvolume_id;
+ __le64 read_ops;
+ __le64 write_ops;
+ __le64 read_bytes;
+ __le64 write_bytes;
+ __le64 read_latency_us;
+ __le64 write_latency_us;
+} __packed;
+
struct ceph_metric_head {
__le32 num; /* the number of metrics that will be sent */
} __packed;
diff --git a/fs/ceph/subvolume_metrics.c b/fs/ceph/subvolume_metrics.c
new file mode 100644
index 000000000000..3e178bf6a740
--- /dev/null
+++ b/fs/ceph/subvolume_metrics.c
@@ -0,0 +1,407 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ceph/ceph_debug.h>
+
+#include <linux/math64.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include "subvolume_metrics.h"
+#include "mds_client.h"
+#include "super.h"
+
+struct ceph_subvol_metric_rb_entry {
+ struct rb_node node;
+ u64 subvolume_id;
+ u64 read_ops;
+ u64 write_ops;
+ u64 read_bytes;
+ u64 write_bytes;
+ u64 read_latency_us;
+ u64 write_latency_us;
+};
+
+void ceph_subvolume_metrics_init(struct ceph_subvolume_metrics_tracker *tracker)
+{
+ spin_lock_init(&tracker->lock);
+ tracker->tree = RB_ROOT_CACHED;
+ tracker->nr_entries = 0;
+ tracker->enabled = false;
+ atomic64_set(&tracker->snapshot_attempts, 0);
+ atomic64_set(&tracker->snapshot_empty, 0);
+ atomic64_set(&tracker->snapshot_failures, 0);
+ atomic64_set(&tracker->record_calls, 0);
+ atomic64_set(&tracker->record_disabled, 0);
+ atomic64_set(&tracker->record_no_subvol, 0);
+ atomic64_set(&tracker->total_read_ops, 0);
+ atomic64_set(&tracker->total_read_bytes, 0);
+ atomic64_set(&tracker->total_write_ops, 0);
+ atomic64_set(&tracker->total_write_bytes, 0);
+}
+
+static struct ceph_subvol_metric_rb_entry *
+__lookup_entry(struct ceph_subvolume_metrics_tracker *tracker, u64 subvol_id)
+{
+ struct rb_node *node;
+
+ node = tracker->tree.rb_root.rb_node;
+ while (node) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+ if (subvol_id < entry->subvolume_id)
+ node = node->rb_left;
+ else if (subvol_id > entry->subvolume_id)
+ node = node->rb_right;
+ else
+ return entry;
+ }
+
+ return NULL;
+}
+
+static struct ceph_subvol_metric_rb_entry *
+__insert_entry(struct ceph_subvolume_metrics_tracker *tracker,
+ struct ceph_subvol_metric_rb_entry *entry)
+{
+ struct rb_node **link = &tracker->tree.rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ bool leftmost = true;
+
+ while (*link) {
+ struct ceph_subvol_metric_rb_entry *cur =
+ rb_entry(*link, struct ceph_subvol_metric_rb_entry, node);
+
+ parent = *link;
+ if (entry->subvolume_id < cur->subvolume_id)
+ link = &(*link)->rb_left;
+ else if (entry->subvolume_id > cur->subvolume_id) {
+ link = &(*link)->rb_right;
+ leftmost = false;
+ } else
+ return cur;
+ }
+
+ rb_link_node(&entry->node, parent, link);
+ rb_insert_color_cached(&entry->node, &tracker->tree, leftmost);
+ tracker->nr_entries++;
+ return entry;
+}
+
+static void ceph_subvolume_metrics_clear_locked(
+ struct ceph_subvolume_metrics_tracker *tracker)
+{
+ struct rb_node *node = rb_first_cached(&tracker->tree);
+
+ while (node) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+ struct rb_node *next = rb_next(node);
+
+ rb_erase_cached(&entry->node, &tracker->tree);
+ tracker->nr_entries--;
+ kfree(entry);
+ node = next;
+ }
+
+ tracker->tree = RB_ROOT_CACHED;
+}
+
+void ceph_subvolume_metrics_destroy(struct ceph_subvolume_metrics_tracker *tracker)
+{
+ spin_lock(&tracker->lock);
+ ceph_subvolume_metrics_clear_locked(tracker);
+ tracker->enabled = false;
+ spin_unlock(&tracker->lock);
+}
+
+void ceph_subvolume_metrics_enable(struct ceph_subvolume_metrics_tracker *tracker,
+ bool enable)
+{
+ spin_lock(&tracker->lock);
+ if (enable) {
+ tracker->enabled = true;
+ } else {
+ tracker->enabled = false;
+ ceph_subvolume_metrics_clear_locked(tracker);
+ }
+ spin_unlock(&tracker->lock);
+}
+
+void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracker,
+ u64 subvol_id, bool is_write,
+ size_t size, u64 latency_us)
+{
+ struct ceph_subvol_metric_rb_entry *entry, *new_entry = NULL;
+ bool retry = false;
+
+ if (!READ_ONCE(tracker->enabled) || !subvol_id || !size || !latency_us)
+ return;
+
+ do {
+ spin_lock(&tracker->lock);
+ if (!tracker->enabled) {
+ spin_unlock(&tracker->lock);
+ kfree(new_entry);
+ return;
+ }
+
+ entry = __lookup_entry(tracker, subvol_id);
+ if (!entry) {
+ if (!new_entry) {
+ spin_unlock(&tracker->lock);
+ new_entry = kzalloc(sizeof(*new_entry), GFP_NOFS);
+ if (!new_entry)
+ return;
+ new_entry->subvolume_id = subvol_id;
+ retry = true;
+ continue;
+ }
+ entry = __insert_entry(tracker, new_entry);
+ if (entry != new_entry) {
+ /* raced with another insert */
+ spin_unlock(&tracker->lock);
+ kfree(new_entry);
+ new_entry = NULL;
+ retry = true;
+ continue;
+ }
+ new_entry = NULL;
+ }
+
+ if (is_write) {
+ entry->write_ops++;
+ entry->write_bytes += size;
+ entry->write_latency_us += latency_us;
+ atomic64_inc(&tracker->total_write_ops);
+ atomic64_add(size, &tracker->total_write_bytes);
+ } else {
+ entry->read_ops++;
+ entry->read_bytes += size;
+ entry->read_latency_us += latency_us;
+ atomic64_inc(&tracker->total_read_ops);
+ atomic64_add(size, &tracker->total_read_bytes);
+ }
+ spin_unlock(&tracker->lock);
+ kfree(new_entry);
+ return;
+ } while (retry);
+}
+
+int ceph_subvolume_metrics_snapshot(struct ceph_subvolume_metrics_tracker *tracker,
+ struct ceph_subvol_metric_snapshot **out,
+ u32 *nr, bool consume)
+{
+ struct ceph_subvol_metric_snapshot *snap = NULL;
+ struct rb_node *node;
+ u32 count = 0, idx = 0;
+ int ret = 0;
+
+ *out = NULL;
+ *nr = 0;
+
+ if (!READ_ONCE(tracker->enabled))
+ return 0;
+
+ atomic64_inc(&tracker->snapshot_attempts);
+
+ spin_lock(&tracker->lock);
+ for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+ /* Include entries with ANY I/O activity (read OR write) */
+ if (entry->read_ops || entry->write_ops)
+ count++;
+ }
+ spin_unlock(&tracker->lock);
+
+ if (!count) {
+ atomic64_inc(&tracker->snapshot_empty);
+ return 0;
+ }
+
+ snap = kcalloc(count, sizeof(*snap), GFP_NOFS);
+ if (!snap) {
+ atomic64_inc(&tracker->snapshot_failures);
+ return -ENOMEM;
+ }
+
+ spin_lock(&tracker->lock);
+ node = rb_first_cached(&tracker->tree);
+ while (node) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+ struct rb_node *next = rb_next(node);
+
+ /* Skip entries with NO I/O activity at all */
+ if (!entry->read_ops && !entry->write_ops) {
+ rb_erase_cached(&entry->node, &tracker->tree);
+ tracker->nr_entries--;
+ kfree(entry);
+ node = next;
+ continue;
+ }
+
+ if (idx >= count) {
+ pr_warn("ceph: subvol metrics snapshot race (idx=%u count=%u)\n",
+ idx, count);
+ break;
+ }
+
+ snap[idx].subvolume_id = entry->subvolume_id;
+ snap[idx].read_ops = entry->read_ops;
+ snap[idx].write_ops = entry->write_ops;
+ snap[idx].read_bytes = entry->read_bytes;
+ snap[idx].write_bytes = entry->write_bytes;
+ snap[idx].read_latency_us = entry->read_latency_us;
+ snap[idx].write_latency_us = entry->write_latency_us;
+ idx++;
+
+ if (consume) {
+ entry->read_ops = 0;
+ entry->write_ops = 0;
+ entry->read_bytes = 0;
+ entry->write_bytes = 0;
+ entry->read_latency_us = 0;
+ entry->write_latency_us = 0;
+ rb_erase_cached(&entry->node, &tracker->tree);
+ tracker->nr_entries--;
+ kfree(entry);
+ }
+ node = next;
+ }
+ spin_unlock(&tracker->lock);
+
+ if (!idx) {
+ kfree(snap);
+ snap = NULL;
+ ret = 0;
+ } else {
+ *nr = idx;
+ *out = snap;
+ }
+
+ return ret;
+}
+
+void ceph_subvolume_metrics_free_snapshot(struct ceph_subvol_metric_snapshot *snapshot)
+{
+ kfree(snapshot);
+}
+
+static u64 div_rem(u64 dividend, u64 divisor)
+{
+ return divisor ? div64_u64(dividend, divisor) : 0;
+}
+
+void ceph_subvolume_metrics_dump(struct ceph_subvolume_metrics_tracker *tracker,
+ struct seq_file *s)
+{
+ struct rb_node *node;
+ struct ceph_subvol_metric_snapshot *snapshot = NULL;
+ u32 count = 0, idx = 0;
+
+ spin_lock(&tracker->lock);
+ if (!tracker->enabled) {
+ spin_unlock(&tracker->lock);
+ seq_puts(s, "subvolume metrics disabled\n");
+ return;
+ }
+
+ for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+ if (entry->read_ops || entry->write_ops)
+ count++;
+ }
+ spin_unlock(&tracker->lock);
+
+ if (!count) {
+ seq_puts(s, "(no subvolume metrics collected)\n");
+ return;
+ }
+
+ snapshot = kcalloc(count, sizeof(*snapshot), GFP_KERNEL);
+ if (!snapshot) {
+ seq_puts(s, "(unable to allocate memory for snapshot)\n");
+ return;
+ }
+
+ spin_lock(&tracker->lock);
+ for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+ struct ceph_subvol_metric_rb_entry *entry =
+ rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+ if (!entry->read_ops && !entry->write_ops)
+ continue;
+
+ if (idx >= count)
+ break;
+
+ snapshot[idx].subvolume_id = entry->subvolume_id;
+ snapshot[idx].read_ops = entry->read_ops;
+ snapshot[idx].write_ops = entry->write_ops;
+ snapshot[idx].read_bytes = entry->read_bytes;
+ snapshot[idx].write_bytes = entry->write_bytes;
+ snapshot[idx].read_latency_us = entry->read_latency_us;
+ snapshot[idx].write_latency_us = entry->write_latency_us;
+ idx++;
+ }
+ spin_unlock(&tracker->lock);
+
+ seq_puts(s, "subvol_id rd_ops rd_bytes rd_avg_lat_us wr_ops wr_bytes wr_avg_lat_us\n");
+ seq_puts(s, "------------------------------------------------------------------------------------------------\n");
+
+ for (idx = 0; idx < count; idx++) {
+ u64 avg_rd_lat = div_rem(snapshot[idx].read_latency_us,
+ snapshot[idx].read_ops);
+ u64 avg_wr_lat = div_rem(snapshot[idx].write_latency_us,
+ snapshot[idx].write_ops);
+
+ seq_printf(s, "%-15llu%-10llu%-12llu%-16llu%-10llu%-12llu%-16llu\n",
+ snapshot[idx].subvolume_id,
+ snapshot[idx].read_ops,
+ snapshot[idx].read_bytes,
+ avg_rd_lat,
+ snapshot[idx].write_ops,
+ snapshot[idx].write_bytes,
+ avg_wr_lat);
+ }
+
+ kfree(snapshot);
+}
+
+void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
+ struct ceph_inode_info *ci,
+ bool is_write, size_t bytes,
+ ktime_t start, ktime_t end)
+{
+ struct ceph_subvolume_metrics_tracker *tracker;
+ u64 subvol_id;
+ s64 delta_us;
+
+ if (!mdsc || !ci || !bytes)
+ return;
+
+ tracker = &mdsc->subvol_metrics;
+ atomic64_inc(&tracker->record_calls);
+
+ if (!ceph_subvolume_metrics_enabled(tracker)) {
+ atomic64_inc(&tracker->record_disabled);
+ return;
+ }
+
+ subvol_id = READ_ONCE(ci->i_subvolume_id);
+ if (!subvol_id) {
+ atomic64_inc(&tracker->record_no_subvol);
+ return;
+ }
+
+ delta_us = ktime_to_us(ktime_sub(end, start));
+ if (delta_us <= 0)
+ delta_us = 1;
+
+ ceph_subvolume_metrics_record(tracker, subvol_id, is_write,
+ bytes, (u64)delta_us);
+}
diff --git a/fs/ceph/subvolume_metrics.h b/fs/ceph/subvolume_metrics.h
new file mode 100644
index 000000000000..872867c75c41
--- /dev/null
+++ b/fs/ceph/subvolume_metrics.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _FS_CEPH_SUBVOLUME_METRICS_H
+#define _FS_CEPH_SUBVOLUME_METRICS_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/ktime.h>
+#include <linux/atomic.h>
+
+struct seq_file;
+struct ceph_mds_client;
+struct ceph_inode_info;
+
+struct ceph_subvol_metric_snapshot {
+ u64 subvolume_id;
+ u64 read_ops;
+ u64 write_ops;
+ u64 read_bytes;
+ u64 write_bytes;
+ u64 read_latency_us;
+ u64 write_latency_us;
+};
+
+struct ceph_subvolume_metrics_tracker {
+ spinlock_t lock;
+ struct rb_root_cached tree;
+ u32 nr_entries;
+ bool enabled;
+ atomic64_t snapshot_attempts;
+ atomic64_t snapshot_empty;
+ atomic64_t snapshot_failures;
+ atomic64_t record_calls;
+ atomic64_t record_disabled;
+ atomic64_t record_no_subvol;
+ /* Cumulative counters (survive snapshots) */
+ atomic64_t total_read_ops;
+ atomic64_t total_read_bytes;
+ atomic64_t total_write_ops;
+ atomic64_t total_write_bytes;
+};
+
+void ceph_subvolume_metrics_init(struct ceph_subvolume_metrics_tracker *tracker);
+void ceph_subvolume_metrics_destroy(struct ceph_subvolume_metrics_tracker *tracker);
+void ceph_subvolume_metrics_enable(struct ceph_subvolume_metrics_tracker *tracker,
+ bool enable);
+void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracker,
+ u64 subvol_id, bool is_write,
+ size_t size, u64 latency_us);
+int ceph_subvolume_metrics_snapshot(struct ceph_subvolume_metrics_tracker *tracker,
+ struct ceph_subvol_metric_snapshot **out,
+ u32 *nr, bool consume);
+void ceph_subvolume_metrics_free_snapshot(struct ceph_subvol_metric_snapshot *snapshot);
+void ceph_subvolume_metrics_dump(struct ceph_subvolume_metrics_tracker *tracker,
+ struct seq_file *s);
+
+void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
+ struct ceph_inode_info *ci,
+ bool is_write, size_t bytes,
+ ktime_t start, ktime_t end);
+
+static inline bool ceph_subvolume_metrics_enabled(
+ const struct ceph_subvolume_metrics_tracker *tracker)
+{
+ return READ_ONCE(tracker->enabled);
+}
+
+#endif /* _FS_CEPH_SUBVOLUME_METRICS_H */
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index f6bf24b5c683..528452aa8beb 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -21,6 +21,7 @@
#include "mds_client.h"
#include "cache.h"
#include "crypto.h"
+#include "subvolume_metrics.h"
#include <linux/ceph/ceph_features.h>
#include <linux/ceph/decode.h>
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 69069c920683..17a44a3e1017 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -167,6 +167,7 @@ struct ceph_fs_client {
struct dentry *debugfs_status;
struct dentry *debugfs_mds_sessions;
struct dentry *debugfs_metrics_dir;
+ struct dentry *debugfs_subvolume_metrics;
#endif
#ifdef CONFIG_CEPH_FSCACHE
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
@ 2025-11-28 1:48 ` kernel test robot
2025-12-01 20:20 ` Viacheslav Dubeyko
1 sibling, 0 replies; 10+ messages in thread
From: kernel test robot @ 2025-11-28 1:48 UTC (permalink / raw)
To: Alex Markuze, ceph-devel
Cc: llvm, oe-kbuild-all, idryomov, linux-fsdevel, amarkuze, vdubeyko
Hi Alex,
kernel test robot noticed the following build warnings:
[auto build test WARNING on ceph-client/for-linus]
[also build test WARNING on linus/master v6.18-rc7 next-20251127]
[cannot apply to ceph-client/testing]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alex-Markuze/ceph-handle-InodeStat-v8-versioned-field-in-reply-parsing/20251127-214928
base: https://github.com/ceph/ceph-client.git for-linus
patch link: https://lore.kernel.org/r/20251127134620.2035796-2-amarkuze%40redhat.com
patch subject: [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing
config: loongarch-defconfig (https://download.01.org/0day-ci/archive/20251128/202511280915.HZ1rLMsq-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251128/202511280915.HZ1rLMsq-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511280915.HZ1rLMsq-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> fs/ceph/mds_client.c:237:7: warning: variable 'v8_struct_v' set but not used [-Wunused-but-set-variable]
237 | u8 v8_struct_v, v8_struct_compat;
| ^
>> fs/ceph/mds_client.c:237:20: warning: variable 'v8_struct_compat' set but not used [-Wunused-but-set-variable]
237 | u8 v8_struct_v, v8_struct_compat;
| ^
2 warnings generated.
vim +/v8_struct_v +237 fs/ceph/mds_client.c
97
98 /*
99 * parse individual inode info
100 */
101 static int parse_reply_info_in(void **p, void *end,
102 struct ceph_mds_reply_info_in *info,
103 u64 features)
104 {
105 int err = 0;
106 u8 struct_v = 0;
107
108 if (features == (u64)-1) {
109 u32 struct_len;
110 u8 struct_compat;
111 ceph_decode_8_safe(p, end, struct_v, bad);
112 ceph_decode_8_safe(p, end, struct_compat, bad);
113 /* struct_v is expected to be >= 1. we only understand
114 * encoding with struct_compat == 1. */
115 if (!struct_v || struct_compat != 1)
116 goto bad;
117 ceph_decode_32_safe(p, end, struct_len, bad);
118 ceph_decode_need(p, end, struct_len, bad);
119 end = *p + struct_len;
120 }
121
122 ceph_decode_need(p, end, sizeof(struct ceph_mds_reply_inode), bad);
123 info->in = *p;
124 *p += sizeof(struct ceph_mds_reply_inode) +
125 sizeof(*info->in->fragtree.splits) *
126 le32_to_cpu(info->in->fragtree.nsplits);
127
128 ceph_decode_32_safe(p, end, info->symlink_len, bad);
129 ceph_decode_need(p, end, info->symlink_len, bad);
130 info->symlink = *p;
131 *p += info->symlink_len;
132
133 ceph_decode_copy_safe(p, end, &info->dir_layout,
134 sizeof(info->dir_layout), bad);
135 ceph_decode_32_safe(p, end, info->xattr_len, bad);
136 ceph_decode_need(p, end, info->xattr_len, bad);
137 info->xattr_data = *p;
138 *p += info->xattr_len;
139
140 if (features == (u64)-1) {
141 /* inline data */
142 ceph_decode_64_safe(p, end, info->inline_version, bad);
143 ceph_decode_32_safe(p, end, info->inline_len, bad);
144 ceph_decode_need(p, end, info->inline_len, bad);
145 info->inline_data = *p;
146 *p += info->inline_len;
147 /* quota */
148 err = parse_reply_info_quota(p, end, info);
149 if (err < 0)
150 goto out_bad;
151 /* pool namespace */
152 ceph_decode_32_safe(p, end, info->pool_ns_len, bad);
153 if (info->pool_ns_len > 0) {
154 ceph_decode_need(p, end, info->pool_ns_len, bad);
155 info->pool_ns_data = *p;
156 *p += info->pool_ns_len;
157 }
158
159 /* btime */
160 ceph_decode_need(p, end, sizeof(info->btime), bad);
161 ceph_decode_copy(p, &info->btime, sizeof(info->btime));
162
163 /* change attribute */
164 ceph_decode_64_safe(p, end, info->change_attr, bad);
165
166 /* dir pin */
167 if (struct_v >= 2) {
168 ceph_decode_32_safe(p, end, info->dir_pin, bad);
169 } else {
170 info->dir_pin = -ENODATA;
171 }
172
173 /* snapshot birth time, remains zero for v<=2 */
174 if (struct_v >= 3) {
175 ceph_decode_need(p, end, sizeof(info->snap_btime), bad);
176 ceph_decode_copy(p, &info->snap_btime,
177 sizeof(info->snap_btime));
178 } else {
179 memset(&info->snap_btime, 0, sizeof(info->snap_btime));
180 }
181
182 /* snapshot count, remains zero for v<=3 */
183 if (struct_v >= 4) {
184 ceph_decode_64_safe(p, end, info->rsnaps, bad);
185 } else {
186 info->rsnaps = 0;
187 }
188
189 if (struct_v >= 5) {
190 u32 alen;
191
192 ceph_decode_32_safe(p, end, alen, bad);
193
194 while (alen--) {
195 u32 len;
196
197 /* key */
198 ceph_decode_32_safe(p, end, len, bad);
199 ceph_decode_skip_n(p, end, len, bad);
200 /* value */
201 ceph_decode_32_safe(p, end, len, bad);
202 ceph_decode_skip_n(p, end, len, bad);
203 }
204 }
205
206 /* fscrypt flag -- ignore */
207 if (struct_v >= 6)
208 ceph_decode_skip_8(p, end, bad);
209
210 info->fscrypt_auth = NULL;
211 info->fscrypt_auth_len = 0;
212 info->fscrypt_file = NULL;
213 info->fscrypt_file_len = 0;
214 if (struct_v >= 7) {
215 ceph_decode_32_safe(p, end, info->fscrypt_auth_len, bad);
216 if (info->fscrypt_auth_len) {
217 info->fscrypt_auth = kmalloc(info->fscrypt_auth_len,
218 GFP_KERNEL);
219 if (!info->fscrypt_auth)
220 return -ENOMEM;
221 ceph_decode_copy_safe(p, end, info->fscrypt_auth,
222 info->fscrypt_auth_len, bad);
223 }
224 ceph_decode_32_safe(p, end, info->fscrypt_file_len, bad);
225 if (info->fscrypt_file_len) {
226 info->fscrypt_file = kmalloc(info->fscrypt_file_len,
227 GFP_KERNEL);
228 if (!info->fscrypt_file)
229 return -ENOMEM;
230 ceph_decode_copy_safe(p, end, info->fscrypt_file,
231 info->fscrypt_file_len, bad);
232 }
233 }
234
235 /* struct_v 8 added a versioned field - skip it */
236 if (struct_v >= 8) {
> 237 u8 v8_struct_v, v8_struct_compat;
238 u32 v8_struct_len;
239
240 ceph_decode_8_safe(p, end, v8_struct_v, bad);
241 ceph_decode_8_safe(p, end, v8_struct_compat, bad);
242 ceph_decode_32_safe(p, end, v8_struct_len, bad);
243 ceph_decode_skip_n(p, end, v8_struct_len, bad);
244 }
245
246 *p = end;
247 } else {
248 /* legacy (unversioned) struct */
249 if (features & CEPH_FEATURE_MDS_INLINE_DATA) {
250 ceph_decode_64_safe(p, end, info->inline_version, bad);
251 ceph_decode_32_safe(p, end, info->inline_len, bad);
252 ceph_decode_need(p, end, info->inline_len, bad);
253 info->inline_data = *p;
254 *p += info->inline_len;
255 } else
256 info->inline_version = CEPH_INLINE_NONE;
257
258 if (features & CEPH_FEATURE_MDS_QUOTA) {
259 err = parse_reply_info_quota(p, end, info);
260 if (err < 0)
261 goto out_bad;
262 } else {
263 info->max_bytes = 0;
264 info->max_files = 0;
265 }
266
267 info->pool_ns_len = 0;
268 info->pool_ns_data = NULL;
269 if (features & CEPH_FEATURE_FS_FILE_LAYOUT_V2) {
270 ceph_decode_32_safe(p, end, info->pool_ns_len, bad);
271 if (info->pool_ns_len > 0) {
272 ceph_decode_need(p, end, info->pool_ns_len, bad);
273 info->pool_ns_data = *p;
274 *p += info->pool_ns_len;
275 }
276 }
277
278 if (features & CEPH_FEATURE_FS_BTIME) {
279 ceph_decode_need(p, end, sizeof(info->btime), bad);
280 ceph_decode_copy(p, &info->btime, sizeof(info->btime));
281 ceph_decode_64_safe(p, end, info->change_attr, bad);
282 }
283
284 info->dir_pin = -ENODATA;
285 /* info->snap_btime and info->rsnaps remain zero */
286 }
287 return 0;
288 bad:
289 err = -EIO;
290 out_bad:
291 return err;
292 }
293
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] ceph: add subvolume metrics collection and reporting
2025-11-27 13:46 ` [PATCH 3/3] ceph: add subvolume metrics collection and reporting Alex Markuze
@ 2025-11-29 16:51 ` Dan Carpenter
0 siblings, 0 replies; 10+ messages in thread
From: Dan Carpenter @ 2025-11-29 16:51 UTC (permalink / raw)
To: oe-kbuild, Alex Markuze, ceph-devel
Cc: lkp, oe-kbuild-all, idryomov, linux-fsdevel, amarkuze, vdubeyko
Hi Alex,
kernel test robot noticed the following build warnings:
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alex-Markuze/ceph-handle-InodeStat-v8-versioned-field-in-reply-parsing/20251127-214928
base: https://github.com/ceph/ceph-client.git for-linus
patch link: https://lore.kernel.org/r/20251127134620.2035796-4-amarkuze%40redhat.com
patch subject: [PATCH 3/3] ceph: add subvolume metrics collection and reporting
config: x86_64-randconfig-r071-20251128 (https://download.01.org/0day-ci/archive/20251129/202511290541.wFDJlRdO-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202511290541.wFDJlRdO-lkp@intel.com/
smatch warnings:
fs/ceph/debugfs.c:436 subvolume_metrics_show() error: we previously assumed 'mdsc' could be null (see line 399)
vim +/mdsc +436 fs/ceph/debugfs.c
d469e1da62544e7 Alex Markuze 2025-11-27 390 static int subvolume_metrics_show(struct seq_file *s, void *p)
d469e1da62544e7 Alex Markuze 2025-11-27 391 {
d469e1da62544e7 Alex Markuze 2025-11-27 392 struct ceph_fs_client *fsc = s->private;
d469e1da62544e7 Alex Markuze 2025-11-27 393 struct ceph_mds_client *mdsc = fsc->mdsc;
d469e1da62544e7 Alex Markuze 2025-11-27 394 struct ceph_subvol_metric_snapshot *snapshot = NULL;
d469e1da62544e7 Alex Markuze 2025-11-27 395 u32 nr = 0;
d469e1da62544e7 Alex Markuze 2025-11-27 396 u64 total_sent = 0;
d469e1da62544e7 Alex Markuze 2025-11-27 397 u32 i;
d469e1da62544e7 Alex Markuze 2025-11-27 398
d469e1da62544e7 Alex Markuze 2025-11-27 @399 if (mdsc) {
This assumes mdsc can be NULL
d469e1da62544e7 Alex Markuze 2025-11-27 400 mutex_lock(&mdsc->subvol_metrics_last_mutex);
d469e1da62544e7 Alex Markuze 2025-11-27 401 if (mdsc->subvol_metrics_last &&
d469e1da62544e7 Alex Markuze 2025-11-27 402 mdsc->subvol_metrics_last_nr) {
d469e1da62544e7 Alex Markuze 2025-11-27 403 nr = mdsc->subvol_metrics_last_nr;
d469e1da62544e7 Alex Markuze 2025-11-27 404 snapshot = kmemdup(mdsc->subvol_metrics_last,
d469e1da62544e7 Alex Markuze 2025-11-27 405 nr * sizeof(*snapshot),
d469e1da62544e7 Alex Markuze 2025-11-27 406 GFP_KERNEL);
d469e1da62544e7 Alex Markuze 2025-11-27 407 if (!snapshot)
d469e1da62544e7 Alex Markuze 2025-11-27 408 nr = 0;
d469e1da62544e7 Alex Markuze 2025-11-27 409 }
d469e1da62544e7 Alex Markuze 2025-11-27 410 total_sent = mdsc->subvol_metrics_sent;
d469e1da62544e7 Alex Markuze 2025-11-27 411 mutex_unlock(&mdsc->subvol_metrics_last_mutex);
d469e1da62544e7 Alex Markuze 2025-11-27 412 }
d469e1da62544e7 Alex Markuze 2025-11-27 413
d469e1da62544e7 Alex Markuze 2025-11-27 414 seq_puts(s, "Last sent subvolume metrics:\n");
d469e1da62544e7 Alex Markuze 2025-11-27 415 if (!nr) {
d469e1da62544e7 Alex Markuze 2025-11-27 416 seq_puts(s, " (none)\n");
d469e1da62544e7 Alex Markuze 2025-11-27 417 } else {
d469e1da62544e7 Alex Markuze 2025-11-27 418 seq_puts(s, " subvol_id rd_ops wr_ops rd_bytes wr_bytes rd_lat_us wr_lat_us\n");
d469e1da62544e7 Alex Markuze 2025-11-27 419 for (i = 0; i < nr; i++) {
d469e1da62544e7 Alex Markuze 2025-11-27 420 const struct ceph_subvol_metric_snapshot *e = &snapshot[i];
d469e1da62544e7 Alex Markuze 2025-11-27 421
d469e1da62544e7 Alex Markuze 2025-11-27 422 seq_printf(s, " %-18llu %-9llu %-9llu %-14llu %-14llu %-14llu %-14llu\n",
d469e1da62544e7 Alex Markuze 2025-11-27 423 e->subvolume_id,
d469e1da62544e7 Alex Markuze 2025-11-27 424 e->read_ops, e->write_ops,
d469e1da62544e7 Alex Markuze 2025-11-27 425 e->read_bytes, e->write_bytes,
d469e1da62544e7 Alex Markuze 2025-11-27 426 e->read_latency_us, e->write_latency_us);
d469e1da62544e7 Alex Markuze 2025-11-27 427 }
d469e1da62544e7 Alex Markuze 2025-11-27 428 }
d469e1da62544e7 Alex Markuze 2025-11-27 429 kfree(snapshot);
d469e1da62544e7 Alex Markuze 2025-11-27 430
d469e1da62544e7 Alex Markuze 2025-11-27 431 seq_puts(s, "\nStatistics:\n");
d469e1da62544e7 Alex Markuze 2025-11-27 432 seq_printf(s, " entries_sent: %llu\n", total_sent);
d469e1da62544e7 Alex Markuze 2025-11-27 433
d469e1da62544e7 Alex Markuze 2025-11-27 434 mutex_lock(&mdsc->subvol_metrics_last_mutex);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Unchecked dereference
d469e1da62544e7 Alex Markuze 2025-11-27 435 seq_printf(s, " non_zero_sends: %llu\n",
d469e1da62544e7 Alex Markuze 2025-11-27 @436 mdsc->subvol_metrics_nonzero_sends);
d469e1da62544e7 Alex Markuze 2025-11-27 437 mutex_unlock(&mdsc->subvol_metrics_last_mutex);
d469e1da62544e7 Alex Markuze 2025-11-27 438
d469e1da62544e7 Alex Markuze 2025-11-27 439 seq_puts(s, "\nPending (unsent) subvolume metrics:\n");
d469e1da62544e7 Alex Markuze 2025-11-27 440 ceph_subvolume_metrics_dump(&fsc->mdsc->subvol_metrics, s);
d469e1da62544e7 Alex Markuze 2025-11-27 441 return 0;
d469e1da62544e7 Alex Markuze 2025-11-27 442 }
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3] ceph: add subvolume metrics reporting support
2025-11-27 13:46 [PATCH 0/3] ceph: add subvolume metrics reporting support Alex Markuze
` (2 preceding siblings ...)
2025-11-27 13:46 ` [PATCH 3/3] ceph: add subvolume metrics collection and reporting Alex Markuze
@ 2025-12-01 20:14 ` Viacheslav Dubeyko
3 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-01 20:14 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
Cc: Viacheslav Dubeyko, idryomov@gmail.com,
linux-fsdevel@vger.kernel.org
On Thu, 2025-11-27 at 13:46 +0000, Alex Markuze wrote:
> This patch series adds support for per-subvolume I/O metrics collection
> and reporting to the MDS. This enables administrators to monitor I/O
> patterns at the subvolume granularity, which is useful for multi-tenant
> CephFS deployments where different subvolumes may be allocated to
> different users or applications.
>
> The implementation requires protocol changes to receive the subvolume_id
> from the MDS (InodeStat v9), and introduces a new metrics type
> (CLIENT_METRIC_TYPE_SUBVOLUME_METRICS) for reporting aggregated I/O
> statistics back to the MDS.
>
> Patch 1 adds forward-compatible handling for InodeStat v8. The MDS v8
> encoding added a versioned optmetadata field containing optional inode
What is "optmetadata"? Do you mean "opt metadata" here? Does it exact name of
the field?
Thanks,
Slava.
> metadata such as charmap (for case-insensitive/case-preserving file
> systems). The kernel client does not currently support case-insensitive
> lookups, so this field is skipped rather than parsed. This ensures
> forward compatibility with newer MDS servers without requiring the
> full case-insensitivity feature implementation.
>
> Patch 2 adds support for parsing the subvolume_id field from InodeStat
> v9 and storing it in the inode structure for later use.
>
> Patch 3 adds the complete subvolume metrics infrastructure:
> - CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
> - Red-black tree based metrics tracker for efficient per-subvolume
> aggregation
> - Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
> - Integration with the existing CLIENT_METRICS message
> - Recording of I/O operations from file read/write and writeback paths
> - Debugfs interfaces for monitoring
>
> Metrics tracked per subvolume include:
> - Read/write operation counts
> - Read/write byte counts
> - Read/write latency sums (for average calculation)
>
> The metrics are periodically sent to the MDS as part of the existing
> metrics reporting infrastructure when the MDS advertises support for
> the SUBVOLUME_METRICS feature.
>
> Debugfs additions in Patch 3:
> - metrics/subvolumes: displays last sent and pending subvolume metrics
> - metrics/metric_features: displays MDS session feature negotiation
> status, showing which metric-related features are enabled (including
> METRIC_COLLECT and SUBVOLUME_METRICS)
>
> Alex Markuze (3):
> ceph: handle InodeStat v8 versioned field in reply parsing
> ceph: parse subvolume_id from InodeStat v9 and store in inode
> ceph: add subvolume metrics collection and reporting
>
> fs/ceph/Makefile | 2 +-
> fs/ceph/addr.c | 10 +
> fs/ceph/debugfs.c | 153 ++++++++++++++
> fs/ceph/file.c | 58 ++++-
> fs/ceph/inode.c | 19 ++
> fs/ceph/mds_client.c | 89 ++++++--
> fs/ceph/mds_client.h | 14 +-
> fs/ceph/metric.c | 172 ++++++++++++++-
> fs/ceph/metric.h | 27 ++-
> fs/ceph/subvolume_metrics.c | 407 ++++++++++++++++++++++++++++++++++++
> fs/ceph/subvolume_metrics.h | 68 ++++++
> fs/ceph/super.c | 1 +
> fs/ceph/super.h | 3 +
> 13 files changed, 997 insertions(+), 26 deletions(-)
> create mode 100644 fs/ceph/subvolume_metrics.c
> create mode 100644 fs/ceph/subvolume_metrics.h
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
2025-11-28 1:48 ` kernel test robot
@ 2025-12-01 20:20 ` Viacheslav Dubeyko
2025-12-02 10:39 ` Alex Markuze
1 sibling, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-01 20:20 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
Cc: Viacheslav Dubeyko, idryomov@gmail.com,
linux-fsdevel@vger.kernel.org
On Thu, 2025-11-27 at 13:46 +0000, Alex Markuze wrote:
> Add forward-compatible handling for the new versioned field introduced
> in InodeStat v8. This patch only skips the field without using it,
> preparing for future protocol extensions.
>
> The v8 encoding adds a versioned sub-structure that needs to be properly
> decoded and skipped to maintain compatibility with newer MDS versions.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/mds_client.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 1740047aef0f..32561fc701e5 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -231,6 +231,18 @@ static int parse_reply_info_in(void **p, void *end,
> info->fscrypt_file_len, bad);
> }
> }
> +
> + /* struct_v 8 added a versioned field - skip it */
> + if (struct_v >= 8) {
> + u8 v8_struct_v, v8_struct_compat;
> + u32 v8_struct_len;
> +
Probably, we need to have warning here that, currently, this protocol is not
supported yet.
Thanks,
Slava.
> + ceph_decode_8_safe(p, end, v8_struct_v, bad);
> + ceph_decode_8_safe(p, end, v8_struct_compat, bad);
> + ceph_decode_32_safe(p, end, v8_struct_len, bad);
> + ceph_decode_skip_n(p, end, v8_struct_len, bad);
> + }
> +
> *p = end;
> } else {
> /* legacy (unversioned) struct */
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode
2025-11-27 13:46 ` [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
@ 2025-12-01 23:27 ` Viacheslav Dubeyko
0 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-01 23:27 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
Cc: Viacheslav Dubeyko, idryomov@gmail.com,
linux-fsdevel@vger.kernel.org
On Thu, 2025-11-27 at 13:46 +0000, Alex Markuze wrote:
> Add support for parsing the subvolume_id field from InodeStat v9 and
> storing it in the inode for later use by subvolume metrics tracking.
>
> The subvolume_id identifies which CephFS subvolume an inode belongs to,
> enabling per-subvolume I/O metrics collection and reporting.
>
> This patch:
> - Adds subvolume_id field to struct ceph_mds_reply_info_in
> - Adds i_subvolume_id field to struct ceph_inode_info
> - Parses subvolume_id from v9 InodeStat in parse_reply_info_in()
> - Adds ceph_inode_set_subvolume() helper to propagate the ID to inodes
> - Initializes i_subvolume_id in inode allocation and clears on destroy
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/inode.c | 19 +++++++++++++++++++
> fs/ceph/mds_client.c | 7 +++++++
> fs/ceph/mds_client.h | 1 +
> fs/ceph/super.h | 2 ++
> 4 files changed, 29 insertions(+)
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index a6e260d9e420..c3fb4dac4692 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -638,6 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>
> ci->i_max_bytes = 0;
> ci->i_max_files = 0;
> + ci->i_subvolume_id = 0;
So, which number starts to identify subvolume ID? Is 0 valid identification
number?
If 0 is valid identification number, then I assume we need to assign some other
number because we don't know the subvolume ID yet. What's about U64_MAX, for
example, or some other constant representing the invalid value?
>
> memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
> memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
> @@ -742,6 +743,8 @@ void ceph_evict_inode(struct inode *inode)
>
> percpu_counter_dec(&mdsc->metric.total_inodes);
>
> + ci->i_subvolume_id = 0;
Ditto.
> +
> netfs_wait_for_outstanding_io(inode);
> truncate_inode_pages_final(&inode->i_data);
> if (inode->i_state & I_PINNING_NETFS_WB)
> @@ -873,6 +876,18 @@ int ceph_fill_file_size(struct inode *inode, int issued,
> return queue_trunc;
> }
>
> +void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
> +{
> + struct ceph_inode_info *ci;
> +
> + if (!inode || !subvolume_id)
Are you sure that 0 is invalid ID for subvolume_id? What about to introduce some
named constant and to compare it with subvolume_id here? For example, we can
introduce CEPH_INVALID_SUBVOLUME_ID.
if (!inode || subvolume_id == CEPH_INVALID_SUBVOLUME_ID)
> + return;
> +
> + ci = ceph_inode(inode);
> + if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
> + WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
Should the ceph_inode_set_subvolume() operation be protected by i_ceph_lock?
If ci->i_subvolume_id is already correct value, then how correct is reset of
this value? I am afraid that we could have potential bugs here. Maybe, we should
have methods of set_subvolume_id()/delete_subvolume_id()?
How looks like the interface of setting and resetting the subvolume_id?
> +}
> +
> void ceph_fill_file_time(struct inode *inode, int issued,
> u64 time_warp_seq, struct timespec64 *ctime,
> struct timespec64 *mtime, struct timespec64 *atime)
> @@ -1087,6 +1102,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> new_issued = ~issued & info_caps;
>
> __ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
> + ceph_inode_set_subvolume(inode, iinfo->subvolume_id);
>
> #ifdef CONFIG_FS_ENCRYPTION
> if (iinfo->fscrypt_auth_len &&
> @@ -1594,6 +1610,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> goto done;
> }
> if (parent_dir) {
> + ceph_inode_set_subvolume(parent_dir,
> + rinfo->diri.subvolume_id);
> err = ceph_fill_inode(parent_dir, NULL, &rinfo->diri,
> rinfo->dirfrag, session, -1,
> &req->r_caps_reservation);
> @@ -1682,6 +1700,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> BUG_ON(!req->r_target_inode);
>
> in = req->r_target_inode;
> + ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
> err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
> NULL, session,
> (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 32561fc701e5..6f66097f740b 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -105,6 +105,8 @@ static int parse_reply_info_in(void **p, void *end,
> int err = 0;
> u8 struct_v = 0;
>
> + info->subvolume_id = 0;
Should we have method for struct ceph_mds_reply_info_in likewise for struct
ceph_inode_info?
> +
> if (features == (u64)-1) {
> u32 struct_len;
> u8 struct_compat;
> @@ -243,6 +245,10 @@ static int parse_reply_info_in(void **p, void *end,
> ceph_decode_skip_n(p, end, v8_struct_len, bad);
> }
>
> + /* struct_v 9 added subvolume_id */
> + if (struct_v >= 9)
> + ceph_decode_64_safe(p, end, info->subvolume_id, bad);
As far as I remember from previous patch we had version 8 and we don't support
it. Do we mean the version 9 of another protocol here?
Thanks,
Slava.
> +
> *p = end;
> } else {
> /* legacy (unversioned) struct */
> @@ -3962,6 +3968,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> goto out_err;
> }
> req->r_target_inode = in;
> + ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
> }
>
> mutex_lock(&session->s_mutex);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 0428a5eaf28c..bd3690baa65c 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -118,6 +118,7 @@ struct ceph_mds_reply_info_in {
> u32 fscrypt_file_len;
> u64 rsnaps;
> u64 change_attr;
> + u64 subvolume_id;
> };
>
> struct ceph_mds_reply_dir_entry {
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index a1f781c46b41..69069c920683 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -385,6 +385,7 @@ struct ceph_inode_info {
>
> /* quotas */
> u64 i_max_bytes, i_max_files;
> + u64 i_subvolume_id;
>
> s32 i_dir_pin;
>
> @@ -1057,6 +1058,7 @@ extern struct inode *ceph_get_inode(struct super_block *sb,
> extern struct inode *ceph_get_snapdir(struct inode *parent);
> extern int ceph_fill_file_size(struct inode *inode, int issued,
> u32 truncate_seq, u64 truncate_size, u64 size);
> +extern void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id);
> extern void ceph_fill_file_time(struct inode *inode, int issued,
> u64 time_warp_seq, struct timespec64 *ctime,
> struct timespec64 *mtime,
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing
2025-12-01 20:20 ` Viacheslav Dubeyko
@ 2025-12-02 10:39 ` Alex Markuze
0 siblings, 0 replies; 10+ messages in thread
From: Alex Markuze @ 2025-12-02 10:39 UTC (permalink / raw)
To: Viacheslav Dubeyko
Cc: ceph-devel@vger.kernel.org, Viacheslav Dubeyko,
idryomov@gmail.com, linux-fsdevel@vger.kernel.org
v8 was added for case insensitive file systems, not relevant to Linux,
I can add a comment saying that.
On Mon, Dec 1, 2025 at 10:20 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Thu, 2025-11-27 at 13:46 +0000, Alex Markuze wrote:
> > Add forward-compatible handling for the new versioned field introduced
> > in InodeStat v8. This patch only skips the field without using it,
> > preparing for future protocol extensions.
> >
> > The v8 encoding adds a versioned sub-structure that needs to be properly
> > decoded and skipped to maintain compatibility with newer MDS versions.
> >
> > Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> > ---
> > fs/ceph/mds_client.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 1740047aef0f..32561fc701e5 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -231,6 +231,18 @@ static int parse_reply_info_in(void **p, void *end,
> > info->fscrypt_file_len, bad);
> > }
> > }
> > +
> > + /* struct_v 8 added a versioned field - skip it */
> > + if (struct_v >= 8) {
> > + u8 v8_struct_v, v8_struct_compat;
> > + u32 v8_struct_len;
> > +
>
> Probably, we need to have warning here that, currently, this protocol is not
> supported yet.
>
> Thanks,
> Slava.
>
> > + ceph_decode_8_safe(p, end, v8_struct_v, bad);
> > + ceph_decode_8_safe(p, end, v8_struct_compat, bad);
> > + ceph_decode_32_safe(p, end, v8_struct_len, bad);
> > + ceph_decode_skip_n(p, end, v8_struct_len, bad);
> > + }
> > +
> > *p = end;
> > } else {
> > /* legacy (unversioned) struct */
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-02 10:39 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-27 13:46 [PATCH 0/3] ceph: add subvolume metrics reporting support Alex Markuze
2025-11-27 13:46 ` [PATCH 1/3] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
2025-11-28 1:48 ` kernel test robot
2025-12-01 20:20 ` Viacheslav Dubeyko
2025-12-02 10:39 ` Alex Markuze
2025-11-27 13:46 ` [PATCH 2/3] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
2025-12-01 23:27 ` Viacheslav Dubeyko
2025-11-27 13:46 ` [PATCH 3/3] ceph: add subvolume metrics collection and reporting Alex Markuze
2025-11-29 16:51 ` Dan Carpenter
2025-12-01 20:14 ` [PATCH 0/3] ceph: add subvolume metrics reporting support Viacheslav Dubeyko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).