linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] ceph: add subvolume metrics reporting support
@ 2025-12-03 15:46 Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 1/4] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 15:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

This patch series adds support for per-subvolume I/O metrics collection
and reporting to the MDS. This enables administrators to monitor I/O
patterns at the subvolume granularity, which is useful for multi-tenant
CephFS deployments where different subvolumes may be allocated to
different users or applications.

The implementation requires protocol changes to receive the subvolume_id
from the MDS (InodeStat v9), and introduces a new metrics type
(CLIENT_METRIC_TYPE_SUBVOLUME_METRICS) for reporting aggregated I/O
statistics back to the MDS.

Patch 1 adds forward-compatible handling for InodeStat v8. The MDS v8
encoding added a versioned optmetadata field containing optional inode
metadata such as charmap (for case-insensitive/case-preserving file
systems). The kernel client does not currently support case-insensitive
lookups, so this field is skipped rather than parsed. This ensures
forward compatibility with newer MDS servers without requiring the
full case-insensitivity feature implementation.

Patch 2 adds support for parsing the subvolume_id field from InodeStat
v9 and storing it in the inode structure for later use.

Patch 3 adds the complete subvolume metrics infrastructure:
- CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- Red-black tree based metrics tracker for efficient per-subvolume
  aggregation with kmem_cache for entry allocations
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing CLIENT_METRICS message
- Recording of I/O operations from file read/write and writeback paths
- Debugfs interfaces for monitoring

Metrics tracked per subvolume include:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)

The metrics are periodically sent to the MDS as part of the existing
metrics reporting infrastructure when the MDS advertises support for
the SUBVOLUME_METRICS feature.

Debugfs additions in Patch 3:
- metrics/subvolumes: displays last sent and pending subvolume metrics
- metrics/metric_features: displays MDS session feature negotiation
  status, showing which metric-related features are enabled (including
  METRIC_COLLECT and SUBVOLUME_METRICS)

Patch 4 introduces CEPH_SUBVOLUME_ID_NONE constant and enforces
subvolume_id immutability. Following the FUSE client convention,
0 means unknown/unset. Once an inode has a valid (non-zero) subvolume_id,
it should not change during the inode's lifetime.

Changes since v2:
- Add CEPH_SUBVOLUME_ID_NONE constant (value 0) for unknown/unset state
- Add WARN_ON_ONCE if attempting to change already-set subvolume_id
- Add documentation for struct ceph_session_feature_desc ('bit' field)
- Change pr_err() to pr_info() for "metrics disabled" message
- Use pr_warn_ratelimited() instead of manual __ratelimit()
- Add documentation comments to ceph_subvol_metric_snapshot and
  ceph_subvolume_metrics_tracker structs
- Use kmemdup_array() instead of kmemdup() for overflow checking
- Add comments explaining ret > 0 checks for read metrics (EOF handling)
- Use kmem_cache for struct ceph_subvol_metric_rb_entry allocations
- Add comment explaining seq_file error handling in dump function

Changes since v1:
- Fixed unused variable warnings (v8_struct_v, v8_struct_compat) by
  using ceph_decode_skip_8() instead of ceph_decode_8_safe()
- Added detailed comment explaining InodeStat encoding versions v1-v9
- Clarified that "optmetadata" is the actual field name in MDS C++ code
- Aligned subvolume_id handling with FUSE client convention (0 = unknown)

Alex Markuze (4):
  ceph: handle InodeStat v8 versioned field in reply parsing
  ceph: parse subvolume_id from InodeStat v9 and store in inode
  ceph: add subvolume metrics collection and reporting
  ceph: adding CEPH_SUBVOLUME_ID_NONE

 fs/ceph/Makefile            |   2 +-
 fs/ceph/addr.c              |  10 +
 fs/ceph/debugfs.c           | 159 +++++++++++++
 fs/ceph/file.c              |  68 +++++-
 fs/ceph/inode.c             |  41 ++++
 fs/ceph/mds_client.c        |  94 ++++++--
 fs/ceph/mds_client.h        |  14 +-
 fs/ceph/metric.c            | 173 ++++++++++++++-
 fs/ceph/metric.h            |  27 ++-
 fs/ceph/subvolume_metrics.c | 432 ++++++++++++++++++++++++++++++++++++
 fs/ceph/subvolume_metrics.h |  97 ++++++++
 fs/ceph/super.c             |   8 +
 fs/ceph/super.h             |  11 +
 13 files changed, 1108 insertions(+), 28 deletions(-)
 create mode 100644 fs/ceph/subvolume_metrics.c
 create mode 100644 fs/ceph/subvolume_metrics.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/4] ceph: handle InodeStat v8 versioned field in reply parsing
  2025-12-03 15:46 [PATCH v3 0/4] ceph: add subvolume metrics reporting support Alex Markuze
@ 2025-12-03 15:46 ` Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 2/4] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 15:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Add forward-compatible handling for the new versioned field introduced
in InodeStat v8. This patch only skips the field without using it,
preparing for future protocol extensions.

The v8 encoding adds a versioned sub-structure that needs to be properly
decoded and skipped to maintain compatibility with newer MDS versions.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/mds_client.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 1740047aef0f..d7d8178e1f9a 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -231,6 +231,26 @@ static int parse_reply_info_in(void **p, void *end,
 						      info->fscrypt_file_len, bad);
 			}
 		}
+
+		/*
+		 * InodeStat encoding versions:
+		 *   v1-v7: various fields added over time
+		 *   v8: added optmetadata (versioned sub-structure containing
+		 *       optional inode metadata like charmap for case-insensitive
+		 *       filesystems). The kernel client doesn't support
+		 *       case-insensitive lookups, so we skip this field.
+		 *   v9: added subvolume_id (parsed below)
+		 */
+		if (struct_v >= 8) {
+			u32 v8_struct_len;
+
+			/* skip optmetadata versioned sub-structure */
+			ceph_decode_skip_8(p, end, bad);  /* struct_v */
+			ceph_decode_skip_8(p, end, bad);  /* struct_compat */
+			ceph_decode_32_safe(p, end, v8_struct_len, bad);
+			ceph_decode_skip_n(p, end, v8_struct_len, bad);
+		}
+
 		*p = end;
 	} else {
 		/* legacy (unversioned) struct */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/4] ceph: parse subvolume_id from InodeStat v9 and store in inode
  2025-12-03 15:46 [PATCH v3 0/4] ceph: add subvolume metrics reporting support Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 1/4] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
@ 2025-12-03 15:46 ` Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 3/4] ceph: add subvolume metrics collection and reporting Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE Alex Markuze
  3 siblings, 0 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 15:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Add support for parsing the subvolume_id field from InodeStat v9 and
storing it in the inode for later use by subvolume metrics tracking.

The subvolume_id identifies which CephFS subvolume an inode belongs to,
enabling per-subvolume I/O metrics collection and reporting.

This patch:
- Adds subvolume_id field to struct ceph_mds_reply_info_in
- Adds i_subvolume_id field to struct ceph_inode_info
- Parses subvolume_id from v9 InodeStat in parse_reply_info_in()
- Adds ceph_inode_set_subvolume() helper to propagate the ID to inodes
- Initializes i_subvolume_id in inode allocation and clears on destroy

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/inode.c      | 23 +++++++++++++++++++++++
 fs/ceph/mds_client.c |  7 +++++++
 fs/ceph/mds_client.h |  1 +
 fs/ceph/super.h      |  2 ++
 4 files changed, 33 insertions(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index a6e260d9e420..835049004047 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -638,6 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 
 	ci->i_max_bytes = 0;
 	ci->i_max_files = 0;
+	ci->i_subvolume_id = 0;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
 	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
@@ -742,6 +743,8 @@ void ceph_evict_inode(struct inode *inode)
 
 	percpu_counter_dec(&mdsc->metric.total_inodes);
 
+	ci->i_subvolume_id = 0;
+
 	netfs_wait_for_outstanding_io(inode);
 	truncate_inode_pages_final(&inode->i_data);
 	if (inode->i_state & I_PINNING_NETFS_WB)
@@ -873,6 +876,22 @@ int ceph_fill_file_size(struct inode *inode, int issued,
 	return queue_trunc;
 }
 
+/*
+ * Set the subvolume ID for an inode. Following the FUSE client convention,
+ * 0 means unknown/unset (MDS only sends non-zero IDs for subvolume inodes).
+ */
+void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
+{
+	struct ceph_inode_info *ci;
+
+	if (!inode || !subvolume_id)
+		return;
+
+	ci = ceph_inode(inode);
+	if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
+		WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
+}
+
 void ceph_fill_file_time(struct inode *inode, int issued,
 			 u64 time_warp_seq, struct timespec64 *ctime,
 			 struct timespec64 *mtime, struct timespec64 *atime)
@@ -1087,6 +1106,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 	new_issued = ~issued & info_caps;
 
 	__ceph_update_quota(ci, iinfo->max_bytes, iinfo->max_files);
+	ceph_inode_set_subvolume(inode, iinfo->subvolume_id);
 
 #ifdef CONFIG_FS_ENCRYPTION
 	if (iinfo->fscrypt_auth_len &&
@@ -1594,6 +1614,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 		if (parent_dir) {
+			ceph_inode_set_subvolume(parent_dir,
+						 rinfo->diri.subvolume_id);
 			err = ceph_fill_inode(parent_dir, NULL, &rinfo->diri,
 					      rinfo->dirfrag, session, -1,
 					      &req->r_caps_reservation);
@@ -1682,6 +1704,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		BUG_ON(!req->r_target_inode);
 
 		in = req->r_target_inode;
+		ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
 		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
 				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d7d8178e1f9a..099b8f22683b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -105,6 +105,8 @@ static int parse_reply_info_in(void **p, void *end,
 	int err = 0;
 	u8 struct_v = 0;
 
+	info->subvolume_id = 0;
+
 	if (features == (u64)-1) {
 		u32 struct_len;
 		u8 struct_compat;
@@ -251,6 +253,10 @@ static int parse_reply_info_in(void **p, void *end,
 			ceph_decode_skip_n(p, end, v8_struct_len, bad);
 		}
 
+		/* struct_v 9 added subvolume_id */
+		if (struct_v >= 9)
+			ceph_decode_64_safe(p, end, info->subvolume_id, bad);
+
 		*p = end;
 	} else {
 		/* legacy (unversioned) struct */
@@ -3970,6 +3976,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 			goto out_err;
 		}
 		req->r_target_inode = in;
+		ceph_inode_set_subvolume(in, rinfo->targeti.subvolume_id);
 	}
 
 	mutex_lock(&session->s_mutex);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0428a5eaf28c..bd3690baa65c 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -118,6 +118,7 @@ struct ceph_mds_reply_info_in {
 	u32 fscrypt_file_len;
 	u64 rsnaps;
 	u64 change_attr;
+	u64 subvolume_id;
 };
 
 struct ceph_mds_reply_dir_entry {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index a1f781c46b41..c0372a725960 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -385,6 +385,7 @@ struct ceph_inode_info {
 
 	/* quotas */
 	u64 i_max_bytes, i_max_files;
+	u64 i_subvolume_id;	/* 0 = unknown/unset, matches FUSE client */
 
 	s32 i_dir_pin;
 
@@ -1057,6 +1058,7 @@ extern struct inode *ceph_get_inode(struct super_block *sb,
 extern struct inode *ceph_get_snapdir(struct inode *parent);
 extern int ceph_fill_file_size(struct inode *inode, int issued,
 			       u32 truncate_seq, u64 truncate_size, u64 size);
+extern void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id);
 extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/4] ceph: add subvolume metrics collection and reporting
  2025-12-03 15:46 [PATCH v3 0/4] ceph: add subvolume metrics reporting support Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 1/4] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 2/4] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
@ 2025-12-03 15:46 ` Alex Markuze
  2025-12-03 15:46 ` [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE Alex Markuze
  3 siblings, 0 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 15:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Add complete subvolume metrics infrastructure for tracking and reporting
per-subvolume I/O metrics to the MDS. This enables administrators to
monitor I/O patterns at the subvolume granularity.

The implementation includes:

- New CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- Red-black tree based metrics tracker (subvolume_metrics.c/h)
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing metrics reporting infrastructure
- Recording of I/O operations from file read/write paths
- Debugfs interface for monitoring collected metrics

Metrics tracked per subvolume:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)

The metrics are periodically sent to the MDS as part of the existing
CLIENT_METRICS message when the MDS advertises support for the
SUBVOLUME_METRICS feature.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/Makefile            |   2 +-
 fs/ceph/addr.c              |  10 +
 fs/ceph/debugfs.c           | 159 +++++++++++++
 fs/ceph/file.c              |  68 +++++-
 fs/ceph/mds_client.c        |  70 ++++--
 fs/ceph/mds_client.h        |  13 +-
 fs/ceph/metric.c            | 173 ++++++++++++++-
 fs/ceph/metric.h            |  27 ++-
 fs/ceph/subvolume_metrics.c | 431 ++++++++++++++++++++++++++++++++++++
 fs/ceph/subvolume_metrics.h |  97 ++++++++
 fs/ceph/super.c             |   8 +
 fs/ceph/super.h             |   1 +
 12 files changed, 1031 insertions(+), 28 deletions(-)
 create mode 100644 fs/ceph/subvolume_metrics.c
 create mode 100644 fs/ceph/subvolume_metrics.h

diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
index 1f77ca04c426..ebb29d11ac22 100644
--- a/fs/ceph/Makefile
+++ b/fs/ceph/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o
 ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
 	export.o caps.o snap.o xattr.o quota.o io.o \
 	mds_client.o mdsmap.o strings.o ceph_frag.o \
-	debugfs.o util.o metric.o
+	debugfs.o util.o metric.o subvolume_metrics.o
 
 ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
 ceph-$(CONFIG_CEPH_FS_POSIX_ACL) += acl.o
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 322ed268f14a..feae80dc2816 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -19,6 +19,7 @@
 #include "mds_client.h"
 #include "cache.h"
 #include "metric.h"
+#include "subvolume_metrics.h"
 #include "crypto.h"
 #include <linux/ceph/osd_client.h>
 #include <linux/ceph/striper.h>
@@ -823,6 +824,10 @@ static int write_folio_nounlock(struct folio *folio,
 
 	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
 				  req->r_end_latency, len, err);
+	if (err >= 0 && len > 0)
+		ceph_subvolume_metrics_record_io(fsc->mdsc, ci, true, len,
+						 req->r_start_latency,
+						 req->r_end_latency);
 	fscrypt_free_bounce_page(bounce_page);
 	ceph_osdc_put_request(req);
 	if (err == 0)
@@ -963,6 +968,11 @@ static void writepages_finish(struct ceph_osd_request *req)
 	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
 				  req->r_end_latency, len, rc);
 
+	if (rc >= 0 && len > 0)
+		ceph_subvolume_metrics_record_io(mdsc, ci, true, len,
+						 req->r_start_latency,
+						 req->r_end_latency);
+
 	ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc);
 
 	osd_data = osd_req_op_extent_osd_data(req, 0);
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index f3fe786b4143..d49069a90f91 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,11 +9,13 @@
 #include <linux/seq_file.h>
 #include <linux/math64.h>
 #include <linux/ktime.h>
+#include <linux/atomic.h>
 
 #include <linux/ceph/libceph.h>
 #include <linux/ceph/mon_client.h>
 #include <linux/ceph/auth.h>
 #include <linux/ceph/debugfs.h>
+#include <linux/ceph/decode.h>
 
 #include "super.h"
 
@@ -21,6 +23,38 @@
 
 #include "mds_client.h"
 #include "metric.h"
+#include "subvolume_metrics.h"
+
+extern bool disable_send_metrics;
+
+/**
+ * struct ceph_session_feature_desc - Maps feature bits to names for debugfs
+ * @bit: Feature bit number from enum ceph_feature_type (see mds_client.h)
+ * @name: Human-readable feature name for debugfs output
+ *
+ * Used by metric_features_show() to display negotiated session features.
+ */
+struct ceph_session_feature_desc {
+	unsigned int bit;
+	const char *name;
+};
+
+static const struct ceph_session_feature_desc ceph_session_feature_table[] = {
+	{ CEPHFS_FEATURE_METRIC_COLLECT, "METRIC_COLLECT" },
+	{ CEPHFS_FEATURE_REPLY_ENCODING, "REPLY_ENCODING" },
+	{ CEPHFS_FEATURE_RECLAIM_CLIENT, "RECLAIM_CLIENT" },
+	{ CEPHFS_FEATURE_LAZY_CAP_WANTED, "LAZY_CAP_WANTED" },
+	{ CEPHFS_FEATURE_MULTI_RECONNECT, "MULTI_RECONNECT" },
+	{ CEPHFS_FEATURE_DELEG_INO, "DELEG_INO" },
+	{ CEPHFS_FEATURE_ALTERNATE_NAME, "ALTERNATE_NAME" },
+	{ CEPHFS_FEATURE_NOTIFY_SESSION_STATE, "NOTIFY_SESSION_STATE" },
+	{ CEPHFS_FEATURE_OP_GETVXATTR, "OP_GETVXATTR" },
+	{ CEPHFS_FEATURE_32BITS_RETRY_FWD, "32BITS_RETRY_FWD" },
+	{ CEPHFS_FEATURE_NEW_SNAPREALM_INFO, "NEW_SNAPREALM_INFO" },
+	{ CEPHFS_FEATURE_HAS_OWNER_UIDGID, "HAS_OWNER_UIDGID" },
+	{ CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK, "MDS_AUTH_CAPS_CHECK" },
+	{ CEPHFS_FEATURE_SUBVOLUME_METRICS, "SUBVOLUME_METRICS" },
+};
 
 static int mdsmap_show(struct seq_file *s, void *p)
 {
@@ -360,6 +394,59 @@ static int status_show(struct seq_file *s, void *p)
 	return 0;
 }
 
+static int subvolume_metrics_show(struct seq_file *s, void *p)
+{
+	struct ceph_fs_client *fsc = s->private;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	struct ceph_subvol_metric_snapshot *snapshot = NULL;
+	u32 nr = 0;
+	u64 total_sent = 0;
+	u64 nonzero_sends = 0;
+	u32 i;
+
+	if (!mdsc) {
+		seq_puts(s, "mds client unavailable\n");
+		return 0;
+	}
+
+	mutex_lock(&mdsc->subvol_metrics_last_mutex);
+	if (mdsc->subvol_metrics_last && mdsc->subvol_metrics_last_nr) {
+		nr = mdsc->subvol_metrics_last_nr;
+		snapshot = kmemdup_array(mdsc->subvol_metrics_last, nr,
+					 sizeof(*snapshot), GFP_KERNEL);
+		if (!snapshot)
+			nr = 0;
+	}
+	total_sent = mdsc->subvol_metrics_sent;
+	nonzero_sends = mdsc->subvol_metrics_nonzero_sends;
+	mutex_unlock(&mdsc->subvol_metrics_last_mutex);
+
+	seq_puts(s, "Last sent subvolume metrics:\n");
+	if (!nr) {
+		seq_puts(s, "  (none)\n");
+	} else {
+		seq_puts(s, "  subvol_id          rd_ops    wr_ops    rd_bytes       wr_bytes       rd_lat_us      wr_lat_us\n");
+		for (i = 0; i < nr; i++) {
+			const struct ceph_subvol_metric_snapshot *e = &snapshot[i];
+
+			seq_printf(s, "  %-18llu %-9llu %-9llu %-14llu %-14llu %-14llu %-14llu\n",
+				   e->subvolume_id,
+				   e->read_ops, e->write_ops,
+				   e->read_bytes, e->write_bytes,
+				   e->read_latency_us, e->write_latency_us);
+		}
+	}
+	kfree(snapshot);
+
+	seq_puts(s, "\nStatistics:\n");
+	seq_printf(s, "  entries_sent:      %llu\n", total_sent);
+	seq_printf(s, "  non_zero_sends:    %llu\n", nonzero_sends);
+
+	seq_puts(s, "\nPending (unsent) subvolume metrics:\n");
+	ceph_subvolume_metrics_dump(&mdsc->subvol_metrics, s);
+	return 0;
+}
+
 DEFINE_SHOW_ATTRIBUTE(mdsmap);
 DEFINE_SHOW_ATTRIBUTE(mdsc);
 DEFINE_SHOW_ATTRIBUTE(caps);
@@ -369,7 +456,72 @@ DEFINE_SHOW_ATTRIBUTE(metrics_file);
 DEFINE_SHOW_ATTRIBUTE(metrics_latency);
 DEFINE_SHOW_ATTRIBUTE(metrics_size);
 DEFINE_SHOW_ATTRIBUTE(metrics_caps);
+DEFINE_SHOW_ATTRIBUTE(subvolume_metrics);
+
+static int metric_features_show(struct seq_file *s, void *p)
+{
+	struct ceph_fs_client *fsc = s->private;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	unsigned long session_features = 0;
+	bool have_session = false;
+	bool metric_collect = false;
+	bool subvol_support = false;
+	bool metrics_enabled = false;
+	bool subvol_enabled = false;
+	int i;
+
+	if (!mdsc) {
+		seq_puts(s, "mds client unavailable\n");
+		return 0;
+	}
+
+	mutex_lock(&mdsc->mutex);
+	if (mdsc->metric.session) {
+		have_session = true;
+		session_features = mdsc->metric.session->s_features;
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	if (have_session) {
+		metric_collect =
+			test_bit(CEPHFS_FEATURE_METRIC_COLLECT,
+				 &session_features);
+		subvol_support =
+			test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+				 &session_features);
+	}
+
+	metrics_enabled = !disable_send_metrics && have_session && metric_collect;
+	subvol_enabled = metrics_enabled && subvol_support;
+
+	seq_printf(s,
+		   "metrics_enabled: %s (disable_send_metrics=%d, session=%s, metric_collect=%s)\n",
+		   metrics_enabled ? "yes" : "no",
+		   disable_send_metrics ? 1 : 0,
+		   have_session ? "yes" : "no",
+		   metric_collect ? "yes" : "no");
+	seq_printf(s, "subvolume_metrics_enabled: %s\n",
+		   subvol_enabled ? "yes" : "no");
+	seq_printf(s, "session_feature_bits: 0x%lx\n", session_features);
+
+	if (!have_session) {
+		seq_puts(s, "(no active MDS session for metrics)\n");
+		return 0;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ceph_session_feature_table); i++) {
+		const struct ceph_session_feature_desc *desc =
+			&ceph_session_feature_table[i];
+		bool set = test_bit(desc->bit, &session_features);
+
+		seq_printf(s, "  %-24s : %s\n", desc->name,
+			   set ? "yes" : "no");
+	}
+
+	return 0;
+}
 
+DEFINE_SHOW_ATTRIBUTE(metric_features);
 
 /*
  * debugfs
@@ -404,6 +556,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 	debugfs_remove(fsc->debugfs_caps);
 	debugfs_remove(fsc->debugfs_status);
 	debugfs_remove(fsc->debugfs_mdsc);
+	debugfs_remove(fsc->debugfs_subvolume_metrics);
 	debugfs_remove_recursive(fsc->debugfs_metrics_dir);
 	doutc(fsc->client, "done\n");
 }
@@ -468,6 +621,12 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 			    &metrics_size_fops);
 	debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
 			    &metrics_caps_fops);
+	debugfs_create_file("metric_features", 0400, fsc->debugfs_metrics_dir,
+			    fsc, &metric_features_fops);
+	fsc->debugfs_subvolume_metrics =
+		debugfs_create_file("subvolumes", 0400,
+				    fsc->debugfs_metrics_dir, fsc,
+				    &subvolume_metrics_fops);
 	doutc(fsc->client, "done\n");
 }
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 99b30f784ee2..8f4425fde171 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -19,6 +19,25 @@
 #include "cache.h"
 #include "io.h"
 #include "metric.h"
+#include "subvolume_metrics.h"
+
+/*
+ * Record I/O for subvolume metrics tracking.
+ *
+ * Callers must ensure bytes > 0 for reads (ret > 0 check) to avoid counting
+ * EOF as an I/O operation. For writes, the condition is (ret >= 0 && len > 0).
+ */
+static inline void ceph_record_subvolume_io(struct inode *inode, bool is_write,
+					    ktime_t start, ktime_t end,
+					    size_t bytes)
+{
+	if (!bytes)
+		return;
+
+	ceph_subvolume_metrics_record_io(ceph_sb_to_mdsc(inode->i_sb),
+					 ceph_inode(inode),
+					 is_write, bytes, start, end);
+}
 
 static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags)
 {
@@ -1140,6 +1159,15 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 					 req->r_start_latency,
 					 req->r_end_latency,
 					 read_len, ret);
+		/*
+		 * Only record subvolume metrics for actual bytes read.
+		 * ret == 0 means EOF (no data), not an I/O operation.
+		 */
+		if (ret > 0)
+			ceph_record_subvolume_io(inode, false,
+						 req->r_start_latency,
+						 req->r_end_latency,
+						 ret);
 
 		if (ret > 0)
 			objver = req->r_version;
@@ -1385,12 +1413,23 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 
 	/* r_start_latency == 0 means the request was not submitted */
 	if (req->r_start_latency) {
-		if (aio_req->write)
+		if (aio_req->write) {
 			ceph_update_write_metrics(metric, req->r_start_latency,
 						  req->r_end_latency, len, rc);
-		else
+			if (rc >= 0 && len)
+				ceph_record_subvolume_io(inode, true,
+							 req->r_start_latency,
+							 req->r_end_latency,
+							 len);
+		} else {
 			ceph_update_read_metrics(metric, req->r_start_latency,
 						 req->r_end_latency, len, rc);
+			if (rc > 0)
+				ceph_record_subvolume_io(inode, false,
+							 req->r_start_latency,
+							 req->r_end_latency,
+							 rc);
+		}
 	}
 
 	put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs,
@@ -1614,12 +1653,23 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 		ceph_osdc_start_request(req->r_osdc, req);
 		ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
 
-		if (write)
+		if (write) {
 			ceph_update_write_metrics(metric, req->r_start_latency,
 						  req->r_end_latency, len, ret);
-		else
+			if (ret >= 0 && len)
+				ceph_record_subvolume_io(inode, true,
+							 req->r_start_latency,
+							 req->r_end_latency,
+							 len);
+		} else {
 			ceph_update_read_metrics(metric, req->r_start_latency,
 						 req->r_end_latency, len, ret);
+			if (ret > 0)
+				ceph_record_subvolume_io(inode, false,
+							 req->r_start_latency,
+							 req->r_end_latency,
+							 ret);
+		}
 
 		size = i_size_read(inode);
 		if (!write) {
@@ -1872,6 +1922,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 						 req->r_start_latency,
 						 req->r_end_latency,
 						 read_len, ret);
+			if (ret > 0)
+				ceph_record_subvolume_io(inode, false,
+							 req->r_start_latency,
+							 req->r_end_latency,
+							 ret);
 
 			/* Ok if object is not already present */
 			if (ret == -ENOENT) {
@@ -2036,6 +2091,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 
 		ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
 					  req->r_end_latency, len, ret);
+		if (ret >= 0 && write_len)
+			ceph_record_subvolume_io(inode, true,
+						 req->r_start_latency,
+						 req->r_end_latency,
+						 write_len);
 		ceph_osdc_put_request(req);
 		if (ret != 0) {
 			doutc(cl, "osd write returned %d\n", ret);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 099b8f22683b..2b831f48c844 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -67,6 +67,22 @@ static void ceph_cap_reclaim_work(struct work_struct *work);
 
 static const struct ceph_connection_operations mds_con_ops;
 
+static void ceph_metric_bind_session(struct ceph_mds_client *mdsc,
+				     struct ceph_mds_session *session)
+{
+	struct ceph_mds_session *old;
+
+	if (!mdsc || !session || disable_send_metrics)
+		return;
+
+	old = mdsc->metric.session;
+	mdsc->metric.session = ceph_get_mds_session(session);
+	if (old)
+		ceph_put_mds_session(old);
+
+	metric_schedule_delayed(&mdsc->metric);
+}
+
 
 /*
  * mds reply parsing
@@ -95,21 +111,23 @@ static int parse_reply_info_quota(void **p, void *end,
 	return -EIO;
 }
 
-/*
- * parse individual inode info
- */
 static int parse_reply_info_in(void **p, void *end,
 			       struct ceph_mds_reply_info_in *info,
-			       u64 features)
+			       u64 features,
+			       struct ceph_mds_client *mdsc)
 {
 	int err = 0;
 	u8 struct_v = 0;
+	u8 struct_compat = 0;
+	u32 struct_len = 0;
+	struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
+
+	info->subvolume_id = 0;
+	doutc(cl, "subv_metric parse start features=0x%llx\n", features);
 
 	info->subvolume_id = 0;
 
 	if (features == (u64)-1) {
-		u32 struct_len;
-		u8 struct_compat;
 		ceph_decode_8_safe(p, end, struct_v, bad);
 		ceph_decode_8_safe(p, end, struct_compat, bad);
 		/* struct_v is expected to be >= 1. we only understand
@@ -389,12 +407,13 @@ static int parse_reply_info_lease(void **p, void *end,
  */
 static int parse_reply_info_trace(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features,
+				  struct ceph_mds_client *mdsc)
 {
 	int err;
 
 	if (info->head->is_dentry) {
-		err = parse_reply_info_in(p, end, &info->diri, features);
+		err = parse_reply_info_in(p, end, &info->diri, features, mdsc);
 		if (err < 0)
 			goto out_bad;
 
@@ -414,7 +433,8 @@ static int parse_reply_info_trace(void **p, void *end,
 	}
 
 	if (info->head->is_target) {
-		err = parse_reply_info_in(p, end, &info->targeti, features);
+		err = parse_reply_info_in(p, end, &info->targeti, features,
+					  mdsc);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -435,7 +455,8 @@ static int parse_reply_info_trace(void **p, void *end,
  */
 static int parse_reply_info_readdir(void **p, void *end,
 				    struct ceph_mds_request *req,
-				    u64 features)
+				    u64 features,
+				    struct ceph_mds_client *mdsc)
 {
 	struct ceph_mds_reply_info_parsed *info = &req->r_reply_info;
 	struct ceph_client *cl = req->r_mdsc->fsc->client;
@@ -550,7 +571,7 @@ static int parse_reply_info_readdir(void **p, void *end,
 		rde->name_len = oname.len;
 
 		/* inode */
-		err = parse_reply_info_in(p, end, &rde->inode, features);
+		err = parse_reply_info_in(p, end, &rde->inode, features, mdsc);
 		if (err < 0)
 			goto out_bad;
 		/* ceph_readdir_prepopulate() will update it */
@@ -758,7 +779,8 @@ static int parse_reply_info_extra(void **p, void *end,
 	if (op == CEPH_MDS_OP_GETFILELOCK)
 		return parse_reply_info_filelock(p, end, info, features);
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
-		return parse_reply_info_readdir(p, end, req, features);
+		return parse_reply_info_readdir(p, end, req, features,
+						req->r_mdsc);
 	else if (op == CEPH_MDS_OP_CREATE)
 		return parse_reply_info_create(p, end, info, features, s);
 	else if (op == CEPH_MDS_OP_GETVXATTR)
@@ -787,7 +809,8 @@ static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_trace(&p, p+len, info, features);
+		err = parse_reply_info_trace(&p, p + len, info, features,
+					     s->s_mdsc);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -796,7 +819,7 @@ static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, req, features, s);
+		err = parse_reply_info_extra(&p, p + len, req, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -4326,6 +4349,11 @@ static void handle_session(struct ceph_mds_session *session,
 		}
 		mdsc->s_cap_auths_num = cap_auths_num;
 		mdsc->s_cap_auths = cap_auths;
+
+		session->s_features = features;
+		if (test_bit(CEPHFS_FEATURE_METRIC_COLLECT,
+			     &session->s_features))
+			ceph_metric_bind_session(mdsc, session);
 	}
 	if (op == CEPH_SESSION_CLOSE) {
 		ceph_get_mds_session(session);
@@ -4352,7 +4380,11 @@ static void handle_session(struct ceph_mds_session *session,
 			pr_info_client(cl, "mds%d reconnect success\n",
 				       session->s_mds);
 
-		session->s_features = features;
+		if (test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+			     &session->s_features))
+			ceph_subvolume_metrics_enable(&mdsc->subvol_metrics, true);
+		else
+			ceph_subvolume_metrics_enable(&mdsc->subvol_metrics, false);
 		if (session->s_state == CEPH_MDS_SESSION_OPEN) {
 			pr_notice_client(cl, "mds%d is already opened\n",
 					 session->s_mds);
@@ -5591,6 +5623,12 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 	err = ceph_metric_init(&mdsc->metric);
 	if (err)
 		goto err_mdsmap;
+	ceph_subvolume_metrics_init(&mdsc->subvol_metrics);
+	mutex_init(&mdsc->subvol_metrics_last_mutex);
+	mdsc->subvol_metrics_last = NULL;
+	mdsc->subvol_metrics_last_nr = 0;
+	mdsc->subvol_metrics_sent = 0;
+	mdsc->subvol_metrics_nonzero_sends = 0;
 
 	spin_lock_init(&mdsc->dentry_list_lock);
 	INIT_LIST_HEAD(&mdsc->dentry_leases);
@@ -6123,6 +6161,8 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
 	ceph_mdsc_stop(mdsc);
 
 	ceph_metric_destroy(&mdsc->metric);
+	ceph_subvolume_metrics_destroy(&mdsc->subvol_metrics);
+	kfree(mdsc->subvol_metrics_last);
 
 	fsc->mdsc = NULL;
 	kfree(mdsc);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index bd3690baa65c..4e6c87f8414c 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -18,6 +18,7 @@
 
 #include "mdsmap.h"
 #include "metric.h"
+#include "subvolume_metrics.h"
 #include "super.h"
 
 /* The first 8 bits are reserved for old ceph releases */
@@ -36,8 +37,9 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_NEW_SNAPREALM_INFO,
 	CEPHFS_FEATURE_HAS_OWNER_UIDGID,
 	CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK,
+	CEPHFS_FEATURE_SUBVOLUME_METRICS,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_SUBVOLUME_METRICS,
 };
 
 #define CEPHFS_FEATURES_CLIENT_SUPPORTED {	\
@@ -54,6 +56,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_32BITS_RETRY_FWD,	\
 	CEPHFS_FEATURE_HAS_OWNER_UIDGID,	\
 	CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK,	\
+	CEPHFS_FEATURE_SUBVOLUME_METRICS,	\
 }
 
 /*
@@ -537,6 +540,14 @@ struct ceph_mds_client {
 	struct list_head  dentry_dir_leases; /* lru list */
 
 	struct ceph_client_metric metric;
+	struct ceph_subvolume_metrics_tracker subvol_metrics;
+
+	/* Subvolume metrics send tracking */
+	struct mutex		subvol_metrics_last_mutex;
+	struct ceph_subvol_metric_snapshot *subvol_metrics_last;
+	u32			subvol_metrics_last_nr;
+	u64			subvol_metrics_sent;
+	u64			subvol_metrics_nonzero_sends;
 
 	spinlock_t		snapid_map_lock;
 	struct rb_root		snapid_map_tree;
diff --git a/fs/ceph/metric.c b/fs/ceph/metric.c
index 871c1090e520..9bb357abc897 100644
--- a/fs/ceph/metric.c
+++ b/fs/ceph/metric.c
@@ -4,10 +4,84 @@
 #include <linux/types.h>
 #include <linux/percpu_counter.h>
 #include <linux/math64.h>
+#include <linux/ratelimit.h>
+
+#include <linux/ceph/decode.h>
 
 #include "metric.h"
 #include "mds_client.h"
 
+static bool metrics_disable_warned;
+
+static inline u32 ceph_subvolume_entry_payload_len(void)
+{
+	return sizeof(struct ceph_subvolume_metric_entry_wire);
+}
+
+static inline u32 ceph_subvolume_entry_encoded_len(void)
+{
+	return CEPH_ENCODING_START_BLK_LEN +
+		ceph_subvolume_entry_payload_len();
+}
+
+static inline u32 ceph_subvolume_outer_payload_len(u32 nr_subvols)
+{
+	/* count is encoded as le64 (size_t on wire) to match FUSE client */
+	return sizeof(__le64) +
+		nr_subvols * ceph_subvolume_entry_encoded_len();
+}
+
+static inline u32 ceph_subvolume_metric_data_len(u32 nr_subvols)
+{
+	return CEPH_ENCODING_START_BLK_LEN +
+		ceph_subvolume_outer_payload_len(nr_subvols);
+}
+
+static inline u32 ceph_subvolume_clamp_u32(u64 val)
+{
+	return val > U32_MAX ? U32_MAX : (u32)val;
+}
+
+static void ceph_init_subvolume_wire_entry(
+	struct ceph_subvolume_metric_entry_wire *dst,
+	const struct ceph_subvol_metric_snapshot *src)
+{
+	dst->subvolume_id = cpu_to_le64(src->subvolume_id);
+	dst->read_ops = cpu_to_le32(ceph_subvolume_clamp_u32(src->read_ops));
+	dst->write_ops = cpu_to_le32(ceph_subvolume_clamp_u32(src->write_ops));
+	dst->read_bytes = cpu_to_le64(src->read_bytes);
+	dst->write_bytes = cpu_to_le64(src->write_bytes);
+	dst->read_latency_us = cpu_to_le64(src->read_latency_us);
+	dst->write_latency_us = cpu_to_le64(src->write_latency_us);
+	dst->time_stamp = 0;
+}
+
+static int ceph_encode_subvolume_metrics(void **p, void *end,
+					 struct ceph_subvol_metric_snapshot *subvols,
+					 u32 nr_subvols)
+{
+	u32 i;
+
+	ceph_start_encoding(p, 1, 1,
+			    ceph_subvolume_outer_payload_len(nr_subvols));
+	/* count is encoded as le64 (size_t on wire) to match FUSE client */
+	ceph_encode_64_safe(p, end, (u64)nr_subvols, enc_err);
+
+	for (i = 0; i < nr_subvols; i++) {
+		struct ceph_subvolume_metric_entry_wire wire_entry;
+
+		ceph_init_subvolume_wire_entry(&wire_entry, &subvols[i]);
+		ceph_start_encoding(p, 1, 1,
+				    ceph_subvolume_entry_payload_len());
+		ceph_encode_copy_safe(p, end, &wire_entry,
+				      sizeof(wire_entry), enc_err);
+	}
+
+	return 0;
+enc_err:
+	return -ERANGE;
+}
+
 static void ktime_to_ceph_timespec(struct ceph_timespec *ts, ktime_t val)
 {
 	struct timespec64 t = ktime_to_timespec64(val);
@@ -29,10 +103,14 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
 	struct ceph_read_io_size *rsize;
 	struct ceph_write_io_size *wsize;
 	struct ceph_client_metric *m = &mdsc->metric;
+	struct ceph_subvol_metric_snapshot *subvols = NULL;
 	u64 nr_caps = atomic64_read(&m->total_caps);
 	u32 header_len = sizeof(struct ceph_metric_header);
 	struct ceph_client *cl = mdsc->fsc->client;
 	struct ceph_msg *msg;
+	u32 nr_subvols = 0;
+	size_t subvol_len = 0;
+	void *cursor;
 	s64 sum;
 	s32 items = 0;
 	s32 len;
@@ -45,15 +123,37 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
 	}
 	mutex_unlock(&mdsc->mutex);
 
+	if (ceph_subvolume_metrics_enabled(&mdsc->subvol_metrics) &&
+	    test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS, &s->s_features)) {
+		int ret;
+
+		ret = ceph_subvolume_metrics_snapshot(&mdsc->subvol_metrics,
+						      &subvols, &nr_subvols,
+						      true);
+		if (ret) {
+			pr_warn_client(cl, "failed to snapshot subvolume metrics: %d\n",
+				       ret);
+			nr_subvols = 0;
+			subvols = NULL;
+		}
+	}
+
+	if (nr_subvols) {
+		/* type (le32) + ENCODE_START payload - no metric header */
+		subvol_len = sizeof(__le32) +
+			     ceph_subvolume_metric_data_len(nr_subvols);
+	}
+
 	len = sizeof(*head) + sizeof(*cap) + sizeof(*read) + sizeof(*write)
 	      + sizeof(*meta) + sizeof(*dlease) + sizeof(*files)
 	      + sizeof(*icaps) + sizeof(*inodes) + sizeof(*rsize)
-	      + sizeof(*wsize);
+	      + sizeof(*wsize) + subvol_len;
 
 	msg = ceph_msg_new(CEPH_MSG_CLIENT_METRICS, len, GFP_NOFS, true);
 	if (!msg) {
 		pr_err_client(cl, "to mds%d, failed to allocate message\n",
 			      s->s_mds);
+		kfree(subvols);
 		return false;
 	}
 
@@ -172,13 +272,56 @@ static bool ceph_mdsc_send_metrics(struct ceph_mds_client *mdsc,
 	wsize->total_size = cpu_to_le64(m->metric[METRIC_WRITE].size_sum);
 	items++;
 
+	cursor = wsize + 1;
+
+	if (nr_subvols) {
+		void *payload;
+		void *payload_end;
+		int ret;
+
+		/* Emit only the type (le32), no ver/compat/data_len */
+		ceph_encode_32(&cursor, CLIENT_METRIC_TYPE_SUBVOLUME_METRICS);
+		items++;
+
+		payload = cursor;
+		payload_end = (char *)payload +
+			      ceph_subvolume_metric_data_len(nr_subvols);
+
+		ret = ceph_encode_subvolume_metrics(&payload, payload_end,
+						    subvols, nr_subvols);
+		if (ret) {
+			pr_warn_client(cl,
+				       "failed to encode subvolume metrics\n");
+			kfree(subvols);
+			ceph_msg_put(msg);
+			return false;
+		}
+
+		WARN_ON(payload != payload_end);
+		cursor = payload;
+	}
+
 	put_unaligned_le32(items, &head->num);
-	msg->front.iov_len = len;
+	msg->front.iov_len = (char *)cursor - (char *)head;
 	msg->hdr.version = cpu_to_le16(1);
 	msg->hdr.compat_version = cpu_to_le16(1);
 	msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
+
 	ceph_con_send(&s->s_con, msg);
 
+	if (nr_subvols) {
+		mutex_lock(&mdsc->subvol_metrics_last_mutex);
+		kfree(mdsc->subvol_metrics_last);
+		mdsc->subvol_metrics_last = subvols;
+		mdsc->subvol_metrics_last_nr = nr_subvols;
+		mdsc->subvol_metrics_sent += nr_subvols;
+		mdsc->subvol_metrics_nonzero_sends++;
+		mutex_unlock(&mdsc->subvol_metrics_last_mutex);
+
+		subvols = NULL;
+	}
+	kfree(subvols);
+
 	return true;
 }
 
@@ -201,6 +344,12 @@ static void metric_get_session(struct ceph_mds_client *mdsc)
 		 */
 		if (check_session_state(s) &&
 		    test_bit(CEPHFS_FEATURE_METRIC_COLLECT, &s->s_features)) {
+			if (ceph_subvolume_metrics_enabled(&mdsc->subvol_metrics) &&
+			    !test_bit(CEPHFS_FEATURE_SUBVOLUME_METRICS,
+				      &s->s_features)) {
+				ceph_put_mds_session(s);
+				continue;
+			}
 			mdsc->metric.session = s;
 			break;
 		}
@@ -217,8 +366,17 @@ static void metric_delayed_work(struct work_struct *work)
 	struct ceph_mds_client *mdsc =
 		container_of(m, struct ceph_mds_client, metric);
 
-	if (mdsc->stopping || disable_send_metrics)
+	if (mdsc->stopping)
+		return;
+
+	if (disable_send_metrics) {
+		if (!metrics_disable_warned) {
+			pr_info("ceph: metrics sending disabled via module parameter\n");
+			metrics_disable_warned = true;
+		}
 		return;
+	}
+	metrics_disable_warned = false;
 
 	if (!m->session || !check_session_state(m->session)) {
 		if (m->session) {
@@ -227,10 +385,13 @@ static void metric_delayed_work(struct work_struct *work)
 		}
 		metric_get_session(mdsc);
 	}
-	if (m->session) {
+
+	if (m->session)
 		ceph_mdsc_send_metrics(mdsc, m->session);
-		metric_schedule_delayed(m);
-	}
+	else
+		pr_warn_ratelimited("ceph: metrics worker has no MDS session\n");
+
+	metric_schedule_delayed(m);
 }
 
 int ceph_metric_init(struct ceph_client_metric *m)
diff --git a/fs/ceph/metric.h b/fs/ceph/metric.h
index 0d0c44bd3332..7e4aac63f6a6 100644
--- a/fs/ceph/metric.h
+++ b/fs/ceph/metric.h
@@ -25,8 +25,9 @@ enum ceph_metric_type {
 	CLIENT_METRIC_TYPE_STDEV_WRITE_LATENCY,
 	CLIENT_METRIC_TYPE_AVG_METADATA_LATENCY,
 	CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY,
+	CLIENT_METRIC_TYPE_SUBVOLUME_METRICS,
 
-	CLIENT_METRIC_TYPE_MAX = CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY,
+	CLIENT_METRIC_TYPE_MAX = CLIENT_METRIC_TYPE_SUBVOLUME_METRICS,
 };
 
 /*
@@ -50,6 +51,7 @@ enum ceph_metric_type {
 	CLIENT_METRIC_TYPE_STDEV_WRITE_LATENCY,	   \
 	CLIENT_METRIC_TYPE_AVG_METADATA_LATENCY,   \
 	CLIENT_METRIC_TYPE_STDEV_METADATA_LATENCY, \
+	CLIENT_METRIC_TYPE_SUBVOLUME_METRICS,	   \
 						   \
 	CLIENT_METRIC_TYPE_MAX,			   \
 }
@@ -139,6 +141,29 @@ struct ceph_write_io_size {
 	__le64 total_size;
 } __packed;
 
+/* Wire format for subvolume metrics - matches C++ AggregatedIOMetrics */
+struct ceph_subvolume_metric_entry_wire {
+	__le64 subvolume_id;
+	__le32 read_ops;
+	__le32 write_ops;
+	__le64 read_bytes;
+	__le64 write_bytes;
+	__le64 read_latency_us;
+	__le64 write_latency_us;
+	__le64 time_stamp;
+} __packed;
+
+/* Old struct kept for internal tracking, not used on wire */
+struct ceph_subvolume_metric_entry {
+	__le64 subvolume_id;
+	__le64 read_ops;
+	__le64 write_ops;
+	__le64 read_bytes;
+	__le64 write_bytes;
+	__le64 read_latency_us;
+	__le64 write_latency_us;
+} __packed;
+
 struct ceph_metric_head {
 	__le32 num;	/* the number of metrics that will be sent */
 } __packed;
diff --git a/fs/ceph/subvolume_metrics.c b/fs/ceph/subvolume_metrics.c
new file mode 100644
index 000000000000..111f6754e609
--- /dev/null
+++ b/fs/ceph/subvolume_metrics.c
@@ -0,0 +1,431 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ceph/ceph_debug.h>
+
+#include <linux/math64.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+
+#include "subvolume_metrics.h"
+#include "mds_client.h"
+#include "super.h"
+
+struct ceph_subvol_metric_rb_entry {
+	struct rb_node node;
+	u64 subvolume_id;
+	u64 read_ops;
+	u64 write_ops;
+	u64 read_bytes;
+	u64 write_bytes;
+	u64 read_latency_us;
+	u64 write_latency_us;
+};
+
+static struct kmem_cache *ceph_subvol_metric_entry_cachep;
+
+void ceph_subvolume_metrics_init(struct ceph_subvolume_metrics_tracker *tracker)
+{
+	spin_lock_init(&tracker->lock);
+	tracker->tree = RB_ROOT_CACHED;
+	tracker->nr_entries = 0;
+	tracker->enabled = false;
+	atomic64_set(&tracker->snapshot_attempts, 0);
+	atomic64_set(&tracker->snapshot_empty, 0);
+	atomic64_set(&tracker->snapshot_failures, 0);
+	atomic64_set(&tracker->record_calls, 0);
+	atomic64_set(&tracker->record_disabled, 0);
+	atomic64_set(&tracker->record_no_subvol, 0);
+	atomic64_set(&tracker->total_read_ops, 0);
+	atomic64_set(&tracker->total_read_bytes, 0);
+	atomic64_set(&tracker->total_write_ops, 0);
+	atomic64_set(&tracker->total_write_bytes, 0);
+}
+
+static struct ceph_subvol_metric_rb_entry *
+__lookup_entry(struct ceph_subvolume_metrics_tracker *tracker, u64 subvol_id)
+{
+	struct rb_node *node;
+
+	node = tracker->tree.rb_root.rb_node;
+	while (node) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+		if (subvol_id < entry->subvolume_id)
+			node = node->rb_left;
+		else if (subvol_id > entry->subvolume_id)
+			node = node->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+static struct ceph_subvol_metric_rb_entry *
+__insert_entry(struct ceph_subvolume_metrics_tracker *tracker,
+	       struct ceph_subvol_metric_rb_entry *entry)
+{
+	struct rb_node **link = &tracker->tree.rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		struct ceph_subvol_metric_rb_entry *cur =
+			rb_entry(*link, struct ceph_subvol_metric_rb_entry, node);
+
+		parent = *link;
+		if (entry->subvolume_id < cur->subvolume_id)
+			link = &(*link)->rb_left;
+		else if (entry->subvolume_id > cur->subvolume_id) {
+			link = &(*link)->rb_right;
+			leftmost = false;
+		} else
+			return cur;
+	}
+
+	rb_link_node(&entry->node, parent, link);
+	rb_insert_color_cached(&entry->node, &tracker->tree, leftmost);
+	tracker->nr_entries++;
+	return entry;
+}
+
+static void ceph_subvolume_metrics_clear_locked(
+		struct ceph_subvolume_metrics_tracker *tracker)
+{
+	struct rb_node *node = rb_first_cached(&tracker->tree);
+
+	while (node) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+		struct rb_node *next = rb_next(node);
+
+		rb_erase_cached(&entry->node, &tracker->tree);
+		tracker->nr_entries--;
+		kmem_cache_free(ceph_subvol_metric_entry_cachep, entry);
+		node = next;
+	}
+
+	tracker->tree = RB_ROOT_CACHED;
+}
+
+void ceph_subvolume_metrics_destroy(struct ceph_subvolume_metrics_tracker *tracker)
+{
+	spin_lock(&tracker->lock);
+	ceph_subvolume_metrics_clear_locked(tracker);
+	tracker->enabled = false;
+	spin_unlock(&tracker->lock);
+}
+
+void ceph_subvolume_metrics_enable(struct ceph_subvolume_metrics_tracker *tracker,
+				   bool enable)
+{
+	spin_lock(&tracker->lock);
+	if (enable) {
+		tracker->enabled = true;
+	} else {
+		tracker->enabled = false;
+		ceph_subvolume_metrics_clear_locked(tracker);
+	}
+	spin_unlock(&tracker->lock);
+}
+
+void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracker,
+				   u64 subvol_id, bool is_write,
+				   size_t size, u64 latency_us)
+{
+	struct ceph_subvol_metric_rb_entry *entry, *new_entry = NULL;
+	bool retry = false;
+
+	/* 0 means unknown/unset subvolume (matches FUSE client convention) */
+	if (!READ_ONCE(tracker->enabled) || !subvol_id || !size || !latency_us)
+		return;
+
+	do {
+		spin_lock(&tracker->lock);
+		if (!tracker->enabled) {
+			spin_unlock(&tracker->lock);
+			kmem_cache_free(ceph_subvol_metric_entry_cachep, new_entry);
+			return;
+		}
+
+		entry = __lookup_entry(tracker, subvol_id);
+		if (!entry) {
+			if (!new_entry) {
+				spin_unlock(&tracker->lock);
+				new_entry = kmem_cache_zalloc(ceph_subvol_metric_entry_cachep,
+						      GFP_NOFS);
+				if (!new_entry)
+					return;
+				new_entry->subvolume_id = subvol_id;
+				retry = true;
+				continue;
+			}
+			entry = __insert_entry(tracker, new_entry);
+			if (entry != new_entry) {
+				/* raced with another insert */
+				spin_unlock(&tracker->lock);
+				kmem_cache_free(ceph_subvol_metric_entry_cachep, new_entry);
+				new_entry = NULL;
+				retry = true;
+				continue;
+			}
+			new_entry = NULL;
+		}
+
+		if (is_write) {
+			entry->write_ops++;
+			entry->write_bytes += size;
+			entry->write_latency_us += latency_us;
+			atomic64_inc(&tracker->total_write_ops);
+			atomic64_add(size, &tracker->total_write_bytes);
+		} else {
+			entry->read_ops++;
+			entry->read_bytes += size;
+			entry->read_latency_us += latency_us;
+			atomic64_inc(&tracker->total_read_ops);
+			atomic64_add(size, &tracker->total_read_bytes);
+		}
+		spin_unlock(&tracker->lock);
+		kmem_cache_free(ceph_subvol_metric_entry_cachep, new_entry);
+		return;
+	} while (retry);
+}
+
+int ceph_subvolume_metrics_snapshot(struct ceph_subvolume_metrics_tracker *tracker,
+				    struct ceph_subvol_metric_snapshot **out,
+				    u32 *nr, bool consume)
+{
+	struct ceph_subvol_metric_snapshot *snap = NULL;
+	struct rb_node *node;
+	u32 count = 0, idx = 0;
+	int ret = 0;
+
+	*out = NULL;
+	*nr = 0;
+
+	if (!READ_ONCE(tracker->enabled))
+		return 0;
+
+	atomic64_inc(&tracker->snapshot_attempts);
+
+	spin_lock(&tracker->lock);
+	for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+		/* Include entries with ANY I/O activity (read OR write) */
+		if (entry->read_ops || entry->write_ops)
+			count++;
+	}
+	spin_unlock(&tracker->lock);
+
+	if (!count) {
+		atomic64_inc(&tracker->snapshot_empty);
+		return 0;
+	}
+
+	snap = kcalloc(count, sizeof(*snap), GFP_NOFS);
+	if (!snap) {
+		atomic64_inc(&tracker->snapshot_failures);
+		return -ENOMEM;
+	}
+
+	spin_lock(&tracker->lock);
+	node = rb_first_cached(&tracker->tree);
+	while (node) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+		struct rb_node *next = rb_next(node);
+
+		/* Skip entries with NO I/O activity at all */
+		if (!entry->read_ops && !entry->write_ops) {
+			rb_erase_cached(&entry->node, &tracker->tree);
+			tracker->nr_entries--;
+			kmem_cache_free(ceph_subvol_metric_entry_cachep, entry);
+			node = next;
+			continue;
+		}
+
+		if (idx >= count) {
+			pr_warn("ceph: subvol metrics snapshot race (idx=%u count=%u)\n",
+				idx, count);
+			break;
+		}
+
+		snap[idx].subvolume_id = entry->subvolume_id;
+		snap[idx].read_ops = entry->read_ops;
+		snap[idx].write_ops = entry->write_ops;
+		snap[idx].read_bytes = entry->read_bytes;
+		snap[idx].write_bytes = entry->write_bytes;
+		snap[idx].read_latency_us = entry->read_latency_us;
+		snap[idx].write_latency_us = entry->write_latency_us;
+		idx++;
+
+		if (consume) {
+			entry->read_ops = 0;
+			entry->write_ops = 0;
+			entry->read_bytes = 0;
+			entry->write_bytes = 0;
+			entry->read_latency_us = 0;
+			entry->write_latency_us = 0;
+			rb_erase_cached(&entry->node, &tracker->tree);
+			tracker->nr_entries--;
+			kmem_cache_free(ceph_subvol_metric_entry_cachep, entry);
+		}
+		node = next;
+	}
+	spin_unlock(&tracker->lock);
+
+	if (!idx) {
+		kfree(snap);
+		snap = NULL;
+		ret = 0;
+	} else {
+		*nr = idx;
+		*out = snap;
+	}
+
+	return ret;
+}
+
+void ceph_subvolume_metrics_free_snapshot(struct ceph_subvol_metric_snapshot *snapshot)
+{
+	kfree(snapshot);
+}
+
+static u64 div_rem(u64 dividend, u64 divisor)
+{
+	return divisor ? div64_u64(dividend, divisor) : 0;
+}
+
+/*
+ * Dump subvolume metrics to a seq_file for debugfs.
+ * This function does not return an error code because the seq_file API
+ * handles errors internally - any failures are tracked in the seq_file
+ * structure and reported to userspace when the file is read.
+ */
+void ceph_subvolume_metrics_dump(struct ceph_subvolume_metrics_tracker *tracker,
+				 struct seq_file *s)
+{
+	struct rb_node *node;
+	struct ceph_subvol_metric_snapshot *snapshot = NULL;
+	u32 count = 0, idx = 0;
+
+	spin_lock(&tracker->lock);
+	if (!tracker->enabled) {
+		spin_unlock(&tracker->lock);
+		seq_puts(s, "subvolume metrics disabled\n");
+		return;
+	}
+
+	for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+		if (entry->read_ops || entry->write_ops)
+			count++;
+	}
+	spin_unlock(&tracker->lock);
+
+	if (!count) {
+		seq_puts(s, "(no subvolume metrics collected)\n");
+		return;
+	}
+
+	snapshot = kcalloc(count, sizeof(*snapshot), GFP_KERNEL);
+	if (!snapshot) {
+		seq_puts(s, "(unable to allocate memory for snapshot)\n");
+		return;
+	}
+
+	spin_lock(&tracker->lock);
+	for (node = rb_first_cached(&tracker->tree); node; node = rb_next(node)) {
+		struct ceph_subvol_metric_rb_entry *entry =
+			rb_entry(node, struct ceph_subvol_metric_rb_entry, node);
+
+		if (!entry->read_ops && !entry->write_ops)
+			continue;
+
+		if (idx >= count)
+			break;
+
+		snapshot[idx].subvolume_id = entry->subvolume_id;
+		snapshot[idx].read_ops = entry->read_ops;
+		snapshot[idx].write_ops = entry->write_ops;
+		snapshot[idx].read_bytes = entry->read_bytes;
+		snapshot[idx].write_bytes = entry->write_bytes;
+		snapshot[idx].read_latency_us = entry->read_latency_us;
+		snapshot[idx].write_latency_us = entry->write_latency_us;
+		idx++;
+	}
+	spin_unlock(&tracker->lock);
+
+	seq_puts(s, "subvol_id       rd_ops    rd_bytes    rd_avg_lat_us  wr_ops    wr_bytes    wr_avg_lat_us\n");
+	seq_puts(s, "------------------------------------------------------------------------------------------------\n");
+
+	for (idx = 0; idx < count; idx++) {
+		u64 avg_rd_lat = div_rem(snapshot[idx].read_latency_us,
+					 snapshot[idx].read_ops);
+		u64 avg_wr_lat = div_rem(snapshot[idx].write_latency_us,
+					 snapshot[idx].write_ops);
+
+		seq_printf(s, "%-15llu%-10llu%-12llu%-16llu%-10llu%-12llu%-16llu\n",
+			   snapshot[idx].subvolume_id,
+			   snapshot[idx].read_ops,
+			   snapshot[idx].read_bytes,
+			   avg_rd_lat,
+			   snapshot[idx].write_ops,
+			   snapshot[idx].write_bytes,
+			   avg_wr_lat);
+	}
+
+	kfree(snapshot);
+}
+
+void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
+				      struct ceph_inode_info *ci,
+				      bool is_write, size_t bytes,
+				      ktime_t start, ktime_t end)
+{
+	struct ceph_subvolume_metrics_tracker *tracker;
+	u64 subvol_id;
+	s64 delta_us;
+
+	if (!mdsc || !ci || !bytes)
+		return;
+
+	tracker = &mdsc->subvol_metrics;
+	atomic64_inc(&tracker->record_calls);
+
+	if (!ceph_subvolume_metrics_enabled(tracker)) {
+		atomic64_inc(&tracker->record_disabled);
+		return;
+	}
+
+	subvol_id = READ_ONCE(ci->i_subvolume_id);
+	if (!subvol_id) {
+		atomic64_inc(&tracker->record_no_subvol);
+		return;
+	}
+
+	delta_us = ktime_to_us(ktime_sub(end, start));
+	if (delta_us <= 0)
+		delta_us = 1;
+
+	ceph_subvolume_metrics_record(tracker, subvol_id, is_write,
+				      bytes, (u64)delta_us);
+}
+
+int __init ceph_subvolume_metrics_cache_init(void)
+{
+	ceph_subvol_metric_entry_cachep = KMEM_CACHE(ceph_subvol_metric_rb_entry,
+						    SLAB_RECLAIM_ACCOUNT);
+	if (!ceph_subvol_metric_entry_cachep)
+		return -ENOMEM;
+	return 0;
+}
+
+void ceph_subvolume_metrics_cache_destroy(void)
+{
+	kmem_cache_destroy(ceph_subvol_metric_entry_cachep);
+}
diff --git a/fs/ceph/subvolume_metrics.h b/fs/ceph/subvolume_metrics.h
new file mode 100644
index 000000000000..6f53ff726c75
--- /dev/null
+++ b/fs/ceph/subvolume_metrics.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _FS_CEPH_SUBVOLUME_METRICS_H
+#define _FS_CEPH_SUBVOLUME_METRICS_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/ktime.h>
+#include <linux/atomic.h>
+
+struct seq_file;
+struct ceph_mds_client;
+struct ceph_inode_info;
+
+/**
+ * struct ceph_subvol_metric_snapshot - Point-in-time snapshot of subvolume metrics
+ * @subvolume_id: Subvolume identifier (inode number of subvolume root)
+ * @read_ops: Number of read operations since last snapshot
+ * @write_ops: Number of write operations since last snapshot
+ * @read_bytes: Total bytes read since last snapshot
+ * @write_bytes: Total bytes written since last snapshot
+ * @read_latency_us: Sum of read latencies in microseconds (for avg calculation)
+ * @write_latency_us: Sum of write latencies in microseconds (for avg calculation)
+ */
+struct ceph_subvol_metric_snapshot {
+	u64 subvolume_id;
+	u64 read_ops;
+	u64 write_ops;
+	u64 read_bytes;
+	u64 write_bytes;
+	u64 read_latency_us;
+	u64 write_latency_us;
+};
+
+/**
+ * struct ceph_subvolume_metrics_tracker - Tracks per-subvolume I/O metrics
+ * @lock: Protects @tree and @nr_entries during concurrent access
+ * @tree: Red-black tree of per-subvolume entries, keyed by subvolume_id
+ * @nr_entries: Number of entries currently in @tree
+ * @enabled: Whether collection is enabled (requires MDS feature support)
+ * @snapshot_attempts: Debug counter: total ceph_subvolume_metrics_snapshot() calls
+ * @snapshot_empty: Debug counter: snapshots that found no data to report
+ * @snapshot_failures: Debug counter: snapshots that failed to allocate memory
+ * @record_calls: Debug counter: total ceph_subvolume_metrics_record() calls
+ * @record_disabled: Debug counter: record calls skipped because disabled
+ * @record_no_subvol: Debug counter: record calls skipped (no subvolume_id)
+ * @total_read_ops: Cumulative read ops across all snapshots (never reset)
+ * @total_read_bytes: Cumulative bytes read across all snapshots (never reset)
+ * @total_write_ops: Cumulative write ops across all snapshots (never reset)
+ * @total_write_bytes: Cumulative bytes written across all snapshots (never reset)
+ */
+struct ceph_subvolume_metrics_tracker {
+	spinlock_t lock;
+	struct rb_root_cached tree;
+	u32 nr_entries;
+	bool enabled;
+	atomic64_t snapshot_attempts;
+	atomic64_t snapshot_empty;
+	atomic64_t snapshot_failures;
+	atomic64_t record_calls;
+	atomic64_t record_disabled;
+	atomic64_t record_no_subvol;
+	atomic64_t total_read_ops;
+	atomic64_t total_read_bytes;
+	atomic64_t total_write_ops;
+	atomic64_t total_write_bytes;
+};
+
+void ceph_subvolume_metrics_init(struct ceph_subvolume_metrics_tracker *tracker);
+void ceph_subvolume_metrics_destroy(struct ceph_subvolume_metrics_tracker *tracker);
+void ceph_subvolume_metrics_enable(struct ceph_subvolume_metrics_tracker *tracker,
+				   bool enable);
+void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracker,
+				   u64 subvol_id, bool is_write,
+				   size_t size, u64 latency_us);
+int ceph_subvolume_metrics_snapshot(struct ceph_subvolume_metrics_tracker *tracker,
+				    struct ceph_subvol_metric_snapshot **out,
+				    u32 *nr, bool consume);
+void ceph_subvolume_metrics_free_snapshot(struct ceph_subvol_metric_snapshot *snapshot);
+void ceph_subvolume_metrics_dump(struct ceph_subvolume_metrics_tracker *tracker,
+				 struct seq_file *s);
+
+void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
+				      struct ceph_inode_info *ci,
+				      bool is_write, size_t bytes,
+				      ktime_t start, ktime_t end);
+
+static inline bool ceph_subvolume_metrics_enabled(
+		const struct ceph_subvolume_metrics_tracker *tracker)
+{
+	return READ_ONCE(tracker->enabled);
+}
+
+int __init ceph_subvolume_metrics_cache_init(void);
+void ceph_subvolume_metrics_cache_destroy(void);
+
+#endif /* _FS_CEPH_SUBVOLUME_METRICS_H */
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index f6bf24b5c683..a60f99b5c68a 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -21,6 +21,7 @@
 #include "mds_client.h"
 #include "cache.h"
 #include "crypto.h"
+#include "subvolume_metrics.h"
 
 #include <linux/ceph/ceph_features.h>
 #include <linux/ceph/decode.h>
@@ -963,8 +964,14 @@ static int __init init_caches(void)
 	if (!ceph_wb_pagevec_pool)
 		goto bad_pagevec_pool;
 
+	error = ceph_subvolume_metrics_cache_init();
+	if (error)
+		goto bad_subvol_metrics;
+
 	return 0;
 
+bad_subvol_metrics:
+	mempool_destroy(ceph_wb_pagevec_pool);
 bad_pagevec_pool:
 	kmem_cache_destroy(ceph_mds_request_cachep);
 bad_mds_req:
@@ -1001,6 +1008,7 @@ static void destroy_caches(void)
 	kmem_cache_destroy(ceph_dir_file_cachep);
 	kmem_cache_destroy(ceph_mds_request_cachep);
 	mempool_destroy(ceph_wb_pagevec_pool);
+	ceph_subvolume_metrics_cache_destroy();
 }
 
 static void __ceph_umount_begin(struct ceph_fs_client *fsc)
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index c0372a725960..a03c373efd52 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -167,6 +167,7 @@ struct ceph_fs_client {
 	struct dentry *debugfs_status;
 	struct dentry *debugfs_mds_sessions;
 	struct dentry *debugfs_metrics_dir;
+	struct dentry *debugfs_subvolume_metrics;
 #endif
 
 #ifdef CONFIG_CEPH_FSCACHE
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 15:46 [PATCH v3 0/4] ceph: add subvolume metrics reporting support Alex Markuze
                   ` (2 preceding siblings ...)
  2025-12-03 15:46 ` [PATCH v3 3/4] ceph: add subvolume metrics collection and reporting Alex Markuze
@ 2025-12-03 15:46 ` Alex Markuze
  2025-12-03 20:15   ` Viacheslav Dubeyko
  2025-12-12  0:14   ` kernel test robot
  3 siblings, 2 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 15:46 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

1. Introduce CEPH_SUBVOLUME_ID_NONE constant (value 0) to make the
   unknown/unset state explicit and self-documenting.

2. Add WARN_ON_ONCE if attempting to change an already-set subvolume_id.
   An inode's subvolume membership is immutable - once created in a
   subvolume, it stays there. Attempting to change it indicates a bug.
---
 fs/ceph/inode.c             | 32 +++++++++++++++++++++++++-------
 fs/ceph/mds_client.c        |  5 +----
 fs/ceph/subvolume_metrics.c |  7 ++++---
 fs/ceph/super.h             | 10 +++++++++-
 4 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 835049004047..257b3e27b741 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -638,7 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 
 	ci->i_max_bytes = 0;
 	ci->i_max_files = 0;
-	ci->i_subvolume_id = 0;
+	ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
 	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
@@ -743,7 +743,7 @@ void ceph_evict_inode(struct inode *inode)
 
 	percpu_counter_dec(&mdsc->metric.total_inodes);
 
-	ci->i_subvolume_id = 0;
+	ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;
 
 	netfs_wait_for_outstanding_io(inode);
 	truncate_inode_pages_final(&inode->i_data);
@@ -877,19 +877,37 @@ int ceph_fill_file_size(struct inode *inode, int issued,
 }
 
 /*
- * Set the subvolume ID for an inode. Following the FUSE client convention,
- * 0 means unknown/unset (MDS only sends non-zero IDs for subvolume inodes).
+ * Set the subvolume ID for an inode.
+ *
+ * The subvolume_id identifies which CephFS subvolume this inode belongs to.
+ * CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset - the MDS only sends
+ * non-zero IDs for inodes within subvolumes.
+ *
+ * An inode's subvolume membership is immutable - once an inode is created
+ * in a subvolume, it stays there. Therefore, if we already have a valid
+ * (non-zero) subvolume_id and receive a different one, that indicates a bug.
  */
 void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
 {
 	struct ceph_inode_info *ci;
+	u64 old;
 
-	if (!inode || !subvolume_id)
+	if (!inode || subvolume_id == CEPH_SUBVOLUME_ID_NONE)
 		return;
 
 	ci = ceph_inode(inode);
-	if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
-		WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
+	old = READ_ONCE(ci->i_subvolume_id);
+
+	if (old == subvolume_id)
+		return;
+
+	if (old != CEPH_SUBVOLUME_ID_NONE) {
+		/* subvolume_id should not change once set */
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
 }
 
 void ceph_fill_file_time(struct inode *inode, int issued,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 2b831f48c844..f2a17e11fcef 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -122,10 +122,7 @@ static int parse_reply_info_in(void **p, void *end,
 	u32 struct_len = 0;
 	struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
 
-	info->subvolume_id = 0;
-	doutc(cl, "subv_metric parse start features=0x%llx\n", features);
-
-	info->subvolume_id = 0;
+	info->subvolume_id = CEPH_SUBVOLUME_ID_NONE;
 
 	if (features == (u64)-1) {
 		ceph_decode_8_safe(p, end, struct_v, bad);
diff --git a/fs/ceph/subvolume_metrics.c b/fs/ceph/subvolume_metrics.c
index 111f6754e609..37cbed5b52c3 100644
--- a/fs/ceph/subvolume_metrics.c
+++ b/fs/ceph/subvolume_metrics.c
@@ -136,8 +136,9 @@ void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracke
 	struct ceph_subvol_metric_rb_entry *entry, *new_entry = NULL;
 	bool retry = false;
 
-	/* 0 means unknown/unset subvolume (matches FUSE client convention) */
-	if (!READ_ONCE(tracker->enabled) || !subvol_id || !size || !latency_us)
+	/* CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset subvolume */
+	if (!READ_ONCE(tracker->enabled) ||
+	    subvol_id == CEPH_SUBVOLUME_ID_NONE || !size || !latency_us)
 		return;
 
 	do {
@@ -403,7 +404,7 @@ void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
 	}
 
 	subvol_id = READ_ONCE(ci->i_subvolume_id);
-	if (!subvol_id) {
+	if (subvol_id == CEPH_SUBVOLUME_ID_NONE) {
 		atomic64_inc(&tracker->record_no_subvol);
 		return;
 	}
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index a03c373efd52..731df0fcbcc8 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -386,7 +386,15 @@ struct ceph_inode_info {
 
 	/* quotas */
 	u64 i_max_bytes, i_max_files;
-	u64 i_subvolume_id;	/* 0 = unknown/unset, matches FUSE client */
+
+	/*
+	 * Subvolume ID this inode belongs to. CEPH_SUBVOLUME_ID_NONE (0)
+	 * means unknown/unset, matching the FUSE client convention.
+	 * Once set to a valid (non-zero) value, it should not change
+	 * during the inode's lifetime.
+	 */
+#define CEPH_SUBVOLUME_ID_NONE 0
+	u64 i_subvolume_id;
 
 	s32 i_dir_pin;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re:  [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 15:46 ` [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE Alex Markuze
@ 2025-12-03 20:15   ` Viacheslav Dubeyko
  2025-12-03 21:22     ` Alex Markuze
  2025-12-12  0:14   ` kernel test robot
  1 sibling, 1 reply; 12+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-03 20:15 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel@vger.kernel.org
  Cc: Viacheslav Dubeyko, idryomov@gmail.com,
	linux-fsdevel@vger.kernel.org

On Wed, 2025-12-03 at 15:46 +0000, Alex Markuze wrote:
> 1. Introduce CEPH_SUBVOLUME_ID_NONE constant (value 0) to make the
>    unknown/unset state explicit and self-documenting.
> 
> 2. Add WARN_ON_ONCE if attempting to change an already-set subvolume_id.
>    An inode's subvolume membership is immutable - once created in a
>    subvolume, it stays there. Attempting to change it indicates a bug.
> ---
>  fs/ceph/inode.c             | 32 +++++++++++++++++++++++++-------
>  fs/ceph/mds_client.c        |  5 +----
>  fs/ceph/subvolume_metrics.c |  7 ++++---
>  fs/ceph/super.h             | 10 +++++++++-
>  4 files changed, 39 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 835049004047..257b3e27b741 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -638,7 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>  
>  	ci->i_max_bytes = 0;
>  	ci->i_max_files = 0;
> -	ci->i_subvolume_id = 0;
> +	ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;

I was expected to see the code of this patch in the second and third ones. And
it looks really confusing. Why have you introduced another one patch? 

So, how I can test this patchset? I assume that xfstests run will be not enough.
Do we have special test environment or test-cases for this?

Thanks,
Slava.

>  
>  	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
>  	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
> @@ -743,7 +743,7 @@ void ceph_evict_inode(struct inode *inode)
>  
>  	percpu_counter_dec(&mdsc->metric.total_inodes);
>  
> -	ci->i_subvolume_id = 0;
> +	ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;
>  
>  	netfs_wait_for_outstanding_io(inode);
>  	truncate_inode_pages_final(&inode->i_data);
> @@ -877,19 +877,37 @@ int ceph_fill_file_size(struct inode *inode, int issued,
>  }
>  
>  /*
> - * Set the subvolume ID for an inode. Following the FUSE client convention,
> - * 0 means unknown/unset (MDS only sends non-zero IDs for subvolume inodes).
> + * Set the subvolume ID for an inode.
> + *
> + * The subvolume_id identifies which CephFS subvolume this inode belongs to.
> + * CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset - the MDS only sends
> + * non-zero IDs for inodes within subvolumes.
> + *
> + * An inode's subvolume membership is immutable - once an inode is created
> + * in a subvolume, it stays there. Therefore, if we already have a valid
> + * (non-zero) subvolume_id and receive a different one, that indicates a bug.
>   */
>  void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
>  {
>  	struct ceph_inode_info *ci;
> +	u64 old;
>  
> -	if (!inode || !subvolume_id)
> +	if (!inode || subvolume_id == CEPH_SUBVOLUME_ID_NONE)
>  		return;
>  
>  	ci = ceph_inode(inode);
> -	if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
> -		WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
> +	old = READ_ONCE(ci->i_subvolume_id);
> +
> +	if (old == subvolume_id)
> +		return;
> +
> +	if (old != CEPH_SUBVOLUME_ID_NONE) {
> +		/* subvolume_id should not change once set */
> +		WARN_ON_ONCE(1);
> +		return;
> +	}
> +
> +	WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
>  }
>  
>  void ceph_fill_file_time(struct inode *inode, int issued,
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 2b831f48c844..f2a17e11fcef 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -122,10 +122,7 @@ static int parse_reply_info_in(void **p, void *end,
>  	u32 struct_len = 0;
>  	struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
>  
> -	info->subvolume_id = 0;
> -	doutc(cl, "subv_metric parse start features=0x%llx\n", features);
> -
> -	info->subvolume_id = 0;
> +	info->subvolume_id = CEPH_SUBVOLUME_ID_NONE;
>  
>  	if (features == (u64)-1) {
>  		ceph_decode_8_safe(p, end, struct_v, bad);
> diff --git a/fs/ceph/subvolume_metrics.c b/fs/ceph/subvolume_metrics.c
> index 111f6754e609..37cbed5b52c3 100644
> --- a/fs/ceph/subvolume_metrics.c
> +++ b/fs/ceph/subvolume_metrics.c
> @@ -136,8 +136,9 @@ void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracke
>  	struct ceph_subvol_metric_rb_entry *entry, *new_entry = NULL;
>  	bool retry = false;
>  
> -	/* 0 means unknown/unset subvolume (matches FUSE client convention) */
> -	if (!READ_ONCE(tracker->enabled) || !subvol_id || !size || !latency_us)
> +	/* CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset subvolume */
> +	if (!READ_ONCE(tracker->enabled) ||
> +	    subvol_id == CEPH_SUBVOLUME_ID_NONE || !size || !latency_us)
>  		return;
>  
>  	do {
> @@ -403,7 +404,7 @@ void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
>  	}
>  
>  	subvol_id = READ_ONCE(ci->i_subvolume_id);
> -	if (!subvol_id) {
> +	if (subvol_id == CEPH_SUBVOLUME_ID_NONE) {
>  		atomic64_inc(&tracker->record_no_subvol);
>  		return;
>  	}
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index a03c373efd52..731df0fcbcc8 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -386,7 +386,15 @@ struct ceph_inode_info {
>  
>  	/* quotas */
>  	u64 i_max_bytes, i_max_files;
> -	u64 i_subvolume_id;	/* 0 = unknown/unset, matches FUSE client */
> +
> +	/*
> +	 * Subvolume ID this inode belongs to. CEPH_SUBVOLUME_ID_NONE (0)
> +	 * means unknown/unset, matching the FUSE client convention.
> +	 * Once set to a valid (non-zero) value, it should not change
> +	 * during the inode's lifetime.
> +	 */
> +#define CEPH_SUBVOLUME_ID_NONE 0
> +	u64 i_subvolume_id;
>  
>  	s32 i_dir_pin;
>  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 20:15   ` Viacheslav Dubeyko
@ 2025-12-03 21:22     ` Alex Markuze
  2025-12-03 22:54       ` Viacheslav Dubeyko
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Markuze @ 2025-12-03 21:22 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: ceph-devel@vger.kernel.org, Viacheslav Dubeyko,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

The latest ceph code supports subvolume metrics.
The test is simple:
1. Deploy a ceph cluster
2. Create and mount a subvolume
3. run some I/O
4. I used debugfs to see that subvolume metrics were collected on the
client side and checked for subvolume metrics being reported on the
mds.

Nothing more to it.

On Wed, Dec 3, 2025 at 10:15 PM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-12-03 at 15:46 +0000, Alex Markuze wrote:
> > 1. Introduce CEPH_SUBVOLUME_ID_NONE constant (value 0) to make the
> >    unknown/unset state explicit and self-documenting.
> >
> > 2. Add WARN_ON_ONCE if attempting to change an already-set subvolume_id.
> >    An inode's subvolume membership is immutable - once created in a
> >    subvolume, it stays there. Attempting to change it indicates a bug.
> > ---
> >  fs/ceph/inode.c             | 32 +++++++++++++++++++++++++-------
> >  fs/ceph/mds_client.c        |  5 +----
> >  fs/ceph/subvolume_metrics.c |  7 ++++---
> >  fs/ceph/super.h             | 10 +++++++++-
> >  4 files changed, 39 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 835049004047..257b3e27b741 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -638,7 +638,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
> >
> >       ci->i_max_bytes = 0;
> >       ci->i_max_files = 0;
> > -     ci->i_subvolume_id = 0;
> > +     ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;
>
> I was expected to see the code of this patch in the second and third ones. And
> it looks really confusing. Why have you introduced another one patch?
>
> So, how I can test this patchset? I assume that xfstests run will be not enough.
> Do we have special test environment or test-cases for this?
>
> Thanks,
> Slava.
>
> >
> >       memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
> >       memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
> > @@ -743,7 +743,7 @@ void ceph_evict_inode(struct inode *inode)
> >
> >       percpu_counter_dec(&mdsc->metric.total_inodes);
> >
> > -     ci->i_subvolume_id = 0;
> > +     ci->i_subvolume_id = CEPH_SUBVOLUME_ID_NONE;
> >
> >       netfs_wait_for_outstanding_io(inode);
> >       truncate_inode_pages_final(&inode->i_data);
> > @@ -877,19 +877,37 @@ int ceph_fill_file_size(struct inode *inode, int issued,
> >  }
> >
> >  /*
> > - * Set the subvolume ID for an inode. Following the FUSE client convention,
> > - * 0 means unknown/unset (MDS only sends non-zero IDs for subvolume inodes).
> > + * Set the subvolume ID for an inode.
> > + *
> > + * The subvolume_id identifies which CephFS subvolume this inode belongs to.
> > + * CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset - the MDS only sends
> > + * non-zero IDs for inodes within subvolumes.
> > + *
> > + * An inode's subvolume membership is immutable - once an inode is created
> > + * in a subvolume, it stays there. Therefore, if we already have a valid
> > + * (non-zero) subvolume_id and receive a different one, that indicates a bug.
> >   */
> >  void ceph_inode_set_subvolume(struct inode *inode, u64 subvolume_id)
> >  {
> >       struct ceph_inode_info *ci;
> > +     u64 old;
> >
> > -     if (!inode || !subvolume_id)
> > +     if (!inode || subvolume_id == CEPH_SUBVOLUME_ID_NONE)
> >               return;
> >
> >       ci = ceph_inode(inode);
> > -     if (READ_ONCE(ci->i_subvolume_id) != subvolume_id)
> > -             WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
> > +     old = READ_ONCE(ci->i_subvolume_id);
> > +
> > +     if (old == subvolume_id)
> > +             return;
> > +
> > +     if (old != CEPH_SUBVOLUME_ID_NONE) {
> > +             /* subvolume_id should not change once set */
> > +             WARN_ON_ONCE(1);
> > +             return;
> > +     }
> > +
> > +     WRITE_ONCE(ci->i_subvolume_id, subvolume_id);
> >  }
> >
> >  void ceph_fill_file_time(struct inode *inode, int issued,
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 2b831f48c844..f2a17e11fcef 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -122,10 +122,7 @@ static int parse_reply_info_in(void **p, void *end,
> >       u32 struct_len = 0;
> >       struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
> >
> > -     info->subvolume_id = 0;
> > -     doutc(cl, "subv_metric parse start features=0x%llx\n", features);
> > -
> > -     info->subvolume_id = 0;
> > +     info->subvolume_id = CEPH_SUBVOLUME_ID_NONE;
> >
> >       if (features == (u64)-1) {
> >               ceph_decode_8_safe(p, end, struct_v, bad);
> > diff --git a/fs/ceph/subvolume_metrics.c b/fs/ceph/subvolume_metrics.c
> > index 111f6754e609..37cbed5b52c3 100644
> > --- a/fs/ceph/subvolume_metrics.c
> > +++ b/fs/ceph/subvolume_metrics.c
> > @@ -136,8 +136,9 @@ void ceph_subvolume_metrics_record(struct ceph_subvolume_metrics_tracker *tracke
> >       struct ceph_subvol_metric_rb_entry *entry, *new_entry = NULL;
> >       bool retry = false;
> >
> > -     /* 0 means unknown/unset subvolume (matches FUSE client convention) */
> > -     if (!READ_ONCE(tracker->enabled) || !subvol_id || !size || !latency_us)
> > +     /* CEPH_SUBVOLUME_ID_NONE (0) means unknown/unset subvolume */
> > +     if (!READ_ONCE(tracker->enabled) ||
> > +         subvol_id == CEPH_SUBVOLUME_ID_NONE || !size || !latency_us)
> >               return;
> >
> >       do {
> > @@ -403,7 +404,7 @@ void ceph_subvolume_metrics_record_io(struct ceph_mds_client *mdsc,
> >       }
> >
> >       subvol_id = READ_ONCE(ci->i_subvolume_id);
> > -     if (!subvol_id) {
> > +     if (subvol_id == CEPH_SUBVOLUME_ID_NONE) {
> >               atomic64_inc(&tracker->record_no_subvol);
> >               return;
> >       }
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index a03c373efd52..731df0fcbcc8 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -386,7 +386,15 @@ struct ceph_inode_info {
> >
> >       /* quotas */
> >       u64 i_max_bytes, i_max_files;
> > -     u64 i_subvolume_id;     /* 0 = unknown/unset, matches FUSE client */
> > +
> > +     /*
> > +      * Subvolume ID this inode belongs to. CEPH_SUBVOLUME_ID_NONE (0)
> > +      * means unknown/unset, matching the FUSE client convention.
> > +      * Once set to a valid (non-zero) value, it should not change
> > +      * during the inode's lifetime.
> > +      */
> > +#define CEPH_SUBVOLUME_ID_NONE 0
> > +     u64 i_subvolume_id;
> >
> >       s32 i_dir_pin;
> >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 21:22     ` Alex Markuze
@ 2025-12-03 22:54       ` Viacheslav Dubeyko
  2025-12-04  8:18         ` Alex Markuze
  0 siblings, 1 reply; 12+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-03 22:54 UTC (permalink / raw)
  To: Alex Markuze
  Cc: Viacheslav Dubeyko, ceph-devel@vger.kernel.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On Wed, 2025-12-03 at 23:22 +0200, Alex Markuze wrote:
> The latest ceph code supports subvolume metrics.
> The test is simple:
> 1. Deploy a ceph cluster
> 2. Create and mount a subvolume
> 3. run some I/O
> 4. I used debugfs to see that subvolume metrics were collected on the
> client side and checked for subvolume metrics being reported on the
> mds.
> 
> Nothing more to it.
> 

So, if it is simple, then what's about of adding another Ceph's test into
xfstests suite? Maybe, you can consider unit-test too. I've already introduced
initial patch with Kunit-based unit-test. We should have some test-case that
anyone can run and test this code.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 22:54       ` Viacheslav Dubeyko
@ 2025-12-04  8:18         ` Alex Markuze
  2025-12-04 18:53           ` Viacheslav Dubeyko
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Markuze @ 2025-12-04  8:18 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Viacheslav Dubeyko, ceph-devel@vger.kernel.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

There is no separate test needed. The client only differs in that the
mount path is for a subvolume.
Regardless its outside the scope of this patchset

On Thu, Dec 4, 2025 at 12:55 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-12-03 at 23:22 +0200, Alex Markuze wrote:
> > The latest ceph code supports subvolume metrics.
> > The test is simple:
> > 1. Deploy a ceph cluster
> > 2. Create and mount a subvolume
> > 3. run some I/O
> > 4. I used debugfs to see that subvolume metrics were collected on the
> > client side and checked for subvolume metrics being reported on the
> > mds.
> >
> > Nothing more to it.
> >
>
> So, if it is simple, then what's about of adding another Ceph's test into
> xfstests suite? Maybe, you can consider unit-test too. I've already introduced
> initial patch with Kunit-based unit-test. We should have some test-case that
> anyone can run and test this code.
>
> Thanks,
> Slava.
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-04  8:18         ` Alex Markuze
@ 2025-12-04 18:53           ` Viacheslav Dubeyko
  2025-12-07 10:51             ` Alex Markuze
  0 siblings, 1 reply; 12+ messages in thread
From: Viacheslav Dubeyko @ 2025-12-04 18:53 UTC (permalink / raw)
  To: Alex Markuze
  Cc: Viacheslav Dubeyko, ceph-devel@vger.kernel.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On Thu, 2025-12-04 at 10:18 +0200, Alex Markuze wrote:
> There is no separate test needed. The client only differs in that the
> mount path is for a subvolume.
> Regardless its outside the scope of this patchset
> 

If we implement any new functionality, then unit-test or special test in
xfstests must be introduced for any new functionality. So, I think it's very
relevant to the patchset.

Also, I believe that fourth patch should be merged into second and third ones. I
don't see the point of introducing not completely correct code and then fix it
by subsequent patch. It looks really wrong to me.

Thanks,
Slava. 

> On Thu, Dec 4, 2025 at 12:55 AM Viacheslav Dubeyko
> <Slava.Dubeyko@ibm.com> wrote:
> > 
> > On Wed, 2025-12-03 at 23:22 +0200, Alex Markuze wrote:
> > > The latest ceph code supports subvolume metrics.
> > > The test is simple:
> > > 1. Deploy a ceph cluster
> > > 2. Create and mount a subvolume
> > > 3. run some I/O
> > > 4. I used debugfs to see that subvolume metrics were collected on the
> > > client side and checked for subvolume metrics being reported on the
> > > mds.
> > > 
> > > Nothing more to it.
> > > 
> > 
> > So, if it is simple, then what's about of adding another Ceph's test into
> > xfstests suite? Maybe, you can consider unit-test too. I've already introduced
> > initial patch with Kunit-based unit-test. We should have some test-case that
> > anyone can run and test this code.
> > 
> > Thanks,
> > Slava.
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-04 18:53           ` Viacheslav Dubeyko
@ 2025-12-07 10:51             ` Alex Markuze
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Markuze @ 2025-12-07 10:51 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Viacheslav Dubeyko, ceph-devel@vger.kernel.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

I'll squash the last patch into patch number 3.

I'm attaching instructions on how to create and mount a subvolume with
ceph. Sending I/O using any of the existing tests will trigger
subvolume metrics collecting and reporting.

Create a subvolume:
ceph fs subvolume create <fs name> <subvolume_name>

Here is an example how to mount a fuse client and get the subvolume
path: (FUSE client is the reference implementation)

/bin/ceph-fuse --id admin --client_fs <fs name> --conf
$(pwd)/ceph.conf -r $(ceph fs subvolume getpath <fs name>
<subvolume_name>) <mnt_point>

Mount Kclient:
   1   │ IP=10.251.64.6
   2   │ PORT=40258
   3   │ CLIENT=kclient1
   4   │ KEY="AQA4zyVpTXWPGRAA4c8aozVMYt+cri+3tAv6yA=="
   5   │ CEPH_PATH="/volumes/_nogroup/ssubvol_a/f089e65e-2fc9-4474-b728-117de5ad25f6"
   6   │
   7   │ sudo mount -t ceph -o
mon_addr=$IP:$PORT,secret=$KEY,name=$CLIENT,ms_mode=crc,nowsync,copyfrom,mds_namespace=a
$IP:$PORT:$CEPH_PATH /mnt/subvol

On Thu, Dec 4, 2025 at 8:53 PM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
> On Thu, 2025-12-04 at 10:18 +0200, Alex Markuze wrote:
> > There is no separate test needed. The client only differs in that the
> > mount path is for a subvolume.
> > Regardless its outside the scope of this patchset
> >
>
> If we implement any new functionality, then unit-test or special test in
> xfstests must be introduced for any new functionality. So, I think it's very
> relevant to the patchset.
>
> Also, I believe that fourth patch should be merged into second and third ones. I
> don't see the point of introducing not completely correct code and then fix it
> by subsequent patch. It looks really wrong to me.
>
> Thanks,
> Slava.
>
> > On Thu, Dec 4, 2025 at 12:55 AM Viacheslav Dubeyko
> > <Slava.Dubeyko@ibm.com> wrote:
> > >
> > > On Wed, 2025-12-03 at 23:22 +0200, Alex Markuze wrote:
> > > > The latest ceph code supports subvolume metrics.
> > > > The test is simple:
> > > > 1. Deploy a ceph cluster
> > > > 2. Create and mount a subvolume
> > > > 3. run some I/O
> > > > 4. I used debugfs to see that subvolume metrics were collected on the
> > > > client side and checked for subvolume metrics being reported on the
> > > > mds.
> > > >
> > > > Nothing more to it.
> > > >
> > >
> > > So, if it is simple, then what's about of adding another Ceph's test into
> > > xfstests suite? Maybe, you can consider unit-test too. I've already introduced
> > > initial patch with Kunit-based unit-test. We should have some test-case that
> > > anyone can run and test this code.
> > >
> > > Thanks,
> > > Slava.
> > >


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
  2025-12-03 15:46 ` [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE Alex Markuze
  2025-12-03 20:15   ` Viacheslav Dubeyko
@ 2025-12-12  0:14   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2025-12-12  0:14 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel
  Cc: oe-kbuild-all, idryomov, linux-fsdevel, amarkuze, vdubeyko

Hi Alex,

kernel test robot noticed the following build warnings:

[auto build test WARNING on ceph-client/for-linus]
[also build test WARNING on linus/master v6.18 next-20251211]
[cannot apply to ceph-client/testing]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Alex-Markuze/ceph-handle-InodeStat-v8-versioned-field-in-reply-parsing/20251204-035756
base:   https://github.com/ceph/ceph-client.git for-linus
patch link:    https://lore.kernel.org/r/20251203154625.2779153-5-amarkuze%40redhat.com
patch subject: [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE
config: x86_64-randconfig-101-20251210 (https://download.01.org/0day-ci/archive/20251212/202512120708.d8OjMmgQ-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251212/202512120708.d8OjMmgQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512120708.d8OjMmgQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/ceph/mds_client.c:123:22: warning: unused variable 'cl' [-Wunused-variable]
     123 |         struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
         |                             ^~
   1 warning generated.


vim +/cl +123 fs/ceph/mds_client.c

b37fe1f923fb4b Yan, Zheng       2019-01-09  113  
2f2dc053404feb Sage Weil        2009-10-06  114  static int parse_reply_info_in(void **p, void *end,
14303d20f3ae3e Sage Weil        2010-12-14  115  			       struct ceph_mds_reply_info_in *info,
48a90cabed1e21 Alex Markuze     2025-12-03  116  			       u64 features,
48a90cabed1e21 Alex Markuze     2025-12-03  117  			       struct ceph_mds_client *mdsc)
2f2dc053404feb Sage Weil        2009-10-06  118  {
b37fe1f923fb4b Yan, Zheng       2019-01-09  119  	int err = 0;
b37fe1f923fb4b Yan, Zheng       2019-01-09  120  	u8 struct_v = 0;
48a90cabed1e21 Alex Markuze     2025-12-03  121  	u8 struct_compat = 0;
48a90cabed1e21 Alex Markuze     2025-12-03  122  	u32 struct_len = 0;
48a90cabed1e21 Alex Markuze     2025-12-03 @123  	struct ceph_client *cl = mdsc ? mdsc->fsc->client : NULL;
48a90cabed1e21 Alex Markuze     2025-12-03  124  
b5cda3b778d7c2 Alex Markuze     2025-12-03  125  	info->subvolume_id = CEPH_SUBVOLUME_ID_NONE;
7361b2801d4572 Alex Markuze     2025-12-03  126  
b37fe1f923fb4b Yan, Zheng       2019-01-09  127  	if (features == (u64)-1) {
b37fe1f923fb4b Yan, Zheng       2019-01-09  128  		ceph_decode_8_safe(p, end, struct_v, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  129  		ceph_decode_8_safe(p, end, struct_compat, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  130  		/* struct_v is expected to be >= 1. we only understand
b37fe1f923fb4b Yan, Zheng       2019-01-09  131  		 * encoding with struct_compat == 1. */
b37fe1f923fb4b Yan, Zheng       2019-01-09  132  		if (!struct_v || struct_compat != 1)
b37fe1f923fb4b Yan, Zheng       2019-01-09  133  			goto bad;
b37fe1f923fb4b Yan, Zheng       2019-01-09  134  		ceph_decode_32_safe(p, end, struct_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  135  		ceph_decode_need(p, end, struct_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  136  		end = *p + struct_len;
b37fe1f923fb4b Yan, Zheng       2019-01-09  137  	}
2f2dc053404feb Sage Weil        2009-10-06  138  
b37fe1f923fb4b Yan, Zheng       2019-01-09  139  	ceph_decode_need(p, end, sizeof(struct ceph_mds_reply_inode), bad);
2f2dc053404feb Sage Weil        2009-10-06  140  	info->in = *p;
2f2dc053404feb Sage Weil        2009-10-06  141  	*p += sizeof(struct ceph_mds_reply_inode) +
2f2dc053404feb Sage Weil        2009-10-06  142  		sizeof(*info->in->fragtree.splits) *
2f2dc053404feb Sage Weil        2009-10-06  143  		le32_to_cpu(info->in->fragtree.nsplits);
2f2dc053404feb Sage Weil        2009-10-06  144  
2f2dc053404feb Sage Weil        2009-10-06  145  	ceph_decode_32_safe(p, end, info->symlink_len, bad);
2f2dc053404feb Sage Weil        2009-10-06  146  	ceph_decode_need(p, end, info->symlink_len, bad);
2f2dc053404feb Sage Weil        2009-10-06  147  	info->symlink = *p;
2f2dc053404feb Sage Weil        2009-10-06  148  	*p += info->symlink_len;
2f2dc053404feb Sage Weil        2009-10-06  149  
14303d20f3ae3e Sage Weil        2010-12-14  150  	ceph_decode_copy_safe(p, end, &info->dir_layout,
14303d20f3ae3e Sage Weil        2010-12-14  151  			      sizeof(info->dir_layout), bad);
2f2dc053404feb Sage Weil        2009-10-06  152  	ceph_decode_32_safe(p, end, info->xattr_len, bad);
2f2dc053404feb Sage Weil        2009-10-06  153  	ceph_decode_need(p, end, info->xattr_len, bad);
2f2dc053404feb Sage Weil        2009-10-06  154  	info->xattr_data = *p;
2f2dc053404feb Sage Weil        2009-10-06  155  	*p += info->xattr_len;
fb01d1f8b0343f Yan, Zheng       2014-11-14  156  
b37fe1f923fb4b Yan, Zheng       2019-01-09  157  	if (features == (u64)-1) {
b37fe1f923fb4b Yan, Zheng       2019-01-09  158  		/* inline data */
b37fe1f923fb4b Yan, Zheng       2019-01-09  159  		ceph_decode_64_safe(p, end, info->inline_version, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  160  		ceph_decode_32_safe(p, end, info->inline_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  161  		ceph_decode_need(p, end, info->inline_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  162  		info->inline_data = *p;
b37fe1f923fb4b Yan, Zheng       2019-01-09  163  		*p += info->inline_len;
b37fe1f923fb4b Yan, Zheng       2019-01-09  164  		/* quota */
b37fe1f923fb4b Yan, Zheng       2019-01-09  165  		err = parse_reply_info_quota(p, end, info);
b37fe1f923fb4b Yan, Zheng       2019-01-09  166  		if (err < 0)
b37fe1f923fb4b Yan, Zheng       2019-01-09  167  			goto out_bad;
b37fe1f923fb4b Yan, Zheng       2019-01-09  168  		/* pool namespace */
b37fe1f923fb4b Yan, Zheng       2019-01-09  169  		ceph_decode_32_safe(p, end, info->pool_ns_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  170  		if (info->pool_ns_len > 0) {
b37fe1f923fb4b Yan, Zheng       2019-01-09  171  			ceph_decode_need(p, end, info->pool_ns_len, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  172  			info->pool_ns_data = *p;
b37fe1f923fb4b Yan, Zheng       2019-01-09  173  			*p += info->pool_ns_len;
b37fe1f923fb4b Yan, Zheng       2019-01-09  174  		}
245ce991cca55e Jeff Layton      2019-05-29  175  
245ce991cca55e Jeff Layton      2019-05-29  176  		/* btime */
245ce991cca55e Jeff Layton      2019-05-29  177  		ceph_decode_need(p, end, sizeof(info->btime), bad);
245ce991cca55e Jeff Layton      2019-05-29  178  		ceph_decode_copy(p, &info->btime, sizeof(info->btime));
245ce991cca55e Jeff Layton      2019-05-29  179  
245ce991cca55e Jeff Layton      2019-05-29  180  		/* change attribute */
a35ead314e0b92 Jeff Layton      2019-06-06  181  		ceph_decode_64_safe(p, end, info->change_attr, bad);
b37fe1f923fb4b Yan, Zheng       2019-01-09  182  
08796873a5183b Yan, Zheng       2019-01-09  183  		/* dir pin */
08796873a5183b Yan, Zheng       2019-01-09  184  		if (struct_v >= 2) {
08796873a5183b Yan, Zheng       2019-01-09  185  			ceph_decode_32_safe(p, end, info->dir_pin, bad);
08796873a5183b Yan, Zheng       2019-01-09  186  		} else {
08796873a5183b Yan, Zheng       2019-01-09  187  			info->dir_pin = -ENODATA;
08796873a5183b Yan, Zheng       2019-01-09  188  		}
08796873a5183b Yan, Zheng       2019-01-09  189  
193e7b37628e97 David Disseldorp 2019-04-18  190  		/* snapshot birth time, remains zero for v<=2 */
193e7b37628e97 David Disseldorp 2019-04-18  191  		if (struct_v >= 3) {
193e7b37628e97 David Disseldorp 2019-04-18  192  			ceph_decode_need(p, end, sizeof(info->snap_btime), bad);
193e7b37628e97 David Disseldorp 2019-04-18  193  			ceph_decode_copy(p, &info->snap_btime,
193e7b37628e97 David Disseldorp 2019-04-18  194  					 sizeof(info->snap_btime));
193e7b37628e97 David Disseldorp 2019-04-18  195  		} else {
193e7b37628e97 David Disseldorp 2019-04-18  196  			memset(&info->snap_btime, 0, sizeof(info->snap_btime));
193e7b37628e97 David Disseldorp 2019-04-18  197  		}
193e7b37628e97 David Disseldorp 2019-04-18  198  
e7f72952508ac4 Yanhu Cao        2020-08-28  199  		/* snapshot count, remains zero for v<=3 */
e7f72952508ac4 Yanhu Cao        2020-08-28  200  		if (struct_v >= 4) {
e7f72952508ac4 Yanhu Cao        2020-08-28  201  			ceph_decode_64_safe(p, end, info->rsnaps, bad);
e7f72952508ac4 Yanhu Cao        2020-08-28  202  		} else {
e7f72952508ac4 Yanhu Cao        2020-08-28  203  			info->rsnaps = 0;
e7f72952508ac4 Yanhu Cao        2020-08-28  204  		}
e7f72952508ac4 Yanhu Cao        2020-08-28  205  
2d332d5bc42440 Jeff Layton      2020-07-27  206  		if (struct_v >= 5) {
2d332d5bc42440 Jeff Layton      2020-07-27  207  			u32 alen;
2d332d5bc42440 Jeff Layton      2020-07-27  208  
2d332d5bc42440 Jeff Layton      2020-07-27  209  			ceph_decode_32_safe(p, end, alen, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  210  
2d332d5bc42440 Jeff Layton      2020-07-27  211  			while (alen--) {
2d332d5bc42440 Jeff Layton      2020-07-27  212  				u32 len;
2d332d5bc42440 Jeff Layton      2020-07-27  213  
2d332d5bc42440 Jeff Layton      2020-07-27  214  				/* key */
2d332d5bc42440 Jeff Layton      2020-07-27  215  				ceph_decode_32_safe(p, end, len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  216  				ceph_decode_skip_n(p, end, len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  217  				/* value */
2d332d5bc42440 Jeff Layton      2020-07-27  218  				ceph_decode_32_safe(p, end, len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  219  				ceph_decode_skip_n(p, end, len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  220  			}
2d332d5bc42440 Jeff Layton      2020-07-27  221  		}
2d332d5bc42440 Jeff Layton      2020-07-27  222  
2d332d5bc42440 Jeff Layton      2020-07-27  223  		/* fscrypt flag -- ignore */
2d332d5bc42440 Jeff Layton      2020-07-27  224  		if (struct_v >= 6)
2d332d5bc42440 Jeff Layton      2020-07-27  225  			ceph_decode_skip_8(p, end, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  226  
2d332d5bc42440 Jeff Layton      2020-07-27  227  		info->fscrypt_auth = NULL;
2d332d5bc42440 Jeff Layton      2020-07-27  228  		info->fscrypt_auth_len = 0;
2d332d5bc42440 Jeff Layton      2020-07-27  229  		info->fscrypt_file = NULL;
2d332d5bc42440 Jeff Layton      2020-07-27  230  		info->fscrypt_file_len = 0;
2d332d5bc42440 Jeff Layton      2020-07-27  231  		if (struct_v >= 7) {
2d332d5bc42440 Jeff Layton      2020-07-27  232  			ceph_decode_32_safe(p, end, info->fscrypt_auth_len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  233  			if (info->fscrypt_auth_len) {
2d332d5bc42440 Jeff Layton      2020-07-27  234  				info->fscrypt_auth = kmalloc(info->fscrypt_auth_len,
2d332d5bc42440 Jeff Layton      2020-07-27  235  							     GFP_KERNEL);
2d332d5bc42440 Jeff Layton      2020-07-27  236  				if (!info->fscrypt_auth)
2d332d5bc42440 Jeff Layton      2020-07-27  237  					return -ENOMEM;
2d332d5bc42440 Jeff Layton      2020-07-27  238  				ceph_decode_copy_safe(p, end, info->fscrypt_auth,
2d332d5bc42440 Jeff Layton      2020-07-27  239  						      info->fscrypt_auth_len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  240  			}
2d332d5bc42440 Jeff Layton      2020-07-27  241  			ceph_decode_32_safe(p, end, info->fscrypt_file_len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  242  			if (info->fscrypt_file_len) {
2d332d5bc42440 Jeff Layton      2020-07-27  243  				info->fscrypt_file = kmalloc(info->fscrypt_file_len,
2d332d5bc42440 Jeff Layton      2020-07-27  244  							     GFP_KERNEL);
2d332d5bc42440 Jeff Layton      2020-07-27  245  				if (!info->fscrypt_file)
2d332d5bc42440 Jeff Layton      2020-07-27  246  					return -ENOMEM;
2d332d5bc42440 Jeff Layton      2020-07-27  247  				ceph_decode_copy_safe(p, end, info->fscrypt_file,
2d332d5bc42440 Jeff Layton      2020-07-27  248  						      info->fscrypt_file_len, bad);
2d332d5bc42440 Jeff Layton      2020-07-27  249  			}
2d332d5bc42440 Jeff Layton      2020-07-27  250  		}
e3d0dedf78abdf Alex Markuze     2025-12-03  251  
e3d0dedf78abdf Alex Markuze     2025-12-03  252  		/*
e3d0dedf78abdf Alex Markuze     2025-12-03  253  		 * InodeStat encoding versions:
e3d0dedf78abdf Alex Markuze     2025-12-03  254  		 *   v1-v7: various fields added over time
e3d0dedf78abdf Alex Markuze     2025-12-03  255  		 *   v8: added optmetadata (versioned sub-structure containing
e3d0dedf78abdf Alex Markuze     2025-12-03  256  		 *       optional inode metadata like charmap for case-insensitive
e3d0dedf78abdf Alex Markuze     2025-12-03  257  		 *       filesystems). The kernel client doesn't support
e3d0dedf78abdf Alex Markuze     2025-12-03  258  		 *       case-insensitive lookups, so we skip this field.
e3d0dedf78abdf Alex Markuze     2025-12-03  259  		 *   v9: added subvolume_id (parsed below)
e3d0dedf78abdf Alex Markuze     2025-12-03  260  		 */
e3d0dedf78abdf Alex Markuze     2025-12-03  261  		if (struct_v >= 8) {
e3d0dedf78abdf Alex Markuze     2025-12-03  262  			u32 v8_struct_len;
e3d0dedf78abdf Alex Markuze     2025-12-03  263  
e3d0dedf78abdf Alex Markuze     2025-12-03  264  			/* skip optmetadata versioned sub-structure */
e3d0dedf78abdf Alex Markuze     2025-12-03  265  			ceph_decode_skip_8(p, end, bad);  /* struct_v */
e3d0dedf78abdf Alex Markuze     2025-12-03  266  			ceph_decode_skip_8(p, end, bad);  /* struct_compat */
e3d0dedf78abdf Alex Markuze     2025-12-03  267  			ceph_decode_32_safe(p, end, v8_struct_len, bad);
e3d0dedf78abdf Alex Markuze     2025-12-03  268  			ceph_decode_skip_n(p, end, v8_struct_len, bad);
e3d0dedf78abdf Alex Markuze     2025-12-03  269  		}
e3d0dedf78abdf Alex Markuze     2025-12-03  270  
7361b2801d4572 Alex Markuze     2025-12-03  271  		/* struct_v 9 added subvolume_id */
7361b2801d4572 Alex Markuze     2025-12-03  272  		if (struct_v >= 9)
7361b2801d4572 Alex Markuze     2025-12-03  273  			ceph_decode_64_safe(p, end, info->subvolume_id, bad);
7361b2801d4572 Alex Markuze     2025-12-03  274  
b37fe1f923fb4b Yan, Zheng       2019-01-09  275  		*p = end;
b37fe1f923fb4b Yan, Zheng       2019-01-09  276  	} else {
2d332d5bc42440 Jeff Layton      2020-07-27  277  		/* legacy (unversioned) struct */
fb01d1f8b0343f Yan, Zheng       2014-11-14  278  		if (features & CEPH_FEATURE_MDS_INLINE_DATA) {
fb01d1f8b0343f Yan, Zheng       2014-11-14  279  			ceph_decode_64_safe(p, end, info->inline_version, bad);
fb01d1f8b0343f Yan, Zheng       2014-11-14  280  			ceph_decode_32_safe(p, end, info->inline_len, bad);
fb01d1f8b0343f Yan, Zheng       2014-11-14  281  			ceph_decode_need(p, end, info->inline_len, bad);
fb01d1f8b0343f Yan, Zheng       2014-11-14  282  			info->inline_data = *p;
fb01d1f8b0343f Yan, Zheng       2014-11-14  283  			*p += info->inline_len;
fb01d1f8b0343f Yan, Zheng       2014-11-14  284  		} else
fb01d1f8b0343f Yan, Zheng       2014-11-14  285  			info->inline_version = CEPH_INLINE_NONE;
fb01d1f8b0343f Yan, Zheng       2014-11-14  286  
fb18a57568c2b8 Luis Henriques   2018-01-05  287  		if (features & CEPH_FEATURE_MDS_QUOTA) {
b37fe1f923fb4b Yan, Zheng       2019-01-09  288  			err = parse_reply_info_quota(p, end, info);
b37fe1f923fb4b Yan, Zheng       2019-01-09  289  			if (err < 0)
b37fe1f923fb4b Yan, Zheng       2019-01-09  290  				goto out_bad;
fb18a57568c2b8 Luis Henriques   2018-01-05  291  		} else {
fb18a57568c2b8 Luis Henriques   2018-01-05  292  			info->max_bytes = 0;
fb18a57568c2b8 Luis Henriques   2018-01-05  293  			info->max_files = 0;
fb18a57568c2b8 Luis Henriques   2018-01-05  294  		}
fb18a57568c2b8 Luis Henriques   2018-01-05  295  
779fe0fb8e1883 Yan, Zheng       2016-03-07  296  		info->pool_ns_len = 0;
779fe0fb8e1883 Yan, Zheng       2016-03-07  297  		info->pool_ns_data = NULL;
5ea5c5e0a7f70b Yan, Zheng       2016-02-14  298  		if (features & CEPH_FEATURE_FS_FILE_LAYOUT_V2) {
5ea5c5e0a7f70b Yan, Zheng       2016-02-14  299  			ceph_decode_32_safe(p, end, info->pool_ns_len, bad);
779fe0fb8e1883 Yan, Zheng       2016-03-07  300  			if (info->pool_ns_len > 0) {
5ea5c5e0a7f70b Yan, Zheng       2016-02-14  301  				ceph_decode_need(p, end, info->pool_ns_len, bad);
779fe0fb8e1883 Yan, Zheng       2016-03-07  302  				info->pool_ns_data = *p;
5ea5c5e0a7f70b Yan, Zheng       2016-02-14  303  				*p += info->pool_ns_len;
779fe0fb8e1883 Yan, Zheng       2016-03-07  304  			}
5ea5c5e0a7f70b Yan, Zheng       2016-02-14  305  		}
08796873a5183b Yan, Zheng       2019-01-09  306  
245ce991cca55e Jeff Layton      2019-05-29  307  		if (features & CEPH_FEATURE_FS_BTIME) {
245ce991cca55e Jeff Layton      2019-05-29  308  			ceph_decode_need(p, end, sizeof(info->btime), bad);
245ce991cca55e Jeff Layton      2019-05-29  309  			ceph_decode_copy(p, &info->btime, sizeof(info->btime));
a35ead314e0b92 Jeff Layton      2019-06-06  310  			ceph_decode_64_safe(p, end, info->change_attr, bad);
245ce991cca55e Jeff Layton      2019-05-29  311  		}
245ce991cca55e Jeff Layton      2019-05-29  312  
08796873a5183b Yan, Zheng       2019-01-09  313  		info->dir_pin = -ENODATA;
e7f72952508ac4 Yanhu Cao        2020-08-28  314  		/* info->snap_btime and info->rsnaps remain zero */
b37fe1f923fb4b Yan, Zheng       2019-01-09  315  	}
2f2dc053404feb Sage Weil        2009-10-06  316  	return 0;
2f2dc053404feb Sage Weil        2009-10-06  317  bad:
b37fe1f923fb4b Yan, Zheng       2019-01-09  318  	err = -EIO;
b37fe1f923fb4b Yan, Zheng       2019-01-09  319  out_bad:
2f2dc053404feb Sage Weil        2009-10-06  320  	return err;
2f2dc053404feb Sage Weil        2009-10-06  321  }
2f2dc053404feb Sage Weil        2009-10-06  322  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-12-12  0:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 15:46 [PATCH v3 0/4] ceph: add subvolume metrics reporting support Alex Markuze
2025-12-03 15:46 ` [PATCH v3 1/4] ceph: handle InodeStat v8 versioned field in reply parsing Alex Markuze
2025-12-03 15:46 ` [PATCH v3 2/4] ceph: parse subvolume_id from InodeStat v9 and store in inode Alex Markuze
2025-12-03 15:46 ` [PATCH v3 3/4] ceph: add subvolume metrics collection and reporting Alex Markuze
2025-12-03 15:46 ` [PATCH v3 4/4] ceph: adding CEPH_SUBVOLUME_ID_NONE Alex Markuze
2025-12-03 20:15   ` Viacheslav Dubeyko
2025-12-03 21:22     ` Alex Markuze
2025-12-03 22:54       ` Viacheslav Dubeyko
2025-12-04  8:18         ` Alex Markuze
2025-12-04 18:53           ` Viacheslav Dubeyko
2025-12-07 10:51             ` Alex Markuze
2025-12-12  0:14   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).