From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: linux-xfs@vger.kernel.org
Subject: [PATCH 2/2] design: document changes for the realtime refcount btree
Date: Thu, 20 Mar 2025 09:36:00 -0700 [thread overview]
Message-ID: <20250320163600.GD2803749@frogsfrogsfrogs> (raw)
In-Reply-To: <20250320162836.GV89034@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
Update the ondisk format documentation to reflect the realtime refcount
btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
.../internal_inodes.asciidoc | 5 -
.../journaling_log.asciidoc | 10 +
design/XFS_Filesystem_Structure/magic.asciidoc | 3
.../XFS_Filesystem_Structure/ondisk_inode.asciidoc | 2
design/XFS_Filesystem_Structure/realtime.asciidoc | 6 -
.../XFS_Filesystem_Structure/rtrefcountbt.asciidoc | 172 ++++++++++++++++++++
6 files changed, 191 insertions(+), 7 deletions(-)
create mode 100644 design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc
diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
index 9cfb2c29b1e6fe..551c799e4f9953 100644
--- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
+++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
@@ -27,6 +27,7 @@ of those inodes have been deallocated and may be reused by future features.
| xref:Real-Time_Bitmap_Inode[Realtime Bitmap] | /rtgroups/*.bitmap
| xref:Real-Time_Summary_Inode[Realtime Summary] | /rtgroups/*.summary
| xref:Real_time_Reverse_Mapping_Btree[Realtime Reverse Mapping B+tree] | /rtgroups/*.rmap
+| xref:Real_time_Refcount_Btree[Realtime Reference Count+tree] | /rtgroups/*.refcount
|=====
Metadata files are flagged by the +XFS_DIFLAG2_METADATA+ flag in the
@@ -301,4 +302,6 @@ xref:Real-Time_Bitmap_Inode[Bitmap Inode] and the
xref:Real-Time_Summary_Inode[Summary Inode].
Each realtime group can allocate one inode to managing a
-xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage.
+xref:Real_time_Reverse_Mapping_Btree[reverse-index of space] usage, and
+a second one to manage xref:Real_time_Refcount_Btree[reference counts] of space
+usage.
diff --git a/design/XFS_Filesystem_Structure/journaling_log.asciidoc b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
index fe8a127aa9abe0..8d5f50d26308c9 100644
--- a/design/XFS_Filesystem_Structure/journaling_log.asciidoc
+++ b/design/XFS_Filesystem_Structure/journaling_log.asciidoc
@@ -586,8 +586,9 @@ struct xfs_cui_log_format {
----
*cui_type*::
-The signature of an CUI operation, 0x1242. This value is in host-endian order,
-not big-endian like the rest of XFS.
+The signature of an CUI operation, 0x1242. For a realtime CUI, this vlaue is
+0x124e. This value is in host-endian order, not big-endian like the rest of
+XFS.
*cui_size*::
Size of this log item. Should be 1.
@@ -621,8 +622,9 @@ struct xfs_cud_log_format {
----
*cud_type*::
-The signature of an RUD operation, 0x1243. This value is in host-endian order,
-not big-endian like the rest of XFS.
+The signature of an RUD operation, 0x1243. For a realtime CUD, this value is
+0x124f. This value is in host-endian order, not big-endian like the rest of
+XFS.
*cud_size*::
Size of this log item. Should be 1.
diff --git a/design/XFS_Filesystem_Structure/magic.asciidoc b/design/XFS_Filesystem_Structure/magic.asciidoc
index 655f638304ec5d..f9e32df4f9f882 100644
--- a/design/XFS_Filesystem_Structure/magic.asciidoc
+++ b/design/XFS_Filesystem_Structure/magic.asciidoc
@@ -51,6 +51,7 @@ relevant chapters. Magic numbers tend to have consistent locations:
| +XFS_REFC_CRC_MAGIC+ | 0x52334643 | R3FC | xref:Reference_Count_Btree[Reference Count B+tree], v5 only
| +XFS_MD_MAGIC+ | 0x5846534d | XFSM | xref:Metadata_Dumps[Metadata Dumps]
| +XFS_RTSB_MAGIC+ | 0x46726F67 | Frog | xref:Realtime_Groups[Realtime Groups]
+| +XFS_RTREFC_CRC_MAGIC+ | 0x52434e54 | RCNT | xref:Real_time_Refcount_Btree[Real-Time Reference Count B+tree], v5 only
|=====
The magic numbers for log items are at offset zero in each log item, but items
@@ -82,6 +83,8 @@ are not aligned to blocks.
| +XFS_LI_EFD_RT+ | 0x124b | | xref:EFD_Log_Item[Extent Freeing Done Log Item]
| +XFS_LI_RUI_RT+ | 0x124c | | xref:RUI_Log_Item[Reverse Mapping Update Intent]
| +XFS_LI_RUD_RT+ | 0x124d | | xref:RUD_Log_Item[Reverse Mapping Update Done]
+| +XFS_LI_CUI_RT+ | 0x124e | | xref:CUI_Log_Item[Reference Count Update Intent]
+| +XFS_LI_CUD_RT+ | 0x124f | | xref:CUD_Log_Item[Reference Count Update Done]
|=====
= Theoretical Limits
diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index bd192e3a929281..ab4a503b4da6df 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -183,6 +183,7 @@ typedef enum xfs_dinode_fmt {
XFS_DINODE_FMT_BTREE,
XFS_DINODE_FMT_UUID,
XFS_DINODE_FMT_RMAP,
+ XFS_DINODE_FMT_REFCOUNT
} xfs_dinode_fmt_t;
----
@@ -205,6 +206,7 @@ enum xfs_metafile_type {
XFS_METAFILE_RTBITMAP,
XFS_METAFILE_RTSUMMARY,
XFS_METAFILE_RTRMAP,
+ XFS_METAFILE_RTREFCOUNT,
};
----
diff --git a/design/XFS_Filesystem_Structure/realtime.asciidoc b/design/XFS_Filesystem_Structure/realtime.asciidoc
index ab47a12a50f46b..16641525e20128 100644
--- a/design/XFS_Filesystem_Structure/realtime.asciidoc
+++ b/design/XFS_Filesystem_Structure/realtime.asciidoc
@@ -14,8 +14,7 @@ By placing the real time device (and the journal) on separate high-performance
storage devices, it is possible to reduce most of the unpredictability in I/O
response times that come from metadata operations.
-None of the XFS per-AG B+trees are involved with real time files. It is not
-possible for real time files to share data blocks.
+None of the XFS per-AG B+trees are involved with real time files.
[[Real-Time_Bitmap_Inode]]
== Free Space Bitmap Inode
@@ -312,6 +311,7 @@ Each realtime group has the following characteristics:
* Free space bitmap
* Summary of free space
* Reverse space mapping btree
+ * Reference count btree
The free space metadata are the same as described in the previous sections,
except that their scope covers only a single rtgroup. The other structures are
@@ -395,3 +395,5 @@ meta_uuid = 7e55b909-8728-4d69-a1fa-891427314eea
----
include::rtrmapbt.asciidoc[]
+
+include::rtrefcountbt.asciidoc[]
diff --git a/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc
new file mode 100644
index 00000000000000..98639928ca19ce
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/rtrefcountbt.asciidoc
@@ -0,0 +1,172 @@
+[[Real_time_Refcount_Btree]]
+=== Reference Count B+tree
+
+If the reflink and real-time storage device features are enabled, each
+real-time group has its own reference count B+tree.
+
+As mentioned in the chapter about xref:Reflink_Deduplication[sharing data
+blocks], this data structure is necessary to track how many times each extent
+in the realtime volume has been mapped. This is how the copy-on-write code
+determines what to do when a realtime file is written.
+
+This B+tree is only present if the +XFS_SB_FEAT_RO_COMPAT_REFLINK+ feature is
+enabled and a real time device is present. The feature requires a version 5
+filesystem.
+
+The rtgroup reference count B+tree is rooted in an inode's data fork; the inode
+number can be found by resolving the path +/rtgroups/$rgno.refcount+ in the
+metadata directory tree. superblock. The B+tree blocks themselves are stored
+in the regular filesystem. The structures used for an inode's B+tree root are:
+
+[source, c]
+----
+struct xfs_rtrefcount_root {
+ __be16 bb_level;
+ __be16 bb_numrecs;
+};
+----
+
+* If the B+tree contains only a single level, the ondisk data fork area begins
+with a +xfs_rtrefcount_root+ header followed by an array of +xfs_refcount_rec+
+leaf records.
+
+* Otherwise, the ondisk data fork area begins with the +xfs_rtrefcount_root+
+header and is followed first by an array of +xfs_refcount_key+ values and then
+an array of +xfs_rtrefcount_ptr_t+ values. The size of both arrays is
+specified by the header's +bb_numrecs+ value.
+
+* The root node in the inode can only contain up to 28 leaf records or
+key/pointer pairs for a standard 512 byte inode before a new level of nodes is
+added between the root and the leaves.
+
+Each record in an rtgroup reference count B+tree has the same structure as an
+AG reference count btree:
+
+[source, c]
+----
+struct xfs_refcount_rec {
+ __be32 rc_startblock;
+ __be32 rc_blockcount;
+ __be32 rc_refcount;
+};
+----
+
+*rc_startblock*::
+rtgroup block number of this record. Note that reference count records are
+tracked in units of realtime blocks, not realtime extents.
+However, records must be aligned to the realtime extent size in accordance with
+the existing realtime extent handling strategy. The high bit
+(+XFS_REFC_COW_FLAG+) is set for all records referring to an extent that is
+being used to stage a copy on write operation. This reduces recovery time
+during mount operations. The reference count of these staging events must only
+be 1.
+
+*rc_blockcount*::
+The length of this extent, in filesystem blocks.
+
+*rc_refcount*::
+Number of times this extent has been shared.
+
+The key has the following structure:
+
+[source, c]
+----
+struct xfs_refcount_key {
+ __be32 rc_startblock;
+};
+----
+
+* All block numbers are 32-bit rtgroup device block numbers, though the
+key should be aligned to the realtime extent size.
+
+* The +bb_magic+ value is ``RCNT'' (0x52434354).
+
+* The +struct xfs_btree_lblock+ header is used for intermediate B+tree node as
+well as the leaves.
+
+==== xfs_db rtrefcountbt Example
+
+This example shows a real-time reference count B+tree from a freshly
+populated filesystem. One directory tree has been reflinked:
+
+----
+xfs_db> path -m /rtgroups/0.refcount
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0100000
+core.version = 3
+core.format = 6 (refcount)
+...
+v3.inumber = 134
+v3.uuid = 23d157a4-8ca7-4fca-8782-637dc6746105
+v3.reflink = 0
+v3.cowextsz = 0
+v3.dax = 0
+v3.bigtime = 1
+v3.nrext64 = 1
+v3.metadata = 1
+u3.rtrefcbt.level = 1
+u3.rtrefcbt.numrecs = 2
+u3.rtrefcbt.keys[1-2] = [startblock,cowflag]
+1:[4,0]
+2:[344,0]
+u3.rtrefcbt.ptrs[1-2] = 1:8 2:9
+----
+
+Notice that this is a two-level refcount btree; we must continue towards the
+leaf level.
+
+----
+xfs_db> addr u3.rtrefcbt.ptrs[2]
+xfs_db> p
+magic = 0x52434e54
+level = 0
+numrecs = 170
+leftsib = 8
+rightsib = null
+bno = 72
+lsn = 0
+uuid = 23d157a4-8ca7-4fca-8782-637dc6746105
+owner = 134
+crc = 0x21e04c3 (correct)
+recs[1-170] = [startblock,blockcount,refcount,cowflag]
+1:[344,1,2,0]
+2:[346,1,2,0]
+3:[348,1,2,0]
+4:[350,1,2,0]
+5:[352,1,2,0]
+6:[354,1,2,0]
+...
+----
+
+This indicates that realtime block 354 is shared. Let's use the realtime
+reverse mapping information to find which files are sharing these blocks:
+
+----
+xfs_db> fsmap -r 354 354
+0: 0/1 len 682 owner 10015 offset 0 bmbt 0 attrfork 0 extflag 0
+1: 0/354 len 1 owner 10014 offset 353 bmbt 0 attrfork 0 extflag 0
+----
+
+It looks as though inodes 10,014 and 10,015 share this block. Let us confirm
+this by navigating to those inodes and dumping the data fork mappings:
+
+----
+xfs_db> inode 10015
+xfs_db> p core.realtime
+core.realtime = 1
+xfs_db> bmap
+data offset 0 startblock 1 (0/1) count 682 flag 0
+xfs_db> inode 10014
+xfs_db> p core.realtime
+core.realtime = 1
+xfs_db> bmap 350 10
+data offset 351 startblock 352 (0/352) count 1 flag 0
+data offset 353 startblock 354 (0/354) count 1 flag 0
+data offset 355 startblock 356 (0/356) count 1 flag 0
+data offset 357 startblock 358 (0/358) count 1 flag 0
+data offset 359 startblock 360 (0/360) count 1 flag 0
+----
+
+Notice that both inodes have their realtime flags set, and both of them map
+a data fork extent to the same realtime block 354.
next prev parent reply other threads:[~2025-03-20 16:36 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-20 16:28 [PATCH 1/2] design: document the revisions to the realtime rmap formats Darrick J. Wong
2025-03-20 16:36 ` Darrick J. Wong [this message]
2025-03-21 5:58 ` [PATCH 2/2] design: document changes for the realtime refcount btree Christoph Hellwig
2025-03-21 5:57 ` [PATCH 1/2] design: document the revisions to the realtime rmap formats Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250320163600.GD2803749@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox