[RFC] Directly mapped xattr data & fs-verity

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Directly mapped xattr data & fs-verity
@ 2024-12-29 13:33 Andrey Albershteyn
  2024-12-29 13:35 ` [PATCH] xfs: direct mapped xattrs design documentation Andrey Albershteyn
                   ` (4 more replies)
  0 siblings, 5 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Hi all,

This is a bit different version of fsverity for XFS from one Darrick sent last
time.

This is mostly a half-baked prototype, sending this out to get
feedback/suggesting/discuss on a design. If you think this approach would be
fine and I can go further with polishing it.

I haven't done anything to repair/scrub yet, haven't tested how old xattr leaf
would work with this changes, and fsverity doesn't work with block size != page
size yet...

The reasoning for iomap interface is that it could possibly be used for other
page cache related processing like fscrypt, compression. Cutting a few more
regions for those.

The attributes design is well described in design doc.

This patchset consists of four parts:
1. Design document which describes directly mappped xattr data (dxattr) and how
   fsverity uses it
2. Patchset with iomap_read_region()/iomap_write_region() interface to write
   beyond EOF
3. Patchset implementing dxattr interface
4. Patchset implementing fsverity using dxattr. Using a xattr per merkle tree
   block, not using da tree address space.

Thanks
Andrey

-- 
2.47.0

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] xfs: direct mapped xattrs design documentation
  2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
@ 2024-12-29 13:35 ` Andrey Albershteyn
  2025-01-07  1:41   ` Darrick J. Wong
  2024-12-29 13:36 ` [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF Andrey Albershteyn
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:35 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Direct mapped xattrs are a form of remote xattr that don't contain
internal self describing metadata. Hence the xattr data can be
directly mapped into page cache pages by iomap infrastructure
without needing to go through the XFS buffer cache.

This functionality allows XFS to implement fsverity data checksum
information externally to the file data, but interact with XFS data
checksum storage through the existing page cache interface.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 .../xfs/xfs-direct-mapped-xattr-design.rst    | 304 ++++++++++++++++++
 1 file changed, 304 insertions(+)
 create mode 100644 Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst

diff --git a/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
new file mode 100644
index 000000000000..a0efa9546eca
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
@@ -0,0 +1,304 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+XFS Direct Mapped Extended Atrtibutes
+=====================================
+
+Background
+==========
+
+We have need to support fsverity in XFS. An attempt to support fsverity
+using named remote xattrs has already been made, but the complexity of the
+solution has made acceptance of that implementation .... challenging.
+
+The fundamental problem caused by using remote xattr blocks to store the
+fsverity checksum data is that the data is stored as opaque filesystem block
+sized chunks of data and accesses them directly through a page cache based
+interface interface.
+
+These filesystem block sized chunks do not fit neatly into remote xattr blocks
+because the remote xattr blocks have metadata headers in them containing self
+describing metadata and CRCs for the metadata. Hence filesystem block sized data
+now spans two filesystem blocks and is not contiguous on disk (it is split up
+by headers).
+
+The fsverity interfaces require page aligned data blocks, which then requires
+copying the fsverity data out of the xattr buffers into a separate bounce buffer
+which is then cached independently of the xattr buffer. IOWs, we have to double
+buffer the fsverity checksum data and that costs a lot in terms of complexity.
+
+Because the remote xattr data is also using the generic xattr read code, it
+requires use of xfs metadata buffers to read and write it to disk. Hence we have
+block based IO requirements for the fsverity data, not page based IO
+requirements. Hence there is an impedence mismatch between the fsverity IO model
+and the XFS xattr IO model as well.
+
+
+Directories, Xattrs and dabtrees
+================================
+
+Directories in XFS are complex - they maintain two separate lookup indexes to
+the dirent data because we have to optimise for two different types of access.
+We also have dirent stability requirements for seek operations.
+
+Hence directories have an offset indexed data segment that hold dirent data,
+with each individual dirent at an offset that is fixed for the life of the
+dirent. This provides dirent stability as the offset of the dirent can be used
+as a seek cookie. This offset indexed data segment is optimised for dirent
+reading by readdir/getdents which iterates sequentially through the dirent data
+and requires stable cookies for iteration continuation.
+
+Directories must also support lookup by name - path based directory lookups need
+to find the dirent by name rather than by offset. This is implemented by the
+dabtree in the directory. It stores name hashes and the leaf records for a
+specific name hash point to the offset of the dirent with that hash. Hence name
+based lookups are fast and efficient when directed through the dabtree.
+Importantly, the dabtree does not store dirent data, it simply provides a
+pointer to the external dirent: the stable offset of the dirent in the offset
+indexed data segment.
+
+In comparison, the attr fork dabtree only has one index - the name hash based
+dabtree. Everything stored in the xattr fork needs to be named and the record
+for the xattr data is indexed by the hash of that name. As there is no external
+stable offset based index data segment, data that does not fit inline in the
+xattr leaf record gets stored in a dynamically allocated remote xattr extent.
+The remote extent is created at the first largest hole in the xattr address space,
+so the remote xattr data does not get allocated sequentially on disk.
+
+Further, because everything is name hash indexed, sequential offset indexed data
+is not going to hash down to sequential record indexes in the dabtree. Hence
+access to offset index based xattr data is not going to be sequential in either
+record lookup patterns nor xattr data read patterns. This isn't a huge issue
+as the dabtree blocks rapidly get cached, but it does consume more CPU time
+that doing equivalent sequential offset based access in the directory structure.
+
+Darrick Wong pondered whether it would help to create a sequentially
+indexed segment in the xattr fork for the merkle tree blocks when discussing
+better ways to handle fsverity metadata. This document is intended to flesh out
+that concept into something that is usable by mutable data checksum
+functionality.
+
+
+fsverity is just data checksumming
+==================================
+
+I had a recent insight into fsverity when discussing what to do with the
+fsverity code with Andrey. That insight came from realising that all fsverity
+was doing is recording a per-filesystem block checksum and then validating
+it on read. While this might seem obvious now that I say it, the previous
+approach we took was all about fsverity needing to read and write opaque blocks
+of individually accessed metadata.
+
+Storing opaque, externally defined metadata is what xattrs are intended to be
+used for, and that drove the original design. i.e. a fsverity merkle tree block
+was just another named xattr object that encoded the tree index in the xattr
+name. Simple and straight forward from the xattr POV, but all the complexity
+arose in translating the xattr data into a form that fsverity could use.
+
+Fundamentally, however, the merkle tree blocks just contain data checksums.
+Yes, they are complex, cryptographically secure data checksums, but the
+fundamental observation is that there is a direct relationship between the file
+data at a given offset and the contents of merkle tree block at a given tree
+index.
+
+fsverity has a mechanism to calculate the checksums from the file data and store
+them in filesystem blocks, hence providing external checksum storage for the
+file data. It also has mechanism to read the external checksum storage and
+compare that against the calculated checksum of the data. Hence fsverity is just
+a fancy way of saying the filesystem provides "tamper proof read-only data
+checksum verification"
+
+But what if we want data checksums for normal read-write data files to be able
+to detect bit-rot in data at rest?
+
+
+Direct Mapped Xattr Data
+========================
+
+fsverity really wants to read and write it's checksum data through the page
+cache. To do this efficiently, we need to store the fsverity metadata in block
+aligned data storage. We don't have that capability in XFS xattrs right now, and
+this is what we really want/need for data checksum storage. There are two ways
+of doing direct mapped xattr data.
+
+A New Remote Xattr Record Format
+--------------------------------
+
+The first way we can do direct mapped xattr data is to change the format of the
+remote xattr. The remote xattr header currently looks like this:
+
+.. code-block:: c
+
+	typedef struct xfs_attr_leaf_name_remote {
+		__be32  valueblk;               /* block number of value bytes */
+		__be32  valuelen;               /* number of bytes in value */
+		__u8    namelen;                /* length of name bytes */
+		/*
+		 * In Linux 6.5 this flex array was converted from name[1] to name[].
+		 * Be very careful here about extra padding at the end; see
+		 * xfs_attr_leaf_entsize_remote() for details.
+		 */
+		__u8    name[];                 /* name bytes */
+	} xfs_attr_leaf_name_remote_t;
+
+It stores the location and size of the remote xattr data as a filesystem block
+offset into the attr fork, along with the xattr name. The remote xattr block
+contains then this self describing header:
+
+.. code-block:: c
+
+	struct xfs_attr3_rmt_hdr {
+		__be32  rm_magic;
+		__be32  rm_offset;
+		__be32  rm_bytes;
+		__be32  rm_crc;
+		uuid_t  rm_uuid;
+		__be64  rm_owner;
+		__be64  rm_blkno;
+		__be64  rm_lsn;
+	};
+
+This is the self describing metadata that we use to validate the xattr data
+block is what it says it is, and this is the cause of the unaligned remote xattr
+data.
+
+The key field in the self describing metadata is ``*rm_crc``. This
+contains the CRC of the xattr data, and that tells us that the contents of the
+xattr data block are the same as what we wrote to disk. Everything else in
+this header is telling us who the block belongs to, it's expected size and
+when and where it was written to. This is far less critical to detecting storage
+errors than the CRC.
+
+Hence if we drop this header and move the ``rm_crc`` field to the ``struct
+xfs_attr_leaf_name_remote``, we can still check that the xattr data is has not
+been changed since we wrote the data to storage. If we have rmap enabled we
+have external tracking of the owner for the xattr data block, as well as the
+offset into the xattr data fork. ``rm_lsn`` is largely meaningless for remote
+xattrs because the data is written synchronously before the dabtree remote
+record is committed to the journal.
+
+Hence we can drop the headers from the remote xattr data blocks and not really
+lose much in way of robustness or recovery capability when rmap is enabled. This
+means all the xattr data is now filesystem block aligned, and this enables us to
+directly map the xattr data blocks directly for read IO.
+
+However, we can't easily directly map this xattr data for write operations. The
+xattr record header contains the CRC, and this needs to be updated
+transactionally. We can't update this before we do a direct mapped xattr write,
+because we have to do the write before we can recalculate the CRC. We can't do
+it after the write, because then we can have the transaction commit before the
+direct mapped data write IO is submitted and completed. This means recovery
+would result in a CRC mismatch for that data block. And we can't do it after the
+data write IO completes, because if we crash before the transaction is committed
+to the journal we again have a CRC mismatch.
+
+This is made more complex because we don't allow xattr headers to be
+re-written. Currently an update to an xattr requires a "remove and recreate"
+operation to ensure that the update is atomic w.r.t. remote xattr data changes.
+
+One approach which we can take is to use two CRCs - for old and new xattr data.
+``rm_crc`` becomes ``rm_crc[2]`` and xattr gains new bit flag
+``XFS_ATTR_RMCRC_SEL_BIT``. This bit defines which of the two fields contains
+the primary CRC. When we write a new CRC, we write it into the secondary
+``rm_crc[]`` field (i.e. the one the bit does not point to). When the data IO
+completes, we toggle the bit so that it points at the new primary value.
+
+If the primary does not match but the secondary does, we can flag an online
+repair operation to run at the completion of whatever operation read the xattr
+data to correct the state of the ``XFS_ATTR_RMCRC_SEL_BIT``.
+
+If neither CRCs match, then we have an -EFSCORRUPTED situation, and that needs
+to mark the attr fork as sick (which brings it to the attention of scrub/repair)
+and return -EFSCORRUPTED to the reader.
+
+Offset-based Xattr Indexing Segments
+------------------------------------
+
+The other mechanism for direct mapping xattr data is to introduce an offset
+indexed segment similar to the directory data segment. The xattr data fork uses
+32 bit filesystem block addressing, so on a 4kB block size filesystem we can
+index 16TB of address space. A dabtree that indexes this amount of data would be
+massive and quite inefficient, and we'd likely be hitting against the maximum
+extent count limits for the attr fork at this point, anyway (32 bit extent
+count).
+
+Hence taking half the address space (the upper 8TB) isn't going to cause any
+significant limitations on what we can store in the existing attr fork dabtree.
+It does, however, provide us with a significant amount of data address space we
+can use for offset indexed based xattr data. We can even split this upper region
+into multiple segments so that it can have multiple separate sets of data and
+even dabtrees to index them.
+
+At this point in time, I do not see a need for these xattr segments to be
+directly accessible from userspace through existing xattr interfaces. If there
+is need for the data the kernel stores in an xattr data segment to be exposed to
+userspace, we can add the necessary interfaces when they are required.
+
+For the moment, let's first concentrate on what is needed in kernel space for
+the fsverity merkle tree blocks.
+
+
+Fsverity Data Segment
+`````````````````````
+
+Let's assume we have carved out a section of the inode address space for fsverity
+metadata. fsverity only supports file sizes up to 4TB (see `thread
+<https://lore.kernel.org/linux-xfs/Y5rDCcYGgH72Wn%2Fe@sol.localdomain/>`_::
+and so at a 4kB block size and 128 bytes per fsb the amount of addressing space
+needed for fsverity is a bit over 128GiB. Hence we could carve out a fixed
+256GiB address space segment just for fsverity data if we need to.
+
+When fsverity measures the file and creates the merkle tree block, it requires
+the filesystem to persistently record that inode is undergoing measurement. It
+also then tells the filesystem when measurement is complete so that the
+filesystem can remove the "under measurement" flag and seal the inode as
+fsverity protected.
+
+Hence with these persistent notifications, we don't have to care about
+persistent creation of the merkle tree data. As long as it has been written back
+before we seal the inode with a synchronous transaction, the merkle tree data
+will be stable on disk before the seal is written to the journal thanks to the
+cache flushes issued before the journal commit starts.
+
+This also means that we don't have to care about what is in the fsverity segment
+when measurement is started - we can just punch out what is already there (e.g.
+debris from a failed measurement) as the measurement process will rewrite
+the entire segment contents from scratch.
+
+Ext4 does this write process via the page cache into the inode's mapping. It
+operates at the aops level directly, but that won't work for XFS as we use iomap
+for buffered IO. Hence we need to call through iomap to map the disk space
+and allocate the page cache pages for the merkle tree data to be written.
+
+This will require us to provide an iomap_ops structure with a ->begin_iomap
+method that allocates and maps blocks from the attr fork fsverity data segment.
+We don't care what file offset the iomap code chooses to cache the folios that
+are inserted into the page cache, all we care about is that we are passed the
+merkle tree block position that it needs to be stored at.
+
+This will require iomap to be aware that it is mapping external metadata rather
+than normal file data so that it can offset the page cache index it uses for
+this data appropriately. The writeback also needs to know that it's working with
+fsverity folios past EOF. This requires changes to how those folios are mapped
+as they are indexed by xattr dabtree. The differentiation factor will be the
+fact that only merkle tree data can be written while inode is under fsverity
+initialization or filesystems also can check if these page is in "fsverity"
+region of the page cache.
+
+The writeback mapping of these specially marked merkle tree folios should be, at
+this point, relatively trivial. We will need to call fsverity ->map_blocks
+callback to map the fsverity address space rather than the file data address
+space, but other than that the process of allocating space and mapping it is
+largely identical to the existing data fork allocation code. We can even use
+delayed allocation to ensure the merkle tree data is as contiguous as possible.
+
+The read side is less complex as all it needs to do is map blocks directly from
+the fsverity address space. We can read from the region intended for the
+fsverity metadata, then ->begin_iomap will map this request to xattr data blocks
+instead of file blocks.
+
+Therefore, we can have something like iomap_read_region() and
+iomap_write_region() to know that we are righting metadata and no filesize or
+any other data releated checks need to be done. This interface will take normal
+IO arguments and an offset of the region allowing filesystem to read relative to
+this offset.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF
  2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
  2024-12-29 13:35 ` [PATCH] xfs: direct mapped xattrs design documentation Andrey Albershteyn
@ 2024-12-29 13:36 ` Andrey Albershteyn
  2024-12-29 13:36   ` [PATCH 1/2] iomap: add iomap_writepages_unbound() to write " Andrey Albershteyn
  2024-12-29 13:36   ` [PATCH 2/2] iomap: introduce iomap_read/write_region interface Andrey Albershteyn
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:36 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

iomap_read/write interface without EOF bound.

--
Andrey

Andrey Albershteyn (2):
  iomap: add iomap_writepages_unbound() to write beyond EOF
  iomap: introduce iomap_read/write_region interface

 fs/iomap/buffered-io.c | 158 +++++++++++++++++++++++++++++++++++++----
 include/linux/iomap.h  |  17 +++++
 2 files changed, 161 insertions(+), 14 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/2] iomap: add iomap_writepages_unbound() to write beyond EOF
  2024-12-29 13:36 ` [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF Andrey Albershteyn
@ 2024-12-29 13:36   ` Andrey Albershteyn
  2024-12-29 17:54     ` kernel test robot
  2024-12-29 21:36     ` kernel test robot
  2024-12-29 13:36   ` [PATCH 2/2] iomap: introduce iomap_read/write_region interface Andrey Albershteyn
  1 sibling, 2 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:36 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add iomap_writepages_unbound() without limit in form of EOF. XFS
will use this to writeback extended attributes (fs-verity Merkle
tree) in range far beyond EOF.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/iomap/buffered-io.c | 55 ++++++++++++++++++++++++++++++++----------
 include/linux/iomap.h  |  4 +++
 2 files changed, 46 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 955f19e27e47..61ec924c5b80 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -979,13 +979,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * unlock and release the folio.
 		 */
 		old_size = iter->inode->i_size;
-		if (pos + written > old_size) {
+		if (!(iter->flags & IOMAP_NOSIZE) && (pos + written > old_size)) {
 			i_size_write(iter->inode, pos + written);
 			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 		}
 		__iomap_put_folio(iter, pos, written, folio);
 
-		if (old_size < pos)
+		if (!(iter->flags & IOMAP_NOSIZE) && (old_size < pos))
 			pagecache_isize_extended(iter->inode, old_size, pos);
 
 		cond_resched();
@@ -1918,18 +1918,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	int error = 0;
 	u32 rlen;
 
-	WARN_ON_ONCE(!folio_test_locked(folio));
-	WARN_ON_ONCE(folio_test_dirty(folio));
-	WARN_ON_ONCE(folio_test_writeback(folio));
-
-	trace_iomap_writepage(inode, pos, folio_size(folio));
-
-	if (!iomap_writepage_handle_eof(folio, inode, &end_pos)) {
-		folio_unlock(folio);
-		return 0;
-	}
 	WARN_ON_ONCE(end_pos <= pos);
 
+	trace_iomap_writepage(inode, pos, folio_size(folio));
+
 	if (i_blocks_per_folio(inode, folio) > 1) {
 		if (!ifs) {
 			ifs = ifs_alloc(inode, folio, 0);
@@ -1992,6 +1984,23 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	return error;
 }
 
+/* Map pages bound by EOF */
+static int iomap_writepage_map_eof(struct iomap_writepage_ctx *wpc,
+		struct writeback_control *wbc, struct folio *folio)
+{
+	int error;
+	struct inode *inode = folio->mapping->host;
+	u64 end_pos = folio_pos(folio) + folio_size(folio);
+
+	if (!iomap_writepage_handle_eof(folio, inode, &end_pos)) {
+		folio_unlock(folio);
+		return 0;
+	}
+
+	error = iomap_writepage_map(wpc, wbc, folio);
+	return error;
+}
+
 int
 iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 		struct iomap_writepage_ctx *wpc,
@@ -2008,12 +2017,32 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 			PF_MEMALLOC))
 		return -EIO;
 
+	wpc->ops = ops;
+	while ((folio = writeback_iter(mapping, wbc, folio, &error))) {
+		WARN_ON_ONCE(!folio_test_locked(folio));
+		WARN_ON_ONCE(folio_test_dirty(folio));
+		WARN_ON_ONCE(folio_test_writeback(folio));
+
+		error = iomap_writepage_map_eof(wpc, wbc, folio);
+	}
+	return iomap_submit_ioend(wpc, error);
+}
+EXPORT_SYMBOL_GPL(iomap_writepages);
+
+int
+iomap_writepages_unbound(struct address_space *mapping, struct writeback_control *wbc,
+		struct iomap_writepage_ctx *wpc,
+		const struct iomap_writeback_ops *ops)
+{
+	struct folio *folio = NULL;
+	int error;
+
 	wpc->ops = ops;
 	while ((folio = writeback_iter(mapping, wbc, folio, &error)))
 		error = iomap_writepage_map(wpc, wbc, folio);
 	return iomap_submit_ioend(wpc, error);
 }
-EXPORT_SYMBOL_GPL(iomap_writepages);
+EXPORT_SYMBOL_GPL(iomap_writepages_unbound);
 
 static int __init iomap_buffered_init(void)
 {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5675af6b740c..3bfd3035ac28 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -181,6 +181,7 @@ struct iomap_folio_ops {
 #define IOMAP_DAX		(1 << 8) /* DAX mapping */
 #else
 #define IOMAP_DAX		0
+#define IOMAP_NOSIZE		(1 << 9) /* Don't update in-memory inode size*/
 #endif /* CONFIG_FS_DAX */
 #define IOMAP_ATOMIC		(1 << 9)
 
@@ -390,6 +391,9 @@ void iomap_sort_ioends(struct list_head *ioend_list);
 int iomap_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, struct iomap_writepage_ctx *wpc,
 		const struct iomap_writeback_ops *ops);
+int iomap_writepages_unbound(struct address_space *mapping,
+		struct writeback_control *wbc, struct iomap_writepage_ctx *wpc,
+		const struct iomap_writeback_ops *ops);
 
 /*
  * Flags for direct I/O ->end_io:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/2] iomap: introduce iomap_read/write_region interface
  2024-12-29 13:36 ` [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF Andrey Albershteyn
  2024-12-29 13:36   ` [PATCH 1/2] iomap: add iomap_writepages_unbound() to write " Andrey Albershteyn
@ 2024-12-29 13:36   ` Andrey Albershteyn
  1 sibling, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:36 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Interface for writing pages beyond EOF into offsetted region in
page cache.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/iomap/buffered-io.c | 103 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/iomap.h  |  13 ++++++
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 61ec924c5b80..0f33ac975209 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -325,6 +325,7 @@ struct iomap_readpage_ctx {
 	bool			cur_folio_in_bio;
 	struct bio		*bio;
 	struct readahead_control *rac;
+	int			flags;
 };
 
 /**
@@ -363,7 +364,8 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 
 	return srcmap->type != IOMAP_MAPPED ||
 		(srcmap->flags & IOMAP_F_NEW) ||
-		pos >= i_size_read(iter->inode);
+		(pos >= i_size_read(iter->inode) &&
+		 !(srcmap->flags & IOMAP_F_BEYOND_EOF));
 }
 
 static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
@@ -2044,6 +2046,105 @@ iomap_writepages_unbound(struct address_space *mapping, struct writeback_control
 }
 EXPORT_SYMBOL_GPL(iomap_writepages_unbound);
 
+struct folio *
+iomap_read_region(struct ioregion *region)
+{
+	struct inode *inode = region->inode;
+	fgf_t fgp = FGP_CREAT | FGP_LOCK | fgf_set_order(region->length);
+	pgoff_t index = (region->pos | region->offset) >> PAGE_SHIFT;
+	struct folio *folio = __filemap_get_folio(inode->i_mapping, index, fgp,
+				    mapping_gfp_mask(inode->i_mapping));
+	struct iomap_readpage_ctx ctx = {
+		.cur_folio = folio,
+	};
+	struct iomap_iter iter = {
+		.inode = inode,
+		.pos = folio_pos(folio),
+		.len = folio_size(folio),
+	};
+	int ret;
+
+	if (folio_test_uptodate(folio)) {
+		folio_unlock(folio);
+		return folio;
+	}
+
+	while ((ret = iomap_iter(&iter, region->ops)) > 0)
+		iter.processed = iomap_readpage_iter(&iter, &ctx, 0);
+
+	if (ret < 0) {
+		folio_unlock(folio);
+		return ERR_PTR(ret);
+	}
+
+	if (ctx.bio) {
+		submit_bio(ctx.bio);
+		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
+	} else {
+		WARN_ON_ONCE(ctx.cur_folio_in_bio);
+		folio_unlock(folio);
+	}
+
+	return folio;
+}
+EXPORT_SYMBOL_GPL(iomap_read_region);
+
+static loff_t iomap_write_region_iter(struct iomap_iter *iter, const void *buf)
+{
+	loff_t pos = iter->pos;
+	loff_t length = iomap_length(iter);
+	loff_t written = 0;
+
+	do {
+		struct folio *folio;
+		int status;
+		size_t offset;
+		size_t bytes = min_t(u64, SIZE_MAX, length);
+		bool ret;
+
+		status = iomap_write_begin(iter, pos, bytes, &folio);
+		if (status)
+			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
+
+		offset = offset_in_folio(folio, pos);
+		if (bytes > folio_size(folio) - offset)
+			bytes = folio_size(folio) - offset;
+
+		memcpy_to_folio(folio, offset, buf, bytes);
+
+		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
+		if (WARN_ON_ONCE(!ret))
+			return -EIO;
+
+		__iomap_put_folio(iter, pos, written, folio);
+
+		pos += bytes;
+		length -= bytes;
+		written += bytes;
+	} while (length > 0);
+
+	return written;
+}
+
+int
+iomap_write_region(struct ioregion *region)
+{
+	struct iomap_iter iter = {
+		.inode		= region->inode,
+		.pos		= region->pos | region->offset,
+		.len		= region->length,
+	};
+	ssize_t ret;
+
+	while ((ret = iomap_iter(&iter, region->ops)) > 0)
+		iter.processed = iomap_write_region_iter(&iter, region->buf);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_write_region);
+
 static int __init iomap_buffered_init(void)
 {
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 3bfd3035ac28..3297ed36c26b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -68,6 +68,7 @@ struct vm_fault;
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
 #define IOMAP_F_BOUNDARY	(1U << 6)
+#define IOMAP_F_BEYOND_EOF	(1U << 7)
 
 /*
  * Flags set by the core iomap code during operations:
@@ -458,4 +459,16 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 # define iomap_swapfile_activate(sis, swapfile, pagespan, ops)	(-EIO)
 #endif /* CONFIG_SWAP */
 
+struct ioregion {
+	struct inode *inode;
+	loff_t pos;				/* IO position */
+	const void *buf;			/* Data to be written (in only) */
+	size_t length;				/* Length of the date */
+	loff_t offset;				/* Region offset in the cache */
+	const struct iomap_ops *ops;
+};
+
+struct folio *iomap_read_region(struct ioregion *region);
+int iomap_write_region(struct ioregion *region);
+
 #endif /* LINUX_IOMAP_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 00/14] Direct mapped extended attribute data
  2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
  2024-12-29 13:35 ` [PATCH] xfs: direct mapped xattrs design documentation Andrey Albershteyn
  2024-12-29 13:36 ` [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF Andrey Albershteyn
@ 2024-12-29 13:38 ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 01/14] iomap: add wrapper to pass readpage_ctx to read path Andrey Albershteyn
                     ` (13 more replies)
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
  2025-01-06 15:42 ` [RFC] Directly mapped xattr data & fs-verity Christoph Hellwig
  4 siblings, 14 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

This patchset introduces new format for extended attribute leafs.
The main difference is that data block doesn't have any header and
that data has to be written through page cache.

The most useful part of the header, necessary for metadata
verification, is rm_crc. This field is moved into DA tree and
doubled in size as rm_crc[2].

rm_crc[2] stores both CRCs for data before IO completion (old data)
and after IO completion (new written data). This allow us to
transactionally update CRC in the DA block while updating attribute
data with writeback.

So far, the interface isn't useful by itself as it requires
additional iomap_begin callbacks. These are implemented by fsverity,
for example.

Andrey Albershteyn (13):
  iomap: add wrapper to pass readpage_ctx to read path
  iomap: add read path ioends for filesystem read verification
  iomap: introduce IOMAP_F_NO_MERGE for non-mergable ioends
  xfs: add incompat directly mapped xattr flag
  libxfs: add xfs_calc_chsum()
  libxfs: pass xfs_sb to xfs_attr3_leaf_name_remote()
  xfs: introduce XFS_DA_OP_EMPTY
  xfs: introduce workqueue for post read processing
  xfs: add interface to set CRC on leaf attributes
  xfs: introduce XFS_ATTRUPDATE_FLAGS operation
  xfs: add interface for page cache mapped remote xattrs
  xfs: parse both remote attr name on-disk formats
  xfs: enalbe XFS_SB_FEAT_INCOMPAT_DXATTR

Darrick J. Wong (1):
  xfs: do not use xfs_attr3_rmt_hdr for remote value blocks for dxattr

 fs/iomap/buffered-io.c          | 111 +++++++++++------
 fs/xfs/libxfs/xfs_attr.c        | 212 +++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_attr.h        |  11 ++
 fs/xfs/libxfs/xfs_attr_leaf.c   | 135 +++++++++++++++-----
 fs/xfs/libxfs/xfs_attr_leaf.h   |   1 +
 fs/xfs/libxfs/xfs_attr_remote.c |  83 ++++++++++---
 fs/xfs/libxfs/xfs_attr_remote.h |   8 +-
 fs/xfs/libxfs/xfs_cksum.h       |  12 ++
 fs/xfs/libxfs/xfs_da_btree.h    |   5 +-
 fs/xfs/libxfs/xfs_da_format.h   |  18 ++-
 fs/xfs/libxfs/xfs_format.h      |   4 +-
 fs/xfs/libxfs/xfs_log_format.h  |   1 +
 fs/xfs/libxfs/xfs_ondisk.h      |   9 +-
 fs/xfs/libxfs/xfs_sb.c          |   2 +
 fs/xfs/libxfs/xfs_shared.h      |   1 +
 fs/xfs/scrub/attr.c             |   2 +-
 fs/xfs/scrub/attr_repair.c      |   3 +-
 fs/xfs/scrub/listxattr.c        |   3 +-
 fs/xfs/xfs_attr_inactive.c      |   4 +-
 fs/xfs/xfs_attr_item.c          |   6 +
 fs/xfs/xfs_attr_item.h          |   1 +
 fs/xfs/xfs_attr_list.c          |   3 +-
 fs/xfs/xfs_mount.h              |   3 +
 fs/xfs/xfs_stats.h              |   1 +
 fs/xfs/xfs_super.c              |   9 ++
 fs/xfs/xfs_trace.h              |   1 +
 include/linux/iomap.h           |  34 +++++
 27 files changed, 580 insertions(+), 103 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/14] iomap: add wrapper to pass readpage_ctx to read path
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 02/14] iomap: add read path ioends for filesystem read verification Andrey Albershteyn
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Make filesystems able to create readpage context, similar as
iomap_writepage_ctx in write path. This will allow filesystem to
pass _ops to iomap for ioend configuration (->prepare_ioend) which
in turn can be used to set BIO end callout (bio->bi_end_io).

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/iomap/buffered-io.c | 76 ++++++++++++++++++++++++------------------
 include/linux/iomap.h  | 12 +++++++
 2 files changed, 55 insertions(+), 33 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0f33ac975209..0d9291719d75 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -320,14 +320,6 @@ static void iomap_read_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
-struct iomap_readpage_ctx {
-	struct folio		*cur_folio;
-	bool			cur_folio_in_bio;
-	struct bio		*bio;
-	struct readahead_control *rac;
-	int			flags;
-};
-
 /**
  * iomap_read_inline_data - copy inline data into the page cache
  * @iter: iteration structure
@@ -461,28 +453,27 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
 	return done;
 }
 
-int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
+int iomap_read_folio_ctx(struct iomap_readpage_ctx *ctx,
+		const struct iomap_ops *ops)
 {
+	struct folio *folio = ctx->cur_folio;
 	struct iomap_iter iter = {
 		.inode		= folio->mapping->host,
 		.pos		= folio_pos(folio),
 		.len		= folio_size(folio),
 	};
-	struct iomap_readpage_ctx ctx = {
-		.cur_folio	= folio,
-	};
 	int ret;
 
 	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_read_folio_iter(&iter, &ctx);
+		iter.processed = iomap_read_folio_iter(&iter, ctx);
 
-	if (ctx.bio) {
-		submit_bio(ctx.bio);
-		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
+	if (ctx->bio) {
+		submit_bio(ctx->bio);
+		WARN_ON_ONCE(!ctx->cur_folio_in_bio);
 	} else {
-		WARN_ON_ONCE(ctx.cur_folio_in_bio);
+		WARN_ON_ONCE(ctx->cur_folio_in_bio);
 		folio_unlock(folio);
 	}
 
@@ -493,6 +484,16 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
 	 */
 	return 0;
 }
+EXPORT_SYMBOL_GPL(iomap_read_folio_ctx);
+
+int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
+{
+	struct iomap_readpage_ctx ctx = {
+		.cur_folio	= folio,
+	};
+
+	return iomap_read_folio_ctx(&ctx, ops);
+}
 EXPORT_SYMBOL_GPL(iomap_read_folio);
 
 static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
@@ -520,6 +521,30 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
 	return done;
 }
 
+void iomap_readahead_ctx(struct iomap_readpage_ctx *ctx,
+		const struct iomap_ops *ops)
+{
+	struct readahead_control *rac = ctx->rac;
+	struct iomap_iter iter = {
+		.inode	= rac->mapping->host,
+		.pos	= readahead_pos(rac),
+		.len	= readahead_length(rac),
+	};
+
+	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
+
+	while (iomap_iter(&iter, ops) > 0)
+		iter.processed = iomap_readahead_iter(&iter, ctx);
+
+	if (ctx->bio)
+		submit_bio(ctx->bio);
+	if (ctx->cur_folio) {
+		if (!ctx->cur_folio_in_bio)
+			folio_unlock(ctx->cur_folio);
+	}
+}
+EXPORT_SYMBOL_GPL(iomap_readahead_ctx);
+
 /**
  * iomap_readahead - Attempt to read pages from a file.
  * @rac: Describes the pages to be read.
@@ -537,26 +562,11 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
  */
 void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 {
-	struct iomap_iter iter = {
-		.inode	= rac->mapping->host,
-		.pos	= readahead_pos(rac),
-		.len	= readahead_length(rac),
-	};
 	struct iomap_readpage_ctx ctx = {
 		.rac	= rac,
 	};
 
-	trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
-
-	while (iomap_iter(&iter, ops) > 0)
-		iter.processed = iomap_readahead_iter(&iter, &ctx);
-
-	if (ctx.bio)
-		submit_bio(ctx.bio);
-	if (ctx.cur_folio) {
-		if (!ctx.cur_folio_in_bio)
-			folio_unlock(ctx.cur_folio);
-	}
+	iomap_readahead_ctx(&ctx, ops);
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 3297ed36c26b..b5ae08955c87 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -296,9 +296,21 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 		iter->srcmap.type == IOMAP_MAPPED;
 }
 
+struct iomap_readpage_ctx {
+	struct folio			*cur_folio;
+	bool				cur_folio_in_bio;
+	struct bio			*bio;
+	struct readahead_control	*rac;
+	int				flags;
+};
+
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops, void *private);
+int iomap_read_folio_ctx(struct iomap_readpage_ctx *ctx,
+		const struct iomap_ops *ops);
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
+void iomap_readahead_ctx(struct iomap_readpage_ctx *ctx,
+		const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/14] iomap: add read path ioends for filesystem read verification
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 01/14] iomap: add wrapper to pass readpage_ctx to read path Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 03/14] iomap: introduce IOMAP_F_NO_MERGE for non-mergable ioends Andrey Albershteyn
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add iomap_readpage_ops with only optional ->prepare_ioend() to allow
filesystem to add callout used for configuring read path ioend.
Mainly for setting ->bi_end_io() callout.

Make iomap_read_end_io() exportable, so, it can be called back from
filesystem callout after verification is done.

The read path ioend are stored side by side with BIOs allocated from
iomap_read_ioend_bioset.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/iomap/buffered-io.c | 31 ++++++++++++++++++++++++++-----
 include/linux/iomap.h  | 20 ++++++++++++++++++++
 2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0d9291719d75..93da48ec5801 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -41,6 +41,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+static struct bio_set iomap_read_ioend_bioset;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
 		struct iomap_folio_state *ifs)
@@ -310,7 +311,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		folio_end_read(folio, uptodate);
 }
 
-static void iomap_read_end_io(struct bio *bio)
+void iomap_read_end_io(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
 	struct folio_iter fi;
@@ -319,6 +320,7 @@ static void iomap_read_end_io(struct bio *bio)
 		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
 	bio_put(bio);
 }
+EXPORT_SYMBOL_GPL(iomap_read_end_io);
 
 /**
  * iomap_read_inline_data - copy inline data into the page cache
@@ -371,6 +373,8 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	loff_t orig_pos = pos;
 	size_t poff, plen;
 	sector_t sector;
+	struct iomap_read_ioend *ioend;
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 
 	if (iomap->type == IOMAP_INLINE)
 		return iomap_read_inline_data(iter, folio);
@@ -407,21 +411,29 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
-				     REQ_OP_READ, gfp);
+		ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
+				REQ_OP_READ, gfp, &iomap_read_ioend_bioset);
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
 		 * avoid having to deal with partial page reads.  This emulates
 		 * what do_mpage_read_folio does.
 		 */
 		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
-					     orig_gfp);
+			ctx->bio = bio_alloc_bioset(iomap->bdev, 1,
+				REQ_OP_READ, orig_gfp, &iomap_read_ioend_bioset);
 		}
 		if (ctx->rac)
 			ctx->bio->bi_opf |= REQ_RAHEAD;
 		ctx->bio->bi_iter.bi_sector = sector;
 		ctx->bio->bi_end_io = iomap_read_end_io;
+		ioend = container_of(ctx->bio, struct iomap_read_ioend,
+				io_bio);
+		ioend->io_inode = iter->inode;
+		ioend->io_flags	= srcmap->flags;
+		ioend->io_offset = poff;
+		ioend->io_size = plen;
+		if (ctx->ops && ctx->ops->prepare_ioend)
+			ctx->ops->prepare_ioend(ioend);
 		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
 	}
 
@@ -2157,6 +2169,15 @@ EXPORT_SYMBOL_GPL(iomap_write_region);
 
 static int __init iomap_buffered_init(void)
 {
+	int error = 0;
+
+	error = bioset_init(&iomap_read_ioend_bioset,
+			   4 * (PAGE_SIZE / SECTOR_SIZE),
+			   offsetof(struct iomap_read_ioend, io_bio),
+			   BIOSET_NEED_BVECS);
+	if (error)
+		return error;
+
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index b5ae08955c87..f089969e4716 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -296,14 +296,34 @@ static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter)
 		iter->srcmap.type == IOMAP_MAPPED;
 }
 
+struct iomap_read_ioend {
+	struct inode		*io_inode;	/* file being read from */
+	u16			io_flags;	/* IOMAP_F_* */
+	size_t			io_size;	/* size of the extent */
+	loff_t			io_offset;	/* offset in the file */
+	struct work_struct	io_work;	/* post read work (e.g. fs-verity) */
+	struct bio		io_bio;		/* MUST BE LAST! */
+};
+
+struct iomap_readpage_ops {
+	/*
+	 * Optional, allows the file systems to perform actions just before
+	 * submitting the bio and/or override the bio bi_end_io handler for
+	 * additional verification after bio is processed
+	 */
+	void (*prepare_ioend)(struct iomap_read_ioend *ioend);
+};
+
 struct iomap_readpage_ctx {
 	struct folio			*cur_folio;
 	bool				cur_folio_in_bio;
 	struct bio			*bio;
 	struct readahead_control	*rac;
 	int				flags;
+	const struct iomap_readpage_ops	*ops;
 };
 
+void iomap_read_end_io(struct bio *bio);
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops, void *private);
 int iomap_read_folio_ctx(struct iomap_readpage_ctx *ctx,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/14] iomap: introduce IOMAP_F_NO_MERGE for non-mergable ioends
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 01/14] iomap: add wrapper to pass readpage_ctx to read path Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 02/14] iomap: add read path ioends for filesystem read verification Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 04/14] xfs: add incompat directly mapped xattr flag Andrey Albershteyn
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

XFS will use it to calculate CRC of written data.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/iomap/buffered-io.c | 4 ++++
 include/linux/iomap.h  | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 93da48ec5801..d6231f4f78d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1651,6 +1651,8 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
 	 */
 	if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector)
 		return false;
+	if (ioend->io_flags & IOMAP_F_NO_MERGE)
+		return false;
 	return true;
 }
 
@@ -1782,6 +1784,8 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
 	 */
 	if (wpc->nr_folios >= IOEND_BATCH_SIZE)
 		return false;
+	if (wpc->iomap.flags & IOMAP_F_NO_MERGE)
+		return false;
 	return true;
 }
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f089969e4716..261772431fae 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -87,6 +87,8 @@ struct vm_fault;
  * Flags from 0x1000 up are for file system specific usage:
  */
 #define IOMAP_F_PRIVATE		(1U << 12)
+/* No ioend merges for this operation */
+#define IOMAP_F_NO_MERGE	(1U << 13)
 
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/14] xfs: add incompat directly mapped xattr flag
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (2 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 03/14] iomap: introduce IOMAP_F_NO_MERGE for non-mergable ioends Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 05/14] libxfs: add xfs_calc_chsum() Andrey Albershteyn
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add directly mapped xattr through page cache incompatibility flag as
this changes on-disk format of remote extended attributes
(xfs_attr3_rmt_hdr is gone and xfs_attr_leaf_name_remote gains CRC).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4d47a3e723aa..154458d72bc6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -395,6 +395,7 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_EXCHRANGE	(1 << 6)  /* exchangerange supported */
 #define XFS_SB_FEAT_INCOMPAT_PARENT	(1 << 7)  /* parent pointers */
 #define XFS_SB_FEAT_INCOMPAT_METADIR	(1 << 8)  /* metadata dir tree */
+#define XFS_SB_FEAT_INCOMPAT_DXATTR	(1 << 9)  /* directly mapped xattrs */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
 		 XFS_SB_FEAT_INCOMPAT_SPINODES | \
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/14] libxfs: add xfs_calc_chsum()
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (3 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 04/14] xfs: add incompat directly mapped xattr flag Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 06/14] libxfs: pass xfs_sb to xfs_attr3_leaf_name_remote() Andrey Albershteyn
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Calculate CRC of the buffer which will not be placed in this same
buffer (common case for all other xfs_cksum.h uses).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_cksum.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_cksum.h b/fs/xfs/libxfs/xfs_cksum.h
index 999a290cfd72..90884ed0ef23 100644
--- a/fs/xfs/libxfs/xfs_cksum.h
+++ b/fs/xfs/libxfs/xfs_cksum.h
@@ -79,4 +79,16 @@ xfs_verify_cksum(char *buffer, size_t length, unsigned long cksum_offset)
 	return *(__le32 *)(buffer + cksum_offset) == xfs_end_cksum(crc);
 }
 
+/*
+ * Helper to calculate the checksum of a buffer which is outside of the buffer
+ */
+static inline uint32_t
+xfs_calc_cksum(char *buffer, size_t length, uint32_t *dst)
+{
+	uint32_t crc = crc32c(XFS_CRC_SEED, buffer, length);
+	if (dst)
+		*dst = crc;
+	return crc;
+}
+
 #endif /* _XFS_CKSUM_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/14] libxfs: pass xfs_sb to xfs_attr3_leaf_name_remote()
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (4 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 05/14] libxfs: add xfs_calc_chsum() Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 07/14] xfs: introduce XFS_DA_OP_EMPTY Andrey Albershteyn
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

This will be needed to check with feature flags if new format of
xfs_attr_leaf_name_remote is in use.

This commit just changes signature of xfs_attr3_leaf_name_remote()
by adding xfs_sb. Superblock will be used in later commit.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 80 +++++++++++++++++++++++------------
 fs/xfs/libxfs/xfs_da_format.h |  5 ++-
 fs/xfs/scrub/attr.c           |  2 +-
 fs/xfs/scrub/attr_repair.c    |  3 +-
 fs/xfs/scrub/listxattr.c      |  3 +-
 fs/xfs/xfs_attr_inactive.c    |  2 +-
 fs/xfs/xfs_attr_list.c        |  3 +-
 7 files changed, 65 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index fddb55605e0c..c657638efe04 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -73,7 +73,11 @@ STATIC void xfs_attr3_leaf_moveents(struct xfs_da_args *args,
 			struct xfs_attr_leafblock *dst_leaf,
 			struct xfs_attr3_icleaf_hdr *dst_ichdr, int dst_start,
 			int move_count);
-STATIC int xfs_attr_leaf_entsize(xfs_attr_leafblock_t *leaf, int index);
+STATIC int
+xfs_attr_leaf_entsize(
+		struct xfs_mount	*mp,
+		xfs_attr_leafblock_t	*leaf,
+		int			index);
 
 /*
  * attr3 block 'firstused' conversion helpers.
@@ -274,7 +278,7 @@ xfs_attr3_leaf_verify_entry(
 		if (lentry->namelen == 0)
 			return __this_address;
 	} else {
-		rentry = xfs_attr3_leaf_name_remote(leaf, idx);
+		rentry = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf, idx);
 		namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
 		name_end = (char *)rentry + namesize;
 		if (rentry->namelen == 0)
@@ -1560,7 +1564,8 @@ xfs_attr3_leaf_add_work(
 		memcpy((char *)&name_loc->nameval[args->namelen], args->value,
 				   be16_to_cpu(name_loc->valuelen));
 	} else {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+						      args->index);
 		name_rmt->namelen = args->namelen;
 		memcpy((char *)name_rmt->name, args->name, args->namelen);
 		entry->flags |= XFS_ATTR_INCOMPLETE;
@@ -1571,9 +1576,10 @@ xfs_attr3_leaf_add_work(
 		args->rmtblkcnt = xfs_attr3_rmt_blocks(mp, args->valuelen);
 		args->rmtvaluelen = args->valuelen;
 	}
-	xfs_trans_log_buf(args->trans, bp,
-	     XFS_DA_LOGRANGE(leaf, xfs_attr3_leaf_name(leaf, args->index),
-				   xfs_attr_leaf_entsize(leaf, args->index)));
+	xfs_trans_log_buf(
+		args->trans, bp,
+		XFS_DA_LOGRANGE(leaf, xfs_attr3_leaf_name(leaf, args->index),
+				xfs_attr_leaf_entsize(mp, leaf, args->index)));
 
 	/*
 	 * Update the control info for this leaf node
@@ -1594,7 +1600,7 @@ xfs_attr3_leaf_add_work(
 						sizeof(xfs_attr_leaf_entry_t));
 		}
 	}
-	ichdr->usedbytes += xfs_attr_leaf_entsize(leaf, args->index);
+	ichdr->usedbytes += xfs_attr_leaf_entsize(mp, leaf, args->index);
 }
 
 /*
@@ -1910,6 +1916,7 @@ xfs_attr3_leaf_figure_balance(
 	struct xfs_attr_leafblock	*leaf1 = blk1->bp->b_addr;
 	struct xfs_attr_leafblock	*leaf2 = blk2->bp->b_addr;
 	struct xfs_attr_leaf_entry	*entry;
+	struct xfs_mount		*mp = state->mp;
 	int				count;
 	int				max;
 	int				index;
@@ -1958,8 +1965,8 @@ xfs_attr3_leaf_figure_balance(
 		/*
 		 * Figure out if next leaf entry would be too much.
 		 */
-		tmp = totallen + sizeof(*entry) + xfs_attr_leaf_entsize(leaf1,
-									index);
+		tmp = totallen + sizeof(*entry) +
+		      xfs_attr_leaf_entsize(mp, leaf1, index);
 		if (XFS_ATTR_ABS(half - tmp) > lastdelta)
 			break;
 		lastdelta = XFS_ATTR_ABS(half - tmp);
@@ -2132,6 +2139,7 @@ xfs_attr3_leaf_remove(
 	struct xfs_attr_leafblock *leaf;
 	struct xfs_attr3_icleaf_hdr ichdr;
 	struct xfs_attr_leaf_entry *entry;
+	struct xfs_mount *mp = args->dp->i_mount;
 	int			before;
 	int			after;
 	int			smallest;
@@ -2166,7 +2174,7 @@ xfs_attr3_leaf_remove(
 	tmp = ichdr.freemap[0].size;
 	before = after = -1;
 	smallest = XFS_ATTR_LEAF_MAPSIZE - 1;
-	entsize = xfs_attr_leaf_entsize(leaf, args->index);
+	entsize = xfs_attr_leaf_entsize(mp, leaf, args->index);
 	for (i = 0; i < XFS_ATTR_LEAF_MAPSIZE; i++) {
 		ASSERT(ichdr.freemap[i].base < args->geo->blksize);
 		ASSERT(ichdr.freemap[i].size < args->geo->blksize);
@@ -2414,6 +2422,7 @@ xfs_attr3_leaf_lookup_int(
 	struct xfs_attr_leaf_entry *entries;
 	struct xfs_attr_leaf_name_local *name_loc;
 	struct xfs_attr_leaf_name_remote *name_rmt;
+	struct xfs_mount *mp = args->dp->i_mount;
 	xfs_dahash_t		hashval;
 	int			probe;
 	int			span;
@@ -2492,7 +2501,8 @@ xfs_attr3_leaf_lookup_int(
 		} else {
 			unsigned int	valuelen;
 
-			name_rmt = xfs_attr3_leaf_name_remote(leaf, probe);
+			name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+							      probe);
 			valuelen = be32_to_cpu(name_rmt->valuelen);
 			if (!xfs_attr_match(args, entry->flags, name_rmt->name,
 					name_rmt->namelen, NULL, valuelen))
@@ -2528,6 +2538,7 @@ xfs_attr3_leaf_getvalue(
 	struct xfs_attr_leaf_entry *entry;
 	struct xfs_attr_leaf_name_local *name_loc;
 	struct xfs_attr_leaf_name_remote *name_rmt;
+	struct xfs_mount *mp = args->dp->i_mount;
 
 	leaf = bp->b_addr;
 	xfs_attr3_leaf_hdr_from_disk(args->geo, &ichdr, leaf);
@@ -2544,7 +2555,7 @@ xfs_attr3_leaf_getvalue(
 					be16_to_cpu(name_loc->valuelen));
 	}
 
-	name_rmt = xfs_attr3_leaf_name_remote(leaf, args->index);
+	name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf, args->index);
 	ASSERT(name_rmt->namelen == args->namelen);
 	ASSERT(memcmp(args->name, name_rmt->name, args->namelen) == 0);
 	args->rmtvaluelen = be32_to_cpu(name_rmt->valuelen);
@@ -2576,6 +2587,7 @@ xfs_attr3_leaf_moveents(
 {
 	struct xfs_attr_leaf_entry	*entry_s;
 	struct xfs_attr_leaf_entry	*entry_d;
+	struct xfs_mount		*mp = args->dp->i_mount;
 	int				desti;
 	int				tmp;
 	int				i;
@@ -2624,7 +2636,7 @@ xfs_attr3_leaf_moveents(
 	desti = start_d;
 	for (i = 0; i < count; entry_s++, entry_d++, desti++, i++) {
 		ASSERT(be16_to_cpu(entry_s->nameidx) >= ichdr_s->firstused);
-		tmp = xfs_attr_leaf_entsize(leaf_s, start_s + i);
+		tmp = xfs_attr_leaf_entsize(mp, leaf_s, start_s + i);
 #ifdef GROT
 		/*
 		 * Code to drop INCOMPLETE entries.  Difficult to use as we
@@ -2730,12 +2742,15 @@ xfs_attr_leaf_lasthash(
  * (whether local or remote only calculate bytes in this block).
  */
 STATIC int
-xfs_attr_leaf_entsize(xfs_attr_leafblock_t *leaf, int index)
+xfs_attr_leaf_entsize(
+		struct xfs_mount	*mp,
+		xfs_attr_leafblock_t	*leaf,
+		int			index)
 {
-	struct xfs_attr_leaf_entry *entries;
-	xfs_attr_leaf_name_local_t *name_loc;
-	xfs_attr_leaf_name_remote_t *name_rmt;
-	int size;
+	struct xfs_attr_leaf_entry	*entries;
+	xfs_attr_leaf_name_local_t	*name_loc;
+	xfs_attr_leaf_name_remote_t	*name_rmt;
+	int				size;
 
 	entries = xfs_attr3_leaf_entryp(leaf);
 	if (entries[index].flags & XFS_ATTR_LOCAL) {
@@ -2743,7 +2758,7 @@ xfs_attr_leaf_entsize(xfs_attr_leafblock_t *leaf, int index)
 		size = xfs_attr_leaf_entsize_local(name_loc->namelen,
 						   be16_to_cpu(name_loc->valuelen));
 	} else {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf, index);
 		size = xfs_attr_leaf_entsize_remote(name_rmt->namelen);
 	}
 	return size;
@@ -2789,6 +2804,7 @@ xfs_attr3_leaf_clearflag(
 	struct xfs_attr_leaf_entry *entry;
 	struct xfs_attr_leaf_name_remote *name_rmt;
 	struct xfs_buf		*bp;
+	struct xfs_mount	*mp = args->dp->i_mount;
 	int			error;
 #ifdef DEBUG
 	struct xfs_attr3_icleaf_hdr ichdr;
@@ -2820,7 +2836,8 @@ xfs_attr3_leaf_clearflag(
 		namelen = name_loc->namelen;
 		name = (char *)name_loc->nameval;
 	} else {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+						      args->index);
 		namelen = name_rmt->namelen;
 		name = (char *)name_rmt->name;
 	}
@@ -2835,11 +2852,13 @@ xfs_attr3_leaf_clearflag(
 
 	if (args->rmtblkno) {
 		ASSERT((entry->flags & XFS_ATTR_LOCAL) == 0);
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+						      args->index);
 		name_rmt->valueblk = cpu_to_be32(args->rmtblkno);
 		name_rmt->valuelen = cpu_to_be32(args->rmtvaluelen);
 		xfs_trans_log_buf(args->trans, bp,
-			 XFS_DA_LOGRANGE(leaf, name_rmt, sizeof(*name_rmt)));
+				  XFS_DA_LOGRANGE(leaf, name_rmt,
+						  sizeof(*name_rmt)));
 	}
 
 	return 0;
@@ -2856,6 +2875,7 @@ xfs_attr3_leaf_setflag(
 	struct xfs_attr_leaf_entry *entry;
 	struct xfs_attr_leaf_name_remote *name_rmt;
 	struct xfs_buf		*bp;
+	struct xfs_mount	*mp = args->dp->i_mount;
 	int error;
 #ifdef DEBUG
 	struct xfs_attr3_icleaf_hdr ichdr;
@@ -2884,7 +2904,8 @@ xfs_attr3_leaf_setflag(
 	xfs_trans_log_buf(args->trans, bp,
 			XFS_DA_LOGRANGE(leaf, entry, sizeof(*entry)));
 	if ((entry->flags & XFS_ATTR_LOCAL) == 0) {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+						      args->index);
 		name_rmt->valueblk = 0;
 		name_rmt->valuelen = 0;
 		xfs_trans_log_buf(args->trans, bp,
@@ -2912,6 +2933,7 @@ xfs_attr3_leaf_flipflags(
 	struct xfs_attr_leaf_name_remote *name_rmt;
 	struct xfs_buf		*bp1;
 	struct xfs_buf		*bp2;
+	struct xfs_mount	*mp = args->dp->i_mount;
 	int error;
 #ifdef DEBUG
 	struct xfs_attr3_icleaf_hdr ichdr1;
@@ -2963,7 +2985,8 @@ xfs_attr3_leaf_flipflags(
 		namelen1 = name_loc->namelen;
 		name1 = (char *)name_loc->nameval;
 	} else {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf1, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf1,
+						      args->index);
 		namelen1 = name_rmt->namelen;
 		name1 = (char *)name_rmt->name;
 	}
@@ -2972,7 +2995,8 @@ xfs_attr3_leaf_flipflags(
 		namelen2 = name_loc->namelen;
 		name2 = (char *)name_loc->nameval;
 	} else {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf2, args->index2);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf2,
+						      args->index2);
 		namelen2 = name_rmt->namelen;
 		name2 = (char *)name_rmt->name;
 	}
@@ -2989,7 +3013,8 @@ xfs_attr3_leaf_flipflags(
 			  XFS_DA_LOGRANGE(leaf1, entry1, sizeof(*entry1)));
 	if (args->rmtblkno) {
 		ASSERT((entry1->flags & XFS_ATTR_LOCAL) == 0);
-		name_rmt = xfs_attr3_leaf_name_remote(leaf1, args->index);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf1,
+						      args->index);
 		name_rmt->valueblk = cpu_to_be32(args->rmtblkno);
 		name_rmt->valuelen = cpu_to_be32(args->rmtvaluelen);
 		xfs_trans_log_buf(args->trans, bp1,
@@ -3000,7 +3025,8 @@ xfs_attr3_leaf_flipflags(
 	xfs_trans_log_buf(args->trans, bp2,
 			  XFS_DA_LOGRANGE(leaf2, entry2, sizeof(*entry2)));
 	if ((entry2->flags & XFS_ATTR_LOCAL) == 0) {
-		name_rmt = xfs_attr3_leaf_name_remote(leaf2, args->index2);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf2,
+						      args->index2);
 		name_rmt->valueblk = 0;
 		name_rmt->valuelen = 0;
 		xfs_trans_log_buf(args->trans, bp2,
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 86de99e2f757..afc25b6d805e 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -773,7 +773,10 @@ xfs_attr3_leaf_name(xfs_attr_leafblock_t *leafp, int idx)
 }
 
 static inline xfs_attr_leaf_name_remote_t *
-xfs_attr3_leaf_name_remote(xfs_attr_leafblock_t *leafp, int idx)
+xfs_attr3_leaf_name_remote(
+		struct xfs_sb		*sb,
+		xfs_attr_leafblock_t	*leafp,
+		int			idx)
 {
 	return (xfs_attr_leaf_name_remote_t *)xfs_attr3_leaf_name(leafp, idx);
 }
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 708334f9b2bd..d911cf9cad20 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -361,7 +361,7 @@ xchk_xattr_entry(
 		if (lentry->namelen == 0)
 			xchk_da_set_corrupt(ds, level);
 	} else {
-		rentry = xfs_attr3_leaf_name_remote(leaf, idx);
+		rentry = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf, idx);
 		namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
 		name_end = (char *)rentry + namesize;
 		if (rentry->namelen == 0 || rentry->valueblk == 0)
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
index c7eb94069caf..2a8e70f45361 100644
--- a/fs/xfs/scrub/attr_repair.c
+++ b/fs/xfs/scrub/attr_repair.c
@@ -436,7 +436,8 @@ xrep_xattr_recover_leaf(
 			error = xrep_xattr_salvage_local_attr(rx, ent, nameidx,
 					buf_end, lentry);
 		} else {
-			rentry = xfs_attr3_leaf_name_remote(leaf, i);
+			rentry = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+							    i);
 			error = xrep_xattr_salvage_remote_attr(rx, ent, nameidx,
 					buf_end, rentry, i, bp);
 		}
diff --git a/fs/xfs/scrub/listxattr.c b/fs/xfs/scrub/listxattr.c
index 256ff7700c94..70ec6d2ff907 100644
--- a/fs/xfs/scrub/listxattr.c
+++ b/fs/xfs/scrub/listxattr.c
@@ -84,7 +84,8 @@ xchk_xattr_walk_leaf_entries(
 		} else {
 			struct xfs_attr_leaf_name_remote	*name_rmt;
 
-			name_rmt = xfs_attr3_leaf_name_remote(leaf, i);
+			name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+							      i);
 			name = name_rmt->name;
 			namelen = name_rmt->namelen;
 			value = NULL;
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 24fb12986a56..2495ff76acec 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -106,7 +106,7 @@ xfs_attr3_leaf_inactive(
 		if (!entry->nameidx || (entry->flags & XFS_ATTR_LOCAL))
 			continue;
 
-		name_rmt = xfs_attr3_leaf_name_remote(leaf, i);
+		name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf, i);
 		if (!name_rmt->valueblk)
 			continue;
 
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 379b48d015d2..4388443e2db7 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -506,7 +506,8 @@ xfs_attr3_leaf_list_int(
 		} else {
 			xfs_attr_leaf_name_remote_t *name_rmt;
 
-			name_rmt = xfs_attr3_leaf_name_remote(leaf, i);
+			name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+							      i);
 			name = name_rmt->name;
 			namelen = name_rmt->namelen;
 			value = NULL;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/14] xfs: introduce XFS_DA_OP_EMPTY
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (5 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 06/14] libxfs: pass xfs_sb to xfs_attr3_leaf_name_remote() Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 08/14] xfs: introduce workqueue for post read processing Andrey Albershteyn
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Allocate blocks but don't copy data into the attribute.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c     | 8 +++++---
 fs/xfs/libxfs/xfs_da_btree.h | 4 +++-
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 17875ad865f5..5060c266f776 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -642,9 +642,11 @@ xfs_attr_rmtval_alloc(
 			goto out;
 	}
 
-	error = xfs_attr_rmtval_set_value(args);
-	if (error)
-		return error;
+	if (!(args->op_flags & XFS_DA_OP_EMPTY)) {
+		error = xfs_attr_rmtval_set_value(args);
+		if (error)
+			return error;
+	}
 
 	attr->xattri_dela_state = xfs_attr_complete_op(attr,
 						++attr->xattri_dela_state);
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 354d5d65043e..2428a3a466cb 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -97,6 +97,7 @@ typedef struct xfs_da_args {
 #define XFS_DA_OP_CILOOKUP	(1u << 4) /* lookup returns CI name if found */
 #define XFS_DA_OP_RECOVERY	(1u << 5) /* Log recovery operation */
 #define XFS_DA_OP_LOGGED	(1u << 6) /* Use intent items to track op */
+#define XFS_DA_OP_EMPTY		(1u << 7) /* Don't copy any data but alloc blks */
 
 #define XFS_DA_OP_FLAGS \
 	{ XFS_DA_OP_JUSTCHECK,	"JUSTCHECK" }, \
@@ -105,7 +106,8 @@ typedef struct xfs_da_args {
 	{ XFS_DA_OP_OKNOENT,	"OKNOENT" }, \
 	{ XFS_DA_OP_CILOOKUP,	"CILOOKUP" }, \
 	{ XFS_DA_OP_RECOVERY,	"RECOVERY" }, \
-	{ XFS_DA_OP_LOGGED,	"LOGGED" }
+	{ XFS_DA_OP_LOGGED,	"LOGGED" }, \
+	{ XFS_DA_OP_EMPTY,	"EMPTY" }
 
 /*
  * Storage for holding state during Btree searches and split/join ops.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/14] xfs: introduce workqueue for post read processing
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (6 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 07/14] xfs: introduce XFS_DA_OP_EMPTY Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 09/14] xfs: add interface to set CRC on leaf attributes Andrey Albershteyn
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

With directly mapped attribute data we need to verify that attribute
data corresponds to the CRC stored in da tree. This need to be done
at IO completion.

Add workqueue for read path work.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_mount.h | 1 +
 fs/xfs/xfs_super.c | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index db9dade7d22a..d772d908ba3c 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -136,6 +136,7 @@ typedef struct xfs_mount {
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct workqueue_struct *m_buf_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
+	struct workqueue_struct	*m_postread_workqueue;
 	struct workqueue_struct	*m_reclaim_workqueue;
 	struct workqueue_struct	*m_sync_workqueue;
 	struct workqueue_struct *m_blockgc_wq;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 394fdf3bb535..4ab93adaab0c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -559,6 +559,12 @@ xfs_init_mount_workqueues(
 	if (!mp->m_unwritten_workqueue)
 		goto out_destroy_buf;
 
+	mp->m_postread_workqueue = alloc_workqueue("xfs-pread/%s",
+			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
+			0, mp->m_super->s_id);
+	if (!mp->m_postread_workqueue)
+		goto out_destroy_postread;
+
 	mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM),
 			0, mp->m_super->s_id);
@@ -592,6 +598,8 @@ xfs_init_mount_workqueues(
 	destroy_workqueue(mp->m_reclaim_workqueue);
 out_destroy_unwritten:
 	destroy_workqueue(mp->m_unwritten_workqueue);
+out_destroy_postread:
+	destroy_workqueue(mp->m_postread_workqueue);
 out_destroy_buf:
 	destroy_workqueue(mp->m_buf_workqueue);
 out:
@@ -607,6 +615,7 @@ xfs_destroy_mount_workqueues(
 	destroy_workqueue(mp->m_inodegc_wq);
 	destroy_workqueue(mp->m_reclaim_workqueue);
 	destroy_workqueue(mp->m_unwritten_workqueue);
+	destroy_workqueue(mp->m_postread_workqueue);
 	destroy_workqueue(mp->m_buf_workqueue);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/14] xfs: add interface to set CRC on leaf attributes
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (7 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 08/14] xfs: introduce workqueue for post read processing Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 10/14] xfs: introduce XFS_ATTRUPDATE_FLAGS operation Andrey Albershteyn
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

With attributes' data passed through page cache we need to update
CRC when IO is complete. This function calculates CRC of newly
written data and swaps CRC with a new one (the old one is still in
there).

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 50 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_attr_leaf.h |  1 +
 fs/xfs/xfs_trace.h            |  1 +
 3 files changed, 52 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index c657638efe04..409c91827b47 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -3035,3 +3035,53 @@ xfs_attr3_leaf_flipflags(
 
 	return 0;
 }
+
+/*
+ * Set CRC field of remote attribute
+ */
+int
+xfs_attr3_leaf_setcrc(
+	struct xfs_da_args			*args)
+{
+	struct xfs_attr_leafblock		*leaf;
+	struct xfs_attr_leaf_entry		*entry;
+	struct xfs_attr_leaf_name_remote	*name_rmt;
+	struct xfs_buf				*bp;
+	struct xfs_mount			*mp = args->dp->i_mount;
+	int					error;
+	unsigned int				whichcrc;
+	uint32_t				crc;
+
+	trace_xfs_attr_leaf_setcrc(args);
+
+	xfs_calc_cksum(args->value, args->valuelen, &crc);
+
+	/*
+	 * Set up the operation.
+	 */
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp);
+	if (error)
+		return error;
+
+	leaf = bp->b_addr;
+	entry = &xfs_attr3_leaf_entryp(leaf)[args->index];
+	ASSERT((entry->flags & XFS_ATTR_INCOMPLETE) != 0);
+
+	whichcrc = (entry->flags & XFS_ATTR_RMCRC_SEL) == 0;
+	name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+					      args->index);
+	name_rmt->crc[whichcrc] = crc;
+	xfs_trans_log_buf(args->trans, bp,
+			XFS_DA_LOGRANGE(leaf, name_rmt, sizeof(*name_rmt)));
+
+	/* Flip the XFS_ATTR_RMCRC_SEL bit to point to the right/new CRC and
+	 * clear XFS_ATTR_INCOMPLETE bit as this is final point of directly
+	 * mapped attr data write flow */
+	entry->flags ^= XFS_ATTR_RMCRC_SEL;
+	entry->flags &= ~XFS_ATTR_INCOMPLETE;
+	xfs_trans_log_buf(args->trans, bp,
+			XFS_DA_LOGRANGE(leaf, entry, sizeof(*entry)));
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 589f810eedc0..c8722c8accb0 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -66,6 +66,7 @@ int	xfs_attr3_leaf_to_shortform(struct xfs_buf *bp,
 int	xfs_attr3_leaf_clearflag(struct xfs_da_args *args);
 int	xfs_attr3_leaf_setflag(struct xfs_da_args *args);
 int	xfs_attr3_leaf_flipflags(struct xfs_da_args *args);
+int	xfs_attr3_leaf_setcrc(struct xfs_da_args *args);
 
 /*
  * Routines used for growing the Btree.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 7b16cdd72e9d..5c3b8929179d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2189,6 +2189,7 @@ DEFINE_ATTR_EVENT(xfs_attr_leaf_to_node);
 DEFINE_ATTR_EVENT(xfs_attr_leaf_rebalance);
 DEFINE_ATTR_EVENT(xfs_attr_leaf_unbalance);
 DEFINE_ATTR_EVENT(xfs_attr_leaf_toosmall);
+DEFINE_ATTR_EVENT(xfs_attr_leaf_setcrc);
 
 DEFINE_ATTR_EVENT(xfs_attr_node_addname);
 DEFINE_ATTR_EVENT(xfs_attr_node_get);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/14] xfs: introduce XFS_ATTRUPDATE_FLAGS operation
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (8 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 09/14] xfs: add interface to set CRC on leaf attributes Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 11/14] xfs: add interface for page cache mapped remote xattrs Andrey Albershteyn
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

The extended attributes mapped through page cache need a way to
reset XFS_ATTR_INCOMPLETE flag and set data CRC when data IO is
complete. Introduce this new operation which now applies only to
leaf attributes.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c       | 19 ++++++++++++++++++-
 fs/xfs/libxfs/xfs_attr.h       |  3 +++
 fs/xfs/libxfs/xfs_log_format.h |  1 +
 fs/xfs/xfs_attr_item.c         |  6 ++++++
 fs/xfs/xfs_attr_item.h         |  1 +
 fs/xfs/xfs_stats.h             |  1 +
 6 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 5060c266f776..55b18ec8bc10 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -855,6 +855,13 @@ xfs_attr_set_iter(
 			attr->xattri_dela_state++;
 		break;
 
+	case XFS_DAS_LEAF_FLAGS_UPDATE:
+		error = xfs_attr3_leaf_setcrc(args);
+		if (error)
+			return error;
+		attr->xattri_dela_state = XFS_DAS_DONE;
+		break;
+
 	case XFS_DAS_LEAF_SET_RMT:
 	case XFS_DAS_NODE_SET_RMT:
 		error = xfs_attr_rmtval_find_space(attr);
@@ -1093,6 +1100,11 @@ xfs_attr_set(
 		tres = M_RES(mp)->tr_attrrm;
 		total = XFS_ATTRRM_SPACE_RES(mp);
 		break;
+	case XFS_ATTRUPDATE_FLAGS:
+		XFS_STATS_INC(mp, xs_attr_flags);
+		tres = M_RES(mp)->tr_attrrm;
+		total = XFS_ATTRRM_SPACE_RES(mp);
+		break;
 	}
 
 	/*
@@ -1119,6 +1131,11 @@ xfs_attr_set(
 			break;
 		}
 
+		if (op == XFS_ATTRUPDATE_FLAGS) {
+			xfs_attr_defer_add(args, XFS_ATTR_DEFER_FLAGS);
+			break;
+		}
+
 		/* Pure create fails if the attr already exists */
 		if (op == XFS_ATTRUPDATE_CREATE)
 			goto out_trans_cancel;
@@ -1126,7 +1143,7 @@ xfs_attr_set(
 		break;
 	case -ENOATTR:
 		/* Can't remove what isn't there. */
-		if (op == XFS_ATTRUPDATE_REMOVE)
+		if (op == XFS_ATTRUPDATE_REMOVE || op == XFS_ATTRUPDATE_FLAGS)
 			goto out_trans_cancel;
 
 		/* Pure replace fails if no existing attr to replace. */
diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h
index 0e51d0723f9a..b851e2e4b63c 100644
--- a/fs/xfs/libxfs/xfs_attr.h
+++ b/fs/xfs/libxfs/xfs_attr.h
@@ -448,6 +448,7 @@ enum xfs_delattr_state {
 
 	XFS_DAS_LEAF_ADD,		/* Initial leaf add state */
 	XFS_DAS_LEAF_REMOVE,		/* Initial leaf replace/remove state */
+	XFS_DAS_LEAF_FLAGS_UPDATE,	/* Update leaf XFS_ATTR_* flags and CRC */
 
 	XFS_DAS_NODE_ADD,		/* Initial node add state */
 	XFS_DAS_NODE_REMOVE,		/* Initial node replace/remove state */
@@ -477,6 +478,7 @@ enum xfs_delattr_state {
 	{ XFS_DAS_SF_REMOVE,		"XFS_DAS_SF_REMOVE" }, \
 	{ XFS_DAS_LEAF_ADD,		"XFS_DAS_LEAF_ADD" }, \
 	{ XFS_DAS_LEAF_REMOVE,		"XFS_DAS_LEAF_REMOVE" }, \
+	{ XFS_DAS_LEAF_FLAGS_UPDATE,	"XFS_DAS_LEAF_FLAGS_UPDATE" }, \
 	{ XFS_DAS_NODE_ADD,		"XFS_DAS_NODE_ADD" }, \
 	{ XFS_DAS_NODE_REMOVE,		"XFS_DAS_NODE_REMOVE" }, \
 	{ XFS_DAS_LEAF_SET_RMT,		"XFS_DAS_LEAF_SET_RMT" }, \
@@ -556,6 +558,7 @@ enum xfs_attr_update {
 	XFS_ATTRUPDATE_UPSERT,	/* set value, replace any existing attr */
 	XFS_ATTRUPDATE_CREATE,	/* set value, fail if attr already exists */
 	XFS_ATTRUPDATE_REPLACE,	/* set value, fail if attr does not exist */
+	XFS_ATTRUPDATE_FLAGS,	/* update attribute flags and metadata */
 };
 
 int xfs_attr_set(struct xfs_da_args *args, enum xfs_attr_update op, bool rsvd);
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 15dec19b6c32..9f1b02a599d2 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -1035,6 +1035,7 @@ struct xfs_icreate_log {
 #define XFS_ATTRI_OP_FLAGS_PPTR_SET	4	/* Set parent pointer */
 #define XFS_ATTRI_OP_FLAGS_PPTR_REMOVE	5	/* Remove parent pointer */
 #define XFS_ATTRI_OP_FLAGS_PPTR_REPLACE	6	/* Replace parent pointer */
+#define XFS_ATTRI_OP_FLAGS_FLAGS_UPDATE	7	/* Update attribute flags */
 #define XFS_ATTRI_OP_FLAGS_TYPE_MASK	0xFF	/* Flags type mask */
 
 /*
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index f683b7a9323f..f392c95905b5 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -908,6 +908,9 @@ xfs_attr_defer_add(
 		else
 			log_op = XFS_ATTRI_OP_FLAGS_REMOVE;
 		break;
+	case XFS_ATTR_DEFER_FLAGS:
+		log_op = XFS_ATTRI_OP_FLAGS_FLAGS_UPDATE;
+		break;
 	default:
 		ASSERT(0);
 		break;
@@ -931,6 +934,9 @@ xfs_attr_defer_add(
 	case XFS_ATTRI_OP_FLAGS_REMOVE:
 		new->xattri_dela_state = xfs_attr_init_remove_state(args);
 		break;
+	case XFS_ATTRI_OP_FLAGS_FLAGS_UPDATE:
+		new->xattri_dela_state = XFS_DAS_LEAF_FLAGS_UPDATE;
+		break;
 	}
 
 	xfs_defer_add(args->trans, &new->xattri_list, &xfs_attr_defer_type);
diff --git a/fs/xfs/xfs_attr_item.h b/fs/xfs/xfs_attr_item.h
index e74128cbb722..f6f169631eb7 100644
--- a/fs/xfs/xfs_attr_item.h
+++ b/fs/xfs/xfs_attr_item.h
@@ -57,6 +57,7 @@ enum xfs_attr_defer_op {
 	XFS_ATTR_DEFER_SET,
 	XFS_ATTR_DEFER_REMOVE,
 	XFS_ATTR_DEFER_REPLACE,
+	XFS_ATTR_DEFER_FLAGS,
 };
 
 void xfs_attr_defer_add(struct xfs_da_args *args, enum xfs_attr_defer_op op);
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index a61fb56ed2e6..007c22e2cad2 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -96,6 +96,7 @@ struct __xfsstats {
 	uint32_t		xs_attr_get;
 	uint32_t		xs_attr_set;
 	uint32_t		xs_attr_remove;
+	uint32_t		xs_attr_flags;
 	uint32_t		xs_attr_list;
 	uint32_t		xs_iflush_count;
 	uint32_t		xs_icluster_flushcnt;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/14] xfs: add interface for page cache mapped remote xattrs
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (9 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 10/14] xfs: introduce XFS_ATTRUPDATE_FLAGS operation Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 12/14] xfs: parse both remote attr name on-disk formats Andrey Albershteyn
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Leafs of the remote attributes contain xfs_attr3_rmt_hdr with CRC
of the extended attribute and owner info. Each block of the extent
has this header.

Due to this fact we can not easily map the content of extended
attribute to the page. This would be very helpful for fsverity as we
can use extended attributes to store merkle tree and map these
blocks to the page cache.

This commit changes format of the leafs by shifting CRC the btree
name struct. This however creates inconsistency problem as CRC
update could not happen even though data is updated.

This is solved by storing both CRCs - for old data and for the new
one. Attribute flag points to the correct CRC.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c        | 189 +++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_attr.h        |   8 ++
 fs/xfs/libxfs/xfs_attr_remote.c |  12 --
 fs/xfs/libxfs/xfs_da_btree.h    |   1 +
 fs/xfs/libxfs/xfs_da_format.h   |   3 +
 fs/xfs/libxfs/xfs_ondisk.h      |   9 +-
 fs/xfs/libxfs/xfs_sb.c          |   2 +
 fs/xfs/xfs_mount.h              |   2 +
 8 files changed, 207 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 55b18ec8bc10..d357405f22ee 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -27,6 +27,7 @@
 #include "xfs_attr_item.h"
 #include "xfs_xattr.h"
 #include "xfs_parent.h"
+#include "xfs_iomap.h"
 
 struct kmem_cache		*xfs_attr_intent_cache;
 
@@ -344,6 +345,175 @@ xfs_attr_set_resv(
 	return ret;
 }
 
+/*
+ * Find attribute specified in args and return iomap pointing to the attribute
+ * data
+ */
+int
+xfs_attr_read_iomap(
+	struct xfs_da_args	*args,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = args->dp;
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+	struct xfs_bmbt_irec	map[1];
+	int			nmap = 1;
+	int			seq;
+	unsigned int		lockmode = XFS_ILOCK_SHARED;
+	int			ret;
+	uint64_t		pos = xfs_attr_get_position(args);
+
+	ASSERT(!args->region_offset);
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	/* We just need to find the attribute and block it's pointing
+	 * to. The reading of data would be done by iomap */
+	args->valuelen = 0;
+	error = xfs_attr_get(args);
+	if (error)
+		return error;
+
+	if (xfs_need_iread_extents(&ip->i_af))
+		lockmode = XFS_ILOCK_EXCL;
+	xfs_ilock(ip, lockmode);
+	error = xfs_bmapi_read(ip, (xfs_fileoff_t)args->rmtblkno,
+			       args->rmtblkcnt, map, &nmap,
+			       XFS_BMAPI_ATTRFORK);
+	xfs_iunlock(ip, lockmode);
+	if (error)
+		return error;
+
+	map[0].br_startoff = XFS_B_TO_FSB(mp, pos | args->region_offset);
+
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_XATTR);
+	trace_xfs_iomap_found(ip, pos, args->valuelen, XFS_ATTR_FORK, map);
+	ret = xfs_bmbt_to_iomap(ip, iomap, map, 0, IOMAP_F_XATTR, seq);
+	/* Attributes are at args->region_offset in cache, beyond EOF of the
+	 * file */
+	iomap->flags |= IOMAP_F_BEYOND_EOF;
+
+	return ret;
+}
+
+int
+xfs_attr_read_end_io(
+		struct xfs_da_args		*args)
+{
+	struct xfs_inode			*ip = args->dp;
+	struct xfs_attr_leafblock		*leaf;
+	struct xfs_attr_leaf_entry		*entry;
+	struct xfs_attr_leaf_name_remote	*name_rmt;
+	struct xfs_buf				*bp;
+	struct xfs_mount			*mp = args->dp->i_mount;
+	uint32_t				crc;
+	int					error;
+	unsigned int				whichcrc;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+
+	if (!xfs_inode_hasattr(args->dp)) {
+		error = -ENOATTR;
+		goto out_unlock;
+	}
+
+	error = xfs_iread_extents(args->trans, args->dp, XFS_ATTR_FORK);
+	if (error)
+		goto out_unlock;
+
+	error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner,
+			args->blkno, &bp);
+	if (error)
+		goto out_unlock;
+
+	leaf = bp->b_addr;
+	entry = &xfs_attr3_leaf_entryp(leaf)[args->index];
+
+	whichcrc = (entry->flags & XFS_ATTR_RMCRC_SEL) != 0;
+	name_rmt = xfs_attr3_leaf_name_remote(&(mp->m_sb), leaf,
+					      args->index);
+
+	xfs_calc_cksum(args->value, args->valuelen, &crc);
+	error = name_rmt->crc[whichcrc] != crc;
+	if (error) {
+		if (name_rmt->crc[~whichcrc & 1] != crc) {
+			error = -EFSCORRUPTED;
+			goto out_buf_relse;
+		} else {
+			error = -EFSBADCRC;
+			goto out_buf_relse;
+		}
+	}
+
+out_buf_relse:
+	xfs_buf_relse(bp);
+out_unlock:
+	xfs_iunlock(args->dp, XFS_ILOCK_SHARED);
+	return error;
+}
+
+/*
+ * Create an attribute described in args and return iomap pointing to the extent
+ * where attribute data has to be written.
+ *
+ * Created attribute has XFS_ATTR_INCOMPLETE set, and doesn't have any data CRC.
+ * Therefore, when IO is complete xfs_attr_write_end_ioend() need to be called.
+ */
+int
+xfs_attr_write_iomap(
+	struct xfs_da_args	*args,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = args->dp;
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+	int			nmap = 1;
+	int			seq;
+	struct xfs_bmbt_irec	imap[1];
+	uint64_t		pos = xfs_attr_get_position(args);
+	unsigned int		blksize = mp->m_attr_geo->blksize;
+
+	ASSERT(!args->region_offset);
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	/* We just want to allocate blocks without copying any data there */
+	args->op_flags |= XFS_DA_OP_EMPTY;
+	args->valuelen = round_up(min_t(int, args->valuelen, blksize), blksize);
+
+	error = xfs_attr_set(args, XFS_ATTRUPDATE_UPSERT, false);
+	if (error)
+		return error;
+
+	ASSERT(args->dp->i_af.if_format != XFS_DINODE_FMT_LOCAL);
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = xfs_bmapi_read(ip, (xfs_fileoff_t)args->rmtblkno,
+			       args->rmtblkcnt, imap, &nmap,
+			       XFS_BMAPI_ATTRFORK);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error)
+		return error;
+
+	/* Instead of xattr extent offset, which will be over data, we need
+	 * merkle tree offset in page cache */
+	imap[0].br_startoff = XFS_B_TO_FSBT(mp, pos | args->region_offset);
+
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_XATTR);
+	xfs_bmbt_to_iomap(ip, iomap, imap, 0, IOMAP_F_XATTR, seq);
+
+	return 0;
+}
+
+int
+xfs_attr_write_end_ioend(
+		struct xfs_da_args	*args)
+{
+	return xfs_attr_set(args, XFS_ATTRUPDATE_FLAGS, false);
+}
+
 /*
  * Add an attr to a shortform fork. If there is no space,
  * xfs_attr_shortform_addname() will convert to leaf format and return -ENOSPC.
@@ -642,11 +812,15 @@ xfs_attr_rmtval_alloc(
 			goto out;
 	}
 
-	if (!(args->op_flags & XFS_DA_OP_EMPTY)) {
+	if (args->op_flags & XFS_DA_OP_EMPTY) {
+		/* Set XFS_ATTR_INCOMLETE flag as attribute doesn't have a value
+		 * yet (which should be written by iomap). */
+		error = xfs_attr3_leaf_setflag(args);
+	} else {
 		error = xfs_attr_rmtval_set_value(args);
-		if (error)
-			return error;
 	}
+	if (error)
+		return error;
 
 	attr->xattri_dela_state = xfs_attr_complete_op(attr,
 						++attr->xattri_dela_state);
@@ -1613,3 +1787,12 @@ xfs_attr_intent_destroy_cache(void)
 	kmem_cache_destroy(xfs_attr_intent_cache);
 	xfs_attr_intent_cache = NULL;
 }
+
+/* Retrieve attribute position from the attr data */
+uint64_t
+xfs_attr_get_position(
+	struct xfs_da_args	*args)
+{
+	ASSERT(args->namelen == sizeof(uint64_t));
+	return be64_to_cpu(*(uint64_t*)args->name);
+}
diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h
index b851e2e4b63c..3d137117154e 100644
--- a/fs/xfs/libxfs/xfs_attr.h
+++ b/fs/xfs/libxfs/xfs_attr.h
@@ -6,6 +6,8 @@
 #ifndef __XFS_ATTR_H__
 #define	__XFS_ATTR_H__
 
+#include <linux/iomap.h>
+
 struct xfs_inode;
 struct xfs_da_args;
 struct xfs_attr_list_context;
@@ -569,6 +571,10 @@ bool xfs_attr_namecheck(unsigned int attr_flags, const void *name,
 		size_t length);
 int xfs_attr_calc_size(struct xfs_da_args *args, int *local);
 struct xfs_trans_res xfs_attr_set_resv(const struct xfs_da_args *args);
+int xfs_attr_read_iomap(struct xfs_da_args *args, struct iomap *iomap);
+int xfs_attr_read_end_io(struct xfs_da_args *args);
+int xfs_attr_write_iomap(struct xfs_da_args *args, struct iomap *iomap);
+int xfs_attr_write_end_ioend(struct xfs_da_args *args);
 
 /*
  * Check to see if the attr should be upgraded from non-existent or shortform to
@@ -652,4 +658,6 @@ void xfs_attr_intent_destroy_cache(void);
 int xfs_attr_sf_totsize(struct xfs_inode *dp);
 int xfs_attr_add_fork(struct xfs_inode *ip, int size, int rsvd);
 
+uint64_t xfs_attr_get_position(struct xfs_da_args *args);
+
 #endif	/* __XFS_ATTR_H__ */
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..17125e2e6c51 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -62,13 +62,6 @@ xfs_attr3_rmt_blocks(
 	struct xfs_mount	*mp,
 	unsigned int		attrlen)
 {
-	/*
-	 * Each contiguous block has a header, so it is not just a simple
-	 * attribute length to FSB conversion.
-	 */
-	if (xfs_has_crc(mp))
-		return howmany(attrlen, xfs_attr3_rmt_buf_space(mp));
-
 	return XFS_B_TO_FSB(mp, attrlen);
 }
 
@@ -467,11 +460,6 @@ xfs_attr_rmt_find_hole(
 	unsigned int		blkcnt;
 	xfs_fileoff_t		lfileoff = 0;
 
-	/*
-	 * Because CRC enable attributes have headers, we can't just do a
-	 * straight byte to FSB conversion and have to take the header space
-	 * into account.
-	 */
 	blkcnt = xfs_attr3_rmt_blocks(mp, args->rmtvaluelen);
 	error = xfs_bmap_first_unused(args->trans, args->dp, blkcnt, &lfileoff,
 						   XFS_ATTR_FORK);
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 2428a3a466cb..81e3426ccb77 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -85,6 +85,7 @@ typedef struct xfs_da_args {
 	int		rmtblkcnt2;	/* remote attr value block count */
 	int		rmtvaluelen2;	/* remote attr value length in bytes */
 	enum xfs_dacmp	cmpresult;	/* name compare result for lookups */
+	loff_t		region_offset;	/* offset of the iomapped attr region */
 } xfs_da_args_t;
 
 /*
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index afc25b6d805e..bf0f73624446 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -645,6 +645,7 @@ typedef struct xfs_attr_leaf_name_local {
 } xfs_attr_leaf_name_local_t;
 
 typedef struct xfs_attr_leaf_name_remote {
+	__be32	crc[2];			/* CRC of the xattr data */
 	__be32	valueblk;		/* block number of value bytes */
 	__be32	valuelen;		/* number of bytes in value */
 	__u8	namelen;		/* length of name bytes */
@@ -715,11 +716,13 @@ struct xfs_attr3_leafblock {
 #define	XFS_ATTR_ROOT_BIT	1	/* limit access to trusted attrs */
 #define	XFS_ATTR_SECURE_BIT	2	/* limit access to secure attrs */
 #define	XFS_ATTR_PARENT_BIT	3	/* parent pointer attrs */
+#define	XFS_ATTR_RMCRC_SEL_BIT	4	/* which CRC field is primary */
 #define	XFS_ATTR_INCOMPLETE_BIT	7	/* attr in middle of create/delete */
 #define XFS_ATTR_LOCAL		(1u << XFS_ATTR_LOCAL_BIT)
 #define XFS_ATTR_ROOT		(1u << XFS_ATTR_ROOT_BIT)
 #define XFS_ATTR_SECURE		(1u << XFS_ATTR_SECURE_BIT)
 #define XFS_ATTR_PARENT		(1u << XFS_ATTR_PARENT_BIT)
+#define XFS_ATTR_RMCRC_SEL	(1u << XFS_ATTR_RMCRC_SEL_BIT)
 #define XFS_ATTR_INCOMPLETE	(1u << XFS_ATTR_INCOMPLETE_BIT)
 
 #define XFS_ATTR_NSP_ONDISK_MASK	(XFS_ATTR_ROOT | \
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index ad0dedf00f18..2617081bf989 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -96,10 +96,11 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_local, valuelen,	0);
 	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_local, namelen,	2);
 	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_local, nameval,	3);
-	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, valueblk,	0);
-	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, valuelen,	4);
-	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, namelen,	8);
-	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, name,	9);
+	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, crc,		0);
+	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, valueblk,	8);
+	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, valuelen,	12);
+	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, namelen,	16);
+	XFS_CHECK_OFFSET(struct xfs_attr_leaf_name_remote, name,	17);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_attr_leafblock,		32);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_attr_sf_hdr,		4);
 	XFS_CHECK_OFFSET(struct xfs_attr_sf_hdr, totsize,	0);
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 3b5623611eba..20395ba66b94 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -183,6 +183,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_PARENT;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
 		features |= XFS_FEAT_METADIR;
+	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_DXATTR)
+		features |= XFS_FEAT_DXATTR;
 
 	return features;
 }
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d772d908ba3c..1fa4a57421c3 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -330,6 +330,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
 #define XFS_FEAT_METADIR	(1ULL << 28)	/* metadata directory tree */
+#define XFS_FEAT_DXATTR		(1ULL << 29)	/* directly mapped xattrs */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -386,6 +387,7 @@ __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
 __XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
 __XFS_HAS_FEAT(metadir, METADIR)
+__XFS_HAS_FEAT(dxattr, DXATTR)
 
 static inline bool xfs_has_rtgroups(struct xfs_mount *mp)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/14] xfs: parse both remote attr name on-disk formats
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (10 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 11/14] xfs: add interface for page cache mapped remote xattrs Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 13/14] xfs: do not use xfs_attr3_rmt_hdr for remote value blocks for dxattr Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 14/14] xfs: enalbe XFS_SB_FEAT_INCOMPAT_DXATTR Andrey Albershteyn
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

The new format has ->crc[2] field. This field is overlapped with
anything what is before this remote attr name, for filesystems
without dxattr.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_da_format.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index bf0f73624446..4034530ad023 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -781,7 +781,13 @@ xfs_attr3_leaf_name_remote(
 		xfs_attr_leafblock_t	*leafp,
 		int			idx)
 {
-	return (xfs_attr_leaf_name_remote_t *)xfs_attr3_leaf_name(leafp, idx);
+	char *leaf = xfs_attr3_leaf_name(leafp, idx);
+	/* Overlap xfs_attr_leaf_name_remote_t with anything which is before.
+	 * The overlap is created by ->crc[2] and is not used for filesystem
+	 * without DXATTR. */
+	if (!xfs_sb_has_incompat_feature(sb, XFS_SB_FEAT_INCOMPAT_DXATTR))
+		return (xfs_attr_leaf_name_remote_t *)(leaf - 2*sizeof(__be32));
+	return (xfs_attr_leaf_name_remote_t *)leaf;
 }
 
 static inline xfs_attr_leaf_name_local_t *
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/14] xfs: do not use xfs_attr3_rmt_hdr for remote value blocks for dxattr
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (11 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 12/14] xfs: parse both remote attr name on-disk formats Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  2024-12-29 13:38   ` [PATCH 14/14] xfs: enalbe XFS_SB_FEAT_INCOMPAT_DXATTR Andrey Albershteyn
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: "Darrick J. Wong" <djwong@kernel.org>

Don't try to use xfs_attr3_rmt_hdr for directly mapped attribute's
data blocks as it's not there. These blocks don't have header. The
CRC is located in the btree structure and are verified on
io-completion.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c        |  6 ++-
 fs/xfs/libxfs/xfs_attr_leaf.c   |  5 +-
 fs/xfs/libxfs/xfs_attr_remote.c | 91 ++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_attr_remote.h |  8 ++-
 fs/xfs/libxfs/xfs_da_format.h   |  2 +-
 fs/xfs/libxfs/xfs_shared.h      |  1 +
 fs/xfs/xfs_attr_inactive.c      |  2 +-
 7 files changed, 94 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index d357405f22ee..e452ca55241f 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -321,7 +321,8 @@ xfs_attr_calc_size(
 		 * Out of line attribute, cannot double split, but
 		 * make room for the attribute value itself.
 		 */
-		uint	dblocks = xfs_attr3_rmt_blocks(mp, args->valuelen);
+		uint	dblocks = xfs_attr3_rmt_blocks(mp, args->attr_filter,
+						       args->valuelen);
 		nblks += dblocks;
 		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks, XFS_ATTR_FORK);
 	}
@@ -1263,7 +1264,8 @@ xfs_attr_set(
 		}
 
 		if (!local)
-			rmt_blks = xfs_attr3_rmt_blocks(mp, args->valuelen);
+			rmt_blks = xfs_attr3_rmt_blocks(mp, args->attr_filter,
+					args->valuelen);
 
 		tres = xfs_attr_set_resv(args);
 		total = args->total;
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 409c91827b47..7e3577a8e5de 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1573,7 +1573,8 @@ xfs_attr3_leaf_add_work(
 		name_rmt->valuelen = 0;
 		name_rmt->valueblk = 0;
 		args->rmtblkno = 1;
-		args->rmtblkcnt = xfs_attr3_rmt_blocks(mp, args->valuelen);
+		args->rmtblkcnt = xfs_attr3_rmt_blocks(mp, args->attr_filter,
+				args->valuelen);
 		args->rmtvaluelen = args->valuelen;
 	}
 	xfs_trans_log_buf(
@@ -2512,6 +2513,7 @@ xfs_attr3_leaf_lookup_int(
 			args->rmtblkno = be32_to_cpu(name_rmt->valueblk);
 			args->rmtblkcnt = xfs_attr3_rmt_blocks(
 							args->dp->i_mount,
+							args->attr_filter,
 							args->rmtvaluelen);
 			return -EEXIST;
 		}
@@ -2561,6 +2563,7 @@ xfs_attr3_leaf_getvalue(
 	args->rmtvaluelen = be32_to_cpu(name_rmt->valuelen);
 	args->rmtblkno = be32_to_cpu(name_rmt->valueblk);
 	args->rmtblkcnt = xfs_attr3_rmt_blocks(args->dp->i_mount,
+					       args->attr_filter,
 					       args->rmtvaluelen);
 	return xfs_attr_copy_value(args, NULL, args->rmtvaluelen);
 }
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 17125e2e6c51..e90a62c61f28 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -43,14 +43,23 @@
  * the logging system and therefore never have a log item.
  */
 
+static inline bool
+xfs_attr3_rmt_has_header(
+	struct xfs_mount	*mp,
+	unsigned int		attrns)
+{
+	return xfs_has_crc(mp) && !xfs_has_dxattr(mp);
+}
+
 /* How many bytes can be stored in a remote value buffer? */
 inline unsigned int
 xfs_attr3_rmt_buf_space(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	unsigned int		attrns)
 {
 	unsigned int		blocksize = mp->m_attr_geo->blksize;
 
-	if (xfs_has_crc(mp))
+	if (xfs_attr3_rmt_has_header(mp, attrns))
 		return blocksize - sizeof(struct xfs_attr3_rmt_hdr);
 
 	return blocksize;
@@ -60,8 +69,16 @@ xfs_attr3_rmt_buf_space(
 unsigned int
 xfs_attr3_rmt_blocks(
 	struct xfs_mount	*mp,
+	unsigned int		attrns,
 	unsigned int		attrlen)
 {
+	/*
+	 * Each contiguous block has a header, so it is not just a simple
+	 * attribute length to FSB conversion.
+	 */
+	if (xfs_attr3_rmt_has_header(mp, attrns))
+		return howmany(attrlen, xfs_attr3_rmt_buf_space(mp, attrns));
+
 	return XFS_B_TO_FSB(mp, attrlen);
 }
 
@@ -241,6 +258,42 @@ const struct xfs_buf_ops xfs_attr3_rmt_buf_ops = {
 	.verify_struct = xfs_attr3_rmt_verify_struct,
 };
 
+static void
+xfs_attr3_rmtdxattr_read_verify(
+	struct xfs_buf	*bp)
+{
+}
+
+static xfs_failaddr_t
+xfs_attr3_rmtdxattr_verify_struct(
+	struct xfs_buf	*bp)
+{
+	return NULL;
+}
+
+static void
+xfs_attr3_rmtdxattr_write_verify(
+	struct xfs_buf	*bp)
+{
+}
+
+const struct xfs_buf_ops xfs_attr3_rmtdxattr_buf_ops = {
+	.name = "xfs_attr3_remote_dxattr",
+	.magic = { 0, 0 },
+	.verify_read = xfs_attr3_rmtdxattr_read_verify,
+	.verify_write = xfs_attr3_rmtdxattr_write_verify,
+	.verify_struct = xfs_attr3_rmtdxattr_verify_struct,
+};
+
+inline const struct xfs_buf_ops *
+xfs_attr3_remote_buf_ops(
+	struct xfs_mount	*mp)
+{
+	if (xfs_has_dxattr(mp))
+		return &xfs_attr3_rmtdxattr_buf_ops;
+	return &xfs_attr3_rmt_buf_ops;
+}
+
 STATIC int
 xfs_attr3_rmt_hdr_set(
 	struct xfs_mount	*mp,
@@ -286,6 +339,7 @@ xfs_attr_rmtval_copyout(
 	struct xfs_buf		*bp,
 	struct xfs_inode	*dp,
 	xfs_ino_t		owner,
+	unsigned int		attrns,
 	unsigned int		*offset,
 	unsigned int		*valuelen,
 	uint8_t			**dst)
@@ -299,11 +353,11 @@ xfs_attr_rmtval_copyout(
 
 	while (len > 0 && *valuelen > 0) {
 		unsigned int hdr_size = 0;
-		unsigned int byte_cnt = xfs_attr3_rmt_buf_space(mp);
+		unsigned int byte_cnt = xfs_attr3_rmt_buf_space(mp, attrns);
 
 		byte_cnt = min(*valuelen, byte_cnt);
 
-		if (xfs_has_crc(mp)) {
+		if (xfs_attr3_rmt_has_header(mp, attrns)) {
 			if (xfs_attr3_rmt_hdr_ok(src, owner, *offset,
 						  byte_cnt, bno)) {
 				xfs_alert(mp,
@@ -335,6 +389,7 @@ xfs_attr_rmtval_copyin(
 	struct xfs_mount *mp,
 	struct xfs_buf	*bp,
 	xfs_ino_t	ino,
+	unsigned int	attrns,
 	unsigned int	*offset,
 	unsigned int	*valuelen,
 	uint8_t		**src)
@@ -347,12 +402,13 @@ xfs_attr_rmtval_copyin(
 	ASSERT(len >= blksize);
 
 	while (len > 0 && *valuelen > 0) {
-		unsigned int hdr_size;
-		unsigned int byte_cnt = xfs_attr3_rmt_buf_space(mp);
+		unsigned int hdr_size = 0;
+		unsigned int byte_cnt = xfs_attr3_rmt_buf_space(mp, attrns);
 
 		byte_cnt = min(*valuelen, byte_cnt);
-		hdr_size = xfs_attr3_rmt_hdr_set(mp, dst, ino, *offset,
-						 byte_cnt, bno);
+		if (xfs_attr3_rmt_has_header(mp, attrns))
+			hdr_size = xfs_attr3_rmt_hdr_set(mp, dst, ino, *offset,
+					byte_cnt, bno);
 
 		memcpy(dst + hdr_size, *src, byte_cnt);
 
@@ -400,6 +456,7 @@ xfs_attr_rmtval_get(
 	unsigned int		blkcnt = args->rmtblkcnt;
 	int			i;
 	unsigned int		offset = 0;
+	const struct xfs_buf_ops *ops = xfs_attr3_remote_buf_ops(mp);
 
 	trace_xfs_attr_rmtval_get(args);
 
@@ -425,14 +482,15 @@ xfs_attr_rmtval_get(
 			dblkno = XFS_FSB_TO_DADDR(mp, map[i].br_startblock);
 			dblkcnt = XFS_FSB_TO_BB(mp, map[i].br_blockcount);
 			error = xfs_buf_read(mp->m_ddev_targp, dblkno, dblkcnt,
-					0, &bp, &xfs_attr3_rmt_buf_ops);
+					0, &bp, ops);
 			if (xfs_metadata_is_sick(error))
 				xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
 			if (error)
 				return error;
 
 			error = xfs_attr_rmtval_copyout(mp, bp, args->dp,
-					args->owner, &offset, &valuelen, &dst);
+					args->owner, args->attr_filter,
+					&offset, &valuelen, &dst);
 			xfs_buf_relse(bp);
 			if (error)
 				return error;
@@ -460,7 +518,12 @@ xfs_attr_rmt_find_hole(
 	unsigned int		blkcnt;
 	xfs_fileoff_t		lfileoff = 0;
 
-	blkcnt = xfs_attr3_rmt_blocks(mp, args->rmtvaluelen);
+	/*
+	 * Because CRC enable attributes have headers, we can't just do a
+	 * straight byte to FSB conversion and have to take the header space
+	 * into account.
+	 */
+	blkcnt = xfs_attr3_rmt_blocks(mp, args->attr_filter, args->rmtvaluelen);
 	error = xfs_bmap_first_unused(args->trans, args->dp, blkcnt, &lfileoff,
 						   XFS_ATTR_FORK);
 	if (error)
@@ -519,10 +582,10 @@ xfs_attr_rmtval_set_value(
 		error = xfs_buf_get(mp->m_ddev_targp, dblkno, dblkcnt, &bp);
 		if (error)
 			return error;
-		bp->b_ops = &xfs_attr3_rmt_buf_ops;
+		bp->b_ops = xfs_attr3_remote_buf_ops(mp);
 
-		xfs_attr_rmtval_copyin(mp, bp, args->owner, &offset, &valuelen,
-				&src);
+		xfs_attr_rmtval_copyin(mp, bp, args->owner, args->attr_filter,
+				       &offset, &valuelen, &src);
 
 		error = xfs_bwrite(bp);	/* GROT: NOTE: synchronous write */
 		xfs_buf_relse(bp);
diff --git a/fs/xfs/libxfs/xfs_attr_remote.h b/fs/xfs/libxfs/xfs_attr_remote.h
index e3c6c7d774bf..2e2b3489a6cb 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.h
+++ b/fs/xfs/libxfs/xfs_attr_remote.h
@@ -6,12 +6,13 @@
 #ifndef __XFS_ATTR_REMOTE_H__
 #define	__XFS_ATTR_REMOTE_H__
 
-unsigned int xfs_attr3_rmt_blocks(struct xfs_mount *mp, unsigned int attrlen);
+unsigned int xfs_attr3_rmt_blocks(struct xfs_mount *mp, unsigned int attrns,
+		unsigned int attrlen);
 
 /* Number of rmt blocks needed to store the maximally sized attr value */
 static inline unsigned int xfs_attr3_max_rmt_blocks(struct xfs_mount *mp)
 {
-	return xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX);
+	return xfs_attr3_rmt_blocks(mp, 0, XFS_XATTR_SIZE_MAX);
 }
 
 int xfs_attr_rmtval_get(struct xfs_da_args *args);
@@ -23,4 +24,7 @@ int xfs_attr_rmt_find_hole(struct xfs_da_args *args);
 int xfs_attr_rmtval_set_value(struct xfs_da_args *args);
 int xfs_attr_rmtval_set_blk(struct xfs_attr_intent *attr);
 int xfs_attr_rmtval_find_space(struct xfs_attr_intent *attr);
+
+const struct xfs_buf_ops *xfs_attr3_remote_buf_ops(struct xfs_mount *mp);
+
 #endif /* __XFS_ATTR_REMOTE_H__ */
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 4034530ad023..48bebcd1e226 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -892,7 +892,7 @@ struct xfs_attr3_rmt_hdr {
 
 #define XFS_ATTR3_RMT_CRC_OFF	offsetof(struct xfs_attr3_rmt_hdr, rm_crc)
 
-unsigned int xfs_attr3_rmt_buf_space(struct xfs_mount *mp);
+unsigned int xfs_attr3_rmt_buf_space(struct xfs_mount *mp, unsigned int attrns);
 
 /* Number of bytes in a directory block. */
 static inline unsigned int xfs_dir2_dirblock_bytes(struct xfs_sb *sbp)
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index e7efdb9ceaf3..59921c12ed15 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -26,6 +26,7 @@ extern const struct xfs_buf_ops xfs_agfl_buf_ops;
 extern const struct xfs_buf_ops xfs_agi_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
+extern const struct xfs_buf_ops xfs_attr3_rmtdxattr_buf_ops;
 extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
 extern const struct xfs_buf_ops xfs_bnobt_buf_ops;
 extern const struct xfs_buf_ops xfs_cntbt_buf_ops;
diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c
index 2495ff76acec..d7a7f250d1c0 100644
--- a/fs/xfs/xfs_attr_inactive.c
+++ b/fs/xfs/xfs_attr_inactive.c
@@ -110,7 +110,7 @@ xfs_attr3_leaf_inactive(
 		if (!name_rmt->valueblk)
 			continue;
 
-		blkcnt = xfs_attr3_rmt_blocks(dp->i_mount,
+		blkcnt = xfs_attr3_rmt_blocks(dp->i_mount, entry->flags,
 				be32_to_cpu(name_rmt->valuelen));
 		error = xfs_attr3_rmt_stale(dp,
 				be32_to_cpu(name_rmt->valueblk), blkcnt);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/14] xfs: enalbe XFS_SB_FEAT_INCOMPAT_DXATTR
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
                     ` (12 preceding siblings ...)
  2024-12-29 13:38   ` [PATCH 13/14] xfs: do not use xfs_attr3_rmt_hdr for remote value blocks for dxattr Andrey Albershteyn
@ 2024-12-29 13:38   ` Andrey Albershteyn
  13 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Enabled directly mapped attribute data feature. This features
includes on-disk format change in remote attribute leafs.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 154458d72bc6..334ca8243b19 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -405,7 +405,8 @@ xfs_sb_has_ro_compat_feature(
 		 XFS_SB_FEAT_INCOMPAT_NREXT64 | \
 		 XFS_SB_FEAT_INCOMPAT_EXCHRANGE | \
 		 XFS_SB_FEAT_INCOMPAT_PARENT | \
-		 XFS_SB_FEAT_INCOMPAT_METADIR)
+		 XFS_SB_FEAT_INCOMPAT_METADIR | \
+		 XFS_SB_FEAT_INCOMPAT_DXATTR)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs
  2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
                   ` (2 preceding siblings ...)
  2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
@ 2024-12-29 13:39 ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 01/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
                     ` (23 more replies)
  2025-01-06 15:42 ` [RFC] Directly mapped xattr data & fs-verity Christoph Hellwig
  4 siblings, 24 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Use new format of extended attributes with filesystem block aligned
data (without header). The blocks are mapped through page cache via
iomap.

Andrey

Andrey Albershteyn (15):
  fs: add FS_XFLAG_VERITY for verity files
  fsverity: pass tree_blocksize to end_enable_verity()
  fsverity: add tracepoints
  fsverity: flush pagecache before enabling verity
  iomap: integrate fs-verity verification into iomap's read path
  xfs: add attribute type for fs-verity
  xfs: add fs-verity ro-compat flag
  xfs: add inode on-disk VERITY flag
  xfs: initialize fs-verity on file open and cleanup on inode
    destruction
  xfs: don't allow to enable DAX on fs-verity sealed inode
  xfs: disable direct read path for fs-verity files
  xfs: add fs-verity support
  xfs: add writeback page mapping for fs-verity
  xfs: add fs-verity ioctls
  xfs: enable ro-compat fs-verity flag

Darrick J. Wong (9):
  fsverity: pass the new tree size and block size to
    ->begin_enable_verity
  fsverity: expose merkle tree geometry to callers
  fsverity: report validation errors back to the filesystem
  xfs: use an empty transaction to protect xfs_attr_get from deadlocks
  xfs: don't let xfs_bmap_first_unused overflow a xfs_dablk_t
  xfs: use merkle tree offset as attr hash
  xfs: advertise fs-verity being available on filesystem
  xfs: check and repair the verity inode flag state
  xfs: report verity failures through the health system

 Documentation/filesystems/fsverity.rst |   8 +
 MAINTAINERS                            |   1 +
 fs/btrfs/verity.c                      |   7 +-
 fs/ext4/verity.c                       |   6 +-
 fs/f2fs/verity.c                       |   6 +-
 fs/ioctl.c                             |  11 +
 fs/iomap/buffered-io.c                 |  30 +-
 fs/verity/enable.c                     |  18 +-
 fs/verity/fsverity_private.h           |   2 +
 fs/verity/init.c                       |   1 +
 fs/verity/open.c                       |  37 ++
 fs/verity/verify.c                     |  13 +
 fs/xfs/Makefile                        |   2 +
 fs/xfs/libxfs/xfs_ag.h                 |   1 +
 fs/xfs/libxfs/xfs_attr.c               |  14 +
 fs/xfs/libxfs/xfs_attr_remote.c        |   3 +
 fs/xfs/libxfs/xfs_da_btree.c           |   3 +
 fs/xfs/libxfs/xfs_da_format.h          |  34 +-
 fs/xfs/libxfs/xfs_format.h             |  17 +-
 fs/xfs/libxfs/xfs_fs.h                 |   2 +
 fs/xfs/libxfs/xfs_health.h             |   4 +-
 fs/xfs/libxfs/xfs_inode_buf.c          |   8 +
 fs/xfs/libxfs/xfs_inode_util.c         |   2 +
 fs/xfs/libxfs/xfs_log_format.h         |   1 +
 fs/xfs/libxfs/xfs_ondisk.h             |   4 +
 fs/xfs/libxfs/xfs_sb.c                 |   4 +
 fs/xfs/libxfs/xfs_verity.c             |  74 ++++
 fs/xfs/libxfs/xfs_verity.h             |  14 +
 fs/xfs/scrub/attr.c                    |   7 +
 fs/xfs/scrub/common.c                  |  68 ++++
 fs/xfs/scrub/common.h                  |   3 +
 fs/xfs/scrub/inode.c                   |   7 +
 fs/xfs/scrub/inode_repair.c            |  36 ++
 fs/xfs/xfs_aops.c                      | 141 +++++++-
 fs/xfs/xfs_file.c                      |  23 +-
 fs/xfs/xfs_fsops.c                     |   1 +
 fs/xfs/xfs_fsverity.c                  | 482 +++++++++++++++++++++++++
 fs/xfs/xfs_fsverity.h                  |  54 +++
 fs/xfs/xfs_health.c                    |   1 +
 fs/xfs/xfs_inode.h                     |   2 +
 fs/xfs/xfs_ioctl.c                     |  16 +
 fs/xfs/xfs_iomap.h                     |   2 +
 fs/xfs/xfs_iops.c                      |   4 +
 fs/xfs/xfs_mount.c                     |   1 +
 fs/xfs/xfs_mount.h                     |   2 +
 fs/xfs/xfs_super.c                     |  11 +
 fs/xfs/xfs_trace.c                     |   1 +
 fs/xfs/xfs_trace.h                     |  42 ++-
 include/linux/fsverity.h               |  34 +-
 include/linux/iomap.h                  |   5 +
 include/trace/events/fsverity.h        | 162 +++++++++
 include/uapi/linux/fs.h                |   1 +
 52 files changed, 1400 insertions(+), 33 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_verity.c
 create mode 100644 fs/xfs/libxfs/xfs_verity.h
 create mode 100644 fs/xfs/xfs_fsverity.c
 create mode 100644 fs/xfs/xfs_fsverity.h
 create mode 100644 include/trace/events/fsverity.h

-- 
2.47.0


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/24] fs: add FS_XFLAG_VERITY for verity files
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 02/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
                     ` (22 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add extended attribute FS_XFLAG_VERITY for inodes with fs-verity
enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
[djwong: fix broken verity flag checks]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/filesystems/fsverity.rst |  8 ++++++++
 fs/ioctl.c                             | 11 +++++++++++
 include/uapi/linux/fs.h                |  1 +
 3 files changed, 20 insertions(+)

diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 76e538217868..ea4ab52b6598 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -336,6 +336,14 @@ the file has fs-verity enabled.  This can perform better than
 FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
 opening the file, and opening verity files can be expensive.
 
+FS_IOC_FSGETXATTR
+-----------------
+
+Since Linux v6.9, the FS_IOC_FSGETXATTR ioctl sets FS_XFLAG_VERITY (0x00020000)
+in the returned flags when the file has verity enabled. Note that this attribute
+cannot be set with FS_IOC_FSSETXATTR as enabling verity requires input
+parameters. See FS_IOC_ENABLE_VERITY.
+
 .. _accessing_verity_files:
 
 Accessing verity files
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 638a36be31c1..3484941ec30d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -480,6 +480,8 @@ void fileattr_fill_xflags(struct fileattr *fa, u32 xflags)
 		fa->flags |= FS_DAX_FL;
 	if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
 		fa->flags |= FS_PROJINHERIT_FL;
+	if (fa->fsx_xflags & FS_XFLAG_VERITY)
+		fa->flags |= FS_VERITY_FL;
 }
 EXPORT_SYMBOL(fileattr_fill_xflags);
 
@@ -510,6 +512,8 @@ void fileattr_fill_flags(struct fileattr *fa, u32 flags)
 		fa->fsx_xflags |= FS_XFLAG_DAX;
 	if (fa->flags & FS_PROJINHERIT_FL)
 		fa->fsx_xflags |= FS_XFLAG_PROJINHERIT;
+	if (fa->flags & FS_VERITY_FL)
+		fa->fsx_xflags |= FS_XFLAG_VERITY;
 }
 EXPORT_SYMBOL(fileattr_fill_flags);
 
@@ -640,6 +644,13 @@ static int fileattr_set_prepare(struct inode *inode,
 	    !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
 		return -EINVAL;
 
+	/*
+	 * Verity cannot be changed through FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
+	 * See FS_IOC_ENABLE_VERITY.
+	 */
+	if ((fa->fsx_xflags ^ old_ma->fsx_xflags) & FS_XFLAG_VERITY)
+		return -EINVAL;
+
 	/* Extent size hints of zero turn off the flags. */
 	if (fa->fsx_extsize == 0)
 		fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..803f1c47f187 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -158,6 +158,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_VERITY		0x00020000	/* fs-verity enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 01/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 03/24] fsverity: add tracepoints Andrey Albershteyn
                     ` (21 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

XFS will need to know tree_blocksize to remove the tree in case of an
error. The size is needed to calculate offsets of particular Merkle
tree blocks.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: I put ebiggers' suggested changes in a separate patch]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/btrfs/verity.c        | 4 +++-
 fs/ext4/verity.c         | 3 ++-
 fs/f2fs/verity.c         | 3 ++-
 fs/verity/enable.c       | 6 ++++--
 include/linux/fsverity.h | 4 +++-
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index e97ad824ae16..dc142c4b24dc 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -620,6 +620,7 @@ static int btrfs_begin_enable_verity(struct file *filp)
  * @desc:              verity descriptor to write out (NULL in error conditions)
  * @desc_size:         size of the verity descriptor (variable with signatures)
  * @merkle_tree_size:  size of the merkle tree in bytes
+ * @tree_blocksize:    the Merkle tree block size
  *
  * If desc is null, then VFS is signaling an error occurred during verity
  * enable, and we should try to rollback. Otherwise, attempt to finish verity.
@@ -627,7 +628,8 @@ static int btrfs_begin_enable_verity(struct file *filp)
  * Returns 0 on success, negative error code on error.
  */
 static int btrfs_end_enable_verity(struct file *filp, const void *desc,
-				   size_t desc_size, u64 merkle_tree_size)
+				   size_t desc_size, u64 merkle_tree_size,
+				   unsigned int tree_blocksize)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
 	int ret = 0;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index d9203228ce97..839ebf7d42ca 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -189,7 +189,8 @@ static int ext4_write_verity_descriptor(struct inode *inode, const void *desc,
 }
 
 static int ext4_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	const int credits = 2; /* superblock and inode for ext4_orphan_del() */
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 2287f238ae09..ff9308ca04aa 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -144,7 +144,8 @@ static int f2fs_begin_enable_verity(struct file *filp)
 }
 
 static int f2fs_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..04e060880b79 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -274,7 +274,8 @@ static int enable_verity(struct file *filp,
 	 * Serialized with ->begin_enable_verity() by the inode lock.
 	 */
 	inode_lock(inode);
-	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size);
+	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	if (err) {
 		fsverity_err(inode, "%ps() failed with err %d",
@@ -300,7 +301,8 @@ static int enable_verity(struct file *filp,
 
 rollback:
 	inode_lock(inode);
-	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size);
+	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	goto out;
 }
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 1eb7eae580be..ac58b19f23d3 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -51,6 +51,7 @@ struct fsverity_operations {
 	 * @desc: the verity descriptor to write, or NULL on failure
 	 * @desc_size: size of verity descriptor, or 0 on failure
 	 * @merkle_tree_size: total bytes the Merkle tree took up
+	 * @tree_blocksize: the Merkle tree block size
 	 *
 	 * If desc == NULL, then enabling verity failed and the filesystem only
 	 * must do any necessary cleanups.  Else, it must also store the given
@@ -65,7 +66,8 @@ struct fsverity_operations {
 	 * Return: 0 on success, -errno on failure
 	 */
 	int (*end_enable_verity)(struct file *filp, const void *desc,
-				 size_t desc_size, u64 merkle_tree_size);
+				 size_t desc_size, u64 merkle_tree_size,
+				 unsigned int tree_blocksize);
 
 	/**
 	 * Get the verity descriptor of the given inode.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/24] fsverity: add tracepoints
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 01/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 02/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 04/24] fsverity: pass the new tree size and block size to ->begin_enable_verity Andrey Albershteyn
                     ` (20 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

fs-verity previously had debug printk but it was removed. This patch
adds trace points to the same places where printk were used (with a
few additional ones).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: fix formatting]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                     |   1 +
 fs/verity/enable.c              |   4 +
 fs/verity/fsverity_private.h    |   2 +
 fs/verity/init.c                |   1 +
 fs/verity/verify.c              |   8 ++
 include/trace/events/fsverity.h | 143 ++++++++++++++++++++++++++++++++
 6 files changed, 159 insertions(+)
 create mode 100644 include/trace/events/fsverity.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e6e71b05710b..62ec363f3b6b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9394,6 +9394,7 @@ T:	git https://git.kernel.org/pub/scm/fs/fsverity/linux.git
 F:	Documentation/filesystems/fsverity.rst
 F:	fs/verity/
 F:	include/linux/fsverity.h
+F:	include/trace/events/fsverity.h
 F:	include/uapi/linux/fsverity.h
 
 FT260 FTDI USB-HID TO I2C BRIDGE DRIVER
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index 04e060880b79..9f743f916010 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -227,6 +227,8 @@ static int enable_verity(struct file *filp,
 	if (err)
 		goto out;
 
+	trace_fsverity_enable(inode, &params);
+
 	/*
 	 * Start enabling verity on this file, serialized by the inode lock.
 	 * Fail if verity is already enabled or is already being enabled.
@@ -269,6 +271,8 @@ static int enable_verity(struct file *filp,
 		goto rollback;
 	}
 
+	trace_fsverity_tree_done(inode, vi, &params);
+
 	/*
 	 * Tell the filesystem to finish enabling verity on the file.
 	 * Serialized with ->begin_enable_verity() by the inode lock.
diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180..04dd471d791c 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -154,4 +154,6 @@ static inline void fsverity_init_signature(void)
 
 void __init fsverity_init_workqueue(void);
 
+#include <trace/events/fsverity.h>
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/init.c b/fs/verity/init.c
index f440f0e61e3e..43f18914a6cd 100644
--- a/fs/verity/init.c
+++ b/fs/verity/init.c
@@ -5,6 +5,7 @@
  * Copyright 2019 Google LLC
  */
 
+#define CREATE_TRACE_POINTS
 #include "fsverity_private.h"
 
 #include <linux/ratelimit.h>
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a12..25fb795655e9 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -109,6 +109,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		/* Byte offset of the wanted hash relative to @addr */
 		unsigned int hoffset;
 	} hblocks[FS_VERITY_MAX_LEVELS];
+
+	trace_fsverity_verify_data_block(inode, params, data_pos);
+
 	/*
 	 * The index of the previous level's block within that level; also the
 	 * index of that block's hash within the current level.
@@ -184,6 +187,9 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 			want_hash = _want_hash;
 			kunmap_local(haddr);
 			put_page(hpage);
+			trace_fsverity_merkle_hit(inode, data_pos, hblock_idx,
+					level,
+					hoffset >> params->log_digestsize);
 			goto descend;
 		}
 		hblocks[level].page = hpage;
@@ -219,6 +225,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		want_hash = _want_hash;
 		kunmap_local(haddr);
 		put_page(hpage);
+		trace_fsverity_verify_merkle_block(inode, hblock_idx,
+				level, hoffset >> params->log_digestsize);
 	}
 
 	/* Finally, verify the data block. */
diff --git a/include/trace/events/fsverity.h b/include/trace/events/fsverity.h
new file mode 100644
index 000000000000..dab220884b89
--- /dev/null
+++ b/include/trace/events/fsverity.h
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fsverity
+
+#if !defined(_TRACE_FSVERITY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FSVERITY_H
+
+#include <linux/tracepoint.h>
+
+struct fsverity_descriptor;
+struct merkle_tree_params;
+struct fsverity_info;
+
+TRACE_EVENT(fsverity_enable,
+	TP_PROTO(const struct inode *inode,
+		 const struct merkle_tree_params *params),
+	TP_ARGS(inode, params),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, data_size)
+		__field(unsigned int, block_size)
+		__field(unsigned int, num_levels)
+		__field(u64, tree_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->data_size = i_size_read(inode);
+		__entry->block_size = params->block_size;
+		__entry->num_levels = params->num_levels;
+		__entry->tree_size = params->tree_size;
+	),
+	TP_printk("ino %lu data size %llu tree size %llu block size %u levels %u",
+		(unsigned long) __entry->ino,
+		__entry->data_size,
+		__entry->tree_size,
+		__entry->block_size,
+		__entry->num_levels)
+);
+
+TRACE_EVENT(fsverity_tree_done,
+	TP_PROTO(const struct inode *inode, const struct fsverity_info *vi,
+		 const struct merkle_tree_params *params),
+	TP_ARGS(inode, vi, params),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(unsigned int, levels)
+		__field(unsigned int, block_size)
+		__field(u64, tree_size)
+		__dynamic_array(u8, root_hash, params->digest_size)
+		__dynamic_array(u8, file_digest, params->digest_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->levels = params->num_levels;
+		__entry->block_size = params->block_size;
+		__entry->tree_size = params->tree_size;
+		memcpy(__get_dynamic_array(root_hash), vi->root_hash, __get_dynamic_array_len(root_hash));
+		memcpy(__get_dynamic_array(file_digest), vi->file_digest, __get_dynamic_array_len(file_digest));
+	),
+	TP_printk("ino %lu levels %d block_size %d tree_size %lld root_hash %s digest %s",
+		(unsigned long) __entry->ino,
+		__entry->levels,
+		__entry->block_size,
+		__entry->tree_size,
+		__print_hex_str(__get_dynamic_array(root_hash), __get_dynamic_array_len(root_hash)),
+		__print_hex_str(__get_dynamic_array(file_digest), __get_dynamic_array_len(file_digest)))
+);
+
+TRACE_EVENT(fsverity_verify_data_block,
+	TP_PROTO(const struct inode *inode,
+		 const struct merkle_tree_params *params,
+		 u64 data_pos),
+	TP_ARGS(inode, params, data_pos),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, data_pos)
+		__field(unsigned int, block_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->data_pos = data_pos;
+		__entry->block_size = params->block_size;
+	),
+	TP_printk("ino %lu pos %lld merkle_blocksize %u",
+		(unsigned long) __entry->ino,
+		__entry->data_pos,
+		__entry->block_size)
+);
+
+TRACE_EVENT(fsverity_merkle_hit,
+	TP_PROTO(const struct inode *inode, u64 data_pos,
+		 unsigned long hblock_idx, unsigned int level,
+		 unsigned int hidx),
+	TP_ARGS(inode, data_pos, hblock_idx, level, hidx),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, data_pos)
+		__field(unsigned long, hblock_idx)
+		__field(unsigned int, level)
+		__field(unsigned int, hidx)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->data_pos = data_pos;
+		__entry->hblock_idx = hblock_idx;
+		__entry->level = level;
+		__entry->hidx = hidx;
+	),
+	TP_printk("ino %lu data_pos %llu hblock_idx %lu level %u hidx %u",
+		(unsigned long) __entry->ino,
+		__entry->data_pos,
+		__entry->hblock_idx,
+		__entry->level,
+		__entry->hidx)
+);
+
+TRACE_EVENT(fsverity_verify_merkle_block,
+	TP_PROTO(const struct inode *inode, unsigned long index,
+		 unsigned int level, unsigned int hidx),
+	TP_ARGS(inode, index, level, hidx),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(unsigned long, index)
+		__field(unsigned int, level)
+		__field(unsigned int, hidx)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->index = index;
+		__entry->level = level;
+		__entry->hidx = hidx;
+	),
+	TP_printk("ino %lu index %lu level %u hidx %u",
+		(unsigned long) __entry->ino,
+		__entry->index,
+		__entry->level,
+		__entry->hidx)
+);
+
+#endif /* _TRACE_FSVERITY_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/24] fsverity: pass the new tree size and block size to ->begin_enable_verity
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (2 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 03/24] fsverity: add tracepoints Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 05/24] fsverity: expose merkle tree geometry to callers Andrey Albershteyn
                     ` (19 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

When starting up the process of enabling fsverity on a file, pass the
new size of the merkle tree and the merkle tree block size to the fs
implementation.  XFS will want this information later to try to clean
out a failed previous enablement attempt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/btrfs/verity.c        | 3 ++-
 fs/ext4/verity.c         | 3 ++-
 fs/f2fs/verity.c         | 3 ++-
 fs/verity/enable.c       | 3 ++-
 include/linux/fsverity.h | 5 ++++-
 5 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index dc142c4b24dc..d7fa7274b4b0 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -578,7 +578,8 @@ static int finish_verity(struct btrfs_inode *inode, const void *desc,
  *
  * Returns 0 on success, negative error code on failure.
  */
-static int btrfs_begin_enable_verity(struct file *filp)
+static int btrfs_begin_enable_verity(struct file *filp, u64 merkle_tree_size,
+				     unsigned int tree_blocksize)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
 	struct btrfs_root *root = inode->root;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index 839ebf7d42ca..b95f31f7debb 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -99,7 +99,8 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 	return 0;
 }
 
-static int ext4_begin_enable_verity(struct file *filp)
+static int ext4_begin_enable_verity(struct file *filp, u64 merkle_tree_size,
+				    unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	const int credits = 2; /* superblock and inode for ext4_orphan_add() */
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index ff9308ca04aa..cef3baa13b80 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -115,7 +115,8 @@ struct fsverity_descriptor_location {
 	__le64 pos;
 };
 
-static int f2fs_begin_enable_verity(struct file *filp)
+static int f2fs_begin_enable_verity(struct file *filp, u64 merkle_tree_size,
+				    unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	int err;
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index 9f743f916010..1d4a6de96014 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -237,7 +237,8 @@ static int enable_verity(struct file *filp,
 	if (IS_VERITY(inode))
 		err = -EEXIST;
 	else
-		err = vops->begin_enable_verity(filp);
+		err = vops->begin_enable_verity(filp, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	if (err)
 		goto out;
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index ac58b19f23d3..81b07909d783 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -33,6 +33,8 @@ struct fsverity_operations {
 	 * Begin enabling verity on the given file.
 	 *
 	 * @filp: a readonly file descriptor for the file
+	 * @merkle_tree_size: total bytes the Merkle tree will take up
+	 * @tree_blocksize: the Merkle tree block size
 	 *
 	 * The filesystem must do any needed filesystem-specific preparations
 	 * for enabling verity, e.g. evicting inline data.  It also must return
@@ -42,7 +44,8 @@ struct fsverity_operations {
 	 *
 	 * Return: 0 on success, -errno on failure
 	 */
-	int (*begin_enable_verity)(struct file *filp);
+	int (*begin_enable_verity)(struct file *filp, u64 merkle_tree_size,
+				   unsigned int tree_blocksize);
 
 	/**
 	 * End enabling verity on the given file.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/24] fsverity: expose merkle tree geometry to callers
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (3 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 04/24] fsverity: pass the new tree size and block size to ->begin_enable_verity Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 06/24] fsverity: report validation errors back to the filesystem Andrey Albershteyn
                     ` (18 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

Create a function that will return selected information about the
geometry of the merkle tree.  Online fsck for XFS will need this piece
to perform basic checks of the merkle tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/verity/open.c         | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/fsverity.h | 11 +++++++++++
 2 files changed, 48 insertions(+)

diff --git a/fs/verity/open.c b/fs/verity/open.c
index fdeb95eca3af..de1d0bd6e703 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -407,6 +407,43 @@ void __fsverity_cleanup_inode(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(__fsverity_cleanup_inode);
 
+/**
+ * fsverity_merkle_tree_geometry() - return Merkle tree geometry
+ * @inode: the inode to query
+ * @block_size: will be set to the log2 of the size of a merkle tree block
+ * @block_size: will be set to the size of a merkle tree block, in bytes
+ * @tree_size: will be set to the size of the merkle tree, in bytes
+ *
+ * Callers are not required to have opened the file.
+ *
+ * Return: 0 for success, -ENODATA if verity is not enabled, or any of the
+ * error codes that can result from loading verity information while opening a
+ * file.
+ */
+int fsverity_merkle_tree_geometry(struct inode *inode, u8 *log_blocksize,
+				  unsigned int *block_size, u64 *tree_size)
+{
+	struct fsverity_info *vi;
+	int error;
+
+	if (!IS_VERITY(inode))
+		return -ENODATA;
+
+	error = ensure_verity_info(inode);
+	if (error)
+		return error;
+
+	vi = inode->i_verity_info;
+	if (log_blocksize)
+		*log_blocksize = vi->tree_params.log_blocksize;
+	if (block_size)
+		*block_size = vi->tree_params.block_size;
+	if (tree_size)
+		*tree_size = vi->tree_params.tree_size;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fsverity_merkle_tree_geometry);
+
 void __init fsverity_init_info_cache(void)
 {
 	fsverity_info_cachep = KMEM_CACHE_USERCOPY(
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 81b07909d783..8627b11082b0 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -157,6 +157,9 @@ int __fsverity_file_open(struct inode *inode, struct file *filp);
 int __fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr);
 void __fsverity_cleanup_inode(struct inode *inode);
 
+int fsverity_merkle_tree_geometry(struct inode *inode, u8 *log_blocksize,
+				  unsigned int *block_size, u64 *tree_size);
+
 /**
  * fsverity_cleanup_inode() - free the inode's verity info, if present
  * @inode: an inode being evicted
@@ -229,6 +232,14 @@ static inline void fsverity_cleanup_inode(struct inode *inode)
 {
 }
 
+static inline int fsverity_merkle_tree_geometry(struct inode *inode,
+						u8 *log_blocksize,
+						unsigned int *block_size,
+						u64 *tree_size)
+{
+	return -EOPNOTSUPP;
+}
+
 /* read_metadata.c */
 
 static inline int fsverity_ioctl_read_metadata(struct file *filp,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/24] fsverity: report validation errors back to the filesystem
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (4 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 05/24] fsverity: expose merkle tree geometry to callers Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 07/24] fsverity: flush pagecache before enabling verity Andrey Albershteyn
                     ` (17 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

Provide a new function call so that validation errors can be reported
back to the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/verity/verify.c              |  4 ++++
 include/linux/fsverity.h        | 14 ++++++++++++++
 include/trace/events/fsverity.h | 19 +++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 25fb795655e9..587f3a4eb34e 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -242,6 +242,10 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		     data_pos, level - 1,
 		     params->hash_alg->name, hsize, want_hash,
 		     params->hash_alg->name, hsize, real_hash);
+	trace_fsverity_file_corrupt(inode, data_pos, params->block_size);
+	if (inode->i_sb->s_vop->file_corrupt)
+		inode->i_sb->s_vop->file_corrupt(inode, data_pos,
+				params->block_size);
 error:
 	for (; level > 0; level--) {
 		kunmap_local(hblocks[level - 1].addr);
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 8627b11082b0..9b79aaaa6626 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -125,6 +125,20 @@ struct fsverity_operations {
 	 */
 	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
 				       u64 pos, unsigned int size);
+
+	/**
+	 * Notify the filesystem that file data is corrupt.
+	 *
+	 * @inode: the inode being validated
+	 * @pos: the file position of the invalid data
+	 * @len: the length of the invalid data
+	 *
+	 * This function is called when fs-verity detects that a portion of a
+	 * file's data is inconsistent with the Merkle tree, or a Merkle tree
+	 * block needed to validate the data is inconsistent with the level
+	 * above it.
+	 */
+	void (*file_corrupt)(struct inode *inode, loff_t pos, size_t len);
 };
 
 #ifdef CONFIG_FS_VERITY
diff --git a/include/trace/events/fsverity.h b/include/trace/events/fsverity.h
index dab220884b89..375fdddac6a9 100644
--- a/include/trace/events/fsverity.h
+++ b/include/trace/events/fsverity.h
@@ -137,6 +137,25 @@ TRACE_EVENT(fsverity_verify_merkle_block,
 		__entry->hidx)
 );
 
+TRACE_EVENT(fsverity_file_corrupt,
+	TP_PROTO(const struct inode *inode, loff_t pos, size_t len),
+	TP_ARGS(inode, pos, len),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(loff_t, pos)
+		__field(size_t, len)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->pos = pos;
+		__entry->len = len;
+	),
+	TP_printk("ino %lu pos %llu len %zu",
+		(unsigned long) __entry->ino,
+		__entry->pos,
+		__entry->len)
+);
+
 #endif /* _TRACE_FSVERITY_H */
 
 /* This part must be outside protection */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/24] fsverity: flush pagecache before enabling verity
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (5 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 06/24] fsverity: report validation errors back to the filesystem Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 08/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
                     ` (16 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

XFS uses iomap interface to write Merkle tree. The writeback
distinguish between data and Merkle tree pages via
XFS_VERITY_CONSTRUCTION flag set on inode. Data pages could get in a
way in writeback path when the file is read-only and Merkle tree
construction already started.

Flush the page cache before enabling fsverity.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/verity/enable.c | 5 +++++
 fs/verity/verify.c | 1 +
 2 files changed, 6 insertions(+)

diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index 1d4a6de96014..af4fcbb6363d 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -11,6 +11,7 @@
 #include <linux/mount.h>
 #include <linux/sched/signal.h>
 #include <linux/uaccess.h>
+#include <linux/pagemap.h>
 
 struct block_buffer {
 	u32 filled;
@@ -374,6 +375,10 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
 	if (!S_ISREG(inode->i_mode))
 		return -EINVAL;
 
+	err = filemap_write_and_wait(inode->i_mapping);
+	if (err)
+		return err;
+
 	err = mnt_want_write_file(filp);
 	if (err) /* -EROFS */
 		return err;
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 587f3a4eb34e..940f59bf3f53 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -9,6 +9,7 @@
 
 #include <crypto/hash.h>
 #include <linux/bio.h>
+#include <linux/pagemap.h>
 
 static struct workqueue_struct *fsverity_read_workqueue;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/24] iomap: integrate fs-verity verification into iomap's read path
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (6 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 07/24] fsverity: flush pagecache before enabling verity Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 09/24] xfs: use an empty transaction to protect xfs_attr_get from deadlocks Andrey Albershteyn
                     ` (15 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

This patch adds fs-verity verification into iomap's read path. After
BIO's io operation is complete the data are verified against
fs-verity's Merkle tree. Verification work is done in a separate
workqueue.

The read path ioend iomap_read_ioend are stored side by side with
BIOs if FS_VERITY is enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: fix doc warning]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 30 ++++++++++++++++++++++++++++--
 include/linux/iomap.h  |  5 +++++
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d6231f4f78d9..59c0ff6fb6b7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/fsverity.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/uio.h>
@@ -23,6 +24,8 @@
 
 #define IOEND_BATCH_SIZE	4096
 
+#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
+
 /*
  * Structure allocated for each folio to track per-block uptodate, dirty state
  * and I/O completions.
@@ -362,6 +365,19 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 		 !(srcmap->flags & IOMAP_F_BEYOND_EOF));
 }
 
+#ifdef CONFIG_FS_VERITY
+void
+iomap_read_fsverity_end_io_work(struct work_struct *work)
+{
+	struct iomap_read_ioend *fbio =
+		container_of(work, struct iomap_read_ioend, io_work);
+
+	fsverity_verify_bio(&fbio->io_bio);
+	iomap_read_end_io(&fbio->io_bio);
+}
+
+#endif /* CONFIG_FS_VERITY */
+
 static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx, loff_t offset)
 {
@@ -376,6 +392,10 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	struct iomap_read_ioend *ioend;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 
+	/* Fail reads from broken fsverity files immediately. */
+	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
+		return -EIO;
+
 	if (iomap->type == IOMAP_INLINE)
 		return iomap_read_inline_data(iter, folio);
 
@@ -387,6 +407,12 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
+		if (!(srcmap->flags & IOMAP_F_BEYOND_EOF) &&
+		    fsverity_active(iter->inode) &&
+		    !fsverity_verify_blocks(folio, plen, poff)) {
+			return -EIO;
+		}
+
 		iomap_set_range_uptodate(folio, poff, plen);
 		goto done;
 	}
@@ -2176,13 +2202,13 @@ static int __init iomap_buffered_init(void)
 	int error = 0;
 
 	error = bioset_init(&iomap_read_ioend_bioset,
-			   4 * (PAGE_SIZE / SECTOR_SIZE),
+			   IOMAP_POOL_SIZE,
 			   offsetof(struct iomap_read_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
 	if (error)
 		return error;
 
-	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
+	return bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 261772431fae..e4704b337ac1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -326,6 +326,11 @@ struct iomap_readpage_ctx {
 };
 
 void iomap_read_end_io(struct bio *bio);
+#ifdef CONFIG_FS_VERITY
+void iomap_read_fsverity_end_io_work(struct work_struct *work);
+#else
+#define iomap_read_fsverity_end_io_work (0)
+#endif /* CONFIG_FS_VERITY */
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops, void *private);
 int iomap_read_folio_ctx(struct iomap_readpage_ctx *ctx,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/24] xfs: use an empty transaction to protect xfs_attr_get from deadlocks
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (7 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 08/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 10/24] xfs: don't let xfs_bmap_first_unused overflow a xfs_dablk_t Andrey Albershteyn
                     ` (14 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

Wrap the xfs_attr_get_ilocked call in xfs_attr_get with an empty
transaction so that we cannot livelock the kernel if someone injects a
loop into the attr structure or the attr fork bmbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index e452ca55241f..3f3699e9c203 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -274,6 +274,8 @@ xfs_attr_get(
 
 	XFS_STATS_INC(args->dp->i_mount, xs_attr_get);
 
+	ASSERT(!args->trans);
+
 	if (xfs_is_shutdown(args->dp->i_mount))
 		return -EIO;
 
@@ -286,8 +288,14 @@ xfs_attr_get(
 	/* Entirely possible to look up a name which doesn't exist */
 	args->op_flags = XFS_DA_OP_OKNOENT;
 
+	error = xfs_trans_alloc_empty(args->dp->i_mount, &args->trans);
+	if (error)
+		return error;
+
 	lock_mode = xfs_ilock_attr_map_shared(args->dp);
 	error = xfs_attr_get_ilocked(args);
+	xfs_trans_cancel(args->trans);
+	args->trans = NULL;
 	xfs_iunlock(args->dp, lock_mode);
 
 	return error;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/24] xfs: don't let xfs_bmap_first_unused overflow a xfs_dablk_t
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (8 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 09/24] xfs: use an empty transaction to protect xfs_attr_get from deadlocks Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 11/24] xfs: add attribute type for fs-verity Andrey Albershteyn
                     ` (13 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

The directory/xattr code uses xfs_bmap_first_unused to find a contiguous
chunk of file range that can hold a particular value.  Unfortunately,
file offsets are 64-bit quantities, whereas the dir/attr block number
type (xfs_dablk_t) is a 32-bit quantity.  If an integer truncation
occurs here, we will corrupt the file.

Therefore, check for a file offset that would truncate and return EFBIG
in that case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 3 +++
 fs/xfs/libxfs/xfs_da_btree.c    | 3 +++
 fs/xfs/libxfs/xfs_da_format.h   | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index e90a62c61f28..2bd225b1772c 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -529,6 +529,9 @@ xfs_attr_rmt_find_hole(
 	if (error)
 		return error;
 
+	if (lfileoff > XFS_MAX_DABLK)
+		return -EFBIG;
+
 	args->rmtblkno = (xfs_dablk_t)lfileoff;
 	args->rmtblkcnt = blkcnt;
 
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..6c6c7bab87fb 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2308,6 +2308,9 @@ xfs_da_grow_inode_int(
 	if (error)
 		return error;
 
+	if (*bno > XFS_MAX_DABLK)
+		return -EFBIG;
+
 	/*
 	 * Try mapping it in one filesystem block.
 	 */
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 48bebcd1e226..ee9635c04197 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -748,6 +748,9 @@ struct xfs_attr3_leafblock {
  */
 #define	XFS_ATTR_LEAF_NAME_ALIGN	((uint)sizeof(xfs_dablk_t))
 
+/* Maximum file block offset of a directory or an xattr. */
+#define	XFS_MAX_DABLK			((xfs_dablk_t)-1U)
+
 static inline int
 xfs_attr3_leaf_hdr_size(struct xfs_attr_leafblock *leafp)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/24] xfs: add attribute type for fs-verity
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (9 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 10/24] xfs: don't let xfs_bmap_first_unused overflow a xfs_dablk_t Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 12/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
                     ` (12 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

The Merkle tree blocks and descriptor are stored in the extended
attributes of the inode. Add new attribute type for fs-verity
metadata. Add XFS_ATTR_INTERNAL_MASK to skip parent pointer and
fs-verity attributes as those are only for internal use. While we're
at it add a few comments in relevant places that internally visible
attributes are not suppose to be handled via interface defined in
xfs_xattr.c.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_da_format.h  | 11 ++++++++---
 fs/xfs/libxfs/xfs_log_format.h |  1 +
 fs/xfs/xfs_trace.h             |  3 ++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index ee9635c04197..060cedb4c12d 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -717,20 +717,24 @@ struct xfs_attr3_leafblock {
 #define	XFS_ATTR_SECURE_BIT	2	/* limit access to secure attrs */
 #define	XFS_ATTR_PARENT_BIT	3	/* parent pointer attrs */
 #define	XFS_ATTR_RMCRC_SEL_BIT	4	/* which CRC field is primary */
+#define	XFS_ATTR_VERITY_BIT	5	/* verity merkle tree and descriptor */
 #define	XFS_ATTR_INCOMPLETE_BIT	7	/* attr in middle of create/delete */
 #define XFS_ATTR_LOCAL		(1u << XFS_ATTR_LOCAL_BIT)
 #define XFS_ATTR_ROOT		(1u << XFS_ATTR_ROOT_BIT)
 #define XFS_ATTR_SECURE		(1u << XFS_ATTR_SECURE_BIT)
 #define XFS_ATTR_PARENT		(1u << XFS_ATTR_PARENT_BIT)
 #define XFS_ATTR_RMCRC_SEL	(1u << XFS_ATTR_RMCRC_SEL_BIT)
+#define XFS_ATTR_VERITY		(1u << XFS_ATTR_VERITY_BIT)
 #define XFS_ATTR_INCOMPLETE	(1u << XFS_ATTR_INCOMPLETE_BIT)
 
 #define XFS_ATTR_NSP_ONDISK_MASK	(XFS_ATTR_ROOT | \
 					 XFS_ATTR_SECURE | \
-					 XFS_ATTR_PARENT)
+					 XFS_ATTR_PARENT | \
+					 XFS_ATTR_VERITY)
 
 /* Private attr namespaces not exposed to userspace */
-#define XFS_ATTR_PRIVATE_NSP_MASK	(XFS_ATTR_PARENT)
+#define XFS_ATTR_PRIVATE_NSP_MASK	(XFS_ATTR_PARENT | \
+					 XFS_ATTR_VERITY)
 
 #define XFS_ATTR_ONDISK_MASK	(XFS_ATTR_NSP_ONDISK_MASK | \
 				 XFS_ATTR_LOCAL | \
@@ -740,7 +744,8 @@ struct xfs_attr3_leafblock {
 	{ XFS_ATTR_LOCAL,	"local" }, \
 	{ XFS_ATTR_ROOT,	"root" }, \
 	{ XFS_ATTR_SECURE,	"secure" }, \
-	{ XFS_ATTR_PARENT,	"parent" }
+	{ XFS_ATTR_PARENT,	"parent" }, \
+	{ XFS_ATTR_VERITY,	"verity" }
 
 /*
  * Alignment for namelist and valuelist entries (since they are mixed
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 9f1b02a599d2..1d07e12a9a30 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -1045,6 +1045,7 @@ struct xfs_icreate_log {
 #define XFS_ATTRI_FILTER_MASK		(XFS_ATTR_ROOT | \
 					 XFS_ATTR_SECURE | \
 					 XFS_ATTR_PARENT | \
+					 XFS_ATTR_VERITY | \
 					 XFS_ATTR_INCOMPLETE)
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5c3b8929179d..de937b3770d3 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -103,7 +103,8 @@ struct xfs_rtgroup;
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
 	{ XFS_ATTR_SECURE,	"SECURE" }, \
 	{ XFS_ATTR_INCOMPLETE,	"INCOMPLETE" }, \
-	{ XFS_ATTR_PARENT,	"PARENT" }
+	{ XFS_ATTR_PARENT,	"PARENT" }, \
+	{ XFS_ATTR_VERITY,	"VERITY" }
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/24] xfs: add fs-verity ro-compat flag
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (10 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 11/24] xfs: add attribute type for fs-verity Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 13/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
                     ` (11 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

To mark inodes with fs-verity enabled the new XFS_DIFLAG2_VERITY flag
will be added in further patch. This requires ro-compat flag to let
older kernels know that fs with fs-verity can not be modified.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 1 +
 fs/xfs/libxfs/xfs_sb.c     | 2 ++
 fs/xfs/xfs_mount.h         | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 334ca8243b19..aefeda01f60f 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -372,6 +372,7 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_VERITY   (1 << 4)		/* fs-verity */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 20395ba66b94..9945ad33a460 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -165,6 +165,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_VERITY)
+		features |= XFS_FEAT_VERITY;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1fa4a57421c3..dab6bc3ae0cf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -331,6 +331,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
 #define XFS_FEAT_METADIR	(1ULL << 28)	/* metadata directory tree */
 #define XFS_FEAT_DXATTR		(1ULL << 29)	/* directly mapped xattrs */
+#define XFS_FEAT_VERITY		(1ULL << 30)	/* fs-verity */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -388,6 +389,7 @@ __XFS_HAS_FEAT(large_extent_counts, NREXT64)
 __XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
 __XFS_HAS_FEAT(metadir, METADIR)
 __XFS_HAS_FEAT(dxattr, DXATTR)
+__XFS_HAS_FEAT(verity, VERITY)
 
 static inline bool xfs_has_rtgroups(struct xfs_mount *mp)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/24] xfs: add inode on-disk VERITY flag
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (11 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 12/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 14/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
                     ` (10 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add flag to mark inodes which have fs-verity enabled on them (i.e.
descriptor exist and tree is built).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h     | 7 ++++++-
 fs/xfs/libxfs/xfs_inode_buf.c  | 8 ++++++++
 fs/xfs/libxfs/xfs_inode_util.c | 2 ++
 fs/xfs/xfs_iops.c              | 2 ++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index aefeda01f60f..df84c275837d 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1215,16 +1215,21 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
  */
 #define XFS_DIFLAG2_METADATA_BIT	5
 
+/* Inode sealed by fsverity */
+#define XFS_DIFLAG2_VERITY_BIT		6
+
 #define XFS_DIFLAG2_DAX		(1ULL << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK	(1ULL << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE	(1ULL << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1ULL << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1ULL << XFS_DIFLAG2_NREXT64_BIT)
 #define XFS_DIFLAG2_METADATA	(1ULL << XFS_DIFLAG2_METADATA_BIT)
+#define XFS_DIFLAG2_VERITY	(1ULL << XFS_DIFLAG2_VERITY_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_METADATA)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_METADATA | \
+	 XFS_DIFLAG2_VERITY)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 424861fbf1bd..9ba57a1efa50 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -726,6 +726,14 @@ xfs_dinode_verify(
 	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
 		return __this_address;
 
+	/* only regular files can have fsverity */
+	if (flags2 & XFS_DIFLAG2_VERITY) {
+		if (!xfs_has_verity(mp))
+			return __this_address;
+		if ((mode & S_IFMT) != S_IFREG)
+			return __this_address;
+	}
+
 	/* COW extent size hint validation */
 	fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize),
 			mode, flags, flags2);
diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
index deb0b7c00a1f..d2bbb4ca1ecd 100644
--- a/fs/xfs/libxfs/xfs_inode_util.c
+++ b/fs/xfs/libxfs/xfs_inode_util.c
@@ -126,6 +126,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+			flags |= FS_XFLAG_VERITY;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 207e0dadffc3..47203b8923aa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1286,6 +1286,8 @@ xfs_diflags_to_iflags(
 		flags |= S_NOATIME;
 	if (init && xfs_inode_should_enable_dax(ip))
 		flags |= S_DAX;
+	if (xflags & FS_XFLAG_VERITY)
+		flags |= S_VERITY;
 
 	/*
 	 * S_DAX can only be set during inode initialization and is never set by
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/24] xfs: initialize fs-verity on file open and cleanup on inode destruction
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (12 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 13/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 15/24] xfs: don't allow to enable DAX on fs-verity sealed inode Andrey Albershteyn
                     ` (9 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

fs-verity will read and attach metadata (not the tree itself) from
a disk for those inodes which already have fs-verity enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  | 8 ++++++++
 fs/xfs/xfs_super.c | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9a435b1ff264..67381e728b41 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -32,6 +32,7 @@
 #include <linux/mman.h>
 #include <linux/fadvise.h>
 #include <linux/mount.h>
+#include <linux/fsverity.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1258,11 +1259,18 @@ xfs_file_open(
 	struct inode	*inode,
 	struct file	*file)
 {
+	int		error;
+
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
 	if (xfs_inode_can_atomicwrite(XFS_I(inode)))
 		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+
+	error = fsverity_file_open(inode, file);
+	if (error)
+		return error;
+
 	return generic_file_open(inode, file);
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4ab93adaab0c..3de6717e4fad 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -52,6 +52,7 @@
 #include <linux/magic.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/fsverity.h>
 
 static const struct super_operations xfs_super_operations;
 
@@ -678,6 +679,7 @@ xfs_fs_destroy_inode(
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
+	fsverity_cleanup_inode(inode);
 	xfs_inode_mark_reclaimable(ip);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 15/24] xfs: don't allow to enable DAX on fs-verity sealed inode
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (13 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 14/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 16/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
                     ` (8 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

fs-verity doesn't support DAX. Forbid filesystem to enable DAX on
inodes which already have fs-verity enabled. The opposite is checked
when fs-verity is enabled, it won't be enabled if DAX is.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: fix typo in subject]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 47203b8923aa..d653ae6b1636 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1258,6 +1258,8 @@ xfs_inode_should_enable_dax(
 		return false;
 	if (!xfs_inode_supports_dax(ip))
 		return false;
+	if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+		return false;
 	if (xfs_has_dax_always(ip->i_mount))
 		return true;
 	if (ip->i_diflags2 & XFS_DIFLAG2_DAX)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 16/24] xfs: disable direct read path for fs-verity files
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (14 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 15/24] xfs: don't allow to enable DAX on fs-verity sealed inode Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 17/24] xfs: add fs-verity support Andrey Albershteyn
                     ` (7 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

The direct path is not supported on verity files. Attempts to use direct
I/O path on such files should fall back to buffered I/O path.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: fix braces]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 67381e728b41..8c26347a0a2f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -257,7 +257,8 @@ xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(iocb, to);
@@ -310,10 +311,18 @@ xfs_file_read_iter(
 
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
-	else if (iocb->ki_flags & IOCB_DIRECT)
+	else if ((iocb->ki_flags & IOCB_DIRECT) && !fsverity_active(inode))
 		ret = xfs_file_dio_read(iocb, to);
-	else
+	else {
+		/*
+		 * In case fs-verity is enabled, we also fallback to the
+		 * buffered read from the direct read path. Therefore,
+		 * IOCB_DIRECT is set and need to be cleared (see
+		 * generic_file_read_iter())
+		 */
+		iocb->ki_flags &= ~IOCB_DIRECT;
 		ret = xfs_file_buffered_read(iocb, to);
+	}
 
 	if (ret > 0)
 		XFS_STATS_ADD(mp, xs_read_bytes, ret);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 17/24] xfs: add fs-verity support
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (15 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 16/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 18/24] xfs: add writeback page mapping for fs-verity Andrey Albershteyn
                     ` (6 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add integration with fs-verity. The XFS store fs-verity metadata in
the extended file attributes. The metadata consist of verity
descriptor and Merkle tree blocks.

The descriptor is stored under "vdesc" extended attribute. The
Merkle tree blocks are stored under binary indexes which are offsets
into the Merkle tree.

When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
flag is set meaning that the Merkle tree is being build. The
initialization ends with storing of verity descriptor and setting
inode on-disk flag (XFS_DIFLAG2_VERITY).

The verification on read is done in read path of iomap.

Merkle tree blocks are indexed by a per-AG rhashtable to reduce the time
it takes to load a block from disk in a manner that doesn't bloat struct
xfs_inode.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: replace caching implementation with an xarray, other cleanups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile               |   2 +
 fs/xfs/libxfs/xfs_ag.h        |   1 +
 fs/xfs/libxfs/xfs_attr.c      |   4 +
 fs/xfs/libxfs/xfs_da_format.h |  14 +
 fs/xfs/libxfs/xfs_ondisk.h    |   4 +
 fs/xfs/libxfs/xfs_verity.c    |  58 +++++
 fs/xfs/libxfs/xfs_verity.h    |  13 +
 fs/xfs/xfs_aops.c             |  56 +++-
 fs/xfs/xfs_fsops.c            |   1 +
 fs/xfs/xfs_fsverity.c         | 471 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsverity.h         |  54 ++++
 fs/xfs/xfs_inode.h            |   2 +
 fs/xfs/xfs_iomap.h            |   2 +
 fs/xfs/xfs_mount.c            |   1 +
 fs/xfs/xfs_super.c            |   9 +
 fs/xfs/xfs_trace.c            |   1 +
 fs/xfs/xfs_trace.h            |  39 +++
 17 files changed, 730 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_verity.c
 create mode 100644 fs/xfs/libxfs/xfs_verity.h
 create mode 100644 fs/xfs/xfs_fsverity.c
 create mode 100644 fs/xfs/xfs_fsverity.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ed9b0dabc1f1..ebee7b75e5ae 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -57,6 +57,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_trans_resv.o \
 				   xfs_trans_space.o \
 				   xfs_types.o \
+				   xfs_verity.o \
 				   )
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix libxfs/, \
@@ -140,6 +141,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_FS_VERITY)		+= xfs_fsverity.o
 
 # notify failure
 ifeq ($(CONFIG_MEMORY_FAILURE),y)
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 1f24cfa27321..ea30f982771e 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -89,6 +89,7 @@ struct xfs_perag {
 
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
+
 #endif /* __KERNEL__ */
 };
 
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 3f3699e9c203..9c416d2506a4 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -28,6 +28,7 @@
 #include "xfs_xattr.h"
 #include "xfs_parent.h"
 #include "xfs_iomap.h"
+#include "xfs_verity.h"
 
 struct kmem_cache		*xfs_attr_intent_cache;
 
@@ -1766,6 +1767,9 @@ xfs_attr_namecheck(
 	if (!xfs_attr_check_namespace(attr_flags))
 		return false;
 
+	if (attr_flags & XFS_ATTR_VERITY)
+		return xfs_verity_namecheck(attr_flags, name, length);
+
 	/*
 	 * MAXNAMELEN includes the trailing null, but (name/length) leave it
 	 * out, so use >= for the length check.
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 060cedb4c12d..cb49e2629bb5 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -924,4 +924,18 @@ struct xfs_parent_rec {
 	__be32	p_gen;
 } __packed;
 
+/*
+ * fs-verity attribute name format
+ *
+ * Merkle tree blocks are stored under extended attributes of the inode.  The
+ * name of the attributes are byte positions into the merkle data.
+ */
+struct xfs_merkle_key {
+	__be64	mk_pos;
+};
+
+/* ondisk xattr name used for the fsverity descriptor */
+#define XFS_VERITY_DESCRIPTOR_NAME	"vdesc"
+#define XFS_VERITY_DESCRIPTOR_NAME_LEN	(sizeof(XFS_VERITY_DESCRIPTOR_NAME) - 1)
+
 #endif /* __XFS_DA_FORMAT_H__ */
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index 2617081bf989..e4ac5a0a01fd 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -292,6 +292,10 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_SB_OFFSET(sb_rgextents,		276);
 	XFS_CHECK_SB_OFFSET(sb_rgblklog,		280);
 	XFS_CHECK_SB_OFFSET(sb_pad,			281);
+
+	/* fs-verity xattrs */
+	XFS_CHECK_STRUCT_SIZE(struct xfs_merkle_key,		8);
+	XFS_CHECK_VALUE(sizeof(XFS_VERITY_DESCRIPTOR_NAME),	6);
 }
 
 #endif /* __XFS_ONDISK_H */
diff --git a/fs/xfs/libxfs/xfs_verity.c b/fs/xfs/libxfs/xfs_verity.c
new file mode 100644
index 000000000000..ff02c5c840b5
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_verity.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Red Hat, Inc.
+ */
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_attr.h"
+#include "xfs_verity.h"
+
+/* Set a merkle tree pos in preparation for setting merkle tree attrs. */
+void
+xfs_merkle_key_to_disk(
+	struct xfs_merkle_key	*key,
+	uint64_t		pos)
+{
+	key->mk_pos = cpu_to_be64(pos);
+}
+
+/* Retrieve the merkle tree pos from the attr data. */
+uint64_t
+xfs_merkle_key_from_disk(
+	const void		*attr_name,
+	int			namelen)
+{
+	const struct xfs_merkle_key *key = attr_name;
+
+	ASSERT(namelen == sizeof(struct xfs_merkle_key));
+
+	return be64_to_cpu(key->mk_pos);
+}
+
+/* Return true if verity attr name is valid. */
+bool
+xfs_verity_namecheck(
+	unsigned int		attr_flags,
+	const void		*name,
+	int			namelen)
+{
+	if (!(attr_flags & XFS_ATTR_VERITY))
+		return false;
+
+	/*
+	 * Merkle tree pages are stored under u64 indexes; verity descriptor
+	 * blocks are held in a named attribute.
+	 */
+	if (namelen != sizeof(struct xfs_merkle_key) &&
+	    namelen != XFS_VERITY_DESCRIPTOR_NAME_LEN)
+		return false;
+
+	return true;
+}
diff --git a/fs/xfs/libxfs/xfs_verity.h b/fs/xfs/libxfs/xfs_verity.h
new file mode 100644
index 000000000000..5813665c5a01
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_verity.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Red Hat, Inc.
+ */
+#ifndef __XFS_VERITY_H__
+#define __XFS_VERITY_H__
+
+void xfs_merkle_key_to_disk(struct xfs_merkle_key *key, uint64_t pos);
+uint64_t xfs_merkle_key_from_disk(const void *attr_name, int namelen);
+bool xfs_verity_namecheck(unsigned int attr_flags, const void *name,
+		int namelen);
+
+#endif	/* __XFS_VERITY_H__ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..bcc51628dbdd 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -19,6 +19,8 @@
 #include "xfs_reflink.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -132,6 +134,10 @@ xfs_end_ioend(
 
 	if (!error && xfs_ioend_is_append(ioend))
 		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
+
+	/* This IO was to the Merkle tree region */
+	if (xfs_fsverity_in_region(ioend->io_offset))
+		error = xfs_fsverity_end_ioend(ip, ioend);
 done:
 	iomap_finish_ioends(ioend, error);
 	memalloc_nofs_restore(nofs_flag);
@@ -512,19 +518,65 @@ xfs_vm_bmap(
 	return iomap_bmap(mapping, block, &xfs_read_iomap_ops);
 }
 
+static void
+xfs_read_end_io(
+	struct bio *bio)
+{
+	struct iomap_read_ioend *ioend =
+		container_of(bio, struct iomap_read_ioend, io_bio);
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+
+	WARN_ON_ONCE(!queue_work(ip->i_mount->m_postread_workqueue,
+					&ioend->io_work));
+}
+
+static void
+xfs_prepare_read_ioend(
+	struct iomap_read_ioend	*ioend)
+{
+	if (ioend->io_flags & IOMAP_F_BEYOND_EOF) {
+		INIT_WORK(&ioend->io_work, &xfs_attr_verify_args);
+		ioend->io_bio.bi_end_io = &xfs_read_end_io;
+		return;
+	}
+
+	if (!fsverity_active(ioend->io_inode))
+		return;
+
+	INIT_WORK(&ioend->io_work, &iomap_read_fsverity_end_io_work);
+	ioend->io_bio.bi_end_io = &xfs_read_end_io;
+}
+
+static const struct iomap_readpage_ops xfs_readpage_ops = {
+	.prepare_ioend		= &xfs_prepare_read_ioend,
+};
+
 STATIC int
 xfs_vm_read_folio(
 	struct file		*unused,
 	struct folio		*folio)
 {
-	return iomap_read_folio(folio, &xfs_read_iomap_ops);
+	struct iomap_readpage_ops xfs_readpage_ops = {
+		.prepare_ioend	= xfs_prepare_read_ioend
+	};
+	struct iomap_readpage_ctx ctx = {
+		.cur_folio	= folio,
+		.ops		= &xfs_readpage_ops,
+	};
+
+	return iomap_read_folio_ctx(&ctx, &xfs_read_iomap_ops);
 }
 
 STATIC void
 xfs_vm_readahead(
 	struct readahead_control	*rac)
 {
-	iomap_readahead(rac, &xfs_read_iomap_ops);
+	struct iomap_readpage_ctx ctx = {
+		.rac = rac,
+		.ops = &xfs_readpage_ops,
+	};
+
+	iomap_readahead_ctx(&ctx, &xfs_read_iomap_ops);
 }
 
 static int
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 28dde215c899..3962ce5e3023 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -21,6 +21,7 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 #include "xfs_trace.h"
+#include "xfs_fsverity.h"
 
 /*
  * Write new AG headers to disk. Non-transactional, but need to be
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
new file mode 100644
index 000000000000..0af0f22ff075
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.c
@@ -0,0 +1,471 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Red Hat, Inc.
+ */
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_attr.h"
+#include "xfs_verity.h"
+#include "xfs_bmap_util.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_trace.h"
+#include "xfs_quota.h"
+#include "xfs_ag.h"
+#include "xfs_fsverity.h"
+#include "xfs_iomap.h"
+#include "xfs_bmap.h"
+#include "xfs_format.h"
+#include <linux/fsverity.h>
+#include <linux/iomap.h>
+
+/*
+ * Initialize an args structure to load or store the fsverity descriptor.
+ * Caller must ensure @args is zeroed except for value and valuelen.
+ */
+static inline void
+xfs_fsverity_init_vdesc_args(
+	struct xfs_inode	*ip,
+	struct xfs_da_args	*args)
+{
+	args->geo = ip->i_mount->m_attr_geo;
+	args->whichfork = XFS_ATTR_FORK,
+	args->attr_filter = XFS_ATTR_VERITY;
+	args->op_flags = XFS_DA_OP_OKNOENT;
+	args->dp = ip;
+	args->owner = ip->i_ino;
+	args->name = XFS_VERITY_DESCRIPTOR_NAME;
+	args->namelen = XFS_VERITY_DESCRIPTOR_NAME_LEN;
+	xfs_attr_sethash(args);
+}
+
+/*
+ * Initialize an args structure to load or store a merkle tree block.
+ * Caller must ensure @args is zeroed except for value and valuelen.
+ */
+inline void
+xfs_fsverity_init_merkle_args(
+	struct xfs_inode	*ip,
+	struct xfs_merkle_key	*key,
+	uint64_t		merkleoff,
+	struct xfs_da_args	*args)
+{
+	xfs_merkle_key_to_disk(key, merkleoff);
+	args->geo = ip->i_mount->m_attr_geo;
+	args->whichfork = XFS_ATTR_FORK,
+	args->attr_filter = XFS_ATTR_VERITY;
+	args->op_flags = XFS_DA_OP_OKNOENT;
+	args->dp = ip;
+	args->owner = ip->i_ino;
+	args->name = (const uint8_t *)key;
+	args->namelen = sizeof(struct xfs_merkle_key);
+	args->region_offset = XFS_FSVERITY_MTREE_OFFSET;
+	xfs_attr_sethash(args);
+}
+
+/* Delete the verity descriptor. */
+static int
+xfs_fsverity_delete_descriptor(
+	struct xfs_inode	*ip)
+{
+	struct xfs_da_args	args = { };
+
+	xfs_fsverity_init_vdesc_args(ip, &args);
+	return xfs_attr_set(&args, XFS_ATTRUPDATE_REMOVE, false);
+}
+
+/* Delete a merkle tree block. */
+static int
+xfs_fsverity_delete_merkle_block(
+	struct xfs_inode	*ip,
+	u64			pos)
+{
+	struct xfs_merkle_key	name;
+	struct xfs_da_args	args = { };
+
+	xfs_fsverity_init_merkle_args(ip, &name, pos, &args);
+	return xfs_attr_set(&args, XFS_ATTRUPDATE_REMOVE, false);
+}
+
+/* Retrieve the verity descriptor. */
+static int
+xfs_fsverity_get_descriptor(
+	struct inode		*inode,
+	void			*buf,
+	size_t			buf_size)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_da_args	args = {
+		.value		= buf,
+		.valuelen	= buf_size,
+	};
+	int			error = 0;
+
+	/*
+	 * The fact that (returned attribute size) == (provided buf_size) is
+	 * checked by xfs_attr_copy_value() (returns -ERANGE).  No descriptor
+	 * is treated as a short read so that common fsverity code will
+	 * complain.
+	 */
+	xfs_fsverity_init_vdesc_args(ip, &args);
+	error = xfs_attr_get(&args);
+	if (error == -ENOATTR)
+		return 0;
+	if (error)
+		return error;
+
+	return args.valuelen;
+}
+
+/*
+ * Clear out old fsverity metadata before we start building a new one.  This
+ * could happen if, say, we crashed while building fsverity data.
+ */
+static int
+xfs_fsverity_delete_stale_metadata(
+	struct xfs_inode	*ip,
+	u64			new_tree_size,
+	unsigned int		tree_blocksize)
+{
+	u64			pos;
+	int			error = 0;
+
+	/*
+	 * Delete as many merkle tree blocks in increasing blkno order until we
+	 * don't find any more.  That ought to be good enough for avoiding
+	 * dead bloat without excessive runtime.
+	 */
+	for (pos = new_tree_size; !error; pos += tree_blocksize) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		error = xfs_fsverity_delete_merkle_block(ip, pos);
+		if (error)
+			break;
+	}
+
+	return error != -ENOATTR ? error : 0;
+}
+
+/* Prepare to enable fsverity by clearing old metadata. */
+static int
+xfs_fsverity_begin_enable(
+	struct file		*filp,
+	u64			merkle_tree_size,
+	unsigned int		tree_blocksize)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	if (IS_DAX(inode))
+		return -EINVAL;
+
+	if (xfs_iflags_test_and_set(ip, XFS_VERITY_CONSTRUCTION))
+		return -EBUSY;
+
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
+
+	return xfs_fsverity_delete_stale_metadata(ip, merkle_tree_size,
+			tree_blocksize);
+}
+
+/* Try to remove all the fsverity metadata after a failed enablement. */
+static int
+xfs_fsverity_delete_metadata(
+	struct xfs_inode	*ip,
+	u64			merkle_tree_size,
+	unsigned int		tree_blocksize)
+{
+	u64			pos;
+	int			error;
+
+	if (!merkle_tree_size)
+		return 0;
+
+	for (pos = 0; pos < merkle_tree_size; pos += tree_blocksize) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		error = xfs_fsverity_delete_merkle_block(ip, pos);
+		if (error == -ENOATTR)
+			error = 0;
+		if (error)
+			return error;
+	}
+
+	error = xfs_fsverity_delete_descriptor(ip);
+	return error != -ENOATTR ? error : 0;
+}
+
+/* Complete (or fail) the process of enabling fsverity. */
+static int
+xfs_fsverity_end_enable(
+	struct file		*filp,
+	const void		*desc,
+	size_t			desc_size,
+	u64			merkle_tree_size,
+	unsigned int		tree_blocksize)
+{
+	struct xfs_da_args	args = {
+		.value		= (void *)desc,
+		.valuelen	= desc_size,
+	};
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			error = 0;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	/* fs-verity failed, just cleanup */
+	if (desc == NULL)
+		goto out;
+
+	xfs_fsverity_init_vdesc_args(ip, &args);
+	error = xfs_attr_set(&args, XFS_ATTRUPDATE_UPSERT, false);
+	if (error)
+		goto out;
+
+	error = filemap_write_and_wait(inode->i_mapping);
+	if (error)
+		goto out;
+
+	/* Set fsverity inode flag */
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange,
+			0, 0, false, &tp);
+	if (error)
+		goto out;
+
+	/*
+	 * Ensure that we've persisted the verity information before we enable
+	 * it on the inode and tell the caller we have sealed the inode.
+	 */
+	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (!error)
+		inode->i_flags |= S_VERITY;
+
+out:
+	if (error) {
+		int	error2;
+
+		error2 = xfs_fsverity_delete_metadata(ip,
+				merkle_tree_size, tree_blocksize);
+		if (error2)
+			xfs_alert(ip->i_mount,
+ "ino 0x%llx failed to clean up new fsverity metadata, err %d",
+					ip->i_ino, error2);
+	}
+
+	xfs_iflags_clear(ip, XFS_VERITY_CONSTRUCTION);
+	return error;
+}
+
+static int
+xfs_fsverity_read_iomap_begin(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_merkle_key	name;
+	struct xfs_da_args	args = { };
+
+	pos = pos & XFS_FSVERITY_MTREE_MASK;
+	xfs_fsverity_init_merkle_args(ip, &name, pos, &args);
+
+	return xfs_attr_read_iomap(&args, iomap);
+}
+
+const struct iomap_ops xfs_fsverity_read_iomap_ops = {
+	.iomap_begin = xfs_fsverity_read_iomap_begin,
+};
+
+static int
+xfs_fsverity_write_iomap_begin(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_da_args	args;
+	struct xfs_merkle_key	name;
+	loff_t			xattr_name;
+	unsigned int		xattr_size;
+	int			error;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	pos = pos & XFS_FSVERITY_MTREE_MASK;
+
+	/* We always allocate one xattr block, as this block will be used by
+	 * iomap. Even for smallest Merkle trees */
+	/* TODO this can be optimized to use shortname attributes */
+	xattr_size = mp->m_attr_geo->blksize;
+	xattr_name = pos & ~(xattr_size - 1);
+
+	xfs_fsverity_init_merkle_args(ip, &name, xattr_name, &args);
+	args.valuelen = xattr_size;
+	args.region_offset = XFS_FSVERITY_MTREE_OFFSET;
+
+	error = xfs_attr_write_iomap(&args, iomap);
+	if (error)
+		return error;
+
+	/* Offset into xattr block. One block can have multiple merkle tree
+	 * blocks */
+	iomap->offset += (pos & (xattr_size - 1));
+	/* Instead of attribute size (which blksize) use requested
+	 * size */
+	iomap->length = length;
+
+	return 0;
+}
+
+int
+xfs_fsverity_end_ioend(
+	struct xfs_inode	*ip,
+	struct iomap_ioend	*ioend)
+{
+	struct xfs_da_args	args;
+	struct xfs_merkle_key	name;
+	loff_t			pos;
+	struct bio		bio = ioend->io_bio;
+	void			*addr;
+	int			error;
+	struct folio		*folio = bio_first_folio_all(&bio);
+
+	pos = ioend->io_offset & XFS_FSVERITY_MTREE_MASK;
+	xfs_fsverity_init_merkle_args(ip, &name, pos, &args);
+	args.valuelen = ioend->io_size;
+	addr = kmap_local_folio(folio, 0);
+	args.value = addr;
+	error = xfs_attr_write_end_ioend(&args);
+	kunmap_local(addr);
+
+	return error;
+}
+
+const struct iomap_ops xfs_fsverity_write_iomap_ops = {
+	.iomap_begin = xfs_fsverity_write_iomap_begin,
+};
+
+void
+xfs_attr_verify_args(
+		struct work_struct	*work)
+{
+	struct xfs_inode		*ip;
+	void				*addr;
+	struct xfs_merkle_key		name;
+	struct xfs_da_args		args;
+	int				error;
+	struct iomap_read_ioend		*ioend =
+		container_of(work, struct iomap_read_ioend, io_work);
+	struct bio			*bio = &ioend->io_bio;
+	struct folio			*folio = bio_first_folio_all(bio);
+
+	ip = XFS_I(ioend->io_inode);
+	xfs_fsverity_init_merkle_args(ip, &name, ioend->io_offset, &args);
+	addr = kmap_local_folio(folio, 0);
+	args.valuelen = ioend->io_size;
+	args.value = addr;
+	error = xfs_attr_read_end_io(&args);
+	kunmap_local(addr);
+	if (error)
+		bio->bi_status = BLK_STS_IOERR;
+	iomap_read_end_io(bio);
+}
+
+/* Retrieve a merkle tree block. */
+static struct page *
+xfs_fsverity_read_merkle(
+	struct inode	*inode,
+	pgoff_t		index,
+	unsigned long	num_ra_pages)
+{
+	struct folio	*folio;
+	unsigned int	block_size;
+	u64		tree_size;
+	int		error;
+	u8		log_blocksize;
+
+	error = fsverity_merkle_tree_geometry(inode, &log_blocksize, &block_size,
+				      &tree_size);
+	if (error)
+		return ERR_PTR(error);
+
+	struct ioregion region = {
+		.inode = inode,
+		.pos = index << log_blocksize,
+		.length = block_size,
+		.offset = XFS_FSVERITY_MTREE_OFFSET,
+		.ops = &xfs_fsverity_read_iomap_ops,
+	};
+
+	folio = iomap_read_region(&region);
+	if (IS_ERR(folio))
+		return ERR_CAST(folio);
+
+	/* Wait for buffered read to finish */
+	error = folio_wait_locked_killable(folio);
+	if (error)
+		return ERR_PTR(error);
+	if (IS_ERR(folio) || !folio_test_uptodate(folio))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	return folio_file_page(folio, 0);
+}
+
+/* Write a merkle tree block. */
+static int
+xfs_fsverity_write_merkle(
+	struct inode	*inode,
+	const void	*buf,
+	u64		pos,
+	unsigned int	size)
+{
+	struct ioregion region = {
+		.inode = inode,
+		.pos = pos,
+		.buf = buf,
+		.length = size,
+		.offset = XFS_FSVERITY_MTREE_OFFSET,
+		.ops = &xfs_fsverity_write_iomap_ops,
+	};
+
+	return iomap_write_region(&region);
+}
+
+const struct fsverity_operations xfs_fsverity_ops = {
+	.begin_enable_verity		= xfs_fsverity_begin_enable,
+	.end_enable_verity		= xfs_fsverity_end_enable,
+	.get_verity_descriptor		= xfs_fsverity_get_descriptor,
+	.read_merkle_tree_page		= xfs_fsverity_read_merkle,
+	.write_merkle_tree_block	= xfs_fsverity_write_merkle,
+};
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
new file mode 100644
index 000000000000..c14b01508349
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Red Hat, Inc.
+ */
+#ifndef __XFS_FSVERITY_H__
+#define __XFS_FSVERITY_H__
+
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include <linux/iomap.h>
+
+#ifdef CONFIG_FS_VERITY
+
+/* Merkle tree location in page cache. We take memory region from the inode's
+ * address space for Merkle tree. */
+#define XFS_FSVERITY_MTREE_OFFSET (1 << 30)
+#define XFS_FSVERITY_MTREE_MASK (XFS_FSVERITY_MTREE_OFFSET - 1)
+
+inline void
+xfs_fsverity_init_merkle_args(
+	struct xfs_inode	*ip,
+	struct xfs_merkle_key	*key,
+	uint64_t		merkleoff,
+	struct xfs_da_args	*args);
+
+struct xfs_merkle_bkey {
+	/* inumber of the file */
+	xfs_ino_t		ino;
+
+	/* the position of the block in the Merkle tree (in bytes) */
+	u64			pos;
+};
+
+int
+xfs_fsverity_end_ioend(
+	struct xfs_inode	*ip,
+	struct iomap_ioend	*ioend);
+
+static inline bool
+xfs_fsverity_in_region(
+		loff_t pos)
+{
+	return pos >= XFS_FSVERITY_MTREE_OFFSET;
+};
+void xfs_attr_verify_args(struct work_struct *work);
+
+extern const struct fsverity_operations xfs_fsverity_ops;
+#else
+#define xfs_fsverity_bmbt_irec(ip, key, merkleoff, args) (0)
+#define xfs_fsverity_end_ioend(ip, ioend) (0)
+#endif	/* CONFIG_FS_VERITY */
+
+#endif	/* __XFS_FSVERITY_H__ */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1648dc5a8068..e0b2e7acdf74 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -404,6 +404,8 @@ xfs_inode_can_atomicwrite(
  */
 #define XFS_IREMAPPING		(1U << 15)
 
+#define XFS_VERITY_CONSTRUCTION	(1U << 16) /* merkle tree construction */
+
 /* All inode state flags related to inode reclaim. */
 #define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
 				 XFS_IRECLAIM | \
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8347268af727..d6cbd675e96a 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -53,5 +53,7 @@ extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
 extern const struct iomap_ops xfs_dax_write_iomap_ops;
+extern const struct iomap_ops xfs_fsverity_read_iomap_ops;
+extern const struct iomap_ops xfs_fsverity_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5918f433dba7..0f60eedf3d76 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -37,6 +37,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_metafile.h"
 #include "xfs_rtgroup.h"
+#include "xfs_fsverity.h"
 #include "scrub/stats.h"
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 3de6717e4fad..88862092f838 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -30,6 +30,7 @@
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
 #include "xfs_sysfs.h"
+#include "xfs_fsverity.h"
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
@@ -53,6 +54,7 @@
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
 #include <linux/fsverity.h>
+#include <linux/iomap.h>
 
 static const struct super_operations xfs_super_operations;
 
@@ -1555,6 +1557,9 @@ xfs_fs_fill_super(
 	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
 	sb->s_op = &xfs_super_operations;
+#ifdef CONFIG_FS_VERITY
+	sb->s_vop = &xfs_fsverity_ops;
+#endif
 
 	/*
 	 * Delay mount work if the debug hook is set. This is debug
@@ -1799,6 +1804,10 @@ xfs_fs_fill_super(
 		xfs_set_resuming_quotaon(mp);
 	mp->m_qflags &= ~XFS_QFLAGS_MNTOPTS;
 
+	if (xfs_has_verity(mp))
+		xfs_warn(mp,
+	"EXPERIMENTAL fsverity feature in use. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 8f530e69c18a..6e5a1b17c2f4 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -49,6 +49,7 @@
 #include "xfs_metafile.h"
 #include "xfs_metadir.h"
 #include "xfs_rtgroup.h"
+#include "xfs_fsverity.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index de937b3770d3..0bd6d1e992e2 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -98,6 +98,7 @@ struct xfs_rmap_intent;
 struct xfs_refcount_intent;
 struct xfs_metadir_update;
 struct xfs_rtgroup;
+struct xfs_merkle_bkey;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -5576,6 +5577,44 @@ DEFINE_EVENT(xfs_metadir_class, name, \
 	TP_ARGS(dp, name, ino))
 DEFINE_METADIR_EVENT(xfs_metadir_lookup);
 
+#ifdef CONFIG_FS_VERITY
+DECLARE_EVENT_CLASS(xfs_fsverity_cache_class,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_merkle_bkey *key,
+		 unsigned long caller_ip),
+	TP_ARGS(mp, key, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(u64, pos)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino = key->ino;
+		__entry->pos = key->pos;
+		__entry->caller_ip = (void *)caller_ip;
+	),
+	TP_printk("dev %d:%d ino 0x%llx pos 0x%llx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->pos,
+		  __entry->caller_ip)
+)
+
+#define DEFINE_XFS_FSVERITY_CACHE_EVENT(name) \
+DEFINE_EVENT(xfs_fsverity_cache_class, name, \
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_merkle_bkey *key, \
+		 unsigned long caller_ip), \
+	TP_ARGS(mp, key, caller_ip))
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_miss);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_hit);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_reuse);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_store);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_drop);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_unmount);
+DEFINE_XFS_FSVERITY_CACHE_EVENT(xfs_fsverity_cache_reclaim);
+#endif /* CONFIG_XFS_VERITY */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 18/24] xfs: add writeback page mapping for fs-verity
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (16 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 17/24] xfs: add fs-verity support Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 19/24] xfs: use merkle tree offset as attr hash Andrey Albershteyn
                     ` (5 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

Data from severity region is not mapped as file data but as a set of
extended attributes.

Add mapping function which removes region offset and map n-th page
to attribute with name n.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_aops.c | 85 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 80 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index bcc51628dbdd..976d77277e95 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -20,6 +20,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_fsverity.h"
+#include "xfs_attr.h"
 #include <linux/fsverity.h>
 
 struct xfs_writepage_ctx {
@@ -132,7 +133,8 @@ xfs_end_ioend(
 	else if (ioend->io_type == IOMAP_UNWRITTEN)
 		error = xfs_iomap_write_unwritten(ip, offset, size, false);
 
-	if (!error && xfs_ioend_is_append(ioend))
+	if (!error && !xfs_fsverity_in_region(ioend->io_offset) &&
+			xfs_ioend_is_append(ioend))
 		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
 
 	/* This IO was to the Merkle tree region */
@@ -472,14 +474,87 @@ static const struct iomap_writeback_ops xfs_writeback_ops = {
 	.discard_folio		= xfs_discard_folio,
 };
 
+static int
+xfs_fsverity_map_blocks(
+	struct iomap_writepage_ctx *wpc,
+	struct inode		*inode,
+	loff_t			offset,
+	unsigned int		len)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error = 0;
+	int			nmap = 1;
+	loff_t			pos;
+	int			seq;
+	struct xfs_bmbt_irec	imap;
+	struct xfs_da_args	args;
+	struct xfs_merkle_key	name;
+	loff_t			xattr_name;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	pos = (offset & XFS_FSVERITY_MTREE_MASK);
+	/* We always write one attribute block, but each block can have multiple
+	 * Merkle tree blocks */
+	ASSERT(!is_power_of_2(len));
+	xattr_name = pos & ~(len - 1);
+
+	xfs_fsverity_init_merkle_args(ip, &name, xattr_name, &args);
+
+	error = xfs_attr_get(&args);
+	if (error)
+		return error;
+
+	ASSERT(args->dp->i_af.if_format != XFS_DINODE_FMT_LOCAL);
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = xfs_bmapi_read(ip, (xfs_fileoff_t)args.rmtblkno,
+			       args.rmtblkcnt, &imap, &nmap,
+			       XFS_BMAPI_ATTRFORK);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error)
+		return error;
+
+	/* Instead of xattr extent offset, which will be over data, we need
+	 * merkle tree offset in page cache */
+	imap.br_startoff =
+		XFS_B_TO_FSBT(mp, xattr_name | XFS_FSVERITY_MTREE_OFFSET);
+
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_XATTR);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, IOMAP_F_XATTR, seq);
+
+	trace_xfs_map_blocks_found(ip, offset, len, XFS_ATTR_FORK, &imap);
+
+	/* We want this to be separate from other IO as we will do
+	 * CRC update on IO completion */
+	wpc->iomap.flags |= IOMAP_F_NO_MERGE;
+
+	return 0;
+}
+
+static const struct iomap_writeback_ops xfs_writeback_verity_ops = {
+	.map_blocks		= xfs_fsverity_map_blocks,
+	.prepare_ioend		= xfs_prepare_ioend,
+	.discard_folio		= xfs_discard_folio,
+};
+
 STATIC int
 xfs_vm_writepages(
-	struct address_space	*mapping,
-	struct writeback_control *wbc)
+	struct address_space		*mapping,
+	struct writeback_control	*wbc)
 {
-	struct xfs_writepage_ctx wpc = { };
+	struct xfs_writepage_ctx	wpc = { };
+	struct xfs_inode		*ip = XFS_I(mapping->host);
 
-	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
+	xfs_iflags_clear(ip, XFS_ITRUNCATED);
+
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION)) {
+		wbc->range_start = XFS_FSVERITY_MTREE_OFFSET;
+		wbc->range_end = LLONG_MAX;
+		return iomap_writepages_unbound(mapping, wbc, &wpc.ctx,
+						&xfs_writeback_verity_ops);
+	}
 	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 19/24] xfs: use merkle tree offset as attr hash
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (17 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 18/24] xfs: add writeback page mapping for fs-verity Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 20/24] xfs: add fs-verity ioctls Andrey Albershteyn
                     ` (4 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: "Darrick J. Wong" <djwong@kernel.org>

I was exploring the fsverity metadata with xfs_db after creating a 220MB
verity file, and I noticed the following in the debugger output:

entries[0-75] = [hashval,nameidx,incomplete,root,secure,local,parent,verity]
0:[0,4076,0,0,0,0,0,1]
1:[0,1472,0,0,0,1,0,1]
2:[0x800,4056,0,0,0,0,0,1]
3:[0x800,4036,0,0,0,0,0,1]
...
72:[0x12000,2716,0,0,0,0,0,1]
73:[0x12000,2696,0,0,0,0,0,1]
74:[0x12800,2676,0,0,0,0,0,1]
75:[0x12800,2656,0,0,0,0,0,1]
...
nvlist[0].merkle_off = 0x18000
nvlist[1].merkle_off = 0
nvlist[2].merkle_off = 0x19000
nvlist[3].merkle_off = 0x1000
...
nvlist[71].merkle_off = 0x5b000
nvlist[72].merkle_off = 0x44000
nvlist[73].merkle_off = 0x5c000
nvlist[74].merkle_off = 0x45000
nvlist[75].merkle_off = 0x5d000

Within just this attr leaf block, there are 76 attr entries, but only 38
distinct hash values.  There are 415 merkle tree blocks for this file,
but we already have hash collisions.  This isn't good performance from
the standard da hash function because we're mostly shifting and rolling
zeroes around.

However, we don't even have to do that much work -- the merkle tree
block keys are themslves u64 values.  Truncate that value to 32 bits
(the size of xfs_dahash_t) and use that for the hash.  We won't have any
collisions between merkle tree blocks until that tree grows to 2^32nd
blocks.  On a 4k block filesystem, we won't hit that unless the file
contains more than 2^49 bytes, assuming sha256.

As a side effect, the keys for merkle tree blocks get written out in
roughly sequential order, though I didn't observe any change in
performance.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_attr.c      |  2 ++
 fs/xfs/libxfs/xfs_da_format.h |  6 ++++++
 fs/xfs/libxfs/xfs_verity.c    | 16 ++++++++++++++++
 fs/xfs/libxfs/xfs_verity.h    |  1 +
 4 files changed, 25 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 9c416d2506a4..05021456578b 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -612,6 +612,8 @@ xfs_attr_hashval(
 
 	if (attr_flags & XFS_ATTR_PARENT)
 		return xfs_parent_hashattr(mp, name, namelen, value, valuelen);
+	if (attr_flags & XFS_ATTR_VERITY)
+		return xfs_verity_hashname(name, namelen);
 
 	return xfs_attr_hashname(name, namelen);
 }
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index cb49e2629bb5..99ca5594ad02 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -938,4 +938,10 @@ struct xfs_merkle_key {
 #define XFS_VERITY_DESCRIPTOR_NAME	"vdesc"
 #define XFS_VERITY_DESCRIPTOR_NAME_LEN	(sizeof(XFS_VERITY_DESCRIPTOR_NAME) - 1)
 
+/*
+ * Merkle tree blocks cannot be smaller than 1k in size, so the hash function
+ * can right-shift the merkle offset by this amount without losing anything.
+ */
+#define XFS_VERITY_HASH_SHIFT		(10)
+
 #endif /* __XFS_DA_FORMAT_H__ */
diff --git a/fs/xfs/libxfs/xfs_verity.c b/fs/xfs/libxfs/xfs_verity.c
index ff02c5c840b5..8c470014b915 100644
--- a/fs/xfs/libxfs/xfs_verity.c
+++ b/fs/xfs/libxfs/xfs_verity.c
@@ -56,3 +56,19 @@ xfs_verity_namecheck(
 
 	return true;
 }
+
+/*
+ * Compute name hash for a verity attribute.  For merkle tree blocks, we want
+ * to use the merkle tree block offset as the hash value to avoid collisions
+ * between blocks unless the merkle tree becomes larger than 2^32 blocks.
+ */
+xfs_dahash_t
+xfs_verity_hashname(
+	const uint8_t		*name,
+	unsigned int		namelen)
+{
+	if (namelen != sizeof(struct xfs_merkle_key))
+		return xfs_attr_hashname(name, namelen);
+
+	return xfs_merkle_key_from_disk(name, namelen) >> XFS_VERITY_HASH_SHIFT;
+}
diff --git a/fs/xfs/libxfs/xfs_verity.h b/fs/xfs/libxfs/xfs_verity.h
index 5813665c5a01..3d7485c511d5 100644
--- a/fs/xfs/libxfs/xfs_verity.h
+++ b/fs/xfs/libxfs/xfs_verity.h
@@ -9,5 +9,6 @@ void xfs_merkle_key_to_disk(struct xfs_merkle_key *key, uint64_t pos);
 uint64_t xfs_merkle_key_from_disk(const void *attr_name, int namelen);
 bool xfs_verity_namecheck(unsigned int attr_flags, const void *name,
 		int namelen);
+xfs_dahash_t xfs_verity_hashname(const uint8_t *name, unsigned int namelen);
 
 #endif	/* __XFS_VERITY_H__ */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 20/24] xfs: add fs-verity ioctls
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (18 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 19/24] xfs: use merkle tree offset as attr hash Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 21/24] xfs: advertise fs-verity being available on filesystem Andrey Albershteyn
                     ` (3 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
tree pages) and obtain file's digest.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: remove unnecessary casting]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_ioctl.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0789c18aaa18..e62260a77b75 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
+#include <linux/fsverity.h>
 
 /* Return 0 on success or positive error */
 int
@@ -1410,6 +1411,21 @@ xfs_file_ioctl(
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 
+	case FS_IOC_ENABLE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_enable(filp, arg);
+
+	case FS_IOC_MEASURE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_measure(filp, arg);
+
+	case FS_IOC_READ_VERITY_METADATA:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_read_metadata(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 21/24] xfs: advertise fs-verity being available on filesystem
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (19 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 20/24] xfs: add fs-verity ioctls Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 22/24] xfs: check and repair the verity inode flag state Andrey Albershteyn
                     ` (2 subsequent siblings)
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: "Darrick J. Wong" <djwong@kernel.org>

Advertise that this filesystem supports fsverity.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_fs.h | 1 +
 fs/xfs/libxfs/xfs_sb.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 41ce4d3d650e..5cfd4043cb9b 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -247,6 +247,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */
 #define XFS_FSOP_GEOM_FLAGS_PARENT	(1 << 25) /* linux parent pointers */
 #define XFS_FSOP_GEOM_FLAGS_METADIR	(1 << 26) /* metadata directories */
+#define XFS_FSOP_GEOM_FLAGS_VERITY	(1 << 27) /* fs-verity */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 9945ad33a460..b8fd1759ebe8 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1500,6 +1500,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE;
 	if (xfs_has_metadir(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
+	if (xfs_has_verity(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_VERITY;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 22/24] xfs: check and repair the verity inode flag state
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (20 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 21/24] xfs: advertise fs-verity being available on filesystem Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 23/24] xfs: report verity failures through the health system Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch

From: "Darrick J. Wong" <djwong@kernel.org>

If an inode has the incore verity iflag set, make sure that we can
actually activate fsverity on that inode.  If activation fails due to
a fsverity metadata validation error, clear the flag.  The usage model
for fsverity requires that any program that cares about verity state is
required to call statx/getflags to check that the flag is set after
opening the file, so clearing the flag will not compromise that model.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c         |  7 ++++
 fs/xfs/scrub/common.c       | 68 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h       |  3 ++
 fs/xfs/scrub/inode.c        |  7 ++++
 fs/xfs/scrub/inode_repair.c | 36 ++++++++++++++++++++
 5 files changed, 121 insertions(+)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index d911cf9cad20..1f840d79cc9d 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -646,6 +646,13 @@ xchk_xattr(
 	if (!xfs_inode_hasattr(sc->ip))
 		return -ENOENT;
 
+	/*
+	 * If this is a verity file that won't activate, we cannot check the
+	 * merkle tree geometry.
+	 */
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_set_incomplete(sc);
+
 	/* Allocate memory for xattr checking. */
 	error = xchk_setup_xattr_buf(sc, 0);
 	if (error == -ENOMEM)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 5cbd94b56582..00c07335725d 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -42,6 +42,8 @@
 #include "scrub/health.h"
 #include "scrub/tempfile.h"
 
+#include <linux/fsverity.h>
+
 /* Common code for the metadata scrubbers. */
 
 /*
@@ -1573,3 +1575,69 @@ xchk_inode_rootdir_inum(const struct xfs_inode *ip)
 		return mp->m_metadirip->i_ino;
 	return mp->m_rootip->i_ino;
 }
+
+/*
+ * If this inode has S_VERITY set on it, read the merkle tree geometry, which
+ * will activate the incore fsverity context for this file.  If the activation
+ * fails with anything other than ENOMEM, the file is corrupt, which we can
+ * detect later with fsverity_active.
+ *
+ * Callers must hold the IOLOCK and must not hold the ILOCK of sc->ip because
+ * activation reads xattrs.  @blocksize and @treesize will be filled out with
+ * merkle tree geometry if they are not NULL pointers.
+ */
+int
+xchk_inode_setup_verity(
+	struct xfs_scrub	*sc,
+	unsigned int		*blocksize,
+	u64			*treesize)
+{
+	unsigned int		bs;
+	u64			ts;
+	int			error;
+
+	if (!IS_VERITY(VFS_I(sc->ip)))
+		return 0;
+
+	error = fsverity_merkle_tree_geometry(VFS_I(sc->ip), NULL, &bs, &ts);
+	switch (error) {
+	case 0:
+		/* fsverity is active; return tree geometry. */
+		if (blocksize)
+			*blocksize = bs;
+		if (treesize)
+			*treesize = ts;
+		break;
+	case -ENODATA:
+	case -EMSGSIZE:
+	case -EINVAL:
+	case -EFSCORRUPTED:
+	case -EFBIG:
+		/*
+		 * The nonzero errno codes above are the error codes that can
+		 * be returned from fsverity on metadata validation errors.
+		 * Set the geometry to zero.
+		 */
+		if (blocksize)
+			*blocksize = 0;
+		if (treesize)
+			*treesize = 0;
+		return 0;
+	default:
+		/* runtime errors */
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Is this a verity file that failed to activate?  Callers must have tried to
+ * activate fsverity via xchk_inode_setup_verity.
+ */
+bool
+xchk_inode_verity_broken(
+	struct xfs_inode	*ip)
+{
+	return IS_VERITY(VFS_I(ip)) && !fsverity_active(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 9ff3cafd8679..f3631c603dd4 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -274,6 +274,9 @@ void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
 
 int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
 		bool *inuse);
+int xchk_inode_setup_verity(struct xfs_scrub *sc, unsigned int *blocksize,
+		u64 *treesize);
+bool xchk_inode_verity_broken(struct xfs_inode *ip);
 
 bool xchk_inode_is_dirtree_root(const struct xfs_inode *ip);
 bool xchk_inode_is_sb_rooted(const struct xfs_inode *ip);
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 25ee66e7649d..661b548460e4 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -36,6 +36,10 @@ xchk_prepare_iscrub(
 
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 
+	error = xchk_inode_setup_verity(sc, NULL, NULL);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -815,6 +819,9 @@ xchk_inode(
 	if (S_ISREG(VFS_I(sc->ip)->i_mode))
 		xchk_inode_check_reflink_iflag(sc, sc->ip->i_ino);
 
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_ino_set_corrupt(sc, sc->sm->sm_ino);
+
 	xchk_inode_check_unlinked(sc);
 
 	xchk_inode_xref(sc, sc->ip->i_ino, &di);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 5a58ddd27bd2..72b97b625517 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -572,6 +572,8 @@ xrep_dinode_flags(
 		dip->di_nrext64_pad = 0;
 	else if (dip->di_version >= 3)
 		dip->di_v3_pad = 0;
+	if (!xfs_has_verity(mp) || !S_ISREG(mode))
+		flags2 &= ~XFS_DIFLAG2_VERITY;
 
 	if (flags2 & XFS_DIFLAG2_METADATA) {
 		xfs_failaddr_t	fa;
@@ -1443,6 +1445,10 @@ xrep_dinode_core(
 	if (iget_error)
 		return iget_error;
 
+	error = xchk_inode_setup_verity(sc, NULL, NULL);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -1852,6 +1858,27 @@ xrep_inode_unlinked(
 	return 0;
 }
 
+/*
+ * If this file is a fsverity file, xchk_prepare_iscrub or xrep_dinode_core
+ * should have activated it.  If it's still not active, then there's something
+ * wrong with the verity descriptor and we should turn it off.
+ */
+STATIC int
+xrep_inode_verity(
+	struct xfs_scrub	*sc)
+{
+	struct inode		*inode = VFS_I(sc->ip);
+
+	if (xchk_inode_verity_broken(sc->ip)) {
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_VERITY;
+		inode->i_flags &= ~S_VERITY;
+
+		xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	}
+
+	return 0;
+}
+
 /* Repair an inode's fields. */
 int
 xrep_inode(
@@ -1901,6 +1928,15 @@ xrep_inode(
 			return error;
 	}
 
+	/*
+	 * Disable fsverity if it cannot be activated.  Activation failure
+	 * prohibits the file from being opened, so there cannot be another
+	 * program with an open fd to what it thinks is a verity file.
+	 */
+	error = xrep_inode_verity(sc);
+	if (error)
+		return error;
+
 	/* Reconnect incore unlinked list */
 	error = xrep_inode_unlinked(sc);
 	if (error)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 23/24] xfs: report verity failures through the health system
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (21 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 22/24] xfs: check and repair the verity inode flag state Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  2024-12-29 13:39   ` [PATCH 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: "Darrick J. Wong" <djwong@kernel.org>

Record verity failures and report them through the health system.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_fs.h     |  1 +
 fs/xfs/libxfs/xfs_health.h |  4 +++-
 fs/xfs/xfs_fsverity.c      | 11 +++++++++++
 fs/xfs/xfs_health.c        |  1 +
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 5cfd4043cb9b..65978a2708ea 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -419,6 +419,7 @@ struct xfs_bulkstat {
 #define XFS_BS_SICK_SYMLINK	(1 << 6)  /* symbolic link remote target */
 #define XFS_BS_SICK_PARENT	(1 << 7)  /* parent pointers */
 #define XFS_BS_SICK_DIRTREE	(1 << 8)  /* directory tree structure */
+#define XFS_BS_SICK_DATA	(1 << 9)  /* file data */
 
 /*
  * Project quota id helpers (previously projid was 16bit only
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index d34986ac18c3..a24006180bda 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -102,6 +102,7 @@ struct xfs_rtgroup;
 /* Don't propagate sick status to ag health summary during inactivation */
 #define XFS_SICK_INO_FORGET	(1 << 12)
 #define XFS_SICK_INO_DIRTREE	(1 << 13)  /* directory tree structure */
+#define XFS_SICK_INO_DATA	(1 << 14)  /* file data */
 
 /* Primary evidence of health problems in a given group. */
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
@@ -136,7 +137,8 @@ struct xfs_rtgroup;
 				 XFS_SICK_INO_XATTR | \
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT | \
-				 XFS_SICK_INO_DIRTREE)
+				 XFS_SICK_INO_DIRTREE | \
+				 XFS_SICK_INO_DATA)
 
 #define XFS_SICK_INO_ZAPPED	(XFS_SICK_INO_BMBTD_ZAPPED | \
 				 XFS_SICK_INO_BMBTA_ZAPPED | \
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 0af0f22ff075..967f75a1f97d 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -24,6 +24,7 @@
 #include "xfs_iomap.h"
 #include "xfs_bmap.h"
 #include "xfs_format.h"
+#include "xfs_health.h"
 #include <linux/fsverity.h>
 #include <linux/iomap.h>
 
@@ -462,10 +463,20 @@ xfs_fsverity_write_merkle(
 	return iomap_write_region(&region);
 }
 
+static void
+xfs_fsverity_file_corrupt(
+	struct inode		*inode,
+	loff_t			pos,
+	size_t			len)
+{
+	xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_DATA);
+}
+
 const struct fsverity_operations xfs_fsverity_ops = {
 	.begin_enable_verity		= xfs_fsverity_begin_enable,
 	.end_enable_verity		= xfs_fsverity_end_enable,
 	.get_verity_descriptor		= xfs_fsverity_get_descriptor,
 	.read_merkle_tree_page		= xfs_fsverity_read_merkle,
 	.write_merkle_tree_block	= xfs_fsverity_write_merkle,
+	.file_corrupt			= xfs_fsverity_file_corrupt,
 };
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index c7c2e6561998..a61b27cc6be7 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -485,6 +485,7 @@ static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_DIR_ZAPPED,	XFS_BS_SICK_DIR },
 	{ XFS_SICK_INO_SYMLINK_ZAPPED,	XFS_BS_SICK_SYMLINK },
 	{ XFS_SICK_INO_DIRTREE,	XFS_BS_SICK_DIRTREE },
+	{ 0, 0 },
 };
 
 /* Fill out bulkstat health info. */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 24/24] xfs: enable ro-compat fs-verity flag
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
                     ` (22 preceding siblings ...)
  2024-12-29 13:39   ` [PATCH 23/24] xfs: report verity failures through the health system Andrey Albershteyn
@ 2024-12-29 13:39   ` Andrey Albershteyn
  23 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2024-12-29 13:39 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, david, hch, Andrey Albershteyn

From: Andrey Albershteyn <aalbersh@redhat.com>

Finalize fs-verity integration in XFS by making kernel fs-verity
aware with ro-compat flag.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add spaces]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index df84c275837d..6eb10300ff31 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -374,10 +374,11 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
 #define XFS_SB_FEAT_RO_COMPAT_VERITY   (1 << 4)		/* fs-verity */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
-		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
-		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
-		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT	| \
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT	| \
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK	| \
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT	| \
+		 XFS_SB_FEAT_RO_COMPAT_VERITY)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] iomap: add iomap_writepages_unbound() to write beyond EOF
  2024-12-29 13:36   ` [PATCH 1/2] iomap: add iomap_writepages_unbound() to write " Andrey Albershteyn
@ 2024-12-29 17:54     ` kernel test robot
  2024-12-29 21:36     ` kernel test robot
  1 sibling, 0 replies; 59+ messages in thread
From: kernel test robot @ 2024-12-29 17:54 UTC (permalink / raw)
  To: Andrey Albershteyn, linux-xfs
  Cc: llvm, oe-kbuild-all, djwong, david, hch, Andrey Albershteyn

Hi Andrey,

kernel test robot noticed the following build errors:

[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on linus/master v6.13-rc4 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrey-Albershteyn/iomap-add-iomap_writepages_unbound-to-write-beyond-EOF/20241229-213942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20241229133640.1193578-2-aalbersh%40kernel.org
patch subject: [PATCH 1/2] iomap: add iomap_writepages_unbound() to write beyond EOF
config: s390-randconfig-002-20241229 (https://download.01.org/0day-ci/archive/20241230/202412300135.cvWMPZGf-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241230/202412300135.cvWMPZGf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412300135.cvWMPZGf-lkp@intel.com/

All errors (new ones prefixed by >>):

>> fs/iomap/buffered-io.c:982:23: error: use of undeclared identifier 'IOMAP_NOSIZE'
                   if (!(iter->flags & IOMAP_NOSIZE) && (pos + written > old_size)) {
                                       ^
   fs/iomap/buffered-io.c:988:23: error: use of undeclared identifier 'IOMAP_NOSIZE'
                   if (!(iter->flags & IOMAP_NOSIZE) && (old_size < pos))
                                       ^
   2 errors generated.


vim +/IOMAP_NOSIZE +982 fs/iomap/buffered-io.c

   909	
   910	static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
   911	{
   912		loff_t length = iomap_length(iter);
   913		loff_t pos = iter->pos;
   914		ssize_t total_written = 0;
   915		long status = 0;
   916		struct address_space *mapping = iter->inode->i_mapping;
   917		size_t chunk = mapping_max_folio_size(mapping);
   918		unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
   919	
   920		do {
   921			struct folio *folio;
   922			loff_t old_size;
   923			size_t offset;		/* Offset into folio */
   924			size_t bytes;		/* Bytes to write to folio */
   925			size_t copied;		/* Bytes copied from user */
   926			size_t written;		/* Bytes have been written */
   927	
   928			bytes = iov_iter_count(i);
   929	retry:
   930			offset = pos & (chunk - 1);
   931			bytes = min(chunk - offset, bytes);
   932			status = balance_dirty_pages_ratelimited_flags(mapping,
   933								       bdp_flags);
   934			if (unlikely(status))
   935				break;
   936	
   937			if (bytes > length)
   938				bytes = length;
   939	
   940			/*
   941			 * Bring in the user page that we'll copy from _first_.
   942			 * Otherwise there's a nasty deadlock on copying from the
   943			 * same page as we're writing to, without it being marked
   944			 * up-to-date.
   945			 *
   946			 * For async buffered writes the assumption is that the user
   947			 * page has already been faulted in. This can be optimized by
   948			 * faulting the user page.
   949			 */
   950			if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
   951				status = -EFAULT;
   952				break;
   953			}
   954	
   955			status = iomap_write_begin(iter, pos, bytes, &folio);
   956			if (unlikely(status)) {
   957				iomap_write_failed(iter->inode, pos, bytes);
   958				break;
   959			}
   960			if (iter->iomap.flags & IOMAP_F_STALE)
   961				break;
   962	
   963			offset = offset_in_folio(folio, pos);
   964			if (bytes > folio_size(folio) - offset)
   965				bytes = folio_size(folio) - offset;
   966	
   967			if (mapping_writably_mapped(mapping))
   968				flush_dcache_folio(folio);
   969	
   970			copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
   971			written = iomap_write_end(iter, pos, bytes, copied, folio) ?
   972				  copied : 0;
   973	
   974			/*
   975			 * Update the in-memory inode size after copying the data into
   976			 * the page cache.  It's up to the file system to write the
   977			 * updated size to disk, preferably after I/O completion so that
   978			 * no stale data is exposed.  Only once that's done can we
   979			 * unlock and release the folio.
   980			 */
   981			old_size = iter->inode->i_size;
 > 982			if (!(iter->flags & IOMAP_NOSIZE) && (pos + written > old_size)) {
   983				i_size_write(iter->inode, pos + written);
   984				iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
   985			}
   986			__iomap_put_folio(iter, pos, written, folio);
   987	
   988			if (!(iter->flags & IOMAP_NOSIZE) && (old_size < pos))
   989				pagecache_isize_extended(iter->inode, old_size, pos);
   990	
   991			cond_resched();
   992			if (unlikely(written == 0)) {
   993				/*
   994				 * A short copy made iomap_write_end() reject the
   995				 * thing entirely.  Might be memory poisoning
   996				 * halfway through, might be a race with munmap,
   997				 * might be severe memory pressure.
   998				 */
   999				iomap_write_failed(iter->inode, pos, bytes);
  1000				iov_iter_revert(i, copied);
  1001	
  1002				if (chunk > PAGE_SIZE)
  1003					chunk /= 2;
  1004				if (copied) {
  1005					bytes = copied;
  1006					goto retry;
  1007				}
  1008			} else {
  1009				pos += written;
  1010				total_written += written;
  1011				length -= written;
  1012			}
  1013		} while (iov_iter_count(i) && length);
  1014	
  1015		if (status == -EAGAIN) {
  1016			iov_iter_revert(i, total_written);
  1017			return -EAGAIN;
  1018		}
  1019		return total_written ? total_written : status;
  1020	}
  1021	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/2] iomap: add iomap_writepages_unbound() to write beyond EOF
  2024-12-29 13:36   ` [PATCH 1/2] iomap: add iomap_writepages_unbound() to write " Andrey Albershteyn
  2024-12-29 17:54     ` kernel test robot
@ 2024-12-29 21:36     ` kernel test robot
  1 sibling, 0 replies; 59+ messages in thread
From: kernel test robot @ 2024-12-29 21:36 UTC (permalink / raw)
  To: Andrey Albershteyn, linux-xfs
  Cc: oe-kbuild-all, djwong, david, hch, Andrey Albershteyn

Hi Andrey,

kernel test robot noticed the following build errors:

[auto build test ERROR on brauner-vfs/vfs.all]
[also build test ERROR on linus/master v6.13-rc4 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrey-Albershteyn/iomap-add-iomap_writepages_unbound-to-write-beyond-EOF/20241229-213942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.all
patch link:    https://lore.kernel.org/r/20241229133640.1193578-2-aalbersh%40kernel.org
patch subject: [PATCH 1/2] iomap: add iomap_writepages_unbound() to write beyond EOF
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20241230/202412300506.Upx51jzg-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241230/202412300506.Upx51jzg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412300506.Upx51jzg-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/iomap/buffered-io.c: In function 'iomap_write_iter':
>> fs/iomap/buffered-io.c:982:37: error: 'IOMAP_NOSIZE' undeclared (first use in this function); did you mean 'IOMAP_HOLE'?
     982 |                 if (!(iter->flags & IOMAP_NOSIZE) && (pos + written > old_size)) {
         |                                     ^~~~~~~~~~~~
         |                                     IOMAP_HOLE
   fs/iomap/buffered-io.c:982:37: note: each undeclared identifier is reported only once for each function it appears in


vim +982 fs/iomap/buffered-io.c

   909	
   910	static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
   911	{
   912		loff_t length = iomap_length(iter);
   913		loff_t pos = iter->pos;
   914		ssize_t total_written = 0;
   915		long status = 0;
   916		struct address_space *mapping = iter->inode->i_mapping;
   917		size_t chunk = mapping_max_folio_size(mapping);
   918		unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
   919	
   920		do {
   921			struct folio *folio;
   922			loff_t old_size;
   923			size_t offset;		/* Offset into folio */
   924			size_t bytes;		/* Bytes to write to folio */
   925			size_t copied;		/* Bytes copied from user */
   926			size_t written;		/* Bytes have been written */
   927	
   928			bytes = iov_iter_count(i);
   929	retry:
   930			offset = pos & (chunk - 1);
   931			bytes = min(chunk - offset, bytes);
   932			status = balance_dirty_pages_ratelimited_flags(mapping,
   933								       bdp_flags);
   934			if (unlikely(status))
   935				break;
   936	
   937			if (bytes > length)
   938				bytes = length;
   939	
   940			/*
   941			 * Bring in the user page that we'll copy from _first_.
   942			 * Otherwise there's a nasty deadlock on copying from the
   943			 * same page as we're writing to, without it being marked
   944			 * up-to-date.
   945			 *
   946			 * For async buffered writes the assumption is that the user
   947			 * page has already been faulted in. This can be optimized by
   948			 * faulting the user page.
   949			 */
   950			if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
   951				status = -EFAULT;
   952				break;
   953			}
   954	
   955			status = iomap_write_begin(iter, pos, bytes, &folio);
   956			if (unlikely(status)) {
   957				iomap_write_failed(iter->inode, pos, bytes);
   958				break;
   959			}
   960			if (iter->iomap.flags & IOMAP_F_STALE)
   961				break;
   962	
   963			offset = offset_in_folio(folio, pos);
   964			if (bytes > folio_size(folio) - offset)
   965				bytes = folio_size(folio) - offset;
   966	
   967			if (mapping_writably_mapped(mapping))
   968				flush_dcache_folio(folio);
   969	
   970			copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
   971			written = iomap_write_end(iter, pos, bytes, copied, folio) ?
   972				  copied : 0;
   973	
   974			/*
   975			 * Update the in-memory inode size after copying the data into
   976			 * the page cache.  It's up to the file system to write the
   977			 * updated size to disk, preferably after I/O completion so that
   978			 * no stale data is exposed.  Only once that's done can we
   979			 * unlock and release the folio.
   980			 */
   981			old_size = iter->inode->i_size;
 > 982			if (!(iter->flags & IOMAP_NOSIZE) && (pos + written > old_size)) {
   983				i_size_write(iter->inode, pos + written);
   984				iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
   985			}
   986			__iomap_put_folio(iter, pos, written, folio);
   987	
   988			if (!(iter->flags & IOMAP_NOSIZE) && (old_size < pos))
   989				pagecache_isize_extended(iter->inode, old_size, pos);
   990	
   991			cond_resched();
   992			if (unlikely(written == 0)) {
   993				/*
   994				 * A short copy made iomap_write_end() reject the
   995				 * thing entirely.  Might be memory poisoning
   996				 * halfway through, might be a race with munmap,
   997				 * might be severe memory pressure.
   998				 */
   999				iomap_write_failed(iter->inode, pos, bytes);
  1000				iov_iter_revert(i, copied);
  1001	
  1002				if (chunk > PAGE_SIZE)
  1003					chunk /= 2;
  1004				if (copied) {
  1005					bytes = copied;
  1006					goto retry;
  1007				}
  1008			} else {
  1009				pos += written;
  1010				total_written += written;
  1011				length -= written;
  1012			}
  1013		} while (iov_iter_count(i) && length);
  1014	
  1015		if (status == -EAGAIN) {
  1016			iov_iter_revert(i, total_written);
  1017			return -EAGAIN;
  1018		}
  1019		return total_written ? total_written : status;
  1020	}
  1021	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
                   ` (3 preceding siblings ...)
  2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
@ 2025-01-06 15:42 ` Christoph Hellwig
  2025-01-06 19:50   ` Darrick J. Wong
  2025-01-06 20:56   ` Andrey Albershteyn
  4 siblings, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2025-01-06 15:42 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: linux-xfs, djwong, david, hch, Andrey Albershteyn

I've not looked in details through the entire series, but I still find
all the churn for trying to force fsverity into xattrs very counter
productive, or in fact wrong.

xattrs are for relatively small variable sized items where each item
has it's own name.  fsverity has been designed to be stored beyond
i_size inside the file.  We're creating a lot of overhead for trying
to map fsverity to an underlying storage concept that does not fit it
will.  As fsverity protected files can't be written to there is no
chance of confusing fsverity blocks with post-EOF preallocation.

So please try to implement it just using the normal post-i_size blocks
and everything will become a lot simpler and cleaner even if the concept
of metadata beyond EOF might sound revolting (it still does to me to
some extent)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-06 15:42 ` [RFC] Directly mapped xattr data & fs-verity Christoph Hellwig
@ 2025-01-06 19:50   ` Darrick J. Wong
  2025-01-06 20:56   ` Andrey Albershteyn
  1 sibling, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2025-01-06 19:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrey Albershteyn, linux-xfs, david, Andrey Albershteyn

On Mon, Jan 06, 2025 at 04:42:12PM +0100, Christoph Hellwig wrote:
> I've not looked in details through the entire series, but I still find
> all the churn for trying to force fsverity into xattrs very counter
> productive, or in fact wrong.
> 
> xattrs are for relatively small variable sized items where each item
> has it's own name.  fsverity has been designed to be stored beyond
> i_size inside the file.  We're creating a lot of overhead for trying
> to map fsverity to an underlying storage concept that does not fit it
> will.  As fsverity protected files can't be written to there is no
> chance of confusing fsverity blocks with post-EOF preallocation.
> 
> So please try to implement it just using the normal post-i_size blocks
> and everything will become a lot simpler and cleaner even if the concept
> of metadata beyond EOF might sound revolting (it still does to me to
> some extent)

I was wondering the same thing -- why not just put the merkle tree
blocks well past EOF and use the regular iomap readahead functions to
get the tree data read in bulk.  Plus you can do readahead optimization
that the fsverity code seems to want anyway.

Just be sure to put it well past EOF, since the current weird thing that
ext4 does (next ~64K after EOF) would seem to allow mmap reads of merkle
tree content for systems with really large base page sizes (e.g.
hexagon).

--D

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-06 15:42 ` [RFC] Directly mapped xattr data & fs-verity Christoph Hellwig
  2025-01-06 19:50   ` Darrick J. Wong
@ 2025-01-06 20:56   ` Andrey Albershteyn
  2025-01-07 16:50     ` Christoph Hellwig
  1 sibling, 1 reply; 59+ messages in thread
From: Andrey Albershteyn @ 2025-01-06 20:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, djwong, david, Andrey Albershteyn

On 2025-01-06 16:42:12, Christoph Hellwig wrote:
> I've not looked in details through the entire series, but I still find
> all the churn for trying to force fsverity into xattrs very counter
> productive, or in fact wrong.

Have you checked
	[PATCH] xfs: direct mapped xattrs design documentation [1]?
It has more detailed argumentation of this approach.

[1]: https://lore.kernel.org/linux-xfs/20250106154212.GA27933@lst.de/T/#m412549e0f3b6671a3bb9f1cb1c0967d504c06ef4

> 
> xattrs are for relatively small variable sized items where each item
> has it's own name.

Probably, but now I'm not sure that this is what I see, xattrs have
the whole dabtree to address all the attributes and there's
infrastructure to have quite a lot of pretty huge attributes.

Taking 1T file we will have about 1908 4k merkle tree blocks ~8Mb,
in comparison to file size, I see it as a pretty small set of
metadata.

> fsverity has been designed to be stored beyond
> i_size inside the file.

I think the only requirement coming from fs-verity in this regard is
that Merkle blocks are stored in Pages. This allows for PG_Checked
optimization. Otherwise, I think it doesn't really care where the
data comes from or where it is.

> We're creating a lot of overhead for trying
> to map fsverity to an underlying storage concept that does not fit it
> will.  As fsverity protected files can't be written to there is no
> chance of confusing fsverity blocks with post-EOF preallocation.

Yes, that's one of the arguments in the design doc, we can possibly
use it for mutable files in future. Not sure how feasible it is with
post-EOF approach.

Regarding code overhead, I don't think it's that much. iomap
interface could be used by any filesystem to read tree from post
eof/anywhere else via ->iomap_begin. The directly mapped attribute
data is a change of leaf format. Then, the fs-verity patchset
itself isn't that huge. But yeah, this's probably more changes than
post i_size.

> 
> So please try to implement it just using the normal post-i_size blocks
> and everything will become a lot simpler and cleaner even if the concept
> of metadata beyond EOF might sound revolting (it still does to me to
> some extent)
> 

I don't really see the advantage or much difference of storing
fs-verity post-i_size. Dedicating post-i_size space to fs-verity
dosn't seem to be much different from changing xattr format to align
with fs blocks, to me.

-- 
- Andrey

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] xfs: direct mapped xattrs design documentation
  2024-12-29 13:35 ` [PATCH] xfs: direct mapped xattrs design documentation Andrey Albershteyn
@ 2025-01-07  1:41   ` Darrick J. Wong
  2025-01-07 10:24     ` Andrey Albershteyn
  0 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2025-01-07  1:41 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: linux-xfs, david, hch, Andrey Albershteyn, Dave Chinner,
	Eric Biggers

[add ebiggers to cc]

On Sun, Dec 29, 2024 at 02:35:01PM +0100, Andrey Albershteyn wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Direct mapped xattrs are a form of remote xattr that don't contain
> internal self describing metadata. Hence the xattr data can be
> directly mapped into page cache pages by iomap infrastructure
> without needing to go through the XFS buffer cache.
> 
> This functionality allows XFS to implement fsverity data checksum
> information externally to the file data, but interact with XFS data
> checksum storage through the existing page cache interface.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> ---
>  .../xfs/xfs-direct-mapped-xattr-design.rst    | 304 ++++++++++++++++++
>  1 file changed, 304 insertions(+)
>  create mode 100644 Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> 
> diff --git a/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> new file mode 100644
> index 000000000000..a0efa9546eca
> --- /dev/null
> +++ b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> @@ -0,0 +1,304 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +XFS Direct Mapped Extended Atrtibutes
> +=====================================
> +
> +Background
> +==========
> +
> +We have need to support fsverity in XFS. An attempt to support fsverity
> +using named remote xattrs has already been made, but the complexity of the
> +solution has made acceptance of that implementation .... challenging.
> +
> +The fundamental problem caused by using remote xattr blocks to store the
> +fsverity checksum data is that the data is stored as opaque filesystem block
> +sized chunks of data and accesses them directly through a page cache based
> +interface interface.
> +
> +These filesystem block sized chunks do not fit neatly into remote xattr blocks
> +because the remote xattr blocks have metadata headers in them containing self
> +describing metadata and CRCs for the metadata. Hence filesystem block sized data
> +now spans two filesystem blocks and is not contiguous on disk (it is split up
> +by headers).
> +
> +The fsverity interfaces require page aligned data blocks, which then requires
> +copying the fsverity data out of the xattr buffers into a separate bounce buffer
> +which is then cached independently of the xattr buffer. IOWs, we have to double
> +buffer the fsverity checksum data and that costs a lot in terms of complexity.
> +
> +Because the remote xattr data is also using the generic xattr read code, it
> +requires use of xfs metadata buffers to read and write it to disk. Hence we have
> +block based IO requirements for the fsverity data, not page based IO
> +requirements. Hence there is an impedence mismatch between the fsverity IO model
> +and the XFS xattr IO model as well.
> +
> +
> +Directories, Xattrs and dabtrees
> +================================
> +
> +Directories in XFS are complex - they maintain two separate lookup indexes to
> +the dirent data because we have to optimise for two different types of access.
> +We also have dirent stability requirements for seek operations.
> +
> +Hence directories have an offset indexed data segment that hold dirent data,
> +with each individual dirent at an offset that is fixed for the life of the
> +dirent. This provides dirent stability as the offset of the dirent can be used
> +as a seek cookie. This offset indexed data segment is optimised for dirent
> +reading by readdir/getdents which iterates sequentially through the dirent data
> +and requires stable cookies for iteration continuation.
> +
> +Directories must also support lookup by name - path based directory lookups need
> +to find the dirent by name rather than by offset. This is implemented by the
> +dabtree in the directory. It stores name hashes and the leaf records for a
> +specific name hash point to the offset of the dirent with that hash. Hence name
> +based lookups are fast and efficient when directed through the dabtree.
> +Importantly, the dabtree does not store dirent data, it simply provides a
> +pointer to the external dirent: the stable offset of the dirent in the offset
> +indexed data segment.
> +
> +In comparison, the attr fork dabtree only has one index - the name hash based
> +dabtree. Everything stored in the xattr fork needs to be named and the record
> +for the xattr data is indexed by the hash of that name. As there is no external
> +stable offset based index data segment, data that does not fit inline in the
> +xattr leaf record gets stored in a dynamically allocated remote xattr extent.
> +The remote extent is created at the first largest hole in the xattr address space,
> +so the remote xattr data does not get allocated sequentially on disk.
> +
> +Further, because everything is name hash indexed, sequential offset indexed data
> +is not going to hash down to sequential record indexes in the dabtree. Hence
> +access to offset index based xattr data is not going to be sequential in either
> +record lookup patterns nor xattr data read patterns. This isn't a huge issue
> +as the dabtree blocks rapidly get cached, but it does consume more CPU time
> +that doing equivalent sequential offset based access in the directory structure.
> +
> +Darrick Wong pondered whether it would help to create a sequentially
> +indexed segment in the xattr fork for the merkle tree blocks when discussing
> +better ways to handle fsverity metadata. This document is intended to flesh out
> +that concept into something that is usable by mutable data checksum
> +functionality.
> +
> +
> +fsverity is just data checksumming
> +==================================
> +
> +I had a recent insight into fsverity when discussing what to do with the
> +fsverity code with Andrey. That insight came from realising that all fsverity
> +was doing is recording a per-filesystem block checksum and then validating
> +it on read. While this might seem obvious now that I say it, the previous
> +approach we took was all about fsverity needing to read and write opaque blocks
> +of individually accessed metadata.
> +
> +Storing opaque, externally defined metadata is what xattrs are intended to be
> +used for, and that drove the original design. i.e. a fsverity merkle tree block
> +was just another named xattr object that encoded the tree index in the xattr
> +name. Simple and straight forward from the xattr POV, but all the complexity
> +arose in translating the xattr data into a form that fsverity could use.
> +
> +Fundamentally, however, the merkle tree blocks just contain data checksums.
> +Yes, they are complex, cryptographically secure data checksums, but the
> +fundamental observation is that there is a direct relationship between the file
> +data at a given offset and the contents of merkle tree block at a given tree
> +index.
> +
> +fsverity has a mechanism to calculate the checksums from the file data and store
> +them in filesystem blocks, hence providing external checksum storage for the
> +file data. It also has mechanism to read the external checksum storage and
> +compare that against the calculated checksum of the data. Hence fsverity is just
> +a fancy way of saying the filesystem provides "tamper proof read-only data
> +checksum verification"

Ok, so fsverity's merkle tree is (more or less) block-indexable, so you
want to use xattrs to map merkle block indexes to headerless remote
xattr blocks which you can then read into the pagecache and give to
fsverity.  Is that right?

I'm puzzled by all this, because the design here feels like it's much
more complex than writing the merkle tree as post-eof data like ext4 and
f2fs already do, and it prevents us from adding trivial fscrypt+fsverity
support.  There must be something that makes the tradeoff worthwhile,
but I'm not seeing it, unless...

> +But what if we want data checksums for normal read-write data files to be able
> +to detect bit-rot in data at rest?

...your eventual plan here is to support data block checksums as a
totally separate feature from fsverity?  And you'll reuse the same
"store two checksums with every xattr remote value" code to facilitate
this...  somehow?  I'm not sure how, since I don't think I see any of
code for that in the patches.

Or are you planning to enhance fsverity itself to support updates?

<confused>

> +
> +
> +Direct Mapped Xattr Data
> +========================
> +
> +fsverity really wants to read and write it's checksum data through the page

"its checksum data", no apostrophe

> +cache. To do this efficiently, we need to store the fsverity metadata in block
> +aligned data storage. We don't have that capability in XFS xattrs right now, and
> +this is what we really want/need for data checksum storage. There are two ways
> +of doing direct mapped xattr data.
> +
> +A New Remote Xattr Record Format
> +--------------------------------
> +
> +The first way we can do direct mapped xattr data is to change the format of the
> +remote xattr. The remote xattr header currently looks like this:
> +
> +.. code-block:: c
> +
> +	typedef struct xfs_attr_leaf_name_remote {
> +		__be32  valueblk;               /* block number of value bytes */
> +		__be32  valuelen;               /* number of bytes in value */
> +		__u8    namelen;                /* length of name bytes */
> +		/*
> +		 * In Linux 6.5 this flex array was converted from name[1] to name[].
> +		 * Be very careful here about extra padding at the end; see
> +		 * xfs_attr_leaf_entsize_remote() for details.
> +		 */
> +		__u8    name[];                 /* name bytes */
> +	} xfs_attr_leaf_name_remote_t;
> +
> +It stores the location and size of the remote xattr data as a filesystem block
> +offset into the attr fork, along with the xattr name. The remote xattr block
> +contains then this self describing header:
> +
> +.. code-block:: c
> +
> +	struct xfs_attr3_rmt_hdr {
> +		__be32  rm_magic;
> +		__be32  rm_offset;
> +		__be32  rm_bytes;
> +		__be32  rm_crc;
> +		uuid_t  rm_uuid;
> +		__be64  rm_owner;
> +		__be64  rm_blkno;
> +		__be64  rm_lsn;
> +	};
> +
> +This is the self describing metadata that we use to validate the xattr data
> +block is what it says it is, and this is the cause of the unaligned remote xattr
> +data.
> +
> +The key field in the self describing metadata is ``*rm_crc``. This
> +contains the CRC of the xattr data, and that tells us that the contents of the
> +xattr data block are the same as what we wrote to disk. Everything else in
> +this header is telling us who the block belongs to, it's expected size and
> +when and where it was written to. This is far less critical to detecting storage
> +errors than the CRC.
> +
> +Hence if we drop this header and move the ``rm_crc`` field to the ``struct
> +xfs_attr_leaf_name_remote``, we can still check that the xattr data is has not
> +been changed since we wrote the data to storage. If we have rmap enabled we
> +have external tracking of the owner for the xattr data block, as well as the
> +offset into the xattr data fork. ``rm_lsn`` is largely meaningless for remote
> +xattrs because the data is written synchronously before the dabtree remote
> +record is committed to the journal.
> +
> +Hence we can drop the headers from the remote xattr data blocks and not really
> +lose much in way of robustness or recovery capability when rmap is enabled. This
> +means all the xattr data is now filesystem block aligned, and this enables us to
> +directly map the xattr data blocks directly for read IO.

So you're moving rm_crc to the remote name structure and eliminating the
headers.  That weakens the self describing metadata, but chances are the
crc isn't going to match if there's a torn write.  Probably not a big
loss.

> +However, we can't easily directly map this xattr data for write operations. The
> +xattr record header contains the CRC, and this needs to be updated
> +transactionally. We can't update this before we do a direct mapped xattr write,
> +because we have to do the write before we can recalculate the CRC. We can't do
> +it after the write, because then we can have the transaction commit before the
> +direct mapped data write IO is submitted and completed. This means recovery
> +would result in a CRC mismatch for that data block. And we can't do it after the
> +data write IO completes, because if we crash before the transaction is committed
> +to the journal we again have a CRC mismatch.
> +
> +This is made more complex because we don't allow xattr headers to be
> +re-written. Currently an update to an xattr requires a "remove and recreate"
> +operation to ensure that the update is atomic w.r.t. remote xattr data changes.
> +
> +One approach which we can take is to use two CRCs - for old and new xattr data.
> +``rm_crc`` becomes ``rm_crc[2]`` and xattr gains new bit flag
> +``XFS_ATTR_RMCRC_SEL_BIT``. This bit defines which of the two fields contains
> +the primary CRC. When we write a new CRC, we write it into the secondary
> +``rm_crc[]`` field (i.e. the one the bit does not point to). When the data IO
> +completes, we toggle the bit so that it points at the new primary value.
> +
> +If the primary does not match but the secondary does, we can flag an online
> +repair operation to run at the completion of whatever operation read the xattr
> +data to correct the state of the ``XFS_ATTR_RMCRC_SEL_BIT``.

I'm a little lost on what this rm_crc[] array covers -- it's intended to
check the value of the remote xattr value, right?  And either of crc[0]
or crc[1] can match?  So I guess the idea here is that you can overwrite
remote xattr value blocks like this:

1. update remote name with crc[0] == current crc and crc[1] == new crc
2. write directly to remote xattr value blocsk

and this is more efficient than running through the classic REPLACE
machinery?

Since each merkle tree block is stored as a separate xattr, you can
write to the merkle tree blocks in this fashion, which avoids
inconsistencies in the ondisk xattr structure so long as the xattr value
write itself doesn't tear.

But we only write the merkle tree once.  So why is the double crc
necessary?  If the sealing process fails, we never set the ondisk iflag.

> +If neither CRCs match, then we have an -EFSCORRUPTED situation, and that needs
> +to mark the attr fork as sick (which brings it to the attention of scrub/repair)
> +and return -EFSCORRUPTED to the reader.
> +
> +Offset-based Xattr Indexing Segments
> +------------------------------------
> +
> +The other mechanism for direct mapping xattr data is to introduce an offset
> +indexed segment similar to the directory data segment. The xattr data fork uses
> +32 bit filesystem block addressing, so on a 4kB block size filesystem we can
> +index 16TB of address space. A dabtree that indexes this amount of data would be
> +massive and quite inefficient, and we'd likely be hitting against the maximum
> +extent count limits for the attr fork at this point, anyway (32 bit extent
> +count).
> +
> +Hence taking half the address space (the upper 8TB) isn't going to cause any
> +significant limitations on what we can store in the existing attr fork dabtree.
> +It does, however, provide us with a significant amount of data address space we
> +can use for offset indexed based xattr data. We can even split this upper region
> +into multiple segments so that it can have multiple separate sets of data and
> +even dabtrees to index them.
> +
> +At this point in time, I do not see a need for these xattr segments to be
> +directly accessible from userspace through existing xattr interfaces. If there
> +is need for the data the kernel stores in an xattr data segment to be exposed to
> +userspace, we can add the necessary interfaces when they are required.
> +
> +For the moment, let's first concentrate on what is needed in kernel space for
> +the fsverity merkle tree blocks.
> +
> +
> +Fsverity Data Segment
> +`````````````````````
> +
> +Let's assume we have carved out a section of the inode address space for fsverity
> +metadata. fsverity only supports file sizes up to 4TB (see `thread
> +<https://lore.kernel.org/linux-xfs/Y5rDCcYGgH72Wn%2Fe@sol.localdomain/>`_::
> +and so at a 4kB block size and 128 bytes per fsb the amount of addressing space
> +needed for fsverity is a bit over 128GiB. Hence we could carve out a fixed
> +256GiB address space segment just for fsverity data if we need to.
> +
> +When fsverity measures the file and creates the merkle tree block, it requires
> +the filesystem to persistently record that inode is undergoing measurement. It
> +also then tells the filesystem when measurement is complete so that the
> +filesystem can remove the "under measurement" flag and seal the inode as
> +fsverity protected.
> +
> +Hence with these persistent notifications, we don't have to care about
> +persistent creation of the merkle tree data. As long as it has been written back
> +before we seal the inode with a synchronous transaction, the merkle tree data
> +will be stable on disk before the seal is written to the journal thanks to the
> +cache flushes issued before the journal commit starts.
> +
> +This also means that we don't have to care about what is in the fsverity segment
> +when measurement is started - we can just punch out what is already there (e.g.
> +debris from a failed measurement) as the measurement process will rewrite
> +the entire segment contents from scratch.
> +
> +Ext4 does this write process via the page cache into the inode's mapping. It
> +operates at the aops level directly, but that won't work for XFS as we use iomap
> +for buffered IO. Hence we need to call through iomap to map the disk space
> +and allocate the page cache pages for the merkle tree data to be written.
> +
> +This will require us to provide an iomap_ops structure with a ->begin_iomap
> +method that allocates and maps blocks from the attr fork fsverity data segment.
> +We don't care what file offset the iomap code chooses to cache the folios that
> +are inserted into the page cache, all we care about is that we are passed the
> +merkle tree block position that it needs to be stored at.
> +
> +This will require iomap to be aware that it is mapping external metadata rather
> +than normal file data so that it can offset the page cache index it uses for
> +this data appropriately. The writeback also needs to know that it's working with
> +fsverity folios past EOF. This requires changes to how those folios are mapped
> +as they are indexed by xattr dabtree. The differentiation factor will be the
> +fact that only merkle tree data can be written while inode is under fsverity
> +initialization or filesystems also can check if these page is in "fsverity"
> +region of the page cache.
> +
> +The writeback mapping of these specially marked merkle tree folios should be, at
> +this point, relatively trivial. We will need to call fsverity ->map_blocks
> +callback to map the fsverity address space rather than the file data address
> +space, but other than that the process of allocating space and mapping it is
> +largely identical to the existing data fork allocation code. We can even use
> +delayed allocation to ensure the merkle tree data is as contiguous as possible.

Ok, so all this writeback stuff is to support the construction of the
initial merkle tree at FS_IOC_ENABLE_VERITY time, /not/ to support
read-write data integrity.

> +The read side is less complex as all it needs to do is map blocks directly from
> +the fsverity address space. We can read from the region intended for the
> +fsverity metadata, then ->begin_iomap will map this request to xattr data blocks
> +instead of file blocks.
> +
> +Therefore, we can have something like iomap_read_region() and
> +iomap_write_region() to know that we are righting metadata and no filesize or
> +any other data releated checks need to be done. This interface will take normal
> +IO arguments and an offset of the region allowing filesystem to read relative to
> +this offset.
> -- 
> 2.47.0
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] xfs: direct mapped xattrs design documentation
  2025-01-07  1:41   ` Darrick J. Wong
@ 2025-01-07 10:24     ` Andrey Albershteyn
  0 siblings, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2025-01-07 10:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, david, hch, Andrey Albershteyn, Dave Chinner,
	Eric Biggers

On 2025-01-06 17:41:41, Darrick J. Wong wrote:
> [add ebiggers to cc]
> 
> On Sun, Dec 29, 2024 at 02:35:01PM +0100, Andrey Albershteyn wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Direct mapped xattrs are a form of remote xattr that don't contain
> > internal self describing metadata. Hence the xattr data can be
> > directly mapped into page cache pages by iomap infrastructure
> > without needing to go through the XFS buffer cache.
> > 
> > This functionality allows XFS to implement fsverity data checksum
> > information externally to the file data, but interact with XFS data
> > checksum storage through the existing page cache interface.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > ---
> >  .../xfs/xfs-direct-mapped-xattr-design.rst    | 304 ++++++++++++++++++
> >  1 file changed, 304 insertions(+)
> >  create mode 100644 Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> > 
> > diff --git a/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> > new file mode 100644
> > index 000000000000..a0efa9546eca
> > --- /dev/null
> > +++ b/Documentation/filesystems/xfs/xfs-direct-mapped-xattr-design.rst
> > @@ -0,0 +1,304 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=====================================
> > +XFS Direct Mapped Extended Atrtibutes
> > +=====================================
> > +
> > +Background
> > +==========
> > +
> > +We have need to support fsverity in XFS. An attempt to support fsverity
> > +using named remote xattrs has already been made, but the complexity of the
> > +solution has made acceptance of that implementation .... challenging.
> > +
> > +The fundamental problem caused by using remote xattr blocks to store the
> > +fsverity checksum data is that the data is stored as opaque filesystem block
> > +sized chunks of data and accesses them directly through a page cache based
> > +interface interface.
> > +
> > +These filesystem block sized chunks do not fit neatly into remote xattr blocks
> > +because the remote xattr blocks have metadata headers in them containing self
> > +describing metadata and CRCs for the metadata. Hence filesystem block sized data
> > +now spans two filesystem blocks and is not contiguous on disk (it is split up
> > +by headers).
> > +
> > +The fsverity interfaces require page aligned data blocks, which then requires
> > +copying the fsverity data out of the xattr buffers into a separate bounce buffer
> > +which is then cached independently of the xattr buffer. IOWs, we have to double
> > +buffer the fsverity checksum data and that costs a lot in terms of complexity.
> > +
> > +Because the remote xattr data is also using the generic xattr read code, it
> > +requires use of xfs metadata buffers to read and write it to disk. Hence we have
> > +block based IO requirements for the fsverity data, not page based IO
> > +requirements. Hence there is an impedence mismatch between the fsverity IO model
> > +and the XFS xattr IO model as well.
> > +
> > +
> > +Directories, Xattrs and dabtrees
> > +================================
> > +
> > +Directories in XFS are complex - they maintain two separate lookup indexes to
> > +the dirent data because we have to optimise for two different types of access.
> > +We also have dirent stability requirements for seek operations.
> > +
> > +Hence directories have an offset indexed data segment that hold dirent data,
> > +with each individual dirent at an offset that is fixed for the life of the
> > +dirent. This provides dirent stability as the offset of the dirent can be used
> > +as a seek cookie. This offset indexed data segment is optimised for dirent
> > +reading by readdir/getdents which iterates sequentially through the dirent data
> > +and requires stable cookies for iteration continuation.
> > +
> > +Directories must also support lookup by name - path based directory lookups need
> > +to find the dirent by name rather than by offset. This is implemented by the
> > +dabtree in the directory. It stores name hashes and the leaf records for a
> > +specific name hash point to the offset of the dirent with that hash. Hence name
> > +based lookups are fast and efficient when directed through the dabtree.
> > +Importantly, the dabtree does not store dirent data, it simply provides a
> > +pointer to the external dirent: the stable offset of the dirent in the offset
> > +indexed data segment.
> > +
> > +In comparison, the attr fork dabtree only has one index - the name hash based
> > +dabtree. Everything stored in the xattr fork needs to be named and the record
> > +for the xattr data is indexed by the hash of that name. As there is no external
> > +stable offset based index data segment, data that does not fit inline in the
> > +xattr leaf record gets stored in a dynamically allocated remote xattr extent.
> > +The remote extent is created at the first largest hole in the xattr address space,
> > +so the remote xattr data does not get allocated sequentially on disk.
> > +
> > +Further, because everything is name hash indexed, sequential offset indexed data
> > +is not going to hash down to sequential record indexes in the dabtree. Hence
> > +access to offset index based xattr data is not going to be sequential in either
> > +record lookup patterns nor xattr data read patterns. This isn't a huge issue
> > +as the dabtree blocks rapidly get cached, but it does consume more CPU time
> > +that doing equivalent sequential offset based access in the directory structure.
> > +
> > +Darrick Wong pondered whether it would help to create a sequentially
> > +indexed segment in the xattr fork for the merkle tree blocks when discussing
> > +better ways to handle fsverity metadata. This document is intended to flesh out
> > +that concept into something that is usable by mutable data checksum
> > +functionality.
> > +
> > +
> > +fsverity is just data checksumming
> > +==================================
> > +
> > +I had a recent insight into fsverity when discussing what to do with the
> > +fsverity code with Andrey. That insight came from realising that all fsverity
> > +was doing is recording a per-filesystem block checksum and then validating
> > +it on read. While this might seem obvious now that I say it, the previous
> > +approach we took was all about fsverity needing to read and write opaque blocks
> > +of individually accessed metadata.
> > +
> > +Storing opaque, externally defined metadata is what xattrs are intended to be
> > +used for, and that drove the original design. i.e. a fsverity merkle tree block
> > +was just another named xattr object that encoded the tree index in the xattr
> > +name. Simple and straight forward from the xattr POV, but all the complexity
> > +arose in translating the xattr data into a form that fsverity could use.
> > +
> > +Fundamentally, however, the merkle tree blocks just contain data checksums.
> > +Yes, they are complex, cryptographically secure data checksums, but the
> > +fundamental observation is that there is a direct relationship between the file
> > +data at a given offset and the contents of merkle tree block at a given tree
> > +index.
> > +
> > +fsverity has a mechanism to calculate the checksums from the file data and store
> > +them in filesystem blocks, hence providing external checksum storage for the
> > +file data. It also has mechanism to read the external checksum storage and
> > +compare that against the calculated checksum of the data. Hence fsverity is just
> > +a fancy way of saying the filesystem provides "tamper proof read-only data
> > +checksum verification"
> 
> Ok, so fsverity's merkle tree is (more or less) block-indexable, so you
> want to use xattrs to map merkle block indexes to headerless remote
> xattr blocks which you can then read into the pagecache and give to
> fsverity.  Is that right?

Yes, the indexing could probably be a block aligned for blocks
smaller than fsb size, so we can pack more of them in one xattr, but
otherwise it the same.

> 
> I'm puzzled by all this, because the design here feels like it's much
> more complex than writing the merkle tree as post-eof data like ext4 and
> f2fs already do, and it prevents us from adding trivial fscrypt+fsverity
> support.  There must be something that makes the tradeoff worthwhile,
> but I'm not seeing it, unless...
> 
> > +But what if we want data checksums for normal read-write data files to be able
> > +to detect bit-rot in data at rest?
> 
> ...your eventual plan here is to support data block checksums as a
> totally separate feature from fsverity?  And you'll reuse the same
> "store two checksums with every xattr remote value" code to facilitate
> this...  somehow?  I'm not sure how, since I don't think I see any of
> code for that in the patches.

Yes, that's kinda an idea, but this patchset doesn't try to do
anything with it. The only goal here for now is get support for
fs-verity in form which could be used for that.

> 
> Or are you planning to enhance fsverity itself to support updates?

No

> 
> <confused>
> 
> > +
> > +
> > +Direct Mapped Xattr Data
> > +========================
> > +
> > +fsverity really wants to read and write it's checksum data through the page
> 
> "its checksum data", no apostrophe
> 
> > +cache. To do this efficiently, we need to store the fsverity metadata in block
> > +aligned data storage. We don't have that capability in XFS xattrs right now, and
> > +this is what we really want/need for data checksum storage. There are two ways
> > +of doing direct mapped xattr data.
> > +
> > +A New Remote Xattr Record Format
> > +--------------------------------
> > +
> > +The first way we can do direct mapped xattr data is to change the format of the
> > +remote xattr. The remote xattr header currently looks like this:
> > +
> > +.. code-block:: c
> > +
> > +	typedef struct xfs_attr_leaf_name_remote {
> > +		__be32  valueblk;               /* block number of value bytes */
> > +		__be32  valuelen;               /* number of bytes in value */
> > +		__u8    namelen;                /* length of name bytes */
> > +		/*
> > +		 * In Linux 6.5 this flex array was converted from name[1] to name[].
> > +		 * Be very careful here about extra padding at the end; see
> > +		 * xfs_attr_leaf_entsize_remote() for details.
> > +		 */
> > +		__u8    name[];                 /* name bytes */
> > +	} xfs_attr_leaf_name_remote_t;
> > +
> > +It stores the location and size of the remote xattr data as a filesystem block
> > +offset into the attr fork, along with the xattr name. The remote xattr block
> > +contains then this self describing header:
> > +
> > +.. code-block:: c
> > +
> > +	struct xfs_attr3_rmt_hdr {
> > +		__be32  rm_magic;
> > +		__be32  rm_offset;
> > +		__be32  rm_bytes;
> > +		__be32  rm_crc;
> > +		uuid_t  rm_uuid;
> > +		__be64  rm_owner;
> > +		__be64  rm_blkno;
> > +		__be64  rm_lsn;
> > +	};
> > +
> > +This is the self describing metadata that we use to validate the xattr data
> > +block is what it says it is, and this is the cause of the unaligned remote xattr
> > +data.
> > +
> > +The key field in the self describing metadata is ``*rm_crc``. This
> > +contains the CRC of the xattr data, and that tells us that the contents of the
> > +xattr data block are the same as what we wrote to disk. Everything else in
> > +this header is telling us who the block belongs to, it's expected size and
> > +when and where it was written to. This is far less critical to detecting storage
> > +errors than the CRC.
> > +
> > +Hence if we drop this header and move the ``rm_crc`` field to the ``struct
> > +xfs_attr_leaf_name_remote``, we can still check that the xattr data is has not
> > +been changed since we wrote the data to storage. If we have rmap enabled we
> > +have external tracking of the owner for the xattr data block, as well as the
> > +offset into the xattr data fork. ``rm_lsn`` is largely meaningless for remote
> > +xattrs because the data is written synchronously before the dabtree remote
> > +record is committed to the journal.
> > +
> > +Hence we can drop the headers from the remote xattr data blocks and not really
> > +lose much in way of robustness or recovery capability when rmap is enabled. This
> > +means all the xattr data is now filesystem block aligned, and this enables us to
> > +directly map the xattr data blocks directly for read IO.
> 
> So you're moving rm_crc to the remote name structure and eliminating the
> headers.  That weakens the self describing metadata, but chances are the
> crc isn't going to match if there's a torn write.  Probably not a big
> loss.
> 
> > +However, we can't easily directly map this xattr data for write operations. The
> > +xattr record header contains the CRC, and this needs to be updated
> > +transactionally. We can't update this before we do a direct mapped xattr write,
> > +because we have to do the write before we can recalculate the CRC. We can't do
> > +it after the write, because then we can have the transaction commit before the
> > +direct mapped data write IO is submitted and completed. This means recovery
> > +would result in a CRC mismatch for that data block. And we can't do it after the
> > +data write IO completes, because if we crash before the transaction is committed
> > +to the journal we again have a CRC mismatch.
> > +
> > +This is made more complex because we don't allow xattr headers to be
> > +re-written. Currently an update to an xattr requires a "remove and recreate"
> > +operation to ensure that the update is atomic w.r.t. remote xattr data changes.
> > +
> > +One approach which we can take is to use two CRCs - for old and new xattr data.
> > +``rm_crc`` becomes ``rm_crc[2]`` and xattr gains new bit flag
> > +``XFS_ATTR_RMCRC_SEL_BIT``. This bit defines which of the two fields contains
> > +the primary CRC. When we write a new CRC, we write it into the secondary
> > +``rm_crc[]`` field (i.e. the one the bit does not point to). When the data IO
> > +completes, we toggle the bit so that it points at the new primary value.
> > +
> > +If the primary does not match but the secondary does, we can flag an online
> > +repair operation to run at the completion of whatever operation read the xattr
> > +data to correct the state of the ``XFS_ATTR_RMCRC_SEL_BIT``.
> 
> I'm a little lost on what this rm_crc[] array covers -- it's intended to
> check the value of the remote xattr value, right?  And either of crc[0]
> or crc[1] can match?  So I guess the idea here is that you can overwrite
> remote xattr value blocks like this:
> 
> 1. update remote name with crc[0] == current crc and crc[1] == new crc
> 2. write directly to remote xattr value blocsk
> 
> and this is more efficient than running through the classic REPLACE
> machinery?
> 
> Since each merkle tree block is stored as a separate xattr, you can
> write to the merkle tree blocks in this fashion, which avoids
> inconsistencies in the ondisk xattr structure so long as the xattr value
> write itself doesn't tear.
> 
> But we only write the merkle tree once.  So why is the double crc
> necessary?  If the sealing process fails, we never set the ondisk iflag.

The reason for rm_crc[] is that by separating CRC from the data we
can't update CRC and data at once. If CRC is wrong we don't know if
update was consistent and what went wrong. The CRC has to tell us
that what we sent to disk got to disk.

In current implementation the data is updated together with CRC
transactionally. But with data going through iomap only CRC will be
updated in transaction.

So, by trading off a bit of space we have two CRCs - before and after
IO completion. This way we can detect what happened and take
appropriate action.

This is not an alternatiive to REPLACE operation but a more or less
general interface for writing leaf xattr data through iomap without
dependency on fs-verity.

> 
> > +If neither CRCs match, then we have an -EFSCORRUPTED situation, and that needs
> > +to mark the attr fork as sick (which brings it to the attention of scrub/repair)
> > +and return -EFSCORRUPTED to the reader.
> > +
> > +Offset-based Xattr Indexing Segments
> > +------------------------------------
> > +
> > +The other mechanism for direct mapping xattr data is to introduce an offset
> > +indexed segment similar to the directory data segment. The xattr data fork uses
> > +32 bit filesystem block addressing, so on a 4kB block size filesystem we can
> > +index 16TB of address space. A dabtree that indexes this amount of data would be
> > +massive and quite inefficient, and we'd likely be hitting against the maximum
> > +extent count limits for the attr fork at this point, anyway (32 bit extent
> > +count).
> > +
> > +Hence taking half the address space (the upper 8TB) isn't going to cause any
> > +significant limitations on what we can store in the existing attr fork dabtree.
> > +It does, however, provide us with a significant amount of data address space we
> > +can use for offset indexed based xattr data. We can even split this upper region
> > +into multiple segments so that it can have multiple separate sets of data and
> > +even dabtrees to index them.
> > +
> > +At this point in time, I do not see a need for these xattr segments to be
> > +directly accessible from userspace through existing xattr interfaces. If there
> > +is need for the data the kernel stores in an xattr data segment to be exposed to
> > +userspace, we can add the necessary interfaces when they are required.
> > +
> > +For the moment, let's first concentrate on what is needed in kernel space for
> > +the fsverity merkle tree blocks.
> > +
> > +
> > +Fsverity Data Segment
> > +`````````````````````
> > +
> > +Let's assume we have carved out a section of the inode address space for fsverity
> > +metadata. fsverity only supports file sizes up to 4TB (see `thread
> > +<https://lore.kernel.org/linux-xfs/Y5rDCcYGgH72Wn%2Fe@sol.localdomain/>`_::
> > +and so at a 4kB block size and 128 bytes per fsb the amount of addressing space
> > +needed for fsverity is a bit over 128GiB. Hence we could carve out a fixed
> > +256GiB address space segment just for fsverity data if we need to.
> > +
> > +When fsverity measures the file and creates the merkle tree block, it requires
> > +the filesystem to persistently record that inode is undergoing measurement. It
> > +also then tells the filesystem when measurement is complete so that the
> > +filesystem can remove the "under measurement" flag and seal the inode as
> > +fsverity protected.
> > +
> > +Hence with these persistent notifications, we don't have to care about
> > +persistent creation of the merkle tree data. As long as it has been written back
> > +before we seal the inode with a synchronous transaction, the merkle tree data
> > +will be stable on disk before the seal is written to the journal thanks to the
> > +cache flushes issued before the journal commit starts.
> > +
> > +This also means that we don't have to care about what is in the fsverity segment
> > +when measurement is started - we can just punch out what is already there (e.g.
> > +debris from a failed measurement) as the measurement process will rewrite
> > +the entire segment contents from scratch.
> > +
> > +Ext4 does this write process via the page cache into the inode's mapping. It
> > +operates at the aops level directly, but that won't work for XFS as we use iomap
> > +for buffered IO. Hence we need to call through iomap to map the disk space
> > +and allocate the page cache pages for the merkle tree data to be written.
> > +
> > +This will require us to provide an iomap_ops structure with a ->begin_iomap
> > +method that allocates and maps blocks from the attr fork fsverity data segment.
> > +We don't care what file offset the iomap code chooses to cache the folios that
> > +are inserted into the page cache, all we care about is that we are passed the
> > +merkle tree block position that it needs to be stored at.
> > +
> > +This will require iomap to be aware that it is mapping external metadata rather
> > +than normal file data so that it can offset the page cache index it uses for
> > +this data appropriately. The writeback also needs to know that it's working with
> > +fsverity folios past EOF. This requires changes to how those folios are mapped
> > +as they are indexed by xattr dabtree. The differentiation factor will be the
> > +fact that only merkle tree data can be written while inode is under fsverity
> > +initialization or filesystems also can check if these page is in "fsverity"
> > +region of the page cache.
> > +
> > +The writeback mapping of these specially marked merkle tree folios should be, at
> > +this point, relatively trivial. We will need to call fsverity ->map_blocks
> > +callback to map the fsverity address space rather than the file data address
> > +space, but other than that the process of allocating space and mapping it is
> > +largely identical to the existing data fork allocation code. We can even use
> > +delayed allocation to ensure the merkle tree data is as contiguous as possible.
> 
> Ok, so all this writeback stuff is to support the construction of the
> initial merkle tree at FS_IOC_ENABLE_VERITY time, /not/ to support
> read-write data integrity.

Yes, with fs-verity writeback will happen during tree construction.

> 
> > +The read side is less complex as all it needs to do is map blocks directly from
> > +the fsverity address space. We can read from the region intended for the
> > +fsverity metadata, then ->begin_iomap will map this request to xattr data blocks
> > +instead of file blocks.
> > +
> > +Therefore, we can have something like iomap_read_region() and
> > +iomap_write_region() to know that we are righting metadata and no filesize or
> > +any other data releated checks need to be done. This interface will take normal
> > +IO arguments and an offset of the region allowing filesystem to read relative to
> > +this offset.
> > -- 
> > 2.47.0
> > 
> > 
> 

-- 
- Andrey


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-06 20:56   ` Andrey Albershteyn
@ 2025-01-07 16:50     ` Christoph Hellwig
  2025-01-08  9:20       ` Andrey Albershteyn
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2025-01-07 16:50 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Christoph Hellwig, linux-xfs, djwong, david, Andrey Albershteyn

On Mon, Jan 06, 2025 at 09:56:51PM +0100, Andrey Albershteyn wrote:
> On 2025-01-06 16:42:12, Christoph Hellwig wrote:
> > I've not looked in details through the entire series, but I still find
> > all the churn for trying to force fsverity into xattrs very counter
> > productive, or in fact wrong.
> 
> Have you checked
> 	[PATCH] xfs: direct mapped xattrs design documentation [1]?
> It has more detailed argumentation of this approach.

It assumes verity must be stored in the attr fork and then justifies
complexity by that.

> > xattrs are for relatively small variable sized items where each item
> > has it's own name.
> 
> Probably, but now I'm not sure that this is what I see, xattrs have
> the whole dabtree to address all the attributes and there's
> infrastructure to have quite a lot of pretty huge attributes.

fsverity has a linear mapping.  The only thing you need to map it
is the bmap btree.  Using the dabtree helps nothing with the task
at hand, quite to the contrary it makes the task really complex.
As seen both by the design document and the code.

> Taking 1T file we will have about 1908 4k merkle tree blocks ~8Mb,
> in comparison to file size, I see it as a pretty small set of
> metadata.

And you could easily map them using a single extent in the bmap
btree with no overhead at all.  Or a few more if there isn't enough
contiguous freespace.

> 
> > fsverity has been designed to be stored beyond
> > i_size inside the file.
> 
> I think the only requirement coming from fs-verity in this regard is
> that Merkle blocks are stored in Pages. This allows for PG_Checked
> optimization. Otherwise, I think it doesn't really care where the
> data comes from or where it is.

I'm not say it's a requirement.  I'm saying it's been designed with
that in mind.  In other words it is a very natural fit.  Mapping it
to some kind of xattrs is not.

> Yes, that's one of the arguments in the design doc, we can possibly
> use it for mutable files in future. Not sure how feasible it is with
> post-EOF approach.

Maybe we can used it for $HANDWAVE is not a good idea.  Hash based
verification works poorly for mutable files, so we'd rather have
a really good argument for that.

> I don't really see the advantage or much difference of storing
> fs-verity post-i_size. Dedicating post-i_size space to fs-verity
> dosn't seem to be much different from changing xattr format to align
> with fs blocks, to me.

It is much simpler, and more storage efficient by doing away with the
need for the dabtree entries and your new remote-remote header.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-07 16:50     ` Christoph Hellwig
@ 2025-01-08  9:20       ` Andrey Albershteyn
  2025-01-09  6:12         ` Christoph Hellwig
  2025-01-09  7:39         ` Darrick J. Wong
  0 siblings, 2 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2025-01-08  9:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, djwong, david, Andrey Albershteyn

On 2025-01-07 17:50:57, Christoph Hellwig wrote:
> On Mon, Jan 06, 2025 at 09:56:51PM +0100, Andrey Albershteyn wrote:
> > On 2025-01-06 16:42:12, Christoph Hellwig wrote:
> > > I've not looked in details through the entire series, but I still find
> > > all the churn for trying to force fsverity into xattrs very counter
> > > productive, or in fact wrong.
> > 
> > Have you checked
> > 	[PATCH] xfs: direct mapped xattrs design documentation [1]?
> > It has more detailed argumentation of this approach.
> 
> It assumes verity must be stored in the attr fork and then justifies
> complexity by that.
> 
> > > xattrs are for relatively small variable sized items where each item
> > > has it's own name.
> > 
> > Probably, but now I'm not sure that this is what I see, xattrs have
> > the whole dabtree to address all the attributes and there's
> > infrastructure to have quite a lot of pretty huge attributes.
> 
> fsverity has a linear mapping.  The only thing you need to map it
> is the bmap btree.  Using the dabtree helps nothing with the task
> at hand, quite to the contrary it makes the task really complex.
> As seen both by the design document and the code.
> 
> > Taking 1T file we will have about 1908 4k merkle tree blocks ~8Mb,
> > in comparison to file size, I see it as a pretty small set of
> > metadata.
> 
> And you could easily map them using a single extent in the bmap
> btree with no overhead at all.  Or a few more if there isn't enough
> contiguous freespace.
> 
> > 
> > > fsverity has been designed to be stored beyond
> > > i_size inside the file.
> > 
> > I think the only requirement coming from fs-verity in this regard is
> > that Merkle blocks are stored in Pages. This allows for PG_Checked
> > optimization. Otherwise, I think it doesn't really care where the
> > data comes from or where it is.
> 
> I'm not say it's a requirement.  I'm saying it's been designed with
> that in mind.  In other words it is a very natural fit.  Mapping it
> to some kind of xattrs is not.
> 
> > Yes, that's one of the arguments in the design doc, we can possibly
> > use it for mutable files in future. Not sure how feasible it is with
> > post-EOF approach.
> 
> Maybe we can used it for $HANDWAVE is not a good idea. 

> Hash based verification works poorly for mutable files, so we'd
> rather have a really good argument for that.

hmm, why? Not sure I have an understanding of this

> 
> > I don't really see the advantage or much difference of storing
> > fs-verity post-i_size. Dedicating post-i_size space to fs-verity
> > dosn't seem to be much different from changing xattr format to align
> > with fs blocks, to me.
> 
> It is much simpler, and more storage efficient by doing away with the
> need for the dabtree entries and your new remote-remote header.
> 

I see.

-- 
- Andrey


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-08  9:20       ` Andrey Albershteyn
@ 2025-01-09  6:12         ` Christoph Hellwig
  2025-01-09  7:39         ` Darrick J. Wong
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2025-01-09  6:12 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Christoph Hellwig, linux-xfs, djwong, david, Andrey Albershteyn

On Wed, Jan 08, 2025 at 10:20:59AM +0100, Andrey Albershteyn wrote:
> > Maybe we can used it for $HANDWAVE is not a good idea. 
> 
> > Hash based verification works poorly for mutable files, so we'd
> > rather have a really good argument for that.
> 
> hmm, why? Not sure I have an understanding of this

You need a consistent point in time to verify with your hash to have
a meaning.  How do you define that point for a mutable file?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-08  9:20       ` Andrey Albershteyn
  2025-01-09  6:12         ` Christoph Hellwig
@ 2025-01-09  7:39         ` Darrick J. Wong
  2025-01-09  7:44           ` Christoph Hellwig
  2025-01-13  9:16           ` Andrey Albershteyn
  1 sibling, 2 replies; 59+ messages in thread
From: Darrick J. Wong @ 2025-01-09  7:39 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Christoph Hellwig, linux-xfs, david, Andrey Albershteyn

On Wed, Jan 08, 2025 at 10:20:59AM +0100, Andrey Albershteyn wrote:
> On 2025-01-07 17:50:57, Christoph Hellwig wrote:
> > On Mon, Jan 06, 2025 at 09:56:51PM +0100, Andrey Albershteyn wrote:
> > > On 2025-01-06 16:42:12, Christoph Hellwig wrote:
> > > > I've not looked in details through the entire series, but I still find
> > > > all the churn for trying to force fsverity into xattrs very counter
> > > > productive, or in fact wrong.
> > > 
> > > Have you checked
> > > 	[PATCH] xfs: direct mapped xattrs design documentation [1]?
> > > It has more detailed argumentation of this approach.
> > 
> > It assumes verity must be stored in the attr fork and then justifies
> > complexity by that.
> > 
> > > > xattrs are for relatively small variable sized items where each item
> > > > has it's own name.
> > > 
> > > Probably, but now I'm not sure that this is what I see, xattrs have
> > > the whole dabtree to address all the attributes and there's
> > > infrastructure to have quite a lot of pretty huge attributes.
> > 
> > fsverity has a linear mapping.  The only thing you need to map it
> > is the bmap btree.  Using the dabtree helps nothing with the task
> > at hand, quite to the contrary it makes the task really complex.
> > As seen both by the design document and the code.
> > 
> > > Taking 1T file we will have about 1908 4k merkle tree blocks ~8Mb,
> > > in comparison to file size, I see it as a pretty small set of
> > > metadata.
> > 
> > And you could easily map them using a single extent in the bmap
> > btree with no overhead at all.  Or a few more if there isn't enough
> > contiguous freespace.
> > 
> > > 
> > > > fsverity has been designed to be stored beyond
> > > > i_size inside the file.
> > > 
> > > I think the only requirement coming from fs-verity in this regard is
> > > that Merkle blocks are stored in Pages. This allows for PG_Checked
> > > optimization. Otherwise, I think it doesn't really care where the
> > > data comes from or where it is.
> > 
> > I'm not say it's a requirement.  I'm saying it's been designed with
> > that in mind.  In other words it is a very natural fit.  Mapping it
> > to some kind of xattrs is not.
> > 
> > > Yes, that's one of the arguments in the design doc, we can possibly
> > > use it for mutable files in future. Not sure how feasible it is with
> > > post-EOF approach.
> > 
> > Maybe we can used it for $HANDWAVE is not a good idea. 
> 
> > Hash based verification works poorly for mutable files, so we'd
> > rather have a really good argument for that.
> 
> hmm, why? Not sure I have an understanding of this

Me neither.  I can see how you might design file data block checksumming
to be basically an array of u32 crc[nblocks][2].  Then if you turned on
stable folios for writeback, the folio contents can't change so you can
compute the checksum of the new data, run a transaction to set
crc[nblock][0] to the old checksum; crc[nblock][1] to the new checksum;
and only then issue the writeback bio.

But I don't think that works if you crash.  At least one of the
checksums might be right if the device doesn't tear the write, but that
gets us tangled up in the untorn block writes patches.  If the device
does not guarantee untorn writes, then you probably have to do it the
way the other checksumming fses do it -- write to a new location, then
run a transaction to store the checksum and update the file mapping.

In any case, that's still just a linear array stored in some blocks
beyond EOF, and (presumably) growing in the top of the file.  Maybe you
can even have a merkle(ish) tree to checksum the checksum leaves.  But I
don't see why the xattr stuff is needed at all in that case, but what
I'm really looking for here is this -- do you folks have some future
design involving these double-checksummed headerless remote xattr
blocks?  Or a more clever data block checksumming design than the stupid
one I just came with?

<shrug>

> > > I don't really see the advantage or much difference of storing
> > > fs-verity post-i_size. Dedicating post-i_size space to fs-verity
> > > dosn't seem to be much different from changing xattr format to align
> > > with fs blocks, to me.
> > 
> > It is much simpler, and more storage efficient by doing away with the
> > need for the dabtree entries and your new remote-remote header.

I agree... at least in the absence of any other knowledge.

--D

> 
> I see.
> 
> -- 
> - Andrey
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-09  7:39         ` Darrick J. Wong
@ 2025-01-09  7:44           ` Christoph Hellwig
  2025-01-09 17:03             ` Darrick J. Wong
  2025-01-13  9:16           ` Andrey Albershteyn
  1 sibling, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2025-01-09  7:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, Christoph Hellwig, linux-xfs, david,
	Andrey Albershteyn

On Wed, Jan 08, 2025 at 11:39:08PM -0800, Darrick J. Wong wrote:
> > > 
> > > Maybe we can used it for $HANDWAVE is not a good idea. 
> > 
> > > Hash based verification works poorly for mutable files, so we'd
> > > rather have a really good argument for that.
> > 
> > hmm, why? Not sure I have an understanding of this
> 
> Me neither.  I can see how you might design file data block checksumming
> to be basically an array of u32 crc[nblocks][2].  Then if you turned on
> stable folios for writeback, the folio contents can't change so you can
> compute the checksum of the new data, run a transaction to set
> crc[nblock][0] to the old checksum; crc[nblock][1] to the new checksum;
> and only then issue the writeback bio.

Are you (plural) talking about hash based integrity protection ala
fsverity or checksums.  While they look similar in some way those are
totally different things!  If we're talking about "simple" data
checksums both post-EOF data blocks and xattrs are really badly wrong,
as the checksum need to be assigned with the physical block due to
reflinks, not the file.  The natural way to implement them for XFS
if we really wanted them would be a new per-AG/RTG metabtree that
is indexed by the agblock/rgblock.

> But I don't think that works if you crash.  At least one of the
> checksums might be right if the device doesn't tear the write, but that
> gets us tangled up in the untorn block writes patches.  If the device
> does not guarantee untorn writes, then you probably have to do it the
> way the other checksumming fses do it -- write to a new location, then
> run a transaction to store the checksum and update the file mapping.

Yes.  That's why for data checksums you'd always need to either write
out of place (as with the pending zoned allocator) or work with intent /
intent done items.  That's assuming you can't offload the atomicy to the
device by uisng T10 PI or at least per-block metadata that stores the
checksum.  Which would also remove the need for any new file system
data struture, but require enterprise hardware that supports PI or
metadata.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-09  7:44           ` Christoph Hellwig
@ 2025-01-09 17:03             ` Darrick J. Wong
  0 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2025-01-09 17:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrey Albershteyn, linux-xfs, david, Andrey Albershteyn

On Thu, Jan 09, 2025 at 08:44:03AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 08, 2025 at 11:39:08PM -0800, Darrick J. Wong wrote:
> > > > 
> > > > Maybe we can used it for $HANDWAVE is not a good idea. 
> > > 
> > > > Hash based verification works poorly for mutable files, so we'd
> > > > rather have a really good argument for that.
> > > 
> > > hmm, why? Not sure I have an understanding of this
> > 
> > Me neither.  I can see how you might design file data block checksumming
> > to be basically an array of u32 crc[nblocks][2].  Then if you turned on
> > stable folios for writeback, the folio contents can't change so you can
> > compute the checksum of the new data, run a transaction to set
> > crc[nblock][0] to the old checksum; crc[nblock][1] to the new checksum;
> > and only then issue the writeback bio.
> 
> Are you (plural) talking about hash based integrity protection ala
> fsverity or checksums.  While they look similar in some way those are
> totally different things!  If we're talking about "simple" data
> checksums both post-EOF data blocks and xattrs are really badly wrong,
> as the checksum need to be assigned with the physical block due to
> reflinks, not the file.  The natural way to implement them for XFS
> if we really wanted them would be a new per-AG/RTG metabtree that
> is indexed by the agblock/rgblock.

Agreed.  For simple things like crc32 I would very much rather we stuff
them in a per-group btree because we only have to store the crc once in
the filesystem and now it protects all owners of that block.  In theory
the double-crc scheme would work fine for untorn data block writes, I
think.

I only see a reason for per-file hash structures in the dabtree if the
hashes themselves have some sort of per-file configuration (like
distributor-signed merkle trees or whatever).  I asked Eric Biggers if
he had any plans for mutable fsverity files and he said no.

> > But I don't think that works if you crash.  At least one of the
> > checksums might be right if the device doesn't tear the write, but that
> > gets us tangled up in the untorn block writes patches.  If the device
> > does not guarantee untorn writes, then you probably have to do it the
> > way the other checksumming fses do it -- write to a new location, then
> > run a transaction to store the checksum and update the file mapping.
> 
> Yes.  That's why for data checksums you'd always need to either write
> out of place (as with the pending zoned allocator) or work with intent /
> intent done items.  That's assuming you can't offload the atomicy to the
> device by uisng T10 PI or at least per-block metadata that stores the
> checksum.  Which would also remove the need for any new file system
> data struture, but require enterprise hardware that supports PI or
> metadata.

<nod>

--D

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] Directly mapped xattr data & fs-verity
  2025-01-09  7:39         ` Darrick J. Wong
  2025-01-09  7:44           ` Christoph Hellwig
@ 2025-01-13  9:16           ` Andrey Albershteyn
  1 sibling, 0 replies; 59+ messages in thread
From: Andrey Albershteyn @ 2025-01-13  9:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, david, Andrey Albershteyn

On 2025-01-08 23:39:08, Darrick J. Wong wrote:
> On Wed, Jan 08, 2025 at 10:20:59AM +0100, Andrey Albershteyn wrote:
> > On 2025-01-07 17:50:57, Christoph Hellwig wrote:
> > > On Mon, Jan 06, 2025 at 09:56:51PM +0100, Andrey Albershteyn wrote:
> > > > On 2025-01-06 16:42:12, Christoph Hellwig wrote:
> > > > > I've not looked in details through the entire series, but I still find
> > > > > all the churn for trying to force fsverity into xattrs very counter
> > > > > productive, or in fact wrong.
> > > > 
> > > > Have you checked
> > > > 	[PATCH] xfs: direct mapped xattrs design documentation [1]?
> > > > It has more detailed argumentation of this approach.
> > > 
> > > It assumes verity must be stored in the attr fork and then justifies
> > > complexity by that.
> > > 
> > > > > xattrs are for relatively small variable sized items where each item
> > > > > has it's own name.
> > > > 
> > > > Probably, but now I'm not sure that this is what I see, xattrs have
> > > > the whole dabtree to address all the attributes and there's
> > > > infrastructure to have quite a lot of pretty huge attributes.
> > > 
> > > fsverity has a linear mapping.  The only thing you need to map it
> > > is the bmap btree.  Using the dabtree helps nothing with the task
> > > at hand, quite to the contrary it makes the task really complex.
> > > As seen both by the design document and the code.
> > > 
> > > > Taking 1T file we will have about 1908 4k merkle tree blocks ~8Mb,
> > > > in comparison to file size, I see it as a pretty small set of
> > > > metadata.
> > > 
> > > And you could easily map them using a single extent in the bmap
> > > btree with no overhead at all.  Or a few more if there isn't enough
> > > contiguous freespace.
> > > 
> > > > 
> > > > > fsverity has been designed to be stored beyond
> > > > > i_size inside the file.
> > > > 
> > > > I think the only requirement coming from fs-verity in this regard is
> > > > that Merkle blocks are stored in Pages. This allows for PG_Checked
> > > > optimization. Otherwise, I think it doesn't really care where the
> > > > data comes from or where it is.
> > > 
> > > I'm not say it's a requirement.  I'm saying it's been designed with
> > > that in mind.  In other words it is a very natural fit.  Mapping it
> > > to some kind of xattrs is not.
> > > 
> > > > Yes, that's one of the arguments in the design doc, we can possibly
> > > > use it for mutable files in future. Not sure how feasible it is with
> > > > post-EOF approach.
> > > 
> > > Maybe we can used it for $HANDWAVE is not a good idea. 
> > 
> > > Hash based verification works poorly for mutable files, so we'd
> > > rather have a really good argument for that.
> > 
> > hmm, why? Not sure I have an understanding of this
> 
> Me neither.  I can see how you might design file data block checksumming
> to be basically an array of u32 crc[nblocks][2].  Then if you turned on
> stable folios for writeback, the folio contents can't change so you can
> compute the checksum of the new data, run a transaction to set
> crc[nblock][0] to the old checksum; crc[nblock][1] to the new checksum;
> and only then issue the writeback bio.
> 
> But I don't think that works if you crash.  At least one of the
> checksums might be right if the device doesn't tear the write, but that
> gets us tangled up in the untorn block writes patches.  If the device
> does not guarantee untorn writes, then you probably have to do it the
> way the other checksumming fses do it -- write to a new location, then
> run a transaction to store the checksum and update the file mapping.
> 
> In any case, that's still just a linear array stored in some blocks
> beyond EOF, and (presumably) growing in the top of the file.  Maybe you
> can even have a merkle(ish) tree to checksum the checksum leaves.  But I
> don't see why the xattr stuff is needed at all in that case, but what
> I'm really looking for here is this -- do you folks have some future
> design involving these double-checksummed headerless remote xattr
> blocks?  Or a more clever data block checksumming design than the stupid
> one I just came with?
> 
> <shrug>
> 
> > > > I don't really see the advantage or much difference of storing
> > > > fs-verity post-i_size. Dedicating post-i_size space to fs-verity
> > > > dosn't seem to be much different from changing xattr format to align
> > > > with fs blocks, to me.
> > > 
> > > It is much simpler, and more storage efficient by doing away with the
> > > need for the dabtree entries and your new remote-remote header.
> 
> I agree... at least in the absence of any other knowledge.

I will look into post-i_size approach, then.

-- 
- Andrey


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-01-13  9:16 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-29 13:33 [RFC] Directly mapped xattr data & fs-verity Andrey Albershteyn
2024-12-29 13:35 ` [PATCH] xfs: direct mapped xattrs design documentation Andrey Albershteyn
2025-01-07  1:41   ` Darrick J. Wong
2025-01-07 10:24     ` Andrey Albershteyn
2024-12-29 13:36 ` [PATCH 0/2] Introduce iomap interface to work with regions beyond EOF Andrey Albershteyn
2024-12-29 13:36   ` [PATCH 1/2] iomap: add iomap_writepages_unbound() to write " Andrey Albershteyn
2024-12-29 17:54     ` kernel test robot
2024-12-29 21:36     ` kernel test robot
2024-12-29 13:36   ` [PATCH 2/2] iomap: introduce iomap_read/write_region interface Andrey Albershteyn
2024-12-29 13:38 ` [PATCH 00/14] Direct mapped extended attribute data Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 01/14] iomap: add wrapper to pass readpage_ctx to read path Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 02/14] iomap: add read path ioends for filesystem read verification Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 03/14] iomap: introduce IOMAP_F_NO_MERGE for non-mergable ioends Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 04/14] xfs: add incompat directly mapped xattr flag Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 05/14] libxfs: add xfs_calc_chsum() Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 06/14] libxfs: pass xfs_sb to xfs_attr3_leaf_name_remote() Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 07/14] xfs: introduce XFS_DA_OP_EMPTY Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 08/14] xfs: introduce workqueue for post read processing Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 09/14] xfs: add interface to set CRC on leaf attributes Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 10/14] xfs: introduce XFS_ATTRUPDATE_FLAGS operation Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 11/14] xfs: add interface for page cache mapped remote xattrs Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 12/14] xfs: parse both remote attr name on-disk formats Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 13/14] xfs: do not use xfs_attr3_rmt_hdr for remote value blocks for dxattr Andrey Albershteyn
2024-12-29 13:38   ` [PATCH 14/14] xfs: enalbe XFS_SB_FEAT_INCOMPAT_DXATTR Andrey Albershteyn
2024-12-29 13:39 ` [PATCH 00/24] fsverity integration for XFS based on direct mapped xattrs Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 01/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 02/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 03/24] fsverity: add tracepoints Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 04/24] fsverity: pass the new tree size and block size to ->begin_enable_verity Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 05/24] fsverity: expose merkle tree geometry to callers Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 06/24] fsverity: report validation errors back to the filesystem Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 07/24] fsverity: flush pagecache before enabling verity Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 08/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 09/24] xfs: use an empty transaction to protect xfs_attr_get from deadlocks Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 10/24] xfs: don't let xfs_bmap_first_unused overflow a xfs_dablk_t Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 11/24] xfs: add attribute type for fs-verity Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 12/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 13/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 14/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 15/24] xfs: don't allow to enable DAX on fs-verity sealed inode Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 16/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 17/24] xfs: add fs-verity support Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 18/24] xfs: add writeback page mapping for fs-verity Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 19/24] xfs: use merkle tree offset as attr hash Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 20/24] xfs: add fs-verity ioctls Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 21/24] xfs: advertise fs-verity being available on filesystem Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 22/24] xfs: check and repair the verity inode flag state Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 23/24] xfs: report verity failures through the health system Andrey Albershteyn
2024-12-29 13:39   ` [PATCH 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
2025-01-06 15:42 ` [RFC] Directly mapped xattr data & fs-verity Christoph Hellwig
2025-01-06 19:50   ` Darrick J. Wong
2025-01-06 20:56   ` Andrey Albershteyn
2025-01-07 16:50     ` Christoph Hellwig
2025-01-08  9:20       ` Andrey Albershteyn
2025-01-09  6:12         ` Christoph Hellwig
2025-01-09  7:39         ` Darrick J. Wong
2025-01-09  7:44           ` Christoph Hellwig
2025-01-09 17:03             ` Darrick J. Wong
2025-01-13  9:16           ` Andrey Albershteyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox