From: David Hildenbrand <david@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>,
Matthew Wilcox <willy@infradead.org>
Cc: Andrey Albershteyn <aalbersh@redhat.com>,
fsverity@lists.linux.dev, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, chandan.babu@oracle.com,
akpm@linux-foundation.org, linux-mm@kvack.org,
Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
Date: Tue, 12 Mar 2024 16:33:14 +0100 [thread overview]
Message-ID: <08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com> (raw)
In-Reply-To: <9927568e-9f36-4417-9d26-c8a05c220399@redhat.com>
On 12.03.24 16:13, David Hildenbrand wrote:
> On 11.03.24 23:38, Darrick J. Wong wrote:
>> [add willy and linux-mm]
>>
>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>
>>>> Sorry to answer your question with a question, but how much checking is
>>>> $filesystem expected to do for merkle trees?
>>>>
>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>> intactness. If the descriptor specifies the highest block address then
>>>> we could certainly trim off excess blocks. But I don't know how much of
>>>> libfsverity actually lets you do that; I haven't looked into that
>>>> deeply. :/
>>>>
>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>> need to stream reads of the verity files through the page cache and let
>>>> verity tell us if the file data are consistent.
>>>>
>>>> For both tools, if something finds errors in the merkle tree structure
>>>> itself, do we turn off verity? Or do we do something nasty like
>>>> truncate the file?
>>>
>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>> validating the data of verity inodes against their Merkle trees.
>>>
>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>> enabled. That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>
>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>> should delete that inode's verity metadata and remove the verity flag from the
>>> inode. Checking for a missing or obviously corrupt fsverity_descriptor would be
>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>> validating the data against the Merkle tree. And actually validating the data
>>> against the Merkle tree would be complex and expensive. Note, none of this
>>> would work on files that are encrypted.
>>>
>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>> libfsverity_compute_digest() and the callbacks that it supports. But that's not
>>> quite what it was designed for.
>>>
>>>> Is there an ioctl or something that allows userspace to validate an
>>>> entire file's contents? Sort of like what BLKVERIFY would have done for
>>>> block devices, except that we might believe its answers?
>>>
>>> Just reading the whole file and seeing whether you get an error would do it.
>>>
>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>> necessary to drop the file's pagecache first.
>>
>> I tried a straight pagecache read and it worked like a charm!
>>
>> But then I thought to myself, do I really want to waste memory bandwidth
>> copying a bunch of data? No. I don't even want to incur system call
>> overhead from reading a single byte every $pagesize bytes.
>>
>> So I created 2M mmap areas and read a byte every $pagesize bytes. That
>> worked too, insofar as SIGBUSes are annoying to handle. But it's
>> annoying to take signals like that.
>>
>> Then I started looking at madvise. MADV_POPULATE_READ looked exactly
>> like what I wanted -- it prefaults in the pages, and "If populating
>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>
>
> Yes, these were the expected semantics :)
>
>> But then I tried rigging up a test to see if I could catch an EIO, and
>> instead I had to SIGKILL the process! It looks filemap_fault returns
>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>> handle_mm_fault -> faultin_page -> __get_user_pages. At faultin_pages,
>> the VM_FAULT_RETRY is translated to -EBUSY.
>>
>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>> that to madvise_populate. Unfortunately, madvise_populate increments
>> its loop counter by the return value (still 0) so it runs in an
>> infinite loop. The only way out is SIGKILL.
>
> That's certainly unexpected. One user I know is QEMU, which primarily
> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> primarily used with shmem/hugetlb, where I haven't heard of any such
> endless loops.
>
>>
>> So I don't know what the correct behavior is here, other than the
>> infinite loop seems pretty suspect. Is it the correct behavior that
>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>> That doesn't quite sound right if it's the case that a zero return could
>> also happen if memory is tight.
>
> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> That one calls __get_user_pages().
>
> __get_user_pages() documents: "0 return value is possible when the fault
> would need to be retried."
>
> So that's what the caller does. IIRC, there are cases where we really
> have to retry (at least once) and will make progress, so treating "0" as
> an error would be wrong.
>
> Staring at other __get_user_pages() users, __get_user_pages_locked()
> documents: "Please note that this function, unlike __get_user_pages(),
> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>
> But there is some elaborate retry logic in there, whereby the retry will
> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> retry attempt (there are cases where we retry more often, but that's
> related to something else I believe).
>
> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> make it use __get_user_pages_locked(), but I recall when I introduced
> MADV_POPULATE_READ, there was a catch to it.
I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
mmap()+access case you describe above.
Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately
see how we would transition from a VM_FAULT_RETRY loop to
VM_FAULT_SIGBUS. Because VM_FAULT_SIGBUS would be required for that
function to call do_sigbus().
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-03-12 15:33 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
2024-03-04 22:35 ` Eric Biggers
2024-03-07 21:39 ` Darrick J. Wong
2024-03-07 22:06 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
2024-03-05 0:52 ` Eric Biggers
2024-03-06 16:30 ` Darrick J. Wong
2024-03-07 22:02 ` Eric Biggers
2024-03-08 3:46 ` Darrick J. Wong
2024-03-08 4:40 ` Eric Biggers
2024-03-11 22:38 ` Darrick J. Wong
2024-03-12 15:13 ` David Hildenbrand
2024-03-12 15:33 ` David Hildenbrand [this message]
2024-03-12 16:44 ` Darrick J. Wong
2024-03-13 12:29 ` David Hildenbrand
2024-03-13 17:19 ` Darrick J. Wong
2024-03-13 19:10 ` David Hildenbrand
2024-03-13 21:03 ` David Hildenbrand
2024-03-08 21:34 ` Dave Chinner
2024-03-09 16:19 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
2024-03-06 3:56 ` Eric Biggers
2024-03-07 21:54 ` Darrick J. Wong
2024-03-07 22:49 ` Eric Biggers
2024-03-08 3:50 ` Darrick J. Wong
2024-03-09 16:24 ` Darrick J. Wong
2024-03-11 23:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
2024-03-05 1:08 ` Eric Biggers
2024-03-07 21:58 ` Darrick J. Wong
2024-03-07 22:26 ` Eric Biggers
2024-03-08 3:53 ` Darrick J. Wong
2024-03-07 22:55 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
2024-03-05 0:33 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
2024-03-04 23:39 ` Eric Biggers
2024-03-07 22:06 ` Darrick J. Wong
2024-03-07 22:19 ` Eric Biggers
2024-03-07 23:38 ` Dave Chinner
2024-03-07 23:45 ` Darrick J. Wong
2024-03-08 0:47 ` Dave Chinner
2024-03-07 23:59 ` Eric Biggers
2024-03-08 1:20 ` Dave Chinner
2024-03-08 3:16 ` Eric Biggers
2024-03-08 3:57 ` Darrick J. Wong
2024-03-08 3:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
2024-03-07 22:46 ` Darrick J. Wong
2024-03-08 1:59 ` Dave Chinner
2024-03-08 3:31 ` Darrick J. Wong
2024-03-09 16:28 ` Darrick J. Wong
2024-03-11 0:26 ` Dave Chinner
2024-03-11 15:25 ` Darrick J. Wong
2024-03-12 2:43 ` Eric Biggers
2024-03-12 4:15 ` Darrick J. Wong
2024-03-12 2:45 ` Darrick J. Wong
2024-03-12 7:01 ` Dave Chinner
2024-03-12 20:04 ` Darrick J. Wong
2024-03-12 21:45 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
2024-03-07 22:06 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
2024-03-07 22:11 ` Darrick J. Wong
2024-03-12 12:02 ` Andrey Albershteyn
2024-03-12 16:36 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
2024-03-06 4:55 ` Eric Biggers
2024-03-06 5:01 ` Eric Biggers
2024-03-07 23:10 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
2024-03-07 22:18 ` Darrick J. Wong
2024-03-12 12:10 ` Andrey Albershteyn
2024-03-12 16:38 ` Darrick J. Wong
2024-03-13 1:35 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
2024-03-07 22:14 ` Darrick J. Wong
2024-03-12 12:42 ` Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
2024-03-07 22:16 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com \
--to=david@redhat.com \
--cc=aalbersh@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chandan.babu@oracle.com \
--cc=djwong@kernel.org \
--cc=ebiggers@kernel.org \
--cc=fsverity@lists.linux.dev \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).