From: David Hildenbrand <david@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>,
Andrey Albershteyn <aalbersh@redhat.com>,
fsverity@lists.linux.dev, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, chandan.babu@oracle.com,
akpm@linux-foundation.org, linux-mm@kvack.org,
Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
Date: Wed, 13 Mar 2024 13:29:12 +0100 [thread overview]
Message-ID: <420b6d5f-adef-4415-b8cb-16c234dcec38@redhat.com> (raw)
In-Reply-To: <20240312164444.GG1927156@frogsfrogsfrogs>
On 12.03.24 17:44, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>> On 12.03.24 16:13, David Hildenbrand wrote:
>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>> [add willy and linux-mm]
>>>>
>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>
>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>> $filesystem expected to do for merkle trees?
>>>>>>
>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>> intactness. If the descriptor specifies the highest block address then
>>>>>> we could certainly trim off excess blocks. But I don't know how much of
>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>> deeply. :/
>>>>>>
>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>> verity tell us if the file data are consistent.
>>>>>>
>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>> itself, do we turn off verity? Or do we do something nasty like
>>>>>> truncate the file?
>>>>>
>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>
>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>> enabled. That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>
>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>> inode. Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>> validating the data against the Merkle tree. And actually validating the data
>>>>> against the Merkle tree would be complex and expensive. Note, none of this
>>>>> would work on files that are encrypted.
>>>>>
>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>> libfsverity_compute_digest() and the callbacks that it supports. But that's not
>>>>> quite what it was designed for.
>>>>>
>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>> entire file's contents? Sort of like what BLKVERIFY would have done for
>>>>>> block devices, except that we might believe its answers?
>>>>>
>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>
>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>> necessary to drop the file's pagecache first.
>>>>
>>>> I tried a straight pagecache read and it worked like a charm!
>>>>
>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>> copying a bunch of data? No. I don't even want to incur system call
>>>> overhead from reading a single byte every $pagesize bytes.
>>>>
>>>> So I created 2M mmap areas and read a byte every $pagesize bytes. That
>>>> worked too, insofar as SIGBUSes are annoying to handle. But it's
>>>> annoying to take signals like that.
>>>>
>>>> Then I started looking at madvise. MADV_POPULATE_READ looked exactly
>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>
>>>
>>> Yes, these were the expected semantics :)
>>>
>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>> instead I had to SIGKILL the process! It looks filemap_fault returns
>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>> handle_mm_fault -> faultin_page -> __get_user_pages. At faultin_pages,
>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>
>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>> that to madvise_populate. Unfortunately, madvise_populate increments
>>>> its loop counter by the return value (still 0) so it runs in an
>>>> infinite loop. The only way out is SIGKILL.
>>>
>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>> endless loops.
>>>
>>>>
>>>> So I don't know what the correct behavior is here, other than the
>>>> infinite loop seems pretty suspect. Is it the correct behavior that
>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>> That doesn't quite sound right if it's the case that a zero return could
>>>> also happen if memory is tight.
>>>
>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>> That one calls __get_user_pages().
>>>
>>> __get_user_pages() documents: "0 return value is possible when the fault
>>> would need to be retried."
>>>
>>> So that's what the caller does. IIRC, there are cases where we really
>>> have to retry (at least once) and will make progress, so treating "0" as
>>> an error would be wrong.
>>>
>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>> documents: "Please note that this function, unlike __get_user_pages(),
>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>
>>> But there is some elaborate retry logic in there, whereby the retry will
>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>> retry attempt (there are cases where we retry more often, but that's
>>> related to something else I believe).
>>>
>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>> MADV_POPULATE_READ, there was a catch to it.
>>
>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>> mmap()+access case you describe above.
>>
>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>> Because VM_FAULT_SIGBUS would be required for that function to call
>> do_sigbus().
>
> The code I was looking at yesterday in filemap_fault was:
>
> page_not_uptodate:
> /*
> * Umm, take care of errors if the page isn't up-to-date.
> * Try to re-read it _once_. We do this synchronously,
> * because there really aren't any performance issues here
> * and we need to check for errors.
> */
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> if (fpin)
> goto out_retry;
> folio_put(folio);
>
> if (!error || error == AOP_TRUNCATED_PAGE)
> goto retry_find;
> filemap_invalidate_unlock_shared(mapping);
>
> return VM_FAULT_SIGBUS;
>
> Wherein I /think/ fpin is non-null in this case, so if
> filemap_read_folio returns an error, we'll do this instead:
>
> out_retry:
> /*
> * We dropped the mmap_lock, we need to return to the fault handler to
> * re-find the vma and come back and find our hopefully still populated
> * page.
> */
> if (!IS_ERR(folio))
> folio_put(folio);
> if (mapping_locked)
> filemap_invalidate_unlock_shared(mapping);
> if (fpin)
> fput(fpin);
> return ret | VM_FAULT_RETRY;
>
> and since ret was 0 before the goto, the only return code is
> VM_FAULT_RETRY. I had speculated that perhaps we could instead do:
>
> if (fpin) {
> if (error)
> ret |= VM_FAULT_SIGBUS;
> goto out_retry;
> }
>
> But I think the hard part here is that there doesn't seem to be any
> distinction between transient read errors (e.g. disk cable fell out) vs.
> semi-permanent errors (e.g. verity says the hash doesn't match).
> AFAICT, either the read(ahead) sets uptodate and callers read the page,
> or it doesn't set it and callers treat that as an error-retry
> opportunity.
>
> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> second case I imagine we'd want something closer to _SIGBUS.
Agreed, it's really hard to judge when it's the right time to give up
retrying. At least with MADV_POPULATE_READ we should try achieving the
same behavior as with mmap()+read access. So if the latter manages to
trigger SIGBUS, MADV_POPULATE_READ should return an error.
Is there an easy way to for me to reproduce this scenario?
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-03-13 12:29 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
2024-03-04 22:35 ` Eric Biggers
2024-03-07 21:39 ` Darrick J. Wong
2024-03-07 22:06 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
2024-03-05 0:52 ` Eric Biggers
2024-03-06 16:30 ` Darrick J. Wong
2024-03-07 22:02 ` Eric Biggers
2024-03-08 3:46 ` Darrick J. Wong
2024-03-08 4:40 ` Eric Biggers
2024-03-11 22:38 ` Darrick J. Wong
2024-03-12 15:13 ` David Hildenbrand
2024-03-12 15:33 ` David Hildenbrand
2024-03-12 16:44 ` Darrick J. Wong
2024-03-13 12:29 ` David Hildenbrand [this message]
2024-03-13 17:19 ` Darrick J. Wong
2024-03-13 19:10 ` David Hildenbrand
2024-03-13 21:03 ` David Hildenbrand
2024-03-08 21:34 ` Dave Chinner
2024-03-09 16:19 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
2024-03-06 3:56 ` Eric Biggers
2024-03-07 21:54 ` Darrick J. Wong
2024-03-07 22:49 ` Eric Biggers
2024-03-08 3:50 ` Darrick J. Wong
2024-03-09 16:24 ` Darrick J. Wong
2024-03-11 23:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
2024-03-05 1:08 ` Eric Biggers
2024-03-07 21:58 ` Darrick J. Wong
2024-03-07 22:26 ` Eric Biggers
2024-03-08 3:53 ` Darrick J. Wong
2024-03-07 22:55 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
2024-03-05 0:33 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
2024-03-04 23:39 ` Eric Biggers
2024-03-07 22:06 ` Darrick J. Wong
2024-03-07 22:19 ` Eric Biggers
2024-03-07 23:38 ` Dave Chinner
2024-03-07 23:45 ` Darrick J. Wong
2024-03-08 0:47 ` Dave Chinner
2024-03-07 23:59 ` Eric Biggers
2024-03-08 1:20 ` Dave Chinner
2024-03-08 3:16 ` Eric Biggers
2024-03-08 3:57 ` Darrick J. Wong
2024-03-08 3:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
2024-03-07 22:46 ` Darrick J. Wong
2024-03-08 1:59 ` Dave Chinner
2024-03-08 3:31 ` Darrick J. Wong
2024-03-09 16:28 ` Darrick J. Wong
2024-03-11 0:26 ` Dave Chinner
2024-03-11 15:25 ` Darrick J. Wong
2024-03-12 2:43 ` Eric Biggers
2024-03-12 4:15 ` Darrick J. Wong
2024-03-12 2:45 ` Darrick J. Wong
2024-03-12 7:01 ` Dave Chinner
2024-03-12 20:04 ` Darrick J. Wong
2024-03-12 21:45 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
2024-03-07 22:06 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
2024-03-07 22:11 ` Darrick J. Wong
2024-03-12 12:02 ` Andrey Albershteyn
2024-03-12 16:36 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
2024-03-06 4:55 ` Eric Biggers
2024-03-06 5:01 ` Eric Biggers
2024-03-07 23:10 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
2024-03-07 22:18 ` Darrick J. Wong
2024-03-12 12:10 ` Andrey Albershteyn
2024-03-12 16:38 ` Darrick J. Wong
2024-03-13 1:35 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
2024-03-07 22:14 ` Darrick J. Wong
2024-03-12 12:42 ` Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
2024-03-07 22:16 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=420b6d5f-adef-4415-b8cb-16c234dcec38@redhat.com \
--to=david@redhat.com \
--cc=aalbersh@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chandan.babu@oracle.com \
--cc=djwong@kernel.org \
--cc=ebiggers@kernel.org \
--cc=fsverity@lists.linux.dev \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).