Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
       [not found]           ` <20240308044017.GC8111@sol.localdomain>
@ 2024-03-11 22:38             ` Darrick J. Wong
  2024-03-12 15:13               ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2024-03-11 22:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, akpm, linux-mm, Eric Biggers

[add willy and linux-mm]

On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > 
> > Sorry to answer your question with a question, but how much checking is
> > $filesystem expected to do for merkle trees?
> > 
> > In theory xfs_repair could learn how to interpret the verity descriptor,
> > walk the merkle tree blocks, and even read the file data to confirm
> > intactness.  If the descriptor specifies the highest block address then
> > we could certainly trim off excess blocks.  But I don't know how much of
> > libfsverity actually lets you do that; I haven't looked into that
> > deeply. :/
> > 
> > For xfs_scrub I guess the job is theoretically simpler, since we only
> > need to stream reads of the verity files through the page cache and let
> > verity tell us if the file data are consistent.
> > 
> > For both tools, if something finds errors in the merkle tree structure
> > itself, do we turn off verity?  Or do we do something nasty like
> > truncate the file?
> 
> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> validating the data of verity inodes against their Merkle trees.
> 
> e2fsck does delete the verity metadata of inodes that don't have the verity flag
> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> 
> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> should delete that inode's verity metadata and remove the verity flag from the
> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> fairly straightforward, but it probably wouldn't catch much compared to actually
> validating the data against the Merkle tree.  And actually validating the data
> against the Merkle tree would be complex and expensive.  Note, none of this
> would work on files that are encrypted.
> 
> Re: libfsverity, I think it would be possible to validate a Merkle tree using
> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> quite what it was designed for.
> 
> > Is there an ioctl or something that allows userspace to validate an
> > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > block devices, except that we might believe its answers?
> 
> Just reading the whole file and seeing whether you get an error would do it.
> 
> Though if you want to make sure it's really re-reading the on-disk data, it's
> necessary to drop the file's pagecache first.

I tried a straight pagecache read and it worked like a charm!

But then I thought to myself, do I really want to waste memory bandwidth
copying a bunch of data?  No.  I don't even want to incur system call
overhead from reading a single byte every $pagesize bytes.

So I created 2M mmap areas and read a byte every $pagesize bytes.  That
worked too, insofar as SIGBUSes are annoying to handle.  But it's
annoying to take signals like that.

Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
like what I wanted -- it prefaults in the pages, and "If populating
fails, a SIGBUS signal is not generated; instead, an error is returned."

But then I tried rigging up a test to see if I could catch an EIO, and
instead I had to SIGKILL the process!  It looks filemap_fault returns
VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
__do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
the VM_FAULT_RETRY is translated to -EBUSY.

__get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
that to madvise_populate.  Unfortunately, madvise_populate increments
its loop counter by the return value (still 0) so it runs in an
infinite loop.  The only way out is SIGKILL.

So I don't know what the correct behavior is here, other than the
infinite loop seems pretty suspect.  Is it the correct behavior that
madvise_populate returns EIO if __get_user_pages ever returns zero?
That doesn't quite sound right if it's the case that a zero return could
also happen if memory is tight.

I suppose filemap_fault could return VM_FAULT_SIGBUS in this one
scenario so userspace would get an -EFAULT.  That would solve this one
case of weird behavior.  But I think that doesn't happen in the
page_not_uptodate case because fpin is non-null?

As for xfs_scrub validating data files, I suppose it's not /so/
terrible to read one byte every $fsblocksize so that we can report
exactly where fsverity and the file data became inconsistent.  The
POPULATE_READ interface doesn't tell you how many pages it /did/ manage
to load, so perhaps MADV_POPULATE_READ isn't workable anyway.

(and now I'm just handwaving wildly about pagecache behaviors ;))

--D

> > Also -- inconsistencies between the file data and the merkle tree aren't
> > something that xfs can self-heal, right?
> 
> Similar to file data itself, only way to self-heal would be via mechanisms that
> provide redundancy.  There's been some interest in adding support forward error
> correction (FEC) to fsverity similar to what dm-verity has, but this would be
> complex, and it's not something that anyone has gotten around to yet.
> 
> - Eric
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-11 22:38             ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Darrick J. Wong
@ 2024-03-12 15:13               ` David Hildenbrand
       [not found]                 ` <08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com>
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-03-12 15:13 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, akpm, linux-mm, Eric Biggers

On 11.03.24 23:38, Darrick J. Wong wrote:
> [add willy and linux-mm]
> 
> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>
>>> Sorry to answer your question with a question, but how much checking is
>>> $filesystem expected to do for merkle trees?
>>>
>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>> walk the merkle tree blocks, and even read the file data to confirm
>>> intactness.  If the descriptor specifies the highest block address then
>>> we could certainly trim off excess blocks.  But I don't know how much of
>>> libfsverity actually lets you do that; I haven't looked into that
>>> deeply. :/
>>>
>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>> need to stream reads of the verity files through the page cache and let
>>> verity tell us if the file data are consistent.
>>>
>>> For both tools, if something finds errors in the merkle tree structure
>>> itself, do we turn off verity?  Or do we do something nasty like
>>> truncate the file?
>>
>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>> validating the data of verity inodes against their Merkle trees.
>>
>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>
>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>> should delete that inode's verity metadata and remove the verity flag from the
>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>> fairly straightforward, but it probably wouldn't catch much compared to actually
>> validating the data against the Merkle tree.  And actually validating the data
>> against the Merkle tree would be complex and expensive.  Note, none of this
>> would work on files that are encrypted.
>>
>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>> quite what it was designed for.
>>
>>> Is there an ioctl or something that allows userspace to validate an
>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>> block devices, except that we might believe its answers?
>>
>> Just reading the whole file and seeing whether you get an error would do it.
>>
>> Though if you want to make sure it's really re-reading the on-disk data, it's
>> necessary to drop the file's pagecache first.
> 
> I tried a straight pagecache read and it worked like a charm!
> 
> But then I thought to myself, do I really want to waste memory bandwidth
> copying a bunch of data?  No.  I don't even want to incur system call
> overhead from reading a single byte every $pagesize bytes.
> 
> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> worked too, insofar as SIGBUSes are annoying to handle.  But it's
> annoying to take signals like that.
> 
> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> like what I wanted -- it prefaults in the pages, and "If populating
> fails, a SIGBUS signal is not generated; instead, an error is returned."
> 

Yes, these were the expected semantics :)

> But then I tried rigging up a test to see if I could catch an EIO, and
> instead I had to SIGKILL the process!  It looks filemap_fault returns
> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> the VM_FAULT_RETRY is translated to -EBUSY.
> 
> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> that to madvise_populate.  Unfortunately, madvise_populate increments
> its loop counter by the return value (still 0) so it runs in an
> infinite loop.  The only way out is SIGKILL.

That's certainly unexpected. One user I know is QEMU, which primarily 
uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is 
primarily used with shmem/hugetlb, where I haven't heard of any such 
endless loops.

> 
> So I don't know what the correct behavior is here, other than the
> infinite loop seems pretty suspect.  Is it the correct behavior that
> madvise_populate returns EIO if __get_user_pages ever returns zero?
> That doesn't quite sound right if it's the case that a zero return could
> also happen if memory is tight.

madvise_populate() ends up calling faultin_vma_page_range() in a loop. 
That one calls __get_user_pages().

__get_user_pages() documents: "0 return value is possible when the fault 
would need to be retried."

So that's what the caller does. IIRC, there are cases where we really 
have to retry (at least once) and will make progress, so treating "0" as 
an error would be wrong.

Staring at other __get_user_pages() users, __get_user_pages_locked() 
documents: "Please note that this function, unlike __get_user_pages(), 
will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".

But there is some elaborate retry logic in there, whereby the retry will 
set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second 
retry attempt (there are cases where we retry more often, but that's 
related to something else I believe).

So maybe we need a similar retry logic in faultin_vma_page_range()? Or 
make it use __get_user_pages_locked(), but I recall when I introduced 
MADV_POPULATE_READ, there was a catch to it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com>]

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
       [not found]                 ` <08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com>
@ 2024-03-12 16:44                   ` Darrick J. Wong
  2024-03-13 12:29                     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> On 12.03.24 16:13, David Hildenbrand wrote:
> > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > [add willy and linux-mm]
> > > 
> > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > 
> > > > > Sorry to answer your question with a question, but how much checking is
> > > > > $filesystem expected to do for merkle trees?
> > > > > 
> > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > deeply. :/
> > > > > 
> > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > need to stream reads of the verity files through the page cache and let
> > > > > verity tell us if the file data are consistent.
> > > > > 
> > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > truncate the file?
> > > > 
> > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > validating the data of verity inodes against their Merkle trees.
> > > > 
> > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > 
> > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > validating the data against the Merkle tree.  And actually validating the data
> > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > would work on files that are encrypted.
> > > > 
> > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > quite what it was designed for.
> > > > 
> > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > block devices, except that we might believe its answers?
> > > > 
> > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > 
> > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > necessary to drop the file's pagecache first.
> > > 
> > > I tried a straight pagecache read and it worked like a charm!
> > > 
> > > But then I thought to myself, do I really want to waste memory bandwidth
> > > copying a bunch of data?  No.  I don't even want to incur system call
> > > overhead from reading a single byte every $pagesize bytes.
> > > 
> > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > annoying to take signals like that.
> > > 
> > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > like what I wanted -- it prefaults in the pages, and "If populating
> > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > 
> > 
> > Yes, these were the expected semantics :)
> > 
> > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > 
> > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > its loop counter by the return value (still 0) so it runs in an
> > > infinite loop.  The only way out is SIGKILL.
> > 
> > That's certainly unexpected. One user I know is QEMU, which primarily
> > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > primarily used with shmem/hugetlb, where I haven't heard of any such
> > endless loops.
> > 
> > > 
> > > So I don't know what the correct behavior is here, other than the
> > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > That doesn't quite sound right if it's the case that a zero return could
> > > also happen if memory is tight.
> > 
> > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > That one calls __get_user_pages().
> > 
> > __get_user_pages() documents: "0 return value is possible when the fault
> > would need to be retried."
> > 
> > So that's what the caller does. IIRC, there are cases where we really
> > have to retry (at least once) and will make progress, so treating "0" as
> > an error would be wrong.
> > 
> > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > documents: "Please note that this function, unlike __get_user_pages(),
> > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > 
> > But there is some elaborate retry logic in there, whereby the retry will
> > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > retry attempt (there are cases where we retry more often, but that's
> > related to something else I believe).
> > 
> > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > make it use __get_user_pages_locked(), but I recall when I introduced
> > MADV_POPULATE_READ, there was a catch to it.
> 
> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> mmap()+access case you describe above.
> 
> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> Because VM_FAULT_SIGBUS would be required for that function to call
> do_sigbus().

The code I was looking at yesterday in filemap_fault was:

page_not_uptodate:
	/*
	 * Umm, take care of errors if the page isn't up-to-date.
	 * Try to re-read it _once_. We do this synchronously,
	 * because there really aren't any performance issues here
	 * and we need to check for errors.
	 */
	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
	if (fpin)
		goto out_retry;
	folio_put(folio);

	if (!error || error == AOP_TRUNCATED_PAGE)
		goto retry_find;
	filemap_invalidate_unlock_shared(mapping);

	return VM_FAULT_SIGBUS;

Wherein I /think/ fpin is non-null in this case, so if
filemap_read_folio returns an error, we'll do this instead:

out_retry:
	/*
	 * We dropped the mmap_lock, we need to return to the fault handler to
	 * re-find the vma and come back and find our hopefully still populated
	 * page.
	 */
	if (!IS_ERR(folio))
		folio_put(folio);
	if (mapping_locked)
		filemap_invalidate_unlock_shared(mapping);
	if (fpin)
		fput(fpin);
	return ret | VM_FAULT_RETRY;

and since ret was 0 before the goto, the only return code is
VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:

	if (fpin) {
		if (error)
			ret |= VM_FAULT_SIGBUS;
		goto out_retry;
	}

But I think the hard part here is that there doesn't seem to be any
distinction between transient read errors (e.g. disk cable fell out) vs.
semi-permanent errors (e.g. verity says the hash doesn't match).
AFAICT, either the read(ahead) sets uptodate and callers read the page,
or it doesn't set it and callers treat that as an error-retry
opportunity.

For the transient error case VM_FAULT_RETRY makes perfect sense; for the
second case I imagine we'd want something closer to _SIGBUS.

</handwave>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-12 16:44                   ` Darrick J. Wong
@ 2024-03-13 12:29                     ` David Hildenbrand
  2024-03-13 17:19                       ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-03-13 12:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 12.03.24 17:44, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>> On 12.03.24 16:13, David Hildenbrand wrote:
>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>> [add willy and linux-mm]
>>>>
>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>
>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>> $filesystem expected to do for merkle trees?
>>>>>>
>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>> deeply. :/
>>>>>>
>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>> verity tell us if the file data are consistent.
>>>>>>
>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>> truncate the file?
>>>>>
>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>
>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>
>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>> would work on files that are encrypted.
>>>>>
>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>> quite what it was designed for.
>>>>>
>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>> block devices, except that we might believe its answers?
>>>>>
>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>
>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>> necessary to drop the file's pagecache first.
>>>>
>>>> I tried a straight pagecache read and it worked like a charm!
>>>>
>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>> overhead from reading a single byte every $pagesize bytes.
>>>>
>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>> annoying to take signals like that.
>>>>
>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>
>>>
>>> Yes, these were the expected semantics :)
>>>
>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>
>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>> its loop counter by the return value (still 0) so it runs in an
>>>> infinite loop.  The only way out is SIGKILL.
>>>
>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>> endless loops.
>>>
>>>>
>>>> So I don't know what the correct behavior is here, other than the
>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>> That doesn't quite sound right if it's the case that a zero return could
>>>> also happen if memory is tight.
>>>
>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>> That one calls __get_user_pages().
>>>
>>> __get_user_pages() documents: "0 return value is possible when the fault
>>> would need to be retried."
>>>
>>> So that's what the caller does. IIRC, there are cases where we really
>>> have to retry (at least once) and will make progress, so treating "0" as
>>> an error would be wrong.
>>>
>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>> documents: "Please note that this function, unlike __get_user_pages(),
>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>
>>> But there is some elaborate retry logic in there, whereby the retry will
>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>> retry attempt (there are cases where we retry more often, but that's
>>> related to something else I believe).
>>>
>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>> MADV_POPULATE_READ, there was a catch to it.
>>
>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>> mmap()+access case you describe above.
>>
>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>> Because VM_FAULT_SIGBUS would be required for that function to call
>> do_sigbus().
> 
> The code I was looking at yesterday in filemap_fault was:
> 
> page_not_uptodate:
> 	/*
> 	 * Umm, take care of errors if the page isn't up-to-date.
> 	 * Try to re-read it _once_. We do this synchronously,
> 	 * because there really aren't any performance issues here
> 	 * and we need to check for errors.
> 	 */
> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> 	if (fpin)
> 		goto out_retry;
> 	folio_put(folio);
> 
> 	if (!error || error == AOP_TRUNCATED_PAGE)
> 		goto retry_find;
> 	filemap_invalidate_unlock_shared(mapping);
> 
> 	return VM_FAULT_SIGBUS;
> 
> Wherein I /think/ fpin is non-null in this case, so if
> filemap_read_folio returns an error, we'll do this instead:
> 
> out_retry:
> 	/*
> 	 * We dropped the mmap_lock, we need to return to the fault handler to
> 	 * re-find the vma and come back and find our hopefully still populated
> 	 * page.
> 	 */
> 	if (!IS_ERR(folio))
> 		folio_put(folio);
> 	if (mapping_locked)
> 		filemap_invalidate_unlock_shared(mapping);
> 	if (fpin)
> 		fput(fpin);
> 	return ret | VM_FAULT_RETRY;
> 
> and since ret was 0 before the goto, the only return code is
> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> 
> 	if (fpin) {
> 		if (error)
> 			ret |= VM_FAULT_SIGBUS;
> 		goto out_retry;
> 	}
> 
> But I think the hard part here is that there doesn't seem to be any
> distinction between transient read errors (e.g. disk cable fell out) vs.
> semi-permanent errors (e.g. verity says the hash doesn't match).
> AFAICT, either the read(ahead) sets uptodate and callers read the page,
> or it doesn't set it and callers treat that as an error-retry
> opportunity.
> 
> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> second case I imagine we'd want something closer to _SIGBUS.


Agreed, it's really hard to judge when it's the right time to give up 
retrying. At least with MADV_POPULATE_READ we should try achieving the 
same behavior as with mmap()+read access. So if the latter manages to 
trigger SIGBUS, MADV_POPULATE_READ should return an error.

Is there an easy way to for me to reproduce this scenario?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 12:29                     ` David Hildenbrand
@ 2024-03-13 17:19                       ` Darrick J. Wong
  2024-03-13 19:10                         ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2024-03-13 17:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
> On 12.03.24 17:44, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> > > On 12.03.24 16:13, David Hildenbrand wrote:
> > > > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > > > [add willy and linux-mm]
> > > > > 
> > > > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > > > 
> > > > > > > Sorry to answer your question with a question, but how much checking is
> > > > > > > $filesystem expected to do for merkle trees?
> > > > > > > 
> > > > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > > > deeply. :/
> > > > > > > 
> > > > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > > > need to stream reads of the verity files through the page cache and let
> > > > > > > verity tell us if the file data are consistent.
> > > > > > > 
> > > > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > > > truncate the file?
> > > > > > 
> > > > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > > > validating the data of verity inodes against their Merkle trees.
> > > > > > 
> > > > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > > > 
> > > > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > > > validating the data against the Merkle tree.  And actually validating the data
> > > > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > > > would work on files that are encrypted.
> > > > > > 
> > > > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > > > quite what it was designed for.
> > > > > > 
> > > > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > > > block devices, except that we might believe its answers?
> > > > > > 
> > > > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > > > 
> > > > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > > > necessary to drop the file's pagecache first.
> > > > > 
> > > > > I tried a straight pagecache read and it worked like a charm!
> > > > > 
> > > > > But then I thought to myself, do I really want to waste memory bandwidth
> > > > > copying a bunch of data?  No.  I don't even want to incur system call
> > > > > overhead from reading a single byte every $pagesize bytes.
> > > > > 
> > > > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > > > annoying to take signals like that.
> > > > > 
> > > > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > > > like what I wanted -- it prefaults in the pages, and "If populating
> > > > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > > > 
> > > > 
> > > > Yes, these were the expected semantics :)
> > > > 
> > > > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > > > 
> > > > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > > > its loop counter by the return value (still 0) so it runs in an
> > > > > infinite loop.  The only way out is SIGKILL.
> > > > 
> > > > That's certainly unexpected. One user I know is QEMU, which primarily
> > > > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > > > primarily used with shmem/hugetlb, where I haven't heard of any such
> > > > endless loops.
> > > > 
> > > > > 
> > > > > So I don't know what the correct behavior is here, other than the
> > > > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > > > That doesn't quite sound right if it's the case that a zero return could
> > > > > also happen if memory is tight.
> > > > 
> > > > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > > > That one calls __get_user_pages().
> > > > 
> > > > __get_user_pages() documents: "0 return value is possible when the fault
> > > > would need to be retried."
> > > > 
> > > > So that's what the caller does. IIRC, there are cases where we really
> > > > have to retry (at least once) and will make progress, so treating "0" as
> > > > an error would be wrong.
> > > > 
> > > > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > > > documents: "Please note that this function, unlike __get_user_pages(),
> > > > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > > > 
> > > > But there is some elaborate retry logic in there, whereby the retry will
> > > > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > > > retry attempt (there are cases where we retry more often, but that's
> > > > related to something else I believe).
> > > > 
> > > > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > > > make it use __get_user_pages_locked(), but I recall when I introduced
> > > > MADV_POPULATE_READ, there was a catch to it.
> > > 
> > > I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> > > mmap()+access case you describe above.
> > > 
> > > Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> > > how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> > > Because VM_FAULT_SIGBUS would be required for that function to call
> > > do_sigbus().
> > 
> > The code I was looking at yesterday in filemap_fault was:
> > 
> > page_not_uptodate:
> > 	/*
> > 	 * Umm, take care of errors if the page isn't up-to-date.
> > 	 * Try to re-read it _once_. We do this synchronously,
> > 	 * because there really aren't any performance issues here
> > 	 * and we need to check for errors.
> > 	 */
> > 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> > 	if (fpin)
> > 		goto out_retry;
> > 	folio_put(folio);
> > 
> > 	if (!error || error == AOP_TRUNCATED_PAGE)
> > 		goto retry_find;
> > 	filemap_invalidate_unlock_shared(mapping);
> > 
> > 	return VM_FAULT_SIGBUS;
> > 
> > Wherein I /think/ fpin is non-null in this case, so if
> > filemap_read_folio returns an error, we'll do this instead:
> > 
> > out_retry:
> > 	/*
> > 	 * We dropped the mmap_lock, we need to return to the fault handler to
> > 	 * re-find the vma and come back and find our hopefully still populated
> > 	 * page.
> > 	 */
> > 	if (!IS_ERR(folio))
> > 		folio_put(folio);
> > 	if (mapping_locked)
> > 		filemap_invalidate_unlock_shared(mapping);
> > 	if (fpin)
> > 		fput(fpin);
> > 	return ret | VM_FAULT_RETRY;
> > 
> > and since ret was 0 before the goto, the only return code is
> > VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> > 
> > 	if (fpin) {
> > 		if (error)
> > 			ret |= VM_FAULT_SIGBUS;
> > 		goto out_retry;
> > 	}
> > 
> > But I think the hard part here is that there doesn't seem to be any
> > distinction between transient read errors (e.g. disk cable fell out) vs.
> > semi-permanent errors (e.g. verity says the hash doesn't match).
> > AFAICT, either the read(ahead) sets uptodate and callers read the page,
> > or it doesn't set it and callers treat that as an error-retry
> > opportunity.
> > 
> > For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> > second case I imagine we'd want something closer to _SIGBUS.
> 
> 
> Agreed, it's really hard to judge when it's the right time to give up
> retrying. At least with MADV_POPULATE_READ we should try achieving the same
> behavior as with mmap()+read access. So if the latter manages to trigger
> SIGBUS, MADV_POPULATE_READ should return an error.
> 
> Is there an easy way to for me to reproduce this scenario?

Yes.  Take this Makefile:

CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable

all: mpr

and this C program mpr.c:

/* test MAP_POPULATE_READ on a file */
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define min(a, b)	((a) < (b) ? (a) : (b))
#define BUFSIZE		(2097152)

int main(int argc, char *argv[])
{
	struct stat sb;
	long pagesize = sysconf(_SC_PAGESIZE);
	off_t read_sz, pos;
	void *addr;
	char c;
	int fd, ret;

	if (argc != 2) {
		printf("Usage: %s fname\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror(argv[1]);
		return 1;
	}

	ret = fstat(fd, &sb);
	if (ret) {
		perror("fstat");
		return 1;
	}

	/* Validate the file contents with regular reads */
	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
		ret = pread(fd, &c, 1, pos);
		if (ret < 0) {
			if (errno != EIO) {
				perror("pread");
				return 1;
			}

			printf("%s: at offset %llu: %s\n", argv[1],
					(unsigned long long)pos,
					strerror(errno));
			break;
		}
	}

	ret = pread(fd, &c, 1, sb.st_size);
	if (ret < 0) {
		if (errno != EIO) {
			perror("pread");
			return 1;
		}

		printf("%s: at offset %llu: %s\n", argv[1],
				(unsigned long long)sb.st_size,
				strerror(errno));
	}

	/* Validate the file contents with MADV_POPULATE_READ */
	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);

	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
		unsigned int mappos;
		size_t maplen = min(read_sz - pos, BUFSIZE);

		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
		if (addr == MAP_FAILED) {
			perror("mmap");
			return 1;
		}

		ret = madvise(addr, maplen, MADV_POPULATE_READ);
		if (ret) {
			perror("madvise");
			return 1;
		}

		ret = munmap(addr, maplen);
		if (ret) {
			perror("munmap");
			return 1;
		}
	}

	ret = close(fd);
	if (ret) {
		perror("close");
		return 1;
	}

	return 0;
}

and this shell script mpr.sh:

#!/bin/bash -x

# Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ

scriptdir="$(dirname "$0")"

commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
for cmd in "${commands[@]}"; do
	if ! command -v "$cmd" &>/dev/null; then
		echo "$cmd: Command required for this program."
		exit 1
	fi
done

dev="${1:-/dev/sda}"
mnt="${2:-/mnt}"
dmtarget="dumbtarget"

# Clean up any old mounts
umount "$dev" "$mnt"
dmsetup remove "$dmtarget"
rmmod xfs

# Create dm linear mapping to block device and format filesystem
sectors="$(blockdev --getsz "$dev")"
tgt="/dev/mapper/$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
mkfs.xfs -f "$tgt"

# Create a file that we'll read, then cycle mount to zap pagecache
mount "$tgt" "$mnt"
xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
umount "$mnt"
mount "$tgt" "$mnt"

# Load file metadata
stat "$mnt/a"

# Induce EIO errors on read
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors error" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Try to provoke the kernel; kill the process after 10s so we can clean up
timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"

# Stop EIO errors so we can unmount
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Unmount and clean up after ourselves
umount "$mnt"
dmsetup remove "$dmtarget"
<EOF>

make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
stall in the madvise call until timeout sends sigkill to the program;
you can crank the 10s timeout up if you want.

<insert usual disclaimer that I run all these things in scratch VMs>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 17:19                       ` Darrick J. Wong
@ 2024-03-13 19:10                         ` David Hildenbrand
  2024-03-13 21:03                           ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-03-13 19:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 13.03.24 18:19, Darrick J. Wong wrote:
> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>> [add willy and linux-mm]
>>>>>>
>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>
>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>
>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>> deeply. :/
>>>>>>>>
>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>
>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>> truncate the file?
>>>>>>>
>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>
>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>
>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>> would work on files that are encrypted.
>>>>>>>
>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>> quite what it was designed for.
>>>>>>>
>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>
>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>
>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>> necessary to drop the file's pagecache first.
>>>>>>
>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>
>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>
>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>> annoying to take signals like that.
>>>>>>
>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>
>>>>>
>>>>> Yes, these were the expected semantics :)
>>>>>
>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>
>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>
>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>> endless loops.
>>>>>
>>>>>>
>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>> also happen if memory is tight.
>>>>>
>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>> That one calls __get_user_pages().
>>>>>
>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>> would need to be retried."
>>>>>
>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>> an error would be wrong.
>>>>>
>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>
>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>> related to something else I believe).
>>>>>
>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>
>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>> mmap()+access case you describe above.
>>>>
>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>> do_sigbus().
>>>
>>> The code I was looking at yesterday in filemap_fault was:
>>>
>>> page_not_uptodate:
>>> 	/*
>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>> 	 * Try to re-read it _once_. We do this synchronously,
>>> 	 * because there really aren't any performance issues here
>>> 	 * and we need to check for errors.
>>> 	 */
>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>> 	if (fpin)
>>> 		goto out_retry;
>>> 	folio_put(folio);
>>>
>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>> 		goto retry_find;
>>> 	filemap_invalidate_unlock_shared(mapping);
>>>
>>> 	return VM_FAULT_SIGBUS;
>>>
>>> Wherein I /think/ fpin is non-null in this case, so if
>>> filemap_read_folio returns an error, we'll do this instead:
>>>
>>> out_retry:
>>> 	/*
>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>> 	 * re-find the vma and come back and find our hopefully still populated
>>> 	 * page.
>>> 	 */
>>> 	if (!IS_ERR(folio))
>>> 		folio_put(folio);
>>> 	if (mapping_locked)
>>> 		filemap_invalidate_unlock_shared(mapping);
>>> 	if (fpin)
>>> 		fput(fpin);
>>> 	return ret | VM_FAULT_RETRY;
>>>
>>> and since ret was 0 before the goto, the only return code is
>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>
>>> 	if (fpin) {
>>> 		if (error)
>>> 			ret |= VM_FAULT_SIGBUS;
>>> 		goto out_retry;
>>> 	}
>>>
>>> But I think the hard part here is that there doesn't seem to be any
>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>> or it doesn't set it and callers treat that as an error-retry
>>> opportunity.
>>>
>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>> second case I imagine we'd want something closer to _SIGBUS.
>>
>>
>> Agreed, it's really hard to judge when it's the right time to give up
>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>> behavior as with mmap()+read access. So if the latter manages to trigger
>> SIGBUS, MADV_POPULATE_READ should return an error.
>>
>> Is there an easy way to for me to reproduce this scenario?
> 
> Yes.  Take this Makefile:
> 
> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
> 
> all: mpr
> 
> and this C program mpr.c:
> 
> /* test MAP_POPULATE_READ on a file */
> #include <stdio.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/stat.h>
> #include <sys/mman.h>
> 
> #define min(a, b)	((a) < (b) ? (a) : (b))
> #define BUFSIZE		(2097152)
> 
> int main(int argc, char *argv[])
> {
> 	struct stat sb;
> 	long pagesize = sysconf(_SC_PAGESIZE);
> 	off_t read_sz, pos;
> 	void *addr;
> 	char c;
> 	int fd, ret;
> 
> 	if (argc != 2) {
> 		printf("Usage: %s fname\n", argv[0]);
> 		return 1;
> 	}
> 
> 	fd = open(argv[1], O_RDONLY);
> 	if (fd < 0) {
> 		perror(argv[1]);
> 		return 1;
> 	}
> 
> 	ret = fstat(fd, &sb);
> 	if (ret) {
> 		perror("fstat");
> 		return 1;
> 	}
> 
> 	/* Validate the file contents with regular reads */
> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
> 		ret = pread(fd, &c, 1, pos);
> 		if (ret < 0) {
> 			if (errno != EIO) {
> 				perror("pread");
> 				return 1;
> 			}
> 
> 			printf("%s: at offset %llu: %s\n", argv[1],
> 					(unsigned long long)pos,
> 					strerror(errno));
> 			break;
> 		}
> 	}
> 
> 	ret = pread(fd, &c, 1, sb.st_size);
> 	if (ret < 0) {
> 		if (errno != EIO) {
> 			perror("pread");
> 			return 1;
> 		}
> 
> 		printf("%s: at offset %llu: %s\n", argv[1],
> 				(unsigned long long)sb.st_size,
> 				strerror(errno));
> 	}
> 
> 	/* Validate the file contents with MADV_POPULATE_READ */
> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
> 
> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
> 		unsigned int mappos;
> 		size_t maplen = min(read_sz - pos, BUFSIZE);
> 
> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
> 		if (addr == MAP_FAILED) {
> 			perror("mmap");
> 			return 1;
> 		}
> 
> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
> 		if (ret) {
> 			perror("madvise");
> 			return 1;
> 		}
> 
> 		ret = munmap(addr, maplen);
> 		if (ret) {
> 			perror("munmap");
> 			return 1;
> 		}
> 	}
> 
> 	ret = close(fd);
> 	if (ret) {
> 		perror("close");
> 		return 1;
> 	}
> 
> 	return 0;
> }
> 
> and this shell script mpr.sh:
> 
> #!/bin/bash -x
> 
> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
> 
> scriptdir="$(dirname "$0")"
> 
> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
> for cmd in "${commands[@]}"; do
> 	if ! command -v "$cmd" &>/dev/null; then
> 		echo "$cmd: Command required for this program."
> 		exit 1
> 	fi
> done
> 
> dev="${1:-/dev/sda}"
> mnt="${2:-/mnt}"
> dmtarget="dumbtarget"
> 
> # Clean up any old mounts
> umount "$dev" "$mnt"
> dmsetup remove "$dmtarget"
> rmmod xfs
> 
> # Create dm linear mapping to block device and format filesystem
> sectors="$(blockdev --getsz "$dev")"
> tgt="/dev/mapper/$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
> mkfs.xfs -f "$tgt"
> 
> # Create a file that we'll read, then cycle mount to zap pagecache
> mount "$tgt" "$mnt"
> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
> umount "$mnt"
> mount "$tgt" "$mnt"
> 
> # Load file metadata
> stat "$mnt/a"
> 
> # Induce EIO errors on read
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors error" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Try to provoke the kernel; kill the process after 10s so we can clean up
> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
> 
> # Stop EIO errors so we can unmount
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Unmount and clean up after ourselves
> umount "$mnt"
> dmsetup remove "$dmtarget"
> <EOF>
> 
> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
> stall in the madvise call until timeout sends sigkill to the program;
> you can crank the 10s timeout up if you want.
> 
> <insert usual disclaimer that I run all these things in scratch VMs>

Yes, seems to work, nice!


[  452.455636] buffer_io_error: 6 callbacks suppressed
[  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
[  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
[  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
[  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
[  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
[  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
[  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
[  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"


And once I switch over to reading instead of MADV_POPULATE_READ:

[  753.940230] buffer_io_error: 6 callbacks suppressed
[  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
[  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
[  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
[  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
[  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
[  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
[  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
[  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
--- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
+++ killed by SIGBUS (core dumped) +++
timeout: the monitored command dumped core
./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"


Let me dig how the fault handler is able to conclude SIGBUS here!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 19:10                         ` David Hildenbrand
@ 2024-03-13 21:03                           ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2024-03-13 21:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 13.03.24 20:10, David Hildenbrand wrote:
> On 13.03.24 18:19, Darrick J. Wong wrote:
>> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>>> [add willy and linux-mm]
>>>>>>>
>>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>>
>>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>>
>>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>>> deeply. :/
>>>>>>>>>
>>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>>
>>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>>> truncate the file?
>>>>>>>>
>>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>>
>>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>>
>>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>>> would work on files that are encrypted.
>>>>>>>>
>>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>>> quite what it was designed for.
>>>>>>>>
>>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>>
>>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>>
>>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>>> necessary to drop the file's pagecache first.
>>>>>>>
>>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>>
>>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>>
>>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>>> annoying to take signals like that.
>>>>>>>
>>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>>
>>>>>>
>>>>>> Yes, these were the expected semantics :)
>>>>>>
>>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>>
>>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>>
>>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>>> endless loops.
>>>>>>
>>>>>>>
>>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>>> also happen if memory is tight.
>>>>>>
>>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>>> That one calls __get_user_pages().
>>>>>>
>>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>>> would need to be retried."
>>>>>>
>>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>>> an error would be wrong.
>>>>>>
>>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>>
>>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>>> related to something else I believe).
>>>>>>
>>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>>
>>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>>> mmap()+access case you describe above.
>>>>>
>>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>>> do_sigbus().
>>>>
>>>> The code I was looking at yesterday in filemap_fault was:
>>>>
>>>> page_not_uptodate:
>>>> 	/*
>>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>>> 	 * Try to re-read it _once_. We do this synchronously,
>>>> 	 * because there really aren't any performance issues here
>>>> 	 * and we need to check for errors.
>>>> 	 */
>>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>>> 	if (fpin)
>>>> 		goto out_retry;
>>>> 	folio_put(folio);
>>>>
>>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>>> 		goto retry_find;
>>>> 	filemap_invalidate_unlock_shared(mapping);
>>>>
>>>> 	return VM_FAULT_SIGBUS;
>>>>
>>>> Wherein I /think/ fpin is non-null in this case, so if
>>>> filemap_read_folio returns an error, we'll do this instead:
>>>>
>>>> out_retry:
>>>> 	/*
>>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>>> 	 * re-find the vma and come back and find our hopefully still populated
>>>> 	 * page.
>>>> 	 */
>>>> 	if (!IS_ERR(folio))
>>>> 		folio_put(folio);
>>>> 	if (mapping_locked)
>>>> 		filemap_invalidate_unlock_shared(mapping);
>>>> 	if (fpin)
>>>> 		fput(fpin);
>>>> 	return ret | VM_FAULT_RETRY;
>>>>
>>>> and since ret was 0 before the goto, the only return code is
>>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>>
>>>> 	if (fpin) {
>>>> 		if (error)
>>>> 			ret |= VM_FAULT_SIGBUS;
>>>> 		goto out_retry;
>>>> 	}
>>>>
>>>> But I think the hard part here is that there doesn't seem to be any
>>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>>> or it doesn't set it and callers treat that as an error-retry
>>>> opportunity.
>>>>
>>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>>> second case I imagine we'd want something closer to _SIGBUS.
>>>
>>>
>>> Agreed, it's really hard to judge when it's the right time to give up
>>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>>> behavior as with mmap()+read access. So if the latter manages to trigger
>>> SIGBUS, MADV_POPULATE_READ should return an error.
>>>
>>> Is there an easy way to for me to reproduce this scenario?
>>
>> Yes.  Take this Makefile:
>>
>> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
>>
>> all: mpr
>>
>> and this C program mpr.c:
>>
>> /* test MAP_POPULATE_READ on a file */
>> #include <stdio.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/stat.h>
>> #include <sys/mman.h>
>>
>> #define min(a, b)	((a) < (b) ? (a) : (b))
>> #define BUFSIZE		(2097152)
>>
>> int main(int argc, char *argv[])
>> {
>> 	struct stat sb;
>> 	long pagesize = sysconf(_SC_PAGESIZE);
>> 	off_t read_sz, pos;
>> 	void *addr;
>> 	char c;
>> 	int fd, ret;
>>
>> 	if (argc != 2) {
>> 		printf("Usage: %s fname\n", argv[0]);
>> 		return 1;
>> 	}
>>
>> 	fd = open(argv[1], O_RDONLY);
>> 	if (fd < 0) {
>> 		perror(argv[1]);
>> 		return 1;
>> 	}
>>
>> 	ret = fstat(fd, &sb);
>> 	if (ret) {
>> 		perror("fstat");
>> 		return 1;
>> 	}
>>
>> 	/* Validate the file contents with regular reads */
>> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
>> 		ret = pread(fd, &c, 1, pos);
>> 		if (ret < 0) {
>> 			if (errno != EIO) {
>> 				perror("pread");
>> 				return 1;
>> 			}
>>
>> 			printf("%s: at offset %llu: %s\n", argv[1],
>> 					(unsigned long long)pos,
>> 					strerror(errno));
>> 			break;
>> 		}
>> 	}
>>
>> 	ret = pread(fd, &c, 1, sb.st_size);
>> 	if (ret < 0) {
>> 		if (errno != EIO) {
>> 			perror("pread");
>> 			return 1;
>> 		}
>>
>> 		printf("%s: at offset %llu: %s\n", argv[1],
>> 				(unsigned long long)sb.st_size,
>> 				strerror(errno));
>> 	}
>>
>> 	/* Validate the file contents with MADV_POPULATE_READ */
>> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
>> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
>>
>> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
>> 		unsigned int mappos;
>> 		size_t maplen = min(read_sz - pos, BUFSIZE);
>>
>> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
>> 		if (addr == MAP_FAILED) {
>> 			perror("mmap");
>> 			return 1;
>> 		}
>>
>> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
>> 		if (ret) {
>> 			perror("madvise");
>> 			return 1;
>> 		}
>>
>> 		ret = munmap(addr, maplen);
>> 		if (ret) {
>> 			perror("munmap");
>> 			return 1;
>> 		}
>> 	}
>>
>> 	ret = close(fd);
>> 	if (ret) {
>> 		perror("close");
>> 		return 1;
>> 	}
>>
>> 	return 0;
>> }
>>
>> and this shell script mpr.sh:
>>
>> #!/bin/bash -x
>>
>> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
>>
>> scriptdir="$(dirname "$0")"
>>
>> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
>> for cmd in "${commands[@]}"; do
>> 	if ! command -v "$cmd" &>/dev/null; then
>> 		echo "$cmd: Command required for this program."
>> 		exit 1
>> 	fi
>> done
>>
>> dev="${1:-/dev/sda}"
>> mnt="${2:-/mnt}"
>> dmtarget="dumbtarget"
>>
>> # Clean up any old mounts
>> umount "$dev" "$mnt"
>> dmsetup remove "$dmtarget"
>> rmmod xfs
>>
>> # Create dm linear mapping to block device and format filesystem
>> sectors="$(blockdev --getsz "$dev")"
>> tgt="/dev/mapper/$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
>> mkfs.xfs -f "$tgt"
>>
>> # Create a file that we'll read, then cycle mount to zap pagecache
>> mount "$tgt" "$mnt"
>> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
>> umount "$mnt"
>> mount "$tgt" "$mnt"
>>
>> # Load file metadata
>> stat "$mnt/a"
>>
>> # Induce EIO errors on read
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors error" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Try to provoke the kernel; kill the process after 10s so we can clean up
>> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
>>
>> # Stop EIO errors so we can unmount
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Unmount and clean up after ourselves
>> umount "$mnt"
>> dmsetup remove "$dmtarget"
>> <EOF>
>>
>> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
>> stall in the madvise call until timeout sends sigkill to the program;
>> you can crank the 10s timeout up if you want.
>>
>> <insert usual disclaimer that I run all these things in scratch VMs>
> 
> Yes, seems to work, nice!
> 
> 
> [  452.455636] buffer_io_error: 6 callbacks suppressed
> [  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
> [  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
> [  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
> [  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
> [  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
> [  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
> [  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
> [  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"
> 
> 
> And once I switch over to reading instead of MADV_POPULATE_READ:
> 
> [  753.940230] buffer_io_error: 6 callbacks suppressed
> [  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
> [  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
> [  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
> [  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
> [  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
> [  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
> [  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
> [  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
> +++ killed by SIGBUS (core dumped) +++
> timeout: the monitored command dumped core
> ./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"
> 
> 
> Let me dig how the fault handler is able to conclude SIGBUS here!

Might be as simple as this:

diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d5..62df1548b7779 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1734,8 +1734,8 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
         if (check_vma_flags(vma, gup_flags))
                 return -EINVAL;
  
-       ret = __get_user_pages(mm, start, nr_pages, gup_flags,
-                              NULL, locked);
+       ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked,
+                                     gup_flags);
         lru_add_drain();
         return ret;
  }


[   57.154542] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.155276] Buffer I/O error on dev dm-0, logical block 17, async page read
[   57.155911] Buffer I/O error on dev dm-0, logical block 18, async page read
[   57.156568] Buffer I/O error on dev dm-0, logical block 19, async page read
[   57.157245] Buffer I/O error on dev dm-0, logical block 20, async page read
[   57.157880] Buffer I/O error on dev dm-0, logical block 21, async page read
[   57.158539] Buffer I/O error on dev dm-0, logical block 22, async page read
[   57.159197] Buffer I/O error on dev dm-0, logical block 23, async page read
[   57.159838] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.160492] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
[   57.190472] Retrying
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f0fa0384000, 1048576, MADV_POPULATE_READ) = -1 EFAULT (Bad address)
madvise: Bad address


And -EFAULT is what MADV_POPULATE_READ is documented to return for SIGBUS.

(there are a couple of ways we could speedup MADV_POPULATE_READ, maybe
at some point using VMA locks)

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-13 21:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240304191046.157464-2-aalbersh@redhat.com>
     [not found] ` <20240304191046.157464-8-aalbersh@redhat.com>
     [not found]   ` <20240305005242.GE17145@sol.localdomain>
     [not found]     ` <20240306163000.GP1927156@frogsfrogsfrogs>
     [not found]       ` <20240307220224.GA1799@sol.localdomain>
     [not found]         ` <20240308034650.GK1927156@frogsfrogsfrogs>
     [not found]           ` <20240308044017.GC8111@sol.localdomain>
2024-03-11 22:38             ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Darrick J. Wong
2024-03-12 15:13               ` David Hildenbrand
     [not found]                 ` <08905bcc-677d-4981-926d-7f407b2f6a4a@redhat.com>
2024-03-12 16:44                   ` Darrick J. Wong
2024-03-13 12:29                     ` David Hildenbrand
2024-03-13 17:19                       ` Darrick J. Wong
2024-03-13 19:10                         ` David Hildenbrand
2024-03-13 21:03                           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).