* inline extents
@ 2018-09-19 16:23 Chris Murphy
2018-09-20 1:45 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Chris Murphy @ 2018-09-19 16:23 UTC (permalink / raw)
To: xfs list
Hi,
I ran across this:
xfs: introduce inode data inline feature
https://lwn.net/Articles/759183/
I'm wondering if there's a chance it becomes the default one day in
the not too distant future? And if it is enabled at all (user or by
default), what happens if something where to directly overwrite 1024
bytes outside of the file system?
For example:
The file /boot/grub/grubenv is 1KiB and stores GRUB environment info,
like what the default boot should be. Fedora up until now only writes
to it when booted, so it goes through the file system, and the
bootloader only reads from it.
Fedora 29 has a new feature to test if boot+startup fails, so the
bootloader can do a fallback at next boot, to a previously working
entry. Part of this means GRUB (the bootloader code, not the user
space code) uses "save_env" to overwrite the 1024 data bytes with
updated environment information.
On something like FAT or ext2 or even ext4 without checksums, this
isn't a problem. On Btrfs, 1KiB is almost always going to be an inline
extent, found inside a 16KiB leaf, and that leaf has a checksum
predicated on the entire contents of that leaf. Overwrite 1KiB outside
the file system and now the checksum is wrong, the kernel code will
consider the entire 16KiB leaf corrupt. And that leaf might contain
items totally unrelated to the file being modified so it could be a
rather significant corruption. And it may not be fixable (I haven't
really tested this yet and I think GRUB knows better than to write to
a grubenv on Btrfs anyway).
For XFS, I'm not sure how the inline extent is saved, and whether
metadata checksumming includes or excludes the inline extent.
I'm also kinda ignoring the reflink ramifications of this behavior,
for now. Let's just say even if there's no corruption I'm really
suspicious of bootloader code writing anything, even what seems to be
a simple overwrite of two sectors.
Thanks,
--
Chris Murphy
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: inline extents
2018-09-19 16:23 inline extents Chris Murphy
@ 2018-09-20 1:45 ` Dave Chinner
2018-09-20 3:18 ` Chris Murphy
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2018-09-20 1:45 UTC (permalink / raw)
To: Chris Murphy; +Cc: xfs list
On Wed, Sep 19, 2018 at 10:23:38AM -0600, Chris Murphy wrote:
> Hi,
>
> I ran across this:
> xfs: introduce inode data inline feature
> https://lwn.net/Articles/759183/
>
> I'm wondering if there's a chance it becomes the default one day in
> the not too distant future? And if it is enabled at all (user or by
> default), what happens if something where to directly overwrite 1024
> bytes outside of the file system?
First somebody has to implement the feature in a way that it can be
merged. :)
> For example:
>
> The file /boot/grub/grubenv is 1KiB and stores GRUB environment info,
> like what the default boot should be. Fedora up until now only writes
> to it when booted, so it goes through the file system, and the
> bootloader only reads from it.
>
> Fedora 29 has a new feature to test if boot+startup fails, so the
> bootloader can do a fallback at next boot, to a previously working
> entry. Part of this means GRUB (the bootloader code, not the user
> space code) uses "save_env" to overwrite the 1024 data bytes with
> updated environment information.
That's just broken. Illegal. Completely unsupportable. Doesn't
matter what the filesystem is, nobody is allowed to write directly
to the block device a filesystem owns.
> On something like FAT or ext2 or even ext4 without checksums, this
> isn't a problem. On Btrfs, 1KiB is almost always going to be an inline
> extent, found inside a 16KiB leaf, and that leaf has a checksum
> predicated on the entire contents of that leaf. Overwrite 1KiB outside
> the file system and now the checksum is wrong, the kernel code will
> consider the entire 16KiB leaf corrupt.
Yup. Or it's a shared data extent (i.e. a reflinked or deduped file)
and writing to it corrupts the other copies because the filesystem
wasn't able to COW it.
> And that leaf might contain
> items totally unrelated to the file being modified so it could be a
> rather significant corruption. And it may not be fixable (I haven't
> really tested this yet and I think GRUB knows better than to write to
> a grubenv on Btrfs anyway).
ext4 has inline data, too, so there's every chance grub will corrupt
ext4 filesystems with tit's wonderful new feature. I'm not sure if
the ext4 metadata cksums cover the entire inode and inline data, but
if they do it's the same problem as btrfs.
> For XFS, I'm not sure how the inline extent is saved, and whether
> metadata checksumming includes or excludes the inline extent.
When XFS implements this, it will be like btrfs as the data will be
covered by the metadata CRCs for the inode, and so writing directly
to it would corrupt the inode and render it unreadable by the
filesystem.
> I'm also kinda ignoring the reflink ramifications of this behavior,
> for now. Let's just say even if there's no corruption I'm really
> suspicious of bootloader code writing anything, even what seems to be
> a simple overwrite of two sectors.
You're not the only one
Like I said, it doesn't matter what the filesystem is, overwriting
file data by writing directly to the block device is not
supportable. It's essentially a filesystem corruption vector, and
grub needs to have that functionality removed immediately.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: inline extents
2018-09-20 1:45 ` Dave Chinner
@ 2018-09-20 3:18 ` Chris Murphy
2018-09-20 19:34 ` Theodore Y. Ts'o
0 siblings, 1 reply; 4+ messages in thread
From: Chris Murphy @ 2018-09-20 3:18 UTC (permalink / raw)
To: Dave Chinner, Linux FS Devel; +Cc: xfs list, Btrfs BTRFS, linux-ext4
Adding fsdevel@, linux-ext4, and btrfs@ (which has a separate subject
on this same issue)
On Wed, Sep 19, 2018 at 7:45 PM, Dave Chinner <david@fromorbit.com> wrote:
>On Wed, Sep 19, 2018 at 10:23:38AM -0600, Chris Murphy wrote:
>> Fedora 29 has a new feature to test if boot+startup fails, so the
>> bootloader can do a fallback at next boot, to a previously working
>> entry. Part of this means GRUB (the bootloader code, not the user
>> space code) uses "save_env" to overwrite the 1024 data bytes with
>> updated environment information.
>
> That's just broken. Illegal. Completely unsupportable. Doesn't
> matter what the filesystem is, nobody is allowed to write directly
> to the block device a filesystem owns.
Yeah, the word I'm thinking of is abomination.
However in their defense, grubenv and the 'save_env' command are old features:
line 3638 @node Environment block
http://git.savannah.gnu.org/cgit/grub.git/tree/docs/grub.texi
"For safety reasons, this storage is only available when installed on a plain
disk (no LVM or RAID), using a non-checksumming filesystem (no ZFS), and
using BIOS or EFI functions (no ATA, USB or IEEE1275)."
I haven't checked how it tests for this. But by now, it should list
the supported file systems, rather than what's exempt. That's a
shorter list.
> ext4 has inline data, too, so there's every chance grub will corrupt
> ext4 filesystems with tit's wonderful new feature. I'm not sure if
> the ext4 metadata cksums cover the entire inode and inline data, but
> if they do it's the same problem as btrfs.
I don't see inline used with a default mkfs, but I do see metadata_csum
e2fsprogs-1.44.3-1.fc29.x86_64
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent 64bit flex_bg sparse_super large_file huge_file
dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
>
>> For XFS, I'm not sure how the inline extent is saved, and whether
>> metadata checksumming includes or excludes the inline extent.
>
> When XFS implements this, it will be like btrfs as the data will be
> covered by the metadata CRCs for the inode, and so writing directly
> to it would corrupt the inode and render it unreadable by the
> filesystem.
Good to know.
>
>> I'm also kinda ignoring the reflink ramifications of this behavior,
>> for now. Let's just say even if there's no corruption I'm really
>> suspicious of bootloader code writing anything, even what seems to be
>> a simple overwrite of two sectors.
>
> You're not the only one
>
> Like I said, it doesn't matter what the filesystem is, overwriting
> file data by writing directly to the block device is not
> supportable. It's essentially a filesystem corruption vector, and
> grub needs to have that functionality removed immediately.
I'm in agreement with respect to the more complex file systems. We've
already realized the folly of the bootloader being unable to do
journal replay, ergo it doesn't really for sure have a complete
picture of the file system anyway. That's suboptimal when it results
in boot failure. But if it were going to use stale file system
information, get a wrong idea of the file system, and then use that to
do even 1024 bytes of writes? No, no, and no.
Meanwhile, also in Fedoraland, it's one of the distros where grubenv
and grub.cfg stuff is on the EFI System partition, which is FAT. This
overwrite behavior will work there, but even this case is a kind of
betrayal that the file is being modified, without its metadata being
updated. I think it's an old era hack that by today's standards simply
isn't good enough. I'm a little surprised that all UEFI
implementations permit arbitrary writes from the pre-boot environment
to arbitrary block devices, even with Secure Boot enabled. That seems
specious.
I know some of the file systems have reserve areas for bootloader
stuff. I'm not sure if that's preferred over bootloaders just getting
their own partition and controlling it stem to stern however they
want.
--
Chris Murphy
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: inline extents
2018-09-20 3:18 ` Chris Murphy
@ 2018-09-20 19:34 ` Theodore Y. Ts'o
0 siblings, 0 replies; 4+ messages in thread
From: Theodore Y. Ts'o @ 2018-09-20 19:34 UTC (permalink / raw)
To: Chris Murphy
Cc: Dave Chinner, Linux FS Devel, xfs list, Btrfs BTRFS, linux-ext4
On Wed, Sep 19, 2018 at 09:18:16PM -0600, Chris Murphy wrote:
> > ext4 has inline data, too, so there's every chance grub will corrupt
> > ext4 filesystems with tit's wonderful new feature. I'm not sure if
> > the ext4 metadata cksums cover the entire inode and inline data, but
> > if they do it's the same problem as btrfs.
>
> I don't see inline used with a default mkfs, but I do see metadata_csum
The grub environment block is 1024 bytes, so unless you are using a 4k
inode size (the default is 256 bytes), the grub environment block
won't be an inline data file. So in practice, this won't be a
problem. So in practice this won't be a problem for ext4 today.
There are ways that grub can query the file system (using FIEMAP or
FS_IOC_GETFLAGS) to see whether or not the file is an inline file.
For xfs grub would also need to make sure the file isn't reflinked
file (which can also be queried using FIEMAP). That's fine for
inline, since a file won't magically change from stored in a block
versus inline, so grub can check this at grub setup time. But the
problem with reflink is that a file that was originally using a block
exclusively could later have that block be shared by another file, at
which point it's only safe to write the block using copy-on-write.
> I know some of the file systems have reserve areas for bootloader
> stuff. I'm not sure if that's preferred over bootloaders just getting
> their own partition and controlling it stem to stern however they
> want.
Ext4 had reserved the bootloader inode, and we have defined a way for
a bootloader to access it. The original intent was that not it just
be fore the environment block, but also the grub stage 1.5 or stage 2
loader. This would allow grub to just use a fixed list of blocks or
extents, and not need to understand the ext4 extents structures. We
added way this when, because we anticipated this would be easier for
grub than having it learn how to read ext4 extents structures.
Instead, the bootloader would at setup time use the EXT4_IOC_SWAP_BOOT
ioctl and then use a fixed list of blocks or block extents like LILO
used to do.
Of course, grub exceeded expectations and actually learned how to read
ext4 file systems, and probably we (the ext4 developers) should have
reached out to GRUB folks.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-09-21 1:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-19 16:23 inline extents Chris Murphy
2018-09-20 1:45 ` Dave Chinner
2018-09-20 3:18 ` Chris Murphy
2018-09-20 19:34 ` Theodore Y. Ts'o
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).