* Re: packed_meta_blocks=1 incompatible with resize2fs?
2023-06-28 0:03 ` Theodore Ts'o
@ 2023-06-28 14:35 ` Roberto Ragusa
2023-06-28 15:44 ` Theodore Ts'o
2023-06-28 21:41 ` Andreas Dilger
1 sibling, 1 reply; 6+ messages in thread
From: Roberto Ragusa @ 2023-06-28 14:35 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org
On 6/28/23 02:03, Theodore Ts'o wrote:
> Unfortunately, (a) there is no place where the fact that the file
> system was created with this mkfs option is recorded in the
> superblock, and (b) once the file system starts getting used, the
> blocks where the metadata would need to be allocated at the start of
> the disk will get used for directory and data blocks.
Isn't resize2fs already capable of migrating directory and data blocks
away? According to the comments at the beginning of resize2fs.c, I mean.
> OK, so I think what you're trying to do is to create a RAID0 device
> where the first part of the md raid device is stored on SSD, and after
> that there would be one or more HDD devices. Is that right?
More or less.
Using LVM and having the first extents of the LV on a fast PV and all the
others on a slow PV.
> In theory, it would be possible to add a file system extension where
> the first portion of the MD device is not allowed to be used for
> normal block allocations[...]
I would not aim to a so complex way.
My hope was that one of the two was possible:
1. reserve the bitmaps and inode table space since the beginning (with mke2fs
option resize, for example)
2. push things out of the way when the expansion is done
I could attempt to code something to do 2., but I would either have to
study resize2fs code, which is not trivial, or write something from scratch,
based only on the layout docs, which would be also complex and not easily
mergeable in resize2fs.
Other options may have been:
3. do not add new inodes when expanding (impossible by design, right?)
4. have an offline way (custom tool, or detecting conflicting files and
temporarily removing them, ...) to free the needed blocks
At the moment the best option I have is to continue doing what I've been
doing for years already: use dumpe2fs and debugfs to discover which bg
contain metadata+journal and selectively use "pvmove" to migrate
those extents (PE) to the fast PV. Automatable, but still messy.
Discovering "packed_meta_blocks" turned out not a so great finding as I was
hoping, if then you can't resize.
Thanks.
--
Roberto Ragusa mail at robertoragusa.it
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: packed_meta_blocks=1 incompatible with resize2fs?
2023-06-28 0:03 ` Theodore Ts'o
2023-06-28 14:35 ` Roberto Ragusa
@ 2023-06-28 21:41 ` Andreas Dilger
1 sibling, 0 replies; 6+ messages in thread
From: Andreas Dilger @ 2023-06-28 21:41 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Roberto Ragusa, linux-ext4@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 7296 bytes --]
On Jun 27, 2023, at 6:03 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Mon, Jun 26, 2023 at 11:15:14AM +0200, Roberto Ragusa wrote:
>> Is there a way to have metadata at the beginning of the disk
>> and be able to enlarge the fs later?
>> Planning to do this on big filesystems, placing the initial part
>> on SSD extents; reformat+copy is not an option.
>
> OK, so I think what you're trying to do is to create a RAID0 device
> where the first part of the md raid device is stored on SSD, and after
> that there would be one or more HDD devices. Is that right?
>
> In theory, it would be possible to add a file system extension where
> the first portion of the MD device is not allowed to be used for
> normal block allocations, so that when you later grow the raid0
> device, the SSD blocks are available for use for the extended
> metadata. This would require adding support for this in the
> superblock format, which would have to be an RO_COMPAT feature (that
> is, kernels that didn't understand the meaning of the feature bit
> would be prohibited from mounting the file system read/write, and
> older versions of fsck.ext4 would be prohibited from touching the file
> system at all). We would then have to add support for off-line and
> on-line resizing for using the reserved SSD space for this use case.
We already have 600TB+ Lustre storage targets with declustered RAID
volumes and will hit 1 PiB very soon (60 * 18TB drives). This results
in millions of block groups. This can cause issues mounting, block
allocation, unlinking large files, etc. due to loading the metadata
from disk. Being able to store the filesystem metadata on flash will
avoid performance contention from HDD seeking without the cost of
all-flash storage (some Lustre filesystems are over 700PiB already).
We are investigating hybrid flash+HDD OSTs with sparse_super2 to put
all static metadata at the start of the LV on flash, and keep regular
data on HDDs. I think there isn't a huge amount of work needed to
get this working reasonably well, and it can be done incrementally.
Modify mke2fs to not force meta_bg to be enabled on filesystems over
256TiB. Locate the sparse_super2 superblock/GDT backup #1 in one of
the sparse_super backup groups (3^n, 5^n, 7^n) instead of always group
#1, so it can be found by e2fsck easily like sparse_super filesystems.
Locate the sparse_super2 backup #2 group in the last ({3,5,7}^n) group
within the filesystem (instead of the *actual* last group). That puts
it in the "slow" part of the device, but it is rarely accessed, and
is still far enough from the start of the device to avoid corruption.
In addition to avoiding the group #0/#1 GDT collision for backup #1 on
very large filesystems, this separates superblock/GDT copies further
apart for safety, and makes the #2 backup easier to find in case of
filesystem resize. The current use of the last filesystem group
is not "correct" after a resize, and is not easily found in this case.
This allows "mke2fs -O sparse_super2 -E packed_meta_blocks" allocation
of static metadata (superblocks, GDTs, bitmaps, inode tables, journal)
at the start of the block device, which puts most of the metadata IOPS
onto flash instead of HDD. No ext4/e2fsck changes are needed for this.
Update e2fsck to check sparse_super groups for backup superblocks/GDTs.
I think is useful independent of sparse_super2 changes. This was
previously submitted as "ext2fs: automaticlly open backup superblocks"
but needs some updates before it can land:
https://patchwork.ozlabs.org/project/linux-ext4/patch/20190129175134.26652-5-c17828@cray.com/
To handle the allocation of *dynamic* metadata on the flash device
(indirect/index blocks, directories, xattr blocks) the ext4 mballoc
code needs to be informed about which groups are on the high IOPS
device. I think it makes sense to add a new EXT4_BG_IOPS = 0x0010
group descriptor flag to mark groups that are on "high IOPS" media,
and then mballoc can use these groups when allocating the metadata.
Normal file data blocks would not be allocated from these groups.
Using a flag instead of "last group" marker in the superblock allows
more flexibility (e.g. for resize after filesystem creation).
mke2fs would be modified to add a "-E iops=[START-]END[,START-END,...]"
option to know which groups to mark with the IOPS flag. The START/END
would be given with either block numbers, or with unit suffixes, so
it can be specified easily by the user, and converted internally to
group numbers based on other filesystem parameters.
The normal case would be a single "END" argument (default START=0)
puts all the IOPS capacity at the start of the device. The ability
to specify multiple [,START-END] ranges is only there for completeness
and flexibility, I don't expect that we would be using it ourselves.
This dynamic metadata allocation on flash is not strictly necessary,
but still covers about 1/5 of metadata IOPS per write in my testing.
> The downside of this particular approach is that the SSD space would
> be "wasted" until the file system is resized, and you have to know up
> front how big you might want to grow the file system eventually.
I agree it doesn't make sense to hugely over-provision flash capacity.
It could be used for dynamic metadata allocations (as above), but that
also defeats the purpose of reserving this space. My calculations is
that the IOPS capacity should be about 0.5% of the filesystem size to
handle the static and dynamic metadata, depending on inode size/count,
journal size, etc.
> I could imagine another approach might be that when you grow the file
> system, if you are using an LVM-type approach, you would append a
> certain number of LVM stripes backed by SSD, and a certain number
> backed by HDD's, and then give a hint to the resizing code where the
> metadata blocks should be allocated, and you would need to know ahead
> of time how many SSD-backed LV stripes to allocate to support the
> additional number of HDD-backed LV stripes.
That is what I was thinking also - if you are resizing (which is itself
an uncommon operation), then it makes sense to also add a large chunk
of flash at the end of the current device followed by the bulk of the
capacity on HDDs, and then have the resize code locate all of the new
static metadata on the new flash groups, like '-E packed_meta_blocks'.
In the past I also tried 1x 128MiB flash + 255x 128MiB HDD, and match
the flex_bg layout exactly to this, but it became complex very quickly.
LVM block remapping also slows down significantly with thousands of
segments, probably due to linear list walking, and still didn't get
to the level of 100k different segments needed for a 1PiB+ fs.
> This would require a bunch of abstraction violations, and it's a
> little bit gross. More to the point, it would require a bunch of
> development work, and I'm not sure there is interest in the ext4
> development community, or the companies that back those developers,
> for implementing such a feature.
We're definitely interested in hybrid storage devices like this, but
resizing is not a priority for us.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread