Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Andreas Dilger <adilger@dilger.ca>
Cc: Bobi Jam <bobijam@hotmail.com>, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs
Date: Wed, 16 Aug 2023 15:39:21 +0530	[thread overview]
Message-ID: <871qg347u6.fsf@doe.com> (raw)
In-Reply-To: <8AF0F706-B25F-4365-B9F2-8CA1BB336EC3@dilger.ca>

Andreas Dilger <adilger@dilger.ca> writes:

> On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
>> 
>> Bobi Jam <bobijam@hotmail.com> writes:
>> 
>>> With LVM it is possible to create an LV with SSD storage at the
>>> beginning of the LV and HDD storage at the end of the LV, and use that
>>> to separate ext4 metadata allocations (that need small random IOs)
>>> from data allocations (that are better suited for large sequential
>>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>>> the filesystem capacity would need to be high-IOPS storage in order to
>>> hold all of the internal metadata.
>>> 
>>> This would improve performance for inode and other metadata access,
>>> such as ls, find, e2fsck, and in general improve file access latency,
>>> modification, truncate, unlink, transaction commit, etc.
>>> 
>>> This patch split largest free order group lists and average fragment
>>> size lists into other two lists for IOPS/fast storage groups, and
>>> cr 0 / cr 1 group scanning for metadata block allocation in following
>>> order:
>>> 
>>> cr 0 on largest free order IOPS group list
>>> cr 1 on average fragment size IOPS group list
>>> cr 0 on largest free order non-IOPS group list
>>> cr 1 on average fragment size non-IOPS group list
>>> cr >= 2 perform the linear search as before
>
> Hi Ritesh,
> thanks for the review and the discussion about the patch.
>
>> Yes. The implementation looks straight forward to me.
>> 
>
>>> Non-metadata block allocation does not allocate from the IOPS groups.
>>> 
>>> Add for mke2fs an option to mark which blocks are in the IOPS region
>>> of storage at format time:
>>> 
>>>  -E iops=0-1024G,4096-8192G
>> 
>
>> However few things to discuss here are -
>
> As Ted requested on the call, this should be done as two separate calls
> to the allocator, rather than embedding the policy in mballoc group
> selection itself.  Presumably this would be in ext4_mb_new_blocks()
> calling ext4_mb_regular_allocator() twice with different allocation
> flags (first with EXT4_MB_HINT_METADATA, then without, though I don't
> actually see this was used anywhere in the code before this patch?)
>
> Metadata allocations should try only IOPS groups on the first call,
> but would go through all allocation phases.  If IOPS allocation fails,
> then the allocator should do a full second pass to allocate from the
> non-IOPS groups. Non-metadata allocations would only allocate from
> non-IOPS groups.
>
>> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
>> the allocation for data blocks will still succeed, however we won't be
>> able to make use of optimized scanning any more. Because we search within
>> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.
>
> The intention for our usage is that data allocations should *only* come
> from the HDD region of the device, and *not* from the IOPS (flash) region
> of the device.  The IOPS region will be comparatively small (0.5-1.0% of
> the total device size) so using or not using this space will be mostly
> meaningless to the overall filesystem usage, especially with a 1-5%
> reserved blocks percentage that is the default for new filesystems.
>

Yes, but when we give this functionality to non-enterprise users,
everyone would like to take advantage of a faster performing ext4 using
1 ssd and few hdds or a smaller spare ssd and larger hdds. Then it could
be that the space of iops region might not strictly be less than 1-2%
and could be anywhere between 10-50% ;)  

Shouldn't we still support this class of usecase as well? ^^^ 
So if the HDD gets full then the allocation should fallback to ssd for
data blocks no?

Or we can have a policy knob i.e. fallback_data_to_iops_region_thresh.
So if the free space %age in the iops region is above 1% (can be changed
by user) then the data allocations can fallback to iops region if it is
unable to allocate data blocks from hdd region.

      echo %age_threshold > fallback_data_to_iops_region_thresh (default 1%)

        Fallback data allocations to iops region as long as we have free space
        %age of iops region above %age_threshold.

> As you mentioned on the call, it seems this is a defect in the current
> patch, that non-metadata allocations may eventually fall back to scan
> all block groups for free space including IOPS groups.  They need to
> explicitly skip groups that have the IOPS flags set.
>
>> 2. Similarly what happens when the ssd space for metadata gets full?
>> In this case we keep falling back to cr2 for allocation and we don't
>> utilize optimize_scanning to find the block groups from hdd space to
>> allocate from.
>
> In the case when the IOPS groups are full then the metadata allocations
> should fall back to using non-IOPS groups.  That avoids ENOSPC when the
> metadata space is accidentally formatted too small, or unexpected usage
> such as large xattrs or many directories are consuming more IOPS space.
>
>> 3. So it seems after a period of time, these iops lists can have block
>> groups belonging to differnt ssds. Could this cause the metadata
>> allocation of related inodes to come from different ssds.
>> Will this be optimal? Checking on this...
>>     ...On checking further on this, we start with a goal group and we
>> at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
>> we frequently allocate metadata blocks from different SSDs.
>
> In our usage will typically be only a single IOPS region at the start of
> the device, but the ability to allow multiple IOPS regions was added for
> completeness and flexibility in the future (e.g. resize of filesystem).

I am interested in knowing what do you think will be challenges in
supporting resize with hybrid devices? Like if someone would like to add
an additional ssd and do a resize, do you think all later metadata
allocations can be fullfilled from this iops region?

And what happens when someone adds hdds to existing ssds.
I guess adding an hdd followed by resize operation can still allocate, bgdt, block/inode
bitmaps and inode tables etc for these block groups to sit on the resized hdd right. 
Are there any other challenges as well for such usecase? 


> In our case, the IOPS region would itself be RAIDed, so "different SSDs"
> is not really a concern.
>
>> 4. Ok looking into this, do we even require the iops lists for metadata
>> allocations? Do we allocate more than 1 blocks for metadata? If not then
>> maintaining these iops lists for metadata allocation isn't really
>> helpful. On the other hand it does make sense to maintain it when we
>> allow data allocations from these ssds when hdds gets full.
>
> I don't think we *need* to use the same mballoc code for IOPS allocation
> in most cases, though large xattr inode allocations should also be using
> the IOPS groups for allocating blocks, and these might be up to 64KB.
> I don't think that is actually implemented properly in this patch yet.
>
> Also, the mballoc list/array make it easy to find groups with free space
> in a full filesystem instead of having to scan for them, even if we
> don't need the full "allocate order-N" functionality.  Having one list
> of free groups or order-N lists doesn't make it more expensive (and it
> actually improves scalability to have multiple list heads).
>
> One of the future enhancements might be to allow small files (of some
> configurable size) to also be allocated from the IOPS groups, so it is
> probably easier IMHO to just stick with the same allocator for both.
>
>> 5. Did we run any benchmarks with this yet? What kind of gains we are
>> looking for? Do we have any numbers for this?
>
> We're working on that.  I just wanted to get the initial patches out for
> review sooner rather than later, both to get feedback on implementation
> (like this, thanks), and also to reserve the EXT4_BG_IOPS field so it
> doesn't get used in a conflicting manner.
>
>> 6. I couldn't stop but start to think of...
>> Should there also be a provision from the user to pass hot/cold data
>> types which we can use as a hint within the filesystem to allocate from
>> ssd v/s hdd? Does it even make sense to think in this direction?
>
> Yes, I also had the same idea, but then left it out of my email to avoid
> getting distracted from the initial goal.  There are a number of possible
> improvements that could be done with a mechanism like this:
> - have fast/slow regions within a single HDD (i.e. last 20% of spindle is
>   in "slow" region due to reduced linear velocity/bandwidth on inner tracks)
>   to avoid using the slow region unless the fast region is (mostly) full
> - have several regions across an HDD to *intentionally* allocate some
>   extents in the "slow" groups to reduce *peak* bandwidth but keep
>   *average* bandwidth higher as the disk becomes more full since there
>   would still be free space in the faster groups.
>

Interesting! 

> Cheers, Andreas
>

Thanks
-ritesh

next prev parent reply	other threads:[~2023-08-16 10:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-27 23:45 [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs Bobi Jam
2023-08-02 23:16 ` Andreas Dilger
2023-08-03 12:10 ` Ritesh Harjani
2023-08-15  4:10   ` Andreas Dilger
2023-08-16 10:09     ` Ritesh Harjani [this message]
2023-08-18 19:53       ` Andreas Dilger
2023-09-12  6:59 ` [PATCH v3] " Bobi Jam
2023-09-18 21:47   ` Andreas Dilger
2023-09-20  5:23     ` Ritesh Harjani
2023-09-22  3:27       ` Andreas Dilger
2023-09-22 11:07         ` Dongyang Li
2023-09-20  5:39   ` Ritesh Harjani
2023-09-22  4:38     ` Andreas Dilger
2023-09-26  3:35       ` Ritesh Harjani
2023-09-26 19:19         ` Andreas Dilger
2023-09-27  2:48           ` Ritesh Harjani
2023-09-28 15:12             ` Ojaswin Mujoo
2023-09-28 14:40   ` Ojaswin Mujoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871qg347u6.fsf@doe.com \
    --to=ritesh.list@gmail.com \
    --cc=adilger@dilger.ca \
    --cc=bobijam@hotmail.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.