Re: [RFC] fadvise: add more flags to provide a hint for block allocation

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Andreas Dilger <aedilger@gmail.com>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC] fadvise: add more flags to provide a hint for block allocation
Date: Thu, 8 Mar 2012 18:07:20 +1100	[thread overview]
Message-ID: <20120308070720.GP3592@dastard> (raw)
In-Reply-To: <yq1vcmfhga2.fsf@sermon.lab.mkp.net>

On Wed, Mar 07, 2012 at 11:23:49PM -0500, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <david@fromorbit.com> writes:
> 
> Dave> From what I've seen of the proposed SMR device standards, we're
> Dave> going to have to redesign filesystem allocation policies
> 
> [...]
> 
> The initial proposal involved SMR disks having a sparse LBA map carved
> into 2GB chunks.

2TB chunks, IIRC - the lower 32 bits of the 48bit LBA was intended
to be the relative offset into the region (RBA), with the upper 16
bits being the region number.

> However, that was shot down pretty hard.

That's unfortunate - it maps really well to how XFS uses allocation
groups. XFS already uses sparse regions for breaking up allocation
to enable parallelism. XFS could map to this sort of layout pretty
easily by placing an allocation group per region. That immediately
separates the SMR regions into discrete regions in the filesystem,
and just requires some tweaking to make use of the different
characteristics of the regions.

For example, use of the standard btree freespace allocator for the
random write regions, and use of the bitmap allocator (used by the
realtime device) for regions that are sequential write because it's
metadata is held externally to the region it is tracking. i.e. it
can be located in the random write regions. This could all be
handled by mkfs.xfs, including setting up the regions on the SMR
drives....

IOWs, XFS already has most of the allocation infrastructure to
handle the proposed region based SMR devices, and would only need a
bit of modification and extension to fully support sequential write
regions along with random write regions.  The allocation policy
stuff (deciding what sort of region to allocate from and aggregating
writes appropriately) is where all the new complexity lies, but that
we have to do that anyway to handle all the different sorts of
access hints we are likely to see.

> The approach currently being worked uses either dynamic (flash, tiered
> storage) or static hints (SMR) to put things in an appropriate area
> given the nature of the I/O.
> This puts the burden of virtual to physical LBA management on the device
> rather than in the filesystem allocators. And gives us the benefit of
> having a single interface that can be used for many different device
> types.

So the current proposal hides all the physical characteristics of
the devices from the file system and remaps the LBA internally based
on the IO hint? But that is the opposite direction to what we've
been taking over the past couple of years - we want more visibility
of device characteristics at the filesystem level so we can optimise
the filesystem better, not less.

> That said, the current proposal is crazy complex and clearly written
> with Windows in mind. They are creating different access profiles for
> .DLLs, .INI files, apps in the startup folder, and so on.

I'll pass judgement when I see it. 

To tell the truth, I'd much prefer that we have direct control of
physical layout in the filesystem rather than have the storage
device virtualise it with some unknown algorithm. Every device will
have different algorithms, so we won't get relatively conistent
behaviour across devices from different manufacturers like we have
now.  If that is all hidden in the drive firmware and is different
for each different device we see, then we've got no hope of being
able to diagnose why two files with identical filesystem layouts at
adjacent LBAs have vastly different performance for the same access
pattern....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2012-03-08  7:07 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-05 12:50 [RFC] fadvise: add more flags to provide a hint for block allocation Zheng Liu
2012-03-05 19:48 ` Sunil Mushran
2012-03-06  2:35   ` Zheng Liu
2012-03-06  4:26     ` Sunil Mushran
2012-03-06 13:30       ` Zheng Liu
2012-03-06  8:27 ` Lukas Czerner
2012-03-06 13:56   ` Zheng Liu
2012-03-06 14:29     ` Lukas Czerner
2012-03-06 17:53       ` Sunil Mushran
2012-03-07  8:51         ` Lukas Czerner
2012-03-07 17:11           ` Ted Ts'o
2012-03-07  0:51 ` Dave Chinner
2012-03-07  4:14   ` Andreas Dilger
2012-03-07  5:02     ` Martin K. Petersen
2012-03-07 12:11       ` Dave Chinner
2012-03-08  4:23         ` Martin K. Petersen
2012-03-08  7:07           ` Dave Chinner [this message]
2012-03-08 17:01             ` Martin K. Petersen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120308070720.GP3592@dastard \
    --to=david@fromorbit.com \
    --cc=aedilger@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).