public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [XFS SUMMIT] SSD optimised allocation policy
Date: Thu, 14 May 2020 20:34:54 +1000	[thread overview]
Message-ID: <20200514103454.GL2040@dread.disaster.area> (raw)


Topic:	SSD Optimised allocation policies

Scope:
	Performance
	Storage efficiency

Proposal:

Non-rotational storage is typically very fast. Our allocation
policies are all, fundamentally, based on very slow storage which
has extremely high latency between IO to different LBA regions. We
burn CPU to optimise for minimal seeks to minimise the expensive
physical movement of disk heads and platter rotation.

We know when the underlying storage is solid state - there's a
"non-rotational" field in the block device config that tells us the
storage doesn't need physical seek optimisation. We should make use
of that.

My proposal is that we look towards arranging the filesystem
allocation policies into CPU-optimised silos. We start by making
filesystems on SSDs with AG counts that are multiples of the CPU
count in the system (e.g. 4x the number of CPUs) to drive
parallelism at the allocation level, and then associate allocation
groups with specific CPUs in the system. Hence each CPU has a set of
allocation groups is selects between for the operations that are run
on it. Hence allocation is typically local to a specific CPU.
Optimisation proceeds from the basis of CPU locality optimisation,
not storage locality optimisation.

What this allows is processes on different CPUs to never contend for
allocation resources. Locality of objects just doesn't matter for
solid state storage, so we gain nothing by trying to group inodes,
directories, their metadata and data physically close together. We
want writes that happen at the same time to be physically close
together so we aggregate them into larger IOs, but we really
don't care about optimising write locality for best read performance
(i.e. must be contiguous for sequential access) for this storage.

Further, we can look at faster allocation strategies - we don't need
to find the "nearest" if we don't have a contiguous free extent to
allocate into, we just want the one that costs the least CPU to
find. This is because solid state storage is so fast that filesystem
performance is CPU limited, not storage limited. Hence we need to
think about allocation policies differently and start optimising
them for minimum CPU expenditure rather than best layout.

Other things to discuss include:
	- how do we convert metadata structures to write-once style
	  behaviour rather than overwrite in place?
	- extremely large block sizes for metadata (e.g. 4MB) to
	  align better with SSD erase block sizes
	- what parts of the allocation algorithms don't we need
	- are we better off with huge numbers of small AGs rather
	  than fewer large AGs?

-- 
Dave Chinner
david@fromorbit.com

             reply	other threads:[~2020-05-14 10:35 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-14 10:34 Dave Chinner [this message]
2020-05-19  6:32 ` [XFS SUMMIT] SSD optimised allocation policy Darrick J. Wong
2020-05-20  1:46   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200514103454.GL2040@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox