public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>,
	Ritesh Harjani <ritesh.list@gmail.com>,
	chandan.babu@oracle.com, dchinner@redhat.com, hch@lst.de,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, catherine.hoang@oracle.com,
	martin.petersen@oracle.com
Subject: Re: [PATCH v4 00/14] forcealign for xfs
Date: Wed, 18 Sep 2024 09:34:37 +1000	[thread overview]
Message-ID: <ZuoSDVJPOX44Fww/@dread.disaster.area> (raw)
In-Reply-To: <20240917205420.GB182177@frogsfrogsfrogs>

On Tue, Sep 17, 2024 at 01:54:20PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > On 16/09/2024 08:03, Dave Chinner wrote:
> > > OTOH, we can't do this with atomic writes. Atomic writes require
> > > some mkfs help because they require explicit physical alignment of
> > > the filesystem to the underlying storage.
> 
> Forcealign requires agsize%extsize==0,

No it doesn't.

AG size is irrelevant when aligning extents - all
that matters is that we can find a free extent that can be trimmed
to the alignment defined by extsize. 

> it's atomicwrites that adds the
> requirement that extsize be a power of 2...

Only by explicit implementation constraint.

Atomic writes do not require power of two extsize - they only
require correctly aligned physical extents.

e.g an 8kB atomic write is always guaranteed to succeed if
we have an extsize of 24kB for laying out the physical extents
because a 24kB physical extent is always 8kB aligned and is an exact
multiple of 8kB. This meets the requirements for 8kB atomic writes to
always succeed, and hence there is no fundamental requirement for
extsize to be a power of 2.

We have *chosen* to simplify the implementation by only allowing
a single aligned atomic write to be issued at a time. This means
the alignment and size of atomic writes is always the minimum size
the hardware advertises, and that is (at present) always a power of
2.

Hence the "extsize needs to be a power of 2" comes from the
constraints exposed from the block device configuration (i.e.
minimum atomic write unit), not from a filesystem design or
implementation constraint.

At the filesystem level, we have further simplified things by only
allowing extsize = atomic write size. Hence the initial
implementation ends up only support power of 2 extsize values. This
is not a hard design or implementation constraint, however.

.....

hmmmmm.

.....

In writing this I've had further thoughts on force-align and the
sub-alloc-unit unwritten extent issue we've been discussing here.
i.e. I've stopped and considered the existing design constraints
given what I wrote above and considered what is needed for
supporting large extsize for small atomic writes.

I think we need to support large extsize with small atomic write
size for two reasons:

1. extsize is still going to be needed for preventing excessive
fragmentation with atomic writes. It's the small DIO write workloads
that see lots of fragmentation, and applications using atomic writes
are essentially being forced down the path of being small DIO write
workloads.

2. we can allow force-align w/o atomic writes behaviour to match the
existing rtvol sb_rextsize > 1 fsb behaviour without impacting
atomic write behaviour. (i.e. less behavioural differences, more
common code, better performance, etc).

To do this (and I think we do want to do this), then we have to do
two things:

1. force-align needs to add a "unwritten align" inode parameter to
allow sub-extsize unwritten extent boundaries to exist in the BMBT.
(i.e.  similar to how rt files w/ sb_rextsize > 1 fsb currently
behave.)

This is purely an in-memory value - for pure "force-align" we can
set it 1 fsb and then the behaviour will match existing RT
behaviour.  We can abstract this behaviour by replacing the hard
coded 1 block alignment for unwritten conversion with an "unwritten
align" value which would initially be set to 1.

We can also abstract this code away from being "rt specific" and
make it entirely dependent on "alloc-unit" configuration. This means
rt, force-align and atomic write will all be running the same code,
which makes testing a lot easier..

2. inodes with atomic write enabled must set the unwritten align
value to the atomic write size exposed by the hardware, and the
extsize must be an exact integer multiple of the unwritten align
size.

The initial settings of unwritten align == extsize gives the current
behaviour of all allocation and extent conversion being aligned to
atomic write constraints.

The separation of unwritten conversion from the extsize then allows
allows the situation I described above with 8kB atomic writes
and 24kB extsize. Because unwritten conversion is aligned to atomic
wriet boundaries, we can use sub-alloc-unit unwritten extents
without violating atomic write boundaries.

This would allow us to use extsize for atomic writes in the same
manner we use it for now - enable large contiguous allocations to
prevent fragmentation when doing lots of concurrent small
"immediate write" operations across many files.

I think this can all be added on top of the existing patchset - it's
not really a fundamental change to any of it. It's a little bit more
abstraction and unification, but it enables a lot more flexibility
for optimising atomic write functionality in the future.

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2024-09-17 23:34 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 16:36 [PATCH v4 00/14] forcealign for xfs John Garry
2024-08-13 16:36 ` [PATCH v4 01/14] xfs: only allow minlen allocations when near ENOSPC John Garry
2024-08-23 16:28   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 02/14] xfs: always tail align maxlen allocations John Garry
2024-08-23 16:31   ` Darrick J. Wong
2024-08-29 17:58     ` John Garry
2024-08-29 21:34       ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 03/14] xfs: simplify extent allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 04/14] xfs: make EOF allocation simpler John Garry
2024-09-04 18:25   ` Ritesh Harjani
2024-09-05  7:51     ` John Garry
2024-08-13 16:36 ` [PATCH v4 05/14] xfs: introduce forced allocation alignment John Garry
2024-08-13 16:36 ` [PATCH v4 06/14] xfs: align args->minlen for " John Garry
2024-08-13 16:36 ` [PATCH v4 07/14] xfs: Introduce FORCEALIGN inode flag John Garry
2024-08-13 16:36 ` [PATCH v4 08/14] xfs: Update xfs_inode_alloc_unitsize() for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 09/14] xfs: Update xfs_setattr_size() " John Garry
2024-08-13 16:36 ` [PATCH v4 10/14] xfs: Do not free EOF blocks " John Garry
2024-08-13 16:36 ` [PATCH v4 11/14] xfs: Only free full extents " John Garry
2024-08-13 16:36 ` [PATCH v4 12/14] xfs: Unmap blocks according to forcealign John Garry
2024-08-23 16:35   ` Darrick J. Wong
2024-08-13 16:36 ` [PATCH v4 13/14] xfs: Don't revert allocated offset for forcealign John Garry
2024-08-13 16:36 ` [PATCH v4 14/14] xfs: Enable file data forcealign feature John Garry
2024-09-04 18:14 ` [PATCH v4 00/14] forcealign for xfs Ritesh Harjani
2024-09-04 23:20   ` Dave Chinner
2024-09-05  3:56     ` Ritesh Harjani
2024-09-05  6:33       ` Dave Chinner
2024-09-10  2:51         ` Ritesh Harjani
2024-09-16  6:33           ` Dave Chinner
2024-09-10 12:33         ` Ritesh Harjani
2024-09-16  7:03           ` Dave Chinner
2024-09-16 10:24             ` John Garry
2024-09-17 20:54               ` Darrick J. Wong
2024-09-17 23:34                 ` Dave Chinner [this message]
2024-09-17 22:12               ` Dave Chinner
2024-09-18  7:59                 ` John Garry
2024-09-23  2:57                   ` Dave Chinner
2024-09-23  3:33                     ` Christoph Hellwig
2024-09-23  8:16                       ` John Garry
2024-09-23 12:07                         ` Christoph Hellwig
2024-09-23 12:33                           ` John Garry
2024-09-24  6:17                             ` Christoph Hellwig
2024-09-24  9:48                               ` John Garry
2024-11-29 11:36                                 ` John Garry
2024-09-23  8:00                     ` John Garry
2024-09-05 10:15     ` John Garry
2024-09-05 21:47       ` Dave Chinner
2024-09-06 14:31         ` John Garry
2024-09-08 22:49           ` Dave Chinner
2024-09-09 16:18             ` John Garry
2024-09-16  5:25               ` Dave Chinner
2024-09-16  9:44                 ` John Garry
2024-09-17 22:27                   ` Dave Chinner
2024-09-18 10:12                     ` John Garry
2024-11-14 12:48                       ` Long Li
2024-11-14 16:22                         ` John Garry
2024-11-14 20:07                         ` Dave Chinner
2024-11-15  8:14                           ` John Garry
2024-11-15 11:20                           ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZuoSDVJPOX44Fww/@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=brauner@kernel.org \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ritesh.list@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox