From: Gionatan Danti <g.danti@assyoma.it>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, g.danti@assyoma.it
Subject: Re: Block size and read-modify-write
Date: Wed, 03 Jan 2018 23:09:30 +0100 [thread overview]
Message-ID: <fa54fea7ec0d836f2c037f7b71a17365@assyoma.it> (raw)
In-Reply-To: <20180103214741.GO5858@dastard>
Il 03-01-2018 22:47 Dave Chinner ha scritto:
> On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
>>
>>
>> On 03/01/2018 02:19, Dave Chinner wrote:
>> >Cached writes smaller than a *page* will cause RMW cycles in the
>> >page cache, regardless of the block size of the filesystem.
>>
>> Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
>> However it seems to me that, when flushed to disk, writes happens at
>> the block level granularity, as you can see from tests[1,2] below.
>> Am I wrong? I am missing something?
>
> You're writing into unwritten extents. That's not a data overwrite,
> so behaviour can be very different. And when you have sub-page block
> sizes, the filesystem and/or page cache may decide not to read the
> whole page if it doesn't need to immmediately. e.g. you'll see
> different behaviour between a 512 byte write() and a 512 byte write
> via mmap()...
The first "dd" execution surely writes into unwritten extents. However,
on the following writes real data are overwritten, right?
> IOWs, there are so many different combinations of behaviour and
> variables that we don't try to explain every single nuance. If you
> do sub-page and/or sub-block size IO, then expect to page sized RMW
> to occur. It might be smaller depending on the fs config, the file
> layout, the underlying extent type, the type of ioperation the
> write must perform (e.g. plain overwrite vs copy-on-write), the
> offset into the page/block, etc. The simple message is this: avoid
> sub-block/page size IO if you can possibly avoid it.
>
>> >Ok, there is a difference between *sector size* and *filesystem
>> >block size*. You seem to be using them interchangably in your
>> >question, and that's not correct.
>>
>> True, maybe I have issues grasping the concept of sector size from
>> XFS point of view. I understand sector size as an hardware property
>> of the underlying block device, but how does it relate to the
>> filesystem?
>>
>> I naively supposed that an XFS filesystem created with 4k *sector*
>> size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT
>> writes, but my test[3] shows that even of such a filesystem a 512B
>> direct write is possible, indeed.
>>
>> Is sector size information only used by XFS own metadata and
>> journaling in order to avoid costly device-level r/m/w cycles on
>> 512e devices? I understand that on 4Kn device you *have* to avoid
>> sub-sector writes, or the transfer will fail.
>
> We don't care if the device does internal RMW cycles (RAID does
> that all the time). The sector size we care about is the size of an
> atomic write IO - the IO size that the device guarantees will either
> succeed completely or fail without modification. This is needed for
> journal recovery sanity.
>
> For data, the kernel checks the logical device sector size and
> limits direct IO to those sizes, not the filesystem sector size.
> i.e. filesystem sector size if there for sizing journal operations
> and metadata, not limiting data access alignment.
This is an outstanding explanation.
Thank you very much.
>> I want to put some context on the original question, and why I am so
>> interested on r/m/w cycles. SSD's flash-page size has, in recent
>> years (2014+), ballooned to 8/16/32K. I wonder if a matching
>> blocksize and/or sector size are needed to avoid (some of)
>> device-level r/m/w cycles, which can dramatically increase flash
>> write amplification (with reduced endurance).
>
> We've been over this many times in the past few years. user data
> alignment is controlled by stripe unit/width specification,
> not sector/block sizes.
Sure, but to avoid/mitigate device-level r/m/w, a proper alignement is
not sufficient by itself. You should also avoid partial page writes.
Anyway, I got the message: this is not business XFS directly cares
about.
Thanks again.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
next prev parent reply other threads:[~2018-01-03 22:09 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
2018-01-02 10:25 ` Carlos Maiolino
2018-01-03 1:19 ` Dave Chinner
2018-01-03 8:19 ` Carlos Maiolino
2018-01-03 14:54 ` Gionatan Danti
2018-01-03 21:47 ` Dave Chinner
2018-01-03 22:09 ` Gionatan Danti [this message]
2018-01-03 22:59 ` Dave Chinner
2018-01-04 1:38 ` Gionatan Danti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fa54fea7ec0d836f2c037f7b71a17365@assyoma.it \
--to=g.danti@assyoma.it \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.