From: Gionatan Danti <g.danti@assyoma.it>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, g.danti@assyoma.it
Subject: Re: Block size and read-modify-write
Date: Wed, 03 Jan 2018 23:09:30 +0100 [thread overview]
Message-ID: <fa54fea7ec0d836f2c037f7b71a17365@assyoma.it> (raw)
In-Reply-To: <20180103214741.GO5858@dastard>
Il 03-01-2018 22:47 Dave Chinner ha scritto:
> On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
>>
>>
>> On 03/01/2018 02:19, Dave Chinner wrote:
>> >Cached writes smaller than a *page* will cause RMW cycles in the
>> >page cache, regardless of the block size of the filesystem.
>>
>> Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
>> However it seems to me that, when flushed to disk, writes happens at
>> the block level granularity, as you can see from tests[1,2] below.
>> Am I wrong? I am missing something?
>
> You're writing into unwritten extents. That's not a data overwrite,
> so behaviour can be very different. And when you have sub-page block
> sizes, the filesystem and/or page cache may decide not to read the
> whole page if it doesn't need to immmediately. e.g. you'll see
> different behaviour between a 512 byte write() and a 512 byte write
> via mmap()...
The first "dd" execution surely writes into unwritten extents. However,
on the following writes real data are overwritten, right?
> IOWs, there are so many different combinations of behaviour and
> variables that we don't try to explain every single nuance. If you
> do sub-page and/or sub-block size IO, then expect to page sized RMW
> to occur. It might be smaller depending on the fs config, the file
> layout, the underlying extent type, the type of ioperation the
> write must perform (e.g. plain overwrite vs copy-on-write), the
> offset into the page/block, etc. The simple message is this: avoid
> sub-block/page size IO if you can possibly avoid it.
>
>> >Ok, there is a difference between *sector size* and *filesystem
>> >block size*. You seem to be using them interchangably in your
>> >question, and that's not correct.
>>
>> True, maybe I have issues grasping the concept of sector size from
>> XFS point of view. I understand sector size as an hardware property
>> of the underlying block device, but how does it relate to the
>> filesystem?
>>
>> I naively supposed that an XFS filesystem created with 4k *sector*
>> size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT
>> writes, but my test[3] shows that even of such a filesystem a 512B
>> direct write is possible, indeed.
>>
>> Is sector size information only used by XFS own metadata and
>> journaling in order to avoid costly device-level r/m/w cycles on
>> 512e devices? I understand that on 4Kn device you *have* to avoid
>> sub-sector writes, or the transfer will fail.
>
> We don't care if the device does internal RMW cycles (RAID does
> that all the time). The sector size we care about is the size of an
> atomic write IO - the IO size that the device guarantees will either
> succeed completely or fail without modification. This is needed for
> journal recovery sanity.
>
> For data, the kernel checks the logical device sector size and
> limits direct IO to those sizes, not the filesystem sector size.
> i.e. filesystem sector size if there for sizing journal operations
> and metadata, not limiting data access alignment.
This is an outstanding explanation.
Thank you very much.
>> I want to put some context on the original question, and why I am so
>> interested on r/m/w cycles. SSD's flash-page size has, in recent
>> years (2014+), ballooned to 8/16/32K. I wonder if a matching
>> blocksize and/or sector size are needed to avoid (some of)
>> device-level r/m/w cycles, which can dramatically increase flash
>> write amplification (with reduced endurance).
>
> We've been over this many times in the past few years. user data
> alignment is controlled by stripe unit/width specification,
> not sector/block sizes.
Sure, but to avoid/mitigate device-level r/m/w, a proper alignement is
not sufficient by itself. You should also avoid partial page writes.
Anyway, I got the message: this is not business XFS directly cares
about.
Thanks again.
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
next prev parent reply other threads:[~2018-01-03 22:09 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
2018-01-02 10:25 ` Carlos Maiolino
2018-01-03 1:19 ` Dave Chinner
2018-01-03 8:19 ` Carlos Maiolino
2018-01-03 14:54 ` Gionatan Danti
2018-01-03 21:47 ` Dave Chinner
2018-01-03 22:09 ` Gionatan Danti [this message]
2018-01-03 22:59 ` Dave Chinner
2018-01-04 1:38 ` Gionatan Danti
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fa54fea7ec0d836f2c037f7b71a17365@assyoma.it \
--to=g.danti@assyoma.it \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).