linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Gionatan Danti <g.danti@assyoma.it>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, g.danti@assyoma.it
Subject: Re: Block size and read-modify-write
Date: Wed, 03 Jan 2018 23:09:30 +0100	[thread overview]
Message-ID: <fa54fea7ec0d836f2c037f7b71a17365@assyoma.it> (raw)
In-Reply-To: <20180103214741.GO5858@dastard>

Il 03-01-2018 22:47 Dave Chinner ha scritto:
> On Wed, Jan 03, 2018 at 03:54:42PM +0100, Gionatan Danti wrote:
>> 
>> 
>> On 03/01/2018 02:19, Dave Chinner wrote:
>> >Cached writes smaller than a *page* will cause RMW cycles in the
>> >page cache, regardless of the block size of the filesystem.
>> 
>> Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
>> However it seems to me that, when flushed to disk, writes happens at
>> the block level granularity, as you can see from tests[1,2] below.
>> Am I wrong? I am missing something?
> 
> You're writing into unwritten extents. That's not a data overwrite,
> so behaviour can be very different. And when you have sub-page block
> sizes, the filesystem and/or page cache may decide not to read the
> whole page if it doesn't need to immmediately. e.g. you'll see
> different behaviour between a 512 byte write() and a 512 byte write
> via mmap()...

The first "dd" execution surely writes into unwritten extents. However, 
on the following writes real data are overwritten, right?

> IOWs, there are so many different combinations of behaviour and
> variables that we don't try to explain every single nuance. If you
> do sub-page and/or sub-block size IO, then expect to page sized RMW
> to occur. It might be smaller depending on the fs config, the file
> layout, the underlying extent type, the type of ioperation the
> write must perform (e.g. plain overwrite vs copy-on-write), the
> offset into the page/block, etc. The simple message is this: avoid
> sub-block/page size IO if you can possibly avoid it.
> 
>> >Ok, there is a difference between *sector size* and *filesystem
>> >block size*. You seem to be using them interchangably in your
>> >question, and that's not correct.
>> 
>> True, maybe I have issues grasping the concept of sector size from
>> XFS point of view. I understand sector size as an hardware property
>> of the underlying block device, but how does it relate to the
>> filesystem?
>> 
>> I naively supposed that an XFS filesystem created with 4k *sector*
>> size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT
>> writes, but my test[3] shows that even of such a filesystem a 512B
>> direct write is possible, indeed.
>> 
>> Is sector size information only used by XFS own metadata and
>> journaling in order to avoid costly device-level r/m/w cycles on
>> 512e devices? I understand that on 4Kn device you *have* to avoid
>> sub-sector writes, or the transfer will fail.
> 
> We don't care if the device does internal RMW cycles (RAID does
> that all the time). The sector size we care about is the size of an
> atomic write IO - the IO size that the device guarantees will either
> succeed completely or fail without modification. This is needed for
> journal recovery sanity.
> 
> For data, the kernel checks the logical device sector size and
> limits direct IO to those sizes, not the filesystem sector size.
> i.e.  filesystem sector size if there for sizing journal operations
> and metadata, not limiting data access alignment.

This is an outstanding explanation.
Thank you very much.

>> I want to put some context on the original question, and why I am so
>> interested on r/m/w cycles. SSD's flash-page size has, in recent
>> years (2014+), ballooned to 8/16/32K. I wonder if a matching
>> blocksize and/or sector size are needed to avoid (some of)
>> device-level r/m/w cycles, which can dramatically increase flash
>> write amplification (with reduced endurance).
> 
> We've been over this many times in the past few years. user data
> alignment is controlled by stripe unit/width specification,
> not sector/block sizes.

Sure, but to avoid/mitigate device-level r/m/w, a proper alignement is 
not sufficient by itself. You should also avoid partial page writes. 
Anyway, I got the message: this is not business XFS directly cares 
about.

Thanks again.

-- 
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8

  reply	other threads:[~2018-01-03 22:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-28 23:14 Block size and read-modify-write Gionatan Danti
2018-01-02 10:25 ` Carlos Maiolino
2018-01-03  1:19   ` Dave Chinner
2018-01-03  8:19     ` Carlos Maiolino
2018-01-03 14:54     ` Gionatan Danti
2018-01-03 21:47       ` Dave Chinner
2018-01-03 22:09         ` Gionatan Danti [this message]
2018-01-03 22:59           ` Dave Chinner
2018-01-04  1:38             ` Gionatan Danti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa54fea7ec0d836f2c037f7b71a17365@assyoma.it \
    --to=g.danti@assyoma.it \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).