public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@dilger.ca>
To: Santosh S <santosh.letterz@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>, linux-ext4@vger.kernel.org
Subject: Re: Overwrite faster than fallocate
Date: Mon, 20 Jun 2022 12:52:29 -0600	[thread overview]
Message-ID: <117682F9-5CEF-44F2-935E-E048C8A9D75D@dilger.ca> (raw)
In-Reply-To: <CAGQ4T_J-43q5xszJK8yDTUt14NGjjQACK4Z1RST-ZQkju3xSzQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6831 bytes --]

On Jun 17, 2022, at 5:56 PM, Santosh S <santosh.letterz@gmail.com> wrote:
> 
> On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
>> 
>> On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
>>> Dear ext4 developers,
>>> 
>>> This is my test - preallocate a large file (2G) and then do sequential
>>> 4K direct-io writes to that file, with fdatasync after every write.
>>> I am preallocating using fallocate mode 0. I noticed that if the 2G
>>> file is pre-written rather than fallocate'd I get more than twice the
>>> throughput. I could reproduce this with fio. The storage is nvme.
>>> Kernel version is 5.3.18 on Suse.
>>> 
>>> Am I doing something wrong or is this difference expected? Any
>>> suggestion to get a better throughput without actually pre-writing the
>>> file.
>> 
>> This is, alas, expected.  The reason for this is because when you use
>> fallocate, the extent is marked as uninitialized, so that when you
>> read from the those newly allocated blocks, you don't see previously
>> written data belonging to deleted files.  These files could contain
>> someone else's e-mail, or medical information, etc.  So if we didn't
>> do this, it would be a walking, talking HIPPA or PCI violation.
>> 
>> So when you write to an fallocated region, and then call fdatasync(2),
>> we need to update the metadata blocks to clear the uninitialized bit
>> so that when you read from the file after a crash, you actually get
>> the data that was written.  So the fdatasync(2) operation is quite the
>> heavyweight operation, since it requries journal commit because of the
>> required metadata update.  When you do an overwrite, there is no need
>> to force a metadata update and journal update, which is why write(2)
>> plus fdatasync(2) is much lighter weight when you do an overwrite.
>> 
>> What enterprise databases (e.g., Oracle Enterprise Database and IBM's
>> Informix DB) tend to do is to use fallocate a chunk of space (say,
>> 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
>> file system's block allocators to be more likely to allocate a
>> contiguous block range, and then immediate write zero's on that 16 or
>> 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
>> tree once to make that 16MB or 32MB to be marked initialized to the
>> database's tablespace file, so you only pay the metadata update once,
>> instead of every few dozen kilobytes as you write each database commit
>> into the tablespace file.
>> 
>> There is also an old, out of tree patch which enables an fallocate
>> mode called "no hide stale", which marks the extent tree blcoks which
>> are allocated using fallocate(2) as initialized.  This substantially
>> speeds things up, but it is potentially a walking, talking, HIPPA or
>> PCI violation in that revealing previously written data is considered
>> a horrible security violation by most file system developers.
>> 
>> If you know, say, that a cluster file system is the only user of the
>> file system, and all data is written encrypted at rest using a
>> per-user key, such that exposing stale data is not a security
>> disaster, the "no hide stale" flag could be "safe" in that highly
>> specialized user case.
>> 
>> But that assumes that file system authors can trust application
>> writers not to do something stupid and insecure, and historically,
>> file system authors (possibly with good reason, given bitter past
>> experience) don't trust application writesr to do something which is
>> very easy, and gooses performance, even if it has terrible side
>> effects on either data robustness or data security.
>> 
>> Effectively, the no hide stale flag could be considered an "Attractive
>> Nuisance"[1] and so support for this feature has never been accepted
>> into the mainline kernel, and never to any distro kernels, since the
>> distribution companies don't want to be held liable for making an
>> "acctive nuisance" that might enable application authors from shooting
>> themselves in the foot.
>> 
>> [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
>> 
>> In any case, the technique of fallocatE(2) plus zero-fill-write plus
>> fdatasync(2) isn't *that* slow, and is only needed when you are first
>> extending the tablespace file.  In the steady state, most database
>> applications tend to be overwriting space, so this isn't an issue.
>> 
>> In any case, if you need to get that last 5% or so of performance ---
>> say, if you are are an enterprise database company interested in
>> taking a full page advertisement on the back cover of Business Week
>> Magazine touting how your enterprise database benchmarks are better
>> than the competition --- the simple solution is to use a raw block
>> device.  Of course, most end users want the convenience of the file
>> system, but that's not the point if you are engaging in
>> benchmarketing.   :-)
>> 
>> Cheers,
>> 
>>                                                - Ted
> 
> Thank you for a comprehensive answer :-)
> 
> I have one more question - when I gradually increase the i/o transfer
> size the performance degradation begins to lessen and at 32K it is
> similar to the "overwriting the file" case. I assume this is because
> the metadata update is now spread over 32K of data rather than 4K.

When splitting unwritten extents, the ext4 code will write out zero
blocks up to 32KB by default (/sys/fs/ext4/*/extent_max_zeroout_kb)
to avoid having millions of very small extents in a file (e.g. in
case of a pathological alternating 4KB write pattern).  If your test
is writing >= 32KB blocks then this no longer needs to be done.  If
writing smaller blocks then it makes sense that the speed is 1/2 the
raw speed because the file blocks are all being written twice (first
with zeroes, then with actual data on a later write).

32KB (or 64KB) is a reasonable minimum size because any disk write
will take the same time to write a single block or a whole sector,
so doing writes in smaller units is not very efficient.  Depending
on the underlying storage (e.g. RAID-6) it might be more efficient
to set extent_max_zeroout_kb=1024 or similar.

> However, my understanding is that, in my case, an extent should
> represent the max 128MiB of data and so the clearing of the
> uninitialized bit for an extent should happen once every 128MiB, so
> then why is a higher transfer size making a difference?

You are misunderstanding how uninitialized extents are cleared.  The
uninitialized extent is split into two/three parts, where only the
extent that has data written to it (min 32KB) is set to "initialized"
and the remaining one/two extents are left uninitialized.  Otherwise,
each write to an uninitialized extent would need up to 128MB of zeroes
written to disk each time, which would be slow/high latency.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  parent reply	other threads:[~2022-06-20 18:50 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-17 16:38 Overwrite faster than fallocate Santosh S
2022-06-17 22:12 ` Theodore Ts'o
2022-06-17 23:56   ` Santosh S
2022-06-18  0:41     ` Santosh S
2022-06-20 18:52     ` Andreas Dilger [this message]
2022-06-23 18:28       ` Santosh S
2022-06-23 19:43         ` Theodore Ts'o
2022-06-23 21:55           ` Santosh S

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=117682F9-5CEF-44F2-935E-E048C8A9D75D@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=linux-ext4@vger.kernel.org \
    --cc=santosh.letterz@gmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox