From: "Theodore Ts'o" <tytso@mit.edu>
To: Travis Downs <travis.downs@gmail.com>
Cc: linux-block@vger.kernel.org
Subject: Re: Semantics of racy O_DIRECT writes
Date: Wed, 8 Jan 2025 23:57:43 -0500 [thread overview]
Message-ID: <20250109045743.GE1323402@mit.edu> (raw)
In-Reply-To: <CAOBGo4xx+88nZM=nqqgQU5RRiHP1QOqU4i2dDwXt7rF6K0gaUQ@mail.gmail.com>
On Wed, Jan 08, 2025 at 01:33:07PM -0300, Travis Downs wrote:
> Hello linux-block,
>
> We are experiencing data corruption in our storage intensive server
> application and are wondering about the semantics of "racy" O_DIRECT
> writes.
>
> Normally we target XFS, but the question is a general one.
>
> Specifically, imagine that we are writing a single 4K aligned page,
> with contents AB00 (each char being 1K bytes). We only care about
> the first 2048 bytes (the AB part). We are using libaio writes
> (io_submit) with O_DIRECT semantics. While the write is in flight,
> i.e.,
> after we have submitted it and before we reap it in io_getevents, the
> userspace application writes into second half of the page,
> changing it to ABCD (let's say via memcpy). The first half is not changed.
>
> The question then is: is this safe in the sense that would result in
> ABxx being written where xx "is don't care"? Or could it do something
> crazier, like cause later writes to be ignored (e.g. if something in
> the kernel storage layer hashes the page for some purpose and
> this hash is out of sync with the page at the time it was captured, or
> something like that).
>
> Of course, the easy answer is "don't do that", but I still want to
> know what happens if we do.
Don't do that. Really.
First of all, your program might need to run on OS's other than Linux,
such as Legacy Unix systems, Mac OS X, etc, and officially, there is
zero guarantees about cache coherency between O_DIRECT writes and the
page cache. So if you use O_DIRECT I/O and buffered I/O or mmap
access to a file.... all bet are off.
By definition O_DIRECT I/O bypasses the page cache, so if there is a
copy of the file's data block in the page cache, for some
implementations of some OS's the page cache might contain the previous
stale version of the block, so buffer reads might not have the updated
copy reflected by the O_DIRECT write. And if the page is mmap'ed into
some process's address space, and the process dirties that page, that
page will get written back to the disk, potentially overwriting
O_DIRECT write.
Linux will make best efforts to maintain cache coherency between
O_DIRECT and the page cache. It does that by writing out the page in
the page cache if it is dirty, and then evicting the the page from the
page cache. In practice this will be good enough to keep programs
like a database which locks the database so it can take a consistent
snapshot, and then does the backup via buffered write, when the
database normally uses O_DIRECT for its transactions, it will work ---
since if the database wasn't locked while taking the backup, it would
be completely a mess, and the O_DIRECT vs page cache coherency is the
*least* of your worries.
But in general, don't mix bufered/mmap and O_DIRECT I/O to the same
file. Just don't. It might work, but remember that raison d'etre for
O_DIRECT is performance in support of databases and storage systems
where developers Know What They Are Doing(tm) and Follow The
Rules(tm). Linux's cache coherency is best efforts only (and other
OS's might not even bother), and database developers and sysadmins
would be sad if we compromised O_DIRECT perforance just to make things
100% safe for people want to do things which are breaking the rules.
If you like breaking rules, don't use O_DIRECT. You'll be happier for
it, as will hapless future users of your programs. :-)
Remember, good programs are maintainable and portable. What if some
one attempts to take your programs and tries to make it work on MacOS?
Cheers,
- Ted
P.S. I commend to you the ten commandments for C programmers,
especially the last one. Remember, all the world's not Linux!
https://www.lysator.liu.se/c/ten-commandments.html
next prev parent reply other threads:[~2025-01-09 4:57 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-08 16:33 Semantics of racy O_DIRECT writes Travis Downs
2025-01-09 4:57 ` Theodore Ts'o [this message]
2025-01-09 14:16 ` Travis Downs
2025-01-09 15:01 ` Travis Downs
2025-01-09 17:32 ` Bart Van Assche
2025-01-10 9:42 ` Christoph Hellwig
2025-01-31 19:58 ` Travis Downs
2025-01-09 15:51 ` Theodore Ts'o
2025-01-10 8:58 ` Christoph Hellwig
2025-01-31 20:06 ` Travis Downs
2025-02-04 5:19 ` Christoph Hellwig
2025-02-04 14:32 ` Travis Downs
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250109045743.GE1323402@mit.edu \
--to=tytso@mit.edu \
--cc=linux-block@vger.kernel.org \
--cc=travis.downs@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.