From: Phillip Susi <psusi@cfl.rr.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Denis Vlasenko <vda.linux@googlemail.com>,
Bill Davidsen <davidsen@tmr.com>,
Michael Tokarev <mjt@tls.msk.ru>,
Linus Torvalds <torvalds@osdl.org>, Viktor <vvp01@inbox.ru>,
Aubrey <aubreylee@gmail.com>, Hua Zhong <hzhong@gmail.com>,
Hugh Dickins <hugh@veritas.com>,
Linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: O_DIRECT question
Date: Tue, 30 Jan 2007 18:07:14 -0500 [thread overview]
Message-ID: <45BFCFA2.6000701@cfl.rr.com> (raw)
In-Reply-To: <20070130195720.GR8030@opteron.random>
Andrea Arcangeli wrote:
> When you have I/O errors during _writes_ (not Read!!) the raid must
> kick the disk out of the array before the OS ever notices. And if it's
> software raid that you're using, the OS should kick out the disk
> before your app ever notices any I/O error. when the write I/O error
> happens, it's not a problem for the application to solve.
I thought it obvious that we were talking about non recoverable errors
that then DO make it to the application. And any kind of mission
critical app most definitely does care about write errors. You don't
need your db completing the transaction when it was only half recorded.
It needs to know it failed so it can back out and/or recover the data
and record it elsewhere. You certainly don't want the users to think
everything is fine, walk away, and have the system continue to limp on
making things worse by the second.
> when the I/O error reaches the filesystem if you're lucky if the OS
> won't crash (ext3 claims to handle it), if your app receives the I/O
> error all you should be doing is to shutdown things gracefully sending
> all errors you can to the admin.
If the OS crashes due to an IO error reading user data, then there is
something seriously wrong and beyond the scope of this discussion. It
suffices to say that due to the semantics of write() and sound
engineering practice, the application expects to be notified of errors
so it can try to recover, or fail gracefully. Whether it chooses to
fail gracefully as you say it should, or recovers from the error, it
needs to know that an error happened, and where it was.
> It doesn't matter much where the error happend, all it matters is that
> you didn't have a fault tolerant raid setup (your fault) and your
> primary disk just died and you're now screwed(tm). If you could trust
> that part of the disk is still sane you could perhaps attempt to avoid
> a restore from the last backup, otherwise all you can do is the
> equivalent of a e2fsck -f on the db metadata after copying what you
> can still read to the new device.
It most certainly matters where the error happened because "you are
screwd" is not an acceptable outcome in a mission critical application.
A well engineered solution will deal with errors as best as possible,
not simply give up and tell the user they are screwed because the
designer was lazy. There is a reason that read and write return the
number of bytes _actually_ transfered, and the application is supposed
to check that result to verify proper operation.
> Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
> offers exactly the same guarantees. Feel free to check the real life
> db code. Even bdb uses fsync.
No, there is a slight difference. An fsync() flushes all dirty buffers
in an undefined order. Using O_DIRECT or O_SYNC, you can control the
flush order because you can simply wait for one set of writes to
complete before starting another set that must not be written until
after the first are on the disk. You can emulate that by placing an
fsync between both sets of writes, but that will flush any other dirty
buffers whose ordering you do not care about. Also there is no aio
version of fsync.
> Please try yourself, it's simple enough:
>
> time dd if=/dev/hda of=/dev/null bs=16M count=100
> time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
> time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct
>
> if you can measure any slowdown in the sync/direct you're welcome (it
> runs faster here... as it should). The pipeline stall is not
> measurable when it's so infrequent, and actually the pipeline stall is
> not a big issue when the I/O is contigous and the dma commands are
> always large.
>
> aio is mandatory only while dealing with small buffers, especially
> while seeking to take advantage of the elevator.
>
sync has no effect on reading, so that test is pointless. direct saves
the cpu overhead of the buffer copy, but isn't good if the cache isn't
entirely cold. The large buffer size really has little to do with it,
rather it is the fact that the writes to null do not block dd from
making the next read for any length of time. If dd were blocking on an
actual output device, that would leave the input device idle for the
portion of the time that dd were blocked.
In any case, this is a totally different example than your previous one
which had dd _writing_ to a disk, where it would block for long periods
of time due to O_SYNC, thereby preventing it from reading from the input
buffer in a timely manner. By not reading the input pipe frequently, it
becomes full and thus, tar blocks. In that case the large buffer size
is actually a detriment because with a smaller buffer size, dd would not
be blocked as long and so it could empty the pipe more frequently
allowing tar to block less.
> This whole thing is about performance, if you remove performance
> factors from the equation, you can stick to your O_SYNC 512bytes at
> time to the journal design. You're perfectly right that when you
> remove performance from the equation you can claim that O_DIRECT is
> much the same as O_SYNC.
> Guess what, if O_SYNC could run as fast as O_DIRECT by still passing
> through pagecache, O_DIRECT wouldn't exist. You can't pretend to
> describe the semantics of any kernel API if you remove performance
> considerations from it. It must be some not useful university theory
> if they thought you that performance evaluation must not be present in
> the semantics. If that's the case, it's best you stop talking about
> semantics when you discuss about any kernel APIs. A ton of kernel APIs
> are all about improving performance, so they'll all be the same if you
> only look at your performance agnostic semantics, it's not just
> O_DIRECT that would become the same as O_SYNC.
You seem to have missed the point of this thread. Denis Vlasenko's
message that you replied to simply pointed out that they are
semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC
+ madvise could be fixed to perform as well. Several people including
Linus seem to like this idea and think it is quite possible.
next prev parent reply other threads:[~2007-01-30 23:07 UTC|newest]
Thread overview: 130+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-11 2:57 O_DIRECT question Aubrey
2007-01-11 3:05 ` Linus Torvalds
2007-01-11 3:15 ` Linus Torvalds
2007-01-11 6:09 ` Nick Piggin
2007-01-11 15:50 ` Linus Torvalds
2007-01-11 16:19 ` Aubrey
2007-01-16 3:41 ` Jörn Engel
2007-01-11 16:23 ` bert hubert
2007-01-11 16:52 ` Xavier Bestel
2007-01-11 17:04 ` Linus Torvalds
2007-01-11 18:41 ` Trond Myklebust
2007-01-11 19:00 ` Linus Torvalds
2007-01-11 19:49 ` Trond Myklebust
2007-01-12 17:03 ` Viktor
2007-01-20 16:19 ` Denis Vlasenko
2007-01-22 15:52 ` Phillip Susi
2007-01-11 5:50 ` Aubrey
2007-01-11 6:06 ` Andrew Morton
2007-01-11 6:45 ` Aubrey
2007-01-11 6:57 ` Andrew Morton
2007-01-11 7:05 ` Nick Piggin
2007-01-11 7:54 ` Aubrey
2007-01-11 8:05 ` Roy Huang
2007-01-11 16:45 ` Linus Torvalds
2007-01-17 4:29 ` Aubrey Li
2007-01-12 2:12 ` Aubrey
2007-01-12 2:47 ` Nick Piggin
2007-01-12 3:59 ` Roy Huang
2007-01-11 8:12 ` Nick Piggin
2007-01-11 8:49 ` Roy Huang
2007-01-11 9:09 ` Nick Piggin
2007-01-12 2:48 ` Bill Davidsen
2007-01-12 4:30 ` Nick Piggin
2007-01-12 4:46 ` Linus Torvalds
2007-01-12 4:56 ` Nick Piggin
2007-01-12 4:58 ` Nick Piggin
2007-01-12 5:18 ` Linus Torvalds
2007-01-12 5:22 ` Aubrey
2007-01-12 14:59 ` Bill Davidsen
2007-01-13 4:51 ` Nick Piggin
2007-01-11 6:16 ` Alexander Shishkin
2007-01-11 6:57 ` Aubrey
2007-01-11 12:13 ` Viktor
2007-01-11 15:53 ` Phillip Susi
2007-01-11 16:20 ` Linus Torvalds
2007-01-11 17:13 ` Michael Tokarev
2007-01-11 23:01 ` Phillip Susi
2007-01-11 23:06 ` Hua Zhong
2007-01-12 15:21 ` Phillip Susi
2007-01-20 16:36 ` Denis Vlasenko
2007-01-20 20:55 ` Michael Tokarev
2007-01-20 23:05 ` Denis Vlasenko
2007-01-21 12:09 ` Michael Tokarev
2007-01-21 20:02 ` Denis Vlasenko
2007-01-22 16:17 ` Phillip Susi
2007-01-24 21:15 ` Denis Vlasenko
2007-01-25 15:44 ` Phillip Susi
2007-01-25 17:38 ` Denis Vlasenko
2007-01-25 19:28 ` Phillip Susi
2007-01-25 19:52 ` Denis Vlasenko
2007-01-25 20:03 ` Phillip Susi
2007-01-25 20:45 ` Michael Tokarev
2007-01-25 21:11 ` Denis Vlasenko
2007-01-26 16:02 ` Mark Lord
2007-01-26 16:52 ` Viktor
2007-01-26 16:58 ` Phillip Susi
2007-01-26 17:05 ` Phillip Susi
2007-01-26 23:16 ` Denis Vlasenko
2007-02-06 20:39 ` Pavel Machek
2007-01-26 18:23 ` Bill Davidsen
2007-01-26 23:35 ` Denis Vlasenko
2007-01-28 15:18 ` Bill Davidsen
2007-01-28 17:03 ` Denis Vlasenko
2007-01-29 15:43 ` Phillip Susi
2007-01-29 17:00 ` Andrea Arcangeli
2007-01-30 0:05 ` Denis Vlasenko
[not found] ` <45BE7D99.70200@cfl.rr.com>
[not found] ` <20070130023056.GN8030@opteron.random>
[not found] ` <45BF65E3.6070102@cfl.rr.com>
[not found] ` <20070130164806.GQ8030@opteron.random>
2007-01-30 18:50 ` Phillip Susi
2007-01-30 19:57 ` Andrea Arcangeli
2007-01-30 20:06 ` Andrea Arcangeli
2007-01-30 23:07 ` Phillip Susi [this message]
2007-01-31 2:28 ` Andrea Arcangeli
2007-01-31 9:37 ` Michael Tokarev
2007-01-26 15:53 ` Bill Davidsen
2007-01-11 17:42 ` Alan
2007-01-11 18:00 ` Linus Torvalds
2007-01-12 7:57 ` dean gaudet
2007-01-12 15:27 ` Phillip Susi
2007-01-12 18:06 ` Linus Torvalds
2007-01-12 20:23 ` Chris Mason
2007-01-12 20:46 ` Michael Tokarev
2007-01-12 20:52 ` Michael Tokarev
2007-01-12 21:03 ` Michael Tokarev
2007-01-12 21:17 ` Linus Torvalds
2007-01-12 21:54 ` Michael Tokarev
2007-01-12 22:09 ` Linus Torvalds
2007-01-12 22:26 ` Michael Tokarev
2007-01-12 22:35 ` Erik Andersen
2007-01-12 22:47 ` Andrew Morton
2007-01-14 9:11 ` Nate Diller
2007-01-20 16:45 ` Denis Vlasenko
2007-01-22 1:47 ` Andrea Arcangeli
2007-01-13 20:07 ` Bill Davidsen
2007-01-13 20:27 ` Michael Tokarev
2007-01-14 15:39 ` Bill Davidsen
2007-01-12 21:39 ` Disk Cache, Was: " Zan Lynx
2007-01-12 22:10 ` Michael Tokarev
2007-01-15 12:11 ` Helge Hafting
2007-01-12 16:59 ` Viktor
2007-01-11 12:45 ` Erik Mouw
2007-01-11 4:51 ` Andrew Morton
2007-01-11 5:06 ` Gerrit Huizenga
2007-01-11 16:09 ` Badari Pulavarty
2007-01-11 12:34 ` linux-os (Dick Johnson)
2007-01-11 13:06 ` Martin Mares
2007-01-11 14:15 ` Jens Axboe
2007-01-12 2:13 ` Bill Davidsen
-- strict thread matches above, loose matches on Subject: below --
2007-01-17 14:27 Alex Tomas
2007-01-22 15:59 Al Boldi
[not found] <7BYkO-5OV-17@gated-at.bofh.it>
[not found] ` <7BYul-6gz-5@gated-at.bofh.it>
[not found] ` <7C18X-1zo-5@gated-at.bofh.it>
[not found] ` <7C1iw-22q-7@gated-at.bofh.it>
[not found] ` <7C1Vb-2Ny-3@gated-at.bofh.it>
[not found] ` <7C256-2ZR-27@gated-at.bofh.it>
[not found] ` <7C2eE-3rT-15@gated-at.bofh.it>
[not found] ` <7C31d-4qb-11@gated-at.bofh.it>
[not found] ` <7C3kj-55E-9@gated-at.bofh.it>
2007-01-11 13:20 ` Bodo Eggert
[not found] ` <7C74B-2A4-23@gated-at.bofh.it>
[not found] ` <7CaYA-mT-19@gated-at.bofh.it>
[not found] ` <7Cpuz-64X-1@gated-at.bofh.it>
[not found] ` <7Cz0T-4PH-17@gated-at.bofh.it>
[not found] ` <7CBcl-86B-9@gated-at.bofh.it>
[not found] ` <7CBvH-52-9@gated-at.bofh.it>
[not found] ` <7CBFn-hw-1@gated-at.bofh.it>
[not found] ` <7CBP1-KI-3@gated-at.bofh.it>
[not found] ` <7CBYG-WK-3@gated-at.bofh.it>
2007-01-13 16:53 ` Bodo Eggert
2007-01-13 19:30 ` Bill Davidsen
2007-01-14 18:51 ` Bodo Eggert
[not found] ` <7CXmz-88G-29@gated-at.bofh.it>
[not found] ` <7CXFR-8vZ-15@gated-at.bofh.it>
[not found] ` <7DfMP-2ak-19@gated-at.bofh.it>
2007-01-14 19:39 ` Bodo Eggert
[not found] ` <7DyYK-6lE-3@gated-at.bofh.it>
2007-01-16 20:26 ` Bodo Eggert
2007-01-17 5:55 ` Arjan van de Ven
2007-01-17 22:36 ` Bodo Eggert
[not found] ` <7HkaQ-2Nb-9@gated-at.bofh.it>
[not found] ` <7HDZP-Pv-1@gated-at.bofh.it>
[not found] ` <7HIPV-8kp-35@gated-at.bofh.it>
2007-01-27 14:01 ` Bodo Eggert
2007-01-27 14:14 ` Denis Vlasenko
2007-01-28 15:30 ` Bill Davidsen
2007-01-28 17:18 ` Denis Vlasenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45BFCFA2.6000701@cfl.rr.com \
--to=psusi@cfl.rr.com \
--cc=andrea@suse.de \
--cc=aubreylee@gmail.com \
--cc=davidsen@tmr.com \
--cc=hugh@veritas.com \
--cc=hzhong@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mjt@tls.msk.ru \
--cc=torvalds@osdl.org \
--cc=vda.linux@googlemail.com \
--cc=vvp01@inbox.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox