From: Ric Wheeler <rwheeler@redhat.com>
To: Howard Chu <hyc@symas.com>
Cc: General Discussion of SQLite Database <sqlite-users@sqlite.org>,
David Lang <david@lang.hm>, Vladislav Bolkhovitin <vst@vlnb.net>,
"Theodore Ts'o" <tytso@mit.edu>, Richard Hipp <drh@hwaci.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-fsdevel@vger.kernel.org
Subject: Re: [sqlite] light weight write barriers
Date: Fri, 16 Nov 2012 13:03:02 -0500 [thread overview]
Message-ID: <50A67FD6.1030108@redhat.com> (raw)
In-Reply-To: <50A661D0.4030200@symas.com>
On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>> A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>> B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology.
> Who's really being silly?
No, just suggesting that you either pay for the expensive stuff or learn how to
use cost effective, high capacity storage like the rest of the world.
I don't disagree that having non-volatile write caches would be nice, but
everyone has learned how to deal with volatile write caches at the low end of
market.
>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and
> reliable writes, or that it's unreasonable for a storage device to offer the
> same, cheaply. And yet it is clearly trivial to provide all of the above.
I look forward to seeing your products in the market.
Until you have more than "I want" and "I think" on your storage system design
resume, I suggest you spend the money to get the parts with non-volatile write
caches or fix your code.
Ric
>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact,
> the drive should report the error back to the host. There's nothing
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written
> pages, of a length equal to the size of the drive's cache. If a drive says
> "hey, block XXX failed" the OS can reissue the write from its own queue. No
> muss, no fuss, no performance bottlenecks. This is what Real Computers did
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is
> a stupid tradeoff.
>
next prev parent reply other threads:[~2012-11-16 18:03 UTC|newest]
Thread overview: 108+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <415E76CC-A53D-4643-88AB-3D7D7DC56F98@dubeyko.com>
2012-10-06 13:54 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
2012-10-06 20:06 ` Jaegeuk Kim
2012-10-07 7:09 ` Marco Stornelli
2012-10-07 9:31 ` Jaegeuk Kim
2012-10-07 12:08 ` Vyacheslav Dubeyko
2012-10-08 8:25 ` Jaegeuk Kim
2012-10-08 9:59 ` Namjae Jeon
2012-10-08 10:52 ` Jaegeuk Kim
2012-10-08 11:21 ` Namjae Jeon
2012-10-08 12:11 ` Jaegeuk Kim
2012-10-09 3:52 ` Namjae Jeon
2012-10-09 8:00 ` Jaegeuk Kim
2012-10-09 8:31 ` Lukáš Czerner
2012-10-09 10:45 ` Jaegeuk Kim
2012-10-09 11:01 ` Lukáš Czerner
2012-10-09 12:01 ` Jaegeuk Kim
2012-10-09 12:39 ` Lukáš Czerner
2012-10-09 13:10 ` Jaegeuk Kim
2012-10-09 21:20 ` Dave Chinner
2012-10-10 2:32 ` Jaegeuk Kim
2012-10-10 4:53 ` Theodore Ts'o
2012-10-12 20:55 ` Arnd Bergmann
2012-10-10 10:36 ` David Woodhouse
2012-10-12 20:58 ` Arnd Bergmann
2012-10-13 4:26 ` Namjae Jeon
2012-10-13 12:37 ` Jaegeuk Kim
2012-10-17 11:12 ` Namjae Jeon
[not found] ` <000001cdacef$b2f6eaa0$18e4bfe0$%kim@samsung.com>
2012-10-18 13:39 ` Vyacheslav Dubeyko
2012-10-18 22:14 ` Jaegeuk Kim
2012-10-19 9:20 ` NeilBrown
2012-10-08 19:22 ` Vyacheslav Dubeyko
2012-10-09 7:08 ` Jaegeuk Kim
2012-10-09 19:53 ` Jooyoung Hwang
2012-10-10 8:05 ` Vyacheslav Dubeyko
2012-10-10 9:02 ` Theodore Ts'o
2012-10-10 11:52 ` SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system) Clemens Ladisch
[not found] ` <50756199.1090103-P6GI/4k7KOmELgA04lAiVw@public.gmane.org>
2012-10-10 12:47 ` Richard Hipp
2012-10-10 17:17 ` light weight write barriers Andi Kleen
[not found] ` <m2fw5mtffg.fsf_-_-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
2012-10-10 17:48 ` Richard Hipp
[not found] ` <CALwJ=MyR+nU3zqi3V3JMuEGNwd8FUsw9xLACJvd0HoBv3kRi0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-10-11 16:38 ` Nico Williams
[not found] ` <CAK3OfOi3E1ePfzWjq1epFaXsjtn8V_=r3h+PG6ankWW2fOr6GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-10-11 16:48 ` Nico Williams
2012-10-11 16:32 ` 杨苏立 Yang Su Li
2012-10-11 17:41 ` [sqlite] " Christoph Hellwig
2012-10-23 19:53 ` Vladislav Bolkhovitin
[not found] ` <5086F5A7.9090406-d+Crzxg7Rs0@public.gmane.org>
2012-10-24 21:17 ` Nico Williams
2012-10-24 22:03 ` [sqlite] " david
[not found] ` <alpine.DEB.2.02.1210241447210.8519-Z4YwzcCRHZnr5h6Zg1Auow@public.gmane.org>
2012-10-25 0:20 ` Nico Williams
2012-10-25 1:04 ` [sqlite] " david
[not found] ` <alpine.DEB.2.02.1210241748180.8519-Z4YwzcCRHZnr5h6Zg1Auow@public.gmane.org>
2012-10-25 5:18 ` Nico Williams
2012-10-25 6:02 ` [sqlite] " Theodore Ts'o
2012-10-25 6:58 ` david
[not found] ` <alpine.DEB.2.02.1210242331060.31862-Z4YwzcCRHZnr5h6Zg1Auow@public.gmane.org>
2012-10-25 14:03 ` Theodore Ts'o
[not found] ` <20121025140327.GB13562-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2012-10-25 18:03 ` david-gFPdbfVZQbY
[not found] ` <alpine.DEB.2.02.1210251048280.8519-Z4YwzcCRHZnr5h6Zg1Auow@public.gmane.org>
2012-10-25 18:29 ` Theodore Ts'o
2012-11-05 20:03 ` [sqlite] " Pavel Machek
[not found] ` <20121105200348.GB15821-5NIqAleC692hcjWhqY66xCZi+YwRKgec@public.gmane.org>
2012-11-05 22:04 ` Theodore Ts'o
[not found] ` <20121105220440.GB25378-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2012-11-05 22:37 ` Richard Hipp
[not found] ` <CALwJ=Mx-uEFLXK2wywekk=0dwrwVFb68wocnH9bjXJmHRsJx3w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-11-05 23:00 ` Theodore Ts'o
2012-10-30 23:49 ` [sqlite] " Nico Williams
2012-10-25 5:42 ` Theodore Ts'o
2012-10-25 7:11 ` david
2012-10-27 1:52 ` Vladislav Bolkhovitin
2012-10-25 5:14 ` Theodore Ts'o
2012-10-25 13:03 ` Alan Cox
[not found] ` <20121025140325.49cd7c79-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
2012-10-25 13:50 ` Theodore Ts'o
2012-10-27 1:55 ` [sqlite] " Vladislav Bolkhovitin
2012-10-27 1:54 ` Vladislav Bolkhovitin
[not found] ` <508B3EED.2080003-d+Crzxg7Rs0@public.gmane.org>
2012-10-27 4:44 ` Theodore Ts'o
2012-10-30 22:22 ` [sqlite] " Vladislav Bolkhovitin
[not found] ` <5090532D.4050902-d+Crzxg7Rs0@public.gmane.org>
2012-10-31 9:54 ` Alan Cox
2012-11-01 20:18 ` [sqlite] " Vladislav Bolkhovitin
[not found] ` <5092D90F.7020105-d+Crzxg7Rs0@public.gmane.org>
2012-11-01 21:24 ` Alan Cox
2012-11-02 0:15 ` [sqlite] " Vladislav Bolkhovitin
[not found] ` <20121101212418.140e3a82-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
2012-11-02 0:38 ` Howard Chu
[not found] ` <50931601.4060102-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org>
2012-11-02 12:24 ` Richard Hipp
2012-11-13 3:41 ` [sqlite] " Vladislav Bolkhovitin
2012-11-02 12:33 ` Alan Cox
2012-11-13 3:41 ` [sqlite] " Vladislav Bolkhovitin
[not found] ` <50A1C15E.2080605-d+Crzxg7Rs0@public.gmane.org>
2012-11-13 17:40 ` Alan Cox
[not found] ` <20121113174000.6457a68b-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
2012-11-13 19:13 ` Nico Williams
2012-11-15 1:17 ` [sqlite] " Vladislav Bolkhovitin
[not found] ` <50A442AF.9020407-d+Crzxg7Rs0@public.gmane.org>
2012-11-15 12:07 ` David Lang
[not found] ` <alpine.DEB.2.02.1211150353080.32408-UEhY+ZBZOcqqLGM74eQ/YA@public.gmane.org>
2012-11-15 16:14 ` 杨苏立 Yang Su Li
2012-11-17 5:02 ` [sqlite] " Vladislav Bolkhovitin
2012-11-16 15:06 ` Howard Chu
2012-11-16 15:31 ` [sqlite] " Ric Wheeler
[not found] ` <50A65C68.6080001-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-11-16 15:54 ` Howard Chu
2012-11-16 18:03 ` Ric Wheeler [this message]
[not found] ` <50A65681.8000204-aQkYFu9vm6AAvxtiuMwx3w@public.gmane.org>
2012-11-16 19:14 ` David Lang
2012-11-17 5:02 ` [sqlite] " Vladislav Bolkhovitin
2012-11-15 17:06 ` Ryan Johnson
2012-11-15 22:35 ` [sqlite] " Chris Friesen
2012-11-17 5:02 ` Vladislav Bolkhovitin
2012-11-20 1:23 ` Vladislav Bolkhovitin
2012-11-26 20:05 ` Nico Williams
[not found] ` <CAK3OfOjD4XBGfu3cnMwTvCfec0Lvg3zrO16+pXtiFF4UWpFjDw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-11-29 2:15 ` Vladislav Bolkhovitin
2012-11-15 1:16 ` [sqlite] " Vladislav Bolkhovitin
2012-11-13 3:37 ` Vladislav Bolkhovitin
2012-11-11 4:25 ` 杨苏立 Yang Su Li
2012-11-13 3:42 ` [sqlite] " Vladislav Bolkhovitin
2012-10-10 7:57 ` [PATCH 00/16] f2fs: introduce flash-friendly file system Vyacheslav Dubeyko
2012-10-10 9:43 ` Jaegeuk Kim
2012-10-11 3:14 ` Namjae Jeon
[not found] ` <CAN863PuyMkSZtZCvqX+kwei9v=rnbBYVYr3TqBXF_6uxwJe2_Q@mail.gmail.com>
2012-10-17 11:13 ` Namjae Jeon
2012-10-17 23:06 ` Changman Lee
2012-10-12 12:30 ` Vyacheslav Dubeyko
2012-10-12 14:25 ` Jaegeuk Kim
2012-10-07 10:15 ` Vyacheslav Dubeyko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50A67FD6.1030108@redhat.com \
--to=rwheeler@redhat.com \
--cc=david@lang.hm \
--cc=drh@hwaci.com \
--cc=hyc@symas.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sqlite-users@sqlite.org \
--cc=tytso@mit.edu \
--cc=vst@vlnb.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).