From: Ric Wheeler <rwheeler@redhat.com>
To: Chris Mason <clmason@fusionio.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
"Martin K. Petersen" <mkp@mkp.net>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: atomic write & T10 standards
Date: Wed, 03 Jul 2013 14:55:28 -0400 [thread overview]
Message-ID: <51D473A0.9050703@redhat.com> (raw)
In-Reply-To: <20130703185417.14981.87700@localhost.localdomain>
On 07/03/2013 02:54 PM, Chris Mason wrote:
> Quoting Ric Wheeler (2013-07-03 14:31:59)
>> On 07/03/2013 11:54 AM, Chris Mason wrote:
>>> Quoting Ric Wheeler (2013-07-03 11:42:38)
>>>> On 07/03/2013 11:37 AM, James Bottomley wrote:
>>>>> On Wed, 2013-07-03 at 11:27 -0400, Ric Wheeler wrote:
>>>>>> On 07/03/2013 11:22 AM, James Bottomley wrote:
>>>>>>> On Wed, 2013-07-03 at 11:04 -0400, Ric Wheeler wrote:
>>>>>>>> Why not have the atomic write actually imply that it is atomic and durable for
>>>>>>>> just that command?
>>>>>>> I don't understand why you think you need guaranteed durability for
>>>>>>> every journal transaction? That's what causes us performance problems
>>>>>>> because we have to pause on every transaction commit.
>>>>>>>
>>>>>>> We require durability for explicit flushes, obviously, but we could
>>>>>>> achieve far better performance if we could just let the filesystem
>>>>>>> updates stream to the disk and rely on atomic writes making sure the
>>>>>>> journal entries were all correct. The reason we require durability for
>>>>>>> journal entries today is to ensure caching effects don't cause the
>>>>>>> journal to lie or be corrupt.
>>>>>> Why would we use atomic writes for things that don't need to be
>>>>>> durable?
>>>>>>
>>>>>> Avoid a torn page write seems to be the only real difference here if
>>>>>> you use the atomic operations and don't have durability...
>>>>> It's not just about torn pages: Journal entries are big complex beasts.
>>>>> They can be megabytes big (at least on xfs). If we can guarantee all or
>>>>> nothing atomicity in the entire journal entry write it permits a more
>>>>> streaming design of the filesystem writeout path.
>>>>>
>>>>> James
>>>>>
>>>>>
>>>> Journals are normally big (128MB or so?) - I don't think that this is unique to xfs.
>>> We're mixing a bunch of concepts here. The filesystems have a lot of
>>> different requirements, and atomics are just one small part.
>>>
>>> Creating a new file often uses resources freed by past files. So
>>> deleting the old must be ordered against allocating the new. They are
>>> really separate atomic units but you can't handle them completely
>>> independently.
>>>
>>>> If our existing journal commit is:
>>>>
>>>> * write the data blocks for a transaction
>>>> * flush
>>>> * write the commit block for the transaction
>>>> * flush
>>>>
>>>> Which part of this does and atomic write help?
>>>>
>>>> We would still need at least:
>>>>
>>>> * atomic write of data blocks & commit blocks
>>>> * flush
>>> Yes. But just because we need the flush here doesn't mean we need the
>>> flush for every single atomic write.
>>>
>>> -chris
>>>
>> The catch is that our current flush mechanisms are still pretty brute force and
>> act across either the whole device or in a temporal (everything flushed before
>> this is acked) way.
> This is only partially true, since you're extending the sata drive model
> into atomics, and the devices implementing atomics are (so far anyway)
> are not sata.
>
>> I still see it would be useful to have the atomic write really be atomic and
>> durable just for that IO - no flush needed.
> In sata speak, it could go down as atomic + FUA + NCQ. In practice this
> is going to be in fusionio, nvme devices and big storage arrays, all of
> which we can expect to have proper knobs for lies about IO that isn't
> really done yet.
>
>> Can you give a sequence for the use case for the non-durable atomic write that
>> would not need a sync? Can we really trust all devices to make something atomic
>> that is not durable :) ?
> Today's usage is mostly O_DIRECT, which really should be FUA. Long term
> we can hope people will find more interesting uses.
>
> Either way the point is that an atomic write is a grouping mechanism,
> and if the standards people want to control fuaness in a separate bit,
> that's really fine.
>
> -chris
>
That makes sense to me - happy to have that bit a bit to indicate durability in
the atomic operation...
Ric
next prev parent reply other threads:[~2013-07-03 18:55 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <51D4365C.1030008@redhat.com>
[not found] ` <20130703143844.14981.69152@localhost.localdomain>
[not found] ` <51D43B87.5090005@redhat.com>
[not found] ` <1372863655.3601.19.camel@dabdike>
2013-07-03 15:04 ` atomic write & T10 standards Ric Wheeler
2013-07-03 15:21 ` Chris Mason
2013-07-03 15:22 ` James Bottomley
2013-07-03 15:27 ` Ric Wheeler
2013-07-03 15:37 ` James Bottomley
2013-07-03 15:42 ` Ric Wheeler
2013-07-03 15:54 ` Chris Mason
2013-07-03 18:31 ` Ric Wheeler
2013-07-03 18:54 ` Chris Mason
2013-07-03 18:55 ` Ric Wheeler [this message]
2013-07-04 3:18 ` Vladislav Bolkhovitin
2013-07-04 12:34 ` Ric Wheeler
2013-07-05 15:34 ` Elliott, Robert (Server Storage)
2013-07-05 16:49 ` Ric Wheeler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51D473A0.9050703@redhat.com \
--to=rwheeler@redhat.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=clmason@fusionio.com \
--cc=linux-scsi@vger.kernel.org \
--cc=mkp@mkp.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.