All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladislav Bolkhovitin <vst@vlnb.net>
To: Ted Ts'o <tytso@mit.edu>, Andreas Dilger <adilger@dilger.ca>,
	Ric Wheeler <rwheeler@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Tejun Heo <tj@kernel.org>, Vivek Goyal <vgoyal@redhat.>
Subject: Re: [RFC] relaxed barrier semantics
Date: Fri, 30 Jul 2010 16:56:31 +0400	[thread overview]
Message-ID: <4C52CBFF.6090406@vlnb.net> (raw)
In-Reply-To: <20100729230406.GI4506@thunk.org>

Ted Ts'o, on 07/30/2010 03:04 AM wrote:
> On Thu, Jul 29, 2010 at 04:30:54PM -0600, Andreas Dilger wrote:
>> Like James wrote, this is basically everything FUA.  It is OK for
>> ordered mode to allow the device to aggregate the normal filesystem
>> and journal IO, but when the commit block is written it should flush
>> all of the previously written data to disk.  This still allows
>> request re-ordering and merging inside the device, but orders the
>> data vs. the commit block.  Having the proposed "flush ranges"
>> interface to the disk would be ideal, since there would be no wasted
>> time flushing data that does not need it (i.e. other partitions).
>
> My understanding is that "everything FUA" can be a performance
> disaster.  That's because it bypasses the track buffer, and things get
> written directly to disk.  So there is no possibility to reorder
> buffers so that they get written in one disk rotation.  Depending on
> the disk, it might even be that if you send N sequential sectors all
> tagged with FUA, it could be slower than sending the N sectors
> followed by a cache flush or SYNCHRONIZE_CACHE command.

It should be, because it gives the drive opportunity to better load 
internal resources and provide data transfer pipelining. Although, of 
course, it's possible to imagine a stupid drive with nearly broken 
caching which would work in write through mode faster.

I used word "drive", not "disk" above, because I think this discussions 
is not only about disks. Storage might be not only disks, but also 
external arrays and even clusters of arrays. They all look to the system 
as single "disks", but are much more advanced and sophisticated in all 
internal capabilities than dumb (S)ATA disks. Now such arrays and 
clusters are getting more and more commonly used. Anybody can make such 
array using just a Linux box with any OSS SCSI target software and use 
them with a variety of interfaces: iSCSI, Fibre Channel, SAS, InfiniBand 
and even familiar parallel SCSI (Funny, 2 Linux boxes connected by Wide 
SCSI :) ).

So, why to only limit discussion to the low end disks? I believe it 
would be more productive if we at first determine the set of 
capabilities which should be used for the best performance and which 
advanced storage devices can provide and then go down to the lower end 
eliminating the use of the advantage features with sacrificing 
performance. Otherwise, ignoring the "hardware offload" which advanced 
devices provide, we would never achieve the best performance they could 
give.

I'd start the analyze of the best performance facilities from the following:

1. Full set of SCSI queuing and task management control facilities. Namely:

  - SIMPLE, ORDERED, ACA and, maybe, HEAD OF QUEUE commands attributes

  - Never draining the queue to wait for completion of one or more 
commands, except some rare recovery error recovery cases.

  - ACA and UA_INTRCK for protecting the queue order in case if one or 
more commands in it finished abnormally.

  - Use of write back caching by default and switch to write through 
only for "blacklisted" drives.

  - FUA for sequences of few write commands, where either 
SYNCHRONIZE_CACHE command is an overkill, or there is internal order 
dependency between the commands, so they must be written to the media 
exactly in the required order.

So, for instance, a naive sequence of meta-data updates with the 
corresponding journal writes would be a chain of commands:

1. 1st journal write command (SIMPLE)

2. 2d  journal write command (SIMPLE)

3. 3d  journal write command (SIMPLE)

4. SYNCHRONIZE_CACHE for blocks written by those 3 commands (ORDERED)

5. Necessary amount of meta-data update commands (all SIMPLE)

6. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

7. Command marking the transaction committed in the journal (ORDERED)

That's all. No queue draining anywhere. Plus, sending commands without 
internal order requirements as SIMPLE would allow the drive to better 
schedule execution of them among internal storage (actual disks).

For an error recovery case consider command (4) abnormally finished 
because of some external event, like Unit Attention. Then the drive 
would establish ACA condition and suspend the commands queue with 
commands from (5) in the head. Then the system would retry this command 
with ACA attribute. Then, when it finished, would clear the ACA 
condition. Then the drive would resume the queue and commands in the 
head ((5)) started being processed.

For a simpler device (a disk without support for ORDERED queuing) the 
same meta-data updates would be:

1. 1st journal write command

2. 2d  journal write command

3. 3d  journal write command

4. The queue draining.

5. SYNCHRONIZE_CACHE

6. The queue draining.

7. Necessary amount of meta-data update commands

8. The queue draining.

9. SYNCHRONIZE_CACHE for blocks written in 5 (ORDERED)

10. The queue draining.

11. Command marking the transaction committed in the journal (ORDERED)

Then we would need to figure out an interface for file systems to let 
them be able to specify the necessary ordering and cache flushing 
requirements in a generic way. The current interface looks almost good, but:

1. In it semantic of "barrier" is quite overloaded, hence confusing and 
hard to implement.

2. It doesn't allow to bind several requests in an ordered chain.

Vlad

  parent reply	other threads:[~2010-07-30 12:56 UTC|newest]

Thread overview: 155+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-27 16:56 [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-27 17:54 ` Jan Kara
2010-07-27 18:35   ` Vivek Goyal
2010-07-27 18:42     ` James Bottomley
2010-07-27 18:51       ` Ric Wheeler
2010-07-27 19:43       ` Christoph Hellwig
2010-07-27 19:38     ` Christoph Hellwig
2010-07-28  8:08     ` Tejun Heo
2010-07-28  8:20       ` Tejun Heo
2010-07-28 13:55         ` Vladislav Bolkhovitin
2010-07-28 14:23           ` Tejun Heo
2010-07-28 14:37             ` James Bottomley
2010-07-28 14:44               ` Tejun Heo
2010-07-28 16:17                 ` Vladislav Bolkhovitin
2010-07-28 16:17               ` Vladislav Bolkhovitin
2010-07-28 16:16             ` Vladislav Bolkhovitin
2010-07-28  8:24       ` Christoph Hellwig
2010-07-28  8:40         ` Tejun Heo
2010-07-28  8:50           ` Christoph Hellwig
2010-07-28  8:58             ` Tejun Heo
2010-07-28  9:00               ` Christoph Hellwig
2010-07-28  9:11                 ` Hannes Reinecke
2010-07-28  9:16                   ` Christoph Hellwig
2010-07-28  9:24                     ` Tejun Heo
2010-07-28  9:38                       ` Christoph Hellwig
2010-07-28  9:28                   ` Steven Whitehouse
2010-07-28  9:35                     ` READ_META semantics, was " Christoph Hellwig
2010-07-28 13:52                       ` Jeff Moyer
2010-07-28  9:17                 ` Tejun Heo
2010-07-28  9:28                   ` Christoph Hellwig
2010-07-28  9:48                     ` Tejun Heo
2010-07-28 10:19                     ` Steven Whitehouse
2010-07-28 11:45                       ` Christoph Hellwig
2010-07-28 12:47                     ` Jan Kara
2010-07-28 23:00                       ` Christoph Hellwig
2010-07-29 10:45                         ` Jan Kara
2010-07-29 16:54                           ` Joel Becker
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29 17:02                             ` Christoph Hellwig
2010-07-29  1:44                     ` Ted Ts'o
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  2:43                       ` Vivek Goyal
2010-07-29  8:42                         ` Christoph Hellwig
2010-07-29 20:02                           ` Vivek Goyal
2010-07-29 20:06                             ` Christoph Hellwig
2010-07-30  3:17                               ` Vivek Goyal
2010-07-30  7:07                                 ` Christoph Hellwig
2010-07-30  7:41                                   ` Vivek Goyal
2010-08-02 18:28                                   ` [RFC PATCH] Flush only barriers (Was: Re: [RFC] relaxed barrier semantics) Vivek Goyal
2010-08-03 13:03                                     ` Christoph Hellwig
2010-08-04 15:29                                       ` Vivek Goyal
2010-08-04 16:21                                         ` Christoph Hellwig
2010-07-29  8:31                       ` [RFC] relaxed barrier semantics Christoph Hellwig
2010-07-29 11:16                         ` Jan Kara
2010-07-29 13:00                         ` extfs reliability Vladislav Bolkhovitin
2010-07-29 13:08                           ` Christoph Hellwig
2010-07-29 14:12                             ` Vladislav Bolkhovitin
2010-07-29 14:34                               ` Jan Kara
2010-07-29 18:20                                 ` Vladislav Bolkhovitin
2010-07-29 18:49                                 ` Vladislav Bolkhovitin
2010-07-29 14:26                           ` Jan Kara
2010-07-29 18:20                             ` Vladislav Bolkhovitin
2010-07-29 18:58                           ` Ted Ts'o
2010-07-29 19:44                       ` [RFC] relaxed barrier semantics Ric Wheeler
2010-07-29 19:49                         ` Christoph Hellwig
2010-07-29 19:56                           ` Ric Wheeler
2010-07-29 19:59                             ` James Bottomley
2010-07-29 20:03                               ` Christoph Hellwig
2010-07-29 20:07                                 ` James Bottomley
2010-07-29 20:11                                   ` Christoph Hellwig
2010-07-30 12:45                                     ` Vladislav Bolkhovitin
2010-07-30 12:56                                       ` Christoph Hellwig
2010-08-04  1:58                                     ` Jamie Lokier
2010-07-30 12:46                                 ` Vladislav Bolkhovitin
2010-07-30 12:57                                   ` Christoph Hellwig
2010-07-30 13:09                                     ` Vladislav Bolkhovitin
2010-07-30 13:12                                       ` Christoph Hellwig
2010-07-30 17:40                                         ` Vladislav Bolkhovitin
2010-07-29 20:58                               ` Ric Wheeler
2010-07-29 22:30                             ` Andreas Dilger
2010-07-29 23:04                               ` Ted Ts'o
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:08                                 ` Ric Wheeler
2010-07-29 23:28                                 ` James Bottomley
2010-07-29 23:37                                   ` James Bottomley
2010-07-30  0:19                                     ` Ted Ts'o
2010-07-30 12:56                                   ` Vladislav Bolkhovitin
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30  7:11                                 ` Christoph Hellwig
2010-07-30 12:56                                 ` Vladislav Bolkhovitin
2010-07-30 12:56                                 ` Vladislav Bolkhovitin [this message]
2010-07-30 13:07                                   ` Tejun Heo
2010-07-30 13:22                                     ` Vladislav Bolkhovitin
2010-07-30 13:27                                       ` Vladislav Bolkhovitin
2010-07-30 13:09                                   ` Christoph Hellwig
2010-07-30 13:25                                     ` Vladislav Bolkhovitin
2010-07-30 13:34                                       ` Christoph Hellwig
2010-07-30 13:44                                         ` Vladislav Bolkhovitin
2010-07-30 14:20                                           ` Christoph Hellwig
2010-07-31  0:47                                             ` Jan Kara
2010-07-31  9:12                                               ` Christoph Hellwig
2010-08-02 13:14                                                 ` Jan Kara
2010-08-02 10:38                                               ` Vladislav Bolkhovitin
2010-08-02 12:48                                                 ` Christoph Hellwig
2010-08-02 19:03                                                   ` xfs rm performance Vladislav Bolkhovitin
2010-08-02 19:18                                                     ` Christoph Hellwig
2010-08-05 19:31                                                       ` Vladislav Bolkhovitin
2010-08-02 19:01                                             ` [RFC] relaxed barrier semantics Vladislav Bolkhovitin
2010-08-02 19:26                                               ` Christoph Hellwig
2010-07-31  0:35                         ` Jan Kara
2010-07-29 19:44                       ` Ric Wheeler
2010-08-02 16:47                     ` Ryusuke Konishi
2010-08-02 17:39                     ` Chris Mason
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-08-05 13:11                       ` Vladislav Bolkhovitin
2010-08-05 13:32                         ` Chris Mason
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 15:17                             ` Chris Mason
2010-08-05 17:07                             ` Christoph Hellwig
2010-08-05 14:52                           ` Hannes Reinecke
2010-08-05 19:48                           ` Vladislav Bolkhovitin
2010-08-05 19:50                             ` Christoph Hellwig
2010-08-05 20:05                               ` Vladislav Bolkhovitin
2010-08-06 14:56                                 ` Hannes Reinecke
2010-08-06 18:38                                   ` Vladislav Bolkhovitin
2010-08-06 23:38                                     ` Christoph Hellwig
2010-08-06 23:34                                   ` Christoph Hellwig
2010-08-05 19:48                           ` Vladislav Bolkhovitin
2010-08-05 17:09                         ` Christoph Hellwig
2010-08-05 19:32                           ` Vladislav Bolkhovitin
2010-08-05 19:40                             ` Christoph Hellwig
2010-07-28 13:56                   ` Vladislav Bolkhovitin
2010-07-28 14:42                 ` Vivek Goyal
2010-07-27 19:37   ` Christoph Hellwig
2010-08-03 18:49   ` [PATCH, RFC 1/2] relaxed cache flushes Christoph Hellwig
2010-08-03 18:51     ` [PATCH, RFC 2/2] dm: support REQ_FLUSH directly Christoph Hellwig
2010-08-04  4:57       ` Kiyoshi Ueda
2010-08-04  8:54         ` Christoph Hellwig
2010-08-05  2:16           ` Jun'ichi Nomura
2010-08-26 22:50             ` Mike Snitzer
2010-08-27  0:40               ` Mike Snitzer
2010-08-27  1:20                 ` Jamie Lokier
2010-08-27  1:43               ` Jun'ichi Nomura
2010-08-27  4:08                 ` Mike Snitzer
2010-08-27  5:52                   ` Jun'ichi Nomura
2010-08-27 14:13                     ` Mike Snitzer
2010-08-30  4:45                       ` Jun'ichi Nomura
2010-08-30  8:33                         ` Tejun Heo
2010-08-30 12:43                           ` Mike Snitzer
2010-08-30 12:45                             ` Tejun Heo
2010-08-06 16:04     ` [PATCH, RFC] relaxed barriers Tejun Heo
2010-08-06 23:34       ` Christoph Hellwig
2010-08-07 10:13       ` [PATCH REPOST " Tejun Heo
2010-08-08 14:31         ` Christoph Hellwig
2010-08-09 14:50           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C52CBFF.6090406@vlnb.net \
    --to=vst@vlnb.net \
    --cc=adilger@dilger.ca \
    --cc=hch@lst.de \
    --cc=rwheeler@redhat.com \
    --cc=tj@kernel.org \
    --cc=tytso@mit.edu \
    --cc=vgoyal@redhat. \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.