linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bill Davidsen <davidsen@tmr.com>
To: Leon Woestenberg <leonw@mailcan.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>,
	Peter Grandi <pg_xf2@xf2.for.sabi.co.UK>,
	Linux XFS <xfs@oss.sgi.com>
Subject: Re: 12x performance drop on md/linux+sw raid1 due to barriers [xfs]
Date: Thu, 18 Dec 2008 18:33:14 -0500	[thread overview]
Message-ID: <494ADDBA.6010105@tmr.com> (raw)
In-Reply-To: <494A07BA.1080008@mailcan.com>

Leon Woestenberg wrote:
> Hello all,
>
> Bill Davidsen wrote:
>> Peter Grandi wrote:
>>   
>>> Unfortunately that seems the case.
>>>
>>> The purpose of barriers is to guarantee that relevant data is
>>> known to be on persistent storage (kind of hardware 'fsync').
>>>
>>> In effect write barrier means "tell me when relevant data is on
>>> persistent storage", or less precisely "flush/sync writes now
>>> and tell me when it is done". Properties as to ordering are just
>>> a side effect.
>>>   
>>>     
>>
>> I don't get that sense from the barriers stuff in Documentation, in fact 
>> I think it's essentially a pure ordering thing, I don't even see that it 
>> has an effect of forcing the data to be written to the device, other 
>> than by preventing other writes until the drive writes everything. So we 
>> read the intended use differently.
>>
>> What really bothers me is that there's no obvious need for barriers at 
>> the device level if the file system is just a bit smarter and does it's 
>> own async io (like aio_*), because you can track writes outstanding on a 
>> per-fd basis, so instead of stopping the flow of data to the drive, you 
>> can just block a file descriptor and wait for the count of outstanding 
>> i/o to drop to zero. That provides the order semantics of barriers as 
>> far as I can see, having tirelessly thought about it for ten minutes or 
>> so. Oh, and did something very similar decades ago in a long-gone 
>> mainframe OS.
>>   
> Did that mainframe OS have re-ordering devices? If it did, you'ld 
> still need barriers all the way down:
>
Why? As long as you can tell when all the writes before the barrier are 
physically on the drive (this is on a per fd basis, remember) you don't 
care about the order of physical writes, you serialize either one fd, or 
one thread, or one application, but you don't have to kill performance 
for the rest of the system to the drive. So you can fsync() one fd or 
several, then write another thread. Or you can wait until the 
outstanding write could for a whole process reaches zero. And the 
application satisfies the needs, not the kernel, which reduces impact on 
other applications.
> The drive itself may still re-order writes, thus can cause corruption 
> if halfway the power goes down.
> >From my understanding, disabling write-caches simply forces the drive 
> to operate in-order.
>
If you ordering logic is 'write A, B, and C, then barrier, then write D' 
I don't see that the physical order of A, B, or C matters, as long as 
they are all complete before you write D. That's what I see in the 
barrier description, let previous writes finish.
> Barriers need to travel all the way down to the point where-after 
> everything remains in-order.
> Devices with write-cache enabled will still re-order, but not across 
> barriers (which are implemented as
> either a single cache flush with forced unit access, or a double cache 
> flush around the barrier write).
>
> Whether the data has made it to the drive platters is not really 
> important from a barrier point of view, however,
> iff part of the data made it to the platters, then we want to be sure 
> it was in-order.
>
And you could use a barrier after every write (some DB setups do fsync() 
after each). Perhaps you mean parts like the journal entry before a 
change is made, then the change, then the journal entry for transaction 
complete?
> Because only in this way can we ensure that the data that is on the 
> platters is consistent.

I think we mean the same thing, but I'm not totally sure. As long a 
logical operations are completed in order, the physical writes don't 
matter, because a journal rollback would reset things to consistent anyway.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 




[[HTML alternate version deleted]]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2008-12-18 23:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-06 14:28 12x performance drop on md/linux+sw raid1 due to barriers [xfs] Justin Piszcz
2008-12-06 15:36 ` Eric Sandeen
2008-12-06 20:35   ` Redeeman
2008-12-13 12:54   ` Justin Piszcz
2008-12-13 17:26     ` Martin Steigerwald
2008-12-13 17:40       ` Eric Sandeen
2008-12-14  3:31         ` Redeeman
2008-12-14 14:02           ` Peter Grandi
2008-12-14 18:12             ` Martin Steigerwald
2008-12-14 22:02               ` Peter Grandi
2008-12-15 22:38                 ` Dave Chinner
2008-12-16  9:39                   ` Martin Steigerwald
2008-12-16 20:57                     ` Peter Grandi
2008-12-16 23:14                     ` Dave Chinner
2008-12-17 21:40                 ` Bill Davidsen
2008-12-18  8:20                   ` Leon Woestenberg
2008-12-18 23:33                     ` Bill Davidsen [this message]
2008-12-21 19:16                     ` Peter Grandi
2008-12-22 13:19                       ` Leon Woestenberg
2008-12-18 22:26                   ` Dave Chinner
2008-12-20 14:06               ` Peter Grandi
2008-12-14 18:35             ` Martin Steigerwald
2008-12-14 17:49           ` Martin Steigerwald
2008-12-14 23:36         ` Dave Chinner
2008-12-13 18:01       ` David Lethe
2008-12-06 18:42 ` Peter Grandi
2008-12-11  0:20 ` Bill Davidsen
2008-12-11  9:18   ` Justin Piszcz
2008-12-11  9:24     ` Justin Piszcz
  -- strict thread matches above, loose matches on Subject: below --
2008-12-14 18:33 Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=494ADDBA.6010105@tmr.com \
    --to=davidsen@tmr.com \
    --cc=leonw@mailcan.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=pg_xf2@xf2.for.sabi.co.UK \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).