bug in md write barrier support?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* bug in md write barrier support?
@ 2004-09-03 17:24 Christoph Hellwig
  2004-09-04  0:56 ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2004-09-03 17:24 UTC (permalink / raw)
  To: neilb, axboe; +Cc: linux-kernel

md_flush_mddev just passes on the sector relative to the raid device,
shouldn't it be translated somewhere?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-03 17:24 bug in md write barrier support? Christoph Hellwig
@ 2004-09-04  0:56 ` Neil Brown
  2004-09-04  8:21   ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2004-09-04  0:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel

On Friday September 3, hch@lst.de wrote:
> md_flush_mddev just passes on the sector relative to the raid device,
> shouldn't it be translated somewhere?

Yes.  md_flush_mddev should simply be removed.  
The functionality should be, and largely is, in the individual
personalities. 

Is there documentation somewhere on exactly what an issue_flush_fn
should do (is it  allowed to sleep? what must happen before it is
allowed to return, what is the "error_sector" for,  that sort of thing).

I suspect that at least raid5 will need some fairly special handling.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-04  0:56 ` Neil Brown
@ 2004-09-04  8:21   ` Jens Axboe
  2004-09-06  1:36     ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2004-09-04  8:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-kernel

On Sat, Sep 04 2004, Neil Brown wrote:
> On Friday September 3, hch@lst.de wrote:
> > md_flush_mddev just passes on the sector relative to the raid device,
> > shouldn't it be translated somewhere?
> 
> Yes.  md_flush_mddev should simply be removed.  
> The functionality should be, and largely is, in the individual
> personalities. 

Yes, sorry I was a little lazy there even though I followed the plugging
conversion :(

> Is there documentation somewhere on exactly what an issue_flush_fn
> should do (is it  allowed to sleep? what must happen before it is
> allowed to return, what is the "error_sector" for,  that sort of thing).

It is allowed to sleep, you should return when the flush is complete.
error_sector is the failed location, which really should be a dev,sector
tupple.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-04  8:21   ` Jens Axboe
@ 2004-09-06  1:36     ` Neil Brown
  2004-09-08  9:23       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2004-09-06  1:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel

On Saturday September 4, axboe@suse.de wrote:
> On Sat, Sep 04 2004, Neil Brown wrote:
> > On Friday September 3, hch@lst.de wrote:
> > > md_flush_mddev just passes on the sector relative to the raid device,
> > > shouldn't it be translated somewhere?
> > 
> > Yes.  md_flush_mddev should simply be removed.  
> > The functionality should be, and largely is, in the individual
> > personalities. 
> 
> Yes, sorry I was a little lazy there even though I followed the plugging
> conversion :(
> 
> > Is there documentation somewhere on exactly what an issue_flush_fn
> > should do (is it  allowed to sleep? what must happen before it is
> > allowed to return, what is the "error_sector" for,  that sort of thing).
> 
> It is allowed to sleep, you should return when the flush is complete.
> error_sector is the failed location, which really should be a dev,sector
> tupple.

Could I get a little more information about this function please.
I've read through the code, and there isn't much in the way of
examples to follow: only reiserfs uses it, only scsi-disk and ide-disk
supports it (I think).

It would seem that this is for write requests where b_end_io has already
been called, indicating that the data is safe, but that maybe the data
isn't really safe after-all, and blk_issue_flush needs to be called.

I would have thought that after b_end_io is called, that data should
be safe anyway.  Not so?

How do you tell a device: it is OK to just leave the data is cache,
I'll call blk_issue_flush when I want it safe.
Is this related to barriers are all??

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-06  1:36     ` Neil Brown
@ 2004-09-08  9:23       ` Jens Axboe
  2004-09-08 13:35         ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2004-09-08  9:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-kernel

On Mon, Sep 06 2004, Neil Brown wrote:
> On Saturday September 4, axboe@suse.de wrote:
> > On Sat, Sep 04 2004, Neil Brown wrote:
> > > On Friday September 3, hch@lst.de wrote:
> > > > md_flush_mddev just passes on the sector relative to the raid device,
> > > > shouldn't it be translated somewhere?
> > > 
> > > Yes.  md_flush_mddev should simply be removed.  
> > > The functionality should be, and largely is, in the individual
> > > personalities. 
> > 
> > Yes, sorry I was a little lazy there even though I followed the plugging
> > conversion :(
> > 
> > > Is there documentation somewhere on exactly what an issue_flush_fn
> > > should do (is it  allowed to sleep? what must happen before it is
> > > allowed to return, what is the "error_sector" for,  that sort of thing).
> > 
> > It is allowed to sleep, you should return when the flush is complete.
> > error_sector is the failed location, which really should be a dev,sector
> > tupple.
> 
> Could I get a little more information about this function please.
> I've read through the code, and there isn't much in the way of
> examples to follow: only reiserfs uses it, only scsi-disk and ide-disk
> supports it (I think).

That is correct. The current definition is to ensure that previously
sent writes are on disk. I hope to tie a range to it in the future, for
devices that can optimize the flush in that case. So for ide with write
back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
with write through cache can make it a noop as well.

> It would seem that this is for write requests where b_end_io has already
> been called, indicating that the data is safe, but that maybe the data
> isn't really safe after-all, and blk_issue_flush needs to be called.

Right on.

> I would have thought that after b_end_io is called, that data should
> be safe anyway.  Not so?

Not necessarily, if you have write caching enabled.

> How do you tell a device: it is OK to just leave the data is cache,
> I'll call blk_issue_flush when I want it safe.

How would md know? The lower level driver knows what to do (if anything)
to ensure the data is safe.

> Is this related to barriers are all??

Yes and no. Currently it's used to fsync(), but can be used for
anything where you want to insert a flush point without having a piece
of data to tie it to.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-08  9:23       ` Jens Axboe
@ 2004-09-08 13:35         ` Alan Cox
  2004-09-08 15:46           ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2004-09-08 13:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Neil Brown, Christoph Hellwig, Linux Kernel Mailing List

On Mer, 2004-09-08 at 10:23, Jens Axboe wrote:
> That is correct. The current definition is to ensure that previously
> sent writes are on disk. I hope to tie a range to it in the future, for
> devices that can optimize the flush in that case. So for ide with write
> back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
> with write through cache can make it a noop as well.

Some semantics questions I have thinking about it from the I2O and
aacraid side: You talk about it as a barrier. Can other I/O cross the
cache flush ? In other words if I issue a flush_cache and continue doing
I/O the flush will finish when the I/O outstanding at that time has
completed but other I/O may get scheduled to disk first.

Secondly what are the intended semantics for a flush error ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-08 13:35         ` Alan Cox
@ 2004-09-08 15:46           ` Jens Axboe
  2004-09-08 22:21             ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2004-09-08 15:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: Neil Brown, Christoph Hellwig, Linux Kernel Mailing List

On Wed, Sep 08 2004, Alan Cox wrote:
> On Mer, 2004-09-08 at 10:23, Jens Axboe wrote:
> > That is correct. The current definition is to ensure that previously
> > sent writes are on disk. I hope to tie a range to it in the future, for
> > devices that can optimize the flush in that case. So for ide with write
> > back caching, it's currently a FLUSH_CACHE command. Ditto for SCSI. SCSI
> > with write through cache can make it a noop as well.
> 
> Some semantics questions I have thinking about it from the I2O and
> aacraid side: You talk about it as a barrier. Can other I/O cross the
> cache flush ? In other words if I issue a flush_cache and continue doing
> I/O the flush will finish when the I/O outstanding at that time has
> completed but other I/O may get scheduled to disk first.

That's a worry if it really does that - does it, or are you just
speculating about possible problems?

> Secondly what are the intended semantics for a flush error ?

It's up to the issue. For IDE it would ideally be issuing FLUSH_CACHE
repeatedly until it doesn't error anymore, but keeping track of the
error location. Come to think of it, we should pass down the range right
now to flag which range we are actually interested in being errored on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-08 15:46           ` Jens Axboe
@ 2004-09-08 22:21             ` Alan Cox
  2004-09-09  8:06               ` Jens Axboe
  2004-09-12 17:13               ` Rogier Wolff
  0 siblings, 2 replies; 14+ messages in thread
From: Alan Cox @ 2004-09-08 22:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Neil Brown, Christoph Hellwig, Linux Kernel Mailing List

On Mer, 2004-09-08 at 16:46, Jens Axboe wrote:
> That's a worry if it really does that - does it, or are you just
> speculating about possible problems?

I2O defines cache flush very losely. It flushes the cache and returns
when the cache has been flushed. From playing with the controllers I
have it seems some at least merge further queued writes into the output
stream. Thus if I issue

write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100

it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete. 

Obviously I can implement full barrier semantics in the driver if need
be but that would cost performance hence the question.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-08 22:21             ` Alan Cox
@ 2004-09-09  8:06               ` Jens Axboe
  2004-09-09  8:22                 ` Arjan van de Ven
  2004-09-12 17:13               ` Rogier Wolff
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2004-09-09  8:06 UTC (permalink / raw)
  To: Alan Cox; +Cc: Neil Brown, Christoph Hellwig, Linux Kernel Mailing List

On Wed, Sep 08 2004, Alan Cox wrote:
> On Mer, 2004-09-08 at 16:46, Jens Axboe wrote:
> > That's a worry if it really does that - does it, or are you just
> > speculating about possible problems?
> 
> I2O defines cache flush very losely. It flushes the cache and returns
> when the cache has been flushed. From playing with the controllers I
> have it seems some at least merge further queued writes into the output
> stream. Thus if I issue
> 
> write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100
> 
> it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete. 
> 
> Obviously I can implement full barrier semantics in the driver if need
> be but that would cost performance hence the question.

Precisely, it's always possible to just drop queueing depth to zero at
that point. If I2O really does reorder around the cache flush (this
seems broken...), then you probably should.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-09  8:06               ` Jens Axboe
@ 2004-09-09  8:22                 ` Arjan van de Ven
  2004-09-09  8:29                   ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Arjan van de Ven @ 2004-09-09  8:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Alan Cox, Neil Brown, Christoph Hellwig,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 407 bytes --]


> Precisely, it's always possible to just drop queueing depth to zero at
> that point. If I2O really does reorder around the cache flush (this
> seems broken...),

why does this seem broken? semantics of "cache flush guarantees that all
io submitted prior to it hits the spindle" are quite sane imo; no
guarantee of later submitted IO.. compare the unix "sync" command; same
level of semantics.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-09  8:22                 ` Arjan van de Ven
@ 2004-09-09  8:29                   ` Jens Axboe
  2004-09-09 12:51                     ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2004-09-09  8:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alan Cox, Neil Brown, Christoph Hellwig,
	Linux Kernel Mailing List

On Thu, Sep 09 2004, Arjan van de Ven wrote:
> 
> > Precisely, it's always possible to just drop queueing depth to zero at
> > that point. If I2O really does reorder around the cache flush (this
> > seems broken...),
> 
> why does this seem broken? semantics of "cache flush guarantees that all
> io submitted prior to it hits the spindle" are quite sane imo; no
> guarantee of later submitted IO.. compare the unix "sync" command; same
> level of semantics.

Depends on your angle, I think it breaks the principle of least
surprise.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-09  8:29                   ` Jens Axboe
@ 2004-09-09 12:51                     ` Alan Cox
  2004-09-09 14:34                       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2004-09-09 12:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Arjan van de Ven, Neil Brown, Christoph Hellwig,
	Linux Kernel Mailing List

On Iau, 2004-09-09 at 09:29, Jens Axboe wrote:
> > why does this seem broken? semantics of "cache flush guarantees that all
> > io submitted prior to it hits the spindle" are quite sane imo; no
> > guarantee of later submitted IO.. compare the unix "sync" command; same
> > level of semantics.
> 
> Depends on your angle, I think it breaks the principle of least
> surprise.

As far as I can ascertain raid controllers in general follow this set of
semantics. Its less of an issue for many of them with battery backup
obviously.

It also makes a lot of sense at the hardware level for performance
especially when dealing with raid.

Alan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-09 12:51                     ` Alan Cox
@ 2004-09-09 14:34                       ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2004-09-09 14:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Arjan van de Ven, Neil Brown, Christoph Hellwig,
	Linux Kernel Mailing List

On Thu, Sep 09 2004, Alan Cox wrote:
> On Iau, 2004-09-09 at 09:29, Jens Axboe wrote:
> > > why does this seem broken? semantics of "cache flush guarantees that all
> > > io submitted prior to it hits the spindle" are quite sane imo; no
> > > guarantee of later submitted IO.. compare the unix "sync" command; same
> > > level of semantics.
> > 
> > Depends on your angle, I think it breaks the principle of least
> > surprise.
> 
> As far as I can ascertain raid controllers in general follow this set of
> semantics. Its less of an issue for many of them with battery backup
> obviously.
> 
> It also makes a lot of sense at the hardware level for performance
> especially when dealing with raid.

Yes. As long as the required semantics aren't explicitly guaranteed in
the specification, we should not rely on it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: bug in md write barrier support?
  2004-09-08 22:21             ` Alan Cox
  2004-09-09  8:06               ` Jens Axboe
@ 2004-09-12 17:13               ` Rogier Wolff
  1 sibling, 0 replies; 14+ messages in thread
From: Rogier Wolff @ 2004-09-12 17:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jens Axboe, Neil Brown, Christoph Hellwig,
	Linux Kernel Mailing List

On Wed, Sep 08, 2004 at 11:21:39PM +0100, Alan Cox wrote:
> I2O defines cache flush very losely. It flushes the cache and returns
[...]
> write 1, 2, 3, 4 , 40, 41, flush cache, write 5, 6, 100

> it'll write 1,2,3,4,5,6, 40, 41, report flush cache complete. 

which, if 5,6 are the metadata updates beloging to logfile writes
40,41 and the system powers down between 5 and 41 spells trouble. 

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam -  no windows, no gates, apache inside!" ****

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-09-12 17:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03 17:24 bug in md write barrier support? Christoph Hellwig
2004-09-04  0:56 ` Neil Brown
2004-09-04  8:21   ` Jens Axboe
2004-09-06  1:36     ` Neil Brown
2004-09-08  9:23       ` Jens Axboe
2004-09-08 13:35         ` Alan Cox
2004-09-08 15:46           ` Jens Axboe
2004-09-08 22:21             ` Alan Cox
2004-09-09  8:06               ` Jens Axboe
2004-09-09  8:22                 ` Arjan van de Ven
2004-09-09  8:29                   ` Jens Axboe
2004-09-09 12:51                     ` Alan Cox
2004-09-09 14:34                       ` Jens Axboe
2004-09-12 17:13               ` Rogier Wolff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox