[Drbd-dev] Running Protocol C with disk cache enabled

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Drbd-dev] Running Protocol C with disk cache enabled
@ 2007-06-19 15:16 Graham, Simon
  2007-06-20 13:33 ` Philipp Reisner
  0 siblings, 1 reply; 8+ messages in thread
From: Graham, Simon @ 2007-06-19 15:16 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 1367 bytes --]

I've been thinking recently about making sure that DRBD handles failures
properly when the disks are run with their caches enabled - in most
cases, I believe that the existing activity log code in DRBD will
correctly handle this by ensuring that portions of the disk that _might_
have been in cache only when a failure occurred are resynchronized.

However - there is one case that I don't think is covered currently;
it's entirely possible that I'm missing something, but I wanted to
check; the case in question is if the Secondary system suffers an
unexpected power loss, therebye potentially losing some writes that were
acknowledged prior to the failure. Now, I think that the activity log
maintained by the Primary actually includes the necessary information
about blocks which should be resynchronized _but_ I don't see any code
that would actually add these blocks to the bitmap when such a failure
occurs.

Conversely, if the Primary suffers an unexpected power loss, when it
comes back up, it will add all the blocks described by its on-disk
activity log to the bitmap as part of the attach processing on that
node.

Maybe this is overkill, but perhaps the Primary should add the contents
of the current AL to the in-memory and on-disk bitmaps whenever it loses
contact with the secondary unexpectedly?

Simon

[-- Attachment #2: Type: text/html, Size: 3321 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-19 15:16 [Drbd-dev] Running Protocol C with disk cache enabled Graham, Simon
@ 2007-06-20 13:33 ` Philipp Reisner
  2007-06-20 16:31   ` Lars Ellenberg
  2007-06-20 19:47   ` Graham, Simon
  0 siblings, 2 replies; 8+ messages in thread
From: Philipp Reisner @ 2007-06-20 13:33 UTC (permalink / raw)
  To: drbd-dev

On Tuesday 19 June 2007 17:16:40 Graham, Simon wrote:
> I've been thinking recently about making sure that DRBD handles failures
> properly when the disks are run with their caches enabled - in most
> cases, I believe that the existing activity log code in DRBD will
> correctly handle this by ensuring that portions of the disk that _might_
> have been in cache only when a failure occurred are resynchronized.
>

Well, right. An interesting question. But do we really need to solve
it in DRBD ? 

In the end it is the file system that wants to ensure that something
is on disk.

The first answer that I had for this was (Linux-2.2 and Linux-2.4)

  wait until IO is completed, then it is on disk.

This of course totally ignored that even at that time most
IDE drives already had write caches in write-back mode. Even worse,
on most drives it is not possible to disable these caches. 

The message to our customers was: 
  Either use a (RAID) disk controller with battery backed RAM
  or with write caches in write-through, not in write-back mode.

Now since the time of Linux-2.6 we (finally) got IO-barriers.

Each driver can not state what it needs to really write the
stuff on the fly to disk:

	 * NONE		: hardbarrier unsupported
	 * DRAIN	: ordering by draining is enough
	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
	 * TAG		: ordering by tag is enough
	 * TAG_FLUSH	: ordering by tag w/ pre and post flushes
	 * TAG_FUA	: ordering by tag w/ pre flush and FUA write

The last time I looked I realized that non of the machines in 
our LAB had a driver that exposed something else than NONE. 

Filesystems slowly start to use the WRITE_BARRIER flag on
BIOs with is then translated by the queuing layer to right
request flags on the requests according to the Queue settings.

What we see of those filesystems is:
"JBD: barrier-based sync failed on XXX - disabling barriers"

Ok, so far the theory and the facts IMHO.

How is DRBD concerned with all this. I think we are done as long
as we pass the BIO_RW_BARRIER and BIO_RW_SYNC flags from the primary
to the secondary. -- And we need to respect the implicit write
barriers that arise out of the the usage pattern:

 submit_bio()
 wait_for_io_complation()
 submit_bio()

>
> However - there is one case that I don't think is covered currently;
> it's entirely possible that I'm missing something, but I wanted to
> check; the case in question is if the Secondary system suffers an
> unexpected power loss, therebye potentially losing some writes that were
> acknowledged prior to the failure. Now, I think that the activity log
> maintained by the Primary actually includes the necessary information
> about blocks which should be resynchronized _but_ I don't see any code
> that would actually add these blocks to the bitmap when such a failure
> occurs.
>

Right we do not do this. The current opinion on this is: If the
disk reported IO completion it has to be on disk. (actually a point
of view of the Linux-2.2 and Linux-2.4 time).

Hmm, I can see you point let me think about this for a few days.

Even if we would mark everything after the last acknoweldeg
BIO_RW_BARRIER, we have to keep in mind that today most drivers'
queues are of type NONE.

>
> Conversely, if the Primary suffers an unexpected power loss, when it
> comes back up, it will add all the blocks described by its on-disk
> activity log to the bitmap as part of the attach processing on that
> node.
>

Right. We do this.
The original intention of the AL was to "revert" blocks that got written
on the primary shortly before a crash, and made it to disk but not to
network before the crash.

>
> Maybe this is overkill, but perhaps the Primary should add the contents
> of the current AL to the in-memory and on-disk bitmaps whenever it loses
> contact with the secondary unexpectedly?
>

Simon, I definitely see your point.

It is necessary for disk subsystems that "lie" to the upper layers
with their completion events. -- But you are right, most of todays'
disk subsystems do this. -- Maybe it should be configurable...

Let me think about it... , further comments and opinions welcome of course!

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-20 13:33 ` Philipp Reisner
@ 2007-06-20 16:31   ` Lars Ellenberg
  2007-06-20 19:47   ` Graham, Simon
  1 sibling, 0 replies; 8+ messages in thread
From: Lars Ellenberg @ 2007-06-20 16:31 UTC (permalink / raw)
  To: drbd-dev

On Wed, Jun 20, 2007 at 03:33:14PM +0200, Philipp Reisner wrote:
> > However - there is one case that I don't think is covered currently;
> > it's entirely possible that I'm missing something, but I wanted to
> > check; the case in question is if the Secondary system suffers an
> > unexpected power loss, therebye potentially losing some writes that were
> > acknowledged prior to the failure. Now, I think that the activity log
> > maintained by the Primary actually includes the necessary information
> > about blocks which should be resynchronized _but_ I don't see any code
> > that would actually add these blocks to the bitmap when such a failure
> > occurs.
> >
> 
> Right we do not do this. The current opinion on this is: If the
> disk reported IO completion it has to be on disk. (actually a point
> of view of the Linux-2.2 and Linux-2.4 time).

Me and Phil had a few words about this.

Now, lying hardware is sooo broken :(
but, anyways.

the most easy way to realise this workaround
in current drbd apears to be:
 upon attach, always apply the activity log, 
 unless known to have been cleanly shut down.

we would basically maintain the activity log on the secondary
as well, and introduce an additional "cleanly detached" flag.

whenever you attach it again, the extents covered would need to be
resynced.  obviously this behaviour should be configurable, you want to
disable it for good hardware and large activity log.

I can think of few possible optimizations, even...
but we should not over-engineer what is "just" a workaround.

> further comments and opinions welcome of course!

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-20 13:33 ` Philipp Reisner
  2007-06-20 16:31   ` Lars Ellenberg
@ 2007-06-20 19:47   ` Graham, Simon
  2007-06-21 13:26     ` Lars Ellenberg
  1 sibling, 1 reply; 8+ messages in thread
From: Graham, Simon @ 2007-06-20 19:47 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

> > > acknowledged prior to the failure. Now, I think that the activity
> log
> > > maintained by the Primary actually includes the necessary
> information
> > > about blocks which should be resynchronized _but_ I don't see any
> code
> > > that would actually add these blocks to the bitmap when such a
> failure
> > > occurs.
> > >
> >
> > Right we do not do this. The current opinion on this is: If the
> > disk reported IO completion it has to be on disk. (actually a point
> > of view of the Linux-2.2 and Linux-2.4 time).
> 
> Me and Phil had a few words about this.
> 
> Now, lying hardware is sooo broken :(
> but, anyways.
> 

Well, I look at this slightly differently; use of the on-disk cache is
really the only way to get decent (i.e. competitive) performance out of
rotating rust, so what we have to do is find ways to allow this and
still be correct.

BTW: another case that is of interest to me is when you have a caching
controller -- even though these have battery backup, there is still the
case to worry about when the controller itself fails (something we have
to worry about when building fault tolerant servers) -- in this case, it
should be possible to repair/replace the failed controller and then
reboot and have DRBD resync correctly...

> 
> we would basically maintain the activity log on the secondary
> as well, and introduce an additional "cleanly detached" flag.
> 
> whenever you attach it again, the extents covered would need to be
> resynced.  obviously this behaviour should be configurable, you want
to
> disable it for good hardware and large activity log.
> 
> I can think of few possible optimizations, even...
> but we should not over-engineer what is "just" a workaround.
> 

I thought about this too -- however, I managed to convince myself that
it isn't necessary to store the AL on both disks since we can use the AL
from the disk that was primary (wouldn't they be identical?) -- maybe
there's a case I'm not considering though.

Would it be enough to modify the code to add the current AL to the
in-memory and on-disk bitmaps on the primary whenever you lose contact
with the peer??? I realize this is different from how it is handled for
the loss-of-primary case...

Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-20 19:47   ` Graham, Simon
@ 2007-06-21 13:26     ` Lars Ellenberg
  2007-06-22 18:41       ` Philipp Reisner
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Lars Ellenberg @ 2007-06-21 13:26 UTC (permalink / raw)
  To: drbd-dev

On Wed, Jun 20, 2007 at 03:47:02PM -0400, Graham, Simon wrote:
> > > > acknowledged prior to the failure. Now, I think that the activity
> > log
> > > > maintained by the Primary actually includes the necessary
> > information
> > > > about blocks which should be resynchronized _but_ I don't see any
> > code
> > > > that would actually add these blocks to the bitmap when such a
> > failure
> > > > occurs.
> > > >
> > >
> > > Right we do not do this. The current opinion on this is: If the
> > > disk reported IO completion it has to be on disk. (actually a point
> > > of view of the Linux-2.2 and Linux-2.4 time).
> > 
> > Me and Phil had a few words about this.
> > 
> > Now, lying hardware is sooo broken :(
> > but, anyways.
> > 
> 
> Well, I look at this slightly differently; use of the on-disk cache is
> really the only way to get decent (i.e. competitive) performance out of
> rotating rust, so what we have to do is find ways to allow this and
> still be correct.

well, yes.
but when kernel asks disk to "get it to disk now, and tell me when it is
there", and the disk lies about it, this is bad.
in a perfect world, there would be no need to disable the cache,
when the disk just would tell the thruth.

> BTW: another case that is of interest to me is when you have a caching
> controller -- even though these have battery backup, there is still the
> case to worry about when the controller itself fails (something we have
> to worry about when building fault tolerant servers) -- in this case, it
> should be possible to repair/replace the failed controller and then
> reboot and have DRBD resync correctly...
> 
> > 
> > we would basically maintain the activity log on the secondary
> > as well, and introduce an additional "cleanly detached" flag.
> > 
> > whenever you attach it again, the extents covered would need to be
> > resynced.  obviously this behaviour should be configurable, you want
> to
> > disable it for good hardware and large activity log.
> > 
> > I can think of few possible optimizations, even...
> > but we should not over-engineer what is "just" a workaround.
> > 
> 
> I thought about this too -- however, I managed to convince myself that
> it isn't necessary to store the AL on both disks since we can use the AL
> from the disk that was primary (wouldn't they be identical?) -- maybe
> there's a case I'm not considering though.

as a side note, no, they are not necessarily identical at all times
(requests in flight to non-covered extent).

in the "allow-two-primaries" case I think we maintain it anyways.
it should not be too much overhead to maintain it always.
and it is the most generic solution: it would just work.

> Would it be enough to modify the code to add the current AL to the
> in-memory and on-disk bitmaps on the primary whenever you lose contact
> with the peer??? I realize this is different from how it is handled for
> the loss-of-primary case...

any implementation should not be half-assed.
but, provided that 
  we can prove that we, under all circumstances (protocol != C,
  small ativity log, many random writes etc.), still cover *at least*
  the area that might have been in the write cache of the remote disk,
then, yes, it should be sufficient.

the interessting part here is, that if we maintain a timestamp
(in the lru lists in memory), we could optimize for large activitly logs
(several GB covered extents), by flagging only those extents that have
seen activity during the last $seconds. Or we could maintain some
"throughput" statistic, and only flag those extents which the last
$megabytes targeted.

food for thought :)

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-21 13:26     ` Lars Ellenberg
@ 2007-06-22 18:41       ` Philipp Reisner
  2007-06-22 20:03       ` Graham, Simon
  2007-06-22 20:09       ` Graham, Simon
  2 siblings, 0 replies; 8+ messages in thread
From: Philipp Reisner @ 2007-06-22 18:41 UTC (permalink / raw)
  To: drbd-dev

Hi,

Lars and I discussed various implementation ideas about the issue 
today. I was just about to write them down. -- But then this though
came to my head:

 * The message would be: You no longer need good disk IO subsystems
   that tell the operating system the truth. Go out and use the
   cheapest RAID5 controllers with enormous on-controller memory,
   without battery unit...

   In case your secondary crash, DRBD will take care and replay
   the data, that was lost in your controller's RAM.

   But, how does this work on the primary ? Our activity-log 
   depends on an working disk subsystem. If you have an IO
   subsystem with write-back caches on the primary, we will not
   have a complete AL after the crash.

Does it make sense to solve an issue with broken hardware for
a DRBD node in secondary role, when we depend on working hardware
when the same node is in primary role ? -- I do not think so.

The bottom line:
There is lots of working hardware around. SCSI drives do not have
write-back caches (enabled). I guess SATA drives are okay as well,
but I do not know for sure. All serious raid5 controllers have 
battery units. People have to use those.

Just a comment to this:

> Well, I look at this slightly differently; use of the on-disk cache is
> really the only way to get decent (i.e. competitive) performance out of
> rotating rust, so what we have to do is find ways to allow this and
> still be correct.

A disk drive or a controller is really fine to take over thousands of
IO operations. -- And in fact Linux (2.6) (and DRBD) takes advantage
of this. I have seen an HP raid5 controller that accepted up 10000 
write requests at without blocking acceptance of further write requests.
-- But when the controller signals IO completion to the operating
system it is its task to ensure that the data either is on disk, or
save by other means ( battery backed up RAM ).

-Phil

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-21 13:26     ` Lars Ellenberg
  2007-06-22 18:41       ` Philipp Reisner
@ 2007-06-22 20:03       ` Graham, Simon
  2007-06-22 20:09       ` Graham, Simon
  2 siblings, 0 replies; 8+ messages in thread
From: Graham, Simon @ 2007-06-22 20:03 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

> Just a comment to this:
> 
> > Well, I look at this slightly differently; use of the on-disk cache
> is
> > really the only way to get decent (i.e. competitive) performance out
> of
> > rotating rust, so what we have to do is find ways to allow this and
> > still be correct.
> 
> A disk drive or a controller is really fine to take over thousands of
> IO operations. -- And in fact Linux (2.6) (and DRBD) takes advantage
> of this. I have seen an HP raid5 controller that accepted up 10000
> write requests at without blocking acceptance of further write
> requests.

Perhaps I'm missing something (and I definitely need to go look at the
barrier stuff you mentioned previously) but it's really not the number
of outstanding requests I am concerned about but the time to complete
any specific request -- if you don't use the disk cache in write behind
mode, then you end up at a competitive disadvantage and it does no good
to explain how you are really better because data can't be lost. 

Obviously if you are streaming data to the disk then the cache doesn't
really help - once it's full, you have to wait anyway. However, a lot of
real world cases are either very bursty or they tend to modify the same
blocks over and over and using the write behind cache can provide
significant improvements (if you don't care about availability).

The trick is providing both perf and reliability!

So - my homework for the w/e is to read up on the barrier support you
mentioned before and see whether or not this provides what we need and
can work in all cases - I'm particularly interested in two cases:
1. Using DRBD volumes with no filesystem (e.g. Oracle or some other DB)
2. Using DRBD volumes from virtual machines (again, there is no
filesystem in the host
   environment - I need to see whether or not the virtual disk drivers
that sit between DRBD
   and the filesystem in the guest implement the barriers properly).

Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Drbd-dev] Running Protocol C with disk cache enabled
  2007-06-21 13:26     ` Lars Ellenberg
  2007-06-22 18:41       ` Philipp Reisner
  2007-06-22 20:03       ` Graham, Simon
@ 2007-06-22 20:09       ` Graham, Simon
  2 siblings, 0 replies; 8+ messages in thread
From: Graham, Simon @ 2007-06-22 20:09 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

BTW, regarding:

>  * The message would be: You no longer need good disk IO subsystems
>    that tell the operating system the truth. Go out and use the
>    cheapest RAID5 controllers with enormous on-controller memory,
>    without battery unit...
> 

As I mentioned previously, I'm also concerned about the caching RAID
controller case but for a different reason - imagine the controller
actually fails; I'd like to be able to replace the controller and
restart the system and still have DRBD get the disks back in sync even
though there were changes cached only in the controller...

Again - it's possible the barrier support in Linux will be the solution,
I just don't know enough about it yet...

Simon

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-06-22 20:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-19 15:16 [Drbd-dev] Running Protocol C with disk cache enabled Graham, Simon
2007-06-20 13:33 ` Philipp Reisner
2007-06-20 16:31   ` Lars Ellenberg
2007-06-20 19:47   ` Graham, Simon
2007-06-21 13:26     ` Lars Ellenberg
2007-06-22 18:41       ` Philipp Reisner
2007-06-22 20:03       ` Graham, Simon
2007-06-22 20:09       ` Graham, Simon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.