Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* [Drbd-dev] Handling on-disk caches
@ 2007-11-07  3:54 Graham, Simon
  2007-11-07 14:03 ` Lars Ellenberg
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Graham, Simon @ 2007-11-07  3:54 UTC (permalink / raw)
  To: drbd-dev

A few months ago, we had a discussion about how to handle systems with
on-disk caches enabled in the face of failures which can cause the cache
to be lost after disk writes are completed back to DRBD. At the time,
the suggestion was to rely on the Linux barrier implementation which is
used by the file systems to ensure correct behavior in the face of disk
caches.

I've now had time to get back to this and review the Linux barrier
implementation and it's become clear to me that the barrier
implementation is insufficient -- imagine the case where a write is
being done, it completes on the secondary (but is still in disk cache
there), then we power off this node -- NO errors are reported to Linux
on the primary (because the other half of the raid set is still there,
the original IO completes successfully BUT we have a difference side to
side...

So a failure of the secondary is NOT reflected back to linux and
therefore we can get out of sync in a way that does not track the blocks
that need to be resynced independent of the use of barriers.

Consider the following sequence of writes:

[1] [2] [3] [barrier] [4] [5]

If we've processed [1] through [3] and the writes have completed on both
primary and secondary but the data is sitting in the disk cache and then
the secondary is powered off, the following occurs:

1. The primary doesn't return any error to Linux
2. The primary goes ahead and processes the [barrier] (which flushes
[1]-[3] to disk then
   performs [4] and [5] and includes the blocks covered by these in the
DRBD bitmap.
3. Now the Secondary comes back -- we ONLY resync [4] and [5] even
though [1]-[3] never made it
   to disk (because we didn't execute the [barrier] on the secondary)

I think the solution to this consists of a number of changes:

1. As suggested previously, DRBD should respect barriers on the
secondary (by passing the appropriate 
   flags to the secondary) -- this will handle unexpected failure of the
primary.
2. Meta-data updates (certainly the AL but possibly all meta-data
updates) should be
   issued as barrier requests (so that we know these are on disk before
issuing the
   associated writes) (I don't think they are currently)
3. DRBD should include the area addressed by the AL when recovering from
an unexpected
   secondary failure. There are two approaches for this:
   a) Maintain the AL on both sides - when the secondary restarts, add
the AL to the
      set of blocks needing to be resynced as is done on the primary
today
   b) Add the current AL to the bitmap on the primary when it loses
contact with the
      secondary.
  The second is probably easier and is, I think, just as effective --
even if the primary
  fails as well (so we lose the in memory bitmap), when it comes back it
WILL add the on-disk
  AL to the bitmap and we wont resync until it comes back...

What do you think?
Simon

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-11-30  0:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-07  3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-07 14:03 ` Lars Ellenberg
2007-11-07 14:16 ` Graham, Simon
2007-11-12 12:39 ` Philipp Reisner
2007-11-12 13:41 ` [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
2007-11-15 16:27   ` Philipp Reisner
2007-11-16  2:36     ` Ernest Montrose
2007-11-26 14:31       ` Philipp Reisner
2007-11-26 14:43       ` Montrose, Ernest
2007-11-26 15:09         ` Philipp Reisner
2007-11-30  0:01       ` Montrose, Ernest
2007-11-12 15:59 ` [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-12 16:24   ` Philipp Reisner
     [not found] ` <BD7042533C2F8943A6A4257A9E31C454F47A31@EXNA.corp.str atus.com>
2007-11-15 16:34   ` [Drbd-dev] DRBD8: incorrect state transition Connected->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox