Extra write mode to close RAID5 write hole (kind of)

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

From: James Pharaoh <james@pharaoh.uk>
To: linux-bcache@vger.kernel.org
Subject: Extra write mode to close RAID5 write hole (kind of)
Date: Wed, 26 Oct 2016 16:20:38 +0100	[thread overview]
Message-ID: <bd0112e0-976a-59e8-94ef-81777f240ffd@pharaoh.uk> (raw)

Hi all,

I'm creating an elaborate storage system and using bcache, with great 
success, to combine SSDs with smallish (500GB) network mounted block 
devices, with RAID5 in between.

I believe this should allow me to use RAID5 at large scale without high 
risk of data loss, because I can very quickly rebuild the small number 
of devices efficiently, across a distributed system.

I am using separate filesystems on each and abstracting their 
combination at a higher level, and I have redundant copies of their data 
in different locations (different countries in fact), so even if I lose 
one it can be recreated efficiently.

I believe this addresses the issue of two devices failing 
simultaneously, because it would affect an even smaller proportion of 
the total data than a single failure, which would simply trigger a 
number of RAID5 rebuilds.

I have high faith in SSD storage, especially given drives' SMART 
capabilities to report failure well in advance of it happening, so it 
occurs to me that bcache is going to close the RAID5 write hole for me, 
assuming certain things.

I am making assumptions about the ordering of writes that RAID5 makes, 
and will post to the appropriate list about that, with the possibility 
of another option. However, I also note that bcache "optimises" 
sequential writes directly to the underlying device:

 > Since random IO is what SSDs excel at, there generally won't be much
 > benefit to caching large sequential IO. Bcache detects sequential IO
 > and skips it; it also keeps a rolling average of the IO sizes per
 > task, and as long as the average is above the cutoff it will skip all
 > IO from that task - instead of caching the first 512k after every
 > seek. Backups and large file copies should thus entirely bypass the
 > cache.

Since I want my bcache device to essentially be a "journal", and to 
close the RAID5 write hole, I would prefer to disable this behaviour.

I propose, therefore, a further write mode, in which data is always 
written to the cache first, and synced, before it is written to the 
underlying device. This could be called "journal" perhaps, or something 
similar.

I am optimistic that this would be a relatively small change to the 
code, since it only requires to always choose the cache to write data to 
first. Perhaps the sync behaviour is also more complex, I am not 
familiar with the internals.

So, does anyone have any idea if this is practical, if it would 
genuinely close the write hole, or any other thoughts?

I am prepared to write up what I am designing in detail and open source 
it, I believe it would be a useful method of managing this kind of high 
scale storage in general.

James

next             reply	other threads:[~2016-10-26 15:25 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-26 15:20 James Pharaoh [this message]
2016-10-26 22:31 ` Extra write mode to close RAID5 write hole (kind of) Vojtech Pavlik
2016-10-27 21:46   ` James Pharaoh
2016-10-28 11:52   ` Kent Overstreet
2016-10-28 13:07     ` Vojtech Pavlik
2016-10-28 13:13       ` Kent Overstreet
2016-10-28 16:55         ` Vojtech Pavlik
2016-10-28 16:58       ` James Pharaoh
2016-10-28 17:07     ` James Pharaoh
2016-10-29  0:58       ` Kent Overstreet
2016-10-29 19:58         ` James Pharaoh
2016-10-28 11:59 ` Kent Overstreet
2016-10-28 17:02   ` James Pharaoh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bd0112e0-976a-59e8-94ef-81777f240ffd@pharaoh.uk \
    --to=james@pharaoh.uk \
    --cc=linux-bcache@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox