* Extra write mode to close RAID5 write hole (kind of) @ 2016-10-26 15:20 James Pharaoh 2016-10-26 22:31 ` Vojtech Pavlik 2016-10-28 11:59 ` Kent Overstreet 0 siblings, 2 replies; 13+ messages in thread From: James Pharaoh @ 2016-10-26 15:20 UTC (permalink / raw) To: linux-bcache Hi all, I'm creating an elaborate storage system and using bcache, with great success, to combine SSDs with smallish (500GB) network mounted block devices, with RAID5 in between. I believe this should allow me to use RAID5 at large scale without high risk of data loss, because I can very quickly rebuild the small number of devices efficiently, across a distributed system. I am using separate filesystems on each and abstracting their combination at a higher level, and I have redundant copies of their data in different locations (different countries in fact), so even if I lose one it can be recreated efficiently. I believe this addresses the issue of two devices failing simultaneously, because it would affect an even smaller proportion of the total data than a single failure, which would simply trigger a number of RAID5 rebuilds. I have high faith in SSD storage, especially given drives' SMART capabilities to report failure well in advance of it happening, so it occurs to me that bcache is going to close the RAID5 write hole for me, assuming certain things. I am making assumptions about the ordering of writes that RAID5 makes, and will post to the appropriate list about that, with the possibility of another option. However, I also note that bcache "optimises" sequential writes directly to the underlying device: > Since random IO is what SSDs excel at, there generally won't be much > benefit to caching large sequential IO. Bcache detects sequential IO > and skips it; it also keeps a rolling average of the IO sizes per > task, and as long as the average is above the cutoff it will skip all > IO from that task - instead of caching the first 512k after every > seek. Backups and large file copies should thus entirely bypass the > cache. Since I want my bcache device to essentially be a "journal", and to close the RAID5 write hole, I would prefer to disable this behaviour. I propose, therefore, a further write mode, in which data is always written to the cache first, and synced, before it is written to the underlying device. This could be called "journal" perhaps, or something similar. I am optimistic that this would be a relatively small change to the code, since it only requires to always choose the cache to write data to first. Perhaps the sync behaviour is also more complex, I am not familiar with the internals. So, does anyone have any idea if this is practical, if it would genuinely close the write hole, or any other thoughts? I am prepared to write up what I am designing in detail and open source it, I believe it would be a useful method of managing this kind of high scale storage in general. James ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh @ 2016-10-26 22:31 ` Vojtech Pavlik 2016-10-27 21:46 ` James Pharaoh 2016-10-28 11:52 ` Kent Overstreet 2016-10-28 11:59 ` Kent Overstreet 1 sibling, 2 replies; 13+ messages in thread From: Vojtech Pavlik @ 2016-10-26 22:31 UTC (permalink / raw) To: James Pharaoh; +Cc: linux-bcache On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote: > Hi all, > > I'm creating an elaborate storage system and using bcache, with > great success, to combine SSDs with smallish (500GB) network mounted > block devices, with RAID5 in between. > > I believe this should allow me to use RAID5 at large scale without > high risk of data loss, because I can very quickly rebuild the small > number of devices efficiently, across a distributed system. > > I am using separate filesystems on each and abstracting their > combination at a higher level, and I have redundant copies of their > data in different locations (different countries in fact), so even > if I lose one it can be recreated efficiently. > > I believe this addresses the issue of two devices failing > simultaneously, because it would affect an even smaller proportion > of the total data than a single failure, which would simply trigger > a number of RAID5 rebuilds. > > I have high faith in SSD storage, especially given drives' SMART > capabilities to report failure well in advance of it happening, so > it occurs to me that bcache is going to close the RAID5 write hole > for me, assuming certain things. I believe your faith in SSDs is somewhat misplaced, they do not so infrequently die ahead of their SMART announcement and if they do, they don't just get bad sectors, the whole device is gone. In case you want to protect your data, either use a RAID for your cache devices, too, use it in write through mode, or in write-back mode with zero dirty data target. > I am making assumptions about the ordering of writes that RAID5 > makes, and will post to the appropriate list about that, with the > possibility of another option. However, I also note that bcache > "optimises" sequential writes directly to the underlying device: In case you're using mdraid for the RAID part on a reasonably recent Linux kernel, there is no write hole. Linux mdraid implements barriers properly even on RAID5, at the cost of performance - mdraid waits for a barrier to complete on all drives before submitting more i/o. Any journalling, log or cow filesystem that relies on i/o barriers for consistency will be consistent in Linux even on mdraid RAID5. > > Since random IO is what SSDs excel at, there generally won't be much > > benefit to caching large sequential IO. Bcache detects sequential IO > > and skips it; it also keeps a rolling average of the IO sizes per > > task, and as long as the average is above the cutoff it will skip all > > IO from that task - instead of caching the first 512k after every > > seek. Backups and large file copies should thus entirely bypass the > > cache. > Since I want my bcache device to essentially be a "journal", and to > close the RAID5 write hole, I would prefer to disable this > behaviour. > > I propose, therefore, a further write mode, in which data is always > written to the cache first, and synced, before it is written to the > underlying device. This could be called "journal" perhaps, or > something similar. Using bcache to accelerate a RAID using a SSD is a fairly common use case. What you're asking for can likely be achieved by: echo writeback > cache_mode echo 0 > writeback_percent echo 10240 > writeback_rate echo 5 > writeback_delay echo 0 > readahead echo 0 > sequential_cutoff echo 0 > cache/congested_read_threshold_us echo 0 > cache/congested_write_threshold_us This is what I use personally on my system with success. It enables writeback to optimize writing whole RAID stripes and sets a writeback delay to make sure whole stripes are collected before writing them out. It sets a fixed writeback rate such that reads aren't significantly delayed even during heavy writes - the dirty data will grow instead. It disables readahead, disallows skipping the cache for sequential writes and disables cache device congestion control to make sure that writes always go through the cache device. As a result, if the cached device is busy with writes, only full stripes get ever written to the raid. When the device is idle, even the remaining dirty data gets written to the raid. > I am optimistic that this would be a relatively small change to the > code, since it only requires to always choose the cache to write > data to first. Perhaps the sync behaviour is also more complex, I am > not familiar with the internals. > > So, does anyone have any idea if this is practical, if it would > genuinely close the write hole, or any other thoughts? It works without code changes, properly implements barriers throughout the whole stack, doesn't get corrupted on pulling the cord if using a modern fs, is fast and doesn't leave dirty data on the SSD unless the cord is pulled in a busy period. > I am prepared to write up what I am designing in detail and open > source it, I believe it would be a useful method of managing this > kind of high scale storage in general. -- Vojtech Pavlik Director SuSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-26 22:31 ` Vojtech Pavlik @ 2016-10-27 21:46 ` James Pharaoh 2016-10-28 11:52 ` Kent Overstreet 1 sibling, 0 replies; 13+ messages in thread From: James Pharaoh @ 2016-10-27 21:46 UTC (permalink / raw) To: Vojtech Pavlik; +Cc: linux-bcache On 26/10/16 23:31, Vojtech Pavlik wrote: > I believe your faith in SSDs is somewhat misplaced, they do not so > infrequently die ahead of their SMART announcement and if they do, they > don't just get bad sectors, the whole device is gone. In my experience they are extremely reliable, compared to traditional drives, and much faster, which increases their reliability because a rebuild/restore is much faster. And of course I have redundant backups, stored in systems with significantly distinct designs, locations, access controls, etc. > In case you want to protect your data, either use a RAID for your cache > devices, too, use it in write through mode, or in write-back mode with > zero dirty data target. Ok, I'm not sure what this means, but it sounds like it is something I might want to use. Have you got a link for this? >> I am making assumptions about the ordering of writes that RAID5 >> makes, and will post to the appropriate list about that, with the >> possibility of another option. However, I also note that bcache >> "optimises" sequential writes directly to the underlying device: > > In case you're using mdraid for the RAID part on a reasonably recent > Linux kernel, there is no write hole. Linux mdraid implements barriers > properly even on RAID5, at the cost of performance - mdraid waits for a > barrier to complete on all drives before submitting more i/o. > > Any journalling, log or cow filesystem that relies on i/o barriers for > consistency will be consistent in Linux even on mdraid RAID5. Ok, wow, I did not know this. Again, have you got a link to any documentation about this. Unfortunately these kind of low-level systems tend to be quite hard to find information about... > Using bcache to accelerate a RAID using a SSD is a fairly common use > case. What you're asking for can likely be achieved by: > > echo writeback > cache_mode > echo 0 > writeback_percent > echo 10240 > writeback_rate > echo 5 > writeback_delay > echo 0 > readahead > echo 0 > sequential_cutoff > echo 0 > cache/congested_read_threshold_us > echo 0 > cache/congested_write_threshold_us > > This is what I use personally on my system with success. Thanks, I'll look at this. Genuinelly a much more helpful response than I could ever have hoped for ;-) James ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-26 22:31 ` Vojtech Pavlik 2016-10-27 21:46 ` James Pharaoh @ 2016-10-28 11:52 ` Kent Overstreet 2016-10-28 13:07 ` Vojtech Pavlik 2016-10-28 17:07 ` James Pharaoh 1 sibling, 2 replies; 13+ messages in thread From: Kent Overstreet @ 2016-10-28 11:52 UTC (permalink / raw) To: Vojtech Pavlik; +Cc: James Pharaoh, linux-bcache On Thu, Oct 27, 2016 at 12:31:58AM +0200, Vojtech Pavlik wrote: > In case you're using mdraid for the RAID part on a reasonably recent > Linux kernel, there is no write hole. Linux mdraid implements barriers > properly even on RAID5, at the cost of performance - mdraid waits for a > barrier to complete on all drives before submitting more i/o. That's not what the raid 5 hole is. The raid 5 hole comes from the fact that it's not possible to update the p/q blocks atomically with the data blocks, thus there is a point in time when they are _inconsistent_ with the rest of the stripe, and if used will lead to reconstructing incorrect data. There's no way to fix this with just flushes. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 11:52 ` Kent Overstreet @ 2016-10-28 13:07 ` Vojtech Pavlik 2016-10-28 13:13 ` Kent Overstreet 2016-10-28 16:58 ` James Pharaoh 2016-10-28 17:07 ` James Pharaoh 1 sibling, 2 replies; 13+ messages in thread From: Vojtech Pavlik @ 2016-10-28 13:07 UTC (permalink / raw) To: Kent Overstreet; +Cc: James Pharaoh, linux-bcache On Fri, Oct 28, 2016 at 03:52:49AM -0800, Kent Overstreet wrote: > On Thu, Oct 27, 2016 at 12:31:58AM +0200, Vojtech Pavlik wrote: > > In case you're using mdraid for the RAID part on a reasonably recent > > Linux kernel, there is no write hole. Linux mdraid implements barriers > > properly even on RAID5, at the cost of performance - mdraid waits for a > > barrier to complete on all drives before submitting more i/o. > > That's not what the raid 5 hole is. The raid 5 hole comes from the fact that > it's not possible to update the p/q blocks atomically with the data blocks, thus > there is a point in time when they are _inconsistent_ with the rest of the > stripe, and if used will lead to reconstructing incorrect data. There's no way > to fix this with just flushes. Indeed. However, together with the write intent bitmap, and filesystems ensuring consistency through barriers, it's still greatly mitigated. Mdraid will mark areas of disk dirty in the write intent bitmap before writing to them. When the system comes up after a power outage, all areas marked dirty are scanned and the xor block written where it doesn't match the rest. Thanks to the strict ordering using barriers, the damage to the consistency of the RAID can only be in request since the last successfully written barrier. As such, the filesystem will always see a consistent state, and the raid will also always recover to a consistent state. The only situation where data damage can happen is a power outage that comes together with a loss of one of the drives. In such a case, the content of any blocks written past the last barrier is undefined. It then depends on the filesystem whether it can revert to the last sane state. Not sure about others, but btrfs will do so. -- Vojtech Pavlik Director SuSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 13:07 ` Vojtech Pavlik @ 2016-10-28 13:13 ` Kent Overstreet 2016-10-28 16:55 ` Vojtech Pavlik 2016-10-28 16:58 ` James Pharaoh 1 sibling, 1 reply; 13+ messages in thread From: Kent Overstreet @ 2016-10-28 13:13 UTC (permalink / raw) To: Vojtech Pavlik; +Cc: James Pharaoh, linux-bcache t On Fri, Oct 28, 2016 at 03:07:20PM +0200, Vojtech Pavlik wrote: > The only situation where data damage can happen is a power outage that > comes together with a loss of one of the drives. In such a case, the > content of any blocks written past the last barrier is undefined. It > then depends on the filesystem whether it can revert to the last sane > state. Not sure about others, but btrfs will do so. It's not any data written since the last barrier - in a non COW filesystem, potentially the entire stripe is toast, which means existing unrelated data gets corrupted. There's nothing really a non COW filesystem can do about it. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 13:13 ` Kent Overstreet @ 2016-10-28 16:55 ` Vojtech Pavlik 0 siblings, 0 replies; 13+ messages in thread From: Vojtech Pavlik @ 2016-10-28 16:55 UTC (permalink / raw) To: Kent Overstreet; +Cc: James Pharaoh, linux-bcache On Fri, Oct 28, 2016 at 05:13:10AM -0800, Kent Overstreet wrote: > t > On Fri, Oct 28, 2016 at 03:07:20PM +0200, Vojtech Pavlik wrote: > > The only situation where data damage can happen is a power outage that > > comes together with a loss of one of the drives. In such a case, the > > content of any blocks written past the last barrier is undefined. It > > then depends on the filesystem whether it can revert to the last sane > > state. Not sure about others, but btrfs will do so. > > It's not any data written since the last barrier - in a non COW filesystem, > potentially the entire stripe is toast, which means existing unrelated data gets > corrupted. There's nothing really a non COW filesystem can do about it. Again, you're right, if a drive is lost during a power outage, there can be damage even outside of the blocks that were written if plain data was written and xor wasn't. I don't think there is a filesystem that can handle damage to untouched data cleanly. An additional journal that works closely with the RAID device and tracks what has been written to all devices is required to close this remaining gap. But then, if the journal device is lost during a power outage ... ;) -- Vojtech Pavlik Director SuSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 13:07 ` Vojtech Pavlik 2016-10-28 13:13 ` Kent Overstreet @ 2016-10-28 16:58 ` James Pharaoh 1 sibling, 0 replies; 13+ messages in thread From: James Pharaoh @ 2016-10-28 16:58 UTC (permalink / raw) To: Vojtech Pavlik, Kent Overstreet; +Cc: linux-bcache On 28/10/16 14:07, Vojtech Pavlik wrote: > On Fri, Oct 28, 2016 at 03:52:49AM -0800, Kent Overstreet wrote: > Indeed. However, together with the write intent bitmap, and filesystems > ensuring consistency through barriers, it's still greatly mitigated. > > Mdraid will mark areas of disk dirty in the write intent bitmap before > writing to them. When the system comes up after a power outage, all > areas marked dirty are scanned and the xor block written where it > doesn't match the rest. > > Thanks to the strict ordering using barriers, the damage to the > consistency of the RAID can only be in request since the last > successfully written barrier. Ok so, without posting to mdraid, you are confident that, assuming the disk (etc) is correctly ordering writes, that the RAID5 write hole, as implemented by a modern Linux kernel, does not suffer from a write hole, then this is great news. I understand that there is a clear issue in the case of a drive failure, but that's specifically why I think that bcache can be of use, because it should be able to mitigate some of this. I have a feeling I would need to bcache the backing devices, rather than the array itself, to make this work, since, in the case of a drive failure, specifically the loss of a data-stripe as opposed to a parity one, is not possible to be ordered to avoid corruption. But I think that a bcache layer on the backing device, assuming of course that the bcache cache device is consistent, would provide this level of assurance. > The only situation where data damage can happen is a power outage that > comes together with a loss of one of the drives. In such a case, the > content of any blocks written past the last barrier is undefined. It > then depends on the filesystem whether it can revert to the last sane > state. Not sure about others, but btrfs will do so. Yes, and of course I've mentioned this above. But... I feel that this is something that bcache could help with, and I also have several redundant backups so that, in the unlikely event of a drive failure which causes corruption, I can easily restore the files in question. I do feel like I would like to understand a little more about how Linux mdraid behaves in this respect, but it sounds like it does a pretty good job, and that my bcache layer, and redundant backups, provide a good layer of data security. I am mostly using this to store zbackup respositories, which store the majority of data in 256 directories, which I currently map to 16 backing devices, and could, of course, easily map to as many as 256. In this use case, with the redundant backups, and of course some automatic testing and verification of the data, I am fairly confident that I won't be losing any backups. James ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 11:52 ` Kent Overstreet 2016-10-28 13:07 ` Vojtech Pavlik @ 2016-10-28 17:07 ` James Pharaoh 2016-10-29 0:58 ` Kent Overstreet 1 sibling, 1 reply; 13+ messages in thread From: James Pharaoh @ 2016-10-28 17:07 UTC (permalink / raw) To: Kent Overstreet, Vojtech Pavlik; +Cc: linux-bcache On 28/10/16 12:52, Kent Overstreet wrote: > That's not what the raid 5 hole is. The raid 5 hole comes from the fact that > it's not possible to update the p/q blocks atomically with the data blocks, thus > there is a point in time when they are _inconsistent_ with the rest of the > stripe, and if used will lead to reconstructing incorrect data. There's no way > to fix this with just flushes. Yes, I understand this, but if the kernel strictly orders writing mdraud data blocks before parity ones, then it closes part of the hole, especially if I have a "journal" in a higher layer, and of course ensure that this journal is reliable. I think that, in the case of a drive failure, which contains data blocks which have been written, but which the parity blocks have not been, then this will fail. I also think, however, that by putting bcache /under/ mdraid, and (again) ensuring that the bcache layer is reliable, along with the requirement for bcache to "journal" all writes, would provide an extremely reliable storage layer, even at a very large scale. James ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 17:07 ` James Pharaoh @ 2016-10-29 0:58 ` Kent Overstreet 2016-10-29 19:58 ` James Pharaoh 0 siblings, 1 reply; 13+ messages in thread From: Kent Overstreet @ 2016-10-29 0:58 UTC (permalink / raw) To: James Pharaoh; +Cc: Vojtech Pavlik, linux-bcache On Fri, Oct 28, 2016 at 06:07:21PM +0100, James Pharaoh wrote: > On 28/10/16 12:52, Kent Overstreet wrote: > > > That's not what the raid 5 hole is. The raid 5 hole comes from the fact that > > it's not possible to update the p/q blocks atomically with the data blocks, thus > > there is a point in time when they are _inconsistent_ with the rest of the > > stripe, and if used will lead to reconstructing incorrect data. There's no way > > to fix this with just flushes. > > Yes, I understand this, but if the kernel strictly orders writing mdraud > data blocks before parity ones, then it closes part of the hole, especially > if I have a "journal" in a higher layer, and of course ensure that this > journal is reliable. Ordering cannot help you here. Whichever order you do the writes in, there is a point in time where the p/q blocks are inconsistent with the data blocks, thus if you do a reconstruct you will reconstruct incorrect data. Unless you were writing to the entire stripe, this affects data you were _not_ writing to. > > I also think, however, that by putting bcache /under/ mdraid, and (again) > ensuring that the bcache layer is reliable, along with the requirement for > bcache to "journal" all writes, would provide an extremely reliable storage > layer, even at a very large scale. What? No, putting bcache under md wouldn't do anything, it couldn't do anything about the atomicity issue there. Also - Vojtech - btrfs _is_ subject to the raid5 hole, it would have to be doing copygc to not be affceted. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-29 0:58 ` Kent Overstreet @ 2016-10-29 19:58 ` James Pharaoh 0 siblings, 0 replies; 13+ messages in thread From: James Pharaoh @ 2016-10-29 19:58 UTC (permalink / raw) To: Kent Overstreet; +Cc: Vojtech Pavlik, linux-bcache Okay... So I think the situation is that: - Currently there is no facility to atomically write out more than one block at a time. - Mdraid orders writes to ensure that data blocks are updating atomically, and these are used for reads. - If a data block is updated, but the parity is not, and there is a failure to any of the devices containing a data block with inconsistent parity, then the other blocks which share the parity block, effectively "random" blocks from the point of view of the filesystem, will be corrupted. - Some kind of journal, and of course I'm proposing that bcache could serve this purpose, could potentially be able to close the write hole. The main missing functionality is the first point above, namely that if the block layer could communicate that multiple block writes need to be made or not made, ie that multiple blocks could be written atomically, assuming there is a journal present, would fix this. Has this been discussed before? As always, I find it hard to find good information about this kind of low-level stuff, and think that asking the people who have written it is the only way to get anywhere. Obviously a change to the device mapper API is not something that would be done without significant consideration, although a POC would of course be welcomed, I think. I think the gains to be made here are substantial, and that bcache is a very good candidate for the journal implementation. I also think that this implementation is relatively simple, compared to other options. I also have read many opinions on the problems of scaling up RAID5 and RAID6 as drives become larger, so I think there's definitely an urgent interest in finding a solution to this. So, I would propose to add this kind of atomic write in the kernel's device mapper API, presumably with some way to detect if it is going to be honoured or not. I'm not familiar enough with it to know if this is more complicated than I make it sound... The mdraid layer would need to use this API, perhaps as an option, but arguably if it can detect the presence of this facility, that it would be easy to recommend as the default, presumably after a period of testing. Bcache would need to implement this API, and ensure that the "journal" atomically contains, or not, all of the atomically updated blocks. I'm also assuming that the cache device is reliable, of course, and I've said I'm simply trusting a single SSD (or potentially a RAID0 array of backing devices with LVM), but I think that simply using RAID1 for the cache device would give a reasonable level of reliability for the bcache cache/journal. I assume it uses some kind of COW tree with an atomic update at the root, and ordering, so that updates to the data can be ordered behind a single update which "commits" the changes, and that when this is read back, it is able to confirm if the critical commit has been made or not. Perhaps another API extension to the block layer, to perform a read which can check with a lower layer (RAID1 in this case) that the block is genuinely consistent. In my main use case, where I am storing backups which are redundantly stored elsewhere, and my belief that an SSD array, even a RAID0 one, is quite reliable, I still think this is good enough. That said, SSDs are cheap enough for me to use RAID1 even in this case. I also have other use cases, for example where I would RAID0 several bcache+RAID5 devices into a single LVM volume group. In this case, I'd definitely want the extra protection on the cache device, because an error would potentially affect a large filesystem built on top of it. I think that there is a further opportunity for optimisation as well. If, as I am lead to believe, that mdraid is strictly ordering writes to data blocks then parity ones, to "partially" close the write hole, then being able to atomically write out all the blocks that change, ie two at minimum, could replace the strict ordering, and this would improve performance, because it takes a round trip of verifying the first write out then peforming the second, out of the consideration. Does this all make sense? Is this interesting for anyone else? Is there any other work that attempts to solve this problem? James On 29/10/16 02:58, Kent Overstreet wrote: > On Fri, Oct 28, 2016 at 06:07:21PM +0100, James Pharaoh wrote: >> On 28/10/16 12:52, Kent Overstreet wrote: >> >>> That's not what the raid 5 hole is. The raid 5 hole comes from the fact that >>> it's not possible to update the p/q blocks atomically with the data blocks, thus >>> there is a point in time when they are _inconsistent_ with the rest of the >>> stripe, and if used will lead to reconstructing incorrect data. There's no way >>> to fix this with just flushes. >> >> Yes, I understand this, but if the kernel strictly orders writing mdraud >> data blocks before parity ones, then it closes part of the hole, especially >> if I have a "journal" in a higher layer, and of course ensure that this >> journal is reliable. > > Ordering cannot help you here. Whichever order you do the writes in, there is a > point in time where the p/q blocks are inconsistent with the data blocks, thus > if you do a reconstruct you will reconstruct incorrect data. Unless you were > writing to the entire stripe, this affects data you were _not_ writing to. > >> >> I also think, however, that by putting bcache /under/ mdraid, and (again) >> ensuring that the bcache layer is reliable, along with the requirement for >> bcache to "journal" all writes, would provide an extremely reliable storage >> layer, even at a very large scale. > > What? No, putting bcache under md wouldn't do anything, it couldn't do anything > about the atomicity issue there. > > Also - Vojtech - btrfs _is_ subject to the raid5 hole, it would have to be doing > copygc to not be affceted. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh 2016-10-26 22:31 ` Vojtech Pavlik @ 2016-10-28 11:59 ` Kent Overstreet 2016-10-28 17:02 ` James Pharaoh 1 sibling, 1 reply; 13+ messages in thread From: Kent Overstreet @ 2016-10-28 11:59 UTC (permalink / raw) To: James Pharaoh; +Cc: linux-bcache On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote: > Since I want my bcache device to essentially be a "journal", and to close > the RAID5 write hole, I would prefer to disable this behaviour. > > I propose, therefore, a further write mode, in which data is always written > to the cache first, and synced, before it is written to the underlying > device. This could be called "journal" perhaps, or something similar. > > I am optimistic that this would be a relatively small change to the code, > since it only requires to always choose the cache to write data to first. > Perhaps the sync behaviour is also more complex, I am not familiar with the > internals. > > So, does anyone have any idea if this is practical, if it would genuinely > close the write hole, or any other thoughts? It's not a crazy idea - bcache already has some stripe awareness code that could be used as a starting point. The main thing you'd need to do is ensure that - all writes are writeback, not writethrough (as you noted) - when the writeback thread is flushing dirty data, only flush entire stripes - reading more data into the cache if necessary and marking it dirty, then ensure that the entire stripe is marked dirty until the entire stripe is flushed. This would basically be using bcache to do full data journalling. I'm not going to do the work myself - I'd rather spend my time working on adding erasure coding to bcachefs - but I could help out if you or someone else wanted to work on adding this to bcache. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Extra write mode to close RAID5 write hole (kind of) 2016-10-28 11:59 ` Kent Overstreet @ 2016-10-28 17:02 ` James Pharaoh 0 siblings, 0 replies; 13+ messages in thread From: James Pharaoh @ 2016-10-28 17:02 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcache On 28/10/16 12:59, Kent Overstreet wrote: > On Wed, Oct 26, 2016 at 04:20:38PM +0100, James Pharaoh wrote: >> Since I want my bcache device to essentially be a "journal", and to close >> the RAID5 write hole, I would prefer to disable this behaviour. >> >> I propose, therefore, a further write mode, in which data is always written >> to the cache first, and synced, before it is written to the underlying >> device. This could be called "journal" perhaps, or something similar. >> >> I am optimistic that this would be a relatively small change to the code, >> since it only requires to always choose the cache to write data to first. >> Perhaps the sync behaviour is also more complex, I am not familiar with the >> internals. >> >> So, does anyone have any idea if this is practical, if it would genuinely >> close the write hole, or any other thoughts? > > It's not a crazy idea - bcache already has some stripe awareness code that could > be used as a starting point. > > The main thing you'd need to do is ensure that > - all writes are writeback, not writethrough (as you noted) > - when the writeback thread is flushing dirty data, only flush entire stripes - > reading more data into the cache if necessary and marking it dirty, then > ensure that the entire stripe is marked dirty until the entire stripe is > flushed. > > This would basically be using bcache to do full data journalling. > > I'm not going to do the work myself - I'd rather spend my time working on adding > erasure coding to bcachefs - but I could help out if you or someone else wanted > to work on adding this to bcache. I don't expect anyone to do the work, or to do this mysekf, although if I have the funds, and I may do soon, I would be prepared to pay someone to do it. At the moment, I'm trying to check my facts/assumptions while designing a complex system which won't be fully operational for a while. I'd like to be sure that it is genuinely scalable, as in the design is valid, before I continue working in this way. For what it's worth, I have recently set up a lot of this, taking advantage of extremely cheap servers set up in a "novel" way, and the performance is pretty good. As I've mentioned, I would like to write up what I've done, why, and perhaps create an open source management suite for people to repeat it. James ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-10-29 19:58 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh 2016-10-26 22:31 ` Vojtech Pavlik 2016-10-27 21:46 ` James Pharaoh 2016-10-28 11:52 ` Kent Overstreet 2016-10-28 13:07 ` Vojtech Pavlik 2016-10-28 13:13 ` Kent Overstreet 2016-10-28 16:55 ` Vojtech Pavlik 2016-10-28 16:58 ` James Pharaoh 2016-10-28 17:07 ` James Pharaoh 2016-10-29 0:58 ` Kent Overstreet 2016-10-29 19:58 ` James Pharaoh 2016-10-28 11:59 ` Kent Overstreet 2016-10-28 17:02 ` James Pharaoh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox