layering question.

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

* layering question.
@ 2015-08-04 16:20 A. James Lewis
  2015-08-04 17:01 ` Jens-U. Mozdzen
  2015-08-05  6:28 ` Kai Krakow
  0 siblings, 2 replies; 14+ messages in thread
From: A. James Lewis @ 2015-08-04 16:20 UTC (permalink / raw)
  To: linux-bcache

Hi all...

I've heard rumours that layering bcache with other block device drivers 
might not be recommended... I wonder what the truth really is... perhaps 
someone can advise.

I was planning to use 2 SSD's... combined with 4 large spinning drives 
to create a large filesystem with BTRFS...  my questions are as follows.

1. Is there a way to use 2 SSD's directly, or would it be OK to use MD 
to stripe them?... then used the MD array as the cache device?

2. I would be using BTRFS, so would it be better to create 4 separate 
bcache devices each attached to the single cache device, and then use 
BTRFS to raid 4 bcache devices... obviously this would be more flexible, 
or would I need to make an MD raid of the 4 devices, and then use that 
to create a single bcache device and build a BTRFS filesystem on top of 
that.

Hope that's clear, any clarification would be appreciated...

Also, there's talk about a pending on-disk cache format change some time 
around 3.19, but no details... is this over with, or still pending?

James

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-04 16:20 layering question A. James Lewis
@ 2015-08-04 17:01 ` Jens-U. Mozdzen
  2015-08-04 17:16   ` A. James Lewis
  2015-08-05  6:28 ` Kai Krakow
  1 sibling, 1 reply; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-04 17:01 UTC (permalink / raw)
  To: A. James Lewis; +Cc: linux-bcache

Hi James,

Zitat von "A. James Lewis" <james@fsck.co.uk>:
> Hi all...
>
> I've heard rumours that layering bcache with other block device  
> drivers might not be recommended... I wonder what the truth really  
> is... perhaps someone can advise.

to me it's more than rumors. We're facing severe difficulties (server  
reboots, disks marked faulty by MDRAID, hangs) in our layered setup:

- physical disks MD-RAID6 (data) plus two SSDs MD-RAID1 (cache)
- bcache
- LVM
- DRBD for many of the logical volumes (always primary, no fail-overs)
- ext4 fs
- NFS / Samba / SCST (fileio)

> I was planning to use 2 SSD's... combined with 4 large spinning  
> drives to create a large filesystem with BTRFS...  my questions are  
> as follows.
>
> 1. Is there a way to use 2 SSD's directly, or would it be OK to use  
> MD to stripe them?... then used the MD array as the cache device?

MD-RAID1 is what our current configuration looks like. We've also  
combined the spinning disks into a RAID6.

> 2. I would be using BTRFS, so would it be better to create 4  
> separate bcache devices each attached to the single cache device,  
> and then use BTRFS to raid 4 bcache devices... obviously this would  
> be more flexible, or would I need to make an MD raid of the 4  
> devices, and then use that to create a single bcache device and  
> build a BTRFS filesystem on top of that.

I have no btrfs experience, so I cannot answer that one. I went for a  
single data and cache device (via RAID) so I won't have to partition  
my SSDs - that would not have been scalable (we're planning to add  
plenty of physical disks over time, and to use many LVs/file systems).

> Also, there's talk about a pending on-disk cache format change some  
> time around 3.19, but no details... is this over with, or still  
> pending?

Someone else might want to help with that one as well?

Regards,
Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-04 17:01 ` Jens-U. Mozdzen
@ 2015-08-04 17:16   ` A. James Lewis
  2015-08-05  6:56     ` Jens-U. Mozdzen
  0 siblings, 1 reply; 14+ messages in thread
From: A. James Lewis @ 2015-08-04 17:16 UTC (permalink / raw)
  To: linux-bcache


Thanks for the details... to clarify, you are using raid1 for SSD cache 
devices, and then creating a RAID6 MD device to act as backing store?

What kernel are you using? You are having some stability issues, but in 
principle it works?...  what is performance like?

James


On 04/08/15 18:01, Jens-U. Mozdzen wrote:
> Hi James,
>
> Zitat von "A. James Lewis" <james@fsck.co.uk>:
>> Hi all...
>>
>> I've heard rumours that layering bcache with other block device 
>> drivers might not be recommended... I wonder what the truth really 
>> is... perhaps someone can advise.
>
> to me it's more than rumors. We're facing severe difficulties (server 
> reboots, disks marked faulty by MDRAID, hangs) in our layered setup:
>
> - physical disks MD-RAID6 (data) plus two SSDs MD-RAID1 (cache)
> - bcache
> - LVM
> - DRBD for many of the logical volumes (always primary, no fail-overs)
> - ext4 fs
> - NFS / Samba / SCST (fileio)
>
>> I was planning to use 2 SSD's... combined with 4 large spinning 
>> drives to create a large filesystem with BTRFS...  my questions are 
>> as follows.
>>
>> 1. Is there a way to use 2 SSD's directly, or would it be OK to use 
>> MD to stripe them?... then used the MD array as the cache device?
>
> MD-RAID1 is what our current configuration looks like. We've also 
> combined the spinning disks into a RAID6.
>
>> 2. I would be using BTRFS, so would it be better to create 4 separate 
>> bcache devices each attached to the single cache device, and then use 
>> BTRFS to raid 4 bcache devices... obviously this would be more 
>> flexible, or would I need to make an MD raid of the 4 devices, and 
>> then use that to create a single bcache device and build a BTRFS 
>> filesystem on top of that.
>
> I have no btrfs experience, so I cannot answer that one. I went for a 
> single data and cache device (via RAID) so I won't have to partition 
> my SSDs - that would not have been scalable (we're planning to add 
> plenty of physical disks over time, and to use many LVs/file systems).
>
>> Also, there's talk about a pending on-disk cache format change some 
>> time around 3.19, but no details... is this over with, or still pending?
>
> Someone else might want to help with that one as well?
>
> Regards,
> Jens
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-04 16:20 layering question A. James Lewis
  2015-08-04 17:01 ` Jens-U. Mozdzen
@ 2015-08-05  6:28 ` Kai Krakow
  2015-08-05  7:04   ` Jens-U. Mozdzen
  1 sibling, 1 reply; 14+ messages in thread
From: Kai Krakow @ 2015-08-05  6:28 UTC (permalink / raw)
  To: linux-bcache

A. James Lewis <james@fsck.co.uk> schrieb:

> I've heard rumours that layering bcache with other block device drivers
> might not be recommended... I wonder what the truth really is... perhaps
> someone can advise.

I think this is not just rumours. Multiple people reported problems when 
layering caching or backing devices on top of MD devices. This may be an 
implementation problem in MD which is gone in later kernel versions and had 
to do with correctly passing discards through the layers. If you want to use 
that, I'd at least recommend to disable discards, and to disable write-back 
caching (just use write-through or write-around which is obviously slower 
but might fit you workload).

> I was planning to use 2 SSD's... combined with 4 large spinning drives
> to create a large filesystem with BTRFS...  my questions are as follows.

Using one SSD with 3 spinning drives here.

> 1. Is there a way to use 2 SSD's directly, or would it be OK to use MD
> to stripe them?... then used the MD array as the cache device?

I think currently you can only add on caching device to a cache set. I think 
it is planned to allow that in a later development stage but currently your 
only way to go would be a MD array if you want to use MD. I'd better suggest 
to use hardware RAID for that.

> 2. I would be using BTRFS, so would it be better to create 4 separate
> bcache devices each attached to the single cache device, and then use
> BTRFS to raid 4 bcache devices... obviously this would be more flexible,
> or would I need to make an MD raid of the 4 devices, and then use that
> to create a single bcache device and build a BTRFS filesystem on top of
> that.

You can just attach multiple backing devices (each sub device of your btrfs 
pool) to the same cache set - so you caching device would cache all backing 
devices.

> Hope that's clear, any clarification would be appreciated...

I'd go with the following setup:

I'm not sure which btrfs RAID level you are going to use. Maybe RAID 10, 
probably RAID 0. This means, btrfs tries to evenly spread writes and reads 
across all devices.

I suggest using 2 cache sets. One bcache for btrfs pool member 1 and 2, one 
bcache or btrfs pool member 3 and 4. If you add more members to the pool 
later, just attach them in alternating order to the first or second cache 
set.

This should give you most value out of your current setup.

If you plan to use 2 SSDs for improved reboustness (combined into RAID 1), 
you obviously don't have this option. You could try with MDRAID tho I 
wouldn't recommend this without doing your tests and backups. Better use 
hardware RAID then, tho most controllers will disable the ability to use 
discard then so you probably want to leave some spare space unused on your 
SSDs for improved life time and long-term performance.

> Also, there's talk about a pending on-disk cache format change some time
> around 3.19, but no details... is this over with, or still pending?

No idea, I'm on 4.1 now and used bcache since 3.18 or 3.19 - not sure. It 
worked well.

PS: Disable "autodefrag" if you use btrfs+bcache... ;-) It helps performance 
a bit but it eats SSD lifetime (used 20% of my proposed SSD lifetime in only 
2 months - according to smartctl). Better just defragment meta data from 
time to time (read: use btrfs defrag on directories only) if you want to 
defrag.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-04 17:16   ` A. James Lewis
@ 2015-08-05  6:56     ` Jens-U. Mozdzen
  0 siblings, 0 replies; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-05  6:56 UTC (permalink / raw)
  To: A. James Lewis; +Cc: linux-bcache

Hi James,

Zitat von "A. James Lewis" <james@fsck.co.uk>:
> Thanks for the details... to clarify, you are using raid1 for SSD  
> cache devices, and then creating a RAID6 MD device to act as backing  
> store?

yes - my two SSDs are RAID1 (I named it /dev/md/linux:san02-cache),  
seven 1TB disks are RAID6 (/dev/md/linux:san02-data), and these two  
were prepared using "make-bcache -C /dev/md/linux\:san02-cache -B  
/dev/md/linux\:san02-data" to create /dev/bcach0.

> What kernel are you using? You are having some stability issues, but  
> in principle it works?...  what is performance like?

Originally, I used the latest OpenSUSE 13.1 stable kernel  
(kernel-default-3.11.10-29.1), but seeing random reboots that seemed  
to match bugs fixed in later DRBD versions, the servers were updated  
to kernel-default-3.18.8-5.1.

Basically, the system works like a charm, we were enthusiastic with  
the first results. I don't have absolute numbers in terms of  
throughput, but since switching to bcache, or I/O waits dropped from  
"5 to 45 %" to "0 to 5%". The servers are mainly used to provide  
virtual disks for about virtual machines, which are running on a  
separate server farm connected via Fiber Channel. Additionally, the  
servers provide NFS access to various file system (among them the home  
directories of the local users and the working area for a distributed  
development environment). Add in a small amount of SMB traffic for a  
few MS-Win machines and you have the overall picture... mostly  
small-sized accesses, with plenty of reads and writes from various  
sources. Even with lots of memory caching, that mix did bring our  
servers to a user-noticable i/o load, which basically vanished when  
introducing bcache. Only large consecutive writes (i.e. ISOs) do go  
the disks directly and hence lead to measurable I/O waits... but  
that's rare and only turns up in monitoring, rather than "felt by  
users".

As I detailed in the other recent thread, when switching to 3.18.8 we  
suddenly were unable to create new file systems on one of the servers,  
mkfs reproducibly lead to a server reboot. Turning bcache to  
"writethrough" solved this, but made MD report disks in our backing  
device as failing, always in the context of what seemed to be hanging  
disk accesses, matching the "bcache locking problem" pattern. (I did  
apply the set of known patches from this mailing list.)

Since last Saturday, we're back to "writeback" and no more disks  
failed - but I haven't tried creating new file systems since, I'll  
have to wait for a maintenance window ;)

Just for completeness: We use /dev/bcache0 as the only "physical  
volume" to a LVM volume group, and create various logical volumes on  
top (both for local system use, and many that are used per NFS  
resource and per "Fiber Channel"-connected VM). The non-system LVs are  
mirrored to a separate machine via DRBD (which then will periodically  
break the link and backup the "snapshots" to external media) and the  
actual file systems (Ext4) are created based on these DRBD resources.

Regards,
Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-05  6:28 ` Kai Krakow
@ 2015-08-05  7:04   ` Jens-U. Mozdzen
  2015-08-05 23:10     ` Kai Krakow
  0 siblings, 1 reply; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-05  7:04 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-bcache

Hi Kai,

Zitat von Kai Krakow <hurikhan77@gmail.com>:
> A. James Lewis <james@fsck.co.uk> schrieb:
>
>> I've heard rumours that layering bcache with other block device drivers
>> might not be recommended... I wonder what the truth really is... perhaps
>> someone can advise.
>
> I think this is not just rumours. Multiple people reported problems when
> layering caching or backing devices on top of MD devices. This may be an
> implementation problem in MD which is gone in later kernel versions [...]

being rather new to bcache, I did only browse the last few months of  
mailing list history - are you saying that these problems were fixed  
(or simply vanished) some point after 3.18.8? Because if so, I'd of  
course try to upgrade our servers to a more recent kernel :)

Regards,
Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-05  7:04   ` Jens-U. Mozdzen
@ 2015-08-05 23:10     ` Kai Krakow
  2015-08-06  0:54       ` A. James Lewis
  0 siblings, 1 reply; 14+ messages in thread
From: Kai Krakow @ 2015-08-05 23:10 UTC (permalink / raw)
  To: linux-bcache

Jens-U. Mozdzen <jmozdzen@nde.ag> schrieb:

> Zitat von Kai Krakow <hurikhan77@gmail.com>:
>> A. James Lewis <james@fsck.co.uk> schrieb:
>>
>>> I've heard rumours that layering bcache with other block device drivers
>>> might not be recommended... I wonder what the truth really is... perhaps
>>> someone can advise.
>>
>> I think this is not just rumours. Multiple people reported problems when
>> layering caching or backing devices on top of MD devices. This may be an
>> implementation problem in MD which is gone in later kernel versions [...]
> 
> being rather new to bcache, I did only browse the last few months of
> mailing list history - are you saying that these problems were fixed
> (or simply vanished) some point after 3.18.8? Because if so, I'd of
> course try to upgrade our servers to a more recent kernel :)

Latest posts imply it is still a problem. It fits with earlier reports: 
Caching on native device, backing on md device... Bcache breaks within the 
caching device (although this is not on md). There seem to still be bugs 
with bcache and md to properly interact.

It was suspected that bcache uses a faulty discard implementation. Some 
reports miss details about this setting. However, my setups are working fine 
with discards fully enabled on SSD - but without using MD. And it has been 
robust to accidental or implied reboots since all time I'm using it (even 
with btrfs as the filesystem on bcache).

So I'd probably remove MD from your plans on using bcache.

BTW: My system uses vanilla gentoo kernel, 4.1.4 currently.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-05 23:10     ` Kai Krakow
@ 2015-08-06  0:54       ` A. James Lewis
  2015-08-06 23:12         ` Kai Krakow
  0 siblings, 1 reply; 14+ messages in thread
From: A. James Lewis @ 2015-08-06  0:54 UTC (permalink / raw)
  To: linux-bcache

The problem is tho... with a very large backing store, I'm not really 
happy with a single point of failure in the cache... is there another 
way to mirror the cache device?

Does anyone make an M.2 raid controller... being PCIe, I don't know it 
that could even be possible.

James

On 06/08/15 00:10, Kai Krakow wrote:
> Jens-U. Mozdzen <jmozdzen@nde.ag> schrieb:
>
>> Zitat von Kai Krakow <hurikhan77@gmail.com>:
>>> A. James Lewis <james@fsck.co.uk> schrieb:
>>>
>>>> I've heard rumours that layering bcache with other block device drivers
>>>> might not be recommended... I wonder what the truth really is... perhaps
>>>> someone can advise.
>>> I think this is not just rumours. Multiple people reported problems when
>>> layering caching or backing devices on top of MD devices. This may be an
>>> implementation problem in MD which is gone in later kernel versions [...]
>> being rather new to bcache, I did only browse the last few months of
>> mailing list history - are you saying that these problems were fixed
>> (or simply vanished) some point after 3.18.8? Because if so, I'd of
>> course try to upgrade our servers to a more recent kernel :)
> Latest posts imply it is still a problem. It fits with earlier reports:
> Caching on native device, backing on md device... Bcache breaks within the
> caching device (although this is not on md). There seem to still be bugs
> with bcache and md to properly interact.
>
> It was suspected that bcache uses a faulty discard implementation. Some
> reports miss details about this setting. However, my setups are working fine
> with discards fully enabled on SSD - but without using MD. And it has been
> robust to accidental or implied reboots since all time I'm using it (even
> with btrfs as the filesystem on bcache).
>
> So I'd probably remove MD from your plans on using bcache.
>
> BTW: My system uses vanilla gentoo kernel, 4.1.4 currently.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-06  0:54       ` A. James Lewis
@ 2015-08-06 23:12         ` Kai Krakow
  2015-08-07 12:43           ` Jens-U. Mozdzen
  0 siblings, 1 reply; 14+ messages in thread
From: Kai Krakow @ 2015-08-06 23:12 UTC (permalink / raw)
  To: linux-bcache

Hi!

A. James Lewis <james@fsck.co.uk> schrieb:

> The problem is tho... with a very large backing store, I'm not really
> happy with a single point of failure in the cache... is there another
> way to mirror the cache device?

Well, AFAIR there are plans to add such capabilities into bcache itself - 
read: make it possible to add more than one caching device to a cache set. 
It will use some sort of hybrid mirror / striping to get the best 
combination of speed and safety - at least that's what the idea is about. I 
just don't remember where I've read about it, neither do I know the status 
of it.

If you want to eliminate the single point of failure, you may want to try 
mdadm with its write-mostly option instead of using bcache. It's slower for 
writes obviously but gracefully falls back if the SSD fails. Obviously, you 
can also not benefit from having a huge storage because it's classic RAID-1 
and thus the smallest member will limit your storage size.

Bcache also has countermeasures for a failing caching device but I didn't 
really look into that yet. You should read the documentation about it in 
Documentation/bcache.txt (Error Handling). The safest mode to use here is 
writethrough.

> On 06/08/15 00:10, Kai Krakow wrote:
>> Jens-U. Mozdzen <jmozdzen@nde.ag> schrieb:
>>
>>> Zitat von Kai Krakow <hurikhan77@gmail.com>:
>>>> A. James Lewis <james@fsck.co.uk> schrieb:
>>>>
>>>>> I've heard rumours that layering bcache with other block device
>>>>> drivers might not be recommended... I wonder what the truth really
>>>>> is... perhaps someone can advise.
>>>> I think this is not just rumours. Multiple people reported problems
>>>> when layering caching or backing devices on top of MD devices. This may
>>>> be an implementation problem in MD which is gone in later kernel
>>>> versions [...]
>>> being rather new to bcache, I did only browse the last few months of
>>> mailing list history - are you saying that these problems were fixed
>>> (or simply vanished) some point after 3.18.8? Because if so, I'd of
>>> course try to upgrade our servers to a more recent kernel :)
>> Latest posts imply it is still a problem. It fits with earlier reports:
>> Caching on native device, backing on md device... Bcache breaks within
>> the caching device (although this is not on md). There seem to still be
>> bugs with bcache and md to properly interact.
>>
>> It was suspected that bcache uses a faulty discard implementation. Some
>> reports miss details about this setting. However, my setups are working
>> fine with discards fully enabled on SSD - but without using MD. And it
>> has been robust to accidental or implied reboots since all time I'm using
>> it (even with btrfs as the filesystem on bcache).
>>
>> So I'd probably remove MD from your plans on using bcache.
>>
>> BTW: My system uses vanilla gentoo kernel, 4.1.4 currently.
>>
-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-06 23:12         ` Kai Krakow
@ 2015-08-07 12:43           ` Jens-U. Mozdzen
  2015-08-07 14:38             ` A. James Lewis
  0 siblings, 1 reply; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-07 12:43 UTC (permalink / raw)
  To: linux-bcache

Hi *,

Zitat von Kai Krakow <hurikhan77@gmail.com>:
> Hi!
>
> A. James Lewis <james@fsck.co.uk> schrieb:
>
>> The problem is tho... with a very large backing store, I'm not really
>> happy with a single point of failure in the cache... is there another
>> way to mirror the cache device?
>
> Well, AFAIR there are plans to add such capabilities into bcache itself -
> read: make it possible to add more than one caching device to a cache set.
> It will use some sort of hybrid mirror / striping to get the best
> combination of speed and safety - at least that's what the idea is about. I
> just don't remember where I've read about it, neither do I know the status
> of it.
>
> If you want to eliminate the single point of failure, you may want to try
> mdadm with its write-mostly option instead of using bcache. It's slower for
> writes obviously but gracefully falls back if the SSD fails. Obviously, you
> can also not benefit from having a huge storage because it's classic RAID-1
> and thus the smallest member will limit your storage size.
>
> Bcache also has countermeasures for a failing caching device but I didn't
> really look into that yet. You should read the documentation about it in
> Documentation/bcache.txt (Error Handling). The safest mode to use here is
> writethrough.

A work of caution here: At least in my layered (kernel 3.18.8)  
situation, the upper layers from time to time run into some sort of  
time-out situation when writing to (bcached) disk. Teh writes abort  
(bad, but tolerable in my circumstances), but on top this makes MD  
mark the current disk faulty, degrading your RAID.

When using "writeback", the likeliness for this to happen is  
relatively small (not more than once every few days), probably because  
the writes to SSD are fairly quick. These hit then have always been on  
the caching device (MD-RAID1 in my case).

When using "writethrough", the likeliness was extremely higher (I've  
seen 2 hits within 6 hours, not later than 28 hours after switching to  
"writethrough") and the hit was on the data device (MD-RAID6 in my  
case).

Had I only set up RAID5, my data array would have dropped dead then.

After switching back to "writeback", I've had *one* further incident,  
again on the caching device, within 6 days.

I would definitely not call "writethrough" "the safest mode" when  
using MD-RAID for the bcache devices, on kernel 3.18.8.

Regards,
Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-07 12:43           ` Jens-U. Mozdzen
@ 2015-08-07 14:38             ` A. James Lewis
  2015-08-07 15:36               ` Jens-U. Mozdzen
  0 siblings, 1 reply; 14+ messages in thread
From: A. James Lewis @ 2015-08-07 14:38 UTC (permalink / raw)
  To: linux-bcache, Jens-U. Mozdzen


That's interesting, are you putting your MD on top of multiple bcache 
devices... rather than bcache on top of an MD device... I wonder what 
the rationale behind this is?

Also,  can anyone give me a summary of how bcache compares with dm-cache?

James


On 07/08/15 13:43, Jens-U. Mozdzen wrote:
> Hi *,
>
> Zitat von Kai Krakow <hurikhan77@gmail.com>:
>> Hi!
>>
>> A. James Lewis <james@fsck.co.uk> schrieb:
>>
>>> The problem is tho... with a very large backing store, I'm not really
>>> happy with a single point of failure in the cache... is there another
>>> way to mirror the cache device?
>>
>> Well, AFAIR there are plans to add such capabilities into bcache 
>> itself -
>> read: make it possible to add more than one caching device to a cache 
>> set.
>> It will use some sort of hybrid mirror / striping to get the best
>> combination of speed and safety - at least that's what the idea is 
>> about. I
>> just don't remember where I've read about it, neither do I know the 
>> status
>> of it.
>>
>> If you want to eliminate the single point of failure, you may want to 
>> try
>> mdadm with its write-mostly option instead of using bcache. It's 
>> slower for
>> writes obviously but gracefully falls back if the SSD fails. 
>> Obviously, you
>> can also not benefit from having a huge storage because it's classic 
>> RAID-1
>> and thus the smallest member will limit your storage size.
>>
>> Bcache also has countermeasures for a failing caching device but I 
>> didn't
>> really look into that yet. You should read the documentation about it in
>> Documentation/bcache.txt (Error Handling). The safest mode to use 
>> here is
>> writethrough.
>
> A work of caution here: At least in my layered (kernel 3.18.8) 
> situation, the upper layers from time to time run into some sort of 
> time-out situation when writing to (bcached) disk. Teh writes abort 
> (bad, but tolerable in my circumstances), but on top this makes MD 
> mark the current disk faulty, degrading your RAID.
>
> When using "writeback", the likeliness for this to happen is 
> relatively small (not more than once every few days), probably because 
> the writes to SSD are fairly quick. These hit then have always been on 
> the caching device (MD-RAID1 in my case).
>
> When using "writethrough", the likeliness was extremely higher (I've 
> seen 2 hits within 6 hours, not later than 28 hours after switching to 
> "writethrough") and the hit was on the data device (MD-RAID6 in my case).
>
> Had I only set up RAID5, my data array would have dropped dead then.
>
> After switching back to "writeback", I've had *one* further incident, 
> again on the caching device, within 6 days.
>
> I would definitely not call "writethrough" "the safest mode" when 
> using MD-RAID for the bcache devices, on kernel 3.18.8.
>
> Regards,
> Jens
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-07 14:38             ` A. James Lewis
@ 2015-08-07 15:36               ` Jens-U. Mozdzen
  2015-08-07 16:16                 ` A. James Lewis
  0 siblings, 1 reply; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-07 15:36 UTC (permalink / raw)
  To: A. James Lewis; +Cc: linux-bcache

Hi James,

Zitat von "A. James Lewis" <james@fsck.co.uk>:
> That's interesting, are you putting your MD on top of multiple  
> bcache devices... rather than bcache on top of an MD device... I  
> wonder what the rationale behind this is?

Hi James, no such thing here...

bcache is running on top of two MD-RAIDs - RAID6 with 7 spinning  
drives and RAID1 with two SSDs.

The stack is, from bottom to top:

- MD-RAID6 data, MD-RAID1 cache
- bcache (/dev/bcache0, used as an LVM PV)
- LVM
- many LVs
- DRBD on top of most of the LVs
- Ext4 on each of the DRBD devices
- SCST / NFS / SMB sharing these file systems

In the referenced incidents, SCST reports that (many) writes failed  
due to time-out, and MD reports a single disk faulty. No other traces  
in syslog, especially no stalled processes, locking problems or kernel  
bugs.

The i/o pattern is highly parallel reads and writes, mostly via SCST.

Regards,
Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
  2015-08-07 15:36               ` Jens-U. Mozdzen
@ 2015-08-07 16:16                 ` A. James Lewis
  0 siblings, 0 replies; 14+ messages in thread
From: A. James Lewis @ 2015-08-07 16:16 UTC (permalink / raw)
  To: linux-bcache


OK, but in that case bcache is not between your MD RAID and it's disks, 
so if your disks are dropping out of the MD array, that has to be either 
an independent problem, or a very complex bug.

James


On 07/08/15 16:36, Jens-U. Mozdzen wrote:
> Hi James,
>
> Zitat von "A. James Lewis" <james@fsck.co.uk>:
>> That's interesting, are you putting your MD on top of multiple bcache
>> devices... rather than bcache on top of an MD device... I wonder what
>> the rationale behind this is?
>
> Hi James, no such thing here...
>
> bcache is running on top of two MD-RAIDs - RAID6 with 7 spinning
> drives and RAID1 with two SSDs.
>
> The stack is, from bottom to top:
>
> - MD-RAID6 data, MD-RAID1 cache
> - bcache (/dev/bcache0, used as an LVM PV)
> - LVM
> - many LVs
> - DRBD on top of most of the LVs
> - Ext4 on each of the DRBD devices
> - SCST / NFS / SMB sharing these file systems
>
> In the referenced incidents, SCST reports that (many) writes failed
> due to time-out, and MD reports a single disk faulty. No other traces
> in syslog, especially no stalled processes, locking problems or kernel
> bugs.
>
> The i/o pattern is highly parallel reads and writes, mostly via SCST.
>
> Regards,
> Jens
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: layering question.
@ 2015-08-07 16:24 Jens-U. Mozdzen
  0 siblings, 0 replies; 14+ messages in thread
From: Jens-U. Mozdzen @ 2015-08-07 16:24 UTC (permalink / raw)
  To: linux-bcache

Hi James,

Zitat von "A. James Lewis" <james@fsck.co.uk>:
> OK, but in that case bcache is not between your MD RAID and it's  
> disks, so if your disks are dropping out of the MD array, that has  
> to be either an independent problem, or a very complex bug.

My guess is that it's a rather simple timeout / locking problem, which  
leads to an expiring timer in the MD code. And bcache has a well-known  
history for locking problems, according to the mailing list.

Regards,
Jens

> James
>
>
> On 07/08/15 16:36, Jens-U. Mozdzen wrote:
>> Hi James,
>>
>> Zitat von "A. James Lewis" <james@fsck.co.uk>:
>>> That's interesting, are you putting your MD on top of multiple  
>>> bcache devices... rather than bcache on top of an MD device... I  
>>> wonder what the rationale behind this is?
>>
>> Hi James, no such thing here...
>>
>> bcache is running on top of two MD-RAIDs - RAID6 with 7 spinning  
>> drives and RAID1 with two SSDs.
>>
>> The stack is, from bottom to top:
>>
>> - MD-RAID6 data, MD-RAID1 cache
>> - bcache (/dev/bcache0, used as an LVM PV)
>> - LVM
>> - many LVs
>> - DRBD on top of most of the LVs
>> - Ext4 on each of the DRBD devices
>> - SCST / NFS / SMB sharing these file systems
>>
>> In the referenced incidents, SCST reports that (many) writes failed  
>> due to time-out, and MD reports a single disk faulty. No other  
>> traces in syslog, especially no stalled processes, locking problems  
>> or kernel bugs.
>>
>> The i/o pattern is highly parallel reads and writes, mostly via SCST.
>>
>> Regards,
>> Jens

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-08-07 16:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-04 16:20 layering question A. James Lewis
2015-08-04 17:01 ` Jens-U. Mozdzen
2015-08-04 17:16   ` A. James Lewis
2015-08-05  6:56     ` Jens-U. Mozdzen
2015-08-05  6:28 ` Kai Krakow
2015-08-05  7:04   ` Jens-U. Mozdzen
2015-08-05 23:10     ` Kai Krakow
2015-08-06  0:54       ` A. James Lewis
2015-08-06 23:12         ` Kai Krakow
2015-08-07 12:43           ` Jens-U. Mozdzen
2015-08-07 14:38             ` A. James Lewis
2015-08-07 15:36               ` Jens-U. Mozdzen
2015-08-07 16:16                 ` A. James Lewis
  -- strict thread matches above, loose matches on Subject: below --
2015-08-07 16:24 Jens-U. Mozdzen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox