Bcache on btrfs: the perfect setup or the perfect storm?

linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bcache on btrfs: the perfect setup or the perfect storm?
@ 2015-04-08  6:59 arnaud gaboury
  2015-04-08 19:02 ` Kai Krakow
       [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
  0 siblings, 2 replies; 5+ messages in thread
From: arnaud gaboury @ 2015-04-08  6:59 UTC (permalink / raw)
  To: linux-bcache

I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
box will be a server, but not a production one.

I know this scheme is not recommended and can be a cause of filesystem
corruption. But as I like living on the edge, I want to give a try,
with a tight backup policy.

Any advice? Any settings that could be worth to avoid future drama?

Thank you

-- 

google.com/+arnaudgabourygabx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bcache on btrfs: the perfect setup or the perfect storm?
  2015-04-08  6:59 Bcache on btrfs: the perfect setup or the perfect storm? arnaud gaboury
@ 2015-04-08 19:02 ` Kai Krakow
       [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Kai Krakow @ 2015-04-08 19:02 UTC (permalink / raw)
  To: linux-bcache

arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:

> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
> box will be a server, but not a production one.
> 
> I know this scheme is not recommended and can be a cause of filesystem
> corruption. But as I like living on the edge, I want to give a try,
> with a tight backup policy.
> 
> Any advice? Any settings that could be worth to avoid future drama?

See my recommendations about partitioning here I just posted:
http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865

Take care of the warnings about trimming/enabling discard. I'd at least test 
it if you want to use it before trusting your kitty to it. ;-)

BTW: I'm planning on a similar system tho the plans may well be 1 or 2 years 
in the future and the system is planned to be based off bcache/btrfs. It 
should become a production container-VM host. We'll see. Fallback plan is to 
use a battery-backed RAID controller with CacheCade (SSDs as volume cache).

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <55257b31.0308c20a.75a0.7849@mx.google.com>]

* Re: Bcache on btrfs: the perfect setup or the perfect storm?
       [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
@ 2015-04-10 14:42   ` arnaud gaboury
  2015-04-10 23:33     ` Kai Krakow
  0 siblings, 1 reply; 5+ messages in thread
From: arnaud gaboury @ 2015-04-10 14:42 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-bcache

On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote:
> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
>
>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>> box will be a server, but not a production one.
>>
>> I know this scheme is not recommended and can be a cause of filesystem
>> corruption. But as I like living on the edge, I want to give a try,
>> with a tight backup policy.
>>
>> Any advice? Any settings that could be worth to avoid future drama?
>
> See my recommendations about partitioning here I just posted:
> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>
> Take care of the warnings about trimming/enabling discard. I'd at least test
> it if you want to use it before trusting your kitty to it. ;-)
>
> BTW: I'm planning on a similar system tho the plans may well be 1 or 2 years
> in the future and the system is planned to be based off bcache/btrfs. It
> should become a production container-VM host. We'll see. Fallback plan is to
> use a battery-backed RAID controller with CacheCade (SSDs as volume cache).


>
> --
> Replies to list only preferred.

Thank you.
Here is what I did to prevent any drama:

the caching device, a SSD:
- GPT partitions.

-------------------------------------------------------
# gisk /dev/sdd
Command: p
Disk /dev/sdb: 224674128 sectors, 107.1 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 224674094
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048       167774207   80.0 GiB    8300  poppy-root
   2       167774208       224674094   27.1 GiB    8300  poppy-cache
---------------------------------------------------------------------------------------

Then :

# make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
/dev/sdb2 --discard

Now, when referring to your post, I have no idea about:
 Fourth - wear-levelling reservation.

Shall I change my partition table for /dev/sdd2 and leave some space?

Below are some infos about the SSD:

-----------------------------------------------------------------------------------------
# hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
Model Number:       OCZ-VERTEX2 3.5
Serial Number:      OCZ-52181PZL67XEETUZ
Firmware Revision:  1.33
Transport:          Serial
Standards:
Used: unknown (minor revision code 0x0028)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors:   16514064
LBA    user addressable sectors:  224674128
LBA48  user addressable sectors:  224674128
Logical  Sector size:                   512 bytes
Physical Sector size:                   512 bytes
Logical Sector-0 offset:                  0 bytes
device size with M = 1024*1024:      109704 MBytes
device size with M = 1000*1000:      115033 MBytes (115 GB)
cache/buffer size  = unknown
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
    Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
    Cycle time: no flow control=120ns  IORDY flow control=120ns
-------------------------------------------------------------------------------------------------

-- 

google.com/+arnaudgabourygabx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bcache on btrfs: the perfect setup or the perfect storm?
  2015-04-10 14:42   ` arnaud gaboury
@ 2015-04-10 23:33     ` Kai Krakow
  2015-04-10 23:49       ` Kai Krakow
  0 siblings, 1 reply; 5+ messages in thread
From: Kai Krakow @ 2015-04-10 23:33 UTC (permalink / raw)
  To: linux-bcache

arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:

> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote:
>> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
>>
>>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>>> box will be a server, but not a production one.
>>>
>>> I know this scheme is not recommended and can be a cause of filesystem
>>> corruption. But as I like living on the edge, I want to give a try,
>>> with a tight backup policy.
>>>
>>> Any advice? Any settings that could be worth to avoid future drama?
>>
>> See my recommendations about partitioning here I just posted:
>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>>
>> Take care of the warnings about trimming/enabling discard. I'd at least
>> test it if you want to use it before trusting your kitty to it. ;-)
>>
>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2
>> years in the future and the system is planned to be based off
>> bcache/btrfs. It should become a production container-VM host. We'll see.
>> Fallback plan is to use a battery-backed RAID controller with CacheCade
>> (SSDs as volume cache).
> 
> Thank you.
> Here is what I did to prevent any drama:

Due to bugs mentioned here, to prevent drama, better don't use discard or at 
least do your tests with appropriate backups, also do your performance 
tests.

> the caching device, a SSD:
> - GPT partitions.
> 
> -------------------------------------------------------
> # gisk /dev/sdd
> Command: p
> Disk /dev/sdb: 224674128 sectors, 107.1 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 224674094
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
> 
> Number  Start (sector)    End (sector)  Size       Code  Name
>    1            2048       167774207   80.0 GiB    8300  poppy-root
>    2       167774208       224674094   27.1 GiB    8300  poppy-cache

This matches my setup (though I used another layout). My drive reports full 
128GB instead of only 100GB. So I partitioned only 100GB and trimmed the 
whole drive in advance to partitioning.

> Then :
> 
> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
> /dev/sdb2 --discard

Take your safety measurements with "discard", see other threads/posts and 
above.

> Now, when referring to your post, I have no idea about:
>  Fourth - wear-levelling reservation.

Wear-levelling means that the drive tries to distribute writes evenly to the 
flash cells. It uses an internal mapping to dynamically remap logical blocks 
to internal block addresses. Because flash cells cannot be overridden, they 
have to be erased first, then written with the new data. The erasing itself 
is slow. This also implies that to modify a single logical sector of a cell, 
the drive has to read, modify, erase, then write a cell. This is clearly 
even slower. It is also known as "write amplification", this is more or less 
the same effect.

To compensate for that (performance), the drive reads a block, modifies the 
data, and writes it to a fresh (already erased) cell. This is known as read-
modify-write-cycle. Here comes the internal remapping into play. The erasing 
of the old cell is deferred into background. It will be done when the drive 
is idle. If you do a lot of writes, this accumulates. The drive needs a 
reservation area for discardable cells (thus, the option is usually called 
"discard" in filesystems).

A flash cell also has a limited lifetime how often it can be erased and 
rewritten. To compensate for that, SSD firmwares implement "wear levelling", 
that means it will try to remap to a flash cell that has a low counter for 
this. If your system informs the drive which cells are no longer holding 
data (the "discard" option), the drive can do a much better job at it 
because it has not to rely on the reservation pool alone. I artificially 
grow this pool by only partitioning about 80% of the native flash size, this 
means the pool has 20% of the whole capacity (usually, this is around 7% by 
default for most manufacturers). (100GB vs. 128GB ~ 20%, 120GB vs. 128GB ~ 
7%)

Now, bcache's bucket size comes into play. Because cells are much bigger 
then logical sectors (usually 512k, if your drive is RAID-striped internally 
and bigger drives usually are, you have integer multiples of that, like 2M 
for a 500GB drive because 4 cells make up on block [1], this is called an 
erase block because it can only be erased/written to all or nothing), you 
want to avoid the read-modify-write cycle as good as possible. This is what 
bcache uses its buckets for. It tries to fill and discard complete buckets 
in one go, caching as most in RAM first before pushing it out to the drive. 
If you are writing a complete block sized the erase block of the drive, it 
doesn't have to read-modify first, it just writes. This is clearly a 
performance benefit.

As a logical consequence, you can improve long-term performance and lifetime 
by using discard and a reservation pool, and you can improve direct 
performance by using an optimal bucket size in bcache. If optimizing Ext4 
for SSD, there are similar approaches to make best use of the erase block 
size.

> Shall I change my partition table for /dev/sdd2 and leave some space?

No, but maybe on sdb2 because that is your SSD, right?

> Below are some infos about the SSD:
[...]

Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If 
you repartition, you may want to trim your whole drive first for the above 
mentioned reasons.

[1]:  If your drive has high write rates, it is probably using striped cells 
because writing flash is a slow process, much slower than reading (25% or 
less).

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Bcache on btrfs: the perfect setup or the perfect storm?
  2015-04-10 23:33     ` Kai Krakow
@ 2015-04-10 23:49       ` Kai Krakow
  0 siblings, 0 replies; 5+ messages in thread
From: Kai Krakow @ 2015-04-10 23:49 UTC (permalink / raw)
  To: linux-bcache

Kai Krakow <hurikhan77@gmail.com> schrieb:

BTW: When I wrote "cell" below, I didn't refer the single cell which stores 
only one, two, or three bits in a flash block. I referred to a complete 
blocks of cells making up on native block that can be erased as a whole.

> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
> 
>> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote:
>>> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
>>>
>>>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>>>> box will be a server, but not a production one.
>>>>
>>>> I know this scheme is not recommended and can be a cause of filesystem
>>>> corruption. But as I like living on the edge, I want to give a try,
>>>> with a tight backup policy.
>>>>
>>>> Any advice? Any settings that could be worth to avoid future drama?
>>>
>>> See my recommendations about partitioning here I just posted:
>>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>>>
>>> Take care of the warnings about trimming/enabling discard. I'd at least
>>> test it if you want to use it before trusting your kitty to it. ;-)
>>>
>>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2
>>> years in the future and the system is planned to be based off
>>> bcache/btrfs. It should become a production container-VM host. We'll
>>> see. Fallback plan is to use a battery-backed RAID controller with
>>> CacheCade (SSDs as volume cache).
>> 
>> Thank you.
>> Here is what I did to prevent any drama:
> 
> Due to bugs mentioned here, to prevent drama, better don't use discard or
> at least do your tests with appropriate backups, also do your performance
> tests.
> 
>> the caching device, a SSD:
>> - GPT partitions.
>> 
>> -------------------------------------------------------
>> # gisk /dev/sdd
>> Command: p
>> Disk /dev/sdb: 224674128 sectors, 107.1 GiB
>> Logical sector size: 512 bytes
>> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
>> Partition table holds up to 128 entries
>> First usable sector is 34, last usable sector is 224674094
>> Partitions will be aligned on 2048-sector boundaries
>> Total free space is 2014 sectors (1007.0 KiB)
>> 
>> Number  Start (sector)    End (sector)  Size       Code  Name
>>    1            2048       167774207   80.0 GiB    8300  poppy-root
>>    2       167774208       224674094   27.1 GiB    8300  poppy-cache
> 
> This matches my setup (though I used another layout). My drive reports
> full 128GB instead of only 100GB. So I partitioned only 100GB and trimmed
> the whole drive in advance to partitioning.
> 
>> Then :
>> 
>> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
>> /dev/sdb2 --discard
> 
> Take your safety measurements with "discard", see other threads/posts and
> above.
> 
>> Now, when referring to your post, I have no idea about:
>>  Fourth - wear-levelling reservation.
> 
> Wear-levelling means that the drive tries to distribute writes evenly to
> the flash cells. It uses an internal mapping to dynamically remap logical
> blocks to internal block addresses. Because flash cells cannot be
> overridden, they have to be erased first, then written with the new data.
> The erasing itself is slow. This also implies that to modify a single
> logical sector of a cell, the drive has to read, modify, erase, then write
> a cell. This is clearly even slower. It is also known as "write
> amplification", this is more or less the same effect.
> 
> To compensate for that (performance), the drive reads a block, modifies
> the data, and writes it to a fresh (already erased) cell. This is known as
> read- modify-write-cycle. Here comes the internal remapping into play. The
> erasing of the old cell is deferred into background. It will be done when
> the drive is idle. If you do a lot of writes, this accumulates. The drive
> needs a reservation area for discardable cells (thus, the option is
> usually called "discard" in filesystems).
> 
> A flash cell also has a limited lifetime how often it can be erased and
> rewritten. To compensate for that, SSD firmwares implement "wear
> levelling", that means it will try to remap to a flash cell that has a low
> counter for this. If your system informs the drive which cells are no
> longer holding data (the "discard" option), the drive can do a much better
> job at it because it has not to rely on the reservation pool alone. I
> artificially grow this pool by only partitioning about 80% of the native
> flash size, this means the pool has 20% of the whole capacity (usually,
> this is around 7% by default for most manufacturers). (100GB vs. 128GB ~
> 20%, 120GB vs. 128GB ~ 7%)
> 
> Now, bcache's bucket size comes into play. Because cells are much bigger
> then logical sectors (usually 512k, if your drive is RAID-striped
> internally and bigger drives usually are, you have integer multiples of
> that, like 2M for a 500GB drive because 4 cells make up on block [1], this
> is called an erase block because it can only be erased/written to all or
> nothing), you want to avoid the read-modify-write cycle as good as
> possible. This is what bcache uses its buckets for. It tries to fill and
> discard complete buckets in one go, caching as most in RAM first before
> pushing it out to the drive. If you are writing a complete block sized the
> erase block of the drive, it doesn't have to read-modify first, it just
> writes. This is clearly a performance benefit.
> 
> As a logical consequence, you can improve long-term performance and
> lifetime by using discard and a reservation pool, and you can improve
> direct performance by using an optimal bucket size in bcache. If
> optimizing Ext4 for SSD, there are similar approaches to make best use of
> the erase block size.
> 
> 
>> Shall I change my partition table for /dev/sdd2 and leave some space?
> 
> No, but maybe on sdb2 because that is your SSD, right?
> 
>> Below are some infos about the SSD:
> [...]
> 
> Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If
> you repartition, you may want to trim your whole drive first for the above
> mentioned reasons.
> 
> 
> [1]:  If your drive has high write rates, it is probably using striped
> [cells
> because writing flash is a slow process, much slower than reading (25% or
> less).
> 
-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-04-10 23:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-08  6:59 Bcache on btrfs: the perfect setup or the perfect storm? arnaud gaboury
2015-04-08 19:02 ` Kai Krakow
     [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
2015-04-10 14:42   ` arnaud gaboury
2015-04-10 23:33     ` Kai Krakow
2015-04-10 23:49       ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).