Re: Bcache on btrfs: the perfect setup or the perfect storm?

linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kai Krakow <hurikhan77@gmail.com>
To: linux-bcache@vger.kernel.org
Subject: Re: Bcache on btrfs: the perfect setup or the perfect storm?
Date: Sat, 11 Apr 2015 01:49:06 +0200	[thread overview]
Message-ID: <i44mvb-283.ln1@hurikhan77.spdns.de> (raw)
In-Reply-To: e63mvb-p62.ln1@hurikhan77.spdns.de

Kai Krakow <hurikhan77@gmail.com> schrieb:

BTW: When I wrote "cell" below, I didn't refer the single cell which stores 
only one, two, or three bits in a flash block. I referred to a complete 
blocks of cells making up on native block that can be erased as a whole.

> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
> 
>> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote:
>>> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb:
>>>
>>>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>>>> box will be a server, but not a production one.
>>>>
>>>> I know this scheme is not recommended and can be a cause of filesystem
>>>> corruption. But as I like living on the edge, I want to give a try,
>>>> with a tight backup policy.
>>>>
>>>> Any advice? Any settings that could be worth to avoid future drama?
>>>
>>> See my recommendations about partitioning here I just posted:
>>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>>>
>>> Take care of the warnings about trimming/enabling discard. I'd at least
>>> test it if you want to use it before trusting your kitty to it. ;-)
>>>
>>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2
>>> years in the future and the system is planned to be based off
>>> bcache/btrfs. It should become a production container-VM host. We'll
>>> see. Fallback plan is to use a battery-backed RAID controller with
>>> CacheCade (SSDs as volume cache).
>> 
>> Thank you.
>> Here is what I did to prevent any drama:
> 
> Due to bugs mentioned here, to prevent drama, better don't use discard or
> at least do your tests with appropriate backups, also do your performance
> tests.
> 
>> the caching device, a SSD:
>> - GPT partitions.
>> 
>> -------------------------------------------------------
>> # gisk /dev/sdd
>> Command: p
>> Disk /dev/sdb: 224674128 sectors, 107.1 GiB
>> Logical sector size: 512 bytes
>> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
>> Partition table holds up to 128 entries
>> First usable sector is 34, last usable sector is 224674094
>> Partitions will be aligned on 2048-sector boundaries
>> Total free space is 2014 sectors (1007.0 KiB)
>> 
>> Number  Start (sector)    End (sector)  Size       Code  Name
>>    1            2048       167774207   80.0 GiB    8300  poppy-root
>>    2       167774208       224674094   27.1 GiB    8300  poppy-cache
> 
> This matches my setup (though I used another layout). My drive reports
> full 128GB instead of only 100GB. So I partitioned only 100GB and trimmed
> the whole drive in advance to partitioning.
> 
>> Then :
>> 
>> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
>> /dev/sdb2 --discard
> 
> Take your safety measurements with "discard", see other threads/posts and
> above.
> 
>> Now, when referring to your post, I have no idea about:
>>  Fourth - wear-levelling reservation.
> 
> Wear-levelling means that the drive tries to distribute writes evenly to
> the flash cells. It uses an internal mapping to dynamically remap logical
> blocks to internal block addresses. Because flash cells cannot be
> overridden, they have to be erased first, then written with the new data.
> The erasing itself is slow. This also implies that to modify a single
> logical sector of a cell, the drive has to read, modify, erase, then write
> a cell. This is clearly even slower. It is also known as "write
> amplification", this is more or less the same effect.
> 
> To compensate for that (performance), the drive reads a block, modifies
> the data, and writes it to a fresh (already erased) cell. This is known as
> read- modify-write-cycle. Here comes the internal remapping into play. The
> erasing of the old cell is deferred into background. It will be done when
> the drive is idle. If you do a lot of writes, this accumulates. The drive
> needs a reservation area for discardable cells (thus, the option is
> usually called "discard" in filesystems).
> 
> A flash cell also has a limited lifetime how often it can be erased and
> rewritten. To compensate for that, SSD firmwares implement "wear
> levelling", that means it will try to remap to a flash cell that has a low
> counter for this. If your system informs the drive which cells are no
> longer holding data (the "discard" option), the drive can do a much better
> job at it because it has not to rely on the reservation pool alone. I
> artificially grow this pool by only partitioning about 80% of the native
> flash size, this means the pool has 20% of the whole capacity (usually,
> this is around 7% by default for most manufacturers). (100GB vs. 128GB ~
> 20%, 120GB vs. 128GB ~ 7%)
> 
> Now, bcache's bucket size comes into play. Because cells are much bigger
> then logical sectors (usually 512k, if your drive is RAID-striped
> internally and bigger drives usually are, you have integer multiples of
> that, like 2M for a 500GB drive because 4 cells make up on block [1], this
> is called an erase block because it can only be erased/written to all or
> nothing), you want to avoid the read-modify-write cycle as good as
> possible. This is what bcache uses its buckets for. It tries to fill and
> discard complete buckets in one go, caching as most in RAM first before
> pushing it out to the drive. If you are writing a complete block sized the
> erase block of the drive, it doesn't have to read-modify first, it just
> writes. This is clearly a performance benefit.
> 
> As a logical consequence, you can improve long-term performance and
> lifetime by using discard and a reservation pool, and you can improve
> direct performance by using an optimal bucket size in bcache. If
> optimizing Ext4 for SSD, there are similar approaches to make best use of
> the erase block size.
> 
> 
>> Shall I change my partition table for /dev/sdd2 and leave some space?
> 
> No, but maybe on sdb2 because that is your SSD, right?
> 
>> Below are some infos about the SSD:
> [...]
> 
> Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If
> you repartition, you may want to trim your whole drive first for the above
> mentioned reasons.
> 
> 
> [1]:  If your drive has high write rates, it is probably using striped
> [cells
> because writing flash is a slow process, much slower than reading (25% or
> less).
> 
-- 
Replies to list only preferred.

     prev parent reply	other threads:[~2015-04-10 23:49 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08  6:59 Bcache on btrfs: the perfect setup or the perfect storm? arnaud gaboury
2015-04-08 19:02 ` Kai Krakow
     [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
2015-04-10 14:42   ` arnaud gaboury
2015-04-10 23:33     ` Kai Krakow
2015-04-10 23:49       ` Kai Krakow [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=i44mvb-283.ln1@hurikhan77.spdns.de \
    --to=hurikhan77@gmail.com \
    --cc=linux-bcache@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).