* Bcache on btrfs: the perfect setup or the perfect storm?
@ 2015-04-08 6:59 arnaud gaboury
2015-04-08 19:02 ` Kai Krakow
[not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
0 siblings, 2 replies; 5+ messages in thread
From: arnaud gaboury @ 2015-04-08 6:59 UTC (permalink / raw)
To: linux-bcache
I plan to setup Bcache on Btrfs SSD/HD caching/backing device. The
box will be a server, but not a production one.
I know this scheme is not recommended and can be a cause of filesystem
corruption. But as I like living on the edge, I want to give a try,
with a tight backup policy.
Any advice? Any settings that could be worth to avoid future drama?
Thank you
--
google.com/+arnaudgabourygabx
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Bcache on btrfs: the perfect setup or the perfect storm? 2015-04-08 6:59 Bcache on btrfs: the perfect setup or the perfect storm? arnaud gaboury @ 2015-04-08 19:02 ` Kai Krakow [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com> 1 sibling, 0 replies; 5+ messages in thread From: Kai Krakow @ 2015-04-08 19:02 UTC (permalink / raw) To: linux-bcache arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: > I plan to setup Bcache on Btrfs SSD/HD caching/backing device. The > box will be a server, but not a production one. > > I know this scheme is not recommended and can be a cause of filesystem > corruption. But as I like living on the edge, I want to give a try, > with a tight backup policy. > > Any advice? Any settings that could be worth to avoid future drama? See my recommendations about partitioning here I just posted: http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865 Take care of the warnings about trimming/enabling discard. I'd at least test it if you want to use it before trusting your kitty to it. ;-) BTW: I'm planning on a similar system tho the plans may well be 1 or 2 years in the future and the system is planned to be based off bcache/btrfs. It should become a production container-VM host. We'll see. Fallback plan is to use a battery-backed RAID controller with CacheCade (SSDs as volume cache). -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <55257b31.0308c20a.75a0.7849@mx.google.com>]
* Re: Bcache on btrfs: the perfect setup or the perfect storm? [not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com> @ 2015-04-10 14:42 ` arnaud gaboury 2015-04-10 23:33 ` Kai Krakow 0 siblings, 1 reply; 5+ messages in thread From: arnaud gaboury @ 2015-04-10 14:42 UTC (permalink / raw) To: Kai Krakow; +Cc: linux-bcache On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote: > arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: > >> I plan to setup Bcache on Btrfs SSD/HD caching/backing device. The >> box will be a server, but not a production one. >> >> I know this scheme is not recommended and can be a cause of filesystem >> corruption. But as I like living on the edge, I want to give a try, >> with a tight backup policy. >> >> Any advice? Any settings that could be worth to avoid future drama? > > See my recommendations about partitioning here I just posted: > http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865 > > Take care of the warnings about trimming/enabling discard. I'd at least test > it if you want to use it before trusting your kitty to it. ;-) > > BTW: I'm planning on a similar system tho the plans may well be 1 or 2 years > in the future and the system is planned to be based off bcache/btrfs. It > should become a production container-VM host. We'll see. Fallback plan is to > use a battery-backed RAID controller with CacheCade (SSDs as volume cache). > > -- > Replies to list only preferred. Thank you. Here is what I did to prevent any drama: the caching device, a SSD: - GPT partitions. ------------------------------------------------------- # gisk /dev/sdd Command: p Disk /dev/sdb: 224674128 sectors, 107.1 GiB Logical sector size: 512 bytes Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 224674094 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector) End (sector) Size Code Name 1 2048 167774207 80.0 GiB 8300 poppy-root 2 167774208 224674094 27.1 GiB 8300 poppy-cache --------------------------------------------------------------------------------------- Then : # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C /dev/sdb2 --discard Now, when referring to your post, I have no idea about: Fourth - wear-levelling reservation. Shall I change my partition table for /dev/sdd2 and leave some space? Below are some infos about the SSD: ----------------------------------------------------------------------------------------- # hdparm -I /dev/sdb /dev/sdb: ATA device, with non-removable media Model Number: OCZ-VERTEX2 3.5 Serial Number: OCZ-52181PZL67XEETUZ Firmware Revision: 1.33 Transport: Serial Standards: Used: unknown (minor revision code 0x0028) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 224674128 LBA48 user addressable sectors: 224674128 Logical Sector size: 512 bytes Physical Sector size: 512 bytes Logical Sector-0 offset: 0 bytes device size with M = 1024*1024: 109704 MBytes device size with M = 1000*1000: 115033 MBytes (115 GB) cache/buffer size = unknown Nominal Media Rotation Rate: Solid State Device Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns ------------------------------------------------------------------------------------------------- -- google.com/+arnaudgabourygabx ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Bcache on btrfs: the perfect setup or the perfect storm? 2015-04-10 14:42 ` arnaud gaboury @ 2015-04-10 23:33 ` Kai Krakow 2015-04-10 23:49 ` Kai Krakow 0 siblings, 1 reply; 5+ messages in thread From: Kai Krakow @ 2015-04-10 23:33 UTC (permalink / raw) To: linux-bcache arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: > On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote: >> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: >> >>> I plan to setup Bcache on Btrfs SSD/HD caching/backing device. The >>> box will be a server, but not a production one. >>> >>> I know this scheme is not recommended and can be a cause of filesystem >>> corruption. But as I like living on the edge, I want to give a try, >>> with a tight backup policy. >>> >>> Any advice? Any settings that could be worth to avoid future drama? >> >> See my recommendations about partitioning here I just posted: >> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865 >> >> Take care of the warnings about trimming/enabling discard. I'd at least >> test it if you want to use it before trusting your kitty to it. ;-) >> >> BTW: I'm planning on a similar system tho the plans may well be 1 or 2 >> years in the future and the system is planned to be based off >> bcache/btrfs. It should become a production container-VM host. We'll see. >> Fallback plan is to use a battery-backed RAID controller with CacheCade >> (SSDs as volume cache). > > Thank you. > Here is what I did to prevent any drama: Due to bugs mentioned here, to prevent drama, better don't use discard or at least do your tests with appropriate backups, also do your performance tests. > the caching device, a SSD: > - GPT partitions. > > ------------------------------------------------------- > # gisk /dev/sdd > Command: p > Disk /dev/sdb: 224674128 sectors, 107.1 GiB > Logical sector size: 512 bytes > Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72 > Partition table holds up to 128 entries > First usable sector is 34, last usable sector is 224674094 > Partitions will be aligned on 2048-sector boundaries > Total free space is 2014 sectors (1007.0 KiB) > > Number Start (sector) End (sector) Size Code Name > 1 2048 167774207 80.0 GiB 8300 poppy-root > 2 167774208 224674094 27.1 GiB 8300 poppy-cache This matches my setup (though I used another layout). My drive reports full 128GB instead of only 100GB. So I partitioned only 100GB and trimmed the whole drive in advance to partitioning. > Then : > > # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C > /dev/sdb2 --discard Take your safety measurements with "discard", see other threads/posts and above. > Now, when referring to your post, I have no idea about: > Fourth - wear-levelling reservation. Wear-levelling means that the drive tries to distribute writes evenly to the flash cells. It uses an internal mapping to dynamically remap logical blocks to internal block addresses. Because flash cells cannot be overridden, they have to be erased first, then written with the new data. The erasing itself is slow. This also implies that to modify a single logical sector of a cell, the drive has to read, modify, erase, then write a cell. This is clearly even slower. It is also known as "write amplification", this is more or less the same effect. To compensate for that (performance), the drive reads a block, modifies the data, and writes it to a fresh (already erased) cell. This is known as read- modify-write-cycle. Here comes the internal remapping into play. The erasing of the old cell is deferred into background. It will be done when the drive is idle. If you do a lot of writes, this accumulates. The drive needs a reservation area for discardable cells (thus, the option is usually called "discard" in filesystems). A flash cell also has a limited lifetime how often it can be erased and rewritten. To compensate for that, SSD firmwares implement "wear levelling", that means it will try to remap to a flash cell that has a low counter for this. If your system informs the drive which cells are no longer holding data (the "discard" option), the drive can do a much better job at it because it has not to rely on the reservation pool alone. I artificially grow this pool by only partitioning about 80% of the native flash size, this means the pool has 20% of the whole capacity (usually, this is around 7% by default for most manufacturers). (100GB vs. 128GB ~ 20%, 120GB vs. 128GB ~ 7%) Now, bcache's bucket size comes into play. Because cells are much bigger then logical sectors (usually 512k, if your drive is RAID-striped internally and bigger drives usually are, you have integer multiples of that, like 2M for a 500GB drive because 4 cells make up on block [1], this is called an erase block because it can only be erased/written to all or nothing), you want to avoid the read-modify-write cycle as good as possible. This is what bcache uses its buckets for. It tries to fill and discard complete buckets in one go, caching as most in RAM first before pushing it out to the drive. If you are writing a complete block sized the erase block of the drive, it doesn't have to read-modify first, it just writes. This is clearly a performance benefit. As a logical consequence, you can improve long-term performance and lifetime by using discard and a reservation pool, and you can improve direct performance by using an optimal bucket size in bcache. If optimizing Ext4 for SSD, there are similar approaches to make best use of the erase block size. > Shall I change my partition table for /dev/sdd2 and leave some space? No, but maybe on sdb2 because that is your SSD, right? > Below are some infos about the SSD: [...] Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If you repartition, you may want to trim your whole drive first for the above mentioned reasons. [1]: If your drive has high write rates, it is probably using striped cells because writing flash is a slow process, much slower than reading (25% or less). -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Bcache on btrfs: the perfect setup or the perfect storm? 2015-04-10 23:33 ` Kai Krakow @ 2015-04-10 23:49 ` Kai Krakow 0 siblings, 0 replies; 5+ messages in thread From: Kai Krakow @ 2015-04-10 23:49 UTC (permalink / raw) To: linux-bcache Kai Krakow <hurikhan77@gmail.com> schrieb: BTW: When I wrote "cell" below, I didn't refer the single cell which stores only one, two, or three bits in a flash block. I referred to a complete blocks of cells making up on native block that can be erased as a whole. > arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: > >> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@kaishome.de> wrote: >>> arnaud gaboury <arnaud.gaboury@gmail.com> schrieb: >>> >>>> I plan to setup Bcache on Btrfs SSD/HD caching/backing device. The >>>> box will be a server, but not a production one. >>>> >>>> I know this scheme is not recommended and can be a cause of filesystem >>>> corruption. But as I like living on the edge, I want to give a try, >>>> with a tight backup policy. >>>> >>>> Any advice? Any settings that could be worth to avoid future drama? >>> >>> See my recommendations about partitioning here I just posted: >>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865 >>> >>> Take care of the warnings about trimming/enabling discard. I'd at least >>> test it if you want to use it before trusting your kitty to it. ;-) >>> >>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2 >>> years in the future and the system is planned to be based off >>> bcache/btrfs. It should become a production container-VM host. We'll >>> see. Fallback plan is to use a battery-backed RAID controller with >>> CacheCade (SSDs as volume cache). >> >> Thank you. >> Here is what I did to prevent any drama: > > Due to bugs mentioned here, to prevent drama, better don't use discard or > at least do your tests with appropriate backups, also do your performance > tests. > >> the caching device, a SSD: >> - GPT partitions. >> >> ------------------------------------------------------- >> # gisk /dev/sdd >> Command: p >> Disk /dev/sdb: 224674128 sectors, 107.1 GiB >> Logical sector size: 512 bytes >> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72 >> Partition table holds up to 128 entries >> First usable sector is 34, last usable sector is 224674094 >> Partitions will be aligned on 2048-sector boundaries >> Total free space is 2014 sectors (1007.0 KiB) >> >> Number Start (sector) End (sector) Size Code Name >> 1 2048 167774207 80.0 GiB 8300 poppy-root >> 2 167774208 224674094 27.1 GiB 8300 poppy-cache > > This matches my setup (though I used another layout). My drive reports > full 128GB instead of only 100GB. So I partitioned only 100GB and trimmed > the whole drive in advance to partitioning. > >> Then : >> >> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C >> /dev/sdb2 --discard > > Take your safety measurements with "discard", see other threads/posts and > above. > >> Now, when referring to your post, I have no idea about: >> Fourth - wear-levelling reservation. > > Wear-levelling means that the drive tries to distribute writes evenly to > the flash cells. It uses an internal mapping to dynamically remap logical > blocks to internal block addresses. Because flash cells cannot be > overridden, they have to be erased first, then written with the new data. > The erasing itself is slow. This also implies that to modify a single > logical sector of a cell, the drive has to read, modify, erase, then write > a cell. This is clearly even slower. It is also known as "write > amplification", this is more or less the same effect. > > To compensate for that (performance), the drive reads a block, modifies > the data, and writes it to a fresh (already erased) cell. This is known as > read- modify-write-cycle. Here comes the internal remapping into play. The > erasing of the old cell is deferred into background. It will be done when > the drive is idle. If you do a lot of writes, this accumulates. The drive > needs a reservation area for discardable cells (thus, the option is > usually called "discard" in filesystems). > > A flash cell also has a limited lifetime how often it can be erased and > rewritten. To compensate for that, SSD firmwares implement "wear > levelling", that means it will try to remap to a flash cell that has a low > counter for this. If your system informs the drive which cells are no > longer holding data (the "discard" option), the drive can do a much better > job at it because it has not to rely on the reservation pool alone. I > artificially grow this pool by only partitioning about 80% of the native > flash size, this means the pool has 20% of the whole capacity (usually, > this is around 7% by default for most manufacturers). (100GB vs. 128GB ~ > 20%, 120GB vs. 128GB ~ 7%) > > Now, bcache's bucket size comes into play. Because cells are much bigger > then logical sectors (usually 512k, if your drive is RAID-striped > internally and bigger drives usually are, you have integer multiples of > that, like 2M for a 500GB drive because 4 cells make up on block [1], this > is called an erase block because it can only be erased/written to all or > nothing), you want to avoid the read-modify-write cycle as good as > possible. This is what bcache uses its buckets for. It tries to fill and > discard complete buckets in one go, caching as most in RAM first before > pushing it out to the drive. If you are writing a complete block sized the > erase block of the drive, it doesn't have to read-modify first, it just > writes. This is clearly a performance benefit. > > As a logical consequence, you can improve long-term performance and > lifetime by using discard and a reservation pool, and you can improve > direct performance by using an optimal bucket size in bcache. If > optimizing Ext4 for SSD, there are similar approaches to make best use of > the erase block size. > > >> Shall I change my partition table for /dev/sdd2 and leave some space? > > No, but maybe on sdb2 because that is your SSD, right? > >> Below are some infos about the SSD: > [...] > > Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If > you repartition, you may want to trim your whole drive first for the above > mentioned reasons. > > > [1]: If your drive has high write rates, it is probably using striped > [cells > because writing flash is a slow process, much slower than reading (25% or > less). > -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-04-10 23:49 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-08 6:59 Bcache on btrfs: the perfect setup or the perfect storm? arnaud gaboury
2015-04-08 19:02 ` Kai Krakow
[not found] ` <55257b31.0308c20a.75a0.7849@mx.google.com>
2015-04-10 14:42 ` arnaud gaboury
2015-04-10 23:33 ` Kai Krakow
2015-04-10 23:49 ` Kai Krakow
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).