All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kai Krakow <hurikhan77+btrfs@gmail.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [RFC 0/5] BTRFS hot relocation support
Date: Thu, 16 May 2013 09:12:53 +0200	[thread overview]
Message-ID: <mgde6a-75n.ln1@hurikhan.ath.cx> (raw)
In-Reply-To: CAEH94LgEAcOHJKpuY+fY4cscxJ+QbynC0H9WbSGiV0fKp+-Ajw@mail.gmail.com

Hi!

I think such a solution as part of the filesystem could do much better than 
something outside of it (like bcache). But I'm not sure: What makes data 
hot? I think the most benefit is detecting random read access and mark only 
those data as hot, also writes should go to the SSD first and then should be 
spooled to the harddisks in background. Bcache does a lot regarding this.

Since this is within the filesystem, users could even mark files as being 
always "hot" with some attribute or ioctl. This could be used by a boot-
readahead and preload implementation to automatically make files hot used 
during booting or for preloading when I start an application.

On the other side hot relocation should be able to reduce writes to the SSD 
as good as possible, for example: Do not defragment files during autodefrag, 
it makes no sense. Also write data in bursts of erase block size etc.

And also important: What if the SSD dies due to wearing? Will it gracefully 
fall back to harddisk? What does "relocation" mean? Files (hot data) should 
only be cached in copy to SSD, and not moved there. It should be possible 
for btrfs to just drop a failing SSD from the filesystem without data loss 
because otherwise one should use two SSDs in raid-1 mode to get a safe cache 
storage.

Altogether I think that a spinning media btrfs raid can outperform a single 
SSD so hot relocation should probably be used to reduce head movements 
because this is where SSD really excels. So everything that involves heavy 
head movement should go to SSD first, then written back to harddisk. And I 
think there's a lot potential to optimize because a COW filesystem like 
btrfs naturally has a lot of head movement.

What do you think?

BTW: I have not tried the one or the other yet because I'm still deciding 
which way to go. Your patches are more welcome because I do not need to 
migrate my storage to bcache-provided block devices. OTOH the bcache 
implementation looks a lot more mature (with regard to performance and 
safety) at this point because it provides many of the above mentioned 
features - most importantly gracefully handling failing SSDs.

Regarding btrfs raid outperforms SSD: During boot my spinning media 3 device 
btrfs raid reads boot files with up to 600 MB/s (from LZ compressed fs), 
boot takes about 7 seconds until the display manager starts (which takes 
another 30 seconds but that's another story), and the system is pretty 
crowded with services I actually wouldn't need if I optimized for boot 
performance. But I think systemd's read-ahead implementation has a lot 
influence on this fast booting: It defragments and relocates boot files on 
btrfs during boot so the harddisks can sequentially read all this stuff. I 
think it also compresses boot files if compression is enabled because 
booting is IO bound, not CPU bound. Benchmarks showed that my btrfs raid 
could technically read up to 450 MB/s, so I think the 600 MB/s counts for 
decompressed data. A single SSD could not do that. For that same reason I 
created a small script to defragment and compress files used by the preload 
daemon. Without benchmarking it, this felt like another small performance 
boost. So I'm eager what could be next with some sort of SSD cache because 
the only problem left seems to be heavy head movement which slows down the 
system.

Zhi Yong Wu <zwu.kernel@gmail.com> schrieb:

> HI,
> 
>    What do you think if its design approach goes correctly? Do you
> have any comments or better design idea for BTRFS hot relocation
> support? any comments are appreciated, thanks.
> 
> 
> On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   The patchset as RFC is sent out mainly to see if it goes in the
>> correct development direction.
>>
>>   The patchset is trying to introduce hot relocation support
>> for BTRFS. In hybrid storage environment, when the data in
>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>> hot relocation support automatically; also, if SSD disk ratio
>> exceed its upper threshold, the data which get cold can be
>> looked up and relocated to HDD disk to make more space in SSD
>> disk at first, and then the data which get hot will be relocated
>> to SSD disk automatically.
>>
>>   BTRFS hot relocation mainly reserve block space from SSD disk
>> at first, load the hot data to page cache from HDD, allocate
>> block space from SSD disk, and finally write the data to SSD disk.
>>
>>   If you'd like to play with it, pls pull the patchset from
>> my git on github:
>>   https://github.com/wuzhy/kernel.git hot_reloc
>>
>> For how to use, please refer too the example below:
>>
>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc
>> -f
>>
>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1
>> [ transid 16 /dev/vdb 140.283650] device fsid
>> [ c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 3 /dev/vdb 140.550759] device fsid
>> [ 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>> [ transid 16 /dev/vdc
>> adding device /dev/vdc id 2
>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2
>> [ transid 3 /dev/vdc
>> fs created label (null) on /dev/vdb
>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 6 /dev/vdb 144.870444] btrfs: disk space caching is enabled
>> [ 144.904214] VFS: Turning on hot data tracking
>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>> 2048+0 records in
>> 2048+0 records out
>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>> root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.0G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=2.00GB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.19MB
>> Data_SSD: total=8.00MB, used=0.00
>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>> ^^^ Above command will start HOT RLEOCATE, because The data temperature
>> is currently 109 root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.1G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=6.25MB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.26MB
>> Data_SSD: total=2.01GB, used=2.00GB
>> root@debian-i386:~#
>>
>> Zhi Yong Wu (5):
>>   vfs: add one list_head field
>>   btrfs: add one new block group
>>   btrfs: add one hot relocation kthread
>>   procfs: add three proc interfaces
>>   btrfs: add hot relocation support
>>
>>  fs/btrfs/Makefile            |   3 +-
>>  fs/btrfs/ctree.h             |  26 +-
>>  fs/btrfs/extent-tree.c       | 107 +++++-
>>  fs/btrfs/extent_io.c         |  31 +-
>>  fs/btrfs/extent_io.h         |   4 +
>>  fs/btrfs/file.c              |  36 +-
>>  fs/btrfs/hot_relocate.c      | 802
>>  +++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/hot_relocate.h      |  48 +++
>>  fs/btrfs/inode-map.c         |  13 +-
>>  fs/btrfs/inode.c             |  92 ++++-
>>  fs/btrfs/ioctl.c             |  23 +-
>>  fs/btrfs/relocation.c        |  14 +-
>>  fs/btrfs/super.c             |  30 +-
>>  fs/btrfs/volumes.c           |  28 +-
>>  fs/hot_tracking.c            |   1 +
>>  include/linux/btrfs.h        |   4 +
>>  include/linux/hot_tracking.h |   1 +
>>  kernel/sysctl.c              |  22 ++
>>  18 files changed, 1234 insertions(+), 51 deletions(-)
>>  create mode 100644 fs/btrfs/hot_relocate.c
>>  create mode 100644 fs/btrfs/hot_relocate.h
>>
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 



  reply	other threads:[~2013-05-16  7:18 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
2013-05-06  8:53 ` [RFC 1/5] vfs: add one list_head field zwu.kernel
2013-05-06  8:53 ` [RFC 2/5] btrfs: add one new block group zwu.kernel
2013-05-06  8:53 ` [RFC 3/5] btrfs: add one hot relocation kthread zwu.kernel
2013-05-06  8:53 ` [RFC 4/5] procfs: add three proc interfaces zwu.kernel
2013-05-06  8:53 ` [RFC 5/5] btrfs: add hot relocation support zwu.kernel
2013-05-06 20:36 ` [RFC 0/5] BTRFS " Kai Krakow
2013-05-07  5:17   ` Tomasz Torcz
2013-05-07 21:17     ` Kai Krakow
2013-05-07 21:35 ` Gabriel de Perthuis
2013-05-07 21:58   ` Kai Krakow
2013-05-07 22:27     ` Gabriel de Perthuis
2013-05-08 23:13 ` Zhi Yong Wu
2013-05-09  6:30   ` Stefan Behrens
2013-05-09  6:42     ` Zhi Yong Wu
2013-05-09  7:41       ` Stefan Behrens
2013-05-09  7:49         ` Zhi Yong Wu
2013-05-09  7:28     ` Zheng Liu
2013-05-09  6:56   ` Roger Binns
2013-05-19 10:41   ` Martin Steigerwald
2013-05-19 13:43     ` Zhi Yong Wu
2013-05-19 14:42       ` Martin Steigerwald
2013-05-19 13:46     ` Zhi Yong Wu
2013-05-09  7:17 ` Gabriel de Perthuis
2013-05-14 15:24 ` Zhi Yong Wu
2013-05-16  7:12   ` Kai Krakow [this message]
2013-05-17  7:23     ` Zhi Yong Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mgde6a-75n.ln1@hurikhan.ath.cx \
    --to=hurikhan77+btrfs@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.