Re: Major HDD performance degradation on btrfs receive

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Major HDD performance degradation on btrfs receive
@ 2016-02-22 19:58 Nazar Mokrynskyi
  2016-02-22 23:30 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-22 19:58 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5933 bytes --]

> On Tue, Feb 16, 2016 at 5:44 AM, Nazar Mokrynskyi <na...@mokrynskyi.com> wrote:
> > I have 2 SSD with BTRFS filesystem (RAID) on them and several subvolumes.
> > Each 15 minutes I'm creating read-only snapshot of subvolumes /root, /home
> > and /web inside /backup.
> > After this I'm searching for last common subvolume on /backup_hdd, sending
> > difference between latest common snapshot and simply latest snapshot to
> > /backup_hdd.
> > On top of all above there is snapshots rotation, so that /backup contains
> > much less snapshots than /backup_hdd.
> >
> > I'm using this setup for last 7 months or so and this is luckily the longest
> > period when I had no problems with BTRFS at all.
> > However, last 2+ months btrfs receive command loads HDD so much that I can't
> > even get list of directories in it.
> > This happens even if diff between snapshots is really small.
> > HDD contains 2 filesystems - mentioned BTRFS and ext4 for other files, so I
> > can't even play mp3 file from ext4 filesystem while btrfs receive is
> > running.
> > Since I'm running everything each 15 minutes this is a real headache.
> >
> > My guess is that performance hit might be caused by filesystem fragmentation
> > even though there is more than enough empty space. But I'm not sure how to
> > properly check this and can't, obviously, run defragmentation on read-only
> > subvolumes.
> >
> > I'll be thankful for anything that might help to identify and resolve this
> > issue.
> >
> > ~> uname -a
> > Linux nazar-pc 4.5.0-rc4-haswell #1 SMP Tue Feb 16 02:09:13 CET 2016 x86_64
> > x86_64 x86_64 GNU/Linux
> >
> > ~> btrfs --version
> > btrfs-progs v4.4
> >
> > ~> sudo btrfs fi show
> > Label: none  uuid: 5170aca4-061a-4c6c-ab00-bd7fc8ae6030
> >     Total devices 2 FS bytes used 71.00GiB
> >     devid    1 size 111.30GiB used 111.30GiB path /dev/sdb2
> >     devid    2 size 111.30GiB used 111.29GiB path /dev/sdc2
> >
> > Label: 'Backup'  uuid: 40b8240a-a0a2-4034-ae55-f8558c0343a8
> >     Total devices 1 FS bytes used 252.54GiB
> >     devid    1 size 800.00GiB used 266.08GiB path /dev/sda1
> >
> > ~> sudo btrfs fi df /
> > Data, RAID0: total=214.56GiB, used=69.10GiB
> > System, RAID1: total=8.00MiB, used=16.00KiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, RAID1: total=4.00GiB, used=1.87GiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > ~> sudo btrfs fi df /backup_hdd
> > Data, single: total=245.01GiB, used=243.61GiB
> > System, DUP: total=32.00MiB, used=48.00KiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, DUP: total=10.50GiB, used=8.93GiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> > Relevant mount options:
> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    / btrfs
> > compress=lzo,noatime,relatime,ssd,subvol=/root    0 1
> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /home btrfs
> > compress=lzo,noatime,relatime,ssd,subvol=/home 0    1
> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /backup btrfs
> > compress=lzo,noatime,relatime,ssd,subvol=/backup 0    1
> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /web btrfs
> > compress=lzo,noatime,relatime,ssd,subvol=/web 0    1
> > UUID=40b8240a-a0a2-4034-ae55-f8558c0343a8    /backup_hdd btrfs
> > compress=lzo,noatime,relatime,noexec 0    1
> As already indicated by Duncan, the amount of snapshots might be just
> too much. The fragmentation on the HDD might have become very high. If
> there is limited amount of RAM in the system (so limited caching), too
> much time is lost in seeks. In addition:
>
>   compress=lzo
> this also increases the chance of scattering fragments and fragmentation.
>
>   noatime,relatime
> I am not sure why you have this. Hopefully you have the actual mount
> listed as   noatime
>
> You could use the principles of the tool/package called  snapper  to
> do a sort of non-linear snapshot thinning: further back in time you
> will have a much higher granularity of snapshot over a certain
> timeframe.
>
> You could use skinny metadata (recreate the fs with newer tools or use
> btrfstune -x on /dev/sda1). I think at the moment this flag is not
> enabled on /dev/sda1
>
> If you put just 1 btrfs fs on the hdd (so move all the content from
> the ext4 fs in the the btrfs fs) you might get better overall
> performance. I assume the ext4 fs is on the second (slower part) of
> the HDD and that is a disadvantage I think.
> But you probably have reasons for why the setup is like it is.
I've replied to Duncan's message about number of snapshots, there is 
snapshots rotation and number of snapshots it is quite small, 491 in total.

About memory - 16 GiB of RAM should be enough I guess:) Can I measure 
somehow if seeking is a problem?

What is wrong with noatime,relatime? I'm using them for a long time as 
good compromise in terms of performance.

I'll try btrfstune -x and let you know whether it changes anything.

About ext4 - it is actually because I did have some serious problems 
with BTRFS in past 2.5 years or so (however, first time I've recovered 
files by building manually git version of btrfs-tools and last time I 
had not very up to date, but backup of everything in other place so I 
didn't miss too much unrecoverable data), so for a while I'd like to 
store some files separately on filesystem that is extremely difficult to 
break. Its content not critical in terms of performance, files do not 
compress well and I do not really need any other extended features on 
that partition - so ext4 will be there for a while.

-- 
Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-22 19:58 Major HDD performance degradation on btrfs receive Nazar Mokrynskyi
@ 2016-02-22 23:30 ` Duncan
  2016-02-23 17:26   ` Marc MERLIN
  2016-02-23 16:55 ` Nazar Mokrynskyi
  2016-02-23 17:44 ` Nazar Mokrynskyi
  2 siblings, 1 reply; 34+ messages in thread
From: Duncan @ 2016-02-22 23:30 UTC (permalink / raw)
  To: linux-btrfs

Nazar Mokrynskyi posted on Mon, 22 Feb 2016 20:58:45 +0100 as excerpted:

> What is wrong with noatime,relatime? I'm using them for a long time as
> good compromise in terms of performance.

The one option ends up canceling the other, as they're both atime related 
options that say do different things.

I'd have to actually setup a test or do some research to be sure which 
one overrides the other (but someone here probably can say without 
further research), tho I'd /guess/ the latter one overrides the earlier 
one, which would effectively make them both pretty much useless, since 
relatime is the normal kernel default and thus doesn't need to be 
specified.

Noatime is strongly recommended for btrfs, however, particularly with 
snapshots, as otherwise, the changes between snapshots can consist mostly 
of generally useless atime changes.

(FWIW, after over a decade of using noatime here (I first used it on the 
then new reiserfs, after finding a recommendation for it on that), I got 
tired of specifying the option on nearly all my fstab entries, and now 
days carry a local kernel patch that changes the default to noatime, 
allowing me to drop specifying it everywhere.  I don't claim to be a 
coder, let alone a kernel level coder, but as a gentooer used to building 
from source for over a decade, I've found that I can often find the code 
behind some behavior I'd like to tweak, and given good enough comments, I 
can often create trivial patches to accomplish that tweak, even if it's 
not exactly the code a real C coder would choose to use, which is exactly 
what I've done here.  So now, unless some other atime option is 
specified, my filesystems are all mounted noatime.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-22 23:30 ` Duncan
@ 2016-02-23 17:26   ` Marc MERLIN
  2016-02-23 17:34     ` Marc MERLIN
  2016-02-24 10:01     ` Patrik Lundquist
  0 siblings, 2 replies; 34+ messages in thread
From: Marc MERLIN @ 2016-02-23 17:26 UTC (permalink / raw)
  To: Duncan, Nazar Mokrynskyi, Alexander Fougner; +Cc: linux-btrfs

Well, since we're on the topic, my backup server btrfs FS has become so
slow that it hangs my system a few seconds here and there and causes
some of my cron jobs to fail.

I'm going to re-create it for a 3 time (in 3 years), adding bcache this
time, but clearly there is a good chance that this filesystem is indeed
going to crap performance wise because all it does is receive btrfs
receive and rsync backups, with snapshot rotations (daily snapshots, and
they expire after a couple of weeks).

I'm currently doing a very slow defrag to see if it'll help (looks like
it's going to take days).
I'm doing this:
for i in dir1 dir2 debian32 debian64 ubuntu dir4 ; do echo $i; time btrfs fi defragment -v -r $i; done

But, just to be clear, is there a way I missed to see how fragmented my
filesystem is without running filefrag on millions of files and parsing
the output?

Label: 'dshelf2'  uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd
        Total devices 1 FS bytes used 4.25TiB
        devid    1 size 7.28TiB used 4.44TiB path /dev/mapper/dshelf2

btrfs fi df /mnt/btrfs_pool2/
Data, single: total=4.29TiB, used=4.18TiB
System, DUP: total=64.00MiB, used=512.00KiB
Metadata, DUP: total=77.50GiB, used=73.31GiB
GlobalReserve, single: total=512.00MiB, used=31.22MiB

Currently, it's btrfs on top of dmcrpyt on top of swraid5

Since I'm about to recreate this after a very slow backup/restore
process, if you have suggestions on how I could better build this
(outside of using a 4.4 kernel this time), they would be appreciated.

Also, should I try running defragment -r from cron from time to time?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:26   ` Marc MERLIN
@ 2016-02-23 17:34     ` Marc MERLIN
  2016-02-23 18:01       ` Lionel Bouton
  2016-02-24 10:01     ` Patrik Lundquist
  1 sibling, 1 reply; 34+ messages in thread
From: Marc MERLIN @ 2016-02-23 17:34 UTC (permalink / raw)
  To: Duncan, Nazar Mokrynskyi, Alexander Fougner; +Cc: linux-btrfs

On Tue, Feb 23, 2016 at 09:26:35AM -0800, Marc MERLIN wrote:
> Label: 'dshelf2'  uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd
>         Total devices 1 FS bytes used 4.25TiB
>         devid    1 size 7.28TiB used 4.44TiB path /dev/mapper/dshelf2
> 
> btrfs fi df /mnt/btrfs_pool2/
> Data, single: total=4.29TiB, used=4.18TiB
> System, DUP: total=64.00MiB, used=512.00KiB
> Metadata, DUP: total=77.50GiB, used=73.31GiB
> GlobalReserve, single: total=512.00MiB, used=31.22MiB
> 
> Currently, it's btrfs on top of dmcrpyt on top of swraid5

Sorry, I forgot to give the mount options:
/dev/mapper/dshelf2 on /mnt/dshelf2/backup type btrfs (rw,noatime,compress=lzo,space_cache,skip_balance,subvolid=257,subvol=/backup)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:34     ` Marc MERLIN
@ 2016-02-23 18:01       ` Lionel Bouton
  2016-02-23 18:30         ` Marc MERLIN
  0 siblings, 1 reply; 34+ messages in thread
From: Lionel Bouton @ 2016-02-23 18:01 UTC (permalink / raw)
  To: Marc MERLIN, Duncan, Nazar Mokrynskyi, Alexander Fougner; +Cc: linux-btrfs

Le 23/02/2016 18:34, Marc MERLIN a écrit :
> On Tue, Feb 23, 2016 at 09:26:35AM -0800, Marc MERLIN wrote:
>> Label: 'dshelf2'  uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd
>>         Total devices 1 FS bytes used 4.25TiB
>>         devid    1 size 7.28TiB used 4.44TiB path /dev/mapper/dshelf2
>>
>> btrfs fi df /mnt/btrfs_pool2/
>> Data, single: total=4.29TiB, used=4.18TiB
>> System, DUP: total=64.00MiB, used=512.00KiB
>> Metadata, DUP: total=77.50GiB, used=73.31GiB
>> GlobalReserve, single: total=512.00MiB, used=31.22MiB
>>
>> Currently, it's btrfs on top of dmcrpyt on top of swraid5
> Sorry, I forgot to give the mount options:
> /dev/mapper/dshelf2 on /mnt/dshelf2/backup type btrfs (rw,noatime,compress=lzo,space_cache,skip_balance,subvolid=257,subvol=/backup)

Why don't you use autodefrag ? If you have writable snapshots and do
write to them heavily it would not be a good idea (depending on how
BTRFS handles this in most cases you would probably either break the
reflinks or fragment a snapshot to defragment another) but if you only
have read-only snapshots it may work for you (it does for me).

The only BTRFS filesystems where I disabled autodefrag where Ceph OSDs
with heavy in-place updates. Another option would have been to mark
files NoCoW but I didn't want to abandon BTRFS checksumming.

Lionel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 18:01       ` Lionel Bouton
@ 2016-02-23 18:30         ` Marc MERLIN
  2016-02-23 20:35           ` Lionel Bouton
  0 siblings, 1 reply; 34+ messages in thread
From: Marc MERLIN @ 2016-02-23 18:30 UTC (permalink / raw)
  To: Lionel Bouton; +Cc: Duncan, Nazar Mokrynskyi, Alexander Fougner, linux-btrfs

On Tue, Feb 23, 2016 at 07:01:52PM +0100, Lionel Bouton wrote:
> Why don't you use autodefrag ? If you have writable snapshots and do
> write to them heavily it would not be a good idea (depending on how
> BTRFS handles this in most cases you would probably either break the
> reflinks or fragment a snapshot to defragment another) but if you only
> have read-only snapshots it may work for you (it does for me).
 
It's not a stupid question, I had issues with autodefrag in the past,
and turned it off, but it's been a good 2 years, so maybe it works well
enough now.

> The only BTRFS filesystems where I disabled autodefrag where Ceph OSDs
> with heavy in-place updates. Another option would have been to mark
> files NoCoW but I didn't want to abandon BTRFS checksumming.

Right. I don't have to worry about COW for virtualbox images there, and
the snapshots are read only (well, my script makes read-write snapshots
too, but I almost never use them. Hopefully their presence isn't a
problem, right?)

Thanks for the suggestion.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 18:30         ` Marc MERLIN
@ 2016-02-23 20:35           ` Lionel Bouton
  0 siblings, 0 replies; 34+ messages in thread
From: Lionel Bouton @ 2016-02-23 20:35 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Duncan, Nazar Mokrynskyi, Alexander Fougner, linux-btrfs

Le 23/02/2016 19:30, Marc MERLIN a écrit :
> On Tue, Feb 23, 2016 at 07:01:52PM +0100, Lionel Bouton wrote:
>> Why don't you use autodefrag ? If you have writable snapshots and do
>> write to them heavily it would not be a good idea (depending on how
>> BTRFS handles this in most cases you would probably either break the
>> reflinks or fragment a snapshot to defragment another) but if you only
>> have read-only snapshots it may work for you (it does for me).
>  
> It's not a stupid question, I had issues with autodefrag in the past,
> and turned it off, but it's been a good 2 years, so maybe it works well
> enough now.
>
>> The only BTRFS filesystems where I disabled autodefrag where Ceph OSDs
>> with heavy in-place updates. Another option would have been to mark
>> files NoCoW but I didn't want to abandon BTRFS checksumming.
> Right. I don't have to worry about COW for virtualbox images there, and
> the snapshots are read only (well, my script makes read-write snapshots
> too, but I almost never use them. Hopefully their presence isn't a
> problem, right?)

I believe autodefrag is only triggering defragmentation on access (write
access only according to the wiki) and uses a queue of limited length
for defragmentation tasks to perform. So the snapshots by themselves
won't cause problems. Even if you access files the defragmentation
should be focused mainly on the versions of the files you access the
most. The real problems probably happen when you access the same file
from several snapshots with lots of internal modifications between the
versions in these snapshots: either autodefrag will break the reflinks
between them or it will attempt to optimize the 2 file versions at
roughly the same time which won't give any benefit but will waste I/O.

Lionel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:26   ` Marc MERLIN
  2016-02-23 17:34     ` Marc MERLIN
@ 2016-02-24 10:01     ` Patrik Lundquist
  1 sibling, 0 replies; 34+ messages in thread
From: Patrik Lundquist @ 2016-02-24 10:01 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs@vger.kernel.org

On 23 February 2016 at 18:26, Marc MERLIN <marc@merlins.org> wrote:
>
> I'm currently doing a very slow defrag to see if it'll help (looks like
> it's going to take days).
> I'm doing this:
> for i in dir1 dir2 debian32 debian64 ubuntu dir4 ; do echo $i; time btrfs fi defragment -v -r $i; done
[snip]
> Also, should I try running defragment -r from cron from time to time?

I find the default threshold a bit low and defragment daily with "-t
1m" to combat heavy random write fragmentation.

Once in a while I defrag e.g. VM disk images with "-t 128m" but find
higher thresholds mostly a waste of time.

YMMV.


> But, just to be clear, is there a way I missed to see how fragmented my
> filesystem is without running filefrag on millions of files and parsing
> the output?

I don't think so, and filefrag is slow with heavily fragmented files
because ioctl(FS_IOC_FIEMAP) is called many times with a buffer which
only fits 292 fiemap_extents.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-22 19:58 Major HDD performance degradation on btrfs receive Nazar Mokrynskyi
  2016-02-22 23:30 ` Duncan
@ 2016-02-23 16:55 ` Nazar Mokrynskyi
  2016-02-23 17:05   ` Alexander Fougner
  2016-02-23 17:44 ` Nazar Mokrynskyi
  2 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-23 16:55 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1986 bytes --]

> > What is wrong with noatime,relatime? I'm using them for a long time as
> > good compromise in terms of performance.
> The one option ends up canceling the other, as they're both atime related
> options that say do different things.
>
> I'd have to actually setup a test or do some research to be sure which
> one overrides the other (but someone here probably can say without
> further research), tho I'd /guess/ the latter one overrides the earlier
> one, which would effectively make them both pretty much useless, since
> relatime is the normal kernel default and thus doesn't need to be
> specified.
>
> Noatime is strongly recommended for btrfs, however, particularly with
> snapshots, as otherwise, the changes between snapshots can consist mostly
> of generally useless atime changes.
>
> (FWIW, after over a decade of using noatime here (I first used it on the
> then new reiserfs, after finding a recommendation for it on that), I got
> tired of specifying the option on nearly all my fstab entries, and now
> days carry a local kernel patch that changes the default to noatime,
> allowing me to drop specifying it everywhere.  I don't claim to be a
> coder, let alone a kernel level coder, but as a gentooer used to building
> from source for over a decade, I've found that I can often find the code
> behind some behavior I'd like to tweak, and given good enough comments, I
> can often create trivial patches to accomplish that tweak, even if it's
> not exactly the code a real C coder would choose to use, which is exactly
> what I've done here.  So now, unless some other atime option is
> specified, my filesystems are all mounted noatime.  =:^)
Well, then I'll leave relatime on root fs and noatime on partition with 
snapshots, thanks.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 16:55 ` Nazar Mokrynskyi
@ 2016-02-23 17:05   ` Alexander Fougner
  2016-02-23 17:18     ` Nazar Mokrynskyi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Fougner @ 2016-02-23 17:05 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: linux-btrfs

2016-02-23 17:55 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
>> > What is wrong with noatime,relatime? I'm using them for a long time as
>> > good compromise in terms of performance.
>> The one option ends up canceling the other, as they're both atime related
>> options that say do different things.
>>
>> I'd have to actually setup a test or do some research to be sure which
>> one overrides the other (but someone here probably can say without
>> further research), tho I'd /guess/ the latter one overrides the earlier
>> one, which would effectively make them both pretty much useless, since
>> relatime is the normal kernel default and thus doesn't need to be
>> specified.
>>
>> Noatime is strongly recommended for btrfs, however, particularly with
>> snapshots, as otherwise, the changes between snapshots can consist mostly
>> of generally useless atime changes.
>>
>> (FWIW, after over a decade of using noatime here (I first used it on the
>> then new reiserfs, after finding a recommendation for it on that), I got
>> tired of specifying the option on nearly all my fstab entries, and now
>> days carry a local kernel patch that changes the default to noatime,
>> allowing me to drop specifying it everywhere.  I don't claim to be a
>> coder, let alone a kernel level coder, but as a gentooer used to building
>> from source for over a decade, I've found that I can often find the code
>> behind some behavior I'd like to tweak, and given good enough comments, I
>> can often create trivial patches to accomplish that tweak, even if it's
>> not exactly the code a real C coder would choose to use, which is exactly
>> what I've done here.  So now, unless some other atime option is
>> specified, my filesystems are all mounted noatime.  =:^)
>
> Well, then I'll leave relatime on root fs and noatime on partition with
> snapshots, thanks.

If you snapshot the root filesystem then the atime changes will still
be there, and you'll be having a lot of unnecessary changes between
each snapshot.

> Sincerely, Nazar Mokrynskyi
> github.com/nazar-pc
> Skype: nazar-pc
> Diaspora: nazarpc@diaspora.mokrynskyi.com
> Tox:
> A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:05   ` Alexander Fougner
@ 2016-02-23 17:18     ` Nazar Mokrynskyi
  2016-02-23 17:29       ` Alexander Fougner
  0 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-23 17:18 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2467 bytes --]

But why? I have relatime option, it should not cause changes unless file 
contents is actually changed if I understand this option correctly.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 23.02.16 18:05, Alexander Fougner wrote:
> 2016-02-23 17:55 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
>>>> What is wrong with noatime,relatime? I'm using them for a long time as
>>>> good compromise in terms of performance.
>>> The one option ends up canceling the other, as they're both atime related
>>> options that say do different things.
>>>
>>> I'd have to actually setup a test or do some research to be sure which
>>> one overrides the other (but someone here probably can say without
>>> further research), tho I'd /guess/ the latter one overrides the earlier
>>> one, which would effectively make them both pretty much useless, since
>>> relatime is the normal kernel default and thus doesn't need to be
>>> specified.
>>>
>>> Noatime is strongly recommended for btrfs, however, particularly with
>>> snapshots, as otherwise, the changes between snapshots can consist mostly
>>> of generally useless atime changes.
>>>
>>> (FWIW, after over a decade of using noatime here (I first used it on the
>>> then new reiserfs, after finding a recommendation for it on that), I got
>>> tired of specifying the option on nearly all my fstab entries, and now
>>> days carry a local kernel patch that changes the default to noatime,
>>> allowing me to drop specifying it everywhere.  I don't claim to be a
>>> coder, let alone a kernel level coder, but as a gentooer used to building
>>> from source for over a decade, I've found that I can often find the code
>>> behind some behavior I'd like to tweak, and given good enough comments, I
>>> can often create trivial patches to accomplish that tweak, even if it's
>>> not exactly the code a real C coder would choose to use, which is exactly
>>> what I've done here.  So now, unless some other atime option is
>>> specified, my filesystems are all mounted noatime.  =:^)
>> Well, then I'll leave relatime on root fs and noatime on partition with
>> snapshots, thanks.
> If you snapshot the root filesystem then the atime changes will still
> be there, and you'll be having a lot of unnecessary changes between
> each snapshot.


[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:18     ` Nazar Mokrynskyi
@ 2016-02-23 17:29       ` Alexander Fougner
  2016-02-23 17:34         ` Nazar Mokrynskyi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Fougner @ 2016-02-23 17:29 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: linux-btrfs

2016-02-23 18:18 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
> But why? I have relatime option, it should not cause changes unless file
> contents is actually changed if I understand this option correctly.
>

*or* if it is older than 1 day. From the manpages:

relatime
              Update inode access times relative to modify or change time.
              Access time is only updated if the previous access time was
              earlier than the current modify or change time.  (Similar to
              noatime, but it doesn't break mutt or other applications that
              need to know if a file has been read since the last time it
              was modified.)

              Since Linux 2.6.30, the kernel defaults to the behavior
              provided by this option (unless noatime was specified), and
              the strictatime option is required to obtain traditional
>>>>>   semantics.  In addition, since Linux 2.6.30, the file's last
              access time is always updated if it is more than 1 day old. <<<<<

Also, if you only use relatime, then you don't need to specify it,
it's the default since 2.6.30 as mentioned above.


> Sincerely, Nazar Mokrynskyi
> github.com/nazar-pc
> Skype: nazar-pc
> Diaspora: nazarpc@diaspora.mokrynskyi.com
> Tox:
> A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249
>
> On 23.02.16 18:05, Alexander Fougner wrote:
>>
>> 2016-02-23 17:55 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
>>>>>
>>>>> What is wrong with noatime,relatime? I'm using them for a long time as
>>>>> good compromise in terms of performance.
>>>>
>>>> The one option ends up canceling the other, as they're both atime
>>>> related
>>>> options that say do different things.
>>>>
>>>> I'd have to actually setup a test or do some research to be sure which
>>>> one overrides the other (but someone here probably can say without
>>>> further research), tho I'd /guess/ the latter one overrides the earlier
>>>> one, which would effectively make them both pretty much useless, since
>>>> relatime is the normal kernel default and thus doesn't need to be
>>>> specified.
>>>>
>>>> Noatime is strongly recommended for btrfs, however, particularly with
>>>> snapshots, as otherwise, the changes between snapshots can consist
>>>> mostly
>>>> of generally useless atime changes.
>>>>
>>>> (FWIW, after over a decade of using noatime here (I first used it on the
>>>> then new reiserfs, after finding a recommendation for it on that), I got
>>>> tired of specifying the option on nearly all my fstab entries, and now
>>>> days carry a local kernel patch that changes the default to noatime,
>>>> allowing me to drop specifying it everywhere.  I don't claim to be a
>>>> coder, let alone a kernel level coder, but as a gentooer used to
>>>> building
>>>> from source for over a decade, I've found that I can often find the code
>>>> behind some behavior I'd like to tweak, and given good enough comments,
>>>> I
>>>> can often create trivial patches to accomplish that tweak, even if it's
>>>> not exactly the code a real C coder would choose to use, which is
>>>> exactly
>>>> what I've done here.  So now, unless some other atime option is
>>>> specified, my filesystems are all mounted noatime.  =:^)
>>>
>>> Well, then I'll leave relatime on root fs and noatime on partition with
>>> snapshots, thanks.
>>
>> If you snapshot the root filesystem then the atime changes will still
>> be there, and you'll be having a lot of unnecessary changes between
>> each snapshot.
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:29       ` Alexander Fougner
@ 2016-02-23 17:34         ` Nazar Mokrynskyi
  2016-02-23 18:09           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-23 17:34 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4223 bytes --]

Wow, this is interesting, didn't know it.

I'll probably try noatime instead:)

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 23.02.16 18:29, Alexander Fougner wrote:
> 2016-02-23 18:18 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
>> But why? I have relatime option, it should not cause changes unless file
>> contents is actually changed if I understand this option correctly.
>>
> *or* if it is older than 1 day. From the manpages:
>
> relatime
>                Update inode access times relative to modify or change time.
>                Access time is only updated if the previous access time was
>                earlier than the current modify or change time.  (Similar to
>                noatime, but it doesn't break mutt or other applications that
>                need to know if a file has been read since the last time it
>                was modified.)
>
>                Since Linux 2.6.30, the kernel defaults to the behavior
>                provided by this option (unless noatime was specified), and
>                the strictatime option is required to obtain traditional
>>>>>>    semantics.  In addition, since Linux 2.6.30, the file's last
>                access time is always updated if it is more than 1 day old. <<<<<
>
> Also, if you only use relatime, then you don't need to specify it,
> it's the default since 2.6.30 as mentioned above.
>
>
>> Sincerely, Nazar Mokrynskyi
>> github.com/nazar-pc
>> Skype: nazar-pc
>> Diaspora: nazarpc@diaspora.mokrynskyi.com
>> Tox:
>> A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249
>>
>> On 23.02.16 18:05, Alexander Fougner wrote:
>>> 2016-02-23 17:55 GMT+01:00 Nazar Mokrynskyi <nazar@mokrynskyi.com>:
>>>>>> What is wrong with noatime,relatime? I'm using them for a long time as
>>>>>> good compromise in terms of performance.
>>>>> The one option ends up canceling the other, as they're both atime
>>>>> related
>>>>> options that say do different things.
>>>>>
>>>>> I'd have to actually setup a test or do some research to be sure which
>>>>> one overrides the other (but someone here probably can say without
>>>>> further research), tho I'd /guess/ the latter one overrides the earlier
>>>>> one, which would effectively make them both pretty much useless, since
>>>>> relatime is the normal kernel default and thus doesn't need to be
>>>>> specified.
>>>>>
>>>>> Noatime is strongly recommended for btrfs, however, particularly with
>>>>> snapshots, as otherwise, the changes between snapshots can consist
>>>>> mostly
>>>>> of generally useless atime changes.
>>>>>
>>>>> (FWIW, after over a decade of using noatime here (I first used it on the
>>>>> then new reiserfs, after finding a recommendation for it on that), I got
>>>>> tired of specifying the option on nearly all my fstab entries, and now
>>>>> days carry a local kernel patch that changes the default to noatime,
>>>>> allowing me to drop specifying it everywhere.  I don't claim to be a
>>>>> coder, let alone a kernel level coder, but as a gentooer used to
>>>>> building
>>>>> from source for over a decade, I've found that I can often find the code
>>>>> behind some behavior I'd like to tweak, and given good enough comments,
>>>>> I
>>>>> can often create trivial patches to accomplish that tweak, even if it's
>>>>> not exactly the code a real C coder would choose to use, which is
>>>>> exactly
>>>>> what I've done here.  So now, unless some other atime option is
>>>>> specified, my filesystems are all mounted noatime.  =:^)
>>>> Well, then I'll leave relatime on root fs and noatime on partition with
>>>> snapshots, thanks.
>>> If you snapshot the root filesystem then the atime changes will still
>>> be there, and you'll be having a lot of unnecessary changes between
>>> each snapshot.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:34         ` Nazar Mokrynskyi
@ 2016-02-23 18:09           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 34+ messages in thread
From: Austin S. Hemmelgarn @ 2016-02-23 18:09 UTC (permalink / raw)
  To: Nazar Mokrynskyi, linux-btrfs

On 2016-02-23 12:34, Nazar Mokrynskyi wrote:
> Wow, this is interesting, didn't know it.
>
> I'll probably try noatime instead:)
>
For what it's worth, due to how it's implemented on almost every UNIX 
derived system in existence (including Linux), atimes are essentially 
useless.  A majority of the software that has used them over the years 
has mad the flawed assumption that the atime only gets updated when the 
file data is read or modified, when they actually get updated when ever 
the file data is read or modified, and when the metadata is modified 
(and in some old UNIX systems, the update on file data modification was 
simply implemented as a cascade effect from the mtime getting updated). 
  Mutt is one of the only publicly available programs I know of that 
uses them, and it makes this same flawed assumption.  The only software 
I know of that uses them right is tmpwatch and tmpreaper, which use them 
to clean up /tmp and similar directories when files there haven't been 
touched in a long time, and even those have the option to not use atimes.

Now, long rant aside, you may want to also look into the 'lazytime' 
mount option.  It won't reduce fragmentation, but it should improve 
performance overall, the only downsides are that mtimes might be 
incorrect after a crash, and it's only available in newer kernels (I 
think it got added in 4.0 or 4.1, but I'm not certain).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-22 19:58 Major HDD performance degradation on btrfs receive Nazar Mokrynskyi
  2016-02-22 23:30 ` Duncan
  2016-02-23 16:55 ` Nazar Mokrynskyi
@ 2016-02-23 17:44 ` Nazar Mokrynskyi
  2016-02-24 22:32   ` Henk Slager
  2 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-23 17:44 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7455 bytes --]

Looks like btrfstune -x did nothing, probably, it was already used at 
creation time, I'm using rcX versions of kernel all the time and rolling 
version of Ubuntu, so this is very likely to be the case.

One thing I've noticed is much slower mount/umount on HDD than on SSD:

> nazar-pc@nazar-pc ~> time sudo umount /backup
> 0.00user 0.00system 0:00.01elapsed 36%CPU (0avgtext+0avgdata 
> 7104maxresident)k
> 0inputs+0outputs (0major+784minor)pagefaults 0swaps
> nazar-pc@nazar-pc ~> time sudo mount /backup
> 0.00user 0.00system 0:00.03elapsed 23%CPU (0avgtext+0avgdata 
> 7076maxresident)k
> 0inputs+0outputs (0major+803minor)pagefaults 0swaps
> nazar-pc@nazar-pc ~> time sudo umount /backup_hdd
> 0.00user 0.11system 0:01.04elapsed 11%CPU (0avgtext+0avgdata 
> 7092maxresident)k
> 0inputs+15296outputs (0major+787minor)pagefaults 0swaps
> nazar-pc@nazar-pc ~> time sudo mount /backup_hdd
> 0.00user 0.02system 0:04.45elapsed 0%CPU (0avgtext+0avgdata 
> 7140maxresident)k
> 14648inputs+0outputs (0major+795minor)pagefaults 0swaps
It is especially long (tenth of seconds with hight HDD load) when called 
after some time, not consequently.

Once it took something like 20 seconds to unmount filesystem and around 
10 seconds to mount it.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora:nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 22.02.16 20:58, Nazar Mokrynskyi wrote:
>> On Tue, Feb 16, 2016 at 5:44 AM, Nazar Mokrynskyi 
>> <na...@mokrynskyi.com> wrote:
>> > I have 2 SSD with BTRFS filesystem (RAID) on them and several 
>> subvolumes.
>> > Each 15 minutes I'm creating read-only snapshot of subvolumes 
>> /root, /home
>> > and /web inside /backup.
>> > After this I'm searching for last common subvolume on /backup_hdd, 
>> sending
>> > difference between latest common snapshot and simply latest 
>> snapshot to
>> > /backup_hdd.
>> > On top of all above there is snapshots rotation, so that /backup 
>> contains
>> > much less snapshots than /backup_hdd.
>> >
>> > I'm using this setup for last 7 months or so and this is luckily 
>> the longest
>> > period when I had no problems with BTRFS at all.
>> > However, last 2+ months btrfs receive command loads HDD so much 
>> that I can't
>> > even get list of directories in it.
>> > This happens even if diff between snapshots is really small.
>> > HDD contains 2 filesystems - mentioned BTRFS and ext4 for other 
>> files, so I
>> > can't even play mp3 file from ext4 filesystem while btrfs receive is
>> > running.
>> > Since I'm running everything each 15 minutes this is a real headache.
>> >
>> > My guess is that performance hit might be caused by filesystem 
>> fragmentation
>> > even though there is more than enough empty space. But I'm not sure 
>> how to
>> > properly check this and can't, obviously, run defragmentation on 
>> read-only
>> > subvolumes.
>> >
>> > I'll be thankful for anything that might help to identify and 
>> resolve this
>> > issue.
>> >
>> > ~> uname -a
>> > Linux nazar-pc 4.5.0-rc4-haswell #1 SMP Tue Feb 16 02:09:13 CET 
>> 2016 x86_64
>> > x86_64 x86_64 GNU/Linux
>> >
>> > ~> btrfs --version
>> > btrfs-progs v4.4
>> >
>> > ~> sudo btrfs fi show
>> > Label: none  uuid: 5170aca4-061a-4c6c-ab00-bd7fc8ae6030
>> >     Total devices 2 FS bytes used 71.00GiB
>> >     devid    1 size 111.30GiB used 111.30GiB path /dev/sdb2
>> >     devid    2 size 111.30GiB used 111.29GiB path /dev/sdc2
>> >
>> > Label: 'Backup'  uuid: 40b8240a-a0a2-4034-ae55-f8558c0343a8
>> >     Total devices 1 FS bytes used 252.54GiB
>> >     devid    1 size 800.00GiB used 266.08GiB path /dev/sda1
>> >
>> > ~> sudo btrfs fi df /
>> > Data, RAID0: total=214.56GiB, used=69.10GiB
>> > System, RAID1: total=8.00MiB, used=16.00KiB
>> > System, single: total=4.00MiB, used=0.00B
>> > Metadata, RAID1: total=4.00GiB, used=1.87GiB
>> > Metadata, single: total=8.00MiB, used=0.00B
>> > GlobalReserve, single: total=512.00MiB, used=0.00B
>> >
>> > ~> sudo btrfs fi df /backup_hdd
>> > Data, single: total=245.01GiB, used=243.61GiB
>> > System, DUP: total=32.00MiB, used=48.00KiB
>> > System, single: total=4.00MiB, used=0.00B
>> > Metadata, DUP: total=10.50GiB, used=8.93GiB
>> > Metadata, single: total=8.00MiB, used=0.00B
>> > GlobalReserve, single: total=512.00MiB, used=0.00B
>> >
>> > Relevant mount options:
>> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    / btrfs
>> > compress=lzo,noatime,relatime,ssd,subvol=/root    0 1
>> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /home btrfs
>> > compress=lzo,noatime,relatime,ssd,subvol=/home 0    1
>> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /backup btrfs
>> > compress=lzo,noatime,relatime,ssd,subvol=/backup 0    1
>> > UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /web btrfs
>> > compress=lzo,noatime,relatime,ssd,subvol=/web 0    1
>> > UUID=40b8240a-a0a2-4034-ae55-f8558c0343a8    /backup_hdd btrfs
>> > compress=lzo,noatime,relatime,noexec 0    1
>> As already indicated by Duncan, the amount of snapshots might be just
>> too much. The fragmentation on the HDD might have become very high. If
>> there is limited amount of RAM in the system (so limited caching), too
>> much time is lost in seeks. In addition:
>>
>>   compress=lzo
>> this also increases the chance of scattering fragments and 
>> fragmentation.
>>
>>   noatime,relatime
>> I am not sure why you have this. Hopefully you have the actual mount
>> listed as   noatime
>>
>> You could use the principles of the tool/package called snapper  to
>> do a sort of non-linear snapshot thinning: further back in time you
>> will have a much higher granularity of snapshot over a certain
>> timeframe.
>>
>> You could use skinny metadata (recreate the fs with newer tools or use
>> btrfstune -x on /dev/sda1). I think at the moment this flag is not
>> enabled on /dev/sda1
>>
>> If you put just 1 btrfs fs on the hdd (so move all the content from
>> the ext4 fs in the the btrfs fs) you might get better overall
>> performance. I assume the ext4 fs is on the second (slower part) of
>> the HDD and that is a disadvantage I think.
>> But you probably have reasons for why the setup is like it is.
> I've replied to Duncan's message about number of snapshots, there is 
> snapshots rotation and number of snapshots it is quite small, 491 in 
> total.
>
> About memory - 16 GiB of RAM should be enough I guess:) Can I measure 
> somehow if seeking is a problem?
>
> What is wrong with noatime,relatime? I'm using them for a long time as 
> good compromise in terms of performance.
>
> I'll try btrfstune -x and let you know whether it changes anything.
>
> About ext4 - it is actually because I did have some serious problems 
> with BTRFS in past 2.5 years or so (however, first time I've recovered 
> files by building manually git version of btrfs-tools and last time I 
> had not very up to date, but backup of everything in other place so I 
> didn't miss too much unrecoverable data), so for a while I'd like to 
> store some files separately on filesystem that is extremely difficult 
> to break. Its content not critical in terms of performance, files do 
> not compress well and I do not really need any other extended features 
> on that partition - so ext4 will be there for a while.
>



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-23 17:44 ` Nazar Mokrynskyi
@ 2016-02-24 22:32   ` Henk Slager
  2016-02-24 22:46     ` Nazar Mokrynskyi
       [not found]     ` <ce805cd7-422c-ab6a-fbf8-18a304aa640d@mokrynskyi.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Henk Slager @ 2016-02-24 22:32 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Feb 23, 2016 at 6:44 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> Looks like btrfstune -x did nothing, probably, it was already used at
> creation time, I'm using rcX versions of kernel all the time and rolling
> version of Ubuntu, so this is very likely to be the case.

The command    btrfs-show-super   shows the features of the
filesystem. You have a 'dummy' single profiles on the HDD fs and that
gives me a hint that you likely have used older tools to create the
fs. The current kernel does not set this feature flag on disk. If the
flag was already set, then no difference in performance.

If it was not set, then from now on, new metadata extents should be
skinny, which saves on total memory size and processing for (the
larger) filesystems. But for your existing data (snapshot subvolumes
in your  case) the metadata is then still non-skinny. So you won't
notice an instant difference only after all exiting fileblocks are
re-written or removed.
You will probably have a measurable difference if you equally fill 2
filesystems, one with and the other without the flag.

> One thing I've noticed is much slower mount/umount on HDD than on SSD:
>
>> nazar-pc@nazar-pc ~> time sudo umount /backup
>> 0.00user 0.00system 0:00.01elapsed 36%CPU (0avgtext+0avgdata
>> 7104maxresident)k
>> 0inputs+0outputs (0major+784minor)pagefaults 0swaps
>> nazar-pc@nazar-pc ~> time sudo mount /backup
>> 0.00user 0.00system 0:00.03elapsed 23%CPU (0avgtext+0avgdata
>> 7076maxresident)k
>> 0inputs+0outputs (0major+803minor)pagefaults 0swaps
>> nazar-pc@nazar-pc ~> time sudo umount /backup_hdd
>> 0.00user 0.11system 0:01.04elapsed 11%CPU (0avgtext+0avgdata
>> 7092maxresident)k
>> 0inputs+15296outputs (0major+787minor)pagefaults 0swaps
>> nazar-pc@nazar-pc ~> time sudo mount /backup_hdd
>> 0.00user 0.02system 0:04.45elapsed 0%CPU (0avgtext+0avgdata
>> 7140maxresident)k
>> 14648inputs+0outputs (0major+795minor)pagefaults 0swaps
>
> It is especially long (tenth of seconds with hight HDD load) when called
> after some time, not consequently.
>
> Once it took something like 20 seconds to unmount filesystem and around 10
> seconds to mount it.

Those are quite typical values for an already heavily used btrfs on a HDD.

>> About memory - 16 GiB of RAM should be enough I guess:) Can I measure
>> somehow if seeking is a problem?

I don't know a tool that can measure seek times and gather statistics
over and extended period of time and relate that to filesystem
internal actions. It would be best if all this were done by the HDD
firmware (under command of the filesystem code). One can make a model
of it I think, but the question is how good that is for modern drives.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-24 22:32   ` Henk Slager
@ 2016-02-24 22:46     ` Nazar Mokrynskyi
       [not found]     ` <ce805cd7-422c-ab6a-fbf8-18a304aa640d@mokrynskyi.com>
  1 sibling, 0 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-24 22:46 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5651 bytes --]

Here is btrfs-show-super output:

> nazar-pc@nazar-pc ~> sudo btrfs-show-super /dev/sda1
> superblock: bytenr=65536, device=/dev/sda1
> ---------------------------------------------------------
> csum            0x1e3c6fb8 [match]
> bytenr            65536
> flags            0x1
>             ( WRITTEN )
> magic            _BHRfS_M [match]
> fsid            40b8240a-a0a2-4034-ae55-f8558c0343a8
> label            Backup
> generation        165491
> root            143985360896
> sys_array_size        226
> chunk_root_generation    162837
> root_level        1
> chunk_root        247023583232
> chunk_root_level    1
> log_root        0
> log_root_transid    0
> log_root_level        0
> total_bytes        858993459200
> bytes_used        276512202752
> sectorsize        4096
> nodesize        16384
> leafsize        16384
> stripesize        4096
> root_dir        6
> num_devices        1
> compat_flags        0x0
> compat_ro_flags        0x0
> incompat_flags        0x169
>             ( MIXED_BACKREF |
>               COMPRESS_LZO |
>               BIG_METADATA |
>               EXTENDED_IREF |
>               SKINNY_METADATA )
> csum_type        0
> csum_size        4
> cache_generation    165491
> uuid_tree_generation    165491
> dev_item.uuid        81eee7a6-774e-4bb5-8b72-cebb85a2f2ce
> dev_item.fsid        40b8240a-a0a2-4034-ae55-f8558c0343a8 [match]
> dev_item.type        0
> dev_item.total_bytes    858993459200
> dev_item.bytes_used    291072114688
> dev_item.io_align    4096
> dev_item.io_width    4096
> dev_item.sector_size    4096
> dev_item.devid        1
> dev_item.dev_group    0
> dev_item.seek_speed    0
> dev_item.bandwidth    0
> dev_item.generation    0
It is sad that skinny metadata will only affect new data, probably, I'll 
end up re-creating it:(

Can I rebalance it or something simple for this purpose?

> Those are quite typical values for an already heavily used btrfs on a HDD.

Bad news, since I'm doing mounting/unmounting few times during snapshots 
creation because of how BTRFS works (source code: 
https://github.com/nazar-pc/just-backup-btrfs/blob/master/just-backup-btrfs.php#L148)

So if 10+20 seconds is typical, then in my case HDD can be very busy 
during a minute or sometimes more, this is not good and basically part 
or even real reason of initial question.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: 
A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 24.02.16 23:32, Henk Slager wrote:
> On Tue, Feb 23, 2016 at 6:44 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> Looks like btrfstune -x did nothing, probably, it was already used at
>> creation time, I'm using rcX versions of kernel all the time and rolling
>> version of Ubuntu, so this is very likely to be the case.
> The command    btrfs-show-super   shows the features of the
> filesystem. You have a 'dummy' single profiles on the HDD fs and that
> gives me a hint that you likely have used older tools to create the
> fs. The current kernel does not set this feature flag on disk. If the
> flag was already set, then no difference in performance.
>
> If it was not set, then from now on, new metadata extents should be
> skinny, which saves on total memory size and processing for (the
> larger) filesystems. But for your existing data (snapshot subvolumes
> in your  case) the metadata is then still non-skinny. So you won't
> notice an instant difference only after all exiting fileblocks are
> re-written or removed.
> You will probably have a measurable difference if you equally fill 2
> filesystems, one with and the other without the flag.
>
>> One thing I've noticed is much slower mount/umount on HDD than on SSD:
>>
>>> nazar-pc@nazar-pc ~> time sudo umount /backup
>>> 0.00user 0.00system 0:00.01elapsed 36%CPU (0avgtext+0avgdata
>>> 7104maxresident)k
>>> 0inputs+0outputs (0major+784minor)pagefaults 0swaps
>>> nazar-pc@nazar-pc ~> time sudo mount /backup
>>> 0.00user 0.00system 0:00.03elapsed 23%CPU (0avgtext+0avgdata
>>> 7076maxresident)k
>>> 0inputs+0outputs (0major+803minor)pagefaults 0swaps
>>> nazar-pc@nazar-pc ~> time sudo umount /backup_hdd
>>> 0.00user 0.11system 0:01.04elapsed 11%CPU (0avgtext+0avgdata
>>> 7092maxresident)k
>>> 0inputs+15296outputs (0major+787minor)pagefaults 0swaps
>>> nazar-pc@nazar-pc ~> time sudo mount /backup_hdd
>>> 0.00user 0.02system 0:04.45elapsed 0%CPU (0avgtext+0avgdata
>>> 7140maxresident)k
>>> 14648inputs+0outputs (0major+795minor)pagefaults 0swaps
>> It is especially long (tenth of seconds with hight HDD load) when called
>> after some time, not consequently.
>>
>> Once it took something like 20 seconds to unmount filesystem and around 10
>> seconds to mount it.
> Those are quite typical values for an already heavily used btrfs on a HDD.
>
>>> About memory - 16 GiB of RAM should be enough I guess:) Can I measure
>>> somehow if seeking is a problem?
> I don't know a tool that can measure seek times and gather statistics
> over and extended period of time and relate that to filesystem
> internal actions. It would be best if all this were done by the HDD
> firmware (under command of the filesystem code). One can make a model
> of it I think, but the question is how good that is for modern drives.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <ce805cd7-422c-ab6a-fbf8-18a304aa640d@mokrynskyi.com>]

* Re: Major HDD performance degradation on btrfs receive
       [not found]     ` <ce805cd7-422c-ab6a-fbf8-18a304aa640d@mokrynskyi.com>
@ 2016-02-25  1:04       ` Henk Slager
  2016-03-15  0:47         ` Nazar Mokrynskyi
  0 siblings, 1 reply; 34+ messages in thread
From: Henk Slager @ 2016-02-25  1:04 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Feb 24, 2016 at 11:45 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> Here is btrfs-show-super output:
>
>> nazar-pc@nazar-pc ~> sudo btrfs-show-super /dev/sda1
>> superblock: bytenr=65536, device=/dev/sda1
>> ---------------------------------------------------------
>> csum            0x1e3c6fb8 [match]
>> bytenr            65536
>> flags            0x1
>>             ( WRITTEN )
>> magic            _BHRfS_M [match]
>> fsid            40b8240a-a0a2-4034-ae55-f8558c0343a8
>> label            Backup
>> generation        165491
>> root            143985360896
>> sys_array_size        226
>> chunk_root_generation    162837
>> root_level        1
>> chunk_root        247023583232
>> chunk_root_level    1
>> log_root        0
>> log_root_transid    0
>> log_root_level        0
>> total_bytes        858993459200
>> bytes_used        276512202752
>> sectorsize        4096
>> nodesize        16384
>> leafsize        16384
>> stripesize        4096
>> root_dir        6
>> num_devices        1
>> compat_flags        0x0
>> compat_ro_flags        0x0
>> incompat_flags        0x169
>>             ( MIXED_BACKREF |
>>               COMPRESS_LZO |
>>               BIG_METADATA |
>>               EXTENDED_IREF |
>>               SKINNY_METADATA )
>> csum_type        0
>> csum_size        4
>> cache_generation    165491
>> uuid_tree_generation    165491
>> dev_item.uuid        81eee7a6-774e-4bb5-8b72-cebb85a2f2ce
>> dev_item.fsid        40b8240a-a0a2-4034-ae55-f8558c0343a8 [match]
>> dev_item.type        0
>> dev_item.total_bytes    858993459200
>> dev_item.bytes_used    291072114688
>> dev_item.io_align    4096
>> dev_item.io_width    4096
>> dev_item.sector_size    4096
>> dev_item.devid        1
>> dev_item.dev_group    0
>> dev_item.seek_speed    0
>> dev_item.bandwidth    0
>> dev_item.generation    0
>
> It is sad that skinny metadata will only affect new data, probably, I'll end
> up re-creating it:(
>
> Can I rebalance it or something simple for this purpose?

A balance won't help for that and also your metadata does look quite
compact already. But I think you should not expect so much of this
skinny metadata on a PC with 16G RAM

>> Those are quite typical values for an already heavily used btrfs on a HDD.
>
>
> Bad news, since I'm doing mounting/unmounting few times during snapshots
> creation because of how BTRFS works (source code:
> https://github.com/nazar-pc/just-backup-btrfs/blob/master/just-backup-btrfs.php#L148)
>
> So if 10+20 seconds is typical, then in my case HDD can be very busy during
> a minute or sometimes more, this is not good and basically part or even real
> reason of initial question.

Yes indeed! This mount/unmount every 15 minutes (or more times per 15
minutes) is killing for performance IMO. At the moment I don't fully
understand why you are bothered by the limitation you mention in the
php source comments. I think it's definitely worth to change paths
and/or your requirements in such a way that you can avoid the
umount/mount.

As a workaround, bcache with its cache device nicely filled over time,
will absolutely speedup the mount. But as you had some troubles with
btrfs in the past and also you use ext4 on the same disk because it is
a more mature filesystem, you might not want bache+btrfs for backup
storage, it is up to you.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-25  1:04       ` Henk Slager
@ 2016-03-15  0:47         ` Nazar Mokrynskyi
  2016-03-15 23:11           ` Henk Slager
  0 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-03-15  0:47 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4970 bytes --]

Some update since last time (few weeks ago).

All filesystems are mounted with noatime, I've also added mounting 
optimization - so there is no problem with remounting filesystem every 
time, it is done only once.

Remounting optimization helped by reducing 1 complete snapshot + 
send/receive cycle by some seconds, but otherwise it is still very slow 
when `btrfs receive` is active.

I'm not considering bcache + btrfs as potential setup because I do not 
currently have free SSD for it and basically spending SSD besides HDD 
for backup partition feels like a bit of overkill (especially for 
desktop use).

My current kernel is 4.5.0 stable, btrfs-tools still 4.4-1 from Ubuntu 
16.04 repository as of today.

As I'm reading mailing list there are other folks having similar 
performance issues. So can we debug things to find the root cause and 
fix it at some point?

My C/C++/Kernel/BTRFS knowledges are scarce, which is why some 
assistance here is needed from someone more experienced.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 25.02.16 03:04, Henk Slager wrote:
> On Wed, Feb 24, 2016 at 11:45 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> Here is btrfs-show-super output:
>>
>>> nazar-pc@nazar-pc ~> sudo btrfs-show-super /dev/sda1
>>> superblock: bytenr=65536, device=/dev/sda1
>>> ---------------------------------------------------------
>>> csum            0x1e3c6fb8 [match]
>>> bytenr            65536
>>> flags            0x1
>>>              ( WRITTEN )
>>> magic            _BHRfS_M [match]
>>> fsid            40b8240a-a0a2-4034-ae55-f8558c0343a8
>>> label            Backup
>>> generation        165491
>>> root            143985360896
>>> sys_array_size        226
>>> chunk_root_generation    162837
>>> root_level        1
>>> chunk_root        247023583232
>>> chunk_root_level    1
>>> log_root        0
>>> log_root_transid    0
>>> log_root_level        0
>>> total_bytes        858993459200
>>> bytes_used        276512202752
>>> sectorsize        4096
>>> nodesize        16384
>>> leafsize        16384
>>> stripesize        4096
>>> root_dir        6
>>> num_devices        1
>>> compat_flags        0x0
>>> compat_ro_flags        0x0
>>> incompat_flags        0x169
>>>              ( MIXED_BACKREF |
>>>                COMPRESS_LZO |
>>>                BIG_METADATA |
>>>                EXTENDED_IREF |
>>>                SKINNY_METADATA )
>>> csum_type        0
>>> csum_size        4
>>> cache_generation    165491
>>> uuid_tree_generation    165491
>>> dev_item.uuid        81eee7a6-774e-4bb5-8b72-cebb85a2f2ce
>>> dev_item.fsid        40b8240a-a0a2-4034-ae55-f8558c0343a8 [match]
>>> dev_item.type        0
>>> dev_item.total_bytes    858993459200
>>> dev_item.bytes_used    291072114688
>>> dev_item.io_align    4096
>>> dev_item.io_width    4096
>>> dev_item.sector_size    4096
>>> dev_item.devid        1
>>> dev_item.dev_group    0
>>> dev_item.seek_speed    0
>>> dev_item.bandwidth    0
>>> dev_item.generation    0
>> It is sad that skinny metadata will only affect new data, probably, I'll end
>> up re-creating it:(
>>
>> Can I rebalance it or something simple for this purpose?
> A balance won't help for that and also your metadata does look quite
> compact already. But I think you should not expect so much of this
> skinny metadata on a PC with 16G RAM
>
>>> Those are quite typical values for an already heavily used btrfs on a HDD.
>>
>> Bad news, since I'm doing mounting/unmounting few times during snapshots
>> creation because of how BTRFS works (source code:
>> https://github.com/nazar-pc/just-backup-btrfs/blob/master/just-backup-btrfs.php#L148)
>>
>> So if 10+20 seconds is typical, then in my case HDD can be very busy during
>> a minute or sometimes more, this is not good and basically part or even real
>> reason of initial question.
> Yes indeed! This mount/unmount every 15 minutes (or more times per 15
> minutes) is killing for performance IMO. At the moment I don't fully
> understand why you are bothered by the limitation you mention in the
> php source comments. I think it's definitely worth to change paths
> and/or your requirements in such a way that you can avoid the
> umount/mount.
>
> As a workaround, bcache with its cache device nicely filled over time,
> will absolutely speedup the mount. But as you had some troubles with
> btrfs in the past and also you use ext4 on the same disk because it is
> a more mature filesystem, you might not want bache+btrfs for backup
> storage, it is up to you.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-15  0:47         ` Nazar Mokrynskyi
@ 2016-03-15 23:11           ` Henk Slager
  2016-03-16  3:37             ` Nazar Mokrynskyi
  0 siblings, 1 reply; 34+ messages in thread
From: Henk Slager @ 2016-03-15 23:11 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: linux-btrfs

On Tue, Mar 15, 2016 at 1:47 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> Some update since last time (few weeks ago).
>
> All filesystems are mounted with noatime, I've also added mounting
> optimization - so there is no problem with remounting filesystem every time,
> it is done only once.
>
> Remounting optimization helped by reducing 1 complete snapshot +
> send/receive cycle by some seconds, but otherwise it is still very slow when
> `btrfs receive` is active.

OK, great that the umount+mount is gone. I think most time is
unfortunately spent in seeks; I think over time and due to various
factors, both free space and files are highly fragmented on your disk.
It could also be that the disk is bit older and has or is starting to
use its spare sectors.

> I'm not considering bcache + btrfs as potential setup because I do not
> currently have free SSD for it and basically spending SSD besides HDD for
> backup partition feels like a bit of overkill (especially for desktop use).

Yes I think so too; For backup, I am also a bit reluctant to use
bcache. But the big difference is that you do a snapshot transfer
every 15minute while I do that only every 24hour. So I almost dont
care how long the send|receive takes in the middle of the night. I
also almost never look at the backups, and when I do, indeed scanning
through a 1000 snapshots fs on spinning disk takes time. If a script
does that every 15mins, and the fs uses LZO compression and there is
another active partition then you will have to deal with the slowness.
And if the files are mostly small, like source-trees, it gets even
worse. So it is about 100x more creates+deletes of subvolumes. To be
honest, it is just requiring too much from a HDD I think, knowing that
btrfs is CoW. On a fresh fs it might work OK in the beginning, but
over time...

You could adapt the script or backup method not to search every time,
but to just write the next diff send|receive and only step back and
search if this fails.

Or keeping more 15min snapshots only on SSD and lower the rate of
send|receive them to HDD

Another thing you could do is skip the receive step; So just pipe the
15min snapshot diff to a stream file and just leave it on the backup
HDD until you need files from the backup. Only then do a series of
incremental receives of the streams. An every now and then a full
(non-incremental) send.

> My current kernel is 4.5.0 stable, btrfs-tools still 4.4-1 from Ubuntu 16.04
> repository as of today.
>
> As I'm reading mailing list there are other folks having similar performance
> issues. So can we debug things to find the root cause and fix it at some
> point?

Indeed there are multiple reports with similar symptoms. I think it is
not really that one should see it as an error or root cause or some
fault. It is further optimization and then specifically for harddisks.
Or implementing additional concepts just for harddisks. For (parity)
RAID (by btrfs itself, not DM or MD etc), one can exploit parallelism,
but it is not trivial to get that fully optimized for all device
configurations and tasks.

> My C/C++/Kernel/BTRFS knowledges are scarce, which is why some assistance
> here is needed from someone more experienced.

It is all about HDD seek times in the first place. There are many
thoughts, articles and benchmarks about this over the years on the
internet, but I just found this one from last year about XFS:
https://lkml.org/lkml/2015/4/29/776

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-15 23:11           ` Henk Slager
@ 2016-03-16  3:37             ` Nazar Mokrynskyi
  2016-03-16  4:18               ` Chris Murphy
                                 ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-03-16  3:37 UTC (permalink / raw)
  Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 12486 bytes --]

> It could also be that the disk is bit older and has or is starting to
> use its spare sectors.
I do not really think HDD is that old. I've got it brand new less than 
year ago. Here is smartctl output:

> nazar-pc@nazar-pc ~> sudo smartctl -a /dev/sda
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.5.0-haswell] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, 
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Samsung SpinPoint M9T
> Device Model:     ST2000LM003 HN-M201RAD
> Serial Number:    S34RJ9CF727799
> LU WWN Device Id: 5 0004cf 20dbc7ec5
> Firmware Version: 2BC10004
> User Capacity:    2 000 398 934 016 bytes [2,00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    5400 rpm
> Form Factor:      2.5 inches
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 6
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Wed Mar 16 01:25:17 2016 EET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x00)    Offline data collection 
> activity
>                     was never started.
>                     Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0)    The previous self-test 
> routine completed
>                     without error or no self-test has ever
>                     been run.
> Total time to complete Offline
> data collection:         (23760) seconds.
> Offline data collection
> capabilities:              (0x5b) SMART execute Offline immediate.
>                     Auto Offline data collection on/off support.
>                     Suspend Offline collection upon new
>                     command.
>                     Offline surface scan supported.
>                     Self-test supported.
>                     No Conveyance Self-test supported.
>                     Selective Self-test supported.
> SMART capabilities:            (0x0003)    Saves SMART data before 
> entering
>                     power-saving mode.
>                     Supports SMART auto save timer.
> Error logging capability:        (0x01)    Error logging supported.
>                     General Purpose Logging supported.
> Short self-test routine
> recommended polling time:      (   1) minutes.
> Extended self-test routine
> recommended polling time:      ( 396) minutes.
> SCT capabilities:            (0x003f)    SCT Status supported.
>                     SCT Error Recovery Control supported.
>                     SCT Feature Control supported.
>                     SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
> UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   100   100   051 Pre-fail  
> Always       -       16
>   2 Throughput_Performance  0x0026   252   252   000 Old_age   
> Always       -       0
>   3 Spin_Up_Time            0x0023   088   086   025 Pre-fail  
> Always       -       3760
>   4 Start_Stop_Count        0x0032   100   100   000 Old_age   
> Always       -       840
>   5 Reallocated_Sector_Ct   0x0033   252   252   010 Pre-fail  
> Always       -       0
>   7 Seek_Error_Rate         0x002e   252   252   051 Old_age   
> Always       -       0
>   8 Seek_Time_Performance   0x0024   252   252   015 Old_age   
> Offline      -       0
>   9 Power_On_Hours          0x0032   100   100   000 Old_age   
> Always       -       6208
>  10 Spin_Retry_Count        0x0032   252   252   051 Old_age   
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000 Old_age   
> Always       -       678
> 191 G-Sense_Error_Rate      0x0022   100   100   000 Old_age   
> Always       -       11
> 192 Power-Off_Retract_Count 0x0022   252   252   000 Old_age   
> Always       -       0
> 194 Temperature_Celsius     0x0002   053   044   000 Old_age   
> Always       -       47 (Min/Max 17/56)
> 195 Hardware_ECC_Recovered  0x003a   100   100   000 Old_age   
> Always       -       0
> 196 Reallocated_Event_Count 0x0032   252   252   000 Old_age   
> Always       -       0
> 197 Current_Pending_Sector  0x0032   252   252   000 Old_age   
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   252   252   000 Old_age   
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0036   200   200   000 Old_age   
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x002a   100   100   000 Old_age   
> Always       -       20
> 223 Load_Retry_Count        0x0032   100   100   000 Old_age   
> Always       -       7
> 225 Load_Cycle_Count        0x0032   100   100   000 Old_age   
> Always       -       7035
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
>
> SMART Selective self-test log data structure revision number 0
> Note: revision number not 1 implies that no selective self-test has 
> ever been run
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Completed [00% left] (0-65535)
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute 
> delay.
gnome-disks says it worked 15 days and 8 minutes

> I also almost never look at the backups, and when I do, indeed scanning
> through a 1000 snapshots fs on spinning disk takes time. If a script
> does that every 15mins, and the fs uses LZO compression and there is
> another active partition then you will have to deal with the slowness.
Well, it is not that bad and hard in reality. Every 15 minutes I'm 
transfering 3 diffs. Right now HDD contains 453 subvolumes totally, 34% 
of 359 GiB partition space used. After writing last message I've decided 
to collect diffs for further analysis.

So /home subvolume's diffs ranging from 6 to 270 MiB. Typically 30-40 MiB.

/root subvolume's diffs ranging from 10 KiB to 380 MiB (during software 
updates). Typically 40-80 KiB.

/web (source code here) subvolume's diffs ranging from bytes to 1 MiB, 
typically 150 KiB.

So generally when I'm watching movie or playing some game (not changing 
source code, not updating software and not doing anything that might 
cause significant changes in /home subvolume) I'll get about 30 MiB of 
diff in total. This is not that much for SATA3 HDD, it shouldn't stuck 
for some seconds when everything is so slow that video stops completely 
for few seconds.

Maybe BTRFS construction requires this small diff to make a big party 
all over HDD, I don't know, but there is some problem here for sure.

> You could adapt the script or backup method not to search every time,
> but to just write the next diff send|receive and only step back and
> search if this fails.
>
> Or keeping more 15min snapshots only on SSD and lower the rate of
> send|receive them to HDD
I'm not sure what you mean exactly by searching. My first SSD died 
during waking up from suspend mode, it worked perfectly till last 
moment. It was not used for critical data at that time, but now I 
understand clearly that SSD failure can happen at any time. Having RAID0 
of 2 SSDs it 2 times more risky, so I'm not ready to lose anything 
beyond 15 minutes threshold. I'd rather end up having another HDD purely 
for backup purposes.


Interesting question: is there any tool to see the whole picture about 
how btrfs partition is fragmented? I saw many tools for NTFS on Windows 
that show nice picture, but not for Linux filesystems. Saw answer on 
StackOverlow about fsck, but btrfsck doesn't provide similar output.

Also, I can't really run defragmentation anyway since all backups are 
read-only.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 16.03.16 01:11, Henk Slager wrote:
> On Tue, Mar 15, 2016 at 1:47 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> Some update since last time (few weeks ago).
>>
>> All filesystems are mounted with noatime, I've also added mounting
>> optimization - so there is no problem with remounting filesystem every time,
>> it is done only once.
>>
>> Remounting optimization helped by reducing 1 complete snapshot +
>> send/receive cycle by some seconds, but otherwise it is still very slow when
>> `btrfs receive` is active.
> OK, great that the umount+mount is gone. I think most time is
> unfortunately spent in seeks; I think over time and due to various
> factors, both free space and files are highly fragmented on your disk.
> It could also be that the disk is bit older and has or is starting to
> use its spare sectors.
>
>> I'm not considering bcache + btrfs as potential setup because I do not
>> currently have free SSD for it and basically spending SSD besides HDD for
>> backup partition feels like a bit of overkill (especially for desktop use).
> Yes I think so too; For backup, I am also a bit reluctant to use
> bcache. But the big difference is that you do a snapshot transfer
> every 15minute while I do that only every 24hour. So I almost dont
> care how long the send|receive takes in the middle of the night. I
> also almost never look at the backups, and when I do, indeed scanning
> through a 1000 snapshots fs on spinning disk takes time. If a script
> does that every 15mins, and the fs uses LZO compression and there is
> another active partition then you will have to deal with the slowness.
> And if the files are mostly small, like source-trees, it gets even
> worse. So it is about 100x more creates+deletes of subvolumes. To be
> honest, it is just requiring too much from a HDD I think, knowing that
> btrfs is CoW. On a fresh fs it might work OK in the beginning, but
> over time...
>
> You could adapt the script or backup method not to search every time,
> but to just write the next diff send|receive and only step back and
> search if this fails.
>
> Or keeping more 15min snapshots only on SSD and lower the rate of
> send|receive them to HDD
>
> Another thing you could do is skip the receive step; So just pipe the
> 15min snapshot diff to a stream file and just leave it on the backup
> HDD until you need files from the backup. Only then do a series of
> incremental receives of the streams. An every now and then a full
> (non-incremental) send.
>
>> My current kernel is 4.5.0 stable, btrfs-tools still 4.4-1 from Ubuntu 16.04
>> repository as of today.
>>
>> As I'm reading mailing list there are other folks having similar performance
>> issues. So can we debug things to find the root cause and fix it at some
>> point?
> Indeed there are multiple reports with similar symptoms. I think it is
> not really that one should see it as an error or root cause or some
> fault. It is further optimization and then specifically for harddisks.
> Or implementing additional concepts just for harddisks. For (parity)
> RAID (by btrfs itself, not DM or MD etc), one can exploit parallelism,
> but it is not trivial to get that fully optimized for all device
> configurations and tasks.
>
>> My C/C++/Kernel/BTRFS knowledges are scarce, which is why some assistance
>> here is needed from someone more experienced.
> It is all about HDD seek times in the first place. There are many
> thoughts, articles and benchmarks about this over the years on the
> internet, but I just found this one from last year about XFS:
> https://lkml.org/lkml/2015/4/29/776
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  3:37             ` Nazar Mokrynskyi
@ 2016-03-16  4:18               ` Chris Murphy
  2016-03-16  4:23                 ` Nazar Mokrynskyi
  2016-03-16  4:22               ` Chris Murphy
  2016-03-17  7:00               ` Duncan
  2 siblings, 1 reply; 34+ messages in thread
From: Chris Murphy @ 2016-03-16  4:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: nazar

Very simplistically: visualizing Btrfs writes without file deletion,
it's a contiguous write. There isn't much scatter, even accounting for
metadata and data chunk writes happening in slightly different regions
of platter space. (I'm thinking this slow down happens overwhelmingly
on HDDs.)

If there are file deletions, holes appear, and now some later writes
will fill those holes, but not exactly, which will lead to
fragmentation and thus seek times. Seeks would go up by a lot the
smaller the holes are. And the holes are smaller the fewer files are
being deleted at once.

If there's a snapshot, and then file deletions, holes don't appear.
Everything is always copy on write and deleted files don't actually
get deleted (they're still in another subvolume). So as soon as a file
is reflinked or in a snapshotted subvolume there's no fragmentation
happening with file deletions.

If there's many snapshots happening in a short time, such as once
every 10 minutes, that means only 10 minutes worth of writes happening
in a given subvolume. If that space is later released by deleting
snapshots one at time (like a rolling snapshot and delete strategy
every 10 minutes) that means only small holes are opening up for later
writes. It's maybe the worst case scenario for fragmenting Btrfs.

A better way might be to delay snapshot deletion. Keep taking the
snapshots, but delete old snapshots in batches. Delete maybe 10 or 100
(if we're talking thousands of snapshots) at once. This should free a
lot more contiguous space for later writes and significantly reduce
the chance of significant fragmentation. Of course some fragmentation
is going to happen no matter what, but I think the usage pattern
described in a lot of these slow down cases sound to me like worse
case scenario for cow.

Now, a less lazy person would actually test this hypothesis.

Chris Murphy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  4:18               ` Chris Murphy
@ 2016-03-16  4:23                 ` Nazar Mokrynskyi
  2016-03-16  6:51                   ` Chris Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-03-16  4:23 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2579 bytes --]

Sounds like a really good idea!

I'll try to implement in in my backup tool, but it might take some time 
to see real benefit from it (or no benefit:)).

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora:nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 16.03.16 06:18, Chris Murphy wrote:
> Very simplistically: visualizing Btrfs writes without file deletion,
> it's a contiguous write. There isn't much scatter, even accounting for
> metadata and data chunk writes happening in slightly different regions
> of platter space. (I'm thinking this slow down happens overwhelmingly
> on HDDs.)
>
> If there are file deletions, holes appear, and now some later writes
> will fill those holes, but not exactly, which will lead to
> fragmentation and thus seek times. Seeks would go up by a lot the
> smaller the holes are. And the holes are smaller the fewer files are
> being deleted at once.
>
> If there's a snapshot, and then file deletions, holes don't appear.
> Everything is always copy on write and deleted files don't actually
> get deleted (they're still in another subvolume). So as soon as a file
> is reflinked or in a snapshotted subvolume there's no fragmentation
> happening with file deletions.
>
> If there's many snapshots happening in a short time, such as once
> every 10 minutes, that means only 10 minutes worth of writes happening
> in a given subvolume. If that space is later released by deleting
> snapshots one at time (like a rolling snapshot and delete strategy
> every 10 minutes) that means only small holes are opening up for later
> writes. It's maybe the worst case scenario for fragmenting Btrfs.
>
> A better way might be to delay snapshot deletion. Keep taking the
> snapshots, but delete old snapshots in batches. Delete maybe 10 or 100
> (if we're talking thousands of snapshots) at once. This should free a
> lot more contiguous space for later writes and significantly reduce
> the chance of significant fragmentation. Of course some fragmentation
> is going to happen no matter what, but I think the usage pattern
> described in a lot of these slow down cases sound to me like worse
> case scenario for cow.
>
> Now, a less lazy person would actually test this hypothesis.
>
>
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message tomajordomo@vger.kernel.org
> More majordomo info athttp://vger.kernel.org/majordomo-info.html



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  4:23                 ` Nazar Mokrynskyi
@ 2016-03-16  6:51                   ` Chris Murphy
  2016-03-16 11:53                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Murphy @ 2016-03-16  6:51 UTC (permalink / raw)
  To: Nazar Mokrynskyi, Btrfs BTRFS

On Tue, Mar 15, 2016 at 10:23 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> Sounds like a really good idea!
>
> I'll try to implement in in my backup tool, but it might take some time to
> see real benefit from it (or no benefit:)).

There is a catch. I'm not sure how much testing deleting 100
subvolumes at once gets. It should work. I haven't looked in xfstests
to see how much of this is being tested. So it's possible you're
testing it. So be ready.

Also since it's batched, consider doing it at night when it's not
used. The cleaner task will always slow things down because it has to
decrement reference counts and then find out what to actually delete
and then update metadata to reflect it. You could also add in a 5
minute delay after subvolume deletes then issue sysrq+w in case
there's significant blocked tasks, the logs will have extra debug
info.

Another idea is maybe graphically modeling (seekwatcher) the normal
write pattern, and see how it changes when even a single subvolume is
deleted (the current every 15 minute method). That might give an idea
how significantly cleaner tasks affect your particular workload. The
results might support batching not only to avoid fragmentation by
getting larger contiguous free space, but to avoid the IOPS hit during
the day time from too aggressive (?) of a cleaner task.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  6:51                   ` Chris Murphy
@ 2016-03-16 11:53                     ` Austin S. Hemmelgarn
  2016-03-16 20:58                       ` Chris Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-16 11:53 UTC (permalink / raw)
  To: Chris Murphy, Nazar Mokrynskyi, Btrfs BTRFS

On 2016-03-16 02:51, Chris Murphy wrote:
> On Tue, Mar 15, 2016 at 10:23 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
>> Sounds like a really good idea!
>>
>> I'll try to implement in in my backup tool, but it might take some time to
>> see real benefit from it (or no benefit:)).
>
> There is a catch. I'm not sure how much testing deleting 100
> subvolumes at once gets. It should work. I haven't looked in xfstests
> to see how much of this is being tested. So it's possible you're
> testing it. So be ready.
I've actually tested bulk removal of large numbers of snapshots multiple 
times before (it's actually one of the things that isn't in xfstests 
that I check when testing patches, I usually test power of two groups 
from 16 up to 256 at a time).  It works, but it may tie up most of the 
disk bandwidth for a while depending on what type of storage you're using.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16 11:53                     ` Austin S. Hemmelgarn
@ 2016-03-16 20:58                       ` Chris Murphy
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Murphy @ 2016-03-16 20:58 UTC (permalink / raw)
  To: Austin Hemmelgarn; +Cc: Chris Murphy, Nazar Mokrynskyi, Btrfs BTRFS

On Wed, Mar 16, 2016 at 5:53 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-03-16 02:51, Chris Murphy wrote:
>>
>> On Tue, Mar 15, 2016 at 10:23 PM, Nazar Mokrynskyi <nazar@mokrynskyi.com>
>> wrote:
>>>
>>> Sounds like a really good idea!
>>>
>>> I'll try to implement in in my backup tool, but it might take some time
>>> to
>>> see real benefit from it (or no benefit:)).
>>
>>
>> There is a catch. I'm not sure how much testing deleting 100
>> subvolumes at once gets. It should work. I haven't looked in xfstests
>> to see how much of this is being tested. So it's possible you're
>> testing it. So be ready.
>
> I've actually tested bulk removal of large numbers of snapshots multiple
> times before (it's actually one of the things that isn't in xfstests that I
> check when testing patches, I usually test power of two groups from 16 up to
> 256 at a time).  It works, but it may tie up most of the disk bandwidth for
> a while depending on what type of storage you're using.

Good to know.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  3:37             ` Nazar Mokrynskyi
  2016-03-16  4:18               ` Chris Murphy
@ 2016-03-16  4:22               ` Chris Murphy
  2016-03-17  7:00               ` Duncan
  2 siblings, 0 replies; 34+ messages in thread
From: Chris Murphy @ 2016-03-16  4:22 UTC (permalink / raw)
  To: Nazar Mokrynskyi; +Cc: linux-btrfs

Maybe a starting point.
https://oss.oracle.com/~mason/seekwatcher/

This shows write patterns, not current state fragmentation. So it's
less useful for what you're asking, and more useful for what I was
suggesting as a batched snapshot delete strategy.

Chris Murphy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-16  3:37             ` Nazar Mokrynskyi
  2016-03-16  4:18               ` Chris Murphy
  2016-03-16  4:22               ` Chris Murphy
@ 2016-03-17  7:00               ` Duncan
  2016-03-18 14:22                 ` Nazar Mokrynskyi
  2 siblings, 1 reply; 34+ messages in thread
From: Duncan @ 2016-03-17  7:00 UTC (permalink / raw)
  To: linux-btrfs

Nazar Mokrynskyi posted on Wed, 16 Mar 2016 05:37:02 +0200 as excerpted:

> I'm not sure what you mean exactly by searching. My first SSD died
> during waking up from suspend mode, it worked perfectly till last
> moment. It was not used for critical data at that time, but now I
> understand clearly that SSD failure can happen at any time. Having RAID0
> of 2 SSDs it 2 times more risky, so I'm not ready to lose anything
> beyond 15 minutes threshold. I'd rather end up having another HDD purely
> for backup purposes.

I understand the raid0 N times the danger part, which is why I only ever 
used raid0 on stuff like the distro packages cache that I could easily 
redownload from the net, here.

But seriously, what are you doing that you can't lose more than 15 
minutes of?  Couldn't it be even 20 minutes, or a half-hour or an hour 
with 15 minute snapshots only on the ssds (yes, I know the raid0 factor, 
but the question still applies)?

What /would/ you do if you lost a whole hour's worth of work?  Surely you 
could duplicate it in the next hour?  Or are you doing securities trading 
or something, where you /can't/ recover work at all, because by then the 
market and your world have moved on?  But in that case...

And perhaps more importantly for your data, btrfs is still considered 
"stabilizing, not fully stable and mature".  Use without backups is 
highly discouraged, but I'd suggest that btrfs in its current state might 
not be what you're looking for if you can't deal with loss of more than 
15 minutes worth of changes anyway.

Be that as it may...

Btrfs is definitely not yet optimized.  In many cases it reads or writes 
only one device at a time, for instance, even in RaidN configuration.  
And there are definitely snapshot scaling issues altho at your newer 500 
snapshots total that shouldn't be /too/ bad.

Dealing with reality, regardless of how or why, you currently have a 
situation of intolerably slow receives that needs addressed.  From a 
practical perspective you said an ssd for backups is ridiculous and I 
can't disagree, but there's another "throw hardware at it" solution that 
might be a bit more reasonable...

Spinning rust hard drives are cheap.  What about getting another one, and 
alternating your backup receives between them?  That would halve the load 
to one every thirty minutes, without changing your 15-minute snapshot and 
backup policy at all. =:^)

So that gives you two choices for halving the load to the spinning rust.  
Either decide you really can live with half-hour loss of data, or throw 
only a relatively small amount of money (well, as long as you have room 
to plug in another sata device anyway, otherwise...) at it for a second 
backup device, and alternate between them.

OTOH, since you mentioned possible coding, optimization might not be a 
bad thing, if you're willing to put in the time necessary to get up to 
speed with the code and can work with the other devs in terms of timing, 
etc.  But that will definitely take significant time even if you do it, 
and the alternating backup solution can be put to use as soon as you can 
get another device plugged in and setup. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-17  7:00               ` Duncan
@ 2016-03-18 14:22                 ` Nazar Mokrynskyi
  2016-05-27  1:57                   ` Nazar Mokrynskyi
  0 siblings, 1 reply; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-03-18 14:22 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6909 bytes --]

> But seriously, what are you doing that you can't lose more than 15
> minutes of?  Couldn't it be even 20 minutes, or a half-hour or an hour
> with 15 minute snapshots only on the ssds (yes, I know the raid0 factor,
> but the question still applies)?
This is artificial psychological limit. Sometimes when you're actively 
coding, it is quite sad to loose even 5 minutes of work, since 
productivity is not constant. This is why 15 minutes was chosen as 
something that is not too critical. There is no other real reason behind 
this limit other than how I feel it.

> And perhaps more importantly for your data, btrfs is still considered
> "stabilizing, not fully stable and mature".  Use without backups is
> highly discouraged, but I'd suggest that btrfs in its current state might
> not be what you're looking for if you can't deal with loss of more than
> 15 minutes worth of changes anyway.
>
> Be that as it may...
>
> Btrfs is definitely not yet optimized.  In many cases it reads or writes
> only one device at a time, for instance, even in RaidN configuration.
> And there are definitely snapshot scaling issues altho at your newer 500
> snapshots total that shouldn't be /too/ bad.
As an (relatively) early adopter I'm fine using experimental stuff with 
extra safeties like backups (hey, I've used it even without those while 
back:)). I fully acknowledge what is current state of BTRFS and want to 
help make it even better by stressing issues that me and other users 
encounter, searching for solutions, etc.

> Dealing with reality, regardless of how or why, you currently have a
> situation of intolerably slow receives that needs addressed.  From a
> practical perspective you said an ssd for backups is ridiculous and I
> can't disagree, but there's another "throw hardware at it" solution that
> might be a bit more reasonable...
>
> Spinning rust hard drives are cheap.  What about getting another one, and
> alternating your backup receives between them?  That would halve the load
> to one every thirty minutes, without changing your 15-minute snapshot and
> backup policy at all. =:^)
>
> So that gives you two choices for halving the load to the spinning rust.
> Either decide you really can live with half-hour loss of data, or throw
> only a relatively small amount of money (well, as long as you have room
> to plug in another sata device anyway, otherwise...) at it for a second
> backup device, and alternate between them.
Yes, I'm leaning toward earning new hardware right now, fortunately, 
laptop allows me to insert 2 x mSATA + 2 x 2.5 SATA drives, so I have 
exactly 2.5 SATA slot free.

> OTOH, since you mentioned possible coding, optimization might not be a
> bad thing, if you're willing to put in the time necessary to get up to
> speed with the code and can work with the other devs in terms of timing,
> etc.  But that will definitely take significant time even if you do it,
> and the alternating backup solution can be put to use as soon as you can
> get another device plugged in and setup. =:^)
I'm not coding C/C++, so my capabilities to improve BTRFS itself are 
limited, but I'm always trying to find the reason and fix it instead of 
living with workarounds forever.

I'll play with Seekwatcher and optimizing snapshots deletion and will 
post an update afterwards.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 17.03.16 09:00, Duncan wrote:
> Nazar Mokrynskyi posted on Wed, 16 Mar 2016 05:37:02 +0200 as excerpted:
>
>> I'm not sure what you mean exactly by searching. My first SSD died
>> during waking up from suspend mode, it worked perfectly till last
>> moment. It was not used for critical data at that time, but now I
>> understand clearly that SSD failure can happen at any time. Having RAID0
>> of 2 SSDs it 2 times more risky, so I'm not ready to lose anything
>> beyond 15 minutes threshold. I'd rather end up having another HDD purely
>> for backup purposes.
> I understand the raid0 N times the danger part, which is why I only ever
> used raid0 on stuff like the distro packages cache that I could easily
> redownload from the net, here.
>
> But seriously, what are you doing that you can't lose more than 15
> minutes of?  Couldn't it be even 20 minutes, or a half-hour or an hour
> with 15 minute snapshots only on the ssds (yes, I know the raid0 factor,
> but the question still applies)?
>
> What /would/ you do if you lost a whole hour's worth of work?  Surely you
> could duplicate it in the next hour?  Or are you doing securities trading
> or something, where you /can't/ recover work at all, because by then the
> market and your world have moved on?  But in that case...
>
> And perhaps more importantly for your data, btrfs is still considered
> "stabilizing, not fully stable and mature".  Use without backups is
> highly discouraged, but I'd suggest that btrfs in its current state might
> not be what you're looking for if you can't deal with loss of more than
> 15 minutes worth of changes anyway.
>
> Be that as it may...
>
> Btrfs is definitely not yet optimized.  In many cases it reads or writes
> only one device at a time, for instance, even in RaidN configuration.
> And there are definitely snapshot scaling issues altho at your newer 500
> snapshots total that shouldn't be /too/ bad.
>
> Dealing with reality, regardless of how or why, you currently have a
> situation of intolerably slow receives that needs addressed.  From a
> practical perspective you said an ssd for backups is ridiculous and I
> can't disagree, but there's another "throw hardware at it" solution that
> might be a bit more reasonable...
>
> Spinning rust hard drives are cheap.  What about getting another one, and
> alternating your backup receives between them?  That would halve the load
> to one every thirty minutes, without changing your 15-minute snapshot and
> backup policy at all. =:^)
>
> So that gives you two choices for halving the load to the spinning rust.
> Either decide you really can live with half-hour loss of data, or throw
> only a relatively small amount of money (well, as long as you have room
> to plug in another sata device anyway, otherwise...) at it for a second
> backup device, and alternate between them.
>
>
> OTOH, since you mentioned possible coding, optimization might not be a
> bad thing, if you're willing to put in the time necessary to get up to
> speed with the code and can work with the other devs in terms of timing,
> etc.  But that will definitely take significant time even if you do it,
> and the alternating backup solution can be put to use as soon as you can
> get another device plugged in and setup. =:^)
>



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-03-18 14:22                 ` Nazar Mokrynskyi
@ 2016-05-27  1:57                   ` Nazar Mokrynskyi
  0 siblings, 0 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-05-27  1:57 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5020 bytes --]

I've being running newer version of just-backup-btrfs, which was configured to remove snapshots in batches ~ at least 3x100 at once (this is what I typically have in 1.5-2 days).

Snapshots transferring become much faster, however when I delete 300 snapshots at once, well... you can imagine what happens, but I can afford this on desktop.

Seekwatcher fails to run on my system with following error:

~> sudo seekwatcher -t find.trace -o find.png -p 'find /backup_hdd > /dev/null' -d /dev/sda1

Traceback (most recent call last):
  File "/usr/bin/seekwatcher", line 58, in <module>
    from seekwatcher import rundata
  File "numpy.pxd", line 43, in seekwatcher.rundata (seekwatcher/rundata.c:7885)
ValueError: numpy.dtype does not appear to be the correct type object

I have no idea what does it mean, but generally I think if seeking because of fragmentation is a real cause of performance degradation, then this is something that BTRFS can improve, since I still have 65% of free space on BTRFS partition that receives snapshots and fragmentation in this case seems weird.

P.S. I've unsubscribed from mailing list, cc me on answers, please.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

On 18.03.16 16:22, Nazar Mokrynskyi wrote:
>> But seriously, what are you doing that you can't lose more than 15
>> minutes of?  Couldn't it be even 20 minutes, or a half-hour or an hour
>> with 15 minute snapshots only on the ssds (yes, I know the raid0 factor,
>> but the question still applies)?
> This is artificial psychological limit. Sometimes when you're actively coding, it is quite sad to loose even 5 minutes of work, since productivity is not constant. This is why 15 minutes was chosen as something that is not too critical. There is no other real reason behind this limit other than how I feel it.
>
>> And perhaps more importantly for your data, btrfs is still considered
>> "stabilizing, not fully stable and mature".  Use without backups is
>> highly discouraged, but I'd suggest that btrfs in its current state might
>> not be what you're looking for if you can't deal with loss of more than
>> 15 minutes worth of changes anyway.
>>
>> Be that as it may...
>>
>> Btrfs is definitely not yet optimized.  In many cases it reads or writes
>> only one device at a time, for instance, even in RaidN configuration.
>> And there are definitely snapshot scaling issues altho at your newer 500
>> snapshots total that shouldn't be /too/ bad.
> As an (relatively) early adopter I'm fine using experimental stuff with extra safeties like backups (hey, I've used it even without those while back:)). I fully acknowledge what is current state of BTRFS and want to help make it even better by stressing issues that me and other users encounter, searching for solutions, etc.
>
>> Dealing with reality, regardless of how or why, you currently have a
>> situation of intolerably slow receives that needs addressed.  From a
>> practical perspective you said an ssd for backups is ridiculous and I
>> can't disagree, but there's another "throw hardware at it" solution that
>> might be a bit more reasonable...
>>
>> Spinning rust hard drives are cheap.  What about getting another one, and
>> alternating your backup receives between them?  That would halve the load
>> to one every thirty minutes, without changing your 15-minute snapshot and
>> backup policy at all. =:^)
>>
>> So that gives you two choices for halving the load to the spinning rust.
>> Either decide you really can live with half-hour loss of data, or throw
>> only a relatively small amount of money (well, as long as you have room
>> to plug in another sata device anyway, otherwise...) at it for a second
>> backup device, and alternate between them.
> Yes, I'm leaning toward earning new hardware right now, fortunately, laptop allows me to insert 2 x mSATA + 2 x 2.5 SATA drives, so I have exactly 2.5 SATA slot free.
>
>> OTOH, since you mentioned possible coding, optimization might not be a
>> bad thing, if you're willing to put in the time necessary to get up to
>> speed with the code and can work with the other devs in terms of timing,
>> etc.  But that will definitely take significant time even if you do it,
>> and the alternating backup solution can be put to use as soon as you can
>> get another device plugged in and setup. =:^)
> I'm not coding C/C++, so my capabilities to improve BTRFS itself are limited, but I'm always trying to find the reason and fix it instead of living with workarounds forever.
>
> I'll play with Seekwatcher and optimizing snapshots deletion and will post an update afterwards.
>
> Sincerely, Nazar Mokrynskyi
> github.com/nazar-pc
> Skype: nazar-pc
> Diaspora: nazarpc@diaspora.mokrynskyi.com
> Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249
>



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
@ 2016-02-22 19:39 Nazar Mokrynskyi
  0 siblings, 0 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-22 19:39 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 9963 bytes --]

> > I have 2 SSD with BTRFS filesystem (RAID) on them and several
> > subvolumes. Each 15 minutes I'm creating read-only snapshot of
> > subvolumes /root, /home and /web inside /backup.
> > After this I'm searching for last common subvolume on /backup_hdd,
> > sending difference between latest common snapshot and simply latest
> > snapshot to /backup_hdd.
> > On top of all above there is snapshots rotation, so that /backup
> > contains much less snapshots than /backup_hdd.
> One thing thing that you imply, but don't actually make explicit except
> in the btrfs command output and mount options listing, is that /backup_hdd
> is a mountpoint for a second entirely independent btrfs (LABEL=Backup),
> while /backup is a subvolume on the primary / btrfs.  Knowing that is
> quite helpful in figuring out exactly what you're doing. =:^)
>
> Further, implied, but not explicit since some folks use hdd when
> referring to ssds as well, is that the /backup_hdd hdd is spinning rust,
> tho you do make it explicit that the primary btrfs is on ssds.
>
> > I'm using this setup for last 7 months or so and this is luckily the
> > longest period when I had no problems with BTRFS at all.
> > However, last 2+ months btrfs receive command loads HDD so much that I
> > can't even get list of directories in it.
> > This happens even if diff between snapshots is really small.
> > HDD contains 2 filesystems - mentioned BTRFS and ext4 for other files,
> > so I can't even play mp3 file from ext4 filesystem while btrfs receive
> > is running.
> > Since I'm running everything each 15 minutes this is a real headache.
>
> The *big* question is how many snapshots you have on LABEL=Backup, since
> you mention rotating backups in /backup, but don't mention rotating/
> thinning backups on LABEL=Backup, and do explicitly state that it has far
> more snapshots, and with four snapshots an hour, they'll build up rather
> fast if you aren't thinning them.
>
> The rest of this post assumes that's the issue, since you didn't mention
> thinning out the snapshots on LABEL=Backup.  If you're already familiar
> with the snapshot scaling issue and snapshot caps and thinning
> recommendations regularly posted here, feel free to skip the below as
> it'll simply be review. =:^)
>
> Btrfs has scaling issues when there's too many snapshots.  The
> recommendation I've been using is a target of no more than 250 snapshots
> per subvolume, with a target of no more than eight subvolumes and ideally
> no more than four subvolumes being snapshotted per filesystem, which
> doing the math leads to an overall filesystem target snapshot cap of
> 1000-2000, and definitely no more than 3000, tho by that point the
> scaling issues are beginning to kick in and you'll feel it in lost
> performance, particularly on spinning rust, when doing btrfs maintenance
> such as snapshotting, send/receive, balance, check, etc.
>
> Unfortunately, many people post here complaining about performance issues
> when they're running 10K+ or even 100K+ snapshots per filesystem and the
> various btrfs maintenance commands have almost ground to a halt. =:^(
>
> You say you're snapshotting three subvolumes, / /home and /web, at 15
> minute intervals.  That's 3*4=12 snapshots per hour, 12*24=288 snapshots
> per day.  If all those are on LABEL=Backup, you're hitting the 250
> snapshots per subvolume target in 250/4/24 = ... just over 2 and a half
> days.  And you're hitting the total per-filesystem snapshots target cap
> in 2000/288= ... just under seven days.
>
> If you've been doing that for 7 months with no thinning, that's
> 7*30*288= ... over 60K snapshots!  No *WONDER* you're seeing performance
> issues!
>
> Meanwhile, say you need a file from a snapshot from six months ago.  Are
> you *REALLY* going to care, or even _know_, exactly what 15 minute
> snapshot it was?  And even if you do, just digging thru 60K+ snapshots...
> OK, so we'll assume you sort them by snapshotted subvolume so only have
> to dig thru 20K+ snapshots... just digging thru 20K snapshots to find the
> exact 15-minute snapshot you need... is quite a bit of work!
>
> Instead, suppose you have a "reasonable" thinning program.  First, do you
> really need _FOUR_ snapshots an hour to LABEL=Backup?  Say you make it
> every 20 minutes, three an hour instead of four.  That already kills a
> third of them.  Then, say you take them every 15 or 20 minutes, but only
> send one per hour to LABEL=Backup.  (Or if you want, do them every 15
> minutes and send only ever other one, half-hourly to LABEL=Backup.  The
> point is to keep it both something you're comfortable with but also more
> reasonable.)
>
> For illustration, I'll say you send once an hour.  That's 3*24=72
> snapshots per day, 24/day per subvolume, already a great improvement over
> the 96/day/subvolume and 288/day total you're doing now.
>
> If then once a day, you thin down the third day back to every other hour,
> you'll have 2-3 days worth of hourly snapshots on LABEL=backup, so upto
> 72 hourly snapshots per subvolume.  If on the 8th day you thin down to
> six-hourly, 4/day, cutting out 2/3, you'll have five days of 12/day/
> subvolume, 60 snapshots per subvolume, plus the 72, 132 snapshots per
> subvolume total, to 8 days out so you can recover over a week's worth at
> at least 2-hourly, if needed.
>
> If then on the 32 day (giving you a month's worth of at least 4X/day),
> you cut every other one, giving you twice a day snapshots, that's 24 days
> of 2X/day or 48 snapshots per subvolume, plus the 132 from before, 180
> snapshots per subvolume total, now.
>
> If then on the 92 day (giving you two more months of 2X/day, a quarter's
> worth of at least 2X/day) you again thin every other one, to one per day,
> you have 60 days @ 2X/day or 120 snapshots per subvolume, plus the 180 we
> had already, 300 snapshots per subvolume, now.
>
> OK, so we're already over our target 250/subvolume, so we could thin a
> bit more drastically.  However, we're only snapshotting three subvolumes,
> so we can afford a bit of lenience on the per-subvolume cap as that's
> assuming 4-8 snapshotted subvolumes, and we're still well under our total
> filesystem snapshot cap.
>
> If then you keep another quarter's worth of daily snapshots, out to 183
> days, that's 91 days of daily snapshots, 91 per subvolume, on top of the
> 300 we had, so now 391 snapshots per subvolume.
>
> If you then thin to weekly snapshots, cutting 6/7, and keep them around
> another 27 weeks (just over half a year, thus over a year total), that's
> 27 more snapshots per subvolume, plus the 391 we had, 418 snapshots per
> subvolume total.
>
> 418 snapshots per subvolume total, starting at 3-4X per hour to /backup
> and hourly to LABEL=Backup, thinning down gradually to weekly after six
> months and keeping that for the rest of the year.  Given that you're
> snapshotting three subvolumes, that's 1254 snapshots total, still well
> within the 1000-2000 total snapshots per filesystem target cap.
>
> During that year, if the data is worth it, you should have done an offsite
> or at least offline backup, we'll say quarterly.  After that, keeping the
> local online backup around is merely for convenience, and with quarterly
> backups, after a year you have multiple copies and can simply delete the
> year-old snapshots, one a week, probably at the same time you thin down
> the six-month-old daily snapshots to weekly.
>
> Compare that just over 1200 snapshots to the 60K+ snapshots you may have
> now, knowing that scaling over 10K snapshots is an issue particularly on
> spinning rust, and you should be able to appreciate the difference it's
> likely to make. =:^)
>
> But at the same time, in practice it'll probably be much easier to
> actually retrieve something from a snapshot a few months old, because you
> won't have tens of thousands of effectively useless snapshots to sort
> thru as you will be regularly thinning them down! =:^)
>
> > ~> uname [-r]
> > 4.5.0-rc4-haswell
> >
> > ~> btrfs --version
> > btrfs-progs v4.4
>
> You're staying current with your btrfs versions.  Kudos on that! =:^)
>
> And on including btrfs fi show and btrfs fi df, as they were useful, tho
> I'm snipping them here.
>
> One more tip.  Btrfs quotas are known to have scaling issues as well.  If
> you're using them, they'll exacerbate the problem.  And while I'm not
> sure about current 4.4 status, thru 4.3 at least, they were buggy and not
> reliable anyway.  So the recommendation is to leave quotas off on btrfs,
> and use some other more mature filesystem where they're known to work
> reliably if you really need them.
>
> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
First of all, sorry for delay, for whatever reason was not subscribed to 
mailing list.

You are right, RAID is on 2 SSDs and backup_hdd (LABEL=Backup) is 
separate really HDD.

Example was simplified to give an overview to not dig too deep into 
details. I actually have correct backups rotation, so we are not talking 
about thousands of snapshots:)
Here is tool I've created and using right now: 
https://github.com/nazar-pc/just-backup-btrfs
I'm keeping all snapshots for last day, up to 90 for last month and up 
to 48 throughout the year.
So as result there are:
* 166 snapshots in /backup_hdd/root
* 166 snapshots in /backup_hdd/home
* 159 snapshots in /backup_hdd/web

I'm not using quotas, there is nothing on this BTRFS partition besides 
mentioned snapshots.

-- 
Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249



[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Major HDD performance degradation on btrfs receive
@ 2016-02-16  4:44 Nazar Mokrynskyi
  2016-02-16  9:10 ` Duncan
  2016-02-18 18:19 ` Henk Slager
  0 siblings, 2 replies; 34+ messages in thread
From: Nazar Mokrynskyi @ 2016-02-16  4:44 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3355 bytes --]

I have 2 SSD with BTRFS filesystem (RAID) on them and several 
subvolumes. Each 15 minutes I'm creating read-only snapshot of 
subvolumes /root, /home and /web inside /backup.
After this I'm searching for last common subvolume on /backup_hdd, 
sending difference between latest common snapshot and simply latest 
snapshot to /backup_hdd.
On top of all above there is snapshots rotation, so that /backup 
contains much less snapshots than /backup_hdd.

I'm using this setup for last 7 months or so and this is luckily the 
longest period when I had no problems with BTRFS at all.
However, last 2+ months btrfs receive command loads HDD so much that I 
can't even get list of directories in it.
This happens even if diff between snapshots is really small.
HDD contains 2 filesystems - mentioned BTRFS and ext4 for other files, 
so I can't even play mp3 file from ext4 filesystem while btrfs receive 
is running.
Since I'm running everything each 15 minutes this is a real headache.

My guess is that performance hit might be caused by filesystem 
fragmentation even though there is more than enough empty space. But I'm 
not sure how to properly check this and can't, obviously, run 
defragmentation on read-only subvolumes.

I'll be thankful for anything that might help to identify and resolve 
this issue.

~> uname -a
Linux nazar-pc 4.5.0-rc4-haswell #1 SMP Tue Feb 16 02:09:13 CET 2016 
x86_64 x86_64 x86_64 GNU/Linux

~> btrfs --version
btrfs-progs v4.4

~> sudo btrfs fi show
Label: none  uuid: 5170aca4-061a-4c6c-ab00-bd7fc8ae6030
     Total devices 2 FS bytes used 71.00GiB
     devid    1 size 111.30GiB used 111.30GiB path /dev/sdb2
     devid    2 size 111.30GiB used 111.29GiB path /dev/sdc2

Label: 'Backup'  uuid: 40b8240a-a0a2-4034-ae55-f8558c0343a8
     Total devices 1 FS bytes used 252.54GiB
     devid    1 size 800.00GiB used 266.08GiB path /dev/sda1

~> sudo btrfs fi df /
Data, RAID0: total=214.56GiB, used=69.10GiB
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=4.00GiB, used=1.87GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

~> sudo btrfs fi df /backup_hdd
Data, single: total=245.01GiB, used=243.61GiB
System, DUP: total=32.00MiB, used=48.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=10.50GiB, used=8.93GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

Relevant mount options:
UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    / btrfs        
compress=lzo,noatime,relatime,ssd,subvol=/root    0 1
UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /home btrfs        
compress=lzo,noatime,relatime,ssd,subvol=/home 0    1
UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /backup btrfs        
compress=lzo,noatime,relatime,ssd,subvol=/backup 0    1
UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /web btrfs        
compress=lzo,noatime,relatime,ssd,subvol=/web 0    1
UUID=40b8240a-a0a2-4034-ae55-f8558c0343a8    /backup_hdd btrfs        
compress=lzo,noatime,relatime,noexec 0    1

-- 
Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Skype: nazar-pc
Diaspora: nazarpc@diaspora.mokrynskyi.com
Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

[-- Attachment #2: ÐÑÑÐ¿ÑÐ¾Ð³ÑÐ°ÑÑÑÐ½Ð¸Ð¹ Ð¿ÑÐ´Ð¿Ð¸Ñ S/MIME --]
[-- Type: application/pkcs7-signature, Size: 3825 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-16  4:44 Nazar Mokrynskyi
@ 2016-02-16  9:10 ` Duncan
  2016-02-18 18:19 ` Henk Slager
  1 sibling, 0 replies; 34+ messages in thread
From: Duncan @ 2016-02-16  9:10 UTC (permalink / raw)
  To: linux-btrfs

Nazar Mokrynskyi posted on Tue, 16 Feb 2016 05:44:30 +0100 as excerpted:

> I have 2 SSD with BTRFS filesystem (RAID) on them and several
> subvolumes. Each 15 minutes I'm creating read-only snapshot of
> subvolumes /root, /home and /web inside /backup.
> After this I'm searching for last common subvolume on /backup_hdd,
> sending difference between latest common snapshot and simply latest
> snapshot to /backup_hdd.
> On top of all above there is snapshots rotation, so that /backup
> contains much less snapshots than /backup_hdd.

One thing thing that you imply, but don't actually make explicit except 
in the btrfs command output and mount options listing, is that /backup_hdd 
is a mountpoint for a second entirely independent btrfs (LABEL=Backup), 
while /backup is a subvolume on the primary / btrfs.  Knowing that is 
quite helpful in figuring out exactly what you're doing. =:^)

Further, implied, but not explicit since some folks use hdd when 
referring to ssds as well, is that the /backup_hdd hdd is spinning rust, 
tho you do make it explicit that the primary btrfs is on ssds.

> I'm using this setup for last 7 months or so and this is luckily the
> longest period when I had no problems with BTRFS at all.
> However, last 2+ months btrfs receive command loads HDD so much that I
> can't even get list of directories in it.
> This happens even if diff between snapshots is really small.
> HDD contains 2 filesystems - mentioned BTRFS and ext4 for other files,
> so I can't even play mp3 file from ext4 filesystem while btrfs receive
> is running.
> Since I'm running everything each 15 minutes this is a real headache.

The *big* question is how many snapshots you have on LABEL=Backup, since 
you mention rotating backups in /backup, but don't mention rotating/
thinning backups on LABEL=Backup, and do explicitly state that it has far 
more snapshots, and with four snapshots an hour, they'll build up rather 
fast if you aren't thinning them.

The rest of this post assumes that's the issue, since you didn't mention 
thinning out the snapshots on LABEL=Backup.  If you're already familiar 
with the snapshot scaling issue and snapshot caps and thinning 
recommendations regularly posted here, feel free to skip the below as 
it'll simply be review. =:^)

Btrfs has scaling issues when there's too many snapshots.  The 
recommendation I've been using is a target of no more than 250 snapshots 
per subvolume, with a target of no more than eight subvolumes and ideally 
no more than four subvolumes being snapshotted per filesystem, which 
doing the math leads to an overall filesystem target snapshot cap of 
1000-2000, and definitely no more than 3000, tho by that point the 
scaling issues are beginning to kick in and you'll feel it in lost 
performance, particularly on spinning rust, when doing btrfs maintenance 
such as snapshotting, send/receive, balance, check, etc.

Unfortunately, many people post here complaining about performance issues 
when they're running 10K+ or even 100K+ snapshots per filesystem and the 
various btrfs maintenance commands have almost ground to a halt. =:^(

You say you're snapshotting three subvolumes, / /home and /web, at 15 
minute intervals.  That's 3*4=12 snapshots per hour, 12*24=288 snapshots 
per day.  If all those are on LABEL=Backup, you're hitting the 250 
snapshots per subvolume target in 250/4/24 = ... just over 2 and a half 
days.  And you're hitting the total per-filesystem snapshots target cap 
in 2000/288= ... just under seven days.

If you've been doing that for 7 months with no thinning, that's 
7*30*288= ... over 60K snapshots!  No *WONDER* you're seeing performance 
issues!

Meanwhile, say you need a file from a snapshot from six months ago.  Are 
you *REALLY* going to care, or even _know_, exactly what 15 minute 
snapshot it was?  And even if you do, just digging thru 60K+ snapshots... 
OK, so we'll assume you sort them by snapshotted subvolume so only have 
to dig thru 20K+ snapshots... just digging thru 20K snapshots to find the 
exact 15-minute snapshot you need... is quite a bit of work!

Instead, suppose you have a "reasonable" thinning program.  First, do you 
really need _FOUR_ snapshots an hour to LABEL=Backup?  Say you make it 
every 20 minutes, three an hour instead of four.  That already kills a 
third of them.  Then, say you take them every 15 or 20 minutes, but only 
send one per hour to LABEL=Backup.  (Or if you want, do them every 15 
minutes and send only ever other one, half-hourly to LABEL=Backup.  The 
point is to keep it both something you're comfortable with but also more 
reasonable.)

For illustration, I'll say you send once an hour.  That's 3*24=72 
snapshots per day, 24/day per subvolume, already a great improvement over 
the 96/day/subvolume and 288/day total you're doing now.

If then once a day, you thin down the third day back to every other hour, 
you'll have 2-3 days worth of hourly snapshots on LABEL=backup, so upto 
72 hourly snapshots per subvolume.  If on the 8th day you thin down to 
six-hourly, 4/day, cutting out 2/3, you'll have five days of 12/day/
subvolume, 60 snapshots per subvolume, plus the 72, 132 snapshots per 
subvolume total, to 8 days out so you can recover over a week's worth at 
at least 2-hourly, if needed.

If then on the 32 day (giving you a month's worth of at least 4X/day), 
you cut every other one, giving you twice a day snapshots, that's 24 days 
of 2X/day or 48 snapshots per subvolume, plus the 132 from before, 180 
snapshots per subvolume total, now.

If then on the 92 day (giving you two more months of 2X/day, a quarter's 
worth of at least 2X/day) you again thin every other one, to one per day, 
you have 60 days @ 2X/day or 120 snapshots per subvolume, plus the 180 we 
had already, 300 snapshots per subvolume, now.

OK, so we're already over our target 250/subvolume, so we could thin a 
bit more drastically.  However, we're only snapshotting three subvolumes, 
so we can afford a bit of lenience on the per-subvolume cap as that's 
assuming 4-8 snapshotted subvolumes, and we're still well under our total 
filesystem snapshot cap.

If then you keep another quarter's worth of daily snapshots, out to 183 
days, that's 91 days of daily snapshots, 91 per subvolume, on top of the 
300 we had, so now 391 snapshots per subvolume.

If you then thin to weekly snapshots, cutting 6/7, and keep them around 
another 27 weeks (just over half a year, thus over a year total), that's 
27 more snapshots per subvolume, plus the 391 we had, 418 snapshots per 
subvolume total.

418 snapshots per subvolume total, starting at 3-4X per hour to /backup 
and hourly to LABEL=Backup, thinning down gradually to weekly after six 
months and keeping that for the rest of the year.  Given that you're 
snapshotting three subvolumes, that's 1254 snapshots total, still well 
within the 1000-2000 total snapshots per filesystem target cap.

During that year, if the data is worth it, you should have done an offsite 
or at least offline backup, we'll say quarterly.  After that, keeping the 
local online backup around is merely for convenience, and with quarterly 
backups, after a year you have multiple copies and can simply delete the 
year-old snapshots, one a week, probably at the same time you thin down 
the six-month-old daily snapshots to weekly.

Compare that just over 1200 snapshots to the 60K+ snapshots you may have 
now, knowing that scaling over 10K snapshots is an issue particularly on 
spinning rust, and you should be able to appreciate the difference it's 
likely to make. =:^)

But at the same time, in practice it'll probably be much easier to 
actually retrieve something from a snapshot a few months old, because you 
won't have tens of thousands of effectively useless snapshots to sort 
thru as you will be regularly thinning them down! =:^)

> ~> uname [-r]
> 4.5.0-rc4-haswell
> 
> ~> btrfs --version
> btrfs-progs v4.4

You're staying current with your btrfs versions.  Kudos on that! =:^)

And on including btrfs fi show and btrfs fi df, as they were useful, tho 
I'm snipping them here.

One more tip.  Btrfs quotas are known to have scaling issues as well.  If 
you're using them, they'll exacerbate the problem.  And while I'm not 
sure about current 4.4 status, thru 4.3 at least, they were buggy and not 
reliable anyway.  So the recommendation is to leave quotas off on btrfs, 
and use some other more mature filesystem where they're known to work 
reliably if you really need them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Major HDD performance degradation on btrfs receive
  2016-02-16  4:44 Nazar Mokrynskyi
  2016-02-16  9:10 ` Duncan
@ 2016-02-18 18:19 ` Henk Slager
  1 sibling, 0 replies; 34+ messages in thread
From: Henk Slager @ 2016-02-18 18:19 UTC (permalink / raw)
  To: linux-btrfs

On Tue, Feb 16, 2016 at 5:44 AM, Nazar Mokrynskyi <nazar@mokrynskyi.com> wrote:
> I have 2 SSD with BTRFS filesystem (RAID) on them and several subvolumes.
> Each 15 minutes I'm creating read-only snapshot of subvolumes /root, /home
> and /web inside /backup.
> After this I'm searching for last common subvolume on /backup_hdd, sending
> difference between latest common snapshot and simply latest snapshot to
> /backup_hdd.
> On top of all above there is snapshots rotation, so that /backup contains
> much less snapshots than /backup_hdd.
>
> I'm using this setup for last 7 months or so and this is luckily the longest
> period when I had no problems with BTRFS at all.
> However, last 2+ months btrfs receive command loads HDD so much that I can't
> even get list of directories in it.
> This happens even if diff between snapshots is really small.
> HDD contains 2 filesystems - mentioned BTRFS and ext4 for other files, so I
> can't even play mp3 file from ext4 filesystem while btrfs receive is
> running.
> Since I'm running everything each 15 minutes this is a real headache.
>
> My guess is that performance hit might be caused by filesystem fragmentation
> even though there is more than enough empty space. But I'm not sure how to
> properly check this and can't, obviously, run defragmentation on read-only
> subvolumes.
>
> I'll be thankful for anything that might help to identify and resolve this
> issue.
>
> ~> uname -a
> Linux nazar-pc 4.5.0-rc4-haswell #1 SMP Tue Feb 16 02:09:13 CET 2016 x86_64
> x86_64 x86_64 GNU/Linux
>
> ~> btrfs --version
> btrfs-progs v4.4
>
> ~> sudo btrfs fi show
> Label: none  uuid: 5170aca4-061a-4c6c-ab00-bd7fc8ae6030
>     Total devices 2 FS bytes used 71.00GiB
>     devid    1 size 111.30GiB used 111.30GiB path /dev/sdb2
>     devid    2 size 111.30GiB used 111.29GiB path /dev/sdc2
>
> Label: 'Backup'  uuid: 40b8240a-a0a2-4034-ae55-f8558c0343a8
>     Total devices 1 FS bytes used 252.54GiB
>     devid    1 size 800.00GiB used 266.08GiB path /dev/sda1
>
> ~> sudo btrfs fi df /
> Data, RAID0: total=214.56GiB, used=69.10GiB
> System, RAID1: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=4.00GiB, used=1.87GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> ~> sudo btrfs fi df /backup_hdd
> Data, single: total=245.01GiB, used=243.61GiB
> System, DUP: total=32.00MiB, used=48.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=10.50GiB, used=8.93GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> Relevant mount options:
> UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    / btrfs
> compress=lzo,noatime,relatime,ssd,subvol=/root    0 1
> UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /home btrfs
> compress=lzo,noatime,relatime,ssd,subvol=/home 0    1
> UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /backup btrfs
> compress=lzo,noatime,relatime,ssd,subvol=/backup 0    1
> UUID=5170aca4-061a-4c6c-ab00-bd7fc8ae6030    /web btrfs
> compress=lzo,noatime,relatime,ssd,subvol=/web 0    1
> UUID=40b8240a-a0a2-4034-ae55-f8558c0343a8    /backup_hdd btrfs
> compress=lzo,noatime,relatime,noexec 0    1

As already indicated by Duncan, the amount of snapshots might be just
too much. The fragmentation on the HDD might have become very high. If
there is limited amount of RAM in the system (so limited caching), too
much time is lost in seeks. In addition:

 compress=lzo
this also increases the chance of scattering fragments and fragmentation.

 noatime,relatime
I am not sure why you have this. Hopefully you have the actual mount
listed as   noatime

You could use the principles of the tool/package called  snapper  to
do a sort of non-linear snapshot thinning: further back in time you
will have a much higher granularity of snapshot over a certain
timeframe.

You could use skinny metadata (recreate the fs with newer tools or use
btrfstune -x on /dev/sda1). I think at the moment this flag is not
enabled on /dev/sda1

If you put just 1 btrfs fs on the hdd (so move all the content from
the ext4 fs in the the btrfs fs) you might get better overall
performance. I assume the ext4 fs is on the second (slower part) of
the HDD and that is a disadvantage I think.
But you probably have reasons for why the setup is like it is.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-05-27  2:03 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-22 19:58 Major HDD performance degradation on btrfs receive Nazar Mokrynskyi
2016-02-22 23:30 ` Duncan
2016-02-23 17:26   ` Marc MERLIN
2016-02-23 17:34     ` Marc MERLIN
2016-02-23 18:01       ` Lionel Bouton
2016-02-23 18:30         ` Marc MERLIN
2016-02-23 20:35           ` Lionel Bouton
2016-02-24 10:01     ` Patrik Lundquist
2016-02-23 16:55 ` Nazar Mokrynskyi
2016-02-23 17:05   ` Alexander Fougner
2016-02-23 17:18     ` Nazar Mokrynskyi
2016-02-23 17:29       ` Alexander Fougner
2016-02-23 17:34         ` Nazar Mokrynskyi
2016-02-23 18:09           ` Austin S. Hemmelgarn
2016-02-23 17:44 ` Nazar Mokrynskyi
2016-02-24 22:32   ` Henk Slager
2016-02-24 22:46     ` Nazar Mokrynskyi
     [not found]     ` <ce805cd7-422c-ab6a-fbf8-18a304aa640d@mokrynskyi.com>
2016-02-25  1:04       ` Henk Slager
2016-03-15  0:47         ` Nazar Mokrynskyi
2016-03-15 23:11           ` Henk Slager
2016-03-16  3:37             ` Nazar Mokrynskyi
2016-03-16  4:18               ` Chris Murphy
2016-03-16  4:23                 ` Nazar Mokrynskyi
2016-03-16  6:51                   ` Chris Murphy
2016-03-16 11:53                     ` Austin S. Hemmelgarn
2016-03-16 20:58                       ` Chris Murphy
2016-03-16  4:22               ` Chris Murphy
2016-03-17  7:00               ` Duncan
2016-03-18 14:22                 ` Nazar Mokrynskyi
2016-05-27  1:57                   ` Nazar Mokrynskyi
  -- strict thread matches above, loose matches on Subject: below --
2016-02-22 19:39 Nazar Mokrynskyi
2016-02-16  4:44 Nazar Mokrynskyi
2016-02-16  9:10 ` Duncan
2016-02-18 18:19 ` Henk Slager

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).