Re: Snapshots slowing system

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Snapshots slowing system
@ 2016-03-14 23:03 pete
  2016-03-15 15:52 ` Duncan
  0 siblings, 1 reply; 17+ messages in thread
From: pete @ 2016-03-14 23:03 UTC (permalink / raw)
  To: linux-btrfs

>pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:

>> I hope this message stays within the thread on the list.  I had email
>> problems and ended up hacking around with sendmail & grabbing the
>> message id off of the web based group archives.

>Looks like it should have as the reply-to looks right, but at least on 
>gmane's news/nntp archive of the list (which is how I read and reply), it 
>didn't.  But the thread was found easily enough.

Found out what had happened.  I think I had a quota full issue at my hosting
provider, I suspect bounce messages caused majordomo to unsubscribe me, the
very week I asked a quesiton.

Thanks for the huge response, and thanks also to Boris.

>>>>I wondered whether you had elimated fragmentation, or any other known
>>>>gotchas, as a cause?
>> 
>> Subvolumes are mounted with the following options:
>> autodefrag,relatime,compress=lzo,subvol=<sub vol name>>

>That relatime (which is the default), could be an issue.  See below.

I've now changed that to noatime.  I think I read or missread relatime as
a good comprimise sometime in the past.


>> Not sure if there is much else to do about fragmentation apart from
>> running a balance which would probally make thje machine v sluggish for
>> a day or so.
>> 
>>>>Out of curiosity, what is/was the utilisation of the disk? Were the
>>>>snapshots read-only or read-write?
>> 
>> root@phoenix:~# btrfs fi df /
>> Data, single: total=101.03GiB, used=97.91GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=8.00GiB, used=5.29GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> 
>> root@phoenix:~# btrfs fi df /home
>> Data, RAID1: total=1.99TiB, used=1.97TiB
>> System, RAID1: total=32.00MiB, used=352.00KiB
>> Metadata, RAID1: total=53.00GiB, used=50.22GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B

>Normally when posting, either btrfs fi df *and* btrfs fi show are 
>needed, /or/ (with a new enough btrfs-progs) btrfs fi usage.  And of 
>course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that 
>I saw) versions.

OK, I have usage.  For the SSD with the system:

root@phoenix:~# btrfs fi usage /
Overall:
    Device size:		 118.05GiB
    Device allocated:		 110.06GiB
    Device unallocated:		   7.99GiB
    Used:			 103.46GiB
    Free (estimated):		  11.85GiB	(min: 11.85GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:102.03GiB, Used:98.16GiB
   /dev/sda3	 102.03GiB

Metadata,single: Size:8.00GiB, Used:5.30GiB
   /dev/sda3	   8.00GiB

System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda3	  32.00MiB

Unallocated:
   /dev/sda3	   7.99GiB


Hmm.  A bit tight.  I've just ordered a replacement SSD. Slackware
should it in about 5GB+ of disk space I've seen on a website?  Hmm.  Don't
beleive that.  I'd allow at least 10GB and more if I want to add extra 
packages such as libreoffice.  If I have no snapshots it seems to get to
45GB with various extra packages installed and grows to 100ish with
snapshotting probally owing to updates.

Anyway, took the lazy, but less tearing less hair out route and ordered
a 500GB drive.  Prices have dropped and fortunately a new drive is not
a major issue.  Timing is also good with Slack 14.2 immanent. You
rarely hear people complaining about disk too empty problems...
   
   
For the traditional hard drives with the data:

root@phoenix:~# btrfs fi usage /home
Overall:
    Device size:		   5.46TiB
    Device allocated:		   4.09TiB
    Device unallocated:		   1.37TiB
    Used:			   4.04TiB
    Free (estimated):		 720.58GiB	(min: 720.58GiB)
    Data ratio:			      2.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,RAID1: Size:1.99TiB, Used:1.97TiB
   /dev/sdb	   1.99TiB
   /dev/sdc	   1.99TiB

Metadata,RAID1: Size:53.00GiB, Used:49.65GiB
   /dev/sdb	  53.00GiB
   /dev/sdc	  53.00GiB

System,RAID1: Size:32.00MiB, Used:352.00KiB
   /dev/sdb	  32.00MiB
   /dev/sdc	  32.00MiB

Unallocated:
   /dev/sdb	 699.49GiB
   /dev/sdc	 699.49GiB
root@phoenix:~# 

   

>> Hmm.  The system disk is getting a little tight. cddisk reports the
>> partition I use for btrfs containing root as 127GB approx.  Not sure why
>> it grows so much. Suspect that software updates can't help as snapshots
>> will contain the legacy versions.  On the other hand they can be useful.

>With the 127 GiB (I _guess_ it's GiB, 1024, not GB, 1000, multiplier, 
>btrfs consistently uses the 1024 multiplier and properly specifies it 
>using the XiB notation) for /, however, and the btrfs fi df sizes of 101 
>GiB plus data and 8 GiB metadata (with system's 32 MiB a rounding error 
>and global reserve actually taken from metadata, so it doesn't add to 
>chunk reservation on its own) we can see that as you mention, it's 
>starting to get tight, a bit under 110 GiB of 127 GiB, but that 17 GiB 
>free isn't horrible, just slightly tight, as you said.

>Tho it'll obviously be tighter if that's 127 GB, 1000 multiplier...

Note that the system btrfs does not get 127GB, it gets /dev/sda3, not
far off, but I've a 209MB partition for /boot and a 1G partition for a 
very cut down system for maintenance purposes (both ext4).  On the 
new drive I'll keep the 'maintenance' ext4 install but I could use 
/boot from that filesystem using bind mounts, a bit cleaner.



>It's tight enough that particularly with the regular snapshotting, btrfs 
>might be having to fragment more than it'd like.  Tho kudos for the 
>_excellent_ snapshot rotation.  We regularly see folks in here with 100K 
>or more snapshots per filesystem, and btrfs _does_ have scaling issues in 
>that case.  But your rotation seems to be keeping it well below the 1-3K 
>snapshots per filesystem recommended max, so that's obviously NOT you're 
>problem, unless of course the snapshot deletion bugged out and they 
>aren't being deleted as they should.

Yay, I've done it right at least somewhere...  I was assuming that was
on server hardware so I thought best to keep it tighter on my more 
modest desktop.

They are deleting.  The new ones are also read only now.


>(Of course, you can check that by listing them, and I would indeed double-
>check, as that _is_ the _usual_ problem we have with snapshots slowing 
>things down, simply too many of them, hitting the known scaling issues 
>btrfs had with over 10K snapshots per filesystem.  But FWIW I don't use 
>snapshots here and thus don't deal with snapshots command-level detail.)

Rarely use them except when I either delete the wrong file or do something
very sneaky but dumb like inavertently set umask for root and install
a package and break _lots_ of file system permissions.  Easiest to 
recover from a good snapshot than try to fix that mess...



>But as I mentioned above, that relatime mount option isn't your best 
>choice, in the presence of heavy snapshotting.  Unless you KNOW you need 
>atimes for something or other, noatime is _strongly_ recommended with 
>snapshotting, because relatime, while /relatively/ better than 
>strictatime, still updates atimes once a day for files you're accessing 
>at least that frequently.

Now noatime.


>And that interacts badly with snapshots, particularly where few of the 
>files themselves have changed, because in that case, a large share of the 
>changes from one snapshot to another are going to be those atime updates 
>themselves.  Ensuring that you're always using noatime avoids the atime 
>updates entirely (well, unless the file itself changes and thus mtime 
>changes as well), which should, in the normal most files unchanged 
>snapshotting context, make for much smaller snapshot-exclusive sizes.

>And you mention below that the snapshots are read-write, but generally 
>used as read-only.  Does that include actually mounting them read-only?  
>Because if not, and if they too are mounted the default relatime, 
>accessing them is obviously going to be updating atimes the relatime-
>default once per day there as well... triggering further divergence of 
>snapshots from the subvolumes they are snapshots of and from each other...

Actually they are normally not mounted.  Only mount them, or rather the 
default subvolume that contains them, on an as needed basis.  The script
that does the snapshotting mounts and then unmounts.


>> Is it likely the SSD?  If likely I could get a larger one, now is a good
>> time with a new version of slackware imminent.  However, no point in
>> spending money for the sake of it.

>Not directly btrfs related, but when you do buy a new ssd, now or later, 
>keep in mind that a lot of authorities recommend that for ssds you buy 
>10-33% larger than you plan on actually provisioning, and that you leave 
>that extra space entirely unprovisioned -- either leave that extra space 
>entirely unpartitioned, or partition it, but don't put filesystems or 
>anything else (swap, etc) on it.  This leaves those erase-blocks free to 
>be used by the FTL for additional wear-leveling block-swap, thus helping 
>maintain device speed as it ages, and with good wear-leveling firmware, 
>should dramatically increase device usable lifetime, as well.

Well, went OTT so got ordered a 500GB.  So if I put say 20GB as my 
'maintenance' partition, then the rest minus 100-150GB as btrfs and keep
the rest unallocated that should work well?


>FWIW, I ended up going rather overboard with that here, as I knew I 
<snip>

So have I.  The price seems almost linear per gigabyte perhaps?  
Suspected it was better to go larger if I could and delay the 
time until the new disk runs out.  Could put the old disk in the
laptop for experimentation with distros.


>>>>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
>> 
>> I'm wondering if it is time for an update from 4.0.4?

>The going list recommendation is to choose either current kernel track or 
>LTS kernel track.  If you choose current kernel, the recommendation is to 
>stick within 1-2 kernel cycles of newest current, which with 4.5 about to 
>come out, means you would be on 4.3 at the oldest, and be looking at 4.4 
>by now, again, on the current kernel track.

4.5 is out.  Maybe I ought to await 4.5.1 or .2 for any initial bugs to 
shake out.


>If you choose LTS kernels, until recently, the recommendation was again 
>the latest two, but here LTS kernel cycles.  That would be 4.4 as the 
>newest LTS and 4.1 previous to that.  However, 3.18, the LTS kernel 
>previous to 4.1, has been holding up reasonably well, so while 4.1 would 
>be preferred, 3.18 remains reasonably well supported as well.

Can't see the advantage to me for a LTS kernel.  In the past I've gone
for the latest and then updated the kernel with the new latest kernel.  
Distro maintainers might want LTS kernels but I'm not going to go from
say 4.1.10 to 4.1.19 when I can go to 4.5.

OK googled for a bit.  Upgrading within an LTS branch fixes bugs but 
reduces chances of breakage due to new functionality.


>You're on 4.0, which isn't an LTS kernel series and is thus, along with 
>4.2, out of upstream's support window.  So it's past time to look at 
>updating. =:^)  Given that you obviously do _not_ follow the last couple 

Whilst everything worked fine and there were no security horrors there was
no need to update.

Kind regards,

Pete

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-14 23:03 Snapshots slowing system pete
@ 2016-03-15 15:52 ` Duncan
  2016-03-15 22:29   ` Peter Chant
  0 siblings, 1 reply; 17+ messages in thread
From: Duncan @ 2016-03-15 15:52 UTC (permalink / raw)
  To: linux-btrfs

pete posted on Mon, 14 Mar 2016 23:03:52 +0000 as excerpted:

> [Duncan wrote...]

>>pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:
>>> 
>>> Subvolumes are mounted with the following options:
>>> autodefrag,relatime,compress=lzo,subvol=<sub vol name>>
> 
>>That relatime (which is the default), could be an issue.  See below.
> 
> I've now changed that to noatime.  I think I read or missread relatime
> as a good comprimise sometime in the past.

Well, "good" is relative (ha! much like relatime itself! =:^).

Relatime is certainly better than strictatime as it cuts down on atime 
updates quite a bit, and as a default it's a reasonable compromise (at 
least for most filesystems), because it /does/ do a pretty good job of 
eliminating /most/ atime updates while still doing the minimal amount to 
avoid breaking all known apps that still rely on what is mostly a legacy 
POSIX feature that very little actually modern software actually relies 
on any more.

For normal filesystems and normal use-cases, relatime really is a 
reasonably "good" compromise.  But btrfs is definitely not a traditional 
filesystem, relying as it does on COW, and snapshotting is even more 
definitely not a traditional filesystem feature.  Relatime does still 
work, but it's just not particularly suitable to frequent snapshotting.

Meanwhile, so little actually depends on atime these days, that unless 
you're trying to work out a compromise solution for a kernel with a 
standing rule that breaking working userspace is simply not acceptable, 
the context in which relatime was developed and for which it really is a 
good compromise, chances are pretty high that unless you are running 
something like mutt that is /known/ to need atime, you can simply set 
noatime and forget about it.

And I'm sure, were the kernel rules on avoiding breaking old but 
otherwise still working userspace somewhat less strict, noatime would be 
the kernel default now, as well.

Meanwhile, FWIW, some months ago I finally got tired of having to specify 
noatime on all my mounts, expanding my fstab width by 8 chars (including 
the ,) and the total fstab character count by several multiples of that 
as I added it to all entries, and decided to see if I might per chance, 
even as a sysadmin not a dev, be able to come up with a patch that 
changed the kernel default to noatime.  It wasn't actually hard, tho were 
I a coder and actually knew what I was doing, I imagine I could create a 
much better patch.  So now all my filesystems (barring a few of the 
memory-only virtual-filesystem mounts) are mounted noatime by default, as 
opposed to the unpatched relatime, and I was able to take all the noatimes 
out of my fstab. =:^)

>>Normally when posting, either btrfs fi df *and* btrfs fi show are
>>needed, /or/ (with a new enough btrfs-progs) btrfs fi usage.  And of
>>course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that
>>I saw) versions.
> 
> OK, I have usage.  For the SSD with the system:
> 
> root@phoenix:~# btrfs fi usage /
> Overall:
>     Device size:		 118.05GiB
>     Device allocated:		 110.06GiB
>     Device unallocated:	   7.99GiB
>     Used:			 103.46GiB
>     Free (estimated):		  11.85GiB	(min: 11.85GiB)
>     Data ratio:		      1.00
>     Metadata ratio:		      1.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:102.03GiB, Used:98.16GiB
>    /dev/sda3	 102.03GiB
> 
> Metadata,single: Size:8.00GiB, Used:5.30GiB
>    /dev/sda3	   8.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB
>    /dev/sda3	  32.00MiB
> 
> Unallocated:
>    /dev/sda3	   7.99GiB
> 
> 
> Hmm.  A bit tight.  I've just ordered a replacement SSD.

While ~8 GiB unallocated on a ~118 GiB filesystem is indeed a bit tight, 
it's nothing that should be giving btrfs fits yet.

Tho even with autodefrag, given the previous relatime and snapshotting, 
it could be that the free-space in existing chunks is fragmented, which 
over time and continued usage would force higher file fragmentation 
despite the autodefrag, since there simply aren't any large contiguous 
free-space areas left in which to write files.

> Slackware
> should it in about 5GB+ of disk space I've seen on a website?  Hmm. 
> Don't beleive that.  I'd allow at least 10GB and more if I want to add
> extra packages such as libreoffice.  If I have no snapshots it seems to
> get to 45GB with various extra packages installed and grows to 100ish
> with snapshotting probally owing to updates.

FWIW, here on gentoo and actually using separate partitions and btrfs,
/not/ btrfs subvolumes (because I don't want all my data eggs in the same 
filesystem basket, should that filesystem go bad)...

My / is 8 GiB (per device, btrfs raid1 both data and metadata on 
partitions from two ssds, so same stuff on each device) including all 
files installed by packages except some individual subdirs in /var/ which 
are symlinked to dirs in /home/var/ where necessary, because I keep / 
read-only mounted by default, and some services want a writable /var/ 
config.

Tho I don't have libreoffice installed, nor multiple desktop environments 
as I prefer (a much slimmed down) kde, but I have had multiple versions 
of kde (kde 3/4 back when, kde 4/5 more recently) installed at the same 
time as I was switching from one to the other.  While gentoo allows 
pulling in rather fewer deps than many distros if one is conservative 
with their USE flag settings, that's probably roughly canceled out by the 
fact that it's build-from-source and thus all the developer package 
halves not installed on binary distros need installed on gentoo, in 
ordered to build packages that depend on them.

Anyway, with compress=lzo, here's my root usage:

$$ sudo btrfs fi usage /
Overall:
    Device size:                  16.00GiB
    Device allocated:              9.06GiB
    Device unallocated:            6.94GiB
    Device missing:                  0.00B
    Used:                          5.41GiB
    Free (estimated):              4.99GiB      (min: 4.99GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               64.00MiB      (used: 0.00B)

Data,RAID1: Size:4.00GiB, Used:2.47GiB
   /dev/sda5       4.00GiB
   /dev/sdb5       4.00GiB

Metadata,RAID1: Size:512.00MiB, Used:237.55MiB
   /dev/sda5     512.00MiB
   /dev/sdb5     512.00MiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda5      32.00MiB
   /dev/sdb5      32.00MiB

Unallocated:
   /dev/sda5       3.47GiB
   /dev/sdb5       3.47GiB

So of that 8 gig (per device, two device raid1), nearly half, ~3.5 GiB, 
remains unallocated.  Data is 4 GiB allocated, ~2.5 GiB used.  Metadata 
is half a GiB allocated, just over half used, and there's 32 MiB of 
system allocated as well, with trivial usage.  Including both the 
allocated but unused data space and the entirely unallocated space, I 
should still be able to write nearly 5 GiB (the free estimate already 
accounts for the raid1).

Regular df (not btrfs fi df) reports similar numbers, 8192 MiB total, 
2836 MiB used, 5114 MiB available, tho with non-btrfs df the numbers are 
going to be fuzzy since its understanding of btrfs internals is somewhat 
fuzzy.

But either way, given the LZO compression it appears I've used under half 
the 8 GiB capacity.  Meanwhile, du -xBM / says 4158M, so just over half 
in uncompressed data  (with --apparent-size added it says 3624M).

So installation-only may well fit in under 5 GiB, and indeed, some years 
ago (before btrfs and the ssds, so reiserfs on spinning rust), I was 
running 5 GiB /, which on reiserfs was possible due to tail packing even 
without compression, but it was indeed a bit tighter than I was 
comfortable with, thus the 8 GiB I'm much happier with, today, when I 
partitioned up the ssds with btrfs and lzo compression in mind.

My /home is 20 GiB (per device, dual-ssd-partition btrfs raid1), tho 
that's with a separate media partition and will obviously vary *GREATLY* 
per person/installation.   My distro's git tree and overlays, along with 
sources tarball cache, built binpkgs cache, ccache build cache, and 
mainline kernel git repo, is 24 GiB.

Log is separate to avoid runaway logging filling up more critical 
filesystems and is tiny, 640 MiB, which I'll make smaller, possibly half 
a GiB, next time I repartition.

Boot is an exception to the usual btrfs raid1, with a separate working 
boot partition on one device and its backup on the other, so I can point 
the BIOS at and boot either one.  It's btrfs mixed-bg mode dup, 256 MiB 
for each of working and backup, which because it's dup means 128 MiB 
capacity.  That's actually a bit small, and why I'll be shrinking the log 
partition the next time I repartition.  Making it 384 MiB dup, for 192 
MiB capacity, would be much better, and since I can shrink the log 
partition by that and still keep the main partitions GiB aligned, it all 
works out.

Under the GiB boundary in addition to boot and log I also have separate 
BIOS and EFI partitions.  Yes, both, for compatibility. =:^)  The sizes 
of all the sub-GiB partitions are calculated so (as I mentioned) the main 
partitions are all GiB aligned.

Further, all main partitions have both a working and a backup partition, 
the same size, which combined with the dual-SSD btrfs raid1 and a btrfs 
dup boot on each device, gives me both working copy and primary backups 
on the SSDs (except for log, which is btrfs raid1 but without a backup 
copy as I didn't see the point).

As I mentioned elsewhere, with another purpose-dedicated partition or two 
and their backups, that's about 130 GiB out of the 256 GB ssds, with the 
rest left unpartitioned for use by the ssd FTL.

I also mentioned a media partition.  That's on spinning rust, along with 
the secondary backups for the main system.  It too is bootable on its 
own, should I need to resort to that, tho I don't keep the secondary 
backups near as current as the primary backups on the SSDs, because I 
figure between the raid1 and the primary backups on the ssds, there's a 
relatively small chance I'll actually have to resort to the secondary 
backups on spinning rust.

> Anyway, took the lazy, but less tearing less hair out route and ordered
> a 500GB drive.  Prices have dropped and fortunately a new drive is not a
> major issue.  Timing is also good with Slack 14.2 immanent. You rarely
> hear people complaining about disk too empty problems...

If I had 500 GiB SSDs like the one you're getting, I could put the media 
partition on SSDs and be rid of the spinning rust entirely.  But I seem 
to keep finding higher priorities for the money I'd spend on a pair of 
them...

(Tho I'm finding I do online media enough these days that I don't use the 
media partition so much these days.  I could probably go thru it, delete 
some stuff, and shrink what I have stored on it.  Given the near 50% 
unpartitioned space on the SSDs if I could get it to 64 GiB or under, I'd 
still have the recommended 20% unallocated space for the FTL to use, and 
wouldn't need to wait to upgrade the SSDs to put media on the SSDs and 
could then unplug the then only "secondary backup usage" spinning rust, 
except for doing those backups.)

> Note that the system btrfs does not get 127GB, it gets /dev/sda3, not
> far off, but I've a 209MB partition for /boot and a 1G partition for a
> very cut down system for maintenance purposes (both ext4).  On the new
> drive I'll keep the 'maintenance' ext4 install but I could use /boot
> from that filesystem using bind mounts, a bit cleaner.

Good point.  Similar here except the backup/maintenance isn't a cutdown 
system, it's a snapshot (in time, not btrfs snapshot) of exactly what was 
on the system when I did the backup.  That way, should it be necessary, I 
can boot the backup and have a fully functional system exactly as it was 
the day I took that backup.  That's very nice to have for a maintenance 
setup, since it means I have access to full manpages, even a full X, 
media players, a full graphical browser to google my problems with, etc.

And of course I have it partitioned up into much smaller pieces, with the 
second device in raid1 as well as having the backup partition copies.

> Rarely use them except when I either delete the wrong file or do
> something very sneaky but dumb like inavertently set umask for root and
> install a package and break _lots_ of file system permissions.  Easiest
> to recover from a good snapshot than try to fix that mess...

Of course snapshots aren't backups as if the filesystem goes south, it 
takes the snapshots with it.  But it's still great for fat-fingering 
issues, as you mention.  But I still prefer smaller and easier/faster 
maintained partitions, with backup partition copies that are totally 
independent filesystems from the working copies.  Between that and the 
btrfs raid1 to cover device failure, AND secondary backups on spinning 
rust, I guess I'm /reasonably/ prepared.

(I don't worry much about or bother with offsite backups, however, as I 
figure if I'm forced to resort to them, I'll have a whole lot more 
important things to worry about, like where I'm going to live if a fire 
or whatever took them out, or simply what sort of computer I'll replace 
it with and how I'll actually set it up, if it was simply burglarized.  
After all, the real important stuff is in my head anyway, and if I lose 
/that/ backup I'm not going to be caring much about /anything/, so...)

>>FWIW, I ended up going rather overboard with that here, as I knew I
> <snip>
> 
> So have I.  The price seems almost linear per gigabyte perhaps?
> Suspected it was better to go larger if I could and delay the time until
> the new disk runs out.  Could put the old disk in the laptop for
> experimentation with distros.

It seems to be more or less linear within a sweet-spot, yes.  Back when I 
bought mine, the sweet-spot was 32-256 GiB or so, smaller you paid more 
due to overhead, larger simply wasn't manufactured in high enough 
quantities yet.

Now it seems the sweet-spot is 256 GB to 1 TB, at around 3 GB/USD
low end price (pricewatch.com, SATA-600).  (128 GB is available at that, 
but only for bare laptop OEM models, M2 I'd guess, possibly used.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-15 15:52 ` Duncan
@ 2016-03-15 22:29   ` Peter Chant
  2016-03-16 11:39     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Chant @ 2016-03-15 22:29 UTC (permalink / raw)
  To: linux-btrfs

On 03/15/2016 03:52 PM, Duncan wrote:

<snip>

> Meanwhile, FWIW, some months ago I finally got tired of having to specify 
> noatime on all my mounts, expanding my fstab width by 8 chars (including 
> the ,) and the total fstab character count by several multiples of that 
> as I added it to all entries, and decided to see if I might per chance, 
> even as a sysadmin not a dev, be able to come up with a patch that 
> changed the kernel default to noatime.  It wasn't actually hard, tho were 
> I a coder and actually knew what I was doing, I imagine I could create a 
> much better patch.  So now all my filesystems (barring a few of the 
> memory-only virtual-filesystem mounts) are mounted noatime by default, as 
> opposed to the unpatched relatime, and I was able to take all the noatimes 
> out of my fstab. =:^)

It is a pity you cannot use variables or macros in fstab.  Its not too
bad with traditional file systems on my home user machine but with
multiple subvolumes my fstab is huge and there is a lot of repetition
of the options.



>> Hmm.  A bit tight.  I've just ordered a replacement SSD.
> 
> While ~8 GiB unallocated on a ~118 GiB filesystem is indeed a bit tight, 
> it's nothing that should be giving btrfs fits yet.
> 

Too late, drive ordered.  It was only a matter of time anyway.


> Tho even with autodefrag, given the previous relatime and snapshotting, 
> it could be that the free-space in existing chunks is fragmented, which 
> over time and continued usage would force higher file fragmentation 
> despite the autodefrag, since there simply aren't any large contiguous 
> free-space areas left in which to write files.
>

Hmm. The following returns instantly as if it were a null operation.
btrfs fi defrag /

I thought though that btrfs fi defrag <name> would only defrag the one
file or directory?

btrfs fi defrag /srv/photos/
Is considerably slower, it is still running.  Disk light is on solid.
Processes kworker and btrfs-transacti are pretty busy according to iotop.

<snip>


> But either way, given the LZO compression it appears I've used under half 
> the 8 GiB capacity.  Meanwhile, du -xBM / says 4158M, so just over half 
> in uncompressed data  (with --apparent-size added it says 3624M).
> 

I seem to install a lot of interesting looking things I barely use.  I
am surprised about how full the filesystem gets, it should not.
However, large disks make life much easier rather than routing out
unused packages as a hobby.  Unless it gets silly.


<snip>
> 
> Boot is an exception to the usual btrfs raid1, with a separate working 
> boot partition on one device and its backup on the other, so I can point 
> the BIOS at and boot either one.  It's btrfs mixed-bg mode dup, 256 MiB 
> for each of working and backup, which because it's dup means 128 MiB 
> capacity.  That's actually a bit small, and why I'll be shrinking the log 
> partition the next time I repartition.  Making it 384 MiB dup, for 192 
> MiB capacity, would be much better, and since I can shrink the log 
> partition by that and still keep the main partitions GiB aligned, it all 
> works out.
> 

Slackware uses lilo so I need a separate /boot with something that is
supported by lilo.

<snip>

> If I had 500 GiB SSDs like the one you're getting, I could put the media 
> partition on SSDs and be rid of the spinning rust entirely.  But I seem 
> to keep finding higher priorities for the money I'd spend on a pair of 
> them...


I'm getting one, not two, so the system is raid0.  Data is more
important (and backed up).

> 
<snip>

> Good point.  Similar here except the backup/maintenance isn't a cutdown 
> system, it's a snapshot (in time, not btrfs snapshot) of exactly what was 
> on the system when I did the backup.  That way, should it be necessary, I 
> can boot the backup and have a fully functional system exactly as it was 
> the day I took that backup.  That's very nice to have for a maintenance 
> setup, since it means I have access to full manpages, even a full X, 
> media players, a full graphical browser to google my problems with, etc.
> 
I have that as well.  But the non-btrfs maintenance partition is there
in case btrfs is unbootable.

-- 
Peter Chant

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-15 22:29   ` Peter Chant
@ 2016-03-16 11:39     ` Austin S. Hemmelgarn
  2016-03-17 21:08       ` Pete
  0 siblings, 1 reply; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-16 11:39 UTC (permalink / raw)
  To: Peter Chant, linux-btrfs

On 2016-03-15 18:29, Peter Chant wrote:
> On 03/15/2016 03:52 PM, Duncan wrote:
>> Tho even with autodefrag, given the previous relatime and snapshotting,
>> it could be that the free-space in existing chunks is fragmented, which
>> over time and continued usage would force higher file fragmentation
>> despite the autodefrag, since there simply aren't any large contiguous
>> free-space areas left in which to write files.
>>
>
> Hmm. The following returns instantly as if it were a null operation.
> btrfs fi defrag /
That should return almost immediately, as defrag isn't recursive by 
default, and / should only have at most about 16-20 directory entries.
>
> I thought though that btrfs fi defrag <name> would only defrag the one
> file or directory?
It does, it's just not recursive unless you tell it to be.
>
> btrfs fi defrag /srv/photos/
> Is considerably slower, it is still running.  Disk light is on solid.
> Processes kworker and btrfs-transacti are pretty busy according to iotop.
If you have a lot of items in /srv/photos/ (be it either lots of 
individual files, or lots of directories at the top level), then this is 
normal, if not, then you may have found a bug.
>>
>> Boot is an exception to the usual btrfs raid1, with a separate working
>> boot partition on one device and its backup on the other, so I can point
>> the BIOS at and boot either one.  It's btrfs mixed-bg mode dup, 256 MiB
>> for each of working and backup, which because it's dup means 128 MiB
>> capacity.  That's actually a bit small, and why I'll be shrinking the log
>> partition the next time I repartition.  Making it 384 MiB dup, for 192
>> MiB capacity, would be much better, and since I can shrink the log
>> partition by that and still keep the main partitions GiB aligned, it all
>> works out.
>>
>
> Slackware uses lilo so I need a separate /boot with something that is
> supported by lilo.
I would like to point out that just because the distribution prefers one 
package doesn't mean you can't use another, it's just not quite as easy. 
  It's worth noting that I do similarly to Duncan in this respect 
though, although I provisioned 512MiB when I set things up (and stuck 
the BIOS boot partition (because I use GPT on everything these days) in 
the unaligned slack space between the partition table and /boot).  It 
also has the advantage that I can fall back to old versions of the 
kernel and initrd if need be when an upgrade fails to boot for some reason.
>
> <snip>
>
>> If I had 500 GiB SSDs like the one you're getting, I could put the media
>> partition on SSDs and be rid of the spinning rust entirely.  But I seem
>> to keep finding higher priorities for the money I'd spend on a pair of
>> them...
>
>
> I'm getting one, not two, so the system is raid0.  Data is more
> important (and backed up).
If you don't need the full terabyte of space, I would seriously suggest 
using raid1 instead of raid0.  If you're using SSD's, then you won't get 
much performance gain from BTRFS raid0 (because the I/O dispatching is 
not particularly smart), and it also makes it more likely that you will 
need to rebuild from scratch.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-16 11:39     ` Austin S. Hemmelgarn
@ 2016-03-17 21:08       ` Pete
  2016-03-18  9:17         ` Duncan
  0 siblings, 1 reply; 17+ messages in thread
From: Pete @ 2016-03-17 21:08 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

On 03/16/2016 11:39 AM, Austin S. Hemmelgarn wrote:

>> I thought though that btrfs fi defrag <name> would only defrag the one
>> file or directory?
> It does, it's just not recursive unless you tell it to be.

Hmm.  That shows when I last used it.  Last time I used it the '-r'
option did not exist.  So I set and forgot 'autodefrag'.


>>
>> btrfs fi defrag /srv/photos/
>> Is considerably slower, it is still running.  Disk light is on solid.
>> Processes kworker and btrfs-transacti are pretty busy according to iotop.
> If you have a lot of items in /srv/photos/ (be it either lots of
> individual files, or lots of directories at the top level), then this is
> normal, if not, then you may have found a bug.

20 files.  15 directories.  A lot of files under this directory but
recursive NOT set.

Hmm.  Comments on ssd s set me googling.  Don't normally touch smartctl

root@phoenix:~# smartctl --attributes /dev/sdc

<snip>

184 End-to-End_Error        0x0032   098   098   099    Old_age   Always
  FAILING_NOW 2

<snip>

also:
  1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always
      -       241052216

That figure seems to be on the move.  On /dev/sdb (the other half of my
hdd raid1 btrfs it is zero).  I presume zero means either 'no errors,
happy days' or 'not supported'.

Hmm.  Is this bad and/or possibly the smoking gun for slowness?  I will
keep an eye on the number to see if it changes.

OK, full output:
root@phoenix:~# smartctl --attributes /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.0.4] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always
      -       241159856
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always
      -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
      -       83
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always
      -       56166570022
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always
      -       22098
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
      -       83
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always
      -       0
184 End-to-End_Error        0x0032   098   098   099    Old_age   Always
  FAILING_NOW 2
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always
      -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always
      -       8590065669
189 High_Fly_Writes         0x003a   095   095   000    Old_age   Always
      -       5
190 Airflow_Temperature_Cel 0x0022   066   063   045    Old_age   Always
      -       34 (Min/Max 30/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always
      -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always
      -       27
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always
      -       287836
194 Temperature_Celsius     0x0022   034   040   000    Old_age   Always
      -       34 (0 14 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
      -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age
Offline      -       281032595099550
241 Total_LBAs_Written      0x0000   100   253   000    Old_age
Offline      -       75393744072
242 Total_LBAs_Read         0x0000   100   253   000    Old_age
Offline      -       115340399121

OK, head flying hours explains it, drive is over 32 billion years old...

As I am slowly producing this post raw_read_error_rate is now at
241507192.  But I did set  smartctl -t long /dev/sdc in motion if that
is at all relevent.


>>
>> Slackware uses lilo so I need a separate /boot with something that is
>> supported by lilo.
> I would like to point out that just because the distribution prefers one
> package doesn't mean you can't use another, it's just not quite as easy.
>  It's worth noting that I do similarly to Duncan in this respect though,
> although I provisioned 512MiB when I set things up (and stuck the BIOS
> boot partition (because I use GPT on everything these days) in the
> unaligned slack space between the partition table and /boot).  It also
> has the advantage that I can fall back to old versions of the kernel and
> initrd if need be when an upgrade fails to boot for some reason.

Thanks.  I know that.  Have dallied with Grub and Grub2 but when it
works well lilo is nice and simple.  My 'maintenance' partition plan is
to give me something more powerful than a rescue disk if things go
south.  Bit frustrating the time I found that btrfs-tools was well
behind on the maintenance partition.  At least I could go online and
fix.  Rescue CDs are not helpful there.


>>
>> <snip>
>>
>>> If I had 500 GiB SSDs like the one you're getting, I could put the media
>>> partition on SSDs and be rid of the spinning rust entirely.  But I seem
>>> to keep finding higher priorities for the money I'd spend on a pair of
>>> them...
>>
>>
>> I'm getting one, not two, so the system is raid0.  Data is more
>> important (and backed up).
> If you don't need the full terabyte of space, I would seriously suggest
> using raid1 instead of raid0.  If you're using SSD's, then you won't get
> much performance gain from BTRFS raid0 (because the I/O dispatching is
> not particularly smart), and it also makes it more likely that you will
> need to rebuild from scratch.

Confused.  I'm getting one SSD which I intend to use raid0.  Seems to me
to make no sense to split it in two and put both sides of raid1 on one
disk and I reasonably think that you are not suggesting that.  Or are
you assuming that I'm getting two disks?  Or are you saying that buying
a second SSD disk is strongly advised?  (bearing in mind that it looks
like I might need another hdd if the smart field above is worth worrying
about).

Pete



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-17 21:08       ` Pete
@ 2016-03-18  9:17         ` Duncan
  2016-03-18 11:38           ` Austin S. Hemmelgarn
  2016-03-18 18:16           ` Pete
  0 siblings, 2 replies; 17+ messages in thread
From: Duncan @ 2016-03-18  9:17 UTC (permalink / raw)
  To: linux-btrfs

Pete posted on Thu, 17 Mar 2016 21:08:23 +0000 as excerpted:

> Hmm.  Comments on ssd s set me googling.  Don't normally touch smartctl
> 
> root@phoenix:~# smartctl --attributes /dev/sdc
> <snip>
> 184 End-to-End_Error        0x0032   098   098   099    Old_age   Always
>   FAILING_NOW 2
> <snip>
>   1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always
>       -       241052216
> 
> That figure seems to be on the move.  On /dev/sdb (the other half of my
> hdd raid1 btrfs it is zero).  I presume zero means either 'no errors,
> happy days' or 'not supported'.

This is very useful.  See below.

> Hmm.  Is this bad and/or possibly the smoking gun for slowness?  I will
> keep an eye on the number to see if it changes.
> 
> OK, full output:
> root@phoenix:~# smartctl --attributes /dev/sdc
> [...]
> === START OF READ SMART DATA SECTION ===
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always
>       -       241159856

This one's showing some issues, but is within tolerance as even the worst 
value of 99 is still _well_ above the failure threshold of 6.

But the fact that the raw value isn't simply zero means that it is having 
mild problems, they're just well within tolerance according to the cooked 
value and threshold.

(I've snipped a few of these...)

>   3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always
>       -       0

On spinning rust this one's a strong indicator of one of the failure 
modes, a very long time to spin up.  Obviously that's not a problem with 
this device.  Even raw is zero.

>   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
>       -       83

Spinning up a drive is hard on it.  Laptops in particular often spin down 
their drives to save power, then spin them up again.  Wall-powered 
machines can and sometimes do, but it's not as common, and when they do, 
the spin-down time is often an hour or higher of idle, where on laptops 
it's commonly 15 minutes and may be as low as 5.

Obviously you're doing no spindowns except for power-offs, and thus have 
a very low raw count of 83, which hasn't dropped the cooked value from 
100 yet, so great on this one as well.

>   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always
>       -       0

This one is available on ssds and spinning rust, and while it never 
actually hit failure mode for me on an ssd I had that went bad, I watched 
over some months as the raw reallocated sector count increased a bit at a 
time.  (The device was one of a pair with multiple btrfs raid1 on 
parallel partitions on each, and the other device of the pair remains 
perfectly healthy to this day, so I was able to use btrfs checksumming 
and scrubs to keep the one that was going bad repaired based on the other 
one, and was thus able to run it for quite some time after I would have 
otherwise replaced it, simply continuing to use it out of curiosity and 
to get some experience with how it and btrfs behaved when failing.)

In my case, it started at 253 cooked with 0 raw, then dropped to a 
percentage (still 100 at first) as soon as the first sector was 
reallocated (raw count of 1).  It appears that your manufacturer treats 
it as a percentage from a raw count of 0.

What really surprised me was just how many spare sectors that ssd 
apparently had.  512 byte sectors, so half a KiB each.  But it was into 
the thousands of replaced sectors raw count, so Megabytes used, but the 
cooked count had only dropped to 85 or so by the time I got tired of 
constantly scrubbing to keep it half working as more and more sectors 
failed.   But threshold was 36, so I wasn't anywhere CLOSE to getting to 
reported failure here, despite having thousands of replaced sectors thus 
megabytes in size.

But the ssd was simply bad before its time, as it wasn't failing due to 
write-cycle wear-out, but due to bad flash, plain and simple.  With the 
other device (and the one I replaced it with as well, I actually had 
three of the same brand and size SSDs), there's still no replaced sectors 
at all.

But apparently, when ssds hit normal old-age and start to go bad from 
write-cycle failure, THAT is when those 128 MiB or so (as I calculated 
based on percentage and raw value failed at one point, or was it 256 MiB, 
IDR for sure) of replacement sectors start to be used.  And on SSDs, 
apparently when that happens, sectors often fail and are replaced faster 
than I was seeing, so it's likely people will actually get to failure 
mode on this attribute in that case.

I'd guess spinning rust has something less, maybe 64 MiB for multiple TB 
of storage, instead of the 128 or 256 MiB I saw on my 256 GiB SSDs.  That 
would be because spinning rust failure mode is typically different, and 
while a few sectors might die and be replaced over the life of the 
device, typically it's not that many, and failure is by some other means 
like mechanical failure (failure to spin up, or read heads getting out of 
tolerated sync with the cylinders on the device).

>   7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always
>       -       56166570022

Like the raw-read-error-rate attribute above, you're seeing minor issues 
as the raw number isn't 0, and in this case, the cooked value is 
obviously dropping significantly as well, but it's still within 
tolerance, so it's not failing yet.  That worst cooked value of 60 is 
starting to get close to that threshold of 30, however, so this one's 
definitely showing wear, just not failure... yet.

>   9 Power_On_Hours          0x0032   075   075   000    Old_age   Always
>       -       22098

Reasonable for a middle-aged drive, considering you obviously don't shut 
it down often (a start-stop-count raw of 80-something).  That's ~2.5 
years of power-on.

>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
>       -       0

This one goes with spin-up time.  Absolutely no problems here.

>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
>       -       83

Matches start-stop-count.  Good. =:^)  Since you obviously don't spin 
down except at power-off, this one isn't going to be a problem for you.

> 184 End-to-End_Error        0x0032   098   098   099    Old_age   Always
>   FAILING_NOW 2

I /think/ this one is a power-on head self-test head seek from one side 
of the device to the other, and back, covering both ways.

Assuming I'm correct on the above guess, the combination of this failing 
for you, and the not yet failing but a non-zero raw-value for raw-read-
error-rate and seek-error-rate, with the latter's cooked value being 
significantly down if not yet failing, is definitely concerning, as the 
three values all have to do with head seeking errors.

I'd definitely get your data onto something else as soon as possible, tho 
as much of it is backups, you're not in too bad a shape even if you lose 
them, as long as you don't lose the working copy at the same time.

But with all three seek attributes indicating at least some issue and one 
failing, at least get anything off it that is NOT backups ASAP.

And that very likely explains the slowdowns as well, as obviously, while 
all sectors are still readable, it's having to retry multiple times on 
some of them, and that WILL slow things down.

> 188 Command_Timeout         0x0032   100   099   000    Old_age   Always
>       -       8590065669

Again, a non-zero raw value indicating command timeouts, probably due to 
those bad seeks.  It'll have to retry those commands, and that'll 
definitely mean slowdowns.

Tho there's no threshold, but 99 worst-value cooked isn't horrible.

FWIW, on my spinning rust device this value actually shows a worst of 
001, here (100 current cooked value, tho), with a threshold of zero, 
however.  But as I've experienced no problems with it I'd guess that's an 
aberration.  I haven't the foggiest why/how/when it got that 001 worst.

> 189 High_Fly_Writes         0x003a   095   095   000    Old_age   Always
>       -       5

Again, this demonstrates a bit of disk wobble or head slop.  But with a 
threshold of zero and a value and worst of 95, it doesn't seem to be too 
bad.

> 193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always
>       -       287836

Interesting.  My spinning rust has the exact same value and worst of 1, 
threshold 0, and a relatively similar 237181 raw count.

But I don't really know what this counts unless it's actual seeks, and 
mine seems in good health still, certainly far better than the cooked 
value and worst of 1 might suggest.

> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age Offline 
>     -       281032595099550

> OK, head flying hours explains it, drive is over 32 billion years old...
> 

While my spinning rust has this attribute and the cooked values are 
identical 100/253/0, the raw value is reported and formatted entirely 
differently, as 21122 (89 19 0).  I don't know what those values are, but 
presumably your big long value reports the others mine does, as well, 
only as a big long combined value.

Which would explain the apparent multi-billion years yours is reporting! 
=:^)  It's not a single value, it's multiple values somehow combined.

At least with my power-on hours of 23637, a head-flying hours of 21122 
seems reasonable.  (I only recently configured the BIOS to spin down that 
drive after 15 minutes I think, because it's only backups and my media 
partition which isn't mounted all the time anyway, so I might as well 
leave it off instead of idle-spinning when I might not use it for days at 
a time.  So a difference of a couple thousand hours between power-on and 
head-flying, on a base of 20K+ hours for both, makes sense given that I 
only recently configured it to spin down.)

But given your ~22K power-on hours, even simply peeling off the first 5 
digits of your raw value would be 28K head-flying, and that doesn't make 
sense for only 22K power-on, so obviously they're using a rather more 
complex formula than that.

So bottom line regarding that smartctl output, yeah, a new device is 
probably a very good idea at this point.  Those smart attributes indicate 
either head slop or spin wobble, and some errors and command timeouts and 
retries, which could well account for your huge slowdowns.  Fortunately, 
it's mostly backup, so you have your working copy, but if I'm not mixing 
up my threads, you have some media files, etc, on a different partition 
on it as well, and if you don't have backups elsewhere, getting them onto 
something else ASAP is a very good idea, because this drive does look to 
be struggling, and tho it could continue working in a low usage scenario 
for some time yet, it could also fail rather quickly, as well.

> As I am slowly producing this post raw_read_error_rate is now at
> 241507192.  But I did set  smartctl -t long /dev/sdc in motion if that
> is at all relevent.
> 
>>> <snip>
>>>
>>>> If I had 500 GiB SSDs like the one you're getting, I could put the
>>>> media partition on SSDs and be rid of the spinning rust entirely. 
>>>> But I seem to keep finding higher priorities for the money I'd spend
>>>> on a pair of them...
>>>
>>>
>>> I'm getting one, not two, so the system is raid0.  Data is more
>>> important (and backed up).
>> If you don't need the full terabyte of space, I would seriously suggest
>> using raid1 instead of raid0.  If you're using SSD's, then you won't
>> get much performance gain from BTRFS raid0 (because the I/O dispatching
>> is not particularly smart), and it also makes it more likely that you
>> will need to rebuild from scratch.
> 
> Confused.  I'm getting one SSD which I intend to use raid0.  Seems to me
> to make no sense to split it in two and put both sides of raid1 on one
> disk and I reasonably think that you are not suggesting that.  Or are
> you assuming that I'm getting two disks?  Or are you saying that buying
> a second SSD disk is strongly advised?  (bearing in mind that it looks
> like I might need another hdd if the smart field above is worth worrying
> about).

Well, raid0 normally requires two devices.  So either you mean single 
mode on a single device, or you're combining it with another device (or 
more than one more) to do raid0.

And if you're combining it with another device to do raid0, than the 
suggestion, unless you really need all the room from the raid0, is to do 
raid1, because the usual reason for raid0 is speed, and btrfs raid0 isn't 
yet particularly optimized so you don't get so much more speed than on a 
single device.  And raid0 has a much higher risk of failure because if 
any of the devices fail the whole filesystem is gone.

So raid0 really doesn't get you much besides the additional room of the 
multiple devices.

Meanwhile, in addition to the traditional device redundancy that you 
normally get with raid1, btrfs raid1 has some additional features as 
well, namely, data integrity due to checksumming, and the ability to 
repair a bad copy from the other one, assuming the other copy passes 
checksum verification.  While traditional raid1 lets you do a similar 
repair, because it doesn't have and verify the checksums like btrfs does, 
on traditional raid1, you're just as likely to be replacing the good copy 
with the bad one, as the other way around.  Btrfs' ability to actually 
repair bad data from a verified good second copy like that, is a very 
nice feature indeed, and having lived thru a failing ssd as I mentioned 
above, btrfs raid1 is not only what saved my data, it's what allowed me 
to continue playing with the failing ssd as I continued to use it well 
passed when I would have otherwise replaced it, so I could watch just how 
it behaved as it failed and get more experience with both that and 
working with btrfs raid1 recovery under that sort of situation.

So btrfs raid1 has data integrity and repair features that aren't 
available on normal raid1, and thus is highly recommended.

But, raid1 /does/ mean two copies of both data and metadata (assuming of 
course you make them both raid1, as I did), and if you simply don't have 
room to do it that way, you don't have room, highly recommended tho it 
may be.

Tho raid1 shouldn't be considered the same as a backup, because it's 
not.  In particular, while you do have reasonable protection against 
device failure, and with btrfs, against the data going bad, raid1, on its 
own, doesn't protect against fat-fingering, simply making a mistake and 
deleting something you shouldn't have, which as any admin knows, tends to 
be the greatest risk to data.  You need a real backup (or a snapshot) to 
recover from that.

Additionally, raid1 alone isn't going to help if the filesystem itself 
goes bad.  Neither will a snapshot, there.  You need a backup to recover 
in that case.

Similarly in the case of an electrical problem, robbery of the machine, 
or fire, since both/all devices in a raid1 will be affected together.  If 
you want to be able to recover your data in that case, better have a real 
backup, preferably kept offline except when actually making the backup, 
and even better, off-site.  For this sort of thing, in fact, the usual 
recommendation is at least two offsite backups, alternated such that if 
tragedy strikes when you're updating the one, taking it out as well, you 
still have the other one safe and sound, and will only lose the 
difference since that alternating backup, even when both your working 
copy and the other of the backups are both taken out at once.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18  9:17         ` Duncan
@ 2016-03-18 11:38           ` Austin S. Hemmelgarn
  2016-03-18 17:58             ` Pete
  2016-03-18 23:58             ` Duncan
  2016-03-18 18:16           ` Pete
  1 sibling, 2 replies; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-18 11:38 UTC (permalink / raw)
  To: linux-btrfs

On 2016-03-18 05:17, Duncan wrote:
> Pete posted on Thu, 17 Mar 2016 21:08:23 +0000 as excerpted:
>>    5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always
>>        -       0
>
> This one is available on ssds and spinning rust, and while it never
> actually hit failure mode for me on an ssd I had that went bad, I watched
> over some months as the raw reallocated sector count increased a bit at a
> time.  (The device was one of a pair with multiple btrfs raid1 on
> parallel partitions on each, and the other device of the pair remains
> perfectly healthy to this day, so I was able to use btrfs checksumming
> and scrubs to keep the one that was going bad repaired based on the other
> one, and was thus able to run it for quite some time after I would have
> otherwise replaced it, simply continuing to use it out of curiosity and
> to get some experience with how it and btrfs behaved when failing.)
>
> In my case, it started at 253 cooked with 0 raw, then dropped to a
> percentage (still 100 at first) as soon as the first sector was
> reallocated (raw count of 1).  It appears that your manufacturer treats
> it as a percentage from a raw count of 0.
>
> What really surprised me was just how many spare sectors that ssd
> apparently had.  512 byte sectors, so half a KiB each.  But it was into
> the thousands of replaced sectors raw count, so Megabytes used, but the
> cooked count had only dropped to 85 or so by the time I got tired of
> constantly scrubbing to keep it half working as more and more sectors
> failed.   But threshold was 36, so I wasn't anywhere CLOSE to getting to
> reported failure here, despite having thousands of replaced sectors thus
> megabytes in size.
This actually makes sense, as SSD's have spare 'sectors' in erase block 
size chunks, and most use a minimum 1MiB erase block size, with 4-8MiB 
being normal for most consumer devices.
>
> But the ssd was simply bad before its time, as it wasn't failing due to
> write-cycle wear-out, but due to bad flash, plain and simple.  With the
> other device (and the one I replaced it with as well, I actually had
> three of the same brand and size SSDs), there's still no replaced sectors
> at all.
>
> But apparently, when ssds hit normal old-age and start to go bad from
> write-cycle failure, THAT is when those 128 MiB or so (as I calculated
> based on percentage and raw value failed at one point, or was it 256 MiB,
> IDR for sure) of replacement sectors start to be used.  And on SSDs,
> apparently when that happens, sectors often fail and are replaced faster
> than I was seeing, so it's likely people will actually get to failure
> mode on this attribute in that case.
>
> I'd guess spinning rust has something less, maybe 64 MiB for multiple TB
> of storage, instead of the 128 or 256 MiB I saw on my 256 GiB SSDs.  That
> would be because spinning rust failure mode is typically different, and
> while a few sectors might die and be replaced over the life of the
> device, typically it's not that many, and failure is by some other means
> like mechanical failure (failure to spin up, or read heads getting out of
> tolerated sync with the cylinders on the device).
>
>>    7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always
>>        -       56166570022
>
> Like the raw-read-error-rate attribute above, you're seeing minor issues
> as the raw number isn't 0, and in this case, the cooked value is
> obviously dropping significantly as well, but it's still within
> tolerance, so it's not failing yet.  That worst cooked value of 60 is
> starting to get close to that threshold of 30, however, so this one's
> definitely showing wear, just not failure... yet.
>
>>    9 Power_On_Hours          0x0032   075   075   000    Old_age   Always
>>        -       22098
>
> Reasonable for a middle-aged drive, considering you obviously don't shut
> it down often (a start-stop-count raw of 80-something).  That's ~2.5
> years of power-on.
>
>>   10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
>>        -       0
>
> This one goes with spin-up time.  Absolutely no problems here.
>
>>   12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
>>        -       83
>
> Matches start-stop-count.  Good. =:^)  Since you obviously don't spin
> down except at power-off, this one isn't going to be a problem for you.
>
>> 184 End-to-End_Error        0x0032   098   098   099    Old_age   Always
>>    FAILING_NOW 2
>
> I /think/ this one is a power-on head self-test head seek from one side
> of the device to the other, and back, covering both ways.
I believe you're correct about this, although I've never seen any 
definitive answer anywhere.
>
> Assuming I'm correct on the above guess, the combination of this failing
> for you, and the not yet failing but a non-zero raw-value for raw-read-
> error-rate and seek-error-rate, with the latter's cooked value being
> significantly down if not yet failing, is definitely concerning, as the
> three values all have to do with head seeking errors.
>
> I'd definitely get your data onto something else as soon as possible, tho
> as much of it is backups, you're not in too bad a shape even if you lose
> them, as long as you don't lose the working copy at the same time.
>
> But with all three seek attributes indicating at least some issue and one
> failing, at least get anything off it that is NOT backups ASAP.
>
> And that very likely explains the slowdowns as well, as obviously, while
> all sectors are still readable, it's having to retry multiple times on
> some of them, and that WILL slow things down.
>
>> 188 Command_Timeout         0x0032   100   099   000    Old_age   Always
>>        -       8590065669
>
> Again, a non-zero raw value indicating command timeouts, probably due to
> those bad seeks.  It'll have to retry those commands, and that'll
> definitely mean slowdowns.
>
> Tho there's no threshold, but 99 worst-value cooked isn't horrible.
>
> FWIW, on my spinning rust device this value actually shows a worst of
> 001, here (100 current cooked value, tho), with a threshold of zero,
> however.  But as I've experienced no problems with it I'd guess that's an
> aberration.  I haven't the foggiest why/how/when it got that 001 worst.
Such an occurrence is actually not unusual when you have particularly 
bad sectors on a 'desktop' rated HDD, as they will keep retrying for an 
insanely long time to read the bad sector before giving up.
>
>> 189 High_Fly_Writes         0x003a   095   095   000    Old_age   Always
>>        -       5
>
> Again, this demonstrates a bit of disk wobble or head slop.  But with a
> threshold of zero and a value and worst of 95, it doesn't seem to be too
> bad.
>
>> 193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always
>>        -       287836
>
> Interesting.  My spinning rust has the exact same value and worst of 1,
> threshold 0, and a relatively similar 237181 raw count.
>
> But I don't really know what this counts unless it's actual seeks, and
> mine seems in good health still, certainly far better than the cooked
> value and worst of 1 might suggest.
As far as I understand it, this is an indicator of the number of times 
the heads have been loaded and unloaded.  This is tracked separately as 
there are multiple reasons the heads might get parked without spinning 
down the disk (most disks will park them if they've been idle, so that 
they reduce the risk of a head crash, and many modern laptops will park 
them if they detect that they're in free fall to protect the disk when 
they impact whatever they fall onto).  It's not unusual to see values 
like that for similarly aged disks either though, so it's not too worrying.
>
>> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age Offline
>>      -       281032595099550
>
>> OK, head flying hours explains it, drive is over 32 billion years old...
>>
>
> While my spinning rust has this attribute and the cooked values are
> identical 100/253/0, the raw value is reported and formatted entirely
> differently, as 21122 (89 19 0).  I don't know what those values are, but
> presumably your big long value reports the others mine does, as well,
> only as a big long combined value.
>
> Which would explain the apparent multi-billion years yours is reporting!
> =:^)  It's not a single value, it's multiple values somehow combined.
>
> At least with my power-on hours of 23637, a head-flying hours of 21122
> seems reasonable.  (I only recently configured the BIOS to spin down that
> drive after 15 minutes I think, because it's only backups and my media
> partition which isn't mounted all the time anyway, so I might as well
> leave it off instead of idle-spinning when I might not use it for days at
> a time.  So a difference of a couple thousand hours between power-on and
> head-flying, on a base of 20K+ hours for both, makes sense given that I
> only recently configured it to spin down.)
>
> But given your ~22K power-on hours, even simply peeling off the first 5
> digits of your raw value would be 28K head-flying, and that doesn't make
> sense for only 22K power-on, so obviously they're using a rather more
> complex formula than that.
This one is tricky, as it's not very clearly defined in the SMART spec. 
  Most manufacturers just count the total time the head has been loaded. 
  There are some however who count the time the heads have been loaded, 
multiplied by the number of heads.  This value still appears to be 
incorrect though, as combined with the Power_On_Hours, it implies well 
over 1024 heads, which is physically impossible on even a 5.25 inch disk 
using modern technology, even using multiple spindles.  The fact that 
this is so blatantly wrong should be a red flag regarding the disk 
firmware or on-board electronics, which just reinforces what Duncan 
already said about getting a new disk.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18 11:38           ` Austin S. Hemmelgarn
@ 2016-03-18 17:58             ` Pete
  2016-03-18 23:58             ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Pete @ 2016-03-18 17:58 UTC (permalink / raw)
  To: linux-btrfs

On 03/18/2016 11:38 AM, Austin S. Hemmelgarn wrote:

> This one is tricky, as it's not very clearly defined in the SMART spec.
>  Most manufacturers just count the total time the head has been loaded.
>  There are some however who count the time the heads have been loaded,
> multiplied by the number of heads.  This value still appears to be
> incorrect though, as combined with the Power_On_Hours, it implies well
> over 1024 heads, which is physically impossible on even a 5.25 inch disk
> using modern technology, even using multiple spindles.  The fact that
> this is so blatantly wrong should be a red flag regarding the disk
> firmware or on-board electronics, which just reinforces what Duncan
> already said about getting a new disk.

Have got a larger SSD on the way as it looked tight.  So annoylingly
wallet has to come out again.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18 11:38           ` Austin S. Hemmelgarn
  2016-03-18 17:58             ` Pete
@ 2016-03-18 23:58             ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-18 23:58 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 18 Mar 2016 07:38:29 -0400 as
excerpted:

>>> 188 Command_Timeout         0x0032   100   099   000    Old_age  
>>> Always
>>>        -       8590065669
>>
>> Again, a non-zero raw value indicating command timeouts, probably due
>> to those bad seeks.  It'll have to retry those commands, and that'll
>> definitely mean slowdowns.
>>
>> Tho there's no threshold, but 99 worst-value cooked isn't horrible.
>>
>> FWIW, on my spinning rust device this value actually shows a worst of
>> 001, here (100 current cooked value, tho), with a threshold of zero,
>> however.  But as I've experienced no problems with it I'd guess that's
>> an aberration.  I haven't the foggiest why/how/when it got that 001
>> worst.

> Such an occurrence is actually not unusual when you have particularly
> bad sectors on a 'desktop' rated HDD, as they will keep retrying for an
> insanely long time to read the bad sector before giving up.

Which is why it's mystifying to me how it could be reporting a worst-
value 1, when the device seems to be working just fine, and I don't 
recall even one event of waiting "an insanely long time", or even 
anything out of the ordinary, for anything on that device, ever.

Tho I suppose it's within reason that whatever it was froze up the system 
bad enough that I rebooted, and I attributed the one-off to something 
else.  But with no other attributes indicating issues, I remain clueless 
as to what might have happened and why that 1-worst, particularly so 
given the 0 threshold for that attribute and that it's an old-age 
indicator rather than a fail indicator, but the device is neither that 
old, nor as I said, in any other way indicating anything close to what 
that 1-worst value for just that single attribute implies.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18  9:17         ` Duncan
  2016-03-18 11:38           ` Austin S. Hemmelgarn
@ 2016-03-18 18:16           ` Pete
  2016-03-18 18:54             ` Austin S. Hemmelgarn
  2016-03-19  1:15             ` Duncan
  1 sibling, 2 replies; 17+ messages in thread
From: Pete @ 2016-03-18 18:16 UTC (permalink / raw)
  To: linux-btrfs

On 03/18/2016 09:17 AM, Duncan wrote:

> So bottom line regarding that smartctl output, yeah, a new device is 
> probably a very good idea at this point.  Those smart attributes indicate 
> either head slop or spin wobble, and some errors and command timeouts and 
> retries, which could well account for your huge slowdowns.  Fortunately, 
> it's mostly backup, so you have your working copy, but if I'm not mixing 
> up my threads, you have some media files, etc, on a different partition 
> on it as well, and if you don't have backups elsewhere, getting them onto 
> something else ASAP is a very good idea, because this drive does look to 
> be struggling, and tho it could continue working in a low usage scenario 
> for some time yet, it could also fail rather quickly, as well.
> 

This disk is one of a pair or raid1 disks which hold the data on my
system.  As you summised the machine is generally on 24x7 as it can just
get on with backups and some data grabbing and crunching on its own.

This is a set up of 2 x 3TB disks completely dedicated to btrfs.  I'm
wondering if the failing one is the older one wrenched out of a USB
enclosure as it was cheaper than a desktop one or whether it was the
desktop drive?  Still academic.  I have 1.37TB unallocated, 720GB free
estimated.  I'm therefore wondering whether I opt for the cheapest
reasonable desktop drive, a NAS drive advertised for 24x7 or whether I
pick a wallet frightening 'enterprise drive' as it might be twice as
much as the standard desktop but will give me less grief in the long
term.  Probably one for comp.os.linux.hardware.

>> Confused.  I'm getting one SSD which I intend to use raid0.  Seems to me
>> to make no sense to split it in two and put both sides of raid1 on one
>> disk and I reasonably think that you are not suggesting that.  Or are
>> you assuming that I'm getting two disks?  Or are you saying that buying
>> a second SSD disk is strongly advised?  (bearing in mind that it looks
>> like I might need another hdd if the smart field above is worth worrying
>> about).
> 
> Well, raid0 normally requires two devices.  So either you mean single 
> mode on a single device, or you're combining it with another device (or 
> more than one more) to do raid0.

Sorry, I confused raid0 with single.  The _lone_ system disk contains
the root partition, it is btrfs in single mode.

> So btrfs raid1 has data integrity and repair features that aren't 
> available on normal raid1, and thus is highly recommended.
> 
> But, raid1 /does/ mean two copies of both data and metadata (assuming of 
> course you make them both raid1, as I did), and if you simply don't have 
> room to do it that way, you don't have room, highly recommended tho it 
> may be.

This looks like a strong recommendation to get a second SSD for the root
partition and go raid1.  Are SSDs more flakey that hdd or are you just a
strong believer in the integrity of raid1?

> 
> Tho raid1 shouldn't be considered the same as a backup, because it's 
> not.  In particular, while you do have reasonable protection against 

<snip>

Backup nightly to an external usb hdd with ext4 via rsync.  Permanently
connected.  Also periodically (when I remember) backup via rsync to
another hdd formatted btrfs, single mode, with snapshots.

Given the discussions here maybe a couple of extra copies of the very
important stuff would not go amiss.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18 18:16           ` Pete
@ 2016-03-18 18:54             ` Austin S. Hemmelgarn
  2016-03-19  0:59               ` Duncan
  2016-03-19  1:15             ` Duncan
  1 sibling, 1 reply; 17+ messages in thread
From: Austin S. Hemmelgarn @ 2016-03-18 18:54 UTC (permalink / raw)
  To: Pete, linux-btrfs

On 2016-03-18 14:16, Pete wrote:
> On 03/18/2016 09:17 AM, Duncan wrote:
>
>> So bottom line regarding that smartctl output, yeah, a new device is
>> probably a very good idea at this point.  Those smart attributes indicate
>> either head slop or spin wobble, and some errors and command timeouts and
>> retries, which could well account for your huge slowdowns.  Fortunately,
>> it's mostly backup, so you have your working copy, but if I'm not mixing
>> up my threads, you have some media files, etc, on a different partition
>> on it as well, and if you don't have backups elsewhere, getting them onto
>> something else ASAP is a very good idea, because this drive does look to
>> be struggling, and tho it could continue working in a low usage scenario
>> for some time yet, it could also fail rather quickly, as well.
>>
>
> This disk is one of a pair or raid1 disks which hold the data on my
> system.  As you summised the machine is generally on 24x7 as it can just
> get on with backups and some data grabbing and crunching on its own.
>
> This is a set up of 2 x 3TB disks completely dedicated to btrfs.  I'm
> wondering if the failing one is the older one wrenched out of a USB
> enclosure as it was cheaper than a desktop one or whether it was the
> desktop drive?  Still academic.  I have 1.37TB unallocated, 720GB free
> estimated.  I'm therefore wondering whether I opt for the cheapest
> reasonable desktop drive, a NAS drive advertised for 24x7 or whether I
> pick a wallet frightening 'enterprise drive' as it might be twice as
> much as the standard desktop but will give me less grief in the long
> term.  Probably one for comp.os.linux.hardware.
Personally, I find that desktop drives generally do fine for 24/7 usage 
as long as things aren't constantly being written to and read from them. 
  For a write-once-read-many workload like most backup setups, there's 
not usually a huge advantage to getting high end disks unless you can't 
be there to replace them relatively soon after they fail (one disk in a 
RAID set failing puts more load on the other disk, thus increasing it's 
chance of also failing).  Desktop disks usually do provide similarly low 
error rates as higher end disks, the big difference is in how they 
handle errors.  Desktop drives will (usually) keep retrying a read on a 
bad sector for multiple minutes before giving up, while NAS drives will 
return an error almost immediately, and enterprise drives will let you 
configure how long it will retry.
>
>
>>> Confused.  I'm getting one SSD which I intend to use raid0.  Seems to me
>>> to make no sense to split it in two and put both sides of raid1 on one
>>> disk and I reasonably think that you are not suggesting that.  Or are
>>> you assuming that I'm getting two disks?  Or are you saying that buying
>>> a second SSD disk is strongly advised?  (bearing in mind that it looks
>>> like I might need another hdd if the smart field above is worth worrying
>>> about).
>>
>> Well, raid0 normally requires two devices.  So either you mean single
>> mode on a single device, or you're combining it with another device (or
>> more than one more) to do raid0.
>
> Sorry, I confused raid0 with single.  The _lone_ system disk contains
> the root partition, it is btrfs in single mode.
Don't feel bad, I made this mistake myself a couple of times at first too.
>
>
>
>> So btrfs raid1 has data integrity and repair features that aren't
>> available on normal raid1, and thus is highly recommended.
>>
>> But, raid1 /does/ mean two copies of both data and metadata (assuming of
>> course you make them both raid1, as I did), and if you simply don't have
>> room to do it that way, you don't have room, highly recommended tho it
>> may be.
>
> This looks like a strong recommendation to get a second SSD for the root
> partition and go raid1.  Are SSDs more flakey that hdd or are you just a
> strong believer in the integrity of raid1?
Generally, SSD's have better reliability in harsh conditions than HDD's, 
they can safely handle a wider temperature range, and are pretty much 
unaffected by vibration.  They fail in different ways however, so advice 
for preventing data loss on HDD's doesn't necessarily apply to SSD's.

Overall though, it really depends on what brand you get.  As of right 
now, the top three brands of SSD as far as quality IMHO are Intel, 
Samsung, and Crucial.  I usually go with Crucial myself because they are 
almost on-par with the other two, give more deterministic performance 
(their peak performance is often lower, but I'm willing to sacrifice a 
bit of performance to get consistency across operating conditions), and 
cost less (sometimes less than half as much as an equivalently sized 
Intel or Samsung SSD) .  Kingston, SanDisk, ADATA, Transcend, and Micron 
are generally OK, but sometimes have issues with data loss when they 
lose power unexpectedly (this likely won't be an issue for you though if 
you have a system that's on 24/7).  The only brand I would actively 
avoid is OCZ, as they've had numerous issues with reliability and data 
integrity over multiple revisions of multiple models of SSD.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18 18:54             ` Austin S. Hemmelgarn
@ 2016-03-19  0:59               ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-19  0:59 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 18 Mar 2016 14:54:54 -0400 as
excerpted:

> As of right now, the top three brands of SSD as far as quality IMHO are
> Intel, Samsung, and Crucial.  I usually go with Crucial myself because
> they are almost on-par with the other two, give more deterministic
> performance (their peak performance is often lower, but I'm willing to
> sacrifice a bit of performance to get consistency across operating
> conditions), and cost less (sometimes less than half as much as an
> equivalently sized Intel or Samsung SSD) .  Kingston, SanDisk, ADATA,
> Transcend, and Micron are generally OK, but sometimes have issues with
> data loss when they lose power unexpectedly (this likely won't be an
> issue for you though if you have a system that's on 24/7).  The only
> brand I would actively avoid is OCZ, as they've had numerous issues with
> reliability and data integrity over multiple revisions of multiple
> models of SSD.

Thanks.  This is useful information for me as well, because while I'm not 
in the /immediate/ market for SSDs ATM, I'm relatively likely to be doing 
some new machines later this year, and will likely either be getting new 
ssds for them or will be getting newer and bigger for my main machine and 
will be putting the current 256 GiB main machine ssds in the smaller 
machines.  So I was recently looking at prices on pricewatch.com, and 
wondering again about brands.  I got relatively lucky with my first ssds 
purchased some years ago when I knew little about them (except that one 
went bad prematurely, but it has been replaced with one of the others 
now), but know much more about the technology now.  Only thing is I knew 
nothing about which brands were in general good or to stay away from, in 
ordered to narrow down the search a bit, so this was really helpful, 
particularly the Crucial bit as I already know Intels aren't 
realistically in my price range.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-18 18:16           ` Pete
  2016-03-18 18:54             ` Austin S. Hemmelgarn
@ 2016-03-19  1:15             ` Duncan
  1 sibling, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-19  1:15 UTC (permalink / raw)
  To: linux-btrfs

Pete posted on Fri, 18 Mar 2016 18:16:50 +0000 as excerpted:

> On 03/18/2016 09:17 AM, Duncan wrote:

>> So btrfs raid1 has data integrity and repair features that aren't
>> available on normal raid1, and thus is highly recommended.
>> 
>> But, raid1 /does/ mean two copies of both data and metadata (assuming
>> of course you make them both raid1, as I did), and if you simply don't
>> have room to do it that way, you don't have room, highly recommended
>> tho it may be.
> 
> This looks like a strong recommendation to get a second SSD for the root
> partition and go raid1.  Are SSDs more flakey that hdd or are you just a
> strong believer in the integrity of raid1?

As Austin says, I'd generally consider ssds /more/ reliable than hdds, at 
least as long as you stay away from the OCZs, etc (but then again, there 
are spinning rust brands and specific models I stay away from, as well), 
but the failure modes are a bit different so it's not always as simple as 
that.

But I played with raid1 before btrfs and find the additional data 
integrity features that btrfs raid1 brings even more compelling, so yes, 
it would indeed be fair to say that I'm a strong booster of btrfs raid1 
in particular. =:^)

The roadmapped but still to come feature I'm /really/ looking forward to, 
however, is N-way-mirroring, because btrfs raid1 is currently very 
specifically two copies, regardless of how many devices there are, and I 
would really /really/ like to have the choice of three copies, again, not 
just for device-failure protection, but because with btrfs checksumming 
and data integrity, if one copy is found to be bad for whatever reason, 
be it a crash before all copies were written, or a failing device, a 
simple bad block on an otherwise fine device, or whatever gamma ray block 
damage or the like, right now that means the other copy BETTER come out 
checksum-verified, or that data or metadata is toast, and I'd rest far 
easier if even if one failed, I knew there was still not just the one 
copy, but a second, to fall back on as well.

With N-way-mirroring, of course that could be 4 or more copies as well, 
but three is closest to my sweet spot balance between cost and extreme 
reliability, and I'd very much like to have that choice as an option.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
@ 2016-03-12 13:01 pete
  2016-03-13  3:28 ` Duncan
  0 siblings, 1 reply; 17+ messages in thread
From: pete @ 2016-03-12 13:01 UTC (permalink / raw)
  To: linux-btrfs

I hope this message stays within the thread on the list.  I had email problems
and ended up hacking around with sendmail & grabbing the message id off of
the web based group archives.

>I wondered whether you had elimated fragmentation, or any other known gotchas, 
>as a cause?

Subvolumes are mounted with the following options:
autodefrag,relatime,compress=lzo,subvol=<sub vol name>

Not sure if there is much else to do about fragmentation apart from running a
balance which would probally make thje machine v sluggish for a day or so.

>Out of curiosity, what is/was the utilisation of the disk? Were the snapshots 
>read-only or read-write?

root@phoenix:~# btrfs fi df /    
Data, single: total=101.03GiB, used=97.91GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=8.00GiB, used=5.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

root@phoenix:~# btrfs fi df /home
Data, RAID1: total=1.99TiB, used=1.97TiB
System, RAID1: total=32.00MiB, used=352.00KiB
Metadata, RAID1: total=53.00GiB, used=50.22GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Hmm.  The system disk is getting a little tight. cddisk reports the partition I
use for btrfs containing root as 127GB approx.  Not sure why it grows so much.
Suspect that software updates can't help as snapshots will contain the legacy
versions.  On the other hand they can be useful.

Is it likely the SSD?  If likely I could get a larger one, now is a good time with
a new version of slackware imminent.  However, no point in spending money for
the sake of it.

All snapshots read-write.  However, I have mainly treated them as read-only.  
Does that make a difference?

>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.

I'm wondering if it is time for an update from 4.0.4?

>[Also, damn you autocorrection on my phone!]

Yep!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-12 13:01 pete
@ 2016-03-13  3:28 ` Duncan
  0 siblings, 0 replies; 17+ messages in thread
From: Duncan @ 2016-03-13  3:28 UTC (permalink / raw)
  To: linux-btrfs

pete posted on Sat, 12 Mar 2016 13:01:17 +0000 as excerpted:

> I hope this message stays within the thread on the list.  I had email
> problems and ended up hacking around with sendmail & grabbing the
> message id off of the web based group archives.

Looks like it should have as the reply-to looks right, but at least on 
gmane's news/nntp archive of the list (which is how I read and reply), it 
didn't.  But the thread was found easily enough.

>>I wondered whether you had elimated fragmentation, or any other known
>>gotchas, as a cause?
> 
> Subvolumes are mounted with the following options:
> autodefrag,relatime,compress=lzo,subvol=<sub vol name>

That relatime (which is the default), could be an issue.  See below.

> Not sure if there is much else to do about fragmentation apart from
> running a balance which would probally make thje machine v sluggish for
> a day or so.
> 
>>Out of curiosity, what is/was the utilisation of the disk? Were the
>>snapshots read-only or read-write?
> 
> root@phoenix:~# btrfs fi df /
> Data, single: total=101.03GiB, used=97.91GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=8.00GiB, used=5.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> root@phoenix:~# btrfs fi df /home
> Data, RAID1: total=1.99TiB, used=1.97TiB
> System, RAID1: total=32.00MiB, used=352.00KiB
> Metadata, RAID1: total=53.00GiB, used=50.22GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

Normally when posting, either btrfs fi df *and* btrfs fi show are 
needed, /or/ (with a new enough btrfs-progs) btrfs fi usage.  And of 
course the kernel (4.0.4 in your case) and btrfs-progs (not posted, that 
I saw) versions.

Btrfs fi df shows the chunk allocation and usage within the chunks, but 
does not show the size of the filesystem or of individual devices.  Btrfs 
fi show, shows that, but not the chunk allocation and usage info.  Btrfs 
fi usage shows both, but it's a newer command that isn't available on old 
btrfs-progs, and was buggy for some layouts (raid56 and mixed-mode, where 
the bugs would cause the numbers to go negative, which would appear as 
EiB free (I wish!!)) until relatively recently.

> Hmm.  The system disk is getting a little tight. cddisk reports the
> partition I use for btrfs containing root as 127GB approx.  Not sure why
> it grows so much. Suspect that software updates can't help as snapshots
> will contain the legacy versions.  On the other hand they can be useful.

With the 127 GiB (I _guess_ it's GiB, 1024, not GB, 1000, multiplier, 
btrfs consistently uses the 1024 multiplier and properly specifies it 
using the XiB notation) for /, however, and the btrfs fi df sizes of 101 
GiB plus data and 8 GiB metadata (with system's 32 MiB a rounding error 
and global reserve actually taken from metadata, so it doesn't add to 
chunk reservation on its own) we can see that as you mention, it's 
starting to get tight, a bit under 110 GiB of 127 GiB, but that 17 GiB 
free isn't horrible, just slightly tight, as you said.

Tho it'll obviously be tighter if that's 127 GB, 1000 multiplier...

It's tight enough that particularly with the regular snapshotting, btrfs 
might be having to fragment more than it'd like.  Tho kudos for the 
_excellent_ snapshot rotation.  We regularly see folks in here with 100K 
or more snapshots per filesystem, and btrfs _does_ have scaling issues in 
that case.  But your rotation seems to be keeping it well below the 1-3K 
snapshots per filesystem recommended max, so that's obviously NOT you're 
problem, unless of course the snapshot deletion bugged out and they 
aren't being deleted as they should.

(Of course, you can check that by listing them, and I would indeed double-
check, as that _is_ the _usual_ problem we have with snapshots slowing 
things down, simply too many of them, hitting the known scaling issues 
btrfs had with over 10K snapshots per filesystem.  But FWIW I don't use 
snapshots here and thus don't deal with snapshots command-level detail.)

But as I mentioned above, that relatime mount option isn't your best 
choice, in the presence of heavy snapshotting.  Unless you KNOW you need 
atimes for something or other, noatime is _strongly_ recommended with 
snapshotting, because relatime, while /relatively/ better than 
strictatime, still updates atimes once a day for files you're accessing 
at least that frequently.

And that interacts badly with snapshots, particularly where few of the 
files themselves have changed, because in that case, a large share of the 
changes from one snapshot to another are going to be those atime updates 
themselves.  Ensuring that you're always using noatime avoids the atime 
updates entirely (well, unless the file itself changes and thus mtime 
changes as well), which should, in the normal most files unchanged 
snapshotting context, make for much smaller snapshot-exclusive sizes.

And you mention below that the snapshots are read-write, but generally 
used as read-only.  Does that include actually mounting them read-only?  
Because if not, and if they too are mounted the default relatime, 
accessing them is obviously going to be updating atimes the relatime-
default once per day there as well... triggering further divergence of 
snapshots from the subvolumes they are snapshots of and from each other...

> Is it likely the SSD?  If likely I could get a larger one, now is a good
> time with a new version of slackware imminent.  However, no point in
> spending money for the sake of it.

Not directly btrfs related, but when you do buy a new ssd, now or later, 
keep in mind that a lot of authorities recommend that for ssds you buy 
10-33% larger than you plan on actually provisioning, and that you leave 
that extra space entirely unprovisioned -- either leave that extra space 
entirely unpartitioned, or partition it, but don't put filesystems or 
anything else (swap, etc) on it.  This leaves those erase-blocks free to 
be used by the FTL for additional wear-leveling block-swap, thus helping 
maintain device speed as it ages, and with good wear-leveling firmware, 
should dramatically increase device usable lifetime, as well.

FWIW, I ended up going rather overboard with that here, as I knew I 
needed a bit under 128 GiB (1024, I was trying to fit it in 100 GiB, so I 
could get 120 or 128 GB (1000) and use the extra as slack, but that was 
going to be tighter than I actually wanted) and thus thought I'd get 140 
GB (1000) or so devices, but I ended up getting 256 GB (1000), as that's 
what was both in-stock and at a reasonable price and performance level.  
Of course that meant I spent somewhat more, but I put it on credit and 
paid it off in 2-3 months, before the interest ate up _all_ the price 
savings I got on it.  So I ended up being able to put a couple more 
partitions on the SSD that I had planned to keep on spinning rust, and 
_still_ was only to 130 GiB or so, so I was still close to only 50% 
actually partitioned and used.

But it has been nice since I basically don't need to worry about trim/
discard at all, tho I do have a cronjob setup to run fstrim every week or 
so.  And given the price on the 256 GB ssds, I actually didn't spend 
_that_ much more on them than I would have on 160 GB or 200 GB devices -- 
well either that or I'd have had to wait for them to get more in stock, 
since all the good-price/performance devices were out of stock in the 
120-200 GB range.

> All snapshots read-write.  However, I have mainly treated them as
> read-only. Does that make a difference?

See above.  It definitely will if you're not using noatime when mounting 
them.

>>Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.
> 
> I'm wondering if it is time for an update from 4.0.4?

The going list recommendation is to choose either current kernel track or 
LTS kernel track.  If you choose current kernel, the recommendation is to 
stick within 1-2 kernel cycles of newest current, which with 4.5 about to 
come out, means you would be on 4.3 at the oldest, and be looking at 4.4 
by now, again, on the current kernel track.

If you choose LTS kernels, until recently, the recommendation was again 
the latest two, but here LTS kernel cycles.  That would be 4.4 as the 
newest LTS and 4.1 previous to that.  However, 3.18, the LTS kernel 
previous to 4.1, has been holding up reasonably well, so while 4.1 would 
be preferred, 3.18 remains reasonably well supported as well.

You're on 4.0, which isn't an LTS kernel series and is thus, along with 
4.2, out of upstream's support window.  So it's past time to look at 
updating. =:^)  Given that you obviously do _not_ follow the last couple 
current kernels rule, I'd strongly recommend that you consider switching 
to an LTS kernel, and given that you're on 4.0 now, the 4.1 or 4.4 LTS 
kernels would be your best candidates.  4.1 should be supported for quite 
some time yet, both btrfs-wise and in general, and would be the minimal 
incremental upgrade, but of course if your object is to upgrade as far as 
you reasonably can when you /do/ upgrade, 4.4, the latest LTS, is perhaps 
your best candidate.

In normal operation, the btrfs-progs userspace version isn't as critical, 
as long as it has support for the features you're using, of course, 
because for most normal runtime tasks, all progs does is make the 
appropriate calls to the kernel to do the real work anyway.  But as soon 
as you find yourself trying to fix a filesystem that isn't working 
properly and possibly won't mount, btrfs-progs version becomes more 
critical, as the newest versions can fix more bugs than older versions, 
which didn't know about the bugs discovered since then.

As a result, a reasonable userspace rule of thumb is to use at _least_ a 
version corresponding to your kernel.  Newer is fine as well, but using 
at _least_ a version corresponding to your kernel means you're running a 
userspace that was developed with that kernel in mind, and also, as long 
as you're following kernel recommendations already, nicely keeps your 
userspace from getting /too/ outdated, to the point that the commands and 
output are enough different from current userspace to create problems 
when you post command output to the list, etc.

>>[Also, damn you autocorrection on my phone!]
> 
> Yep!

I'm one of those folks who still doesn't have a cell phone -- tho I have 
a VoIP adaptor hooked up to my internet, and and a cordless phone 
attached to it (and pay... about $30/year to a VoIP phone service for a 
phone number and US-domestic dialing without additional fees... tho 
obviously I have to keep an internet connection to keep that working, but 
that's why I don't have a cell, at the pitifully small full-speed data 
limits, I can't switch to cell for data, and it's simply not cost 
effective for voice when I can get full US phone coverage at no 
additional cost for what amounts to $2.50/mo.).

But FWIW, if you've not already discovered it, plug in phone autocorrect 
on youtube some day when you have some time, and be prepared to spend a 
few hours laughing your *** off!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Snapshots slowing system
@ 2016-03-11 20:03 Pete
  2016-03-11 23:38 ` boris
  0 siblings, 1 reply; 17+ messages in thread
From: Pete @ 2016-03-11 20:03 UTC (permalink / raw)
  To: linux-btrfs

I though I would post this in case it was useful info for the list.  No
help needed as I have a fix (sort of).

I've an PC with a 3 core Phenom 720 CPU, 8GB of RAM.  / is in a RAID0
SSD btrfs file system and data, home directories, various bits of data
etc, on 2x3TB disks at btrfs RAID1.  The data file system contains the
data spread across about 5 or 6 subvolumes for ease of management.
Kernel 4.0.4.

I wrote a script which performs snapshots for appropriate subvolumes on
each file system.  Hourly snapshots were taken and kept for 24 hours
before deletion.  Daily ones 30 days and weekly ones about a year.  So
each sub-volume had approx 86 snapshots.  This worked well with the odd
sluggish response from the file system, but these were infrequent and I
was happy to accept them given the benefits of subvolumes & snapshots.

Over the past few weeks I had noticed a degradation in performance to
the point where it paused with busy disks when trying to do anything
that might involve disks until it got to an unacceptable state.  Not
sure that anything had changed, but the slowness came on over a period
of a couple weeks.

I fixed this by disabling the hourly snapshots and deleting them.
System is back to normal.  Though I would share in case there is any
value in this info for the devs.

Kind regards,

Pete

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Snapshots slowing system
  2016-03-11 20:03 Pete
@ 2016-03-11 23:38 ` boris
  0 siblings, 0 replies; 17+ messages in thread
From: boris @ 2016-03-11 23:38 UTC (permalink / raw)
  To: linux-btrfs

I wondered whether you had elimated fragmentation, or any other known gotchas, 
as a cause?

Out of curiosity, what is/was the utilisation of the disk? Were the snapshots 
read-only or read-write?

Apropos Nada: quick shout out to Qu to wish him luck for the 4.6 merge.

[Also, damn you autocorrection on my phone!]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-03-19  1:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-14 23:03 Snapshots slowing system pete
2016-03-15 15:52 ` Duncan
2016-03-15 22:29   ` Peter Chant
2016-03-16 11:39     ` Austin S. Hemmelgarn
2016-03-17 21:08       ` Pete
2016-03-18  9:17         ` Duncan
2016-03-18 11:38           ` Austin S. Hemmelgarn
2016-03-18 17:58             ` Pete
2016-03-18 23:58             ` Duncan
2016-03-18 18:16           ` Pete
2016-03-18 18:54             ` Austin S. Hemmelgarn
2016-03-19  0:59               ` Duncan
2016-03-19  1:15             ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2016-03-12 13:01 pete
2016-03-13  3:28 ` Duncan
2016-03-11 20:03 Pete
2016-03-11 23:38 ` boris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).