Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
@ 2018-08-17  9:08 Martin Steigerwald
  2018-08-17 11:58 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-17  9:08 UTC (permalink / raw)
  To: linux-btrfs

Hi!

This happened about two weeks ago. I already dealt with it and all is 
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully. 
After that it did not boot the operating system anymore. Intel SSD 320, 
latest firmware, which should patch this bug, but apparently does not, 
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even 
able to write onto this degraded filesystem. First I copied all data to 
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and 
since I had no other SSD at hand, did that. And yes, it did. So I 
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But 
at that time I was not able to mount the degraded BTRFS on the other SSD 
as writable anymore, not even with "-f" "I know what I am doing". Thus I 
was not able to add a device to it and btrfs balance it to RAID 1. Even 
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again 
via one full backup and restore cycle. However, this time I was able to 
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via 
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD 
outage.

2. Con:  It does not allow me to add a device and balance to RAID 1 or 
replace one device that is already missing at this time.

3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical 
data.

4. And yes, I know it does not replace a backup. As it was holidays and 
I was lazy backup was two weeks old already, so I was happy to have all 
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are 
less than helpful. They indicate a corrupted filesystem instead of just 
telling that one device is missing and "-o degraded" would help here.

I have seen a discussion about the limitation in point 2. That allowing 
to add a device and make it into RAID 1 again might be dangerous, cause 
of system chunk and probably other reasons. I did not completely read 
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all* 
data. Also for the system chunk, which according to btrfs fi df / btrfs 
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it 
would need to disallow me to make it into an RAID 1 again after one 
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was 
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does 
two copies regardless of how many drives you use.)

For this laptop it was not all that important but I wonder about BTRFS 
RAID 1 in enterprise environment, cause restoring from backup adds a 
significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway, 
cause it replaced an BTRFS that aged over several years with a new one. 
Due to the increased capacity and due to me thinking that Samsung 860 
Pro compresses itself, I removed LZO compression. This would also give 
larger extents on files that are not fragmented or only slightly 
fragmented. I think that Intel SSD 320 did not compress, but Crucial 
m500 mSATA SSD does. That has been the secondary SSD that still had all 
the data after the outage of the Intel SSD 320.

Overall I am happy, cause BTRFS RAID 1 gave me access to the data after 
the SSD outage. That is the most important thing about it for me.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17  9:08 Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD Martin Steigerwald
@ 2018-08-17 11:58 ` Austin S. Hemmelgarn
  2018-08-17 12:28   ` Martin Steigerwald
  0 siblings, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2018-08-17 11:58 UTC (permalink / raw)
  To: Martin Steigerwald, linux-btrfs

On 2018-08-17 05:08, Martin Steigerwald wrote:
> Hi!
> 
> This happened about two weeks ago. I already dealt with it and all is
> well.
> 
> Linux hung on suspend so I switched off this ThinkPad T520 forcefully.
> After that it did not boot the operating system anymore. Intel SSD 320,
> latest firmware, which should patch this bug, but apparently does not,
> is only 8 MiB big. Those 8 MiB just contain zeros.
> 
> Access via GRML and "mount -fo degraded" worked. I initially was even
> able to write onto this degraded filesystem. First I copied all data to
> a backup drive.
> 
> I even started a balance to "single" so that it would work with one SSD.
> 
> But later I learned that secure erase may recover the Intel SSD 320 and
> since I had no other SSD at hand, did that. And yes, it did. So I
> canceled the balance.
> 
> I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But
> at that time I was not able to mount the degraded BTRFS on the other SSD
> as writable anymore, not even with "-f" "I know what I am doing". Thus I
> was not able to add a device to it and btrfs balance it to RAID 1. Even
> "btrfs replace" was not working.
> 
> I thus formatted a new BTRFS RAID 1 and restored.
> 
> A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again
> via one full backup and restore cycle. However, this time I was able to
> copy most of the data of the Intel SSD 320 with "mount -fo degraded" via
> eSATA and thus the copy operation was way faster.
> 
> So conclusion:
> 
> 1. Pro: BTRFS RAID 1 really protected my data against a complete SSD
> outage.
Glad to hear I'm not the only one!
> 
> 2. Con:  It does not allow me to add a device and balance to RAID 1 or
> replace one device that is already missing at this time.
See below where you comment about this more, I've replied regarding it 
there.
> 
> 3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical
> data.
> 
> 4. And yes, I know it does not replace a backup. As it was holidays and
> I was lazy backup was two weeks old already, so I was happy to have all
> my data still on the other SSD.
> 
> 5. The error messages in kernel when mounting without "-o degraded" are
> less than helpful. They indicate a corrupted filesystem instead of just
> telling that one device is missing and "-o degraded" would help here.
Agreed, the kernel error messages need significant improvement, not just 
for this case, but in general (I would _love_ to make sure that there 
are exactly zero exit paths for open_ctree that don't involve a proper 
error message being printed beyond the ubiquitous `open_ctree failed` 
message you get when it fails).
> 
> 
> I have seen a discussion about the limitation in point 2. That allowing
> to add a device and make it into RAID 1 again might be dangerous, cause
> of system chunk and probably other reasons. I did not completely read
> and understand it tough.
> 
> So I still don´t get it, cause:
> 
> Either it is a RAID 1, then, one disk may fail and I still have *all*
> data. Also for the system chunk, which according to btrfs fi df / btrfs
> fi sh was indeed RAID 1. If so, then period. Then I don´t see why it
> would need to disallow me to make it into an RAID 1 again after one
> device has been lost.
> 
> Or it is no RAID 1 and then what is the point to begin with? As I was
> able to copy of all date of the degraded mount, I´d say it was a RAID 1.
> 
> (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does
> two copies regardless of how many drives you use.)
So, what's happening here is a bit complicated.  The issue is entirely 
with older kernels that are missing a couple of specific patches, but it 
appears that not all distributions have their kernels updated to include 
those patches yet.

In short, when you have a volume consisting of _exactly_ two devices 
using raid1 profiles that is missing one device, and you mount it 
writable and degraded on such a kernel, newly created chunks will be 
single-profile chunks instead of raid1 chunks with one half missing. 
Any write has the potential to trigger allocation of a new chunk, and 
more importantly any _read_ has the potential to trigger allocation of a 
new chunk if you don't use the `noatime` mount option (because a read 
will trigger an atime update, which results in a write).

When older kernels then go and try to mount that volume a second time, 
they see that there are single-profile chunks (which can't tolerate 
_any_ device failures), and refuse to mount at all (because they can't 
guarantee that metadata is intact).  Newer kernels fix this part by 
checking per-chunk if a chunk is degraded/complete/missing, which avoids 
this because all the single chunks are on the remaining device.

As far as avoiding this in the future:

* If you're just pulling data off the device, mark the device read-only 
in the _block layer_, not the filesystem, before you mount it.  If 
you're using LVM, just mark the LV read-only using LVM commands  This 
will make 100% certain that nothing gets written to the device, and thus 
makes sure that you won't accidentally cause issues like this.
* If you're going to convert to a single device, just do it and don't 
stop it part way through.  In particular, make sure that your system 
will not lose power.
* Otherwise, don't mount the volume unless you know you're going to 
repair it.
> 
> 
> For this laptop it was not all that important but I wonder about BTRFS
> RAID 1 in enterprise environment, cause restoring from backup adds a
> significantly higher downtime.
> 
> Anyway, creating a new filesystem may have been better here anyway,
> cause it replaced an BTRFS that aged over several years with a new one.
> Due to the increased capacity and due to me thinking that Samsung 860
> Pro compresses itself, I removed LZO compression. This would also give
> larger extents on files that are not fragmented or only slightly
> fragmented. I think that Intel SSD 320 did not compress, but Crucial
> m500 mSATA SSD does. That has been the secondary SSD that still had all
> the data after the outage of the Intel SSD 320.
First off, keep in mind that the SSD firmware doing compression only 
really helps with wear-leveling.  Doing it in the filesystem will help 
not only with that, but will also give you more space to work with.

Secondarily, keep in mind that most SSD's use compression algorithms 
that are fast, but don't generally get particularly amazing compression 
ratios (think LZ4 or Snappy for examples of this).  In comparison, BTRFS 
provides a couple of options that are slower, but get far better ratios 
most of the time (zlib, and more recently zstd, which is actually pretty 
fast).
> 
> 
> Overall I am happy, cause BTRFS RAID 1 gave me access to the data after
> the SSD outage. That is the most important thing about it for me.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 11:58 ` Austin S. Hemmelgarn
@ 2018-08-17 12:28   ` Martin Steigerwald
  2018-08-17 12:50     ` Roman Mamedov
  2018-08-17 12:55     ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-17 12:28 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

Thanks for your detailed answer.  

Austin S. Hemmelgarn - 17.08.18, 13:58:
> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> > I have seen a discussion about the limitation in point 2. That
> > allowing to add a device and make it into RAID 1 again might be
> > dangerous, cause of system chunk and probably other reasons. I did
> > not completely read and understand it tough.
> > 
> > So I still don´t get it, cause:
> > 
> > Either it is a RAID 1, then, one disk may fail and I still have
> > *all*
> > data. Also for the system chunk, which according to btrfs fi df /
> > btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
> > why it would need to disallow me to make it into an RAID 1 again
> > after one device has been lost.
> > 
> > Or it is no RAID 1 and then what is the point to begin with? As I
> > was
> > able to copy of all date of the degraded mount, I´d say it was a
> > RAID 1.
> > 
> > (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
> > does two copies regardless of how many drives you use.)
> 
> So, what's happening here is a bit complicated.  The issue is entirely
> with older kernels that are missing a couple of specific patches, but
> it appears that not all distributions have their kernels updated to
> include those patches yet.
> 
> In short, when you have a volume consisting of _exactly_ two devices
> using raid1 profiles that is missing one device, and you mount it
> writable and degraded on such a kernel, newly created chunks will be
> single-profile chunks instead of raid1 chunks with one half missing.
> Any write has the potential to trigger allocation of a new chunk, and
> more importantly any _read_ has the potential to trigger allocation of
> a new chunk if you don't use the `noatime` mount option (because a
> read will trigger an atime update, which results in a write).
> 
> When older kernels then go and try to mount that volume a second time,
> they see that there are single-profile chunks (which can't tolerate
> _any_ device failures), and refuse to mount at all (because they
> can't guarantee that metadata is intact).  Newer kernels fix this
> part by checking per-chunk if a chunk is degraded/complete/missing,
> which avoids this because all the single chunks are on the remaining
> device.

How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e. 
the one on the live distro that needs to be new enough? To one on this 
laptop meanwhile is already 4.18.1.

I used latest GRML stable release 2017.05 which has an 4.9 kernel.

> As far as avoiding this in the future:

I hope that with the new Samsung Pro 860 together with the existing 
Crucial m500 I am spared from this for years to come. That Crucial SSD 
according to SMART status about lifetime used has still quite some time 
to go.

> * If you're just pulling data off the device, mark the device
> read-only in the _block layer_, not the filesystem, before you mount
> it.  If you're using LVM, just mark the LV read-only using LVM
> commands  This will make 100% certain that nothing gets written to
> the device, and thus makes sure that you won't accidentally cause
> issues like this.

> * If you're going to convert to a single device,
> just do it and don't stop it part way through.  In particular, make
> sure that your system will not lose power.

> * Otherwise, don't mount the volume unless you know you're going to
> repair it.

Thanks for those. Good to keep in mind.

> > For this laptop it was not all that important but I wonder about
> > BTRFS RAID 1 in enterprise environment, cause restoring from backup
> > adds a significantly higher downtime.
> > 
> > Anyway, creating a new filesystem may have been better here anyway,
> > cause it replaced an BTRFS that aged over several years with a new
> > one. Due to the increased capacity and due to me thinking that
> > Samsung 860 Pro compresses itself, I removed LZO compression. This
> > would also give larger extents on files that are not fragmented or
> > only slightly fragmented. I think that Intel SSD 320 did not
> > compress, but Crucial m500 mSATA SSD does. That has been the
> > secondary SSD that still had all the data after the outage of the
> > Intel SSD 320.
> 
> First off, keep in mind that the SSD firmware doing compression only
> really helps with wear-leveling.  Doing it in the filesystem will help
> not only with that, but will also give you more space to work with.

While also reducing the ability of the SSD to wear-level. The more data 
I fit on the SSD, the less it can wear-level. And the better I compress 
that data, the less it can wear-level.

> Secondarily, keep in mind that most SSD's use compression algorithms
> that are fast, but don't generally get particularly amazing
> compression ratios (think LZ4 or Snappy for examples of this).  In
> comparison, BTRFS provides a couple of options that are slower, but
> get far better ratios most of the time (zlib, and more recently zstd,
> which is actually pretty fast).

I considered switching to zstd. But it may not be compatible with grml 
2017.05 4.9 kernel, of course I could test a grml snapshot with a newer 
kernel. I always like to be able to recover with some live distro :). 
And GRML is the one of my choice.

However… I am not all that convinced that it would benefit me as long as 
I have enough space. That SSD replacement more than doubled capacity 
from about 680 TB to 1480 TB. I have ton of free space in the 
filesystems – usage of /home is only 46% for example – and there are 96 
GiB completely unused in LVM on the Crucial SSD and even more than 183 
GiB completely unused on Samsung SSD. The system is doing weekly 
"fstrim" on all filesystems. I think that this is more than is needed 
for the longevity of the SSDs, but well actually I just don´t need the 
space, so… 

Of course, in case I manage to fill up all that space, I consider using 
compression. Until then, I am not all that convinced that I´d benefit 
from it.

Of course it may increase read speeds and in case of nicely compressible 
data also write speeds, I am not sure whether it even matters. Also it 
uses up some CPU cycles on a dual core (+ hyperthreading) Sandybridge 
mobile i5. While I am not sure about it, I bet also having larger 
possible extent sizes may help a bit. As well as no compression may also 
help a bit with fragmentation.

Well putting this to a (non-scientific) test:

[…]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head -5
3,1G    parttable.ibd

[…]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd 
parttable.ibd: 11583 extents found

Hmmm, already quite many extents after just about one week with the new 
filesystem. On the old filesystem I had somewhat around 40000-50000 
extents on that file.

Well actually what do I know: I don´t even have an idea whether not 
using compression would be beneficial. Maybe it does not even matter all 
that much.

I bet testing it to the point that I could be sure about it for my 
workload would take considerable amount of time.

Ciao,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 12:28   ` Martin Steigerwald
@ 2018-08-17 12:50     ` Roman Mamedov
  2018-08-17 13:01       ` Austin S. Hemmelgarn
  2018-08-17 21:17       ` Martin Steigerwald
  2018-08-17 12:55     ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 11+ messages in thread
From: Roman Mamedov @ 2018-08-17 12:50 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Fri, 17 Aug 2018 14:28:25 +0200
Martin Steigerwald <martin@lichtvoll.de> wrote:

> > First off, keep in mind that the SSD firmware doing compression only
> > really helps with wear-leveling.  Doing it in the filesystem will help
> > not only with that, but will also give you more space to work with.
> 
> While also reducing the ability of the SSD to wear-level. The more data 
> I fit on the SSD, the less it can wear-level. And the better I compress 
> that data, the less it can wear-level.

Do not consider SSD "compression" as a factor in any of your calculations or
planning. Modern controllers do not do it anymore, the last ones that did are
SandForce, and that's 2010 era stuff. You can check for yourself by comparing
write speeds of compressible vs incompressible data, it should be the same. At
most, the modern ones know to recognize a stream of binary zeroes and have a
special case for that.

As for general comment on this thread, always try to save the exact messages
you get when troubleshooting or getting failures from your system. Saying just
"was not able to add" or "btrfs replace not working" without any exact details
isn't really helpful as a bug report or even as a general "experiences" story,
as we don't know what was the exact cause of those, could that have been
avoided or worked around, not to mention what was your FS state at the time
(as in "btrfs fi show" and "fi df").

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 12:28   ` Martin Steigerwald
  2018-08-17 12:50     ` Roman Mamedov
@ 2018-08-17 12:55     ` Austin S. Hemmelgarn
  2018-08-17 21:26       ` Martin Steigerwald
  1 sibling, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2018-08-17 12:55 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On 2018-08-17 08:28, Martin Steigerwald wrote:
> Thanks for your detailed answer.
> 
> Austin S. Hemmelgarn - 17.08.18, 13:58:
>> On 2018-08-17 05:08, Martin Steigerwald wrote:
> […]
>>> I have seen a discussion about the limitation in point 2. That
>>> allowing to add a device and make it into RAID 1 again might be
>>> dangerous, cause of system chunk and probably other reasons. I did
>>> not completely read and understand it tough.
>>>
>>> So I still don´t get it, cause:
>>>
>>> Either it is a RAID 1, then, one disk may fail and I still have
>>> *all*
>>> data. Also for the system chunk, which according to btrfs fi df /
>>> btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
>>> why it would need to disallow me to make it into an RAID 1 again
>>> after one device has been lost.
>>>
>>> Or it is no RAID 1 and then what is the point to begin with? As I
>>> was
>>> able to copy of all date of the degraded mount, I´d say it was a
>>> RAID 1.
>>>
>>> (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
>>> does two copies regardless of how many drives you use.)
>>
>> So, what's happening here is a bit complicated.  The issue is entirely
>> with older kernels that are missing a couple of specific patches, but
>> it appears that not all distributions have their kernels updated to
>> include those patches yet.
>>
>> In short, when you have a volume consisting of _exactly_ two devices
>> using raid1 profiles that is missing one device, and you mount it
>> writable and degraded on such a kernel, newly created chunks will be
>> single-profile chunks instead of raid1 chunks with one half missing.
>> Any write has the potential to trigger allocation of a new chunk, and
>> more importantly any _read_ has the potential to trigger allocation of
>> a new chunk if you don't use the `noatime` mount option (because a
>> read will trigger an atime update, which results in a write).
>>
>> When older kernels then go and try to mount that volume a second time,
>> they see that there are single-profile chunks (which can't tolerate
>> _any_ device failures), and refuse to mount at all (because they
>> can't guarantee that metadata is intact).  Newer kernels fix this
>> part by checking per-chunk if a chunk is degraded/complete/missing,
>> which avoids this because all the single chunks are on the remaining
>> device.
> 
> How new the kernel needs to be for that to happen?
> 
> Do I get this right that it would be the kernel used for recovery, i.e.
> the one on the live distro that needs to be new enough? To one on this
> laptop meanwhile is already 4.18.1.
Yes, the kernel used for recovery is the important one here.  I don't 
remember for certain when the patches went in, but I'm pretty sure it's 
been no eariler than 4.14.  FWIW, I'm pretty sure SystemRescueCD has a 
new enough kernel, but they still (sadly) lack zstd support.
> 
> I used latest GRML stable release 2017.05 which has an 4.9 kernel.
While I don't know exactly when the patches went in, I'm fairly certain 
that 4.9 never got them.
> 
>> As far as avoiding this in the future:
> 
> I hope that with the new Samsung Pro 860 together with the existing
> Crucial m500 I am spared from this for years to come. That Crucial SSD
> according to SMART status about lifetime used has still quite some time
> to go.
Yes, hopefully.  And the SMART status on that Crucial is probably right, 
they tend to do a very good job in my experience with accurately 
measuring life expectancy (that or they're just _really_ good at 
predicting failures, I've never had a Crucial SSD that did not indicate 
correctly in the SMART status that it would fail in the near future).
> 
>> * If you're just pulling data off the device, mark the device
>> read-only in the _block layer_, not the filesystem, before you mount
>> it.  If you're using LVM, just mark the LV read-only using LVM
>> commands  This will make 100% certain that nothing gets written to
>> the device, and thus makes sure that you won't accidentally cause
>> issues like this.
> 
>> * If you're going to convert to a single device,
>> just do it and don't stop it part way through.  In particular, make
>> sure that your system will not lose power.
> 
>> * Otherwise, don't mount the volume unless you know you're going to
>> repair it.
> 
> Thanks for those. Good to keep in mind.
The last one is actually good advice in general, not just for BTRFS.  I 
can't count how many stories I've heard of people who tried to run half 
an array simply to avoid downtime, and ended up making things far worse 
than they were as a result.
> 
>>> For this laptop it was not all that important but I wonder about
>>> BTRFS RAID 1 in enterprise environment, cause restoring from backup
>>> adds a significantly higher downtime.
>>>
>>> Anyway, creating a new filesystem may have been better here anyway,
>>> cause it replaced an BTRFS that aged over several years with a new
>>> one. Due to the increased capacity and due to me thinking that
>>> Samsung 860 Pro compresses itself, I removed LZO compression. This
>>> would also give larger extents on files that are not fragmented or
>>> only slightly fragmented. I think that Intel SSD 320 did not
>>> compress, but Crucial m500 mSATA SSD does. That has been the
>>> secondary SSD that still had all the data after the outage of the
>>> Intel SSD 320.
>>
>> First off, keep in mind that the SSD firmware doing compression only
>> really helps with wear-leveling.  Doing it in the filesystem will help
>> not only with that, but will also give you more space to work with.
> 
> While also reducing the ability of the SSD to wear-level. The more data
> I fit on the SSD, the less it can wear-level. And the better I compress
> that data, the less it can wear-level.
No, the better you compress the data, the _less_ data you are physically 
putting on the SSD, just like compressing a file makes it take up less 
space.  This actually makes it easier for the firmware to do 
wear-leveling.  Wear-leveling is entirely about picking where to put 
data, and by reducing the total amount of data you are writing to the 
SSD, you're making that decision easier for the firmware, and also 
reducing the number of blocks of flash memory needed (which also helps 
with SSD life expectancy because it translates to fewer erase cycles).

The compression they do internally operates on the same principal, the 
only difference is that you have no control over how it's doing it and 
no way to see exactly how efficient it is (but it's pretty well known it 
needs to be fast, and fast compression usually does not get good 
compression ratios).
> 
>> Secondarily, keep in mind that most SSD's use compression algorithms
>> that are fast, but don't generally get particularly amazing
>> compression ratios (think LZ4 or Snappy for examples of this).  In
>> comparison, BTRFS provides a couple of options that are slower, but
>> get far better ratios most of the time (zlib, and more recently zstd,
>> which is actually pretty fast).
> 
> I considered switching to zstd. But it may not be compatible with grml
> 2017.05 4.9 kernel, of course I could test a grml snapshot with a newer
> kernel. I always like to be able to recover with some live distro :).
> And GRML is the one of my choice.
> 
> However… I am not all that convinced that it would benefit me as long as
> I have enough space. That SSD replacement more than doubled capacity
> from about 680 TB to 1480 TB. I have ton of free space in the
> filesystems – usage of /home is only 46% for example – and there are 96
> GiB completely unused in LVM on the Crucial SSD and even more than 183
> GiB completely unused on Samsung SSD. The system is doing weekly
> "fstrim" on all filesystems. I think that this is more than is needed
> for the longevity of the SSDs, but well actually I just don´t need the
> space, so…
> 
> Of course, in case I manage to fill up all that space, I consider using
> compression. Until then, I am not all that convinced that I´d benefit
> from it.
> 
> Of course it may increase read speeds and in case of nicely compressible
> data also write speeds, I am not sure whether it even matters. Also it
> uses up some CPU cycles on a dual core (+ hyperthreading) Sandybridge
> mobile i5. While I am not sure about it, I bet also having larger
> possible extent sizes may help a bit. As well as no compression may also
> help a bit with fragmentation.
It generally does actually. Less data physically on the device means 
lower chances of fragmentation.  In your case, it may not improve speed 
much though (your i5 _probably_ can't compress data much faster than it 
can access your SSD's, which means you likely won't see much performance 
benefit other than reducing fragmentation).
> 
> Well putting this to a (non-scientific) test:
> 
> […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head -5
> 3,1G    parttable.ibd
> 
> […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> parttable.ibd: 11583 extents found
> 
> Hmmm, already quite many extents after just about one week with the new
> filesystem. On the old filesystem I had somewhat around 40000-50000
> extents on that file.
Filefrag doesn't properly handle compressed files on BTRFS.  It treats 
each 128KiB compression block as a separate extent, even though they may 
be contiguous as part of one BTRFS extent.  That one file by itself 
should have reported as about 25396 extents on the old volume (assuming 
it was entirely compressed), so your numbers seem to match up 
realistically.>
> 
> Well actually what do I know: I don´t even have an idea whether not
> using compression would be beneficial. Maybe it does not even matter all
> that much.
> 
> I bet testing it to the point that I could be sure about it for my
> workload would take considerable amount of time.
> 
One last quick thing about compression in general on BTRFS.  Unless you 
have a lot of files that are likely to be completely incompressible, 
you're generally better off using `compress-force` instead of 
`compress`.  With regular `compress`, BTRFS will try to compress the 
first few blocks of a file, and if that fails will mark the file as 
incompressible and not try to compress any of it automatically ever 
again.  With `compress-force`, BTRFS will just unconditionally compress 
everything.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 12:50     ` Roman Mamedov
@ 2018-08-17 13:01       ` Austin S. Hemmelgarn
  2018-08-17 21:16         ` Martin Steigerwald
  2018-08-17 21:17       ` Martin Steigerwald
  1 sibling, 1 reply; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2018-08-17 13:01 UTC (permalink / raw)
  To: Roman Mamedov, Martin Steigerwald; +Cc: linux-btrfs

On 2018-08-17 08:50, Roman Mamedov wrote:
> On Fri, 17 Aug 2018 14:28:25 +0200
> Martin Steigerwald <martin@lichtvoll.de> wrote:
> 
>>> First off, keep in mind that the SSD firmware doing compression only
>>> really helps with wear-leveling.  Doing it in the filesystem will help
>>> not only with that, but will also give you more space to work with.
>>
>> While also reducing the ability of the SSD to wear-level. The more data
>> I fit on the SSD, the less it can wear-level. And the better I compress
>> that data, the less it can wear-level.
> 
> Do not consider SSD "compression" as a factor in any of your calculations or
> planning. Modern controllers do not do it anymore, the last ones that did are
> SandForce, and that's 2010 era stuff. You can check for yourself by comparing
> write speeds of compressible vs incompressible data, it should be the same. At
> most, the modern ones know to recognize a stream of binary zeroes and have a
> special case for that.
All that testing write speeds forz compressible versus incompressible 
data tells you is if the SSD is doing real-time compression of data, not 
if they are doing any compression at all..  Also, this test only works 
if you turn the write-cache on the device off.

Besides, you can't prove 100% for certain that any manufacturer who does 
not sell their controller chips isn't doing this, which means there are 
a few manufacturers that may still be doing it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 13:01       ` Austin S. Hemmelgarn
@ 2018-08-17 21:16         ` Martin Steigerwald
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-17 21:16 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Roman Mamedov, linux-btrfs

Austin S. Hemmelgarn - 17.08.18, 15:01:
> On 2018-08-17 08:50, Roman Mamedov wrote:
> > On Fri, 17 Aug 2018 14:28:25 +0200
> > 
> > Martin Steigerwald <martin@lichtvoll.de> wrote:
> >>> First off, keep in mind that the SSD firmware doing compression
> >>> only
> >>> really helps with wear-leveling.  Doing it in the filesystem will
> >>> help not only with that, but will also give you more space to
> >>> work with.>> 
> >> While also reducing the ability of the SSD to wear-level. The more
> >> data I fit on the SSD, the less it can wear-level. And the better
> >> I compress that data, the less it can wear-level.
> > 
> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff.
> > You can check for yourself by comparing write speeds of
> > compressible vs incompressible data, it should be the same. At
> > most, the modern ones know to recognize a stream of binary zeroes
> > and have a special case for that.
> 
> All that testing write speeds forz compressible versus incompressible
> data tells you is if the SSD is doing real-time compression of data,
> not if they are doing any compression at all..  Also, this test only
> works if you turn the write-cache on the device off.

As the data still needs to be transferred to the SSD at least when the 
SATA connection is maxed out I bet you won´t see any difference in write 
speed whether the SSD compresses in real time or not.

> Besides, you can't prove 100% for certain that any manufacturer who
> does not sell their controller chips isn't doing this, which means
> there are a few manufacturers that may still be doing it.

Who really knows what SSD controller manufacturers are doing? I have not 
seen any Open Channel SSD stuff for laptops so far.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 12:50     ` Roman Mamedov
  2018-08-17 13:01       ` Austin S. Hemmelgarn
@ 2018-08-17 21:17       ` Martin Steigerwald
  2018-08-18  7:12         ` Roman Mamedov
  1 sibling, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-17 21:17 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Austin S. Hemmelgarn, linux-btrfs

Hi Roman.

Now with proper CC.

Roman Mamedov - 17.08.18, 14:50:
> On Fri, 17 Aug 2018 14:28:25 +0200
> 
> Martin Steigerwald <martin@lichtvoll.de> wrote:
> > > First off, keep in mind that the SSD firmware doing compression
> > > only
> > > really helps with wear-leveling.  Doing it in the filesystem will
> > > help not only with that, but will also give you more space to
> > > work with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> Do not consider SSD "compression" as a factor in any of your
> calculations or planning. Modern controllers do not do it anymore,
> the last ones that did are SandForce, and that's 2010 era stuff. You
> can check for yourself by comparing write speeds of compressible vs
> incompressible data, it should be the same. At most, the modern ones
> know to recognize a stream of binary zeroes and have a special case
> for that.

Interesting. Do you have any backup for your claim?

> As for general comment on this thread, always try to save the exact
> messages you get when troubleshooting or getting failures from your
> system. Saying just "was not able to add" or "btrfs replace not
> working" without any exact details isn't really helpful as a bug
> report or even as a general "experiences" story, as we don't know
> what was the exact cause of those, could that have been avoided or
> worked around, not to mention what was your FS state at the time (as
> in "btrfs fi show" and "fi df").

I had a screen.log, but I put it on the filesystem after the 
backup was made, so it was lost.

Anyway, the reason for not being able to add the device was the read 
only state of the BTRFS, as I wrote. Same goes for replace. I was able 
to read the error message just fine. AFAIR the exact wording was "read 
only filesystem".

In any case: It was a experience report, no request for help, so I don´t 
see why exact error messages are absolutely needed. If I had a support 
inquiry that would be different, I agree.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 12:55     ` Austin S. Hemmelgarn
@ 2018-08-17 21:26       ` Martin Steigerwald
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-17 21:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

Austin S. Hemmelgarn - 17.08.18, 14:55:
> On 2018-08-17 08:28, Martin Steigerwald wrote:
> > Thanks for your detailed answer.
> > 
> > Austin S. Hemmelgarn - 17.08.18, 13:58:
> >> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> >>> Anyway, creating a new filesystem may have been better here
> >>> anyway,
> >>> cause it replaced an BTRFS that aged over several years with a new
> >>> one. Due to the increased capacity and due to me thinking that
> >>> Samsung 860 Pro compresses itself, I removed LZO compression. This
> >>> would also give larger extents on files that are not fragmented or
> >>> only slightly fragmented. I think that Intel SSD 320 did not
> >>> compress, but Crucial m500 mSATA SSD does. That has been the
> >>> secondary SSD that still had all the data after the outage of the
> >>> Intel SSD 320.
> >> 
> >> First off, keep in mind that the SSD firmware doing compression
> >> only
> >> really helps with wear-leveling.  Doing it in the filesystem will
> >> help not only with that, but will also give you more space to work
> >> with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> No, the better you compress the data, the _less_ data you are
> physically putting on the SSD, just like compressing a file makes it
> take up less space.  This actually makes it easier for the firmware
> to do wear-leveling.  Wear-leveling is entirely about picking where
> to put data, and by reducing the total amount of data you are writing
> to the SSD, you're making that decision easier for the firmware, and
> also reducing the number of blocks of flash memory needed (which also
> helps with SSD life expectancy because it translates to fewer erase
> cycles).

On one hand I can go with this, but:

If I fill the SSD 99% with already compressed data, in case it 
compresses itself for wear leveling, it has less chance to wear level 
than with 99% of not yet compressed data that it could compress itself.

That was the point I was trying to make.

Sure, with a fill rate of about 46% for home, compression would help the 
wear leveling. And if the controller does not compress at all, it would 
also.

Hmmm, maybe I enable "zstd", but on the other hand I save CPU cycles 
with not enabling it. 

> > However… I am not all that convinced that it would benefit me as
> > long as I have enough space. That SSD replacement more than doubled
> > capacity from about 680 TB to 1480 TB. I have ton of free space in
> > the filesystems – usage of /home is only 46% for example – and
> > there are 96 GiB completely unused in LVM on the Crucial SSD and
> > even more than 183 GiB completely unused on Samsung SSD. The system
> > is doing weekly "fstrim" on all filesystems. I think that this is
> > more than is needed for the longevity of the SSDs, but well
> > actually I just don´t need the space, so…
> > 
> > Of course, in case I manage to fill up all that space, I consider
> > using compression. Until then, I am not all that convinced that I´d
> > benefit from it.
> > 
> > Of course it may increase read speeds and in case of nicely
> > compressible data also write speeds, I am not sure whether it even
> > matters. Also it uses up some CPU cycles on a dual core (+
> > hyperthreading) Sandybridge mobile i5. While I am not sure about
> > it, I bet also having larger possible extent sizes may help a bit.
> > As well as no compression may also help a bit with fragmentation.
> 
> It generally does actually. Less data physically on the device means
> lower chances of fragmentation.  In your case, it may not improve

I thought "no compression" may help with fragmentation, but I think you 
think that "compression" helps with fragmentation and misunderstood what 
I wrote.

> speed much though (your i5 _probably_ can't compress data much faster
> than it can access your SSD's, which means you likely won't see much
> performance benefit other than reducing fragmentation).
> 
> > Well putting this to a (non-scientific) test:
> > 
> > […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head
> > -5 3,1G    parttable.ibd
> > 
> > […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> > parttable.ibd: 11583 extents found
> > 
> > Hmmm, already quite many extents after just about one week with the
> > new filesystem. On the old filesystem I had somewhat around
> > 40000-50000 extents on that file.
> 
> Filefrag doesn't properly handle compressed files on BTRFS.  It treats
> each 128KiB compression block as a separate extent, even though they
> may be contiguous as part of one BTRFS extent.  That one file by
> itself should have reported as about 25396 extents on the old volume
> (assuming it was entirely compressed), so your numbers seem to match
> up realistically.>

Oh, thanks. I did not know that filefrag does not understand extents for 
compressed files in BTRFS.

> > Well actually what do I know: I don´t even have an idea whether not
> > using compression would be beneficial. Maybe it does not even matter
> > all that much.
> > 
> > I bet testing it to the point that I could be sure about it for my
> > workload would take considerable amount of time.
> 
> One last quick thing about compression in general on BTRFS.  Unless
> you have a lot of files that are likely to be completely
> incompressible, you're generally better off using `compress-force`
> instead of `compress`.  With regular `compress`, BTRFS will try to
> compress the first few blocks of a file, and if that fails will mark
> the file as incompressible and not try to compress any of it
> automatically ever again.  With `compress-force`, BTRFS will just
> unconditionally compress everything.

Well on one filesystem which is on a single SSD, I do have lots of image 
files, mostly jpg, and audio files in mp3 or ogg vorbis formats.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-17 21:17       ` Martin Steigerwald
@ 2018-08-18  7:12         ` Roman Mamedov
  2018-08-18  8:47           ` Martin Steigerwald
  0 siblings, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2018-08-18  7:12 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Austin S. Hemmelgarn, linux-btrfs

On Fri, 17 Aug 2018 23:17:33 +0200
Martin Steigerwald <martin@lichtvoll.de> wrote:

> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff. You
> > can check for yourself by comparing write speeds of compressible vs
> > incompressible data, it should be the same. At most, the modern ones
> > know to recognize a stream of binary zeroes and have a special case
> > for that.
> 
> Interesting. Do you have any backup for your claim?

Just "something I read". I follow quote a bit of SSD-related articles and
reviews which often also include a section to talk about the controller
utilized, its background and technological improvements/changes -- and the
compression going out of fashion after SandForce seems to be considered a
well-known fact.

Incidentally, your old Intel 320 SSDs actually seem to be based on that old
SandForce controller (or at least license some of that IP to extend on it),
and hence those indeed might perform compression.

> As the data still needs to be transferred to the SSD at least when the 
> SATA connection is maxed out I bet you won´t see any difference in write 
> speed whether the SSD compresses in real time or not.

Most controllers expose two readings in SMART:

  - Lifetime writes from host (SMART attribute 241)
  - Lifetime writes to flash (attribute 233, or 177, or 173...)

It might be difficult to get the second one, as often it needs to be decoded
from others such as "Average block erase count" or "Wear leveling count".
(And seems to be impossible on Samsung NVMe ones, for example)

But if you have numbers for both, you know the write amplification of the
drive (and its past workload).

If there is compression at work, you'd see the 2nd number being somewhat, or
significantly lower -- and barely increase at all, if you write highly
compressible data. This is not typically observed on modern SSDs, except maybe
when writing zeroes. Writes to flash will be the same as writes from host, or
most often somewhat higher, as the hardware can typically erase flash only in
chunks of 2MB or so, hence there's quite a bit of under the hood reorganizing
going on. Also as a result depending on workloads the "to flash" number can be
much higher than "from host".

Point is, even when the SATA link is maxed out in both cases, you can still
check if there's compression at work via using those SMART attributes.

> In any case: It was a experience report, no request for help, so I don´t 
> see why exact error messages are absolutely needed. If I had a support 
> inquiry that would be different, I agree.

Well, when reading such stories (involving software that I also use) I imagine
what if I had been in that situation myself, what would I do, would I have
anything else to try, do I know about any workaround for this. And without any
technical details to go from, those are all questions left unanswered.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
  2018-08-18  7:12         ` Roman Mamedov
@ 2018-08-18  8:47           ` Martin Steigerwald
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Steigerwald @ 2018-08-18  8:47 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Austin S. Hemmelgarn, linux-btrfs

Roman Mamedov - 18.08.18, 09:12:
> On Fri, 17 Aug 2018 23:17:33 +0200
> 
> Martin Steigerwald <martin@lichtvoll.de> wrote:
> > > Do not consider SSD "compression" as a factor in any of your
> > > calculations or planning. Modern controllers do not do it anymore,
> > > the last ones that did are SandForce, and that's 2010 era stuff.
> > > You
> > > can check for yourself by comparing write speeds of compressible
> > > vs
> > > incompressible data, it should be the same. At most, the modern
> > > ones
> > > know to recognize a stream of binary zeroes and have a special
> > > case
> > > for that.
> > 
> > Interesting. Do you have any backup for your claim?
> 
> Just "something I read". I follow quote a bit of SSD-related articles
> and reviews which often also include a section to talk about the
> controller utilized, its background and technological
> improvements/changes -- and the compression going out of fashion
> after SandForce seems to be considered a well-known fact.
> 
> Incidentally, your old Intel 320 SSDs actually seem to be based on
> that old SandForce controller (or at least license some of that IP to
> extend on it), and hence those indeed might perform compression.

Interesting. Back then I read the Intel SSD 320 would not compress.
I think its difficult to know for sure with those proprietary controllers.

> > As the data still needs to be transferred to the SSD at least when
> > the SATA connection is maxed out I bet you won´t see any difference
> > in write speed whether the SSD compresses in real time or not.
> 
> Most controllers expose two readings in SMART:
> 
>   - Lifetime writes from host (SMART attribute 241)
>   - Lifetime writes to flash (attribute 233, or 177, or 173...)
>
> It might be difficult to get the second one, as often it needs to be
> decoded from others such as "Average block erase count" or "Wear
> leveling count". (And seems to be impossible on Samsung NVMe ones,
> for example)

I got the impression every manufacturer does their own thing here. And I
would not even be surprised when its different between different generations
of SSDs by one manufacturer.

# Crucial mSATA

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       16345
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4193
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Wear_Leveling_Count     0x0032   078   078   000    Old_age   Always       -       663
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       362
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       8219
183 SATA_Iface_Downshift    0x0032   100   100   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   046   020   000    Old_age   Always       -       54 (Min/Max -10/80)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       16
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Used   0x0031   078   078   000    Pre-fail  Offline      -       22

I expect the raw value of this to raise more slowly now there are almost
100 GiB completely unused and there is lots of free space in the filesystems.
But even if not, the SSD is in use since March 2014. So it has plenty of time
to go.

206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_Host_Sector_Write 0x0032   100   100   ---    Old_age   Always       -       91288276930

^^ In sectors. 91288276930 * 512 / 1024 / 1024 / 1024 ~= 43529 GiB

Could be 4 KiB… but as its telling about Host_Sector and the value multiplied
by eight does not make any sense, I bet its 512 Bytes.

% smartctl /dev/sdb --all |grep "Sector Size"
Sector Sizes:     512 bytes logical, 4096 bytes physical

247 Host_Program_Page_Count 0x0032   100   100   ---    Old_age   Always       -       2892511571
248 Bckgnd_Program_Page_Cnt 0x0032   100   100   ---    Old_age   Always       -       742817198


# Intel SSD 320, before secure erase

The Intel SSD 320 in April 2017, I lost the smartctl -a directly before the
secure erase output due to writing it to the /home filesystem after the
backup – I do have the more recent attrlog CSV file, but I feel to lazy
to format it in a meaningful way:

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       21035
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5292
170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       169
183 SATA_Downshift_Count    0x0030   100   100   000    Old_age   Offline      -       3
184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       462
199 CRC_Error_Count         0x0030   100   100   000    Old_age   Offline      -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1370316
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       2206583
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       49
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       13857327
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   097   097   000    Old_age   Always       -       0

^^ almost new. I have a PDF from Intel explaining this value somewhere.
Intel SSD 320 had more free space than the Crucial M500 for a good time
of their usage.

241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1370316

^^ 1370316 * 32 / 1024 ~= 42822 GiB

242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       2016560

The Intel SSD is in use for a longer time, since May 2011.


# Intel SSD 320 after secure erase:

Interestingly the secure erase nuked the SMART values:

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       3
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6726
170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
183 SATA_Downshift_Count    0x0030   100   100   000    Old_age   Offline      -       0
184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       537
199 CRC_Error_Count         0x0030   100   100   000    Old_age   Offline      -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       5768
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       65535
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       5768
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -

Good for selling it. You could claim it is all fresh and new :)


# Samsung Pro 860

Note this SSD is almost new – smartctl 6.6 2016-05-31 does not know about 
one attribute. I am not sure why the command is so old in Debian Sid:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       50
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       26
177 Wear_Leveling_Count     0x0013   100   100   000    Pre-fail  Always       -       0

^^ new :)

179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   065   052   000    Old_age   Always       -       35
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       1
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1133999775

According to reference in internet, sectors are meant here, so:

1133999775 * 512 / 1024 / 1024 / 1024 ~= 541 GiB

% smartctl /dev/sda --all |grep "Sector Size"
Sector Size:      512 bytes logical/physical

> But if you have numbers for both, you know the write amplification of
> the drive (and its past workload).

Sure.

> If there is compression at work, you'd see the 2nd number being
> somewhat, or significantly lower -- and barely increase at all, if
> you write highly compressible data. This is not typically observed on
> modern SSDs, except maybe when writing zeroes. Writes to flash will
> be the same as writes from host, or most often somewhat higher, as
> the hardware can typically erase flash only in chunks of 2MB or so,
> hence there's quite a bit of under the hood reorganizing going on.
> Also as a result depending on workloads the "to flash" number can be
> much higher than "from host".

Okay, I get that, but it would be quite some effort to make reliable
measurements cause you´d need to write quite some amount of data
for the media wearout indicator to change. I do not intend to do that.

> Point is, even when the SATA link is maxed out in both cases, you can
> still check if there's compression at work via using those SMART
> attributes.

Sure. But with quite some effort. And with some aging of the SSDs involved.

I can imagine better uses of my  time :)

> > In any case: It was a experience report, no request for help, so I
> > don´t see why exact error messages are absolutely needed. If I had
> > a support inquiry that would be different, I agree.
> 
> Well, when reading such stories (involving software that I also use) I
> imagine what if I had been in that situation myself, what would I do,
> would I have anything else to try, do I know about any workaround for
> this. And without any technical details to go from, those are all
> questions left unanswered.

Sure, I get that.

My priority was to bring the machine back online. I managed to put the
screen log on a filesystem I destroyed afterwards and I managed to put it
there after the backup of that filesystem was complete… so c’est la vie the
log is gone. But even if I still had it, I probably would not have included
all error messages. But I would have been able to provide the those you
are interested in. Anyway, its gone and that is it.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-08-18 11:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-17  9:08 Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD Martin Steigerwald
2018-08-17 11:58 ` Austin S. Hemmelgarn
2018-08-17 12:28   ` Martin Steigerwald
2018-08-17 12:50     ` Roman Mamedov
2018-08-17 13:01       ` Austin S. Hemmelgarn
2018-08-17 21:16         ` Martin Steigerwald
2018-08-17 21:17       ` Martin Steigerwald
2018-08-18  7:12         ` Roman Mamedov
2018-08-18  8:47           ` Martin Steigerwald
2018-08-17 12:55     ` Austin S. Hemmelgarn
2018-08-17 21:26       ` Martin Steigerwald

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).