* Periodic fstrim job vs mounting with discard
@ 2016-10-20 22:32 Jared D. Cottrell
2016-10-21 1:48 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Jared D. Cottrell @ 2016-10-20 22:32 UTC (permalink / raw)
To: linux-xfs
We've been running our Ubuntu 14.04-based, SSD-backed databases with a
weekly fstrim cron job, but have been finding more and more clusters
that are locking all IO for a couple minutes as a result of the job.
In theory, mounting with discard could be appropriate for our use case
as file deletes are infrequent and handled in background threads.
However, we read some dire warnings about using discard on this list
(http://oss.sgi.com/archives/xfs/2014-08/msg00465.html) that make us
want to avoid it.
Is discard still to be avoided at all costs? Are the corruption and
bricking problems mentioned still something to be expected even with
the protection of Linux's built-in blacklist of broken SSD hardware?
We happen to be using Amazon's in-chassis SSDs. I'm sure they use
multiple vendors but I can't imagine they're taking short-cuts with
cheap hardware.
If discard is still strongly discouraged, perhaps we can approach the
problem from the other side: does the slow fstrim we're seeing sounds
like a known issue? After a bunch of testing and research, we've
determined the following:
Essentially, XFS looks to be iterating over every allocation group and
issuing TRIM s for all free extents every time this ioctl is called.
This, coupled with the facts that Linux’s interface to the TRIM
command is both synchronous and does not support a vectorized list of
ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112),
is leading to a large number of extraneous TRIM commands (each of
which have been observed to be slow, see:
http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to
the disk for ranges that both the filesystem and the disk know to be
free. In practice, we have seen IO disruptions of up to 2 minutes. I
realize that the duration of these disruptions may be controller
dependent. Unfortunately, when running on a platform like AWS, one
does not have the luxury of choosing specific hardware.
EXT4, on the other hand, tracks blocks that have been deleted since
the previous FITRIM ioctl and targets subsequent TRIM s to the
appropriate block ranges (see:
http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world
tests this significantly reduces the impact of fstrim to the point
that it is un-noticeable to the database / application.
For a bit more context, here's a write-up of the same issue we did for
the MongoDB community:
https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms
Regards,
Jared
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Periodic fstrim job vs mounting with discard
2016-10-20 22:32 Periodic fstrim job vs mounting with discard Jared D. Cottrell
@ 2016-10-21 1:48 ` Dave Chinner
2016-11-02 13:50 ` Jared D. Cottrell
0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2016-10-21 1:48 UTC (permalink / raw)
To: Jared D. Cottrell; +Cc: linux-xfs
On Thu, Oct 20, 2016 at 03:32:48PM -0700, Jared D. Cottrell wrote:
> We've been running our Ubuntu 14.04-based, SSD-backed databases with a
> weekly fstrim cron job, but have been finding more and more clusters
Command line for fstrim?
> that are locking all IO for a couple minutes as a result of the job.
> In theory, mounting with discard could be appropriate for our use case
> as file deletes are infrequent and handled in background threads.
> However, we read some dire warnings about using discard on this list
> (http://oss.sgi.com/archives/xfs/2014-08/msg00465.html) that make us
> want to avoid it.
discard is being improved - Christoph posted a patchset a few days
ago that solve many of the XFS specific issues. It also tries to
avoid the various deficiencies of underlying infrastructure as much
as possible.
> Is discard still to be avoided at all costs? Are the corruption and
> bricking problems mentioned still something to be expected even with
> the protection of Linux's built-in blacklist of broken SSD hardware?
> We happen to be using Amazon's in-chassis SSDs. I'm sure they use
> multiple vendors but I can't imagine they're taking short-cuts with
> cheap hardware.
Every so often we see a problem that manifests when discard is
enabled, and it goes away when it is turned off. Not just on XFS -
there's similar reports on the btrfs list. It's up to you to decide
whether you use it or not.
> If discard is still strongly discouraged, perhaps we can approach the
> problem from the other side: does the slow fstrim we're seeing sounds
> like a known issue? After a bunch of testing and research, we've
> determined the following:
>
> Essentially, XFS looks to be iterating over every allocation group and
> issuing TRIM s for all free extents every time this ioctl is called.
> This, coupled with the facts that Linux's interface to the TRIM
> command is both synchronous and does not support a vectorized list of
> ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112),
> is leading to a large number of extraneous TRIM commands (each of
> which have been observed to be slow, see:
> http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to
> the disk for ranges that both the filesystem and the disk know to be
> free. In practice, we have seen IO disruptions of up to 2 minutes. I
> realize that the duration of these disruptions may be controller
> dependent. Unfortunately, when running on a platform like AWS, one
> does not have the luxury of choosing specific hardware.
Many issues here, none of what have changed recently.
One of the common misconceptions about discard is that it will
improve performance. People are lead to think "empty drive" SSD
performance is what they should always get as that is what the
manufacturers quote, not the performance once the drive has been
completely written once. They are also lead to beleive that running
TRIM will restore their drive to "empty drive" performance. This is
not true - for most users, the "overwritten" performance is what
you'll get for the majority of the life of an active drive,
regardless of whether you use TRIM or not.
If you want an idea of how different misleading the performance
expectations manufacturers set for their SSDs, go have a look at the
SSD "performance consistency" tests that are run on all SSDs at
anandtech.com. e.g: Samsung's latest 960 Pro. Quoted at 360,000
random 4k write iops, it can actually only sustain 25,000 random 4k
write iops once the drive has been filled, which only takes a few
minutes to do:
http://www.anandtech.com/show/10754/samsung-960-pro-ssd-review/3
This matches what will happen in the few hours after a TRIM is run
on a SSD under constant write pressure where the filesystem used
space pattern at the time of the fstrim was significantly different
to the SSD's used space pattern. i.e. fstrim will free up used
space in the SSD which means performance will go up and be fast
(yay!), but as soon as the "known free" area is exhausted it will
fall into the steady state where the garbage collection algorithm
limits performance.
At this point, running fstrim again won't make any difference to
performance unless there are new areas of the block device address
space have been freed by the filesystem. This is because SSD's
record of "used space" still closely matches the filesystem's view
of free space. Hence fstrim will fail to free any significant amount
of space in the SSD it could use to improve performance, and so the
SSD remains in the slow "garbage collection mode" to sustain ongoing
writes.
IOWs, fstrim/discard will not restore any significant SSD
performance unless your application has a very dynamic filesystem
usage pattern (i.e. regularly fills and empties the filesystem).
That doesn't seem to be the situation your application is running in
("... our use case [...] file deletes are infrequent .. "), so
maybe you're best to just disable fstrim altogether?
Put simply: fstrim needs to be considered similarly to online
defragmentation - it can be actively harmful to production workloads
when it is used unnecessarily or inappropriately.
> EXT4, on the other hand, tracks blocks that have been deleted since
> the previous FITRIM ioctl
ext4 tracks /block groups/, not blocks. Freeing a single 4k block in
a 128MB block group will mark it for processing on the next fstrim
run. IOWs if you are freeing blocks all over your filesystem between
weekly fstrim runs, ext4 will behave pretty much identically to XFS.
> and targets subsequent TRIM s to the
> appropriate block ranges (see:
> http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world
> tests this significantly reduces the impact of fstrim to the point
> that it is un-noticeable to the database / application.
IMO that's a completely meaningless benchmark/comparison. To start
with nobody runs fstrim twice in a row on production systems, so
back-to-back behaviour is irrelevant to us. Also, every test is run
on different hardware so the results simply cannot be compared to
each other. Now if were run on the same hardware, with some kind of
significant workload in between runs it would be slightly more
meaningful.(*)
A lot of the "interwebs knowledge" around discard, fstrim, TRIM, SSD
performance, etc that you find with google is really just cargo-cult
stuff. What impact fstrim is going to have on your SSDs is largely
workload dependent, and the reality is that a large number of
workloads don't have the dynamic allocation behaviour that allows
regular usage of fstrim to provide a meaningful, measurable and
sustained performance improvement.
So, with all that in mind, the first thing you need to do is gather
measurements to determine if SSD performance is actually improved
after running a weekly fstrim. If there's no /significant/ change in
IO latency or throughput, then fstrim is not doing anything useful
for you and you can reduce the frequency at which you run it, only
run it in scheduled maintenance windows, or simply stop using it.
If there is a significant improvement in IO performance as a result
of running fstrim, then we need to work out why your application is
getting stuck during fstrim. sysrq-w output when fstrim is running
and the application is blocking will tell us where the blocking
issue lies (it may not be XFS!), and along with the various
information about your system here:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
we should be able to determine what is causing the blocking and
hence determine if it's fixable or not....
Cheers,
Dave.
(*) This test it'll probably still come out in favour of ext4
because of empty filesystem allocation patterns. i.e. ext4
allocation is all nice and compact until you dirty all the block
groups in the filesystem, then the allocation patterns become
scattered and non-deterministic. At that point, typical data
intensive workloads will always dirty a significant proportion of
the block groups in the filesystem, and fstrim behaviour becomes
much more like XFS. XFS's behaviour does not change with workloads
- it only changes as free space patterns changes. Hence it should be
roughly consistent and predictable behaviour for a given free
space pattern regardless of the workload or the age of the
filesystem.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Periodic fstrim job vs mounting with discard
2016-10-21 1:48 ` Dave Chinner
@ 2016-11-02 13:50 ` Jared D. Cottrell
0 siblings, 0 replies; 3+ messages in thread
From: Jared D. Cottrell @ 2016-11-02 13:50 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Oct 20, 2016 at 6:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 20, 2016 at 03:32:48PM -0700, Jared D. Cottrell wrote:
>> We've been running our Ubuntu 14.04-based, SSD-backed databases with a
>> weekly fstrim cron job, but have been finding more and more clusters
>
> Command line for fstrim?
fstrim-all
>> that are locking all IO for a couple minutes as a result of the job.
>> In theory, mounting with discard could be appropriate for our use case
>> as file deletes are infrequent and handled in background threads.
>> However, we read some dire warnings about using discard on this list
>> (http://oss.sgi.com/archives/xfs/2014-08/msg00465.html) that make us
>> want to avoid it.
>
> discard is being improved - Christoph posted a patchset a few days
> ago that solve many of the XFS specific issues. It also tries to
> avoid the various deficiencies of underlying infrastructure as much
> as possible.
>
>> Is discard still to be avoided at all costs? Are the corruption and
>> bricking problems mentioned still something to be expected even with
>> the protection of Linux's built-in blacklist of broken SSD hardware?
>> We happen to be using Amazon's in-chassis SSDs. I'm sure they use
>> multiple vendors but I can't imagine they're taking short-cuts with
>> cheap hardware.
>
> Every so often we see a problem that manifests when discard is
> enabled, and it goes away when it is turned off. Not just on XFS -
> there's similar reports on the btrfs list. It's up to you to decide
> whether you use it or not.
So if we want to be conservative we should stay away, then.
It sounds like the issues don't follow any kind of pattern to support
this, but is there any way to test our particular hardware to be
confident it won't be a problem?
There was mention of corruption issues in the original thread. We're
more worried about those than performance issues. Does that change the
answer to the above?
>> If discard is still strongly discouraged, perhaps we can approach the
>> problem from the other side: does the slow fstrim we're seeing sounds
>> like a known issue? After a bunch of testing and research, we've
>> determined the following:
>>
>> Essentially, XFS looks to be iterating over every allocation group and
>> issuing TRIM s for all free extents every time this ioctl is called.
>> This, coupled with the facts that Linux's interface to the TRIM
>> command is both synchronous and does not support a vectorized list of
>> ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112),
>> is leading to a large number of extraneous TRIM commands (each of
>> which have been observed to be slow, see:
>> http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to
>> the disk for ranges that both the filesystem and the disk know to be
>> free. In practice, we have seen IO disruptions of up to 2 minutes. I
>> realize that the duration of these disruptions may be controller
>> dependent. Unfortunately, when running on a platform like AWS, one
>> does not have the luxury of choosing specific hardware.
>
> Many issues here, none of what have changed recently.
>
> One of the common misconceptions about discard is that it will
> improve performance. People are lead to think "empty drive" SSD
> performance is what they should always get as that is what the
> manufacturers quote, not the performance once the drive has been
> completely written once. They are also lead to beleive that running
> TRIM will restore their drive to "empty drive" performance. This is
> not true - for most users, the "overwritten" performance is what
> you'll get for the majority of the life of an active drive,
> regardless of whether you use TRIM or not.
>
> If you want an idea of how different misleading the performance
> expectations manufacturers set for their SSDs, go have a look at the
> SSD "performance consistency" tests that are run on all SSDs at
> anandtech.com. e.g: Samsung's latest 960 Pro. Quoted at 360,000
> random 4k write iops, it can actually only sustain 25,000 random 4k
> write iops once the drive has been filled, which only takes a few
> minutes to do:
>
> http://www.anandtech.com/show/10754/samsung-960-pro-ssd-review/3
>
> This matches what will happen in the few hours after a TRIM is run
> on a SSD under constant write pressure where the filesystem used
> space pattern at the time of the fstrim was significantly different
> to the SSD's used space pattern. i.e. fstrim will free up used
> space in the SSD which means performance will go up and be fast
> (yay!), but as soon as the "known free" area is exhausted it will
> fall into the steady state where the garbage collection algorithm
> limits performance.
>
> At this point, running fstrim again won't make any difference to
> performance unless there are new areas of the block device address
> space have been freed by the filesystem. This is because SSD's
> record of "used space" still closely matches the filesystem's view
> of free space. Hence fstrim will fail to free any significant amount
> of space in the SSD it could use to improve performance, and so the
> SSD remains in the slow "garbage collection mode" to sustain ongoing
> writes.
>
> IOWs, fstrim/discard will not restore any significant SSD
> performance unless your application has a very dynamic filesystem
> usage pattern (i.e. regularly fills and empties the filesystem).
> That doesn't seem to be the situation your application is running in
> ("... our use case [...] file deletes are infrequent .. "), so
> maybe you're best to just disable fstrim altogether?
>
> Put simply: fstrim needs to be considered similarly to online
> defragmentation - it can be actively harmful to production workloads
> when it is used unnecessarily or inappropriately.
>
>> EXT4, on the other hand, tracks blocks that have been deleted since
>> the previous FITRIM ioctl
>
> ext4 tracks /block groups/, not blocks. Freeing a single 4k block in
> a 128MB block group will mark it for processing on the next fstrim
> run. IOWs if you are freeing blocks all over your filesystem between
> weekly fstrim runs, ext4 will behave pretty much identically to XFS.
>
>> and targets subsequent TRIM s to the
>> appropriate block ranges (see:
>> http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world
>> tests this significantly reduces the impact of fstrim to the point
>> that it is un-noticeable to the database / application.
>
> IMO that's a completely meaningless benchmark/comparison. To start
> with nobody runs fstrim twice in a row on production systems, so
> back-to-back behaviour is irrelevant to us. Also, every test is run
> on different hardware so the results simply cannot be compared to
> each other. Now if were run on the same hardware, with some kind of
> significant workload in between runs it would be slightly more
> meaningful.(*)
>
> A lot of the "interwebs knowledge" around discard, fstrim, TRIM, SSD
> performance, etc that you find with google is really just cargo-cult
> stuff. What impact fstrim is going to have on your SSDs is largely
> workload dependent, and the reality is that a large number of
> workloads don't have the dynamic allocation behaviour that allows
> regular usage of fstrim to provide a meaningful, measurable and
> sustained performance improvement.
>
> So, with all that in mind, the first thing you need to do is gather
> measurements to determine if SSD performance is actually improved
> after running a weekly fstrim. If there's no /significant/ change in
> IO latency or throughput, then fstrim is not doing anything useful
> for you and you can reduce the frequency at which you run it, only
> run it in scheduled maintenance windows, or simply stop using it.
>
> If there is a significant improvement in IO performance as a result
> of running fstrim, then we need to work out why your application is
> getting stuck during fstrim. sysrq-w output when fstrim is running
> and the application is blocking will tell us where the blocking
> issue lies (it may not be XFS!), and along with the various
> information about your system here:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> we should be able to determine what is causing the blocking and
> hence determine if it's fixable or not....
Good points, we'll add to our testing regimen.
One issue we have is that no matter how much testing we do, we don't
have just one workload, we have them all (well, all the workloads you
can expect to see when running a database). Our customers are free to
do whatever they want (within reason, of course) with their
deployments.
Ideally each customer would go through a testing phase where they
would determine whether and how often to run fstrim, but we'd like to
simplify things for them as much as possible.
Obviously, the simplest thing is not to have to go through the tuning
phase or consider in their operations. This is why discard is
theoretically attractive, but also running fstrim perhaps more
aggressively than needed to cover most cases.
But let's pretend we did disable automated fstrim jobs, didn't mount
with discard, and just provided a button for folks to click to run
fstrim on demand as needed. Are there any additional tools we can
expose to help customers figure out when to push the button? Perhaps
some telemetry we can present customers that might indicate when TRIM
debt is getting high (e.g "Having performance problems and showing
TRIM debt? Try fstrim.")? Maybe some of the stats here?
http://xfs.org/index.php/Runtime_Stats
> Cheers,
>
> Dave.
>
> (*) This test it'll probably still come out in favour of ext4
> because of empty filesystem allocation patterns. i.e. ext4
> allocation is all nice and compact until you dirty all the block
> groups in the filesystem, then the allocation patterns become
> scattered and non-deterministic. At that point, typical data
> intensive workloads will always dirty a significant proportion of
> the block groups in the filesystem, and fstrim behaviour becomes
> much more like XFS. XFS's behaviour does not change with workloads
> - it only changes as free space patterns changes. Hence it should be
> roughly consistent and predictable behaviour for a given free
> space pattern regardless of the workload or the age of the
> filesystem.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-11-02 13:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-20 22:32 Periodic fstrim job vs mounting with discard Jared D. Cottrell
2016-10-21 1:48 ` Dave Chinner
2016-11-02 13:50 ` Jared D. Cottrell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).