Re: Software RAID and TRIM

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: Software RAID and TRIM
Date: Mon, 18 Jul 2011 12:35:42 +0200	[thread overview]
Message-ID: <j012fa$6k6$1@dough.gmane.org> (raw)
In-Reply-To: <4E235984.2070704@5t9.de>

On 17/07/2011 23:52, Lutz Vieweg wrote:
> David Brown wrote:
>> However, AFAIUI, you are wrong about TRIM being essential for the
>> continued high performance of SSDs. As long as your SSDs have some
>> over-provisioning (or you only partition something like 90% of the
>> drive), and it's got good garbage collection, then TRIM will have
>> minimal effect.
>
> I beg to differ.
>

Well, I don't have your experience here (I have a couple of 60G SSD's in 
RAID0, without TRIM, but that's hardly in the same class).  So I don't 
expect you to put much weight on my opinions.  But maybe it will give 
you reason for more testing.

> We are using SSDs in very much the way that Tom de Mulder intends,
> and from our extensive performance measurements over many months
> now I can say that (at least if you do have significant amounts
> of write operations) it _does_ make a lot of difference whether you
> periodically discard the unused sectors or not.
> (For us, the write performance measured to be about half as good
> when there are no free erase blocks available anymore.)
>

If there are no free erase blocks, then your SSD's don't have enough 
over-provisioning.  This is, after all, the whole point of having more 
physical flash than the logical disk size would suggest.  Depending on 
the quality of the SSD (more expensive ones have more 
over-provisioning), and the usage patterns (if you have lots of small 
random writes, you'll need more extra space), then you might have to 
"manually" over-provision the disk by only partitioning about 90% of the 
disk.  Of course, you must make sure that the remaining 10% is 
"discarded", or left untouched from new, and that you use the partition 
for your RAID and not the whole disk.

So now you have plenty of erase blocks at any time, and your write 
performance will be good.

TRIM, on the other hand, does not give you any extra free erase blocks. 
  If you think it does, you've misunderstood it.

TRIM exists to make garbage collection a little more efficient - when 
garbage collecting an erase block that contains TRIM'ed blocks, the 
TRIM'ed blocks don't need to be copied.  This saves a small amount of 
time in the copying, and allows slightly denser packing.  It may 
sometimes lead to saving whole erase blocks, but that's seldom the case 
in practice except when erasing large files.

If your disks are reasonably full, then TRIM will not help much because 
the garbage collection will be desperately trying to piece together 
small bits into complete erase blocks, and your performance will drop 
through the floor.  If you have plenty of overprovisioning, then the SSD 
still has lots of completely free erase blocks whenever it needs them.

If your filesystem re-uses (logical) blocks, then TRIM will not help. 
It is /always/ more efficient for the FS to simply write new data to the 
same block, rather than TRIM'ing it first.

TRIM is a very expensive command - it acts a bit like a write, but it is 
not a queued command.  Thus the block layer must wait for /all/ IO 
commands to have completed, then issue the TRIM, then wait for it to 
complete, and then carry on with new commands.  On some SSD's, it will 
(according to something I read) trigger garbage collection, which may 
slow down the SSD.  Even without that, the performance of most meta-data 
operations (such as delete) will drop considerably when they also need 
to do TRIM.

<http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_discard.html>

<http://lwn.net/Articles/347511/>

<http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=116034&threadid=115697&roomid=2>

On the other hand, your off-line batch TRIM during low use periods could 
well be a win.  The cost of these discards is not going to be an issue, 
and large batched discards are going to be far more useful to the SSD 
than small scattered ones.  I believe that there has been work on a 
similar system in XFS - I don't know what happened to that, or if there 
is any way to make it work in concert with md raid.

What will make a big difference to using SSD's in md raid is the 
sync/no-sync tracking.  This will avoid a lot of unnecessary writes, 
especially with a new array, and leave the SSD with more free blocks (at 
least until the disk is getting full of data).  It is also much higher 
up the things-to-do list, because it will be useful for all uses of md 
raid, and is a perquisite to general discard support.  (Strictly 
speaking it is not needed for SSD's that guarantee a zero return on 
TRIM'ed blocks - but only some SSD's give that guarantee.)

> Of course, you can only benefit from discards if your filesystem
> is not full (because then there is nothing to discard). But any
> kind of "garbage collection" by the SSD itself will not have the
> same effect, since it cannot know which blocks are in use by the
> filesystem.
>

Garbage collection will recycle blocks that have been overwritten.  The 
filesystem knows which logical blocks are in use, and which are free. 
Filesystems already heavily re-use blocks, in the aim of preferring 
faster outer tracks on HD's, and minimizing head movement.  So when a 
file is erased, there's a good chance that those same logical blocks 
will be re-used soon - TRIM is of no benefit in that case.

>> I think other SSD-optimisations, such as those in BTRFS, are much more
>> important.
>
> Actually, (apart from btrfs still being in development, not really
> ready for production use, yet), XFS (-o delaylog,barrier) performs
> better on our SSDs than btrfs - without any SSD-specific options.
>

btrfs is ready for some uses, but is not mature and real-world tested 
enough for serious systems (and its tools are still lacking somewhat). 
But more generally, different filesystems are faster and slower for 
different usage patterns.

One SSD optimisation that many filesystems could implement is to be less 
concerned about fragmentation.  Most modern filesystems go out of their 
way to try to reduce fragmentation, which is great for HD use.  But on 
SSD's, you should be happy to fragment files if it promotes re-use of 
erased blocks, as long as fragments aim to fill complete erase blocks 
(in size and alignment).

> What is really an important factor for SSD performance: The controller.
> The same SSDs perform with significantly lower latency for us when
> connected to SATA controller channels than when connected to SAS
> controllers (and they perform abysmal when used as hardware-RAID
> constituents, in comparison).

That is /very/ interesting to know, and is a data point I haven't read 
elsewhere (though I knew about poor performance of hardware RAID with 
SSD).  Thanks for sharing that.

>
> Regards,
>
> Lutz Vieweg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2011-07-18 10:35 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-28 15:31 Software RAID and TRIM Tom De Mulder
2011-06-28 16:11 ` Mathias Burén
2011-06-29 10:32   ` Tom De Mulder
2011-06-29 10:45     ` NeilBrown
2011-06-29 11:10       ` Tom De Mulder
2011-06-29 11:48         ` Scott E. Armitage
2011-06-29 12:46           ` Roberto Spadim
2011-06-29 12:46       ` David Brown
2011-06-30  0:28         ` NeilBrown
2011-06-30  7:50           ` David Brown
2011-06-29 13:39       ` Namhyung Kim
2011-06-30  0:27         ` NeilBrown
2011-07-17 22:11       ` Lutz Vieweg
2011-07-17 21:57     ` Lutz Vieweg
2011-06-29 10:33   ` Tom De Mulder
2011-06-29 12:42     ` David Brown
2011-06-29 12:55       ` Tom De Mulder
2011-06-29 13:02         ` Roberto Spadim
2011-06-29 13:10         ` David Brown
2011-06-30  5:51         ` Mikael Abrahamsson
2011-07-04  9:13           ` Tom De Mulder
2011-07-04 16:26             ` Werner Fischer
2011-07-17 22:31               ` Lutz Vieweg
2011-07-17 22:16         ` Lutz Vieweg
2011-07-17 22:00     ` Lutz Vieweg
2011-06-28 16:17 ` Johannes Truschnigg
2011-06-28 16:40 ` David Brown
2011-07-17 21:52   ` Lutz Vieweg
2011-07-18  5:14     ` Mikael Abrahamsson
2011-07-18 10:35     ` David Brown [this message]
2011-07-18 10:48       ` Tom De Mulder
2011-07-18 18:09       ` Lutz Vieweg
2011-07-18 20:18         ` David Brown
2011-07-19  9:29           ` Lutz Vieweg
2011-07-19 10:22             ` David Brown
2011-07-19 13:41               ` Lutz Vieweg
2011-07-19 15:06                 ` David Brown
2011-07-20 10:39                   ` Lutz Vieweg
2011-07-19 14:19               ` Tom De Mulder
2011-07-20  7:42                 ` David Brown
2011-07-20 12:20                   ` Lutz Vieweg
2011-07-20 12:13                 ` Werner Fischer
2011-07-20 12:25                   ` Lutz Vieweg
2011-07-18 10:53     ` Tom De Mulder
2011-07-18 12:13       ` Werner Fischer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='j012fa$6k6$1@dough.gmane.org' \
    --to=david@westcontrol.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).