public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* TRIM details
@ 2011-01-07  3:22 Phil Karn
  2011-01-07  4:35 ` Martin K. Petersen
  2011-01-07  9:11 ` Matthias Schniedermeyer
  0 siblings, 2 replies; 9+ messages in thread
From: Phil Karn @ 2011-01-07  3:22 UTC (permalink / raw)
  To: xfs

Now that I've rebuilt my main Linux server, added a 120GB Intel SSD and
converted all the file systems to XFS, I've gotten interested in the
internals of both XFS and TRIM and how they work together (or will work
together).

I'd like to know exactly how the drives implement TRIM but I've only
found bits and pieces. Can anyone suggest a current and complete
reference for the complete SATA command set that includes all the TRIM
related stuff?

As I understand it, there's a SATA (and SCSI?) command that will
repeatedly write a fixed block of data to some number of consecutive
LBAs (WRITE SAME), and an "unmap" bit in the write command can be set to
indicate that instead of actually writing the blocks, they can be marked
for erasure and placed in the free pool.

Is this the only way it can be done? It occurs to me that while an
"unmap" bit should be quite fast, you don't absolutely *have* to have it.

Just have the drive interpret an ordinary write of all 0's to any LBA as
an implicit "unmap" indication for that LBA. As long as the drive
returns all 0's when an unmapped LBA is read (and I believe this is
already a requirement) then were an application to write a block of real
data that just happens to contain all 0's, it would still get back what
it wrote.

Then you could manually trim a drive with something like

dd if=/dev/zero of=foobar bs=1024k count=10240k
rm foobar

or if you're really adventurous and don't mind a little hiccup:

dd if=/dev/zero of=foobar bs=1024k count=20m
rm foobar

(i.e, let dd run the file system out of space before you delete the
temporary).

Then you wouldn't need a potentially dangerous program like wiper.sh
talking directly to the drive behind the file system's back. And while
wiper.sh only works with file systems whose structures it knows, this
approach would work with ANY file system.

This will all become moot when every SSD supports TRIM and every file
system uses it. But there are a lot of file systems out there, not all
of them support TRIM, and many may not for some time.

Somebody must have already thought of this, right?

Phil

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  3:22 TRIM details Phil Karn
@ 2011-01-07  4:35 ` Martin K. Petersen
  2011-01-07  9:11 ` Matthias Schniedermeyer
  1 sibling, 0 replies; 9+ messages in thread
From: Martin K. Petersen @ 2011-01-07  4:35 UTC (permalink / raw)
  To: karn; +Cc: xfs

>>>>> "Phil" == Phil Karn <karn@philkarn.net> writes:

Phil> I'd like to know exactly how the drives implement TRIM but I've
Phil> only found bits and pieces. Can anyone suggest a current and
Phil> complete reference for the complete SATA command set that includes
Phil> all the TRIM related stuff?

You kind-of have to be T13 member to get it. But try googling ATA
ACS-2...


Phil> As I understand it, there's a SATA (and SCSI?) command that will
Phil> repeatedly write a fixed block of data to some number of
Phil> consecutive LBAs (WRITE SAME), and an "unmap" bit in the write
Phil> command can be set to indicate that instead of actually writing
Phil> the blocks, they can be marked for erasure and placed in the free
Phil> pool.

There are several commands and variations...

For ATA there's the DSM TRIM command which allows you to indicate ranges
of blocks to discard. The ranges are stored in the data blocks and not
the command itself. A device can indicate how many blocks of payload it
supports. Many don't. Some of those that do blow up if you actually send
more than one block.

In SCSI there are three ways:

1. WRITE SAME with a zeroed payload
2. WRITE SAME with the UNMAP bit set
3. UNMAP command

UNMAP, like ATA DSM, takes a set of ranges in the data payload. Just to
make things more interesting they are not the same format and don't have
a 1:1 mapping with the ATA ranges.

There is no official support for (1) at the protocol level. You have to
know via means outside the standard whether the device supports logical
block provisioning with zero detection. There are a few storage arrays
out there that do.

Whether the device supports (2) or (3) is indicated in a set of VPD
pages that also indicate preferred granularity, alignment, etc. That
didn't use to be the case so for a while you just had to guess. We have
some heuristics in place that pick the right command depending on the
device.

Furthermore, in Linux, ATA sits underneath SCSI. So we translate WRITE
SAME(16) with the UNMAP bit set to DSM TRIM in our SCSI-ATA Translation
Layer.

Finally, there are a set of bits in both ATA and SCSI that indicate
whether read after a discard will return zeroes or garbage. Some devices
report that they return zeroes but don't in all cases.

The kernel goes through a lot of blah to make sure we're doing the right
thing. I really don't think that's a headache that's worth repeating.

Thankfully, at the top of the stack we have a generic block device ioctl
that hides all the complexity from the user. If you want to tinker
that's a much better place to start.

If you check the archives you'll also see that the filesystem-specific
FITRIM ioctl is being worked on. Plus some filesystems have the option
of doing discards in realtime.


Phil> Just have the drive interpret an ordinary write of all 0's to any
Phil> LBA as an implicit "unmap" indication for that LBA. As long as the
Phil> drive returns all 0's when an unmapped LBA is read (and I believe
Phil> this is already a requirement) then were an application to write a
Phil> block of real data that just happens to contain all 0's, it would
Phil> still get back what it wrote.

See above.


Phil> Then you could manually trim a drive with something like

Phil> dd if=/dev/zero of=foobar bs=1024k count=10240k rm foobar

But if the device does not detect zeroes then you'll end up:

 - transferring a bunch of useless data across the bus which will slow
   things to a grinding halt

and

 - if it's an SSD, wear out a lot of flash cells for no reason

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  3:22 TRIM details Phil Karn
  2011-01-07  4:35 ` Martin K. Petersen
@ 2011-01-07  9:11 ` Matthias Schniedermeyer
  2011-01-07  9:17   ` Matthias Schniedermeyer
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Matthias Schniedermeyer @ 2011-01-07  9:11 UTC (permalink / raw)
  To: karn; +Cc: xfs

On 06.01.2011 19:22, Phil Karn wrote:
> Now that I've rebuilt my main Linux server, added a 120GB Intel SSD and
> converted all the file systems to XFS, I've gotten interested in the
> internals of both XFS and TRIM and how they work together (or will work
> together).
> 
> I'd like to know exactly how the drives implement TRIM but I've only
> found bits and pieces. Can anyone suggest a current and complete
> reference for the complete SATA command set that includes all the TRIM
> related stuff?
> 
> As I understand it, there's a SATA (and SCSI?) command that will
> repeatedly write a fixed block of data to some number of consecutive
> LBAs (WRITE SAME), and an "unmap" bit in the write command can be set to
> indicate that instead of actually writing the blocks, they can be marked
> for erasure and placed in the free pool.

I roughly know what happens in the SATA version.
The SATA command takes an sector-offset and a sector-count (up to 64k)

Spec is linked on the Wikipedia-page:
http://en.wikipedia.org/wiki/TRIM

> Is this the only way it can be done? It occurs to me that while an
> "unmap" bit should be quite fast, you don't absolutely *have* to have it.

"Quite fast" is something of an understatement.
The Intel SSD can TRIM the whole drive in a matter of seconds. I have 
tested that with hdparm, when i wrote me a simple disc imaging 
perl-script.

> Just have the drive interpret an ordinary write of all 0's to any LBA as
> an implicit "unmap" indication for that LBA. As long as the drive

The drive would have to look into each written sector in the off chance 
that it might be 0, that's a lot of electrons you have to burn for not 
much gain. And that's ignoring the performance side, doing such a check 
on each incoming write would be expensive at best.







Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  9:11 ` Matthias Schniedermeyer
@ 2011-01-07  9:17   ` Matthias Schniedermeyer
  2011-01-07 14:15     ` Phil Karn
  2011-01-07 14:13   ` Phil Karn
  2011-01-07 14:21   ` Phil Karn
  2 siblings, 1 reply; 9+ messages in thread
From: Matthias Schniedermeyer @ 2011-01-07  9:17 UTC (permalink / raw)
  To: karn; +Cc: xfs

On 07.01.2011 10:11, Matthias Schniedermeyer wrote:
> On 06.01.2011 19:22, Phil Karn wrote:
> 
> > Just have the drive interpret an ordinary write of all 0's to any LBA as
> > an implicit "unmap" indication for that LBA. As long as the drive
> 
> The drive would have to look into each written sector in the off chance 
> that it might be 0, that's a lot of electrons you have to burn for not 
> much gain. And that's ignoring the performance side, doing such a check 
> on each incoming write would be expensive at best.

Altough, after thinking about it a little more. Doing a Population count 
in the controller while the data comes in over the wire can't be that 
expensive.





Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as 
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  9:11 ` Matthias Schniedermeyer
  2011-01-07  9:17   ` Matthias Schniedermeyer
@ 2011-01-07 14:13   ` Phil Karn
  2011-01-07 16:50     ` Martin K. Petersen
  2011-01-07 14:21   ` Phil Karn
  2 siblings, 1 reply; 9+ messages in thread
From: Phil Karn @ 2011-01-07 14:13 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: xfs

On 1/7/11 1:11 AM, Matthias Schniedermeyer wrote:

> The drive would have to look into each written sector in the off chance 
> that it might be 0, that's a lot of electrons you have to burn for not 
> much gain. And that's ignoring the performance side, doing such a check 
> on each incoming write would be expensive at best.

Oh, there's no question that an explicit TRIM command would be *far*
more efficient than an implicit TRIM that writes zeroes. If nothing
else, implicit TRIMming requires writing every single sector
individually, while the WRITE SAME command lets the host wipe up to
65,536 (I think) sectors with a single command.

But that's not my point. My point is that if the drive could recognize a
write of 0s to a sector as an implicit TRIM, then it would still be
possible to manually trim the drive without any support whatsoever from
the device driver or file system.

You could use a standard copy command, provided you have something like
/dev/zero, or you could write a simple application that wouldn't even
need root privileges (assuming it didn't need to get around any quotas
when creating the temporary file). And it would work for any file system
and any operating system while we're waiting for native TRIM support
(I'm still waiting for TRIM support for HFS+ in Mac OSX).

I don't think it would be that hard for the drive to recognize a write
of all zeroes. It already has to compute a set of Reed Solomon parity
symbols for every block written to the drive. That's quite a bit more
work than merely seeing if the block is all 0's.

You could even use the existing Reed-Solomon encoder to optimize the
process though I doubt it would really be necessary. The RS parities for
an all-0 data block are also all 0. If any of the parities are non zero,
then it can't be a block of 0's. If the parities are all zero, then
confirm in software that the data is all 0's; you'll have very few false
alarms.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  9:17   ` Matthias Schniedermeyer
@ 2011-01-07 14:15     ` Phil Karn
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Karn @ 2011-01-07 14:15 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: xfs

On 1/7/11 1:17 AM, Matthias Schniedermeyer wrote:

> Altough, after thinking about it a little more. Doing a Population count 
> in the controller while the data comes in over the wire can't be that 
> expensive.

Doesn't even have to be a popcount. Just OR every word into a register
as the data flies by and look to see if the result is 0.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07  9:11 ` Matthias Schniedermeyer
  2011-01-07  9:17   ` Matthias Schniedermeyer
  2011-01-07 14:13   ` Phil Karn
@ 2011-01-07 14:21   ` Phil Karn
  2 siblings, 0 replies; 9+ messages in thread
From: Phil Karn @ 2011-01-07 14:21 UTC (permalink / raw)
  To: Matthias Schniedermeyer; +Cc: karn, xfs

On 1/7/11 1:11 AM, Matthias Schniedermeyer wrote:

> "Quite fast" is something of an understatement.
> The Intel SSD can TRIM the whole drive in a matter of seconds. I have 
> tested that with hdparm, when i wrote me a simple disc imaging 
> perl-script.

Yes, I've been running wiper.sh on the XFS filesystems on my Intel 120
GB SSD and it's impressively fast. I've been careful to do backups
first, but I haven't had a problem yet.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07 14:13   ` Phil Karn
@ 2011-01-07 16:50     ` Martin K. Petersen
  2011-01-07 23:43       ` Phil Karn
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2011-01-07 16:50 UTC (permalink / raw)
  To: Phil Karn; +Cc: xfs

>>>>> "Phil" == Phil Karn <karn@ka9q.net> writes:

Phil> Oh, there's no question that an explicit TRIM command would be
Phil> *far* more efficient than an implicit TRIM that writes zeroes. If
Phil> nothing else, implicit TRIMming requires writing every single
Phil> sector individually, while the WRITE SAME command lets the host
Phil> wipe up to 65,536 (I think) sectors with a single command.

ATA does not have WRITE SAME. It's a SCSI command.

WRITE SAME(10) allows clearing 32MB per command on a device with
512-byte blocks. WRITE SAME(16) allows a bigger area but most drives
don't support it. Those that do often cap at 16-bits anyway (Note that
I'm talking about drives. Arrays are more flexible).

DSM TRIM allows you to clear 2GB per command with a 512-byte
payload. Several modern drives will let you clear 16GB with a 4KB
payload.


Phil> But that's not my point. My point is that if the drive could
Phil> recognize a write of 0s to a sector as an implicit TRIM, then it
Phil> would still be possible to manually trim the drive without any
Phil> support whatsoever from the device driver or file system.

But the fact remains that drives don't implement this. They do implement
DSM TRIM. Even if the drives did support zero detection we'd have no way
of getting the information to them short of sending a bazillion zeroes
down the pipe. And why would the drive vendors add support for a
crappier interface when DSM exists?

If you are set on using dd you could do zero detection in the kernel and
have the filesystem either send the data pages or issue discards for the
relevant regions if the device supports it. We pretty much have all the
infrastructure in place for that. But your time is better spent adding
FITRIM support to your filesystem of choice. XFS is done already,
Christoph posted the patches.


Phil> I don't think it would be that hard for the drive to recognize a
Phil> write of all zeroes. It already has to compute a set of Reed
Phil> Solomon parity symbols for every block written to the
Phil> drive.

That typically happens way later. There's usually a clear separation
between command processing and encoding. The zero detection needs to
happen early as it affects whether you need to mark the block in a
bitmap or allocate a real flash block.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: TRIM details
  2011-01-07 16:50     ` Martin K. Petersen
@ 2011-01-07 23:43       ` Phil Karn
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Karn @ 2011-01-07 23:43 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: xfs

On 1/7/11 8:50 AM, Martin K. Petersen wrote:

> ATA does not have WRITE SAME. It's a SCSI command.

Ah. I keep thinking that the ATA commands are the same as SCSI commands
because Linux does a pretty good job of making ATA drives look as though
they're SCSI, but they're not a 1-1 mapping.

> But the fact remains that drives don't implement this. They do implement
> DSM TRIM. Even if the drives did support zero detection we'd have no way
> of getting the information to them short of sending a bazillion zeroes
> down the pipe.

I know.

> And why would the drive vendors add support for a
> crappier interface when DSM exists?

The *only* reason I suggest this is to make it possible to manually TRIM
a drive when the file system and/or device driver don't yet support the
explicit device-level TRIM command.

RAID subsystems are another obstacle to TRIM, but I don't quite see the
point in using RAID with SSDs, or especially why so many people seem to
want to do RAID-0 with SSDs. SSDs already implement something much like
RAID-0 internally, i.e., interleaving for speed, which is why the bigger
SSDs are generally faster than the smaller ones until the interface
saturates. At that point you're better off abandoning SATA and attaching
the SSD subsystem directly to the processor over a PCI-e path, as in the
OCX Revo drives.

The implicit TRIM-with-zeroes feature I suggest wouldn't have to be in
every drive. You wouldn't have to use it even if it were there. And I
will certainly use your new XFS code as soon as it's stable enough to go
into the production kernel.

But the many concerned about the lack of TRIM support in their
proprietary, closed-source OS/FS of choice could select such a drive and
trim manually at the application layer until (or if) their vendor
finally gets around to supporting explicit device-level TRIM. OSX still
doesn't have it, which is surprising given how many SSDs Apple has sold
in MacBooks -- and how much they charge for them.

This is not something I'd expected to be of direct use by those who use
XFS. But you've obviously been thinking a lot about SSDs and TRIM in
general so I knew you'd have some useful comments on the idea. I really
ought to approach an SSD vendor with this idea, but I don't know anybody
who works for one. Thanks for your ideas.

Phil

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-01-07 23:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-07  3:22 TRIM details Phil Karn
2011-01-07  4:35 ` Martin K. Petersen
2011-01-07  9:11 ` Matthias Schniedermeyer
2011-01-07  9:17   ` Matthias Schniedermeyer
2011-01-07 14:15     ` Phil Karn
2011-01-07 14:13   ` Phil Karn
2011-01-07 16:50     ` Martin K. Petersen
2011-01-07 23:43       ` Phil Karn
2011-01-07 14:21   ` Phil Karn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox