* Some interesting input from a flash manufacturer @ 2012-03-02 21:00 Theodore Ts'o 2012-03-02 21:04 ` Eric Sandeen 2012-03-05 7:00 ` Lukas Czerner 0 siblings, 2 replies; 9+ messages in thread From: Theodore Ts'o @ 2012-03-02 21:00 UTC (permalink / raw) To: linux-ext4; +Cc: Lukas Czerner I spent an hour talking to architecture guy from a major flash manufacturer, who makes everything from SSD's to SD cards to eMMC devices, and he said a few things that were interesting. One is that he would actually be very happy if we send lots of extra trim commands; in particular, he would actually *like* us to send trims at unlink/commit time, *and* trims periodically via FITRIM. The reason for that is because that way, if the disk is busy, it would be OK if he dropped the TRIM on the floor, knowing that he would get another bite at the apple later on. But, if the disk has time to process the trim, he he would be able to use that information as quickly as possible. One of the other things we talked about was it would be really nice if we could send TRIM commands at journal checkpoint time, and perhaps send checkpoints more aggressively (although the requirement to send a SYNCHORNIZE CACHE command may make this be too expensive, unless we have ways of reliably knowing when the disk is idle, since unlike the enterprise server case, when ext4 is used in a mobile device, the fs accesses patterns tend to have more gaps where this sort of maintenance can take place). We also talked about ways that we might right some application notes so that handset OEM's understood how to use mke2fs parameters to optimize their file systems for different types of flash systems, and perhaps ways that the eMMC spec could be enhanced so that key parameters such as erase block size, flash page size, and translation table granularity could be passed back to the block layer, and made available to file system and mkfs. Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM requests may not make as much sense once we have devices with are SATA 3.1 complaint, when we will have a queuable TRIM command. Also, presumably SATA 3.1 compliance devices are less likely to have disastrous firmware bugs that make TRIM such a performance dog, and in fact they may be devices that would very much like as much TRIM information as we are willing to send to them. Regards, - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 21:00 Some interesting input from a flash manufacturer Theodore Ts'o @ 2012-03-02 21:04 ` Eric Sandeen 2012-03-02 23:11 ` Ted Ts'o 2012-03-06 18:42 ` Martin K. Petersen 2012-03-05 7:00 ` Lukas Czerner 1 sibling, 2 replies; 9+ messages in thread From: Eric Sandeen @ 2012-03-02 21:04 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, Lukas Czerner On 3/2/12 3:00 PM, Theodore Ts'o wrote: > I spent an hour talking to architecture guy from a major flash > manufacturer, who makes everything from SSD's to SD cards to eMMC > devices, and he said a few things that were interesting. > > One is that he would actually be very happy if we send lots of extra > trim commands; in particular, he would actually *like* us to send trims > at unlink/commit time, *and* trims periodically via FITRIM. The reason > for that is because that way, if the disk is busy, it would be OK if he > dropped the TRIM on the floor, knowing that he would get another bite at > the apple later on. But, if the disk has time to process the trim, he > he would be able to use that information as quickly as possible. Is that within spec? > One of the other things we talked about was it would be really nice if > we could send TRIM commands at journal checkpoint time, and perhaps send > checkpoints more aggressively (although the requirement to send a > SYNCHORNIZE CACHE command may make this be too expensive, unless we have > ways of reliably knowing when the disk is idle, since unlike the > enterprise server case, when ext4 is used in a mobile device, the fs > accesses patterns tend to have more gaps where this sort of maintenance > can take place). > > We also talked about ways that we might right some application notes so > that handset OEM's understood how to use mke2fs parameters to optimize > their file systems for different types of flash systems, and perhaps > ways that the eMMC spec could be enhanced so that key parameters such as > erase block size, flash page size, and translation table granularity > could be passed back to the block layer, and made available to file > system and mkfs. Now that would be nice. Could some of this just be piggybacked on the existing preferred_io_size-type geometry interfaces? -Eric > Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM > requests may not make as much sense once we have devices with are SATA > 3.1 complaint, when we will have a queuable TRIM command. Also, > presumably SATA 3.1 compliance devices are less likely to have > disastrous firmware bugs that make TRIM such a performance dog, and in > fact they may be devices that would very much like as much TRIM > information as we are willing to send to them. > > Regards, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 21:04 ` Eric Sandeen @ 2012-03-02 23:11 ` Ted Ts'o 2012-03-06 1:12 ` Greg Freemyer 2012-03-06 18:44 ` Martin K. Petersen 2012-03-06 18:42 ` Martin K. Petersen 1 sibling, 2 replies; 9+ messages in thread From: Ted Ts'o @ 2012-03-02 23:11 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4, Lukas Czerner On Fri, Mar 02, 2012 at 03:04:48PM -0600, Eric Sandeen wrote: > > One is that he would actually be very happy if we send lots of extra > > trim commands; in particular, he would actually *like* us to send trims > > at unlink/commit time, *and* trims periodically via FITRIM. The reason > > for that is because that way, if the disk is busy, it would be OK if he > > dropped the TRIM on the floor, knowing that he would get another bite at > > the apple later on. But, if the disk has time to process the trim, he > > he would be able to use that information as quickly as possible. > > Is that within spec? Yup; the drive manufacturer is free to do anything they want with the TRIM command; it's purely advisory. So dropping it on the floor if you're too busy because some other process is sending random 4k writes to you at a high rate, is something that's within spec. Or if the thin provisioning service is only tracking blocks with a granularity of 4megs, and it receives trim request for less than 4 megabytes, it again is perfectly free to drop the trim request on the floor. I'm even aware of one implementation which remembers the trim request while the system is powered on, but since it doesn't (necessarily) write the trim information to stable store, you could trim the block, read the block and get zeros, then take a power failure, and afterwards, read the block and get the previous contents. As far as I know, the Trim spec allows all of this. > > We also talked about ways that we might right some application notes so > > that handset OEM's understood how to use mke2fs parameters to optimize > > their file systems for different types of flash systems, and perhaps > > ways that the eMMC spec could be enhanced so that key parameters such as > > erase block size, flash page size, and translation table granularity > > could be passed back to the block layer, and made available to file > > system and mkfs. > > Now that would be nice. Could some of this just be piggybacked on the > existing preferred_io_size-type geometry interfaces? As far as the /sys/block/XXX/queue/* framework, certainly. It's not clear, however, whether or not we should use entirely new parameters, or try to reuse the existing parameters. For example, would it be better to use optimal_io_size for the flash page size, or the erase block size? - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 23:11 ` Ted Ts'o @ 2012-03-06 1:12 ` Greg Freemyer 2012-03-06 18:44 ` Martin K. Petersen 1 sibling, 0 replies; 9+ messages in thread From: Greg Freemyer @ 2012-03-06 1:12 UTC (permalink / raw) To: Ted Ts'o; +Cc: Eric Sandeen, linux-ext4, Lukas Czerner On Fri, Mar 2, 2012 at 6:11 PM, Ted Ts'o <tytso@mit.edu> wrote: > I'm even aware of one implementation which remembers the trim > request while the system is powered on, but since it doesn't > (necessarily) write the trim information to stable store, you could > trim the block, read the block and get zeros, then take a power > failure, and afterwards, read the block and get the previous contents. > > As far as I know, the Trim spec allows all of this. It's been a while since I read the spec, but the read operation above changes the rules I believe. That is if the SSD advertizes itself as having deterministic reads after a trim, that read should lock in the values, and a power cycle should not change that as I understood the spec. Otherwise what you describe would be a non-deterministic read. That is also allowed, but the drive would need to advertise itself as non-deterministic after trim. Greg ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 23:11 ` Ted Ts'o 2012-03-06 1:12 ` Greg Freemyer @ 2012-03-06 18:44 ` Martin K. Petersen 2012-03-07 0:52 ` Ted Ts'o 1 sibling, 1 reply; 9+ messages in thread From: Martin K. Petersen @ 2012-03-06 18:44 UTC (permalink / raw) To: Ted Ts'o; +Cc: Eric Sandeen, linux-ext4, Lukas Czerner >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: Ted> As far as the /sys/block/XXX/queue/* framework, certainly. It's Ted> not clear, however, whether or not we should use entirely new Ted> parameters, or try to reuse the existing parameters. For example, Ted> would it be better to use optimal_io_size for the flash page size, Ted> or the erase block size? If we were to use the existing fields we'd probably set min_io to the flash page size and optimal_io to the erase block size. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-06 18:44 ` Martin K. Petersen @ 2012-03-07 0:52 ` Ted Ts'o 2012-03-08 4:36 ` Martin K. Petersen 0 siblings, 1 reply; 9+ messages in thread From: Ted Ts'o @ 2012-03-07 0:52 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Eric Sandeen, linux-ext4, Lukas Czerner On Tue, Mar 06, 2012 at 01:44:28PM -0500, Martin K. Petersen wrote: > >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: > > Ted> As far as the /sys/block/XXX/queue/* framework, certainly. It's > Ted> not clear, however, whether or not we should use entirely new > Ted> parameters, or try to reuse the existing parameters. For example, > Ted> would it be better to use optimal_io_size for the flash page size, > Ted> or the erase block size? > > If we were to use the existing fields we'd probably set min_io to the > flash page size and optimal_io to the erase block size. But min_io currently means the smallest size that we're allowed to write, correct? And the flash page size could be 128k and 512 byte writes might be perfectly OK; it's just that writes are more optimal at 128k, and would be even more optimal at the erbase block size of 4 megs. That's why I'm not sure it makes sense to use the existing fields, since it will confuse file system utilities that are reading those fields. - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-07 0:52 ` Ted Ts'o @ 2012-03-08 4:36 ` Martin K. Petersen 0 siblings, 0 replies; 9+ messages in thread From: Martin K. Petersen @ 2012-03-08 4:36 UTC (permalink / raw) To: Ted Ts'o; +Cc: Martin K. Petersen, Eric Sandeen, linux-ext4, Lukas Czerner >>>>> "Ted" == Ted Ts'o <tytso@mit.edu> writes: Ted> But min_io currently means the smallest size that we're allowed to Ted> write, correct? Without incurring a penalty, yes. That was conceived in the standards with 4K sectors and RAID RMW in mind. But I think it would apply to SSDs as well. Depending on how mkfs.* interpret the field, obviously. Ted> And the flash page size could be 128k and 512 byte writes might be Ted> perfectly OK; it's just that writes are more optimal at 128k, and Ted> would be even more optimal at the erbase block size of 4 megs. Yep. Just like in the RAID case where the writing the full stripe chunk is better than just a logical block. And a full stripe is even better. Ted> That's why I'm not sure it makes sense to use the existing fields, Ted> since it will confuse file system utilities that are reading those Ted> fields. Happy to add new fields if it makes sense. But right now ATA ACS doesn't even have anything corresponding to the SCSI fields that populate min_io and opt_io. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 21:04 ` Eric Sandeen 2012-03-02 23:11 ` Ted Ts'o @ 2012-03-06 18:42 ` Martin K. Petersen 1 sibling, 0 replies; 9+ messages in thread From: Martin K. Petersen @ 2012-03-06 18:42 UTC (permalink / raw) To: Eric Sandeen; +Cc: Theodore Ts'o, linux-ext4, Lukas Czerner >>>>> "Eric" == Eric Sandeen <sandeen@redhat.com> writes: >> We also talked about ways that we might right some application notes >> so that handset OEM's understood how to use mke2fs parameters to >> optimize their file systems for different types of flash systems, and >> perhaps ways that the eMMC spec could be enhanced so that key >> parameters such as erase block size, flash page size, and translation >> table granularity could be passed back to the block layer, and made >> available to file system and mkfs. Eric> Now that would be nice. Could some of this just be piggybacked on Eric> the existing preferred_io_size-type geometry interfaces? So far the barrier has been that the flash manufacturers did not want to disclose the erase block size, etc. That's why the original standardization efforts in that department were shelved. If the devices actually start exporting this information I'll be happy to put it in the topology. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Some interesting input from a flash manufacturer 2012-03-02 21:00 Some interesting input from a flash manufacturer Theodore Ts'o 2012-03-02 21:04 ` Eric Sandeen @ 2012-03-05 7:00 ` Lukas Czerner 1 sibling, 0 replies; 9+ messages in thread From: Lukas Czerner @ 2012-03-05 7:00 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, Lukas Czerner On Fri, 2 Mar 2012, Theodore Ts'o wrote: > > I spent an hour talking to architecture guy from a major flash > manufacturer, who makes everything from SSD's to SD cards to eMMC > devices, and he said a few things that were interesting. > > One is that he would actually be very happy if we send lots of extra > trim commands; in particular, he would actually *like* us to send trims > at unlink/commit time, *and* trims periodically via FITRIM. The reason > for that is because that way, if the disk is busy, it would be OK if he > dropped the TRIM on the floor, knowing that he would get another bite at > the apple later on. But, if the disk has time to process the trim, he > he would be able to use that information as quickly as possible. Hi Ted, yes, they can do a lot of things behind the curtain, and dropping the TRIM on the floor is clearly on of it. We do not actually care all that much, but they should export proper flags accordingly. So if the TRIMs can be droppend on the floor, of the unmapped regions can be read again after a power cycle they should not export the "discard zeroes data" thing. Of course we do not want them to drop every TRIM command as well :). I think that we would very much like to enable '-o discard' however it is still very slow due to the fact that it is nonqueable command and it take a while to process the command as well. Moreover I have noticed that some device become 'busy' after they get the TRIM command, hence the performance is lower for a short period of time after the TRIM. > > One of the other things we talked about was it would be really nice if > we could send TRIM commands at journal checkpoint time, and perhaps send > checkpoints more aggressively (although the requirement to send a > SYNCHORNIZE CACHE command may make this be too expensive, unless we have > ways of reliably knowing when the disk is idle, since unlike the > enterprise server case, when ext4 is used in a mobile device, the fs > accesses patterns tend to have more gaps where this sort of maintenance > can take place). > > We also talked about ways that we might right some application notes so > that handset OEM's understood how to use mke2fs parameters to optimize > their file systems for different types of flash systems, and perhaps > ways that the eMMC spec could be enhanced so that key parameters such as > erase block size, flash page size, and translation table granularity > could be passed back to the block layer, and made available to file > system and mkfs. Regarding the eMMC it would also be very nice from them if they stopped optimize their flashes for FAT, but rather take a more general approach and advertise which parts of the flash are faster than other :). Also from what I know, doing frequent discard on those flashes might make them wear off much faster, because the wear leveling involves copying data around the flash so they can free the whole erase blocks. > > Anyway, going back to TRIM, I suspect that efforts to optimize out TRIM > requests may not make as much sense once we have devices with are SATA > 3.1 complaint, when we will have a queuable TRIM command. Also, > presumably SATA 3.1 compliance devices are less likely to have > disastrous firmware bugs that make TRIM such a performance dog, and in > fact they may be devices that would very much like as much TRIM > information as we are willing to send to them. That is definitely very good news, however those optimization still makes sense. SSD's are not the only discard capable devices out there, nor will be the 3.1 compliant SSD's. So we still need some kind of optimization so that it does not hurt the performance on thin-provisioned storage, or today's SSD's, right ? But I definitely agree that we should start looking into enabling the new SSD's to be more effective and if the frequent discard can help then, then we could start to look how to enable -o discard for such device by default. Maybe /sys/block/sda/queue/discard_queuable or something. Thanks! -Lukas > > Regards, > > - Ted > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-03-08 4:36 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-02 21:00 Some interesting input from a flash manufacturer Theodore Ts'o 2012-03-02 21:04 ` Eric Sandeen 2012-03-02 23:11 ` Ted Ts'o 2012-03-06 1:12 ` Greg Freemyer 2012-03-06 18:44 ` Martin K. Petersen 2012-03-07 0:52 ` Ted Ts'o 2012-03-08 4:36 ` Martin K. Petersen 2012-03-06 18:42 ` Martin K. Petersen 2012-03-05 7:00 ` Lukas Czerner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).