Re: [RFC 4/5] MMC: Adjust unaligned write accesses.

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
       [not found]   ` <AANLkTinEQEwa2SqEwTnbe3kcYuDoM-ZbzWE7X+V3B+zV@mail.gmail.com>
@ 2011-03-21 14:21     ` Arnd Bergmann
  2011-03-21 14:41       ` Andrei Warkentin
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-03-21 14:21 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: linux-mmc, linux-ext4

On Saturday 19 March 2011, Andrei Warkentin wrote:
> On Mon, Mar 14, 2011 at 2:40 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> 
> >>>
> >>> Revalidating the data now, along with some more tests, to get a better
> >>> picture. It seems the more data I get, the less it makes sense :(.
> >>
> >> I was already fearing that the change would only benefit low-level
> >> benchmarks. It certainly helps writing small chunks to the buffer
> >> that is meant for FAT32 directories, but at some point, the card
> >> will have to write back the entire logical erase block, so you
> >> might not be able to gain much in real-world workloads.
> >>
> >
> 
> Attaching is some data I have collected  on the MMC32G part. I tried
> to make the collection process as controlled as possible, as well as
> use more-or-less a "real life" usage case that involves running a user
> application, so it's not just a purely synthetic test at block level.
> 
> Attached file (I hope you don't mind PDFs) contains data collected for
> two possible optimizations. The second page of the document tests the
> vendor suggested optimization that is basically -
> if (request_blocks < 24) {
>      /* given request offset, calculate sectors remaining on 8K page
> containing offset */
>      sectors = 16 - (request_offset % 16);
>      if (request_blocks > sectors) {
>         request_blocks = sectors;
>      }
> }
> ...I'll call this optimization A.
> 
> ...the first page of the document tests the optimization that floated
> up on the list when I first sent a patch with the vendor suggestions.
> That optimization being - align all unaligned accesses (either all
> completely, or under a certain size threshold) on flash page size.
> I'll call this optimization B.

I'm not sure if I really understand the difference between the two.
Do you mean optimization A makes sure that you don't have partial
pages at the start of a request, while optimization B also splits
small requests on page boundary if the first page in it is aligned?

> To test, a collect time info for 2000 small inserts into a table with
> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
> test. The test is executed for ext2, ext3 and ext4 with a 4k block
> size. Every test begins with a flash discard and format operation on
> the partition where the tables are created and accessed, to ensure
> similar acceses to flash on every test. All other partitions are RO,
> and no processes other than those needed by the tests run. All power
> management is disabled. The results are thus repeatable, consistent
> and stable across reboots and power-on time...
> 
> Each test consists of:
> 1) Unmount partition
> 2) Flash erase
> 3) Format with fs
> 4) Mount
> 5) Sync
> 6) echo 3 > /proc/sys/vm/drop_caches
> 7) run 20 x 2000 inserts as described above
> 8) unmount

Just to make sure: Did you properly align the partition start on an
erase block boundary of 4MB?

I would have loved to see results with nilfs2 and btrfs as well, but
I can understand that these were less relevant to you, especially
since you don't really want to compare the file systems as much as
your own changes.

One very surprising result to me is how much worse the ext4 numbers
are compared to ext2/ext3. I would have guessed that they should
be much better, given that the ext4 developers are specifically
trying to optimize for this case. I've taken the ext4 mailing
list on Cc here and will forward your test results there as
well.

> For optimization B testing, the alignment size and alignment access
> size threshold (same parameters as in my RFC patch) are exposed
> through debugfs. To get B test data, the flow was
> 
> 1) Set alignment to none (no optimization)
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
> 
> 6) Set alignment to 8k, no threshold
> 7) Sql test on ext2
> 8) Sql test on ext3
> 9) Sql test on ext4
> 
> 10) Set alignment to 8k, < 8k only
> 11) Sql test on ext2
> 12) Sql test on ext3
> 13) Sql test on ext4
> 
> ...all the way up to 32K threshold.
> 
> For optimization A testing, the optimization was turned off/on with a
> debugfs attribute, and the data collected with this flow:
> 
> 1) Turn off optimization
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
> 5) Turn on optimization
> 6) Sql test on ext2
> 7) Sql test on ext3
> 8) Sql test on ext4
> 
> My interpretation of the results: Any kind of alignment-on-flash page
> optimization produced data that in all cases was either
> indistinguishable from control, or was worse. Do you agree with my
> interpretation?

I suppse when the result is total runtime in seconds, that larger numbers
are always worse, so I agree.

One potential flaw in the measurement might be that running the test
a second time means that the card is already in a state that requires
garbage collection and therefore slower. Running the test in the opposite
order (optimized first, then unoptimized) might theoretically lead
to other results. It's not clear from your description whether your
test method has taken this into account (I would assume yes).

> So I guess that hexes the align optimization, at least until I can get
> data for MMC16G with the same controlled setup. Sorry about that. I'll
> work on the "reliability optimization" now, which I guess are pretty
> generic for cards with similar buffer schemes. It relies on reliable
> writes, so exposing that will be first for review here...
> 
> Even though I'm rescinding the adjust/align patch, is there any chance
> for pulling in my quirks changes?

The quirks patch still looks fine to me, I'd just recommend that we
don't apply it before we have a need for it, i.e. at least a single
card specific quirk.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 14:21     ` [RFC 4/5] MMC: Adjust unaligned write accesses Arnd Bergmann
@ 2011-03-21 14:41       ` Andrei Warkentin
  2011-03-21 18:03         ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Andrei Warkentin @ 2011-03-21 14:41 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-mmc, linux-ext4

On Mon, Mar 21, 2011 at 9:21 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> Attached file (I hope you don't mind PDFs) contains data collected for
>> two possible optimizations. The second page of the document tests the
>> vendor suggested optimization that is basically -
>> if (request_blocks < 24) {
>>      /* given request offset, calculate sectors remaining on 8K page
>> containing offset */
>>      sectors = 16 - (request_offset % 16);
>>      if (request_blocks > sectors) {
>>         request_blocks = sectors;
>>      }
>> }
>> ...I'll call this optimization A.
>>
>> ...the first page of the document tests the optimization that floated
>> up on the list when I first sent a patch with the vendor suggestions.
>> That optimization being - align all unaligned accesses (either all
>> completely, or under a certain size threshold) on flash page size.
>> I'll call this optimization B.
>
> I'm not sure if I really understand the difference between the two.
> Do you mean optimization A makes sure that you don't have partial
> pages at the start of a request, while optimization B also splits
> small requests on page boundary if the first page in it is aligned?

The vendor optimization always splits accesses under 12k, even if they
are aligned. There are (still) some outstanding questions
on how that's supposed to work (to improve anything), but that's the algorithm.

"our" optimization, suggested on this list, was to align accesses onto
flash page size, thus splitting each request into (small) unaligned
and aligned portions.

>
>> To test, a collect time info for 2000 small inserts into a table with
>> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
>> test. The test is executed for ext2, ext3 and ext4 with a 4k block
>> size. Every test begins with a flash discard and format operation on
>> the partition where the tables are created and accessed, to ensure
>> similar acceses to flash on every test. All other partitions are RO,
>> and no processes other than those needed by the tests run. All power
>> management is disabled. The results are thus repeatable, consistent
>> and stable across reboots and power-on time...
>>
>> Each test consists of:
>> 1) Unmount partition
>> 2) Flash erase
>> 3) Format with fs
>> 4) Mount
>> 5) Sync
>> 6) echo 3 > /proc/sys/vm/drop_caches
>> 7) run 20 x 2000 inserts as described above
>> 8) unmount
>
> Just to make sure: Did you properly align the partition start on an
> erase block boundary of 4MB?
>

Yes, absolutely.

> I would have loved to see results with nilfs2 and btrfs as well, but
> I can understand that these were less relevant to you, especially
> since you don't really want to compare the file systems as much as
> your own changes.
>

In the context of looking at this anyway, I will try and get
comparison data for sqlite on different fs (and different fs tunables)
on flash.

> One very surprising result to me is how much worse the ext4 numbers
> are compared to ext2/ext3. I would have guessed that they should
> be much better, given that the ext4 developers are specifically
> trying to optimize for this case. I've taken the ext4 mailing
> list on Cc here and will forward your test results there as
> well.

I was surprised too.

> One potential flaw in the measurement might be that running the test
> a second time means that the card is already in a state that requires
> garbage collection and therefore slower. Running the test in the opposite
> order (optimized first, then unoptimized) might theoretically lead
> to other results. It's not clear from your description whether your
> test method has taken this into account (I would assume yes).
>

I've done tests across reboots that showed consistent results.
Additionally, repeating a test after another showed same results. At
least on this flash medium, block erase (used erase utility from
flashbench, modified to erase everything if no argument provided)
prior to formatting with fs prior to every test seemed to make results
consistent.

>> So I guess that hexes the align optimization, at least until I can get
>> data for MMC16G with the same controlled setup. Sorry about that. I'll
>> work on the "reliability optimization" now, which I guess are pretty
>> generic for cards with similar buffer schemes. It relies on reliable
>> writes, so exposing that will be first for review here...
>>
>> Even though I'm rescinding the adjust/align patch, is there any chance
>> for pulling in my quirks changes?
>
> The quirks patch still looks fine to me, I'd just recommend that we
> don't apply it before we have a need for it, i.e. at least a single
> card specific quirk.
>

Ok. Sounds good. Back to reliable writes it is, so I can roll up the
second quirk...

A
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 14:41       ` Andrei Warkentin
@ 2011-03-21 18:03         ` Andreas Dilger
  2011-03-21 19:05           ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2011-03-21 18:03 UTC (permalink / raw)
  To: Andrei Warkentin; +Cc: Arnd Bergmann, linux-mmc, linux-ext4

On 2011-03-21, at 3:41 PM, Andrei Warkentin wrote:
> On Mon, Mar 21, 2011 at 9:21 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>> Attached file (I hope you don't mind PDFs) contains data collected for
>>> two possible optimizations. The second page of the document tests the
>>> vendor suggested optimization that is basically -
>>> if (request_blocks < 24) {
>>>    /* given request offset, calculate sectors remaining on 8K page
>>> containing offset */
>>>    sectors = 16 - (request_offset % 16);
>>>    if (request_blocks > sectors) {
>>>       request_blocks = sectors;
>>>    }
>>> }
>>> ...I'll call this optimization A.
>>> 
>>> ...the first page of the document tests the optimization that floated
>>> up on the list when I first sent a patch with the vendor suggestions.
>>> That optimization being - align all unaligned accesses (either all
>>> completely, or under a certain size threshold) on flash page size.
>>> I'll call this optimization B.
>> 
>> I'm not sure if I really understand the difference between the two.
>> Do you mean optimization A makes sure that you don't have partial
>> pages at the start of a request, while optimization B also splits
>> small requests on page boundary if the first page in it is aligned?
> 
> The vendor optimization always splits accesses under 12k, even if they
> are aligned. There are (still) some outstanding questions
> on how that's supposed to work (to improve anything), but that's the algorithm.
> 
> "our" optimization, suggested on this list, was to align accesses onto
> flash page size, thus splitting each request into (small) unaligned
> and aligned portions.

Note that mballoc was specifically designed to handle allocation requests that are aligned on RAID stripe boundaries, so it should be able to handle this for MMC as well.  What is needed is to tell the filesystem what the underlying alignment is.  That can be done at format time with mke2fs or afterward with tune2fs by using the "-E stripe_width" option.

>>> To test, a collect time info for 2000 small inserts into a table with
>>> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
>>> test. The test is executed for ext2, ext3 and ext4 with a 4k block
>>> size. Every test begins with a flash discard and format operation on
>>> the partition where the tables are created and accessed, to ensure
>>> similar acceses to flash on every test. All other partitions are RO,
>>> and no processes other than those needed by the tests run. All power
>>> management is disabled. The results are thus repeatable, consistent
>>> and stable across reboots and power-on time...
>>> 
>>> Each test consists of:
>>> 1) Unmount partition
>>> 2) Flash erase
>>> 3) Format with fs
>>> 4) Mount
>>> 5) Sync
>>> 6) echo 3 > /proc/sys/vm/drop_caches
>>> 7) run 20 x 2000 inserts as described above
>>> 8) unmount
>> 
>> Just to make sure: Did you properly align the partition start on an
>> erase block boundary of 4MB?
>> 
> 
> Yes, absolutely.
> 
>> I would have loved to see results with nilfs2 and btrfs as well, but
>> I can understand that these were less relevant to you, especially
>> since you don't really want to compare the file systems as much as
>> your own changes.
>> 
> 
> In the context of looking at this anyway, I will try and get
> comparison data for sqlite on different fs (and different fs tunables)
> on flash.
> 
>> One very surprising result to me is how much worse the ext4 numbers
>> are compared to ext2/ext3. I would have guessed that they should
>> be much better, given that the ext4 developers are specifically
>> trying to optimize for this case. I've taken the ext4 mailing
>> list on Cc here and will forward your test results there as
>> well.
> 
> I was surprised too.
> 
>> One potential flaw in the measurement might be that running the test
>> a second time means that the card is already in a state that requires
>> garbage collection and therefore slower. Running the test in the opposite
>> order (optimized first, then unoptimized) might theoretically lead
>> to other results. It's not clear from your description whether your
>> test method has taken this into account (I would assume yes).
>> 
> 
> I've done tests across reboots that showed consistent results.
> Additionally, repeating a test after another showed same results. At
> least on this flash medium, block erase (used erase utility from
> flashbench, modified to erase everything if no argument provided)
> prior to formatting with fs prior to every test seemed to make results
> consistent.
> 
>>> So I guess that hexes the align optimization, at least until I can get
>>> data for MMC16G with the same controlled setup. Sorry about that. I'll
>>> work on the "reliability optimization" now, which I guess are pretty
>>> generic for cards with similar buffer schemes. It relies on reliable
>>> writes, so exposing that will be first for review here...
>>> 
>>> Even though I'm rescinding the adjust/align patch, is there any chance
>>> for pulling in my quirks changes?
>> 
>> The quirks patch still looks fine to me, I'd just recommend that we
>> don't apply it before we have a need for it, i.e. at least a single
>> card specific quirk.
>> 
> 
> Ok. Sounds good. Back to reliable writes it is, so I can roll up the
> second quirk...
> 
> A
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 18:03         ` Andreas Dilger
@ 2011-03-21 19:05           ` Arnd Bergmann
  2011-03-21 23:58             ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-03-21 19:05 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andrei Warkentin, linux-mmc, linux-ext4

On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
> Note that mballoc was specifically designed to handle allocation
> requests that are aligned on RAID stripe boundaries, so it should
> be able to handle this for MMC as well.  What is needed is to tell
> the filesystem what the underlying alignment is.  That can be done
> at format time with mke2fs or afterward with tune2fs by using the
> "-E stripe_width" option.

Ah, that sounds useful. So would I set the stripe_width to the
erase block size, and the block group size to a multiple of that?

Does this also work in (rare) cases where the erase block size is
not a power of two?

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 19:05           ` Arnd Bergmann
@ 2011-03-21 23:58             ` Andreas Dilger
  2011-03-22 13:56               ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2011-03-21 23:58 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Andrei Warkentin, linux-mmc, linux-ext4

On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
>> Note that mballoc was specifically designed to handle allocation
>> requests that are aligned on RAID stripe boundaries, so it should
>> be able to handle this for MMC as well.  What is needed is to tell
>> the filesystem what the underlying alignment is.  That can be done
>> at format time with mke2fs or afterward with tune2fs by using the
>> "-E stripe_width" option.
> 
> Ah, that sounds useful. So would I set the stripe_width to the
> erase block size, and the block group size to a multiple of that?

When you write "block group size" do you mean the ext4 block group?  Then yes it would help.  You could also consider setting the flex_bg size to a multiple of this, so that the bitmap blocks are grouped as a multiple of this size.  However, they may not be aligned correctly, which needs extra effort that isn't obvious.  

I think it would be nice to have mke2fs take the stripe_width and/or flex_bg factor into account when sizing/aligning the bitmaps, but it doesn't yet.

> Does this also work in (rare) cases where the erase block size is
> not a power of two?

It does (or is supposed to), but that isn't code that is exercised very much (most installations use a power-of-two size).

Cheers, Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 23:58             ` Andreas Dilger
@ 2011-03-22 13:56               ` Arnd Bergmann
  2011-03-22 15:02                 ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-03-22 13:56 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andrei Warkentin, linux-mmc, linux-ext4

On Tuesday 22 March 2011, Andreas Dilger wrote:
> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> > On Monday 21 March 2011 19:03:09 Andreas Dilger wrote:
> >> Note that mballoc was specifically designed to handle allocation
> >> requests that are aligned on RAID stripe boundaries, so it should
> >> be able to handle this for MMC as well.  What is needed is to tell
> >> the filesystem what the underlying alignment is.  That can be done
> >> at format time with mke2fs or afterward with tune2fs by using the
> >> "-E stripe_width" option.
> > 
> > Ah, that sounds useful. So would I set the stripe_width to the
> > erase block size, and the block group size to a multiple of that?
> 
> When you write "block group size" do you mean the ext4 block group? 

Yes.

> Then yes it would help.  You could also consider setting the flex_bg
> size to a multiple of this, so that the bitmap blocks are grouped as
> a multiple of this size.  However, they may not be aligned correctly,
> which needs extra effort that isn't obvious.  
> 
> I think it would be nice to have mke2fs take the stripe_width and/or
> flex_bg factor into account when sizing/aligning the bitmaps, but it
> doesn't yet.

A few more questions: 

* On cards that can only write to a single erase block at a time,
should I make the block group size the same as the as the erase
block? I suppose writing both block bitmaps, inode and data to
separate erase blocks would create multiple eraseblock
read-modify-write cycles for every single file otherwise.

* Is it guaranteed that inode bitmap, inode, block bitmap and
blocks are always written in low-to-high sector order within
one ext4 block group? A lot of the drives will do a garbage-collect
step (adding hundreds of miliseconds) every time you move back
inside of the eraseblock.

* Is there any way to make ext4 use effective blocks larger
than 4 KB? The most common size for a NAND flash page is 16
KB right (effectively, ignoring what the hardware does), so
it would be good to never write smaller.

* Calling TRIM on SD cards is probably counterproductive unless
you trim entire erase blocks. Is that even possible with ext4,
assuming that we use block group == erase block?

* Is there a way to put the journal into specific parts of the
drive? Almost all SD cards have an area in the second 4 MB
(more for larger cards) that can be written using random access
without forcing garbage collection on other parts.

> > Does this also work in (rare) cases where the erase block size is
> > not a power of two?
> 
> It does (or is supposed to), but that isn't code that is exercised
> very much (most installations use a power-of-two size).

Ok. Recently, cheap TLC (three-level cell, 3-bit MLC) NAND is
becoming popular. I've seen erase block sizes of 6 MiB, 1376 KiB
(4096 / 3, rounded up) and 4128 KiB (1376 * 3) because of this, in
place of the common 4096 KiB. The SD card standard specifies
values of 12 MB and 24 MB aside from the usual power-of-two values
up to 64 MB for large cards (>32GB), while smaller cards are allowed
only up to 4 MB erase blocks and need to be power-of-two. Many
cards do not use the size they claim in their registers.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-22 13:56               ` Arnd Bergmann
@ 2011-03-22 15:02                 ` Andreas Dilger
  2011-03-22 15:44                   ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2011-03-22 15:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Andrei Warkentin, linux-mmc, linux-ext4

On 2011-03-22, at 2:56 PM, Arnd Bergmann wrote:
> On Tuesday 22 March 2011, Andreas Dilger wrote:
>> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
>>> So would I set the stripe_width to the erase block size, and the
>>> block group size to a multiple of that?
>> 
>> When you write "block group size" do you mean the ext4 block group? 
> 
> Yes.
> 
>> Then yes it would help.  You could also consider setting the flex_bg
>> size to a multiple of this, so that the bitmap blocks are grouped as
>> a multiple of this size.  However, they may not be aligned correctly,
>> which needs extra effort that isn't obvious.  
>> 
>> I think it would be nice to have mke2fs take the stripe_width and/or
>> flex_bg factor into account when sizing/aligning the bitmaps, but it
>> doesn't yet.
> 
> A few more questions: 
> 
> * On cards that can only write to a single erase block at a time,
> should I make the block group size the same as the as the erase
> block? I suppose writing both block bitmaps, inode and data to
> separate erase blocks would create multiple eraseblock
> read-modify-write cycles for every single file otherwise.

That doesn't seem like a very good idea.  It will significantly limit the size of the filesystem, and will cause a lot of overhead (two bitmaps per group for only a handful of blocks).

> * Is it guaranteed that inode bitmap, inode, block bitmap and
> blocks are always written in low-to-high sector order within
> one ext4 block group? A lot of the drives will do a garbage-collect
> step (adding hundreds of miliseconds) every time you move back
> inside of the eraseblock.

Generally, yes.  I don't think there is a hard guarantee, but the block device elevator will sort the blocks.

> * Is there any way to make ext4 use effective blocks larger
> than 4 KB? The most common size for a NAND flash page is 16
> KB right (effectively, ignoring what the hardware does), so
> it would be good to never write smaller.

You may be interested in Ted's bigalloc patchset.  This will force block allocation to be at a power-of-two multiple of the blocksize, so it could be 16kB or whatever.  However, this is inefficient if the average filesize is not large enough.

> * Calling TRIM on SD cards is probably counterproductive unless
> you trim entire erase blocks. Is that even possible with ext4,
> assuming that we use block group == erase block?

That is already the case, if the underlying storage reports the erase block size to the filesystem.

> * Is there a way to put the journal into specific parts of the
> drive? Almost all SD cards have an area in the second 4 MB
> (more for larger cards) that can be written using random access
> without forcing garbage collection on other parts.

That would need a small patch to mke2fs.  I've been interested in this also for other reasons, but haven't had time to work on it.  It will likely need only some small adjustments to ext2fs_add_journal_inode() to allow passing the goal block, and write_journal_inode() to use the goal block instead of its internal heuristic.  The default location of the journal inode was previously moved from the beginning of the filesystem to the middle of the filesystem for performance reasons, so this is mostly already handled.

>>> Does this also work in (rare) cases where the erase block size is
>>> not a power of two?
>> 
>> It does (or is supposed to), but that isn't code that is exercised
>> very much (most installations use a power-of-two size).
> 
> Ok. Recently, cheap TLC (three-level cell, 3-bit MLC) NAND is
> becoming popular. I've seen erase block sizes of 6 MiB, 1376 KiB
> (4096 / 3, rounded up) and 4128 KiB (1376 * 3) because of this, in
> place of the common 4096 KiB. The SD card standard specifies
> values of 12 MB and 24 MB aside from the usual power-of-two values
> up to 64 MB for large cards (>32GB), while smaller cards are allowed
> only up to 4 MB erase blocks and need to be power-of-two. Many
> cards do not use the size they claim in their registers.

Well, the large erase block size is not in itself a problem, but if the devices do not use the reported erase block size internally, there is nothing much that ext4 or the rest of the kernel can do about it, since it has no other way of knowing what the real erase block size is.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-22 15:02                 ` Andreas Dilger
@ 2011-03-22 15:44                   ` Arnd Bergmann
  0 siblings, 0 replies; 10+ messages in thread
From: Arnd Bergmann @ 2011-03-22 15:44 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Andrei Warkentin, linux-mmc, linux-ext4

On Tuesday 22 March 2011, Andreas Dilger wrote:
> On 2011-03-22, at 2:56 PM, Arnd Bergmann wrote:
> > On Tuesday 22 March 2011, Andreas Dilger wrote:
> >> On 2011-03-21, at 8:05 PM, Arnd Bergmann wrote:
> > 
> > * On cards that can only write to a single erase block at a time,
> > should I make the block group size the same as the as the erase
> > block? I suppose writing both block bitmaps, inode and data to
> > separate erase blocks would create multiple eraseblock
> > read-modify-write cycles for every single file otherwise.
> 
> That doesn't seem like a very good idea.  It will significantly limit
> the size of the filesystem, and will cause a lot of overhead (two bitmaps
> per group for only a handful of blocks).

I'm willing to spend a little space overhead in return for one
or two orders of magnitude in performance and life expectancy
for the card ;-)

A typical case is that a single-page (16KB) write to the currently
open erase block takes 1ms, but since writing to another erase block
requires a garbage-collection (erase-rewrite 4 MB), it takes 500 ms,
just like the following access to the first erase block, which
has now been closed.

Every erase cycle ages the drive, and on some cheap ones, you only
have about 2000 guaranteed erases per erase block!

> > * Is it guaranteed that inode bitmap, inode, block bitmap and
> > blocks are always written in low-to-high sector order within
> > one ext4 block group? A lot of the drives will do a garbage-collect
> > step (adding hundreds of miliseconds) every time you move back
> > inside of the eraseblock.
> 
> Generally, yes.  I don't think there is a hard guarantee,
> but the block device elevator will sort the blocks.

Ok. 
 
> > * Is there any way to make ext4 use effective blocks larger
> > than 4 KB? The most common size for a NAND flash page is 16
> > KB right (effectively, ignoring what the hardware does), so
> > it would be good to never write smaller.
> 
> You may be interested in Ted's bigalloc patchset.  This will force
> block allocation to be at a power-of-two multiple of the blocksize,
> so it could be 16kB or whatever.  However, this is inefficient if
> the average filesize is not large enough.

Is it just a performance/space tradeoff, or is there also a
performance overhead in this?

> > * Calling TRIM on SD cards is probably counterproductive unless
> > you trim entire erase blocks. Is that even possible with ext4,
> > assuming that we use block group == erase block?
> 
> That is already the case, if the underlying storage reports the
> erase block size to the filesystem.

Ok, I should try to find out how this is done on SD cards.
The hardware interface allows erasing 512 byte sectors, so
we might be reporting that instead.
 
> > * Is there a way to put the journal into specific parts of the
> > drive? Almost all SD cards have an area in the second 4 MB
> > (more for larger cards) that can be written using random access
> > without forcing garbage collection on other parts.
> 
> That would need a small patch to mke2fs.  I've been interested in
> this also for other reasons, but haven't had time to work on it. 
> It will likely need only some small adjustments to 
> ext2fs_add_journal_inode() to allow passing the goal block, and
> write_journal_inode() to use the goal block instead of its internal
> heuristic.  The default location of the journal inode was previously
> moved from the beginning of the filesystem to the middle of the
> filesystem for performance reasons, so this is mostly already handled.

Ok. It was previously suggested to put an external journal on a
4 MB partition for experimenting with this. I hope I can back this
up with performance numbers soon.

> Well, the large erase block size is not in itself a problem, but
> if the devices do not use the reported erase block size internally,
> there is nothing much that ext4 or the rest of the kernel can do
> about it, since it has no other way of knowing what the real erase
> block size is.

For SDHC cards, the typical case is that they are reasonably efficient
when you use the reported size, because they are tested that way.
A lot of cards use 2MiB internally but report 4MiB, which is fine
as long as you write the 4MB consecutively and don't alternate
between the two halves.

The split into three erase blocks of (4 MiB / 3) is on low-end
SanDisk cards, and I believe it has mostly advantages and will
work well if we use the reported 4 MB.

The 4128 KiB erase blocks are on a USB stick, and those devices
do not report any erase block size at all.

I have written a tool to detect the actual erase block size,
and perhaps that could be integrated into mke2fs and similar
tools.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Fwd: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
@ 2011-03-21 14:27 Arnd Bergmann
  2011-03-21 23:45 ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-03-21 14:27 UTC (permalink / raw)
  To: linux-ext4, Andrei Warkentin

[-- Attachment #1: Type: text/plain, Size: 4565 bytes --]

Hi ext4 developers,

Andrei has been experimenting with optimizations in the mmc layer for
specific eMMC media. The test results so far show not much success, but
I was rather surprised that ext4 performs worse than ext3 on this
drive and test case.

I would have expected that with support for delayed allocation and
trim, it should be better.

	Arnd

----------  Forwarded Message  ----------

Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
Date: Saturday 19 March 2011
From: Andrei Warkentin <andreiw@motorola.com>
To: Arnd Bergmann <arnd@arndb.de>
CC: linux-mmc@vger.kernel.org

Hi Arnd, all...

On Mon, Mar 14, 2011 at 2:40 AM, Andrei Warkentin <andreiw@motorola.com> wrote:

>>>
>>> Revalidating the data now, along with some more tests, to get a better
>>> picture. It seems the more data I get, the less it makes sense :(.
>>
>> I was already fearing that the change would only benefit low-level
>> benchmarks. It certainly helps writing small chunks to the buffer
>> that is meant for FAT32 directories, but at some point, the card
>> will have to write back the entire logical erase block, so you
>> might not be able to gain much in real-world workloads.
>>
>

Attaching is some data I have collected  on the MMC32G part. I tried
to make the collection process as controlled as possible, as well as
use more-or-less a "real life" usage case that involves running a user
application, so it's not just a purely synthetic test at block level.

Attached file (I hope you don't mind PDFs) contains data collected for
two possible optimizations. The second page of the document tests the
vendor suggested optimization that is basically -
if (request_blocks < 24) {
     /* given request offset, calculate sectors remaining on 8K page
containing offset */
     sectors = 16 - (request_offset % 16);
     if (request_blocks > sectors) {
        request_blocks = sectors;
     }
}
...I'll call this optimization A.

...the first page of the document tests the optimization that floated
up on the list when I first sent a patch with the vendor suggestions.
That optimization being - align all unaligned accesses (either all
completely, or under a certain size threshold) on flash page size.
I'll call this optimization B.

To test, a collect time info for 2000 small inserts into a table with
sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
test. The test is executed for ext2, ext3 and ext4 with a 4k block
size. Every test begins with a flash discard and format operation on
the partition where the tables are created and accessed, to ensure
similar acceses to flash on every test. All other partitions are RO,
and no processes other than those needed by the tests run. All power
management is disabled. The results are thus repeatable, consistent
and stable across reboots and power-on time...

Each test consists of:
1) Unmount partition
2) Flash erase
3) Format with fs
4) Mount
5) Sync
6) echo 3 > /proc/sys/vm/drop_caches
7) run 20 x 2000 inserts as described above
8) unmount

For optimization B testing, the alignment size and alignment access
size threshold (same parameters as in my RFC patch) are exposed
through debugfs. To get B test data, the flow was

1) Set alignment to none (no optimization)
2) Sql test on ext2
3) Sql test on ext3
4) Sql test on ext4

6) Set alignment to 8k, no threshold
7) Sql test on ext2
8) Sql test on ext3
9) Sql test on ext4

10) Set alignment to 8k, < 8k only
11) Sql test on ext2
12) Sql test on ext3
13) Sql test on ext4

...all the way up to 32K threshold.

For optimization A testing, the optimization was turned off/on with a
debugfs attribute, and the data collected with this flow:

1) Turn off optimization
2) Sql test on ext2
3) Sql test on ext3
4) Sql test on ext4
5) Turn on optimization
6) Sql test on ext2
7) Sql test on ext3
8) Sql test on ext4

My interpretation of the results: Any kind of alignment-on-flash page
optimization produced data that in all cases was either
indistinguishable from control, or was worse. Do you agree with my
interpretation?

So I guess that hexes the align optimization, at least until I can get
data for MMC16G with the same controlled setup. Sorry about that. I'll
work on the "reliability optimization" now, which I guess are pretty
generic for cards with similar buffer schemes. It relies on reliable
writes, so exposing that will be first for review here...

Even though I'm rescinding the adjust/align patch, is there any chance
for pulling in my quirks changes?

Thanks,
A

-------------------------------------------------------

[-- Attachment #2: flash data MMC32G.pdf --]
[-- Type: application/pdf, Size: 55157 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 14:27 Fwd: " Arnd Bergmann
@ 2011-03-21 23:45 ` Andreas Dilger
  2011-03-22  7:18   ` Andrei Warkentin
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2011-03-21 23:45 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-ext4@vger.kernel.org, Andrei Warkentin

I was just looking at the test data. I wonder of this slowness might also be due to sync on ext4 using a barrier, and not on ext2/3?

Cheers, Andreas

On 2011-03-21, at 3:27 PM, Arnd Bergmann <arnd@arndb.de> wrote:

> Hi ext4 developers,
> 
> Andrei has been experimenting with optimizations in the mmc layer for
> specific eMMC media. The test results so far show not much success, but
> I was rather surprised that ext4 performs worse than ext3 on this
> drive and test case.
> 
> I would have expected that with support for delayed allocation and
> trim, it should be better.
> 
>   Arnd
> 
> ----------  Forwarded Message  ----------
> 
> Subject: Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
> Date: Saturday 19 March 2011
> From: Andrei Warkentin <andreiw@motorola.com>
> To: Arnd Bergmann <arnd@arndb.de>
> CC: linux-mmc@vger.kernel.org
> 
> Hi Arnd, all...
> 
> On Mon, Mar 14, 2011 at 2:40 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> 
>>>> 
>>>> Revalidating the data now, along with some more tests, to get a better
>>>> picture. It seems the more data I get, the less it makes sense :(.
>>> 
>>> I was already fearing that the change would only benefit low-level
>>> benchmarks. It certainly helps writing small chunks to the buffer
>>> that is meant for FAT32 directories, but at some point, the card
>>> will have to write back the entire logical erase block, so you
>>> might not be able to gain much in real-world workloads.
>>> 
>> 
> 
> Attaching is some data I have collected  on the MMC32G part. I tried
> to make the collection process as controlled as possible, as well as
> use more-or-less a "real life" usage case that involves running a user
> application, so it's not just a purely synthetic test at block level.
> 
> Attached file (I hope you don't mind PDFs) contains data collected for
> two possible optimizations. The second page of the document tests the
> vendor suggested optimization that is basically -
> if (request_blocks < 24) {
>    /* given request offset, calculate sectors remaining on 8K page
> containing offset */
>    sectors = 16 - (request_offset % 16);
>    if (request_blocks > sectors) {
>       request_blocks = sectors;
>    }
> }
> ...I'll call this optimization A.
> 
> ...the first page of the document tests the optimization that floated
> up on the list when I first sent a patch with the vendor suggestions.
> That optimization being - align all unaligned accesses (either all
> completely, or under a certain size threshold) on flash page size.
> I'll call this optimization B.
> 
> To test, a collect time info for 2000 small inserts into a table with
> sqlite into 20 separate tables. So that's 20 x 2000 sqlite inserts per
> test. The test is executed for ext2, ext3 and ext4 with a 4k block
> size. Every test begins with a flash discard and format operation on
> the partition where the tables are created and accessed, to ensure
> similar acceses to flash on every test. All other partitions are RO,
> and no processes other than those needed by the tests run. All power
> management is disabled. The results are thus repeatable, consistent
> and stable across reboots and power-on time...
> 
> Each test consists of:
> 1) Unmount partition
> 2) Flash erase
> 3) Format with fs
> 4) Mount
> 5) Sync
> 6) echo 3 > /proc/sys/vm/drop_caches
> 7) run 20 x 2000 inserts as described above
> 8) unmount
> 
> For optimization B testing, the alignment size and alignment access
> size threshold (same parameters as in my RFC patch) are exposed
> through debugfs. To get B test data, the flow was
> 
> 1) Set alignment to none (no optimization)
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
> 
> 6) Set alignment to 8k, no threshold
> 7) Sql test on ext2
> 8) Sql test on ext3
> 9) Sql test on ext4
> 
> 10) Set alignment to 8k, < 8k only
> 11) Sql test on ext2
> 12) Sql test on ext3
> 13) Sql test on ext4
> 
> ...all the way up to 32K threshold.
> 
> For optimization A testing, the optimization was turned off/on with a
> debugfs attribute, and the data collected with this flow:
> 
> 1) Turn off optimization
> 2) Sql test on ext2
> 3) Sql test on ext3
> 4) Sql test on ext4
> 5) Turn on optimization
> 6) Sql test on ext2
> 7) Sql test on ext3
> 8) Sql test on ext4
> 
> My interpretation of the results: Any kind of alignment-on-flash page
> optimization produced data that in all cases was either
> indistinguishable from control, or was worse. Do you agree with my
> interpretation?
> 
> So I guess that hexes the align optimization, at least until I can get
> data for MMC16G with the same controlled setup. Sorry about that. I'll
> work on the "reliability optimization" now, which I guess are pretty
> generic for cards with similar buffer schemes. It relies on reliable
> writes, so exposing that will be first for review here...
> 
> Even though I'm rescinding the adjust/align patch, is there any chance
> for pulling in my quirks changes?
> 
> Thanks,
> A
> 
> -------------------------------------------------------
> <flash data MMC32G.pdf>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 4/5] MMC: Adjust unaligned write accesses.
  2011-03-21 23:45 ` Andreas Dilger
@ 2011-03-22  7:18   ` Andrei Warkentin
  0 siblings, 0 replies; 10+ messages in thread
From: Andrei Warkentin @ 2011-03-22  7:18 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Arnd Bergmann, linux-ext4@vger.kernel.org

On Mon, Mar 21, 2011 at 6:45 PM, Andreas Dilger <adilger@whamcloud.com> wrote:
> I was just looking at the test data. I wonder of this slowness might also be due to sync on ext4 using a barrier, and not on ext2/3?

Indeed. ext4 mounts by default with barrier=1...   I'll collect some
comparison data with various tunables. Sorry.

A

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-03-22 15:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1299718449-15172-1-git-send-email-andreiw@motorola.com>
     [not found] ` <AANLkTimV-no9Wk4wbS4gQGdSgq-2L=ims6SXDFrdEZAe@mail.gmail.com>
     [not found]   ` <AANLkTinEQEwa2SqEwTnbe3kcYuDoM-ZbzWE7X+V3B+zV@mail.gmail.com>
2011-03-21 14:21     ` [RFC 4/5] MMC: Adjust unaligned write accesses Arnd Bergmann
2011-03-21 14:41       ` Andrei Warkentin
2011-03-21 18:03         ` Andreas Dilger
2011-03-21 19:05           ` Arnd Bergmann
2011-03-21 23:58             ` Andreas Dilger
2011-03-22 13:56               ` Arnd Bergmann
2011-03-22 15:02                 ` Andreas Dilger
2011-03-22 15:44                   ` Arnd Bergmann
2011-03-21 14:27 Fwd: " Arnd Bergmann
2011-03-21 23:45 ` Andreas Dilger
2011-03-22  7:18   ` Andrei Warkentin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox