Re: MMC quirks relating to performance/lifetime.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: MMC quirks relating to performance/lifetime.
       [not found]   ` <AANLkTimGrT_p_5xdXJv5UURPMC0cwJCPkZ=xcX5Nk=o=@mail.gmail.com>
@ 2011-02-20 14:39     ` Arnd Bergmann
  2011-02-22  7:46       ` Andrei Warkentin
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw)
  To: linux-arm-kernel, linux-fsdevel
  Cc: Andrei Warkentin, Linus Walleij, linux-mmc

[adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
on this discussion.]

On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote:
> On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think I'd try to reduce the number of sysfs files needed for this.
> > What are the values you would typically set here?
> >
> > My feeling is that separating unaligned page writes from full pages
> > or multiples of pages could always be benefitial for all cards, or at
> > least harmless, but that will require more measurements.
> > Whether to do the reliable write or not could be a simple flag
> > if the numbers are the same.
> 
> I thought about this some more, and I realized it would be ugly if
> everybody added enable_workaround_sec_start/enable_workaround_sec_end
> for every novel idea of working around some issue with
> performance/reliability on mmc/sd cards.
> 
> What about letting the user/embedder create policies for how certain
> accesses are done? That way you give runtime-accessible
> blocks for tuning mmc block layer while having one interface to
> manipulate (and combine) multiple workarounds, all the while catching
> conflicts and
> without forcing specific policy in code.
> 
> Essentially under /sys/block/mmcblk0/device you have an attribute
> called "policies". Example:
> 
> # echo mypol0 > /sys/block/mmcblk0/device/policies
> # ls /sys/block/mmcblk0/device/mypol0
> debug
> delete
> start_block
> end_block
> access_size_low
> access_size_high
> write_policy
> erase_policy
> read_policy
> # cat /sys/block/mmcblk0/device/mypol0/write_policy
> Current: none
> 0x00000001: Split unaligned writes across page_size
> 0x00000002: Split writes into page_size chunks and write using reliable writes
> 0x00000004: Use reliable writes for WRITE_META blocks.
> # cat /sys/block/mmcblk0/device/mypol0/erase_policy
> Current: none
> 0x00000001: Use secure erase.
> # echo 1 > delete
> # Policy is deleted.
> 
> The policies are all stored in a rb-tree. First order of business
> inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing
> policy given the access type and block start/end (which both tells
> where the access is going and the size of the access). Later, it's
> that policy information which controls how the request is translated
> into MMC commands. I'm almost done with a prototype.

I think it's good to discuss all the options, but my feeling is that
we should not add so much complexity at the interface level, because
we will never be able to change all that again. In general, sysfs
files should contain simple values that are self-descriptive (a simple
number or one word), and should have no side-effects (unlike the delete
or the policies attributes you describe).

The behavior of the Toshiba chip is peculiar enough to justify having
some workarounds for it, including run-time selected ones, but I'm
looking for something much simpler. I'd certainly be interested in
the patch you come up with and any performance results, but I don't
think it can be merged like that.

In the end, Chris will have to make the decision on mmc patches of
course -- I'm just trying to contribute experience from other subsystems.

What I see as a more promising approach is to add the tunables
to attributes of the CFQ I/O scheduler once we know what we want.
This will allow doing the same optimizations to non-MMC devices such
as USB sticks or CF/IDE cards without reimplementing it in other
subsystems, and give more control over the individual requests than
the MMC layer has.

E.g. the I/O scheduler can also make sure that we always submit all
blocks from the start of one erase unit (e.g. 4 MB) to the end, but
not try to merge requests across erase unit boundaries. It can
also try to group the requests in aligned power-of-two sized chunks
rather than merging as many sectors as possible up to the maximum
request size, ignoring the alignment.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-20 14:39     ` MMC quirks relating to performance/lifetime Arnd Bergmann
@ 2011-02-22  7:46       ` Andrei Warkentin
  2011-02-22 17:00         ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Andrei Warkentin @ 2011-02-22  7:46 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and
> http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more
> on this discussion.]
>
>
> I think it's good to discuss all the options, but my feeling is that
> we should not add so much complexity at the interface level, because
> we will never be able to change all that again. In general, sysfs
> files should contain simple values that are self-descriptive (a simple
> number or one word), and should have no side-effects (unlike the delete
> or the policies attributes you describe).
>
> The behavior of the Toshiba chip is peculiar enough to justify having
> some workarounds for it, including run-time selected ones, but I'm
> looking for something much simpler. I'd certainly be interested in
> the patch you come up with and any performance results, but I don't
> think it can be merged like that.
>

Sure. The page_align patch is just going to be a single sysfs
attribute. All I need to prove to myself now is the effect for large
unaligned accesses (and show everyone else the data :-)).

> In the end, Chris will have to make the decision on mmc patches of
> course -- I'm just trying to contribute experience from other subsystems.
>
> What I see as a more promising approach is to add the tunables
> to attributes of the CFQ I/O scheduler once we know what we want.
> This will allow doing the same optimizations to non-MMC devices such
> as USB sticks or CF/IDE cards without reimplementing it in other
> subsystems, and give more control over the individual requests than
> the MMC layer has.
>
> E.g. the I/O scheduler can also make sure that we always submit all
> blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> not try to merge requests across erase unit boundaries. It can
> also try to group the requests in aligned power-of-two sized chunks
> rather than merging as many sectors as possible up to the maximum
> request size, ignoring the alignment.

I agree. These are common things that affect any kind of flash
storage, and it belongs in the I/O scheduler as simple tuneables. I'll
see if I can figure my way around that...

What belongs in mmc card driver are tunable workarounds for MMC/SD
brokeness. For example - needing to use 8K-spitted reliable writes to
ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
improve lifespan of the card.) But you want a waterline above which
you don't do this anymore, otherwise the overall performance will go
to 0 - i.e. there is a need to balance between performance and
reliability, so the range of access size for which the workaround
works needs to be runtime controlled, as it's potentially different.
Another example (this one is apparently affecting Sandisk) - do
special stuff for block erase, since the card violates spec in that
regard (touch ext_csd instead of argument, I believe). A different
example might be turning on reliable writes for WRITE_META (or all)
blocks for a certain partition (but I just made that up... ).

So there are things that just should be on (spec brokeness
workarounds), and things that apply only to a subset of accesses (and
thus they are selective at issue_*_rq time), whether it's because of
accessed offset or access size.

I agree that the sysfs method is particularly nasty, and I guess I
didn't have to make a prototype to figure that out :-) (but needed
something similar for selective testing anyway). Nothing else exists
right now that acts in the same way, and nothing really should, as
there is no feedback for manipulating the policies (echo POLICY_ENUM >
policy, if it doesn't stick, then the arguments were wrong, etc).

You could put the entire MMC block policy interface through an API
usable by system integrators - i.e. you would really only care for
tuning the MMC parameters if you're creating a device around an emmc.

Idea (1). One idea is to keep the "policies" from my previous mail.
Policies are registered through platform-specific code. The policies
could be then matched for enabling against a specific block device by
manfid/date/etc at the time of mmc_block_alloc... For removable media
no one would fiddle with the tunable parameters anyway, unless there
was some global database of cards and workarounds and a daemon or some
such to take care of that... Probably don't want to add such baggage
to the kernel.

Idea (2). There is probably no need to overcomplicate. Just add a
platform callback (something like int
(*mmc_platform_block_workaround)(struct request *, struct
mmc_blk_request *)). This will be usable as-is for R/W accesses, and
the discard code will need to be slightly modified.

Do you think there is any need for runtime tuning of the MMC
workarounds (disregarding ones that really belong in the I/O
scheduler)? Should the workarounds be simply platform callbacks, or
should they be something heftier ("policies")?

A

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22  7:46       ` Andrei Warkentin
@ 2011-02-22 17:00         ` Arnd Bergmann
  2011-02-23 10:19           ` Andrei Warkentin
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tuesday 22 February 2011, Andrei Warkentin wrote:
> On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > E.g. the I/O scheduler can also make sure that we always submit all
> > blocks from the start of one erase unit (e.g. 4 MB) to the end, but
> > not try to merge requests across erase unit boundaries. It can
> > also try to group the requests in aligned power-of-two sized chunks
> > rather than merging as many sectors as possible up to the maximum
> > request size, ignoring the alignment.
> 
> I agree. These are common things that affect any kind of flash
> storage, and it belongs in the I/O scheduler as simple tuneables. I'll
> see if I can figure my way around that...
> 
> What belongs in mmc card driver are tunable workarounds for MMC/SD
> brokeness. For example - needing to use 8K-spitted reliable writes to
> ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to
> improve lifespan of the card.) But you want a waterline above which
> you don't do this anymore, otherwise the overall performance will go
> to 0 - i.e. there is a need to balance between performance and
> reliability, so the range of access size for which the workaround
> works needs to be runtime controlled, as it's potentially different.
> Another example (this one is apparently affecting Sandisk) - do
> special stuff for block erase, since the card violates spec in that
> regard (touch ext_csd instead of argument, I believe). A different
> example might be turning on reliable writes for WRITE_META (or all)
> blocks for a certain partition (but I just made that up... ).

Yes, makes sense.

> You could put the entire MMC block policy interface through an API
> usable by system integrators - i.e. you would really only care for
> tuning the MMC parameters if you're creating a device around an emmc.
> 
> Idea (1). One idea is to keep the "policies" from my previous mail.
> Policies are registered through platform-specific code. The policies
> could be then matched for enabling against a specific block device by
> manfid/date/etc at the time of mmc_block_alloc... For removable media
> no one would fiddle with the tunable parameters anyway, unless there
> was some global database of cards and workarounds and a daemon or some
> such to take care of that... Probably don't want to add such baggage
> to the kernel.
> 
> Idea (2). There is probably no need to overcomplicate. Just add a
> platform callback (something like int
> (*mmc_platform_block_workaround)(struct request *, struct
> mmc_blk_request *)). This will be usable as-is for R/W accesses, and
> the discard code will need to be slightly modified.
> 
> Do you think there is any need for runtime tuning of the MMC
> workarounds (disregarding ones that really belong in the I/O
> scheduler)? Should the workarounds be simply platform callbacks, or
> should they be something heftier ("policies")?

The platform hook seems the wrong place, because you might use
the same chip in multiple platforms, and a single platform might
have a large number of different boards, all of which require
separate workarounds.

A per-card quirk table does not seem so bad, we have that in
other subsystems as well. I wouldn't necessarily make it
a list of possible quirks, but rather a __devinit function that
is called for a new card on insertion, in order to tweak various
parameters.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-22 17:00         ` Arnd Bergmann
@ 2011-02-23 10:19           ` Andrei Warkentin
  2011-02-23 16:09             ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> Do you think there is any need for runtime tuning of the MMC
>> workarounds (disregarding ones that really belong in the I/O
>> scheduler)? Should the workarounds be simply platform callbacks, or
>> should they be something heftier ("policies")?
>
> The platform hook seems the wrong place, because you might use
> the same chip in multiple platforms, and a single platform might
> have a large number of different boards, all of which require
> separate workarounds.
>

That's a good point. At best it would result in massive copy-paste/

> A per-card quirk table does not seem so bad, we have that in
> other subsystems as well. I wouldn't necessarily make it
> a list of possible quirks, but rather a __devinit function that
> is called for a new card on insertion, in order to tweak various
> parameters.
>

That sounds good! In fact, for any quirks enabled for a particular
card, I'll expose the tuneables through sysfs attributes, something
like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.

Quirks will have block intervals and access size intervals over which
they are valid, along with any other quirk-specific parameter.
Interval overlap will not be allowed for quirks in the same operation
type (r/w/e). The goal here is to make the changes to issue_*_rq as
small as possible, and not to pollute block.c at all with the quirks
stuff. Quirks are looked up inside issue_*_rq based on req type and
[start,end) interval. The resulting found quirks structure will
contain a callback used inside issue_*_rq to modify mmc block request
structures prior to generating actual MMC commands.

Quirks consist of a callback called inside of mmc issue_*_rq,
configurable attributes, and the sysfs interface. Quirk groups are
defined per-card. At card insertion time, a matching quirk group is
found, and is enabled. The quirk group enable function then enables
the relevant quirks with the right parameters (adds them to per
mmc_blk_data quirk interval tree). Some sane defaults for the tunables
are used. If the tunables are modified through sysfs, care is taken
that an interval overlap never happens, otherwise the tunable is not
modified and a kernel error message is logged.

I hope I explained the tentative idea clearly... Thoughts?

A

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 10:19           ` Andrei Warkentin
@ 2011-02-23 16:09             ` Arnd Bergmann
  2011-02-23 22:26               ` Andrei Warkentin
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wednesday 23 February 2011, Andrei Warkentin wrote:
> That sounds good! In fact, for any quirks enabled for a particular
> card, I'll expose the tuneables through sysfs attributes, something
> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
> 
> Quirks will have block intervals and access size intervals over which
> they are valid, along with any other quirk-specific parameter.
> Interval overlap will not be allowed for quirks in the same operation
> type (r/w/e). The goal here is to make the changes to issue_*_rq as
> small as possible, and not to pollute block.c at all with the quirks
> stuff. Quirks are looked up inside issue_*_rq based on req type and
> [start,end) interval. The resulting found quirks structure will
> contain a callback used inside issue_*_rq to modify mmc block request
> structures prior to generating actual MMC commands.
>
> Quirks consist of a callback called inside of mmc issue_*_rq,
> configurable attributes, and the sysfs interface. Quirk groups are
> defined per-card. At card insertion time, a matching quirk group is
> found, and is enabled. The quirk group enable function then enables
> the relevant quirks with the right parameters (adds them to per
> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
> are used. If the tunables are modified through sysfs, care is taken
> that an interval overlap never happens, otherwise the tunable is not
> modified and a kernel error message is logged.
> 
> I hope I explained the tentative idea clearly... Thoughts?

I would hope that the quirks can be simpler than this still, without
the need to call any function pointers while using the device, or
quirk specific sysfs directories.

What I meant is to have a single function pointer that can get
called when detecting a specific known card. All this function
does is to set values and flags that we can export either through
common attributes of block devices (e.g. preferred erase size),
or attributes specific to mmc devices (e.g. the toshiba hack, as
a bool attribute).

An obvious attribute would be the minimum size of an atomic
page update. By default this could be 32KB, because any device
should support that (FAT32 cannot have larger clusters). A
card specific quirk can set it to another value, like 8KB, 16KB
or 64KB, and file systems or other tools like mkfs can optimize
for this value.

I would like the flags like "don't submit requests spanning
this boundary" and "make all writes below this size" to be defined
in terms of the regular sizes we already know about, like the
page size or the erase size.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 16:09             ` Arnd Bergmann
@ 2011-02-23 22:26               ` Andrei Warkentin
  2011-02-24  9:24                 ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 23 February 2011, Andrei Warkentin wrote:
>> That sounds good! In fact, for any quirks enabled for a particular
>> card, I'll expose the tuneables through sysfs attributes, something
>> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names.
>>
>> Quirks will have block intervals and access size intervals over which
>> they are valid, along with any other quirk-specific parameter.
>> Interval overlap will not be allowed for quirks in the same operation
>> type (r/w/e). The goal here is to make the changes to issue_*_rq as
>> small as possible, and not to pollute block.c at all with the quirks
>> stuff. Quirks are looked up inside issue_*_rq based on req type and
>> [start,end) interval. The resulting found quirks structure will
>> contain a callback used inside issue_*_rq to modify mmc block request
>> structures prior to generating actual MMC commands.
>>
>> Quirks consist of a callback called inside of mmc issue_*_rq,
>> configurable attributes, and the sysfs interface. Quirk groups are
>> defined per-card. At card insertion time, a matching quirk group is
>> found, and is enabled. The quirk group enable function then enables
>> the relevant quirks with the right parameters (adds them to per
>> mmc_blk_data quirk interval tree). Some sane defaults for the tunables
>> are used. If the tunables are modified through sysfs, care is taken
>> that an interval overlap never happens, otherwise the tunable is not
>> modified and a kernel error message is logged.
>>
>> I hope I explained the tentative idea clearly... Thoughts?
>

> I would hope that the quirks can be simpler than this still, without
> the need to call any function pointers while using the device, or
> quirk specific sysfs directories.
>

I'll skip the sysfs part from the first RFC patch. I think this
complicates what I'm trying to achieve and makes this whole thing look
bigger than it is.

> What I meant is to have a single function pointer that can get
> called when detecting a specific known card. All this function
> does is to set values and flags that we can export either through
> common attributes of block devices (e.g. preferred erase size),
> or attributes specific to mmc devices (e.g. the toshiba hack, as
> a bool attribute).
>
> An obvious attribute would be the minimum size of an atomic
> page update. By default this could be 32KB, because any device
> should support that (FAT32 cannot have larger clusters). A
> card specific quirk can set it to another value, like 8KB, 16KB
> or 64KB, and file systems or other tools like mkfs can optimize
> for this value.
>
> I would like the flags like "don't submit requests spanning
> this boundary" and "make all writes below this size" to be defined
> in terms of the regular sizes we already know about, like the
> page size or the erase size.
>

I agree with you on the size/align issues. These are very generic
attributes and don't need a complicated framework like I described to
be dealt with. Ultimately they are just hints to the I/O scheduler, so
they should be part of the block device.

I am more concerned with workarounds that depend on access size (like
the toshiba one) and that modify the MMC commands sent (using reliable
writes, like the Toshiba one, or putting parameters differently like
the Sandisk erase workaround). It's these kinds of workarounds that
the quirks framework is meant to address. I don't think it's a good
idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
if()-elsed workarounds, because it's going to quickly complicate the
logic, and get out of hand and unmanageable the more cards are added.
I'm trying to avoid having to make any changes to card/block.c as part
of making quirk workarounds. The only cost when compared to an if-else
will be one O(log n) quirk lookup, where n is either one or something
close that (since the search is only done for quirks per
mmc_blk_data), and one callback invoked after "brq.data.sg_len =
mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-23 22:26               ` Andrei Warkentin
@ 2011-02-24  9:24                 ` Arnd Bergmann
  2011-02-25 11:02                   ` Andrei Warkentin
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-02-24  9:24 UTC (permalink / raw)
  To: Andrei Warkentin
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Wednesday 23 February 2011, Andrei Warkentin wrote:

> I am more concerned with workarounds that depend on access size (like
> the toshiba one) and that modify the MMC commands sent (using reliable
> writes, like the Toshiba one, or putting parameters differently like
> the Sandisk erase workaround). It's these kinds of workarounds that
> the quirks framework is meant to address. I don't think it's a good
> idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with
> if()-elsed workarounds, because it's going to quickly complicate the
> logic, and get out of hand and unmanageable the more cards are added.
> I'm trying to avoid having to make any changes to card/block.c as part
> of making quirk workarounds. The only cost when compared to an if-else
> will be one O(log n) quirk lookup, where n is either one or something
> close that (since the search is only done for quirks per
> mmc_blk_data), and one callback invoked after "brq.data.sg_len =
> mmc_queue_map_sg(mq);" so it can patch up mrq as necessary.

Unlike the sysfs interface, the code does not need to be future-proof,
it can always be changed if we feel the code becomes more maintainable
by doing it another way.

The approach that I'd like to see here is:

* Start out with an ad-hoc patch for a quirk (like the one you already
  have).
* Add a boolean variable to enable it per card.
* Get performance data for this quirk to show that it's useful in
  real-world workloads for some cards but counterproductive for others
* Get the patch into the mmc tree.
* Repeat for the next quirk
* When the code becomes overly complicated after adding all the quirks,
  decide on a good strategy to move the code around, and do a new patch.

I understand that you are convinced that you will need the indirect function
calls in the end. That is fine, just don't add them before they are
actually needed -- that would only make it harder for you to get the first
patch included.

Note that the situation is very different for user interfaces such as sysfs:
You need to plan ahead because once the interface is merged upstream, it
can never be changed. When you submit a patch that introduces a new sysfs
interface, it has to be documented, and you have to convince the reviewers
that it is sufficient to cover all the cases it is designed for, while
at the same time it is the most simple way to achieve this.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-24  9:24                 ` Arnd Bergmann
@ 2011-02-25 11:02                   ` Andrei Warkentin
  2011-02-25 12:21                     ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote:

> Unlike the sysfs interface, the code does not need to be future-proof,
> it can always be changed if we feel the code becomes more maintainable
> by doing it another way.
>
> The approach that I'd like to see here is:
>
> * Start out with an ad-hoc patch for a quirk (like the one you already
>  have).
> * Add a boolean variable to enable it per card.
> * Get performance data for this quirk to show that it's useful in
>  real-world workloads for some cards but counterproductive for others
> * Get the patch into the mmc tree.
> * Repeat for the next quirk
> * When the code becomes overly complicated after adding all the quirks,
>  decide on a good strategy to move the code around, and do a new patch.
>

Yup. I understand :-).  That's the strategy I'm going to follow. For
page_size-alignment/splitting I'm looking at the block layer now. Is
that the right approach or should I still submit a (cleaned up) patch
to mmc/card/block.c for that performance improvement? The other
(Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

> I understand that you are convinced that you will need the indirect function
> calls in the end. That is fine, just don't add them before they are
> actually needed -- that would only make it harder for you to get the first
> patch included.
>
> Note that the situation is very different for user interfaces such as sysfs:
> You need to plan ahead because once the interface is merged upstream, it
> can never be changed. When you submit a patch that introduces a new sysfs
> interface, it has to be documented, and you have to convince the reviewers
> that it is sufficient to cover all the cases it is designed for, while
> at the same time it is the most simple way to achieve this.


Ok, thanks a lot for the explanation, I hadn't thought of it that way
(and should have).

A
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-25 11:02                   ` Andrei Warkentin
@ 2011-02-25 12:21                     ` Arnd Bergmann
  2011-03-01 18:48                       ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw)
  To: Andrei Warkentin, Jens Axboe
  Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc

On Friday 25 February 2011, Andrei Warkentin wrote:
> Yup. I understand :-).  That's the strategy I'm going to follow. For
> page_size-alignment/splitting I'm looking at the block layer now. Is
> that the right approach or should I still submit a (cleaned up) patch
> to mmc/card/block.c for that performance improvement.

I guess it should live in block/cfq-iosched in the long run, but I don't
know how easy it is to implement it there for test purposes.

It may be easier to prototype it in the mmc code, since you are more
familiar with that already, post that patch together with benchmark
results and then do a new patch for the final solution. We'll need
more benchmarking to figure out if that should be applied for
all nonrotational storage, or if there are cases where it actually
hurts performance to split requests on page boundaries.

If it turns out to be a good idea in general, we won't even need a
sysfs interface for enabling it, just one for reading/writing the
underlying page size.

> The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c.

Makes sense.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-02-25 12:21                     ` Arnd Bergmann
@ 2011-03-01 18:48                       ` Jens Axboe
  2011-03-01 19:11                         ` Arnd Bergmann
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On 2011-02-25 07:21, Arnd Bergmann wrote:
> On Friday 25 February 2011, Andrei Warkentin wrote:
>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>> page_size-alignment/splitting I'm looking at the block layer now. Is
>> that the right approach or should I still submit a (cleaned up) patch
>> to mmc/card/block.c for that performance improvement.
> 
> I guess it should live in block/cfq-iosched in the long run, but I don't
> know how easy it is to implement it there for test purposes.

I don't think I saw the original patch(es) for this?


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 18:48                       ` Jens Axboe
@ 2011-03-01 19:11                         ` Arnd Bergmann
  2011-03-01 19:15                           ` Jens Axboe
  2011-03-02 10:34                           ` Andrei Warkentin
  0 siblings, 2 replies; 17+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
> 
> On 2011-02-25 07:21, Arnd Bergmann wrote:
> > On Friday 25 February 2011, Andrei Warkentin wrote:
> >> Yup. I understand :-).  That's the strategy I'm going to follow. For
> >> page_size-alignment/splitting I'm looking at the block layer now. Is
> >> that the right approach or should I still submit a (cleaned up) patch
> >> to mmc/card/block.c for that performance improvement.
> > 
> > I guess it should live in block/cfq-iosched in the long run, but I don't
> > know how easy it is to implement it there for test purposes.
> 
> I don't think I saw the original patch(es) for this?

Nobody has posted one yet, only discussions. Andrei made a patch for the
MMC block driver to split requests in some cases, but I think the
concept has changed enough that it's probably not useful to look at
that patch.

I think what needs to be done here is to split requests in these cases:

* Small requests should be split on flash page boundaries, where a page
is typically 8 to 32 KB. Sending one hardware request that spans two
partial pages can be slower than sending two requests with the same
data, but on page boundaries.

* If a hardware transfer is limited to a few sectors, these should be
aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
maximum transfers, a request that spans from sector 7 to 62 should be
split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
This reduces the number of page read-modify-write cycles that the drive
does.

* No request should ever span multiple erase blocks. Most flash drives today
have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
treat the erase block boundary like a seek on a hard drive. The I/O
scheduler should try to send all sector writes of an erase block in sequence,
but after that it can chose any other erase block to write to next.

I think if we get this logic, we can deal well with all cheap flash drives.
The two parameters we need are the page size and the erase block size,
which the kernel can sometimes guess, but should also be tunable in
sysfs for devices that don't tell us or lie to the kernel about them.

I'm not sure if we want to do this for all nonrotational media, or
add another flag to enable these optimizations. On proper SSDs that have
an intelligent controller and enough RAM, they probably would not help
all that much, or even make it slightly slower due to a higher number
of separate write requests.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:11                         ` Arnd Bergmann
@ 2011-03-01 19:15                           ` Jens Axboe
  2011-03-01 19:51                             ` Arnd Bergmann
  2011-03-02 10:34                           ` Andrei Warkentin
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On 2011-03-01 14:11, Arnd Bergmann wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>>> On Friday 25 February 2011, Andrei Warkentin wrote:
>>>> Yup. I understand :-).  That's the strategy I'm going to follow. For
>>>> page_size-alignment/splitting I'm looking at the block layer now. Is
>>>> that the right approach or should I still submit a (cleaned up) patch
>>>> to mmc/card/block.c for that performance improvement.
>>>
>>> I guess it should live in block/cfq-iosched in the long run, but I don't
>>> know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
> 
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
> 
> I think what needs to be done here is to split requests in these cases:
> 
> * Small requests should be split on flash page boundaries, where a page
> is typically 8 to 32 KB. Sending one hardware request that spans two
> partial pages can be slower than sending two requests with the same
> data, but on page boundaries.
> 
> * If a hardware transfer is limited to a few sectors, these should be
> aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector
> maximum transfers, a request that spans from sector 7 to 62 should be
> split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62.
> This reduces the number of page read-modify-write cycles that the drive
> does.
> 
> * No request should ever span multiple erase blocks. Most flash drives today
> have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should
> treat the erase block boundary like a seek on a hard drive. The I/O
> scheduler should try to send all sector writes of an erase block in sequence,
> but after that it can chose any other erase block to write to next.
> 
> I think if we get this logic, we can deal well with all cheap flash drives.
> The two parameters we need are the page size and the erase block size,
> which the kernel can sometimes guess, but should also be tunable in
> sysfs for devices that don't tell us or lie to the kernel about them.
> 
> I'm not sure if we want to do this for all nonrotational media, or
> add another flag to enable these optimizations. On proper SSDs that have
> an intelligent controller and enough RAM, they probably would not help
> all that much, or even make it slightly slower due to a higher number
> of separate write requests.

Thanks for the recap. One way to handle this would be to have a dm
target that ensures that requests are never built up to violate any of
the above items. Doing splitting is a little silly, when you can prevent
it from happening in the first place.

Alternatively, a queue ->merge_bvec_fn() with a settings table could
provide the same.

As this is of limited scope, I would prefer having this done via a
plugin of some sort (like a dm target).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:15                           ` Jens Axboe
@ 2011-03-01 19:51                             ` Arnd Bergmann
  2011-03-01 21:33                               ` Andrei Warkentin
  0 siblings, 1 reply; 17+ messages in thread
From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
> Thanks for the recap. One way to handle this would be to have a dm
> target that ensures that requests are never built up to violate any of
> the above items. Doing splitting is a little silly, when you can prevent
> it from happening in the first place.

Ok, that sounds good. I didn't know that it's possible to prevent
bios from getting created that violate this.

I'm actually trying to do a device mapper target that does much more than
this, see
https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
for an early draft. The design has moved on since I wrote that, but
the basic idea is still the same: all blocks get written in a way that
fills up entire 4MB segments before moving to another segment,
independent of what the logical block numbers are, and a little space
is used to store a lookup table for the logical-to-physical block mapping.

> Alternatively, a queue ->merge_bvec_fn() with a settings table could
> provide the same.

That's probably better for the common case. The device mapper target
would be useful for those that want the best case write performance,
but if I understand you correctly, the merge_bvec_fn() could be used
per block driver, so we could simply add that to the SCSI (for USB and
consumer SSD) case and MMC block drivers.

The point that this does not solve is submitting all outstanding writes
for an erase block together, which is needed to reduce the garbage
collection overhead. When you do a partial update of an erase block
(4MB typically) and then start writing to another erase block, the
drive will have to copy all data you did not write in order to free
up internal resources.

> As this is of limited scope, I would prefer having this done via a
> plugin of some sort (like a dm target).

I'm not sure what you mean with limited scope. This is certainly not
as important for the classic server environment (aside from USB boot
drives), but I assume that it is highly relevant for the a large
portion of new embedded designs as people move from raw flash to
eMMC and similar "technologies".

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:51                             ` Arnd Bergmann
@ 2011-03-01 21:33                               ` Andrei Warkentin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote:
>> Thanks for the recap. One way to handle this would be to have a dm
>> target that ensures that requests are never built up to violate any of
>> the above items. Doing splitting is a little silly, when you can prevent
>> it from happening in the first place.
>
> Ok, that sounds good. I didn't know that it's possible to prevent
> bios from getting created that violate this.
>

Wouldn't someone still be able to perform a generic_make_request that
would violate the conditions (i.e. cross alignment boundary while
performing unaligned write)? You could prevent the merges that would
result in violating the conditions, sure, but you would need to handle
single unaligned accesses correctly too... Sorry, I'm just groping my
way around the block layer...a lot I'm still trying to draw a mental
picture for.

P.S. I've submitted for review the first 3 patches. Tear into them :).

A

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-01 19:11                         ` Arnd Bergmann
  2011-03-01 19:15                           ` Jens Axboe
@ 2011-03-02 10:34                           ` Andrei Warkentin
  2011-03-05  9:23                             ` Andrei Warkentin
  1 sibling, 1 reply; 17+ messages in thread
From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote:
>>
>> On 2011-02-25 07:21, Arnd Bergmann wrote:
>> > On Friday 25 February 2011, Andrei Warkentin wrote:
>> >> Yup. I understand :-).  That's the strategy I'm going to follow. For
>> >> page_size-alignment/splitting I'm looking at the block layer now. Is
>> >> that the right approach or should I still submit a (cleaned up) patch
>> >> to mmc/card/block.c for that performance improvement.
>> >
>> > I guess it should live in block/cfq-iosched in the long run, but I don't
>> > know how easy it is to implement it there for test purposes.
>>
>> I don't think I saw the original patch(es) for this?
>
> Nobody has posted one yet, only discussions. Andrei made a patch for the
> MMC block driver to split requests in some cases, but I think the
> concept has changed enough that it's probably not useful to look at
> that patch.
>

Before the generic improvements are made to the block layer, I think
there is some value
in implementing the (simpler) ones in mmc block code, as well as
expose an mmc block quirk interface by which its easy to add complex
workarounds. Some things will never be able to completely stay above
mmc block code, for example, when splitting up smaller accesses, you
need to be careful on the Toshiba card, since the 4th consecutive 8KB
block results in the entire 32KB getting pushed  into the bigger 4MB
buffer. On our platform, there are a lot of accesses in the 16KB-32KB
range which benefit from the splitting. Data collected showed
splitting more than 32KB to have adverse effect on performance (I
guess that sort of makes sense, after all, why else would the
controller treat 4 consecutive 8KB accesses as a larger access and
treat it accordingly?) On the other hand, that data was collected on
code that used reliable write for every portion of the split access,
so I'm going to have to get some new data...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: MMC quirks relating to performance/lifetime.
  2011-03-02 10:34                           ` Andrei Warkentin
@ 2011-03-05  9:23                             ` Andrei Warkentin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrei Warkentin @ 2011-03-05  9:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij,
	linux-mmc

On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote:
> Before the generic improvements are made to the block layer, I think
> there is some value
> in implementing the (simpler) ones in mmc block code, as well as
> expose an mmc block quirk interface by which its easy to add complex
> workarounds. Some things will never be able to completely stay above
> mmc block code, for example, when splitting up smaller accesses, you
> need to be careful on the Toshiba card, since the 4th consecutive 8KB
> block results in the entire 32KB getting pushed  into the bigger 4MB
> buffer. On our platform, there are a lot of accesses in the 16KB-32KB
> range which benefit from the splitting. Data collected showed
> splitting more than 32KB to have adverse effect on performance (I
> guess that sort of makes sense, after all, why else would the
> controller treat 4 consecutive 8KB accesses as a larger access and
> treat it accordingly?) On the other hand, that data was collected on
> code that used reliable write for every portion of the split access,
> so I'm going to have to get some new data...
>

Just want to correct myself - any consecutive write that exceeds 8K
goes into the 4MB buffer.
Also, according to vendor, there is no performance penalty for using
reliable write.
This is why in the patch set, for splitting larger requests (to
improve lifetime by reducing the number of AU write/erase cycles) I
perform a reliable write for each split block set.

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <201102111551.15508.arnd@arndb.de>]

[parent not found: <20110308065911.GC1357@ucw.cz>]

* Re: MMC quirks relating to performance/lifetime.
       [not found]   ` <20110308065911.GC1357@ucw.cz>
@ 2011-03-08 14:03     ` Arnd Bergmann
  0 siblings, 0 replies; 17+ messages in thread
From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-arm-kernel, Andrei Warkentin, linux-fsdevel, linux-mmc

On Tuesday 08 March 2011, Pavel Machek wrote:
> > > 
> > > How big is performance difference?
> > 
> > Several orders of magnitude. It is very easy to get a card that can write
> > 12 MB/s into a case where it writes no more than 30 KB/s, doing only
> > things that happen frequently with ext3.
> 
> Ungood.
> 
> I guess we should create something like loopback device, which knows
> about flash specifics, and does the right coalescing so that card
> stays in the fast mode?

I have listed a few suggestions for areas to work in my article
at https://lwn.net/Articles/428584/. My idea was to use a device mapper
target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper
but a loopback device might work as well.

The other area that I think will help a lot is to make the I/O
scheduler aware of the erase block size and the preferred access
patterns.

> ...or, do we need to create new, simple filesystem with layout similar
> to fat32, for use on mmc cards?

It doesn't need to be similar to fat32, but creating a new file system
could fix this, too. Microsoft seems to have built ExFAT around
cheap flash devices, though they don't document what that does exactly.
I think we can do better than that, and I still want to find out
how close nilfs2 and btrfs can actually get to the optimum.

Note that it's not just MMC cards though, you get the exact same
effects on some low-end SSDs (which are basically repackaged CF
cards) and most USB sticks. The best USB sticks I have seen
can hide some effects with a bit of caching, and they have a higher
number of open segments than the cheap ones, but the basic
problems are unchanged.

The requirements for a good low-end flash optimized file system
would be roughly:

1. Do all writes is chunks of 32 or 64 KB. If there is less
   data to write, fill the chunk with zeroes and clean up later,
   but don't write more data to the same chunk.
2. Start writing on a segment (e.g. 4 MB, configurable) boundary,
   then write that segment to the end using the chunks mentioned
   above.
3. Erase full segments using trim/erase/discard before writing
   to them, if supported by the drive.
4. Have a configurable number of segments open for writing, i.e.
   you have written blocks at the start of the segment but not
   filled the segment to the end. Typical hardware limitations
   are between 1 and 10 open segments.
5. Keep all metadata within a single 4 MB segment. Drives that cannot
   do random access within normal segments can do it in the area
   that holds the FAT. If 4 MB is not enough, the FAT area can be
   used as a journal or cache, for a larger metadata area that gets
   written less frequently.
6. Because of the requirement to erase 4 MB chunks at once, there
   needs to be garbage collection to free up space. The quality
   of the garbage collection algorithm directly relates to the
   performance on full file systems and/or the space overhead.
7. Some static wear levelling is required to increase the expected
   life of consumer devices that only do dynamic wear levelling,
   i.e. the segments that contain purely static data need to
   be written occasionally so they make it back into the
   wear leveling pool of the hardware.

	Arnd

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-03-08 14:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AANLkTikh4vfS7SLKAa-aUXhbTxcHzYHmBuaXj1qHHYN9@mail.gmail.com>
     [not found] ` <201102171647.46190.arnd@arndb.de>
     [not found]   ` <AANLkTimGrT_p_5xdXJv5UURPMC0cwJCPkZ=xcX5Nk=o=@mail.gmail.com>
2011-02-20 14:39     ` MMC quirks relating to performance/lifetime Arnd Bergmann
2011-02-22  7:46       ` Andrei Warkentin
2011-02-22 17:00         ` Arnd Bergmann
2011-02-23 10:19           ` Andrei Warkentin
2011-02-23 16:09             ` Arnd Bergmann
2011-02-23 22:26               ` Andrei Warkentin
2011-02-24  9:24                 ` Arnd Bergmann
2011-02-25 11:02                   ` Andrei Warkentin
2011-02-25 12:21                     ` Arnd Bergmann
2011-03-01 18:48                       ` Jens Axboe
2011-03-01 19:11                         ` Arnd Bergmann
2011-03-01 19:15                           ` Jens Axboe
2011-03-01 19:51                             ` Arnd Bergmann
2011-03-01 21:33                               ` Andrei Warkentin
2011-03-02 10:34                           ` Andrei Warkentin
2011-03-05  9:23                             ` Andrei Warkentin
     [not found] ` <201102111551.15508.arnd@arndb.de>
     [not found]   ` <20110308065911.GC1357@ucw.cz>
2011-03-08 14:03     ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).