* Re: MMC quirks relating to performance/lifetime. [not found] ` <AANLkTimGrT_p_5xdXJv5UURPMC0cwJCPkZ=xcX5Nk=o=@mail.gmail.com> @ 2011-02-20 14:39 ` Arnd Bergmann 2011-02-22 7:46 ` Andrei Warkentin 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-02-20 14:39 UTC (permalink / raw) To: linux-arm-kernel, linux-fsdevel Cc: Andrei Warkentin, Linus Walleij, linux-mmc [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more on this discussion.] On Sunday 20 February 2011 12:27:39 Andrei Warkentin wrote: > On Thu, Feb 17, 2011 at 9:47 AM, Arnd Bergmann <arnd@arndb.de> wrote: > > I think I'd try to reduce the number of sysfs files needed for this. > > What are the values you would typically set here? > > > > My feeling is that separating unaligned page writes from full pages > > or multiples of pages could always be benefitial for all cards, or at > > least harmless, but that will require more measurements. > > Whether to do the reliable write or not could be a simple flag > > if the numbers are the same. > > I thought about this some more, and I realized it would be ugly if > everybody added enable_workaround_sec_start/enable_workaround_sec_end > for every novel idea of working around some issue with > performance/reliability on mmc/sd cards. > > What about letting the user/embedder create policies for how certain > accesses are done? That way you give runtime-accessible > blocks for tuning mmc block layer while having one interface to > manipulate (and combine) multiple workarounds, all the while catching > conflicts and > without forcing specific policy in code. > > Essentially under /sys/block/mmcblk0/device you have an attribute > called "policies". Example: > > # echo mypol0 > /sys/block/mmcblk0/device/policies > # ls /sys/block/mmcblk0/device/mypol0 > debug > delete > start_block > end_block > access_size_low > access_size_high > write_policy > erase_policy > read_policy > # cat /sys/block/mmcblk0/device/mypol0/write_policy > Current: none > 0x00000001: Split unaligned writes across page_size > 0x00000002: Split writes into page_size chunks and write using reliable writes > 0x00000004: Use reliable writes for WRITE_META blocks. > # cat /sys/block/mmcblk0/device/mypol0/erase_policy > Current: none > 0x00000001: Use secure erase. > # echo 1 > delete > # Policy is deleted. > > The policies are all stored in a rb-tree. First order of business > inside mmc_blk_issue_rw_rq/mmc_blk_issue_* is to fetch an existing > policy given the access type and block start/end (which both tells > where the access is going and the size of the access). Later, it's > that policy information which controls how the request is translated > into MMC commands. I'm almost done with a prototype. I think it's good to discuss all the options, but my feeling is that we should not add so much complexity at the interface level, because we will never be able to change all that again. In general, sysfs files should contain simple values that are self-descriptive (a simple number or one word), and should have no side-effects (unlike the delete or the policies attributes you describe). The behavior of the Toshiba chip is peculiar enough to justify having some workarounds for it, including run-time selected ones, but I'm looking for something much simpler. I'd certainly be interested in the patch you come up with and any performance results, but I don't think it can be merged like that. In the end, Chris will have to make the decision on mmc patches of course -- I'm just trying to contribute experience from other subsystems. What I see as a more promising approach is to add the tunables to attributes of the CFQ I/O scheduler once we know what we want. This will allow doing the same optimizations to non-MMC devices such as USB sticks or CF/IDE cards without reimplementing it in other subsystems, and give more control over the individual requests than the MMC layer has. E.g. the I/O scheduler can also make sure that we always submit all blocks from the start of one erase unit (e.g. 4 MB) to the end, but not try to merge requests across erase unit boundaries. It can also try to group the requests in aligned power-of-two sized chunks rather than merging as many sectors as possible up to the maximum request size, ignoring the alignment. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-20 14:39 ` MMC quirks relating to performance/lifetime Arnd Bergmann @ 2011-02-22 7:46 ` Andrei Warkentin 2011-02-22 17:00 ` Arnd Bergmann 0 siblings, 1 reply; 17+ messages in thread From: Andrei Warkentin @ 2011-02-22 7:46 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote: > [adding linux-fsdevel to Cc, see http://lwn.net/Articles/428941/ and > http://comments.gmane.org/gmane.linux.ports.arm.kernel/105607 for more > on this discussion.] > > > I think it's good to discuss all the options, but my feeling is that > we should not add so much complexity at the interface level, because > we will never be able to change all that again. In general, sysfs > files should contain simple values that are self-descriptive (a simple > number or one word), and should have no side-effects (unlike the delete > or the policies attributes you describe). > > The behavior of the Toshiba chip is peculiar enough to justify having > some workarounds for it, including run-time selected ones, but I'm > looking for something much simpler. I'd certainly be interested in > the patch you come up with and any performance results, but I don't > think it can be merged like that. > Sure. The page_align patch is just going to be a single sysfs attribute. All I need to prove to myself now is the effect for large unaligned accesses (and show everyone else the data :-)). > In the end, Chris will have to make the decision on mmc patches of > course -- I'm just trying to contribute experience from other subsystems. > > What I see as a more promising approach is to add the tunables > to attributes of the CFQ I/O scheduler once we know what we want. > This will allow doing the same optimizations to non-MMC devices such > as USB sticks or CF/IDE cards without reimplementing it in other > subsystems, and give more control over the individual requests than > the MMC layer has. > > E.g. the I/O scheduler can also make sure that we always submit all > blocks from the start of one erase unit (e.g. 4 MB) to the end, but > not try to merge requests across erase unit boundaries. It can > also try to group the requests in aligned power-of-two sized chunks > rather than merging as many sectors as possible up to the maximum > request size, ignoring the alignment. I agree. These are common things that affect any kind of flash storage, and it belongs in the I/O scheduler as simple tuneables. I'll see if I can figure my way around that... What belongs in mmc card driver are tunable workarounds for MMC/SD brokeness. For example - needing to use 8K-spitted reliable writes to ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to improve lifespan of the card.) But you want a waterline above which you don't do this anymore, otherwise the overall performance will go to 0 - i.e. there is a need to balance between performance and reliability, so the range of access size for which the workaround works needs to be runtime controlled, as it's potentially different. Another example (this one is apparently affecting Sandisk) - do special stuff for block erase, since the card violates spec in that regard (touch ext_csd instead of argument, I believe). A different example might be turning on reliable writes for WRITE_META (or all) blocks for a certain partition (but I just made that up... ). So there are things that just should be on (spec brokeness workarounds), and things that apply only to a subset of accesses (and thus they are selective at issue_*_rq time), whether it's because of accessed offset or access size. I agree that the sysfs method is particularly nasty, and I guess I didn't have to make a prototype to figure that out :-) (but needed something similar for selective testing anyway). Nothing else exists right now that acts in the same way, and nothing really should, as there is no feedback for manipulating the policies (echo POLICY_ENUM > policy, if it doesn't stick, then the arguments were wrong, etc). You could put the entire MMC block policy interface through an API usable by system integrators - i.e. you would really only care for tuning the MMC parameters if you're creating a device around an emmc. Idea (1). One idea is to keep the "policies" from my previous mail. Policies are registered through platform-specific code. The policies could be then matched for enabling against a specific block device by manfid/date/etc at the time of mmc_block_alloc... For removable media no one would fiddle with the tunable parameters anyway, unless there was some global database of cards and workarounds and a daemon or some such to take care of that... Probably don't want to add such baggage to the kernel. Idea (2). There is probably no need to overcomplicate. Just add a platform callback (something like int (*mmc_platform_block_workaround)(struct request *, struct mmc_blk_request *)). This will be usable as-is for R/W accesses, and the discard code will need to be slightly modified. Do you think there is any need for runtime tuning of the MMC workarounds (disregarding ones that really belong in the I/O scheduler)? Should the workarounds be simply platform callbacks, or should they be something heftier ("policies")? A ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-22 7:46 ` Andrei Warkentin @ 2011-02-22 17:00 ` Arnd Bergmann 2011-02-23 10:19 ` Andrei Warkentin 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-02-22 17:00 UTC (permalink / raw) To: Andrei Warkentin Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tuesday 22 February 2011, Andrei Warkentin wrote: > On Sun, Feb 20, 2011 at 8:39 AM, Arnd Bergmann <arnd@arndb.de> wrote: > > E.g. the I/O scheduler can also make sure that we always submit all > > blocks from the start of one erase unit (e.g. 4 MB) to the end, but > > not try to merge requests across erase unit boundaries. It can > > also try to group the requests in aligned power-of-two sized chunks > > rather than merging as many sectors as possible up to the maximum > > request size, ignoring the alignment. > > I agree. These are common things that affect any kind of flash > storage, and it belongs in the I/O scheduler as simple tuneables. I'll > see if I can figure my way around that... > > What belongs in mmc card driver are tunable workarounds for MMC/SD > brokeness. For example - needing to use 8K-spitted reliable writes to > ensure that a 64KB access doesn't wind up in the 4MB buffer B (as to > improve lifespan of the card.) But you want a waterline above which > you don't do this anymore, otherwise the overall performance will go > to 0 - i.e. there is a need to balance between performance and > reliability, so the range of access size for which the workaround > works needs to be runtime controlled, as it's potentially different. > Another example (this one is apparently affecting Sandisk) - do > special stuff for block erase, since the card violates spec in that > regard (touch ext_csd instead of argument, I believe). A different > example might be turning on reliable writes for WRITE_META (or all) > blocks for a certain partition (but I just made that up... ). Yes, makes sense. > You could put the entire MMC block policy interface through an API > usable by system integrators - i.e. you would really only care for > tuning the MMC parameters if you're creating a device around an emmc. > > Idea (1). One idea is to keep the "policies" from my previous mail. > Policies are registered through platform-specific code. The policies > could be then matched for enabling against a specific block device by > manfid/date/etc at the time of mmc_block_alloc... For removable media > no one would fiddle with the tunable parameters anyway, unless there > was some global database of cards and workarounds and a daemon or some > such to take care of that... Probably don't want to add such baggage > to the kernel. > > Idea (2). There is probably no need to overcomplicate. Just add a > platform callback (something like int > (*mmc_platform_block_workaround)(struct request *, struct > mmc_blk_request *)). This will be usable as-is for R/W accesses, and > the discard code will need to be slightly modified. > > Do you think there is any need for runtime tuning of the MMC > workarounds (disregarding ones that really belong in the I/O > scheduler)? Should the workarounds be simply platform callbacks, or > should they be something heftier ("policies")? The platform hook seems the wrong place, because you might use the same chip in multiple platforms, and a single platform might have a large number of different boards, all of which require separate workarounds. A per-card quirk table does not seem so bad, we have that in other subsystems as well. I wouldn't necessarily make it a list of possible quirks, but rather a __devinit function that is called for a new card on insertion, in order to tweak various parameters. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-22 17:00 ` Arnd Bergmann @ 2011-02-23 10:19 ` Andrei Warkentin 2011-02-23 16:09 ` Arnd Bergmann 0 siblings, 1 reply; 17+ messages in thread From: Andrei Warkentin @ 2011-02-23 10:19 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tue, Feb 22, 2011 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote: >> >> Do you think there is any need for runtime tuning of the MMC >> workarounds (disregarding ones that really belong in the I/O >> scheduler)? Should the workarounds be simply platform callbacks, or >> should they be something heftier ("policies")? > > The platform hook seems the wrong place, because you might use > the same chip in multiple platforms, and a single platform might > have a large number of different boards, all of which require > separate workarounds. > That's a good point. At best it would result in massive copy-paste/ > A per-card quirk table does not seem so bad, we have that in > other subsystems as well. I wouldn't necessarily make it > a list of possible quirks, but rather a __devinit function that > is called for a new card on insertion, in order to tweak various > parameters. > That sounds good! In fact, for any quirks enabled for a particular card, I'll expose the tuneables through sysfs attributes, something like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names. Quirks will have block intervals and access size intervals over which they are valid, along with any other quirk-specific parameter. Interval overlap will not be allowed for quirks in the same operation type (r/w/e). The goal here is to make the changes to issue_*_rq as small as possible, and not to pollute block.c at all with the quirks stuff. Quirks are looked up inside issue_*_rq based on req type and [start,end) interval. The resulting found quirks structure will contain a callback used inside issue_*_rq to modify mmc block request structures prior to generating actual MMC commands. Quirks consist of a callback called inside of mmc issue_*_rq, configurable attributes, and the sysfs interface. Quirk groups are defined per-card. At card insertion time, a matching quirk group is found, and is enabled. The quirk group enable function then enables the relevant quirks with the right parameters (adds them to per mmc_blk_data quirk interval tree). Some sane defaults for the tunables are used. If the tunables are modified through sysfs, care is taken that an interval overlap never happens, otherwise the tunable is not modified and a kernel error message is logged. I hope I explained the tentative idea clearly... Thoughts? A ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-23 10:19 ` Andrei Warkentin @ 2011-02-23 16:09 ` Arnd Bergmann 2011-02-23 22:26 ` Andrei Warkentin 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-02-23 16:09 UTC (permalink / raw) To: Andrei Warkentin Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Wednesday 23 February 2011, Andrei Warkentin wrote: > That sounds good! In fact, for any quirks enabled for a particular > card, I'll expose the tuneables through sysfs attributes, something > like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names. > > Quirks will have block intervals and access size intervals over which > they are valid, along with any other quirk-specific parameter. > Interval overlap will not be allowed for quirks in the same operation > type (r/w/e). The goal here is to make the changes to issue_*_rq as > small as possible, and not to pollute block.c at all with the quirks > stuff. Quirks are looked up inside issue_*_rq based on req type and > [start,end) interval. The resulting found quirks structure will > contain a callback used inside issue_*_rq to modify mmc block request > structures prior to generating actual MMC commands. > > Quirks consist of a callback called inside of mmc issue_*_rq, > configurable attributes, and the sysfs interface. Quirk groups are > defined per-card. At card insertion time, a matching quirk group is > found, and is enabled. The quirk group enable function then enables > the relevant quirks with the right parameters (adds them to per > mmc_blk_data quirk interval tree). Some sane defaults for the tunables > are used. If the tunables are modified through sysfs, care is taken > that an interval overlap never happens, otherwise the tunable is not > modified and a kernel error message is logged. > > I hope I explained the tentative idea clearly... Thoughts? I would hope that the quirks can be simpler than this still, without the need to call any function pointers while using the device, or quirk specific sysfs directories. What I meant is to have a single function pointer that can get called when detecting a specific known card. All this function does is to set values and flags that we can export either through common attributes of block devices (e.g. preferred erase size), or attributes specific to mmc devices (e.g. the toshiba hack, as a bool attribute). An obvious attribute would be the minimum size of an atomic page update. By default this could be 32KB, because any device should support that (FAT32 cannot have larger clusters). A card specific quirk can set it to another value, like 8KB, 16KB or 64KB, and file systems or other tools like mkfs can optimize for this value. I would like the flags like "don't submit requests spanning this boundary" and "make all writes below this size" to be defined in terms of the regular sizes we already know about, like the page size or the erase size. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-23 16:09 ` Arnd Bergmann @ 2011-02-23 22:26 ` Andrei Warkentin 2011-02-24 9:24 ` Arnd Bergmann 0 siblings, 1 reply; 17+ messages in thread From: Andrei Warkentin @ 2011-02-23 22:26 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Wed, Feb 23, 2011 at 10:09 AM, Arnd Bergmann <arnd@arndb.de> wrote: > On Wednesday 23 February 2011, Andrei Warkentin wrote: >> That sounds good! In fact, for any quirks enabled for a particular >> card, I'll expose the tuneables through sysfs attributes, something >> like /sys/block/mmcblk0/device/quirks/quirk-name/attr-names. >> >> Quirks will have block intervals and access size intervals over which >> they are valid, along with any other quirk-specific parameter. >> Interval overlap will not be allowed for quirks in the same operation >> type (r/w/e). The goal here is to make the changes to issue_*_rq as >> small as possible, and not to pollute block.c at all with the quirks >> stuff. Quirks are looked up inside issue_*_rq based on req type and >> [start,end) interval. The resulting found quirks structure will >> contain a callback used inside issue_*_rq to modify mmc block request >> structures prior to generating actual MMC commands. >> >> Quirks consist of a callback called inside of mmc issue_*_rq, >> configurable attributes, and the sysfs interface. Quirk groups are >> defined per-card. At card insertion time, a matching quirk group is >> found, and is enabled. The quirk group enable function then enables >> the relevant quirks with the right parameters (adds them to per >> mmc_blk_data quirk interval tree). Some sane defaults for the tunables >> are used. If the tunables are modified through sysfs, care is taken >> that an interval overlap never happens, otherwise the tunable is not >> modified and a kernel error message is logged. >> >> I hope I explained the tentative idea clearly... Thoughts? > > I would hope that the quirks can be simpler than this still, without > the need to call any function pointers while using the device, or > quirk specific sysfs directories. > I'll skip the sysfs part from the first RFC patch. I think this complicates what I'm trying to achieve and makes this whole thing look bigger than it is. > What I meant is to have a single function pointer that can get > called when detecting a specific known card. All this function > does is to set values and flags that we can export either through > common attributes of block devices (e.g. preferred erase size), > or attributes specific to mmc devices (e.g. the toshiba hack, as > a bool attribute). > > An obvious attribute would be the minimum size of an atomic > page update. By default this could be 32KB, because any device > should support that (FAT32 cannot have larger clusters). A > card specific quirk can set it to another value, like 8KB, 16KB > or 64KB, and file systems or other tools like mkfs can optimize > for this value. > > I would like the flags like "don't submit requests spanning > this boundary" and "make all writes below this size" to be defined > in terms of the regular sizes we already know about, like the > page size or the erase size. > I agree with you on the size/align issues. These are very generic attributes and don't need a complicated framework like I described to be dealt with. Ultimately they are just hints to the I/O scheduler, so they should be part of the block device. I am more concerned with workarounds that depend on access size (like the toshiba one) and that modify the MMC commands sent (using reliable writes, like the Toshiba one, or putting parameters differently like the Sandisk erase workaround). It's these kinds of workarounds that the quirks framework is meant to address. I don't think it's a good idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with if()-elsed workarounds, because it's going to quickly complicate the logic, and get out of hand and unmanageable the more cards are added. I'm trying to avoid having to make any changes to card/block.c as part of making quirk workarounds. The only cost when compared to an if-else will be one O(log n) quirk lookup, where n is either one or something close that (since the search is only done for quirks per mmc_blk_data), and one callback invoked after "brq.data.sg_len = mmc_queue_map_sg(mq);" so it can patch up mrq as necessary. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-23 22:26 ` Andrei Warkentin @ 2011-02-24 9:24 ` Arnd Bergmann 2011-02-25 11:02 ` Andrei Warkentin 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-02-24 9:24 UTC (permalink / raw) To: Andrei Warkentin Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Wednesday 23 February 2011, Andrei Warkentin wrote: > I am more concerned with workarounds that depend on access size (like > the toshiba one) and that modify the MMC commands sent (using reliable > writes, like the Toshiba one, or putting parameters differently like > the Sandisk erase workaround). It's these kinds of workarounds that > the quirks framework is meant to address. I don't think it's a good > idea to pollute mmc_blk_issue_rw_rq and mmc_blk_issue_discard_rq with > if()-elsed workarounds, because it's going to quickly complicate the > logic, and get out of hand and unmanageable the more cards are added. > I'm trying to avoid having to make any changes to card/block.c as part > of making quirk workarounds. The only cost when compared to an if-else > will be one O(log n) quirk lookup, where n is either one or something > close that (since the search is only done for quirks per > mmc_blk_data), and one callback invoked after "brq.data.sg_len = > mmc_queue_map_sg(mq);" so it can patch up mrq as necessary. Unlike the sysfs interface, the code does not need to be future-proof, it can always be changed if we feel the code becomes more maintainable by doing it another way. The approach that I'd like to see here is: * Start out with an ad-hoc patch for a quirk (like the one you already have). * Add a boolean variable to enable it per card. * Get performance data for this quirk to show that it's useful in real-world workloads for some cards but counterproductive for others * Get the patch into the mmc tree. * Repeat for the next quirk * When the code becomes overly complicated after adding all the quirks, decide on a good strategy to move the code around, and do a new patch. I understand that you are convinced that you will need the indirect function calls in the end. That is fine, just don't add them before they are actually needed -- that would only make it harder for you to get the first patch included. Note that the situation is very different for user interfaces such as sysfs: You need to plan ahead because once the interface is merged upstream, it can never be changed. When you submit a patch that introduces a new sysfs interface, it has to be documented, and you have to convince the reviewers that it is sufficient to cover all the cases it is designed for, while at the same time it is the most simple way to achieve this. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-24 9:24 ` Arnd Bergmann @ 2011-02-25 11:02 ` Andrei Warkentin 2011-02-25 12:21 ` Arnd Bergmann 0 siblings, 1 reply; 17+ messages in thread From: Andrei Warkentin @ 2011-02-25 11:02 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Thu, Feb 24, 2011 at 3:24 AM, Arnd Bergmann <arnd@arndb.de> wrote: > Unlike the sysfs interface, the code does not need to be future-proof, > it can always be changed if we feel the code becomes more maintainable > by doing it another way. > > The approach that I'd like to see here is: > > * Start out with an ad-hoc patch for a quirk (like the one you already > have). > * Add a boolean variable to enable it per card. > * Get performance data for this quirk to show that it's useful in > real-world workloads for some cards but counterproductive for others > * Get the patch into the mmc tree. > * Repeat for the next quirk > * When the code becomes overly complicated after adding all the quirks, > decide on a good strategy to move the code around, and do a new patch. > Yup. I understand :-). That's the strategy I'm going to follow. For page_size-alignment/splitting I'm looking at the block layer now. Is that the right approach or should I still submit a (cleaned up) patch to mmc/card/block.c for that performance improvement? The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c. > I understand that you are convinced that you will need the indirect function > calls in the end. That is fine, just don't add them before they are > actually needed -- that would only make it harder for you to get the first > patch included. > > Note that the situation is very different for user interfaces such as sysfs: > You need to plan ahead because once the interface is merged upstream, it > can never be changed. When you submit a patch that introduces a new sysfs > interface, it has to be documented, and you have to convince the reviewers > that it is sufficient to cover all the cases it is designed for, while > at the same time it is the most simple way to achieve this. Ok, thanks a lot for the explanation, I hadn't thought of it that way (and should have). A -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-25 11:02 ` Andrei Warkentin @ 2011-02-25 12:21 ` Arnd Bergmann 2011-03-01 18:48 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-02-25 12:21 UTC (permalink / raw) To: Andrei Warkentin, Jens Axboe Cc: linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Friday 25 February 2011, Andrei Warkentin wrote: > Yup. I understand :-). That's the strategy I'm going to follow. For > page_size-alignment/splitting I'm looking at the block layer now. Is > that the right approach or should I still submit a (cleaned up) patch > to mmc/card/block.c for that performance improvement. I guess it should live in block/cfq-iosched in the long run, but I don't know how easy it is to implement it there for test purposes. It may be easier to prototype it in the mmc code, since you are more familiar with that already, post that patch together with benchmark results and then do a new patch for the final solution. We'll need more benchmarking to figure out if that should be applied for all nonrotational storage, or if there are cases where it actually hurts performance to split requests on page boundaries. If it turns out to be a good idea in general, we won't even need a sysfs interface for enabling it, just one for reading/writing the underlying page size. > The other (Toshiba quirk) is obviously a quirk belonging to mmc/card/block.c. Makes sense. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-02-25 12:21 ` Arnd Bergmann @ 2011-03-01 18:48 ` Jens Axboe 2011-03-01 19:11 ` Arnd Bergmann 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2011-03-01 18:48 UTC (permalink / raw) To: Arnd Bergmann Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On 2011-02-25 07:21, Arnd Bergmann wrote: > On Friday 25 February 2011, Andrei Warkentin wrote: >> Yup. I understand :-). That's the strategy I'm going to follow. For >> page_size-alignment/splitting I'm looking at the block layer now. Is >> that the right approach or should I still submit a (cleaned up) patch >> to mmc/card/block.c for that performance improvement. > > I guess it should live in block/cfq-iosched in the long run, but I don't > know how easy it is to implement it there for test purposes. I don't think I saw the original patch(es) for this? -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-01 18:48 ` Jens Axboe @ 2011-03-01 19:11 ` Arnd Bergmann 2011-03-01 19:15 ` Jens Axboe 2011-03-02 10:34 ` Andrei Warkentin 0 siblings, 2 replies; 17+ messages in thread From: Arnd Bergmann @ 2011-03-01 19:11 UTC (permalink / raw) To: Jens Axboe Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote: > > On 2011-02-25 07:21, Arnd Bergmann wrote: > > On Friday 25 February 2011, Andrei Warkentin wrote: > >> Yup. I understand :-). That's the strategy I'm going to follow. For > >> page_size-alignment/splitting I'm looking at the block layer now. Is > >> that the right approach or should I still submit a (cleaned up) patch > >> to mmc/card/block.c for that performance improvement. > > > > I guess it should live in block/cfq-iosched in the long run, but I don't > > know how easy it is to implement it there for test purposes. > > I don't think I saw the original patch(es) for this? Nobody has posted one yet, only discussions. Andrei made a patch for the MMC block driver to split requests in some cases, but I think the concept has changed enough that it's probably not useful to look at that patch. I think what needs to be done here is to split requests in these cases: * Small requests should be split on flash page boundaries, where a page is typically 8 to 32 KB. Sending one hardware request that spans two partial pages can be slower than sending two requests with the same data, but on page boundaries. * If a hardware transfer is limited to a few sectors, these should be aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector maximum transfers, a request that spans from sector 7 to 62 should be split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62. This reduces the number of page read-modify-write cycles that the drive does. * No request should ever span multiple erase blocks. Most flash drives today have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should treat the erase block boundary like a seek on a hard drive. The I/O scheduler should try to send all sector writes of an erase block in sequence, but after that it can chose any other erase block to write to next. I think if we get this logic, we can deal well with all cheap flash drives. The two parameters we need are the page size and the erase block size, which the kernel can sometimes guess, but should also be tunable in sysfs for devices that don't tell us or lie to the kernel about them. I'm not sure if we want to do this for all nonrotational media, or add another flag to enable these optimizations. On proper SSDs that have an intelligent controller and enough RAM, they probably would not help all that much, or even make it slightly slower due to a higher number of separate write requests. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-01 19:11 ` Arnd Bergmann @ 2011-03-01 19:15 ` Jens Axboe 2011-03-01 19:51 ` Arnd Bergmann 2011-03-02 10:34 ` Andrei Warkentin 1 sibling, 1 reply; 17+ messages in thread From: Jens Axboe @ 2011-03-01 19:15 UTC (permalink / raw) To: Arnd Bergmann Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On 2011-03-01 14:11, Arnd Bergmann wrote: > On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote: >> >> On 2011-02-25 07:21, Arnd Bergmann wrote: >>> On Friday 25 February 2011, Andrei Warkentin wrote: >>>> Yup. I understand :-). That's the strategy I'm going to follow. For >>>> page_size-alignment/splitting I'm looking at the block layer now. Is >>>> that the right approach or should I still submit a (cleaned up) patch >>>> to mmc/card/block.c for that performance improvement. >>> >>> I guess it should live in block/cfq-iosched in the long run, but I don't >>> know how easy it is to implement it there for test purposes. >> >> I don't think I saw the original patch(es) for this? > > Nobody has posted one yet, only discussions. Andrei made a patch for the > MMC block driver to split requests in some cases, but I think the > concept has changed enough that it's probably not useful to look at > that patch. > > I think what needs to be done here is to split requests in these cases: > > * Small requests should be split on flash page boundaries, where a page > is typically 8 to 32 KB. Sending one hardware request that spans two > partial pages can be slower than sending two requests with the same > data, but on page boundaries. > > * If a hardware transfer is limited to a few sectors, these should be > aligned to page boundaries. E.g. assuming a 16 sector page and 32 sector > maximum transfers, a request that spans from sector 7 to 62 should be > split into three transfers: 7-15, 16-47 and 48-62, not 7-38 and 39-62. > This reduces the number of page read-modify-write cycles that the drive > does. > > * No request should ever span multiple erase blocks. Most flash drives today > have 4MB erase blocks (sometimes 1, 2 or 8), and the I/O scheduler should > treat the erase block boundary like a seek on a hard drive. The I/O > scheduler should try to send all sector writes of an erase block in sequence, > but after that it can chose any other erase block to write to next. > > I think if we get this logic, we can deal well with all cheap flash drives. > The two parameters we need are the page size and the erase block size, > which the kernel can sometimes guess, but should also be tunable in > sysfs for devices that don't tell us or lie to the kernel about them. > > I'm not sure if we want to do this for all nonrotational media, or > add another flag to enable these optimizations. On proper SSDs that have > an intelligent controller and enough RAM, they probably would not help > all that much, or even make it slightly slower due to a higher number > of separate write requests. Thanks for the recap. One way to handle this would be to have a dm target that ensures that requests are never built up to violate any of the above items. Doing splitting is a little silly, when you can prevent it from happening in the first place. Alternatively, a queue ->merge_bvec_fn() with a settings table could provide the same. As this is of limited scope, I would prefer having this done via a plugin of some sort (like a dm target). -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-01 19:15 ` Jens Axboe @ 2011-03-01 19:51 ` Arnd Bergmann 2011-03-01 21:33 ` Andrei Warkentin 0 siblings, 1 reply; 17+ messages in thread From: Arnd Bergmann @ 2011-03-01 19:51 UTC (permalink / raw) To: Jens Axboe Cc: Andrei Warkentin, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote: > Thanks for the recap. One way to handle this would be to have a dm > target that ensures that requests are never built up to violate any of > the above items. Doing splitting is a little silly, when you can prevent > it from happening in the first place. Ok, that sounds good. I didn't know that it's possible to prevent bios from getting created that violate this. I'm actually trying to do a device mapper target that does much more than this, see https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper for an early draft. The design has moved on since I wrote that, but the basic idea is still the same: all blocks get written in a way that fills up entire 4MB segments before moving to another segment, independent of what the logical block numbers are, and a little space is used to store a lookup table for the logical-to-physical block mapping. > Alternatively, a queue ->merge_bvec_fn() with a settings table could > provide the same. That's probably better for the common case. The device mapper target would be useful for those that want the best case write performance, but if I understand you correctly, the merge_bvec_fn() could be used per block driver, so we could simply add that to the SCSI (for USB and consumer SSD) case and MMC block drivers. The point that this does not solve is submitting all outstanding writes for an erase block together, which is needed to reduce the garbage collection overhead. When you do a partial update of an erase block (4MB typically) and then start writing to another erase block, the drive will have to copy all data you did not write in order to free up internal resources. > As this is of limited scope, I would prefer having this done via a > plugin of some sort (like a dm target). I'm not sure what you mean with limited scope. This is certainly not as important for the classic server environment (aside from USB boot drives), but I assume that it is highly relevant for the a large portion of new embedded designs as people move from raw flash to eMMC and similar "technologies". Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-01 19:51 ` Arnd Bergmann @ 2011-03-01 21:33 ` Andrei Warkentin 0 siblings, 0 replies; 17+ messages in thread From: Andrei Warkentin @ 2011-03-01 21:33 UTC (permalink / raw) To: Arnd Bergmann Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tue, Mar 1, 2011 at 1:51 PM, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 01 March 2011 20:15:30 Jens Axboe wrote: >> Thanks for the recap. One way to handle this would be to have a dm >> target that ensures that requests are never built up to violate any of >> the above items. Doing splitting is a little silly, when you can prevent >> it from happening in the first place. > > Ok, that sounds good. I didn't know that it's possible to prevent > bios from getting created that violate this. > Wouldn't someone still be able to perform a generic_make_request that would violate the conditions (i.e. cross alignment boundary while performing unaligned write)? You could prevent the merges that would result in violating the conditions, sure, but you would need to handle single unaligned accesses correctly too... Sorry, I'm just groping my way around the block layer...a lot I'm still trying to draw a mental picture for. P.S. I've submitted for review the first 3 patches. Tear into them :). A ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-01 19:11 ` Arnd Bergmann 2011-03-01 19:15 ` Jens Axboe @ 2011-03-02 10:34 ` Andrei Warkentin 2011-03-05 9:23 ` Andrei Warkentin 1 sibling, 1 reply; 17+ messages in thread From: Andrei Warkentin @ 2011-03-02 10:34 UTC (permalink / raw) To: Arnd Bergmann Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Tue, Mar 1, 2011 at 1:11 PM, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 01 March 2011 19:48:17 Jens Axboe wrote: >> >> On 2011-02-25 07:21, Arnd Bergmann wrote: >> > On Friday 25 February 2011, Andrei Warkentin wrote: >> >> Yup. I understand :-). That's the strategy I'm going to follow. For >> >> page_size-alignment/splitting I'm looking at the block layer now. Is >> >> that the right approach or should I still submit a (cleaned up) patch >> >> to mmc/card/block.c for that performance improvement. >> > >> > I guess it should live in block/cfq-iosched in the long run, but I don't >> > know how easy it is to implement it there for test purposes. >> >> I don't think I saw the original patch(es) for this? > > Nobody has posted one yet, only discussions. Andrei made a patch for the > MMC block driver to split requests in some cases, but I think the > concept has changed enough that it's probably not useful to look at > that patch. > Before the generic improvements are made to the block layer, I think there is some value in implementing the (simpler) ones in mmc block code, as well as expose an mmc block quirk interface by which its easy to add complex workarounds. Some things will never be able to completely stay above mmc block code, for example, when splitting up smaller accesses, you need to be careful on the Toshiba card, since the 4th consecutive 8KB block results in the entire 32KB getting pushed into the bigger 4MB buffer. On our platform, there are a lot of accesses in the 16KB-32KB range which benefit from the splitting. Data collected showed splitting more than 32KB to have adverse effect on performance (I guess that sort of makes sense, after all, why else would the controller treat 4 consecutive 8KB accesses as a larger access and treat it accordingly?) On the other hand, that data was collected on code that used reliable write for every portion of the split access, so I'm going to have to get some new data... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: MMC quirks relating to performance/lifetime. 2011-03-02 10:34 ` Andrei Warkentin @ 2011-03-05 9:23 ` Andrei Warkentin 0 siblings, 0 replies; 17+ messages in thread From: Andrei Warkentin @ 2011-03-05 9:23 UTC (permalink / raw) To: Arnd Bergmann Cc: Jens Axboe, linux-arm-kernel, linux-fsdevel, Linus Walleij, linux-mmc On Wed, Mar 2, 2011 at 4:34 AM, Andrei Warkentin <andreiw@motorola.com> wrote: > Before the generic improvements are made to the block layer, I think > there is some value > in implementing the (simpler) ones in mmc block code, as well as > expose an mmc block quirk interface by which its easy to add complex > workarounds. Some things will never be able to completely stay above > mmc block code, for example, when splitting up smaller accesses, you > need to be careful on the Toshiba card, since the 4th consecutive 8KB > block results in the entire 32KB getting pushed into the bigger 4MB > buffer. On our platform, there are a lot of accesses in the 16KB-32KB > range which benefit from the splitting. Data collected showed > splitting more than 32KB to have adverse effect on performance (I > guess that sort of makes sense, after all, why else would the > controller treat 4 consecutive 8KB accesses as a larger access and > treat it accordingly?) On the other hand, that data was collected on > code that used reliable write for every portion of the split access, > so I'm going to have to get some new data... > Just want to correct myself - any consecutive write that exceeds 8K goes into the 4MB buffer. Also, according to vendor, there is no performance penalty for using reliable write. This is why in the patch set, for splitting larger requests (to improve lifetime by reducing the number of AU write/erase cycles) I perform a reliable write for each split block set. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <201102111551.15508.arnd@arndb.de>]
[parent not found: <20110308065911.GC1357@ucw.cz>]
* Re: MMC quirks relating to performance/lifetime. [not found] ` <20110308065911.GC1357@ucw.cz> @ 2011-03-08 14:03 ` Arnd Bergmann 0 siblings, 0 replies; 17+ messages in thread From: Arnd Bergmann @ 2011-03-08 14:03 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-arm-kernel, Andrei Warkentin, linux-fsdevel, linux-mmc On Tuesday 08 March 2011, Pavel Machek wrote: > > > > > > How big is performance difference? > > > > Several orders of magnitude. It is very easy to get a card that can write > > 12 MB/s into a case where it writes no more than 30 KB/s, doing only > > things that happen frequently with ext3. > > Ungood. > > I guess we should create something like loopback device, which knows > about flash specifics, and does the right coalescing so that card > stays in the fast mode? I have listed a few suggestions for areas to work in my article at https://lwn.net/Articles/428584/. My idea was to use a device mapper target, as described in https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashDeviceMapper but a loopback device might work as well. The other area that I think will help a lot is to make the I/O scheduler aware of the erase block size and the preferred access patterns. > ...or, do we need to create new, simple filesystem with layout similar > to fat32, for use on mmc cards? It doesn't need to be similar to fat32, but creating a new file system could fix this, too. Microsoft seems to have built ExFAT around cheap flash devices, though they don't document what that does exactly. I think we can do better than that, and I still want to find out how close nilfs2 and btrfs can actually get to the optimum. Note that it's not just MMC cards though, you get the exact same effects on some low-end SSDs (which are basically repackaged CF cards) and most USB sticks. The best USB sticks I have seen can hide some effects with a bit of caching, and they have a higher number of open segments than the cheap ones, but the basic problems are unchanged. The requirements for a good low-end flash optimized file system would be roughly: 1. Do all writes is chunks of 32 or 64 KB. If there is less data to write, fill the chunk with zeroes and clean up later, but don't write more data to the same chunk. 2. Start writing on a segment (e.g. 4 MB, configurable) boundary, then write that segment to the end using the chunks mentioned above. 3. Erase full segments using trim/erase/discard before writing to them, if supported by the drive. 4. Have a configurable number of segments open for writing, i.e. you have written blocks at the start of the segment but not filled the segment to the end. Typical hardware limitations are between 1 and 10 open segments. 5. Keep all metadata within a single 4 MB segment. Drives that cannot do random access within normal segments can do it in the area that holds the FAT. If 4 MB is not enough, the FAT area can be used as a journal or cache, for a larger metadata area that gets written less frequently. 6. Because of the requirement to erase 4 MB chunks at once, there needs to be garbage collection to free up space. The quality of the garbage collection algorithm directly relates to the performance on full file systems and/or the space overhead. 7. Some static wear levelling is required to increase the expected life of consumer devices that only do dynamic wear levelling, i.e. the segments that contain purely static data need to be written occasionally so they make it back into the wear leveling pool of the hardware. Arnd ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-08 14:03 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <AANLkTikh4vfS7SLKAa-aUXhbTxcHzYHmBuaXj1qHHYN9@mail.gmail.com>
[not found] ` <201102171647.46190.arnd@arndb.de>
[not found] ` <AANLkTimGrT_p_5xdXJv5UURPMC0cwJCPkZ=xcX5Nk=o=@mail.gmail.com>
2011-02-20 14:39 ` MMC quirks relating to performance/lifetime Arnd Bergmann
2011-02-22 7:46 ` Andrei Warkentin
2011-02-22 17:00 ` Arnd Bergmann
2011-02-23 10:19 ` Andrei Warkentin
2011-02-23 16:09 ` Arnd Bergmann
2011-02-23 22:26 ` Andrei Warkentin
2011-02-24 9:24 ` Arnd Bergmann
2011-02-25 11:02 ` Andrei Warkentin
2011-02-25 12:21 ` Arnd Bergmann
2011-03-01 18:48 ` Jens Axboe
2011-03-01 19:11 ` Arnd Bergmann
2011-03-01 19:15 ` Jens Axboe
2011-03-01 19:51 ` Arnd Bergmann
2011-03-01 21:33 ` Andrei Warkentin
2011-03-02 10:34 ` Andrei Warkentin
2011-03-05 9:23 ` Andrei Warkentin
[not found] ` <201102111551.15508.arnd@arndb.de>
[not found] ` <20110308065911.GC1357@ucw.cz>
2011-03-08 14:03 ` Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).